One shared transformer policy across 1,600+ reinforcement learning environments
Thibaut Kulak, NeoInstinct SA. June 2026
Accepted at the Reinforcement Learning Conference 2026 Workshop on Automated RL.
TL;DR
- We introduce LDM-v0, a 308M-parameter Large Decision Model trained offline on 9.3B RL transitions.
- The model is trained as one shared transformer policy across 1,600+ environments spanning robotics, driving, trading, inventory, cybersecurity, energy, games, and more.
- LDM-v0 reaches more than 80% of reference-policy performance on 1,600+ environments and matches task-specific reference policies on ~1,000 environments.
- These results suggest that large-scale pretraining may become a practical path toward RL foundation models, although our evaluations remain mostly in-distribution.

1. Motivation
Modern Reinforcement Learning still usually means building one policy per task.
That approach can work well, but it also comes with familiar costs: environment-specific tuning, substantial experimentation, and repeated training runs for each new problem. In practice, this is one of the reasons why RL remains difficult to adopt broadly in real-world settings.
In parallel, large sequence models have shown that a single model trained on sufficiently large and diverse data can learn useful shared structure across many tasks.
In this work, we ask a simple question:
Can we train one transformer policy across a very large and heterogeneous collection of RL environments, while still retaining strong task performance?
2. Key Idea
We build an automated pipeline that:
- collects Gym/Gymnasium-compatible environments from many libraries,
- trains task-specific RL agents to obtain strong reference policies,
- replays those policies to generate labeled trajectories,
- and then trains one shared transformer policy to predict actions from interaction history.
The resulting model is LDM-v0, our first Large Decision Model.
At inference time, LDM-v0 receives the recent interaction history: observations, actions, rewards, termination signals, and the current observation — and predicts the next action.

3. Method
3.1 Environment and Data Pipeline
One challenge of multi-task RL at scale is that the ecosystem is fragmented across many libraries whose implementations may require different Python versions and incompatible dependencies.
To address this, we developed an internal orchestration framework that wraps each environment library in isolated containers and exposes a unified interaction API.
This infrastructure lets us train and evaluate agents across a large set of heterogeneous environments in a reproducible way.
We then generate training data with an automated reference-policy pipeline:
- multiple RL algorithm/configuration pairs are tested per environment family,
- strong candidates are retained as reference policies,
- those policies are replayed to produce supervised trajectories,
- and the final offline dataset is used to train one model.
Overall, the dataset used for LDM-v0 contains 9.3B transitions and occupies roughly 29 TB of storage.
3.2 Model
LDM-v0 is a decoder-only Llama-style transformer trained from scratch.
Our main model uses:
- 12 layers
- 12 attention heads
- hidden size 768
- context length of 2048 transitions
- 308M parameters
Unlike text-first LLMs, the architecture is designed directly for decision-making trajectories:
- modality-specific encoders process observations, actions, rewards, and termination signals,
- these are merged into one embedding per transition,
- the transformer predicts the next action autoregressively,
- and training is performed with supervised next-action prediction.
3.3 Training Setup
The 308M model was trained from scratch for six days on two nodes with 8 NVIDIA H200 GPUs each.
Generating the reference-policy data was itself a large-scale effort, taking roughly 12 weeks across four servers and 608 CPU cores.
4. Results
The main result is straightforward:
a single shared model reaches strong performance across a very large number of unrelated RL environments.
More precisely, LDM-v0:
- achieves more than 80% of reference-policy performance on over 1,600 environments,
- and matches task-specific reference policies on ~1,000 environments.
This strong performance appears across a broad range of domains, including:
- robotic manipulation and control,
- autonomous driving simulation,
- drone and UAV control,
- electric motor control,
- smart-grid and energy-management tasks,
- financial trading,
- inventory optimization,
- cybersecurity environments,
- crop and plant optimization,
- Atari-style games and procedurally generated environments.
The key point is that these results come from one pretrained model with a single set of parameters, not one model per benchmark.

We also study model scaling with 32M, 70M, 308M, and 736M parameter variants.
Performance improves substantially from 32M to 308M, while gains appear to plateau between 308M and 736M in our current setup.
This suggests that heterogeneous offline RL pretraining does benefit from additional capacity, but that better characterization of scaling laws will require further experiments.
5. Why This Matters
We view this result as evidence that RL may be able to follow a path similar to other machine learning domains:
from training one specialized model per task toward pretraining a shared foundation model on large-scale decision-making data.
This does not mean we have solved general RL.
The paper should be read primarily as a demonstration of scalable multi-task pretraining. Our evaluations are still mostly in-distribution, so broad out-of-distribution generalization remains to be established.
Still, the result is encouraging: even across major variation in modalities, action spaces, reward scales, and temporal structure, a single transformer can learn a surprisingly wide set of useful behaviors.
6. Next Steps
The next milestone is out-of-distribution generalization: decision models that can transfer to genuinely new environments.
To move in that direction, we will focus on three areas: scaling the diversity of environments and data, exploring architectures and training objectives that enable in-context adaptation, and developing offline-to-online finetuning methods.
7. Takeaway
LDM-v0 shows that a single transformer policy can be trained across thousands of heterogeneous RL environments and retain strong task performance at scale.
For us, this is an early but concrete step toward Large Decision Models as a practical foundation-model paradigm for Reinforcement Learning.
References
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., & Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling.
Gallouédec, Q., Beeching, E., Romac, C., & Dellandréa, E. (2024). Jack of all trades, master of some, a multi-purpose transformer agent.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., & De Freitas, N. (2022). A generalist agent.