One shared transformer policy across 1,600+ reinforcement learning environments

Thibaut Kulak, NeoInstinct SA. June 2026

Accepted at the Reinforcement Learning Conference 2026 Workshop on Automated RL.


TL;DR

Performance-threshold curve of LDM-v0 across training environments.

1. Motivation

Modern Reinforcement Learning still usually means building one policy per task.

That approach can work well, but it also comes with familiar costs: environment-specific tuning, substantial experimentation, and repeated training runs for each new problem. In practice, this is one of the reasons why RL remains difficult to adopt broadly in real-world settings.

In parallel, large sequence models have shown that a single model trained on sufficiently large and diverse data can learn useful shared structure across many tasks.

In this work, we ask a simple question:

Can we train one transformer policy across a very large and heterogeneous collection of RL environments, while still retaining strong task performance?


2. Key Idea

We build an automated pipeline that:

The resulting model is LDM-v0, our first Large Decision Model.

At inference time, LDM-v0 receives the recent interaction history: observations, actions, rewards, termination signals, and the current observation — and predicts the next action.

Architecture of LDM-v0.

3. Method

3.1 Environment and Data Pipeline

One challenge of multi-task RL at scale is that the ecosystem is fragmented across many libraries whose implementations may require different Python versions and incompatible dependencies.

To address this, we developed an internal orchestration framework that wraps each environment library in isolated containers and exposes a unified interaction API.

This infrastructure lets us train and evaluate agents across a large set of heterogeneous environments in a reproducible way.

We then generate training data with an automated reference-policy pipeline:

Overall, the dataset used for LDM-v0 contains 9.3B transitions and occupies roughly 29 TB of storage.

3.2 Model

LDM-v0 is a decoder-only Llama-style transformer trained from scratch.

Our main model uses:

Unlike text-first LLMs, the architecture is designed directly for decision-making trajectories:

3.3 Training Setup

The 308M model was trained from scratch for six days on two nodes with 8 NVIDIA H200 GPUs each.

Generating the reference-policy data was itself a large-scale effort, taking roughly 12 weeks across four servers and 608 CPU cores.


4. Results

The main result is straightforward:

a single shared model reaches strong performance across a very large number of unrelated RL environments.

More precisely, LDM-v0:

This strong performance appears across a broad range of domains, including:

The key point is that these results come from one pretrained model with a single set of parameters, not one model per benchmark.

Model size scaling law.

We also study model scaling with 32M, 70M, 308M, and 736M parameter variants.

Performance improves substantially from 32M to 308M, while gains appear to plateau between 308M and 736M in our current setup.

This suggests that heterogeneous offline RL pretraining does benefit from additional capacity, but that better characterization of scaling laws will require further experiments.


5. Why This Matters

We view this result as evidence that RL may be able to follow a path similar to other machine learning domains:

from training one specialized model per task toward pretraining a shared foundation model on large-scale decision-making data.

This does not mean we have solved general RL.

The paper should be read primarily as a demonstration of scalable multi-task pretraining. Our evaluations are still mostly in-distribution, so broad out-of-distribution generalization remains to be established.

Still, the result is encouraging: even across major variation in modalities, action spaces, reward scales, and temporal structure, a single transformer can learn a surprisingly wide set of useful behaviors.


6. Next Steps

The next milestone is out-of-distribution generalization: decision models that can transfer to genuinely new environments.

To move in that direction, we will focus on three areas: scaling the diversity of environments and data, exploring architectures and training objectives that enable in-context adaptation, and developing offline-to-online finetuning methods.


7. Takeaway

LDM-v0 shows that a single transformer policy can be trained across thousands of heterogeneous RL environments and retain strong task performance at scale.

For us, this is an early but concrete step toward Large Decision Models as a practical foundation-model paradigm for Reinforcement Learning.


References

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., & Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling.

Gallouédec, Q., Beeching, E., Romac, C., & Dellandréa, E. (2024). Jack of all trades, master of some, a multi-purpose transformer agent.

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., & De Freitas, N. (2022). A generalist agent.