The upcoming GPT-3 moment for RL | Mechanize Inc.

https://www.mechanize.work/blog/the-upcoming-gpt-3-moment-for-rl/

GPT-3 showed that simply scaling up language models unlocks powerful, task-agnostic, few-shot performance, often outperforming carefully fine-tuned models. Before GPT-3, achieving state-of-the-art performance meant first pre-training models on large generic text corpora, then fine-tuning them on specific tasks.

Today’s reinforcement learning is stuck in a similar pre-GPT-3 paradigm. We first pre-train large models, and then painstakingly fine-tune them on narrow tasks in highly specialized environments. But this approach has a fundamental limitation: the resulting performance rarely generalizes. The capabilities that models pick up from RL today are so brittle that even minor changes, such as simply renaming a tool, can dramatically reduce performance.

A visualization of what the GPT-3 moment for RL might look like.

We think RL will soon have its own GPT-3 moment. Rather than fine-tuning models on a small number of environments, we expect the field will shift toward massive-scale training across thousands of diverse environments. Doing this effectively will produce RL models with strong few-shot, task-agnostic abilities capable of quickly adapting to entirely new tasks. But achieving this will require training environments at a scale and diversity that dwarf anything currently available.

How much RL will this take?

Current RL datasets are relatively small. For example, DeepSeek-R1 was trained on roughly 600k math problems, representing about six years of continuous human effort if each task takes five minutes to complete. By contrast, reconstructing GPT-3’s 300-billion-token training corpus would require on the order of tens of thousands of years of human writing at typical human writing speeds.

Incidentally, achieving RL compute expenditure comparable to current frontier-model pretraining budgets will likely require about ~10k years of model-facing task-time, measured in terms of how long humans would take to perform the same tasks. DeepSeek-R1 used about 6e23 FLOP during the RL stage using about 6 years of model-facing task-time. Assuming future training runs use a similar number of epochs and group sizes as DeepSeek-R1, scaling this to about 6e26 FLOP would imply roughly 6k years of model-facing task-time.

It is unclear whether future RL training will involve larger or smaller group sizes or more epochs, especially as we increase the diversity of task distributions. We don’t have much data on this question, so making precise estimates of the required model-facing task-time remains difficult, though ~10k years seems likely to be the correct order of magnitude.

For comparison, the amount of work that this will require of models is on the same order of major projects like Windows Server 2008, GTA V, or Red Hat Linux 7.1, each estimated to involve on the order of 10k yrs of cumulative human effort.

Expanding RL to this scale is economically efficient. Because compute spending dominates total training expenses, scaling RL to be comparable with pretraining budgets will yield significant performance improvements without substantially increasing overall costs. However, achieving this will require dramatically scaling up RL environments, while still ensuring tasks can be automatically scoreable. Doing so will likely require new approaches to building RL environments.

Replication training

Imagine if every time you wanted to pretrain a language model using next-token prediction, you had to manually create the entire training corpus yourself. Clearly, this would be impractical. Instead, we leverage the enormous amount of existing content such as books, academic articles, blog posts, and Reddit discussions to build training corpora.

Similarly, we suspect the GPT-3 moment for RL will be enabled largely by a paradigm we’re calling replication training. This proposed paradigm involves tasking AIs with duplicating existing software products, or specific features within them. Simple command-line tools that implement obscure hashing and encryption algorithms are straightforward initial targets, but this approach can easily extend to more complex software, such as websites, professional software, and games.

Each replication task consists of a detailed specification and a reference implementation. The central idea is that AI models are trained to produce an implementation that precisely matches the reference behavior. This clear-cut approach significantly simplifies evaluation, as the grading criteria are objective and direct: either the generated implementation behaves identically to the reference, or it doesn’t.

Though these replication tasks may differ from typical day-to-day software engineering activities, they specifically target critical skills that current AI systems often struggle with. For instance, replicating a complex algorithm (such as a 10k-line-of-code encryption/decryption CLI tools guided by a detailed specification) requires the model to:

Accurately read and deeply understand detailed instructions.
Execute instructions meticulously and precisely, without errors.
Notice earlier mistakes and reliably recover from them.