The slowdown argument imagined the next phase of AI as monolithic long-horizon RL: let the model work for three days, discover at the end whether it succeeded, and somehow backpropagate one bit of reward through a gigantic trajectory. That regime really would be ugly. But it is not the regime that has been driving recent progress.
The important unit of progress is not “a long task.” It is a verified trace: sample many candidate reasoning/tool-use trajectories, check them cheaply, learn from the winners, and then distill the expensive search back into a cheaper policy. The horizon wall is real where no such verifier exists. The mistake was thinking the frontier had already run out of verifier-rich territory.
Coding is the cleanest example. “Build this feature” may look like a multi-day task, but the work is full of dense signals: compiler errors, type errors, unit tests, logs, diffs, benchmarks, linters, API contracts, screenshots, runtime traces, and eventually user-visible behavior. The same is true, to varying degrees, for math, formal proof, data analysis, cyber, and many scientific workflows. These domains turn “did the agent succeed after days?” into “did this subclaim compile, execute, prove, reproduce, or pass tests?” The effective horizon is much shorter than the nominal human task horizon.
This is why the o-series result was not just “more RL.” OpenAI described o1 as improving both with more RL train-time compute and with more time spent thinking at test time; on AIME 2024, o1 improved from 74% single-sample accuracy to 93% when 1,000 samples were reranked by a learned scoring function. That is the signature of search plus verification, not just ordinary imitation learning. OpenAI’s earlier process-supervision work made the same point in miniature: rewarding correct intermediate steps can outperform merely rewarding the final answer.
DeepSeek-R1 made the mechanism even more legible. R1-Zero was trained with large-scale RL on a base model, but the rewards were largely rule-based in domains like math and coding: final-answer checks, format checks, and compiler/test feedback. DeepSeek explicitly avoided neural reward models for reasoning tasks because they were more hackable and complex, while rule-based rewards gave precise feedback. Then DeepSeek used rejection sampling, SFT, and distillation to move reasoning patterns from the expensive model into smaller dense models; it reported that direct distillation from R1 to a 32B model beat applying RL directly to that smaller model.
That last fact is central. Search is not just an inference-time product feature; it is a data-generation engine. Once a model can occasionally solve a hard problem, you can sample many attempts, filter by verifier, train on the good traces, and repeat. One expensive solved trajectory can become thousands or millions of cheap supervised tokens. This is the search-distill flywheel. It lets progress compound without needing every downstream model to rediscover the same reasoning behavior through fresh online RL.
The compute argument also overcounted the wrong thing. If you picture capability as “RL environment interaction FLOPs,” then GPT-4 → o1 → o3 looks like you quickly exhaust the feasible orders of magnitude. But if capability comes from a portfolio of train-time RL, test-time search, verifier construction, synthetic tasks, tool-use scaffolds, and distillation, then the bottleneck moves. You can buy more capability by spending at inference on hard cases, by improving verifiers, by curating harder prompts, by reusing successful trajectories, or by teaching the model when to call tools. OpenAI’s o3/o4-mini release explicitly says they pushed another order of magnitude in both RL training compute and inference-time reasoning, while also training models through RL to use tools and decide when to use them. That is not “naive policy gradient got cheap”; it is “the learning system changed.”
There is another missed feedback loop: better models make their own horizons shorter. A weak model explores almost randomly, so sparse rewards are devastating. A strong pretrained model starts with a good world prior, proposes plausible decompositions, writes mostly-correct code, identifies what to test, and notices when something smells wrong. As the base model improves, each rollout contains more meaningful decision points and fewer hopeless branches. The marginal rollout becomes more valuable. For a while, reward-per-FLOP can rise even as tasks get longer.
This explains why progress looks especially fast in agentic coding, cyber, computer use, and technical research. GPT-5.5’s public results emphasize agentic coding, computer use, knowledge work, and scientific workflows, with gains on Terminal-Bench, OSWorld-Verified, GDPval, and long scientific-analysis tasks. Anthropic’s Mythos Preview story points in the same direction: a large jump in general coding/reasoning produced a large jump in vulnerability discovery and exploit development, because cyber is unusually verifier-rich once the model can read code and run experiments.
So I would half-deny the premise. AI progress has not yet proved that we can do sample-efficient RL over arbitrary month-long real-world projects with fuzzy objectives. It has proved something narrower but still enormous: much of economically valuable cognitive labor can be embedded in digital environments where intermediate work is inspectable, executable, and compressible into training data.
The prediction is therefore uneven acceleration. Expect continued rapid gains in domains with cheap verifiers or simulators: code, math, cyber, formal methods, data analysis, logistics-like planning, spreadsheet work, and parts of biology that can be reduced to computational assays. Expect slower progress where success is delayed, social, taste-laden, adversarial in non-formal ways, or dependent on expensive physical experiments. “Run a company,” “pick a strategy,” “manage a team,” and “discover a drug in the wet lab” will not become easy merely because AIME and SWE-bench saturate.
The old horizon intuition was not stupid. It describes the wall we hit after we exhaust the giant backlog of verifiable digital work. But it fired too early. Before the world becomes a sparse-reward POMDP, it looks like a massive software-and-science corpus with compilers, tests, theorem provers, logs, sandboxes, and model critics attached. The thing scaling is not RL compute alone. It is verified thought: produced by search, checked by tools, and distilled back into the model.