LLMs find the right factors but miss the frame

Stress-testing LLM insight on the Dwarkesh Blog Prize

Dwarkesh Patel recently launched a Blog Prize for four big questions about AI. He pointed out that LLM responses to these kinds of questions fall into the failure mode of giving “5 plausible answers” but not “the context and taste to identify the crucial factor.” That generally aligns with my own experience, but I was curious to investigate it more rigorously. So I queried LLMs repeatedly for answers to each of the questions, and then Elo scored, clustered, and aggregated the responses.

Generating LLM thesis statements

Methodology. I queried the GPT-5.4 API ten times for each of Dwarkesh's questions, where each query asked for ten thesis statements for essays answering the question. Thesis statements were aggregated per question and then Elo scored against each other in a pairwise Elo tournament, also run on GPT-5.4 via a few thousand independent API calls (one per matchup). Finally, I clustered them with the help of Claude Opus 4.7. Each cluster was given a summary and is depicted below alongside the size of its sample and the mean Elo score of its members. Bars show each cluster's mean Elo. Hover any row for the full cluster description and the top-ranked response. Singleton clusters where the cluster is formed of a single response are rendered in lighter gray.

Q1. RL progress and timelines

19 clusters · 100 responses
A couple years ago, there was this idea that AI progress might slow down as we make further progress into the RL regime. 1. Because as horizon lengths increase, the AI needs to do many days' worth of work before we can even see if it did it right, so if we're still in a naive policy gradient world, the reward signal / FLOP goes down, and 2. We'd crossed through many OOMs of RL compute from GPT 4 to o1 to o3, and it would not be feasible to replicate that many OOMs increase in compute immediately again. But AI progress seems to have been fast nonetheless - even potentially speeding up if rumors about Spud or Mythos are to be believed. What gives? What did that previous intuition pump that motivated longer timelines miss? Feel free to deny premise of question.
verifier dense signal
17
search distill flywheel
9
off policy replay
4
task decomposition
14
synthetic env selfplay
10
data quality over flops
5
internal cot shaping
1
capabilities compound
2
parallelism engineering
2
pretraining still dominates
9
world model planner
1
critic actor reward hacking
1
system design multiplicative
1
algorithmic efficiency
9
rl small so ooms feasible
1
measurement artifact
3
capability threshold pipeline loop
1
better problem formulations
1
1300
1400
1500
1600

Q2. Foundation model profitability

20 clusters · 100 responses
What's the most plausible story where foundation model companies actually start making money? If you consider each individual model as a company, then its profits may be able to pay back the training cost. But of course, if you don't train a bigger, more expensive model immediately, then you stop making money after 3 months. So when does the profit start? Maybe at some point scaling will plateau, but if progress at the frontier has slowed down, then the combination of distillation and low switching costs (cloud margins result from high switching costs) makes it really easy for open source to catch up to the labs, eating into their margins. So how do the labs actually start making money?
production tooling moat
5
workflow outcome pricing
16
enterprise private contracts
2
regulatory compliance moat
13
distribution channel ownership
9
government defense contracts
3
reliability sla premium
2
data flywheel
11
inference cost leadership
11
model as loss leader
3
internal labor then external playbook
1
premise reversed acceleration window
1
slowdown enables amortization
8
model factory distillation
3
brand network amortization
1
specialization vertical models
2
scarce capacity allocation
1
consolidation oligopoly utility
6
pharma rolling rd portfolio
1
consortia shared public good
1
1300
1400
1500
1600

Q3. OpenAI Foundation: how to deploy $180B

20 clusters · 100 responses
With OpenAI's new raise at an $852B valuation, OpenAI Foundation's stake is now worth $180B. Anthropic's cofounders have pledged to donate 80% of their wealth. Nobody seems to have a concrete idea of how to deploy 100s of billions (soon trillions) of wealth productively to "make AI go well". If you were in charge of the OpenAI Foundation right now, what exactly would you do? And when? It's not enough to identify a cause you think is important, because that doesn't answer the fundamental problem of how you convert money to impact. Identify the concrete strategy you recommend pursuing.
compute governance verification stack
11
auditor compute access trust
1
audit eval regime
10
procurement market for safety
7
standards release infrastructure
3
insurance liability regime
6
buy and retire pause
4
secure by default deployment
2
weights security proliferation prevention
1
civil defense resilience
12
open auditing transparency
2
state capacity talent pipeline
9
infrastructure for governance
2
treaty international coalition
6
charter governance
2
alignment manhattan project
9
ai dividend plus hard law lobbying
1
public option frontier lab
8
preserve optionality emergency brake
3
broad institutional resilience
1
1300
1400
1500
1600

Q4. Countries outside the AI production chain

21 clusters · 100 responses
What should countries which are not currently in the AI production chain (semis, energy, frontier models, robotics) do in order to not get totally sidestepped by transformative AI? If you're the leader of India or Nigeria, what do you do right now?
digital public infrastructure substrate
7
energy datacenter wedge
10
domestic deployment diffusion
13
sovereign compute infrastructure
5
data moat national datasets
7
compute access bargaining bloc
9
ai assurance eval jurisdiction
5
vertical specialization export niches
9
state capacity institutional
5
mass deployment talent
4
regulatory leverage jurisdiction
5
national security defense
2
bloc geopolitical alignment
2
open standards portability
3
elite ai builder talent
4
ai adjacent manufacturing
2
regional ai hub
1
tax and redistribute ai rents
3
food energy security resilience
1
macro fundamentals
1
social insurance labor resilience
2
1300
1400
1500
1600

So how well did LLMs do?

Overall, I think the set of answers identified and ranked by GPT-5.4 are a reasonable starting point for thinking about these questions. Take Q1 for, for example: verifiers and task decomposition are, in my view, some of the most important underlying drivers of recent AI progress, and I found it notable those were close to the top. Most of the top scoring answers for each question are worth thinking about. But while these answers are real mechanisms and the top ones may even include the crucial factor (if you were forced to pick one specific mechanism), there is something missing in terms of synthesis.

So I'd make my own critique of LLMs somewhat differently than Dwarkesh: they actually are good at identifying crucial factors, but they have difficulty coming up with a framing that unifies those factors. For example, my view is that the first question is answered more completely by the fact that the RL used to train LLMs is mechanically different from the RL assumed by the slowdown argument's framing: more like supervised learning, less like traditional sparse reward RL (e.g., stuff like rejection sampling, GRPO, SFT on verified traces). Many of the top factors identified by GPT-5.4 (e.g., dense verification, role of inference, task decomposition) are underlying mechanisms that illustrate or follow from this, but that framing itself is missing from GPT-5.4's set of responses.

It's possible the framing of the prompt impacted the distribution of answers, but there was nothing stopping GPT-5.4 from giving a more abstract thesis. My best guess is that the complexity of the questions made it difficult for the model to understand what level of abstraction to engage with; that's hard for humans too. It's not at all certain, for example, that my own preference is the most popular one.

To check if another pass at the information might help with this kind of synthesis, I gave GPT-5.5 Pro the scored and clustered data and asked it to answer the first question about the slowdown argument again. In my estimation, it did an okay job, but it tried to unify everything under verified traces, which I think is a mistake. Still, its response was better than anything I'd been able to get out of a model previously, and it didn't obviously succumb to the failure mode Dwarkesh describes where models give five plausible answers. You can see GPT-5.5 Pro's response here.

I also submitted my personal response to that same question for fun. I'll be curious to see how many of the top factors from this analysis are present in the winning answers, and how they're framed.

Addendum: GPT-5.5 Pro reponse to Q1

Methodology. GPT-5.5 Pro was given the clustered and scored thesis statements above and asked to answer the first question following Dwarkesh's original instructions. Its full response follows.

The slowdown argument imagined the next phase of AI as monolithic long-horizon RL: let the model work for three days, discover at the end whether it succeeded, and somehow backpropagate one bit of reward through a gigantic trajectory. That regime really would be ugly. But it is not the regime that has been driving recent progress.

The important unit of progress is not “a long task.” It is a verified trace: sample many candidate reasoning/tool-use trajectories, check them cheaply, learn from the winners, and then distill the expensive search back into a cheaper policy. The horizon wall is real where no such verifier exists. The mistake was thinking the frontier had already run out of verifier-rich territory.

Coding is the cleanest example. “Build this feature” may look like a multi-day task, but the work is full of dense signals: compiler errors, type errors, unit tests, logs, diffs, benchmarks, linters, API contracts, screenshots, runtime traces, and eventually user-visible behavior. The same is true, to varying degrees, for math, formal proof, data analysis, cyber, and many scientific workflows. These domains turn “did the agent succeed after days?” into “did this subclaim compile, execute, prove, reproduce, or pass tests?” The effective horizon is much shorter than the nominal human task horizon.

This is why the o-series result was not just “more RL.” OpenAI described o1 as improving both with more RL train-time compute and with more time spent thinking at test time; on AIME 2024, o1 improved from 74% single-sample accuracy to 93% when 1,000 samples were reranked by a learned scoring function. That is the signature of search plus verification, not just ordinary imitation learning. OpenAI’s earlier process-supervision work made the same point in miniature: rewarding correct intermediate steps can outperform merely rewarding the final answer.

DeepSeek-R1 made the mechanism even more legible. R1-Zero was trained with large-scale RL on a base model, but the rewards were largely rule-based in domains like math and coding: final-answer checks, format checks, and compiler/test feedback. DeepSeek explicitly avoided neural reward models for reasoning tasks because they were more hackable and complex, while rule-based rewards gave precise feedback. Then DeepSeek used rejection sampling, SFT, and distillation to move reasoning patterns from the expensive model into smaller dense models; it reported that direct distillation from R1 to a 32B model beat applying RL directly to that smaller model.

That last fact is central. Search is not just an inference-time product feature; it is a data-generation engine. Once a model can occasionally solve a hard problem, you can sample many attempts, filter by verifier, train on the good traces, and repeat. One expensive solved trajectory can become thousands or millions of cheap supervised tokens. This is the search-distill flywheel. It lets progress compound without needing every downstream model to rediscover the same reasoning behavior through fresh online RL.

The compute argument also overcounted the wrong thing. If you picture capability as “RL environment interaction FLOPs,” then GPT-4 → o1 → o3 looks like you quickly exhaust the feasible orders of magnitude. But if capability comes from a portfolio of train-time RL, test-time search, verifier construction, synthetic tasks, tool-use scaffolds, and distillation, then the bottleneck moves. You can buy more capability by spending at inference on hard cases, by improving verifiers, by curating harder prompts, by reusing successful trajectories, or by teaching the model when to call tools. OpenAI’s o3/o4-mini release explicitly says they pushed another order of magnitude in both RL training compute and inference-time reasoning, while also training models through RL to use tools and decide when to use them. That is not “naive policy gradient got cheap”; it is “the learning system changed.”

There is another missed feedback loop: better models make their own horizons shorter. A weak model explores almost randomly, so sparse rewards are devastating. A strong pretrained model starts with a good world prior, proposes plausible decompositions, writes mostly-correct code, identifies what to test, and notices when something smells wrong. As the base model improves, each rollout contains more meaningful decision points and fewer hopeless branches. The marginal rollout becomes more valuable. For a while, reward-per-FLOP can rise even as tasks get longer.

This explains why progress looks especially fast in agentic coding, cyber, computer use, and technical research. GPT-5.5’s public results emphasize agentic coding, computer use, knowledge work, and scientific workflows, with gains on Terminal-Bench, OSWorld-Verified, GDPval, and long scientific-analysis tasks. Anthropic’s Mythos Preview story points in the same direction: a large jump in general coding/reasoning produced a large jump in vulnerability discovery and exploit development, because cyber is unusually verifier-rich once the model can read code and run experiments.

So I would half-deny the premise. AI progress has not yet proved that we can do sample-efficient RL over arbitrary month-long real-world projects with fuzzy objectives. It has proved something narrower but still enormous: much of economically valuable cognitive labor can be embedded in digital environments where intermediate work is inspectable, executable, and compressible into training data.

The prediction is therefore uneven acceleration. Expect continued rapid gains in domains with cheap verifiers or simulators: code, math, cyber, formal methods, data analysis, logistics-like planning, spreadsheet work, and parts of biology that can be reduced to computational assays. Expect slower progress where success is delayed, social, taste-laden, adversarial in non-formal ways, or dependent on expensive physical experiments. “Run a company,” “pick a strategy,” “manage a team,” and “discover a drug in the wet lab” will not become easy merely because AIME and SWE-bench saturate.

The old horizon intuition was not stupid. It describes the wall we hit after we exhaust the giant backlog of verifiable digital work. But it fired too early. Before the world becomes a sparse-reward POMDP, it looks like a massive software-and-science corpus with compilers, tests, theorem provers, logs, sandboxes, and model critics attached. The thing scaling is not RL compute alone. It is verified thought: produced by search, checked by tools, and distilled back into the model.

top-ranked response (rank )