Refract
KO/EN
KO한국어로 읽기
Technology·AI·2026.07.03
Fact-checkedCode-verifiedvalidate.pyPublished

What Comes After the Transformer — the Question Is Wrong

"What will replace the transformer?" Ask about the next mega-trend in AI architecture and most people expect the name of a successor. A state-space model (Mamba)? A world model? Or something not yet written up in a paper?

Yet look at what actually happened over the past two years, and the question itself may be set down wrong. The biggest capability jumps of 2024–2025 — the "reasoning models," OpenAI's o1 and o3, DeepSeek's R1 — came without changing the architecture. o1's system card calls the model itself a "generative pre-trained transformer (GPT)," onto which large-scale reinforcement learning was used to train a chain of thought. R1, being open source, is clearer still. Its early version, R1-Zero, was given not a single human-written reasoning example (SFT), yet reasoning ability emerged from bolting pure reinforcement learning (GRPO) onto an existing base model — DeepSeek-V3, a mixture-of-experts (MoE) transformer that carries 671B parameters but activates only 37B per token. The production R1 adds a small set of cold-start examples in a multi-stage recipe, but neither touched the base architecture.

So before hunting for the name of the next architecture, ask this. If the biggest recent leaps came without changing the architecture, then where is the "shift" actually happening?

Don't Line Up the Candidates — Score the Walls

Instead of listing buzzwords, use a single scorecard. The transformer has a few known "walls." Attention cost that grows quadratically with sequence length; a diet of text alone, ungrounded in the physical world; the snap answer of next-token prediction (no deliberation); opaque, sequential autoregressive generation. Note which wall each candidate architecture aims at — but hold every candidate up to one litmus test.

Can you get that capability without building a new architecture — by bolting only a training-and-inference wrapper onto an existing pretrained model? If you can, the change happened not in the core (the operation that mixes the sequence) but in the layer wrapping it. If instead it can be won only by pretraining a fundamentally different mixer from scratch, that is a change in the core. Run this test and "what looks like a new architecture" splits apart from "a new regime laid over an old core." Let's score the axes one by one, in order of how big the leap was.

The Reasoning Wall: the Leap Came From Outside the Core

This is where the biggest leap happened. o1, the first reasoning-specialized line that "thinks for a long while before answering," arrived in September 2024. The size of the jump reads fastest as a table.

BenchmarkGPT-4o (non-reasoning)o1R1
AIME 2024 (math)13.4%74.4%79.8%
GPQA Diamond (PhD-level science)56.1%78.0%71.5%

On GPQA, o1's 78.0% is the first score to clear the human-PhD-expert baseline (69.7%), and R1's AIME 79.8% and GPQA 71.5% sit on the same reasoning frontier. Hold up the litmus and the verdict is plain. This capability did not come from pretraining a new core; it came from wrapping reinforcement learning around an existing base. R1-Zero is the proof.

The lever behind this leap is mainly reinforcement-learning training (teaching the model to generate a long, deliberating chain of thought), overlaid with spending more compute at inference time. On the latter, one study finds that allocating inference-time search well over a fixed model can beat a model 14 times larger. Both are regimes outside the core.

Two caveats attach here. First, this leap clusters in domains where the answer can be graded (math, code, science). o3 made headlines by hitting 87.5% on the abstract-reasoning benchmark ARC-AGI-1 in a high-compute configuration, but that was a special setup costing thousands of dollars per task, and on the successor ARC-AGI-2 — designed to resist pure compute scaling — it sank to roughly 2.9% (the human average is about 60%). "Pour in compute and it gets smarter" is not unconditional; it holds only for gradeable problems. Second, R1's leap happened on top of a 671B-scale MoE base. The regime alone did not do it — a base model that large had to already exist.

The Efficiency Wall: Here the Core Really Does Change — but It Buys Something Else

The candidate that promises to kill the quadratic cost is the state-space model (SSM). Mamba scales linearly with sequence length and promised throughput several times faster than a transformer. Pure SSMs have a structural weakness. Because they pack the past into a fixed-size state, they are poor at pulling a specific token back out of the context exactly (copying, in-context recall). This is not an empirical observation but a theoretical limit. A two-layer transformer can copy strings of exponential length; a fixed-state SSM fundamentally cannot. One analysis found that 82% of the quality gap between the subquadratic family and attention comes from this recall.

So the answer production settled on is neither pure SSM nor pure transformer but a hybrid — and in that hybrid, attention is the minority. AI21's Jamba mixes attention to Mamba at 1:7, NVIDIA's Nemotron-H leaves only about 8% of its layers as attention, and IBM shipped Granite 4.0 at roughly 9:1 Mamba to transformer. Somewhere around 90% of the sequence-mixing operation has already gone over to SSM. And even the attention that survives is not the original — DeepSeek-V3's attention was redesigned as MLA. To say the core is "unchanged" is not accurate.

So what does the litmus say here? These core changes bought not a leap in capability but efficiency. What Nemotron-H got by cutting attention to 8% is up to 3x more speed, not a new capability. Pure Mamba fell behind on standard benchmarks, and the reason a minority of attention was reinserted was precisely to win capability back. The resulting hybrids sometimes beat a same-class transformer, but that is at the level of same-scale standard benchmarks, not a frontier leap like the reasoning models. The mixer is genuinely changing, yet what the change buys is speed, memory, and long context — not a frontier capability jump. The capability jump, as the previous section showed, came from somewhere else (the regime). It is not that pure SSM models don't exist. 7B-class ones like Falcon Mamba and Codestral Mamba did ship, but they are not frontier general models. And that frontier general models are still attention-based is confirmed only in the open-source ones, DeepSeek-V3 and Llama. The closed models don't reveal their internals, so "they keep attention" is an absence of evidence, not confirmation.

The Grounding Wall: World Models Haven't Reached Language Yet

The candidate born from the critique that text alone cannot learn the causality of the physical world — Yann LeCun is the representative voice, arguing that autoregressive LLMs accumulate error exponentially — is the world model. Meta's V-JEPA 2 is a video-prediction model that uses no text at all, put to work on zero-shot robot planning. DeepMind's Genie generates manipulable 3D environments from an image, and Wayve's GAIA models the world of autonomous driving. What they share is that they all live in robotics, video, and simulation. A case of the language-generation frontier being replaced by a world-model architecture is nowhere to be seen as of July 2026. (OpenAI called Sora a "world simulator," but with physics violations remaining — glass that does not shatter, liquids that pass through — it is far from a verified causal model.)

This "absence" has to be read with care. The world model's absence from the language frontier may only mean there hasn't been time to scale yet (Mamba itself arrived only in late 2023). Don't turn absence straight into disproof. The problem that a language model learns the correlations of text instead of the causality of the world points to the same spot as the missing "correction channel" taken up in A Machine That Has Never Seen a Board Knows the Board — Understanding, Mimicry, or the Wrong Question?.

The Generation-Mode Wall: a System Component, and One Real Counterexample

The wall of opaque, sequential generation has two candidates. One is neuro-symbolic. DeepMind's AlphaGeometry solved 25 of 30 olympiad geometry problems as a collaborative system in which a neural network proposes auxiliary constructions and a symbolic engine verifies them. Then a system combining AlphaProof and AlphaGeometry 2 solved 4 of 6 problems at the International Mathematical Olympiad, reaching silver-medal level — a setup that attached a formal prover (Lean) to a language model. This is not the replacement of a monolithic model but the assembly of a system that uses the LLM as a component, and it sits in the niche of verifiable domains.

The other is this piece's strongest counterexample. Diffusion language models throw out autoregression and build a sentence by parallel denoising that clears the noise all at once. LLaDA, at 8B scale, goes toe to toe with LLaMA3 8B, and on the "reversal curse" in particular — the problem where learning "A is B" fails to yield "B is A" — it beats GPT-4o. Commercial diffusion models like Mercury run at over 1,000 tokens per second. A different generation mode beating the frontier on some axis — that is a real challenge to the story so far.

But here the litmus turns ambiguous. The network that actually clears the noise in a diffusion model — the denoiser — is usually a transformer. Only, having thrown out autoregression, its attention flips from causal to bidirectional. So is diffusion a "non-transformer core" or a "non-autoregressive regime"? This case, where the generation objective and the directionality of attention change together, is a gray zone in which the boundary between core and regime is not clean. There is still no diffusion-based frontier flagship, but this one candidate is worth watching for whether it becomes a signal that the core itself is splitting.

So: This Is Not a Discovery but a Lens Defended by the Litmus

Score all the axes and one picture emerges. The leaps in capability came from the layer wrapping the core — training-and-inference regimes, system assembly — while the change in the core (the mixer) itself bought efficiency, not capability.

Now face head-on the labeling problem I deferred twice. This picture is not a discovery the data revealed on its own. The very same facts can be read the opposite way. "The transformer of 2026 is not the transformer of 2017. It has already been made heterogeneous by sparse routing (MoE), state-space layers, and reinforcement-learning reasoning loops. The pure dense autoregressive monolith quietly died; we only call it a 'transformer' out of inertia." That side says "it has already been replaced." The lens above says "the layer wrapping the core changed." Where the core ends — that boundary is not one the data draws for you by counting layer ratios alone.

What draws the boundary is not the ratio but the litmus. If capability attaches by bolting a wrapper onto an existing pretrained model, it's a regime; if it requires pretraining a fundamentally different mixer from scratch, it's the core. By this standard the reasoning leap is plainly a regime (R1-Zero reproduces it with reinforcement learning alone on top of an existing base), the SSM hybrids change the mixer but buy efficiency rather than capability, and only diffusion stays in the gray zone. Choosing this lens is not because it is the one truth, but because it is the operational criterion that separates what you can bolt onto an existing model from what you have to build anew. Which is why the next single sentence is practically useful. When someone puts forward a "new architecture," there is one question to throw at it. Can you bolt it onto an existing model, or must you train it from scratch? If the former, the value is in the wrapper, not the core.

What to Watch, and What Would Prove This Wrong

The litmus lets me nail the outlook down in a falsifiable form. Through 2027, new frontier capability leaps will come from regimes and systems laid over existing pretrained models, and will not require pretraining a new mixer from scratch.

The signals to watch. Do new capability jumps get announced as new regimes (reasoning, agents, tools, memory, verifiers) rather than new core architecture families? Do the efficiency-selling mixer swaps (SSM hybrids, MoE) and the capability-selling regimes stay separate? Do alternatives like diffusion and SSM keep winning on a particular axis while the general-benchmark capability frontier remains reproducible with a wrapper?

Let's also fix what would prove this outlook wrong. When some new capability leap cannot be reproduced by laying a wrapper over an existing model — that is, when it is attributable not to a regime or routing but to a freshly pretrained mixer itself. This covers not only a pure-diffusion or pure-SSM core but also the case where a mostly-SSM hybrid pushes past the attention frontier on capability rather than efficiency. Then this lens has to bend. (Note that the absence itself — "there is still no non-transformer frontier" — is not disproof. It may just be that a new mixer has had little time to scale to the frontier, so watch the attribution of the leap, not the absence.)

What comes after the transformer is most likely not a swap to a different model. Rather, whatever the core mixer turns into, the leaps in capability come from the regime and system layers wrapping it. That layer thickening is the real substance of the next trend. If this direction is right, it means the source of capability shifts from model training to system assembly and verification — and who builds and answers for that system leads into the question taken up in AI Agents and the Accountability Gap: the Work Is Delegated, the Respondent Is Not. Another axis, the fragmentation of deployment, lives in On-Device AI: Cost and Jurisdiction, Not Chips, Draw the Line. For the builder, the capability that will last is not calling which architecture wins but assembling and verifying that layer.

Sources
  1. Reasoning regime (test-time compute · RL)
  2. OpenAI o1 — Learning to reason with LLMs · o1 system card
  3. OpenAI o3 — ARC-AGI breakthrough (ARC Prize) · ARC-AGI-2 resistance (Chollet et al., arXiv:2505.11831) · o3/o4-mini release
  4. DeepSeek-R1 — arXiv:2501.12948 (R1-Zero pure RL / R1 multi-stage)
  5. Inference-time compute scaling — Snell et al., arXiv:2408.03314
  6. Efficiency (SSM · hybrids)
  7. Mamba — Gu & Dao, arXiv:2312.00752
  8. Pure-SSM recall limit — Jelassi et al. "Repeat After Me", arXiv:2402.01032 · Zoology (recall 82%), arXiv:2312.04927
  9. Hybrid evidence & production — NVIDIA Mamba-Transformer, arXiv:2406.07887 · Jamba (AI21), arXiv:2403.19887 · Nemotron-H, arXiv:2504.03624 · IBM Granite 4.0 · Falcon Mamba · Codestral Mamba
  10. Grounding (world models)
  11. Meta V-JEPA 2 · DeepMind Genie 2 · Wayve GAIA-1 (arXiv:2309.17080) · OpenAI Sora as world simulators
  12. Generation mode (neuro-symbolic · diffusion)
  13. AlphaGeometry (DeepMind) · AlphaProof IMO silver-medal level
  14. Diffusion LLMs — LLaDA (arXiv:2502.09992) · Mercury (Inception Labs) · Gemini Diffusion
  15. Modularity (MoE)
  16. Mixtral (arXiv:2401.04088) · DeepSeek-V3 (arXiv:2412.19437) · Llama 4 (Meta) · Qwen3 (arXiv:2505.09388)
Analyzed and verified multi-dimensionally with AI; reviewed by the author.