Technology·AI·2026.06.27

✓Fact-checked✓Code-verifiedvalidate.pyPublished

Note톤 override — 기술 노드 preset(정보형)은 explainer용. 본 글은 "LLM은 이해하는가 흉내내는가"라는 고찰(examination·reflection) 레지스터라 정보형으로는 깊이가 죽는다. 그래서 철학 에세이의 사색 골격(만연·우유·1인칭 절제)에 기술 칼럼의 데이터 단정(수치·논문 f#)을 얹은 혼합 톤으로 간다. 빌드업·잠언투 결론은 여전히 금지. 계승론(f12)은 닫는 명시적 전망으로만.

A Machine That Has Never Seen a Board Knows the Board — Understanding, Mimicry, or the Wrong Question?

There is a neural network that was never taught a single rule of Othello. It was fed nothing but sequences of legal moves — game records, the bare transcript of play — and trained to predict the next one. No one ever told it the board is 8×8, or that pieces flip. And yet open the model up and you find, grown inside it, a representation that computes which piece sits where on the board right now. It knows the board without knowing the rules.

I keep stopping in front of this one scene. On the question of LLMs we split into two camps. One says the thing understands the world; the other says it merely mimics words a human once wrote. Which is the Othello machine? If it is mimicry, how does it know the board — and if it is understanding, why was it never taught the rules? This piece follows that fork, not to pick an answer, but to chase the possibility that the question itself has been set down in the wrong place.

When Mimicry Deepens, a Map Appears

Start with the case for emergence. Propping up the weaker side as a straw man and knocking it over is not thinking, so begin with this camp's hardest result.

The first finding from the Othello experiment was subtle. A non-linear probe trained to reconstruct the board state had an error rate of 26.2% on a randomly initialized model, which fell to 1.7% once the model had finished training. Something like a board sits somewhere inside the network. A skeptic could fairly object here — isn't the probe wringing the board out by force? — and the objection was legitimate. Then a follow-up changed one thing, and the picture sharpened. Read the board not in absolute colors — black/white — but in player-relative coordinates — mine/yours/empty — and a simple linear probe alone reached over 99% accuracy. More than that: twist that linear direction artificially, and the model's next move changes causally. The representation is not merely smeared across the weights. The model reads it to choose its move.

Othello-GPT board reconstruction	Result
Non-linear probe · randomly initialized model	26.2% error
Non-linear probe · trained model	1.7% error
Linear probe · player-relative coordinates (mine/yours/empty)	99%+ accurate; intervening on the linear direction changes behavior causally

Table: the core evidence for the emergence camp. Sources — Li et al., Othello-GPT (ICLR 2023) · Neel Nanda, linear-representation analysis (2023). As of 2026-06-27.

Nor is this peculiar to Othello. Gurnee and Tegmark opened up the Llama-2 family and showed that feeding it a city name spontaneously forms a linear representation corresponding to latitude and longitude, and feeding it an event forms one corresponding to the year. The larger the model, the more accurate the map — and you could even pick out the neurons that handle space and the neurons that handle time, separately.

Read only this far and the conclusion looks all but settled. Let mimicry run deep enough and a map grows inside it. To predict the next word well, it pays in the end to internalize the structure of the world those words point to, and so the model draws a scale model of the world without being asked. That is the reading you want to reach for.

Sutton's Blade

Richard Sutton — a founder of reinforcement learning and a 2024 Turing Award co-laureate — rejects that conclusion head-on. In the autumn of 2025, in a long conversation with Dwarkesh Patel, he pinned the LLM down as a "dead end." His logic does not fight over whether the representation exists. It cuts lower.

Sutton's distinction is this. What an LLM learns is what a person will say next — not what will happen next in the world. The two look like the same act of prediction, but they differ at the root. The former imitates the speech process of a human speaker; the latter knows how a physical environment responds to an action. So he calls the LLM not a model of the world but a model of the human language-generation process. Even when we say the Othello machine knows the board, that knowing was shaped indirectly, through external data — legal move sequences someone else handed it — not won by placing a stone itself, watching it flip, and learning from the result.

And decisively, in Sutton's view the LLM lacks three things: a goal, ground truth, and the ability to learn on the job — continual learning. Once pretraining ends, the weights freeze and only inference repeats. There is no structural mechanism by which today's mistake gets inscribed into tomorrow's weights and corrected. Even Patel granted that this absence of continual learning is "a genuine basic gap." He pushes back, though: because today's models are trained with reinforcement learning on tasks where the answer is verified — mathematical proofs, coding — the line between imitation and experience is not as clean as Sutton draws it.

One Map, or a Bag of Heuristics?

The most refined voice doubting emergence comes not from Sutton but from the side of the cognitive scientist Melanie Mitchell. But push that doubt to its end and you reach not a denial of emergence but a more precise question.

Mitchell takes the linear-probe result from the Othello experiment seriously, but she does not cross from there to "there is a world model." Her objection was twofold. One: the non-linear probe is "too powerful" — the credit for reconstructing the board may belong to the probe's own computation rather than to the transformer. Two: when student researchers took it apart, the Othello model looked less like a single coherent board model than like a bundle of local rules scattered across the board — a "bag of heuristics." In fact the first edge was largely blunted by the follow-up linear-probe result (99% with a computationally weak probe, plus causal intervention). So the live issue now is the second one: one map, or a bag of heuristics? Which is why Mitchell withholds judgment rather than ruling: "the claim that an abstract world model has emerged in LLMs is not yet supported by strong evidence."

The same probe experiment reads on one side as "evidence of a world model" and on the other as "evidence of a bag of heuristics." Not because the evidence is thin. Because we have never agreed on what to pack into the words "world model."

The Wrong Question

After looking at it for a long time, the place I arrive is this. "Does the LLM have a world model?" is the wrong question, set down in the wrong place.

Frame it as representation-present-or-absent, and the emergence research has already supplied part of the answer. Something map-like is inside the model. Sutton's rebuttal does not, in fact, deny the existence of that map. What he strikes at is the map's provenance and its correctability. Where did it come from — someone else's text, or a result I obtained by acting? And when it is wrong, can the world correct it?

On that second point, one thing must be split apart. "Correct" has two layers. One is fixing the output mid-inference, inside the same session. The model reads a compile error and fixes its code; a person hands back a flawed draft and the next output differs. This in-context correction plainly happens. Patel's point about the blurred line between imitation and experience lands exactly here. But this correction is volatile — it vanishes when the session ends and leaves not a single character in the weights. The other layer is correction that congeals into the weights, so that next time the same mistake is made less often: persistent continual learning. What today's LLM lacks is not the first layer but precisely this second one.

So the boundary line sharpens — not "does correction occur?" but "does correction stay and accumulate?" Imagine a frozen network that represents the Othello board perfectly. It knows the board. Within any single game a person can correct it. But because that correction is never inscribed in itself, play a thousand games and the thousandth still fails in the same place the first did. The representation is there, but it has no capacity to be wrong in the way that teaches itself. One thing must be made clear here: a frozen network can be competent moment to moment — a calculator is corrected by nothing and yet does not err. Momentary competence and the capacity to improve oneself are different axes. So the line I mean to draw is not a definition of "intelligence in general." It is the line that separates the kind of intelligence we delegate trust to, and lean on more and more — intelligence that gets better over time. That line falls not at the static spot of "understanding versus mimicry" but at the dynamic spot of what corrects this model, and whether that correction stays.

Sutton, citing the absence of continual learning and ground truth, calls the LLM a dead end. But even he still fights on the spot of "is the LLM a world model." What I mean to move is that spot itself. Posed as a static question — "does it have a world model" — the two camps only run parallel lines over the same probe result: one looks at the representation and says "it's there," the other looks at the provenance and calls it "mimicry," and they cross forever without meeting. Move the axis to "what corrects it, and whether that correction stays," and why the two read the same evidence in opposite directions is explained at a stroke: the representation is real but has no agency over itself, and the disagreement comes not from a shortage of evidence but from the fact that we never agreed on what "world model" means. Sutton's continual-learning point is one coordinate on this axis, not the axis itself. Where he struck rightly, I move the line to a more precise place and draw it again.

Seen this way, it becomes clear why the picture of reinforcement learning Sutton spent his life on is so different. The agent in RL acts upon an environment, takes back a ground-truth signal called reward, and corrects itself with that signal. The one thing the frozen Othello network can never do — the result coming back to correct the self — sits here in the dead center of the circuit. Sutton and colleagues pushed it further still, advancing the hypothesis that the single principle of reward maximization alone could underwrite the whole of natural and artificial intelligence ("Reward is enough"). Whether intelligence is reward maximization — there I do not follow him. A head-on rebuttal exists in the literature — that a single scalar reward is not enough — and that question stays open. But whether the grand picture is right or wrong, one thing remains: at the center of the reinforcement-learning circuit sits a loop in which the world corrects me, and in today's LLM that loop is severed at the level of the weights. And that severed loop is exactly where the axis I drew points.

The Bitter Lesson Turns on Its Author

Go one step further, though, and a strange paradox surfaces.

In 2019, in "The Bitter Lesson," Sutton wrote the largest lesson of 70 years of AI in a single line: general methods that leverage computation ultimately beat methods that build in human domain knowledge, and by a large margin. In chess, in Go, in speech recognition, systems with what humans believe they know hand-coded in led in the short run but collapsed in the long run before two methods that scale arbitrarily — search and learning. The sentence he left is like a blade: building in how we think we think does not work in the long run. We must plant not the content of discovery but the capacity to discover.

And it is this very blade that turns on the LLM. The LLM is a vast compression of all the text humanity has written — that is, of what humans have discovered. In Sutton's view that is no shining proof of the bitter lesson. It is a negative example, rather. The model's cognition is capped at the ceiling of human knowledge, and no animal in the natural world, he argues, learns by that kind of imitation.

This last premise, though, is contestable. The human is precisely the animal that learns by imitating at scale — through language, culture, and the game record — and the cumulative culture of Homo sapiens is itself a product of imitation learning. On that view, room opens to argue that the LLM, reasoning over text, is a kind of experiential learning too. Sutton's proposition that "imitation is a dead end" is less established doctrine than a still-contested issue. And the very fact that this dispute never ends is what underwrites my diagnosis. The line "imitation or experience" is, like "world model or not," a static question no one can settle. Shift to the correction axis and the dispute falls away — because whether the imitation is human or an LLM's, all you need ask is whether the result comes back, corrects the self, and accumulates.

Read the same evidence from three positions and it splits like this.

Lens	Emergence reading	Sutton / skeptic reading
Technical	Board, space, and time representations arise spontaneously from move records and text alone; intervening on the linear direction changes behavior	Frozen weights, no continual learning, an observer that does not act → correction does not stay
Epistemic	To predict the next token well, internalizing world structure pays → mimicry into a map	The representation's provenance is someone else's text; a "human language-generation process" model, not a "world" model
Philosophical	The line between deep-enough mimicry and understanding is unclear; reserve judgment for want of strong evidence	Intelligence = goal, correction, agency; reward maximization is central, but its sufficiency is in dispute

Table: the dialectic of three lenses dividing the same evidence. Sources — Othello-GPT (Li et al., ICLR 2023) · linear representations (Nanda et al., 2023) · space-and-time representations (Gurnee & Tegmark, ICLR 2024) · Sutton interview (Dwarkesh, 2025-09-26) · Mitchell (2025) · Reward is enough (Silver et al., 2021) and its rebuttal (Vamplew et al., 2022). As of 2026-06-27.

So What Corrects the Tool?

This contemplation does not end in the abstract; it comes back to my desk. Most of us now use this model every day. We hand it a draft, have it write code, set it to summarize a source. If so, the question to ask is not the metaphysics of "does this understand the world?" but a far more practical sentence. What corrects this output, and does that feedback channel exist?

Where the answer is verified on the spot — code that compiles, a calculation that comes out right, a function whose tests pass — the world returns the output quickly, right there. So delegation is relatively safe. Conversely, where no ground-truth signal of right and wrong comes back at once — a judgment, a strategy, the truth of a matter of fact, a decision that carries accountability — the correction channel is empty. That empty seat is exactly where a human must stay in the loop. (What disappears when this delegation is pushed all the way to a life-and-death decision is taken up separately in Who Pulled the Trigger — Autonomous Weapons and No One Left to Answer.)

To be honest, one more layer has to go on top. A channel existing does not mean it is safe. Whether what that channel measures is the real answer is a further problem. Code that compiles but whose specification is wrong; a function that passes its tests but whose tests are themselves badly written — cleverly overfitting to a measurable signal is common. So the question refracts into two stages. What corrects this output — and does what that correction measures really match what I want? Ask not whether the model "understands" but what tells it that it is wrong, and is that telling trustworthy — and the line of how far to delegate and where to stop and inspect comes into focus.

Sutton goes further still, all the way to: the succession from humanity to AI is inevitable, so prepare for it rather than fear it. That is not a verified fact but one old scholar's forecast, and I do not follow him that far. But one thing he got right remains. To ask after the conditions of real intelligence is to ask not what the model knows but what corrects the model and whether that correction stays. A machine that knows the board without ever having seen one is a marvel. But until we ask what sets it right when it draws the board wrong, and whether that setting-right carries into the next game, we have not yet spoken of intelligence.

Sources

#	Outlet (via)	Primary source	Link	As of
1	incompleteideas.net	Richard Sutton, The Bitter Lesson	http://www.incompleteideas.net/IncIdeas/BitterLesson.html	2019-03-13
2	Dwarkesh Podcast	Richard Sutton interview ("dead end" · absence of continual learning)	https://www.dwarkesh.com/p/richard-sutton	2025-09-26
3	Dwarkesh Podcast	Dwarkesh Patel, "Thoughts on Sutton" (rebuttal note)	https://www.dwarkesh.com/p/thoughts-on-sutton	2025-09
4	ACM	2024 ACM Turing Award (Barto & Sutton, reinforcement learning)	https://awards.acm.org/about/2024-turing	2025-03
5	arXiv	Li et al., "Emergent World Representations" (Othello-GPT, ICLR 2023)	https://arxiv.org/abs/2210.13382	2023
6	neelnanda.io	Neel Nanda, "Othello-GPT Has A Linear Emergent World Representation"	https://www.neelnanda.io/mechanistic-interpretability/othello	2023
7	arXiv	Gurnee & Tegmark, "Language Models Represent Space and Time" (ICLR 2024)	https://arxiv.org/abs/2310.02207	2024
8	AI: A Guide for Thinking Humans	Melanie Mitchell, "LLMs and World Models, Part 2"	https://aiguide.substack.com/p/llms-and-world-models-part-2	2025
9	Artificial Intelligence (Elsevier)	Silver et al., "Reward is enough" (v299, 103535)	https://www.sciencedirect.com/science/article/pii/S0004370221000862	2021
10	arXiv	Vamplew et al., "Scalar reward is not enough" (rebuttal)	https://arxiv.org/abs/2112.15422	2022
11	X (@RichardSSutton)	Richard Sutton, on AI succession (WAIC talk)	https://x.com/RichardSSutton/status/1700315838468043015	2023-09