What Survived in Behavioral Economics — The Stars the Replication Crisis Toppled Were Mostly Social Psychology

One discipline won the Nobel Prize in Economics twice — Daniel Kahneman in 2002, Richard Thaler in 2017. Its founding paper, prospect theory, ranks among the most-cited papers in economics, and its applied arm, the nudge, has spread to some 300 institutions across 63 countries, with more than 200 dedicated units inside governments alone. And yet the same discipline has spent the past decade being told it "collapsed in the replication crisis." Both are true. So the first question to ask is what survived.

What Behavioral Economics Is

To score what survived, we first have to list the items that belong on the scorecard — which means starting with what behavioral economics is. To say it up front: this piece confines itself to the core theories of judgment and choice, the lineage of prospect theory. The market anomalies of behavioral finance (the disposition effect, the equity premium puzzle, and the like) are a separate subject and are left aside here.

Prospect theory (Kahneman and Tversky, 1979) holds that people judge gains and losses against a reference point rather than against final wealth. They avoid risk in the domain of gains and take on risk in the domain of losses, and they weigh a loss more heavily than a gain of the same size (loss aversion). Probabilities, too, are not used as given: low probabilities are inflated and high ones discounted in the weighting. The coefficient putting a loss at roughly 2.25 times a gain comes not from 1979 but from an estimate in a 1992 follow-up paper. The original 1979 paper established that asymmetry only qualitatively.

Framing (1981) is the reversal of preference produced by how the same options are described. In a situation where 600 people will die, 72% chose the certain option under the survival frame — "saves 200 for sure" — but only 22% chose it under the logically identical mortality frame, "400 will die for sure." Anchoring (1974) is the pull an irrelevant number exerts on a judgment. A group for whom a spun wheel of fortune stopped at 10 estimated the share of African nations in the UN at a median of 25%, while a group for whom it stopped at 65 estimated 45% — and the effect did not shrink even when accuracy was rewarded.

The remaining theories sit in the same lineage. Mental accounting (Thaler, 1985 and 1999) holds that although money carries no label, people partition it in their minds into purpose-specific accounts. The endowment effect (1990) is the gap by which the price someone demands to sell a mug just handed to them runs to about twice the price they would pay to buy it. Present bias (Laibson, 1997) is the tendency of the present self to betray the future self — preferences that reverse over time, driving people to seek devices that bind their future selves. And the application that carried these insights into policy is the nudge: choice architecture that changes behavior in a predictable direction without forbidding any options or greatly altering economic incentives, its signature tool being the default.

This is behavioral economics — not a scattered list of illusions but a map of where and how people systematically depart from the rational-agent assumption of neoclassical economics.

Here a skeptical reader will object at once: Kahneman and Tversky were psychologists too, and anchoring and framing were psychology studies published in Science — so isn't the very distinction "behavioral economics versus social psychology" a post-hoc convenience? Half right. The line this piece draws is not the department someone belongs to. The distinction that holds is the research program. The prospect-theory lineage belongs to the judgment-and-decision-making (JDM) tradition, which deals with incentivized judgment and choice; that tradition was absorbed into economics and became the core of behavioral economics. Kahneman's Nobel citation itself was for "having integrated insights from psychological research into economic science." The other side is the social-priming tradition, in which a single subtle cue is said to shift attitudes and behavior. Both are broadly psychology, but their methods and traditions differ. And this piece applies that criterion evenly to both sides: it sorts what survived and what collapsed alike by the single yardstick of which tradition it came from.

The replication crisis was not a verdict on this map. When the Open Science Collaboration re-ran 100 psychology papers in 2015, results that had been significant in 97% of the originals were significant in only 36% of the replications. That episode was not an exposé that some particular theory was wrong; it was a measuring rod laid against the psychological and behavioral sciences as a whole. Preregistration, multi-lab replication, and registered reports rose to become the standard, and it emerged that using just four of the analytic liberties known as "researcher degrees of freedom" together drives the false-positive rate from a nominal 5% to 60.7%. In short, the replication crisis was the rod, and behavioral economics was not the target of that rod so much as one of the many fields scored by it. So the question narrows to one: what score did it get?

The Core Theories Passed the Audit

To give the score first: the core theories of behavioral economics did not dodge the audit — they passed it.

In 2014, "Many Labs 1" replicated 13 effects across 36 samples, 11 countries, and 6,344 participants, all preregistered. This was not a friendly confirmation; it was the same kind of adversarial multi-lab audit that would later topple the marquee names of social psychology. And the results split along lineage. The anchoring that Many Labs replicated was not the roulette demonstration from the opening section but the four-item version from Jacowitz and Kahneman (1995), and all four items replicated strongly — with some of the largest effect sizes in the project, the replication effect exceeding the original. The Asian-disease framing also replicated robustly (the replication effect size shrank to about half the original, but its significance was firm), and the sunk-cost effect, which was not even significant in the original paper, turned out decisive in the aggregate. Meanwhile, within the same project, flag priming — the claim that a glimpse of the American flag makes people more conservative — and money priming — the claim that cueing money makes people justify the system — dissolved to an effect of zero. Their confidence intervals included zero.

Effect	Lineage	Many Labs replication result
Anchoring (Jacowitz & Kahneman 1995 four-item version)	Behavioral economics	All four items robustly replicated, among the largest effect sizes in the project
Gain–loss framing (1981)	Behavioral economics	Robustly replicated (replication d=0.60, significant in 0.86 of samples)
Sunk cost (Thaler 1985 line)	Behavioral economics	Non-significant in the original, replicated in the aggregate
Flag priming (Carter et al. 2011)	Social psychology	Dissolved (replication d=0.03, CI includes zero)
Money priming (Caruso et al. 2013)	Social psychology	Dissolved (replication d=−0.02)

Table · Many Labs 1 (Klein et al. 2014): 13 effects preregistered and replicated across 36 samples, 11 countries, and 6,344 participants. In the same adversarial multi-lab audit, behavioral economics' judgment-and-choice effects passed while social-priming effects failed. The original effect sizes may be overstated by publication bias and are not trusted as a baseline. Primary source: Klein et al. (2014), Social Psychology 45(3):142–152.

Same audit, opposite results. And the line ran along lineage. This is the lever. To be precise: the social priming that dissolved inside Many Labs 1 was flag and money priming, while the larger casualties to come — ego depletion, facial feedback, the power pose — died not in this project but each in its own separate multi-lab replication. The targets differ, but the kind of audit is the same. Under the same kind of audit — preregistered multi-lab replication — behavioral economics' judgment-and-choice effects passed and the social-priming line failed. Behavioral economics did not slip past adversarial replication; it passed it head-on. Not all 13 effects in Many Labs 1 split this cleanly — some, like imagined contact, replicated only marginally, at the border. The point is not the overall tally but that, in the core cases where lineage is clear, the line ran along lineage.

Why did anchoring and framing live while priming died? Counterintuitiveness alone doesn't explain it — anchoring and framing are surprising enough effects in their own right. The difference lies in the structure of design and incentive. Judgment-and-choice tasks put an explicit, incentivized decision to subjects repeatedly, and the true effect itself is large — the large replication effect size of anchoring above is one piece of evidence. Social priming, by contrast, plants a single subtle cue among subjects and measures a shift in attitude; the manipulation is weak and the true effect is small. The lineage line is not a convenience drawn after the fact but comes from this structural difference in design and incentive.

Another axis of the core theories passed a different audit — field deployment and administrative data. Analyzed as a natural experiment, one firm's 2001 switch to 401(k) auto-enrollment sent its participation rate from 37% to 86%, and roughly 41 million administrative records from Denmark showed that tax subsidies barely raise net saving while automatic contributions actually do increase it. Britain deployed this default to 10.7 million people over a decade, where the opt-out rate had been forecast as high as 28% but in fact stayed at 8–10%. And a UK tax-authority field trial that put the social-norm line "nine out of ten people pay on time" into reminder letters raised the payment rate by up to 5.1 percentage points.

Effect	Type of audit	Scale · result
401(k) auto-enrollment	Natural experiment (Madrian & Shea 2001)	Participation 37%→86%
Savings default	Administrative data (Chetty et al. 2014)	~41 million records, only automatic contributions raise net saving
Organ-donation default	Lab online (Johnson & Goldstein 2003)	opt-in 42% vs opt-out 82%
UK auto-enrollment	Large-scale deployment (DWP 2012–)	10.7 million+ over a decade, opt-out 8–10%
Tax-notice social norm	Field RCT (HMRC 2017)	Payment rate +5.1pp

Table · The default line, which passed on deployment and administrative data. Unlike Many Labs, these passed a different audit — natural experiments, administrative data, and large-scale deployment rather than adversarial replication. Primary sources: Madrian & Shea 2001 / Chetty et al. 2014 / Johnson & Goldstein 2003 / UK DWP 2022 / Hallsworth et al. 2017.

One common summary should be rejected here. "The field survived and the lab died" is wrong. Anchoring and framing came out of the lab and survived, and the famous default experiment on organ-donation consent was a lab-based online experiment and survived too — 42% consented when the default was opt-in, 82% when it was opt-out. The predictor is not lab versus field but the combination of lineage (behavioral economics or social psychology) and traits (sample size; single-lab versus multi-lab; preregistration; degree of counterintuitiveness).

Passing Is Not the Same as Being Flawless

Passing is not the same as being flawless. Even within the core theories there are axes whose magnitude, universality, and reality have been recalibrated, and hiding that would make for a dishonest report card.

Loss aversion is the leading case. The value putting a loss at about 2.25 times a gain was not that large in many contexts. In 2018, Gal and Rucker corrected the record: loss aversion is not "fake" but "overgeneralized into a general principle." At low stakes gains actually operate more strongly, and status-quo bias has often been misattributed to loss aversion. The rebuttal is formidable too. Mrkva and colleagues' study of 17,720 people countered that while loss aversion diminishes as knowledge and experience accumulate, it is still observed at every level of knowledge — "the reports of its death are greatly exaggerated." Not a discarding, but a recalibration of magnitude and scope.

For the endowment effect, its very reality is in dispute. In Kahneman, Knetsch, and Thaler's 1990 mug experiment the selling price was twice the buying price, but in 2005 Plott and Zeiler countered that the gap disappears once subjects are sufficiently trained in the procedure — that the gap may be an artifact of experimental procedure and misconception rather than of preference. This is not a verdict that the endowment effect "does not exist," but what was being measured is still contested. Mental accounting and present bias are similar. In a 2025 preregistered replication with 1,007 subjects, mental accounting largely replicated — 11 of the seventeen problems Thaler catalogued were supported — though some were weaker than the original; and for present bias there is a debate that it replicates on real, effortful tasks but weakens in pure monetary-reward experiments.

The nudge's effect size was revised downward as well. When DellaVigna and Linos aggregated 126 field trials, some 23 million people, from the two largest nudge units in the United States, the nudge's real effect was 1.4 percentage points — a sixth of the 8.7 points cited in the journals. About 70% of the gap between the two figures was due to selective publication in the journals. The nudge is not without effect; it is far smaller than advertised.

The question "so does the nudge work?" is itself a trap. In 2022, working from almost the same data, Mertens and colleagues put the corrected effect at d=0.43; Maier and colleagues said that once corrected for publication bias it fell to d=0.04, leaving "no evidence that nudges are effective"; and Szaszi and colleagues' alternative correction also settled near zero. What the three analyses share is that they poured heterogeneous interventions into one bucket and averaged them. The question to ask is not "does the nudge work?" but "a nudge verified under which traits?" A meta-analysis stratified by trait and lineage has not yet appeared.

The Toppled Stars Were Mostly in the Next Seat Over

So who were "the behavioral-economics stars felled by the replication crisis"? The casualty list that filled the headlines is ego depletion, social priming, facial feedback, and the power pose. But by the line drawn in the opening section, these are not behavioral economics. Social priming (Bargh, 1996), facial feedback (Strack, 1988), the power pose (Carney, Cuddy, and Yap, 2010), and ego depletion all belong to the social-priming and embodied-cognition tradition, in which a subtle manipulation is said to sway attitudes and behavior. They are not of the prospect-theory–anchoring–framing lineage that deals with incentivized judgment and choice.

The manner of death was grim too. Ego depletion dissolved to an effect of zero in a replication that 23 labs preregistered and ran together (the widely reported original meta-analysis at around d=0.62 against a replication d=0.04), and facial feedback likewise vanished in a 17-lab replication. The power pose lost its hormonal and risk-taking effects in a replication that scaled the sample fivefold, and even the original author effectively retracted it in 2016, saying she does "not believe that 'power pose' effects are real." Priming, on top of failing to replicate, was shown to have had its results produced not by the priming but by the experimenter's expectations.

Effect	Original paper	Adversarial replication	Result
Ego depletion	Hagger 2010 meta, widely reported at around d=0.62	23-lab preregistered replication, N=2,141	Replication d=0.04, effectively zero
Facial feedback	Strack 1988, difference of 0.82 units	17-lab preregistered replication, N=1,894	Replication 0.03 units, effectively zero
Social priming	Bargh 1996, widely reported at around d≈1.08	Doyen 2012 replication failure	No effect; results were produced by experimenter expectation
Power pose	Carney, Cuddy & Yap 2010, N=42	Ranehill 2015, N=200	Original author effectively retracted in 2016

Table · The toppled effects: all four are of the social-psychology lineage, not behavioral economics. The original effect sizes (around d=0.62, d≈1.08) are overstated by publication bias and are not trusted as a baseline. Primary sources: Hagger et al. 2016 / Wagenmakers et al. 2016 / Doyen et al. 2012 / Ranehill et al. 2015 · Carney 2016.

So why was this collapse read as the collapse of behavioral economics? The key is the "surprise premium." Only a counterintuitive, clever finding makes it into the lecture, the bestseller, the cover paper. But "counterintuitive" means a low prior probability, and the lower a finding's prior, the more it crumbles in replication. The very trait that conferred fame foretold the failure to replicate. And because the most surprising findings happened to be on the social-psychology side, the surprise premium did two things at once: it selected them as the ones with the fragile trait, and it drew them into the single brand "behavioral." So in the boom, behavioral economics was co-credited with social psychology's flashy hits, and in the bust, behavioral economics took the fall for social psychology's replication failures. The misattribution ran in both directions. This two-way diagnosis, too, can be falsified: if the popular narrative of the boom years (bestsellers, the press) had classed the power pose and priming clearly as "social psychology" from the start and never mixed them with behavioral economics, the misattribution diagnosis wobbles. What Kahneman warned of in 2012 — "a train wreck looming" — and admitted to in 2017 — that he had "placed too much faith in underpowered studies" — was precisely this priming line.

One nail should be driven in, though. To say only that everything toppled was in the next seat over would be self-serving. There is an exception. The nudge claiming that signing at the top of a form makes people more honest was a genuine application of behavioral economics, and it died twice. The original authors themselves reported the failure, in a 2020 replication of 5,794 people finding "no effect of signing first"; and before that, the 2012 field experiment (13,488 people) that first reported the effect was found to have had its data fabricated and was retracted in 2021. Francesca Gino, a co-author of that paper, lost her Harvard tenure in 2025 over a separate data-fabrication finding. Gino herself, though, contests the finding and is fighting it, and who fabricated the signature experiment's data has not been publicly identified.

One axis must be kept separate here. Ego depletion, facial feedback, and the power pose were honest research that failed to replicate; the signature nudge was data that was forged. These are different events. "There was fraud, so it all collapsed" is a smearing-together. What caught the fabrication, too, was not researcherly virtue but forensics that pried open the public data, and even the fabricated signature experiment was large-sample field data on 13,488 people. A method can be robust without guaranteeing integrity.

The Policy Risk Is Not Its Size but Its Location

So is "evidence-based policy" standing on sand? First narrow the scope of the fear. The serious policy bodies were already standing on the trait line — and did so before the crisis arrived. The UK's Behavioural Insights Team (BIT), the world's first, was founded in 2010, before the replication crisis, and in 2012 it made randomized controlled trials the policy standard with 'Test, Learn, Adapt.' The US Office of Evaluation Sciences (OES) was founded in 2015, the same year as the Open Science replication project, and has accumulated more than 120 field evaluations. What these bodies took as their standard in the thick of the crisis was the field trial. So the policy risk lies not in the diffuse fear that "the whole foundation is sand" but in an identifiable class deployed across that line — the cheap, clever, media-friendly nudge. This does not mean that class is quantitatively small in policy. It has never been measured; even the signature nudge retracted for fabrication had been adopted by several government agencies; and behavioral-insight units have spread to the governments of dozens of countries by the OECD OPSI mapping. The point is not the size of the risk but its location.

What the replication crisis purified was the academy, not the field of deployment. The effects of the surviving line — friction, defaults, and framing — have a second validator: the industrial A/B test run by firms. Large platforms each run more than 10,000 online controlled experiments a year. The 'dark patterns' catalogue the US Federal Trade Commission compiled in 2022 held 32 variants across 8 categories, and among them categories like obstruction, concealment, and forced action overlap substantially with friction, defaults, and framing (some, like urgency and scarcity, do not overlap). What Thaler in 2018 named 'sludge' — the malign mirror image of the nudge, friction planted deliberately for self-interest — is this line. That industrial A/B testing re-validates the academy's surviving effects is, however, an inference drawn from the catalogue and from cases rather than a direct measurement, so it is not used here as evidence for the robustness of defaults. What is certain here is not robustness but an asymmetry of accountability. Public nudge units publish their results and move cautiously, while private platforms deploy the same line of effects quietly and at scale. While the replication crisis was tidying up the academy, the surviving choice architecture scaled first on the side of private extraction. This structure repeats at greater scale in Who Chose That Hour Last Night? and What I'll Want, and Who I Handed It To.

What Stays and What Retreats

The wave of policy re-validation is only beginning. Across 2026–2029, as budget-pressed governments demand evidence again and the institutional reckoning of the fabrication cases overlaps, the direction will split like this. Deployment-type interventions verified by repeated deployment, natural experiments, and administrative data — defaults, auto-enrollment, the social-norm framing of tax notices — will be absorbed into regulation and default settings and expanded. Anchoring and framing, which passed adversarial replication, will remain less as regulatory clauses in themselves than as material drawn on in such deployment-type designs. Meanwhile the lab-born, single-lab, counterintuitive "clever" nudge — interventions mostly rooted in social psychology — will quietly retreat from the policy toolkit.

This forecast must be capable of being shown false. The main criteria are two measurable signals. One: in the specific lists absorbed into regulation and guidance, do deployment-type nudges like auto-enrollment expand, while the fragile nudges governments had adopted (signature-honesty or salience-manipulation types) are deleted or flagged with warnings? Two: does the nudge's prevailing effect size align away from the journal-cited figure (8.7 percentage points) toward the field-measured figure (1.4 percentage points)? If these two signals go the other way, the thesis is wrong. As a secondary criterion, a 2029 deadline is set too: if priming-type nudges are absorbed into major regulation by then and are not retracted or flagged, the thesis is flawed. This condition is weighted lightly, though — priming types are rarely deployed in policy to begin with, so it is "almost always confirmed," a quasi-tautology like the default-fails-to-replicate condition. The same yardstick is applied evenly to both sides.

Here is a decoder you can use right away. When you meet any behavioral-science finding, nudge, or self-help tip, look first at its lineage — is it rooted, like prospect theory, anchoring, and framing, in systematic departures of judgment and choice, or is it social-psychology priming, like priming, posture, and expression? Then look at the four traits: is the sample large; is it single-lab or multi-lab; was it preregistered; and how counterintuitive is it (the more surprising, the more suspect)? Finally, check the record — what audit did it pass? The adversarial multi-lab replication that toppled social priming, or a non-adversarial check like deployment or a natural experiment?

The replication crisis rendered no verdict that "humans were rational after all," nor that "behavioral economics was a fraud." Behavioral economics did not collapse; it was sorted and recalibrated by an audit. Anchoring and framing passed the very audit that, within the same experiment, killed social priming; the magnitudes of loss aversion and the nudge were trimmed; and the fallen marquee names were mostly the social psychology in the next seat over. But what that audit tidied up was the academy. The accountability for policy built on effects that did not replicate or were inflated scatters along the chain of delegation that runs from researcher to meta-analysis, from nudge unit to regulator, and when the evidence is quietly retracted a few years later, the cost stays with the citizen already enrolled by default. Harder than sorting out what collapsed is the question of who watches the choice architecture that, in the meantime, scaled first on the side of private extraction.

Sources

The main primary sources, disclosed in the order they appear in the text. Primary authority (original papers, replication projects, institutional reports) is listed first, and because the nudge meta-analysis dispute (Mertens ↔ Maier ↔ Szaszi) and the endowment-effect reality dispute (KKT ↔ Plott & Zeiler) are unresolved, both sides are listed.
Behavioral economics canon (definitions and founding literature)
Kahneman & Tversky — prospect theory (reference point · value function · loss aversion · decision weights), Econometrica 47(2):263–291 (1979): https://courses.washington.edu/pbafhall/514/514%20Readings/ProspectTheory.pdf
Tversky & Kahneman — cumulative prospect theory (the primary source of the loss-aversion coefficient λ≈2.25, not a value from the 1979 original), Journal of Risk and Uncertainty 5:297–323 (1992): https://link.springer.com/article/10.1007/BF00122574
Tversky & Kahneman — framing / preference reversal (the Asian-disease problem, survival 72% vs mortality 22%), Science 211(4481):453–458 (1981): https://sites.stat.columbia.edu/gelman/surveys.course/TverskyKahneman1981.pdf
Tversky & Kahneman — judgment heuristics and biases (anchoring · roulette 10/65→25%/45%), Science 185(4157):1124–1131 (1974): https://www.cs.tufts.edu/comp/150AIH/pdf/TverskyKa74.pdf
Richard H. Thaler — mental accounting (formalized 1985 + synthesis 1999), Marketing Science 4(3):199–214 (1985): https://econpapers.repec.org/RePEc:inm:ormksc:v:4:y:1985:i:3:p:199-214 · synthesis: Journal of Behavioral Decision Making 12(3):183–206 (1999)
Kahneman, Knetsch & Thaler — the canonical endowment-effect experiment (the mug, WTA≈2×WTP · violation of the Coase theorem), Journal of Political Economy 98(6):1325–1348 (1990): https://ideas.repec.org/a/ucp/jpolec/v98y1990i6p1325-48.html
David Laibson — present bias · quasi-hyperbolic (β-δ) discounting · self-binding, Quarterly Journal of Economics 112(2):443–478 (1997): https://academic.oup.com/qje/article-abstract/112/2/443/1870925
Thaler & Sunstein — Nudge (the original definition of nudge and choice architecture, p.6), Yale University Press (2008); term-coining paper: 'Libertarian Paternalism', American Economic Review 93(2):175–179 (2003)
Scientific methodology (replication · open science)
Open Science Collaboration — high-powered replication of 100 psychology papers (original significant 97% → replication significant 36%), Science 349(6251):aac4716 (2015): https://www.science.org/doi/10.1126/science.aac4716
Simmons, Nelson & Simonsohn — "False-Positive Psychology" (false-positive rate 60.7% when four researcher degrees of freedom are used together), Psychological Science 22(11):1359–1366 (2011): https://journals.sagepub.com/doi/10.1177/0956797611417632
Chambers — first journal introduction of Registered Reports (preregistration · publication guaranteed regardless of results), Cortex 49(3):609–610 (2013): https://pubmed.ncbi.nlm.nih.gov/23347556/
Center for Open Science — Registered Reports (now adopted by 300+ journals): https://www.cos.io/initiatives/registered-reports
Daniel Kahneman — open letter "I see a train wreck looming" (published in Nature News) (2012-09-26): https://www.nature.com/news/polopoly_fs/7.6716.1349271308!/suppinfoFile/Kahneman%20Letter.pdf
Daniel Kahneman — self-correcting comment ("I placed too much faith in underpowered studies"), under his own name on the Replicability-Index blog (2017-02-14): https://replicationindex.com/2017/02/02/reconstruction-of-a-train-wreck-how-priming-research-went-of-the-rails/comment-page-1/#comment-1454
The core theories passing the audit (adversarial multi-lab replication · Many Labs)
Klein et al. — Many Labs 1 (13 effects · 36 samples · 11 countries · 6,344 participants, preregistered replication; anchoring, framing, and sunk cost robustly replicated, flag and money priming dissolved), Social Psychology 45(3):142–152 (2014): https://econtent.hogrefe.com/doi/10.1027/1864-9335/a000178
Jacowitz & Kahneman — the original of the four-item anchoring version that Many Labs replicated, Personality and Social Psychology Bulletin 21(11):1161–1166 (1995)
Recalibration disputes over the core theories (not a discarding)
Gal & Rucker — loss aversion is not 'absence' but 'a scaling-back of overgeneralization', Journal of Consumer Psychology 28(3):497–516 (2018): https://doi.org/10.1002/jcpy.1047
Mrkva, Johnson, Gächter & Herrmann — rebuttal on the robustness of loss aversion (5 samples, 17,720 people · does not vanish), Journal of Consumer Psychology 30(3):407–428 (2020): https://myscp.onlinelibrary.wiley.com/doi/abs/10.1002/jcpy.1156
Plott & Zeiler — the endowment-effect reality dispute (WTA–WTP gap disappears under controlled procedures · possible procedural artifact), American Economic Review 95(3):530–545 (2005): https://www.aeaweb.org/articles?id=10.1257%2F0002828054201387
Li & Feldman — preregistered replication of mental accounting (RR, N=1,007 · 11 of Thaler's 17 problems supported), Royal Society Open Science 12(9):250979 (2025): https://pmc.ncbi.nlm.nih.gov/articles/PMC12445221/
The surviving effects (deployment · natural experiments · administrative data)
Madrian & Shea — natural experiment on 401(k) auto-enrollment (participation 37%→86%), Quarterly Journal of Economics 116(4):1149–1187 (2001; NBER w7682): https://www.nber.org/papers/w7682
Chetty, Friedman, Leth-Petersen, Nielsen & Olsen — Danish savings administrative data (~41 million observations · tax subsidies barely raise net saving · only automatic contributions have an effect), Quarterly Journal of Economics 129(3):1141–1219 (2014; NBER w18565): https://www.nber.org/papers/w18565
Johnson & Goldstein — organ-donation default experiment (lab online, opt-in 42% vs opt-out 82%), Science 302(5649):1338–1339 (2003): https://www.science.org/doi/10.1126/science.1091721 · author's distributed copy: https://www.dangoldstein.com/papers/DefaultsScience.pdf
UK DWP — "Ten years of Automatic Enrolment in Workplace Pensions" (10.7 million+ newly enrolled over a decade · opt-out rate 8–10%) (2022-10-26): https://www.gov.uk/government/statistics/ten-years-of-automatic-enrolment-in-workplace-pensions/ten-years-of-automatic-enrolment-in-workplace-pensions-statistics-and-analysis
The nudge meta-analysis dispute (unresolved · both sides listed)
Mertens, Herberz, Hahnel & Brosch — nudge meta-analysis (corrected d=0.43), PNAS 119(1):e2107346118 (2022): https://www.pnas.org/doi/10.1073/pnas.2107346118
Maier, Bartoš, Stanley, Shanks, Harris & Wagenmakers — nudge effect dissolves when corrected for publication bias (posterior mean d=0.04 · "no evidence remains"), PNAS 119(31):e2200300119 (2022): https://www.pnas.org/doi/10.1073/pnas.2200300119
Szaszi et al. — alternative corrections · critique of heterogeneity ("implausibly large" · near zero), PNAS 119(31):e2200732119 (2022): https://www.pnas.org/doi/10.1073/pnas.2200732119
DellaVigna & Linos — the 'nudge gap' (126 RCTs · ~23 million people · field 1.4pp vs journals 8.7pp · about 70% of the gap is selective publication), Econometrica 90(1):81–116 (2022): https://www.hks.harvard.edu/publications/rcts-scale-comprehensive-evidence-two-nudge-units
The toppled effects (adversarial multi-lab replication) — mostly of the social-psychology lineage
Hagger et al. — multi-lab preregistered replication of ego depletion (RRR, 23 labs · N=2,141 · replication d=0.04 vs original meta d=0.62), Perspectives on Psychological Science 11(4):546–573 (2016): https://journals.sagepub.com/doi/10.1177/1745691616652873
Wagenmakers et al. — replication of facial feedback (RRR, 17 labs · N=1,894 · effect effectively zero), Perspectives on Psychological Science 11(6):917–928 (2016): https://journals.sagepub.com/doi/10.1177/1745691616674458
Bargh, Chen & Burrows — the original elderly-priming paper, Journal of Personality and Social Psychology 71(2):230–244 (1996): https://web.mit.edu/curhan/www/docs/Articles/15341_Readings/Social_Cognition/Bargh_et_al_1996_Automaticity_of_social_behavior.pdf
Doyen, Klein, Pichon & Cleeremans — priming replication failure · 'experimenter expectation' produced the results, PLOS ONE 7(1):e29081 (2012): https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0029081
Carney, Cuddy & Yap — the original power-pose paper (N=42), Psychological Science 21(10):1363–1368 (2010): https://journals.sagepub.com/doi/10.1177/0956797610383437
Ranehill et al. — large-scale power-pose replication failure (N=200), Psychological Science 26(5):653–656 (2015): https://journals.sagepub.com/doi/abs/10.1177/0956797614553946
Dana R. Carney — disavowal statement 'My position on "Power Poses"', UC Berkeley Haas (2016): https://faculty.haas.berkeley.edu/dana_carney/pdf_my%20position%20on%20power%20poses.pdf
Fraud · data integrity (separate from the replication axis)
Shu, Mazar, Gino, Ariely & Bazerman — signature-position–honesty experiment (Study 3 field N=13,488), PNAS 109:15197–15200 (2012; retracted 2021): https://www.pnas.org/doi/10.1073/pnas.1209746109
Data Colada #98 — evidence of data fabrication in Study 3 (2021-08-17): https://datacolada.org/98
PNAS — retraction notice for Shu et al. 2012, PNAS 118(38) (2021-09): https://www.pnas.org/doi/10.1073/pnas.2115397118
Kristal, Whillans, Bazerman, Gino, Shu, Mazar & Ariely — the original authors' own failure to replicate (N=5,794, before the fabrication was detected), PNAS (2020): https://www.pnas.org/doi/abs/10.1073/pnas.1911695117
Data Colada #109–#112 'Data Falsificada' — allegations of fabrication in papers co-authored by Francesca Gino (2023-06): https://datacolada.org/109
The Harvard Crimson — Harvard revokes Gino's tenure (2025-05-27): https://www.thecrimson.com/article/2025/5/27/gino-tenure-revoked/
Policy bodies · field measurements
Behavioural Insights Team — 'Our History' (the world's first government nudge unit, established within the Cabinet Office in 2010): https://www.bi.team/about-us/our-history/
Haynes, Service, Goldacre & Torgerson — 'Test, Learn, Adapt' (making randomized controlled trials the policy standard), Cabinet Office/BIT (2012): https://www.gov.uk/government/publications/test-learn-adapt-developing-public-policy-with-randomised-controlled-trials
Hallsworth, List, Metcalfe & Vlaev — HMRC social-norm tax-reminder RCT (best message raised the payment rate by +5.1pp), Journal of Public Economics 148:14–31 (2017): https://doi.org/10.1016/j.jpubeco.2017.02.003
Office of Evaluation Sciences (GSA) — the US federal Office of Evaluation Sciences (founded 2015 · 120+ field evaluations): https://oes.gsa.gov/
OECD Observatory of Public Sector Innovation (OPSI) — global mapping of behavioral-insight units in government (300+ institutions across 63 countries · 200+ units; the quantitative figures are indicative only): https://oecd-opsi.org/blog/mapping-behavioural-insights/
The evil twin (dark patterns · sludge)
US FTC — "Bringing Dark Patterns to Light" (catalogue of 32 variants across 8 categories) (2022-09): https://www.ftc.gov/reports/bringing-dark-patterns-light
Richard H. Thaler — "Nudge, not sludge" (coining the concept of 'sludge'), Science 361(6401):431 (2018-08-03): https://www.science.org/doi/10.1126/science.aau9241
Kohavi & Thomke — the scale of industrial A/B testing (large platforms each run 10,000+ online controlled experiments a year), Harvard Business Review (2017-09): https://hbr.org/2017/09/the-surprising-power-of-online-experiments
The institutional victory (the Nobel Prize)
Kahneman — 2002 Nobel Prize in Economics ('for having integrated insights from psychological research into economic science'): https://www.nobelprize.org/prizes/economic-sciences/2002/kahneman/facts/
Thaler — 2017 Nobel Prize in Economics ('for his contributions to behavioural economics'): https://www.nobelprize.org/prizes/economic-sciences/2017/thaler/facts/