AI Agents and the Accountability Gap: the Work Is Delegated, the Respondent Is Not
A passenger asked Air Canada's website chatbot about a bereavement fare. The chatbot said the discount could be claimed retroactively, after the ticket was bought; that was not the actual policy. The passenger relied on it, bought the ticket, and lost the difference. In the dispute, Air Canada made a remarkable argument—that the chatbot was "a separate legal entity that is responsible for its own actions." In February 2024, the British Columbia Civil Resolution Tribunal called this "a remarkable submission," dismissed it, and ordered the airline to pay. That chatbot only spoke. The next generation does not stop at speaking.
An AI agent is an LLM-based system that chooses its own tools and moves in a loop on feedback from its environment. It does not stop at generating text—it sends emails, buys things, runs code, operates software. It acts, rather than advises. The question posed by its sibling piece on understanding was "what corrects this model?" The moment the model stops at drafting and starts acting, the question shifts one notch. When it gets something wrong, who answers?
Agents Act—and Get It Wrong Often
One thing first. It is true that agents are faster than people and, on some tasks, more consistent; people tire, forget, and look away, and machines do not. But accuracy is not the issue here. The issue survives conceding accuracy: who is responsible when it goes wrong. Low reliability only sets the timing of that question. The more often an agent errs, the more often the accident with no one to answer for it becomes real.
And agents get it wrong often. Start with reliability. On the agent benchmark τ-bench, even the best agent succeeds on all eight repeated attempts of the same task (pass^8) only about 25% of the time—nearly 60% below its single-attempt rate, meaning the same instruction yields uneven results. Across both domains, even the single-attempt rate stays under 50%. Longer tasks get worse: by METR's measurements, the "time horizon" of tasks an agent finishes with 50% reliability is still short, and success drops steeply as tasks run longer (though that horizon is climbing fast). On TheAgentCompany, a Carnegie Mellon benchmark of 175 real office tasks, the best model finished only about 30% of them fully autonomously, with no human intervention (per the September 2025 leaderboard; 24% in the original December 2024 paper).
These are not the numbers for a production agent fenced in by guardrails on a narrow job—they are the numbers for handing an open-ended professional task wholesale to autonomy. And autonomy is exactly what the market is now selling. Hand it an open task and even the best model finishes one in three. Yet delegation is spreading fast. Gartner forecasts that by 2028, 33% of enterprise software will embed agentic AI (up from under 1% in 2024), and that at least 15% of day-to-day work decisions will be made autonomously through it. The move to hand real decisions to an unreliable actor has already begun.
| Measure | Result | Source · As-of |
|---|---|---|
| τ-bench, best agent retail pass^8 (repeat consistency) | ~25% (≈60% below single-attempt) | arXiv 2406.12045, 2024-06 |
| TheAgentCompany, best model fully autonomous completion | ~30% (175 real tasks; 24% in original paper) | CMU, 2025-09 |
| Enterprise software embedding agentic AI (Gartner forecast) | 33% by 2028 (from <1% in 2024) | Gartner, 2025-06 |
Table: Agent reliability and adoption. Sources—τ-bench (Sierra · Princeton), TheAgentCompany (CMU, arXiv 2412.14161), Gartner forecast. As-of 2024-06 to 2025-09.
Responsibility Does Not Vanish—It Relocates
Say there is an accident. The agent executes a wrong refund, or invents a policy and promises it to a customer. Responsibility sits among four candidates: the user who ran it, the company that deployed it, the provider that built the model, and the agent itself.
Cross off the last candidate first. An agent has no legal personhood. It cannot hold assets, enter contracts, or be sued—it is, in law, incapable of bearing responsibility. The philosopher Andreas Matthias named this the "responsibility gap" back in 2004: with self-learning autonomous machines, a structure arises in which neither manufacturer nor operator can fairly be held responsible for the machine's actions. The attempt to push responsibility onto the agent itself reached a courtroom once—that was Air Canada's submission above. It was a small-claims matter, about CAD $650, a non-binding decision of an online civil-resolution tribunal; but the principle is general. A company answers for the words and actions of the tool it puts out to face its customers. This is no different from the old doctrine that a principal answers for what it had its agent do.
Among the remaining three, responsibility then forks into two channels that run in opposite directions. One channel collects at the party that faced the customer—the deployer (where Air Canada stood). The provider exits this channel by contract. Anthropic's consumer terms, for instance, provide outputs and actions "as is," disclaim any warranty of accuracy, and state that the user must not rely on outputs without independently confirming them, nor use them as the sole basis for high-stakes decisions such as financial ones. It is standard industry boilerplate. (Enterprise contracts, though, are negotiated, and their indemnity clauses can run the other way, back upstream.) So in the default case of a consumer or a small deployer, responsibility flows down the contract to the deployer.
The other channel runs back upstream. The revised EU Product Liability Directive explicitly brings software and AI within "product," places liability for a defect on the manufacturer—the party that built the model—and even eases the victim's burden of proof (disclosure of evidence, presumptions of defectiveness and causation). It is a structure that hauls the responsibility a disclaimer pushed downstream back up to the provider. The two channels run in exactly opposite directions.
No rule yet settles the crossing. The EU AI Act does split obligations between "provider" and "deployer," but that is an allocation of regulatory duties—conformity, transparency—not an allocation of civil liability for who compensates the victim. The attempt to tailor that civil channel to AI, the EU AI Liability Directive, was withdrawn in 2025 for "no foreseeable agreement." And the Product Liability Directive does not apply until December 2026. The delegation is happening now; the bill on the upstream channel arrives late.
| Candidate respondent | Bears it? | Why |
|---|---|---|
| The agent itself | Cannot | No legal personhood—cannot be sued or pay damages |
| The model provider | Two directions | Consumer terms disclaim (push downstream) / product liability summons it back upstream |
| The user | Partly | Terms and case law impose a duty to confirm independently |
| The deploying company | Front line, customer-facing | The party that faced customers with the tool—the general doctrine of vicarious liability |
Table: The two channels of responsibility. Sources—Matthias (2004), Moffatt v. Air Canada (2024 BCCRT 149), revised EU Product Liability Directive (2024/2853), provider terms. As-of 2004 to 2026-03.
The Upside Is Delegated, the Downside Externalized
Why hand decisions to an unreliable actor at all? Here economics answers. An agent's value lies in taking the human out of the loop. The work runs without a person inspecting it step by step, so it is fast and cheap. Autonomy is the saving. The upside, productivity, goes to the company that deployed the tool.
The trouble is the downside. When an agent causes harm, the cost falls first on the counterparty to the bad transaction, or the customer who believed the misinformation. And the machinery to manage that cost is still thin. By one governance survey, fewer than half of organizations (under 48%) monitor their AI systems for accuracy, misuse, or drift. The figure covers AI systems in general, but applied to agents that act on their own, the implication is heavier: a fair number of autonomous systems run without anyone watching what went wrong, or when.
One clear signal that the cost is real is that a product to sell it has appeared. In April 2025, an AI liability insurance policy from Armilla, backed by an underwriter at Lloyd's, launched. It covers losses—and the legal costs that follow—when an AI fails to perform as intended or produces critical errors, hallucinations, or inaccuracies. The risk has begun to be priced into premiums. Separately, Gartner forecasts that more than 40% of agentic AI projects will be canceled by the end of 2027, on rising costs, unclear value, and inadequate risk controls. That is a signal distinct from the liability question, but it is evidence that the economics of delegation are already not frictionless.
From here on is interpretation, not fact. The value of delegation and the accountability gap are not two events but two faces of a single act. The cost delegation cuts is human attention and labor—and the person who would have answered, by name, when something went wrong was sitting in that very seat. One person held both roles, so cutting the human to save the cost cuts the respondent along with them. Responsibility, then, cannot simply be added back for free after the fact. Seating a person back into the answerable role carries a floor cost that cannot be driven to zero—because the whole appeal of full autonomy was driving that very cost toward zero.
Where, Then, Does the Respondent Remain
This analysis does not end in the abstract; it returns to the desk of the professional and the firm adopting agents.
By elimination, the conclusion is plain. The provider exits by contract, the agent cannot answer in law, and so the body the law first looks to as the respondent for a customer-facing accident is the company that deployed the tool. Delegation feels like an act of offloading a burden, but the circuit of responsibility closes back toward the deployer. So "where to keep a person answerable" is not a problem an engineer solves with precision—it is a governance and legal decision. Just as in the adjacent piece on infrastructure, where the line between what runs where left the engineer's hands and became a decision of cost, legal, and product, the line of responsibility moves to the same place.
Concretely, it means putting a named person's sign-off in front of actions that are irreversible or high-consequence—moving money, making external commitments, deleting data—and keeping a trail that records what was done and when. It is a design choice about where to cut the scope of autonomy and where to stand a person.
In the sibling piece on understanding, the reason a person had to stay in the loop was correction: where no ground-truth signal returns, someone had to check whether the output was right. With an acting agent, the reason a person must stay is one layer heavier. It is to answer. Even a perfectly accurate agent needs a respondent, because responsibility is not a question of who was right but of who stands, by name, before the harm when it goes wrong.
So it closes on the signature question. Who bears it, and when? Responsibility has not vanished. Erased once at the agent, slipping past the provider on the contract, it lands at the nearest seat the law can find—usually the deploying company. Where even that cannot be reached (a small operator, across a border, harm diffused), it stays with the victim, with no one to answer. The timing is out of joint too. Delegation spreads now; the bill is carried in later, by lawsuits and regulation—delayed further where the tailored rule was withdrawn.
In the piece on lethal delegation, when the final decision to kill was handed to a machine, the name that should have stood before the death disappeared. In handing everyday work to an agent, the same shape returns. Only the scale differs; the structure is one. The act of handing over erases the one who would answer. An agent stands in for the work. It does not stand in for the seat where someone answers, by name, when the work goes wrong. Who keeps that seat is not a question of technology but of how responsibility is designed.
| # | Outlet (via) | Primary source | Link | As-of |
|---|---|---|---|---|
| 1 | Anthropic Research | Anthropic, "Building Effective Agents" (agent definition) | https://www.anthropic.com/research/building-effective-agents | 2024-12 |
| 2 | arXiv | Sierra · Princeton, τ-bench (2406.12045) | https://arxiv.org/abs/2406.12045 | 2024-06 |
| 3 | METR | METR, "Measuring AI Ability to Complete Long Tasks" | https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ | 2025-03 |
| 4 | arXiv | CMU et al., TheAgentCompany (2412.14161) | https://arxiv.org/abs/2412.14161 | 2025-09 |
| 5 | MES Computing (via) | Gartner (agentic AI forecast) | https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027 | 2025-06 |
| 6 | CanLII / McCarthy Tétrault | BC Civil Resolution Tribunal, Moffatt v. Air Canada (2024 BCCRT 149) | https://www.canlii.org/en/bc/bccrt/doc/2024/2024bccrt149/2024bccrt149.html | 2024-02 |
| 7 | Springer (Ethics and Information Technology) | Andreas Matthias, "The responsibility gap" | https://doi.org/10.1007/s10676-004-3422-1 | 2004 |
| 8 | EUR-Lex | EU, AI Act (Regulation (EU) 2024/1689) | https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng | 2024-06 |
| 9 | EUR-Lex | EU, revised Product Liability Directive (Directive (EU) 2024/2853) | https://eur-lex.europa.eu/eli/dir/2024/2853/oj/eng | 2024-10 |
| 10 | EP Legislative Train / Bird & Bird (via) | European Commission, AI Liability Directive (AILD) withdrawal | https://www.europarl.europa.eu/legislative-train/theme-a-europe-fit-for-the-digital-age/file-ai-liability-directive | 2025-10 |
| 11 | WCR.Legal | (legal commentary) AI has no legal personhood | https://wcr.legal/ai-liability-false-statements/ | 2026-03 |
| 12 | Anthropic | Anthropic, Consumer Terms of Service | https://www.anthropic.com/legal/consumer-terms | 2025-10 |
| 13 | PR Newswire / Tech Monitor (via) | Armilla (underwritten at Lloyd's, Chaucer), AI liability insurance | https://www.prnewswire.com/news-releases/armilla-launches-affirmative-ai-liability-insurance-with-lloyds-underwriter-chaucer-302442586.html | 2025-04 |
| 14 | IoT For All (via) | Pacific AI / Gradient Flow, 2025 AI Governance Survey | https://www.iotforall.com/news/2025-ai-governance-survey-reveals-critical-gaps-between-ai-ambition-and-operational-readiness | 2025-06 |
Analyzed and verified multi-dimensionally with AI; reviewed by the author. </content>