On-Device AI: Cost and Jurisdiction, Not Chips, Draw the Line

Classifying a single photo no longer requires a round trip to a cloud server. It finishes on the processor built into your phone or laptop. That is on-device AI. The data never leaves the device, so privacy holds; it works without a network; and the round-trip latency disappears. The usual reading stops there—a story about chips and compression. But what gets pushed down to the device and what stays on the server, the line that actually decides this, is not drawn by silicon alone.

Two Engineering Axes: NPUs and Quantization

Running a model on a device means solving two problems: fitting it inside limited memory and power (compression), and running it fast enough (acceleration).

Acceleration falls to the NPU—a chip specialized for neural-network operations like matrix multiplication. By one industry estimate, NPUs ship in more than 80% of Qualcomm's recent SoCs. The accelerators refresh every generation. On the Hexagon NPU in the Snapdragon 8 Elite Gen 5, unveiled on 24 September 2025, INT8 object detection finishes in roughly 12–15 ms on reference-device benchmarks.

Compression falls to quantization, which trims weight precision to cut memory and computation. The recipe for putting an LLM on a device has converged on a single path: train in 16-bit, then quantize to 4-bit for deployment. GPTQ in 2022 and AWQ in 2023 cut memory to roughly a quarter at 4-bit while preserving most of the quality, and INT8 loses almost nothing against FP32 across most production workloads. Some have pushed further. BitNet b1.58, released by Microsoft in April 2025, is a 1.58-bit model that holds each weight to just three values—-1, 0, +1. At the two-billion-parameter scale, its non-embedding memory comes to 0.4 GB, against 1.4 GB for a comparable Gemma-3 1B.

The Models Running on Devices Now

On these two axes, billion-parameter models already sit inside real shipping hardware. Sparse architectures activate only a fraction of them at a time.

Model	Parameters	Memory / Notes	Source · As-of
Apple AFM 3 Core	3B (dense)	On-device default	Apple official, 2026-06
Apple AFM 3 Core Advanced	20B (sparse)	Only 1–4B active per request	Apple official, 2026-06
Google Gemini Nano	1.8B / 3.25B	~1 GB at 4-bit (per secondary reporting)	Secondary reporting, 2026-06
Microsoft BitNet b1.58	2B	1.58-bit, 0.4 GB non-embedding	arXiv, 2025-04

Table: Scale of on-device commercial and open-weight models. Sources—Apple Machine Learning Research (AFM3), arXiv 2504.12285 (BitNet), secondary reporting (Gemini Nano). As-of 2025-04 to 2026-06.

These are a different weight class from the large models in the cloud. Which leaves a question. Being able to run a small-enough model on a device is one thing; having to push inference down to the device is another. Why push it down now?

Cost: Whose Silicon Absorbs the Inference?

Cost answers first. Cloud inference prices have fallen fast. a16z calls the trend "LLMflation": GPT-4-class inference dropped from roughly $20 per million tokens in late 2022 to about $0.40, cheapening by some 10x a year at constant performance. The decline is not uniform. By Epoch AI's measurements, the rate ranges from 9x to 900x a year depending on task difficulty.

If unit prices have fallen this far, why push inference to the device at all? Scale. One industry analysis estimates that a single AI inference query costs an order of magnitude more than a traditional search, eroding the cloud provider's per-query margin. Cheaper unit prices do not erase the inference line on the provider's books when call volume explodes.

Push inference to the device and that cost leaves the provider's ledger and lands on the chip the user has already paid for. Inference shifts from the provider's operating expense to capital expense on user-owned silicon. This is the logic behind Apple's hybrid of on-device inference and its own Private Cloud Compute: avoid the data-center capital outlay (the supply ceiling on those data-center AI accelerators is set by packaging) and keep AI capex more conservative than rivals'.

Jurisdiction: Under Whose Law Does the Data Sit?

Alongside cost, regulation redraws the line—because on-device AI's primary benefit, "the data never leaves the device," becomes a compliance asset.

The EU AI Act applies its high-risk-system obligations from 2 August 2026. Penalties for violations reach up to 7% of global annual revenue or €35 million for prohibited practices. At the same time, the physical location of data is no guarantee. The U.S. CLOUD Act exposes data stored in the EU to American jurisdiction if the provider is U.S.-headquartered. Under 2026 compliance readings, using "a U.S. hyperscaler's EU region" does not by itself satisfy data residency. By one industry report, roughly 20% of European companies have begun repatriating core data to in-region facilities.

When data never leaves the device, much of this problem never arises in the first place. And even when the cloud is unavoidable, the line gets redrawn. Apple Private Cloud Compute uses data only to process a request and stores nothing once the request ends; built on Apple Silicon and the Secure Enclave, it is designed so that not even Apple can reach it. It lifts the on-device principle of non-retention up into the cloud.

Who Draws the Line

The limits are clear. Memory and power constraints keep devices to mostly small models, and low-bit quantization trades away some accuracy for the weight it sheds. At INT8 the loss is small, but the price climbs the further you trim the bits.

So what runs where is not settled by the chip's capability alone. Three forces draw the line together: how fast it has to be (latency), who absorbs the inference cost (cost incidence), and under whose law the data sits (jurisdiction). Work that is light and instant, or sensitive, or that must run offline goes to the device; heavy work goes to the server. On-device AI is not a technology that replaces the cloud—it is one that redraws the line between the two. And the decision about where that line falls has left the engineer's hands. Cost, legal, and product now make it together.

Sources

#	Outlet (via)	Primary source	Link	As-of
1	Apple Machine Learning Research	Apple (third-generation foundation models)	https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models	2026-06
2	Apple Security Research	Apple (Private Cloud Compute)	https://security.apple.com/blog/private-cloud-compute/	2024-06
3	arXiv	Microsoft (BitNet b1.58 2B4T)	https://arxiv.org/abs/2504.12285	2025-04
4	a16z	Guido Appenzeller, "LLMflation"	https://a16z.com/llmflation-llm-inference-cost/	2024-11
5	Epoch AI	Epoch AI (inference price trends)	https://epoch.ai/data-insights/llm-inference-price-trends	2025
6	European Commission	EU AI Act (regulatory framework)	https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai	2026-06
7	Lyceum Technology	US CLOUD Act / EU data residency	https://lyceum.technology/magazine/eu-data-residency-ai-infrastructure/	2026-06
8	Fortune / Kavout	Apple AI capex & strategy (reporting)	https://fortune.com/2026/02/17/why-apple-isnt-spending-big-on-ai-capex-commodity-integration-strategy/	2026-02
9	Aleph Zero Labs / Google for Developers	NPU, quantization, on-device benchmarks	https://www.alephzerolabs.com/blog/on-device-ai-2026-sub-20ms/	2026-06
10	Android Police	Google Gemini Nano specs (secondary reporting)	https://www.androidpolice.com/gemini-nano-guide/	2026-06