The Edge Inference Stack: A Field Map, Least Squares

Almost everything written about LLM inference quietly assumes a datacenter. Elastic memory, hundreds of gigabytes per second of bandwidth, batches of thousands of requests amortizing every weight read, a power budget measured in kilowatts. Move the same model onto a phone, a Jetson, or a microcontroller and the assumptions invert one by one. You cannot scale out. You cannot batch – there is one user, and they are waiting. The chip throttles after thirty seconds because there is no fan. The weights do not fit.

Edge inference is not cloud inference scaled down. It’s a different regime with a different bottleneck, and the techniques that win are tuned to it. This series is about that regime. This first part draws the map: the constraints that define the edge, the single piece of physics that governs the whole stack, the vocabulary we will measure everything against, and a survey of the three layers (hardware, runtime, model) where the real decisions get made. Later parts replace every borrowed number here with the one I measured myself.

1. The Five Constraints That Define "Edge"

Everything downstream follows from five constraints. Each is the inverse of a luxury the cloud takes for granted.

Memory ceiling

Device RAM is fixed and shared with the OS and the rest of the app. A model only runs if its weights fit. A 7B model needs ~14 GB at FP16 but only ~3.5 GB at 4-bit, on an 8 GB phone, quantization is what decides whether the model loads at all, not a nice-to-have speedup.

Memory bandwidth

Once a model fits, the speed limit is how fast its weights stream from DRAM into the compute units. Generating one token reads every weight once but does only ~2 ops per weight. Compute finishes long before the next weights arrive.

Power & thermal budget

Edge devices run on a battery and cool passively. Performance you can hit for one burst is not performance you can sustain. After the SoC heats up, clocks drop and your latency quietly doubles. A benchmark that runs cold looks great and ships broken.

Latency & real-time

There is no network round-trip to hide behind and a human on the other end. Interactive work lives under hard budgets: a voice turn that feels instant has a few hundred milliseconds total for the whole pipeline.

Privacy & offline

Often the entire reason to be on-device: the data never leaves, and it works in a tunnel with no signal. This is a product constraint that forces all the engineering ones above.

A caveat on scope: "Edge" is a spectrum, not a point, from a Cortex-M microcontroller with kilobytes of SRAM, to a phone SoC, to a Jetson-class module with tens of GB/s of bandwidth. The five constraints hold across all of it, but their numbers differ by orders of magnitude.

The device spectrum, from constrained to capable:

Cortex-M MCU

kB of SRAM · TinyML territory · CPU only

Phone SoC

Snapdragon / Tensor / Apple A-series · unified LPDDR · NPU · 60–100 GB/s

Apple Silicon

Large unified memory · ~100+ GB/s · CPU + GPU + ANE

Jetson Orin

Discrete-GPU-like · ~68 GB/s · CUDA at the edge

Raspberry Pi 5 sits below this spectrum at ~17 GB/s, pure CPU, no GPU offload for matmul.

2. The Governing Physics: One Equation for the Whole Stack

The thesis everything else rests on: single-stream LLM decoding is bound by memory bandwidth, not by compute. With that in hand, most of the edge toolkit stops looking like a grab-bag of tricks and starts looking like a single idea applied in different places.

Why decode is memory-bound.Generate one token at batch size one, and the forward pass must read every weight in the model exactly once to produce that single token. The arithmetic per weight is tiny, one multiply-accumulate, about 2 floating-point operations. Define arithmetic intensity as useful work per byte moved:

$$\text{intensity} = \frac{\text{FLOPs}}{\text{bytes moved}} \approx \frac{2\text{ FLOPs/weight}}{\text{bytes/weight}}$$

FP16 weights (2 bytes): intensity ≈ 1 FLOP/byte · INT4 weights (0.5 bytes): intensity ≈ 4 FLOP/byte

Place that on the roofline (Williams, Waterman & Patterson, 2009). Attainable performance is min(peak compute, intensity × bandwidth). Modern accelerators have a ridge point, the intensity at which compute and bandwidth balance, that sits in the tens to low hundreds of FLOP/byte. Single-stream decode, at ~1–4 FLOP/byte, sits far to the left of that ridge: deep in the bandwidth-limited region. The expensive silicon for matrix math mostly idles, waiting on DRAM.

Why on-device decode is memory-bound

single-stream decode sits far left of the ridge, bandwidth-limited, not compute-limited

Figure 1, Roofline model. Single-stream decode (~1–4 FLOP/byte) sits far left of the ridge: bandwidth-bound, not compute-bound. Schematic; axis values illustrative.

In the bandwidth-limited region, the achievable lower bound on per-token latency is just:

$$t_\text{token} \approx \frac{\text{model bytes read}}{\text{memory bandwidth}}$$

A worked example (published specs only, I’ll measure the real gap in Part 2): Take a 1.7B-parameter model on a device with ~100 GB/s of LPDDR5 bandwidth.

Precision	Weight bytes	÷ 100 GB/s	Ceiling
FP16 (2 B/param)	~3.4 GB	~34 ms/token	~29 tok/s
INT4 (0.5 B/param)	~0.85 GB	~8.5 ms/token	~118 tok/s

Two things follow. First, quantizing FP16 to INT4 moves 4× fewer bytes and buys ~4× faster decode, in this regime. Second, these are ceilings, you never hit peak bandwidth (60–80% is typical), KV-cache reads add traffic as context grows, and the OS competes for the same bus. The gap between this clean number and what you actually measure on hardware is what Part 2 is about.

~60–85

GB/s, current flagship phone

~100

GB/s, entry Apple Silicon

~68

GB/s, Jetson Orin Nano

~17

GB/s, Raspberry Pi 5

Prefill vs. decode: this is the decode phase at batch size 1. Prefill (processing the prompt) runs in parallel across many tokens, so it’s compute-bound, and so is batched server inference. Mixing up the two is the most common cost-model mistake I see. The memory-bound story is the edge story because the edge basically lives at batch size one. At scale you can climb back toward compute-bound by batching many requests together, which is what the datacenter does (we covered that side well in Inference at Scale). On the edge you don’t get that lever: one user means batch size one, so memory bandwidth stays the ceiling. That is part of what makes edge inference its own problem. (For the IO- aware view of attention, see FlashAttention; for the full datacenter latency math, Pope et al.)

3. The Vocabulary: What We Measure, and What It Means

Every later part in this series reports these metrics. It is worth pinning them down once up front.

TTFT

Time to first token. Wall-clock from request to first output token. In a pipeline it's a sum, in a voice cascade, "first token" is really first audio out, and TTFT is the sum of every upstream stage.

TPOT / inter-token latency

Time per output token. Time between successive tokens once generation is flowing. This is the memory-bound number from Sec. 2. TTFT and TPOT are different costs and must be reported separately.

Throughput

Tokens per second, sustained. On the edge, "sustained" is the word that matters. A cold-start burst number tells you nothing about steady-state performance after thermal throttling kicks in.

Peak memory footprint

Maximum resident set. Weights + KV cache + activations + runtime overhead. This is what determines fits/doesn't fit against the Sec. 1 memory ceiling.

Energy per token

Joules per token (or mWh per turn). On battery this matters as much as latency. A model that is fast but draws 3W kills a session in minutes.

Thermal throttle point

How long until clocks drop. How long until sustained load forces a clock reduction, and what latency looks like after that. A benchmark that runs cold ships broken.

Always report p50 and p95, never just the mean. On passively-cooled hardware the average is a lie: it blends a fast cold start with a slow throttled steady state into a number that describes neither. The tail is the user experience.

4. Layer 1: The Hardware

The compute on an edge SoC is heterogeneous, and each unit wins at something different. Understanding which unit handles which workload, and the tradeoffs of each, is the first hardware decision.

CPU: flexible, great for control flow and the messy glue (audio framing, tokenization, sampling). Lowest throughput for dense matmul. Tradeoff: universal and simple, but leaves performance on the table for the heavy tensor work.

GPU: strong parallel throughput, the common target for on-device LLM matmuls via Metal/Vulkan/OpenCL. Tradeoff: good general accelerator, but shares the same DRAM bus. Sec. 2 still rules.

NPU/DSP: purpose-built for dense low-precision matmul at very high perf-per-watt. Tradeoff: fastest and most efficient when the op maps cleanly, and brutal when it doesn’t. Dynamic shapes, unusual ops, and control flow fall back to CPU and erase the win.

“NPU” isn’t a magic word. NPUs are built for static-shape INT8/INT4 matmul; dynamic sequence lengths and exotic operators fall off the fast path. Whether your model is NPU- friendly is really a property of the graph, decided long before you deploy.

One shared bus, many compute units

Whatever units a device has, they all pull weights through one memory bus, bandwidth is the bottleneck

Figure 2, Every compute unit shares one DRAM bus, so bandwidth sets the ceiling. Stage-to-unit mapping shown is one phone-class SoC example; it varies by device.

The device spectrum (placing Sec. 1 on real silicon): microcontrollers (Cortex-M, kB of SRAM; TinyML territory) to phone SoCs (Snapdragon/Tensor/Apple, unified LPDDR, dedicated NPU) to Apple Silicon (large unified memory, high memory bandwidth) to Jetson-class modules (discrete-GPU like, tens of GB/s). The same model is a different engineering problem at each stop. And they don’t all carry the same kinds of compute: a Raspberry Pi is effectively CPU-only, most laptops are CPU+GPU, and phones reliably add an NPU/DSP, so “which unit runs what” is device-specific, not a fixed rule.

5. Layer 2: The Runtime / Engine

Between a model file and a token sits the inference engine, and it does far more than "run the math." Graph compilation folds many small ops into fewer fused kernels, cutting memory round-trips. Delegate selection routes each op to CPU, GPU, or NPU and picks a kernel for the shapes and precision. Memory planning pre-allocates a reusable arena so inference doesn't allocate on the hot path. KV-cache management stores past keys and values so decode is incremental, and on the edge, its growth with context is a real share of both memory and bandwidth.

llama.cpp / ggml

The de-facto CPU/edge LLM runtime. GGUF format, broad quant support, runs nearly everywhere from Pi to phone.

ONNX Runtime

Portable cross-platform graph execution with a wide delegate ecosystem across CPU, GPU, and NNAPI.

ExecuTorch

PyTorch's on-device runtime for mobile and embedded export. Direct path from a PyTorch model to a phone.

LiteRT (TFLite)

Google's mobile runtime. Mature NNAPI, GPU, and Hexagon delegates. Wide Android support.

MLC-LLM

Compiler-driven (TVM) LLM deployment across GPUs and phones. Aggressive kernel generation for specific targets.

TensorRT-LLM

NVIDIA's high-performance stack. Jetson Orin at the edge end; full datacenter stack at the other.

Cactus

A newer engine aimed at IoT and mobile. Low RAM overhead, model-agnostic, targeting constrained hardware.

The runtime is where Sec. 2's ceiling meets reality. Fusion quality, kernel selection, and cache layout decide how close you actually get to the roofline. That gap between ceiling and runtime is measurable, Part 2 measures it.

6. Layer 3: The Model and Compression

This is the layer with the most levers, and Sec. 2 tells you which ones actually move latency. Because decode is memory-bound, anything that reduces bytes-per-weight is a direct, proportional speed multiplier.

Quantization is the headline edge technique, fundamentally a bandwidth decision. Fewer bits per weight means fewer bytes moved per token, which Sec. 2 turns directly into speed and into fitting the Sec. 1 ceiling at all. PTQ (post-training quantization) is cheap and fast; QAT (quantization-aware training) is costlier but better at low bit-widths. Common formats include GGUF k-quants (e.g., Q4_K_M) for llama.cpp, NF4 from QLoRA, weight-only PTQ via GPTQ and AWQ, and activation handling via LLM.int8() and SmoothQuant. The honest cost: quality degrades as bits drop, unevenly across tasks, reasoning and rare tokens suffer first. The 4x speedup is real; so is the loss. Measure it; a later part does.

Distillation and pruning train a small student model to mimic a larger teacher and remove weights or structure, shrinking model bytes directly. Efficient architectures like MobileLLM are sub-billion parameter LLMs designed for devices; Mamba and SSM-hybrid models trade quadratic attention for linear-time recurrence, cutting the KV-cache and long-context bandwidth tax. Memory-mapping streams weights from flash when they exceed RAM (Apple's LLM in a flash), trading bandwidth and latency for footprint.

Speculative decoding at the edge: A cheap draft model proposes several tokens; the target verifies them in one pass. Lossless by construction, and it pays off precisely in the low-batch, memory-bound edge regime, the same regime that makes everything else hard. The acceptance math is covered in detail in our Speculative Decoding deep-dive; this series will measure it on-device.

7. Following a Single Voice Turn Through the Stack

The pipeline below is a representative on-device cascade, the standard pattern shared by voice assistants generally, not any specific product. Stage timings are illustrative until the measured parts.

Everything above sits in isolation until you wire it into a real product, and the clearest example is an on-device voice assistant, because it isn’t one model, it’s a little pipeline of them. Your latency budget is just whatever they add up to. One voice turn flows like this:

Each component spends a slice of the Sec. 3 budget, and each one is shaped by the choices from Sec. 4- 6. The wake-word and VAD models are tiny and always listening, so on a phone they sit on the NPU/DSP or else CPU, while on a Pi or laptop they’d just run on the CPU (Sec. 4). STT and the LLM are the heavy stages: this is where your quantization and runtime picks (Sec. 5-6) decide whether the thing even fits in memory. TTS time-to-first-audio is what the user actually feels as “did it answer me?”

RAG pipeline showing query embedding, ANN retrieval, re-ranking and generation stages with latency annotations — A voice turn's latency is the sum of its stages; the slowest sets what the user feels. Timings illustrative.

The wake-word and VAD models are tiny and always listening, on a phone they sit on the NPU/DSP or CPU, while on a Pi or laptop they run on CPU. STT and the LLM are the heavy stages: this is where your quantization and runtime picks decide whether the thing fits in memory at all. TTS time-to-first-audio is what the user actually feels as "did it answer me?"

In a pipeline, TTFT is not a single number; it is a sum. "First audio out" only happens once every upstream stage has done its part, so your slowest box sets the experience, and the Sec. 2 memory-bound physics runs underneath all of them.

8. What's Next

This part gave you the map and the one equation that ties it together. The map is worth little without the territory, so the rest of the series goes measured. Part 2 starts where Sec. 2 ended: how far does a real runtime fall from the roofline ceiling, and why, TTFT and its pipeline decomposition, TPOT, throughput, peak memory, energy per token, and the thermal throttle point, all with p50/p95, stated device, stated quant, and stated sequence length.

Related reading: For the datacenter side of this memory-bound story, continuous batching, PagedAttention, and the economics of serving at scale, see Inference at Scale. For the acceptance-sampling math behind speculative decoding referenced in Sec. 6, see Speculative Decoding in LLMs.