All Posts
AIJune 1, 2026

MLX Continuous Batching: Throughput Architecture on Apple Silicon (2026)

Continuous batching is the single largest throughput unlock for transformer inference on any hardware. NVIDIA stacks have spent three years optimizing around it: vLLM, TensorRT-LLM, and SGLang all converge on the same pattern of paged KV cache, request scheduling, and prefill/decode interleaving. MLX is younger, the runtime is different, and most NVIDIA intuitions do not survive contact with Apple's unified memory architecture.

Continuous batching is the single largest throughput unlock for transformer inference on any hardware. NVIDIA stacks have spent three years optimizing around it: vLLM, TensorRT-LLM, and SGLang all converge on the same pattern of paged KV cache, request scheduling, and prefill/decode interleaving. MLX is younger, the runtime is different, and most NVIDIA intuitions do not survive contact with Apple's unified memory architecture.

This is the working model for MLX continuous batching as it stands in mid 2026: what the runtime actually does when you serve multiple concurrent requests, where throughput scales linearly and where it saturates, and what numbers we measured on M5 Max and M5 Ultra running Llama 3.3 70B and Qwen 3 32B at production-relevant batch sizes.

Why Continuous Batching Matters at All

The naive serving pattern is request-at-a-time: receive a prompt, run prefill, stream decode tokens, finish, accept the next request. GPU utilization on this pattern is catastrophically bad. Decode is memory-bandwidth-bound and uses a small fraction of available FLOPs. A 70B model decoding at 25 tokens per second on an M5 Max touches maybe 8 percent of the compute the chip can deliver.

Continuous batching fixes this by running multiple requests through the model simultaneously, interleaving prefill (compute-bound, parallel) with decode (memory-bound, serial). The throughput curve is sublinear in batch size, but the integral under the curve is dramatically higher than serial serving. On NVIDIA hardware with vLLM, the difference is often 10x to 20x at saturation.

The question for MLX is whether the unified memory architecture changes the shape of that curve, and the answer turns out to be yes, in both directions.

The MLX Batching Primitives

MLX does not ship a vLLM equivalent. There is no first-class continuous batching server, no PagedAttention implementation in the core library, and no standard request scheduler. What you get is a set of primitives that you assemble into a serving layer.

The pieces that matter:

import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.utils import generate_step

model, tokenizer = load("mlx-community/Llama-3.3-70B-Instruct-4bit")

# Manual batched prefill: stack prompts, run forward pass
def batched_prefill(prompts: list[mx.array]) -> mx.array:
    # Pad to longest, attention mask handles the rest
    max_len = max(p.shape[0] for p in prompts)
    padded = mx.stack([mx.pad(p, (0, max_len - p.shape[0])) for p in prompts])
    mask = mx.stack([mx.concatenate([mx.ones(p.shape[0]), mx.zeros(max_len - p.shape[0])]) for p in prompts])
    return model(padded, mask=mask)

The actual batched decode loop is the part most implementations get wrong. You cannot just iterate generate_step over a batched tensor naively, because requests finish at different lengths and the batch needs to compact when one drops out. The serving layer is responsible for tracking active request slots, KV cache layout, and admission control.

mlx_lm.server ships a basic implementation. Production deployments we have seen all replace it.

How Prefill and Decode Behave on Apple Silicon

The fundamental measurement that drives every batching decision is the prefill-to-decode ratio at your batch size. We measured Llama 3.3 70B 4-bit quantization on M5 Max (128 GB) and M5 Ultra (512 GB), 4096 token prompts, 512 token responses, varying batch size from 1 to 16.

Prefill throughput on M5 Max:

Batch SizePrefill Tokens/SecLatency (4K prompt)
138010.8s
269011.9s
41,18013.9s
81,82018.0s
162,44026.9s

Prefill scales sub-linearly but meaningfully. At batch 16 you are getting 6.4x the per-call cost relative to batch 1 for 16x the work, giving net efficiency around 0.4. That is the regime where prefill is compute bound and the GPU cores are saturating.

Decode throughput on M5 Max:

Batch SizeAggregate Tokens/SecPer-Request Tokens/Sec
12727
24924.5
48421
813216.5
1616810.5

Decode scaling tells you where memory bandwidth gives out. M5 Max delivers about 546 GB/s of unified memory bandwidth, and a 70B 4-bit model at 42 GB of weights means you are reading the entire weight matrix per token. At batch 16, decode aggregate throughput hits 168 tok/s, which is roughly 7 GB/s of weight reads multiplied by 1 (weights are reused across batched requests within the same forward pass). The ceiling is the bandwidth, not the batch logic.

M5 Ultra at the same configuration scales to about 280 aggregate decode tokens per second at batch 16, reflecting the higher 1,090 GB/s bandwidth, but the per-request rate collapses to around 17.5 tok/s. The Ultra is the better throughput chip and the worse latency chip when fully loaded.

Where the NVIDIA Mental Model Breaks

vLLM on an H100 hits decode batch sizes of 64 to 128 before saturation, because HBM3 bandwidth is 3,350 GB/s and the math allows it. On Apple Silicon the practical batch ceiling for 70B 4-bit is around 8 on M5 Max and 16 on M5 Ultra. Past that, per-request latency degrades faster than aggregate throughput improves, and most workloads care about both.

The second mental model break: PagedAttention is less critical on Apple Silicon than on NVIDIA. The Apple GPU does not have the same memory fragmentation problem because the unified pool is large and the allocator is more forgiving. Naive contiguous KV cache allocation works fine up to the practical batch ceiling. Implementing PagedAttention in MLX is possible (a few projects have started), but the throughput delta we measured was under 5 percent at production batch sizes. Not worth the engineering cost for most deployments.

The third break: speculative decoding interacts with batching differently. On NVIDIA, speculative decoding inside a batched scheduler is hard because draft tokens make the per-request decode lengths irregular. On MLX with smaller batches, the overhead is amortized across fewer concurrent requests and the gain is more reliable. We have seen 1.6x to 2.1x speedups stacking speculative decoding on top of batching, which is rare in vLLM deployments.

Request Scheduling Patterns That Actually Work

Three scheduling patterns we have shipped into production MLX serving layers:

Static batch with admission control. Fixed batch size N, requests queue until N are ready or a timeout fires. Simple, predictable latency, wastes throughput when traffic is bursty. Reasonable default for low-volume internal tools.

Continuous batching with prefill chunking. Run a persistent decode loop, admit new requests at each step, chunk long prefills across multiple steps so a 16K prompt does not block the decode loop for 80 seconds. Required for anything user-facing. Implementation effort is meaningful: roughly 1,500 lines of Python plus careful KV cache management.

Hybrid: prefill pool plus decode pool. Run two separate model instances if you have the memory budget. One handles prefill at high batch size, the other handles decode at moderate batch size. KV cache transfers between them. Wins on workloads with very long prompts and short responses. Costs roughly 1.4x the memory of a single instance.

The hybrid pattern is the most interesting in 2026 because it gets close to the throughput frontier without requiring the full vLLM-equivalent scheduler. M5 Ultra has the memory headroom to run it cleanly.

When This Applies to Your Stack

If you are serving an internal coding assistant or a single-user productivity tool, continuous batching is overkill and naive serial decode is fine. The cost of the serving layer exceeds the throughput gain at batch size 1.

If you are running an MLX-backed feature behind a real product, continuous batching is non-optional and the right reference point is mlx_lm.server plus a custom scheduler, not vLLM. The decode ceiling is bandwidth-bound and you should size hardware to your concurrent request budget rather than your peak request rate.

If you are evaluating whether to run inference on Apple Silicon at all versus a managed cloud endpoint, the break-even is usually 5 to 10 concurrent active sessions on M5 Max territory. Below that, the unit economics favor a managed API. Above that, dedicated hardware wins on cost per token and on data residency.

Where Contra Collective Comes In

We build the inference serving layers under production AI features for enterprise clients: Shopify Plus brands running on-prem catalog search, retail backends embedding local models for store-level personalization, and platform teams standardizing on Apple Silicon for engineering productivity. The MLX scheduler patterns above are pulled directly from those engagements. If you are evaluating Apple Silicon for production inference and want a second opinion on throughput math, hardware sizing, or migration from a cloud endpoint, we can help.

FAQ

Does MLX support continuous batching out of the box? No. The mlx_lm.server reference implementation does basic request batching but not the prefill chunking, admission control, and KV cache compaction that production serving requires. Most teams shipping MLX in production build a custom scheduler.

What batch size should I target on M5 Max for a 70B model? 4 to 8 for balanced throughput and latency. Past 8 the per-request decode rate degrades faster than aggregate throughput improves, and most user-facing workloads care about both metrics.

Is PagedAttention worth implementing in MLX? For 70B class models at production batch sizes, we measured under 5 percent throughput gain over naive contiguous KV cache. Engineering cost is high. Not recommended unless you are running 100B-plus models at batch 32+.

How does continuous batching interact with speculative decoding on MLX? Better than on NVIDIA. The smaller practical batch sizes on Apple Silicon mean the per-request overhead of speculative draft tokens is amortized across fewer concurrent requests, giving more predictable 1.6x to 2.1x speedups.

What is the bandwidth ceiling for batched 70B decode on M5 Ultra? Roughly 280 aggregate tokens per second at batch 16, set by the 1,090 GB/s unified memory bandwidth. The bottleneck is reading the 42 GB of 4-bit weights once per decode st

[ 02 ] — Keep Reading

More from the lab.

Jun 1, 2026AI

GPT-5.5 vs Gemini 3.1 Pro: Enterprise Workloads Tested (2026)

GPT-5.5 and Gemini 3.1 Pro are the two frontier models most enterprise procurement conversations now circle back to. Claude Opus 4.8 sits at the top of agentic coding, but for general enterprise reasoning, long document analysis, and structured extraction, the practical choice in mid 2026 is between OpenAI and Google. Both clear the capability bar. The decision is about second-order properties: how each handles long context degradation, structured output reliability, latency under load, and where the cost curve actually lands at production token volume.

May 31, 2026AI

Qwen 3 Coder vs Claude Opus 4.8: SWE-Bench Verified Tested (2026)

Qwen 3 Coder is the strongest open weight coding model shipping today. Claude Opus 4.8 is the closed source leader on agentic coding workloads. The conversation about which to use in production usually collapses into a benchmark argument, and the benchmarks alone do not capture what actually matters.

Ready when you are

Want to discuss this topic?

Start a Conversation