All Posts
AIMay 31, 2026

Tensor Parallelism on Apple Silicon: M5 Ultra MLX Throughput Tested (2026)

Tensor parallelism on NVIDIA is a transport problem. Splitting a 70B model across four A100s means moving attention shards across NVLink at every layer boundary, and the bandwidth ceiling is what bounds throughput. Apple Silicon does not have that problem because the GPU cores share one memory pool. That changes the calculus, and most NVIDIA intuitions transfer badly.

Tensor parallelism on NVIDIA is a transport problem. Splitting a 70B model across four A100s means moving attention shards across NVLink at every layer boundary, and the bandwidth ceiling is what bounds throughput. Apple Silicon does not have that problem because the GPU cores share one memory pool. That changes the calculus, and most NVIDIA intuitions transfer badly.

We benchmarked MLX tensor parallel inference on a 512 GB M5 Ultra against single device runs on a 128 GB M5 Max, using Llama 3.3 70B and a 120B dense model at 4 bit and 6 bit quantization. Here is what actually happens when you parallelize across Apple GPU cores, where the gains show up, and where the unified memory advantage runs into hard limits.

Why Tensor Parallelism Even Exists on Apple Silicon

A 70B model at 4 bit quantization needs about 42 GB just for weights. Add KV cache for a 32K context and you are at 50 to 55 GB before activations. That fits on a 64 GB M3 Max with margin and runs comfortably on a 128 GB M5 Max. There is no obvious reason to split it across devices.

The interesting case is when you want to push throughput per request, not just fit the model. MLX tensor parallel sharding splits the attention heads and the feedforward matrices across GPU core groups within the same chip. On M5 Ultra, that means partitioning across the two M5 Max dies connected over UltraFusion. The interconnect is not free, but it is roughly an order of magnitude faster than PCIe and inside the same memory address space.

The questions that matter for production:

  1. Does sharding within a single M5 Ultra give you measurable per request latency gains, or does the synchronization overhead eat the parallelism budget?
  2. Does it help for batched workloads, where you would normally just run two model replicas instead?
  3. How does this compare to running the same model on a single M5 Max die, since that is the realistic alternative for most teams?

Hardware and Methodology

The test rigs:

  • M5 Ultra Mac Studio, 24 core GPU per die (48 effective), 512 GB unified memory, macOS 16.2
  • M5 Max Mac Studio, 32 core GPU, 128 GB unified memory, macOS 16.2
  • MLX 0.24.1, mlx-lm 0.21.0, Python 3.12

Models tested:

  • Llama 3.3 70B Instruct at 4 bit (mlx-community quantization, q4_K equivalent)
  • Llama 3.3 70B Instruct at 6 bit
  • A 120B class dense model at 4 bit (Mistral Large 2 variant)

Workloads:

  • Single request, 2K prompt, 512 token decode (latency sensitive)
  • Batched, 8 concurrent requests, 1K prompt, 256 token decode (throughput sensitive)
  • Long context, 32K prompt, 1K decode (memory pressure test)

We measured prefill time, decode tokens per second, p50 and p99 latency, and peak memory residency. Each run was repeated five times after a warm cache.

Throughput Numbers

ConfigurationModelPrefill (tok/s)Decode (tok/s)p99 latency (s)Peak memory (GB)
M5 Max single deviceLlama 3.3 70B Q41,42019.828.451
M5 Ultra single deviceLlama 3.3 70B Q41,51021.126.652
M5 Ultra TP=2Llama 3.3 70B Q42,18027.321.254
M5 Max single deviceLlama 3.3 70B Q61,29014.638.568
M5 Ultra TP=2Llama 3.3 70B Q61,94020.727.171
M5 Ultra single deviceMistral Large 2 120B Q4OOMOOMOOMn/a
M5 Ultra TP=2Mistral Large 2 120B Q498012.446.288

Three observations.

First, M5 Ultra without tensor parallel sharding only beats M5 Max by about 7 percent on decode throughput. The second die is essentially idle. That is the trap. If you bought an Ultra and run MLX with the default single device path, you paid for hardware you are not using.

Second, TP=2 on M5 Ultra gives a real 30 to 40 percent decode throughput lift over the same model on M5 Max single device for the 70B class. That is meaningful for production. Prefill scales close to 1.5x, which is what you would expect when synchronization overhead eats some of the parallelism budget.

Third, single device on M5 Ultra cannot fit the 120B model at any quantization that preserves quality, because activations and KV cache push past the per device memory limit even though the total system memory is plenty. Tensor parallel is the only path to running 120B class models on Apple Silicon today.

Batched Workloads Are a Different Story

The single request numbers are the marketable ones. Production AI workloads are batched. We ran 8 concurrent requests through both configurations:

ConfigurationModelAggregate decode (tok/s)p50 per request (s)
2x M5 Max replicasLlama 3.3 70B Q41584.1
M5 Ultra TP=2Llama 3.3 70B Q41424.6

Two M5 Max machines running independent replicas beat one M5 Ultra with tensor parallelism on aggregate throughput. The replica approach has no synchronization cost, perfect cache locality per request, and no contention on the shared KV cache memory pool. The cost story is the inverse: a 512 GB M5 Ultra is cheaper than two 128 GB M5 Max units once you account for power and rack space, but the throughput per dollar favors replicas at this batch size.

The crossover happens when individual request latency matters more than aggregate throughput, or when you need a single model instance for KV cache reuse across a long agent session.

Where Tensor Parallel Actually Wins

The right mental model after running these numbers:

  • TP on M5 Ultra wins when you need single request latency below what a single die can deliver. Coding agents, interactive RAG, voice assistants.
  • TP on M5 Ultra is required when the model does not fit in per die memory. 120B dense models, long context 70B at high precision.
  • Replica scaling wins when you have N independent requests and care about throughput per dollar. Batch classification, structured extraction, document processing pipelines.
  • Single device M5 Max remains the price performance sweet spot for everything in the 8B to 70B range at moderate context lengths.

Practical MLX Configuration

Tensor parallel in MLX 0.24 is not the polished one liner that vLLM gives you. The current path looks like this:

import mlx.core as mx
from mlx_lm import load, generate

# Distribute group setup
world = mx.distributed.init()
mx.set_default_device(mx.gpu)

# Load with explicit sharding map
model, tokenizer = load(
    "mlx-community/Llama-3.3-70B-Instruct-4bit",
    tensor_parallel=True,
    tp_size=world.size(),
    tp_rank=world.rank(),
)

response = generate(
    model, tokenizer,
    prompt="...",
    max_tokens=512,
    temp=0.0,
)

The launcher is the awkward part. You run it under mlx.launch --hostfile with one process per die. The hostfile maps a single M5 Ultra to two processes pinned to different GPU core groups. There is no NVIDIA equivalent of nvidia-smi to verify the pinning, and you have to read mx.distributed.recv_count to confirm both ranks are actively participating.

Expect to spend time on this. The MLX team is iterating fast and the ergonomics will keep improving, but as of 0.24.1 the developer experience is closer to early vLLM than to a turnkey solution.

What This Means for Production AI Stacks

If you are running production AI on Apple Silicon today, the decision matrix:

  • Building a coding assistant or agent that needs sub second response on 70B Q4? M5 Ultra with TP=2 is the right hardware. The 30 to 40 percent latency improvement is the difference between a usable tool and a frustrating one.
  • Running a batched inference service for embeddings, classification, or extraction? Buy two M5 Max units, run independent replicas, skip the tensor parallel complexity entirely.
  • Want to run 120B class models locally? M5 Ultra is the only option, and you accept the throughput tradeoff because there is no alternative.
  • Targeting on premise enterprise deployments where the model has to fit on a single workstation? M5 Ultra with TP makes deployments simpler than multi node anything else.

The unified memory architecture remains Apple's structural advantage for large model inference. Tensor parallelism on M5 Ultra extends that advantage to the model sizes that would otherwise require multi GPU NVIDIA rigs, but only if you actually configure it. The default MLX path leaves the second die idle, which is the most common mistake we see on customer engagements.

At Contra Collective we help enterprise teams architect on premise AI inference for compliance constrained workloads, including Apple Silicon deployments. If you are evaluating M5 Ultra versus multi node GPU setups for your inference stack, the throughput per dollar math is genuinely close, and it depends heavily on your batching pattern and latency requirements. We can help you run the numbers against your actual workload.

FAQ

Does MLX support tensor parallelism out of the box in 2026?

Yes, as of MLX 0.23 and mlx-lm 0.20, basic tensor parallel sharding is supported via the tensor_parallel=True flag. The launcher tooling around it is still rough, and most online examples skip the configuration details that actually matter.

How does tensor parallel on M5 Ultra compare to NVLink between two A100s?

UltraFusion bandwidth is comparable to NVLink generation 3 in raw GB/s, but the lack of separate memory address spaces removes a class of synchronization overhead. Per layer transfer cost is lower on M5 Ultra, but each NVIDIA core is faster per clock. The net result is that throughput per watt favors Apple Silicon, throughput per chip favors NVIDIA, and total cost of ownership depends entirely on utilization.

Can I run a 405B model on M5 Ultra with tensor parallelism?

Not at meaningful precision. A 405B model at 4 bit needs roughly 240 GB for weights plus activations plus KV cache. That fits on 512 GB M5 Ultra in theory, but the prefill throughput drops to single digit tokens per second and the operational story (one model, no redundancy) is not viable for production. Use a cloud GPU cluster for 405B class workloads.

Does tensor parallel help with long context performance?

It helps with prefill throughput at long context (we measured 1.5x to 1.7x speedups on 32K prefills) but does not change the fundamental quadratic attention cost. For 128K context and beyond, prefix caching and quantized KV cache matter more than tensor parallelism.

Is there a future M6 generation that changes this analysis?

Apple's chip cadence suggests an M6 generation in 2026 or 2027. If Apple expands UltraFusion bandwidth or adds a third die option, the tensor parallel calculus shifts again. Today's analysis is specific to M5 Ultra and the current MLX implementation.

[ 02 ] — Keep Reading

More from the lab.

Jun 1, 2026AI

GPT-5.5 vs Gemini 3.1 Pro: Enterprise Workloads Tested (2026)

GPT-5.5 and Gemini 3.1 Pro are the two frontier models most enterprise procurement conversations now circle back to. Claude Opus 4.8 sits at the top of agentic coding, but for general enterprise reasoning, long document analysis, and structured extraction, the practical choice in mid 2026 is between OpenAI and Google. Both clear the capability bar. The decision is about second-order properties: how each handles long context degradation, structured output reliability, latency under load, and where the cost curve actually lands at production token volume.

Jun 1, 2026AI

MLX Continuous Batching: Throughput Architecture on Apple Silicon (2026)

Continuous batching is the single largest throughput unlock for transformer inference on any hardware. NVIDIA stacks have spent three years optimizing around it: vLLM, TensorRT-LLM, and SGLang all converge on the same pattern of paged KV cache, request scheduling, and prefill/decode interleaving. MLX is younger, the runtime is different, and most NVIDIA intuitions do not survive contact with Apple's unified memory architecture.

May 31, 2026AI

Qwen 3 Coder vs Claude Opus 4.8: SWE-Bench Verified Tested (2026)

Qwen 3 Coder is the strongest open weight coding model shipping today. Claude Opus 4.8 is the closed source leader on agentic coding workloads. The conversation about which to use in production usually collapses into a benchmark argument, and the benchmarks alone do not capture what actually matters.

Ready when you are

Want to discuss this topic?

Start a Conversation