The category error

The right inference engine doesn't maximize a benchmark. It matches the bottleneck that actually blocks your workflow.

Comparing local LLM inference engines on raw tokens-per-second is a category error. Imagine you need to serve a 70B model to three concurrent users on a single RTX 4090. You grab the engine that posted the highest single-request benchmark, fire it up, and watch your API time out. You didn't pick the wrong model. You picked the wrong bottleneck. Ollama solves setup friction. vLLM solves KV-cache scheduling. llama.cpp solves hardware portability. Apple's MLX is optimized for unified memory bandwidth. They optimize for completely different constraints, and treating them as interchangeable alternatives is what costs engineers hours of wasted VRAM and debugging time. I'll map your actual workload and hardware to the right stack. No vendor fluff, no benchmark theater. Just the tradeoffs that matter when you're trying to ship.

The Bottleneck You’re Actually Trying to Solve

  • llama.cpp: single-stream efficiency and bare-metal portability. C++ core, GGUF quantization, runs on basically anything.
  • Ollama: developer experience and ecosystem glue. Wraps llama.cpp with a registry, daemon, and OpenAI-compatible API.
  • vLLM: concurrent throughput and memory management. PagedAttention and continuous batching change the math for multi-user loads.
  • MLX: compute-memory co-design on Apple Silicon. Unified memory enables massive models, but bandwidth is the hard limit.

Picking based on "speed" alone wastes engineering time because speed is context-dependent. A single-request benchmark measures how fast one stream can pull tokens. It says nothing about how the engine handles memory fragmentation, context switching, or multiple simultaneous prompts. The right tool matches the bottleneck, not the benchmark table.

llama.cpp — Squeezing the Single Stream

llama.cpp powers a significant portion of the local AI ecosystem, and for good reason. GGUF quantization became the standard for consumer hardware because it strikes a practical balance between model fidelity and memory footprint. Q4_K_M isn't a theoretical curiosity; it's what lets an 8B model fit comfortably in 6GB of VRAM without collapsing into nonsense. When you run llama.cpp directly, you get the full control surface. You set the batch size, tune rope scaling for longer contexts, split GPU layers across cards, and cap context length explicitly. You get exactly what you configure. The tradeoff is that configuration is manual. There's no daemon guessing your intent. The serving reality matters most. llama-server gives you OpenAI-compatible endpoints, but concurrency is naive queuing. Requests stack up. Single-request latency is highly predictable. Multi-user throughput isn't. If you're on Linux bare-metal, running AMD/Intel/low-VRAM hardware, or need deterministic single-stream performance without wrapper overhead, this is your stack.

llama-server \
  --model ./llama-3.1-8b.Q4_K_M.gguf \
  --ctx-size 8192 \
  --gpu-layers 35 \
  --threads 8 \
  --port 8080

I haven't stress-tested llama-server past a handful of concurrent users, and the CLI can be brutal if you skip the server wrapper and try to script raw generation. But when you need to squeeze every cycle out of a single prompt on weird or constrained hardware, the C++ core leaves Python wrappers in the dust.

Ollama — Paying the DX Tax

Frame it honestly: Ollama is llama.cpp with training wheels. ollama run <model> downloads the GGUF, handles quantization defaults, starts a background daemon, and drops you into a chat interface or API in seconds. The ecosystem integration is unmatched. Virtually every local AI tool, IDE plugin, and notebook assumes Ollama exists. The tradeoff is control. The daemon manages GPU layers automatically. It decides how much VRAM to use, how to fall back to system RAM, and when to evict caches. That's great until you need to force a specific layer split across two mismatched cards, or you want to lock context length to prevent OOM crashes during long sessions. You lose the fine-grained knobs. Concurrency is the other ceiling. Ollama is built for single-user dev loops, not production APIs. Under load, the underlying naive queuing degrades fast. If you're on a personal workstation, a macOS dev machine, or rapid-prototyping where setup time outweighs inference time, it's the default. If you're serving a team, it will fight you.

NoteGotcha: Ollama's automatic GPU layer management can obscure memory fragmentation issues. Running long contexts or switching models may cause the daemon to spill to swap unexpectedly. Pin your layers or restart the service before it degrades your whole session.

vLLM — PagedAttention and the Multi-User Reality

vLLM exists because naive attention cache management is wildly inefficient. PagedAttention treats the KV cache like OS virtual memory. Instead of pre-allocating contiguous blocks for every possible sequence length, it allocates and frees cache blocks dynamically. Padding waste vanishes. Memory utilization jumps. Combine that with continuous batching, and the math changes. New requests don't wait for the current batch to finish. They join the generation stream immediately. Throughput scales with concurrency. Single-request overhead is higher because of the Python runtime and scheduler, but multi-user latency drops dramatically. The constraints are real. GPU-only. Python-based. Heavier startup. Configuration complexity for tensor parallelism, prefix caching, and structured output. If you're running a multi-user API endpoint, a batch processing pipeline, or any workload where concurrent requests exceed one, vLLM is the only rational choice.

Say you're running a flash sale and two buyers click at once. With naive queuing, buyer two waits for buyer one's entire generation to finish before their first token appears. With continuous batching, both streams run in the same forward pass. Buyer two gets a near-instant TTFT. Buyer one pays a negligible latency tax. That's why vLLM dominates serving benchmarks while losing single-request comparisons.

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.9

The --gpu-memory-utilization flag is worth noting. vLLM's scheduler reserves headroom for KV cache growth. Push it too high and you'll hit OOM during peak batching. Leave it at 0.85 or 0.9 for production. You're trading single-request peak speed for predictable multi-user throughput.

MLX — Apple Silicon’s Unified Memory Play

Apple's M-series architecture co-designs compute and memory. The model weights and KV cache share the same physical memory pool. There's no PCIe bottleneck copying weights from system RAM to VRAM. That's why a 70B model runs on a MacBook Max without swapping to disk. The hard limit isn't capacity. It's bandwidth. Unified memory gives you access to huge pools, but the memory controller still caps how fast you can feed the compute units — a bandwidth bottleneck that extra unified memory won't fix. When you push long contexts or concurrent requests, throughput flattens because the bus saturates. MLX is exceptional for iteration, fine-tuning, and running massive models on a laptop. It's terrible for production serving or multi-user loads. If you're on Mac and prioritizing the dev loop over serving metrics, MLX is your go-to. Skip it for Linux workstations or team APIs. The architecture advantage doesn't translate to serving throughput.

The Decision Matrix — Hardware × Workload

Workload / HardwareRecommended EngineWhy
Single GPU dev loop / personal workstationOllama or llama.cppOllama for frictionless setup. llama.cpp for deterministic control.
Multi-user API / production servingvLLMContinuous batching and PagedAttention scale throughput with concurrency.
Linux bare-metal / AMD / low-VRAM / weird hardwarellama.cppGGUF quantization and C++ portability cover hardware Python engines ignore.
Mac iteration / fine-tuning / massive contextMLXUnified memory enables large models; bandwidth limits serving.
OpenAI-compatible API parityOllama, vLLM, llama-serverAll expose the same endpoint shape. Pick based on concurrency, not the route.

The stack will fight you if you misalign it. Profile your actual request shape. Count your concurrent users. Check your memory bandwidth. Then pick the bottleneck you're trying to solve, not the benchmark table. Pick the bottleneck, not the benchmark. The stack will fight you if you misalign it, so profile your actual request shape and hardware constraints before you pip install anything.

local-llmbandwidthllm-inferencevllmllamacppollamamlx