The perplexity curve is flat until it isn't

Quant LevelPerplexity (Llama-3.1-8B baseline)Bytes/Param8B Model Size70B Model Size
F16~5.962.0~16 GB~140 GB
Q8_0~5.960041.0~8 GB~70 GB
Q6_K~5.97~0.75~6 GB~52.5 GB
Q5_K_M~5.99~0.625~5 GB~43.75 GB
Q4_K_M~6.01~0.5~4 GB~35 GB

The quality difference between Q4_K_M and Q8_0 is measured in hundredths of a perplexity point. You are not buying better reasoning. You are buying slower memory bandwidth.

I’ve watched the Q4 vs Q8 debate drag on, and it’s missing the real bottleneck. You probably treat quantization like a JPEG slider: crank it up until the artifacts show, then back off. But consumer GPU inference doesn't care about floating-point precision. It cares about how fast you can feed the compute units. When I look at llama.cpp's community perplexity tracking, the trade-off is brutally clear. Across the 7B-class models I’ve tested, the delta stays under 0.05. A baseline sits at roughly 5.96 perplexity in F16. Q8_0 pushes that to 5.96004. Q6_K lands at 5.97. Q5_K_M hits 5.99. Q4_K_M settles at 6.01. In practice, a 0.05 perplexity shift is statistically invisible for chat, drafting, or general summarization. It only surfaces in strict chain-of-thought reasoning, mathematical derivation, or when the model is forced to follow rigid JSON schemas. Yet jumping from Q4_K_M to Q8_0 doubles your VRAM footprint. You're trading a two-fold increase in memory consumption for a sub-1% shift in a metric that barely moves the needle for most workloads. The curve is flat. It stays flat until you drop below Q4, where legacy formats start bleeding accuracy. K-quants (the _K_M, _K_S, _K_L variants) were designed specifically to preserve that flat region by mixing precision across layers instead of applying uniform compression.

Why memory bandwidth dictates your tok/s, not compute

# Theoretical throughput bound (ignoring kernel overhead)
# tok/s ≈ (GPU Memory Bandwidth) / (Model Size in VRAM)

bandwidth_4090 = 1008e9 / 8  # ~1008 GB/s peak, divided by 8 for bytes
model_size_q4 = 35e9         # 70B params at ~0.5 bytes/param
model_size_q8 = 70e9         # 70B params at 1.0 bytes/param

tok_s_q4 = bandwidth_4090 / model_size_q4
tok_s_q8 = bandwidth_4090 / model_size_q8

This formula is a first-order approximation. It breaks down for batched inference or very small models where compute, not memory, becomes the bottleneck. But when you run single-request inference on modern consumer GPUs, you're almost entirely memory-bound. The compute cores sit idle waiting for weight data to stream from VRAM. The math is unforgiving: tok/s ≈ (GPU memory bandwidth) / (active model size in VRAM). Precision doesn't change the arithmetic intensity of the forward pass. It changes the volume of data you have to move. Take a 70B model you're trying to run. At Q4_K_M, the weights occupy roughly 35 GB. At Q8_0, they occupy 70 GB. On a card with 1 TB/s of peak bandwidth, the theoretical maximum throughput halves. Real-world tok/s will be lower due to kernel launch overhead, Python/Go runtime costs, and the KV cache, but the ratio holds. Double the precision, halve the speed. This is why community benchmarks consistently show Q4_K_M outpacing Q8_0 by a significant margin, often approaching a 2x theoretical speedup with real-world gains typically ranging from 30% to 50% depending on framework overhead, even though both run the same architecture. You won't fix this with unified memory or massive VRAM pools. As we've detailed, unified memory won't fix the bandwidth bottleneck. Offloading to system RAM or using CPU fallback introduces latency penalties that dwarf any precision gains. The bus width and memory clock speed are the hard ceiling. Then you hit the KV cache trap. Every token you keep in context consumes VRAM proportional to the model's hidden dimension and the number of layers. If you pick Q8_0 for a 30B model, the weights will occupy roughly 30 GB, but you'll have almost nothing left for the cache. The moment context grows past a few thousand tokens, the framework either truncates the conversation, spills to CPU, or OOMs entirely. You gain precision but lose coherence and speed simultaneously.

Note⚠️ The KV cache scales linearly with context length and layer count, but is independent of weight quantization. A Q4 model with a 64K context window will often consume more VRAM than a Q8 model with a 2K window. Always budget cache space before picking a quant level.

The VRAM-to-quant decision matrix

VRAM BudgetRecommended QuantTarget Model SizeContext Strategy
8 GBQ4_K_M7B–8BKeep context ≤ 4K–8K. Q5 will OOM on cache.
16 GBQ5_K_M or Q6_K7B–13B8K–16K viable. Use Q4_K_M if targeting 30B+ with heavy offloading.
24 GBQ6_K13B–30B16K–32K comfortable. Q8_0 only for 7B–8B if structured output/math is required.
32 GB+Q8_013B–30BFull context windows. Q4_K_M still wins for 70B+ unless dual-GPU or massive bandwidth.

At 8 GB, Q4_K_M is the only viable path for 7B–8B models. You can squeeze a Q5_K_M in, but the KV cache will eat your headroom the moment you exceed a few thousand tokens. Stick to Q4 unless you're running micro-models or distillations. At 16 GB, you gain breathing room. Q5_K_M or Q6_K for 7B–13B models gives you a perceptible bump in reasoning stability without tanking tok/s. If you're targeting 30B+ models, drop back to Q4_K_M and accept the 0.05 perplexity delta. The speed and context length gains are worth it. At 24 GB, Q6_K becomes the sweet spot for 13B–30B models. You get near-Q8 quality with significantly better memory throughput. Reserve Q8_0 for 7B–8B models only when you're generating strict JSON, running mathematical chains, or fine-tuning where small perplexity deltas compound over thousands of steps. At 32 GB and above, Q8_0 becomes practical for 13B–30B models. But for 70B+ weights, Q4_K_M still wins unless you have dual GPUs with NVLink or a workstation card with substantially higher bandwidth. The math doesn't lie: moving 70 GB of weights per forward pass will bottleneck you regardless of how many bits you pack into them.

# Verify VRAM allocation and layer offloading
llama-cli -m ./models/llama-3.1-8b-instruct-q4_k_m.gguf \
  --n-gpu-layers 99 \
  --verbose-prompt \
  --context-size 8192

# Cross-check with nvidia-smi to confirm weight vs cache split
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Run the command, watch the verbose output, and check nvidia-smi. You'll see exactly how many layers landed on the GPU and how much VRAM the KV cache is consuming. If the framework reports CPU offloading, your quant is too aggressive for your context budget. Drop the context size or move to a smaller quant. The numbers will tell you where the bottleneck actually lives.

When to reach for imatrix or Q8 anyway

I find K-quants excellent, but they're not the end of the road. IQ and imatrix quantization levels use calibration data to map weight distributions before compression. Instead of treating all layers uniformly, they identify high-magnitude weights that carry disproportionate semantic load and preserve them at higher precision, while aggressively compressing the rest. The result is a better perplexity-to-VRAM ratio, often matching or exceeding standard K-quant quality at equivalent bit depths. You'll feel the friction, though. Imatrix quants aren't drop-in replacements for every workflow. You need a representative calibration dataset that matches your target distribution. Running llama-quantize with calibration adds steps to your pipeline. If you're just spinning up a model for chat or code assistance, the overhead isn't justified. If you're baking a model into a production service where consistent perplexity matters across thousands of requests, the calibration step pays for itself. I still think Q8_0 is worth the VRAM tax in specific scenarios: - Fine-tuning bases where small weight errors amplify during LoRA/QLoRA training - Strict JSON/schema generation where token probability shifts cause structural breaks - Mathematical reasoning or symbolic tasks where perplexity deltas compound over long chains - When you're running evaluation suites and need a stable baseline Outside those cases, Q8_0 is usually a solution to a problem you don't have. The rule is simple: pick the smallest quant that keeps your perplexity delta below your task's tolerance threshold, then maximize context length with the remaining VRAM.

Details
Calibration workflow overview Imatrix quantization requires a text corpus representative of your inference workload. You pass it through llama-quantize with the --imatrix flag, which generates a frequency matrix of weight activations. The quantizer then uses that matrix to assign higher precision to frequently activated or high-magnitude weights. Community scripts typically use a mix of Wikipedia, code repositories, and synthetic prompts. If your workload is heavily domain-specific (medical, legal, financial), generic calibration data will underperform. Always validate perplexity on your actual prompt distribution after quantizing.

You'll find context length eats VRAM faster than precision does. Before you debate Q4 versus Q8, calculate how many tokens of history you actually need in memory. The right quant is the one that leaves room for the conversation, not the one that squeezes the most bits into your GPU.

local-llmunified-memorybandwidthinferenceggufquantizationllama-cppvram