Archive
Writing
Tutorials, teardowns and field notes across software, firmware and the machines between them.
3 pieces tagged #local-llm
2026
3
AI
Consumer GPUs are bandwidth-bound, not precision-bound. Here’s the exact VRAM-to-quant lookup table that maximizes tokens/sec without crossing the perceptible quality threshold.
Jun 7, 20267 min read
AI
Stop comparing local LLM engines on tokens per second. Pick the one that matches your actual bottleneck: setup friction, KV-cache scheduling, or memory bandwidth.
Jun 6, 20266 min read
AI
NVIDIA's RTX Spark packs 128GB of unified memory, but ~300 GB/s bandwidth caps inference throughput—here's the math on what you can actually run locally versus the cloud.
Jun 6, 20266 min read