Skip to content

Bytecairn

Archive

Writing

Everything I've published — on gaming, displays, AI, hardware, programming, and agents. The hype, examined. Filter by tag, or just scroll.

4 pieces tagged #local-llm

2026

4

AI Hardware & InfrastructureJun 7, 202610 min VRAM capacity and memory bandwidth, not raw compute, dictate which 2026 models actually run locally The industry’s shift toward Mixture-of-Experts architectures and 128K context windows has turned GPU memory into a hard ceiling, forcing users to choose between aggressive quantization, slower unified memory, or NVIDIA’s $2,000+ 32GB cards.

AIJun 7, 20267 min Stop Guessing GGUF Quants: A VRAM-to-Precision Lookup Table for Local LLMs Consumer GPUs are bandwidth-bound, not precision-bound. Here’s the exact VRAM-to-quant lookup table that maximizes tokens/sec without crossing the perceptible quality threshold.

AIJun 6, 20266 min Ollama Isn't a Competitor to vLLM (And Neither Is llama.cpp) Stop comparing local LLM engines on tokens per second. Pick the one that matches your actual bottleneck: setup friction, KV-cache scheduling, or memory bandwidth. AIJun 6, 20266 min RTX Spark: 128GB Unified Memory Won't Fix the Bandwidth Bottleneck NVIDIA's RTX Spark packs 128GB of unified memory, but ~300 GB/s bandwidth caps inference throughput—here's the math on what you can actually run locally versus the cloud.