Can you run AI locally?
Local models run on your hardware, offering privacy and no per-query cost. But they're fundamentally less capable than cloud frontier models, and optimizations can't close that gap.
Why can't I just run GPT-4 on my laptop?
You can run AI locally. Excellent open models work on consumer hardware. But here's the uncomfortable truth: local models are not frontier models made portable. They're fundamentally smaller, less capable models, and that gap cannot be closed by optimization alone.
Understanding why requires understanding what makes frontier models expensive to run.
The scale problem
Frontier models are enormous:
| Model | Parameters | Memory (FP16) | Hardware | |-------|-----------|---------------|----------| | GPT-4 (est.) | ~1.7T MoE | ~400GB+ | Data center clusters | | Claude 3 Opus | Unknown | Massive | Cloud infrastructure | | Llama 3 70B | 70B | ~140GB | 2-4 high-end GPUs | | Local-friendly | 7-13B | 14-26GB | Single consumer GPU |
A model with 1 trillion parameters literally cannot fit in a laptop's memory. This isn't a software limitation. It's physics.
Why optimization can't close the gap
"But wait," you might think, "can't we optimize frontier models to run locally?" Here's why that doesn't work:
1. Cloud providers already optimize aggressively
OpenAI, Anthropic, and Google aren't running naive, unoptimized code. They use:
- Quantization (serving at INT8 or INT4)
- Optimized inference kernels
- Custom hardware (TPUs, specialized GPUs)
- Sophisticated batching and caching
- Every trick in the book
If there were a way to run their models cheaply, they'd already be using it. Their costs would drop. The fact that they need massive infrastructure tells you: there's no hidden efficiency to unlock.
2. Optimizations have costs
Every optimization technique trades capability for efficiency:
- Quantization reduces precision โ subtle degradation
- Distillation compresses knowledge โ capability loss
- Pruning removes weights โ reduced capacity
You can't compress 1 trillion parameters into 7 billion and keep the same capability. Information is lost. The small model is a lossy approximation of the large one.
3. Scale IS the capability
Scaling laws are clear: capability comes from parameters, data, and compute. A 7B model cannot match a 70B model because it has 10ร fewer parameters to encode patterns.
This isn't a temporary gap that better software will fix. It's fundamental to how neural networks work.
The qualitative gap
Local models aren't just "slightly worse." For certain tasks, they're categorically different:
- Complex reasoning: Frontier models can chain multi-step logic; local models lose the thread
- Rare knowledge: Local models hallucinate more on niche topics
- Instruction following: Subtle nuances get missed
- Long context: Smaller models degrade faster with length
- Emergent capabilities: Some abilities only exist at scale
A 7B model is not a 70B model that runs slower. It's a different, less capable model.
The hardware landscape
What can run what:
High-end consumer GPU (24GB VRAM):
- 7B-13B models at full precision
- Up to 30B models at INT4
- Good for serious local use
Standard gaming GPU (8-12GB VRAM):
- 7B models at INT8
- Limited capabilities
CPU-only (no GPU):
- Very slow inference
- 7B models possible but frustrating
- Quantization essential
Apple Silicon (M1/M2/M3):
- Unified memory helps
- 7B-13B models run well
- Surprising local capability for the form factor
Phones and edge devices:
- Sub-7B models only
- Very constrained
Running local models in practice
Popular tools for local deployment:
- Ollama: Simple CLI and API for running models locally
- LM Studio: GUI for exploring and running models
- llama.cpp: Efficient C++ inference, runs anywhere
- vLLM: High-performance serving for larger setups
- text-generation-webui: Feature-rich web interface
Most use quantized models (GGUF format) for memory efficiency.
The practical choice
When to use local:
- Privacy is paramount
- Simple, high-volume tasks
- Offline requirements
- Fixed cost budget
- Learning and experimentation
When to use API/cloud:
- Frontier capability needed
- Complex reasoning
- Rare knowledge required
- Occasional, high-value queries
- Latest models
Many workflows use both: local for routine work, API calls for the hard parts. The choice isn't ideological. It's practical.
The future
The gap between local and frontier may shrink through:
- More efficient architectures
- Better distillation techniques
- Improved hardware (faster, more memory)
- Specialized models that excel in narrow domains
But as local models improve, so do frontier models. The frontier keeps moving. Local will always offer a trade-off: accessibility and privacy in exchange for capability.