Apple Silicon
Unified memory architecture for LLMs
Apple Silicon (M1/M2/M3/M4 chips) uses a fundamentally different architecture than discrete GPUs. Instead of separate CPU and GPU memory, Apple uses unified memory that both can access. This enables running models that wouldn't fit in typical GPU VRAM.
The Unified Memory Advantage
Traditional (Discrete GPU):
┌──────────────┐ ┌──────────────┐
│ CPU │ │ GPU │
│ System RAM │◄────►│ VRAM │
│ (64GB+) │ PCIe │ (24GB) │
└──────────────┘ └──────────────┘
Model must fit in VRAM (24GB limit)
Apple Silicon (Unified):
┌─────────────────────────────────────┐
│ Unified Memory │
│ (up to 192GB) │
│ │
│ CPU ◄─────────────────► GPU │
│ 800 GB/s │
└─────────────────────────────────────┘
Model can use ALL memory (192GB)
Apple Silicon Specs
| Chip | Max Memory | Bandwidth | GPU Cores | Found In |
|---|---|---|---|---|
| M1 | 16GB | 68 GB/s | 7-8 | MacBook Air/Pro (2020) |
| M1 Pro | 32GB | 200 GB/s | 14-16 | MacBook Pro 14"/16" |
| M1 Max | 64GB | 400 GB/s | 24-32 | MacBook Pro, Mac Studio |
| M1 Ultra | 128GB | 800 GB/s | 48-64 | Mac Studio |
| M2 | 24GB | 100 GB/s | 8-10 | MacBook Air (2022) |
| M2 Pro | 32GB | 200 GB/s | 16-19 | MacBook Pro, Mac Mini |
| M2 Max | 96GB | 400 GB/s | 30-38 | MacBook Pro, Mac Studio |
| M2 Ultra | 192GB | 800 GB/s | 60-76 | Mac Studio, Mac Pro |
| M3 | 24GB | 100 GB/s | 8-10 | MacBook Air/Pro |
| M3 Pro | 36GB | 150 GB/s | 14-18 | MacBook Pro |
| M3 Max | 128GB | 400 GB/s | 30-40 | MacBook Pro |
| M4 | 32GB | 120 GB/s | 10 | iPad Pro, Mac Mini |
| M4 Pro | 64GB | 270 GB/s | 20 | Mac Mini, MacBook Pro |
Performance by Model Size
| Model | M2 Ultra (192GB) | M3 Max (128GB) | M4 Pro (64GB) |
|---|---|---|---|
| Llama 3 8B Q4 | ~100+ tok/s | ~80 tok/s | ~60 tok/s |
| Llama 3 70B Q4 | ~18-22 tok/s | ~10-14 tok/s | ~8-10 tok/s |
| Llama 3 70B Q8 | ~12-15 tok/s | Doesn't fit | Doesn't fit |
| Mixtral 8x7B Q4 | ~25-30 tok/s | ~15-20 tok/s | ~10-12 tok/s |
Software Support
llama.cpp + Metal
Best performance. Native Metal backend, actively optimized.
# Run with Metal acceleration
./main -m model.gguf -ngl 99
Ollama
Uses llama.cpp Metal backend. Just works.
ollama run llama3:70b
MLX
Apple's native ML framework. Optimized specifically for Apple Silicon.
pip install mlx-lm
mlx_lm.generate --model mlx-community/Llama-3-70B-Instruct-4bit
PyTorch MPS
PyTorch's Metal Performance Shaders backend. Works but less optimized than llama.cpp or MLX for inference.
Choosing an Apple Machine
What models do you need?
│
├── 7B-13B models
│ ├── Mac Mini M4 (~$800): Budget friendly, 32GB
│ └── MacBook Air M3 (~$1,500): Portable, 24GB
│
├── 13B-34B models
│ ├── Mac Mini M4 Pro 64GB (~$1,800): Best value
│ └── MacBook Pro M3 Pro (~$2,500): Portable
│
├── 70B models (Q4)
│ ├── MacBook Pro M3 Max 128GB (~$4,500): Portable 70B!
│ └── Mac Studio M2 Max 96GB (~$3,000): Desktop
│
└── 70B+ or long context
└── Mac Studio M2 Ultra 192GB (~$6,000+): Maximum capacity
Apple vs Discrete GPU
| Aspect | Apple Silicon | NVIDIA GPU |
|---|---|---|
| Max memory | 192GB | 24GB (consumer) |
| Bandwidth | Up to 800 GB/s | Up to 1,008 GB/s |
| Power draw | 30-60W typical | 300-450W |
| Noise | Silent to quiet | Loud under load |
| Cost for 70B | ~$4,500+ (M3 Max 128GB) | ~$1,600 (2× 3090 used) |
| Software | Good (llama.cpp, MLX) | Best (CUDA ecosystem) |
| Other uses | Great general computer | Gaming, rendering |
Tips for Apple Silicon
Maximize Performance
- Use llama.cpp or MLX for best Metal optimization
- Close other apps — unified memory is shared with everything
- Monitor memory pressure — swapping kills performance
- Keep macOS updated — Metal optimizations improve regularly
Common Pitfalls
- Don't compare TFLOPS — Apple chips have lower TFLOPS but unified memory changes the game
- Memory pressure — if you see yellow/red in Activity Monitor, the model is too big
- Background apps — Safari with many tabs can use 10GB+ of your LLM memory
The Sweet Spot
Apple Silicon shines when:
- You need to run models that don't fit in 24GB VRAM
- Silence matters (bedroom, office)
- You want a great general-purpose computer too
- Portability matters (MacBook running 70B on battery)
- Power costs matter (60W vs 700W)
NVIDIA wins when:
- Models fit in 24GB (faster bandwidth per $)
- Budget is tight ($800 used 3090 vs $4,500 Mac)
- You need CUDA-specific software
- Training or fine-tuning (Apple is inference-only practically)