← Back to index
Model Compression Tutorial
Knowledge Distillation & Quantization — Toy Examples for Security Researchers
Part 1: Knowledge Distillation
Goal: Train a small "student" model to mimic a large "teacher" model's behavior.
The Setup: Network Traffic Classification
Imagine we have a large teacher model that classifies network packets into three categories:
- Normal — benign traffic
- Suspicious — worth investigating
- Malicious — definite threat
Step 1: Collect Teacher's Soft Outputs
We run 4 sample packets through our big teacher model:
| Input |
P(Normal) |
P(Suspicious) |
P(Malicious) |
Hard Label |
| Packet A: HTTP GET /index.html |
0.92 |
0.06 |
0.02 |
Normal |
| Packet B: SSH to unknown IP |
0.15 |
0.70 |
0.15 |
Suspicious |
| Packet C: Encoded powershell payload |
0.02 |
0.18 |
0.80 |
Malicious |
| Packet D: Port scan pattern |
0.05 |
0.45 |
0.50 |
Malicious |
Step 2: Why Soft Labels Beat Hard Labels
Key Insight: Soft probabilities contain relationships between classes that hard labels throw away.
Look at Packet D (port scan):
Hard Label
[0, 0, 1] → "Malicious"
Information: 1 bit
(just the winner)
Soft Label
[0.05, 0.45, 0.50] → "Malicious"
Information: ~1.4 bits entropy
+ class relationships!
The soft label tells the student:
- This is barely more malicious than suspicious (0.50 vs 0.45)
- It's definitely not normal (0.05)
- Port scans live on the suspicious/malicious boundary
With hard labels, the student just learns "port scan = malicious" with no nuance.
Step 3: The Distillation Loss Function
# Simplified distillation training loop
for packet in training_data:
# Get teacher's soft predictions (with temperature)
teacher_probs = softmax(teacher(packet) / T) # T=2 or 3 typically
# Get student's predictions
student_probs = softmax(student(packet) / T)
# Loss = how different is student from teacher?
loss = KL_divergence(teacher_probs, student_probs)
# Update student to minimize difference
student.backward(loss)
Step 4: Temperature Scaling
We use "temperature" (T) to soften the probabilities further:
| Packet D |
T=1 (normal) |
T=2 (soft) |
T=4 (very soft) |
| P(Normal) |
0.05 |
0.15 |
0.22 |
| P(Suspicious) |
0.45 |
0.40 |
0.38 |
| P(Malicious) |
0.50 |
0.45 |
0.40 |
Higher temperature → probabilities spread out more → student learns finer distinctions between classes.
Step 5: What the Student Learns
TEACHER (100M params) STUDENT (5M params)
━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━
┌─────────────────┐ ┌───────────────┐
│ 12 transformer │ distill │ 3 transformer │
│ layers │ ───────────▶ │ layers │
│ │ (learns │ │
│ 768 hidden dim │ behavior) │ 256 hidden │
└─────────────────┘ └───────────────┘
Speed: 50ms/packet Speed: 5ms/packet
Memory: 400MB Memory: 20MB
Accuracy: 94% Accuracy: 89%
Capability Loss: The student is smaller, so it can't capture everything. It typically loses performance on rare/edge cases first. In security terms: common malware detection stays good, novel APT detection degrades.
Part 2: Quantization
Goal: Keep the same model architecture, but use fewer bits to represent weights.
The Setup: A Tiny Weight Matrix
Here's a 3×3 weight matrix from a model layer, stored as 32-bit floats:
# Original weights (FP32 — 32 bits per value)
W = [
[ 0.127, -0.891, 0.342 ],
[ 0.505, 0.012, -0.234 ],
[ -0.667, 0.789, 0.001 ]
]
Memory: 9 values × 32 bits = 288 bits (36 bytes)
Step 1: Find the Range
min(W) = -0.891
max(W) = 0.789
range = 0.789 - (-0.891) = 1.680
Step 2: Calculate Scale and Zero Point
For INT8 quantization (values from -128 to 127, or 0 to 255 for unsigned):
# Using unsigned INT8 (0 to 255)
scale = range / 255 = 1.680 / 255 = 0.00659
zero_point = round(-min / scale) = round(0.891 / 0.00659) = 135
# Quantization formula:
q(x) = round(x / scale) + zero_point
Step 3: Quantize Each Weight
| Original (FP32) |
Calculation |
Quantized (INT8) |
| 0.127 |
round(0.127/0.00659) + 135 = 154 |
154 |
| -0.891 |
round(-0.891/0.00659) + 135 = 0 |
0 |
| 0.342 |
round(0.342/0.00659) + 135 = 187 |
187 |
| 0.505 |
round(0.505/0.00659) + 135 = 212 |
212 |
| 0.012 |
round(0.012/0.00659) + 135 = 137 |
137 |
| -0.234 |
round(-0.234/0.00659) + 135 = 100 |
100 |
| -0.667 |
round(-0.667/0.00659) + 135 = 34 |
34 |
| 0.789 |
round(0.789/0.00659) + 135 = 255 |
255 |
| 0.001 |
round(0.001/0.00659) + 135 = 135 |
135 |
# Quantized weights (INT8 — 8 bits per value)
W_q = [
[ 154, 0, 187 ],
[ 212, 137, 100 ],
[ 34, 255, 135 ]
]
Memory: 9 values × 8 bits = 72 bits (9 bytes)
+ 2 floats for scale/zero_point = 8 bytes
Total: 17 bytes (vs 36 bytes) = 53% smaller
Step 4: Dequantize (Reconstruct)
When we need to use the weights, we convert back:
# Dequantization formula:
x_reconstructed = (q - zero_point) × scale
# Example: reconstructing 0.127
(154 - 135) × 0.00659 = 19 × 0.00659 = 0.125 (original: 0.127)
# Example: reconstructing 0.001
(135 - 135) × 0.00659 = 0 × 0.00659 = 0.000 (original: 0.001)
Step 5: Precision Loss Analysis
| Original |
Reconstructed |
Error |
% Error |
| 0.127 |
0.125 |
0.002 |
1.6% |
| -0.891 |
-0.890 |
0.001 |
0.1% |
| 0.342 |
0.343 |
0.001 |
0.3% |
| 0.505 |
0.507 |
0.002 |
0.4% |
| 0.012 |
0.013 |
0.001 |
8.3% |
| -0.234 |
-0.231 |
0.003 |
1.3% |
| -0.667 |
-0.665 |
0.002 |
0.3% |
| 0.789 |
0.791 |
0.002 |
0.3% |
| 0.001 |
0.000 |
0.001 |
100% |
Critical Observation: Small values near zero suffer the most. The value 0.001 became 0.000 — a 100% error. This is why quantization can cause problems: weights that encode subtle features get rounded away.
Visual: The Quantization Grid
FP32: Continuous (infinite precision)
─────────────────────────────────────────────────
···|···|···|···|···|···|···|···|···|···|···|···
-0.9 -0.6 -0.3 0.0 0.3 0.6 0.9
INT8: Discrete (256 possible values)
─────────────────────────────────────────────────
| | | | | | | |
-0.891 -0.56 -0.23 0.10 0.43 0.76 0.789
↑ ↑
min zero_point
(0) (135)
Any value between grid lines gets ROUNDED to nearest grid point.
Grid spacing = scale = 0.00659
Side-by-Side Comparison
| Aspect |
Distillation |
Quantization |
| What changes |
Model architecture (fewer layers/params) |
Number precision (fewer bits) |
| Typical compression |
10-100× fewer parameters |
2-4× smaller file size |
| Speed improvement |
Large (less computation) |
Moderate (faster math ops) |
| Training required |
Yes, full training run |
Minimal or none |
| Capability loss |
Variable (5-15% typical) |
Small at INT8 (<2%) |
| Where it hurts |
Rare cases, complex reasoning |
Subtle features, edge cases |
| Can be combined? |
Yes! Distill first, then quantize for maximum compression. |
FULL MODEL DISTILLED DISTILLED + QUANTIZED
━━━━━━━━━━━━ ━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━
100M params 10M params 10M params
FP32 weights FP32 weights INT8 weights
Size: 400MB Size: 40MB Size: 10MB
Speed: 50ms Speed: 8ms Speed: 3ms
Accuracy: 94% Accuracy: 89% Accuracy: 87%
────────────────────────────────────────────────────────────────────────────▶
More compression, more speed, less accuracy
Security Research Implications
When to Use Compressed Models
✓ Good Fit
- High-volume log triage
- First-pass malware scanning
- Embedding generation for similarity
- Edge/endpoint deployment
- Real-time classification
✗ Bad Fit
- Novel threat detection (APT, 0-day)
- Final verdict decisions
- Anything needing explanation
- Cases where FP cost is high
- Adversarial environments
Red Team Considerations
Quantization artifacts are exploitable. If an attacker knows you're using an INT8 model, they can craft inputs that fall between quantization grid points, causing inconsistent behavior.
Distilled models have blind spots. The student didn't see the teacher's training data — it only saw outputs. Edge cases the teacher handled via memorization may be lost entirely.
Validation Checklist
□ Test on YOUR data distribution, not public benchmarks
□ Check calibration (are confidence scores meaningful?)
□ Measure performance on rare/edge cases specifically
□ Test adversarial robustness (craft inputs near decision boundaries)
□ Compare FP/FN rates to full model on held-out threats
□ Keep full model available for escalation/verification