Model Compression Tutorial - Distillation & Quantization

Part 1: Knowledge Distillation

Goal: Train a small "student" model to mimic a large "teacher" model's behavior.

The Setup: Network Traffic Classification

Imagine we have a large teacher model that classifies network packets into three categories:

Normal — benign traffic
Suspicious — worth investigating
Malicious — definite threat

Step 1: Collect Teacher's Soft Outputs

We run 4 sample packets through our big teacher model:

Input	P(Normal)	P(Suspicious)	P(Malicious)	Hard Label
Packet A: HTTP GET /index.html	0.92	0.06	0.02	Normal
Packet B: SSH to unknown IP	0.15	0.70	0.15	Suspicious
Packet C: Encoded powershell payload	0.02	0.18	0.80	Malicious
Packet D: Port scan pattern	0.05	0.45	0.50	Malicious

Step 2: Why Soft Labels Beat Hard Labels

Key Insight: Soft probabilities contain relationships between classes that hard labels throw away.

Look at Packet D (port scan):

Hard Label

[0, 0, 1]  →  "Malicious"

Information: 1 bit
(just the winner)

Soft Label

[0.05, 0.45, 0.50]  →  "Malicious"

Information: ~1.4 bits entropy
+ class relationships!

The soft label tells the student:

This is barely more malicious than suspicious (0.50 vs 0.45)
It's definitely not normal (0.05)
Port scans live on the suspicious/malicious boundary

With hard labels, the student just learns "port scan = malicious" with no nuance.

Step 3: The Distillation Loss Function

# Simplified distillation training loop

for packet in training_data:
    # Get teacher's soft predictions (with temperature)
    teacher_probs = softmax(teacher(packet) / T)  # T=2 or 3 typically
    
    # Get student's predictions
    student_probs = softmax(student(packet) / T)
    
    # Loss = how different is student from teacher?
    loss = KL_divergence(teacher_probs, student_probs)
    
    # Update student to minimize difference
    student.backward(loss)

Step 4: Temperature Scaling

We use "temperature" (T) to soften the probabilities further:

Packet D	T=1 (normal)	T=2 (soft)	T=4 (very soft)
P(Normal)	0.05	0.15	0.22
P(Suspicious)	0.45	0.40	0.38
P(Malicious)	0.50	0.45	0.40

Higher temperature → probabilities spread out more → student learns finer distinctions between classes.

Step 5: What the Student Learns

TEACHER (100M params)              STUDENT (5M params)
━━━━━━━━━━━━━━━━━━━━━━              ━━━━━━━━━━━━━━━━━━━━━
                                    
  ┌─────────────────┐                 ┌───────────────┐
  │ 12 transformer  │    distill     │ 3 transformer │
  │    layers       │ ───────────▶   │    layers     │
  │                 │   (learns      │               │
  │ 768 hidden dim  │    behavior)   │ 256 hidden    │
  └─────────────────┘                 └───────────────┘
                                    
  Speed: 50ms/packet                  Speed: 5ms/packet
  Memory: 400MB                       Memory: 20MB
  Accuracy: 94%                       Accuracy: 89%

Capability Loss: The student is smaller, so it can't capture everything. It typically loses performance on rare/edge cases first. In security terms: common malware detection stays good, novel APT detection degrades.

Part 2: Quantization

Goal: Keep the same model architecture, but use fewer bits to represent weights.

The Setup: A Tiny Weight Matrix

Here's a 3×3 weight matrix from a model layer, stored as 32-bit floats:

# Original weights (FP32 — 32 bits per value)
W = [
    [ 0.127,  -0.891,   0.342  ],
    [ 0.505,   0.012,  -0.234  ],
    [ -0.667,  0.789,   0.001  ]
]

Memory: 9 values × 32 bits = 288 bits (36 bytes)

Step 1: Find the Range

min(W) = -0.891
max(W) =  0.789

range = 0.789 - (-0.891) = 1.680

Step 2: Calculate Scale and Zero Point

For INT8 quantization (values from -128 to 127, or 0 to 255 for unsigned):

# Using unsigned INT8 (0 to 255)

scale = range / 255 = 1.680 / 255 = 0.00659

zero_point = round(-min / scale) = round(0.891 / 0.00659) = 135

# Quantization formula:
q(x) = round(x / scale) + zero_point

Step 3: Quantize Each Weight

Original (FP32)	Calculation	Quantized (INT8)
0.127	round(0.127/0.00659) + 135 = 154	154
-0.891	round(-0.891/0.00659) + 135 = 0	0
0.342	round(0.342/0.00659) + 135 = 187	187
0.505	round(0.505/0.00659) + 135 = 212	212
0.012	round(0.012/0.00659) + 135 = 137	137
-0.234	round(-0.234/0.00659) + 135 = 100	100
-0.667	round(-0.667/0.00659) + 135 = 34	34
0.789	round(0.789/0.00659) + 135 = 255	255
0.001	round(0.001/0.00659) + 135 = 135	135

# Quantized weights (INT8 — 8 bits per value)
W_q = [
    [ 154,    0,  187 ],
    [ 212,  137,  100 ],
    [  34,  255,  135 ]
]

Memory: 9 values × 8 bits = 72 bits (9 bytes)
+ 2 floats for scale/zero_point = 8 bytes
Total: 17 bytes (vs 36 bytes) = 53% smaller

Step 4: Dequantize (Reconstruct)

When we need to use the weights, we convert back:

# Dequantization formula:
x_reconstructed = (q - zero_point) × scale

# Example: reconstructing 0.127
(154 - 135) × 0.00659 = 19 × 0.00659 = 0.125  (original: 0.127)

# Example: reconstructing 0.001
(135 - 135) × 0.00659 = 0 × 0.00659 = 0.000  (original: 0.001)

Step 5: Precision Loss Analysis

Original	Reconstructed	Error	% Error
0.127	0.125	0.002	1.6%
-0.891	-0.890	0.001	0.1%
0.342	0.343	0.001	0.3%
0.505	0.507	0.002	0.4%
0.012	0.013	0.001	8.3%
-0.234	-0.231	0.003	1.3%
-0.667	-0.665	0.002	0.3%
0.789	0.791	0.002	0.3%
0.001	0.000	0.001	100%

Critical Observation: Small values near zero suffer the most. The value 0.001 became 0.000 — a 100% error. This is why quantization can cause problems: weights that encode subtle features get rounded away.

Visual: The Quantization Grid

        FP32: Continuous (infinite precision)
        ─────────────────────────────────────────────────
        ···|···|···|···|···|···|···|···|···|···|···|···
       -0.9   -0.6   -0.3    0.0    0.3    0.6    0.9
        
        INT8: Discrete (256 possible values)
        ─────────────────────────────────────────────────
           |     |     |     |     |     |     |     |
       -0.891  -0.56  -0.23  0.10  0.43  0.76  0.789
           ↑                   ↑
         min                 zero_point
         (0)                  (135)

Any value between grid lines gets ROUNDED to nearest grid point.
Grid spacing = scale = 0.00659

Side-by-Side Comparison

Aspect	Distillation	Quantization
What changes	Model architecture (fewer layers/params)	Number precision (fewer bits)
Typical compression	10-100× fewer parameters	2-4× smaller file size
Speed improvement	Large (less computation)	Moderate (faster math ops)
Training required	Yes, full training run	Minimal or none
Capability loss	Variable (5-15% typical)	Small at INT8 (<2%)
Where it hurts	Rare cases, complex reasoning	Subtle features, edge cases
Can be combined?	Yes! Distill first, then quantize for maximum compression.

FULL MODEL                    DISTILLED                   DISTILLED + QUANTIZED
━━━━━━━━━━━━                    ━━━━━━━━━━                   ━━━━━━━━━━━━━━━━━━━━━
                                
100M params                     10M params                   10M params
FP32 weights                    FP32 weights                 INT8 weights
                                
Size: 400MB                     Size: 40MB                   Size: 10MB
Speed: 50ms                     Speed: 8ms                   Speed: 3ms
Accuracy: 94%                   Accuracy: 89%                Accuracy: 87%

────────────────────────────────────────────────────────────────────────────▶
                         More compression, more speed, less accuracy

Security Research Implications

When to Use Compressed Models

✓ Good Fit

High-volume log triage
First-pass malware scanning
Embedding generation for similarity
Edge/endpoint deployment
Real-time classification

✗ Bad Fit

Novel threat detection (APT, 0-day)
Final verdict decisions
Anything needing explanation
Cases where FP cost is high
Adversarial environments

Red Team Considerations

Quantization artifacts are exploitable. If an attacker knows you're using an INT8 model, they can craft inputs that fall between quantization grid points, causing inconsistent behavior.

Distilled models have blind spots. The student didn't see the teacher's training data — it only saw outputs. Edge cases the teacher handled via memorization may be lost entirely.

Validation Checklist

□ Test on YOUR data distribution, not public benchmarks
□ Check calibration (are confidence scores meaningful?)
□ Measure performance on rare/edge cases specifically
□ Test adversarial robustness (craft inputs near decision boundaries)
□ Compare FP/FN rates to full model on held-out threats
□ Keep full model available for escalation/verification

Built for the Ori group chat — March 2026

← Back to all visualizations