← Back to index

Model Compression Tutorial

Knowledge Distillation & Quantization — Toy Examples for Security Researchers

Part 1: Knowledge Distillation

Goal: Train a small "student" model to mimic a large "teacher" model's behavior.

The Setup: Network Traffic Classification

Imagine we have a large teacher model that classifies network packets into three categories:

Step 1: Collect Teacher's Soft Outputs

We run 4 sample packets through our big teacher model:

Input P(Normal) P(Suspicious) P(Malicious) Hard Label
Packet A: HTTP GET /index.html 0.92 0.06 0.02 Normal
Packet B: SSH to unknown IP 0.15 0.70 0.15 Suspicious
Packet C: Encoded powershell payload 0.02 0.18 0.80 Malicious
Packet D: Port scan pattern 0.05 0.45 0.50 Malicious

Step 2: Why Soft Labels Beat Hard Labels

Key Insight: Soft probabilities contain relationships between classes that hard labels throw away.

Look at Packet D (port scan):

Hard Label

[0, 0, 1] → "Malicious" Information: 1 bit (just the winner)

Soft Label

[0.05, 0.45, 0.50] → "Malicious" Information: ~1.4 bits entropy + class relationships!

The soft label tells the student:

With hard labels, the student just learns "port scan = malicious" with no nuance.

Step 3: The Distillation Loss Function

# Simplified distillation training loop for packet in training_data: # Get teacher's soft predictions (with temperature) teacher_probs = softmax(teacher(packet) / T) # T=2 or 3 typically # Get student's predictions student_probs = softmax(student(packet) / T) # Loss = how different is student from teacher? loss = KL_divergence(teacher_probs, student_probs) # Update student to minimize difference student.backward(loss)

Step 4: Temperature Scaling

We use "temperature" (T) to soften the probabilities further:

Packet D T=1 (normal) T=2 (soft) T=4 (very soft)
P(Normal) 0.05 0.15 0.22
P(Suspicious) 0.45 0.40 0.38
P(Malicious) 0.50 0.45 0.40

Higher temperature → probabilities spread out more → student learns finer distinctions between classes.

Step 5: What the Student Learns

TEACHER (100M params) STUDENT (5M params) ━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━ ┌─────────────────┐ ┌───────────────┐ │ 12 transformer │ distill │ 3 transformer │ │ layers │ ───────────▶ │ layers │ │ │ (learns │ │ │ 768 hidden dim │ behavior) │ 256 hidden │ └─────────────────┘ └───────────────┘ Speed: 50ms/packet Speed: 5ms/packet Memory: 400MB Memory: 20MB Accuracy: 94% Accuracy: 89%
Capability Loss: The student is smaller, so it can't capture everything. It typically loses performance on rare/edge cases first. In security terms: common malware detection stays good, novel APT detection degrades.

Part 2: Quantization

Goal: Keep the same model architecture, but use fewer bits to represent weights.

The Setup: A Tiny Weight Matrix

Here's a 3×3 weight matrix from a model layer, stored as 32-bit floats:

# Original weights (FP32 — 32 bits per value) W = [ [ 0.127, -0.891, 0.342 ], [ 0.505, 0.012, -0.234 ], [ -0.667, 0.789, 0.001 ] ] Memory: 9 values × 32 bits = 288 bits (36 bytes)

Step 1: Find the Range

min(W) = -0.891 max(W) = 0.789 range = 0.789 - (-0.891) = 1.680

Step 2: Calculate Scale and Zero Point

For INT8 quantization (values from -128 to 127, or 0 to 255 for unsigned):

# Using unsigned INT8 (0 to 255) scale = range / 255 = 1.680 / 255 = 0.00659 zero_point = round(-min / scale) = round(0.891 / 0.00659) = 135 # Quantization formula: q(x) = round(x / scale) + zero_point

Step 3: Quantize Each Weight

Original (FP32) Calculation Quantized (INT8)
0.127 round(0.127/0.00659) + 135 = 154 154
-0.891 round(-0.891/0.00659) + 135 = 0 0
0.342 round(0.342/0.00659) + 135 = 187 187
0.505 round(0.505/0.00659) + 135 = 212 212
0.012 round(0.012/0.00659) + 135 = 137 137
-0.234 round(-0.234/0.00659) + 135 = 100 100
-0.667 round(-0.667/0.00659) + 135 = 34 34
0.789 round(0.789/0.00659) + 135 = 255 255
0.001 round(0.001/0.00659) + 135 = 135 135
# Quantized weights (INT8 — 8 bits per value) W_q = [ [ 154, 0, 187 ], [ 212, 137, 100 ], [ 34, 255, 135 ] ] Memory: 9 values × 8 bits = 72 bits (9 bytes) + 2 floats for scale/zero_point = 8 bytes Total: 17 bytes (vs 36 bytes) = 53% smaller

Step 4: Dequantize (Reconstruct)

When we need to use the weights, we convert back:

# Dequantization formula: x_reconstructed = (q - zero_point) × scale # Example: reconstructing 0.127 (154 - 135) × 0.00659 = 19 × 0.00659 = 0.125 (original: 0.127) # Example: reconstructing 0.001 (135 - 135) × 0.00659 = 0 × 0.00659 = 0.000 (original: 0.001)

Step 5: Precision Loss Analysis

Original Reconstructed Error % Error
0.127 0.125 0.002 1.6%
-0.891 -0.890 0.001 0.1%
0.342 0.343 0.001 0.3%
0.505 0.507 0.002 0.4%
0.012 0.013 0.001 8.3%
-0.234 -0.231 0.003 1.3%
-0.667 -0.665 0.002 0.3%
0.789 0.791 0.002 0.3%
0.001 0.000 0.001 100%
Critical Observation: Small values near zero suffer the most. The value 0.001 became 0.000 — a 100% error. This is why quantization can cause problems: weights that encode subtle features get rounded away.

Visual: The Quantization Grid

FP32: Continuous (infinite precision) ───────────────────────────────────────────────── ···|···|···|···|···|···|···|···|···|···|···|··· -0.9 -0.6 -0.3 0.0 0.3 0.6 0.9 INT8: Discrete (256 possible values) ───────────────────────────────────────────────── | | | | | | | | -0.891 -0.56 -0.23 0.10 0.43 0.76 0.789 ↑ ↑ min zero_point (0) (135) Any value between grid lines gets ROUNDED to nearest grid point. Grid spacing = scale = 0.00659

Side-by-Side Comparison

Aspect Distillation Quantization
What changes Model architecture (fewer layers/params) Number precision (fewer bits)
Typical compression 10-100× fewer parameters 2-4× smaller file size
Speed improvement Large (less computation) Moderate (faster math ops)
Training required Yes, full training run Minimal or none
Capability loss Variable (5-15% typical) Small at INT8 (<2%)
Where it hurts Rare cases, complex reasoning Subtle features, edge cases
Can be combined? Yes! Distill first, then quantize for maximum compression.
FULL MODEL DISTILLED DISTILLED + QUANTIZED ━━━━━━━━━━━━ ━━━━━━━━━━ ━━━━━━━━━━━━━━━━━━━━━ 100M params 10M params 10M params FP32 weights FP32 weights INT8 weights Size: 400MB Size: 40MB Size: 10MB Speed: 50ms Speed: 8ms Speed: 3ms Accuracy: 94% Accuracy: 89% Accuracy: 87% ────────────────────────────────────────────────────────────────────────────▶ More compression, more speed, less accuracy

Security Research Implications

When to Use Compressed Models

✓ Good Fit

  • High-volume log triage
  • First-pass malware scanning
  • Embedding generation for similarity
  • Edge/endpoint deployment
  • Real-time classification

✗ Bad Fit

  • Novel threat detection (APT, 0-day)
  • Final verdict decisions
  • Anything needing explanation
  • Cases where FP cost is high
  • Adversarial environments

Red Team Considerations

Quantization artifacts are exploitable. If an attacker knows you're using an INT8 model, they can craft inputs that fall between quantization grid points, causing inconsistent behavior.

Distilled models have blind spots. The student didn't see the teacher's training data — it only saw outputs. Edge cases the teacher handled via memorization may be lost entirely.

Validation Checklist

□ Test on YOUR data distribution, not public benchmarks □ Check calibration (are confidence scores meaningful?) □ Measure performance on rare/edge cases specifically □ Test adversarial robustness (craft inputs near decision boundaries) □ Compare FP/FN rates to full model on held-out threats □ Keep full model available for escalation/verification

Built for the Ori group chat — March 2026

← Back to all visualizations