Model Distillation

Fundamentals

A technique where a smaller 'student' model is trained to replicate the behavior of a larger 'teacher' model, preserving much of the performance at a fraction of the size and cost.

Model distillation (also called knowledge distillation) is a model compression technique where a large, high-performing model (the teacher) transfers its learned knowledge to a smaller model (the student). Rather than training the student on raw data labels alone, it learns from the teacher's output probability distributions, which contain richer information about relationships between classes and nuances in the data.

The process typically involves running inputs through the teacher model to generate "soft labels" - probability distributions over all possible outputs rather than hard one-hot labels. The student model is then trained to match these soft distributions, often using a temperature parameter to soften the probabilities further and reveal more of the teacher's learned structure.

Distillation is widely used in production AI systems where deploying a massive model is impractical due to latency, cost, or hardware constraints. For example, many commercially available smaller language models have been distilled from larger ones. DeepSeek R1's reasoning capabilities were distilled into smaller models like Qwen and Llama variants. The technique enables organizations to achieve near-state-of-the-art performance with models that are 10-100x smaller and cheaper to run.

Last updated: March 1, 2026

Model Distillation

Related Terms