Glossary Entry

Knowledge Distillation

Training a smaller model to imitate the outputs of a larger, stronger model so it inherits much of the capability at lower cost.

Training Models

Also called: distillation, distilled model, distil

Seed source: Hinton et al. 2015

Knowledge distillation transfers capability from a large “teacher” model to a smaller “student” by training the student on the teacher’s generated outputs. For reasoning models this usually means collecting long chain-of-thought traces from a strong model and fine-tuning a smaller model on them.

Distillation is cheap and surprisingly effective: small models fine-tuned on a good teacher’s reasoning traces can beat much larger models trained with reinforcement learning from scratch. The catch is that it cannot exceed the teacher; it spreads existing capability rather than creating new frontier behaviour.