Knowledge distillation transfers capability from a large “teacher” model to a smaller “student” by training the student on the teacher’s generated outputs. For reasoning models this usually means collecting long chain-of-thought traces from a strong model and fine-tuning a smaller model on them.
Distillation is cheap and surprisingly effective: small models fine-tuned on a good teacher’s reasoning traces can beat much larger models trained with reinforcement learning from scratch. The catch is that it cannot exceed the teacher; it spreads existing capability rather than creating new frontier behaviour.
