Question 1

How does it work?

Accepted Answer

Knowledge distillation works by training a smaller student model to mimic the output of a larger teacher model. The teacher generates soft target probabilities by applying a temperature-scaled softmax, which the student learns from, often in combination with the original hard labels. This allows the student to capture the teacher's knowledge about class similarities and decision boundaries.

Question 2

What is the role of the temperature parameter?

Accepted Answer

The temperature parameter controls the softness of the probability distribution produced by the teacher. A higher temperature yields a softer distribution, emphasizing smaller probabilities and revealing relationships between classes. During training, the student learns from these softened targets, and at inference time, the temperature is typically set to 1.

Question 3

How does knowledge distillation compare to other compression methods?

Accepted Answer

Unlike pruning or quantization, which directly reduce model size or precision, knowledge distillation transfers knowledge through training. It often achieves better accuracy than training a small model from scratch, but it requires access to the teacher model during training. It can be combined with other compression techniques for further efficiency gains.

Knowledge Distillation

Knowledge Distillation

Why it matters

First appeared

FAQ

How does it work?

What is the role of the temperature parameter?

How does knowledge distillation compare to other compression methods?