This is an implementation for the basic idea behind Hinton’s Knowledge Distillation Paper. We do not reproduce the exact results but rather show that the idea works.
While a few other implementations are available, the code flow is not very intuitive. Here we generate the soft targets from the teacher in an on-line manner while training the student network.
While this may not (or may) be a good way to implement the distillation architecture, it leads to a good improvement in the (small) student model.