Knowledge-Distillation
Knowledge-Distillation copied to clipboard
KL Divergence Loss
The official implementation uses KL-Div loss while your implementation seems to use categorical cross entropy loss of Keras. Using the latter would completely invalidate the use of soft predictions in my opinion. Let me know what you think or if I am mistaken.