Questionable distillation technique
I noticed in Table 12 of your paper, the hyperparameter ($\beta$) is set to a very low value, 1e-8, which suggests that the proposed code-based distillation process plays an almost negligible role during training. This is quite puzzling. Could you explain?
I noticed in Table 12 of your paper, the hyperparameter (β) is set to a very low value, 1e-8, which suggests that the proposed code-based distillation process plays an almost negligible role during training. This is quite puzzling. Could you explain?
Regarding the disparity in hyperparameters between the code distillation and class distillation in Equation 7, it is critical to consider the calculation processes for both the class-based distillation loss and code-based distillation loss. The class-based distillation loss entails computing the KL divergence between the student predictions and soft labels. On the other hand, the code-based loss involves comparing the target node representations with all M codes of the codebook embeddings. It is deserved to notice that the class number and code number are extremely different regarding the order of magnitude, which results in a considerable scale gap of approximately 10^7-fold in large-scale datasets. Consequently, despite the smaller hyperparameter, the corresponding code distillation loss is proportionally larger, effectively ensuring that the gradients propagated during backpropagation are maintained at a comparable level with class distillation loss.