Why simply use the first constrained layer as pruning template for all constrained layers?
From the observation of training results, the hard mask's weights between the constrained layers are not exactly aligned. https://github.com/MingSun-Tse/ASSL/blob/a564556c8b578c2ee86d135044f088bfeaafc707/src/pruner/utils.py#L71
Hi @yumath , thanks for your interest in our work!
Yes, the hard masks are not exactly aligned, because SSA is a regularization term, which only encourages aligned masks (as shown by the decreasing SSA loss) but cannot guarantee masks will be fully aligned (i.e., SSA loss = 0). We tried improving the penalty strength of SSA to make it even more aligned, but at the price of a performance drop. So the current scheme (a not-so-beautiful solution, the way i see it) simply uses the masks derived from the first constrained Conv layer after applying the SSA penalty. You may use masks derived from other constrained layers. Presumably, I think there should be no obvious difference.
More thoughts - Even if the masks are not fully aligned, reducing the misalignment, per se, is a good thing for pruning and the later finetuning because the gradient flow of the remaining weights would be less distorted and trainability would be better.
Best,
I'll close this issue given no further questions. Feel free to re-open if you see it necessary.