Edward Hu

Results 69 comments of Edward Hu

Thanks for flagging this! Do you know the best way to fix this?

Hi Joao, You should see a speedup if you have previously saturated your GPU utilization. Yes, for GPT-2 we only changed one layer and marked the rest as not trainable.

Hi BurguerJohn, We haven't implemented a mu-version of Conv2d to use as the output layer, but we can certainly do it! It seems slightly unusual to us to use Conv2d...

Hi both, Thanks for your patience regarding this issue. The muconv2d branch should work in principle, but I haven't added test cases since it requires the labels to be the...

Hi tchaton, Thanks for the pointer to the Lightning Tuner. We are not familiar with its usage, but from the page you linked, it looks like one can pass a...

Hi shjwudp, Thanks for your interest in our work! Your coordinate check plots seem identical across time steps, which is a sign that the learning rate is too small for...

Hi Zach, Thanks your for your interest in muP! A couple things come to mind. - It might help to locate the layer with the very small activation norm and...

I meant using your knowledge of the specific architecture you are using to reason if there's a bug in the code. E.g., maybe it's okay if it's the output of...

@shjwudp Thanks! You are right that the advantage of muP over SP should become more apparent as the difference in width grows. As a direct consequence, the effect of random...