Edward Hu comments

Results 69 comments of


                                            Edward Hu

Current implementation can't be converted to ONNX

Thanks for flagging this! Do you know the best way to fix this?

LoRA/loralib/layers.py line 71 nn.Linear.eval() using.

Thanks! Corrected.

Is it expected for the training time to not decrease?

Hi Joao, You should see a speedup if you have previously saturated your GPU utilization. Yes, for GPT-2 we only changed one layer and marked the rest as not trainable.

Does mup work with model with Conv2D as output?

Hi BurguerJohn, We haven't implemented a mu-version of Conv2d to use as the output layer, but we can certainly do it! It seems slightly unusual to us to use Conv2d...

Does mup work with model with Conv2D as output?

Hi both, Thanks for your patience regarding this issue. The muconv2d branch should work in principle, but I haven't added test cases since it requires the labels to be the...

PyTorch Lightning example

Hi tchaton, Thanks for the pointer to the Lightning Tuner. We are not familiar with its usage, but from the page you linked, it looks like one can pass a...

Coord check looks good, but μTransfer is not working as expected

Hi shjwudp, Thanks for your interest in our work! Your coordinate check plots seem identical across time steps, which is a sign that the learning rate is too small for...

Conv1D Coord check looks good (I think), but μTransfer does not seem to work?

Hi Zach, Thanks your for your interest in muP! A couple things come to mind. - It might help to locate the layer with the very small activation norm and...

Conv1D Coord check looks good (I think), but μTransfer does not seem to work?

I meant using your knowledge of the specific architecture you are using to reason if there's a bug in the code. E.g., maybe it's okay if it's the output of...

Conv1D Coord check looks good (I think), but μTransfer does not seem to work?

@shjwudp Thanks! You are right that the advantage of muP over SP should become more apparent as the difference in width grows. As a direct consequence, the effect of random...