mup Should `base=None` be used in `set_base_shapes` for model used for tuning?

Hello! First of all, thank you for doing such great work and making it so accessible. I'm looking at using mup for a project but I'm a bit confused about how to set the base shapes for the smaller model used for hyperparameter tuning.

Let's say I want to train an MLP with hidden dimension 1024, and I want to muTransfer the best learning rate from an MLP with hidden dimension 128. My top-level code might look like this:

best_loss = float('inf')
best_lr = 0.

# Hyperparameter sweep with hidden dimension 128
for lr in learning_rates:

    small_mlp = MLP(hidden_dim=128)

    # use `base=None` in `set_base_shapes`
    small_mlp = mup.set_base_shapes(small_mlp, base=None)

    final_loss = full_training_loop(small_mlp, lr=lr)

    if final_loss < best_loss:
        best_loss = final_loss
        best_lr = lr

# Transfer optimal LR to large model

base_mlp = MLP(hidden_dim=128)
big_mlp = MLP(hidden_dim=1024)

big_mlp = mup.set_base_shapes(big_mlp, base=base_mlp)

ultimate_loss = full_training_loop(big_mlp, lr=best_lr)

or like this:

best_loss = float('inf')
best_lr = 0.

for lr in learning_rates:

    small_mlp = MLP(hidden_dim=128)

    # use a base model in `set_base_shapes`
    smaller_mlp = MLP(hidden_dim=32)
    small_mlp = mup.set_base_shapes(small_mlp, base=smaller_mlp)

    final_loss = full_training_loop(small_mlp, lr=lr)

    if final_loss < best_loss:
        best_loss = final_loss
        best_lr = lr

# Transfer optimal LR to large model

base_mlp = MLP(hidden_dim=128)
big_mlp = MLP(hidden_dim=1024)

big_mlp = mup.set_base_shapes(big_mlp, base=base_mlp)

ultimate_loss = full_training_loop(big_mlp, lr=best_lr)

Could you please clarify which of these would be correct? Thank you very much for your time!

Nov 03 '22 16:11 callumm-graphcore

Thanks for the kind words!

You should do the 2nd thing. base=None essentially means not using muP.

Nov 03 '22 17:11 thegregyang

Great, thanks Greg!

Nov 04 '22 07:11 callumm-graphcore