Hongwu Peng comments

Repositories
Issues
Comments

Results 4 comments of


                                            Hongwu Peng

During the training of Step 3, the reward score of my language model collapsed to a stable point

According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training...

Feature Request: `tl.atomic_add` for bfloat16

Any update on the BF16 atomic add support for device other than Hopper?

Gradient caching vs Model dropout

In short, in the second for loop, for everything minibatch query and passage loss backward, you put the query and passage embedding into the original batch, and calculate the gradient...

Gradient caching vs Model dropout

Oh, okay, I was using deepspeed + gradient caching, the model is wrapped into a deepspeed defined object, and RandContext doesn't work on my side. But it's good to learn...