Hongwu Peng
Hongwu Peng
According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training...
Any update on the BF16 atomic add support for device other than Hopper?
In short, in the second for loop, for everything minibatch query and passage loss backward, you put the query and passage embedding into the original batch, and calculate the gradient...
Oh, okay, I was using deepspeed + gradient caching, the model is wrapped into a deepspeed defined object, and RandContext doesn't work on my side. But it's good to learn...