direct-preference-optimization Hi @eric-mitchell ,

In your formula (the image below), it seems that the log[π(y|x)] was calculate through .sum(-1) after logits.softmax(-1), then .log(). But in your codes (the image below), the log[π(y|x)] was calculate through .sum(-1) after logits.log_softmax(-1).

the two ways to calculate log[π(y|x)] seem different.Could you please tell me if they conflict each other?

Originally posted by @Gryff1ndor in https://github.com/eric-mitchell/direct-preference-optimization/issues/57#issuecomment-2106580951

May 15 '24 00:05 Gryff1ndor

@Gryff1ndor, I think it might be because $\pi_\theta(y_w | x)$ is the probability of the entire sequence $y_w$ conditioned on the input $x$. So after decomposing into tokens:

\pi_\theta(y_w\ |\ x) = \prod^{|y_w|}_i \pi_\theta(y_i\ |\ x, y_0, \dots, y_{i-1})

And taking a log:

\log\pi_\theta(y_w\ |\ x)=\sum^{|y_w|}_i \log \pi_\theta(y_i\ |\ x, y_0, \ldots, y_{i-1})

Which would correspond to the sum at the bottom of _get_batch_logps().

Jul 12 '24 00:07 nlpfollower

Thanks a lot! There is another problem which bothers me: When using the DPO loss in my work, I found that the sigmoid function in DPOloss caused gradient explosion, because of sigmoid(x) turned out 0 or 1( when x tended to -∞ or +∞). But actually the value of x was just like -10 or +10. Do you know how to fix it?

Jul 27 '24 08:07 Gryff1ndor

Happy to help! I'm learning this stuff as well, so take it with a grain of salt, but I think there's a couple of things you can do:

Play around with the hyperparams. Try lowering the learning rate, or lowering beta.
Gradient clipping. Maybe reducing the configured gradient norm limit: self.config.max_grad_norm, will prevent exploding gradients.
Perhaps another loss function like the 'conservative' DPO, or IPO will work in your case. I also wonder would would happen if you modified the DPO loss to use the clip function, something like PPO-clip, to keep the sigmoid in the desirable range.

Let me know how it goes! Right now I'm working with another fascinating repository that introduces a related loss called KTO. It's actually designed to be an extension of this repository. So it's very possible that the authors of the KTO repo addressed exploding gradient problem in their code. I will be training with DPO afterwards, so it'll be helpful to know what worked for you.

Jul 27 '24 20:07 nlpfollower