direct-preference-optimization icon indicating copy to clipboard operation
direct-preference-optimization copied to clipboard

Hi @eric-mitchell ,

Open Gryff1ndor opened this issue 1 year ago • 3 comments

In your formula (the image below), it seems that the log[π(y|x)] was calculate through .sum(-1) after logits.softmax(-1), then .log(). image But in your codes (the image below), the log[π(y|x)] was calculate through .sum(-1) after logits.log_softmax(-1). image

the two ways to calculate log[π(y|x)] seem different.Could you please tell me if they conflict each other?

Originally posted by @Gryff1ndor in https://github.com/eric-mitchell/direct-preference-optimization/issues/57#issuecomment-2106580951

Gryff1ndor avatar May 15 '24 00:05 Gryff1ndor

@Gryff1ndor, I think it might be because $\pi_\theta(y_w | x)$ is the probability of the entire sequence $y_w$ conditioned on the input $x$. So after decomposing into tokens:

\pi_\theta(y_w\ |\ x) = \prod^{|y_w|}_i \pi_\theta(y_i\ |\ x, y_0, \dots, y_{i-1})

And taking a log:

\log\pi_\theta(y_w\ |\ x)=\sum^{|y_w|}_i \log \pi_\theta(y_i\ |\ x, y_0, \ldots, y_{i-1})

Which would correspond to the sum at the bottom of _get_batch_logps().

nlpfollower avatar Jul 12 '24 00:07 nlpfollower

Thanks a lot! There is another problem which bothers me: When using the DPO loss in my work, I found that the sigmoid function in DPOloss caused gradient explosion, because of sigmoid(x) turned out 0 or 1( when x tended to -∞ or +∞). But actually the value of x was just like -10 or +10. Do you know how to fix it?

Gryff1ndor avatar Jul 27 '24 08:07 Gryff1ndor

Happy to help! I'm learning this stuff as well, so take it with a grain of salt, but I think there's a couple of things you can do:

  1. Play around with the hyperparams. Try lowering the learning rate, or lowering beta.
  2. Gradient clipping. Maybe reducing the configured gradient norm limit: self.config.max_grad_norm, will prevent exploding gradients.
  3. Perhaps another loss function like the 'conservative' DPO, or IPO will work in your case. I also wonder would would happen if you modified the DPO loss to use the clip function, something like PPO-clip, to keep the sigmoid in the desirable range.

Let me know how it goes! Right now I'm working with another fascinating repository that introduces a related loss called KTO. It's actually designed to be an extension of this repository. So it's very possible that the authors of the KTO repo addressed exploding gradient problem in their code. I will be training with DPO afterwards, so it'll be helpful to know what worked for you.

nlpfollower avatar Jul 27 '24 20:07 nlpfollower