SPIN Thesis discussion: Why can the end-to-end algorithm work properly?

In the paper, it seems that you combined the two steps of reinforcement learning into one step, forming an end-to-end training method. The specific algorithm is shown in the figure.

However, I have a problem with this end-to-end method. Let's say we've got the Opponent Player model for round t, and now we're going to learn the Opponent Player model for round t+1 through an end-to-end algorithm, but since $P_{\theta}$ and $P_{\theta_{t}}$ are one model, isn't the resultant loss 0? This means that we can't get $P_{\theta_{t+1}}$ that has any progress , is my understanding wrong?

Mar 01 '24 02:03 nomadlx

In the paper, it seems that you combined the two steps of reinforcement learning into one step, forming an end-to-end training method. The specific algorithm is shown in the figure.

However, I have a problem with this end-to-end method. Let's say we've got the Opponent Player model for round t, and now we're going to learn the Opponent Player model for round t+1 through an end-to-end algorithm, but since Pθ and Pθt are one model, isn't the resultant loss 0? This means that we can't get Pθt+1 that has any progress , is my understanding wrong?

Thank you for your question. Since $\ell$ here is a monotonically decreasing and convex function (logistic loss in our paper $\ell(t) = \log(1 + \exp(-t))$, the gradient is nonzero when $P_{\theta}$ and $P_{\theta_t}$ are the same. Let us know if you have further questions

Apr 07 '24 07:04 angelahzyuan

https://github.com/uclaml/SPIN/blob/e84b7be111b41b388367e591bdc23e327725c869/spin/alignment/trainer.py#L405

In spin_loss difinition, at steps 0, the loss value starts with a fixed value 0.6931, when p_theta equals to p_theta_t

Apr 08 '24 11:04 wnzhyee

https://github.com/uclaml/SPIN/blob/e84b7be111b41b388367e591bdc23e327725c869/spin/alignment/trainer.py#L405

In spin_loss difinition, at steps 0, the loss value starts with a fixed value 0.6931, when p_theta equals to p_theta_t

Apr 08 '24 11:04 wnzhyee

https://github.com/uclaml/SPIN/blob/e84b7be111b41b388367e591bdc23e327725c869/spin/alignment/trainer.py#L405

在 spin_loss 定义中，在步骤 0 处，当 p_theta 等于 p_theta_t 时，损失值从固定值 0.6931 开始

I know that the initial loss is not equal to 0 in the actual code, but this is caused by the actual calculation method, but it is undeniable that the value of this formula is 0 in the paper, isn't it?

Apr 09 '24 02:04 nomadlx

First of all, a monotonically decreasing and convex function $\ell$ is required in the algorithm. The value of $\ell(0) = \log(2) \approx 0.6931$. Therefore, the value of this formula is not 0 at 0. Secondly, the progress of $\theta_t$ is not dependent on the value, but on the gradient. Let us know if there are any questions.

Apr 09 '24 02:04 angelahzyuan