Thesis discussion: Why can the end-to-end algorithm work properly?
In the paper, it seems that you combined the two steps of reinforcement learning into one step, forming an end-to-end training method. The specific algorithm is shown in the figure.
However, I have a problem with this end-to-end method. Let's say we've got the Opponent Player model for round t, and now we're going to learn the Opponent Player model for round t+1 through an end-to-end algorithm, but since $P_{\theta}$ and $P_{\theta_{t}}$ are one model, isn't the resultant loss 0? This means that we can't get $P_{\theta_{t+1}}$ that has any progress , is my understanding wrong?
In the paper, it seems that you combined the two steps of reinforcement learning into one step, forming an end-to-end training method. The specific algorithm is shown in the figure.
However, I have a problem with this end-to-end method. Let's say we've got the Opponent Player model for round t, and now we're going to learn the Opponent Player model for round t+1 through an end-to-end algorithm, but since Pθ and Pθt are one model, isn't the resultant loss 0? This means that we can't get Pθt+1 that has any progress , is my understanding wrong?
Thank you for your question. Since $\ell$ here is a monotonically decreasing and convex function (logistic loss in our paper $\ell(t) = \log(1 + \exp(-t))$, the gradient is nonzero when $P_{\theta}$ and $P_{\theta_t}$ are the same. Let us know if you have further questions
https://github.com/uclaml/SPIN/blob/e84b7be111b41b388367e591bdc23e327725c869/spin/alignment/trainer.py#L405
In spin_loss difinition, at steps 0, the loss value starts with a fixed value 0.6931, when p_theta equals to p_theta_t
https://github.com/uclaml/SPIN/blob/e84b7be111b41b388367e591bdc23e327725c869/spin/alignment/trainer.py#L405
In spin_loss difinition, at steps 0, the loss value starts with a fixed value 0.6931, when p_theta equals to p_theta_t
https://github.com/uclaml/SPIN/blob/e84b7be111b41b388367e591bdc23e327725c869/spin/alignment/trainer.py#L405
在 spin_loss 定义中,在步骤 0 处,当 p_theta 等于 p_theta_t 时,损失值从固定值 0.6931 开始
I know that the initial loss is not equal to 0 in the actual code, but this is caused by the actual calculation method, but it is undeniable that the value of this formula is 0 in the paper, isn't it?
First of all, a monotonically decreasing and convex function $\ell$ is required in the algorithm. The value of $\ell(0) = \log(2) \approx 0.6931$. Therefore, the value of this formula is not 0 at 0. Secondly, the progress of $\theta_t$ is not dependent on the value, but on the gradient. Let us know if there are any questions.
