The training problem
I found that the number of output channels of STDiT-v3 is 8, but in the training phase, only the former 4 channels are computed in the loss function.
Thanks. Would you please help me pinpoint where you identified this issue? I saw this but failed to locate where channel selection happened.
The training of open sora v1.2 used the rectified flow scheduler in this. On line 102-107, the only first 4 channels are computed in the loss function.
In the DiT training code, the purpose of setting sigma is to learn the mean and variance of the noise, and then calculate the KL loss with the gold standard. The code sets pred_sigma to True by default, but directly uses the mean (the first 4 channels) as the predicted noise.
The model released is also 8 channels (including mean and variance). Why set the model to predict mean and variance, but do not calculate the loss of variance.
The model released is also 8 channels (including mean and variance). Why set the model to predict mean and variance, but do not calculate the loss of variance.
I also want to know. @JThh Is there any reason to do this?
This issue is stale because it has been open for 7 days with no activity.
In the DiT training code, the purpose of setting sigma is to learn the mean and variance of the noise, and then calculate the KL loss with the gold standard. The code sets pred_sigma to True by default, but directly uses the mean (the first 4 channels) as the predicted noise.
so why?the output dimensions is? input is ?
I felt the same doubt. why Just use out_channel=8 ,uses the mean (the first 4 channels) ,dorpout bias? Maybe,Update latent_z by grad without bias
dt = timesteps[i] - timesteps[i + 1] if i < len(timesteps) - 1 else timesteps[i]
dt = dt / self.num_timesteps
z = z + v_pred * dt[:, None, None, None, None]
I felt the same doubt. why Just use out_channel=8 ,uses the mean (the first 4 channels) ,dorpout bias? Maybe,Update latent_z by grad without bias
dt = timesteps[i] - timesteps[i + 1] if i < len(timesteps) - 1 else timesteps[i] dt = dt / self.num_timesteps z = z + v_pred * dt[:, None, None, None, None]
This issue come from pixart-alpha which align with the original DiT. (https://github.com/PixArt-alpha/PixArt-sigma/issues/81#issuecomment-2100610843)
i see
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.