diffusers Question about noise schedule parametrization & samplers

Thanks for all the work first of all, I appreciate that we have all these pretrained models freely available open-source.

I'm a bit confused by the relationship between sampler implementations and model checkpoints. If someone could clarify the following, that'd be much appreciated (and perhaps some of these can become part of the user documentation) -- especially Q2.

Q1. Where can I find the exact noise schedule used during training for each checkpoint? From what I can tell, some of the released Stable Diffusion U-Nets are using the original DDPM schedule (specific betas with 1000 steps), but I couldn't find any official documentation for this.

Q2. Continuing from Q1, how are we supposed to use different samplers without the explicit specification of the noise schedule used during training? Depending on the schedule, $\sigma_t$ as a function of $t$ can be different for each checkpoint -- which in turn may affect the meaning of "step" for different samplers. Or is it currently the case that all models use the same exact schedule (which I'm asking for in Q1), and all sampler implementations are implicitly assuming that?

Q3. I may be mistaken here, but it seems like the library assumes epsilon prediction with discrete times. Why were these assumptions made? (somewhat related: #1308) A concrete example would be VDM [1], with the U-net conditioned on log(SNR) instead of $t$ with continuous time steps. It's not immediately clear to me whether the library can support this parametrization. My cursory understanding is that much of the existing sampler code will need to be rewritten.

I'm not very familiar with the code base, so if there is any misunderstanding please correct me.

[1] Kingma, Diederik, et al. "Variational diffusion models." Advances in neural information processing systems 34 (2021): 21696-21707.

Dec 11 '22 06:12 jaywhang

Hi @jaywhang, I think you're raising some super relevant questions here !

It seems that there are a lot of work in progress regarding parameterization (see #818, #1010, #1505). However, I would say that these works mostly concerns using trained models for sampling at the moment, and assume that you have a UNet denoiser trained with the appropriate parameterization at hand.

However, the unconditional DDPM training example currently supports epsilon-prediction as well as x-prediction with SNR weighting from the distillation paper. I would love to see it evolved with v-prediction as well.

Regarding your questions, I think that the correct workflow when training your own model with diffusers is to save an entire pipeline (at least a UNet model and a noise scheduler). The scheduler config dict will have prediction_type, beta_schedule, etc. tags that reflects the training configuration. Then, at sampling time, you'll be able to switch schedulers using from_config (see #1286) and thus keep these information while using another sampler.

It might be cool to maybe save a prediction_type tag in a Model config dict, so we know what is the output target of a trained model. Maybe @patrickvonplaten can help with this :)

Dec 13 '22 10:12 leopoldmaillard

Linking some maybe relevant discussions here: https://github.com/huggingface/diffusers/issues/1308

Overall, these are some very good questions and I think @leopoldmaillard answered some/all of them already very nicely. I can mostly only agree!

Some more comments: Q1) As @leopoldmaillard said, if you train a checkpoint with diffusers and save it, then all relevant states (beta scheduler, ...) will be saved as a scheduler/config.json . E.g. see here: https://huggingface.co/prompthero/openjourney/tree/main/scheduler We sadly cannot provide this for the original .ckpt models since we didn't train those

Q2) & Q3) Here there are a couple of things:

1. Some schedulers are compatible with each other, some aren't - we already have some nice/relevant logic for this. E.g. see: https://huggingface.co/docs/diffusers/using-diffusers/schedulers
1. It's correct that currently diffusers is quite strongly implemented towards "the model was trained with discrete time steps" because Stable Diffusion and co. have been trained this way. As you can see in this very nice discussion: https://github.com/huggingface/diffusers/issues/1308, there are lots of difficult design decisions to play around with. We currently aren't 100% satisfied with the API of schedulers, but it seems like the "best of the bad" options especially taken into account backwards compatibility
1. The models (unets) can already very well be trained just on noise for timesteps

Dec 16 '22 15:12 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 10 '23 15:01 github-actions[bot]