diffusers [feat] Add option to use (Scheduled) Huber Loss in all diffusion training pipelines to improve resilience to data corruption and better image quality

Is your feature request related to a problem? Please describe.

Diffusion models are known to be vulnerable to outliers in training data. Therefore it's possible for a relatively small number of corrupting samples to "poison" the model, making it unable to produce desired output, which has been exploited by the programs such as Nightshade.

One of the reasons of this vulnerability may lie in the commonly used L2 (Mean Squared Error) loss, the fundamental part of diffusion/flow models, which is also highly sensitive to ourliers, see Anscombe's quartet for some examples.

Describe the solution you'd like.

In our new paper (also my first paper 🥳) "Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber Loss" https://arxiv.org/abs/2403.16728 we present a novel scheme to improve the score-matching models resilience to data corruption of parts of their datasets, introducing Huber Loss -- long used in robust regression, such as when you need to restore a contour in highly noised computer vision tasks -- and Scheduled Huber Loss. Huber loss behaves exactly like L2 around zero and like L1 (Mean Absolute Error) as it tends towards infinity, making it punish the outliers less hardly than the quadratic MSE. However, a common concern is that it may hinder the models capability to learn diverse concepts and small details, that's like we introduced Scheduled Pseudo-Huber loss with the decreasing parameter, so that the loss will behave like Huber loss on early reverse-diffusion timesteps, when the image only begins to form and is most vulnerable to be lead astray, and like L2 on final timesteps, to learn fine details of the images.

Describe alternatives you've considered.

We made tests with Pseudo-Huber Loss, Scheduled Pseudo-Huber Loss and L2, and SPHL beats the rest in nearly all cases. (On the plots the Resilence is similarity to clean pictures on partially corrupted runs minus the similarity to clean pictures on clean runs, see paper for more details)

Screenshot 2024-03-27 at 11-55-48 Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber Loss - 2403 16728 pdf

Another alternatives are data filtration, image recaptioning (what may also be vulnerable to adversarial noise) and or "diffusion purification", which would require additional resources and may be impractical in case of large models training and false negatives, which may be drastic outliers with high corrupting potential.

👀 Also we found that the Diffusers LCM training script has a wrong Pseudo-Huber Loss coefficient proportionality (and this mistake was in the original OpenAI's article about LCMs), resulting in wrong asymptotics as its parameter tends to 0 or to infinity, resulting in the most negative impact when it is timestep-scheduled. This would be nice to fix as well (maybe adding a compatibility option for previously made LCMs)

We show that our scheme works in text2speech diffusion domain as well, further supporting the claims.

Screenshot 2024-03-27 at 12-00-00 Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber Loss - 2403 16728 pdf

Additional context.

As a side effect (which I remembered after publishing, when I was looking through the sampled pictures), Huber loss also seems to improve the "vibrancy" of pictures on clean runs, though the mechanism behind it is unknown (maybe better concept disentanglement?). I think it would be nice to include at least simply because of this effect 🙃

vanilla_vs_huber

As I literally was behind this idea and made the experiments with modified Diffusers library, I have all the code on my hands and will make a PR soon

We also extensively tried to prove a theorem, claiming that in the event of corrupting samples present in the dataset (the third moment "skewness" of the distribution is greater than zero), the usage of Scheduled Pseudo-Huber loss with timestep-decreasing parameter will result in less KL divergence between the clean data and the data distribution generated by an ideal score-matching (e.g. diffusion) model than the usage of L2, but there was a mistake in the proof and we stuck. If you'd like to take a look at our proof attempt, PM me.

Mar 27 '24 09:03 kabachuha

Thank you for your contributions. But I think just releasing your code independently will be more justified here.

If you want to add something to research_projects, that's also more than welcome.

👀 Also we found that the Diffusers LCM training script has a wrong Pseudo-Huber Loss coefficient proportionality (and this mistake was in the original OpenAI's article about LCMs), resulting in wrong asymptotics as its parameter tends to 0 or to infinity, resulting in the most negative impact when it is timestep-scheduled. This would be nice to fix as well (maybe adding a compatibility option for previously made LCMs)

You are welcome to open a PR for this, however.

Mar 27 '24 13:03 sayakpaul

Ccing @kashif and @patil-suraj for awareness.

Mar 27 '24 13:03 sayakpaul

agree! I also found huber to work well in the time-series setting back in the day: https://github.com/zalandoresearch/pytorch-ts/blob/master/pts/modules/gaussian_diffusion.py#L251-L252

Mar 27 '24 13:03 kashif

@sayakpaul Very well, then I'll make a repo with the modified training scripts and the instruction to install Diffusers as a dependency, and a PR for the research_projects folder. + a (separate) PR for the openai fix and ping them into this issue

Mar 27 '24 14:03 kabachuha

Of course thank you! Happy to help promote your work too!

Mar 27 '24 14:03 sayakpaul

@kashif btw, using bare smooth L1 in the case of my proposed changes won't be helpful, like in the case of openai's formula for PHL.

If you read torch docs for smooth L1 loss, it's torch's huber loss, divided by the constant delta https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html#torch.nn.SmoothL1Loss

However, this loss has vastly different asymptotics:

As beta -> 0, Smooth L1 loss converges to L1Loss, while HuberLoss converges to a constant 0 loss. When beta is 0, Smooth L1 loss is equivalent to L1 loss.

As beta -> +∞, Smooth L1 loss converges to a constant 0 loss, while HuberLoss converges to MSELoss.

For Smooth L1 loss, as beta varies, the L1 segment of the loss has a constant slope of 1. For HuberLoss, the slope of the L1 segment is beta.

It may not be noticeable for constant parameter, it will have a profound impact in case of our scheduled huber loss, which needs to be close to L2 on the first forward diffusion timesteps.

Pseudo-Huber loss, derived from a square root (see Wikipedia), however, satisfies both desired asymptotics.

Additionally, torch's native Huber loss and smooth L1 loss are piecewise and not twice differentiable (differenting the parabola will yield a linear dependency and differenting the abs will result in constant line. There will be a cusp at their intersection. Having a 2+ smooth loss was one of the key proof assumptions of our attempted theorem. (And may help the others later)

Mar 28 '24 04:03 kabachuha

There may be even better schedule for delta depending on snr, see the discussion in kohya_ss

Apr 01 '24 11:04 kabachuha

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 26 '24 15:04 github-actions[bot]