diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

[WIP] Stable Audio integration

Open ylacombe opened this issue 1 year ago • 4 comments

What does this PR do?

Stability AI recently open-sourced Stable Audio 1.0, which can be run using their toolkit library .

Contrarily to most diffusion models, the diffusion process here operates on a 1D latent signal, so I had to depart a bit from other models.

For now, I've drafted a bit how the pipeline will work, namely:

  1. Project the input 1D waveform signal (which can be noise) to a latent space
  2. Project input text description + two float numbers that indicates the beginning and ending of the audio into the latent space
  3. Diffuse using a transformer-like model
  4. Decode from the latent space to the waveform space.

For this to work, I'm waiting for DAC to be integrated to transformers in this PR, in order to use the encoder and decoder code for the VAE.

Left TODO

  • [ ] Validate modeling and pipeline design
  • [ ] Integrate the VAE
  • [ ] Convert the weigthts
  • [ ] Verify 1 to 1 correspondance
  • [ ] Write tests

cc @sayakpaul and @yiyixuxu !

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

ylacombe avatar Jun 26 '24 17:06 ylacombe

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ylacombe

thanks for the PR!

Overall, it looks pretty aligned with the diffuser's design! Here is my initial feedback:

  1. can we make the projection_model part of the transformer? I think it just contains projection layers on the prompt_embeds and audio_start_in_s and audio_end_in_s, so IMO, it is a natural part of the transformer model: all our transformer/Unet models apply some sort of projections on various conditions inputs, e.g., image_size, time, text, style, etc. Is there a special reason that you want to keep the projection layers outside of the transformer?

  2. if we can agree on 1, then I think we can change encode_prompt_and_seconds toencode_prompt method that returns prompt_embeds, negative_prompt_embeds

yiyixuxu avatar Jun 26 '24 19:06 yiyixuxu

Hey @yiyixuxu, thanks for the feedback here!

I think the main reason for the separate projection model is that encode_prompt_and_seconds also takes care of cfg and negative tensors:

  • the cross attention negative hidden states are set to 0 if there is cfg but no negative prompts, and it's done after the prompts and seconds have been projected to the latent space

Is this something that we'd want to do in the transformers ? IMO, no, but happy to change the way it's implemented !

ylacombe avatar Jun 27 '24 08:06 ylacombe

@ylacombe Thanks for explaining! feel free to continue your plan and convert the weights, we can refactor later if needed:)

IMO, ideally, we do want to move the projection layers in transformers, but since the original implementation is implemented this way, let's keep it this way for now. I can help look into this later.

One way I think we can go about this is to make the audio_start_in_s and audio_end_in_s tensors (e.g., audio_end_in_s = Torch.tensor([10.0]) and we handle the CFG in the pipeline normally, something like this

        if self.do_classifier_free_guidance:
            audio_end_in_s = torch.cat([audio_end_in_s, audio_end_in_s], dim=0)
       elif ..: 
            neg_audio_end_in_s = torch.tensor([0])
            audio_end_in_s = torch.cat([neg_audio_end_in_s, audio_end_in_s], dim=0)

this way when these argument reach the transformer, it already contain info about CFG. But I'm just making it up here, I wouldn't know if it would work. so let's just don't worry about it and continue with your implementation :)

yiyixuxu avatar Jun 27 '24 09:06 yiyixuxu

Thank you so much @ylacombe for your hard work here. Navigating through 1e14 comments and addressing them like you did is NO SMALL FEAT. Thank you once again!

sayakpaul avatar Jul 30 '24 09:07 sayakpaul

Thank you @ylacombe! Is there an example of how to use initial_audio_waveforms somewhere? Is that for extending or zero-shot generation?

tin2tin avatar Aug 08 '24 06:08 tin2tin