[WIP] Stable Audio integration
What does this PR do?
Stability AI recently open-sourced Stable Audio 1.0, which can be run using their toolkit library .
Contrarily to most diffusion models, the diffusion process here operates on a 1D latent signal, so I had to depart a bit from other models.
For now, I've drafted a bit how the pipeline will work, namely:
- Project the input 1D waveform signal (which can be noise) to a latent space
- Project input text description + two float numbers that indicates the beginning and ending of the audio into the latent space
- Diffuse using a transformer-like model
- Decode from the latent space to the waveform space.
For this to work, I'm waiting for DAC to be integrated to transformers in this PR, in order to use the encoder and decoder code for the VAE.
Left TODO
- [ ] Validate modeling and pipeline design
- [ ] Integrate the VAE
- [ ] Convert the weigthts
- [ ] Verify 1 to 1 correspondance
- [ ] Write tests
cc @sayakpaul and @yiyixuxu !
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@ylacombe
thanks for the PR!
Overall, it looks pretty aligned with the diffuser's design! Here is my initial feedback:
-
can we make the
projection_modelpart of thetransformer? I think it just contains projection layers on theprompt_embedsandaudio_start_in_sandaudio_end_in_s,so IMO, it is a natural part of the transformer model: all our transformer/Unet models apply some sort of projections on various conditions inputs, e.g., image_size, time, text, style, etc. Is there a special reason that you want to keep the projection layers outside of thetransformer? -
if we can agree on 1, then I think we can change
encode_prompt_and_secondstoencode_promptmethod that returnsprompt_embeds,negative_prompt_embeds
Hey @yiyixuxu, thanks for the feedback here!
I think the main reason for the separate projection model is that encode_prompt_and_seconds also takes care of cfg and negative tensors:
- the cross attention negative hidden states are set to 0 if there is cfg but no negative prompts, and it's done after the prompts and seconds have been projected to the latent space
Is this something that we'd want to do in the transformers ? IMO, no, but happy to change the way it's implemented !
@ylacombe Thanks for explaining! feel free to continue your plan and convert the weights, we can refactor later if needed:)
IMO, ideally, we do want to move the projection layers in transformers, but since the original implementation is implemented this way, let's keep it this way for now. I can help look into this later.
One way I think we can go about this is to make the audio_start_in_s and audio_end_in_s tensors (e.g., audio_end_in_s = Torch.tensor([10.0]) and we handle the CFG in the pipeline normally, something like this
if self.do_classifier_free_guidance:
audio_end_in_s = torch.cat([audio_end_in_s, audio_end_in_s], dim=0)
elif ..:
neg_audio_end_in_s = torch.tensor([0])
audio_end_in_s = torch.cat([neg_audio_end_in_s, audio_end_in_s], dim=0)
this way when these argument reach the transformer, it already contain info about CFG. But I'm just making it up here, I wouldn't know if it would work. so let's just don't worry about it and continue with your implementation :)
Thank you so much @ylacombe for your hard work here. Navigating through 1e14 comments and addressing them like you did is NO SMALL FEAT. Thank you once again!
Thank you @ylacombe! Is there an example of how to use initial_audio_waveforms somewhere? Is that for extending or zero-shot generation?