What does this PR do?

In #4549 @sanchit-gandhi has implemented AudioLDM2 but no TTS. In this PR I convert gigaspeech checkpoint and implement TTS pipeline for AudioLDM2

Fixes # (issue)

Before submitting

[x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline?
[x] Did you read our philosophy doc (important for complex PRs)?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Oct 12 '23 20:10 tuanh123789

@sayakpaul @sanchit-gandhi @patrickvonplaten Could anyone please review this. Thank you ! You can find model here: https://huggingface.co/anhnct/audioldm2_giga_speech/tree/main. And checkpoint for ljspeech will be update soon

Oct 12 '23 20:10 tuanh123789

Looks like a great start @tuanh123789! Nice job on getting to grips with the AudioLDM2 logic. I'm wondering if we could actually update the existing pipeline to be compatible with the TTS model? Since most of the code is the same, the existing pipeline should be compatible with the TTS task, it's just the pre-processing that differs slightly? If you follow the tips I've left below, you should find this to be quite straightforward:

Let the VITS tokenizer do the heavy lifting of phonemization and pre-processing. This means we can use the same pre-processing logic as in our current pipeline

Add the learned embeddings to the projection model as a new attribute: the projection model adds the new learned embeddings to the VITS hidden-states

Then you just need to update the arguments to include transcription (which you've done nicely on this PR)

Let me know if you need a hand with docs and tests too! Otherwise I think you've laid the groundwork for a nice design here 🤗

Thank you for the review. I find your opinion quite reasonable, and I will try to implement it according to that design. I will inform you as soon as there are any changes.

Oct 16 '23 18:10 tuanh123789

Hi @sanchit-gandhi, I have updated the code base on your review and design. Can you check, thank you 🤗

Oct 17 '23 20:10 tuanh123789

hi @sanchit-gandhi, can you take a look

Oct 21 '23 12:10 tuanh123789

@yiyixuxu @DN6 can you check out this PR?

Oct 25 '23 14:10 patrickvonplaten

@tuanh123789 Could you add a single slow test to verify TTS functionality. https://github.com/huggingface/diffusers/blob/main/tests/pipelines/audioldm2/test_audioldm2.py

Oct 31 '23 06:10 DN6

@tuanh123789 Could you add a single slow test to verify TTS functionality. https://github.com/huggingface/diffusers/blob/main/tests/pipelines/audioldm2/test_audioldm2.py

Ok, i'll add slow test. I'll inform you when i'm done.

Nov 01 '23 18:11 tuanh123789

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 27 '23 15:12 github-actions[bot]

@patrickvonplaten Sorry, but I was a bit busy a while ago. Can I reopen this PR. Thank you

Mar 08 '24 10:03 tuanh123789

Sorry, the bot accidentally closed it. It's open now.

Mar 08 '24 10:03 sayakpaul

Sorry, the bot accidentally closed it. It's open now.

Thank you 🤗 @sayakpaul

Mar 08 '24 10:03 tuanh123789

hi @DN6 i add slow test for audioldm2 TTS pipeline, can you check ?

Mar 08 '24 18:03 tuanh123789

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Mar 08 '24 22:03 HuggingFaceDocBuilderDev

hi @yiyixuxu can you check this, thank you 🤗

Mar 11 '24 07:03 tuanh123789

gentle ping @DN6 for any update on this PR

Mar 13 '24 12:03 tuanh123789

@DN6 can you do a final review?

Mar 19 '24 21:03 yiyixuxu

Hi @DN6 can you review, thank you 🤗

Mar 22 '24 05:03 tuanh123789

Sorry for the delay. Just some small clean up requests. I think the checkpoint might have to be updated to support the new suggested init arg use_learned_position_embedding as well.

No problem. I'll update code base on your review. Thank you 🤗

Mar 28 '24 07:03 tuanh123789

hi @DN6 can you do final review 🤗

Apr 03 '24 16:04 tuanh123789

Nicely done! 👍🏽

Thank for support 🤗, can we merge now.

Apr 04 '24 11:04 tuanh123789

Hi @patrickvonplaten @yiyixuxu @DN6 . If everything is okay, can we merge. If so, this would help me in the next step to fine-tune this model for my country's language. 🤗 🤗 🤗

Apr 05 '24 15:04 tuanh123789

Add AudioLDM2 TTS

What does this PR do?

Before submitting

Who can review?