Add AudioLDM2 TTS
What does this PR do?
In #4549 @sanchit-gandhi has implemented AudioLDM2 but no TTS. In this PR I convert gigaspeech checkpoint and implement TTS pipeline for AudioLDM2
Fixes # (issue)
Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline?
- [x] Did you read our philosophy doc (important for complex PRs)?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@sayakpaul @sanchit-gandhi @patrickvonplaten Could anyone please review this. Thank you ! You can find model here: https://huggingface.co/anhnct/audioldm2_giga_speech/tree/main. And checkpoint for ljspeech will be update soon
Looks like a great start @tuanh123789! Nice job on getting to grips with the AudioLDM2 logic. I'm wondering if we could actually update the existing pipeline to be compatible with the TTS model? Since most of the code is the same, the existing pipeline should be compatible with the TTS task, it's just the pre-processing that differs slightly? If you follow the tips I've left below, you should find this to be quite straightforward:
- Let the VITS tokenizer do the heavy lifting of phonemization and pre-processing. This means we can use the same pre-processing logic as in our current pipeline
- Add the learned embeddings to the projection model as a new attribute: the projection model adds the new learned embeddings to the VITS hidden-states
- Then you just need to update the arguments to include
transcription(which you've done nicely on this PR)Let me know if you need a hand with docs and tests too! Otherwise I think you've laid the groundwork for a nice design here 🤗
Thank you for the review. I find your opinion quite reasonable, and I will try to implement it according to that design. I will inform you as soon as there are any changes.
Hi @sanchit-gandhi, I have updated the code base on your review and design. Can you check, thank you 🤗
hi @sanchit-gandhi, can you take a look
@yiyixuxu @DN6 can you check out this PR?
@tuanh123789 Could you add a single slow test to verify TTS functionality. https://github.com/huggingface/diffusers/blob/main/tests/pipelines/audioldm2/test_audioldm2.py
@tuanh123789 Could you add a single slow test to verify TTS functionality. https://github.com/huggingface/diffusers/blob/main/tests/pipelines/audioldm2/test_audioldm2.py
Ok, i'll add slow test. I'll inform you when i'm done.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@patrickvonplaten Sorry, but I was a bit busy a while ago. Can I reopen this PR. Thank you
Sorry, the bot accidentally closed it. It's open now.
Sorry, the bot accidentally closed it. It's open now.
Thank you 🤗 @sayakpaul
hi @DN6 i add slow test for audioldm2 TTS pipeline, can you check ?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
hi @yiyixuxu can you check this, thank you 🤗
gentle ping @DN6 for any update on this PR
@DN6 can you do a final review?
Hi @DN6 can you review, thank you 🤗
Sorry for the delay. Just some small clean up requests. I think the checkpoint might have to be updated to support the new suggested init arg
use_learned_position_embeddingas well.
No problem. I'll update code base on your review. Thank you 🤗
hi @DN6 can you do final review 🤗
Nicely done! 👍🏽
Thank for support 🤗, can we merge now.
Hi @patrickvonplaten @yiyixuxu @DN6 . If everything is okay, can we merge. If so, this would help me in the next step to fine-tune this model for my country's language. 🤗 🤗 🤗