tango
tango copied to clipboard
Sound cloning
I'm looking to build on your research. I understand this isn't the scope of your project. Just curious for an interesting. Just wanted the thoughts of the creators. I want to retrain and repuprose and expertiment with this for expressive TTS instead of generic . I'm somewhat new to working with these models.
OBJECTIVES ->
- retrain on a more dynamic dataset
- synthetic dataset -> speech/text[w/special utterences]{real speech/lofi speech from 'BARK'}, speech w/synthetic audio envirnoments generated by 'tango'/text[I have a rather large dataset]
EXPECTATATIOS ->
- most expressive hybrid TTS[TTS with semantic conditioned background environments]
QUESTIONS ->
- what are your thought on approaching voice cloning with this style of architecture? I figure I should approach like inpainting?
- If possible, wouldn't it clone any artifact contained in the speech audio?
CLOSING THOUGHTS -> I'm opening to sharing my results with you guys privately. Appreciate your contribution to the community.