DaegyeomKim

Results 16 comments of DaegyeomKim

- unseen speaker prompt inference mel-spectrogram ![스크린샷 2024-03-07 154449](https://github.com/p0p4k/pflowtts_pytorch/assets/57590655/52dda87c-6ca9-4797-b15a-41e24504874d) - seen speaker prompt inference mel-spectrogram ![스크린샷 2024-03-07 154706](https://github.com/p0p4k/pflowtts_pytorch/assets/57590655/ba17a894-72b0-4c6d-b3b2-08bd2ab6b14f)

Thank you for your response. I will try to modify it to extract speaker characteristics in comparison with the content of the paper. If I achieve good results, I will...

Hi yiwei0730. Thank you for your advice. I'll do some testing and share with you. Thank you.

Hello p0p4k, yiwei0730, I have incorporated the prompt encoder part from the 'https://github.com/adelacvg/NS2VC' repository to extract prompt features for the text encoder. The reason I chose this model is that...

Hello, I have conducted an experiment by adding the ns2 prompt encoder to the P-Flow text encoder. This was applied to both the structure provided by p0p4k and the one...

- adding the ns2 prompt encoder to the paper's structure(59epoch, 64batch) ![image](https://github.com/p0p4k/pflowtts_pytorch/assets/57590655/f872fefd-c469-47a6-b50f-fde6c80f3045) - adding the ns2 prompt encoder to p0p4k's structure(59epoch, 64batch) ![image](https://github.com/p0p4k/pflowtts_pytorch/assets/57590655/cd5ced4c-fb42-4fdc-9e3f-ddc8b54c208e)

Is this model zero-shot TTS possible?

The Korean data I used for training is 1186 hours.

The authors even wrote that zero-shot TTS of comparable quality to VALL-E is possible with less data. ![image](https://github.com/p0p4k/pflowtts_pytorch/assets/57590655/ea7acc7d-2c10-4a61-91f3-69faed765b66)

I can't play the demo audio, but p0p4k can?