wyh2000
wyh2000
## What? We add voice conversion task in espnet2, and support both parallel data and non-parallel data. ## Why? To support more voice conversion models and datasets ## See also...
Hi, thanks for sharing this nice work. Could you share some example code for how to reconstruct images by DiffAE when only z_{sem} is encoded from original images but x_T...
We add a HifiGAN based vocoder which can decode hubert tokens to waveform. It supports LJSpeech now. It is aligned with configurations in discrete TTS from [ESPnet](https://github.com/espnet/espnet/pull/5626).
## What? Add evaluation scripts for long speech ASR. ## Why? For long speech (e.g. longer than 20 seconds), we should first split it into shorter segments and then evaluate...
## What? Adding recipe for training SpeechComposer on voice conversion and speech enhancement. ## Why? Allow training for language models on new tasks and adding recipe. ## See also This...
## What? This is an implementation of EVA: Robust Audiovisual Speech Recognition Models with Mixture-of-Experts. It supports audiovisual ASR for unconstrained videos. EVA implementation is based on OWSM v3.1, with...
We want to develop the ESPnet2 for the voice conversion task. We aim to develop voice conversion systems which support both parallel data and un-parallel data. # Progress & Plan...
Hi, I'd like to know if DPOTrainer supports more than one rejected sample. As the original DPO paper mentioned, they can support Plackett-Luce model for a set of possible responses.
Hi, Thanks for your great job! Do you have any plan to release the training code of Video-Salmonn?