Retrieval-based-Voice-Conversion-WebUI icon indicating copy to clipboard operation
Retrieval-based-Voice-Conversion-WebUI copied to clipboard

How does RVC handle reconstructing audio from the spectrogram?

Open kalomaze opened this issue 2 years ago • 7 comments

Are you guys using librosa.griffinlim or some other technique when converting to audio from spectrogram data?

kalomaze avatar Jun 14 '23 21:06 kalomaze

@fairseq:hubert

RVC-Boss avatar Jun 15 '23 02:06 RVC-Boss

Hubert doesnt even handle anything to do with spec to audio. I mean in general when processing audio. It only handles spec to text speech recognition right? Your answer seems rather ambigious? Is it HiFi-Gan or something?

kalomaze avatar Jun 15 '23 07:06 kalomaze

HifiGAN in RVC doesn't do spec2wav transform, but do something like spec2wav transform. RVC doesn't generate spec before HifiGAN, but a 192 dimensional vector. wav2spec: It's all in fairseq, and fairseq is != hubert

RVC-Boss avatar Jun 15 '23 12:06 RVC-Boss

RVC and So-Vits-Svc are similar end-to-end architectures. In fact, the spectrum is not explicitly generated during the conversion process, although HifiGAN is used (the input is a 192-dimensional hidden space vector, not a mel spectrum).

In order to solve the broken sound problem of HifiGAN when singing, the software actually uses the improved NSF-HifiGAN. Compared with speech, NSF (Neural Source Filter) technology is more suitable for synthesizing singing voice.

yxlllc avatar Jun 15 '23 16:06 yxlllc

RVC and So-Vits-Svc are similar end-to-end architectures. In fact, the spectrum is not explicitly generated during the conversion process, although HifiGAN is used (the input is a 192-dimensional hidden space vector, not a mel spectrum).

In order to solve the broken sound problem of HifiGAN when singing, the software actually uses the improved NSF-HifiGAN. Compared with speech, NSF (Neural Source Filter) technology is more suitable for synthesizing singing voice.

Is SingGAN similar to NSF-HifiGAN? And besides from end-to-end architectures, do you know the "current" ways to convert spectrograms back to wavs regardless if the spectrogram sample is a voice or just music/sound? I'm looking for ways to combat phase reconstruction issues.

Mangio621 avatar Jun 15 '23 19:06 Mangio621

RVC and So-Vits-Svc are similar end-to-end architectures. In fact, the spectrum is not explicitly generated during the conversion process, although HifiGAN is used (the input is a 192-dimensional hidden space vector, not a mel spectrum). In order to solve the broken sound problem of HifiGAN when singing, the software actually uses the improved NSF-HifiGAN. Compared with speech, NSF (Neural Source Filter) technology is more suitable for synthesizing singing voice.

Is SingGAN similar to NSF-HifiGAN? And besides from end-to-end architectures, do you know the "current" ways to convert spectrograms back to wavs regardless if the spectrogram sample is a voice or just music/sound? I'm looking for ways to combat phase reconstruction issues.

Singgan is indeed similar to nsf-hifigan. If it is an arbitrary sound, without any prior distribution assumptions, then the basic algorithm for generating a waveform from a phase-free spectrum is Griffin-Lim. It performs phase reconstruction in an iterative-based manner.

However, phase reconstruction is actually a mathematical problem with multiple solutions. The phase reconstructed by this algorithm often deviates greatly from the original waveform, resulting in poor sound quality. This is why the neural network vocoder is popular now, because it assumes a priori distribution (such as human voice, a specific musical instrument, etc.), and its phase can be closer to the real audio after machine learning.

yxlllc avatar Jun 16 '23 05:06 yxlllc

RVC and So-Vits-Svc are similar end-to-end architectures. In fact, the spectrum is not explicitly generated during the conversion process, although HifiGAN is used (the input is a 192-dimensional hidden space vector, not a mel spectrum).

In order to solve the broken sound problem of HifiGAN when singing, the software actually uses the improved NSF-HifiGAN. Compared with speech, NSF (Neural Source Filter) technology is more suitable for synthesizing singing voice.

Is SingGAN similar to NSF-HifiGAN? And besides from end-to-end architectures, do you know the "current" ways to convert spectrograms back to wavs regardless if the spectrogram sample is a voice or just music/sound? I'm looking for ways to combat phase reconstruction issues.

Singgan is indeed similar to nsf-hifigan. If it is an arbitrary sound, without any prior distribution assumptions, then the basic algorithm for generating a waveform from a phase-free spectrum is Griffin-Lim. It performs phase reconstruction in an iterative-based manner.

However, phase reconstruction is actually a mathematical problem with multiple solutions. The phase reconstructed by this algorithm often deviates greatly from the original waveform, resulting in poor sound quality. This is why the neural network vocoder is popular now, because it assumes a priori distribution (such as human voice, a specific musical instrument, etc.), and its phase can be closer to the real audio after machine learning.

I see. And I'm assuming Ultimate-Vocal-Remover with MDX, Vr architecture etc all use multiple prior distribution based vocoders for separating multiple stems in a song, like guitar, drums, vocals? This is where pre-trained models come into play for such GANs to determine reconstruction effectiveness. Cool! Im wondering if we've found a way yet to somehow provide a typical convolutional neural network with something like a phase spec? However i understand that a phase spec seems arbitrary and too context specific.

Mangio621 avatar Jun 16 '23 07:06 Mangio621

This issue was closed because it has been inactive for 15 days since being marked as stale.

github-actions[bot] avatar Apr 28 '24 04:04 github-actions[bot]