NeMo
NeMo copied to clipboard
[TTS][German][Single Speaker][Fastpitch + HifiGAN] Bad Interference RTF on CPU
Describe the bug
The RTF for interference is greather than 2 on Intel CPUs.
Steps/Code to reproduce bug
The below code produces a 7 second audio (speech2.wav) after properly loading models and takes more than 14 seconds. For me its unclear how many cores are utilized and where to configure this.
git clone https://github.com/NVIDIA/NeMo/ cd NeMo python3.8 -m venv venv source venv/bin/activate pip3 install a-lot-of-missing-requirements when executing python3 tts_german.py where inference code is as follows tts_german.py
import soundfile as sf
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder
import time
# Download and load the pretrained fastpitch model
spec_generator = SpectrogramGenerator.from_pretrained(model_name="tts_de_fastpitch_singlespeaker")#.cuda()
# Download and load the pretrained hifigan model
vocoder = Vocoder.from_pretrained(model_name="tts_de_slr_hifigan_ft_fastpitch_singlespeaker")#.cuda()
# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_generator.parse("Was schreibst Du da gerade Papa?")
# They then take the tokenized string and produce a spectrogram
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
# Save the audio to disk in a file called speech.wav
# Note vocoder return a batch of audio. In this example, we just take the first and only sample.
sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 22050)
starttime=time.time()
parsed = spec_generator.parse("In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.")
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
sf.write("speech2.wav", audio.to('cpu').detach().numpy()[0], 22050)
endtime=time.time()
print(endtime-starttime)
Expected behavior
Faster interference time, RTF < 1.
Environment overview (please complete the following information)
See above
Environment details
See above
Additional context
--