NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

[TTS][German][Single Speaker][Fastpitch + HifiGAN] Bad Interference RTF on CPU

Open eqikkwkp25-cyber opened this issue 3 years ago • 0 comments

Describe the bug

The RTF for interference is greather than 2 on Intel CPUs.

Steps/Code to reproduce bug

The below code produces a 7 second audio (speech2.wav) after properly loading models and takes more than 14 seconds. For me its unclear how many cores are utilized and where to configure this.

git clone https://github.com/NVIDIA/NeMo/ cd NeMo python3.8 -m venv venv source venv/bin/activate pip3 install a-lot-of-missing-requirements when executing python3 tts_german.py where inference code is as follows tts_german.py

import soundfile as sf
from nemo.collections.tts.models.base import SpectrogramGenerator, Vocoder
import time

# Download and load the pretrained fastpitch model
spec_generator = SpectrogramGenerator.from_pretrained(model_name="tts_de_fastpitch_singlespeaker")#.cuda()
# Download and load the pretrained hifigan model
vocoder = Vocoder.from_pretrained(model_name="tts_de_slr_hifigan_ft_fastpitch_singlespeaker")#.cuda()

# All spectrogram generators start by parsing raw strings to a tokenized version of the string
parsed = spec_generator.parse("Was schreibst Du da gerade Papa?")
# They then take the tokenized string and produce a spectrogram
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
# Finally, a vocoder converts the spectrogram to audio
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

# Save the audio to disk in a file called speech.wav
# Note vocoder return a batch of audio. In this example, we just take the first and only sample.
sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 22050)

starttime=time.time()
parsed = spec_generator.parse("In den alten Zeiten, wo das Wünschen noch geholfen hat, lebte ein König, dessen Töchter waren alle schön.")
spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
sf.write("speech2.wav", audio.to('cpu').detach().numpy()[0], 22050)
endtime=time.time()
print(endtime-starttime)

Expected behavior

Faster interference time, RTF < 1.

Environment overview (please complete the following information)

See above

Environment details

See above

Additional context

--

eqikkwkp25-cyber avatar Sep 10 '22 07:09 eqikkwkp25-cyber