Textless NLP / GSLM: Speech resynthesis produces incorrect results

Open KaushalNaresh opened this issue 2 years ago • 0 comments

What is your question?

When I resynthesize audio using HuBERT, TTS and waveglow checkpoint I am not getting correct results. it seems like my generated audio is truncated.

Code

import torch import soundfile as sf

speech1_path = 'original.flac'

speech1, sr = librosa.load(speech1_path, sr=None, mono=True) speech1_tensor = torch.tensor(speech1) audio = speech1_tensor.unsqueeze(0).cuda() a, b, c = gslm.encode(audio)

m, audi = gslm.decode(c, True) audio = audi[0] new_audio = audio.squeeze().cpu().numpy()

new_audio = new_audio.astype('float32') sf.write('output.wav', new_audio, 22050)

Link of the generated and original audio : Link

What's your environment?

fairseq Version (e.g., 1.0 or main): v0.12.2
PyTorch Version (e.g., 1.0): 1.10.1+cu102
OS (e.g., Linux): Ubuntu 18.04
How you installed fairseq (pip, source): https://github.com/facebookresearch/fairseq
Build command you used (if compiling from source):
Python version: 3.8.17
CUDA/cuDNN version: 12.2
GPU models and configuration:
Any other relevant information:

gslm.txt

Oct 02 '23 06:10 KaushalNaresh