tango After just using VAE reconstruct a audio, I only get noise

Here is my code. Is there something wrong on my method about using vae?

`def recon_vae(self, filename):
        """ recon audio only by vae """
        with torch.no_grad():

        waveform, sample_rate = torchaudio.load(filename)
        waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=16000)[0]
        waveform = waveform - torch.mean(waveform)
        waveform = waveform / (torch.max(torch.abs(waveform)) + 1e-8)
        waveform = 0.5 * waveform
        waveform = waveform / torch.max(torch.abs(waveform))
        waveform = 0.5 * waveform
      
        #waveform = 0.5 * waveform[0:int(len(waveform)*1)]
        
        audio = torch.unsqueeze(waveform, 0)
        audio = torch.nan_to_num(torch.clip(audio, -1, 1))
        audio = torch.autograd.Variable(audio, requires_grad=False)
        melspec, log_magnitudes_stft, energy = self.stft.mel_spectrogram(audio)
        melspec = melspec.transpose(1, 2)
        melspec = melspec.unsqueeze(1)
        truth_lattent = self.vae.get_first_stage_encoding(self.vae.encode_first_stage(melspec))
       
        mel_recon = self.vae.decode_first_stage(truth_lattent)
        wave = self.vae.decode_to_waveform(mel_recon)
    return wave[0], waveform`

May 31 '23 11:05 SuperiorDtj

Can you try the folllowing:

import torch
import torchaudio
from tango import Tango
from tools.torch_tools import wav_to_fbank

filename = ... 

device = "cuda:0"
tango = Tango("declare-lab/tango", device)
tango.vae.eval()
tango.stft.eval()

duration = 10
target_length = int(duration * 102.4)

with torch.no_grad():
    mel, _, waveform = wav_to_fbank([filename], target_length, tango.stft)
    mel = mel.unsqueeze(1).to(device)
    latent = tango.vae.get_first_stage_encoding(tango.vae.encode_first_stage(mel))
    reconstructed_mel = tango.vae.decode_first_stage(latent)
    reconstructed_waveform = tango.vae.decode_to_waveform(reconstructed_mel)[0]

Jun 02 '23 14:06 deepanwayx

Can you try the folllowing:

import torch
import torchaudio
from tango import Tango
from tools.torch_tools import wav_to_fbank

filename = ... 

device = "cuda:0"
tango = Tango("declare-lab/tango", device)
tango.vae.eval()
tango.stft.eval()

duration = 10
target_length = int(duration * 102.4)

with torch.no_grad():
    mel, _, waveform = wav_to_fbank([filename], target_length, tango.stft)
    mel = mel.unsqueeze(1).to(device)
    latent = tango.vae.get_first_stage_encoding(tango.vae.encode_first_stage(mel))
    reconstructed_mel = tango.vae.decode_first_stage(latent)
    reconstructed_waveform = tango.vae.decode_to_waveform(reconstructed_mel)[0]

Thanks for your code！Now I can reconstruct the audio, but only in the situation that the number of the audio's frames is the multiple of four(3.6s dur instead of 3.7s dur)it can reconstruct the audio. Is this commom issue of the VAE model?

Jun 05 '23 01:06 SuperiorDtj

What is the exact issue when reconstructing a 3.7s audio? Does it generate noise for the entire 3.7s or the last 0.1s?

Jun 06 '23 05:06 deepanwayx

What is the exact issue when reconstructing a 3.7s audio? Does it generate noise for the entire 3.7s or the last 0.1s?

When the VAE reconsturct a 3.7s audio, it generate noise for the entire 3.7s

Jun 06 '23 06:06 SuperiorDtj

I meet the same problem as u. Have the problem been solved? I tried making reconstruction on the same one audio smaple for several times, the reconstructed results are always very different noise. And the results of each reconstruction vary greatly from one another.

The only one solution is setting the duration like this? target_length = int(duration * 102.4)

Jul 29 '23 14:07 ikm565