Question about speaker encoder input
The paper mentions thatThe tone color extractor is a simple 2D convolutional neural network that operates on the mel-spectrogram of the input voice and outputs a single feature vector that encodes the tone color information., but in api.py I see that it looks like it's operating on the non-mel spectrogram.
for fname in ref_wav_list:
audio_ref, sr = librosa.load(fname, sr=hps.data.sampling_rate)
y = torch.FloatTensor(audio_ref)
y = y.to(device)
y = y.unsqueeze(0)
y = spectrogram_torch(y, hps.data.filter_length,
hps.data.sampling_rate, hps.data.hop_length, hps.data.win_length,
center=False).to(device)
with torch.no_grad():
g = self.model.ref_enc(y.transpose(1, 2)).unsqueeze(-1)
gs.append(g.detach())
gs = torch.stack(gs).mean(0)
I'm wondering if this is true, and if so, if there was a reason for using the non-mel spectrogram (was quality better)?
Thanks for pointing out. This is true. There is actually not a performance difference between this two
Ah thank you and to clarify, the mel input in question was ~128 channels?
How can i optimize the audio cloning process how can i make a change to the def extract_se function?