youngsuenXMLY comments

Results 12 comments of


                                            youngsuenXMLY

Interpreting results

Hi, I used a model at step 59000, and the VC total loss reduced to around 1.2, but all inference samples results in almost null. They looked like this: ![68FA419F-F48B-4542-B420-FBC9CE49E1EC](https://user-images.githubusercontent.com/31339839/75342808-1f510000-58d2-11ea-9e64-55f57d4f737d.png)...

Interpreting results

I get almost the same results as yours @JRMeyer , have you solved the problem?

Interpreting results

My test results: [test_samples.zip](https://github.com/jxzhanggg/nonparaSeq2seqVC_code/files/4278949/test_samples.zip)

Interpreting results

In the pre-train folder, I use a decay rate 0.95 at each epoch and abandon training samples whose frame length is longer than 800. The inferred results begin to make...

Interpreting results

Hi, in the feature extraction process, I trimmed silence using librosa.trim and I used 80 dimensional mel-spec as used in hparams.py. The text look like this: ![image](https://user-images.githubusercontent.com/31339839/76376757-15c0a100-6384-11ea-90f7-49a71e6b60c8.png) But the mean...

Interpreting results

In pre-train/model/layers.py, line 353-354, I change the code to self.initialize_decoder_states(memory, mask=(1-get_mask_from_lengths(memory_lengths))) Becase I found ~ is a bit-wise reverse, ~1 will get 254.

Interpreting results

I can't get any possible difference from the source code. So would you please send me a copy of your training text and phn files. @jxzhanggg

Interpreting results

I conducted the experiment under ubuntu16.04, using pytorch1.3.1 and python3.7. For a boolean type variable, ~True gets False and ~False gets True. I will debug Please send me a copy...

Interpreting results

after modifying the bit-wise reverse ~, the model begin to converge to reasonable speech. One problem is the inferred result doesn't keep the speaking style from the speaker embeddings, which...

Interpreting results

Have you tried VAE loss to further disentangle content embedding from speaker embedding?