Katsuya Iida comments

Results 21 comments of


                                            Katsuya Iida

Preprocessed data are in the wrong path 512/wikipedia_pretrain.

The serialized data `wikipedia_segment ed_part_NN.bin` refer `WikiNBookCorpusPretrainingDataCreator` which has been deleted in the latest code. Adding the following can avoid the issue. ``` class WikiNBookCorpusPretrainingDataCreator(PretrainingDataCreator): pass ```

Evaluate adding Text2Speech Onnx to speech--audio-processing section

Thank you @GeorgeS2019 for mentioning. For context, [voice100](https://github.com/kaiidams/voice100) is my personal TTS/ASR project with CNN layers without recursion for embedding in mobile apps [Xamarin Android sample](https://github.com/kaiidams/Voice100AndroidApp) . It is not...

Classes of torchvision\ops

@GeorgeS2019 Probably ONNX runtime approach is not related to this. > Why there are no Torchaudio.ops? `torchaudio` has C code that uses Kaldi and Sox, FFMpeg, which is not implemented...

Classes of torchvision\ops

@xhuan8 If this is the build from torchvision, it is a C++ torchvision library. You'll need to make a C wrapper so that C# can use it with P/Invoke. ```...

Classes of torchvision\ops

@NiklasGustafsson To build torchvision.dll you'll need Python (and zlib, libpng, CUDA, etc) I think it should be built outside TorchSharp.

Speech to Speech

Thanks. They use k-mean clustered audio and seq2seq to translate them to translate Spanish-English. k-mean clustered audio can be used to replace CMU phonemes in Voice100. For Speech-to-Speech translation, I'm...

Adapt the network to another inputs

I think it is difficult to answer without more information. SoundStream tries to produce audio with close spectrogram. How do you measure your error? Does the model learn audio but...

Adapt the network to another inputs

`ORIGINAL_AUDIO.wav` has very low signals < 0.03, while the model accepts normalized audio as inputs https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L606 The code below produces noisy sound followed by laughter. I think the noisy sound...

Adapt the network to another inputs

These are numbers for LIBRISPEECH. |g_stft_loss|g_wave_loss|g_feat_loss|g_rec_loss|q_loss|g_loss|codes_entropy|d_stft_loss|d_wave_loss|d_loss|num_replaced|epoch|step| |---|---|---|---|---|---|---|---|---|---|---|---|---| |8.765625|2.03125|0.035614|13.462036|0.385002|20.735474|6.826962|0.0|1.387695|1.041016|0.0|24|21487| Spikes of entropy in your case is expected it jumps when some of codes are replaced. Rec loss is flat after 1.5k,...

How to train a new set of data?

Do you mean you want to train soundstream model with new training data or want to train other model which uses output of soundstream as features? In the first case,...