Katsuya Iida

Results 21 comments of Katsuya Iida

The serialized data `wikipedia_segment ed_part_NN.bin` refer `WikiNBookCorpusPretrainingDataCreator` which has been deleted in the latest code. Adding the following can avoid the issue. ``` class WikiNBookCorpusPretrainingDataCreator(PretrainingDataCreator): pass ```

Thank you @GeorgeS2019 for mentioning. For context, [voice100](https://github.com/kaiidams/voice100) is my personal TTS/ASR project with CNN layers without recursion for embedding in mobile apps [Xamarin Android sample](https://github.com/kaiidams/Voice100AndroidApp) . It is not...

@GeorgeS2019 Probably ONNX runtime approach is not related to this. > Why there are no Torchaudio.ops? `torchaudio` has C code that uses Kaldi and Sox, FFMpeg, which is not implemented...

@xhuan8 If this is the build from torchvision, it is a C++ torchvision library. You'll need to make a C wrapper so that C# can use it with P/Invoke. ```...

@NiklasGustafsson To build torchvision.dll you'll need Python (and zlib, libpng, CUDA, etc) I think it should be built outside TorchSharp.

Thanks. They use k-mean clustered audio and seq2seq to translate them to translate Spanish-English. k-mean clustered audio can be used to replace CMU phonemes in Voice100. For Speech-to-Speech translation, I'm...

I think it is difficult to answer without more information. SoundStream tries to produce audio with close spectrogram. How do you measure your error? Does the model learn audio but...

`ORIGINAL_AUDIO.wav` has very low signals < 0.03, while the model accepts normalized audio as inputs https://github.com/kaiidams/soundstream-pytorch/blob/9c6086e4fccaf75adb3f62014f750843fc68d84e/soundstream.py#L606 The code below produces noisy sound followed by laughter. I think the noisy sound...

These are numbers for LIBRISPEECH. |g_stft_loss|g_wave_loss|g_feat_loss|g_rec_loss|q_loss|g_loss|codes_entropy|d_stft_loss|d_wave_loss|d_loss|num_replaced|epoch|step| |---|---|---|---|---|---|---|---|---|---|---|---|---| |8.765625|2.03125|0.035614|13.462036|0.385002|20.735474|6.826962|0.0|1.387695|1.041016|0.0|24|21487| Spikes of entropy in your case is expected it jumps when some of codes are replaced. Rec loss is flat after 1.5k,...

Do you mean you want to train soundstream model with new training data or want to train other model which uses output of soundstream as features? In the first case,...