Pasha S
Pasha S
In the given vall-e example only text prefix given but in the VALL-E paper we also need to pass the 3 seconds audio prompt as prefix along with text right?...
The mode is only predicting 1026 values including codebook and special tokens bos and eos. can someone please give some clarification on this?
Coming from arxiv website. This paper is super cool imo. Would love to train this model for my use case. Are you planning to release the training and Inference code?
Can we train a GPT model using text in the same language if we have audio transcriptions in that language?
Great research! I'm really interested in learning more about the training process. Do you have any plans to open-source the training code for the audio tokenizer and transformer? I'd love...