Liu Zihan
Liu Zihan
I am currently exploring the musicgen model and have some questions regarding the application of audio prompts within the model's architecture, particularly in relation to the cross_attention layers: 1. **Role...
I am using the Parler_TTS model with a reference audio (`input_values`) during inference, similar to MusicGen, to perform continuation tasks. `model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values=input_values)` While the model continues in the style...
Hello, I noticed that in the recent architecture improvements, modules for RoPE positional encoding and adding Prompts in Cross Attention were included. However, it seems that the newly released two...
I observed a significant discrepancy in the CLAP scores when using different pretrained CLAP models to evaluate Musicgen. Specifically, I used two distinct pretrained CLAP checkpoints to assess Musicgen's performance...
Great work! I would like to inquire if there are any results available regarding the variation of semantic richness and acoustic fidelity as the number of n_q changes in XCodec....
Hi, I would like to inquire about performing batch inference using XCodec. Specifically, what is the expected shape of the `wav` input in the following code snippet ? Should the...