Inquiry regarding loss functions, hyperparameters, and weights for Causal Audio VAE training
Hi VoxCPM Team,
Thank you for open-sourcing VoxCPM and providing the technical report. I am very interested in the Causal Audio VAE architecture described in your work.
To reproduce the training process of this module, could you please provide more specific details regarding the training configuration? Specifically, I am looking for:
-
List of Loss Functions: What is the comprehensive list of loss functions used for the VAE training? The paper mentions Mel-spectrogram loss, Adversarial (GAN) loss, and KL-divergence. Were there any other auxiliary losses, such as Feature Matching loss or Multi-Resolution STFT loss?
-
Loss Weights: What are the specific weights assigned to each component in the composite training objective?
-
Mel-spectrogram Settings: For the Mel-spectrogram reconstruction loss, what were the exact hyperparameters used, such as the FFT size, window length, hop length, and the number of Mel bins?
-
KL Divergence Strategy: Regarding the KL-divergence loss weight of 5e-5, is this a constant value used throughout the entire training process, or did you implement a KL warmup schedule?
-
Discriminator Configurations: Could you provide more details on the settings for the multi-period and multi-scale discriminators used in the adversarial training phase? Did you use any tricks to avoid GAN training collapse?
Thank you for your time and for contributing this work to the community!