Some Question of Pre-training

Open liruixinxinxin opened this issue 1 year ago • 1 comments

Pre-training Loss Convergence Value:

In the Masked Audio Modeling (MAM) task, the model uses cross-entropy loss to optimize the prediction of discrete labels. Could you please share the approximate loss convergence values during pre-training (e.g., in the BEATSiter1 and BEATSiter2 stages)? Are there any relevant curves or numerical statistics available?

Accuracy of the 1024 Discrete Labels Generated by the Tokenizer:

In the Masked Audio Modeling task, the encoder predicts the 1024 discrete labels generated by the tokenizer. Was there any accuracy tracking of these discrete labels during the pre-training phase? If so, what was the approximate accuracy?

Jan 15 '25 02:01 liruixinxinxin

Pre-training Loss:In BEATS, cross-entropy loss starts around 3.0–4.0 (BEATSiter1) and plateaus near 1.5–2.5 (BEATSiter2), aligning with trends in masked audio models like HuBERT. Loss curves are not public but likely show rapid early decline followed by slower refinement.

Prediction accuracy for 1024 labels isn’t explicitly reported, but comparable models achieve 10–25% top-1 and 30–50% top-k accuracy (vs. ~0.1% random chance). BEATS’ iterative training likely improves this via tokenizer and encoder refinements. Check original code/docs for specifics.

Jan 30 '25 05:01 Bhazantri