Why do different -max_tokens during inference affect the prediction results?
I use the wav2vec2_small.pt for fine-tuning, when I decode the fine-tuned model, I find different --max_tokens during inference will produce different wer results, and setting it to the same size as during training will give the best results. May I ask why? Or is there something wrong with the parameters I set during reasoning?
What have you tried?
The --max-tokens token size used during training is 1600000. The WER of test of my dataset is 8.3 when --max-tokens 1600000, while it is 14.5 when --max-tokens 4000000
What's your environment?
- fairseq Version (e.g., 1.0 or main): main
- PyTorch Version (e.g., 1.8): 1.8
- OS (e.g., Linux): Linux
- How you installed fairseq (
pip, source): pip - Build command you used (if compiling from source): pip
- Python version: 3.7
- CUDA/cuDNN version: 11.1
Hi, have you found out what's wrong with this? I think, the document says that param "--max-tokens" represents max tokens per batch, suppose batch size is 64 and num of average tokens per sample is 100. If "--max-tokens" set to 4096, however 64 * 100 = 6400, does it mean that the part over 4096 will be abandoned? In another word, such incomplete batch leading to a worse performence.
Have you found the reason?