fairseq Why do different -max_tokens during inference affect the prediction results？

I use the wav2vec2_small.pt for fine-tuning, when I decode the fine-tuned model, I find different --max_tokens during inference will produce different wer results, and setting it to the same size as during training will give the best results. May I ask why? Or is there something wrong with the parameters I set during reasoning?

What have you tried?

The --max-tokens token size used during training is 1600000. The WER of test of my dataset is 8.3 when --max-tokens 1600000, while it is 14.5 when --max-tokens 4000000

What's your environment?

fairseq Version (e.g., 1.0 or main): main
PyTorch Version (e.g., 1.8): 1.8
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source): pip
Python version: 3.7
CUDA/cuDNN version: 11.1

Nov 16 '22 10:11 KrystalBling

Hi, have you found out what's wrong with this? I think, the document says that param "--max-tokens" represents max tokens per batch, suppose batch size is 64 and num of average tokens per sample is 100. If "--max-tokens" set to 4096, however 64 * 100 = 6400, does it mean that the part over 4096 will be abandoned? In another word, such incomplete batch leading to a worse performence.

Dec 05 '22 12:12 XueMoonLit

Have you found the reason?

Mar 02 '23 16:03 LingweiMeng