Difference result from the paper when finetuning ASR
Hi, sorry to disturb you again.
I used the provided pre-trained model base_lrs3_iter5.pt, the 30h_data data split, the base_lrs3_30h.yaml config to finetune for ASR, all following the README, and finally decoded with s2s_decode.yaml by only changing override.modalities=['audio']. However I got a WER=9.28 which is much different from 5.4 in the paper.
Could you please suggest where is probably going wrong? Thank you.
P.S. I used 4 GPUs instead of 8 so I changed the update_freq to [2], other configs are untouched.
WER: 9.282103134479271
err / num_ref_words = 918 / 9890
_name: null
beam: 50
nbest: 1
max_len_a: 1.0
max_len_b: 0
min_len: 1
match_source_len: false
unnormalized: false
no_early_stop: false
no_beamable_mm: false
lenpen: 1.0
unkpen: 0.0
replace_unk: null
sacrebleu: false
score_reference: false
prefix_size: 0
no_repeat_ngram_size: 0
sampling: false
sampling_topk: -1
sampling_topp: -1.0
constraints: null
temperature: 1.0
diverse_beam_groups: -1
diverse_beam_strength: 0.5
diversity_rate: -1.0
print_alignment: null
print_step: false
lm_path: null
lm_weight: 0.0
iter_decode_eos_penalty: 0.0
iter_decode_max_iter: 10
iter_decode_force_max_iter: false
iter_decode_with_beam: 1
iter_decode_with_external_reranker: false
retain_iter_history: false
retain_dropout: false
retain_dropout_modules: null
decoding_format: null
no_seed_provided: false
Hi,
The config file base_lrs3_30h.yaml is for lip reading. In fine-tuning an ASR model, have you tuned hyperparameters, mostly: freeze_finetune_updates, max_update, warmup_steps and decay_steps?
One other thing to note is that 5.4 in table 4 of [2] is obtained with a pre-trained audio HuBERT model. Specifically, you need to take the pre-trained AV-HuBERT of penultimate iter to generate cluster and pre-train an audio-HuBERT (by setting modalities to ['audio'] in pre-training) using those clusters. If you directly fine-tune an AV-HuBERT for ASR (where the ASR result in [2] is obtained), the result can be a bit different. In our experiments, it is slightly worse but the gap is not large.