Fix language model repeated scoring

Open FieldsMedal opened this issue 2 years ago • 0 comments

In this pr，fix language model score repeatedly. When hotwords_scorer->is_character_based and ext_scorer->is_character_based() is false，The language model and hot word scores will be repeatedly calculated. In fact, if the language model is word based , it will only call the scorer whenever space_id is detected. After modification, we tested all possibilities on the dataset.

first audio

set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space )，alpha =0.5 ，beta=0.5,window_length=4. hot_words = {'换一': -3.40282e+38, '首歌': -100, '换首歌': 3.40282e+38}

编号	模型	热词is_character_based	语言模型is_character_based	解码结果（best path）
1	都不使用	*	*	换一首歌
2	热词	TRUE	*	换首歌`a<unk>`
3		FALSE	*	换首歌`<space>A<space>爱'爱<unk>`
4	语言	*	TRUE	换一首歌
5		*	FALSE	换一首
6	热词+语言	TRUE	TRUE	换换首歌`<unk>`
7		TRUE	FALSE	一首
8		FALSE	TRUE	换首歌`<space>A<space>`爱'`爱<unk>`
9		FALSE	FALSE	换一首

No. 7 and No. 9 hot words did not take effect. When the language model is_character_based is false, Words generated between two spaces should be in 1-grams or is a prefix of 1-grams. hotwords '换首歌' not in 1-grams.

second audio

set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space )，alpha =0.5 ，beta=0.5,window_length=4. hot_words = {'极点': 550}.Set the space to <space> before compiling ctc_decoder.

编号	模型	热词is_character_based	语言模型is_character_based	解码结果（best path）
1	都不使用	*	*	几点了
2	热词	TRUE	*	极点极点点了
3		FALSE	*	极点`<space><space><space><space>`
4	语言	*	TRUE	几点啦
5		*	FALSE	几点啦
6	热词+语言	TRUE	TRUE	极点极点极点啦
7		TRUE	FALSE	极点`<space>极点<space>`极点
8		FALSE	TRUE	极点`<space><space><space><space>`
9		FALSE	FALSE	极点`<space>是<space>是<space>是<space>`

May 22 '23 10:05 FieldsMedal