Fix language model repeated scoring
In this pr,fix language model score repeatedly. When hotwords_scorer->is_character_based and ext_scorer->is_character_based() is false,The language model and hot word scores will be repeatedly calculated. In fact, if the language model is word based , it will only call the scorer whenever space_id is detected. After modification, we tested all possibilities on the dataset.
first audio
set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space ),alpha =0.5 ,beta=0.5,window_length=4. hot_words = {'换一': -3.40282e+38, '首歌': -100, '换首歌': 3.40282e+38}
| 编号 | 模型 | 热词is_character_based | 语言模型is_character_based | 解码结果(best path) |
|---|---|---|---|---|
| 1 | 都不使用 | * | * | 换一首歌 |
| 2 | 热词 | TRUE | * | 换首歌a<unk> |
| 3 | FALSE | * | 换首歌<space>A<space>爱'爱<unk> |
|
| 4 | 语言 | * | TRUE | 换一首歌 |
| 5 | * | FALSE | 换一首 | |
| 6 | 热词+语言 | TRUE | TRUE | 换换首歌<unk> |
| 7 | TRUE | FALSE | 一首 | |
| 8 | FALSE | TRUE | 换首歌<space>A<space>爱'爱<unk> |
|
| 9 | FALSE | FALSE | 换一首 |
No. 7 and No. 9 hot words did not take effect. When the language model is_character_based is false, Words generated between two spaces should be in 1-grams or is a prefix of 1-grams. hotwords '换首歌' not in 1-grams.
second audio
set beam_size=10, num_processes = 1,blank_id = 0,space_id = 45,cutoff_prob = 1(increase cutoff_prob to generate space
),alpha =0.5 ,beta=0.5,window_length=4. hot_words = {'极点': 550}.Set the space to <space> before compiling ctc_decoder.
| 编号 | 模型 | 热词is_character_based | 语言模型is_character_based | 解码结果(best path) |
|---|---|---|---|---|
| 1 | 都不使用 | * | * | 几点了 |
| 2 | 热词 | TRUE | * | 极点极点点了 |
| 3 | FALSE | * | 极点<space><space><space><space> |
|
| 4 | 语言 | * | TRUE | 几点啦 |
| 5 | * | FALSE | 几点啦 | |
| 6 | 热词+语言 | TRUE | TRUE | 极点极点极点啦 |
| 7 | TRUE | FALSE | 极点<space>极点<space>极点 |
|
| 8 | FALSE | TRUE | 极点<space><space><space><space> |
|
| 9 | FALSE | FALSE | 极点<space>是<space>是<space>是<space> |