Sudachi
Sudachi copied to clipboard
Unintended long OOVs are created
When inputting "〇所定勤務時間", it is outputted as a single OOV.
- The kanji numeral "〇" is assigned to the character categories SYMBOL, KANJI, and KANJINUMERIC.
- The OOV generation rule is set to 0 0 2 for KANJI and 1 1 0 for SYMBOL.
- The function InputText.getCharCategoryContinuousLength() counts the length as KANJI.
- In MeCabOovProviderPlugin.provideOOV(), it is treated as SYMBOL, resulting in unintended long OOVs being generated.