Sudachi icon indicating copy to clipboard operation
Sudachi copied to clipboard

Unintended long OOVs are created

Open kazuma-t opened this issue 2 years ago • 0 comments

When inputting "〇所定勤務時間", it is outputted as a single OOV.

  • The kanji numeral "〇" is assigned to the character categories SYMBOL, KANJI, and KANJINUMERIC.
  • The OOV generation rule is set to 0 0 2 for KANJI and 1 1 0 for SYMBOL.
  • The function InputText.getCharCategoryContinuousLength() counts the length as KANJI.
  • In MeCabOovProviderPlugin.provideOOV(), it is treated as SYMBOL, resulting in unintended long OOVs being generated.

kazuma-t avatar Mar 24 '23 00:03 kazuma-t