pythainlp icon indicating copy to clipboard operation
pythainlp copied to clipboard

Mistake in word tokenization for text containing digit related time and finance

Open p16i opened this issue 3 years ago • 2 comments

Description

I've been contacted via email that AttaCut (possibly other tokenizers as well) cannot cope well when encountering texts like below

- 'เจอกันตอน 17.00น.' 
   - actual: ['เจอ', 'กัน', 'ตอน', ' ', '17', '.', '00น', '.']
   - expected: ['เจอ', 'กัน', 'ตอน', ' ', '17.00น', '.']
- 'เจอกันตอน 17:00'
   - actual:  ['เจอ', 'กัน', 'ตอน', ' ', '17', ':', '00']
   - expected: ['เจอ', 'กัน', 'ตอน', ' ', '17:00']
- 'ของชิ้นนี้ราคา 3.50 บาท' => 
   - actual:  ['ของ', 'ชิ้น', 'นี้', 'ราคา', ' ', '3', '.', '50', ' ', 'บาท']
   - expected: ['ของ', 'ชิ้น', 'นี้', 'ราคา', ' ', '3.50', ' ', 'บาท']

IMHO, this problem seems quite general. I wonder what could be a good strategy to solve the problem.

p16i avatar Feb 10 '22 13:02 p16i

I should maybe use regex to fix.

wannaphong avatar Feb 22 '22 15:02 wannaphong

The 1st one is actually bad. The 2nd and 3rd can be post-processed to combine digits and symbols together to form a larger token (with some semantic).

bact avatar Feb 22 '22 18:02 bact