pythainlp
pythainlp copied to clipboard
Mistake in word tokenization for text containing digit related time and finance
Description
I've been contacted via email that AttaCut (possibly other tokenizers as well) cannot cope well when encountering texts like below
- 'เจอกันตอน 17.00น.'
- actual: ['เจอ', 'กัน', 'ตอน', ' ', '17', '.', '00น', '.']
- expected: ['เจอ', 'กัน', 'ตอน', ' ', '17.00น', '.']
- 'เจอกันตอน 17:00'
- actual: ['เจอ', 'กัน', 'ตอน', ' ', '17', ':', '00']
- expected: ['เจอ', 'กัน', 'ตอน', ' ', '17:00']
- 'ของชิ้นนี้ราคา 3.50 บาท' =>
- actual: ['ของ', 'ชิ้น', 'นี้', 'ราคา', ' ', '3', '.', '50', ' ', 'บาท']
- expected: ['ของ', 'ชิ้น', 'นี้', 'ราคา', ' ', '3.50', ' ', 'บาท']
IMHO, this problem seems quite general. I wonder what could be a good strategy to solve the problem.
I should maybe use regex to fix.
The 1st one is actually bad. The 2nd and 3rd can be post-processed to combine digits and symbols together to form a larger token (with some semantic).