pythainlp Mistake in word tokenization for text containing digit related time and finance

Description

I've been contacted via email that AttaCut (possibly other tokenizers as well) cannot cope well when encountering texts like below

- 'เจอกันตอน 17.00น.' 
   - actual: ['เจอ', 'กัน', 'ตอน', ' ', '17', '.', '00น', '.']
   - expected: ['เจอ', 'กัน', 'ตอน', ' ', '17.00น', '.']
- 'เจอกันตอน 17:00'
   - actual:  ['เจอ', 'กัน', 'ตอน', ' ', '17', ':', '00']
   - expected: ['เจอ', 'กัน', 'ตอน', ' ', '17:00']
- 'ของชิ้นนี้ราคา 3.50 บาท' => 
   - actual:  ['ของ', 'ชิ้น', 'นี้', 'ราคา', ' ', '3', '.', '50', ' ', 'บาท']
   - expected: ['ของ', 'ชิ้น', 'นี้', 'ราคา', ' ', '3.50', ' ', 'บาท']

IMHO, this problem seems quite general. I wonder what could be a good strategy to solve the problem.

Feb 10 '22 13:02 p16i

I should maybe use regex to fix.

Feb 22 '22 15:02 wannaphong

The 1st one is actually bad. The 2nd and 3rd can be post-processed to combine digits and symbols together to form a larger token (with some semantic).

Feb 22 '22 18:02 bact