chardet
chardet copied to clipboard
Enforce order for results with same confidence
Hi gogs developers, The current implementation utlizes go routines to speed up detection, which makes perfect sense.
But the consistency of result is not guaranteed when multiple detectors returning same confidence.
POC:
- Encode
"ノエル"withShift_JIS=>"\x83m\x83G\x83\x8b" - Try to detect with
DetectBest - The result is randomlly picked from one of the following:
Shift_JIS,GB18030andBig5
Because they all have the same confidence10 - Try to detect with
DetectAll - The result order is not consist between runs 😢
- For the same byte sequence, decoding with different charset obviously results in different content.
- And this breaks apps willing to detect whether the content has changed 💥
Fix:
- Introduce
Result.orderfield - Sort the result (or replace the result in
DetectBest) based on confidence, if the confidence is same, sort based on order - This guarantees the consistency of result
- Although the encoding detected MAY NOT BE CORRECT, the output is ALWAYS SAME for same input