chardet icon indicating copy to clipboard operation
chardet copied to clipboard

Enforce order for results with same confidence

Open xWTF opened this issue 3 years ago • 0 comments

Hi gogs developers, The current implementation utlizes go routines to speed up detection, which makes perfect sense.

But the consistency of result is not guaranteed when multiple detectors returning same confidence.

POC:

  1. Encode "ノエル" with Shift_JIS => "\x83m\x83G\x83\x8b"
  2. Try to detect with DetectBest
  3. The result is randomlly picked from one of the following: Shift_JIS, GB18030 and Big5
    Because they all have the same confidence 10
  4. Try to detect with DetectAll
  5. The result order is not consist between runs 😢
  6. For the same byte sequence, decoding with different charset obviously results in different content.
  7. And this breaks apps willing to detect whether the content has changed 💥

Fix:

  1. Introduce Result.order field
  2. Sort the result (or replace the result in DetectBest) based on confidence, if the confidence is same, sort based on order
  3. This guarantees the consistency of result
  4. Although the encoding detected MAY NOT BE CORRECT, the output is ALWAYS SAME for same input

xWTF avatar Jan 30 '23 09:01 xWTF