binaryornot icon indicating copy to clipboard operation
binaryornot copied to clipboard

Not checking for low percentage of high ascii chars.

Open siulkilulki opened this issue 7 years ago • 0 comments

I feel like something is wrong just when looking at the code.

The comments in code say:

and check for a low percentage of high ASCII characters: Binary if high ASCII chars are < 5% of the string

but then high_chars are all chars which are not high_ascii. high_chars = bytes_to_check.translate(None, _printable_high_ascii)

and nontext_ratio2 is actually a ratio of low ASCII chars from 0 to 126, not high ASCII chars.

Am I just tired or I don't understand? I feel like either high_chars variable shouldn't be called high_chars or we are not checking for low percentage of high ASCII.

But even if high_chars were really high chars (codes from 127 to 255) then (nontext_ratio1 > 0.8 and nontext_ratio2 > 0.8) doesn't make sense for example for test test_text_utf82. This is test testing file tests/isBinaryFile/encodings/utf_8.txt which contain '中文\n'. Here nontext_ratio1 (also take high_chars into account) would be the same as nontext_ratio2 and equal to 0.857. So the file would be classified as binary and it's not.

What would make sense is making nontext_ratio1 as ratio of ASCII chars having codes 0-7, 11, 14-32, and nontext_ratio2 as ratio of ASCII chars having codes from 127 to 255. Than the conditions should be changed.

Am I really tired or we have a bug here?

siulkilulki avatar Jun 14 '18 22:06 siulkilulki