Handling of BOM leading characters
From @yongminyan .
Hey @j256 , I found these issues when I was parsing certain html content that start with BOM, like byte array of "-17, -69, -65, 60, 104, 116, 109, 108, 32" (the first three bytes are UTF-8 BOM and followed by <html tag) or "-1, -2, 60, 0, 104, 0, 116, 0, 109, 0, 108, 0" (the first two bytes are UTF-16 Little-Endian BOM and followed by <html tag), in these cases, the library failed to detect it as text/html, for it to be working, I think we need to fix the issues first and then add proper magic entries, something like
+0 byte 0xEF
+!:mime text/html
+>1 byte 0xBB
+>>2 byte 0xBF UTF-8 Unicode text with BOM
+>>>3 search/1/cb \<html
and
+# UTF-16 LE
+0 byte 0xFF
+!:mime text/html
+>1 byte 0xFE
+>>1 lestring16 \<html Little-endian UTF-16 Unicode text with BOM
I did not include the magic entries in the pull request as I feel those changes are not very generic, it could happen to other types like xml (i.e., different encoding), not too sure about the best solution?
Also I am not too sure lestring16/bestring16 support [Bbc] options or not, the magic5 spec does not say so, but I see lestring16/bestring16 extends from StringTypes, I mean can we do something like lestring16/cb or not?
It would be great if you can take a look and answer my two questions above, thanks a lot!