UTF-16/32/7 (and UCS-2/4) support
Per ack-users list discussion re UTF-16, while it's somewhat off core usecase, it's not totally absurd to consider having one or more commandline options to enable
- process all input files as a specified encoding (preferably with byte order correction if LE/BE not specified)
- check all input files for Unicode BOM magic number, and process each accordingly
- output encoding other that Latin1/UTF8 on request
Discussion - https://groups.google.com/forum/#!topic/ack-users/qidCgv3S5Uo
Interestingly, παπια works as a search against the UTF-8 test file http://www.humancomp.org/unichtm/tongtws8.htm with Ack normally, but not with the UTF-16 hack described on list against the UCS-2 version of the file http://www.humancomp.org/unichtm/tongtwst.htm.
I don' t think that's a bug until we claim to support UTF-16 :-)
but should be considered a criterion if we ever do.
For future reference, MIT Licensed project https://github.com/tahonermann/text_view/ has extensive examples section for Unicode file types.
(Includes files without BOM prefix, which are problematic for inspection. One could in theory try each possible decoding in a try{} block to see which if any are valid, but will that have false positives? I'm guessing so. BOM-free files UTF-16/32 will likely have to be requested from commandline and not mixed with others :-/ )
suggestions would be
-
--encoding=utf[-][32|16|8][be|le]|ucs-[2|4](implies what byte order if nobe/lesuffix?; should allow case insensitive and canonicalize, except the utf8/UTF-8/UTF8/utf8-strict distinction in Perl) - and
-
--[no]encoding=bom|automaticto use Byte Order Marker to recognize UTF8 from ASCII and UTF16/32 BE/LE and decode properly, before #! inspection. With the proliferation of UTF8 BOMs, loose utf8 decoding upon seeing BOM is a reasonable default !
Based on cursory experiment, i do not believe heuristic detection of non-BOM tagged UTF is practical without heuristic text for "intelligible text" of expected / desired language(s), which is beyond our scope. Small files will not fail decode into all decodings other than the intended ones. (A good quality Unicode quality checker -- if found -- may increase efficacy of rejection of spurious decodes however.)
Cross reference:
-
beyondgrep/ack2#565 Unicode + beyondgrep/ack2#120 Unicode
-
and we need test cases for what unicode does and doesnt match with and without
(?u:)patterns, with and without input being unicode decoded.)
reading the document as Unicode opens a can of works regarding Unicode REs ...
when is the RE to be treated as (?u:)?
when does /[c]/ match "ç" ?
When does /\w/ match "à á â ç è é ê ì í î ô ü µ 𝛷 𝛹 𝛳 ô 𝟇 𝝿 𝜎 τ" ?
... which may deserve its own Issue #
Richard replied
For what it's worth, I'd happily settle for an explicit encoding flag that could be tucked away in a working directory's .ackrc file. E
That, including a BOM-magic option for mixed collections, seems least unlikely.
See also Rob's 2015 "Ack 2.1" notes on implications of Unicode Support
Additional caveat - the workaround noted in email thread will break in Perl 5.30 and has deprecation warning in Perl 5.24-5.28 :
$ perl -C '-Mopen IO=>":encoding(UTF-8)"' ~/bin/ack --noenv '\p{Han}' han.txt
sysread() is deprecated on :utf8 handles. This will be a fatal error in Perl 5.30 at /home/wdr/bin/ack line 302.
hello 世界
perlbrew exec --with perl-5.30.0 "perl -C '-Mopen IO=>\":encoding(utf8)\"' /home/wdr/bin/ack --noenv '\p{Han}' han.txt"
sysread() isn't allowed on :utf8 handles at /home/wdr/bin/ack line 302.
Command terminated with non-zero status.
...
That our UTF workaround is dead with latest Stable Perl suggests that getting at least UTF-8 Multibyte correctly handled by decoding sysread buffer per Locale is a bit more urgent than we thought.
(Also, should add UTF workarounds and limitations to website FAQ and maybe shipping POD FAQ.)