ack3 icon indicating copy to clipboard operation
ack3 copied to clipboard

UTF-16/32/7 (and UCS-2/4) support

Open n1vux opened this issue 7 years ago • 7 comments

Per ack-users list discussion re UTF-16, while it's somewhat off core usecase, it's not totally absurd to consider having one or more commandline options to enable

  • process all input files as a specified encoding (preferably with byte order correction if LE/BE not specified)
  • check all input files for Unicode BOM magic number, and process each accordingly
  • output encoding other that Latin1/UTF8 on request

Discussion - https://groups.google.com/forum/#!topic/ack-users/qidCgv3S5Uo

n1vux avatar Aug 24 '18 22:08 n1vux

Interestingly, παπια works as a search against the UTF-8 test file http://www.humancomp.org/unichtm/tongtws8.htm with Ack normally, but not with the UTF-16 hack described on list against the UCS-2 version of the file http://www.humancomp.org/unichtm/tongtwst.htm. I don' t think that's a bug until we claim to support UTF-16 :-) but should be considered a criterion if we ever do.

n1vux avatar Aug 24 '18 22:08 n1vux

For future reference, MIT Licensed project https://github.com/tahonermann/text_view/ has extensive examples section for Unicode file types.

(Includes files without BOM prefix, which are problematic for inspection. One could in theory try each possible decoding in a try{} block to see which if any are valid, but will that have false positives? I'm guessing so. BOM-free files UTF-16/32 will likely have to be requested from commandline and not mixed with others :-/ )

n1vux avatar Aug 24 '18 23:08 n1vux

suggestions would be

  • --encoding=utf[-][32|16|8][be|le]|ucs-[2|4] (implies what byte order if no be/le suffix?; should allow case insensitive and canonicalize, except the utf8/UTF-8/UTF8/utf8-strict distinction in Perl)
  • and
  • --[no]encoding=bom|automatic to use Byte Order Marker to recognize UTF8 from ASCII and UTF16/32 BE/LE and decode properly, before #! inspection. With the proliferation of UTF8 BOMs, loose utf8 decoding upon seeing BOM is a reasonable default !

Based on cursory experiment, i do not believe heuristic detection of non-BOM tagged UTF is practical without heuristic text for "intelligible text" of expected / desired language(s), which is beyond our scope. Small files will not fail decode into all decodings other than the intended ones. (A good quality Unicode quality checker -- if found -- may increase efficacy of rejection of spurious decodes however.)

Cross reference:

  • beyondgrep/ack2#565 Unicode + beyondgrep/ack2#120 Unicode

  • and we need test cases for what unicode does and doesnt match with and without (?u:) patterns, with and without input being unicode decoded.)

n1vux avatar Aug 28 '18 15:08 n1vux

reading the document as Unicode opens a can of works regarding Unicode REs ... when is the RE to be treated as (?u:)? when does /[c]/ match "ç" ? When does /\w/ match "à á â ç è é ê ì í î ô ü µ 𝛷 𝛹 𝛳 ô 𝟇 𝝿 𝜎 τ" ? ... which may deserve its own Issue #

n1vux avatar Aug 28 '18 15:08 n1vux

Richard replied

For what it's worth, I'd happily settle for an explicit encoding flag that could be tucked away in a working directory's .ackrc file. E

​That, including a BOM-magic option for mixed collections, seems least unlikely.

n1vux avatar Sep 02 '18 21:09 n1vux

See also Rob's 2015 "Ack 2.1" notes on implications of Unicode Support

n1vux avatar Oct 31 '18 05:10 n1vux

Additional caveat - the workaround noted in email thread will break in Perl 5.30 and has deprecation warning in Perl 5.24-5.28 :

$ perl  -C '-Mopen IO=>":encoding(UTF-8)"' ~/bin/ack --noenv '\p{Han}' han.txt
sysread() is deprecated on :utf8 handles. This will be a fatal error in Perl 5.30 at /home/wdr/bin/ack line 302.
hello 世界
perlbrew exec --with perl-5.30.0 "perl  -C '-Mopen IO=>\":encoding(utf8)\"' /home/wdr/bin/ack --noenv '\p{Han}' han.txt"
sysread() isn't allowed on :utf8 handles at /home/wdr/bin/ack line 302.
Command terminated with non-zero status.
 ...

That our UTF workaround is dead with latest Stable Perl suggests that getting at least UTF-8 Multibyte correctly handled by decoding sysread buffer per Locale is a bit more urgent than we thought.

(Also, should add UTF workarounds and limitations to website FAQ and maybe shipping POD FAQ.)

n1vux avatar Aug 08 '19 15:08 n1vux