grep: add -o option
Added an additional change to prevent grep from aborting/failing when the input contains non-UTF-8 bytes
Questions,
-
I see that BSD and Linux grep both have
-o. Is this option widely used? -
This is a significant change. Is there a performance impact, vs current code? A microbenchmark would be nice.
I ran into a build script that was trying to use -o. I can't remember which program it was I was compiling, I'll try to find it.
Most of the changes are related to switching from UTF-8 types to raw bytes types. It's not too uncommon to want to use text processing tools on "mostly UTF-8 files" (in my case I was trying to search my shell history file, which somehow got some non-UTF-8 bytes in it).
Performance is mostly unchanged:
❯ hyperfine ' ./grep-after-pull-request -F adjective ./webster ' ' ./grep-before-pull-request -F adjective ./webster '
Benchmark 1: ./grep-after-pull-request -F adjective ./webster
Time (mean ± σ): 96.5 ms ± 1.2 ms [User: 88.4 ms, System: 7.1 ms]
Range (min … max): 95.2 ms … 100.3 ms 30 runs
Benchmark 2: ./grep-before-pull-request -F adjective ./webster
Time (mean ± σ): 94.7 ms ± 0.8 ms [User: 87.1 ms, System: 6.8 ms]
Range (min … max): 93.5 ms … 96.3 ms 30 runs
Summary
./grep-before-pull-request -F adjective ./webster ran
1.02 ± 0.02 times faster than ./grep-after-pull-request -F adjective ./webster
❯ hyperfine ' ./grep-after-pull-request adjective ./webster ' ' ./grep-before-pull-request adjective ./webster '
Benchmark 1: ./grep-after-pull-request adjective ./webster
Time (mean ± σ): 191.8 ms ± 3.9 ms [User: 181.6 ms, System: 8.7 ms]
Range (min … max): 183.3 ms … 195.6 ms 15 runs
Benchmark 2: ./grep-before-pull-request adjective ./webster
Time (mean ± σ): 210.7 ms ± 3.5 ms [User: 202.0 ms, System: 7.1 ms]
Range (min … max): 206.6 ms … 220.4 ms 14 runs
Summary
./grep-after-pull-request adjective ./webster ran
1.10 ± 0.03 times faster than ./grep-before-pull-request adjective ./webster
"./webster" is "The 1913 Webster Unabridged Dictionary" from https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia (just under 40 mebibytes).
I couldn't find the program I had been trying to compile, but here's an interesting statistic:
I have a directory containing 42 Git repositories I've cloned. 4 of those repositories contain scripts that use grep -o:
https://github.com/hinto-janai/festival https://github.com/landley/toybox https://github.com/libjxl/libjxl https://github.com/uutils/coreutils
grep -o in combination with grep's regular expressions support is a quite useful combination:
# Get all https URLs.
grep -Eo 'https://.*' text.txt
# Get numbers.
grep -Eo '[0-9]+' text.txt