posixutils-rs icon indicating copy to clipboard operation
posixutils-rs copied to clipboard

grep: add -o option

Open andrewliebenow opened this issue 1 year ago • 5 comments

andrewliebenow avatar Oct 22 '24 02:10 andrewliebenow

Added an additional change to prevent grep from aborting/failing when the input contains non-UTF-8 bytes

andrewliebenow avatar Oct 28 '24 13:10 andrewliebenow

Questions,

  1. I see that BSD and Linux grep both have -o. Is this option widely used?

  2. This is a significant change. Is there a performance impact, vs current code? A microbenchmark would be nice.

jgarzik avatar Nov 03 '24 15:11 jgarzik

I ran into a build script that was trying to use -o. I can't remember which program it was I was compiling, I'll try to find it.

Most of the changes are related to switching from UTF-8 types to raw bytes types. It's not too uncommon to want to use text processing tools on "mostly UTF-8 files" (in my case I was trying to search my shell history file, which somehow got some non-UTF-8 bytes in it).

Performance is mostly unchanged:

❯ hyperfine ' ./grep-after-pull-request -F adjective ./webster ' ' ./grep-before-pull-request -F adjective ./webster '
Benchmark 1:  ./grep-after-pull-request -F adjective ./webster 
  Time (mean ± σ):      96.5 ms ±   1.2 ms    [User: 88.4 ms, System: 7.1 ms]
  Range (min … max):    95.2 ms … 100.3 ms    30 runs
 
Benchmark 2:  ./grep-before-pull-request -F adjective ./webster 
  Time (mean ± σ):      94.7 ms ±   0.8 ms    [User: 87.1 ms, System: 6.8 ms]
  Range (min … max):    93.5 ms …  96.3 ms    30 runs
 
Summary
   ./grep-before-pull-request -F adjective ./webster  ran
    1.02 ± 0.02 times faster than  ./grep-after-pull-request -F adjective ./webster 
❯ hyperfine ' ./grep-after-pull-request adjective ./webster ' ' ./grep-before-pull-request adjective ./webster '                                                                 
Benchmark 1:  ./grep-after-pull-request adjective ./webster 
  Time (mean ± σ):     191.8 ms ±   3.9 ms    [User: 181.6 ms, System: 8.7 ms]
  Range (min … max):   183.3 ms … 195.6 ms    15 runs
 
Benchmark 2:  ./grep-before-pull-request adjective ./webster 
  Time (mean ± σ):     210.7 ms ±   3.5 ms    [User: 202.0 ms, System: 7.1 ms]
  Range (min … max):   206.6 ms … 220.4 ms    14 runs
 
Summary
   ./grep-after-pull-request adjective ./webster  ran
    1.10 ± 0.03 times faster than  ./grep-before-pull-request adjective ./webster 

"./webster" is "The 1913 Webster Unabridged Dictionary" from https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia (just under 40 mebibytes).

andrewliebenow avatar Nov 04 '24 11:11 andrewliebenow

I couldn't find the program I had been trying to compile, but here's an interesting statistic:

I have a directory containing 42 Git repositories I've cloned. 4 of those repositories contain scripts that use grep -o:

https://github.com/hinto-janai/festival https://github.com/landley/toybox https://github.com/libjxl/libjxl https://github.com/uutils/coreutils

andrewliebenow avatar Nov 04 '24 11:11 andrewliebenow

grep -o in combination with grep's regular expressions support is a quite useful combination:

# Get all https URLs.
grep -Eo 'https://.*' text.txt

# Get numbers.
grep -Eo '[0-9]+' text.txt

ghuls avatar Feb 06 '25 13:02 ghuls