coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

head: rework handling of non-seekable files

Open karlmcdowall opened this issue 10 months ago • 11 comments

Rework logic for handling all-but-last-lines and all-but-last-bytes for non-seekable files. Changes give large performance improvement.

karlmcdowall avatar Mar 12 '25 16:03 karlmcdowall

GNU testsuite comparison:

GNU test failed: tests/head/head-write-error. tests/head/head-write-error is passing on 'main'. Maybe you have to rebase?
GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?

github-actions[bot] avatar Mar 12 '25 17:03 github-actions[bot]

GNU testsuite comparison:

GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Mar 12 '25 20:03 github-actions[bot]

Changes give large performance improvement.

Can you attach a benchmark to show the improvement ?

RenjiSann avatar Mar 13 '25 16:03 RenjiSann

GNU testsuite comparison:

GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Mar 14 '25 14:03 github-actions[bot]

Changes give large performance improvement.

Can you attach a benchmark to show the improvement ?

The performance improvements are seen when the input comes in via a pipe. Comparing current head with head_improved which contains my changes...

hyperfine "cat ./shakespeare.txt | ./target/release/head -c -100000" "cat ./shakespeare.txt | ./target/release/head_improved -c -100000"
Summary
  cat ./shakespeare.txt | ./target/release/head_improved -c -100000 ran
   17.93 ± 2.56 times faster than cat ./shakespeare.txt | ./target/release/head -c -100000


hyperfine "cat ./shakespeare.txt | ./target/release/head -n -100000" "cat ./shakespeare.txt | ./target/release/head_improved -n -100000"
Summary
  cat ./shakespeare.txt | ./target/release/head_improved -n -100000 ran
   24.38 ± 2.60 times faster than cat ./shakespeare.txt | ./target/release/head -n -100000

Also should note though that I want to add a bit more test code before anyone reviews these changes. Give me another day or so please :)

karlmcdowall avatar Mar 14 '25 16:03 karlmcdowall

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Mar 15 '25 22:03 github-actions[bot]

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Mar 17 '25 01:03 github-actions[bot]

In addition to the main changes here, I also made a few changes to add/remove buffering (BufReader/BufWriter) through the head.rs file - I've remove some buffering that was redundant and added it where I think it's helpful. Here are the performance numbers before/after my changes on my machine (running Linux) to confirm that nothing has regressed...

Command Time Before (ms) Time After (ms) Improvement Ratio
head -n 100000 ./shakespeare.txt 3.6 3.0 1.19
head -n -100000 ./shakespeare.txt 3.9 3.9 1
head -c 100000 ./shakespeare.txt 1.0 1.0 1
head -c -100000 ./shakespeare.txt 1.2 1.2 1
cat ./shakespeare.txt | head -n 100000 4.0 3.5 1.15
cat ./shakespeare.txt | head -n -100000 133.5 5.7 23
cat ./shakespeare.txt | head -c 100000 1.4 1.4 1
cat ./shakespeare.txt | head -c -100000 75.5 4.0 19

So, no regressions, some reasonable improvements (~15%) for the ... -n 100000 cases, and some really big improvements for the cat ... -100000 cases.

If there are any other usecases people think I should run then please let me know.

karlmcdowall avatar Mar 17 '25 01:03 karlmcdowall

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Mar 17 '25 02:03 github-actions[bot]

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

github-actions[bot] avatar Mar 17 '25 17:03 github-actions[bot]

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/misc/tee is no longer failing!

github-actions[bot] avatar Mar 17 '25 19:03 github-actions[bot]

Appart from the typo. this looks good ^^ It's nice to see such performance improvement 🚀

Thanks for the review!

karlmcdowall avatar Mar 18 '25 15:03 karlmcdowall

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Mar 18 '25 15:03 github-actions[bot]