ugrep icon indicating copy to clipboard operation
ugrep copied to clipboard

Hexdump stops prematurely at LF

Open misutoneko opened this issue 1 year ago • 4 comments

New user here, hi :D

I've only tested ugrep for a short while, but I think I've found a bug. To cut to the chase here's an example:

echo -e "Hello World\nThis part left unseen" |ugrep -U --hexdump=1A1 "\x57\x6f\x72"

Which will give you this:

00000000  48 65 6c 6c 6f 20 57 6f  |Hello Wo|
00000008  72 6c 64 0a -- -- -- --  |rldJ----|

So the problem is that the output doesn't continue all the way to the context boundary, it stops at the first sight of LF (0x0a). OK sure, greps are usually line-oriented, but imho the byte values in context shouldn't matter when dealing with the binary stuff? Tested with v3.7.2 and v6.5.0.

Btw hexdumping is THE main attraction of ugrep for me :D I have high hopes that ugrep will help me to replace my silly old 600+ line bash/awk script that I'm now using for dumping...

misutoneko avatar Sep 24 '24 21:09 misutoneko

Here's a workaround:

echo -e "Hello World\nThis part left unseen" \ |ugrep -y -U --color=always --hexdump=1A1 "\x57\x6f\x72" \ |grep -A1 --color=never 31m

The -y in the first ugrep is passthru. The second (u)grep finds the lines based on the color code and prints them with the desired context added. Note that the search pattern 31m is ok for demo purposes, but it should be refined for RL usage.

EDIT: Actually that -y might eat all performance so it's better to restrict the first ugrep:

echo -e "Hello World\nThis part left unseen" \ |ugrep -U --color=always -A1 --hexdump=1A1 "\x57\x6f\x72" \ |ugrep --color=never -A1 31m

misutoneko avatar Sep 26 '24 10:09 misutoneko

As you point out, the hexdump feature is line-oriented, like grep in general is a line-oriented tool. The reason is the efficient buffering with a window that shifts out at line boundaries so it is possible to quickly search several GB of input without impacting memory resources.

A way to avoid this is to match newline \n byte as part of the pattern, but there are also reasons not to do that.

Perhaps this should be moved to the ugrep discussion so we can close this eventually.

genivia-inc avatar Oct 01 '24 18:10 genivia-inc

Yeah I had a hunch it might be something like that. And I agree, if this is ever fixed it should be done in a way that doesn't compromise speed. The workaround, well, works for me well enough (it could break at the next version update though).

I guess only the repo owner can move to Discussions? Didn't spot any way to do that...

misutoneko avatar Oct 01 '24 22:10 misutoneko

I agree that there are opportunities to improve the hexdump feature with better context control. I haven't done that, because I didn't want to mess with the search engine.

What I would like to do is improve the hexdump context to always show the given number of hex lines before and after a match, regardless of the presence of newlines. That needs a change to the search engine so that the before hex lines (i.e. bytes) are not shifted out of the window when searching large files.

If will put that on the TODO list.

genivia-inc avatar Oct 02 '24 14:10 genivia-inc

The upcoming v7 release will produce hex before and after context lines regardless of LF boundaries. The before hex context is guaranteed to display, as long as the before hex size specified is not crazy large like hundreds of hex lines (in that case, the hex before context may get truncated by chance if the file is several MB).

genivia-inc avatar Oct 20 '24 19:10 genivia-inc

Thank you! Works like a charm from what I can see :D

misutoneko avatar Oct 26 '24 16:10 misutoneko

The v7.0.2 update improves hex context lines when these run into each other. I want to avoid overlap, which can be confusing. So it looks a lot cleaner in the update, at least it does to me :)

genivia-inc avatar Oct 27 '24 19:10 genivia-inc