ugrep [FR] Add caching/indexing for static content

When I try to search a big repo such as https://github.com/NightMachinary/r_rational/tree/master/indices , ugrep is extremely slow, while a cached service like Sourcegraph answers almost instantaneously. I wonder if it is possible to add indexing for static content to achieve better performance? Sourcegraph also uses regexes, so it seems some kind of caching should be possible?

Jun 01 '21 14:06 NightMachinery

A great suggestion! I will work on this when I have time.

As a side note, please note that the first recursive search can be slow(er) compared to a second recursive search, because of FS caching. Just try it. You'll see a timing difference. Even when you run another search tool after ugrep, that search tool sees the cached data and will be faster which means you're comparing apples with oranges in that case. Also, with ugrep you always want to use option -I when possible to skip binary files. Grepping through binary files is unnecessary and can be expensive for some types of binary files.

Some source search tools use indexing to get instantaneous results and it's not surprising how that works. Indexes can get stale though, so reindexing is important or running a daemon to fix that. This is not really something that the original grep was designed to do. I don't think that caching is the issue, which ugrep can't effectively control, the difference is indexing.

Jun 01 '21 23:06 genivia-inc

@genivia-inc Yes, exactly, indexing. I don’t know of any CLI search tool that does indexing. There is recoll, which was buggy and unusable when I tried it this week on macOS (no idea how it works on Linux).

It would provide quite a bit of value to have this. I just wanted to float the idea. If you feel it’s too out of scope, feel free to close the issue.

Jun 02 '21 00:06 NightMachinery

Indexing would be great. There's really not a great local on-disk full-text search option out there.

Mar 09 '22 13:03 crisdosaygo

Yes, I agree it would be great to have! In fact, the regex engine already produces hashes from the regex pattern. These hashes can be compared to the hashes of an indexed file. If none match, then the file can be skipped. So this part of the conceptual approach is already "in place" so to speak. (footnote 1: option -P for Perl matching won't produce hashes, so that requires a lot more work to produce hashes somehow. footnote 2: fuzzy search cannot benefit from indexing, I don't see how that could work reliably.)

When some files are updated since the last indexing phase, then their cashed hashes have become stale and the files have to be searched. Perhaps checking file modification times is all that is needed to check this condition.

Indexing should be a separate phase to be run manually, perhaps. Re-indexing while searching would slow down searching too much, I think.

Mar 10 '22 19:03 genivia-inc

Re: https://github.com/Genivia/ugrep/issues/136#issuecomment-852547746

If the archive exceeds the size of RAM, then FS caching fails.

Suppose you have a VLDB of raw unstructured data that you cannot turn into a DB, and you need to search it continuously using the same pattern. An index would reduce the response time from days down to seconds.

Sep 01 '22 10:09 hdatma

I've implemented a new approach to index files to speed up searching significantly.

I'm quite excited to work on this and to reveal more details soon! 🚀

At this time I can only reveal that my approach uses an efficient hashing technique that I came up. It turns out to be quite effective to speed up search, judging from some initial performance tests.

Perhaps you might ask why a new method? Well, I am disappointed with existing indexing methods that lack accuracy when regex patterns are used.

The most important trick is to reduce the false positive rate of regex patterns matching the indexes of files when there should not be a match. This happens due to imperfect indexing. A false positive slows the search a bit, since we cannot skip over the file to search and have to search the whole thing. I want it to get as close as possible to a 0% false positive rate and I'm getting quite close. Only if a given regex pattern is "very permissible" so to speak, then the false positive rate is higher. For example when .* or \w* are used in a pattern, which matches a lot obviously. The more patterns can be matched by a given "permissible" regex, to more likely it is that the false positive rate goes up, because indexing is never going to be 100% perfect. We want indexing storage to be small(ish), while getting a low false positive rate. These are tradeoffs.

A nice feature of the indexing method I came up with is that the indexer accepts an accuracy value to reduce the false positive rate overall of all regex searches performed. The higher the specified accuracy, the lower the false positive rate, but the higher the storage overhead of the index files will be. By making this an option of the indexer, the user can decide that a higher accuracy is needed based on the search performance or based on the regex patterns typically used.

I might release my prototype as a separate project "ugrep-indexer" on GitHub to let people try it out as a beta feature and give feedback, before integrating everything in the ugrep project.

Also, eventually and hopefully soon, I will work on a Windows ugrep.exe port with indexing and to implement indexing the contents of compressed files and archives and also support -v to invert matching.

A quick demo to illustrate how this is going to work. To (re)index all files in and recursively below the current working directory, while skipping binary files with -I:

$ ugrep-indexer -I

     36620438 bytes scanned and indexed with 25% noise on average
           40 files indexed in 4 directories
            0 files modified in indexes
           40 files added to indexes
            0 files removed from indexes
            1 symlinks and devices skipped
       345636 bytes index files size increase

The noise rate is limited by the accuracy argument, which is 5 by default. The range is 0 to 9. The lower the noise the lower the probability of false positives. Noise is like entropy or chaos. The more random a file is, the higher the indexing noise. The noise limit is specified by the accuracy argument. But very large files with almost "random" content will have high noise rates and will give higher rates of false positives.

The index files are locally stored in each (sub)directory. The size of the index per file is between 128 bytes (best case) and 65536 bytes (worst case e.g. for a very large file).

The ugrep-indexer help page:

$ ugrep-indexer --help
Usage: ugrep-indexer [-0|-1|-2|-3|...|-9] [-.] [-f] [-I] [-q] [-s] [-z]

    -0, -1, -2, -3, ..., -9, --accuracy=DIGIT
            Specifies indexing accuracy.  A low accuracy reduces indexing file
            storage overhead at the cost of a higher rate of false positive
            regex pattern matches (more noise).  A high accuracy reduces the
            rate of false positive regex pattern matches (less noise) at a cost
            of an increased indexing file storage overhead.  An accuracy
            between 2 and 7 is recommended.  The default accuracy is 5.
    -., --hidden
            Index hidden files and directories.
    -?, --help
            Display a help message and exit.
    -d, --delete
            Recursively remove index files.
    -f, --force
            Force reindexing of files, even those that are already indexed.
    -I, --ignore-binary
            Do not index binary files.
    -q, --quiet, --silent
            Quiet mode: do not display indexing statistics.
    -s, --no-messages
            Silent mode: nonexistent and unreadable files are ignored, i.e.
            their error messages and warnings are suppressed.
    -V, --version
            Display version and exit.
    -v, --verbose
            Produce verbose output.
    -z, --decompress
            Index the contents of compressed files and archives.

After indexing with ugrep-indexer, a search with ugrep is performed with option --index (for convenience this option can be specified in a .ugrep config file instead of on the command line):

$ ug --index 'xyz' -I -l --stats
dir1/bits.h
Searched 40 files in 4 directories in 0.00139 seconds with 9 threads: 1 matching (2.5%)
Skipped 39 of 40 files with indexes not matching the search patterns
The following pathname selections and search restrictions were applied:
  --directories=recurse
  --no-hidden (default)
Lines matched if:
  "xyz" matches

Note that there were 0 false positives in this case, since 39 files were skipped with indexes not matching the search pattern "xyz".

Without option --index, the same search with ugrep takes 0.0125 seconds which is still fast, but 10x slower than indexed search on this simple demo with a mix of mostly small files and a few big files. The speedup depends on the files indexed. If a lot of files are indexed and do not match and if the files are generally larger than 4KB, then the speedup will be much more than a 10x.

The indexer is incremental. So running ugrep-indexer again only indexes files that were added or modified. Files that were removed will be removed from the index.

The only restrictions will be that ugrep option -P (Perl matching) and -Z (fuzzy search) cannot be used with indexing. I think that will not be a practical limitation. Perhaps -P is still possible actually as long as the patterns do not use advanced/special Perl regex patterns such as lookahead/lookbehind and backreferences.

Jul 31 '23 00:07 genivia-inc

In the beta testing phase.

Already TESTED A LOT OF CASES, and MORE TESTING on thousands of files with thousands of regex patterns using automated scripts to generate cases and test against known results with stable tools.

My nightmare scenario is a bug somewhere that did not get caught early on.

Aug 02 '23 18:08 genivia-inc

This might be useful, the new ugrep --index option:

    --index
            Perform indexing-based search on files indexed with ugrep-indexer.
            Recursive searches are performed by skipping non-matching files.
            Binary files are skipped with option -I.  Note that the start-up
            time to search is increased, which may be significant when complex
            search patterns are specified that contain large Unicode character
            classes with `*' or `+' repeats, which should be avoided.  Option
            -U (--ascii) improves performance.  Option --stats=vm displays a
            detailed indexing-based search report.  This is a beta feature.

As long as a lot of files are recursively searched and/or if regex pattens are specified that are not overly "permissive" to match a lot (i.e. are not too complex with Unicode patterns and wildcards), then indexing will speed up searching significantly.

If regex patterns are very "permissive" (i.e. are complex to match a lot), for example with Unicode classes and wildcards, then the ugrep --index startup time will increase. Sometimes by a lot, like even more than a second in extreme cases. This is something to be aware of, that indexing-based search makes sense only if you're doing a recursive search on a lot of files, so the startup time will not dominate the search time.

For example, in a Homebrew formula directory with thousands of source files we might search for a pattern like this:

$ ugrep -c -m1, -I '[hH]ello[ \t]world' --stats=vm
[...]
Searched 3897 files in 580 directories in 0.142 seconds with 9 threads: 8 matching (0.2053%)
Searched 5954 lines: 13 matching (0.2183%)

which is already pretty fast with ugrep, actually. With indexing the search is 4x faster for this specific case:

$ ugrep --index -c -m1, -I '[hH]ello[ \t]world' --stats=vm
Searched 3897 files in 580 directories in 0.031 seconds with 9 threads: 8 matching (0.2053%)
Searched 5954 lines: 13 matching (0.2183%)
Skipped 3888 of 3897 files with indexes not matching any search patterns
[...]
VM: 12 nodes (0ms) 13 edges (0ms) 26 opcode words (0ms) 100 hash tables (0ms)

The "100 hash tables" is small in this case, with no overhead (0ms) to construct. But if we try the pattern '[hH]ello\s+world' with Unicode spacing \s+ this gives:

Searched 3897 files in 580 directories in 0.128 seconds with 9 threads: 8 matching (0.2053%)
Searched 5954 lines: 13 matching (0.2183%)
Skipped 3811 of 3897 files with indexes not matching any search patterns
[...]
VM: 20 nodes (0ms) 49 edges (0ms) 56 opcode words (0ms) 2933833 hash tables (90ms)

Wow, "2933833 hash tables" are needed to represent this regex pattern! The reason is that there are a lot of patterns that starts with a Unicode space \s followed by world (or just w or wo or wor and so on). It's an exponential increase in possibilities.

So to summarize the above: use indexing-based search when applicable with regex patterns that aren't matching too much, i.e. limit the use of repeats * and + and avoid Unicode pattern classes when possible.

Note that this is work in progress. As we learn more, we can further improve ugrep's performance and features. It's exciting to try something new like this with a new fresh approach.

Aug 04 '23 19:08 genivia-inc

NEW!

The new tested and verified ugrep-indexer is released: https://github.com/Genivia/ugrep-indexer

Ugrep 3.12.6 supports index-based searching.

Please help!

I want to get more experience and feedback from user using ugrep-indexer. So please share your experience to improve ugrep and ugrep-indexer.

What's next?

Some features that I want to include are not ready yet, such as indexing compressed files and multi-threading to speed up indexing. Therefore, it is a 0.9 beta release. Also no windows exe release yet.

When promoting to stable and with new features, the ugrep-indexer will be part of the ugrep repo

Aug 06 '23 21:08 genivia-inc