glean
glean copied to clipboard
lightweight search engine for local text docs
For mainly English text corpora, using a porter stemmer variant at index- and search-time might be a good idea. (If stemming, the terminal $ in the keyword search should be...
Add agrep / nrgrep support in gln.c - format_cmd. Will need to set up a gunzip -c pipe for compressed token indexes. This will allow searching for keywords with a...
A second token table could be added for large files, with an additional value for which block(s) have the token. Using grep to search files whole is slow for very...
Anything that can reduce the token occurrence index size without greatly increasing complexity or lookup time is worth considering.
Tested so far on OpenBSD/amd64, Linux (debian/i386), OS X. Testing on FreeBSD, NetBSD, Cygwin, etc. would be good. There are (hopefully) not a lot of portability issues - the main...
Set up configuration hooks\* for non-text files that nonetheless can be meaningfully indexed: Pass .mp3s through id3tag, PDFs through ps2ascii, .docs through antiword, etc., and index the output. (Optionally, cache...
glean should also work for non-ASCII text. It just needs a different hashing algorithm for hash_word in whash.c, a different word separator, and testing by people fluent in a whitespace-separated,...
Rather than rebuilding the DB from scratch, add another table to the hash table chains in the DB files. When searching in gln.db, search all tables. Add a "merge"/"pack"/whatever command...
Add NEAR (alongside AND, OR, NOT); should be based on grep -C $NUM_CONTEXT_LINES.