Various search improvement suggestions

Open joepio opened this issue 4 years ago • 0 comments

I've just implemented Full-Text Search #40 and it works pretty well! Good enough for now. However, I noticed some things could be improved upon:

[ ] Besides indexing only triples, consider indexing full resources. That way, a user could comine terms present in various fields. For example, Say I'd look for a red shirt. This shirt would have two relvant properties, its type (shirt) and its color (red). As it currently only indexes triples, it would find one triple for redand one forshirt`, but it would not find something that contains both. If we'd index a full resource, we'd fix this. #336 might be a solution
[ ] Boost titles
[ ] Consider indexing connected resources, too. Say in the previous example, the red was not a literal string, but it was a resource somewhere else, possibly with a very obscure Subject URL. This would mean that we would not even hit the red shirt if we searched for red! We could fix this by indexing connected resources, and including these in the initial item. Perhaps we'd add a new field: connected, and serialize all values of all directly connected nodes in here. I think doing this for a depth of 1 is doable, although it would make indexing about 10x slower, and the size of the index, too. But it would open up some cool possibilities, such as searching for a user name + class type (e.g. joep document) and see all documents of that user - without having any form of explicit filters. That's pretty cool, right?
[ ] Fuzzy searching does not, at the moment, score items at all. In other words, we get kind of 'random' hits for fuzzy matches, which is what we use for all short strings. That's bad. I think there's people working on this though, see PR: https://github.com/quickwit-inc/tantivy/pull/998. But in another comment, the PR creator told we could think of this PR of as discarded.
[ ] There is no scoring system to make important resources rank higher (think pagerank from google). No user feedback to make the system learn from what is relevant to me. No synonyms.
[ ] Search inside collections or in some hierarchy #226
[ ] tokenize the search sentence into separate parts (a new query for each token). Inspiration permalink), (thanks @ChillFish8!)
[ ] First execute 0 distance, later 1 + distance depending on how many hits.

Nov 12 '21 11:11 joepio