api icon indicating copy to clipboard operation
api copied to clipboard

autocomplete: extend additional name fields used in multimatch queries

Open missinglink opened this issue 3 years ago • 6 comments

this PR is an experiment with splitting up the name.* fields in order to avoid the negative effects of field norms due to field length, reported in https://github.com/pelias/openstreetmap/issues/507 and better explained in https://github.com/pelias/pelias/issues/862

in particular we see this issue in OSM and WOF due to those sources having more alt names than others, although it applies to all sources.

as discussed on our call today, it might be that https://github.com/pelias/openstreetmap/pull/435 exacerbated the issue (albeit unknown at the time) so reversing that method and moving back to multiple fields using a multi_match query should result in a significant reduction in the effects of the field norms issue on scoring.

although fairly arbitrary, I've identified 4 new fields to begin with:

  • alt - this field will contain all alternative names, so the norms penalty will no longer apply to the primary name. this includes variants, colloquialisms & other alternatives
  • abbr - abbreviations, ie. succulent representations of the primary name
  • code - similar to above but distinct in the case of airports, stop IDs etc.
  • org - brands, operators etc.

we may very well change these, maybe abbr and code can be merged, or org omitted, that's up for discussion. the main difference is that we attempt to have only a single token indexed per field.

missinglink avatar Mar 31 '22 18:03 missinglink

I'm not sure if we want to keep using best_fields, maybe cross_fields is better if it doesn't suffer the same norms issue.

missinglink avatar Mar 31 '22 18:03 missinglink

Screenshot 2022-04-01 at 11 40 06

missinglink avatar Apr 01 '22 09:04 missinglink

this looks very promising: Screenshot 2022-04-04 at 12 10 52

worth noting we will need to make similar changes to the /v1/search subqueries, otherwise some aliases which were previously searchable are now not (Phoenix Sky Harbor Intern.... in this example, no. 2 on the left) Screenshot 2022-04-04 at 12 12 39

missinglink avatar Apr 04 '22 10:04 missinglink

Interestingly, the popularity boosting may now be too strong (rather than too weak as proposed in https://github.com/pelias/api/pull/1619), or maybe this was always the case 🤔

For example, this /v1/search query has an exact matching result but the scoring of all top n items seems to be heavily influenced by the popularity value: https://pelias.github.io/compare/#/v1/search?text=pyramids+of+giza&debug=1

Screenshot 2022-04-04 at 12 23 19

missinglink avatar Apr 04 '22 10:04 missinglink

As discussed offline, I've pushed a new commit which changes this behaviour to use wildcards instead of explicit field names, I feel like this is more flexible. The _ delimiter is unfortunately required otherwise German would match the default field. (ie. de* == de && default), using - could potentially cause conflict with hyphenated language codes.

Screenshot 2022-04-15 at 12 18 22

missinglink avatar Apr 15 '22 10:04 missinglink

as-is this PR is safe to merge since it's backward compatible.

missinglink avatar Apr 15 '22 10:04 missinglink