keyman icon indicating copy to clipboard operation
keyman copied to clipboard

feat(web): provide lexicon probabilities directly on the search path 📚

Open jahorton opened this issue 1 year ago • 1 comments

This PR was originally part of #10973.

In order to efficiently traverse a full lexical efficiently for dictionary-based wordbreaking, it's best to directly provide relevant probability data as efficiently as possible. Fortunately, it's easily possible to make this O(1) on the lexical model's internal iterator - the LexiconTraversal type. It would take O(log(N)) time to recompute it via the model's .predict method instead.

Note that this provides two different probability value types:

  1. The probability of each reached entry.
  2. The probability of the highest-frequency entry either represented by the current node or by any of its descendants.

There are uses for this outside of dictionary-based wordbreaking, too. The latter 'probability' listed above can be useful for optimizing the correction-search - if a path only produces low-frequency words, we should consider other paths that could yield higher-frequency words first.

There's also notable potential for being able to merge / blend two different models together via their LexiconTraversal iterators in this manner. Noting our upcoming push toward #11872, this would facilitate a fantastic way to achieve that goal - to create a stand-in model for the OS's dictionary and blend that with the loaded lexical-model via traversals.

@keymanapp-test-bot skip

jahorton avatar Jun 25 '24 05:06 jahorton

Changes in this pull request will be available for download in Keyman version 18.0.70-alpha

keyman-server avatar Jul 08 '24 18:07 keyman-server