Enhance Phrasing performance
In function FindInterval. The following loop structure is used to find all intervals.
for ( begin = 0; begin < pgdata->nPhoneSeq; begin++ ) {
for ( end = begin; end < pgdata->nPhoneSeq; end++ ) {
However, for system phrase, we do not need two nest loops to find the intervals. For example, if both phrases 一二 and 一二三四 are system phrases. When libchewing do tree traversal to find 一二三四, phrase 一二 is in the middle of search path. So the second loop can be skip.
Userphrase, however, is another story.
valgrind suggests that we need to improve SaveList.
SaveList is used to calculate all possible combination of phrasing via recursive. The problem of SaveList is that it calculate the same result multiple times. The following is an example of this problem:
preedit buffer = 一二三四
intervals = 一, 一二, 二, ...
In SaveList > RecursiveSave, 一 will be chosen as first and do the RecursiveSave for 二三四. 一二 will be chosen later and do the RecursiveSave for 三四. However, during the recursive for 一, 二 will be chosen and the RecursiveSave will be called for 三四, which is exactly the same. If we can cache the result for 三四, we can save a recursive call. This will be a huge improvement for phrasing speed.
For the phrasing, it looks like a dynamic programming problem.
Assume P(x,y) is the maximum phrasing score between x and y. The recursive definition of P(x,y) is listed as following:
P(x,y) = MAX( P(x,y-1)+P(y-1,y), P(x,y-2)+P(y-2,y), ... )
This can be finished in linear time.
Yes, it is dynamic programming for the typical phrasing problem.
However, it may become not DP after applying some heuristic. (This was discussed on mailing list in the past, IIRC. But not yet implemented.)
And the unique feature of chewing, "NextCut()", requires all phrasing results.
@kcwu Is there any link for heuristic rule? And what is the feature of NextCut()?
I need some time to find the discussion.
Regarding to NextCut(), I mean this function https://github.com/chewing/libchewing/blob/master/src/tree.c#L866 which is triggered here https://github.com/chewing/libchewing/blob/master/src/chewingio.c#L875 When users press [Tab] at the end of input, the phrasing result become the second, third, ... best.
The original goal is to improve the performance because current windows-chewing-tsf has non neglected delay when preedit buffer length is greater than 20 in sqlite_userphrase build. The feature/sqlite_userphrase is used in windows-chewing-tsf to solve the concurrent access problem. I hope feature/sqlite_userphrase can be merged to master so we can build cross platform userphrase editor.
The following is current run time for test-bopomofo and test-config in my Windows machine. You can see test-bopomofo as short preedit buff test and test-config as long preedit buffer test because it has two cases to test preedit buffer limitation. FindInterval loop optimized is to modify loop in FindInterval so that it does not check interval longer than MAX_PHRASE_LEN and do early break if there is a break point between current interval.
master
test-bopomofo 0.07 sec
test-config 0.71 sec
master + FindInterval loop optimized
test-bopomofo 0.04 sec
test-config 0.48 sec
sqlite_userphrase
test-bopomofo 0.28 sec
test-config 13.80 sec
sqlite_userphrase + FindInterval loop optimized
test-bopomofo 0.23 sec
test-config 8.93 sec
Current I think we can just deploy FindInterval loop optimized because it has measurable improvement in Windows platform. As for DP, Google with keyword dynamic programming k best suggests that we might still using DP for k best phrasing. But I need some time to understand what the papers said.
The reason why FindInterval() is so slow because it repeatedly query the same set of phrases again and again.
My suggestion: Before FindInterval, run O(n^2) loop to query db for every possible phrases first and cache the results. In FindInterval, don't query db anymore. Just use the cached result.
I just reduce the second loop to MAX_PHRASE_LEN so that O(n^2) become O(n). It is the easiest fix.
for ( begin = 0; begin < pgdata->nPhoneSeq; begin++ ) {
for ( end = begin; end < min( pgdata->nPhoneSeq, begin + MAX_PHRASE_LEN ); end++ ) {
For cache part, we can actual remove the second loop entirely if our system/user phrase internal API can handle prefix search.
Because system phrase is tree, we just need to record the search path to get all possible phrase in particular position in different length. However, because #70 is still in review, I will do it after #70 is done.
For sqlite_userphrase, I think adding new field (ex: lenght) and turning SQL query is enough for prefix search.
As suggestion by Yuan Chao, we can do brute force phrasing only when user presses tab at the end of preedit buffer. In other scenarios, we just use DP based algorithm. In this way, user don't need to pay for brute force algorithm if they don't use tab at the end feature.
Also text in input buffer prior punctuations could be ignore when brute-force emulating different combinations with 'tab' key?
See also https://github.com/chewing/libchewing/commit/06e0073fee373f0d79ab1489f56aab62ee8632ed
The performance is enhanced by using the assumption that better phrasing results are part of the top K shortest paths so we can reduce the search space to only consider K results.
I plan to add some benchmark to futher analysis current implementation.