solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-15437: ReRanking/LTR does not work in combination with custom sort and SolrCloud

Open tkaessmann opened this issue 4 years ago • 6 comments

https://issues.apache.org/jira/browse/SOLR-15437

Description

We found out that a plain SolrCloud will return random sorted docs if you define a custom sort field or function. This problem is also mentioned on mailing-lists (see jira issue) and is not really fixed yet.

Solution

The basic idea is to fix this in the mergeIds method of the QueryComponent.

A PriorityQueue was used to collect the results from all shards. We added another PriorityQueue next to the existing one. The new queue is used to collect the defined count of reRanked documents.

After collecting all documents, the two queues have to be combined to fill the resultIds.

Be aware, that the reRankDocs-threshold is used per shard but also has to be applied across the whole result.

To handle this, documents get removed from the reRankQueue and inserted into the normal queue if their score after reRanking does not place them under the top results defined by the reRankDocs.

The documents in the reRankQueue are sorted by the score (after reRanking). The original sort is applied to all documents in the original queue.

To be able to correctly sort the not-reRanked documents, we had to remove the shortcut of taking the orderInShard in shard instead of comparing the sortValues from the ShardFieldSortedHitQueue. We extracted the functionality to the class ShardFieldSortedHitQueueWithSameShardCompareSkip which is used for the reRankQueue or in cases without reRanking.

Tests

We've added two tests with a custom sorting and reranking activated. One test ranks the whole resultset and the other one reranks only a subset, to make sure that the custom is applied to docs that were not reranked.

Checklist

Please review the following and check all that apply:

  • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [x] I have developed this patch against the main branch.
  • [x] I have run ./gradlew check.
  • [x] I have added tests for my changes.
  • [ ] I have added documentation for the Reference Guide

tkaessmann avatar May 26 '21 13:05 tkaessmann

Hi @cpoerschke , the originalScore really is a hard nut to crack.

We have a very rough first draw of a possible solution and created a draft PR ( https://github.com/apache/solr/pull/171 ) which is based upon the code from this PR. There is still some stuff that does not work but maybe you can generate some ideas from the current WIP.

Apart from that we would continue with the scope of only fixing the sort now. Starting with applying you suggestion.

tomglk avatar Jun 10 '21 11:06 tomglk

I ran into this problem today -- I'm glad others are working on it. Unless I've misattributed this issue, I think this problem is not related to a "custom sort" -- one only needs to have a re-ranking boost function that could reduce the score from what it was. QueryComponent.mergeIds doesn't know how to deal with scores that don't descend when you're sorting by score (as is the default).

The solution above is one approach and I think it's basically fine. It assumes that rr.rows docs across all shards are more relevant than all docs that follow, which I don't think is necessarily true, although it usually is when your data is balanced across the shards, as is typical. Another solution is to have the ReRankQuery machinery apply a constant factor to the score of all docs that follow rr.rows, so that the scores continue to descend. The factor could be computed based on how close the scores were between these documents on the fence prior to reranking, and then it could apply it proportionally to whatever the final score is of the last re-ranked document.

dsmiley avatar Mar 01 '22 17:03 dsmiley

We're also affected by this bug and first of all we're wondering: Is anyone still working on these PRs?

Furthermore while the Solution proposed in the PRs 151/171 also covers many corner cases and e.g. uneven distributions of re-ranked docs across the shards, we were thinking about deploying a temporary, simpler fix for our local installation, perhaps a variant of what @dsmiley mentioned in his last comment: To adjust the scores of all re-ranked documents locally in the shards to make them larger than the largest original score of the documents before re-ranking. There is still be the possibility that the original scores differ per shard, but since we also accept that for normal ranking (except for distributed IDF via global statsCache), we figured we could ignore that. We would also accept that the scores of all re-ranked docs are always better than the highest non-reranked score, and we'd set reRankDocs to be our intended total number divided by the actual numbers of shards we're using. However (not being a Solr expert), is there something obvious I'm missing here?

Stefan4solr avatar Oct 11 '22 11:10 Stefan4solr

I had to stop due to a lack of time.

However, I can try to manage collaborating with you to finally fix this. Especially if you have questions regarding this solution. Maybe also to develop a new solution (cannot promise anything here). I definitely need a bit of time to get deep into the topic again to be able to have fruitful discussions.

tomglk avatar Oct 19 '22 06:10 tomglk

Thanks a lot for the offer, @tomglk! I guess it will take us some time to get a full understanding of the issue(s) and the proposed solution. Perhaps we'll start by looking into the "simpler" fix proposed above and see if we can get that to run on our side - that will probably already give us a better understanding of these issues. We'll definitely get back to you and are looking forward to discussions on this!

Stefan4solr avatar Oct 26 '22 14:10 Stefan4solr

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution!

github-actions[bot] avatar Feb 26 '24 00:02 github-actions[bot]