Optimize MatchAny for large amount of values
Is your feature request related to a problem? Please describe.
There are use-case, where we might want to specify a lot of values in the MatchAny condition. One of those use-cases is "reddit" type of application, where user might have up to a several thousand subscriptions and we want to search in only among those subscriptions (and not all groups).
Describe the solution you'd like
Make sure our implementation of MatchAny is optimal enough to handle those case.
To do so, we would need:
- Benchmark dataset, which approximates mentioned scenario
- Measurements before and after the change
- Handling of the case on the segment level (most likely it could something like converting MatchAny list into hashmap if it is long enough)
Describe alternatives you've considered
Other potential approach is to assign user "ids" to the groups, but in this case it might require huge lists of users associated with each group record.
Additional context
No interface / storage changes are expected in this PR.
/bounty $200
💎 $200 bounty created by Qdrant
🙋 If you start working on this, comment /attempt #3522 along with your implementation plan
👉 To claim this bounty, submit a pull request that includes the text /claim #3522 somewhere in its body
📝 Before proceeding, please make sure you can receive payouts in your country
💵 Payment arrives in your account 2-5 days after the bounty is rewarded
💯 You keep 100% of the bounty award
🙏 Thank you for contributing to qdrant/qdrant!
👉 Add a bounty • Share on socials
| Attempt | Started (GMT+0) | Solution |
|---|---|---|
| 🔴 @haruncurak | Feb 4, 2024, 8:55:44 PM | WIP |
| 🔴 @Shylock-Hg | Feb 5, 2024, 3:21:23 AM | WIP |
| 🟢 @ima-attac-helikoptaaa | Feb 5, 2024, 8:12:50 AM | WIP |
| 🟢 @JojiiOfficial | Feb 5, 2024, 11:49:04 AM | #3525 |
Hi all! I'd love to try and tackle this one as my first Qdrant issue . Would be fantastic if I could get assigned - excited to get going.
/attempt #3522
| Algora profile | Completed bounties | Tech | Active attempts | Options |
|---|---|---|---|---|
| @haruncurak | 4 bounties from 3 projects | TypeScript, Elixir |
Cancel attempt |
/attempt #3522 in queue.
| Algora profile | Completed bounties | Tech | Active attempts | Options |
|---|---|---|---|---|
| @Shylock-Hg | 2 bounties from 1 project | C++, C, Shell & more |
Cancel attempt |
One of those use-cases is "reddit" type of application, where user might have up to a several thousand subscriptions and we want to search in only among those subscriptions (and not all groups).
This is a nitpick, but Reddit does not support this either. They fake showing content for all your subreddits. Instead, they take 50 random subreddits you subscribed on and only shows content for them.
Maybe it would be possible for the user actually wanting this to use a similar approach.
That doesn't mean we cannot do it though. So we should definitely try to go through with this :+1:
This is a nitpick, but Reddit does not support this either. On your front-page, it takes 50 random subreddits you subscribed on and only shows content for them.
Maybe that's a reddit who requested this feature to be improved :man_shrugging:
attempt #3522
Just a quick check @generall - would the points in the approximate dataset be users with subscriptions, potentially including posts within those subscriptions?
Something like this?
{
"id": "5c56c793-69f3-4fbf-87e6-c4bf54c28c26", (user id)
"vector": [0.9, 0.1, 0.1],
"payload": {
"subscriptions": [
{
"subscription": "subreddit_name",
"posts": [
{
"title": "",
"content": "",
},
{
"...": "..."
}
]
},
{
"...": "..."
}
]
}
}
Or would the subscriptions (i.e. subreddits) themselves be points?
each post in subreddit should be a separate point.
💡 @JojiiOfficial submitted a pull request that claims the bounty. You can visit your bounty board to reward.
@JojiiOfficial: Your claim has been rewarded! 👉 Complete your Algora onboarding to collect the bounty.