qdrant Optimize MatchAny for large amount of values

Is your feature request related to a problem? Please describe.

There are use-case, where we might want to specify a lot of values in the MatchAny condition. One of those use-cases is "reddit" type of application, where user might have up to a several thousand subscriptions and we want to search in only among those subscriptions (and not all groups).

Describe the solution you'd like

Make sure our implementation of MatchAny is optimal enough to handle those case.

To do so, we would need:

Benchmark dataset, which approximates mentioned scenario
Measurements before and after the change
Handling of the case on the segment level (most likely it could something like converting MatchAny list into hashmap if it is long enough)

Describe alternatives you've considered

Other potential approach is to assign user "ids" to the groups, but in this case it might require huge lists of users associated with each group record.

Additional context

No interface / storage changes are expected in this PR.

Feb 04 '24 20:02 generall

/bounty $200

Feb 04 '24 20:02 generall

💎 $200 bounty created by Qdrant 🙋 If you start working on this, comment /attempt #3522 along with your implementation plan 👉 To claim this bounty, submit a pull request that includes the text /claim #3522 somewhere in its body 📝 Before proceeding, please make sure you can receive payouts in your country 💵 Payment arrives in your account 2-5 days after the bounty is rewarded 💯 You keep 100% of the bounty award 🙏 Thank you for contributing to qdrant/qdrant!

👉 Add a bounty • Share on socials

Attempt	Started (GMT+0)	Solution
🔴 @haruncurak	Feb 4, 2024, 8:55:44 PM	WIP
🔴 @Shylock-Hg	Feb 5, 2024, 3:21:23 AM	WIP
🟢 @ima-attac-helikoptaaa	Feb 5, 2024, 8:12:50 AM	WIP
🟢 @JojiiOfficial	Feb 5, 2024, 11:49:04 AM	#3525

Feb 04 '24 20:02 algora-pbc[bot]

Hi all! I'd love to try and tackle this one as my first Qdrant issue . Would be fantastic if I could get assigned - excited to get going.

/attempt #3522

Algora profile	Completed bounties	Tech	Active attempts	Options
@haruncurak	4 bounties from 3 projects	TypeScript, Elixir		Cancel attempt

Feb 04 '24 20:02 haruncurak

/attempt #3522 in queue.

Algora profile	Completed bounties	Tech	Active attempts	Options
@Shylock-Hg	2 bounties from 1 project	C++, C, Shell & more		Cancel attempt

Feb 05 '24 03:02 Shylock-Hg

/attempt #3522

Options

Cancel my attempt

Feb 05 '24 08:02 ima-helikoptaaa

One of those use-cases is "reddit" type of application, where user might have up to a several thousand subscriptions and we want to search in only among those subscriptions (and not all groups).

This is a nitpick, but Reddit does not support this either. They fake showing content for all your subreddits. Instead, they take 50 random subreddits you subscribed on and only shows content for them.

Maybe it would be possible for the user actually wanting this to use a similar approach.

That doesn't mean we cannot do it though. So we should definitely try to go through with this :+1:

Feb 05 '24 09:02 timvisee

This is a nitpick, but Reddit does not support this either. On your front-page, it takes 50 random subreddits you subscribed on and only shows content for them.

Maybe that's a reddit who requested this feature to be improved :man_shrugging:

Feb 05 '24 09:02 generall

attempt #3522

Feb 05 '24 09:02 itssubhodiproy

Just a quick check @generall - would the points in the approximate dataset be users with subscriptions, potentially including posts within those subscriptions?

Something like this?


{
  "id": "5c56c793-69f3-4fbf-87e6-c4bf54c28c26", (user id)
  "vector": [0.9, 0.1, 0.1],
  "payload": {
    "subscriptions": [
      {
        "subscription": "subreddit_name",
        "posts": [
          {
            "title": "",
            "content": "",
          },
          {
            "...": "..."
          }
        ]
      },
      {
        "...": "..."
      }
    ]
  }
}

Or would the subscriptions (i.e. subreddits) themselves be points?

Feb 05 '24 10:02 haruncurak

each post in subreddit should be a separate point.

Feb 05 '24 10:02 generall

/attempt #3522

Options

Cancel my attempt

Feb 05 '24 11:02 JojiiOfficial

💡 @JojiiOfficial submitted a pull request that claims the bounty. You can visit your bounty board to reward.

Feb 05 '24 12:02 algora-pbc[bot]

@JojiiOfficial: Your claim has been rewarded! 👉 Complete your Algora onboarding to collect the bounty.

Feb 08 '24 11:02 algora-pbc[bot]