qdrant icon indicating copy to clipboard operation
qdrant copied to clipboard

Optimize MatchAny for large amount of values

Open generall opened this issue 2 years ago • 13 comments

Is your feature request related to a problem? Please describe.

There are use-case, where we might want to specify a lot of values in the MatchAny condition. One of those use-cases is "reddit" type of application, where user might have up to a several thousand subscriptions and we want to search in only among those subscriptions (and not all groups).

Describe the solution you'd like

Make sure our implementation of MatchAny is optimal enough to handle those case.

To do so, we would need:

  • Benchmark dataset, which approximates mentioned scenario
  • Measurements before and after the change
  • Handling of the case on the segment level (most likely it could something like converting MatchAny list into hashmap if it is long enough)

Describe alternatives you've considered

Other potential approach is to assign user "ids" to the groups, but in this case it might require huge lists of users associated with each group record.

Additional context

No interface / storage changes are expected in this PR.

generall avatar Feb 04 '24 20:02 generall

/bounty $200

generall avatar Feb 04 '24 20:02 generall

💎 $200 bounty created by Qdrant 🙋 If you start working on this, comment /attempt #3522 along with your implementation plan 👉 To claim this bounty, submit a pull request that includes the text /claim #3522 somewhere in its body 📝 Before proceeding, please make sure you can receive payouts in your country 💵 Payment arrives in your account 2-5 days after the bounty is rewarded 💯 You keep 100% of the bounty award 🙏 Thank you for contributing to qdrant/qdrant!

👉 Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🔴 @haruncurak Feb 4, 2024, 8:55:44 PM WIP
🔴 @Shylock-Hg Feb 5, 2024, 3:21:23 AM WIP
🟢 @ima-attac-helikoptaaa Feb 5, 2024, 8:12:50 AM WIP
🟢 @JojiiOfficial Feb 5, 2024, 11:49:04 AM #3525

algora-pbc[bot] avatar Feb 04 '24 20:02 algora-pbc[bot]

Hi all! I'd love to try and tackle this one as my first Qdrant issue . Would be fantastic if I could get assigned - excited to get going.

/attempt #3522

Algora profile Completed bounties Tech Active attempts Options
@haruncurak 4 bounties from 3 projects
TypeScript, Elixir
Cancel attempt

haruncurak avatar Feb 04 '24 20:02 haruncurak

/attempt #3522 in queue.

Algora profile Completed bounties Tech Active attempts Options
@Shylock-Hg 2 bounties from 1 project
C++, C,
Shell & more
Cancel attempt

Shylock-Hg avatar Feb 05 '24 03:02 Shylock-Hg

/attempt #3522

Options

ima-helikoptaaa avatar Feb 05 '24 08:02 ima-helikoptaaa

One of those use-cases is "reddit" type of application, where user might have up to a several thousand subscriptions and we want to search in only among those subscriptions (and not all groups).

This is a nitpick, but Reddit does not support this either. They fake showing content for all your subreddits. Instead, they take 50 random subreddits you subscribed on and only shows content for them.

Maybe it would be possible for the user actually wanting this to use a similar approach.

That doesn't mean we cannot do it though. So we should definitely try to go through with this :+1:

timvisee avatar Feb 05 '24 09:02 timvisee

This is a nitpick, but Reddit does not support this either. On your front-page, it takes 50 random subreddits you subscribed on and only shows content for them.

Maybe that's a reddit who requested this feature to be improved :man_shrugging:

generall avatar Feb 05 '24 09:02 generall

attempt #3522

itssubhodiproy avatar Feb 05 '24 09:02 itssubhodiproy

Just a quick check @generall - would the points in the approximate dataset be users with subscriptions, potentially including posts within those subscriptions?

Something like this?


{
  "id": "5c56c793-69f3-4fbf-87e6-c4bf54c28c26", (user id)
  "vector": [0.9, 0.1, 0.1],
  "payload": {
    "subscriptions": [
      {
        "subscription": "subreddit_name",
        "posts": [
          {
            "title": "",
            "content": "",
          },
          {
            "...": "..."
          }
        ]
      },
      {
        "...": "..."
      }
    ]
  }
}

Or would the subscriptions (i.e. subreddits) themselves be points?

haruncurak avatar Feb 05 '24 10:02 haruncurak

each post in subreddit should be a separate point.

generall avatar Feb 05 '24 10:02 generall

/attempt #3522

Options

JojiiOfficial avatar Feb 05 '24 11:02 JojiiOfficial

💡 @JojiiOfficial submitted a pull request that claims the bounty. You can visit your bounty board to reward.

algora-pbc[bot] avatar Feb 05 '24 12:02 algora-pbc[bot]

@JojiiOfficial: Your claim has been rewarded! 👉 Complete your Algora onboarding to collect the bounty.

algora-pbc[bot] avatar Feb 08 '24 11:02 algora-pbc[bot]