Implement Pipeline Collection smart search
Currently support unified (re-rank results into single list) and separate (results for each pipeline returned separately) searches for a collection .
Adding smart search which will do a smart routing to identify what collections are worth searching based on the query. Using the description of the pipeline, match to query.
Hey @ddematheu , can you elaborate this? I would like to contribute to this.
Sure.
At a high level, we have Pipelines that each have a description associated to them. (https://github.com/NeumTry/NeumAI/blob/main/neumai/neumai/Pipelines/Pipeline.py) The pipeline represents a collection of data as it has a data source as well as a vector DB associated to it.
We introduced PipelineCollection (https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neumai_tools/PipelineCollection/PipelineCollection.py) as an easy way to query multiple pipelines at the same time. Ex. I want to query data both from a user record in Postgres as well as general info from files in S3. This is sort of helpful, but the main piece of feedback we have heard is that the preference would be to dynamically make decisions on what data collection to query based on the question. Ex. If I want to know the status of a user then I would query Postgres vs if I want to get the information for a mortgage they are getting I would go to S3 where the mortgage document is stored.
With this in mind, I have stubbed out a search_routed method.
The method is designed to take a Collection of Pipelines (1+) and using the description field decide which one to use.
For the decision, there are two approaches I have in mind:
-
Using embeddings to do some basic classification (compare the embedding of the description vs the embedding of the query ). Have some threshold for the similarity score to decide if I should search or not that pipeline given the query.
-
Using an LLM with function calling, to decide based on the description for each pipeline collection which one to query.
I was leaning towards #1 to start given that it is more lightweight and will provide faster responses. But #2 might provide better quality.
@ddematheu Thanks, that made it more clear to me. I can also think of an approach like comparing query with a cluster center for each pipeline/sink(pre-computed), something which is representative of the the data in the pipeline/sink, along with pipeline description. So its same as your point 1 with an addition of these pipeline representatives. I thought of this because when the data in the pipeline changes, the representative would also update and be more relevant. Would that be useful?
That is actually a great idea. How are you thinking about calculating the center? Do some vector dbs provide it our of the box? Alternatively it might be something that we calculate at ingestion time and update over time as new data is added.
On Thu, Dec 28, 2023, 11:15 AM Aakash Thatte @.***> wrote:
@ddematheu https://github.com/ddematheu Thanks, that made it more clear to me. I can also think of an approach like comparing query with a cluster center for each pipeline(pre-computed), something which is representative of the the data in the pipeline, along with pipeline description. So its same as your point 1 with an addition of these pipeline representatives. I thought of this because when the data in the pipeline changes, the representative would also update and be more relevant.
— Reply to this email directly, view it on GitHub https://github.com/NeumTry/NeumAI/issues/43#issuecomment-1871356222, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRYWGDQBD3LJI4SO3PBLHLYLWSLFAVCNFSM6AAAAABBDSXCKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRGM2TMMRSGI . You are receiving this because you were mentioned.Message ID: @.***>
@ddematheu I will have to see if each DB offers this, will get back on that.
Regarding implementation, as an initial idea, I had thought of it like the way you mentioned:
Approach
- Every
SinkCOnnectorwould have to define methodscompute_cluster_centerandupdate_cluster_center - Whenever new data is added to a sink, it would trigger the
update_cluster_centermethod
Doubts
- In each sink, a single data unit can have multiple fields, some or all of them vectorized, so which fields to use for cluster center calculation?
- I am not sure, but it might happen that user has vectorized data in one sink with 512 dimensions and other sink with 1024 dimensions of embedding, what to do in that case?
- What method to use for cluster center calculation? Would simple averaging suffice?
Discussion
- I would love to know if there are more approaches to this. Please share if you come across any, I would also do some research on that.
- We can also explore simpler approaches instead of embedding similarity, because semantic stuff would throw multiple options at us like which model to use, what embedding size to consider, what should be the threshold etc and make config cluttered. We can discuss this in detail maybe.
Something that might be tricky is that after calculating / updating the center it would require that center to be stored somewhere. Maybe it is just added into the sink with some metadata / ids to get it back.
Doubt #1 that should be transparent to the sink. Data sources are broken down and translated to individual vectors.
Doubt #2 center would be calculated the dimensionality of a sink (it has a embed model associated as part of the pipeline). In terms of embedding descriptions of pipelines we would need to standardize.
Doubt #3 thinking about this.
In general, my vibe is to start with the most straightforward method and "benchmark" results before overly investing in a much more complicated approach.
On Fri, Dec 29, 2023, 2:19 AM Aakash Thatte @.***> wrote:
@ddematheu https://github.com/ddematheu I will have to see if each DB offers this, will get back on that.
Regarding implementation, as an initial idea, I had thought of it like the way you mentioned:
Approach
- Every SinkCOnnector would have to define methods compute_cluster_center and update_cluster_center
- Whenever new data is added to a sink, it would trigger the update_cluster_center method
Doubts
- In each sink, a single data unit can have multiple fields, some or all of them vectorized, so which fields to use for cluster center calculation?
- I am not sure, but it might happen that user has vectorized data in one sink with 512 dimensions and other sink with 1024 dimensions of embedding, what to do in that case?
- What method to use for cluster center calculation? Would simple averaging suffice?
Discussion
- I would love to know if there are more approaches to this. Please share if you come across any, I would also do some research on that.
- We can also explore simpler approaches instead of embedding similarity, because semantic stuff would throw multiple options at us like which model to use, what embedding size to consider, what should be the threshold etc and make config cluttered. We can discuss this in detail maybe.
— Reply to this email directly, view it on GitHub https://github.com/NeumTry/NeumAI/issues/43#issuecomment-1871831927, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRYWGGL4SRM5NDKOA6LBCDYLZ4KHAVCNFSM6AAAAABBDSXCKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRHAZTCOJSG4 . You are receiving this because you were mentioned.Message ID: @.***>
@ddematheu Okay, I would start with Marqo and try to implement a basic working version.
Update
I am done with implementing a get_representative_vector method for lanceDB and marqo, using mean of vectors as representative. Would write code for the search_routed and see.
Sounds good. feel free to open PR and I can take a look to provide feedback.
@ddematheu How will the query be vectorized? In the separate search, each time we are using the respective pipeline'e embed_query method. In this case, what to use to vectorize query?
This is where it gets hard with the representative vector as that vector will be determined by the embedding model used within the each pipeline. Comparing those will be hard. Unless, for the comparison we embed the query using the embed_query for each pipeline and compare. We would then just compare the scores.
So yeah, I think using the embed_query makes sense.
Okay then, I would go ahead with the embed_query for now. I am looking to get an initial version up and running as quick as possible, you can then get feedback from users and we can develop more.
@ddematheu I have implemented first version of smart search, it works well, tested it using two data sources and two sinks. Excited for this feature and its further improvements!
Awesome, taking a look at the PR.