age icon indicating copy to clipboard operation
age copied to clipboard

Proposal: Vector handling with extension(pgvector)

Open iitaejeong opened this issue 2 years ago • 26 comments

Is your feature request related to a problem? Please describe. Vector usage is very skyrocketing now. so, market wants to handle with them using database. but we didnt. exactly, the vectorDB has many popularity from resource 'db-engine rank'.

related to a problem

but , Apache age doesn't have the function for vector engineering. so many other person who wants to apply the vector are just using the python library itself for supporting vector handling.

Describe the solution

Extension PGvector.

CRETATE EXTENSION vector;

SELECT embedding <-> '[3,1,2]' AS distance FROM items;
SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;
SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;

and also can indexing function for searching of nearest neighbor. There are many very good functions except for the aforementioned functions.

Additional context Except for above problem, we expand this feature to ML practitioner who wants data efficient management. the overview of additional context is compatible with Pytorch_geometric remote backend function. They builds the class for ease to integration others database. if previous problem well solve , then we might do this future work.

image

in below reference , it has the function for convenience.

[Scaling Up GNNs via Remote Backends] https://pytorch-geometric.readthedocs.io/en/latest/advanced/remote.html

iitaejeong avatar Aug 02 '23 02:08 iitaejeong

This will be a great addition. I understand that Graph data has to be converted into vector via graph embeddings before being passed to ML algorithms. This feature will very effective for AGE and attract alot of ML/DL practitioners. I really look forward to a project of this nature to contribute to.

titoausten avatar Aug 02 '23 05:08 titoausten

i guess that ideally we can use eg vector search (but maybe also bm25 via pg_search) to be able to form graphs connecting similar entities (instead of exact matching ones). eg when building a knowlegde graph for RAG, the entities and relations are harvested using eg LLMs, leading to similar entities (instead of exactly teh sme ones) because eg the source data is written by diffferent people talking about the same things, yet using different words. lacking a formal verification/cleanup process, this leads to more then one entity in the graph.

stdweird avatar Apr 25 '24 13:04 stdweird

i guess that ideally we can use eg vector search (but maybe also bm25 via pg_search) to be able to form graphs connecting similar entities (instead of exact matching ones). eg when building a knowlegde graph for RAG, the entities and relations are harvested using eg LLMs, leading to similar entities (instead of exactly teh sme ones) because eg the source data is written by diffferent people talking about the same things, yet using different words. lacking a formal verification/cleanup process, this leads to more then one entity in the graph.

Unfortunately pg_search(https://github.com/paradedb/paradedb) does not work with agtype fields.

My workaround is to first convert agtype to JSON as a materialized view.

CREATE MATERIALIZED VIEW IF NOT EXISTS movies_view AS 
SELECT  
    graphid_to_agtype(id)::bigint as id,
    CAST(CAST(properties as VARCHAR) as JSON) as properties
FROM movies."MOVIE";

CALL paradedb.create_bm25(
        index_name => 'movies_view_idx',
        schema_name => 'public',
        table_name => 'movies_view',
        key_field => 'id',
        json_fields => '{properties: {}}'
);

But this only works for (mostly)static data.

mingfang avatar Apr 25 '24 13:04 mingfang

@mingfang and how do you use this in your graph queries? tbh, i also don't know if bm25 makes a lot of sense when comparing to embedding vectors. the entities are only a few words typically, so not sure the vector search would miss things bm25 would not miss (this is different in RAG where a single vector represents a whole sentence or paragraph).

stdweird avatar Apr 25 '24 14:04 stdweird

First I have to create a function to do the agtype conversion

CREATE OR REPLACE FUNCTION public.movies_view_bm25_search(search agtype, limit_rows agtype default '10'::agtype) RETURNS SETOF agtype AS $$
	SELECT id::agtype
	FROM movies_view_bm25.search(search::text, limit_rows => limit_rows::int)
$$ LANGUAGE sql;

Then you can call the search from cypher

unwind public.movies_view_bm25_search('plot:drink') as it
match (v1:MOVIE)
where id(v1)=it
return v1

You can play with it here https://age-playground.rebelsoft.com/graph/movies?q=with%20public.movies_view_bm25_search(%27plot%3Adrink%27)%20as%20it%0Amatch%20(v1%3AMOVIE)%0Awhere%20id(v1)%3Dit%0Areturn%20v1

mingfang avatar Apr 25 '24 19:04 mingfang

This issue is stale because it has been open 60 days with no activity. Remove "Abondoned" label or comment or this will be closed in 14 days.

github-actions[bot] avatar Jun 25 '24 00:06 github-actions[bot]

This issue was closed because it has been stalled for further 14 days with no activity.

github-actions[bot] avatar Jul 09 '24 00:07 github-actions[bot]

Hi, just wanna follow up on this... I wish to try both postgres's extension: apache age with pgvector... Is the docker-compose image working for both extensions? Or should I build the image myself. Thanks, and hope for reply.

tan-yong-sheng avatar Aug 11 '24 16:08 tan-yong-sheng

Would love to see native support and / or examples for this on the site. Neo4j has their vector implementation and it would probably increase adoption to see this more widely used/publicised.

alph486 avatar Aug 24 '24 13:08 alph486