datasketch icon indicating copy to clipboard operation
datasketch copied to clipboard

Storing MinHash for later use

Open ghost opened this issue 5 years ago • 5 comments

Right now I have a database of documents and each day new documents enter the database. Lets say that up to a certain day I have all the MinHash functions for each document in my database (corpus).

Then, another day, a new document enters the database.

Can I store all these previously obtained MinHash functions and later when a new document enters the database I just MinHash that document and use the previously obtained MinHashes to find similarities with the new MinHash?

Basically, I don't want to recompute MinHashes for all the documents in my corpus every time a new document comes in. So, if I can store Minhashes to save running time I want to do so.

Thank you very much.

ghost avatar Feb 27 '20 16:02 ghost

Hi.

You could use insert or insert_session methods to update your LSH index


Sincerely yours, Aleksey Astafiev

On 27 Feb 2020, at 19:07, apaullier [email protected] wrote:

 Right now I have a database of documents and each day new documents enter the database. Lets say I have a MinHash function for each document in my database (corpus).

Can I store all these MinHash functions and later when a new document enters the database I just MinHash that document and compare the previously computed MinHashes to the new MinHash?

I don't want to recompute MinHashes for all the documents in my corpus every time a new document comes in. If I can store Minhashes to save running time I want to do so.

Thank you very much!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

aastafiev avatar Feb 27 '20 16:02 aastafiev

Just add to @aastafiev. You can also serialize/pickle your MinHash (or LeanMinHash for better performance) and save the serialized bytes in your database (e.g. bytea in Postgres). So the next time a similarity computation is needed you just load the LeanMinHash from your database instead.

See the documentation for LeanMinHash that supports serialization and deserialization with examples.

ekzhu avatar Feb 29 '20 19:02 ekzhu

I'm currently storing the key (i.e. document id), seed, and hash values of a LeanMinHash as a record. The LeanMinHash can be (re)created using the seed and the hash values. see: http://ekzhu.com/datasketch/documentation.html#lean-minhash

The key and minHash can then be inserted into an LSH.

A benefit is that the aa can be stored in almost any database.

hsicsa avatar Mar 21 '23 01:03 hsicsa

I want to store the minhash values in the column of spark dataframe. How can we do that?

sejalrj avatar Mar 30 '23 20:03 sejalrj

Pickle the minhash objects and store them as bytes in a column.

On Thu, Mar 30, 2023 at 1:29 PM Sejal Jagtap @.***> wrote:

I want to store the minhash values in the column of spark dataframe. How can we do that?

— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/122#issuecomment-1490910182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLQDSC5Z6URZAGFLNPTW6XUKXANCNFSM4K46BHTQ . You are receiving this because you commented.Message ID: @.***>

ekzhu avatar Mar 30 '23 23:03 ekzhu