Storing MinHash for later use
Right now I have a database of documents and each day new documents enter the database. Lets say that up to a certain day I have all the MinHash functions for each document in my database (corpus).
Then, another day, a new document enters the database.
Can I store all these previously obtained MinHash functions and later when a new document enters the database I just MinHash that document and use the previously obtained MinHashes to find similarities with the new MinHash?
Basically, I don't want to recompute MinHashes for all the documents in my corpus every time a new document comes in. So, if I can store Minhashes to save running time I want to do so.
Thank you very much.
Hi.
You could use insert or insert_session methods to update your LSH index
Sincerely yours, Aleksey Astafiev
On 27 Feb 2020, at 19:07, apaullier [email protected] wrote:
Right now I have a database of documents and each day new documents enter the database. Lets say I have a MinHash function for each document in my database (corpus).
Can I store all these MinHash functions and later when a new document enters the database I just MinHash that document and compare the previously computed MinHashes to the new MinHash?
I don't want to recompute MinHashes for all the documents in my corpus every time a new document comes in. If I can store Minhashes to save running time I want to do so.
Thank you very much!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Just add to @aastafiev. You can also serialize/pickle your MinHash (or LeanMinHash for better performance) and save the serialized bytes in your database (e.g. bytea in Postgres). So the next time a similarity computation is needed you just load the LeanMinHash from your database instead.
See the documentation for LeanMinHash that supports serialization and deserialization with examples.
I'm currently storing the key (i.e. document id), seed, and hash values of a LeanMinHash as a record. The LeanMinHash can be (re)created using the seed and the hash values. see: http://ekzhu.com/datasketch/documentation.html#lean-minhash
The key and minHash can then be inserted into an LSH.
A benefit is that the aa can be stored in almost any database.
I want to store the minhash values in the column of spark dataframe. How can we do that?
Pickle the minhash objects and store them as bytes in a column.
On Thu, Mar 30, 2023 at 1:29 PM Sejal Jagtap @.***> wrote:
I want to store the minhash values in the column of spark dataframe. How can we do that?
— Reply to this email directly, view it on GitHub https://github.com/ekzhu/datasketch/issues/122#issuecomment-1490910182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOGLQDSC5Z6URZAGFLNPTW6XUKXANCNFSM4K46BHTQ . You are receiving this because you commented.Message ID: @.***>