BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Best practices on saving results

Open jhgeluk opened this issue 1 year ago • 1 comments

Hello,

I'm new to topic modelin, and I'm looking for some help on best practices for storing the results, including topics and the relationship between topics and documents, in a relational database (like MySQL or Postgres). Additionally, I'm interested in how to integrate this with the process of merging models whenever new data becomes available.

What are the recommended approaches for achieving these goals? Specifically, I'd like to understand how to assign unique identifiers to topics for easy referencing in the database.

jhgeluk avatar Feb 13 '24 09:02 jhgeluk

Generally, you could just use the .transform method of BERTopic to predict new instances that are coming in. Often, when the number of documents are not much more than a couple of hundred thousand documents, I seldom use dedicated databases to query the results.

MaartenGr avatar Feb 18 '24 17:02 MaartenGr

Thanks for the insight. I also noticed that it's quite slow to fit documents to a loaded "base" model (I saved previously), before I merge it with my new model, so I can discover new topics.

Is there a way I can speed this up? With manual topic modelling for example?

jhgeluk avatar Feb 20 '24 22:02 jhgeluk

What do you mean by "fitting a loaded 'base' model"? The idea is that you first fit a model, save and it and then load it for inference, not training/fitting.

Also, what exactly do you consider to be slow? How many documents do you have? Are you using a GPU? etc.

MaartenGr avatar Feb 21 '24 11:02 MaartenGr

Excuse me for my vague question. I have found the solution to my "problem", thanks for the swift reply.

jhgeluk avatar Feb 22 '24 19:02 jhgeluk