Running batches and aggregating results
Is possible to run BERtopic on segmented datasets e.g instead of running BERtopic on say 200,000 documents it is ran 20 times on 10,000 documents each. The results are aggregated to find the most common occurring topics.
I believe the short answer is no. The issue is that the embedding process creates a model of the entire document set and each document is embedded relative to the other documents in the set. If you break it up over multiple datasets then those relationships won't exist. However, you can try to find a random, representative subset of some x,000's (probably) of documents and then that representative model can be used to transform() new documents to the already fit() model. The trick is figuring out how many documents to put in the subset.
The issue is that the embedding process creates a model of the entire document set and each document is embedded relative to the other documents in the set. If you break it up over multiple datasets then those relationships won't exist.
The embedding models are pre-trained and actually do not create an embedding of documents relative to each other. Each document is embedded independent from one another when using a pre-trained embedding model since it is not being trained when fitting BERTopic. In other words, the embedding process is not the limiting factor here.
Is possible to run BERtopic on segmented datasets e.g instead of running BERtopic on say 200,000 documents it is ran 20 times on 10,000 documents each. The results are aggregated to find the most common occurring topics.
If I am not mistaken, what you are describing here is essentially partial_fit in scikit-learn. This is currently not possible in BERTopic as these methods are not supported across all BERTopic's sub models. I am currently doing some work on that but I cannot make any promises.
Ahhh... Ok. Good to know. I think I understand my mistake.
I recently had occasion to split a corpus into two segments. When I split BERTopic.umap_model.embedding_ I didn't get the results I was expecting, went back to the original text, split that, then created new BERTopic models on both. I assumed that the issue was with the original BERT embedding and that the issue was as reflected above. But now I'm assuming that the issue wasn't with the original embeddings but with the UMAP reduction - which makes sense since UMAP will use all the data to do its reduction.
If I've got this right now then it would be perfectly easy to embed multiple batches outside of BERTopic with BERT, yes? If that is the case couldn't a subclass of BaseEmbedder handle this?
Thanks for the responses.
@krews2 With the newest version of BERTopic it is now possible to use .partial_fit and apply incremental topic modeling. You can find more about that here.