BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

How to get more than 3 representative docs per topic via get_topic_info()?

Open youssefabdelm opened this issue 2 years ago • 3 comments

Hey, thank you so much for making this library! Super awesome.

I've seen a bunch of issues here requesting this but haven't found a straightforward easy way to specify it via get_topic_info() as it contains a lot of the information I need. I wish there was a parameter in there like get_topic_info(number_of_representative_documents=3) that I could modify.

I'm not sure that _extract_representative_docs will work in my context as I'm using umap, hdbscan, and gpt for topic labels, no tfidf or anything, which seems to be a required parameter

youssefabdelm avatar Jan 22 '24 22:01 youssefabdelm

The documents themselves are not saved within BERTopic in part to reduce memory requirements, so it would not be possible to run something like .get_topic_info(numberof_representative_documents=3). You are always using c-TF-IDF since it is part of the default pipeline, so ._extract_representative_docs should work.

MaartenGr avatar Jan 23 '24 05:01 MaartenGr

so ._extract_representative_docs should work.

@MaartenGr I couldn't find an example in BERTopic doc. Could you please provide an example?

syGOAT avatar Mar 20 '24 06:03 syGOAT

I think there is a nice example here. It might be nice to have an additional function that re-calculates the representative documents since this question seems to appear frequently.

MaartenGr avatar Mar 20 '24 09:03 MaartenGr