BERTopic Get index of docs plotted using sample param in visualize_documents and "smart" way to reduce data points?

Hi,

Is it possible to get the index of documents plotted when using the sample parameter in visualize_documents? As far as I understand it, if, say, sample=0.1, from each topic a random 10% of docs is sampled to visualize in the plot. Would help greatly to see which docs specifically, as we use GPT or other LLM API to summarize each doc and show this as hover in the plot. We only want to summarize the docs actually plotted.

edit: Generally, we have been thinking about how to decrease the number of data points plotted, preferably in a "smart" way, for the same reason of computational and cost constraints when using some LLM API to interpret the topic labels or docs themselves. I've used MMR for topic fine-tuning, so filtering terms with high marginal relevance, it definitely improved the output. Is there something similar for documents themselves? Plotting only the documents that are relevant or contain many highly marginal relevant terms? Anyone experience with this?

Would love to hear about it. Regards and thanks, Arne

Feb 22 '24 10:02 arneeichholtz

Is it possible to get the index of documents plotted when using the sample parameter in visualize_documents? As far as I understand it, if, say, sample=0.1, from each topic a random 10% of docs is sampled to visualize in the plot. Would help greatly to see which docs specifically, as we use GPT or other LLM API to summarize each doc and show this as hover in the plot. We only want to summarize the docs actually plotted.

It is currently not possible to return the indices as it did not fit with the way the visualizations are being used currently. However, you can still pass the summary of each document instead of the documents themselves and it will perform the selection itself.

If you only have summaries for selected documents yourself, I would advise adapting the code for the visualization to only plot the documents you are interested in.

edit: Generally, we have been thinking about how to decrease the number of data points plotted, preferably in a "smart" way, for the same reason of computational and cost constraints when using some LLM API to interpret the topic labels or docs themselves. I've used MMR for topic fine-tuning, so filtering terms with high marginal relevance, it definitely improved the output. Is there something similar for documents themselves? Plotting only the documents that are relevant or contain many highly marginal relevant terms? Anyone experience with this?

Something like that is currently not implemented since with enough data points a sample generally works out well since the clustered documents tend to be in a similar place. If, however, you want the most relevant documents to the topic and not necessarily the cluster, then something like c-TF-IDF generally works well for finding the most relevant documents. You could indeed also use MMR for finding the most relevant documents, but c-TF-IDF is quite a bit faster and is currently used for selecting the most representative documents.

Feb 22 '24 13:02 MaartenGr

Thanks for your response!

Based on the topic ids for each document it is indeed straightforward to sample (say) 10% of documents in each topic and get the documents indices. But is it possible to integrate this with visualize_documents without fitting the model again on the sample? Documentation specifies that docs param in the function should be documents used to fit model, and running it with sampled documents gives list index out of range error.

Fitting the model again (on the sample) will change the results. When using some LLM API we run into token and query limit for the summaries, so we want to summarize only a given sample, and plot these.

Feb 23 '24 12:02 arneeichholtz

Based on the topic ids for each document it is indeed straightforward to sample (say) 10% of documents in each topic and get the documents indices. But is it possible to integrate this with visualize_documents without fitting the model again on the sample? Documentation specifies that docs param in the function should be documents used to fit model, and running it with sampled documents gives list index out of range error.

You would have to adopt the .visualize_documents function yourself as this is currently not implemented within BERTopic. It should be straightforward since the code for visualizations are independent from the fitting process. You can use the code here to do something like this:

def visualize_documents(topic_model, ...):
      # My updated functionality

visualize_documents(topic_model, my_sampled_docs)

To illustrate, under the hood, BERTopic is simply running the following when calling .visualize_documents:

https://github.com/MaartenGr/BERTopic/blob/99ee553e3ee00fa7189d3210bdc618a7c7a943c8/bertopic/_bertopic.py#L2329-L2340

Feb 23 '24 12:02 MaartenGr