Show the distribution of top k documents in each topic
Reference to https://github.com/MaartenGr/BERTopic/issues/93#:~:text=So%2C%20only%20set%20this%20to%20True%20if%20you%20have%20less%20than%20100.000%20documents., does it mean it will not compute probabilities when I have more than 100 documents (accurately, I have 40000+ documents) although setting calculate_probabilities=True?
My BERTopic model is:
umap_model = UMAP(n_neighbors=20, n_components=15, min_dist=0.0, metric='cosine', random_state=42)
cluster_model = KMeans(n_clusters=100, random_state=42) # 100
vectorizer_model = CountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(
seed_words=seed_words,
seed_multiplier=5
)
model = BERTopic(embedding_model='./paraphrase-MiniLM-L6-v2',
umap_model=umap_model,
min_topic_size=50,
ctfidf_model=ctfidf_model,
hdbscan_model=cluster_model,
vectorizer_model=vectorizer_model,
calculate_probabilities=True
)
Then run this code:
topics, probabilities = model.fit_transform(abstracts)
print(probabilities)
The output is None.
Version:
# pip show bertopic
Name: bertopic
Version: 0.16.0
Summary: BERTopic performs topic Modeling with state-of-the-art transformer models.
Home-page: https://github.com/MaartenGr/BERTopic
I know how stressful it would be to compute all the probabilities. If the answer to the question at the beginning of this issue is yes, could you please add a parameter to let users set top k documents to show probabilities of most similar k documents for each topic?
I have read this FAQ: https://maartengr.github.io/BERTopic/faq.html#how-do-i-calculate-the-probabilities-of-all-topics-in-a-document and known why my probabilities is None. Thank you so much for making this useful library!
In addtion, .approximate_distribution models the distribution of topics in the documents. It's documents to topics not topics to dpcuments. I would reserve the request above, which is showing the distribution of top k documents in each topic.
Reference to https://github.com/MaartenGr/BERTopic/issues/93., does it mean it will not compute probabilities when I have more than 100 documents (accurately, I have 40000+ documents) although setting calculate_probabilities=True?
No, the reason it does not compute probabilities is that you are using k-Means which does not generate probabilities. Instead, if you save the model as either pytorch or safetensors, then load it and use .transform it will generate probabilities without k-Means instead.
@MaartenGr Thankx for your reply! Would you mind answering another question of mine?
could you please add a parameter to let users set top k documents to show probabilities of most similar k documents for each topic?
In addtion,
.approximate_distributionmodels the distribution of topics in the documents. It's documents to topics not topics to dpcuments. I would reserve the request above, which is showing the distribution of top k documents in each topic.
could you please add a parameter to let users set top k documents to show probabilities of most similar k documents for each topic?
Ah, you mean the most representative documents per topic, right? If so, currently you get the top 3 most representative documents automatically. If you want more, you might need to use ._extract_representative_docs but it would be nicer if this was a public function and not a private function.
If someone is interested in working on this, this can be added as an additional parameter in the get_representative_docs parameter as a way to recalculate the most representative document.
If someone is interested in working on this, this can be added as an additional parameter in the
get_representative_docsparameter as a way to recalculate the most representative document.
@MaartenGr Great! I'm modifying the source code to do this. Hope I get a chance to post a PR.
But I have one small question. After calling the function .fit_transform (My code is above), I ran this code:
model.get_topic_info().loc[0, 'Representative_Docs']
Output:
But when running this code:
doc_topic = pd.DataFrame({
'Topic': model.topics_,
'ID': range(len(model.topics_)),
'Document': abstracts}
)
repr_docs, _, _, _ = model._extract_representative_docs(
c_tf_idf=model.c_tf_idf_,
documents=doc_topic,
topics=model.topic_representations_,
nr_samples=1000,
nr_repr_docs=5
)
print(repr_docs)
The output of topic 0 was different from the first method:
Could you please tell me what was going on?
Well, you changed a parameter, namely nr_samples from the original 500 to 1000, so it makes sense that then the output would change. Also, make sure to also include "Image": images, in the dataframe as it might lead to errors otherwise.
Thank you for your reply and for your contribution to the application of clustering!
No problem, glad I could be of help. Feel free to reach out if you have any other questions.