BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Semi supervised learning with a sample of data manually labeled

Open AlexisV-hub opened this issue 3 years ago • 7 comments

Hello, I am new to BERTopic and I'm trying to replace my old algorithms (Corex and LDA) with BERTopic.

Here is my problem: I have many unlabeled documents (about 7000). In order to guide BERTopic, I read a sample of these documents, and I labeled them myself (about 400 documents) . So I created about ten topics for these labeled documents, and I still have 6600 documents without labels : I don't want BERTopic to find new topics, but to exploit the one I've pre-created, and to assign them to the rest of the documents unlabeled

I would like to know what is the best way to use these 400 labeled documents, and generalize them on the 6600 remaining ones, knowing that BERTopic does not seem to be able to re-train a pre-existing model with new data.

I read on the documentation the example of "semi-supervised" model (https://maartengr.github.io/BERTopic/getting_started/supervised/supervised.html) where there were some topics fully labelled, however for my case, I have a small sample where I have already defined all my topics, but I want to label the rest from the topics I created.

Thanks for your answers,

Alexis

AlexisV-hub avatar Jul 26 '22 15:07 AlexisV-hub

Apologies for the late reply, life is hectic lately!

Semi-supervised learning works by nudging the topic creation towards those that you have defined previously. In practice, that will not mean that the topics you have selected will be the exact ones that it will actually find. To circumvent this, you can try one of two ways. First, instead of HDBSCAN, choose an algorithm like k-Means that allows you to set the number of topics. Second, use a classification algorithm instead of BERTopic as you are essentially doing predictive modeling.

MaartenGr avatar Jul 31 '22 05:07 MaartenGr

Thanks for your answer !

I have one last question, about guided BERTopic

I have tried to guide my model as shown here (https://maartengr.github.io/BERTopic/getting_started/guided/guided.html), however my model does not seem to be affected at all by the words I specify (I am freezing umap, and thus I am obtaining the exact same topics with of without guiding BERTopic). Is it possible to set a weight to this seed_topic_list in bertopic ? (like with the Corex algorithm for example)

AlexisV-hub avatar Aug 02 '22 16:08 AlexisV-hub

I have tried to guide my model as shown here (https://maartengr.github.io/BERTopic/getting_started/guided/guided.html), however my model does not seem to be affected at all by the words I specify (I am freezing umap, and thus I am obtaining the exact same topics with of without guiding BERTopic).

Could you share the exact code you are using? More specifically, you will need to set an embedding_model in order for guided topic modeling to work.

Is it possible to set a weight to this seed_topic_list in bertopic ? (like with the Corex algorithm for example)

That is currently not possible but it might be something worthwhile to add in an upcoming release although I cannot make any promises since the seeded topic list are added to both the embeddings and the c-TF-IDF representations.

MaartenGr avatar Aug 03 '22 06:08 MaartenGr

seed_topic_list = [["casse","sav", "defectueux"], ["produit conforme", "content"], ["produit non conforme", "decevant", "mauvaise taille"], ["livraison", "reception"], ["prix", "qualite prix", "tarif","pas cher"], ["retour", "remboursement"], ["facturation", "paiement"], ]

topic_model = BERTopic(seed_topic_list=seed_topic_list, umap_model=frozen_model, min_topic_size=120, n_gram_range=(1,3), verbose=True, language="french", vectorizer_model=vectorizer_model, calculate_probabilities=True)

topics, probs = topic_model.fit_transform(docs)

Hi Maarten, thanks for your answer.

Indeed, I think I misunderstood an essential notion: I thought that by choosing language='french' I was choosing an embedding_model intended for French.

I think I misunderstood the documentation, could you explain me the difference between embedding_model and language ?

AlexisV-hub avatar Aug 03 '22 12:08 AlexisV-hub

Indeed, I think I misunderstood an essential notion: I thought that by choosing language='french' I was choosing an embedding_model intended for French. I think I misunderstood the documentation, could you explain me the difference between embedding_model and language ?

You are correct. Whenever you use language, it will automatically select an embedding model, so that seems to be no problem in your code.

topic_model = BERTopic(seed_topic_list=seed_topic_list, umap_model=frozen_model, min_topic_size=120, n_gram_range=(1,3), verbose=True, language="french", vectorizer_model=vectorizer_model, calculate_probabilities=True)

Could you share how you created the frozen_model variable? It may have something to do with that.

MaartenGr avatar Aug 04 '22 08:08 MaartenGr

Thanks for the answer! When I tried to freeze UMAP only by setting a random_state, BERTopic returned bad results because the default settings of UMAP in BERTopic were no longer the same: it returned the default settings of the basic UMAP algorithm (which are different of those used with UMAP within BERTopic).

So I froze UMAP settings inside of BERTopic as follows :

import umap as UMAP

umap_model = UMAP.UMAP(random_state=100)

Bert = BERTopic() Bert.umap_model.random_state=100 frozen_model = Bert.umap_model

AlexisV-hub avatar Aug 04 '22 09:08 AlexisV-hub

I just tried out the following and did find a difference in topic models comparing seeded topics vs non-seeded topics:

from bertopic import BERTopic
from umap import UMAP
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))["data"]

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

seed_topic_list = [["drug", "cancer", "drugs", "doctor"],
                   ["windows", "drive", "dos", "file"],
                   ["space", "launch", "orbit", "lunar"]]

# Seeded model
seeded_model = BERTopic(umap_model=umap_model, seed_topic_list=seed_topic_list)
seeded_topics, seeded_probs = seeded_model.fit_transform(docs)

# Non-seeded model
topic_model = BERTopic(umap_model=umap_model)
topics, probs = topic_model.fit_transform(docs)

Then, I compared topic_model.get_topic_info() with seeded_model.get_topic_info() to find differences in output between models. So it seems that seeded topic modeling works but I am not entirely sure why it is not working for you. Perhaps the seeded words are not accurate representations of the topics in the documents.

MaartenGr avatar Aug 05 '22 06:08 MaartenGr

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!

MaartenGr avatar Sep 27 '22 08:09 MaartenGr