BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Saving a model takes forever.. How to avoid?

Open shsh88 opened this issue 3 years ago • 4 comments

Thank you for this great tool and your support.

I have trained a Bertopic model on 7917 Reddit posts with calculate_probabilities=True Training the model took ~ 6 min but saving until now more than 24 hours have passed! Is that normal?

Thanks,

shsh88 avatar Sep 07 '22 07:09 shsh88

No, it's not normal, pal. I dunno why's that happening to you.

diegopaucarv avatar Sep 07 '22 22:09 diegopaucarv

Could you share your entire code for training and saving your BERTopic model? Also, which version of BERTopic are you using?

MaartenGr avatar Sep 08 '22 08:09 MaartenGr

This the code.. I commented out the line os.environ["TOKENIZERS_PARALLELISM"] = "false" and now the model is saved !

I'll test with other variations as well

import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

#import os
# os.environ["TOKENIZERS_PARALLELISM"] = "false"

training_dataset = pd.read_csv( './data/train_all.csv')

training_dataset.title_text = training_dataset.title_text.astype(str)
docs = training_dataset.title_text.to_list()

vectorizer_model = CountVectorizer(# ngram_range=(1, 2), 
                                   stop_words='english')

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

topic_model = BERTopic(calculate_probabilities=True,
                       vectorizer_model=vectorizer_model,
                       embedding_model=sentence_model,
                       # diversity=0.2
                       )

topic_model.fit_transform(docs, embeddings)

topic_model.save("model", save_embedding_model=True)

shsh88 avatar Sep 08 '22 08:09 shsh88

Glad to hear that you found the solution! This will definitely help out others having similar issues.

MaartenGr avatar Sep 10 '22 08:09 MaartenGr