Saving a model takes forever.. How to avoid?
Thank you for this great tool and your support.
I have trained a Bertopic model on 7917 Reddit posts with calculate_probabilities=True
Training the model took ~ 6 min but saving until now more than 24 hours have passed!
Is that normal?
Thanks,
No, it's not normal, pal. I dunno why's that happening to you.
Could you share your entire code for training and saving your BERTopic model? Also, which version of BERTopic are you using?
This the code..
I commented out the line
os.environ["TOKENIZERS_PARALLELISM"] = "false"
and now the model is saved !
I'll test with other variations as well
import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
#import os
# os.environ["TOKENIZERS_PARALLELISM"] = "false"
training_dataset = pd.read_csv( './data/train_all.csv')
training_dataset.title_text = training_dataset.title_text.astype(str)
docs = training_dataset.title_text.to_list()
vectorizer_model = CountVectorizer(# ngram_range=(1, 2),
stop_words='english')
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)
topic_model = BERTopic(calculate_probabilities=True,
vectorizer_model=vectorizer_model,
embedding_model=sentence_model,
# diversity=0.2
)
topic_model.fit_transform(docs, embeddings)
topic_model.save("model", save_embedding_model=True)
Glad to hear that you found the solution! This will definitely help out others having similar issues.