Google cloud session crashes when using cuML when clustering in fit
I am using cuML as the hdbscan for clustering using gpu as I have 1 million short docs and I need to run with calculate_probabilities=True to get topic distributions.
My code is as follows:
from google.colab import userdata
from cuml.cluster import HDBSCAN
api_key = 'openai-key'
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)
docs = df['body']
topic_modeler = TopicModel()
topics = topic_modeler.train_topic_model(docs,
embeddings=embeddings,
min_topic_size=5,
nr_topics = None,
calculate_probabilities=True,
hdbscan_model=hdbscan_model,
representation_model='openAI',
api_key=api_key,
save_path=MODEL_PATH,
save_file= 'topic_model_min_topics_5_no_dc_subset')`
I see the following messages:
2025-05-22 12:32:05,122 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-22 12:38:00,201 - BERTopic - Dimensionality - Completed ✓
2025-05-22 12:38:00,211 - BERTopic - Cluster - Start clustering the reduced embeddings
But a couple of minutes after starting clustering the kernel crashes without any messages.
Please find attached the logs. Any help would be appreciated.
I edited your post a bit to make sure the code is properly formatted (tip; use ``` tags to format your code in markdown).
It is tricky to say without seeing inside TopicModel, nor knowing which version of cuML and BERTopic you are using. Could you share that information?
That said, it seems to indeed relate to cuML here. Perhaps it is a result of specific version of cuML or perhaps you have too little VRAM. Which GPU are you using?