Lack of Reproducibility in CTM
- OCTIS version: 1.10.3
- Python version: 3.8.10
- Operating System: Ubuntu 20.04 LTS
Description
I think CTM does not provide reproducibility, i.e it doesn't have a random_state or random_seed. I get different results with the same dataset and same parameters. It would be very good to have a random_state in CTM, similar to LDA and NMF models.
Hello, I can implement this for the next release. In the meantime, you can try to follow the instructions provided for the original implementation of CTM to set the random seed:
import torch
import numpy as np
import random
torch.manual_seed(10)
torch.cuda.manual_seed(10)
np.random.seed(10)
random.seed(10)
torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True
Source: https://colab.research.google.com/drive/10Z1g7stkKNqfszwCicZOL3QlOK--RuDE?usp=sharing#scrollTo=SZmTpQUov8y8
Hope it helps,
Silvia
Hi Silvia, I've tried this but it doesn't work for me
import torch
import numpy as np
import random
torch.manual_seed(10)
torch.cuda.manual_seed(10)
np.random.seed(10)
random.seed(10)
torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.similarity_metrics import RBO
from octis.evaluation_metrics.topic_significance_metrics import KL_uniform, KL_vacuous, KL_background
from octis.models.CTM import CTM
# metrics
cv = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_v')
top_div = TopicDiversity(topk=10)
rbo = RBO(topk=10)
klb = KL_background()
metrics = {
'Coherence': cv.score(trained_model),
'Topic Diversity': top_div.score(trained_model),
'Ranked-Biased Overlap': rbo.score(trained_model),
'KL Background': klb.score(trained_model)
}
DATA_DIR = '../data/cleantech/'
dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATA_DIR)
model = CTM(num_topics=8, use_partitions=False, activation='softplus', dropout=0.7,
inference_type='combined', lr=1e-3, model_type='LDA', momentum=0.99,
reduce_on_plateau=False, )
trained_model = model.train_model(dataset, top_words=30)
print(metrics)
What am i doing wrong?
Hi @dip-gupta, I ran your code two times on dataset M10 available in OCTIS and I got the same results, same topics. Maybe one reason could be that you didn't reset the seed the second time. Make sure that you run the following code each time you train the model:
torch.manual_seed(10)
torch.cuda.manual_seed(10)
np.random.seed(10)
random.seed(10)
torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True
# then init model and train
From v1.11.0, CTM takes as input a parameter "seed", which is the random seed. This should guarantee the reproducibility for CTM. Feel free to re-open the issue for further issues.
Silvia