OCTIS Lack of Reproducibility in CTM

OCTIS version: 1.10.3
Python version: 3.8.10
Operating System: Ubuntu 20.04 LTS

Description

I think CTM does not provide reproducibility, i.e it doesn't have a random_state or random_seed. I get different results with the same dataset and same parameters. It would be very good to have a random_state in CTM, similar to LDA and NMF models.

Jul 21 '22 16:07 berksudan

Hello, I can implement this for the next release. In the meantime, you can try to follow the instructions provided for the original implementation of CTM to set the random seed:

import torch
import numpy as np
import random
torch.manual_seed(10)
torch.cuda.manual_seed(10)
np.random.seed(10)
random.seed(10)
torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True

Source: https://colab.research.google.com/drive/10Z1g7stkKNqfszwCicZOL3QlOK--RuDE?usp=sharing#scrollTo=SZmTpQUov8y8

Hope it helps,

Silvia

Jul 26 '22 10:07 silviatti

Hi Silvia, I've tried this but it doesn't work for me

import torch
import numpy as np
import random
torch.manual_seed(10)
torch.cuda.manual_seed(10)
np.random.seed(10)
random.seed(10)
torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True

from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.similarity_metrics import RBO
from octis.evaluation_metrics.topic_significance_metrics import KL_uniform, KL_vacuous, KL_background
from octis.models.CTM import CTM


# metrics 
cv = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_v')
top_div = TopicDiversity(topk=10)
rbo = RBO(topk=10)
klb = KL_background()
metrics = {
    'Coherence': cv.score(trained_model), 
    'Topic Diversity': top_div.score(trained_model), 
    'Ranked-Biased Overlap': rbo.score(trained_model), 
    'KL Background': klb.score(trained_model)
}

DATA_DIR = '../data/cleantech/'
dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATA_DIR)

model = CTM(num_topics=8, use_partitions=False, activation='softplus', dropout=0.7, 
           inference_type='combined', lr=1e-3, model_type='LDA', momentum=0.99, 
           reduce_on_plateau=False, )
trained_model = model.train_model(dataset, top_words=30)
print(metrics)

What am i doing wrong?

Nov 07 '22 13:11 dip-gupta

Hi @dip-gupta, I ran your code two times on dataset M10 available in OCTIS and I got the same results, same topics. Maybe one reason could be that you didn't reset the seed the second time. Make sure that you run the following code each time you train the model:

torch.manual_seed(10)
torch.cuda.manual_seed(10)
np.random.seed(10)
random.seed(10)
torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True

# then init model and train

Nov 20 '22 10:11 silviatti

From v1.11.0, CTM takes as input a parameter "seed", which is the random seed. This should guarantee the reproducibility for CTM. Feel free to re-open the issue for further issues.

Silvia

Jan 07 '23 23:01 silviatti