STREAM icon indicating copy to clipboard operation
STREAM copied to clipboard

Issues in TNTM model debugging

Open williamlhy opened this issue 1 year ago • 2 comments

When I wanted to use TNTM model, I got the following error. Code:

from stream_topic.models import TNTM
from stream_topic.utils import TMDataset
dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="TNTM")
model = TNTM()
model.fit(dataset)

Error:

[/usr/local/lib/python3.10/dist-packages/stream_topic/models/abstract_helper_models/base.py](https://localhost:8080/#) in prepare_embeddings(self, dataset, logger)
    226                 f"--- Creating {self.embedding_model_name} document embeddings ---"
    227             )
--> 228             embeddings = self.encode_documents(
    229                 dataset.texts, encoder_model=self.embedding_model_name, use_average=True
    230             )
AttributeError: 'TNTM' object has no attribute 'encode_documents'

Then I added the SentenceEncodingMixin class to the TNTM model class build and modified some issues in the umap_model build. Then re-run the training code and get the error reported:

2024-12-19 15:48:07.837 | INFO     | stream_topic.models.abstract_helper_models.base:prepare_embeddings:225 - --- Creating /hongyi/stream/sentence-transformers/all-MiniLM-L6-v2 document embeddings ---
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2225/2225 [00:54<00:00, 40.89it/s]
2024-12-19 15:49:02.694 | INFO     | stream_topic.models.tntm:_initialize_datamodule:371 - --- Initializing Datamodule for TNTM ---
2024-12-19 15:49:02.964 | INFO     | stream_topic.models.tntm:_prepare_word_embeddings:335 - --- Creating /hongyi/stream/sentence-transformers/paraphrase-MiniLM-L3-v2 word embeddings ---
Batches: 100%
 253/253 [00:01<00:00, 129.29it/s]
/hongyi/STREAM/stream_topic/models/neural_base_models/tntm_base.py:61: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.word_embeddings_projected = torch.tensor(word_embeddings_projected)
2024-12-19 15:49:38.776 | INFO     | stream_topic.models.tntm:_initialize_trainer:279 - --- Initializing Trainer for TNTM ---
Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
2024-12-19 15:49:38.798 | INFO     | stream_topic.models.tntm:fit:489 - --- Training TNTM topic model ---
You are using a CUDA device ('NVIDIA A800 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:652: Checkpoint directory /hongyi/STREAM/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name                    | Type             | Params | Mode 
---------------------------------------------------------------------
0 | model                   | TNTMBase         | 5.2 M  | train
1 | model.inference_network | InferenceNetwork | 5.2 M  | train
2 | model.mean_bn           | BatchNorm1d      | 10     | train
3 | model.logvar_bn         | BatchNorm1d      | 10     | train
4 | model.beta_batchnorm    | BatchNorm1d      | 16.1 K | train
5 | model.theta_drop        | Dropout          | 0      | train
---------------------------------------------------------------------
5.2 M     Trainable params
8.1 K     Non-trainable params
5.2 M     Total params
20.916    Total estimated model params size (MB)
Sanity Checking DataLoader 0:   0%
 0/2 [00:00<?, ?it/s]
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=255` in the `DataLoader` to improve performance.
2024-12-19 15:49:38.955 | ERROR    | stream_topic.models.tntm:fit:496 - Error in training: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 3
      1 from stream_topic.models import KmeansTM,CEDC, ETM,DCTE,LDA,ProdLDA,NSTM,CTM,CTMNeg,CBC,BERTopicTM,TNTM
      2 model = TNTM(word_embedding_model_name="/hongyi/stream/sentence-transformers/paraphrase-MiniLM-L3-v2",embedding_model_name="/hongyi/stream/sentence-transformers/all-MiniLM-L6-v2")#
----> 3 model.fit(dataset,n_topics=5)#
      5 topics = model.get_topics()
      6 print(topics)

File ~/STREAM/stream_topic/models/tntm.py:493, in TNTM.fit(self, dataset, n_topics, val_size, lr, lr_patience, patience, factor, weight_decay, max_epochs, batch_size, shuffle, random_state, inferece_type, checkpoint_path, monitor, mode, trial, optimize, **kwargs)
    490     self._status = TrainingStatus.RUNNING
    491     # self.model.to("cuda:0")
    492     # print(self.model.device)
--> 493     self.trainer.fit(self.model, self.data_module)
    495 except Exception as e:
    496     logger.error(f"Error in training: {e}")

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:543, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    541 self.state.status = TrainerStatus.RUNNING
    542 self.training = True
--> 543 call._call_and_handle_interrupt(
    544     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    545 )

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     42     if trainer.strategy.launcher is not None:
     43         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 44     return trainer_fn(*args, **kwargs)
     46 except _TunerExitException:
     47     _call_teardown_hook(trainer)

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:579, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    572 assert self.state.fn is not None
    573 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    574     self.state.fn,
    575     ckpt_path,
    576     model_provided=True,
    577     model_connected=self.lightning_module is not None,
    578 )
--> 579 self._run(model, ckpt_path=ckpt_path)
    581 assert self.state.stopped
    582 self.training = False

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:986, in Trainer._run(self, model, ckpt_path)
    981 self._signal_connector.register_signal_handlers()
    983 # ----------------------------
    984 # RUN THE TRAINER
    985 # ----------------------------
--> 986 results = self._run_stage()
    988 # ----------------------------
    989 # POST-Training CLEAN UP
    990 # ----------------------------
    991 log.debug(f"{self.__class__.__name__}: trainer tearing down")

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:1028, in Trainer._run_stage(self)
   1026 if self.training:
   1027     with isolate_rng():
-> 1028         self._run_sanity_check()
   1029     with torch.autograd.set_detect_anomaly(self._detect_anomaly):
   1030         self.fit_loop.run()

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:1057, in Trainer._run_sanity_check(self)
   1054 call._call_callback_hooks(self, "on_sanity_check_start")
   1056 # run eval step
-> 1057 val_loop.run()
   1059 call._call_callback_hooks(self, "on_sanity_check_end")
   1061 # reset logger connector

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py:182, in _no_grad_context.<locals>._decorator(self, *args, **kwargs)
    180     context_manager = torch.no_grad
    181 with context_manager():
--> 182     return loop_run(self, *args, **kwargs)

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py:135, in _EvaluationLoop.run(self)
    133     self.batch_progress.is_last_batch = data_fetcher.done
    134     # run step hooks
--> 135     self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
    136 except StopIteration:
    137     # this needs to wrap the `*_step` call too (not just `next`) for `dataloader_iter` support
    138     break

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py:396, in _EvaluationLoop._evaluation_step(self, batch, batch_idx, dataloader_idx, dataloader_iter)
    390 hook_name = "test_step" if trainer.testing else "validation_step"
    391 step_args = (
    392     self._build_step_args_from_hook_kwargs(hook_kwargs, hook_name)
    393     if not using_dataloader_iter
    394     else (dataloader_iter,)
    395 )
--> 396 output = call._call_strategy_hook(trainer, hook_name, *step_args)
    398 self.batch_progress.increment_processed()
    400 if using_dataloader_iter:
    401     # update the hook kwargs now that the step method might have consumed the iterator

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:311, in _call_strategy_hook(trainer, hook_name, *args, **kwargs)
    308     return None
    310 with trainer.profiler.profile(f"[Strategy]{trainer.strategy.__class__.__name__}.{hook_name}"):
--> 311     output = fn(*args, **kwargs)
    313 # restore current_fx when nested context
    314 pl_module._current_fx_name = prev_fx_name

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py:411, in Strategy.validation_step(self, *args, **kwargs)
    409 if self.model != self.lightning_module:
    410     return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
--> 411 return self.lightning_module.validation_step(*args, **kwargs)

File ~/STREAM/stream_topic/models/abstract_helper_models/neural_basemodel.py:46, in NeuralBaseModel.validation_step(self, batch, batch_idx)
     45 def validation_step(self, batch, batch_idx):
---> 46     val_loss = self.model.compute_loss(batch)
     48     self.log(
     49         "val_loss",
     50         val_loss,
   (...)
     54         logger=True,
     55     )
     57     return val_loss

File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:215, in TNTMBase.compute_loss(self, x)
    201 """
    202 Computes the loss for the model.
    203 
   (...)
    212     The computed loss.
    213 """
    214 x_bow = x['bow']
--> 215 log_recon, posterior_mean, posterior_logvar = self.forward(x)
    216 loss = self.loss_function(x_bow, log_recon, posterior_mean, posterior_logvar)
    217 return loss

File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:143, in TNTMBase.forward(self, x)
    124 """
    125 Forward pass through the network.
    126 
   (...)
    139     The log variance of the variational posterior.
    140 """
    141 theta, posterior_mean, posterior_logvar = self.get_theta(x)
--> 143 log_beta = self.calc_log_beta()
    147 # prodLDA vs LDA
    148 # use numerical trick to compute log(beta @ theta )
    149 log_theta = torch.nn.LogSoftmax(dim=-1)(theta)        #calculate log theta = log_softmax(theta_hat)

File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:112, in TNTMBase.calc_log_beta(self)
    109 log_probs = torch.zeros(self.n_topics, self.vocab_size)
    111 for i, dis in enumerate(normal_dis_lis):
--> 112     log_probs[i] = dis.log_prob(self.word_embeddings_projected)
    113 return log_probs

File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/torch/distributions/lowrank_multivariate_normal.py:214, in LowRankMultivariateNormal.log_prob(self, value)
    212 if self._validate_args:
    213     self._validate_sample(value)
--> 214 diff = value - self.loc
    215 M = _batch_lowrank_mahalanobis(
    216     self._unbroadcasted_cov_factor,
    217     self._unbroadcasted_cov_diag,
    218     diff,
    219     self._capacitance_tril,
    220 )
    221 log_det = _batch_lowrank_logdet(
    222     self._unbroadcasted_cov_factor,
    223     self._unbroadcasted_cov_diag,
    224     self._capacitance_tril,
    225 )

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Finally I tried to move both self.model and its parameters to “cuda:0”, but it still reports the same error.

williamlhy avatar Dec 19 '24 07:12 williamlhy

@AnFreTh Could you take a look at this issue?

williamlhy avatar Dec 19 '24 08:12 williamlhy

The SentenceEncoding issue is fixed on main. However, I currently cannot recreate the device issue...Since we are using lightning and all tensors are usually transferred to the same device, I am not sure where this issue might come from. I'll try and recreate once I am on a machine with GPU and will revisit this issue then.

AnFreTh avatar Jan 23 '25 22:01 AnFreTh