DeepDPM icon indicating copy to clipboard operation
DeepDPM copied to clipboard

new size-match errors after commenting out lines linked to labels

Open Vz09 opened this issue 3 years ago • 2 comments

    > Hi @Vz09, As a quick fix - as you anyway do not have any labels, you could comment out the following lines from the code: "init_nmi = normalized_mutual_info_score(gt, init_labels) init_ari = adjusted_rand_score(gt, init_labels)" (lines 373-4 at src/clustering_models/clusternet_modules/clusternetasmodel.py), we will upload an official fix in the future.

That said, please make sure (using debugging) that the number of samples you are training on (self.codes) is indeed equal to the number of samples you have in the dataset, as if this is not the case there might be some problem with the dimension configuration and this will need more attention.

Thank you for your reply~ Does the print added in the code of class customdataset allow to check that the number of samples of my training equals the number of samples I had in the dataset? if yes, they are equal (74770).

class CustomDataset(MyDataset):
    def __init__(self, args):
        super().__init__(args)
        self.transformer = transforms.Compose([transforms.ToTensor()])
        self._data_dim = 0
    
    def get_train_data(self):
        train_codes = torch.Tensor(torch.load(os.path.join(self.data_dir, "train_data.pt")))
        if self.args.transform_input_data:
            train_codes = transform_embeddings(self.args.transform_input_data, train_codes)
        if self.args.use_labels_for_eval:
            train_labels = torch.load(os.path.join(self.data_dir, "train_labels.pt"))
        else:
            train_labels = torch.zeros((train_codes.size()[0]))
        self._data_dim = train_codes.size()[1]
        print("train_codes.size()", train_codes.size())
        print("train_labels.size()", train_labels.size())
        train_set = TensorDatasetWrapper(train_codes, train_labels)
        del train_codes
        del train_labels
        return train_set

I tried to comment the two lines you talked about, that indeed allowed to continue to epoch 25, then another error occurred...

train_codes.size() torch.Size([74770, 128])
train_labels.size() torch.Size([74770])
Sequential()
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
  new_rank_zero_deprecation(
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: The `pytorch_lightning.loggers.base.DummyLogger` is deprecated in v1.7 and will be removed in v1.9. Please use `pytorch_lightning.loggers.logger.DummyLogger` instead.
  return new_rank_zero_deprecation(*args, **kwargs)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
/usr/local/lib/python3.8/dist-packages/torch/utils/hooks.py:59: UserWarning: backward hook <function Subclustering_net.__init__.<locals>.<lambda> at 0x7f6643376af0> on tensor will not be serialized.  If this is expected, you can decorate the function with @torch.utils.hooks.unserializable_hook to suppress this warning
  warnings.warn("backward hook {} on tensor will not be "
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:616: UserWarning: Checkpoint directory /mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/checkpoints exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py:381: RuntimeWarning: Found unsupported keys in the optimizer configuration: {'scheduler'}
  rank_zero_warn(

  | Name              | Type              | Params
--------------------------------------------------------
0 | cluster_net       | MLP_Classifier    | 6.5 K 
1 | subclustering_net | Subclustering_net | 6.6 K 
--------------------------------------------------------
13.1 K    Trainable params
0         Non-trainable params
13.1 K    Total params
0.052     Total estimated model params size (MB)
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:203: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/99 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/99 [00:00<?, ?it/s] 
Epoch 0:   1%|          | 1/99 [00:19<31:47, 19.46s/it]
Epoch 0:   1%|          | 1/99 [00:19<31:47, 19.46s/it, loss=nan]
Epoch 0:   2%|▏         | 2/99 [00:19<15:44,  9.74s/it, loss=nan]
Epoch 0:   2%|▏         | 2/99 [00:19<15:44,  9.74s/it, loss=nan]
....I kipped intermediate log
Epoch 25:  74%|███████▎  | 73/99 [00:22<00:07,  3.31it/s, loss=0]
Epoch 25:  74%|███████▎  | 73/99 [00:22<00:07,  3.31it/s, loss=0]
Epoch 25:  75%|███████▍  | 74/99 [00:22<00:07,  3.35it/s, loss=0]
Epoch 25:  75%|███████▍  | 74/99 [00:22<00:07,  3.35it/s, loss=0]

Validation: 0it [00:00, ?it/s][A

Validation:   0%|          | 0/25 [00:00<?, ?it/s][A

Validation DataLoader 0:   0%|          | 0/25 [00:00<?, ?it/s][A

Validation DataLoader 0:   4%|▍         | 1/25 [00:00<00:00, 173.97it/s][A
Epoch 25:  76%|███████▌  | 75/99 [00:44<00:14,  1.70it/s, loss=0]

Validation DataLoader 0:   8%|▊         | 2/25 [00:00<00:00, 134.73it/s][A
Epoch 25:  77%|███████▋  | 76/99 [00:44<00:13,  1.72it/s, loss=0]

Validation DataLoader 0:  12%|█▏        | 3/25 [00:00<00:00, 98.68it/s] [A
Epoch 25:  78%|███████▊  | 77/99 [00:44<00:12,  1.74it/s, loss=0]

Validation DataLoader 0:  16%|█▌        | 4/25 [00:00<00:00, 58.51it/s][A
Epoch 25:  79%|███████▉  | 78/99 [00:44<00:11,  1.76it/s, loss=0]

Validation DataLoader 0:  20%|██        | 5/25 [00:00<00:00, 67.64it/s][A
Epoch 25:  80%|███████▉  | 79/99 [00:44<00:11,  1.78it/s, loss=0]

Validation DataLoader 0:  24%|██▍       | 6/25 [00:00<00:00, 76.02it/s][A
Epoch 25:  81%|████████  | 80/99 [00:44<00:10,  1.81it/s, loss=0]codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1


Validation DataLoader 0:  28%|██▊       | 7/25 [00:00<00:00, 79.10it/s][A
Epoch 25:  82%|████████▏ | 81/99 [00:44<00:09,  1.83it/s, loss=0]

Validation DataLoader 0:  32%|███▏      | 8/25 [00:00<00:00, 75.19it/s][A
Epoch 25:  83%|████████▎ | 82/99 [00:44<00:09,  1.85it/s, loss=0]

Validation DataLoader 0:  36%|███▌      | 9/25 [00:00<00:00, 73.14it/s][A
Epoch 25:  84%|████████▍ | 83/99 [00:44<00:08,  1.87it/s, loss=0]

Validation DataLoader 0:  40%|████      | 10/25 [00:00<00:00, 70.16it/s][A
Epoch 25:  85%|████████▍ | 84/99 [00:44<00:07,  1.89it/s, loss=0]

Validation DataLoader 0:  44%|████▍     | 11/25 [00:00<00:00, 73.41it/s][A
Epoch 25:  86%|████████▌ | 85/99 [00:44<00:07,  1.91it/s, loss=0]

Validation DataLoader 0:  48%|████▊     | 12/25 [00:00<00:00, 77.05it/s][A
Epoch 25:  87%|████████▋ | 86/99 [00:44<00:06,  1.94it/s, loss=0]

Validation DataLoader 0:  52%|█████▏    | 13/25 [00:00<00:00, 77.63it/s][A
Epoch 25:  88%|████████▊ | 87/99 [00:44<00:06,  1.96it/s, loss=0]

Validation DataLoader 0:  56%|█████▌    | 14/25 [00:00<00:00, 76.06it/s][A
Epoch 25:  89%|████████▉ | 88/99 [00:44<00:05,  1.98it/s, loss=0]

Validation DataLoader 0:  60%|██████    | 15/25 [00:00<00:00, 78.72it/s][A
Epoch 25:  90%|████████▉ | 89/99 [00:44<00:04,  2.00it/s, loss=0]

Validation DataLoader 0:  64%|██████▍   | 16/25 [00:00<00:00, 76.24it/s][A
Epoch 25:  91%|█████████ | 90/99 [00:44<00:04,  2.02it/s, loss=0]

Validation DataLoader 0:  68%|██████▊   | 17/25 [00:00<00:00, 73.49it/s][A
Epoch 25:  92%|█████████▏| 91/99 [00:44<00:03,  2.05it/s, loss=0]

Validation DataLoader 0:  72%|███████▏  | 18/25 [00:00<00:00, 72.70it/s][A
Epoch 25:  93%|█████████▎| 92/99 [00:44<00:03,  2.07it/s, loss=0]codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1


Validation DataLoader 0:  76%|███████▌  | 19/25 [00:00<00:00, 71.92it/s][A
Epoch 25:  94%|█████████▍| 93/99 [00:44<00:02,  2.09it/s, loss=0]codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1


Validation DataLoader 0:  80%|████████  | 20/25 [00:00<00:00, 73.28it/s][A
Epoch 25:  95%|█████████▍| 94/99 [00:44<00:02,  2.11it/s, loss=0]

Validation DataLoader 0:  84%|████████▍ | 21/25 [00:00<00:00, 75.34it/s][A
Epoch 25:  96%|█████████▌| 95/99 [00:44<00:01,  2.13it/s, loss=0]

Validation DataLoader 0:  88%|████████▊ | 22/25 [00:00<00:00, 77.29it/s][A
Epoch 25:  97%|█████████▋| 96/99 [00:44<00:01,  2.16it/s, loss=0]

Validation DataLoader 0:  92%|█████████▏| 23/25 [00:00<00:00, 79.32it/s][A
Epoch 25:  98%|█████████▊| 97/99 [00:44<00:00,  2.18it/s, loss=0]

Validation DataLoader 0:  96%|█████████▌| 24/25 [00:00<00:00, 81.38it/s][A
Epoch 25:  99%|█████████▉| 98/99 [00:44<00:00,  2.20it/s, loss=0]

Validation DataLoader 0: 100%|██████████| 25/25 [00:00<00:00, 83.41it/s][A
Epoch 25: 100%|██████████| 99/99 [00:44<00:00,  2.22it/s, loss=0]
Epoch 25: 100%|██████████| 99/99 [00:44<00:00,  2.22it/s, loss=0]

                                                                        [Acodes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
codes.size():  torch.Size([3154, 128])
logits.size():  torch.Size([9347, 1])
K:  1
Traceback (most recent call last):
  File "DeepDPM.py", line 456, in <module>
    train_cluster_net()
  File "DeepDPM.py", line 436, in train_cluster_net
    trainer.fit(model, train_loader, val_loader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 286, in on_advance_end
    epoch_end_outputs = self.trainer._call_lightning_module_hook("training_epoch_end", epoch_end_outputs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1552, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 448, in training_epoch_end
    ) = self.training_utils.comp_cluster_params(
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/training_utils.py", line 146, in comp_cluster_params
    mus = compute_mus(
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 278, in compute_mus
    mus = compute_mus_soft_assignment(codes, logits, K)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 227, in compute_mus_soft_assignment
    [
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 228, in <listcomp>
    (logits[:, k].reshape(-1, 1) * codes).sum(axis=0) / denominator[k]
RuntimeError: The size of tensor a (9347) must match the size of tensor b (3154) at non-singleton dimension 0

then I tried to comment the following lines of clusternetasmodel.py, it continued till epoch 44 and another error occurred again....

  if not freeze_mus:
                (
                    self.pi,
                    self.mus,
                    self.covs,
                ) = self.training_utils.comp_cluster_params(
                    self.train_resp,
                    self.codes.view(-1, self.codes_dim),
                    self.pi,
                    self.K,
                    self.prior,
                )
Epoch 44:  74%|███████▎  | 73/99 [00:25<00:09,  2.86it/s, loss=0]
Epoch 44:  74%|███████▎  | 73/99 [00:25<00:09,  2.86it/s, loss=0]
Epoch 44:  75%|███████▍  | 74/99 [00:25<00:08,  2.90it/s, loss=0]
Epoch 44:  75%|███████▍  | 74/99 [00:25<00:08,  2.90it/s, loss=0]

Validation: 0it [00:00, ?it/s][A

Validation:   0%|          | 0/25 [00:00<?, ?it/s][A

Validation DataLoader 0:   0%|          | 0/25 [00:00<?, ?it/s][A

Validation DataLoader 0:   4%|▍         | 1/25 [00:00<00:00, 149.64it/s][A
Epoch 44:  76%|███████▌  | 75/99 [00:48<00:15,  1.56it/s, loss=0]

Validation DataLoader 0:   8%|▊         | 2/25 [00:01<00:13,  1.67it/s] [A
Epoch 44:  77%|███████▋  | 76/99 [00:49<00:14,  1.54it/s, loss=0]

Validation DataLoader 0:  12%|█▏        | 3/25 [00:01<00:12,  1.80it/s][A
Epoch 44:  78%|███████▊  | 77/99 [00:49<00:14,  1.55it/s, loss=0]

Validation DataLoader 0:  16%|█▌        | 4/25 [00:01<00:08,  2.38it/s][A
Epoch 44:  79%|███████▉  | 78/99 [00:49<00:13,  1.57it/s, loss=0]

Validation DataLoader 0:  20%|██        | 5/25 [00:01<00:06,  2.93it/s][A
Epoch 44:  80%|███████▉  | 79/99 [00:49<00:12,  1.59it/s, loss=0]

Validation DataLoader 0:  24%|██▍       | 6/25 [00:01<00:05,  3.45it/s][A
Epoch 44:  81%|████████  | 80/99 [00:49<00:11,  1.61it/s, loss=0]

Validation DataLoader 0:  28%|██▊       | 7/25 [00:01<00:04,  4.00it/s][A
Epoch 44:  82%|████████▏ | 81/99 [00:49<00:11,  1.63it/s, loss=0]

Validation DataLoader 0:  32%|███▏      | 8/25 [00:01<00:03,  4.55it/s][A
Epoch 44:  83%|████████▎ | 82/99 [00:49<00:10,  1.65it/s, loss=0]

Validation DataLoader 0:  36%|███▌      | 9/25 [00:01<00:03,  5.08it/s][A
Epoch 44:  84%|████████▍ | 83/99 [00:49<00:09,  1.67it/s, loss=0]

Validation DataLoader 0:  40%|████      | 10/25 [00:01<00:02,  5.58it/s][A
Epoch 44:  85%|████████▍ | 84/99 [00:49<00:08,  1.69it/s, loss=0]

Validation DataLoader 0:  44%|████▍     | 11/25 [00:01<00:02,  6.07it/s][A
Epoch 44:  86%|████████▌ | 85/99 [00:49<00:08,  1.71it/s, loss=0]

Validation DataLoader 0:  48%|████▊     | 12/25 [00:01<00:01,  6.51it/s][A
Epoch 44:  87%|████████▋ | 86/99 [00:49<00:07,  1.73it/s, loss=0]

Validation DataLoader 0:  52%|█████▏    | 13/25 [00:01<00:01,  6.99it/s][A
Epoch 44:  88%|████████▊ | 87/99 [00:49<00:06,  1.74it/s, loss=0]

Validation DataLoader 0:  56%|█████▌    | 14/25 [00:01<00:01,  7.47it/s][A
Epoch 44:  89%|████████▉ | 88/99 [00:49<00:06,  1.76it/s, loss=0]

Validation DataLoader 0:  60%|██████    | 15/25 [00:01<00:01,  7.93it/s][A
Epoch 44:  90%|████████▉ | 89/99 [00:49<00:05,  1.78it/s, loss=0]

Validation DataLoader 0:  64%|██████▍   | 16/25 [00:01<00:01,  8.39it/s][A
Epoch 44:  91%|█████████ | 90/99 [00:49<00:04,  1.80it/s, loss=0]

Validation DataLoader 0:  68%|██████▊   | 17/25 [00:01<00:00,  8.84it/s][A
Epoch 44:  92%|█████████▏| 91/99 [00:49<00:04,  1.82it/s, loss=0]

Validation DataLoader 0:  72%|███████▏  | 18/25 [00:01<00:00,  9.21it/s][A
Epoch 44:  93%|█████████▎| 92/99 [00:49<00:03,  1.84it/s, loss=0]

Validation DataLoader 0:  76%|███████▌  | 19/25 [00:01<00:00,  9.62it/s][A
Epoch 44:  94%|█████████▍| 93/99 [00:49<00:03,  1.86it/s, loss=0]

Validation DataLoader 0:  80%|████████  | 20/25 [00:01<00:00, 10.02it/s][A
Epoch 44:  95%|█████████▍| 94/99 [00:49<00:02,  1.88it/s, loss=0]

Validation DataLoader 0:  84%|████████▍ | 21/25 [00:02<00:00, 10.41it/s][A
Epoch 44:  96%|█████████▌| 95/99 [00:50<00:02,  1.90it/s, loss=0]

Validation DataLoader 0:  88%|████████▊ | 22/25 [00:02<00:00, 10.86it/s][A
Epoch 44:  97%|█████████▋| 96/99 [00:50<00:01,  1.92it/s, loss=0]

Validation DataLoader 0:  92%|█████████▏| 23/25 [00:02<00:00, 11.22it/s][A
Epoch 44:  98%|█████████▊| 97/99 [00:50<00:01,  1.94it/s, loss=0]

Validation DataLoader 0:  96%|█████████▌| 24/25 [00:02<00:00, 11.67it/s][A
Epoch 44:  99%|█████████▉| 98/99 [00:50<00:00,  1.96it/s, loss=0]

Validation DataLoader 0: 100%|██████████| 25/25 [00:02<00:00, 12.13it/s][A
Epoch 44: 100%|██████████| 99/99 [00:50<00:00,  1.98it/s, loss=0]
Epoch 44: 100%|██████████| 99/99 [00:50<00:00,  1.98it/s, loss=0]

                                                                        [ATraceback (most recent call last):
  File "DeepDPM.py", line 456, in <module>
    train_cluster_net()
  File "DeepDPM.py", line 436, in train_cluster_net
    trainer.fit(model, train_loader, val_loader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 103, in launch
    mp.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 129, in _wrapping_function
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
    return self._run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 286, in on_advance_end
    epoch_end_outputs = self.trainer._call_lightning_module_hook("training_epoch_end", epoch_end_outputs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1552, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/clusternetasmodel.py", line 462, in training_epoch_end
    ) = self.training_utils.init_subcluster_params(
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/training_utils.py", line 190, in init_subcluster_params
    mus, covs, pis = init_mus_and_covs_sub(
  File "/mnt/iribhm/people/zzhang/DeepDPM/DeepDPM-main_vincent4/src/clustering_models/clusternet_modules/utils/clustering_utils/clustering_operations.py", line 134, in init_mus_and_covs_sub
    codes_k = codes[indices_k]
IndexError: The shape of the mask [9347] at index 0 does not match the shape of the indexed tensor [3154, 128] at index 0

I don't know if further coding comment impacts the results of clustering or not.... could you explain more about problem with the dimension configuration please? how should I check or change please?

thank you

sincerely

Zhao

Originally posted by @Vz09 in https://github.com/BGU-CS-VIL/DeepDPM/issues/34#issuecomment-1279844675

Vz09 avatar Nov 02 '22 12:11 Vz09

Hello,

Unless this issue is solved, one simply can't use DeepDPM in the complete absence of labelled data, i.e. in the context of a truly unsupervised problem.

If the goal of the package is to enable unsupervised exploratory analysis, one should be able to use it with completely unlabelled data.

All the best.

vdet avatar Nov 14 '22 16:11 vdet

Hi,

@vdet - our method is completely unsupervised and does run without labels. See for example past issues of users who were successful in doing so. The labels are only used for clustering evaluation and not for training.

@Vz09 - we pushed another fix today. Please try to use the most updated version. Also, there is no longer a need to supply mock labels. If you have a labels.pt file please delete it. Also, for your data - make sure that the embeddings are saved such that the first dimension is the data size and the second is the dimensionality. E.g. embbeding_pickle.size()[0] is the data size and embbeding_pickle.size()[1] is the data's dimensionality.

Also, would you mind running our code on one GPU to see if this problem replicates?

We're closing the past issue as you opened this new one. Hope that this will help, let us know either way.

meitarronen avatar Nov 27 '22 18:11 meitarronen

Just in case anybody would have a similar size-match error, it turns out it was caused by the wrong version of pytorch_lightning. After using pytorch_lightning==1.2.10 as in the requirements.txt, this error would disappear.

wingvortex avatar Oct 09 '23 07:10 wingvortex