What is the proper procedure for finetuning a pretrained model on a custom dataset using this library?
Checklist
- [X] I have searched for similar issues.
- [X] I have tested with the latest development wheel.
- [X] I have checked the release documentation and the latest documentation (for
masterbranch).
My Question
Hello everyone,
First of all, here are the versions I am running:
Python 3.8 Open3d 0.15.2 Torch 1.8.2+cpu
I would like to finetune RandLA-Net (pretrained on S3DIS) on my own custom point cloud dataset for indoor scene semantic segmentation. I've managed to load the dataset correctly and run inference with the randlanet_s3dis_202201071330utc.pth checkpoint. I have also managed to modify the final fc.conv.weight and fc.conv.bias in model_state_dict so that it only includes weights for my classes of interest (implemented a solution from #277 ). However, when bringing in this modified model checkpoint for training, I run into an issue:
RuntimeError: The size of tensor a (13) must match the size of tensor b (3) at non-singleton dimension 0.
I know this error is occurring because S3DIS contains 13 classes, while I am trying to train on only 3 of these. So, I must have forgotten some necessary modification. I will further elaborate below.
Below is how I manually sliced the weights in the last layer, since I am only interested in walls, windows, and doors. But, I am not sure if this is technically the right thing to do. Please let me know if I need to make other modifications in order to finetune:
import torch
import os, sys, logging
import numpy as np
# Load pre-trained model checkpoint
checkpoint = torch.load('randlanet_s3dis_202201071330utc.pth', map_location=torch.device('cpu'))
ckpt_mod = checkpoint
# Modify pretrained model weights for fine-tuning
# Slice out weights for the 3 classes of interest (walls, windows, doors)
ckpt_mod['model_state_dict']['fc1.3.conv.bias'] = checkpoint['model_state_dict']['fc1.3.conv.bias'][np.r_[2,5,6]]
ckpt_mod['model_state_dict']['fc1.3.conv.weight'] = checkpoint['model_state_dict']['fc1.3.conv.weight'][np.r_[2,5,6]]
# Save modified weights
save_mod = dict(epoch=checkpoint['epoch'],
model_state_dict=ckpt_mod['model_state_dict'],
optimizer_state_dict=checkpoint['optimizer_state_dict'],
scheduler_state_dict=checkpoint['scheduler_state_dict'])
torch.save(save_mod, 'randlanet_finetune.pth')
Below is my training script. As you see, I froze all but the last parameters, that is, fc1.3.conv.bias and fc1.3.conv.weight:
import torch
import open3d.ml as _ml3d
import open3d.ml.torch as ml3d
import numpy as np
from ml3d.datasets.customdataset import Custom3D
cfg_file = "ml3d/configs/randlanet_s3dis.yml"
cfg = _ml3d.utils.Config.load_from_file(cfg_file)
model = ml3d.models.RandLANet(**cfg.model)
cfg.dataset['dataset_path'] = '/mnt/d/auto_class/Open3D-ML-master/NIST_data/NPY'
def freeze_all_but_last(model):
#named_parameters is a tuple with (parameter name: string, parameters: tensor)
for n, p in model.named_parameters():
if 'fc1.3' in n:
pass
else:
p.requires_grad = False
freeze_all_but_last(model)
dataset = Custom3D(cfg.dataset.pop('dataset_path', None), **cfg.dataset)
pipeline = ml3d.pipelines.SemanticSegmentation(model, dataset=dataset, device="cpu", **cfg.pipeline)
ckpt_path = 'randlanet_finetune.pth'
pipeline.load_ckpt(ckpt_path=ckpt_path)
pipeline.run_train()
However, when running this above training script, I run into the following error:
INFO - 2022-08-09 16:04:56,085 - semantic_segmentation - DEVICE : cpu
INFO - 2022-08-09 16:04:56,086 - semantic_segmentation - Logging in file : ./logs/RandLANet_Custom3D_torch/log_train_2022-08-09_16:04:56.txt
INFO - 2022-08-09 16:04:56,090 - customdataset - Found 70 pointclouds for train
INFO - 2022-08-09 16:04:56,091 - customdataset - Found 9 pointclouds for validation
INFO - 2022-08-09 16:04:56,096 - semantic_segmentation - Loading checkpoint randlanet_finetune.pth
INFO - 2022-08-09 16:04:56,241 - semantic_segmentation - Loading checkpoint optimizer_state_dict
INFO - 2022-08-09 16:04:56,265 - semantic_segmentation - Loading checkpoint scheduler_state_dict
INFO - 2022-08-09 16:04:56,277 - semantic_segmentation - Writing summary in train_log/00023_RandLANet_Custom3D_torch.
INFO - 2022-08-09 16:04:56,278 - semantic_segmentation - Started training
INFO - 2022-08-09 16:04:56,279 - semantic_segmentation - === EPOCH 0/200 ===
training: 0%| | 0/35 [00:03<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 pipeline.run_train()
File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/open3d/_ml3d/torch/pipelines/semantic_segmentation.py:421, in SemanticSegmentation.run_train(self)
418 if model.cfg.get('grad_clip_norm', -1) > 0:
419 torch.nn.utils.clip_grad_value_(model.parameters(),
420 model.cfg.grad_clip_norm)
--> 421 self.optimizer.step()
423 self.metric_train.update(predict_scores, gt_labels)
425 self.losses.append(loss.cpu().item())
File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:65, in _LRScheduler.__init__.<locals>.with_counter.<locals>.wrapper(*args, **kwargs)
63 instance._step_count += 1
64 wrapped = func.__get__(instance, cls)
---> 65 return wrapped(*args, **kwargs)
File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/optim/optimizer.py:89, in Optimizer._hook_for_profile.<locals>.profile_hook_step.<locals>.wrapper(*args, **kwargs)
87 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
88 with torch.autograd.profiler.record_function(profile_name):
---> 89 return func(*args, **kwargs)
File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.__class__():
---> 27 return func(*args, **kwargs)
File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/optim/adam.py:108, in Adam.step(self, closure)
105 state_steps.append(state['step'])
107 beta1, beta2 = group['betas']
--> 108 F.adam(params_with_grad,
109 grads,
110 exp_avgs,
111 exp_avg_sqs,
112 max_exp_avg_sqs,
113 state_steps,
114 group['amsgrad'],
115 beta1,
116 beta2,
117 group['lr'],
118 group['weight_decay'],
119 group['eps'])
120 return loss
File ~/miniconda3/envs/open3D/lib/python3.8/site-packages/torch/optim/_functional.py:84, in adam(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps)
81 grad = grad.add(param, alpha=weight_decay)
83 # Decay the first and second moment running average coefficient
---> 84 exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
85 exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
86 if amsgrad:
87 # Maintains the maximum of all 2nd moment running avg. till now
RuntimeError: The size of tensor a (13) must match the size of tensor b (3) at non-singleton dimension 0
Based on where the error is pointing, I assume this has something to do with the optimizer. Do I need to make any modifications to optimizer_state_dict as I did to the model_state_dict? Any help is appreciated, as being able to successfully implement the library would be a big breakthrough in my own research. Please let me know if I have left out any important details.
Well,
I managed to figure this out by modifying optimizer_state_dict with the following:
ckpt_mod['optimizer_state_dict']['state'][196]['exp_avg'] = checkpoint['optimizer_state_dict']['state'][196]['exp_avg'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][196]['exp_avg_sq'] = checkpoint['optimizer_state_dict']['state'][196]['exp_avg_sq'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][197]['exp_avg'] = checkpoint['optimizer_state_dict']['state'][197]['exp_avg'][np.r_[2,5,6]]
ckpt_mod['optimizer_state_dict']['state'][197]['exp_avg_sq'] = checkpoint['optimizer_state_dict']['state'][197]['exp_avg_sq'][np.r_[2,5,6]]
Now I'm running into a new weird issue. I get Loss train: nan eval: nan at each epoch. Any suggestions?
Hi @eliasm56 I have met the same bug with you. Have you solved this problem?
To fix this issue, you should modify the model architecture to have the correct number of output channels in the last layer. You can do this by updating the num_classes parameter when creating the RandLANet model instance. In your case, you should set num_classes=3 to match the three classes you are training on. ------------------------Below is the updated Code -------------------
import torch import open3d.ml as _ml3d import open3d.ml.torch as ml3d import numpy as np from ml3d.datasets.customdataset import Custom3D
cfg_file = "ml3d/configs/randlanet_s3dis.yml" cfg = _ml3d.utils.Config.load_from_file(cfg_file)
#////Update the number of classes in the last layer cfg.model.num_classes = 3
model = ml3d.models.RandLANet(**cfg.model)
cfg.dataset['dataset_path'] = '/mnt/d/auto_class/Open3D-ML-master/NIST_data/NPY'
def freeze_all_but_last(model): # Named_parameters is a tuple with (parameter name: string, parameters: tensor) for n, p in model.named_parameters(): if 'fc1.3' in n: pass else: p.requires_grad = False
freeze_all_but_last(model)
dataset = Custom3D(cfg.dataset.pop('dataset_path', None), **cfg.dataset) pipeline = ml3d.pipelines.SemanticSegmentation(model, dataset=dataset, device="cpu", **cfg.pipeline)
ckpt_path = 'randlanet_finetune.pth' pipeline.load_ckpt(ckpt_path=ckpt_path)
pipeline.run_train()
@AdityaRajThakur hey! I tried out your code but was still getting the error of:
RuntimeError: Error(s) in loading state_dict for RandLANet:
size mismatch for fc1.3.conv.weight: copying a param with shape torch.Size([13, 32, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 32, 1, 1]).
size mismatch for fc1.3.conv.bias: copying a param with shape torch.Size([13]) from checkpoint, the shape in current model is torch.Size([3]).
It seems like setting num_classes=3 helps to load the model to size 3 but the checkpoint is still size 13. Am I missing a step? Thanks!