InternImage icon indicating copy to clipboard operation
InternImage copied to clipboard

[bug] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).)

Open jeonga0303 opened this issue 1 year ago • 4 comments

image

I customized the config.py. how to train fine-tuning classification model?

image

jeonga0303 avatar Jun 05 '24 06:06 jeonga0303

how to convert Nc1..?

image
image
image

jeonga0303 avatar Jun 10 '24 01:06 jeonga0303

image

I tried changing the file configuration in the following order.

I'm training, but the data is big, so I'll let you know the results in the future

  1. download pth file

  2. config.py

_C.DATA.IMG_SIZE = 224
_C.MODEL.PRETRAINED = 'internimage_b_1k_224.pth'
_C.MODEL.NUM_CLASSES = 4

  1. util.py ( Nc1: 1000 > Nc2: 4 ) convert load_pretrained function.
    if 'head.bias' in state_dict:
        head_bias_pretrained = state_dict['head.bias']
        Nc1 = head_bias_pretrained.shape[0]
        Nc2 = model.head.bias.shape[0]
        logger.info(f'{Nc1}, {Nc2}')
        if (Nc1 != Nc2):
            # head_weight = model.head.weight
            # head_bias = model.head.bias
            model.head.weight = torch.nn.Parameter(torch.zeros_like(model.head.weight))
            model.head.bias = torch.nn.Parameter(torch.zeros_like(model.head.bias))
            state_dict.pop('head.weight', None)
            state_dict.pop('head.bias', None)
  1. dataset/samplers.py convert iteration.
def __iter__(self):
        # deterministically shuffle based on epoch
        g = torch.Generator()
        g.manual_seed(self.epoch)

        t = torch.Generator()
        t.manual_seed(0)

        indices = torch.randperm(len(self.dataset), generator=t).tolist()
        indices = [i for i in indices if i % self.num_parts == self.rank]

        # add extra samples to make it evenly divisible
        while len(indices) < self.total_size_parts:
            indices += indices[:(self.total_size_parts - len(indices))]
        
        indices = indices[:self.total_size_parts]
        assert len(indices) == self.total_size_parts, f'Length of indices ({len(indices)}) does not match total_size_parts ({self.total_size_parts})'

        # subsample
        indices = indices[self.rank // self.num_parts:self.total_size_parts:self.num_replicas // self.num_parts]

        index = torch.randperm(len(indices), generator=g).tolist()
        indices = list(np.array(indices)[index])

        assert len(indices) == self.num_samples, f'Length of indices ({len(indices)}) does not match num_samples ({self.num_samples})'

        return iter(indices)
  1. cmd python -m torch.distributed.launch --nproc_per_node 2 --master_port 12345 main.py --cfg configs/without_lr_decay/internimage_b_1k_224_custom.yaml --data-path [data-path] --pretrained internimage_b_1k_224.pth --batch-size 120

my gpu is a100 * 2.

  • If you use a huge dataset, use the following command python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --cfg configs/without_lr_decay/internimage_b_1k_224_custom.yaml --batch-size 256 --accumulation-steps 4 --pretrained internimage_b_1k_224.pth --data-path [data-path] --local-rank 1 --output work_dirs

2024.06.11 train success (image classification fine-tuning)

image

jeonga0303 avatar Jun 10 '24 02:06 jeonga0303

[bug]

I don't think there's progress in training.. Everything's the same as loss May I know the reason?

image

jeonga0303 avatar Jun 11 '24 02:06 jeonga0303

hello,how did you put the model on gpus? I got Segmentation fault (core dumped) when run model.cuda() on 3090gpu.So sad....

LovelySimon avatar May 25 '25 07:05 LovelySimon