SoftTeacher icon indicating copy to clipboard operation
SoftTeacher copied to clipboard

Question regarding loading pretraind weights

Open purbayankar opened this issue 4 years ago • 9 comments

How can I load pretrained weights provided in your repository during training time?

purbayankar avatar Jan 20 '22 04:01 purbayankar

[COMMAND] --resume-from {A MODEL PATH} or [COMMAND] --cfg-options load_from={A MODEL PATH}. The former one will load both model and optimizer state.

MendelXu avatar Jan 20 '22 10:01 MendelXu

Should I add this in the script file dist_train_partially.sh?

purbayankar avatar Jan 20 '22 11:01 purbayankar

No. Just append it to your command. For eg, bash tools/dist_train_partially.sh semi 0 10 8 --resume-from {A MODEL PATH}.

MendelXu avatar Jan 20 '22 13:01 MendelXu

I followed your instructions. But I am having this error

Traceback (most recent call last):
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/hdd/purbayan/SoftTeacher/tools/ssod/apis/train.py", line 205, in train_detector
    runner.load_checkpoint(cfg.load_from)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 337, in load_checkpoint
    return load_checkpoint(
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 528, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 467, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 245, in load_checkpoint
    return checkpoint_loader(filename, map_location)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 262, in load_from_local
    checkpoint = torch.load(filename, map_location=map_location)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/serialization.py", line 777, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
index created!
2022-01-20 19:19:52,119 - mmdet.ssod - INFO - load checkpoint from local path: /hdd/purbayan/SoftTeacher/pretrained_weights/1/split-1/iter_180000.pth
Traceback (most recent call last):
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/hdd/purbayan/SoftTeacher/tools/ssod/apis/train.py", line 205, in train_detector
    runner.load_checkpoint(cfg.load_from)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 337, in load_checkpoint
    return load_checkpoint(
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 528, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 467, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 245, in load_checkpoint
    return checkpoint_loader(filename, map_location)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 262, in load_from_local
    checkpoint = torch.load(filename, map_location=map_location)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/serialization.py", line 777, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3938940) of binary: /hdd/purbayan/envs/env_st/bin/python
Traceback (most recent call last):
  File "/hdd/purbayan/envs/env_st/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/hdd/purbayan/envs/env_st/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-01-20_19:19:59
  host      : insrisrvsr-0275
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3938941)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-01-20_19:19:59
  host      : insrisrvsr-0275
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3938940)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

purbayankar avatar Jan 20 '22 13:01 purbayankar

It seems the model file is corrupted. Which model have you tried? I will have a try on my machine.

MendelXu avatar Jan 21 '22 06:01 MendelXu

I have tried this one https://drive.google.com/drive/folders/1QA8sAw49DJiMHF-Cr7q0j7KgKjlJyklV. These are the checkpoints for 1% labelled data provided in your repository.

purbayankar avatar Jan 21 '22 06:01 purbayankar

It seems some files are corrupted due to my google one is expired. I have updated the models. Cloud you have a try again? https://drive.google.com/file/d/1dUWoWDmYqNBx6lko59xrs2ZMGGuzn_5y/view?usp=sharing

MendelXu avatar Jan 21 '22 06:01 MendelXu

I have tried https://drive.google.com/file/d/1dUWoWDmYqNBx6lko59xrs2ZMGGuzn_5y/view?usp=sharing this file. It is working perfectly now. Thank you very much for the prompt responses. Are the weights files in the repository updated now?

purbayankar avatar Jan 21 '22 09:01 purbayankar

I have checked it and the generated link is not changed.

MendelXu avatar Jan 21 '22 10:01 MendelXu