SoftTeacher Question regarding loading pretraind weights

How can I load pretrained weights provided in your repository during training time?

Jan 20 '22 04:01 purbayankar

[COMMAND] --resume-from {A MODEL PATH} or [COMMAND] --cfg-options load_from={A MODEL PATH}. The former one will load both model and optimizer state.

Jan 20 '22 10:01 MendelXu

Should I add this in the script file dist_train_partially.sh?

Jan 20 '22 11:01 purbayankar

No. Just append it to your command. For eg, bash tools/dist_train_partially.sh semi 0 10 8 --resume-from {A MODEL PATH}.

Jan 20 '22 13:01 MendelXu

I followed your instructions. But I am having this error

Traceback (most recent call last):
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/hdd/purbayan/SoftTeacher/tools/ssod/apis/train.py", line 205, in train_detector
    runner.load_checkpoint(cfg.load_from)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 337, in load_checkpoint
    return load_checkpoint(
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 528, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 467, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 245, in load_checkpoint
    return checkpoint_loader(filename, map_location)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 262, in load_from_local
    checkpoint = torch.load(filename, map_location=map_location)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/serialization.py", line 777, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
index created!
2022-01-20 19:19:52,119 - mmdet.ssod - INFO - load checkpoint from local path: /hdd/purbayan/SoftTeacher/pretrained_weights/1/split-1/iter_180000.pth
Traceback (most recent call last):
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/hdd/purbayan/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/hdd/purbayan/SoftTeacher/tools/ssod/apis/train.py", line 205, in train_detector
    runner.load_checkpoint(cfg.load_from)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 337, in load_checkpoint
    return load_checkpoint(
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 528, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 467, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 245, in load_checkpoint
    return checkpoint_loader(filename, map_location)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/mmcv/runner/checkpoint.py", line 262, in load_from_local
    checkpoint = torch.load(filename, map_location=map_location)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/serialization.py", line 777, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3938940) of binary: /hdd/purbayan/envs/env_st/bin/python
Traceback (most recent call last):
  File "/hdd/purbayan/envs/env_st/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/hdd/purbayan/envs/env_st/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/hdd/purbayan/envs/env_st/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-01-20_19:19:59
  host      : insrisrvsr-0275
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3938941)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-01-20_19:19:59
  host      : insrisrvsr-0275
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3938940)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Jan 20 '22 13:01 purbayankar

It seems the model file is corrupted. Which model have you tried? I will have a try on my machine.

Jan 21 '22 06:01 MendelXu

I have tried this one https://drive.google.com/drive/folders/1QA8sAw49DJiMHF-Cr7q0j7KgKjlJyklV. These are the checkpoints for 1% labelled data provided in your repository.

Jan 21 '22 06:01 purbayankar

It seems some files are corrupted due to my google one is expired. I have updated the models. Cloud you have a try again? https://drive.google.com/file/d/1dUWoWDmYqNBx6lko59xrs2ZMGGuzn_5y/view?usp=sharing

Jan 21 '22 06:01 MendelXu

I have tried https://drive.google.com/file/d/1dUWoWDmYqNBx6lko59xrs2ZMGGuzn_5y/view?usp=sharing this file. It is working perfectly now. Thank you very much for the prompt responses. Are the weights files in the repository updated now?

Jan 21 '22 09:01 purbayankar

I have checked it and the generated link is not changed.

Jan 21 '22 10:01 MendelXu