PermissionError with ModelCheckpoints
Hi I'm trying to train a model and am getting this error:
Traceback (most recent call last):
File "/home/jovyan/talmolab-smb/aadi/biogtr_expts/run/animal/SLAP_M74/single_run.py", line 91, in <module>
main(cfg.cfg)
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/hydra/main.py", line 83, in decorated_main
return task_function(cfg_passthrough)
File "/home/jovyan/talmolab-smb/aadi/biogtr_expts/src/biogtr/biogtr/training/train.py", line 101, in main
trainer.fit(model, dataset)
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
self.fit_loop.run()
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 203, in run
self.on_advance_end()
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 374, in on_advance_end
call._call_callback_hooks(trainer, "on_train_epoch_end", monitoring_callbacks=True)
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 314, in on_train_epoch_end
self._save_last_checkpoint(trainer, monitor_candidates)
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 679, in _save_last_checkpoint
self._link_checkpoint(trainer, self._last_checkpoint_saved, filepath)
File "/opt/conda/envs/biogtr/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 397, in _link_checkpoint
shutil.copy(filepath, linkpath)
File "/opt/conda/envs/biogtr/lib/python3.9/shutil.py", line 428, in copy
copymode(src, dst, follow_symlinks=follow_symlinks)
File "/opt/conda/envs/biogtr/lib/python3.9/shutil.py", line 317, in copymode
chmod_func(dst, stat.S_IMODE(st.st_mode))
PermissionError: [Errno 1] Operation not permitted: '/home/jovyan/talmolab-smb/aadi/biogtr_expts/run/animal/SLAP_M74/models/tests/test_chkpt/epoch=1-best-val_num_switches=36.0.ckpt'
This is how i set up my checkpoints:
def get_checkpointing(self) -> pl.callbacks.ModelCheckpoint:
"""Getter for lightning checkpointing callback.
Returns:
A lightning checkpointing callback with specified params
"""
# convert to dict to enable extracting/removing params
checkpoint_params = OmegaConf.to_container(self.cfg.checkpointing, resolve=True)
logging_params = self.cfg.logging
if "dirpath" not in checkpoint_params or checkpoint_params["dirpath"] is None:
if "group" in logging_params:
dirpath = f"./models/{logging_params.group}/{logging_params.name}"
else:
dirpath = f"./models/{logging_params.name}"
else:
dirpath = checkpoint_params["dirpath"]
dirpath = Path(dirpath).resolve()
if not Path(dirpath).exists():
try:
Path(dirpath).mkdir(parents=True, exist_ok=True)
except OSError as e:
print(
f"Cannot create a new folder. Check the permissions to the given Checkpoint directory. \n {e}"
)
_ = checkpoint_params.pop("dirpath")
checkpointers = []
monitor = checkpoint_params.pop("monitor")
for metric in monitor:
checkpointer = pl.callbacks.ModelCheckpoint(
monitor=metric, dirpath=dirpath, filename=f"{{epoch}}-{{{metric}}}", **checkpoint_params
)
checkpointer.CHECKPOINT_NAME_LAST = f"{{epoch}}-best-{{{metric}}}"
checkpointers.append(checkpointer)
return checkpointers
Its quite strange because this error never used to happen before
Originally posted by @aaprasad in https://github.com/Lightning-AI/pytorch-lightning/discussions/19396
cc @carmocca @awaelchli
@aaprasad Does this error happen with the latest version of Lightning (2.1.4)?
Hi, yes it does. I'm currently using:
cuda-cudart 12.1.105 0 nvidia
cuda-cupti 12.1.105 0 nvidia
cuda-libraries 12.1.0 0 nvidia
cuda-nvrtc 12.1.105 0 nvidia
cuda-nvtx 12.1.105 0 nvidia
cuda-opencl 12.3.101 0 nvidia
cuda-runtime 12.1.0 0 nvidia
cudatoolkit 11.1.74 h6bb024c_0 nvidia
cudnn 8.0.4 cuda11.1_0 nvidia
filelock 3.13.1 pyhd8ed1ab_0 conda-forge
fsspec 2024.2.0 pyhca7485f_0 conda-forge
hydra-core 1.3.2 pypi_0 pypi
libcublas 12.1.0.26 0 nvidia
libcufft 11.0.2.4 0 nvidia
libcufile 1.8.1.2 0 nvidia
libcurand 10.3.4.107 0 nvidia
libcusolver 11.4.4.55 0 nvidia
libcusparse 12.0.2.55 0 nvidia
libnpp 12.0.2.50 0 nvidia
libnvjitlink 12.1.105 0 nvidia
libnvjpeg 12.1.1.14 0 nvidia
lightning 2.1.4 pyhd8ed1ab_0 conda-forge
lightning-utilities 0.10.1 pyhd8ed1ab_0 conda-forge
python 3.9.18 h0755675_1_cpython conda-forge
pytorch 2.2.0 py3.9_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cuda 12.1 ha16c6d3_5 pytorch
pytorch-lightning 2.1.3 pyhd8ed1ab_0 conda-forge
pytorch-mutex 1.0 cuda pytorch
torchmetrics 1.2.1 pyhd8ed1ab_0 conda-forge
torchtriton 2.2.0 py39 pytorch
torchvision 0.17.0 py39_cu121 pytorch
In addition to a bunch of other misc packages (lmk if you need any other versions)
For some more context, these are the config params im using for checkpointing:
checkpointing:
monitor: ['val_num_switches', 'val_loss']
verbose: True
save_last: True
dirpath: '/home/jovyan/talmolab-smb/aadi/biogtr_expts/run/animal/SLAP_M74/models/burnt_pancake'
auto_insert_metric_name: True
every_n_epochs: 10
save_top_k: -1
I also checked the checkpoint directory permissions with ls -lah and everything looks fine:
drwxrwxrwx 2 root root 0 Feb 5 18:36 ..
-rwxrwxrwx 1 root root 133M Feb 5 19:16 'epoch=0-best-val_loss=17.41611099243164.ckpt'
-rwxrwxrwx 1 root root 133M Feb 5 18:37 'epoch=0-best-val_loss=50.88465881347656.ckpt'
-rwxrwxrwx 1 root root 133M Feb 5 18:37 'epoch=0-best-val_num_switches=44.0.ckpt'
-rwxrwxrwx 1 root root 133M Feb 5 19:16 'epoch=0-best-val_num_switches=49.0.ckpt'
-rwxrwxrwx 1 root root 133M Feb 5 19:17 'epoch=1-best-val_num_switches=50.0.ckpt'
@aaprasad Is this still an issue with Lightning 2.2?