Too many open files during fitting
I was running fit-model on 18 tomograms using the config.yaml attached below and got this error for "too many open files" (~~ulimit is actually unlimited already~~). Have you seen this before or know what the cause might be?
ulimit -n was actually only 1024, I increased it to 4096 and that did allow ddw to resume upon re-running the command (instead of failing immediately with the same error). I do have ~1700 subtomograms in each half. However, after validation DDW exits with "Killed" and no other output.
DeadlockDetectedException: DeadLock detected from rank: 3
Traceback (most recent call last):
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in
_call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1058, in _run
results = self._run_stage()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1137, in _run_stage
self._run_train()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1160, in _run_train
self.fit_loop.run()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in
run
self.on_advance_end()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 296,
in on_advance_end
self.trainer._call_lightning_module_hook("on_train_epoch_end")
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 84, in
on_train_epoch_end
self.update_subtomo_missing_wedges()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 125, in
update_subtomo_missing_wedges
for batch in tqdm.tqdm(loader, desc="Updating subtomo missing wedges"):
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in
__next__
data = self._next_data()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in
_next_data
idx, data = self._get_data()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in
_get_data
success, data = self._try_get_data()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1160, in
_try_get_data
raise RuntimeError(
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n`
in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning
of your code
Killed
shared:
project_dir: "."
tomo0_files:
- "../23dec02b/aretomo/23dec02b_ts10.mrc_ODD_Vol.mrc"
- "../23dec02b/aretomo/23dec02b_ts12.mrc_ODD_Vol.mrc"
- "../23dec02b/aretomo/23dec02b_ts13.mrc_ODD_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts108.mrc_ODD_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts109.mrc_ODD_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts120.mrc_ODD_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts38.mrc_ODD_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts49.mrc_ODD_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts54.mrc_ODD_Vol.mrc"
- "../24feb09a/aretomo/da27-3_29.mrc_ODD_Vol.mrc"
- "../24feb09a/aretomo/da27-3_34.mrc_ODD_Vol.mrc"
- "../24feb09a/aretomo/da27-3_35.mrc_ODD_Vol.mrc"
- "../24feb16a/aretomo/da8-1_10.mrc_ODD_Vol.mrc"
- "../24feb16a/aretomo/da8-1_13.mrc_ODD_Vol.mrc"
- "../24feb16a/aretomo/da8-1_17.mrc_ODD_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_10.mrc_ODD_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_11.mrc_ODD_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_12.mrc_ODD_Vol.mrc"
tomo1_files:
- "../23dec02b/aretomo/23dec02b_ts10.mrc_EVN_Vol.mrc"
- "../23dec02b/aretomo/23dec02b_ts12.mrc_EVN_Vol.mrc"
- "../23dec02b/aretomo/23dec02b_ts13.mrc_EVN_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts108.mrc_EVN_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts109.mrc_EVN_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts120.mrc_EVN_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts38.mrc_EVN_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts49.mrc_EVN_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts54.mrc_EVN_Vol.mrc"
- "../24feb09a/aretomo/da27-3_29.mrc_EVN_Vol.mrc"
- "../24feb09a/aretomo/da27-3_34.mrc_EVN_Vol.mrc"
- "../24feb09a/aretomo/da27-3_35.mrc_EVN_Vol.mrc"
- "../24feb16a/aretomo/da8-1_10.mrc_EVN_Vol.mrc"
- "../24feb16a/aretomo/da8-1_13.mrc_EVN_Vol.mrc"
- "../24feb16a/aretomo/da8-1_17.mrc_EVN_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_10.mrc_EVN_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_11.mrc_EVN_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_12.mrc_EVN_Vol.mrc"
subtomo_size: 96
mw_angle: 92
num_workers: 32
gpu: [0, 1, 2, 3]
seed: 42
prepare_data:
val_fraction: 0.1
extract_larger_subtomos_for_rotating: true
overwrite: true
fit_model:
unet_params_dict:
chans: 64
num_downsample_layers: 3
drop_prob: 0.0
adam_params_dict:
lr: 0.0004
num_epochs: 1000
batch_size: 5
update_subtomo_missing_wedges_every_n_epochs: 10
check_val_every_n_epochs: 10
save_n_models_with_lowest_val_loss: 5
save_n_models_with_lowest_fitting_loss: 5
save_model_every_n_epochs: 50
logger: "csv"
refine_tomogram:
model_checkpoint_file: "logs/version_0/checkpoints/epoch/epoch=999.ckpt"
subtomo_overlap: 32
batch_size: 10
Reran from scratch using the higher file descriptor ulimit setting, failed on this NoneType assignment error instead. System has 4x H100, CUDA 12.4, afterwards the cards are stuck with some memory utilization, one even showed 100% activity although there were no processes listed.
Hi @asarnow,
thanks for trying DeepDeWedge and for reaching out! 🙂
I am sorry that the software is causing you so much trouble 🙁
I have never encountered the "too many open files" error, but I am glad that your ulimit fix seems to have worked.
The second error you report, i.e.,
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 166, in update_normalization self.update_hparam("unet_params", self.unet_params) File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 178, in update_hparam hparams[hparam] = value TypeError: 'NoneType' object does not support item assignment
reminds me of Issue https://github.com/MLI-lab/DeepDeWedge/issues/27
Summary: It seems that a (randomly occurring ?), still unknown issue occasionally produces a corrupt .yaml file which cannot be read and causes the error you encountered. (Maybe this is related to "too many open files"?)
In Issue https://github.com/MLI-lab/DeepDeWedge/issues/27, I also proposed a possible fix (in a separate branch). Maybe you can try the fix in the branch and let me know if this did the trick. If not we might gain some insight into what's going on.
If you need any help with trying the fix, please let me know!
Thanks, I'll try that next. I just tried running it again, since it might be nondeterministic, and that led to this crazy NCCL crash (log attached). I had to reboot the server after that, nvidia-smi wasn't responding. I restarted with a single GPU and it's a couple of epochs in at this point.