DeepDeWedge icon indicating copy to clipboard operation
DeepDeWedge copied to clipboard

Too many open files during fitting

Open asarnow opened this issue 8 months ago • 3 comments

I was running fit-model on 18 tomograms using the config.yaml attached below and got this error for "too many open files" (~~ulimit is actually unlimited already~~). Have you seen this before or know what the cause might be?

ulimit -n was actually only 1024, I increased it to 4096 and that did allow ddw to resume upon re-running the command (instead of failing immediately with the same error). I do have ~1700 subtomograms in each half. However, after validation DDW exits with "Killed" and no other output.

Image
DeadlockDetectedException: DeadLock detected from rank: 3
 Traceback (most recent call last):
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in
_call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
621, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1058, in _run
    results = self._run_stage()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1137, in _run_stage
    self._run_train()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1160, in _run_train
    self.fit_loop.run()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in
run
    self.on_advance_end()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 296,
in on_advance_end
    self.trainer._call_lightning_module_hook("on_train_epoch_end")
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1302, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 84, in
on_train_epoch_end
    self.update_subtomo_missing_wedges()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 125, in
update_subtomo_missing_wedges
    for batch in tqdm.tqdm(loader, desc="Updating subtomo missing wedges"):
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in
__next__
    data = self._next_data()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in
_next_data
    idx, data = self._get_data()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in
_get_data
    success, data = self._try_get_data()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1160, in
_try_get_data
    raise RuntimeError(
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n`
in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning
of your code

Killed

config.yaml.txt

shared:
  project_dir: "."
  tomo0_files: 
    - "../23dec02b/aretomo/23dec02b_ts10.mrc_ODD_Vol.mrc"
    - "../23dec02b/aretomo/23dec02b_ts12.mrc_ODD_Vol.mrc"
    - "../23dec02b/aretomo/23dec02b_ts13.mrc_ODD_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts108.mrc_ODD_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts109.mrc_ODD_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts120.mrc_ODD_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts38.mrc_ODD_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts49.mrc_ODD_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts54.mrc_ODD_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_29.mrc_ODD_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_34.mrc_ODD_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_35.mrc_ODD_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_10.mrc_ODD_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_13.mrc_ODD_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_17.mrc_ODD_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_10.mrc_ODD_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_11.mrc_ODD_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_12.mrc_ODD_Vol.mrc"
  tomo1_files:
    - "../23dec02b/aretomo/23dec02b_ts10.mrc_EVN_Vol.mrc"
    - "../23dec02b/aretomo/23dec02b_ts12.mrc_EVN_Vol.mrc"
    - "../23dec02b/aretomo/23dec02b_ts13.mrc_EVN_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts108.mrc_EVN_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts109.mrc_EVN_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts120.mrc_EVN_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts38.mrc_EVN_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts49.mrc_EVN_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts54.mrc_EVN_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_29.mrc_EVN_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_34.mrc_EVN_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_35.mrc_EVN_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_10.mrc_EVN_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_13.mrc_EVN_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_17.mrc_EVN_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_10.mrc_EVN_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_11.mrc_EVN_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_12.mrc_EVN_Vol.mrc"
  subtomo_size: 96
  mw_angle: 92
  num_workers: 32
  gpu: [0, 1, 2, 3]
  seed: 42

prepare_data:
  val_fraction: 0.1
  extract_larger_subtomos_for_rotating: true
  overwrite: true

fit_model:
    unet_params_dict:
      chans: 64
      num_downsample_layers: 3
      drop_prob: 0.0
    adam_params_dict: 
      lr: 0.0004
    num_epochs: 1000
    batch_size: 5
    update_subtomo_missing_wedges_every_n_epochs: 10
    check_val_every_n_epochs: 10
    save_n_models_with_lowest_val_loss: 5
    save_n_models_with_lowest_fitting_loss: 5
    save_model_every_n_epochs: 50
    logger: "csv"


refine_tomogram:
    model_checkpoint_file: "logs/version_0/checkpoints/epoch/epoch=999.ckpt"
    subtomo_overlap: 32
    batch_size: 10

asarnow avatar May 27 '25 01:05 asarnow

Reran from scratch using the higher file descriptor ulimit setting, failed on this NoneType assignment error instead. System has 4x H100, CUDA 12.4, afterwards the cards are stuck with some memory utilization, one even showed 100% activity although there were no processes listed.

ddw_log2.txt

asarnow avatar May 27 '25 07:05 asarnow

Hi @asarnow,

thanks for trying DeepDeWedge and for reaching out! 🙂

I am sorry that the software is causing you so much trouble 🙁

I have never encountered the "too many open files" error, but I am glad that your ulimit fix seems to have worked.

The second error you report, i.e.,

File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 166, in update_normalization self.update_hparam("unet_params", self.unet_params) File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 178, in update_hparam hparams[hparam] = value TypeError: 'NoneType' object does not support item assignment

reminds me of Issue https://github.com/MLI-lab/DeepDeWedge/issues/27

Summary: It seems that a (randomly occurring ?), still unknown issue occasionally produces a corrupt .yaml file which cannot be read and causes the error you encountered. (Maybe this is related to "too many open files"?)

In Issue https://github.com/MLI-lab/DeepDeWedge/issues/27, I also proposed a possible fix (in a separate branch). Maybe you can try the fix in the branch and let me know if this did the trick. If not we might gain some insight into what's going on.

If you need any help with trying the fix, please let me know!

SimWdm avatar May 27 '25 19:05 SimWdm

Thanks, I'll try that next. I just tried running it again, since it might be nondeterministic, and that led to this crazy NCCL crash (log attached). I had to reboot the server after that, nvidia-smi wasn't responding. I restarted with a single GPU and it's a couple of epochs in at this point.

ddw_log3.txt

asarnow avatar May 27 '25 22:05 asarnow