pai I/O error when there are too many concurrent processes

Organization Name: fzu

Short summary about the issue/question: When I start multiple PyTorch DDP jobs at the same time, most of the processes crash after running several epochs with high probability, and the I/O errors are reported as follows:

[2022-03-27 08:57:41] ERROR: Uncaught exception:
Traceback (most recent call last):
  File "main.py", line 55, in run_epoch
    for i, data in enumerate(train_loader):

  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()

  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)

  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()

  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
OSError: Caught OSError in DataLoader worker process 5.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/mnt/csip-091/TorchDomain/torchdomain/datasets/folder.py", line 90, in __getitem__
    return super(DomainFolder, self).__getitem__(idx) + self._get_domain(idx)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 308, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 232, in __getitem__
    sample = self.loader(path)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 269, in default_loader
    return pil_loader(path)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 249, in pil_loader
    with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/mnt/share/ImageNet/train/n02488702/n02488702_98.JPEG'

It looks like the file is missing, but the file actually exists.

So I guess maybe it's caused by too many concurrent processes?

How to solve this problem?

Brief what process you are following: When there are too many concurrent processes accessing the same data set, an I/O error is reported.

How to reproduce it:

Run 5 jobs at the same time, each with 2 main processes and 6 dataloader processes.

In this way, there will be a total of 5*2*6=60 processes accessing /mnt/share/ImageNet/* at the same time.

OpenPAI Environment:

OpenPAI version: v1.8.0
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): 18.04.5 LTS
Kernel (e.g. uname -a): Linux 4.15.0-151-generic #157-Ubuntu SMP Fri Jul 9 23:07:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Hardware (e.g. core number, memory size, storage size, GPU type etc.):
Others:

Anything else we need to know: The storage /mnt/share is launched by nfs-kernel-server of host machaine.

Mar 27 '22 12:03 siaimes

Any NFS related logs?

Mar 29 '22 02:03 Binyang2014

NFS has no logs by default, and it is not convenient for me to restart in the production environment. The issue will not appear when I reduced the number of jobs accessing the same dataset to 2 (24 processes).

here is my config for nfs-kernel-server

csip@csip-091:~$ cat /etc/exports 
# /etc/exports: the access control list for filesystems which may be exported
#		to NFS clients.  See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes       hostname1(rw,sync,no_subtree_check) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
# /srv/nfs4        gss/krb5i(rw,sync,fsid=0,crossmnt,no_subtree_check)
# /srv/nfs4/homes  gss/krb5i(rw,sync,no_subtree_check)
/home/data 172.17.175.0/255.255.255.0(rw,fsid=0,async,no_subtree_check,no_auth_nlm,insecure,no_root_squash)

here is my pv/pvc:

root@csip-dev-box-openpai092:/cluster-configuration/storage# cat share.yaml 
# replace 10.0.0.1 with your storage server IP
# NFS Persistent Volume
apiVersion: v1
kind: PersistentVolume
metadata:
  name: share-pv
  labels:
    name: share
spec:
  capacity:
    storage: 30Ti
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  mountOptions:
    - nfsvers=4.1
    - soft
    - retry=0
    - retrans=1
    - timeo=20
  nfs:
    path: /share
    server: 172.17.175.90
---
# NFS Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: share
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 30Ti
  selector:
    matchLabels:
      name: share

Mar 30 '22 02:03 siaimes