Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS
Deep learning frameworks support multi-process data loading, such as num_worker option of DataLoader in PyTorch, MultiprocessIterator in Chainer, etc.
They use multiprocessing module to launch worker processes using fork by default (in Linux).
When using PFIO, in case an HDFS connection is established before the fork, information of the connections are also copied to child processes. They are eventually destroyed when one of the workers has completed its work (this happens at the end of each epoch in PyTorch DataLoader). However remaining worker processes still want to keep in touch with HDFS, but since the connection is unexpectedly and uncontrollably closed, they will break.
As far as I know, the actual error message or phenomenon that users face may be different depending on the situation (such as freezing, some strange error like RuntimeError: threads can only be started once, etc), and this makes the troubleshooting even more difficult.
The workaround for this issue is to set multiprocessing module forkserver mode before having access to HDFS.
Due to a similar reason (prevent MPI context being broken after fork), ChainerCV and Chainer examples apply the same workaround, and it works for PFIO+HDFS case, too.
https://github.com/chainer/chainercv/blob/master/examples/classification/train_imagenet_multi.py#L96-L100
https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/examples/chainermn/imagenet/train_imagenet.py#L145-L148
related issue: #81
V2 API introduced a proactive fork detection before entering PyArrow functions by checking process ids, and when fork detected, it raises an exception by default. With vanilla Hdfs() class used, developers are now able to detect fork-after-hdfs-init as a bug, and then fix their code and introduce forkserver. What do you think?
Example of checking proc id is like this: https://github.com/pfnet/pfio/pull/151/files#diff-4e49c0f20764e59a31322473b893e889d1163bc77c47758e50c11107f878d498R149