"EOFError: Ran out of input“ occurred in example mnist_hogwild
Hi, when I ran example mnist_hogwild on cuda, errors occurred as below:
File "main.py", line 66, in <module>
p.start()
File "D:\Python3.7.3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "D:\Python3.7.3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\Python3.7.3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Python3.7.3\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
reduction.dump(process_obj, to_child)
File "D:\Python3.7.3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "D:\Python3.7.3\lib\site-packages\torch\multiprocessing\reductions.py", line 232, in reduce_tensor
event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at C:\w\1\s\windows\pytorch\torch/csrc/generic/StorageSharing.cpp:245
C:\Users\audrey\Desktop\test>Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\Python3.7.3\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\Python3.7.3\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
My system: Windows10 device: GeForce RTX 2080 Ti PyTorch version: 1.2.0
How to fix this? Thanks!
I also get the same error on windows 10 and NVIDIA TITAN. Pytorch version: 1.4.0
But the same code runs on linux with the same python and pytorch version. Code can be found here: https://github.com/ArashJavan/DeepLIO
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
Hi, I get the same issue on windows11 trying to use CUDA. It works with CPU though.
@msaroufim That's probably not an issue with pickle but with shared memory in CUDA on Windows. The EOFError appears asynchronously after the cuda exception.
File "D:\Python3.7.3\lib\site-packages\torch\multiprocessing\reductions.py", line 232, in reduce_tensor
event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at C:\w\1\s\windows\pytorch\torch/csrc/generic/StorageSharing.cpp:245
my output:
(vpyenv) PS D:\_projects\examples\mnist_hogwild> python main.py --cuda
Using device: cuda
Traceback (most recent call last):
File "D:\_projects\examples\mnist_hogwild\main.py", line 99, in <module>
p.start()
File "D:\Program Files\Python311\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "D:\Program Files\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Program Files\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "D:\Program Files\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 94, in __init__
reduction.dump(process_obj, to_child)
File "D:\Program Files\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "D:\_projects\examples\mnist_hogwild\vpyenv\Lib\site-packages\torch\multiprocessing\reductions.py", line 261, in reduce_tensor
event_sync_required) = storage._share_cuda_()
^^^^^^^^^^^^^^^^^^^^^^
File "D:\_projects\examples\mnist_hogwild\vpyenv\Lib\site-packages\torch\storage.py", line 943, in _share_cuda_
return self._untyped_storage._share_cuda_(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(vpyenv) PS D:\_projects\examples\mnist_hogwild> Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\Program Files\Python311\Lib\multiprocessing\spawn.py", line 120, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Program Files\Python311\Lib\multiprocessing\spawn.py", line 130, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
The issue seems to be caused by the CUDA IPC operations support for windows (see https://pytorch.org/docs/stable/notes/windows.html#cuda-ipc-operations).
"IPC is the way by which multiple processes or threads communicate with each other. IPC in OS obtains modularity, computational speedup, and data sharing."
From Pytorch Doc:
"They [IPC Operations] are not supported on Windows. Something like doing multiprocessing on CUDA tensors cannot succeed, there are two alternatives for this.
Don’t use multiprocessing. Set the num_worker of DataLoader to zero.
Share CPU tensors instead. Make sure your custom DataSet returns CPU tensors."
Basically saying it's not possible? Please find a fix somehow for Windows, there must be a way. This would enable me to save so much VRAM by not having to duplicate models etc. I really need this.