examples icon indicating copy to clipboard operation
examples copied to clipboard

"EOFError: Ran out of input“ occurred in example mnist_hogwild

Open audreycs opened this issue 6 years ago • 2 comments

Hi, when I ran example mnist_hogwild on cuda, errors occurred as below:

File "main.py", line 66, in <module>
    p.start()
  File "D:\Python3.7.3\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "D:\Python3.7.3\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "D:\Python3.7.3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "D:\Python3.7.3\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "D:\Python3.7.3\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "D:\Python3.7.3\lib\site-packages\torch\multiprocessing\reductions.py", line 232, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at C:\w\1\s\windows\pytorch\torch/csrc/generic/StorageSharing.cpp:245

C:\Users\audrey\Desktop\test>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "D:\Python3.7.3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "D:\Python3.7.3\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)

My system: Windows10 device: GeForce RTX 2080 Ti PyTorch version: 1.2.0

How to fix this? Thanks!

audreycs avatar Dec 19 '19 05:12 audreycs

I also get the same error on windows 10 and NVIDIA TITAN. Pytorch version: 1.4.0

But the same code runs on linux with the same python and pytorch version. Code can be found here: https://github.com/ArashJavan/DeepLIO

    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

ArashJavan avatar Apr 14 '20 15:04 ArashJavan

Hi, I get the same issue on windows11 trying to use CUDA. It works with CPU though.

@msaroufim That's probably not an issue with pickle but with shared memory in CUDA on Windows. The EOFError appears asynchronously after the cuda exception.

File "D:\Python3.7.3\lib\site-packages\torch\multiprocessing\reductions.py", line 232, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at C:\w\1\s\windows\pytorch\torch/csrc/generic/StorageSharing.cpp:245

my output:

(vpyenv) PS D:\_projects\examples\mnist_hogwild> python main.py --cuda
Using device: cuda
Traceback (most recent call last):
  File "D:\_projects\examples\mnist_hogwild\main.py", line 99, in <module>
    p.start()
  File "D:\Program Files\Python311\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "D:\Program Files\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Program Files\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "D:\Program Files\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 94, in __init__
    reduction.dump(process_obj, to_child)
  File "D:\Program Files\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "D:\_projects\examples\mnist_hogwild\vpyenv\Lib\site-packages\torch\multiprocessing\reductions.py", line 261, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
                           ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\_projects\examples\mnist_hogwild\vpyenv\Lib\site-packages\torch\storage.py", line 943, in _share_cuda_
    return self._untyped_storage._share_cuda_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(vpyenv) PS D:\_projects\examples\mnist_hogwild> Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "D:\Program Files\Python311\Lib\multiprocessing\spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Program Files\Python311\Lib\multiprocessing\spawn.py", line 130, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input

The issue seems to be caused by the CUDA IPC operations support for windows (see https://pytorch.org/docs/stable/notes/windows.html#cuda-ipc-operations).

"IPC is the way by which multiple processes or threads communicate with each other. IPC in OS obtains modularity, computational speedup, and data sharing."

From Pytorch Doc:

"They [IPC Operations] are not supported on Windows. Something like doing multiprocessing on CUDA tensors cannot succeed, there are two alternatives for this.

  1. Don’t use multiprocessing. Set the num_worker of DataLoader to zero.

  2. Share CPU tensors instead. Make sure your custom DataSet returns CPU tensors."

Basically saying it's not possible? Please find a fix somehow for Windows, there must be a way. This would enable me to save so much VRAM by not having to duplicate models etc. I really need this.

tobiaswuerth avatar Oct 02 '23 10:10 tobiaswuerth