UPSNet icon indicating copy to clipboard operation
UPSNet copied to clipboard

one GPU

Open gh2517956473 opened this issue 6 years ago • 5 comments

Can I use one GPU with 12G memory to train? Where does the code need to change? Thank you very much!

gh2517956473 avatar May 14 '19 08:05 gh2517956473

please follow this https://github.com/uber-research/UPSNet/issues/36#issuecomment-491593609

YuwenXiong avatar May 14 '19 19:05 YuwenXiong

Thank you!

gh2517956473 avatar May 15 '19 01:05 gh2517956473

Thank you! Hello,i use one gpu,but it occured: ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "upsnet/upsnet_end2end_train.py", line 414, in upsnet_train() File "upsnet/upsnet_end2end_train.py", line 268, in upsnet_train data, label, _ = train_iterator.next() File "/root/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 330, in next idx, batch = self._get_batch() File "/root/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch return self.data_queue.get() File "/root/anaconda3/lib/python3.7/multiprocessing/queues.py", line 352, in get res = self._reader.recv_bytes() File "/root/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/root/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/root/anaconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) File "/root/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 227, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 31613) is killed by signal: Bus error. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

lfdeep avatar Jun 10 '19 13:06 lfdeep

Can I use one GPU with 12G memory to train? Where does the code need to change? Thank you very much!

Hello,Can you run the code successfully on a gpu?

lfdeep avatar Jun 11 '19 08:06 lfdeep

Thank you for great work. what if i use horovod on a single gpu machine?I tried it and found it fast than not use horovod, do this have any problem?Moreover, how could i run multiple horovod worker to mimic multiple gpu on a single gpu machine, thanks a lot. Expect your reply.

pkuCactus avatar Aug 06 '19 08:08 pkuCactus