ai-research-code icon indicating copy to clipboard operation
ai-research-code copied to clipboard

Memory allocation failed

Open ppphhhleo opened this issue 3 years ago • 16 comments

I tried to train with 2 GPUs by docker, but after one epoch, memory errors in allocation occur. I am not sure what to check and what's wrong possibly. image

ppphhhleo avatar May 24 '22 11:05 ppphhhleo

Thank you for creating, please let me confirm basics at first.

  1. What GPU do you use? It would be helpful if you could provide the result of nvidia-smi.
  2. Is there any logs such as openmpi library is not found before start training?
  3. Do you get same error message even if you use smaller batch size? Does this error always appear on 2nd epoch?

TomonobuTsujikawa avatar May 26 '22 03:05 TomonobuTsujikawa

2022-06-09 11-19-39屏幕截图 hello, why the g_loss_con is 0.0000 (0.0000) all time ???

15755841658 avatar Jun 09 '22 03:06 15755841658

@TomonobuTsujikawa @ppphhhleo

15755841658 avatar Jun 09 '22 03:06 15755841658

Thank you for reporting, please let us check it.

TomonobuTsujikawa avatar Jun 10 '22 00:06 TomonobuTsujikawa

I have 11GB memory GPU, so I tried to run NVCNet on this environment. At first, I couldn't run this model due to memory allocation error, so I had to reduce batch_size to 2. After that, training is started correctly, but g_loss_con is 0 as you pointed.

Now, I'm confirming about g_loss_con.

TomonobuTsujikawa avatar Jun 14 '22 11:06 TomonobuTsujikawa

@TomonobuTsujikawa please, i want to train for multi-GPU,but meet the question: (tts_nnabla) twu@durian:/qwork4/twu/off_nvcnet$ mpirun -n 2 python main.py -c cudnn -d 0,2 --output_path log_new/baseline --batch_size 8 2022-08-24 16:29:11,963 [nnabla][INFO]: Initializing CPU extension... 2022-08-24 16:29:11,971 [nnabla][INFO]: Initializing CPU extension... 2022-08-24 16:29:12,607 [nnabla][INFO]: Initializing CUDA extension... 2022-08-24 16:29:12,607 [nnabla][INFO]: Initializing CUDA extension... 2022-08-24 16:29:25,542 [nnabla][INFO]: Initializing cuDNN extension... value error in query /home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/include/nbla/function_registry.hpp:70 Failed it != items_.end(): Any of [cudnn:float, cuda:float, cpu:float] could not be found in []

No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same. 2022-08-24 16:29:25,558 [nnabla][INFO]: Initializing cuDNN extension... value error in query /home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/include/nbla/function_registry.hpp:70 Failed it != items_.end(): Any of [cudnn:float, cuda:float, cpu:float] could not be found in []

No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same. 2022-08-24 16:29:26,010 [nnabla][INFO]: Training data with 103 speakers. 2022-08-24 16:29:26,011 [nnabla][INFO]: DataSource with shuffle(True) 2022-08-24 16:29:26,015 [nnabla][INFO]: Training data with 103 speakers. 2022-08-24 16:29:26,016 [nnabla][INFO]: DataSource with shuffle(True) 2022-08-24 16:29:26,025 [nnabla][INFO]: Using DataIterator 2022-08-24 16:29:26,030 [nnabla][INFO]: Using DataIterator Running epoch=1 lr=0.00010 Failed to allocate. Freeing memory cache and retrying. Failed to allocate. Freeing memory cache and retrying. Failed to allocate again. Error during forward propagation: RandintCuda MulScalarCuda AddScalarCuda Mul2Cuda RandCuda Mul2Cuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn ArangeCuda ReshapeCuda StackCuda GatherNdCuda Constant SigmoidCrossEntropyCuda MeanCudaCudnn AddScalarCuda AveragePoolingCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn ArangeCuda ReshapeCuda StackCuda GatherNdCuda Constant SigmoidCrossEntropyCuda MeanCudaCudnn Add2CudaCudnn AveragePoolingCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn ArangeCuda ReshapeCuda StackCuda GatherNdCuda Constant SigmoidCrossEntropyCuda MeanCudaCudnn Add2CudaCudnn RandintCuda MulScalarCuda AddScalarCuda Mul2Cuda RandCuda Mul2Cuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda ConvolutionCudaCudnn GELUCuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn GELUCuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn PowScalarCuda AddScalarCuda SumCuda PowScalarCuda Div2Cuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn GELUCuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn GELUCuda WeightNormalizationCuda DeconvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn RandintCuda MulScalarCuda AddScalarCuda Mul2Cuda RandCuda Mul2Cuda PadCuda ConvolutionCudaCudnn PowScalarCuda ConvolutionCudaCudnn PowScalarCuda Add2CudaCudnn PowScalarCuda BatchMatmulCuda MulScalarCuda AddScalarCuda LogCuda Callback WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn MulScalarCuda ExpCuda RandnCuda Mul2Cuda Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda DeconvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda DeconvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda DeconvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn <-- ERROR Traceback (most recent call last): File "main.py", line 99, in run(args) File "main.py", line 70, in run Trainer(gen, gen_optim, dis, dis_optim, dataloader, rng, hp).run() File "/qwork4/twu/off_nvcnet/train.py", line 156, in run self.train_on_batch(i) File "/qwork4/twu/off_nvcnet/train.py", line 185, in train_on_batch p['d_loss'].forward() File "_variable.pyx", line 582, in nnabla._variable.Variable.forward RuntimeError: memory error in alloc /home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/src/nbla/memory/memory.cpp:39 Failed this->alloc_impl(): N4nbla10CudaMemoryE allocation failed.

and I test the environment about:python -c "import nnabla_ext.cuda, nnabla_ext.cudnn" :

2022-08-24 16-35-46屏幕截图

15755841658 avatar Aug 24 '22 08:08 15755841658

Please provide me the results of following command:

pip list | grep -e pip -e nnabla

You can import nnabla correctly on single GPU environment, so I think it is a setup issue for multi GPUs.

TomonobuTsujikawa avatar Aug 25 '22 07:08 TomonobuTsujikawa

@TomonobuTsujikawa the result: image

15755841658 avatar Aug 29 '22 07:08 15755841658

Hmm, it seems to be ok.

Do you still have same error if you do the following?

pip uninstall nnabla nnabla-ext-cuda110-nccl2-mpi3-1-6
pip install nnabla nnabla-ext-cuda110-nccl2-mpi3-1-6
mpirun -n 2 python main.py -c cudnn -d 0,1 --output_path log_new/baseline --batch_size 8

I will also check.

TomonobuTsujikawa avatar Aug 29 '22 07:08 TomonobuTsujikawa

@TomonobuTsujikawa it still have same error: (tts_nnabla) twu@durian:/qwork4/twu/nvcnet_offi$ mpirun -n 2 python main.py -c cudnn -d 0,1 --output_path log_new/baseline --batch_size 8 2022-08-29 17:52:27,939 [nnabla][INFO]: Initializing CPU extension... 2022-08-29 17:52:27,939 [nnabla][INFO]: Initializing CPU extension... 2022-08-29 17:52:30,726 [nnabla][INFO]: Initializing CUDA extension... 2022-08-29 17:52:30,727 [nnabla][INFO]: Initializing CUDA extension... /qwork4/twu/miniconda/envs/tts_nnabla/bin/../lib/libmpi.so: undefined symbol: ompi_mpi_op_no_op /qwork4/twu/miniconda/envs/tts_nnabla/bin/../lib/libmpi.so: undefined symbol: ompi_mpi_op_no_op 2022-08-29 17:52:43,731 [nnabla][INFO]: Initializing cuDNN extension... 2022-08-29 17:52:44,080 [nnabla][INFO]: Training data with 103 speakers. 2022-08-29 17:52:44,081 [nnabla][INFO]: DataSource with shuffle(True) 2022-08-29 17:52:44,100 [nnabla][INFO]: Using DataIterator 2022-08-29 17:52:44,716 [nnabla][INFO]: Initializing cuDNN extension... 2022-08-29 17:52:45,076 [nnabla][INFO]: Training data with 103 speakers. 2022-08-29 17:52:45,076 [nnabla][INFO]: DataSource with shuffle(True) 2022-08-29 17:52:45,103 [nnabla][INFO]: Using DataIterator value error in query /home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/include/nbla/function_registry.hpp:70 Failed it != items_.end(): Any of [cudnn:float, cuda:float, cpu:float] could not be found in []

No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same. Running epoch=1 lr=0.00010 Failed to allocate. Freeing memory cache and retrying. Failed to allocate. Freeing memory cache and retrying. Failed to allocate again. Error during forward propagation: RandintCuda MulScalarCuda AddScalarCuda Mul2Cuda RandCuda Mul2Cuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn ArangeCuda ReshapeCuda StackCuda GatherNdCuda Constant SigmoidCrossEntropyCuda MeanCudaCudnn AddScalarCuda AveragePoolingCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn ArangeCuda ReshapeCuda StackCuda GatherNdCuda Constant SigmoidCrossEntropyCuda MeanCudaCudnn Add2CudaCudnn AveragePoolingCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn ArangeCuda ReshapeCuda StackCuda GatherNdCuda Constant SigmoidCrossEntropyCuda MeanCudaCudnn Add2CudaCudnn RandintCuda MulScalarCuda AddScalarCuda Mul2Cuda RandCuda Mul2Cuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda ConvolutionCudaCudnn GELUCuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn GELUCuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn PowScalarCuda AddScalarCuda SumCuda PowScalarCuda Div2Cuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn GELUCuda PadCuda WeightNormalizationCuda ConvolutionCudaCudnn GELUCuda WeightNormalizationCuda DeconvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn RandintCuda MulScalarCuda AddScalarCuda Mul2Cuda RandCuda Mul2Cuda PadCuda ConvolutionCudaCudnn PowScalarCuda ConvolutionCudaCudnn PowScalarCuda Add2CudaCudnn PowScalarCuda BatchMatmulCuda MulScalarCuda AddScalarCuda LogCuda Callback WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn AveragePoolingCudaCudnn LeakyReLUCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn MulScalarCuda ExpCuda RandnCuda Mul2Cuda Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda DeconvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda DeconvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn GELUCuda WeightNormalizationCuda DeconvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn PadCuda WeightNormalizationCuda ConvolutionCudaCudnn WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn SliceCuda TanhCudaCudnn SliceCuda SigmoidCudaCudnn Mul2Cuda WeightNormalizationCuda ConvolutionCudaCudnn Add2CudaCudnn <-- ERROR Traceback (most recent call last): File "main.py", line 99, in run(args) File "main.py", line 70, in run Trainer(gen, gen_optim, dis, dis_optim, dataloader, rng, hp).run() File "/qwork4/twu/nvcnet_offi/train.py", line 156, in run self.train_on_batch(i) File "/qwork4/twu/nvcnet_offi/train.py", line 185, in train_on_batch p['d_loss'].forward() File "_variable.pyx", line 582, in nnabla._variable.Variable.forward RuntimeError: memory error in alloc /home/gitlab-runner/builds/LRsSYq-B/0/nnabla/builders/all/nnabla/src/nbla/memory/memory.cpp:39 Failed this->alloc_impl(): N4nbla10CudaMemoryE allocation failed.

15755841658 avatar Aug 29 '22 09:08 15755841658

@15755841658 Thank you for testing.

I setup many environments to reproduce this error today, but I could not reproduce. Can you provide your environment information a bit more? If you would like to run nvcnet on docker, please show me the Dockerfile. The following log will be very big, so I would appreciate it if you could put compressed log.

cat /etc/os-release
dpkg -l | grep ^ii
conda --version
conda list
pip --version
pip list
nvidia-smi
set | grep -e LD_LIBRARY -e LD_PRELOAD
find /usr -name libmpi.so\*

I think this is the minimum command if issue has been resolved.

mpirun -n 2 python -c "import nnabla_ext.cudnn; from nnabla.ext_utils import get_extension_context; import nnabla.communicators as C; ctx = get_extension_context('cudnn', device_id='0'); C.MultiProcessDataParallelCommunicator(ctx)"

TomonobuTsujikawa avatar Aug 29 '22 12:08 TomonobuTsujikawa

ok! I will test. but test the minimum command: image

15755841658 avatar Aug 29 '22 12:08 15755841658

Yes, your environment has issue, so minimum command fails. Please provide information what I wrote.

TomonobuTsujikawa avatar Aug 30 '22 00:08 TomonobuTsujikawa

@TomonobuTsujikawa OK,Thanks. I have emailed you. Please read the attachment for the results of these commands. And please tell me what's wrong at the back. Thanks very much.

15755841658 avatar Aug 30 '22 01:08 15755841658

@15755841658

I checked your environment information, here is the list of problems need to be solved.

  • Ubuntu16: Official support is ubuntu18 and later. This is because the many packages are really old on ubuntu16.
  • openmpi1: openmpi1 is not supported. I recommend to use openmpi v3 as of now (you can still use openmpi v2).
  • ~~pip: If you use conda environment, pip must use conda's pip, otherwise python package management will conflict.~~ Your pip seems to be conda-base. I'm sorry.

I cannot find nvidia driver/cuda/cudnn packages in your dpkg list, but you installed them by manually? Also, there seems to be a new mpi on /usr/local, but you cannot use it due to permission denied.

Hmm, if you cannot upgrade OS environment, I think it is better to use docker container. Here is example:

docker pull nnabla/nnabla-ext-cuda-multi-gpu:py37-cuda110-mpi3.1.6-v1.29.0
docker run --rm -it -u $(id -u):$(id -g) --gpus all nnabla/nnabla-ext-cuda-multi-gpu:py37-cuda110-mpi3.1.6-v1.29.0

mpirun -n 2 python3 -c "import nnabla_ext.cudnn; from nnabla.ext_utils import get_extension_context; import nnabla.communicators as C; ctx = get_extension_context('cudnn', device_id='0'); C.MultiProcessDataParallelCommunicator(ctx)"

If you cannot install docker, you need to build openmpi by yourself. This is how to build openmpi, but some setup might be different since OS versions are different: https://github.com/sony/nnabla-ext-cuda/blob/v1.29.0/docker/release/Dockerfile.cuda-mpi#L54-L86

Also, please refer nnabla install page: https://nnabla.org/install/ This page provides the list of install components and how to install.

TomonobuTsujikawa avatar Aug 30 '22 02:08 TomonobuTsujikawa

I had same trouble when i tied to setup another code repo, env:

  • numpy==1.22.4
  • docker with cuda 11.6
  • os: ubuntu-18.04

after i install numpy>=1.23.0, the problem is fixed. however, some warnings showed up, such as:

...
2022-11-10 11:54:56,668 [nnabla][INFO]: Initializing CUDA extension...
<frozen importlib._bootstrap>:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 232 from PyObject
...

hope helpful.

gl8-mt avatar Nov 10 '22 12:11 gl8-mt