MMdnn icon indicating copy to clipboard operation
MMdnn copied to clipboard

CUDNN_STATUS_INTERNAL_ERROR when converting mobilenet frozen graph to caffe model.

Open goncz opened this issue 5 years ago • 1 comments

Platform (like ubuntu 16.04/win10): Ubuntu 16.04 Python version: 3.6.9 Source framework with version (like Tensorflow 1.4.1 with GPU): Tensorflow 1.12.2 with GPU Destination framework with version (like CNTK 2.3 with GPU): Caffe

I'm trying to run the frozen graph to caffe conversion example, however, when i run the command

mmconvert -sf tensorflow -iw mobilenet_v1_1.0_224/frozen_graph.pb --inNodeName input --inputShape 224,224,3 --dstNodeName MobilenetV1/Predictions/Softmax -df caffe -om tf_mobilenet

My GPU seemingly runs out of memory. This is the last section of the output:

I0622 09:12:44.478653 22097 net.cpp:122] Setting up MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add
I0622 09:12:44.478657 22097 net.cpp:129] Top shape: 1 512 14 14 (100352)
I0622 09:12:44.478659 22097 net.cpp:137] Memory required for data: 67753856
I0622 09:12:44.478663 22097 layer_factory.hpp:77] Creating layer MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add_scale
I0622 09:12:44.478667 22097 net.cpp:84] Creating Layer MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add_scale
I0622 09:12:44.478669 22097 net.cpp:406] MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add_scale <- MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add
I0622 09:12:44.478673 22097 net.cpp:367] MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add_scale -> MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add (in-place)
I0622 09:12:44.478680 22097 layer_factory.hpp:77] Creating layer MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add_scale
I0622 09:12:44.478690 22097 net.cpp:122] Setting up MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add_scale
I0622 09:12:44.478694 22097 net.cpp:129] Top shape: 1 512 14 14 (100352)
I0622 09:12:44.478696 22097 net.cpp:137] Memory required for data: 68155264
I0622 09:12:44.478700 22097 layer_factory.hpp:77] Creating layer MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6
I0622 09:12:44.478705 22097 net.cpp:84] Creating Layer MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6
I0622 09:12:44.478708 22097 net.cpp:406] MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6 <- MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add
I0622 09:12:44.478711 22097 net.cpp:367] MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6 -> MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add (in-place)
I0622 09:12:44.479003 22097 net.cpp:122] Setting up MobilenetV1_MobilenetV1_Conv2d_6_pointwise_Relu6
I0622 09:12:44.479012 22097 net.cpp:129] Top shape: 1 512 14 14 (100352)
I0622 09:12:44.479013 22097 net.cpp:137] Memory required for data: 68556672
I0622 09:12:44.479017 22097 layer_factory.hpp:77] Creating layer MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise
I0622 09:12:44.479020 22097 net.cpp:84] Creating Layer MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise
I0622 09:12:44.479023 22097 net.cpp:406] MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise <- MobilenetV1_MobilenetV1_Conv2d_6_pointwise_BatchNorm_batchnorm_add
I0622 09:12:44.479027 22097 net.cpp:380] MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise -> MobilenetV1_MobilenetV1_Conv2d_7_depthwise_depthwise
F0622 09:12:45.251077 22097 cudnn_conv_layer.cpp:53] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0)  CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
Aborted (core dumped)

A suggestion from @linmajia was to hide the GPU device and execute on CPU instead by

export CUDA_VISIBLE_DEVICES=" "

However, this results in the following error:


I0622 09:17:14.161953 22272 layer_factory.hpp:77] Creating layer Placeholder
I0622 09:17:14.161962 22272 net.cpp:84] Creating Layer Placeholder
I0622 09:17:14.161967 22272 net.cpp:380] Placeholder -> Placeholder
I0622 09:17:14.161988 22272 net.cpp:122] Setting up Placeholder
I0622 09:17:14.161994 22272 net.cpp:129] Top shape: 1 3 224 224 (150528)
I0622 09:17:14.161998 22272 net.cpp:137] Memory required for data: 602112
I0622 09:17:14.162000 22272 layer_factory.hpp:77] Creating layer MobilenetV1_MobilenetV1_Conv2d_0_convolution
I0622 09:17:14.162006 22272 net.cpp:84] Creating Layer MobilenetV1_MobilenetV1_Conv2d_0_convolution
I0622 09:17:14.162009 22272 net.cpp:406] MobilenetV1_MobilenetV1_Conv2d_0_convolution <- Placeholder
I0622 09:17:14.162012 22272 net.cpp:380] MobilenetV1_MobilenetV1_Conv2d_0_convolution -> MobilenetV1_MobilenetV1_Conv2d_0_convolution
F0622 09:17:14.167845 22272 cudnn_conv_layer.cpp:52] Check failed: error == cudaSuccess (38 vs. 0)  no CUDA-capable device is detected
*** Check failure stack trace: ***
Aborted (core dumped)

Any suggestions on how to fix this issue? I've experienced the same CUDNN_STATUS_INTERNAL_ERROR in previous DL applications, which were solved by allowing growth:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

But I'm not sure where to put this code in this application.

goncz avatar Jun 22 '20 07:06 goncz

@goncz, thank you very much for the feedback. A quick workaround is that you install the CPU versions of TensorFlow and Caffe. Maybe you could create a Python virtual environment (e.g. using Anaconda) dedicated for MMdnn and install only the CPU versions of deep learning frameworks.

CUDNN_STATUS_INTERNAL_ERROR may be caused by several root causes, and we will look into this issue.

linmajia avatar Jun 27 '20 12:06 linmajia