oneDNN Is dnnl::convolution_backward_data running on a single core？

cpu: i3-8100, 4/4; windows10 just using single convolution, forward and backward. Its speed is four times slower than Pytorch。 Is multi-core not enabled? How to accelerate my dnnl-code？

Jul 28 '23 09:07 w1005444804

void convolution_node::backward2data(const dnnl::memory& diff_dst) { m_src_diff_md = dnnl::memory::desc(m_src_dims, dt::f32, tag::any); m_weights_diff_md = dnnl::memory::desc({ m_weights_dims }, dt::f32, tag::any); m_dst_diff_md = dnnl::memory::desc({ m_dst_dims }, dt::f32, tag::any); // // std::cout << "Creating backward Convolutional layer primitive descriptor\n"; m_conv_bwd_data_desc = dnnl::convolution_backward_data::primitive_desc(m_engine, dnnl::algorithm::convolution_direct, m_src_diff_md, m_weights_md, m_dst_diff_md, m_stride_dims, m_dilation_dims, m_padding_dims, m_padding_dims, m_conv_fwd_desc);

// if 
m_arg_diff_dst = diff_dst;
if (diff_dst.get_desc() != m_conv_bwd_data_desc.diff_dst_desc()) {
    m_arg_diff_dst = dnnl::memory(m_conv_bwd_data_desc.diff_dst_desc(), m_engine);
    m_net_bwd_data.push_back(dnnl::reorder(diff_dst, m_arg_diff_dst));
    m_net_bwd_data_args.push_back({ {DNNL_ARG_FROM, diff_dst},
            {DNNL_ARG_TO, m_arg_diff_dst} });
}
m_arg_diff_src = dnnl::memory(m_conv_bwd_data_desc.diff_src_desc(), m_engine);
m_net_bwd_data.push_back(dnnl::convolution_backward_data(m_conv_bwd_data_desc));
m_net_bwd_data_args.push_back(
    { {DNNL_ARG_DIFF_SRC, m_arg_diff_src},
     {DNNL_ARG_DIFF_DST, m_arg_diff_dst},
    // If something does not work check this, there might be some
    // reordering needed done in a similar fashion to cnn_training_f32.cpp
    {DNNL_ARG_WEIGHTS, m_arg_weights} });

auto user_diff_src_md = dnnl::memory::desc({ m_src_dims }, dt::f32, tag::nchw);
m_user_diff_src = m_arg_diff_src;
if (m_arg_diff_src.get_desc() != user_diff_src_md) {
    m_user_diff_src = dnnl::memory(user_diff_src_md, m_engine);
    m_net_bwd_data.push_back(dnnl::reorder(m_arg_diff_src, m_user_diff_src));
    m_net_bwd_data_args.push_back({ {DNNL_ARG_FROM, m_arg_diff_src},
            {DNNL_ARG_TO, m_user_diff_src} });
}

assert(m_net_bwd_data.size() == m_net_bwd_data_args.size() && "something is missing");

}

Jul 28 '23 09:07 w1005444804

dnnl::convolution_backward_data is quite time-consuming； infer cost(ms): 10 backward2data cost(ms): 232 （however pytorch or libtorch cost(ms) 30~50） backward2weights cost(ms): 12

Jul 28 '23 13:07 w1005444804

Hi @w1005444804 , could you please run oneDNN with verbose enabled? Here is the documentation: https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html?highlight=verbose

Jul 28 '23 15:07 igorsafo

@igorsafo thanks, Activate ONEDNN_ VERBOSE does have a certain effect, but it is very unstable, and the time consumption has changed from the previous 230ms to a dynamic range of 60-200ms, onednn_verbose,188439297.948300,exec,cpu,convolution,jit:avx2,backward_data,src_f32:ap:blocked:aBcd8b::f0 wei_f32:ap:blocked:ABcd8a8b::f0 bia_undef::undef::: dst_f32:ap:blocked:aBcd8b::f0,,alg:convolution_direct,mb10_ic3oc6_ih160oh156kh5sh1dh0ph0_iw160ow156kw5sw1dw0pw0,100.937

Jul 29 '23 05:07 w1005444804

Hi @igorsafo , Is the problem caused by me?

Jul 30 '23 06:07 w1005444804

@w1005444804 Thanks for the additional information! It looks it is not an integration problem, because data formats are blocked and an optimized implementation is called. Also I was able to reproduce low performance for this case. It doesn't run on a single thread, but the optimized implementation seems to have a gap for this kind of shapes.

Is it the first layer in the model? You usually don't need to compute backward wrt data for the first layer. Unfortunately, if there are other layers before this convolution then the gradient is required.

If you can provide more details about the use case (model, hw/isa) this would be helpful. How much of time does this convolution takes comparing to the overall model time?

Jul 31 '23 16:07 igorsafo

@igorsafo Yes, It is the first layer, my model is a conv-layer, I just wanted to test the speed of forward and backward propagation of convolutions, and then found this issue in comparison with Pytorch. Thank you for your reply！

Aug 02 '23 08:08 w1005444804

The code is roughly as follows： ... dnnl::memory::dims conv1_src_tz = { 10, 3, 160, 160 }; auto conv1_src_memory = dnnl::memory({ {conv1_src_tz}, dt::f32, tag::nchw }, engine); convolution_node conv1(engine, 3, 6, 5, 1, 0, 0, 1, 0, 1, conv1_src_memory); ... for (size_t i = 0; i < conv1.m_net_fwd.size(); i++) { conv1.m_net_fwd[i].execute(s, conv1.m_net_fwd_args[i]); } ... conv1.backward2data(top_memory); for (size_t i = 0; i < conv1.m_net_bwd_data.size(); i++) { conv1.m_net_bwd_data[i].execute(s, conv1.m_net_bwd_data_args[i]); }

Aug 02 '23 08:08 w1005444804

Hi @w1005444804 , Thank you for the information. I created an internal tracker for this issue, however I can't guarantee this issue will be fixed until we have more requests/use cases for this particular shape.

Aug 03 '23 18:08 igorsafo

Fixed in oneDNN v3.4.

Mar 29 '24 20:03 vpirogov