oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

Is dnnl::convolution_backward_data running on a single core?

Open w1005444804 opened this issue 2 years ago • 9 comments

cpu: i3-8100, 4/4; windows10 just using single convolution, forward and backward. Its speed is four times slower than Pytorch。 Is multi-core not enabled? How to accelerate my dnnl-code?

w1005444804 avatar Jul 28 '23 09:07 w1005444804

void convolution_node::backward2data(const dnnl::memory& diff_dst) { m_src_diff_md = dnnl::memory::desc(m_src_dims, dt::f32, tag::any); m_weights_diff_md = dnnl::memory::desc({ m_weights_dims }, dt::f32, tag::any); m_dst_diff_md = dnnl::memory::desc({ m_dst_dims }, dt::f32, tag::any); // // std::cout << "Creating backward Convolutional layer primitive descriptor\n"; m_conv_bwd_data_desc = dnnl::convolution_backward_data::primitive_desc(m_engine, dnnl::algorithm::convolution_direct, m_src_diff_md, m_weights_md, m_dst_diff_md, m_stride_dims, m_dilation_dims, m_padding_dims, m_padding_dims, m_conv_fwd_desc);

// if 
m_arg_diff_dst = diff_dst;
if (diff_dst.get_desc() != m_conv_bwd_data_desc.diff_dst_desc()) {
    m_arg_diff_dst = dnnl::memory(m_conv_bwd_data_desc.diff_dst_desc(), m_engine);
    m_net_bwd_data.push_back(dnnl::reorder(diff_dst, m_arg_diff_dst));
    m_net_bwd_data_args.push_back({ {DNNL_ARG_FROM, diff_dst},
            {DNNL_ARG_TO, m_arg_diff_dst} });
}
m_arg_diff_src = dnnl::memory(m_conv_bwd_data_desc.diff_src_desc(), m_engine);
m_net_bwd_data.push_back(dnnl::convolution_backward_data(m_conv_bwd_data_desc));
m_net_bwd_data_args.push_back(
    { {DNNL_ARG_DIFF_SRC, m_arg_diff_src},
     {DNNL_ARG_DIFF_DST, m_arg_diff_dst},
    // If something does not work check this, there might be some
    // reordering needed done in a similar fashion to cnn_training_f32.cpp
    {DNNL_ARG_WEIGHTS, m_arg_weights} });

auto user_diff_src_md = dnnl::memory::desc({ m_src_dims }, dt::f32, tag::nchw);
m_user_diff_src = m_arg_diff_src;
if (m_arg_diff_src.get_desc() != user_diff_src_md) {
    m_user_diff_src = dnnl::memory(user_diff_src_md, m_engine);
    m_net_bwd_data.push_back(dnnl::reorder(m_arg_diff_src, m_user_diff_src));
    m_net_bwd_data_args.push_back({ {DNNL_ARG_FROM, m_arg_diff_src},
            {DNNL_ARG_TO, m_user_diff_src} });
}

assert(m_net_bwd_data.size() == m_net_bwd_data_args.size() && "something is missing");

}

w1005444804 avatar Jul 28 '23 09:07 w1005444804

dnnl::convolution_backward_data is quite time-consuming; infer cost(ms): 10 backward2data cost(ms): 232 (however pytorch or libtorch cost(ms) 30~50) backward2weights cost(ms): 12

w1005444804 avatar Jul 28 '23 13:07 w1005444804

Hi @w1005444804 , could you please run oneDNN with verbose enabled? Here is the documentation: https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html?highlight=verbose

igorsafo avatar Jul 28 '23 15:07 igorsafo

@igorsafo thanks, Activate ONEDNN_ VERBOSE does have a certain effect, but it is very unstable, and the time consumption has changed from the previous 230ms to a dynamic range of 60-200ms, onednn_verbose,188439297.948300,exec,cpu,convolution,jit:avx2,backward_data,src_f32:ap:blocked:aBcd8b::f0 wei_f32:ap:blocked:ABcd8a8b::f0 bia_undef::undef::: dst_f32:ap:blocked:aBcd8b::f0,,alg:convolution_direct,mb10_ic3oc6_ih160oh156kh5sh1dh0ph0_iw160ow156kw5sw1dw0pw0,100.937

w1005444804 avatar Jul 29 '23 05:07 w1005444804

Hi @igorsafo , Is the problem caused by me?

w1005444804 avatar Jul 30 '23 06:07 w1005444804

@w1005444804 Thanks for the additional information! It looks it is not an integration problem, because data formats are blocked and an optimized implementation is called. Also I was able to reproduce low performance for this case. It doesn't run on a single thread, but the optimized implementation seems to have a gap for this kind of shapes.

Is it the first layer in the model? You usually don't need to compute backward wrt data for the first layer. Unfortunately, if there are other layers before this convolution then the gradient is required.

If you can provide more details about the use case (model, hw/isa) this would be helpful. How much of time does this convolution takes comparing to the overall model time?

igorsafo avatar Jul 31 '23 16:07 igorsafo

@igorsafo Yes, It is the first layer, my model is a conv-layer, I just wanted to test the speed of forward and backward propagation of convolutions, and then found this issue in comparison with Pytorch. Thank you for your reply!

w1005444804 avatar Aug 02 '23 08:08 w1005444804

The code is roughly as follows: ... dnnl::memory::dims conv1_src_tz = { 10, 3, 160, 160 }; auto conv1_src_memory = dnnl::memory({ {conv1_src_tz}, dt::f32, tag::nchw }, engine); convolution_node conv1(engine, 3, 6, 5, 1, 0, 0, 1, 0, 1, conv1_src_memory); ... for (size_t i = 0; i < conv1.m_net_fwd.size(); i++) { conv1.m_net_fwd[i].execute(s, conv1.m_net_fwd_args[i]); } ... conv1.backward2data(top_memory); for (size_t i = 0; i < conv1.m_net_bwd_data.size(); i++) { conv1.m_net_bwd_data[i].execute(s, conv1.m_net_bwd_data_args[i]); }

w1005444804 avatar Aug 02 '23 08:08 w1005444804

Hi @w1005444804 , Thank you for the information. I created an internal tracker for this issue, however I can't guarantee this issue will be fixed until we have more requests/use cases for this particular shape.

igorsafo avatar Aug 03 '23 18:08 igorsafo

Fixed in oneDNN v3.4.

vpirogov avatar Mar 29 '24 20:03 vpirogov