Is dnnl::convolution_backward_data running on a single core?
cpu: i3-8100, 4/4; windows10 just using single convolution, forward and backward. Its speed is four times slower than Pytorch。 Is multi-core not enabled? How to accelerate my dnnl-code?
void convolution_node::backward2data(const dnnl::memory& diff_dst) { m_src_diff_md = dnnl::memory::desc(m_src_dims, dt::f32, tag::any); m_weights_diff_md = dnnl::memory::desc({ m_weights_dims }, dt::f32, tag::any); m_dst_diff_md = dnnl::memory::desc({ m_dst_dims }, dt::f32, tag::any); // // std::cout << "Creating backward Convolutional layer primitive descriptor\n"; m_conv_bwd_data_desc = dnnl::convolution_backward_data::primitive_desc(m_engine, dnnl::algorithm::convolution_direct, m_src_diff_md, m_weights_md, m_dst_diff_md, m_stride_dims, m_dilation_dims, m_padding_dims, m_padding_dims, m_conv_fwd_desc);
// if
m_arg_diff_dst = diff_dst;
if (diff_dst.get_desc() != m_conv_bwd_data_desc.diff_dst_desc()) {
m_arg_diff_dst = dnnl::memory(m_conv_bwd_data_desc.diff_dst_desc(), m_engine);
m_net_bwd_data.push_back(dnnl::reorder(diff_dst, m_arg_diff_dst));
m_net_bwd_data_args.push_back({ {DNNL_ARG_FROM, diff_dst},
{DNNL_ARG_TO, m_arg_diff_dst} });
}
m_arg_diff_src = dnnl::memory(m_conv_bwd_data_desc.diff_src_desc(), m_engine);
m_net_bwd_data.push_back(dnnl::convolution_backward_data(m_conv_bwd_data_desc));
m_net_bwd_data_args.push_back(
{ {DNNL_ARG_DIFF_SRC, m_arg_diff_src},
{DNNL_ARG_DIFF_DST, m_arg_diff_dst},
// If something does not work check this, there might be some
// reordering needed done in a similar fashion to cnn_training_f32.cpp
{DNNL_ARG_WEIGHTS, m_arg_weights} });
auto user_diff_src_md = dnnl::memory::desc({ m_src_dims }, dt::f32, tag::nchw);
m_user_diff_src = m_arg_diff_src;
if (m_arg_diff_src.get_desc() != user_diff_src_md) {
m_user_diff_src = dnnl::memory(user_diff_src_md, m_engine);
m_net_bwd_data.push_back(dnnl::reorder(m_arg_diff_src, m_user_diff_src));
m_net_bwd_data_args.push_back({ {DNNL_ARG_FROM, m_arg_diff_src},
{DNNL_ARG_TO, m_user_diff_src} });
}
assert(m_net_bwd_data.size() == m_net_bwd_data_args.size() && "something is missing");
}
dnnl::convolution_backward_data is quite time-consuming; infer cost(ms): 10 backward2data cost(ms): 232 (however pytorch or libtorch cost(ms) 30~50) backward2weights cost(ms): 12
Hi @w1005444804 , could you please run oneDNN with verbose enabled? Here is the documentation: https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html?highlight=verbose
@igorsafo thanks, Activate ONEDNN_ VERBOSE does have a certain effect, but it is very unstable, and the time consumption has changed from the previous 230ms to a dynamic range of 60-200ms, onednn_verbose,188439297.948300,exec,cpu,convolution,jit:avx2,backward_data,src_f32:ap:blocked:aBcd8b::f0 wei_f32:ap:blocked:ABcd8a8b::f0 bia_undef::undef::: dst_f32:ap:blocked:aBcd8b::f0,,alg:convolution_direct,mb10_ic3oc6_ih160oh156kh5sh1dh0ph0_iw160ow156kw5sw1dw0pw0,100.937
Hi @igorsafo , Is the problem caused by me?
@w1005444804 Thanks for the additional information! It looks it is not an integration problem, because data formats are blocked and an optimized implementation is called. Also I was able to reproduce low performance for this case. It doesn't run on a single thread, but the optimized implementation seems to have a gap for this kind of shapes.
Is it the first layer in the model? You usually don't need to compute backward wrt data for the first layer. Unfortunately, if there are other layers before this convolution then the gradient is required.
If you can provide more details about the use case (model, hw/isa) this would be helpful. How much of time does this convolution takes comparing to the overall model time?
@igorsafo Yes, It is the first layer, my model is a conv-layer, I just wanted to test the speed of forward and backward propagation of convolutions, and then found this issue in comparison with Pytorch. Thank you for your reply!
The code is roughly as follows: ... dnnl::memory::dims conv1_src_tz = { 10, 3, 160, 160 }; auto conv1_src_memory = dnnl::memory({ {conv1_src_tz}, dt::f32, tag::nchw }, engine); convolution_node conv1(engine, 3, 6, 5, 1, 0, 0, 1, 0, 1, conv1_src_memory); ... for (size_t i = 0; i < conv1.m_net_fwd.size(); i++) { conv1.m_net_fwd[i].execute(s, conv1.m_net_fwd_args[i]); } ... conv1.backward2data(top_memory); for (size_t i = 0; i < conv1.m_net_bwd_data.size(); i++) { conv1.m_net_bwd_data[i].execute(s, conv1.m_net_bwd_data_args[i]); }
Hi @w1005444804 , Thank you for the information. I created an internal tracker for this issue, however I can't guarantee this issue will be fixed until we have more requests/use cases for this particular shape.
Fixed in oneDNN v3.4.