Haicheng Wu
Haicheng Wu
you can take a look at this igemm example: https://github.com/NVIDIA/cutlass/blob/master/test/unit/gemm/threadblock/mma_pipelined_slicedk.cu#L42-L64 . We haven't added slicek support to sgemm, but the concepts are the same. The shared memory load code is...
`tbk x stage` shows that how many elements in the k dimension are processed in one complete software pipeline. We can add warp tile size to the kernel name so...
It will be in 2.9
It is not ready yet. we will add it as a patch.
https://github.com/NVIDIA/cutlass/blob/master/test/unit/conv/device/conv2d_fprop_with_reduction_sm75.cu is an example of calculating sum needed by BN. It involves 2 kernels. The first one calculates partial sum of each threadblock. The 2nd kernel calculates the final sum....
The standard conv we do is D = alpha x conv(A, B) + beta x C
In inference, we don't need batch norm. So, I am not really sure what you need. If you variance is a scalar, you can set it as `alpha`. If it...
Are you using `fp32` for inference? Your error is that it cannot find corresponding `operator()` for `cutlass::gemm::warp::MmaTensorOp`. Your top level template configuration should be something like this: ``` cutlass::conv::kernel::DefaultConv2dFprop< float,...
You are using nchw, not nhwc
It is most efficient to use NHWC on GPUs. You can convert NCHW to NHWC by using our utility: https://github.com/NVIDIA/cutlass/blob/master/tools/util/include/cutlass/util/device_nchw_to_nhwc.h