Haicheng Wu

Results 323 comments of Haicheng Wu

you can take a look at this igemm example: https://github.com/NVIDIA/cutlass/blob/master/test/unit/gemm/threadblock/mma_pipelined_slicedk.cu#L42-L64 . We haven't added slicek support to sgemm, but the concepts are the same. The shared memory load code is...

`tbk x stage` shows that how many elements in the k dimension are processed in one complete software pipeline. We can add warp tile size to the kernel name so...

It is not ready yet. we will add it as a patch.

https://github.com/NVIDIA/cutlass/blob/master/test/unit/conv/device/conv2d_fprop_with_reduction_sm75.cu is an example of calculating sum needed by BN. It involves 2 kernels. The first one calculates partial sum of each threadblock. The 2nd kernel calculates the final sum....

The standard conv we do is D = alpha x conv(A, B) + beta x C

In inference, we don't need batch norm. So, I am not really sure what you need. If you variance is a scalar, you can set it as `alpha`. If it...

Are you using `fp32` for inference? Your error is that it cannot find corresponding `operator()` for `cutlass::gemm::warp::MmaTensorOp`. Your top level template configuration should be something like this: ``` cutlass::conv::kernel::DefaultConv2dFprop< float,...

You are using nchw, not nhwc

It is most efficient to use NHWC on GPUs. You can convert NCHW to NHWC by using our utility: https://github.com/NVIDIA/cutlass/blob/master/tools/util/include/cutlass/util/device_nchw_to_nhwc.h