Young-Jun Ko
Young-Jun Ko
Hi Bangsheng, could you please share more information, e.g. which GPU architecture? A small [Gist](gist.github.com) with a reproducer would be very helpful. Thanks!
Hello, every warp computes a 16x16 tile, so this column offset should be ok.
@lw921014 can we close if that answered your question, or is there something else regarding this we can help with?
EDIT: nvm, I missed the explicit cast to FP16 in the Python code. I thought that dtype was taken from the input. The main difference in the generated ptx seems...
Another difference is a lot of explicit packing, e.g. for half, the PTX shows 32-bit shared loads in the epilogue, that can directly be passed to 128bit global stores, whereas...