Young-Jun Ko comments

Repositories
Issues
Comments

Results 5 comments of


                                            Young-Jun Ko

make TFusedMHAKernelFactory thread_local

Hi Bangsheng, could you please share more information, e.g. which GPU architecture? A small [Gist](gist.github.com) with a reproducer would be very helpful. Thanks!

about mask check 16

Hello, every warp computes a 16x16 tile, so this column offset should be ok.

about mask check 16

@lw921014 can we close if that answered your question, or is there something else regarding this we can help with?

bfloat16 matmul performance is worse than float16 matmul

EDIT: nvm, I missed the explicit cast to FP16 in the Python code. I thought that dtype was taken from the input. The main difference in the generated ptx seems...

bfloat16 matmul performance is worse than float16 matmul

Another difference is a lot of explicit packing, e.g. for half, the PTX shows 32-bit shared loads in the epilogue, that can directly be passed to 128bit global stores, whereas...