FBGEMM Change the hardcoded 128 items per warp in embeddingBag to a variable and optimize for ROCm

Make @weihanmines's PR https://github.com/ROCmSoftwarePlatform/FBGEMM/pull/13 upstreamable. @sryap, would you please review the PR and consider converting it to a draft? Thank you.

Aug 05 '22 16:08 liligwu

Deploy Preview for eclectic-stroopwafel-199537 canceled.

Name	Link
Latest commit	3ddbc27db0ad51610fc6a77a1d6692cd5c17528b
Latest deploy log	https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/63090ca664836500087b15e6

Aug 05 '22 16:08 netlify[bot]

Do you have the perf improvement result on AMD GPUs with this PR? cc @shintaro-iwasaki @sryap @MohammadMahdiJavanmard

Aug 08 '22 16:08 jianyuh

@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Aug 08 '22 17:08 facebook-github-bot

@jianyuh We see performance improvement in some cases (mostly the backward TBE) but we have requested AMD to share a more in-depth analysis with us (https://github.com/ROCmSoftwarePlatform/FBGEMM/pull/13#issuecomment-1205815995) to understand why this would improve the performance.

We think that the performance improvement comes from the reduced number of registers used per thread block, but we do not have a concrete evidence to show this at the moment.

Aug 08 '22 18:08 sryap

Do you have the perf improvement result on AMD GPUs with this PR? cc @shintaro-iwasaki @sryap @MohammadMahdiJavanmard

Please see slide 5 of 20220801_Mathews_AMDPhantom_MI250_Perf_EmbBag_DCPPv2 in workplace.

Aug 09 '22 15:08 amathews-amd

@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Aug 09 '22 16:08 facebook-github-bot

Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.

Aug 09 '22 23:08 shintaro-iwasaki

Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.

Hi @shintaro-iwasaki , 4 * kWarpSize is a good idea, but kWarpSize is a C++ constant, and we are using these 128/256 values in { }, which are in jinja template sections. Simply replacing items_per_warp with 4 * kWarpSize will trigger the following error:

jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.

I don't think we can "avoid having to put this logic in the template system" if it's in { }. Or, we need to get the warp size in Python level.

Aug 10 '22 00:08 liligwu

Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.

Hi @shintaro-iwasaki , 4 * kWarpSize is a good idea, but kWarpSize is a C++ constant, and we are using these 128/256 values in { }, which are in jinja template sections. Simply replacing items_per_warp with 4 * kWarpSize will trigger the following error:

jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.

I don't think we can "avoid having to put this logic in the template system" if it's in { }. Or, we need to get the warp size in Python level.

We can use kWarpSize outside {}. I see kWarpSize is already used in this file (although it's defined in https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh#L32 ). In the future we should refactor and reuse the kWarpSize from c10.

Aug 10 '22 00:08 jianyuh

Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.

Hi @shintaro-iwasaki , 4 * kWarpSize is a good idea, but kWarpSize is a C++ constant, and we are using these 128/256 values in { }, which are in jinja template sections. Simply replacing items_per_warp with 4 * kWarpSize will trigger the following error:

jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.

I don't think we can "avoid having to put this logic in the template system" if it's in { }. Or, we need to get the warp size in Python level.

We can use kWarpSize outside {}. I see kWarpSize is already used in this file (although it's defined in https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh#L32 ). In the future we should refactor and reuse the kWarpSize from c10.

Sorry to reply to you late. We are still working on this suggestion but didn't find a good solution for it, curretly; especially, the items_per_wap is in a jinja for loop, https://github.com/pytorch/FBGEMM/blob/86bf2ea27108da079b0b43e80acd0bc5ff5480a9/fbgemm_gpu/codegen/embedding_backward_split_indice_weights_template.cu#L233 though we can easily get items_per_wap in the if statements like https://github.com/pytorch/FBGEMM/blob/86bf2ea27108da079b0b43e80acd0bc5ff5480a9/fbgemm_gpu/codegen/embedding_backward_split_indice_weights_template.cu#L234 out of { }

Aug 26 '22 16:08 liligwu

Thanks @liligwu for your investigation! We can revisit this issue on refactoring. Would you mind rebasing this PR and making sure all tests pass?

Aug 26 '22 17:08 shintaro-iwasaki

Thanks @liligwu for your investigation! We can revisit this issue on refactoring. Would you mind rebasing this PR and making sure all tests pass?

I'll update the branch soon. Thank you.

Aug 26 '22 17:08 liligwu

Hi @shintaro-iwasaki , the branch updated. Tests passed on my side locally. You can wait for the CI. Thank you!

Aug 26 '22 18:08 liligwu

@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Aug 26 '22 19:08 facebook-github-bot

@liligwu Merged. Thanks for your contribution!

Aug 29 '22 15:08 shintaro-iwasaki

@liligwu Merged. Thanks for your contribution!

@shintaro-iwasaki , I appreciate your help.

Aug 29 '22 16:08 liligwu

Change the hardcoded 128 items per warp in embeddingBag to a variable and optimize for ROCm

✅ Deploy Preview for eclectic-stroopwafel-199537 canceled.

Deploy Preview for eclectic-stroopwafel-199537 canceled.