FBGEMM icon indicating copy to clipboard operation
FBGEMM copied to clipboard

Change the hardcoded 128 items per warp in embeddingBag to a variable and optimize for ROCm

Open liligwu opened this issue 3 years ago • 9 comments

Make @weihanmines's PR https://github.com/ROCmSoftwarePlatform/FBGEMM/pull/13 upstreamable. @sryap, would you please review the PR and consider converting it to a draft? Thank you.

liligwu avatar Aug 05 '22 16:08 liligwu

Deploy Preview for eclectic-stroopwafel-199537 canceled.

Name Link
Latest commit 3ddbc27db0ad51610fc6a77a1d6692cd5c17528b
Latest deploy log https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/63090ca664836500087b15e6

netlify[bot] avatar Aug 05 '22 16:08 netlify[bot]

Do you have the perf improvement result on AMD GPUs with this PR? cc @shintaro-iwasaki @sryap @MohammadMahdiJavanmard

jianyuh avatar Aug 08 '22 16:08 jianyuh

@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 08 '22 17:08 facebook-github-bot

@jianyuh We see performance improvement in some cases (mostly the backward TBE) but we have requested AMD to share a more in-depth analysis with us (https://github.com/ROCmSoftwarePlatform/FBGEMM/pull/13#issuecomment-1205815995) to understand why this would improve the performance.

We think that the performance improvement comes from the reduced number of registers used per thread block, but we do not have a concrete evidence to show this at the moment.

sryap avatar Aug 08 '22 18:08 sryap

Do you have the perf improvement result on AMD GPUs with this PR? cc @shintaro-iwasaki @sryap @MohammadMahdiJavanmard

Please see slide 5 of 20220801_Mathews_AMDPhantom_MI250_Perf_EmbBag_DCPPv2 in workplace.

amathews-amd avatar Aug 09 '22 15:08 amathews-amd

@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 09 '22 16:08 facebook-github-bot

Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.

shintaro-iwasaki avatar Aug 09 '22 23:08 shintaro-iwasaki

Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.

Hi @shintaro-iwasaki , 4 * kWarpSize is a good idea, but kWarpSize is a C++ constant, and we are using these 128/256 values in { }, which are in jinja template sections. Simply replacing items_per_warp with 4 * kWarpSize will trigger the following error:

jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.

I don't think we can "avoid having to put this logic in the template system" if it's in { }. Or, we need to get the warp size in Python level.

liligwu avatar Aug 10 '22 00:08 liligwu

Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.

Hi @shintaro-iwasaki , 4 * kWarpSize is a good idea, but kWarpSize is a C++ constant, and we are using these 128/256 values in { }, which are in jinja template sections. Simply replacing items_per_warp with 4 * kWarpSize will trigger the following error:

jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.

I don't think we can "avoid having to put this logic in the template system" if it's in { }. Or, we need to get the warp size in Python level.

We can use kWarpSize outside {}. I see kWarpSize is already used in this file (although it's defined in https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh#L32 ). In the future we should refactor and reuse the kWarpSize from c10.

jianyuh avatar Aug 10 '22 00:08 jianyuh

Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.

Hi @shintaro-iwasaki , 4 * kWarpSize is a good idea, but kWarpSize is a C++ constant, and we are using these 128/256 values in { }, which are in jinja template sections. Simply replacing items_per_warp with 4 * kWarpSize will trigger the following error:

jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.

I don't think we can "avoid having to put this logic in the template system" if it's in { }. Or, we need to get the warp size in Python level.

We can use kWarpSize outside {}. I see kWarpSize is already used in this file (although it's defined in https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh#L32 ). In the future we should refactor and reuse the kWarpSize from c10.

Sorry to reply to you late. We are still working on this suggestion but didn't find a good solution for it, curretly; especially, the items_per_wap is in a jinja for loop, https://github.com/pytorch/FBGEMM/blob/86bf2ea27108da079b0b43e80acd0bc5ff5480a9/fbgemm_gpu/codegen/embedding_backward_split_indice_weights_template.cu#L233 though we can easily get items_per_wap in the if statements like https://github.com/pytorch/FBGEMM/blob/86bf2ea27108da079b0b43e80acd0bc5ff5480a9/fbgemm_gpu/codegen/embedding_backward_split_indice_weights_template.cu#L234 out of { }

liligwu avatar Aug 26 '22 16:08 liligwu

Thanks @liligwu for your investigation! We can revisit this issue on refactoring. Would you mind rebasing this PR and making sure all tests pass?

shintaro-iwasaki avatar Aug 26 '22 17:08 shintaro-iwasaki

Thanks @liligwu for your investigation! We can revisit this issue on refactoring. Would you mind rebasing this PR and making sure all tests pass?

I'll update the branch soon. Thank you.

liligwu avatar Aug 26 '22 17:08 liligwu

Hi @shintaro-iwasaki , the branch updated. Tests passed on my side locally. You can wait for the CI. Thank you!

liligwu avatar Aug 26 '22 18:08 liligwu

@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 26 '22 19:08 facebook-github-bot

@liligwu Merged. Thanks for your contribution!

shintaro-iwasaki avatar Aug 29 '22 15:08 shintaro-iwasaki

@liligwu Merged. Thanks for your contribution!

@shintaro-iwasaki , I appreciate your help.

liligwu avatar Aug 29 '22 16:08 liligwu