Change the hardcoded 128 items per warp in embeddingBag to a variable and optimize for ROCm
Make @weihanmines's PR https://github.com/ROCmSoftwarePlatform/FBGEMM/pull/13 upstreamable. @sryap, would you please review the PR and consider converting it to a draft? Thank you.
Deploy Preview for eclectic-stroopwafel-199537 canceled.
| Name | Link |
|---|---|
| Latest commit | 3ddbc27db0ad51610fc6a77a1d6692cd5c17528b |
| Latest deploy log | https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/63090ca664836500087b15e6 |
Do you have the perf improvement result on AMD GPUs with this PR? cc @shintaro-iwasaki @sryap @MohammadMahdiJavanmard
@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jianyuh We see performance improvement in some cases (mostly the backward TBE) but we have requested AMD to share a more in-depth analysis with us (https://github.com/ROCmSoftwarePlatform/FBGEMM/pull/13#issuecomment-1205815995) to understand why this would improve the performance.
We think that the performance improvement comes from the reduced number of registers used per thread block, but we do not have a concrete evidence to show this at the moment.
Do you have the perf improvement result on AMD GPUs with this PR? cc @shintaro-iwasaki @sryap @MohammadMahdiJavanmard
Please see slide 5 of 20220801_Mathews_AMDPhantom_MI250_Perf_EmbBag_DCPPv2 in workplace.
@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Hi @liligwu Thanks for your update. If we could use 4 * kWarpSize (see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.
Hi @liligwu Thanks for your update. If we could use
4 * kWarpSize(see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.
Hi @shintaro-iwasaki , 4 * kWarpSize is a good idea, but kWarpSize is a C++ constant, and we are using these 128/256 values in { }, which are in jinja template sections. Simply replacing items_per_warp with 4 * kWarpSize will trigger the following error:
jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.
I don't think we can "avoid having to put this logic in the template system" if it's in { }. Or, we need to get the warp size in Python level.
Hi @liligwu Thanks for your update. If we could use
4 * kWarpSize(see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.Hi @shintaro-iwasaki ,
4 * kWarpSizeis a good idea, butkWarpSizeis a C++ constant, and we are using these128/256values in{ }, which are in jinja template sections. Simply replacingitems_per_warpwith4 * kWarpSizewill trigger the following error:jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.
I don't think we can "avoid having to put this logic in the template system" if it's in
{ }. Or, we need to get the warp size in Python level.
We can use kWarpSize outside {}. I see kWarpSize is already used in this file (although it's defined in https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh#L32 ). In the future we should refactor and reuse the kWarpSize from c10.
Hi @liligwu Thanks for your update. If we could use
4 * kWarpSize(see above), that would be easier to maintain this parameter for both AMD and NVIDIA GPUs. I'd appreciate it if you could take a look at this option.Hi @shintaro-iwasaki ,
4 * kWarpSizeis a good idea, butkWarpSizeis a C++ constant, and we are using these128/256values in{ }, which are in jinja template sections. Simply replacingitems_per_warpwith4 * kWarpSizewill trigger the following error:jinja2.exceptions.UndefinedError: 'kWarpSize' is undefined.
I don't think we can "avoid having to put this logic in the template system" if it's in
{ }. Or, we need to get the warp size in Python level.We can use kWarpSize outside
{}. I see kWarpSize is already used in this file (although it's defined in https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/include/fbgemm_gpu/fbgemm_cuda_utils.cuh#L32 ). In the future we should refactor and reuse the kWarpSize from c10.
Sorry to reply to you late. We are still working on this suggestion but didn't find a good solution for it, curretly; especially, the items_per_wap is in a jinja for loop, https://github.com/pytorch/FBGEMM/blob/86bf2ea27108da079b0b43e80acd0bc5ff5480a9/fbgemm_gpu/codegen/embedding_backward_split_indice_weights_template.cu#L233
though we can easily get items_per_wap in the if statements like https://github.com/pytorch/FBGEMM/blob/86bf2ea27108da079b0b43e80acd0bc5ff5480a9/fbgemm_gpu/codegen/embedding_backward_split_indice_weights_template.cu#L234 out of { }
Thanks @liligwu for your investigation! We can revisit this issue on refactoring. Would you mind rebasing this PR and making sure all tests pass?
Thanks @liligwu for your investigation! We can revisit this issue on refactoring. Would you mind rebasing this PR and making sure all tests pass?
I'll update the branch soon. Thank you.
Hi @shintaro-iwasaki , the branch updated. Tests passed on my side locally. You can wait for the CI. Thank you!
@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@liligwu Merged. Thanks for your contribution!
@liligwu Merged. Thanks for your contribution!
@shintaro-iwasaki , I appreciate your help.