FBGEMM re-organize rocm header for the upcomming rocm update

Nov 09 '23 02:11 liligwu

Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
Latest commit	5719c3e81222160eb3a8b7ca38961905c9fdf458
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/654c4953fa9378000835ad06

Nov 09 '23 02:11 netlify[bot]

No ciflow labels are configured for this repo. For information on how to enable CIFlow bot see this wiki

Nov 09 '23 02:11 pytorch-bot[bot]

Thanks for the contributions! It'd be nice if you could add explanation a little bit :)

Judging from the changes, ROCm 5.7 has headers at both old and new places, while ROCm 6.0 will deprecate those places. Is my understanding correct?

#include <hipblas.h> // old
#include <hipblas/hipblas.h> // new?

#include <rocm_version.h> // old
#include <rocm-core/rocm_version.h> // new?

Nov 09 '23 03:11 shintaro-iwasaki

Thanks for the contributions! It'd be nice if you could add explanation a little bit :)

Judging from the changes, ROCm 5.7 has headers at both old and new places, while ROCm 6.0 will deprecate those places. Is my understanding correct?
#include <hipblas.h> // old
#include <hipblas/hipblas.h> // new?

#include <rocm_version.h> // old
#include <rocm-core/rocm_version.h> // new?

Sorry, I didn't explain the changes. Your reasoning is correct. ROCm5.7 is a transition from the old places to the new ones, and ROCm6.0 will deprecate the old header.

Nov 09 '23 03:11 liligwu

Thanks! It looks great to me. Let me import it and merge it. (The linter issue is not coming from this PR, so don't worry about it.)

Nov 09 '23 03:11 shintaro-iwasaki

Thanks! It looks great to me. Let me import it and merge it. (The linter issue is not coming from this PR, so don't worry about it.)

Thanks a lot for your help!

Nov 09 '23 03:11 liligwu

@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Nov 09 '23 03:11 facebook-github-bot

Hmm, 5.6 test is failing: Build x86 Linux Wheels / pytorch/FBGEMM / manywheel-py3_8-rocm5_6

/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/split_embeddings_utils.hip:27:10: fatal error: 'rocm-core/rocm_version.h' file not found
#include <rocm-core/rocm_version.h>

@q10 Do you know if we still need to keep this job active?

Nov 09 '23 03:11 shintaro-iwasaki

Hmm, 5.6 test is failing: Build x86 Linux Wheels / pytorch/FBGEMM / manywheel-py3_8-rocm5_6
/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/split_embeddings_utils.hip:27:10: fatal error: 'rocm-core/rocm_version.h' file not found
#include <rocm-core/rocm_version.h>
@q10 Do you know if we still need to keep this job active?

ROCm5.6 is expected to fail. I suggest moving to ROCm5.7; otherwise, there would be more divergences when enabling the UVM tests.

Nov 09 '23 03:11 liligwu

Hmm, 5.6 test is failing: Build x86 Linux Wheels / pytorch/FBGEMM / manywheel-py3_8-rocm5_6
/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/split_embeddings_utils.hip:27:10: fatal error: 'rocm-core/rocm_version.h' file not found
#include <rocm-core/rocm_version.h>
@q10 Do you know if we still need to keep this job active?

We can disable it, since the Nova builds for ROCm are broken anyway - the wheel can be built but installing the wheel and running tests will result in segfaults (see example run here). cc @liligwu

Nov 09 '23 04:11 q10

Hmm, 5.6 test is failing: Build x86 Linux Wheels / pytorch/FBGEMM / manywheel-py3_8-rocm5_6
/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/split_embeddings_utils.hip:27:10: fatal error: 'rocm-core/rocm_version.h' file not found
#include <rocm-core/rocm_version.h>
@q10 Do you know if we still need to keep this job active?
We can disable it, since the Nova builds for ROCm are broken anyway - the wheel can be built but installing the wheel and running tests will result in segfaults (see example run here). cc @liligwu

Hi @q10 . Yes, the unary test is an old issue. I saw it was skipped in the test https://github.com/pytorch/FBGEMM/blob/bd2d5fcb953dc64e3d07102de363e2c76fe25fbd/.github/scripts/fbgemm_gpu_test.bash#L94 We will take a look at it if you are concerned about the failure.

Nov 09 '23 04:11 liligwu

@q10 Do you know how to disable it? Is it something @liligwu can just remove 5.6 from one of the GHA yml files? It'd be good to disable this test before merging this PR.

Nov 09 '23 20:11 shintaro-iwasaki

@q10 Do you know how to disable it? Is it something @liligwu can just remove 5.6 from one of the GHA yml files? It'd be good to disable this test before merging this PR.

This line will need to be updated to disable. Unfortunately, ROCm versions in Nova runs are decided by PyTorch infra, not us, so we aren't able to say "keep 5.7 and disable 5.6".

Nov 09 '23 20:11 q10

Thanks! I see. If that's the case, can we keep 5.6 by adding #ifdef @liligwu until the PyTorch infra disables 5.6 (@liligwu)? We don't want to make the test red while we want to keep the ROCm test if possible.

We will need some versioning based on the ROCm version for a while, but hope it is not too burdensome.

Nov 09 '23 20:11 shintaro-iwasaki

Thanks! I see. If that's the case, can we keep 5.6 by adding #ifdef @liligwu until the PyTorch infra disables 5.6 (@liligwu)? We don't want to make the test red while we want to keep the ROCm test if possible.

We will need some versioning based on the ROCm version for a while, but hope it is not too burdensome.

Thanks @q10 for the pointer. I'll figure out how to keep rocm5.6 for the header changes. Please hold the merge @shintaro-iwasaki

But I don't have a good idea that switches UVM test enablement between 5.6 and 5.7 currently. I'll let you know if I can find an approach.

Nov 09 '23 21:11 liligwu

CUDA has torch.version.cuda. Maybe torch.version.hip returns something useful?

Nov 09 '23 23:11 shintaro-iwasaki

CUDA has torch.version.cuda. Maybe torch.version.hip returns something useful?

Sounds good. I'll try it out. Thanks.

Nov 10 '23 02:11 liligwu

re-organize rocm header for the upcomming rocm update

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Deploy Preview for pytorch-fbgemm-docs canceled.