FBGEMM icon indicating copy to clipboard operation
FBGEMM copied to clipboard

re-organize rocm header for the upcomming rocm update

Open liligwu opened this issue 2 years ago • 17 comments

liligwu avatar Nov 09 '23 02:11 liligwu

Deploy Preview for pytorch-fbgemm-docs canceled.

Name Link
Latest commit 5719c3e81222160eb3a8b7ca38961905c9fdf458
Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/654c4953fa9378000835ad06

netlify[bot] avatar Nov 09 '23 02:11 netlify[bot]

No ciflow labels are configured for this repo. For information on how to enable CIFlow bot see this wiki

pytorch-bot[bot] avatar Nov 09 '23 02:11 pytorch-bot[bot]

Thanks for the contributions! It'd be nice if you could add explanation a little bit :)

Judging from the changes, ROCm 5.7 has headers at both old and new places, while ROCm 6.0 will deprecate those places. Is my understanding correct?

#include <hipblas.h> // old
#include <hipblas/hipblas.h> // new?

#include <rocm_version.h> // old
#include <rocm-core/rocm_version.h> // new?

shintaro-iwasaki avatar Nov 09 '23 03:11 shintaro-iwasaki

Thanks for the contributions! It'd be nice if you could add explanation a little bit :)

Judging from the changes, ROCm 5.7 has headers at both old and new places, while ROCm 6.0 will deprecate those places. Is my understanding correct?

#include <hipblas.h> // old
#include <hipblas/hipblas.h> // new?

#include <rocm_version.h> // old
#include <rocm-core/rocm_version.h> // new?

Sorry, I didn't explain the changes. Your reasoning is correct. ROCm5.7 is a transition from the old places to the new ones, and ROCm6.0 will deprecate the old header.

liligwu avatar Nov 09 '23 03:11 liligwu

Thanks! It looks great to me. Let me import it and merge it. (The linter issue is not coming from this PR, so don't worry about it.)

shintaro-iwasaki avatar Nov 09 '23 03:11 shintaro-iwasaki

Thanks! It looks great to me. Let me import it and merge it. (The linter issue is not coming from this PR, so don't worry about it.)

Thanks a lot for your help!

liligwu avatar Nov 09 '23 03:11 liligwu

@shintaro-iwasaki has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Nov 09 '23 03:11 facebook-github-bot

Hmm, 5.6 test is failing: Build x86 Linux Wheels / pytorch/FBGEMM / manywheel-py3_8-rocm5_6

/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/split_embeddings_utils.hip:27:10: fatal error: 'rocm-core/rocm_version.h' file not found
#include <rocm-core/rocm_version.h>

@q10 Do you know if we still need to keep this job active?

shintaro-iwasaki avatar Nov 09 '23 03:11 shintaro-iwasaki

Hmm, 5.6 test is failing: Build x86 Linux Wheels / pytorch/FBGEMM / manywheel-py3_8-rocm5_6

/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/split_embeddings_utils.hip:27:10: fatal error: 'rocm-core/rocm_version.h' file not found
#include <rocm-core/rocm_version.h>

@q10 Do you know if we still need to keep this job active?

ROCm5.6 is expected to fail. I suggest moving to ROCm5.7; otherwise, there would be more divergences when enabling the UVM tests.

liligwu avatar Nov 09 '23 03:11 liligwu

Hmm, 5.6 test is failing: Build x86 Linux Wheels / pytorch/FBGEMM / manywheel-py3_8-rocm5_6

/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/split_embeddings_utils.hip:27:10: fatal error: 'rocm-core/rocm_version.h' file not found
#include <rocm-core/rocm_version.h>

@q10 Do you know if we still need to keep this job active?

We can disable it, since the Nova builds for ROCm are broken anyway - the wheel can be built but installing the wheel and running tests will result in segfaults (see example run here). cc @liligwu

q10 avatar Nov 09 '23 04:11 q10

Hmm, 5.6 test is failing: Build x86 Linux Wheels / pytorch/FBGEMM / manywheel-py3_8-rocm5_6

/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/src/split_embeddings_utils.hip:27:10: fatal error: 'rocm-core/rocm_version.h' file not found
#include <rocm-core/rocm_version.h>

@q10 Do you know if we still need to keep this job active?

We can disable it, since the Nova builds for ROCm are broken anyway - the wheel can be built but installing the wheel and running tests will result in segfaults (see example run here). cc @liligwu

Hi @q10 . Yes, the unary test is an old issue. I saw it was skipped in the test https://github.com/pytorch/FBGEMM/blob/bd2d5fcb953dc64e3d07102de363e2c76fe25fbd/.github/scripts/fbgemm_gpu_test.bash#L94 We will take a look at it if you are concerned about the failure.

liligwu avatar Nov 09 '23 04:11 liligwu

@q10 Do you know how to disable it? Is it something @liligwu can just remove 5.6 from one of the GHA yml files? It'd be good to disable this test before merging this PR.

shintaro-iwasaki avatar Nov 09 '23 20:11 shintaro-iwasaki

@q10 Do you know how to disable it? Is it something @liligwu can just remove 5.6 from one of the GHA yml files? It'd be good to disable this test before merging this PR.

This line will need to be updated to disable. Unfortunately, ROCm versions in Nova runs are decided by PyTorch infra, not us, so we aren't able to say "keep 5.7 and disable 5.6".

q10 avatar Nov 09 '23 20:11 q10

Thanks! I see. If that's the case, can we keep 5.6 by adding #ifdef @liligwu until the PyTorch infra disables 5.6 (@liligwu)? We don't want to make the test red while we want to keep the ROCm test if possible.

We will need some versioning based on the ROCm version for a while, but hope it is not too burdensome.

shintaro-iwasaki avatar Nov 09 '23 20:11 shintaro-iwasaki

Thanks! I see. If that's the case, can we keep 5.6 by adding #ifdef @liligwu until the PyTorch infra disables 5.6 (@liligwu)? We don't want to make the test red while we want to keep the ROCm test if possible.

We will need some versioning based on the ROCm version for a while, but hope it is not too burdensome.

Thanks @q10 for the pointer. I'll figure out how to keep rocm5.6 for the header changes. Please hold the merge @shintaro-iwasaki

But I don't have a good idea that switches UVM test enablement between 5.6 and 5.7 currently. I'll let you know if I can find an approach.

liligwu avatar Nov 09 '23 21:11 liligwu

CUDA has torch.version.cuda. Maybe torch.version.hip returns something useful?

shintaro-iwasaki avatar Nov 09 '23 23:11 shintaro-iwasaki

CUDA has torch.version.cuda. Maybe torch.version.hip returns something useful?

Sounds good. I'll try it out. Thanks.

liligwu avatar Nov 10 '23 02:11 liligwu