cub icon indicating copy to clipboard operation
cub copied to clipboard

Better error message for no GPU or incompatible GPU

Open zkhatami opened this issue 3 years ago • 3 comments

To give user some clue what's happening if the program gets compiled on a node with no GPU or if it gets compiled with different compute capability than the one it's running on. In both scenarios no good error message were produced before. The proposed changes will improve the user experience and make it easier for users to troubleshoot problems.

This fix is for addressing the issue#1785 reported on Thrust https://github.com/NVIDIA/thrust/issues/1785

zkhatami avatar Sep 21 '22 17:09 zkhatami

From issue#1785 on thrust (https://github.com/NVIDIA/cccl/issues/818), for this small test case:

#include <thrust/device_vector.h> #include <thrust/sort.h> int main() { thrust::device_vector<int> dv; thrust::sort(dv.begin(), dv.end()); }

when compiled with -gpu=cc60 and then run it on a system with cc80, the error message would be: terminate called after throwing an instance of 'thrust::system::system_error' what(): radix_sort: failed on 1st step: cudaErrorUnsupportedPtxVersion: the provided PTX was compiled with an unsupported toolchain. Aborted

This doesn't help user to understand what's happening. I tried to address it in this change so that better message will show up: Incompatible GPU: you are trying to run this program on sm_80, different from the one that it was compiled for

zkhatami avatar Sep 21 '22 18:09 zkhatami

@allisonvacanti @jrhemstad The "Reviewers" button is disabled for me for some reason and I cant fix it, so and I am not able to add any reviewers for this change. Would you please take a look at the changes and share your comments with me? Thanks!

zkhatami avatar Sep 21 '22 18:09 zkhatami

Thanks @zkhatami! This is a nice addition. We'll have @senior-zero take a look.

jrhemstad avatar Sep 21 '22 21:09 jrhemstad

Is there any release targeted for NVIDIA/thrust#556? If its unknown, then might be beneficial from user perspective to have this fix with current CUB design. In this proposal, is there anyway to test for the GPU incompatibility in the very beginning, like the one that we currently have with using the failure on cudaFuncGetAttributes?

zkhatami avatar Sep 22 '22 16:09 zkhatami

have this fix with current CUB design.

I agree, however, I'm not sure about our ability to preserve this functionality into the future.

In this proposal, is there anyway to test for the GPU incompatibility in the very beginning, like the one that we currently have with using the failure on cudaFuncGetAttributes?

That's the problem, I don't think we do.

jrhemstad avatar Sep 22 '22 17:09 jrhemstad

I agree with Jake that writing error messages to stdout or stderr is not the best way to report this problem. The use cases that we care about are calling Thrust. NVC++ stdpar never calls CUB directly. Thrust usually reports device-side problems by throwing a thrust::system_error exception. A lack of GPU or a GPU with an incompatible architecture should also be reported with a thrust::system_error exception, with a a custom message in the exception that clearly explains the problem.

I wonder if get_ptx_version in thrust/system/cuda/detail/core/util.h would be a better place to put this logic.

dkolsen-pgi avatar Oct 10 '22 17:10 dkolsen-pgi

@jrhemstad @dkolsen-pgi @senior-zero I've moved my changes to thrust instead based on David's suggestion. Here is my pull request: https://github.com/NVIDIA/thrust/pull/1848

Still, I'm not able to add any reviewers here. The Reviewers button is still disabled for me. Would you please take a look at the changes and share your comments with me? Thanks!

zkhatami avatar Jan 13 '23 21:01 zkhatami

Hello @zkhatami!

I've moved my changes to thrust instead based on David's suggestion.

If the change is now on the thrust part, can this PR be closed?

Still, I'm not able to add any reviewers here.

The thrust PR seems to have 3 reviewers at the moment. Let me know if I can help requesting reviews from other maintainers.

gevtushenko avatar Jan 26 '23 06:01 gevtushenko

I'm closing this pull since I've moved the changes to thrust.

zkhatami avatar Jan 27 '23 22:01 zkhatami