[FEA]: Relevant exceptions for cuCheckpointProcessGetState
Is this a duplicate?
- [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
cuda.bindings
Is your feature request related to a problem? Please describe.
Very small issue, but not sure if it expands to other functions that I have not tested. For cuCheckpointProcessGetState, sending a PID that doesn't exist or PID not valid to be checkpointed results in the following err:
>>> cu.cuCheckpointProcessGetState(123434)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cuda/bindings/driver.pyx", line 44467, in cuda.bindings.driver.cuCheckpointProcessGetState
File "/usr/lib64/python3.11/enum.py", line 714, in __call__
return cls.__new__(cls, value)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/enum.py", line 1137, in __new__
raise ve_exc
ValueError: 32718 is not a valid CUprocessState
I have seen some other values other than 32718 show up as the returned CUprocessState as well, seemingly random.
Describe the solution you'd like
Consistent exceptions for common failures such as PID not existing or being invalid. The cuda-checkpoint CLI gives the following message which would be fine
Error getting process state for process ID 1234234344: "OS call failed or operation not supported on this OS"
Describe alternatives you've considered
No response
Additional context
Inconsistent/irrelevant exceptions makes unit testing around this area of the cuda driver difficult/messy.
There is one bug in cuda.bindings. The bug is that when cuCheckpointProcessGetState fails, state is randomly populated (or not touched by the driver at all, it doesn't matter for our purpose here):
// nvcc -std=c++11 -arch=sm_80 -lcuda 561.cu -o 561.out
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cassert>
#include <iostream>
int main() {
assert (0 == cudaSetDevice(0));
CUprocessState state;
auto out = cuCheckpointProcessGetState(123456, &state);
std::cout << (int)state << " ; " << (int)out << std::endl;
const char* reason;
cuGetErrorString(out, &reason);
std::cout << "(" << reason << ")" << std::endl;
return 0;
}
Output:
-1445175672 ; 304
(OS call failed or operation not supported on this OS)
Therefore, when converting state to the CUprocessState IntEnum to build the return tuple, it fails due to out of range (it could be a random int value like 32718 or in my case -1445175672).
You can see the output string matches what you got from cuda-checkpoint. We should fix the enum conversion possibly by checking the error code first, then we can return the proper error value (304, which is CUDA_ERROR_OPERATING_SYSTEM) to you.