cuda-python icon indicating copy to clipboard operation
cuda-python copied to clipboard

[FEA]: Relevant exceptions for cuCheckpointProcessGetState

Open jricker2 opened this issue 9 months ago • 1 comments

Is this a duplicate?

  • [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

cuda.bindings

Is your feature request related to a problem? Please describe.

Very small issue, but not sure if it expands to other functions that I have not tested. For cuCheckpointProcessGetState, sending a PID that doesn't exist or PID not valid to be checkpointed results in the following err:

>>> cu.cuCheckpointProcessGetState(123434)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cuda/bindings/driver.pyx", line 44467, in cuda.bindings.driver.cuCheckpointProcessGetState
  File "/usr/lib64/python3.11/enum.py", line 714, in __call__
    return cls.__new__(cls, value)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/enum.py", line 1137, in __new__
    raise ve_exc
ValueError: 32718 is not a valid CUprocessState

I have seen some other values other than 32718 show up as the returned CUprocessState as well, seemingly random.

Describe the solution you'd like

Consistent exceptions for common failures such as PID not existing or being invalid. The cuda-checkpoint CLI gives the following message which would be fine

Error getting process state for process ID 1234234344: "OS call failed or operation not supported on this OS"

Describe alternatives you've considered

No response

Additional context

Inconsistent/irrelevant exceptions makes unit testing around this area of the cuda driver difficult/messy.

jricker2 avatar Apr 17 '25 21:04 jricker2

There is one bug in cuda.bindings. The bug is that when cuCheckpointProcessGetState fails, state is randomly populated (or not touched by the driver at all, it doesn't matter for our purpose here):

// nvcc -std=c++11 -arch=sm_80 -lcuda 561.cu -o 561.out
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cassert>
#include <iostream>

int main() {
  assert (0 == cudaSetDevice(0));
  CUprocessState state;
  auto out = cuCheckpointProcessGetState(123456, &state);
  std::cout << (int)state << " ; " << (int)out << std::endl;
  const char* reason;
  cuGetErrorString(out, &reason);
  std::cout << "(" << reason << ")" << std::endl;
  return 0;
}

Output:

-1445175672 ; 304
(OS call failed or operation not supported on this OS)

Therefore, when converting state to the CUprocessState IntEnum to build the return tuple, it fails due to out of range (it could be a random int value like 32718 or in my case -1445175672).

You can see the output string matches what you got from cuda-checkpoint. We should fix the enum conversion possibly by checking the error code first, then we can return the proper error value (304, which is CUDA_ERROR_OPERATING_SYSTEM) to you.

leofang avatar Apr 18 '25 01:04 leofang