pycuda pycuda on Windows is crashing with access violation

Behavior

pycuda crashes immediately when trying to do something useful with it.. Any ideas?

Example to reproduce

import pycuda.driver as cuda
a_gpu = cuda.mem_alloc(64)

Resulting error

"C:\Program Files (x86)\Python38\python.exe" C:/Users/.../main.py
Process finished with exit code -1073741819 (0xC0000005)

Environment

Windows 10 (Build 19041.630)
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)] on win32
pycuda 2020.1
CUDA 11.2
NVIDIA Driver Version: 461.33
GPU: NVIDIA GeForce GTX TITAN X

Mar 25 '21 19:03 philipp-fischer

How did you install pycuda?
Can you run https://github.com/NVIDIA/cuda-samples/tree/master/Samples/matrixMulDrv?

Mar 25 '21 21:03 inducer

How did you install pycuda?

I installed it simply using pip install pycuda

C:\WINDOWS\system32>"C:\Program Files (x86)\Python38\python.exe" -m pip install pycuda
Collecting pycuda
  Using cached pycuda-2020.1.tar.gz (1.6 MB)
Requirement already satisfied: pytools>=2011.2 in c:\program files (x86)\python38\lib\site-packages (from pycuda) (2021.2.1)
Requirement already satisfied: decorator>=3.2.0 in c:\program files (x86)\python38\lib\site-packages (from pycuda) (4.4.2)
Requirement already satisfied: appdirs>=1.4.0 in c:\program files (x86)\python38\lib\site-packages (from pycuda) (1.4.3)
Requirement already satisfied: mako in c:\program files (x86)\python38\lib\site-packages (from pycuda) (1.1.4)
Requirement already satisfied: numpy>=1.6.0 in c:\program files (x86)\python38\lib\site-packages (from pytools>=2011.2->pycuda) (1.18.1)
Requirement already satisfied: MarkupSafe>=0.9.2 in c:\program files (x86)\python38\lib\site-packages (from mako->pycuda) (1.1.1)
Installing collected packages: pycuda
    Running setup.py install for pycuda ... done
Successfully installed pycuda-2020.1
WARNING: You are using pip version 20.0.2; however, version 21.0.1 is available.
You should consider upgrading via the 'C:\Program Files (x86)\Python38\python.exe -m pip install --upgrade pip' command.

Can you run https://github.com/NVIDIA/cuda-samples/tree/master/Samples/matrixMulDrv?

Interestingly, this sample fails, although other samples I have tried do work. See below for comarison between matrixMulDrv and matrixMul:

Sample "matrixMulDrv" (fails):

[ matrixMulDrv (Driver API) ]
> Using CUDA Device [0]: GeForce GTX TITAN X
> GPU Device has SM 5.2 compute capability
  Total amount of global memory:     12884901888 bytes
sdkFindFilePath <matrixMul_kernel64.fatbin> in ./
sdkFindFilePath <matrixMul_kernel64.fatbin> in ./../../bin/win64/Debug/matrixMulDrv_data_files/
sdkFindFilePath <matrixMul_kernel64.fatbin> in ./common/
sdkFindFilePath <matrixMul_kernel64.fatbin> in ./common/data/
sdkFindFilePath <matrixMul_kernel64.fatbin> in ./data/
> findModulePath found file at <./data/matrixMul_kernel64.fatbin>
> initCUDA loading module: <./data/matrixMul_kernel64.fatbin>
checkCudaErrors() Driver API error = 0701 "too many resources requested for launch" from file <C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.2\0_Simple\matrixMulDrv\matrixMulDrv.cpp>, line 160.

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.2\0_Simple\matrixMulDrv\../../bin/win64/Debug/matrixMulDrv.exe (process 19460) exited with code 1.

Sample "matrixMul" (works):

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Maxwell" with compute capability 5.2

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 22.30 GFlop/s, Time= 5.878 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

PS: I used the samples that came with the CUDA Toolkit. I just assume they are the same as the one you refer to on github.

Mar 26 '21 07:03 philipp-fischer

import pycuda.driver as cuda
a_gpu = cuda.mem_alloc(64)

Try adding import pycuda.autoinit before trying to allocate memory.

Mar 26 '21 22:03 inducer

It's actually not gpu memory allocation that fails, but the initialization itself. So import pycuda.autoinit also fails, because it internally fails in this line: https://github.com/inducer/pycuda/blob/29466d4e93ec20a81ce2534327aed24903c3a2e2/pycuda/autoinit.py#L5

And all this seems to be doing is calling cuInit: https://github.com/inducer/pycuda/blob/29466d4e93ec20a81ce2534327aed24903c3a2e2/src/cpp/cuda.hpp#L502

So that post gave me an idea: https://stackoverflow.com/questions/38610264/cuinit0-not-needed-anymore

Could it be a problem that the device number is not set anywhere? I have two GPUs an onboard GPU and an NVIDIA GPU...

Mar 27 '21 07:03 philipp-fischer

But matrixMulDrv must be calling cuInit, and that must be succeeding, given how far it gets. Could check (maybe with "Dependency Walker") that pycuda's _driver DLL and matrixMulDrv find the same CUDA library?

Mar 27 '21 17:03 inducer

_driver.cp38-win32.pyd seems to crash before it even get's to loading cuda and calling cuInit(). I will have to investigate deeper.

The following call is out of address space, I just debugged without having the debug symbols / source code.

Will have to debug with source...

Mar 28 '21 15:03 philipp-fischer

Honestly, I was a bit to lazy to get the whole build environment ready to compile this myself, so I downloaded and installed this pre-compiled wheel pycuda-2020.1+cuda102-cp38-cp38-win32.whl from https://www.lfd.uci.edu/~gohlke/pythonlibs/#pycuda and it works out of the box.. So there must be a difference to the PyPI version which you can download with pip.

Mar 28 '21 16:03 philipp-fischer

That's very strange. There aren't any binaries up on the package index (see?), so I wonder where your original binary was compiled, and what went wrong with it...

Mar 28 '21 18:03 inducer

Hmm interesting. So this means that during pip install pycuda it must have been compiled locally on my machine and something must have gone wrong with it?

Mar 31 '21 19:03 philipp-fischer

I don't see what else could have happened.

Mar 31 '21 19:03 inducer