Andreas Klöckner
Andreas Klöckner
> Tesla V100 on a cluster Is their X server using the Nv driver? Do they even have an X server running? I would ask them to run one of...
Out of curiosity, could you share what ultimately solved the problem?
You could try the (relatively recent) [`retain_primary_context`](https://documen.tician.de/pycuda/driver.html#pycuda.driver.Device.retain_primary_context) to create a context.
* Do the CUDA SDK samples work on that machine? Particularly the one for the driver SDK? * Can you get a backtrace (via gdb) at the moment the program...
That's because the first time the function is called, a few kernels are compiled behind the scenes to do the work. The basic assumption is that your program will run...
If that works for your use case, then yes, that should avoid compilation/module load delays on subsequent runs of the kernel.
PyOpenCL is not involved in the execution of the kernels. The actual on-device kernel execution time should be exactly the same between a C program using OpenCL and a Python...
Are you saying your kernel times are different? Switch your command queue to enable profiling and get kernel execution times in both settings. They should match pretty closely.
The Python bits (setting and preparing arguments) is slower, but this time can (and should be) hidden by kernel execution, which occurs asynchronously on the device.