Failed to launch kernels (error code invalid argument)
Describe the bug I ran into a bug well trying to use Bend. Well trying to run Bend in "cuda mode" I get the following error:
Error reading result from hvm. Output :
Failed to launch kernels (error code invalid argument)!
exit status: 1
I was gonna make an issue for this in Bend but, based on the error message, I think it's a HVM bug.
To Reproduce Steps to reproduce the behavior:
-
Grab the fib.bend example from the Bend repo HERE. Here is the code:
add = λa λb (+ a b)
fib = λx switch x {
0: 1
_: let p = x-1; switch p {
0: 1
_: (+ (fib p) (fib p-1))
}
}
main = (fib 30)
- Run the following Bend command:
bend run-cu fib.bend
Expected behavior This is error/bug did not happen, the output would be this (ideally fast):
Result: 1346269
Desktop (please complete the following information):
- OS: Ubuntu 20.04.6 LTS x86_64
- CPU: AMD Ryzen 5 3600 (12) @ 3.600GHz
- GPU: NVIDIA GeForce RTX 2070 SUPER
- Cuda Version: 12.3, V12.3.52
Additional context n/a
Just added a little debug helper function and manually compiled with:
bend gen-cu test.bend > test.cu
nvcc -I/usr/local/cuda/include -L/usr/local/cuda/lib -O0 -g test.cu -o test_cuda
./test_cuda
#include <iostream>
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, const char* const func, const char* const file,
const int line)
{
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
// We don't exit when we encounter CUDA errors in this example.
// std::exit(EXIT_FAILURE);
}
}
Output:
CUDA Runtime Error at: test.cu:2597
invalid argument cudaFuncSetAttribute(evaluator, cudaFuncAttributeMaxDynamicSharedMemorySize, sizeof(LNet))
Failed to launch kernels (error code invalid argument)!
cudaFuncSetAttribute seems to choke on evaluator or sizeof(LNet)
For some devices, it seems that sizeof(LNet) exceeds the maximum value for cudaDevAttrMaxSharedMemoryPerBlockOptin
For my device, I have a maximum value of this of 65536, versus the required size of 98304 for LNet.
This problem stems from the fact that your GPU does not support >=96KB of shared mem per block, which is what is currently hardcoded on the HVM, we plan on soon releasing a dynamic mem allocation, for now, in order to fix this, please set the shared mem, from 96KB:
const u32 L_NODE_LEN = 0x2000;
const u32 L_VARS_LEN = 0x2000;
to 48KB:
const u32 L_NODE_LEN = 0x1000;
const u32 L_VARS_LEN = 0x1000;
for your GPU (2070), since it has 7.5 Compute Capability, you have 64KB max shared mem, so i guess you could use 0x1500.
closing this since a duplicate of https://github.com/HigherOrderCO/HVM/issues/283