HVM Failed to launch kernels (error code invalid argument)

Describe the bug I ran into a bug well trying to use Bend. Well trying to run Bend in "cuda mode" I get the following error:

Error reading result from hvm. Output :
Failed to launch kernels (error code invalid argument)!
exit status: 1

I was gonna make an issue for this in Bend but, based on the error message, I think it's a HVM bug.

To Reproduce Steps to reproduce the behavior:

Install Bend. This is how I installed it.
Grab the fib.bend example from the Bend repo HERE. Here is the code:

add = λa λb (+ a b)
fib = λx switch x {
  0: 1
  _: let p = x-1; switch p {
    0: 1
    _: (+ (fib p) (fib p-1))
  }
}
main = (fib 30)

Run the following Bend command:

bend run-cu fib.bend

Expected behavior This is error/bug did not happen, the output would be this (ideally fast):

Result: 1346269

Desktop (please complete the following information):

OS: Ubuntu 20.04.6 LTS x86_64
CPU: AMD Ryzen 5 3600 (12) @ 3.600GHz
GPU: NVIDIA GeForce RTX 2070 SUPER
Cuda Version: 12.3, V12.3.52

Additional context n/a

May 18 '24 08:05 MehmetMHY

Just added a little debug helper function and manually compiled with:

bend gen-cu test.bend > test.cu 
nvcc -I/usr/local/cuda/include -L/usr/local/cuda/lib -O0 -g test.cu -o test_cuda
./test_cuda


#include <iostream>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, const char* const func, const char* const file,
           const int line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        // We don't exit when we encounter CUDA errors in this example.
        // std::exit(EXIT_FAILURE);
    }
}

Output:

CUDA Runtime Error at: test.cu:2597
invalid argument cudaFuncSetAttribute(evaluator, cudaFuncAttributeMaxDynamicSharedMemorySize, sizeof(LNet))
Failed to launch kernels (error code invalid argument)!

cudaFuncSetAttribute seems to choke on evaluator or sizeof(LNet)

May 18 '24 22:05 janschiefer

For some devices, it seems that sizeof(LNet) exceeds the maximum value for cudaDevAttrMaxSharedMemoryPerBlockOptin

For my device, I have a maximum value of this of 65536, versus the required size of 98304 for LNet.

May 19 '24 03:05 FreezePhoenix

This problem stems from the fact that your GPU does not support >=96KB of shared mem per block, which is what is currently hardcoded on the HVM, we plan on soon releasing a dynamic mem allocation, for now, in order to fix this, please set the shared mem, from 96KB:

const u32 L_NODE_LEN = 0x2000;
const u32 L_VARS_LEN = 0x2000;

to 48KB:

const u32 L_NODE_LEN = 0x1000;
const u32 L_VARS_LEN = 0x1000;

for your GPU (2070), since it has 7.5 Compute Capability, you have 64KB max shared mem, so i guess you could use 0x1500.

closing this since a duplicate of https://github.com/HigherOrderCO/HVM/issues/283

May 24 '24 14:05 kings177