Runxin Zhong

Results 10 comments of Runxin Zhong

Same question. It seems that the current code is not the final version and the attention still uses dense methods not sparse. Will it be open-source? @HaochengWan @2020zhangcheng @zhangcheng828 Thanks...

遇到了同样的问题

Any update on this? I got the same error and I cannot find such a special flag that can fix it. Thanks for any help!

@thakkarV I follow the suggestion and copy the code provided by @lygztq into example directory to try compiling it but still get errors. Instructions I did are: ```bash # copy...

I tried to add CUTE_HOST_DEVICE to `cast_smem_ptr_to_unit` and it works properly (see pr #2171).

fixed by `pip install tensorrt_cu12_libs==10.0.1 tensorrt_cu12_bindings==10.0.1 tensorrt==10.0.1 --extra-index-url https://pypi.nvidia.com`

> can we re-open this issue? ok

Thanks a lot! The `cute.copy_atom_call` seems what I want and I will try it.

I found that we can use driver api cuFuncGetAttribute to get local_size_bytes for analyzing register spill. The example code with cute dsl kernel is as follows (I don't know whether...

The api is updated after cute dsl 4.3. The right way now is as following: (Note that the kernel should be compiled with --keep-cubin, and the `compiled_kernel` is the output...