Chenggang Zhao

Results 26 issues of Chenggang Zhao

Some functionalities in `` (e.g. `std::codecvt`) are being deprecated since C++ 17/20. Please refer to https://en.cppreference.com/w/cpp/locale/codecvt for more details. Compiling with these usages may throw annoying warnings.

For `BLOCK_SIZE_K=256`, `GmmaFP8Accumulation` has `accum_promotion_interval=4` but `mma_count_per_mainloop_iteration=8`, which makes a non-FP8-fast-accum kernel never promote to FP32 accumulators. This PR fixes the wrong assertion by changing `4` into the real number...

For extreme GPU memory saving, we currently use communication queues for NVLink and RDMA buffers. This means tokens cyclically reuse a small buffer - when the queue is full, no...

In some kernels, a tensor may be not used for some cases (e.g., with a specific template parameter), then users may pass `None (nullptr)` as the `T.ptr`. But `ValueError: Unsupported...

good first issue

As titled, similar to https://triton-lang.org/main/python-api/generated/triton.language.device_assert.html and https://triton-lang.org/main/python-api/generated/triton.language.static_assert.html. Host/device/static assert should be all supported. For host assert, raise exception so that Python could catch. Pythonic `assert` or API `tl.device/host/static_assert` should be...

good first issue

```python import torch import tilelang from tilelang import language as T @tilelang.jit( pass_configs={ tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, }, ) def get_buggy_kernel(): num_tokens = T.symbolic('num_tokens') @T.prim_func def buggy_kernel(x: T.Tensor[(num_tokens, ), 'int64']):...

```python import torch import tilelang from tilelang import language as T @tilelang.jit( pass_configs={ tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, }, ) def get_buggy_kernel(): num_tokens = T.symbolic('num_tokens') num_threads = 128 @T.prim_func def buggy_kernel(x:...

arith

For lots of CUDA kernels, we conventionaly write as: ```CUDA for (int i = thread_idx; i < numel; i += num_threads) out[i] = 0; ``` But in tilelang: ```python for...

```python import torch import tilelang from tilelang import language as T @tilelang.jit( pass_configs={ tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True, }, ) def get_buggy_kernel(hidden): num_tokens = T.symbolic('num_tokens') @T.prim_func def buggy_kernel(x: T.Tensor[(num_tokens, hidden), 'float']):...

bug