FindDefinition
FindDefinition
## Problem currently the API of context parallel have five problems. 1. only support apply CP to whole model. if we have some cross attn in prep part of model...
## New Feature We can use memory instant events in pytorch profiler result to generate a nice gpu memory trace in perfetto:  This memory trace is aligned to python...
gluon, as a tile-based low-level GPU programming language, has a core advantage over other similar languages (such as tilelang and tilus): users can perform thread-level operations through **Linear Layout**, for...
### Describe the bug compiler hint functions such as `tl.multiple_of` don't work when appied to return value of a jit function. this bug also exists in gluon. ```Python import triton...