cutile-python icon indicating copy to clipboard operation
cutile-python copied to clipboard

[FEA]: Require ct.barrier for multi stage kernels

Open ZhangZhiPku opened this issue 1 month ago • 0 comments

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request?

High

Please provide a clear description of problem this feature solves

In CUDA programming, we use atomic methods or cooperative groups to synchronize execution across blocks. cutile could provide a similar mechanism to help developers write complex multi-stage kernels in a simpler way.

Feature Description

Example:

import torch
import cuda.tile as ct

@ct.kernel
def device_norm(
    x: ct.Array, y: ct.Array, workspace: ct.Array, 
    tile_size: ct.Constant, p: ct.Constant):
    # create a barrier on global memory, except p blocks to reach it.
    barrier = ct.barrier(p=p)
    block_id = ct.bid(0)
    
    tile = ct.load(x, index=(block_id, 0), shape=(1, tile_size))
    mean = ct.sum(tile) / tile_size
    
    ct.atomic_add(workspace, (0, ), mean)
    # wait until p blocks to reach here
    barrier.wait()

    global_mean = ct.load(workspace, (0, ), (1, ))
    global_mean = global_mean / p
    tile = tile - global_mean
    
    ct.store(y, (block_id, ), (tile_size, ))

Describe your ideal solution

Provide ct.barrier, or a similar feature, to make it easier for developers to write applications that require block-level synchronization.

There are multiple ways to implement ct.barrier:

  1. Allocate a region in global memory for synchronization, and let each block atomically increment a counter when it reaches the barrier.
  2. Use cooperative groups.

Describe any alternatives you have considered

No response

Additional context

No response

Contributing Guidelines

  • [x] I agree to follow cuTile Python's contributing guidelines
  • [x] I have searched the open feature requests and have found no duplicates for this feature request

ZhangZhiPku avatar Dec 19 '25 08:12 ZhangZhiPku