nvidia-cuda-tutorial
nvidia-cuda-tutorial copied to clipboard
Add section on Grid Groups and Grid sync
Grid groups and grid sync were added in Numba 0.53.1. A short section on using these to implement a global barrier would be good, perhaps based around the example kernel from the documentation:
@cuda.jit(void(int32[:,::1]))
def sequential_rows(M):
col = cuda.grid(1)
g = cuda.cg.this_grid()
rows = M.shape[0]
cols = M.shape[1]
for row in range(1, rows):
opposite = cols - col - 1
# Each row's elements are one greater than the previous row
M[row, col] = M[row - 1, opposite] + 1
# Wait until all threads have written their column element,
# and that the write is visible to all other threads
g.sync()