donghaku

Results 2 issues of donghaku

When the computing node does not have a cache worker,Scheduling cache workers to computing nodes can speed up training。Do you have relevant information?

features

in 05-layer-norm.py Less than 64KB per feature: enqueue fused kernel MAX_FUSED_SIZE = 65536 // x.element_size() BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(N)) 1、what is 65536 ? i hava some Hardware indicators guess,But it's...