guangzlu

Results 6 issues of guangzlu

I am doing a program with HIPRTC. I want to have a look at the isa code to consider I can enhance its performance. I have known that in cuda...

Added client example for bwd qloop v1, v2, light v1 and light v2. Now we can do profiling for flash attention backward qloop.

Updated judgement of dropout. Performance is improved when p_drop = 0. G0 G1 M K 54 16 512 64 : before : 4.49336 ms, 32.2599 TFlops, 101.206 GB/s -> now:...

Added LSE storing into flash attention forward path. Added device random number generator philox. Based on philox, added blockwise dropout. And dropout is applied into flash attention forward path. Flash...

WIP

### What happened + What you expected to happen Ray corrupted when using ray.init(num_gpus=2) **error info:** core_worker.cc:215: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker...

bug
triage

### What happened + What you expected to happen I can't init ray with num_cpus more than 10. I can get number of 192 cpus from multiprocessing.cpu_count() on the machine....

bug
triage
core