Hubert Lu

Results 6 issues of Hubert Lu

Hi, I currently try to run my program (which initially works on a Cloud TPU v2-8/v3-8) on a Cloud TPU v2-32 which has 4 hosts by using JAX (jax==0.1.65, jaxlib==0.1.45)....

P0 (urgent)
NVIDIA GPU

Hi, I tested the native AllReduce (deepspeed.comm.all_reduce) and the compressed AllReduce (backend.compressed_allreduce) in DeepSpeed with [this test script](https://github.com/microsoft/DeepSpeed/blob/master/tests/onebit/test_nccl_perf.py). On a ROCm system, we observed 414% performance improvement of switching from...

## Motivation The current SGLang on AMD GPUs fails to leverage vLLM custom AR. To further remove the dependency on vLLM's custom AR in SGLang, we plan to maintain the...

## Motivation To enable aiter's fused allreduce kernel, please add `--enable-aiter-allreduce-fusion`. **With `--enable-aiter-allreduce-fusion`**: `void aiter::reduce_scatter_cross_device_store(aiter::RankData*, aiter::RankSignals, aiter::Signal*, int, int)` and `void aiter::local_device_load_rmsnorm_512n(aiter::RankSignals, __hip_bfloat16*, __hip_bfloat16*, __hip_bfloat16*, __hip_bfloat16*, float, int, int, int)`...

documentation
amd

## Motivation Added more tests to AMD CI and diffusion dependencies along with a placeholder for diffusion-related test in AMD CI ## Modifications ## Accuracy Tests ## Benchmarking and Profiling...

amd
dependencies
run-ci

## Motivation Add an 8-GPU MI35X test to AMD CI which uses [amd/DeepSeek-R1-MXFP4-Preview](https://huggingface.co/amd/DeepSeek-R1-MXFP4-Preview) with and without speculative decoding (MTP). ## Modifications ## Accuracy Tests ## Benchmarking and Profiling ## Checklist...

amd
dependencies
deepseek
run-ci