Li Li issues

Results 7 issues of


                                            Li Li

Change the hardcoded 128 items per warp in embeddingBag to a variable and optimize for ROCm

Make @weihanmines's PR https://github.com/ROCmSoftwarePlatform/FBGEMM/pull/13 upstreamable. @sryap, would you please review the PR and consider converting it to a draft? Thank you.

cla signed

module: rocm

Pipelined-hip embeddingBag forward.

The credits of the pipelined-hip implementation belong to @carlushuang.

cla signed

[BUG] state dict loading issue when running an example in https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts#run

**Describe the bug** see the following error message when running the example in https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts#run `deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16` ``` │ /opt/conda/lib/python3.8/site-packages/deepspeed/inference/engine.py:388 in _load_checkpoint │ │ │ │ 385...

bug

inference

Are there any examples of mixed input Gemm (A type fp16, B type fp8)?

`./tools/profiler/cutlass_profiler --m=16 --n=16 --k=1024 --A=fe5m2:\* --B=fe5m2:\*` works for me just fine, or any other combination of fp8 types, layouts etc. I also noticed that your A type if fp16 but...

enable uvm tests on ROCm5.7 or later

cla signed

module: rocm

ciflow/rocm

re-organize rocm header for the upcomming rocm update

cla signed

module: rocm

ciflow/rocm

[ROCm] debug v2 kernel for ROCm

Fix the v2 kernel forward_test for ROCm. One more unit test failure (test_forward_gpu_uvm_cache_fp16) is under investigation.

cla signed

module: rocm