Haiyang Huang comments

Results 8 comments of


                                            Haiyang Huang

[BUG] Running DeepSpeed with MoE inference leads to CUDA illegal memory access and NaN activation

The problem seems to be rooted from the ds_qkv_gemm implementation under FP16. This kernel works fine when handling FP32 inputs. However, when running under FP16, only the inp_norm can be...

[BUG] Running DeepSpeed with MoE inference leads to CUDA illegal memory access and NaN activation

Here is a screenshot created by the same script with different precision. On the left is the results of a dense layer given FP32 and the right is the results...

[BUG] Running DeepSpeed with MoE inference leads to CUDA illegal memory access and NaN activation

Sure, here is the script I'm using. I made some modification to deepspeed/module_inject/replace_module.py to ensure the args and flags are respected by the deepspeed.init_inference() function. Besides the fp16 and kernel...

Solving environment: failed with initial frozen solve. Retrying with flexible solve.

Same problem here on ubuntu.

Compiled PSMP too slow and fails the tests

Thank you for your reply! I passed the make test by changing some configuration I was using, but I am not sure if I really solved all the problems I...

[Bug]: inter-token latency is lower than TPOT in serving benchmark result

Observed similar results on my experiments. It seems like TPOT is calculated with the final "[Done]" latency included, whereas ITL does not include the final latency, as shown [here](https://github.com/vllm-project/vllm/blob/61e592747c28c9fbd6861e48b825c796e09da02f/benchmarks/backend_request_func.py#L264). Would...

[Bug]: inter-token latency is lower than TPOT in serving benchmark result

I understand that the current naming of ITL might be causing some confusion. However, interpreting ITL as the inter-packet latency seems to contradict the problem mentioned here. If ITL measured...

triton build failing to access https://tritonlang.blob.core.windows.net/llvm-builds/ with HTTP Error 409

> That URL was changed back in June: [06e6799](https://github.com/triton-lang/triton/commit/06e6799f4eba6035ec35c528e8fefd3d4d724b6f) > > Perhaps torch is on an older commit? Thank you! Using the new URL solved this problem.