Haiyang Huang
Haiyang Huang
The problem seems to be rooted from the ds_qkv_gemm implementation under FP16. This kernel works fine when handling FP32 inputs. However, when running under FP16, only the inp_norm can be...
Here is a screenshot created by the same script with different precision. On the left is the results of a dense layer given FP32 and the right is the results...
Sure, here is the script I'm using. I made some modification to deepspeed/module_inject/replace_module.py to ensure the args and flags are respected by the deepspeed.init_inference() function. Besides the fp16 and kernel...
Same problem here on ubuntu.
Thank you for your reply! I passed the make test by changing some configuration I was using, but I am not sure if I really solved all the problems I...
Observed similar results on my experiments. It seems like TPOT is calculated with the final "[Done]" latency included, whereas ITL does not include the final latency, as shown [here](https://github.com/vllm-project/vllm/blob/61e592747c28c9fbd6861e48b825c796e09da02f/benchmarks/backend_request_func.py#L264). Would...
I understand that the current naming of ITL might be causing some confusion. However, interpreting ITL as the inter-packet latency seems to contradict the problem mentioned here. If ITL measured...
> That URL was changed back in June: [06e6799](https://github.com/triton-lang/triton/commit/06e6799f4eba6035ec35c528e8fefd3d4d724b6f) > > Perhaps torch is on an older commit? Thank you! Using the new URL solved this problem.