Heyang Qin issues

Results 12 issues of


                                            Heyang Qin

Add support for T5 for deepspeed-inference

This is the PR to add support to T5. Currently it is still work in progress. A lot of the codes are adapted from https://github.com/microsoft/DeepSpeed/pull/2451

Check device count before running dist tests

move load_params() from container to policy

improving int4 asymmetric quantization accuracy

Credits to Connor for this PR! This PR changes the way offset and scale are computed and applied for int4 asymmetric quantization to improve the quantization accuracy.

make parameters status shared by all PartitionedParameterCoordinator instances

There could be multiple PartitionedParameterCoordinator instances, yet they currently manage the parameters in a standalone manner. Let's say we have PartitionedParameterCoordinator A and B. When A puts some parameters inflight,...

share inflight registry between PartitionedParameterCoordinators

This is a collaborative effort with the Lightning team to solve https://github.com/microsoft/DeepSpeed/issues/3068 and https://github.com/microsoft/DeepSpeed/issues/3156. More discussion at https://github.com/Lightning-AI/lightning/issues/17523 There could be multiple PartitionedParameterCoordinator instances, yet they currently manage the parameters...

Bf16 with `tl.dot` and `tl.atomic_add`

Currently some triton primitives like `tl.dot` and `tl.atomic_add` don't work with bf16. The straightforward workaround would be to convert to fp16 and cast it back. But that is a non-trivial...

torch.compile makes triton kernel slower

### 🐛 Describe the bug Hello pytorch team, it is exciting to see the recent PRs that enable torch.compile on triton kernels. I did a quick benchmark of torch.compile on...

oncall: pt2

module: user triton

Kernel launching overhead with `jit`

Hello triton team, I did a quick profiling on the triton matmul kernel https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py using pytorch profiler. ![image](https://github.com/openai/triton/assets/46639297/62064d78-d396-4610-9214-7c6625f48bbc) I did the warmup using the tensor of same size/dtype so the...