Heyang Qin

Results 12 issues of Heyang Qin

This is the PR to add support to T5. Currently it is still work in progress. A lot of the codes are adapted from https://github.com/microsoft/DeepSpeed/pull/2451

Credits to Connor for this PR! This PR changes the way offset and scale are computed and applied for int4 asymmetric quantization to improve the quantization accuracy.

There could be multiple PartitionedParameterCoordinator instances, yet they currently manage the parameters in a standalone manner. Let's say we have PartitionedParameterCoordinator A and B. When A puts some parameters inflight,...

This is a collaborative effort with the Lightning team to solve https://github.com/microsoft/DeepSpeed/issues/3068 and https://github.com/microsoft/DeepSpeed/issues/3156. More discussion at https://github.com/Lightning-AI/lightning/issues/17523 There could be multiple PartitionedParameterCoordinator instances, yet they currently manage the parameters...

Currently some triton primitives like `tl.dot` and `tl.atomic_add` don't work with bf16. The straightforward workaround would be to convert to fp16 and cast it back. But that is a non-trivial...

### 🐛 Describe the bug Hello pytorch team, it is exciting to see the recent PRs that enable torch.compile on triton kernels. I did a quick benchmark of torch.compile on...

oncall: pt2
module: user triton

Hello triton team, I did a quick profiling on the triton matmul kernel https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py using pytorch profiler. ![image](https://github.com/openai/triton/assets/46639297/62064d78-d396-4610-9214-7c6625f48bbc) I did the warmup using the tensor of same size/dtype so the...