Chunyuan WU

Results 5 issues of Chunyuan WU

## Pitch Optimize the decoder of RNN-T to support batch mode. ## Motivation The current RNN-T decoder uses a for loop on batch_size and can only process `BS = 1`...

## Pitch Enable bf16 support for mkldnn prepack conv2d in NNC. ## Performance The BF16 conv performance has been evaluated in https://github.com/pytorch/pytorch/pull/82705. ## Additional context This PR depends on BF16...

oncall: jit
open source
cla signed
release notes: jit

## Pitch Enable Linear-Eltwise fusion in NNC. ## Description The code change is similar to https://github.com/pytorch/pytorch/pull/77157 which has enabled conv2d related fusions. This PR adds a fusion pass to fuse...

oncall: jit
open source
cla signed

The current implementation of `drq` does not support **channels last** input since `view` has constraints on the input size and stride (https://pytorch.org/docs/stable/generated/torch.Tensor.view.html): ``` Cannot view a tensor with shape torch.Size([1,...

## Motivation This PR is a follow-up on https://github.com/sgl-project/sglang/issues/2807 and https://github.com/sgl-project/sglang/pull/5150 to add **fp8** gemm kernel for CPU. The bf16 and int8 gemm kernel is already added in https://github.com/sgl-project/sglang/pull/5150. This...