YanbingJiang

Results 19 issues of YanbingJiang

### Description This PR is to add NNC post-op fusion support in ideep for further NNC development. It includes: - element wise post op fusion - conv/matmal/linear + binary post...

triaged
open source
cla signed
intel

Hi, We found that, the time consumption of primitive creation of conv1d (nwc input) is much higher than that of conv2d (block format), especially the first creation. Though, it has...

question

Adding inference test in benchmark/kernel. Add profile in benchmark/inference.

feature
0 - Priority P0
benchmark

Currently, this PR is a draft PR that contains many print log.

This PR is to fix the issue of amp_bf16 train with staged_train_test in CPU. Need set `forward_contexts` correctly with `torch.cpu.amp.autocast(dtype=torch.bfloat16)`, otherwise, in staged_train_test, model cannot run into bf16 successfully.

cla signed

## Motivation `TorchBench` is a collection of open-source benchmarks used to evaluate PyTorch performance. It provides a standardized API for benchmark drivers, both for evaluation (eager/jit) and training. Plenty of...

roadmap

This PR is to update C++17 for PyTorch 2.1.0.

Hi Maintainers @yanboliang @Chillee , I encounter codegen error when using `--compile_prefile` in int8 Woq. Although it can still run, it could be confused to users. Could you please fix...

This PR is to optimize Int8 Woq both in gpt-fast and mixtral-moe. At the current stage, we use `torch.ops.aten._weight_int8pack_mm` as an workaround. And this workaround will be removed when https://github.com/pytorch/pytorch/pull/120985...

CLA Signed

gpt-fast will use `torch.load` with `mmap=True` to load checkpoints of models. This may help speed up model load time. However, eventually, mmap is not used in bf16, because in https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L247,...