YanbingJiang
YanbingJiang
### Description This PR is to add NNC post-op fusion support in ideep for further NNC development. It includes: - element wise post op fusion - conv/matmal/linear + binary post...
Hi, We found that, the time consumption of primitive creation of conv1d (nwc input) is much higher than that of conv2d (block format), especially the first creation. Though, it has...
Adding inference test in benchmark/kernel. Add profile in benchmark/inference.
Currently, this PR is a draft PR that contains many print log.
This PR is to fix the issue of amp_bf16 train with staged_train_test in CPU. Need set `forward_contexts` correctly with `torch.cpu.amp.autocast(dtype=torch.bfloat16)`, otherwise, in staged_train_test, model cannot run into bf16 successfully.
## Motivation `TorchBench` is a collection of open-source benchmarks used to evaluate PyTorch performance. It provides a standardized API for benchmark drivers, both for evaluation (eager/jit) and training. Plenty of...
This PR is to update C++17 for PyTorch 2.1.0.
Hi Maintainers @yanboliang @Chillee , I encounter codegen error when using `--compile_prefile` in int8 Woq. Although it can still run, it could be confused to users. Could you please fix...
This PR is to optimize Int8 Woq both in gpt-fast and mixtral-moe. At the current stage, we use `torch.ops.aten._weight_int8pack_mm` as an workaround. And this workaround will be removed when https://github.com/pytorch/pytorch/pull/120985...
gpt-fast will use `torch.load` with `mmap=True` to load checkpoints of models. This may help speed up model load time. However, eventually, mmap is not used in bf16, because in https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L247,...