ppetrushkov
ppetrushkov
I noticed that DNNL u8s8s32 single core performance is slower than [FBGEMM](https://github.com/pytorch/FBGEMM) when m is small (m
**Describe the bug** Running inference with Deepspeed using GPT-NeoX 20B model produces garbage output, indicating an implementation bug. **To Reproduce** For example, can be seen when using example script: `deepspeed...
Currently importing transformer_engine takes ~10s on my machine and it also starts a background process pool because of all the JIT initialization like [here](https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/jit.py#L50-L54) . It would be better if...