Transformer-like model inference acceleration
Is there a more advanced primitive? I found that turning on oneDNN does not greatly improve the performance of transformer-like models. I look forward to a specially optimized primitive like TensorRT's multihead_matmul.
Hi @baoachun, thank you for the question. oneDNN doesn't provide any complex API for blocks of models except RNNs. Thus, we expect that any models (including Transform-like) would be enabled through one of frameworks oneDNN has integration with. To provide better guidance, we would need the following information:
- How you enable your model, on your own using oneDNN API directly, or through one of frameworks?
- If first, then which part you consider slow and provide DNNL_VERBOSE output with mentioning the spot you consider to be improved.
- If second, please provide framework details, like name and version, and also DNNL_VERBOSE output so that we can check with integration maintainers if everything is as expected.
It would also help if you could share your vision on what "specially optimized primitive like TensorRT's multihead_matmul" is and how it looks like from your perspective. Thanks.
- matmul+scale
- fc+gelu
Hi @lidanqing-intel. If both combinations were the answer to
what "specially optimized primitive like TensorRT's multihead_matmul" is
then we definitely need more data, as requested, in case you feel performance is not enough since we have both cases optimized.
Hi @dzarukin, thank you for your reply! I used paddle+oneDNN for model inference, but turning on oneDNN did not bring much performance improvement. I think the possible reason is that there are more small operators in the model and the calculation time of matmul and other calculations is longer. I learned that TensorRT has introduced the FasterTransformer acceleration method in order to improve the inference performance of transformer-like models. This method merges all the operators in a block into one operator, and specifically optimizes the lower level for the fusion operator. In addition, I found that using openVINO for model inference will greatly improve performance. I want to know what optimizations openVINO has made, thank you!
Hi @baoachun. From what you share it seems the problem is rather in paddle integration since OpenVINO works fine for you. If this is really important, you may try figure it out with @jczaja, but issue you mentioned might help as well.
This method merges all the operators in a block into one operator, and specifically optimizes the lower level for the fusion operator.
oneDNN will unlikely post such API in production environment since its not flexible and very model specific (complete lack of generality). Though, who knows, if anything experimental of that kind might be available.
I want to know what optimizations openVINO has made
I'm unaware of what kind of optimizations are done on OpenVINO side. You may try to clarify them here.
Thank you.
@baoachun It is doable to make a big fuse to fuse all transformers into one big. It will be done inside Paddle, not from oneDNN side.
We are targeting complex fusions with oneDNN Graph. Specifically for transformers multi-head attention (MHA) block is available in oneDNN Graph technical preview.
@vpirogov @dzarukin What about GPU? MHA optimization like flash attention yield significant speed up for NV GPUs for large seq length models (stable diffusion UNet etc). There is another CUDA implementation from CUTLASS which is very popular in DL compiler / runtime projects. AMD also implements the fused MHA kernel in https://github.com/ROCmSoftwarePlatform/composable_kernel.
The fused kernel cannot be implemented at the graph level. I'm aware that oneDNN Graph has a graph compiler but it doesn't support GPU at the moment https://github.com/oneapi-src/oneDNN/blob/dev-graph/src/backend/graph_compiler/CMakeLists.txt#L58
@masahi, you are right, there's no GPU support in the compiler yet. For custom primitives like this on GPU currently the only option is implementing these in SYCL.
Yeah, I've seen incredible performance out of oneDNN conv2d / gemm kernels on Arc GPU. So I'm looking forward to the availability of a fused MHA kernel for better performance. That would greatly help popular applications like https://github.com/bes-dev/stable_diffusion.openvino.
But since those kernels are implemented in custom assembly, I'm not sure about the feasibility of extending the matmul JIT kernel for fused MHA. Since it is unlikely possible to implement fused MHA by "composing" existing matmul kernels, I expect someone needs to implement a dedicated asm kernel for fused MHA.
@masahi, I agree that it would be great to see these block better optimized. There are many different variants of MHA block with new ones appearing on regular basis. This creates two issues for the library:
- Hard to provide stable API. This is addressed in oneDNN by recently introduced Graph API which allows to express arbitrary subgraphs without breaking API.
- New implementation is still required to support new variants. This is where we are exploring Graph Compiler technology as maintaining complex fusions for general case is expensive.
Until we get there I think the best option would be implementing a custom version for the cases you care about using SYCL-ESIMD. This implementation potentially may be used with oneDNN Graph API to exposure it to application in consistent manner.