TensorRT What does “Reformatting CopyNode for Input Tensor” mean in trtexec' dump profile

Description

What does “Reformatting CopyNode for Input Tensor” mean in trtexec' dump profile

Environment

TensorRT Version: NVIDIA GPU: NVIDIA Driver Version: CUDA Version: CUDNN Version: Operating System: Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

Jul 11 '22 08:07 pangr

By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors.

If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors

Jul 11 '22 08:07 nvpohanh

By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors.

If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors

Thank you for your reply. Another question, Is there an example of removing the QDQ, I want to remove the part of the QDQ node in the pic below image2022-7-11_17-36-19

Jul 11 '22 09:07 pangr

https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon and https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon/examples/04_modifying_a_model

Jul 11 '22 09:07 zerollzeng

https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon and https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon/examples/04_modifying_a_model

Sorry, it doesn't work, Is there any inference in pytorch_quantization, I just doesn't want to quantify “Gemm”

Jul 11 '22 10:07 pangr

By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors.

If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors

Which device is the Reformatting layer running on , when TensorRT inserts the reformatting layer between DLA and GPU?I meet a problem, the reformatting layer increase too much latency.And even if I set input/output formats、directIO , tensorRT still inserts a reformatting layer.How can I solve this problem? I'm looking forward to your reply.

Nov 06 '22 09:11 yuanl15

Which device is the Reformatting layer running on

GPU

when TensorRT inserts the reformatting layer between DLA and GPU?

For example, if the GPU output is FP16 linear format, but the DLA input is FP16 CHW16, then trt will insert a reformat layers. more generally, trt will insert reformat layers if the src format is not the same as the target format.

And even if I set input/output formats、directIO , tensorRT still inserts a reformatting layer.How can I solve this problem?

checking the log, find the DLA input format, then set it. see https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac3e115b1a2b1e578e8221ef99d27cd45a059cbe9f133a135c0f69fe41ae0a92e1

Nov 06 '22 11:11 zerollzeng

closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!

Dec 06 '22 02:12 ttyio

By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors. If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors

Which device is the Reformatting layer running on , when TensorRT inserts the reformatting layer between DLA and GPU?I meet a problem, the reformatting layer increase too much latency.And even if I set input/output formats、directIO , tensorRT still inserts a reformatting layer.How can I solve this problem? I'm looking forward to your reply.

I met the same problem, I want to how is your soulution... Thanks

Nov 29 '23 06:11 realwenpeng