What does “Reformatting CopyNode for Input Tensor” mean in trtexec' dump profile
Description
What does “Reformatting CopyNode for Input Tensor” mean in trtexec' dump profile
Environment
TensorRT Version: NVIDIA GPU: NVIDIA Driver Version: CUDA Version: CUDNN Version: Operating System: Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):
Relevant Files
Steps To Reproduce
By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors.
If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors
By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors.
If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors
By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors.
If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors
Thank you for your reply. Another question, Is there an example of removing the QDQ, I want to remove the part of the QDQ node in the pic below

https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon and https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon/examples/04_modifying_a_model
https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon and https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon/examples/04_modifying_a_model
Sorry, it doesn't work, Is there any inference in pytorch_quantization, I just doesn't want to quantify “Gemm”
By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors.
If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors
Which device is the Reformatting layer running on , when TensorRT inserts the reformatting layer between DLA and GPU?I meet a problem, the reformatting layer increase too much latency.And even if I set input/output formats、directIO , tensorRT still inserts a reformatting layer.How can I solve this problem? I'm looking forward to your reply.
Which device is the Reformatting layer running on
GPU
when TensorRT inserts the reformatting layer between DLA and GPU?
For example, if the GPU output is FP16 linear format, but the DLA input is FP16 CHW16, then trt will insert a reformat layers. more generally, trt will insert reformat layers if the src format is not the same as the target format.
And even if I set input/output formats、directIO , tensorRT still inserts a reformatting layer.How can I solve this problem?
checking the log, find the DLA input format, then set it. see https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#ac3e115b1a2b1e578e8221ef99d27cd45a059cbe9f133a135c0f69fe41ae0a92e1
closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!
By default, TRT assumes that the network inputs/outputs are in FP32 linear (i.e. NCHW) format. However, many tactics in TRT require different formats, like NHWC8 or NC/32HW32 formats, so TRT automatically inserts Reformat layers to transform the format of the tensors. If you want to eliminate the cost of these additional reformats for the network input, you may be able to specify the network input format. See the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#reformat-free-network-tensors
Which device is the Reformatting layer running on , when TensorRT inserts the reformatting layer between DLA and GPU?I meet a problem, the reformatting layer increase too much latency.And even if I set input/output formats、directIO , tensorRT still inserts a reformatting layer.How can I solve this problem? I'm looking forward to your reply.
I met the same problem, I want to how is your soulution... Thanks