GridSample outputs incorrect results in ONNX opset 20 (opset 18 works correctly)

Open Alcoholrithm opened this issue 3 months ago • 0 comments

Description

When exporting a PyTorch model using torch.nn.functional.grid_sample:

Opset 18 → TensorRT output matches onnx exactly Opset 20 → The GridSample node produces large numerical errors

TensorRT build logs show no warnings or errors.

Environment

TensorRT Version: 10.13.3

NVIDIA GPU: RTX 5090

NVIDIA Driver Version: 575.64.03

CUDA Version: 12.9

CUDNN Version: 9.1.0

Operating System: Ubuntu 24.04

Python Version (if applicable): 3.10.19

PyTorch Version (if applicable): 2.9.0+cu128

ONNX IR: 0.0.10

Minimal Reproduction Code

import torch
from torch.nn import functional as F
class GridSampleTest(torch.nn.Module):
    def __init__(self, mode="bilinear", padding_mode="zeros", align_corners=False):
        super().__init__()
        self.mode = mode
        self.padding_mode = padding_mode
        self.align_corners = align_corners

    def forward(self, x, grid):
        return F.grid_sample(
            x, grid,
            mode=self.mode,
            padding_mode=self.padding_mode,
            align_corners=self.align_corners
        )

model = GridSampleTest().eval().cuda()

x = torch.randn(45056,1,1,256, device='cuda')

g = torch.randn(45056, 1, 9, 2, device='cuda')
torch.onnx.export(
    model, (x, g), "grid_sample_test.onnx",
    input_names=["x", "grid"], output_names=["y"],
    opset_version=20, # opset_version=18
    dynamic_axes={"x": {0: "B"}, "grid": {0: "B"}, "y": {0: "B"}}
)

Commands or scripts: polygraphy run grid_sample_test.onnx --onnxrt --trt --verbose --onnx-outputs mark all --trt-outputs mark all

Have you tried the latest release?: No

Polygraphy Log


[V] Loaded Module: polygraphy | Version: 0.49.26 
[V] Loaded extension modules: []
[I] TF32 is disabled by default. Turn on TF32 for better performance with minor accuracy differences.
[I] onnxrt-runner-N0-11/20/25-10:34:53  | Activating and starting inference
[I] Loading model: grid_sample_test.onnx
[V] Loaded Module: onnx | Version: 1.19.1 
[V] Marking all ONNX tensors as outputs
[V] Loaded Module: onnxruntime | Version: 1.23.2
[I] Creating ONNX-Runtime Inference Session with providers: ['CPUExecutionProvider']
[V] Loading inputs from data loader
[V] Generating data using numpy seed: 1
[V] Loaded Module: numpy | Version: 2.2.6 
[W] Input tensor: x [shape=BoundedShape(['s38', 1, 1, 256], min=None, max=None)] | Will generate data of shape: [1, 1, 1, 256].
    If this is incorrect, please provide a custom data loader.
[V] Input tensor: x | Generating input data in range: [0.0, 1.0]
[W] Input tensor: grid [shape=BoundedShape(['s38', 1, 9, 2], min=None, max=None)] | Will generate data of shape: [1, 1, 9, 2].
    If this is incorrect, please provide a custom data loader.
[V] Input tensor: grid | Generating input data in range: [0.0, 1.0]
[I] onnxrt-runner-N0-11/20/25-10:34:53
    ---- Inference Input(s) ----
    {x [dtype=float32, shape=(1, 1, 1, 256)],
     grid [dtype=float32, shape=(1, 1, 9, 2)]}
[V] onnxrt-runner-N0-11/20/25-10:34:53  | Input metadata is: {x [dtype=float32, shape=('s38', 1, 1, 256)],
     grid [dtype=float32, shape=('s38', 1, 9, 2)]}
[V] Loaded Module: torch | Version: 2.9.0+cu128 
[I] onnxrt-runner-N0-11/20/25-10:34:53
    ---- Inference Output(s) ----
    {y [dtype=float32, shape=(1, 1, 1, 9)]}
[I] onnxrt-runner-N0-11/20/25-10:34:53  | Completed 1 iteration(s) in 0.1719 ms | Average inference time: 0.1719 ms.
[I] trt-runner-N0-11/20/25-10:34:53     | Activating and starting inference
[V] Loaded Module: tensorrt | Version: 10.13.3.9
[V] [MemUsageChange] Init CUDA: CPU +31, GPU +0, now: CPU 163, GPU 3148 (MiB)
[V] [MemUsageChange] Init builder kernel library: CPU +1552, GPU +4, now: CPU 1917, GPU 3152 (MiB)
[V] ----------------------------------------------------------------
[V] Input filename:   grid_sample_test.onnx
[V] ONNX IR version:  0.0.10
[V] Opset version:    20
[V] Producer name:    pytorch
[V] Producer version: 2.9.0+cu128
[V] Domain:
[V] Model version:    0
[V] Doc string:
[V] ----------------------------------------------------------------
[V] Executing postprocessing step [ModifyNetworkOutputs]
[V] Marking 1 tensors as outputs
[V] Setting TensorRT Optimization Profiles
[W] Input tensor: x (dtype=DataType.FLOAT, shape=(-1, 1, 1, 256)) | No shapes provided; Will use shape: [1, 1, 1, 256] for min/opt/max in profile.
[W] This will cause the tensor to have a static shape. If this is incorrect, please set the range of shapes for this input tensor.
[W] Input tensor: grid (dtype=DataType.FLOAT, shape=(-1, 1, 9, 2)) | No shapes provided; Will use shape: [1, 1, 9, 2] for min/opt/max in profile.
[V] Input tensor: x (dtype=DataType.FLOAT, shape=(-1, 1, 1, 256)) | Setting input tensor shapes to: (min=[1, 1, 1, 256], opt=[1, 1, 1, 256], max=[1, 1, 1, 256])
[V] Input tensor: grid (dtype=DataType.FLOAT, shape=(-1, 1, 9, 2)) | Setting input tensor shapes to: (min=[1, 1, 9, 2], opt=[1, 1, 9, 2], max=[1, 1, 9, 2])
[I] Configuring with profiles:[
        Profile 0:
            {x [min=[1, 1, 1, 256], opt=[1, 1, 1, 256], max=[1, 1, 1, 256]],
             grid [min=[1, 1, 9, 2], opt=[1, 1, 9, 2], max=[1, 1, 9, 2]]}
    ]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | []
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 32109.50 MiB, TACTIC_DRAM: 32109.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | [EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [PROFILE_SHARING_0806]
[V] Global timing cache in use. Profiling results in this builder pass will be stored.
[V] Compiler backend is used during engine build.
[V] Detected 2 inputs and 1 output network tensors.
[V] Total Host Persistent Memory: 80 bytes
[V] Total Device Persistent Memory: 0 bytes
[V] Max Scratch Memory: 0 bytes
[V] Total Activation Memory: 0 bytes
[V] Total Weights Memory: 0 bytes
[V] Compiler backend is used during engine execution.
[V] Engine generation completed in 0.213137 seconds.
[V] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 1 MiB
[I] Finished engine building in 0.236 seconds
[V] Loaded engine size: 0 MiB
[V] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[V] Found candidate CUDA libraries: ['/usr/local/cuda-12.9/lib64/libcudart.so.12.9.79', '/usr/local/cuda-12.9/lib64/libcudart.so', '/usr/local/cuda-12.9/lib64/libcudart.so.12']
[I] trt-runner-N0-11/20/25-10:34:53
    ---- Inference Input(s) ----
    {x [dtype=float32, shape=(1, 1, 1, 256)],
     grid [dtype=float32, shape=(1, 1, 9, 2)]}
[V] trt-runner-N0-11/20/25-10:34:53     | Input metadata is: {x [dtype=float32, shape=(1, 1, 1, 256)],
     grid [dtype=float32, shape=(1, 1, 9, 2)]}
[I] trt-runner-N0-11/20/25-10:34:53
    ---- Inference Output(s) ----
    {y [dtype=float32, shape=(1, 1, 1, 9)]}
[I] trt-runner-N0-11/20/25-10:34:53     | Completed 1 iteration(s) in 0.6733 ms | Average inference time: 0.6733 ms.
[V] Successfully ran: ['onnxrt-runner-N0-11/20/25-10:34:53', 'trt-runner-N0-11/20/25-10:34:53']
[I] Accuracy Comparison | onnxrt-runner-N0-11/20/25-10:34:53 vs. trt-runner-N0-11/20/25-10:34:53
[I]     Comparing Output: 'y' (dtype=float32, shape=(1, 1, 1, 9)) with 'y' (dtype=float32, shape=(1, 1, 1, 9))
[I]         Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         onnxrt-runner-N0-11/20/25-10:34:53: y | Stats: mean=0.30805, std-dev=0.12337, var=0.015221, median=0.3287, min=0.1003 at (0, 0, 0, 5), max=0.46257 at (0, 0, 0, 8), avg-magnitude=0.30805, p90=0.42193, p95=0.44225, p99=0.4585
[I]             ---- Values ----
                    [[[[0.39737493 0.10553478 0.3287046  0.2827478  0.4117708  0.10029647
                        0.4003233  0.28308854 0.46256799]]]]
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (0.1  , 0.157) |          2 | ##########################
                (0.157, 0.214) |          0 |
                (0.214, 0.271) |          0 |
                (0.271, 0.328) |          2 | ##########################
                (0.328, 0.385) |          1 | #############
                (0.385, 0.442) |          3 | ########################################
                (0.442, 0.499) |          1 | #############
                (0.499, 0.555) |          0 |
                (0.555, 0.612) |          0 |
                (0.612, 0.669) |          0 |
[I]         trt-runner-N0-11/20/25-10:34:53: y | Stats: mean=0.39464, std-dev=0.1704, var=0.029035, median=0.37008, min=0.12633 at (0, 0, 0, 5), max=0.66923 at (0, 0, 0, 6), avg-magnitude=0.39464, p90=0.57917, p95=0.6242, p99=0.66023
[I]             ---- Values ----
                    [[[[0.5566532  0.15679139 0.34473667 0.34473667 0.43367636 0.12632953
                        0.6692329  0.3700842  0.5495479 ]]]]
[I]             ---- Histogram ----
                Bin Range      |  Num Elems | Visualization
                (0.1  , 0.157) |          2 | ##########################
                (0.157, 0.214) |          0 |
                (0.214, 0.271) |          0 |
                (0.271, 0.328) |          0 |
                (0.328, 0.385) |          3 | ########################################
                (0.385, 0.442) |          1 | #############
                (0.442, 0.499) |          0 |
                (0.499, 0.555) |          1 | #############
                (0.555, 0.612) |          1 | #############
                (0.612, 0.669) |          1 | #############
[I]         Error Metrics: y
[I]             Minimum Required Tolerance: elemwise error | [abs=0.26891] OR [rel=0.40182] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.086598, std-dev=0.076889, var=0.005912, median=0.061989, min=0.016032 at (0, 0, 0, 2), max=0.26891 at (0, 0, 0, 6), avg-magnitude=0.086598, p90=0.1812, p95=0.22506, p99=0.26014
[I]                 ---- Values ----
                        [[[[0.15927827 0.05125661 0.01603207 0.06198886 0.02190557 0.02603306
                            0.2689096  0.08699566 0.08697993]]]]
[I]                 ---- Histogram ----
                    Bin Range        |  Num Elems | Visualization
                    (0.016 , 0.0413) |          3 | ########################################
                    (0.0413, 0.0666) |          2 | ##########################
                    (0.0666, 0.0919) |          2 | ##########################
                    (0.0919, 0.117 ) |          0 |
                    (0.117 , 0.142 ) |          0 |
                    (0.142 , 0.168 ) |          1 | #############
                    (0.168 , 0.193 ) |          0 |
                    (0.193 , 0.218 ) |          0 |
                    (0.218 , 0.244 ) |          0 |
                    (0.244 , 0.269 ) |          1 | #############
[I]             Relative Difference | Stats: mean=0.21012, std-dev=0.11188, var=0.012517, median=0.20607, min=0.046505 at (0, 0, 0, 2), max=0.40182 at (0, 0, 0, 6), avg-magnitude=0.21012, p90=0.34189, p95=0.37185, p99=0.39583
[I]                 ---- Values ----
                        [[[[0.28613555 0.3269096  0.04650526 0.17981511 0.05051133 0.20607264
                            0.40181768 0.23506992 0.15827543]]]]
[I]                 ---- Histogram ----
                    Bin Range       |  Num Elems | Visualization
                    (0.0465, 0.082) |          2 | ########################################
                    (0.082 , 0.118) |          0 |
                    (0.118 , 0.153) |          0 |
                    (0.153 , 0.189) |          2 | ########################################
                    (0.189 , 0.224) |          1 | ####################
                    (0.224 , 0.26 ) |          1 | ####################
                    (0.26  , 0.295) |          1 | ####################
                    (0.295 , 0.331) |          1 | ####################
                    (0.331 , 0.366) |          0 |
                    (0.366 , 0.402) |          1 | ####################
[E]         FAILED | Output: 'y' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)
[E]     FAILED | Mismatched outputs: ['y']
[E] Accuracy Summary | onnxrt-runner-N0-11/20/25-10:34:53 vs. trt-runner-N0-11/20/25-10:34:53 | Passed: 0/1 iterations | Pass Rate: 0.0%
[E] FAILED | Runtime: 2.230s | Command: polygraphy run grid_sample_test.onnx --onnxrt --trt --verbose --onnx-outputs mark all --trt-outputs mark all

Nov 20 '25 01:11 Alcoholrithm