TensorRT Plugin gets incorrect input data when integrated into full model, but works fine in isolation

Open niubiplus2 opened this issue 8 months ago • 1 comments

I am converting a custom PyTorch plugin (CorrSamplerPlugin) into a TensorRT plugin. When testing the plugin independently, the results are correct and consistent with PyTorch. However, after integrating the plugin into the full model and exporting it to a TensorRT engine, the plugin receives incorrect input data during inference, leading to wrong outputs.

Current status: The plugin is verified independently with correct input/output compared to PyTorch;

After full model integration, the volume input received by the plugin differs significantly from the PyTorch version;

The calling order of plugin layers seems wrong (e.g., CorrSampler_0, CorrSampler_3, CorrSampler_2, CorrSampler_1, CorrSampler_4);

During inference, volume input values show large differences (even signs are flipped), while coord values only match in early iterations and then diverge;

Confirmed:

No memory errors inside the plugin;

Input shapes and binding names match one-by-one;

FP16 and INT8 are disabled — FP32 is used;

supportsFormatCombination enforces kFLOAT and kLINEAR;

I listed top 10 shapes and values of my test datas on the c++ tensorrt inference:

Here I alse listed top 10 shapes and values of my test datas on the pytorch:

This is my op:

May 13 '25 06:05 niubiplus2

Hi @niubiplus2 , thank you for the detailed issue and the snapshots.

The plugin is verified independently with correct input/output compared to PyTorch;

After full model integration, the volume input received by the plugin differs significantly from the PyTorch version;

It seems to me that the inputs to the plugin are wrong, but the plugin itself seems to work. That means some issue in the model definition side. Could you maybe show the relevant snippet of the tensorrt code where youu're adding the plugin layer?

The calling order of plugin layers seems wrong (e.g., CorrSampler_0, CorrSampler_3, CorrSampler_2, CorrSampler_1, CorrSampler_4);

I see this from the snippet attached - it is possible that the stdout can be re-ordered either if the network has those nodes as topologically parallel. Are those layers supposed to be invoked one after the other sequentially in the way the model is defined?

During inference, volume input values show large differences (even signs are flipped), while coord values only match in early iterations and then diverge;

It is entirely possible that the layer(s) before the plugin are not 1:1 equivalent in behavior between pytorch and tensorrt, would know more based on the how you construct the layers. Can you describe more about the various layers in your network? what are the non-plugin layers that might be involved?

I also suspect that the precisions of the operations in torch vs tensorrt may not be the same. a small precision change can catastrophically diverge the values when done repeatedly. It would be worth looking into if such exponential and div operations are done in half/full precision or otherwise.

Thanks!

May 29 '25 23:05 venkywonka