TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Operator spec of Split operator's attribute split is wrong

Open demuxin opened this issue 1 year ago • 6 comments

Description

My model has a split operator that is expected to divide each sub-tensor according to the dimensions specified.

As depicted in this setting:

image

This is the split operator link.

But each sub-tensor was divided equally using tensorrt, that is to say the split setting didn't take effect.

The error information is as follows:

[shapeContext.cpp::operator()::4017] Error Code 4: Shape Error (reshape wildcard -1 has no integer solution. Reshaping [1,18720,8,32] to [1,4990,-1].)

4990 is 24953 divided by 5.

Environment

TensorRT Version: 10.0

NVIDIA GPU: GeForce RTX 3090

NVIDIA Driver Version: 353

CUDA Version: 12.2

CUDNN Version:

Operating System: ubuntu22

Python Version (if applicable): 3.9.7

PyTorch Version (if applicable): 2.2

Baremetal or Container (if so, version): nvcr.io/nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04

demuxin avatar May 23 '24 09:05 demuxin

1*18720*8*32 / 4990 = 960.38477

has no integer solution.

You onnx export wrong, some dynamic ops , but you use const shape/val. Check Warning: and TracerWarning: in export process.

lix19937 avatar May 23 '24 11:05 lix19937

Take note that axis=1. The expected operation is to divide 24953 of [1, 24953, 8, 32] by [18720, 4680, 1170, 299, 84], and I can get five sub-tensor.

But each sub-tensor was divided equally using tensorrt, that is to say the split setting didn't take effect.

in other words:

24953 / 5 = 4990.6

demuxin avatar May 24 '24 01:05 demuxin

Above answer explains. But, I'll probably reiterate.

From what OP shared, split seems to be doing its job - splitting a tensor of shape [1,24953,8,32] along axis=1 (presumably) into 5 tensors. The first among them (say output1) will be of shape [1,18720,8,32]. Problem seems to coming from a reshape of output1 to shape [1,4990,-1] - which is not possible. OP needs to figure where that reshape is coming from.

A good way to narrow down if the issue is in TRT itself is by running your onnx model in polygraphy using another backend (like onnxrt). polygraphy run <your_model>.onnx --onnxrt

If your model has trouble running with onnxrt also, then there's a very good chance the problem is with the model itself. If not, please post here and consider sharing your model.

brb-nv avatar May 24 '24 01:05 brb-nv

Hi @brb-nv , I use this command and the result is fine.

polygraphy run model.onnx --onnxrt

But I replace trt with onnxrt, It's starting to report errors.

I share this model using google dirve.

demuxin avatar May 24 '24 04:05 demuxin

Thank you for verifying. Kindly provide access to the onnx file.

brb-nv avatar May 24 '24 06:05 brb-nv

Hi @brb-nv , I solved this issue by using polygraphy surgeon sanitize --fold-constants, And there is still a error:

[E] 2: [myelinBuilderUtils.cpp::getMyelinSupportType::1270] Error Code 2: Internal Error (ForeignNode does not support data-dependent shape for now.)
[!] Invalid Engine. Please ensure the engine was built correctly
[E] FAILED | Runtime: 16.287s | Command: /home/osmagic/anaconda3/envs/pytorch/bin/polygraphy run --trt weights/out_nms_0524_sim_san.onnx

Please help me to see if there is any solution, and do I need to create a new issue?

I changed the access to the model, The model link is accessible.

demuxin avatar May 27 '24 01:05 demuxin