TensorRT Deep Learning model in TensorRT with SPARSE layers not accelerating inference speed

My deep leaning model, which is converted from a PyTorch model and is pruned with NVIDIA’s ASP (Automatic SParsity) and saved as FP16 ONNX model. This model is then converted with trtexec with sparse option “force” and saved with FP16 precision. But when benchmarking on the A40 GPU no speed latency/throughput in noticed. While I thought this could be due to small batchsize, the same is noticed for other batchsizes: between B1 - B32, after B32 I got out of memory error. More detailed and analysis info is given here

Aug 08 '24 08:08 Michelvl92

Could you share the ONNX model? I don't have the permission to access your files

Also, our experiments showed that sparsity only have benefit when convolution channels are large enough (256 or above).

Aug 08 '24 08:08 nvpohanh

I think sparse conv may not necessarily be faster than dense conv in many cases. TRT did not provide specific names for each strategy/tactics but time cost, only the last picked tactic's impl.

Aug 08 '24 14:08 lix19937

Thanks for the quick response I have made the links accessible now, but just to be sure here are the links:

Would be nice to understand why inference speed does not improve. If this is due to dimensions that can not accelerate in sparse, then I would love to have a link to documentation. Else would love to know how I can improve inference speed by making use of sparsity.

Aug 12 '24 11:08 Michelvl92

@lix19937 @nvpohanh I have open the acces to the models, could you have a look for me?

Aug 15 '24 17:08 Michelvl92

I checked the models, and most the convs do not have "good" shape. Due to hardware alignment requirement, sparse kernels require larger tile sizes than dense kernels. Therefore, if you really want to get the benefit of sparse kernels, make sure that the input/output channels of the convs are at least 256 and are multiples of 128.

Some numbers of channels of most convs in this model are: 48, 96, 192, 288, 576

Aug 16 '24 06:08 nvpohanh

@nvpohanh, thanks for checking. Do you have links to documentation with more details? (I couldn't find it.) What are the technical details of why "input/output channels of the convs are at least 256 and are multiples of 128" for speedup are required?

Aug 16 '24 07:08 Michelvl92

Those are not hard-coded requirements. They are just based on our past observations regarding sparse vs dense kernel perf.

Aug 19 '24 04:08 nvpohanh

@nvpohanh So I did some tests with the backbone of yolov8l (first 20 layers), changing the backbone of the model to the rules you mentioned (as you can see below) "if you really want to get the benefit of sparse kernels, make sure that the input/output channels of the convs are at least 256 and are multiples of 128" this is done in the yolov8l model size (FP16). I also used just to be sure 1024x1024 as input size, and tested multiple batch sizes, but saw almost no difference. What could be the reason for this?

Layer: Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 3, Output Channels: 64 Layer: Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(320, 128, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 320, Output Channels: 128 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 64, Output Channels: 64 Layer: Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1024, Output Channels: 256 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 128, Output Channels: 128 Layer: Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 512 Layer: Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 512 Layer: Conv2d(1280, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1280, Output Channels: 512 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) Input Channels: 256, Output Channels: 256 Layer: Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 512, Output Channels: 256 Layer: Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False) Input Channels: 1024, Output Channels: 512

Sep 30 '24 16:09 Michelvl92

@Michelvl92 Could you share the ONNX file for yolov8l? My general feeling is that the problem size is probably still too small for sparse kernel to get faster than dense kernel.

Mar 06 '25 03:03 nvpohanh