server TensorRT model low throughput

Description When CUDA Shared memory is used with HTTP/GRPC protocol, it is expected that the client allocates cuda memory on one of the devices and copies the data into it. On systems with multiple-GPUs (in my case, a machine with ~20 GPUs) , how is the client recommended to copy the data on the right device for best performance considering that the triton server handles scheduling of jobs between the GPUs. If Triton decides to execute the inference request on a different GPU, then would there be a significant penalty in copying data over? How can this be avoided?

From perf_analyzer experiments we notice that the perf_analyzer client seems to copy the data to GPU 0 and it appears that the TritonServer internally copies the data on to the right device for execution? (We can see that all 20 GPUs are being used while checking usage with nvidia-smi). This aligns with the increased latency of "Compute Input" step in the perf analysis when executed with 20 GPUs. How can we maximize throughput with perf_analyzer while running TensorRT models + CUDA Shared memory on multiple devices?

TensorRT + Cuda Shared Mem

#perf_analyzer -m my_trt_model --shape inputimage:2048,2048,1 --measurement-interval 60000 -i HTTP --shared-memory cuda --output-shared-memory-size=16777216 --concurrency-range 20 Request concurrency: 20 Client: Request count: 65449 Throughput: 302.954 infer/sec Avg latency: 65973 usec (standard deviation 15656 usec) p50 latency: 36162 usec p90 latency: 65274 usec p95 latency: 506917 usec p99 latency: 538550 usec Avg HTTP time: 65955 usec (send/recv 142 usec + response wait 65813 usec) Server: Inference count: 65449 Execution count: 65449 Successful request count: 65449 Avg request latency: 65392 usec (overhead 57 usec + queue 108 usec + compute input 25039 usec + compute infer 32872 usec + compute output 7314 usec)

If alternatively, system shared memory is used for TensorRT model, we see a huge hit in both Compute Input and Compute Output timings, severely affecting throughput.

TensorRT + System Shared Mem

#perf_analyzer -m my_trt_model --shape inputimage:2048,2048,1 --measurement-interval 60000 -i HTTP --shared-memory system --output-shared-memory-size=16777216 --concurrency-range 20 Request concurrency: 20 Client: Request count: 18896 Throughput: 87.4745 infer/sec Avg latency: 227984 usec (standard deviation 15816 usec) p50 latency: 53416 usec p90 latency: 1139611 usec p95 latency: 1394049 usec p99 latency: 1757626 usec Avg HTTP time: 227966 usec (send/recv 142 usec + response wait 227824 usec) Server: Inference count: 18896 Execution count: 18896 Successful request count: 18896 Avg request latency: 227376 usec (overhead 55 usec + queue 1158 usec + compute input 112292 usec + compute infer 35535 usec + compute output 78335 usec)

Same issue is not seen while working with TF backend and system shared memory for the same model. But TF does not give us the inference speed-up that TensorRT does.

TensorFlow + System Shared Mem

#perf_analyzer -m my_tf_model --shape inputImage:2048,2048,1 --measurement-interval 60000 -i HTTP --shared-memory system --output-shared-memory-size=16777216 --concurrency-range 20 Request concurrency: 20 Client: Request count: 81465 Throughput: 377.05 infer/sec Avg latency: 53042 usec (standard deviation 3187 usec) p50 latency: 52593 usec p90 latency: 57166 usec p95 latency: 58773 usec p99 latency: 62349 usec Avg HTTP time: 53026 usec (send/recv 104 usec + response wait 52922 usec) Server: Inference count: 81465 Execution count: 81465 Successful request count: 81465 Avg request latency: 52612 usec (overhead 54 usec + queue 2286 usec + compute input 3324 usec + compute infer 42926 usec + compute output 4022 usec)

Triton Information Triton Version - 24.01 (same behavior seen in older versions like 23.07 as well) Driver: 545.23.08 Cuda: 12.3 GPU: T4

To Reproduce Steps to reproduce the behavior.

Model used - A U-Net variant image segmentation model. Backend - TensorRT (Comparison with TensorFlow backend) Precision used - FP16 Model Config :

name: "my_trt_model", platform: "tensorrt_plan", backend: "tensorrt", input: [ { name: "inputimage", data_type: TYPE_FP32, format: FORMAT_NONE, dims: [ -1, -1, 1 ], is_shape_tensor: false, allow_ragged_batch: false } ], output: [ { name: "output", data_type: TYPE_FP32, dims: [ -1, -1 ], label_filename: "", is_shape_tensor: false } ] instance_group [ { count: 1, kind: KIND_GPU } ] version_policy: { all { }}

model_warmup: [{ name: "Warmup", batch_size: 1, inputs: { key: "inputimage", value: { data_type: TYPE_FP32, dims: [2048,2048,1], random_data: true } } } ]

Mar 13 '24 20:03 rs-ixz

@jbkyang-nvi @tanmayv25 Any thoughts?

Mar 13 '24 22:03 lkomali

@lkomali , @jbkyang-nvi , @tanmayv25 , any chance someone was able to look at this? Any recommendations would help us retire this risk of using TensorRT with Triton.

Mar 19 '24 04:03 rs-ixz

@rs-ixz The team is looking into it.

Mar 19 '24 18:03 lkomali

@rs-ixz Triton does not provide an option to the users to enforce a certain request to be executed on the specific model instance. Instead, Triton schedules the request to the next available instance. The incoming requests are held in queues within Triton core till there is an available Triton model instance. This is a simple solution for maximizing throughput. The execution time of each inference can be a bit dynamic. Hence, this reactive approach ensures that all the model instance running on each GPU are not starving and executing requests if there are any in the queue.

D2D copies across different GPUs would incur some latency costs but it won't be as expensive as H2D memory copies in system shared memory. Additionally, there is no efficient way of making sure all the model instances are executing requests when providing specific model instance with request. We might be able to prevent an cross-device memory copy if we let the client handle the data and request placement, however, this might lead to instance starvation.

Mar 21 '24 18:03 tanmayv25

Thanks, @tanmayv25 , your response seems to suggest that the Throughput we see for CUDA shared memory is as expected.

My overarching problem is this: TensorRT models are not able to get high throughput although inference on GPU is faster than TensorFlow

System shared memory --> Tensorflow backend behaves well for all input sizes
System shared memory --> TensorRT backend suffers from really high compute input/ compute output timing for large inputs (2048x2048x1) --> 3x slower than TensorFlow throughput For smaller inputs (400x400x1), TensorRT outperformes TensorFlow backend (3x faster than TensorFLow throughptu)
Cuda shared memory --> TensorRT backend suffers from high compute input timing but throughput matches tensorflow at best. We lose the Throughput gain we saw for small inputs in TensorRT w/ shared memory

We would like to get the benefits of TensorRT (faster inference) but we are presently limited by either System Shared memory being slow for Large inputs or CUDA Shared memory being overall slow and only meeting TensorFlow numbers for all input sizes.

Please let me know if any more details about the above will help address the issue.

Mar 21 '24 18:03 rs-ixz

Adding @Tabrizian who was investigating implicitly selecting model instance based on the input tensor data locality.

Cuda shared memory --> TensorRT backend suffers from high compute input timing but throughput matches tensorflow at best. We lose the Throughput gain we saw for small inputs in TensorRT w/ shared memory

The scale of 20 GPUs might be saturating the inter-device bandwidth. Could this be a use-case of resume looking into the feature?

@rs-ixz Can you share the perf_analyzer numbers for Tensorflow + Cuda Shared Memory as well?

Mar 21 '24 19:03 tanmayv25

@tanmayv25 , Here are the perf_analyzer numbers for TF + Cuda shm:

Request concurrency: 20 Client: Request count: 61919 Throughput: 286.613 infer/sec Avg latency: 69775 usec (standard deviation 7604 usec) p50 latency: 64728 usec p90 latency: 91601 usec p95 latency: 108960 usec p99 latency: 134397 usec Avg HTTP time: 69760 usec (send/recv 99 usec + response wait 69661 usec) Server: Inference count: 61919 Execution count: 61919 Successful request count: 61919 Avg request latency: 69369 usec (overhead 42 usec + queue 7150 usec + compute input 13990 usec + compute infer 42697 usec + compute output 5490 usec)

Following is a summary plot of the four combinations (TF/TRT + CUDA/System shm). We can clearly see TRT + Sys Shm is best for smaller input and for larger ones TF + Sys Shm is the best.

Mar 22 '24 18:03 rs-ixz

Hi @rs-ixz, thanks for sharing your observations and sorry for the delayed response.

TensorRT models are not able to get high throughput although inference on GPU is faster than TensorFlow

We would definitely want to bridge this gap. From the numbers that you have shared, there seems to be a bottleneck in preparing the input data and collect the output data from the results.

@rs-ixz I am going to create a ticket for investigation within team. Meanwhile can you share the models and exact steps for us reproduce the issue? Additionally, can you reproduce the issue on a system with lesser number of GPUs, such as 8? Is there a threshold on number of GPUs for observing this issue?

Apr 03 '24 22:04 tanmayv25

@tanmayv25 , thanks for getting back on this.

From the numbers that you have shared, there seems to be a bottleneck in preparing the input data and collect the output data from the results.

Is this a comment regarding TensorRT backend with System shared memory or Cuda shared memory or both? Just to reiterate, System shared memory is our current baseline and performs well even with TensorRT except for large inputs. Cuda shared memory is not great for all sizes probably due to inter-device communication and the fact that we use 20 GPUs.

Meanwhile can you share the models and exact steps for us reproduce the issue?

I was able to reproduce the problem with an off-the shelf u-net model. Attached is an archive containing models and perf_analyzer results.

HEre is the plot comparison for this model with System Shared memory:

Additionally, can you reproduce the issue on a system with lesser number of GPUs, such as 8? Is there a threshold on number of GPUs for observing this issue?

I am working on getting this measured. Will it suffice if we set CUDA_VISIBLE_DEVICES with 8 devices on this same 20 GPU machine?

TRT_TritonSlowness.zip

Apr 10 '24 05:04 rs-ixz

I am working on getting this measured. Will it suffice if we set CUDA_VISIBLE_DEVICES with 8 devices on this same 20 GPU machine?

If you could observe the slowness with CUDA_VISIBLE_DEVICES=8, then we can take up from there. Are the above curves with CUDA_VISIBLE_DEVICES=8?

Apr 11 '24 00:04 tanmayv25

@tanmayv25 , the above curves are still with 20 GPUs with System shared memory.

Apr 12 '24 05:04 rs-ixz

Hello @rs-ixz can you share the exact model you're using for TRT and Tensorflow? This is so there's no confusion in reproducing your results. If CUDA_VISIBLE_DEVICES=8, that should mean only 8 GPUs are available for Triton to use even though there are 20 GPUs.

Apr 17 '24 18:04 jbkyang-nvi

@jbkyang-nvi the models are available in the archive I attached few responses ago (TRT_Slowness.zip)

@jbkyang-nvi / @tanmayv25 , for system shared memory, we see the crossover happening at around 4/5 GPUs.

Note - this plot is only for input size 2048x2048x3 , with System shared memory. For each 'GPU Count', I set CUDA_VISIBLE_DEVICES to correspond number of devices. Attached are perf analyzer results

GPUSpread.zip

Apr 17 '24 20:04 rs-ixz

@jbkyang-nvi the models are available in the archive I attached few responses ago (TRT_Slowness.zip)

Thanks. Sorry I missed the zip file. How are you converting from tensorflow savedmodel to TRT plan model though? Are you going through ONNX?

Apr 17 '24 21:04 jbkyang-nvi

@jbkyang-nvi , no worries ! and yes, we are going through onnx route to get the TRT plan.

Start with TF saved_model
Run tf2onnx
Run trtexec

example commands for tf2onnx and trtexec below:

Apr 17 '24 21:04 rs-ixz

Thanks for your quick response! While I'm working on a reproducer, can you try creating the model with

–optShapes flags to control the range of input shapes including batch size.

according to https://docs.nvidia.com/tao/tao-toolkit/text/trtexec_integration/index.html And seeing if that helps?

Apr 17 '24 21:04 jbkyang-nvi

@jbkyang-nvi , I see from trtexec logs that optShape is by default set to the maxShapes (1x2512x2176x3). This should suffice right? From my recollection, I have tried generating trt engine plan for a single input size and the issue still occurred with it. Let me confirm this again.

Apr 17 '24 23:04 rs-ixz

@rs-ixz can you also list the GPUs you are using for measuring perf?

Apr 18 '24 21:04 jbkyang-nvi

@jbkyang-nvi , all are T4 GPUs on a single server.

Apr 18 '24 23:04 rs-ixz

@jbkyang-nvi , any updates on reproducing the problem on your end?

Apr 24 '24 02:04 rs-ixz

@rs-ixz sorry for the delay. @indrajit96 is taking over this ticket and would let you know if he has some questions or findings. Thanks for your patience and prompt responses in this case!

Apr 25 '24 23:04 tanmayv25

Hi @rs-ixz , we are able to reproduce this. We will update you as soon as we have a RCA/Fix/WAR. CC @tanmayv25

Apr 29 '24 18:04 indrajit96

Thank you, @indrajit96 and @tanmayv25 , this is highly encouraging. Just for my clarity, the investigation focus is going to be on System shared memory, correct?
And is there any chance you may be able to give a rough idea for the timeline of the investigation? Just so that we can plan TensorRT integration in our workflow accordingly.

May 01 '24 15:05 rs-ixz

Hi @rs-ixz , Yes we are focusing on System Shared Memory, we will provide an estimate after an RCA. Currently we are actively in the process of RCA.

Thanks, Indrajit

May 01 '24 19:05 indrajit96

Sounds fair, thank you, @indrajit96

May 02 '24 03:05 rs-ixz

Hello @rs-ixz , We repro-ed your issue on a 8GPU setup with concurrency set to 20 in Perf Analyzer. Fix: Use --pinned-memory-pool-byte-size at triton startup set the size to a suitably high value. The default is ~250MB. For 8GPU I set it to ~4GB. Usage Example: tritonserver --model-repository=/mnt --pinned-memory-pool-byte-size=4684354560

RCA: We ran the repro with NVTX flag enabled that helped us profile all the GPU related activity in Nsight. NVTX traces showed multiple calls to cudaHostAlloc. At every execution ProcessTensor calls FlushPendingPinned which in turn calls BackendMemory::Create if there's not enough CPU memory to allocate (This slows down the inference as cudaHostAlloc is called for every inference and is slow) If --pinned-memory-pool-byte-size is set suitabliy high calls to cudaHostAlloc are reduced.

CC @tanmayv25 @GuanLuo

May 18 '24 01:05 indrajit96

Thank you @indrajit96 for the quick debug and fix details! I will try this out on our 20 GPU system to see if it fixes the throughput problem.

A follow-up question - are there any caveats to be aware of while increasing the cpu pinned memory pool? On a system with CPU memory of 256G, can we increase it to say 16G without any negative side-effects assuming 16G of memory is available for use always?

May 20 '24 15:05 rs-ixz

Hello @rs-ixz , Did the suggested flag resolve your issue? If yes we would like to close the issue Also regarding cpu pinned memory we have not seen any know downsides of using it with triton.

Jun 10 '24 17:06 indrajit96

Hi @indrajit96 , apologies, I was traveling and couldn't get these tests done earlier. I tried sweeping through a range of CPU_Pinned_memory size (default, 2GB, 4GB, 8GB, 16GB and 32 GB). It does appear that the higher setting has improved the throughput for larger input cases where we saw issues earlier. We see that throughput drops as the cpu pinned memory size is increased. 2GB seems to be the best setting for larger inputs.

However, I would also like to note that the throughput difference we see between TF FP32 and TRT FP16 is very minimal although the inference latency for TRT FP16 is ~4x faster than that of TF FP32. I have the perf results & models attached here. Any thoughts on this? Are we getting limited by I/O operations like earlier?

Uploading CPUPinnedMemoryTest.zip…

Jun 13 '24 15:06 rs-ixz

Hi @rs-ixz , I suspect the latency could be due to max_batch_size mismatch in models. Can you confirm both models have the same batch_size? You can check using curl localhost:8000/v2/models/"model name"/config

Jun 13 '24 19:06 indrajit96