TensorRT Xid 31 error in TensorRT 8.6.1.6 when running two CudaGraph captured ExecutionContexts concurrently on RTX 4070 or RTX A4500

Description

I have two TensorRT plans compiled from ONNX using the standard TensorRT builder and ONNX parser.

I can successfully capture the ExecutionContexts derived from these plans to CudaGraphs and launch these on Streams (with outputs as expected).

However, when launching these operations repeatedly in a loop, and if certain conditions are met, we will eventually encounter a Xid 31 error after an arbitrary, large number of loop iterations. This error manifests itself in the program as a cuda error 700 (illegal memory access) when synchronizing the first stream.

The following conditions must all be true to trigger the error:

The ExectionContexts must be captured to graphs.
The two ExectionContexts must be executing in parallel (on two Streams).
There must be other compute processes on the same GPU.

compute-sanitizer (all tools) and cuda-memcheck (all tools) report no problems. The issue doesn't seem to pop up when running with cuda-gdb. when CUDA_LAUNCH_BLOCKING=1 is used, the error is still received when synchronizing.

Environment

TensorRT Version: 8.6.1.6 GPU Type: tested with RTX 4070 and RTX A4500 Nvidia Driver Version: 550.78 (RTX 4070) or 525.60.13 (RTX A4500) CUDA Version: tested with 11.8 and 12.3.2 CUDNN Version: 8.9.7 Operating System + Version: tested with linux 6.6 and linux 6.1 Python Version (if applicable): N/A TensorFlow Version (if applicable): N/A PyTorch Version (if applicable): N/A Baremetal or Container (if container which image + tag): tested on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 and nvcr.io/nvidia/tensorrt:24.01-py3

Relevant Files

https://github.com/soooch/weird-trt-thing

Steps To Reproduce

git clone [email protected]:soooch/weird-trt-thing.git
cd weird-trt-thing
docker run --gpus all -it --rm -v .:/workspace nvcr.io/nvidia/tensorrt:24.01-py3

once inside container:

apt update
apt-get install -y parallel

make

# need at least 2, but will fail faster if more (hence 16)
parallel -j0 --delay 0.3 ./fuzzer ::: {1..16}
# wait up to ~ 10 minutes. usually much faster

Aug 07 '24 17:08 soooch

This issue has also been posted to the Nvidia Developer Forums: https://forums.developer.nvidia.com/t/xid-31-error-when-two-cudagraph-captured-executioncontexts-are-executed-concurrently/302553/1

Aug 07 '24 17:08 soooch

https://github.com/NVIDIA/TensorRT/issues/3633 sounds very similar.

@zerollzeng @oxana-nvidia any chance we could get confirmation on this being the same issue? And if so, is there any news on a fix?

Aug 07 '24 21:08 soooch

yes, I think it is the same issue. I still don't have information which cuda version is planned to have the fix.

Aug 07 '24 21:08 oxana-nvidia