TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Xid 31 error in TensorRT 8.6.1.6 when running two CudaGraph captured ExecutionContexts concurrently on RTX 4070 or RTX A4500

Open soooch opened this issue 1 year ago • 3 comments

Description

I have two TensorRT plans compiled from ONNX using the standard TensorRT builder and ONNX parser.

I can successfully capture the ExecutionContexts derived from these plans to CudaGraphs and launch these on Streams (with outputs as expected).

However, when launching these operations repeatedly in a loop, and if certain conditions are met, we will eventually encounter a Xid 31 error after an arbitrary, large number of loop iterations. This error manifests itself in the program as a cuda error 700 (illegal memory access) when synchronizing the first stream.

The following conditions must all be true to trigger the error:

  • The ExectionContexts must be captured to graphs.
  • The two ExectionContexts must be executing in parallel (on two Streams).
  • There must be other compute processes on the same GPU.

compute-sanitizer (all tools) and cuda-memcheck (all tools) report no problems. The issue doesn't seem to pop up when running with cuda-gdb. when CUDA_LAUNCH_BLOCKING=1 is used, the error is still received when synchronizing.

Environment

TensorRT Version: 8.6.1.6 GPU Type: tested with RTX 4070 and RTX A4500 Nvidia Driver Version: 550.78 (RTX 4070) or 525.60.13 (RTX A4500) CUDA Version: tested with 11.8 and 12.3.2 CUDNN Version: 8.9.7 Operating System + Version: tested with linux 6.6 and linux 6.1 Python Version (if applicable): N/A TensorFlow Version (if applicable): N/A PyTorch Version (if applicable): N/A Baremetal or Container (if container which image + tag): tested on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 and nvcr.io/nvidia/tensorrt:24.01-py3

Relevant Files

https://github.com/soooch/weird-trt-thing

Steps To Reproduce

git clone [email protected]:soooch/weird-trt-thing.git
cd weird-trt-thing
docker run --gpus all -it --rm -v .:/workspace nvcr.io/nvidia/tensorrt:24.01-py3

once inside container:

apt update
apt-get install -y parallel

make

# need at least 2, but will fail faster if more (hence 16)
parallel -j0 --delay 0.3 ./fuzzer ::: {1..16}
# wait up to ~ 10 minutes. usually much faster

soooch avatar Aug 07 '24 17:08 soooch

This issue has also been posted to the Nvidia Developer Forums: https://forums.developer.nvidia.com/t/xid-31-error-when-two-cudagraph-captured-executioncontexts-are-executed-concurrently/302553/1

soooch avatar Aug 07 '24 17:08 soooch

https://github.com/NVIDIA/TensorRT/issues/3633 sounds very similar.

@zerollzeng @oxana-nvidia any chance we could get confirmation on this being the same issue? And if so, is there any news on a fix?

soooch avatar Aug 07 '24 21:08 soooch

yes, I think it is the same issue. I still don't have information which cuda version is planned to have the fix.

oxana-nvidia avatar Aug 07 '24 21:08 oxana-nvidia