TensorRT [Question] Is `IOutputAllocator::reallocateOutput` guaranteed to be called before `context->enqueueV3` returns?

Description

I cannot find any information regarding when IOutputAllocator::reallocateOutput is called with respect to context->enqueueV3. Is there any guarantee this function is called before enqueueV3 returns or should I explicitly synchronize stream?

In other words, in the following pseudo-code:

context_->setOutputAllocator(name, allocator);
// ...
context->enqueueV3(stream);
cudaStreamSynchronize(stream); // <-- Is this necessary?

// Memcpy devie -> host; Is it valid to ask allocator for device buffer without stream synchronization?
cudaMemcpyAsync(hostBuffer, allocator->getDeviceBuffer(), ...);

Should I explicitly synchronize the stream after enqueueV3 for device allocator->getDeviceBuffer() to be valid? Or is allocator->reallocateOutput guaranteed to be called before enqueueV3 returns, in which case stream synchronization is unnecessary?

Environment

TensorRT Version:

NVIDIA GPU:

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

May 16 '24 19:05 mirzadeh

Please refer to our api doc: https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#aa174ba57c44df821625ce4d3317dd7aa

should I explicitly synchronize stream?

yes

Should I explicitly synchronize the stream after enqueueV3 for device allocator->getDeviceBuffer() to be valid?

the ptr is always valid until you free the memory, but the correct output is ready only after synchronization is done.

May 26 '24 11:05 zerollzeng

I think my question was more about the calling order of reallocateOutput and enqueueV3. Since enqueueV3 is async, is it possible that by the time cudaMemcpy is called, reallocateOutput is still not called by TensorRT and therefore the device pointer is invalid (b/c reallocate might return a different pointer)?

If there is guarantee that reallocateOutput is always called by the time enqueueV3 returns, there is no need for an explicit synchronization before memcpy.

May 28 '24 15:05 mirzadeh

401decd0b4f21766da3b8e4f98c5d66 I'm having the following problem, what should I do?Here's my code. Uploading 90783492ee9173ad92e9da4d3b9acd3.png…

Jun 14 '24 05:06 GDUTLinsy

Uploading 90783492ee9173ad92e9da4d3b9acd3.png…

Jun 14 '24 05:06 GDUTLinsy

void* buffers[2]{};

const int inputIndex = 0;
const int outputIndex = 1;

// Create GPU buffers on device 
CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * IN_H * IN_W * sizeof(float)));
CHECK(cudaMalloc(&buffers[outputIndex], batchSize * 4));

//为输入输出传递Tensorrt缓冲区
context.setTensorAddress(IN_NAME, buffers[inputIndex]);
context.setTensorAddress(OUT_NAME, buffers[outputIndex]);

// Create stream 
cudaStream_t stream{};
CHECK(cudaStreamCreate(&stream));

// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host 
CHECK(cudaMemcpyAsync(input,buffers[inputIndex], batchSize * 3 * IN_H * IN_W * sizeof(float), cudaMemcpyHostToDevice, stream));

//执行推理
context.enqueueV3(stream);

CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * 4 * sizeof(float), cudaMemcpyDeviceToHost, stream));
CHECK(cudaStreamSynchronize(stream));

// Release stream and buffers 
CHECK(cudaStreamDestroy(stream));
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));

Jun 14 '24 05:06 GDUTLinsy