[Question] [TREx] Unexpected TREx Layer-Sum vs End-to-End Latency Behavior in LLM Generation Phase (Qwen2-VL on NVIDIA Thor)

Open taoz27 opened this issue 2 months ago • 1 comments

Description

I am analyzing the performance of the Qwen2-VL model on NVIDIA Thor using the TREx (TensorRT Engine Explorer) tool.

According to the README, when using trtexec to time individual layers, the sum of per-layer average latencies is expected to be higher than the end-to-end engine latency, due to measurement overhead. This matches what I observe on ViT and LLM Prefill workloads.

However, when analyzing the LLM Generation phase, I observe the opposite behavior: for FP8 and INT4 quantized engines, the sum of layer latencies reported by TREx is consistently lower than the end-to-end latency.

I manually re-computed latency statistics from the JSON file generated by trtexec and confirmed that TREx is accurately reflecting the JSON contents. Therefore, I would like to confirm whether this behavior is expected or if there may be an issue with how I invoked trtexec.

Environment

TensorRT Version: 10.13.1

NVIDIA GPU: nvidia thor

NVIDIA Driver Version:

CUDA Version: 12.8

CUDNN Version:

Operating System:

Python Version (if applicable): 3.12.3

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Convert the fine-tuned Qwen2-VL model into a TensorRT engine I used the tensorrt-llm workflow to build the engine from a fine-tuned Qwen2-VL checkpoint. During engine building, I enabled detailed profiling with:

config->setProfilingVerbosity(nvinfer1::ProfilingVerbosity::kDETAILED);

Generate profiling JSON outputs using trtexec After obtaining the engine, I executed trtexec with the following command (Python-style argument construction shown here):

trtexec_path,
"--verbose",
"--useCudaGraph",
"--separateProfileRun",
"--useSpinWait",
f"--useProfile={profile}",
f"--loadEngine={engine_path}",
f"--exportTimes={timing_json}",
f"--exportProfile={profiling_json}",
f"--exportLayerInfo={graph_json}",
f"--timingCacheFile={timing_cache}",
"--profilingVerbosity=detailed"

Using --noDataTransfers results in:

sampleInference.cpp:1017: an illegal memory access was encountered

so this flag was removed.

Compare TREx results with raw JSON output For the INT4 quantation, the value of "mean" under "GPU Compute Time" in profile.metadata.json is:

7.42306 ms

After summing all "averageMs" values in profile.json across layers, the result is:

7.09662919 ms

which is lower than the end-to-end "GPU Compute Time" value.

TREx reports the same cumulative layer-time result as the JSON, confirming that its statistics match the raw trtexec output.

Here is profile.json and profile.metadata.json

llm_int4_noneagle.engine.1.profile.json llm_int4_noneagle.engine.1.profile.metadata.json

Dec 04 '25 06:12 taoz27

Additionally, I would like to ask for clarification regarding the operator types "kgen" and "gemm" appearing in the TensorRT engine profile. While "gemm" clearly corresponds to matrix-multiplication kernels, I could not find any documentation on what “kgen” represents. Is “kgen” an internal code-generation/fused-kernel category used by TensorRT?

Dec 04 '25 06:12 taoz27