TensorRT-LLM How to set the initial kv cache length?

I want to test an example: the initial kv cache length is 2048, and LLM iterate 2048 times, so the output_tokens=2048, but the initial kv cache length is 2048, and the final kv cache length is 4096(2048+2048).

if I run:

FT_NVTX=ON /opt/nvidia/nsight-systems/2024.2.1/bin/nsys profile mpirun  -n 8 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --engine_dir ./benchmarks/cpp/temp/engine_out_builddocker_tp8/ --warm_up 1 --batch_size "64" --duration 0 --num_runs 1 --input_output_len "1,2048"

the initial kv cache length is 1, not 2048. So, how to set the initial kv cache length?

May 11 '24 09:05 liminn

You should set --input_output_len "2048,2048".

May 14 '24 07:05 byshiue

Sorry, I may not have expressed my meaning clearly. If I set -- input_output_len "2048,2048", then I understand that it includes two part time:

part 1: one Prefill inference time (input sequence length is 2048, initial kv cache length is 0)
part 2: 2047 Decoding iteration inference times (input sequence length is actually 1, initial kv cache length is 2048), right?

However, I only want to test the inference time of part 2, so how can I set it?

May 17 '24 09:05 liminn

There is no way to measure that directly. You could use nsys to measure the whole workflow, and calculate the time of part 2 manually.

May 23 '24 07:05 byshiue

ok, thanks

May 23 '24 07:05 liminn