How to set the initial kv cache length?
I want to test an example: the initial kv cache length is 2048, and LLM iterate 2048 times, so the output_tokens=2048, but the initial kv cache length is 2048, and the final kv cache length is 4096(2048+2048).
if I run:
FT_NVTX=ON /opt/nvidia/nsight-systems/2024.2.1/bin/nsys profile mpirun -n 8 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --engine_dir ./benchmarks/cpp/temp/engine_out_builddocker_tp8/ --warm_up 1 --batch_size "64" --duration 0 --num_runs 1 --input_output_len "1,2048"
the initial kv cache length is 1, not 2048. So, how to set the initial kv cache length?
You should set --input_output_len "2048,2048".
Sorry, I may not have expressed my meaning clearly.
If I set -- input_output_len "2048,2048", then I understand that it includes two part time:
- part 1: one Prefill inference time (input sequence length is 2048, initial kv cache length is 0)
- part 2: 2047 Decoding iteration inference times (input sequence length is actually 1, initial kv cache length is 2048), right?
However, I only want to test the inference time of part 2, so how can I set it?
There is no way to measure that directly. You could use nsys to measure the whole workflow, and calculate the time of part 2 manually.
ok, thanks