Benchmark toolkit support
What would you like to be added:
It would be super great to support benchmarking the LLM throughputs or latencies with different backends.
Why is this needed:
Provide proofs for users.
Completion requirements:
This enhancement requires the following artifacts:
- [x] Design doc
- [ ] API change
- [x] Docs update
The artifacts should be linked in subsequent comments.
/kind feature
An example would looks like:
{
metadata: {
name: llama3-405b-2024-07-01,
namespace: llm,
},
spec: {
endpoint: llm-1.svc.local,
port: 8000,
performance: {
traffic-shape: {
req-rate: 10 qps,
model-type: instruction-tuned-llm/diffusion,
dataset: share-gpt,
input-length: 1024,
max-output-length: 1024,
total-prompts: 1000,
traffic-spike: {
burst: 10m,
req-rate: 20 qps,
}
}
}
},
status: {
status: success,
results: gcs-bucket-1/llama3-405b-2024-07-01,
}
}
Inspired by https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ/edit
Also see:
- https://github.com/ray-project/llmperf
- https://github.com/run-ai/llmperf
- https://github.com/kubernetes-sigs/inference-perf
/help
We have gateway right now, I think we can push this forward.
/assign
Here are some more references:
- fw-ai/benchmark
- genai-perf
- huggingface/inference-benchmarker
- llm-d
- SGLang
I think we have to clarify the main target and the scope of our benchmarking tool first, since there are many existing tools for benchmarking LLM inference services, with the rapid growth of the LLM + Kubernetes community. For example, llm-d leverages fmperf-project/fmperf to make their benchmark tool suite, and SGLang's OME defines a CRD called BenchmarkJob to run their genai-bench for benchmarking with fine-grained parameters.
Thanks @rudeigerc I think what you concern about makes sense, I'll update the description later and have a discussion with you if possible.