feat(llama2-70b): Add multinode to SUT_API.py for the offline scenario
Motivation
The LLaMA-2-70B benchmark (Offline Scenario) currently does not have multinode support.
Contents
This PR adds multinode inference support to LLaMA-2-70B benchmark (Offline Scenario) by enabling SUT_API.py to issue requests to multiple OpenAI-compatible endpoints (e.g., vLLM, TensorRT-LLM) simultaneously. Prompts are (mostly) evenly partitioned across servers.
-
Multi-server API mode for SUT_API (
--vllm) with even prompt distribution across multiple OpenAI-compatible endpoints. -
Unit tests for API-related logic (
query_batchandquery_servers). - Documentation updates and example commands for multinode usage.
- Additional dependencies specified in READMEs.
User facing Changes
Usage Example (Offline + Multinode API mode)
python3 -u main.py --scenario Offline \
--vllm \
--api-model-name ${MODEL_NAME} \
--api-server http://node1:8000 \
--api-server http://node2:8000 \
--api-server http://node3:8000 \
--model-path ${CHECKPOINT_PATH} \
--user-conf user.conf \
--total-sample-count 24576 \
--dataset-path ${DATASET_PATH} \
--output-log-dir offline-logs
Each --api-server argument registers an endpoint; SUT_API distributes prompts across them automatically.
MLCommons CLA bot:
Thank you very much for your submission, we really appreciate it. Before we can accept your contribution, we ask that you sign the MLCommons CLA (Apache 2). Please use this [Google form] (https://forms.gle/Ew1KkBVpyeJDuRw67) to initiate authorization. If you are from an MLCommons member organization, we will request that you be added to the CLA. If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact [email protected].
0 out of 1 committers have signed the MLCommons CLA.
:x: @mrzzy
You can retrigger this bot by commenting recheck in this Pull Request