Adding evals after throughput benchmarks
Add Eval Runs After Throughput Benchmarks
TL;DR
- Adds optional eval runs (e.g. GSM8K) that run right after throughput benchmarks, reusing the same inference server.
- Evals are plumbed into all throughput workflows, but are opt-in (
RUN_EVAL=false→ no change in behavior). - When enabled, the default eval suite is
gsm8kvialm-eval, with support for lighteval as an alternative. - To keep CI cost reasonable, evals only run for two representative points per config:
- Lowest TP per GPU with highest concurrency, and
- Highest TP per GPU with highest concurrency.
- All changes are contained under
benchmarks/*(sharedbenchmark_lib.sh+ runner scripts).
Motivation
Throughput optimizations can quietly trade off accuracy (e.g. via aggressive truncation, decoding tweaks, or endpoint misconfiguration). Without evals, a misconfigured server (truncation, bad decoding, wrong endpoint params) can still produce great throughput numbers but garbage answers.
This PR wires evals directly into the benchmarking flow so that:
- Each representative throughput config has an associated numerical accuracy check.
- We can align throughput numbers with SLAs and avoid “gaming” (e.g. lowering
max_new_tokensor silently dropping tokens). - Adding new eval suites in future (beyond GSM8K) is straightforward and reuses the same plumbing.
What This PR Changes
1. Optional evals for all throughput workflows
- All throughput workflows that call
benchmarks/*now have the ability to run evals immediately after throughput. - This is controlled via the matrix and an environment flag:
- Matrix sets a boolean
FIELD_RUN_EVAL. - Workflows export this as
RUN_EVALfor each matrix entry.
- Matrix sets a boolean
- Behavior:
-
RUN_EVALunset orfalse→ only throughput runs (current behavior). -
RUN_EVAL=true→ throughput then evals on the same server.
-
By default, no evals are run (opt-in), but the plumbing exists for all throughput workflows.
When evals are enabled, the default task is GSM8K:
-
EVAL_TASKdefaults togsm8k. -
EVAL_FRAMEWORKdefaults tolm-eval. - Both can be overridden via env for future suites.
2. Representative eval selection via matrix generation
To balance coverage and cost, we only run evals for two key points per configuration.
The matrix helper mark_eval_entries does, for each unique group:
- Group key: (model, runner, framework, precision, isl, osl).
- Within each group:
- Find min TP and max TP.
- For max TP:
- Identify entries with that TP.
- Among them, pick the highest concurrency (FIELD_CONC) → mark as eval.
- For min TP (if different from max TP):
- Same logic: lowest TP + highest concurrency → mark as eval.
The selected entries get: entry[FIELD_RUN_EVAL] = True
This means evals are ran only at the highest concurrency for the lowest and highest TP per GPU for each (model, runner, framework, precision, ISL, OSL) combo.
Everything else runs throughput-only.
3. Eval integration in runner scripts (benchmarks/*)
All runner scripts follow the same pattern:
- Start the server (e.g. vLLM, TRT-LLM, sglang, etc.).
- Call wait_for_server_ready.
- Run throughput via run_benchmark_serving.
- Conditionally run evals:
- Only when RUN_EVAL=true.
- Use
run_eval+append_lm_eval_summary
⸻
4. Eval Frameworks
This PR supports two eval frameworks, with a unified entrypoint and local patching to handle reasoning tokens and OpenAI-compatible endpoints.
1. lm-eval (lm-evaluation-harness)
1.1 Installation &
- `_install_lm_eval_deps
- Installs lm-eval[api].
- Pulls lm-evaluation-harness
-
_patch_lm_eval: injects asitecustomize.pythat:- Fixes
LocalChatCompletion.parse_generations - Handles responses where message.content is empty but reasoning_content contains the actual answer.
- Avoids crashes and ensures text extraction works for reasoning-style models.
- ixes
TemplateAPI.apply_chat_template - Stops injecting
{"type": "text"}into the payload when there is no tokenizer / non-HF tokenizer.- This was breaking some TRT/vLLM endpoints with strict JSON schemas.
- Fixes
Patched behavior is wired by adding the generated directory to PYTHONPATH.
1.2 Running lm-eval (run_lm_eval)
run_lm_eval wraps the lm_eval CLI:
- Defaults:
- task = ${EVAL_TASK:-gsm8k}
- num_fewshot = ${NUM_FEWSHOT:-5}
- concurrent_requests = 32
- gen_max_tokens = 4096
- temperature = 0, top_p =
- Outputs are written under
${EVAL_RESULT_DIR}(default is a new/tmp/eval_out-XXXXXX).
1.3 Summarizing lm-eval results (append_lm_eval_summary)
- Writes
meta_env.jsondescribing:-
framework -
precision -
tp -
ep -
dp_attention -
model
-
- Runs
utils/lm_eval_to_md.pyto convert raw lm-eval results intoSUMMARY.md. - If running inside GitHub Actions:
- Appends
SUMMARY.mdinto$GITHUB_STEP_SUMMARY(in the same runner).
- Appends
- Raw eval outputs remain under
/tmp(they are not copied back into the repo workspace).
2. lighteval + litellm
While lm-eval is the default, this PR also supports lighteval as an alternative backend via the unified run_eval wrapper.
2.1 Installation & patching
-
_install_lighteval_deps:- Installs
lighteval[api]andlitellm.
- Installs
-
_patch_lighteval_litellmviasitecustomize.py:-
Disables
sglangimports:- Some lighteval versions attempt to import
sglang, which crashes with our version mismatches. - We patch
lighteval.utils.imports.is_package_available("sglang")to always returnFalse.
- Some lighteval versions attempt to import
-
Patches
LiteLLMClientto be OpenAI-server friendly:- Removes
response_format={"type": "text"}which interferes with vLLM endpoints. - Handles reasoning-only responses via
reasoning_content. - Adds retry/backoff logic around
litellmcompletions.
- Removes
-
Switches parallel evaluation to threads:
- Replaces async concurrency with
ThreadPoolExecutor(self.concurrent_requests)to avoid stalls under high load.
- Replaces async concurrency with
- Returns
ModelResponsewithtextandreasoningsseparated for downstream extraction.
-
Disables
2.2 Running lighteval (run_lighteval_eval)
- Expects
MODEL_NAMEto be set (will error otherwise). - Wraps the model with an OpenAI-style prefix:
-
lite_model="openai/${MODEL_NAME}"
-
- Builds
MODEL_ARGSfor lighteval: -model_name=${lite_model},base_url=${base_url},api_key=${OPENAI_API_KEY},generation_parameters={temperature:0.0,top_p=1,max_new_tokens:2048},concurrent_requests=${concurrent_requests} - Task specification:
-
TASK_SPEC="${task}|${num_fewshot}"
-
3. Unified eval entrypoint (run_eval)
run_eval abstracts over frameworks:
- Defaults:
-
EVAL_FRAMEWORK=lm-eval -
EVAL_TASK=gsm8k
-
- Runner scripts can override via env or by passing
--frameworkexplicitly. - All additional arguments (e.g.
--port,--concurrent-requests,--results-dir) are forwarded to the underlying framework-specific function.
Future Work / Notes
- Currently the default behavior is unchanged for most users:
- Evals are off by default (
RUN_EVAL=false). - Only selected matrix entries (lowest & highest TP per GPU at max concurrency) enable
RUN_EVAL=true.
- Evals are off by default (
- The plumbing is now in place to:
- Add more eval suites (e.g. MMLU, Math, custom internal tasks) via
EVAL_TASKandutils/evals/*. - Swap or augment frameworks (
lm-evalvslighteval) per job viaEVAL_FRAMEWORK.
- Add more eval suites (e.g. MMLU, Math, custom internal tasks) via
- Token count optimizations.
@Oseltamivir Can you pls fix the merge conflicts so I can review more holistically ?
📊 Line Count Report
File: utils/matrix-logic/generate_sweep_configs.py
Total Lines: 1036
Base Lines: 968
Change: +68 lines 📈
📊 Line Count Report
File: utils/matrix-logic/generate_sweep_configs.py
Total Lines: 1036
Base Lines: 968
Change: +68 lines 📈
📊 Line Count Report
File: utils/matrix-logic/generate_sweep_configs.py
Total Lines: 1036
Base Lines: 968
Change: +68 lines 📈
📊 Line Count Report
File: utils/matrix-logic/generate_sweep_configs.py
Total Lines: 1036
Base Lines: 968
Change: +68 lines 📈
📊 Line Count Report
File: utils/matrix-logic/generate_sweep_configs.py
Total Lines: 1036
Base Lines: 968
Change: +68 lines 📈
📊 Line Count Report
File: utils/matrix-logic/generate_sweep_configs.py
Total Lines: 1036
Base Lines: 968
Change: +68 lines 📈
📊 Line Count Report
File: utils/matrix-logic/generate_sweep_configs.py
Total Lines: 1036
Base Lines: 968
Change: +68 lines 📈
📊 Line Count Report
File: utils/matrix-logic/generate_sweep_configs.py
Total Lines: 1036
Base Lines: 968
Change: +68 lines 📈
Reviewing from NV side. Please hold on merging till we finish review.