InferenceMAX icon indicating copy to clipboard operation
InferenceMAX copied to clipboard

Adding evals after throughput benchmarks

Open cquil11 opened this issue 2 months ago • 1 comments

Add Eval Runs After Throughput Benchmarks

TL;DR

  • Adds optional eval runs (e.g. GSM8K) that run right after throughput benchmarks, reusing the same inference server.
  • Evals are plumbed into all throughput workflows, but are opt-in (RUN_EVAL=false → no change in behavior).
  • When enabled, the default eval suite is gsm8k via lm-eval, with support for lighteval as an alternative.
  • To keep CI cost reasonable, evals only run for two representative points per config:
    • Lowest TP per GPU with highest concurrency, and
    • Highest TP per GPU with highest concurrency.
  • All changes are contained under benchmarks/* (shared benchmark_lib.sh + runner scripts).

Motivation

Throughput optimizations can quietly trade off accuracy (e.g. via aggressive truncation, decoding tweaks, or endpoint misconfiguration). Without evals, a misconfigured server (truncation, bad decoding, wrong endpoint params) can still produce great throughput numbers but garbage answers.

This PR wires evals directly into the benchmarking flow so that:

  • Each representative throughput config has an associated numerical accuracy check.
  • We can align throughput numbers with SLAs and avoid “gaming” (e.g. lowering max_new_tokens or silently dropping tokens).
  • Adding new eval suites in future (beyond GSM8K) is straightforward and reuses the same plumbing.

What This PR Changes

1. Optional evals for all throughput workflows

  • All throughput workflows that call benchmarks/* now have the ability to run evals immediately after throughput.
  • This is controlled via the matrix and an environment flag:
    • Matrix sets a boolean FIELD_RUN_EVAL.
    • Workflows export this as RUN_EVAL for each matrix entry.
  • Behavior:
    • RUN_EVAL unset or false → only throughput runs (current behavior).
    • RUN_EVAL=true → throughput then evals on the same server.

By default, no evals are run (opt-in), but the plumbing exists for all throughput workflows.

When evals are enabled, the default task is GSM8K:

  • EVAL_TASK defaults to gsm8k.
  • EVAL_FRAMEWORK defaults to lm-eval.
  • Both can be overridden via env for future suites.

2. Representative eval selection via matrix generation

To balance coverage and cost, we only run evals for two key points per configuration.

The matrix helper mark_eval_entries does, for each unique group:

  • Group key: (model, runner, framework, precision, isl, osl).
  • Within each group:
    • Find min TP and max TP.
    • For max TP:
    • Identify entries with that TP.
    • Among them, pick the highest concurrency (FIELD_CONC) → mark as eval.
    • For min TP (if different from max TP):
    • Same logic: lowest TP + highest concurrency → mark as eval.

The selected entries get: entry[FIELD_RUN_EVAL] = True

This means evals are ran only at the highest concurrency for the lowest and highest TP per GPU for each (model, runner, framework, precision, ISL, OSL) combo.

Everything else runs throughput-only.


3. Eval integration in runner scripts (benchmarks/*)

All runner scripts follow the same pattern:

  1. Start the server (e.g. vLLM, TRT-LLM, sglang, etc.).
  2. Call wait_for_server_ready.
  3. Run throughput via run_benchmark_serving.
  4. Conditionally run evals:
    • Only when RUN_EVAL=true.
    • Use run_eval + append_lm_eval_summary

4. Eval Frameworks

This PR supports two eval frameworks, with a unified entrypoint and local patching to handle reasoning tokens and OpenAI-compatible endpoints.

1. lm-eval (lm-evaluation-harness)

1.1 Installation &

  • `_install_lm_eval_deps
  • Installs lm-eval[api].
  • Pulls lm-evaluation-harness
  • _patch_lm_eval: injects a sitecustomize.py that:
    • Fixes LocalChatCompletion.parse_generations
    • Handles responses where message.content is empty but reasoning_content contains the actual answer.
    • Avoids crashes and ensures text extraction works for reasoning-style models.
    • ixes TemplateAPI.apply_chat_template
    • Stops injecting {"type": "text"} into the payload when there is no tokenizer / non-HF tokenizer.
      • This was breaking some TRT/vLLM endpoints with strict JSON schemas.

Patched behavior is wired by adding the generated directory to PYTHONPATH.

1.2 Running lm-eval (run_lm_eval)

run_lm_eval wraps the lm_eval CLI:

  • Defaults:
    • task = ${EVAL_TASK:-gsm8k}
    • num_fewshot = ${NUM_FEWSHOT:-5}
    • concurrent_requests = 32
    • gen_max_tokens = 4096
    • temperature = 0, top_p =
  • Outputs are written under ${EVAL_RESULT_DIR} (default is a new /tmp/eval_out-XXXXXX).

1.3 Summarizing lm-eval results (append_lm_eval_summary)

  • Writes meta_env.json describing:
    • framework
    • precision
    • tp
    • ep
    • dp_attention
    • model
  • Runs utils/lm_eval_to_md.py to convert raw lm-eval results into SUMMARY.md.
  • If running inside GitHub Actions:
    • Appends SUMMARY.md into $GITHUB_STEP_SUMMARY (in the same runner).
  • Raw eval outputs remain under /tmp (they are not copied back into the repo workspace).

2. lighteval + litellm

While lm-eval is the default, this PR also supports lighteval as an alternative backend via the unified run_eval wrapper.

2.1 Installation & patching

  • _install_lighteval_deps:
    • Installs lighteval[api] and litellm.
  • _patch_lighteval_litellm via sitecustomize.py:
    • Disables sglang imports:
      • Some lighteval versions attempt to import sglang, which crashes with our version mismatches.
      • We patch lighteval.utils.imports.is_package_available("sglang") to always return False.
    • Patches LiteLLMClient to be OpenAI-server friendly:
      • Removes response_format={"type": "text"} which interferes with vLLM endpoints.
      • Handles reasoning-only responses via reasoning_content.
      • Adds retry/backoff logic around litellm completions.
    • Switches parallel evaluation to threads:
      • Replaces async concurrency with ThreadPoolExecutor(self.concurrent_requests) to avoid stalls under high load.
    • Returns ModelResponse with text and reasonings separated for downstream extraction.

2.2 Running lighteval (run_lighteval_eval)

  • Expects MODEL_NAME to be set (will error otherwise).
  • Wraps the model with an OpenAI-style prefix:
    • lite_model="openai/${MODEL_NAME}"
  • Builds MODEL_ARGS for lighteval: -model_name=${lite_model},base_url=${base_url},api_key=${OPENAI_API_KEY},generation_parameters={temperature:0.0,top_p=1,max_new_tokens:2048},concurrent_requests=${concurrent_requests}
  • Task specification:
    • TASK_SPEC="${task}|${num_fewshot}"

3. Unified eval entrypoint (run_eval)

run_eval abstracts over frameworks:

  • Defaults:
    • EVAL_FRAMEWORK=lm-eval
    • EVAL_TASK=gsm8k
  • Runner scripts can override via env or by passing --framework explicitly.
  • All additional arguments (e.g. --port, --concurrent-requests, --results-dir) are forwarded to the underlying framework-specific function.

Future Work / Notes

  • Currently the default behavior is unchanged for most users:
    • Evals are off by default (RUN_EVAL=false).
    • Only selected matrix entries (lowest & highest TP per GPU at max concurrency) enable RUN_EVAL=true.
  • The plumbing is now in place to:
    • Add more eval suites (e.g. MMLU, Math, custom internal tasks) via EVAL_TASK and utils/evals/*.
    • Swap or augment frameworks (lm-eval vs lighteval) per job via EVAL_FRAMEWORK.
  • Token count optimizations.

cquil11 avatar Dec 01 '25 00:12 cquil11

@Oseltamivir Can you pls fix the merge conflicts so I can review more holistically ?

cquil11 avatar Dec 01 '25 14:12 cquil11

📊 Line Count Report

File: utils/matrix-logic/generate_sweep_configs.py

Total Lines: 1036

Base Lines: 968

Change: +68 lines 📈

github-actions[bot] avatar Dec 03 '25 20:12 github-actions[bot]

📊 Line Count Report

File: utils/matrix-logic/generate_sweep_configs.py

Total Lines: 1036

Base Lines: 968

Change: +68 lines 📈

github-actions[bot] avatar Dec 04 '25 05:12 github-actions[bot]

📊 Line Count Report

File: utils/matrix-logic/generate_sweep_configs.py

Total Lines: 1036

Base Lines: 968

Change: +68 lines 📈

github-actions[bot] avatar Dec 04 '25 11:12 github-actions[bot]

📊 Line Count Report

File: utils/matrix-logic/generate_sweep_configs.py

Total Lines: 1036

Base Lines: 968

Change: +68 lines 📈

github-actions[bot] avatar Dec 04 '25 12:12 github-actions[bot]

📊 Line Count Report

File: utils/matrix-logic/generate_sweep_configs.py

Total Lines: 1036

Base Lines: 968

Change: +68 lines 📈

github-actions[bot] avatar Dec 04 '25 13:12 github-actions[bot]

📊 Line Count Report

File: utils/matrix-logic/generate_sweep_configs.py

Total Lines: 1036

Base Lines: 968

Change: +68 lines 📈

github-actions[bot] avatar Dec 04 '25 13:12 github-actions[bot]

📊 Line Count Report

File: utils/matrix-logic/generate_sweep_configs.py

Total Lines: 1036

Base Lines: 968

Change: +68 lines 📈

github-actions[bot] avatar Dec 05 '25 15:12 github-actions[bot]

📊 Line Count Report

File: utils/matrix-logic/generate_sweep_configs.py

Total Lines: 1036

Base Lines: 968

Change: +68 lines 📈

github-actions[bot] avatar Dec 05 '25 15:12 github-actions[bot]

Reviewing from NV side. Please hold on merging till we finish review.

kedarpotdar-nv avatar Dec 05 '25 21:12 kedarpotdar-nv