Fast-LLM icon indicating copy to clipboard operation
Fast-LLM copied to clipboard

Run lm-eval-harness benchmarks during validation

Open bigximik opened this issue 11 months ago • 4 comments

🎯 Goal (What & Why)

Enable Fast-LLM to run structured evaluations using lm-eval-harness. This allows benchmarking Fast-LLM models across many standard tasks using the in-memory model during validation, leveraging the existing HuggingFace-compatible interface improved in #217.

Note that the current HuggingfaceGPTModelForCausalLM.from_pretrained(...) API always reloads the model from disk. This breaks the intended workflow, where we keep the model sharded and in memory across all GPUs. We want to integrate with lm-eval-harness while reusing the model already in memory, avoiding redundant loading, avoiding eviction, and reducing complexity.

🚀 Execution Plan

Step 1: Add from_existing_model() constructor

Add a new constructor method to HuggingfaceGPTModelForCausalLM that allows wrapping an existing GPTModel instance, e.g.

@classmethod
def from_existing_model(cls, model: GPTModel) -> HuggingfaceGPTModelForCausalLM:
    config = HuggingfaceGPTModelConfig(fast_llm_config=model.config)
    obj = cls(config)
    obj._fast_llm_model = model
    return obj

Notes:

  • HuggingfaceGPTModelConfig already holds a GPTModelConfig, so no need to explicitly construct it if we already have a GPTModel.
  • We need to assign fields like .runner and .schedule because they'll be used during generation.

Step 2: Implement a TemplateLM subclass for Fast-LLM

Create a subclass of lm_eval.api.model.TemplateLM that wraps an instance of HuggingfaceGPTModelForCausalLM and provides the required methods:

  • tok_encode()
  • loglikelihood(), loglikelihood_rolling()
  • generate_until()
  • eot_token_id

Use the HuggingFace tokenizer that pairs with the Fast-LLM model. Assume greedy decoding only. No need to support chat templates or SFT-specific tokenization quirks yet.

Step 3: Integration test

  • Load a small model like HuggingFaceTB/SmolLM2-135M-Instruct.
  • Wrap the in-memory Fast-LLM model using from_existing_model(...).
  • Use lm_eval.simple_evaluate(...) to run one or more generative tasks (e.g., hellaswag, arc_challenge, winogrande).
  • Validate that results match expectations.

Step 4: Extend Fast-LLM's validation config to support lm-eval-harness tasks

  • Extend the Fast-LLM config to accept a list of generative evaluation tasks using lm-eval-harness.
    • Fields to support:
      • tasks: list of task names (e.g. ["hellaswag", "arc_challenge"])
      • num_fewshot: number of few-shot examples to use per task.
  • Implement logic that:
    • Runs the lm-eval-harness only on global rank 0.
    • Constructs the TemplateLM wrapper for the in-memory Fast-LLM model.
    • Calls simple_evaluate(...) with the configured tasks.
    • Relies on Fast-LLM’s forward() for token-level inference, which is already distributed across GPUs and hosts.
  • Add support for logging results (e.g. to stdout and WandB), and disable lm-eval progress bars because Fast-LLM typically runs in a headless interface.

📌 Acceptance Criteria (Must-Haves for Completion)

  • Must be able to wrap an in-memory GPTModel in a HuggingfaceGPTModelForCausalLM via from_existing_model() without disk I/O.
  • Must implement a subclass of TemplateLM that:
    • Uses Fast-LLM's HuggingFace-compatible model (HuggingfaceGPTModelForCausalLM) for all inference.
    • Implements generate_until, loglikelihood, and loglikelihood_rolling.
    • Uses the correct tokenizer, PAD token ID, and EOS token ID.
  • Must support calling lm_eval.simple_evaluate(...) using the wrapped model and produce correct results.
  • Must extend Fast-LLM's validation/evaluation configuration to support:
    • Specifying lm-eval-harness tasks by name.
    • Setting num_fewshot.
  • Must ensure lm-eval-harness runs only on global rank 0, while model.forward() is transparently distributed using Fast-LLM’s runner logic.
  • Must include:
    • A working test that evaluates at least one lm-eval task on a small model (SmolLM2-135M-Instruct or similar).
    • Logging of evaluation results (stdout and WandB).
  • Implementation must be documented:
    • Configs in docs that show how to run lm-eval's generative benchmarks.

📎 Relevant Links

  • lm-eval-harness interface guide
  • TemplateLM interface:
    https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/model.py#L253
  • Fast-LLM HF model entry point:
    https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/huggingface.py

🛠️ Project Management

  • [x] Assign the project to the Fast-LLM project.
  • [x] Set the Estimate field (in days) in the GitHub project.
  • [x] Use the Size field to categorize the PR size (Small/Medium/Large).
  • [x] Assign an owner when opening the issue.

bigximik avatar Mar 24 '25 12:03 bigximik

https://github.com/EleutherAI/lm-evaluation-harness/blob/773dcd7f8fe95ae7ae73c687dec0a4dc1a6174b9/lm_eval/api/model.py#L315

https://github.com/EleutherAI/lm-evaluation-harness/blob/773dcd7f8fe95ae7ae73c687dec0a4dc1a6174b9/lm_eval/models/huggingface.py#L1263

tscholak avatar Apr 01 '25 13:04 tscholak

https://github.com/huggingface/transformers/blob/24e311f42b54f5f5fab6efcaa0c82eebd5608ba3/src/transformers/generation/utils.py#L345

https://github.com/huggingface/transformers/blob/24e311f42b54f5f5fab6efcaa0c82eebd5608ba3/src/transformers/generation/utils.py#L2018C5-L2032C50

tscholak avatar Apr 01 '25 13:04 tscholak

https://github.com/ServiceNow/Fast-LLM/blob/21182c2d152729d3ed8a6d53024acdf76d468d2f/fast_llm/models/gpt/huggingface.py#L24C18-L41C30

tscholak avatar Apr 01 '25 13:04 tscholak

https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct

tscholak avatar Apr 01 '25 14:04 tscholak