Run lm-eval-harness benchmarks during validation
🎯 Goal (What & Why)
Enable Fast-LLM to run structured evaluations using lm-eval-harness. This allows benchmarking Fast-LLM models across many standard tasks using the in-memory model during validation, leveraging the existing HuggingFace-compatible interface improved in #217.
Note that the current HuggingfaceGPTModelForCausalLM.from_pretrained(...) API always reloads the model from disk. This breaks the intended workflow, where we keep the model sharded and in memory across all GPUs. We want to integrate with lm-eval-harness while reusing the model already in memory, avoiding redundant loading, avoiding eviction, and reducing complexity.
🚀 Execution Plan
Step 1: Add from_existing_model() constructor
Add a new constructor method to HuggingfaceGPTModelForCausalLM that allows wrapping an existing GPTModel instance, e.g.
@classmethod
def from_existing_model(cls, model: GPTModel) -> HuggingfaceGPTModelForCausalLM:
config = HuggingfaceGPTModelConfig(fast_llm_config=model.config)
obj = cls(config)
obj._fast_llm_model = model
return obj
Notes:
-
HuggingfaceGPTModelConfigalready holds aGPTModelConfig, so no need to explicitly construct it if we already have aGPTModel. - We need to assign fields like
.runnerand.schedulebecause they'll be used during generation.
Step 2: Implement a TemplateLM subclass for Fast-LLM
Create a subclass of lm_eval.api.model.TemplateLM that wraps an instance of HuggingfaceGPTModelForCausalLM and provides the required methods:
-
tok_encode() -
loglikelihood(),loglikelihood_rolling() -
generate_until() -
eot_token_id
Use the HuggingFace tokenizer that pairs with the Fast-LLM model. Assume greedy decoding only. No need to support chat templates or SFT-specific tokenization quirks yet.
Step 3: Integration test
- Load a small model like
HuggingFaceTB/SmolLM2-135M-Instruct. - Wrap the in-memory Fast-LLM model using
from_existing_model(...). - Use
lm_eval.simple_evaluate(...)to run one or more generative tasks (e.g.,hellaswag,arc_challenge,winogrande). - Validate that results match expectations.
Step 4: Extend Fast-LLM's validation config to support lm-eval-harness tasks
- Extend the Fast-LLM config to accept a list of generative evaluation tasks using lm-eval-harness.
- Fields to support:
-
tasks: list of task names (e.g.["hellaswag", "arc_challenge"]) -
num_fewshot: number of few-shot examples to use per task.
-
- Fields to support:
- Implement logic that:
- Runs the lm-eval-harness only on global rank 0.
- Constructs the
TemplateLMwrapper for the in-memory Fast-LLM model. - Calls
simple_evaluate(...)with the configured tasks. - Relies on Fast-LLM’s
forward()for token-level inference, which is already distributed across GPUs and hosts.
- Add support for logging results (e.g. to stdout and WandB), and disable lm-eval progress bars because Fast-LLM typically runs in a headless interface.
📌 Acceptance Criteria (Must-Haves for Completion)
- Must be able to wrap an in-memory
GPTModelin aHuggingfaceGPTModelForCausalLMviafrom_existing_model()without disk I/O. - Must implement a subclass of
TemplateLMthat:- Uses Fast-LLM's HuggingFace-compatible model (
HuggingfaceGPTModelForCausalLM) for all inference. - Implements
generate_until,loglikelihood, andloglikelihood_rolling. - Uses the correct tokenizer, PAD token ID, and EOS token ID.
- Uses Fast-LLM's HuggingFace-compatible model (
- Must support calling
lm_eval.simple_evaluate(...)using the wrapped model and produce correct results. - Must extend Fast-LLM's validation/evaluation configuration to support:
- Specifying lm-eval-harness tasks by name.
- Setting
num_fewshot.
- Must ensure lm-eval-harness runs only on global rank 0, while
model.forward()is transparently distributed using Fast-LLM’s runner logic. - Must include:
- A working test that evaluates at least one lm-eval task on a small model (
SmolLM2-135M-Instructor similar). - Logging of evaluation results (stdout and WandB).
- A working test that evaluates at least one lm-eval task on a small model (
- Implementation must be documented:
- Configs in docs that show how to run lm-eval's generative benchmarks.
📎 Relevant Links
- lm-eval-harness interface guide
-
TemplateLMinterface:
https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/model.py#L253 - Fast-LLM HF model entry point:
https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/huggingface.py
🛠️ Project Management
- [x] Assign the project to the Fast-LLM project.
- [x] Set the
Estimatefield (in days) in the GitHub project. - [x] Use the
Sizefield to categorize the PR size (Small/Medium/Large). - [x] Assign an owner when opening the issue.
https://github.com/EleutherAI/lm-evaluation-harness/blob/773dcd7f8fe95ae7ae73c687dec0a4dc1a6174b9/lm_eval/api/model.py#L315
https://github.com/EleutherAI/lm-evaluation-harness/blob/773dcd7f8fe95ae7ae73c687dec0a4dc1a6174b9/lm_eval/models/huggingface.py#L1263
https://github.com/huggingface/transformers/blob/24e311f42b54f5f5fab6efcaa0c82eebd5608ba3/src/transformers/generation/utils.py#L345
https://github.com/huggingface/transformers/blob/24e311f42b54f5f5fab6efcaa0c82eebd5608ba3/src/transformers/generation/utils.py#L2018C5-L2032C50
https://github.com/ServiceNow/Fast-LLM/blob/21182c2d152729d3ed8a6d53024acdf76d468d2f/fast_llm/models/gpt/huggingface.py#L24C18-L41C30
https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct