🎯 Goal (What & Why)

Enable Fast-LLM to run structured evaluations using lm-eval-harness. This allows benchmarking Fast-LLM models across many standard tasks using the in-memory model during validation, leveraging the existing HuggingFace-compatible interface improved in #217.

Note that the current HuggingfaceGPTModelForCausalLM.from_pretrained(...) API always reloads the model from disk. This breaks the intended workflow, where we keep the model sharded and in memory across all GPUs. We want to integrate with lm-eval-harness while reusing the model already in memory, avoiding redundant loading, avoiding eviction, and reducing complexity.

🚀 Execution Plan

Step 1: Add `from_existing_model()` constructor

Add a new constructor method to HuggingfaceGPTModelForCausalLM that allows wrapping an existing GPTModel instance, e.g.

@classmethod
def from_existing_model(cls, model: GPTModel) -> HuggingfaceGPTModelForCausalLM:
    config = HuggingfaceGPTModelConfig(fast_llm_config=model.config)
    obj = cls(config)
    obj._fast_llm_model = model
    return obj

Notes:

HuggingfaceGPTModelConfig already holds a GPTModelConfig, so no need to explicitly construct it if we already have a GPTModel.
We need to assign fields like .runner and .schedule because they'll be used during generation.

Step 2: Implement a `TemplateLM` subclass for Fast-LLM

Create a subclass of lm_eval.api.model.TemplateLM that wraps an instance of HuggingfaceGPTModelForCausalLM and provides the required methods:

tok_encode()
loglikelihood(), loglikelihood_rolling()
generate_until()
eot_token_id

Use the HuggingFace tokenizer that pairs with the Fast-LLM model. Assume greedy decoding only. No need to support chat templates or SFT-specific tokenization quirks yet.

Step 3: Integration test

Load a small model like HuggingFaceTB/SmolLM2-135M-Instruct.
Wrap the in-memory Fast-LLM model using from_existing_model(...).
Use lm_eval.simple_evaluate(...) to run one or more generative tasks (e.g., hellaswag, arc_challenge, winogrande).
Validate that results match expectations.

Step 4: Extend Fast-LLM's validation config to support lm-eval-harness tasks

Extend the Fast-LLM config to accept a list of generative evaluation tasks using lm-eval-harness.
- Fields to support:
  - tasks: list of task names (e.g. ["hellaswag", "arc_challenge"])
  - num_fewshot: number of few-shot examples to use per task.
Implement logic that:
- Runs the lm-eval-harness only on global rank 0.
- Constructs the TemplateLM wrapper for the in-memory Fast-LLM model.
- Calls simple_evaluate(...) with the configured tasks.
- Relies on Fast-LLM’s forward() for token-level inference, which is already distributed across GPUs and hosts.
Add support for logging results (e.g. to stdout and WandB), and disable lm-eval progress bars because Fast-LLM typically runs in a headless interface.

📌 Acceptance Criteria (Must-Haves for Completion)

Must be able to wrap an in-memory GPTModel in a HuggingfaceGPTModelForCausalLM via from_existing_model() without disk I/O.
Must implement a subclass of TemplateLM that:
- Uses Fast-LLM's HuggingFace-compatible model (HuggingfaceGPTModelForCausalLM) for all inference.
- Implements generate_until, loglikelihood, and loglikelihood_rolling.
- Uses the correct tokenizer, PAD token ID, and EOS token ID.
Must support calling lm_eval.simple_evaluate(...) using the wrapped model and produce correct results.
Must extend Fast-LLM's validation/evaluation configuration to support:
- Specifying lm-eval-harness tasks by name.
- Setting num_fewshot.
Must ensure lm-eval-harness runs only on global rank 0, while model.forward() is transparently distributed using Fast-LLM’s runner logic.
Must include:
- A working test that evaluates at least one lm-eval task on a small model (SmolLM2-135M-Instruct or similar).
- Logging of evaluation results (stdout and WandB).
Implementation must be documented:
- Configs in docs that show how to run lm-eval's generative benchmarks.

📎 Relevant Links

lm-eval-harness interface guide
TemplateLM interface:
https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/model.py#L253
Fast-LLM HF model entry point:
https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/huggingface.py

🛠️ Project Management

[x] Assign the project to the Fast-LLM project.
[x] Set the Estimate field (in days) in the GitHub project.
[x] Use the Size field to categorize the PR size (Small/Medium/Large).
[x] Assign an owner when opening the issue.

Mar 24 '25 12:03 bigximik

https://github.com/EleutherAI/lm-evaluation-harness/blob/773dcd7f8fe95ae7ae73c687dec0a4dc1a6174b9/lm_eval/api/model.py#L315

https://github.com/EleutherAI/lm-evaluation-harness/blob/773dcd7f8fe95ae7ae73c687dec0a4dc1a6174b9/lm_eval/models/huggingface.py#L1263

Apr 01 '25 13:04 tscholak

https://github.com/huggingface/transformers/blob/24e311f42b54f5f5fab6efcaa0c82eebd5608ba3/src/transformers/generation/utils.py#L345

https://github.com/huggingface/transformers/blob/24e311f42b54f5f5fab6efcaa0c82eebd5608ba3/src/transformers/generation/utils.py#L2018C5-L2032C50

Apr 01 '25 13:04 tscholak

https://github.com/ServiceNow/Fast-LLM/blob/21182c2d152729d3ed8a6d53024acdf76d468d2f/fast_llm/models/gpt/huggingface.py#L24C18-L41C30

Apr 01 '25 13:04 tscholak

https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct

Apr 01 '25 14:04 tscholak

Run lm-eval-harness benchmarks during validation

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: Add from_existing_model() constructor

Step 2: Implement a TemplateLM subclass for Fast-LLM

Step 3: Integration test

Step 4: Extend Fast-LLM's validation config to support lm-eval-harness tasks

📌 Acceptance Criteria (Must-Haves for Completion)

📎 Relevant Links

🛠️ Project Management

Step 1: Add `from_existing_model()` constructor

Step 2: Implement a `TemplateLM` subclass for Fast-LLM