Sandbox for Implementation of generate and integration of lm_eval (evaluation harness)
✨ Description
This PR draft will be split in 3 PRs
🔍 Type of change
Select all that apply:
- [ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
- [ ] 🚀 New feature (non-breaking change that adds functionality)
- [ ] ⚠️ Breaking change (a change that could affect existing functionality)
- [ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
- [ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
- [ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
- [ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
- [ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)
📝 Changes
List the key changes introduced in this PR:
- Change A
- Change B
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
- [ ] 📜 I have read and followed the contributing guidelines.
- [ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
- [ ] 🎉 The functionality is complete, and I have tested the changes.
- [ ] 📝 I have updated the documentation if needed.
- [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
- [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
Dependencies and Configuration
- [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
- [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
Testing
- [ ] 🧪 I have added or updated tests to cover my changes.
- [ ] ✔️ New and existing tests pass locally with my changes.
- [ ] 🚦 I have tested these changes on GPUs and verified training stability.
- [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
Performance Impact
- [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
- [ ] ✅ The benchmarks show no performance regression.
- [ ] 🚀 The benchmarks indicate a potential performance improvement.
- [ ] ⚠️ The benchmarks indicate a potential performance degradation.
- [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
📊 Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
🗒️ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.
I have created a debugging sandbox with manual tests for now. The results are as follows:
Ignoring attention_mask and position_ids:
| Batch Size | No Flash Attention (Float32) | No Flash Attention (BF16) | Flash Attention (BF16) |
|---|---|---|---|
| 1 | Same output (same model via HF and Fast-LLM) | Same output | Different output |
| 2 | Different output | Different output | Different output |
Converting attention_mask (from HF forward) to sequence_lengths:
| Batch Size | No Flash Attention (Float32) | No Flash Attention (BF16) | Flash Attention (BF16) |
|---|---|---|---|
| 1 | FastLLM empty output | FastLLM empty output | Different output |
| 2 | FastLLM empty output | FastLLM empty output | Different output |
It seems sequence_lengths is not supported for fused attention and does not improve Flash Attention. Could this be correct?
If attention_mask is a left-padded mask like this:
[[0, 0, 0, 1, 1, 1, 1], ....]
I convert it to sequence_lengths = [[3, 4], ....].
# First non zero indexes or zero index if the row is all zeros (invalid row)
first_non_zero_indexes = attention_mask.argmax(dim=1)
# Check if the sequence is left-padded and if the remaining ones are continuous 1-ns
assert (attention_mask.sum(axis=1) == (attention_mask.shape[1] - first_non_zero_indexes)).all()
sequence_lenghts = [
torch.tensor(
[attention_mask.shape[1]] if el == 0 else [el, attention_mask.shape[1] - el], dtype=torch.int64
)
for el in first_non_zero_indexes.tolist()
]
@sohamparikh @jlamypoirier Hi, I am trying to use the cross-document attention prevention that @tscholak pointed me to (https://github.com/ServiceNow/Fast-LLM/pull/177/files) to mimic left padding for documents in a batch during generation. It appears to be doing the correct thing, such as building the internal mask and position IDs, but it is not working. Could you please comment on what might be wrong? Thanks!
Can we please break down this PR? Otherwise it will make reviewing too difficult. Let's keep this one about the minimalistic generate, and move the rest to the next PR
Can we please break down this PR? Otherwise it will make reviewing too difficult. Let's keep this one about the minimalistic
generate, and move the rest to the next PR
Sure, eventually we can do that. @bigximik is currently iterating towards an end-to-end solution for running benchmarks, and he's solving issues as they arise. It makes sense for him to operate that way for the time being, but when the time comes to review the changes, we should separate the concerns.
@jlamypoirier, btw, we need your guidance in determining the best way to distribute generation across ranks. Concretely, we are looking to implement this lm-eval-harness API:
@abc.abstractmethod
def generate_until(self, requests) -> List[str]:
"""Generate greedily until a stopping sequence
:param requests: list[Instance]
A list of Instance objects with property `args` which returns a tuple (context, gen_kwargs).
context: str
Context string
gen_kwargs: dict
A dictionary of keyword arguments to pass to the generation function e.g. top_k, until, etc.
:return: list[str]
A list of model generated continuations.
continuation: str
The generated continuation.
"""
pass
where generate_until(requests: list[Instance], ...) is called from rank 0 and distribute the Instances across ranks calling the Fast-LLM model's generate(inputs: torch.Tensor, ...). An Instance is a prompt with fluff, https://github.com/EleutherAI/lm-evaluation-harness/blob/e4a7b69fe0fc6cb430e12cf15c4109bf28185124/lm_eval/api/instance.py#L11.
Current State
- Implemented evaluation abstraction and
lm_evalintegration for single GPU. - Made necessary changes to
generate().
Next Steps
- Refactor
lm_evalintegration to rely less on moved code. - Explore the possibility of using a base VLLM integration class instead of Hugging Face for
lm_eval. - Implement full distributed model support for
lm_evalintegration, including necessary changes to support distributedgenerate().
I’ve finished working in this draft and will create 3 new PRs from it:
- Generate support
- Refactoring of evaluations
-
lm_evalintegration
In addition to the changes here, I’ll be adding tests and documentation updates as needed.
I’ll also be tracking this draft in case further discussion continues here.
Work on this prototype branch has been completed and moved to other feature branches. This PR can be safely closed.