✨ Description

This PR draft will be split in 3 PRs

🔍 Type of change

Select all that apply:

[ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[ ] 🚀 New feature (non-breaking change that adds functionality)
[ ] ⚠️ Breaking change (a change that could affect existing functionality)
[ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

[ ] 📜 I have read and followed the contributing guidelines.
[ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
[ ] 🎉 The functionality is complete, and I have tested the changes.
[ ] 📝 I have updated the documentation if needed.
[ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
[ ] 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

[ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
[ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

[ ] 🧪 I have added or updated tests to cover my changes.
[ ] ✔️ New and existing tests pass locally with my changes.
[ ] 🚦 I have tested these changes on GPUs and verified training stability.
[ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

[ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
[ ] ✅ The benchmarks show no performance regression.
[ ] 🚀 The benchmarks indicate a potential performance improvement.
[ ] ⚠️ The benchmarks indicate a potential performance degradation.
[ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

Apr 03 '25 12:04 bigximik

I have created a debugging sandbox with manual tests for now. The results are as follows:

Ignoring `attention_mask` and `position_ids`:

Batch Size	No Flash Attention (Float32)	No Flash Attention (BF16)	Flash Attention (BF16)
1	Same output (same model via HF and Fast-LLM)	Same output	Different output
2	Different output	Different output	Different output

Converting `attention_mask` (from HF `forward`) to `sequence_lengths`:

Batch Size	No Flash Attention (Float32)	No Flash Attention (BF16)	Flash Attention (BF16)
1	FastLLM empty output	FastLLM empty output	Different output
2	FastLLM empty output	FastLLM empty output	Different output

It seems sequence_lengths is not supported for fused attention and does not improve Flash Attention. Could this be correct?

If attention_mask is a left-padded mask like this:
[[0, 0, 0, 1, 1, 1, 1], ....]
I convert it to sequence_lengths = [[3, 4], ....].

# First non zero indexes or zero index if the row is all zeros (invalid row)
first_non_zero_indexes = attention_mask.argmax(dim=1)

# Check if the sequence is left-padded and if the remaining ones are continuous 1-ns
assert (attention_mask.sum(axis=1) == (attention_mask.shape[1] - first_non_zero_indexes)).all()

sequence_lenghts = [
    torch.tensor(
        [attention_mask.shape[1]] if el == 0 else [el, attention_mask.shape[1] - el], dtype=torch.int64
    )
    for el in first_non_zero_indexes.tolist()
]

Apr 03 '25 12:04 bigximik

@sohamparikh @jlamypoirier Hi, I am trying to use the cross-document attention prevention that @tscholak pointed me to (https://github.com/ServiceNow/Fast-LLM/pull/177/files) to mimic left padding for documents in a batch during generation. It appears to be doing the correct thing, such as building the internal mask and position IDs, but it is not working. Could you please comment on what might be wrong? Thanks!

Apr 03 '25 13:04 bigximik

Can we please break down this PR? Otherwise it will make reviewing too difficult. Let's keep this one about the minimalistic generate, and move the rest to the next PR

Apr 23 '25 15:04 jlamypoirier

Can we please break down this PR? Otherwise it will make reviewing too difficult. Let's keep this one about the minimalistic generate, and move the rest to the next PR

Sure, eventually we can do that. @bigximik is currently iterating towards an end-to-end solution for running benchmarks, and he's solving issues as they arise. It makes sense for him to operate that way for the time being, but when the time comes to review the changes, we should separate the concerns.

Apr 23 '25 20:04 tscholak

@jlamypoirier, btw, we need your guidance in determining the best way to distribute generation across ranks. Concretely, we are looking to implement this lm-eval-harness API:

    @abc.abstractmethod
    def generate_until(self, requests) -> List[str]:
        """Generate greedily until a stopping sequence

        :param requests: list[Instance]
            A list of Instance objects with property `args` which returns a tuple (context, gen_kwargs).
            context: str
                Context string
            gen_kwargs: dict
                A dictionary of keyword arguments to pass to the generation function e.g. top_k, until, etc.
        :return: list[str]
            A list of model generated continuations.
            continuation: str
                The generated continuation.
        """
        pass

where generate_until(requests: list[Instance], ...) is called from rank 0 and distribute the Instances across ranks calling the Fast-LLM model's generate(inputs: torch.Tensor, ...). An Instance is a prompt with fluff, https://github.com/EleutherAI/lm-evaluation-harness/blob/e4a7b69fe0fc6cb430e12cf15c4109bf28185124/lm_eval/api/instance.py#L11.

Apr 23 '25 20:04 tscholak

Current State

Implemented evaluation abstraction and lm_eval integration for single GPU.
Made necessary changes to generate().

Next Steps

Refactor lm_eval integration to rely less on moved code.
Explore the possibility of using a base VLLM integration class instead of Hugging Face for lm_eval.
Implement full distributed model support for lm_eval integration, including necessary changes to support distributed generate().

Apr 28 '25 13:04 bigximik

I’ve finished working in this draft and will create 3 new PRs from it:

Generate support
Refactoring of evaluations
lm_eval integration

In addition to the changes here, I’ll be adding tests and documentation updates as needed.

I’ll also be tracking this draft in case further discussion continues here.

May 09 '25 13:05 bigximik

Work on this prototype branch has been completed and moved to other feature branches. This PR can be safely closed.

Jul 16 '25 09:07 bigximik

Sandbox for Implementation of generate and integration of lm_eval (evaluation harness)

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Ignoring attention_mask and position_ids:

Converting attention_mask (from HF forward) to sequence_lengths:

Current State

Next Steps

Ignoring `attention_mask` and `position_ids`:

Converting `attention_mask` (from HF `forward`) to `sequence_lengths`: