Joel Lamy-Poirier
Joel Lamy-Poirier
Some benchmarking results, comparing several implementations: 1. `flash`: `flash_santacoder`, the current implementation. 2. `causal`: The `gpt_bigcode` model from HF transformers, run with `causal_lm`. 3. `vector`: The `gpt_bigcode` model from HF...
### Starcoder decode * Similar to Santacoder, but `flash` is already inefficient at a batch size of 1, often even worse than `causal`. * Latency for small batch sizes is...
> @jlamypoirier Thanks for great investigation. """Add support for cuda graphs, at least for decode. I already showed them to work with dynamic shapes (using a lot of graphs), and...
> @jlamypoirier Amazing reports !! May I ask does sequence length indicate max_new_token? I got pretty high latency (about 4s) for starcoder when I set max_new_token=128 It's the time to...
We'd have to find out where the time is being spent, I suspect tokenization. It's a Hugging Face thing so we don't have much control on it, but it seems...
> `fast-llm type=GPTTrainer` is principled (because it taps into the override logic) but ugly (because spelling out `type=` is mandatory and because it's using class names as values). I think...
@tscholak I started working on more dynamic classes and realized user-friendly names are essential. So I implemented a simple solution where each class can have its own registry, and the...
This is a triton bug, our implementation of dropless mlp might not be able to handle that many experts. Fixing this will need an in-depth investigation and some implementation work....
Can we please break down this PR? Otherwise it will make reviewing too difficult. Let's keep this one about the minimalistic `generate`, and move the rest to the next PR
AFAIK all checks that can be done during validation are done there. But some of them can't really be done during validation because of missing information The most importantly category...