LLM-Training-Puzzles icon indicating copy to clipboard operation
LLM-Training-Puzzles copied to clipboard

What would you do with 1000 H100s...

Results 3 LLM-Training-Puzzles issues
Sort by recently updated
recently updated
newest added

For some of the puzzles, the target memory (and time) are impossible to satisfy. E.g. for puzzle 0, I spent some time trying to get from `2621440` to under the...

Hi, this might be a minor thing, but I'm wondering in distributed data parallel, when we aggregate `grad_weights` from all machines using `model.allgather`, since `allgather` performs `sum` operation, shouldn't we...