Tianyang Lin comments

Results 4 comments of


                                            Tianyang Lin

How the 1B and 7B model are initialized?

I went through the [configs](https://github.com/allenai/OLMo/tree/main/configs) files and found that for [official-0724](https://github.com/allenai/OLMo/tree/main/configs/official-0724) release the weights were initialized using `mitchell` method, while for [official-1124](https://github.com/allenai/OLMo/tree/main/configs/official-1124) release the weights were initialized with truncated normal....

Test spec of Puzzle 9 seems to mismatch the problem setting?

Btw, `B={"B0":200}` seems to be problematic too, for we usually use `tl.arange(0, B0)` in the kernel to calculate offsets and `tl.arange` only accepts ranges of powers of 2.

性能问题

> tgs 因为tgs是tokens **per gpu** per second

Incomplete implementation of SparseGEMV

They mentioned in A4 that single-batch setting is used. That said, I don't think it's appropriate to compare 2:4 sparsity here as 2:4 sparsity is not fit for small-batch matmul's.