Tianyang Lin
Tianyang Lin
I went through the [configs](https://github.com/allenai/OLMo/tree/main/configs) files and found that for [official-0724](https://github.com/allenai/OLMo/tree/main/configs/official-0724) release the weights were initialized using `mitchell` method, while for [official-1124](https://github.com/allenai/OLMo/tree/main/configs/official-1124) release the weights were initialized with truncated normal....
Btw, `B={"B0":200}` seems to be problematic too, for we usually use `tl.arange(0, B0)` in the kernel to calculate offsets and `tl.arange` only accepts ranges of powers of 2.
> tgs 因为tgs是tokens **per gpu** per second
They mentioned in A4 that single-batch setting is used. That said, I don't think it's appropriate to compare 2:4 sparsity here as 2:4 sparsity is not fit for small-batch matmul's.