Tianyang Lin

Results 4 comments of Tianyang Lin

I went through the [configs](https://github.com/allenai/OLMo/tree/main/configs) files and found that for [official-0724](https://github.com/allenai/OLMo/tree/main/configs/official-0724) release the weights were initialized using `mitchell` method, while for [official-1124](https://github.com/allenai/OLMo/tree/main/configs/official-1124) release the weights were initialized with truncated normal....

Btw, `B={"B0":200}` seems to be problematic too, for we usually use `tl.arange(0, B0)` in the kernel to calculate offsets and `tl.arange` only accepts ranges of powers of 2.

> tgs 因为tgs是tokens **per gpu** per second

They mentioned in A4 that single-batch setting is used. That said, I don't think it's appropriate to compare 2:4 sparsity here as 2:4 sparsity is not fit for small-batch matmul's.