Samuel Kriman

Results 2 issues of Samuel Kriman

I have been trying to replicate the results from the paper, but I'm confused about the number of training steps. The paper mentions 240k steps, but when running this code...

It seems that in this implementation you are only adding the "sink" token to the cache, and not using in the original forward pass, so if you are using windowed...