MarkYang
MarkYang
@hxs91 My hypothesis is that FoT is using a similar training strategy to Recurrent Memory Transformer, if you want to train a local context of 2k with 4 segments, you...
@JasonZhu1313 can you share your config for running flex attention on gsm8k?
Hi @jiaqiw09, I wonder if you ever fixed this issue?
@jiaqiw09 Thanks jiaqi, I later managed to solve it by increasing TP 1->4, but I'll also try your method.
@qingyujean Did you eventually fix this problem?