MarkYang

Results 5 comments of MarkYang

@hxs91 My hypothesis is that FoT is using a similar training strategy to Recurrent Memory Transformer, if you want to train a local context of 2k with 4 segments, you...

@JasonZhu1313 can you share your config for running flex attention on gsm8k?

@jiaqiw09 Thanks jiaqi, I later managed to solve it by increasing TP 1->4, but I'll also try your method.

@qingyujean Did you eventually fix this problem?