Zhihua.Liu
Zhihua.Liu
Very grateful for your help. The model works fine then. However, I ended up with a score of 66.3, which is still lower than the result given in the paper.
I can also run through training. However, the current training results are not very good. I'm trying to train further
You can try to adjust the memory retrieval process to the end of 'apply_rotary_pos_emb' and compare the training performance. However, I did not try it further.
> The results of flash attention are somehow amazing... keep an eye on this. thanks, I'll check it
> And after reading the code, I have found that the ring attention should accept already-chunked qkv instead of the whole qkv. That is, qkv should be split into local...
> Lucidrains has a pytorch implementation of RingAttention https://github.com/lucidrains/ring-attention-pytorch Have you tried this repo? I don’t know whether the experimental results are as expected. Seems that the model posted on...
> What do you have in mind? Is this model suitable for tokenized ecosystem and bridging liquidity and creating a smart algorithm for bridging / blending / mending and growth...
I encountered similar problem. When I use LWM-TEXT-512K (pytorch), warning that "Token indices sequence length is longer than the specified maximum sequence length for this model (42314 > 2048). Running...
After setting repeat=True with different sequence_length, I got the following results. Are these results as expected? (When seq_len=1024, there is an obvious diff in the grad value; as seq_len increases,...