lllyyyqqq
lllyyyqqq
  发现有类似图中的差异,哪个更准呀?
请问flash attention 用的是v1吗,是什么版本啊
First, Excellent work! I am trying to reproduce using my own data, and change some of your code. During the training, at some steps, I got negative rl_loss, reg_loss, pg_loss,...
I am installing flash-attn in image, container environment as follow: ``` Ubuntu 16.04.6 pytorch image: nvcr.io/nvidia/pytorch: 22.04-py3 PyTorch Version 1.12.0a0+bd13bc6 CUDA 11.6 My card is V100-32g ``` Command `pip install...
I've seen the dkpd paper, the experiment results show dkpd works, but I don't really see why implement dpo to KD in the first place, and how it should improve...