Chen Jie
Chen Jie
The first graph is a comparison between using and not using flash attention 2. It seems that the loss doesn't change much with fa2 (yellowish curve).
> will the training code be released? We will organize and release the training code as soon as possible. The continual pre-training is done based on the slurm workload manager.
We are quite sorry for the delayed release of the code. We just released the [code](https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/src) used for continual pre-training and data preparation. The code contains detailed documentation comments.
[LLMBox](https://github.com/RUCAIBox/LLMBox).