tic-top
Results
2
comments of
tic-top
Have a look of [tinyllama](https://github.com/jzhang38/TinyLlama), [phi-1](https://huggingface.co/papers/2306.11644) and phi-1.5. phi-1 and phi-1.5's dataset is not open sourced. But there're are many dataset based on their idea.
I'm using deepspeed 0.13.1 with torch 2.2.1 cuda 12.2 on one node with 8 * A100(40G). I train this LM with bf16. Then grad_norm is always nan 