ruihan0495
ruihan0495
It's under training/utils folder, called ds_utils.py
Makes sense :D @s-isaev Do you have any suggestions for inference parameters?
Hard to say. We changed the model to our pretrained GPT, it kept crash during training. I guess there are a lot of things to tune in ds_utils.py to make...
Our Model is a GPT model.
All right, in addition to this, after we turned off enable-hybrid-engine, we encounter another error "Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run." The...
Dear @LuciusMos , We found that our problem was due to the tokenizer encoding process. Say our tokenizer has maximum length of 6666, but sometimes it encode some input strings...
可以在main.py组batch的时候做。 这个loss是inf可能是有bug了 如果排查了没有bug的话 试着调一调lr和batch size,还可以调一下ds_config里面的训练精度之类的
We tried using BF16 instead of FP16 during training, though there is no more loss scale error, the training is very unstable. So the best way is still stick to...
maybe see this issue https://github.com/microsoft/DeepSpeedExamples/issues/335#issuecomment-1521105300
> anyone who has use transformers models such as ChatGLM, baichuan-7b run past? i tested many bugs i am about to try... have u fixed them already?