Finished training but the inference performance doesn't look good
Hi, I have finished training the following models:
facebook/opt-1.3b (step 1,2 and 3) facebook/opt-6.7b (step 1)
Here is the performance shown at the bottom of the chatbot.py script:
Human: what is edge?
Assistant:
Edge is a newer version of the Microsoft internet browser,
developed by Microsoft. It is focused on improving performance
and security, and offers a more modern user interface. Edge is
currently the most popular internet browser on the market, and is
also used heavily by Microsoft employees.
And here is the performance using the demo questions:
facebook/opt-6.7b (SFT) ->
Human: What is Edge?
Assistant:
edge is a web browser that is built into the web browser in most web-based
operating systems, such as Google Chrome, Firefox, and Firefox.
It is a very popular web browser, and is widely used for web browsing,
web shopping, and web-based email. It is also a popular web browser
for web-based email, and is a popular web browser for web-based email,
and is a popular web browser for web-based email, and is a popular web
browser for web-based email, and is a popular web browser for web-based
email, and is a popular web browser for web-based
facebook/opt-1.3b (SFT) ->
Human: What is Edge?
Assistant: edge is a web browser that is built into the web browser
you use on your computer. It is a very popular web browser, and is
widely used by web users. It is also a very popular web browser for web
users, and is widely used by web users. It is also a very popular web
browser for web users, and is widely used by web users. It is also a very
popular web browser for web users, and is widely used by web users.
It is also a very popular web browser for web users, and is widely used
by web users. It is also a very popular web browser
facebook/opt-1.3b (RLHF) ->
Human: What is Edge?
Assistant: edge is a web browser that is built into the web browser
in most web-based operating systems, such as Google Chrome, Firefox,
and Firefox. It is a very popular web browser, and is widely used for web
browsing, web shopping, and web-based email. It is also a popular web
browser for web-based email, and is a popular web browser for web-based
email, and is a popular web browser for web-based email, and is a popular
web browser for web-based email, and is a popular web browser for web-based
email, and is a popular web browser for web-based
Unlike the demo answer, which is clean and does not repeat itself, my observation is that my trained model will repeat itself.
Note: I was using the default setup to train the models.
Have anyone experienced the same issue? Or we need to adjust the temperature, top-k and etc to get the same result?
I'm facing the same issue with opt-1.3b model.
@alibabadoufu can you share more information about your environment? GPU type, # GPUs, torch version, transformers version, etc.? I will try my best to replicate your setup and see if the models produce the same type of repeated output on my side. If I recall correctly, the demo at the bottom of chat.py was generated with the 13b model, which may explain the difference you are seeing.
@mrwyattii Could you provide the detailed training configuration for the opt-1.3b conversation results proposed in the README? I attempted to reproduce the results on 8 * V100, but unfortunately, the output model seems can only repeat the last sentence or generate something like "I can help you with that." but can't give the actual result. Any assistance you could provide would be greatly appreciated.
I have the same problem with facebook/opt-1.3b.
Human: Hello,write me a poem about a wise frog.
Assistant: The poem is "The wise frog", by William Shakespeare. It's a beautiful poem about a wise frog, and the ways he helps people. It's a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a
Human: who are you
Assistant: I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to help people. I am a human who is trying to
Environment:
Ubuntu 22.04
Python 3.10.6
GPU RTX3090 24G x 1
CUDA 11.7.1
Driver 515.48.07
accelerate= 0.18.0
datasets=2.11.0
deepspeed=0.9.0
protobuf=3.20.3
sentencepiece=0.1.98
transformers=4.29.0.dev0
torch=2.0.0
Add the parameter --per_device_train_batch_size 4 for each file
training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh
training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh
Commands:
python3 train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 1
python3 train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 2
python3 train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 3
python3 chat.py --path output/step3-models/1.3b/actor
Hi @alibabadoufu,
Thank you for bringing the issue to our attention. We observed a similar issue. Both steps 1 and 3 require careful setup.
For example, during step 1 training, we discovered that adding openai/webgpt_comparisons_stanfordnlp/SHP and weight delay can cause a performance regression. Additionally, it may be necessary to adjust the number of training epochs to improve model convergence and quality.
Step 3 training also has several important factors. In our experiments, enabling critical model dropout was essential to prevent training loss overflow. Using the EMA checkpoint for evaluation can also improve generation quality, so we recommend turning it on.
The latest revision addresses some of these issues. Please try it out and let us know if you still experience repeated sentence issues. You can find more information about our experiences here:https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md
Best, Minjia
Hi @alibabadoufu,
Thank you for bringing the issue to our attention. We observed a similar issue. Both steps 1 and 3 require careful setup.
For example, during step 1 training, we discovered that adding openai/webgpt_comparisons_stanfordnlp/SHP and weight delay can cause a performance regression. Additionally, it may be necessary to adjust the number of training epochs to improve model convergence and quality.
Step 3 training also has several important factors. In our experiments, enabling critical model dropout was essential to prevent training loss overflow. Using the EMA checkpoint for evaluation can also improve generation quality, so we recommend turning it on.
The latest revision addresses some of these issues. Please try it out and let us know if you still experience repeated sentence issues. You can find more information about our experiences here:https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md
Best, Minjia
Thanks Minjia! I will definitely try to train the opt model again after reading your team's experiences. Will update here once I got any result. Thanks a lot for your detailed reply.