DeepSpeedExamples Finished training but the inference performance doesn't look good

Hi, I have finished training the following models:

facebook/opt-1.3b (step 1,2 and 3) facebook/opt-6.7b (step 1)

Here is the performance shown at the bottom of the chatbot.py script:

Human: what is edge?
Assistant:
Edge is a newer version of the Microsoft internet browser, 
developed by Microsoft. It is focused on improving performance 
and security, and offers a more modern user interface. Edge is 
currently the most popular internet browser on the market, and is 
also used heavily by Microsoft employees.

And here is the performance using the demo questions:

facebook/opt-6.7b (SFT) ->
Human: What is Edge?
Assistant:  
edge is a web browser that is built into the web browser in most web-based 
operating systems, such as Google Chrome, Firefox, and Firefox.  
It is a very popular web browser, and is widely used for web browsing, 
web shopping, and web-based email.  It is also a popular web browser 
for web-based email, and is a popular web browser for web-based email, 
and is a popular web browser for web-based email, and is a popular web 
browser for web-based email, and is a popular web browser for web-based 
email, and is a popular web browser for web-based

facebook/opt-1.3b (SFT) ->
Human: What is Edge?
Assistant:  edge is a web browser that is built into the web browser 
you use on your computer.  It is a very popular web browser, and is 
widely used by web users.  It is also a very popular web browser for web 
users, and is widely used by web users.  It is also a very popular web 
browser for web users, and is widely used by web users.  It is also a very 
popular web browser for web users, and is widely used by web users.  
It is also a very popular web browser for web users, and is widely used 
by web users.  It is also a very popular web browser

facebook/opt-1.3b (RLHF) ->
Human: What is Edge?
Assistant:  edge is a web browser that is built into the web browser 
in most web-based operating systems, such as Google Chrome, Firefox, 
and Firefox.  It is a very popular web browser, and is widely used for web 
browsing, web shopping, and web-based email.  It is also a popular web 
browser for web-based email, and is a popular web browser for web-based 
email, and is a popular web browser for web-based email, and is a popular 
web browser for web-based email, and is a popular web browser for web-based 
email, and is a popular web browser for web-based

Unlike the demo answer, which is clean and does not repeat itself, my observation is that my trained model will repeat itself.

Note: I was using the default setup to train the models.

Have anyone experienced the same issue? Or we need to adjust the temperature, top-k and etc to get the same result?

Apr 17 '23 04:04 alibabadoufu

I'm facing the same issue with opt-1.3b model.

Apr 17 '23 05:04 Metal-joker

@alibabadoufu can you share more information about your environment? GPU type, # GPUs, torch version, transformers version, etc.? I will try my best to replicate your setup and see if the models produce the same type of repeated output on my side. If I recall correctly, the demo at the bottom of chat.py was generated with the 13b model, which may explain the difference you are seeing.

Apr 17 '23 17:04 mrwyattii

@mrwyattii Could you provide the detailed training configuration for the opt-1.3b conversation results proposed in the README? I attempted to reproduce the results on 8 * V100, but unfortunately, the output model seems can only repeat the last sentence or generate something like "I can help you with that." but can't give the actual result. Any assistance you could provide would be greatly appreciated.

Apr 18 '23 08:04 Metal-joker

I have the same problem with facebook/opt-1.3b.

 Human: Hello,write me a poem about a wise frog.
 Assistant:  The poem is "The wise frog", by William Shakespeare.  It's a beautiful poem about a wise frog, and the ways he helps people.  It's a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a poem that's both funny and meaningful, and is a great example of a

 Human: who are you
 Assistant:  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to help people.  I am a human who is trying to

Environment:

Ubuntu 22.04
Python 3.10.6

GPU RTX3090 24G x 1
CUDA 11.7.1
Driver 515.48.07

accelerate= 0.18.0
datasets=2.11.0
deepspeed=0.9.0
protobuf=3.20.3
sentencepiece=0.1.98
transformers=4.29.0.dev0
torch=2.0.0

Add the parameter --per_device_train_batch_size 4 for each file

training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh
training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh

Commands:

python3 train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 1
python3 train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 2
python3 train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu --step 3
python3 chat.py --path output/step3-models/1.3b/actor

Apr 20 '23 14:04 xiaoyaolangzhi

Hi @alibabadoufu,

Thank you for bringing the issue to our attention. We observed a similar issue. Both steps 1 and 3 require careful setup.

For example, during step 1 training, we discovered that adding openai/webgpt_comparisons_stanfordnlp/SHP and weight delay can cause a performance regression. Additionally, it may be necessary to adjust the number of training epochs to improve model convergence and quality.

Step 3 training also has several important factors. In our experiments, enabling critical model dropout was essential to prevent training loss overflow. Using the EMA checkpoint for evaluation can also improve generation quality, so we recommend turning it on.

The latest revision addresses some of these issues. Please try it out and let us know if you still experience repeated sentence issues. You can find more information about our experiences here:https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md

Best, Minjia

May 04 '23 16:05 minjiaz

Hi @alibabadoufu,

Thank you for bringing the issue to our attention. We observed a similar issue. Both steps 1 and 3 require careful setup.

For example, during step 1 training, we discovered that adding openai/webgpt_comparisons_stanfordnlp/SHP and weight delay can cause a performance regression. Additionally, it may be necessary to adjust the number of training epochs to improve model convergence and quality.

Step 3 training also has several important factors. In our experiments, enabling critical model dropout was essential to prevent training loss overflow. Using the EMA checkpoint for evaluation can also improve generation quality, so we recommend turning it on.

The latest revision addresses some of these issues. Please try it out and let us know if you still experience repeated sentence issues. You can find more information about our experiences here:https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/README.md

Best, Minjia

Thanks Minjia! I will definitely try to train the opt model again after reading your team's experiences. Will update here once I got any result. Thanks a lot for your detailed reply.

May 05 '23 13:05 alibabadoufu