Mixtral-8x7B repetitive answers
System Info
CPU Architecture x86_64 GPU: 2 x NVIDIA H100 TensorRT-LLM v0.9.0 Image: tritonserver:24.05-trtllm-python-py3 Model weights: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Who can help?
No response
Information
- [x] The official example scripts
- [x] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I'm using nearly the same commands as in this instruction: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral I'm using the following commands to convert the weights and to build the engine.
python /data/TensorRT-LLM/examples/llama/convert_checkpoint.py \
--model_dir /data/Mixtral-8x7B-v0.1 \
--output_dir /data/tllm_checkpoint_mixtral_2gpu \
--dtype float16 \
--pp_size 2
trtllm-build \
--checkpoint_dir /data/tllm_checkpoint_mixtral_2gpu \
--output_dir /data/trt_engines/mixtral/pp2 \
--gemm_plugin float16
mpirun -n 2 python3 /data/TensorRT-LLM/examples/run.py \
--engine_dir /data/trt_engines/mixtral/pp2 \
--tokenizer_dir /data/Mixtral-8x7B-v0.1 \
--max_output_len 512 \
--input_text "I'm sick but I have to work. What should I do?"
Expected behavior
The model should provide responses without repetitiveness, ensuring higher quality and more coherent answers.
actual behavior
On a lot of prompts I get a very repetitive answer. Here are two examples
Example 1:
Command:
mpirun -n 2 python3 /data/TensorRT-LLM/examples/run.py --engine_dir /data/trt_engines/mixtral/pp2 --tokenizer_dir /data/Mixtral-8x7B-v0.1 --max_output_len 512 --input_text " I'm sick but I have to work. What should I do"
Output:
"<s> I'm sick but I have to work. What should I do?"
Output [Text 0 Beam 0]: "
If you are sick, you should stay home. If you are sick and have to work, you should take precautions to prevent the spread of germs.
- Cover your mouth and nose with a tissue when you cough or sneeze.
- Put your used tissue in the waste basket.
- If you don't have a tissue, cough or sneeze into your upper sleeve, not your hands.
- Wash your hands after coughing, sneezing or blowing your nose.
- Try to avoid close contact with sick people.
- If you are sick, stay home and avoid close contact with others to keep from infecting them.
- If you are sick with flu-like illness, CDC recommends that you stay home for at least 24 hours after your fever is gone except to get medical care or for other necessities. (Your fever should be gone without the use of a fever-reducing medicine.)
- Keeping away from others will help prevent others from getting sick too.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay"
Example 2:
Command:
mpirun -n 2 python3 /data/TensorRT-LLM/examples/run.py --engine_dir /data/trt_engines/mixtral/pp2 --tokenizer_dir /data/Mixtral-8x7B-v0.1 --max_output_len 512 --input_text "Give me a pizza recipe and explain it in terms of thermodynamics."
Output:
Input [Text 0]: "<s> Give me a pizza recipe and explain it in terms of thermodynamics."
Output [Text 0 Beam 0]: "
I’m not sure if this is a joke or not, but I’ll give it a shot.
Pizza is a type of food that is made by baking dough with toppings on it. The dough is usually made from flour, water, yeast, and salt. The toppings can be anything from cheese to vegetables to meat.
The thermodynamics of pizza can be explained in terms of the energy that is required to make the dough and the toppings. The dough needs to be mixed and kneaded, which requires energy. The toppings need to be cooked, which also requires energy.
The energy that is required to make pizza can be calculated using the following equation:
E = mc2
Where E is the energy, m is the mass of the dough and toppings, and c is the speed of light.
The energy that is required to make pizza can also be calculated using the following equation:
E = mgh
Where E is the energy, m is the mass of the dough and toppings, g is the acceleration due to gravity, and h is the height of the pizza.
The energy that is required to make pizza can also be calculated using the following equation:
E = mc2
Where E is the energy, m is the mass of the dough and toppings, c is the speed of light, and h is the height of the pizza.
The energy that is required to make pizza can also be calculated using the following equation:
E = mc2
Where E is the energy, m is the mass of the dough and toppings, c is the speed of light, and h is the height of the pizza.
The energy that is required to make pizza can also be calculated using the following equation:
E = mc2
Where E is the energy, m is the mass of the dough and toppings, c is the speed of light, and h is the height of the pizza.
The energy that is required to make pizza can also be calculated using the following equation:
E = mc2
Where E is the energy, m is the mass of the dough and toppings, c is the speed of light, and h is the height of the pizza.
additional notes
The same problem occurs when using Llama3-8b-instruct and the triton-cli where the conversion and engine creation is done automatically. The model returns very lyrical answers with a lot of repetitiveness.
triton import --model llama-3-8b-instruct --backend tensorrtllm --model-repository /data/models
@BugsBuggy
Hi, can you try with the latest TRT-LLM to see whether the issue still exist? There are recent fixes as to MoE related kernels.
Thanks June
Hi @BugsBuggy,
We did a reference run using "non TRT-LLM" deployment framework with the same Mixtral-8x7B checkpoints and configs (sampling config, max_output_len, etc) and observed the same repetitive answers as you shared.
In TRT-LLM, the default sampling config uses top_k=1 and top_p=0. If you change them to top_k=0 and top_p=1 to consider all tokens, the output answer has less to no repetition as shown below:
TRT-LLM default uses top_k=1, which is a greedy search and can be more prone to generating repeated tokens. From the above, our current conclusion is that this is not a bug. Please try tweaking the sampling config and let us know if you see different behaviors than we described.
Hi @BugsBuggy,
We did a reference run using "non TRT-LLM" deployment framework with the same Mixtral-8x7B checkpoints and configs (sampling config, max_output_len, etc) and observed the same repetitive answers as you shared.
In TRT-LLM, the default sampling config uses
top_k=1andtop_p=0. If you change them totop_k=-1andtop_p=1to consider all tokens, the output answer has less to no repetition as shown below:TRT-LLM default uses `top_k=1`, which is a greedy search and can be more prone to generating repeated tokens. From the above, our current conclusion is that this is not a bug. Please try tweaking the sampling config and let us know if you see different behaviors than we described.
I set top_k = 50 and top_p to 1,Every output result is the same.
If I set top_k to -1 and top_p to 1, an error will occur:
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: topK.value() >= 0 (/home/jenkins/agent/workspace/LLM/release-0.11/L0_PostMerge/llm/cpp/tensorrt_llm/executor/samplingConfig.cpp:231)
1 0x7fead35a0c01 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82
2 0x7fead35d35a2 /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7575a2) [0x7fead35d35a2]
3 0x7fead52ebfae tensorrt_llm::executor::SamplingConfig::SamplingConfig(int, std::optional
Hi @xiangxinhello , I tried again w/ tensorrt-llm 0.11.0 with Mixtra 8x7B and top_k=0 (minimal value, should be 0 instead of -1) and top_p=1 and it doesn't have repetitive answer. Can you try these two flags w/ Mixtra and Qwen again?
TRT-LLM default uses `top_k=1`, which is a greedy search and can be more prone to generating repeated tokens. From the above, our current conclusion is that this is not a bug. Please try tweaking the sampling config and let us know if you see different behaviors than we described.