TensorRT-LLM Mixtral-8x7B repetitive answers

System Info

CPU Architecture x86_64 GPU: 2 x NVIDIA H100 TensorRT-LLM v0.9.0 Image: tritonserver:24.05-trtllm-python-py3 Model weights: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

Who can help?

No response

Information

[x] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I'm using nearly the same commands as in this instruction: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral I'm using the following commands to convert the weights and to build the engine.

python /data/TensorRT-LLM/examples/llama/convert_checkpoint.py \
    --model_dir /data/Mixtral-8x7B-v0.1 \
    --output_dir /data/tllm_checkpoint_mixtral_2gpu \
    --dtype float16 \
    --pp_size 2

trtllm-build \
    --checkpoint_dir /data/tllm_checkpoint_mixtral_2gpu \
    --output_dir /data/trt_engines/mixtral/pp2 \
    --gemm_plugin float16

mpirun -n 2 python3 /data/TensorRT-LLM/examples/run.py \
    --engine_dir /data/trt_engines/mixtral/pp2 \
    --tokenizer_dir /data/Mixtral-8x7B-v0.1 \
    --max_output_len 512 \
    --input_text "I'm sick but I have to work. What should I do?"

Expected behavior

The model should provide responses without repetitiveness, ensuring higher quality and more coherent answers.

actual behavior

On a lot of prompts I get a very repetitive answer. Here are two examples

Example 1:

Command:

mpirun -n 2 python3 /data/TensorRT-LLM/examples/run.py --engine_dir /data/trt_engines/mixtral/pp2 --tokenizer_dir /data/Mixtral-8x7B-v0.1 --max_output_len 512 --input_text " I'm sick but I have to work. What should I do"

Output:

"<s> I'm sick but I have to work. What should I do?"
Output [Text 0 Beam 0]: "

If you are sick, you should stay home. If you are sick and have to work, you should take precautions to prevent the spread of germs.

- Cover your mouth and nose with a tissue when you cough or sneeze.
- Put your used tissue in the waste basket.
- If you don't have a tissue, cough or sneeze into your upper sleeve, not your hands.
- Wash your hands after coughing, sneezing or blowing your nose.
- Try to avoid close contact with sick people.
- If you are sick, stay home and avoid close contact with others to keep from infecting them.
- If you are sick with flu-like illness, CDC recommends that you stay home for at least 24 hours after your fever is gone except to get medical care or for other necessities. (Your fever should be gone without the use of a fever-reducing medicine.)
- Keeping away from others will help prevent others from getting sick too.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay home.
- If you have a fever, you should stay"

Example 2:

Command:

mpirun -n 2 python3 /data/TensorRT-LLM/examples/run.py --engine_dir /data/trt_engines/mixtral/pp2 --tokenizer_dir /data/Mixtral-8x7B-v0.1 --max_output_len 512 --input_text "Give me a pizza recipe and explain it in terms of thermodynamics."

Output:

Input [Text 0]: "<s> Give me a pizza recipe and explain it in terms of thermodynamics."
Output [Text 0 Beam 0]: "

I’m not sure if this is a joke or not, but I’ll give it a shot.

Pizza is a type of food that is made by baking dough with toppings on it. The dough is usually made from flour, water, yeast, and salt. The toppings can be anything from cheese to vegetables to meat.

The thermodynamics of pizza can be explained in terms of the energy that is required to make the dough and the toppings. The dough needs to be mixed and kneaded, which requires energy. The toppings need to be cooked, which also requires energy.

The energy that is required to make pizza can be calculated using the following equation:

E = mc2

Where E is the energy, m is the mass of the dough and toppings, and c is the speed of light.

The energy that is required to make pizza can also be calculated using the following equation:

E = mgh

Where E is the energy, m is the mass of the dough and toppings, g is the acceleration due to gravity, and h is the height of the pizza.

The energy that is required to make pizza can also be calculated using the following equation:

E = mc2

Where E is the energy, m is the mass of the dough and toppings, c is the speed of light, and h is the height of the pizza.

The energy that is required to make pizza can also be calculated using the following equation:

E = mc2

Where E is the energy, m is the mass of the dough and toppings, c is the speed of light, and h is the height of the pizza.

The energy that is required to make pizza can also be calculated using the following equation:

E = mc2

Where E is the energy, m is the mass of the dough and toppings, c is the speed of light, and h is the height of the pizza.

The energy that is required to make pizza can also be calculated using the following equation:

E = mc2

Where E is the energy, m is the mass of the dough and toppings, c is the speed of light, and h is the height of the pizza.

additional notes

The same problem occurs when using Llama3-8b-instruct and the triton-cli where the conversion and engine creation is done automatically. The model returns very lyrical answers with a lot of repetitiveness.

triton import --model llama-3-8b-instruct --backend tensorrtllm --model-repository /data/models

Jul 12 '24 13:07 BugsBuggy

@BugsBuggy

Hi, can you try with the latest TRT-LLM to see whether the issue still exist? There are recent fixes as to MoE related kernels.

Thanks June

Jul 20 '24 00:07 juney-nvidia

Hi @BugsBuggy,

We did a reference run using "non TRT-LLM" deployment framework with the same Mixtral-8x7B checkpoints and configs (sampling config, max_output_len, etc) and observed the same repetitive answers as you shared.

In TRT-LLM, the default sampling config uses top_k=1 and top_p=0. If you change them to top_k=0 and top_p=1 to consider all tokens, the output answer has less to no repetition as shown below:

TRT-LLM default uses top_k=1, which is a greedy search and can be more prone to generating repeated tokens. From the above, our current conclusion is that this is not a bug. Please try tweaking the sampling config and let us know if you see different behaviors than we described.

Aug 08 '24 15:08 hchings

Hi @BugsBuggy,

We did a reference run using "non TRT-LLM" deployment framework with the same Mixtral-8x7B checkpoints and configs (sampling config, max_output_len, etc) and observed the same repetitive answers as you shared.

In TRT-LLM, the default sampling config uses top_k=1 and top_p=0. If you change them to top_k=-1 and top_p=1 to consider all tokens, the output answer has less to no repetition as shown below:
TRT-LLM default uses `top_k=1`, which is a greedy search and can be more prone to generating repeated tokens. From the above, our current conclusion is that this is not a bug. Please try tweaking the sampling config and let us know if you see different behaviors than we described.

I set top_k = 50 and top_p to 1，Every output result is the same. If I set top_k to -1 and top_p to 1, an error will occur: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: topK.value() >= 0 (/home/jenkins/agent/workspace/LLM/release-0.11/L0_PostMerge/llm/cpp/tensorrt_llm/executor/samplingConfig.cpp:231) 1 0x7fead35a0c01 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82 2 0x7fead35d35a2 /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7575a2) [0x7fead35d35a2] 3 0x7fead52ebfae tensorrt_llm::executor::SamplingConfig::SamplingConfig(int, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&, std::optional const&) + 78 4 0x7feb4e6da0ed /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xa80ed) [0x7feb4e6da0ed] 5 0x7feb4e68b16c /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5916c) [0x7feb4e68b16c] 6 0x4fdc87 /root/anaconda3/envs/trt_llm/bin/python3.10() [0x4fdc87] 7 0x4f741b _PyObject_MakeTpCall + 603 8 0x509cbf /root/anaconda3/envs/trt_llm/bin/python3.10() [0x509cbf] 9 0x50a869 PyVectorcall_Call + 185 10 0x507a1c /root/anaconda3/envs/trt_llm/bin/python3.10() [0x507a1c] 11 0x4f7786 /root/anaconda3/envs/trt_llm/bin/python3.10() [0x4f7786] 12 0x7fec67d8fc2b /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/matplotlib/_c_internal_utils.cpython-310-x86_64-linux-gnu.so(+0x16c2b) [0x7fec67d8fc2b] 13 0x50a659 PyObject_Call + 521 14 0x4f0c69 _PyEval_EvalFrameDefault + 11129 15 0x5099ce /root/anaconda3/envs/trt_llm/bin/python3.10() [0x5099ce] 16 0x50a508 PyObject_Call + 184 17 0x4f0c69 _PyEval_EvalFrameDefault + 11129 18 0x4fe0cf _PyFunction_Vectorcall + 111 19 0x4ee40f _PyEval_EvalFrameDefault + 799 20 0x5950f2 /root/anaconda3/envs/trt_llm/bin/python3.10() [0x5950f2] 21 0x595037 PyEval_EvalCode + 135 22 0x5c5e67 /root/anaconda3/envs/trt_llm/bin/python3.10() [0x5c5e67] 23 0x5c0fb0 /root/anaconda3/envs/trt_llm/bin/python3.10() [0x5c0fb0] 24 0x45970e /root/anaconda3/envs/trt_llm/bin/python3.10() [0x45970e] 25 0x5bb53f _PyRun_SimpleFileObject + 415 26 0x5bb2a3 _PyRun_AnyFileObject + 67 27 0x5b805d Py_RunMain + 909 28 0x588679 Py_BytesMain + 57 29 0x7fed09831d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fed09831d90] 30 0x7fed09831e40 __libc_start_main + 128 31 0x58852e /root/anaconda3/envs/trt_llm/bin/python3.10() [0x58852e]

Aug 14 '24 06:08 xiangxinhello

Hi @xiangxinhello , I tried again w/ tensorrt-llm 0.11.0 with Mixtra 8x7B and top_k=0 (minimal value, should be 0 instead of -1) and top_p=1 and it doesn't have repetitive answer. Can you try these two flags w/ Mixtra and Qwen again?

Aug 29 '24 22:08 hchings