TensorRT-LLM When will FP8 be available for Mixtral?

Could you guys share rough timeline on the support of FP8 quantization for Mixtral (MoE) model?

cc: @Tracin

Mar 04 '24 22:03 Pernekhan

Mixtral with FP8 is almost done for now, will be released in version 0.9.0

Mar 05 '24 07:03 Tracin

Does Mixtra support AWQ 4-bit?

Mar 05 '24 08:03 vip-china

Mixtral with FP8 is almost done for now, will be released in version 0.9.0

Hi @Tracin , it seems that I still can't export an fp8 quantized version of Mixtral 8x7B in version 0.9.0 Is there something I missed in the doc or is this still not supported ?

When using the python script quantized.py like this:

python /app/tensorrt_llm/examples/quantization/quantize.py \
   --model_dir mistralai/Mixtral-8x7B-Instruct-v0.1  \
   --output_dir /workspace/checkpoint/mistralai_Mixtral-8x7B-Instruct-v0.1/tllm_checkpoint_fp8 \
   --qformat fp8 \
   --tp_size 1  \
   --pp_size 1  \
   --kv_cache_dtype fp8

it exit with an error:

Warning: this is an old NPZ format and will be deprecated soon.
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/quantization/quantize.py", line 52, in <module>
    quantize_and_export(model_dir=args.model_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 360, in quantize_and_export
    with safetensors.safe_open(f"{export_path}/rank0.safetensors",
FileNotFoundError: No such file or directory: "/workspace/checkpoint/mistralai_Mixtral-8x7B-Instruct-v0.1/tllm_checkpoint_fp8/rank0.safetensors"

It did generate two files though, but I can't use them an input for trtllm-build:

$ ls /workspace/checkpoint/mistralai_Mixtral-8x7B-Instruct-v0.1/tllm_checkpoint_fp8/
unknown:MixtralForCausalLM_tp1.json  unknown:MixtralForCausalLM_tp1_rank0.npz

The generated json file does not look like the ones usually generated with convert_checkpoint. Please advise ! :)

Apr 16 '24 12:04 Marlinski

Hi! I'm sorry, I couldn't understand it from release notes, but is it released in 0.9.0? I can see that it now has OOTB support and is under the unified workflow, but does that mean that it works in fp8 now?

Apr 16 '24 12:04 hawkeoni

@hawkeoni @Marlinski Sorry guys, I think Mixtral-FP8 is delayed a little bit and will be released very soon.

Apr 17 '24 06:04 Tracin

@hawkeoni @Marlinski Sorry guys, I think Mixtral-FP8 is delayed a little bit and will be released very soon.

Hi @Tracin, I suggest to keep somewhere a compatibility table between models and quantization techniques, so that at a glance it's easy to understand what is supposed to work and what not.

Apr 17 '24 06:04 fedem96

@hawkeoni @Marlinski Sorry guys, I think Mixtral-FP8 is delayed a little bit and will be released very soon.

Hi @Tracin, I suggest to keep somewhere a compatibility table between models and quantization techniques, so that at a glance it's easy to understand what is supposed to work and what not.

Good idea! You can check it here https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/precision.md, we will keep this updated in time.

Apr 17 '24 08:04 Tracin

@hawkeoni @Marlinski Sorry guys, I think Mixtral-FP8 is delayed a little bit and will be released very soon.

Hi @Tracin, I suggest to keep somewhere a compatibility table between models and quantization techniques, so that at a glance it's easy to understand what is supposed to work and what not.

Good idea! You can check it here https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/precision.md, we will keep this updated in time.

Thanks for your answer. Is there any reason why Mixtral is not present in the table? Also, is the column for "Int4-FP8 AWQ (W4A8)" not present because it's intended to be supported when both "FP8" and "W4A16 AWQ" are supported?

Apr 17 '24 11:04 fedem96

@fedem96 I think the table is not up to date. I was able to export and run Mixtral in FP16 as well as in W4A16 but it does not appear in the matrice.

Apr 18 '24 16:04 Marlinski

@fedem96 I think the table is not up to date. I was able to export and run Mixtral in FP16 as well as in W4A16 but it does not appear in the matrice.

Yes, I also think so, I'm able to run it in W4A16 by using the convert_checkpoint.py script in llama dir, but I'm not able to quantize it with using the quantize.py script.

Apr 18 '24 16:04 fedem96

@Tracin I noticed that documentation for Mixtral FP8 has been added: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral#fp8-post-training-quantization! Thank you very much for your hard work on this feature!

I am able to quantize, build, and run Mixtral 8x7B-Instruct v0.1 in FP8 on H100s following these instructions, however I'm running into two different errors with beam search:

When using beam_width > 1, an assertion error is thrown:

python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --calib_size 512 \
                                --tp_size 2

trtllm-build --checkpoint_dir {checkpoint_dir} \
                 --output_dir {deploy_dir} \
                 --gemm_plugin float16 \
                 --workers 2 \
                 --strongly_typed

mpirun --allow-run-as-root -n 2 \
          python3 /triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 1024 \
          --input_text "{prompt}" \
          --no_prompt_template \
          --temperature 1.0 \
          --length_penalty 1.0 \
          --repetition_penalty 1.0 \
          --num_beams 2 \
          --early_stopping 0

[TensorRT-LLM][WARNING] cache_indirection: expected dim[1] = 1, provided dim[1] = 2
[TensorRT-LLM][WARNING] cache_indirection: expected dim[1] = 1, provided dim[1] = 2
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'cache_indirection' has invalid shape (1, 2, 2048), expected (1, 1, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:178)
1       0x7fbf5f47640a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fbf6121000e tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 2318
3       0x7fbf6141a817 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupContext(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 359
4       0x7fbf6141aa5c tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 492
5       0x7fbf61427664 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2164
6       0x7fbf6144b7e4 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 100
7       0x7fbf6144da6c tensorrt_llm::executor::Executor::Impl::executionLoop() + 380
8       0x7fc096ab0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc096ab0253]
9       0x7fc215956ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc215956ac3]
10      0x7fc2159e7a04 clone + 68

Settings that worked for Mixtral in FP16 (v0.9.0) cause a runtime CUDA malloc error when I attempt to run Mixtral in FP8 (v0.11.0.dev2024052100). Note that for both FP16 and FP8, I'm using 2xH100:

# This works using v0.9.0
python {convert_checkpoint_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --tp_size 2

trtllm-build --checkpoint_dir {checkpoint_dir}                  
                                --output_dir {deploly_dir}
                                --gemm_plugin float16                  
                                --workers 2                 
                                --tp_size 2                 
                                --pp_size 1                  
                                --gpt_attention_plugin float16                 
                                 --context_fmha enable                  
                                --remove_input_padding enable                  
                                --use_custom_all_reduce disable                 
                                 --paged_kv_cache enable                  
                                --max_input_len 32768                  
                                --max_batch_size 60                  
                                --max_output_len 1024                  
                                --max_beam_width 5                  
                                --max_num_tokens 393456                  
                                --strongly_typed                  
                                --use_paged_context_fmha disable  

mpirun --allow-run-as-root -n 2 \
          python3 /triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 1024 \
          --input_text "{prompt}" \
          --no_prompt_template \
          --temperature 1.0 \
          --length_penalty 1.0 \
          --repetition_penalty 1.0 \
          --num_beams 5 \
          --early_stopping 0

# This does not work in v0.11.0.dev20240521
python {quantize_path} --model_dir {model_dir} \
                                --output_dir {checkpoint_dir} \
                                --dtype float16 \
                                --qformat fp8 \
                                --kv_cache_dtype fp8 \
                                --calib_size 512 \
                                --tp_size 2

trtllm-build --checkpoint_dir {checkpoint_dir}                  
                                --output_dir {deploly_dir}
                                --gemm_plugin float16                  
                                --workers 2                 
                                --tp_size 2                 
                                --pp_size 1                  
                                --gpt_attention_plugin float16                 
                                 --context_fmha enable                  
                                --remove_input_padding enable                  
                                --use_custom_all_reduce disable                 
                                 --paged_kv_cache enable                  
                                --max_input_len 32768                  
                                --max_batch_size 60                  
                                --max_output_len 1024                  
                                --max_beam_width 5                  
                                --max_num_tokens 393456                  
                                --strongly_typed                  
                                --use_paged_context_fmha disable  

mpirun --allow-run-as-root -n 2 \
          python3 /triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py \
          --engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
          --tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
          --max_output_len 1024 \
          --input_text "{prompt}" \
          --no_prompt_template \
          --temperature 1.0 \
          --length_penalty 1.0 \
          --repetition_penalty 1.0 \
          --num_beams 5 \
          --early_stopping 0

Traceback (most recent call last):
  File "/triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py", line 571, in <module>
    main(args)
  File "/triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py", line 420, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 981, in from_dir
    return ModelRunnerCppExecutor.from_dir(engine_dir, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 110, in from_dir
    executor = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:118)
1       0x7fe04ed78555 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2       0x7fe0509459e4 tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 324
3       0x7fe050a10814 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 644
4       0x7fe050c21f6d tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1309
5       0x7fe050be27e0 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 976
6       0x7fe050c49767 tensorrt_llm::executor::Executor::Impl::createModel(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 727
7       0x7fe050c4a381 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2097
8       0x7fe050c40622 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 50
9       0x7fe0c6428a5d /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xaaa5d) [0x7fe0c6428a5d]
10      0x7fe0c63cdd33 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x4fd33) [0x7fe0c63cdd33]
11      0x5648af1cb10e python3(+0x15a10e) [0x5648af1cb10e]
12      0x5648af1c1a7b _PyObject_MakeTpCall + 603

Separately, I'm wondering if it should be possible to run a TensorRT-LLM engine for Mixtral 8x7B in FP8 on a single H100. I think that should be possible (the compiled engine is ~44 GB), but I'm running into an error message similar to (2) above.

Thank you again for the work you've put into this feature!

May 21 '24 23:05 njaramish

Hi @Pernekhan do u still have further issue or question now? If not, we'll close it soon.

Nov 14 '24 07:11 nv-guomingz

No questions. Thank you.

Nov 14 '24 16:11 Pernekhan