When will FP8 be available for Mixtral?
Could you guys share rough timeline on the support of FP8 quantization for Mixtral (MoE) model?
cc: @Tracin
Mixtral with FP8 is almost done for now, will be released in version 0.9.0
Does Mixtra support AWQ 4-bit?
Mixtral with FP8 is almost done for now, will be released in version 0.9.0
Hi @Tracin , it seems that I still can't export an fp8 quantized version of Mixtral 8x7B in version 0.9.0 Is there something I missed in the doc or is this still not supported ?
When using the python script quantized.py like this:
python /app/tensorrt_llm/examples/quantization/quantize.py \
--model_dir mistralai/Mixtral-8x7B-Instruct-v0.1 \
--output_dir /workspace/checkpoint/mistralai_Mixtral-8x7B-Instruct-v0.1/tllm_checkpoint_fp8 \
--qformat fp8 \
--tp_size 1 \
--pp_size 1 \
--kv_cache_dtype fp8
it exit with an error:
Warning: this is an old NPZ format and will be deprecated soon.
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/quantization/quantize.py", line 52, in <module>
quantize_and_export(model_dir=args.model_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py", line 360, in quantize_and_export
with safetensors.safe_open(f"{export_path}/rank0.safetensors",
FileNotFoundError: No such file or directory: "/workspace/checkpoint/mistralai_Mixtral-8x7B-Instruct-v0.1/tllm_checkpoint_fp8/rank0.safetensors"
It did generate two files though, but I can't use them an input for trtllm-build:
$ ls /workspace/checkpoint/mistralai_Mixtral-8x7B-Instruct-v0.1/tllm_checkpoint_fp8/
unknown:MixtralForCausalLM_tp1.json unknown:MixtralForCausalLM_tp1_rank0.npz
The generated json file does not look like the ones usually generated with convert_checkpoint. Please advise ! :)
Hi! I'm sorry, I couldn't understand it from release notes, but is it released in 0.9.0? I can see that it now has OOTB support and is under the unified workflow, but does that mean that it works in fp8 now?
@hawkeoni @Marlinski Sorry guys, I think Mixtral-FP8 is delayed a little bit and will be released very soon.
@hawkeoni @Marlinski Sorry guys, I think Mixtral-FP8 is delayed a little bit and will be released very soon.
Hi @Tracin, I suggest to keep somewhere a compatibility table between models and quantization techniques, so that at a glance it's easy to understand what is supposed to work and what not.
@hawkeoni @Marlinski Sorry guys, I think Mixtral-FP8 is delayed a little bit and will be released very soon.
Hi @Tracin, I suggest to keep somewhere a compatibility table between models and quantization techniques, so that at a glance it's easy to understand what is supposed to work and what not.
Good idea! You can check it here https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/precision.md, we will keep this updated in time.
@hawkeoni @Marlinski Sorry guys, I think Mixtral-FP8 is delayed a little bit and will be released very soon.
Hi @Tracin, I suggest to keep somewhere a compatibility table between models and quantization techniques, so that at a glance it's easy to understand what is supposed to work and what not.
Good idea! You can check it here https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/precision.md, we will keep this updated in time.
Thanks for your answer. Is there any reason why Mixtral is not present in the table? Also, is the column for "Int4-FP8 AWQ (W4A8)" not present because it's intended to be supported when both "FP8" and "W4A16 AWQ" are supported?
@fedem96 I think the table is not up to date. I was able to export and run Mixtral in FP16 as well as in W4A16 but it does not appear in the matrice.
@fedem96 I think the table is not up to date. I was able to export and run Mixtral in FP16 as well as in W4A16 but it does not appear in the matrice.
Yes, I also think so, I'm able to run it in W4A16 by using the convert_checkpoint.py script in llama dir, but I'm not able to quantize it with using the quantize.py script.
@Tracin I noticed that documentation for Mixtral FP8 has been added: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral#fp8-post-training-quantization! Thank you very much for your hard work on this feature!
I am able to quantize, build, and run Mixtral 8x7B-Instruct v0.1 in FP8 on H100s following these instructions, however I'm running into two different errors with beam search:
- When using beam_width > 1, an assertion error is thrown:
python {quantize_path} --model_dir {model_dir} \
--output_dir {checkpoint_dir} \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--calib_size 512 \
--tp_size 2
trtllm-build --checkpoint_dir {checkpoint_dir} \
--output_dir {deploy_dir} \
--gemm_plugin float16 \
--workers 2 \
--strongly_typed
mpirun --allow-run-as-root -n 2 \
python3 /triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py \
--engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
--tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
--max_output_len 1024 \
--input_text "{prompt}" \
--no_prompt_template \
--temperature 1.0 \
--length_penalty 1.0 \
--repetition_penalty 1.0 \
--num_beams 2 \
--early_stopping 0
[TensorRT-LLM][WARNING] cache_indirection: expected dim[1] = 1, provided dim[1] = 2
[TensorRT-LLM][WARNING] cache_indirection: expected dim[1] = 1, provided dim[1] = 2
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
[TensorRT-LLM][ERROR] 3: [executionContext.cpp::setInputShape::2037] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2037, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'cache_indirection' has invalid shape (1, 2, 2048), expected (1, 1, -1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:178)
1 0x7fbf5f47640a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7fbf6121000e tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 2318
3 0x7fbf6141a817 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupContext(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 359
4 0x7fbf6141aa5c tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 492
5 0x7fbf61427664 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 2164
6 0x7fbf6144b7e4 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 100
7 0x7fbf6144da6c tensorrt_llm::executor::Executor::Impl::executionLoop() + 380
8 0x7fc096ab0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc096ab0253]
9 0x7fc215956ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc215956ac3]
10 0x7fc2159e7a04 clone + 68
- Settings that worked for Mixtral in FP16 (v0.9.0) cause a runtime CUDA malloc error when I attempt to run Mixtral in FP8 (v0.11.0.dev2024052100). Note that for both FP16 and FP8, I'm using 2xH100:
# This works using v0.9.0
python {convert_checkpoint_path} --model_dir {model_dir} \
--output_dir {checkpoint_dir} \
--dtype float16 \
--tp_size 2
trtllm-build --checkpoint_dir {checkpoint_dir}
--output_dir {deploly_dir}
--gemm_plugin float16
--workers 2
--tp_size 2
--pp_size 1
--gpt_attention_plugin float16
--context_fmha enable
--remove_input_padding enable
--use_custom_all_reduce disable
--paged_kv_cache enable
--max_input_len 32768
--max_batch_size 60
--max_output_len 1024
--max_beam_width 5
--max_num_tokens 393456
--strongly_typed
--use_paged_context_fmha disable
mpirun --allow-run-as-root -n 2 \
python3 /triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py \
--engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
--tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
--max_output_len 1024 \
--input_text "{prompt}" \
--no_prompt_template \
--temperature 1.0 \
--length_penalty 1.0 \
--repetition_penalty 1.0 \
--num_beams 5 \
--early_stopping 0
# This does not work in v0.11.0.dev20240521
python {quantize_path} --model_dir {model_dir} \
--output_dir {checkpoint_dir} \
--dtype float16 \
--qformat fp8 \
--kv_cache_dtype fp8 \
--calib_size 512 \
--tp_size 2
trtllm-build --checkpoint_dir {checkpoint_dir}
--output_dir {deploly_dir}
--gemm_plugin float16
--workers 2
--tp_size 2
--pp_size 1
--gpt_attention_plugin float16
--context_fmha enable
--remove_input_padding enable
--use_custom_all_reduce disable
--paged_kv_cache enable
--max_input_len 32768
--max_batch_size 60
--max_output_len 1024
--max_beam_width 5
--max_num_tokens 393456
--strongly_typed
--use_paged_context_fmha disable
mpirun --allow-run-as-root -n 2 \
python3 /triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py \
--engine_dir {SCRATCH_MODEL_REPO}/{MODEL_NAME}/{MODEL_NAME}-tensorrt_llm/1 \
--tokenizer_dir {SCRATCH_RAW_MODELS}/{HF_MODEL_NAME} \
--max_output_len 1024 \
--input_text "{prompt}" \
--no_prompt_template \
--temperature 1.0 \
--length_penalty 1.0 \
--repetition_penalty 1.0 \
--num_beams 5 \
--early_stopping 0
Traceback (most recent call last):
File "/triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py", line 571, in <module>
main(args)
File "/triton-llm/trtllm-0.11.0.dev20240521/tensorrtllm_backend/tensorrt_llm/examples/run.py", line 420, in main
runner = runner_cls.from_dir(**runner_kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 981, in from_dir
return ModelRunnerCppExecutor.from_dir(engine_dir, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 110, in from_dir
executor = trtllm.Executor(
RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:118)
1 0x7fe04ed78555 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149
2 0x7fe0509459e4 tensorrt_llm::runtime::BufferManager::gpu(unsigned long, nvinfer1::DataType) const + 324
3 0x7fe050a10814 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(void const*, unsigned long, float, nvinfer1::ILogger&) + 644
4 0x7fe050c21f6d tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1309
5 0x7fe050be27e0 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::executor::SchedulerConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 976
6 0x7fe050c49767 tensorrt_llm::executor::Executor::Impl::createModel(std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 727
7 0x7fe050c4a381 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2097
8 0x7fe050c40622 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 50
9 0x7fe0c6428a5d /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xaaa5d) [0x7fe0c6428a5d]
10 0x7fe0c63cdd33 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x4fd33) [0x7fe0c63cdd33]
11 0x5648af1cb10e python3(+0x15a10e) [0x5648af1cb10e]
12 0x5648af1c1a7b _PyObject_MakeTpCall + 603
Separately, I'm wondering if it should be possible to run a TensorRT-LLM engine for Mixtral 8x7B in FP8 on a single H100. I think that should be possible (the compiled engine is ~44 GB), but I'm running into an error message similar to (2) above.
Thank you again for the work you've put into this feature!
Hi @Pernekhan do u still have further issue or question now? If not, we'll close it soon.
No questions. Thank you.