TensorRT-LLM
TensorRT-LLM copied to clipboard
Codestral-22B not support FP8?
I used the following commands to quantize Codestral-22B-v0.1 model,
python3 /data/trt_llm_code/trt_llm_v0.16.0/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/base_models/codestral-22b-v1.0-250311 --dtype bfloat16 --qformat fp8 --kv_cache_dtype fp8 --calib_dataset /data/cnn_dailymail --output_dir /data/trt_llm_quantize_files/nv-ada/trt-v16-codestral-22b-v1.0-250311-fp8/2-gpu --tp_size 2
I got the following errors.
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:40<00:00, 4.49s/it]
Generating train split: 287113 examples [00:04, 70579.50 examples/s]
Generating validation split: 13368 examples [00:00, 64444.24 examples/s]
Generating test split: 11490 examples [00:00, 71494.24 examples/s]
Inserted 1179 quantizers
/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/model_quant.py:71: DeprecationWarning: forward_loop should take model as argument, but got forward_loop without any arguments. This usage will be deprecated in future versions.
warnings.warn(
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [150,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "/data/trt_llm_code/trt_llm_v0.16.0/tensorrtllm_backend/tensorrt_llm/examples/quantization/quantize.py", line 150, in <module>
quantize_and_export(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 620, in quantize_and_export
model = quantize_model(model, quant_cfg, calib_dataloader, batch_size,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 422, in quantize_model
mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop)
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/model_quant.py", line 229, in quantize
return calibrate(model, config["algorithm"], forward_loop=forward_loop)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/model_quant.py", line 105, in calibrate
max_calibrate(model, forward_loop)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/model_calib.py", line 56, in max_calibrate
forward_loop(model)
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/model_quant.py", line 81, in forward_loop
return original_forward_loop()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 342, in calibrate_loop
model(**data)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformers/models/mistral/modeling_mistral.py", line 1039, in forward
outputs = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformers/models/mistral/modeling_mistral.py", line 816, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformers/models/mistral/modeling_mistral.py", line 551, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformers/models/mistral/modeling_mistral.py", line 491, in forward
attn_output = self.o_proj(attn_output)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/nn/modules/quant_module.py", line 86, in forward
return super().forward(input, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/nn/modules/quant_module.py", line 41, in forward
output = super().forward(input, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/opt/dynamic.py", line 806, in __getattr__
return manager.get_da_cb(name)(self, value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/opt/dynamic.py", line 83, in __call__
val = cb(self_module, val)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/nn/modules/quant_module.py", line 77, in _get_quantized_weight
return module.weight_quantizer(weight)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/nn/modules/tensor_quantizer.py", line 690, in forward
self._calibrator.collect(inputs)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/calib/max.py", line 65, in collect
local_amax = quant_utils.reduce_amax(x, axis=reduce_axis).detach()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/modelopt/torch/quantization/utils.py", line 69, in reduce_amax
output = torch.maximum(torch.abs(max_val), torch.abs(min_val))
^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
/usr/lib/python3.12/tempfile.py:1075: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpwom4aby1'>
_warnings.warn(warn_message, ResourceWarning)
So Trt does not support Codestral-22B FP8 quantization? Thanks.
Looks like the error is in Pytorch module. Could you print the min_val to see what happend?
min_val
@Tracin Thanks for your reply, could u please tell me where and how to print the min_val?
Looks like the error is in Pytorch module. Could you print the
min_valto see what happend?
@Tracin Hi, could you please help to resolve this challenge?