TensorRT-LLM 256G mem is Not Enough （AWQ 4bit LLama 70b）

v 0.6.1

python quantize.py --model_dir ./hg_weight_3999/ --dtype float16 --qformat int4_awq --export_path ./quantized_int4-awq --calib_size 32

Using pad_token, but it is not set yet.
[12/15/2023-01:48:38] The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [02:13<00:00,  8.91s/it]
Loading calibration dataset
Replaced 1683 modules to quantized modules
Caching activation statistics for awq_lite...
Searching awq_lite parameters...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
current rank: 0, tp rank: 0, pp rank: 0
torch.distributed not initialized, assuming single world_size.

Then due to full memory usage, the server crashed

Dec 15 '23 02:12 busishengui

@jdemouth-nvidia Anyone has some suggestion?

Jan 09 '24 07:01 busishengui

I met the same fk problem

Feb 05 '24 13:02 Hukongtao

I met the same problem. Is there any solution to it? @busishengui @Hukongtao

Mar 18 '24 13:03 andakai