TensorRT-LLM
TensorRT-LLM copied to clipboard
256G mem is Not Enough (AWQ 4bit LLama 70b)
v 0.6.1
python quantize.py --model_dir ./hg_weight_3999/ --dtype float16 --qformat int4_awq --export_path ./quantized_int4-awq --calib_size 32
Using pad_token, but it is not set yet.
[12/15/2023-01:48:38] The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [02:13<00:00, 8.91s/it]
Loading calibration dataset
Replaced 1683 modules to quantized modules
Caching activation statistics for awq_lite...
Searching awq_lite parameters...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.register_buffer("_pre_quant_scale", torch.tensor(value))
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
current rank: 0, tp rank: 0, pp rank: 0
torch.distributed not initialized, assuming single world_size.
Then due to full memory usage, the server crashed
@jdemouth-nvidia Anyone has some suggestion?
I met the same fk problem
I met the same problem. Is there any solution to it? @busishengui @Hukongtao