Tomás Domínguez Bolaño
Tomás Domínguez Bolaño
I ran `git bisect` as suggested by @zoranbosnjak, I used master as the good commit and v5.0.0 as the bad. I used the following code to run it automatically: ```bash...
> Side note, but is there a reason that TensorRT only supports do-it-yourself quantization and not pre-quantized models like the TheBloke produces at Huggingface? I can imagine a lot of...
I managed to quantize Mixtral 8x7B to 4 bpw. I first tried running this command: ```bash model="models--mistralai--Mixtral-8x7B-Instruct-v0.1" model_dir="/models/$model" model_chkpt_dir="/models/$model--trt-chkpt" python3 TensorRT-LLM/examples/llama/convert_checkpoint.py \ --model_dir "$model_dir" \ --output_dir "$model_chkpt_dir" \ --dtype float16...
Yes, this requires at least 100GB of VRAM. I executed the code on a system equipped with three Nvidia A100 GPUs, each with 40GB of VRAM, so 120GB in total....