larin92
larin92
Yes, but this can be accomplished in one execution of session.run. It looks like one of those lines should be commented out at a time
yes please, support for pre-quantized models from HuggingFace would be great. i'm not even sure i can use multi-gpu setup for DIY quantization using TensorRT-LLM, as this file doesn't have...
> I managed to quantize Mixtral 8x7B to 4 bpw. > > I first tried running this command: > > ```shell > model="models--mistralai--Mixtral-8x7B-Instruct-v0.1" > model_dir="/models/$model" > model_chkpt_dir="/models/$model--trt-chkpt" > > python3...
a bit of an offtopic, but longformer support would be nice as well
same issue, cuda 12.4, originally used torch==2.4, tried these (didn't help): ``` pip install torch==2.6.0.dev20240922+cu124 --index-url https://download.pytorch.org/whl/nightly/cu124; ``` ``` pip install torch==2.5.0.dev20240905+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121; ``` ``` pip install torch==2.6.0.dev20240923+cu121 --index-url...
for anyone bumping into this issue in future: @mobicham explained in discord, for `torchao` to work you need at least Ampere GPU, same for `torch.compile`'ing the whole model
should probably remove `self` at [line 968](https://github.com/huggingface/transformers/pull/34237/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR968) as well
was thinking about implementing/using it for Longformers as well