larin92

Results 8 comments of larin92

Yes, but this can be accomplished in one execution of session.run. It looks like one of those lines should be commented out at a time

yes please, support for pre-quantized models from HuggingFace would be great. i'm not even sure i can use multi-gpu setup for DIY quantization using TensorRT-LLM, as this file doesn't have...

> I managed to quantize Mixtral 8x7B to 4 bpw. > > I first tried running this command: > > ```shell > model="models--mistralai--Mixtral-8x7B-Instruct-v0.1" > model_dir="/models/$model" > model_chkpt_dir="/models/$model--trt-chkpt" > > python3...

a bit of an offtopic, but longformer support would be nice as well

same issue, cuda 12.4, originally used torch==2.4, tried these (didn't help): ``` pip install torch==2.6.0.dev20240922+cu124 --index-url https://download.pytorch.org/whl/nightly/cu124; ``` ``` pip install torch==2.5.0.dev20240905+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121; ``` ``` pip install torch==2.6.0.dev20240923+cu121 --index-url...

for anyone bumping into this issue in future: @mobicham explained in discord, for `torchao` to work you need at least Ampere GPU, same for `torch.compile`'ing the whole model

should probably remove `self` at [line 968](https://github.com/huggingface/transformers/pull/34237/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR968) as well

was thinking about implementing/using it for Longformers as well