Erin
Erin
Hi @atyshka, TensorRT Model Optimizer team is aware of this and similar requests. We've started planning on publishing quantized checkpoints and the exported models on HuggingFace model hub. If you...
Hi @atyshka, we have a few llama models like https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8 uploaded, and we're uploading more (e.g., Medusa checkpoint). Legal clearance took a while.
Hi @BugsBuggy, We did a reference run using "non TRT-LLM" deployment framework with the same Mixtral-8x7B checkpoints and configs (sampling config, max_output_len, etc) and observed the same repetitive answers as...
Hi @xiangxinhello , I tried again w/ `tensorrt-llm 0.11.0` with Mixtra 8x7B and `top_k=0` (minimal value, should be 0 instead of -1) and `top_p=1` and it doesn't have repetitive answer....
This has been merged. Thanks.
To verify the fix, let's see whether we're able to pass CI consistently, which will include Ray stages w/ RPC. But this might be tricky since CI itself is quite...