njaramish

Results 9 comments of njaramish

@teis-e I was able to get Llama 3 70B-Instruct with TensorRT-LLM v0.9.0 working with: 1) In `tokenizer_config.json`, change line 2055 to `"eos_token": "",` 2) ``` python {convert_checkpoint_path} --model_dir {model_dir} \...

@teis-e you need to use tp_size 2 or 4 since the n_head must be divisible by tp_size. I have only tried FP8 quantization, but hopefully you would be able to...

@Tracin I noticed that documentation for Mixtral FP8 has been added: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mixtral#fp8-post-training-quantization! Thank you very much for your hard work on this feature! I am able to quantize, build, and...

@schetlur-nv thank you for the helpful information. Is adding support for priority levels possible/on the roadmap?

@schetlur-nv I see that the front page [readme ](https://github.com/triton-inference-server/tensorrtllm_backend/blob/f87ad6bf66c7b7862f8e1ef62a2b93d5c0069989/README.md?plain=1#L351) has been updated to include a --multi-model flag. Does this mean that this feature has been implemented, or is it still...

Noting that when deploying with `--mulit-model`, the /metrics endpoint loses some important tensorrtllm_backend-specific metrics, including: - nv_trt_llm_request_metrics - nv_trt_llm_runtime_memory_metrics - nv_trt_llm_kv_cache_block_metrics - nv_trt_llm_inflight_batcher_metrics - nv_trt_llm_general_metrics Would it be possible to...

@avianion I was able to get it working with two models that each had a batch size of 1 -- increasing the batch size beyond that caused OOM in some...

@byshiue thanks, I'll keep this in mind and create separate issues in the future. If others run into an issue similar to the one I described, the issue that I...