Dhruv Mullick

Results 24 comments of Dhruv Mullick

Seeing the same issue. @byshiue were you able to check?

Keeping the geekIT vertically longer, with using either a Symbol, or smaller text, might be good. Similarly placing the Done/NotDone box on the right extreme might look nice, as there's...

Likewise. Imposing constraints on beam search (like HF's decoding strategies) would be invaluable

Sure, will take this up. @omri374 , can you give me write access for the PR?

Certainly need this functionality. With vLLM supporting [constrained decoding](https://outlines-dev.github.io/outlines/reference/models/vllm/), this could be a dealbreaker for some for TRT-LLM. Is this on the roadmap by any chance (pinging @ncomly-nvidia in case...

Facing a similar issue https://github.com/triton-inference-server/tensorrtllm_backend/issues/577 There's no use_custom_all_reduce build option now either, so not sure how to resolve this

@byshiue is it possible to disable it though? I'm facing similar problems with tp>1 https://github.com/triton-inference-server/tensorrtllm_backend/issues/577

Even tried without quantization, following the steps given in the [official examples](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md) ``` python convert_checkpoint.py --model_dir meta_llama_3_8B_instruct \ --output_dir /tmp/tllm_checkpoint_2gpu_tp2 \ --dtype bfloat16 \ --tp_size 2 trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_2gpu_tp2 \...

I tried the official image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 which was launched 2 days back, and built the TRT engines from this Problem remains though, even with reduce_fusion enabled. Logs below: Logs ```...

@imihic, after spending a week on this, I pivoted to vLLM.