Jason Ng

Results 7 issues of Jason Ng

### Checked other resources - [X] I added a very descriptive title to this issue. - [X] I searched the LangChain documentation with the integrated search. - [X] I used...

Hi there! Thank you for the wonderful work done as this greatly reduced the memory overhead and increased inference time for my use case. I noticed that the prompt compression...

question

**Description** Unable to run triton inference server with tensorrt-llm for Llama3-ChatQA-1.5-8B **Triton Information** v2.46.0 Are you using the Triton container or did you build it yourself? Using Triton container image...

Hi, I have built a TensoRT engine and tried running the command: ``` python3 run_server.py -p 9090 -b tensorrt -trt {path_to_engine} ``` but the only output that I have received...

**Description** I have noticed that there was a huge difference in memory usage for runtime buffers and decoder for llama3 and llama3.1. **Triton Information** What version of Triton are you...

**Description** I have noticed that there was a huge difference in memory usage for runtime buffers and decoder for llama3 and llama3.1. **Triton Information** What version of Triton are you...

Hi, wonder if there are any benchmarks done to compare the retrieval latency between using GPU and CPU? It would be great to understand the tradeoff in using LEANN on...