J comments

Results 12 comments of

Memory leak from long-duration inference

@Narsil seems like this is the thread on memory leak. For others, I don't know if you've been running it for a long time but eventually it fails. Currently only...

Memory leak from long-duration inference

This does not fix it for my case. I don't get a CUDA OOM, it's rather the pod's RAM OOM in the form of "transport error". I tried cuda memory...

Feature Request: Add Min-P sampling layer

@ncomly-nvidia seconded on adding min p - makes a noticeable impact on production, doesn't seem too bad to implement compared to some others.

Max num tokens & max batch size sanity checks

Got it, I do think 8192 seems conservative - I see on your example performance benchmarks a similar setup can run batchsizes of at least 64, and if 8192 is...

Option for disabling mmap for safetensors loading for network storage users

Maybe this is only part of it. Still seeing super long loading times (perhaps for the except branch?) It's been really hard to work with with some requests taking 50+...

Option for disabling mmap for safetensors loading for network storage users

Tried the fix, it seems like it's reusing a bit less but still long graph building times. Definitely think room for improvement there and will try to simplify workflows so...

Illegal memory access when medium batch sizes on using bad_words

It seems like stop words also trigger this, nondeterministic. When end_id = 2 is used, or when \ is used as a stop word, it doesn't trigger the memory issue...

Illegal memory access when medium batch sizes on using bad_words

@byshiue if you want I can privately share you the fp8 engine we built for it, ~70 gb and you can run it directly along with the TRT backend settings...

Illegal memory access when medium batch sizes on using bad_words

From #448 it seems like it may be an issue with tensor parallelism. Do you have a small llama with TP 4 and see if it runs into the same...

Illegal memory access when medium batch sizes on using bad_words

@jdemouth-nvidia @byshiue I am using the main branch, still the latest as of now, built the TRTLLM & Triton backend images from source last week. Right now am crunched for...