J
J
@Narsil seems like this is the thread on memory leak. For others, I don't know if you've been running it for a long time but eventually it fails. Currently only...
This does not fix it for my case. I don't get a CUDA OOM, it's rather the pod's RAM OOM in the form of "transport error". I tried cuda memory...
@ncomly-nvidia seconded on adding min p - makes a noticeable impact on production, doesn't seem too bad to implement compared to some others.
Got it, I do think 8192 seems conservative - I see on your example performance benchmarks a similar setup can run batchsizes of at least 64, and if 8192 is...
Maybe this is only part of it. Still seeing super long loading times (perhaps for the except branch?) It's been really hard to work with with some requests taking 50+...
Tried the fix, it seems like it's reusing a bit less but still long graph building times. Definitely think room for improvement there and will try to simplify workflows so...
It seems like stop words also trigger this, nondeterministic. When end_id = 2 is used, or when \ is used as a stop word, it doesn't trigger the memory issue...
@byshiue if you want I can privately share you the fp8 engine we built for it, ~70 gb and you can run it directly along with the TRT backend settings...
From #448 it seems like it may be an issue with tensor parallelism. Do you have a small llama with TP 4 and see if it runs into the same...
@jdemouth-nvidia @byshiue I am using the main branch, still the latest as of now, built the TRTLLM & Triton backend images from source last week. Right now am crunched for...