Fanhai Lu
Fanhai Lu
Thanks @imsujinpark! I got same issues, after switching it to the release version (v.0.109.0), I can connect my vms.
More logs after skip zero output: only 2 of 300 had zero length -------- output_len is zero for 238th request -------- output_len is zero for 288th request
> > Any reason to add text back, I suggested we keep both str and id in response in #40. The answer is " don't want to decode it to...
> > Any reason to add text back, I suggested we keep both str and id in response in #40. The answer is " don't want to decode it to...
> > > > Any reason to add text back, I suggested we keep both str and id in response in #40. The answer is " don't want to decode...
> * When input text, return both text and token ids. Is it still a streaming mode?
> * Optimized TPU duty cycle (largest gap < 4ms) > * Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return...
Hi [richard](https://github.com/richardsliu), I tested the llama-2 7B with run_server_with_ray.py (--batch_size=32). Instead of sent request one by one, I use benchmark script to send 200 request and got 198 response back....
@qihqi @wang2yn84 Let's revisit this issue now. Having regression test is critical to catch the performance degradation. @sixiang-google Since the infra is ready, could you work on regression test for...