JetStream icon indicating copy to clipboard operation
JetStream copied to clipboard

Performance optimized interleaved mode JetStream server

Open JoeZijunZhou opened this issue 1 year ago • 2 comments

  • Optimized TPU duty cycle (largest gap < 4ms)
  • Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token ASAP.
  • Optimized TPOT: properly enforce generate and detokenize task in sequential w/o unnecessary blocking in CPU.
  • Optimized output token throughput: properly prioritize prefill and balancing TTFT and decode in high throughput situation.
  • Tested with llama2-70b JetStream MaxText server on v5e-8 VM

JoeZijunZhou avatar Jul 26 '24 10:07 JoeZijunZhou

  • Optimized TPU duty cycle (largest gap < 4ms)
  • Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token ASAP.
  • Optimized TPOT: properly enforce generate and detokenize task in sequential w/o unnecessary blocking in CPU.
  • Optimized output token throughput: properly prioritize prefill and balancing TTFT and decode in high throughput situation.
  • Tested with llama2-70b JetStream MaxText server on v5e-8 VM

Optimized TTFT and Optimized output token throughput are conflicted with each. Can we expose some parameter to tuning the two part?

FanhaiLu1 avatar Jul 29 '24 18:07 FanhaiLu1

  • Optimized TPU duty cycle (largest gap < 4ms)
  • Optimized TTFT: dispatch prefill tasks ASAP w/o unnecessary blocking in CPU, keep backpressure to enforce insert ASAP, return first token ASAP.
  • Optimized TPOT: properly enforce generate and detokenize task in sequential w/o unnecessary blocking in CPU.
  • Optimized output token throughput: properly prioritize prefill and balancing TTFT and decode in high throughput situation.
  • Tested with llama2-70b JetStream MaxText server on v5e-8 VM

Optimized TTFT and Optimized output token throughput are conflicted with each. Can we expose some parameter to tuning the two part?

Currently, prioritize prefills in interleaved mode, and apply correct JAX blocking for copy to host async to reduce wasted wait time. 1 more optimization to do is to ensure the result returns immediately when the return channel has the result (from orchestrator).

JoeZijunZhou avatar Aug 05 '24 23:08 JoeZijunZhou