Parallelise Model Loading
Add model parallelise preloading capability for improved inference startup time
Adds model preloading functionality to improve initial inference latency by allowing models to be loaded into memory before they're needed.
Key changes:
- Added
--preload-modelsCLI arg to specify models for preloading - Introduced
preload_modelmethod in inference engine interface - Implemented preloading in MLX engine using existing shard loading
- Enhanced preemptive download to also preload models after download
- Added concurrent model preloading support in StandardNode
Primary motivation is reducing cold-start latency by preloading models before they're needed, useful for deployments requiring predictable latency.
Tested with MLX engine and verified working with preemptive downloads. Built on existing shard infrastructure, maintains backward compatibility.
test using
exo --preload-models model1, model2
exo --preload-models llama-3.2-1b,llama-3.1-8b
prev pr #360
@AlexCheema PTAL
lmk if u need more changes.
@AlexCheema PTAL
@AlexCheema , say do I run the formatter over the whole codebase ?? or just the files I edited ?
Please respond to my review @vovw
Please respond to my review @vovw
flooded with college work rn will address these tomorrow
Thanks so much for your contribution and for taking the time to open this PR.
Since this repository has been fully rewritten and the license has changed, I’m closing all existing open PRs to avoid confusion and to align with the new codebase.
I really appreciate your interest in the project, and you’re very welcome to open a new PR against the updated version if you’d like and we look forward to reviewing it!