Loading model weights more efficiently
What would you like to be added:
Right now we can download model weights from model hub directly, but each time we start/restart a pod, it will downloading the model weights again. Without the loading accelerators like fluid or dragonfly, we should think of a way to tackle this more efficiently, let's focus on three things:
- download the models the first time should be as quick as possible
- don't need to download the model weights again when pod restarted
- handle the model cache efficiently
Why is this needed:
Completion requirements:
This enhancement requires the following artifacts:
- [ ] Design doc
- [ ] API change
- [ ] Docs update
The artifacts should be linked in subsequent comments.
/milestone v0.1.0
/kind feature
/assign
We may implement a simplified p2p network for efficient model distributing. See https://github.com/InftyAI/Manta
How transformer handles large models: https://huggingface.co/docs/transformers/big_models
/assign
/milestone v0.2.0 as Manta needs more developing time.
Generally, we have several approaches here:
- without cache: leverage GPU stream like https://github.com/InftyAI/llmaz/issues/352 to accelerate model loading.
- with filesystem cache, we'll use P2P technologies like manta for in-cluster model loading, and https://github.com/InftyAI/llmaz/issues/352 can still help us here as reading tensors from disk to GPU memory directly, however, we need to find out whether this is inference engine agnostic. Enterprise support: read tensors from peers to GPU and sync the model weights as well, which will benefit the pod restart, no longer need to read tensors from remote again. Generally, this will benefit for the future fine-tune and training system if we want to extend the scope in the future.
- with OCI system cache, like integration with dragonfly and model spec implementation, see https://github.com/CloudNativeAI/modctl/blob/main/docs/getting-started.md
Let's focus on the approach 1 first, milestone v0.2.0 specifically.