llmaz Loading model weights more efficiently

What would you like to be added:

Right now we can download model weights from model hub directly, but each time we start/restart a pod, it will downloading the model weights again. Without the loading accelerators like fluid or dragonfly, we should think of a way to tackle this more efficiently, let's focus on three things:

download the models the first time should be as quick as possible
don't need to download the model weights again when pod restarted
handle the model cache efficiently

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

[ ] Design doc
[ ] API change
[ ] Docs update

The artifacts should be linked in subsequent comments.

Sep 02 '24 07:09 kerthcet

/milestone v0.1.0

Sep 02 '24 07:09 kerthcet

/kind feature

Sep 02 '24 07:09 kerthcet

/assign

Sep 14 '24 04:09 kerthcet

We may implement a simplified p2p network for efficient model distributing. See https://github.com/InftyAI/Manta

Sep 18 '24 07:09 kerthcet

How transformer handles large models: https://huggingface.co/docs/transformers/big_models

Sep 20 '24 05:09 kerthcet

/assign

Oct 08 '24 07:10 kerthcet

/milestone v0.2.0 as Manta needs more developing time.

Dec 24 '24 14:12 kerthcet

Generally, we have several approaches here:

without cache: leverage GPU stream like https://github.com/InftyAI/llmaz/issues/352 to accelerate model loading.
with filesystem cache, we'll use P2P technologies like manta for in-cluster model loading, and https://github.com/InftyAI/llmaz/issues/352 can still help us here as reading tensors from disk to GPU memory directly, however, we need to find out whether this is inference engine agnostic. Enterprise support: read tensors from peers to GPU and sync the model weights as well, which will benefit the pod restart, no longer need to read tensors from remote again. Generally, this will benefit for the future fine-tune and training system if we want to extend the scope in the future.
with OCI system cache, like integration with dragonfly and model spec implementation, see https://github.com/CloudNativeAI/modctl/blob/main/docs/getting-started.md

Apr 18 '25 09:04 kerthcet

Let's focus on the approach 1 first, milestone v0.2.0 specifically.

Apr 18 '25 09:04 kerthcet