llmaz icon indicating copy to clipboard operation
llmaz copied to clipboard

Loading model weights more efficiently

Open kerthcet opened this issue 1 year ago • 9 comments

What would you like to be added:

Right now we can download model weights from model hub directly, but each time we start/restart a pod, it will downloading the model weights again. Without the loading accelerators like fluid or dragonfly, we should think of a way to tackle this more efficiently, let's focus on three things:

  • download the models the first time should be as quick as possible
  • don't need to download the model weights again when pod restarted
  • handle the model cache efficiently

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

  • [ ] Design doc
  • [ ] API change
  • [ ] Docs update

The artifacts should be linked in subsequent comments.

kerthcet avatar Sep 02 '24 07:09 kerthcet

/milestone v0.1.0

kerthcet avatar Sep 02 '24 07:09 kerthcet

/kind feature

kerthcet avatar Sep 02 '24 07:09 kerthcet

/assign

kerthcet avatar Sep 14 '24 04:09 kerthcet

We may implement a simplified p2p network for efficient model distributing. See https://github.com/InftyAI/Manta

kerthcet avatar Sep 18 '24 07:09 kerthcet

How transformer handles large models: https://huggingface.co/docs/transformers/big_models

kerthcet avatar Sep 20 '24 05:09 kerthcet

/assign

kerthcet avatar Oct 08 '24 07:10 kerthcet

/milestone v0.2.0 as Manta needs more developing time.

kerthcet avatar Dec 24 '24 14:12 kerthcet

Generally, we have several approaches here:

  • without cache: leverage GPU stream like https://github.com/InftyAI/llmaz/issues/352 to accelerate model loading.
  • with filesystem cache, we'll use P2P technologies like manta for in-cluster model loading, and https://github.com/InftyAI/llmaz/issues/352 can still help us here as reading tensors from disk to GPU memory directly, however, we need to find out whether this is inference engine agnostic. Enterprise support: read tensors from peers to GPU and sync the model weights as well, which will benefit the pod restart, no longer need to read tensors from remote again. Generally, this will benefit for the future fine-tune and training system if we want to extend the scope in the future.
  • with OCI system cache, like integration with dragonfly and model spec implementation, see https://github.com/CloudNativeAI/modctl/blob/main/docs/getting-started.md

kerthcet avatar Apr 18 '25 09:04 kerthcet

Let's focus on the approach 1 first, milestone v0.2.0 specifically.

kerthcet avatar Apr 18 '25 09:04 kerthcet