llmaz [ModelLoader] Some huggingface models may contain duplicated weights

What would you like to be added:

Take Mistral for example, it not only contain the chunked model weights, it also has consolidated model weights, when downloading models from huggingface, we should pay attention to this or we will download two replicas of model weights.

Why is this needed:

Fast model loading.

Completion requirements:

This enhancement requires the following artifacts:

[ ] Design doc
[ ] API change
[ ] Docs update

The artifacts should be linked in subsequent comments.

Sep 14 '24 07:09 kerthcet

/kind feature

Sep 14 '24 07:09 kerthcet

In another issuse https://github.com/InftyAI/llmaz/pull/175#issuecomment-2372716947, there has a new project which shares model weights across the cluster, may change the code with models.

so, i want to know Is it still necessary to develop this feature? this project to get model with python, but new project get model with go.

Sep 26 '24 09:09 qinguoyi

Yes, we need this, because Manta may leverage the code as well, we don't want to rewrite the client code with other languages anymore.

What I'm concerned about is how to make this a more general approach, maybe we can add two fields in the ModelHub, the allow_patterns and the ignore_patterns, which will be passed to the lib directly. You can refer to the huggingface snapshot_download func for details. modelScope has the similar parameters as well.

I also have two other suggestions:

Remove the ThreadPoolExecutor for modelScope, because there's only one thread
When downloading one file with huggingface lib, let's use hf_hub_download
When downloading the whole repo with huggingface lib, let's use snapshot_download which will downloads files concurrently and we can remove the ThreadPoolExecutor as well.

WDYT?

Sep 26 '24 10:09 kerthcet

I agree with you, i will impl this feature soon.

Sep 27 '24 01:09 qinguoyi

Yes, we need this, because Manta may leverage the code as well, we don't want to rewrite the client code with other languages anymore.

What I'm concerned about is how to make this a more general approach, maybe we can add two fields in the ModelHub, the allow_patterns and the ignore_patterns, which will be passed to the lib directly. You can refer to the huggingface snapshot_download func for details. modelScope has the similar parameters as well.

I also have two other suggestions:

Remove the ThreadPoolExecutor for modelScope, because there's only one thread

When downloading one file with huggingface lib, let's use hf_hub_download

When downloading the whole repo with huggingface lib, let's use snapshot_download which will downloads files concurrently and we can remove the ThreadPoolExecutor as well.

WDYT?

when i develop, i find we can download one file use snapshot_download with allow_patterns to download one or more files.

i push a request in there https://github.com/InftyAI/llmaz/pull/178#issue-2553977136 PTAL.

Sep 28 '24 04:09 qinguoyi

Could we close this issue now? @kerthcet

Oct 29 '24 08:10 qinguoyi

Absolutely, fixed by https://github.com/InftyAI/llmaz/pull/178 /close

Oct 29 '24 10:10 kerthcet