llmaz icon indicating copy to clipboard operation
llmaz copied to clipboard

[ModelLoader] Some huggingface models may contain duplicated weights

Open kerthcet opened this issue 1 year ago • 5 comments

What would you like to be added:

Take Mistral for example, it not only contain the chunked model weights, it also has consolidated model weights, when downloading models from huggingface, we should pay attention to this or we will download two replicas of model weights.

Why is this needed:

Fast model loading.

Completion requirements:

This enhancement requires the following artifacts:

  • [ ] Design doc
  • [ ] API change
  • [ ] Docs update

The artifacts should be linked in subsequent comments.

kerthcet avatar Sep 14 '24 07:09 kerthcet

/kind feature

kerthcet avatar Sep 14 '24 07:09 kerthcet

In another issuse https://github.com/InftyAI/llmaz/pull/175#issuecomment-2372716947, there has a new project which shares model weights across the cluster, may change the code with models.

so, i want to know Is it still necessary to develop this feature? this project to get model with python, but new project get model with go.

qinguoyi avatar Sep 26 '24 09:09 qinguoyi

Yes, we need this, because Manta may leverage the code as well, we don't want to rewrite the client code with other languages anymore.

What I'm concerned about is how to make this a more general approach, maybe we can add two fields in the ModelHub, the allow_patterns and the ignore_patterns, which will be passed to the lib directly. You can refer to the huggingface snapshot_download func for details. modelScope has the similar parameters as well.

I also have two other suggestions:

  • Remove the ThreadPoolExecutor for modelScope, because there's only one thread
  • When downloading one file with huggingface lib, let's use hf_hub_download
  • When downloading the whole repo with huggingface lib, let's use snapshot_download which will downloads files concurrently and we can remove the ThreadPoolExecutor as well.

WDYT?

kerthcet avatar Sep 26 '24 10:09 kerthcet

I agree with you, i will impl this feature soon.

qinguoyi avatar Sep 27 '24 01:09 qinguoyi

Yes, we need this, because Manta may leverage the code as well, we don't want to rewrite the client code with other languages anymore.

What I'm concerned about is how to make this a more general approach, maybe we can add two fields in the ModelHub, the allow_patterns and the ignore_patterns, which will be passed to the lib directly. You can refer to the huggingface snapshot_download func for details. modelScope has the similar parameters as well.

I also have two other suggestions:

  • Remove the ThreadPoolExecutor for modelScope, because there's only one thread
  • When downloading one file with huggingface lib, let's use hf_hub_download
  • When downloading the whole repo with huggingface lib, let's use snapshot_download which will downloads files concurrently and we can remove the ThreadPoolExecutor as well.

WDYT?

when i develop, i find we can download one file use snapshot_download with allow_patterns to download one or more files.

i push a request in there https://github.com/InftyAI/llmaz/pull/178#issue-2553977136 PTAL.

qinguoyi avatar Sep 28 '24 04:09 qinguoyi

Could we close this issue now? @kerthcet

qinguoyi avatar Oct 29 '24 08:10 qinguoyi

Absolutely, fixed by https://github.com/InftyAI/llmaz/pull/178 /close

kerthcet avatar Oct 29 '24 10:10 kerthcet