What is the difference between mlx model and hugging face model?
What is the difference between mlx model and hugging face model? I notice there is the weight file *.npz, is this file a part of mlx model, if I want to deploy the mlx model, should I include this file?
There are different implementation models available. It may use either PyTorch or Hugging Face Transformers implementation. However, if you use the models from llm_lm, they are implemented with Hugging Face Transformers implementation. That's why you can directly load the Hugging Face models (not quantized) using mlx-lm library.
In terms of deployment, I assume you want to host the fine-tuned model for some inference task. You can use mlx-lm library to do inference. If you want to use other tools like TGI or llama.cpp, you have to use fuse.py to merge your adapter and convert it back to de-quant model weights. Then follow the tools of your preference for doing inference there.
There are different implementation models available. It may use either PyTorch or Hugging Face Transformers implementation. However, if you use the models from llm_lm, they are implemented with Hugging Face Transformers implementation. That's why you can directly load the Hugging Face models (not quantized) using mlx-lm library.
In terms of deployment, I assume you want to host the fine-tuned model for some inference task. You can use mlx-lm library to do inference. If you want to use other tools like TGI or llama.cpp, you have to use fuse.py to merge your adapter and convert it back to de-quant model weights. Then follow the tools of your preference for doing inference there.
If I didn't use convert.py to quant the model, I think there is no need to convert it back to de-quant model weights in the last step.