Allow LLM copy from the local network instead of downloading from Hugging Face
Implementing a feature to allow LLM copying from the local network instead of downloading from Hugging Face is an excellent way to optimize your setup, especially for multi-node environments. Here's a detailed approach to implement this feature:
-
Local File Server Setup:
- Set up a lightweight file server (e.g., using Python's
http.serveror a more robust solution like nginx) on one of the nodes or a dedicated machine in your local network. - This server will host the model files once they're downloaded.
- Set up a lightweight file server (e.g., using Python's
-
Model Registry:
- Create a simple "model registry" service that keeps track of which models are available locally.
- This could be a simple key-value store or a database with model names/versions and their local network locations.
-
Download and Share Process:
- Modify the model loading process to follow these steps:
import requests
import os
from pathlib import Path
def get_model(model_name: str, local_cache_dir: Path):
registry_url = "http://registry-server:8000/models"
local_file_server = "http://file-server:8080/models"
# Check if model is in local registry
response = requests.get(f"{registry_url}/{model_name}")
if response.status_code == 200:
# Model is available locally
model_url = f"{local_file_server}/{model_name}"
else:
# Model not available locally, use Hugging Face
model_url = f"https://huggingface.co/{model_name}/resolve/main/model.safetensors"
# Download the model
response = requests.get(model_url, stream=True)
model_path = local_cache_dir / f"{model_name}.safetensors"
with open(model_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
# If downloaded from Hugging Face, update registry and copy to file server
if "huggingface.co" in model_url:
requests.post(f"{registry_url}/{model_name}", json={"path": str(model_path)})
copy_to_file_server(model_path, model_name)
return model_path
def copy_to_file_server(model_path: Path, model_name: str):
# Implementation depends on your file server setup
# This could be a file copy, an HTTP POST, or any other method to transfer the file
pass
- Registry Server:
- Implement a simple HTTP server that maintains the registry:
from flask import Flask, request, jsonify
app = Flask(__name__)
model_registry = {}
@app.route('/models/<model_name>', methods=['GET'])
def check_model(model_name):
if model_name in model_registry:
return jsonify({"available": True, "path": model_registry[model_name]})
return jsonify({"available": False}), 404
@app.route('/models/<model_name>', methods=['POST'])
def register_model(model_name):
data = request.json
model_registry[model_name] = data['path']
return jsonify({"status": "registered"}), 201
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000)
- File Server:
- Set up a simple HTTP file server to serve the model files:
from http.server import HTTPServer, SimpleHTTPRequestHandler
import os
class ModelFileHandler(SimpleHTTPRequestHandler):
def __init__(self, *args, **kwargs):
super().__init__(*args, directory=os.environ.get('MODEL_DIR', '.'), **kwargs)
if __name__ == '__main__':
httpd = HTTPServer(('0.0.0.0', 8080), ModelFileHandler)
httpd.serve_forever()
- Integration:
- Modify your
MLXDynamicShardInferenceEngineto use theget_modelfunction:
- Modify your
class MLXDynamicShardInferenceEngine(InferenceEngine):
async def ensure_shard(self, shard: Shard):
if self.shard == shard:
return
model_path = get_model(shard.model_id, Path.home() / '.cache' / 'huggingface' / 'hub')
model_shard, self.tokenizer = await load_shard(model_path, shard)
self.stateful_sharded_model = StatefulShardedModel(shard, model_shard)
self.shard = shard
-
Network Configuration:
- Ensure all nodes can access the registry server and file server.
- Configure firewalls and network settings to allow this local traffic.
-
Error Handling and Fallback:
- Implement robust error handling to fall back to Hugging Face if local download fails.
- Add retry logic for network issues.
-
Version Control:
- Include version information in your model registry to ensure all nodes use the same model version.
-
Security Considerations:
- Implement authentication for the registry and file server if needed.
- Use HTTPS for local transfers if security is a concern.
This implementation allows nodes to check a local registry first, download from a local file server if available, and fall back to Hugging Face only when necessary. The first node to download a model will make it available to all other nodes, significantly reducing bandwidth usage and download times for subsequent nodes.
I'm currently copying the directories manually from .cache/huggingface/hub/models directories using AirDrop to the other machines. This works and is obviously faster when downloading from huggingface.
models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit models--mlx-community--Meta-Llama-3.1-8B-Instruct-4bit
This would be great. I think the thing that is difficult is doing this in a way that's compatible with one of exo's core design philosophies: node equality. I don't want to have a master file server. The way we could do it is nodes can first ask all their peers if they have a model file before going to hugging face and send it p2p.
Related: #80 #70 #16
@AlexCheema 您好,请问有没有更简便的方式实现从本地加载模型,我无法连接huggingface,不能实现该项目,谢谢
There are also things that might be able to bolt-on or people can easily setup for this scenario. SyncThing and LocalSend come to mind immediately. I think SyncThing can operate in a LAN-only style setup and possibly borrow the peer configuration taking place by Exo.
But Dave's Garage did mention downloading the models took a while even on his super fast network so I'm guessing Hugging Face has a speed limit.
I think futuristically we could use direct p2p feeding access.
I think that would work better instead of having one core server. (Hugging face would be the only core network)
I also think we can divide the downloading task amongst multiple peers. Think bittorrent?