exo Allow LLM copy from the local network instead of downloading from Hugging Face

Implementing a feature to allow LLM copying from the local network instead of downloading from Hugging Face is an excellent way to optimize your setup, especially for multi-node environments. Here's a detailed approach to implement this feature:

Local File Server Setup:
- Set up a lightweight file server (e.g., using Python's http.server or a more robust solution like nginx) on one of the nodes or a dedicated machine in your local network.
- This server will host the model files once they're downloaded.
Model Registry:
- Create a simple "model registry" service that keeps track of which models are available locally.
- This could be a simple key-value store or a database with model names/versions and their local network locations.
Download and Share Process:
- Modify the model loading process to follow these steps:

import requests
import os
from pathlib import Path

def get_model(model_name: str, local_cache_dir: Path):
    registry_url = "http://registry-server:8000/models"
    local_file_server = "http://file-server:8080/models"

    # Check if model is in local registry
    response = requests.get(f"{registry_url}/{model_name}")
    if response.status_code == 200:
        # Model is available locally
        model_url = f"{local_file_server}/{model_name}"
    else:
        # Model not available locally, use Hugging Face
        model_url = f"https://huggingface.co/{model_name}/resolve/main/model.safetensors"

    # Download the model
    response = requests.get(model_url, stream=True)
    model_path = local_cache_dir / f"{model_name}.safetensors"
    with open(model_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

    # If downloaded from Hugging Face, update registry and copy to file server
    if "huggingface.co" in model_url:
        requests.post(f"{registry_url}/{model_name}", json={"path": str(model_path)})
        copy_to_file_server(model_path, model_name)

    return model_path

def copy_to_file_server(model_path: Path, model_name: str):
    # Implementation depends on your file server setup
    # This could be a file copy, an HTTP POST, or any other method to transfer the file
    pass

Registry Server:
- Implement a simple HTTP server that maintains the registry:

from flask import Flask, request, jsonify

app = Flask(__name__)
model_registry = {}

@app.route('/models/<model_name>', methods=['GET'])
def check_model(model_name):
    if model_name in model_registry:
        return jsonify({"available": True, "path": model_registry[model_name]})
    return jsonify({"available": False}), 404

@app.route('/models/<model_name>', methods=['POST'])
def register_model(model_name):
    data = request.json
    model_registry[model_name] = data['path']
    return jsonify({"status": "registered"}), 201

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

File Server:
- Set up a simple HTTP file server to serve the model files:

from http.server import HTTPServer, SimpleHTTPRequestHandler
import os

class ModelFileHandler(SimpleHTTPRequestHandler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, directory=os.environ.get('MODEL_DIR', '.'), **kwargs)

if __name__ == '__main__':
    httpd = HTTPServer(('0.0.0.0', 8080), ModelFileHandler)
    httpd.serve_forever()

Integration:
- Modify your MLXDynamicShardInferenceEngine to use the get_model function:

class MLXDynamicShardInferenceEngine(InferenceEngine):
    async def ensure_shard(self, shard: Shard):
        if self.shard == shard:
            return

        model_path = get_model(shard.model_id, Path.home() / '.cache' / 'huggingface' / 'hub')
        model_shard, self.tokenizer = await load_shard(model_path, shard)
        self.stateful_sharded_model = StatefulShardedModel(shard, model_shard)
        self.shard = shard

Network Configuration:
- Ensure all nodes can access the registry server and file server.
- Configure firewalls and network settings to allow this local traffic.
Error Handling and Fallback:
- Implement robust error handling to fall back to Hugging Face if local download fails.
- Add retry logic for network issues.
Version Control:
- Include version information in your model registry to ensure all nodes use the same model version.
Security Considerations:
- Implement authentication for the registry and file server if needed.
- Use HTTPS for local transfers if security is a concern.

This implementation allows nodes to check a local registry first, download from a local file server if available, and fall back to Hugging Face only when necessary. The first node to download a model will make it available to all other nodes, significantly reducing bandwidth usage and download times for subsequent nodes.

Jul 25 '24 06:07 stephanj

I'm currently copying the directories manually from .cache/huggingface/hub/models directories using AirDrop to the other machines. This works and is obviously faster when downloading from huggingface.

models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit models--mlx-community--Meta-Llama-3.1-8B-Instruct-4bit

Jul 25 '24 14:07 stephanj

This would be great. I think the thing that is difficult is doing this in a way that's compatible with one of exo's core design philosophies: node equality. I don't want to have a master file server. The way we could do it is nodes can first ask all their peers if they have a model file before going to hugging face and send it p2p.

Jul 26 '24 00:07 AlexCheema

Related: #80 #70 #16

Jul 26 '24 00:07 AlexCheema

@AlexCheema 您好，请问有没有更简便的方式实现从本地加载模型，我无法连接huggingface，不能实现该项目，谢谢

Jul 31 '24 06:07 JKYtydt

There are also things that might be able to bolt-on or people can easily setup for this scenario. SyncThing and LocalSend come to mind immediately. I think SyncThing can operate in a LAN-only style setup and possibly borrow the peer configuration taking place by Exo.

But Dave's Garage did mention downloading the models took a while even on his super fast network so I'm guessing Hugging Face has a speed limit.

Sep 30 '24 15:09 Zaf9670

I think futuristically we could use direct p2p feeding access.

I think that would work better instead of having one core server. (Hugging face would be the only core network)

I also think we can divide the downloading task amongst multiple peers. Think bittorrent?

Sep 30 '24 17:09 larson-carter