exo icon indicating copy to clipboard operation
exo copied to clipboard

Add support for embedding models

Open ArvidSU opened this issue 3 months ago • 1 comments

Feature Request: Embedding Model Support

Summary

Add support for embedding models in exo, enabling local generation of text embeddings compatible with the OpenAI embeddings API. This extends exo beyond chat completions to support semantic search, RAG, and other embedding-based workflows.

Implementation Plan

Add embedding support as a new task type alongside ChatCompletion, reusing the MLX engine infrastructure.

Changes Required

1. API Layer (src/exo/master/api.py):

  • Add POST /v1/embeddings endpoint (OpenAI-compatible)
  • Add EmbeddingTaskParams type (similar to ChatCompletionTaskParams)
  • Add EmbeddingResponse type (OpenAI-compatible format)

2. Command/Event System (src/exo/shared/types/commands.py, tasks.py):

  • Add Embedding command (similar to ChatCompletion)
  • Add Embedding task (similar to ChatCompletion task)
  • Add EmbeddingGenerated event (similar to ChunkGenerated)

3. Worker/Runner (src/exo/worker/runner/runner.py):

  • Add embedding task handling in runner main loop
  • Add MLX embedding generation function (simpler than chat generation—no streaming/sampling)
  • Load embedding models via existing load_mlx_items infrastructure

4. Model Metadata (src/exo/shared/models/model_cards.py):

  • Add embedding model cards (e.g., nomic-ai/nomic-embed-text-v1, BAAI/bge-small-en-v1.5)
  • Mark models with model_type: "embedding" in metadata

5. Placement Logic:

  • Embedding models typically fit on single devices
  • Can reuse existing placement with min_nodes=1 constraint

Considerations

  • Embedding models may not need distributed sharding (single-device is typical)
  • No streaming needed (embeddings are returned as complete vectors)
  • Simpler than chat generation (no sampling, temperature, etc.)

Technical Considerations

API Compatibility

  • Follow OpenAI embeddings API format:
    {
      "input": "text to embed",
      "model": "nomic-embed-text-v1",
      "encoding_format": "float"  // or "base64"
    }
    
  • Response format:
    {
      "object": "list",
      "data": [{
        "object": "embedding",
        "embedding": [0.1, 0.2, ...],
        "index": 0
      }],
      "model": "nomic-embed-text-v1",
      "usage": {"prompt_tokens": 10, "total_tokens": 10}
    }
    

Performance

  • Embedding models are typically small and fast
  • Batch processing for multiple inputs
  • No need for distributed sharding (single-device is sufficient)
  • Consider caching for repeated inputs

Integration Points

  • Model download: reuse existing ShardDownloader
  • Instance management: reuse existing CreateInstance/DeleteInstance commands
  • Placement: reuse existing placement logic with single-device constraint
  • Dashboard: show embedding instances alongside chat instances

Open Questions

  1. Should embedding models support distributed inference, or is single-device sufficient?
  2. Should we support batch embedding requests, or process one at a time?
  3. Do we need embedding-specific metrics (latency, throughput) in the dashboard?
  4. Should embedding models be listed separately in /models, or mixed with chat models?
  5. Do we need embedding-specific placement strategies, or can we reuse existing logic?

Success Criteria

  • [ ] POST /v1/embeddings endpoint returns OpenAI-compatible responses
  • [ ] Support for at least 2 popular embedding models (e.g., nomic-embed, BGE-small)
  • [ ] Embedding instances can be created/deleted via existing instance management APIs
  • [ ] Embedding models appear in /models endpoint
  • [ ] Documentation with examples for RAG and semantic search use cases
  • [ ] Performance: <100ms latency for single embedding on M-series Mac

References

  • OpenAI Embeddings API: https://platform.openai.com/docs/api-reference/embeddings
  • MLX Embedding Models: https://huggingface.co/mlx-community (search for "embed")
  • Nomic Embed: https://www.nomic.ai/blog/posts/nomic-embed-text-v1
  • Existing code patterns:
    • Chat completion API: src/exo/master/api.py:498-526
    • Chat completion task: src/exo/shared/types/tasks.py:51-57
    • MLX generation: src/exo/worker/engines/mlx/generator/generate.py

ArvidSU avatar Dec 30 '25 09:12 ArvidSU

Edited for brevity

Evanev7 avatar Dec 30 '25 12:12 Evanev7