Add support for embedding models

Open ArvidSU opened this issue 3 months ago • 1 comments

Feature Request: Embedding Model Support

Summary

Add support for embedding models in exo, enabling local generation of text embeddings compatible with the OpenAI embeddings API. This extends exo beyond chat completions to support semantic search, RAG, and other embedding-based workflows.

Implementation Plan

Add embedding support as a new task type alongside ChatCompletion, reusing the MLX engine infrastructure.

Changes Required

1. API Layer (src/exo/master/api.py):

Add POST /v1/embeddings endpoint (OpenAI-compatible)
Add EmbeddingTaskParams type (similar to ChatCompletionTaskParams)
Add EmbeddingResponse type (OpenAI-compatible format)

2. Command/Event System (src/exo/shared/types/commands.py, tasks.py):

Add Embedding command (similar to ChatCompletion)
Add Embedding task (similar to ChatCompletion task)
Add EmbeddingGenerated event (similar to ChunkGenerated)

3. Worker/Runner (src/exo/worker/runner/runner.py):

Add embedding task handling in runner main loop
Add MLX embedding generation function (simpler than chat generation—no streaming/sampling)
Load embedding models via existing load_mlx_items infrastructure

4. Model Metadata (src/exo/shared/models/model_cards.py):

Add embedding model cards (e.g., nomic-ai/nomic-embed-text-v1, BAAI/bge-small-en-v1.5)
Mark models with model_type: "embedding" in metadata

5. Placement Logic:

Embedding models typically fit on single devices
Can reuse existing placement with min_nodes=1 constraint

Considerations

Embedding models may not need distributed sharding (single-device is typical)
No streaming needed (embeddings are returned as complete vectors)
Simpler than chat generation (no sampling, temperature, etc.)

Technical Considerations

API Compatibility

Follow OpenAI embeddings API format:

{
  "input": "text to embed",
  "model": "nomic-embed-text-v1",
  "encoding_format": "float"  // or "base64"
}

Response format:

{
  "object": "list",
  "data": [{
    "object": "embedding",
    "embedding": [0.1, 0.2, ...],
    "index": 0
  }],
  "model": "nomic-embed-text-v1",
  "usage": {"prompt_tokens": 10, "total_tokens": 10}
}

Performance

Embedding models are typically small and fast
Batch processing for multiple inputs
No need for distributed sharding (single-device is sufficient)
Consider caching for repeated inputs

Integration Points

Model download: reuse existing ShardDownloader
Instance management: reuse existing CreateInstance/DeleteInstance commands
Placement: reuse existing placement logic with single-device constraint
Dashboard: show embedding instances alongside chat instances

Open Questions

Should embedding models support distributed inference, or is single-device sufficient?
Should we support batch embedding requests, or process one at a time?
Do we need embedding-specific metrics (latency, throughput) in the dashboard?
Should embedding models be listed separately in /models, or mixed with chat models?
Do we need embedding-specific placement strategies, or can we reuse existing logic?

Success Criteria

[ ] POST /v1/embeddings endpoint returns OpenAI-compatible responses
[ ] Support for at least 2 popular embedding models (e.g., nomic-embed, BGE-small)
[ ] Embedding instances can be created/deleted via existing instance management APIs
[ ] Embedding models appear in /models endpoint
[ ] Documentation with examples for RAG and semantic search use cases
[ ] Performance: <100ms latency for single embedding on M-series Mac

References

OpenAI Embeddings API: https://platform.openai.com/docs/api-reference/embeddings
MLX Embedding Models: https://huggingface.co/mlx-community (search for "embed")
Nomic Embed: https://www.nomic.ai/blog/posts/nomic-embed-text-v1
Existing code patterns:
- Chat completion API: src/exo/master/api.py:498-526
- Chat completion task: src/exo/shared/types/tasks.py:51-57
- MLX generation: src/exo/worker/engines/mlx/generator/generate.py

Dec 30 '25 09:12 ArvidSU

Edited for brevity

Dec 30 '25 12:12 Evanev7