exo
exo copied to clipboard
Add support for embedding models
Feature Request: Embedding Model Support
Summary
Add support for embedding models in exo, enabling local generation of text embeddings compatible with the OpenAI embeddings API. This extends exo beyond chat completions to support semantic search, RAG, and other embedding-based workflows.
Implementation Plan
Add embedding support as a new task type alongside ChatCompletion, reusing the MLX engine infrastructure.
Changes Required
1. API Layer (src/exo/master/api.py):
- Add
POST /v1/embeddingsendpoint (OpenAI-compatible) - Add
EmbeddingTaskParamstype (similar toChatCompletionTaskParams) - Add
EmbeddingResponsetype (OpenAI-compatible format)
2. Command/Event System (src/exo/shared/types/commands.py, tasks.py):
- Add
Embeddingcommand (similar toChatCompletion) - Add
Embeddingtask (similar toChatCompletiontask) - Add
EmbeddingGeneratedevent (similar toChunkGenerated)
3. Worker/Runner (src/exo/worker/runner/runner.py):
- Add embedding task handling in runner main loop
- Add MLX embedding generation function (simpler than chat generation—no streaming/sampling)
- Load embedding models via existing
load_mlx_itemsinfrastructure
4. Model Metadata (src/exo/shared/models/model_cards.py):
- Add embedding model cards (e.g.,
nomic-ai/nomic-embed-text-v1,BAAI/bge-small-en-v1.5) - Mark models with
model_type: "embedding"in metadata
5. Placement Logic:
- Embedding models typically fit on single devices
- Can reuse existing placement with
min_nodes=1constraint
Considerations
- Embedding models may not need distributed sharding (single-device is typical)
- No streaming needed (embeddings are returned as complete vectors)
- Simpler than chat generation (no sampling, temperature, etc.)
Technical Considerations
API Compatibility
- Follow OpenAI embeddings API format:
{ "input": "text to embed", "model": "nomic-embed-text-v1", "encoding_format": "float" // or "base64" } - Response format:
{ "object": "list", "data": [{ "object": "embedding", "embedding": [0.1, 0.2, ...], "index": 0 }], "model": "nomic-embed-text-v1", "usage": {"prompt_tokens": 10, "total_tokens": 10} }
Performance
- Embedding models are typically small and fast
- Batch processing for multiple inputs
- No need for distributed sharding (single-device is sufficient)
- Consider caching for repeated inputs
Integration Points
- Model download: reuse existing
ShardDownloader - Instance management: reuse existing
CreateInstance/DeleteInstancecommands - Placement: reuse existing placement logic with single-device constraint
- Dashboard: show embedding instances alongside chat instances
Open Questions
- Should embedding models support distributed inference, or is single-device sufficient?
- Should we support batch embedding requests, or process one at a time?
- Do we need embedding-specific metrics (latency, throughput) in the dashboard?
- Should embedding models be listed separately in
/models, or mixed with chat models? - Do we need embedding-specific placement strategies, or can we reuse existing logic?
Success Criteria
- [ ]
POST /v1/embeddingsendpoint returns OpenAI-compatible responses - [ ] Support for at least 2 popular embedding models (e.g., nomic-embed, BGE-small)
- [ ] Embedding instances can be created/deleted via existing instance management APIs
- [ ] Embedding models appear in
/modelsendpoint - [ ] Documentation with examples for RAG and semantic search use cases
- [ ] Performance: <100ms latency for single embedding on M-series Mac
References
- OpenAI Embeddings API: https://platform.openai.com/docs/api-reference/embeddings
- MLX Embedding Models: https://huggingface.co/mlx-community (search for "embed")
- Nomic Embed: https://www.nomic.ai/blog/posts/nomic-embed-text-v1
- Existing code patterns:
- Chat completion API:
src/exo/master/api.py:498-526 - Chat completion task:
src/exo/shared/types/tasks.py:51-57 - MLX generation:
src/exo/worker/engines/mlx/generator/generate.py
- Chat completion API:
Edited for brevity