[Feature]: Add CLI to launch FastEmbed as an Embedding API service (OpenAI-compatible)
What feature would you like to request?
Feature Request: Add CLI to launch FastEmbed as an Embedding API service (OpenAI-compatible)
Title
CLI to start FastEmbed as a local HTTP service providing
/v1/embeddingsendpoint
Background / Motivation
Currently FastEmbed is a Python library — users need to write Python code to create embeddings.
In many production setups:
- Teams want to run FastEmbed as a self-contained service (like OpenAI embeddings API), locally or on an internal server.
- Many existing apps/tools (LangChain, LlamaIndex, etc.) expect an OpenAI-style
/v1/embeddingsREST API, so they can work without code changes. - This would also make FastEmbed easier to integrate into non-Python environments.
Proposed Solution
Add a CLI command (e.g. fastembed serve) to start a HTTP server.
Server will host a REST API compatible with OpenAI's embeddings format:
CLI Example
fastembed serve \
--model-name BAAI/bge-small-zh-v1.5 \
--device cuda \
--model-path /path/to/local/model \
--port 8080 \
--host 0.0.0.0
Optional flags:
-
--model-path: Specify local ONNX/quantized model file path (avoid re-download) -
--model-name: HuggingFace model name (fallback to remote download if path not given) -
--device:cpu/cuda -
--port: Listening port -
--host: Bind address
API Specification (POST /v1/embeddings)
Request:
{
"input": [
"Artificial intelligence is the future",
"FastEmbed makes embeddings blazing fast"
]
}
Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [-0.0246, -0.0536, -0.0010, ...]
},
{
"object": "embedding",
"index": 1,
"embedding": [0.0123, -0.0456, 0.0231, ...]
}
],
"model": "BAAI/bge-small-zh-v1.5",
"usage": {
"prompt_tokens": 0,
"total_tokens": 0
}
}
Benefits
-
Drop-in replacement for OpenAI embeddings — developers can point existing code to local FastEmbed service by only changing
OPENAI_API_BASE. - Works in multi-language environments (Java, Go, JS, etc.) via HTTP.
- No cloud dependency — model runs locally, respects data privacy.
-
Easy deployment:
- Local dev via CLI
- Docker container for production use (
docker run ... fastembed serve)
Alternatives
- Users manually wrap FastEmbed in Flask/FastAPI (but requires extra coding).
- Run embedding in Python only — limits cross-language integration.
Additional Context
- Similar approach in Ollama (
ollama serve) and HuggingFace Inference Server. - Many users (including me) have internal projects using
/v1/embeddings, switching from OpenAI to FastEmbed would be zero code change with this feature. - Could later extend to
/v1/reRankfor ColBERT / BGE reranking models.
Possible Implementation Use FastAPI / Uvicorn internally:
from fastapi import FastAPI
from fastembed import TextEmbedding
app = FastAPI()
embedder = TextEmbedding(model_name="BAAI/bge-small-zh-v1.5")
@app.post("/v1/embeddings")
async def create_embeddings(request: dict):
inputs = request.get("input", [])
vectors = list(embedder.embed(inputs))
return {
"object": "list",
"data": [
{"object": "embedding", "index": i, "embedding": vec.tolist()}
for i, vec in enumerate(vectors)
],
"model": "BAAI/bge-small-zh-v1.5",
"usage": {"prompt_tokens": 0, "total_tokens": 0}
}
Is there any additional information you would like to provide?
No response
A summary of the changes CodeRabbit can apply:
Implement a new CLI command to run FastEmbed as an OpenAI-compatible embedding API server by adding files fastembed/main.py, fastembed/cli/init.py, fastembed/cli/serve.py, fastembed/cli/README.md, and docs/CLI_Server_Guide.md, and modifying pyproject.toml to register the fastembed console script and add a [tool.poetry.group.server] optional dependency group (FastAPI, Uvicorn, Pydantic).
Add a CLI entrypoint and full server mode to run FastEmbed as an OpenAI-compatible embeddings API by adding fastembed.main.py, a fastembed/cli package (serve.py implementing FastAPI endpoints /v1/embeddings, /v1/models, /health, a README, and a simple import test), new top-level docs (docs/CLI_Server_Guide.md), and update pyproject.toml to register the fastembed console script and an optional "server" dependency group (fastapi, uvicorn, pydantic).
- [ ] ✅ Create PR with these edits
- [ ] 📋 Get copyable edits
Bump