Describe the issue

Use vllm to launch a local large model, in the style of openai，but it won't work

Steps to reproduce

step1:python -m vllm.entrypoints.openai.api_server --max-model-len 6144 --gpu-memory-utilization 0.95 --disable-log-stats --served-model-name Qwen2-7B-Instruct --model /mnt/workspace/Qwen2-7B-Instruct step2: start embedding

import os from contextlib import asynccontextmanager from typing import List, Union

import tiktoken import torch import uvicorn from fastapi import FastAPI from fastapi.middleware.cors import CORSMiddleware from pydantic import BaseModel from sentence_transformers import SentenceTransformer from sse_starlette.sse import EventSourceResponse

Set up limit request time

EventSourceResponse.DEFAULT_PING_INTERVAL = 1000

EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', '/mnt/workspace/m3e-base')

@asynccontextmanager async def lifespan(app: FastAPI): yield if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.ipc_collect()

app = FastAPI(lifespan=lifespan)

app.add_middleware( CORSMiddleware, allow_origins=[""], allow_credentials=True, allow_methods=[""], allow_headers=["*"], )

class CompletionUsage(BaseModel): prompt_tokens: int completion_tokens: int total_tokens: int

class EmbeddingResponse(BaseModel): data: list model: str object: str usage: CompletionUsage

class EmbeddingRequest(BaseModel): input: Union[List[str], str] model: str

@app.post("/v1/embeddings", response_model=EmbeddingResponse) async def get_embeddings(request: EmbeddingRequest): if isinstance(request.input, str): embeddings = [embedding_model.encode(request.input)] else: embeddings = [embedding_model.encode(text) for text in request.input] embeddings = [embedding.tolist() for embedding in embeddings]

def num_tokens_from_string(string: str) -> int:
    encoding = tiktoken.get_encoding('cl100k_base')
    num_tokens = len(encoding.encode(string))
    return num_tokens

response = {
    "data": [
        {
            "object": "embedding",
            "embedding": embedding,
            "index": index
        }
        for index, embedding in enumerate(embeddings)
    ],
    "model": request.model,
    "object": "list",
    "usage": CompletionUsage(
        prompt_tokens=sum(len(text.split()) for text in request.input),
        completion_tokens=0,
        total_tokens=sum(num_tokens_from_string(text) for text in request.input),
    )
}
return response

if name == "main": # load Embedding embedding_model = SentenceTransformer(EMBEDDING_PATH, device="cuda") uvicorn.run(app, host='0.0.0.0', port=8001, workers=1)

step3:pip install graphrag step4:mkdir -p ./ragtest/input step5:curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt step6:python -m graphrag.index --init --root ./ragtest step7: Modify the yml file

GraphRAG Config Used

No response

Logs and screenshots

encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: Qwen2-7B-Instruct model_supports_json: false # recommended if this is available for your model. max_tokens: 2000 request_timeout: 180.0 api_base: http://localhost:8000/v1/

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization: stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: m3e-base api_base: http://localhost:8001/v1/ # api_version: 2024-02-15-preview # organization: <organization_id> # deployment_name: <azure_model_deployment_name> # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional

Additional Information

GraphRAG Version:
Operating System:
Python Version:
Related Issues:

Jul 18 '24 03:07 1444141859

The local search with embeddings from Ollama now works. You can read full guide here: https://medium.com/@karthik.codex/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f Here is the link to the repo: https://github.com/karthik-codex/autogen_graphRAG

Jul 18 '24 15:07 karthik-codex

你端口号是不是错了？

Jul 19 '24 09:07 Nomore0912

If you want to use open-source models, I've created a repository for deploying Hugging Face models to local endpoints, offering functionality similar to OpenAI APIs. You can find the repo here: https://github.com/rushizirpe/open-llm-server

Also, I've prepared a Colab notebook for the Graphrag Demo. You might want to take a look: https://colab.research.google.com/drive/1uhFDnih1WKrSRQHisU-L6xw6coapgR51?usp=sharing. If you don't have access to GPUs like the A100, you'll need a GROQ_API_KEY (which is free with certain limitations), you can obtain it from: https://console.groq.com/keys

Jul 20 '24 12:07 rushizirpe

Consolidating alternate model issues here: https://github.com/microsoft/graphrag/issues/657

Jul 22 '24 20:07 natoverse

[Issue]: <title> Local LLM and Loacl embedding error?

Describe the issue

Steps to reproduce

Set up limit request time

GraphRAG Config Used

Logs and screenshots

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 25 # the number of parallel inflight requests that may be made

num_threads: 50 # the number of threads to use for parallel processing

parallelization: override the global parallelization settings for embeddings

Additional Information