Error during `create_base_entity_graph` stage in GraphRAG pipeline
Is there an existing issue for this?
- [X] I have searched the existing issues
- [X] I have checked #657 to validate if my issue is covered by community support
Describe the issue
I encountered an error when running the graphrag.index command. The error occurs during the create_base_entity_graph stage. Below is the detailed error message and logs.
Steps to reproduce
- Activate the
graphragconda environment (Python 3.11). - Install graphRAG with pip and set the OpenAI API key.
- Run the following command:
python -m graphrag.index --init --root ./ragtest - See the error message during the
create_base_entity_graphstage.
GraphRAG Config Used
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: gpt-4-turbo-preview
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# temperature: 0 # temperature for sampling
# top_p: 1 # top-p sampling
# n: 1 # Number of completions to generate
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 1200
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
global_search:
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
Logs and screenshots
[42 rows x 5 columns]
🚀 create_base_extracted_entities
entity_graph
0 <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
entity_graph
0 <graphml xmlns="http://graphml.graphdrawing.or...
❌ create_base_entity_graph
None
⠦ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
└── create_base_entity_graph
❌ Errors occurred during the pipeline run, see logs for more details.
Additional Information
- GraphRAG Version: 0.2.0
- Operating System: Ubuntu 20.04.6 LTS
- Python Version: 3.11
- Related Issues:
I had the same problem, but it worked out. You can look at the log in the output directory. My problem is that openai doesn't have enough Key permissions. After the repair is done
I had the same problem, but it worked out. You can look at the log in the output directory. My problem is that openai doesn't have enough Key permissions. After the repair is done
What key permissions did you give exactly?
Please provide the indexing-engine.log file located in the report directory inside output dir. This file contains detailed information about the issue you're experiencing, which will help us replicate the problem and find a solution.
Please provide the
indexing-engine.logfile located in the report directory inside output dir. This file contains detailed information about the issue you're experiencing, which will help us replicate the problem and find a solution.
This is the log file.
It looks like the error message in the log is: 'The model gpt-4-turbo-preview does not exist or you do not have access to it.'
Upon checking, I noticed that the api_base setting in your configuration is commented out. For the model and API base to work correctly, you need to ensure that these fields are properly set.
If you're using an Ollama-based local setup, you might want to use the following configuration:
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: mistral # your choice of model
model_supports_json: true # recommended if available for your model
# max_tokens: 4000
# request_timeout: 180.0
api_base: http://localhost:11434/v1
It looks like the error message in the log is:
'The model gpt-4-turbo-preview does not exist or you do not have access to it.'Upon checking, I noticed that the
api_basesetting in your configuration is commented out. For the model and API base to work correctly, you need to ensure that these fields are properly set.If you're using an Ollama-based local setup, you might want to use the following configuration:
llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: mistral # your choice of model model_supports_json: true # recommended if available for your model # max_tokens: 4000 # request_timeout: 180.0 api_base: http://localhost:11434/v1
Thank you for your help, but the same error still occurs even after trying that. I've attached the logs. Could you please take a look at them?
It looks like there might be an issue with the embedding setup. The response from the embedding API might not be in the expected format.
Try using the following configuration with LM Studio as the API provider and check if it resolves the issue:
embeddings:
## Parallelization: Override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: nomic-ai/nomic-embed-text-v1.5-GGUF
api_base: http://localhost:1234/v1/
请提供位于输出目录内的报告目录中的文件。此文件包含有关您遇到的问题的详细信息,这将有助于我们复制问题并找到解决方案。
indexing-engine.log
大佬,可以帮我看一下吗? Uploading indexing-engine.log…
请提供位于输出目录内的报告目录中的文件。此文件包含有关您遇到的问题的详细信息,这将有助于我们复制问题并找到解决方案。
indexing-engine.log大佬,可以帮我看一下吗? Uploading indexing-engine.log…
same
@eeleedev do you have a recent indexing-engine.log? The last one posted was missing api_base, resulting in a 404, but I'm not sure if that was fixed based on following comments. Please double-check the getting started page for the baseline required parameter.
This issue has now been resolved, and I followed the content on the following webpage to solve it:
https://fornewchallenge.tistory.com/entry/%F0%9F%93%8AGraphRAG-%EB%A7%88%EC%9D%B4%ED%81%AC%EB%A1%9C%EC%86%8C%ED%94%84%ED%8A%B8%EC%9D%98-%EA%B7%B8%EB%9E%98%ED%94%84%EA%B8%B0%EB%B0%98-RAG-%EC%A0%91%EA%B7%BC%EB%B2%95feat-Ollama