graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

Error during `create_base_entity_graph` stage in GraphRAG pipeline

Open eeleedev opened this issue 1 year ago • 12 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues
  • [X] I have checked #657 to validate if my issue is covered by community support

Describe the issue

I encountered an error when running the graphrag.index command. The error occurs during the create_base_entity_graph stage. Below is the detailed error message and logs.

Steps to reproduce

  1. Activate the graphrag conda environment (Python 3.11).
  2. Install graphRAG with pip and set the OpenAI API key.
  3. Run the following command: python -m graphrag.index --init --root ./ragtest
  4. See the error message during the create_base_entity_graph stage.

GraphRAG Config Used


encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4-turbo-preview
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  


chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

[42 rows x 5 columns]
🚀 create_base_extracted_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
❌ create_base_entity_graph
None
⠦ GraphRAG Indexer 
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
└── create_base_entity_graph
❌ Errors occurred during the pipeline run, see logs for more details.

Additional Information

  • GraphRAG Version: 0.2.0
  • Operating System: Ubuntu 20.04.6 LTS
  • Python Version: 3.11
  • Related Issues:

eeleedev avatar Jul 30 '24 14:07 eeleedev

I had the same problem, but it worked out. You can look at the log in the output directory. My problem is that openai doesn't have enough Key permissions. After the repair is done

hffei avatar Jul 30 '24 15:07 hffei

I had the same problem, but it worked out. You can look at the log in the output directory. My problem is that openai doesn't have enough Key permissions. After the repair is done

What key permissions did you give exactly?

johnmendez2 avatar Jul 31 '24 13:07 johnmendez2

Please provide the indexing-engine.log file located in the report directory inside output dir. This file contains detailed information about the issue you're experiencing, which will help us replicate the problem and find a solution.

9prodhi avatar Aug 01 '24 00:08 9prodhi

Please provide the indexing-engine.log file located in the report directory inside output dir. This file contains detailed information about the issue you're experiencing, which will help us replicate the problem and find a solution.

This is the log file.

indexing-engine.log

eeleedev avatar Aug 01 '24 03:08 eeleedev

It looks like the error message in the log is: 'The model gpt-4-turbo-preview does not exist or you do not have access to it.'

Upon checking, I noticed that the api_base setting in your configuration is commented out. For the model and API base to work correctly, you need to ensure that these fields are properly set.

If you're using an Ollama-based local setup, you might want to use the following configuration:

llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: mistral # your choice of model
  model_supports_json: true # recommended if available for your model
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: http://localhost:11434/v1

9prodhi avatar Aug 01 '24 11:08 9prodhi

It looks like the error message in the log is: 'The model gpt-4-turbo-preview does not exist or you do not have access to it.'

Upon checking, I noticed that the api_base setting in your configuration is commented out. For the model and API base to work correctly, you need to ensure that these fields are properly set.

If you're using an Ollama-based local setup, you might want to use the following configuration:

llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: mistral # your choice of model
  model_supports_json: true # recommended if available for your model
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: http://localhost:11434/v1

Thank you for your help, but the same error still occurs even after trying that. I've attached the logs. Could you please take a look at them?

logs.json

eeleedev avatar Aug 03 '24 16:08 eeleedev

It looks like there might be an issue with the embedding setup. The response from the embedding API might not be in the expected format.

Try using the following configuration with LM Studio as the API provider and check if it resolves the issue:

embeddings:
  ## Parallelization: Override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio

  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: nomic-ai/nomic-embed-text-v1.5-GGUF
    api_base: http://localhost:1234/v1/

9prodhi avatar Aug 03 '24 17:08 9prodhi

请提供位于输出目录内的报告目录中的文件。此文件包含有关您遇到的问题的详细信息,这将有助于我们复制问题并找到解决方案。indexing-engine.log

大佬,可以帮我看一下吗? Uploading indexing-engine.log…

night666e avatar Aug 09 '24 06:08 night666e

logs.json

night666e avatar Aug 09 '24 06:08 night666e

请提供位于输出目录内的报告目录中的文件。此文件包含有关您遇到的问题的详细信息,这将有助于我们复制问题并找到解决方案。indexing-engine.log

大佬,可以帮我看一下吗? Uploading indexing-engine.log…

indexing-engine.log

night666e avatar Aug 09 '24 06:08 night666e

same

Friman04 avatar Aug 09 '24 09:08 Friman04

@eeleedev do you have a recent indexing-engine.log? The last one posted was missing api_base, resulting in a 404, but I'm not sure if that was fixed based on following comments. Please double-check the getting started page for the baseline required parameter.

natoverse avatar Aug 09 '24 19:08 natoverse

This issue has now been resolved, and I followed the content on the following webpage to solve it:

https://fornewchallenge.tistory.com/entry/%F0%9F%93%8AGraphRAG-%EB%A7%88%EC%9D%B4%ED%81%AC%EB%A1%9C%EC%86%8C%ED%94%84%ED%8A%B8%EC%9D%98-%EA%B7%B8%EB%9E%98%ED%94%84%EA%B8%B0%EB%B0%98-RAG-%EC%A0%91%EA%B7%BC%EB%B2%95feat-Ollama

eeleedev avatar Aug 10 '24 06:08 eeleedev