graphrag [Issue]: Large Dataset (200k input files, 100M input tokens): out of memory issue (xml.etree.ElementTree.ParseError)

Is there an existing issue for this?

[X] I have searched the existing issues
[X] I have checked #657 to validate if my issue is covered by community support

Describe the issue

Dear Team - I am currently processing a large dataset (200k rows csv file, each row is approx. 500 tokens) and the initial steps have been going well until create_summarized_entities that immediately stops with an error message that seems to indicate an out of memory issue. I am really keen on getting to the final stage to be able to run searches and I am hopeful I can sort it out despite the size of the dataset.

I have 64GB of RAM on my workstation (32GB were previously not enough to generate the merged graph) and I am guessing the dataset size is creating a challenge. Would there be places in the code where I could make changes to handle the loading of the graph?

Thank you!

Logs and screenshots

indexing-engine.log 10:31:09,7 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "summarize_descriptions" in create_summarized_entities: out of memory: line 1, column 0 details=None 10:31:09,7 graphrag.index.run ERROR error running workflow create_summarized_entities Traceback (most recent call last): File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/graphrag/index/run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb result = await result ^^^^^^^^^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/graphrag/index/verbs/entities/summarize/description_summarize.py", line 184, in summarize_descriptions await get_resolved_entities(row, semaphore) for row in output.itertuples() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/graphrag/index/verbs/entities/summarize/description_summarize.py", line 122, in get_resolved_entities graph: nx.Graph = load_graph(cast(str | nx.Graph, getattr(row, column))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/graphrag/index/utils/load_graph.py", line 11, in load_graph return nx.parse_graphml(graphml) if isinstance(graphml, str) else graphml ^^^^^^^^^^^^^^^^^^^^^^^^^ File "<class 'networkx.utils.decorators.argmap'> compilation 4", line 3, in argmap_parse_graphml_1 import gzip ^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/networkx/utils/backends.py", line 633, in call return self.orig_func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/networkx/readwrite/graphml.py", line 372, in parse_graphml glist = list(reader(string=graphml_string)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/site-packages/networkx/readwrite/graphml.py", line 854, in call self.xml = fromstring(string) ^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/envs/py312_graphrag/lib/python3.12/xml/etree/ElementTree.py", line 1323, in XML parser.feed(text) xml.etree.ElementTree.ParseError: out of memory: line 1, column 0 10:31:09,10 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

Aug 10 '24 09:08 COPILOT-WDP

Any ideas or suggestions on how to scale?

Aug 18 '24 18:08 COPILOT-WDP

I have a dataset of 1.2 Billion tokens and the lack of response here is concerning

Oct 05 '24 10:10 Tipik1n