ERROR:root:Failed to parse JSON even after cleanup
Traceback (most recent call last):
File "G:\agent_service\outside_tools\PageIndex\run_pageindex.py", line 67, in
toc_with_page_number = page_index_main(args.pdf_path, opt)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\agent_service\outside_tools\PageIndex\pageindex\page_index.py", line 1102, in page_index_main
return asyncio.run(page_index_builder())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\asyncio\runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\asyncio\base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "G:\agent_service\outside_tools\PageIndex\pageindex\page_index.py", line 1077, in page_index_builder
structure = await tree_parser(page_list, opt, doc=doc, logger=logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "G:\agent_service\outside_tools\PageIndex\pageindex\page_index.py", line 1037, in tree_parser
toc_with_page_number = await meta_processor(
^^^^^^^^^^^^^^^^^^^^^
File "G:\agent_service\outside_tools\PageIndex\pageindex\page_index.py", line 991, in meta_processor
raise Exception('Processing failed')
Exception: Processing failed
Hi, thanks for reporting this and sorry for the delayed response. We’re currently a bit short on manpower and catching up on issues.
From the traceback you shared, the error seems to come from dirty JSON being produced by the LLM during the parsing step, which then fails even after our cleanup attempts. This typically happens when the model outputs extra text, formatting artifacts that break JSON parsing.
To help us reproduce and fix this, could you please share (if it’s not private or sensitive):
- The document (or a minimal excerpt of it) that triggered the error, if
- Any specific options/flags you used when running PageIndex
With that, we can better trace where the JSON output goes wrong and strengthen the cleanup logic.
Thanks again for your patience and for bringing this up!