PageIndex icon indicating copy to clipboard operation
PageIndex copied to clipboard

If the file name has Spaces, it will simply fail

Open sliderss opened this issue 9 months ago • 3 comments

(.venv) skype@192 PageIndex % python3 run_pageindex.py --pdf_path /Users/skype/Documents/GitHub/PageIndex/docs/Websocket vs SSE- OpenAI.pdf usage: run_pageindex.py [-h] [--pdf_path PDF_PATH] [--model MODEL] [--toc-check-pages TOC_CHECK_PAGES] [--max-pages-per-node MAX_PAGES_PER_NODE] [--max-tokens-per-node MAX_TOKENS_PER_NODE] [--if-add-node-id IF_ADD_NODE_ID] [--if-add-node-summary IF_ADD_NODE_SUMMARY] [--if-add-doc-description IF_ADD_DOC_DESCRIPTION] run_pageindex.py: error: unrecognized arguments: vs SSE- OpenAI.pdf

(.venv) skype@192 PageIndex % python3 run_pageindex.py --pdf_path /Users/skype/Documents/GitHub/PageIndex/docs/Regulation Best Interest_proposed rule.pdf usage: run_pageindex.py [-h] [--pdf_path PDF_PATH] [--model MODEL] [--toc-check-pages TOC_CHECK_PAGES] [--max-pages-per-node MAX_PAGES_PER_NODE] [--max-tokens-per-node MAX_TOKENS_PER_NODE] [--if-add-node-id IF_ADD_NODE_ID] [--if-add-node-summary IF_ADD_NODE_SUMMARY] [--if-add-doc-description IF_ADD_DOC_DESCRIPTION] run_pageindex.py: error: unrecognized arguments: Best Interest_proposed rule.pdf

sliderss avatar Apr 18 '25 02:04 sliderss

Hi sliders, thanks for raising this point.

For the file name that includes the space, either quote the whole path or escape each space:

For example

python3 run_pageindex.py --pdf_path "./example report.pdf"
# or
python3 run_pageindex.py --pdf_path ./example\ report.pdf

Hope this can work.

zmtomorrow avatar Apr 19 '25 03:04 zmtomorrow

I had previously resolved the issue, but after pulling the latest code, I encountered an error when running the following command:

python3 run_pageindex.py --pdf_path '/Users/skype/Documents/GitHub/PageIndex/docs/2023-annual-report.pdf'

No corresponding JSON file was generated, and an error occurred during execution.

Let me know if you'd like to include the specific error message in the description, and I can help you translate or format that as well!

(.venv) skype@192 PageIndex % python3 run_pageindex.py --pdf_path '/Users/skype/Documents/GitHub/PageIndex/docs/2023-annual-report.pdf' Parsing PDF... start find_toc_pages toc found start detect_page_index index found process_toc_with_page_numbers start_index: 1 start toc_transformer start toc_index_extractor Traceback (most recent call last): File "/Users/skype/Documents/GitHub/PageIndex/run_pageindex.py", line 35, in toc_with_page_number = page_index_main(args.pdf_path, opt) File "/Users/skype/Documents/GitHub/PageIndex/pageindex/page_index.py", line 1022, in page_index_main structure = asyncio.run(tree_parser(page_list, opt, doc=doc, logger=logger)) File "/opt/homebrew/Cellar/[email protected]/3.9.22/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/homebrew/Cellar/[email protected]/3.9.22/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/Users/skype/Documents/GitHub/PageIndex/pageindex/page_index.py", line 978, in tree_parser toc_with_page_number = await meta_processor( File "/Users/skype/Documents/GitHub/PageIndex/pageindex/page_index.py", line 919, in meta_processor toc_with_page_number = process_toc_with_page_numbers(toc_content, toc_page_list, page_list, toc_check_page_num=opt.toc_check_page_num, model=opt.model, logger=logger) File "/Users/skype/Documents/GitHub/PageIndex/pageindex/page_index.py", line 640, in process_toc_with_page_numbers toc_with_page_number = process_none_page_numbers(toc_with_page_number,page_list, model) File "/Users/skype/Documents/GitHub/PageIndex/pageindex/page_index.py", line 668, in process_none_page_numbers page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n" TypeError: unsupported operand type(s) for -: 'int' and 'str'

log file: 2023-annual-report.pdf_20250423_102104.json

sliderss avatar Apr 23 '25 02:04 sliderss

I had previously resolved the issue, but after pulling the latest code, I encountered an error when running the following command:

python3 run_pageindex.py --pdf_path '/Users/skype/Documents/GitHub/PageIndex/docs/2023-annual-report.pdf'

No corresponding JSON file was generated, and an error occurred during execution.

Let me know if you'd like to include the specific error message in the description, and I can help you translate or format that as well!

Hi @sliderss, thanks so much for reporting this and for the detailed traceback!

The issue was introduced in one of the previous commits (on April 18) due to start_index being passed incorrectly. It’s now fixed — please pull the latest code and try again.

We really appreciate your feedback and support. Let us know if anything else comes up! 🙏

rejojer avatar Apr 23 '25 10:04 rejojer