run_pageindex.py doesn't return the text of the content and has wrong indices

Open dokato opened this issue 3 months ago • 1 comments

I followed readme and tried:

$ python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

on my document, however, run_pageindex.py doesn't return the text of the content, only summaries:

{
'doc_name': 'referee_guidelines_arm.pdf',
 'structure': [{'title': 'Referee Guidelines – Football Tournament',
   'start_index': 1,
   'end_index': 1,
   'nodes': [{'title': 'Match Duration',
     'start_index': 1,
     'end_index': 1,
     'node_id': '0001',
     'summary': 'The partial document outlines referee guidelines for the Football Tournament. It covers match duration (20 minutes with a running clock), procedures for starting the match (coin flip and kick-off), restarts after goals, and rules for throw-ins/outs (played from the ground, no direct goals). Discipline rules include no yellow cards, temporary time-outs for unsporting behavior, and expulsion for serious misconduct. In case of a draw, a penalty shootout is conducted. The document emphasizes fair play, safety, respect, and the finality of referee decisions.'},]
#...
}

Also both start_index and end_index = 1 is kinda wrong, isn't it?

Oct 10 '25 17:10 dokato

The output isn’t wrong. start_index and end_index are both inclusive page numbers.

By default it only returns summaries. If you want text content, use --add-node-text flag, or simply read from the underlying page text list using those indices.

Nov 25 '25 10:11 rejojer