PageIndex
PageIndex copied to clipboard
run_pageindex.py doesn't return the text of the content and has wrong indices
I followed readme and tried:
$ python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
on my document, however, run_pageindex.py doesn't return the text of the content, only summaries:
{
'doc_name': 'referee_guidelines_arm.pdf',
'structure': [{'title': 'Referee Guidelines β Football Tournament',
'start_index': 1,
'end_index': 1,
'nodes': [{'title': 'Match Duration',
'start_index': 1,
'end_index': 1,
'node_id': '0001',
'summary': 'The partial document outlines referee guidelines for the Football Tournament. It covers match duration (20 minutes with a running clock), procedures for starting the match (coin flip and kick-off), restarts after goals, and rules for throw-ins/outs (played from the ground, no direct goals). Discipline rules include no yellow cards, temporary time-outs for unsporting behavior, and expulsion for serious misconduct. In case of a draw, a penalty shootout is conducted. The document emphasizes fair play, safety, respect, and the finality of referee decisions.'},]
#...
}
Also both start_index and end_index = 1 is kinda wrong, isn't it?
The output isnβt wrong. start_index and end_index are both inclusive page numbers.
By default it only returns summaries. If you want text content, use --add-node-text flag, or simply read from the underlying page text list using those indices.