Can it be used for non structured docs?

Open namevinu opened this issue 4 months ago • 1 comments

Greetings to the creators of Page Index,

I recently discovered your library, and it seems to be an excellent solution.

From the description, it looks ideal for structured documents such as technical or financial reports. However, I would like to know if it can serve these use cases:

Can it fetch context relevant to a user query when the context includes previous chats with LLMs or uploaded documents, right out of the box or any manual pre processing needed? These sources are mostly unstructured and may contain spelling and grammatical errors, especially in chat logs.
Does it rely on the model’s context limit? For example, if the model has a maximum input of 128k tokens but the context is over 2 million tokens, does the library handle this automatically, or would manual chunking and splitting of the context be required?
Is it limited to processing PDF or text-based content, or can it handle images and screenshots as well, whether they are embedded separately or within PDFs?
Is it possible to save the tree structure for future use, to enable faster processing and retrieval?

I am considering using this library in my project because it appears to address context relevance challenges in long chats with multiple documents.

I look forward to your response!

Thanks and regards, Vineet Mangal

Aug 30 '25 10:08 namevinu

Hi Vineet,

Thank you for your thoughtful questions and for your interest in PageIndex! We are sorry for not getting back to you sooner: we are currently a bit short on manpower and catching up on issues.

Handling unstructured text: PageIndex is optimized for structured documents like reports, but it can also index unstructured sources such as chat logs or free-form text. You don't need heavy preprocessing—basic cleanup (removing irrelevant metadata, formatting inconsistencies) is usually enough. Spelling or grammar errors aren't an issue, since retrieval is based on the hierarchical structure and semantic signals, not strict keyword matching.
Context length vs. model limits: PageIndex is designed to bypass traditional context length barriers. Instead of feeding the entire document into the model, it builds a navigable tree and only retrieves the most relevant segments in response to a query. This means you can index millions of tokens (or hundreds of pages) without worrying about the model's token limit—no manual chunking or splitting required.
Supported content types (PDF, text, images): At present, PageIndex works best with PDFs and text-based content. Images and screenshots embedded in PDFs can be handled if OCR is enabled, but standalone image support (e.g., uploaded PNG/JPEGs) is not native yet. That said, we are actively exploring multi-modal extensions.
Saving the tree structure: Yes, once a document is processed into a PageIndex tree, the structure can be persisted and reused. This way, you don't need to re-process the same files every time—you can query them immediately, and retrieval is faster.
Long chats with multiple documents: This is exactly the kind of scenario PageIndex was built for. By treating chats and documents as a unified index, it allows you to retrieve the most relevant pieces of context from across different sources, without worrying about exceeding model limits.

We'd love to hear more about your project and your use case. PageIndex is under active development, and feedback like yours helps us prioritize improvements.

Thanks again for reaching out, and please don't hesitate to follow up with more questions!

Best regards, PageIndex Team

Oct 01 '25 05:10 rejojer