haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: Add table extraction in`DOCXToDocument`

Open medsriha opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe. The current version of the DOCXToDocument does not extract tables from docx documents.

Describe the solution you'd like Ability to extract tables from documents while preserving their original structure.

medsriha avatar Sep 27 '24 16:09 medsriha

@julian-risch I'll self assigned this one coordinating with @medsriha

vblagoje avatar Oct 15 '24 08:10 vblagoje

@medsriha and @sjrl if we iterate over document elements:

document = docx.Document("test_files/docx/sample_docx.docx")
[e for e in document.element.body]   

we can get the following data:

[<CT_P '<w:p>' at 0x156f771b0>,
 <CT_P '<w:p>' at 0x156f77750>,
 <CT_P '<w:p>' at 0x156f74dc0>,
 <CT_P '<w:p>' at 0x156f767b0>,
 <CT_P '<w:p>' at 0x156f763a0>,
 <CT_P '<w:p>' at 0x156f740a0>,
 <CT_P '<w:p>' at 0x156f754a0>,
 <CT_P '<w:p>' at 0x156f752c0>,
 <CT_Tbl '<w:tbl>' at 0x156f8f520>,
 <CT_P '<w:p>' at 0x156f75090>,
 <CT_P '<w:p>' at 0x156f74460>,
 <CT_P '<w:p>' at 0x1579a8c80>,
 <CT_P '<w:p>' at 0x1579ab430>,
 <CT_P '<w:p>' at 0x1579a88c0>,
 <CT_P '<w:p>' at 0x1579abca0>,
 <CT_SectPr '<w:sectPr>' at 0x156f8eb20>]

If you load that test docx file in Word/Pages you can see that this is the natural order of paragraphs and tables in the document.

Therefore I think we can adapt _extract_paragraphs_with_page_breaks of our DOCXToDocument to iterate over elements and inject tables as text in appropriate locations. As @sjrl said "table stays where it's meant to be and can then be processed by the LLM the in the correct context".

What do you think of this approach?

vblagoje avatar Oct 15 '24 10:10 vblagoje

@vblagoje thanks for looking into this!

Since we are here could we inject all types of objects into the text? For example, I see that CT_SectPr is probably also missed right now, right?

For the tables I think it might make sense to optionally allow them to be extracted separately. What do you think @medsriha and @mathislucka ?

sjrl avatar Oct 15 '24 10:10 sjrl

Yes they are missed @sjrl this approach seems simpler, respecting natural order for context, we just need to convert each object to txt. And yes optional separate extraction shouldn't be hard either.

vblagoje avatar Oct 15 '24 11:10 vblagoje