docx2python icon indicating copy to clipboard operation
docx2python copied to clipboard

Feature Request: Add support for "strict" format

Open Spectre5 opened this issue 1 year ago • 7 comments

It would be great if the strict format was supported for docx/docm files. I think it basically just requires different ns tags to be used. Here are the tags used in a similar project, mammoth. It does not include as many tags as in docx2python though, so I'm not totally sure what the strict format tags are for some of the other tags used in this libraries file namespace.py.

If you have a file that includes all of the tags for this library, then you could save it in strict format to see what those tags become.

Spectre5 avatar Jun 19 '24 17:06 Spectre5

What would the advantage be in regards to text extraction?

Sent from my iPhone

On Jun 19, 2024, at 12:32, Spectre5 @.***> wrote:



It would be great if the strict format was supported for docx/docm files. I think it basically just requires different ns tags to be used. Here are the tags used in a similar project, mammothhttps://github.com/mwilliamson/python-mammoth/blob/master/mammoth/docx/office_xml.py. It does not include as many tags as in docx2python though, so I'm not totally sure what the strict format tags are for some of the other tags used in this libraries file namespace.py.

If you have a file that includes all of the tags for this library, then you could save it in strict format to see what those tags become.

— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/62, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIEZ7Y4Z4WRIJZRLZP6TZIG6CZAVCNFSM6AAAAABJSOXHKCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3DEOBWGI3TKNQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ShayHill avatar Jun 19 '24 19:06 ShayHill

Well right now the library cannot extract text from a strict docx file. We have some automatically created docx files that are saved in the strict format that I was hoping to parse the text of.

Spectre5 avatar Jun 19 '24 19:06 Spectre5

I will have a look around. Thank you.

Sent from my iPhone

On Jun 19, 2024, at 14:36, Spectre5 @.***> wrote:

strict

ShayHill avatar Jun 19 '24 19:06 ShayHill

I took a look at this. Currently, docx2python v2 explicitly defines namespaces. This is a legacy of docx2python v1, which used the xml module from the standard library. The way to handle strict and other surprises should be to load the namespaces from the input documents and dynamically create tags. I want to do this, but fear it might break some projects out there, so I am going to plan this for docx2python 3, which I might create over the next few weekends.

ShayHill avatar Jun 26 '24 23:06 ShayHill

I agree that would be the best way to handle it. For what it's worth, that is what pylightxl does for .xlsx/.xlsm files, if you want some inspiration.

Spectre5 avatar Jun 27 '24 04:06 Spectre5

I uploaded a branch that should work with strict docx files.

https://github.com/ShayHill/docx2python/tree/v3

If you try it, please let me know if you have any files that don't work. I will release this on pypi when I make a few other v3 updates.

ShayHill avatar Jul 02 '24 13:07 ShayHill

I haven't had a chance to try it yet - but will try to soon.

Spectre5 avatar Jul 10 '24 06:07 Spectre5

Version 3.0.0 is not up on PyPI. It should work with strict Word files.

ShayHill avatar Jul 27 '24 22:07 ShayHill