Feature Request: Add support for "strict" format
It would be great if the strict format was supported for docx/docm files. I think it basically just requires different ns tags to be used. Here are the tags used in a similar project, mammoth. It does not include as many tags as in docx2python though, so I'm not totally sure what the strict format tags are for some of the other tags used in this libraries file namespace.py.
If you have a file that includes all of the tags for this library, then you could save it in strict format to see what those tags become.
What would the advantage be in regards to text extraction?
Sent from my iPhone
On Jun 19, 2024, at 12:32, Spectre5 @.***> wrote:
It would be great if the strict format was supported for docx/docm files. I think it basically just requires different ns tags to be used. Here are the tags used in a similar project, mammothhttps://github.com/mwilliamson/python-mammoth/blob/master/mammoth/docx/office_xml.py. It does not include as many tags as in docx2python though, so I'm not totally sure what the strict format tags are for some of the other tags used in this libraries file namespace.py.
If you have a file that includes all of the tags for this library, then you could save it in strict format to see what those tags become.
— Reply to this email directly, view it on GitHubhttps://github.com/ShayHill/docx2python/issues/62, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIEZ7Y4Z4WRIJZRLZP6TZIG6CZAVCNFSM6AAAAABJSOXHKCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3DEOBWGI3TKNQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Well right now the library cannot extract text from a strict docx file. We have some automatically created docx files that are saved in the strict format that I was hoping to parse the text of.
I will have a look around. Thank you.
Sent from my iPhone
On Jun 19, 2024, at 14:36, Spectre5 @.***> wrote:
strict
I took a look at this. Currently, docx2python v2 explicitly defines namespaces. This is a legacy of docx2python v1, which used the xml module from the standard library. The way to handle strict and other surprises should be to load the namespaces from the input documents and dynamically create tags. I want to do this, but fear it might break some projects out there, so I am going to plan this for docx2python 3, which I might create over the next few weekends.
I agree that would be the best way to handle it. For what it's worth, that is what pylightxl does for .xlsx/.xlsm files, if you want some inspiration.
I uploaded a branch that should work with strict docx files.
https://github.com/ShayHill/docx2python/tree/v3
If you try it, please let me know if you have any files that don't work. I will release this on pypi when I make a few other v3 updates.
I haven't had a chance to try it yet - but will try to soon.
Version 3.0.0 is not up on PyPI. It should work with strict Word files.