fix(ooxml): regexp does not replace the whole tag in every case
Hey !
I just figured out that the externalData is not a leaf. Thus, sometimes, the DOM was just fucked up because the replacement was a big #fail.
Tell me if you see anything wrong :-)
Thanks for the patch!
Do you believe this regex is "enough", or should we resort to some XML parsing library just to be sure?
& the regex does not match < c:externalData/> because of the extra space 😕
The more I look at the code the more I think it could be a nice option. I'm only afraid about how slow it could be to parse the DOM of every single file. But we should give it a try and see. Probably that disabling the DTD/schema checking would not be that slow.
What do you think?
If performance is really an issue, we might read the text looking for "externalData" (the check is already in place).
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true); // <-
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(inputFile); // Using the stream *might* improve performances, right?
NodeList nodes = doc.getElementsByTagName("externalData"); // <- Without the NS
for (int i = 0; i < nodes.getLength(); i++) {
// Remove? Empty it?
// getParentNode then removeChild?
}
I never used XML APIs in Java (and did not test the above code) so I can't judge the performance impact, but this does not depend on XPath tricks and should be fast enough, right? 😕
And we don't have to parse every files, do we? I don't have samples to check but I can only guess the "externalData" must be in a specific file, and using the relations we could find it. Am I right?