DocBleach icon indicating copy to clipboard operation
DocBleach copied to clipboard

fix(ooxml): regexp does not replace the whole tag in every case

Open PandiPanda69 opened this issue 8 years ago • 4 comments

Hey !

I just figured out that the externalData is not a leaf. Thus, sometimes, the DOM was just fucked up because the replacement was a big #fail.

Tell me if you see anything wrong :-)

PandiPanda69 avatar Aug 29 '17 17:08 PandiPanda69

Thanks for the patch!

Do you believe this regex is "enough", or should we resort to some XML parsing library just to be sure? & the regex does not match < c:externalData/> because of the extra space 😕

punkeel avatar Aug 29 '17 21:08 punkeel

The more I look at the code the more I think it could be a nice option. I'm only afraid about how slow it could be to parse the DOM of every single file. But we should give it a try and see. Probably that disabling the DTD/schema checking would not be that slow.

What do you think?

PandiPanda69 avatar Aug 31 '17 10:08 PandiPanda69

If performance is really an issue, we might read the text looking for "externalData" (the check is already in place).

DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true); // <-
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(inputFile); // Using the stream *might* improve performances, right?

NodeList nodes = doc.getElementsByTagName("externalData"); // <- Without the NS
for (int i = 0; i < nodes.getLength(); i++) {
// Remove? Empty it?
// getParentNode then removeChild?
}

I never used XML APIs in Java (and did not test the above code) so I can't judge the performance impact, but this does not depend on XPath tricks and should be fast enough, right? 😕

And we don't have to parse every files, do we? I don't have samples to check but I can only guess the "externalData" must be in a specific file, and using the relations we could find it. Am I right?

punkeel avatar Aug 31 '17 16:08 punkeel