html-parser icon indicating copy to clipboard operation
html-parser copied to clipboard

Parse internal DTDs in doctype declaration

Open paulbijnens opened this issue 9 years ago • 2 comments

Up to now it was documented that internal DTDs inside the doctype declaration confuse HTML::Parser. Depending on the content of that internal dtd, the parser would return a text token instead, but sometimes also a declaration token that contained a lot of elements and text appearing after the syntactically correct declaration as well. The old implementation did allow for the empty internal DTD like: <!DOCTYPE abc SYSTEM "abc.dtd" [] >

This patch allows non-empty internal DTDs inside those square brackets in the doctype declaration, and returns the whole internal DTD as one single token in the list, similar to the token just containing "[]" in the old implementation. E.g. now it correctly parses:

<!DOCTYPE abc SYSTEM abc.dtd"[
<!-- even a simple comment here would confuse it -->
<!-- or quoted strings with special chars like ]> -->
<!ENTITY confuse "]>">
] >
<abc>Hello world</abc>

Paul (Ten years after my previous small patch, but still using this very nice perl module, one of the only ones that allows for sane parsing of sgml-like files with errors in it.)

paulbijnens avatar Jun 01 '16 14:06 paulbijnens

Wait a moment -- still some bug.

paulbijnens avatar Jun 03 '16 15:06 paulbijnens

Ok. Now it correctly parses all the possible ways comments inside the internal DTD. Can you have a look now? Feedback welcome.

paulbijnens avatar Jun 06 '16 09:06 paulbijnens