huge_tree option for XML parser
I ran into a situation where I was asking pydantic-xml to parse very large XML documents, and was receiving errors like: lxml.etree.XMLSyntaxError: CData section too big found, line 66036, column 194.
According to https://lxml.de/apidoc/lxml.etree.html#lxml.etree.XMLParser, this can be increased with the huge_tree=True parameter. However, there does not appear to be a way to enable this for the pydantic-xml parser.
I was able to solve my issues by monkey-patching like so:
from pydantic_xml.model import BaseXmlModel
from lxml import etree
def _from_xml(cls, source, context=None):
"""
Deserializes an xml string to an object of `cls` type.
:param source: xml string
:param context: pydantic validation context
:return: deserialized object
"""
parser = etree.XMLParser(huge_tree=True)
return cls.from_xml_tree(etree.fromstring(source, parser), context=context)
BaseXmlModel.from_xml = classmethod(_from_xml)
It would be nice if huge_tree were exposed as part of the interface when running from_xml. Is this change desirable? If so, I'd be happy to write a PR. Maybe even something a bit more general to be able to pass arbitrary arguments to the XMLParser creation.
@JordanBarnartt thanks for the feedback
added this feature in 2.15.0
Thanks for looking at this so quickly!
Looking at the PRs, this will definitely solve my issue, but I'm a bit worried that the way it's implemented may freeze the API of from_xml. If there's any chance that you will want to add some other explicit parameters to this function later, it would potentially break backward compatibility.
Maybe something like (untested):
@classmethod
def from_xml(
cls: Type[ModelT],
source: Union[str, bytes],
context: Optional[Dict[str, Any]] = None,
parser_options: Optional[Dict[str, Any]] = None
) -> ModelT:
"""
Deserializes an XML string to an object of `cls` type.
:param source: XML string or bytes.
:param context: Optional validation context.
:param parser_options: Dictionary of options for XMLParser.
:return: Deserialized object.
"""
if parser_options is None:
parser_options = {}
parser = etree.XMLParser(**parser_options)
tree = etree.fromstring(source, parser=parser)
return cls.from_xml_tree(tree, context=context)
@JordanBarnartt Hi,
the library is supposed to be xml parser agnostic, so that it must be able to use different xml parser backends, not only lxml. Passing parser_options explicitly could lead to some problems if a new backend doesn't have parser options or have different api.