pydantic-xml icon indicating copy to clipboard operation
pydantic-xml copied to clipboard

huge_tree option for XML parser

Open JordanBarnartt opened this issue 11 months ago • 3 comments

I ran into a situation where I was asking pydantic-xml to parse very large XML documents, and was receiving errors like: lxml.etree.XMLSyntaxError: CData section too big found, line 66036, column 194.

According to https://lxml.de/apidoc/lxml.etree.html#lxml.etree.XMLParser, this can be increased with the huge_tree=True parameter. However, there does not appear to be a way to enable this for the pydantic-xml parser.

I was able to solve my issues by monkey-patching like so:

from pydantic_xml.model import BaseXmlModel
from lxml import etree

def _from_xml(cls, source, context=None): 
    """
    Deserializes an xml string to an object of `cls` type.

    :param source: xml string
    :param context: pydantic validation context
    :return: deserialized object
    """

    parser = etree.XMLParser(huge_tree=True)
    return cls.from_xml_tree(etree.fromstring(source, parser), context=context)


BaseXmlModel.from_xml = classmethod(_from_xml)

It would be nice if huge_tree were exposed as part of the interface when running from_xml. Is this change desirable? If so, I'd be happy to write a PR. Maybe even something a bit more general to be able to pass arbitrary arguments to the XMLParser creation.

JordanBarnartt avatar Mar 20 '25 16:03 JordanBarnartt

@JordanBarnartt thanks for the feedback

added this feature in 2.15.0

dapper91 avatar Mar 29 '25 08:03 dapper91

Thanks for looking at this so quickly!

Looking at the PRs, this will definitely solve my issue, but I'm a bit worried that the way it's implemented may freeze the API of from_xml. If there's any chance that you will want to add some other explicit parameters to this function later, it would potentially break backward compatibility.

Maybe something like (untested):

@classmethod
def from_xml(
    cls: Type[ModelT],
    source: Union[str, bytes],
    context: Optional[Dict[str, Any]] = None,
    parser_options: Optional[Dict[str, Any]] = None
) -> ModelT:
    """
    Deserializes an XML string to an object of `cls` type.
    
    :param source: XML string or bytes.
    :param context: Optional validation context.
    :param parser_options: Dictionary of options for XMLParser.
    :return: Deserialized object.
    """
    if parser_options is None:
        parser_options = {}
    parser = etree.XMLParser(**parser_options)
    tree = etree.fromstring(source, parser=parser)
    return cls.from_xml_tree(tree, context=context)

JordanBarnartt avatar Mar 29 '25 22:03 JordanBarnartt

@JordanBarnartt Hi,

the library is supposed to be xml parser agnostic, so that it must be able to use different xml parser backends, not only lxml. Passing parser_options explicitly could lead to some problems if a new backend doesn't have parser options or have different api.

dapper91 avatar Mar 30 '25 07:03 dapper91