types-lxml icon indicating copy to clipboard operation
types-lxml copied to clipboard

Replace factory functions with classes.

Open udifuchs opened this issue 1 year ago • 5 comments

Element() and ElementTree() were factory functions until now. They are now defined as classes.

With this change, we can annotate code with the public Element class, instead of using the private _Element class. For example:

def print_tree(tree: e.ElementTree[e.Element]) -> None: ...

There is a related PR for lxml: https://github.com/lxml/lxml/pull/405

In the lxml PR, _Element is a virtual subclass of Element. This means that when we create an Element it actually creates an _Element class, but the created object is also an instance of Element:

el = Element()
reveal_type(el)  # _Element
isinstance(element, Element)  # True
isinstance(element, _Element)  # True

I don't know if it is possible to have the equivalent of a virtual subclass in a type stub. Therefore, I just made Element a type alias of _Element.

The lxms stubs should work with all existing versions of lxml. But I am not sure it make sense to merge this PR until the related PR for lxml is merged.

udifuchs avatar Feb 13 '25 04:02 udifuchs

... not sure it make sense to merge this PR until the related PR for lxml is merged.

Yeah, it makes more sense to wait for lxml release which includes your change, then I make a new types-lxml release including this PR.

I have taken a glimpse of your lxml PR. If Element becomes a superclass, there is an alternative approach that I want to try out too. While making it a TypeAlias of _Element should most likely work fine, I'd like to try a more ambitious change -- allow Element to be the canonical type and phase out _Element wherever convenient. It has been a common complaint that a "private" type (with underscore) is used in mission critical way, where most code using lxml will probably fail to annotate if _Element and _ElementTree were not available. This is also the reason why PyCharm support is not very usable for now, as PyCharm has no method autocompletion for private classes.

Such change needs some time for testing; luckily there's spare time until lxml releases new version. In worst case, I can apply your PR first and refine later, or maybe have a review in the coming days.

abelcheung avatar Feb 22 '25 11:02 abelcheung

I was thinking of breaking this change over several PRs:

  1. Make Element a type alias of _Element instead of a factory function (this PR.)
  2. Replace most occurrences of _Element with Element in the stubs.
  3. Switch roles between Element and _Element, making Element a class and _Element a type alias of Element.
  4. Remove _Element completely from the type stubs.

I can easily create a PR for step 2 or add it to this PR. The only down side is that it changes almost 100 lines of code, so it makes it harder to see what are interesting changes in this PR.

Step 3 subtly breaks the runtime tests unless one uses a patched version of lxml. I think that the problem is in the tests, but this is not the important point. What is important is that step 3 has the potential of breaking existing code that does run-time type checking. It is probably very rare, but we shouldn't make a release of types-lxml with step 3 before lxml 6.0.0 (with the Element change) is released.

Step 4 would essentially break any existing code that uses types-lxml. It might be possible to implement it in a few years, but maybe an _Element type alias should be kept forever.

udifuchs avatar Feb 22 '25 22:02 udifuchs

I'm aware that your PR in lxml just got merged. Will give it some thought after I obtain a new wheel build of 6.0 alpha.

abelcheung avatar Apr 23 '25 08:04 abelcheung

Things don't seem right for me:

>>> from lxml import etree
>>> etree.LXML_VERSION
(6, 0, 0, -200)
>>> root = etree.Element('foo')
>>> type(root).__mro__
(<class 'lxml.etree._Element'>, <class 'object'>)

Are you expecting etree.Element to be a superclass of _Element?

abelcheung avatar Apr 24 '25 14:04 abelcheung

Are you expecting etree.Element to be a superclass of _Element?

Sort of. See https://github.com/lxml/lxml/pull/405 for the full details. etree.Element is an Abstract Base Class:

class Element(ABC):

_Element is not an actual subclass of Element, it is just registered as a virtual subclass of Element.

# Register _Element as a virtual subclass of Element
Element.register(_Element)

This way Element can be upgraded from a factory function to a class that behaves the way we expect a class to behave:

>>> element = Element("test")
>>> type(element)
<class 'lxml.etree._Element'>
>>> isinstance(element, Element)
True
>>> issubclass(_Element, Element)
True

And at the same time _Element stays a cython class, which is an internal implementation detail and does not depend directly on Element.

For the types stubs, it is a bit simpler. Element is just a type alias to _Element. Technically, it would be possible to get rid of _Element completely, but this would break backward compatibility.

udifuchs avatar Apr 24 '25 15:04 udifuchs

PR https://github.com/lxml/lxml/pull/461 - Enable subscripting generic types in annotations - was just merged into lxml. This means that with lxml-6.0 one could write:

element_tree: etree.ElementTree[etree.Element]

and Python would accept it even if you do not use: from __future__ import annotations

It would be really nice if the lxml stubs would also allow this syntax for the lxml-6.0 release.

udifuchs avatar May 12 '25 01:05 udifuchs

PR https://github.com/lxml/lxml/pull/461 - Enable subscripting generic types in annotations - was just merged into lxml.

Sure, I noticed the merge yesterday. It was a surprise, given that similar effort has been rejected before.

abelcheung avatar May 12 '25 13:05 abelcheung

I was able to reproduce the error locally by running the latest version of pyright. I am not sure what version of pyright is used by github actions.

The last commit removes the duplicate definition of makeelement from HTMLElement. There are other duplicate definitions in HTMLElement that could be removed. But this problem is not related to this PR and should be handled in a different PR.

udifuchs avatar May 12 '25 16:05 udifuchs

I was able to reproduce the error locally by running the latest version of pyright. I am not sure what version of pyright is used by github actions.

Yeah, the pyright errors only occur since 1.1.399, and that applies to main branch as well. So it's not your fault.

The last commit removes the duplicate definition of makeelement from HTMLElement. There are other duplicate definitions in HTMLElement that could be removed.

So I bet you don't want to only apply a single fix locally, only to conflict with a larger scale warning fix I'm going to apply to main. Maybe you can revert your last 2 commits and merge main branch again when I have commited the fix.

The duplicate definitions are another story. I added them for very specific reason. Take .makeelement() for example:

>>> from lxml.html import fromstring
>>> root = fromstring("<html><body><br/></body></html>")
>>> elem = root.makeelement("form")
>>> type(elem)
<class 'lxml.html.FormElement'>

>>> elem = root.makeelement("label")
>>> type(elem)
<class 'lxml.html.LabelElement'>

If root node is HtmlElement, its .makeelement() factory always produce HtmlElement or some of its subclass, but we can't tell which one from typing alone. So I add override to make sure it produces HtmlElement, not XML element (_Element). All other "duplicates" are there to mandate the methods output HtmlElement or expects HtmlElement as argument.

abelcheung avatar May 13 '25 22:05 abelcheung

Your example with makeelement only works for runtime checks, so the stub definitions are not relevant.

Static checks work correctly without the duplicate definition thanks to the Self mechanism:

from lxml.html import HtmlElement
html = HtmlElement()
reveal_type(html.makeelement)  # Revealed type is "type[lxml.html._element.HtmlElement]"

In any case, I reverted my last two commits like you asked.

udifuchs avatar May 14 '25 00:05 udifuchs

In any case, I reverted my last two commits like you asked.

Thanks, I have just committed a fix, and you can rebase your work on that.

Your example with makeelement only works for runtime checks, so the stub definitions are not relevant. Static checks work correctly without the duplicate definition thanks to the Self mechanism:

The issue here is, I am taking care of real world usage when designing those incompatible overrides. The usage of Self in _Element methods and non-usage in HtmlElement demonstrates 2 different user cases.

  • For _Element, developers are expected to not subclass or create a single subclass of _Element while parsing their XML tree. So Self can refer to either vanilla _Element or their own custom element as they see fit.
  • For HtmlElement, Self is detrimental to its usage. This time I use __getitem__ for easier understanding, which means elem[0] is the first immmediate child subelement of a parent element. If the override were not there, then the type of elem[0] for a subclass of HtmlElement, say FormElement, would be another FormElement in terms of static type. This implies illegal html where all elements of an HTML form are themselves forms.

When fixing stub for lxml.html submodule, the behavior of various subclasses of HtmlElement have to be taken into account too.

abelcheung avatar May 14 '25 03:05 abelcheung

Ah sorry, by "revert" I mean

git reset --hard 5d970bd9da39237ebb02ed5d7dae967cd85b0269

then force push the branch, so the conflict is completely gone.

abelcheung avatar May 14 '25 03:05 abelcheung

I merged your commits to my branch.

Notice that multi-subclass.patch did not have any conflicts, but it still references _ElementFactory. I am not sure how you use it, so I am not sure how to fix and test it.

udifuchs avatar May 14 '25 04:05 udifuchs

Notice that multi-subclass.patch did not have any conflicts, but it still references _ElementFactory.

No problem, I'll redo the patch and commit directly to your branch.

abelcheung avatar May 14 '25 11:05 abelcheung

lxml 6.0 was released a few weeks ago with the change that Element is now a class instead of a factory function.

Is there anything holding back merging this PR?

udifuchs avatar Jul 17 '25 03:07 udifuchs

@udifuchs Sorry for late reply.

Is there anything holding back merging this PR?

Yeah, a few reasons actually.

  1. Perhaps the PR started too long ago. I have found an existing feature removed during previous merge of repo head in this PR. Need some additional check to make sure there aren't any more stuff accidentally removed.
  2. All _ElementFactory definitions are converted to type like:
    -         makeelement: _ElementFactory[_ET_co],
    +         makeelement: type[_ET_co],
    
    Usage of type indicates makeelement is a class itself, while in reality they are only class methods. Although .makeelement and the new Element perform the same thing (calling _makeElement cython function internally), the nature of .makeelement is completely modified in PR. I'm a bit worried about that.
  3. This is the most important: the change is incompatible with lxml 5.x. I have to declare types-lxml supports lxml 6.0 only if I merge this PR right now. How to resolve this dilemma is open for discussion.

abelcheung avatar Jul 22 '25 16:07 abelcheung

2. All `_ElementFactory` definitions are converted to `type` like:
   ```diff
   -         makeelement: _ElementFactory[_ET_co],
   +         makeelement: type[_ET_co],
   ```
     
   Usage of `type` indicates `makeelement` is a class itself, while in reality they are only class methods. Although `.makeelement` and the new `Element` perform the same thing (calling `_makeElement` cython function internally), the nature of `.makeelement` is completely modified in PR. I'm a bit worried about that.

There is not much difference between a class and a factory function. Both are considered callable and both return a class instance. The only relevant difference is that only on a class you can call isinstance.

3. This is the most important: the change is incompatible with lxml 5.x. I have to declare `types-lxml` supports lxml 6.0 _only_ if I merge this PR right now. How to resolve this dilemma is open for discussion.

AFAIK, this PR is fully compatible with both lxml 5.x and 6.0. Do you have any examples of incompatible code?

udifuchs avatar Jul 22 '25 18:07 udifuchs

OK, I should have mentioned I was working on 2 PRs together, so it was not apparent to you: for example, mypy.stubtest checks the nature of lxml.etree.ElementTree itself, and it will complain about stub not implemented incorrectly for lxml 5.x -- etree.ElementTree is a class in stub now after merging this PR, but is a function when running tests against 5.x.

I can silent the warnings and see if it works in real world. As long as nobody complains, it's good to go.

abelcheung avatar Aug 08 '25 20:08 abelcheung

The merge with main was becoming a mess because there have been too many commits in between. Let me roll back to prestine state and work from there instead.

abelcheung avatar Aug 08 '25 21:08 abelcheung

@udifuchs It turns out rebasing a large change against almost 200 commits turns out to be disastrous, with way too many conflict resolution that I have lost track somewhere in between.

Can you rework the patch against current head and submit a new PR? BTW I have added a few patches to this PR, you will need to fix those parts in new PR.

abelcheung avatar Aug 10 '25 19:08 abelcheung

Obsoleted by #94

abelcheung avatar Aug 12 '25 11:08 abelcheung