datatree icon indicating copy to clipboard operation
datatree copied to clipboard

Name permanence

Open TomNicholas opened this issue 3 years ago • 5 comments

Xarray Datasets don't alter the names of the objects they store:

In [1]: import xarray as xr

In [2]: da = xr.DataArray(name="b")

In [3]: ds = xr.Dataset()

In [4]: ds['a'] = da

In [5]: ds["a"].name
Out[5]: 'a'

In [6]: da.name
Out[6]: 'b'

After #41 (and #115 ensures it) then DataTree objects behave similarly for data variables they store

In [7]: from datatree import DataTree

In [8]: root = DataTree()

In [9]: root["a"] = da

In [10]: root["a"].name
Out[10]: 'a'

In [11]: da.name
Out[11]: 'b'

However, currently DataTree objects do alter the name of child DataTree nodes that they store.

In [12]: subtree = DataTree(name="b")

In [13]: root = DataTree()

In [14]: root["a"] = subtree

In [15]: root["a"].name
Out[15]: 'a'

In [16]: subtree.name
Out[16]: 'a'

I noticed this in #115, but fixing it might be a bit complex.

TomNicholas avatar Jun 17 '22 20:06 TomNicholas

The reason fixing this is complicated is that Dataset doesn't actually store DataArray objects, instead it stores lower-level Variable objects, which are unnamed. It then recreates a new DataArray object when __getitem__ is called.

Implementing a similar solution with DataTree objects would require DataTree to not store other DataTree objects, but instead store a lower-level unnamed node object. Then when DataTree.__getitem__ was called we would recreate a new DataTree object from the underlying node object.

I'm trying to think about whether that design might cause any other consistency problems though...

TomNicholas avatar Jun 17 '22 21:06 TomNicholas

@shoyer wondering if you have an opinions on this? Do you think this name behaviour is a subtle but important feature of xarray or a unimportant detail?

TomNicholas avatar Jul 12 '22 15:07 TomNicholas

I do think this behavior is important. I would definitely consider alternative data structures for internal storage.

shoyer avatar Jul 12 '22 15:07 shoyer

Okay I think a design like this would work, where DataNode is analogous to Variable in that it's a private internal class that contains data but doesn't know it's own name.

Notice that you could technically have an entire tree of only DataNodes (because they still store the names of children via dict keys), but we would ensure the user always gets returned a DataTree which points to DataNodes.

from typing import TypeVar, Generic, Optional, OrderedDict, Dict, Mapping, Hashable, Any, Union

from xarray.core.variable import Variable
from xarray.core.utils import Frozen


Tree = TypeVar("Tree", bound="TreeNode")


class TreeNode(Generic[Tree]):
    """Unnamed tree node"""

    _parent: Optional[Tree]
    _children: OrderedDict[str, Tree]

    @property
    def children(self) -> Mapping[str, Tree]:
        return Frozen(self._children)


class NamedNodeMixin:
    """
    Enables a node to know its own name.

    Implements path-like relationships to other nodes in its tree.
    """

    _name: Optional[str]

    @property
    def name(self) -> Union[str, None]:
        """The name of this node."""
        return self._name

    @property
    def path(self) -> str:
        ...

    def relative_to(self: NamedNodeMixin, other: NamedNodeMixin) -> str:
        ...


class DataNode(TreeNode, Generic[Tree]):
    """Unnamed tree node which also stores data."""

    # TODO this could potentially inherit from Dataset?

    _parent: Optional[DataNode]
    _children: OrderedDict[str, DataNode]
    _variables: Dict[Hashable, Variable]
    ...


class DataTree(DataNode, NamedNodeMixin, Generic[Tree]):
    _name: Optional[str]
    _parent: Optional[DataTree]
    _children: OrderedDict[str, DataNode]
    _variables: Dict[Hashable, Variable]
    ...

    @property
    def children(self) -> Mapping[str, DataTree]:
        children_as_datatrees = {
            name: DataTree._construct_from_datanode(
                name=name, datanode=child,
            )
            for name, child in self._children
        }
        return Frozen(children_as_datatrees)

    @classmethod
    def _construct_direct(
            cls,
            name: str | None = None,
            parent: DataTree | None = None,
            children: OrderedDict[str, DataNode] = None,
            variables: dict[Any, Variable] = None,
    ) -> DataTree:
        """Shortcut around __init__ for internal use when we want to skip costly validation."""
        ...

    @classmethod
    def _construct_from_datanode(cls, name: str, datanode: DataNode) -> DataTree:
        return DataTree._construct_direct(
            name=name,
            parent=datanode._parent,
            children=datanode._children,
            variables=datanode._variables,
        )

    def _construct_datatree(self, name: str) -> DataTree:
        """Used when a single child node is requested"""
        child_datanode = self._children[name]
        return DataTree._construct_from_datanode(name=name, datanode=child_datanode)

TomNicholas avatar Nov 07 '22 20:11 TomNicholas

Alternatively Stephan suggested today that this might be achieved just by ensuring that group insertion occurs by making a shallow copy instead.

TomNicholas avatar Nov 09 '22 18:11 TomNicholas