Name permanence
Xarray Datasets don't alter the names of the objects they store:
In [1]: import xarray as xr
In [2]: da = xr.DataArray(name="b")
In [3]: ds = xr.Dataset()
In [4]: ds['a'] = da
In [5]: ds["a"].name
Out[5]: 'a'
In [6]: da.name
Out[6]: 'b'
After #41 (and #115 ensures it) then DataTree objects behave similarly for data variables they store
In [7]: from datatree import DataTree
In [8]: root = DataTree()
In [9]: root["a"] = da
In [10]: root["a"].name
Out[10]: 'a'
In [11]: da.name
Out[11]: 'b'
However, currently DataTree objects do alter the name of child DataTree nodes that they store.
In [12]: subtree = DataTree(name="b")
In [13]: root = DataTree()
In [14]: root["a"] = subtree
In [15]: root["a"].name
Out[15]: 'a'
In [16]: subtree.name
Out[16]: 'a'
I noticed this in #115, but fixing it might be a bit complex.
The reason fixing this is complicated is that Dataset doesn't actually store DataArray objects, instead it stores lower-level Variable objects, which are unnamed. It then recreates a new DataArray object when __getitem__ is called.
Implementing a similar solution with DataTree objects would require DataTree to not store other DataTree objects, but instead store a lower-level unnamed node object. Then when DataTree.__getitem__ was called we would recreate a new DataTree object from the underlying node object.
I'm trying to think about whether that design might cause any other consistency problems though...
@shoyer wondering if you have an opinions on this? Do you think this name behaviour is a subtle but important feature of xarray or a unimportant detail?
I do think this behavior is important. I would definitely consider alternative data structures for internal storage.
Okay I think a design like this would work, where DataNode is analogous to Variable in that it's a private internal class that contains data but doesn't know it's own name.
Notice that you could technically have an entire tree of only DataNodes (because they still store the names of children via dict keys), but we would ensure the user always gets returned a DataTree which points to DataNodes.
from typing import TypeVar, Generic, Optional, OrderedDict, Dict, Mapping, Hashable, Any, Union
from xarray.core.variable import Variable
from xarray.core.utils import Frozen
Tree = TypeVar("Tree", bound="TreeNode")
class TreeNode(Generic[Tree]):
"""Unnamed tree node"""
_parent: Optional[Tree]
_children: OrderedDict[str, Tree]
@property
def children(self) -> Mapping[str, Tree]:
return Frozen(self._children)
class NamedNodeMixin:
"""
Enables a node to know its own name.
Implements path-like relationships to other nodes in its tree.
"""
_name: Optional[str]
@property
def name(self) -> Union[str, None]:
"""The name of this node."""
return self._name
@property
def path(self) -> str:
...
def relative_to(self: NamedNodeMixin, other: NamedNodeMixin) -> str:
...
class DataNode(TreeNode, Generic[Tree]):
"""Unnamed tree node which also stores data."""
# TODO this could potentially inherit from Dataset?
_parent: Optional[DataNode]
_children: OrderedDict[str, DataNode]
_variables: Dict[Hashable, Variable]
...
class DataTree(DataNode, NamedNodeMixin, Generic[Tree]):
_name: Optional[str]
_parent: Optional[DataTree]
_children: OrderedDict[str, DataNode]
_variables: Dict[Hashable, Variable]
...
@property
def children(self) -> Mapping[str, DataTree]:
children_as_datatrees = {
name: DataTree._construct_from_datanode(
name=name, datanode=child,
)
for name, child in self._children
}
return Frozen(children_as_datatrees)
@classmethod
def _construct_direct(
cls,
name: str | None = None,
parent: DataTree | None = None,
children: OrderedDict[str, DataNode] = None,
variables: dict[Any, Variable] = None,
) -> DataTree:
"""Shortcut around __init__ for internal use when we want to skip costly validation."""
...
@classmethod
def _construct_from_datanode(cls, name: str, datanode: DataNode) -> DataTree:
return DataTree._construct_direct(
name=name,
parent=datanode._parent,
children=datanode._children,
variables=datanode._variables,
)
def _construct_datatree(self, name: str) -> DataTree:
"""Used when a single child node is requested"""
child_datanode = self._children[name]
return DataTree._construct_from_datanode(name=name, datanode=child_datanode)
Alternatively Stephan suggested today that this might be achieved just by ensuring that group insertion occurs by making a shallow copy instead.