Skip to content

Commit

Permalink
add open_datatree to xarray (#8697)
Browse files Browse the repository at this point in the history
* DAS-2060: Skips datatree_ CI

Adds additional ignore to mypy

Adds additional ignore to doctests

Excludes xarray/datatree_ from all pre-commmit.ci

* DAS-2070: Migrate open_datatree into xarray.

First stab. Will need to add/move tests.

* DAS-2060: replace relative import of datatree to library

* DAS-2060: revert the exporting of NodePath from datatree

I mistakenly thought we wanted to use the hidden version of datatree_ and we do not.

* Don't expose open_datatree at top level

We do not want to expose open_datatree at top level until all of the code is migrated.

* Point datatree imports to xarray.datatree_.datatree

* Updates function signatures for mypy.

* Move io tests, remove undefined reference to documentation.

Also starts fixing simple mypy errors

* Pass bare-minimum tests.

* Update pyproject.toml to exclude imported datatree_ modules.

Add some typing for mygrated tests.
Adds display_expand_groups to core options.

* Adding back type ignores

This is cargo-cult.  I wonder if there's a different CI test that wanted these
and since this is now excluded at the top level.  I'm putting them back until
migration into main codebase.

* Refactor open_datatree back together.

puts common parts in common.

* Removes TODO comment

* typo fix

Co-authored-by: Tom Nicholas <tom@cworthy.org>

* typo 2

Co-authored-by: Tom Nicholas <tom@cworthy.org>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Call raised exception

* Add unpacking notation to kwargs

* Use final location for DataTree doc strings

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* fix comment from open_dataset to open_datatree

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* Revert "fix comment from open_dataset to open_datatree"

This reverts commit aab1744.

* Change sphynx link from meth to func

* Update whats-new.rst

* Fix what-new.rst formatting.

---------

Co-authored-by: Tom Nicholas <tom@cworthy.org>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
  • Loading branch information
4 people authored Feb 14, 2024
1 parent 4806412 commit fffb03c
Show file tree
Hide file tree
Showing 25 changed files with 263 additions and 193 deletions.
2 changes: 1 addition & 1 deletion doc/roadmap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ types would also be highly useful for xarray users.
By pursuing these improvements in NumPy we hope to extend the benefits
to the full scientific Python community, and avoid tight coupling
between xarray and specific third-party libraries (e.g., for
implementing untis). This will allow xarray to maintain its domain
implementing units). This will allow xarray to maintain its domain
agnostic strengths.

We expect that we may eventually add some minimal interfaces in xarray
Expand Down
9 changes: 8 additions & 1 deletion doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,16 @@ Internal Changes
when the data isn't datetime-like. (:issue:`8718`, :pull:`8724`)
By `Maximilian Roos <https://github.com/max-sixty>`_.

- Move `parallelcompat` and `chunk managers` modules from `xarray/core` to `xarray/namedarray`. (:pull:`8319`)
- Move ``parallelcompat`` and ``chunk managers`` modules from ``xarray/core`` to ``xarray/namedarray``. (:pull:`8319`)
By `Tom Nicholas <https://github.com/TomNicholas>`_ and `Anderson Banihirwe <https://github.com/andersy005>`_.

- Imports ``datatree`` repository and history into internal
location. (:pull:`8688`) By `Matt Savoie <https://github.com/flamingbear>`_
and `Justus Magin <https://github.com/keewis>`_.

- Adds :py:func:`open_datatree` into ``xarray/backends`` (:pull:`8697`) By `Matt
Savoie <https://github.com/flamingbear>`_.

.. _whats-new.2024.01.1:

v2024.01.1 (23 Jan, 2024)
Expand Down
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,11 @@ warn_redundant_casts = true
warn_unused_configs = true
warn_unused_ignores = true

# Ignore mypy errors for modules imported from datatree_.
[[tool.mypy.overrides]]
module = "xarray.datatree_.*"
ignore_errors = true

# Much of the numerical computing stack doesn't have type annotations yet.
[[tool.mypy.overrides]]
ignore_missing_imports = true
Expand Down
29 changes: 29 additions & 0 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
T_NetcdfTypes = Literal[
"NETCDF4", "NETCDF4_CLASSIC", "NETCDF3_64BIT", "NETCDF3_CLASSIC"
]
from xarray.datatree_.datatree import DataTree

DATAARRAY_NAME = "__xarray_dataarray_name__"
DATAARRAY_VARIABLE = "__xarray_dataarray_variable__"
Expand Down Expand Up @@ -788,6 +789,34 @@ def open_dataarray(
return data_array


def open_datatree(
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
engine: T_Engine = None,
**kwargs,
) -> DataTree:
"""
Open and decode a DataTree from a file or file-like object, creating one tree node for each group in the file.
Parameters
----------
filename_or_obj : str, Path, file-like, or DataStore
Strings and Path objects are interpreted as a path to a netCDF file or Zarr store.
engine : str, optional
Xarray backend engine to use. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
**kwargs : dict
Additional keyword arguments passed to :py:func:`~xarray.open_dataset` for each group.
Returns
-------
xarray.DataTree
"""
if engine is None:
engine = plugins.guess_engine(filename_or_obj)

backend = plugins.get_backend(engine)

return backend.open_datatree(filename_or_obj, **kwargs)


def open_mfdataset(
paths: str | NestedSequence[str | os.PathLike],
chunks: T_Chunks | None = None,
Expand Down
59 changes: 58 additions & 1 deletion xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,12 @@
if TYPE_CHECKING:
from io import BufferedIOBase

from h5netcdf.legacyapi import Dataset as ncDatasetLegacyH5
from netCDF4 import Dataset as ncDataset

from xarray.core.dataset import Dataset
from xarray.core.types import NestedSequence
from xarray.datatree_.datatree import DataTree

# Create a logger object, but don't add any handlers. Leave that to user code.
logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -127,6 +131,43 @@ def _decode_variable_name(name):
return name


def _open_datatree_netcdf(
ncDataset: ncDataset | ncDatasetLegacyH5,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs,
) -> DataTree:
from xarray.backends.api import open_dataset
from xarray.datatree_.datatree import DataTree
from xarray.datatree_.datatree.treenode import NodePath

ds = open_dataset(filename_or_obj, **kwargs)
tree_root = DataTree.from_dict({"/": ds})
with ncDataset(filename_or_obj, mode="r") as ncds:
for path in _iter_nc_groups(ncds):
subgroup_ds = open_dataset(filename_or_obj, group=path, **kwargs)

# TODO refactor to use __setitem__ once creation of new nodes by assigning Dataset works again
node_name = NodePath(path).name
new_node: DataTree = DataTree(name=node_name, data=subgroup_ds)
tree_root._set_item(
path,
new_node,
allow_overwrite=False,
new_nodes_along_path=True,
)
return tree_root


def _iter_nc_groups(root, parent="/"):
from xarray.datatree_.datatree.treenode import NodePath

parent = NodePath(parent)
for path, group in root.groups.items():
gpath = parent / path
yield str(gpath)
yield from _iter_nc_groups(group, parent=gpath)


def find_root_and_group(ds):
"""Find the root and group name of a netCDF4/h5netcdf dataset."""
hierarchy = ()
Expand Down Expand Up @@ -458,6 +499,11 @@ class BackendEntrypoint:
- ``guess_can_open`` method: it shall return ``True`` if the backend is able to open
``filename_or_obj``, ``False`` otherwise. The implementation of this
method is not mandatory.
- ``open_datatree`` method: it shall implement reading from file, variables
decoding and it returns an instance of :py:class:`~datatree.DataTree`.
It shall take in input at least ``filename_or_obj`` argument. The
implementation of this method is not mandatory. For more details see
<reference to open_datatree documentation>.
Attributes
----------
Expand Down Expand Up @@ -496,7 +542,7 @@ def open_dataset(
Backend open_dataset method used by Xarray in :py:func:`~xarray.open_dataset`.
"""

raise NotImplementedError
raise NotImplementedError()

def guess_can_open(
self,
Expand All @@ -508,6 +554,17 @@ def guess_can_open(

return False

def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs: Any,
) -> DataTree:
"""
Backend open_datatree method used by Xarray in :py:func:`~xarray.open_datatree`.
"""

raise NotImplementedError()


# mapping of engine name to (module name, BackendEntrypoint Class)
BACKEND_ENTRYPOINTS: dict[str, tuple[str | None, type[BackendEntrypoint]]] = {}
11 changes: 11 additions & 0 deletions xarray/backends/h5netcdf_.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
BackendEntrypoint,
WritableCFDataStore,
_normalize_path,
_open_datatree_netcdf,
find_root_and_group,
)
from xarray.backends.file_manager import CachingFileManager, DummyFileManager
Expand Down Expand Up @@ -38,6 +39,7 @@

from xarray.backends.common import AbstractDataStore
from xarray.core.dataset import Dataset
from xarray.datatree_.datatree import DataTree


class H5NetCDFArrayWrapper(BaseNetCDF4Array):
Expand Down Expand Up @@ -423,5 +425,14 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
)
return ds

def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs,
) -> DataTree:
from h5netcdf.legacyapi import Dataset as ncDataset

return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)


BACKEND_ENTRYPOINTS["h5netcdf"] = ("h5netcdf", H5netcdfBackendEntrypoint)
11 changes: 11 additions & 0 deletions xarray/backends/netCDF4_.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
BackendEntrypoint,
WritableCFDataStore,
_normalize_path,
_open_datatree_netcdf,
find_root_and_group,
robust_getitem,
)
Expand Down Expand Up @@ -44,6 +45,7 @@

from xarray.backends.common import AbstractDataStore
from xarray.core.dataset import Dataset
from xarray.datatree_.datatree import DataTree

# This lookup table maps from dtype.byteorder to a readable endian
# string used by netCDF4.
Expand Down Expand Up @@ -667,5 +669,14 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
)
return ds

def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs,
) -> DataTree:
from netCDF4 import Dataset as ncDataset

return _open_datatree_netcdf(ncDataset, filename_or_obj, **kwargs)


BACKEND_ENTRYPOINTS["netcdf4"] = ("netCDF4", NetCDF4BackendEntrypoint)
44 changes: 44 additions & 0 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@

from xarray.backends.common import AbstractDataStore
from xarray.core.dataset import Dataset
from xarray.datatree_.datatree import DataTree


# need some special secret attributes to tell us the dimensions
Expand Down Expand Up @@ -1039,5 +1040,48 @@ def open_dataset( # type: ignore[override] # allow LSP violation, not supporti
)
return ds

def open_datatree(
self,
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
**kwargs,
) -> DataTree:
import zarr

from xarray.backends.api import open_dataset
from xarray.datatree_.datatree import DataTree
from xarray.datatree_.datatree.treenode import NodePath

zds = zarr.open_group(filename_or_obj, mode="r")
ds = open_dataset(filename_or_obj, engine="zarr", **kwargs)
tree_root = DataTree.from_dict({"/": ds})
for path in _iter_zarr_groups(zds):
try:
subgroup_ds = open_dataset(
filename_or_obj, engine="zarr", group=path, **kwargs
)
except zarr.errors.PathNotFoundError:
subgroup_ds = Dataset()

# TODO refactor to use __setitem__ once creation of new nodes by assigning Dataset works again
node_name = NodePath(path).name
new_node: DataTree = DataTree(name=node_name, data=subgroup_ds)
tree_root._set_item(
path,
new_node,
allow_overwrite=False,
new_nodes_along_path=True,
)
return tree_root


def _iter_zarr_groups(root, parent="/"):
from xarray.datatree_.datatree.treenode import NodePath

parent = NodePath(parent)
for path, group in root.groups():
gpath = parent / path
yield str(gpath)
yield from _iter_zarr_groups(group, parent=gpath)


BACKEND_ENTRYPOINTS["zarr"] = ("zarr", ZarrBackendEntrypoint)
3 changes: 3 additions & 0 deletions xarray/core/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
"display_expand_coords",
"display_expand_data_vars",
"display_expand_data",
"display_expand_groups",
"display_expand_indexes",
"display_default_indexes",
"enable_cftimeindex",
Expand All @@ -44,6 +45,7 @@ class T_Options(TypedDict):
display_expand_coords: Literal["default", True, False]
display_expand_data_vars: Literal["default", True, False]
display_expand_data: Literal["default", True, False]
display_expand_groups: Literal["default", True, False]
display_expand_indexes: Literal["default", True, False]
display_default_indexes: Literal["default", True, False]
enable_cftimeindex: bool
Expand All @@ -68,6 +70,7 @@ class T_Options(TypedDict):
"display_expand_coords": "default",
"display_expand_data_vars": "default",
"display_expand_data": "default",
"display_expand_groups": "default",
"display_expand_indexes": "default",
"display_default_indexes": False,
"enable_cftimeindex": True,
Expand Down
10 changes: 0 additions & 10 deletions xarray/datatree_/datatree/__init__.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,15 @@
# import public API
from .datatree import DataTree
from .extensions import register_datatree_accessor
from .io import open_datatree
from .mapping import TreeIsomorphismError, map_over_subtree
from .treenode import InvalidTreeError, NotFoundInTreeError

try:
# NOTE: the `_version.py` file must not be present in the git repository
# as it is generated by setuptools at install time
from ._version import __version__
except ImportError: # pragma: no cover
# Local copy or not installed with setuptools
__version__ = "999"

__all__ = (
"DataTree",
"open_datatree",
"TreeIsomorphismError",
"InvalidTreeError",
"NotFoundInTreeError",
"map_over_subtree",
"register_datatree_accessor",
"__version__",
)
3 changes: 2 additions & 1 deletion xarray/datatree_/datatree/datatree.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
List,
Mapping,
MutableMapping,
NoReturn,
Optional,
Set,
Tuple,
Expand Down Expand Up @@ -160,7 +161,7 @@ def __setitem__(self, key, val) -> None:
"use `.copy()` first to get a mutable version of the input dataset."
)

def update(self, other) -> None:
def update(self, other) -> NoReturn:
raise AttributeError(
"Mutation of the DatasetView is not allowed, please use `.update` on the wrapping DataTree node, "
"or use `dt.to_dataset()` if you want a mutable dataset. If calling this from within `map_over_subtree`,"
Expand Down
3 changes: 0 additions & 3 deletions xarray/datatree_/datatree/formatting_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@
datavar_section,
dim_section,
)
from xarray.core.options import OPTIONS

OPTIONS["display_expand_groups"] = "default"


def summarize_children(children: Mapping[str, Any]) -> str:
Expand Down
Loading

0 comments on commit fffb03c

Please sign in to comment.