diff --git a/ci/requirements/doc.yml b/ci/requirements/doc.yml index c041b5158e0..e3652fe6c35 100644 --- a/ci/requirements/doc.yml +++ b/ci/requirements/doc.yml @@ -8,6 +8,7 @@ dependencies: - bottleneck - cartopy - cfgrib + - kerchunk - dask-core>=2022.1 - dask-expr - hypothesis>=6.75.8 diff --git a/doc/combined.json b/doc/combined.json new file mode 100644 index 00000000000..345462e055f --- /dev/null +++ b/doc/combined.json @@ -0,0 +1,30 @@ +{ + "version": 1, + "refs": { + ".zgroup": "{\"zarr_format\":2}", + "foo/.zarray": "{\"chunks\":[4,5],\"compressor\":null,\"dtype\":\"`_: Analysis of ensemble forecast models for climate prediction. - `geocube `_: Tool to convert geopandas vector data into rasterized xarray data. - `GeoWombat `_: Utilities for analysis of remotely sensed and gridded raster data at scale (easily tame Landsat, Sentinel, Quickbird, and PlanetScope). +- `grib2io `_: Utility to work with GRIB2 files including an xarray backend, DASK support for parallel reading in open_mfdataset, lazy loading of data, editing of GRIB2 attributes and GRIB2IO DataArray attrs, and spatial interpolation and reprojection of GRIB2 messages and GRIB2IO Datasets/DataArrays for both grid to grid and grid to stations. - `gsw-xarray `_: a wrapper around `gsw `_ that adds CF compliant attributes when possible, units, name. - `infinite-diff `_: xarray-based finite-differencing, focused on gridded climate/meteorology data - `marc_analysis `_: Analysis package for CESM/MARC experiments and output. diff --git a/doc/getting-started-guide/installing.rst b/doc/getting-started-guide/installing.rst index ca12ae62440..823c50f333b 100644 --- a/doc/getting-started-guide/installing.rst +++ b/doc/getting-started-guide/installing.rst @@ -8,8 +8,8 @@ Required dependencies - Python (3.9 or later) - `numpy `__ (1.23 or later) -- `packaging `__ (22 or later) -- `pandas `__ (1.5 or later) +- `packaging `__ (23.1 or later) +- `pandas `__ (2.0 or later) .. _optional-dependencies: diff --git a/doc/user-guide/io.rst b/doc/user-guide/io.rst index 07de0619c73..1eb979e52f6 100644 --- a/doc/user-guide/io.rst +++ b/doc/user-guide/io.rst @@ -1033,6 +1033,59 @@ reads. Because this fall-back option is so much slower, xarray issues a instead of falling back to try reading non-consolidated metadata. +.. _io.kerchunk: + +Kerchunk +-------- + +`Kerchunk `_ is a Python library +that allows you to access chunked and compressed data formats (such as NetCDF3, NetCDF4, HDF5, GRIB2, TIFF & FITS), +many of which are primary data formats for many data archives, by viewing the +whole archive as an ephemeral `Zarr`_ dataset which allows for parallel, chunk-specific access. + +Instead of creating a new copy of the dataset in the Zarr spec/format or +downloading the files locally, Kerchunk reads through the data archive and extracts the +byte range and compression information of each chunk and saves as a ``reference``. +These references are then saved as ``json`` files or ``parquet`` (more efficient) +for later use. You can view some of these stored in the `references` +directory `here `_. + + +.. note:: + These references follow this `specification `_. + Packages like `kerchunk`_ and `virtualizarr `_ + help in creating and reading these references. + + +Reading these data archives becomes really easy with ``kerchunk`` in combination +with ``xarray``, especially when these archives are large in size. A single combined +reference can refer to thousands of the original data files present in these archives. +You can view the whole dataset with from this `combined reference` using the above packages. + +The following example shows opening a combined references generated from a ``.hdf`` file stored locally. + +.. ipython:: python + + storage_options = { + "target_protocol": "file", + } + + # add the `remote_protocol` key in `storage_options` if you're accessing a file remotely + + ds1 = xr.open_dataset( + "./combined.json", + engine="kerchunk", + storage_options=storage_options, + ) + + ds1 + +.. note:: + + You can refer to the `project pythia kerchunk cookbook `_ + and the `pangeo guide on kerchunk `_ for more information. + + .. _io.iris: Iris diff --git a/xarray/backends/plugins.py b/xarray/backends/plugins.py index a62ca6c9862..f4890015040 100644 --- a/xarray/backends/plugins.py +++ b/xarray/backends/plugins.py @@ -204,6 +204,9 @@ def get_backend(engine: str | type[BackendEntrypoint]) -> BackendEntrypoint: if engine not in engines: raise ValueError( f"unrecognized engine {engine} must be one of: {list(engines)}" + "To install additional dependencies, see:\n" + "https://docs.xarray.dev/en/stable/user-guide/io.html \n" + "https://docs.xarray.dev/en/stable/getting-started-guide/installing.html" ) backend = engines[engine] elif isinstance(engine, type) and issubclass(engine, BackendEntrypoint):