Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into DAS-2155-merge-data…
Browse files Browse the repository at this point in the history
…tree-docs
  • Loading branch information
flamingbear committed Aug 2, 2024
2 parents fa459d9 + c7cb9f7 commit e957e33
Show file tree
Hide file tree
Showing 6 changed files with 90 additions and 2 deletions.
1 change: 1 addition & 0 deletions ci/requirements/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
- bottleneck
- cartopy
- cfgrib
- kerchunk
- dask-core>=2022.1
- dask-expr
- hypothesis>=6.75.8
Expand Down
30 changes: 30 additions & 0 deletions doc/combined.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"version": 1,
"refs": {
".zgroup": "{\"zarr_format\":2}",
"foo/.zarray": "{\"chunks\":[4,5],\"compressor\":null,\"dtype\":\"<f8\",\"fill_value\":\"NaN\",\"filters\":null,\"order\":\"C\",\"shape\":[4,5],\"zarr_format\":2}",
"foo/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\",\"y\"],\"coordinates\":\"z\"}",
"foo/0.0": [
"saved_on_disk.h5",
8192,
160
],
"x/.zarray": "{\"chunks\":[4],\"compressor\":null,\"dtype\":\"<i8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[4],\"zarr_format\":2}",
"x/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\"]}",
"x/0": [
"saved_on_disk.h5",
8352,
32
],
"y/.zarray": "{\"chunks\":[5],\"compressor\":null,\"dtype\":\"<i8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[5],\"zarr_format\":2}",
"y/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"y\"],\"calendar\":\"proleptic_gregorian\",\"units\":\"days since 2000-01-01 00:00:00\"}",
"y/0": [
"saved_on_disk.h5",
8384,
40
],
"z/.zarray": "{\"chunks\":[4],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"allow_nan\":true,\"check_circular\":true,\"encoding\":\"utf-8\",\"ensure_ascii\":true,\"id\":\"json2\",\"indent\":null,\"separators\":[\",\",\":\"],\"skipkeys\":false,\"sort_keys\":true,\"strict\":true}],\"order\":\"C\",\"shape\":[4],\"zarr_format\":2}",
"z/0": "[\"a\",\"b\",\"c\",\"d\",\"|O\",[4]]",
"z/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\"]}"
}
}
1 change: 1 addition & 0 deletions doc/ecosystem.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Geosciences
- `climpred <https://climpred.readthedocs.io>`_: Analysis of ensemble forecast models for climate prediction.
- `geocube <https://corteva.github.io/geocube>`_: Tool to convert geopandas vector data into rasterized xarray data.
- `GeoWombat <https://github.com/jgrss/geowombat>`_: Utilities for analysis of remotely sensed and gridded raster data at scale (easily tame Landsat, Sentinel, Quickbird, and PlanetScope).
- `grib2io <https://github.com/NOAA-MDL/grib2io>`_: Utility to work with GRIB2 files including an xarray backend, DASK support for parallel reading in open_mfdataset, lazy loading of data, editing of GRIB2 attributes and GRIB2IO DataArray attrs, and spatial interpolation and reprojection of GRIB2 messages and GRIB2IO Datasets/DataArrays for both grid to grid and grid to stations.
- `gsw-xarray <https://github.com/DocOtak/gsw-xarray>`_: a wrapper around `gsw <https://teos-10.github.io/GSW-Python>`_ that adds CF compliant attributes when possible, units, name.
- `infinite-diff <https://github.com/spencerahill/infinite-diff>`_: xarray-based finite-differencing, focused on gridded climate/meteorology data
- `marc_analysis <https://github.com/darothen/marc_analysis>`_: Analysis package for CESM/MARC experiments and output.
Expand Down
4 changes: 2 additions & 2 deletions doc/getting-started-guide/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Required dependencies

- Python (3.9 or later)
- `numpy <https://www.numpy.org/>`__ (1.23 or later)
- `packaging <https://packaging.pypa.io/en/latest/#>`__ (22 or later)
- `pandas <https://pandas.pydata.org/>`__ (1.5 or later)
- `packaging <https://packaging.pypa.io/en/latest/#>`__ (23.1 or later)
- `pandas <https://pandas.pydata.org/>`__ (2.0 or later)

.. _optional-dependencies:

Expand Down
53 changes: 53 additions & 0 deletions doc/user-guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1033,6 +1033,59 @@ reads. Because this fall-back option is so much slower, xarray issues a
instead of falling back to try reading non-consolidated metadata.


.. _io.kerchunk:

Kerchunk
--------

`Kerchunk <https://fsspec.github.io/kerchunk/index.html>`_ is a Python library
that allows you to access chunked and compressed data formats (such as NetCDF3, NetCDF4, HDF5, GRIB2, TIFF & FITS),
many of which are primary data formats for many data archives, by viewing the
whole archive as an ephemeral `Zarr`_ dataset which allows for parallel, chunk-specific access.

Instead of creating a new copy of the dataset in the Zarr spec/format or
downloading the files locally, Kerchunk reads through the data archive and extracts the
byte range and compression information of each chunk and saves as a ``reference``.
These references are then saved as ``json`` files or ``parquet`` (more efficient)
for later use. You can view some of these stored in the `references`
directory `here <https://github.com/pydata/xarray-data>`_.


.. note::
These references follow this `specification <https://fsspec.github.io/kerchunk/spec.html>`_.
Packages like `kerchunk`_ and `virtualizarr <https://github.com/zarr-developers/VirtualiZarr>`_
help in creating and reading these references.


Reading these data archives becomes really easy with ``kerchunk`` in combination
with ``xarray``, especially when these archives are large in size. A single combined
reference can refer to thousands of the original data files present in these archives.
You can view the whole dataset with from this `combined reference` using the above packages.

The following example shows opening a combined references generated from a ``.hdf`` file stored locally.

.. ipython:: python
storage_options = {
"target_protocol": "file",
}
# add the `remote_protocol` key in `storage_options` if you're accessing a file remotely
ds1 = xr.open_dataset(
"./combined.json",
engine="kerchunk",
storage_options=storage_options,
)
ds1
.. note::

You can refer to the `project pythia kerchunk cookbook <https://projectpythia.org/kerchunk-cookbook/README.html>`_
and the `pangeo guide on kerchunk <https://guide.cloudnativegeo.org/kerchunk/intro.html>`_ for more information.


.. _io.iris:

Iris
Expand Down
3 changes: 3 additions & 0 deletions xarray/backends/plugins.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,9 @@ def get_backend(engine: str | type[BackendEntrypoint]) -> BackendEntrypoint:
if engine not in engines:
raise ValueError(
f"unrecognized engine {engine} must be one of: {list(engines)}"
"To install additional dependencies, see:\n"
"https://docs.xarray.dev/en/stable/user-guide/io.html \n"
"https://docs.xarray.dev/en/stable/getting-started-guide/installing.html"
)
backend = engines[engine]
elif isinstance(engine, type) and issubclass(engine, BackendEntrypoint):
Expand Down

0 comments on commit e957e33

Please sign in to comment.