Xarray round-tripping #22

CommonClimate · 2023-01-24T23:18:36Z

Enable Xarray round-trip with 2 functions Series.to_xarray() / Series.from_xarray(). Xarray is built on top of pandas, so there area already rules for how to do this, but they need to be adapted to deal with metadata, for instance.
Delicate points:

by default, pd.Series.to_xarray() creates an xr.DataArray instance, but some of our metadata are more adapted to the xr.DataSet construct. So Series.to_xarray() needs to create a xr.DataSet, not an xr.DataArray as would be the case by default.
will Xarray recognize non-ns dtypes by default, or do we need to make changes to that codebase?
we need to work out how to do MultipleSeries to xarray(). If all Series in the MS object share the same time axis, this is trivial (datafame to xr.DataSet). If different time axes are involved, probably to export different Xarray objects.

The text was updated successfully, but these errors were encountered:

khider · 2023-01-25T01:16:29Z

If we create a proper DataSet object within Pyleoclim from a Series, xarray will need to become a dependency of the package

CommonClimate · 2023-01-25T03:14:13Z

Didn't we already decide to bite that bullet when we wrote PaleoCube? I thought this was high on your list. If it's not, I'm happy to put that in another package (paleocube?) to enable a round-trip without needlessly enlarging Pyleoclim.

MarcoGorelli · 2023-01-26T11:02:18Z

It could be an optional dependency, which you only import within the to_xarray and from_xarray functions

khider · 2023-01-26T16:49:23Z

@MarcoGorelli I was thinking about doing that so it won't affect the package too much. Was just reading about the pros and cons.
But it seems like the best way to go.

CommonClimate · 2023-01-28T01:08:51Z

This point is moot as our current dependencies include

xarray=2022.6.0
netCDF4=1.5.8
Bottleneck=1.3.4
dask=2022.8.0
@khider I think we did this so that the hub container could be built from the same environment.yml file. Perhaps @kcpevey and @MarcoGorelli have suggestions for how to curate a lean set of dependencies for pyleoclim, and a heftier set for the hub? @jordanplanders will have plenty of suggestions for packages needed on the hub side to play nicely with the pangeo-forge stuff. Should likely be a new issue in the Pyleoclim repo.

kcpevey · 2023-01-30T16:34:27Z

I would keep two environment files. One that is unpinned for general package usage, and a specialized one for the hub that has pins to ensure things work.

Generally, its good practice to keep your environment as unpinned as possible. Reasoning:

People who are using your package are probably using other packages as well. This gives them the most flexibility.
It naturally stays up to date (CI will catch failures)
Its low maintenance

For your usage of pyleoclim on the hub, you need a different type of environment. This one really needs to be consistent and reliable. In this case, it should be pinned down. The caveat here is that it will slowly get out of date, so you need to have a maintenance place to revisit it every once in a while (maybe quarterly). To do this, you'll remove all the pins, solve, then put all the updated pins back.

kcpevey · 2023-01-30T18:59:53Z

Once pandas 2.0 is release, we would expect this to work:

import numpy as np
import pandas as pd

pds = pd.Series([10], index=np.array(['-2000-01-01']).astype('M8[s]'))
xra = pds.to_xarray()
xra.to_netcdf()

However, this fails with:

Traceback (most recent call last):
  File "/Users/kcp/projects/usc/git/Pyleoclim_util/example_xarray.py", line 103, in <module>
    xra.to_netcdf()
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/core/dataarray.py", line 3217, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/backends/api.py", line 1210, in to_netcdf
    dump_to_store(
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/backends/api.py", line 1257, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/backends/common.py", line 263, in store
    variables, attributes = self.encode(variables, attributes)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/backends/common.py", line 352, in encode
    variables, attributes = cf_encoder(variables, attributes)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/conventions.py", line 864, in cf_encoder
    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/conventions.py", line 864, in <dictcomp>
    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/conventions.py", line 273, in encode_cf_variable
    var = coder.encode(var, name=name)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/times.py", line 668, in encode
    (data, units, calendar) = encode_cf_datetime(
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/times.py", line 604, in encode_cf_datetime
    units = infer_datetime_units(dates)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/times.py", line 386, in infer_datetime_units
    reference_date = format_cftime_datetime(reference_date)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/times.py", line 397, in format_cftime_datetime
    date.year,
AttributeError: 'numpy.datetime64' object has no attribute 'year'

Xarray is using cftime to handle dates (specifically dates outside of the 1678-2262 range) (https://docs.xarray.dev/en/stable/user-guide/weather-climate.html#non-standard-calendars-and-dates-outside-the-timestamp-valid-range) . Xarray isn't parsing the date properly so we need to open an issue there to raise awareness.

After playing with xarray itself for a bit to see if I can find a workaround, it looks like it can handle really large dates (e.g 9000) if they are specified properly, but it fails to parse -9000. So the fix may be multi-faceted. Small reproducer:

import numpy as np
import xarray as xr
dates = xr.cftime_range(start="-9000", periods=24, freq="S", calendar="proleptic_gregorian")
da = xr.DataArray(np.arange(24), coords=[dates], dims=["time"], name="foo")

Traceback (most recent call last):
  File "/Users/kcp/projects/usc/git/Pyleoclim_util/example_xarray.py", line 109, in <module>
    dates = xr.cftime_range(start="-9000", periods=24, freq="S", calendar="proleptic_gregorian")
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/cftime_offsets.py", line 1034, in cftime_range
    start = to_cftime_datetime(start, calendar)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/cftime_offsets.py", line 750, in to_cftime_datetime
    date, _ = _parse_iso8601_with_reso(get_date_type(calendar), date_str_or_date)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/cftimeindex.py", line 128, in _parse_iso8601_with_reso
    result = parse_iso8601_like(timestr)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/cftimeindex.py", line 118, in parse_iso8601_like
    raise ValueError(
ValueError: no ISO-8601 or cftime-string-like match for string: -9000

kcpevey · 2023-01-31T13:53:06Z

Issue opened in xarray: pydata/xarray#7493

MarcoGorelli · 2023-02-24T19:21:09Z

Do you need anything else from us (Quansight) on this issue?

khider · 2023-02-24T22:37:07Z

No, this is going to be with xarray directly.

CommonClimate · 2023-02-24T23:44:20Z

Indeed, we'll leave it open until we've resolved that issue, but QuanSight is relieved from duty on this one!

CommonClimate added the enhancement New feature or request label Jan 24, 2023

CommonClimate assigned khider and MarcoGorelli Jan 24, 2023

CommonClimate added this to Pandas integration Jan 24, 2023

CommonClimate added the priority_high high priority issue label Jan 24, 2023

kcpevey moved this to High Priority TODO in Pandas integration Jan 26, 2023

CommonClimate mentioned this issue Jan 31, 2023

Set up environment.yml to enable notebook tests LinkedEarth/LinkedEarthHub#1

Open

CommonClimate moved this from High Priority TODO to Todo in Pandas integration Feb 20, 2023

CommonClimate moved this from Todo to In Progress in Pandas integration Feb 22, 2023

MarcoGorelli removed their assignment Feb 25, 2023

CommonClimate moved this from In Progress to Todo in Pandas integration Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray round-tripping #22

Xarray round-tripping #22

CommonClimate commented Jan 24, 2023

khider commented Jan 25, 2023

CommonClimate commented Jan 25, 2023

MarcoGorelli commented Jan 26, 2023

khider commented Jan 26, 2023

CommonClimate commented Jan 28, 2023

kcpevey commented Jan 30, 2023

kcpevey commented Jan 30, 2023

kcpevey commented Jan 31, 2023

MarcoGorelli commented Feb 24, 2023

khider commented Feb 24, 2023

CommonClimate commented Feb 24, 2023

Xarray round-tripping #22

Xarray round-tripping #22

Comments

CommonClimate commented Jan 24, 2023

khider commented Jan 25, 2023

CommonClimate commented Jan 25, 2023

MarcoGorelli commented Jan 26, 2023

khider commented Jan 26, 2023

CommonClimate commented Jan 28, 2023

kcpevey commented Jan 30, 2023

kcpevey commented Jan 30, 2023

kcpevey commented Jan 31, 2023

MarcoGorelli commented Feb 24, 2023

khider commented Feb 24, 2023

CommonClimate commented Feb 24, 2023