Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray round-tripping #22

Open
CommonClimate opened this issue Jan 24, 2023 · 11 comments
Open

Xarray round-tripping #22

CommonClimate opened this issue Jan 24, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request priority_high high priority issue

Comments

@CommonClimate
Copy link
Contributor

Enable Xarray round-trip with 2 functions Series.to_xarray() / Series.from_xarray(). Xarray is built on top of pandas, so there area already rules for how to do this, but they need to be adapted to deal with metadata, for instance.
Delicate points:

  • by default, pd.Series.to_xarray() creates an xr.DataArray instance, but some of our metadata are more adapted to the xr.DataSet construct. So Series.to_xarray() needs to create a xr.DataSet, not an xr.DataArray as would be the case by default.
  • will Xarray recognize non-ns dtypes by default, or do we need to make changes to that codebase?
  • we need to work out how to do MultipleSeries to xarray(). If all Series in the MS object share the same time axis, this is trivial (datafame to xr.DataSet). If different time axes are involved, probably to export different Xarray objects.
@CommonClimate CommonClimate added the enhancement New feature or request label Jan 24, 2023
@CommonClimate CommonClimate added the priority_high high priority issue label Jan 24, 2023
@khider
Copy link
Member

khider commented Jan 25, 2023

If we create a proper DataSet object within Pyleoclim from a Series, xarray will need to become a dependency of the package

@CommonClimate
Copy link
Contributor Author

Didn't we already decide to bite that bullet when we wrote PaleoCube? I thought this was high on your list. If it's not, I'm happy to put that in another package (paleocube?) to enable a round-trip without needlessly enlarging Pyleoclim.

@MarcoGorelli
Copy link

It could be an optional dependency, which you only import within the to_xarray and from_xarray functions

@khider
Copy link
Member

khider commented Jan 26, 2023

@MarcoGorelli I was thinking about doing that so it won't affect the package too much. Was just reading about the pros and cons.
But it seems like the best way to go.

@kcpevey kcpevey moved this to High Priority TODO in Pandas integration Jan 26, 2023
@CommonClimate
Copy link
Contributor Author

This point is moot as our current dependencies include

  • xarray=2022.6.0
  • netCDF4=1.5.8
  • Bottleneck=1.3.4
  • dask=2022.8.0
    @khider I think we did this so that the hub container could be built from the same environment.yml file. Perhaps @kcpevey and @MarcoGorelli have suggestions for how to curate a lean set of dependencies for pyleoclim, and a heftier set for the hub? @jordanplanders will have plenty of suggestions for packages needed on the hub side to play nicely with the pangeo-forge stuff. Should likely be a new issue in the Pyleoclim repo.

@kcpevey
Copy link
Collaborator

kcpevey commented Jan 30, 2023

I would keep two environment files. One that is unpinned for general package usage, and a specialized one for the hub that has pins to ensure things work.

Generally, its good practice to keep your environment as unpinned as possible. Reasoning:

  • People who are using your package are probably using other packages as well. This gives them the most flexibility.
  • It naturally stays up to date (CI will catch failures)
  • Its low maintenance

For your usage of pyleoclim on the hub, you need a different type of environment. This one really needs to be consistent and reliable. In this case, it should be pinned down. The caveat here is that it will slowly get out of date, so you need to have a maintenance place to revisit it every once in a while (maybe quarterly). To do this, you'll remove all the pins, solve, then put all the updated pins back.

@kcpevey
Copy link
Collaborator

kcpevey commented Jan 30, 2023

Once pandas 2.0 is release, we would expect this to work:

import numpy as np
import pandas as pd

pds = pd.Series([10], index=np.array(['-2000-01-01']).astype('M8[s]'))
xra = pds.to_xarray()
xra.to_netcdf()

However, this fails with:

Traceback (most recent call last):
  File "/Users/kcp/projects/usc/git/Pyleoclim_util/example_xarray.py", line 103, in <module>
    xra.to_netcdf()
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/core/dataarray.py", line 3217, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/backends/api.py", line 1210, in to_netcdf
    dump_to_store(
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/backends/api.py", line 1257, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/backends/common.py", line 263, in store
    variables, attributes = self.encode(variables, attributes)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/backends/common.py", line 352, in encode
    variables, attributes = cf_encoder(variables, attributes)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/conventions.py", line 864, in cf_encoder
    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/conventions.py", line 864, in <dictcomp>
    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/conventions.py", line 273, in encode_cf_variable
    var = coder.encode(var, name=name)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/times.py", line 668, in encode
    (data, units, calendar) = encode_cf_datetime(
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/times.py", line 604, in encode_cf_datetime
    units = infer_datetime_units(dates)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/times.py", line 386, in infer_datetime_units
    reference_date = format_cftime_datetime(reference_date)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/times.py", line 397, in format_cftime_datetime
    date.year,
AttributeError: 'numpy.datetime64' object has no attribute 'year'

Xarray is using cftime to handle dates (specifically dates outside of the 1678-2262 range) (https://docs.xarray.dev/en/stable/user-guide/weather-climate.html#non-standard-calendars-and-dates-outside-the-timestamp-valid-range) . Xarray isn't parsing the date properly so we need to open an issue there to raise awareness.

After playing with xarray itself for a bit to see if I can find a workaround, it looks like it can handle really large dates (e.g 9000) if they are specified properly, but it fails to parse -9000. So the fix may be multi-faceted. Small reproducer:

import numpy as np
import xarray as xr
dates = xr.cftime_range(start="-9000", periods=24, freq="S", calendar="proleptic_gregorian")
da = xr.DataArray(np.arange(24), coords=[dates], dims=["time"], name="foo")
Traceback (most recent call last):
  File "/Users/kcp/projects/usc/git/Pyleoclim_util/example_xarray.py", line 109, in <module>
    dates = xr.cftime_range(start="-9000", periods=24, freq="S", calendar="proleptic_gregorian")
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/cftime_offsets.py", line 1034, in cftime_range
    start = to_cftime_datetime(start, calendar)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/cftime_offsets.py", line 750, in to_cftime_datetime
    date, _ = _parse_iso8601_with_reso(get_date_type(calendar), date_str_or_date)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/cftimeindex.py", line 128, in _parse_iso8601_with_reso
    result = parse_iso8601_like(timestr)
  File "/Users/kcp/miniconda3/envs/usc_JAN30/lib/python3.10/site-packages/xarray/coding/cftimeindex.py", line 118, in parse_iso8601_like
    raise ValueError(
ValueError: no ISO-8601 or cftime-string-like match for string: -9000

@kcpevey
Copy link
Collaborator

kcpevey commented Jan 31, 2023

Issue opened in xarray: pydata/xarray#7493

@MarcoGorelli
Copy link

Do you need anything else from us (Quansight) on this issue?

@khider
Copy link
Member

khider commented Feb 24, 2023

No, this is going to be with xarray directly.

@CommonClimate
Copy link
Contributor Author

Indeed, we'll leave it open until we've resolved that issue, but QuanSight is relieved from duty on this one!

@MarcoGorelli MarcoGorelli removed their assignment Feb 25, 2023
@CommonClimate CommonClimate moved this from In Progress to Todo in Pandas integration Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority_high high priority issue
Projects
Development

No branches or pull requests

4 participants