Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up new default values for kwargs on open_mfdataset #10051

Closed
wants to merge 3 commits into from

Conversation

jsignell
Copy link
Contributor

  • Towards Stricter defaults for concat, combine, open_mfdataset, merge #8778
    • after a sufficiently long deprecation cycle there should be another PR that:
      • removes the option (but doesn't throw if people are still setting it)
      • removes all the warning logic in _get_default_open_mfdataset_kwargs
    • after even more time one last PR that:
      • encodes the new defaults right into the function signature
      • removes the extra warnings that throw on failure
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

This PR attempts to throw warnings if and only if the output would change with the new kwarg defaults. These are the steps I am going through:

  1. Change the defaults and see which tests fail

    13 failed

    ========================================================================================= short test summary info =========================================================================================
    FAILED xarray/tests/test_backends.py::test_h5netcdf_storage_options - AssertionError: Left and right Dataset objects are not identical
    Differing data variables:
    L   var3     (time, dim3, dim1) float64 26kB 1.246 0.7902 ... -0.847 1.085
    R   var3     (dim3, dim1) float64 640B dask.array<chunksize=(10, 8), meta=np.ndarray>
    L   var2     (time, dim1, dim2) float64 23kB 0.007146 0.5344 ... -2.722 -0.6733
    R   var2     (dim1, dim2) float64 576B dask.array<chunksize=(8, 9), meta=np.ndarray>
    L   var1     (time, dim1, dim2) float64 23kB -1.424 1.264 ... -0.9074 -1.095
    R   var1     (dim1, dim2) float64 576B dask.array<chunksize=(8, 9), meta=np.ndarray>
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[outer-different-nested-t] - ValueError: Cannot specify both data_vars='different' and compat='override'.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[outer-different-by_coords-None] - ValueError: Cannot specify both data_vars='different' and compat='override'.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[inner-different-nested-t] - ValueError: Cannot specify both data_vars='different' and compat='override'.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[inner-different-by_coords-None] - ValueError: Cannot specify both data_vars='different' and compat='override'.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[left-different-nested-t] - ValueError: Cannot specify both data_vars='different' and compat='override'.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[left-different-by_coords-None] - ValueError: Cannot specify both data_vars='different' and compat='override'.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[right-different-nested-t] - ValueError: Cannot specify both data_vars='different' and compat='override'.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[right-different-by_coords-None] - ValueError: Cannot specify both data_vars='different' and compat='override'.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_exact_join_raises_error[different-by_coords-None] - AssertionError: Regex pattern did not match.
     Regex: 'cannot align objects.*join.*exact.*'
     Input: "Cannot specify both data_vars='different' and compat='override'."
    FAILED xarray/tests/test_backends.py::TestDask::test_encoding_mfdataset - ValueError: cannot align objects with join='exact' where index/labels/sizes are not equal along these coordinates (dimensions): 't' ('t',)
    FAILED xarray/tests/test_backends.py::TestDask::test_open_single_dataset - AssertionError: Left and right Dataset objects are not identical
    Differing data variables:
    L   foo      (baz, x) float64 80B 1.764 0.4002 0.9787 ... -0.1514 -0.1032 0.4106
    R   foo      (x) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
    FAILED xarray/tests/test_backends.py::TestDask::test_open_multi_dataset - AssertionError: Left and right Dataset objects are not identical
    Differing data variables:
    L   foo      (baz, x) float64 160B 1.764 0.4002 0.9787 ... -0.1032 0.4106
    R   foo      (x) float64 80B dask.array<chunksize=(10,), meta=np.ndarray>
    ==================================================== 13 failed, 18018 passed, 992 skipped, 185 xfailed, 19 xpassed, 2451 warnings in 266.75s (0:04:26) ====================================================
    
  2. Add warnings and switch back to old defaults

    13 failed

    ========================================================================================= short test summary info =========================================================================================
    FAILED xarray/tests/test_backends.py::test_h5netcdf_storage_options - FutureWarning: In a future version of xarray default value for data_vars  will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets have overlapping concat_dim values. To opt in to new defaults and get rid of these warnings now use `set_option(use_new_open_mfdataset_kwargs=True) or set data_vars explicitly.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[outer-different-nested-t] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[outer-different-by_coords-None] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[inner-different-nested-t] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[inner-different-by_coords-None] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[left-different-nested-t] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[left-different-by_coords-None] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[right-different-nested-t] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[right-different-by_coords-None] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_exact_join_raises_error[different-nested-t] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_exact_join_raises_error[different-by_coords-None] - FutureWarning: In a future version of xarray default value for compat  will change from compat='no_conflicts' to compat='override'. This is likely to lead to failures when used with data_vars='different'. The recommendation is to set compat='no_conflicts' explicitly for this case.
    FAILED xarray/tests/test_backends.py::TestDask::test_open_single_dataset - FutureWarning: In a future version of xarray default value for data_vars  will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when using a `DataArray` as the concat_dim. To opt in to new defaults and get rid of these warnings now use `set_option(use_new_open_mfdataset_kwargs=True) or set data_vars explicitly.
    FAILED xarray/tests/test_backends.py::TestDask::test_open_multi_dataset - FutureWarning: In a future version of xarray default value for data_vars  will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when using a `DataArray` as the concat_dim. To opt in to new defaults and get rid of these warnings now use `set_option(use_new_open_mfdataset_kwargs=True) or set data_vars explicitly.
    ==================================================== 13 failed, 18018 passed, 992 skipped, 185 xfailed, 19 xpassed, 2394 warnings in 226.98s (0:03:46) ====================================================
    
  1. TODO: Alter existing tests to make sure they still test what they were meant to test by passing in any required kwargs
  2. TODO: Add new tests that explicitly ensure that for a bunch of different inputs, using old defaults throws a warning OR output is the same with new and old defaults.

Notes

  • I used None to indicate a kwarg value that the user has not explicitly set, but it might be preferable to instead use a special indicator value so you can't actually set the kwargs to None. The benefit of using that approach is that it makes it more obvious that people should not be setting these kwargs to None in their own code.
  • I added a second set of warnings that tries to catch any errors that might be related to the new default kwargs. The thinking is that these warnings might provide people with some context on any new errors that crop up after they opt in to the new defaults. These only pop up when there is already an error, so they don't catch places where open_mfdataset succeeds but is different than before.

)

if "join" not in input_kwargs:
# TODO: Figure out how conditions that xarray/tests/test_backends.py::TestDask::test_encoding_mfdataset
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having a hard time figuring out how to catch the conditions that make xarray/tests/test_backends.py::TestDask::test_encoding_mfdataset fail with the new join kwarg. It seems like it has to do with the concat_dim in each dataset having a different encoding, but the values of the concat_dim in each dataset doe not overlap. So I'm a little wondering if the align index function is too strict.

f"in `open_mfdataset`. Previously the default was {old}",
category=UserWarning,
stacklevel=2,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the warnings that are meant to provide extra context on failures when you set use_new_open_mfdataset_kwargs=True and for a while after xarray switches to using the new defaults by default.

@TomNicholas
Copy link
Member

TomNicholas commented Feb 14, 2025

Thank you for working on this @jsignell !

Note that the scope of the changes implied by #8778 are larger than what you've done here - we also need to change the defaults in concat, merge, combine_nested, combine_by_coords, and possibly align. We need to do this all at the same time (see #8778 (comment)).

As open_mfdataset/combine_nested etc. are just composed of concat and merge under the hood, it might be easier to start by changing the defaults in those first and check that test_concat.py passes, then move to test_combine.py (i.e. bottom-up changes rather than top-down). Perhaps those should be separate PRs, but we also absolutely don't want to release a version of xarray where only some of these changes have been made.

@jsignell
Copy link
Contributor Author

Note that the scope of the changes implied by #8778 are larger than what you've done here - we also need to change the defaults in concat, merge, combine_nested, combine_by_coords, and possibly align. We need to do this all at the same time (see #8778 (comment)).

Yes! I had that comment in my head at one point, but somehow misread it when I actually started working on this this week 🤦🏻

As open_mfdataset/combine_nested etc. are just composed of concat and merge under the hood, it might be easier to start by changing the defaults in those first and check that test_concat.py passes, then move to test_combine.py (i.e. bottom-up changes rather than top-down).

I agree this might be easier. It also involves passing the context that the user did not explicitly set the kwarg values between functions. It seems kind of nice to me to keep this kind of default kwarg-setting pretty high up in the chain. So maybe it's just a matter of abstracting the _get_default_open_mfdataset_kwargs and using that within each of the top-level functions. I'll see what that looks like.

Perhaps those should be separate PRs, but we also absolutely don't want to release a version of xarray where only some of these changes have been made.

I think having it all be one PR will be fine.

@TomNicholas
Copy link
Member

abstracting the _get_default_open_mfdataset_kwargs and using that within each of the top-level functions.

That sounds like it could be a very neat way to abstract out this complexity.

@TomNicholas TomNicholas added the topic-combine combine/concat/merge label Feb 14, 2025
@jsignell
Copy link
Contributor Author

Ok new (still WIP) PR is up (#10062)!

@jsignell jsignell closed this Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-combine combine/concat/merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants