Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating fill_value logic for HDFVirtualBackend #414

Open
sharkinsspatial opened this issue Feb 3, 2025 · 2 comments
Open

Updating fill_value logic for HDFVirtualBackend #414

sharkinsspatial opened this issue Feb 3, 2025 · 2 comments
Labels
HDF reader Non-kerchunk-based HDF reader backend

Comments

@sharkinsspatial
Copy link
Collaborator

I'm working through trying to update the "fill value" (quoted to denote the massive ambiguity around this loaded term as described by @rabernat in pydata/xarray#5475 (comment)) for the HDFVirtualBackend but it seems this will be a multi-level issue. My hope is our logic can follow the interpretation that @rabernat set out in pydata/xarray#5475 (comment) so that the reader will use the HDF dataset fillvalue property for the the Zarr fill_value and if the HDF dataset has a _FillValue attr that will be preserved as an array attr for use by xarray's decoding logic to represent masked values. This is a transition from the the old kerchunk logic that was discussed and updated in fsspec/kerchunk#177.

This presents a bit of a chicken and egg problem for our transition to Zarr V3 #392 and specifically our transition to ArrayV3Metadata #411 as an internal representation. As outlined by @rabernat, Zarr v2 uses different semantics for fill_value and the spec is less rigorous about its encoding. The Zarr V3 specification has adopted @rabernat's new semantics and has more detailed guidelines for fill_value encoding.

IIUC as we transition the HDFVirtualBackend to use the new Zarr v3 fill_value semantics and encoding our kerchunk based roundtrip tests will be broken unless we add specific logic to our to_kerchunk serialization that transform v3 -> v2 semantics. I have not searched the new kerchunk v3 compatibility PR fsspec/kerchunk#516 (comment) exhaustively but it appears that the semantic fill_value logic used by the HDF reader has not changed and is still using the v2 logic so even serializing to v3 compatible kerchunk dictionaries would fail equivalence assertions.

I'll be working through a PR to update the HDFVirtualBackend semantics in our current codebase and I'll assess the extent of test failures and how they can be addressed.

Looking through all of the Zarr v3 PRs it looks like @TomAugspurger has been tackling a lot of the fill_value updates. Maybe he can weigh in here with some advice on the best path forward to adhere to the correct v3 behavior while still maintaining v2 semantic compatibility during our transition period.

@TomAugspurger
Copy link
Contributor

I'm forgetting most of the details, but IIRC the biggest problem for xarray was it layering on additional meaning to fill_value: that anything equal to fill_value was actually "missing" / NaN. That could only be resolved through a new keyword, which touched quite a few places (see https://github.com/pydata/xarray/blob/main/xarray/backends/zarr.py, search for use_zarr_fill_value_as_mask). I'm not sure whether that will affect virtualizarr or not.

LMK if there's any specifics I can help out with.

@jsignell
Copy link
Contributor

jsignell commented Feb 4, 2025

the reader will use the HDF dataset fillvalue property for the the Zarr fill_value and if the HDF dataset has a _FillValue attr that will be preserved as an array attr for use by xarray's decoding logic to represent masked values.

This seems to be what is going on with the use_zarr_fill_value_as_mask that Tom pointed out in xarray. This code in particular:
https://github.com/pydata/xarray/blob/d924d93c4ebe1e174e7b332d53ab5d9f94d789d4/xarray/backends/zarr.py#L842-L851

IIUC as we transition the HDFVirtualBackend to use the new Zarr v3 fill_value semantics and encoding our kerchunk based roundtrip tests will be broken unless we add specific logic to our to_kerchunk serialization that transform v3 -> v2 semantics.

I think we do want to specify custom logic in to_kerchunk to handle writing Zarr v2. Here is where the use_zarr_fill_value_as_mask gets set in xarray depending on the Zarr version:
https://github.com/pydata/xarray/blob/d924d93c4ebe1e174e7b332d53ab5d9f94d789d4/xarray/backends/zarr.py#L1794-L1802

while still maintaining v2 semantic compatibility during our transition period.

My understanding is that virtualizarr is intended to be able to write Zarr v2 indefinitely. It will just be using zarr-python v3 to do so.

I'll be working through a PR to update the HDFVirtualBackend semantics in our current codebase and I'll assess the extent of test failures and how they can be addressed.

I have some additional failures for you in #422 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HDF reader Non-kerchunk-based HDF reader backend
Projects
None yet
Development

No branches or pull requests

4 participants