-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is _FillValue
really the same as zarr's fill_value
?
#5475
Comments
I encountered this issue today and totally agree. In Zarr, The default Here is an MWE with ints: import zarr
import xarray as xr
store = zarr.DirectoryStore("zarr-xarray-example.zarr")
root = zarr.group(store=store, overwrite=True)
arr = root.create_dataset(name="arr", data=[0, 1, 2, 3, 4], dtype="i4")
arr.attrs["_ARRAY_DIMENSIONS"] = ["phony_dim_0"]
zarr.consolidate_metadata(store)
root = zarr.open("zarr-xarray-example.zarr")
print(root["arr"].fill_value)
dset = xr.open_dataset("zarr-xarray-example.zarr", engine="zarr")
dset.arr.data Returns: 0
array([nan, 1., 2., 3., 4.]) This was very confusing at first. Why did 0 become NaN? Why did the ints become floats? Upon digging around, I learned that this automatic conversion/masking can be turned off by passing the argument Now, this is a very specific and probably uncommon use case. But still, Zarr's To resolve this, I suggest that on write, Xarray writes the dataset However, this approach would break the default masking behavior when loading existing Zarr data that lacks this new |
Thanks @rly for coming back on this. I agree, this issue still persists and might become worse in Zarr v3. There's another aspect which might be problematic: From CF-Conventions §2.5.1:
So if @rly, why would you suggest to invent a new attribute name? CF-Convention says (same paragraph):
So I would "just" call the attribute I'd probably suggest something like this as a strategy to fix this:
|
I ran into the same problem as well. This simple code fails, because
The latter print gives a |
In the mean time, you can set |
@jhamman I'm looking into getting xarray ready for Zarr v3 and bumped into this pretty quickly. AFAICT, zarr v3 requires that I'm getting up to speed on xarray's Zarr reader now and the implications of the change, but for now it seems like just not using |
What is the issue? Are you seeing a roundtripping problem? |
Yes, here's the smallest reproducer I have (this won't run for you though. I have some other changes to zarr-python fixing some compatibility issues): import zarr
import xarray as xr
store = zarr.store.MemoryStore(mode="w-")
ds = xr.Dataset({"x": xr.DataArray([0, 2, 3], name="x")})
ds.to_zarr(store)
result = xr.open_dataset(store, engine="zarr", mode="w-")
xr.testing.assert_identical(ds, result) which fails with
The (valid) xarray/xarray/backends/zarr.py Lines 555 to 558 in cea354f
fill_value into the Array Metadata.
As far as I can tell, the risk with ignoring >>> store = {}
>>> ds = xr.Dataset({"x": xr.DataArray([0, np.nan, 3], name="x", attrs={"_FillValue": 0.0})})
>>> ds.to_zarr(store)
>>> result = xr.open_dataset(store, engine="zarr", mode="r")
>>> print(result.x.load())
<xarray.DataArray 'x' (dim_0: 3)> Size: 24B
array([nan, nan, 3.])
Dimensions without coordinates: dim_0 I can't really tell if that's expected according to CF-conventions or not, but it seems surprising. That said, that seems to be a fundamental limitation of encoding missing value "in-band". After encoding the data (by setting NaN to 0) we can't tell whether a 0 means 0 or NaN. This does match the behavior of I think it comes down to the decision on whether or not Zarr's |
The behaviour is correct from Xarray's POV but this example isn't well defined from a data POV. For one, in netCDF land And yeah, Xarray's fill value decoding is already lossy (it doesn't distinguish between
It's confusing and the meaning has become overloaded over time, but the original intent seem the same to me IMO (treating zarr's |
What are our options (spoiler: after writing this up I think 4 is most promising)?
This would effectively restore the Zarr v2 behavior: the "surprising" behavior of valid values potentially being masked as invalid would only affect users where I don't really like this option. Ideally zarr datasets are portable across implementations, and having some implementations interpret the same dataset differently because they happen to have different default fill values makes me uneasy.
The fact that zarr-python v3 uses
I don't know what the tolerance is here.
IIUC, the main purpose of this xarray/xarray/coding/variables.py Lines 221 to 232 in cc74d3a
Then we avoid the situation where you have some valid values cast to NaN (or whatever) since they happened to be equal to fill value. You wouldn't have any valid values in the chunk since it was never initialized. This might be better done in Zarr-Python, and IIRC there's some desire to move some of the CF related logic into Zarr Python as new codecs. |
TLDR: perhaps we change the tests instead? I'm not sure there is a good overarching solution.
Xarray's Responding to your suggestions In general, I think we should distinguish between what is recorded on disk, and what is interpreted as an in-memory data model. To me, it seems this issue is about two things:
This implies that we need to have two subtly different roundtripping tests: |
Thanks, I've been thinking this over since yesterday and am still not sure where I sit. I just keep coming back to this being a really bad behavior In [1]: import xarray as xr
In [2]: ds = xr.Dataset({"x": xr.DataArray([0, 1, 2], name="x")})
In [3]: store = {}
In [4]: ds.to_zarr(store)
Out[4]: <xarray.backends.zarr.ZarrStore at 0x10914d640>
In [5]: xr.open_dataset(store, engine="zarr").load()
Out[5]:
<xarray.Dataset> Size: 24B
Dimensions: (dim_0: 3)
Dimensions without coordinates: dim_0
Data variables:
x (dim_0) float64 24B nan 1.0 2.0 So I guess we agree that xarray interpreting values equal to With that said, would xarray be open to me implementing #8359, which AFAICT proposes to only do the masking for values that happen to fall outside of valid range defined by those attributes, and so are "semantically invalid"? Under that proposal, my snippet would correctly round trip. To restore the old behavior, users would need to specify the zarr-developers/zarr-specs#133 has the discussion on making |
Yes, I'd agree the core issue is: xarray treats zarr's With the recent discussion, I guess I see that the original idea of netCDF's So if xarray's zarr ties
I'd guess the former would be simpler, the latter would be more correct / elegant. |
I am the one who originally implemented the current logic (equating Zarr's I believe we can and should fix this by changing Xarray's behavior, with minimal impact on users. A useful concept for me is to realize that the Zarr / Xarray stack actually has two distinct layers of encoding (Zarr's own "filters", as opposed to the NetCDF / Xarray stack, which only has one (implemented by Xarray). Using CF conventions with Zarr creates many ambiguities, as it is not always clear where in the stack the encoding / decoding should be applied Converting To recreate the behavior we have today, we could set the Zarr array attribute I would be inclined not to worry too much about backwards compatibility with existing behavior. This needs to be fixed and the sooner we do so, the better off everyone will be. |
I just spent the past hour re-reading all of the NetCDF and CF documentation on this, and reached the same conclusion that @dcherian did a few weeks ago (#5475 (comment)): this is a mess. It's truly ambiguous how to interpret From the CF Conventions:
This is how we do decoding today in Xarray--first mask _FillValue, then scale Lines 279 to 284 in cde720f
However,
This means that we should be reversing the order of the decoding operations depending on whether we are masking a xarray/xarray/coding/variables.py Lines 389 to 392 in cde720f
|
Final thought in my rant: another reason this is very hard to solve is the lack of true missing value support in Xarray. We have no good way to actually mask int data. Since int has no obvious sentinel value for missing values (unlike float, which has NaN), we have to resort to treating Zarr's Ok, enough complaining, now time to think about how to actually solve this. 😆 A simple workaround that would solve the most obvious bug (int array with 0 as missing values) could be to just stop interpreting A more comprehensive fix would involve a deep rethinking of how we handle missing data in both Xarray and Zarr. |
indeed, the best way to resolve this would involve having true missing value support in our in-memory arrays (and adapting There's two options (as described in NEP 26): masked arrays (i.e. a additional mask array directly on the array) and bit-pattern based missing values on the dtypes. For the former, there's several promising attempts, most recently For the latter, we'd need to have someone dive into the custom dtype API that has already been used to define the new string dtypes. I also played around with that earlier today, but it looks like it will take me a lot more time to figure out since I don't really know the python and numpy C APIs too well. (In any case I don't want to be the bottleneck, so if someone wants to help out / take over I'd be happy to collaborate – and I'm not involved in that but I guess the same goes for Edit: though if we want to keep this focused on encoding / decoding I'll be happy to continue this discussion in #1194. |
#9515 (comment) has a suggestion / question to set xarray's default zarr version to 2, to punt on this issue a bit. |
Tom, we are quite motivated to find a way to write V3, since it is the only thing that works with Icechunk. I guess I'm failing to see why this is such a problem for V3 but not V2. Is it that V3 doesn't allow an empty fill value? |
Ah of course I forgot about that. I was hesitant to suggest xarray not follow the library default in the first place. So let's work through the transition pains.
Correct. So it's a problem for both v2 and v3, just more pronounced in v3 since |
Sorry if what I'm saying below is obvious to people. Here's a summary of how it works in V2 vs V3: Zarr 2.17.1 import zarr
a = zarr.create(shape=10, chunks=10, dtype=int)
# fill_value set to 0 by default
assert a.fill_value == 0
assert a[0] == 0
# None allowed as fill_value
b = zarr.create(shape=10, chunks=10, dtype=int, fill_value=None)
assert b.fill_value is None
# however, 0 is still being used as a fill value (no data have been written); this seems weird
# can we count on this behavior?
assert b[0] == 0 The problem is that Xarray is using the presence of the fill_value attribute to determine whether to apply a mask, via setting the CF xarray/xarray/backends/zarr.py Lines 623 to 624 in cde720f
An, on the encoding side, setting fill_value as follows xarray/xarray/backends/zarr.py Lines 824 to 826 in cde720f
Here's how it works in V3 (3.0.0a7.dev4+ga3221245) import zarr
# fill_value set to 0 by default, same as V3
assert a.fill_value == 0 # passes; actually it is np.int64(0)
assert a[0] == 0
b = zarr.create(shape=10, chunks=10, dtype=int, fill_value=None)
# fails! it is still np.int64(0)
assert b.fill_value is None
# passes
assert b[0] == 0 So even though the V3 array behaves the same as the V2 array, it lacks the ability to set Having written this summary, the answer seems a bit more clear: we need another way for Zarr to tell Xarray whether to apply masking. Here's an idea: for V3 arrays (and maybe even also V2), we can use the fill_value = attrs.pop("_FillValue", None) ...we just do fill_value = attrs.get("_FillValue", None) If this attr is missing, we will never apply a mask, regardless of what the Zarr Some questions that arise:
|
zarr-developers/zarr-python#2274 is a draft PR restoring that behavior on zarr-python 3.x for zarr-v2. But for zarr-v3,
That sounds reasonable to me. Users who want the masking behavior can explicitly set I do think it has the potential to break reading some v2 datasets, where people were relying on this behavior. FWIW, this is surprising enough to me that I'd be comfortable calling it a bug fix, but I'll leave that up to the experts. |
I'm planning to take a swing at this over the weekend. |
I'd vote for this (I guess that's also what I tried to outline in #5475 (comment)).
I think it could break datasets (but maybe that's fine).
One might want to explicitly write
In my opinion, this should definitely be allowed. The two serve different purposes. One might however spend a little thought, if |
Over in #9552 I have implemented a fix for this issue, with the new approach to |
The zarr backend uses the
fill_value
of zarrs.zarray
key as if it would be the_FillValue
according to CF-Conventions:xarray/xarray/backends/zarr.py
Line 373 in 1a7b285
I think this interpretation of the
fill_value
is wrong and creates problems. Here's why:The zarr v2 spec is still a little vague, but states that
fill_value
isAccordingly this value should be used to fill all areas of a variable which are not backed by a stored chunk with this value. This is also different from what CF conventions state (emphasis mine):
The difference between the two is, that
fill_value
is only a background value, which just isn't stored as a chunk. But_FillValue
is (possibly) a background value and is interpreted as not being valid data. In my opinion, this mix of_FillValue
andmissing_value
could be considered a defect in the CF-Conventions, but probably that's far to late as many depend on this.Thinking of an example, when storing a density field (i.e. water droplets forming clouds) in a zarr dataset, it might be perfectly valid to set the
fill_value
to0
and then store only chunks in regions of the atmosphere where clouds are actually present. In that case,0
(i.e. no drops) would be a perfectly valid value, which just isn't stored. As most parts of the atmosphere are indeed cloud-free, this may save quite a bunch of storage. Other formats (e.g. OpenVDB) commonly use this trick.The issue gets worse when looking into the upcoming zarr v3 spec where
fill_value
is described as:Thus for boolean arrays, if the
fill_value
would be interpreted as a missing value indicator, only (missing,True
) or (False
, missing) arrays could be represented. A (False
,True
) array would not be possible. The issue applies similarly for integer types as well.The text was updated successfully, but these errors were encountered: