Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_icechunk doesn't have a group option #341

Closed
rabernat opened this issue Dec 11, 2024 · 21 comments · Fixed by #383 or #391
Closed

to_icechunk doesn't have a group option #341

rabernat opened this issue Dec 11, 2024 · 21 comments · Fixed by #383 or #391
Labels
good first issue Good for newcomers Icechunk 🧊 Relates to Icechunk library / spec

Comments

@rabernat
Copy link
Collaborator

This signature

def dataset_to_icechunk(
ds: Dataset, store: "IcechunkStore", append_dim: Optional[str] = None
) -> None:

should include a group option to allow you to put the dataset somewhere other than at the root of the store hierarchy.

@TomNicholas TomNicholas added Icechunk 🧊 Relates to Icechunk library / spec good first issue Good for newcomers labels Dec 11, 2024
@TomNicholas
Copy link
Member

Right. I think this basically only requires passing the group kwarg down through to where we create the prefix here

key=f"{key_prefix}/c/{chunk_key}", # should be of form 'group/arr_name/c/0/1/2', where c stands for chunks

cc @mpiannucci

@chuckwondo
Copy link
Contributor

I'd like to take this one, if nobody else has claimed it.

@chuckwondo
Copy link
Contributor

chuckwondo commented Jan 21, 2025

How would you like to see the group option added, given that the current function signature is the following?

def dataset_to_icechunk(
    ds: Dataset,
    store: IcechunkStore,
    append_dim: Optional[str] = None,
    last_updated_at: Optional[datetime] = None,
) -> None:

Offhand it seems like making append_dim, last_updated_at, and group (to be added) might be better as kw-only args, but I suspect that would represent a breaking change, so adding group as an optional parameter at the end seems like it would play better, but it feels like that's starting to get messy because if a caller wants only to set group and not the other 2 optionals, they are forced to pass 2 Nones for those positional parameters.

@chuckwondo
Copy link
Contributor

Right. I think this basically only requires passing the group kwarg down through to where we create the prefix here

VirtualiZarr/virtualizarr/writers/icechunk.py

Line 274 in fcdd5e4

key=f"{key_prefix}/c/{chunk_key}", # should be of form 'group/arr_name/c/0/1/2', where c stands for chunks
cc @mpiannucci

Does this mean that we simply want to define the parameter to dataset_to_icechunk as Optional[Group] and simply pass down the root group if the group arg is None?

@TomNicholas
Copy link
Member

simply pass down the root group if the group arg is None

Yep! Generally we want to make this as similar as possible to xarray.Dataset.to_zarr().

I agree about kw-only args. We're not too worried about backwards-compatibility at this point as the library is still new and evolving.

@chuckwondo
Copy link
Contributor

simply pass down the root group if the group arg is None

Yep! Generally we want to make this as similar as possible to xarray.Dataset.to_zarr().

I agree about kw-only args. We're not too worried about backwards-compatibility at this point as the library is still new and evolving.

Do we want to create the group if it does not already exist, or raise an error in that case?

@TomNicholas
Copy link
Member

Ideally do whatever xarray.to_zarr does if the group does not already exist.

@keewis
Copy link
Contributor

keewis commented Jan 22, 2025

Offhand it seems like making append_dim, last_updated_at, and group (to be added) might be better as kw-only args, but I suspect that would represent a breaking change, so adding group as an optional parameter at the end seems like it would play better, but it feels like that's starting to get messy because if a caller wants only to set group and not the other 2 optionals, they are forced to pass 2 Nones for those positional parameters

for reference, if group would also be a positional-or-keyword (i.e. not positional-only) parameter with a default – like append_dim / last_updated_at – it would be possible to simply use group as a keyword argument:

dataset_to_icechunk(ds, store, group=...)

(so you don't have to make it a keyword-only parameter).

That said, keyword-only parameters allow making API use more uniform, which especially for optional parameters really improves the readability of any function calls (in my opinion, at least).

@chuckwondo
Copy link
Contributor

Offhand it seems like making append_dim, last_updated_at, and group (to be added) might be better as kw-only args, but I suspect that would represent a breaking change, so adding group as an optional parameter at the end seems like it would play better, but it feels like that's starting to get messy because if a caller wants only to set group and not the other 2 optionals, they are forced to pass 2 Nones for those positional parameters

for reference, if group would also be a positional-or-keyword (i.e. not positional-only) parameter with a default – like append_dim / last_updated_at – it would be possible to simply use group as a keyword argument:

dataset_to_icechunk(ds, store, group=...)
(so you don't have to make it a keyword-only parameter).

That said, keyword-only parameters allow making API use more uniform, which especially for optional parameters really improves the readability of any function calls (in my opinion, at least).

Thank you. I realized that later (🤦), especially after I made append_dim and last_updated_at keyword-only and saw that all of the unit tests for dataset_to_icechunk still passed. All of those unit tests explicitly set append_dim and last_updated_at via keywords.

Given that, and given that @TomNicholas indicated that this function should be "as similar as possible to xarray.Dataset.to_zarr()," here is the signature for that function:

Dataset.to_zarr(store=None, chunk_store=None, mode=None, synchronizer=None,
group=None, encoding=None, *, compute=True, consolidated=None, append_dim=None,
region=None, safe_chunks=True, storage_options=None, zarr_version=None,
zarr_format=None, write_empty_chunks=None, chunkmanager_store_kwargs=None)

So shall I make dataset_to_icechunk like so?

dataset_to_icechunk(
    ds: Dataset,
    store: IcechunkStore,
    group: Group = None,
    *,
    append_dim: Optional[str] = None,
    last_updated_at: Optional[datetime] = None,
) -> None:

This would make the existing optional parameters keyword-only, but make the new, optional group param pos-or-kw to match the signature of xarray.Dataset.to_zarr().

Or shall I make group kw-only as well?

@keewis
Copy link
Contributor

keewis commented Jan 22, 2025

Or shall I make group kw-only as well?

I personally would also make group keyword-only (if we were to rewrite Dataset.to_zarr I'd also vote for making all the parameters other than store keyword-only):

dataset_to_icechunk(ds, store, group=x)

is much harder to misunderstand than

dataset_to_icechunk(ds, store, x)

(I'm intentionally choosing a bad variable name to make my point)

@chuckwondo
Copy link
Contributor

Or shall I make group kw-only as well?

I personally would also make group keyword-only (if we were to rewrite Dataset.to_zarr I'd also vote for making all the parameters other than store keyword-only):

dataset_to_icechunk(ds, store, group=x)
is much harder to misunderstand compared to

dataset_to_icechunk(ds, store, x)
(I'm intentionally choosing a bad variable name to make my point)

Thank you. I agree. I'll take that route unless someone strongly advises otherwise.

chuckwondo added a commit to chuckwondo/VirtualiZarr that referenced this issue Jan 23, 2025
chuckwondo added a commit to chuckwondo/VirtualiZarr that referenced this issue Jan 23, 2025
@maxrjones
Copy link
Member

@chuckwondo would you be interested in opening another PR that adds the group option to the to_icechunk method in the VirtualiZarrDatasetAccessor:

def to_icechunk(
self,
store: "IcechunkStore",
append_dim: Optional[str] = None,
last_updated_at: Optional[datetime] = None,
) -> None:
"""
Write an xarray dataset to an Icechunk store.
Any variables backed by ManifestArray objects will be be written as virtual references, any other variables will be loaded into memory before their binary chunk data is written into the store.
If `append_dim` is provided, the virtual dataset will be appended to the existing IcechunkStore along the `append_dim` dimension.
If `last_updated_at` is provided, it will be used as a checksum for any virtual chunks written to the store with this operation.
At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised.
This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, no check is performed.
This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value.
Parameters
----------
store: IcechunkStore
append_dim: str, optional
When provided, specifies the dimension along which to append the virtual dataset.
last_updated_at: datetime, optional
When provided, uses provided datetime as a checksum for any virtual chunks written to the store with this operation.
When not provided (default), no check is performed.
Examples
--------
To ensure an error is raised if the files containing referenced virtual chunks are modified at any time from now on, pass the current time to ``last_updated_at``.
>>> from datetime import datetime
>>>
>>> vds.virtualize.to_icechunk(
... icechunkstore,
... last_updated_at=datetime.now(),
... )
"""
from virtualizarr.writers.icechunk import dataset_to_icechunk
dataset_to_icechunk(self.ds, store, append_dim=append_dim)

This is a convenience method that allows users to use code like:

vds.virtualize.to_icechunk(store, group=group))

rather than needing to call dataset_to_icehunk().

I think the best way to test this new code would be to modify the test you previously parameterized to use the to_icechunk() function on the accessor rather than calling dataset_to_icechunk because this will only add code coverage:

@pytest.mark.parametrize("group_path", [None, "", "/a", "a", "/a/b", "a/b", "a/b/"])
def test_write_new_virtual_variable(
icechunk_filestore: "IcechunkStore",
vds_with_manifest_arrays: xr.Dataset,
group_path: Optional[str],
):
vds = vds_with_manifest_arrays
dataset_to_icechunk(vds, icechunk_filestore, group=group_path)
# check attrs
group = zarr.group(store=icechunk_filestore, path=group_path)
assert isinstance(group, zarr.Group)
assert group.attrs.asdict() == {"something": 0}
# TODO check against vds, then perhaps parametrize?
# check array exists
assert "a" in group
arr = group["a"]
assert isinstance(arr, zarr.Array)
# check array metadata
assert arr.metadata.zarr_format == 3
assert arr.shape == (2, 3)
assert arr.chunks == (2, 3)
assert arr.dtype == np.dtype("<i8")
assert arr.order == "C"
assert arr.fill_value == 0
# TODO check compressor, filters?
#
# check array attrs
# TODO somehow this is broken by setting the dimension names???
# assert dict(arr.attrs) == {"units": "km"}
# check dimensions
if isinstance(arr.metadata, ArrayV3Metadata):
assert arr.metadata.dimension_names == ("x", "y")

@chuckwondo
Copy link
Contributor

@chuckwondo would you be interested in opening another PR that adds the group option to the to_icechunk method in the VirtualiZarrDatasetAccessor:

Sure!

@TomNicholas
Copy link
Member

Oops, thank you for catching that @maxrjones!

Yes that function isn't meant to be public API, only the accessors.

@chuckwondo
Copy link
Contributor

Oops, thank you for catching that @maxrjones!

Yes that function isn't meant to be public API, only the accessors.

@TomNicholas, do you mean that the dataset_to_icechunk function isn't meant to be part of the public API? If not, is there something you want to change to indicate as such to users?

@TomNicholas
Copy link
Member

Yes - the fact it's not public API is supposed to be clear from the fact that it isn't listed on the API docs page

https://virtualizarr.readthedocs.io/en/stable/api.html

But the accessor method is mentioned.

But I welcome suggestions of how to make this clearer.

@maxrjones
Copy link
Member

I think changing it to _dataset_to_icechunk would be worthwhile.

@chuckwondo
Copy link
Contributor

I think changing it to _dataset_to_icechunk would be worthwhile.

Will do. I'll make this change in the context of the accessor update. @maxrjones, do you want me to create a new issue for the accessor work?

@maxrjones maxrjones reopened this Jan 24, 2025
@chuckwondo
Copy link
Contributor

@TomNicholas, I was just chatting with @maxrjones about the usefulness (or lack thereof) of a Dataset accessor in this case.

I generally avoid "accessor registration" because of what I find to be degraded dev productivity due to the lack of "discoverability" of available accessors on an object (i.e., no code-completion help to discover registered accessors), and further lack of code-completion on a registered accessor (thus no ability to discover properties and methods of an accessor, forcing the need to dig up docs). Further, static type checking goes out the window since registered accessors are simply typed as Any.

Max pointed out that such accessors can be useful for adding state without the need to subclass, but since in this case, we're not adding state, do we want to "force" users to use an accessor instead of allowing them to use the dataset_to_icechunk function directly (thus retaining type-checking and auto-completion benefits)?

@TomNicholas
Copy link
Member

Thanks for the thoughtful Qs @chuckwondo. As this is orthogonal to the addition of the group kwarg I would prefer that we just plumb it through to the accessor for now (else right now the API is unclear + inconsistent), and I've raised #389 to discuss the general design choice of using accessors at all.

chuckwondo added a commit to chuckwondo/VirtualiZarr that referenced this issue Jan 27, 2025
@chuckwondo
Copy link
Contributor

Thanks for the thoughtful Qs @chuckwondo. As this is orthogonal to the addition of the group kwarg I would prefer that we just plumb it through to the accessor for now (else right now the API is unclear + inconsistent), and I've raised #389 to discuss the general design choice of using accessors at all.

Excellent. Thanks for starting a separate discussion. I'll add my 2 cents over there.

In the meantime, I've plumbed things through as requested, in PR #391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers Icechunk 🧊 Relates to Icechunk library / spec
Projects
None yet
5 participants