`to_icechunk` doesn't have a `group` option #341

rabernat · 2024-12-11T17:05:46Z

This signature

VirtualiZarr/virtualizarr/writers/icechunk.py

Lines 25 to 27 in fcdd5e4

    
           def dataset_to_icechunk( 
        
               ds: Dataset, store: "IcechunkStore", append_dim: Optional[str] = None 
        
           ) -> None:

should include a group option to allow you to put the dataset somewhere other than at the root of the store hierarchy.

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-12-11T20:03:20Z

Right. I think this basically only requires passing the group kwarg down through to where we create the prefix here

VirtualiZarr/virtualizarr/writers/icechunk.py

Line 274 in fcdd5e4

    
           key=f"{key_prefix}/c/{chunk_key}",  # should be of form 'group/arr_name/c/0/1/2', where c stands for chunks

cc @mpiannucci

chuckwondo · 2025-01-21T17:48:54Z

I'd like to take this one, if nobody else has claimed it.

chuckwondo · 2025-01-21T19:53:39Z

How would you like to see the group option added, given that the current function signature is the following?

def dataset_to_icechunk(
    ds: Dataset,
    store: IcechunkStore,
    append_dim: Optional[str] = None,
    last_updated_at: Optional[datetime] = None,
) -> None:

Offhand it seems like making append_dim, last_updated_at, and group (to be added) might be better as kw-only args, but I suspect that would represent a breaking change, so adding group as an optional parameter at the end seems like it would play better, but it feels like that's starting to get messy because if a caller wants only to set group and not the other 2 optionals, they are forced to pass 2 Nones for those positional parameters.

chuckwondo · 2025-01-21T20:37:10Z

Right. I think this basically only requires passing the group kwarg down through to where we create the prefix here

VirtualiZarr/virtualizarr/writers/icechunk.py

Line 274 in fcdd5e4

key=f"{key_prefix}/c/{chunk_key}", # should be of form 'group/arr_name/c/0/1/2', where c stands for chunks
cc @mpiannucci

Does this mean that we simply want to define the parameter to dataset_to_icechunk as Optional[Group] and simply pass down the root group if the group arg is None?

TomNicholas · 2025-01-21T21:11:52Z

simply pass down the root group if the group arg is None

Yep! Generally we want to make this as similar as possible to xarray.Dataset.to_zarr().

I agree about kw-only args. We're not too worried about backwards-compatibility at this point as the library is still new and evolving.

chuckwondo · 2025-01-21T21:51:10Z

simply pass down the root group if the group arg is None

Yep! Generally we want to make this as similar as possible to xarray.Dataset.to_zarr().

I agree about kw-only args. We're not too worried about backwards-compatibility at this point as the library is still new and evolving.

Do we want to create the group if it does not already exist, or raise an error in that case?

TomNicholas · 2025-01-21T21:57:17Z

Ideally do whatever xarray.to_zarr does if the group does not already exist.

keewis · 2025-01-22T10:01:11Z

Offhand it seems like making append_dim, last_updated_at, and group (to be added) might be better as kw-only args, but I suspect that would represent a breaking change, so adding group as an optional parameter at the end seems like it would play better, but it feels like that's starting to get messy because if a caller wants only to set group and not the other 2 optionals, they are forced to pass 2 Nones for those positional parameters

for reference, if group would also be a positional-or-keyword (i.e. not positional-only) parameter with a default – like append_dim / last_updated_at – it would be possible to simply use group as a keyword argument:

dataset_to_icechunk(ds, store, group=...)

(so you don't have to make it a keyword-only parameter).

That said, keyword-only parameters allow making API use more uniform, which especially for optional parameters really improves the readability of any function calls (in my opinion, at least).

chuckwondo · 2025-01-22T11:04:55Z

Offhand it seems like making append_dim, last_updated_at, and group (to be added) might be better as kw-only args, but I suspect that would represent a breaking change, so adding group as an optional parameter at the end seems like it would play better, but it feels like that's starting to get messy because if a caller wants only to set group and not the other 2 optionals, they are forced to pass 2 Nones for those positional parameters

for reference, if group would also be a positional-or-keyword (i.e. not positional-only) parameter with a default – like append_dim / last_updated_at – it would be possible to simply use group as a keyword argument:

dataset_to_icechunk(ds, store, group=...)
(so you don't have to make it a keyword-only parameter).

That said, keyword-only parameters allow making API use more uniform, which especially for optional parameters really improves the readability of any function calls (in my opinion, at least).

Thank you. I realized that later (🤦), especially after I made append_dim and last_updated_at keyword-only and saw that all of the unit tests for dataset_to_icechunk still passed. All of those unit tests explicitly set append_dim and last_updated_at via keywords.

Given that, and given that @TomNicholas indicated that this function should be "as similar as possible to xarray.Dataset.to_zarr()," here is the signature for that function:

Dataset.to_zarr(store=None, chunk_store=None, mode=None, synchronizer=None,
group=None, encoding=None, *, compute=True, consolidated=None, append_dim=None,
region=None, safe_chunks=True, storage_options=None, zarr_version=None,
zarr_format=None, write_empty_chunks=None, chunkmanager_store_kwargs=None)

So shall I make dataset_to_icechunk like so?

dataset_to_icechunk(
    ds: Dataset,
    store: IcechunkStore,
    group: Group = None,
    *,
    append_dim: Optional[str] = None,
    last_updated_at: Optional[datetime] = None,
) -> None:

This would make the existing optional parameters keyword-only, but make the new, optional group param pos-or-kw to match the signature of xarray.Dataset.to_zarr().

Or shall I make group kw-only as well?

keewis · 2025-01-22T11:15:25Z

Or shall I make group kw-only as well?

I personally would also make group keyword-only (if we were to rewrite Dataset.to_zarr I'd also vote for making all the parameters other than store keyword-only):

dataset_to_icechunk(ds, store, group=x)

is much harder to misunderstand than

dataset_to_icechunk(ds, store, x)

(I'm intentionally choosing a bad variable name to make my point)

chuckwondo · 2025-01-22T11:18:47Z

Or shall I make group kw-only as well?

I personally would also make group keyword-only (if we were to rewrite Dataset.to_zarr I'd also vote for making all the parameters other than store keyword-only):

dataset_to_icechunk(ds, store, group=x)
is much harder to misunderstand compared to

dataset_to_icechunk(ds, store, x)
(I'm intentionally choosing a bad variable name to make my point)

Thank you. I agree. I'll take that route unless someone strongly advises otherwise.

Fixes zarr-developers#341

maxrjones · 2025-01-24T18:08:39Z

@chuckwondo would you be interested in opening another PR that adds the group option to the to_icechunk method in the VirtualiZarrDatasetAccessor:

VirtualiZarr/virtualizarr/accessor.py

Lines 42 to 82 in e3787c5

    
               def to_icechunk( 
        
                   self, 
        
                   store: "IcechunkStore", 
        
                   append_dim: Optional[str] = None, 
        
                   last_updated_at: Optional[datetime] = None, 
        
               ) -> None: 
        
                   """ 
        
                   Write an xarray dataset to an Icechunk store. 
        
                   Any variables backed by ManifestArray objects will be be written as virtual references, any other variables will be loaded into memory before their binary chunk data is written into the store. 
        
                   If `append_dim` is provided, the virtual dataset will be appended to the existing IcechunkStore along the `append_dim` dimension. 
        
                   If `last_updated_at` is provided, it will be used as a checksum for any virtual chunks written to the store with this operation. 
        
                   At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised. 
        
                   This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, no check is performed. 
        
                   This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value. 
        
                   Parameters 
        
                   ---------- 
        
                   store: IcechunkStore 
        
                   append_dim: str, optional 
        
                       When provided, specifies the dimension along which to append the virtual dataset. 
        
                   last_updated_at: datetime, optional 
        
                       When provided, uses provided datetime as a checksum for any virtual chunks written to the store with this operation. 
        
                       When not provided (default), no check is performed. 
        
                   Examples 
        
                   -------- 
        
                   To ensure an error is raised if the files containing referenced virtual chunks are modified at any time from now on, pass the current time to ``last_updated_at``. 
        
                   >>> from datetime import datetime 
        
                   >>> 
        
                   >>> vds.virtualize.to_icechunk( 
        
                   ...     icechunkstore, 
        
                   ...     last_updated_at=datetime.now(), 
        
                   ... ) 
        
                   """ 
        
                   from virtualizarr.writers.icechunk import dataset_to_icechunk 
        
                   dataset_to_icechunk(self.ds, store, append_dim=append_dim)

This is a convenience method that allows users to use code like:

vds.virtualize.to_icechunk(store, group=group))

rather than needing to call dataset_to_icehunk().

I think the best way to test this new code would be to modify the test you previously parameterized to use the to_icechunk() function on the accessor rather than calling dataset_to_icechunk because this will only add code coverage:

VirtualiZarr/virtualizarr/tests/test_writers/test_icechunk.py

Lines 42 to 80 in e3787c5

    
           @pytest.mark.parametrize("group_path", [None, "", "/a", "a", "/a/b", "a/b", "a/b/"]) 
        
           def test_write_new_virtual_variable( 
        
               icechunk_filestore: "IcechunkStore", 
        
               vds_with_manifest_arrays: xr.Dataset, 
        
               group_path: Optional[str], 
        
           ): 
        
               vds = vds_with_manifest_arrays 
        
               dataset_to_icechunk(vds, icechunk_filestore, group=group_path) 
        
               # check attrs 
        
               group = zarr.group(store=icechunk_filestore, path=group_path) 
        
               assert isinstance(group, zarr.Group) 
        
               assert group.attrs.asdict() == {"something": 0} 
        
               # TODO check against vds, then perhaps parametrize? 
        
               # check array exists 
        
               assert "a" in group 
        
               arr = group["a"] 
        
               assert isinstance(arr, zarr.Array) 
        
               # check array metadata 
        
               assert arr.metadata.zarr_format == 3 
        
               assert arr.shape == (2, 3) 
        
               assert arr.chunks == (2, 3) 
        
               assert arr.dtype == np.dtype("<i8") 
        
               assert arr.order == "C" 
        
               assert arr.fill_value == 0 
        
               # TODO check compressor, filters? 
        
               # 
        
               # check array attrs 
        
               # TODO somehow this is broken by setting the dimension names??? 
        
               # assert dict(arr.attrs) == {"units": "km"} 
        
               # check dimensions 
        
               if isinstance(arr.metadata, ArrayV3Metadata): 
        
                   assert arr.metadata.dimension_names == ("x", "y")

chuckwondo · 2025-01-24T18:16:04Z

@chuckwondo would you be interested in opening another PR that adds the group option to the to_icechunk method in the VirtualiZarrDatasetAccessor:

Sure!

TomNicholas · 2025-01-24T20:01:02Z

Oops, thank you for catching that @maxrjones!

Yes that function isn't meant to be public API, only the accessors.

chuckwondo · 2025-01-24T21:21:01Z

Oops, thank you for catching that @maxrjones!

Yes that function isn't meant to be public API, only the accessors.

@TomNicholas, do you mean that the dataset_to_icechunk function isn't meant to be part of the public API? If not, is there something you want to change to indicate as such to users?

TomNicholas · 2025-01-24T21:24:12Z

Yes - the fact it's not public API is supposed to be clear from the fact that it isn't listed on the API docs page

https://virtualizarr.readthedocs.io/en/stable/api.html

But the accessor method is mentioned.

But I welcome suggestions of how to make this clearer.

maxrjones · 2025-01-24T21:31:47Z

I think changing it to _dataset_to_icechunk would be worthwhile.

chuckwondo · 2025-01-24T21:47:25Z

I think changing it to _dataset_to_icechunk would be worthwhile.

Will do. I'll make this change in the context of the accessor update. @maxrjones, do you want me to create a new issue for the accessor work?

chuckwondo · 2025-01-24T22:26:03Z

@TomNicholas, I was just chatting with @maxrjones about the usefulness (or lack thereof) of a Dataset accessor in this case.

I generally avoid "accessor registration" because of what I find to be degraded dev productivity due to the lack of "discoverability" of available accessors on an object (i.e., no code-completion help to discover registered accessors), and further lack of code-completion on a registered accessor (thus no ability to discover properties and methods of an accessor, forcing the need to dig up docs). Further, static type checking goes out the window since registered accessors are simply typed as Any.

Max pointed out that such accessors can be useful for adding state without the need to subclass, but since in this case, we're not adding state, do we want to "force" users to use an accessor instead of allowing them to use the dataset_to_icechunk function directly (thus retaining type-checking and auto-completion benefits)?

TomNicholas · 2025-01-25T18:43:07Z

Thanks for the thoughtful Qs @chuckwondo. As this is orthogonal to the addition of the group kwarg I would prefer that we just plumb it through to the accessor for now (else right now the API is unclear + inconsistent), and I've raised #389 to discuss the general design choice of using accessors at all.

Fixes zarr-developers#341

chuckwondo · 2025-01-27T14:20:33Z

Thanks for the thoughtful Qs @chuckwondo. As this is orthogonal to the addition of the group kwarg I would prefer that we just plumb it through to the accessor for now (else right now the API is unclear + inconsistent), and I've raised #389 to discuss the general design choice of using accessors at all.

Excellent. Thanks for starting a separate discussion. I'll add my 2 cents over there.

In the meantime, I've plumbed things through as requested, in PR #391

TomNicholas added Icechunk 🧊 Relates to Icechunk library / spec good first issue Good for newcomers labels Dec 11, 2024

maxrjones mentioned this issue Jan 21, 2025

ODD PI 25.2 Objective 7: Increase data format support in VirtualiZarr NASA-IMPACT/veda-odd#31

Open

chuckwondo added a commit to chuckwondo/VirtualiZarr that referenced this issue Jan 23, 2025

Add unit tests for non-root groups

d88483f

Fixes zarr-developers#341

chuckwondo mentioned this issue Jan 23, 2025

Add group option to dataset_to_icechunk #383

Merged

6 tasks

chuckwondo added a commit to chuckwondo/VirtualiZarr that referenced this issue Jan 23, 2025

Add unit tests for non-root groups

2e9769c

Fixes zarr-developers#341

TomNicholas closed this as completed in 30b7c00 Jan 23, 2025

TomNicholas closed this as completed in #383 Jan 23, 2025

maxrjones reopened this Jan 24, 2025

TomNicholas mentioned this issue Jan 25, 2025

Functions or accessors for serialization API? #389

Open

chuckwondo added a commit to chuckwondo/VirtualiZarr that referenced this issue Jan 27, 2025

Add group param to virtualize Dataset accessor

db560d0

Fixes zarr-developers#341

chuckwondo mentioned this issue Jan 27, 2025

Add group param to virtualize Dataset accessor #391

Merged

5 tasks

maxrjones closed this as completed in #391 Jan 27, 2025

maxrjones mentioned this issue Feb 24, 2025

Support for virtual DataTrees NASA-IMPACT/veda-odd#90

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_icechunk` doesn't have a `group` option #341

`to_icechunk` doesn't have a `group` option #341

rabernat commented Dec 11, 2024

TomNicholas commented Dec 11, 2024

chuckwondo commented Jan 21, 2025

chuckwondo commented Jan 21, 2025 •

edited

Loading

chuckwondo commented Jan 21, 2025

TomNicholas commented Jan 21, 2025

chuckwondo commented Jan 21, 2025

TomNicholas commented Jan 21, 2025

keewis commented Jan 22, 2025

chuckwondo commented Jan 22, 2025

keewis commented Jan 22, 2025 •

edited

Loading

chuckwondo commented Jan 22, 2025

maxrjones commented Jan 24, 2025

chuckwondo commented Jan 24, 2025

TomNicholas commented Jan 24, 2025

chuckwondo commented Jan 24, 2025

TomNicholas commented Jan 24, 2025

maxrjones commented Jan 24, 2025

chuckwondo commented Jan 24, 2025

chuckwondo commented Jan 24, 2025

TomNicholas commented Jan 25, 2025

chuckwondo commented Jan 27, 2025

to_icechunk doesn't have a group option #341

to_icechunk doesn't have a group option #341

Comments

rabernat commented Dec 11, 2024

TomNicholas commented Dec 11, 2024

chuckwondo commented Jan 21, 2025

chuckwondo commented Jan 21, 2025 • edited Loading

chuckwondo commented Jan 21, 2025

TomNicholas commented Jan 21, 2025

chuckwondo commented Jan 21, 2025

TomNicholas commented Jan 21, 2025

keewis commented Jan 22, 2025

chuckwondo commented Jan 22, 2025

keewis commented Jan 22, 2025 • edited Loading

chuckwondo commented Jan 22, 2025

maxrjones commented Jan 24, 2025

chuckwondo commented Jan 24, 2025

TomNicholas commented Jan 24, 2025

chuckwondo commented Jan 24, 2025

TomNicholas commented Jan 24, 2025

maxrjones commented Jan 24, 2025

chuckwondo commented Jan 24, 2025

chuckwondo commented Jan 24, 2025

TomNicholas commented Jan 25, 2025

chuckwondo commented Jan 27, 2025

`to_icechunk` doesn't have a `group` option #341

`to_icechunk` doesn't have a `group` option #341

chuckwondo commented Jan 21, 2025 •

edited

Loading

keewis commented Jan 22, 2025 •

edited

Loading