Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_consolidated raises ValueError when attempting to read a specific subgroup #2820

Closed
aladinor opened this issue Feb 12, 2025 · 9 comments
Labels
bug Potential issues with the zarr-python library

Comments

@aladinor
Copy link

Zarr version

v3.0.1

Numcodecs version

v0.15.0

Python Version

3.12

Operating System

Linux

Installation

conda

Description

Hi everyone,

I am working on fixing incompatibilities between Zarr-Python V3 and xarray DataTrees PR10020 and I found that when opening a consolidated group it will raise the following error,

Steps to reproduce

import zarr

# Create the root group in a directory
root = zarr.open('example.zarr', mode='w')
# Create group 'a' and add variable 'A'
group_a = root.create_group('a')
group_a.create_array('A', shape=(128, 256), dtype='float64', chunks=(64, 256))

# Create group 'c' and add variables 'w' and 'z'
group_c = root.create_group('c')
group_c.create_array('w', shape=(128,), dtype='float64')
group_c.create_array('z', shape=(128, 256), dtype='float64', chunks=(64, 256))

# Create subgroup 'd' under group 'c' and add variable 'G'
group_d = group_c.create_group('d')
group_d.create_array('G', shape=(128, 256), dtype='float64', chunks=(64, 256))

# Consolidate metadata
zarr.consolidate_metadata('example.zarr')

when using zarr.open_consolidated I got the following error

root = zarr.open_consolidated(
        'example.zarr',
        mode='r',
        path="/c"
    )

Traceback (most recent call last):
  File "/snap/pycharm-community/439/plugins/python-ce/helpers/pydev/pydevd.py", line 1570, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/snap/pycharm-community/439/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/media/alfonso/drive/Alfonso/python/xarray/delete_Zarr.py", line 38, in <module>
    open_zarr_store()
  File "/media/alfonso/drive/Alfonso/python/xarray/delete_Zarr.py", line 28, in open_zarr_store
    root = zarr.open_consolidated(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/api/synchronous.py", line 212, in open_consolidated
    sync(async_api.open_consolidated(*args, use_consolidated=use_consolidated, **kwargs))
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/core/sync.py", line 142, in sync
    raise return_result
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/core/sync.py", line 98, in _runner
    return await coro
           ^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/api/asynchronous.py", line 346, in open_consolidated
    return await open_group(*args, use_consolidated=use_consolidated, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/api/asynchronous.py", line 807, in open_group
    return await AsyncGroup.open(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/core/group.py", line 553, in open
    return cls._from_bytes_v3(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/alfonso/mambaforge/envs/xarray-tests/lib/python3.12/site-packages/zarr/core/group.py", line 611, in _from_bytes_v3
    raise ValueError(msg)
ValueError: Consolidated metadata requested with 'use_consolidated=True' but not found in 'c'.

Apparently, the open _consolidated function does not detect the consolidated metadata at each nested node.

Additional output

No response

@aladinor aladinor added the bug Potential issues with the zarr-python library label Feb 12, 2025
@aladinor aladinor changed the title The Zarr Group raises ValueError when attempting to read a specific subgroup with consolidated metadata open_consolidated raises ValueError when attempting to read a specific subgroup with consolidated metadata Feb 12, 2025
@aladinor aladinor changed the title open_consolidated raises ValueError when attempting to read a specific subgroup with consolidated metadata open_consolidated raises ValueError when attempting to read a specific subgroup Feb 12, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Feb 12, 2025

when you run consolidate_metadata('example.zarr'), it only creates consolidated metadata in the attributes of the root group, which means the group at example.zarr/c doesn't have any consolidated metadata, which is why you see that error message. If you change your invocation of open_consolidated to

root = zarr.open_consolidated(
        'example.zarr',
        mode='r',
        path=""
    )

then it should work.

@aladinor
Copy link
Author

Thanks @d-v-b for your comment.

But, what if I only need the want the '/c' node? Do we need to open the whole group and then extract only the wanted node?

@d-v-b
Copy link
Contributor

d-v-b commented Feb 14, 2025

But, what if I only need the want the '/c' node? Do we need to open the whole group and then extract only the wanted node?

That's the current design, yes. Being able to directly open that group via the path keyword argument would be nice though. Maybe this could work by doing a depth-first search along <root>/<nodes>/c for the consolidated metadata that describes the group c, but I think the intention of the open_consolidated function is to open a group that has consolidated metadata, not to open a group with consolidated metadata OR a group described by some parent group's consolidated metadata. thoughts @TomAugspurger ?

@aladinor aladinor reopened this Feb 14, 2025
@TomAugspurger
Copy link
Contributor

The only thing giving me pause here is the potential for consolidated metadata at different levels of the hierarchy to have different (potentially inconsistent) consolidated metadata. And I'd be uneasy about open_consolidated trying to search up a tree for where the metadata has been consolidated: consolidated metadata is supposed to avoid making many reads, so having a fallback mode where it does that would be surprising.

Do we need to open the whole group and then extract only the wanted node?

Performance wise, there won't really be a difference, unless your consolidated metadata file is huge. open_consolidated(store)["c"] and open_group(store, path="c") will both make a single read / HTTP request.

So IMO, this isn't worth trying to support.

@rabernat
Copy link
Contributor

I'll just chime in here and say that I think that consolidated metadata is a bad idea (even though I helped implement it!) and that we should be looking for other more robust solutions to performance and consistency.

We have a blog post coming out next week showing how Zarr 3 async concurrency allows us to quickly list large hierarchies without consolidated metadata.

@TomAugspurger
Copy link
Contributor

Yeah, the (lack of) consistency issues make it hard to reason about.

I do think that having ways to do in-memory operations on Group and Array structures (like open_group(store)["a"]["b"]) without triggering a read from the store on each operation is valuable.

@rabernat
Copy link
Contributor

I do think that having ways to do in-memory operations on Group and Array structures (like open_group(store)["a"]["b"]) without triggering a read from the store on each operation is valuable.

Me too! But doing that requires some notion of caching, which requires some understanding when you can safely cache.

When you follow this reasoning to its end, you reach Icechunk. 🙃

@d-v-b
Copy link
Contributor

d-v-b commented Feb 14, 2025

my preferred approach would be to have a very clear distinction between types / data structures that can do IO, and those that do not. The fact that we want zarr groups to be both IO-free models (via metadata consolidation and attribute caching) and objects that can do IO via array / group creation is a source of a lot of problems. There are ways of separating these two concerns, but it would be a big departure from what users expect.

@aladinor
Copy link
Author

aladinor commented Feb 14, 2025

Thanks @d-v-b, @TomAugspurger, and @rabernat for your comments and thoughts. I’ve refactored the code to first read the root metadata and then extract the required node, as shown here.

Please let me know if you have any further comments or suggestions. Otherwise, I’ll proceed to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

4 participants