Fixed consolidated Group getitem with multi-part key #2363

TomAugspurger · 2024-10-14T15:13:40Z

This fixes Group.__getitem__ when indexing with a key
like 'subgroup/array'. The basic idea is to rewrite the indexing
operation as group['subgroup']['array'] by splitting the key
and doing each operation independently. This is fine for consolidated
metadata which doesn't need to do IO.

There's a complication around unconsolidated metadata, though. What
if we encounter a node where Group.getitem returns a sub Group
without consolidated metadata. Then we need to fall back to
non-consolidated metadata. We've written _getitem_consolidated
as a regular (non-async) function so we need to pop back up to
the async caller and have it fall back.

Closes #2358

This fixes `Group.__getitem__` when indexing with a key like 'subgroup/array'. The basic idea is to rewrite the indexing operation as `group['subgroup']['array']` by splitting the key and doing each operation independently. This is fine for consolidated metadata which doesn't need to do IO. There's a complication around unconsolidated metadata, though. What if we encounter a node where `Group.getitem` returns a sub Group without consolidated metadata. Then we need to fall back to non-consolidated metadata. We've written _getitem_consolidated as a regular (non-async) function so we need to pop back up to the async caller and have *it* fall back. Closes zarr-developers#2358

TomAugspurger · 2024-10-14T15:18:14Z

src/zarr/core/group.py

@@ -97,6 +97,24 @@ def _parse_async_node(
        raise TypeError(f"Unknown node type, got {type(node)}")


+class _MixedConsolidatedMetadataException(Exception):


Let me know if this MixConsolidatedMetadataException stuff makes sense. The crux of the issue is that Group.__getitem__ on a Group with consolidated metadata can be IO-free. To make that explicit, we define it as a plain (non-async) function.

But this group['subgroup']['subarray'] presents a new challenge. What if group has consolidated metadata, but subgroup doesn't? Its consolidated metadata might be None, meaning we need to fall back to the non-consolidated metadata. We discover this fact in AsyncGroup._getitem_consolidated, which isn't an async function, so it can't call the async non-consolidated getitem.

To break through that tangle, I've added this custom exception. The expectation is that users never see it; we require (through docs & being careful?) that all the callers of _getitem_consolidated handle this case by catching the error and falling back if needed.

I forgot to mention: this "mixed" case isn't going to be common. Under normal operation, where users call consolidate_metadata and don't mutate the group afterwards we consolidate everything and so won't hit this. I believe the only time this can happen through the public API is when users add a new group to group with consolidated metadata, and don't then re-consolidate.

I believe the only time this can happen through the public API is when users add a new group to group with consolidated metadata, and don't then re-consolidate.

Do we need to support this case? I would think that users could expect that, when using consolidated metadata, then metadata lookups (beyond the first) will never trigger IO.

Some discussion on this here: #2363 (comment). I think there isn't really a clear answer. I'm sympathetic to users who want to know for certain that they're done with I/O after loading a Group with consolidated metadata.

If we decide not to support this, we do still have a decision to make: when we try to index into a nested Group and reach a node with non-consolidated metadata, do we raise a KeyError (saying we know this key isn't there) or some public version of this MixedConsolidatedMetadataError (saying we don't know whether or not this key is there)?

I'm not a user of consolidated metadata, so take this with a grain of salt, but I would default to being strict: if people open a group in consolidated metadata mode, then IMO the contents of the consolidated metadata should purport to provide the complete model of the hierarchy, even if that model happens to be wrong :) But I'd prefer to temper my strict default with whatever expectations consolidated metadata users have.

but I would default to being strict

Just to be clear: what does "strict" mean to you here?

I actually am having trouble constructing a hierarchy that hits this using the public API:

In [1]: import zarr In [2]: store = zarr.storage.MemoryStore(mode="w") In [3]: group = zarr.open_group(store=store, path="a/b/c/d") In [4]: g = zarr.consolidate_metadata(store) In [5]: g['a/b/c/d'].create_group("x") Out[5]: <Group memory://4480609344/a/b/c/d/x> In [6]: g['a/b/c/d'].metadata.consolidated_metadata Out[6]: ConsolidatedMetadata(metadata={}, kind='inline', must_understand=False)

I think the semantics around what exact state in after mutating a consolidated hierarchy, but before reconsolidating, that we can do whatever we want.

but I would default to being strict

Just to be clear: what does "strict" mean to you here?

for me, "strict" would mean that if a user opens a group with use_consolidated=True, then the consolidated metadata for that group will be the single source of truth about the hierarchy, and so attempting to access a sub-node that actually exists, but isn't in that document, would raise a KeyError. Not sure if this is actually a reasonable position

jhamman

Nice work here @TomAugspurger - The (somewhat messy) handling of mixed consolidated / non-consolidated metadata makes me wonder if we want to revisit supporting it at all. It would be pretty reasonable to ask callers to switch to consolidated=False if they go looking for a key outside of the range of the consolidated metadata.

But we can have that conversation another day.

…item

TomAugspurger · 2024-10-15T16:30:25Z

Makes sense. I think it depends on exactly how transparent a cache we want consolidated metadata to be. If it's really supposed to be transparent, then falling back is probably the right thing to do.

Really, the complexity comes from my stubbornness around keeping _getitem_consolidated non-async. We could make that async and just do await replace(self, consolidated_metadata=None).getitem(key) if we detect non-consolidated metadata.

But overall, I think the guarantee we get that _getitem_consolidated is actually I/O free makes jumping through hoops worth it.

An alternative to using Exceptions to signal an issue with mixed consolidated metadata, we could return a Result or Option type from _getitem_consolidated that is either the actual result or some object indicating we need to fall back to non-consolidated metadata. I think that'd be about the same level of complexity.

src/zarr/core/group.py

d-v-b · 2024-10-16T13:59:11Z

src/zarr/core/group.py

+        indexers.reverse()
+        metadata: ArrayV2Metadata | ArrayV3Metadata | GroupMetadata = self.metadata
+
+        while indexers:


wouldn't the flattened form of the consolidated metadata make this a lot simpler?

maybe the problem with that is that it requires that the consolidated metadata be complete? whereas the iterative approach can handle a group with 'live' metadata.

Sorry I missed this earlier. Flattened metadata would make this specific section simpler, but I think would complicate things a later since we'd when we need to "unflatten" it to put all of its children in its consolidated_metadata. Doable, but not obviously simpler in the end.

…-getitem' into tom/fix/cm-nested-getitem

…etitem

jhamman

Reapproving -- this is much cleaner and, I think, in line with user expectations for the scope of consolidated metadata

TomAugspurger · 2024-10-17T14:26:38Z

Thanks for the reviews!

…#2363) * Fixed consolidated Group getitem with multi-part key This fixes `Group.__getitem__` when indexing with a key like 'subgroup/array'. The basic idea is to rewrite the indexing operation as `group['subgroup']['array']` by splitting the key and doing each operation independently. Closes zarr-developers#2358 --------- Co-authored-by: Joe Hamman <joe@earthmover.io>

* move v3/tests to tests and fix various mypy issues * test(ci): change branch name in v3 workflows (#2368) * Use lazy % formatting in logging functions (#2366) * Use lazy % formatting in logging functions * f-string should be more efficient * Space before unit symbol From "SI Unit rules and style conventions": https://physics.nist.gov/cuu/Units/checklist.html There is a space between the numerical value and unit symbol, even when the value is used in an adjectival sense, except in the case of superscript units for plane angle. * Enforce ruff/flake8-logging-format rules (G) --------- Co-authored-by: Joe Hamman <joe@earthmover.io> * Move roadmap and v3-design documument to docs (#2354) * move roadmap to docs * formatting and minor copy editing * Multiple imports for an import name (#2367) Co-authored-by: Joe Hamman <joe@earthmover.io> * Enforce ruff/pycodestyle warnings (W) (#2369) * Apply ruff/pycodestyle rule W291 W291 Trailing whitespace * Enforce ruff/pycodestyle warnings (W) It looks like `ruff format` does not catch all trailing spaces. --------- Co-authored-by: Joe Hamman <joe@earthmover.io> * Apply ruff/pycodestyle preview rule E262 (#2370) E262 Inline comment should start with `# ` Co-authored-by: Joe Hamman <joe@earthmover.io> * Fix typo (#2382) Co-authored-by: Joe Hamman <joe@earthmover.io> * Imported name is not used anywhere in the module (#2379) * Missing mandatory keyword argument `shape` (#2376) * Update ruff rules to ignore (#2374) Co-authored-by: Joe Hamman <joe@earthmover.io> * Docstrings for arraymodule (#2276) * start to docstrings for arraymodule * incorporating toms edits, overriding mypy error... * fix attrs * Update src/zarr/core/array.py Co-authored-by: Sanket Verma <svsanketverma5@gmail.com> * fix store -> storage * remove properties from asyncarray docstring --------- Co-authored-by: Sanket Verma <svsanketverma5@gmail.com> Co-authored-by: Joe Hamman <joe@earthmover.io> * fix/normalize storage paths (#2384) * bring in path normalization function from v2, and add a failing test * rephrase comment * simplify storepath creation * Update tests/v3/test_api.py Co-authored-by: Joe Hamman <joe@earthmover.io> * refactor: remove redundant zarr format fixture * replace assertion with an informative error message * fix incorrect path concatenation in make_store_path, and refactor store_path tests * remove upath import because we don't need it * apply suggestions from code review --------- Co-authored-by: Joe Hamman <joe@earthmover.io> * Enforce ruff/flake8-pyi rule PYI013 (#2389) PYI013 Non-empty class body must not contain `...` Note that documentation is enough to fill the class body. * deps: remove fasteners from list of dependencies (#2386) * Enforce ruff/flake8-annotations rule ANN003 (#2388) ANN003 Missing type annotation Co-authored-by: Joe Hamman <joe@earthmover.io> * Enforce ruff/Perflint rules (PERF) (#2372) * Apply ruff/Perflint rule PERF401 PERF401 Use a list comprehension to create a transformed list * Enforce ruff/Perflint rules (PERF) * chore: update package maintainers (#2387) * chore: update package maintainers * Update pyproject.toml Co-authored-by: David Stansby <dstansby@gmail.com> --------- Co-authored-by: David Stansby <dstansby@gmail.com> * Fixed consolidated Group getitem with multi-part key (#2363) * Fixed consolidated Group getitem with multi-part key This fixes `Group.__getitem__` when indexing with a key like 'subgroup/array'. The basic idea is to rewrite the indexing operation as `group['subgroup']['array']` by splitting the key and doing each operation independently. Closes #2358 --------- Co-authored-by: Joe Hamman <joe@earthmover.io> * chore: add python 3.13 to ci / pyproject.toml (#2385) * chore: add python 3.13 to ci / pyproject.toml * update hatch matrix * remove references to dead test dir in pyproject.toml * remove v3 reference in test --------- Co-authored-by: Joe Hamman <joe@earthmover.io> Co-authored-by: Dimitri Papadopoulos Orfanos <3234522+DimitriPapadopoulos@users.noreply.github.com> Co-authored-by: Emma Marshall <55526386+e-marshall@users.noreply.github.com> Co-authored-by: Sanket Verma <svsanketverma5@gmail.com> Co-authored-by: David Stansby <dstansby@gmail.com> Co-authored-by: Tom Augspurger <tom.w.augspurger@gmail.com>

TomAugspurger commented Oct 14, 2024

View reviewed changes

TomAugspurger added 2 commits October 14, 2024 10:46

test for getitem

05cf652

fixed members

488d57f

TomAugspurger marked this pull request as ready for review October 14, 2024 16:16

jhamman approved these changes Oct 14, 2024

View reviewed changes

jhamman changed the base branch from v3 to main October 14, 2024 20:49

Merge remote-tracking branch 'upstream/v3' into tom/fix/cm-nested-get…

cbfc197

…item

Merge branch 'main' into tom/fix/cm-nested-getitem

b08917c

d-v-b reviewed Oct 16, 2024

View reviewed changes

src/zarr/core/group.py Outdated Show resolved Hide resolved

d-v-b reviewed Oct 16, 2024

View reviewed changes

TomAugspurger added 6 commits October 16, 2024 09:32

remove assert

a0c0ef6

Merge remott e-tracking branch 'refs/remotes/origin/tom/fix/cm-nested…

912e582

…-getitem' into tom/fix/cm-nested-getitem

Merge remote-tracking branch 'upstream/main' into tom/fix/cm-nested-g…

1cf5bd4

…etitem

trust the metadata

5533ac7

simplify

259e09e

fixup

446c893

jhamman approved these changes Oct 17, 2024

View reviewed changes

TomAugspurger merged commit 4d663cc into zarr-developers:main Oct 17, 2024
22 checks passed

TomAugspurger deleted the tom/fix/cm-nested-getitem branch October 17, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed consolidated Group getitem with multi-part key #2363

Fixed consolidated Group getitem with multi-part key #2363

TomAugspurger commented Oct 14, 2024

TomAugspurger Oct 14, 2024

TomAugspurger Oct 14, 2024

d-v-b Oct 16, 2024

TomAugspurger Oct 16, 2024

d-v-b Oct 16, 2024

TomAugspurger Oct 16, 2024

d-v-b Oct 16, 2024 •

edited

Loading

jhamman left a comment

TomAugspurger commented Oct 15, 2024

d-v-b Oct 16, 2024

d-v-b Oct 16, 2024

TomAugspurger Oct 17, 2024

jhamman left a comment

TomAugspurger commented Oct 17, 2024

		@@ -97,6 +97,24 @@ def _parse_async_node(
		raise TypeError(f"Unknown node type, got {type(node)}")


		class _MixedConsolidatedMetadataException(Exception):

Fixed consolidated Group getitem with multi-part key #2363

Fixed consolidated Group getitem with multi-part key #2363

Conversation

TomAugspurger commented Oct 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

jhamman left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 17, 2024

d-v-b Oct 16, 2024 •

edited

Loading