Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest arrays use arrayv3metadata #429

Open
wants to merge 68 commits into
base: zarr-python-3.0
Choose a base branch
from

Conversation

abarciauskas-bgse
Copy link
Collaborator

@abarciauskas-bgse abarciauskas-bgse commented Feb 6, 2025

This is still very much a WIP - many tests and implementations still need to be fixed.

A few notes:

  • It was suggested we remove ZArray completely as a part of this work, as opposed to using a conversion function for ZArrays to ArrayV3Metadata. So we should be able to remove ZArray as a part of this pr.
  • It was suggested not to use zarr's _parse_chunk_encoding_v3 function since it is a private function and may change, which is why some of that logic is replicated in convert_to_codec_pipeline

Checklist

  • Closes ManifestArray should use zarr-python's ArrayV3Metadata #424
  • Manifest tests passing
  • Library (codecs, etc) tests passing
  • Reader tests passing
  • test_integration tests passing
  • test_xarray tests passing
  • Writer tests passing
  • Cleanup dead code
  • Consider reorganizing codecs and zarr modules
  • Tests added for new functions
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

virtualizarr/manifests/array.py Outdated Show resolved Hide resolved
virtualizarr/manifests/array.py Outdated Show resolved Hide resolved
virtualizarr/manifests/array_api.py Outdated Show resolved Hide resolved
@TomNicholas TomNicholas added zarr-python Relevant to zarr-python upstream internals labels Feb 6, 2025
@abarciauskas-bgse abarciauskas-bgse marked this pull request as ready for review February 12, 2025 23:54
@abarciauskas-bgse
Copy link
Collaborator Author

abarciauskas-bgse commented Feb 12, 2025

I'm going to continue to review this tomorrow but the tests are passing and I've done an initial reorganization of the code that was in zarr.py. So if any of @TomNicholas @norlandrhagen @ayushnag @sharkinsspatial @jsignell @mpiannucci want to start to review please go ahead 🙏🏽

I also changed the base to a new branch of main zarr-python-3.0,

@abarciauskas-bgse abarciauskas-bgse changed the base branch from main to zarr-python-3.0 February 12, 2025 23:58
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely great @abarciauskas-bgse ! Comments are really just minor.

ci/upstream.yml Show resolved Hide resolved
Comment on lines +218 to +219
@pytest.fixture
def array_v3_metadata():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reimplement this fixture to internally just call the array_v3_metadata_dict fixture below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in 0712979 - I do like just having one method to create array v3 metadata for tests however it does mean tests have to be a bit more verbose as every codecs argument must include an ArrayBytesCodec (which is always {"name": "bytes", "configuration": {"endian": "little"}}).

But I'll think on ways to streamline this more...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a look at the signature of this function, which has a lot of sane defaults, and which works for v2 and v3 metadata: https://github.com/zarr-developers/zarr-python/blob/99621ecf0b81400e323828111363fe21cf0c7592/src/zarr/core/array.py#L4008-L4030. I think we could consider adding an ArrayV3Metadata.build method that has a signature like this, which should make creating metadata documents a lot easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revisited this upon also building a utils method for create_v3_array_metadata that readers can use. Now the fixture just calls that function and I removed the array_v3_metadata_dict altogether (and call to_dict in tests where I want to test the construction from dict functionality).

docs/usage.md Show resolved Hide resolved
virtualizarr/codecs.py Outdated Show resolved Hide resolved
def extract_codecs(
codecs: CodecPipeline,
) -> tuple[
tuple[ArrayArrayCodec, ...], ArrayBytesCodec | None, tuple[BytesBytesCodec, ...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty complicated type that I had to stare at to work out what it is. Use TypeAlias with an informative name?

Also is it definitely the right type? Seems weird that this would be valid: ((,), None, (,))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is correct, I have added a type alias, but essentially we just need a way of extracting out the various codec types so we can insert an ArrayBytesCodec (when there is none, which is always in this codebase) in-between ArrayArrayCodecs and BytesBytesCodecs

virtualizarr/tests/test_codecs.py Outdated Show resolved Hide resolved
virtualizarr/translators/kerchunk.py Show resolved Hide resolved
codecs = zarray._v3_codecs()

# create array if it doesn't already exist
# TODO: Should codecs be an argument to zarr's AsyncrGroup.create_array?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're asking for an upstream change here right?

@@ -67,13 +90,65 @@ def remove_file_uri_prefix(path: str):
return path


def convert_v3_to_v2_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this (a) not live in the kerchunk-specific writer module, (b) actually live in zarr-python upstream? Or is there no use for it upstream?

chunk_grid=RegularChunkGrid(chunk_shape=(2920, 25, 53)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
fill_value=np.float64(-327.67),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the full representation I looked into why the fill value was np.float64(-327.67) as that fill value was not in the xarray encoding but was the fill value of the h5py.Dataset, is it worth digging into why that float value wasn't in the xarray encoding? @sharkinsspatial

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing this in a separate thread, will follow up once resolved...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals zarr-python Relevant to zarr-python upstream
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

ManifestArray should use zarr-python's ArrayV3Metadata
5 participants