-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifest arrays use arrayv3metadata #429
base: zarr-python-3.0
Are you sure you want to change the base?
Manifest arrays use arrayv3metadata #429
Conversation
…not happy about this)
I'm going to continue to review this tomorrow but the tests are passing and I've done an initial reorganization of the code that was in zarr.py. So if any of @TomNicholas @norlandrhagen @ayushnag @sharkinsspatial @jsignell @mpiannucci want to start to review please go ahead 🙏🏽 I also changed the base to a new branch of main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely great @abarciauskas-bgse ! Comments are really just minor.
@pytest.fixture | ||
def array_v3_metadata(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we reimplement this fixture to internally just call the array_v3_metadata_dict
fixture below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this in 0712979 - I do like just having one method to create array v3 metadata for tests however it does mean tests have to be a bit more verbose as every codecs argument must include an ArrayBytesCodec
(which is always {"name": "bytes", "configuration": {"endian": "little"}}
).
But I'll think on ways to streamline this more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a look at the signature of this function, which has a lot of sane defaults, and which works for v2 and v3 metadata: https://github.com/zarr-developers/zarr-python/blob/99621ecf0b81400e323828111363fe21cf0c7592/src/zarr/core/array.py#L4008-L4030. I think we could consider adding an ArrayV3Metadata.build
method that has a signature like this, which should make creating metadata documents a lot easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revisited this upon also building a utils method for create_v3_array_metadata
that readers can use. Now the fixture just calls that function and I removed the array_v3_metadata_dict altogether (and call to_dict in tests where I want to test the construction from dict functionality).
virtualizarr/codecs.py
Outdated
def extract_codecs( | ||
codecs: CodecPipeline, | ||
) -> tuple[ | ||
tuple[ArrayArrayCodec, ...], ArrayBytesCodec | None, tuple[BytesBytesCodec, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty complicated type that I had to stare at to work out what it is. Use TypeAlias
with an informative name?
Also is it definitely the right type? Seems weird that this would be valid: ((,), None, (,))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is correct, I have added a type alias, but essentially we just need a way of extracting out the various codec types so we can insert an ArrayBytesCodec (when there is none, which is always in this codebase) in-between ArrayArrayCodecs and BytesBytesCodecs
codecs = zarray._v3_codecs() | ||
|
||
# create array if it doesn't already exist | ||
# TODO: Should codecs be an argument to zarr's AsyncrGroup.create_array? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're asking for an upstream change here right?
@@ -67,13 +90,65 @@ def remove_file_uri_prefix(path: str): | |||
return path | |||
|
|||
|
|||
def convert_v3_to_v2_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this (a) not live in the kerchunk-specific writer module, (b) actually live in zarr-python upstream? Or is there no use for it upstream?
chunk_grid=RegularChunkGrid(chunk_shape=(2920, 25, 53)), | ||
chunk_key_encoding=DefaultChunkKeyEncoding(name='default', | ||
separator='/'), | ||
fill_value=np.float64(-327.67), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing the full representation I looked into why the fill value was np.float64(-327.67)
as that fill value was not in the xarray encoding but was the fill value of the h5py.Dataset
, is it worth digging into why that float value wasn't in the xarray encoding? @sharkinsspatial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussing this in a separate thread, will follow up once resolved...
Co-authored-by: Tom Nicholas <tom@cworthy.org>
for more information, see https://pre-commit.ci
Co-authored-by: Tom Nicholas <tom@cworthy.org>
for more information, see https://pre-commit.ci
This is still very much a WIP - many tests and implementations still need to be fixed.
A few notes:
_parse_chunk_encoding_v3
function since it is a private function and may change, which is why some of that logic is replicated inconvert_to_codec_pipeline
Checklist
docs/releases.rst
api.rst