-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr v3 compatibility? #282
Comments
I am aware of zarr 3 and the very cool new things. Unfortunately, zarr 3 is incompatible with current tifffile on several levels: stores, codecs, dependencies. Tifffile does not depend directly on zarr (except one feature in the highest-level interface) and does not use asyncio. I would like to keep the core of tifffile that way. It would probably be better to re-implement Re "generate byte ranges": yes, that is currently implemented in the zarr 2 store and the |
That's unfortunate, but entirely reasonable.
This could be done.
Yes, a fully-implemented store is overkill for virtualizarr. Ultimately all virtualizarr needs is some function that takes a tiff file path and returns the byte range information for each chunk + the same array metadata that a zarr store would need. We basically refer to that function as a "virtualizarr reader". I did try writing a virtualizarr reader that uses
It's a very powerful idea, and a virtualizarr reader unlocks the ability to open all that data in xarray efficiently. As this works even when the files are in object storage you can also think of it as an alternative way to "cloud-optimize" tiff files, but without altering or duplicating the original files. |
I am interested. Is there an even easier interface though? I would like to avoid any direct usage of xarray, zarr, kerchunk, json, and string type chunk keys. Suppose my various file readers provide image objects with a bunch of numpy/xarray/zarr like properties. Would it be possible to initialize a xarray.Dataset using virtualizarr by duck typing or passing those fundamental properties as arguments? Something like: class Image:
name: str
dtype: numpy.dtype[Any]
shape: tuple[int, ...]
dims: tuple[str, ...]
coords: dict[str, NDArray[Any]]
attrs: dict[str, Any]
compressor: dict[str, Any]
chunks: tuple[int, ...]
levels: tuple[Image, ...]
def chunk_manifest(self) -> Iterator[tuple[tuple[int, ...], str, int, int]]:
"""Return iterator over chunk (index, path, offset, and length)."""
with Image() as im:
ds = virtualizarr.open_duck(im)
# or
ds = virtualizarr.open(name=im.name, dtype=im.dtype, ..., chunk_manifest=im.chunk_manifest()) I just made up |
Yep that can work.
Not sure exactly what you mean by that.
That is literally all I would need! Including the properties of your
The current pattern would look more like this: from virtualizarr.readers.common import VirtualBackend
from virtualizarr import ChunkManifest, ManifestArray
from zarr.metadata import ArrayV3Metadata
class TiffVirtualBackend(VirtualBackend):
@staticmethod
def open_virtual_dataset(
filepath: str,
) -> Dataset:
"""
Takes a path to a tifffile and returns a virtual dataset containing chunk manifest information for every array in the file.
"""
from tifffile import Image
img = Image.from_file(filepath)
# do this for each array in the file
metadata: ArrayV3Metadata = translate_to_zarr_v3_metadata(img)
manifest: ChunkManifest = translate_chunk_manifest(img.chunk_manifest())
ma: ManifestArray = ManifestArray(manifest, metadata)
# from now on it's the same as any other existing virtualizarr reader
# i.e. the ManifestArrays just get packaged up and returned
... then the data engineer wanting to "virtualize" the Tiffs does this: from virtualizarr import open_virtual_dataset
# this could be defined in any library we like
import TiffVirtualBackend
# at this stage one could open and concatenate the chunk manifests of many tiff files to make one bigger zarr store pointing at many tiff files at once
vds = open_virtual_dataset('path/to/file.tiff', reader=TiffVirtualBackend)
# write out all the "virtual references" to a version-controlled Icechunk Store
vds.virtualize.to_icechunk(icechunkstore) then the user wanting to access the zarr-ified tiff data via xarray (who may or may not be the same person as the data engineer) does this: # open and (lazily) load tiff chunks with the full async power of zarr-python v3
ds = xr.open_zarr(icechunkstore)
# normal xarray analysis happens
ds.foo.plot() We're also looking at shortcutting from virtual dataset to normal in-memory dataset without going via icechunk first, see zarr-developers/VirtualiZarr#427.
That problem of what to do if there are millions and millions of chunks is a bit separate, and is discussed here. |
But that doesn't solve the problem of virtualizarr requiring |
Technically tifffile doesn't use or import zarr (except that one-off use in the |
Thanks for all the feedback. So
Zarr keys are strings |
That's promising!
Are you sure? We are using some kind of imagecodecs import with zarr 3 successfully in virtualizarr (see import here for example) - cc @sharkinsspatial
The idea of these functions is that they translate your representation into the types that virtualizarr needs. If you simply implemented the
I'm not sure, but I presume it's just because every language can easily support some kind of key-value store with string keys. But either of these types is fine for virtualizarr, we can just convert on our end. |
Are you using zarr 3 (the library) with the zarr version 2 file format? It is my understanding that in that case the numcodecs compatible codecs in imagecodecs would still work (except for the issues discussed at cgohlke/imagecodecs#123).
Rather than providing an arbitrary mapping that would work for you, I would like to provide an interface (properties and functions) that is established in the numpy/xarray/dask/zarr ecosystem and also makes sense on its own. For
That makes sense. |
We're not really using the zarr version 2 file format at all. We're reading non-zarr data using zarr-python v3, which seems to be working fine.
That's fair, but the best way to do this would be to directly support zarr v3. (Also numpy/xarray/dask are ignorant of many of these, by design).
The standard IO interface for these is now zarr-python's
Where are you getting that from? Zarr doesn't have a chunk manifest abstraction in it really - that's why I had to make VirtualiZarr separately.
Not in any consistent way, which is why VirtualiZarr created abstractions over the various ways they expose that information. The closest thing to a standardized way to represent a chunk manifest is virtualizarr's |
The proposal at zarr-developers/zarr-specs#287
You are right. But, I got somewhat disappointed with zarr, having spent much time and effort providing zarr 2 stores and codecs that are now obsolete after just a few years. Performance also turned out to be disappointing in many cases. I am hesitant directly supporting zarr 3. |
That's (a) talking about on-disk, not in-memory representation, and (b) arguably outdated now that we have icechunk.
That is a shame. From my perspective I have been waiting years for zarr v3 to be fully released and I'm very pleased that it now is out! Regardless, it sounds like we still have a useful and concrete plan here:
|
The most immediate solution is to parse the output of the existing I will experiment with |
Zarr-python v3 was recently fully released (see blog post), so no longer in beta.
Is tifffile planning to support zarr v3? It would enable a lot of cool things, such as creating a virtualizarr reader for any tiff file so they can be version-controlled with icechunk.
On that topic: can tifffile be used to generate byte ranges and offsets for chunks in tiff files directly? That's the information we would need in order to write a virtualizarr reader. I see you can do it by creating a
ZarrTiffStore
then using.write_fsspec()
- is that the most direct way?cc @maxrjones
The text was updated successfully, but these errors were encountered: