Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr v3 compatibility? #282

Open
TomNicholas opened this issue Feb 14, 2025 · 13 comments
Open

Zarr v3 compatibility? #282

TomNicholas opened this issue Feb 14, 2025 · 13 comments
Labels
enhancement New feature or request

Comments

@TomNicholas
Copy link

Zarr-python v3 was recently fully released (see blog post), so no longer in beta.

Is tifffile planning to support zarr v3? It would enable a lot of cool things, such as creating a virtualizarr reader for any tiff file so they can be version-controlled with icechunk.

On that topic: can tifffile be used to generate byte ranges and offsets for chunks in tiff files directly? That's the information we would need in order to write a virtualizarr reader. I see you can do it by creating a ZarrTiffStore then using .write_fsspec() - is that the most direct way?

cc @maxrjones

@cgohlke cgohlke added the enhancement New feature or request label Feb 14, 2025
@cgohlke
Copy link
Owner

cgohlke commented Feb 14, 2025

I am aware of zarr 3 and the very cool new things. Unfortunately, zarr 3 is incompatible with current tifffile on several levels: stores, codecs, dependencies. Tifffile does not depend directly on zarr (except one feature in the highest-level interface) and does not use asyncio. I would like to keep the core of tifffile that way. It would probably be better to re-implement Zarr3TiffStore and Zarr3FileSequenceStore in a separate module or package. However, it took several months to implement the zarr 2 interfaces. A related feature missing from tifffile is a xarray like interface for array metadata (coords, attrs). Then, there are dozens of imagecodecs codecs that need to be wrapped for the zarr3 codec interface. Frankly, I won't be able to commit to those without funding.

Re "generate byte ranges": yes, that is currently implemented in the zarr 2 store and the write_fsspec function. Is there maybe a simpler interface than a fully implemented zarr3 store for virtualizarr to consume byte ranges and metadata? I quite like the idea of accessing various file formats through xarray syntax via "reference files" and am looking for easier ways to generate them also for other microscopy files (e.g. Leica LIF, Zeiss CZI, or PicoQuant PTU).

@TomNicholas
Copy link
Author

Frankly, I won't be able to commit to those without funding.

That's unfortunate, but entirely reasonable.

It would probably be better to re-implement Zarr3TiffStore and Zarr3FileSequenceStore in a separate module or package

This could be done.

Is there maybe a simpler interface than a fully implemented zarr3 store for virtualizarr to consume byte ranges and metadata?

Yes, a fully-implemented store is overkill for virtualizarr. Ultimately all virtualizarr needs is some function that takes a tiff file path and returns the byte range information for each chunk + the same array metadata that a zarr store would need. We basically refer to that function as a "virtualizarr reader".

I did try writing a virtualizarr reader that uses tifffile directly (see zarr-developers/VirtualiZarr#291 (comment)) but the restriction is that virtualizarr will soon depend explicitly on zarr-python>=3.0.0 for various reasons. (It's annoying that python doesn't allow you to have different dependencies for different packages installed simultaneously.)

I quite like the idea of accessing various file formats through xarray syntax via "reference files" and am looking for easier ways to generate them also for other microscopy files (e.g. Leica LIF, Zeiss CZI, or PicoQuant PTU).

It's a very powerful idea, and a virtualizarr reader unlocks the ability to open all that data in xarray efficiently. As this works even when the files are in object storage you can also think of it as an alternative way to "cloud-optimize" tiff files, but without altering or duplicating the original files.

@cgohlke
Copy link
Owner

cgohlke commented Feb 15, 2025

Yes, a fully-implemented store is overkill for virtualizarr. Ultimately all virtualizarr needs is some function that takes a tiff file path and returns the byte range information for each chunk + the same array metadata that a zarr store would need. We basically refer to that function as a "virtualizarr reader".

I am interested. Is there an even easier interface though? I would like to avoid any direct usage of xarray, zarr, kerchunk, json, and string type chunk keys. Suppose my various file readers provide image objects with a bunch of numpy/xarray/zarr like properties. Would it be possible to initialize a xarray.Dataset using virtualizarr by duck typing or passing those fundamental properties as arguments? Something like:

class Image:

    name: str
    dtype: numpy.dtype[Any]
    shape: tuple[int, ...]
    dims: tuple[str, ...]
    coords: dict[str, NDArray[Any]]
    attrs: dict[str, Any]
    compressor: dict[str, Any]
    chunks: tuple[int, ...]
    levels: tuple[Image, ...]

    def chunk_manifest(self) -> Iterator[tuple[tuple[int, ...], str, int, int]]:
        """Return iterator over chunk (index, path, offset, and length)."""

with Image() as im:
    ds = virtualizarr.open_duck(im)
    # or
    ds = virtualizarr.open(name=im.name, dtype=im.dtype, ..., chunk_manifest=im.chunk_manifest())

I just made up chunk_manifest to return an iterator over tuples (instead of one dict) because some WSI and LZW compressed OME-TIFFs may have millions of tiny chunks.

@TomNicholas
Copy link
Author

I would like to avoid any direct usage of xarray, zarr, kerchunk, json,

Yep that can work.

and string type chunk keys.

Not sure exactly what you mean by that.

I just made up chunk_manifest

That is literally all I would need! Including the properties of your Image class you listed.

Would it be possible to initialize a xarray.Dataset using virtualizarr by duck typing or passing those fundamental properties as arguments?

The current pattern would look more like this:

from virtualizarr.readers.common import VirtualBackend
from virtualizarr import ChunkManifest, ManifestArray
from zarr.metadata import ArrayV3Metadata


class TiffVirtualBackend(VirtualBackend):
    @staticmethod
    def open_virtual_dataset(
        filepath: str,
    ) -> Dataset:
        """
        Takes a path to a tifffile and returns a virtual dataset containing chunk manifest information for every array in the file.
        """
        from tifffile import Image

        img = Image.from_file(filepath)
 
        # do this for each array in the file
        metadata: ArrayV3Metadata = translate_to_zarr_v3_metadata(img)
        manifest: ChunkManifest = translate_chunk_manifest(img.chunk_manifest())
        ma: ManifestArray = ManifestArray(manifest, metadata)
        
        # from now on it's the same as any other existing virtualizarr reader
        # i.e. the ManifestArrays just get packaged up and returned 
        ...

then the data engineer wanting to "virtualize" the Tiffs does this:

from virtualizarr import open_virtual_dataset

# this could be defined in any library we like
import TiffVirtualBackend

# at this stage one could open and concatenate the chunk manifests of many tiff files to make one bigger zarr store pointing at many tiff files at once
vds = open_virtual_dataset('path/to/file.tiff', reader=TiffVirtualBackend)

# write out all the "virtual references" to a version-controlled Icechunk Store
vds.virtualize.to_icechunk(icechunkstore)

then the user wanting to access the zarr-ified tiff data via xarray (who may or may not be the same person as the data engineer) does this:

# open and (lazily) load tiff chunks with the full async power of zarr-python v3
ds = xr.open_zarr(icechunkstore)

# normal xarray analysis happens
ds.foo.plot()

We're also looking at shortcutting from virtual dataset to normal in-memory dataset without going via icechunk first, see zarr-developers/VirtualiZarr#427.

some WSI and LZW compressed OME-TIFFs may have millions of tiny chunks.

That problem of what to do if there are millions and millions of chunks is a bit separate, and is discussed here.

@TomNicholas
Copy link
Author

But that doesn't solve the problem of virtualizarr requiring zarr-python>=3.0.0 (soon) and tifffile requiring zarr-python<=3.0.0.

@cgohlke
Copy link
Owner

cgohlke commented Feb 15, 2025

Technically tifffile doesn't use or import zarr (except that one-off use in the zarr_selection function). It's just that the Zarr stores can no longer be opened with Zarr 3. The other issue I mentioned is that there are no Zarr 3 format compatible imagecodecs codecs yet.

@cgohlke
Copy link
Owner

cgohlke commented Feb 15, 2025

Thanks for all the feedback.

So translate_to_zarr_v3_metadata and translate_chunk_manifest are functions that need to be implemented and would possibly work with several of my file readers. I will start experimenting with my newer libraries, ptufile and liffile, because the required properties are already implemented, except compressor and the chunk_manifest function.

and string type chunk keys.

Not sure exactly what you mean by that.

Zarr keys are strings "0.0.0" not sequences of int (0, 0, 0). Do you happen to know the reason for that?

@TomNicholas
Copy link
Author

Technically tifffile doesn't use or import zarr

That's promising!

The other issue I mentioned is that there are no Zarr 3 format compatible imagecodecs codecs yet.

Are you sure? We are using some kind of imagecodecs import with zarr 3 successfully in virtualizarr (see import here for example) - cc @sharkinsspatial

So translate_to_zarr_v3_metadata and translate_chunk_manifest are functions that need to be implemented and would possibly work with several of my file readers.

The idea of these functions is that they translate your representation into the types that virtualizarr needs. If you simply implemented the Image class and the .chunk_manifest method that would be enough for me to import and use in virtualizarr, without you having any external dependencies. I just need your representation to have enough info about the tiff file that I can easily see how it maps to the zarr model.

Zarr keys are strings "0.0.0" not sequences of int (0, 0, 0). Do you happen to know the reason for that?

I'm not sure, but I presume it's just because every language can easily support some kind of key-value store with string keys. But either of these types is fine for virtualizarr, we can just convert on our end.

@cgohlke
Copy link
Owner

cgohlke commented Feb 15, 2025

We are using some kind of imagecodecs import with zarr 3 successfully in virtualizarr

Are you using zarr 3 (the library) with the zarr version 2 file format? It is my understanding that in that case the numcodecs compatible codecs in imagecodecs would still work (except for the issues discussed at cgohlke/imagecodecs#123).

I just need your representation to have enough info about the tiff file that I can easily see how it maps to the zarr model.

Rather than providing an arbitrary mapping that would work for you, I would like to provide an interface (properties and functions) that is established in the numpy/xarray/dask/zarr ecosystem and also makes sense on its own. For name, dtype, shape, dims, coords, attrs, chunks that is the case. I have to look into compressor and levels. My main concern is chunk_manifest. The type used in the Zarr 3 format would be dict[str, dict[str, Any]] while kerchunk uses dict[str, tuple[str, int, int], and I proposed Iterator[tuple[tuple[int, ...], str, int, int]]. Seems confusing. How are other Python file reader libraries exposing that information?

I presume it's just because every language can easily support some kind of key-value store with string keys

That makes sense.

@TomNicholas
Copy link
Author

Are you using zarr 3 (the library) with the zarr version 2 file format?

We're not really using the zarr version 2 file format at all. We're reading non-zarr data using zarr-python v3, which seems to be working fine.

Rather than providing an arbitrary mapping that would work for you, I would like to provide an interface (properties and functions) that is established in the numpy/xarray/dask/zarr ecosystem and also makes sense on its own.

That's fair, but the best way to do this would be to directly support zarr v3. (Also numpy/xarray/dask are ignorant of many of these, by design).

For name, dtype, shape, dims, coords, attrs, chunks that is the case. I have to look into compressor and levels.

The standard IO interface for these is now zarr-python's ArrayMetadata class (and the ArrayV3Metadata concrete implementation).

The type used in the Zarr 3 format would be dict[str, dict[str, Any]]

Where are you getting that from? Zarr doesn't have a chunk manifest abstraction in it really - that's why I had to make VirtualiZarr separately.

Seems confusing. How are other Python file reader libraries exposing that information?

Not in any consistent way, which is why VirtualiZarr created abstractions over the various ways they expose that information. The closest thing to a standardized way to represent a chunk manifest is virtualizarr's ChunkManifest class.

@cgohlke
Copy link
Owner

cgohlke commented Feb 15, 2025

Where are you getting that from?

The proposal at zarr-developers/zarr-specs#287

the best way to do this would be to directly support zarr v3

You are right. But, I got somewhat disappointed with zarr, having spent much time and effort providing zarr 2 stores and codecs that are now obsolete after just a few years. Performance also turned out to be disappointing in many cases. I am hesitant directly supporting zarr 3.

@TomNicholas
Copy link
Author

Where are you getting that from?

The proposal at zarr-developers/zarr-specs#287

That's (a) talking about on-disk, not in-memory representation, and (b) arguably outdated now that we have icechunk.

But, I got somewhat disappointed with zarr, having spent much time and effort providing zarr 2 stores and codecs that are now obsolete after just a few years. Performance also turned out to be disappointing in many cases. I am hesitant directly supporting zarr 3.

That is a shame. From my perspective I have been waiting years for zarr v3 to be fully released and I'm very pleased that it now is out!

Regardless, it sounds like we still have a useful and concrete plan here:

  • You add something like the Image class & .chunk_manifest methods to tifffile
    • Though ideally they would return virtualizarr.chunkmanifest and zarr.ArrayV3Metadata, they don't need to, which means you don't need to change tifffile to support zarr-python v3 directly
  • I wrap whatever you return inside a tiff reader for virtualizarr
  • The result is then that we can virtualize tiff files using virtualizarr into icechunk, with zarr-python v3 installed, all good!

@cgohlke
Copy link
Owner

cgohlke commented Feb 15, 2025

The most immediate solution is to parse the output of the existing ZarrTiffStore.write_fsspec function. That should contain all the information in a structured way and will work with zarr 3 installed.

I will experiment with chunk_manifest() and attributes in my other libraries first. Maybe the idea turns out impractical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants