Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icechunk stores design doc #1
Icechunk stores design doc #1
Changes from 1 commit
1f69780
54fc8da
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGHHE_07/summary which I prefer as it has more information and links to the official documentation (apologies if you knew this already). Specifically, the linked technical documentation describes the data variables and that the introduction of the
Intermediate
group was to "minimize misinterpretation of variable names and reflect changes in the algorithm".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI my approach to this is to try and use lithops to parallelize the
open_virtual_dataset
call across serverless workers, then do the reduction on the client (because the vds objects themselves should be small).See zarr-developers/VirtualiZarr#349, and I also have a notebook using this that I need to publish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively a summary on where IC needs work to be able to support this use case better. The relevant IC issues are:
list_dir
relies onlist_prefix
which traverses all chunks icechunk#321I'm not going to work further on this until we have made progress on those issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could move forward with demos using smaller-scale virtual datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to understand this better - I guess specifically how looking up chunk indices to byte ranges + file names are stored in icechunk and read by zarr. Are all chunk references stored together? Is it possible to load just the chunk references that are required for a specific query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see these questions are somewhat answered below - so it is my understanding now that all chunk references are stored together and that option 3 (manifest sharding) would be one solution which enables loading select chunk references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no Icechunk expert but this seems like the best option - let's discuss further.