You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This ticket is meant to track our ongoing investigation of th utility of icechunk at GES DISC. We have had a number of conversations about zarr virtualization with GES DISC folks (most notably Christine and Brianna, but also Lucas Sterzinger and Maha Hegde).
Objective
We have 2 objectives with this work:
Demonstrate icechunk's utility for NASA: We aim to showcase Icechunk's utility across NASA by identifying a high-impact use case within GES DISC, specifically focusing on datasets that could benefit from virtualization and accessibility as Zarr/ARCO formats. These datasets may serve Giovanni users or other applications, particularly where maintaining accessibility has been challenging.
Determine what are the challenges and limitations in (1): Are there ways in which icechunk is still not useful or usable for NASA? What can we do to address those challenges?
Meetings
We met with Christine, Brianna and Hegde on Wednesday, October 23 and plan to meet with them every 3 weeks through the rest of this year (which is only 2 more meetings 🙀 )
Meeting Notes
Christine/GES DISC:
Uses Zarr stores but hesitates to make them public due to update/appending issues.
Interested in Icechunk but notes GES DISC currently uses a single-writer, multi-reader model with long chunking in time.
Seems like the most major concern is about storage growth - right now icechunk would maintain copies of chunks rather than doing chunk diffs. A huge number of chunk copies may be generated as they regularly append to the same chunk.
It was noted that garbage collection is on icechunk's roadmap.
Highlights cases like GPM IMERG, popular analysis workflows (e.g., time-averaged maps, area-averaged time series), and challenges with non-time-dimensional data (e.g., AIRS3STD).
Sounded like they are also planning on an implementation of lakefs but not until Q2 next year (not sure if this is fiscal or calendar year Q2).
Sean:
Notes Icechunk’s current handling of chunk updates (whole chunk copies rather than diffs).
Suggests data batching based on acceptable data latency.
Potential Use Cases
GPM IMERG (Near Real-Time): Update frequency poses a challenge; data must be available near real-time.
MERRA and Hydrology Data Rods: Popular data with challenging metadata for OpenDAP or THREDDS emulation.
AIRS3STD (HDF4): Example of non-temporal data needing time dimension insertion.
Data Aggregation Needs: Use cases include OpenDAP and metadata challenges for Giovanni.
Action Items
MERRA Product Details: Identify available products under TDS (contact: maha.hegde@nasa.gov).
This ticket is meant to track our ongoing investigation of th utility of icechunk at GES DISC. We have had a number of conversations about zarr virtualization with GES DISC folks (most notably Christine and Brianna, but also Lucas Sterzinger and Maha Hegde).
Objective
We have 2 objectives with this work:
Meetings
We met with Christine, Brianna and Hegde on Wednesday, October 23 and plan to meet with them every 3 weeks through the rest of this year (which is only 2 more meetings 🙀 )
Meeting Notes
Christine/GES DISC:
Sean:
Potential Use Cases
Action Items
FYI @sharkinsspatial @maxrjones @hrodmn
The text was updated successfully, but these errors were encountered: