Merge branch 'main' into ian/docs/updates

earth-mover · Feb 19, 2025 · 6c653f6 · 6c653f6
2 parents f1d1d18 + cd5dac2
commit 6c653f6
Show file tree

Hide file tree

Showing 81 changed files with 5,425 additions and 902 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/design-docs/008-no-copy-serialization-formats.md b/design-docs/008-no-copy-serialization-formats.md
@@ -0,0 +1,127 @@
+# Evaluation of different serialization formats
+
+We want to move away from msgpack serialization for Icechunk metadata files.
+
+## Why
+
+* Msgpack requires a expensive parsing process upfront. If the user only wants
+to pull a few chunk refs from a manifest, they still need to parse the whole manifest.
+* Msgpack deserializes to Rust datastructures. This is good for simplicity of code, but
+probably not good for memory consumption (more pointers everywhere).
+* Msgpack gives too many options on how to serialize things, there is no canonical way,
+so it's not easy to predict how `serde` is going to serialize our detastructures, and
+could even change from version to version.
+* It's hard to explain in the spec what goes into the metadata files, we would need to go
+into `rmp_serde` implementation, see what they do, and document that in the spec.
+
+## Other options
+
+There is a never ending menu. From a custom binary format, to Parquet, and everything else.
+We focused mostly on no-copy formats, for some of the issues enumerated above. Also
+there is a preference for formats that have a tight schema and can be documented with
+some form of IDL.
+
+## Performance evaluation
+
+We evaluated performance of msgpack, flatbuffers and capnproto. Evaluation looks at:
+
+* Manifest file size, for a big manifest with 1M native chunk refs.
+* Speed of writing.
+* Speed of reading.
+
+We wrote an example program in `examples/multithreaded_get_chunk_refs.rs`.
+This program writes a big repo to local file storage, it doesn't really write the chunks,
+we are not interested in benchmarking that. It executes purely in Rust, not using the python interface.
+
+It writes a manifest with 1M native chunk refs, using zstd compression level 3. The writes are done
+from 1M concurrent async tasks.
+
+It then executes 1M chunk ref reads (notice, the refs are read, not the chunks that are not there).
+Reads are executed from 4 threads with 250k concurrent async tasks each.
+
+Notice:
+
+* We are comparing local file system on purpose, to not account for network times
+* We are comparing pulling refs only, not chunks, which is a worst case. In the real
+  world, read operations are dominated by the time taken to fetch the chunks.
+* The evaluation was done in an early state of the code, where many parts were unsafe,
+  but we have verified there are no huge differences.
+
+### Results for writes
+
+```sh
+nix run nixpkgs#hyperfine -- \
+  --prepare 'rm -rf /tmp/test-perf' \
+  --warmup 1 \
+  'cargo run --release --example multithreaded_get_chunk_refs -- --write /tmp/test-perf'
+```
+
+#### Flatbuffers
+
+Compressed manifest size:  27_527_680 bytes
+
+```
+Time (mean ± σ):      5.698 s ±  0.163 s    [User: 4.764 s, System: 0.910 s]
+Range (min … max):    5.562 s …  6.103 s    10 runs
+```
+
+#### Capnproto
+
+Compressed manifest size:  26_630_927 bytes
+
+```
+Time (mean ± σ):      6.276 s ±  0.163 s    [User: 5.225 s, System: 1.017 s]
+Range (min … max):    6.126 s …  6.630 s    10 runs
+```
+
+#### Msgpack
+
+Compressed manifest size: 22_250_152 bytes
+
+```
+Time (mean ± σ):      6.224 s ±  0.155 s    [User: 5.488 s, System: 0.712 s]
+Range (min … max):    6.033 s …  6.532 s    10 runs
+```
+
+### Results for reads
+
+```sh
+nix run nixpkgs#hyperfine -- \
+  --warmup 1 \
+  'cargo run --release --example multithreaded_get_chunk_refs -- --read /tmp/test-perf'
+```
+
+#### Flatbuffers
+
+```
+Time (mean ± σ):      3.676 s ±  0.257 s    [User: 7.385 s, System: 1.819 s]
+Range (min … max):    3.171 s …  4.038 s    10 runs
+```
+
+#### Capnproto
+
+```
+Time (mean ± σ):      5.254 s ±  0.234 s    [User: 11.370 s, System: 1.962 s]
+Range (min … max):    4.992 s …  5.799 s    10 runs
+```
+
+#### Msgpack
+
+```
+Time (mean ± σ):      3.310 s ±  0.606 s    [User: 5.975 s, System: 1.762 s]
+Range (min … max):    2.392 s …  4.102 s    10 runs
+```
+
+## Conclusions
+
+* Compressed manifest is 25% larger in flatbuffers than msgpack
+* Flatbuffers is slightly faster for commits
+* Flatbuffers is slightly slower for reads
+* Timing differences are not significant for real world scenarios, where performance
+is dominated by the time taken downloading or uploading chunks.
+* Manifest fetch time differences could be somewhat significant for workloads where
+latency to first byte is important. This is not the use case Icechunk optimizes for.
+
+## Decision
+
+We are going to use flatbuffers for our metadata on-disk format.
diff --git a/docs/docs/icechunk-python/cheatsheets/git-users.md b/docs/docs/icechunk-python/cheatsheets/git-users.md
@@ -81,7 +81,7 @@ At this point, the tip of the branch is now the snapshot `198273178639187` and a
 In Icechunk, you can view the history of a branch by using the [`repo.ancestry()`](../reference/#icechunk.Repository.ancestry) command, similar to the `git log` command.
 
 ```python
-repo.ancestry(branch="my-new-branch")
+[ancestor for ancestor in repo.ancestry(branch="my-new-branch")]
 
 #[Snapshot(id='198273178639187', ...), ...]
 ```
@@ -156,7 +156,7 @@ We can also view the history of a tag by using the [`repo.ancestry()`](../refere
 repo.ancestry(tag="my-new-tag")
 ```
 
-This will return a list of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](../reference/#icechunk.Repository.lookup_tag) command.
+This will return an iterator of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](../reference/#icechunk.Repository.lookup_tag) command.
 
 ```python
 repo.lookup_tag("my-new-tag")

diff --git a/docs/docs/icechunk-python/version-control.md b/docs/docs/icechunk-python/version-control.md
@@ -27,7 +27,7 @@ repo = icechunk.Repository.create(icechunk.in_memory_storage())
 On creating a new [`Repository`](../reference/#icechunk.Repository), it will automatically create a `main` branch with an initial snapshot. We can take a look at the ancestry of the `main` branch to confirm this.
 
 ```python
-repo.ancestry(branch="main")
+[ancestor for ancestor in repo.ancestry(branch="main")]
 
 # [SnapshotInfo(id="A840RMN5CF807CM66RY0", parent_id=None, written_at=datetime.datetime(2025,1,30,19,52,41,592998, tzinfo=datetime.timezone.utc), message="Repository...")]
 ```
@@ -36,7 +36,7 @@ repo.ancestry(branch="main")
 
     The [`ancestry`](./reference/#icechunk.Repository.ancestry) method can be used to inspect the ancestry of any branch, snapshot, or tag.
 
-We get back a list of [`SnapshotInfo`](../reference/#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.
+We get back an iterator of [`SnapshotInfo`](../reference/#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.
 
 ## Creating a snapshot
 

diff --git a/docs/docs/spec.md b/docs/docs/spec.md
@@ -72,7 +72,6 @@ Finally, in an atomic put-if-not-exists operation, to commit the transaction, it
 This operation may fail if a different client has already committed the next snapshot.
 In this case, the client may attempt to resolve the conflicts and retry the commit.
 
-
 ```mermaid
 flowchart TD
     subgraph metadata[Metadata]
@@ -121,14 +120,14 @@ All data and metadata files are stored within a root directory (typically a pref
 - `$ROOT/snapshots/` snapshot files
 - `$ROOT/attributes/` attribute files
 - `$ROOT/manifests/` chunk manifests
+- `$ROOT/transactions/` transaction log files
 - `$ROOT/chunks/` chunks
 
 ### File Formats
 
 !!! warning
     The actual file formats used for each type of metadata file are in flux. The spec currently describes the data structures encoded in these files, rather than a specific file format.
 
-
 ### Reference Files
 
 Similar to Git, Icechunk supports the concept of _branches_ and _tags_.
@@ -149,9 +148,8 @@ Different client sessions may simultaneously create two inconsistent snapshots;
 
 References (both branches and tags) are stored as JSON files, the content is a JSON object with:
 
-* keys: a single key `"snapshot"`,
-* value: a string representation of the snapshot id, using [Base 32 Crockford](https://www.crockford.com/base32.html) encoding. The snapshot id is 12 byte random binary, so the encoded string has 20 characters.
-
+- keys: a single key `"snapshot"`,
+- value: a string representation of the snapshot id, using [Base 32 Crockford](https://www.crockford.com/base32.html) encoding. The snapshot id is 12 byte random binary, so the encoded string has 20 characters.
 
 Here is an example of a JSON file corresponding to a tag or branch:
 
@@ -186,6 +184,7 @@ Branch references are stored in the `refs/` directory within a subdirectory corr
 Branch names may not contain the `/` character.
 
 To facilitate easy lookups of the latest branch reference, we use the following encoding for the sequence number:
+
 - subtract the sequence number from the integer `1099511627775`
 - encode the resulting integer as a string using [Base 32 Crockford](https://www.crockford.com/base32.html)
 - left-padding the string with 0s to a length of 8 characters
@@ -216,30 +215,8 @@ Tags cannot be deleted once created.
 
 The snapshot file fully describes the schema of the repository, including all arrays and groups.
 
-The snapshot file is currently encoded using [MessagePack](https://msgpack.org/), but this may change before Icechunk version 1.0. Given the alpha status of this spec, the best way to understand the information stored
-in the snapshot file is through the data structure used internally by the Icechunk library for serialization. This data structure will most certainly change before the spec stabilization:
-
-```rust
-pub struct Snapshot {
-    pub icechunk_snapshot_format_version: IcechunkFormatVersion,
-    pub icechunk_snapshot_format_flags: BTreeMap<String, rmpv::Value>,
-
-    pub manifest_files: Vec<ManifestFileInfo>,
-    pub attribute_files: Vec<AttributeFileInfo>,
-
-    pub total_parents: u32,
-    pub short_term_parents: u16,
-    pub short_term_history: VecDeque<SnapshotMetadata>,
-
-    pub metadata: SnapshotMetadata,
-    pub started_at: DateTime<Utc>,
-    pub properties: SnapshotProperties,
-    nodes: BTreeMap<Path, NodeSnapshot>,
-}
-```
-
-To get full details on what each field contains, please refer to the [Icechunk library code](https://github.com/earth-mover/icechunk/blob/f460a56577ec560c4debfd89e401a98153cd3560/icechunk/src/format/snapshot.rs#L97).
-
+The snapshot file is encoded using [flatbuffers](https://github.com/google/flatbuffers). The IDL for the
+on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/snapshot.fbs)
 
 ### Attributes Files
 
@@ -248,37 +225,22 @@ Attribute files hold user-defined attributes separately from the snapshot file.
 !!! warning
     Attribute files have not been implemented.
 
-The on-disk format for attribute files has not been defined yet, but it will probably be a
-MessagePack serialization of the attributes map.
+The on-disk format for attribute files has not been defined in full yet.
 
 ### Chunk Manifest Files
 
 A chunk manifest file stores chunk references.
 Chunk references from multiple arrays can be stored in the same chunk manifest.
 The chunks from a single array can also be spread across multiple manifests.
 
-Manifest files are currently encoded using [MessagePack](https://msgpack.org/), but this may change before Icechunk version 1.0. Given the alpha status of this spec, the best way to understand the information stored
-in the snapshot file is through the data structure used internally by the Icechunk library. This data structure will most certainly change before the spec stabilization:
-
-```rust
-pub struct Manifest {
-    pub icechunk_manifest_format_version: IcechunkFormatVersion,
-    pub icechunk_manifest_format_flags: BTreeMap<String, rmpv::Value>,
-    chunks: BTreeMap<(NodeId, ChunkIndices), ChunkPayload>,
-}
-
-pub enum ChunkPayload {
-    Inline(Bytes),
-    Virtual(VirtualChunkRef),
-    Ref(ChunkRef),
-}
-```
+Manifest files are encoded using [flatbuffers](https://github.com/google/flatbuffers). The IDL for the
+on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/manifest.fbs)
 
 The most important part to understand from the data structure is the fact that manifests can hold three types of references:
 
-* Native (`Ref`), pointing to the id of a chunk within the Icechunk repository.
-* Inline (`Inline`), an optimization for very small chunks that can be embedded directly in the manifest. Mostly used for coordinate arrays.
-* Virtual (`Virtual`), pointing to a region of a file outside of the Icechunk repository, for example,
+- Native (`Ref`), pointing to the id of a chunk within the Icechunk repository.
+- Inline (`Inline`), an optimization for very small chunks that can be embedded directly in the manifest. Mostly used for coordinate arrays.
+- Virtual (`Virtual`), pointing to a region of a file outside of the Icechunk repository, for example,
   a chunk that is inside a NetCDF file in object store
 
 To get full details on what each field contains, please refer to the [Icechunk library code](https://github.com/earth-mover/icechunk/blob/f460a56577ec560c4debfd89e401a98153cd3560/icechunk/src/format/manifest.rs#L106).

diff --git a/icechunk-python/benchmarks/test_benchmark_reads.py b/icechunk-python/benchmarks/test_benchmark_reads.py
@@ -48,7 +48,11 @@ def test_time_getsize_key(synth_dataset: Dataset, benchmark) -> None:
     @benchmark
     def fn():
         for array in synth_dataset.load_variables:
-            key = f"{synth_dataset.group or ''}/{array}/zarr.json"
+            if group := synth_dataset.group is not None:
+                prefix = f"{group}/"
+            else:
+                prefix = ""
+            key = f"{prefix}{array}/zarr.json"
             sync(store.getsize(key))
 
 

diff --git a/icechunk-python/python/icechunk/_icechunk_python.pyi b/icechunk-python/python/icechunk/_icechunk_python.pyi
@@ -901,13 +901,6 @@ class PyRepository:
     def save_config(self) -> None: ...
     def config(self) -> RepositoryConfig: ...
     def storage(self) -> Storage: ...
-    def ancestry(
-        self,
-        *,
-        branch: str | None = None,
-        tag: str | None = None,
-        snapshot: str | None = None,
-    ) -> list[SnapshotInfo]: ...
     def async_ancestry(
         self,
         *,

diff --git a/icechunk-python/python/icechunk/repository.py b/icechunk-python/python/icechunk/repository.py
@@ -1,6 +1,6 @@
 import datetime
-from collections.abc import AsyncIterator
-from typing import Self
+from collections.abc import AsyncIterator, Iterator
+from typing import Self, cast
 
 from icechunk._icechunk_python import (
     Diff,
@@ -197,7 +197,7 @@ def ancestry(
         branch: str | None = None,
         tag: str | None = None,
         snapshot: str | None = None,
-    ) -> list[SnapshotInfo]:
+    ) -> Iterator[SnapshotInfo]:
         """
         Get the ancestry of a snapshot.
 
@@ -219,7 +219,13 @@ def ancestry(
         -----
         Only one of the arguments can be specified.
         """
-        return self._repository.ancestry(branch=branch, tag=tag, snapshot=snapshot)
+
+        # the returned object is both an Async and Sync iterator
+        res = cast(
+            Iterator[SnapshotInfo],
+            self._repository.async_ancestry(branch=branch, tag=tag, snapshot=snapshot),
+        )
+        return res
 
     def async_ancestry(
         self,