Skip to content

Commit

Permalink
Merge branch 'main' into ian/docs/updates
Browse files Browse the repository at this point in the history
  • Loading branch information
dcherian authored Feb 19, 2025
2 parents f1d1d18 + cd5dac2 commit 6c653f6
Show file tree
Hide file tree
Showing 81 changed files with 5,425 additions and 902 deletions.
15 changes: 13 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

127 changes: 127 additions & 0 deletions design-docs/008-no-copy-serialization-formats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Evaluation of different serialization formats

We want to move away from msgpack serialization for Icechunk metadata files.

## Why

* Msgpack requires a expensive parsing process upfront. If the user only wants
to pull a few chunk refs from a manifest, they still need to parse the whole manifest.
* Msgpack deserializes to Rust datastructures. This is good for simplicity of code, but
probably not good for memory consumption (more pointers everywhere).
* Msgpack gives too many options on how to serialize things, there is no canonical way,
so it's not easy to predict how `serde` is going to serialize our detastructures, and
could even change from version to version.
* It's hard to explain in the spec what goes into the metadata files, we would need to go
into `rmp_serde` implementation, see what they do, and document that in the spec.

## Other options

There is a never ending menu. From a custom binary format, to Parquet, and everything else.
We focused mostly on no-copy formats, for some of the issues enumerated above. Also
there is a preference for formats that have a tight schema and can be documented with
some form of IDL.

## Performance evaluation

We evaluated performance of msgpack, flatbuffers and capnproto. Evaluation looks at:

* Manifest file size, for a big manifest with 1M native chunk refs.
* Speed of writing.
* Speed of reading.

We wrote an example program in `examples/multithreaded_get_chunk_refs.rs`.
This program writes a big repo to local file storage, it doesn't really write the chunks,
we are not interested in benchmarking that. It executes purely in Rust, not using the python interface.

It writes a manifest with 1M native chunk refs, using zstd compression level 3. The writes are done
from 1M concurrent async tasks.

It then executes 1M chunk ref reads (notice, the refs are read, not the chunks that are not there).
Reads are executed from 4 threads with 250k concurrent async tasks each.

Notice:

* We are comparing local file system on purpose, to not account for network times
* We are comparing pulling refs only, not chunks, which is a worst case. In the real
world, read operations are dominated by the time taken to fetch the chunks.
* The evaluation was done in an early state of the code, where many parts were unsafe,
but we have verified there are no huge differences.

### Results for writes

```sh
nix run nixpkgs#hyperfine -- \
--prepare 'rm -rf /tmp/test-perf' \
--warmup 1 \
'cargo run --release --example multithreaded_get_chunk_refs -- --write /tmp/test-perf'
```

#### Flatbuffers

Compressed manifest size: 27_527_680 bytes

```
Time (mean ± σ): 5.698 s ± 0.163 s [User: 4.764 s, System: 0.910 s]
Range (min … max): 5.562 s … 6.103 s 10 runs
```

#### Capnproto

Compressed manifest size: 26_630_927 bytes

```
Time (mean ± σ): 6.276 s ± 0.163 s [User: 5.225 s, System: 1.017 s]
Range (min … max): 6.126 s … 6.630 s 10 runs
```

#### Msgpack

Compressed manifest size: 22_250_152 bytes

```
Time (mean ± σ): 6.224 s ± 0.155 s [User: 5.488 s, System: 0.712 s]
Range (min … max): 6.033 s … 6.532 s 10 runs
```

### Results for reads

```sh
nix run nixpkgs#hyperfine -- \
--warmup 1 \
'cargo run --release --example multithreaded_get_chunk_refs -- --read /tmp/test-perf'
```

#### Flatbuffers

```
Time (mean ± σ): 3.676 s ± 0.257 s [User: 7.385 s, System: 1.819 s]
Range (min … max): 3.171 s … 4.038 s 10 runs
```

#### Capnproto

```
Time (mean ± σ): 5.254 s ± 0.234 s [User: 11.370 s, System: 1.962 s]
Range (min … max): 4.992 s … 5.799 s 10 runs
```

#### Msgpack

```
Time (mean ± σ): 3.310 s ± 0.606 s [User: 5.975 s, System: 1.762 s]
Range (min … max): 2.392 s … 4.102 s 10 runs
```

## Conclusions

* Compressed manifest is 25% larger in flatbuffers than msgpack
* Flatbuffers is slightly faster for commits
* Flatbuffers is slightly slower for reads
* Timing differences are not significant for real world scenarios, where performance
is dominated by the time taken downloading or uploading chunks.
* Manifest fetch time differences could be somewhat significant for workloads where
latency to first byte is important. This is not the use case Icechunk optimizes for.

## Decision

We are going to use flatbuffers for our metadata on-disk format.
4 changes: 2 additions & 2 deletions docs/docs/icechunk-python/cheatsheets/git-users.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ At this point, the tip of the branch is now the snapshot `198273178639187` and a
In Icechunk, you can view the history of a branch by using the [`repo.ancestry()`](../reference/#icechunk.Repository.ancestry) command, similar to the `git log` command.

```python
repo.ancestry(branch="my-new-branch")
[ancestor for ancestor in repo.ancestry(branch="my-new-branch")]

#[Snapshot(id='198273178639187', ...), ...]
```
Expand Down Expand Up @@ -156,7 +156,7 @@ We can also view the history of a tag by using the [`repo.ancestry()`](../refere
repo.ancestry(tag="my-new-tag")
```

This will return a list of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](../reference/#icechunk.Repository.lookup_tag) command.
This will return an iterator of snapshots that are ancestors of the tag. Similar to branches we can lookup the snapshot that a tag is based on by using the [`repo.lookup_tag()`](../reference/#icechunk.Repository.lookup_tag) command.

```python
repo.lookup_tag("my-new-tag")
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/icechunk-python/version-control.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ repo = icechunk.Repository.create(icechunk.in_memory_storage())
On creating a new [`Repository`](../reference/#icechunk.Repository), it will automatically create a `main` branch with an initial snapshot. We can take a look at the ancestry of the `main` branch to confirm this.

```python
repo.ancestry(branch="main")
[ancestor for ancestor in repo.ancestry(branch="main")]

# [SnapshotInfo(id="A840RMN5CF807CM66RY0", parent_id=None, written_at=datetime.datetime(2025,1,30,19,52,41,592998, tzinfo=datetime.timezone.utc), message="Repository...")]
```
Expand All @@ -36,7 +36,7 @@ repo.ancestry(branch="main")

The [`ancestry`](./reference/#icechunk.Repository.ancestry) method can be used to inspect the ancestry of any branch, snapshot, or tag.

We get back a list of [`SnapshotInfo`](../reference/#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.
We get back an iterator of [`SnapshotInfo`](../reference/#icechunk.SnapshotInfo) objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.

## Creating a snapshot

Expand Down
62 changes: 12 additions & 50 deletions docs/docs/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@ Finally, in an atomic put-if-not-exists operation, to commit the transaction, it
This operation may fail if a different client has already committed the next snapshot.
In this case, the client may attempt to resolve the conflicts and retry the commit.


```mermaid
flowchart TD
subgraph metadata[Metadata]
Expand Down Expand Up @@ -121,14 +120,14 @@ All data and metadata files are stored within a root directory (typically a pref
- `$ROOT/snapshots/` snapshot files
- `$ROOT/attributes/` attribute files
- `$ROOT/manifests/` chunk manifests
- `$ROOT/transactions/` transaction log files
- `$ROOT/chunks/` chunks

### File Formats

!!! warning
The actual file formats used for each type of metadata file are in flux. The spec currently describes the data structures encoded in these files, rather than a specific file format.


### Reference Files

Similar to Git, Icechunk supports the concept of _branches_ and _tags_.
Expand All @@ -149,9 +148,8 @@ Different client sessions may simultaneously create two inconsistent snapshots;

References (both branches and tags) are stored as JSON files, the content is a JSON object with:

* keys: a single key `"snapshot"`,
* value: a string representation of the snapshot id, using [Base 32 Crockford](https://www.crockford.com/base32.html) encoding. The snapshot id is 12 byte random binary, so the encoded string has 20 characters.

- keys: a single key `"snapshot"`,
- value: a string representation of the snapshot id, using [Base 32 Crockford](https://www.crockford.com/base32.html) encoding. The snapshot id is 12 byte random binary, so the encoded string has 20 characters.

Here is an example of a JSON file corresponding to a tag or branch:

Expand Down Expand Up @@ -186,6 +184,7 @@ Branch references are stored in the `refs/` directory within a subdirectory corr
Branch names may not contain the `/` character.

To facilitate easy lookups of the latest branch reference, we use the following encoding for the sequence number:

- subtract the sequence number from the integer `1099511627775`
- encode the resulting integer as a string using [Base 32 Crockford](https://www.crockford.com/base32.html)
- left-padding the string with 0s to a length of 8 characters
Expand Down Expand Up @@ -216,30 +215,8 @@ Tags cannot be deleted once created.

The snapshot file fully describes the schema of the repository, including all arrays and groups.

The snapshot file is currently encoded using [MessagePack](https://msgpack.org/), but this may change before Icechunk version 1.0. Given the alpha status of this spec, the best way to understand the information stored
in the snapshot file is through the data structure used internally by the Icechunk library for serialization. This data structure will most certainly change before the spec stabilization:

```rust
pub struct Snapshot {
pub icechunk_snapshot_format_version: IcechunkFormatVersion,
pub icechunk_snapshot_format_flags: BTreeMap<String, rmpv::Value>,

pub manifest_files: Vec<ManifestFileInfo>,
pub attribute_files: Vec<AttributeFileInfo>,

pub total_parents: u32,
pub short_term_parents: u16,
pub short_term_history: VecDeque<SnapshotMetadata>,

pub metadata: SnapshotMetadata,
pub started_at: DateTime<Utc>,
pub properties: SnapshotProperties,
nodes: BTreeMap<Path, NodeSnapshot>,
}
```

To get full details on what each field contains, please refer to the [Icechunk library code](https://github.com/earth-mover/icechunk/blob/f460a56577ec560c4debfd89e401a98153cd3560/icechunk/src/format/snapshot.rs#L97).

The snapshot file is encoded using [flatbuffers](https://github.com/google/flatbuffers). The IDL for the
on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/snapshot.fbs)

### Attributes Files

Expand All @@ -248,37 +225,22 @@ Attribute files hold user-defined attributes separately from the snapshot file.
!!! warning
Attribute files have not been implemented.

The on-disk format for attribute files has not been defined yet, but it will probably be a
MessagePack serialization of the attributes map.
The on-disk format for attribute files has not been defined in full yet.

### Chunk Manifest Files

A chunk manifest file stores chunk references.
Chunk references from multiple arrays can be stored in the same chunk manifest.
The chunks from a single array can also be spread across multiple manifests.

Manifest files are currently encoded using [MessagePack](https://msgpack.org/), but this may change before Icechunk version 1.0. Given the alpha status of this spec, the best way to understand the information stored
in the snapshot file is through the data structure used internally by the Icechunk library. This data structure will most certainly change before the spec stabilization:

```rust
pub struct Manifest {
pub icechunk_manifest_format_version: IcechunkFormatVersion,
pub icechunk_manifest_format_flags: BTreeMap<String, rmpv::Value>,
chunks: BTreeMap<(NodeId, ChunkIndices), ChunkPayload>,
}

pub enum ChunkPayload {
Inline(Bytes),
Virtual(VirtualChunkRef),
Ref(ChunkRef),
}
```
Manifest files are encoded using [flatbuffers](https://github.com/google/flatbuffers). The IDL for the
on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/manifest.fbs)

The most important part to understand from the data structure is the fact that manifests can hold three types of references:

* Native (`Ref`), pointing to the id of a chunk within the Icechunk repository.
* Inline (`Inline`), an optimization for very small chunks that can be embedded directly in the manifest. Mostly used for coordinate arrays.
* Virtual (`Virtual`), pointing to a region of a file outside of the Icechunk repository, for example,
- Native (`Ref`), pointing to the id of a chunk within the Icechunk repository.
- Inline (`Inline`), an optimization for very small chunks that can be embedded directly in the manifest. Mostly used for coordinate arrays.
- Virtual (`Virtual`), pointing to a region of a file outside of the Icechunk repository, for example,
a chunk that is inside a NetCDF file in object store

To get full details on what each field contains, please refer to the [Icechunk library code](https://github.com/earth-mover/icechunk/blob/f460a56577ec560c4debfd89e401a98153cd3560/icechunk/src/format/manifest.rs#L106).
Expand Down
6 changes: 5 additions & 1 deletion icechunk-python/benchmarks/test_benchmark_reads.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,11 @@ def test_time_getsize_key(synth_dataset: Dataset, benchmark) -> None:
@benchmark
def fn():
for array in synth_dataset.load_variables:
key = f"{synth_dataset.group or ''}/{array}/zarr.json"
if group := synth_dataset.group is not None:
prefix = f"{group}/"
else:
prefix = ""
key = f"{prefix}{array}/zarr.json"
sync(store.getsize(key))


Expand Down
7 changes: 0 additions & 7 deletions icechunk-python/python/icechunk/_icechunk_python.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -901,13 +901,6 @@ class PyRepository:
def save_config(self) -> None: ...
def config(self) -> RepositoryConfig: ...
def storage(self) -> Storage: ...
def ancestry(
self,
*,
branch: str | None = None,
tag: str | None = None,
snapshot: str | None = None,
) -> list[SnapshotInfo]: ...
def async_ancestry(
self,
*,
Expand Down
14 changes: 10 additions & 4 deletions icechunk-python/python/icechunk/repository.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import datetime
from collections.abc import AsyncIterator
from typing import Self
from collections.abc import AsyncIterator, Iterator
from typing import Self, cast

from icechunk._icechunk_python import (
Diff,
Expand Down Expand Up @@ -197,7 +197,7 @@ def ancestry(
branch: str | None = None,
tag: str | None = None,
snapshot: str | None = None,
) -> list[SnapshotInfo]:
) -> Iterator[SnapshotInfo]:
"""
Get the ancestry of a snapshot.
Expand All @@ -219,7 +219,13 @@ def ancestry(
-----
Only one of the arguments can be specified.
"""
return self._repository.ancestry(branch=branch, tag=tag, snapshot=snapshot)

# the returned object is both an Async and Sync iterator
res = cast(
Iterator[SnapshotInfo],
self._repository.async_ancestry(branch=branch, tag=tag, snapshot=snapshot),
)
return res

def async_ancestry(
self,
Expand Down
Loading

0 comments on commit 6c653f6

Please sign in to comment.