Skip to content

Commit

Permalink
Flatbuffers for serialization (#733)
Browse files Browse the repository at this point in the history
* flatbuffers manifest

* wip

* working

* testing flatbuffers perf

* flatbuffers snapshots

* code quality on manifest

* Manifest working

* All tests pass wit the new flatbuffers snapshot

* Working on code qty, tests passing

* more code qty

* WIP: transaction log"

* Diffs and status working

* All tests passing

* Fix stateful test

* code qty

* Clean up

* Documentation and some name changes

* Better ManifestExtents type and serialization

* lint and tests
  • Loading branch information
paraseba authored Feb 18, 2025
1 parent 6e973a3 commit f1dc0bb
Show file tree
Hide file tree
Showing 73 changed files with 5,375 additions and 856 deletions.
15 changes: 13 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

127 changes: 127 additions & 0 deletions design-docs/008-no-copy-serialization-formats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Evaluation of different serialization formats

We want to move away from msgpack serialization for Icechunk metadata files.

## Why

* Msgpack requires a expensive parsing process upfront. If the user only wants
to pull a few chunk refs from a manifest, they still need to parse the whole manifest.
* Msgpack deserializes to Rust datastructures. This is good for simplicity of code, but
probably not good for memory consumption (more pointers everywhere).
* Msgpack gives too many options on how to serialize things, there is no canonical way,
so it's not easy to predict how `serde` is going to serialize our detastructures, and
could even change from version to version.
* It's hard to explain in the spec what goes into the metadata files, we would need to go
into `rmp_serde` implementation, see what they do, and document that in the spec.

## Other options

There is a never ending menu. From a custom binary format, to Parquet, and everything else.
We focused mostly on no-copy formats, for some of the issues enumerated above. Also
there is a preference for formats that have a tight schema and can be documented with
some form of IDL.

## Performance evaluation

We evaluated performance of msgpack, flatbuffers and capnproto. Evaluation looks at:

* Manifest file size, for a big manifest with 1M native chunk refs.
* Speed of writing.
* Speed of reading.

We wrote an example program in `examples/multithreaded_get_chunk_refs.rs`.
This program writes a big repo to local file storage, it doesn't really write the chunks,
we are not interested in benchmarking that. It executes purely in Rust, not using the python interface.

It writes a manifest with 1M native chunk refs, using zstd compression level 3. The writes are done
from 1M concurrent async tasks.

It then executes 1M chunk ref reads (notice, the refs are read, not the chunks that are not there).
Reads are executed from 4 threads with 250k concurrent async tasks each.

Notice:

* We are comparing local file system on purpose, to not account for network times
* We are comparing pulling refs only, not chunks, which is a worst case. In the real
world, read operations are dominated by the time taken to fetch the chunks.
* The evaluation was done in an early state of the code, where many parts were unsafe,
but we have verified there are no huge differences.

### Results for writes

```sh
nix run nixpkgs#hyperfine -- \
--prepare 'rm -rf /tmp/test-perf' \
--warmup 1 \
'cargo run --release --example multithreaded_get_chunk_refs -- --write /tmp/test-perf'
```

#### Flatbuffers

Compressed manifest size: 27_527_680 bytes

```
Time (mean ± σ): 5.698 s ± 0.163 s [User: 4.764 s, System: 0.910 s]
Range (min … max): 5.562 s … 6.103 s 10 runs
```

#### Capnproto

Compressed manifest size: 26_630_927 bytes

```
Time (mean ± σ): 6.276 s ± 0.163 s [User: 5.225 s, System: 1.017 s]
Range (min … max): 6.126 s … 6.630 s 10 runs
```

#### Msgpack

Compressed manifest size: 22_250_152 bytes

```
Time (mean ± σ): 6.224 s ± 0.155 s [User: 5.488 s, System: 0.712 s]
Range (min … max): 6.033 s … 6.532 s 10 runs
```

### Results for reads

```sh
nix run nixpkgs#hyperfine -- \
--warmup 1 \
'cargo run --release --example multithreaded_get_chunk_refs -- --read /tmp/test-perf'
```

#### Flatbuffers

```
Time (mean ± σ): 3.676 s ± 0.257 s [User: 7.385 s, System: 1.819 s]
Range (min … max): 3.171 s … 4.038 s 10 runs
```

#### Capnproto

```
Time (mean ± σ): 5.254 s ± 0.234 s [User: 11.370 s, System: 1.962 s]
Range (min … max): 4.992 s … 5.799 s 10 runs
```

#### Msgpack

```
Time (mean ± σ): 3.310 s ± 0.606 s [User: 5.975 s, System: 1.762 s]
Range (min … max): 2.392 s … 4.102 s 10 runs
```

## Conclusions

* Compressed manifest is 25% larger in flatbuffers than msgpack
* Flatbuffers is slightly faster for commits
* Flatbuffers is slightly slower for reads
* Timing differences are not significant for real world scenarios, where performance
is dominated by the time taken downloading or uploading chunks.
* Manifest fetch time differences could be somewhat significant for workloads where
latency to first byte is important. This is not the use case Icechunk optimizes for.

## Decision

We are going to use flatbuffers for our metadata on-disk format.
62 changes: 12 additions & 50 deletions docs/docs/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@ Finally, in an atomic put-if-not-exists operation, to commit the transaction, it
This operation may fail if a different client has already committed the next snapshot.
In this case, the client may attempt to resolve the conflicts and retry the commit.


```mermaid
flowchart TD
subgraph metadata[Metadata]
Expand Down Expand Up @@ -121,14 +120,14 @@ All data and metadata files are stored within a root directory (typically a pref
- `$ROOT/snapshots/` snapshot files
- `$ROOT/attributes/` attribute files
- `$ROOT/manifests/` chunk manifests
- `$ROOT/transactions/` transaction log files
- `$ROOT/chunks/` chunks

### File Formats

!!! warning
The actual file formats used for each type of metadata file are in flux. The spec currently describes the data structures encoded in these files, rather than a specific file format.


### Reference Files

Similar to Git, Icechunk supports the concept of _branches_ and _tags_.
Expand All @@ -149,9 +148,8 @@ Different client sessions may simultaneously create two inconsistent snapshots;

References (both branches and tags) are stored as JSON files, the content is a JSON object with:

* keys: a single key `"snapshot"`,
* value: a string representation of the snapshot id, using [Base 32 Crockford](https://www.crockford.com/base32.html) encoding. The snapshot id is 12 byte random binary, so the encoded string has 20 characters.

- keys: a single key `"snapshot"`,
- value: a string representation of the snapshot id, using [Base 32 Crockford](https://www.crockford.com/base32.html) encoding. The snapshot id is 12 byte random binary, so the encoded string has 20 characters.

Here is an example of a JSON file corresponding to a tag or branch:

Expand Down Expand Up @@ -186,6 +184,7 @@ Branch references are stored in the `refs/` directory within a subdirectory corr
Branch names may not contain the `/` character.

To facilitate easy lookups of the latest branch reference, we use the following encoding for the sequence number:

- subtract the sequence number from the integer `1099511627775`
- encode the resulting integer as a string using [Base 32 Crockford](https://www.crockford.com/base32.html)
- left-padding the string with 0s to a length of 8 characters
Expand Down Expand Up @@ -216,30 +215,8 @@ Tags cannot be deleted once created.

The snapshot file fully describes the schema of the repository, including all arrays and groups.

The snapshot file is currently encoded using [MessagePack](https://msgpack.org/), but this may change before Icechunk version 1.0. Given the alpha status of this spec, the best way to understand the information stored
in the snapshot file is through the data structure used internally by the Icechunk library for serialization. This data structure will most certainly change before the spec stabilization:

```rust
pub struct Snapshot {
pub icechunk_snapshot_format_version: IcechunkFormatVersion,
pub icechunk_snapshot_format_flags: BTreeMap<String, rmpv::Value>,

pub manifest_files: Vec<ManifestFileInfo>,
pub attribute_files: Vec<AttributeFileInfo>,

pub total_parents: u32,
pub short_term_parents: u16,
pub short_term_history: VecDeque<SnapshotMetadata>,

pub metadata: SnapshotMetadata,
pub started_at: DateTime<Utc>,
pub properties: SnapshotProperties,
nodes: BTreeMap<Path, NodeSnapshot>,
}
```

To get full details on what each field contains, please refer to the [Icechunk library code](https://github.com/earth-mover/icechunk/blob/f460a56577ec560c4debfd89e401a98153cd3560/icechunk/src/format/snapshot.rs#L97).

The snapshot file is encoded using [flatbuffers](https://github.com/google/flatbuffers). The IDL for the
on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/snapshot.fbs)

### Attributes Files

Expand All @@ -248,37 +225,22 @@ Attribute files hold user-defined attributes separately from the snapshot file.
!!! warning
Attribute files have not been implemented.

The on-disk format for attribute files has not been defined yet, but it will probably be a
MessagePack serialization of the attributes map.
The on-disk format for attribute files has not been defined in full yet.

### Chunk Manifest Files

A chunk manifest file stores chunk references.
Chunk references from multiple arrays can be stored in the same chunk manifest.
The chunks from a single array can also be spread across multiple manifests.

Manifest files are currently encoded using [MessagePack](https://msgpack.org/), but this may change before Icechunk version 1.0. Given the alpha status of this spec, the best way to understand the information stored
in the snapshot file is through the data structure used internally by the Icechunk library. This data structure will most certainly change before the spec stabilization:

```rust
pub struct Manifest {
pub icechunk_manifest_format_version: IcechunkFormatVersion,
pub icechunk_manifest_format_flags: BTreeMap<String, rmpv::Value>,
chunks: BTreeMap<(NodeId, ChunkIndices), ChunkPayload>,
}

pub enum ChunkPayload {
Inline(Bytes),
Virtual(VirtualChunkRef),
Ref(ChunkRef),
}
```
Manifest files are encoded using [flatbuffers](https://github.com/google/flatbuffers). The IDL for the
on-disk format can be found in [the repository file](https://github.com/earth-mover/icechunk/tree/main/icechunk/flatbuffers/manifest.fbs)

The most important part to understand from the data structure is the fact that manifests can hold three types of references:

* Native (`Ref`), pointing to the id of a chunk within the Icechunk repository.
* Inline (`Inline`), an optimization for very small chunks that can be embedded directly in the manifest. Mostly used for coordinate arrays.
* Virtual (`Virtual`), pointing to a region of a file outside of the Icechunk repository, for example,
- Native (`Ref`), pointing to the id of a chunk within the Icechunk repository.
- Inline (`Inline`), an optimization for very small chunks that can be embedded directly in the manifest. Mostly used for coordinate arrays.
- Virtual (`Virtual`), pointing to a region of a file outside of the Icechunk repository, for example,
a chunk that is inside a NetCDF file in object store

To get full details on what each field contains, please refer to the [Icechunk library code](https://github.com/earth-mover/icechunk/blob/f460a56577ec560c4debfd89e401a98153cd3560/icechunk/src/format/manifest.rs#L106).
Expand Down
25 changes: 13 additions & 12 deletions icechunk-python/tests/data/test-repo/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,10 @@ compression: null
caching: null
storage: null
virtual_chunk_containers:
s3:
name: s3
url_prefix: s3://
store: !s3_compatible
region: us-east-1
endpoint_url: http://localhost:9000
anonymous: false
allow_http: true
gcs:
name: gcs
url_prefix: gcs
store: !gcs {}
az:
name: az
url_prefix: az
Expand All @@ -25,11 +21,16 @@ virtual_chunk_containers:
endpoint_url: https://fly.storage.tigris.dev
anonymous: false
allow_http: false
gcs:
name: gcs
url_prefix: gcs
store: !gcs {}
s3:
name: s3
url_prefix: s3://
store: !s3_compatible
region: us-east-1
endpoint_url: http://localhost:9000
anonymous: false
allow_http: true
file:
name: file
url_prefix: file
store: !local_file_system ''
manifest: null
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"FK0CX5JQH2DVDZ6PD6WG"}
{"snapshot":"A2RD2Y65PR6D3B6BR1K0"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"KCR7ES7JPCBY23X6MY3G"}
{"snapshot":"K1BMYVG1HNVTNV1FSBH0"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"QY5JG2BWG2VPPDJR4JE0"}
{"snapshot":"RPA0WQCNM2N9HBBRHJQ0"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"VNPWJSZWB9G990XV1V8G"}
{"snapshot":"6Q9GDTXKF17BGQVSQZFG"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"G0BR0G9NKT75ZZS7BWWG"}
{"snapshot":"949AXZ49X764TMDC6D4G"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"9W0W1DS2BKRV4MK2A2S0"}
{"snapshot":"SNF98D1SK7NWD5KQJM20"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"FK0CX5JQH2DVDZ6PD6WG"}
{"snapshot":"A2RD2Y65PR6D3B6BR1K0"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"9W0W1DS2BKRV4MK2A2S0"}
{"snapshot":"SNF98D1SK7NWD5KQJM20"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"G0BR0G9NKT75ZZS7BWWG"}
{"snapshot":"949AXZ49X764TMDC6D4G"}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"snapshot":"9W0W1DS2BKRV4MK2A2S0"}
{"snapshot":"SNF98D1SK7NWD5KQJM20"}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit f1dc0bb

Please sign in to comment.