-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Flatbuffers for serialization (#733)
* flatbuffers manifest * wip * working * testing flatbuffers perf * flatbuffers snapshots * code quality on manifest * Manifest working * All tests pass wit the new flatbuffers snapshot * Working on code qty, tests passing * more code qty * WIP: transaction log" * Diffs and status working * All tests passing * Fix stateful test * code qty * Clean up * Documentation and some name changes * Better ManifestExtents type and serialization * lint and tests
- Loading branch information
Showing
73 changed files
with
5,375 additions
and
856 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
# Evaluation of different serialization formats | ||
|
||
We want to move away from msgpack serialization for Icechunk metadata files. | ||
|
||
## Why | ||
|
||
* Msgpack requires a expensive parsing process upfront. If the user only wants | ||
to pull a few chunk refs from a manifest, they still need to parse the whole manifest. | ||
* Msgpack deserializes to Rust datastructures. This is good for simplicity of code, but | ||
probably not good for memory consumption (more pointers everywhere). | ||
* Msgpack gives too many options on how to serialize things, there is no canonical way, | ||
so it's not easy to predict how `serde` is going to serialize our detastructures, and | ||
could even change from version to version. | ||
* It's hard to explain in the spec what goes into the metadata files, we would need to go | ||
into `rmp_serde` implementation, see what they do, and document that in the spec. | ||
|
||
## Other options | ||
|
||
There is a never ending menu. From a custom binary format, to Parquet, and everything else. | ||
We focused mostly on no-copy formats, for some of the issues enumerated above. Also | ||
there is a preference for formats that have a tight schema and can be documented with | ||
some form of IDL. | ||
|
||
## Performance evaluation | ||
|
||
We evaluated performance of msgpack, flatbuffers and capnproto. Evaluation looks at: | ||
|
||
* Manifest file size, for a big manifest with 1M native chunk refs. | ||
* Speed of writing. | ||
* Speed of reading. | ||
|
||
We wrote an example program in `examples/multithreaded_get_chunk_refs.rs`. | ||
This program writes a big repo to local file storage, it doesn't really write the chunks, | ||
we are not interested in benchmarking that. It executes purely in Rust, not using the python interface. | ||
|
||
It writes a manifest with 1M native chunk refs, using zstd compression level 3. The writes are done | ||
from 1M concurrent async tasks. | ||
|
||
It then executes 1M chunk ref reads (notice, the refs are read, not the chunks that are not there). | ||
Reads are executed from 4 threads with 250k concurrent async tasks each. | ||
|
||
Notice: | ||
|
||
* We are comparing local file system on purpose, to not account for network times | ||
* We are comparing pulling refs only, not chunks, which is a worst case. In the real | ||
world, read operations are dominated by the time taken to fetch the chunks. | ||
* The evaluation was done in an early state of the code, where many parts were unsafe, | ||
but we have verified there are no huge differences. | ||
|
||
### Results for writes | ||
|
||
```sh | ||
nix run nixpkgs#hyperfine -- \ | ||
--prepare 'rm -rf /tmp/test-perf' \ | ||
--warmup 1 \ | ||
'cargo run --release --example multithreaded_get_chunk_refs -- --write /tmp/test-perf' | ||
``` | ||
|
||
#### Flatbuffers | ||
|
||
Compressed manifest size: 27_527_680 bytes | ||
|
||
``` | ||
Time (mean ± σ): 5.698 s ± 0.163 s [User: 4.764 s, System: 0.910 s] | ||
Range (min … max): 5.562 s … 6.103 s 10 runs | ||
``` | ||
|
||
#### Capnproto | ||
|
||
Compressed manifest size: 26_630_927 bytes | ||
|
||
``` | ||
Time (mean ± σ): 6.276 s ± 0.163 s [User: 5.225 s, System: 1.017 s] | ||
Range (min … max): 6.126 s … 6.630 s 10 runs | ||
``` | ||
|
||
#### Msgpack | ||
|
||
Compressed manifest size: 22_250_152 bytes | ||
|
||
``` | ||
Time (mean ± σ): 6.224 s ± 0.155 s [User: 5.488 s, System: 0.712 s] | ||
Range (min … max): 6.033 s … 6.532 s 10 runs | ||
``` | ||
|
||
### Results for reads | ||
|
||
```sh | ||
nix run nixpkgs#hyperfine -- \ | ||
--warmup 1 \ | ||
'cargo run --release --example multithreaded_get_chunk_refs -- --read /tmp/test-perf' | ||
``` | ||
|
||
#### Flatbuffers | ||
|
||
``` | ||
Time (mean ± σ): 3.676 s ± 0.257 s [User: 7.385 s, System: 1.819 s] | ||
Range (min … max): 3.171 s … 4.038 s 10 runs | ||
``` | ||
|
||
#### Capnproto | ||
|
||
``` | ||
Time (mean ± σ): 5.254 s ± 0.234 s [User: 11.370 s, System: 1.962 s] | ||
Range (min … max): 4.992 s … 5.799 s 10 runs | ||
``` | ||
|
||
#### Msgpack | ||
|
||
``` | ||
Time (mean ± σ): 3.310 s ± 0.606 s [User: 5.975 s, System: 1.762 s] | ||
Range (min … max): 2.392 s … 4.102 s 10 runs | ||
``` | ||
|
||
## Conclusions | ||
|
||
* Compressed manifest is 25% larger in flatbuffers than msgpack | ||
* Flatbuffers is slightly faster for commits | ||
* Flatbuffers is slightly slower for reads | ||
* Timing differences are not significant for real world scenarios, where performance | ||
is dominated by the time taken downloading or uploading chunks. | ||
* Manifest fetch time differences could be somewhat significant for workloads where | ||
latency to first byte is important. This is not the use case Icechunk optimizes for. | ||
|
||
## Decision | ||
|
||
We are going to use flatbuffers for our metadata on-disk format. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file removed
BIN
-166 Bytes
icechunk-python/tests/data/test-repo/manifests/3AS90VB6T0GSE4J67XX0
Binary file not shown.
Binary file added
BIN
+165 Bytes
icechunk-python/tests/data/test-repo/manifests/3C9WRKTE3PNDSNYBKD60
Binary file not shown.
Binary file removed
BIN
-119 Bytes
icechunk-python/tests/data/test-repo/manifests/5ZW0V1ZQXQ16804S897G
Binary file not shown.
Binary file added
BIN
+278 Bytes
icechunk-python/tests/data/test-repo/manifests/G94WC9CN23R53A63CRXG
Binary file not shown.
Binary file added
BIN
+174 Bytes
icechunk-python/tests/data/test-repo/manifests/MWE7J4Y1V04W0DCXB8Z0
Binary file not shown.
Binary file removed
BIN
-221 Bytes
icechunk-python/tests/data/test-repo/manifests/R666SBH9YHZMB04ZMARG
Binary file not shown.
Binary file removed
BIN
-117 Bytes
icechunk-python/tests/data/test-repo/manifests/STWYFSPWFCD62MQTDM20
Binary file not shown.
Binary file added
BIN
+241 Bytes
icechunk-python/tests/data/test-repo/manifests/T9PRDPYDRCEHC2GAVR8G
Binary file not shown.
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/branch.main/ZZZZZZZW.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"FK0CX5JQH2DVDZ6PD6WG"} | ||
{"snapshot":"A2RD2Y65PR6D3B6BR1K0"} |
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/branch.main/ZZZZZZZX.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"KCR7ES7JPCBY23X6MY3G"} | ||
{"snapshot":"K1BMYVG1HNVTNV1FSBH0"} |
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/branch.main/ZZZZZZZY.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"QY5JG2BWG2VPPDJR4JE0"} | ||
{"snapshot":"RPA0WQCNM2N9HBBRHJQ0"} |
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/branch.main/ZZZZZZZZ.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"VNPWJSZWB9G990XV1V8G"} | ||
{"snapshot":"6Q9GDTXKF17BGQVSQZFG"} |
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/branch.my-branch/ZZZZZZZX.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"G0BR0G9NKT75ZZS7BWWG"} | ||
{"snapshot":"949AXZ49X764TMDC6D4G"} |
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/branch.my-branch/ZZZZZZZY.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"9W0W1DS2BKRV4MK2A2S0"} | ||
{"snapshot":"SNF98D1SK7NWD5KQJM20"} |
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/branch.my-branch/ZZZZZZZZ.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"FK0CX5JQH2DVDZ6PD6WG"} | ||
{"snapshot":"A2RD2Y65PR6D3B6BR1K0"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"9W0W1DS2BKRV4MK2A2S0"} | ||
{"snapshot":"SNF98D1SK7NWD5KQJM20"} |
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/tag.it also works!/ref.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"G0BR0G9NKT75ZZS7BWWG"} | ||
{"snapshot":"949AXZ49X764TMDC6D4G"} |
2 changes: 1 addition & 1 deletion
2
icechunk-python/tests/data/test-repo/refs/tag.it works!/ref.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"snapshot":"9W0W1DS2BKRV4MK2A2S0"} | ||
{"snapshot":"SNF98D1SK7NWD5KQJM20"} |
Binary file added
BIN
+177 Bytes
icechunk-python/tests/data/test-repo/snapshots/6Q9GDTXKF17BGQVSQZFG
Binary file not shown.
Binary file added
BIN
+787 Bytes
icechunk-python/tests/data/test-repo/snapshots/949AXZ49X764TMDC6D4G
Binary file not shown.
Binary file removed
BIN
-490 Bytes
icechunk-python/tests/data/test-repo/snapshots/9W0W1DS2BKRV4MK2A2S0
Binary file not shown.
Binary file added
BIN
+587 Bytes
icechunk-python/tests/data/test-repo/snapshots/A2RD2Y65PR6D3B6BR1K0
Binary file not shown.
Binary file removed
BIN
-493 Bytes
icechunk-python/tests/data/test-repo/snapshots/FK0CX5JQH2DVDZ6PD6WG
Binary file not shown.
Binary file removed
BIN
-646 Bytes
icechunk-python/tests/data/test-repo/snapshots/G0BR0G9NKT75ZZS7BWWG
Binary file not shown.
Binary file added
BIN
+577 Bytes
icechunk-python/tests/data/test-repo/snapshots/K1BMYVG1HNVTNV1FSBH0
Binary file not shown.
Binary file removed
BIN
-481 Bytes
icechunk-python/tests/data/test-repo/snapshots/KCR7ES7JPCBY23X6MY3G
Binary file not shown.
Binary file removed
BIN
-429 Bytes
icechunk-python/tests/data/test-repo/snapshots/QY5JG2BWG2VPPDJR4JE0
Binary file not shown.
Binary file added
BIN
+513 Bytes
icechunk-python/tests/data/test-repo/snapshots/RPA0WQCNM2N9HBBRHJQ0
Binary file not shown.
Binary file added
BIN
+586 Bytes
icechunk-python/tests/data/test-repo/snapshots/SNF98D1SK7NWD5KQJM20
Binary file not shown.
Binary file removed
BIN
-132 Bytes
icechunk-python/tests/data/test-repo/snapshots/VNPWJSZWB9G990XV1V8G
Binary file not shown.
Binary file added
BIN
+172 Bytes
icechunk-python/tests/data/test-repo/transactions/949AXZ49X764TMDC6D4G
Binary file not shown.
Binary file removed
BIN
-71 Bytes
icechunk-python/tests/data/test-repo/transactions/9W0W1DS2BKRV4MK2A2S0
Binary file not shown.
Binary file added
BIN
+148 Bytes
icechunk-python/tests/data/test-repo/transactions/A2RD2Y65PR6D3B6BR1K0
Binary file not shown.
Binary file removed
BIN
-72 Bytes
icechunk-python/tests/data/test-repo/transactions/FK0CX5JQH2DVDZ6PD6WG
Binary file not shown.
Binary file removed
BIN
-141 Bytes
icechunk-python/tests/data/test-repo/transactions/G0BR0G9NKT75ZZS7BWWG
Binary file not shown.
Binary file added
BIN
+235 Bytes
icechunk-python/tests/data/test-repo/transactions/K1BMYVG1HNVTNV1FSBH0
Binary file not shown.
Binary file removed
BIN
-104 Bytes
icechunk-python/tests/data/test-repo/transactions/KCR7ES7JPCBY23X6MY3G
Binary file not shown.
Binary file removed
BIN
-116 Bytes
icechunk-python/tests/data/test-repo/transactions/QY5JG2BWG2VPPDJR4JE0
Binary file not shown.
Binary file added
BIN
+167 Bytes
icechunk-python/tests/data/test-repo/transactions/RPA0WQCNM2N9HBBRHJQ0
Binary file not shown.
Binary file added
BIN
+148 Bytes
icechunk-python/tests/data/test-repo/transactions/SNF98D1SK7NWD5KQJM20
Binary file not shown.
Oops, something went wrong.