Compress models better #130

seanses · 2024-12-20T20:21:56Z

Implements ZipNN byte grouping compression based on lz4 (bg4-lz4).
For a comparison of compression ratio between bg4-lz4 and lz4, see the test, also copied below.

In summary, bg4-lz4 provides compression ratio no worse than lz4. bg4-lz4 achieves 7.4% reduction for random f32 values in [-1, 1], 13% for those in [0, 2], 15% for random bf16 values in [-1,1] and 26% for those in [0,2], while lz4 yields none. bg4-lz4 doesn't compress for random f64 values, I think we need 8 byte groups to make the compression happen. It doesn't compress for f16 values either, we need to extract and shuffle the sign bit to make the compression happen.

See benchmark results (single core) on Apple M2 Max here, also copied below.

Byte splits occupies 2% - 34% out of total compression time, while for data types that bg4-lz4 yields additional compression to lz4, split occupies 2% - 7% total time. Byte regrouping occupies 17% - 57% out of total decompression time. This likely can be optimized more with a group of "gather" SIMD instructions.

Will merge into xetcas before changing the default compression config in the cas client.

assafvayner

generally looks good, main comment is to add more comments

cas_object/src/compression_scheme.rs

assafvayner · 2024-12-20T22:05:38Z

cas_object/src/compression_scheme.rs

+    Ok(regrouped.len() as u64)
+}
+
+fn bg4_split(data: &[u8]) -> [Vec<u8>; 4] {


please leave a comment explaining this protocol/format, if there is a design doc then a link here as well would be nice.

https://github.com/huggingface/dedupe_estimator/blob/main/dedupe_estimator.cpp#L287C8-L314

But, this code does seem different from the zipnn paper - https://arxiv.org/pdf/2411.05239 - where the sign bit is moved over. See image below

@seanses I think your code matches @ylow dedup_estimater, but it is worth checking if the original zipnn idea will give even better results

@port8080 Indeed, that's actually where I doubt in the first place if ZipNN is applicable to our architecture. Moving the sign bit requires knowing where the sign bit is, that means 1). our chunker produces a clean cut at f32/f64/f16/bf16 boundaries; 2). uses different algorithm for f32 and f16/bf16. Both of them require deep plumbing to learn model file format from the python library. Thus I think bit moving is more suitable for ad-hoc solutions instead of serving as a general approach. With that being said, I still like to try it and see how it performs, so I don't plan to merge this in before a complete benchmark.

As we discussed this morning, safetensors is ~30% of uploaded bytes, and we can dig into that format in particular to understand the layout. Of course, let us first measure to see if the advantage is worth it.

I strongly prefer not building something format specific. This leaves us highly format agnostic and this will work automatically for dduf, gguf, Tensorflow files, pytorch etc. Which in aggregate is a lot more than just safetensors.

cas_object/src/compression_scheme.rs

port8080

Benchmarking request, I would love to see data for compression & decompression speeds - similar to Table 3 in the paper

5.2 Compression and Decompression speed
We ran our tests on an Apple M1 Max machine with 10
cores and 64GB of RAM running macOS Sonoma 14.3.
The tests run in a single process and on a single core. Table 3 shows the speed benefits of our method vs. vanilla compressors on 3 representative models.

port8080 · 2024-12-20T22:59:26Z

cas_object/src/compression_scheme.rs

+    Ok(regrouped.len() as u64)
+}
+
+fn bg4_split(data: &[u8]) -> [Vec<u8>; 4] {


https://github.com/huggingface/dedupe_estimator/blob/main/dedupe_estimator.cpp#L287C8-L314

But, this code does seem different from the zipnn paper - https://arxiv.org/pdf/2411.05239 - where the sign bit is moved over. See image below

@seanses I think your code matches @ylow dedup_estimater, but it is worth checking if the original zipnn idea will give even better results

seanses · 2024-12-21T05:26:43Z

Benchmarking request, I would love to see data for compression & decompression speeds - similar to Table 3 in the paper

5.2 Compression and Decompression speed
We ran our tests on an Apple M1 Max machine with 10
cores and 64GB of RAM running macOS Sonoma 14.3.
The tests run in a single process and on a single core. Table 3 shows the speed benefits of our method vs. vanilla compressors on 3 representative models.

Yeah, all coming along.

final say from AB&YL

seanses · 2025-01-21T19:28:24Z

Improved compression ratio and speed. Added more benchmarks.

port8080

🚢
This is a wonderful addition to our toolbox, I will keep asking for more benchmarking numbers but this code is good to go

seanses · 2025-01-28T07:38:28Z

Results on real models (dedup against dev):

seanses · 2025-01-28T07:43:24Z

All numbers available in table at https://docs.google.com/spreadsheets/d/1SJ8Dv3EcNTuA41JXT-ggL4SzMnqs84hn/edit?usp=sharing&ouid=108235600614994105911&rtpof=true&sd=true

seanses · 2025-01-30T19:43:15Z

Results on real datasets (dedup against dev):

This makes a clear distinction that this algorithm should only be applied to models.

implements zipnn byte grouping compression based on lz4

5672a30

seanses changed the title ~~Implements zipnn byte grouping compression based on lz4~~ Compress models better Dec 20, 2024

seanses added 2 commits December 21, 2024 04:28

add import

4981641

cargo fmt

c896d87

seanses marked this pull request as ready for review December 20, 2024 20:47

seanses requested review from port8080, ylow and assafvayner December 20, 2024 20:48

assafvayner previously approved these changes Dec 20, 2024

View reviewed changes

port8080 reviewed Dec 20, 2024

View reviewed changes

seanses added 3 commits January 4, 2025 17:08

f16 and bf16 tests

2ad874b

speed benchmark

f9afa8b

cargo fmt

6f7abff

seanses added 3 commits January 21, 2025 10:49

improvement and more regroup benches

271027d

drop combined write 4x4

553d849

format

8eeb1bd

seanses requested a review from port8080 January 21, 2025 19:24

seanses requested a review from hoytak January 21, 2025 19:37

update split/regroup bench numbers

eb5c195

port8080 approved these changes Jan 21, 2025

View reviewed changes

seanses merged commit fb65c8b into main Jan 24, 2025
2 checks passed

seanses deleted the di/zipnn-compression branch January 24, 2025 21:37

seanses mentioned this pull request Feb 26, 2025

Use KL divergence on bg4 groups to choose compression scheme #179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress models better #130

Compress models better #130

seanses commented Dec 20, 2024 •

edited

Loading

assafvayner left a comment

assafvayner Dec 20, 2024

port8080 Dec 20, 2024

seanses Dec 21, 2024

port8080 Dec 23, 2024

ylow Dec 24, 2024

port8080 left a comment

port8080 Dec 20, 2024

seanses commented Dec 21, 2024

seanses commented Jan 21, 2025

port8080 left a comment

seanses commented Jan 28, 2025 •

edited

Loading

seanses commented Jan 28, 2025

seanses commented Jan 30, 2025 •

edited

Loading

Compress models better #130

Compress models better #130

Conversation

seanses commented Dec 20, 2024 • edited Loading

assafvayner left a comment

Choose a reason for hiding this comment

assafvayner Dec 20, 2024

Choose a reason for hiding this comment

port8080 Dec 20, 2024

Choose a reason for hiding this comment

seanses Dec 21, 2024

Choose a reason for hiding this comment

port8080 Dec 23, 2024

Choose a reason for hiding this comment

ylow Dec 24, 2024

Choose a reason for hiding this comment

port8080 left a comment

Choose a reason for hiding this comment

port8080 Dec 20, 2024

Choose a reason for hiding this comment

seanses commented Dec 21, 2024

seanses commented Jan 21, 2025

port8080 left a comment

Choose a reason for hiding this comment

seanses commented Jan 28, 2025 • edited Loading

seanses commented Jan 28, 2025

seanses commented Jan 30, 2025 • edited Loading

seanses commented Dec 20, 2024 •

edited

Loading

seanses commented Jan 28, 2025 •

edited

Loading

seanses commented Jan 30, 2025 •

edited

Loading