-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress models better #130
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, main comment is to add more comments
cas_object/src/compression_scheme.rs
Outdated
Ok(regrouped.len() as u64) | ||
} | ||
|
||
fn bg4_split(data: &[u8]) -> [Vec<u8>; 4] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please leave a comment explaining this protocol/format, if there is a design doc then a link here as well would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/huggingface/dedupe_estimator/blob/main/dedupe_estimator.cpp#L287C8-L314
But, this code does seem different from the zipnn paper - https://arxiv.org/pdf/2411.05239 - where the sign bit is moved over. See image below
@seanses I think your code matches @ylow dedup_estimater, but it is worth checking if the original zipnn idea will give even better results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@port8080 Indeed, that's actually where I doubt in the first place if ZipNN is applicable to our architecture. Moving the sign bit requires knowing where the sign bit is, that means 1). our chunker produces a clean cut at f32/f64/f16/bf16 boundaries; 2). uses different algorithm for f32 and f16/bf16. Both of them require deep plumbing to learn model file format from the python library. Thus I think bit moving is more suitable for ad-hoc solutions instead of serving as a general approach. With that being said, I still like to try it and see how it performs, so I don't plan to merge this in before a complete benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed this morning, safetensors is ~30% of uploaded bytes, and we can dig into that format in particular to understand the layout. Of course, let us first measure to see if the advantage is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly prefer not building something format specific. This leaves us highly format agnostic and this will work automatically for dduf, gguf, Tensorflow files, pytorch etc. Which in aggregate is a lot more than just safetensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarking request, I would love to see data for compression & decompression speeds - similar to Table 3 in the paper
5.2 Compression and Decompression speed
We ran our tests on an Apple M1 Max machine with 10
cores and 64GB of RAM running macOS Sonoma 14.3.
The tests run in a single process and on a single core. Table 3 shows the speed benefits of our method vs. vanilla compressors on 3 representative models.
cas_object/src/compression_scheme.rs
Outdated
Ok(regrouped.len() as u64) | ||
} | ||
|
||
fn bg4_split(data: &[u8]) -> [Vec<u8>; 4] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/huggingface/dedupe_estimator/blob/main/dedupe_estimator.cpp#L287C8-L314
But, this code does seem different from the zipnn paper - https://arxiv.org/pdf/2411.05239 - where the sign bit is moved over. See image below
@seanses I think your code matches @ylow dedup_estimater, but it is worth checking if the original zipnn idea will give even better results
Yeah, all coming along. |
Improved compression ratio and speed. Added more benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
This is a wonderful addition to our toolbox, I will keep asking for more benchmarking numbers but this code is good to go
All numbers available in table at https://docs.google.com/spreadsheets/d/1SJ8Dv3EcNTuA41JXT-ggL4SzMnqs84hn/edit?usp=sharing&ouid=108235600614994105911&rtpof=true&sd=true |
Implements ZipNN byte grouping compression based on lz4 (bg4-lz4).
For a comparison of compression ratio between bg4-lz4 and lz4, see the test, also copied below.
In summary, bg4-lz4 provides compression ratio no worse than lz4. bg4-lz4 achieves 7.4% reduction for random f32 values in [-1, 1], 13% for those in [0, 2], 15% for random bf16 values in [-1,1] and 26% for those in [0,2], while lz4 yields none. bg4-lz4 doesn't compress for random f64 values, I think we need 8 byte groups to make the compression happen. It doesn't compress for f16 values either, we need to extract and shuffle the sign bit to make the compression happen.
See benchmark results (single core) on Apple M2 Max here, also copied below.
Byte splits occupies 2% - 34% out of total compression time, while for data types that bg4-lz4 yields additional compression to lz4, split occupies 2% - 7% total time. Byte regrouping occupies 17% - 57% out of total decompression time. This likely can be optimized more with a group of "gather" SIMD instructions.
Will merge into xetcas before changing the default compression config in the cas client.