Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress models better #130

Merged
merged 10 commits into from
Jan 24, 2025
Merged

Compress models better #130

merged 10 commits into from
Jan 24, 2025

Conversation

seanses
Copy link
Collaborator

@seanses seanses commented Dec 20, 2024

Implements ZipNN byte grouping compression based on lz4 (bg4-lz4).
For a comparison of compression ratio between bg4-lz4 and lz4, see the test, also copied below.

In summary, bg4-lz4 provides compression ratio no worse than lz4. bg4-lz4 achieves 7.4% reduction for random f32 values in [-1, 1], 13% for those in [0, 2], 15% for random bf16 values in [-1,1] and 26% for those in [0,2], while lz4 yields none. bg4-lz4 doesn't compress for random f64 values, I think we need 8 byte groups to make the compression happen. It doesn't compress for f16 values either, we need to extract and shuffle the sign bit to make the compression happen.

405360161-0a9de0e0-3fe6-44e2-aaea-a5ac751789af

See benchmark results (single core) on Apple M2 Max here, also copied below.

405359672-99d8e290-308d-4734-ac0c-d42e2e2c454e

Byte splits occupies 2% - 34% out of total compression time, while for data types that bg4-lz4 yields additional compression to lz4, split occupies 2% - 7% total time. Byte regrouping occupies 17% - 57% out of total decompression time. This likely can be optimized more with a group of "gather" SIMD instructions.

Will merge into xetcas before changing the default compression config in the cas client.

@seanses seanses changed the title Implements zipnn byte grouping compression based on lz4 Compress models better Dec 20, 2024
@seanses seanses marked this pull request as ready for review December 20, 2024 20:47
assafvayner
assafvayner previously approved these changes Dec 20, 2024
Copy link
Contributor

@assafvayner assafvayner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good, main comment is to add more comments

Ok(regrouped.len() as u64)
}

fn bg4_split(data: &[u8]) -> [Vec<u8>; 4] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please leave a comment explaining this protocol/format, if there is a design doc then a link here as well would be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/huggingface/dedupe_estimator/blob/main/dedupe_estimator.cpp#L287C8-L314

But, this code does seem different from the zipnn paper - https://arxiv.org/pdf/2411.05239 - where the sign bit is moved over. See image below
Screenshot 2024-12-20 at 2 58 20 PM

@seanses I think your code matches @ylow dedup_estimater, but it is worth checking if the original zipnn idea will give even better results

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@port8080 Indeed, that's actually where I doubt in the first place if ZipNN is applicable to our architecture. Moving the sign bit requires knowing where the sign bit is, that means 1). our chunker produces a clean cut at f32/f64/f16/bf16 boundaries; 2). uses different algorithm for f32 and f16/bf16. Both of them require deep plumbing to learn model file format from the python library. Thus I think bit moving is more suitable for ad-hoc solutions instead of serving as a general approach. With that being said, I still like to try it and see how it performs, so I don't plan to merge this in before a complete benchmark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed this morning, safetensors is ~30% of uploaded bytes, and we can dig into that format in particular to understand the layout. Of course, let us first measure to see if the advantage is worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly prefer not building something format specific. This leaves us highly format agnostic and this will work automatically for dduf, gguf, Tensorflow files, pytorch etc. Which in aggregate is a lot more than just safetensors.

Copy link
Contributor

@port8080 port8080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarking request, I would love to see data for compression & decompression speeds - similar to Table 3 in the paper

5.2 Compression and Decompression speed
We ran our tests on an Apple M1 Max machine with 10
cores and 64GB of RAM running macOS Sonoma 14.3.
The tests run in a single process and on a single core. Table 3 shows the speed benefits of our method vs. vanilla compressors on 3 representative models. 

Ok(regrouped.len() as u64)
}

fn bg4_split(data: &[u8]) -> [Vec<u8>; 4] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/huggingface/dedupe_estimator/blob/main/dedupe_estimator.cpp#L287C8-L314

But, this code does seem different from the zipnn paper - https://arxiv.org/pdf/2411.05239 - where the sign bit is moved over. See image below
Screenshot 2024-12-20 at 2 58 20 PM

@seanses I think your code matches @ylow dedup_estimater, but it is worth checking if the original zipnn idea will give even better results

@seanses
Copy link
Collaborator Author

seanses commented Dec 21, 2024

Benchmarking request, I would love to see data for compression & decompression speeds - similar to Table 3 in the paper

5.2 Compression and Decompression speed
We ran our tests on an Apple M1 Max machine with 10
cores and 64GB of RAM running macOS Sonoma 14.3.
The tests run in a single process and on a single core. Table 3 shows the speed benefits of our method vs. vanilla compressors on 3 representative models. 

Yeah, all coming along.

@assafvayner assafvayner dismissed their stale review January 7, 2025 00:57

final say from AB&YL

@seanses seanses requested a review from port8080 January 21, 2025 19:24
@seanses
Copy link
Collaborator Author

seanses commented Jan 21, 2025

Improved compression ratio and speed. Added more benchmarks.

@seanses seanses requested a review from hoytak January 21, 2025 19:37
Copy link
Contributor

@port8080 port8080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢
This is a wonderful addition to our toolbox, I will keep asking for more benchmarking numbers but this code is good to go

@seanses seanses merged commit fb65c8b into main Jan 24, 2025
2 checks passed
@seanses seanses deleted the di/zipnn-compression branch January 24, 2025 21:37
@seanses
Copy link
Collaborator Author

seanses commented Jan 28, 2025

Results on real models (dedup against dev):
compression

@seanses
Copy link
Collaborator Author

seanses commented Jan 28, 2025

@seanses
Copy link
Collaborator Author

seanses commented Jan 30, 2025

Results on real datasets (dedup against dev):
compression2

This makes a clear distinction that this algorithm should only be applied to models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants