Skip to content

Commit

Permalink
Up the chunk deduplication size. (#163)
Browse files Browse the repository at this point in the history
Changed the number of chunks processed at once to 256 (increased from
1).
  • Loading branch information
hoytak authored Feb 4, 2025
1 parent e37d424 commit 0bc9a42
Showing 1 changed file with 12 additions and 2 deletions.
14 changes: 12 additions & 2 deletions data/src/clean.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,22 @@ use crate::remote_shard_interface::RemoteShardInterface;
use crate::repo_salt::RepoSalt;
use crate::PointerFile;

// Chunking is the bottleneck, changing batch size doesn't have a big impact.
// Tradeoff is the memory size of the buffer vs. the following benefits:
// 1. global dedup query -- when a chunk hash satisfies the condition for global dedup, we query
// the global dedup server in parallel with chunking the rest of the hash. We process the block of
// chunks while that query is executed, but then wait until it finishes before either reprocessing
// the chunks or exiting. A larger batch size means more work is done in the round trip here in the case
// of not hitting the global dedup, and more is reprocessed if the global dedup was successful.
//
// 2. When there are a lot of shards, dedup from a single match proceeds as far as possible through the chunks
// while still matching, which saves a lot of time when there are a lot of shards and hmac keys to work through.
//
// 256 is chosen as a decent balance between memory and the above benefits.
lazy_static! {
pub static ref DEDUP_CHUNK_BATCH_SIZE: usize = std::env::var("XET_DEDUP_BATCHSIZE")
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(1);
.unwrap_or(256);
}

lazy_static! {
Expand Down

0 comments on commit 0bc9a42

Please sign in to comment.