-
Notifications
You must be signed in to change notification settings - Fork 189
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
perf: Reduce memory consumption for WARC reads and improve estimates (#…
…3935) This PR makes the following changes for `read_warc`: - Reduce memory consumption - Adds `WARC-Identified-Payload-Type` as an extracted metadata column - Improve stats estimation for scan tasks that read WARC ## Reduced memory consumption When reading a single Common Crawl file, the file size is typically 1GB, which decompresses to 5GB of data. Before this Resident Set Size peaks at `5.15GB` while heap size peaks at `10.98GB`:  After this PR, Resident Set Size peaks at `4.3GB` while heap size peaks at `6.6GB`, which is more in line with expectations:  ## Additional `WARC-Identified-Payload-Type` metadata column For ease of filtering WARC records, we extract `WARC-Identified-Payload-Type` from the metadata as its own column. Since this is an optional column, it is often NULL. ## Stats estimation A single Common Crawl .warc.gz file is typically 1GB in size, but takes up ~5GB of memory once decompressed. For a .warc.gz file with `145,717` records, before this PR we would estimate: ``` Stats = { Approx num rows = 9,912,769, Approx size bytes = 914.63 MiB, Accumulated selectivity = 1.00 } ``` After this PR, we now estimate: ``` Stats = { Approx num rows = 167,773, Approx size bytes = 4.34 GiB, Accumulated selectivity = 1.00 } ``` which is much closer to reality. ### Estimations with pushdowns When doing `daft.read_warc("file.warc.gz").select("Content-Length")`, we estimate `1.32 MiB` and in reality store `1.13 MiB`. When doing `daft.read_warc("cc-original.warc.gz").select("warc_content")`, we estimate `4.39 GiB` and in reality store `3.82 GiB`.
- Loading branch information
1 parent
8163982
commit c7df611
Showing
10 changed files
with
221 additions
and
87 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.