Read the footers in parallel when reading multiple Parquet files #17957

vuule · 2025-02-07T23:56:38Z

Description

When reading multiple files, all data(i.e. pages) IO is performed in the same "batch", allowing parallel IO operations (provided by kvikIO). However, footers are read serially, leading to poor performance when reading many files. This is especially pronounced for IO that benefits from high level of parallelism.

This PR adds a global thread pool meant for any host-side work. This pool is used to read the Parquet file footers in parallel.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-02-07T23:56:42Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

vuule · 2025-02-08T00:02:24Z

/ok to test

TomAugspurger · 2025-02-10T17:16:29Z

cpp/src/utilities/host_worker_pool.cpp

+
+BS::thread_pool& host_worker_pool()
+{
+  static const std::size_t default_pool_size = std::min(32u, std::thread::hardware_concurrency());


Do we have a policy for how we choose the default threadpool size here? For a workload reading ~360 parquet files of 128 MB each, the default thread pool size of 32 might have been a little small. The overall workload took about 20 to read the data. With LIBCUDF_NUM_HOST_WORKERS=256 the overall workload took 10s. (and no parallelism, like on main, took 60s).

I used 32 because it worked well for my host compression thread pool (where the tasks incude H2D/D2H copies).
I'm fine with just using hardware_concurrency, given that this is intended for host-only work.

So I have an Intel i9-13900K, which has 8 performance cores and 16 efficiency cores, and std::thread::hardware_concurrency() returns 32 for this. If I was reading a lot of files and my machine tried to use 32 threads, I'd have nothing available for anything else and the OS might stop responding. Perhaps 3/4 or 7/8 of hardware_concurrency() would be better?

Another thought (feel free to ignore): this will be most notable for remote file systems, we're we'll be network bound and spending a lot of time doing nothing. In Python, a single thread making all the network requests asynchronously would likely work as well as a large threadpool. Would something similar be good here?

yeah, we can do the same as we do when reading the actual data - loop over all sources in a single function. This would take some surgery, but IMO it's worth a try, given that this specific use of the thread pool requires more threads than we normally want.

global thread pool + use in metadatas_from_sources

567646b

github-actions bot assigned vuule Feb 7, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 7, 2025

vuule added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 8, 2025

style

c220be9

mhaseeb123 self-requested a review February 8, 2025 01:01

TomAugspurger reviewed Feb 10, 2025

View reviewed changes

vuule marked this pull request as ready for review February 10, 2025 21:04

vuule requested review from a team as code owners February 10, 2025 21:04

vuule requested a review from pmattione-nvidia February 10, 2025 21:04

ttnghia approved these changes Feb 11, 2025

View reviewed changes

pmattione-nvidia approved these changes Feb 11, 2025

View reviewed changes

vuule added the DO NOT MERGE Hold off on merging; see PR for details label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read the footers in parallel when reading multiple Parquet files #17957

Read the footers in parallel when reading multiple Parquet files #17957

vuule commented Feb 7, 2025 •

edited

Loading

copy-pr-bot bot commented Feb 7, 2025

vuule commented Feb 8, 2025

TomAugspurger Feb 10, 2025

vuule Feb 10, 2025

pmattione-nvidia Feb 11, 2025

TomAugspurger Feb 11, 2025

vuule Feb 11, 2025

Read the footers in parallel when reading multiple Parquet files #17957

Are you sure you want to change the base?

Read the footers in parallel when reading multiple Parquet files #17957

Conversation

vuule commented Feb 7, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Feb 7, 2025

vuule commented Feb 8, 2025

TomAugspurger Feb 10, 2025

Choose a reason for hiding this comment

vuule Feb 10, 2025

Choose a reason for hiding this comment

pmattione-nvidia Feb 11, 2025

Choose a reason for hiding this comment

TomAugspurger Feb 11, 2025

Choose a reason for hiding this comment

vuule Feb 11, 2025

Choose a reason for hiding this comment

vuule commented Feb 7, 2025 •

edited

Loading