Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read the footers in parallel when reading multiple Parquet files #17957

Open
wants to merge 2 commits into
base: branch-25.04
Choose a base branch
from

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Feb 7, 2025

Description

When reading multiple files, all data(i.e. pages) IO is performed in the same "batch", allowing parallel IO operations (provided by kvikIO). However, footers are read serially, leading to poor performance when reading many files. This is especially pronounced for IO that benefits from high level of parallelism.

This PR adds a global thread pool meant for any host-side work. This pool is used to read the Parquet file footers in parallel.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 7, 2025
@vuule vuule added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 8, 2025
@vuule
Copy link
Contributor Author

vuule commented Feb 8, 2025

/ok to test

@mhaseeb123 mhaseeb123 self-requested a review February 8, 2025 01:01

BS::thread_pool& host_worker_pool()
{
static const std::size_t default_pool_size = std::min(32u, std::thread::hardware_concurrency());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a policy for how we choose the default threadpool size here? For a workload reading ~360 parquet files of 128 MB each, the default thread pool size of 32 might have been a little small. The overall workload took about 20 to read the data. With LIBCUDF_NUM_HOST_WORKERS=256 the overall workload took 10s. (and no parallelism, like on main, took 60s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used 32 because it worked well for my host compression thread pool (where the tasks incude H2D/D2H copies).
I'm fine with just using hardware_concurrency, given that this is intended for host-only work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have an Intel i9-13900K, which has 8 performance cores and 16 efficiency cores, and std::thread::hardware_concurrency() returns 32 for this. If I was reading a lot of files and my machine tried to use 32 threads, I'd have nothing available for anything else and the OS might stop responding. Perhaps 3/4 or 7/8 of hardware_concurrency() would be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought (feel free to ignore): this will be most notable for remote file systems, we're we'll be network bound and spending a lot of time doing nothing. In Python, a single thread making all the network requests asynchronously would likely work as well as a large threadpool. Would something similar be good here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we can do the same as we do when reading the actual data - loop over all sources in a single function. This would take some surgery, but IMO it's worth a try, given that this specific use of the thread pool requires more threads than we normally want.

@vuule vuule marked this pull request as ready for review February 10, 2025 21:04
@vuule vuule requested review from a team as code owners February 10, 2025 21:04
@vuule vuule added the DO NOT MERGE Hold off on merging; see PR for details label Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue DO NOT MERGE Hold off on merging; see PR for details improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants