-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read the footers in parallel when reading multiple Parquet files #17957
base: branch-25.04
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
|
||
BS::thread_pool& host_worker_pool() | ||
{ | ||
static const std::size_t default_pool_size = std::min(32u, std::thread::hardware_concurrency()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a policy for how we choose the default threadpool size here? For a workload reading ~360 parquet files of 128 MB each, the default thread pool size of 32 might have been a little small. The overall workload took about 20 to read the data. With LIBCUDF_NUM_HOST_WORKERS=256
the overall workload took 10s. (and no parallelism, like on main, took 60s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 32 because it worked well for my host compression thread pool (where the tasks incude H2D/D2H copies).
I'm fine with just using hardware_concurrency
, given that this is intended for host-only work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have an Intel i9-13900K, which has 8 performance cores and 16 efficiency cores, and std::thread::hardware_concurrency() returns 32 for this. If I was reading a lot of files and my machine tried to use 32 threads, I'd have nothing available for anything else and the OS might stop responding. Perhaps 3/4 or 7/8 of hardware_concurrency() would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought (feel free to ignore): this will be most notable for remote file systems, we're we'll be network bound and spending a lot of time doing nothing. In Python, a single thread making all the network requests asynchronously would likely work as well as a large threadpool. Would something similar be good here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we can do the same as we do when reading the actual data - loop over all sources in a single function. This would take some surgery, but IMO it's worth a try, given that this specific use of the thread pool requires more threads than we normally want.
Description
When reading multiple files, all data(i.e. pages) IO is performed in the same "batch", allowing parallel IO operations (provided by kvikIO). However, footers are read serially, leading to poor performance when reading many files. This is especially pronounced for IO that benefits from high level of parallelism.
This PR adds a global thread pool meant for any host-side work. This pool is used to read the Parquet file footers in parallel.
Checklist