Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Suboptimal parallelism #6

Open
jeromekelleher opened this issue Jul 17, 2020 · 3 comments
Open

Suboptimal parallelism #6

jeromekelleher opened this issue Jul 17, 2020 · 3 comments

Comments

@jeromekelleher
Copy link
Contributor

I ran a conversion of a 45 gig plink file as described over in https://github.com/pystatgen/sgkit/issues/48

Overall, it worked great, and I was super impressed by how smooth the process was. However, I found the processor utilisation a bit less than I would have expected. Here's the output from /usr/bin/time:

24225.98user 8095.22system 6:36:25elapsed 135%CPU (0avgtext+0avgdata 27802828maxresident)k
218960383inputs+82680936outputs (214major+129016633minor)pagefaults 0swaps

I ran this on a server with 40 threads, and I would have expected the process to basically max out all of them. Instead, usage rarely went over about 300% - it feels like there was a lot of lock contention or something. I don't think IO was the problem - it was running off spinning disk, so you might expect the random nature of the IO to hurt it, but I kept an eye on atop, and the disk didn't seem to be a bottleneck.

It's not particularly important to get into this now I think - there's not a big difference between this taking 6 hours and 1 hour right now. It'll be something to keep an eye on at some point though.

@eric-czech
Copy link
Collaborator

An update on this is at: https://github.com/pystatgen/sgkit/issues/48#issuecomment-666536828.

I think that largely solves this, but let me know if you find otherwise.

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 3, 2020

If that solves it, we should think about how to make 'processes' the default, or at least issue a warning to the user.

@eric-czech
Copy link
Collaborator

+1 to a warning and an argument in a TBD export function. We should probably do the same in the readers to use threads scheduler by default since I think splitting data frames into arrays causes a lot of worker communication.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants