Suboptimal parallelism #6

jeromekelleher · 2020-07-17T17:53:30Z

I ran a conversion of a 45 gig plink file as described over in https://github.com/pystatgen/sgkit/issues/48

Overall, it worked great, and I was super impressed by how smooth the process was. However, I found the processor utilisation a bit less than I would have expected. Here's the output from /usr/bin/time:

24225.98user 8095.22system 6:36:25elapsed 135%CPU (0avgtext+0avgdata 27802828maxresident)k
218960383inputs+82680936outputs (214major+129016633minor)pagefaults 0swaps

I ran this on a server with 40 threads, and I would have expected the process to basically max out all of them. Instead, usage rarely went over about 300% - it feels like there was a lot of lock contention or something. I don't think IO was the problem - it was running off spinning disk, so you might expect the random nature of the IO to hurt it, but I kept an eye on atop, and the disk didn't seem to be a bottleneck.

It's not particularly important to get into this now I think - there's not a big difference between this taking 6 hours and 1 hour right now. It'll be something to keep an eye on at some point though.

The text was updated successfully, but these errors were encountered:

eric-czech · 2020-07-30T20:38:13Z

An update on this is at: https://github.com/pystatgen/sgkit/issues/48#issuecomment-666536828.

I think that largely solves this, but let me know if you find otherwise.

tomwhite · 2020-08-03T08:04:03Z

If that solves it, we should think about how to make 'processes' the default, or at least issue a warning to the user.

eric-czech · 2020-08-03T10:02:19Z

+1 to a warning and an argument in a TBD export function. We should probably do the same in the readers to use threads scheduler by default since I think splitting data frames into arrays causes a lot of worker communication.

eric-czech mentioned this issue Jul 24, 2020

Requirements for UKB GWAS sgkit-dev/sgkit#67

Open

11 tasks

eric-czech mentioned this issue Aug 20, 2020

Test IO on large scale data sgkit-dev/sgkit#48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal parallelism #6

Suboptimal parallelism #6

jeromekelleher commented Jul 17, 2020

eric-czech commented Jul 30, 2020

tomwhite commented Aug 3, 2020

eric-czech commented Aug 3, 2020

Suboptimal parallelism #6

Suboptimal parallelism #6

Comments

jeromekelleher commented Jul 17, 2020

eric-czech commented Jul 30, 2020

tomwhite commented Aug 3, 2020

eric-czech commented Aug 3, 2020