Multiplexing mutex optimization for improvement of multi-GPU scalability #147

dje-dev · 2018-07-08T03:19:14Z

Please see attached text document for full description of this mutex optimization for multi-GPU scaling
mutex_optimization_comment.txt

only once, improving multi-GPU scalability

dubslow · 2018-07-08T03:55:34Z

Contents of the text file:

Adjust SearchWorker::PickNodeToExtend() to take lock on the nodes_mutex_ only once, improving multi-GPU scalability

It has been noted that with the current multiplexing logic, multi-GPU scaling is poor (acutely sublinear or possibly even sub 1.0x). See for example:
glinscott/leela-chess#687

This simple adjustment to SearchWorker::PickNodeToExtend() appears to dramatically improves multi-GPU performance. We simply take the lock on the nodes_mutex_ only once (at beginning of the method) and hold it, rather than three separate acquisitions/releases.

The following tests appear to confirm the value of this change:
-- the "nodes per second" performance in various configurations shown below (using "go movetime 20000" from the starting position) shows that fp16 performance scales by 1.96x in a 3-GPU configuration (from 53,001 to 103,178 nps) compared to the current scaling of 0.82 (from 48,140 to 39,236). In the case of fp32, the improvement is even better, with 3-GPU yielding scaling of 2.95x).
-- a test match of 40 games at 2 seconds per move, the 3-GPU version scored 4 wins and 0 losses (using fp16 for both)

[  NPS  ] [COMMAND LINE]
[ 25,488] lc0_org -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnn,gpu=0)"
[ 48,140] lc0_org -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnnfp16,gpu=0)"
[ 25,460] lc0_mutex -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnn,gpu=0)"
[ 53,001] lc0_mutex -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnnfp16,gpu=0)"

[ 46,024] lc0_org -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnn,gpu=0),(backend=cudnn,gpu=1),(backend=cudnn,gpu=2)"
[ 39,236] lc0_org -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnnfp16,gpu=0),(backend=cudnnfp16,gpu=1),(backend=cudnnfp16,gpu=2)"
[ 75,047] lc0_mutex -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnn,gpu=0),(backend=cudnn,gpu=1),(backend=cudnn,gpu=2)"
[103,178] lc0_mutex -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnnfp16,gpu=0),(backend=cudnnfp16,gpu=1),(backend=cudnnfp16,gpu=2)"

NOTES:

lc0_org is Windows binary compiled from codebase as of 7 July (source unmodified)
lc0_mutex is Windows binary compiled from codebase as of 7 July (with one modification of the mutex optimization) in PickNodeToExtend
all 3 GPUS are Titan V (Volta)
the weights file is from network 395
the performance (nps) reported above are from a single run, but these tests were all performed three times and the timings were highly stable

mooskagh · 2018-07-08T09:48:55Z

The idea was that most of the code runs behind shared mutex, to multiple thereads can do most of stuff simultaneously, but apparently locking overhead is too large when there is contention.

Would you mind also checking whether replacing search_->nodes_mutex_ to Mutex (rather than SharedMutex) and locking it exclusively everywhere would not slow down things? That way we could get rid of shared mutex completely..

dje-dev · 2018-07-08T10:53:13Z

As suggested I ran an experiment to replace search_->nodes_mutex_ to Mutex and locked exclusively everywhere. Performance tests showed clearly inferior results (easily statistically significant) in the 3-GPU case. For example, the last two reported NPS above decreased to 71,512 and 84,833 (from 75,047 and 103,178).

Adjust SearchWorker::PickNodeToExtend() to take lock on the nodes_mutex

de8bee6

only once, improving multi-GPU scalability

Remove unneeded braces

356ea1f

mooskagh approved these changes Jul 8, 2018

View reviewed changes

mooskagh merged commit 407d841 into LeelaChessZero:master Jul 8, 2018

Cyanogenoid mentioned this pull request Jul 9, 2018

Deadlock when using many threads with multiplexing backend #157

Closed

dje-dev mentioned this pull request Jul 9, 2018

Edge-Node separation (addresses #13) #145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiplexing mutex optimization for improvement of multi-GPU scalability #147

Multiplexing mutex optimization for improvement of multi-GPU scalability #147

dje-dev commented Jul 8, 2018

dubslow commented Jul 8, 2018 •

edited

Loading

mooskagh commented Jul 8, 2018

dje-dev commented Jul 8, 2018

Multiplexing mutex optimization for improvement of multi-GPU scalability #147

Multiplexing mutex optimization for improvement of multi-GPU scalability #147

Conversation

dje-dev commented Jul 8, 2018

dubslow commented Jul 8, 2018 • edited Loading

mooskagh commented Jul 8, 2018

dje-dev commented Jul 8, 2018

dubslow commented Jul 8, 2018 •

edited

Loading