Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiplexing mutex optimization for improvement of multi-GPU scalability #147

Merged
merged 2 commits into from
Jul 8, 2018

Conversation

dje-dev
Copy link
Contributor

@dje-dev dje-dev commented Jul 8, 2018

Please see attached text document for full description of this mutex optimization for multi-GPU scaling
mutex_optimization_comment.txt

@dubslow
Copy link
Member

dubslow commented Jul 8, 2018

Contents of the text file:

Adjust SearchWorker::PickNodeToExtend() to take lock on the nodes_mutex_ only once, improving multi-GPU scalability

It has been noted that with the current multiplexing logic, multi-GPU scaling is poor (acutely sublinear or possibly even sub 1.0x). See for example:
glinscott/leela-chess#687

This simple adjustment to SearchWorker::PickNodeToExtend() appears to dramatically improves multi-GPU performance. We simply take the lock on the nodes_mutex_ only once (at beginning of the method) and hold it, rather than three separate acquisitions/releases.

The following tests appear to confirm the value of this change:
-- the "nodes per second" performance in various configurations shown below (using "go movetime 20000" from the starting position) shows that fp16 performance scales by 1.96x in a 3-GPU configuration (from 53,001 to 103,178 nps) compared to the current scaling of 0.82 (from 48,140 to 39,236). In the case of fp32, the improvement is even better, with 3-GPU yielding scaling of 2.95x).
-- a test match of 40 games at 2 seconds per move, the 3-GPU version scored 4 wins and 0 losses (using fp16 for both)

[  NPS  ] [COMMAND LINE]
[ 25,488] lc0_org -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnn,gpu=0)"
[ 48,140] lc0_org -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnnfp16,gpu=0)"
[ 25,460] lc0_mutex -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnn,gpu=0)"
[ 53,001] lc0_mutex -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnnfp16,gpu=0)"

[ 46,024] lc0_org -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnn,gpu=0),(backend=cudnn,gpu=1),(backend=cudnn,gpu=2)"
[ 39,236] lc0_org -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnnfp16,gpu=0),(backend=cudnnfp16,gpu=1),(backend=cudnnfp16,gpu=2)"
[ 75,047] lc0_mutex -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnn,gpu=0),(backend=cudnn,gpu=1),(backend=cudnn,gpu=2)"
[103,178] lc0_mutex -t 6 --minibatch-size=1024 --backend=multiplexing "--backend-opts=(backend=cudnnfp16,gpu=0),(backend=cudnnfp16,gpu=1),(backend=cudnnfp16,gpu=2)"

NOTES:

  • lc0_org is Windows binary compiled from codebase as of 7 July (source unmodified)
  • lc0_mutex is Windows binary compiled from codebase as of 7 July (with one modification of the mutex optimization) in PickNodeToExtend
  • all 3 GPUS are Titan V (Volta)
  • the weights file is from network 395
  • the performance (nps) reported above are from a single run, but these tests were all performed three times and the timings were highly stable

@mooskagh
Copy link
Member

mooskagh commented Jul 8, 2018

The idea was that most of the code runs behind shared mutex, to multiple thereads can do most of stuff simultaneously, but apparently locking overhead is too large when there is contention.

Would you mind also checking whether replacing search_->nodes_mutex_ to Mutex (rather than SharedMutex) and locking it exclusively everywhere would not slow down things? That way we could get rid of shared mutex completely..

@dje-dev
Copy link
Contributor Author

dje-dev commented Jul 8, 2018

As suggested I ran an experiment to replace search_->nodes_mutex_ to Mutex and locked exclusively everywhere. Performance tests showed clearly inferior results (easily statistically significant) in the 3-GPU case. For example, the last two reported NPS above decreased to 71,512 and 84,833 (from 75,047 and 103,178).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants