Initialize Q = 0 instead of parent Q for self-play to match AGZ paper #344

Mardak · 2018-09-07T18:58:46Z

From AGZ paper:

I realize initializing to parent Q is common and then additionally reducing with first play urgency is used by Leela Zero and copied to lczero/lc0, and doing that and tuning can improve match strength.

However, self-play training data quality can be reduced when using "match settings" leading to different default values for cpuct, softmax, fpu reduction, etc. as well as turning on/off code specifically for self-play, e.g., leela-zero/leela-zero#1083

The behavior of Q = 0 instead of parent Q makes it so that in winning positions, the first found "good enough" move will likely have a dominating Q > 0 relative to unvisited Q = 0 moves. Similarly, from a losing position, search will naturally go wider as the usual "best" move probably still has a Q < 0 leading to visits of seemingly Q = 0 moves.

Almost all of the positions in #8 happen to be from losing positions with one good move, so Q = 0 ends up finding all of the expected tactical moves not even needing 800 visits as once search starts going wide, it realizes the "hidden"-very-low-prior tactical move is actually the best of all possible moves.

But even ignoring the fact that Q = 0 happens to improve tactical training in those select positions, it sounds like a main project goal is to "reproduce AZ as closely as possible." There are no details of Q = 0 or other Q in AZ paper itself, so falling back to the previous AGZ paper, it would seem to imply unvisited moves should have Q = 0 for self-play.

leedavid · 2018-09-07T23:23:05Z

yes. i think your idea is very interesting.

Videodr0me · 2018-09-09T08:20:38Z

Tried that already for playing outside training, there its worse (at least under my testing conditions): https://github.com/Videodr0me/leela-chess-experimental/wiki/Sanity-Tests

Its not clear what DM used for A0, but based on my tests, for chess at least and non-training, parent q seems strongest.

As for what works best in training (as opposed to playing outside of training)- thats another question, but the strength difference is rather large.... so there are pros and cons. Could be tried in training but i would not expect miracles...

mooskagh · 2018-09-09T10:06:51Z

That's also what lc0 used back at the time when lczero was still the official engine. It also shown to be weaker.

Mardak · 2018-09-09T14:43:16Z

It also shown to be weaker.

Shown how? Which networks used training data that were generated with Q=0?

Mardak · 2018-09-09T14:47:59Z

non-training, parent q seems strongest.

I agree that having something for FPU can be very important for playing matches, but using that to try to infer quality of training data is very misguided. For example, the randomness added by noise and temperature are purposefully not playing at "full strength" so that the network can learn from better training data.

Similarly, Q=0 means there's more of a clear difference in how training works for losing vs winning side instead of self-play games trying to "do the usual thing." With Q=0, the losing side's purpose is to search wide and find good moves that are hidden while winning side's purpose is to reinforce and validate whether a move is actually good.

Mardak · 2018-09-09T15:16:27Z

Looking through the commit history, I don't see when lczero ever used FPU = 0.

Add recursive search depth, remove FPU VL bug jkiliani April 29
https://github.com/LeelaChessZero/lczero/blame/5e74337a05f3ed5d2baa1d7db2ff23f96306f1b4/src/UCTNode.cpp#L338
auto fpu_eval = (cfg_fpu_dynamic_eval ? get_raw_eval(color) : net_eval) - fpu_reduction;

Add fpu_dynamic_eval option and enable it Tilps April 11
https://github.com/LeelaChessZero/lczero/blame/cd8b1c299630b9dae37353547eeb103255815aa1/src/UCTNode.cpp#L338
auto fpu_eval = (cfg_fpu_dynamic_eval ? get_eval(color) : net_eval) - fpu_reduction;

Return fpu eval to use the static net eval as starting point Tilps April 11
https://github.com/LeelaChessZero/lczero/blame/0e32b223c83fbc50d23e0e4c14e201cca8fa68a2/src/UCTNode.cpp#L316
auto fpu_eval = net_eval - fpu_reduction;

Reduce first play urgency jkiliani March 28
https://github.com/LeelaChessZero/lczero/blame/3f7b6c64c0bcc477916c881800496e757186c7a8/src/UCTNode.cpp#L311
auto fpu_eval = get_eval(color) - fpu_reduction;

Port UCTNode simplifications from Leela Zero glinscott January 12
https://github.com/LeelaChessZero/lczero/blame/eeb6ea6eff781f66f4fa0f43b0420afb30cf571a/src/UCTNode.cpp#L288

// If a node has not been visited yet, the eval is that of the parent.
auto winrate = child->get_eval(color);

Add files via upload benediamond December 21
https://github.com/LeelaChessZero/lczero/blame/e9b2c71050b8da543d205263972070d69eedbc71/src/UCTNode.cpp#L351

// get_eval() will automatically set first-play-urgency
float winrate = child->get_eval(color);

// If a node has not been visited yet, the eval is that of the parent.
auto eval = m_init_eval;

Videodr0me · 2018-09-09T19:55:56Z

Similarly, Q=0 means there's more of a clear difference in how training works for losing vs winning side instead of self-play games trying to "do the usual thing." With Q=0, the losing side's purpose is to search wide and find good moves that are hidden while winning side's purpose is to reinforce and validate whether a move is actually good.

I am a little sceptical about this argument especially if you make it about training and believe it does not hold for normal play. Why shouldn't in normal play (outside training), the loosing side's "purpose" to be searching wide and the winning side's "purpose" be to "validate" whether a move is actually good? Thus, if this argument holds then initializing q to 0 should yield elo in normal play - but it does not.

This is not saying that there is no interaction between FPU, learning and the ceiling of the fully trained NN. By all means just try - maybe you can train a smaller net locally and see what happens? I am curious.

Mardak · 2018-09-09T20:58:40Z

It's because searching wider is not visit efficient and leads to suboptimal visit usage assuming priors are accurate. When playing a match, the network shouldn't second guess itself and trust that the policy is good.

As mentioned earlier, self play settings should be different from match settings. Self play should maximize learning while match maximizes rating.

oscardssmith · 2018-11-15T20:33:33Z

Close this as evidence suggests this is weaker?

mooskagh · 2019-01-02T19:51:15Z

Now we have fpu-strategy=absolute. (turned out that A0 used -1)

This was referenced Sep 11, 2018

Use Q=0 for self-play following AZ paper behavior while keeping FPU reduction tuning #350

Closed

Should Q=0 / FPU=0.5 be used at noised root for self-play to more cloesly follow AGZ paper? leela-zero/leela-zero#1829

Closed

mooskagh closed this as completed Jan 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize Q = 0 instead of parent Q for self-play to match AGZ paper #344

Initialize Q = 0 instead of parent Q for self-play to match AGZ paper #344

Mardak commented Sep 7, 2018 •

edited

Loading

leedavid commented Sep 7, 2018

Videodr0me commented Sep 9, 2018 •

edited

Loading

mooskagh commented Sep 9, 2018

Mardak commented Sep 9, 2018

Mardak commented Sep 9, 2018

Mardak commented Sep 9, 2018 •

edited

Loading

Videodr0me commented Sep 9, 2018

Mardak commented Sep 9, 2018

oscardssmith commented Nov 15, 2018

mooskagh commented Jan 2, 2019

Initialize Q = 0 instead of parent Q for self-play to match AGZ paper #344

Initialize Q = 0 instead of parent Q for self-play to match AGZ paper #344

Comments

Mardak commented Sep 7, 2018 • edited Loading

leedavid commented Sep 7, 2018

Videodr0me commented Sep 9, 2018 • edited Loading

mooskagh commented Sep 9, 2018

Mardak commented Sep 9, 2018

Mardak commented Sep 9, 2018

Mardak commented Sep 9, 2018 • edited Loading

Videodr0me commented Sep 9, 2018

Mardak commented Sep 9, 2018

oscardssmith commented Nov 15, 2018

mooskagh commented Jan 2, 2019

Mardak commented Sep 7, 2018 •

edited

Loading

Videodr0me commented Sep 9, 2018 •

edited

Loading

Mardak commented Sep 9, 2018 •

edited

Loading