-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize Q = 0 instead of parent Q for self-play to match AGZ paper #344
Comments
yes. i think your idea is very interesting. |
Tried that already for playing outside training, there its worse (at least under my testing conditions): https://github.com/Videodr0me/leela-chess-experimental/wiki/Sanity-Tests Its not clear what DM used for A0, but based on my tests, for chess at least and non-training, parent q seems strongest. As for what works best in training (as opposed to playing outside of training)- thats another question, but the strength difference is rather large.... so there are pros and cons. Could be tried in training but i would not expect miracles... |
That's also what lc0 used back at the time when lczero was still the official engine. It also shown to be weaker. |
Shown how? Which networks used training data that were generated with Q=0? |
I agree that having something for FPU can be very important for playing matches, but using that to try to infer quality of training data is very misguided. For example, the randomness added by noise and temperature are purposefully not playing at "full strength" so that the network can learn from better training data. Similarly, Q=0 means there's more of a clear difference in how training works for losing vs winning side instead of self-play games trying to "do the usual thing." With Q=0, the losing side's purpose is to search wide and find good moves that are hidden while winning side's purpose is to reinforce and validate whether a move is actually good. |
Looking through the commit history, I don't see when lczero ever used FPU = 0. Add recursive search depth, remove FPU VL bug jkiliani April 29 Add fpu_dynamic_eval option and enable it Tilps April 11 Return fpu eval to use the static net eval as starting point Tilps April 11 Reduce first play urgency jkiliani March 28 Port UCTNode simplifications from Leela Zero glinscott January 12
Add files via upload benediamond December 21
|
I am a little sceptical about this argument especially if you make it about training and believe it does not hold for normal play. Why shouldn't in normal play (outside training), the loosing side's "purpose" to be searching wide and the winning side's "purpose" be to "validate" whether a move is actually good? Thus, if this argument holds then initializing q to 0 should yield elo in normal play - but it does not. This is not saying that there is no interaction between FPU, learning and the ceiling of the fully trained NN. By all means just try - maybe you can train a smaller net locally and see what happens? I am curious. |
It's because searching wider is not visit efficient and leads to suboptimal visit usage assuming priors are accurate. When playing a match, the network shouldn't second guess itself and trust that the policy is good. As mentioned earlier, self play settings should be different from match settings. Self play should maximize learning while match maximizes rating. |
Close this as evidence suggests this is weaker? |
Now we have fpu-strategy=absolute. (turned out that A0 used -1) |
From AGZ paper:

I realize initializing to parent Q is common and then additionally reducing with first play urgency is used by Leela Zero and copied to lczero/lc0, and doing that and tuning can improve match strength.
However, self-play training data quality can be reduced when using "match settings" leading to different default values for cpuct, softmax, fpu reduction, etc. as well as turning on/off code specifically for self-play, e.g., leela-zero/leela-zero#1083
The behavior of Q = 0 instead of parent Q makes it so that in winning positions, the first found "good enough" move will likely have a dominating Q > 0 relative to unvisited Q = 0 moves. Similarly, from a losing position, search will naturally go wider as the usual "best" move probably still has a Q < 0 leading to visits of seemingly Q = 0 moves.
Almost all of the positions in #8 happen to be from losing positions with one good move, so Q = 0 ends up finding all of the expected tactical moves not even needing 800 visits as once search starts going wide, it realizes the "hidden"-very-low-prior tactical move is actually the best of all possible moves.
But even ignoring the fact that Q = 0 happens to improve tactical training in those select positions, it sounds like a main project goal is to "reproduce AZ as closely as possible." There are no details of Q = 0 or other Q in AZ paper itself, so falling back to the previous AGZ paper, it would seem to imply unvisited moves should have Q = 0 for self-play.
The text was updated successfully, but these errors were encountered: