-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AlphaZero searches much wider than lc0 at low visits #748
Comments
@Tilps Here's with #699 applied:
|
I think that it is the effect of multi threaded search.
|
The AGZ paper does say Perhaps the same batch size was used for AZ. Just that for Go, there's ~300 possible moves and spilling over some in early batches still leaves many moves unvisited. While for Chess like in this position, there's ~30 moves leading to most being visited at least once. |
Isn't there a comment by DM that they "go wider" on the first couple of ply because its hard to fill the batches. I always wondered why we only cache potential nodes if a batch is not full. We could just backpropagate their results as well instead of caching - it seems thats what DM might be doing. |
Matthew provided some more details confirming that TPU batching allows "free" evaluation of additional moves that otherwise wouldn't have been picked earlier by MCTS:
Edit: And pointing out root children need at least 1.4% prior to overcome a Q=loss at 800 visits if parent Q=draw:
|
Since narrow/deep search is beneficial at leafs, and wide search is beneficial at root, shouldn't the FPU be a function of depth (something equal to init to parent or Q or even win, at root, and converging to init to loss with depth)? I think it would have a different effect than just scaling cpuct with visits (could be combined actually). |
You're premise isn't correct. Optimal tree shape for root is different than for the rest of the tree, but for the rest of the tree, optimal tree shape can determined by nodes alone. (Root is different because you don't care about estimating value of the root, just finding the best move). |
I see. So, using init FPU = Win or Q at root, and FPU = Loss elsewhere would make sense? |
for now, that seems like a good option. (fpur in tree might be better) |
As a fact, LZ Go in self-play has fpu reduction disabled at root and enabled elsewhere because it was thought that fpu reduction hinders exploration from Dirichlet noise. (Now that we know init to loss works, it seems unnecessary to make this distinction.) |
From Game Changer, page 81 has a table showing move priors and eval at "64" visits (seems to actually be 67 total child visits) and the top 19 moves out of 33 possible.

The table shows that Kb1/a1b1 with 0.07% prior managed to get 3% of visits, which matches up with 2 visits of the "64". It looks like the table is sorted by N then P, so an assumption here is that because a 0.07% prior child was visited twice, the unshown children have 1 visit with prior less than 0.26% and unvisited children have priors less than 0.07%.
To get 67 children visits, the table shows 19 children with 60 visits, so most likely there are 7 unshown children with 1 visit each and 7 more unshown children with 0 visits each.
For reference, here's 32930 (policy-softmax-temp=1) forcing a visit to each child once:
The 7 unvisited children most likely match up with the first 7 moves above with a very negative Q while other children are at least draw-ish and matching up with that is very low P at most 0.05%. So it seems quite likely AZ's P for these 7 children is very close to 0.00%.
The table shows averaged Q values near 0.657 with smallest Q in table 0.508, which mapping to a range of [-1, 1] should be average Q=.314.
Assuming AZ search indeed initialized unvisited children Q=-1, then for AZ search to be visiting children with priors as low as 0.07% means:
However, 32930 with policy-softmax-temp=1 FPU=-1 and 67 child visits only visits the top 4 highest prior moves:
Here Q+U of 1.2% prior move is -0.7 and nowhere close to being picked as the current max(Q+U) is 0.6. So somehow AZ is picking these very low prior moves.
This behavior of AZ being able to search wide assuming the behavior exists in self-play would seem to fix #8 for finding tactics as often times the network would have known a sacrifice tactic is actually a good position with a single visit, but lc0's search ends up ignoring low prior moves.
The text was updated successfully, but these errors were encountered: