Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlphaZero searches much wider than lc0 at low visits #748

Closed
Mardak opened this issue Feb 20, 2019 · 10 comments · Fixed by #750
Closed

AlphaZero searches much wider than lc0 at low visits #748

Mardak opened this issue Feb 20, 2019 · 10 comments · Fixed by #750

Comments

@Mardak
Copy link
Contributor

Mardak commented Feb 20, 2019

From Game Changer, page 81 has a table showing move priors and eval at "64" visits (seems to actually be 67 total child visits) and the top 19 moves out of 33 possible.
screen shot 2019-02-20 at 1 01 12 pm

position startpos moves g1f3 g8f6 c2c4 e7e6 b1c3 f8b4 d1c2 e8g8 a2a3 b4c3 c2c3 a7a5 b2b4 d7d6 e2e3 f6e4 c3c2 e4g5 b4b5 g5f3 g2f3 d8f6 d2d4 f6f3 h1g1 b8d7 f1e2 f3f6 c1b2 f6h4 g1g4 h4h2 g4g3 f7f5 e1c1 f8f7 e2f3 h2h4 d1h1 h4f6 c1b1 g7g6 g3g1 a5a4 b1a1 f7g7 e3e4 f5f4 c4c5 f6e7 g1c1 d7f6 e4e5 d6e5 h1e1 e5e4 f3e4 e7f8

screen shot 2019-02-20 at 1 04 29 pm

The table shows that Kb1/a1b1 with 0.07% prior managed to get 3% of visits, which matches up with 2 visits of the "64". It looks like the table is sorted by N then P, so an assumption here is that because a 0.07% prior child was visited twice, the unshown children have 1 visit with prior less than 0.26% and unvisited children have priors less than 0.07%.

To get 67 children visits, the table shows 19 children with 60 visits, so most likely there are 7 unshown children with 1 visit each and 7 more unshown children with 0 visits each.

For reference, here's 32930 (policy-softmax-temp=1) forcing a visit to each child once:

info string c2a4  (262 ) N:       1 (+ 0) (P:  0.02%) (Q: -0.85741) (D:  0.000) (U: 0.00209) (Q+U: -0.85532) (V: -0.8574) 
info string e4d5  (795 ) N:       1 (+ 0) (P:  0.05%) (Q: -0.98659) (D:  0.000) (U: 0.00406) (Q+U: -0.98253) (V: -0.9866) 
info string c2b3  (258 ) N:       1 (+ 0) (P:  0.02%) (Q: -0.97691) (D:  0.000) (U: 0.00193) (Q+U: -0.97498) (V: -0.9769) 
info string e4f5  (797 ) N:       1 (+ 0) (P:  0.05%) (Q: -0.96944) (D:  0.000) (U: 0.00445) (Q+U: -0.96500) (V: -0.9694) 
info string e1e3  (111 ) N:       1 (+ 0) (P:  0.05%) (Q: -0.92453) (D:  0.000) (U: 0.00415) (Q+U: -0.92038) (V: -0.9245) 
info string e4c6  (799 ) N:       1 (+ 0) (P:  0.03%) (Q: -0.71045) (D:  0.000) (U: 0.00288) (Q+U: -0.70758) (V: -0.7105) 
info string e4g6  (803 ) N:       1 (+ 0) (P:  0.03%) (Q: -0.47933) (D:  0.000) (U: 0.00292) (Q+U: -0.47641) (V: -0.4793) 
info string b2c3  (231 ) N:       1 (+ 0) (P:  0.13%) (Q: -0.02199) (D:  0.000) (U: 0.01067) (Q+U: -0.01132) (V: -0.0220) 
info string a1b1  (0   ) N:       1 (+ 0) (P:  0.09%) (Q: -0.00074) (D:  0.000) (U: 0.00764) (Q+U:  0.00690) (V: -0.0007) 
info string a1a2  (7   ) N:       1 (+ 0) (P:  0.09%) (Q:  0.03331) (D:  0.000) (U: 0.00781) (Q+U:  0.04112) (V:  0.0333) 
info string c1b1  (48  ) N:       1 (+ 0) (P:  0.11%) (Q:  0.07281) (D:  0.000) (U: 0.00963) (Q+U:  0.08244) (V:  0.0728) 
info string b5b6  (941 ) N:       1 (+ 0) (P:  0.21%) (Q:  0.10347) (D:  0.000) (U: 0.01815) (Q+U:  0.12162) (V:  0.1035) 
info string e1f1  (101 ) N:       1 (+ 0) (P:  0.12%) (Q:  0.11431) (D:  0.000) (U: 0.01032) (Q+U:  0.12463) (V:  0.1143) 
info string c2d2  (252 ) N:       1 (+ 0) (P:  0.26%) (Q:  0.16386) (D:  0.000) (U: 0.02225) (Q+U:  0.18611) (V:  0.1639) 
info string e1d1  (100 ) N:       1 (+ 0) (P:  0.15%) (Q:  0.17411) (D:  0.000) (U: 0.01290) (Q+U:  0.18700) (V:  0.1741) 
info string c2c3  (259 ) N:       1 (+ 0) (P:  0.16%) (Q:  0.19155) (D:  0.000) (U: 0.01336) (Q+U:  0.20491) (V:  0.1916) 
info string c2b1  (246 ) N:       1 (+ 0) (P:  0.21%) (Q:  0.19083) (D:  0.000) (U: 0.01780) (Q+U:  0.20862) (V:  0.1908) 
info string e4g2  (781 ) N:       1 (+ 0) (P:  0.79%) (Q:  0.16545) (D:  0.000) (U: 0.06753) (Q+U:  0.23298) (V:  0.1654) 
info string c2d1  (248 ) N:       1 (+ 0) (P:  0.21%) (Q:  0.22839) (D:  0.000) (U: 0.01758) (Q+U:  0.24597) (V:  0.2284) 
info string e1g1  (102 ) N:       1 (+ 0) (P:  0.48%) (Q:  0.21439) (D:  0.000) (U: 0.04065) (Q+U:  0.25504) (V:  0.2144) 
info string f2f3  (346 ) N:       1 (+ 0) (P:  0.52%) (Q:  0.22276) (D:  0.000) (U: 0.04454) (Q+U:  0.26730) (V:  0.2228) 
info string c2d3  (260 ) N:       1 (+ 0) (P:  0.25%) (Q:  0.24711) (D:  0.000) (U: 0.02098) (Q+U:  0.26809) (V:  0.2471) 
info string e1h1  (103 ) N:       1 (+ 0) (P:  0.32%) (Q:  0.24280) (D:  0.000) (U: 0.02710) (Q+U:  0.26991) (V:  0.2428) 
info string e1e2  (106 ) N:       1 (+ 0) (P:  0.19%) (Q:  0.25640) (D:  0.000) (U: 0.01583) (Q+U:  0.27224) (V:  0.2564) 
info string c1d1  (49  ) N:       1 (+ 0) (P:  0.18%) (Q:  0.28285) (D:  0.000) (U: 0.01569) (Q+U:  0.29854) (V:  0.2829) 
info string e4h1  (776 ) N:       1 (+ 0) (P:  1.19%) (Q:  0.19882) (D:  0.000) (U: 0.10067) (Q+U:  0.29949) (V:  0.1988) 
info string c2e2  (253 ) N:       1 (+ 0) (P:  0.47%) (Q:  0.29207) (D:  0.000) (U: 0.03964) (Q+U:  0.33172) (V:  0.2921) 
info string c2c4  (264 ) N:       1 (+ 0) (P:  1.20%) (Q:  0.28042) (D:  0.000) (U: 0.10174) (Q+U:  0.38217) (V:  0.2804) 
info string e4b7  (804 ) N:       1 (+ 0) (P:  0.03%) (Q:  0.38252) (D:  0.000) (U: 0.00285) (Q+U:  0.38537) (V:  0.3825) 
info string e4f3  (785 ) N:       1 (+ 2) (P: 15.74%) (Q:  0.26857) (D:  0.000) (U: 0.66853) (Q+U:  0.93710) (V:  0.2686) 
info string c5c6  (973 ) N:       1 (+ 5) (P: 23.27%) (Q:  0.37320) (D:  0.000) (U: 0.56480) (Q+U:  0.93801) (V:  0.3732) 
info string e4d3  (783 ) N:       1 (+ 5) (P: 35.75%) (Q:  0.10710) (D:  0.000) (U: 0.86772) (Q+U:  0.97482) (V:  0.1071) 
info string d4d5  (761 ) N:       1 (+ 8) (P: 17.61%) (Q:  0.69821) (D:  0.000) (U: 0.29924) (Q+U:  0.99745) (V:  0.6982) 

The 7 unvisited children most likely match up with the first 7 moves above with a very negative Q while other children are at least draw-ish and matching up with that is very low P at most 0.05%. So it seems quite likely AZ's P for these 7 children is very close to 0.00%.

The table shows averaged Q values near 0.657 with smallest Q in table 0.508, which mapping to a range of [-1, 1] should be average Q=.314.

Assuming AZ search indeed initialized unvisited children Q=-1, then for AZ search to be visiting children with priors as low as 0.07% means:

0.07% child Q+U > max(Q+U)
-1 + U-of-0.07% > ~.314 + U-of-other
U-of-0.07% > ~1.3 + U-of-other

However, 32930 with policy-softmax-temp=1 FPU=-1 and 67 child visits only visits the top 4 highest prior moves:

…
info string c2e2  (253 ) N:       0 (+ 0) (P:  0.47%) (Q: -1.00000) (D:  0.000) (U: 0.11486) (Q+U: -0.88514) (V:  -.----) 
info string e1g1  (102 ) N:       0 (+ 0) (P:  0.48%) (Q: -1.00000) (D:  0.000) (U: 0.11777) (Q+U: -0.88223) (V:  -.----) 
info string f2f3  (346 ) N:       0 (+ 0) (P:  0.52%) (Q: -1.00000) (D:  0.000) (U: 0.12903) (Q+U: -0.87097) (V:  -.----) 
info string e4g2  (781 ) N:       0 (+ 0) (P:  0.79%) (Q: -1.00000) (D:  0.000) (U: 0.19564) (Q+U: -0.80436) (V:  -.----) 
info string e4h1  (776 ) N:       0 (+ 0) (P:  1.19%) (Q: -1.00000) (D:  0.000) (U: 0.29167) (Q+U: -0.70833) (V:  -.----) 
info string c2c4  (264 ) N:       0 (+ 0) (P:  1.20%) (Q: -1.00000) (D:  0.000) (U: 0.29477) (Q+U: -0.70523) (V:  -.----) 
info string d4d5  (761 ) N:       7 (+ 0) (P: 17.61%) (Q:  0.03270) (D:  0.000) (U: 0.54185) (Q+U:  0.57455) (V:  0.6982) 
info string e4f3  (785 ) N:       7 (+ 0) (P: 15.74%) (Q:  0.12017) (D:  0.000) (U: 0.48421) (Q+U:  0.60438) (V:  0.2686) 
info string e4d3  (783 ) N:      14 (+ 0) (P: 35.75%) (Q: -0.01088) (D:  0.000) (U: 0.58658) (Q+U:  0.57570) (V:  0.1071) 
info string c5c6  (973 ) N:      39 (+ 1) (P: 23.27%) (Q:  0.29288) (D:  0.000) (U: 0.13969) (Q+U:  0.43257) (V:  0.3732)

Here Q+U of 1.2% prior move is -0.7 and nowhere close to being picked as the current max(Q+U) is 0.6. So somehow AZ is picking these very low prior moves.

This behavior of AZ being able to search wide assuming the behavior exists in self-play would seem to fix #8 for finding tactics as often times the network would have known a sacrifice tactic is actually a good position with a single visit, but lc0's search ends up ignoring low prior moves.

@Mardak
Copy link
Contributor Author

Mardak commented Feb 20, 2019

@Tilps Here's with #699 applied:

info string c2c4  (264 ) N:       0 (+ 0) (P:  1.20%) (Q: -1.00000) (U: 0.29696) (Q+U: -0.70304) (V:  -.----) 
info string d4d5  (761 ) N:       8 (+ 0) (P: 17.61%) (Q: -0.06259) (U: 0.48522) (Q+U:  0.42263) (V:  0.6982) 
info string e4f3  (785 ) N:      12 (+ 0) (P: 15.74%) (Q:  0.22110) (U: 0.30019) (Q+U:  0.52129) (V:  0.2686) 
info string e4d3  (783 ) N:      17 (+ 0) (P: 35.75%) (Q: -0.00050) (U: 0.49245) (Q+U:  0.49195) (V:  0.1071) 
info string c5c6  (973 ) N:      30 (+ 1) (P: 23.27%) (Q:  0.34096) (U: 0.18030) (Q+U:  0.52126) (V:  0.3732)

@lp200
Copy link

lp200 commented Feb 21, 2019

I think that it is the effect of multi threaded search.
There is a similar effect in LZ Go

--threads 1
 C18 ->      29 (V: 15.96%) (N: 40.81%) PV: C18 B18 L17 Q14 P14 P13 O14
  Q7 ->      13 (V: 16.32%) (N: 15.49%) PV: Q7 Q8 C18 B18 P7 O7 P8 P9
 L17 ->      11 (V: 14.33%) (N: 19.89%) PV: L17 Q14 C18 B18 P14
 L16 ->       7 (V: 11.92%) (N: 16.01%) PV: L16 Q14 C18 B18
 --threads 16 --batchsize 5
 C18 ->      25 (V: 57.51%) (N: 36.84%) PV: C18 B18 Q7
 L16 ->       9 (V: 16.17%) (N: 24.60%) PV: L16 Q14
  Q7 ->       3 (V: 47.76%) (N:  9.12%) PV: Q7 S5 S4
  F3 ->       2 (V: 11.51%) (N:  0.03%) PV: F3 L17
 F18 ->       1 (V: 17.21%) (N:  0.01%) PV: F18 
 T17 ->       1 (V: 13.25%) (N:  0.01%) PV: T17 
  C6 ->       1 (V:  9.06%) (N:  0.05%) PV: C6 
 D14 ->       1 (V:  8.70%) (N:  0.20%) PV: D14 
 Q13 ->       1 (V:  7.97%) (N:  9.83%) PV: Q13 
 P13 ->       1 (V:  7.94%) (N:  0.49%) PV: P13 
 L17 ->       1 (V:  7.82%) (N: 16.23%) PV: L17 
 K17 ->       1 (V:  7.60%) (N:  0.09%) PV: K17 
 P12 ->       1 (V:  6.69%) (N:  0.03%) PV: P12 
 R14 ->       1 (V:  6.68%) (N:  1.41%) PV: R14 
 Q14 ->       1 (V:  6.66%) (N:  0.01%) PV: Q14 
 K16 ->       1 (V:  6.56%) (N:  0.25%) PV: K16 
 M16 ->       1 (V:  6.20%) (N:  0.03%) PV: M16 
  P2 ->       1 (V:  5.82%) (N:  0.02%) PV: P2 
  P8 ->       1 (V:  5.43%) (N:  0.02%) PV: P8 
 M15 ->       1 (V:  4.98%) (N:  0.01%) PV: M15 
 Q12 ->       1 (V:  4.72%) (N:  0.01%) PV: Q12 
 M17 ->       1 (V:  4.61%) (N:  0.02%) PV: M17 
  O7 ->       1 (V:  3.20%) (N:  0.01%) PV: O7 
 M18 ->       1 (V:  3.14%) (N:  0.01%) PV: M18 
  P3 ->       1 (V:  2.76%) (N:  0.01%) PV: P3 
 M17 ->       1 (V:  4.90%) (N:  0.02%) PV: M17 
 O11 ->       1 (V:  4.44%) (N:  0.01%) PV: O11 
 L18 ->       1 (V:  2.82%) (N:  0.01%) PV: L18 
 D18 ->       1 (V:  1.29%) (N:  0.01%) PV: D18 
 B18 ->       1 (V:  0.98%) (N:  0.01%) PV: B18 

@Mardak
Copy link
Contributor Author

Mardak commented Feb 21, 2019

The AGZ paper does say Positions in the queue are evaluated by the neural network using a mini-batch size of 8 and AZ says The leaf node sL is added to a queue for neural network evaluation

Perhaps the same batch size was used for AZ. Just that for Go, there's ~300 possible moves and spilling over some in early batches still leaves many moves unvisited. While for Chess like in this position, there's ~30 moves leading to most being visited at least once.

@Videodr0me
Copy link
Contributor

Isn't there a comment by DM that they "go wider" on the first couple of ply because its hard to fill the batches. I always wondered why we only cache potential nodes if a batch is not full. We could just backpropagate their results as well instead of caching - it seems thats what DM might be doing.

@Mardak
Copy link
Contributor Author

Mardak commented Feb 21, 2019

Matthew provided some more details confirming that TPU batching allows "free" evaluation of additional moves that otherwise wouldn't have been picked earlier by MCTS:

Yes, at very early stages of the search, we evaluate wider than we would have otherwise because going deep requires previous results to come back (so we have the priors) before we can go another ply deeper. So in very early stages we fill batches with nodes that we otherwise wouldn't have searched because we basically get them for "free" due to performance characteristics of TPUs (evaluating 1 node isn't much faster than evaluating 16 at a time).

We evaluate asynchronously. The search threads run simulations as fast as they can, and when they get to a leaf to expand, it gets added to an evaluation queue. When there's enough evaluation requests to fill a batch, it gets sent to the TPU. Virtual loss ensures that we will pick different nodes on next simulation, and that will get added to the same batch.

Yes, this is the same for both selfplay and the evaluation player.

It's true that we explore wider than usual due to this latency issue, but the effect is only significant at the very beginning - we find that at 800 simulations it will have visited every root children at least once anyways, even if we aren't doing this batching.

Edit: And pointing out root children need at least 1.4% prior to overcome a Q=loss at 800 visits if parent Q=draw:

That is true, I stand corrected.

There is probably some value in visiting every child once.

@Ishinoshita
Copy link

Ishinoshita commented Feb 26, 2019

Since narrow/deep search is beneficial at leafs, and wide search is beneficial at root, shouldn't the FPU be a function of depth (something equal to init to parent or Q or even win, at root, and converging to init to loss with depth)? I think it would have a different effect than just scaling cpuct with visits (could be combined actually).

@oscardssmith
Copy link
Contributor

You're premise isn't correct. Optimal tree shape for root is different than for the rest of the tree, but for the rest of the tree, optimal tree shape can determined by nodes alone. (Root is different because you don't care about estimating value of the root, just finding the best move).

@Ishinoshita
Copy link

I see. So, using init FPU = Win or Q at root, and FPU = Loss elsewhere would make sense?

@oscardssmith
Copy link
Contributor

for now, that seems like a good option. (fpur in tree might be better)

@alreadydone
Copy link

As a fact, LZ Go in self-play has fpu reduction disabled at root and enabled elsewhere because it was thought that fpu reduction hinders exploration from Dirichlet noise. (Now that we know init to loss works, it seems unnecessary to make this distinction.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants