-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate exploration from training feedback (alternate method of lowering T) #342
Comments
This is very similar to what I've mentioned before. Although, I suggest doing it like Experience Replay (a common RL technique) where after a game is played, go back and explore an alternate move from that game and then send just the partial games starting after the alternate move back as training positions. |
@RedDenver I really like the idea of xp replay. It seems to do a good job of exploring without messing up game results as a result. Do you have any idea if this would be easy to implement? |
@oscardssmith Both mine and KD's ideas require some change to the training data being sent back to the server and probably to the way the server selects positions to training on. I'm not familiar with those parts of the code, so I'm not sure how much work that would entail. |
Based on the discord discussion about how to training the opening moves, I propose using high temp in the first 6 plies (119 million positions possible) in order to produce semi-random game since training is distributed, and then playing the remainder of the game without temp and submit that full game back to training just like we do now. Then additional partial games can be produced in a few ways:
|
At a high level, this is a bit similar to @dubslow stochastic temperature where effect of temperature is reduced/removed after a random amount of moves: #237 (comment) The approach here seems like it would require more changes to self-play and training with special handling of opening positions or empty training data. Even if we say "N plies" can be 0, the proportion of training data that would include opening moves would be significantly less (but unclear if that's actually a problem as perhaps average training with history planes won't confuse start position from other positions). One general issue of the described approach of picking positions based on priors instead of including value with search is that on the surface, it would seem that it would be harder to have the network correct incorrect position value. But I suppose that could be eventually covered by having a separate run of "N-1" getting into the same "just before" position with search generating prior training data that increases the likelihood of randomly picking the "N" position. So perhaps in other words, there's multiple networks needed for value to propagate to prior to then randomly play into positions to then update value. On an even larger general topic of randomness for exploration is whether it generates training data in the sufficient proportion of reasonable and/or desired positions, e.g., piece sacrifice, fortress perpetual positions. But then again, a proposed design doesn't need to try to address all things at once. |
I have wondered why Alpha Zero (and Leela Zero and LC0 in its wake) was not using some form of temporal difference (TD) learning or Q-learning to obtain the training data for the value function; they use the value at the end of the simulated game. (It is different for the policy function; the data that is used for to train the policy head is available as each move is generated, without playing out the game.) If some sort of TD learning is used then one can completely divorce the generation of the training data from the generation of complete simulated games. One can obtain board positions in any way one likes and obtain the training data for just one board position at a time. It removes a lot of correlations in the data and it removes the conflict between high noise in the self-play games (to obtain a diverse set of training configurations) and low noise (to obtain an accurate value). I discussed this from a somewhat remote perspective -- the generation of training data for fitting a molecular energy function -- in a post on the r/cbaduk reddit that may be of interest for the present discussion again. "Generating neural network training data by self-play, sampling a state space in molecular simulations"; posted on r/cbaduk on 2018-03-22. |
Yes, and that "remove training data" seemed to be an artifact of not doing search in these earlier moves to quickly get to positions that we do want to train, so this would have no visit probability data to train with, and the removal also then requires special handling to actually train opening moves. AGZ with it's 30 moves T=1 and remainder T=0 shows that early position value is probably pretty good even though there's multiple temperature-induced moves that might distort the true value. @killerducky was the intent of the proposal to "train value with zero temperature-induced plays" or just to reduce the likelihood? Either way is a reasonable proposal and would address "Get an accurate value of positions." -- the sub point 2a just made it seem that the removal of early position value training wasn't an explicit goal of the proposal. |
@Mardak I think exploration with alternative moves is still very useful in the middle and end games, but I don't think temp is the right tool to do it. Temp is very simple to implement and works well in the opening though. So my suggestions are based on removing temp after the opening, and experience replay is very similar to how humans analyze games - look at variations in games already played. Just makes sense to me. |
Isn't this the same as #330 (comment) |
Indeed, the post and discussion by @amjshl under #330 looks like a very good quantitative demonstration of the benefits of separating the sampling of the configurations from the evaluation of the training data for the value. This (from the comments section under #330) sums it up for me. "Yes, my assumption is that the result of T=0 at 800 playouts gives the most accurate value but that is very computationally intensive. As a compromise, running just 1, 10, or 50 playouts with T=0 is cheaper to compute but gives a better prediction accuracy than T=1 and 800 playouts used during training games. The challenge is that we still need T=1 to generate greater variety of positions and explore new moves, but use T=0 only for determining the value of a position, but not generate positions for training." |
@MaurizioDeLeo @bjbraams This isn't the same as #330, as that one is a comparison of using temp for entire game vs only using temp in early game. This issue is discussing how to keep move exploration without using temp. |
Now that the "uncertainty" head is being considered that could be used to guide the exploration. Kind of curiosity based learning: https://towardsdatascience.com/curiosity-driven-learning-made-easy-part-i-d3e5a2263359 |
oh, i hadn't thought i of the that use. that's a cool idea |
One could just only train value on result until eval drops below 1 pawn (for the winning side) or whatever that translates to in win % and train value on the search result for all moves previous to that. |
I would like to work on something like this, but slightly different, involving a small amount of temp in the second phase including a value guard. Still I think we can set it up flexible so that the exact use is still free to choose. I am not 100% sure that my idea of the current code is correct, so please correct me if necessary. I suggest we try to do the following:
What do you guys think? |
This could work, but should not be in the initial version as value guard is orthogonal.
|
Thanks for the reply. Ok, I can agree. That means start from that PR, which one is it? Just sending the positions to train is most definitely better. It's probably a misunderstanding of mine of how the code works right now. Then in the first version, I still think it's easiest to set temp and dirichlet noise to zero after detection of a 'blunder'. It will not reduce time required for training games, but does not require a more extensive restructuring of the code for first playing a complete game based on a 1 node search. If we could start from a version of the code that includes Q training, then we could still use the positions in phase one. Should be doable. |
We don't want to use the phase 1 positions. The desired behavior is An easy way to do step 1 might be to pick a random number between 1 and 450, and play that many plies (or restart if the game ends by then). |
I'll drop talking about using the first phase, but just want to mention that it's useful information when you use 800 nodes. I assume you mean that the first phase is with temperature and /or dirichlet noise? Otherwise we would very identical games until phase two is started. I'm not really happy with just choosing a move number to start the second phase. Game length statistics will vary a lot for different nets... What about a relatively flat exponential distribution for choosing the move number at which phase two starts? I would believe we would have relatively short games using 1 node and temp /noise and it would be a waste to have to restart many times for a (too high) move number. |
flat exponential should work fine. You can't use temp as is, as with only 1 node, temp is meaningless. Instead you want to choose proportionally to policy (with noise added). |
Ok, that's what I meant, what is meant with 1 node then? |
current training uses 800 NN evals to decide where to move. Since we don't train on the first part with this issue, and only want to produce a variety of training positions, we only need to run the NN once on the root position for each move, and pick relative to policy. |
I understand that we only do one NN eval at root and look at policy, but I still don't understand what the difference is between what I call 1 node and what you call proportional to policy... Anyway, as I understand it (just reread the DM papers to check) you can actually apply temp to the root policy (distribution of move probabilities), basically sharpening (or softening for temp >1) the distribution. Adding dirichlet noise really changes the distribution, so can also change the move for temp = 0. I guess it makes most sense to use both in phase one right? |
Temp is applied at root, but the formula talks about the nodes each of the children got. At 1 node, there are no child nodes |
Ok, that makes sense. From the Go paper it appears that the policy is trained to resemble the visit distribution including the temp effect... I always thought this was only done just before move selection!? Anyway, sampling from the pure policy should be equivalent to some positive value of temp. So that already gives some variation. So I should look at the code to see if it's possible to add additional softening to the distribution, if this would be desirable. To really explore we also add dirichlet noise. If you could point out the version including changes of parameters during game generation I would be grateful (I looked for "Hanse" but did not find it yet). |
My apologies, but could you please explain what a "value guard" is (or link to some research paper)? My Google searches for "mcts value guard" and similar did not turn up anything. |
when I say it's orthogonal, I mean it has nothing to do with search. It's mitigating the effects (both good and bad) of temp by only selecting moves that are close enough in quality. |
The policy distribution can still be shaped in a temperature-like manner, just acting directly on the policy to sharpen or flatten it. This could be an important issue for long phase 1 playouts if you want the resulting positions to be something reasonable. Making each move proportional to policy (without sharpening) might end up producing many odd moves in a long phase 1 playout. That might be fine, but it might not. It would be easy enough to implement a tau that defaults to 1 (does nothing). If tau is anything else there would be an exponentiation and renormalization at every ply so there would be some cost, but most likely it is trivial compared to the NN. |
With #964, one could set a very high @Tilps is still adjusting those selfplay options, and practically it seems like there will still be some "usual temperature" that combines exploration and training feedback, but maybe it's good enough to close off this issue. |
We have two competing goals:
A) Get an accurate value of positions.
B) Explore different positions so we learn about a variety of different positions.
For B we have T=1 for the entire game. But this reduces the accuracy of A.
Proposed method:
2a) For these moves do not record anything in the training data file
For picking N we want a good number of games that start near the opening so we can learn values of opening moves. But we also want to get into unusual middle/endgame positions.
Using this method, the time spent in step 1 is wasted in that we generate no data to feedback to our NN. But this step is 800 times faster because we only do one playout for each move in that step.
The text was updated successfully, but these errors were encountered: