-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve training data for learning drawn endgames (with lower temperature) #237
Comments
Not sure if this needs to be explicitly stated, but the problem for having an incorrect value of these drawn positions is that search will play moves getting into these positions that seem to favor one side, say 60% win rate, instead of favoring a move that leads to an actual 55% win rate position. |
This is good analysis. We've talked about trying .5 temp and then 2x
playouts each for a day at the end of test 10 to see if they solidify
things.
…On Mon, Aug 6, 2018, 8:32 PM Ed Lee ***@***.***> wrote:
Not sure if this needs to be explicitly stated, but the problem for having
an incorrect value of these drawn positions is that search will play moves
getting into these positions that seem to favor one side, say 60% win rate,
instead of favoring a move that leads to an actual 55% win rate position.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO6INOt6I20zvaDY3_VF8eaY_R4_qoqCks5uOOA6gaJpZM4VxSup>
.
|
That planning card was an open-ended cover-all for various sorts of things, including dynamic temperature. I guess this is reasonably persuasive that temperature blunders affect one side disproportionately more than the other. Can you verify and expand upon that? My request would be that any solution to this problem needs to stay "Zero": such things as starting temperature decay after a fixed move count, or targeting a temperature based on a specific game rule (such as the 50 move rule) aren't really acceptable to me, but something that would be more game-independent would be to pick a (small) probability such that each move in a selfplay game has that chance to begin a temperature decay at a certain rate. For instance, by analogy to the noise's use of average branching factor, we could also decide that average game length is "fair game", and choose 1/average_game_length as the probability to start temperature decay and also use that as the duration over which to decay temperature. (Maybe such would even be capable of supplanting resign?) |
Also perhaps this should have gone in the training repository? This really isn't a topic that is relevant to the engine. |
Also also, keep in mind that perpetuals may be the sort of suitably subtle technique/position/whatever that it requires suitably low LRs to be learned, and this project hasn't really ever fully trained a net thru all the same LRs as DeepMind. For all we know, Leela may yet learn this correctly even at temp=1. |
There might need to be a small lc0 code change to support "average game length": Line 78 in 1b68b95
Again from http://lczero.org/stats the median game length looks to be around 100 ply with a tail towards 450, so potentially we'll need to support values over 100 moves for temp decay. |
In leelazero go moves with only visit will not be picked, how would this influence the probability? Another thing to consider is that noise levels are based on the average number of legal moves in the game, that implies it might be too high in situations with few legal moves. A dirichlet noise that adapts depending on that might help. |
Could you base the temperature drop on the number of pieces left in the game or would that be considered non-zero? In my mind, it would be no different than basing it on average game length but I am sure some would disagree. |
I think we could just try temperature = 0.5 the whole game, and not have to worry about zero/non-zero or other more complicated schemes. |
A fixed temperature lower than 1 should be good to avoid blundering draws, and at least from the example position from the initial comment, a value of .89 is good enough. A value of 0.5 might be too low for game variety though. Here's an example distribution of visits from startpos:
And here's the probabilities of picking those moves with various temperatures:
So with T=1 and this particular network, 50 games out of 100 would play something other than c2c4. While with T=0.5, only 15 of 100 would play something different. Conveniently with T=0.8, the single-visit moves in my above output round to 0.0% probability of being picked, so there would be less of a need for special logic to filter out "minimum number of visits for temperature move selection." An experiment with T=0.5 is reasonable, but it seems likely to overfit due to the lack of variety; although a formal experiment result would be good. For a single fixed temperature, I would suggest something between 0.8 and 0.9. |
Getting more complicated in allowing more variety early game would perhaps start with T>1 then decay it to the end-game target T>0. But perhaps that won't be necessary if the initial temperature is not too low. Separately, regarding the lessened issue of single-visit/low-visit moves with a T<1, one of the main concerns in #8 to improve tactics by visiting potentially undesired moves was that these bad moves could be played with T=1. However, that concern should be reduced with say T=0.8 as the temperature favors the good moves. For example, if 2 visits were forced for every root move and assuming 50 bad moves and 1 actually good move (so search puts 100 visits into 50 wrong moves and 700 visits into the right move):
Similarly if there were 2 possible equally good moves and 50 bad moves:
|
I'm in favor of this T=0.8 idea. Finding a somewhat reasonable compromise between blunders and exploration should be good. The only problem is that the correct point probably depends on other parameters such as puct, softmax, number of visits per move... and thus this kind of analysis would be needed every time these parameters are changed. |
I think a lower value like that would work with any puct at least better.
Visits would be more variable since with more visits there is more chance
of a bad node getting a visit or two even due to noise
…On Wed, Aug 15, 2018, 10:24 AM Eddh ***@***.***> wrote:
I'm in favor of this T=0.8 idea. Finding a somewhat reasonable compromise
between blunders and exploration should be good.
The only problem is that the correct point probably depends on other
parameters such as puct, softmax, number of visits per move... and thus
this kind of analysis would be needed every time these parameters are
changed.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO6INFaNxHWWvnLvLaAAP5Iv9Hjt52PRks5uRC8VgaJpZM4VxSup>
.
|
Why not play half games with low temperature and half with high temperature? Training can still visit new ideas, but will see older ideas in a truer light. |
That's not too far from stochastic temperature decay, which I posited on Discord |
@dubslow : A more distant question, although related to the effect of temperature injecting to much noise in value head training data: Are you also seriously considering trying to train value head against average of Q and z ? To start with, is Q value currently captured in the training data ? |
@dubslow I was curious to what you were referring to, so I searched around on Discord and found your messages. To encourage discussion, and for anyone else who is curious, here it is:
My thoughts: I like it. It theoretically gets us the best of all discussed options. I would like to see this tested. |
In looking at #290, there were some positions that 10520 really thought was favorable to one side and wants to avoid a TB draw move.
According to lichess tb, white has 19 moves that are draw and 3 moves that lose DTZ 1. After c4a4, black has 6 moves that draw and 8 moves that lose DTZ1-15. After a black draw move, white has 20 moves that are draw and 2 moves that lose DTZ 1. So in these positions, if we assume each move is just equally likely to be played, black has ~60% chance to blunder and ~10% for white. Seems to somewhat line up with 10520 having a much higher win rate for white than the TB draw.
White has 13 moves to draw and 6 moves to lose DTZ1. After h3g4, black has 7 moves to draw and 8 moves to lose DTZ1-9. Then white has 16 moves to draw and 4 moves to lose DTZ1. So again, as long as white doesn't play a move that immediately sacrifices a pieces, it's pretty safe, and search probably keeps visits away from these immediate sacrifice moves anyway, but black needs to try quite a bit harder to avoid blundering the draw.
Here before the 26th capture, lichess says any move white makes (out of 8 possible moves) should lead to a draw. After b3b4, black has 1 correct draw move and 10 losing moves DTZ1. Again white has 10 possible moves all leading to a draw. Then black has 2 draw moves and 7 losing moves DTZ1-7. So these positions seem to be very easy for white to maintain a draw while very difficult (in terms of possible random moves) for black to do the same. So for these positions, it does seem like lower temperature to correctly reach 50-move draw will teach the network the same thing that TB already knows -- it's a draw. |
I have some interesting data tries to find the correlation between game result from T=1 and game result using T=0. It is related to this issue, but I suggest a slightly different method to improve value head. I have opened a new issue for it #330 . My observations were that T=0 even with a small number of playouts gives more accurate result than T=1, 800 playouts. So we can continue to gain benefits of T=1, by using T=1 to generate positions and using T=0, small number of playouts to get a more accurate estimate of the end result for each of the positions. |
@Tilps It looks like the maximum game length lc0 currently allows is 450 ply: Line 86 in 3287af2
Although from CCCC Round 1, Lc0 vs Fire had 237 moves (and Crafty vs Lc0 had exactly 225 moves matching up with the lc0 hardcoded game result cutoff). Assuming the 225 move limit is acceptable, then setting that for temp decay should be similarly acceptable? So move 1 for both white and black have temperature = 1 while move 100 has temperature = 0.55. I suppose if one wants to be more zero, the client could just keep playing past 450 ply, and the server keeps track of the longest game length to compute a new temp decay moves target? |
Digging into the history a bit more, 450 ply came from lczero: The intent there was to get << 1% of training games to hit that limit, and sounds like currently we're at 0.27% of games stopping at move 225. |
Rerunning the numbers with 11089 and various visits for 5th rank:
Those correspond to continuing check being played with T=1: 96.9%, 97.6%, 98.3%, 98.7% And 4th rank:
With respective 95.0%, 96.1%, 97.0%, 97.8% probabilities to continue perpetual. And just calculating the probability of successful checking assuming a single averaged probability: So instead of adjusting temperature, just simply doubling visits should lead to significantly more games that correctly play to draw even with T=1 because continuing the perpetual is the only reasonable play, and search will gladly put more visits into it. (For reference in this position, the next best move after the check Q: -0.1 is -0.8.) (Increasing visits improves value head while keeping existing temperature randomness, and increasing visits also improves policy head while keeping existing noise without needing #8.) |
doesn't this cut both ways? we will get less bad effects from temp, but we'll also explore less. |
Depends on the position, if there are multiple good moves, potentially there will be more exploration and less bias towards prior. For example same 11089 network but from startpos:
Notice how each doubling of visits increases the highest prior move's visits by less than double, i.e., other moves are more likely to be picked by T=1, so more diversity / exploration here. |
oh, that's really interesting. to me this suggests that training on more nodes might change the preferred opening to e4. |
With the conclusion that t53's 0 endgame temperature was weaker, I started looking at this position again from the original comment to see what temperature settings would lead to the most position variety while not blundering the draw: Curiously, I noticed 53316 frequently blundering black by walking the king down to rank 3, and it's because it doesn't think it's that bad for black!
Turns out the network would happily keep on checking even though there's an opportunity to exchange rooks to free the king. In this case, it's another example of the network knowing the move would be good if only the prior didn't hinder search, so the #8 out-of-order nature of visiting root children first would have allowed 53316 to direct search towards the capture:
@Tilps to be clear, fpu reduction of 0 at root would find h3c3 as well. Looks like t53 didn't learn this exchange tactic because it wouldn't have gotten itself into this position in the first place, and generally this is one example of not blundering preventing selfplay from generating valuable learning opportunities, e.g., uncovering these moves or not forgetting that a nearby position is indeed bad. Here's some analysis running from the original position above with various temperatures/offsets with high plain resign percentage so that draws will play out to 3-fold while blunders for either side end quickly:
|
Here's an analysis similar to @Ttl's #710 (comment) starting from a position and generating selfplay games: Might just be the noise, but .6 temperature with more negative offset seems to increase uniques as well as correct outcomes. Although the same happens for 1 temperature when going from 0 offset to -50 offset: 922 -> 984 uniques and 323 -> 882 correct. This might also be related to how I adjudicate the games soon after a blunder, so reducing blunders with more accurate outcomes and high temperature allows more variety of draws. Here's the most common drawn game and how many for each:
|
With #964, looks like at least the original position here with 59350 would split out all the bad moves that fail to draw. Here's search with 800 visits without noise as well as for some nearby positions:
This means Checking with
Checking with noise added does sometimes result in a bad move getting bumped up to more visits roughly 20% of the time, e.g.,:
So overall, seems like badgame split should be able to help with these types of endgames by reducing the effect and likelihood of playing into bad moves. |
From TCEC Season 13 - Division 4 Game 46 DeusX 1.0 vs LCZero 16.10161, white can perpetually check from move 51 Rh8+. However, 10161 value and search have win rates that favor black for all these checked positions as white chases the king down and up.

Looking at these related positions as if it were self-play with 800 visits but without noise, here's the probability that white would check when the king is on a rank:
(Similarly, when the black king is on the 4th rank, it doesn't want to move to the 3rd rank as that would favor white, and with 800 visits, search wants to move back up to 5th rank with 737 visits = 92%.)
With the current move temperature set to 1.0 and averaging white playing the perpetual check move at 96% of visits, this means self-play will correctly draw these positions via 50-move rule if white correctly checks ~25 times = 36% of the time. This means white is more likely to blunder more often than not due to temperature leading to the network learning that this drawn position favors black.
The "average" training data for this position blunders the draw.
If instead the temperature was 0.5 for these moves, the probability that white plays the check move increases to 99.8%. And correctly playing that across 25 moves to draw happens 96% of the time.
If we set a target of half of self-play to correctly play these positions assuming the 4% individual move blunder rate, a temperature of 0.89 would make the correct move be picked 97.3% of the time, i.e., 50% picked 25 times.
To get to .89, self-play could just play with a lower temperature instead of 1.0. Another way is with tempdecay, where in this case, this started with move 51, so a tempdecay moves of 463 and initial 1.0 temperature results in the desired value. (It looks like the maximum game length shown on the stats page is 450, so that might be a convenient number to pick.) Alternatively, if tempdecay moves is set to say 50 so that move 51 has 0 temperature, the server can tell half of self-play requests to play with 0 tempdecay and the other half with 50 moves. And of course there's many other ways to adjust temperature given the existing initial temperature, decay, server response distribution values as well as more complex approaches that require additional cilent and/or server code.
The main drawback of a lower temperature is the benefits of having temperature to begin with, where it seems that the primary purpose of temperature is to play out positions that search otherwise would not favor to then update the value head in future networks with the possibility of search then favoring those positions (as well as learning the search priors for these positions). Lower temperature means these "other" positions are less likely to be played, so there would be fewer game results of that position; however, lower temperature also means the game results that do get fed to training are more accurate (hence this issue), so unclear what's actually the tradeoff.
@dubslow Are there more details for "Test temperature changes in training"
The text was updated successfully, but these errors were encountered: