-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate exploration from training feedback #720
Separate exploration from training feedback #720
Conversation
I had imagined something like this done would not be done in the engine but after we solved starting training games from opening books. But this way could work too. |
@eddh that would be a valid implementation also, but the problem is that is not possible to know when a temperature influenced move happened from the client's point of view. |
This is a small detail (also discussed in discord), but it may be better to choose the "subgames" after the game is complete, instead of branching mid-game. This way you can choose the moves based on information about all moves of the game, instead of moves up to the point of the branch, which leaves flexibility for future improvements in the subgame selection process. Along this line of thinking, you may also consider a more sophisticated exploration strategy than just temperature. This is not my area of expertise, but there are a lot of papers online about "experience replay" which consider various methods, of which I think this is a special case. (My apologies if you are already familiar with these.) |
The branched subgames are not started until the current game finishes, see: https://github.com/LeelaChessZero/lc0/pull/720/files#diff-679c205f3774da8da1796cfce32853adR247 so after a game is played a set of sub-games is returned. I'm going to prepare an histogram with the starting-plys of each game, with that information it will be possible to pick an appropiate way of filtering the sub-games. |
Sure, but my understanding is that the branches are selected while the game is being played, before it is finished. Is this correct? My suggestion is to do the branch selection after the game is finished, so that more information can be incorporated. |
Yes, the branches are generated when a temperature influenced move that hasn't the highest number of visits is made. |
For a first approach, I agree something like temperature is fine. However, it would take no extra work and be future-proof to do the selection after the game ends. Just do the random branching selection (using temperature) on the finished game, instead of the in-progress game. Tilps mentioned this in discord and I agree. Does this make sense? Is there any advantage to doing the selection on an in-progress game that I am overlooking? |
How much thought has been given on overtraining with this PR? Let's say you have a split at move 7, 15, 36 and 89 or something. Since the longer the game goes, the more likely and the more splits there will be, this will cause a shift to middlegame, and even more towards Endgame positions in the dataset. |
Yes, the pool of sample positions will be modified. We could use that to benefit learning positions that are harder. Raw data of the graph: Note that currently lc0 starts all of it's games from ply 0. |
A shift toward more positions from mid/end game in training might even be good, seeing the current weaknesses of Leela. |
Comparison of training-data position plys in current master vs PR720: (See it in Plotly: https://plot.ly/~danieluranga/15/all-positions-plys/) |
Before the temperature cutoff is reached use temperature moves the usual way.
On commit 4756198 made this sub-games mechanism to only be used after the temperature cutoff. This way opening moves will use temperature moves normally, but endgame ones will be effectively zero-temp (like A0 did) without sacrificing exploration. |
For moves up until the split you should take the best result. But for the sub games with a bad result (draw/loss) you should keep the move that caused that result in training but with the eval of loss or draw. The rest of the game after that sub game should be discarded. Leela needs to learn what moves are bad too. |
Is there anything to handle the case where a sub-game re-enters a position from the original game, such as with a move transposition? Is that rare enough not to matter? |
Here is what I was thinking of how the tree would process. is this correct @DanielUranga ? Note that this is much more complex than any real game would get.... But I thought that I should make it a complex case. I think that in this complex case there is room for suggesting saving white moves so that it learns too and this is assuming both sides can make temp moves |
@Veedrac no, there is not, and that could be a problem. The good thing is that it isn't that bad, there will be some repeated positions, not sure if that is a very serious issue. |
The way it works is as follow, first it gets to a random position by doing temp moves before the temp move cutoff (exactly the same way training works now). After the cutoff, when a temp move happens, that goes into it's own sub-game and current game is played to the end. @dtracers is that what you were asking? |
Yes that is what I am talking about. |
So for example if your pink result was a temp move by white and white wins the game but the green was a lost game for white. |
@dtracers ahh kind of minimax. Could be an improvement maybe, but it seems a bit hard to get it right and not sure about the subtle implications it could have. |
It seems to defeat the whole purpose of exploration though? The only reason
to do a temp move is it might be a sacrifice that looks losing but actually
wins? So if we throw away the ones that work and the ones that lose it does
improve quality but no reason to not just have no temperature and save all
this complication?
It seems that we need to throw away most of them because it will normally
fail but keep the ones that work. Otherwise I think we can just copy the
successful a0 approach that we haven't tried yet.
…On Sat, Feb 16, 2019, 6:42 PM Daniel Uranga ***@***.*** wrote:
@dtracers <https://github.com/dtracers> ahh kind of minimax. Could be an
improvement maybe, but it seems a bit hard to get it right and not sure
about the subtle implications it could have.
For this PR implementation I will try with the more simple and
straightforward approach, which still should work better than the current
training game generation.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#720 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO6INMYz4bCHl2fyTPte4gOXcnq4skUIks5vOJdKgaJpZM4asUbo>
.
|
If a temp move wins, all of the positions after that previously seemengly losing move will be scored as winning. That way the network gets to see positions it wouldn't have played, but the accuracy of the scoring process is still the same as zero temp. |
I am confused as to how it would work any better if the temp moves have no chance to change the entire score. |
The positions after the temperature move will have it's value updated, so even if the moves before the temp move aren't scored differently the search will be different next time it reaches the same position, possibly causing it to have the move that was a "temperature blunder" as it's first choice the second time. |
But what I am saying is that it may not reach that position at all because it may think it is losing even though it is winning |
The search will have the position after that as winning, that will augment the probability of it being played. Also it is not required to be so exact, only requirement is that it should converge to optimal play. This is just to remove the bias of "waiting for a mistake to happen" in endgames, but without losing the exploration. |
Added an option to select how many sub-games can be started in relation to the total amount of full games played.
My problem is this will not lead to optimal play like game 85 of TCEC. in the non temp move it would lead to draw (saccing the bishop like leela wants). This would cause a blunder way lower in the tree search. |
@dtracers i agree with you, I have mentioned it in the discord a few times before. I do think its explained in your last post a bit confusingly, but a minmax should be the best move. The temp move should be evaluated from the side that played it. Black turns a white draw to a white loss? Score everything before as the loss. Black turns a white draw to a white win? Score everything before the split as draw. However in both those cases if white made the temp move: draw and win respectively. This would much more suit the extra exploration, otherwise the exploration is simply wasted. If your temp move turns a loss into a win but everything before is still rated as loss, then the NN is still discouraged to reach that position ever again, even though it was winning. |
I was hoping the graph picture I made would help explain that. But that is exactly what I mean. It also introduces some AB theories. Depending on temp this could cause many min/max trees. |
Min/max scoring, that makes sense and it would be an improvement indeed, it would make it learn faster I think. |
Skip adding a sub-game if the sub-game start position happened in the parent game, or if the sub-game start position is not reached through a capture or pawn push. This is done to maximize the overall position diversity.
Did a test generating 10k unique positions with this PR, one allowing 50% of sub-games and the other allowing 0% (equivalent to simply using endgame temp=0). Results: So this PR improves the generation of unique positions. |
What is this waiting on to be merged? This seems like a good candidate for t50 or next small net. |
I like the idea, but I'm not sure about the implementation, cloning trees seems heavyweight and unnecessary.
"Ideal" would be just to have #541 and then server generating lists of startpos itself, but server-side changes are probably more complicated. Also the problem with current approach (rather than server deciding which parts of games to replay and sending positions) is that fast clients will generate 30 variants of quite similarly structured positions, while slower clients won't, which may lead to overfits and other effects. We already have similar problem of slow clients only generating short games (because they are killed by new network before they finish long game), and now the same problem will be multiplied. |
I think go has slow clients finish their games instead of rub matches. We
should only send matches to fast clients and let slow ones finish there
game and then go straight to the mext network. We probably dont even need
logic to throw away the one game in rare case next network doesn't pass
gate. But it would be easy for server to ignore games from bad networks.
You could similarly have slow clients still do the same number of branches
on the old network but I also think the list of startpos is more flexible
for fixing any kind of bugs or bias. Or adding positions other engines say
are blind spots whether zero or nonzero or tb
…On Fri, Mar 8, 2019, 8:59 AM Alexander Lyashuk ***@***.***> wrote:
I like the idea, but I'm not sure about the implementation, cloning trees
seems heavyweight and unnecessary.
I think it would be cleaner to combine that with training from starting
positions (#541 <#541>) idea.
For example:
1. Implement #541 <#541>
to support a list of openings/positions to train from.
2. When game split happens, new position is pushed into that list.
"Ideal" would be just to have #541
<#541> and then server
generating lists of startpos itself, but server-side changes are probably
more complicated.
Also the problem with current approach (rather than server deciding which
parts of games to replay and sending positions) is that fast clients will
generate 30 variants of quite similarly structured positions, while slower
clients won't, which may lead to overfits and other effects.
We already have similar problem of slow clients only generating short
games (because they are killed by new network before they finish long
game), and now the same problem will be multiplied.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#720 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AO6INIZFEXH9WIct5mIUneNsPaZ_l972ks5vUmzcgaJpZM4asUbo>
.
|
Just for reference, in case, although you may know this paper: this PR reminds me of Katago's game branching technique (paragraph 6.2.2). |
What is this waiting on to get merged in? We can't really test it at a large scale until after it is on training clients. |
How does this PR stand with #964 being tested and found to be detrimental during early T59? |
I believe badgame split (#964) is basically the same idea implemented. I'm closing the PR, but feel free to reopen if you think it makes sense. |
Closing since we already have #964 |
This PR is intended to solve both #237 and #342.
The idea is to branch a new "sub-game" every time temperetarue causes the selection of a move that hasn't the highest number of visits. The original game is then played with the zero-temp move. Once that game finishes the branched sub-game is resumed from just after the temperature-move.
Using this method the game result value (Z) that is used to score any position is the result of playing with the equivalent of zero-temperature, but still getting the exploration temperature would normally cause.
Example of the output with added debuging prints:
This means that 2 subgames were added, one will be continued from an alternative to
1. e2e4
and the other from an alternative to3. b1c3
.After that the first game is resumed:
And then the second game:
I'm still not sure that the logic is completly polished, there might be still a few bugs, so would like more eyes on it.
Thoughts?