Separate exploration from training feedback #720

DanielUranga · 2019-02-08T03:19:32Z

This PR is intended to solve both #237 and #342.

The idea is to branch a new "sub-game" every time temperetarue causes the selection of a move that hasn't the highest number of visits. The original game is then played with the zero-temp move. Once that game finishes the branched sub-game is resumed from just after the temperature-move.

Using this method the game result value (Z) that is used to score any position is the result of playing with the equivalent of zero-temperature, but still getting the exploration temperature would normally cause.

Example of the output with added debuging prints:

gameready trainingfile ./data-pgltswkheuie/game_000000.gz gameid 0 player1 black result draw moves e2e4 c7c6 g1f3 d7d5 b1c3 c8g4 h2h3 g4f3 d1f3 g8f6 g2g3 e7e6 f1g2 f8b4 a2a3 b4e7 d2d3 e8g8 e1g1 a7a5 a3a4 d8b6 f3e2 f8d8 e4d5 c6d5 c3b5 b8c6 c2c3 d5d4 c1f4 a8c8 a1c1 h7h6 f1d1 e7c5 h3h4 f6d5 f4d2 d8d7 d2e1 c8d8 c1b1 e6e5 b2b4 a5b4 c3b4 c5f8 d1c1 d8a8 e2d1 b6d8 d1b3 g7g6 b5a3 c6b4 e1b4 d5b4 a3c4 d8f6 c1e1 d7c7 c4e5 g8g7 e5c4 a8d8 a4a5 h6h5 b3d1 f6f5 g2e4 f5f6 d1d2 b4d5 b1b3 c7e7 e4d5 d8d5 e1e7 f6e7 d2b2 g6g5 b3b5 d5b5 b2b5 g5h4 g3h4 g7h6 b5b6 h6h7 b6d4 e7e6 g1g2 f8b4 d4f4 e6d5 f4f3 d5f3 g2f3 h7g6 f3e4 b4e7 c4e5 g6g7 e5c4 e7h4 c4d6 h4f2 d6b7 h5h4 e4f3 f2g1 a5a6 g7f6 f3g2 g1e3 b7d6 f6e6 d6b5 e6d5 g2f3 e3g1 a6a7 g1a7 b5a7 h4h3 a7b5 d5c5 b5c3 c5d4 c3e4 d4d3 e4f2 d3d4 f2h1 d4d5 f3f4 f7f5 h1f2 d5d4 f2h1 d4d5 h1g3 d5e6 g3h1 e6f6 h1f2 f6g7 f2h3 g7f8 h3f2 f8g7 f2h3 g7g6 f4e5 g6h6 e5d6 h6h5 d6e5 h5h4 h3f4 h4g4 f4g2 g4h5 e5d4 h5g6 d4e3 g6g5 g2f4 g5g4 f4e6 g4g3 e6f8 f5f4 e3e2 g3g2 f8g6 f4f3 e2e3 f3f2 g6f4 g2g1 f4h3 g1g2 h3f2

Adding game to resumable games, ply: 1, side to move: black.
Adding game to resumable games, ply: 5, side to move: black.

This means that 2 subgames were added, one will be continued from an alternative to 1. e2e4 and the other from an alternative to 3. b1c3.

After that the first game is resumed:

Resuming game at ply 1, side to move: black.
gameready trainingfile ./data-pgltswkheuie/game_000001.gz gameid 1 player1 white result blackwon moves g2g3 c7c5 f1g2 b8c6 g1f3 g7g6 c2c4 f8g7 b1c3 e7e6 e2e3 g8e7 e1g1 e8g8 d2d4 c5d4 f3d4 d7d5 c4d5 e7d5 c3d5 e6d5 d4e2 c8f5 e2f4 d5d4 e3e4 f5d7 f4d5 a8c8 h2h4 h7h6 c1f4 d7e6 d1d2 g8h7 a1c1 d8d7 c1c5 b7b6 c5c1 c6e7 d5c7 e6a2 d2b4 a2e6 b4d6 f8d8 c7e6 d7d6 f4d6 d8d6 c1c8 e7c8 e6g7 h7g7 f1d1 d4d3 g2f1 d3d2 f2f3 f7f5 g1f2 f5e4 f3e4 g7f6 f2e3 f6e5 f1d3 d6d4 d1d2 c8d6 d2c2 d6e4 c2c6 e4f6 d3g6 d4g4 h4h5 g4g3 e3d2 g3g5 g6f7 g5g7 f7g6 g7d7 d2c2 f6d5 c2b3 d5e7 c6c1 e7g6 h5g6 e5f6 c1c6 f6g5 b3c4 d7g7 c4d3 g7g6 c6c7 a7a5 d3e4 h6h5 e4f3 g6f6 f3g3 h5h4 g3h2 f6f2 h2h3 f2b2 c7c5 g5f4 c5b5 b2b5 h3h2 b5d5 h2g1 f4g3 g1h1 d5d1

And then the second game:

Resuming game at ply 5, side to move: black.
gameready trainingfile ./data-pgltswkheuie/game_000002.gz gameid 2 player1 black result whitewon moves e2e4 c7c6 g1f3 d7d5 f3e5 d5e4 b1c3 d8d4 e5c4 g8f6 d2d3 e4d3 f1d3 d4g4 e1g1 g4d1 f1d1 g7g6 c3e4 f6e4 d3e4 f8g7 c1e3 b8d7 a2a4 d7e5 e3d4 e5c4 d4g7 h8g8 g7c3 c8f5 e4f5 g6f5 a4a5 c4d6 c3b4 e8c8 b4c5 a7a6 f2f3 d6c4 c5e7 d8e8 d1e1 c4b2 h2h4 g8g6 h4h5 g6e6 e1e6 f7e6 e7f6 b2c4 g2g4 f5g4 f3g4 e6e5 g4g5 c8d7 f6g7 d7e6 a1f1 c4e3 f1f6 e6d5 h5h6 e3g4 f6f7 e8e6 g7f8 e6g6 f7g7 e5e4 g7g6 h7g6 h6h7 g4e5 h7h8q e5f3 g1g2 f3e5 g2g3 e5d3 c2d3 e4e3 g3f4 d5e6 f4e4 c6c5 e4e3 b7b6 a5b6 a6a5 e3f4 e6d7 f4e4 a5a4 d3d4 c5c4 d4d5 a4a3 e4d4 a3a2 h8h1 d7c8 h1a1 c8b8 a1a2 b8b7 a2b1 b7a6 b6b7 a6a5 b1c1 a5a4 c1d1 a4b5 d1e1 b5b6 e1e2 c4c3 e2d1 b6a6 d1e1 a6a7 e1f1 a7b8 f1e1 b8a7 e1c1 a7b6 c1a3 c3c2 a3a4 c2c1r d4e5 c1c8 a4a5 b6b7 d5d6 c8c5 a5c5 b7b8 c5d5 b8c8 e5e6 c8d8 d5a8

I'm still not sure that the logic is completly polished, there might be still a few bugs, so would like more eyes on it.
Thoughts?

remdu · 2019-02-08T10:32:19Z

I had imagined something like this done would not be done in the engine but after we solved starting training games from opening books. But this way could work too.

DanielUranga · 2019-02-08T17:01:11Z

@eddh that would be a valid implementation also, but the problem is that is not possible to know when a temperature influenced move happened from the client's point of view.

ghost · 2019-02-10T17:21:21Z

This is a small detail (also discussed in discord), but it may be better to choose the "subgames" after the game is complete, instead of branching mid-game. This way you can choose the moves based on information about all moves of the game, instead of moves up to the point of the branch, which leaves flexibility for future improvements in the subgame selection process.

Along this line of thinking, you may also consider a more sophisticated exploration strategy than just temperature. This is not my area of expertise, but there are a lot of papers online about "experience replay" which consider various methods, of which I think this is a special case. (My apologies if you are already familiar with these.)

DanielUranga · 2019-02-10T18:23:55Z

The branched subgames are not started until the current game finishes, see: https://github.com/LeelaChessZero/lc0/pull/720/files#diff-679c205f3774da8da1796cfce32853adR247 so after a game is played a set of sub-games is returned.

I'm going to prepare an histogram with the starting-plys of each game, with that information it will be possible to pick an appropiate way of filtering the sub-games.

ghost · 2019-02-10T18:25:59Z

Sure, but my understanding is that the branches are selected while the game is being played, before it is finished. Is this correct? My suggestion is to do the branch selection after the game is finished, so that more information can be incorporated.

DanielUranga · 2019-02-10T18:30:50Z

Yes, the branches are generated when a temperature influenced move that hasn't the highest number of visits is made.
Other strategies could be used, like picking the positions where Q was the furthest away from Z, but for a first approach I would prefer to make it as similar to the current implementation as possible.

ghost · 2019-02-10T18:39:52Z

For a first approach, I agree something like temperature is fine.

However, it would take no extra work and be future-proof to do the selection after the game ends. Just do the random branching selection (using temperature) on the finished game, instead of the in-progress game. Tilps mentioned this in discord and I agree.

Does this make sense? Is there any advantage to doing the selection on an in-progress game that I am overlooking?

JB940 · 2019-02-11T01:29:15Z

How much thought has been given on overtraining with this PR?

Let's say you have a split at move 7, 15, 36 and 89 or something.
Since at each split a sub game is added, this means more positions.
If position 89 was end-game, you'd be functionally adding 5 games worth of positions of endgame, while only 1 game worth of openings has been added. (ofcourse games are different length, and it's much more complex)

Since the longer the game goes, the more likely and the more splits there will be, this will cause a shift to middlegame, and even more towards Endgame positions in the dataset.

DanielUranga · 2019-02-11T01:50:05Z

Yes, the pool of sample positions will be modified. We could use that to benefit learning positions that are harder.
But first we need to better understand the current situation. Here is an histogram plot of starting ply of each game with the current state of this PR:

That was the result of running: ./lc0 selfplay --visits=200 --cpuct=2.5 --resign-percentage=4.0 --resign-playthrough=20 --temperature=1.1 --temp-endgame=0.45 --temp-cutoff-move=16 --temp-visit-offset=-0.25 --fpu-strategy=absolute --training=true on my 1060 for about an hour (200 visits to increase the game rate since 1060 is a bit slow).

Raw data of the graph:
starting_plies.txt

Note that currently lc0 starts all of it's games from ply 0.

DanielUranga · 2019-02-11T12:22:32Z

Another histogram, counting the number of occurrences for each ply in the resulting training data:
(X=ply, Y=occurrences)

remdu · 2019-02-11T22:37:41Z

A shift toward more positions from mid/end game in training might even be good, seeing the current weaknesses of Leela.
On the other hand, there might be a need to reduce sampling rate, because the positions seen in training would be more correlated, causing the value head to overfit more easily. But even with the chance of this happening, it might still be worth it.

DanielUranga · 2019-02-12T00:06:16Z

Comparison of training-data position plys in current master vs PR720:

(See it in Plotly: https://plot.ly/~danieluranga/15/all-positions-plys/)

Before the temperature cutoff is reached use temperature moves the usual way.

DanielUranga · 2019-02-12T20:03:55Z

On commit 4756198 made this sub-games mechanism to only be used after the temperature cutoff. This way opening moves will use temperature moves normally, but endgame ones will be effectively zero-temp (like A0 did) without sacrificing exploration.

dtracers · 2019-02-13T08:01:16Z

For moves up until the split you should take the best result. But for the sub games with a bad result (draw/loss) you should keep the move that caused that result in training but with the eval of loss or draw. The rest of the game after that sub game should be discarded.

Leela needs to learn what moves are bad too.

Veedrac · 2019-02-13T15:45:55Z

Is there anything to handle the case where a sub-game re-enters a position from the original game, such as with a move transposition? Is that rare enough not to matter?

dtracers · 2019-02-16T18:43:20Z

Here is what I was thinking of how the tree would process. is this correct @DanielUranga ?

Note that this is much more complex than any real game would get.... But I thought that I should make it a complex case.

I think that in this complex case there is room for suggesting saving white moves so that it learns too and this is assuming both sides can make temp moves

DanielUranga · 2019-02-16T19:05:14Z

@Veedrac no, there is not, and that could be a problem. The good thing is that it isn't that bad, there will be some repeated positions, not sure if that is a very serious issue.

DanielUranga · 2019-02-16T19:25:55Z

The way it works is as follow, first it gets to a random position by doing temp moves before the temp move cutoff (exactly the same way training works now). After the cutoff, when a temp move happens, that goes into it's own sub-game and current game is played to the end.
After the temp cutoff only moves with no temp propagates the game score.

@dtracers is that what you were asking?

dtracers · 2019-02-16T21:56:24Z

Yes that is what I am talking about.
I am saying that we should not do it the way you are suggesting but instead the way I drew the tree.
Where it chooses the best possible result to propagate back all the way to the beginning with the assumption that the opponent will choose the maximum result.

dtracers · 2019-02-16T22:01:16Z

So for example if your pink result was a temp move by white and white wins the game but the green was a lost game for white.
We should propagate back that to the beginning that this is a winning game where the temp move is performed and for the green a loss should be propogated back to the split

DanielUranga · 2019-02-16T23:42:01Z

@dtracers ahh kind of minimax. Could be an improvement maybe, but it seems a bit hard to get it right and not sure about the subtle implications it could have.
For this PR implementation I will try with the more simple and straightforward approach, which still should work better than the current training game generation.

jjoshua2 · 2019-02-17T00:47:05Z

It seems to defeat the whole purpose of exploration though? The only reason to do a temp move is it might be a sacrifice that looks losing but actually wins? So if we throw away the ones that work and the ones that lose it does improve quality but no reason to not just have no temperature and save all this complication? It seems that we need to throw away most of them because it will normally fail but keep the ones that work. Otherwise I think we can just copy the successful a0 approach that we haven't tried yet.

…

On Sat, Feb 16, 2019, 6:42 PM Daniel Uranga ***@***.*** wrote: @dtracers <https://github.com/dtracers> ahh kind of minimax. Could be an improvement maybe, but it seems a bit hard to get it right and not sure about the subtle implications it could have. For this PR implementation I will try with the more simple and straightforward approach, which still should work better than the current training game generation. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#720 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AO6INMYz4bCHl2fyTPte4gOXcnq4skUIks5vOJdKgaJpZM4asUbo> .

DanielUranga · 2019-02-17T00:56:32Z

If a temp move wins, all of the positions after that previously seemengly losing move will be scored as winning. That way the network gets to see positions it wouldn't have played, but the accuracy of the scoring process is still the same as zero temp.

dtracers · 2019-02-17T02:44:54Z

I am confused as to how it would work any better if the temp moves have no chance to change the entire score.
It could cause really good winning temp moves to never be reached because the tree before is seen as losing or draw

DanielUranga · 2019-02-17T02:55:19Z

The positions after the temperature move will have it's value updated, so even if the moves before the temp move aren't scored differently the search will be different next time it reaches the same position, possibly causing it to have the move that was a "temperature blunder" as it's first choice the second time.

dtracers · 2019-02-17T06:37:24Z

But what I am saying is that it may not reach that position at all because it may think it is losing even though it is winning

DanielUranga · 2019-02-17T15:21:38Z

The search will have the position after that as winning, that will augment the probability of it being played. Also it is not required to be so exact, only requirement is that it should converge to optimal play. This is just to remove the bias of "waiting for a mistake to happen" in endgames, but without losing the exploration.

Added an option to select how many sub-games can be started in relation to the total amount of full games played.

dtracers · 2019-02-21T01:00:02Z

My problem is this will not lead to optimal play like game 85 of TCEC.
https://lichess.org/study/RSk2SOkx

in the non temp move it would lead to draw (saccing the bishop like leela wants). This would cause a blunder way lower in the tree search.
But the temp move could cause a very clear and obvious loss.
So passing up loss up the tree is much more appropriate.
(ofc leela would never naturally reach such an unusual position)

JB940 · 2019-02-21T01:14:38Z

@dtracers i agree with you, I have mentioned it in the discord a few times before. I do think its explained in your last post a bit confusingly, but a minmax should be the best move. The temp move should be evaluated from the side that played it. Black turns a white draw to a white loss? Score everything before as the loss. Black turns a white draw to a white win? Score everything before the split as draw. However in both those cases if white made the temp move: draw and win respectively.

This would much more suit the extra exploration, otherwise the exploration is simply wasted. If your temp move turns a loss into a win but everything before is still rated as loss, then the NN is still discouraged to reach that position ever again, even though it was winning.

dtracers · 2019-02-21T08:04:59Z

I was hoping the graph picture I made would help explain that. But that is exactly what I mean.

It also introduces some AB theories. Depending on temp this could cause many min/max trees.
If we are passing lots of minmax trees I am of the opinion that the branch we do not choose to explore should not be passed to training in it's entirely. We have already seen that this path is suboptimal. Just the move that caused the suboptimal path should be passed to training.
There is less to learn in the suboptimal path.

DanielUranga · 2019-02-24T13:40:50Z

Min/max scoring, that makes sense and it would be an improvement indeed, it would make it learn faster I think.
In order to try the simpler approach (since this PR is already quite complex), will leave it as is for now, which should work as well: it would be the same than using endgame-temp=0 but with more middle and end game positions sample.

Skip adding a sub-game if the sub-game start position happened in the parent game, or if the sub-game start position is not reached through a capture or pawn push. This is done to maximize the overall position diversity.

DanielUranga · 2019-02-26T13:20:21Z

New plys graph, with all the latest changes. The amount of 'sub-games' is limited to 20% which seems too much, going to try with 50% probably.

DanielUranga · 2019-02-26T22:30:49Z

As promised, the results setting the sub-games percentage limit at 50%:

DanielUranga · 2019-03-07T03:33:45Z

Did a test generating 10k unique positions with this PR, one allowing 50% of sub-games and the other allowing 0% (equivalent to simply using endgame temp=0).
The Postion::Hash() method was used to determine if two positions were equivalent.

Results:
[50% sub-games] Total: 10668, repeated: 668, repeated %: 6.261717285339333%
[00% sub-games] Total: 10859, repeated: 859, repeated %: 7.910488995303434%

So this PR improves the generation of unique positions.

oscardssmith · 2019-03-07T03:52:29Z

What is this waiting on to be merged? This seems like a good candidate for t50 or next small net.

mooskagh · 2019-03-08T13:59:54Z

I like the idea, but I'm not sure about the implementation, cloning trees seems heavyweight and unnecessary.
I think it would be cleaner to combine that with training from starting positions (#541) idea.
For example:

Implement brainstorming lc0 support for training from predefined positions / book #541 to support a list of openings/positions to train from.
When game split happens, new position is pushed into that list.

"Ideal" would be just to have #541 and then server generating lists of startpos itself, but server-side changes are probably more complicated.

Also the problem with current approach (rather than server deciding which parts of games to replay and sending positions) is that fast clients will generate 30 variants of quite similarly structured positions, while slower clients won't, which may lead to overfits and other effects.

We already have similar problem of slow clients only generating short games (because they are killed by new network before they finish long game), and now the same problem will be multiplied.

jjoshua2 · 2019-03-08T14:14:06Z

I think go has slow clients finish their games instead of rub matches. We should only send matches to fast clients and let slow ones finish there game and then go straight to the mext network. We probably dont even need logic to throw away the one game in rare case next network doesn't pass gate. But it would be easy for server to ignore games from bad networks. You could similarly have slow clients still do the same number of branches on the old network but I also think the list of startpos is more flexible for fixing any kind of bugs or bias. Or adding positions other engines say are blind spots whether zero or nonzero or tb

…

On Fri, Mar 8, 2019, 8:59 AM Alexander Lyashuk ***@***.***> wrote: I like the idea, but I'm not sure about the implementation, cloning trees seems heavyweight and unnecessary. I think it would be cleaner to combine that with training from starting positions (#541 <#541>) idea. For example: 1. Implement #541 <#541> to support a list of openings/positions to train from. 2. When game split happens, new position is pushed into that list. "Ideal" would be just to have #541 <#541> and then server generating lists of startpos itself, but server-side changes are probably more complicated. Also the problem with current approach (rather than server deciding which parts of games to replay and sending positions) is that fast clients will generate 30 variants of quite similarly structured positions, while slower clients won't, which may lead to overfits and other effects. We already have similar problem of slow clients only generating short games (because they are killed by new network before they finish long game), and now the same problem will be multiplied. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#720 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AO6INIZFEXH9WIct5mIUneNsPaZ_l972ks5vUmzcgaJpZM4asUbo> .

Ishinoshita · 2019-05-21T05:18:48Z

Just for reference, in case, although you may know this paper: this PR reminds me of Katago's game branching technique (paragraph 6.2.2).

dtracers · 2019-05-21T05:46:16Z

What is this waiting on to get merged in?

We can't really test it at a large scale until after it is on training clients.

Naphthalin · 2020-04-28T15:33:16Z

How does this PR stand with #964 being tested and found to be detrimental during early T59?

mooskagh · 2020-04-28T15:37:40Z

I believe badgame split (#964) is basically the same idea implemented. I'm closing the PR, but feel free to reopen if you think it makes sense.

DanielUranga · 2020-04-28T15:38:19Z

Closing since we already have #964

DanielUranga added 5 commits February 5, 2019 17:02

Add "GetBestMoveNoTemp" to Search class

e82362c

Add "ResumableGame" structure and some related logic

493a3d0

Add a method to clone NodeTree from the head to the root node

1b82893

Store and resume "sub-games"

98e3651

Fix writing the training data for sub-games

bfa4b96

DanielUranga requested review from dubslow, mooskagh and Tilps February 8, 2019 03:19

DanielUranga added 2 commits February 12, 2019 16:38

Add a new "CurrentlyUsingEndgameTemperature" method to the Search class

c222536

Only branch sub-games after the temperature cutoff

4756198

Before the temperature cutoff is reached use temperature moves the usual way.

Write statistics to .txt files (for debug only)

8567f9c

Added an option to select the sub-games percentage

7630b18

Added an option to select how many sub-games can be started in relation to the total amount of full games played.

Skip adding some subgames

a8b2bc6

Skip adding a sub-game if the sub-game start position happened in the parent game, or if the sub-game start position is not reached through a capture or pawn push. This is done to maximize the overall position diversity.

mooskagh mentioned this pull request Mar 8, 2019

Train not only from startpos #787

Closed

TesseractA mentioned this pull request Jun 21, 2019

Matefinder Rescoring #894

Closed

mooskagh closed this Apr 28, 2020

Separate exploration from training feedback #720

Separate exploration from training feedback #720

Conversation

DanielUranga commented Feb 8, 2019

remdu commented Feb 8, 2019

DanielUranga commented Feb 8, 2019

ghost commented Feb 10, 2019

DanielUranga commented Feb 10, 2019

ghost commented Feb 10, 2019

DanielUranga commented Feb 10, 2019

ghost commented Feb 10, 2019

JB940 commented Feb 11, 2019

DanielUranga commented Feb 11, 2019 • edited Loading

DanielUranga commented Feb 11, 2019

remdu commented Feb 11, 2019

DanielUranga commented Feb 12, 2019

DanielUranga commented Feb 12, 2019

dtracers commented Feb 13, 2019

Veedrac commented Feb 13, 2019

dtracers commented Feb 16, 2019 • edited Loading

DanielUranga commented Feb 16, 2019

DanielUranga commented Feb 16, 2019

dtracers commented Feb 16, 2019

dtracers commented Feb 16, 2019

DanielUranga commented Feb 16, 2019

jjoshua2 commented Feb 17, 2019 via email

DanielUranga commented Feb 17, 2019

dtracers commented Feb 17, 2019

DanielUranga commented Feb 17, 2019

dtracers commented Feb 17, 2019

DanielUranga commented Feb 17, 2019

dtracers commented Feb 21, 2019

JB940 commented Feb 21, 2019

dtracers commented Feb 21, 2019

DanielUranga commented Feb 24, 2019

DanielUranga commented Feb 26, 2019

DanielUranga commented Feb 26, 2019

DanielUranga commented Mar 7, 2019

oscardssmith commented Mar 7, 2019

mooskagh commented Mar 8, 2019

jjoshua2 commented Mar 8, 2019 via email

Ishinoshita commented May 21, 2019

dtracers commented May 21, 2019

Naphthalin commented Apr 28, 2020

mooskagh commented Apr 28, 2020

DanielUranga commented Apr 28, 2020

DanielUranga commented Feb 11, 2019 •

edited

Loading

dtracers commented Feb 16, 2019 •

edited

Loading