On Sat, Dec 14, 2019 at 01:12:34PM +0100, Øystein Schønning-Johansen wrote:
> The reinforcement learning that has been used up til now is plain temporal
> difference learning like described in Sutton and Barto (and done by several
> science projects) with TD(lambda=0).
I don't think this is the case (or the definiton of TD is much wider
than what I thought).
The 1.0 version uses straightforward supervised training on a rolled-out
I wasn't involved at the time, but as far as I know :
Earlier versions, by Joseph Heled, used supervised training on a
database evaluated at 2-ply.
The very first versions by Gary Wong did indeed use TD training but this
was abandonned when it seemed stuck at an intermediate level of play
(but the problem was probably not due to the training method since
TD-Gammon before that and BGBlitz since then did very well with TD).
> Do you think that the engine can be better at planning ahead, if lambda is
> increased? Has anyone done a lot of experiments with lambda other than 0?
> (I don't think it's code in the repo to do anything else than lambda=0, so
> maybe someone with some other research code base on this can answer?) Or
> someone with general knowledge of RL can answer?
The engine doesn't "plan ahead", does it ? It approximates the
probabilities of the game outcomes from the current position (or we can
say its equity for simplification).
My understanding is that its potential accuracy depends on the neural
network (architecture + input features) and the training method
(including the training database in the case of supervised learning) has
influence on how close to this potential one can go, and how fast.