Re: Temporal difference learning. Lambda parameter.

On Sun, Dec 22, 2019 at 11:07 PM boomslang <address@hidden> wrote:

Nikolaos Papahristou used TD(lambda) for his Palamedes bot, which plays several Greek backgammon variants. In "On the Design and Training of Bots to play Backgammon Variants", he writes:

"In the Plakoto variant, values of λ>0.6 resulted in divergence, whereas lower values sometimes became unstable. So it was decided to keep λ=0 for this variant.
For Portes and Fevga variants it was possible to increase the λ value without problems and this always resulted in faster learning, but unlike other reported results [16], final performance did not exceed experiments with λ=0."

Portes is essentially the same as standard backgammon; the main differences are: (1) the absence of the doubling cube and (2) the absence of triple wins.

He trained the Portes variant with lambda = 0.7 for the first 250k games, then proceeds with lambda = 0.

Perhaps he can tell us why it works in one variant, but in others not?

On Sunday, 22 December 2019, 00:09:57 CET, Philippe Michel <address@hidden> wrote:

On Sat, Dec 14, 2019 at 01:12:34PM +0100, Øystein Schønning-Johansen wrote:

> The reinforcement learning that has been used up til now is plain temporal
> difference learning like described in Sutton and Barto (and done by several
> science projects) with TD(lambda=0).

I don't think this is the case (or the definiton of TD is much wider
than what I thought).

The 1.0 version uses straightforward supervised training on a rolled-out
database.

I wasn't involved at the time, but as far as I know :

Earlier versions, by Joseph Heled, used supervised training on a
database evaluated at 2-ply.

The very first versions by Gary Wong did indeed use TD training but this
was abandonned when it seemed stuck at an intermediate level of play
(but the problem was probably not due to the training method since
TD-Gammon before that and BGBlitz since then did very well with TD).

> Do you think that the engine can be better at planning ahead, if lambda is
> increased? Has anyone done a lot of experiments with lambda other than 0?
> (I don't think it's code in the repo to do anything else than lambda=0, so
> maybe someone with some other research code base on this can answer?) Or
> someone with general knowledge of RL can answer?

The engine doesn't "plan ahead", does it ? It approximates the
probabilities of the game outcomes from the current position (or we can
say its equity for simplification).

My understanding is that its potential accuracy depends on the neural
network (architecture + input features) and the training method
(including the training database in the case of supervised learning) has
influence on how close to this potential one can go, and how fast.

From:	Nikos Papachristou
Subject:	Re: Temporal difference learning. Lambda parameter.
Date:	Thu, 2 Jan 2020 12:09:22 +0200