Re: [Bug-gnubg] TD(lambda) training for neural networks -- a question
boomslang
Re: [Bug-gnubg] TD(lambda) training for neural networks -- a question
Thu, 21 May 2009 11:30:00 +0000 (GMT)
Hi Øystein / others,
thanks for your quick answer.
I didn't know gnubg used just TD(0). This does make things
easier for me. The Sutton/Barto you're referring
to..., is that the book "Reinforcement Learning: An
Introduction"?
I do have a question about this supervised training,
though. Could you give an indication of the number of games
it takes to get a good kick start with TD(0), and how big
should the database with positions/rollouts be for the
supervised training?
Thanks again, I appreciate your help.
--boomslang
> --- On Thu, 21/5/09, Øystein Johansen <address@hidden>
> wrote:
> > From: Øystein Johansen <address@hidden>
> > Subject: Re: [Bug-gnubg] TD(lambda) training for
> neural networks -- a question
> > To: "boomslang" <address@hidden>
> > Cc: address@hidden
> > Date: Thursday, 21 May, 2009, 10:18 AM
> > boomslang wrote:
> > > Hi all,
> > >
> > > I have a question regarding TD(lambda) training
> by
> > Tesauro (see
> > > http://www.research.ibm.com/massive/tdl.html#h2:learning_methodology).
> > >
> > > The formula for adapting the weights of the
> neural net
> > is
> > >
> > > w(t+1)-w(t) = a * [Y(t+1)-Y(t)] *
> sum(lambda^(t-k) *
> > nabla(w)Y(k);
> > > k=1..t).
> > >
> > > I would like to know if nabla(w)Y(k) in the
> formula
> > above is the
> > > gradient of Y(k) to the weights of the net at
> time t
> > (i.e. the
> > > current net) or to the weights of the net at
> time
> > k. I assume the
> > > former.
> > That really doesn't matter much, I believe. I guess,
> as you
> > that it is
> > the former. You can check this with Sutton/Barto I
> guess.
> > However: This equation was never implemented in gnubg!
> All
> > TD training
> > that was done in gnubg, (and that's a long time ago
> and
> > abandoned at an
> > early stage), was done with lambda = 0. Notice how
> lambda =
> > 0 simplifies
> > the equation. There will only be one term -- when t =
> k.
> > This simplifies
> > the implementation to only take into account the
> previous
> > position when
> > updating the weights. Can be simply solved with
> backprop.
> > Our experience is: TD is nice for kickstarting the
> training
> > process. But
> > supervised training is the real thing. Make a big
> database
> > of positions
> > and the rollout results according to these positions
> and
> > train supervised.
> >
> > If you still would like to do TD training with your
> system,
> > I really
> > recommend looking at Sutton/Barto.
> >
> > Good luck!
> > -Øystein
