|From:||Timothy Y. Chow|
|Subject:||Re: [gnubg] Temporal difference learning. Lambda parameter.|
|Date:||Sun, 22 Dec 2019 16:59:24 -0500 (EST)|
|User-agent:||Alpine 2.21 (LRH 202 2017-01-01)|
Philippe Michel wrote:
The engine doesn't "plan ahead", does it ? It approximates the probabilities of the game outcomes from the current position (or we can say its equity for simplification).My understanding is that its potential accuracy depends on the neural network (architecture + input features) and the training method (including the training database in the case of supervised learning) has influence on how close to this potential one can go, and how fast.
I haven't done any actual training of backgammon nets, but I think what Oysetein was saying is that TD learning is a method of trying to figure out (crudely speaking) "where you made your mistake when you lost," and it works well when you don't have to "backtrack too far" when you're readjusting your weights. But for positions where there's "long-term planning" (e.g., rolling the prime around the board), one intuitively expects TD learning not to work so well.
It's true that once you have a reasonably good network, you can "fine-tune" it using other methods. For example, for a perfect bot, 0-ply, 1-ply, 2-ply, etc., should all give the same answer, but an actual bot won't, so you can get some improvement just by forcing the bot to iron out these inconsistencies. This can be done using various supervised training methods and not necessarily TD learning. But my understanding (which could be flawed) is that TD learning still enters the picture at the very first step, when you're starting from scratch (with only the rules and no heuristics).
If there's some area of the game where your network is still doing very poorly, then you may need to do more "from scratch" training, rather than just bootstrapping off what you already have. I think this is why Oystein is suggesting revisiting TD learning.
|[Prev in Thread]||Current Thread||[Next in Thread]|