[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-gnubg] Alternative weights files and call for benchmarkers (lon
From: |
Philippe Michel |
Subject: |
Re: [Bug-gnubg] Alternative weights files and call for benchmarkers (long) |
Date: |
Sun, 24 Jun 2012 21:37:35 +0200 (CEST) |
User-agent: |
Alpine 2.00 (BSF 1167 2008-08-23) |
On Sun, 24 Jun 2012, Joseph Heled wrote:
I am very interested to know how those nets were generated?
They were trained with your gnubg-nn tools, but from improved training
data. This is basically how it went :
I first tried to train the crashed net. Since it seemed one of its
problems was dubious absolute equities in many positions and large
discrepancies between even and odd evaluations, I used the original set of
positions with the average of 3ply and 4ply evaluations.
Early results looked promising but it didn't go very far, the 0ply
errors going from :
checkers 771 cube 1088 (total errors in the 0.90.0 net)
to
checkers 747 cube 753
to
checkers 741 cube 776 (with the training set evaluated with the above net)
to
checkers 753 cube 787
Checking the worse positions (worse as 3ply differing from 4ply), it was
clear that if large differences went down from more that 4.0 in the old
net to about 1.0, the equity given by a rollout were often close to either
3ply or 4ply, often outside the interval of these and taking the average
wasn't converging.
At this point I started to roll out the whole crashed training database
(1296 trials, 0ply) using the 741/776 net. I used a slightly modified
gnubg since gnubg-nn, not using SSE, would be much slower.
Training from that led to a benchmark of checkers 766 cube 514.
Then I looked at what I could change to the training set to improve
checker play. Since it had been reported that he crashed net was bad at
containment play and rolling outside primes, making bizarre stacks in the
outfield, I started there.
Looking at the training positions, I found quite many such positions,
stacks of 7, 8 chekers on the 12 point, things like that. I tried to
remove them, but since you added them in pairs, I tried hard to remove
groups of related positions, not single ones. It was tedious and led only
to minimal improvement. I gave up and left all the original positions.
I then tried to add positions from rolling a prime from far away (playing
out from something like Advanced Backgammon's position 127 with a varying
number on men already off) against one or two checkers. I asked gnubg for
its 0ply hint and if its 2 or 3 favorites looked wrong I added them, as
well as my choice, to the training set. All in all, I added 700-800
positions. This worked quite well, decreasing the checker error to the
730s.
While investigating on these 1-checker containment positions, I had noted
that the original training set was very unbalanced, with like 1500
positions seen from the container and 30 from the runner. This is quite
logical if most positions were added when 0ply and 2ply plays differ but
I had the idea that the even/odd effect might be somehow related to this,
especially for crashed positions where checker mobility, hence a training
set more or less automatically generated, would be likely to be very
asymetrical.
To test this, I did the same full rollouts on the race training database
as well as its positions with the other player on roll. It worked well,
improving the 0ply benchmark a little and the 1ply one a lot. The
swapped positions are not related pairs like most original ones but they
still help.
Redoing full rollouts on the crashed training set or even only on its
swapped positions was going to take more time than I wished so I settled
on doing a truncated rollout (324 trials, truncated at 8 ply then 2 ply
evaluation) of the whole new set (original + inverted positions).
Xavier Dufaure from XG had claimed in the Bgonline forum that its roller++
evaluations (similar to the above truncated rollout) were generally about
as good as a full rollout. Later experience makes me think this is not
quite the case for gnubg (for starters, its equity estimates at the
trucation point are not good enough). But this seemed like a decent
compromise that I used the reevaluate the contact traing set (original +
inverted positions) as well. My thoughts were that it would somehow
diffuse the improved estimations from the crashed and race nets and the
easy late positions deeper than a simple 2ply evaluation would.
The weights attached to the earlier message are those resulting the above
process. In summary :
- full rollout of the original race database + inverted positions using
the 0.90.0 net and train a race net from that
- truncated rollout of the original crashed database + inverted positions
+ new containment positions using an intermediate crashed net and the
above race net, and train from that
- truncated rollout of the contact database + inverted positions using the
0.90.0 contact net and above crashed and race nets, and train from that
After that I tried another pass of truncated rollouts on the contact and
crashed training sets but it didn't improve the benchmarks (or maybe it
did : this is when I realized the crashed benchmark was flawed, but the
improvement, if any, looked like it would be minimal).
Since I didn't see any obvious prospects for further quick improvements
but the nets seemed to be worthwhile, I trained corresponding pruning nets
and posted the weights files as they were.
At this point, I'm looking at redoing full rollouts of the contact and
crashed databases (1.8M positions!). I've done some tests on a few
thousands of the smaller and larger pip counts. No surprise here : the
former are fast but current data is already accurate, the latter take a
lot of time but are quite often much more plausible than the current
estimates.