[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-gnubg] Alternative weights files and call for benchmarkers (lon
Re: [Bug-gnubg] Alternative weights files and call for benchmarkers (long)
Mon, 25 Jun 2012 19:17:47 +1200
Hi again Philippe,
Did you find a way to show that the new net indeed is indeed more balanced than the old with regard to the odd-even ply syndrome?
On 25 June 2012 07:37, Philippe Michel <address@hidden>
On Sun, 24 Jun 2012, Joseph Heled wrote:
I am very interested to know how those nets were generated?
They were trained with your gnubg-nn tools, but from improved training data. This is basically how it went :
I first tried to train the crashed net. Since it seemed one of its problems was dubious absolute equities in many positions and large discrepancies between even and odd evaluations, I used the original set of positions with the average of 3ply and 4ply evaluations.
Early results looked promising but it didn't go very far, the 0ply errors going from :
checkers 771 cube 1088 (total errors in the 0.90.0 net)
checkers 747 cube 753 to
checkers 741 cube 776 (with the training set evaluated with the above net)
checkers 753 cube 787
Checking the worse positions (worse as 3ply differing from 4ply), it was clear that if large differences went down from more that 4.0 in the old net to about 1.0, the equity given by a rollout were often close to either 3ply or 4ply, often outside the interval of these and taking the average wasn't converging.
At this point I started to roll out the whole crashed training database (1296 trials, 0ply) using the 741/776 net. I used a slightly modified gnubg since gnubg-nn, not using SSE, would be much slower.
Training from that led to a benchmark of checkers 766 cube 514.
Then I looked at what I could change to the training set to improve checker play. Since it had been reported that he crashed net was bad at containment play and rolling outside primes, making bizarre stacks in the outfield, I started there.
Looking at the training positions, I found quite many such positions, stacks of 7, 8 chekers on the 12 point, things like that. I tried to remove them, but since you added them in pairs, I tried hard to remove groups of related positions, not single ones. It was tedious and led only to minimal improvement. I gave up and left all the original positions.
I then tried to add positions from rolling a prime from far away (playing out from something like Advanced Backgammon's position 127 with a varying number on men already off) against one or two checkers. I asked gnubg for its 0ply hint and if its 2 or 3 favorites looked wrong I added them, as well as my choice, to the training set. All in all, I added 700-800 positions. This worked quite well, decreasing the checker error to the 730s.
While investigating on these 1-checker containment positions, I had noted that the original training set was very unbalanced, with like 1500 positions seen from the container and 30 from the runner. This is quite logical if most positions were added when 0ply and 2ply plays differ but I had the idea that the even/odd effect might be somehow related to this, especially for crashed positions where checker mobility, hence a training set more or less automatically generated, would be likely to be very asymetrical.
To test this, I did the same full rollouts on the race training database as well as its positions with the other player on roll. It worked well, improving the 0ply benchmark a little and the 1ply one a lot. The swapped positions are not related pairs like most original ones but they still help.
Redoing full rollouts on the crashed training set or even only on its swapped positions was going to take more time than I wished so I settled on doing a truncated rollout (324 trials, truncated at 8 ply then 2 ply evaluation) of the whole new set (original + inverted positions).
Xavier Dufaure from XG had claimed in the Bgonline forum that its roller++ evaluations (similar to the above truncated rollout) were generally about as good as a full rollout. Later experience makes me think this is not quite the case for gnubg (for starters, its equity estimates at the trucation point are not good enough). But this seemed like a decent compromise that I used the reevaluate the contact traing set (original + inverted positions) as well. My thoughts were that it would somehow diffuse the improved estimations from the crashed and race nets and the easy late positions deeper than a simple 2ply evaluation would.
The weights attached to the earlier message are those resulting the above process. In summary :
- full rollout of the original race database + inverted positions using the 0.90.0 net and train a race net from that
- truncated rollout of the original crashed database + inverted positions + new containment positions using an intermediate crashed net and the above race net, and train from that
- truncated rollout of the contact database + inverted positions using the 0.90.0 contact net and above crashed and race nets, and train from that
After that I tried another pass of truncated rollouts on the contact and crashed training sets but it didn't improve the benchmarks (or maybe it did : this is when I realized the crashed benchmark was flawed, but the improvement, if any, looked like it would be minimal).
Since I didn't see any obvious prospects for further quick improvements but the nets seemed to be worthwhile, I trained corresponding pruning nets and posted the weights files as they were.
At this point, I'm looking at redoing full rollouts of the contact and crashed databases (1.8M positions!). I've done some tests on a few thousands of the smaller and larger pip counts. No surprise here : the former are fast but current data is already accurate, the latter take a lot of time but are quite often much more plausible than the current estimates.