Re: [Bug-gnubg] The race training and benchmark datasets

From: Philippe Michel
Subject: Re: [Bug-gnubg] The race training and benchmark datasets
Date: Sun, 9 Jun 2019 23:37:59 +0200
On Fri, Jun 07, 2019 at 08:30:10PM +0200, Øystein Schønning-Johansen wrote:

> (Of course I remove any position duplicated in the two datasets, such that
> the training and validation set are disjoint.)

Is it really important (in general) ? I know one shouldn't use the same 
dataset but is some limited random overlap really an issue ? I didn't 
verify how limited it is in the case of gnubg's databases, though...

> I train a neural network. If I validate the training with a 10% fraction of
> the training dataset itself, I get a MSE error of about 1.0e-04. But if I
> validate against the dataset generated from train.bm-1.00.bz2 I get an MSE
> error of 7e-04. About 7 times higher!
> This makes me believe that the rolled out positions in the race-train-data
> file is rolled out in an other way (different tool, different settings,
> different neural net?) than the positions in train.bm-1.00.bz2.

Different tool and different neural net.

For the benchmark databases it is recorded as a comment at the beginning 
of the file :

s version 1.93 weights 1.00 moves2plyLimit 20 rolloutLimit 5 nRollOutGames 1296 
cubeAway 7 include0Ply 1 evalPlies 2 shortCuts 1 osrGames 1296 osrInRoll 1

This is version 1.93 of the sagnubg tool, using the 1.OO weights file 
(the current one). I rerolled the benchmark databases with it after the 
new weights file was generated.

The training database was rolled out with a slightly modified gnubg 
(merely to have gnubg -t print the rollout results in the right format).

This was done with earlier weights. I didn't kept notes but I think I 
used one intermediate weights set for the race and possibly more than 
one for the crashed net (rollout the training database with the 0.90 
net, train a new net, reroll the training database with it, etc...). For 
the contact net I'm not sure.

In any case, this was with different weights than the current benchmark 

> Joseph? Philippe? Ian? Others? Do you know how these data where generated?
> Is it maybe worth rolling these positions out again? I do remember that
> Joseph made a separate rollout tool, but I'm not sure what Philippe did?

It is likely the different errors you got have another cause : as far as 
I can see,the sagnubg tool used for creating the benchmark databases 
doesn't use variance reduction.

That should be enough of a reason to seriously consider rerolling them, 
but we would have to implement variance reduction in sagnubg first or 
use gnubg with some substantial pre- and post-processing.

> (I also remember that the original benchmark was move based, and it
> calculates the loss based on incorrect moves picked, and that it might not
> be that interesting if the rollout values are abit wrong....)

I'm afraid they may not be just a bit wrong. It seems the standard 
deviation of a 1296 trials rollout without variance reduction is larger 
than the vast majority of the "errors" found when running the benchmark.

