gnugo-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnugo-devel] statistical regression


From: Evan Daniel
Subject: Re: [gnugo-devel] statistical regression
Date: Wed, 03 Mar 2004 13:20:01 -0500
User-agent: Mozilla Thunderbird 0.5 (X11/20040221)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The major problem with all of this is that gnugo-gnugo matches across
versions do not properly show the strength of the program as compared to
people or other programs.  Such matches emphasize certain portions of
the engine, mostly ignore others, and frequently allow broad classes of
mistakes to go unpunished.  I think the matches are still useful, and
certainly can catch some things, but any serious discussion of them
needs to be aware of this.

Other than that, it all looks correct (I haven't checked the math, but
it feels at least reasonable).

Your command line looks correct.

I find it a little odd that level 15 didn't do better, but not terribly
surprising.  I would guess it's about a stone stronger, maybe less.

Evan Daniel

Douglas Ridgway wrote:
| Hi all!
|
| After reading some of the discussion on r.g.g. as to whether --level 15
| is any improvement over --level 10, I did some work on statistics. The
| question is, based on the results of a series of games, is player A
| stronger than player B. From the point of view of setting up the test,
the
| question is how many games are necessary to identify a difference in
| strength of a given size. I think people here have also run such tests.
|
| I constructed [1] a table using KGS's formula for converting a strength
| difference in stones to probability of victory, allowing a 5% chance of
| falsely identifying a difference when there is none, and a 10% chance of
| missing a real difference at the stated mismatch. N is the number of games
| that need to be played, and Nw is the number of games that the stronger
| player must win to get declared stronger.
|
| Stones        p       N       Nw
| 0.5   0.60    264     148
| 1.0   0.69    67      42
| 1.5   0.77    30      21
| 2.0   0.83    18      14
| 2.5   0.88    12      10
| 3.0   0.92    9       8
|
| The results are interesting. For a short series, <=10 games, nothing less
| than a complete blowout is statistically significant, and we wouldn't
| expect to see that without a major difference in strength, perhaps 3
| stones. To identify a substantial strength difference, 1.5-2.0 stones,
| requires 20 or 30 games, and winning 2/3s of them. To be sure of a
| strength difference of less than a stone requires hundreds of games.
|
| One idea is to check that a change at least hasn't made the program
worse.
| The short series are so dominated by noise that they may not be worth
| running at all. A run of 20 or 30 games, on the other hand, with a
| required margin of victory of 2/3's, makes some sense. That at least gives
| a 90% chance of catching a mistake that costs 1.5 to 2.0 stones, and some
| chance of identifying smaller changes, positive or negative.
|
| I tried 3.5.3 at --level 15 (always white, receiving 6.5 komi) against
| --level 10.  Assuming I did it right [2], they split the series 10-10,
| indicating a strength difference of a stone or less, and no clue which
one
| is stronger.
|
| doug.
| address@hidden
|
| [1] For people who'd like to check the math, here's the Matlab code:
|
| p = 1./(1+exp(-0.8*[0.5:0.5:3.0]))
| Ns = ceil(((1.96*sqrt(p.*(1-p))+1.28*sqrt(.5*(1-.5)))./(p-.5)).^2)
| Nw = binoinv(0.975, Ns, 0.5)+1
|
|
| [2] Does the command line
|
|  perl twogtp --white '/usr/local/bin/gnugo --mode
| gtp --level 15' --black '/usr/local/bin/gnugo --mode gtp --level 10'
| --komi 6.5 --games 20 --sgffile filename.sgf
|
| look about right?
|
|
|
|
|
| _______________________________________________
| gnugo-devel mailing list
| address@hidden
| http://mail.gnu.org/mailman/listinfo/gnugo-devel

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFARiHR0OcbTJFOafIRAofJAJ4gjxJ/4FdjJBgKDveUPdcQ8fEzGwCfTKbi
fsSGHkYYVivv+A3Na3/OML8=
=bnZn
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]