bug-gnubg
[Top][All Lists]

## Re: [gnubg] Help with a new MET

 From: Joseph Heled Subject: Re: [gnubg] Help with a new MET Date: Tue, 12 Nov 2019 16:04:34 +1300

Hi Timothy,

Here is a stats question I encounter from time to time.

Suppose I run N BG games and collect the average win rates and gammon rates. 4 estimates which are dependent as they sum to 1.
How do I determine the confidence intervals for each? This is a 4d vector and it seems like a non trivial Q, but I assume this crops up a lot and must have a standard answer.

Thanks, Joseph

On Tue, 12 Nov 2019 at 15:17, Timothy Y. Chow <address@hidden> wrote:
Ian,

Thanks for putting all this effort into a new MET!

I don't know too much about the innards of GNU Backgammon, but I do know

In terms of how many matches you would have to play between GNU-old-MET
and GNU-new-MET, that depends on how much stronger GNU-new-MET is.
Suppose that GNU-new-MET has a 51%/49% edge over GNU-old-MET.  That means
that if you played 1000 matches, then you would expect a score of 510 to
490.  The problem is that if GNU-old-MET were playing against itself, the
standard deviation would be about 15.8.  So a 510 to 490 result would be
far from statistically significant.  You'd need about 10000 trials to
barely reach statistical significance: The expected score would be 5100 to
4900 and the standard deviation would be 50, so 5100 would be two standard
deviations away.  In general the formula for the standard deviation is
sqrt(n)/2 where n is the number of matches.

There's another point to be cognizant of, which is that there is a
distinction between statistically significant evidence of the bare-bones
claim that "the new MET is better," and a good estimate of *how* much
stronger GNU-new-MET is than GNU-old-MET.  Let's say you played 10000
matches and the score was 5100 to 4900.  You could then claim that the new
MET is better, and say that this claim is significant at the two standard
deviation level.  But you *couldn't* claim that you are 95% confident that
the new MET gives you a 51%/49% edge over the old MET.  To get a good
estimate of the edge requires more trials.  How many trials you need would
depend on how sharp an estimate you want.

I don't have as much insight into what might be going wrong with the
cubeful calculations.  It does sound to me that there might be a problem
with floating-point precision, but someone with knowledge of the code will
have to comment on that.

Tim