Main
Date: 01 Aug 2008 10:40:57
From: Guest
Subject: How do you know if an engine is xyz% better???
One item shows up in this group regularly.... Namely whether some engine
is now xyz % stronger than a previous version.

Hyatt has been doing some tests over the past year and so, and he's posted
occasionally on it.

He's just posted some data again...

http://www.talkchess.com/forum/viewtopic.php?t=22731&postdays=0&postorder=asc&topic_view=&start=0

For those who don't want to read the thread, here's the basic situation.

Let's say you have an engine that plays at 'abc' ELO. (This score is done
by serious testing, rated games, etc.) And you make some changes.

How can you tell if the changes make the program stronger or weaker?

The obvious answer is to play a few games and find out. (Often this is done
either with a standard set of program opponents or on the various internet
chess servers.)

The questions that have often been raised in the talkchess forums is "How
many games do you really need to play?"

Hyatt has published several responses to that question (often getting
arguments in return.)

Well, Hyatt has published some more data on answering that question... (And
you can bet your home he has the hard game data still to back up the
results. One thing you can always be sure of him is that he does serious
testing and gets hard data.)

To answer the question... Even 800 games is *NOT* enough to determine if a
new version of your program is actually stronger than your old version.
(These were standard even opening positions, from both sides, against a
fixed set of program opponents.)

(This is about program changes. Not performance / speed improvements like
you'd get if you moved to faster hardware, etc.)

There's enough statistical variance and timing differences (clock skew, OS
overhead, random stuff, etc.) that even a few NPS can lead to wildly
different results.

So if anybody is thinking that playing a handful of games in enough to say a
new version of their pgoram is better than the previous version, you might
want to rethink.


Note that this does not mean that a hand full of tests isn't enough to
detect *massive* changes in program strength. If your program suddenly
jumps 300 points, then that kind of change will be easier to detect. But
smaller changes, like you would get from refining your program can become
very hard to detect.



Well, I just thought I should post this and give interested people a "heads
up" attention signal to go over to TalkChess and follow the discussion.
Doubt it'll do any good, but.... (shrug)




----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.pronews.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= - Total Privacy via Encryption =---




 
Date: 03 Aug 2008 14:02:28
From: johnny_t
Subject: Re: How do you know if an engine is xyz% better???
The engines get tested thousands of times when released. They get
tested 100's of times before they are released. Strength and variance
is just math.

People have been doing this correctly for a long time.

Try looking up CEGT or CCRL for the latest lists, methodologies,
variances, and ELO.

Sheesh.

Guest wrote:
> One item shows up in this group regularly.... Namely whether some engine
> is now xyz % stronger than a previous version.
>
> Hyatt has been doing some tests over the past year and so, and he's posted
> occasionally on it.
>
> He's just posted some data again...
>
> http://www.talkchess.com/forum/viewtopic.php?t=22731&postdays=0&postorder=asc&topic_view=&start=0
>
> For those who don't want to read the thread, here's the basic situation.
>
> Let's say you have an engine that plays at 'abc' ELO. (This score is done
> by serious testing, rated games, etc.) And you make some changes.
>
> How can you tell if the changes make the program stronger or weaker?
>
> The obvious answer is to play a few games and find out. (Often this is done
> either with a standard set of program opponents or on the various internet
> chess servers.)
>
> The questions that have often been raised in the talkchess forums is "How
> many games do you really need to play?"
>
> Hyatt has published several responses to that question (often getting
> arguments in return.)
>
> Well, Hyatt has published some more data on answering that question... (And
> you can bet your home he has the hard game data still to back up the
> results. One thing you can always be sure of him is that he does serious
> testing and gets hard data.)
>
> To answer the question... Even 800 games is *NOT* enough to determine if a
> new version of your program is actually stronger than your old version.
> (These were standard even opening positions, from both sides, against a
> fixed set of program opponents.)
>
> (This is about program changes. Not performance / speed improvements like
> you'd get if you moved to faster hardware, etc.)
>
> There's enough statistical variance and timing differences (clock skew, OS
> overhead, random stuff, etc.) that even a few NPS can lead to wildly
> different results.
>
> So if anybody is thinking that playing a handful of games in enough to say a
> new version of their pgoram is better than the previous version, you might
> want to rethink.
>
>
> Note that this does not mean that a hand full of tests isn't enough to
> detect *massive* changes in program strength. If your program suddenly
> jumps 300 points, then that kind of change will be easier to detect. But
> smaller changes, like you would get from refining your program can become
> very hard to detect.
>
>
>
> Well, I just thought I should post this and give interested people a "heads
> up" attention signal to go over to TalkChess and follow the discussion.
> Doubt it'll do any good, but.... (shrug)
>
>
>
>
> ----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==----
> http://www.pronews.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
> ---= - Total Privacy via Encryption =---


  
Date: 03 Aug 2008 18:14:05
From: Guest
Subject: Re: How do you know if an engine is xyz% better???
"johnny_t" <[email protected] > wrote in message
news:[email protected]...
> The engines get tested thousands of times when released. They get tested
> 100's of times before they are released. Strength and variance is just
> math.

It's "just math" if you have humans involved.

When it's computer vs. computer testing though, things don't behave as
expected. There is more involved than the normal asumptions implied in the
"just math".

What Hyatt has shown (in that thread and several others) is that when he
plays matches consisting of hundreds of thousands (something very few people
can do), the results he gets does not match the expected results.

That there is so much variance that you can't depend on running a few
hundred automated matches to test changes in your program.

Thoughout chess programming history, people generally tune their engines one
of two ways.

1) used some test positions and let your program generate a move and compare
that to the 'predicted' move, then adjust weights various ways.

2) Play a few dozen to a few hundred games with a few (sometimes just one)
other programs. If the added idea causes worse play, then toss it out or
adjust the weight.

Number 1 can be used for casual tuning but it's not the most accurate.

Number 2 is what most people do. Hyatt has repeatedly shown (not just that
one thread, but several past threads) that doing a few dozen or even a few
hundred automated games is not enough to accurately determine if a
modification is better or not.

Until recently, very very few people had access to the computing clusters
like he does now. So very few people have ever been in a position to run
such massive testing. Not many people can run tests involving anywhere from
25,000 games on up to a couple million games.

The results he's getting do not set well with many people. They violate
expected behavior which is usually based on matches with humans involved, or
on computer vs. computer testing with only a few hundred games played.

But he is getting the results.




>
> People have been doing this correctly for a long time.

People *think* they've been doing this correctly for a long time.

That's not the same thing.

When you get testing results from a few hundred tests that don't agree with
the results you get from hundreds of thousands to millions of tests, then
you've got problems if all you can do is a few hundred tests.


> Try looking up CEGT or CCRL for the latest lists, methodologies,
> variances, and ELO.
>
> Sheesh.

I know about those kinds of tests.

And those aren't the same kinds of tests that Hyatt is doing.

He's doing it on a much more massive scale, and like you'd do if you were
trying to determine if a program change is better or worse than an older
version. Read the thread.

And he's getting provable, repeatable results that do not agree with the
small tests that others have been doing.


People have been assuming that the results from a 'small' match involving a
few hundreds computer vs. computer games are accurate.

Hyatt's tests have repeatedly shown they aren't. There's enough randomness
and computer vs. computer interaction to keep even large scale testing
inaccurate enough to detect small changes in the playing quality.

That's kind of the problem....


This is not the first thread he's done on this. He's been running tests
like that for a couple years now, ever since his University added some
cluster computers. Nobody had done this kind of massive testing before now.

And these unexpected results are definetly pissing off a lot of people,
because they just don't agree with their prefered beliefs and how everybody
has been doing their testing over the years.

Unfortunately, not too many other people have the resources to be able to
run such massive testing. So right now, nobody can repeat his experiment.
We have to talk with Bob and try to determine what might have caused such
results, and when nothing can reasonable explain it, assume they are right.
Just like the other reports he's done on massive testing.






----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.pronews.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= - Total Privacy via Encryption =---


 
Date: 02 Aug 2008 06:00:28
From: Sanny
Subject: Re: How do you know if an engine is xyz% better???
> Note that this does not mean that a hand full of tests isn't enough to
> detect *massive* changes in program strength. =A0If your program suddenly
> jumps 300 points, then that kind of change will be easier to detect. =A0B=
ut
> smaller changes, like you would get from refining your program can become
> very hard to detect.

When GetClub Chess improved 30% better I was unable to detect the
changes. But when 4 months back the game used to double its strength I
was able to find the improvements by seeing a few games.

Bye
Sanny

Play Chess at: http://www.GetClub.com/Chess.html





  
Date: 02 Aug 2008 09:43:06
From: Guest
Subject: Re: How do you know if an engine is xyz% better???
>"Sanny" <[email protected]> wrote in message
>news:e20b61e1-c869-4676-8cd7-af84f9a946b9@r35g2000prm.googlegroups.com...
>> Note that this does not mean that a hand full of tests isn't enough to
>> detect *massive* changes in program strength. If your program suddenly
>> jumps 300 points, then that kind of change will be easier to detect. But
>> smaller changes, like you would get from refining your program can become
>> very hard to detect.
>
>When GetClub Chess improved 30% better I was unable to detect the
>changes.

Then how the expletive do you know it actually improved by 30%??!

Unless you mean as if you moved to hardware that was 30% faster. That
doesn't mean it was 30% better, though. Just that it ran 30% faster.


>changes. But when 4 months back the game used to double its strength I
>was able to find the improvements by seeing a few games.

It depends on what you define as "double in strength".

If you mean going from 1000 to 2000 points, then yes, that would be easy to
detect with a high degree of confidence. (Or similar doubling in scales
that aren't linear.)

If you mean program speed while getting the exact same result and searching
the exact same tree (as if you moved to faster hardware with no program
changes), then that too would be fairly easy to detect and no real need to
actually test, since that would be pure speed improvement.

If you mean any other kind of 'double in strength', then that's a worthless
estimate. Saying that it predicts move XYZ in 30% less time or can hold its
own against program ABC in 30% less time or plays games against program ABC
that are 30% longer are utterly bogus. Pure crap. It's simply not a valid
way to measure playing improvements.

(The first two only say they get the same old result at 30% less time but
say nothing about whether they would actually improve at the full time. It
may change its mind to a worse move. The last one assumes there's a direct
correlation between game length and strength, but there's not. A long game
does not mean the program is strong, and an even longer game does not mean
the program is even stronger. By that reasoning, a game that is dragged out
for 200 moves would mean the programs are super super strong. Game length
can be related to strength but it says nothing about the quality of the
moves themselves.)

Elo (or other) ratings and raw speed are the only two measurements that have
any meaning.

Now, test sets (like the WAC set or even the classic Bratko-Kopec set, and
many others) have their uses. They can be used as a test to see if anything
was broken by your latest changes or if it can 'see' something new. And it
can be useful to keep track of improvements in those areas. There's nothing
wrong if you report results from standardized tests as long as you report
them as such.

But only full games can be used to gauge a program's strenth with any sort
of accuracy. And based on Hyatt's results over the past few years LOTS of
full games are needed.

(Some may ask why Hyatt is seeing these results and not anybody else.
Certainly a valid question! It's possible there's a flaw in Hyatt's method.
But it's more likely that few chess programmers have access to the kind of
hardware he does and just can't / don't run hundreds of thousands of games
to try to detect small improvements. Most programs go their entire life
without ever playing a hundred thousand games. Hyatt, on the other hand,
can do this kind of testing almost casually.)



>
>Bye
>Sanny
>
>Play Chess at: http://www.GetClub.com/Chess.html







----== Posted via Pronews.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.pronews.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= - Total Privacy via Encryption =---