Test suites and ply depth

Date: 10 Nov 2006 00:46:42
From: [email protected]
Subject: Test suites and ply depth

Would it be useful to have test suites which consist of a set of
positions which should be solvable in 12 ply, a set which should be
solvable in 13 ply, ... 14 ply, ... and so on up as far as desirable?

That way, one could quickly check a program by seeing how well it
performed on the 12 ply set, for example, since it takes much less time
at lower ply depth.

And one could see if there was a problem at a certain depth, say the
program works well up to 16 ply, but does poorly at higher ply.

There could be both tactical sets and strategic sets which should be
solvable by the time the search reaches a certain ply depth.

I don't know much about test suites (most likely the above has already
been done if it is a good idea, as it seems obvious enough), but was
thinking about the issue after reading about the recent experiment
where Capablanca did better than the other world champions when tested
against Crafty limited to a search depth of 12 ply. Obviously it would
be interesting to try the test again at greater ply depths and with
other engines as well as that most admirable engine Crafty. Also more
games could be included by other strong players, in sufficiently strong
tournaments and matches. Also a correlation might be established
between such results against engines and ELO ratings or similar rating
systems.

I was thinking an engine might be deliberately tuned to be consistent
with the style of a given player, like Capablanca, but then realized
the existing test suites probably already do even better than that at
tuning an engines playing style.

To compensate for opening book knowledge, perhaps a chess tree could be
modified to show at what date a move from a given position was first
played in high level competition, and the engine test only begin with
the first new move which hadn't been played before.

Date: 13 Nov 2006 22:34:17
From: Simon Waters
Subject: Re: Test suites and ply depth

> <[email protected]> wrote in message
> news:[email protected]...
>> Would it be useful to have test suites which consist of a set of
>> positions which should be solvable in 12 ply, a set which should be
>> solvable in 13 ply, ... 14 ply, ... and so on up as far as desirable?
>
> No, not really.

Although a similar test is done. There are published tables
for the correct number of legal moves to specific depths from the
initial position. A reasonable way to detect flaws in the move generator.

Date: 14 Nov 2006 10:10:36
From: Mr. Question
Subject: Re: Test suites and ply depth

"Simon Waters" <[email protected] > wrote in message
news:4558f2e9.0@entanet...
>> <[email protected]> wrote in message
>> news:[email protected]...
>>> Would it be useful to have test suites which consist of a set of
>>> positions which should be solvable in 12 ply, a set which should be
>>> solvable in 13 ply, ... 14 ply, ... and so on up as far as desirable?
>>
>> No, not really.
>
> Although a similar test is done. There are published tables
> for the correct number of legal moves to specific depths from the
> initial position. A reasonable way to detect flaws in the move generator.

Yes, that's called "Perft". And actually, the initial position isn't a good
position to test. It misses a lot of types of moves (enpassant, promotion,
mate, etc.) Positions like "Kiwi Pete" are better to test with Perft.

Therer are some limitations to Perft. Although it tests the move generator,
the make & unmake move routines, and the InCheck() stuff, that still leaves
a lot untested.

Also, the guy was wanting to know about solving positions to find out how
good the programs are. Perft doesn't help that.

One final comment about Perft.... Many people use perft as a benchk, but
you should not do that. It's runtime behavior is *very* different from an
actual search. You can't use it as a real benchk. You can use it as a
measure to determine if your core routines are faster or slower than before,
but that doesn't relate well to actual search performance because that
depends mostly on your move ordering, not the low level performance of
individual routines.

(Perft stands for "Per"formance "T"uning. It was originall conceived as a
way to measure your lowlevel routine performance, as well as debugging your
core routines. The way it behaves is totally different from the way a
search works, so you can't relate perft performance to search performance or
to program strength.)

Also, you can't compare perft results among programs because they may be
doing things differently than what you do. They may be updating more
information (databases etc.) that they use in their evaluator or nifty
search extensions, or whatever. So it's possible for their makemove() to be
slower but their overall program to be faster and stronger. (And as a side
note, faster does not mean stronger. A program can be very fast but have a
stupid evaluator and play poorly. And a slow program doesn't mean it's
st, either.)

Finally, many people *cheat* in perft tests. Because they treat it as a
benchk to compare against other programs, they will do things like not
doing the makemove() on the final ply of the perft test. Or they'll use
hash tables specifically designed for perft tests.

Perft is not designed as a benchk. It's designed as a debugging aid and
as a way to tell if your own performance improvements in your makemove() &
unmakemove() are actually faster than before. It can not be compared among
other programs.

Now, having said all of that, I'm quite willing to admit that I too have
used perft as a benchk to compare my program to others. Even though I
know I shouldn't, I've done it anyway....

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Date: 10 Nov 2006 22:09:46
From: Mr. Question
Subject: Re: Test suites and ply depth

<[email protected] > wrote in message
news:[email protected]...
> Would it be useful to have test suites which consist of a set of
> positions which should be solvable in 12 ply, a set which should be
> solvable in 13 ply, ... 14 ply, ... and so on up as far as desirable?

No, not really.

Most programs do search depth differently. They don't simply search 'x'
plies deep and stop.

There are a variety of search extensions that can be done, including null
moves, various reductions, etc.

Then there are the search extensions that can be done during the main
search. For example, not counting a move that deals with being in check, or
not counting a move that promotes a pawn. Or whatever.

Then there are the search extensions during the q-search. Which nodes to
expand and which ones to ignore.

So search depth can't be compared among programs. One program's depth 8 may
be another program's depth 10 even though they may take about the same
length of time to search and seem to find the same moves.

It's only valid for a particular program, for comparision with other
versions as a way of helping to guage whether a modification is an
improvement or not. And even then, the usefulness is limited.

Nor can you depend on search time. Many programs are tuned for specific
type of architectures. Or are designed for SMP systems. Or certain types
of parallel systems. Or whatever. So comparing their performance on their
non-native hardware isn't a reliable indicator of what they can do.

So, even doing timed based tests are only valid when run on their preferred
hardware, and you can't compare those results to another program even if
running on the same hardware because the hardware may not be what it's
designed for.

There are some 'standard' test suites that some people use. Bratko-Kopec,
Win At Chess, etc. But it's really hard to compare results from one program
to the next.

About all you can really say is something like "On my system (cpu=xyz,
mhz=abc, board=123, ram=fgh, etc.) I got these results...." Whether you
report the search time or the search depth, the results are only valid for
that system and your program.

There is so much variation among programs that it's not really easy to
compare their performance. The only reasonable way to do that is full
tournaments where each program plays dozens of games against the others.
And even then, there's more that could be said.

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----