Testing a chess engine from the ground up
Proper testing a chess engine is as hard as writing a chess engine itself. There are as much good ways to test the changes you make as there are bad ones. I like to present a model that works for me without trumpeting it is the right one, there probably isn't one, just good ones.
Furthermore the data presented on this page is collected using one standard quad PC running matches in 8 threads using hyperthreading. In other words, a poor man's testing environment.
When your first engine gets some body take the time to include debug code for the search and especially the evaluation function. With a #define my engine lists the main important evaluation details.
I am using Chesspartner from Lokasoft since it offers all these (freely definable) tabulators to display all the information I want. With this function with one mouse-click I can scroll through a PGN database and compare if the values make sense to the board on the screen. For more information and how to implement checkout chapter Engine info at the end of that page.
A second hint for aspirant programmers is to play hundreds of one ply games against your brainchild yourself and check if each move make sense given the one ply limitation and save positions that need improvement. Some self discipline is required (say 5 games a day over a one month period) but in this early stage of your engine it pays off much more than starting engine-engine matches immediately.
Engine - Engine Matches
I am using Arena for blitz 1+1 games and cutechess-cli for (depth based) self-play bullet games in 8 threads. For that purpose make 8 copies of your engine folder called MAIN1-MAIN8 and 8 copies called WORK1-WORK8. The MAIN folders shall serve as the current base version and the WORK folders shall contain the new (hopefully) better engine. Install the 16 engines in Arena using the same nicknames as the folders, so MAIN1-MAIN8 and WORK1-WORK8.
Now before you start playing your first engine - engine match do a SANITY CHECK first. Start Arena, setup a match between MAIN1 vs WORK1 and set the level to a fixed depth, say 10 plies and use WORK1.PGN (included in the download) as the input file. Start another Arena session and repeat the process for MAIN2 vs WORK2 using WORK2.PGN till all 8 threads are running.
Closely watch the games if they are indentical, abort each match when it has played an even number of games and check if each match score is equal. If this is not the case check the output PGN to find the reason, fix it and repeat the sanity check.
You are now ready to test your changes and run MAIN vs WORK matches. On a frequent base repeat the sanity check.
A useful tool is MATCH. Among other things it keeps track of the total score of the 8 threads.
Tuning the evaluation function
For tuning the evaluation I exclusively use cutechess-cli depth-based self-play matches of 8000 bullet games. On my simple I7-870 quad it takes around 10 hours to finish which is a reasonable time.
A depth-based testing system looks somewhat suspicious at first glance (and it is) but with a few whistles and bells it has become and elegant and working system that excludes any form of user and/or Windows (and friends) interference that influence time control based testing. Doing depth-based matches you (the operating system) can do as many things in the background as you wish, the match result will be the same.
The whistles and bells
Actually only one bell is needed, the introduction of a special parameter that sets the base (and minimum) fixed depth. In ProDeo I use this parameter:
x is the base depth for the middle game phase and gradually increased
when the game moves into different phases.
When for example using [PLY = 8] then:
- All middle game positions will be played at depth 8.
- When Queens are off the base depth (8) is raised by 1 and becomes 9.
- Normal endgames will be played with depth 10.
- Simple endgames (rook endings) the base search depth increases with 3 and becomes 11.
- Light piece endings are searched with +4 thus 12 plies deep.
- And finally pawn endings at 13 plies.
- We make one exception for Queen endings, the depth increase is only 1, thus 9 plies because of the many check extensions that usually play a role in Queen endings.
Before the search starts the ply-depth is calculated based on the material on the board.
As such we have created a testing environment that relative closely emulates the same depth behavior as playing games on time control without any intereference from outside and thus excluding randomness in a process that is already so sensitive to randomness.
I am not pretending this the way to tune your evaluation, I am just pretending it works for me.
Cutechess-cli is command-line and best can be operated by the use of batch (*.bat) files.
For instance, run a fixed depth match MAIN vs WORK with cutechess-cli via the batchfile:
c:\cc\cutechess-cli -engine name=MAIN1 cmd=yourengine.exe dir=C:\a\main1 proto=uci -engine name=WORK1 cmd=yourengine.exe dir=c:\a\work1 proto=uci -each tc=inf -draw 160 100 -resign 5 500 -rounds 1000 -repeat -pgnout c:\cc\all.pgn -pgnin c:\cc\1.pgn -pgndepth 20
Just 8 mouse clicks from the Windows File Manager starts the 8 matches, each playing 1000 games at "tc=inf" controlled by the [PLY = x] parameter. All games are collected in ALL.PGN for the use in MATCH.
The -concurrency option
Cutechess-cli has a powerful option called -concurrency [x] whereas X is the number of threads. This avoids all the hoopla regarding the creation and maintenance of 2 x 8 folders. The option -concurrency 8 will start 8 threads from the same folder. Example:
c:\cc\cutechess-cli -concurrency 8 -engine name=MAIN cmd=yourengine.exe dir=C:\a\main proto=uci -engine name=WORK cmd=yourengine.exe dir=c:\a\work proto=uci -each tc=inf -draw 160 100 -resign 5 500 -rounds 1000 -repeat -pgnout c:\cc\all.pgn -pgnin c:\cc\1.pgn -pgndepth 20
And now all runs with one mouse click.
The disadvantage of the -concurrency option is that it can't produce a statistic of each thread since all the 8 nicknames of the engines running are the same. This becomes clear in the thread statistic in the below chapter Understanding Randomness as ideally you want all 8 threads not to fluctuate too much and in case of an improvement that all 8 threads score above 50%.
Testing SEARCH is a different animal. It hardly makes sense to run depth based matches although it can be extremely handy to do so in certain circumstances. For instance to measure the speed gain (or loss) in percentage of a search change.
From experience I learned that running a couple of hundred games at a resonable fixed depth already gives a reliable speed gain (or loss) percentage and I was able to quickly trace candidate improvements that would do well on regular time control as well, an explanatory example is in place.
After removing a static reduction the search slowed down with 10% but the elo gain was 35.
Engine WORK (elo 2500) vs Engine MAIN (elo 2500) estimated TPR 2535 (+35)
491-394-362 (1247) match score 688.0 - 559.0 (55.2%)
Won-loss 491-362 = 129 (1247 games) draws 31.6%
LOS = 100.0% Elo Error Margin +15 -15
WORK 8:08:12 (46.299M nodes) NPS = 1.581K
MAIN 7:23:35 (41.969M nodes) NPS = 1.577K
In the end at regular time control the change wasn't worth the full 35 elo of course but that is irrelevant, I am just presenting an idea to quickly test possible improvements as an INDICATOR for a real elo improvement at regular time control.
Always test your final version at regular time control.
Another way to test your search is the use of a collection of special tactical and positional positions. Elo wise this has very little meaning. It depends on the goal you have in mind with your engine, if you want to compete in the elo lists don't give testsets much meaning, focus on eng-eng matches instead.
However if your goal is to write an attractive playing engine then the Strategic Testset is an excellent tool.
Eng-Eng matches are great to climb in the elo lists but as the saying goes that every advantage has a disadvantage (and vice versa!) the disadvantage of eng-eng matches is losing control over the playing style of your program since you hardly look any longer at the games your brainchild plays and you only tend to concentrate on the match results.
introducing the monster
One of the main problems of eng-eng testing is the random nature when pitching 2 about almost equal strength engines against each other in self-play but also against other opponents. I recently saw a match after 1200 games having a +48 score dropping to +17 within 1/2 hour. Another recent example is shown below in the run to the latest ProDeo release.
# ENGINE : RATING POINTS PLAYED (%)
1 WORK1 : 2522.2 420.5 747 56.3%
2 WORK4 : 2509.9 392.0 742 52.8%
3 WORK8 : 2509.8 377.0 714 52.8%
4 WORK3 : 2509.7 400.0 758 52.8%
5 MAIN5 : 2504.1 369.0 721 51.2% <<< ---- look ---- <<<
6 WORK7 : 2503.5 388.5 762 51.0%
7 WORK2 : 2503.0 383.5 754 50.9%
8 WORK6 : 2500.9 374.5 745 50.3%
9 MAIN6 : 2499.1 370.5 745 49.7%
10 MAIN2 : 2497.0 370.5 754 49.1%
11 MAIN7 : 2496.5 373.5 762 49.0%
12 WORK5 : 2495.9 352.0 721 48.8%
13 MAIN3 : 2490.3 358.0 758 47.2%
14 MAIN8 : 2490.2 337.0 714 47.2%
15 MAIN4 : 2490.1 350.0 742 47.2%
16 MAIN1 : 2477.8 326.5 747 43.7%
Engine WORK (elo 2500) vs Engine MAIN (elo 2500) estimated TPR 2512 (+12)
2112-1952-1879 (5943) match score 3088.0 - 2855.0 (52.0%)
Won-loss 2112-1879 = 233 (5943 games) draws 32.8%
LOS = 100.0% Elo Error Margin +7 -7
WORK 34:44:23 (188.350M nodes) NPS = 1.506K
MAIN 33:26:34 (177.180M nodes) NPS = 1.472K
generated with MATCH 1.4
As the example shows the WORK engine wins 7 of the 8 matches resulting in an 12-13 elo gain. What if we had only played the WORK5 version against the MAIN5 version? We would have rejected the changes and thrown away 12-13 elo points. After all the only difference between the versions are the 8 different PGN files with the start positions.
We are facing the monster of randomness, apparently (and way too often!) 721 games are not enough to show the superiority of the good changes. How to deal with that?
Volume is the way to weed out randomness and the LOS (the blue) is a helpful tool here.
LOS means: Likelihood of superiority.
A LOS of 95% means that the match result statiscally ensures a 95% certainty the version is superior, it does not say anything about the elo gain, just that you have a certainty of 95% the version is better than 0.00001 elo.
Among the programmers (I think by now) there is a kind of consensus that 95% is the norm to keep provided enough games are played. "Enough" is still undefined but 5000 games seems to be lower limit currently and I tend to agree.
The latter automatically moves us to the final and important chapter, when do you terminate a runing match? A couple of advices:
- Don't give up too early on a version. A 40-60 score after 100 games statistically means nothing. After 500 games if the result is still 40% it's high time to terminate.
- Don't accept changes even with a LOS of 95% before 1000 games.
Other testing philosophies
So far we have discussed self-play matches. Another good option is to focus on other engines. It's an old and classic discussion which system is better, self-play or tuning your engine against different engines. My experience has learned me there hardly is a difference and if there is it can't be proven.
The main important thing in testing is consistency. Once you have chosen for a system that works stick to it. A number of good options:
- Choose one reliable opponent that is rated 50-100 elo higher and exclusively tune against it.
- Choose a pool of reliable engines, say 2 that are 50-100 elo stronger, 2 that are of equal strength and 2 that are 50-100 elo weaker.
The opponents of your choice must be reliable, for instance they should never lose a game on time forfeit else your results will be not reliable.
In an ideal world the above 2 options should give the same results when you change opponents but unfortunately this too often is not the case.
We have discussed a testing environment on a simple quad PC with hyperthreading allowing us to play 8 eng-eng matches. We play:
- At least 8000 bullet games to test our evaluation changes with a LOS of 95% before we accept a change as an improvement.
- At least 2000 games at a reasonable time control [blitz 1+1] or [40/2] with a LOS of 95% before we accept a search change as an improvement. [blitz 1+1] takes less than one day.
- Better hardware is a huge advantage. If you are the owner of an 8-16 core PC or cluster hardware you can test much more accurate by increasing the number of games and/or playing matches at a more reliable time control than bullet games, the latter because of the diminishing return effect.
- You can download the PGN sets with starting positions I am using. These are made either with Kirr's Opening Sampler or with Protools. A short description in alphabetical order:
4 x 100
| Early end game positions without queens.|
8 x 100
| Normal end game positions.|
4 x 40
| End game with light (Knight | Bishop) pieces only.|
4 x 25
| Pawn endings.|
6 x 100
| Rook endings.|
8 x 500
| Games truncated after 10 moves of opening theory.|
[ To Part II ]