Engine tuning
Part II
In search for something better
Phase one, selfplay equal versions.
On a 4 core PC with hyperthreading (so 8 threads) we are playing about 3600 games at 40/15 (40 moves in 15 seconds) in 4 threads in order to estimate when the score settles on 50% exactly. We also meassure the number of equal  different games and the average move number where the 2 (totally equal) engines start to differ due to the fast time control.
Round I (graph)
Games 
3667 (1838 pairs) 
Result 
50.2%  50.7% 
LOS 
51.3%  85.0% 
Threads 
4 
Time control 
40/15 
Run time 
14 hours 
Equal games 
13 
Different games 
1825 
Average game length 
72 moves 
Average equal moves 
6.8 moves 
On an ideal PC one would expect 2 equal engines playing on the same time control to produce 100% equal games. This obviously is not the case.
As one can see the operating system and time control have a huge impact and on average only (the first) 6.8 moves are equal.
While 50.2% is an acceptable deviation (just 1 elo point) regarding the low number of games we repeat the match. Sadly the twin match reports a 50.7% score (representing 34 elo) and thus we must conclude that 3600 games is unusable for fine tuning which most of the time is about 13 elo. See the combined graph of the 2 matches.
Round II (graph)
Games 
3600 (1800 pairs) 
Result 
50.2%  48.8% 
LOS 
59.1%  3.1% 
Threads 
3 
Time control 
40/15 
Run time 
18 hours 
Equal games 
6 
Different games 
1794 
Average game length 
73 
Average equal moves 
7.7 moves 
We will repeat round I but now only using 3 of the 4 cores in 3 threads in order to unburden the Operating system.
First run (50.2%) is an acceptable deviation although the graph shows a quite turbulent course.
Second run shows an incredible loss of 89 elo points. It's not the operating system either, see the combined graph.
And now on 40/60 (4 times slower) as perhaps that would show a more consistent result.
Round III (graph)
Games 
1717 (858 pairs) 
Result 
51.1% 
LOS 
88.9% 
Threads 
4 
Time control 
40/60 
Run time 
20 hours 
Equal games 
1 
Different games 
857 
Average game length 
78 moves 
Average equal moves 
7.7 moves 
Only 1 equal game despite the 4 time slower time control however the average equal moves has increased from 6.8 to 7.7 moves.
Furthermore this match shows that 1700 (time consuming) games by far are not enough to proof 50.0% as the 51.1% shows a 78 elo improvement while the 2 engines are completely equal.
Conclusion  3600 games by far are insufficient to measure small elo improvements. We either invest in extra hardware (and increase the number of games to 5000, 10.000 or 20.000) or we look for a better alternative.
Phase two, selfplay equal versions based on nodes.
On a 4 core PC with hyperthreading (so 8 threads) we are playing 4000, 8000 and 16.000 games at 100.000 nodes. Since time is no longer an issue we can run this in 8 threads within a reasonable time.
Round IV (graph)
Games 
4000 
8000 
16.000 
Result 
50.3% 
50.3% 
50.2% 
Threads 
8 
8 
8 
Time control (nodes) 
100.000 
100.000 
100.000 
Runtime 
2 hours 
4 hours 
8 hours 
Equal games 
1447 


Different games 
553 


Average game length 
63 moves 


Average equal moves 
56.6 moves 


The "nodes" system is attractive considering one can run 16,000 games in 8 hours also bypassing the time control issues. But as usual every advantage has its own disadvantage and for the "nodes" system these are:
 When faced with a faillow then often there will be not enough time to find a better move. After 100.000 nodes it's just boom and the engine will play the bad move.
 The same applies the engine being in the process of finding a better move when the search is suddenly is terminated due to the 100.000 nodes limit and the better move is not played.
While these kind of problems don't exist using regular timecontrol or fixed depth testing it's not unreasonable to assume these 2 disadvantages will be equally divided during a match of 4000  8000  16.000 games.
Besides as programmer H.G.Muller stated one can add with some effort a new time control option, something like 40 moves in 10M nodes that solves the 2 above problems.
Since the (my) current testing system (basically 40/15 matches) only covers the big elo changes and is not suitable to measure fine tuning accurately I am going to experiment with the "nodes" system. In the case of an x number of possitive changes they should also produce a similar increase in playing strength on regular time control.
Time will tell.........
UPDATE
After collecting a number of improvements via the Nodes testing method we pitched the new (1.87) version against the old (1.86) version and the scaling at increasing time controls is excellent.
1.87 vs 1.86
Level 
Games 
Score 
Run Time 
Nodes=100.000 
16.000 
54.0% 
8 hours 
40 moves in 15 seconds 
8.555 
53.3% 
28 hours 
40 moves in 30 seconds 
3.018 
53.1% 
21 hours 
40 moves in 60 seconds 
2.034 
53.8% 
27 hours 
As it seems we have created ourselves a more reliable testing method as with 16.000 games (or 32.000 for that matter) we can now measure 12 elo improvements in a much faster setting.
Reliability (graph)
To test the above system for its reliability we play a fixed ply match which should produce 100% indentical games. We get:
Games 
4000 (2000 pairs) 
Result 
50.0% 
Threads 
4 
Time control 
fixed depth 
Equal games 
1995 
Different games 
5 
Average game length 
71 moves 
Average equal moves 
71.3 moves 
While 5 of the 2000 games follow a different path (somewhere in the late endgame I noticed) the overall impression is a functional system fit for the purpose it is created.
Emulate matches
Emulate.exe is a small utility that assumes 2 indentical engines and generates random results (win, draw, loss) and then calculates the game score and LOS. A typical output:
Games 
4000 (Score  LOS) 
8000 
16.000 
32.000 
64.000 
1000.000 
Round1 
48.7%  2.5% 
49.1%  2.2% 
49.6%  10.8% 
49.6%  5.3% 
50.0% 
49.8% 
Round2 
49.4%  18.7% 
50.1%  60.3% 
49.7%  14.7% 
50.3%  89.5% 
50.0% 
50.1% 
Round3 
50.2%  58.9% 
49.5%  11.3% 
50.4%  90.6% 
49.5%  1% 
50.0% 
49.9% 
As we can see from the 16.000 and 32.000 matches the score still can fluctuate with 0.8% incompatible with our findings in real games (0.2%) see Round IV above. A likely explanation is that real games contain on average 6080 moves of which several are decisive for the endresult contrary to flipacoin approach of this emulator. It's therefore reasonable to assume the number of games of emulate.exe should be divided by a factor of 1.5 or even 2 to represent the real number of games. More research is needed.
A graphical overview of 10.000  20.000  50.000  100.00 games running 5 rounds.
Comments welcome at the Programmer Forum.
Emulate including source code can be dowloaded here.
[ Back to Part I ] 