Wisdom is knowing how

   little we know -- Socrates

About Company

Started in 1980, retired in 2004 REBEL was baptized into ProDeo, latin for gratis according to Dutch tradition.

Other Information

    Quick downloads

 

 

    Quick links

 

 

    Technical

 

 

    Misc

 

 

    Personal links

 

 

 

 

                                  Engine tuning               

                                      Part II

                                       In search for something better

 

Phase one, selfplay equal versions.

On a 4 core PC with hyperthreading (so 8 threads) we are playing about 3600 games at 40/15 (40 moves in 15 seconds) in 4 threads in order to estimate when the score settles on 50% exactly. We also meassure the number of equal | different games and the average move number where the 2 (totally equal) engines start to differ due to the fast time control.

 

Round I   (graph)

 Games  3667 (1838 pairs)
 Result  50.2%  | 50.7%
 LOS  51.3%  | 85.0%
 Threads  4
 Time control  40/15
 Run time  14 hours
 Equal games  13
 Different games  1825
 Average game length  72 moves
 Average equal moves  6.8 moves

On an ideal PC one would expect 2 equal engines playing on the same time control to produce 100% equal games. This obviously is not the case.

 

As one can see the operating system and time control have a huge impact and on average only (the first) 6.8 moves are equal.

 

While 50.2% is an acceptable deviation (just 1 elo point) regarding the low number of games we repeat the match. Sadly the twin match reports a 50.7% score (representing 3-4 elo) and thus we must conclude that 3600 games is unusable for fine tuning which most of the time is about 1-3 elo. See the combined graph of the 2 matches.

 

Round II (graph)

 Games  3600 (1800 pairs)
 Result  50.2%  | 48.8%
 LOS  59.1%  |   3.1%
 Threads  3
 Time control  40/15
 Run time  18 hours
 Equal games  6
 Different games  1794
 Average game length  73
 Average equal moves  7.7 moves

We will repeat round I but now only using 3 of the 4 cores in 3 threads in order to unburden the Operating system.

 

First run (50.2%) is an acceptable deviation although the graph shows a quite turbulent course.

 

Second run shows an incredible loss of 8-9 elo points. It's not the operating system either, see the combined graph.



 

And now on 40/60 (4 times slower) as perhaps that would show a more consistent result.

 

Round III  (graph)

 Games  1717 (858 pairs)
 Result  51.1%
 LOS  88.9%
 Threads  4
 Time control  40/60
 Run time  20 hours
 Equal games  1
 Different games  857
 Average game length  78 moves 
 Average equal moves  7.7 moves

Only 1 equal game despite the 4 time slower time control however the average equal moves has increased from 6.8 to 7.7 moves.

 

Furthermore this match shows that 1700 (time consuming) games by far are not enough to proof 50.0% as the 51.1% shows a 7-8 elo improvement while the 2 engines are completely equal.

 

 

 


 

Conclusion - 3600 games by far are insufficient to measure small elo improvements. We either invest in extra hardware (and increase the number of games to 5000, 10.000 or 20.000) or we look for a better alternative.

 


 

Phase two, selfplay equal versions based on nodes.

On a 4 core PC with hyperthreading (so 8 threads) we are playing 4000, 8000 and 16.000 games at 100.000 nodes. Since time is no longer an issue we can run this in 8 threads within a reasonable time.

 

 Round IV (graph)

 Games  4000  8000  16.000
 Result  50.3%  50.3%  50.2% 
 Threads  8  8  8
 Time control (nodes)  100.000  100.000  100.000
 Runtime  2 hours  4 hours  8 hours
 Equal games  1447     
 Different games  553    
 Average game length  63 moves    
 Average equal moves  56.6 moves    

 

 

 

 


 

 

 

 

 

 



 

The "nodes" system is attractive considering one can run 16,000 games in 8 hours also bypassing the time control issues. But as usual every advantage has its own disadvantage and for the "nodes" system these are:

 

  1. When faced with a fail-low then often there will be not enough time to find a better move. After 100.000 nodes it's just boom and the engine will play the bad move.
     
  2. The same applies the engine being in the process of finding a better move when the search is suddenly is terminated due to the 100.000 nodes limit and the better move is not played.

 

While these kind of problems don't exist using regular time-control or fixed depth testing it's not unreasonable to assume these 2 disadvantages will be equally divided during a match of 4000 | 8000 | 16.000 games.

 

Besides as programmer H.G.Muller stated one can add with some effort a new time control option, something like 40 moves in 10M nodes that solves the 2 above problems.

 

Since the (my) current testing system (basically 40/15 matches) only covers the big elo changes and is not suitable to measure fine tuning accurately I am going to experiment with the "nodes" system. In the case of an x number of possitive changes they should also produce a similar increase in playing strength on regular time control.

 

Time will tell.........

 

UPDATE

 

After collecting a number of improvements via the Nodes testing method we pitched the new (1.87) version against the old (1.86) version and the scaling at increasing time controls is excellent.

 

1.87 vs 1.86

 Level

 Games

 Score

Run Time

 Nodes=100.000

 16.000

 54.0%

8 hours

 40 moves in 15 seconds

 8.555

 53.3%

28 hours

 40 moves in 30 seconds

 3.018

 53.1%

21 hours

 40 moves in 60 seconds

 2.034

 53.8%

27 hours

 

As it seems we have created ourselves a more reliable testing method as with 16.000 games (or 32.000 for that matter) we can now measure 1-2 elo improvements in a much faster setting.

 

 


 

Reliability  (graph)

To test the above system for its reliability we play a fixed ply match which should produce 100% indentical games. We get:

 

 Games  4000 (2000 pairs)
 Result  50.0%
 Threads  4
 Time control  fixed depth
 Equal games  1995
 Different games  5
 Average game length  71 moves
 Average equal moves  71.3 moves

While 5 of the 2000 games follow a different path (somewhere in the late endgame I noticed) the overall impression is a functional system fit for the purpose it is created.

 

 

 

 

 

 

 

 


 

                                                      Emulate matches

 

Emulate.exe is a small utility that assumes 2 indentical engines and generates random results (win, draw, loss) and then calculates the game score and LOS. A typical output:

 

 Games

 4000 (Score | LOS)

 8000

 16.000

 32.000

 64.000  1000.000
 Round-1  48.7% | 2.5%  49.1% | 2.2%  49.6% | 10.8%  49.6% | 5.3%  50.0%  49.8%
 Round-2  49.4% | 18.7%  50.1% | 60.3%  49.7% | 14.7%  50.3% | 89.5%  50.0%  50.1%
 Round-3  50.2% | 58.9%  49.5% | 11.3%  50.4% | 90.6%  49.5% | 1%  50.0%  49.9%

 

As we can see from the 16.000 and 32.000 matches the score still can fluctuate with 0.8% incompatible with our findings in real games (0.2%) see Round IV above. A likely explanation is that real games contain on average 60-80 moves of which several are decisive for the endresult contrary to flip-a-coin approach of this emulator. It's therefore reasonable to assume the number of games of emulate.exe should be divided by a factor of 1.5 or even 2 to represent the real number of games. More research is needed.

 

A graphical overview of 10.000 | 20.000 | 50.000 | 100.00 games running 5 rounds.

 

Comments welcome at the Programmer Forum.

 

Emulate including source code can be dowloaded here.

 

 

                                                        [ Back to Part I ]

Copyright ® 2013  Ed Schröder