If you can't explain it simply, you don't understand it well enough -- Albert Einstein

 

Adam Hair Column

My name is Adam Hair. I am a CCRL tester and a computer chess enthusiast.


I am deeply interested in discovering new information concerning computer chess.

Columns

 

 

     

     

     

     

    A Pairwise Comparison of Chess Engine Move Selections

     

    The following is a report on the data I have accumulated using Don Dailey's similarity tool. I conducted the test successfully with 99 unique (different authors) engines and ~380 engines overall. The percentage of matched moves for each possible engine pair was then computed.

     

    All engines in the sample are considered to be unrelated by virtue of having different authors. Known clones and derivatives were not included in this sample. Engine pairs whose matched move percentage is much greater than the unique engine pair average (5 standard deviations or more) are then considered to be possibly related. The rationale for using 5 deviations as the measure is from the following observation:

     

    1000 unique engines is an upper limit for the total number of unique engines in existence. This means that there are no more than 999*1000/2 = 499,500 possible pairs of unique engines. 5 standard deviations corresponds to an event that we expect to occur 1 in 3,448,556 times.

     

    Since the actual number of unique engines is less than 499,500, which is less than 3,448,556, we would not expect to find a pairing whoses matched moves percentage was 5 standard deviations or more greater than the mean.

     

    An attempt was made to increase the accuracy of the similarity test by adjusting the thinking time used for each engine. The rationale for this is the assumption that two engines will be most similar in move selection when their thinking times are adjusted so that they will be equal in strength. This assumption may be flawed, for it assumes the the search speed (as measured by depth) is the most significant factor in determining strength. Yet, from informal testing, I have determined that adjustments in thinking times does increase the percentage of matched moves for engine pairs. 

     

    I should note that in an earlier engine comparison, there was some correlation of the percentage of matched moves to Elo difference (the correlation was negative) even after adjustments. I have not attempted to examine such a correlation in the current data as of yet. I would not be surprised if some negative correlation does exist.

    The individual engine thinking time per position was determined in the following manner:

     

    Houdini 1.5 was used to determine the times for the other engines. First, Houdini 1.5 was given 20 ms of thinking time per position. Then, for each engine, its Elo difference from Houdini 1.5 was estimated using the CCRL and CEGT blitz lists. The Elo difference was then used in this formula to determine thinking times:

     

    Y = 20 * (2^(Elo difference/120))

     

    The formula is based on discussions found in CCC and my personal tests. It has been generally established that each doubling of processor speed or thinking time increases the Elo of an engine by a certain amount. The amount normally stated (the source of which appears to be How Computers Play Chess, by David Levy and Monty Newborn) is 50 to 70 Elo.

     

    However, some authors more recently have reported finding higher Elo increases. I decided to determine for myself the approximate amount of Elo increase one could expect, for a somewhat accurate estimate is needed for the formula. I found for each doubling of thinking time (where the thinking times were approximately those found in this study), the increase is approximately 120 Elo.

     

    I have no expectation that this increase per doubling holds for longer thinking times nor for all engines in my study. I simply contend that it provides for a more accurate measurement of move selection correlation between pairs of engines as opposed to making no adjustments.

     

    Here is the sample of engines tested, along with their thinking time per position:

     

      1) Alaric 707 (time: 361 ms  scale: 1.0)
      2) Alfil 11 (time: 453 ms  scale: 1.0)
      3) Amy 0.87 (time: 1738 ms  scale: 1.0)
      4) Amyan 1.72 (time: 529 ms  scale: 1.0)
      5) AnMon 5.75 (time: 730 ms  scale: 1.0)
      6) Arasan 13.3 (time: 525 ms  scale: 1.0)
      7) Ares 1.004 (time: 2575 ms  scale: 1.0)
      8) Aristarch 4.51 (time: 622 ms  scale: 1.0)
      9) Atlas 3.14b (time: 1310 ms  scale: 1.0)
     10) Bison 9.11 (time: 219 ms  scale: 1.0)
     11) Bobcat 2.75 (time: 335 ms  scale: 1.0)
     12) Booot 5.1.0 (time: 136 ms  scale: 1.0)
     13) Cerebro 3.03d (time: 1034 ms  scale: 1.0)
     14) Cheese 1.3 (time: 2203 ms  scale: 1.0)
     15) Cheng 1.06 (time: 415 ms  scale: 1.0)
     16) Chess Tiger 2007.1 (time: 263 ms  scale: 1.0)
     17) Colossus 2008b (time: 399 ms  scale: 1.0)
     18) Comet B68 (time: 1842 ms  scale: 1.0)
     19) Crafty 23.4 (time: 183 ms  scale: 1.0)
     20) Critter 1.2 (time: 28 ms  scale: 1.0)
     21) CuckooChess 1.12 (time: 1016 ms  scale: 1.0)
     22) Cyrano 06b17 (time: 415 ms  scale: 1.0)
     23) DanaSah 4.60 (time: 825 ms  scale: 1.0)
     24) Deuterium 11.02.29.107 (time: 422 ms  scale: 1.0)
     25) Diablo 0.51 (time: 1810 ms  scale: 1.0)
     26) Dirty 25AUG2011 (time: 359 ms  scale: 1.0)
     27) Djinn 0.925x (time: 1885 ms  scale: 1.0)
     28) Dragon 4.6 (time: 1445 ms  scale: 1.0)
     29) ET Chess 130108 (time: 406 ms  scale: 1.0)
     30) Frenzee 3.5.19 (time: 202 ms  scale: 1.0)
     31) Fruit 2.1 (time: 290 ms  scale: 1.0)
     32) Gandalf 6.0 (time: 453 ms  scale: 1.0)
     33) GarboChess 3 (time: 583 ms  scale: 1.0)
     34) Gaviota 0.84 (time: 320 ms  scale: 1.0)
     35) Glass 1.8 (time: 526 ms  scale: 1.0)
     36) GNU Chess 5.07.173b (time: 511 ms  scale: 1.0)
     37) Godel 2.05 (time: 1641 ms  scale: 1.0)
     38) Green Light Chess 301.2.2 (time: 869 ms  scale: 1.0)
     39) GreKo 8.2 (time: 788 ms  scale: 1.0)
     40) Gromit 3.8.2 (time: 1040 ms  scale: 1.0)
     41) Gull 1.2 (time: 72 ms  scale: 1.0)
     42) Hamsters 0.71 (time: 577 ms  scale: 1.0)
     43) Hannibal 1.1 (time: 96 ms  scale: 1.0)
     44) Hermann 2.8 (time: 774 ms  scale: 1.0)
     45) Hiarcs 12 (time: 135 ms  scale: 1.0)
     46) Homer 2.01 (time: 2079 ms  scale: 1.0)
     47) Houdini 1.5 (time: 20 ms  scale: 1.0)
     48) Hussar 0.4 (time: 1364 ms  scale: 1.0)
     49) Jonny 4.00 (time: 500 ms  scale: 1.0)
     50) Junior 10.1 (time: 229 ms  scale: 1.0)
     51) King of Kings 2.56 (time: 1310 ms  scale: 1.0)
     52) Knight Dreamer D33 (time: 1594 ms  scale: 1.0)
     53) Komodo 3 (time: 130 ms  scale: 1.0)
     54) Ktulu 8 (time: 254 ms  scale: 1.0)
     55) LambChop 10.99 (time: 1356 ms  scale: 1.0)
     56) Leila 0.53h (time: 1689 ms  scale: 1.0)
     57) List 5.12 (time: 538 ms  scale: 1.0)
     58) Little Goliath Evolution 3.12 (time: 869 ms  scale: 1.0)
     59) Movei 0.08.438 (10_10_10) (time: 435 ms  scale: 1.0)
     60) N2 0.4 (time: 544 ms  scale: 1.0)
     61) nanoSzachy 3.9 (time: 948 ms  scale: 1.0)
     62) Naraku 1.31 (time: 381 ms  scale: 1.0)
     63) Naum 4.2 (time: 58 ms  scale: 1.0)
     64) Nemo 1.0 beta (time: 248 ms  scale: 1.0)
     65) Onno 1.04 (time: 134 ms  scale: 1.0)
     66) Patzer 3.80 (time: 1874 ms  scale: 1.0)
     67) Pepito 1.59 (time: 959 ms  scale: 1.0)
     68) Petir 4.39 (time: 633 ms  scale: 1.0)
     69) Pharaon 3.51 (time: 505 ms  scale: 1.0)
     70) Philou 3.60 (time: 686 ms  scale: 1.0)
     71) ProDeo 1.74 (time: 800 ms  scale: 1.0)
     72) Pupsi2 0.08 (time: 869 ms  scale: 1.0)
     73) Quark 2.35 (time: 1364 ms  scale: 1.0)
     74) Quazar 0.3a.exe (time: 1388 ms  scale: 1.0)
     75) RedQueen 1.0.0 (time: 337 ms  scale: 1.0)
     76) RobboLito 0.09 (time: 27 ms  scale: 1.0)
     77) Rotor 0.6 (time: 590 ms  scale: 1.0)
     78) Ruffian 2.10 (time: 478 ms  scale: 1.0)
     79) Rybka 4.1 (time: 28 ms  scale: 1.0)
     80) Shredder 11 (time: 125 ms  scale: 1.0)
     81) Sjeng WC 2008 (time: 131 ms  scale: 1.0)
     82) SlowChess Blitz WV2.1 (time: 479 ms  scale: 1.0)
     83) SmarThink 1.20 (time: 226 ms  scale: 1.0)
     84) Snitch 1.6.2 (time: 1420 ms  scale: 1.0)
     85) Spark-1.0 (time: 82 ms  scale: 1.0)
     86) SpiderChess 070603 (time: 1273 ms  scale: 1.0)
     87) Spike 1.4 (time: 87 ms  scale: 1.0)
     88) Stockfish 2.11 (time: 31 ms  scale: 1.0)
     89) Tao 5.6 (time: 1022 ms  scale: 1.0)
     90) The Baron 2.23 (time: 651 ms  scale: 1.0)
     91) The King 3.50 (time: 327 ms  scale: 1.0)
     92) Thinker 5.4d Inert (time: 102 ms  scale: 1.0)
     93) Thor's Hammer 2.28 (time: 2416 ms  scale: 1.0)
     94) Tornado 4.40 (time: 260 ms  scale: 1.0)
     95) Trace 1.37a (time: 874 ms  scale: 1.0)
     96) Ufim 8.02 (time: 757 ms  scale: 1.0)
     97) WildCat 8 (time: 427 ms  scale: 1.0)
     98) Yace 0.99.87 (time: 859 ms  scale: 1.0)
     99) Zappa Mexico II (time: 126 ms  scale: 1.0)

     

    Note: Pro Deo 1.74 was given an additional ~300 ms per position to allow for lag in initiating search.In additional testing, Komodo 4's time was increased from 26 ms to 200 ms for the same reason.

     

    There are 4851 pairs. The average matched move percentage was 45.16, the standard deviation of the data was 2.86.

     



    The following table presents the odds for a pair to match moves greater than a certain percentage. This is assuming that N(45.16, 2.86) is a good approximation for the matched moves percentage distribution for all unique engines used in the sample.

     

    Standard Deviations

    Percentage Matched Moves

    Approximate odds that percentage matched moves for pair is greater than Column 2

    0

    45.16

    1 chance in 2

    1

    48.02

    1 chance in 6

    2

    50.88

    1 chance in 44

    3

    53.74

    1 chance in 741

    4

    56.60

    1 chance in 31,574

    5

    59.46

    1 chance in 3,488,556

    6

    62.32

    1 chance in 1,013,594,635

    7

    65.18

    1 chance in 781,332,343,402

    8

    68.04

     

    9

    70.90

     

    10

    73.76 

     

    11

    76.62

     

    12

    79.48

     

    13

    82.34

     

    14

    85.20

     

    15

    88.06

     

    16

    90.92

     

    17

    93.78

     

     

    There are some pairs in the unique engines sample that have greater then 59.46% moves matched. They are:

     

    Engine 1

    Engine 2

    % Moves Matched

    Alaric 707 Fruit 2.1

    59.78

    Onno 1.0.4 Fruit 2.1

    66.75

    Critter 1.2 Houdini 1.5 60.91
    Robbolito 0.09 Houdini 1.5 61.43
    Robbolito 0.09 Critter 1.2 59.99

     

    I decided to not remove these outliers to avoid introducing bias.

     

    The following is a dendrogram (a diagram that represents the relationships, as measured by the matched move percentages) for the 99 unique engines:

     

    As noted in the first paragraph, I tested 378 engines, many of them related to each other. The 99 engines listed above can be used to make judgements on the level of similarity found between engines from the larger group and other, supposedly unrelated engines.

     

    The method I used was to choose engines from the larger group and insert them in place of their known relatives in the base group. For example, to compare Thinker 4.7a and Crafty 18.15, I replaced Thinker 5.4d Inert and Crafty 23.4 with those two engines. Then the mean and standard deviation was recalculated to determine how likely the measured percentage of matched moves would occur if the engines were unrelated.

     

    Some additional notable pairs of engines (from the entire sample of 378 tested engines) not noted to be from the same family, with the exception of acknowledged Fruit/Toga derivatives and Strelka (their inclusion is for illustrative purposes):

     

    Engine 1 Engine 2

     % Matched Moves

    Standard Deviations from Mean

    Fruit 2.2.1 Loop 2007

    84.73

    13.18

    Fruit 2.2.1 Loop 10.32f

    84.13

    12.99

    Strelka 1.8 Rybka 1.0 Beta

    73.11

    9.49

    Fruit 2.1 TogaII 1.0

    72.43

    9.24

    Fruit 2.1 Loop 2007

    70.81

    8.97

    RobboLito 0.085d1

    Houdini 1.00

    70.37

    8.79

    Crafty 18.15 Thinker 4.7a

    68.78

    8.30

    Rybka 2.1o Naum 3.1

    68.66

    8.12

    Onno 1.04 Loop 2007

    68.56

    8.20

    Strelka 2.0 B Rybka 1.0 Beta

    68.52

    7.88

    Houdini 1.5 Strelka 5.1

    67.43

    7.72

    Fruit 2.1 Onno 1.04

    66.75

    7.55

    Rybka 2.1o Strelka 2.0 B

    65.96

    7.01

    Fruit 2.0 Alaric 707

    64.57

    6.83

    Fruit 2.1 Philou 3.14.1

    64.31

    6.63

    Stockfish 1.51 Philou 3.51

    63.97

    6.53

    Diablo 0.51 Alfil 7.6

    63.51

    6.41

    Strelka 2.0 B Thinker 5.4d Inert

    63.41

    6.24

    Strelka 2.0 B Naum 4.2

    62.99

    6.10

    Critter 1.4 Houdini 1.5

    61.91

    5.87

    Strelka 2.0 B Naum 3.1

    61.51

    5.60

    Crafty 18.15 Aristarch 4.4.131

    61.36

    5.67

    Aristarch 4.4.131 List 5.12

    60.12

    5.22

    RobboLito 0.085d1 Rybka 3

    59.81

    5.08

    Fruit 2.0 Rotor 0.5

    59.64

    5.10

    TogaII 1.2.1 Colossus 2008b

    59.53

     5.06

    TogaII 1.2.1 Naraku 1.31

    59.48

    5.04

    Glaurung 1.2.1 Alfil 8.1.1 Optimized

    59.47

    5.05

    Rybka 3 Houdini 1.00

    59.37

    4.91

     

    The following is an additional dendrogram which includes the exceptional pairs listed above:

     

     


     

    Interpretation of Dendograms

     

    The dendograms were made with the statistical package called PAST. The algorithmn used in constructing the dendograms is UPGMA (Unweighted Pair Group Method with Arithmetic mean). The following is a basic description of the algorithmn, from which comes the explanation of how to interpret the dendograms:

     

    The similarity tool produces a similarity matrix, whose entries consist of the matched move percentages for each pair of engines tested. I then constructed a distance matrix, whose entries were (100 - matched move percentage)/100. The distance matrix was then processed using the UPGMA algorithmn.

     

    The first step in the process is to identify the pair with the smallest distance. Then, this pair is considered as one and then the distance matrix is recomputed with the distance to the joined pair is the arithmetic mean of the distances to each of those engines. Then, the next pair with the smallest distance is identified and joined together, new distance matrix computed, etcetera ..., until all engines have been connected together.

     

    The distances in the dendograms are interpreted in this manner:

     

    1. The distance of a node which connect two engines together is the distance as computed from the similarity matrix.
       
    2. The distance of a node that connects two groups of engines is the arithmetic mean of all the distances between pairs of engines from the two groups. For example, for two groups [A,B] and [C,D], the distance of the node connecting the two groups is defined as (d[A,C] + d[A,D] + d[B,C] + d[B,D])/4.

     

    A short description of the UPGMA algorithmn can be found here.

    A demonstration of the algorithmn can be found here

     

     


     

    Concerning The Use of Don Dailey's Similarity Tester

     

    The similarity tester is a small program written in Tcl (Tool Command Language) that presents engines with 8238 positions and records the best move determined for each position. The commands encoded into the program are for the UCI protocol.

     

    USAGE

     

    The program can be run (in Windows) from the command line or by way of a batch file. The following commands are recognized by the program:

     

    -t or -test
    -c or -config
    -l or -list
    -r or -report
    -m or -matrix

     

    The command sim03w64.exe -t MyProgram.exe {time in milliseconds - default 100 ms} is used to conduct the test with an engine.

     

    A better method is to use configuration files. Create a configuration file for each engine. In the file, specify the values for all UCI parameters that you want to control. Typically, that would be the hash and number of processors used. Here is an example:

     

    exe = stockfish-21-64-ja.exe
    name = Stockfish 2.1
    Hash = 64
    Threads = 1

     

    Be sure that the name of the parameter exactly matches the name used by the engine.

    Then, the command sim03w64.exe -c {name of config file} {time in milliseconds - default 100 ms} is used to conduct the test.

     

    All data is collected in a file called similarity.data. When the commands -i, -r, or -m are used, the program reads the data and returns either the list of engines tested (-l), or the results of a particular engine ( -r {# of engine as labeled in the list} ), or the results of all the engines in matrix form ( -m ).

     

    The results are the % of moves matched between pairs of engines.

     

    To test winboard engines, Odd Gunnar Malin's program WB2UCI must be used.

     

    HOW IT WORKS

     

    The similarity tool sends the command "go depth 50" and then sends "stop" after the user specified time (or default time) has elapsed. It then waits to receive the best move from the engine before initializing the next step in the loop. After all 8238 positions have been sent and the best move received, then the program sends "quit" and writes to the similarity.data file. Here is a snippet of the code:

     

     

    set setoptions [string trim $setoptions]

    set start 0

    set fh [open "|$invoke" r+]
    fconfigure $fh -buffering line
    puts $fh "uci"

    while { 1 } {
        gets $fh s
        if { [string trim $s] == "uciok" } { break }
        if { [regexp {^id name (.*)$} $s dmy xx] } {
     if { $id == "" } {
         set id "$xx (time: $tme ms  scale: $scale)"
         break
     } else {
         set id "$id (time: $tme ms  scale: $scale)" 
     }
        }
    }

    if { $setoptions > "" } {
        puts $fh $setoptions
        after 250
    }

    set tme [expr int(0.5 + $tme * $scale)]
    if { $tme < 1 } { set tme 1 }

    puts ""
    puts "    program: $id"
    puts "scaled time: $tme"

    set rec {}
    lappend rec $id

    puts $fh "isready"
    while { 1 } { gets $fh o ;  if { $o == "readyok" } { break } }


    set cc 0
    set e [expr $start + $TP]


    for { set n $start } { $n < $e } { incr n } {

        if { ($n % 50) == 0 } {
     puts ""
     set perc [expr ($cc * 100.0) / ($TP * 1.0)]
     puts -nonewline [format "%5.1f percent  " $perc ]
        }
        puts -nonewline "."
        flush stdout
        incr cc

        puts $fh "position startpos moves [lindex $all $n]"
        puts $fh "go depth 50"

        after $tme
        puts $fh "stop"

        set mv "xxxx"
        while { 1 } {
     gets $fh s
     if { [regexp {^bestmove} $s] } {
         set mv [lindex $s 1]
         lappend rec $mv
         break
     }
        }
        puts $fh "isready"
        while { 1 } { gets $fh s ;  if { [regexp {^readyok} $s] } { break } }

     

     

    EDITING PROGRAM

     

    The program, being that it is written in Tcl, is not a typical .exe file. The application and other necessary auxillaries are wrapped together into a binary that is portable. It can be unwrapped. This means that the positions can be changed and that the commands can be modified. Go to this site in order to learn more: http://wiki.tcl.tk/

     

    PROBLEMS

     

    Some engines will not conduct the test properly when given the command "go depth 50". Either they do not make full use of a core (when set to use 1 cpu) or they will not obey the "stop" command (or "exit" if using WB2UCI). In the first situation, using "go infinite" will cause the engine to make full use of the cpu. Unfortunately, some engines will not make full use of the cpu.

     

    In some cases, it appears that there is a lag before the engine starts thinking. In other cases, the engine will stop thinking after a small number of plies. Since at least some of these same engines do not display this characteristic with Arena, I believe it may be a timing issue. In the second case, the best that can be done is to use "go depth {number}" where the number approximates the average depth the engine would search in the allotted amount of time.

     

    In any case, one should check the cpu utilization for each engine while running the test.

     

    Also, some engines will crash while using the positions contained in the test. Sometimes it takes multiple runs to complete the test.

     

    In addition, some engines will not analyze, and thus will not conduct the test.

     

    It also should be noted that some engines, such as Stockfish,will ignore the "stop" command and will continue to search until reaching a certain depth (~10 plies).

     

    Finally, some engines (notably Rybka 3 and later, Robbolito/IvanHoe, and Houdini) show surprisingly low self-similarity (~75%) when comparing multiple runs of the test. The typical engine shows 95+% self-similarity.

     



    As a Test To Detect Clones and Derivatives

     

    This tool, in conjunction with other tests, can be used to detect possible clones and derived engines. To be used most effectively, some preliminary work and some observations are needed. A database of unique engine pairs is needed to conduct comparisons.

     

    A result of 60% moves matched for a pair is meaningless without other pairwise results to compare the number to. Ideally, enough engines are used to construct the database so that the distribution of the matched move percentages is approximately normal.

     

    Also, it should be noted that the amount of time an engine has to think about each position has some effect on the moves chosen, and ultimately on the matched move percentages. One can choose to use a default time period, with some adverse effect on the accuracy of the test. Or one may chose an engine to calibrate the other engine times to.

     

    Once an engine is chosen and its thinking time (X) is set, then the other times can be determined by this formula: Y = X*(2^((Elo Diff)/120)).

     

    This formula works well for newer engines, but most likely less well for older engines. In any case, it does give more accurate matched move percentages than using a default time.

     

    An additional observation is that a minimum of 5 standard deviations should be used to judge a pair percentage is beyond the norm. If 1000 unique engines (where unique means unrelated engines with unique authors) is considered to be an upper limit, then there could be possibly 999*1000/2 = 499,500 pairs of unique engines.

     

    4 standard deviations representes an event that occurs 1 time out of approximately 31,600, or approximately 16 times in 499,500.

     

    5 standard deviations represents 1 time in 3,448,556.

     

    While not a guarantee of avoiding a false positive, the threshold of 5 standard deviations greatly reduces the chance of it occurring.

     

    The drawback of setting the false positive threshold so high is that more false negatives will occur (two similar engines would be deemed non-similar). However, there are several things to consider.

     

    1. The use of statistical methods assumes that the authors have access to a common pool of ideas, but that there are no interactions between authors/engines. In reality, authors/engines do interact.
        
    2. There are permissible methods by which one author can make his engine more similar to another engine. We have no standard for when some author goes too far. Thus, we have no way to determine an exact threshold.
        
    3. The need to avoid false accusation is greater than the need to determine authors who break the rules slightly. In other words, it is better to let lesser offenders slip through than to make accusations against innocent authors.
        
    4. This tool should not be used solely for determining derivatives and clones. Other methods should be used in conjunction with this tool. Ultimately, any accusation of cloning requires an examination of the code of the accused author.

     

    Sincerely,

    Adam Hair

     



    Downloads

     

    1. The complete work Similarity_Report (27.5 Mb)
        
    2. The compendious work Similarity_Report_Compendious (6.5 Mb)

     

    Copyright ® 2012 Ed Schröder