Defining an appropriate fitness measure to rank our agents has proven difficult.
In principle we defined a variant of fitness sharing  by
giving points for doing better than average against a human player, and negative
points for doing worse than average. The fitness of agent a was defined
where s(h,a) is the number of games lost minus the number of games won (score) by a human opponent h against a; p(h,a) is the total number of games between the two; s(h) is the total score of h; and p(h) is the number of games that h has played. All games played are counted, not just those that belong to the current generation. The factor is a confidence measure that devalues the average scores obtained against humans who have played only a small number of games.
A second part of the experiment assayed a new definition of fitness, based on our statistical analysis of players' strengths. This problem is discussed in detail in section 3.6.