Defining an appropriate fitness measure to rank our agents has proven difficult.
In principle we defined a variant of fitness sharing [8] by
giving points for doing better than average against a human player, and negative
points for doing worse than average. The fitness of agent *a* was defined
as:

where *s*(*h,a*) is the number of games lost minus the number of games
won (*score)* by a human opponent *h* against *a*; *p*(*h,a*)
is the total number of games between the two; *s*(*h*) is the total
score of *h*; and *p*(*h)* is the number of games that *h*
has played. All games played are counted, not just those that belong to the
current generation. The factor
is
a confidence measure that devalues the average scores obtained against humans
who have played only a small number of games.

A second part of the experiment assayed a new definition of fitness, based on our statistical analysis of players' strengths. This problem is discussed in detail in section 3.6.