Evolution, as trial-and-error based learning methods, usually relies on the repeatability of an experience: Different behavioral alternatives are tested and compared with each other. But agents acting on real environments may not be able to choose which experience to live. Instead, the environment provides varying initial conditions for each trial.
In competitive games for example, it is difficult to compare players with each other if they are not able to choose their opponents. The analysis methodology we adopted for the Tron experiment (section 3.4.2) is a statistics-based approach to solving this problem.
In a coevolutionary environment, the Red Queen Effect  makes it difficult to evaluate progress, since the criterion for evaluation of one species is the other, and vice versa. A higher number of wins does not necessarily imply better performance.
Similar problems arise when one tries to compare the performances of past and present players. A well-known strategy for evaluating coevolutionary progress in presence of the Red Queen effect is to take a sample set, an advanced generation for example, and use them to evaluate all players [24,109]. This is impossible here: we cannot recreate the behavior of humans who played in the past.
Some fixed agents could conceivably be kept in the population for evaluation purposes, but even if one or a few agents were to be present in all generations, most people would play against them only a few times, yielding a measure of low confidence. At the onset of the experiment, we were not willing to sacrifice performance, nor slow down the evolutionary pace by keeping fixed losers inside the population (if they were winners, they would not have to be kept alive artificially, but without an oracle we could not choose them in advance).
Evaluating fitness in coevolution is a closely related problem. The most basic way to assign fitness to players in a competitive/coevolutionary environment is to sum up all wins [4,65,6]. More advanced is the use of fitness sharing strategies [8,75,118]. Different researchers have tried to reduce the number of games to be played in each generation: large savings can be obtained by matching players against a sample instead of the whole population, ``finding opponents worth beating'' [123,119]. The assumption, however, that one can choose the opponents, could not be upheld in our case, where human opponents come and go at will, and an entirely different approach to scoring was needed as well.
The Tron experiment assayed a fitness sharing-inspired fitness function (eq. 3.1 on page ). We knew that different agents would play against some of the same and some different humans, so simply adding up all wins would not suffice. Instead we compared winning ratios: according to this equation, agents get positive points when they do better than average against a human, and negative points for doing worse than average. The more experienced the human is, the more valuable those points are.
This function was relatively successful in finding good Tron agents, but had problems that we did not foresee. Over time, a strong group of agents formed that were reliably better than average, thus surviving for many generations. As these agents had seen hundreds of humans over their history, and were better than average, even though not necessarily the best, they had too many points to be challenged by newer ones.
The need for a more accurate evaluation of performance in coevolution was thus twofold: not only did we wish to study the evolution of the experiments, comparing today's and yesterday's humans and robots; we were also looking for a better measure to further evolve the artificial population in the future. After two years of Tron we had new insights that prompted us to reconfigure the system:
Some of these long-term survivors with high fitness are visible on fig. 3.6 as long vertical lines. Fig. 3.19 shows the relationship between fitness and RS strength. Agents' RS are plotted in the y axis, their fitness on the x axis. Graph (b) shows that there is a main cloud of points with an approximate linear correlation between fitness and strength; there is however (graph a) an important group of agents which deviate from the ``main sequence'', with higher fitness than they should. Graph (c) on this figure has a column showing the RS values of the top 100 agents with respect to fitness and the top 100 agents strength-wise. We conclude that with the paired comparisons method we have found a better algorithm for choosing our top players.