Consistency and Cautious Fictitious Play

: We study a variation of fictitious play, in which the probability of each action is an exponential function of that action’s utility against the historical frequency of opponents’ play. Regardless of the opponents’ strategies, the utility received by an agent using this rule is nearly the best that could be achieved against the historical frequency. Such rules are approximately optimal in i.i.d. environments, and guarantee nearly the minmax regardless of opponents’ behavior. Fictitious play shares these properties provided it switches “infrequently” between actions. We also study the long run outcomes when all players use consistent and cautious rules.


Introduction
There are three major views of why we might expect to see equilibrium in a game: the most traditional introspective view has players study the rules closely, and consider their opponents motivation to calculate what strategy they should play.Evolutionary and learning models see equilibrium rather as the outcome of a process in which less than fully rational players grope for optimality over time.Evolutionary models focus on a population of players and the non-modeled idea that the number of players playing actions that have historically been successful will increase over time at the expense of actions that have historically been less successful.Learning models take a more individualistic point of view, focusing on how an individual player might try to deduce from careful observation of opponents' past play how they will play in the future1 .The focus of this paper is on the issue of individual learning.
There are two types of questions that can be asked about particular learning rules: How well do they do?That is, how much utility do they generate in different environments?Second, what happens in a game if particular learning rules are used?The latter question has been the focus of a number of recent papers, including Fudenberg and Kreps (1993), Fudenberg and Levine (1993), Jordan (1993) and Young (1993), as well as the earlier literature on the process of fictitious play (Brown (1951), Shapley (1964), and so forth).
This paper focuses more on the former question: how well do learning rules do, and what are sensible criteria for evaluating the performance of a learning rule?We propose as desiderata for learning rules that they be "safe"-meaning that they guarantee the player at least his minmax payoff-and "consistent", meaning that they should do at least as well as playing the best response to the empirical average of play if the opponents' 1-2 play is given by independent draws from a fixed distribution.We then suggest that behavior rules should be not just consistent, but "universally consistent," meaning that the player should get at least the payoff of playing a best response to the empirical distribution whether or not the environment is in fact i.i.d.Such a universally consistent rule is both consistent and safe.
Standard fictitious play is consistent, but not safe.Our main result is that there is a very simple modification of fictitious play which is universally consistent and so both safe and consistent.We also show that fictitious play itself is consistent provided that it does not alternate "too quickly" between actions. 2 In addition, we investigate the long-run consequences of both players using such rules.
We do not model the internal thought processes of the players, and instead phrase our conditions, assumptions, and results solely in terms of players' behavior.In particular, we will not make separate assumptions about how players update their beliefs on the one hand, and how they use their beliefs on the other.(However, the particular rules we construct can be interpreted as an "almost-best-response" to beliefs of the type used in fictitious play.)Consequently, the object of our analysis is the set of "behavior rules", by which we mean maps from observations to actions.
It should be emphasized that how well a behavior rule performs depends on the environment it is in.For example, consider the game of matching pennies: a single player must guess each period whether nature will choose "Heads" (H) or "Tails" (T).He earns a payoff of 1 if he guesses correctly, -1 otherwise.The rule "always guess H" will perform quite well if the environment is one in which nature always plays H.Of course in other environments, this rule will perform quite badly.Implicitly, behavior rules based on learning attempt to "learn" about the environment they are in, so that in the long-run they 1-3 perform well in a broad class of environments.Generally, this at least includes those environments that converge to long-run equilibrium.
One obvious question is whether there are behavior rules that perform well in the long run against all environments.If performing well means optimization against the true environment it is well known that there can be no such rule.Indeed, Nachbar (1993a,b) has refined this result with a counter-example showing that no rule drawn from a sufficiently rich set of rules can be even approximately optimal against all rules in that set. 3 One way to think about what is going on is to begin with Blackwell and Dubin's (1962) observation that a Bayesian optimizer will perform optimally in the long-run against any environment to which positive probability is associated.The problem shown by Nachbar is that such a Bayesian optimizer may through his behavior generate behavior that did not receive positive weight in his own prior.That is, individuals may easily generate behavior more complicated than that they contemplate as possible for their opponents.In particular, as a practical observation, in learning models in which learning rules fail to converge to an equilibrium, as for example in cobweb cycles, the behavior of agents may seem implausibly stupid. 4n this paper we lower our sights somewhat, and look for rules that have sensible properties in all environments even though they are not asymptotically optimal in all environments.We begin with the observation that the world may be more complex than players contemplate in their models, and that players are aware of this.What then is a sensible criterion when a very complicated sequence of Heads and Tails has been observed?It may well be that the environment is generated by a complicated chaotic deterministic model, for example, but to figure this out may be difficult or impossible.

1-4
Hence, from the player's point of view, it may be sensible to view such a sequence as random, and to at least try to play optimally with respect to the frequencies of heads and tails.Put another way, a player may simply choose to ignore the order in which the observations occur, even though this information is potentially useful.This motivates our desiderata that the behavior rule be universally consistent, in the sense that the rule should (asymptotically) ensure that the player's realized average payoff is not much less than the payoff from playing the best response to the empirical distribution, uniformly over all possible environments.
If players know they are boundedly rational, they may also wish to allow for the possibility that they are playing against opponents who are cleverer than they are.One way that players might do this is to only use behavior rules that guarantee that their realized payoff is not much lower than their minimax payoff.It is fairly easy to see that any universally consistent rule will be "safe" in this sense, since the best response to the any distribution must be at least the minmax.
The "calibration" result of Foster and Vohra (1994) shows that universally consistent rules exist, but since the proof is existential, it does not indicate the forms that such rules might take.This paper shows that a particular randomized version of fictitious play in which actions are played in proportion to their utility with exponential weights (exponential fictitious play) is universally consistent.Moreover, such a policy can be implemented even in an extensive form game in which opponents strategies are not observed.
Beyond this result, we explore the possible long-run outcomes when all players use rules that guarantee they do at least as well as playing a best response to the empirical frequency distribution.This gives rise to the notion of marginal best-response distributions, which are the only points such a learning process can pass through in the long run.We give some examples and results to show what these types of distributions are like.

The Model
We begin by considering a single agent.This agent repeatedly chooses a probability distribution α , called a mixed action, over a finite space of actions A , and , the agent receives a utility of u a y ( , ) .We use ∆ to denote the space of probability distributions over a set.If α ∈∆( ) A and γ ∈∆( ) Y then the expected utility is also denoted u( , ) α γ .We say that α for all alternative mixed actions α .If ε = 0 , we refer simply to a best response.
Since we are considering a repeated situation, we define a history as a sequence of actions and outcomes h a y a y . The number of actions (or outcomes) t h ( ) is called the length of the history.The history truncated by one period is , ) .It is useful also to define the null history h 0 of zero length.
The space of all histories is denoted by H .The outcome frequency distribution of a nonnull history is the empirical probability distribution over outcomes, and is denoted by γ ( ) h .
The agent chooses a (mixed) behavior rule, which is a map from histories to probability distributions over actions σ: . An important example of a behavior rule is that of fictitious play: this requires the existence of a probability distribution over outcomes γ 0 , called the prior, and an integer n 0 called the prior precision such that σ ( ) h places weight only on actions that are best responses to called the posterior.We will be particularly interested in rules which depend on the history only through the empirical distribution of outcomes, that is rules of the form , with the corresponding behavior rule given by σ α γ ( ) ( ( )) h h ≡ .We call these stationary rules.
The agent is also faced with an unknown environment which is a rule mapping histories to probability distributions over outcomes ρ: ( ) H Y → ∆ .In our applications this environment will correspond to the behavior rules of other players.Notice that the outcome cannot depend on the current action taken by the agent.An important example of an environment is the i.i.d.environment which is history independent: Our interest is in behavior rules that enable an agent to "learn" about an unknown environment.Our goal is to assess behavior rules by how well they perform.There are two performance criterion of interest: long-run performance, and the rate at which the learning behavior rule converges to this long-run payoff.In this paper we will focus solely on the long-run performance.Consequently our criterion for assessing performance will be the time-average payoff with a "long" time horizon.For any given behavior rule/environment pair, we may define a probability distribution over histories p( , ) σ ρ . The In other words a behavior rule is ε-consistent if in an i.i.d.environment it does about as well as playing a best response against the empirical distribution, or (equivalently for large T) the true probability distribution in that environment.Notice that this a bit stronger than the usual notion of consistency in the statistics literature in that the rate of convergence here is independent of the environment ρ .However, for the multinomial distribution, is well known that the rate of convergence is uniform, and the following lemma is immediate from Chebychev's inequality.
Proposition 2.1: If σ is a fictitious play behavior rule, then it is consistent.
The limitation of consistency is that there is no reason the agent should think that he is facing an i.i.d.environment.Indeed, if a game is played between agents who both use fictitious play, for most initial conditions on beliefs (those that do not begin at an exact equilibrium), the resulting environment will not be i.i.d..More to the point, a consistent behavior rule can be fooled quite badly by a clever opponent.One additional criterion beyond consistency that seems desirable is that of safety: a player should not get significantly less than his minmax payoff in the long run.

Definition 2.2:
A behavior rule σ is ε -safe if there exists a T such that for any environment ρ and for any T T ≥ there is a subset of histories of length T , H T , with A behavior rule is safe if it is ε-safe for every positive ε.
Fictitious play is well known not to be safe.Suppose the game is matching pennies, with the agent trying to match the play of the environment.If fictitious play begins with a prior γ 0 1 1 2 ) and prior precision n 0 1 = , and the environment alternates deterministically between heads and tails, starting with heads, then fictitious play 2-4 always plays the opposite of the environment, and the agent gets -1, considerably less than the minmax of 0 .Indeed, fictitious play can fail in this way, not only against a clever opponent out to trick the agent, but also in a game in which all players use fictitious play: Fudenberg and Kreps (1993) give an example of this sort.
It is easy to see that fictitious play is not the only vulnerable rule.In particular, deterministic behavior rules can be exploited by an opponent who knows the rule, and chooses in each period an action that minimizes the agent's payoff given the action that the agent will play that period.(For example, if the agent uses a deterministc rule in matching pennies, consider the environment which always picks the exact opposite of the choice made by the agent.)This is essentially the point made by Oakes (1985).
There are obviously behavior rules that are safe: playing the maxmin every period is safe, for example.Unfortunately this does not have the minimal learning property of consistency.An obvious question is whether there is any behavior rule that is both safe and consistent.We will find such a behavior rule below, but first it is useful to define a property that combines both safety and consistency: Definition 2.3: A behavior rule σ is ε -universally consistent if there exists a T for any environment ρ , and for any T T ≥ there is a subset of histories of length T, H T , such that For small ε, ε-universal consistency means doing more-or-less as well as playing a best response to the historical frequency distribution.In effect, the player ignores all information about the order in which the outcomes occur, and the extent to which they might be correlated with his own play.
There are potentially two problems with playing a best response to the frequency distribution: First, it ignores information about the way the agent's play influences the 2-5 play of the environment.Suppose for example that the game is the Prisoner's dilemma and the environment is one in which the opponent plays tit-for-tat.Then it is universally consistent to cheat all the time (this is a best response to any frequency distribution) but the opportunity to get a higher payoff by cooperating is being ignored.
However, ignoring causality in this way need not be troubling.In an environment where a large number of players interact anonymously either through market prices or through a random matching procedure, the actions of individual players can have essentially no effect on the future of prices or the population distribution of opponents.In such an environment there is no causality running from the agent's action to future outcomes, so such information is irrelevant.We refer to the learning problem in such an environment as a pure forecasting problem.
Note, though, that even with a pure forecasting problem, an agent who plays a best response to the empirical frequency distribution is ignoring the order in which observations occur: For example, in matching pennies if the environment alternates in a deterministic manner between H and T, a best response to the frequency distribution of 1/2-1/2 yields a payoff of 0. This, however, overlooks the opportunity to do even better by guessing correctly every period and getting a payoff of 1.
One desirable property of universally consistent learning rules is that they are safe.
Proof: This follows immediately from max ( , ( )) min max ( , ) Because no deterministic rule is safe, no deterministic rule can be universally consistent.Moreover, simply adding an arbitrary form of noisy mixing does not make a behavior rule universally consistent.Again in matching pennies, suppose that the agent uses a modified version of fictitious play that assigns probability (1-ε) to the action that is the best response to the agent's posterior, and divides the remaining ε probability equally among the other actions.With the prior we gave earlier, and a "malicious" opponent, the 2-6 agent still plays the "wrong" action with probability (1-ε) in each period, and so the agent's expected average payoff is only 2ε-1, which is less than the minmax value of 0.
Intuitively, if the agent's play is very sensitive to small changes in the empirical average, then there are environments where the empirical average is converging, but the agent's play oscillates in such a way that the agent's realized payoff is lower than the best response to the limit of the empirical averages.Conversely, if the agent not only plays a mixed action, but also varies his mixing probabilities "smoothly" with changes in the empirical average then (since the empirical average adjusts at the inverse of the sample size) the agent's play cannot oscillate wildly from period to period.This is the motivation for our restricting attention to smooth behavior rules in the next section, and also for proposition 4.1, which shows that fictitious play performs well along histories where it exhibits "infrequent switches." It is easy to give an existence proof showing that for every ε there is a behavior rule that is ε -universally consistent.The idea originates with Foster and Vohra (1993) who use the same idea to establish a stronger property called calibration, introduced by Dawid (1982).The idea is to consider a hypothetical perverse opponent whose objective is to choose a behavior rule that tricks the agent in the sense that U will fail.This gives rise to a T-period zero sum game where the agent's payoff is − and the opponent's payoff is the negative of this amount.The perverse opponent has a behavior rule that yields him the value of this zero sum game, and by the minmax theorem, the agent has a behavior rule that guarantees him this value.To calculate the value, we know that it is at least what the agent gets from playing any behavior rule against the perverse opponents minmaxing behavior rule.In particular, the agent can in each period play a best response to the conditional distribution (given the history ) of the perverse opponent's minmaxing behavior rule.This behavior rule for the agent yields approximately zero in a large sample by the weak law of large numbers.Thus we conclude that the value of the game is at worst −ε , where ε → → ∞ 0 as T , which is the desired result.Note, though, that to actually find the desired behavior rule requires solving a dynamic stochastic zero sum game with a very long horizon which is computationally impractical.
However, as we will see below, a very simple randomized and '"smooth" variation of fictitious play has the desired property. 3-8

Cautious Fictitious Play
In light of our observations in the previous section, we are led to consider behavior rules where the agent's mixing probabilities depend smoothly on the empirical average.If the stationary behavior rule α is a smooth (twice continuously differentiable) ε -best response to the average, we say that it represents ε -cautious fictitious play.We will show that any such rule can be made ε' -universally consistent for any given ε' by taking ε small enough, if there are only two outcomes.(Remember that the outcomes correspond to the profile of opponents' actions in a game.) If there are more than two outcomes, then we cannot show that ε -cautious fictitious play is ε' -universally consistent even for very small ε .However, we can show that a particular variation on fictitious play called κ -exponential fictitious play is εuniversally consistent.A κ -exponential fictitious play is given by specifying fixed weights w a > 0 and using the stationary rule Notice that for fixed weights and κ sufficiently large, this scheme assures that the agent is playing an ε best response to the historical average so that this is indeed an ε -fictitious play.5 Proposition 3.1: (a) For all weights w a and every ε' there exists a κ such that κ -exponential fictitious play is ε' -universally consistent.
(b) If there are only two outcomes, then for every ε' there exists a ε such that every εcautious fictitious play is ε' -universally consistent.
To prove the proposition, we use a method from stochastic approximation theory of approximating a system that involves averaging with a differential equation in virtual time.Fix σ ρ , .The equation of motion for the time average of utility is this may also be written as   From lemma 3.2, the problem of universal consistency is reduced to the study of the differential equation In the absence of the first two terms, this differential equation is stable, so that the distance between the optimized and actual payoff tends to be reduced.The second term is of order ε , if α γ ( ) is an ε -best response to γ for all γ .If the first term were also small, it would follow that the solution to the differential equation would remain uniformly close to zero, which is the desired conclusion.
The first term is the product of the sensitivity of the payoff loss to the opponents' average play, − , and the rate at which the average is changing; this rate can be viewed as the extent to which the opponent is trying to trick the agent.Since the exact best response α and the smoothed response α may differ significantly, the payoff difference between them when being tricked may be quite large.(The fact that α is an ε best response to γ only means that the payoff loss is small against distribution γ .)

3-11
The key idea of the proof is that the agent cannot be "substantially tricked" for a long time, as α and α must be nearly the same, except in regions where several actions are nearly indifferent.To prove this we observe that over sufficiently short curves F does not change very much, and there is an obvious bound: 3) and α is ε -cautious fictitious play then Proof: To show this we integrate (3.3) term by term.The first term is exact.The second term is bounded using the definition of ε -fictitious play.The error in the remaining term is bounded by using the fundamental theorem of calculus and the mean value theorem: for some 0 . By Lemma 3.2 γ can be no greater than one in norm, and all remaining terms in (3.3) are also bounded independent of α .We conclude that ( *) F t is bounded independent of α , so that the integral is of order τ 2 as desired.Note that the result would be completely trivial if not for the fact that O( ) τ 2 means a uniform bound independent of α .

3.4:
depends on γ only through the endpoints of γ .
Proof: exp( ( which is certainly symmetric.
Finally, the integral over a straight line between the endpoints can be bounded: Lemma 3.5: For every δ and τ there exists an ε such that if α is ε -fictitious play In the two outcome case, let α be any ε -cautious fictitious play, and in the general case, let α be a κ -exponential fictitious play that is also ε -cautious.By virtue of Lemma 3.4, the restriction of the integral to a straight line is not a limitation, so this gives the further bound on F of F ≤ ε'/2 .Now we may simply use Lemma 3.2 together with Chebychev's inequality to give the desired conclusion.
Since the use of exponential weights may seem somewhat mysterious, it may be useful to look back and see what role they played in the proof.The only use of the exponential weighting was in the proof of Lemma 3.4, where it was used to show that the derivative of the integrand of ∂ ∂γ α γ γ γ τ u dt ( ( ~), ~)0 z was symmetric, and consequently that the integral itself is path independent.The remaining parts of the proof show that to a good approximation, the loss from using α in place of α is to a good approximation made up of two parts: the loss from the fact that α is only an approximate best-response to the historical average, and the "loss" from being tricked if the actual outcomes are not drawn from the historical average.This latter "loss" may actually be a gain, since the trick may actually favor α over α , but in any case it is measured by the flow Notice that the method of proof yields not only an upper bound on the loss, but a lower bound as well.Consequently if the integral of (3.4) is large, the loss will be large.
Moreover, if the integral fails to be path independent, there must be closed loops over which it has a non-zero integral: this integral will be positive in one direction and negative in the other.The implication is that a "tricky" opponent can create a continuing loss by moving γ repeatedly around the loop in the positive direction, and a continuing gain by moving it in the opposite direction.The greater the failure of path independence as measured by the size of this integral, the greater the potential loss or gain.
The conclusion we reach is that if we use the size of the integral (3.4) as a measure of the failure of symmetry, the greater departure from symmetry, the greater the departure from universal consistency.On the other hand, we cannot conclude that a universally consistent strategy dominates an ε -fictitious play, since the failure of path independence also guarantees that a "tricky" but "benevolent" opponent could actually provide a higher level of utility than merely the best response to the historical average.
Finally, we should add that the exponential weighting case is the only rule we know of that yields symmetry and so path independence.We do not know whether other such rules exist.

Fictitious Play
Before discussing learning in games in which different players are playing particular types of behavior rules, it will be helpful to establish a necessary condition for fictitious play to be consistent.

Definition 4.1:
A behavior rule σ is ε -consistent against ρ if there exists a T such that for any T T ≥ there is a subset of histories of length T, H T , with Given a history h we define the frequency of switches η( ) h to be the fraction of periods t in which a a t t ≠ −1 .

Definition 4.2:
A behavior rule σ exhibits infrequent switches against ρ if for every ε there exists a T and for any T T ≥ there is a subset of histories of length T , H T , such Proposition 4.1: If σ is fictitious play and exhibits infrequent switches against ρ then for every ε > 0 it is ε -consistent against ρ .
Remark: This result has been independently obtained by Monderer, Samet, and Sela   [1994]. 7  Proof: Fictitious play plays a best response to the posterior beliefs γ ( ) h formed by taking a weighted average of the empirical distribution γ ( ) h at the end of the previous period and the prior beliefs γ 0 .
7 To compare their result and ours, note that what we call "infrequent switches" they call "smooth," and that their "belief affirming process" are pairs ( , ) σ ρ such that each is consistent against the other.
Fix the distribution over histories generated by ( , ) σ ρ , and let h be any history that this distribution assigns positive probability; in the following we will suppress the dependence on h to lighten the notation.Set T t h = ( ) .
Define U u a h h T ≡ ( ( ( )), ( )) γ γ to be the payoff from playing a best response to posterior beliefs at the end of period T when the opponent's play is given by the empirical distribution, and define U u a h h T ≡ ( ( ( )), ( )) γ γ , which is the payoff that the player "expects" to get when he believes the distribution of opponents' play is given by ( ) γ h .
Also define U U T ( ) 0 recursively by This is the expected payoff that would result if the agent's action at each period t T ∈{ , , } 1 is a best response to the end-of-period t beliefs γ t , averaged with an exogenous "initial utility U 0 .(Of course the agent does not have the information to actually implement this path; we use it only for an upper bound.) We will show inductively that U h U U U U U Suppose that the inequalities hold at date T − 1.From the definitions and the linearity of payoffs in the opponent's distribution, we have 0 from the definition of U .This plus the inductive so that the desired follows directly from the inductive hypothesis.
This yields the conclusion of the proposition.
Obviously any behavior rule that is asymptotically the same as fictitious play has the same property. 5-18

Learning in Games
We turn now to a setting where a number of agents play each other in a game.We will assume that there are N agents i N = 1 2 , , , , and that each has an action space A i .
The space of outcomes for agent i is simply that actions taken by opponents Y A and the payoff function is u i .
We suppose that every agent is playing a universally consistent learning rule.
Denote distributions over action profiles by µ , with corresponding marginals over × ≠ j i j A denoted by µ −i .It is convenient also to denote the expected utility from the distribution as u i ( ) µ .Then to a good approximation in the long run each agent will be getting at least the same utility as he could get by playing a best response to the marginal empirical distribution of opponents' play.This motivates the following definition.
Definition 5.1: A (correlated) µ has the marginal best-response property if for each agent i max ( , ) ( ) A behavior profile b specifies a behavior rule for each player i.Given behavior profile b, we can compute the resulting probability distribution p(b) over outcomes, and hence obtain probability distributions over the empirical distributions from period 1 to T for any T. Denote these empirical distributions by µ T , and let ν T denote the probability distribution over the µ T .In general there is no reason to expect that the ν T will converge, but since the space of measures on a compact set is compact (in the topology of weak convergence) we know the sequence will have accumulation points.Note moreover that these accumulation point need not be a degenerate measures: for example, the longrun empirical distribution might take on one of two values, depending on the realization of play in the first period.However, if every player is using an ε-universally consistent rule, then except on a set of histories with probability ε the long-run empirical distribution must In other words, if agents are universally consistent, in the long run we will see the empirical time average distribution over profiles move only within the set of distributions having the marginal best-response property.This suggests that it is of interest to understand how big and what the set of distributions having the marginal best-response property looks like.Of even greater importance is to understand the utilities that can arise from these distributions.We refer to such utility vectors as marginal best-response points.Moreover, if players actually use cautious fictitious play, and not some other universally consistent behavior rule, then the exact same method of proof that establishes that cautious fictitious play is universally consistent shows that players can do no more than ε better than playing a best response to the historical frequency.In this case Proposition 5.1 can be strengthened from marginal best-response to exact marginal best response.In other words, we may view the set of marginal best-response points as the set of asymptotic possibilities when players play some universally consistent behavior rules, and the set of exact marginal best-response points as the set when they play cautious fictitious play.
It is immediate from the definition that the set of correlated equilibria are a subset of the set of distributions with the marginal best-response property, while the set of Nash 5-20 equilibria are a subset of the set of distributions with the strict marginal best-response property.We shall see below that the converses of these results are false.
In zero sum games we have a very quick result: since each player is getting at least the minmax, any distribution with the marginal best-response property must give each player at least (and so exactly) the value of the game.This in turn implies the opponent must be playing a minmaxing behavior rule.In other words is a Nash equilibrium.
Note however that this result cannot be strengthened to show that µ is actually a Nash equilibrium, that is, that play is independent.(This is true in 2x2 games.)Consider the following "Rock, Scissors and Paper" game 0 1 1 The value of this zero sum game is 0, and the unique equilibrium point is (1/3,1/3,1/3).
Consider on the other hand the distribution over profiles given by 1 9 0 2 9 0 2 9 1 9 It is easily checked that both marginals are (1/3,1/3,1/3), and since the matrix is symmetric, both players get an expected payoff of zero.In other words, this distribution has the exact marginal best-response property, but is not a Nash equilibrium.
Another interesting case to consider is the non-zero sum Shapley game 0 0 0 1 1 0 It has been shown by Shapley [1964] (see also Gaunersdorfer and Hofbauer [1994]) that in this game fictitious play cycles ever more slowly through (UM,DM,DL,ML,MR,UR).
Because switching between profiles drops in frequency to zero, the condition of Proposition 4.1 is satisfied, and fictitious play is consistent in this example.We conclude from Proposition 5.1 that when T is large, to a good approximation the empirical time average distribution of profiles (which never puts any weight on the diagonal) is always a distribution with the exact marginal best-response property.Obviously in this example there are many distributions with this property.Note moreover, that this shows that the set of distributions that have the exact marginal best-response property are not a subset of the set of correlated equilibria, as it is known from Foster and Vohra (1993) that in the Shapley game utility remains bounded away from that at any correlated equilibrium.
This leaves the question of whether there are actually correlated equilibria that are not exact marginal best-responses.The following Battle-of-the-Sexes example shows there are: 1,2 0,0 0,0 2,1

F H G I K J
The distribution clearly a correlated equilibrium, indeed, it is a public randomization over Nash equilibria.
Given the marginal, player 1 prefers to play D and player 2 prefers to play L. Each receives an expected utility against the marginal of 1 against the correlated equilibrium payoff of 1.5.
One crucial question is whether there are broad classes of games in which the marginal best-response property imposes no restrictions on payoffs, that is, that the set of marginal best-response points are the entire socially feasible individually rational set.

5-22
Consider the generic case in which no pair of profiles yield exactly the same utility for all players.In this case extremal points in the socially feasible set can be achieved by only one distribution that places all weight on a single profile.This implies that extremal points are marginal best-response points only if they are Nash equilibrium payoff vectors.
Combining this with the obvious fact that the set of marginal best-response point is closed yields the following proposition.
Proposition 5.2: Suppose that no pair of profiles yields exactly the same utility for all players.Then an extremal point that is not a Nash equilibrium is contained in an open set that has no marginal best-response points.
In other words, the set of marginal best-response points is bounded away from the extremal points. 6-1

Incomplete Observation
We wish to conclude by considering settings such as extensive form games and moral hazard models, in which the player does not actually observe the outcome y , but only a noisy signal that may depend on his own action.A useful example to have in mind is a two-period prisoner's dilemma.If the agent chooses cheat in the first period he will never learn how his opponent will respond to cooperation in the first period.We know, for example, from Fudenberg and Levine [1993] and Fudenberg and Kreps [1993] that in such models learning rules that do not experiment frequently may fail to learn a best response.However, cautious fictitious play experiments infinitely often, so it seems plausible that it could be modified to perform in a universally consistent manner, even with imperfect information.
We will consider the extreme case of the least amount of information that might be available to an agent about the outcome: we assume that the agent does not observe y but only his own utility u.Notice that exponential fictitious play requires only historical average utilities, and not actual observations of y.This motivates the definition of a κexponential fictitious play with respect to the utility rule U h a ( ) as ) is asymptotically the same as u a h ( , ( )) γ , this rule will have properties identical to those of κ -exponential fictitious play.
Consider a long period over which two actions are played with (approximately) fixed positive probabilities.Since the probabilities of the actions fixed and positive the frequency of outcomes conditional on either of the two actions is approximately the same over this period.Notice that this would not be the case if the action probabilities are time dependent: time dependent outcome frequencies can then cause the conditional frequencies to differ between the two actions.

6-2
Since each action has the same conditional frequency of outcomes, the only issue is the appropriate assignment of weights to the observations.If we update utility by weighting observations in inverse proportion to the likelihood that the action is taken, then asymptotically the utility average corresponding to each action is based on the same underlying frequency.In other words, if we use the updating rule then universal consistency is achieved despite the fact that only the agent's own utility is observed. 7-3

Appendix A: Proof of Lemma 3.2
Lemma 3.2 follows directly from the linearity of the differential equation and Lemmas A.1 and A.2 below.
Lemma A.1: If α is smooth, then any δ > 0 , there exists T such that for any for some piecewise linear curve γ connecting γ γ ( ) ( ' ) h h and with τ = log ( ' ) / ( ) t h t h b g and γ ≤ 1.
Proof: A standard weak law of large numbers calculation using Chebychev's inequality shows that Similarly for payoffs we have Next, we turn to the movement of the empirical distribution itself.We have

7-5
Proof: Follows from the fact that α is locally constant and changes only at points of indifference to γ .This is the virtual time analog of Proposition 4.1, and the interested reader may wish to refer to the proof of that proposition in the text.See also Monderer, Samet and Sela (1994)  Let s be the total length of time over the subinterval of length τ a in which The desired result now follows by choosing a ≤ δ τ / Φ a and ε δ ≤ a .
line makes use of the fact that the per-period utility function is bilinear.To find a continuous virtual time approximation, consider a piecewise Lipshitz function α: Y in the space of probabilities measures over outcomes.The curve γ should be thought of as a continuous time approximation to the time average γ t .Let F α γ , ~ be a solution to the differential equation analog of (3.To avoid having to keep track of inessential constants that depend only on the payoffs, we use the order notation.We say that a family of bounded random variables ~( ) r x is of order x, written ~( ) r O x = if there are constants B ,b independent of x and α such that E r x Bx x b | ~( )| 2 ≤ ≤ if .Lemma 3.2: For any smooth α , δ > 0 and ( ) arg max ( , ) in Appendix A. Note that the conclusion of the lemma uses the fact that solutions to the differential equation (3.2) have the property that F from the linearity of the payoffs in α.) path independence is immediate.In the higher dimensional case, the result will follow provided D u Figure 1

Finally
ε) the marginal best response property.Passing to the limit yields the following proposition.Proposition 5.1: Consider a sequence of ε-universally consistent behavior rules with ε converging monotonically to 0, and let T( ) ε → ∞ be such that each T( ) ε is greater than the T ( ) ε in the definition of universal consistency.For each ε, let ν ε denote the probability distribution over empirical distributions from periods 1 to T( ) ε induced by the associated ε-universally consistent behavior profile.If ν * is any accumulation point (in the topology of weak convergence) of the ν ε , then ν * assigns probability 1 to distributions with the marginal best-response property.

Φ
For every δ and τ there exists an ε such that if α is ε -fictitious play ∂ Φ be the payoff matrix with elements Φ ay u a y = ( a is the vector of all components except a and Φ payoff difference between a and b against γ .Consequently we may write Because γ is restricted to varying along a straight line, the set on which a particular action is a best response is a (connected) subinterval.Consequently we may 8-7 break the integral up into an integral over subintervals along which one action remains a best response.Since there are at most as many such subintervals as there are actions, it suffices to prove the desired bound separately in each such subinterval.Let a be the best response over some such subinterval: for γ in this subinterval either b is a best response also, in which case Φ words, if b is played with significant probability over a long subinterval where a is a best response, it must yield essentially the same payoff against γ * as a.