Learning from Behavior

Michael Kearns Jennifer Wortman Computer and Science Computer and Information Science University of Pennsylvania University of Pennsylvania [email protected] [email protected]

Abstract population after observing their interactions during a train- ing phase of polynomial length. We assume that each agent Inspired by longstanding lines of research in soci- i in a population of size N acts according to a fixed but un- ology and related fields, and by more recent large- known strategy ci drawn from a known class . A strategy C population human subject experiments on the In- probabilistically maps the current population state to the next ternet and the Web, we initiate a study of the com- state or action for that agent, and each agent’s strategy may putational issues in learning to model collective be different. As is common in much of the literature cited behavior from observed data. We define formal above, there may also be a network structure governing the models for efficient learning in such settings, and population interaction, in which case strategies may map the provide both general theory and specific learning local neighborhood state to next actions. algorithms for these models. Learning algorithms in our model are given training data of the population behavior, either as repeated finite-length trajectories from multiple initial states (an episodic model), 1 Introduction or in a single unbroken trajectory from a fixed start state (a no-reset model). In either case, they must efficiently (poly- in large populations has been a subject nomially) learn to accurately predict or simulate (properties of enduring interest in sociology and economics, and a more of) the future behavior of the same population. Our frame- recent topic in fields such as physics and computer science. work may be viewed as a computational model for learning There is consequently now an impressive literature on math- the dynamics of an unknown Markov process — more pre- ematical models for collective behavior in settings as di- cisely, a dynamic Bayes net — in which our primary interest verse as the diffusion of fads or innovation in social net- is in Markov processes inspired by simple models for social works [10, 1, 2, 18], voting behavior [10], housing choices behavior. and segregation [22], herding behaviors in financial mar- As a simple, concrete example of the kind of system we kets [27, 8], Hollywood trends [25, 24], critical mass phe- have in mind, consider a population in which each agent nomena in group activities [22], and many others. The ad- makes a series of choices from a fixed over time (such as vent of the Internet and the Web have greatly increased the what restaurant to go to, or what political party to vote for). number of both controlled experiments [7, 17, 20, 21, 8] and Like many previously studied models, we consider agents open-ended systems (such as Wikipedia and many other in- who have a desire to behave like the rest of the population stances of “human peer-production”) that permit the logging (because they want to visit the popular restaurants, or want and analysis of detailed collective behavioral data. It is nat- to vote for “electable” candidates). On the other hand, each ural to ask if there are learning methods specifically tailored agent may also have different and unknown intrinsic prefer- to such models and data. ences over the choicesas well (based on cuisine and decor, or The mathematical models of the collective behavior liter- the actual policies of the candidates). We consider models in ature differ from one another in important details, such as the which each agent balances or integrates these two forces in extent to which individual agents are assumed to act accord- deciding how to behave at each step [12]. Our main question ing to traditional notions of rationality, but they generally is: Can a learning algorithm watching the collective behavior share the significant underlying assumption that each agent’s of such a population for a short period produce an accurate current behavior is entirely or largely determined by the re- model of their future choices? cent behavior of the other agents. Thus the collective behav- The assumptions of our model fit nicely with the litera- ior is a social phenomenon, and the population evolves over ture cited in the first paragraph, much of which indeed pro- time according to its own internal dynamics — there is no poses simple stochastic models for how individual agents re- exogenous “Nature” being reacted to, or injecting shocks to act to the current population state. We emphasize from the the collective. outset the difference between our interests and those com- In this paper, we introduce a computational theory of mon in multiagent systems and learning in games. In those learning from collective behavior, in which the goal is to fields, it is often the case that the agents themselves are accurately model and predict the future behavior of a large acting according to complex and fairly general learning al- gorithms (such as Q-learning [26], no-regret learning [9], large) class, but is otherwise unknown. The learner’s ulti- fictitious play [3], and so on), and the central question is mate goal is not to discover each individual agent strategy whether and when the population converges to particular, per se, but rather to make accurate predictions of the collec- “nice” states (such as Nash or correlated equilibria). In con- tive behavior in novel situations. trast, while the agent strategies we consider are certainly “adaptive” in a reactive sense, they are much simpler than 2.1 Agent Strategies and Collective Trajectories general-purpose learning algorithms, and we are interested We now describe the main components of our framework: in learning algorithms that model the full collective behavior no matter what its properties; there is no special status given State Space. At each time step, each agent i is in some • either to particular states nor to any notion of convergence. state si chosen from a known, finite set of size K. S Thus our interest is not in learning by the agents themselves, We often think of K as being large, and thus want al- but at the higher level of an observer of the population. gorithms whose running time scales polynomially in K Our primary contributions are: and other parameters. We view si as the action taken by agent i in response to the recent population behavior. The introduction of a computational model for learning The joint action vector ~s N describes the current • from collective behavior. global state of the collective.∈ S The developmentof some general theory for this model, Initial State Distribution. We assume that the initial • including a polynomial-time reduction of learning from • population state ~s 0 is drawn according to a fixed but collective behavior to learning in more traditional, unknown distribution P over N . During training, the single-target I.I.D. settings, and a separation between learner is able to see trajectoriesS of the collective behav- efficient learnability in collective models in which the ior in which the initial state is drawn from P , and as in learner does and does not see all intermediate popula- many standard learning models, must generalize with tion states. respect to this same distribution. (We also consider a no-reset variant of our model in Section 2.3.) The definition of specific classes of agent strategies, • including variants of the “crowd affinity” strategies Agent Strategy Class. We assume that each agent’s • sketched above, and complementary “crowd aversion” strategy is drawn from a known class of (typically C classes. probabilistic) mappings from the recent collective be- havior into the agent’s next state or action in . We S Provably efficient algorithms for learning from collec- mainly consider the case in which ci probabilisti- • tive behavior for these same classes. cally maps the current global state ~s into∈ C agent i’s next state. However, much of the theory we develop ap- The outline of the paper is as follows. In Section 2, we plies equally well to more complex strategies that might introduce our main model for learning from collective be- incorporate a longer history of the collective behavior havior, and then discuss two natural variants. Section 3 in- on the current trajectory, or might depend on summary troduces and motivates a number of specific agent strategy statistics of that history. classes that are broadly inspired by earlier sociological mod- els, and provides brief simulations of the collective behaviors Given these components, we can now define what is meant they can generate. Section 4 provides a general reduction of by a collective trajectory. learning from collective behavior to a generalized PAC-style model for learning from I.I.D. data, which is used subse- Definition 1 Let ~c N be the vector of strategies for the ∈ C quently in Section 5, where we give provably efficient algo- N agents, P be the initial state distribution, and T 1 be an ≥ rithms for learning some of the strategy classes introduced in integer. A T -trajectory of ~c with respect to P is a random Section 3. Brief conclusions and topics for further research variable ~s 0, , ~s T in which the initial state ~s 0 N are given in Section 6. is drawnh according· · · toiP , and for each t 1, ,T∈, S the t t ∈ { · · · } component si of the joint state ~s is obtained by applying t−1 2 The Model the strategy ci to ~s . (Again, more generally we may also allow the strategies ci to depend on the full sequence In this section we describe a learning model in which the ~s 0,...,~s t−1, or on summary statistics of that history.) observed data is generated from observations of trajectories (defined shortly) of the collective behavior of N interacting Thus, a collective trajectory in our model is simply a agents. The key feature of the model is the fact that each Markovian sequence of states that factors according to the agent’s next state or action is always determined by the recent N agent strategies — that is, a dynamic Bayes net [19]. Our actions of the other agents, perhaps combined with some in- interest is in cases in which this Markov process is generated trinsic “preferences” or behaviors of the particular agent. As by particular models of social behavior, some of which are we shall see, we can view our model as one for learning cer- discussed in Section 3. tain kinds of factored Markov processes that are inspired by models common in sociology and related fields. 2.2 The Learning Model Each agent may follow a different and possibly proba- We now formally define the learning model we study. In bilistic strategy. We assume that the strategy followed by our model, learning algorithms are given access to an oracle each agent is constrained to lie in a known (and possibly (~c, P, T ) that returns a T -trajectory ~s 0, , ~s T of OEXP h · · · i ~c with respect to P . This is thus an episodic or reset model, behavior if there exists an algorithm A such that for any in which the learner has the luxury of repeatedly observing population size N 1, any ~c N , any time horizon T , the population behavior from random initial conditions. It any distribution P over≥ N , and∈ any C ǫ> 0 and δ > 0, given S is most applicable in (partially) controlled, experimental set- access to the oracle EXP(~c, P, T ), algorithm A runs in time tings [7, 17, 20, 21, 8] where such “population resets” can polynomial in N, TO, dim( ), 1/ǫ, and 1/δ, and outputs a C be implemented or imposed. In Section 2.3 below we de- polynomial-time model Mˆ such that with probability at least fine a perhaps more broadly applicable variant of the model 1 δ, ε(QMˆ ,Q~c) ǫ. in which resets are not available; the algorithms we provide − ≤ can be adapted for this model as well (Section 5.3). We now discuss two reasonable variations on the model The goal of the learner is to find a generative model that we have presented. can efficiently produce trajectories from a distribution that is arbitrarily close to that generated by the true population. 2.3 A No-Reset Variant ˆ 0 Thus, let M(~s ,T ) be a (randomized) model output by a The model above assumes that learning algorithms are given 0 learning algorithm that takes as input a start state ~s and access to repeated, independent trajectories via the oracle time horizon T , and outputs a random T -trajectory, and let , which is analogous to the episodic setting of rein- ˆ OEXP QMˆ denote the distribution over trajectories generated by M forcement learning. As in that field, we may also wish to when the start state is distributed according to P . Similarly, consider an alternative “no-reset” model in which the learner let Q~c denote the distribution over trajectories generated by has access only to a single, unbroken trajectory of states gen- EXP(~c, P, T ). Then the goal of the learning algorithm is to erated by the Markov process. To do so we must formulate O find a model Mˆ making the distance ε(Q ,Q~c) between an alternative notion of generalization, since on the one hand, L1 Mˆ QMˆ and Q~c small, where the (distribution of the) initial state may quickly become ir- relevant as the collective behavior evolves, but on the other, ε(Q ,Q~c) Mˆ ≡ the state space is exponentially large and thus it is unrealistic 0 T 0 T to expect to model the dynamics from an arbitrary state in Q ˆ ( ~s , , ~s ) Q~c( ~s , , ~s ) . | M h · · · i − h · · · i | polynomial time. h~s 0,··· ,~s T i X One natural formulationallows the learner to observe any A couple of remarks are in order here. First, note that polynomially long prefix of a trajectory of states for training, we have defined the output of the learning algorithm to be and then to announce its readiness for the test phase. If ~s is a “black box” that simply produces trajectories from initial the final state of the training prefix, we can simply ask that states. Of course, it would be natural to expect that this black the learner output a model Mˆ that generates accurate T -step box operates by having good approximations to every agent trajectories forward from the current state ~s. In other words, strategy in ~c, and using collective simulations of these to pro- Mˆ should generate trajectories from a distribution close to duce trajectories, but we choose to define the output Mˆ in a the distribution over T -step trajectories that would be gener- more general way since there may be other approaches. Sec- ated if each agent continued choosing actions according to ond, we note that our learning criteria is both strong (see his strategy. The length of the training prefix is allowed to be below for a discussion of weaker alternatives) and useful, polynomial in T and the other parameters. in the sense that if ε(QMˆ ,Q~c) is smaller than ǫ, then we can While aspects of the general theory described below are sample Mˆ to obtain O(ǫ)-good approximations to the expec- particular to our main (episodic) model, we note here that the tation of any (bounded) function of trajectories. Thus, for in- algorithms we give for specific classes can in fact be adapted stance, we can use Mˆ to answer questions like “What is the to work in the no-reset model as well. Such extensions are expected number of agents playing the plurality action after discussed briefly in Section 5.3. T steps?” or “What is the probability the entire population is playing the same action after T steps?” (In Section 2.4 be- 2.4 Weaker Criteria for Learnability low we discuss a weaker model in which we care only about We have chosen to formulate learnability in our model us- one fixed outcome function.) ing a rather strong success criterion — namely, the ability to Our algorithmic results consider cases in which the agent (approximately) simulate the full dynamics of the unknown strategies may themselves already be rather rich, in which Markov process induced by the population strategy ~c. In or- case the learning algorithm should be permitted resources der to meet this strong criterion, we have also allowed the commensurate with this complexity. For example, the crowd learner access to a rather strong oracle, which returns all in- affinity models have a number of parameters that scales with termediate states of sampled trajectories. the number of actions K. More generally, we use dim( ) to There may be natural scenarios, however, in which we denote the complexity or dimension of ; in all of our imag-C are interested only in specific fixed properties of collective ined applications dim( ) is either the VCC dimension for de- behavior, and thus a weaker data source may suffice. For in- terministic classes, or one· of its generalizations to probabilis- stance, suppose we have a fixed, real-valued outcome func- tic classes (such as pseudo-dimension [11], fat-shattering di- tion F (~s T ) of final states (for instance, the fraction of agents mension [15], combinatorial dimension [11], etc.). playing the plurality action at time T ), with our goal being We are now ready to define our learning model. to simply learn a function G that maps initial states ~s 0 and a time horizon T to real values, and approximately minimizes Definition 2 Let be an agent strategy class over actions C 0 T . We say that is polynomially learnable from collective E 0 G(~s ,T ) E T [F (~s )] S C ~s ∼P − ~s   where ~s T is a random variable that is the final state of a gate of h at time T = D, pairs of the form ~s 0, F (~s T ) T -trajectory of ~c from the initial state ~s 0. Clearly in such a provide exactly the same data as the PAC modelh for h underi model, while it certainly would suffice, there may be no need D, and thus must be equally hard. to directly learn a full dynamical model. It may be feasible For the polynomial learnability of from collective be- to satisfy this criterion without even observing intermediate havior, we note that is clearly PACC learnable, since it is states, but only seeing initial state and final outcome pairs just Boolean combinationsC of 1 or 2 inputs. In Section 4 ~s 0, F (~s T ) , closer to a traditional regression problem. we give a general reduction from collective learning of any h It is noti difficult to define simple agent strategy classes agent strategy class to PAC learning the class, thus giving the for which learning from only ~s 0, F (~s T ) pairs is provably claimed result. intractable, yet efficient learningh is possiblei in our model. This idea is formalized in Theorem 3 below. Here the popu- Conversely, it is also not difficult to concoct cases in lation forms a rather powerful computational device map- which learning the full dynamics in our sense is intractable, ping initial states to final states. In particular, it can be but we can learn to approximate a specific outcome func- tion from only ~s 0, F (~s T ) pairs. Intuitively, if each agent thought of as a circuit of depth T with “gates” chosen from h i , with the only real constraint being that each layer of the strategy is very complex but the outcome function applied to circuitC is an identical sequence of N gates which are applied final states is sufficiently simple (e.g., constant), we cannot to the outputs of the previous layer. Intuitively, if only initial but do not need to model the full dynamics in order to learn states and final outcomes are provided to the learner, learn- to approximate the outcome. ing should be as difficult as a corresponding PAC-style prob- We note that there is an analogy here to the distinc- lem. On the other hand, by observing intermediate state vec- tion between direct and indirect approaches to reinforcement tors we can build arbitrarily accurate models for each agent, learning [16]. In the former, one learns a policy that is spe- which in turn allows us to accurately simulate the full dy- cific to a fixed reward function without learning a model of namical model. next-state dynamics; in the latter, at possibly greater cost, one learns an accurate dynamical model, which can in turn Theorem 3 Let be the class of 2-input AND and OR gates, be used to compute good policies for any reward function. and one-input NOTC gates. Then is polynomially learnable For the remainder of this paper, we focus on the model as we from collective behavior, but thereC exists a binary outcome formalized it in Definition 2, and leave for future work the function F such that learning an accurate mapping from investigation of such alternatives. start states ~s 0 to outcomes F (~s T ) without observing inter- mediate state data is intractable. 3 Social Strategy Classes

Proof: (Sketch) We first sketch the hardness construction. Before providing our general theory, including the reduction Let be any class of Boolean circuits (that is, with gates in from collective learning to I.I.D. learning, we first illustrate ) thatH is not polynomially learnable in the standard PAC and motivate the definitions so far with some concrete exam- model;C under standard cryptographic assumptions, such a ples of social strategy classes, some of which we analyze in class exists. Let D be a hard distribution for PAC learning detail in Section 5. . Let h be a Boolean circuit with R inputs, S gates, 3.1 Crowd Affinity: Mixture Strategies andH depth ∈D. H To embed the computation by h in a collective problem, we let N = R + S and T = D. We introduce The first class of agent strategies we discuss are meant to an agent for each of the R inputs to h, whose value after the model settings in which each individual wishes to balance initial state is set according to an arbitrary AND, OR, or NOT their intrinsic personal preferences with a desire to “follow gate. We additionally introduce one agent for every gate g the crowd.” We broadly refer to strategies of this type as in h. If a gate g in h takes as its inputs the outputs of gates crowd affinity strategies (in contrast to the crowd aversion g′ and g′′, then at each time step the agent corresponding to strategies discussed shortly), and examine a couple of natural g computes the corresponding function of the states of the variants. agents corresponding to g′ and g′′ at the previous time step. As a motivating example, imagine that there are K Finally, by we always have the Nth agent be the restaurants, and each week, every member of a population agent corresponding to the output gate of h, and define the chooses one of the restaurants in which to dine. On the one output function as F (~s) = sN . The distribution P over ini- hand, each agent has personal preferences over the restau- tial states of the N agents is identical to D on the R agents rants based on the cuisine, service, ambiance, and so on. On corresponding to the inputs of h, and arbitrary (e.g., inde- the other, each agent has some desire to go to the currently pendent and uniform) on the remaining S agents. “hot” restaurants — that is, where many or most other agents Despite the fact that this construction introduces a great have been recently. To model this setting, let be the set of N S deal of spurious computation (for instance, at the first time K restaurants, and suppose ~s is the population state ∈ S step, many or most gates may simply be computing Boolean vector indicating where each agent dined last week. We can functions of the random bits assigned to non-input agents), it summarize the population behavior by the vector or distribu- K is clear that if gate g is at depth d in h, then at time d in the tion f~ [0, 1] , where fa is the fraction of agents dining collective simulation of the agents, the corresponding agent in restaurant∈ a in ~s. Similarly, we might represent the per- has exactly the value computed by g under the inputs to h sonal preferences of a specific agent by another distribution K (which are distributed according to D). Because the outcome ~w [0, 1] in which wa represents the probability this agent function is the value of the agent corresponding to the output would∈ attend restaurant a in the absence of any information 100 100 20

200 200 40

300 300 60

400 400 80

500 500 100

600 600 120

700 700 140

800 800 160

900 900 180

1000 1000 200 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 (a) (b) (c)

Figure 1: Sample simulations of the (a) crowd affinity mixture model; (b) crowd affinity multiplicative model; (c) agent affinity model. Horizontal axis is population state; vertical axis is simulation time. See text for details. about what the population is doing. One natural way for the havior. Broadly speaking, it is such properties we would like agent to balance their preferences with the population behav- a learning algorithm to model and predict from sufficient ob- ior would be to choose a restaurant according to the mixture servations. distribution (1 α)f~ + α ~w for some agent-dependent mix- ture coefficient−α. Such models have been studied in the 3.2 Crowd Affinity: Multiplicative Strategies sociology literature [12] in the context of formation. One possible objection to the crowd affinity mixture strate- We are interested in collective systems in which every gies described above is that each agent can be viewed as agent i has some unknown preferences ~wi and mixture co- randomly choosing whether to entirely follow the population efficient αi, and in each week t chooses its next restaurant distribution (with probability 1 α)orto entirely follow their t − according to (1 αi)f~ + αi ~wi, which thus probabilisti- personal preferences (with probability α) at each time step. − cally yields the next population distribution f~t+1. How do A more realistic model might have each agent truly combine such systems behave? And how can we learn to model their the population behavior with their preferences at every step. macroscopic properties from only observed behavior, espe- Consider, for instance, how an American citizen might cially when the number of choices K is large? alter their anticipated presidential voting decision over time An illustration of the rich collective behavior that can al- in response to recent primary or polling news. If their first ready be generated from such simple strategies is shown in choice of candidate — say, an Independent or Libertarian Figure 1(a). Here we show a single but typical 1000-step candidate — appears over time to be “unelectable” in the simulation of collective behavior under this model, in which general election due to their inability to sway large num- N = 100 and each agent’s individual preference vector ~w bers of Democratic and Republican voters, a natural and typ- puts all of its weight on just one of 10 possible actions (rep- ical response is for the citizen to shift their intended vote to resented as colors); this action was selected independently at whichever of the front-runners they most prefer or least dis- random for each agent. All agents have an α value of just like. In other words, the low popularity of their first choice 0.01, and thus are selecting from the population distribution causes that choice to be dampened or eradicated; unlike the 99% of the time. Each row shows the population state at a mixture model above, where weight α is always given to per- given step, with time increasing down the horizontal axis of sonal preferences, here there may remain no weight on this the image. The initial state was chosen uniformly at random. candidate. One natural way of defining a general such class of It is interesting to note the dramatic difference between strategies is as follows. As above, let f~ [0, 1]K, where α = 0 (in which rapid convergence to a common color ∈ is certain) and this small value for α; despite the fact that fa is the fraction of agents dining in restaurant a in the current state ~s. Similar to the mixture strategies above, let almost all agents play the population distribution at every K ~wi [0, 1] be a vector of weights representing the intrinsic step, revolving horizontal waves of near-consensus to dif- ∈ ferent choices are present, with no final convergence in preferences of agent i over actions. Then define the prob- ability that agent i plays action a to be fa wi,a/Z(f,~ ~wi), sight. The slight “personalization” of population-only be- · havior is enough to dramatically change the collective be- where the normalizing factor is Z(f,~ ~wi) = fb wi,b. b∈S · P Thus, in such multiplicative crowd affinity models, the prob- For a fixed agent, such a strategy can be modeled by a ability the agent takes an action is always proportional to the weight vector ~w [0, 1]N , with one weight for each agent product of their preference for it and its current popularity. in the populationrather∈ than each action. We define the prob- Despite their similar motivation, the mixture and mul- ability that this agent takes action a if the current global state is N to be proportional to w . In this class of tiplicative crowd affinity strategies can lead to dramatically ~s i:si=a i different collective behavior. Perhaps the most obvious dif- strategies,∈ S the strength of the agent’s desire to take the same ference is that in the mixture case, if agent i has a strong action as agent i is determined byP how large the weight wi preference for action a there is always some minimum prob- is. The overall behavior of this agent is then probabilistically ability (αiwi,a) they take this action, whereas in the mul- determined by summing over all agents in the fashion above. tiplicative case even a strong preference can be eradicated In Figure 1(c), we show a single but typical simulation, from expression by small or zero values for the popularity again with N = 100 but now with a much shorter time hori- fa. zon of 200 steps and a much larger set of 100 actions. All In Figure 1(b), we again show a single but typical 1000- agents have random distributions as their preferences over step, N = 100 simulation for the multiplicative model other agents; this model is similar to traditional diffusion dy- in which agent’s individual preference distributions ~w are namics in a dense, random (weighted) network, and quickly chosen to be random normalized vectors over 10 actions. converges to global consensus. The dynamics are now quite different than for the additive We leave the analysis of this strategy class to future work, crowd affinity model. In particular, now there is never near- but remark that in the simple case in which K =2, learning consensus but a gradual dwindling of the colors represented this class is closely related to the problem of learning per- in the population — from the initial full diversity down to 3 ceptrons under certain noise models in which the intensity of colors remaining at approximately t = 100, until by t = 200 the noise increases with proximity to the separator [5, 4] and there is a stand-off in the population between red and light seems at least as difficult. green. Unlike the additive models, colorsdie out in the popu- lation permanently. There is also clear vertical structure cor- 3.5 Incorporating Network Structure responding to strong conditional preferences of the agents Many of the social models inspiring this work involve a net- once the stand-off emerges. work structure that dictates or restricts the interactions be- tween agents [18]. It is natural to ask if the strategy classes 3.3 Crowd Aversion and Other Variants discussed here can be extended to the scenario in which each agent is influenced only by his neighbors in a given network. It is easy to transform the mixture or multiplicative crowd Indeed, it is straightforward to extend each of the strategy affinity strategies into crowd aversion strategies — that is, in classes introduced in this section to a network setting. For which agents wish to balance or combine their personal pref- example, to adapt the crowd affinity and aversion strategy erences with a desire to act differently than the population at classes, it suffices to redefine fa for each agent i to be the large. This can be accomplished in a variety of simple ways. fraction of agents in the local neighborhoodof agent i choos- For instance, if f~ is the current distributions over actions in ing action a. To adapt the agent affinity and aversion classes, the population, we can simply define a kind of “inverse” to it is necessary only to require that wj = 0 for every agent j the distribution by letting ga = (1 fa)/(K 1), where outside the local neighborhood of agent . By making these − − i K 1= (1 fb) is the normalizing factor, and apply- simple modifications, the learning algorithms discussed in − b∈S − ing the strategies above to ~g rather than f~. Now each agent Section 5 can immediately be applied to settings in which a P exhibits a tendency to “avoid the crowd”, moderated as be- network structure is given. fore by their own preferences. Of course, there is no reason to assume that the entire 4 A Reduction to I.I.D. Learning population is crowd-seeking, or crowd-avoiding; more gen- Since algorithms in our framework are attempting to learn to erally we would allow there to be both types of individuals model the dynamics of a factored Markov process in which present. Furthermore, we might entertain other transforms each component is known to lie in the class , it is natural of the population distribution than just g above. For in- a to investigate the relationship between learningC just a single stance, we might wish to still consider crowd affinity, but to strategy in and the entire Markovian dynamics. One main first “sharpen” the distribution by replacing each f with f 2 a a concern mightC be effects of dynamic instability — that is, and normalizing, then applying the models discussed above that even small errors in models for each of the compo- to the resulting vector. This has the effect of magnifying the N nents could be amplified exponentially in the overall popula- attraction to the most popular actions. In general our algo- tion model. rithmic results are robust to a wide range of such variations. In this section we show that this can be avoided. More precisely, we prove that if the component errors are all small 3.4 Agent Affinity and Aversion Strategies compared to 1/(NT )2, the population model also has small In the two versions of crowd affinity strategies discussed error. Thus fast rates of learning for individual compo- above, an agent has personal preferences over actions, and nents are polynomially preserved in the resulting population also reacts to the current population behavior, but only in an model. aggregate fashion. An alternative class of strategies that we To show this, we give a reduction showing that if a class call agent affinity strategies instead allows agents to prefer to of (possibly probabilistic) strategies is polynomially learn- agree (or disagree) in their choice with specific other agents. ableC (in a sense that we describe shortly) from I.I.D. data, then is also polynomially learnable from collective behav- 4.2 A General Reduction C ior. The key step in the reduction is the introduction of the Multiple analogs of the definition of learnability in the PAC experimental distribution, defined below. Intuitively, the ex- model have been proposed for distribution learning settings. perimental distribution is meant to capture the distribution The probabilistic concept model [15] presents a definition over states that are encountered in the collective setting over for learning conditional distributions over binary outcomes, repeated trials. Polynomial I.I.D. learning on this distribu- while later work [13] proposes a definition for learning un- tion leads to polynomial learning from the collective. conditional distributions over larger outcome spaces. We combine the two into a single PAC-style model for learn- 4.1 A Reduction for Deterministic Strategies ing conditional distributions over large outcome spaces from In order to illustrate some of the key ideas we use in the I.I.D. data as follows. more general reduction, we begin by examining the simple case in which the number of actions K = 2 and and each Definition 5 Let be a class of probabilistic mappings from strategy c is deterministic. We show that if is polyno- an input ~x toC an output y where is a finite set. We mially learnable∈C in the (distribution-free) PAC model,C then say that is∈polynomially X learnable∈ Y if thereY exists an algo- is polynomially learnable from collective behavior. C rithm A Csuch that for any c and any distribution D over In order to exploit the fact that is PAC learnable, it is , if A is given access to an∈C oracle producing pairs ~x, c(~x) first necessary to define a single distributionC over states on Xwith x distributed according to D, then for any ǫ,δh > 0, al-i which we would like to learn. gorithm A runs in time polynomial in 1/ǫ, 1/δ, and dim( ) and outputs a function cˆ such that with probability 1 δ, C − Definition 4 For any initial state distribution P , strategy vector ~c, and sequence length T , the experimental distribu- E~x∼D Pr(c(~x)= y) Pr(ˆc(~x)= y) ǫ . tion DP,~c,T is the distribution over state vectors ~s obtained  | − | ≤ y∈Y by querying (~c, P, T ) to obtain ~s 0, , ~s T , choos- X OEXP h · · · i   ing t uniformly at random from 0, ,T 1 , and setting We could have chosen instead to require that the expected t. { · · · − } ~s = ~s KL divergence between c and cˆ be bounded. Using Jensen’s inequality and Lemma 12.6.1 of Cover and Thomas [6], it We denote this distribution simply as D when P , ~c, and T is simple to show that if the expected KL divergence be- are clear from context. Given access to the oracle , we OEXP tween two distributions is bounded by ǫ, then the expected can sample pairs ~s, ci(~s) where ~s is distributed according h i 1 distance is bounded by 2 ln(2)ǫ. Thus any class that is to D using the following procedure: polynomiallyL learnable under this alternate definition is also polynomially learnable underp ours. 1. Query (~c, P, T ) to obtain ~s 0, , ~s T . OEXP h · · · i Theorem 6 For any class , if is polynomially learnable 2. Choose t 0, ,T 1 uniformly at random. ∈{ · · · − } according to Definition 5, thenC C is polynomially learnable from collective behavior. C 3. Return ~s t,s t+1 . h i i Proof: This proof is very similar in spirit to the proof of the If is polynomially learnable in the PAC model, then by reduction for deterministic case. However, several tricks are definition,C with access to the oracle , for any δ,ǫ > 0, EXP needed to deal with the fact that trajectories are now random it is possible to learn a model cˆ suchO that with probability i variables, even given a fixed start state. In particular, it is no 1 (δ/N), − longer the case that we can argue that starting at a given start ǫ state and executing a set of strategies that are “close to” the Pr~s D[ˆci(~s) = ci(~s)] ∼ 6 ≤ NT true strategy vector usually yields the same full trajectory we would have obtained by executing the true strategies of each in time polynomial in N, T , 1/ǫ, 1/δ, and the VC dimension agent. Instead, due to the inherent randomness in the strate- of using the sampling procedure above; the dependence C gies, we must argue that the distribution over trajectories is on N and T come from the fact that we are requesting a similar when the estimated strategies are sufficiently close to confidence of 1 (δ/N) and an accuracy of ǫ/(TN). We − the true strategies. can learn a set of such strategies cˆi for all agents i at the cost To make this argument, we begin by introducing the idea of an additional factor of N. of sampling from a distribution P1 using a “filtered” version Consider a new sequence ~s 0, , ~s T returned by the of a second distribution P2 as follows. First, draw an out- oracle . By the union bound,h · · with · probabilityi 1 δ, OEXP − come ω Ω according to P2. If P1(ω) P2(ω), output the probability that there exists any agent i and any t ω. Otherwise,∈ output ω with probability P≤(ω)/P (ω), and t t ∈ 1 2 0, ,T 1 , such that cˆi(~s ) = ci(~s ) is less than ǫ. with probability 1 P (ω)/P (ω), output an alternate action { · · · − } t6 t 1 2 If this is not the case (i.e., if cˆi(~s ) = ci(~s ) for all i and − drawn according to a third distribution P3, where t) then the same sequence of states would have been reached 0 if we had instead started at state ~s and generated each ad- P1(ω) P2(ω) t t t−1 P3(ω)= − ′ ′ ditional state ~s by letting s = ci(~s ). This implies that ′ ′ ′ i ω :P2(ω ) P2(ωP), and P3(ω)=0 otherwise. It is easy to verify that the output of this filtering algo- in which each agent has an intrinsic set of preferences over rithm is indeed distributed according to P1. Additionally, actions, but simultaneously would prefer to choose the same notice that the probability that the output is “filtered” is actions chosen by other agents. Similar techniques can be applied to learn the crowd aversion strategies. P1(ω) 1 ~ P2(ω) 1 = P2 P1 1 . (1) Formally, let f be a vector representing the distribution − P2(ω) 2|| − || over current states of the agents; if ~s is the current state, then ω:P2(ωX)>P1(ω)   for each action a, fa = i : si = a /N is the fractionof the As in the deterministic case, we make use of the experi- population currently choosing|{ action}| a. (Alternately, if there mental distribution D as defined in Definition 4. If is poly- is a network structure governing interaction among agents, C nomially learnable as in Definition 5, then with access to the fa can be defined as the fraction of nodes in an agent’s local f oracle EXP, for any δ,ǫ > 0, it is possible to learn a model neighborhood choosing action a.) We denote by D the dis- O cˆi such that with probability 1 (δ/N), ~ − tribution over vectors f induced by the experimental distri- bution D over state vectors ~s. In other words, the probability 2 ǫ ~ f E~s∼D Pr(ci(~s)=s) Pr(ˆci(~s)=s) (2) of a vector f under D is the sum over all state vectors ~s " | − |#≤ NT ~ s∈S   mapping to f of the probability of ~s under D. X We focus on the problem of learning the parameters of in time polynomial in N, T , 1/ǫ, 1/δ, and dim( ) using the C the strategy of a single agent i in each of the models. We as- three-step sampling procedure described in the deterministic sume that we are presented with a set of samples , where case; as before, the dependence on N and T stem from the ~ M fact that we are requesting a confidence of 1 (δ/N) and an each instance m consists of a pair fm,am . Here ~ I ∈ M h i accuracy that is polynomial in both N and T−. It is possible fm is the distribution over states of the agents and am is the learn a set of such strategies cˆi for all agents i at the cost of next action chosen by agent i. We assume that the state dis- an additional factor of N. tributions f~m of these samples are distributed according to f If Equation 2 is satisfied for agent i, then for any τ 1, D . Given access to the oracle EXP, such samples could the probability of drawing a state ~s from D such that ≥ be collected, for example, usingO a three-step procedure like the one in Section 4.1. We show that each class is polyno- ǫ 2 f Pr(ci(~s)= s) Pr(ˆci(~s)= s) τ (3) mially learnable with respect to the distribution D induced | − |≥ NT s∈S by any distribution D over states, and so by Theorem 6, also X   polynomially learnable from collective behavior. is no more than 1/τ. While it may seem wasteful to gather only one data in- Consider a new sequence ~s 0, , ~s T returned by the stance for each agent i from each T -trajectory, we remark t h · · · i t+1 oracle EXP. For each ~s , consider the action si chosen that only small, isolated pieces of the analysis presented in by agentOi. This action was chosen according to the distribu- this section rely on the assumption that the state distributions f tion ci. Suppose instead we would like to choose this action of the samples are distributed according to D . In practice, according to the distribution cˆi using a filtered version of ci the entire trajectories could be used for learning with no im- as described above. By Equation 1, the probability that the pact on the structure of the algorithms. Additionally, while t+1 action choice of ci is “filtered” (and thus not equal to si ) the analysis here is geared towards learning under the experi- t t is half the 1 distance between ci(~s ) and cˆi(~s ). From mental distribution, the algorithms we present can be applied Equation 3,L we know that for any τ 1, with probability at without modification in the no-reset variant of the model in- least 1 1/τ, this probability is less≥ than τ(ǫ/(NT ))2, so troduced in Section 2.3. We briefly discuss how to extend − t+1 the probability of the new action being different from si the analysis to the no-reset variant in Section 5.3. is less than τ(ǫ/(NT ))2 + 1/τ. This is minimized when τ =2NT/ǫ, giving us a bound of ǫ/(NT ). 5.1 Learning Crowd Affinity Mixture Models By the union bound, with probability 1 δ, the proba- In Section 3.1, we introducedthe class of crowd affinity mix- bility that there exists any agent i and any t− 1, ,T , ture model strategies. Such strategies are parameterized bya t+1 ∈ { · · · } (normalized) weight vector ~w and parameter α [0, 1]. The such that si is not equal to the action we get by sampling t ∈ cˆi(~s ) using the filtered version of ci must then be less than probability that agent i chooses action a given that the cur- ǫ. As in the deterministic version, if this is not the case, then rent state distribution is f~ is then αfa + (1 α)wa. In this the same sequence of states would have been reached if we section, we show that this class of strategies− is polynomially had instead started at state ~s 0 and generated each additional learnable from collective behavior and sketch an algorithm t t t−1 state ~s by letting si =c ˆi(~s ) filtered using ci. This im- for learning estimates of the parameters α and ~w. plies that with probability 1 δ, ε(QMˆ ,Q~c) ǫ, and is Let I(x) be the indicator function that is 1 if x is true and polynomially learnable from− collective behavior.≤ C 0 otherwise. From the definition of the model it is easy to see that for any m such that m , for any action a , I ∈M ∈ S E[I(am = a)] = αfa + (1 α)wa, where the expectation is 5 Learning Social Strategy Classes over the randomness in the− agent’s strategy. By linearity of We now turn our attention to efficient algorithms for learn- expectation, ing some of the specific social strategy classes introduced in Section 3. We focus on the two crowd affinity model classes. E I(am = a) =α fm,a +(1 α)wa . (4) − |M| Recall that these classes are designed to model the scenario "m # m :IXm∈M :IXm∈M Standard results from uniform convergence theory say We estimate this weight using that we can approximate the left-hand side of this equation I(am = a) αˆ fm,a arbitrarily well given a sufficiently large data set . Replac- m:Im∈M m:Im∈M wˆa = − . (7) ing the expectation with this approximation inM Equation 4 (1 αˆ)M P − P yields a single equation with two unknown variables, α and The following lemma shows that given sufficient data, wa. To solve for these variables, we must construct a pair of the error in these estimates is small when Z ∗ is large. equations with two unknown variables. We do so by splitting a ∗ the data into instances where fm,a is “high” and instances Lemma 7 Let a = argmaxa∈S Za, and let αˆ be calculated ∗ where it is “low.” as in Equation 5 with a = a . For each a , let wˆa be Specifically, let M = . For convenience of notation, calculated as in Equation 7. For sufficiently∈ large S M, for assume without loss of generality|M| that M is even; if M is any δ > 0, with probability 1 δ, low − odd, simply discard an instance at random. Define a to be the set containing the M/2 instances in withM the α αˆ (1/Za∗ ) ln((4 + 2K)/δ)/M , | − |≤ lowest values of f . Similarly, define highMto be the m,a a and for all actions a, p set containing the M/2 instances with theM highest values of low high fm,a. Replacing with a and a respectively in wa wˆa Equation 4 gives usM two linearM equationsM with two unknowns. | − | ((1 αˆ)Za∗ /√2+2) ln((4 + 2K)/δ) As long as these two equations are linearly independent, we − . 2 can solve the system of equations for α, giving us ≤ Za∗ (1 αˆ) √M (1 αˆ) ln((4 + 2K)/δ) − − −p The proof of this lemma, which isp in the appendix, relies E high I(am =a) low I(am =a) m:Im∈Ma m:Im∈Ma heavily on the following technical lemma for bounding the α= − . h high f low f i error of estimated ratios, which is used frequently throughout P m:I ∈M m,a Pm:Im∈M m,a m a − a the remainder of the paper. We can approximateP α from data inP the natural way, using Lemma 8 For any positive u, uˆ, v, vˆ, k, and ǫ such that high I(am=a) low I(am=a) m:Im∈M m:Im∈M ǫk 0, with probability 1 δ, − ln(4/δ)M Now that we have bounds on the error of the estimated α αˆ parameters, we can bound the expected 1 distance between | − | ≤ high fm,a m low fm,a L m:Im∈Ma p − :Im∈Ma the estimated model and the real model. = (1/Z ) ln(4/δ)/M , (6) P a P Lemma 9 For sufficiently large M, where p E ~ f (αfa + (1 α)wa) (ˆαfa + (1 αˆ)ˆwa) f∼D | − − − | 1 1 a∈S Za = fm,a fm,a . M/2 − M/2 X high m:I ∈Mlow 2 ln((4 + 2K)/δ) m:ImX∈Ma mX a ≤ Za∗ √M The quantity Za measures the difference between the p mean value of f among instances with “high” values of K(Z ∗ /√2+2) ln((4 + 2K)/δ) m,a + min a , fm,a and the mean valueof fm,a among instances with “low” (Za∗ (1 αˆ)√M ln((4 + 2K)/δ) values. While this quantity is data-dependent, standard uni- − −p form convergencetheory tells us that it is stable once the data 2(1 αˆ) . p set is large. From Equation 6, we know that if there is an ac- − o tion a for which this difference is sufficiently high, then it In this proof of this lemma, which appears in the appendix, is possible to obtain an accurate estimate of α given enough the quantity data. If, on the other hand, no such a exists, it follows that there is very little variance in the population distribution over (αfa + (1 α)wa) (ˆαfa + (1 αˆ)ˆwa) | − − − | the sample. We argue below that it is not necessary to learn a∈S α in order to mimic the behavior of an agent i if this is the X case. is bounded uniformly for all f~ using the error bounds. The For now, assume that Za is sufficiently large for at least bound on the expectation follows immediately. one value of a, and call this value a∗. We can use the estimate It remains to show that we can still bound the error when of α to obtain estimates of the weights for each action. From Za∗ is zero or very close to zero. We present a light sketch Equation 4, it is clear that for any a, of the argument here; more details appear in the appendix. Let ηa and µa be the true median and mean of the dis- I(a = a) α f tribution from which the random variables f are drawn. E m:Im∈M m m:Im∈M m,a m,a wa = − . high (1 α)M Let fa be the mean value of the distribution over fm,a P −  P ¯high conditioned on fm,a > ηa. Let fa be the empirical terms. The first, wa, is precisely the value we would like average of fm,a conditioned on fm,a > ηa. Finally, let to calculate. The second term is something that depends on ˆhigh the set of instances , but does not depend on action a. fa = (2/M) high fm,a be the empirical av- m:Im∈Ma This leads to the key observationatM the core of our algorithm. erage of fm,a conditioned on fm,a being greater than the P ˆhigh Specifically, if we have a second action b such that fm,b > 0 empirical median. We can calculate fa from data. for all m such that m , then We can apply standard arguments from uniform conver- I ∈M gence theory to show that f high is close to f¯high, and in turn m a a wa E m:I ∈M χa high high m that f¯ is close to fˆ . Similar statements can be made = m . a a wb E χ low ¯low ˆlow Pm:Im∈M b  for the analogous quantities fa , fa , and fa . By noting high low   that Za = fˆ fˆ this implies that if Za is small, then Although we do not knowP the values of these expec- a − a the probability that a random value of fm,a is far from the tations, we can approximate them arbitrarily well given mean µa is small. When this is the case, it is not necessary enough data. Since we have assumed (so far) that fm,a > 0 for all m , and we know that fm,a represents a fraction to estimate α directly. Instead, we set αˆ =0 and ∈M of the population, it must be the case that fm,a 1/N and 1 m for all . By a standard application≥ of Ho- wˆa = I(am = a) . χa [0,N] m M effding’s∈ inequality and the union bound, we see that for any m:Im∈M X δ > 0, with probability 1 δ, Applying Hoeffding’s inequality again, it is easy to show − that for each a, wˆa is very close to αµa + (1 α)wa, and N ln(2/δ) from here it can be argued that the distance− between the χm E χm . (8) L1 a − a ≤ 2 estimated model and the real model is small. m:Im∈M "m:Im∈M # s X X |M| Thus for any distribution D over state vectors, regardless This leads to the following lemma. We note that the role of of the corresponding value of Za∗ , it is possible to build an accurate model for the strategy of agent i in polynomial time. β in this lemma may appear somewhat mysterious. It comes By Theorem 6, this implies that the class is polynomially the fact that we are boundingthe errorof a ratio of two terms; learnable from collective behavior. an application of Lemma 8 using the bound in Equation 8 gives us a factor of χa,b + χb,a in the numerator and a factor Theorem 10 The class of crowd affinity mixture model of χb,a in the denominator. This is problematic only when strategies is polynomially learnable from collective behav- χa,b is significantly larger than χb,a. The full proof appears ior. in the appendix.

5.2 Learning Crowd Affinity Multiplicative Models Lemma 11 Suppose that fm,a > 0 and fm,b > 0 for all m such that . Then for any δ > 0, with probability In Section 3.2, we introduced the crowd affinity multiplica- m , forI any∈ M , if and , then if tive model. In this model, strategies are parameterized only 1 δ β > 0 χa,b βχb,a χb,a 1 − N ln(2/δ)/2, then ≤ ≥ by a weight vector ~w. The probability that agent i chooses |M| ≥ action a is simply f w / f w . m a a b∈S b b w χ (1 + β) N ln(2/δ) Although the motivation for this model is similar to that a m:Im∈M a . w χm for the mixture model, theP dynamics of the system are quite b − m:Im∈M b ≤ 2 N ln(2/δ) P |M| −p different (see the simulations and discussion in Section 3), P p p and a very different algorithm is necessary to learn individ- If we are fortunate enough to have a sufficient number of ual strategies. In this section, we show that this class is poly- data instances for which fm,a > 0 for all a , then this ∈ S nomially learnable from collective behavior, and sketch the lemma supplies us with a way of approximating the ratios corresponding learning algorithm. The algorithm we present between all pairs of weights and subsequently approximating is based on a simple but powerful observation. In particular, the weights themselves. In general, however, this may not be consider the following random variable: the case. Luckily, it is possible to estimate the ratio of the weights of each pair of actions a and b that are used together 1/fm,a if fm,a > 0 and am = a , frequently by the population using only those data instances χm = a 0 otherwise . in which at least one agent is choosingeach. Formally, define  Suppose that for all m such that m , it is the case that a,b = m : fm,a > 0,fm,b > 0 . I ∈M M {I ∈M } fm,a > 0. Then by the definition of the strategy class and Lemma 11 tells us that if a,b is sufficiently large, and there linearity of expectation, M is at least one instance m a,b for which am = b, then I ∈ M m 1 fm,awa we can approximate the ratio between wa and wb well. E χa = What if one of these assumptions does not hold? If we fm,a fm,sws "m # m s∈S :IXm∈M :IXm∈M   are not able to collect sufficiently many instances in which P1 fm,a > 0 and fm,b > 0, then standard uniform convergence = wa , s fm,sws results can be used to show that it is very unlikely that we m:Im∈M ∈S X see a new instance for which fa > 0 and fb > 0. This idea where the expectation is over the randomnessP in the agent’s is formalized in the following lemma, the proof of which is strategy. Notice that this expression is the product of two in the appendix. Lemma 12 For any M < , for any δ (0, 1), with Theorem 13 The class of crowd affinity multiplicative probability 1 δ, |M| ∈ model strategies is polynomially learnable from collective − behavior. Pr~ f [ a,b : fa > 0,fb > 0, a,b 0 serves a polynomially long prefix of a trajectory of states for and fm,b > 0. We will see that in this case, it is not important to learn the ratio between these two weights. training, and then must produce a generative model which Using these observations, we can accurately model the results in a distribution over the values of the subsequent T behavior of agent i. The model consists of two phases. First, states close to the true distribution. as a preprocessing step, we calculate a quantity When learning individual crowd affinity models for each agent in this setting, we again assume that we are presented m χa,b = χa with a set of samples , where each instance m m:I ∈M M I ∈ M m a,b consists of a pair f~m,am . However, instead of assuming X h i for each pair a,b . Then, each time we are presented that the state distributions f~ are distributed according to ∈ S m with a state f~, we calculate a set of weights for all actions a Df , we now assume that the state and action pairs represent with fa > 0 on the fly. a single trajectory. As previously noted, the majority of the For a fixed f~, let ′ be the set of actions a such that analysis for both the mixture and multiplicative variants of S ∈ S fa > 0. By Lemma 12, if the data set is sufficiently large, the crowd affinity model does not depend on the particular then we know that with high probability, it is the case that way in which state distribution vectors are distributed, and ′ for all a,b , a,b M for some threshold M. thus carries over to this setting as is. Here we briefly discuss ∈ S∗ |M |≥ ′ the few modifications that are necessary. Now, let a = argmaxa∈S′ b : b ,χa,b χb,a . Intuitively, if there is sufficient data,|{ a∗∈should S be the≥ action}| The only change required in the analysis of the crowd in ′ with the highest weight, or have a weight arbitrarily affinity mixture model relates to handling the case in which closeS to the highest. Thus for any a ′, Lemma 11 can Za is small for all a. Previously we argued that when this is ∈ S f be used to bound our estimate of wa/wa∗ with a value of β the case, the distribution D must be concentrated so that for arbitrarily close to 1. Noting that all a, fa falls within a very small range with high probability. Thus it is not necessary to estimate the parameter α directly, w w /w ∗ a = a a , and we can instead learn a single probability for each action ∗ s∈S′ ws s∈S′ ws/wa that is used regardless of f~. A similar argument holds in ′ we approximateP the relativePweight of action a with the no-reset variant. If it is the case that Za is small for all respect to the other actions in ′ using ∈ S a, then it must be the case that for each a, the value of f S a χa,a∗ /χa∗,a has fallen into the same small range for the entire observed wˆa = , trajectory. A standard uniform convergence argument says ∗ s∈S′ χs,a∗/χa ,s that the probability that fa suddenly changes dramatically is ′ and simply let wˆa = 0Pfor any a . Applying Lemma 8, very small, and thus again it is sufficient to learn a single we find that for all a ′, with high6∈ S probability, ~ ∈ S probability for each action that is used regardless of f. wa To adapt the analysis of the crowd affinity multiplicative wˆa model, it is first necessary to replace Lemma 12. Recall that ′ w − s∈S s the purpose of this lemma was to show that when the data (1 + β )K N ln(2K2/δ) P , (9) set does not contain sufficient samples in which fa > 0 and ≤ √2M (1 + β)K N ln(2K2/δ) fb > 0 for a pair of actions a and b, the chance of observing − p ′ a new state distribution f~ with fa > 0 and fb > 0 is small. where M is the lower bound on pa,b for all a,b , and β is close to 1. With this bound in|M place,| it is straightforward∈ S This argument is actually much more straightforward in the to show that we can apply Lemma 8 once more to bound the no-reset case. By the definition of the model, it is easy to see that if fa > 0 for some action a at time t in a trajectory, expected 1, L then it must be the case that fa > 0 at all previous points wafa wˆafa in the trajectory. Thus if fa > 0 on any test instance, then Ef~∼Df , wsfs − wˆsfs fa must have been non-negative on every training instance, "a∈S s∈S s∈S # X and we do not have to worry about the case in which there is and that the bound goes P to 0 at a rateP of O(1/√ M) as the insufficient data to compare the weights of a particular pair threshold M grows. More details are given in the appendix. of actions. Since it is possible to build an accurate model of the strat- One additional, possibly more subtle, modification is egy of agent i in polynomial time under any distribution D necessary in the analysis of the multiplicative model to han- over state vectors, we can again apply Theorem 6 to see that dle the case in which χa,b = χb,a = 0 for all “active” pairs this class is polynomially learnable from collective behavior. of actions a,b ′. This can happen only if agent i has ∈ S extremely small weights for every action in ′, and had pre- [10] M. Granovetter. Threshold models of collective behavior. viously been choosing an alternate action thatS is no longer American Journal of Sociology, 61:1420–1443, 1978. available, i.e., an action s for which f had previously been s [11] D. Haussler. Decision theoretic generalizations of the PAC non-negative but suddenly is not. However, in order for f s model for neural net and other learning applications. Infor- to become 0, it must be the case that agent i himself chooses mation and Computation, 100:78–150, 1992. an alternate action (say, action a) instead of s, which can- not happen since the estimated weight of action a used by [12] P. Hedstrom. Rational imitation. In P. Hedstrom and R. Swed- the model is 0. Thus this situation can never occur in the berg, editors, Social Mechanisms: An Analytical Approach to no-reset variant. Social Theory. Cambridge University Press, 1998. [13] M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R.E. Schapire, 6 Conclusions and Future Work and L. Sellie. On the learnability of discrete distributions. In 26th Annual ACM Symposium on Theory of Computing, pages We have introduced a computational model for learning from 273–282, 1994. collective behavior, and populated it with some initial gen- eral theory and algorithmic results for crowd affinity models. [14] M. Kearns, R. Schapire, and L. Sellie. Towards efficient ag- In addition to positive or negative results for further agent nostic learning. , 17:115–141, 1994. strategy classes, there are a number of other general direc- tions of interest for future research. These include exten- [15] M. Kearns and R. E. Schapire. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and sion of our model to agnostic [14] settings, in which we re- System Sciences, 48(3):464–497, 1994. lax the assumption that every agent strategy falls in a known class, and to reinforcement learning [23] settings, in which [16] M. Kearns and S. Singh. Finite-sample rates of convergence the learning algorithm may itself be a member of the popu- for Q-learning and indirect methods. In Advances in Neural lation being modeled, and wishes to learn an optimal policy Information Processing Systems 11, 1999. with respect to some reward function. [17] M. Kearns, S. Suri, and N. Montfort. A behavioral study of the coloring problem on human subject networks. Science, Acknowledgments 313(5788):824–827, 2006.

We thank Nina Balcan and Eyal Even-Dar for early discus- [18] J. Kleinberg. Cascading behavior in networks: Algorithmic sions on models of social learning, and Duncan Watts for and economic issues. In N. Nisan, T. Roughgarden, E. Tar- helpful conversations and pointers to relevant literature. dos, and V.Vazirani, editors, Algorithmic . Cam- bridge University Press, 2007. References [19] S. Russell and P. Norvig. Artificial Intelligence: A Modern [1] S. Bikhchandani, D. Hirshleifer, and I. Welch. A theory of Approach. Prentice Hall, 1995. fads, fashion, custom, and cultural change as informational cascades. Journal of Political Economy, 100:992–1026, 1992. [20] M. Salganik, P. Dodds, and D. J. Watts. Experimental study of inequality and unpredictability in an artificial cultural market. [2] S. Bikhchandani, D. Hirshleifer, and I. Welch. Learning from Science, 331(5762):854–856, 2006. the behavior of others: Conformity, fads, and informational cascades. The Journal of Economic Perspectives, 12:151–170, [21] M. Salganik and D. J. Watts. Social influence, manipulation, 1998. and self-fulfilling prophecies in cultural markets. Preprint, 2007. [3] G. W. Brown. Iterative solutions of games by fictitious play. In T.C. Koopmans, editor, Activity Analysis of Production and [22] T. Schelling. Micromotives and Macrobehavior. Norton, New Allocation. Wiley, 1951. York, NY, 1978.

[4] T. Bylander. Learning noisy linear threshold functions. Tech- [23] R. Sutton and A. Barto. Reinforcement Learning. MIT Press, nical Report, 1998. 1998.

[5] E. Cohen. Learning noisy perceptrons by a perceptron in poly- [24] A. De Vany. Hollywood Economics: How Extreme Uncer- nomial time. In 38th IEEE Annual Symposium on Foundations tainty Shapes the Film Industry. Routledge, London, 2004. of Computer Science, 1997. [25] A. De Vany and C. Lee. Quality signals in information cas- [6] T. Cover and J. Thomas. Elements of . cades and the dynamics of the distribution of motion picture John Wiley & Sons, New York, NY, 1991. box office revenues. Journal of Economic Dynamics and Con- trol, 25:593–614, 2001. [7] P. Dodds, R. Muhamad, and D. J. Watts. An experimental study of search in global social networks. Science, 301:828– [26] C. Watkins and P. Dayan. Q-learning. Machine Learning, 829, August 2003. 8(3):279–292, 1992.

[8] M. Drehmann, J. Oechssler, and A. Roider. Herding and con- [27] I. Welch. Herding among security analysts. Journal of Finan- trarian behavior in financial markets: An Internet experiment. cial Economics, 58:369–396, 2000. American Economic Review, 95(5):1403–1426, 2005.

[9] D. Foster and R. Vohra. Regret in the on-line decision prob- lem. Games and Economic Behavior, 29:7–35, 1999. A Proofs from Section 5.1 Setting δ2 = Kδ1/2, we can again apply Lemma 8 and see that for sufficiently large M A.1 ProofofLemma8

For the first direction, m:Im∈M I(am = a) E m:Im∈M I(am = a) u uˆ u u ǫ u uv ǫv (1 αˆ)M − (1 α)M − = − P − P −  v − vˆ ≤ v − v + kǫ v − v(v + kǫ) M ln(4/δ1)/2 (1 αˆ)Za∗ M + √2M − u uv + ukǫ ǫv ukǫ 2 2 = − − ≤ Za∗ (1 αˆ) M (1 αˆ)M M ln(4/δ ) v − v(v + kǫ) p − − − 1 ln(4/δ )/2 (1 αˆ)Za∗ + √2 u u(v + kǫ) ǫ(v + uk) 1 − p = − 2 ≤ Za∗ (1 αˆ) √M (1 αˆ) ln(4/δ ) v − v(v + kǫ) p − − −  1 ǫ(v + uk) ǫ(v + uk) ((1 αˆ)Za∗ /√2+1) ln(4/δ ) = . = − p 1 . 2 v(v + ǫk) ≤ v(v ǫk) Za∗ (1 αˆ) √M (1 αˆ) ln(4/δ ) − − − −p 1 Similarly, Setting δ1 = δ/(1 + K/2) and applyingp the union bound uˆ u u + ǫ u uv + ǫv u yields the lemma. = vˆ − v ≤ v kǫ − v v(v kǫ) − v − − A.3 ProofofLemma9 uv ukǫ + ǫv + ukǫ u = − As long as M is sufficiently large, for any fixed f~, v(v kǫ) − v − u(v kǫ)+ ǫ(v + uk) u (αfa + (1 α)wa) (ˆαfa + (1 αˆ)ˆwa) = − | − − − | v(v kǫ) − v a∈S − X ǫ(v + uk) α αˆ fa + (1 α)wa (1 αˆ)ˆwa = . ≤ | − | | − − − | v(v ǫk) a∈S a∈S − X X A.2 ProofofLemma7 α αˆ + ( α αˆ wˆa + wa wˆa (1 αˆ) ≤ | − | | − | | − | − a We bound the error of these estimations in two parts. First, X∈S since from Equation 6 we know that for any , with α αˆ wa wˆa ) δ1 > 0 −| − | · | − | probability 1 δ , 1 2 α αˆ + (1 αˆ) wa wˆa − ≤ | − | − | − | a X∈S αˆ fm,a α fm,a − α αˆ wa wˆa m:Im∈M m:Im∈M −| − | | − | X X a∈S X 1 ln(4/δ1) 1 fm,a M ln(4/δ1) , 2 ln((4 + 2K)/δ) ≤ Za∗ M ≤ Za∗ r m:Im∈M ≤ Z ∗ √M X p p a and K(Za∗ /√2+2) ln((4 + 2K)/δ) 1 + min , ∗ √ (1 αˆ)M (1 α)M M ln(4/δ1) , (Za (1 αˆ) M p ln((4 + 2K)/δ) | − − − |≤ Za∗ − − 2(1 αˆ) . p we have by Lemma 8 that for sufficientlyp large M − } Notice that this holds uniformly for all f~, so the same αˆ m:I ∈M fm,a α m:I ∈M fm,a m m bound holds when we take an expectation over f~. (1 αˆ)M − (1 α)M P P − − (1/Za∗ ) M ln(4/δ1) (1 αˆ)M + α ˆ fm,a A.4 Handling the case where Za∗ is small − m:Im∈M 2 2 Suppose that for an action a, Z < ǫ. Let η and µ be ≤ (1p αˆ) M (1 αˆ)M M ln(4/δ1)/Za∗  a a a − − − P the true median and mean respectively of the distribution M ln(4/δ1)((1 αˆ)M +pαM ˆ ) high − from which the random variables fm,a are drawn. Let fa 2 2 ≤ Za∗ (1 αˆ) M (1 αˆ)M M ln(4/δ1) be the mean value of the distribution over fm,a conditioned p− − − low on fm,a > ηa, and similarly let fa be the mean value ln(4/δ1) p low high 1 = . conditioned on fm,a < ηa, so µa = f + f /2. 2 a a Za∗ (1 αˆ) √M (1 αˆ) ln(4/δ1) ¯high − p − − Let fa be the empirical average of fm,a conditioned on ¯low  Now, by Hoeffding’s inequality andp the union bound, for fm,a > ηa, and fa be the empirical average of fm,a any δ2 > 0, with probability 1 δ2, for all a, conditioned on fm,a < ηa. (Note that we cannot actu- − ¯high ¯low ally compute fa and fa since ηa is unknown.) Fi- I(a = a) E I(a = a) m m 1 = − " # Assume for now that it is never the case that fm,a ηa. This m:Im∈M m:Im∈M X X simplifies the explanation, although everything still holds if fm,a M ln(2K/δ )/2 . can be ηa. ≤ 2 p ˆhigh ˆlow high nally, let fa = (2/M) m:I ∈Mhigh fm,a and fa = Since fa is an average of points which are all higher than m a low low ˆhigh ˆlow fa , and similarly fa is an average of points all lower (2/M) m low fm,a. Notice that Za = fa fa . :Im∈Ma P − than f high, this implies that for any τ 0, high low ¯high a We show first that fa and fa are close to fa and ′ ≥ ¯low P ¯high ¯low Pr( fm,a µa ǫ + τ) fa respectively. Next we show that fa and fa are | − |≥ high low high low close to fˆ and fˆ respectively. Finally, we show that Pr(fm,a f + τ) + Pr(fm,a f τ) a a ≤ ≥ a ≤ a − this implies that if Za is small, then the probability that a ǫ′/(ǫ′ + τ) . random value of fm,a is far from ηa is small. This in turn ≤ Recall that when Za is small for all a, we set αˆ = 0 implies a small 1 distance between our estimated model L and wˆa = I(am = a). Let w¯a = αµa + (1 and the real model for each agent. m:Im∈M − high ¯high α)wa. Notice that E[ˆwa]=w ¯a. Applying Hoeffding’s in- To bound the difference between fa and fa , it is P equality yet again, with probability 1 δ , wˆa w¯a first necessary to lower bound the number of points in the − 5 | − | ≤ empirical samples with fm,a > ηa. Let zm be a random ln(2/δ5)/2M. For any τ > 0, variable that is 1 if f > η and 0 otherwise. Clearly m,a a p Pr(zm = 1) = Pr(zm =0)=1/2. By a straightforward E ~ f (αfa + (1 α)wa) (ˆαfa + (1 αˆ)ˆwa) f∼D | − − − | application of Hoeffding’s inequality, for any δ3, with prob- "a∈S # ability 1 δ3, X − = E ~ f [ αfa + (1 α)wa wˆa ] f∼D | − − | M ln(2/δ )M a∈S z 3 , X m 2 2 − ≤ r Ef~∼Df [ αfa + (1 α)wa w¯a + w¯a wˆa ] m:Im∈M ≤ | − − | | − | X a∈S ¯high X and so the numberof samplesaveragedto get fa is at least α E ~ f [ fa µa ]+ K ln(2/δ )/2M ≤ f∼D | − | 5 (M/2) ln(2/δ3)M/2. Applying Hoeffding’s inequality a∈S again, with− probability 1 δ , X p − 4 ǫ′ ǫ′ p K 1 (ǫ′ + τ)+ ≤ − ǫ′ + τ ǫ′ + τ high ¯high ln(2/δ4)    fa fa . − ≤ sM 2 ln(2/δ3)M +K ln(2/δ5)/2M − ′ pǫ ˆhigh p = K τ + + ln(2/δ5)/2M . Now, fa is an empirical average of M/2 values in ǫ′ + τ ¯high   [0, 1], while fa is an empirical average of the same points, p plus or minus up to ln(2/δ3)M/2 points. In the worst B Proofs from Section 5.2 case, f¯high either includes an additional ln(2/δ )M/2 a p 3 B.1 ProofofLemma 12 points with values lower than any point in high, or ex- pMa By the union bound, cludes the ln(2/δ )M/2 lowest values points in high. 3 Ma This implies that in the worst case, Prf~∼Df [ a,b : fa > 0,fb > 0, a,b 0,fb > 0, a,b 0,fb > 0] . By the triangle inequality, p p ≤ a,b∈S:|MXa,b| 0,fb > 0] + . M/p2 ln(2/δ3) f∼D ≤ 2 − |M| s |M| lowp ˆlowp 2 The same can be show for fa and fa . Hence if Za Noting that the number of pairs is less than K /2 and setting ǫ, then ≤ δ =2δ/K2 yields the lemma.

ln(2/δ ) B.2 Bounding the 1 f high f low ǫ +2 4 L a a Here we show that with high probability (over the choice of − ≤ sM 2 ln(2/δ3)M − and the draw of f~), if is sufficiently large, 2 pln(2/δ3) M M + . wafa wˆafa M/2 ln(2/δ3) p− wsfs − wˆsfs a∈S s∈S s∈S ′ p p X Call this quantity ǫ . Clearly we have P 2(1 + βP)KN N ln(2 K/δ) ′ low high ′ . µa ǫ fa µa fa µa + ǫ . ≤ √2M (1 + β)K(N + 1) N ln(2K/δ) − ≤ ≤ ≤ ≤ − p p First note that we can rewrite the expression on the left- hand side of the inequality above as ′ wafa wˆafa ′ , ′ w fs − ′ wˆsfs a∈S′ s∈S s s∈S X P′ P where for all a, wa = wa/( s∈S′ ws). We know from Equation 9 that with high probability (over the data set and the choice of f~), for all a ′,P ∈ S ′ 2 ′ fa(1 + β) N ln(2K /δ) wafa wˆafa |S | , | − |≤ √2M (1 + β) ′ N ln(2K2/δ) − |Sp| and p

′ w fs wsfs s − s ′ s ′ X∈S X∈S ′ fs w wˆs ≤ | s − | s ′ X∈S (1 + β) ′ N ln(2K2/δ) |S | . ≤ √2M (1 + β) ′ N ln(2K2/δ) − p|S | Thus, using the fact that s∈S′ wˆspfs 1/N, we can apply Lemma 8 once again to get that ≥ P ′ wafa wˆafa ′ ′ w f − ′ wˆ f s∈S s s s∈S s s wˆafa P (1 + β)KNP N ln(2K/δ ) fa + Ps∈S′ wˆsfs , ≤ √2M (1p + β)K(N + 1) N ln(2K/δ)  − and p ′ wafa wˆafa ′ ′ w fs − ′ wˆsfs a∈S′ s∈S s s∈S X P 2(1 + βP)KN N ln(2 K/δ) . ≤ √2M (1 + β)K(N + 1) N ln(2K/δ) − p p