The Neuroeconomic Theory of Learning

By Andrew Caplin and Mark Dean*

Paul Glimcher (2003) and , violated will aid in the development of appropri- George Loewenstein, and Drazen Prelec (2005) ate alternatives. make powerful cases in favor of neuroeconomic For our initial foray into this line of research, research. Yet in their equally powerful defense we focus on learning theory (Caplin and Dean of standard “Mindless ,” Faruk Gul 2007a). This represents an ideal test case for the and Wolfgang Pesendorfer (forthcoming) point integrative methodology since to the profound language gap between the two have independently formulated a specific theory contributing disciplines. For example, for an of neurological function, the “Dopaminergic economist, risk aversion captures preferences Reward Prediction Error” (DRPE) hypothesis, among wealth lotteries. From the neuroscien- which has important behavioral implications. tific viewpoint, it is a broader concept related to Dopamine is a neurotransmitter for which fear responses and the amygdala. Furthermore, release had previously been hypothesized to as economic models make no predictions con- reflect “hedonia,” as when a thirsty monkey is cerning brain activity, neurological data can given a squirt of juice. Yet Wolfram Schultz, neither support nor refute these models. Rather Paul Apicella, and Tomas Ljungberg (1993) than looking to connect such distinct abstrac- found that if such a monkey learns to associ- tions, Gul and Pesendorfer (forthcoming) argue ate a tone with later receipt of fruit juice, the for explicit separation: “The requirement that dopaminergic response occurs when the tone is economic theories simultaneously account for heard, not when the juice is received. Dopamine economic data and brain imaging data places an somehow appears to signal changes in the antic- unreasonable burden on economic theories.” ipated value of rewards. Schultz, Peter Dayan, We share the conviction of Glimcher (2003) and P. Read Montague (1997) noted that a “pre- and Camerer, Loewenstein, and Prelec (2005) diction error” signal of this form is precisely concerning the potential value of neuroeco- what is needed in reinforcement algorithms to nomics, yet we believe that the field will live drive convergence toward a standard dynamic up to its potential only if a common conceptual programming value function (Andrew Barto language can be agreed upon. Hence, we face and Richard Sutton (1982). The DRPE hypoth- the Gul and Pesendorfer challenge head on by esis of asserts that dopamine does developing theories that simultaneously account measure a reward prediction error that is used to for behavioral and brain imaging data. The prin- update an evolving value function. In addition to cipal lies in our use of the decision more standard tests, Mathias Pesseglione et al. theorists’ standard axiomatic methodology in (2006) have shown that neurological interven- this highly nonstandard setting. This removes tions aimed at the dopamine system can have an any linguistic confusion by defining concepts impact on the rate at which learning appears to directly in terms of their empirical counter- take place, much as the theory suggests. parts. It also allows us to pinpoint how to design Economists have become interested in rein- experiments directed to the central tenets of the forcement learning for their own reasons. While theory, rather than to particular parametriza- the natural assumption within economics is tions. If these experimental tests reveal the the- that inference is Bayesian, this assumption has ory to be wanting, then knowing which axiom is little predictive power in complex environments. The need for predictive models led economists * Caplin: Department of Economics, New York Univer­ toward the reinforcement model, and Ido Erev sity, 19 West 4th St., New York, NY 10011 (e-mail: andrew. and Alvin E. Roth (1998) have demonstrated [email protected]); Dean: Department of Economics, New that simple variants of this model match the York University, 19 West 4th St., New York, NY 10011 ­pattern of behavior in a wide variety of standard (e‑mail: [email protected]). Deep thanks are extend to Paul Glimcher and to all of the economists, neuroscientists, games. Camerer and Teck-Hua Ho (1999) enrich and psychologists who contribute to the vibrancy of the the model by adding counterfactual learning Seminar. based on past outcomes. It is clear, however, that 148 VOL. 97 NO. 2 The Neuroeconomic Theory of Learning 149 these simple “history-based” models will match ­dopaminergic responses derive from realiza- behavior only in relatively stable environments. tions of specific lotteries over final prizes. An We believe that neuroscientific research has advantage of this simple lottery-outcome frame- the potential to suggest next steps in modeling work is that it avoids tying our formulation to a learning in more complex environments. Indeed, particular model of learning. We consider a set- Anthony Dickinson and Bernard W. Balleine ting in which the agent is either endowed with (2002) have begun to uncover evidence suggest- or chooses a specific lottery from which a prize ing that dopaminergic reinforcement is but one of is realized. We observe any initial act of choice several neurological modules related to learning. among lotteries and the dopaminergic response Unfortunately there is a profound language when the prize is realized. Definition 1 lays out barrier that largely prevents economists from the various prize and lottery sets studied in the embracing the growing neuroscientific evidence model as well as our idealized measure of the on learning. In economics, concepts such as dopamine response rate. utility and reward are inferred from observed choices, while neuroscientists interpret them Definition 1: The set of prizes is a compact in relation to intuitions concerning the flow of metric space Z with generic element z [ Z. experience (e.g., a squirt of juice is assumed to The set of all simple lotteries over Z is denoted be rewarding to a thirsty monkey). In fact, many L, with generic element p [ L. We define neuroscientific tests of the DRPE hypothesis ez [ L as the degenerate lottery that assigns take the perspective of “classical” or “Pavlovian” probability 1 to prize z [ Z and the set L1z2 conditioning in which choice plays no role, ren- to comprise all lotteries with prize z in their dering economic interpretation impossible. As support, with risk aversion, the fact that economic and psychological concepts have identical names (1) L1z2 ; 5p [ L0pz . 06 ( L. does not imply identical interpretations. Caplin and Dean (2007b) take an axiomatic The function d : M S R identifies the idealized perspective on the DRPE hypothesis and char- dopamine response function (DRF), where M acterize its empirical implications for a data tape comprises all pairs 1z, p2 with z [ Z and p [ with combined information on choice and dopa- L1z2. minergic activity. If the data do not obey our axioms, then the DRPE model is fundamentally The dopaminergic reward prediction error wrong, not merely misspecified. Our approach hypothesis hinges on the existence of a func- allows us to identify and rectify a problem in tion defining the “expected” and the “experi- current quantitative tests of the DRPE hypothe- enced” reward associated with receipt of each sis. In these tests, it is typical to treat neurologi- possible prize from any given lottery. Under cally measured dopaminergic signals as defined the assumption that the expected reward of a only up to linear transformations, with a quanti- degenerate lottery is equal to the experienced tatively larger dopaminergic response identify- reward of that prize, what we are looking for ing a larger reward difference. We pinpoint the is a dopaminergic reward function r : L S R somewhat harsh assumptions that are needed to which defines both the expected reward asso- justify this conclusion. Our central result justi- ciated with each lottery and the experienced fies only an ordinal version of the DRPE hypoth- reward associated with each prize. A basic esis in which reward differences are ill-defined, assumption is that this reward function con- just as marginal utility is ill-defined in ordinal tains all the information that determines dopa- characterizations. We strengthen the assump- mine release. tions as required to justify use of dopamine as a measuring rod for differences in reward. Definition 2: A function r : L S R fully summarizes a DRF d : M S R, if there exists a I. Basic Propositions function E : r(Z) 3 r(L) S R such that, given 1z, p2 [ M, We develop the DRPE hypothesis for a case in which probabilities are objective and (2) d1z, p2 5 E1r1ez 2,r1p2 2, 150 AEA PAPERS AND PROCEEDINGS MAY 2007

where r1Z2 comprises all values r1ez 2 across (DARPE) representation if there exists a func- degenerate lotteries, and r1L2 identifies the tion r : L S R and a strictly increasing function range across all lotteries. In this case, we say G : r(Z) 2 r(L) S R, such that, given 1z, p2 [ M, that r and E represent the DRF. (3) d1z,p2 5 G1r1ez 2 2 r1p2 2, A DRPE representation rests not only on the ability to use reward computations to under- where r1Z2 2 r1L2 comprises all values stand all dopaminergic responses, but also on r1ez 2 2 r1p2 across 1z,p2 [ M . these responses having appropriate order prop- erties. Intuitively, the dopaminergic response Definition 6: A DRF d : M S R admits a should be strictly higher for a more rewarding dopaminergic expected reward prediction error prize than it is for a less rewarding prize, and (DERPE) representation if there exists func- from a less rewarding lottery than from a more tions r : L S R, E : r 1Z2 3 r 1L2 S R that form rewarding lottery. In addition to depending on a DRPE representation of the DRF in which the existence of an appropriate reward function, r p m u all p [ L, the DRPE hypothesis rests on the assumption 1 2 K p 3 4 that, if expectations are met, the dopaminergic for some function u : Z S R, where mp 3u4 denotes response does not depend on what was expected. the expected value of u : Z S R with respect to We say that a DRF admits a reward prediction the lottery p. error representation if we can findr and E func- tions that satisfy all three conditions. For economic interest to be warranted, there must be a connection between the dopaminergic Definition 3: Functions r and E, which rep- reward and choice. The simplest such connection resent a DRF, respect dopaminergic dominance occurs if choices among lotteries can be mod- if E is strictly increasing in its first argument eled as deriving from maximization of the DRPE and strictly decreasing in its second argu- reward function. While this case is of obvious ment. They satisfy no surprise constancy if interest to economists, a more standard scenario E1x, x2 5 E1y,y2 for all x, y [ r1Z2. involves dopamine as simply one component of a richer overall process of learning and of choice. Definition 4: A DRF d : M S R admits a dopaminergic reward prediction error (DRPE) Definition 7: The choice correspondence L representation if there exists functions r : L S C is defined on domain Q 5 {X [ 2 with R and E : r(Z) 3 r(L) S R, which (i) represent 0X0 , `6, with C1X2 # X denoting the set of the DRF; (ii) respect dopaminergic dominance; lotteries chosen from any finite subset of the and (iii) satisfy no surprise constancy. lottery space. A DRF d : M S R and a choice correspondence C admit a choice-consistent It is clear from the definition that ifr : L S R DRPE representation if d admits a DRPE rep- forms part of a DRPE representation of a DRF resentation, and for all X [ Q, d : M S R, then so does any function r* : L S R (4) C(X) 5 arg max r (p). that is a strictly increasing monotone transform p [X of r. Hence, this representation does not allow one to treat dopamine as an invariant measure Caplin and Dean (2007a) characterize all of reward differences. We develop an addi- the representations above. In the present paper, tive representation that embodies the minimum we outline conditions for the basic DRPE rep- requirement for using dopaminergic response resentation. There are three axioms that are to animate the notion of reward differences. We intuitively necessary for d : M S R to admit a also develop an expected reward representation DRPE. The first ensures coherence in dopami- that is entirely analogous to the expected utility nergic responses to prizes: if one prize is more of representation from choice theory. a positive surprise than another when received from some lottery, it must be so for any other Definition 5: A DRF d : M S R admits a ­lottery. The second provides the analog condition dopaminergic additive reward prediction error with respect to lotteries: if an outcome z leads to VOL. 97 NO. 2 The Neuroeconomic Theory of Learning 151 a bigger dopamine release when obtained from hypothesis. Our next step is to generate data with one given lottery than from some other lottery, which to perform such a test. To this end, we are the same must be true for any other outcome working with Glimcher at the Center for Neural that is in the support of both lotteries. The third Science at New York University on the labora- characterizes the dopamine function as having tory implementation of our axiomatic frame- equivalent value all along the 45-degree line. work. We are currently refining an experimental design that will allow us to record brain activity Axiom 1 (A1: Coherent Prize Dominance): as lotteries are chosen and as they resolve. This

Given 1z, p2, 1z, r pr2, 1z, r p2, 1z, pr2 [ M, takes place while the subject is lying in a func- tional Magnetic Resonance Imaging (fMRI)

(5) d 1z, p2 . d1z, r p2 1 d1z, pr2 . d1z, r pr2, scanner. We use these data to construct an empiri­ cal analogue of the d function. Axiom 2 (A2: Coherent Lottery Dominance): Assuming that we find supporting evidence

Given 1z, p2, 1z, r pr2, 1z, r p2, 1z, pr2 [ M, for the DRPE hypothesis, our next step will be to extend our theoretical and empirical work to allow

(6) d 1z, p2 . d1z, pr2 1 d1z, r p2 . d1z, r pr2. for dynamic environments in which learning takes place. Richer axioms will be developed to connect Axiom 3 (A3: No Surprise Equivalence): Giv- the observed patterns of choice and dopaminergic en z, z9 [ Z, response, enabling us fully to characterize rein- forcement learning models for such data.

(7) d1z, r ezr 2 5 d1z, ez 2. The drive to deepen understanding of the neurological and behavioral aspects of learning While necessary, A1–A3 are not sufficient for will ultimately force researchers to look beyond a DRPE representation, due to the fact that the the dopamine system. Dickinson and Balleine domain of the dopamine function differs across (2002) have developed a two-process theory prizes. These domain differences allow A1–A3 of the motivation, with only one of these two to be consistent with the cycles of apparent processes involving the dopamine system. In dopaminergic dominance, and other conditions association with this theory, Balleine (2005) is inconsistent with the DRPE (Caplin and Dean currently investigating the development of “hab- (2007a). In that paper, we show that the follow- its” as opposed to more flexible responses, and ing continuity conditions suffice to establish an associated “supervisory function” that deter- availability of such a representation. mines the extent to which habitual behavior is called into play in any given situation. Depending Axiom 4 (A4: Uniform Continuity): The func- on specific neurological manipulations, there can tion d : M S R is uniformly continuous in the be wholesale changes in the extent to which past appropriate metric. rewards dominate future behavior. In a multi- modal framework, our understanding of behav-

Axiom 5 (A5: Separation): Given (z, p), 1z, r p2 [ ior will be greatly enhanced if we can develop M, neurological evidence pinpointing the extent to which different modes are in operation. (8) d1z, p2 2 d1z, r p2 1 We believe that the axiomatic approach to neu- roeconomics may be of value in areas other than inf 0d1z, pr2 2 d1z, r pr2 0 . 0. 5pr[L01z,pr2,1zr,pr2[M6 learning. Kenway Louie and Glimcher (2007) present intriguing preliminary evidence of a Theorem 1: Under A4 and A5, a DRF d : M neurological switch that occurs in the process of S R admits a DRPE representation if and only decision making, and that may possibly signal if it satisfies A1–A3. the process of contemplation has come to an end. This suggests the value of studying the neuro- II. Next Steps logical and behavioral aspects of the process of “arriving at a decision.” The hope is that neu- Caplin and Dean (2007a) provide a theoreti- roscientific measurements will provide system- cal framework within which to test the DRPE atic, if limited, access to the search process that 152 AEA PAPERS AND PROCEEDINGS MAY 2007 presages the act of choice. This would provide us Caplin, Andrew. Forthcoming. “Economic The- with new channels for understanding when and ory and Psychological Data: Bridging the how information provided to a decision maker is Divide.” In Perspectives on the Future of mapped into the ultimate decision. Economics: Positive and Normative Founda­ The broader backdrop to our research is our tions, Vol. I, ed. Andrew Caplin and Andrew belief that the momentum to broaden the empir- Schotter. Oxford: Oxford University Press. ical basis of decision theory is unstoppable. Caplin, Andrew, and Mark Dean. 2007a. We see the appropriate response as being to “Dopamine and Reward Prediction Error: An embrace rather than to resist this domain expan- Axiomatic Treatment.” Unpublished. sion. While Gul and Pesendorfer are correct to Caplin, Andrew, and Mark Dean. 2007b. point out that language barriers impede inter- “Revealed Preference and Time to Decide.” disciplinary collaboration, these barriers are far Unpublished. from impenetrable. Our goal is not only to pen- Dickinson, Anthony, and Bernard W. Balleine. etrate these barriers, but also to show that there 2002. “The Role of Learning in the Operation is much to be learned through incorporation of of Motivational Systems.” In Steven’s Hand­ nonstandard data into economic theory. Caplin book of Experimental , Volume 3: and Dean (2007b) provide an entirely separate Learning Motivation and Emotion, ed. Randy example focused around the question of how Gallistel and Hal Pashler. New York: John long a subject takes to make a decision. Caplin Wiley & Sons. (forthcoming) argues against imposing any ex Erev, Ido, and Alvin E. Roth. 1998. “Predict­ ante constraints on the types of data that can ing How People Play Games: Reinforce­ potentially be included in axiomatic models, ment Learning in Experimental Games with even if the only goal is to provide insight into Unique, Mixed Strategy Equilibria.” Ameri­ standard choice behavior. Our current argument can Economic Review, 88(4): 848–81. that the neuroeconomic language barrier can Glimcher, Paul W. 2003. Decisions, Uncertainty, be breached in the case of a specific neurologi- and the Brain: The Science of Neuroeconom- cal measurement represents only the tip of the ics. Cambridge, MA and London: MIT Press. methodological iceberg. Gul, Frank, and Wolfgang Pesendorfer. Forth coming. “The Case for Mindless Economics.” References In Perspectives on the Future of Economics: Positive and Normative Foundations. Vol. I, Balleine, Bernard W. 2005. “Neural Bases of ed. Andrew Caplin and Andrew Schotter. Food-Seeking: Affect, Arousal, and Reward Oxford: Oxford University Press. in Corticostriatalimbic Circuits.” Physiology Louie, Kenway, and Paul W. Glimcher. 2007. and Behavior, 86(5): 717–30. “Temporal Discounting Activity in Mon­ Balleine, Bernard W., A. Simon Killcross, and key Parietal Neurons during Intertemporal Anthony Dickinson. 2003. “The Effect of Choice.” Unpublished Lesions of the Basolateral Amygdala on Pessiglione, Mathias, Ben Seymour, Guillaume Instrumental Conditioning.” Journal of Neuro­ Flandin, Raymond J. Dolan, and Chris D. science, 23(2): 666–75. Firth. 2006. “Dopamine-Dependent Predic­ Barto, Andrew, and Richard Sutton. 1982. tion Errors Underpin Reward-Seeking Behav­ “Simulation of Anticipatory Responses in iour in Humans.” Nature, 442(7106): 1042–5 Classical Conditioning by a Neuron-Like Schultz, Wolfram, Paul Apicella, and Tomas Adaptive Element.” Behavioral Brain Research, Ljungberg. 1993. “Responses of Monkey Dopa­ 4(3): 221–35. mine Neurons to Reward and Conditioned Camerer, Colin, and Teck-Hua Ho. 1999. “Experi­ Stimuli During Successive Steps of Learning ence-Weighted Attraction Learning in Normal a Delayed Response Task.” Journal of Neuro­ Form Games.” Econometrica, 67(4): 827–74. science, 13(3): 914–23. Camerer, Colin, George Loewenstein, and Schultz, Wolfram, Peter Dayan, and P. Read Drazen Prelec. 2005. “Neuroeconomics: How Montague. 1997. “A Neural Substrate of Neuroscience Can Inform Economics.” Journal Prediction and Reward.” Science, 275(5306): of Economic Literature, 43(1): 9–64. 1596–99. This article has been cited by:

1. Don Ross. 2008. TWO STYLES OF NEUROECONOMICS. Economics and Philosophy 24:03, 473. [CrossRef]