The Neuroeconomic Theory of Learning
Total Page:16
File Type:pdf, Size:1020Kb
The Neuroeconomic Theory of Learning By Andrew Caplin and Mark Dean* Paul Glimcher (2003) and Colin Camerer, violated will aid in the development of appropri- George Loewenstein, and Drazen Prelec (2005) ate alternatives. make powerful cases in favor of neuroeconomic For our initial foray into this line of research, research. Yet in their equally powerful defense we focus on learning theory (Caplin and Dean of standard “Mindless Economics,” Faruk Gul 2007a). This represents an ideal test case for the and Wolfgang Pesendorfer (forthcoming) point integrative methodology since neuroscientists to the profound language gap between the two have independently formulated a specific theory contributing disciplines. For example, for an of neurological function, the “Dopaminergic economist, risk aversion captures preferences Reward Prediction Error” (DRPE) hypothesis, among wealth lotteries. From the neuroscien- which has important behavioral implications. tific viewpoint, it is a broader concept related to Dopamine is a neurotransmitter for which fear responses and the amygdala. Furthermore, release had previously been hypothesized to as economic models make no predictions con- reflect “hedonia,” as when a thirsty monkey is cerning brain activity, neurological data can given a squirt of juice. Yet Wolfram Schultz, neither support nor refute these models. Rather Paul Apicella, and Tomas Ljungberg (1993) than looking to connect such distinct abstrac- found that if such a monkey learns to associ- tions, Gul and Pesendorfer (forthcoming) argue ate a tone with later receipt of fruit juice, the for explicit separation: “The requirement that dopaminergic response occurs when the tone is economic theories simultaneously account for heard, not when the juice is received. Dopamine economic data and brain imaging data places an somehow appears to signal changes in the antic- unreasonable burden on economic theories.” ipated value of rewards. Schultz, Peter Dayan, We share the conviction of Glimcher (2003) and P. Read Montague (1997) noted that a “pre- and Camerer, Loewenstein, and Prelec (2005) diction error” signal of this form is precisely concerning the potential value of neuroeco- what is needed in reinforcement algorithms to nomics, yet we believe that the field will live drive convergence toward a standard dynamic up to its potential only if a common conceptual programming value function (Andrew Barto language can be agreed upon. Hence, we face and Richard Sutton (1982). The DRPE hypoth- the Gul and Pesendorfer challenge head on by esis of neuroscience asserts that dopamine does developing theories that simultaneously account measure a reward prediction error that is used to for behavioral and brain imaging data. The prin- update an evolving value function. In addition to cipal innovation lies in our use of the decision more standard tests, Mathias Pesseglione et al. theorists’ standard axiomatic methodology in (2006) have shown that neurological interven- this highly nonstandard setting. This removes tions aimed at the dopamine system can have an any linguistic confusion by defining concepts impact on the rate at which learning appears to directly in terms of their empirical counter- take place, much as the theory suggests. parts. It also allows us to pinpoint how to design Economists have become interested in rein- experiments directed to the central tenets of the forcement learning for their own reasons. While theory, rather than to particular parametriza- the natural assumption within economics is tions. If these experimental tests reveal the the- that inference is Bayesian, this assumption has ory to be wanting, then knowing which axiom is little predictive power in complex environments. The need for predictive models led economists * Caplin: Department of Economics, New York Univer- toward the reinforcement model, and Ido Erev sity, 19 West 4th St., New York, NY 10011 (e-mail: andrew. and Alvin E. Roth (1998) have demonstrated [email protected]); Dean: Department of Economics, New that simple variants of this model match the York University, 19 West 4th St., New York, NY 10011 pattern of behavior in a wide variety of standard (e-mail: [email protected]). Deep thanks are extend to Paul Glimcher and to all of the economists, neuroscientists, games. Camerer and Teck-Hua Ho (1999) enrich and psychologists who contribute to the vibrancy of the the model by adding counterfactual learning New York University Neuroeconomics Seminar. based on past outcomes. It is clear, however, that 148 VOL. 97 NO. 2 THE NEUROEConoMIC THEORY OF LEARNING 149 these simple “history-based” models will match dopaminergic responses derive from realiza- behavior only in relatively stable environments. tions of specific lotteries over final prizes. An We believe that neuroscientific research has advantage of this simple lottery-outcome frame- the potential to suggest next steps in modeling work is that it avoids tying our formulation to a learning in more complex environments. Indeed, particular model of learning. We consider a set- Anthony Dickinson and Bernard W. Balleine ting in which the agent is either endowed with (2002) have begun to uncover evidence suggest- or chooses a specific lottery from which a prize ing that dopaminergic reinforcement is but one of is realized. We observe any initial act of choice several neurological modules related to learning. among lotteries and the dopaminergic response Unfortunately there is a profound language when the prize is realized. Definition 1 lays out barrier that largely prevents economists from the various prize and lottery sets studied in the embracing the growing neuroscientific evidence model as well as our idealized measure of the on learning. In economics, concepts such as dopamine response rate. utility and reward are inferred from observed choices, while neuroscientists interpret them Definition 1: The set of prizes is a compact in relation to intuitions concerning the flow of metric space Z with generic element z [ Z. experience (e.g., a squirt of juice is assumed to The set of all simple lotteries over Z is denoted be rewarding to a thirsty monkey). In fact, many L, with generic element p [ L. We define neuroscientific tests of the DRPE hypothesis ez [ L as the degenerate lottery that assigns take the perspective of “classical” or “Pavlovian” probability 1 to prize z [ Z and the set L1z2 conditioning in which choice plays no role, ren- to comprise all lotteries with prize z in their dering economic interpretation impossible. As support, with risk aversion, the fact that economic and psychological concepts have identical names (1) L1z2 ; 5p [ L0pz . 06 ( L. does not imply identical interpretations. Caplin and Dean (2007b) take an axiomatic The function d : M S R identifies the idealized perspective on the DRPE hypothesis and char- dopamine response function (DRF), where M acterize its empirical implications for a data tape comprises all pairs 1z, p2 with z [ Z and p [ with combined information on choice and dopa- L1z2. minergic activity. If the data do not obey our axioms, then the DRPE model is fundamentally The dopaminergic reward prediction error wrong, not merely misspecified. Our approach hypothesis hinges on the existence of a func- allows us to identify and rectify a problem in tion defining the “expected” and the “experi- current quantitative tests of the DRPE hypothe- enced” reward associated with receipt of each sis. In these tests, it is typical to treat neurologi- possible prize from any given lottery. Under cally measured dopaminergic signals as defined the assumption that the expected reward of a only up to linear transformations, with a quanti- degenerate lottery is equal to the experienced tatively larger dopaminergic response identify- reward of that prize, what we are looking for ing a larger reward difference. We pinpoint the is a dopaminergic reward function r : L S R somewhat harsh assumptions that are needed to which defines both the expected reward asso- justify this conclusion. Our central result justi- ciated with each lottery and the experienced fies only an ordinal version of the DRPE hypoth- reward associated with each prize. A basic esis in which reward differences are ill-defined, assumption is that this reward function con- just as marginal utility is ill-defined in ordinal tains all the information that determines dopa- characterizations. We strengthen the assump- mine release. tions as required to justify use of dopamine as a measuring rod for differences in reward. Definition 2: A function r : L S R fully summarizes a DRF d : M S R, if there exists a I. Basic Propositions function E : r(Z) 3 r(L) S R such that, given 1z, p2 [ M, We develop the DRPE hypothesis for a case in which probabilities are objective and (2) d1z, p2 5 E1r1ez 2,r1p2 2, 150 AEA PAPERS AND PROCEEDINGS MAY 2007 where r1Z2 comprises all values r1ez 2 across (DARPE) representation if there exists a func- degenerate lotteries, and r1L2 identifies the tion r : L S R and a strictly increasing function range across all lotteries. In this case, we say G : r(Z) 2 r(L) S R, such that, given 1z, p2 [ M, that r and E represent the DRF. (3) d1z,p2 5 G1r1ez 2 2 r1p2 2, A DRPE representation rests not only on the ability to use reward computations to under- where r1Z2 2 r1L2 comprises all values stand all dopaminergic responses, but also on r1ez 2 2 r1p2 across 1z,p2 [ M . these responses having appropriate order prop- erties. Intuitively, the dopaminergic response Definition 6: A DRF d : M S R admits a should be strictly higher for a more rewarding dopaminergic expected reward prediction error prize than it is for a less rewarding prize, and (DERPE) representation if there exists func- from a less rewarding lottery than from a more tions r : L S R, E : r 1Z2 3 r 1L2 S R that form rewarding lottery. In addition to depending on a DRPE representation of the DRF in which the existence of an appropriate reward function, r p m u all p [ L, the DRPE hypothesis rests on the assumption 1 2 K p 3 4 that, if expectations are met, the dopaminergic for some function u : Z S R, where mp 3u4 denotes response does not depend on what was expected.