<<

Integrated accounts of behavioral and neuroimaging data using flexible models

12# 3 3 4 1 Amir Dezfouli Richard Morris † Fabio Ramos ‡ Peter Dayan § Bernard W. Balleine ⇤ 1UNSW Sydney 2Data61, CSIRO 3University of Sydney 4Gatsby Unit, UCL # [email protected][email protected] §[email protected][email protected][email protected]

Abstract

Neuroscience studies of human decision-making abilities commonly involve sub- jects completing a decision-making task while BOLD signals are recorded using fMRI. Hypotheses are tested about which regions mediate the effect of past experience, such as rewards, on future actions. One standard approach to this is model-based fMRI data analysis, in which a model is fitted to the behavioral data, i.e., a subject’s choices, and then the neural data are parsed to find brain regions whose BOLD signals are related to the model’s internal signals. However, the internal mechanics of such purely behavioral models are not constrained by the neural data, and therefore might miss or mischaracterize aspects of the brain. To address this limitation, we introduce a new method using recurrent neural network models that are flexible enough to be jointly fitted to the behavioral and neural data. We trained a model so that its internal states were suitably related to neural activity during the task, while at the same time its output predicted the next action a subject would execute. We then used the fitted model to create a novel visualization of the relationship between the activity in brain regions at different times following a reward and the choices the subject subsequently made. Finally, we validated our method using a previously published dataset. We found that the model was able to recover the underlying neural substrates that were discovered by explicit model engineering in the previous work, and also derived new results regarding the temporal pattern of brain activity.

1 Introduction

Decision-making circuitry in the brain enables humans and animals to learn from the consequences of their past actions to adjust their future choices. The role of different brain regions in this circuitry has been the subject of extensive research in the past [Gold and Shadlen, 2007, Doya, 2008], with one of the main challenges being that decisions – and thus the neural activity that causes them – are not only affected by the immediate events in the task, but are also affected by a potentially long history of previous inputs, such as rewards, actions and environmental cues. As an example, assume that subjects make choices in a bandit task while their brain activity is recorded using fMRI, and we seek to determine which brain regions are involved in reward processing. Key signals, such as reward prediction errors, are not only determined by the current reward, but also a potentially extensive history of past inputs. Thus, it is inadequate merely to find brain regions showing marked BOLD changes just in response to reward. An influential approach to address the above problem has been to use model-based analysis of fMRI data [e.g., O’Doherty et al., 2007, Cohen et al., 2017], which involves training a computational model using behavioral data and then searching the brain for regions whose BOLD activity is related to the internal signals and variables of the model. Examples include fitting a reinforcement-learning model

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. to the choices of subjects (with learning-rates etc. as the model parameters) and then finding the brain regions that are related to the estimated value of each action or other variables of interest [e.g., Daw et al., 2006]. One major challenge for this approach is that, even if the model produces actions similar to the subjects, the variables and summary statistics that the brain explicitly tracks might not transparently represent the ones the hypothetical model represents. In this case, either the relevant signals in the brain will be missed in the analysis, or the model will have to be altered manually in the hope that the new signals in the model resemble neural activity in the brain. In contrast, here, we propose a new approach using a recurrent neural network as a type of model that is sufficiently flexible [Siegelmann and Sontag, 1995] to represent the potentially complex neural computations in the brain, while also closely matching subjects’ choice behavior. In this way, the model learns to learn the task such that (a) its output matches subjects’ choices; and (b) its internal mechanism tracks subjects’ brain activity. A model trained using this approach ideally provides an end-to-end model of neural decision-making circuitry that does not benefit from manual engineering, but describes how past inputs are translated to future actions through a successive set of computations occurring in different brain regions. Having introduced the architecture of this recurrent neural network meta-learner, we show how to interpret it by unrolling it over space and time to determine the role of each brain region at each time slice in the path from reward processing to action selection. We show that experimental results obtained using our method are consistent with the previous literature on the neural basis of decision-making and provide novel insights into the temporal dynamics of reward processing in the brain.

2 Related work

There are at least four types of previous approach. In type one, which includes model-based fMRI analysis and some work on complex non-linear recurrent dynamical systems [Sussillo et al., 2015], the models are trained on the behavioral data and are only then applied to the neural data. By contrast, we include neural data at the outset. In a second type recurrent neural networks are trained to perform a task [e.g., to maximize reward; Song et al., 2017], but without the attention that we give to both the psychological and neural data. A third type aims to uncover the dynamics of the interaction between different brain regions by approximating the underlying neural activity (see Breakspear [2017] for review). However, unlike our protocol, these models are not trained on behavioral data. A fourth type relies on two separate models for the behavioral and neural data but, unlike model-based fMRI analyses, the free parameters of the two models are jointly modeled and estimated, e.g., by assuming that they follow a joint distribution [Turner et al., 2013, Halpern et al., 2018]. Nevertheless, similar to model-based fMRI, this approach requires manual model engineering and is limited by how well the hypothesized behavioral model characterizes its underlying neural processes.

3 The model

3.1 Data We consider a typical neuroscience study of decision-making processes in humans, in which the data include the actions of a set of subjects while they are making choices and receiving rewards ( BEH) in a decision-making task, while their brain activity in the form of fMRI images is recorded ( D ). DfMRI Behavioral data include the states of the environment (described by set ), choices executed by the subjects in each state, and the rewards they receive. At each time t ST i subject i observes state i i 2 st as an input, calculates and then executes action at (e.g., presses a button on a computer 2S i i keyboard; at and is a set of actions) and receives a reward rt (e.g., a monetary reward; ri ). The behavioral2A A data can be described as, t 2< i i i i i = (s i ,a i ,r i ) i =1...N ,t T . (1) DBEH { t t t | SUBJ 2 } The second component of the data is the recorded brain activity in the form of 3D images taken by the scanner during the task. Each image can be divided into a set of voxels (NVOX voxels; e.g., 3mm x 3mm x 3mm cubes), each of which has an intensity (a scalar number) which represents the neural activity of the corresponding brain region at the time of image acquisition by the scanner. Images

2 ut yt

W fMRI(⇥) 1 . ht . . L . . h2 . . at 1 t . h3 rt 1 t (⇥) rnn ... s (GRU) L t ...

Ncells P (at = eye) ht (⇥) LBEH P (at = hand)

Figure 1: Architecture of the model. The model has a RNN which consists of a set of GRU cells, and receives previous actions, rewards and the current state of the environment as inputs (NCELLS is the number of cells in the RNN layer). The outputs/states of the RNN layer (ht) are connected to a middle layer (shown by red circles) with the same number of units as there are voxels (NVOX); the outputs of the units ut are weighted sums over their inputs. Each component of ut is convolved with the HRF signal and is compared to the bias-adjusted intensity of its corresponding voxel in fMRI recordings (yt). Voxels are shown by the squares overlaying the brain, and three of them are highlighted (in blue) as an example of how they are connected to the units in the middle layer. The outputs of the GRU cells are also connected to a softmax layer (the green lines), which outputs the probability of selecting each action on the next trial (in this case, EYE and HAND are the available actions). BEH refers to the behavioral and fMRI refers to the fMRI loss function. The final lossL function is denoted by (⇥), which is a weightedL sum of the fMRI and the behavioral loss functions. ⇥ contains all the parameters.L

are acquired at times 0, TR, 2TR,...,(NACQ 1)TR, where TR refers to the repetition time of the i,v scanner (time between image acquisitions), and NACQ is the total number of images. Let yt denote the intensity of voxel v recorded at time t for subject i. The fMRI data will take the following form: = yi,v ,t=0, TR, 2TR,...,(N 1)TR,i=1...N ,v =1...N . (2) DfMRI { t } ACQ SUBJ VOX 3.2 Network architecture Actions taken by a subject at each point in time are affected by the history of previous rewards, actions and states experienced by the subject. Aspects of this history are encoded in neural activity in a persistent, albeit mutating, form, and enable subjects’ future choices to benefit from past experience. This process constitutes learning in the task; we aim to recover it by jointly modeling the behavioral and neural data. We first describe the network architecture and then explain how it can be interpreted to answer the questions of interest. RNN layer. The model (Figure 1) is a specific form of recurrent neural network (RNN). The recurrent layer consists of a set of NCELLS GRU cells [Gated recurrent unit; Cho et al., 2014]; cell c outputs c 1 NCELLS its state ht at time t. We define ht as the state of the whole RNN network (ht =[ht ,...,ht ]>). This state summarizes the past history of the inputs to the network and is updated as new inputs are received according to a function that we denote by f,

ht = f(at 1,rt 1,st, ht 1;⇥), (3) depending on parameters ⇥. We aim to train the parameters of this dynamical system to approximate the underlying neural computations in the brain that translate previous inputs to future actions during the task. fMRI layer. To establish a correspondence between the underlying RNN and neural activity, one training signal for ⇥ comes from requiring the activity of each voxel at each point in time to be described as a (noisy) linear combination of GRU cell states (shown by the red connections in Figure 1). N N We denote the weights of this linear combination as W VOX ⇥ CELLS , and u as a vector of size 2< t NVOX representing predicted neural activity at each voxel at time t. Thus,

ut = W ht. (4)

3 For training the model, the predicted neural activity is compared with the actual activity recorded by the scanner. However, neural activity is not instantly reflected in the intensity recorded by the scanner, but is delayed according to the haemodynamic response function (HRF; Figure S3). To correct for this delay, elements of ut are first convolved with HRF [Henson and Friston, 2007], and after adding a bias term b, are compared with the intensities of the corresponding voxels, to form the following loss function, 2 (⇥) = u HRF + b y ,t 0, TR,...(N 1)TR , (5) LfMRI k t ~ tk 2{ ACQ } t X in which ⇥ is the model parameters (W and RNN parameters), yt is a vector of size NVOX containing the recorded activity of each voxel at time t. Symbol ~ is the operator. The above loss function can be thought of as the logarithm of a Gaussian likelihood function. Note that in this case the convolution operator acts on the output of the network, and so this is not a conventional convolutional neural network, in which act on the input. Behavioral layer. To ensure that the RNN also captures the behavioral data, a second training signal for ⇥ comes from requiring it to produce actions similar to those of humans. This is achieved by connecting the output of the RNN network to a softmax layer in Figure 1 (shown by the green lines), in which the weights of the connections determine the influence of each cell on the probability of selecting actions. Denoting by ⇡t(a) the predicted probability of taking action a at time t, we define the behavioral loss function as: (⇥) = log ⇡ (a ), (6) LBEH t t t T X2 0 in which T 0 refers to the timesteps at which the subject was allowed to execute an action. Training. We define the overall loss function as the weighted sum of the behavioral and fMRI loss functions,

NSUBJ (⇥) = (⇥; )+ (⇥; ), (7) L LBEH Di LfMRI Di i=1 X with parameter determining the contribution of the fMRI loss function, and i denoting the data of subject i. Note the above loss function can be thought of as the logarithm ofD the multiplication of a Gaussian likelihood function (for the fMRI part) – with being related to the level of noise/variance in the likelihood function – and a multinomial likelihood function (for the behavioral part).

3.3 Interpreting the model We seek to understand how the inputs to the network (previous rewards, actions, states) affect future actions through the medium of the brain’s neural activity. Although different methods have been suggested for investigating the way the inputs to a neural network determine its outputs, the most fundamental quantity is the gradient of the output with respect to the input, which represents how much the output changes by changing the input (as used, for instance, by Simonyan et al. [2013] in the context of an image classification task). Inspired by this, we defined two differential quantities relating rewards, actions and brain activity to each other. There are at least two ‘layers’ to this: off- and on-policy. In the off-policy setting, which is conventionally studied in model-based imaging, there is a fixed sequence of inputs, whose effects on future predicted probabilities and neural activities we determine. In the on-policy setting, which is used in settings such as approximate Bayesian computation [Sunnåker et al., 2013], future choices, and thus future inputs are also affected by past inputs. For the present, we consider the simpler, off-policy setting. This allows us to look, for instance, at the brain regions involved in mediating the effect of the reward that subject i actually received at, say, time t1 on the predicted probability of the action that the subject actually executed at, say, time t2. For convenience, we drop notation for the fixed inputs for the subject; and indeed for the subject number (since we fit a single model to the whole group). The first measure represents the behavioral effects of reward on future actions, which can be calculated as the gradient of the predicted probabilities of actions at each time t2 with respect to the input received at time t1. For the case of binary choices, which are the focus of the current experiment, with

4 EYE and HAND as the two available actions in the task, we only needed to calculate the probability for one of the actions. Let ⇡t2 denote the probability of taking action EYE at time t2. The effect of reward at time t1 on the action at time t2 can be calculated as follows,

⇡r @⇡t2 dt1,t2 = . @rt1

This is a straightforward application of (calculated using automatic differentiation), ⇡r noting again that we consider the inputs received by the network between t1 and t2 to be fixed. dt1,t2 can be thought as capturing how much the probability of taking action EYE at time t2 increases as the results of increasing the magnitude of reward earned at time t1. The second measure relates behavioral and fMRI data by exploiting the informational association between the predicted neural activity ut and the state of RNN, ht. First, note that, at each time t, ht is a Markov state for the RNN, in that given ht, RNN outputs after time t are independent of their past. Thus, we can decompose:

NCELLS k @⇡t2 @⇡t2 @ht = k , for any t t1 +1...t2 , (8) @rt1 @ht @rt1 2{ } kX=1 k as the effect changing rt1 has on the predicted RNN state ht at time t, times the effect that a change k in ht has on the action probability ⇡t2 at t2. Now, consider the case that W >W is non-singular (note that NVOX NCELLS). This implies that there is a one-to-one mapping between the RNN state and predicted neural activity:

1 ht =(W >W ) W >ut . (9)

Thus, we can rewrite equation 8 in terms of the effect changing rt1 has on the predicted neural v v activity ut in each voxel at time t times what a change in ut implies about a change in ⇡t2 , operating v implicitly via what the change in ut tells us about a change in ht. We can write this as,

N @⇡ VOX @⇡ @uv t2 = t2 t ,t= t +1...t . (10) @r @uv @r 1 2 t1 v=1 t t1 X

Note that this is a correlational relationship – the direction of causality is from ht to ut. Nevertheless the individual terms in this sum:

v ⇡ur @⇡t2 @ut dv,t1,t,t2 = v , for any t t1 +1...t2 , (11) @ut @rt1 2{ } v v combine the influence that voxel ut at time t receives from the reward at time t1 (which is @ut /@rt1 ), v with the covariation between the voxel activity and the action at time t2 (which is @⇡t2 /@ut ). This v quantifies the intermediation of voxel ut between the reward at t1 and the action at time t2. We make two remarks: (i) The joint fitting of the model to both the behavioral and fMRI data was important that, if there are behaviorally equivalent solutions, then the one that can fit the neural data should be chosen; and (ii) for equation 10 to hold it is necessary for the state of the network to be fully determined by the neural activity (ut). This can hold in the case of GRU cells (provided that the hidden units do not partition into separate behavioral and neural groups). In contrast, in LSTM cells [Long short-term memory; Hochreiter and Schmidhuber, 1997], the cell states and cell outputs are different and are both required to determine the outputs in the next time-step, and therefore in the case of LSTM cells equation 10 does not hold.

4 Results

In this section we aim to show how the above measures can be used to study the neural substrates of decision-making in the brain.

5 A A Ve SMA Vh (0, -12, 78) preSEF (-6, 9, 60) preSEF make choice delay outcome 2.5 sec revealed ITI 3.5 sec 1 sec 1-8 sec IPS

Figure 2: The taskB. Each1 trial started withC 1 the presentation ofD a screen1 (the left most squarex = in-6 the z = +60 R figure) and the subjects had 2.5 seconds to make an eye saccade0.6 to the red target circle or press 0.5 0.5

P reward 0.2 a button with their right hand. After a delay (3.5 second) duringchoices sjs’ which the screen showed only6 a P choose eye choose P B fixation point, subjects received0 50 100 the outcome150 of0 their50 choice100 150 which indicated0 0.2 0.6 whether1 their choice was preSEF SMA rewarded. The next trialtime started (choice after trial) an inter-trialtime interval(choice trial) (ITI) thatmodel varied choice between prob. 1 and 8 seconds. choice eye model choice probability Figure reprinted with permissionchoice hand from Wunderlichsubject choice et al. [2009]. Copyright (2009) National Academy4 of Sciences. Fig. 1. Experimental Design and Behavior. (A)Subjectswerepresentedwitha choice cue after which they had to respond within 2.5 s by performing a saccade 2 to the red target circle or a right handed button press. Once a response was 4.1 Task and subjectsregistered the screen was immediately cleared for a short delay and subsequently

the outcome was revealed (6 s after trial onset) indicating either receipt of reward (a.u.) Estimate Contrast 0 The data used hereor no were reward. previously Inter-trial-intervals published varied in between Wunderlich 1 and 8 s. et (B al.)Examplereward [2009]. The structure of the decision-makingprobabilities task is shown for saccades in Figure and button 2. In presses each as trial a function subjects of the had trial a number. choice between making a saccade (EYE)The or pressingprobability a of button being rewarded (HAND). following Choices choice were of either rewarded the hand with or varyingeye probabilities −2 across the experiment.movement There was varied were across four the trial experiment types in independently the task: (i) for free-choice each movement. trials (150 trials), in Ve Vh Ve Vh which subjects could(C)Fittedmodelchoiceprobability(red)andactualchoicebehavior(blue)shown choose between EYE and HAND; (ii) forced-choice trials in which subjects for a single subject. (D)Actualchoicebehaviorversusmodelpredictedchoice Fig. 2. Action values. (A)Regionofsupplementarymotorareashowingcor- were instructed to choose EYE (50 trials) or (iii) HAND (50 trials); (iv) null trials in which no probability. Data are pooled across subjects, the regression slope is shown as a relations with action values for hand movement (Vh/green) and a region of reward was receivedline, vertical irrespective bars, SEM. of the action selected (50 trials). Forced-choices andpre-SEF null showing trails correlations with action-values for eye movements (Ve/red). were randomly inserted between the free-choice trials. The environment consisted ofT-maps two are actions shown from a whole brain analysis thresholded at P Ͻ 0.001 uncor- (EYE and HAND) and five states corresponding to the four trial types and one state whenrected the (see choiceFig. S1 for a version with color bars relating to t stats). (B)Average outcomes were shownTo look (reward for neural or no-reward). correlates of Actions action values and states we had are to assumed estimate to beeffect coded sizes using of Ve (red) and Vh (green) extracted from SEF and SMA. The effects one-hot representations.the value Since of taking time was each discretized action in every (see below), trial. We there calculated were time the pointsshown at which here were no calculated from trials independent of those used to functionally action was takenaction or no visual values stimulus using a computational was shown, in reinforcement-learningwhich case states and actions (RL) wereidentify coded the using ROI. Note that only Ve but not Vh modulate the signal in preSEF, and that activity in SMA shows the opposite pattern. Vertical lines, SEM. zero vectors. model in which the value of each action, Veye and Vhand,was updated in proportion to a prediction error on each trial (see Table The total numberS1 offor subjects a summary was N ofSUBJ how= the 22 different, and in total typesN ofACQ value= 1136signalsimages relate were acquired by the scanner each containing N = 63191 voxels. Therefore, the fMRI data can be summarizedmake choices as to obtain reward compared to an otherwise analogous to the componentsVOX of the experiment). The model also assumed situation in which the rewards are obtained without the need to a tensor of size 22that1136 action selection63191. Each in every subject trial made followed300 a soft-maxchoices. probability See Supplementary Material for the details of fMRI⇥ preprocessing⇥ and model settings.⇠ make a choice (31–34). rule based on the difference of the estimated action values (8). To The most simple type of comparison process would be to test for the presence of action value signals in the brain we took the compute a difference between the two action values. We tested for 4.2 Model settingsmodel predicted trial-by-trial estimates of the two action values and such a difference, but as we had no a priori hypothesis about the entered these into a against the fMRI data. In directionality of the computation, we tested for both the difference All the methodsaddition were implemented to a whole brain in Tensorflowscreening for [Abadi the presence et al., of 2016] action-value and gradientsbetween (for the both value of the action chosen and the value of action not optimization andsignals, interpretation we specifically of the model) looked were for calculated them in using areas automatic known to differentiation be chosen methods (Vchosen Ϫ Vunchosen), and one involving the opposite available in this package.involved in See the Supplementary planning of motor Material actions, for including the model supplementary settings. difference (Vunchosen Ϫ Vchosen). As we found evidence for such an motor cortex (18–21) and lateral parietal cortex (22, 23). Given that action-value comparison signal in the brain, we then proposed a 4.3 From rewardboth to of action these areas have previously been shown to contain value- simple computational model to provide a conceptual explanation as related signals for movements in nonhuman primates, and that they to how such a signal could reflect the output of a computationally Figure 3(a,b) showsare two closely sets interconnected of off-policy simulations. with the area In ofeach motor simulation cortex thereinvolved are fourplausible choice states, decision mechanism. the times of whichin arecarrying shown out by motor the vertical actions gray (24–26), patches we in considered the top panels. these The areas red patch following each grey ribbonprime shows candidates the time at for which containing the outcome action-value was revealed representations following that the choice.Results The first choice was rewardedcould (shown then be by used ‘R’ to in guide the graph),action-based but the choices. rest were It is importantnot. In panel to (a)RL action ModelHAND Fits to Behavioral Choice Data. Acomparisonofthechoice was selected in allemphasize, choice states however, whereas that the in tasks panel used (b) itin was previous action studiesEYE. did Based not onprobabilities this, since in predicted by the RL model and the soft-max proce- make it possible to determine if the value signals identified were panel (a) the reward was earned when HAND was selected, we expected that choice todure decrease to subjects’ the actual behavior suggests that the model matches chosen values or action values. probability of selecting action EYE on the next choice. This is shown by the blue bars whichsubjects illustrate behavior well. Fig. 1C compares both variables for a typical We also looked for areas that are involved in comparing the the gradient of the probability of selecting the EYE action at each subsequent choice withsubject. respect Fig. 1 toD compares the predicted choice probability (binned) the amount of rewardaction earned values after to make the afirst choice. choice Two (d⇡ areasr). For of apanel priori (b), interest since were the rewardagainst was the earned actual choice probabilities for the group. A similar linear the anterior cingulate cortex (ACC) and the dorsal striatum. ACC regression analysis at the individual level generated an average R2 as a consequence of choosing EYE in the first choice, we expected the reward to have a positive effect has been previously implicated in action-based choice, both in the across subjects of 0.83 and regression coefficients that were signif- on the probability of selecting EYE on the next trials, which is consistent with the graph. context of a human imaging study reporting activity in this area icant at P Ͻ 0.001 in each subject. Next we asked aboutduring the a intermediation task involving choices of each between brain region different between actions the compared reward earned after the first choice (t1) and theto a next situation choice involving (t2), shown responses by the guided red arrow by instruction in Figure (27),3(a). and To answer in Action this question, Values. We found neural activity correlating with the action amonkeylesionstudywhereACClesionsproducedanimpairment values for making a hand movement in left supplementary motor in action-outcome based choice but not in mediating changes in area (SMA; Fig. 2A and Table S2). A region of interest (ROI) responses following errors (28).6 Dorsal striatum has been impli- analysis showed that activity in this area satisfied the properties of cated in both goal-directed and habitual instrumental responding ahandactionvalue:itwassensitivetothevalueofhandmovements, for reward in rodents (29, 30). Moreover, human fMRI studies and it showed no response selectivity to the value of eye movements reveal increased activity in both of these regions when subjects (Fig. 2B). Activity in lateral parietal cortex, ACC, and right dorsal

17200 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0901077106 Wunderlich et al. ⇡ur we calculated d for every voxel and every time-step between t1 and t2, and masked out the voxels that were not in the top one percent. By focusing only on the 99th percentile of d⇡ur , we hoped to limit our analysis to the circuitry known to be involved in decision-making. The resulting| | voxel maps are shown in Figure 3(c) for the case of HAND action corresponding to the inputs shown in panel (a), and Figure 3(d) shows the time-course of changes in d⇡ur. See Figure S1(c,d) for EYE action corresponding to the inputs shown in panel (b). The results show that, for each action, the top 1% of voxels contain three key cortical and subcortical brain regions known to be critically involved in reward-processing and decision-making, i.e., (i) striatum (associative aStr; or ventral, vStr), (ii) anterior cingulate cortex (ACC) and (iii) supplementary motor area (SMA) [Rangel and Hare, 2010, Wunderlich et al., 2009]. We first note that these anatomical regions are among the same anatomical regions that Wunderlich et al. [2009] also identified as involved in decision-making in this task (see Figure S4 for the time course of changes in d⇡ur for the voxel coordinates reported in Wunderlich et al. [Table S3; 2009]). Secondly, we can see that not only are the identified regions consistent with the neural substrates of decision-making based on previous work, but the temporal order of engagement of these regions is also consistent with their functional role in decision-making. It has been argued that activity in subregions of the striatum reflect reward prediction-errors [O’Doherty et al., 2004] and that these errors serve to update action-values in the ACC [Dayan and Balleine, 2002, Wunderlich et al., 2009, Seo and Lee, 2007, Walton et al., 2004], which in turn must be compared in the SMA to determine the best action before a decision can be made [Wunderlich et al., 2009]. Such prior work has argued that these different decision-making signals are carried by separate regions in a corticostriatal loop, which is assumed to participate in a time course of events leading to action-selection [Balleine and O’Doherty, 2010, Hare et al., 2011]. Here we show for the first time the temporal dynamics between these critical regions in the striatum, anterior cingulate cortex and motor areas leading to action-selection. Figure 3(d) shows the time ⇡ur course of each region’s d between the reward at 9.2 s (t1) and the next response at 12.8 s (t2). Note that since we took the probability of taking the EYE action as the reference, negative values of d⇡ur indicate a region’s role in selecting the HAND action. At reward receipt (9.2 s), d⇡ur of the ventral striatum begins below the zero baseline and then (negatively) peaks at 9.8 s, as it mediates the effect of reward prediction-errors on the subsequent hand response. The value of d⇡ur for the anterior cingulate then (negatively) peaks after 10.4 s, consistent with its role in updating action values with the new errors before the next response. Finally d⇡ur for the large cluster in the motor area (including the supplementary motor area) controlling motor responses such as the HAND action, negatively peaks at the time of the action (12.8 s), which marks the end of the decision process in the current task. As part of our supplementary material, Figure S1(d) shows the time dynamics between the striatum, anterior cingulate and motor areas controlling EYE choices – corresponding to the inputs shown in panel (b). Here positive values of d⇡ur indicate a region’s role in selecting the EYE action. At reward receipt (9.2 s) the associative striatum is involved immediately in mediating the effect of reward on the subsequent action-selection. Then at 11 s the involvement of the anterior cingulate peaks before a region in the motor area nearest the supplementary eye field peaks at the time of action (12.8 s). In sum, changes in d⇡ur over this time period mirror those for the HAND action, and are consistent with the hypothesized roles of these regions in the varying decision stages of the reward-learning task used here.

5 Discussion

We have introduced a new neural architecture for investigating the neural substrates of decision- making in the brain. Unlike previous methods, our approach does not require manual engineering and is able to learn computational processes directly from the data. We further showed that the model can be interpreted to uncover the temporal engagement of different brain regions in choice and reward processing. Besides being used as a standalone analysis tool, this approach can inform model-based fMRI analyses to investigate whether the model correctly tracks the brain’s internal mechanism. That is, if a brain region is found to be important in the current analysis, but not using the model-based fMRI analysis, this could mean that the model used to extract neural information is not representing all of the relevant neural signals involved in decision-making and requires further modification.

7 A A A A Ve SMA Ve SMA Vh (0, -12, 78) preSEF Vh (0, -12, 78) preSEF (-6, 9, 60) (-6, 9, 60) preSEF preSEF make choice make choice delay outcome delay outcome 2.5 sec revealed 2.5 sec ITI revealed ITI 3.5 sec 3.5 sec 1 sec 1 sec 1-8 sec 1-8 sec IPS IPS

Figure 2: The taskB. Each1 trial startedFigure withC 1 the 2: The presentation taskB. Each1 ofD trial a screen1 started (the withC left1 the most presentation squarex = in-6 the ofD a screen1 (the left most squarez = +60x = in-6 the R z = +60 R figure) and the subjects had 2.5 secondsfigure) to and make the an subjects eye saccade had0.6 2.5 to secondsthe red target to make circle an oreye press saccade0.6 to the red target circle or press 0.5 0.5 0.5 0.5

P reward P reward 0.2 0.2 a button with their right hand. Aftera a button delay with(3.5 second) their right during hand.choices sjs’ which After the a delay screen (3.5 showed second) only during6 a choices sjs’ which the screen showed only6 a P choose eye choose P eye choose P B B fixation point, subjects received0 50 100 the outcome150fixation of point,0 their50 subjects choice100 150 which received0 50 indicated0 100 the0.2 outcome1500.6 whether1 of0 their50 choice100 150 waswhich indicated0 0.2preSEF0.6 whether1 their choiceSMA was preSEF SMA rewarded. The next trialtime started (choice after trial) rewarded. an inter-trialtime The interval(choice next trial) trial (ITI)time started that (choicemodel varied after trial) choice an between prob. inter-trialtime 1 and interval(choice 8 seconds.trial) (ITI) thatmodel varied choice between prob. 1 and 8 seconds. choice eye model choice probability choice eye model choice probability Figure reprinted with permissionchoice hand fromFigure Wunderlich reprintedsubject choice et al. with [2009]. permissionchoice Copyright hand from (2009) Wunderlichsubject National choice et al. Academy [2009].4 Copyright (2009) National Academy4 of Sciences. of Sciences. Fig. 1. Experimental Design and Behavior. (Fig.A)Subjectswerepresentedwitha 1. Experimental Design and Behavior. (A)Subjectswerepresentedwitha choice cue after which they had to respond withinchoice 2.5 cue s by after performing which they a saccade had to respond within 2.5 s2 by performing a saccade 2 to the red target circle or a right handed buttonto the press. red target Once circle a response or a right was handed button press. Once a response was 4.1 Task and subjectsregistered the screen was4.1 immediately Task cleared and subjectsregistered for a short the delay screen and wassubsequently immediately cleared for a short delay and subsequently

the outcome was revealed (6 s after trial onset)the indicating outcome either was revealedreceipt of (6 reward s after trial onset) indicating(a.u.) Estimate Contrast either receipt of reward (a.u.) Estimate Contrast 0 0 The data used hereor no were reward. previously Inter-trial-intervalsThe published data varied used in between Wunderlich hereor no 1were reward. and 8 previously s. et Inter-trial-intervals (B al.)Examplereward [2009]. published The varied structure in between Wunderlich of 1 and the 8 s. et (B al.)Examplereward [2009]. The structure of the decision-makingprobabilities task is shown for saccades in Figuredecision-making and button 2. In presses eachprobabilities as tasktrial a function issubjects shown for of saccades the had in trial Figure a number.and choice button 2. between In presses each as trial making a function subjects of the had trial a number. choice between making a saccade (EYE)The or pressingprobability a of button beinga rewarded ( saccadeHAND). following (EYE Choices)The or choice pressingprobability were of either rewarded a of thebutton being hand with rewarded ( orHAND varyingeye ). following Choices probabilities choice were of either rewarded the hand with or varyingeye probabilities −2 −2 across the experiment.movement There was varied were across fouracross the trial experiment the types experiment. in independentlymovement the task: There was (i) for varied free-choice were each across movement. four the trial trials experiment types (150 in independently trials), the task: in (i) for free-choice eachVe movement.Vh trials (150Ve trials),Vh in Ve Vh Ve Vh which subjects could(C)Fittedmodelchoiceprobability(red)andactualchoicebehavior(blue)shown choose betweenwhichEYE subjectsand HAND could(C)Fittedmodelchoiceprobability(red)andactualchoicebehavior(blue)shown; (ii) choose forced-choice between trialsEYE and in whichHAND subjects; (ii) forced-choice trials in which subjects for a single subject. (D)Actualchoicebehaviorversusmodelpredictedchoicefor a single subject. (D)ActualchoicebehaviorversusmodelpredictedchoiceFig. 2. Action values. (A)Regionofsupplementarymotorareashowingcor-Fig. 2. Action values. (A)Regionofsupplementarymotorareashowingcor- were instructed to choose EYE (50were trials) instructed or (iii) HAND to choose(50 trials);EYE (50 (iv) trials) null trialsor (iii) inHAND which(50 no trials); (iv) null trials in which no probability. Data are pooled across subjects,probability. the regression Data slope are is pooled shown across as a subjects,relations the with regression action slope values is shown for hand as a movementrelations (Vh/green) with action and values a region for of hand movement (Vh/green) and a region of reward was receivedline, vertical irrespective bars, SEM. ofreward the action was selected receivedline, (50 vertical irrespective trials). bars, SEM. Forced-choices of the action selected andpre-SEF null showing (50 trails trials). correlations Forced-choices with action-values andpre-SEF for null showing eye trails movements correlations (Ve/red). with action-values for eye movements (Ve/red). were randomly inserted between thewere free-choice randomly trials. inserted The environment between the free-choice consisted ofT-maps trials. two are actions The shown environment from a whole consisted brain analysis ofT-maps thresholded two are actions shown at P fromϽ 0.001 a whole uncor- brain analysis thresholded at P Ͻ 0.001 uncor- (EYE and HAND) and five states corresponding(EYE and HAND to the) four and trialfive states types correspondingand one state when torected the the four (see choiceFig. trial S1 typesfor a version and one with state color when barsrected relating the (see choice toFig.t S1stats).for (aB version)Average with color bars relating to t stats). (B)Average outcomes were shownTo look (reward for neural or no-reward).outcomes correlates wereof Actions action shownTo values and look (reward states we for had neural are or to assumed no-reward). estimate correlates to be of Actionseffect action coded sizes values and using of Ve states (red) we had and are toVh assumed estimate (green) extracted to beeffect coded from sizes SEF using of and Ve SMA.(red) and The Vh effects (green) extracted from SEF and SMA. The effects one-hot representations.the value Since of taking time was eachone-hot discretized action representations. in every (seethe below), value trial. Since We of there taking calculated time were was each time discretized the action pointsshown in at every (see which here below), trial. were no calculated We there calculated were from trials time the independent pointsshown at which of here those were no used calculated to functionally from trials independent of those used to functionally action was takenaction or no visual values stimulus using aaction computational was shown, was taken in reinforcement-learningwhichaction or no casevisual values states stimulus using and a computationalactions was (RL) shown, wereidentify in coded reinforcement-learningwhich the using ROI. case Note states that only and Ve actions (RL) but not were Vhidentify modulate coded the the using ROI. signal Note in that preSEF, only and Ve but not Vh modulate the signal in preSEF, and that activity in SMA shows the opposite pattern.that Vertical activity lines, in SMA SEM. shows the opposite pattern. Vertical lines, SEM. zero vectors. model in which thezero value vectors. of each action,model inVeye whichand the Vhand value,was of each action, Veye and Vhand,was updated in proportion to a prediction errorupdated on eachin proportion trial (see toTable a prediction error on each trial (see Table The total numberS1 offor subjects a summary was N ofSUBJThe how= total the 22 different, number and inS1 total types offor subjectsN a ofACQ summary value= was 1136signalsN ofSUBJ howimages relate= the 22 were different, and acquired in total typesN byofACQ value= 1136signalsimages relate were acquired by the scanner each containing N =the 63191 scannervoxels. each Therefore, containing theN fMRI= data 63191 canvoxels. be summarizedmake Therefore, choices as to the obtain fMRI reward data can compared be summarizedmake to an choices otherwise as to obtain analogous reward compared to an otherwise analogous to the componentsVOX of the experiment).to the The components modelVOX also assumedof the experiment).situation The in which model the also rewards assumed are obtainedsituation without in which the the need rewards to are obtained without the need to a tensor of size 22that1136 action selection63191. Eacha in tensor every subject of trial size made followed22that1136 action300 a soft-maxchoices. selection63191 probability See. Each in Supplementary every subject trial made followed Material300 a soft-maxchoices. probability See Supplementary Material for the details of fMRI⇥ preprocessing⇥ for and the model details settings. of fMRI⇥⇠ preprocessing⇥ and modelmake settings. a⇠ choice (31–34). make a choice (31–34). rule based on the difference of the estimatedrule based action on the values difference (8). To of theThe estimated most action simple values type (8). of To comparisonThe process most simple would type be to of comparison process would be to test for the presence of action value signalstest for in the presencebrain we took of action the valuecompute signals a in difference the brain between we took the the twocompute action values. a difference We tested between for the two action values. We tested for 4.2 Model settingsmodel predicted trial-by-trial4.2 Model estimates settingsmodel of the predicted two action trial-by-trial values and estimatessuch a of difference, the two action but as values we had and no asuch priori a difference, hypothesis but about as the we had no a priori hypothesis about the entered these into a regression analysisentered against these the fMRIinto a data.regression In analysisdirectionality against of the the fMRI computation, data. In we testeddirectionality for both of the the difference computation, we tested for both the difference All the methodsaddition were implemented to a whole brainAll in theTensorflowscreening methods foraddition [Abadi the were presence implemented et to al.,a whole of 2016] action-value brain in and Tensorflowscreening gradientsbetween for [Abadi(for the the presence both value et al., of 2016] theaction-value action and chosengradientsbetween and (for the the bothvalue value of actionof the actionnot chosen and the value of action not optimization andsignals, interpretation we specifically of theoptimization model) looked were for andcalculated themsignals, interpretation in using weareas specifically automatic of known the model) to differentiation looked be were forchosen calculated them methods (V inchosen using areasϪ automatic knownVunchosen to differentiation), be and onechosen involving methods (Vchosen theϪ oppositeVunchosen), and one involving the opposite available in this package.involved in See the Supplementary planningavailable of motor Material in this actions, package.involved for including the in model See the supplementary Supplementary planning settings. of motor Materialdifference actions, for including the (Vunchosen model supplementaryϪ settings.Vchosen). As wedifference found evidence (Vunchosen forϪ suchVchosen an ). As we found evidence for such an motor cortex (18–21) and lateral parietalmotor cortex cortex (22, (18–21) 23). Given and that lateral parietalaction-value cortex comparison (22, 23). Given signal that in theaction-value brain, we then comparison proposed signal a in the brain, we then proposed a 4.3 From rewardboth to of action these areas have4.3 previously From reward beenboth to shown of action these to areas contain have value- previouslysimple been computational shown to contain model value- to providesimple a conceptual computational explanation model as to provide a conceptual explanation as related signals for movements in nonhumanrelated primates, signals for and movements that they in nonhumanto how such primates, a signaland could that reflect they theto output how such of a computationally a signal could reflect the output of a computationally Figure 3(a,b) showsare two closely sets interconnected of off-policyFigure simulations. with 3(a,b) the shows areaare In oftwoeach closely motor sets simulation interconnected of cortex off-policy thereinvolved simulations.are with four theplausible choice area In ofeachstates, decision motor simulation cortex mechanism. thereinvolved are fourplausible choice states, decision mechanism. the times of whichin arecarrying shown out by motor the verticalthe actions times gray (24–26),of which patchesin we arecarrying in considered shown the top out by panels. motor the these vertical actionsThe areas red gray (24–26), patch patches following we in considered the top panels. these The areas red patch following each grey ribbonprime shows candidates the time at for whicheach containing grey the outcome ribbon action-valueprime shows was candidatesrevealed the representations time following at for which containing that the outcomechoice.Results action-value The was first revealed representations following that the choice.Results The first choice was rewardedcould (shown then be by used ‘R’ tochoice in guide the graph),action-based was rewarded butcould the choices. (shown restthen were be It byis used importantnot. ‘R’ to In in guide panelthe to graph),action-based (a)RL action Model but theHAND choices. Fits rest to were Behavioral It is importantnot. In Choice panel to Data. (a)RL actionAcomparisonofthechoice ModelHAND Fits to Behavioral Choice Data. Acomparisonofthechoice was selected in allemphasize, choice states however, whereaswas that selected the in tasks panel in used (b) allemphasize, itin choice was previous action states however, studiesEYE whereas that. did Based not the in tasks panel onprobabilities this, used (b) since itin was previous in predicted action studiesEYE by. thedid Based not RL model onprobabilities this, and since the in soft-max predicted proce- by the RL model and the soft-max proce- make it possible to determine if thea) valuemakeHAND action signals it possible identified to determine were if theb) value EYE action signals identified were panel (a) the reward was earned whenpanelHAND (a)was the reward selected, was we earned expected when thatHAND choicewas todure decreaseselected, to subjects’ thewe expected actual behavior that choice suggests todure decrease that to subjects’ the the model actual matches behavior suggests that the model matches chosen values or action values. chosen0.15 valuesR or action values. R C C probability of selecting action EYE onprobability the next choice. of selecting This is action shownEYE by theon the blue next bars choice. whichsubjects Thisillustrate0.3 behavior is shown well. by Fig. the 1bluecompares bars whichsubjects both illustrate variables behavior for well. a typical Fig. 1 compares both variables for a typical We also looked for areas that are involved0.10We also in looked comparing for areas the that are involved in comparing the the gradient of the probability of selectingthe gradient the EYE ofaction the probability at each subsequent of selecting choice the EYE withsubject.action respect Fig. at 1each toDEcompares subsequentE the predicted choiceE withsubject.E choice respect Fig. probability 1 toD compares (binned) the predicted choice probability (binned) r r 0.2 π action values to make a choice. Two⇡ areasactionr 0.05 of values a prioriH to interest makeH a were choice.H TwoHagainst⇡ areasπ r the of actual a priori choice interest probabilities were foragainst the group. the actual A similar choice linear probabilities for the group. A similar linear d the amount of reward earned after thethe first amount choice of (d reward). For earned panel after (b), since the first the choice reward (d wasd ). earned For panel (b), since the reward was earned 2 2 the anterior cingulate cortex (ACC) andthe the anterior dorsal cingulate striatum. cortex ACC (ACC)regression and0.1 the analysis dorsal striatum. at the individual ACC levelregression generated analysis an average at the individualR level generated an average R as a consequence of choosing EYE inas the a first consequence choice, we of0.00 expected choosing theEYE rewardin the to first have choice, a positive we expected effect the reward to have a positive effect has been previously implicated in action-basedhas been previously choice, both implicated in the inacross action-based subjects choice, of 0.83 bothand regression in the across coefficients subjects that of were 0.83 signif- and regression coefficients that were signif- on the probability of selecting EYE onon the the next probability trials, which of selecting is consistentEYE on with the the next graph. trials, which0.0 is consistent with the graph. context of a human imaging study reportingcontext−0.05 of activity a human in this imaging area studyicant reporting at P Ͻ 0.001 activity in eachin this subject. area icant at P Ͻ 0.001 in each subject. ~0 ~3 ~6 ~9 time (s) ~0 ~3 ~6 ~9 time (s) Next we asked aboutduring the a intermediation task involvingNext choices of weeach asked between brain about regionduring different the betweena intermediation task actions involving the~13 compared~16 reward~19 choices of~23 each earned~26 ~29 between brain after region differentthe first between actions the~13 compared~16 reward~19 ~23 earned~26 ~29 after the first choice (t1) and theto a next situation choice involving (t2), shownchoice responses by (t1 the) and guided red theto arrow a nextby situation instruction in choice Figure involving (t2 (27),3(a).), shown and To responses answer in by theAction guided this red question,arrow Values. by instruction inWe Figure found (27),3(a). neural and To answer in activityAction this correlating question, Values. withWe found the action neural activity correlating with the action amonkeylesionstudywhereACClesionsproducedanimpairmentc)amonkeylesionstudywhereACClesionsproducedanimpairmentvalues for making a hand movementvalues in left supplementaryfor making a hand motor movement in left supplementary motor in action-outcome based choice but notin9.2 action-outcome in s mediating9.8 s changes based10.4 choice in s area but11.0 not (SMA;s in mediating11.6 Fig. s 2A and changes12.2Table s in S212.8).area As region (SMA; of Fig. interest 2A and (ROI)Table S2). A region of interest (ROI) responses following errors (28).6 Dorsalresponses striatum following has been errors impli- (28).analysis6 Dorsal showedstriatum that has activity been impli- in this areaanalysis satisfied showed the properties that activity of in this area satisfied the properties of cated in both goal-directed and habitualcated instrumental in both goal-directed responding andahandactionvalue:itwassensitivetothevalueofhandmovements, habitual instrumental responding ahandactionvalue:itwassensitivetothevalueofhandmovements, for reward in rodents (29, 30). Moreover,for reward human in rodentsfMRI studies (29, 30).and Moreover, it showed human no response fMRI selectivity studies toand the it value showed of eye no responsemovements selectivity to the value of eye movements reveal increased activity in both of thesereveal regions increased when activity subjects in both(Fig. of these 2B). Activity regions in when lateral subjects parietal cortex,(Fig. 2 ACC,B). Activity and right in lateral dorsal parietal cortex, ACC, and right dorsal ). 2

17200 ͉ www.pnas.org͞cgit ͞doi͞10.1073͞pnas.090107710617200 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0901077106 Wunderlich et al. Wunderlich et al. of the action action, for the ur ur ⇡ ⇡ EYE d d HAND , we hoped to | ur action. At reward ⇡ d | EYE

d) action, and are consistent , and masked out the voxels -5 2 x10 t 40 HAND for the large cluster in the motor and 1 ur t . See Figure S1(c,d) for

⇡ 0 ) and the next response at 12.8 s ( d ur 1 ⇡ t action as the reference, negative values d -40 action. At reward receipt (9.2 s), EYE 9.2 9.8 10.4 11.0 11.6 12.2 12.8

choices – corresponding to the inputs shown in vStr ACC PMC 7 action corresponding to the inputs shown in panel HAND Figure 3: (a,b). The graphsEYE show the effect of reward on actions in terms of d⇡r. The choice states (between EYE and HAND actions) are shown by the grey shaded area. In the left panel, action HAND

HAND (shown by ‘H’) was selected and in the right panel action EYE (shown by ‘E’) was selected at all

of the choice states. Theindicate outcome a region’s role in selecting the of each choice (reward/no reward) was delivered in red shaded area. The first choice was rewarded, as shown by ‘R’ in the graph, but the other choices were not ur

followed by any reward.⇡ The blue bars show the effect of reward received after the first choice on the subsequent choices (dd ⇡r). (c). Voxel maps and the time-course of changes in d⇡ur in cortical

between the reward at 9.2 s ( and subcortical brain regions between reward of the HAND action at 9.2 s and the response at 12.8 s shown by the red arrow in panel (a). Voxels below the 99th percentile of voxels were masked to reveal ur ⇡

only the top one percent of voxels shownover this time period mirrorhere. those for the (d) The time courses of each region calculated from d the maximum voxel in that region at each time point (smoothed), selected within an anatomical mask ur ⇡ur from wfu_pickatlas. y-axis represents⇡ d . ACC: anterior cingulate cortex; vStr: ventral striatum; d for every voxel and every time-step between PMC: primary motor cortex. See Figure S1 for the voxel maps and time course changes relating to ur the EYE action. ⇡ d indicate a region’s role in selecting the for the voxel coordinates reported in Wunderlich et al. [Table S3; 2009]). ur

⇡ 8 d ur ⇡ that were not in thelimit top our one analysis percent. to By theare focusing circuitry shown only known in on to Figure the be 3(c) 99th involved for in percentile decision-making. the of case The of resultingcorresponding voxel to maps the inputs shown in panel (b). brain regions known tostriatum be (associative aStr; critically or involved ventral, in vStr),motor (ii) reward-processing anterior area and cingulate cortex (SMA) decision-making, (ACC) and [Rangel i.e.,anatomical (iii) (i) and supplementary regions Hare, are 2010,identified among as Wunderlich the involved et in same decision-making al.,d anatomical in 2009]. this regions task that (see We Figure first Wunderlich S4 note et forSecondly, we the that al. can time these [2009] course see of that also of changes not decision-making in only based on are previous theis work, identified also but regions consistent the consistent with temporal with their ordersubregions the functional of of neural role engagement the substrates in of striatum these decision-making. reflecterrors regions reward It serve prediction-errors to has [O’Doherty update been et action-values arguedSeo in al., that and the 2004] activity Lee, ACC and in 2007, [Dayan that Walton andthe et these Balleine, best al., 2002, action 2004], Wunderlich before which et a inthat al., decision turn these 2009, can must different be be decision-making made compared signals [Wunderlich in are et the carried al., SMA by 2009]. to separate Such determine O’Doherty, regions prior 2010, in work Hare a has et corticostriatal argued loop, al., 2011]. Here we show for theanterior first cingulate time cortex the and temporalcourse motor dynamics of areas between each these leading critical region’s to regionsNote action-selection. in that Figure the since 3(d) striatum, we shows tookof the the time probability of taking the the effect of reward prediction-errorsanterior on cingulate the then subsequent (negatively) hand peaks response. after The 10.4 value s, of area consistent (including with the its supplementary motor rolenegatively area) peaks in controlling at updating motor the action responses timecurrent such of task. as the the action (12.8 s), which marks the end of theanterior decision cingulate process and in motor the areaspanel controlling (b). Here positive valuesreceipt of (9.2 s) the associativethe striatum subsequent is action-selection. involved immediately Then in ata mediating 11 region the s in effect the the of involvement motorIn reward of area sum, on the nearest changes anterior the in cingulate supplementary peaks eye before field peaks atused the here. time of action (12.8 s). 5 Discussion making in the brain. Unlikeis previous able methods, to our learn approach computationalbe does processes interpreted not directly to require from manual uncover the the engineeringprocessing. data. temporal and Besides engagement We further being of showed used different that brain asfMRI the regions a analyses in model standalone to choice can analysis investigate and whether tool,is, reward the this if model approach a correctly can brain tracks inform regionfMRI the model-based is analysis, brain’s found internal this to mechanism. could be mean That all important that of in the the the model relevant current used neural analysis, to signals but extract involved not in neural using decision-making information the and is model-based requires not further representing modification. ventral striatum begins below the zero baseline and then (negatively) peaks at 9.8 s,values as with it the mediates new errors before the next response. Finally The results show that, for each action, the top 1% of voxels contain three key cortical and subcortical (a), and Figure 3(d) shows the time-course of changes in we calculated which is assumed to participate in a time course of events leading to action-selection [Balleine and As part of our supplementary material, Figure S1(d) shows the time dynamics between the striatum, with the hypothesized roles of these regions in the varying decision stages of the reward-learning task We have introduced a new neural architecture for investigating the neural substrates of decision- Acknowledgments AD and BWB were supported by funding from UNSW Sydney and the National Health and Medical Research Council of Australia GNT1079561. PD was funded by the Gatsby Charitable Foundation. Part of this work was conducted whilst PD was at Uber Technologies. Neither body played a part in its design, execution or communication. PD is affiliated with Max Planck Institute for Biological Cybernetics, Tübingen, Germany ([email protected]).

References Joshua I Gold and Michael N Shadlen. The neural basis of decision making. Annual review of neuroscience, 30, 2007. Kenji Doya. Modulators of decision making. Nature neuroscience, 11(4):410–6, apr 2008. John P O’Doherty, Alan Hampton, and Hackjin Kim. Model-based fMRI and its application to reward learning and decision making. Annals of the New York Academy of sciences, 1104(1):35–53, 2007. Jonathan D Cohen, Nathaniel Daw, Barbara Engelhardt, Uri Hasson, Kai Li, Yael Niv, Kenneth A Norman, Jonathan Pillow, Peter J Ramadge, Nicholas B Turk-Browne, and Others. Computational approaches to fMRI analysis. Nature neuroscience, 20(3):304, 2017. Nathaniel D Daw, John P O’Doherty, Peter Dayan, Ben Seymour, and Raymond J. Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–9, jun 2006. Hava T Siegelmann and Eduardo D Sontag. On the computational power of neural nets. Journal of computer and system sciences, 50(1):132–150, 1995. David Sussillo, Mark M. Churchland, Matthew T. Kaufman, and Krishna V. Shenoy. A neural network that finds a naturalistic solution for the production of muscle activity. Nature Neuroscience, 18(7):1025–1033, 2015. H. Francis Song, Guangyu R. Yang, and Xiao Jing Wang. Reward-based training of recurrent neural networks for cognitive and value-based tasks. eLife, 6:1–24, 2017. Michael Breakspear. Dynamic models of large-scale brain activity. Nature neuroscience, 20(3):340, 2017. Brandon M Turner, Birte U Forstmann, Eric-Jan Wagenmakers, Scott D Brown, Per B Sederberg, and Mark Steyvers. A Bayesian framework for simultaneously modeling neural and behavioral data. NeuroImage, 72: 193–206, 2013. David Halpern, Shannon Tubridy, Hong Yu Wang, Camille Gasser, Pamela Osborn Popp, Lila Davachi, and Todd M Gureckis. Knowledge Tracing Using the Brain. In Proceedings of the 11th International Conference on Educational Data Mining, EDM, 2018. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and . Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. Richard Henson and Karl J Friston. CHAPTER 14 - Convolution Models for fMRI. In Karl Friston, John Ashburner, Stefan Kiebel, Thomas Nichols, and William Penny, editors, Statistical Parametric Mapping, pages 178–192. Academic Press, London, 2007. ISBN 978-0-12-372560-8. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. Mikael Sunnåker, Alberto Giovanni Busetto, Elina Numminen, Jukka Corander, Matthieu Foll, and Christophe Dessimoz. Approximate Bayesian Computation. PLOS Computational Biology, 9(1):1–10, 2013. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. Klaus Wunderlich, Antonio Rangel, and John P O’Doherty. Neural computations underlying action-based decision making in the human brain. Proceedings of the National Academy of Sciences, 106(40):17199–17204, 2009. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and Others. Tensorflow: Large-scale on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

9 Antonio Rangel and Todd Hare. Neural computations associated with goal-directed choice. Current opinion in neurobiology, 20(2):262–270, 2010.

John P O’Doherty, Peter Dayan, Johannes Schultz, Ralf Deichmann, Karl J Friston, and Raymond J. Dolan. Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304(5669):452–4, apr 2004.

Peter Dayan and Bernard W Balleine. Reward, motivation, and . Neuron, 36(2):285–98, 2002. ISSN 0896-6273. Hyojung Seo and Daeyeol Lee. Temporal filtering of reward signals in the dorsal anterior cingulate cortex during a mixed-strategy game. Journal of neuroscience, 27(31):8366–8377, 2007.

Mark E Walton, Joseph T Devlin, and Matthew F S Rushworth. Interactions between decision making and performance monitoring within prefrontal cortex. Nature neuroscience, 7(11):1259, 2004.

Bernard W Balleine and John P O’Doherty. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology, 35(1):48–69, jan 2010.

Todd A Hare, Wolfram Schultz, Colin F Camerer, John P O’Doherty, and Antonio Rangel. Transformation of stimulus value signals into motor commands during simple choice. Proceedings of the National Academy of Sciences, 108(44):18120–18125, 2011.

Tijmen Tieleman and . Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.

10