<<

© 2012 Nature America, Inc. All rights reserved. UK. Correspondence UK. should Correspondence be addressed to K.W. ([email protected]). 1 choice. and and evaluation training –based extensive of of integration the or encoding planning, neural value-based or mechanisms computational the addressed spectrum this targeting investigations recent More above. to referred two spectrum the the of compare ends to designed been not have and value on focused London of Tower the as such ning, there has been extensive investigation of various tasks involving on has on focused the basis of errors prediction of experience. course extended the over predictions in errors measuring and making by domains stable, but complex, in actions optimal learn to possible is intelligence artificial in insights earliest the of one indeed learning reinforcement of field the in results main the of One no world. the of need model and representational complex a automated on based be highly longer become can spectrum the of end this the at about Decision-making actions. different of experience consequences from affective learn directly can they domain, stable aspects its where quickly. change or environment an in experience little relatively have we when importance particular of is spectrum planning the of end This complexity. growing with intractable becomes rapidly and the best outcome, poses severe demands on computation and for tree of a decision branches the by searching for instance, context, choice of type this in actions optimal Finding outcome. desired a to lead actions available the of which determine can domain, relevant the of model a on based planning, on-the-fly end, Atone spectrum. crude a along fall problems decision value-based such to Solutions choices. in their punishment and reward minimize to maximize seek An overarching view of adaptive behavior is that humans and animals with medial integrating their outputs. choices, trained extensively and planning for striatum the in parallel in operate systems value segregated which in choice of ventromedial prefrontal cortex, consistent with this region acting as a value comparator. Our findings point toward an architecture solely to values learned during extensive training. During actual choice, both striatal areas showed a functional coupling to caudate nucleus as values of individual branching steps in a decision tree. In contrast, values represented in the putamen pertain a minimax decision task, we found that the processes computational underlying forward planning are expressed in the anterior structure computational of value based planning Using underexplored. comparatively behavioral and analyses of of Investigations the underlying mechanisms of choice in humans have focused on learning from prediction errors, leaving the Klaus Wunderlich choice in the Mapping value based planning and extensively trained nature nature Received 1 December 2011; accepted 14 February 2012; published online 11 March 2012; Wellcome Trust Center for University Neuroimaging, College London, London, UK. In contrast, when subjects have extensive practice in a relatively relatively a in practice extensive have subjects when contrast, In A rich body of work on value-based decision-making in humans humans in decision-making value-based on work of body rich A NEUR OSCI EN C 1 6– E , , Peter Dayan

8 have been revealing, but have not directly but have not directly revealing, have been advance online publication online advance 5 , these tasks have typically not not typically have tasks these , 2 & Raymond J Dolan 3 , 4 . Although . Although 2

, is that it it that is , 1 , and and , 1 - 2 Gatsby Computational Gatsby Neuroscience Computational Unit, University College London, London, played, but could change on each trial. Three consecutive choices led choices consecutive Three but played, change on could trial. each ( trials planning pure In transparent. computationally tree decision the in branches reward, obtaining thus the value components rendering of individual of probabilities distinct with ter associated was available state Each states. several minal of one reach a to navigate maze to branching subjects tree-shaped required task the of component One basis. trial-by-trial a on distinguished be could training extensive through sion values that were either derived from forward planning or learned deci which in task decision a in participate to subjects We21 asked RESULTS systems. decision both across parator com value a as role possible its highlighting systems, across option medial prefrontal cortex (vmPFC), represented the value of the chosen medio-lateral a axis inalong basal ganglia.structures Furthermore,neural prefrontalrecruit and cortex, specificallyparallel ventroin and independently operate that systems evidence decision multiple for humans in direct provide results Our responses. trained during extensively values the with fluctuated selectively putamen posterior in computational components of planned choice individual values, whereas signals to pertained caudate in signals blood level–dependent oxygen the Notably, contexts. trained engaged extensively strongly in more choices during was striatum lateral and planning during of We engaged was more choice. strongly striatum that found medial type each of processes computational the with associated brain the in representations contexts, value investigate trained specifically to us allowed extensively which and planning index separately to us on learned values after extensive behavioral training. Our task allowed based choices and planning forward underlying mechanisms neural the examine to (fMRI) imaging resonance magnetic functional used We designed a value-based choice task for human subjects and and subjects human for task choice value-based a designed We d o i : 1 0 . 1 0 3 Fig. 1 Fig. 8 / n n . 3 0 a 6 ), probabilities of reward were visually dis visually were reward of probabilities ), 8 t r a C I e l

s  - - - - ­

© 2012 Nature America, Inc. All rights reserved. units; vertical lines represent s.e.m. Posterior putamen did not significantly correlate with planned values. ( values. planned with correlate significantly not did putamen Posterior s.e.m. represent lines vertical units; ( effects significant mark Asterisks values. alternative the for effects negative significant and values target for effects positive significant by both as indicated path, traversed the along choices in the values alternative and target actual between difference value to the pertained in caudate Signals outcome. and choice second subjects’ choice, first points: time HRF, at three data a canonical BOLD against with convolved values, of planned in a regression ( context. trained extensively in the a response made subjects when activated more were gyrus, postcentral including cortex, somatosensory and gyrus temporal medial the into extending insula posterior putamen, posterior Lateral trials. trained extensively with compared on planning responses BOLD enhanced showed precuneus, and gyrus frontal medial bilateral cortex, prefrontal dorsomedial insula, anterior bilateral thalamus, caudate, medial including ganglia, of basal sectors Medial (blue). trials > planning trained and (red) trials > trained planning for effects categorical ( choices. trained extensively 2 Figure examinedinvolvingtrialsmixed choicesplanningbetweeneitheraor we this,Subsequentdecisionsystem.computations toeach tounique planningandextensively trained trialsallowed investigatetous neural a probabilistic reward delivery process ( between two available actions after having learned 3d of valuesbehavioral training. fromIn these, samplessubjects had to make single of choices not require forward planning and were instead extensively ( tree the exercised in stages at distinct values of estimation the duringing possible involv best programming, of a dynamic form required task the the in choice Thus, planning. forward sufficient invoke to fail might values displayed compare to requirement mere a instance, for induced a tree step search strategy for latter calculating planned values, whereas, This value). maximum lower the having branch tree the selecting (minimax, rule disclosed fully a to according acting agent computer predictable a by made was choice middle the choices; last from the start state to the terminal state. Subjects planned the first and See s.e.m. represent lines Vertical heuristics. the of any by explained be not could and strategy planning search tree to pertained behavior choice Subjects’ (avg). value average higher and (max) value highest heuristics: alternative two and (plan) strategy planning search a tree to according choices correct of ( doors. colored above shown were probabilities reward No states. outcome four with branch planning reduced followed door other the ( given. was or contingencies probabilities reward about information explicit No maze. trained each identify distinctively to subjects allowed that contexts distinguishable provided ( probabilities reward distinct and contingencies invariant with mazes single-level four 3 d in over training ( path. choice the along shown are values (red) action and (black) State strategy. minimax a using values state of induction backward by determined was (arrows) path optimal The circles). (gray level second the at choice determined computer the circles); (cyan level third and first the at choosing freely by forward moved Subjects rooms. between transitions represent lines and rooms represent nodes ( option. value lowest the implemented that agent computer minimizing value a deterministic by determined was choice layer Second room. terminal each of probabilities reward displayed trial) to trial from changing (randomly numbers Eight rewards. probabilistic reaching before maze three-layer a navigated subjects trials: planning in flow 1 Figure t r a  of choice. at time mazes trained in extensively representations value significant showed putamen Posterior trials. trained during errors prediction

A second component of the overall task design involved trials that did Supplementary Table 2 Supplementary

C I Neural correlates of planning versus versus of planning correlates Neural Task and behavioral results. ( results. Task behavioral and e l b ) Exemplary planning maze: maze: planning ) Exemplary s d for individual subject behavior. subject individual for ) Combination of planned and trained options in the same trial; colored doors transitioned into trained maze of same color, same of maze trained into transitioned doors colored trial; same the in options trained and planned of ) Combination b a ) Effect size plots plots size ) Effect ) Significant ) Significant a ) ) Task Fig. 1

c ) Prior ) Prior c ). The inclusion of separate P c a a x =45 < 0.05; see see < 0.05; 70 root choice z Extensively trained Planning 20 = Layer | 40 (0) (0) 3 5 30 | 20 s (40) (90) 0 Fig. 1 Fig. 1 | 50 1–6 90 s Supplementary Table 5 Supplementary b computer (15) ). (65) Layer choice 2 s - (0) (0)

2 1 value later in the tree or picking the path with the highest average value simpler heuristics, such as picking the path with the largest maximum individual subjects’ choices to the optimal minimax strategysearchstrategy,tree a participants used whether indeed compared weand to other search strategy would yield good performance in planning trials. To testtree a onlythatsuch designed was task Our extensively trials. trained mally in 94% of planning trials and chose the rewarding door in 98% of action mappings in the trained trials. On average, subjects chose opti tively tractable, and consistent with subjects having learned values and ( types Subjects’ choices were largely consistent with choice values over all trial results Behavioral values derived from the extensively trained task on the other trialbranch. entails a direct comparison of planned values from one branch awith normative perspective, the combination aof trainedboth components branch ( in the same 70 s 20 Layer second | choice 40 2 d b 30 | 20 Putamen Caudate s Transition SupplementaryTable 1 0 3 | 50

Effect size (a.u.) Effect size (a.u.) Q trained=65 −10 −10 90 1 10 10 advance online publication online advance 0 0 V s (65) for individual effect sizes and and sizes effect individual for target V Reward * 1 / choice alt.root 1 First s

V 0 * Fig. 1

alt.deep Root choice 1–6 * V target d s 20 20 Planning Second *

V ) and choices between two trained branches. From choice alt.root Q plan=20 b

V Reward room 20 20 0 P 5 0

alt.deep Second choice c = 0.15, 0.40, 0.65 and 0.90). Wall colors Wall colors 0.90). and 0.65 0.40, = 0.15, ) Caudate activity related to classic reward reward to classic related activity ) Caudate * Computer choice 90 Computer choice

V 9 0

target confirming), that planning cogni was V Second choice

alt.root Outcome 0 V alt.deep 70 Reward 70 20 P * values), a.u. = arbitrary = arbitrary a.u. values), 40 40 40

40 c Root choice nature nature

Putamen Caudate 40 30 40 Effect size (a.u.) Effect size (a.u.) 30 e −10 −10 20 20

10 10 Fraction correct 0 0

e choices

V 0.5 0.6 0.7 0.8 0.9 Choice 20 ) Average fraction fraction ) Average 20

trained Extensively trained NEUR 5 0 * * Plan 90 MA 9 0 V OSCI X trained Avg Outcom 0 MI * Reward N Max EN MA e * C X E - - © 2012 Nature America, Inc. All rights reserved. nals in these regions. One set of values concerned were those of the the of those were concerned values of set One regions. these in nals and anatomical criteria and regressed various values against fMRI sig Supplementary Fig. 2 implicated in value-based choice widely also region a vmPFC, in responses examined we addition, In choices overtrained with associated been has which putamen, rior choices directed goal in implicated nucleus, caudate anterior namely decision-making, to linked strongly are that regions neural striatal two in choices available of investigated valuations to pertaining responses next we value, on depends crucially choice As values relevant choice of correlates Neural trials, whereas posterior putamen activity was selective to trained trials. BOLD responses in caudate increased significantly only during Anatomicallyplanningdefined region ofinterest (ROI) analyses confirmed that (all sensory postcentral gyrus were more strongly activated in trained trials insulaextending into themedialtemporal vmPFCgyrus, and somato posteriorputamen,posteriorcontrast, lateral In intraparietal sulcus). into extending (precuneus cortex parietal and gyrus frontal medial dorsomedialprefrontal cortex,dorsolateral prefrontal cortex,bilateral included caudate and medial striatum, thalamus, bilateral( anterior insula, ganglia anteromedio-posterolateralbasal anin alongdissociatedaxis trials with activity in trials involving extensively trained choices. WeActivityfirst compared activity at the time of initial choice during planning choices trained versus planning differences: neural Categorical ( onwards 2 day from contexts valued lower and higher in responses correct of rate the between difference no was there context; each in action optimal to converged the responses subjects’ of the course Over that stabilized. training, had values associated that ensure to trials trained sively than those predicted by the heuristics ( subject matched choices predicted by the planning strategy more closely Wilcoxon rank sum test; maxplanning strategy than any of the alternative heuristics ( inthe leaf nodes. Subjects’ behavior was better explained by themini process. * the value of the chosen option, representing the output of a comparison values of the colored trained branch regardless of choice. ( of the planning branch regardless of choice. ( V plotted conditional on subjects’ choices. trials and trials comparing two trained branches. Mixed trials are separately mazes. Value-based effect sizes at choice time in mixed planning/trained Figure 3 nature nature Fig. nc a Before undergoing fMRI, subjects were trained for 3 d on exten on d 3 for trained were subjects fMRI, undergoing Before = value of the not chosen option. (

Effect size (a.u.) P

2 0 1 2 3 00 fmlws err corrected; error familywise 0.05 < a

). Structures). thatwere preferentially activated during planning Plan Caudate Comparing values from planning and values from extensively trained vmPFC NEUR * * P Functional coupling < 0.05; a.u., arbitrary units; vertical lines represent s.e.m. Chosen action Train OSCI Supplementary Fig. 1 Fig. Supplementary Plan Putamen vmPFC * * EN Train for location details) based on previous research C E Fig. 1 advance online publication online advance e b ). Moreover, choices in every individual 11– a 1 ) Caudate represented planned values 3 V . We delineated ROIs a priori (see c = value of the chosen option, Supplementary Table 2 ). b ) Putamen fluctuated with upeetr Tbe 3 Table Supplementary c ) vmPFC encoded 9 , 1 0 , and poste and ,

P < 10 ).

−8 8 ). - - - - - , .

the values that need to be compared during tree search. Indeed, con Indeed, search. tree during compared be to need that values the maze the traversed path. This was motivated by thatare the fact these of the computer’s minimax strategy) and of the alternative choices targetalong (the choice leading to the best reachable outcome, taking account caudate and putamen during mixed choices (conjunction analysis). (conjunction choices mixed during putamen and caudate ( systems. choice both from values pre-choice accessing by process decision the in mediates vmPFC that hypothesis the with consistent choice, actual of independent P (all graphs bar as shown is size effect the which for ROIs, defined a priori our between contrast interaction PPI the of significance statistical tested ( trials. mixed in choice during increased significantly is vmPFC 4 Figure the not and values planned these represented caudate trials. the Notably, planning pure in as computations difference value performing same it the with consistent branch, planning the on option tive alterna the of value the and value target planned the between ence differ the represented consistently caudate The types. value both of influence the distinguish to us allowed which branch, planning the of values the with uncorrelated was branch trained the of value the design, By values. trained and planned both access to need subjects ( branch trained a and branch ning clusters. distinct We plan a choicepresented a between with subjects require a comparison of the respective values represented in these two extensively at trained trials the time of choice ( on values encoded putamen.putamen Instead, solely the posterior in effects seen in caudate for planning value components were not evident a predicted hallmark of a cognitive implementation of tree search. The difference signals are likely to the reflect output of value comparisons, ated with the value of the previously rejected root branch. These value alternatives atboth now the current choice, but no was longer associ choice (layer 3), caudate was activity with associated still the values of the latter values even while at the root state. During the subjects’ second forward cessful search of the decision tree required a consideration of consecutive choice deeper in the tree ( given choice, including those at the present ( the root choice, caudate activity related to values several relevant for a ( tives significant positive effects for target and negative effects for the alternathe by shown as values, alternative and target between difference the sistent with this hypothesis, fMRI signals in the caudate covaried with < 0.05). Vertical lines represent s.e.m. The increase in coupling was was coupling in increase The s.e.m. represent lines Vertical < 0.05). Wethatdecisions in networksinteracttwo howthe examined next P < 0.05; < 0.05;

Functional coupling between caudate-vmPFC and putamen- and caudate-vmPFC between coupling Functional b c a

Fig. 2 Fig. vmPFC Putamen Caudate Effect size (a.u.) Effect size (a.u.) Effect size (a.u.) Planning chosen –5 –5 –5 0 5 0 5 0 5 b b ) Shown are areas of increased coupling with both both with coupling increased of areas are ) Shown V and V plan * * c V V nc trained * Supplementary TableSupplementary Trained chosen V V plan * nc

Fig. 1 Fig. V V V

* trained * c target d V − ), that is, a task in which which in task a is, that ), target Trained /trained V V Fig. Fig. 2 * * * c alt.deep −

5 V t r a * V * nc ). Notably,). during c alt.root ). ). ). Note that suc ), and to the C I a ) ) We e l s  ------

© 2012 Nature America, Inc. All rights reserved. We found that behavior on trials invoking forward planning and trials DISCUSSION choices. branch planning on not but choices, branch trained on putamen posterior with coupling increased differential a or branch, trained the chose they which in trials on not but branch, planning the chose subjects which in on trials caudate with coupling In we contrast, did not areas find increased that showed a differential ( chosen finally was that action the of independent choice, of time the during vmPFC with of caudate coupling and of putamen both strength the in increase significant a revealing esis, choice. of regardless vmPFC, and areas precursor both tition is resolved in vmPFC. This predicts increased coupling between that values from both areas are transferred to vmPFC and that compe area (caudate or putamen) with The vmPFC. is alternative hypothesis predicts that the PPI will show increased coupling of This only the vmPFC. winning to transferred is outcome the and ganglia basal the in resolved is system trained extensive and planning the between tion competi the that is possibility One choices. mixed during putamen and caudate vmPFC, between relationship functional the examined connectivity and a (PPI), interaction used psychophysiological a we from derived compared, analysis, are systems both from values value comparison process, whereas vmPFC is at an output stage. did not. This suggests the caudate and putamen are at an input stagein tovmPFC a depended on choice, whereas activity in putamen tainedand caudateto the value that was modulated by choice. In theother valuewords, ofactivity the same system, independent of In choice,mixed trials, whereassignals in vmPFCcaudate per and putamen vmPFC and ganglia basal consistently between coupling pertainedFunctional to >0.99). probability (exceedance trials trained/trained and mixed both in signal choice-related a for evidence strong provided analysis and regressor comparison model Bayesian value performing choice maximum a with (GLM) model general our linear re-estimating by option chosen the than rather value) (maximum option best the represented signals vmPFC the that out ruled we Furthermore, contrast. this in values unchosen and chosen both for effect positive a see to expected have would we then sum), had related to some form of of activity representation options value both (or their vmPFC if cluster: vmPFC the in values stimulus mere ( trained or planned was it whether of irrespective signal, choice post a branch, chosen the of value the with covaried activity vmPFC that in is implicated comparative valuation the vmPFC commonly most comparator. region The decision to a as inputs final trained the of values and branches the fulfill criteria for pre-choicevalues values and are likely planned to serve of correlates striatal the values action to Similar choice. of time the at compete systems how these to examine opportunity us the afforded operation parallel this turn, In controllers. separate two of operation independent and parallel a for evidence direct provides chosen, not were that actions planned values and values from the trained trials respectively, even for actions in trials comparing values from two trained branches. available both of values stimulus the represented also putamen The the available trained branches, also irrespective of later choice ( ( values of the trained branch, regardless of which branch was later chosen t r a  neural distinct in activity evoke options trained extensively with Fig. 3 Fig. 3 Fig.

The results of the PPI analysis support the latter pattern hypoth pattern latter the support analysis PPI the of results The choice how for mechanisms alternative between Todiscriminate with covaried putamen and caudate in activity that finding The a c ). In contrast, activity in putamen solelypertained to values of C I ). Notably, we found no evidence for the representation of of representation the for evidence no found we Notably, ). e l s 2 0 between both GLMs. This GLMs. both between 16– P 1 < 0.05; 0.05; < 9 . . We observed Fig. Fig. Fig. 14

3 , 4 b 1 5 ). ). ). - - - - ,

nisms underlying different forms of learning mecha different reinforcement underlying nisms computational the of considerations theoretical by ported the outcomes. Accounts a suggesting of plurality control are also sup actions, of than rather stimuli consequence predicting the by the determined are habits whereas to regard with performed is control goal-directed As such, case. habitual the in actions of those outcome the of assessment direct any without learned associations through stimulus-response controlled is it whereas case, goal-directed the in selection action governs outcomes and actions between association habits of acquisition the ling control one and actions of goal-directed acquisition the controlling one behavior: govern processes learning that two different suggested that was decision onmade. actual the that a depended signal value types in our task. In contrast, activity in prefrontal cortex to pertained findings suggest that twoThese independent choice. systemsfinal of represent regardless the twovalues choice respective their represented from both choice types, the individual striatal subsystems consistently Notably, during choices requiring a simultaneous comparison of values with values associated with responses in an extensively trained context. fluctuated putamen posterior the in signals BOLD whereas tree, sion deci a in branches individual the of values to pertained caudate the systems during computations associated with choice. BOLD signals in ing this, although it would be interesting to design a more explicit test, culations are no longer possible. We consider the mixed trials as show planning system is engaged in its own unique computations, these cal improving the prediction errors available to the other systemonor behavior,only withnoeffect modest a or perhaps at most extensively trained trials, a planning system might estimate values, but so), but rather that their calculations do not influence behavior. Thus, redundantin systems do not calculate (if they have the information to do for choice. That is, the main claim of dual systems accounts is not that explanations for this. The simplest possible is of that number this a activityare is There epiphenomenal trials. trained extensively in options relevant the of values the with correlated also structure this in signal implicating the caudate in explicit planning, it is notable that the BOLD schedule low-contingency a high- a on on were compared with they when performing schedule contingency performing were subjects when caudate anterior in ity rodents in lesions caudate after impairment goal-directed for evidence with consistent is caudate anterior in representations value planning of relates of the choice values during the planning process. The existence mechanism error–based tion through numerous repetitions in stable contexts is solved by a predic learning that shown have studies previous in tasks However,similar at of time choice. the it retrieved be so that it could memorizing then and training during tree decision a solving by mazes trained sively exten the in values derived subjects that notion the exclude cannot we Similarly, habit. true a created training extensive our that prove definitively cannot we that means test degradation contingency or of a The control. absence devaluation model-based typifies sion tree, deci the searching by solvable only be to designed was which task, planning Our in a sequence. action of each the consequences fly, immediate the on predict, that choice model-based and values cached of learning difference temporal model-free between is used we that dissociation The of learning. types different underlying mechanisms been a functional one, focused on the differences in the computational defining criterion in the more computationally centered literature has Converging evidence from animal and human studies has long long has studies human and animal from evidence Converging Although most of our results are consistent with previous findings findings previousconsistent arewith results our of most Although are which cor likely In differences, value the caudate, we observed 2 9 . In addition, a human imaging study found elevated activ elevated found study imaging a human . In addition, advance online publication online advance 1 0 21 3 . , 25– , 2 2 . According to this dissociation, an an dissociation, this to According . 2 8 . nature nature NEUR 6 . When the OSCI 23 , 2 4 EN . . The C E ------

© 2012 Nature America, Inc. All rights reserved. putamen did not reflect a prediction error at the time of outcome outcome of of extensively the persistence underlie might time which caudate), (unlike the at error prediction a reflect not did putamen posterior that noted be also should It task. this perform to option) unchosen and chosen the between difference a such see we (where caudate or vmPFC the needs but values, compare not does putamen the that suggests difference, value a not but values, option of pattern This putamen. in represented simultaneously were options available larly clear in trials with two trained branches, where the values of both particu is This memory. cached learning reinforcement as function for the extensively trained choices in posterior putamen, which might difference value a observe we did tasks the of neither in that note to forcement learning models (notably the actor-critic on the sort of more arbitrary action propensities found in by certain rein only depending than influenced rather actions, be associated the of values still learned can responses trained extensively even that times; response reduce trials with putamen of parts learning. increasing caudal more to rostral from activity of fer from studies on procedural sequence learning evidence with consistent is This values. planned or branches trained the of putamen, values for representation anterior significant in reliable a find ROI not did we an in signals value for tested we When in error tasks putamen signals prediction learning during basic reporting studies previous many of location the to posterior was 8) in on ref. was Our ROI the based coordinates (which of overtraining. consolidation by which values migrate in the striatum over the course in the same area. However, it there is is a less clear whether of process found representations of neural values for choices trained extensively we region, in this effects parametric value-related the not investigate to the habitual control of behavior in humans. Although that study did prolonged with habitization and concluded that this increases region may contribute putamen posterior dorsolateral in activation driven trials throughout our study. A recent imaging study questions are nevertheless important issues for future research. These methods. our using dissect to trivial be not would this system, in the absence of an overt expression of behavior from the model-free ofconcomitant However,planning in learningeven trials. model-free cannotweexclude choiceplanning formsometrials, not in affect did although we found that prediction error–based learning of action values ity, but it suggests an important area for future work. On a similar note, previously been observed. Our task is not ideal for testing this possibil widely predicted frombeen the has very earliest systems days ofacross planning values of integration an Such system. ning trained branches, as these values could then be compared by the plan This interpretation would be most appropriate for trials involving two trained branches are used to ground evaluations in the planning system. A third,trials. and more radical, possibility is that the values from the planning pure in observed activity of pattern the to similar caudate, choices on planning then we should have seen a value difference in the ously absent from caudate in mixed trials, whereas if subjects based all believe this is unlikely, as the value of the trained branch was conspicu controls choice, even in trials that we consider to be non-planning. We opposed to this interpretation is the possibility that the caudate actuallyDiametrically unimpaired. essentially task trained extensively the on tain to the concurrent planning task, while leaving choice performance per to ratheror vanish, caudate to in signals value-associated expect subjects make extensively trained choices. In such a scenario, we would for instance, engaging the planning system with a distractor task while nature nature habits. trained The putamen encoded values associated with the extensively trained Our Our data (including the behavioral effect that higher values in those NEUR OSCI EN C E advance online publication online advance Supplementary Table 1 Table Supplementary 32 , 3 3 , , suggesting a trans 3 8 4 showed that cue- ). ). It is interesting 2 , ) also suggest suggest also ) 3 0 , but has not

4 , 27 , 3 1 ------.

cal differences between trial types ( types trial between differences cal categori to relation in results our although Third, error. prediction ref. in case the (unlike probabilities extensively trained context, substantial experience with fixed outcome the computer opponent’sFor in the choices the is instructed. strategy and basis trial-by-trial a on change values as planning, for is irrelevant outcome The system. either for time this at computation the for expectation no had we because was This outcome. at the than rather choice the of time the at signals BOLD on concentrated we Second, tasks. different several across patterns meaningful computationally correlate specific of set a to just pertained instead not but signal, value did singular a with putamen and caudate in signals neural that mentioned be However, should it effect. an of absence the prove not space. value common in items world real of appraisal an abstract monetary rewards and our task therefore did not require such economic of valuation goods the in role separate a has also vmPFC possible that is It ganglia. basal the as such structures other from there transferred are values pre-choice after vmPFC for role comparison option or goods PFC in medial values for evidence found studies previous of number a values, chosen to addition for. In controlled explicitly not was tions behavior whether was byguided planning or non-planning computa ess. In summary, vmPFC might facilitate actual value comparisons, comparisons, value actual facilitate might vmPFC summary, In ess. role posed of organizing and representing the forward planning proc consistent choice), (deep with was its action activated pro associated time when they were computed (root choice) and at the time when the trast, caudate represented values of the second stage choice both at the ( stage second the at not but stage, at as the first difference) absolute value (measured difficulty decision with increased times response subjects’ interpretation: this support findings executed behavioral Our only stage. deep the then at response and appropriate the stage root the at choices deep stored and precomputed already have might subjects as trials, planning of stage would why explain also not does vmPFC represent choice at values the third interpretation This trials. these in action the initiate diately choice trials, which do not require trained a comparison, implies that extensively subjects imme during representations value vmPFC of from derives a planning computation this or from extensive training. The whether absence of regardless action, an prepare to compared, are values whenever engaged is vmPFC the that suggest instead and to value differences between chosen and unchosen options unchosen and chosen between differences value to studies have reported previous vmPFC sensitivity to as both chosen values investigation, further requires hypothesis This conditions. both in comparison value the for mechanisms different employs brain the that is explanation alternative However,an trials. of to option those during the the effect negative insensitive unchosen comparison value chosen in mixed Wetrials. the a cannot to rule out the only possibility that our but test is involving branches), trained trials extensively two and between trials planning pure as (such during choices requiring a comparison of values from only one system options and unchosen chosen between difference value to the tained per vmPFC that interest of is it Furthermore, initiated. are actions planned values) the as long as with they are task relevant (together and until the required actions planned the represents caudate whereas prefrontal cortex is largely sensitive to model-derived values model-derived to sensitive largely is cortex prefrontal that view the challenge results comparator. These value a as role tive puta a having it with consistent is value) (chosen process of choice outcome a winning the encoded and choice during putamen and We note four caveats to our findings. First, nonsignificant results do Our finding that vmPFC increased its coupling with both caudate caudate both with coupling its increased vmPFC that finding Our 39 , 4 0 , , in which case it may also reflect stimulus values 12 , 3 8 . Our overall interpretation suggests a value suggests . interpretation Our overall Supplementary Table 1 Supplementary Fig. 2 Fig. 6) should render nugatory any nugatory render should 6) a ) might be influenced by influenced be might ) t r a 11 , 13 4 , C I 31 1 ). In con In ). . . We used 16 , 36 , , 1 10 e l 3 9 7 , but but , , and 11 , s 3  5 ------

© 2012 Nature America, Inc. All rights reserved. 8. 7. 5. 4. 3. 2. 1. reprints/index.html. http://www.nature.com/ at online available is information permissions and Reprints Published online at http://www.nature.com/natureneuroscience/. The authors declare no competing interests.financial experiments and analyzed the data. K.W., P.D. and R.J.D. wrote the paper. K.W. and P.D. conceived the study. K.W. the designed task, performed the Wellcome Trust (091593/Z/10/Z). Wellcome Trust Centre for Neuroimaging is supported by core funding from the Award (R.J.D. and K.W.) and the Gatsby Charitable Foundation (P.D.). The This study was supported by a Wellcome Trust Program Grant and Max Planck M. Guitart Masip for their valuable and insightful comments on the manuscript. We thank W. Yoshida and J. Oberg for help with data acquisition, and N. Daw and Note: Supplementary information is available on the online the at http://www.nature.com/natureneuroscience/. paper of the in version available are references associated any and Methods M extended domains, but has never tested. experimentally been happens is a prediction critical of andtheories in practice planning in than in our trained/trained trials. As mentioned above, that this actually as of part assessed be a planning choice in a more thoroughgoing way training in deeper the tree. This would require those learned values to bility furnished by our results is to embed values derived from extensive values change on a trial-by-trial basis. Perhaps the most pressing possi it) for (unnecessarily system were planning to the calculate if temporaleven instance, differencefor control; predictionof forms errors two on the planning trials designed our task to minimize the possible indirect interactionsplanning actual involvingnot or controllers tiple between that bywe posed either not value dissociating representations of mul questions the to respect with limited were studies Previous species. across ganglia basal in processes conserved are there that suggesting ROIs ( our in BOLD and probability choices and neural data showing a linear relationship subjects’ reward between with both consistent value, to probabilities reward of tion transforma linear a employed subjects our We that values. assumed subjective reflect ultimately should values of representation neural agent choosing the of frame reference the to relative are passed by the associated categorical regressor. Finally, all value signals parametric analysis of values, as those potential confounds are encom the affect not would this conditions, between difficulty in variations t r a  6. COM AUTH Acknowledgmen

et Our findings add to recent investigations of value-based choices Tricomi,B.W.Balleine, E., O’Doherty,& J.P. dorsolateral posterior for role specific A Gläscher, J., Daw, N., Dayan, P. & O’Doherty, J.P. States versus rewards: dissociable of planning. impairments Specific T. Shallice, B. Seymour, checkers. of game O’Doherty, the J.P., Dayan, using P., Friston, Critchley, K., learning Dolan, Temporal R.J. & H. difference machine in studies Some A.L. Samuel, A.G. Barto, & R.S. Sutton, a, .. Grha, .. Dyn P, emu, . Dln RJ Model-based R.J. Dolan, & B. Seymour, P., Dayan, S.J., Gershman, N.D., Daw, striatum in human habit learning. habit human in striatum learning. errors. neural prediction error signals underlying model-based and model-free reinforcement prediction andstriatal choices (2011). 1204–1215 humans’ on influences (1982). 199–209 humans. modelsandreward-related learning theinhuman brain. Develop. Res. J. IBM 1998). Massachusetts, Cambridge, P h O E ods TI R R C I CON NG Nature Neuron e l FI t al. et TRIBUTI N

A 429 s 66 NC eprl ifrne oes ecie ihrodr erig in learning higher-order describe models difference Temporal , 585–595 (2010). 585–595 , ts , 664–667 (2004). 664–667 ,

3 IA 6 , 210–229 (1959). 210–229 , , there would be little to do with them, as the the as them, with do to little be would there , ON L I N S enocmn Lann: n Introduction an Learning: Reinforcement T E R Eur. J. Neurosci. J. Eur. E STS Supplementary Fig. 3 Fig. Supplementary hl Tas R Sc Ln. B Lond. Soc. R. Trans. Phil. N a

t 29 u Neuron r , 2225–2232 (2009). 2225–2232 , e

N 4 3 e . Furthermore, we Furthermore, . u

r 38 o

s , 329–337, (2003). c i e 38 n c , e 4 MT Press, (MIT Neuron website. ). 2

and any and 6–

298 8 , 69 1 1 - - - - , , ,

21. 20. 19. 18. 17. 16. 15. 14. 13. 12. 11. 10. 9. 43. 42. 41. 40. 39. 38. 37. 36. 35. 34. 33. 32. 31. 30. 29. 28. 27. 26. 25. 24. 23. 22.

Balleine, B.W. & O’Doherty, J.P. Human and rodent homologies in action control: action in homologies rodent and J.P.Human O’Doherty, & B.W. Balleine, Bayesian K.J. Friston, & R.J. Moran, J., Daunizeau, W.D., Penny, K.E., Stephan, FitzGerald,T.H.,humanorbitofrontalofroleDolan,Seymour, The R.J. & cortexB. in integrates brain the How C.J. Fiebach, & H.R. Heekeren, G., Biele, U., Basten, M.P. Noonan, the is M.F.T.E.,green Rushworth, M.W.Woolrich,Behrens, How & E.D., Boorman, action-specific of Representation M. Kimura, & K. Doya, Y., Ueda, K., Samejima, caudate primate the in encoding outcome and P.W. Action Glimcher, & B. Lau, Daw,O’Doherty,N.D., J.P., P.,Dayan, Seymour,substrates Cortical R.J. Dolan, & B. the Dissociating A. Rangel, W.& C.F.,Camerer,Schultz, T.A.,J., O’Doherty, Hare, ventromedial the of role The J.P. O’Doherty, & P. Bossaerts, A.N., Hampton, brain consequences: Calculating J.P. O’Doherty, & B.W. Balleine, S.C., Tanaka, action by activity caudate of Modulation J.A. Fiez, & M.R. Delgado, Tricomi,E.M., izead TH, emu, . Bc, .. Dln RJ Dfeetal neural Differentiable R.J. Dolan, & D.R. Bach, B., Seymour, T.H., during Fitzgerald, value subjective of correlates neural The P.W. Glimcher, & J.W. Kable, using made be O’Doherty,can J.P.& choices A. Economic Rangel, K., Wunderlich, willingness encodes cortex Orbitofrontal A. Rangel, & O’Doherty,J. H., Plassmann, common a for Evidence J.P. O’Doherty, & S. Shimojo, A., encode Rangel, cortex V.S., orbitofrontal Chib, the in Neurons J.A. Assad, & C. Padoa-Schioppa, Tanaka, S.C. Wunderlich, K., Rangel, A. & O’Doherty, J.P. Neural computations underlying action- of substrates neural O’Doherty,V.V.,J.P.Valentin,the & Determining A. Dickinson, Barto, A.G. Adaptive critic and the basal ganglia. in Lehéricy, S. R.E. Passingham, & R.S. Frackowiak, D.J., Brooks, C.D., Frith, M., Jueptner, neural Distributed G. Glover, & R. Peterson, M., Kaufman, Taylor,J., B., Knutson, Sutton, R.S. First results with Dyna, an interesting architecture for learning, planning, Yin, H.H., Ostlund, S.B., Knowlton, B.J. & Balleine, B.W. The role of the dorsomedial P.R.Montague, & G. Pagnoni, S.M., McClure, modulates G.S., Predictability Berns, reward of O’Doherty,J. imaging resonance magnetic Functional J.C. Cooper, & B. Knutson, Schultz, W., Dayan, P. & Montague, P.R. A neural substrate of prediction and reward. the and ganglia basal the cerebellum, the of computations the are What K. Doya, prefrontal between competition P.Uncertainty-based Dayan, & Y.Niv, N.D., Daw, P. Redgrave, otcsraa dtriat o ga-ietd n hbta action. habitual and goal-directed of determinants Neuropsychopharmacology corticostriatal studies. group for selection model valuecomparison forincommensurable objects. making. (2010). decision 21767–21772 during benefits and costs (2010). 20547–20552 cortex. orbitofrontal lateral and medial macaque action. of courses alternative of favor in evidence the and cortex Frontopolar side? other the on grass striatum. the in values reward nucleus. humans. in decisions exploratory for errors. prediction and values goal of computation the in striatum the and cortex orbitofrontal the of role in making decision during inference state-based humans. abstract in cortex prefrontal systemsthat encode the causal effects ofactions. contingency. (2010). risk. and value described and learned for substrates choice. intertemporal values. stimulus only transactions. economic everyday in pay to ventromedial human in goods dissimilar for cortex. prefrontal values decision of representation value. economic loops. ganglia cortico-basal brain. (2009). human 17199–17204 the in making decision based brain. human the in learning goal-directed 1995). Massachusetts, Cambridge, Press, Ganglia Basal the in learning.sequencemotor Neurophysiol. J. error.and trial by learning and structures Subcortical II. learning. motor of Anatomy value. expected of representation 1990). Massachusetts, Cambridge, Press, (MIT Werbos,P.)179–189 in reacting. and conditioning. instrumental in striatum reward. to response brain human conditioning. prediction. Science cortex? cerebral control. behavioral for systems (2005). 1704–1711 striatal dorsolateral and disease. Parkinson’s for implications

275 advance online publication online advance J. Neurosci. J. J. Neurosci. J. Curr. Opin. Neurol. Opin. Curr. et al. , 1593–1599 (1997). 1593–1599 , et al. Neuron et al. et Science t al. et t al. et

Neural Netw. Neural Distinct basal ganglia territories are engaged in early and advanced 77 Nature J. Neurosci. J. Prediction of immediate and future rewards differentially recruits erl ewrs o Control for Networks Neural Neuron Dissociable roles of ventral and dorsal striatum in instrumental in striatum dorsal and ventral of roles Dissociable , 1325–1337 (1997). 1325–1337 ,

oldrce ad aiul oto i te aa ganglia: basal the in control habitual and Goal-directed (eds. Houk, J.C., Davis, J.L. & Beiser, D.G.) 215–232 (MIT 215–232 D.G.) Beiser, & J.L. Davis, J.C., Houk, (eds.

41

eaae au cmaio ad erig ehnss in mechanisms learning and comparison value Separate

Proc. Natl. Acad. Sci. USA Sci. Acad. Natl. Proc. Nat. Neurosci. Nat. 27 26 304 J. Neurosci. J.

, 281–292 (2004). 281–292 , 441 , 14502–14514 (2007). 14502–14514 , , 8360–8367 (2006). 8360–8367 , Proc. Natl. Acad. Sci. USA Sci. Acad. Natl. Proc.

, 452–454 (2004). 452–454 , 62 35 Nat. Neurosci. Nat. , 223–226 (2006). 223–226 , , 733–743 (2009). 733–743 ,

, 48–69 (2010). 48–69 , 12

29 Science 18 , 961–974 (1999). 961–974 , , 12315–12320 (2009). 12315–12320 , J. Neurosci. J. J. Neurosci. J. , 411–417 (2005). 411–417 ,

Neuroimage 28

Nature 10 Nat. Rev. Neurosci. Rev. Nat. , 5623–5630 (2008). 5623–5630 ,

Eur. J. Neurosci. J. Eur. 310 , 1625–1633 (2007). 1625–1633 ,

J. Neurosci. J. 7 J. Neurosci. J. , 1337–1340 (2005). 1337–1340 ,

, 887–893 (2004). 887–893 , 441

J. Neurosci.J. 21

25 rc Nt. cd Si USA Sci. Acad. Natl. Proc. J.Neurosci. rc Nt. cd Si USA Sci. Acad. Natl. Proc. 46 es Mle, , utn RS & R.S. Sutton, T, Miller, (eds. , 876–879 (2006). 876–879 ,

Proc. Natl. Acad. Sci. USA Sci. Acad. Natl. Proc. , 2793–2798 (2001). 2793–2798 , , 4806–4812 (2005). 4806–4812 , Models of Information Processing 107 , 1004–1017 (2009). 1004–1017 , nature nature

102 , 15005–15010 (2010). 15005–15010 , ur Biol. Curr.

27 27 22

, 12566–12571(2005). , 11

29 , 9984–9988 (2007). 9984–9988 , 28 , 4019–4026 (2007). 4019–4026 , , 513–523 (2005). 513–523 , , 760–772 (2010). 760–772 , , 8388–8395, (2009). , 6750–6755, (2008). NEUR a. Neurosci. Nat.

20 1823–1829 , OSCI EN

106 107 107

C 8 E , , , ,

© 2012 Nature America, Inc. All rights reserved. Training of values in colored mazes. colored in values Trainingof tained two colored doors and choice of any transitions into the respective maze. condition involved a comparison between two learned values.C The root room con branch, followed by a direct comparison between values from both coloredsystems. the for value a of retrieval and branch planning the for values action rational choice always required performance of both a planning part to calculate to engage in planning or a choice based on the previously trained mazes. Instead,type did not provide subjects with an option to choose whether theyequateto for and efforttimetraversingtrials eitherwouldbranch. Note trial thatthis prefer Wevalue.trained a matchedbranchmixedtransitionswith one from in value ofthatmazecolor. requiredThis directlysubjectscompareto plannedtargeta the root node of the other branch was colored and its choice led into the trained a planning branch with the same rules as in the planning maze; the doorframe in C reward contingencies, and actions across subjects. orpercent6590of40, 15, set distinct contexts in which subjects acquired separate value associations from the coloringyellow,(red,Differentexperiment. wall the providedblue) andgreen probabilisticto led rewardcontingencies andthose changednever throughout choice room with two doors and a reward room behind each door.C Only one door choice in each of the two possible rooms in layer 2. the root choice, considerations of the choice at the third layer and the computer’smaze using a mimimax strategy only rational strategy in this task was to plan the The best experiment. possible the throughout predictabletransit and deterministic throughremain would the rule choice its that emphasized we agent, social as computer the considered subjects that avoid to and, agent’srule computerchoice the about instructed value minimizing computer agent. Before the experiment subjects were explicitlyof instructed values, the choice at layer 2 in the tree was made by a deterministic successful application of model-free learning strategies. states changed completely on every planning trial, thereby effectively preventing the entire experiment. However, the reward probabilities for the eight terminal maze (the spatial layout of the maze) were deterministic and constantinteger numberthroughout (in the range [0, 100]). Transitions from state to state within the discrete 0.1 wide steps between 0 and 1 and were shown to subjects as percentageof eight numbers at the top of the screen. The reward probabilities fluctuated in terminal rooms were clearly available to subjects throughout the trial as a display as a chest full with gold coins or an empty chest. The reward probabilities of all experiment. Each reward room contained probabilistic reward, shown pingsto subjectsbetween room transitions and maze positions before room the at functionalthe end of imagingeach branch ( progressed along different branches in the tree maze until they reached a reward doors (backtracking was not possible). Depending on the chosen doors, subjects the maze with state transitions implemented through left and right forward exit in roomunique a correspondedto tree decision the in statereward. Each mal Planning. trained two between branches choice (EE trials). and trials), (PE values trained extensively and extensively trained choices (E trials), choices requiring a comparison of planned Task. Ethics Committee approved the study. scan session 2. The Institute of Neurology (University College London) Research trained mazes and scan session 1 (see below). 20 of the 21 subjects participatedvision deficiency inin the Ishihara test. All subjects participated in 3-d learning of neurological or psychiatric illness participated in this study. None showed color Subjects. ONLINE MET doi:10.1038/nn.3068 mazes, we informed subjects that each color corresponds to a different maze with hoices in extensively trained contexts. hoices between planning and trained branches. hoice between two extensively trained branches. trained extensively two between hoice Tocomparisonmereabove planninga forwardoverand in engage subjects Our experiment consisted of four conditions: pure planning (P trials), trials), (P planning pure conditions: four of consisted experiment Our 21 healthy subjects (9 female, 18–35 years old) with no history of history no with old) years 18–35 female, (9 subjects healthy 21 Subjects navigated through a tree-shaped maze in search of maxi of search in maze tree-shaped a through navigated Subjects H ODS 4 5 4 . We. counterbalanced mappings color,between Fig. 1 4 to rollback state value. This involves, already at b ). All participants acquired correct map To induce stable values in the colored the in values stable Toinduce Each of the four mazes consisted of one Half of the decision tree was Finally, trials in a fourth a in Finally, trials - - - warp T1 images to the SPM Montreal Neurological Institute (MNI) template,MontrealInstituteNeurological(MNI)SPM the to images T1 warp to parameters segmentation using and fluid, cerebrospinalmatter, and white ment and unwarping using field maps, segmenting T1 images into gray matter, for image analysis and applied standard preprocessing procedures (EPI realign would then be minimal due to a very small learning rate, learning adapted small to stable due the to a minimal very be would then value the attrue the time of fluctuations in the fMRI value study (trial-by-trial colored mazes during the the for training values period learned and thatacquired those the subjects values that toward assume had can converged we option training, to of better end training the of chose number universally subjects large because the and to trials, Due study. the throughout probabilities M mately £60 in rewards (range £55–64). fMRI session and £0.05 during training). In total, subjects accumulated approxiduring (£0.20 rewards earned the Subjects’payouttorelated trials. EE 50 andPE 100 intermixed contained which 2, session scan in participated subjects decisions in extensively trained contexts. After a intermixed,15-min to breakmeasure choiceoutside related brainthe activityscanner, unique to either planning or a subsequent block PE and EE trials. blocked our experiment into two parts and first presented E and P trials and in we trials), Esubsequent changeon strategy a inducingmaze,colored each for representation value explicit new a of formation a stimulate might trials (PE f induces prediction error mediated learning. is well established in numerous animal and human studies Fig. 1 days (720 trials in interleaved ordering) before the fMRI scan ( itsown stable reward probabilities and then trained them on three consecutive f the whole brain (MDEFT sequence, 1 × 1 × 1 mm resolution). resolution)mm 2 × 3 × 3 andhigh-resolution T1-weighted anatomical ofscan also acquired a B0-fieldmap (double-echo FLASH, TE1 = 10 ms, TE2 = 12.46–30° to minimizems, signal dropout in ventrolateral and medial frontal cortex. We 1-mmgap, in-plane resolution of 3×mm), tilted in an oblique orientation at coverage was achieved by taking 46 slices in ascending order (2-mm thickness, time = 3.128 s, flip angle = 90°, echo time = 30 ms, 64 × 64 matrix). Whole brain were taken with a gradient echo T2*-weighted echo-planar sequence (repetition Erlangen, Germany) using a 12-channel phased array f head coil. Functional images ( Stimuli. choice correct on ( parameters these of influence the analyzed similarly We sooner). slightly experiment the finishing for (except subjects to benefit tary subjects to respond quickly nor was it the case that instructed fastneither we responsesthatNote type. had trial any each formone separatelyRT logarithmic negativeabsolute valuedifference (−| regressed we time choiceon alternativevalues) andtarget between target value) and difficulty based influences (originating from small differences analysis. Behavioral computer’s control (layer 2). subjects’statesthe underinminimizing strategy controla and 3), and (layer 1 of Layer search forwardWe maze.this themodeled throughtransit optimal the stateroot the in plan then andstate prior every to screen) the on (instructed roomsreward planning. Forward environment M M M h Supplementary Table 1 t odel predicted choice values. choice predicted odel s V In scan session 1 we presented subjects with 96 P and 96 E trials, randomly trials, E 96 and P presentedwe96 withsubjects1session scan In t RI data analysis. RI data acquisition.data RI RI experiment. RI p , ( : / ). We did not perform functional imaging during this training phase, but it / 4 R a w 7 ( ) for rewards We programmed stimulus presentation in MATLAB using Cogent 2000 L w + ← ( w s ) using a maximizing strategy over available choices in states under . v 4 i 6 s ). s l a ′ m ) b . u To prevent a deterioration of responses in trained mazes mazes trained in responses of deterioration a Toprevent We used SPM8 (rev. 4068; We assumed that subjects would unroll values from the the from values unroll would subjects that assumed We c R ∑ To investigate potential motivational (caused by a high high a by (caused motivational potential Toinvestigate l s . and calculated planned values for action ′ a c Data were acquired with a 3T scanner (Trio,scanner 3T awere acquiredDatawith Siemens,    . ). u a k ax ′ / c V o g , ( e a s We used constant values of the true reward true the of values constant Weused n ′ t . p ′ ( ) h V   p chosen s L ). m ) ≠ − h 2 t V t   p unchosen : + / nature nature / w a w in ′ w |), and trial numberand |),trial on s V . f 3 i , ( , l 25– . i ′ o NEUR L a 2 n 8 ′ Supplementary . that such a task ( ) a u   in each state c l . a OSCI s c V ) . u target = k / 2 EN s p   , the ,    m C / E ­ - - ) s

© 2012 Nature America, Inc. All rights reserved. pling between vmPFC and caudate and putamen BOLD during mixed choices. analysis. PPI ( the choice screen was modulated by the value of the chosen ( regressor at the time of the outcome was modulated by the experienced reward. branch ( ningbranch ( during choice were parametrically modulated with the target value in the plan choices and modeled separately plan chosen and train chosen trials. Regressors modulated by the experienced reward. ( actionrewarding tion of the choice screen and at outcome with the true reward probability of the between a explainedby betteractually was signal significantthatthe despite fact be the mightstill test differencevalue the of morea thoroughis differencetestfora representation regressiondirect a than components in the overall signal. Separatethe alternative testingvalue in thisof analysis, minuend indicating a ( value difference between the two nalized regressors. are still dissociable through the principle of competing variances in unorthogo choice (on a continuous scale) and response to the actual outcome (either 1 orduringvalue expected of0) anypotentialeffects predictionoferrors), the effects Although the time of third choice and outcome were fixed (to avoid sorconfounding during outcome presentation was additionally modulated by actual reward. the second choice in layer 3 and during presentation of the outcome. The regres trial, we modulated regressors at three time points: during the rootthe temporalchoice, dynamics of duringvalue representations during planning over the entire reward outcome = 100 on rewarded and 0 on non-rewarded trials. To investigatein shown example the In rooms. reward the in outcome the to response in and choice, alternativebranchtreealternative andthe root atnode the second valueatthe the action, targetoptimal representations the value ofneural find to expected Wemaze. thereforethe through way their alongfollow subjects branchesthat of value the be alternativedecisiontwo the of values would the andchoice) (target path optimal the signals value salient most the that hypothesized we Value modulated parametric analysis. a one-sample assessed statistical significance with a second-level random-effects analysis using ouscorrelations from signals pertaining to any of theother value signals regressorsinterestensuringthematrix,that of confounded werenot spuriby orthogonalizationenteredwewhen regressors andmodulators intodesign the decisionseparatethe variablesinbelow. section described as We notdidapply relevant task by modulated parametrically outcomewere and time choice the correctionregressors estimated from realignmentthe procedure. Regressors at trained chosen trials. Additional regressors captured button presses and motion and planninginto condition PE the in choices divided further andseparately choice/transition, and outcome. We modeled choice trials in all four conditions computerchoice, second root, the presentationof representingregressorsthe full-width half-maximum Gaussian kernel). and spatially smoothing of normalized functional data using an isotropic 8-mm nature nature V and a significant negative effect for unchosen For the second scan session of the PE trials, we split trials according to subjects’ For the first scan session of E trials, we modulated regressors during presenta There was a significant positive effect for the target value and negative effect for We regressed fMRI time series onto a composite GLM containing individual For the second scan session of the EE trials, the regressor during presentation of V a ) branch, at the time of the outcome by the experienced reward. deep.alternative NEUR − Figure 1 Figure b a We performed a PPI analysis PPI a We performed V t is necessarily also significant. test against zero on the effect sizes in individual subjects. than by target OSCI V b ), the alternativethe ), valueat secondthechoice ofplanningthe , trained ) and the value of the colored trained branch ( V EN target a – b – a a − C ). The regressor at the time of the outcome was also also outcomewas the of time the atregressor The ). E b = 40, = : if : . Notably,. thereif significant isa positive for effect a alone had a very strong effect, then the latterthe then strongeffect, very a hadalone V root_alternative b , then a contrast testing for the difference For the first scan session of the P trials, 4 9 to examine the functional cou functional the examine to = 20, and 20, = V a chosen V ) and subtrahend ( deep_alternative ) and unchosen V trained 4 = 30, = ). The 8 . We. b ------)

W significant remaining interactions ( find not did we values), choice the with areas target and seed of correlations (to rule out the possibility that effects on this interactionchoicevaluesV ric were solely due to mutual paramet the However,added wetrials. chosenwhen trained on −1 and trials a GLM on the PPI = Y × P, where Y = difference time course lenient threshold of significantanalysisnotrevealThisresultsdid brain, anywhereat a eventhe in on plan chosen trials, but not on train chosen trials, and vice-versa for putamen. coupling,that is,areas that would differentially increase coupling withcaudate in changeschoice-dependentfor tested Alternativelywe systems.choice both conjunction highlights common regions that The playedareas. a both rolewith coupling in in mediatingincrease significant between a showed that areas ing this PPI both for a seed in caudate and in putamen, thereby separately identify mutual correlation of seed and target region with the choice interactionvalues. Wewould reveal computedincreased coupling that could not be explained from thecontained the parametric value signals for both branches, so any effect (containingon theP and PPI all value regressors) into a new GLM. Notably, this GLM also action term along with all regressors from our model based parametric analysis mixed choices were made. We entered the seed region BOLDand putamenY, ROI, and Pand being an theindicator PPIvariable forinter the times during which The PPI term was Y × P, with Y being the BOLD time courses in either the caudate 50. 49. 48. 47. 46. 45. 44. 65 voxels. Radii chosen to fit anatomical boundaries. study learning habit previous a on based voxels.Location 20 radius, mm 4 0; −24, −33, left: 0; −24, 33, right:putamen: 66 voxels. Sphere centered in the anterior caudate nucleus. Dorsolateral posterior single regression is performed per region (no multiple comparisons required). eigenvariatefirst as region we regressed our design matrix on a representative( ROIs defined anatomicallytime priori course,a within calculated R trial types ( anybrainprioriregionsin a ouroutside ofsignals value corrected)with FWE motor preparatory) we did not observe any other significant correlation men( ( ( anteriorcaudatein trials representationP duringtivevalues target planned of Supplementary Fig. 4a O hole brain analysis. brain hole We also tested the possibility that vmPFC correlated with the choice dependent Anterior caudate (xyz mm): right: 9, 15, 3; left: −9, 15, 3; size: 6 mm radius,mm 6 size: 3; 15, −9, left: 3;Anterior 15, caudate9,right: mm): (xyz

I analysis. I rso, .. Rthen P, eg JJ, tre, . Hno, .. ciiu of critique A R.N. Henson, & P. Sterzer, J.J., Geng, P., Rotshtein, K.J., Friston, Friston, K.J. in results Ambiguous J.B. Poline, & S. Rouquette, A.L., Paradis, A., Andrade, Bellman,theorytheOndynamicR.of programming. the Learning M.F. Rushworth, & M.E. Walton, M.W., Woolrich, T.E., Behrens, motivational of operation the in B.W. learning of role The & Balleine, A. Dickinson, O. Morgenstern, & J. Neumann, von functional localisers. functional Neuroimage (1999). 483–486 correlation. covariate to due analysis data neuroimaging functional (1952). 716–719 world. uncertain an (2007). in information of value 2002). York, New Sons, & Wiley (John 497–533 R.) Gallistel, in systems. 1944). Press, University (Princeton Supplementary Fig. 4b Supplementary Table 4 We analyzed value signals (results in (results signals value Weanalyzed

et al. tvn’ adok f xeietl Psychology Experimental of Handbook Stevens’ 6 , 218–229 (1997). 218–229 , plan Psychophysiological and modulatory interactions in neuroimaging. 5 P 0 . This provides us with a very sensitive analysis as only a only as analysissensitive very a with providesus This . < 0.005, uncorrected. andV 1 8 A whole brain parametric analysis confirmed a selec a confirmed analysis parametric brain whole A Neuroimage ) and cached values during E trials within posterior puta between activity in caudate and putamen by estimating trained ). Besides precentral gyrus (putatively motivational

as covariatesas interestnoof GLMPPI this to 30 P ). < 0.005 uncorrected). t , 1077–1087 (2006). 1077–1087 , caudate hoy f ae ad cnmc Behavior Economic and Games of Theory 8 . vmPFC: 0, 32, −13; 8 mm sphere, mm 8 −13; 32, 0, vmPFC: . − Supplementary Fig. 2 Fig. Supplementary t putamen a. Neurosci. Nat. Proc.Natl.Acad. Sci.USA , and Supplementary Table5 Supplementary P doi:10.1038/nn.3068 es Pslr H & H. Pashler, (eds. = 1 on plan chosen

10 Neuroimage , 1214–1221 1214–1221 , ). For each For ). P < 0.05

10 38 - - - - - ) , ,