Modeling reward-guided decision-making with a biophysically plausible attractor network and the belief-dependent learning rule

Rebecca Sier

A master project executed at the Neural Information Processing Group of the Technische Universität Berlin as part of the final requirements for fulfillment of the Master of Science in Brain and Cognitive Sciences of the Universiteit van Amsterdam

January – September 2014 Supervisor: Audrey Houillon, PhD UvA representative: Dr. Leendert van Maanen Co-assessor: Dr. Leendert van Maanen 41 ECTS Final date: September 24, 2014 Student ID: 5893011 Study track: Cognition

TABLE OF CONTENTS

Abstract 4 Acknowledgements 5

Introduction 6

Methods 9 Theory on models of reward-guided decision-making 9 Making decisions: an attractor model of perceptual decision-making 9 Learning from decisions: the belief-dependent learning rule 10 Approximations of the attractor model for reward-guided decision-making 13 Experimental methods 15 Participants 15 Testing paradigm 15 Behavioral and imaging data 17 Computational model simulations 17 The sigmoid probability function 17 The reduced attractor model for decision-making 18

Results 22 Training phase 22 Estimating the noise parameter 휎 22 Learning behavior: age-independent 23 Learning behavior: age-dependent 24 Performance phase 26 Fitting the noice parameter 휎 27 Age independent 27 Age dependent 29 Physiological results 31

Discussion 32 Model performance 32

2

Testing the model hypotheses by Frank et al. 33

References 35

Appendix 37 Part A: surface plots for fitting the sigmoid function to training phase behavior 37 Part B: learning curves and learning phase 39 Learning curves with noise parameter 휎 = 0.0970 39 Learning curves dependent on age, two values of sigma 39 Part C: age independent performance behavior with 흈 = ퟎ. ퟏퟓ 41

3

ABSTRACT

Elderly people and patients suffering from Parkinson’s disease learn more from negative than from posi- tive decision outcomes as compared to healthy controls. Previous research hypothesized that this error- avoidant behavior is partly caused by reduced levels of , and a model of this effect in the was constructed. However, the model is subject to discussion and hard to verify on the level of the basal ganglia. This thesis hypothesizes that the attractor model for decision-making combined with the belief-dependent learning rule can verify the basal ganglia model. It is shown how this combined model includes features of the basal ganglia model and explains them with greater detail. Simulations of the combined model make both correct and incorrect predictions of reward-guided decision behavior and neurotransmitter values of subjects that perform a probabilistic selection task. However, the empirical da- ta is restricted and future research should be done to decide on the usefulness and the correct implemen- tation of the model.

4

ACKNOWLEDGEMENTS

The research leading to this thesis was performed at the Neural Information Processing Group at the Technische Universität Berlin, during the Spring and Summer of 2014. The project was partly funded by Erasmus+ and the Spinozafonds of the Amsterdams Universiteitsfonds. I would like to thank the colleagues I had at the research group, who gave me a warm welcome to the TU Berlin and to their great city. Special thanks to Audrey Houillon, PhD, who supervised me greatly throughout the project, providing comments and asking me the right questions even during an Asian holiday. Thank you Prof. Dr. Klaus Obermayer for giving me the opportunity to work in your research group. And finally I thank co-supervisor Dr. Leendert van Maanen and the people from the University of Amsterdam’s research master Brain & Cognitive Sciences, who gave me the freedom to create my own, more computationally oriented program within the master.

5

INTRODUCTION

The ability to learn from the punishment or reward that follows a decision is an essential function of cognition. Consequently, reinforcement learning or the ability to incorporate feedback to improve future actions, is a much researched topic in psychology and neuroscience. A correct understanding of how reward-guided decision-making works in the brain is very useful, opening doors to better care for people with learning and decision-making disabilities and an increased fundamental knowledge on an essential part of human functioning. Deviations from normal reward-guided decision-making behavior provide first insights into its functioning. Compared to healthy controls, patients suffering from Parkinson’s disease (PD) off dopamine (DA) medication learn more from negative decision outcomes (Frank, Seeberger, & O’Reilly, 2004), while these patients learn less from positive decision outcomes. Similar results are found in older seniors, which have shown to be more error-avoidant than younger seniors (Frank & Kong, 2008). Shared by both PD patients and elderly people is a depleted tonic DA level. When making up for this deficit with DA medication the learning bias towards negative outcomes is reversed: PD patients on DA medication learn more from correct than from incorrect decisions (Frank et al., 2004). Similarly, the DA hypothesis of cognitive aging states that cognitive changes or impairments related to aging are at least partly caused by a decline in DA level (Frank & Kong, 2008). Younger adults on drugs that reduce DA levels showed to have the same error avoidance bias as elderly people, while drugs that increase DA levels remove the negative bias (Frank & Kong, 2008). In line with the mentioned correlations between DA level and reward-guided decision-making, Frank at al. (Frank & Kong, 2008; Frank et al., 2004) hypothesized that DA plays a crucial role in reward- guided decision-making. Moreover, error-avoidant behavior is supposed to be partly caused by reduced levels of DA. Exactly how DA level can have this effect is described in a biologically based network model. In the model for reward-guided decision-making by Frank et al. (Frank & Kong, 2008; Frank et al., 2004; Frank, 2005) DA is supposed to play a modulatory role, influencing pathways through nuclei of the basal ganglia (BG) that are related to decision-making. According to the model, the “Go” and “NoGo” (or “direct” and “indirect”) pathways together act as a gate, respectively facilitating and suppressing the different competing response options (Mink, 2003; Nambu, Tokuno, & Takada, 2002; Nambu, 2004). Activity of a choice alternative’s “Go” pathway in the BG is in favor of its probability to be chosen, while the corresponding “NoGo” pathway subtracts from this probability. With learning, DA modulates the pathways in order to change these probabilities differentially over the available stimulus-response options.

6

The BG model by Frank et al. correctly predicts how a lower amount of tonic DA level disrupts reinforcement learning (Frank et al., 2004): small negative phasic DA changes suffice to reach a lower threshold for achieving synaptic plasticity, while the positive threshold is only reached in the case of large positive DA changes. This explains why PD patients, elderly people and healthy subjects on DA-reducing drugs learn more easily from negative than from positive reinforcement. The opposite is true for elevated amounts of tonic DA, which results in reward-seeking instead of error-avoidant decision-making. Despite of the correct predictions the model by Frank et al. makes, it is hard to verify. The exact function of the BG is still investigated, and shown to be less clear than assumed in Frank et al.’s model (Utter & Basso, 2008). The two segregated pathways are shown to be an oversimplification of the actual circuitry of the BG, also when including a third “hyperdirect” pathway (Parent & Hazrati, 1995). A closer look at the anatomy of the BG shows how nuclei in the Go pathway are not strictly segregated from the NoGo pathway and vice versa. DA receptors D1 and D2 – which are supposed to be segregated per pathway according to the Frank et al. model – are shown to be located at the inputs of both pathways. Due to its simplifications, the model by Frank et al. is yet unable to give an analytical account of reward-guided decision-making. Model predictions of subjects’ learning and decision behavior are restricted to be only qualitative (Frank et al., 2004). In order to test the model predictions made by Frank et al., this thesis aims to construct and test an analytical model of reward-guided decision-making. The model requires detailed knowledge of the neural system, specifically where it comes to individual ’ electrochemical dynamics and interconnectivity. A biophysically plausible attractor model for perceptual decision-making created by Wang (2002) will be used, including parameters for individual , neurotransmitter and synaptic gating dynamics. The model is shown to successfully predict behavioral and electrophysiological decision-making data. A learning rule at the same, detailed level of analysis is created by Soltani, Lee and Wang (2006). In this paper it is hypothesized that a combination of the attractor model for decision-making and the learning rule can analytically predict reward-guided decision-making at the neural network level, capturing the effects of age and dopamine on learning behavior. This paper will be structured as follows. The methods section first introduces the theory behind the attractor model for decision-making (Wang, 2002), the addition of a learning rule (Soltani et al., 2006) and model approximations that are used in the current experiment. Secondly, the experimental paradigm that was used to empirically test the reward-guided decision-making behavior of human subjects is explained in detail. Empirical data consists of age-dependent reward-guided learning behavior and neurochemical activity assessed with magnetic resonance spectroscopy (MRS) and positron emission tomography (PET). The third part of the methods section provides the details of the computational model that was used to simulate the tested subjects’ behavior. Moreover, this section shows how the belief- dependent learning rule can be used in combination with a mean-field reduction of the attractor model for 7 decision-making, which has not been done before. After fitting behavioral data to the computational model, learning and decision behaviors are simulated and subsequently compared to the empirical data. Following the results section, the discussion serves to find how the created biophysically plausible model for reward-guided decision-making relates to the model by Frank et al.

8

METHODS

Theory on models of reward-guided decision-making

To construct and test an analytical model of reward-guided decision-making, an attractor model of perceptual decision-making (Wang, 2002) is used in combination with the “belief-dependent learning rule” (Soltani et al., 2006). However, since the attractor model needs about 7200 parameters, it is time- consuming to do simulations with it that are useful for making predictions. This disadvantage led to several approximations of the model. In this paper, two approximations are systematically used to model subjects’ decision behavior. To understand the use of these approximations, first the full attractor model and the belief-dependent learning rule are introduced, followed by the approximations themselves and the way in which they form part of the research.

Making decisions: an attractor model of perceptual decision-making. Experimental findings on physiological brain activity revealed several areas that correlate with decision behavior, which Wang (2002) used to create his attractor model for decision-making. Specifically, the lateral intraparietal area (LIP) and parietal, prefrontal and premotor areas showed activity that correlates with accumulation of stimulus information and the eventual decision choice (Hernández, Zainos, & Romo, 2002; Kim & Shadlen, 1999; Romo, Merchant, Zainos, & Hernández, 1997; Shadlen & Newsome, 1996, 2001). A model of decision-making should predict the characteristic activity found in these “decision areas”. Interestingly, the neurons in LIP and other decisional, information accumulation areas, show persistent elevated activity during a delay period between stimulus and response (Shadlen & Newsome, 1996, 2001). Once a decision is made, the result is maintained active for a while, ready for the decision to be executed at the right moment in time. This neural behavior of the LIP was also found during tasks (Shadlen & Newsome, 2001), where working memory can be modeled with attractors – networks whose dynamics can end up in self-sustaining stable states. It led Wang to hypothesize that attractor networks can be manipulated to not only cause working memory performance, but also perform perceptual evidence integration and categorical decision-making.

9

The recurrent network model is based on a combination of a neural network architecture with attractor dynamics (Amit & Brunel, 1997) and descriptions of synaptic currents (Wang, 1999). Details of the model can be found in (Wang, 2002). Summarized, the network model consists of 푁 leaky integrate-and-fire neurons, of which 80% is excitatory and 20% is inhibitory. The neurons are arranged in three excitatory pools and one inhibitory pool. Within each pool Fig. 1. Schematic depiction of the attractor model for perceptual decision-making. Taken from Wang (2002). the neurons are recurrently connected. Since these connections are all excitatory, it amplifies the pool's activity and can sustain elevated activity. As shown in Figure 1, two of the excitatory pools are selective for choice “퐴” or “퐵”: neural activity in either of the selective pools increases the likelihood that the decision will be in favor of the corresponding alternative. The selective pools compete through the pool of inhibitory neurons: excitation of one of the selective pools inhibits the other selective pool, and vice versa. The third excitatory pool is non-selective. Each neuron in the model receives input and sends output through realistic AMPA, NMDA, and GABA receptors. In addition, all neurons receive a large amount of background Poisson inputs from afferent neurons, accounting for stochasticity that is inherent to decision-making. Due to the recurrent connectivity within and mutual inhibition between the pools, one of the selective pools is bound to increase in activity, at the expense of activity in the other selective pool. At this point, the decision network has “chosen” for the alternative whose selective pool has the highest activity. In other words, this architecture creates a winner-take-all dynamics, always ending up with one “winning” pool. Finally, the winning pool’s activity is sustained in an attractor. The model therefore achieves what Wang (2002) hypothesized: an attractor model is able to subsequently integrate evidence, form a categorical decision and keep the decision for a period of time, as in working memory, in accordance with physiological measurements of the LIP and similar decision areas. Moreover, the model successfully replicated most of the psychophysical and physiological results given in Shadlen and Newsome (2001) and Roitman and Shadlen (2002).

Learning from decisions: the belief-dependent learning rule. Reward-guided decision-making is modeled by adding a reward-dependent learning rule (Soltani et al., 2006) to Wang's (2002) biophysically plausible attractor model for decision-making. For the model to learn, it is assumed that synapses between afferent neurons and neurons from the selective pools are plastic (O’Connor, Wittenberg, & Wang, 2005). As depicted in Figure 2, reinforcement signals (e.g. phasic dopamine changes (Frank, 2005)) that follow a 10 decision influence this plasticity by changing a fraction of the synapses from a depressed to a potentiated state, or vice versa. Decision-making depends on the states that synapses in the selective pools are in. At a given moment in time, in each selective pool 푖 a fraction 푐푖 (the “synaptic strength”) of its synapses is in the potentiated state, while the rest (1 − 푐푖) of the synapses is in the depressed state. It is assumed that the firing rates of input Fig. 2. Schematic model architecture of how reward influences synaptic neurons are similar for the two selective plasticity in the decision-making network. Taken from Soltani, Lee and Wang (2006). pools. Therefore, the difference in synaptic strengths of the two selective pools (푐퐴 − 푐퐵) determines which selective pool receives a larger net input current and is therefore more probable to win the decision race. The larger the synaptic strength of a selective neuron pool, the stronger the net input current received by the pool, the larger its activation, and the higher the probability for it to be the first of the pools to cross the decision threshold. The synaptic plasticity in the model is changed through Hebbian learning and gated by reinforcement (Soltani et al., 2006). Next to that, synaptic plasticity depends on learning rates for rewarded and non-rewarded trials, respectively 푞푟 and 푞푛. It means that when a decision is rewarded, synaptic potentiation of a synapse only happens when four conditions are met. First, for a synapse to be potentiated it needs to be part of the pool that is selective to the correct decision. Only synapses in this “correct” pool should be potentiated, since this pool is desired to become more strongly activated when confronted with the same, “correct” stimulus in the future. It increases the likelihood of the rewarding choice alternative to be chosen again. Second, both the pre- and post-synaptic neurons of the synapse need to be active. This condition of Hebbian learning makes sure that a synapse is potentiated when it contributed to activation of its selective neuron pool, and thereby increased the likelihood for the correct alternative to be chosen. In learning it is desirable that a future encounter with the same, correct choice alternative is easily chosen again. This is achieved when the relevant synapse is potentiated, such that presynaptic activity easily passes through to the postsynaptic neuron, part of the selective neuron pool. Third, the synapse should be in the depressed state for it to be potentiated. Since synapses can only be in either the depressed or the potentiated states, and potentiated synapses cannot be potentiated any more than they are already, the depressed synapses are the only ones that can be potentiated.

As a fourth condition for a synapse to be potentiated, it should be part of the fraction 푞푟 of 11 depressed synapses within its selective pool. Since the trial is rewarded the network learns with a learning rate 푞푟. A larger 푞푟 means that a larger fraction of the depressed synapses in the selective pool should be potentiated. As a result, the selective pool that corresponds to the rewarded option will be activated more strongly in a new encounter with the rewarded stimulus. In other words, a larger 푞푟 makes the network learn faster. Special about this learning rule is how not only the synaptic strength of the selected neuron pool is updated, but also that of the other, unchosen pool. Neural activity in this pool is unwanted, since it would increase the probability of choosing the unrewarded decision, which is undesirable. Therefore, a fraction

푞푟 of potentiated synapses in the selective neuron pool corresponding to the unselected choice alternative, whose pre- and post-synaptic neurons are active during the decision, is depressed. When the decision is not rewarded, a counteracting mechanism takes place, using the learning rate for unrewarded trials 푞푛. In this case, a fraction 푞푛 of the potentiated synapses in the neuron pool that corresponds to the selected response option is depressed in accordance with Hebb’s rule. Simultaneously, a fraction 푞푛 of the depressed synapses in the neuron pool corresponding to the unselected option is potentiated. Together, the rules form the belief-dependent learning rule:

퐴 is selected and rewarded:

푐퐴(푡 + 1) = 푐퐴(푡) + (1 − 푐퐴(푡))푞푟

푐퐵(푡 + 1) = 푐퐵(푡) − 푐퐵(푡)푞푟

퐴 is selected but not rewarded:

푐퐴(푡 + 1) = 푐퐴(푡) − 푐퐴(푡)푞푛

푐퐵(푡 + 1) = 푐퐵(푡) + (1 − 푐퐵(푡))푞푛.

Here, 푐퐴 and 푐퐵 are the synaptic strengths for the pools that are selective to choice alternatives 퐴 and 퐵 respectively, and 푡 indicates the trial the network is in. On the one hand, when 퐴 is selected and rewarded, the new synaptic strength 푐퐴(푡 + 1) for the pool that corresponds to choice alternative 퐴 is updated through an addition of newly potentiated synapses (1 − 푐퐴(푡))푞푟, i.e. the product of the depressed synapses (1 − 푐퐴(푡)) with the fraction of depressed synapses that are to be potentiated in the case of a reward 푞푟. On the other hand, the synaptic strength of the pool that is not selected 푐퐵(푡) lessens by a decrease of potentiated synapses 푐퐵(푡)푞푟. Due to the double update feature of this learning rule, in which both the synaptic strengths of the selected and the unselected neuron pools are updated, it was found to be similar to the reinforcement

12 learning model by Sutton and Barto (Soltani et al., 2006; Sutton & Barto, 1998). The double update ensures that the values given to all choice alternatives are stored in the synapses of the decision-making network, which is found to be a biophysically plausible scenario. Therefore, this rule was chosen to be combined with the attractor model for decision-making, together forming a model for reward-guided decision-making.

Approximations of the attractor model for reward-guided decision-making. In the full attractor model, the description of activity of about 2000 neurons, including the synaptic gating dynamics of the noisy networks, requires about 7200 parameters (Wong & Wang, 2006). To actually simulate the model and create a set of results that is large enough to make useful predictions, Wong and Wang (2006) approximated the full model through a mean-field reduction and other simplifications of the synaptic currents and recurrent connections. A schematic representation of the full reduction is given in Figure 3. With the mean-field approach the net input to each neuron in the neural pools is treated as a Gaussian random process. This allows for the neural population to have a mean activity that can be represented by a single unit that receives one input current (Wong & Wang, 2006). The mean-field approach further assumes Fig. 3. Schematic representation of the reduction of the full attractor model for the synaptic currents within the decision-making. The full attractor model for decision-making (upper figure) requires 7200 parameters. A mean-field reduction and other simplifications of neuron pools to be constant, and the synaptic currents and recurrent connections results in a reduced attractor model (lower figure) that requires 2 parameters for performing simulations. Taken from variance of the neurons’ membrane (Wong & Wang, 2006). potentials is assumed to vary mainly due to external input to each neuron. Contributions to the membrane potential due to recurrent connections within the neuron pools are assumed to average out, because of the averaging effects caused by the all-to-all connectivity and the long time constant of NMDA receptors. These approximations avoid the complex calculations that are associated with recurrent connections within the neuron pools. 13

Other simplifications create a reduction of the model’s structure (Wong & Wang, 2006). First, the cells of the nonselective excitatory neuron units (the mean-field reduced pool) are assumed to have a constant firing rate, since this firing rate was found to only change by a modest amount in a wide range of conditions. This approximation causes the nonselective excitatory unit to be left out of the model’s structure, reducing it to three neuron units. Second, an even smaller model structure is allowed when assuming that the input-output relations of interneurons in the inhibitory neuron pool are linear. As such, the inhibitory system can be absorbed into direct, inhibitory effects of the two selective neuron units onto one another. Finally, time constants of neuron membranes and synaptic gating variables are approximated. Since the neurons from the selective pools were found to fire instantaneously in response to a stimulus the membrane time constant of the single cell is neglected. The time evolution of the system is approximated to be determined by the NMDA receptors, since the gating variable of this receptor is the longest as compared to other receptors. The approximations reduce the amount of parameters to 2, thereby allowing for simulations that are less time consuming, which is useful for doing simulations. Full details of the reduction are given in (Wong & Wang, 2006). The model itself is illustrated in the section “Computational model simulations”.

A second model reduction enables a direct calculation of the probability 푃푖 for choosing decision alternative 푖. In Figure 4, the choice

probability 푃푅 for choosing right, calculated with the full attractor model (Wang, 2002), is plotted as a function of the difference in

synaptic strengths (푐푅 − 푐퐿). It shows that this relation forms a sigmoid. The choice

probability 푃푅 for option 푅 only depends on noise parameter 휎 and the difference

between synaptic strengths 푐푅 − 푐퐿 of

options 푅 and 퐿. The choice probability 푃푅 of making decision 푅 can therefore be expressed by a sigmoid function:

1 푃푅 = 푐 − 푐 . Fig. 4. Choice behavior of the decision-making network as a function of the 1 + exp (− 푅 퐿) difference in synaptic strengths. Different overall synaptic strengths are 휎 represented by different symbols: 푐푅 + 푐퐿 = 60% (plus), 100% (square), 140% (circle). A regression of the data points is shown with the red curve. Taken from (Soltani et al., 2006). Here, 휎 represents the randomness of the network’s choice behavior. A larger value of sigma denotes a network with more random choice or 14 exploratory behavior, while a smaller sigma indicates less random choice or “exploitative” behavior. Its value is determined by the structure of the decision network: the number of presynaptic neurons 푁푝 to each neuron in the selective pool, the firing rate of presynaptic neurons 푓푠푡 and the difference in peak conductances of the synapses in up and down states (푔+ − 푔−): 휎 = 푁푝푓푠푡(푔+ − 푔−). The sigmoid function can be used to avoid having to simulate the entire decision process (Soltani & Wang, 2006). A decision can be simulated simply by a calculation of the choice probability with the sigmoid function. The resulting probability determines the bias of a coin that is flipped to determine the choice of the network in that trial. Then, the feedback that follows from the decision together with the belief-dependent learning rule determines how synaptic strengths of the relevant selective neuron pools are changed, which in themselves change the choice probability 푃푖 of the following trial.

Experimental methods

Participants. 22 Healthy young (20-43 years, mean=30; SD=6.5) and 21 healthy older participants (45-80 years, mean=65.4 years; SD=10.2) took part in the study. Informed consent was obtained from all participants and the study was approved by the local ethics committee of the Charité University Hospital.

Testing paradigm. An adapted version of the probabilistic selection task from Frank et al. (2004) is used to test to what extent participants learn from positive versus negative decision outcomes. A schematic representation of the testing paradigm is given in Figure 5. With each trial participants are given one of three stimulus pairs (퐴퐵, 퐶퐷, 퐸퐹), from which they have to choose one of the two stimuli.

15

Feedback on the decision is probabilistic. For example, when choosing 퐴 the probability of

Fig 5. A schematic representation of the testing paradigm. receiving positive feedback (i.e. a reward) is 85%, while in those cases choosing 퐵 would be followed by negative feedback. In 15% of the cases 퐴 is the wrong pick and 퐵 is rewarded. Stimulus pairs 퐶퐷 and 퐸퐹 are less reliable: 퐶 is rewarded with a 75% chance, and 퐸 is rewarded in 65% of the cases. During a training phase the subject is supposed to learn through trial and error that stimuli 퐴, 퐶 and 퐸 are the most likely to be rewarded. In a performance phase new stimulus combinations including either 퐴 or 퐵 are provided: 퐴퐶, 퐴퐷, 퐴퐸, 퐴퐹 and 퐵퐶, 퐵퐷, 퐵퐸, 퐵퐹. Trials with stimulus 퐴 are given as often as trials with stimulus 퐵. The subject again needs to pick the most rewarding stimulus. In this phase feedback is not provided. This testing paradigm enables to measure to which extent participants learn from rewarding and non-rewarding trials. In the case that a subject learns more from reward than from errors, he or she should be more inclined to choose 퐴 than to avoid 퐵 in the performance phase. In the reversed case, the average number of performance phase trials in which 퐵 is avoided should exceed that of the trials where 퐴 is chosen. In the final case that 퐴 is chosen as often as 퐵 during the performance phase, it indicates that the subject learned from rewards as much as from errors that follow decisions in the training phase. The experiment was executed with a computerized version of the probabilistic selection task. The training phase consisted of 3 blocks with 90 trials, in which the three different stimulus pairs were presented in random order. The performance phase consisted of 36 trials for the first three subjects and 72 trials for all other subjects, where the eight different stimulus pairs were presented in random order.

16

Behavioral and imaging data. During the experiment participants’ decision behaviors in both the training and performance phases are recorded. The data allow to assess learning curves over training trials and compute to which degree each subject learns from reward and error. During the performance phase reaction times are measured as well. In addition, the subjects were measured with MRS and fMRI during the learning phase and with fMRI and PET scan during the performance phase. The gathered data consists of MRS measurements of glutamate level in the anterior cingulate cortex, and PET measurements of dopamine level in the bilateral, left and right sides of the ventral striatum. BOLD-fMRI signal was performed on a 3T scanner (Siemens Trio system) and extracted using a 5mm sphere. 3T proton Magnetic Resonance Spectroscopy (1H-MRS) was carried out in the same session targeting absolute concentrations of glutamate, GABA and other metabolites. Measures were acquired by using water suppressed and unsuppressed spectra (128/8 averages, 90⁰ flip angle, TE = 80 ms, TR = 3 s) and a 20x30x25 mm voxel individually placed in the ACC and striatum. Grey matter volume in MRS voxels was acquired by segmentation of anatomical T1-weighted images into GM, white matter and cerebrospinal fluid, and extraction from individually shifted voxel positions. 6-[18F]fluoro-L-DOPA PET was carried out within one day of fMRI-MRS imaging to map dopamine synthesis capacity by calculating the net blood- brain clearance of the tracer. After realignment, correction and coregistration, the dynamic FDOPA emission recording was corrected for the brain-penetrating FDOPA plasma metabolite, using an inlet and outlet model.

Computational model simulations

To simulate subjects’ decision behaviors and reward-guided learning the described sigmoid function, the reduced attractor model for decision-making (Wong & Wang, 2006) and the belief-dependent learning rule (Soltani et al., 2006) are used to implement on a Linux-Ubuntu workstation using Mathworks Matlab version R2012a.

The sigmoid probability function. First, the sigmoid was used to investigate how the model roughly learns over training trials and decides in the performance phase. The learning process is initialized with synaptic strengths 푐푖 = 0.5 for 푖 = [퐴, 퐵, 퐶, 퐷, 퐸, 퐹]. In words, it means that all six stimuli are chosen with an equal probability 푃푖. When being confronted with a pair of stimuli 퐴퐵, 퐶퐷 or 퐸퐹, one of the stimuli is chosen according to the result of the flip of an unbiased coin. Feedback follows and, according to the belief- dependent learning rule, causes an update of the synaptic strengths of the selective neuron pools that 17 correspond to the presented pair of stimuli. The changed synaptic strengths induce a bias towards one of the two stimuli, reflected in the choice probabilities 푃푖, such that the next trial with these stimuli amounts to the flipping of a biased coin. Each subject’s learning behavior was modeled by training the model with the stimulus sequence that was given to the subject during training. With each trial the synaptic strengths

푐푖 are updated pairwise according to the belief-dependent learning rule. The final set of six synaptic strengths was used to predict behavior during the entire performance phase. This was again modeled with the sigmoid probability function.

Next to the relevant synaptic strengths 푐푖, the sigmoid function needs a noise parameter 휎. To find how the sigmoid function is capable of predicting subject behavior without any help from the reduced attractor model, the reduced attractor model could not provide any data points on choice probabilities dependent on synaptic weights, as plotted in Figure 4. Therefore, no regression could be made to find the optimal 휎. Therefore, a different method than sigmoidal regression was used to find a useful value of 휎. The value of 휎 was estimated by computing the negative log likelihood of the subjects’ decisions in the training phase according to the choice probabilities that were calculated by the sigmoid function, dependent on the learning parameters 푞푟, 푞푛 and noise parameter 휎. The learning parameters [푞푟, 푞푛] were taken to range between 0.025 and 0.825 (Soltani et al., 2006), and the noise parameter 휎 took values between 0 and 0.5. Surface plots provided an insight into which parameter values should result in an optimal fit. An analytical approach to guess 휎 is done by performing a maximum likelihood estimation (MLE), using the same negative log likelihood calculations. The MLE looks for the parameter combination that gives the best fit with respect to behavioral data obtained from the experiment. Next to finding its optimal value, the MLE was used to find which time-dependent behavior of 휎 amounts to the best fit. For example, it was expected to lead to a better fit when 휎 would decay during the training phase, reflecting how learning leads to a higher confidence, less exploration and more exploitation. The value of 휎 was taken to either be (1) fixed and equal over both the training and performance phases, (2) to decay linearly or exponentially over the training phase trials, and take the final training value for the performance phase, or (3) to be fixed and unequal for the two phases with the 휎 for the training phase larger than for the performance phase. As stated in the following section, for simulating the reduced attractor model, the optimal value of 휎 will be regressed through a sigmoid fit as in Figure 4. To be able to compare the simulations of the reduced attractor model with those of the sigmoid function the latter are not only done with the guessed value of 휎, but also with this regressed value of 휎.

The reduced attractor model for decision-making. The subjects’ reward-guided decision behavior during the performance phase was modeled with the reduced attractor model for decision-making as well. 18

Considering the higher level of (biophysically plausible) detail of this model, it is expected to give more accurate results than the sigmoid probability function. With the reduced attractor model the choice probability can be calculated as a function of the difference in synaptic strengths. The choice probability is found by averaging the model’s decisions over

100 simulations with the same initial conditions. For each difference in synaptic strengths (푐푖 − 푐푗) the probability of choosing 푖 is calculated to create a plot like the sigmoid in Figure 4. From this plot a sigmoid regression was made to find the optimal value of 휎.

However, it is inefficient to try to find the optimal learning parameters [푞푟, 푞푛] with the reduced attractor model in the way that it was done with the sigmoid function, since the reduced attractor does not directly provide a choice probability that can be used in an MLE. The only way to find the choice probability by use of the reduced attractor model is by averaging over many decision instances simulated with the model. One decision takes the model about 0.13 seconds, which means that it would take the MLE a, for this research, unreasonable amount of time to search through the space spanned by the ranges of 푞푟 and 푞푛 for the optimal parameters. In addition, averaging over 100 instances showed not to be enough, resulting in the MLE algorithm to get stuck in local maxima. Therefore, the optimal learning parameters 푞푟 and 푞푛 were estimated by conducting an MLE in which the choice probabilities were given by the sigmoid function. The only difference of this MLE as compared to the MLE in the previous section is the availability of a regressed, optimal 휎 value. In this case it is therefore unnecessary to search for the optimal value of 휎 with an MLE. With the resulting optimal learning parameters, synaptic weights 푐푖 were calculated per trial and per subject to find the correct input currents to selective neuron pools in the reduced attractor model. Implementation of the reduced attractor model for decision-making is done similarly to the implementation in (Hunt et al., 2012). To adapt it to the choice alternatives in the probabilistic selection task, the reduced model consists of six units, each selective to one of the stimuli 푖 = 퐴, 퐵, 퐶, 퐷, 퐸 or 퐹. A unit corresponds to the mean-field reduction of a selective neuron pool in Wang (2002). Each unit 푖 has an excitatory recurrent coupling 퐽퐴,푖푖 and an inhibitory coupling 퐽퐴,푖푗 to other units 푗. A current 퐼푠푦푛,푖(푡) designates the net synaptic input to unit 푖 at time 푡:

퐼푠푦푛,푖(푡) = 퐽퐴,푖푖푆푖 − 퐽퐴,푖푗푆푗 + 퐼0 + 퐼푖푛푝푢푡,푖(푡) + 퐼푛표푖푠푒,푖,

where 푆푖 represents the NMDA synaptic gating variable of unit 푖, 퐼0 represents the synaptic input current from external inputs to both pools, 퐼푖푛푝푢푡,푖(푡) represents the additional synaptic input current specific to unit 푖 and 퐼푛표푖푠푒,푖 represents the synaptic current due to background white noise, generated with an Ornstein-Uhlenbeck process that is filtered by the AMPA receptor’s decay time constant.

A new feature added to this model is the dependence of the synaptic input 퐼푖푛푝푢푡,푖(푡) on time 푡.

19

With each new decision trial the synaptic weights 푐푖(푡) are changed, resulting in a changed synaptic input to the synaptic pools 푖. In other words, the network learns, which needs to be reflected in the synaptic input current. Since the belief-dependent learning rule was originally implemented together with the full attractor model for decision-making, its use with the approximated version of the attractor model is redefined in this research. According to Soltani and Wang (2006) the conductivity 퐺푖(푡) of unit 푖 is determined by its current synaptic strength 푐푖(푡), the number of plastic synapses onto each neuron 푁푝, the firing rate of the presynaptic neurons 푓푠푡, the peak conductances of synapses in the potentiated and depressed states 푔+ and 푔−, and the decay time of AMPA currents 휏퐴푀푃퐴 (Soltani & Wang, 2006):

퐺푖(푡) = 푁푝푓푠푡(푐푖(푡)푔+ + (1 − 푐푖(푡))푔−)휏퐴푀푃퐴.

The conductivity 퐺푖(푡) combined with the peak voltage 푉 of an allows to calculate the selective external input to pool 푖 using 퐼푖푛푝푢푡,푖(푡) = 퐺푖(푡)푉.

As a next step in the algorithm, the total synaptic input 퐼푠푦푛,푖(푡) is used to find the firing rate in each selective neuron pool

푎퐼푠푦푛,푖(푡) − 푏 푟푖(푡) = 푓 (퐼푠푦푛,푖(푡)) = , 1 − exp (−푑(푎퐼푠푦푛,푖 − 푏)) where 푎, 푏, and 푑 determined the input-output relationship for a neuronal population (Wong & Wang, 2006).

The NMDA synaptic gating variables 푆푖 for all populations 푖 = [1,6] represent slow synaptic currents due to NMDA activation. This gating dynamic needs to be updated with every new time step 푡, using

푑푆푖 푆푖 = − + (1 − 푆푖)휉푓(퐼푠푦푛,푖), 푑푡 휏푁푀𝐷퐴

where 휏푁푀𝐷퐴 is the NMDA receptor’s decay time constant and 휉 represents a parameter that relates the presynaptic input firing rate to the synaptic gating variable. With a simulation of the above model, firing rates and total currents of all selective pools are calculated over time. When the firing rate of one of the selective pools crosses a decision threshold of 30 Hz the decision is made in favor of that pool. In addition, the reduced attractor model allows to calculate

20 the reaction time (RT) or the time between stimulus presentation and the moment a decision is made. To find how experimental RTs relate to the simulated RT, the RTs in the performance phase were calculated by averaging over 100 decision instances per stimulus pair and per individual.

Parameter values are taken from (Hunt et al., 2012): 퐽퐴,푖푖 = 0.3539, 퐽퐴,푖푗 = 0.0966, 퐼0 = 0.3297 nA, noise amplitude 퐼푛표푖푠푒,푖 = 0.009 nA, 푁푝 = 4, 푓푠푡 = 7.5 Hz, 푔+ = 3.0 S, 푔− = 2.1 nS, 휏퐴푀푃퐴 = 2, 휏푁푀𝐷퐴 = 60, 휉 = 0.641, 푉 = 0.1 V, 푎 = 270, 푏 = 108, 푑 = 0.1540.

21

RESULTS

Training phase

Behavior from participants in the experiment is expressed in a likelihood for “chooseA” and “avoidB” trials. ChooseA trials are the trials in which a participant was given stimulus 퐴 and decided in favor of that stimulus. In avoidB trials one of the two provided stimuli is stimulus 퐵, and the participant chooses the other stimulus, thus avoiding stimulus 퐵. The likelihood of chooseA and avoidB trials is a measure of how a participant learns from reward and error respectively. Similarly, one can express probabilities of chooseC, avoidD, chooseE and avoidF trials.

Estimating the noise parameter 흈. The choice probability during the training phase is only simulated by use of the sigmoid function. As described in the methods section, the values of the parameters 푞푟, 푞푛 and 휎 that provide the best fit of the sigmoid function to the subjects’ behaviors were guessed through inspection of surface plots and calculated with an MLE. Figure 6 shows surface plots for subject 1. It depicts the negative log likelihood of a sigmoid fit to the training decision behavior of subject 1, dependent on parameters 푞푟, 푞푛 and 휎. Surface plots for other subjects are given in the appendix (A). The surface plots dependent on 푞푟 and 휎 show a small deflection which Fig. 6. Surface plots of the negative log likelihood of the indicates a minimum around 휎 = 0.15 and 푞푟 close to decisions made by subject 1 during the training phase zero. However, the minimum NLL in the plot dependent according to the sigmoid function, which depends on learning parameters 푞푟, 푞푛 and noise parameter 휎. The best fit with this model is obtained when using the parameters on 휎 and 푞푛 shows to be rather problematic for that amount to the lowest negative log likelihood.

22 estimating an overall minimum. In estimating the optimal parameter values with an MLE, the Akaike information criterion (AIC) showed that keeping 휎 fixed (during the entire training phase and for all subjects) to the estimated value of 0.15 results in the smallest penalty. As a result, every subject was modeled with 휎 = 0.15 and the values for 푞푟 and 푞푛 were fitted with the MLE per subject. The resulting set of parameters per subject were subsequently used to simulate each subject’s learning behavior.

Learning behavior: age independent. Probabilities for the subject to choose stimuli 퐴, 퐶 or 퐸 and avoid stimuli 퐵, 퐷 and 퐹 change over trials, which is reflected by the changing synaptic strengths 푐푖 for each of the neuron pools selective to stimuli 푖 = [퐴, 퐵, 퐶, 퐷, 퐸, 퐹]. It indicates how decision behavior is learned. In addition, the training phase only presents stimulus 퐴 in combination with stimulus 퐵. During this phase, a change in the likelihood for chooseA trials therefore directly influences the likelihood for avoidB trials. Figure 7 shows both the experimental and simulated learning curves for choosing stimulus 퐴, stimulus 퐶 and stimulus 퐸, averaged over all subjects, for noise parameter 휎 = 0.15. Learning curves for the value of 휎 that was found through regression with the reduced attractor model are given in appendix (B). According to the experiment, the subject Fig 7. Experimental and simulated learning curves of subject 1, showing quickly learns that stimulus 퐴 has a higher the running average of choosing stimulus 퐴 (upper panel), stimulus 퐶 (middle panel) and stimulus 퐸 (lower panel) over training phase trials in probability of being rewarded than stimulus 퐵. which the stimulus pairs 퐴퐵, 퐶퐷 and 퐸퐹 were presented respectively. 휎 = 0.15. The same holds for learning to choose stimuli 퐶 and 퐸 over 퐷 and 퐹 respectively. Due to the reduced liability of stimulus 퐸 to result in a reward, learning to 23 choose stimulus 퐸 is slower than learning to choose stimulus 퐶, as expected. Moreover, the participants are slower to learn to choose stimulus 퐶 than stimulus 퐴. By the end of the training phase all curves terminate in an asymptote close to the probabilities of the respective choice alternatives to result in a reward. Although the experimental learning curves depicted in Figure 7 are averages over all subjects, not all start at a probability of 0.5. However, this does not necessarily mean that the subjects are biased towards one of the stimuli, but rather that the amount of subjects was too limited to filter out all noise in the data by means of averaging.

To simulate the learning curve, first the synaptic strengths 푐푖 for 푖 = [퐴, 퐵, 퐶, 퐷, 퐸, 퐹] are calculated for each trial, using the belief-dependent learning rule, the subject’s most likely learning rates [푞푟, 푞푛] and the sigmoid choice probability function. The initial set of synaptic weights was taken to be 푐푖 = 0.5 for all 푖, which makes the decisions in which a stimulus pair was encountered for the first time unbiased. Then, the likelihood of choosing stimuli 퐴, 퐶 or 퐸 is directly calculated with the sigmoid function. As in the experiment, the model learns that stimulus 퐴, 퐶 or 퐸 is the most rewarding. Learning is fastest in the trials with stimulus pair 퐴퐵 and slowest in the trials with stimulus pair 퐸퐹. The simulated curves end in an asymptote, where the subject is supposed to be done with learning.

Learning behavior: age dependent. In Figures 8 and 9 learning curves for different age groups are shown. Learning differences between subjects from the young and old age groups are not evident in the case of learning to choose stimuli 퐴 and 퐶. In these cases both the learning rate and the final probability for choosing the stimuli are roughly equal for both groups.

However, a clear difference between learning by the younger and older age groups is found for learning to choose stimulus 퐸. Clearly, both the experimental and simulated learning curves show how older subjects learn less fast than younger subjects. In addition, older subjects get stuck at learning to choose stimulus 퐸 around a 70% chance level, while younger subjects on average learn to choose stimulus 퐸 with about 80% chance.

This result indicates that older people are capable to learn to choose a stimulus that was rewarded in the past, when the likelihood for that stimulus to be rewarded is high. However, when a stimulus is less reliable to lead to a reward, as in the case of stimulus 퐸, then older subjects are less capable to learn that it is the most profitable stimulus to choose.

24

Fig. 8. Learning curves according to experiment and to simulations with the sigmoid function, for choosing stimulus 퐴 (upper row) and stimulus 퐶 (lower row) averaged over all subjects in the old (left column) and the young group (right column).

This finding can be explained in terms of the DA hypothesis of aging as follows. Since older sub- jects are supposed to have a lower tonic DA level, each punishment that follows from choosing stimulus 퐸

results in a tonic DA level that is low enough for the synaptic strength 푐 to be reduced. However, the more frequent case in which choosing stimulus 퐸 leads to a reward might not raise DA level enough for the synapses to be potentiated in each of the rewarding cases. The larger probability for depression in the case of less frequent punishments, combined with the smaller probability for potentiation in the case of more frequent rewards, might balance out and therefore have a direct influence on the capability of older subjects to learn from rewards. However, this effect might not be as pronounced in the cases of choosing stimuli 퐴 and 퐶, since these stimuli are rewarded in a larger amount of cases. Thus, the balancing effect of punishing versus rewarding trials with the condition 퐸퐹 might not be present (as strongly) in the cases of choosing 퐴 and 퐶. 25

Fig. 9. Learning curves according to experiment and to simulations with the sigmoid function, for choosing stimulus 퐸 averaged over all subjects in the old (left column) and the young group (right column).

Performance phase

After learning in the training phase, the performance phase indicates how well subjects have learned to either avoid errors or seek rewards. This section serves to analyze the subjects’ behavior during the performance phase. Both the sigmoid function and the reduced attractor model are used here. Since the 퐶퐷 and 퐸퐹 combinations are less reliable in giving a reward or an error as compared to stimulus pair 퐴퐵, it is harder to know which of the stimuli are the best option as compared to other stimuli. Since the number of trials in Fig 10. The likelihood of choosing stimulus 퐴 as a function of the difference in synaptic this study is relatively low, it is strengths of the neuron pools selective for stimuli 퐴 and 퐵. The likelihood was calculated by use of the reduced attractor model.

26 hard to get a reliable estimate of chooseC, avoidD, chooseE and avoidF behavior in the performance phase. Therefore, results on reward-guided learning are only assessed for stimuli 퐴 and 퐵.

Fitting the noise parameter 흈. As described in the methods section, the noise parameter 휎 is regressed by finding the sigmoid that best fits the data points calculated by the reduced attractor model. Figure 10 shows the regression, from which a value of 휎 = 0.0970 was obtained. Both values for 휎 that were obtained in this study (i.e. 휎 = 0.15 and 휎 = 0.0970) are used to simulate both the sigmoid function and the reduced attractor model. This serves to compare both models and their performance dependent on the noise parameter.

Age independent. Figure 11 qualitatively shows how well the sigmoid function (left column) and the reduced attractor model (right column) predict the subjects’ decision behaviors. The plots are obtained with 휎 = 0.0970. Simulation results with 휎 = 0.15 are given in appendix (C). Each circle in the scatter plots stands for a subject, positioned in the space spanned by probabilities of chooseA or avoidB. Both models show to either give a prediction that is close to experiment or too high. Quantitatively, a paired Wilcoxon signed-rank test shows that the median of the experimentally found probability distribution of chooseA equals the median of the reduced attractor model’s distribution of chooseA probability (p = 0.3667). However, the two distributions were not found to be significantly correlated (p = 0.1003) The same holds for predictions made with the sigmoid function: the medians of the distributions were found to be equal (p = 0.5222), while there is no significant correlation with the experimental data (p = 0.1975). In sum, neither of the models is able to predict how subjects learned from reward, although the modeled distributions share their medians with the experimental distribution. The experimentally found distribution on the likelihood of avoidB trials does not share its median with distributions of the sigmoid function (p = 0.0018) nor with those of the reduced attractor model (p = 0.0012). Also, there were no significant correlations found between experimental data and either of the simulations (p = 0.2403 for the reduced attractor model; p = 0.0798 for the sigmoid function). In other words, the models are not able to simulate how subjects learn from punishments. The medians of the experimental chooseA and avoidB distributions are found to be significantly different from one another (p = 0.0352) and do not correlate (p = 0.2743). For 24 from the 43 subjects the probability of choosing stimulus 퐴 is larger than that of avoiding stimulus 퐵. Also, the mean of choosing stimulus 퐴 over all subjects, is larger than the mean of avoiding stimulus 퐵. The sigmoid function confirms this finding (p = 0.0039), but this is not the case for the simulations by the reduced attractor model (p = 0.3579).

27

Fig 11. The likelihood of chooseA (upper row) and avoidB (lower row) trials according to the sigmoid function (left column) and the reduced attrac- tor model (right row) and according to the experiment. Each dot stands for a subject. Perfect modeling would position the dots near the diagonal line, where experimental and modeled chooseA probability coincide. Both models are given a noise parameter 휎 = 0.0970.

Next to subjects’ chooseA and avoidB behaviors, the learning rate values indicate how subjects learn from reward and error. For 31 of the 43 subjects 푞푟 (mean = 0.0716,SD = 0.0962) was found to be larger than 푞푛 (mean = 0.0410,SD = 0.0438). In other words, 31 subjects learn faster after positive than after negative feedback. Figure 12 shows how the reduced attractor model is able to simulate the experimentally found reaction times. It shows that according to both the simulation and the experiment, reaction times are smallest when subjects need to pick a stimulus from the stimulus pairs 퐴퐶 and 퐵퐶. This agrees with expectations, since these stimulus pairs are the easiest to choose from. It is easy to choose for stimulus 퐴, since 퐴 has an 85% probability of resulting in a reward, while this probability is only 25% in the case of

28 stimulus 퐷. Similarly, the probability for obtaining a reward in choosing 퐶 is 75%, while 퐵 would only give a reward in 15% of the cases. A paired Wilcoxon signed-rank test showed that no relation exists between the simulated and experimentally found RT distributions. Also, there were no correlations found between experimental results and predictions per condition. However, a correlation was found when comparing the reaction times for all conditions together (푝 = 2.1859 ∗ 10−6). The median of the simulation is found to be significantly different from that from the experiment. Therefore, the reduced attractor model seems to be able to model overall reaction times correctly, while a wrong set of parameter values might influence the model incorrectly, resulting in a different median.

Fig. 12. Simulated (reduced attractor) versus experimentally found reaction times for each subject.

Age dependent. To test whether experimental and modeling results reflect the DA hypothesis of aging, the results are checked for aging effects. Specifically, it is interesting to look at how learning from error and reward depends on age. Subjects are divided by the median into two age groups: a younger group with ages ranging from 20 to 43, and an older group in which ages range from 45 to 80. Figure 13 shows how subjects from both age groups learn from reward and error, respectively reflected by the probabilities for chooseA and avoidB trials. Older subjects are hypothesized to have a smaller probability of choosing stimulus 퐴 than avoiding stimulus 퐵, while younger subjects are expected to be biased towards choosing 퐴, instead of avoiding 퐵.

29

A Wilcoxon rank sum test reveals that chooseA probability distributions of younger and older subjects in the experiment have significantly different medians (p = 0.0186). It indicates that younger subjects learn differently from rewards than older subjects do. Simulations with the sigmoid function confirm the significant difference (p = 0.0356), while the simulations with the reduced attractor model do not (p = 0.1400). The means of the data from experiment

(mean푦표푢푛푔 = 0.8709, SD푦표푢푛푔 =

0.1946; mean표푙푑 = 0.7599, SD표푙푑 = 0.1998), from the sigmoid function (mean푦표푢푛푔 =

0.8898, SD푦표푢푛푔 = 0.0331; mean표푙푑 =

0.8514, SD표푙푑 = 0.0721), and from the reduced attractor model (mean푦표푢푛푔 =

0.7765, SD푦표푢푛푔 = 0.0819; mean표푙푑 =

0.7619, SD표푙푑 = 0.0792) all have a higher value for the younger group than for the older group. In other words, younger subjects are found to significantly learn more from rewards than older subjects, as expected. In the case of the probability of avoiding stimulus 퐵 no significant difference between the medians of the probability distributions of the two age groups was found in the experiment (p = 0.1369), nor in the simulations with the reduced attractor model (p = 0.3442). It indicates that the set of young participants does not show a different tendency to learn from punishment than the Fig 13. Age dependent decision behavior, showing the likelihood of avoidB versus likelihood of chooseA according to experiment (top), the sigmoid subjects in the old group. In contrast, the function (middle) and the reduced attractor model (bottom).

30 sigmoid function did show a significant median difference between avoidB behavior of young and old groups (p = 0.0297), with a mean probability of avoidB that was larger for the younger than for the older group (mean푦표푢푛푔 = 0.8914, SD푦표푢푛푔 = 0.0327; mean표푙푑 = 0.8524, SD표푙푑 = 0.0721). According to Wilcoxon signed-rank tests, chooseA distributions of the younger and older group as created by the sigmoid function have the same median as the corresponding distributions in the experimental data (p = 0.4651 for the young group, p = 0.1492 for the old group). The sigmoid function therefore seems to give a good indication of the chooseA probability per age group. The same holds for the young avoidB distribution (p = 0.1579). However, a significant difference between experiment and sigmoid predictions is found for avoidB behavior of the old group (p = 0.0046). The chooseA distribution following from the reduced attractor model shares its medians with the experimental distribution in both the young (p = 0.9107) and the old (p = 0.1538) age groups. For avoidB, this only holds in the case of the young age group (p = 0.0902) – the avoidB distribution of the older age group significantly differs from that of the experimental data (p = 0.0070). Furthermore, within old and young groups, chooseA and avoidB distributions from the experimental data were found to have the same median (p = 0.2664 for the older group; p = 0.0764 for the younger group), meaning that the chooseA behavior of a subject in one of the two age groups is not significantly different from its avoidB behavior. The same was found in the reduced attractor simulation results (p = 0.0991 for the older group; p = 0.8877 for the younger group). The sigmoid function only showed the same median for the older group (p = 0.1563), while the medians of chooseA and avoidB distributions were found to differ significantly in the case of the younger group (p = 0.0313).

Physiological results

No correlations were found between the measurements of glutamate, dopamine levels, the experimental data on decision behavior and learning parameters. When introducing age dependency in the data, again no correlations were found between the biophysical measurements, the behavioral data and the learning rate parameters. Furthermore, age comparisons made with the Wilcoxon rank sum test showed no significant differences in the medians of the bilateral dopamine level distributions (p = 0.7246), the left dopamine distributions (p = 0.1624) and the right dopamine distributions (p = 0.9516).

31

DISCUSSION

To test the model predictions on reward-guided decision-making made by Frank et al. (Frank & Kong, 2008; Frank et al., 2004; Frank, 2005), a biophysically plausible network model for reward-guided decision-making was created and empirically tested. To create the model, a mean-field reduction of the biophysically plausible attractor model for decision-making by Wang (2002) was combined with the belief-dependent learning rule by Soltani et al. (2006). In addition, the reduced attractor model provided a sigmoid fit of the decision probability, which can be used to make fast predictions of decision and learning behavior, without the need to simulate the entire decision process. Both modeling attempts were implemented computationally and used to simulate the choice behavior of participants that perform a probabilistic selection task.

Model performance

It was hypothesized that the reduced attractor model for reward-guided decision-making could predict reward-guided learning and decision behavior from participants performing the probabilistic selection task. The reduced attractor model was expected to give more accurate predictions than the sigmoid probability function, since the latter is a simplification of the reduced attractor model that can only predict the result of a decision process, instead of simulating the entire decision process itself. However, both models resulted in bad predictions of the participants’ decisions, while at the same time giving good predictions of learning from reward and feedback. Both the sigmoid function and the reduced attractor model were shown to correctly predict how the average subject quickly learns to make decisions that were rewarding in the past. In addition, model predictions replicate how, after learning, the probability to choose a decision alternative is dependent on the alternative’s chance to be rewarded. When differentiating between older and younger subjects, both models predict the experimental finding that it is harder for old subjects than for young subjects to learn to choose a decision alternative when this alternative is unreliable, but most probable to result in a reward. Simulations confirm the experimental finding that one learns differently from rewarding decisions than from the decisions that are followed by an error signal. On average over all participants, it is found that learning rates that follow a reward 푞푟 are faster than learning rates that follow punishments 푞푛. In

32 addition, younger subjects are found to significantly learn more from rewards than older subjects. However, this different learning behavior between age groups was not found in learning from punishments. The reduced attractor model and the sigmoid probability function were not able to model decisions in the performance phase. However, since the reaction time distributions and the learning curves were estimated correctly by the models, the bad prediction during the performance phase does not necessarily indicate that the model is incorrect. Bad performance might be due to an incorrect estimation of the reduced attractor model’s biophysical parameters. The used biophysical parameters, for example the values for the amount of presynaptic neurons 푁푝 per selective neuron or parameters for calculating the firing rate, were taken from a study by Hunt et al. (2012). In that study a probabilistic selection task was used which is similar to the one that was used in the present study, and decision-making was modeled with the reduced attractor model as well. Therefore the biophysical parameters used in (Hunt et al., 2012) were adopted in the present study. However, (Hunt et al., 2012) did not incorporate reward-guided learning in the simulation, and therefore did not model the reward guidance that was modeled in the current study. In addition, (Hunt et al., 2012) used behavioral measurements to calculate the input to the selective neuron pools in the reduced attractor model, while the current study used learning rates, synaptic strengths and measures for conductivity dependent on these synaptic strengths, to determine the input to each of the selective neuron pools. Because of these differences, it would be a useful step for future research in reward-guided decision-making to reevaluate biophysical parameter values that should be used with the current experimental design and the used simulations. Another suggestion for future research, is to let biophysical parameters, like synaptic gating constants for AMPA and DA receptors, be free parameters that can be fitted to behavior in the training phase with the MLE. The physiological measurements of dopamine and glutamate might correlate with these values, and thereby give more insight into the different learning behaviors of younger and older participants and biophysical parameters that might relate to it.

Testing the model hypotheses by Frank et al.

Frank et al. hypothesized that differences in reward-guided learning are (at least in part) caused by differences in tonic DA level. According to his BG model for decision-making, learning to either seek rewarding choice alternatives or avoid non-rewarding alternatives, is caused by phasic DA changes that follow a reward or an error respectively. Rewards result in a peak in DA level, and upon the crossing of a positive threshold lead to long term potentiation (LTP) of the plastic synapses that actively contributed to 33 the decision. In contrast, a lower DA threshold should be crossed to cause long term depression (LTD), as a result of the phasic DA dip that follows an error. The different learning behaviors that were found in patients suffering from PD (Frank et al., 2004) and elderly people (Frank & Kong, 2008) were hypothesized to be caused by their relatively low tonic DA levels. As a result, negative phasic dopamine changes easily cross the negative threshold for LTD, while DA changes in the positive direction need to be relatively large to cross the positive threshold for LTP. This might explain how PD patients and elderly people show more error-avoidant and less reward-seeking behavior than people with normal or high tonic DA levels. The lack of significant differences between DA levels of the two age groups in the experiment makes it impossible to confirm Frank et al.’s hypotheses through the biophysically plausible network model for reward-guided decision-making created in this research. A significant difference in DA level might only be found when comparing older seniors (60 to 70 years) to very old seniors (>70 years), as was done in (Frank & Kong, 2008). Unfortunately, the sample in the present study was too small to make useful predictions with this age differentiation. Moreover, the used age groups divided by a border of 43 years show varying results. While the young and old groups do show a significant difference in reward- seeking behavior, this is not the case for error-avoidant behavior. However, it is possible to theoretically relate the model by Frank et al. with the created biophysically plausible reward-guided decision-making model. The central, modulatory role given to DA by Frank et al. can be retraced to the belief-dependent learning rule. Synaptic changes due to phasic DA change should coincide with adaptation of the synaptic strength of selective neuron pools. Accordingly, the tonic DA level that influences the degree to which can be learned from rewards and errors should be reflected in the two learning rates 푞푟 and 푞푛. For the decision-making process itself, each pair of Go and NoGo pathways in the BG corresponds to one of the selective neurons pools in the attractor model. On the one hand, lateral inhibition between the selective neuron pools can be traced back to lateral inhibition between cortical response units in the model by Frank et al. On the other hand, recurrent activity within the neuron pools compares to the feedback loop between the (output nucleus of the BG) and the cortical response units, which strengthens activity in both locations in a nonlinear way (Frank, 2005). Future research is needed to either approve or reject the proposed relations between the model by Frank et al. and the model that was presented in this study. Experimentally, more and especially older participants are needed to replicate the study by (Frank & Kong, 2008) and relate the results to the model for reward-guided decision-making. Furthermore, it is advised to reconsider the biophysical parameters that are used in the model, adjusting them to the characteristics of stimuli in the experimental task given to participants.

34

REFERENCES

Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the . Cerebral Cortex (New York, N.Y. : 1991), 7(3), 237–52. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/9143444 Frank, M. J. (2005). Dynamic dopamine modulation in the basal ganglia: a neurocomputational account of cognitive deficits in medicated and nonmedicated Parkinsonism. Journal of Cognitive Neuroscience, 17(1), 51–72. doi:10.1162/0898929052880093 Frank, M. J., & Kong, L. (2008). Learning to avoid in older age. Psychology and Aging, 23(2), 392–8. doi:10.1037/0882-7974.23.2.392 Frank, M. J., Seeberger, L. C., & O’Reilly, R. C. (2004). By Carrot or by Stick : Cognitive Reinforcement Learning in Parkinsonism. Science (New York, N.Y.), 306(5703), 1940–1943. Hernández, A., Zainos, A., & Romo, R. (2002). Temporal evolution of a decision-making process in medial premotor cortex. Neuron, 33(6), 959–72. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/11906701 Hunt, L. T., Kolling, N., Soltani, A., Woolrich, M. W., Rushworth, M. F. S., & Behrens, T. E. J. (2012). Mechanisms underlying cortical activity during value-guided choice. Nature Neuroscience, 15(3), 470–6, S1–3. doi:10.1038/nn.3017 Kim, J. N., & Shadlen, M. N. (1999). Neural correlates of a decision in the dorsolateral of the macaque. Nature Neuroscience, 2(2), 176–85. doi:10.1038/5739 Mink, J. W. (2003). The Basal Ganglia and Involuntary Movements. Neurological Review, 60, 1365–1368. Nambu, A. (2004). A new dynamic model of the cortico-basal ganglia loop. Progress in Brain Research, 143(03), 461–6. doi:10.1016/S0079-6123(03)43043-4 Nambu, A., Tokuno, H., & Takada, M. (2002). Functional significance of the cortico-subthalamo-pallidal “hyperdirect” pathway. Neuroscience Research, 43(2), 111–7. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/12067746 O’Connor, D. H., Wittenberg, G. M., & Wang, S. S.-H. (2005). Graded bidirectional synaptic plasticity is composed of switch-like unitary events. Proceedings of the National Academy of Sciences of the United States of America, 102(27), 9679–84. doi:10.1073/pnas.0502332102 Parent, A., & Hazrati, L. N. (1995). Functional anatomy of the basal ganglia. I. Brain Research Reviews, (20), 91–127. Roitman, J. D., & Shadlen, M. N. (2002). Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience, 22(21), 9475–89. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/12417672 Romo, R., Merchant, H., Zainos, a, & Hernández, a. (1997). Categorical perception of somesthetic stimuli: psychophysical measurements correlated with neuronal events in primate medial premotor cortex. Cerebral Cortex (New York, N.Y. : 1991), 7(4), 317–26. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/9177763 Shadlen, M. N., & Newsome, W. T. (1996). Motion perception: seeing and deciding. Proceedings of the National Academy of Sciences of the United States of America, 93(2), 628–33. Retrieved from

35

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=40102&tool=pmcentrez&rendertype =abstract Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. Journal of Neurophysiology, 86(4), 1916–36. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/11600651 Soltani, A., Lee, D., & Wang, X.-J. (2006). Neural mechanism for stochastic behaviour during a competitive game. Neural Networks : The Official Journal of the International Neural Network Society, 19(8), 1075–90. doi:10.1016/j.neunet.2006.05.044 Soltani, A., & Wang, X.-J. (2006). A biophysically based neural model of matching law behavior: melioration by stochastic synapses. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience, 26(14), 3731–44. doi:10.1523/JNEUROSCI.5159-05.2006 Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Utter, A. a, & Basso, M. a. (2008). The basal ganglia: an overview of circuits and function. Neuroscience and Biobehavioral Reviews, 32(3), 333–42. doi:10.1016/j.neubiorev.2006.11.003 Wang, X. (2002). Probabilistic Decision Making by Slow Reverberation in Cortical Circuits. Neuron, 36, 955–968. Wang, X. J. (1999). Synaptic basis of cortical persistent activity: the importance of NMDA receptors to working memory. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience, 19(21), 9587–603. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/10531461 Wong, K.-F., & Wang, X.-J. (2006). A recurrent network mechanism of time integration in perceptual decisions. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience, 26(4), 1314–28. doi:10.1523/JNEUROSCI.3733-05.2006

36

APPENDIX

Part A: surface plots for fitting the sigmoid function to training phase behavior

Fig. A.1. Surface plots of the negative log likelihood of the decisions made by subject 2 during the training phase according to the sigmoid function, which depends on learning parameters 푞푟, 푞푛 and noise parameter 휎.

37

Fig. A.2. Surface plots of the negative log likelihood of the decisions made by subject 3 during the training phase according to the sigmoid function, which depends on learning parameters 푞푟, 푞푛 and noise parameter 휎.

38

Part B: learning curves and learning phases

Learning curves with noise parameter 휎 = 0.0970

Figure B.1 shows how subjects on average learned to choose stimuli 퐴, 퐶 and 퐸. Here, the noise parameter 휎 was taken to be 0.0970, which is the value that was regressed by use of data points generated with the reduced attractor model.

Compared to the learning curves that were computed with 휎 = 0.15, the simulations with 휎 = 0.0970 learn better and faster. However, it re- sults in a worse prediction of the actual learning curves that were found with experiment.

The sigmoid model might perform better when using 휎 = 0.15, since the regressed value of 휎 = 0.0970 was obtained using the reduced attractor model – not the sigmoid function. The sigmoid func- tion might not be a perfect approximation of the choice probabilities that follow from the reduced at- tractor model. However, as Figure 10 shows, the sig- moid is observed to be near to perfect, which contra- dicts the previous statement.

Another problem might lie in the biophysical parameters given to the reduced attractor model. These parameters, among which the synaptic time constants and the number of presynaptic neurons to each selective neuron, are taken from (Hunt et al., 2012), in which a similar decision task was modeled. However, the models did not include learning. In ad- dition, the decision-task was comparable, but not equal to the probabilistic decision task that was used

Fig B.1. Learning curves for choosing A (upper), choosing C (middle) and choosing E (lower), according to experiment and according to the 39 sigmoid function with 휎 = 0.0970.

in this thesis. Therefore, future research might benefit from investigation of the values given to the bio- physical parameters in the model, and if necessary adjust them to the current task and conditions.

Learning curves dependent on age, two values of sigma

When differentiating between age groups, the learning curves show the same difference between the use of 휎 = 0.15 and 휎 = 0.0970: with the latter value for 휎 the model seems to learn to choose the most prof- itable stimulus 퐴, 퐶 or 퐸 with a higher probability than the subjects from the experiment, while modeling with 휎 = 0.15 results in predictions that fit the experimental data better.

Fig. B.2. Learning curves according to experiment and to simulations with the sigmoid function, for choosing stimulus 퐴 (upper row) or 퐶 (lower row) averaged over all subjects in the old (left column) and the young group (right column). A value of 휎 = 0.0970 was used.

40

Fig. B.2. Learning curves according to experiment and to simulations with the sigmoid function, for choosing stimulus 퐸 averaged over all subjects in the old (left column) and the young group (right column). A value of 휎 = 0.0970 was used.

41

Part C: age independent performance behavior with 흈 = ퟎ. ퟏퟓ

Fig C.1. The likelihood of chooseA (upper row) and avoidB (lower row) trials according to the sigmoid function (left column) and the reduced attrac- tor model (right row) and according to the experiment. Each dot stands for a subject. Perfect modeling would position the dots near the diagonal line, where experimental and modeled chooseA probability coincide. Both models are given a noise parameter 휎 = 0.15.

Figure C.1 shows the chooseA and avoidB behavior of the subjects as given by simulation with ei- ther of the two models and the experiment. Statistically the models had the same results. Both the sigmoid function (p = 0.3162) and the reduced attractor model (p = 0.3973) result in chooseA distributions that equal that of the distribution found from experiment. No correlations were found to exist between exper- imental results and the reduced attractor model (p = 0.2077), nor the sigmoid function (p = 0.3497). In words, it means that neither of the models are able to model the subjects’ learning from reward.

The median from the experimental distribution for avoidB is found to be significantly different from the distribution as predicted by the reduced attractor model (푝 = 3.8478 ∗ 10− ). The distributions 42 did correlate with one another (푝 = 0.0176). The distribution predicted by the sigmoid function coincides with the median that follows from experiment (푝 = 0.0886), but no correlation was found (푝 = 0.2105). Therefore, the reduced attractor model does seem to be able to predict learning from punishments, alt- hough the resulting values do not agree with experimental results. This might be solved by adjusting the values of the biophysical parameters. The sigmoid function is not found to be useful for predicting learn- ing from punishments.

43