bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Modeling Uncertainty-Seeking Behavior Mediated by Cholinergic Influence on

Marwen Belkaid1,2,* and Jeffrey L. Krichmar3,4

Abstract 1 Introduction

Recent findings suggest that me- Animals constantly face uncertainty due to diates uncertainty-seeking behaviors through noisy and incomplete information about the its projection to dopamine neurons – another environment. From the information-processing neuromodulatory system known for its major perspective, uncertainty is typically considered implication in reinforcement and a burden, an issue that has to be resolved decision-making. In this paper, we propose a for the animal to behave correctly [Cohen leaky-integrate-and-fire model of this mecha- et al., 2007; Rao, 2010]. In the framework of nism. It implements a softmax-like selection reinforcement learning, for example, to allow with an uncertainty bonus by a cholinergic optimal exploitation and outcome maximiza- drive to neurons, which in turn tion, agents must explore the environment influence synaptic currents of downstream and gather information about action–outcome neurons. The model is able to reproduce contingencies [Sutton and Barto, 1998; Rao, experimental data in two decision-making 2010]. tasks. It also predicts that i) in the absence The neural mechanisms driving the decision of cholinergic input, dopaminergic activity to perform actions with uncertain outcomes would not correlate with uncertainty, and that are still poorly understood. In contrast, the ii) the adaptive advantage brought by the processes by which individuals learn to per- implemented uncertainty-seeking mechanism form successful actions have been extensively is most useful when sources of reward are not studied. Notably, the dopaminergic system is highly uncertain. Moreover, this modeling thought to play a key role in these processes, work allows us to propose novel experiments both in the learning- and in the - which might shed new light on the role of related aspects [Schultz, 2002; Berridge, 2012; acetylcholine in both random and directed ex- Berke, 2018]. Moreover, studies have reported ploration. Overall, this study thus contributes dopaminergic activities that are correlated to a more comprehensive understanding of with the uncertainty of reward [Fiorillo et al., the roles of the cholinergic system and its 2003; Linnet et al., 2012]. involvement in decision-making in particular. Another neuromodulatory system which has been largely implicated in the processing 1Sorbonne Université, CNRS UMR 7222, Institut des Systèmes Intelligents et de Robotique, ISIR, F- of novelty and uncertainty is the cholonergic 75005 Paris, France. system. For instance, Yu and Dayan [2005] 2ETIS Laboratory, UMR 8051, Université Paris Seine, suggested that acetylcholine (ACh) suppresses ENSEA, CNRS, Université de Cergy-Pontoise, F-95000 top-down, expectation-driven information Cergy-Pontoise, France. relative to bottom-up, sensory-induced signals 3Department of Cognitive Sciences,University of Cali- fornia, Irvine, Irvine, CA 92697, USA. in situations of expected uncertainty, i.e. 4Department of ComputerScience, University of Cali- when expectations are known to be unreliable. fornia, Irvine, Irvine, CA 92697, USA Additionally, Hasselmo [1999, 2006] proposed *Corresponding author: MB, [email protected] that the level of ACh in the

1 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. determines whether it is encoding new infor- source of motivation thus driving exploratory mation or consolidating old memories. The behaviors. cholinergic system also interacts with the dopaminergic system. In particular, there are cholinergic projections onto neurons in the 2 Background ventral tegmental area (VTA), one of the two major sources of dopamine (DA) in the brain 2.1 Dopamine [Avery and Krichmar, 2017; Scatton et al., Dopamine (DA) is involved in decision-making 1980]. In a recent study, Naudé et al. [2016] through its role in reward processing and mo- provided evidence that these projections might tivation [Schultz, 2002; Berridge, 2012]. The mediate the motivation to select uncertain largest group of dopaminergic neurons is found actions. in the ventral tegmental area (VTA) [Scatton The softmax rule, where the probability et al., 1980]. It projects to the of choosing an action is a function of its (BG), in particular to the , but also to estimated value, is generally thought to be the frontal cortex. The is also a good model of human [Daw et al., 2006] an important source of dopamine in the BG. and animal [Cinotti et al., 2019] decision- There is strong evidence of the role of making. But Naudé et al. [2016] showed that dopamine in the learning of the value of the decisions made by wild-type (WT) mice actions, stimuli and states of the environment. exhibited an uncertainty-seeking bias and In this context, Schultz and colleagues hypoth- followed a softmax function which included esized that the activity of DA neurons encodes an uncertainty bonus. In contrast, mice a reward prediction error [Schultz, 2002]. lacking the nicotinic acetylcholine receptors Indeed, phasic dopaminergic activities show on the dopaminergic neurons in VTA showed strong correlations with an error in the pre- less uncertainty-seeking behaviors and their diction of conditioned stimuli after Pavlovian decisions rather followed the standard softmax learning. Moreover, Berridge and colleagues rule. suggest that DA is essential for “incentive In neural networks, decision-making pro- salience” and “wanting”, i.e. for motivation cesses are generally modeled using competition [Berridge and Kringelbach, 2008; Berridge, mechanisms [Rumelhart and Zipser, 1985; Car- 2012]. For instance, DA deprived rats are penter and Grossberg, 1988]. Such mechanisms unable to generate the motivation arousal can constitute a neural implementation of the necessary for ingestive behavior and can starve softmax rule. In particular, Krichmar [2008] to death although they are able to move and proposed a model where act eat [Ungerstedt, 1971]. However, dopamine upon different synaptic currents to modulate has also been suggested to signify novelty, the network’s sensitivity to differences in input which may be related to an uncertainty signal values, much like the temperature parameter [Redgrave and Gurney, 2006]. In summary, in the softmax model [Sutton and Barto, 1998]. the dopaminergic system seems to implement In this paper, we propose a new version of this a series of mechanisms that reinforce and favor model using leaky-integrate-and-fire neurons stimuli and actions that have been rewarding and integrating an uncertainty bonus. We use in the past, or that may be of interest in the this model, in comparison with three alterna- future. tive models, to verify a set of hypotheses about how cholinergic projections to dopaminergic VTA neurons in mediate uncertainty-seeking. 2.2 Acetylcholine We then perform additional experiments to Acetylcholine (ACh) originates from various assess the interest of such a mechanism for structures in the brain: the laterodorsal animals foraging in volatile environments. tegmental (LDT) and the pedunculopontine These simulations suggest that ACh effects tegmental (PPT) mesopontine nuclei pro- behavior by translating uncertainty into a jecting to the VTA and other nuclei in the

2 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. , basal forebrain and basal gan- validity effect (VE). The model proposed by glia [Mena-Segovia, 2016]; the medial septal Yu and Dayan [2005] reproduced the results nucleus mainly targeting the hippocampus; obtained by Phillips et al. [2000] which showed and the in the basal forebrain in rats experiments that VE varies inversely mainly acting on the [Baxter and with the level of ACh which was manipulated Chiba, 1999]. In addition, striatal interneurons pharmacologically. Additionally, ACh has provide an internal source of ACh in the BG. been hypothesized to set the threshold for noa- ACh has been largely implicated in the pro- drenergic signaling of unexpected uncertainty cessing of novelty and uncertainty. Significant [Yu and Dayan, 2005] which calls for more research highlighted this role in the septo- exploration by counterbalancing DA-driven hippocampal cholinergic system for instance. exploitation [Cohen et al., 2007]. In this case, novelty detection increases the level of septal ACh: novel patterns elicit little recall which reduces hippocampal inhibition 3 Methods of the septum and allows ACh neurons to discharge [Meeter et al., 2004]. In addition, 3.1 Bandit task Hasselmo [1999, 2006] proposed that high and The experiment reported by Naudé et al. low levels of ACh in the hippocampus – during [2016] was a 3-armed bandit task adapted active waking on the one hand, and quiet for mice. The setup was an open-field in waking and slow-wave sleep on the other hand which three target locations were associated – respectively allow encoding new information with a certain probability of rewards (Figure and facilitate memory consolidation. Similarly, 1A), which was delivered through intracranial higher activity of the cholinergic neurons in self-stimulation (ICSS). Mice could not receive the and nucleus basalis have been two consecutive ICSS at the same location. shown to be associated with cortical activation Thus, each time they were at a target location, during waking and paradoxical sleep [Jones, they had to choose the next target among the 2005] – a sleep phase physiologically similar to two remaining alternatives. As in a classical waking states. Thus, various computational bandit task, this is referred to as a gamble. models of the cholinergic system have focused Since the outcome is binary (i.e. reward on its role in learning and memory [Hasselmo, delivered or not), the expected uncertainty 2006; Pitti and Kuniyoshi, 2011; Grossberg, was represented by the variance p(1 − p) of 2017]. Bernoulli distributions (Figure 1B). A complementary theory was devel- Naudé et al. [2016] used this task to oped by Yu and Dayan [2005] suggesting study the influence of uncertainty on decision- that acetylcholine suppresses top-down, making, and more specifically on the dopamin- expectation-driven information relative to ergic activity under the influence of cholinergic bottom-up, sensory-induced signals in sit- projections. Notably, they showed that while uations of expected uncertainty, i.e. when wild type (WT) mice exhibited uncertainty- expectations are known to be unreliable. To seeking behavior in their task, such behaviors illustrate their theory, the authors modeled were suppressed in mice with deleted nicotinic the so-called Posner task. Posner [1980] acetylcholine receptors in the dopaminergic proposed this paradigm to study attentional neurons in VTA (hereafter KO mice). processes. Typically, a cue is presented to the participants, followed by a target stimulus. Posner [1980] showed that individuals respond 3.2 Neural Network Model of Un- more rapidly and accurately on correctly cued certainty Seeking trials (i.e. cue on the same side as the target) We model the decision-making process in- than on incorrectly cued trials (i.e. cue on volved in this task using an artificial neural opposite side). The difference in response network (Figure 1C). This network has time between valid and invalid trials is termed three channels, each corresponding to one of

3 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

AB potential V is represented as follows:

dV (t) τ. = −V (t) + V + I (t).R (1) dt rest in

where τ is the time constant, Vrest is the resting potential, R the resistance of the membrane, and Iin the input current: C Sensations

Iin(t) =Iext(t) + I0(t) (2) uncertainty Targets with I0 ∼ N (µ0, σ0) (3)

ACh value where I and I are respectively extrinsic and Decision ext 0 background input currents. The latter is mod- DA eled as a Gaussian distribution N (µ0, σ0) and Neuro- modulation accounts for spontaneous activities as well as Selection possible other extrinsic inputs which are not specifically modeled here. When the membrane Action potential is higher than a threshold Vth, the Figure 1: Bandit task and core model A) Task neuron fires, i.e. the potential rises to Vspike setup used by Naudé et al. [2016]. B) Expected reward then decreases to Vrest and a current Iout is and uncertainty as a function of reward probability in transmitted to post-synaptic neurons: this task. C) Neural network model of decision-making. (see text for description).  V (t) = V  spike the targets. Similar to Krichmar [2008], the If V (t) > Vth then V (t + 1) = Vrest  competition takes place in a decision layer were Iout(t) = 1 neurons have lateral excitatory and inhibitory (4) connections in addition to extrinsic input from downstream layers. Neuromodulatory signals A single trial of the experiment consists of driven by the cholinergic and dopaminergic a decision made between two target locations. representative neurons modulate this competi- For simplicity, neurons of the target identifica- tion. When the dopaminergic activity is low, tion layer (in blue in Figure 1A) were tuned the low signal-to-noise ratio in decision neurons such that they fire every two iterations (a spike leaves room for exploration. However, strong is followed by a refractory period) with random dopaminergic activity amplifies the efficacy initialization. Only the two target neurons cor- of extrinsic input connections and those of responding to the current options are activated inhibitory interconnections in order to achieve in each trial. exploitative decisions. This Similarly to the model used by Krichmar thus implements a neuronal equivalent to the [2008], the extrinsic input current of the deci- dec softmax decision policy based on the value of sion neurons Iext (in green in Figure 1A) is the target. Moreover, the cholinergic activity defined as follows: increases the firing of DA and introduces an uncertainty bias in the softmax-like neuro- dec tar modulation of the competition. In the bandit Iext (x, t) = w(1 + nm(x, t)).Iout(x, t) (5) task simulations, the value and uncertainty X dec + w.Iout (y, t − dt) signals are manually provided to the model y6=x (see Section 3.2.1). X dec All neurons are leaky-integrate-and-fire − w(1 + nm(x, t)).Iout (y, t − dt) (LIF) neurons. The change in the membrane y6=x

4 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

where x ∈ {A, B, C} corresponds to the gam- bling options, w is a synaptic weight and nm IACh(t) = I (6) is a neuromodulation factor that we will define ext u DA ACh below. Iext (t) = Iv + Iout (t) (7) DA As for selection neurons (in orange in Fig- nm(x, t) = Iout (t).(v(x)+ u(x)) (8) ure 1A), the extrinsic input current is simply Introducing the output current of the ACh Isel(x, t) = Idec(x, t). This layer implements a ext out neuron as an input to the DA neuron is consis- winner-takes-all readout of the decision. The tent with the increase of dopaminergic activity first spike corresponds to the network’s deci- observed in the presence cholinergic receptors sion. [Graupner et al., 2013; Naudé et al., 2016]. In In this model, decisions are modulated the bandit task, v(x) and u(x) of each target by the dopaminergic system. To account for were manually fixed for simplicity. But in the the difference between wild type (WT) mice foraging task, these variables were estimated by and mice in which nicotinic achetylcholine the model (see Section 3.5). receptors in dopamine neurons were removed (KO) reported by Naudé et al. [2016], we defined two variants of the neuromodulation 3.2.2 KO model component: the WT and KO models. Naudé et al. [2016] showed that uncertainty- seeking was removed in KO mice. Their de- 3.2.1 WT model cisions were rather exploitative, similarly to a classical softmax policy. Thus, in this variant of The ability to learn the reward probability the model, the cholinergic effect on decision is and maximize the outcome is thought to be eliminated and the dopaminergic activity only mediated by the dopaminergic system. Thus, depends on reward probabilities: in our model, DA activity is a function of the targets value v(x) representing the reward DA probability. Besides, Naudé et al. [2016] Iext (t) = Iv (9) observed an uncertainty-driven motivation DA nm(x, t) = Iout (t).v(x) (10) in WT mice in addition to the motivation to maximize reward by choosing the target with highest reward probability. They showed 3.3 Alternative models that this uncertainty-seeking behavior was The proposed model assumes that: i) uncer- dependent upon the cholinergic projections tainty is processed by ACh, and that ii) cholin- to DA neurons, which also modulate the ergic projections to DA neurons not only in- dopaminergic activity. Since the expected crease the latter’s firing rate but also the neu- uncertainty is thought to be encoded by ACh romodulation of the action selection process. neurons, in our model, ACh activity is deter- To further validate these hypotheses, we tested mined by the reward uncertainty u(x) which three alternative models – all including a WT we represent as the variance of a Bernouilli and a KO variants – introducing the following distribution v(x)(1 − v(x)) (Figure 1B). We changes. P define Iu = x∈(O) u(x) as an input current generated by the overall expected uncertainty Alternative model 1. In this model, ACh P in the current trial, and Iv = x∈(O) v(x) an simply increases DA firing rate independently input current generated by the overall expected from uncertainty. It is set to fire at a similar rewards in the current trial. Thus, the activity rate as previously using a constant input. But, in neuromodulation network is determined by uncertainty is neither processed by ACh nor the following equations: DA. Hence, there is no difference in the neu- romodulation term between the WT and KO variants, both using the form in Equation (10). The only difference between the WT and KO variants is whether ACh activates DA.

5 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Alternative model 2. In this model, the ACh neuron also has a constant input indepen- dv(x, t) dent from reward uncertainty. However, un- = αδ(x, t) (12) certainty is processed by DA neurons. Hence, dt there is no difference in the neuromodulation with δ(x, t) = r(t) − v(x, t) (13) term between the WT and KO variants, this where r is the reward function, equal to 1 when time both using the form in Equation (8). The a reward is obtained, and to 0 otherwise. only difference between the WT and KO vari- Additionally, the reward uncertainty could ants is again whether ACh activates DA: be estimated as follows [Balasubramani et al., ( ACh 2014; Naudé et al., 2016]: DA (Iv + Iu)/2 + Iout (t), if WT Iext (t) = (Iv + Iu)/2, if KO du(x, t) = α(δ2(x, t) − u(x, t)) (14) (11) dt

Alternative model 3. In this model, every- The hyperparameter α was set to 0.1. thing is similar to the WT model with one ex- ception: ACh projections only increase DA fir- 3.6 Model fitting ing rate. Hence, there is again no difference The hyperparameters τ, V , V , V , µ in the neuromodulation term between the WT spike th rest 0 and σ are common to all neurons and were and KO variants, this time both using the form 0 set manually so as to determine the dynamics in Equation (10). The only difference between of the network (see values in Table 1). The the WT and KO variants is again whether ACh values of Rach and Rda, i.e. the membrane resis- activates DA. This model differs from Alterna- tance in ACh and DA neurons respectively, was tive model 1 in that ACh firing is not driven by accordingly set to match the mean firing rate a constant input but rather by the estimated reported by Naudé et al. [2018]. We did not uncertainty of reward. have data on the firing rate of ACh neurons in the mesopontine nuclei but the obtained firing 3.4 Foraging task rate is within the range that has been reported for other areas. Naudé et al. [2016] did an additional exper- dec sel iment with a dynamic setup simulating a The values of R , R and w, i.e. re- volatile environment. More specifically, in spectively the membrane resistance in the de- each session, two of the targets were rewarding cision and the selection layers and the baseline 100% of time while the remaining one was synaptic weight in the lateral connection within not. The non-rewarding target changed from the decision layer, were optimized using a grid one session to another (Figure 4A), which search (see ranges listed in Table 1) to fit the required that animals detect the change of rule proportion of exploitative choices observed by and learn the new reward probabilities. Naudé et al. [2016] in WT and KO mice. The models’ results were averaged over 30 runs com- prising 300 trials each and the fitness score S 3.5 Learning task statistics was calculated as follows: In Equations (6 - 10), the expected reward  X S = 100 − |XWT (g) − XWT (g)| probability v and uncertainty u for each mice model target were manually fixed. But to model g∈G X KO KO  the dynamic foraging task, these statistics |Xmice(g) − Xmodel(g) /6 about the environment outcomes could no g∈G longer be hardwired and had to be learned by (15) trial-and-error. To learn the expected reward probabilities, where X is the average proportion of exploita- we used the Rescorla-Wagner rule [Rescorla tive choices and G is the set of gambles. and Wagner, 1972]:

6 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

4 Results A Gamble 1 Gamble 2 Gamble 3 B 50% 50% 25% 50%

25% 4.1 Bandit task A C 100% 100% 25% 100% In this task reported by Naudé et al. [2016], B C animals had to make binary choices (called model gambles) among the remaining two out of three target locations that were set to deliver rewards with probabilities P = 25%, 50% and 100% respectively (Figure 2A). We modeled this decision-making process with a neural network (Figure 1C). The hyperparameters determining the dynamics of the model were D first manually set to match the mean firing rate mice model reported by Naudé et al. [2018] in DA neurons in vivo (Figure 2B and C). The remaining hyperparameters of the model were optimized to fit the proportion of exploitative choices observed by Naudé et al. [2016] in WT and KO mice (Figure 2D). As a result, the model reproduced experimental data (Figure 2D).

Notably, the two groups had distinct profiles E mice model which respectively correspond to uncertainty bonus and standard softmax decision rules. Interestingly, the WT and KO models also reproduced the repartition of choices among targets (i.e. overall percentage of times each target was selected across trials) that was observed by Naudé et al. [2016] (Figure 2E).

F mice model Hyperparameter Value

τ 20 n.s. Vspike 5 n.s. Vth 1 Vrest −2 µ0 0.15 σ0 0.05 Rach 60 Rda 5.5 Figure 2: The proposed model reproduces mice behavior A) Schematic illustration of the task setup and the three possible gambles. B) Example of spike Hyperparameter Range Step trains generated by the model. C) Mean firing rate Rdec [10, 60] 1 produced by the WT and KO versions of the model. Rsel [5, 16] 1 D) Percentage of exploitative transitions (i.e. choos- w [0, 1] 0.05 ing the option with the highest reward probability) in each gamble. WT and KO mice (Left) had distinct pro- files, which the WT and KO models (Right) were able Table 1: Hyperparameter values (when set) and to reproduce. E) Percentage of targets selection as a ranges (when optimized). The hyperparameters τ, function of their reward probability. The model (Right) ach Vspike, Vth, Vrest, µ0 and σ0 were chosen manually. R also reproduced the repartition of choices exhibited by da and R were set to fit experimentally observed firing mice (Left). F) Dwell time (i.e. time to decision) was dec sel rates of DA neurons. R , R and w were optimized also similar between targets with our model (Right), through a grid search. like in mice (Left). Mice results were plotted with data from Naudé et al. [2016].

7 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

A model alt_1 alt_2 alt_3 u const const u WT KO ACh ACh ACh ACh

v v v+u v nm = f(v,u) nm = f(v) nm = f(v,u) nm = f(v) DA DA DA DA

nm = f(v) decision net nm = f(v) decision net nm = f(v,u) decision net nm = f(v) decision net

B

C

Figure 3: Alternative models fail to fully reproduce mice behavior A) Schematic illustration of the differences in the neuromodulatory component between the core model and the three alternative models. B) Percentage of exploitative transitions. C) Percentage of targets selection as a function of their reward probability across gambles. In B) and C), the alternative models did not fit the profiles observed in WT and KO mice.

As with the softmax rule, KO mice and the three alternative models introducing two types corresponding model selected targets propor- of changes in the neuromodulatory component: tionally to their probability of reward whereas i) uncertainty could be either encoded by WT mice and the corresponding model ex- dopamine directly or not taken into account hibited a bias in favor of uncertainty in the at all, ii) the absence of ACh receptors on case of reward probability 50%. Additionally, DA only affected the latter’s firing but not like in mice, there was no difference between the neuromodulatory effect (Figure 3A; see the targets in terms of dwell time (i.e. time ‘Methods’ for more detailed explanation). to make a decision, calculated in the model Upon optimization, none of the alternative as the time of the first spike in the trial). In models was able to fully fit the behavioral data. other words, there was no effect of the reward Indeed, the fitness scores (calculated using probability on the decision time (H = 4.70, Equation (15)) for these alternative models p = 0.09, Kruskal-Wallis test; Figure 2F). were significantly lower than our model’s Importantly, these two criteria (repartition of (model versus alt1, t(29) = 3.38,p = 0.0013, choices and dwell time) were not explicitly model versus alt2, t(29) = 5.63,p = 10−6, optimized by the model fitting procedure. model versus alt3, t(29) = 3.35,p = 0.0014, To further validate our model, we tested t-test; Table 2). Fitness scores quantify the

8 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

ability of the WT and KO variants of a model A session 1 session 2 session 3 to fit the proportion of exploitative transitions made by the corresponding group of mice. 0% 100%

Lower scores can be explained by the fact that 100% WT variants of the alternative models did not follow the same linear increase from gamble B C 1 to 3 (Figure 3B). Moreover, qualitatively, *** n.s. the differences in exploitative transitions and *** probability of selection of each target between the WT and KO model were smaller than with our model (Figure 3B and C).

Rdec Rsel w Score *** *** model 12 12 0.7 96.19 alt 1 59 5 1 94.95∗∗ ∗∗∗ alt 2 43 7 0.6 94.72 D Reward probability = 1.; uncertainty = 0. alt 3 10 13 0.8 95.06∗∗

Table 2: Model fitting results. Optimized param- Reward probability = .9; uncertainty = 0.09 eters and fitness scores. The proposed model has the highest score. Rdec and Rsel are the membrane resis- tance in the decision layer and the selection layer respec- Reward probability = .75; uncertainty = 0.1875 tively. Stars indicate the results of a t-test comparison between the model’s score to each of the alternative models’ score: ** p < 0.01, *** p < 0.001. Reward probability = .5; uncertainty = 0.25

4.2 Foraging task E WT KO We also tested our model in a foraging task *** where only two of the targets were rewarding. *** n.s. The non-rewarding target changed from one session to another (Figure 4A). In such a n.s. volatile environment, animals must detect the changes in reward probabilities and adapt their decisions accordingly. We initially tested a setup in which rewarding targets had 100% Figure 4: Model’s results and predictions in probability in the original experiments [Naudé the foraging task. A) Schematic illustration of the et al., 2016]. In line with the experimental dynamic setup consisting of three sessions. Full circles indicate the two rewarding targets and empty circles results, we found that the KO model had indicate the non-rewarding target. B) Higher foraging a lower foraging efficacy (i.e. global reward efficacy with the WT model than KO model. Efficacy is rate) than the WT model (WT versus KO: defined as the success rate, i.e. the average proportion t(29) = −3.92,p = 0.0002, t-test; (Figure of rewarded choices. C) Failure rate in the beginning 4B). We split the sessions in half to analyze and in the end of sessions shows a decrease for both WT and KO models but is lower for WT. D) Reward prob- the model’s behavior more closely (Figure ability v and uncertainty u were correctly estimated by 4C). The WT and KO models had similar the model throughout sessions. Dashed lines indicate failure rates in the beginning of sessions the correct values. E) The model predicts that the dif- (WT versus KO, U = 4249.0, p = 0.566, ference in foraging efficacy between WT and KO mice Mann-Whitney test), and both significantly vanishes in situations where the reward uncertainty is high. *** p < 0.001, n.s. not significant at p > 0.05. reduced their failure rates at the end of session (beginning versus end of session for WT, T = 854.5, p = 6.10−11, for KO, T = 854.5,

9 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. p = 5.10−5, Wilcoxon test). However, the rate its projections to dopaminergic neurons in of failure was significantly lower at the end of the ventral tegmental area, acetylcholine also session for the WT model (WT versus KO, promotes exploratory uncertainty-seeking U = 5695.5, p = 2.10−6, Mann-Whitney test), behaviors. In other words, that the neu- suggesting that the KO model adapted more romodulator participates in the process by slowly to condition changes. which individuals decide to perform actions To assess how robust was this effect on associated with uncertain outcomes. foraging efficacy, we further tested similar We modeled this process using a decision- setups where reward probability in the two making neural network under the influence of rewarding targets were lower (but still equal) cholinergic and dopaminergic modulation. This resulting in higher uncertainty: p=90%, 75% model is based on the following hypotheses: and 50% probability of reward corresponding a) dopamine encodes the estimated value, b) to mid-low, mid-high and high uncertainty. dopamine modulates the decision-making net- The model successfully estimated the ex- work such as to implement a softmax-like rule, pected reward probability v and uncertainty c) acetylcholine encodes the estimated uncer- u (Figure 4D; see Equations (12 - 14)). tainty, d) acetylcholine increases dopamine fir- While the foraging efficacy was still higher for ing, e) acetylcholine introduces an uncertainty the WT model with reward probability 90% bonus in the softmax-like decision rule. The (WT versus KO: t(29) = −4.64,p = 2.10−5, model was tested in two decision-making tasks t-test; Figure 4E), the difference was no – bandit task and foraging task – and success- longer significant with a probability of 75% fully reproduced the behavioral results reported (WT versus KO: t(29) = −1.89,p = 0.06, by Naudé et al. [2016]. In addition, the model t-test; Figure 4E) and 50% (WT versus fitted the experimental data from the bandit KO: t(29) = −1.73,p = 0.08, t-test; Figure task better than three alternative models which 4E). Overall, these results demonstrate the differed in the expression of the neuromodu- interest of the uncertainty-seeking behaviors lation component. Overall, these results vali- mediated by the cholinergic projections to date the above mentioned hypotheses, confirm- VTA dopaminergic neurons but suggest that ing that the cholinergic influence on dopamine the scope of such an adaptively advantageous mediates uncertainty-seeking behaviors. More- mechanism is limited to situations where the over, the model makes the following predictions uncertainty is low. which can be further tested experimentally: i) the correlation of dopaminergic activity with reward uncertainty as reported by Fiorillo et al. 5 Discussion [2003] should not be observed in the absence of the cholinergic influence on DA neurons; ii) Prominent theories about the role of acetyl- the adaptive advantage brought by the imple- choline hold that it helps control the balance mented uncertainty-seeking mechanism is most between the storage and update of mem- useful when sources of reward are not highly ory [Hasselmo, 1999] and between top-down uncertain. expectation-driven and bottom-up stimulus- How animals generate variable decisions driven attention [Yu and Dayan, 2005; Cohen and manage the exploitation–exploration et al., 2007; Avery et al., 2012]. Accordingly, dilemma (i.e. choosing between predictably most computational models of this neuromod- rewarding actions and other uncertain and ulator at the functional level focus on memory- suboptimal options) is still poorly understood. and attention-related functions [Hasselmo, It has been suggested that humans rely on 2006; Pitti and Kuniyoshi, 2011; Grossberg, two types of exploratory behaviors [Wilson 2017; Yu and Dayan, 2005; Avery et al., 2012]. et al., 2014]: directed exploration in which In this paper, we targeted another aspect of uncertain actions are purposely chosen for the the cholinergic action which was highlighted in sake of information-gathering; and random recent experimental studies [Naudé et al., 2016, exploration where actions are selected regard- 2018]. These studies suggest that, through

10 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. less of their predicted outcome. Our model tary insight into the learning process [Deperrois formally describes how these two exploratory and Gutkin, 2018]. It has also been suggested processes can be implemented: the former via that to be able to account for both learning- the uncertainty bonus driven by the cholinergic and motivation-related processes, it is impor- influence on dopamine and the latter through a tant to distinguish dopamine cell firing from global decrease of dopaminergic modulation of local dopamine release on dopamine terminals decisions which results in lower selectivity and [Berke, 2018]. Thus, a more detailed model of higher sensitivity to noise. This model could the decision-making network might be neces- be thus be tested against other experimental sary to fully capture the role and functioning data to further assess the validity of our of the neuromodulators in these processes. By hypotheses. showing how ACh might drive uncertainty seek- Moreover, some models suggested that ing behavior through its influence on DA, the the striatal cholinergic interneurons modulate present model is a first step in that direction. the level of noise during action selection in the basal ganglia [Stocco, 2012]. This would imply a key role of acetylcholine, not only 6 Acknowledgements in directed exploration as we showed in this The authors would like to thank Jérémie paper, but also in random exploration. We Naudé and the Neuroscience Paris Seine believe that new experimental studies are laboratory (Sorbonne Université, INSERM, required which specifically investigate this CNRS) for sharing the data from their study possible dual implication of acetylcholine in [Naudé et al., 2016]. They are also grateful to exploratory processes. For instance, using Jérémie Naudé, Philippe Faure, Olivier Sigaud tasks that leverage both random and directed and Andrea Soltoggio for fruitful discussions exploration, lentiviral expression could se- and comments. MB is thankful to the ETIS lectively target cholinergic receptors in the laboratory for supporting his visit to UCI. At striatum and in VTA to evaluate their respec- Sorbonne Université, MB is supported by the tive involvement in these behaviors as well as Labex SMART (ANR-11-LABX-65) which is possible interdependences. Furthermore, it is funded by French state funds and managed by still unclear whether the cholinergic receptors the ANR within the Investissements d’Avenir in VTA dopamine neurons are required for programme under reference ANR-11-IDEX- learning the uncertainty bonus or solely for 0004-02. JLK was supported in part by the operating the bonus during action selection. United States Air Force and under Contract These two alternatives could be differentiated No. FA8750- 18-C-0103. Any opinions, experimentally via genetic-chemical manip- findings and conclusions or recommendations ulations rendering the cholinergic receptors expressed in this material are those of the light-controllable [Durand-de Cuttoli et al., author(s) and do not necessarily reflect the 2018]. If the receptors are switched off during views of the United States Air Force and the initial sessions in which animals learn DARPA. the statistics of reward delivery, and then switched on again, we should be able to find out whether the uncertainty seeking behavior 7 Authors’ contributions appears rapidly or requires additional learning. This work is a step toward a more com- MB and JLK designed the study. MB imple- prehensive understanding of the implication of mented the model, analyzed the data and wrote the dopaminergic and cholinergic systems in the initial draft. MB and JLK reviewed and decision-making. It highlights their role in mo- edited the manuscript. tivation and the execution of decisions. More effort is yet needed to further disentangle these neural mechanisms. For instance, more real- istic neuron models could offer a complemen-

11 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

References Transactions of the Royal Society of London B: Biological Sciences, 362(1481):933–942. Avery, M. C. and Krichmar, J. L. (2017). Neu- romodulatory systems and their interactions: Daw, N. D., O’doherty, J. P., Dayan, P., Sey- A review of models, theories, and experi- mour, B., and Dolan, R. J. (2006). Corti- ments. Frontiers in Neural Circuits, 11(108). cal substrates for exploratory decisions in hu- mans. Nature, 441(7095):876. Avery, M. C., Nitz, D. A., Chiba, A. A., and Krichmar, J. L. (2012). Simulation of cholin- Deperrois, N. and Gutkin, B. S. (2018). Nico- ergic and noradrenergic modulation of be- tinic and cholinergic modulation of reward havior in uncertain environments. Frontiers prediction error computations in the ventral in computational neuroscience, 6:5. tegmental area: a minimal circuit model. bioRxiv, page 423806. Balasubramani, P. P., Chakravarthy, V. S., Ravindran, B., and Moustafa, A. A. (2014). Durand-de Cuttoli, R., Mondoloni, S., Marti, An extended reinforcement learning model of F., Lemoine, D., Nguyen, C., Naudé, J., basal ganglia to understand the contributions d’Izarny Gargas, T., , S., Maskos, U., of and dopamine in risk-based de- Trauner, D., et al. (2018). Manipulating mid- cision making, reward prediction, and pun- brain dopamine neurons and reward-related ishment learning. Frontiers in computational behaviors with light-controllable nicotinic neuroscience, 8:47. acetylcholine receptors. Elife, 7:e37487.

Baxter, M. G. and Chiba, A. A. (1999). Cogni- Fiorillo, C. D., Tobler, P. N., and Schultz, W. tive functions of the basal forebrain. Current (2003). Discrete coding of reward probability opinion in neurobiology, 9(2):178–183. and uncertainty by dopamine neurons. Sci- ence, 299(5614):1898–1902. Berke, J. D. (2018). What does dopamine mean? Nature neuroscience, page 1. Graupner, M., Maex, R., and Gutkin, B. (2013). Endogenous cholinergic inputs and Berridge, K. C. (2012). From prediction error local circuit mechanisms govern the phasic to incentive salience: Mesolimbic computa- mesolimbic dopamine response to . tion of reward motivation. European Journal PLoS computational biology, 9(8):e1003183. of Neuroscience, 35(7):1124–1143. Grossberg, S. (2017). Aceylcholine neuromod- Berridge, K. C. and Kringelbach, M. L. (2008). ulation in normal and abnormal learning and Affective neuroscience of pleasure: reward in memory: Vigilance control in waking, sleep, humans and animals. Psychopharmacology, autism, amnesia, and alzheimer’s disease. 199(3):457–480. Frontiers in neural circuits, 11:82.

Carpenter, G. A. and Grossberg, S. (1988). Hasselmo, M. E. (1999). Neuromodula- The art of adaptive pattern recognition by tion: acetylcholine and memory consolida- a self-organizing neural network. Computer, tion. Trends in cognitive sciences, 3(9):351– 21(3):77–88. 359.

Cinotti, F., Fresno, V., Aklil, N., Coutureau, Hasselmo, M. E. (2006). The role of acetyl- E., Girard, B., Marchand, A. R., and choline in learning and memory. Current Khamassi, M. (2019). Dopamine blockade opinion in neurobiology, 16(6):710–715. impairs the exploration-exploitation trade- off in rats. Scientific reports, 9(1):6770. Jones, B. E. (2005). From waking to sleeping: neuronal and chemical substrates. Trends in Cohen, J. D., McClure, S. M., and Yu, J. A. pharmacological sciences, 26(11):578–586. (2007). Should i stay or should i go? how the manages the trade-off between exploitation and exploration. Philosophical

12 bioRxiv preprint doi: https://doi.org/10.1101/699595; this version posted July 11, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Krichmar, J. L. (2008). The neuromodulatory Rao, R. P. (2010). Decision making under un- system: a framework for survival and adap- certainty: a neural model based on partially tive behavior in a challenging world. Adap- observable markov decision processes. Fron- tive Behavior, 16(6):385–399. tiers in computational neuroscience, 4.

Linnet, J., Mouridsen, K., Peterson, E., Møller, Redgrave, P. and Gurney, K. (2006). The short- A., Doudet, D. J., and Gjedde, A. (2012). latency dopamine signal: a role in discover- Striatal dopamine release codes uncertainty ing novel actions? Nature reviews neuro- in pathological gambling. Psychiatry Re- science, 7(12):967. search: Neuroimaging, 204(1):55–60. Rescorla, R. A. and Wagner, A. R. (1972). Meeter, M., Murre, J., and Talamini, L. (2004). A theory of pavlovian conditioning: The Mode shifting between storage and recall effectiveness of reinforcement and non- based on novelty detection in oscillating hip- reinforcement. Classical conditioning II: pocampal circuits. Hippocampus, 14(6):722– Current research and theory. 741. Rumelhart, D. E. and Zipser, D. (1985). Fea- Mena-Segovia, J. (2016). Structural and func- ture discovery by competitive learning. Cog- tional considerations of the cholinergic brain- nitive science, 9(1):75–112. stem. Journal of Neural Transmission, Scatton, B., Simon, H., Le Moal, M., and 123(7):731–736. Bischoff, S. (1980). Origin of dopaminergic innervation of the rat hippocampal forma- Naudé, J., Didienne, S., Takillah, S., Prévost- tion. Neuroscience letters, 18(2):125–131. Solié, C., Maskos, U., and Faure, P. (2018). Acetylcholine-dependent phasic dopamine Schultz, W. (2002). Getting formal with activity signals exploratory locomotion and dopamine and reward. Neuron, 36(2):241– choices. bioRxiv. 263. Naudé, J., Tolu, S., Dongelmans, M., Torquet, Stocco, A. (2012). Acetylcholine-based entropy N., Valverde, S., Rodriguez, G., Pons, S., in response selection: a model of how striatal Maskos, U., Mourot, A., Marti, F., et al. interneurons modulate exploration, exploita- (2016). Nicotinic receptors in the ventral tion, and response variability in decision- tegmental area promote uncertainty-seeking. making. Frontiers in neuroscience, 6:18. Nature neuroscience, 19(3):471. Sutton, R. S. and Barto, A. G. (1998). Re- Phillips, J. M., McAlonan, K., Robb, W. G., inforcement learning: An introduction, vol- and Brown, V. J. (2000). Cholinergic neu- ume 1. MIT press Cambridge. rotransmission influences covert orientation Ungerstedt, U. (1971). Adipsia and apha- of visuospatial attention in the rat. Psy- gia after 6-hydroxydopamine induced de- chopharmacology, 150(1):112–116. generation of the nigro-striatal dopamine Pitti, A. and Kuniyoshi, Y. (2011). Model- system. Acta Physiologica Scandinavica, ing the cholinergic innervation in the infant 82(S367):95–122. cortico-hippocampal system and its contri- Wilson, R. C., Geana, A., White, J. M., Lud- bution to early memory development and vig, E. A., and Cohen, J. D. (2014). Hu- attention. In Neural Networks (IJCNN), mans use directed and random exploration The 2011 International Joint Conference on, to solve the explore–exploit dilemma. Jour- pages 1409–1416. IEEE. nal of Experimental Psychology: General, Posner, M. I. (1980). Orienting of attention. 143(6):2074. Quarterly journal of experimental psychol- Yu, A. J. and Dayan, P. (2005). Uncertainty, ogy, 32(1):3–25. neuromodulation, and attention. Neuron, 46(4):681–692.

13