An Investigation of Decision Noise and Horizon- Adaptive Exploration in the Explore-Exploit Dilemma

Item Type text; Electronic Dissertation

Authors Wang, Siyu

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 09/10/2021 05:56:56

Link to Item http://hdl.handle.net/10150/642134

AN INVESTIGATION OF DECISION NOISE AND HORIZON-ADAPTIVE EXPLORATION IN THE EXPLORE-EXPLOIT DILEMMA

by

Siyu Wang

______Copyright © Siyu Wang 2020

A Dissertation Submitted to the Faculty of the

DEPARTMENT OF PSYCHOLOGY

In Partial Fulfillment of the Requirements

For the Degree of

DOCTOR OF PHILOSOPHY

In the Graduate College

THE UNIVERSITY OF ARIZONA

2020

2

THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE

As members of the Dissertation Committee, we certify that we have read the dissertation prepared by: Siyu Wang titled:

and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy.

Robert C Wilson ______Date: ______May 29, 2020 Robert C Wilson

______Date: ______Jun 8, 2020 John JB Allen

______Date: ______May 29, 2020 Lynn Nadel

______Date: ______May 29, 2020 Jessica Andrews-Hanna

Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College.

I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement.

Robert C Wilson ______Date: ______May 29, 2020 Robert C Wilson Psychology

Acknowledgements

I would like first thank my advisor Dr. Robert C. Wilson for being both a great mentor and a role model as a scientist to me. I will also take this opportunity to thank

Dr. Jean-Marc Fellous, my second advisor in graduate school who opened the door of animal research to me.

Dr. Hashem Sadeghiyeh, for being a great colleague and valuable friend.

Jack-Morgan Mizell, Bryan Kromenacker, Dr. Jane Keung, Sarah Cook, Todd Hagen, for being such wonderful and supportive lab mates

Blaine Harper, Maddie Souder, Kristine Gradisher, Yuxin Qin, Jerry Anderson, Zhuocheng Xiao, for all the training or help you offered in the Fellous lab, and Blake Gerken who helped me significantly in running experiments.

Ali Gilliland, Maggie Calder, Sylvia Zarnescu, Yifei Xiang, Weixi Kang, Zeyi Chen, Hannah Kyllo, Filipa Santos, Carlos Velazquez, Kathleen Ge, undergraduates who worked closely with me on various projects, as well as my fantastic research assistants who worked with me or helped me

run experiments, including Chrysta Andrade, Abigail Foley, Kathryn Kellohen, Daniel Carrera, Vera Thornton, Tausif Chowdhury, Audrey Fierro, Aidan Smith, Gabe Sulser, Nick Adragna, Kailyn Teel, Chelsea Goldberger, Haley Gordy, Julia Sochin, Cami Rivera, Michala Carlson, Colin Lynch, James Barley-Fuentes.

My committee members, Dr. Lynn Nadel, Dr. John Allen, Dr. Jessica Andrews-Hanna, Dr. Ying-hui Chou, for all the support and advice.

Beth Owens, Stephanie O’Donnell, Sarah Winters, for all the help and kindness.

My dear rats, Drs. Scragg Gradisher, Hachi Wang, Tianqi Wang, Ratzo Wang, Rizzo Wang, Gerald Gerken and Twenty Lu, for your significant sacrifice and contribution to science.

In addition, I’d like to thank my family and friends

3 Randy Spalding, who embraced me as part of his extended family and made Tucson my second home. Jim Cook, Michelle Morden, Patsy Spalding, Nancy Cook, Shubham Jain, Birkan Kolcu, Anault Allihien, Bob Cook, Friederike Almstedt, Cindy Cook, Thayer Keller, Dr. Adam Ussishkin, Dr. Andy Wedel, Aldo Wedel-Ussishkin, Dhruv Gajaria, Prajakta Vaishampayan, Sathyan Padmanabhan, members of this extended family and friends who significantly enriched my life in Tucson.

Yuru Zhu, Dewei Zhang, for standing with me in my ups and downs over the past years

Mengtian Lu, for being the best friend I can hope for.

My parents, Wei Wang and Yaling Xu for everything.

4 Dedication

To my dog, Tudou, who passed away recently & my rats, Scragg, Hachi, Tianqi, Ratzo, Rizzo, Gerald, Twenty

5 Contents

Abstract ...... 8

1 Introduction 10 The explore-exploit tradeoff ...... 11 Multi-armed bandit problem ...... 11 Exploration strategies in humans and animals ...... 19 Current studies ...... 22 References ...... 24

2 The nature of decision noise in random exploration 28 Abstract ...... 29 Introduction ...... 30 Results ...... 31 Discussion ...... 41 Methods ...... 42 References ...... 48

3 Deep exploration accounts for the stopping threshold and behavioral variability in an opti- mal stopping task 52 Abstract ...... 53 Introduction ...... 54 Methods ...... 55 Results ...... 58 Discussion ...... 62 References ...... 62

6 4 The importance of action in explore-exploit decisions 65 Abstract ...... 66 Introduction ...... 67 Methods ...... 68 Results ...... 72 Discussion ...... 80 References ...... 81

7 Abstract

Human and animals constantly face the tradeoff between exploring something new and exploiting what one has learned to be good. In this dissertation, I studied the properties of the heuristics that humans use to make the explore-exploit decisions. In the first study, I examined the nature of randomness in human behavior that is adaptive in exploration. Human decision making is inherently variable. While this variability is often seen as a sign of sub-optimality in human behavior, recent work suggests that randomness can actually be adaptive. A little randomness in explore-exploit decisions is remarkably effective as it encourages us to explore options we might otherwise ignore. From a modeling perspective, behavioral variability is essentially the variance that can not be explained by a model and is modeled as the level of decision noise. However, what we have called ”decision noise” in previous researches could actually just be missing deterministic components from the model, it is difficult to tell whether decision noise truly arises from a stochastic process. Here we show that, while both random and deterministic noise drive variability in behavior, the noise driving random exploration is predominantly random. In the second study, we further asked where the randomness in behavior comes from. In particular, we examined one candidate theory known as deep exploration that decisions are made through mental simulation in which behavioral variability can potentially come from the stochastic sampling process during such mental simulation. In the context of a stopping problem, we showed that deep exploration successfully accounts for the simultaneous strategic adaptation of stopping threshold and the adaptation of the level of behavioral variability in the task, suggesting a potential mechanism for how adaptive behavioral variability in human behavior is achieved. In the third study, we examined factors that modulate the behavioral adaptation of strategy and behavioral variability to the horizon context in explore-exploit decisions. One key factor in explore-exploit decisions is the planning horizon, i.e. how far ahead one plans when making the decision. Previous work has shown that humans can adapt the level of exploration to the horizon context, specifically, people are more biased towards less-known option (known as directed exploration) and behave more randomly (known as random exploration) in longer horizon context. However, Sadeghiyeh et al. (2018) showed that this horizon adaptive exploration critically depends on how the value information of the options are obtained by the participants, and participants only show horizon adaptive exploration when the value information is gained by action triggered responses (Active version), and don’t show horizon

8 adaption if the information is presented without actions to retrieve them (Passive version). In the Passive version, participants showed no horizon adaptive directed or random exploration. This is true even if the same participant has played the Active version first. I conducted a series of experiments to further investigate what behavioral factors kill the horizon adaptive exploration in the passive condition. This work reveals a more complicated nature of explore-exploit decisions and suggests the influence of action on how subjective utility is computed in the brain.

9 Introduction

Siyu Wang1

1Department of Psychology, University of Arizona, Tucson AZ USA

10 The explore-exploit tradeoff

Imaging you are deciding which restaurant to go for dinner, would you exploit your favorite restaurant that you always enjoy going to, or would you explore some new restaurant that you’ve never been to? This is an example of the so-called explore-exploit dilemma. Exploiting your favorite restaurant so far can ensure you a good meal, but you won’t learn anything new, whereas exploring new restaurants have the potential of finding an even better restaurant that you can enjoy for the rest of your life but at the risk of getting bad meals sometimes. These type of explore-exploit dilemma is quite ubiquitous, birds face it when deciding whether to explore and forage for food (Krebs et al., 1978), lab rats face it when deciding whether to explore new routes in a maze to maximize food gain (Jackson et al., 2020), companies face it when deciding between exploiting the current business model and exploring new lines of businesses (Tushman and O’Reilly, 2011) and management teams face it when deciding whether to explore and investigate innovation projects (Ericson and Kastensson, 2011), websites face it when trying to maximize user clicks by deciding whether to exploit links that are known to have high hit rates or explore new links (Agarwal et al., 2009), intelligent recommendation systems face it when deciding whether to recommend novel items to users (Celma, 2016). Balancing between exploration and exploitation is crucial to solving all these above problems. The explore-exploit dilemma is in general computationally intractable (See ), yet humans and animals succeed in solving these problems all the time. Therefore, there is significant interest in understanding how humans and animals balance between exploration and exploitation in recent years.

Multi-armed bandit problem

One family of problems known as the multi-armed bandit problems have been used extensively in investi- gating how humans, animals as well as machines can solve the explore-exploit dilemma. Nearly all cases in can be formally formulated as a variant of the multi-armed bandit problem.

Description of multi-armed bandit problem

The multi-armed bandit problem is an abstraction of the problem that gamblers face when choosing which slot machines to play, how many times to play each slot machine, in which order to play them, and whether to continue with the current slot machine to maximize their total gain. The original multi-armed

11 bandit problem refers to the following learning problem: you are faced repeated with a choice among n different options (or actions, bandits), after each choice you get a numerical reward chosen from an associated stationary probability distribution that depends on the action chosen, your goal is to maximize your expected total rewards over some period of time (100 plays for example, each play refers to selecting one of the actions). (Sutton and Barto, 2018) More formally, in the multi-armed bandit problem, at each play (or time point, t), you have n actions a , a , , a to choose from, and each action will lead to a random reward rt drawn from a stationary 1 2 ··· n at probability distribution that depends on the action at you select. Your goal is to maximize the total rewards T t you get over a period of T trials t=1 rat . In order to do this, you need to explore all the actions to get a good estimate of the associatedP expected reward Q(a S) with each action a in environmental state S, | and at the same time to exploit the action with the best associated expected reward as much as you can to maximize the total gain. Hence a tradeoff between exploration and exploitation is necessary in solving the multi-armed bandit problem. More broadly, the term multi-armed bandit problem refers to a generalization of the problem described above (for commonly studied versions, see 0.0.1).

Optimal solutions to multi-armed bandit problem

In general, optimal solution to the multi-armed bandit problem is mathematically intractable, the difficulty can be seen by considering the formal solution to the explore-exploit dilemma provide by Bellman (1954). To optimally solve the multi-armed bandit problem, we need to select the action that maximizes the long- term expected total reward gain. In this case, we compute the expected long-term gain of action at in state S at time t, Q (a S). The optimal action to take is a that t t| t

at = arg max Qt(a S) (1) a | The difficulty is in estimating the Q (a S) function. Q (a S) can be written as a sum of two terms: t t| t t| an immediate reward that is associated with the action (how much I enjoy the meal today at restaurant A), and a predicted future (how much I will enjoy coming back to restaurant A in the future).

Q (a S) = r (a S) + γ T (S0 S, a )V (S0) (2) t t| t t| · | t t+1 XS0 Here r (a S) is the immediate reward if action a is taken in state S. By trying out action a in state S a t t| t t r1 +r2 + +rn few times (n times), we can estimate this expected gain by r (a S) = at at ··· at . γ is a discounting t t| n factor of future rewards. T (S0 S, a ) is the state transition probability from S to S0 when action a is taken | t

12 (For example, the amount of cash you have when ordering at a cash only restaurant will obviously limit the range of dinner choices. Here, the state S refers to the remaining cash you have for food. Then ordering a $50 steak at a high-end restaurant will transfer you from a state of having $70 to having $20. You will have fewer actions available in the new state with less available cash for your next meal). T (S0 S, a ) is | t sometimes known to the decision maker, but other times needs to be learned through experience. Vt+1(S0) is the expected total future reward from time t + 1 when you get to state S0.

Vt+1(S0) = max Qt+1(at+1 S0)) (3) at+1 |

The future expected value at S’ is essentially Q (a S0) with the best possible action a taken in t+1 t+1| t+1 state S0 at time t + 1. If the goal is to maximize the total reward gain in a time period from t = 1 to t = T , then we can compute Q (a S) first (this is easy to compute because we don’t need to consider T | future value at the last time step T, Q (a S) = r (a S) ), and then use backwards induction to calculate T | T | QT 1(a S), ,Q1(a S) (using equation 2, 3). Then use equation 1, we can select the optimal action a1 − | ··· | to take right now at t = 1. This process is known as dynamical programming (Bellman, 1954). In order to fully solve for the optimal action a , as discussed above, we need to actually compute the Q (a S) at all 1 t | time points, for all actions and states pairs involved. In other words, in order to compute the best action now, we need to actually simulate up to all future outcomes for all possible sequences of actions taken, this gets mathematically intractable quickly, and becomes even impossible to solve if either the set of actions or space is infinite.

Simple heuristics to balance exploration and exploitation in the multi-armed bandit problem

Given that the optimal solution to explore-exploit problems is generally not tractable, approximations and heuristics are being studied in balancing the exploration and exploitation tradeoff and solving the multi- armed bandit problem. Here we list some simplest heuristics through which we can balance exploration and exploitation. For illustration purposes, we evaluated the heuristics that will be described below using a simple 2-armed Gaussian bandit problem (0.0.1), there are only 2 actions available in a single state, rewards associated with both actions are drawn from Gaussian distributions with different average payouts. 1000 simulations are run for each heuristic and we calculate for each of the heuristics, the percentage of times the best action is taken, and the percentage of times the exploit option is the best action.

13 0.0.1 -greedy policy (Sutton and Barto, 2018)

The simplest way is to choose to explore a small fraction of times and choose the exploit the current best option (the action that leads to the largest immediate reward) at other times. In this heuristic, explorative choices are made  percent of the times, whereas exploitive choices are made 1  percent of the times. −

Figure 1: Simulation of -greedy policy at different levels of 

The rate of exploration  here controls the balance between exploration and exploitation. When  = 0.01 is small, among the 1000 simulations, the better action is found after 100 trials only 80% of the times, when  = 0.2 is large, the algorithm finds the better action at trial 100 more than 95% of the times, however, only choose the best action 80% of the time because of the high exploration rate at 20%. The best performance comes from  = 0.1 where it’s large enough to get to explore and find the better action and at the same time choose the better option most of the times. One obvious drawback of this algorithm is that even after the better action is found, the algorithm still keeps exploring.

Softmax policy (Sutton and Barto, 2018)

Instead of forcing to explore all options equally, in the softmax policy, actions associated with higher immediate rewards are more likely to be chosen, with probability

r(a S) k| e σ p(ak S) = r(a S) k| | σ i e P 14 r(a S) is the utility of action a , σ is known as decision noise and controls the level of exploration (the i| i higher σ is, the less predictable choices will be based on expected reward r(a S) of actions). In the i| extreme case, it will always explore (choose all actions at equal chance) if σ = 0, and it will always exploit if σ = . ∞

Figure 2: Simulation of softmax policy at different levels of σ

The level of decision noise σ here controls the balance between exploration and exploitation. When σ = 0.01 is small, among the 1000 simulations, the better action is found after 100 trials below 80% of the times, when σ = 20 is large, it only takes the algorithm 30 trials to find the best action more than 90% of the time, however, although the best action is found, because of the high noise, best action is taken only 60 % of the time. The best performance comes from σ = 10 where it’s large enough to get to explore and find the better action and at the same time choose the better option most of the times once the best action is found. One limitation in the Softmax policy is that uncertainty in the r(a S) estimates do not affect the action | selection policy, only the the magnitude of the estimate r(a S) itself matters, action with higher r(a S) | | has higher chances of being selected. So as all the options have been sampled enough times, the softmax algorithm still explores to the same extent. On contrast, Thompson sampling (0.0.1) takes uncertainty in estimation into consideration and in Thompson sampling action selection will be overall more random when uncertainty in the r(a S) estimates are high. |

15 Thompson sampling (Thompson, 1933)

Thompson sampling considers the uncertainty in the r(a S), and keeps track of the probability distribution | of r(a S). In this case, we assume that r(a S) follows a Gaussian distribution with mean µ = r(aˆS), | | | standard deviation of σ , this is known as the posterior distribution of r(a S). At time t, a |

2 rt Norm(rt(aˆ S), σt ) a ∼ | a

t To select an action, in Thompson sampling, a random outcome ra will be drawn from the distribution Norm(rt(aˆ S), σt 2) for all available action a. And the action a that has the best associated rt will be | a t a selected. t at = arg max ra a

Intuitively, when the uncertainty σas are high, there is a higher chance that the current non-exploit actions will be chosen, thus encourages exploration whereas small uncertainty σas encourage exploitation. Impor- tantly, as action a gets selected over and over, the uncertainty term σa will decrease, so there will naturally be a transition from a more explorative state to an exploitative state as the play goes on. Once an action a is taken, Kalman filter (Kalman, 1960) will be used to update the posterior of rt(a S) t t| and the uncertainty term σat for the action chosen

2 rt+1 Norm(rt+1(ˆa S), σt+1 ) at ∼ t| at

In this simplified example,

σt 2 rt+1(ˆa S) = rt(aˆ S) + at (Rt rt(aˆ S)) t t 2 at t | | σr − | 1 1 1 = + t+12 t 2 σ 2 σat σat r t Here, Rat is the reward actually received when action at was taken at time t, σr is the true standard deviation of the Gaussian bandit (assumed to be known to the agent here), an initial uncertainty σ0 is defined before the first action is taken. Figure 3 simulates learning for 3 levels of initial uncertainty σ0.

When σ0 is small, there is not enough exploration to begin with, and initial outcomes triggered by the two actions will largely determine the estimated r(a S) and the learning agent only finds the better bandit | around 70% of the times, when σ is large enough, then the agent can quickly find the best bandit and gradually transit from exploration to exploitation. Because of the decreasing nature of σa as action a gets selected, a large initial σ0 will only encourage more exploration (possibly more errors) at the beginning

16 but will decrease and promote exploitation eventually. Compared to a smaller σ0 of 5, a larger σ0 of 100 converges to the best action sooner.

Figure 3: Simulation of Thompson sampling at different levels of initial uncertainty in the estimation σ0

Upper confidence bound (UCB) (Auer et al., 2002)

Upper confidence bound policy explicitly accounts for the difference in uncertainty of the estimated r(a S), in this policy, actions associated with larger uncertainty in the r(a S) estimation will be favored. | | Instead of selecting an action that maximizes r(a S), in this policy, action is selected to maximize |

Q(a S) = r(a S) + c U(a S) | | · | where U(a S) is the upper bound of confidence (uncertainty) in the estimation of r(a S), and c is the scale | | factor. c controls the tradeoff between exploration and exploitation, a larger c will favor options with more uncertainty, i.e. actions that are explored enough, and a small c will favor exploitation (c = 0 is exactly exploitation) One commonly used formula for U(a—S) is

ln n + 1 U(a S) = n | r a where na is the number of times action a is taken, and n = a na is the total number of plays. P 17 Figure 4: Simulation of Upper Confidence Bound policy at different levels of weight on the uncertainty term σ0

In this simulation, with a larger c (c = 100, the algorithm quickly finds the best action however would alternate even after figuring out which action is better (the less often chosen option will have an uncertainty bonus). With c = 0, there is no exploration, hence the algorithm depends heavily on initial outcomes and is only able to detect the best action 75% of the times. With a proper c (c = 10), this policy is able to balance exploration and exploitation to achieve a better performance.

Directed and Random exploration

In general, there are at least two class of heuristics in guiding exploration: directed and random explo- ration. As a generalization of the Upper confidence bound policy, directed exploration encourages exploration by biasing choices towards the more uncertain option. Mathematically, this can be formulated this as adding an ”uncertainty bonus” or ”information bonus” to the less known option.

Q(a S) = r(a S) + IB(a S) | | |

In directed exploration, choices are biased towards the more uncertain option. IB(a S) here is a function | of uncertainty in the r(a S) estimate. |

18 As a generalization of -greedy, softmax and Thompson sampling policy, random exploration encourages exploration via increasing behavioral variability. Mathematically, this can be formulated as adding a ”decision noise” to the value of the options.

Q(a S) = r(a S) + n(a S) | | |

In random exploration, exploratory choices are driven by chance. n(a S) is a zero-mean random signal | sampled from some probability distribution.

Summary

Although explore-exploit problems are generally computationally intractable (See ), simple heuristics that either bias actions towards more uncertain options (directed exploration) or use decision noise to induce variability in action selection (random exploration) are able to balance exploration and exploitation. Hu- mans and animals face and solve explore-exploit problems all the time, there is significant interest in understanding the algorithms/heuristics that humans and animals use in making explore-exploit decisions.

Exploration strategies in humans and animals

Experimental paradigms of the multi-armed bandit problem

Various variations of the multi-armed bandit task have been designed to investigate how humans and animals solve the explore-exploit dilemma. Some of the most commonly studied paradigms include

1. Gaussian bandit

Each action a is associated with a stationary Gaussian distribution of rewards with mean µa and

standard deviation σa, every time the action a is chosen, a random reward will be drawn from the associated Gaussian distribution. r Norm(µ , σ ) a ∼ a a

2. Binary multi-armed bandit (or Bernoulli multi-armed bandit)

Each action a is associated with a reward of 1 at probability pa, and 0 otherwise.

r Bernoulli(p ) a ∼ a

19 3. Independent Markov machine (or drifting bandit) Each time a bandit is played, the underlying reward structure of all bandits shift to a new state according to the Markov state evolution probabilities. Here, the probability distribution of reward associated with each action is no longer stationary but instead drift in time. A drifting Gaussian

bandit, for example, assumes that the underlying mean associated with each action µa changes over time. rt Norm(µt , σ ) a ∼ a a

Often times, it is assumed that the change of µa at each time step also follows a Gaussian distribution with mean 0 and a fixed variance σ2, t Norm(0, σ2) and µt+1 = µt + t 0 ∼ 0 a a 4. Infinite bandit (Agrawal, 1995) Multi-armed bandit problems with infinite arms (bandits). In this case, there is no way to exhaust exploring all the options. The simplest case of infinite bandit assumes that each arm(action) is

associated with a fixed reward ra, but there are infinite possible action as to choose from.

5. Contextual bandit In this class of bandit problems, reward associated with each action r(a S) depends on the state | of the environment, and there are more than 1 state involved in the task. For example, pizza in restaurant A is great, pasta however is bad in restaurant A, whereas in restaurant B, pizza is bad and pasta is good. So reward associated with the action of choosing either pasta or pizza is restaurant dependent.

Directed and random exploration in humans and animals

Because the optimal solution to explore-exploit decisions is computationally intractable, humans and ani- mals are thought to use approximations or heuristics in making explore-exploit decisions. Directed exploration is information driven, action is biased towards the more uncertain option, directed exploration is often quantified as ”Information bonus” (See 0.0.1). A number of studies reported observing an information bonus in humans (Banks et al., 1997, Frank et al., 2009, Lee et al., 2011, Meyer and Shi, 1995, Payzan-LeNestour and Bossaerts, 2012, Steyvers et al., 2009, Zhang and Yu, 2013) as well as in animals (Krebs et al., 1978), whereas some other studies failed to observe such an uncertainty bonus (Daw et al., 2006, Payzan-Lenestour and Bossaerts, 2011). Wilson et al. (2014) pointed out that in many studies

20 uncertainty and reward are confounded, higher rewards are sampled more and hence have less uncertainty. By manipulating reward and uncertainty independently, Wilson et al. (2014) was able to confirm that humans use directed exploration in solving the explore-exploit dilemma. Random exploration is variability/error driven, action with lower estimated reward will be sampled by chance. Random exploration is often quantified as ”decision noise” (See 0.0.1). There are evidences that humans and animals also use random exploration. For example, song birds produce more variable songs during the practice periods (exploration phase) than when singing to a female bird (exploitation phase) (Brainard and Doupe, 2002, Kao et al., 2005). It has been shown that humans also use random explo- ration (Gershman, 2018, Wilson et al., 2014). In addition, Gershman (2018, 2019) showed that relative uncertainty modulates directed exploration whereas total uncertainty modulates random exploration.

Effect of horizon on exploration

One factor that has been shown to modulate exploration is the time horizon. In the restaurant example, if you are leaving town tomorrow and this is your last meal, you probably want to exploit and make sure that you can have a good last meal, however, if you are new in town, then you are more likely to choose to explore. When choosing between two food stations (binary bandit task in which each bandit has a fixed probability of giving a reward), great tits explore more in their first 50 choices in the long horizon condition when they have a total of 250 choices to make, compared to that in the short horizon condition where the total number of choices is 50 (Kacelnik, 1979). In a 2-arm bandit task, humans are shown to be more biased towards uncertain option (directed exploration) and more random in their behavior (random exploration) in the long horizon context compared to the short horizon context (Wilson et al., 2014). However, none of the above mentioned algorithms (UCB, -greedy, softmax or Thompson sampling) alone can account for this horizon adaptation. Wilson (2020) putatively argued that deep exploration (Osband et al., 2016), the idea of making decisions through mental simulation, can provide a unifying account of horizon-adaptive directed and random exploration.

Neural correlates of directed vs random exploration

Frontopolar cortex (FPC) and interparietal sulcus (IPS) have been linked to exploration in several fMRI studies (Daw et al., 2006, Laureiro-Mart´ınez et al., 2014). Consistent with this, EEG studies showed increased bilateral activation in frontal and parietal areas during exploration (Bourdaud et al., 2008). In

21 addition, Badre et al. (2012) showed that rostrolateral PFC activity is correlated with information bonus in directed exploration, consistently, Cavanagh et al. (2012) showed that frontal and parietal theta oscillations in EEG correlated with information bonus. Recent studies suggest that directed and random exploration may reply on dissociable neural systems. Tomov et al. (2020) showed that relative uncertainty (which relates to directed exploration) and total uncertainty (which relates to random exploration) are represented in right rostrolateral prefrontal cortex and right dorsolateral prefrontal cortex respectively. Transcranial magnetic stimulation that inhibits the right frontopolar cortex selectively inhibits directed exploration but not random exploration (Zajkowski et al., 2017).

Current studies

This dissertation consists of three separate studies.

Experiment 1: Is random exploration truly random?

In random exploration, having variability in behavior is beneficial for exploration, however we ask, is this behavioral variability stochastic or does it come from a deterministic source? From a modeling perspec- tive, behavioral variability is essentially the variance that cannot be explained by a model and is modeled as the level of decision noise. However, what we have called ”decision noise” in previous researches could actually just be missing deterministic components from the model, it is difficult to tell whether decision noise truly arises from a stochastic process. In this experiment, we investigate which source of noise, deterministic vs stochastic, drives random exploration in humans in a modified version of the Horizon Task (Wilson et al., 2014). To distinguish between the two types of noise, we have people make the exact same explore-exploit decision twice. If decision noise is purely deterministic noise, then people’s choices should be identical both times, that is their choices should be consistent, since the stimulus is the same both times. Meanwhile, if decision noise is truly random their choices should be less consistent, since random noise can be different both times. Through both model-free and model-based estimation, we can quantify the magnitude of both random and deterministic noise in driving random exploration.

22 Experiment 2: What accounts for horizon-adaptive behavioral variability?

Here we test the theory of “deep exploration”(Osband et al., 2016) in making explore-exploit decisions. Deep exploration states that people make decisions by simulating future events and outcomes of actions. We ask whether this mechanism can naturally give rise to behavioral variability in people’s choices that are adaptive to the horizon context. We designed a simple card game that we refer to as the Card Stopping Task. In this task, participants are presented with a row of 5 or 10 face down cards. Each card can have any number from 1 to 100 (uniformly distributed) which represents the amount of reward available if that card is chosen. On each trial one of the cards is flipped and participants must decide whether to accept or reject this card. If they accept the card the game stops and they receive a points-reward equal to the value of the accepted card. If they reject the card, the next card in the sequence is flipped and the process repeats. A key factor in the Card Stopping Task is the ‘horizon’, the number of face-down cards remaining which plays a central role in deciding whether to stop. Behavior in the Card Stopping Task can be quantified with two parameters: the stopping threshold, i.e. the card value above which people were more likely to stop than continue, and the decision noise, i.e. the variability in the stopping threshold. By fitting these parameters to the behavioral data, preliminary data suggested that as the horizon decreases (1) the stopping threshold decreases and (2) the decision noise increases. That is, as the game goes on, they are more likely to accept low valued cards, but are also more random in their choice. By fitting a deep exploration model with two free parameters: (1) number of samples, (2) the planning horizon, we ask whether deep exploration can account for the horizon dependent change in stopping threshold and decision noise.

Experiment 3: Resolve a failed replication, factors that modulate horizon adaptive exploration

Sadeghiyeh et al. (2018) found whether reward information is gained through action (active condition) or presented passively (passive condition) can moderate the horizon-dependent exploration. The same information presented either actively (that requires action to trigger) or passively to the participants can drastically change the dependency of their exploration tendency to the horizon context. Participants no longer show horizon adaptive directed or random exploration in the passive presentation condition whereas they do in the active condition.

23 In active version of the Horizon task, participants are instructed to choose between two one-armed bandits that give out random rewards from different Gaussian distributions whose means are initially unknown. Sometimes, they need to make 1 choice (short horizon) and sometimes 6 choices (long horizon). To give participants some information about the relative value of the two bandits before they make their own decisions, in the original task, participants are instructed to press the arrow keys according to a preset sequence to reveal some sample outcomes from the two bandits – these are referred to as “sample plays”. After the sample plays, in their first free choice, people are more biased towards less-known option (known as directed exploration) and behave more randomly (known as random exploration) in longer horizon (Wilson et al., 2014). In the passive version of the Horizon task, instead of actively pressing arrow keys to reveal the outcomes of the sample plays, all of the sample outcomes are presented passively to participants without key press. In this Passive condition, people show no horizon dependent directed or random exploration. This is true even if the same participant has played the Active version, i.e. the original Horizon Task, first (Sadeghiyeh et al., 2018). In this study, we investigate what factors drive this behavioral dichotomy in horizon-dependent exploration behavior using different variants of the Horizon task, specially we tested how timing, sequential order, motor response and attention modulates horizon adaptive exploration.

References

Agarwal, D., Chen, B. C., and Elango, P. (2009). Explore/exploit schemes for web content optimization. In Proceedings - IEEE International Conference on Data Mining, ICDM.

Agrawal, R. (1995). The Continuum-Armed Bandit Problem. SIAM Journal on Control and Optimization.

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning.

Badre, D., Doll, B. B., Long, N. M., and Frank, M. J. (2012). Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron.

Banks, J., Olson, M., and Porter, D. (1997). An experimental analysis of the bandit problem. Economic Theory.

24 Bellman, R. (1954). The Theory of Dynamic Programming. Bulletin of the American Mathematical Society.

Bourdaud, N., Chavarriaga, R., Galan,´ F., and Millan,´ J. D. R. (2008). Characterizing the EEG correlates of exploratory behavior. IEEE Transactions on Neural Systems and Rehabilitation Engineering.

Brainard, M. S. and Doupe, A. J. (2002). What songbirds teach us about learning. Nature, 417(6886):351– 358.

Cavanagh, J. F., Figueroa, C. M., Cohen, M. X., and Frank, M. J. (2012). Frontal theta reflects uncertainty and unexpectedness during exploration and exploitation. Cerebral Cortex.

Celma, O. (2016). The Exploit-Explore Dilemma in Music Recommendation.

Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature.

Ericson, A.˚ and Kastensson, A.˚ (2011). Exploit and explore: Two ways of categorizing innovation projects. In ICED 11 - 18th International Conference on Engineering Design - Impacting Society Through Engineering Design.

Frank, M. J., Doll, B. B., Oas-Terpstra, J., and Moreno, F. (2009). Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation. Nature Neuroscience.

Gershman, S. J. (2018). Deconstructing the human algorithms for exploration. Cognition, 173:34–42.

Gershman, S. J. (2019). Uncertainty and exploration. Decision.

Jackson, B. J., Fatima, G. L., Oh, S., and Gire, D. H. (2020). Many paths to the same goal: balancing exploration and exploitation during probabilistic route planning. eneuro.

Kacelnik, A. (1979). Studies of foraging, behaviour and time budgeting in great tits (Parus major). Uni- versity of Oxford, Ph.D dissertation.

Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Fluids Engineering, Transactions of the ASME.

Kao, M. H., Doupe, A. J., and Brainard, M. S. (2005). Contributions of an avian basal ganglia-forebrain circuit to real-time modulation of song. Nature, 433(7026):638–643.

25 Krebs, J. R., Kacelnik, A., and Taylor, P. (1978). Test of optimal sampling by foraging great tits. Nature, 275(5675):27–31.

Laureiro-Mart´ınez, D., Canessa, N., Brusoni, S., Zollo, M., Hare, T., Alemanno, F., and Cappa, S. F. (2014). Frontopolar cortex and decision-making efficiency: Comparing brain activity of experts with different professional background during an exploration-exploitation task. Frontiers in Human Neuroscience.

Lee, M. D., Zhang, S., Munro, M., and Steyvers, M. (2011). Psychological models of human and optimal performance in bandit problems. Cognitive Systems Research.

Meyer, R. J. and Shi, Y. (1995). Sequential Choice Under Ambiguity: Intuitive Solutions to the Armed- Bandit Problem. Management Science.

Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016). Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems.

Payzan-Lenestour, E. and Bossaerts, P. (2011). Risk, unexpected uncertainty, and estimation uncertainty: Bayesian learning in unstable settings. PLoS Computational Biology.

Payzan-LeNestour, E.´ and Bossaerts, P. (2012). Do not bet on the unknown versus try to find out more: Estimation uncertainty and ”unexpected uncertainty” both modulate exploration. Frontiers in Neuro- science.

Sadeghiyeh, H., Wang, S., and Wilson, R. C. (2018). Lessons from a “failed” replication: The importance of taking action in exploration. PsyArXiv. doi, 10.

Steyvers, M., Lee, M. D., and Wagenmakers, E. J. (2009). A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA.

Thompson, W. R. (1933). On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika.

Tomov, M. S., Truong, V. Q., Hundia, R. A., and Gershman, S. J. (2020). Dissociable neural correlates of uncertainty underlie different exploration strategies. Nature Communications.

26 Tushman, M. L. and O’Reilly, C. a. (2011). Organizational Ambidexterity in Action: How Managers Explore and Exploit. California Management Review.

Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A., and Cohen, J. D. (2014). Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology: General.

Wilson, R. C., W. S. S. H. . C. J. D. (2020). Deep exploration as a unifying account of explore-exploit behavior.

Zajkowski, W. K., Kossut, M., and Wilson, R. C. (2017). A causal role for right frontopolar cortex in directed, but not random, exploration. eLife, 6:1–18.

Zhang, S. and Yu, A. J. (2013). Forgetful Bayes and myopic planning: Human learning and decision- making in a bandit setting. In Advances in Neural Information Processing Systems.

27 The nature of decision noise in random exploration

Siyu Wang1 and Robert C. Wilson1,2

1Department of Psychology, University of Arizona, Tucson AZ USA 2Cognitive Science Program, University of Arizona, Tucson AZ USA

28 Abstract

Human decision making is inherently variable. While this variability is often seen as a sign of sub- optimality in human behavior, recent work suggests that randomness can actually be adaptive. An ex- ample arises when we must choose between exploring unknown options or exploiting options we know well. A little randomness in these ‘explore-exploit’ decisions is remarkably effective as it encourages us to explore options we might otherwise ignore. Moreover, people appear to use such ‘random explo- ration’ in practice, increasing their behavioral variability when it is more valuable to explore. From a modeling perspective, behavioral variability is essentially the variance that can not be explained by a model and is modeled as the level of decision noise. However, what we have called ”decision noise” in previous researches could actually just be missing deterministic components from the model, it is difficult to tell whether decision noise truly arises from a stochastic process. Here we show that, while both random and deterministic noise drive variability in behavior, the noise driving random exploration is predominantly random. This suggests that random exploration depends on adaptive noise processes in the brain which are subject to cognitive control.

29 Introduction

Imagine trying to decide where to go to dinner. You can go to your favorite restaurant, the one you really enjoy and always go to, or you can try a new restaurant that you know nothing about. Such decisions, in which we must choose between a well-known ‘exploit’ option and a lesser known ‘explore’ option, are known as explore-exploit decisions. From a theoretical perspective, making optimal explore-exploit choices, i.e. choices that maximize long-term reward, is computationally intractable in most cases (Basu et al., 2018, Gittins and Jones, 1974). In part because of this difficulty, there is considerable interest in how humans and animals solve the explore-exploit dilemma in practice (Auer et al., 2002, Banks et al., 1997, Bridle, 1990, Daw et al., 2006, Frank et al., 2009, Gittins, 1979, Krebs et al., 1978, Lee et al., 2011, Meyer and Shi, 1995, Payzan-LeNestour and Bossaerts, 2011, Payzan-Lenestour and Bossaerts, 2012, Steyvers et al., 2009, Thompson, 1933, Watkins, 1989, Wilson et al., 2014, Zhang and Yu, 2013). One particularly effective strategy for solving the explore-exploit dilemma is choice randomization (Bridle, 1990, Thompson, 1933, Watkins, 1989). In this strategy, the decision process between exploration and exploitation is corrupted by ‘decision noise,’ meaning that high value ‘exploit’ options are not always chosen and exploratory choices are sometimes made by chance. In theory, such ‘random exploration,’ is surprisingly effective and, if implemented correctly, can come close to optimal performance (Agrawal and Goyal, 2011, Bridle, 1990, Chapelle and Li, 2011, Thompson, 1933). It has recently been shown that humans appear to actually use random exploration and actively adapt their decision noise to solve simple explore-exploit problems and can increase such decision noise when it’s more beneficial to explore (Gershman, 2018, Wilson et al., 2014). In one of these tasks, known as the Horizon Task, the key manipulation is the horizon condition, i.e. the number of decisions remaining for the participant to make. Increasing the horizon makes exploration more valuable as there is more time to use the information gained by exploration to maximize future rewards. For example, if you are leaving town tomorrow (short horizon), you will probably exploit the restaurant you know and love, but if you are in town for a while (long horizon), you would be more likely to explore the new restaurant. Using such a horizon manipulation it has been shown that people’s behavior is more variable in long horizons than short horizons, suggesting that they use adaptive decision noise to solve the explore-exploit dilemma (Wilson et al., 2014). It is however difficult in these tasks to tell whether what is measured as decision noise is really random, what we have called ’noise’ in previous researches could actually just be some missing deterministic com-

30 ponents from the model. Decision noise as defined in previous researches are more or less a quantification of what’s not predictable by the model. In the restaurant case, an example of deterministic noise would be if you happen to spot an old friend walking in to one of the restaurants. If the model did not consider that the agent is favoring the behavior of following a friend in a deterministic way, the behavior of going to a less favorable restaurant because of a friend will appear to be ’random’ when it is really a deterministic effect. hence this deterministic factor will be modeled as a random decision noise. Crucially, however, this ‘deterministic noise’ is very much in the stimulus and if you saw the same friend go into the same restaurant at a later date you might follow them again. Conversely, truly ’random noise’ would arise from stochastic mental processes tossing a metaphorical coin in your head. Such a process would not be in- fluenced by the friend going into the restaurant, and if you saw the same friend again, you might make a different choice. In this paper, we investigate which source of noise, deterministic vs random, drives random exploration in humans in a modified version of the Horizon Task. To distinguish between the two types of noise, we had people make the exact same explore-exploit decision twice. If decision noise is purely deterministic noise, then people’s choices should be identical both times, that is their choices should be consistent, since the stimulus is the same both times. Meanwhile, if decision noise is truly random their choices should be less consistent, since random noise can be different both times. By analyzing behavior on this task in both a model-free and model-based manner, we show that, while both types of noise are present in explore-exploit decisions, the variability related to random exploration is dominated by random noise. The missing deterministic component is much smaller than the non-deterministic component in random exploration.

Results

The Repeated-Games Horizon Task

We used a modified version of the ‘Horizon Task’ (Wilson et al., 2014) to show the influence of random vs deterministic noise on people’s decisions (Figure 1). In this task, participants make repeated choices between two slot machines, or ‘one-armed bandits,’ that pay out probabilistic rewards. Because they are initially unsure as to the mean payoff of each bandit, this task requires that participants carefully balance exploration of the lesser known bandit with exploitation of the better known bandit to maximize their

31 overall rewards. Crucially, before people make their first choice in the Horizon Task, they are given information about the mean payoff from each bandit in the form of four example plays distributed either unequally between bandits (i.e. 1 play of one bandit, 3 plays of the other, the [1 3] condition) or equally (2 plays each, the [2 2] condition). These example plays allow us to manipulate exactly what people know about each option before they make their first choice. Thus, by giving people the exact same example plays twice in two separate games (separated by several minutes in time so as to avoid detection), the example plays allow us to probe how participants respond to the exact same explore-exploit choice twice. These ‘repeated games’ are the key manipulation in this paper and allow us to distinguish between deterministic and random sources of noise. Specifically, if noise is deterministicly driven, then choices on repeated games should be consistent. Conversely if noise is randomly driven, then choices on repeated games should be independent and can be inconsistent.

Both behavioral variability and information seeking increase with horizon

Before discussing the results for repeated games, we first confirm that the basic behavior in this task is consistent with our previously reported results (Wilson et al., 2014). As in our previous work, we find evidence for two types of exploration in the Horizon Task. Random exploration, which is the main focus of this paper, where exploration is driven by noise, and directed exploration, where exploration is driven by information. Random exploration is quantified in a model-free way as the probability of choosing the low mean option, p(low mean) in the equal, or [2 2], condition. This value increases with horizon, consistent with the idea that behavior is more random in horizon 6 (t(64) = 6.55, p < 0.001 for [1 3], t(64) = 7.99, p < 0.001 for [2 2]). Directed exploration, is measured as the probability of choosing the more informative option p(high info in the unequal, or [1 3], condition. Again this measure increases with horizon, showing that people are more information seeking in horizon 6 (t(64) = 6.92, p < 0.001).

32 A. trial 1 cue trial 1 response trial 4 cue trial 4 response trial 5 cue trial 5 response

XX 48 XX 48 XX 48 XX 48 XX 48

+ + 46 + XX 46 + XX 46 + XX 46 + XX 56 XX 56 XX 56 XX 56 XX 50 XX 50 XX 50 XX … XX 52 …

B. (repeated) (repeated) horizon 1 horizon 6 horizon 6 horizon 1 horizon 6 [1 3] [1 3] [2 2] [2 2] [1 3]

XX 61 XX 48 XX 67 81 XX XX 48

XX + 65 46 + XX 46 + XX XX + 72 46 + XX 58 XX 56 XX 38 XX 72 XX 56 XX XX 60 50 XX XX 62 XX 79 50 XX … … … … …

Game #1 Game #18 Game #30 Game #80 Game #100 consistent or inconsistent p(inconsistent) choice inconsistency

Figure 1: Schematic of the experiment. (A) Dynamics of an example horizon 6 game. Here the first four trials are forced trials in which participants are instructed which option to play. After the forced trials, par- ticipants are free to choose between the two options for the remainder of the game. (B) Different possible states of the game after the first free choice over the course of the experiment. Overall participants play about 160 such games, with varying horizon (1 vs 6), uncertainty condition ([1 3] vs [2 2]) and observed rewards. In addition, all games are repeated (as Game 18 and 100 are here) such that participants will be faced with the exact same pattern of forced trials and exact same outcomes from those forced trials twice within each experiment. These repeated games allow us to compute the relative contribution of de- terministic and random noise by analyzing the extent to which choices are consistent across the repeated games.

33 Figure 2: Replication of previous findings. Both p(low mean) (A) and p(high info) (B) increase with horizon suggesting that people use both random and directed exploration in this task.

Model-free analysis shows that random exploration may involve both random and deterministic noise

Next we asked whether participants’ choices were consistent or inconsistent in the two repetitions of each game. The idea behind this measure is that purely deterministic noise should lead to consistent choices as the deterministic stimulus is identical both times. Conversely, purely random noise should lead to independent choices, and hence more inconsistent choices both times. To quantify choice inconsistency we computed the frequency with which participants made different responses for pairs of repeated games (Figure 3). Using this measure we found that participants made inconsistent choices in both the unequal ([1 3]) and equal ([2 2]) information conditions, suggesting that not all of the noise was deterministic (t-test vs zero revealed that inconsistency was greater than zero for all horizon and uncertainty conditions. For [1 3] condition, t(64) = 13.72, p < 0.001 for horizon 1, t(64) = 16.71, p < 0.001 for horizon 6; For [2 2] condition, t(64) = 9.55, p < 0.001 for horizon 1, t(64) = 17.93, p < 0.001 for horizon 6). In addition, we found that choice inconsistency was higher in horizon 6 than in horizon 1 for both [1 3] (t(64) = 5.41, p < 0.001) and [2 2] condition (t(64) = 6.26, p < 0.001), suggesting that at least some of the horizon dependent noise is random.

34 Figure 3: Model-free analysis suggests that both deterministic and random noise contribute to the choice variability in random exploration. For both the [1 3] (A) and [2 2] (B) condition, people show greater choice inconsistency in horizon 6 than horizon 1. However, the extent to which their choices are inconsistent lies between what is predicted by purely deterministic and random noise, suggesting that both noise sources influence the decision.

To gain more quantitative insight into these results, we computed theoretical values for the choice inconsistency for the purely deterministic and purely random noise cases. For purely deterministic noise this computation is simple because people should make the exact same decisions each time in repeated games, meaning that p(inconsistent) = 0 in this case. For purely random noise, the two games should be treated independently, allowing us to compute the choice inconsistency in terms of the probability of choosing the low mean option, p(low mean), as

p(consistent) = p(low mean)2 + p(high mean)2

= p(low mean)2 + (1 p(low mean))2 − hence, p(inconsistent) = 1 p(consistent) = 2p(low mean)(1 p(low mean)) − − As shown in Figure 3, people’s behavior falls in between the pure deterministic noise prediction and the pure random noise prediction (Behavior is different from pure random noise prediction in both [1 3] condition, t(64) = 8.66, p < 0.001 for horizon 1, t(64) = 9.48, p < 0.001 for horizon 6; and [2 2] condition, t(64) = 6.94, p < 0.001 for horizon 1, t(64) = 7.47, p < 0.001 for horizon 6. Behavior is different from

35 pure deterministic noise prediction in both [1 3] condition, t(64) = 13.72, p < 0.001 for horizon 1, t(64) = 16.71, p < 0.001 for horizon 6; and [2 2] condition, t(64) = 9.55, p < 0.001 for horizon 1, t(64) = 17.93, p < 0.001 for horizon 6.), suggesting that both deterministic and random noise are present in driving this choice inconsistency. Since choice inconsistency only reflects random noise, Figure 3 suggests that random noise increases with horizon.

Model-based analysis shows that random exploration is dominated by random noise

To more precisely quantify random and deterministic noise, we turned to model fitting. We modeled behavior on the first free choice of the Horizon Task using a version of the logistic choice model in (Wilson et al., 2014) that was modified to differentiate random and deterministic noises. In particular, we assume that in repeated games, deterministic noise remains the same whereas random noise can change.

Overview of model

As with our model-free analysis, the model-based analysis focuses only on the first free-choice trial since that is the only free choice when we have control over the experience participants have about two bandits. To model participants’ choices on this first free-choice trial, we assume that they make decisions by com- puting the difference in value ∆Q between the right and left options, choosing right when ∆Q > 0 and left otherwise. Specifically, we write

∆Q = ∆R + A∆I + b + ndet + nran (1) where, the experimentally controlled variables are ∆R = R R , the difference between the mean right − left of rewards shown on the forced trials, and ∆I, the difference between information available for playing the two options on the first free-choice trial. For simplicity, and because information is manipulated categorically in the Horizon Task, we define ∆I to be +1, -1 or 0, +1 if one reward is drawn from the right option and three are drawn from the left in the [1 3] condition, -1 if one from the left and three from the right, and in [2 2] condition, ∆I is 0. ndet and nran are deterministic noise and random noise respectively which are assumed to come from logistic distribution with mean 0. The subject-and-condition-specific parameters are: the spatial bias, b, which determines the extent to which participants prefer the option on the right; the information bonus A, which controls the level of directed exploration; ndet denotes deterministic noise, which is identical on the repeat versions of each

36 game; and nran denotes random noise, which is uncorrelated between repeat plays and changes every game. For each pair of repeated games, the set of forced-choice trials are exactly the same, so the deterministic noise, ndet, should be the same while the random noise, nran may be different. This is exactly how we distinguish deterministic noise from random noise. In symbolic terms, for repeated games i and j, ni = nj and ni = nj . det det ran 6 ran

Model fitting

We used hierarchical Bayesian analysis to fit the parameters of the model (see Figure 6 for an graphical representation of the model in the style of Lee and Wagenmakers (2014a)). In particular, we fit values

2 of the information bonus A, spatial bias b, variance of random noise σran, and variance of deterministic 2 noise, σdet for each participant in each horizon. Model fitting was performed using the MATJAGS and JAGS software (Depaoli et al., 2016, Steyvers, 2011) with full details given in the Methods.

Model fitting results

Posterior distributions over the group-level means of the deterministic and random noise variance are shown in Figure 4. Consistent with our model-free results, we see that both random and deterministic noise variances are non-zero and that random noise is about 2-3 times larger than the deterministic noise. In addition, we find that random noise increases dramatically with horizon (M = 4.55, 100% of samples showed an increase in random noise with horizon) whereas the increase in deterministic noise is smaller (M = 1.78, 98.12% of samples showed an increase in deterministic noise with horizon). Taken together, these results suggest that random exploration is dominated by random noise.

37 Figure 4: Model based analysis showing the posterior distributions over the group-level mean of the standard deviations of random and deterministic noise. Both random (A, B) and deterministic (C,D) noises are nonzero (A, C) and change with horizon (B, D). However, random noise has both a greater magnitude overall (A, C) and a greater change with horizon (B, D) than deterministic noise.

Model comparison

Previous section suggests that behavioral variability in random exploration is dominated by random noise. To test this more explicitly, we build a series of models in which different assumptions are made regarding the presence and absence of both types of noise and whether each type of noise if exists is horizon depen-

38 dent (See Table 1). In model A-D, we assumed the existence of both random and deterministic noise, in model A and B, random noise is assumed to be horizon-dependent, whereas in model A and C, determin- istic noise is assumed to be horizon dependent. In model E, we assumed no deterministic noise. In model F, we assumed no random noise.

Model Deterministic noise Random noise A Horizon dependent Horizon dependent B Fixed Horizon dependent C Horizon dependent Fixed D Fixed Fixed E None Horizon dependent F Horizon dependent None

Table 1: Model description.

To evaluate and compare between models, we simulated choice behavior by taking the subject-level pa- rameters from the Hierarchical Bayesian fits using each model. The same model-free analysis as described in the previous session is applied to all 6 sets of simulated data for the 6 models respectively. (See Figure 5). The original measure of random exploration, p(low mean), as used in Wilson et al. (2014) can be explained by having deterministic noise alone (Figure 5, Panel F2) or having random noise alone (Figure 5, Panel E2). That participants qualitatively exploit the high-mean option less and choose the low-mean option more in horizon 6, can be explained by having both pure deterministic noise and pure random noise, as long as noise is horizon dependent. If both deterministic and random noise are assumed to be the same for both horizons (Figure 5, Panel D), p(low mean) becomes completely flat and no horizon dependent random exploration is observed. On the other hand, by looking at the percentage of inconsistent choices in the repeated pair of game, p(inconsistent), deterministic noise alone can not account for behavior any more (Figure 5, Panel F3, F4). Moreover, model C and D are disqualified that the increase of choice inconsistency with horizon can only be qualitatively accounted for when random noise is horizon-dependent (Figure 5, Panel A, B, E). Among the models A, B and E, where random noise is horizon dependent, model A provides the best quantitative fit. If there is no deterministic noise (Model E), then we overestimate the level of choice inconsistency in both horizons by a constant. In addition, horizon dependent deterministic noise (Model

39 A) gives slightly better model fits than if deterministic noise is assumed to be the same in both horizons (Model B). Overall, these model simulations confirmed that the horizon dependence of random noise is the main source of random exploration.

Figure 5: Model comparison: model A - both deterministic and random noise are horizon dependent, model B - only random noise is horizon dependent, model C - only deterministic noise is horizon depen- dent, model D - neither random nor deterministic noise is horizon dependent, model E - only random noise is assumed to be present, model F - only deterministic noise is assumed to be present.

40 Discussion

In this paper, we investigated whether random exploration is driven by random noise, putatively arising in the brain, or deterministic noise, arising from the environment. Using a version of the Horizon Task with repeated games, we found evidence for both types of noise in explore-exploit decisions. In addition, we see that both random and deterministic noise increase with horizon, but that the horizon effect is much larger for random noise. Taken together our results suggest that random exploration, i.e. the use and adaptation of decision noise to drive exploration, is primarily driven by random noise. Perhaps the main limitation of this work is in the interpretation of the different types of noise as being random and deterministic. In particular, while we controlled many aspects of the stimulus across repeated games (e.g. the outcomes and the order of the forced trials), we could not perfectly control all stimuli the participant received, which would vary, for example, based on exactly what they were looking at or whether they were scratching their nose. Thus, our estimate of deterministic noise is likely a lower bound. Likewise, our estimate of random noise is likely an upper bound as these ‘missing’ sources of deterministic noise would be interpreted as random noise in our model. Despite this, it seems hard to imagine that these additional noise sources could be enough to account for the large differences between random and deterministic noise that we found in Figure 4, where random noise is 2-3 times the size of deterministic noise. Taken at face value, the horizon-dependent increase in random noise is consistent with the idea that random exploration is driven by intrinsic variability in the brain. This is in line with work in the bird song literature in which variability during song learning has been tied to neural variability arising from specific areas of the brain (Brainard and Doupe, 2002, Kao et al., 2005). In addition, this work is consistent with a recent report from Ebitz et al. (2017) in which the behavioral variability of monkeys in an ‘explore’ state was also tied to random rather than deterministic sources of noise. Whether such a noise-controlling area exists in the human brain is less well established, but one candi- date theory (Aston-Jones and Cohen, 2005) suggests that norepinephrine (NE) from the locus coeruleus may play a role in modulating random levels of noise. Indeed, changes in the NE system have been associ- ated with changes in behavioral variability in both humans and other animals in a variety of tasks (Keung et al., 2018, Tervo et al., 2014). In addition there is some evidence that NE plays a direct role in random exploration (Warren et al., 2017), although this finding is complicated by other work showing no effect of NE drugs on exploration (Jepma et al., 2012, Nieuwenhuis et al., 2005)

41 More generally, our finding that random noise dominates behavioral variability over deterministic noise, is consistent with findings of Drugowitsch et al. (2016). In particular these authors show that randomness in behavior arises from imperfections in mental inference, that happen inside the brain, rather than in peripheral processes such as sensory processing and response selection. This suggests that most noise in behavior is generated randomly and that this may arise from computational errors in computing the correct strategy. In the context of the Horizon Task, such computational errors would likely be larger in the long horizon condition as the correct course of action in these cases is much harder to compute.

Methods

Participants

80 participants (ages 18-25, 37 male, 43 female) from the University of Arizona undergraduate subject pool participated in the experiment. 15 were excluded on the basis of performance, using the same exclu- sion criterion as in (Wilson et al., 2014). This left 65 for the main analysis. Note that including the 15 badly performing subjects did not change the main results (Supplementary Figures 1 - 3)

Task

The task was a modified version of the Horizon Task (Wilson et al., 2014). In this task, participants play a set of games in which they make choices between two slot machines (one-armed bandits) that pay out rewards from different Gaussian distributions. In each game they made multiple decisions between two options. Each option paid out a random reward between 1 and 100 points sampled from a Gaussian distribution. The means of the underlying Gaussian were different for the two bandit options, remained the same within a game, but changed with each new game. One of the bandits always had a higher mean than the other. Participants were instructed to maximize the points earned over the entire task. To maximize their rewards in each game, participants need to exploit the slot machine with the highest mean, but they cannot identify this best option without exploring both options first. The number of games participants played depended on how well they performed, which acted as the primary incentive for performing the task. Thus, the better participants performed, the sooner they got to leave the experiment. On average, participants played 153.7 games (minimum = 90 games, maximum = 192 games) and the whole task lasted between 12.34 and 32.12 minutes (mean 22.75 minutes).

42 As in the original paper, the distributions of payoffs tied to bandits were independent between games and drawn from a Gaussian distribution with variable means and fixed standard deviation of 8 points. Differences between the mean payouts of the two slot machines were set to either 4, 8, 12 or 20. One of the means was always equal to either 40 or 60 and the second was set accordingly. Participants were informed that in every game one of the bandits always has a higher mean reward than the other. The order of games was randomized. Mean sizes and order of presentation were counterbalanced. Each game consisted of 5 or 10 choices. Every game started with a fixation cross, then a bar of boxes will show up indicating the horizon for that game. For the first 4 games - the instructed games, we highlight the box on one of the bandits to instruct the participant to choose that option, they have to press the corresponding key to reveal the outcome. From the 5th trial, boxes on both bandits will be highlighted and they are free to make their own decision. There was no time limit for decisions. During free choices they could press either the left arrow key or right arrow key to indicate their choice of left or right bandit. The score feedback was presented for 300ms. The task was programmed using Psychtoolbox in MATLAB (Brainard, 1997, Pelli, 1997). (See Figure 1) The first four trials of each game were forced-choice trials, in which only one of the options was avail- able for the participant to choose. We used these forced-choice trials to manipulate the relative ambiguity of the two options, by providing the participant with different amounts of information about each bandit before their first free choice. The four forced-choice trials set up two uncertainty conditions: unequal un- certainty(or [1 3]) in which one option was forced to be played once and the other three times, and equal uncertainty(or [2 2]) in which each option was forced to be played twice. After the forced-choice trials, participants made either 1 or 6 free choices (two horizon conditions), Figure 1.

Data and code

Behavioral data as well as Matlab code to recreate the main figures from this paper will be made available on the Dataverse website by publication.

Model-based analysis

We modeled behavior on the first free choice of the Horizon Task using a version of the logistic choice model in (Wilson et al., 2014) that was modified to differentiate random and deterministic noise. In particular, we assume that in repeated games, deterministic noise remains the same whereas random noise

43 can change.

Hierarchical Bayesian Model

To model participants’ choices on this first free-choice trial, we assume that they make decisions by com- puting the difference in value ∆Q between the right and left options, choosing right when ∆Q > 0 and left otherwise. Specifically, we write

∆Q = ∆R + A∆I + b + ndet + nran (2) where, the experimentally controlled variables are ∆R = R R , the difference between the mean right − left of rewards shown on the forced trials, and ∆I, the difference in information available for playing the two options on the first free-choice trial. For simplicity, and because information is manipulated categorically in the Horizon Task, we define ∆I to be +1, -1 or 0, +1 if one reward is drawn from the right option and three are drawn from the left in the [1 3] condition, -1 if one from the left and three from the right, and in

[2 2] condition, ∆I is 0. ndet and nran are deterministic noise and random noise respectively. The other variables are: the spatial bias, b, which determines the extent to which participants prefer the option on the right; the information bonus A, which controls the level of directed exploration; ndet denotes deterministic noise, which is identical on the repeat versions of each game; and nran denotes random noise, which is uncorrelated between repeat plays and changes every game. Each subject’s behavior in each horizon condition is described by 4 free parameters: the information bonus A, the spatial bias, b, the standard deviation of the deterministic noise, σdet, and the standard devia- tion of the random noise, σran (Table 2, Figure 6). Each of the free parameters is fit to the behavior of each subject using a hierarchical Bayesian approach (Allenby et al., 2005). In this approach to model fitting, each parameter for each subject is assumed to be sampled from a group-level prior distribution whose parameters, the so-called ‘hyperparameters’, are estimated using a Markov Chain Monte Carlo (MCMC) sampling procedure. The hyper-parameters themselves are assumed to be sampled from ‘hyperprior’ dis- tributions whose parameters are defined such that these hyperpriors are broad. The particular priors and hyperpriors for each parameter are shown in Table 2. For example, we assume that the information bonus, Ais, for each horizon condition i and for each participant s, is sampled from a

A A Gaussian prior with mean µi and standard deviation σi . These prior parameters are sampled in turn from A their respective hyperpriors: µi , from a Gaussian distribution with mean 0 and standard deviation 10, and A σi from an Exponential distribution with parameters 0.1.

44 Parameter Prior Hyperparameters Hyperpriors µA A A A A A i Gaussian( 0, 100 ) information bonus, Ais Ais Gaussian(µi , σi ) θi = (µi , σi ) ∼ ∼ σA Exponential(0.01) i ∼ µb b b b b b i Gaussian( 0, 100 ) spatial bias, bis bis Gaussian(µi , σi ) θi = (µi , σi ) ∼ ∼ σb Exponential(0.01) i ∼ kdet det det det det det det det i Exponential(0.01) deviation of deterministic noise, σis σis Gamma(ki , λi ) θi = (ki , λi ) ∼ ∼ λdet Exponential(10) i ∼ kran ran ran ran ran ran ran ran i Exponential(0.01) deviation of random noise, σis σis Gamma(ki , λi ) θi = (ki , λi ) ∼ ∼ λran Exponential(10) i ∼

Table 2: Model parameters, priors, hyperparameters and hyperpriors.

Model fitting using MCMC

The model was fit to the data using Markov Chain Monte Carlo approach implemented in the JAGS pack- age (Depaoli et al., 2016) via the MATJAGS interface (psiexp.ss.uci.edu/research/programs data/jags). This package approximates the posterior distribution over model parameters by generating samples from this posterior distribution given the observed behavioral data. In particular we used 4 independent Markov chains to generate 16000 samples from the posterior dis- tribution over parameters (4000 samples per chain). Each chain had a burn in period of 2000 samples, which were discarded to reduce the effects of initial conditions, and posterior samples were acquired at a thin rate of 1. Convergence of the Markov chains was confirmed post hoc by eye.

45 Priors µA σA µb σb kran λran kdet λdet µA Gaussian(0, 100), σA Exponential(0.01) i i i i i i i i i ∼ i ∼ µb Gaussian(0, 100), σb Exponential(0.01) i ∼ i ∼ kran Exponential(0.01), λran Exponential(10) i ∼ i ∼ A b σran σdet kdet Exponential(0.01), λdet Exponential(10) is is is is i ∼ i ∼ Subject specific parameters A Gaussian(µA, σA) is ∼ i i B Gaussian(µB, σB) is ∼ i i σran Gamma(kran, λran) is ∼ i i ran det det det cisgr nisgr σ Gamma(k , λ ) is ∼ i i

Deterministic noise for repeated game repeat r = 1:2 ndet ndet Logistic(0, σdet) isg isg ∼ is Random noise for each game ran ran nisgr Logistic(0, σis ) ∆Risg ∆Iisg ∼ game g = 1:G Observed choices subject s = 1:S ran det ∆Qisgr ∆Risg + Ais∆Iisg + bis + nisgr + nisg condition i = 1:N ← c Bernoulli (Q > 0) isgr ∼ isgr

Figure 6: Schematic of the hierarchical Bayesian model using notation of Lee and Wagenmakers (2014b)

Parameter recovery

To be sure that our fit parameter values were meaningful, we tested the ability of our model fitting pro- cedure to recovery parameters from simulated data. In particular, we simulated choices with the fitted parameters from the Hierarchical Bayesian analysis, and then re-fit the simulated choices to see whether we can recover the parameters. Results of this parameter recovery procedure are shown in Figure 7. As is clear from this figure, param- eter recovery is good for all parameters (Figure 7). The recovery for the noise parameters, σdet and σran, is slightly better for horizon 1 than horizon 6. This is because it requires more trials to recover bigger noises, so with the same number of choices it is harder to recover overall bigger noises in horizon 6. In addition we see better recovery for random noise than deterministic noise because we effectively have half as many trials for deterministic noise since we are only generating one sample of deterministic noise for each repeated game pair. Overall, we are able to recover both deterministic and random noises using our model to a satisfactory extent.

46 Figure 7: Parameter recovery over the subject-level means of information bonus, A, spatial bias, b, random noise variance, σran, and deterministic noise variance, σdet, for horizon 1 (left column) and horizon 6 (right column) games.

47 References

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem, 2011.

Greg Allenby, Peter Rossi, and Robert McCulloch. Hierarchical bayes models: A practitioners guide. 01 2005.

G. Aston-Jones and J. D. Cohen. An integrative theory of locus coeruleus-norepinephrine function: adap- tive gain and optimal performance. Annu. Rev. Neurosci., 28:403–450, 2005.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Machine Learning. 47(235), 2002. URL https://doi. org/10.1023/A:1013689704352.

J. Banks, M. Olson, and D. Porter. An experimental analysis of the bandit problem. Economic Theory, 10:55, 1997.

Debabrota Basu, Pierre Senellart, and Stephane´ Bressan. Belman: Bayesian bandits on the belief–reward manifold, 2018.

D. H. Brainard. The Psychophysics Toolbox. Spat Vis, 10(4):433–436, 1997.

M. S. Brainard and A. J. Doupe. What songbirds teach us about learning. Nature, 417(6886):351–358, May 2002.

J.S. Bridle. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimates of parameters. Advances in Neural Information Processing Systems, 2:211– 217, 1990.

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2249–2257. Curran Associates, Inc., 2011. URL http://papers. nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling. pdf.

N. D. Daw, J. P. O’Doherty, P. Dayan, B. Seymour, and R. J. Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879, Jun 2006.

48 Sarah Depaoli, James P. Clifton, and Patrice R. Cobb. Just another gibbs sampler (jags): Flexi- ble software for mcmc implementation. Journal of Educational and Behavioral , 41 (6):628–649, 2016. doi: 10.3102/1076998616664876. URL https://doi.org/10.3102/ 1076998616664876.

J. Drugowitsch, V. Wyart, A. D. Devauchelle, and E. Koechlin. Computational Precision of Mental Infer- ence as Critical Source of Human Choice Suboptimality. Neuron, 92(6):1398–1411, Dec 2016.

B. Ebitz, T. Moore, and T. Buschman. Bottom-up salience drives choice during exploration. Cosyne, 2017.

M. J. Frank, B. B. Doll, J. Oas-Terpstra, and F. Moreno. Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation. Nat. Neurosci., 12(8):1062–1068, Aug 2009.

Samuel J. Gershman. Deconstructing the human algorithms for exploration. Cognition, 2018. ISSN 18737838. doi: 10.1016/j.cognition.2017.12.014.

J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. J. R. Statist. Soc. B, 41(2):148–177, 1979.

J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments. Progress in Statistics, 1974.

M. Jepma, R. G. Verdonschot, H. van Steenbergen, S. A. Rombouts, and S. Nieuwenhuis. Neural mecha- nisms underlying the induction and relief of perceptual curiosity. Front Behav Neurosci, 6:5, 2012.

M. H. Kao, A. J. Doupe, and M. S. Brainard. Contributions of an avian basal ganglia-forebrain circuit to real-time modulation of song. Nature, 433(7026):638–643, Feb 2005.

Waitsang Keung, Todd A Hagen, and Robert C Wilson. Regulation of evidence accumulation by pupil- linked arousal processes. bioRxiv, 2018. doi: 10.1101/309526. URL https://www.biorxiv. org/content/early/2018/04/28/309526.

J.R. Krebs, A. Kacelnik, and P. Taylor. Test of optimal sampling by foraging great tits. Nature, 275:27–31, 1978. doi: doi:10.1038/275027a0.

M.D. Lee, S. Zhang, M.N. Munro, and M. Steyvers. Psychological models of human and optimal perfor- mance on bandit problem. Cognitive Systems Research, 12:164–174, 2011.

49 Michael D. Lee and Eric-Jan Wagenmakers. Bayesian Cognitive Modeling: A Practical Course. Cam- bridge University Press, 2014a. doi: 10.1017/CBO9781139087759.

Michael D. Lee and Eric-Jan Wagenmakers. Bayesian Cognitive Modeling: A Practical Course. Cam- bridge University Press, 2014b. doi: 10.1017/CBO9781139087759.

R. Meyer and Y. Shi. Choice under ambiguity: Intuitive solutions to the armed-bandit problem. Manage- ment Science, 41:817, 1995.

S. Nieuwenhuis, D. J. Heslenfeld, N. J. von Geusau, R. B. Mars, C. B. Holroyd, and N. Yeung. Activity in human reward-sensitive brain areas is strongly context dependent. Neuroimage, 25(4):1302–1309, May 2005.

E. Payzan-LeNestour and P. Bossaerts. Risk, unexpected uncertainty, and estimation uncertainty: Bayesian learning in unstable settings. PLoS Comput. Biol., 7(1):e1001048, Jan 2011.

E. Payzan-Lenestour and P. Bossaerts. Do not Bet on the Unknown Versus Try to Find Out More: Estima- tion Uncertainty and ”Unexpected Uncertainty” Both Modulate Exploration. Front Neurosci, 6:150, 2012.

D. G. Pelli. The VideoToolbox software for visual psychophysics: transforming numbers into movies. Spat Vis, 10(4):437–442, 1997.

M. Steyvers. matjags. An interface for MATLAB to JAGS version 1.3. 2011. URL http://psiexp. ss.uci.edu/research/programs_data/jags/.

M. Steyvers, M. Lee, and E. Wagenmakers. A Bayesian analysis of human decisionmaking on bandit problems. Journal of Mathematical Psychology, 53:168, 2009.

D. G. R. Tervo, M. Proskurin, M. Manakov, M. Kabra, A. Vollmer, K. Branson, and A. Y. Karpova. Behavioral variability through stochastic choice and its gating by anterior cingulate cortex. Cell, 159 (1):21–32, Sep 2014.

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. ISSN 00063444. URL http:// www.jstor.org/stable/2332286.

50 Christopher M. Warren, Robert C. Wilson, Nic J. van der Wee, Eric J. Giltay, Martijn S. van Noorden, Jonathan D. Cohen, and Sander Nieuwenhuis. The effect of atomoxetine on random and directed exploration in humans. PLOS ONE, 12(4):1–17, 04 2017. doi: 10.1371/journal.pone.0176034. URL https://doi.org/10.1371/journal.pone.0176034.

C. J. C. H. Watkins. Learning from delayed rewards. Ph.D thesis, Cambridge University, 1989.

R. C. Wilson, A. Geana, J. M. White, E. A. Ludvig, and J. D. Cohen. Humans use directed and random exploration to solve the explore-exploit dilemma. J Exp Psychol Gen, 143(6):2074–2081, Dec 2014.

S. Zhang and A. J. Yu. Forgetful bayes and myopic planning: Human learning and decision making in a bandit setting. Advances in Neural Information Processing Systems, 26:2607–2615, 2013.

51 Deep exploration accounts for the stopping threshold and behavioral variability in an task

Siyu Wang1, Ali Gilliland2, Maggie Calder2, and Robert C. Wilson1,3

1Department of Psychology, University of Arizona, Tucson AZ USA 2Neuroscience and Cognitive Science Program, University of Arizona, Tucson AZ USA 3Cognitive Science Program, University of Arizona, Tucson AZ USA

52 Abstract

Imagine you are on a road trip and looking to refuel. The next gas station is slightly overpriced, do you stop here to refuel or keep driving in the hopes of finding a lower price? This type of question is known as a “Stopping Problem”, and whether one is trying to find the best price on gas or the best person to fill a job, such problems occur frequently in daily life. Theoretical strategies for stopping problems have been extensively studied; but how humans solve these problems has received less at- tention. In this paper we designed a simple card game that we refer to as the Card Stopping Task to investigate how humans solve optimal stopping problems. In this task, a row of face-down cards is flipped one by one in front of the participant, and the participant has to decide when to stop flipping and take the last card. A key factor in the Card Stopping Task is the ‘horizon’, the number of face down cards remaining which plays a central role in deciding whether to stop. For example, if there are many gas stations in range, you may be more likely to pass the current gas station that is overpriced. But if your gas light is on, you wouldn’t hesitate to stop and refuel. Behavior in the Card Stopping Task can be quantified with two parameters: the stopping threshold, i.e. the card value above which people were more likely to stop than continue, and the decision noise, i.e. the variability in the stopping threshold. By fitting these parameters to the behavioral data we found that as the horizon decreases (1) the stopping threshold decreases and (2) the decision noise increases. That is, as the game goes on, they are more likely to accept low valued cards, but are also more random in their choice. This opposite horizon dependence for threshold and noise can be accounted for by a simple sam- pling model based on the idea of Deep Exploration (Osband et al., 2016). In this model, we assume that people make the accept/reject decision by simulating a small number (between 1 and 4) possible futures if they were to reject the card. Comparing this simulated outcome with the current card, they would stop if the current card is higher than the simulated outcome and continue otherwise. This model successfully accounts for the simultaneous decrease of threshold and increase of noise with horizon, suggesting a potential mechanism for how humans solve the optimal stopping problem.

Keywords— stopping problem, deep exploration, behavioral variability

53 Introduction

Imagine you are on a lengthy road trip, as you cross the country, you are constantly exposed to differing gas prices and always on the lookout to refuel for the lowest price. As your car gets low on gas, the next gas station is slightly overpriced, do you stop here to refuel or keep driving in the hopes of finding a lower price? This type of question is known as a ‘Stopping Problem’, where you have to choose whether to stop and take the current offer or continue to try and find something better. There is a rich history in statistics and mathematics studying the optimal stopping problem as complex optimization problems. From hiring a secretary (Gilbert and Mosteller, 1966) to hunting for an apartment, stopping problems crop up in all sorts of situations spawning a rich literature in statistics, mathematics, and economics (Rouder, 2014). In these cases, researchers often are interested in finding optimal solutions to stopping problems in different situations. Two situations, in particular, have received wide attention: rank-based stopping problems and exact-value stopping problems. In rank-based problems (an example of which is the classic ‘secretary problem’), only the relative rank of the current option among those already seen is known, and the goal is to maximize the rank one stops at. Conversely in the exact-value paradigm, the actual value of the current option is known, and the goal is to maximize the value one stops at (Guan and Lee, 2014, Lee, 2006). Mathematically optimal solutions can be derived for both types of stopping problem under a variety of different conditions (Gilbert and Mosteller, 1966). In the ranking- based version, depending on the number of candidates N you have, the optimal strategy is to reject the first r candidate and then take the first of the following candidates that is better than the first r participants. In the exact-value version, depending on the number of candidates N you have, the optimal strategy specifies a threshold value, above which you should stop and below which you should continue your search. Of course, how stopping problems should be solved in theory may offer little guidance as to how humans and animals solve stopping problems in practice. Indeed, previous studies in psychology and cognitive sci- ence found evidence that people make suboptimal stopping decisions (Bearden et al., 2006, Guan and Lee, 2014, Guan et al., 2015, Lee, 2006, von Helversen and Mata, 2012). However, despite being suboptimal overall, people’s behavior does share some features with the optimal model (Guan and Lee, 2014). For example, in an exact-value paradigm, Guan and Lee (2014) found that, like the optimal model, people use a threshold model in which they are more likely to accept an offer if it exceeds their threshold. Moreover, they found that this threshold decreases as the number of remaining options decreases in a manner that is at least qualitatively consistent with optimal behavior.

54 Later, in a similar exact-value paradigm, Baumann et al. (2018) argued that instead of using an inter- nal threshold, it arose from a policy of maximizing the likelihood of a better future outcome instead of maximizing the average future outcome, which is what the optimal strategy usually optimizes. Implicit in all of this previous work in is the idea that humans are noisy. That is they do not rigidly adhere to a hard threshold — always rejecting offers below it and always accepting offers above it. Instead the threshold is soft, meaning they are more likely to reject, but will sometimes accept, an offer below it, and are more likely to accept, but will sometimes reject, an offer above it. This variability in behavior usually passes unremarked but, inspired by recent work showing that variability may be adaptive for similar decisions (Wilson et al., 2014), we wondered whether this variability has structure and whether it might serves some purpose. In this paper, we used a tightly controlled version of an exact-value stopping task. In this task, par- ticipants were presented with one face up card and between 1 and 4 face down cards. Their job was to decide whether to stop and take value shown on the current face up card (which could be between 1 and 100 points) or continue and flip the next card. Consistent with (Guan et al 2014) we found that people used a threshold-like policy, and that their threshold decreased when there were fewer face down cards. In addition, however, we also found the opposite pattern in people’s variability — when there were fewer face down cards, people’s behavior was more variable. Furthermore, this simultaneous adaptation of threshold and behavioral variability can be accounted for by a simple sampling model based on the idea of Deep Exploration(Osband et al., 2016). In this model, we assume that people make the accept versus reject decision by simulating a small number (between 1 and 4) possible futures if they were to reject the card. Comparing this simulated outcome with the current card, they would stop if the current card is higher than the simulated outcome and continue otherwise. This model successfully accounts for the simultaneous decrease of threshold and increase of noise with horizon, suggesting a potential mechanism for how humans solve the optimal stopping problem.

Methods

Participants

50 participants from the University of Arizona undergraduate subject pool participated in the experiment. 6 people were excluded because they were under 18 (per our subject pool policy and Institutional Review

55 Board). This left 44 for the analysis (age 18-41, 12 male, 32 female).

Card Stopping Task

To investigate how behavioral variability in stopping problems, we designed a simple card game that we refer to as the Card Stopping Task (Figure 1). In this task, participants are presented with a row of N face down cards (N = 2, 3, 4, or 5). Each card can have a random number from 1 to 100 (uniformly distributed) which represents the amount of reward available if that card is chosen. The row of cards will be flipped one by one and after each card is flipped, the participant must decide whether to accept or reject this card. If they accept the card, the game stops and they receive the reward value on the accepted card. If they reject the card, the next card in the sequence is flipped and the process repeats. Once a participant elected to reject the current card and flip the next card, they could not return to a previous card. If they flipped the last card, i.e. the Nth card, they automatically accepted the value of this last card. Participants are instructed to maximize the total rewards they gain throughout the experiment. (Figure 1) Participants start each game with N = 2, 3, 4, or 5 cards. Games of differing numbers of cards were interleaved and counterbalanced so each level of N appeared equally in the experiment. The first 19 participants played 400 total games where each N level appeared 100 times, the latter 25 participants played 360 total games where games of each N level appeared 90 times (the game was shortened to decrease the length of the experiment). The task was implemented using Psychtoolbox in MATLAB (Brainard, 1997, Kleiner et al., 2007, Pelli, 1997).

Descriptive model

We used a descriptive model to quantify behavior on the Card Stopping task. In this model we specify a separate stopping threshold and decision noise as a function of the number of cards remaining H (H is also referred to as the horizon). We fit the behavior using a simple logistic model by maximum likelihood estimation. The model computes a ∆Q value and makes probabilistic decisions to stop or to continue based on these ∆Q values. ∆Q weighs the difference between the current card value V and the participant’s stopping threshold θ.

∆Q = V θ −

56 Stop, get 0 points Go

Stop, get 62 points Go

Stop, get 81 points Go

Stop, get 19 points Next game Next Current game Current Go

Stop, get 89 points Go

Stop, get 49 points

Figure 1: An example of an N-card game (N = 5). At each step, participants choose whether to stop on the current card, if they choose to stop, the game ends and they get the reward value on the last card, if they choose to flip the next card (Go), the process repeats but they are not allowed to go back to a previous card. If there is no more cards remaining, the participant will be forced to accept the last card.

The likelihood function PH is the probability of stopping of the participant when there are H cards remaining. 1 PH = V θ − 1 + e− σ where, the experimentally controlled variable is V , the current card value, and the 2 free parameters are θ, the stopping threshold, i.e. the card value above which people were more likely to stop than continue, and σ, the decision noise, i.e. the variability in the stopping threshold. We were able to quantify stopping threshold and behavioral variability by fitting the θ and σ values for each participant and for each horizon condition.

Deep exploration model

The deep exploration model provides a unified account of stopping threshold and behavioral variability in the card stopping problem. This model assumes that people make the stop/continue decision by simulating a small number of possible futures if they were to reject the current card. Comparing the card values gained

57 in the simulated futures with the current card, participants would choose to stop if the current card is higher than the simulated outcomes on average and continue if otherwise. In formal terms, when the current card value is V and the number of remaining cards is H (i.e. horizon), the participant first simulates the values of the remaining cards V ,V , ,V by randomly drawing from a 1 2 · H uniform distribution U(1, 100). In the simplest model, the participant will compute the maximal value of H i=1 Vi these remaining card values Vmax = H (equivalently, Vmax is drawn from the distribution of the max P of H uniformly distributed random numbers from 1 to 100). By repeating the above process n times, the

1 2 n model generates n samples of such Vmax: Vmax,Vmax, ,Vmax. Based on these simulations, on average, · n i i=1 Vmax when rejecting the current card, the participant expects to earn Vmax = n points. If Vmax > V , P then the participant will choose to continue and flip the next card, otherwise the participant will stop on the current card. This deep exploration model can have two free parameters: (1) the number of simulations, n, (2) the subjective horizon T (in place of H above). Through simulating the model for different combinations of ns and T s, the model simultaneously pre- dicts that as H decreases, the stopping threshold θ decreases and decision noise σ increases (Figure 2). Both N and T parameters are fit to each participant’s choices for each horizon condition.

Results

Stopping choice depends on the number of cards remaining (horizon) and does not depend on the total number of cards in the game.

Here we show that people’s choices only depend on the number of cards remaining H (also referred to as Horizon), and do not depend on the total number of cards in the game N. We computed the percentages of stopping choices as a function of the card value for each combination of H and N where 1 H < N 5. ≤ ≤ The choice curves with the same N do not overlap with each other whereas the choice curves with the same H overlap (Figure 3). This shows that the planning horizon H, i.e. the number of face down cards remaining, plays a central role in the Card Stopping Task in deciding whether to stop.

Stopping threshold and behavioral variability as a function of horizon

Since people’s choice only depends on the horizon H and not the total number of cards N, we can collapse curves for different H to improve signal-to-noise rational (Figure 4 Panel A). As the number of cards re-

58 Figure 2: Simulated choice curves using the deep exploration model for s = 1, 2, 3, 4 and H = 1, 2, 3, 4. Deep exploration models predict the simultaneous decrease of stopping threshold and increase of deci- sion noise as a function of horizon.

59 Figure 3: The choice curves with the same N do not overlap with each other whereas the choice curves with the same H overlap.

maining H decreases, the choice curve shifts to the left, showing that on average people are willing to stop at a lower value when the horizon context is short. This is consistent with (Guan and Lee, 2014). Further- more, the choice curve becomes slightly flatter as H decreases, this qualitatively shows that participants behave more randomly when there are fewer cards left. To quantify stopping threshold and behavioral variability in the Card Stopping task, we fitted a two- parameter descriptive model to each participant’s behavioral data and each horizon condition that has two free parameters, i.e. the stopping threshold θ and decision noise σ. We found that as the horizon decreases (1) the stopping threshold θ decreases and (2) the decision noise σ increases. That is, as the game goes on, they are more likely to accept low valued cards, but are also more random in their choices (Figure 4 Panel

60 Figure 4: A. Choice curve. B. Threshold increases with horizon. C. Decision noise decreases with horizon.

B, C). In addition, if we compare participants’ stopping threshold with optimal threshold, the participants obviously are sub-optimal and used a stopping threshold higher than optimal, this is also consistent with Guan and Lee (2014) (Figure 4 Panel B). We interviewed our participants about the strategies they used in solving this task after the experiment, some participants voluntarily reported their strategy of using a particular threshold value to decide whether to stop. Our model-fitted stopping threshold θ values strongly correlated with participants’ self-report stopping thresholds. This simple 2 parameter logistic model fits people’s choices pretty well.

Deep exploration provides a unified account for the simultaneous decrease of stop- ping threshold and increase of behavioral variability as horizon decreases.

This opposite horizon dependence for threshold and noise can be accounted for by a simple sampling model based on the idea of Deep Exploration (Osband et al., 2016). In this model, we assume that people make the stop/continue decision by simulating a small number of possible futures if they were to reject the card. Comparing this simulated outcome with the current card, they would stop if the current card is higher than the simulated outcome and continue otherwise. This model successfully accounts for the simultaneous decrease of threshold and increase of decision noise with horizon, suggesting a potential mechanism for how humans solve the optimal stopping problem.

61 Discussion

In this paper, we assessed how humans solve the optimal stopping problem using a version of the exact- value paradigm of the stopping problem. Consistent with previous findings (Guan and Lee, 2014), we showed that people can actively decrease their stopping threshold as horizon, i.e. the number of remaining options, decreases. In addition, we showed that people’s behavioral variability also increases as they decrease the stopping threshold. Finally, a simple deep exploration model, through the idea of making decisions via simulating future events, can qualitatively account for the simultaneous decrease of stopping threshold and increase in behavioral variability as horizon decreases. In the field of explore-exploit decisions, people have also observed the adaptation of behavioral vari- ability as a function of horizon (Wilson et al., 2014). However, in the explore-exploit setting, people decreases their behavioral variability as horizon decreases, whereas in the current stopping task, people increase their variability when horizon decreases. In the same task, people are also simultaneously adapt- ing the threshold and behavioral variability simultaneously. The deep exploration model can also account for the simultaneous change with horizon in the explore-exploit task (Wilson et al., 2020). Despite the op- posite horizon dependence across tasks, deep Exploration is able to account for the relationship between horizon and decision noise in both situations. There can be different deep exploration policies. Some alternative policies are: a. simulate the best of the remaining cards and compare the highest card with the current card (see Methods), b. simulate flipping the card sequentially and stop on the first card that’s higher than the current card, c. nest simulation. All of these exploration policies can qualitatively account for behavior. With the two free parameters S and T, the deep exploration model is also capable to a quantitatively close fit to behavior. However, one limitation of such model is that it doesn’t capture the irrationally high threshold at short horizons (H = 2). In conclusion, this work suggests that mental simulation can be involved for complex decisions.

References

Christiane Baumann, Henrik Singmann, Vassilios E Kaxiras, Samuel Gershman, and Bettina von Hel- versen. Explaining Human Decision Making in Optimal Stopping Tasks. pages 1341–1346, 2018.

J. Neil Bearden, Amnon Rapoport, and Ryan O. Murphy. Experimental studies of sequential selection and

62 assignment with relative ranks. Journal of Behavioral Decision Making, 2006. ISSN 10990771. doi: 10.1002/bdm.521.

David H. Brainard. The Psychophysics Toolbox. Spatial Vision, 1997. ISSN 01691015. doi: 10.1163/156856897X00357.

John P. Gilbert and Frederick Mosteller. Recognizing the Maximum of a Sequence. Journal of the Amer- ican Statistical Association, 1966. ISSN 1537274X. doi: 10.1080/01621459.1966.10502008.

Maime Guan and Michael D Lee. Threshold Models of Human Decision Making on Optimal Stopping Problems in Different Environments. In Proceedings of the 36th Annual Conference of the Cognitive Science Society, 2014.

Maime Guan, Michael D Lee, and Joachim Vandekerckhove. A Hierarchical Cognitive Threshold Model of Human Decision Making on Different Length Optimal Stopping Problems. In Proceedings of the 37th Annual Conference of the Cognitive Science Society, 2015.

Mario Kleiner, David H Brainard, Denis G Pelli, Chris Broussard, Tobias Wolf, and Diederick Niehorster. What’s new in Psychtoolbox-3? Perception, 2007. ISSN 0301-0066. doi: 10.1068/v070821.

Michael D. Lee. A hierarchical Bayesian model of human decision-making on an optimal stopping prob- lem. Cognitive Science, 2006. ISSN 03640213.

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via boot- strapped DQN. In Advances in Neural Information Processing Systems, 2016.

Denis G. Pelli. The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision, 1997. ISSN 01691015. doi: 10.1163/156856897X00366.

Jeffrey N. Rouder. Optional stopping: no problem for Bayesians. Psychonomic bulletin & review, 2014. ISSN 15315320. doi: 10.3758/s13423-014-0595-4.

Bettina von Helversen and Rui Mata. Losing a dime with a satisfied mind: Positive affect predicts less search in sequential decision making. Psychology and Aging, 2012. ISSN 19391498.

R. C. Wilson, A. Geana, J. M. White, E. A. Ludvig, and J. D. Cohen. Humans use directed and random exploration to solve the explore-exploit dilemma. J Exp Psychol Gen, 143(6):2074–2081, Dec 2014.

63 Robert C Wilson, Siyu Wang, and Hashem Sadeghiyeh. Deep exploration as a unifying account of explore- exploit behavior. Psyarxiv, 2020.

64 The importance of action in explore-exploit decisions

Siyu Wang1, Hashem Sadeghiyeh1,3, and Robert C. Wilson1,2

1Department of Psychology, University of Arizona 2Cognitive Science Program, University of Arizona 3Department of Psychological Science, Missouri University of Science and Technology

65 Abstract

Human and animals constantly face the tradeoff between exploring something new and exploiting what one has learned to be good. One key factor that moderates such explore-exploit decisions is the planning horizon, i.e. how far ahead one plans when making the decision. Previous work has shown that humans can adapt the level of exploration to the horizon context, specifically, people are more biased towards less-known option (known as directed exploration) and behave more randomly (known as random exploration) in longer horizon context (Wilson et al., 2014). However, Sadeghiyeh et al. (2018) showed that this horizon adaptive exploration critically depends on how the value information of the options are obtained by the participants, and participants only show horizon adaptive exploration when the value information is gained by action triggered responses (Active version), and don’t show horizon adaption if the information is presented without actions to re- trieve them (Passive version). In the Passive version, participants showed no horizon adaptive directed or random exploration. This is true even if the same participant has played the Active version first. To further investigate what kills the horizon adaptive exploration in the passive condition, in this paper, we performed a series of follow up experiments to assess the influence of individual factors, like processing timing and the sequential presentation of value information, on driving the horizon adaptive exploration behavior. This work reveals a more complicated nature of explore-exploit decisions and suggests the influence of action on how subjective utility is computed in the brain.

Keywords— explore-exploit decision, directed exploration, random exploration

66 Introduction

When deciding what to eat, will you order the pizza you always like or the new pasta on the menu? The tradeoff between exploiting the option you know and exploring new options is known as the explore- exploit dilemma. How far ahead one plans for the future, the “horizon”, plays a key role in making such explore-exploit decisions. You will be more likely to try the new pasta if you know you are coming back to the same restaurant. Previous work has shown that humans can adapt the level of exploration to the horizon context (Wilson et al., 2014), this result has been replicated many times over the past few years (Somerville et al., 2017, Warren et al., 2017, Zajkowski et al., 2017). However, Sadeghiyeh et al. (2018) showed that this horizon adaptive exploration depends critically on how the value information is obtained using a modified version of the same task as in (Wilson et al., 2014). In the original task, participants are instructed to choose between two one-armed bandits that give out random rewards from different Gaussian distributions whose means are initially unknown. Sometimes, they need to make 1 choice (short horizon) and sometimes 6 choices (long horizon). To give participants some information about the relative value of the two bandits before they make their own decisions, in the original task, participants are instructed to press the arrow keys according to a preset sequence to reveal some sample outcomes from the two bandits – these are referred to as “sample plays”. After the sample plays, in their first free choice, people are more biased towards less-known option (known as directed exploration) and behave more randomly (known as random exploration) in longer horizon (Wilson et al., 2014). In the modified version, instead of actively pressing arrow keys to reveal the outcomes of the sample plays, all of the sample outcomes are presented passively to participants without key press. In this Passive condition, people show no horizon dependent directed or random exploration. This is true even if the same participant has played the Active version, i.e. the original Horizon Task, first. It remains unclear what modulates the ability to adapt exploration to the horizon contexts between the two conditions as described above. It is also surprising that with practically the same information at hand, people behave differently with regard to whether adapting exploration to the horizon context. To further investigate what kills the horizon-adaptive exploration in the passive condition, we performed a series follow up experiments in which: (a) add a forced delay before participants can make the first choice, (b) the outcomes of sample plays were shown to people sequentially without a key press, (c) outcomes of sample plays were revealed upon pressing a neutral key (space bar) instead of the arrow keys, (d) outcomes of sample plays were revealed upon pressing a different set of keys (”v” and ”b”), and (e) play the active

67 and passive versions in a interleaved instead of blocked setup. We found increasing level of horizon-adaptive exploration in versions (a), (b), (c), (d), however in none of the above cases did we find the same level of as in the active condition. In the interleaved version (e) however, people show comparable increase in the level of exploration to horizon context as in the active condition. This suggests that the same information obtained is registered differently based on how the information is presented and the mindset of the participants. This work reveals a more complicated nature of explore-exploit decisions and suggests the influence of action on how subjective utility is computed in the brain.

Methods

Participants

Participants from all experiments come from the University of Arizona Psychology Subject Pool. Partici- pants are excluded based on their performance in the task using the same criteria as in (Wilson et al., 2014), those who didn’t significantly choose above chance (α = 0.001) the bandit that has a higher underlying mean payout by the last trial of long horizon games are excluded. The number of total participants and the number of participants excluded for each experiment are listed in the table below.

Study Name total excluded remaining

Study 0: active/passive 92 31 61 Study 1: delayed passive 94 18 76 Study 2: active/sequential 79 32 47 Study 3: active/space bar 73 26 47 Study 3b: active/space bar (separate) 30 11 19 Study 4: left hand/space bar 188 46 142 Study 5: intermixed 42 7 35 Subtotal 640 178 462

Table 1: Participants in all studies.

68 Behavioral Task

We used different variations of the Horizon Task (Wilson et al., 2014) to investigate what modulates the horizon dependent adaptation in exploration. In the original Horizon Task (Wilson et al., 2014), participants have to choose between two slot ma- chines (will also refer to as bandits) that give out random points from 1 to 100 drawn from two respective gaussian distributions with a fixed variance (8 points) but different means. The mean payouts of the two bandits will change from game to game but remain constant within a game. Participants are instructed to maximize the total reward points they get. In each game, one of the two bandits will have a higher average payout than the other, and hence should be the more rewarding option to choose in that game. In each game, before participants make their own decisions, there are 4 sample trials to show them some outcomes from both bandits before they make their own free choices. In the biased information condition, people see 1 outcome from one of the bandits and 3 outcomes from the other one, whereas in the unbiased information condition, people see 2 outcomes from each bandit (Figure 1G). There are also two horizon conditions, sometimes there is only 1 free choice to make (short horizon condition), and sometimes there are 6 free choices (long horizon condition) (Figure 1G).

69 A. EXP0: active version (Original Horizon Task ) B. EXP0b: passive version C. EXP1: delayed version trial 1 trial 1 cue response

D. EXP2: sequential version E. EXP3: space bar version

F. EXP4: left hand version G. Task conditions

Biased information condition Unbiased information condition

Figure 1: Different versions of the Horizon task.

However, different varieties of the horizon task differ in how the sample trials are presented.

Experiment 0: active version (Original horizon task)

During the sample trials in the active version of the task (original task), one of the two boxes will be highlighted at each choice point, and participants have to press the corresponding arrow keys to reveal the outcome, after that the next trial will start. After the sample trials are over, in the 5th trial, both of the boxes will be highlighted, and participants will use the arrow keys again to make their free choices (Figure 1A).

70 Experiment 0b: Passive version

The outcomes of the 4 sample trials will be presented at the beginning of each game (leave out the key presses in Experiment 0), and participants directly have a free choice to make. They make their free decision using the arrow keys (Figure 1B).

Experiment 1: Delayed version

As Experiment 0b, the outcomes of the 4 sample trials will be presented at the beginning of each game, however a 3s delay is sample before participants are allowed to make their 1st decision. They make their free choices using the arrow keys (Figure 1C).

Experiment 2: Sequential version

During the sample trials, one of the two boxes will be highlighted at each choice point. The outcome will be revealed without key presses after a delay that is calculated based on median reaction times from previous participants participated in Experiment 0. After the sample trials are over, in the 5th trial, both of the boxes will be highlighted, and participants will use the arrow keys to make their free choices (Figure 1D).

Experiment 3: Space bar version

During the sample trials, one of the two boxes will be highlighted at each choice point, instead of using the arrow keys, participants use a neutral key, i.e. the space bar, to reveal the outcome. After the sample trials are over, in the 5th trial, both of the boxes will be highlighted, and participants will use the arrow keys to make their free choices (Figure 1E).

Experiment 3b: Space bar version (wide spread)

This version is the same as Experiment 3, however the two bandits are placed at the two ends of the computer screen instead of side by side in the middle of the screen. This is to allow better precision in acquiring gazing data. Behavior from this version is analyzed with Experiment 3.

71 Experiment 4: Left hand version

During the sample trials, one of the two boxes will be highlighted at each choice point, instead of using the arrow keys, participants are instructed to press a different pair of keys with their left hand , i.e. ”v” for left and ”b” for right, to reveal the outcome. After the sample trials are over, in the 5th trial, both of the boxes will be highlighted, and participants will use the arrow keys to make their free choices (Figure 1F).

Experiment 5: Interleaved version

In this experiment, active games and passive games are interleaved.

Gazing

Gaze data is recorded using EyeTribe eye tracker. Gaze locations when pupil diameter is 0 (eyes of the participant are closed) are excluded, and gaze locations outside of the screen monitor size are also excluded.

Results

In the horizon task, random exploration is quantified in a model-free way as the probability of choosing the low mean option, p(low mean) in the equal, or [2 2], condition, whereas directed exploration is measured as the probability of choosing the more informative option p(high info in the unequal, or [1 3], condition. In the original task (active version), participants showed that both directed exploration and random explo- ration increase as a function of horizon (Wilson et al., 2014), see Figure 2A. However, Sadeghiyeh et al. (2018) showed that people no longer show horizon adaptive directed and random exploration in the pas- sive version of the task (reproduced from Sadeghiyeh et al. (2018), Figure 2A). This finding is surprising because people are receiving the exact same information, and should not behave any differently. In this work, we try to pin down what factor drives this behavioral difference in horizon adaptive exploration in the active vs passive condition.

72 Figure 2: Directed and random exploration in different versions of the horizon task.

73 Figure 3: Directed, random exploration and reaction time as a function of difference in reward points between the two options.

74 Figure 4: Accuracy, reaction time directed and random exploration, percentage of choosing the alternative option from the previous trial.

Do people not have enough time in the passive version?

In the active version of the task, people have to press arrow keys to go through the 4 sample trials, however, in the passive version, people are presented with the information of the 4 sample trials all at once. Are people simply overwhelmed with all the information at once and not having enough time to process the information in the passive condition? To answer this question, a separate group of participants participated in the delayed version of the horizon task (Figure 1 C) where we forced a 3s delay before participants can make the first choice. In the delayed condition compared to the passive condition, people are less ambiguity averse and choose the uncertain option 46.18% of the times on average (as opposed to 38.99% in the passive condition) . Also, there is a decrease in the overall error rates (from 24.00% to. 16.41%). There is also a slight increase in

75 directed exploration but not comparable to the level of increase in the active condition (Figure 2B), so timing is not the whole story.

Does sequential presentation of the outcomes in the sample trials matter?

The other obvious thing that’s missed in the passive condition compared to the active condition is that the 4 pieces of sample trial information are not presented sequentially any more but instead all at once. To test whether the sequential presentation is critical, we calculated the median reaction time at each trial respectively from previous participants in the active condition, and designed the sequential version of the horizon task in which the 4 sample trials are presented one after another with the delay of the median reaction times. Since in this version, participant are not short of time, as the delayed version, they are not ambiguity averse and their overall error rates are also comparable with those in the active condition. However, we did not see significant increases in directed or random exploration (Figure 2C). So sequential presentation alone doesn’t promise horizon adaptive exploration.

Does self-paced timing matter?

In the sequential version, participants saw a movie of the sample trials, but they don’t have control over the speed at which the movie plays. In addition, we ask does self-paced timing matter? In the space bar version, during the sample trials, participants need to use a neutral key, i.e. the space bar instead of the arrow keys to see the outcome. Clearly we can see a significant increase in both directed and random exploration in this version. However, the magnitude is still not comparable with the active condition (Fig- ure 2D, 3D). The effect size is only half as large if compared to the active condition. Furthermore, in the active condition, 69.12% of participants showed an increase in directed exploration and 75.00% increased in random exploration. However, in the spacebar condition, it’s only 55.42% and 57.83% respectively.

76 Figure 5: Percentage of participants showing horizon-dependent increase in directed and random explo- ration.

In addition, there are also important reaction time differences between the spacebar and the active

77 condition. In the spacebar condition, participants are significantly slower in the long horizon context than the short horizon context, which we don’t see in the active condition (Figure 3D). Also, participants are significantly slower than the active condition in the first free choice (Figure 4D), this is probably due to a hand/key switch, they have to switch to their right hand to press the arrows keys. This suggests that people are still not processing information in the same way in the space bar condition as in the active condition.

Should it be two keys instead the same key?

In the spacebar condition, participants are getting the same key to reveal either bandit, however in the active condition they use two different keys. If only 1 key is used, participants don’t need to pay attention to which side is cued to see the outcome, in order to test whether it makes a difference if participants have to press 2 different keys, we instructed participants to use a different pair of keys (”v” and ”b”) during the sample trials. In the left hand version, we can see significant increase in both directed and random exploration, the magnitude is much closer to the active condition (Figure 5). Like the spacebar condition, reaction time in horizon 6 is significantly longer than reaction time in horizon 1.

Interleaved active and passive condition

One important question is whether the active vs passive difference is only present in blocks. To test this, we run a. version where active and passive games are interleaved. If active games and passive games are interleaved, we successfully see the horizon dependent change in the passive condition. Both the increase in directed and random exploration are at the same level as the active condition. We still see a significant ambiguity aversion in the passive condition because of the lack of time. Also, from the analysis of later trials, we can see that the active and passive condition only differ in the first free choice, and become the same from the 6th trial in the long horizon context, which we don’t see in blocked horizon for reaction time. Also, for the interleaved active vs passive dataset, we see a trend of decrease in reaction time in the long horizon context. This is the opposite of what we observe in the spacebar and left hand condition.

78 Figure 6: Directed and random exploration in interleaved version.

Figure 7: Directed and random exploration in interleaved version.

79 Figure 8: Directed and random exploration in interleaved version.

Discussion

In this paper, we examined the behavioral difference in horizon adaptive exploration between the active condition of the Horizon Task (Somerville et al., 2017, Warren et al., 2017, Wilson et al., 2014, Zajkowski et al., 2017) and the passive condition of the Horizon Task (Sadeghiyeh et al., 2018). In this paper, we performed a series follow up experiments to figure out what kills the horizon-adaptive exploration in the passive condition, in which: (a) add a forced delay before participants can make the first choice, (b) the outcomes of sample plays were shown to people sequentially without a key press, (c) outcomes of sample plays were revealed upon pressing a neutral key (space bar) instead of the arrow keys, (d) outcomes of sample plays were revealed upon pressing a different set of keys (”v” and ”b”), and (e) play the active and passive versions in a interleaved instead of blocked setup. We found that (a) ”delay” is highly related to ambiguity aversion, but doesn’t contribute much to the horizon dependence of exploration, (b) sequential presentation alone doesn’t contribute much to the horizon dependence of exploration, (c/d) By having a control button over the sample trials (either the space bar or a pair of different keys ”v” and ”b”), people show some level of horizon adaptive exploration, but not at the same level of the active condition. Also, by having active and passive games interleaved, we almost completely get horizon adaptive exploration in both the active and the passive condition. First, we know that participants are equally engaged in both versions of the explore-exploit task we use, in all these variations, participants are not significantly different in the percentages of picking the best bandit. The difference between active vs passive condition here is not because of some bad subjects, but instead reflect a more complicated nature of how explore-exploit decisions are made.

80 One way of viewing the contrast between active and passive condition is to view them as experienced vs described learning. In described vs experienced gambles, people observe a similar contrast that when choosing between gambles that are described participants are risk averse for gains and risk seeking for losses (Tversky and Kahneman, 1973), whereas when the same gambles are not explicitly described but learned from experience, then the pattern reverses and participants become risk seeking for gains and risk averse for losses Hertwig et al. (2004). In our study, it is clear that timing has a huge impact on the overall ambiguity aversion regardless of horizon context. If the contrast we see here is indeed described vs experienced, then our results may imply that timing should also have an impact on experienced vs described gambling decisions. Our interleaved experiment has shown that the difference is not just about the task setting, the human mindset also has a big impact. The same task shows different results when it’s blocked vs when it’s interleaved. In the interleaved condition, one potential account is that people may simulate the active condition in the passive condition. For the space bar condition, the looking strategy is different from the active condition or left hand condition, as for the latter participants need to look for which side the cue comes up to know which key to press. This difference may contribute to the undermined increase in directed and random exploration in the spacebar condition. In addition, we observe the increase of decision time in long horizon context in the spacebar or left hand condition which we don’t see in active or interleaved passive condition, this delay may also contribute to the reduced scale of directed and random exploration in those tasks compared to in the active condition.

References

Ralph Hertwig, Greg Barron, Elke U. Weber, and Ido Erev. Decisions from experience and the effect of rare events in risky choice. Psychological Science, 15(8):534–539, 8 2004. ISSN 0956-7976. doi: 10.1111/j.0956-7976.2004.00715.x.

Hashem Sadeghiyeh, Siyu Wang, and Robert C Wilson. Lessons from a “failed” replication: The impor- tance of taking action in exploration. PsyArXiv. doi, 10, 2018.

Leah H Somerville, Stephanie F Sasse, Megan C Garrad, Andrew T Drysdale, Nadine Abi Akar, Catherine Insel, and Robert C Wilson. Charting the expansion of strategic exploratory behavior during adoles-

81 cence. Journal of experimental psychology. General, 146(2):155–164, 2 2017. ISSN 1939-2222. doi: 10.1037/xge0000250.

Amos Tversky and Daniel Kahneman. Availability: A heuristic for judging frequency and probability. Cognitive Psychology, 5(2):207–232, 9 1973. ISSN 0010-0285. doi: 10.1016/0010-0285(73)90033- 9.

Christopher M. Warren, Robert C. Wilson, Nic J. van der Wee, Eric J. Giltay, Martijn S. van Noorden, Jonathan D. Cohen, and Sander Nieuwenhuis. The effect of atomoxetine on random and directed exploration in humans. PLOS ONE, 12(4):e0176034, 4 2017. ISSN 1932-6203. doi: 10.1371/jour- nal.pone.0176034.

R. C. Wilson, A. Geana, J. M. White, E. A. Ludvig, and J. D. Cohen. Humans use directed and random exploration to solve the explore-exploit dilemma. J Exp Psychol Gen, 143(6):2074–2081, Dec 2014.

Wojciech K. Zajkowski, Malgorzata Kossut, and Robert C. Wilson. A causal role for right fron- topolar cortex in directed, but not random, exploration. eLife, 6, 2017. ISSN 2050084X. doi: 10.7554/eLife.27430.

82