Cognitive 125 (2021) 101360

Contents lists available at ScienceDirect

Cognitive Psychology

journal homepage: www.elsevier.com/locate/cogpsych

Data-driven experimental design and model development using Gaussian process with active learning

Jorge Chang a,*, Jiseob Kim b, Byoung-Tak Zhang b, Mark A. Pitt a, Jay I. Myung a a Department of Psychology, The Ohio State University, Columbus, OH 43210, USA b School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Republic of Korea

ARTICLE INFO ABSTRACT

Keywords: Interest in computational modeling of cognition and behavior continues to grow. To be most Computational cognition productive, modelers should be equipped with tools that ensure optimal efficiency in data Data-driven cognitive modeling collection and in the integrity of inference about the phenomenon of interest. Traditionally, Gaussian process models in have been parametric, which are particularly susceptible to model Active learning misspecification because their strong assumptions (e.g. parameterization, functional form) may Optimal experimental design Delay discounting introduce unjustified biases in data collection and inference. To address this issue, we propose a Nonparametric Bayesian methods data-driven nonparametric framework for model development, one that also includes optimal experimental design as a goal. It combines Gaussian Processes, a stochastic process often used for regression and classification, with active learning, from , to iteratively fit the model and use it to optimize the design selection throughout the experiment. The approach, dubbed Gaussian process with active learning (GPAL), is an extension of the parametric, adaptive design optimization (ADO) framework (Cavagnaro, Myung, Pitt, & Kujala, 2010). We demon­ strate the application and features of GPAL in a delay discounting task and compare its perfor­ mance to ADO in two experiments. The results show that GPAL is a viable modeling framework that is noteworthy for its high sensitivity to individual differences, identifying novel patterns in the data that were missed by the model-constrained ADO. This investigation represents a firststep towards the development of a data-driven cognitive modeling framework that serves as a middle ground between raw data, which can be difficultto interpret, and parametric models, which rely on strong assumptions.

1. Introduction

Experimentation is at the core of scientificresearch, whether one is interested in understanding how people trade off between small but immediate rewards and larger but delayed rewards in delay discounting or the neural basis of cognitive control in visual search. Advancement in empirical research depends critically on the collection of high-quality, informative data, from which one can draw inferences with confidenceabout the phenomenon under study. A challenge faced by researchers is that experiments can be difficultto design because the consequence of design decisions (e.g., stimulus values, task settings, and testing schedule) are not known prior to data collection. Efficiencyof data collection also matters, especially when experiments are costly to perform in terms of time, money, and availability of participants, such as in brain imaging experiments, research with infants, and clinical research. Ideally, one strives

* Corresponding author. E-mail address: [email protected] (J. Chang). https://doi.org/10.1016/j.cogpsych.2020.101360 Received 6 November 2019; Received in revised form 26 September 2020; Accepted 15 November 2020 Available online 17 January 2021 0010-0285/© 2020 Elsevier Inc. All rights reserved. J. Chang et al. 125 (2021) 101360 to design experiments that yield the most informative data in order to achieve the experimental objective with the fewest observations (trials) possible. Recent developments in statistical computing offer algorithm-based ways to achieve these goals. Specifically, computational methods of optimal experimental design (OED; Atkinson & Donev, 1992; Lindley, 1956) in Bayesian statistics can assist in improving scientific inference by efficiently searching the design space to identify the combination of design variables and parameters that are likely to be most informative trial after trial, making the experiment efficient. Concretely, in an optimized adaptive experiment, the values of design variables (e.g., reward amounts and time delays in a delay discounting experiment) are not predetermined but instead are computed iteratively trial by trial to be optimal in an information-theoretic sense by real-time analysis of participant responses from earlier trials. With a newly made observation using the optimal design, the adaptive process then repeats in the next trial. This is unlike traditional approaches in which experimental designs are fixedfor all participants or vary across trials using a heuristic decision rule, such as the staircase method in adaptive threshold estimation or psychometric function estimation (e.g., Garcia-Perez, 1998). Adaptive design optimization (ADO; Cavagnaro, Myung, Pitt, & Kujala, 2010; Myung, Cavagnaro, & Pitt, 2013) was developed as an OED framework for behavioral experiments and derives from Bayesian experimental design (Chaloner & Verdinelli, 1995) and active learning1 in machine learning (Cohn, Ghahramani, & Jordan, 1996). ADO is a general-purpose model-based algorithm that exploits the predictions of a computational model of task performance to guide design selection in each trial in an adaptive manner. The top two panels in Fig. 1 illustrate the difference between a traditional experiment and an ADO-based experiment. A growing body of work is showing that ADO can improve significantly the informativeness and efficiency of data collection (e.g., Cavagnaro, Pitt, Gonzalez, & Myung, 2013; Cavagnaro, Pitt, & Myung, 2011; Gu et al., 2016). One limitation of ADO, however, is the technical requirement that the assumed model is correctly specified,in that the modeling scheme represents the true data-generating model (e.g., the hyperbolic model is the most accurate description of the rate at which people discount future rewards). This assumption is unlikely to hold true in practice because all models are imperfect approximations of the underlying mental process under study. To the extent that the parametric modeling assumption is violated, ADO would be sub- optimal and not as efficient as it could be. In short, ADO is not robust with respect to the inaccuracies and uncertainties about the underlying system. If a model is wrong, it is considered misspecified,which can make the results from ADO experiments misleading. One way to address and resolve the poor robustness of ADO is to drop its parametric modeling requirement and adopt a nonparametric (i.e., data-driven) approach. In parametric modeling, observations are assumed to be generated from some unknown (to be inferred) parameterized form of the model equation. In nonparametric modeling, on the other hand, the target model is inferred directly from the data collected in the experiment, without constraining it to a specific parametric family of functional forms as in parametric modeling. A nonparametric model, therefore, is highly flexible, containing virtually all possible functional forms (linear, nonlinear, cyclic, etc.) for describing any data pattern, and is, in a sense, a parametric model with a theoretically infinite number of parameters in which the number of parameters grows with the amount of data. Nonparametric modeling via optimal experimental design is the focus of the present work. In this paper, we propose a data-driven approach to optimal experimental design (OED). It uses Gaussian Processes (GP), which is a nonparametric Bayesian method that establishes priors over functions. GPs are a popular modeling tool in machine learning for regression and classificationtasks Rasmussen and Williams (2006). Recently, researchers in psychology have also explored the use of GPs to model human behavior (e.g., Cox, Kachergis, & Shiffrin, 2012; Griffiths,Lucas, Williams, & Kalish, 2009; Schulz, Speekenbrink, & Krause, 2018; Song, Sukesan, & Barbour, 2018). Among these, the work that is most closely related to the present study is Schulz et al. (2018), who discuss a way for combining GP with active learning into a unified OED framework (pp. 9–10).2 Here, we further develop this idea and apply it to a behavioral task. We refer to this framework as Gaussian Process with active learning (GPAL). GPAL is capable of simultaneously modeling the underlying function that generated the data (i.e., the ) while optimizing the experimental design to model that function efficiently. The GPAL algorithm begins with a rough approximation (i.e., prior) of the (initially unknown) data-generating model, and then continually updates and refines its approximation via Bayes rule after each observation in the experiment is collected. GPAL, while similar to ADO, eliminates the need for a parameterized model, and instead seeks to learn the model entirely from data without the potentially misleading assumptions about its functional form. This data-driven model learning step of GPAL, as illustrated in the bottom panel of Fig. 1, sets it apart from ADO, and thus GPAL may be viewed as a “model-free” version of ADO. The virtually unlimited flexibility of GPAL allows it to capture a much wider range of data patterns compared to ADO, thus showing higher sensitivity to individual differences. Further, and importantly, the usefulness of GPAL goes beyond its use as a design optimization tool. GPAL also lends itself naturally to another unique tool – one for model development that can assist in building “robust models” in cognitive science Lee et al. (2019). Given that the process of creating a new model from scratch can be quite a daunting task (Shiffrin & Nobel, 1997), the data-driven approach of GPAL makes it attractive in the early, exploratory stages of model development, because one can be confident in the fidelity(accuracy, representativeness) of the data collected. As we demonstrate in this study, these two complementary attributes of GPAL, an experimental design tool and a model development tool, work hand-in-hand to identify optimal designs on the one hand and to extract the underlying regularities in the data on the other hand, both being carried out simultaneously in a unitary and model-free manner. The remainder of the paper is organized as follows. In Section 2, we provide a brief overview of OED in general and ADO in

1 Not to be confused with active learning in the context of educational psychology. 2 We conceived and developed our approach unaware of Schulz et al. (2018).

2 J. Chang et al. Cognitive Psychology 125 (2021) 101360

Fig. 1. Schematic illustration of three approaches to experimentation. In a traditional paradigm, the experimental design tends to be fixedand not updated based on participant responses. In the ADO paradigm, design selection is revised based on participant responses. Design informativeness is updated on each trial via Bayes rule, considering the specified parametric model and a utility function. GPAL takes ADO one step further by being model-free at experiment outset, learning the model through optimized design choices. particular as background to contrast with GPAL, and then describe the statistical foundations of GPAL. In Section 3, we report results from an example application of GPAL to two delay discounting experiments, including a discussion of two novel behavioral patterns observed that are unaccounted for by most parametric models. In Section 4, we present our vision for GPAL as a general-purpose flexible tool for data-driven model development in cognitive modeling. To this end, we demonstrate the potential and usefulness of the tool in the domain of delay discounting. In particular, we propose plausible extensions to the traditional hyperbolic model to accommodate for the novel behavior. We close the paper with a discussion of a few outstanding technical issues.

2. Methods of optimal experimental design

2.1. Adaptive Design Optimization (ADO)

ADO is a model-based, thus parametric, framework for Bayesian optimal adaptive experimentation (Chaloner & Verdinelli, 1995) that can be used for parameter estimation of a single model as well as model discrimination among a set of multiple models. As illustrated in the middle panel of Fig. 1, ADO consists of three iterative steps that repeat themselves on each trial of an experiment: (1) Design optimization (finding the optimal design given the prior that summarizes the current state of knowledge about model pa­ rameters or models); (2) Experiment (conducting the experiment with the optimal design and subsequently obtaining an observation); and (3) Inference (combining the observed response with the prior to form a posterior, via Bayes rule, which becomes a new prior for the next iteration). Specifically, the design optimization step of ADO involves identifying the optimal design d* that maximizes some real-valued function that quantifies the utility, or usefulness, of design d. This function, denoted U(d) and dubbed the “global” utility function, is defined for the problem of parameter estimation of a model as follows3: ( ) ∫∫ ( ) ( ) ( ) U d = u d, θ, y p y|θ, d p θ dydθ. (1)

In this equation, p(y|θ, d) is the likelihood function of the model, p(θ) is the prior distribution of model parameters, and u(d, θ, y) is the “local” utility function, which represents the utility of a hypothetical experiment with design d in which the model outputs an outcome y from its parameter values θ. It is worth noting that the global utility U(d) is nothing but the mean local utility value that is obtained by averaging u(d, θ, y) over the outcome variable y and the model parameter θ, weighted by p(y|θ, d) and p(θ), respectively. A second pillar of ADO, besides the Bayesian framework, is its information-theoretical foundation. Given a particular form the local ) , θ, = log p(θ|y,d) ( ) utility function, i.e., u d y p(θ) , the global utility function U d in Eq. (1) reduces to the following form that leads to an information-theoretic interpretation: U(d) = H(Θ) H(Θ|Y(d)). (2)

3 A similar form of the global utility function can be definedfor ADO-based model discrimination with a set of models. See Cavagnaro et al. (2010) and Myung et al. (2013).

3 J. Chang et al. Cognitive Psychology 125 (2021) 101360

The right-hand side of the above expression is known as the mutual information between the outcome random variable Y(d) and the parameter random variable Θ in information theory Cover and Thomas (1991). The mutual information is defined as the difference between two entropies: the marginal entropy, H(Θ), representing overall uncertainty about the model parameter and the conditional entropy, H(Θ|Y(d)), representing reduced uncertainty about the parameter given the knowledge of the outcome event Y(d). Accord­ ingly, from the information-theoretic standpoint, the optimal design d* that maximizes U(d) is the one that is expected to maximally reduce the uncertainty about the model parameter when the observation of an outcome is made, or, put another way, the “most informative” design. For a detailed, more technical explanation of the ADO framework we direct readers to Myung et al. (2013). The optimal design d* with the highest value of U(d) in Eq. (1) must be sought on each trial of an ADO-based experiment. Solving the equation, however, can be non-trivial as it includes a high-dimensional integral that needs to be approximated numerically. To improve the accessibility of ADO to cognitive scientists at large, our lab has recently released an open-source Python package ADOpy that implements ADO using high-level, semantic-based commands. The package is available on Github (https://github.com/adopy) and an accompanying manuscript Yang, Pitt, Ahn, and Myung (2020) provides an overview of ADOpy that also demonstrates its use with application examples in psychophysical, delay discounting, and risky choice tasks.

2.2. Gaussian Process with Active Learning (GPAL)

GPAL is a nonparametric extension of ADO. In this section, we provide the background and technical details of the GPAL framework.

2.2.1. Gaussian process regression Gaussian processes (GPs) were originally developed in the fieldof geostatistics for regression problems; the technique was referred to as Kriging (Matheron, 1960). While powerful, GPs were limited by the inference tools available at the time, making it hard to fit them with data. More recently, GPs gained a surge of interest after the introduction of approximate inference techniques such as MCMC (Neal, 1997) and are now ubiquitous in the machine learning literature. GPs can be thought of as infinite-dimensionalGaussian distributions where the covariance of this distribution is determined by a kernel function which itself is a function of the distance between two points. Thus, points that are closer together in the design space will tend to have a stronger correlation to those that are far apart. Formally, a GP is a stochastic (random) process where any subset of random variables has a multivariate Gaussian marginal distribution. For a set of observed value pairs (D, f) and a set of unobserved ̃ pairs (X, f), the joint posterior distribution under GP is obtained as [ ] ([ ] [ ] ) f μD KD,D KD,X ̃ ∼ 𝒩 , (3) f μX KX,D KX,X

In the above equation, K is a submatrix whose elements are given by a kernel function that defines the covariance between points. Theoretically, any non-negative function can be used as a kernel function and this choice will determine the properties of the resulting GP function. Thus, while we characterize this approach as “model-free”, the kernel function does encode mild assumptions about the possible forms of the target function. This makes selecting an appropriate kernel a particularly important decision when building a GP model. The kernel function used in the present study is the squared exponential kernel, also commonly referred to as the radial basis function kernel: ⃦ ( ⃦ ′ 2) ⃦xi xj‖ 2 Ki,j = σ exp (4) 2l2 where l(> 0) is the length scale parameter that controls the smoothness and σ2 is the average variance of the function to the mean. This kernel function is a popular choice for relatively smooth functions and applications that require a differentiable kernel. In contrast, kernel functions such as the Matern kernel (Rasmussen & Williams, 2006, p. 84) offer additional flexibility but require significantly more trials to estimate the function. ̃ The posterior in Eq. (3) can then be used to model f using the conditional of the multivariate normal distribution. From here we can compute the expected posterior GP mean function which we refer to as the GP model throughout this paper. This process is known as Gaussian process regression, and is illustrated in Fig. 2. During inference, we use the maximum a posterior estimates to optimize the parameters θ = (l, σ2). For priors,4 we use a t-distribution with ν = 4 degrees of freedom for the parameter l, and a log-normal dis­ tribution with μ = 0 and σ2 = 10 for the parameter σ2.

2.2.2. Gaussian process classification In many tasks in cognitive science, such as delay discounting, it is difficultto observe f directly in human experiments. Instead, it is common practice to give participant choices that produce multinomial observations. In the case of delay discounting, participants are

4 Priors were selected based on simulations. The log-normal distribution was selected to promote smaller values of σ2 which result in softer decision boundaries. Changing the parameter values of either distribution did not show meaningful impact on the results.

4 J. Chang et al. Cognitive Psychology 125 (2021) 101360 given two choice