<<

OXFORD LIBRARY OF

EDITED BY JEROME R. BUSEMEYER ZHENG WANG JAMES T. TOWNSEND & AMI EIDELS

The Oxford Handbook of COMPUTATIONAL and MATHEMATICAL PSYCHOLOGY The Oxford Handbook of Computational and Mathematical Psychology OXFORD LIBRARY OF PSYCHOLOGY

EDITOR-IN-CHIEF

Peter E. Nathan

AREA EDITORS

Clinical Psychology David H. Barlow Kevin N. Ochsner and Stephen M. Kosslyn Daniel Reisberg Elizabeth M. Altmaier and Jo-Ida C. Hansen Philip David Zelazo Howard S. Friedman David B. Baker Methods and Todd D. Little Kenneth M. Adams Organizational Psychology Steve W. J. Kozlowski Personality and Kay Deaux and Mark Snyder OXFORD LIBRARY OF PSYCHOLOGY

Editor-in-Chief peter e. nathan The Oxford Handbook of Computational and Mathematical Psychology Edited by Jerome R. Busemeyer Zheng Wang James T. Townsend Ami Eidels

1 3

Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide.

Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto

With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam

Oxford is a registered trademark of Oxford University Press in the UK and certain other countries.

Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016

c Oxford University Press 2015

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license, or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above.

You must not circulate this work in any other form and you must impose this same condition on any acquirer.

Library of Congress Cataloging-in-Publication Data Oxford handbook of computational and mathematical psychology / edited by Jerome R. Busemeyer, Zheng Wang, James T. Townsend, and Ami Eidels. pages cm. – (Oxford library of psychology) Includes bibliographical references and index. ISBN 978-0-19-995799-6 1. Cognition. 2. Cognitive science. 3. Psychology–Mathematical models. 4. . I. Busemeyer, Jerome R. BF311.O945 2015 150.1 51–dc23 2015002254

987654321 Printed in the United States of America on acid-free paper Dedicated to the of Dr. William K. Estes (1919–2011) and Dr. R. Duncan Luce (1925–2012) Two of the founders of modern mathematical psychology

v

SHORT CONTENTS

Oxford Library of Psychology ix

About the Editors xi

Contributors xiii

Table of Contents xvii

Chapters 1–390

Index 391

vii

OXFORD LIBRARY OF PSYCHOLOGY

The Oxford Library of Psychology, a landmark series of handbooks, is published by Oxford University Press, one of the world’s oldest and most highly respected publishers, with a tradition of publishing significant books in psychology. The ambitious goal of the Oxford Library of Psychology is nothing less than to span a vibrant, wide-ranging field and, in so doing, to fill a clear market need. Encompassing a comprehensive of handbooks, organized hierarchically, the Library incorporates volumes at different levels, each designed to meet a distinct need. At one level are a set of handbooks designed broadly to survey the major subfields of psychology; at another are numerous handbooks that cover important current focal research and scholarly areas of psychology in depth and detail. Planned as a reflection of the dynamism of psychology, the Library will grow and expand as psychology itself develops, thereby highlighting significant new research that will impact on the field. Adding to its accessibility and ease of use, the Library will be published in print and, later on, electronically. The Library surveys psychology’s principal subfields with a set of handbooks that capture the current status and future prospects of those major subdisci- plines. The initial set includes handbooks of social and , , counseling psychology, , , industrial and organizational psychology, cognitive psychology, cognitive neuroscience, methods and , history, neuropsychology, personality assessment, developmental psychology, and more. Each handbook undertakes to review one of psychology’s major subdisciplines with breadth, comprehensiveness, and exemplary scholarship. In addition to these broadly- conceived volumes, the Library also includes a large number of handbooks designed to explore in depth more specialized areas of scholarship and research, such as stress, health and coping, anxiety and related disorders, cognitive development, or child and adolescent assessment. In contrast to the broad coverage of the subfield handbooks, each of these latter volumes focuses on an especially productive, more highly focused line of scholarship and research. Whether at the broadest or most specific level, however, all of the Library handbooks offer synthetic coverage that reviews and evaluates the relevant past and present research and anticipates research in the future. Each handbook in the Library includes introductory and concluding chapters written by its editor to provide a roadmap to the handbook’s table of contents and to offer informed anticipations of significant future developments in that field.

ix An undertaking of this scope calls for handbook editors and chapter authors who are established scholars in the areas about which they write. Many of the nation’s and world’s most productive and best-respected have agreed to edit Library handbooks or write authoritative chapters in their areas of expertise. For whom has the Oxford Library of Psychology been written? Because of its breadth, depth, and accessibility, the Library serves a diverse audience, including graduate students in psychology and their faculty mentors, scholars, researchers, and practitioners in psychology and related fields. Each will find in the Library the information they seek on the subfield or focal area of psychology in which they work or are interested. Befitting its commitment to accessibility, each handbook includes a comprehensive index, as well as extensive references to help guide research. And because the Library was designed from its inception as an online as well as print resource, its structure and contents will be readily and rationally searchable online. Further, once the Library is released online, the handbooks will be regularly and thoroughly updated. In summary, the Oxford Library of Psychology will grow organically to provide a thoroughly informed perspective on the field of psychology, one that reflects both psychology’s dynamism and its increasing interdisciplinarity. Once published electronically, the Library is also destined to become a uniquely valuable interactive tool, with extended search and browsing capabilities, As you begin to consult this handbook, we sincerely hope you will share our enthusiasm for the more than 500-year tradition of Oxford University Press for excellence, innovation, and quality, as exemplified by the Oxford Library of Psychology.

Peter E. Nathan Editor-in-Chief Oxford Library of Psychology

x oxford library of psychology ABOUT THE EDITORS

Jerome R. Busemeyer is Provost Professor of Psychology at Indiana University. He was the president of Society for Mathematical Psychology and editor of the Journal of Mathematical Psychology. His theoretical contributions include decision field theory and, more recently, pioneering the new field of quantum cognition. Zheng Wang is Associate Professor at the Ohio State University and directs the Communication and Lab. Much of her research tries to understand how our cognition, decision making, and communication are contextualized. James T. Townsend is Distinguished Rudy Professor of Psychology at Indiana University. He was the president of Society for Mathematical Psychology and editor of the Journal of Mathematical Psychology. His theoretical contributions include systems factorial technology and general recognition theory. Ami Eidels is Senior Lecturer at the School of Psychology, University of Newcastle, Australia, and a principle investigator in the Newcastle Cognition Lab. His research focuses on human cognition, especially visual and attention, combined with computational and mathematical modeling.

xi

CONTRIBUTORS

Daniel Algom Ami Eidels School of Psychological Sciences School of Psychology Tel-Aviv University University of Newcastle Israel Callaghan, NSW F. Gregory Ashby Australia Department of Psychological and Brain Sciences Samuel J. Gershman University of California, Santa Barbara Department of Brain and Cognitive Sciences Santa Barbara, CA Massachusetts Institute of Technology Joseph L. Austerweil Cambridge, MA Department of Cognitive Thomas L. Griffiths Linguistic, and Psychological Sciences Department of Psychology Brown University University of California, Berkeley Providence, RI Berkeley, CA Scott D. Brown Todd M. Gureckis School of Psychology Department of Psychology University of Newcastle NewYorkUniversity Callaghan, NSW New York, NY Australia Robert X. D. Hawkins Jerome R. Busemeyer Department of Psychological Department of Psychological and Brain Sciences and Brain Sciences Indiana University Cognitive Science Program Bloomington, IN Indiana University Andrew Heathcote Bloomington, IN School of Psychology Amy H. Criss University of Newcastle Department of Psychology Callaghan, NSW Syracuse University Australia Syracuse, NY Marc W. Howard Simon Dennis Department of Psychological Department of Psychology and Brain Sciences The Ohio State University Center for Memory and Brain Columbus, OH Boston University Adele Diederich Boston, MA Psychology Brett Jefferson Jacobs University Bremen gGmbH Department of Psychological Bremen 28759 and Brain Sciences Germany Indiana University Chris Donkin Bloomington, IN School of Psychology Michael N. Jones University of New South Wales Department of Psychological and Brain Sciences Kensington, NSW Indiana University Australia Bloomington, IN

xiii John K. Kruschke Emmanuel Pothos Department of Psychological Department of Psychology and Brain Sciences City University London Indiana University London, UK Bloomington, IN Babette Rae Yunfeng Li School of Psychology Department of Psychological University of Newcastle Sciences Callaghan, NSW Purdue University Australia West Lafayette, IN Roger Ratcliff Gordon D. Logan Department of Psychology Department of Psychology The Ohio State University Vanderbilt Vision Research Center Columbus, OH Center for Integrative and Tadamasa Sawada Cognitive Neuroscience Department of Psychology Vanderbilt University Higher School of Economics Nashville, TN Moscow, Russia Bradley C. Love Jeffrey D. Schall University College London Department of Psychology Vanderbilt Vision Research Center London, UK Center for Integrative and Dora Matzke Cognitive Neuroscience Department of Psychology Vanderbilt University University of Amsterdam Nashville, TN Amsterdam, the Netherlands Philip Smith Robert M. Nosofsky School of Psychological Sciences Department of Psychological The University of Melbourne and Brain Sciences Parkville, VIC Indiana University Australia Bloomington, IN Fabian A. Soto Richard W. J. Neufeld Department of Psychological Departments of Psychology and Psychiatry, and Brain Sciences Neuroscience Program University of California, Santa Barbara University of Western Ontario Santa Barbara, CA London, Ontario Joshua B. Tenenbaum Canada Department of Brain and Cognitive Sciences Thomas J. Palmeri Massachusetts Institute of Technology Department of Psychology Cambridge, MA Vanderbilt Vision Research Center James T. Townsend Center for Integrative and Department of Psychological Cognitive Neuroscience and Brain Sciences Vanderbilt University Cognitive Science Program Nashville, TN Indiana University Zygmunt Pizlo Bloomington, IN Department of Psychological Joachim Vandekerckhove Sciences Department of Cognitive Sciences Purdue University University of California, Irvine West Lafayette, IN Irvine, CA Timothy J. Pleskac Wolf Vanpaemel Center for Adaptive Rationality (ARC) Faculty of Psychology Max Planck Institute for Human and Educational Sciences Development University of Leuven Berlin, Germany Leuven, Belgium xiv contributors Eric-Jan Wagenmakers Zheng Wang Department of Psychology School of Communication University of Amsterdam Center for Cognitive and Brain Sciences Amsterdam, the Netherlands The Ohio State University Thomas S. Wallsten Columbus, OH Department of Psychology Jon Willits University of Maryland Department of Psychological and Brain Sciences College Park, MD Indiana University Bloomington, IN

contributors xv

CONTENTS

Preface xix

1. Review of Basic Mathematical Concepts Used in Computational and Mathematical Psychology 1 Jerome R. Busemeyer, Zheng Wang, Ami Eidels, and James T. Townsend

Part I • Elementary Cognitive Mechanisms 2. Multidimensional Signal 13 F. Gregory Ashby and Fabian A. Soto 3. Modeling Simple Decisions and Applications Using a Diffusion Model 35 Roger Ratcliff and Philip Smith 4. Features of Response Times: Identification of Cognitive Mechanisms through Mathematical Modeling 63 Daniel Algom, Ami Eidels, Robert X. D. Hawkins, Brett Jefferson, and James T. Townsend 5. Computational Reinforcement 99 Todd M. Gureckis and Bradley C. Love

Part II • Basic Cognitive Skills 6. Why Is Accurately Labeling Simple Magnitudes So Hard? A Past, Present, and Future Look at Simple Perceptual Judgment 121 Chris Donkin, Babette Rae, Andrew Heathcote, and Scott D. Brown 7. An Exemplar-Based Random-Walk Model of Categorization and Recognition 142 Robert M. Nosofsky and Thomas J. Palmeri 8. Models of Episodic Memory 165 Amy H. Criss and Marc W. Howard

Part III • Higher Level Cognition 9. Structure and Flexibility in Bayesian Models of Cognition 187 Joseph L. Austerweil, Samuel J. Gershman, Joshua B. Tenenbaum, and Thomas L. Griffiths

xvii 10. Models of Decision Making under Risk and Uncertainty 209 Timothy J. Pleskac, Adele Diederich, and Thomas S. Wallsten 11. Models of Semantic Memory 232 Michael N. Jones, Jon Willits, and Simon Dennis 12. Shape Perception 255 Tadamasa Sawada, Yunfeng Li, and Zygmunt Pizlo

Part IV • New Directions 13. Bayesian Estimation in Hierarchical Models 279 John K. Kruschke and Wolf Vanpaemel 14. Model Comparison and the Principle of Parsimony 300 Joachim Vandekerckhove, Dora Matzke, and Eric-Jan Wagenmakers 15. Neurocognitive Modeling of Perceptual Decision Making 320 Thomas J. Palmeri, Jeffrey D. Schall, and Gordon D. Logan 16. Mathematical and Computational Modeling in Clinical Psychology 341 Richard W. J. Neufeld 17. Quantum Models of Cognition and Decision 369 Jerome R. Busemeyer, Zheng Wang, and Emmanuel Pothos

Index 391

xviii contents PREFACE

Computational and mathematical psychology has enjoyed rapid growth over the past decade. Our vision for the Oxford Handbook of Computational and Mathematical Psychology is to invite and organize a set of chapters that review these most important developments, especially those that have impacted— and will continue to impact—other fields such as cognitive psychology, developmental psychology, clinical psychology, and neuroscience. Together with a group of dedicated authors, who are leading scientists in their areas, we believe we have realized our vision. Specifically, the chapters cover the key developments in elementary cognitive mechanisms (e.g., signal detection, information processing, reinforcement learning), basic cognitive skills (e.g., perceptual judgment, categorization, episodic memory), higher-level cognition (e.g., Bayesian cognition, decision making, semantic memory, shape perception), modeling tools (e.g., Bayesian estimation and other new model comparison methods), and emerging new directions (e.g., neurocognitive modeling, applications to clinical psychology, quantum cognition) in computation and mathematical psychology. An important feature of this handbook is that it aims to engage readers with various levels of modeling experience. Each chapter is self-contained and written by authoritative figures in the topic area. Each chapter is designed to be a relatively applied introduction with a great emphasis on empirical examples (see New Handbook of Mathematical Psychology (2014) by Batchelder, Colonius, Dzhafarov, and Myung for a more mathematically foundational and less applied presentation). Each chapter endeavors to immediately involve readers, inspire them to apply the introduced models to their own research interests, and refer them to more rigorous mathematical treatments when needed. First, each chapter provides an elementary overview of the basic concepts, techniques, and models in the topic area. Some chapters also offer a historical perspective of their area or approach. Second, each chapter emphasizes empirical applications of the models. Each chapter shows how the models are being used to understand human cognition and illustrates the use of the models in a tutorial manner. Third, each chapter strives to create engaging, precise, and lucid writing that inspires the use of the models. The chapters were written for a typical graduate student in virtually any area of psychology, cognitive science, and related social and behavioral sciences, such as consumer behavior and communication. We also expect it to be useful for readers ranging from advanced undergraduate students to experienced faculty members and researchers. Beyond being a handy reference book, it should be beneficial as

xix a textbook for self-teaching, and for graduate level (or advanced undergraduate level) courses in computational and mathematical psychology. We would like to thank all the authors for their excellent contributions. Also we thank the following scholars who helped review the book chapters in addition to the editors (listed alphabetically): Woo-Young Ahn, Greg Ashby, Scott Brown, Cody Cooper, Amy Criss, Adele Diederich, Chris Donkin, Yehiam Eldad, Pegah Fakhari, Birte Forstmann, Tom Griffiths, Andrew Heathcote, Alex Hedstrom, Joseph Houpt, Marc Howard, Matt Irwin, Mike Jones, John Kruschke, Peter Kvam, Bradley Love, Dora Matzke, Jay Myung, Robert Nosofsky, Tim Pleskac, Emmanuel Pothos, Noah Silbert, Tyler Solloway, Fabian Soto, Jennifer Trueblood, Joachim Vandekerckhove, Wolf Vanpaemel, Eric-Jan Wagenmakers, and Paul Williams. The authors and reviewers’ effort ensure our confidence in the high quality of this handbook. Finally, we would like to express how much we appreciate the outstanding assistance and guidance provided by our editorial team and production team at Oxford University Press. The hard work provided by Joan Bossert, Louis Gulino, Anne Dellinger, A. Joseph Lurdu Antoine and the production team of Newgen Knowledge Works Pvt. Ltd., and others at the Oxford University Press are essential for the development of this handbook. It has been a true pleasure working with this team!

Jerome R. Busemeyer Zheng Wang James T. Townsend Ami Eidels December 16, 2014

xx preface CHAPTER Review of Basic Mathematical Concepts 1 Used in Computational and Mathematical Psychology

Jerome R. Busemeyer, Zheng Wang, Ami Eidels, and James T. Townsend

Abstract Computational and mathematical models of psychology all use some common mathematical functions and principles. This chapter provides a brief overview. Key Words: mathematical functions, derivatives and integrals, , expectations, maximum likelihood estimation

We have three ways to build theories to explain Psychological theories may start as a verbal and predict how variables interact and relate to each description, which then can be formalized using other in psychological phenomena: using natural mathematical language and subsequently coded verbal languages, using formal , and into computational language. By testing the models using computational methods. Human intuitive using empirical data, the model fitting outcomes and verbal reasoning has a lot of limitations. For can provide feedback to improve the models, as well example, Hintzman (1991) summarized at least as our initial understanding and verbal descriptions. 10 critical limitations, including our incapability For readers who are newcomers to this exciting to imagine how a dynamic system works. Formal field, this chapter provides a review of basic models, including both mathematical and compu- concepts of mathematics, probability, and tational models, can address these limitations of used in computational and mathematical modeling human reasoning. of psychological representation, mechanisms, and Mathematics is a “radically empirical” science processes. See Busemeyer and Diederich (2010) and (Suppes, 1984, p.78), with consistent and rigorous Lewandowsky and Ferrel (2010) for a more detailed evidence (the proof) that is “presented with a presentations. completeness not characteristic of any other area of science” (p.78). Mathematical models can help avoid logic and reasoning errors that are typically Mathematical Functions encountered in human verbal reasoning. The Mathematical functions are used to map a set of complexity of theorizing and data often requires points called the domain of the function into a set the aid of computers and computational languages. of points called the range of the function such that Computational models and mathematical models only one point in range is assigned to each point can be thought of as a continuum of a theorizing in the domain.1 As a simple example, the linear process. Every computational model is based on function is defined as f (x) = a·x where the constant a certain , and almost every a is the slope of a straight line. In general, we use the mathematical model can be implemented as a notation f (x) to represent a function f that maps computational model. a domain point x into a range point y = f (x). If a

1 = function f (x) has the property that each range point you to first compute the squared deviation y y can only be reached by a single unique domain x−μ 2 1 σ and then compute the reciprocal exp(y). point x, then we can define the inverse function x − The exponential function obeys the property e · f 1(y) = x that maps each range point y = f (x) + · ey = ex y and (ex)a = ea x. In contrast to the power back to the corresponding domain point x.For function, the base of the exponential is a constant example, the quadratic function is defined as the and the exponent is a variable. map f (x) = x2 = x · x, and if we pick the number The (natural) log function is denoted ln(x) for x = 3.5, then f (3.5) = 3.52 = 12.25. The quadratic positive values of x. For example, using a calculator, function is defined on a domain of both positive for x = 10, we obtain ln(10) = 2.3026. (We and negative real numbers, and it does not have an normally use the natural base e = 2.7183. If instead inverse because, for example, ( − x)2 = x2 and so we used base 10, then log (10) = 1.) The log there are two ways to get back from each range point 10 function obeys the rules ln(x ·y) = ln(x)+ln(y)and y to the domain. However, if we restrict the domain ln(xa) = a · ln(x). The log function is the inverse to the non-negative real numbers, then the inverse of the exponential function: ln(exp(x)) = x and of x2 exists and it is the square root function defined √ √ the exponential function is the inverse of the log = 2 = on non-negative real numbers y x x.There function exp(log(x)) = x. The function ax where a are, of course, a large number of functions used is a constant and x is a variable can be rewritten in in mathematical psychology, but some of the most terms of the exponential function: define b = ln(a), popular ones include the following. bx = b x = x = x a then e (e ) exp(ln(a)) a . The power function is denoted x where the Figure 1.1 illustrates the power, exponential, and variable x is a positive real number and the constant log functions using different coefficient values for a is called the power. A quadratic function can be the function. As can be seen, the coefficient changes = obtained by setting a 2 but we could instead the curve of the functions. = choose a√ 0.50, which is the square root function Last but not least are the trigonometric functions .50 = =− x x, or we could choose a 1, which based on a circle. Figure 1.2 shows a circle with −1 = 1 produces the reciprocal x x ,orwecould its center located at coordinates (0,0) in an (X ,Y ) = choose any real number such as a 1.37. Using plane. Now imagine a line segment of radius = a calculator, one finds that if x = 15.25 and a r = 1 that extends from the center point to the 1.37 = 1.37, then 15.25 41.8658. One important circumference of the circle. This line segment property to remember about power functions is that intersects with the circumference at coordinates a · b = a+b b · b = · b ( a)b = ab x x x and x y (x y) and x x . (cos(t · π),sin(t · π)) in the plane. The coordinate 0 = Also note that x 1. Note that when working cos(t · π) represents the projection of the point on with the power function, the variable x appears the circumference down onto the the X axis, and in the base, and the constant a appears as the the point sin(t · π) is the projection of the point power. on the circumference to the Y axis. The variable x The exponential function is denoted e where t (which, for example can be time) moves this the exponent x is any real valued variable and point around the circle, with positive values moving the constant base e stands for a special number the point counter clockwise, and negative values =∼ that is approximately e 2.7183. Sometimes it is moving it clockwise. The constant π = 3.1416 x = ( ) more convenient to use the notation e exp x equals one-half cycle around the circle, and 2π is 2.5 = instead. Using a calculator, we can calculate e the period of time it takes to go all the way around 2.5 = 2.7183 12.1825. Note that the exponent can once. The two functions are related by a translation − < be negative x 0, in which case we can write (called the phase) in time: cos(t · π + (π/2)) = −x = 1 = 0 = e ex .Ifx 0, then e 1. The exponential sin(t ·π). Note that cos is an even function because x > function always returns a positive value, e 0, and cos(t · π) = cos(−t · π), whereas sin is an odd it approaches zero as x approaches negative infinity. function because −sin(t · π) = sin( − t · π).Also More forms of the exponential are often note that these functions are periodic in the used. For example, you will later see the function that for example cos(t · π) = cos(t · π + 2 · k · π) 2 − x−μ e σ ,wherex is a variable and μ and σ are for any integer k. We can generalize these two functions by changing the frequency and the phase. constants. In this case, it is more convenient to 2 (ω · π + θ) − x−μ For example, cos t is a cosine function σ x−μ 2 write this as e =exp − σ . This tells with a frequency ω (changing the time it takes

2 review of basic mathematical concepts Power function: y = xa Exponential function: y = exp(x) Log function: y = log(x) 5 5 5

4 a =2 4 y = exp(x) 4 y = log(2x) y = log(x) 3 3 3 y y y y = log(0.5x) a = 1/2 2 2 2 y = exp(0) 1 1 1 a =–1 y = exp(–x) 0 0 0 01 23 45 01 23 450 1020304050 x x x

Fig. 1.1 Examples of three important functions, with various parameter values. From left to right: Power function, exponential function, and log function. See text for details.

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 ) x 0 0 Sin(

–0.2 Sin(Time) –0.2

–0.4 –0.4

–0.6 –0.6

–0.8 –0.8

–1 –1 –1 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 0 1.2 1.4 1.6 1.8 2 Cos(x) Time

Fig. 1.2 Left panel illustrates a point on a unit circle with a radius equal to one. Vertical line shows sine, horizontal line shows cosine. Right panel shows sine as a function of time. The point on the Y axis of the right panel corresponds to the point on the Y axis of the left panel.

θ d to complete a cycle) and a phase (advancing or dx f (x). The derivatives of many functions are delaying the initial value at time t = 0). derived in (see Stewart (2012) or any calculus textbook for an introduction to calculus). · For example, in calculus it is shown that d ec x = · dx Derivatives and Integrals c · ec x, which says that the slope of the exponential A derivative of a continuous function is the rate function at any point x is proportional to the of change or the slope of a function at some point. exponential function itself. As another example, it − Suppose f (x) is some continuous function. For a is shown in calculus that d xa = a · xa 1,whichis  dx small increment , the change in this function is the derivative of the power function. For example, = − ( − ) df (x) f (x) f x and the rate of change the derivative of a quadratic function a·x2 is a linear df (x) is the change divided by the increment  = function 2 · a · x, and the derivative of the linear f (x)−f (x−) ·  . If the function is continuous, then function a x is the constant a. The derivative of d =− as  → 0, this ratio converges to what is called the cosine function is dt cos(t) sin(t)andthe d = the derivative of the function at x, denoted as derivative of sine is dt sin(t) cos(t).

review of basic mathematical concepts 3 Power function: y = xa; a =2 Power function: y = xa; a = 0.5 5 5

4 4

3 3 y y

2 2

1 1

0 0 021435021435 x x

Fig. 1.3 Illustration of the power function and its derivatives. The lines in both panels mark the power-function line. The slope of the dotted line (the tangent to the function) is given by the derivative of that function. (in this example, at x = 1).

Figure 1.3 illustrates the derivative of the power A(xN ) − A(xN −1) = f (xN ), function at the value x = 1 for two different ( ) − ( ) A xN A xN −1 = ( ) coefficients. The curved line shows the power  f xN . function, and the straight line touches the curve at x = 1. The slope of this line is the derivative. This simple idea (proven more rigorously in a The integral of a continuous function is the area calculus textbook) leads to the first fundamental d ( ) = under the curve within some interval (see Fig. 1.4). theorem of integrals which states that dz F z ( ) ( ) = z Suppose f (x) is a continuous function of x within f z ,withF z a f (x)dx. The fundamental the interval [a,b]. A simple way to approximate this theorem can then be used to find the integral of z a = area is to divide the interval into N very small steps, a function. For example, the integral of 0 x dx with a small increment  being a step: ( + )−1 a+1 d ( + )−1 a+1 = a a 1 z because dz a 1 z z .The z α·x 1 α·z d α·z = = +  = +  integral of e dx = α e because e = α · [x0 a,x1 a ,x2 a 2 ,..., α· dt e z.Theintegralof z cos(t)dt = sin(z)because = + ·  = −  = xj a j ,...,xN −1 b ,xN b] d = dz sin(z) cos(z). Then, compute the area of the rectangle within each Computational and mathematical models are of- step,  · f xj , and finally sum all the areas of the ten described by difference or differential equations. rectangles to obtain an approximate area under the These types of equations are used to describe how curve: the state of a system changes with time. For exam- N ple, suppose V (t) represents the strength of a neural ≈  · ( ) +  · ( ) +···+ · = ·  A f x1 f x2 f xN f xj . connection between an input and an output at time j=1 t, and suppose x (t) is some reward signal that is guiding the learning process. A simple, discrete time As the number of intervals becomes arbitrarily large linear model of learning can be V (t) = (1 − α) · and the increments get arbitrarily small so that V (t − 1)+α ·x (t),where0≤ α ≤ 1 is the learning →∞  → N and 0, this sum converges to the rate parameter. We can rewrite this as a difference integral equation: b A = f (x) · dx. = ( ) − ( − ) a dV (t) V t V t 1 If we allow the upper limit of the integral to be a =−α · V (t − 1) + α · x (t) variable, say z, then the integral becomes a function =−α · (V (t − 1) − x (t)). of the upper limit, which can be written as F (z) = z · a f (x) dx . What happens if we take the derivative This model states that the change in strength at time of an integral? Let’s examine the change in the area t is proportional to the negative of the error signal, divided by the increment which is defined as the difference between the

4 review of basic mathematical concepts x 10–3 300 70 8

250 60 6 50 200 40 150 4 30 100 20 2 50 10

0 0 0 0 500 1000 0 500 1000 020 40 60 80 100

Fig. 1.4 The integral of the function is the area under curve. It can be approximated as the sum of the areas of the rectangles (left panel). As the rectangles become narrower (middle), the sum of their areas converges to the true integral (right). previous strength and the new reward. If we wish a connection changes according to the preceding to describe learning as occurring more continuously learning model, but with some noise (denoted in time we can introduce a small time increment t as (t)) added, then we can use the following into the model so that it states difference equation √ dV (t) =V (t) − V (t − t) dV (t) =−α · t · (V (t − t) − x (t)) +  (t) · t. =−α · t · (V (t − t) − x (t)), √ Note that the noise is multiplied by t instead which says that the change in strength is propor- of t in a stochastic difference equation. This is tional to the negative of the error signal, with the required so that the effect of the noise does not constant of proportionality now modified by the disappear as t → 0, and the variance of the time increment. Dividing both sides by the time noise remains proportional to t (which is the key increment t we obtain characteristic of Brownian motion processes). See dV (t) Battacharya and Waymire (2009) for an excellent =−α · (V (t − t) − x (t)), t book on stochastic processes. and now if we allow the time increment to approach zero in the limit, t → 0, then the preceding Elementary Probability Theory equation converges to a limit that is the differential Probability theory describes how to assign prob- equation abilities to events. See Feller (1968) for a review of probability theory. We start with a sample space that d V (t) =−α · (V (t) − x (t)), is a set denoted as , which contains all the unique dt outcomes that can be realized. For simplicity, we which states that the rate of change in strength is will assume (unless noted otherwise) that the sample proportional to the negative of the error signal. space is finite. (There could be a very large number Sometimes we can solve the of outcomes, but the number is finite.) For example, for a simple solution. For example, the solution to if a person takes two medical tests, test A and test the equation d V (t) =−α · V (t) + c is V (t) = c − B, and each test can be positive or negative, then −α· dt a e t because when we substitute this solution back the sample space contains four mutually exclusive into the differential equation, it satisfies the equality and exhaustive outcomes: all four combinations of of the differential equation positive and negative test results from tests A and B. Figure 1.5 illustrates the situation for this simple d c c − −αt =−α · − −α·t + example. An event, such as the event A (e.g., test A α e α e c. dt is positive) is a subset of the sample space. Suppose A stochastic difference equation is frequently used for the moment that A, B are two events. The in cognitive modeling to represent how a state disjunctive event AorB(e.g., test A is positive changes across time when it is perturbed by noise. or test B is positive) is represented as the union For example, if we assume that the strength of A∪B. The conjunctive event AandB(e.g., test A is

review of basic mathematical concepts 5 Test A Negative Positive AB A ∩ B Negative A Test B Positive B A ∩ B

Fig. 1.5 Two ways to illustrate the probability space of events A and B. The contingency table (left) and the Venn diagram (right) correspond in the following way: Positive values on both tests in the table (the conjunctive event, A∩B) are represented by the overlap of the circles in the Venn diagram. Positive values on one test but not on the other in the table (the XOR event, A positive and B negative, or vice versa) are represented by the nonoverlapping areas of circles A and B. Finally, tests that are both negative (upper left entry in the table) correspond in the Venn diagram to the area within the rectangle (the so-called “sample space”) that is not occupied by any of the circles.

positive and test B is positive) is represented as the exclusive and exhaustive hypotheses denoted H1 intersection A ∩ B. The impossible event (e.g., test and H2. For example H1 could be a certain disease is A is neither positive nor negative), denoted ,isan present and H2 is the disease is not present. Define empty set. The certain event is the entire sample the event D as some observed data that provide space . The complementary event “not A” is evidence for or against each hypothesis, such as a denoted A. medical test result. Suppose p(D|H1)andp(D|H2) A probability function p assigns a number are known. These are called the likelihood’s of the between zero to one to each event. The impossible data for each hypothesis. For example, medical event is assigned zero, and the certain event testing would be used to determine the likelihood is assigned one. The other events are assigned of a positive versus negative test result when the probabilities 0 ≤ p(A) ≤ 1andp A¯ = 1 − disease is known to be present, and the likelihood p(A). However, these probabilities must obey the of a positive versus negative test would also be following additive rule: If A ∩ B = then p(A ∪ known when the disease is not present. We define B) = p(A) + p(B). What if the events are not p(H1)andp(H2) as the prior probabilities of mutually exclusive so that A ∩ B = ?Theanswer each hypothesis. For example, these priors may is called the “or”, which follows from the previous be based on base rates for disease present or assumptions: not. Then according to the conditional probability definition ∪ = ( ∩ ) + ∩ ¯ + ¯ ∩ p(A B) p A B p A B p A B ( ) ( | ) p H1 p D H1 ¯ p(H1|D) = = p(A ∩ B) + p A ∩ B p(D) + ¯ ∩ + ( ∩ ) ( ) ( | ) p A B p A B = p H1 p D H1 ( ) ( | ) + ( ) ( | ). − p(A ∩ B) p H1 p D H1 p H2 p D H2 The last line is Bayes’ rule. The probability p(H |D) = p(A) + p(B) − p(A ∩ B). 1 is called the posterior probability of the hypothesis Suppose we learn that some event A has occurred, given the data. It reflects the revision from the and now we wish to define the new probability prior produced by the evidence from the data. for event B conditioned on this known event. If there are M ≥ 2 hypotheses, then the rule is The conditional probability p(B|A) stands for the extended to be probability of event B given that event A has p(H1)p(D|H1) ( ∩ ) p(H |D) = , occurred, which is defined as p(B|A) = p A B . 1 M ( ) ( | ) p(A) k=1 p Hk p D Hk | = p(A∩B) Similarly, p(A B) p(B) is the probability of where the denominator is the sum across k = 1to event A given that B has occurred. Using the k = M hypotheses. definition of conditional probability, we can then We often work with events that are assigned define the “and” rule for joint probabilities as to numbers. A random variable is a function that follows: the probability of A and B equals p(A∩B) = assigns real numbers to events. For example, a p(A)p(B|A) = p(B)p(A|B). person may look at an advertisement and then An important theorem of probability is called rate how effective it is on a nine-point scale. In Bayes rule. It describes how to revise one’s beliefs this case, there are nine mutually exclusive and based on evidence. Suppose we have two mutually exhaustive categories to choose from on the rating

6 review of basic mathematical concepts scale, and each choice is assigned a number (say, 0 ≤ p X = xi,Y = yj ≤ 1 1, 2,..., or 9). Then we can define a random = = = = variable X (R), which is a function that maps the p X xi,Y yj p(X xi) category event R onto one of the nine numbers. j For example, if the person chooses the middle p X = xi,Y = yj = p(Y = yj) rating option, so that R = middle, then we assign i ( ) = X middle 5. For simplicity, we often omit = = = the event and instead write the random variable p X xi,Y yj 1. i j simply as X . For example, we can ask what is the probability that the random variable is assigned Finally, we can define the conditional probability = the number 5, which is written as p(X 5). of Y = yj given X = xi as p Y = yj|X = xi = Then we assign a probability to each value of a p(X =xi,Y =yj) ( = ) . random variable by assigning it the probability of p X xi the event that produces the value. For example, Often, we work with random variables that have p(X = 5) equals the probability of the event that the a continuous rather than a discrete and finite distri- bution, such as the normal distribution. Suppose X person picks the middle value. Suppose the random is a univariate continuous random variable. In this variable has N values x1,x2, ...,xi,...,xN .In our previous example with the rating scale, the case, the probability assigned to each real number is random variable had nine values. The function zero (there are uncountably infinite many of them p(X = x ) (interpreted as the probability that in any interval). Instead, we start by defining the i = ( ≤ ) the person picks a choice corresponding to value cumulative distribution function F(x) p X x . Then we define the probability density at each value xi) is called the probability mass function for the random variable X . This function has the following of x as the derivative of the cumulative probability = d properties: function, f (x) dx F(x). Using this definition for the density, we compute the probability of X falling ( ∈ ) = b ≤ ( = ) ≤ in some interval [a,b] as p X [a,b] a f (x)dx . 0 p X xi 1; The increment f (x) · dx for the continuous random N variable is conceptually related to the mass function ( = ) = ( = ) + ( = ) p X xi p X x1 p X x2 p(X = xi). i=1 The probability density function for the normal 2 − x−μ +···+ ( = ) = 1. σ p X xN distribution equals f (x) = √ 1 e ,where 2πσ μ is the mean of the distribution and σ is the The cumulative probability is then defined as standard deviation of the distribution. The normal distribution is popular because of the central limit ¯ theorem, which states that the sample mean X = p(X ≤ xi) = p(X = x1) + p(X = x2) Xi/N of N independent samples of a random i variable X will approach normal as the number +···+ ( = ) = = p X xi p X xj . of samples becomes arbitrarily large, even if the j=1 original random variable X is not normal. We often work with sample means, which we expect to be Often we measure more than one random variable. approximately normal because of the central limit For example, we could present an advertisement theorem. and ask how effective it is for the participant personally but also ask how effective the participant believes it is for others. Suppose X is the random Expectations variable for the nine-point rating scale for self, and When working with random variables, we are let Y be the random variable for the nine-point often interested in their moments (i.e., their means, rating scale for others. Then we can define ajoint variances, or correlations). See Hogg and Craig probability p(X = xi,Y = yj), which equals the (1970) for a reference on . probability that xj is selected for self and that yj is selected for others. These joint probabilities form These different moments are different concepts, but a two way 9 × 9 table with with p(X = xi,Y = yj) they are all defined as expected values of the random in each cell. This joint probability function has the variables. The expectation of the random variable = ( = ) · properties: is defined as E [X ] i p X xi xi for the review of basic mathematical concepts 7 discrete case and it is defined as E [X ] = f (x)·x·dx the model parameters, which makes them more for the continuous case. The mean of a random complicated, and one cannot use simple linear variable X , denoted μX is defined as the expectation regression fitting routines. of the random variable μX = E [X ].Thevariance The model parameters are estimated from the σ 2 of a random variable X , denoted X ,isdefinedas empirical experimental data. These experiments the expectation of the squared deviation around the usually consist of a sample of participants, and each σ 2 = ( ) = ( − μ)2 mean: X Var X E X . For example, participant provides a series of responses to several σ 2 = ( = )· in the discrete case, this equals X i p X xi experimental conditions. For example, a study of ( − μ )2 xi X . The standard deviation is the square learning to gamble could obtain 100 choice trials at each of 5 payoff conditions from each of 50 root of the variance, σ = σ 2. The covariance X X participants. One of the first issues for modeling between two random variables (X ,Y ), denoted σ is the level of analysis of the data. On the one XY , is defined by the expectation of the product hand, a group-level analysis would fit a model of deviations: σXY = cov(X ,Y ) = E [(X − μX )· ( − μ ) to all the data from all the participants ignoring Y Y ]. For example, in the discrete case, this σ = = = (( − μ ) individual differences. This is not a good idea equals XY i j p X xi,Y yj xi X if there are substantial individual differences. On yj − μY . The correlation is defined as ρXY = σ the other hand, an individual level analysis would XY . σX ·σY fit a model to each individual separately allowing Often we need to combine two random variables arbitrary individual differences. This introduces a by a linear combination Z = a · X + b · Y ,where new set of parameters for each person, which is a,b are two constants. For example, we may sum unparsimonious. A hierarchical model applies the two scores, a = 1andb = 1, or take a difference model to all of the individuals, but it includes a between two scores, a = 1,b =−1. There are two model for the distribution of individual differences. important rules for determining the mean and the This is a good compromise, but it requires a good variance of a linear transformation. The expectation model of the distribution of individual differences. operation is linear: E [a · X + b · Y ] = a · E [X ] + Chapter 13 of this book describes the hierarchical b · E [Y ]. The variance operator, however, is not approach. Here we describe the basic ideas of linear: var(a · X + b · Y ) = a2var(X ) + b2var(Y ) + fitting models at the individual level using a method 2ab · cov(X ,Y ). called maximum likelihood (see Myung, 2003 for a detailed tutorial on maximum likelihood Maximum Likelihood Estimation estimation). Also see Hogg and Craig (1970) Computational and mathematical models of for the general properties of maximum likelihood psychology contain parameters that need to be estimates. estimated from the data. For example, suppose a Suppose we obtain 100 choice trials (gamble, person can choose to play or not play a slot machine not gamble) from 5 payoff conditions from each at the beginning of each trial. The slot machine pays participant. The above learning model has two α β the amount x(t)ontrialt, but this amount is not parameters ( , ) that we wish to estimate using × = revealed until the trial is over. Consider a simple the 100 5 500 binary valued responses. = model that assumes that the probability of choosing We can put the 500 answers in a vector D ...... to gamble on trial t, denoted p(t), is predicted by [x1,x2, ,xt , ,x500],whereeachxt is zero (not the following linear learning model gamble) or one (gamble). If we pick values for the two parameters (α,β), then we can insert these into = ( − α) · − + α · V (t) 1 V (t 1) x(t) our learning model and compute the probability 1 of gambling, p(t), for each trial from the model. p(t) = 1 + e−β·V (t) Define p(xt ,t) as the probability that the model This model has two parameters α,β that must be predicts the value xt observed on trial t.For example, if xt =1, then p(xt ,t) = p(t), but if xt = 0, estimated from the data. This is analogous to the = − estimation problem one faces when using multiple then p(xt ,t) 1 p(t) , where recall that p(t)is linear regression, where the linear regression is the predicted probability of choosing the gamble. the model and the regression coefficients are the Then we compute the likelihood of the observed sequence of data D given the model parameters model parameters. However, computational and α β mathematical models, such as the earlier learning ( , ) as follows: model example, are nonlinear with respect to L(D|α,β) = p(x1,1)p(x2,2)···p(x500,500).

8 review of basic mathematical concepts 300

250

200

150

100

50

0 100 200 300 400 500 600 700 800

Fig. 1.6 Example of maximum likelihood estimation. The histograms describe data sampled from a Gamma distribution with scale and shape parameters both equal to 20. Using maximum likelihood we can estimate the parameter values of a Gamma distribution that best fits the sample (they turn out to be 20.4 and 19.5, respectively), and plot its probability density function (solid line).

To make this computationally feasible, we use the complex model. Then we can compare models by log likelihood instead using a Bayesian information criterion (BIC, see Wasserman, 2000, for review). This criterion is 500 derived on the basis of choosing the model that LnL(D|α,β) = ln p(x ,t) t is most probable given the data. (However, the t=1 derivation only holds asymptotically as the sample This likelihood changes depending on our choice size increases indefinitely.) For each model we of α,β. Our goal is to pick (α,β) that maximizes wish to compare, we can compute a BIC index: |αβ = 2 + · LnL(D ). Nonlinear search available BICmodel Gmodel nmodel ln(N ), where nmodel in computational software, such as MATLAB, R, equals the number of model parameters estimated Gauss, and Mathematica can be used to find the from the data and N = number of data points. maximum likelihood estimates. The log likelihood The BIC index is an index that balances model fit is a goodness-of-fit measure. Higher values indicate with model complexity as measured by number of better fit. Actually the computer algorithms find parameters. (Note however, that model complexity the minimum of the badness of fit measure is more than the number of parameters, see Chapter G2 =−2 · LnL(D|α,β). 13). It is a badness-of-fit index, and so we choose Maximum likelihood is not restricted to learning the model with the lowest BIC index. See Chapter models and it can be used to fit all kinds of models. 14 for a detailed review on model comparison. For example, if we observe a response time on each trial, and our model predicts the response time Concluding comments for each trial, then the preceding equation can be This handbook categorizes models into three applied with xt equal to the observed response time levels based on the questions asked and addressed on a trial, and with p(xt ,t) equal to the predicted by the models, and the levels of analysis. The first probability for the observed value of response time category includes models theorizing the elementary on that trial. Figure 1.6 shows an example in which cognitive mechanisms, such as the signal detection a sample of response time data (summarized in process, the diffusion process, informaton process- the figure by a histogram) was fit by a gamma ing, and reinforcement learning. They have been distribution model for response time using two widely used in many formal models within and model parameters. beyond psychology and cognitive science, ranging Now suppose we have two different competing from basic to complex decision learning models. The model we just described has making. The second category covers models the- two parameters. Suppose the competing model is orizing basic cognitive skills, such as perceptual quite different and it is more complex with four identification, categorization, and episodic mem- parameters. Also suppose the models are not nested ory. The third category includes models theorizing so that it is not possible to compute the same higher level cognition, such as Bayesian cognition, predictions for the simpler model using the more decision making, semantic memory, and shape

review of basic mathematical concepts 9 perception. In addition, we provide two chapters applications, and provides a specific glossary list of on modeling tools, including Bayesian estimation the basic concepts in the topic area. We believe you in hierarchical models and model comparison will have a rewarding reading experience. methods. We conclude the handbook with three chapters on new directions in the field, including Note neurocognitive modeling, mathematical and com- 1. This chapter is restricted to real numbers. putational modeling in clinical psychology, and cognitive and decision models based upon quantum probability theory. References The models reviewed in the handbook make Battacharya, R. N. & Waymire, E. C. (2009). Stochastic processes use of many of the mathematical ideas presented with applications (Vol. 61). Philadelphia, PA: Siam. Busemeyer, J. R., & Diederich, A. (2009). Cognitive modeling. in this review chapter. Probabilistic models ap- Thousand Oaks, CA: SAGE. pear in chapters covering signal detection theory Cox, D. R. & Miller, H. D. (1965). Thetheoryofstochastic (Chapter 2), probabilistic models of cognition processes (Vol. 134). Boca Raton, FL: CRC Press. (Chapter 9), (Chapters 10 and 17), Feller, W. (1968). on An introduction to probability theory and its and clinical applications (Chapter 16). Stochas- applications (3rd ed., Vol. 1). New York, NY: Wiley. tic models (i.e., models that are dynamic and Hintzman D.L. (1991). Why are formal models useful in probabilistic) appear in chapters covering informa- psychology? In W. E. Hockley & S. Lewandowsky (Eds.), Relating theory and data: Essays on human memory in honor of tion processing (Chapter 4), percpetual judgment Bennet B. Murdock (pp. 39–56). Hillsdale, NJ: Erlbaum. (Chapter 6), and random walk/diffusion models Hogg, R. V., & Craig, A. T. (1970). Introduction to mathematical of choice and response time in various cogntive statistics (3rd ed.). New York, NY: Macmillan. tasks (Chapters 3, 7, 10, and 15). Learning and Lewandowsky, S. & Ferrel, S. (2010). Computational modeling memory models are reviewed in Chapters 5, 7, 8, in cognition: Principles and practice. Thousand Oaks, CA: and 11. Models using vector spaces and SAGE. are introduced in Chapters 11, 12, and 17. Myung, I. J. (2003). Tutorial on maximum likelihood estima- tion. Journal of Mathematical Psychology, 47, 90–100. The basic concepts reviewed in this chapter Stewart, J. (2012). Calculus (7th ed.) Belmont, CA: Brooks/Cole. should be helpful for readers who are new to math- Suppes, P. (1984). Probabilistic metaphysics. Oxford: Basil ematical and computational models to jumpstart Blackwell. reading the rest of the book. In addition, each Wasserman, L. (2000) Bayesian model selection and model chapter is self-contained, presents a tutorial style averaging. Journal of Mathematical Psychology, 44(1), 92– introduction to the topic area exemplified by many 107.

10 review of basic mathematical concepts PARTI Elementary Cognitive Mechanisms

CHAPTER 2 Multidimensional Signal Detection Theory

F. Gregory Ashby and Fabian A. Soto

Abstract Multidimensional signal detection theory is a multivariate extension of signal detection theory that makes two fundamental assumptions, namely that every mental state is noisy and that every action requires a decision. The most widely studied version is known as general recognition theory (GRT). General recognition theory assumes that the percept on each trial can be modeled as a random sample from a multivariate probability distribution defined over the perceptual space. Decision bounds divide this space into regions that are each associated with a response alternative. General recognition theory rigorously defines and tests a number of important perceptual and cognitive conditions, including perceptual and decisional separability and perceptual independence. General recognition theory has been used to analyze data from identification experiments in two ways: (1) fitting and comparing models that make different assumptions about perceptual and decisional processing, and (2) testing assumptions by computing summary statistics and checking whether these satisfy certain conditions. Much has been learned recently about the neural networks that mediate the perceptual and decisional processing modeled by GRT, and this knowledge can be used to improve the design of experiments where a GRT analysis is anticipated. Key Words: signal detection theory, general recognition theory, perceptual separability, perceptual independence, identification, categorization

Introduction itself has a history that dates back at least several Signal detection theory revolutionized psy- centuries. The insight of signal detection theorists chophysics in two different ways. First, it in- was that this model of statistical decisions was also troduced the idea that trial-by-trial variability in a good model of sensory decisions. The first signal sensation can significantly affect a subject’s perfor- detection theory publication appeared in 1954 mance. And second, it introduced the field to the (Peterson, Birdsall, & Fox, 1954), but the theory then-radical idea that every psychophysical response did not really become widely known in psychology requires a decision from the subject, even when until the seminal article of Swets, Tanner, and thetaskisassimpleasdetectingasignalinthe Birdsall appeared in in 1961. presence of noise. Of course, signal detection theory From then until 1986, almost all applications of proved to be wildly successful and both of these signal detection theory assumed only one sen- assumptions are now routinely accepted without sory dimension (Tanner, 1956, is the principal question in virtually all areas of psychology. exception). In almost all cases, this dimension The mathematical basis of signal detection was meant to represent sensory magnitude. For theory is rooted in statistical decision theory, which a detailed description of this standard univariate

13 theory, see the excellent texts of either Macmillan Second, the problem of how to model decision and Creelman (2005) or Wickens (2002). This processes when the perceptual space is multidi- chapter describes multivariate generalizations of mensional is far more difficult than when there signal detection theory. is only one sensory dimension. A standard signal- Multidimensional signal detection theory is a detection-theory lecture is to show that almost any multivariate extension of signal detection to cases in decision strategy is mathematically equivalent to which there is more than one perceptual dimension. setting a criterion on the single sensory dimension, It has all the advantages of univariate signal then giving one response if the sensory value falls detection theory (i.e., it separates perceptual and on one side of this criterion, and the other response decision processes) but it also offers the best existing if the sensory value falls on the other side. For method for examining interactions among percep- example, in the normal, equal-variance model, this tual dimensions (or components). The most widely is true regardless of whether subjects base their studied version of multidimensional signal detec- decision on sensory magnitude or on likelihood tion theory is known as general recognition theory ratio. A straightforward generalization of this model (GRT; Ashby & Townsend, 1986). Since its incep- to two perceptual dimensions divides the perceptual tion, more than 350 articles have applied GRT to plane into two response regions. One response is a wide variety of phenomena, including categoriza- given if the percept falls in the first region and the tion (e.g., Ashby & Gott, 1988; Maddox & Ashby, other response is given if the percept falls in the 1993), similarity judgment (Ashby & Perrin, 1988), second region. The obvious problem is that, unlike face perception (Blaha, Silbert, & Townsend, 2011; a line, there are an infinite number of ways to divide Thomas, 2001; Wenger & Ingvalson, 2002), recog- a plane into two regions. How do we know which nition and source memory (Banks, 2000; Rotello, of these has the most empirical validity? Macmillan, & Reeder, 2004), source monitoring The solution to the first of these two problems— (DeCarlo, 2003), attention (Maddox, Ashby, & that is, the sensory problem—was proposed by Waldron, 2002), object recognition (Cohen, 1997; Ashby and Townsend (1986) in the article that Demeyer, Zaenen, & Wagemans, 2007), per- first developed GRT. The GRT model of sensory ception/action interactions (Amazeen & DaSilva, interactions has been embellished during the past 2005), auditory and speech perception (Silbert, 25 years, but the core concepts introduced by Ashby 2012; Silbert, Townsend, & Lentz, 2009), haptic and Townsend (1986) remain unchanged (i.e., per- perception (Giordano et al., 2012; Louw, Kappers, ceptual independence, perceptual separability). In & Koenderink, 2002), and the perception of sexual contrast, the decision problem has been much more interest (Farris, Viken, & Treat, 2010). difficult. Ashby and Townsend (1986) proposed Extending signal detection theory to multiple some candidate decision processes, but at that time dimensions might seem like a straightforward they were largely without empirical support. In the mathematical exercise, but, in fact, several new ensuing 25 years, however, hundreds of studies conceptual problems must be solved. First, with have attacked this problem, and today much is more than one dimension, it becomes necessary known about human decision processes in percep- to model interactions (or the lack thereof) among tual and cognitive tasks that use multidimensional those dimensions. During the 1960s and 1970s, perceptual stimuli. a great many terms were coined that attempted to describe perceptual interactions among separate components. None of these, however, were Box 1 Notation rigorously defined or had any underlying theoretical AiBj = stimulus constructed by setting compo- foundation. Included in this list were perceptual nent A to level i and component B to level j independence, separability, integrality, performance aibj = response in an identification experiment parity, and sampling independence. Thus, to be signaling that component A is at level i and useful as a model of perception, any multivariate component B is at level j extension of signal detection theory needed to provide theoretical interpretations of these terms X1 = perceived value of component A and show rigorously how they were related to one X2 = perceived value of component B another.

14 elementary cognitive mechanisms joint pdf. Any such sample defines an ordered pair Box 1 Continued (x ,x ), the entries of which fix the perceived value f (x ,x ) = joint likelihood that the perceived 1 2 ij 1 2 of the stimulus on the two sensory dimensions. value of component A is x and the perceived 1 General recognition theory assumes that the subject value of component B is x on a trial when the 2 uses these values to select a response. presented stimulus is A B i j In GRT, the relationship of the joint pdf to the gij(x1) = marginal pdf of component A on trials marginal pdfs plays a critical role in determining when stimulus AiBj is presented whether the stimulus dimensions are perceptually rij = frequency with which the subject re- integral or separable. The marginal pdf gij(x1) sponded Rj on trials when stimulus Si was simply describes the likelihoods of all possible presented sensory values of X 1. Note that the marginal pdfs P(Rj|Si) = probability that response Rj is given are identical to the one-dimensional pdfs of classical on a trial when stimulus Si is presented signal detection theory. Component A is perceptually separable from component B if the subject’s perception of A does not change when the level of B is varied. For General Recognition Theory example, age is perceptually separable from gender General recognition theory (see the Glossary for if the perceived age of the in our face key concepts related to GRT) can be applied to experiment is the same for the male adult as for virtually any task. The most common applications, the female adult, and if a similar invariance holds however, are to tasks in which the stimuli vary for the perceived age of the teen. More formally, in on two stimulus components or dimensions. As an experiment with the four stimuli, A1B1,A1B2, an example, consider an experiment in which A2B1,andA2B2, component A is perceptually participants are asked to categorize or identify faces separable from B if and only if that vary across trials on gender and age. Suppose g11(x1) = g12(x1)andg21(x1) = g22(x1) there are four stimuli (i.e., faces) that are created by for all values of x .(1) factorially combining two levels of each dimension. 1 In this case we could denote the two levels of the Similarly, component B is perceptually separable gender dimension by A1 (male) and A2 (female) and from A if and only if the two levels of the age dimension by B1 (teen) and g11(x2) = g21(x2)andg12(x2) = g22(x2), (2) B2 (adult). Then the four faces are denoted as A1B1 (male teen), A1B2 (male adult), A2B1 (female teen), for all values of x2. If perceptual separability fails and A2B2 (female adult). then A and B are said to be perceptually integral. As with signal detection theory, a fundamental Note that this definition is purely perceptual since assumption of GRT is that all perceptual systems it places no constraints on any decision processes. are inherently noisy. There is noise both in the Another purely perceptual phenomenon is per- stimulus (e.g., photon noise) and in the neural ceptual independence. According to GRT, compo- systems that determine its sensory representation nents A and B are perceived independently in (Ashby & Lee, 1993). Even so, the perceived value stimulus AiBj if and only if the perceptual value on each sensory dimension will tend to increase as of component A is statistically independent of the level of the relevant stimulus component in- the perceptual value of component B on AiBj creases. In other words, the distribution of percepts trials. More specifically, A and B are perceived will change when the stimulus changes. So, for independently in stimulus AiBj if and only if example, each time the A1B1 face is presented, its f ij(x1,x2) = gij(x1) g (x2)(3) perceived age and maleness will tend to be slightly ij different. for all values of x1 and x2. If perceptual inde- General recognition theory models the sensory pendence is violated, then components A and B or perceptual effects of a stimulus AiBj via the joint are perceived dependently. Note that perceptual probability density function (pdf) fij(x1,x2)(see independence is a property of a single stimulus, Box 1 for a description of the notation used in this whereas perceptual separability is a property of article). On any particular trial when stimulus AiBj groups of stimuli. is presented, GRT assumes that the subject’s percept A third important construct from GRT is deci- can be modeled as a random sample from this sional separability. In our hypothetical experiment

multidimensional signal detection theory 15 with stimuli A1B1,A1B2,A2B1,andA2B2,and the two dimensions. These are typically catalogued two perceptual dimensions X 1 and X 2, decisional in a mean vector and a variance-covariance matrix. separability holds on dimension X 1 (for example), For example, consider a bivariate normal distribu- if the subject’s decision about whether stimulus tion with joint density function f (x1,x2). Then the component A is at level 1 or 2 depends only on mean vector would equal the perceived value on dimension X 1.Adecision μ µ = 1 bound is a line or curve that separates regions of the μ (4) perceptual space that elicit different responses. The 2 only types of decision bounds that satisfy decisional and the variance-covariance matrix would equal separability are vertical and horizontal lines. σ 2 cov = 1 12 (5) cov σ 2 The Multivariate Normal Model 21 2 So far we have made no assumptions about where cov12 is the covariance between the values on the form of the joint or marginal pdfs. Our the two dimensions (i.e., note that the correlation only assumption has been that there exists some coefficient is the standardized covariance—that is, the correlation ρ = cov12 ). probability distribution associated with each stim- 12 σ1σ2 ulus and that these distributions are all embedded The multivariate normal distribution has an- in some Euclidean space (e.g., with orthogonal other important property. Consider an identifica- dimensions). There have been some efforts to tion task with only two stimuli and suppose the extend GRT to more general geometric spaces perceptual effects associated with the presentation (i.e., Riemannian manifolds; Townsend, Aisbett, of each stimulus can be modeled as a multivariate Assadi, & Busemeyer, 2006; Townsend & Spencer- normal distribution. Then it is straightforward to Smith, 2004), but much more common is to show that the decision boundary that maximizes add more restrictions to the original version of accuracy is always linear or quadratic (e.g., Ashby, GRT, not fewer. For example, some applications 1992). The optimal boundary is linear if the of GRT have been distribution free (e.g., Ashby two perceptual distributions have equal variance- & Maddox, 1994; Ashby & Townsend, 1986), covariance matrices (and so the contours of equal but most have assumed that the percepts are likelihood have the same shape and are just trans- multivariate normally distributed. The multivariate lations of each other) and the optimal boundary is normal distribution includes two assumptions. quadratic if the two variance-covariance matrices are First, the marginal distributions are all normal. unequal. Thus, in the Gaussian version of GRT, the Second, the only possible dependencies are pairwise only decision bounds that are typically considered linear relationships. Thus, in multivariate normal are either linear or quadratic. distributions, uncorrelated random variables are In Figure 2.1, note that perceptual independence statistically independent. holds for all stimuli except A2B2.Thiscanbe A hypothetical example of a GRT model that seen in the contours of equal likelihood. Note assumes multivariate normal distributions is shown that the major and minor axes of the ellipses that in Figure 2.1. The ellipses shown there are contours define the contours of equal likelihood for stimuli of equal likelihood; that is, all points on the same A1B1,A1B2,andA2B1 are all parallel to the ellipse are equally likely to be sampled from the two perceptual dimensions. Thus, a scatterplot of underlying distribution. The contours of equal samples from each of these distributions would be likelihood also describe the shape a scatterplot of characterized by zero correlation and, therefore, points would take if they were random samples statistical independence (i.e., in the special Gaussian from the underlying distribution. Geometrically, case). However, the major and minor axes of the the contours are created by taking a slice through A2B2 distribution are tilted, reflecting a positive the distribution parallel to the perceptual plane and correlation and hence a violation of perceptual looking down at the result from above. Contours independence. of equal likelihood in multivariate normal distribu- Next, note in Figure 2.1 that stimulus compo- tions are always circles or ellipses. Bivariate normal nent A is perceptually separable from stimulus com- distributions, like those depicted in Figure 2.1 are ponent B, but B is not perceptually separable from each characterized by five parameters: a mean on A. To see this, note that the marginal distributions each dimension, a variance on each dimension, and for stimulus component A are the same, regardless a covariance or correlation between the values on of the level of component B [i.e., g11(x1) =

16 elementary cognitive mechanisms Respond A2B2 larger perceived values of A). As a result, component B is not decisionally separable from component A. g22(x2) Respond A1B2 f22(x1,x2)

f (x ,x ) 12 1 2 Applying GRT to Data g12(x2) 2

c The most common applications of GRT are X g (x ) 21 2 Respond A B to data collected in an identification experiment 2 1 1 f21(x1,x2) X like the one modeled in Figure 2.1. The key data f11(x1,x2) Respond A2B1 from such experiments are collected in a confusion g (x ) 11 2 matrix, which contains a row for every stimulus g (x )=g (x ) g (x )=g (x ) 11 1 12 1 21 1 22 1 and a column for every response (Table 2.1 displays an example of a confusion matrix, which will be discussed and analyzed later). The entry in row i X X1 c1 and column j lists the number of trials on which stimulus Si was presented and the subject gave Fig. 2.1 Contours of equal likelihood, decision bounds, and response R . Thus, the entries on the main diagonal marginal perceptual distributions from a hypothetical multi- j variate normal GRT model that describes the results of an give the frequencies of all correct responses and the identification experiment with four stimuli that were constructed off-diagonal entries describe the various errors (or by factorially combining two levels of two stimulus dimensions. confusions). Note that each row sum equals the total number of stimulus presentations of that type. So if each stimulus is presented 100 times then the sum of all entries in each row will equal 100. g12(x1)andg21(x1) = g22(x1), for all values of x1]. This means that there is one constraint per row, so Thus, the subject’s perception of component A does an n × n confusion matrix will have n × (n –1) not depend on the level of B and, therefore, stimu- degrees of freedom. lus component A is perceptually separable from B. General recognition theory has been used to On the other hand, note that the subject’s percep- analyze data from confusion matrices in two tion of component B does change when the level different ways. One is to fit the model to the of component A changes [i.e., g11(x1) = g21(x1) entire confusion matrix. In this method, a GRT and g12(x1) = g22(x1) for most values of x1]. In model is constructed with specific numerical values particular, when A changes from level 1 to level 2 of all of its parameters and a predicted confusion the subject’s mean perceived value of each level of matrix is computed. Next, values of each parameter component B increases. Thus, the perception of are found that make the predicted matrix as close component B depends on the level of component as possible to the empirical confusion matrix. To A and therefore B is not perceptually separable test various assumptions about perceptual and deci- from A. sional processing—for example, whether perceptual Finally, note that decisional separability holds independence holds—a version of the model that on dimension 1 but not on dimension 2. On assumes perceptual independence is fit to the data dimension 1 the decision bound is vertical. Thus, as well as a version that makes no assumptions the subject has adopted the following decision rule: about perceptual independence. This latter version contains the former version as a special case (i.e., Component A is at level 2 if x > X ; otherwise 1 c1 in which all covariance parameters are set to zero), component A is at level 1. so it can never fit worse. After fitting these two Where X c1 is the criterion on dimension 1 models, we assume that perceptual independence (i.e., the x1 intercept of the vertical decision is violated if the more general model fits sig- bound). Thus, the subject’s decision about whether nificantly better than the more restricted model component A is at level 1 or 2 does not depend on that assumes perceptual independence. The other the perceived value of component B. So component method for using GRT to test assumptions about A is decisionally separable from component B. On perceptual processing, which is arguably more the other hand, the decision bound on dimension popular, is to compute certain summary statistics x2 is not horizontal, so the criterion used to judge from the empirical confusion matrix and then whethercomponentBisatlevel1or2changeswith to check whether these satisfy certain conditions the perceived value of component A (at least for that are characteristic of perceptual separability or

multidimensional signal detection theory 17 independence. Because these two methods are so significantly during this period, then decisional different, we will discuss each in turn. separability can be rejected, whereas if accuracy is It is important to note however, that regard- unaffected then decisional separability is supported. less of which method is used, there are certain Third, one could analyze the 2 × 2data nonidentifiabilities in the GRT model that could using the newly developed GRT model with in- limit the conclusions that are possible to draw from dividual differences (GRT-wIND; Soto, Vucovich, any such analyses (e.g., Menneer, Wenger, & Blaha, Musgrave, & Ashby, in press), which was patterned 2010; Silbert & Thomas, 2013). The problems after the INDSCAL model of multidimensional are most severe when GRT is applied to 2 × 2 scaling (Carroll & Chang, 1970). GRT-wIND is fit identification data (i.e., when the stimuli are A1B1, to the data from all individuals simultaneously. All A1B2,A2B1,andA2B2). For example, Silbert and participants are assumed to share the same group Thomas (2013) showed that in 2 × 2 applications perceptual distributions, but different participants where there are two linear decision bounds that are allowed different linear bounds and they are do not satisfy decisional separability, there always assumed to allocate different amounts of attention exists an alternative model that makes the exact to each perceptual dimension. The model does not same empirical predictions and satisfies decisional suffer from the identifiability problems identified separability (and these two models are related by an by Silbert and Thomas (2013), even in the 2 × 2 affine transformation). Thus, decisional separability case, because with different linear bounds for is not testable with standard applications of GRT to each participant there is no affine transformation 2 × 2 identification data (nor can the slopes of the that simultaneously makes all these bounds satisfy decision bounds be uniquely estimated). For several decisional separability. reasons, however, these nonidentifiabilities are not catastrophic. First, the problems don’t generally exist with Fitting the GRT Model to 3 × 3 or larger identification tasks. In the 3 × 3 Identification Data case the GRT model with linear bounds requires computing the likelihood function at least 4 decision bounds to divide the perceptual When the full GRT model is fit to identification space into 9 response regions (e.g., in a tic-tac-toe data, the best-fitting values of all free parameters configuration). Typically, two will have a generally must be found. Ideally, this is done via the method vertical orientation and two will have a generally of maximum likelihood—that is, numerical values horizontal orientation. In this case, there is no of all parameters are found that maximize the affine transformation that guarantees decisional likelihood of the data given the model. Let S1,S2, separability except in the special case where the ..., Sn denote the n stimuli in an identification two vertical-tending bounds are parallel and the experiment and let R1,R2, ..., Rn denote the n two horizontal-tending bounds are parallel (because responses. Let rij denote the frequency with which parallel lines remain parallel after affine transfor- the subject responded Rj on trials when stimulus Si × mations). Thus, in 3 3 (or higher) designs, was presented. Thus, rij is the entry in row i and decisional separability is typically identifiable and column j of the confusion matrix. Note that the rij testable. are random variables. The entries in each row have a Second, there are simple experimental manip- multinomial distribution. In particular, if P(Rj|Si) × ulations that can be added to the basic 2 is the true probability that response Rj is given 2 identification experiment to test for decisional on trials when stimulus Si is presented, then the separability. In particular, switching the locations probability of observing the response frequencies of the response keys is known to interfere with ri1, ri2,...,rin in row i equals performance if decisional separability fails but not if decisional separability holds (Maddox, Glass, ni! ri1 ri2 rin P(R1 |Si) P(R2 |Si) ···P(Rn |Si) O’Brien, Filoteo, & Ashby, 2010; for more infor- ri1!ri2!···rin! mation on this, see the section later entitled “Neural (6) Implementations of GRT”). Thus, one could add × 100 extra trials to the end of a 2 2 identi- where ni is the total number of times that fication experiment where the response key loca- stimulus Si was presented during the course of the tions are randomly interchanged (and participants experiment. The probability or joint likelihood of are informed of this change). If accuracy drops observing the entire confusion matrix is the product

18 elementary cognitive mechanisms of the probabilities of observing each row; that is, distribution that describes the perceptual effects of n n stimulus Si, and the solid lines denote two possible n = i | rij decision bounds in this hypothetical task. In Figure L n P(Rj Si) (7) = r i=1 j 1 ij j=1 2.2 the bounds are linear, but the method works for any number of bounds that have any parametric General recognition theory models predict that form. The shaded region is the R response region. |S ) has a specific form. Specifically, they j P(Rj i Thus, according to GRT, computing P(R |S )is predict that P(R | j i j Si) is the volume in the Rj equivalent to computing the volume under the S response region under the multivariate distribution i perceptual distribution in the Rj response region. of perceptual effects elicited when stimulus Si is pre- This volume is indicated by the shaded region in sented. This requires computing a multiple integral. the figure. First note that any linear bound can be The maximum likelihood estimators of the GRT written in discriminant function form as model parameters are those numerical values of each parameter that maximize L. Note that the first term h(x1,x2) = h(x) = b x + c = 0(9) in Eq. 7 does not depend on the values of any where (in the bivariate case) x and b are the vectors model parameters. Rather it only depends on the data. Thus, the parameter values that maximize the x1 x = and b = b1 b2 second term also maximize the whole expression. x2 For this reason, the first term can be ignored during the maximization process. Another common prac- and c is a constant. The discriminant function form tice is to take logs of both sides of Eq. 7. Parameter of any decision bound has the property that positive values that maximize L will also maximize any values are obtained if any point on one side of the monotonic function of L (and log is a monotonic bound is inserted into the function, and negative transformation). So, the standard approach is to values are obtained if any point on the opposite find values of the free parameters that maximize side is inserted. So, for example, in Figure 2.2, the constants b1, b2,andc can be selected so that h1(x) n n > 0 for any point x above the h bound and h (x) r logP(R |S )(8) 1 1 ij j i < 0 for any point below the bound. Similarly, for i=1 j=1 the h2 bound, the constants can be selected so that > estimating the parameters h2(x) 0 for any point to the right of the bound < In the case of the multivariate normal model, and h2(x) 0 for any point to the left. Note that the predicted probability P(Rj|Si) in Eq. 8 equals under these conditions, the Rj response region is > the volume under the multivariate normal pdf defined as the set of all x such that h1(x) 0and > that describes the subject’s perceptual experiences h2(x) 0. Therefore, if we denote the multivariate µ on trials when stimulus Si is presented over the normal (mvn) pdf for stimulus Si as mvn( i, i), then response region associated with response Rj.To estimate the best-fitting parameter values using a | = µ standard minimization routine, such integrals must P(Rj Si ) mvn( i, i)dx1dx2 (10) be evaluated many times. If decisional separability is h1(x) > 0; assumed, then the problem simplifies considerably. h2(x) > 0 For example, under these conditions, Wickens (1992) derived the first and second derivatives Ennis and Ashby (2003) showed how to quickly necessary to quickly estimate parameters of the approximate integrals of this type. The basic idea is model using the Newton-Raphson method. Other to transform the problem using a multivariate form methods must be used for more general models that of the well-known z transformation. Ennis and do not assume decisional separability. Ennis and Ashby proposed using the Cholesky transformation. Ashby (2003) proposed an efficient for Any random vector x that has a multivariate normal evaluating the integrals that arise when fitting any distribution can always be rewritten as GRT model. This algorithm allows the parameters x = Pz + µ, (11) of virtually any GRT model to be estimated via standard minimization software. The remainder of where µ is the mean vector of x, z is a random this section describes this method. vector with a multivariate z distribution (i.e., a mul- The left side of Figure 2.2 shows a contour tivariate normal distribution with mean vector 0 of equal likelihood from the bivariate normal and variance-covariance matrix equal to the identity

multidimensional signal detection theory 19 Cholesky z2

h1*(z1, z2)=0 h2(x1, x2)=0 h1(x1, x2)=0 Cholesky

z1 x2

Respond Rj h2*(z1, z2)=0 Respond R x1 j

Fig. 2.2 Schematic illustration of how numerical integration is performed in the multivariate normal GRT model via Cholesky factorization. matrix I), and P is a lower triangular matrix such are 10 points). Taking the Cartesian product of that PP = (i.e., the variance-covariance matrix these 10 points produces a table of 100 ordered of x). If x is bivariate normal then pairs (z1, z2) that are each the center of a rectangle ⎡ ⎤ × σ with volume 0.01 (i.e., 0.10 0.10) under the 1  0 ⎣ ⎦ bivariate z distribution. Given such a table, the P = 2 (12) cov12 σ 2 − cov12 Eq. 14 integral is evaluated by stepping through all σ1 2 σ 2 1 (z1, z2) points in the table. Each point is substituted The Cholesky transformation is linear (see Eq. 11), into Eq. 13 for k = 1 and 2 and the signs of ∗ ∗ so linear bounds in x space are transformed to linear h1 (z1, z2)andh2 (z1, z2) are determined. If ∗ ∗ bounds in z space. In particular, h1 (z1, z2) > 0andh2 (z1, z2) > 0 then the = + = Eq. 14 integral is incremented by 0.01. If either hk(x) b x c 0 or both of these signs are negative, then the value becomes of the integral is unchanged. So the value of the integral is approximately equal to the number of h (Pz + µ) = b (Pz + µ) + c = 0 k (z1, z2) points that are in the Rj response region or equivalently divided by the total number of (z1, z2) points in the table. Figure 2.2 shows a 10 × 10 grid of (z1, ∗ = + µ + = hk (z) (b P)z (b c) 0 (13) z2) points, but better results can be expected from a grid with higher resolution. We have had success Thus, in this way we can transform the Eq. 10 with a 100 × 100 grid, which should produce integral to approximations to the integral that are accurate to within 0.0001 (Ashby, Waldron, Lee, & Berkman, P(R |S ) = mvn(µ , )dx dx j i i i 1 2 2001). h1(x) > 0; > h2(x) 0 evaluating goodness of fit As indicated before, one popular method for = mvn(0,I)dz1dz2 (14) testing an assumption about perceptual or deci- ∗ h (z) > 0; sional processing is to fit two versions of a GRT 1∗ h (z) > 0 model to the data using the procedures outlined in 2 this section. In the first, restricted version of the The right and left panels of Figure 2.2 illustrate model, a number of parameters are set to values that these two integrals. The key to evaluating the reflect the assumption being tested. For example, second of these integrals quickly is to preload z- fixing all correlations to zero would test perceptual values that are centered in equal area intervals. In independence. In the second, unrestricted version Figure 2.2 each gray point in the right panel has of the model, the same parameters are free to vary. 1 a z1 coordinate that is the center of an interval Once the restricted and unrestricted versions of the with area 0.10 under the z distribution (since there model have been fit, they can be compared through

20 elementary cognitive mechanisms a likelihood ratio test: then at least 4 bounds are required. The confusion matrix from a 2 × 2 factorial experiment has 12  =−2(logL − logL ), (15) R U degrees of freedom. The full model has more where LR and LU represent the likelihoods of free parameters than this, so it cannot be fit the restricted and unrestricted models, respectively. to the data from this experiment. As a result, Under the null hypothesis that the restricted model some restrictive assumptions are required. In a is correct, the statistic  has a Chi-squared 3 × 3 factorial experiment, however, the confusion distribution with degrees of freedom equal to the matrix has 72 degrees of freedom (9 × 8) and difference in the number of parameters between the the full model has 49 free parameters (i.e., 41 restricted and unrestricted models. distributional parameters and 8 decision bound If several non-nested models were fitted to the parameters), so the full model can be fit to data, we would usually want to select the best candi- identification data when there are at least 3 levels of date from this set. The likelihood ratio test cannot each stimulus dimension. For an alternative to the be used to select among such non-nested models. GRT identification model presented in this section, Instead, we can compute the Akaike information see Box 2. criterion (AIC, Akaike, 1974) or the Bayesian information criterion (BIC, Schwarz, 1978): AIC =−2logL + 2m (16) Box 2 GRT Versus the BIC =−2logL + m log N (17) Similarity-Choice Model where m is the number of free parameters in the The most widely known alternative identifi- model and N is the number of data points being cation model is the similarity-choice model fit. When the sample size is small compared to (SCM; Luce, 1963; Shepard, 1957), which the number of free parameters of the model, as assumes that in most applications of GRT, a correction factor + / 2 − − η β equal to 2m(m 1) (n m 1) should be added | = ij j P(Rj Si) η β , to the AIC (see Burnham & Anderson, 2004). ik k The best model is the one with the smallest AIC k or BIC. where ηij is the similarity between stimuli Because an n × n confusion matrix has n(n − 1) Si and Sj and βj is the bias toward response degrees of freedom, the maximum number of Rj. The SCM has had remarkable success. free parameters that can be estimated from any For many years, it was the standard against − confusion matrix is n(n 1). The origin and unit which competing models were compared. For of measurement on each perceptual dimension are example, in 1992 J. E. K. Smith summarized arbitrary. Therefore, without loss of generality, the its performance by concluding that the SCM mean vector of one perceptual distribution can be “has never had a serious competitor as a set to 0, and all variances of that distribution can model of identification data. Even when it be set to 1.0. Therefore, if there are two perceptual has provided a poor model of such data, dimensions and n stimuli, then the full GRT model other models have done even less well” − + has 5(n 1) 1 free distributional parameters (i.e., (p. 199). Shortly thereafter, however, the GRT n – 1 stimuli have 5 free parameters—2 means, 2 model ended this dominance, at least for variances, and a covariance—and the distribution identification data collected from experiments with mean 0 and all variances set to 1 has 1 with stimuli that differ on only a couple of free parameter—a covariance). If linear bounds are stimulus dimensions. In virtually every such assumed, then another 2 free parameters must be comparison, the GRT model has provided added for every bound (e.g., slope and intercept). a substantially better fit than the SCM, in With a factorial design (e.g., as when the stimulus many cases with fewer free parameters (Ashby set is A1B1,A1B2,A2B1,andA2B2), there must be et al., 2001). Even so, it is important to note at least one bound on each dimension to separate that the SCM is still valuable, especially in each pair of consecutive component levels. So for the case of identification experiments in which stimuli A1B1,A1B2,A2B1,andA2B2,atleast the stimuli vary on many unknown stimulus two bounds are required (e.g., see Figure 2.1). dimensions. If, instead, there are 3 levels of each component,

multidimensional signal detection theory 21 The Summary Statistics Approach X2 The summary statistics approach (Ashby & Townsend, 1986; Kadlec and Townsend, 1992a, f12 f22 1992b) draws inferences about perceptual inde- pendence, perceptual separability, and decisional separability by using summary statistics that are easily computed from a confusion matrix. Consider f11 f21 again the factorial identification experiment with 2 levels of 2 stimulus components. As before, X we will denote the stimuli in this experiment as 1 g (x ) g (x ) A1B1,A1B2,A2B1,andA2B2. In this case, it is 12 1 22 1 convenient to denote the responses as a1b1,a1b2, a2b1,anda2b2. The summary statistics approach ' dAB2 operates by computing certain summary statistics |cAB2| × g (x ) that are derived from the 4 4 confusion matrix 11 1 g21(x1) that results from this experiment. The statistics are Hit computed at either the macro- or micro-level of False Alarm ' analysis. dAB1 |cAB1| macro-analyses Fig. 2.3 Diagram explaining the relation between macro- Macro-analyses draw conclusions about percep- analytic summary statistics and the concepts of perceptual and tual and decisional separability from changes in decisional separability. accuracy, sensitivity, and bias measures computed for one dimension across levels of a second di- P(a1bj | A1Bj) + P(a2bj | A1Bj) mension. One of the most widely used summary = | + | statistics in macro-analysis is marginal response P(a1bj A2Bj) P(a2bj A2Bj) (19) invariance, which holds for a dimension when the probability of identifying the correct level of for both j =1and2. that dimension does not depend on the level of Marginal response invariance is closely related any irrelevant dimensions (Ashby & Townsend, to perceptual and decisional separability. In fact, 1986). For example, marginal response invariance if dimension X 1 is perceptually and decisionally requires that the probability of correctly identifying separable from dimension X 2, then marginal that component A is at level 1 is the same response invariance must hold for X 1(Ashby & regardless of the level of component B, or in other Townsend, 1986). In the later section entitled words that “Extensions to Response Time,” we describe how an even stronger test is possible with a response time version of marginal response invariance. Figure P(a1 | A1B1) = P(a1 | A1B2) 2.3 helps to understand intuitively why perceptual and decisional separability together imply marginal Now in an identification experiment, A1 can be correctly identified regardless of whether the level response invariance. The top of the figure shows of B is correctly identified, and so the perceptual distributions of four stimuli that vary on two dimensions. Dimension X 1 is decisionally ; P(a |A B ) = P(a b | A B ) + P(a b | A B ) but not perceptually separable from dimension X 2 1 1 1 1 1 1 1 1 2 1 1 the distance between the means of the perceptual distributions along the X axisismuchgreater For this reason, marginal response invariance holds 1 for the top two stimuli than for the bottom on dimension X if and only if 1 two stimuli. The marginal distributions at the bottom of Figure 2.3 show that the proportion of P(a b | A B ) + P(a b | A B ) i 1 i 1 i 2 i 1 correct responses, represented by the light-grey areas = P(aib1 | AiB2) + P(aib2 | AiB2) (18) under the curves, is larger in the second level of X 2 than in the first level. The result would be for both i = 1 and 2. Similarly, marginal response similar if perceptual separability held and decisional invariance holds on dimension X 2 if and only if separability failed, as would be the case for X 2

22 elementary cognitive mechanisms if its decision bound was not perpendicular to its violated. The equality between two d scanbetested main axis. using the following statistic (Marascuilo, 1970): To test marginal response invariance in dimen- − d1 d2 sion X 1, we estimate the various probabilities in Z = (23) 2 + 2 Eq. 18 from the empirical confusion matrix that sd sd results from this identification experiment. Next, 1 2 equality between the two sides of Eq. 18 is assessed where via a standard statistical test. These computations − − 2 F(1 F) H(1 H) are repeated for both levels of component A and s = + d φ −1 ( ) 2 φ −1 ( ) 2 if either of the two tests is significant, then we nn F ns H conclude that marginal response invariance fails, (24) and, therefore, that either perceptual or decisional separability are violated. where φ is the standard normal probability density The left side of Eq. 18 equals P(ai|AiB1) function, H and F are the hit and false-alarm and the right side equals P(ai|AiB2). These are rates associated with the relevant d , nn is the the probabilities that component Ai is correctly number of trials used to compute F,andns is the identified and are analogous to “hit” rates in signal number of trials used to compute H.Underthenull detection theory. To emphasize this relationship, we hypothesis of equal d s, Z follows a standard normal define the identification hit rate of component Ai on distribution. trials when stimulus AiBj is presented as Marginal hit and false-alarm rates can also   be used to compute a marginal response crite- =  =  Ha |A B P ai AiBj P aib1 AiBj rion. Several measures of response criterion and i i j   bias have been proposed (see Chapter 2 of + P a b A B (20) i 2 i j Macmillan & Creelman, 2005), but perhaps the The analogous false alarm rates can be defined most widely used criterion measure in recent years (due to Kadlec, 1999) is: similarly. For example,   = −1 =  =  cAB F | (25) Fa |A B P a2 A1Bj P a2b1 A1Bj j a1 A2Bj 2 1 j   + P a2b2 A1Bj (21) As shown in Figure 2.3, this measure represents the placement of the decision-bound relative to In Figure 2.3, note that the dark grey areas the center of the A2Bj distribution. If component in the marginal distributions equal Fa1|A2B2 (top) A is perceptually separable from component B, | = and Fa1 A2B1 (bottom). In signal detection theory, but cAB1 cAB2 , then decisional separability must hit and false-alarm rates are used to measure have failed on dimension X 1 (Kadlec & Townsend, stimulus discriminability (i.e., d ). We can use 1992a, 1992b). On the other hand, if perceptual the identification analogues to compute marginal separability is violated, then examining the marginal discriminabilities for each stimulus component response criteria provides no information about (Thomas, 1999). For example, decisional separability. To understand why this is the case, note that in Figure 2.3 the marginal c − − d =  1 H | −  1 F | (22) values are not equal, even though decisional sep- ABj a2 A2Bj a2 A1Bj arability holds. A failure of perceptual separability − where the function  1 is the inverse cumulative has affected measures of both discriminability and distribution function for the standard normal response criteria. distribution. As shown in Figure 2.3, the value To test the difference between two c values, the of d represents the standardized distance between following test statistic can be used (Kadlec, 1999): ABj the means of the perceptual distributions of stimuli c − c Z = 1 2 (26) A1Bj and A2Bj. If component A is perceptually s2 + s2 separable from component B, then the marginal c1 c2 discriminabilities between the two levels of A must where = be the same for each level of B – that is, dAB 1 F(1 − F) d (Kadlec & Townsend, 1992a, 1992b). Thus, s2 = . (27) AB2 c φ −1 ( ) 2 if this equality fails, then perceptual separability is nn F

multidimensional signal detection theory 23 micro-analyses dimension X 2. The decision bound perpendicular Macro-analyses focus on properties of the entire to X 2 separates the perceptual plane into two stimulus ensemble. In contrast, micro-analyses regions: percepts falling in the upper region elicit test assumptions about perceptual independence an incorrect response on component B (i.e., a miss and decisional separability by examining summary for B), whereas percepts falling in the lower region statistics computed for only one or two stimuli. elicit a correct B response (i.e., a hit). The bottom The most widely used test of perceptual inde- of the figure shows the marginal distribution for pendence is via sampling independence, which holds each stimulus conditioned on whether B is a hit when the probability of reporting a combination or a miss. When perceptual independence holds, of components P(aibj) equals the product of the as is the case for the stimulus to the left, these probabilities of reporting each component alone, conditional distributions have the same mean. On P(ai)P(bj). For example, sampling independence the other hand, when perceptual independence does holds for stimulus A1B1 if and only if not hold, as is the case for the stimulus to the right, the conditional distributions have different | = | × | P(a1b1 A1B1) P(a1 A1B1) P(b1 A1B1) means, which is reflected in different d and c values

= [P(a1b1|A1B1) + P(a1b2|A1B1)] depending on whether there is a hit or a miss on B. If decisional separability holds, differences in the × | + | [P(a1b1 A1B1) P(a2b1 A1B1)] conditional d sandcs are evidence of violations (28) of perceptual independence (Kadlec & Townsend, 1992a, 1992b). Sampling independence provides a strong test of perceptual independence if decisional separability Conditional d and c values can be computed holds. In fact, if decisional separability holds from hit and false alarm rates for two stimuli differ- on both dimensions, then sampling independence ing in one dimension, conditioned on the reported holds if and only if perceptual independence holds level of the second dimension. For example, for the (Ashby & Townsend, 1986). Figure 2.4A gives an pair A1B1 and A2B1, conditioned on a hit on B, the intuitive illustration of this theoretical result. Two hit rate for A is P(a1b1|A1B1) and the false alarm cases are presented in which decisional separability rate is P(a1b1|A2B1). Conditioned on a miss on B, holds on both dimensions and the decision bounds the hit rate for A is P(a1b2|A1B1) and the false alarm cross at the mean of the perceptual distribution. In rate is P(a1b2|A2B1). These values are used as input the distribution to the left, perceptual independence to Eqs. 22–27 to reach a statistical conclusion. holds and it is easy to see that all four responses Note that if perceptual independence and deci- are equally likely. Thus, the volume of this bivariate sional separability both hold, then the tests based on sampling independence and equal conditional normal distribution in response region R4 =a2b2 is 0.25. It is also easy to see that half of each marginal d and c should lead to the same conclusion. If distribution lies above its relevant decision criterion only one of these two tests holds and the other fails, this indicates a violation of decisional separability (i.e., the two shaded regions), so P(a2) = P(b2) = 0.5. As a result, sampling independence is satisfied (Kadlec & Townsend, 1992a, 1992b). since P(a2b2)=P(a2) × P(b2). It turns out that this relation holds regardless of where the bounds An Empirical Example are placed, as long as they remain perpendicular to In this section we show with a concrete example the dimension that they divide. The distribution how to analyze the data from an identification to the right of Figure 2.4A has the same variances experiment using GRT. We will first analyze the as the previous distribution, and, therefore, the data by fitting GRT models to the identification same marginal response proportions for a2 and b2. confusion matrix, and then we will conduct sum- However, in this case, the covariance is larger than mary statistics analyses on the same data. Finally, zero and it is clear that P(a2b2) > 0.25. we will compare the results from the two separate Perceptual independence can also be assessed analyses. Imagine that you are a researcher inter- through discriminability and criterion measures ested in how the age and gender of faces interact computed for one dimension conditioned on the during face recognition. You run an experiment perceived value on the other dimension. Figure in which subjects must identify four stimuli, the 2.4B shows the perceptual distributions of two combination of two levels of age (teen and adult) stimuli that share the same level of component B and two levels of gender (male and female). Each (i.e., B1) and have the same perceptual mean on stimulus is presented 250 times, for a total of 1,000

24 elementary cognitive mechanisms A

R3 R4 R3 R4

R1 R2 R1 R2

B Hit f f11 21 Miss False Alarm

g (X | B miss) 11 1 g21(X1 | B miss)

' dA|B miss |cA|B miss | g11(X1| B hit) g21(X1| B hit)

' dA|B hit |CA|B hit|

Fig. 2.4 Diagram explaining the relation between micro-analytic summary statistics and the concepts of perceptual independence and decisional separability. Panel A focuses on sampling independence and Panel B on conditional signal detection measures. trials in the whole experiment. The data to be data, some parameters were fixed for all models. analyzed are summarized in the confusion matrix Specifically, all variances were assumed to be equal displayed in Table 2.1. These data were generated to one and decisional separability was assumed for by random sampling from the model shown in both dimensions. Figure 2.5C shows the hierarchy Figure 2.5A. of models used for the analysis, together with The advantage of generating artificial data from the number of free parameters m for each of this model is that we know in advance what them. In this figure, PS stands for perceptual conclusions should be reached by our analyses. For separability, PI for perceptual independence, DS example, note that decisional separability holds in for decisional separability and 1_RHO describes the Figure 2.5A model. Also, because the distance a model with a single correlation parameter for between the “male” and “female” distributions is all distributions. Note that several other models larger for “adult” than for “teen,” gender is not could be tested, depending on specific research goals perceptually separable from age. In contrast, the and hypotheses, or on the results from summary “adult” and “teen” marginal distributions are the statistics analysis. same across levels of gender, so age is perceptually The arrows in Figure 2.5C connect models separable from gender. Finally, because all dis- that are nested within each other. The result tributions show a positive correlation, perceptual of likelihood ratio tests comparing such nested independence is violated for all stimuli. models are displayed next to each arrow, with A hierarchy of models were fit to the data in an asterisk representing significantly better fit for Table 2.1 using maximum likelihood estimation (as the more general model (lower in the hierarchy) in Ashby et al., 2001; Thomas, 2001). Because and n.s. representing a nonsignificant difference there are only 12 degrees of freedom in the in fit. Starting at the top of the hierarchy, it

multidimensional signal detection theory 25 A B Adult Adult Age Age Teen Teen

Male Fermale Male Fermale Gender Gender

C

{PI, PS, DS} m =4

* n.s. *

{PI, PS(Gender), DS} {PI, PS(Age), DS} {1_RHO, PS, DS} m=6 m=6 m=5

n.s. * n.s * *

{PI, DS} {1_RHO, PS(Gender), DS} {1_RHO, PS(Age), DS} {PS, DS} m=8 m=7 m=7 m=8

* n.s. * n.s. n.s. n.s.

{1_RHO, DS} {PS(Gender), DS} {PS(Age), DS} m=9 m=10 m=10

Fig. 2.5 Results of analyzing the data in Table 2.1 with GRT. Panel A shows the GRT model that was used to generate the data. Panel B shows the recovered model from the model fitting and selection process. Panel C shows the hierarchy of models used for the analysis and the number of free parameters (m) in each. PI stands for perceptual independence, PS for perceptual separability, DS for decisional separability and 1_RHO for a single correlation in all distributions.

Table 2.1. Data from a simulated identification experiment with four face stimuli, created by factorially combining two levels of gender (male and female) and two levels of age (teen and adult). Response

Stimulus Male/Teen Female/Teen Male/Adult Female/Adult

Male/Teen 140 36 34 40

Female/Teen 89 91 4 66

Male/Adult 85 5 90 70

Female/Adult 20 59 8 163

26 elementary cognitive mechanisms Table 2.2. Results of the summary statistics analysis for the simulated Gender × Age identification experiment. Macroanalyses

Marginal Response Invariance Test Result Conclusion Equal P (Gender=Male) across all Ages z = −0.09, p>.1 Yes Equal P (Gender=Female) across all Ages z = −7.12, p<.001 No Equal P (Age=Teen) across all Genders z = −0.39, p>.1 Yes Equal P (Age=Adult) across all Genders z = −1.04, p>.1 Yes Marginal d Test d for level 1 d for Result Conclusion level 2 Equal d for Gender across all Ages 0.84 1.74 z = −5.09, p < .001 No Equal d for Age across all Genders 0.89 1.06 z = −1.01, p >.1 Yes Marginal c Test c for level 1 c for Result Conclusion level 2 Equal c for Gender across all Ages −0.33 −1.22 z = 6.72, p < .001 No Equal c for Age across all Genders −0.36 −0.48 z = 1.04, p >.1 Yes

Microanalyses

Sampling Independence Stimulus Response Expected Observed Result Conclusion Proportion Proportion

Male/Teen Male/Teen 0.49 0.56 z=1.57, p > .1 Yes Male/Teen Female/Teen 0.21 0.14 z=−2.05, p < .05 No Male/Teen Male/Adult 0.21 0.14 z = −2.24, p < .05 No Male/Teen Female/Adult 0.09 0.16 z = 2.29, p < .05 No Female/Teen Male/Teen 0.27 0.36 z = 2.14, p < .05 No Female/Teen Female/Teen 0.45 0.36 z = −2.01, p < .05 No Female/Teen Male/Adult 0.23 0.02 z = −7.80, p < .001 No Female/Teen Female/Adult 0.39 0.26 z = −3.13, p < .01 No Male/Adult Male/Teen 0.25 0.34 z = 2.17, p < .05 No Male/Adult Female/Teen 0.11 0.02 z = −4.09, p < .001 No Male/Adult Male/Adult 0.21 0.36 z = 3.77, p < .001 No Male/Adult Female/Adult 0.09 0.28 z = 5.64, p < .001 No Female/Adult Male/Teen 0.04 0.08 z = 2.15, p < .05 No Female/Adult Female/Teen 0.28 0.24 z = −1.14, p > .1 Yes Female/Adult Male/Adult 0.10 0.03 z = −3.07, p < .01 No Female/Adult Female/Adult 0.79 0.65 z = −3.44, p < .001 No Conditional d Test d |Hit d |Miss Result Conclusion Equal d for Gender when Age=Teen 0.84 1.48 z = −2.04, p < .05 No Equal d for Gender when Age=Adult 1.83 2.26 z = −1.27, p > .1 Yes Equal d for Age when Gender=Male 0.89 1.44 z = −1.80, p > .05 Yes Equal d for Age when Gender=Female 0.83 1.15 z = −0.70, p > .1 Yes

multidimensional signal detection theory 27 Table 2.2. Continued

Conditional c Test c |Hit c |Miss Result Conclusion

Equal c for Gender when Age=Teen −0.014 −1.58 z = 6.03, p < .001 No Equal c for Gender when Age=Adult −1.68 −0.66 z = −4.50, p < .001 No Equal c for Age when Gender=Male −0.04 −1.50 z = 6.05, p < .001 No Equal c for Age when Gender=Female −0.63 0.57 z = −4.46, p < .001 No

is possible to find the best candidate models by invariance, equal marginal d and c values all following the arrows with an asterisk on them hold for the age dimension, providing some weak down the hierarchy. This leaves the following evidence for perceptual and decisional separability candidate models: {PS, DS}, {1_RHO, PS(Age), of age from gender. DS}, {1_RHO, DS}, and {PS(Gender), DS}. From The micro-analytic tests show violations of sam- this list, we eliminate {1_RHO, DS} because it does pling independence for all stimuli, and conditional c not fit significantly better than the more restricted values that are significantly different for all stimulus model {1_RHO, PS(Age), DS}. We also eliminate pairs, suggesting possible violations of perceptual {PS(Gender), DS} because it does not fit better than independence and decisional separability. Note that the more restricted model {PS, DS}. This leaves two if we assumed decisional separability, as we did to candidate models that cannot be compared through fit models to the data, the results of the micro- a likelihood ratio test, because they are not nested: analytic tests would lead to the conclusion of failure {PS, DS} and {1_RHO, PS(Age), DS}. of perceptual independence. To compare these two models, we can use the Thus, the results of the model fitting and BIC or AIC goodness-of-fit measures introduced summary statistics analyses converge to similar earlier. The smallest corrected AIC was found conclusions, which is not uncommon for real for the model {1_RHO, PS(Age), DS} (2,256.43, applications of GRT. These conclusions turn out compared to 2,296.97 for its competitor). This to be correct in our example, but note that leads to the conclusion that the model that fits these several of them depend heavily on making correct data best assumes perceptual separability of age assumptions about decisional separability and other from gender, violations of perceptual separability features of the perceptual and decisional processes of gender from age, and violations of perceptual generating the observed data. independence. This model is shown in Figure 2.5B, and it perfectly reproduces the most important Extensions to Response Time features of the model that was used to generate There have been a number of extensions of the data. However, note that the quality of this fit GRT that allow the theory to account both for depends strongly on the fact that the assumptions response accuracy and response time (RT). These used for all the models in Figure 2.5C (decisional have differed in the amount of extra theoretical separability and all variances equal) are correct in structure that was added to the theory described the true model. This will not be the case in many earlier. One approach was to add the fewest applications, which is why it is always a good idea and least controversial assumptions possible that to complement the model-fitting results with an would allow GRT to make RT predictions. The analysis of summary statistics. resulting model succeeds, but it offers no process The results from the summary statistics analysis interpretation of how a decision is reached on each are shown in Table 2.2. The interested reader trial. An alternative approach is to add enough can directly compute all the values in this table theoretical structure to make RT predictions and to from the data in the confusion matrix (Table describe the perceptual and cognitive processes that 2.1). The macro-analytic tests indicate violations of generated that decision. We describe each of these marginal response invariance, and unequal marginal approaches in turn. d and c values for the gender dimension, both of which suggest that gender is not perceptually The RT-Distance Hypothesis separable from age. These results are uninformative In standard univariate signal detection theory, about decisional separability. Marginal response the most common RT assumption is that RT

28 elementary cognitive mechanisms decreases with the distance between the percep- perceptual separability holds if and only if marginal tual effect and the response criterion (Bindra, RT invariance holds for both correct and incorrect Donderi, & Nishisato, 1968; Bindra, Williams, responses. Note that this is an if and only if & Wise, 1965; Emmerich, Gray, Watson, & Tanis, result, which was not true for marginal response 1972; Smith, 1968). The obvious multivariate invariance. In particular, if decisional separability analog of this, which is known as the RT-distance and marginal response invariance both hold, per- hypothesis, assumes that RT decreases with the ceptual separability could still be violated. But if distance between the percept and the decision decisional separability, marginal RT invariance, and bound. Considerable experimental support for the RT-distance hypothesis all hold, then perceptual the RT-distance hypothesis has been reported in separability must be satisfied. The reason we get categorization experiments in which there is only the stronger result with RTs is that marginal RT one decision bound and where more observability is invariance requires that Eq. 29 holds for all values possible (Ashby, Boynton, & Lee, 1994; Maddox, of t, whereas marginal response invariance only Ashby, & Gottlob, 1998). Efforts to incorporate requires a single equality to hold. A similar strong the RT-distance hypothesis into GRT have been result could be obtained with accuracy data if limited to two-choice experimental paradigms, marginal response invariance were required to hold such as categorization or speeded classification, for all possible placements of the response criterion which can be modeled with a single decision (i.e., the point where the vertical decision bound bound. intersects the X 1 axis). The most general form of the RT-distance hypothesis makes no assumptions about the para- Process Models of RT metric form of the function that relates RT and At least three different process models have distance to bound. The only assumption is that been proposed that account for both RT and this function is monotonically decreasing. Specific accuracy within a GRT framework. Ashby (1989) functional forms are sometimes assumed. Perhaps proposed a stochastic interpretation of GRT that the most common choice is to assume that RT was instantiated in a discrete-time linear system. decreases exponentially with distance to bound In effect, the model assumed that each stimulus (Maddox & Ashby, 1996; Murdock, 1985). An component provides input into a set of parallel (and advantage of assuming a specific functional form linear) mutually interacting perceptual channels. is that it allows direct fitting to empirical RT The channel outputs describe a point that moves distributions (Maddox & Ashby, 1996). through a multidimensional perceptual space dur- Even without any parametric assumptions, how- ing processing. With long exposure durations the ever, monotonicity by itself is enough to derive percept settles into an equilibrium state, and under some strong results. For example, consider a filtering these conditions the model becomes equivalent to task with stimuli A1B1,A1B2,A2B1,andA2B2, the static version of GRT. However, the model and two perceptual dimensions X 1 and X 2,in can also be used to make predictions in cases of whichthesubject’staskoneachtrialistoname < | short exposure durations and when the subject the level of component A. Let PFA(RTi t AiBj) is operating under conditions of speed stress. In denote the probability that the RT is less than addition, this model makes it possible to relate or equal to some value t on trials of a filtering properties like perceptual separability to network task when the subject correctly classified the level architecture. For example, a sufficient condition for of component A. Given this, then the RT analog perceptual separability to hold is that there is no of marginal response invariance, referred to as crossing of the input lines and no crosstalk between marginal RT invariance, can be defined as (Ashby & channels. Maddox, 1994) Townsend, Houpt, and Silbert (2012) consid- erably generalized the stochastic model proposed PFA(RT i ≤ t | AiB1) = PFA(RT i ≤ t | AiB2) (29) by Ashby (1989) by extending it to a broad class for i = 1and2andforallt > 0. of parallel processing models. In particular, they Now assume that the weak version of the considered (almost) any model in which processing RT-distance hypothesis holds (i.e., where no func- on each stimulus dimension occurs in parallel and tional form for the RT-distance relationship is the stimulus is identified as soon as processing specified) and that decisional separability also holds. finishes on all dimensions. They began by extending Then Ashby and Maddox (1994) showed that definitions of key GRT concepts, such as perceptual

multidimensional signal detection theory 29 and decisional separability and perceptual indepen- COVIS assumes that rule-based categorization is dence, to this broad class of parallel models. Next, mediated by a broad neural network that includes under the assumption that decisional separability the prefrontal cortex, anterior cingulate, head of the holds, they developed many RT versions of the caudate nucleus, and the hippocampus, whereas the summary statistics tests considered earlier in this key structures in the procedural-learning system are chapter. the striatum and the premotor cortex. Ashby (2000) took a different approach. Rather Virtually all decision rules that satisfy decisional than specify a processing architecture, he pro- separability are easily verbalized. In fact, COVIS posed that moment-by-moment fluctuations in the assumes that the rule-based system is constrained percept could be modeled via a continuous-time to use rules that satisfy decisional separability (at multivariate diffusion process. In two-choice tasks least piecewise). In contrast, the COVIS procedural with one decision bound, a signed distance is system has no such constraints. Instead, it tends computed to the decision bound at each point to learn decision strategies that approximate the in time; that is, in one response region simple optimal bound. As we have seen, decisional separa- distance-to-bound is computed (which is always bility is optimal only under some special, restrictive positive), but in the response region associated with conditions. Thus, as a good first approximation, the contrasting response the negative of distance to one can assume that decisional separability holds bound is computed. These values are then contin- if subjects use their rule-based system, and that uously integrated and this cumulative value drives decisional separability is likely to fail if subjects a standard diffusion process with two absorbing use their procedural system. A large literature barriers—one associated with each response. This establishes conditions that favor one system over stochastic version of GRT is more biologically plau- the other. Critical features include the nature of sible than the Ashby (1989) version (e.g., see Smith the optimal decision bound, the instructions given & Ratcliff, 2004) and it establishes links to the to the subjects, and the nature and timing of voluminous work on diffusion models of decision the feedback, to name just a few (e.g., Ashby & making. Maddox, 2005, 2010). For example, Ashby et al. (2001) fit the full GRT identification model to Neural Implementations of GRT data from two experiments. In both, 9 similar Of course, the perceptual and cognitive processes stimuli were constructed by factorially combining modeled by GRT are mediated by circuits in the 3 levels of the same 2 stimulus components. Thus, brain. During the past decade or two, much has in stimulus space, the nine stimuli had the same been learned about the architecture and functioning 3×3 grid configuration in both experiments. In the of these circuits. Perhaps most importantly, there first experiment however, subjects were shown this is now overwhelming evidence that humans have configuration beforehand and the response keypad multiple neuroanatomically and functionally dis- had the same 3 × 3 grid as the stimuli. In the second tinct learning systems (Ashby & Maddox, 2005; experiment, the subjects were not told that the Eichenbaum, & Cohen, 2004; Squire, 1992). And stimuli fell into a grid. Instead, the 9 stimuli were most relevant to GRT, the evidence is good that the randomly assigned responses from the first 9 letters default decision strategy of one of these systems is of the alphabet. In the first experiment, where decisional separability. subjects knew about the grid structure, the best- The most complete description of two of the fitting GRT model assumed decisional separability most important learning systems is arguably pro- on both stimulus dimensions. In the second vided by the COVIS theory of category learning experiment, where subjects lacked this knowledge, (Ashby, Alfonso-Reese, Turken, & Waldron, 1998; the decision bounds of the best-fitting GRT model Ashby, Paul, & Maddox, 2011). COVIS assumes violated decisional separability. Thus, one interpre- separate rule-based and procedural-learning catego- tation of these results is that the instructions biased rization systems that compete for access to response subjects to use their rule-based system in the first ex- production. The rule-based system uses executive periment and their procedural system in the second attention and working memory to select and experiment. test simple verbalizable hypotheses about category As we have consistently seen throughout this membership. The procedural system gradually chapter, decisional separability greatly simplifies associates categorization responses with regions applications of GRT to behavioral data. Thus, of perceptual space via reinforcement learning. researchers who want to increase the probability

30 elementary cognitive mechanisms that their subjects use decision strategies that satisfy Note decisional separability should adopt experimental 1. This is true except for the endpoints. Since these endpoint procedures that encourage subjects to use their intervals have infinite width, the endpoints are set at the z-value rule-based learning system. For example, subjects that has equal area to the right and left in that interval (0.05 in Figure 2.2). should be told about the factorial nature of the stimuli, the response device should map onto this factorial structure in a natural way, working memory demands should be minimized (e.g., avoid Glossary dual tasking) to ensure that working memory Absorbing barriers: Barriers placed around a diffusion process that terminate the upon first capacity is available for explicit hypothesis testing contact. In most cases there is one barrier for each response (Waldron & Ashby, 2001), and the intertrial alternative. interval should be long enough so that sub- Affine transformation: A transformation from an jects have sufficient time to process the mean- n-dimensional space to an m-dimensional space of the form ing of the feedback (Maddox, Ashby, Ing, & y=Ax + b,whereAisanm × n matrix and b is a vector. Pickering, 2004). Categorization experiment: An experiment in which the subject’s task is to assign the presented stimulus to the category to which it belongs. If there are n different stimuli then a categorization experiment must include fewer than n Conclusions separate response alternatives. Multidimensional signal detection theory in gen- d : A measure of discriminability from signal detection eral, and GRT in particular, make two fundamental theory, defined as the standardized distance between the assumptions, namely that every mental state is noisy means of the signal and noise perceptual distributions and that every action requires a decision. When (i.e., the mean difference divided by the common standard signal detection theory was first proposed, both deviation). of these assumptions were controversial. We now Decision bound: The set of points separating regions of know, however, that every sensory, perceptual, or perceptual space associated with contrasting responses. cognitive process must operate in the presence of Diffusion process: A stochastic process that models the inherent noise. There is inevitable noise in the stim- trajectory of a microscopic particle suspended in a liquid ulus (e.g., photon noise, variability in viewpoint) and subject to random displacement because of collisions with other molecules. at the neural level and in secondary factors, such Euclidean space: The standard space taught in high- as attention and . Furthermore, there is school geometry constructed from orthogonal axes of real now overwhelming evidence that every volitional numbers. Frequently, the n-dimensional Euclidean space is action requires a decision of some sort. In fact, these denoted by n. decisions are now being studied at the level of the False Alarm: Incorrectly reporting the presence of a signal when no signal was presented. single neuron (e.g., Shadlen & Newsome, 2001). Thus, multidimensional signal detection theory Hit: Correctly reporting the presence of a presented signal. captures two fundamental features of almost all Identification experiment: An experiment in which the behaviors. Beyond these two assumptions, however, subject’s task is to identify each stimulus uniquely. Thus, if there are n different stimuli, then there must be n separate the theory is flexible enough to model a wide variety response alternatives. Typically, on each trial, one stimulus of decision processes and sensory and perceptual is selected randomly and presented to the subject. The interactions. For these reasons, the popularity of subject’s task is to choose the response alternative that is multidimensional signal detection theory is likely to uniquely associated with the presented stimulus. grow in the coming decades. Likelihood ratio: The ratio of the likelihoods associated with two possible outcomes. If the two trial types are equally likely, then accuracy is maximized when the subject gives one response if the likelihood ratio is greater than 1 Acknowledgments and the other response if the likelihood ratio is less than 1. Preparation of this chapter was supported in Multidimensional scaling: A statistical technique in which part by Award Number P01NS044393 from the objects or stimuli are situated in a multidimensional space National Institute of Neurological Disorders and in such a way that objects that are judged or perceived Stroke, by grant FA9550-12-1-0355 from the Air as similar are placed close together. In most approaches, Force Office of Scientific Research, and by support each object is represented as a single point and the space is constructed from some type of proximity data collected from the U.S. Army Research Office through the on the to-be-scaled objects. A common choice is to collect Institute for Collaborative Biotechnologies under similarity ratings on all possible stimulus pairs. grant W911NF-07-1-0072.

multidimensional signal detection theory 31 Ashby, F. G., & Lee, W. W. (1993). Perceptual variability Nested mathematical models: Two mathematical models as a fundamental axiom of perceptual science. In S.C. are nested if one is a special case of the other in which the Masin (Ed.), Foundations of perceptual theory (pp. 369–399). restricted model is obtained from the more general model Amsterdam, Netherlands: Elsevier Science. by fixing one or more parameters to certain specific values. Ashby, F. G., & Maddox, W. T. (1994). A response time Nonidentifiable models: Thecasewheretwoseemingly theory of separability and integrality in speeded classification. different models make identical predictions. Journal of Mathematical Psychology, 38, 423–466. Perceptual dimension: A range of perceived values of some Ashby, F. G., & Maddox, W. T. (2005). Human category psychologically primary component of a stimulus. learning. Annual Review of Psychology, 56, 149–178. Ashby, F. G., & Maddox, W. T. (2010). Human category Procedural learning: Learning that improves incremen- learning 2.0. Annals of the New York Academy of Sciences, tally with practice and requires immediate feedback after 1224, 147–161. each response. Prototypical examples include the learning of athletic skills and learning to play a musical instrument. Ashby, F. G., Paul, E. J., & Maddox, W. T. (2011). COVIS. In E. M. Pothos & A. J. Wills (Eds.), Formal approaches Response bias: The tendency to favor one response in categorization (pp. 65–87). New York, NY: Cambridge alternative in the face of equivocal sensory information. University Press. When the frequencies of different trial types are equal, a Ashby, F. G., & Perrin, N. A. (1988). Toward a unified theory response bias occurs in signal detection theory whenever of similarity and recognition. Psychological Review, 95, 124– the response criterion is set at any point for which the likelihood ratio is unequal to 1. 150. Ashby, F. G., & Townsend, J. T. (1986). Varieties of perceptual Response criterion: In signal detection theory, this is the independence. Psychological Review, 93, 154–179. point on the sensory dimension that separates percepts that Ashby, F. G., Waldron, E. M., Lee, W. W., & Berkman, elicit one response (e.g., Yes) from percepts that elicit the A. (2001). Suboptimality in human categorization and contrasting response (e.g., No). identification. Journal of Experimental Psychology: General, Speeded classification: An experimental task in which 130, 77–96. the subject must quickly categorize the stimulus according Banks, W. P. (2000). Recognition and source memory as to the level of a single stimulus dimension. A common multivariate decision processes. Psychological Science, 11, example is the filtering task. 267–273. Statistical decision theory: The statistical theory of Bindra, D., Donderi, D. C., & Nishisato, S. (1968). Decision optimal decision-making. latencies of “same” and “different” judgments. Perception & Striatum: A major input structure within the basal ganglia , 3(2), 121–136. that includes the caudate nucleus and the putamen. Bindra, D., Williams, J. A., & Wise, J. S. (1965). Judgments of sameness and difference: Experiments on decision time. Science, 150, 1625–1627. References Blaha, L., Silbert, N., & Townsend, J. (2011). A general Akaike, H. (1974). A new look at the statistical model tecognition theory study of race adaptation. Journal of Vision, identification. IEEE Transactions on Automatic Control, 19, 11, 567–567. 716–723. Burnham, K. P., & Anderson, D. R. (2004). Multimodel Amazeen, E. L., & DaSilva, F. (2005). Psychophysical test inference understanding AIC and BIC in model selection. for the independence of perception and action. Journal of Sociological Methods & Research, 33, 261–304. Experimental Psychology: Human Perception and Performance, Carroll, J. D., & Chang, J. J. (1970). Analysis of in- 31, 170. dividual differences in multidimensional scaling via an Ashby, F. G. (1989). Stochastic general recognition theory. N-way generalization of “Eckart-Young” decomposition. In D. Vickers & P. L. Smith (Eds.), Human information Psychometrika, 35, 283–319. processing: Measures, mechanisms and models (pp. 435–457). Cohen, D. J. (1997). Visual detection and perceptual indepen- Amsterdam, Netherlands: Elsevier Science Publishers. dence: Assessing color and form. Attention, Perception, & Ashby, F. G. (1992). Multidimensional models of categorization. Psychophysics, 59, 623–635. In F. G. Ashby (Ed.), Multidimensional models of perception DeCarlo, L. T. (2003). Source monitoring and multivariate and cognition (pp. 449–483). Hillsdale, NJ: Erlbaum. signal detection theory, with a model for selection. Journal Ashby, F. G. (2000). A stochastic version of general recognition of Mathematical Psychology, 47, 292–303. theory. Journal of Mathematical Psychology, 44, 310–329. Demeyer, M., Zaenen, P., & Wagemans, J. (2007). Low- Ashby, F.G., Alfonso-Reese, L. A., Turken, A. U., & Waldron, E. level correlations between object properties and viewpoint M. (1998). A neuropsychological theory of multiple systems can cause viewpoint-dependent object recognition. Spatial in category learning. Psychological Review, 105, 442–481. Vision, 20, 79–106. Ashby, F. G., Boynton, G., & Lee, W. W. (1994). Categorization Eichenbaum, H., & Cohen, N. J. (2004). From conditioning to response time with multidimensional stimuli. Perception & conscious recollection: Memory systems of the brain (No. 35). Psychophysics, 55, 11–27. New York, NY: Oxford University Press. Ashby, F. G., & Gott, R. E. (1988). Decision rules in the Emmerich, D. S., Gray, C. S., Watson, C. S., & Tanis, perception and categorization of multidimensional stimuli. D. C. (1972). Response latency, confidence and ROCs in Journal of Experimental Psychology: Learning, Memory, and auditory signal detection. Perception & Psychophysics, 11, Cognition, 14, 33–53. 65–72.

32 elementary cognitive mechanisms Ennis, D. M., & Ashby, F. G. (2003). Fitting decision bound integrality and perceptual configurality. Journal of Vision, 10, models to identification or categorization data. Unpublished 1211–1211. manuscript. Available at http://www.psych.ucsb.edu/˜ashby/ Murdock, B. B. (1985). An analysis of the strength-latency cholesky.pdf relationship. Memory & Cognition, 13, 511–521. Farris, C., Viken, R. J., & Treat, T. A. (2010). Perceived Peterson, W. W., Birdsall, T. G., & Fox, W. C. (1954). association between diagnostic and non-diagnostic cues The theory of signal detectability. Transactions of the IRE of women’s sexual interest: General Recognition Theory Professional Group on , PGIT-4, 171–212. predictors of risk for sexual coercion. Journal of mathematical Rotello, C. M., Macmillan, N. A., & Reeder, J. A. (2004). psychology, 54, 137–149. Sum-difference theory of remembering and knowing: a two- Giordano, B. L., Visell, Y., Yao, H. Y., Hayward, V., dimensional signal-detection model. Psychological Review, Cooperstock, J. R., & McAdams, S. (2012). Identification of 111, 588. walked-upon materials in auditory, kinesthetic, haptic, and Schwarz, G. E. (1978). Estimating the dimension of a model. audio-haptic conditions. Journal of the Acoustical Society of Annals of Statistics, 6, 461–464. America, 131, 4002–4012. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a Kadlec, H. (1999). MSDA_2: Updated version of software for perceptual decision in the parietal cortex (area LIP) of the multidimensional signal detection analyses. Behavior Research rhesus monkey. Journal of Neurophysiology, 86, 1916–1936. Methods, 31, 384–385. Shepard, R. N. (1957). Stimulus and response generalization: Kadlec, H., & Townsend, J. T. (1992a). Implications of marginal A stochastic model relating generalization to distance in and conditional detection parameters for the separabilities psychological space. Psychometrika, 22, 325–345. and independence of perceptual dimensions. Journal of Silbert, N. H. (2012). Syllable structure and integration of Mathematical Psychology, 36, 325–374. voicing and manner of articulation information in labial Kadlec, H., & Townsend, J. T. (1992b). Signal detection consonant identification. Journal of the Acoustical Society of analyses of multidimensional interactions. In F. G. Ashby America, 131, 4076–4086. (Ed.), Multidimensional models of perception and cognition Silbert, N. H., & Thomas, R. D. (2013). Decisional separability, (pp. 181–231). Hillsdale, NJ: Erlbaum. model identification, and statistical inference in the general Louw, S., Kappers, A. M., & Koenderink, J. J. (2002). Haptic recognition theory framework. Psychonomic Bulletin & discrimination of stimuli varying in amplitude and width. Review, 20(1), 1–20. Experimental brain research, 146, 32–37. Silbert, N. H., Townsend, J. T., & Lentz, J. J. (2009). Luce, R. D. (1963). Detection and recognition. In R. D. Luce, Independence and separability in the perception of complex R. R. Bush, & E. Galanter (Eds.), Handbook of mathe- nonspeech sounds. Attention, Perception, & Psychophysics, 71, matical psychology (Vol. 1, pp. 103–190). New York, NY: 1900–1915. Wiley. Smith, E. E. (1968). Choice reaction time: An analysis of the Macmillan, N. A., & Creelman, D. (2005). Detection theory: A major theoretical positions. Psychological Bulletin, 69, 77– user’s guide (2nd ed.). Mahwah, NJ: Erlbaum. 110. Maddox, W. T., & Ashby, F. G. (1993). Comparing decision Smith, J. E. K. (1992). Alternative biased choice models. bound and exemplar models of categorization. Perception & Mathematical Social Sciences, 23, 199–219. Psychophysics, 53, 49–70. Smith, P. L., & Ratcliff, R. (2004). Psychology and neuro- Maddox, W. T., & Ashby, F. G. (1996). Perceptual separability, biology of simple decisions. Trends in Neurosciences, 27, decisional separability, and the identification- speeded 161–168. classification relationship. Journal of Experimental Psychology: Soto, F. A., Musgrave, R., Vucovich, L., & Ashby, F. Human Perception & Performance, 22, 795–817. G. (in press). General recognition theory with individual Maddox, W. T., Ashby, F. G., & Gottlob, L. R. (1998). differences: A new method for examining perceptual and Response time distributions in multidimensional perceptual decisional interactions with an application to face perception. categorization. Perception & Psychophysics, 60, 620–637. Psychonomic Bulletin & Review. Maddox, W. T., Ashby, F. G., Ing, A. D., & Pickering, A. D. Squire, L. R. (1992). Declarative and nondeclarative memory: (2004). Disrupting feedback processing interferes with rule- Multiple brain systems supporting learning and memory. based but not information-integration category learning. Journal of Cognitive Neuroscience, 4, 232–243. Memory & Cognition, 32, 582–591. Swets, J. A., Tanner, W. P., Jr., & Birdsall, T. G. (1961). Decision Maddox, W. T., Ashby, F. G., & Waldron, E. M. (2002). processes in perception. Psychological Review, 68, 301–340. Multiple attention systems in perceptual categorization. Tanner, W. P., Jr. (1956). Theory of recognition. Journal of the Memory & Cognition, 30, 325–339. Acoustical Society of America, 30, 922–928. Maddox, W. T., Glass, B. D., O’Brien, J. B., Filoteo, J. Thomas, R. D. (1999). Assessing sensitivity in a multidimen- V., & Ashby, F. G. (2010). Category label and response sional space: Some problems and a definition of a general d . location shifts in category learning. , 74, Psychonomic Bulletin & Review, 6, 224–238. 219–236. Thomas, R. D. (2001). Perceptual interactions of facial dimen- Marascuilo, L. (1970). Extensions of the significance test for one- sions in speeded classification and identification. Perception parameter signal detection hypotheses. Psychometrika, 35, & Psychophysics, 63, 625–650. 237–243. Townsend, J. T., Aisbett, J., Assadi, A., & Busemeyer, Menneer, T., Wenger, M., & Blaha, L. (2010). Inferential J. (2006). General recognition theory and methodol- challenges for General Recognition Theory: Mean-shift ogy for dimensional independence on simple cognitive

multidimensional signal detection theory 33 manifolds. In H. Colonius & E. N. Dzhafarov (Eds.), Waldron, E. M., & Ashby, F. G. (2001). The effects of Measurement and representation of sensations: Recent progress concurrent task interference on category learning: Evidence in psychophysical theory (pp. 203–242). Mahwah, NJ: for multiple category learning systems. Psychonomic Bulletin Erlbaum. & Review, 8, 168–176. Townsend, J. T., Houpt, J. W., & Silbert, N. H. (2012). Wenger, M. J., & Ingvalson, E. M. (2002). A decisional General recognition theory extended to include response component of holistic encoding. Journal of Experimen- times: Predictions for a class of parallel systems. Journal of tal Psychology: Learning, Memory, & Cognition, 28, Mathematical Psychology, 56, 476–494. 872–892. Townsend, J. T., & Spencer-Smith, J. B. (2004). Two kinds Wickens, T. D. (1992). Maximum-likelihood estimation of of global perceptual separability and curvature. In C. a multivariate Gaussian rating model with excluded data. Kaernbach, E. Schröger, & H. Müller (Eds.), Psychophysics Journal of Mathematical Psychology, 36, 213–234. beyond sensation: Laws and invariants of human cognition (pp. Wickens, T. D. (2002). Elementary signal detection theory.New 89–109). Mahwah, NJ: Erlbaum. York, NY: Oxford University Press.

34 elementary cognitive mechanisms CHAPTER Modeling Simple Decisions and 3 Applications Using a Diffusion Model

Roger Ratcliff and Philip Smith

Abstract The diffusion model is one of the major sequential-sampling models for two-choice decision-making and choice response time in psychology. The model conceives of decision-making as a process in which noisy evidence is accumulated until one of two response criteria is reached and the associated response is made. The criteria represent the amount of evidence needed to make each decision and reflect the decision maker’s response biases and speed-accuracy trade-off settings. In this chapter we examine the application of the diffusion model in a variety of different settings. We discuss the optimality of the model and review its applications to a number of cognitive tasks, including perception,memory, and language tasks. We also consider its applications to normal and special populations, to the cognitive foundations of individual differences, to value-based decisions, and its role in understanding the neural basis of decision-making. Key Words: diffusion model, sequential-sampling, drift rate, choice, decision time, accuracy, confidence, perceptual decision, memory decision, lexical decision

Diffusion Models for Rapid Decisions The models have been applied successfully to many Over the last 30 or 40 years, there has been a different tasks including perceptual, numerical, and steady development of models for simple decision- memory tasks with a variety of subject populations, making that deal with both the accuracy of deci- including older , children, dyslexics, and sions and the time taken to make them. The models adults undergoing sleep deprivation, reduced blood assume that decisions are made by accumulating sugar, or alcohol intoxication. noisy information to decision criteria, one criterion An important feature of human decision-making for each possible choice. The models successfully is that the processing system is very flexible because account for the probability that each choice is humans can switch tasks, stimulus dimensions, made and the response time (RT) distributions for and output modalities very quickly, from one trial correct responses and errors. The models are highly to the next. There are many different kinds of constrained by the behavior of these dependent decisions that can be made about any stimulus. If variables. The most frequent applications of these the stimulus is a letter string, decisions can be made models have been to tasks that require two-choice about whether it is word or a nonword, whether decisions that are made reasonably quickly, typically it was studied earlier, whether the color is red or with mean RTs less than 1.0–2.0 s. This is fast green, whether it is upper or lower case, and so on. enough that one can assume that the decisions come Responses can be made in different modalities and from a single decision process and not from multi- in different ways in those modalities (for example, ple, sequential processes (anything much slower and manually, vocally, or via eye movements). The same the single-process assumption would be suspect). decision mechanism might operate for all these

35 tasks or the mechanism might be task and modality a brightness discrimination task, for example, if the specific. accumulated evidence reaches the top boundary, a For two-choice tasks, the assumption usually “bright” response is executed and a “dark” response made is that all decision-related information, that would then correspond to the bottom boundary. is, all the information that comes from a stimulus or Figure 3.1 shows an example, using a brightness memory, is collapsed onto a single variable, called discrimination task. Evidence accumulates from a drift rate, that characterizes the discriminative or stimulus to the “bright” boundary or to the “dark” preference information in the stimulus. In some sit- boundary. The solid arrow shows the drift rate for uations, subjects may be asked to make judgments a bright stimulus, the dashed arrow shows the drift based on more than one dimension that cannot be rate for a less bright stimulus, and the dotted arrow combined in this way. In such cases, the systems shows the drift rate for a dark stimulus. factorial methods of Townsend and colleagues (e.g., The three paths in Figure 3.1 show three differ- Townsend, 1972; see the review in Townsend & ent outcomes, all with the same drift rate. Noise Wenger, 2004) may be able to be used to determine in the accumulation process produces errors when whether processing on the different dimensions is the accumulated evidence reaches the incorrect serial or parallel, or some hybrid of the two. boundary and it produces variable RTs that form a In this chapter, we focus on one model of the distribution of RTs that has the shape of empirically class of sequential sampling models of evidence obtained distributions. In the figure, one path leads accumulation, the diffusion model (Ratcliff, 1978; to a fast correct decision, one to a slow correct Ratcliff & McKoon, 2008; Smith, 2000). A decision, and one to an error. Most responses are comparison of the diffusion model with other reasonably fast, but there are slower ones that spread sequential-sampling models, such as the Poisson out the right-hand tails of the distributions (as in counter model (Townsend & Ashby, 1983), the the distribution at the top of Figure 3.1). As drift Vickers accumulator model (Smith & Vickers, rate changes from a large value to near zero, the 1988; Vickers, 1970), and the leaky competing mean of the RT distribution for both correct and accumulator model (Usher & McClelland, 2001) error responses increases because the tail of the RT can be found in Ratcliff and Smith (2004). In distribution spreads out. Figure 3.2 shows simulated the diffusion model, for a two-choice task, noisy individual RTs from the model as a function of evidence accumulates from a starting point (Figure drift rate, which is assumed to vary from trial 3.1), toward one of two decision criteria or to trial. The shortest RTs change little with drift boundaries and the quality of the information that rate, and so a fast response says nothing about the enters the decision process determines the rate of difficulty of the trial. The probability of obtaining accumulation. a slow response from a high drift rate is very small Fitting the model to data provides estimates of (e.g., Figure 3.2) and so conditions with the slowest drift rates, decision boundaries, and a parameter responses come from lower drift rates (see Ratcliff, representing the duration of nondecision processes. Philiastides, & Sajda, 2009). The model’s ability to separate these components Figure 3.1 shows the accumulation-of-evidence is one of its key contributions and places major process. Besides this, there are processes that constraints on its ability to explain data. Stimulus difficulty affects drift rate but not the criteria, and to a good approximation, speed-accuracy shifts Quality of Evidence from Perception or Memory are represented in the criteria, not drift rate. If difficulty varies, changes in drift rate alone must accommodate all the changes in performance, Bright namely accuracy and the changes in the spreads and locations of the correct and error RT distributions. Likewise, changes in the criteria affect all the aspects of performance. In these ways, the model is tightly constrained by data. Dark In a perceptual task, drift rate depends on the quality of the perceptual information from a stim- ulus; in a memory task, it depends on the quality Fig. 3.1 The diffusion decision model with three simulated of the match between a test item and memory. In paths and three different drift rates.

36 elementary cognitive mechanisms correlation between A problem with early random walk models, 1400 RT and drift rate which were precursors to the diffusion model, was r = –0.336 that they predicted equal correct and error RT 1200 distributions if the drift rates for two stimuli were equal in magnitude but opposite in sign (Laming, 1000 1968; Stone, 1960; but see Link & Heath, 1975). This prediction is also made by the diffusion model 800 in the absence of across-trial variability in model parameters. In fact, the patterns of the relative speed 600 Individual Trial Response Time of correct versus error responses are as follows: with accuracy instructions and/or difficult tasks, errors 0.0 0.2 0.4 0.6 0.8 are slower than correct responses, and with speed Drift Rate for the Trial instructions and/or easy tasks, errors are faster than Fig. 3.2 Plots of individual RTs as a function of drift rate for correct responses (Luce, 1986). the trial. The parameters of the diffusion model were, boundary In the diffusion model, the observed patterns of separation, a = 0.107, starting point z = 0.048, duration of correct versus error RTs fall out naturally because = processes other than the decision process, Ter 0.48 s, SD in there is trial-to-trial variability in drift rate and η = = drift rate across trial 0.20, range in starting points sz 0.02, starting point (e.g., Ratcliff, 1981). Figure 3.4 range in nondecision time s = 0.18 s, drift rate v = 0.3. t illustrates how this mixing works with just two drift rates or two starting points instead of their full distributions. In Figure 3.4 left panel, the v1 encode stimuli, access memory, transform stimulus drift rate produces high accuracy and fast responses, information into a decision-related variable that the v2 one lower accuracy and slow responses. determines drift rate, and execute responses. These The mixture of these produces errors slower than components of processing are combined into one correct responses because 5% of the 400 ms process “nondecision” component in the model, that has averaged with 20% of the 600 ms process gives a mean Ter. The total processing time for a decision weighted mean of 560 ms, which is slower than the is the sum of the time taken by the decision process weighted mean for correct responses (491 ms). In and the time taken by the nondecision component. Figure 3.4, right panel, the distributions to the left The boundaries of the decision process can be are for processes that start near the correct boundary manipulated by instructions (“respond as quickly (the dotted arrow shows the distance the process as possible” or “respond as accurately as possi- hastogotomakeanerror—thelargerthedistance, ble”), differential rewards for the two choices, the slower the response) and the distributions to the and the relative frequencies with which the two right are for processes that start further away from stimuli are presented in the experiment. Changes the correct boundary. Processes that start near to the in instructions, rewards, or biases affect both correct boundary have few errors and those errors RTs and accuracy but in the model, to a good are slow, whereas processes that start further away approximation, the effects on RTs and accuracy are have more errors and the errors are fast, leading due to shifts in boundary settings alone, not drift to errors faster than correct responses. In practice, rates or nondecision time. (However, if subjects are drift rate is assumed to be normally distributed from pushed very hard to go fast, then nondecision time trial to trial and the starting point is uniformly and drift rates can be lower (e.g., Starns, Ratcliff, & distributed, but these specific functional forms are McKoon, 2012.) not critical (Ratcliff, 2013). Figure 3.3, left panel, shows boundaries moving Some researchers have argued that across-trial in for speed relative to accuracy instructions and the variability in the parameters is not needed (Palmer, right panel shows how subjects can be biased toward Huk, & Shadlen, 2005; Usher & McClelland, the top response versus the bottom response by 2001). However, it is unreasonable to assume moving decision criteria from the dashed line to the that subjects can set their processing components solid line settings. It is also possible (Figure 3.3 right to identical values on every equivalent trial of panel) to adjust the zero point of drift rate (the drift an experiment (i.e., ones with the same stimulus rate criterion) to accommodate biases between the value). For drift rates, across-trial variability in drift two responses (see Leite & Ratcliff, 2011; Ratcliff, rate is exactly analogous to variability in stimulus or 1985; Ratcliff & McKoon, 2008, Figure 3.3). memory strength in signal detection theory. Later

modeling simple decisions and applications 37 Speed/Accuracy tradeoff Bias towards top boundary (dashed) changes Boundary separation changes to bias towards bottom boundary (solid)

Fig. 3.3 In the left panel, boundary separation alone changes between speed and accuracy instructions. In the right panel, the starting point varies with bias.

RT = 400 ms Weighted Pr = 0.95 Mean RT Pr = 0.95 Weighted = 491ms RT = 350 ms RT=600ms Pr = 0.80 Mean RT Pr = 0.80 RT = 450 ms a = 396 ms a Correct v v v Correct 1 Responses Responses v2 z= z a/2

Error Responses Error Responses 0 0 RT = 400 ms RT = 600 ms Weighted Pr = 0.20 Pr = 0.05 Weighted Pr = 0.05 Pr = 0.20 Mean RT RT = 350 ms RT = 450 ms Mean RT = 560 ms = 370 ms

Fig. 3.4 Variability in drift rate and starting point and the effects on speed and accuracy. The left panel shows two process with drift rates v1 and v2 and the starting point halfway between the boundaries with correct and error RTs of 400 ms for v1 and of 600 ms for v2. Averaging these two illustrates the effects of variability in drift rate across trials and in the illustration yields error responses slower than correct responses. The right panel shows processes with two starting points and drift rate v. Averaging processes with starting point 0.5a+0.5 (high accuracy and short RTs) and starting point 0.5a−0.5 (lower accuracy and short RTs) yields error responses faster than correct responses. we describe an EEG study of perceptual decision- expressions for accuracy and rt making that provides independent evidence for distributions across-trial variability in drift rate and mention For a two-boundary diffusion process with no another that provides evidence for variability in across-trial variability in any of the parameters, the starting point. equation for accuracy, the proportion of responses It is important to understand that the diffusion terminating at the boundary at zero, is given by model is highly falsifiable, not by mean RTs − / 2 − / 2 and accuracy values but by RT distributions. If e 2va s − e 2vz s P(v,a,z) = (1) empirical distributions are not right skewed, and do e−2va/s2 − 1 not shift and spread in exactly the right ways across (or 1−z/a if drift is zero), and the cumulative experimental conditions, the model is falsified. distribution of finishing times at the same boundary Ratcliff (2002) generated sets of data with RT is given by distributions that are plausible but never obtained πs2 −vz in real experiments. For one set, the shapes and G(t,v,a,z) = P(v,a,z) − e s2 a2 locations of the RT distributions were changed as   2 2π2 2 a function of task difficulty, and for the other, the π − 1 v + k s t ∞ 2k sin k z e 2 s2 a2 shapes and locations were changed as a function × a of speed versus accuracy instructions. For none of v2 k2π2s2 = + the resulting distributions was the model able to fit k 1 s2 a2 the data. In addition, the distributional predictions (2) of the model are tested every time it is fit to where a is boundary separation (the top boundary empirical data. is at a, the bottom boundary is at 0 and the

38 elementary cognitive mechanisms distribution of finishing times is the distribution They obtained predictions for these models using at the bottom boundary), z is the starting point, the integral equation method. v is drift rate, and s is the SD in the normal Diederich and Busemeyer (2003) proposed a distribution of within-trial variability (square root matrix method for obtaining predictions for diffu- of the diffusion coefficient). sion models. In their approach, a continuous-time, These expressions can be derived as a solution continuous-state diffusion process is approximated of the partial differential equation for the first by a discrete-time, discrete-state birth-death pro- passage-time probability for the diffusion process cess. The probability that the process takes a step (Feller, 1968). The results are described in detail up or down at each time point is characterized by a in Ratcliff (1978) and Ratcliff and Smith (2004). transition matrix whose entries express the rules by Because Equation 2 contains an infinite sum, values which the process evolves over time. By approximat- of the RT density function need to be computed ing the process in this way, the problem of obtaining numerically. The series needs to be summed until RT distributions and response probabilities can be it converges; this means that terms have to be reduced to one of repeated matrix multiplication. added until subsequent terms become so small that This solution can be expensive computationally, but they does not affect the total. This is complicated can be made more efficient by solving the associated by the sine term, which can allow one value in algebraic eigenvalue problem, avoiding the need for the sum to be small, whereas the next one is not repeated matrix multiplication. The method can small. To deal with this practically, it is necessary also be applied to more complex problems that to require that two or three successive terms are cannot be solved using the method of Equation very small. 2 and has the advantage that it is very robust The predictions from the model are obtained computationally. by integrating the results from Equations 1 and In some situations, it is important to generate 2 over the distributions of the model’s across-trial predictions by simulation because simulated data variability parameters using numerical integration. can show the effects of all the sources of variability In the standard model, drift rate is normally on a subject’s RTs and accuracy. The number of distributed across trials with SD η, the starting simulated can be increased sufficiently point is uniformly distributed with range sz,and that the data approach the predictions that would nondecision time is uniformly distributed with be determined exactly from the numerical method. range st . The predicted values are “exact” numerical The expression for the update of evidence, x,on predictions in the sense that they can be made as every time step t during the decision process, accurate as necessary (e.g., 0.1 ms or better) by is determined by the drift rate, v, plus a noise using more and more steps in the infinite sum and term (Gaussian random variable, εi with SD σ)to more and more steps in the numerical integrations represent variability in processing: (packages that perform fitting are mentioned later). Alternative computational methods for obtain- √ x = v t + ση t (3) ing predictions for diffusion models have been i i i described by Smith (2000) and Diederich and Busemeyer (2003). The approach described by This equation provides the most straightforward Smith uses integral equation methods derived from method of simulating the diffusion process, but renewal theory. It was originally developed in it is not the most efficient. Tuerlinckx, Maris, mathematical biology to model the firing rates Ratcliff, & De Boeck (2001) examined four of integrate-and-fire neurons (Buonocore, Giorno, methods for simulating diffusion processes and Nobile, & Ricciardi, 1990). The method is more found that a random walk approximation is better computationally intensive than the infinite series than using Equation 3. They also showed that a approach of Equation 2, but has the advantage “rejection” method is even more efficient. However, that it can be applied to processes in which the if the process is nonstationary and complicated drift rates or decision criteria change over time (e.g., with time varying drift rate, or boundaries that or in which the accumulated information decays have some functional form) or there are several dif- during the course of a trial. Smith (1995) and fusion processes running to model multiple choice Smith and Ratcliff (2009) have proposed models in tasks, simulation is the simplest way to produce which drift rates depend on the outputs of visual predictions, and the random walk approximation is and memory processes that change during a trial. likely the most efficient.

modeling simple decisions and applications 39 In fitting the diffusion model to data, accuracy variances of the population distributions, which are and RT distributions for correct and error responses estimated in fitting, determine a range of probable for all the conditions of the experiment must values of drift rates and decision boundaries for be simultaneously fit and the values of all of individual subjects. Because all subjects are fit the components of processing estimated simulta- simultaneously using these methods, the parameters neously. One commonly used fitting method uses are constrained by the group parameters especially quantiles of the RT distributions for correct and with low numbers of observations. The application error responses for each condition (the 0.1, 0.3, of these hierarchical methods is in its infancy and 0.5, 0.7, and 0.9 quantile RTs). The model predicts some applications with large numbers of subjects, the cumulative probability of a response at each RT both simulated and real, that show their benefit quantile. Subtracting the cumulative probabilities over and above the more traditional methods are for each successive quantile from the next higher needed. quantile gives the proportion of responses between To show how well the diffusion model fits adjacent quantiles. For a chi-square computation, data, we plot RT quantiles against the proportions these are the expected values, to be compared to for which the two responses are made. The top the observed proportions of responses between the panel of Figure 3.5 shows a histogram for an quantiles (i.e., the proportions between 0.1, 0.3, RT distribution. The 0.1–0.9 quantile RTs and 0.5, 0.7, and 0.9, are each 0.2, and the proportions the 0.005 and 0.995 quantiles are shown on the below 0.1 and above 0.9 are both 0.1) multiplied x-axis. The rectangles represent equal areas of by the number of observations. Summing over 0.2 probability mass between the 0.1–0.3, 0.3–0.5, (Observed-Expected)2/Expected for correct and etc. quantile RTs (and as can be seen, these represent error responses for each condition gives a single the histogram reasonably well). chi-square value that is minimized with a general These quantiles can be used to construct a SIMPLEX minimization routine. The parameter quantile-probability plot by plotting the 0.1–0.9 values for the model are adjusted by SIMPLEX until quantile RTs vertically, as in the second panel of the minimum chi-square value is obtained. Figure 3.5, against the response proportion of that In any data set, there is the potential problem of condition on the x-axis. Usually, correct responses outlier RTs, which could be fast (e.g., fast guesses) are on the right of 0.5 and errors to the left (if or slow (e.g., inattention). The quantile based there is no bias toward one or the other of the method provides a good compromise that reduces responses). Example RT distributions constructed the influence of outliers because the proportion from the equal area rectangles are also shown in of responses between the quantiles is used and grey. When there is a bias in starting point or extreme RTs within the bins have no influence on when the two response categories are not symmetric fitting. To additionally deal with outliers, a model (as in lexical decision and memory experiments), of such processes is used in some model fitting two quantile probability are needed, one for each approaches so that data is assumed to be a mixture response category. of diffusion processes plus a small proportion of With quantile probability plots, changes in RT outliers. For details of the fitting methods for the distribution locations and spread as a function standard diffusion model and modeling outliers, see of response proportion can be seen easily and Ratcliff and Tuerlinckx (2002). compared with model fits. In the bottom panel of New methods for fitting the diffusion model Figure 3.5, the 1–5 symbols are the data and the have been developed recently and, over the solid lines are the predictions from fits of the model last 6 or 7 years, fitting packages have been to the data (with circles denoting the exact location made available by Vandekerckhove and Tuerlinckx of the predictions). As can be seen in this example, (2007) and Voss and Voss (2007). Also, Bayesian as response proportion changes from about 0.6 to methods have been developed (Vandekerckhove, near 1.0, the 0.1 quantile (leading edge) changes Tuerlinckx, & Lee, 2011) and a Bayesian package little, but the 0.9 quantile changes by as much as by Wiecki, Sofer & Frank (2013) has been made 400 ms. This is in line with the model predictions available. These Bayesian methods also implement (e.g., Fig. 3.2). Also, as can be seen, error responses hierarchical modeling schemes, in which model are slower than correct responses mainly in the parameters for individual subjects are assumed to spread, not in the leading edge location. Thus, be random samples from population distributions quantile-probability plots allow all the important that are specified within the model. The means and aspects of the data to be read from a single plot.

40 elementary cognitive mechanisms Variants of the Standard Two-Choice Task 4 Up to this point, we have discussed how the diffusion model explains the results of experiments 3 in which subjects respond with one of the two 2 choices in their own time. The model has also been successfully applied to paradigms in which

Probability Density 1 decision time is manipulated. Here we discuss three of these.

response signal and deadline tasks Quantiles .005 .1 .3.5 .7 .9 .995 For response signal and deadline tasks, a signal is presented after the stimulus and a subject is required error Pr = .3 correct to respond as quickly as possible (in, say, 200–300 1200 Pr = .7 error Pr = .05 Q ms). For a deadline paradigm, the time between the x Q 5 0.9 5 0.9 stimulus and the signal is fixed across trials. For a 1000 correct response signal paradigm, the time varies from trial Pr = .95 x to trial (Reed, 1973; Schouten & Bekker, 1967; 800 x 4 0.7 4 0.7 Wickelgren, 1977; Wickelgren, Corbett, & Dosher, 3 0.5 3 0.5 x 1980). With the deadline paradigm, subjects can 600 x 2 0.3 2 0.3 x adopt different strategies or criteria for different x x 1 0.1 1 0.1 deadlines. This is not the case for the response signal 400 x x

Reaction Time Quantiles (ms) paradigm in which processing can be assumed to be the same up to the signal. 200 To apply the diffusion model to response signal 0.00.20.40.60.81.0 data, Ratcliff (1988, 2006) assumed that there Response Proportion are response criteria just as for the standard two- Dynamic pixel noise, brightness discrim. choice task, and at some signal lag, responses o o5 5o 5 come from a mixture of processes, those that have 1200 5o o o terminated at one or the other of the boundaries 5 5 1000 and those that have not. This is in accord with o5 subjects’ intuitions that, at the long lags, the 4o o4 4o 4o 5o 800 o4 o4 decision has already been made, the response has 3o 3 been chosen, and the subject is simply waiting 33 o 3o 3 o4 o o o 4o 600 2 2o o2 o3 for the signal. As the time between stimulus and 2o 2o o2 3o o o22o 1 1o o1 1o o1 signal decreases, a larger and larger proportion of o 1o o11o Quantile Reaction Time (ms) 400 processes will have failed to terminate. Differences Experiment 2, Ratcliff & Smith (2010) among experimental conditions of different diffi- 0.0 0.2 0.4 0.6 0.8 1.0 culties appear as differences in the proportions of Response Proportion accumulated information at the different lags. At the longest lags (2 or more seconds), all or almost all Fig. 3.5 The top panel shows a RT distribution overlaid with processes will have terminated. For nonterminated 0.1, 0.3, 0.5, 0.7, and 0.9 quantiles, where the area outside the processes, there are two possibilities: that decisions .1 quantile ranges from 0.005 to 0.1 and the area outside the .9 quantile ranges from 0.9 to 0.995. The areas between each pair are made on the basis of the partial information of middle quantiles are 0.2 and the areas below 0.1 and above that has already been accumulated (Figure 3.6 0.9 are 0.095. The quantile rectangles capture the main features top panel) or that they are simply guesses (Figure of the RT distribution and therefore a reasonable summary of 3.6 middle panel). Ratcliff (2006) tested between overall distribution shape. The middle panel shows quantile RTs these possibilities with a numerosity discrimination for the 0.1, 0.3, 0.5 (median), 0.7, and 0.9 quantiles (stacked vertically) plotted against response proportion for each of the experiment (subjects decide whether the number of six conditions. Correct responses are plotted to the right, and asterisks displayed on a PC monitor is greater than error responses to the left. The bottom panel shows a quantile or less than 50). The same subjects participated probability function from Ratcliff and Smith (2010, Experiment in the response signal task and the standard task 2) with the numbers representing data and the lines representing and examples of the response signal data and model predictions. fits are shown in Figure 3.7. When the model

modeling simple decisions and applications 41 Partial Information Model

Respond Pr(“Large”) = “Large” sum of black areas at T1 Time

Pr(“Small”) = sum of grey areas at T2 Respond “Small” T1 T2

Guessing Model

Respond Pr(“Large”) = “Large” sum of black areas at T1 & (1-P(guess)) Time Pr(“Small”) = sum of grey area at T2 & P(guess) Respond “Small” T1 T2

1.0 non-terminated processes non-terminated Distribution of the density 0.8

0.6

0.4 Density of Processes 0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Time (s)

Fig. 3.6 The top two panels shows two models for how the diffusion model accounts for response signal data. In the top panel, the proportion of “large” responses at time T1 is the sum of processes that have terminated at the “large” boundary (the black area above the boundary) and nonterminated processes (the black area still within the diffusion process), i.e., partial information. The middle panel shows the same assumption as the top panel except that if a process has not terminated, a guess is used instead of partial information. The bottom panel shows heat maps of simulated paths for the diffusion model. White corresponds to high path density and black to a low path density. For the diffusion model, the distribution to the right corresponds to the asymptotic distribution of path positions after about 0.2 seconds (i.e., the vertically oriented distributions in the top panel). was fit to the two sets of data simultaneously, meyer, irwin, osman,& kounios, (1988) partial it fit well and it fit equally well for the two information paradigm possibilities for nonterminated processes. In other This paradigm used a variant of the response sig- words, “guessing” and partial information models nal task in which, on each trial, subjects responded could not be discriminated. either in the regular way unless a signal to respond

42 elementary cognitive mechanisms 1.0 1 1 1 1 21 2 2 23 21 12 2 21 3 3 1 3 4 2 3 3 3 0.8 3 1 1 4 4 4 12 23 4 23 4 4 4 4 0.6 3 5 5 4 65 5 5 3 4576 S2 S1 5 0.4 878 6 45 5 5 6 6 5 6 56 78 6 0.2 7 867 6 7 78 6 6 8 7 7 7 0.0 8 8 8 8 87 87 1.0 1 1 1122 1 2 123 23 21 2 3 4 3 3 1 3 4 3 0.8 2 4 21 1 23 4 1 3 45 5 5 5 0.6 3456 5 4 4 4 77 123 0.4 68 6 S3 4 8 6 2 S4 Response Proportion for “Large” Responses 34 7546 5 7 6 65 5 0.2 6 887 5 5 8 76 7 7 7 8 6 6 0.0 8 8 8 87 887 76 0 500 1000 1500 0 500 1000 1500 Response Signal Lag (ms) Response Signal Lag (ms)

Fig. 3.7 Plots of response proportion as a function of response signal lag from a numerosity discrimination experiment (Ratcliff, 2006) for four subjects. The task required subjects to judged whether the number of dots in a 10x10 array was greater that 50 or less or equal to 50. The digits 1–8 (in reverse order) and the eight lines represent eight groupings of numbers of dots (e.g., 13–20, 21–30, 31–40, 41–50, 51–60, 61–70, 71–80, and 81–87 dots). occurred, in which case they were to respond cooler because there are fewer and fewer processes immediately. Thus, any trial could be a signal trial that have not terminated. As in the top panel, or a regular trial. Meyer et al. developed a method the evolution of paths moves the mean position based on a race model that decomposes accuracy on (the thick black line) from the starting point at the signal trials (at each signal lag) into a component 0.5 to a point a little above 0.6 by about 0.2 from fast finishing regular trials and a component s. This produces an almost stationary distribution based on partial information. The predictions from (the distribution to the right of the heat map), the diffusion model matched those from Meyer which gradually collapses over time (the two vertical et al. (1988). Results showed that partial informa- distributions in the top panel of Figure 3.6). tion, in some tasks (see also Kounios, Osman, & For the case in which partial information is used Meyer, 1987) grew quickly and leveled off at about in the decision, the expression for the distribution one-third the accuracy level of regular processes. of the positions x of decision processes at time t is Ratcliff (1988) examined the predictions of the given by: diffusion model with the assumption that decisions ∞ v(x−z) 2 nπz on signal trials were a mixture of processes that p(x,t) = e s2 sin a a terminated at a boundary and processes based on n=1   position in the decision process, that is, partial 2 2π2 2 nπx − 1 v + n s t information. Therefore, if a process was above the × sin e 2 s2 a2 (4) starting point (i.e., the black area in the vertical a distribution in the top panel of Figure 3.6), the where s2 is the diffusion coefficient, z is the decision corresponded to the choice at the upper starting point, a is the separation between the boundary. boundaries, and v is the drift rate. For model fitting, Figure 3.6 bottom panel shows a heat map of the expression in Equation 4 must be integrated the evolution of simulated diffusion processes. The over the normal distribution of drift rates and the map shows the density of processes as they begin at uniform distribution of starting points to include the starting point and spread out to the boundaries. variability in drift rate and starting point across The hotter the color (whiter), the more processes trials. This can be accomplished with numerical in that region. As time goes by, the color becomes integration using Gaussian quadrature. The series in

modeling simple decisions and applications 43 Equation 4 must be summed until it converges; this The first conclusion is that applying models means that terms have to be added until subsequent to multiple tasks simultaneously produces strong terms become so small that they do not affect constraints on models that (if they successfully the total (i.e., the series has converged to within account for data) lead to new understanding of − some criterion, e.g., 10 5). Then, to obtain the how the tasks are performed. In the context of the probability of choosing each response alternative, sequential sampling models discussed in this article, the proportion of processes between 0 and a/2 (for this approach yielded a new view of response signal the negative alternative) and between a/2 and a (for performance: responses increase in accuracy over the positive alternative) is calculated by integrating time mainly because the proportion of terminated the expression for the density over position. processes increases and the increase in accuracy does not come entirely from the increasing availability time-varying processing of partial information. Moreover, versions of the Ratcliff (1980) examined two cases in which drift models that provide quite good fits to the data rate changes across the time course of processing. from the standard RT and response signal tasks For one, drift rate changes discretely at one fixed individually would not account for both sets of data time. Because there is an explicit expression for simultaneously with parameters that were consistent the distribution of evidence at that time, this across tasks. distribution can be used as a starting distribution for a second diffusion process. If the time at which Optimality evidence changes is not a fixed time but has a In animal studies, performance has been de- distribution over time, this can be integrated over. scribed in terms of how close it comes to maxi- This allows both response signal and regular RT mizing reward rate. This is part of a larger theme tasks to be modeled. For another case, boundaries in neuroscience, which reprises the classical signal are removed completely and drift rate and the drift detection and sequential-sampling literatures, in coefficient varied continuously over time. Only the which reward rate is used as a criterion for un- first case has been used in modeling response signal derstanding whether neural computations approach data (as in Ratcliff, 1988, 2006). optimality. For animals, how close performance is to optimal in terms of reward rate is a reasonable go/nogo task question to ask because animals are deprived of In the go/no go task, subjects are told to respond water or food and their overwhelming desire is for one of the two choices but to make no response to obtain them. Also they are trained for many for the other choice. Withholding responses for one sessions and so there is ample opportunity to of the choices is similar to the response signal task optimize reward. However, when this kind of in which responses must be held until the signal. optimality is translated to human studies, the a Gomez, Ratcliff, and Perea (2007) proposed that priori reasonableness comes into question. This is there are two response boundaries for the go/no because humans do not aim to get the most correct go task just as for the standard task, but subjects per unit time. Instead, they aim to get the most made a response only when accumulated evidence correct in the available time. If a student takes a reaches the “go” boundary. Gomez et al. successfully 2-hour exam and obtains 60% correct in 1 hour, fit the model simultaneously to data from the but another student gets 80% correct in 2 hours, the standard task and data from the go/no go task. They first has more correct per unit time, but the second also tested a variant for which there was only one would be more likely to pass the course. boundary, the “go” boundary, but this variant could Bogacz, Brown, Moehlis, Holmes, & Cohen not fit the data well. et al. (2006) performed extensive analyses of Application of the diffusion model simultane- optimality and set the stage for analyses of data. ously to the standard task and response signal task They showed that optimality as defined by reward or to the standard task and go/no go tasks places rate can be adjusted by changing boundary settings. powerful constraints on the model and, when it is If the boundaries are too far apart, subjects are successful, it offers new insights into the cognitive accurate, but slow and so there are few correct processes involved in these tasks. It also provides per unit of time. If boundaries are too narrow, theoretical convergence between the three tasks, RT is short but accuracy is low and there are few with two boundaries for all three tasks and withheld correct responses per unit of time. Thus, there responses for the latter two. is a boundary setting that maximizes the number

44 elementary cognitive mechanisms correct per unit of time and it is possible to test stimuli are presented very briefly, often at low whether subjects set criteria near to this value. levels of contrast, sometimes with backward masks Starns and Ratcliff (2012) tested undergraduate to limit iconic persistence. The focus has been subjects on a simple numerosity discrimination task to understand the perceptual processes involved in which different groups of subjects were tested in the computation of drift rates. Psychophysical at different levels of difficulty. They were tested paradigms have historically been used mainly with in blocks of trials that had a fixed total duration threshold or accuracy measures but recent studies for which they were instructed to get as many have collected accuracy and RT data. correct in the time allowed and in blocks of trials in Ratcliff and Rouder (2000) and Smith, Ratcliff, which the number of trials was the same no matter and Wolfgang (2004) found that the diffusion how fast they went. Reward-rate optimality predicts model provided a good account of accuracy and that when difficulty increases, subjects should speed distributions of RT from tasks with brief backward- up and sacrifice accuracy.per unit time. Results masked stimuli. They compared the model with a showed subjects did the opposite, slowing down constant drift rate from starting point to boundaries with increases in difficulty. This is the result we to the model with varying drift rate. Drift rates might expect from years of academic training to might be thought to decrease over time if they either spend more time on difficult problems. tracked stimulus information or were governed by Starns and Ratcliff (2010) analyzed several a decaying perceptual trace. However, there was published data sets with young and older adults no evidence in either study of increased skewness and found that young adults with accuracy feedback in the RT distributions or very slow error RTs sometimes approached reward-rate optimality. But at short stimulus durations as would have been older adults rarely moved more than a few percent expected if the decision process had been driven by away from asymptotic accuracy. Young adults in a decaying perceptual trace. Instead, it appears that the context of psychology experiments (or perhaps the information that drives the decision is relatively practice with video games, some of which promote durable. speed) will sometimes be able to optimize perfor- The standard application of the model assumes mance in terms of number correct per unit of time. that, at some point in time after stimulus encoding, In general, however, concerns about accuracy that the decision process turns on, and evidence is have been trained for years appear to dominate. accumulated toward a decision. This time is assumed to be the same across conditions and drift Domains of Application rate is assumed to be at a constant values from the One criterion for how well a model performs point the process turns on. The assumption of a is whether it simply reiterates what is already constant drift rate could be relaxed: Ratcliff (2002) known from traditional analyses. Here we describe generated predicted accuracy and RT quantiles for a number of applications, some of which provide several conditions under the assumption that drift new insights into processing, individual differences rate ramped up from zero to a constant level over and differences among subject groups are obtained. 50 ms. He fit the standard model to these predicted But in other cases, even when the obvious results are values and found that the model fit well with obtained the model integrates the three dependent nondecision time increased by 25 ms and with variables, namely, accuracy and correct and error starting point, and nondecision time variability RT distributions, into a common theoretical frame- increased. Thus, a ramped onset of drift rate over work that provides explanations of data that many a small time range will be indistinguishable from an hypothesis-testing approaches do not. Hypothesis- abrupt onset. testing approaches usually select only accuracy or Smith and Ratcliff (2009) developed a model, only mean RT as the dependent variable. In some the integrated system model, that is a continuous- cases, the two variables tell the same empirical flow model comprised of perceptual, memory, story, but in other cases, they are inconsistent. and decision processes operating in cascade. The The model based approach helps to resolve such perceptual encoding processes are linear filters inconsistencies. (Watson, 1986) and the transient outputs of the filters are encoded in a durable form in visual short- Perceptual Tasks term memory (VSTM), which is under the control Recently diffusion models have been applied of spatial attention. The strength of the VSTM to psychophysical discrimination tasks in which trace determines the drift rate for the diffusion

modeling simple decisions and applications 45 process and the moment-to-moment variations in a meeting point between the decision process and trace strength act as a source of noise in the memory, specifically, the drift rate for a test item decision process. Because the VSTM trace in the represents the degree of match between a test item model increases over time (i.e., drift rate is time and memory. varying), predictions for the model are obtained In signal detection approaches to recognition using the integral equation methods described pre- memory, there has been considerable interest in viously (Smith, 2000). The model has successfully the relative standard deviations (SDs) in strength accounted for accuracy and RT distributions in between old and new test items, typically measured tasks with brief backward-masked stimuli. by confidence judgement paradigms. The common The main area of application of the integrated finding is that z-ROC functions (i.e., z-score system model has been to tasks in which spatial transformed receiver operating characteristics) are attention is manipulated by spatial cues. In many approximately linear with a slope less than 1 (e.g., cuing tasks, in which a single well-localized stimulus Ratcliff, Sheu, & Gronlund, 1992). There have is presented in an otherwise empty display, atten- been two interpretations of this finding. One is tion shortens RT but increases accuracy only when a single-process model that assumes the SD of stimuli are masked (Smith, Ratcliff, & Wolfgang, memory strength is normally distributed, but the 2004; Smith, Ellis, Sewell, & Wolfgang, 2010). SD for old items is larger than that for new The model assumes that attention increases the items. The other is a dual-process model in which efficiency with which perceptual information is the familiarity of old and new items comes from transferred to VSTM and that masks interrupt normal distributions with equal SDs but there is the process of VSTM trace formation before an additional recollection process (e.g., Yonelinas, it is complete. These two processes interact to 1997). produce a cuing effect in accuracy only when In fits of the diffusion model to recognition stimuli are masked but an unconditional effect memory data, it has been usually assumed that in RT. The model has successfully accounted for the SD in drift rate across trials is the same for the distributions of RT and accuracy in attention studied and new items. Starns and Ratcliff (2014) tasks in which the timing of stimulus localization performed an analysis of existing data sets that is manipulated via onset transients and localizing allowed the across-trial variability in drift rate to markers (Sewell & Smith, 2012). These studies have be different for studied and new items. They found helped illuminate the way in which performance is that the across-trial variability in drift rate was larger determined by perceptual, memory, attention, and (in about 66% of the cases for individual subjects) decision processes acting in concert. for studied items than for new items. It also turned Diederich and Busemeyer (2006) also considered out that the interpretations of the other model the effects of attention on decision-making in parameters did not change when variability was a diffusion-process framework, studying decisions allowed to differ. The advantage of this analysis is about multi-attribute stimuli for which it is plau- that the relative variability of studied and new items sible that people shift their attention sequentially was able to be determined from two-choice data and from one attribute of a stimulus to the next. did not require confidence judgments. They assumed that some attributes would provide more information than others and modeled this Lexical Decision successfully as a sequence of step changes in drift Much like recognition memory, a test item for rate during the course of a trial. lexical decision is matched against memory. The output is a value of how “wordlike” the item is. Recognition Memory For sequential sampling models, proposals about One of the early applications of the diffusion how lexical items are accessed in memory must model was to recognition memory. In global provide output values that, when mapped through memory models, a test item is matched against a sequential sampling model, produce RTs and all memory in parallel, and the output is a single accuracy that fit data (Ratcliff, Gomez, & McKoon, value of strength or familiarity (Gillund & Shiffrin, 2004). (Note that there are other models that have 1984; Hintzman, 1986; Murdock, 1982, and integrated RT and accuracy with lexical processes, later, Dennis & Humphreys, 2001; McClelland & in particular, Norris, 2006). Chappell, 1998; Shiffrin & Steyvers, 1997). From Often, lexical decision response time (RT) has this point of view, the diffusion model provides been interpreted as a direct measure of the speed

46 elementary cognitive mechanisms with which a word can be accessed in the lexicon. an associative recognition task (see also Ratcliff, For example, some researchers have argued that Thapar, & McKoon, 2011). For the associative the well-known effect of word frequency—shorter recognition task, subjects decided whether two RTs for higher frequency words—demonstrates the words of a test pair had or had not appeared in the greater accessibility of high frequency words (e.g., same pair at study. In the single-word task, some test their order in a serial search, Forster, 1976; the words were immediately preceded in the test list by resting levels of activation in units representing the other word of their studied pair (primed) and the words in a parallel processing system, Morton, some by a word from a different pair (unprimed). 1969). However, other researchers have argued, as Data from the two tasks were fit with the diffusion we do here, against a direct mapping from RT to model and the results showed parallel behavior: the accessibility. For example, Balota and Chumbley drift rates for associative recognition and those for (1984) suggested that the effect of word frequency priming were parallel across ages and IQ, indicating might be a by-product of the nature of the task that they are based, at least to some degree, on the itself, and not a manifestation of accessibility. In same information in memory. the research presented here, the diffusion model makes explicit how such a by-product might come Value-Based Judgments about. Busemeyer and Townsend (1993) developed a diffusion model called decision field theory to ex- Semantic and Recognition Priming Effects plain choices and decision times for decisions under For semantic priming, the task is usually a lexical uncertainty, and later Roe, Busemeyer, Townsend decision. A target word is immediately preceded in (2001) extended it to multi-alternative and multi- a test list either by a word related to it (e.g., cat dog) attribute situations. According to the theory, at or some other word (e.g. table dog). For recognition each moment in time, options are compared in priming, the task is old/new recognition and a terms of advantages/disadvantages with respect to target word is immediately preceded by a word that an attribute, these evaluations are accumulated was studied near to it in the list of items to be across time until a threshold is reached, and the remembered or far from it. In the diffusion model, first option to cross the threshold determines the the simplest assumption about priming effects is choice that is made. The theory accounts for a that they result from higher drift rates for primed number of findings that seem paradoxical from than unprimed items. the perspective of rational choice theory. Usher It has been hypothesized that the difference in and McClelland (2004) proposed another diffusion drift rates between primed and unprimed items model to account for a similar range of findings. arises from the familiarity of compound cues to Milosavljevic, Malmaud, Huth, Koch, & Rangel memory (McKoon & Ratcliff, 1992; McNamara, (2010) examined several variants of diffusion mod- 1992, 1994; Ratcliff & McKoon, 1988, 1994). els for value-based judgments. They found that the The compound cue for an item is a multiplicative standard model with across-trial variability in model combination of the familiarity of the target word parameters provided a good account of data from and the familiarity of the prime (see examples in their paradigm. More recently, Krajbich and Rangel Ratcliff & McKoon, 1988). If the prime and target (2011) have used a model similar in character to words are related in memory, the combination decision field theory. They examined value-based produces a higher value of the joint familiarity than judgments for food items and had subjects choose if they were not related. For primed items, the which of two alternatives they preferred. They prime and target share associates in memory, the monitored eye fixations and in modeling, they joint familiarity would be higher than if the prime assumed evidence was accumulated at a higher rate and target do not share associates. This model was for the alternative fixated. Their model accounted capable of explaining a number of phenomena in for RTs and accuracy and for the influence of which research on priming including the range of priming, of the two choices was fixated and for how long. the decay in priming, the onset of priming, and Philiastides and Ratcliff (2013) examined value- so on. based judgments of consumer choices with brand McKoon and Ratcliff (2012) compared priming names presented on some trials as well as the items in word recognition to associative recognition. for which the choices were made. When the quality Subjects studied pairs of words and then per- of the brand name was in conflict with the perceived formed either a single-word recognition task or quality of the item, the probability of choosing the

modeling simple decisions and applications 47 item was lower then when they were consistent. spatial tasks) from Oberauer, Suβ, Wilhelm, and Application of the diffusion model showed that the Wittmann (2003). They found that drift rates effect of the brand was to alter drift rate but none in the diffusion model mapped onto working of the other parameters of the model. This means memory, speed of processing, and reasoning ability that the value of the stimulus and brand name were measures (each of these was measured by aggregated processed as a whole. performance on several tasks). Currently, there is a growing interest in the In aging studies by Ratcliff et al. (2010, 2011), application of diffusion models to decision-making IQs ranged from about 80 to about 140. Applying in marketing and economics, including neuroe- the model showed that drift rate varied with IQ (by conomics. Wide application of diffusion models as much as 2:1 for high versus lower IQ subjects) in this domain are in their infancy, but the but boundary separation and nondecision time did potential for theoretical advancement is great, as is not. This is the opposite of the pattern for aging. demonstrated by these examples. This dissociation provides strong support for the model because it extracts regularity from the three dependent variables (accuracy and correct and error Aging RT distributions). The application of the diffusion model to studies Individual differences across tasks in model of aging has been especially successful, producing a parameters provide strong evidence for common different view of the effects of aging on cognition abilities across tasks. In the Ratcliff et al. (2010) than has been usual in aging research. The general study, in the lexical decision, item recogni- finding in the literature has been that older adults tion, and associative recognition tasks, there are are slower than young adults (but not necessarily strong correlations across subjects in drift rate, less accurate) on most tasks, and this has been and these correlated with IQ as measured by interpreted as a decline with age in all or almost WAIS vocabulary and matrix reasoning. Also, all cognitive processes. However, application of the boundary separation correlates across tasks as did diffusion model showed that this is not correct nondecision time. These results show that the (Ratcliff, Thapar, & McKoon, 2003, 2004, 2006, diffusion model extracts components of processing 2007; Ratcliff, Thapar, Gomez, & McKoon, that show systematic individual differences across 2004). For example, Ratcliff, Thapar, and McKoon tasks. (2010) tested old and young adults on numerosity Consistent boundary setting across tasks is discrimination, lexical decision, and recognition of special interest because boundary settings are memory. What they found is that older adults had optional, because they can be easily changed by slower nondecision times and set their boundaries instruction (e.g., go fast or be accurate). In wider, but their drift rates were not lower than most real-life situations, we rarely encounter more those of young adults. In contrast, in some tasks than single decisions on a particular stimulus class (associative recognition and letter discrimination), (except perhaps at Las Vegas or in psychology large declines in drift rate with age have been found experiments). This means that there is little chance (Ratcliff et al., 2011; Thapar et al., 2003). of adjusting decision criteria in real life because there is little extended experience with a task in Individual Differences which the decision maker can extract statistics from The diffusion model has been used to examine a long sequence of trials in which the structure of individual differences. To do so requires that the the trials does not change. The diffusion model SDs in model parameters from estimation variabil- assumes that a decision maker uses this decision ity are smaller than the SDs between subjects. In mechanism across many tasks, and so we would the aging studies described earlier, with about 45 expect to see correlations in boundary separation minutes of data collection, individual differences across tasks. This is a result that has been obtained in drift rates, boundary settings, and nondecision whenever the comparison has been made. time were three to five times larger than the SDs of the model parameters. (See Ratcliff & Tuerlinckx, Child Development 2002, for tables of SDs in model parameters). A natural extension from the aging studies is to Schmiedek, Oberauer, Wilhelm, Suβ,& test children on similar tasks to those performed Wittmann (2007) analyzed data from eight with older adults to trace the course of develop- choice-RT tasks (including verbal, numerical, and ment within the model framework. Ratcliff, Love,

48 elementary cognitive mechanisms Thompson, and Opfer (2012) tested several groups subjects and even-handedness in dysphoric subjects of children on a numerosity discrimination task in drift rates. As before, this pattern was not and a lexical decision task. The results showed that apparent with comparisons of reaction times or relative to college age subjects, children’s drift rates accuracy, consistent with previous null findings. were lower, boundary separation was larger, and One limitation of these studies and similar ones nondecision time was longer. These differences were is that there may be relatively few materials with larger for younger relative to older children. In the right kinds of properties or structures (also other laboratories, drift rates have been found to in language processing experiments for example). be lower for ADHD and dyslexic children relative The emotional word pools for the experiments to normal controls (ADHD, Mulder et al., 2010; only contained 30 words each. This left relatively dyslexia, Zeguers et al., 2011). These studies show few observations (especially for errors) to use in that the diffusion model can be applied to data fitting the diffusion model, which would result collected from children, a domain in which there in unreliable parameter estimates. To remedy this, has been relatively little research with decision the model was fit to all conditions simultaneously, models. including the neutral filler conditions which had hundreds of observations. The only parameter that Clinical Applications was allowed to vary between the conditions was In research on psychopathology and clinical pop- drift rate. Estimates for the other parameters were ulations, two-choice tasks are commonly used to (e.g., nondecision time and boundary separation) investigate processing differences between patients largely determined by the filler conditions because and healthy controls. For highly anxious individu- the fitting method essentially weighted estimation als, it is well-established that they show enhanced of the parameters common to all conditions by processing with threat-provoking materials, but this the number of observations for each condition. is found reliably only when there are two or more Thus, the filler conditions largely determined all stimuli competing for processing resources, not one. model parameters except the drift rates for the However, when White, Ratcliff, Vasey, & McKoon critical conditions, resulting in an increase in (2010) applied the diffusion model to the RT and power. The results showed a bias for positive accuracy data from two-choice lexical decision task emotional words in the nondysphoric participants, with single words that included threatening and but not in the dysphoric participants (White et al., control words, they found a consistent processing 2009). advantage for threatening words in high-anxious in- This difference in emotional bias was not dividuals, whereas traditional comparisons showed significant when the diffusion model was fit only no significant differences Because the diffusion to the emotional conditions with few observations, model makes use of both RT and accuracy data, it nor was it significant in comparisons of mean RT or has more power to detect differences among subject accuracy. populations than simply RT or accuracy alone. Another study examined the effects of aphasia in Studies of depression have had somewhat dif- a lexical decision task. The impairments produce ferent patterns of results. Depressive symptoms the exaggerated lexical decision reaction times are more closely linked with abnormal emotional typical of neurolinguistic patients. In diffusion processing with a negative emotional bias in clinical model analyses, decision and nondecision processes depression, even-handedness (i.e., no emotional were compromised, but the quality of the infor- bias) in dysphoria, and a positive emotional bias in mation upon which the decisions were based did nondepressed individuals. However, item recogni- not differ much from that of unimpaired subjects tion and lexical decision tasks often fail to produce (Ratcliff, Perea, Colangelo, & Buchanan, 2004). significant results. White, Ratcliff, Vasey, & McKoon (2009) used the diffusion model to Manipulations of Homeostatic State examine emotional processing in dysphoric (i.e., Ratcliff and Van Dongen (2009) looked at moderately high levels of depressive symptoms) effects of sleep deprivation with a numerosity and nondysphoric college students to examine discrimination task, van Ravenzwaaij, Dutilh, differences in memory and lexical processing of and Wagenmakers (2012) looked at the effects positive and negative emotional words (which were of alcohol consumption with a lexical decision presented among many neutral filler words). They task, and Geddes et al. (2010) looked at the found positive emotional bias in nondysphoric effects of reduced blood sugar with a numerosity

modeling simple decisions and applications 49 discrimination task. Applying the model to all of the integrated system model, drift rate and diffusion these studies, the main effect was a reduced drift noise grow in proportion to one another to an rate but with either small or no effect on boundary asymptote. Unlike the standard model, in which separation and nondecision time. the onset of evidence accumulation is abrupt, the These results show that the diffusion model onset of evidence accumulation in the integrated is useful in providing interpretations of group system model is gradual, controlled by the growth differences among different subject populations. of diffusion noise. Smith, Ratcliff & Sewell, 2014 Furthermore, as noted earlier, the model can be showed that this model could explain the shifts in used to examine individual differences (even with the onsets of RT distributions found by Ratcliff and only 45 minutes of data collection for a task). This Smith (2010). means that this modeling approach, when paired Smith, et al. (2014) also considered a sec- with the right tasks, may have a useful role to play ond, release-from-inhibition model, which was in neuropsychological assessment. motivated, in part, by physiological principles. They modeled release from inhibition using an Ornstein-Uhlenbeck (OU) diffusion process with a Situations in Which the Standard time-varying decay coefficient. In the OU process, Model Fails information accumulation is opposed by a decay There are several cases in which the standard term that pulls the process back toward its starting diffusion model fails to account for experimental point. The larger the decay, the harder it is for the data. These fall into two classes: one involves process to accumulate enough information to reach dynamic noise and categorical stimuli and the other a criterion and trigger a response. In the standard involves conflict paradigms. For both, the main way OU process, decay is proportional to the distance the model fails is that there are cases for which the of the process from its starting point, but does onset of the RT distribution (i.e., the leading edges) not vary with time. Smith et al. (2014) assumed for one condition is delayed relative to the onset for that decay was time-locked to the stimulus. At the other conditions. start of the trial, before a perceptual representation Ratcliff and Smith (2010) and Smith, Ratcliff, & of the stimulus is formed, the decay term is large Sewell (2014) tested letter discrimination, horizon- and the process remains near its starting point tal versus vertical bars discrimination, and Gabor with high probability. As stimulus information patch orientation discrimination with stimuli de- becomes available, the decay term progressively graded with either static noise or with dynamic decreases, allowing information to accumulate in noise. Noise was implemented by reversing the the same way as it does in the standard model. contrast polarity of some proportion of the pixels This model was also able to account for data (randomly selected) for each of the letter, random like those reported by Ratcliff and Smith (2010). bars, and Gabor patch stimuli. For dynamic noise, Because the inhibition process behaves somewhat a different random sample of pixels was chosen on like the standard model with variable starting every frame of the display, whereas static noise used point, the release-from-inhibition model was able a single image with one random sample reversed. to account for the fast errors found at high stimulus Dynamic noise and, to a lesser extent static noise, discriminability in dynamic noise tasks without the produced large shifts in the leading edges of the assumption of starting point variability. RT distribution. The shapes of the RT distributions Ratcliff and Frank (2012) also found shifts in were consistent with the model but increasing the leading edges of RT distributions in a reinforce- noise increased estimates of the nondecision time ment learning conflict experiment for which the parameter Ter. This finding is inconsistent with stimuli were three pairs of letters (the same three the hypothesis that noise increases RTs simply by throughout the experiment). On each trial, one reducing the rate at which evidence accumulates in of the pairs of letters were presented in random the decision process. Instead, it implies that noise order and the subject had to choose and respond delays the onset of the diffusion process. to one of the letters. One of the letters of the Smith, Ratcliff, and Sewell (2014) showed that pair was reinforced more often than the other (in shifts in onsets can be explained by Smith and this case, reinforcement was simply a “correct” or Ratcliff’s (2009) integrated system model. with the “incorrect” message). After a training phase, on a assumption that noise slows the process of forming small proportion of the trials, letters from different a stable perceptual representation of the stimulus. In pairs were presented together. When the two letters

50 elementary cognitive mechanisms were the highly reinforced members of the pairs, they were chosen nearly equally often and there c2 Criteria was no slowing of the RT distribution. But when c 1 Leak or the letters that were reinforced with low probability Leak or Decay (k) Inhibition(β) were presented together, there was a delay in the Decay (k) leading edge of the RT distribution, an average delay of over 100 ms. This was explained in two Gaussian Variable + Gaussian + σ Start. Pt. σ ways, one in terms of the basal ganglia model of Noise( ) Noise( ) Range sz Frank (2006), and one in terms of the diffusion model. For the diffusion model to explain the Stimulus strength data, a delay in the onset of the decision process 1–v v could be used to produce good fits to the data. But this was, to some degree, a redescription Fig. 3.8 An illustration of the leaky competing accumulator. of the empirical result. The basal ganglia model The model includes an inhibition term (−kxj)inwhichthe explained these conflict trials by an increase in increment to evidence in accumulator i is reduced as a function threshold in the neural circuitry. This was linked of activity in the other accumulator (xj) and a decay term in which the increment to evidence is reduced as a function of to the diffusion model by showing that a transient activity in the accumulator (−βxi). The decision criteria for increase in boundary separation was also capable the two accumulators are c1 and c2, the accumulation rates are of explaining the result (the delay in onset of the v1 and v2 (v1+v2=1), and there is variability in the starting RT distribution). It turned out that an increase points that is uniformly distributed across trials with range st . in boundary separation with an exponential decay Variability in processing within a trial is normally distributed with standard deviation σ. mimics a delayed onset. White, Ratcliff and Starns (2011) also found leading edge shifts in a flanker task. In their an alternative to the diffusion model. Part of experiment, a target angle bracket was presented the motivation was to implement neurobiological that pointed in the direction of the correct response. principles that the authors believed should be On conflict trials, the target bracket was embedded incorporated into RT models, especially mutual in a string pointing the other way. Again, RT inhibition mechanisms and decay of information distributions could not be explained with only a across time. difference in drift rates, but a model with drift In the LCA model, like the diffusion model, rate changing over the time course of the decision, information is accumulated continuously over time. starting by being dominated by the flankers and There are two accumulators, one for each response, then gradually focusing on the central symbol, was as shown in Figure 3.8, and a response is made successful. when the amount of information in one of the All of these paradigms suggest that, in these accumulators reaches its criterion amount. The conflict situations, drift rate is not stationary over rate of accumulation, the equivalent of drift time. It is necessary to go beyond the basic decision rate in the diffusion model, is a combination model and begin to integrate it with models of of three components. The first is the input perceptual and cognitive processing. from the stimulus (v), with a different value for each experimental condition. If the input to Competing Two-Choice Models one of the accumulators is v, the input to the The diffusion model described to this point is other is 1−v so that the sum of the two rates one of a class of sequential sampling models that is 1. The second component is decay in the share many features. They all have given the same amount of accumulated information, k, with size interpretations of effects of independent variables, of decay growing as the amount of information in which are the same across the models (e.g., Donkin, the accumulator grows, and the third is inhibition Brown, Heathcote, & Wagenmakers, 2011; from the other accumulator, β, with the amount of Ratcliff, Thapar, Smith, & McKoon, 2005). This inhibition growing as the amount of information means, for example, that the effects of aging on in the other accumulator grows. If the amount model components are the same, whichever model of inhibition is large, the model exhibits features is used. similar to the diffusion model because an increase The leaky competing accumulator (LCA) model in accumulated information for one of the response (Usher & McClelland, 2001) was developed as choices produces a decrease for the other choice.

modeling simple decisions and applications 51 Just as in the diffusion model, the accumulation accumulated information remain positive. Thus, of information is assumed to be variable over the as in Usher and McClelland (2001), predictions course of a trial, with a normal distribution with from the model are obtained by simulation. There standard deviation σ. Because of the decay and have been several analyses of this model. Bogacz inhibition in the accumulation rates, the tails of et al. (2006) showed that the model could be RT distributions are longer than they would be if reduced to a single diffusion process if leak and produced without these factors (cf., Smith & Vick- inhibition were balanced and examined notions of ers, 1988; Vickers, 1970, 1979; Vickers, Caudrey, optimality (but see van Ravenzwaaij, van der Maas, Willson, 1971), which leads to good matches with & Wagenmakers, 2012). the skewed shape of empirical distributions. The Linear Ballistic Accumulator (LBA, Brown Theexpressionforthechangeintheamountof and Heathcote, 2008) is similar to the LCA in that accumulated information at time t in accumulators it uses two accumulators, but it has no within-trial i,is: variability, no decay, and no inhibition. The model ⎛ ⎞ assumes that the rate of evidence accumulation and the starting point for accumulation both vary ran-  =⎝ − − β ⎠ xi vi kxi xj t domly from trial to trial, but that the process of ev- j =i idence accumulation itself is noise free. In essence, √ + ση ti= 1,2 (5) the model assumes that there is noise in the central i nervous system on long, between-trial, time scales, The amount of accumulated information is not but none on short, moment-to-moment, time scales allowed to take on values below zero, so if it is that govern evidence accumulation within a trial. computed to be below zero, it is reset to zero. This assumption appears incompatible with the This is theoretically equivalent to constraining the single-cell recording literature that has linked pro- diffusion process with a reflecting boundary at zero. cesses of evidence accumulation with neural firing The LCA model without across-trial variability rates in the oculomotor control system, because in any of its components predicts errors slower such neural spike trains are typically noisy. To than correct responses. To produce errors faster reconcile these kinds of data with noiseless evidence than correct responses and the crossover pattern accumulation requires an argument to the effect such that errors are faster than correct responses for that individual neurons are noisy but the neural en- easy conditions and slower for difficult conditions, semble as a whole is effectively noise free. However, Usher and McClelland assumed variability in the it is not clear that firing rates in weakly coupled accumulators’ starting points, just as is assumed in networks of neurons exhibit the kinds of central the diffusion model and by Laming (1968). limit theorem type properties that this argument In the diffusion model, moving a boundary requires (Zohary, Shadlen, & Newsome, 1994), and position is equivalent to moving the starting point. so the status central limit argument is unclear. Moving the starting point an amount y toward one boundary is the same as moving that boundary an amount y toward the starting point and the Multichoice Decision-Making and other boundary an amount y away from the starting Confidence Judgments point. In the LCA model, changing the starting Recently, interest in the neuroscience domain point is not equivalent to changing a boundary in multichoice decision-making tasks has devel- position because decay is a function of the distance oped for visual search (Basso & Wurtz, 1998; of the accumulated amount of evidence from zero. Purcell et al., 2010) and motion discrimination Increasing the starting point by an amount y (Niwa & Ditterich, 2008; Ditterich, 2010). In increases decay by an amount proportional to y, psychology, there have been investigations using but with the starting point at zero, reducing the generalizations of standard two-choice tasks (Leite boundary by y has no effect on decay. Usher & Ratcliff, 2010) and in absolute identification and McClelland (2001) implemented variability in (Brown, Marley, Donkin, & Heathcote, 2008). In starting point by assuming rectangular distributions addition, confidence judgments in decision-making of the starting points with minimums at zero. and memory tasks are multichoice decisions, and No explicit solution is known for the pair of cou- diffusion models are being applied in these domains pled equations in Eq. 5, when they are constrained (Pleskac & Busemeyer, 2010; Ratcliff & Starns, by decision criteria and the requirement that the 2009, 2013; Van Zandt, 2002).

52 elementary cognitive mechanisms It is clear that there is no simple way to extend to the psychomotor vigilance task (PVT) for which the two-choice model to tasks with three or more a millisecond timer is displayed on a computer choices. But models with racing accumulators can screen and it starts counting up at intervals between be extended. Some models with racing accumula- 2 and 12 s after the subject’s last response. The tors become standard diffusion models when the subject’s task is to hit a key as quickly as possible to number of choices is reduced to two. stop the timer. When the key is pressed, the counter Ratcliff and Starns (2013) proposed a model for is stopped, and the RT in milliseconds is displayed confidence judgments in recognition memory tasks for1s. that uses a multiple-choice diffusion decision process In single-choice decision-making tasks, the data with separate accumulators of evidence for each are a distribution of RTs for hitting the response confidencechoice. Theaccumulatorthatfirstreaches key. The one-choice diffusion model assumes the its decision boundary determines which choice is evidence begins accumulating on presentation of made. Ratcliff and Starns compared five algorithms a stimulus until a decision criterion is hit, upon for accumulating evidence and found that one of which, a response is initiated (Figure 3.9 illustrates them produced choice proportions and full RT the model). In the model, drift rate is assumed to distributions for each choice that closely matched vary from trial to trial. This relates it to the standard empirical data. With this algorithm, an increase in two-choice model, which makes this assumption to the evidence in one accumulator is accompanied by fit the relative speeds of correct and error responses. a decrease in the others with the total amount of In application of the one-choice model to sleep evidence in the system being constant. deprivation data, across-trial variability in drift rate Application of the model to the data from an was needed to produce the long tails observed in the earlier experiment (Ratcliff, McKoon, & Tindall, RT distributions. 1994) uncovered a relationship between the shapes Ratcliff and Van Dongen (2011) fit the model of z-ROC functions and the behavior of RT to RT distributions and their hazard functions distributions. For low-proportion choices, the RT from experiments with the PVT with over 2000 distributions were shifted by as much as several observations per RT distribution per subject. With hundred milliseconds relative to high proportion only changes in drift rate, they found that the choices. This behavior and the shapes of z-ROC model accounted for changes in the shape of RT functions were both explained in the model by the distributions. In particular, changes in drift rate behavior of the decision boundaries. accounted for the change in hazard function shape For generality, Ratcliff and Starns (2013) also moving from a high tail under no sleep deprivation applied the decision model to a three-choice motion to a low tail with sleep deprivation. They also fit discrimination task in which one of the alternatives data for which the PVT was tested every 2 hours for was the correct choice on only a low proportion 36 hours of sleep deprivation and found that drift of trials. As for the confidence judgment data, the rate was closely related to an independent measure RT distribution for the low probability alternative of alertness, which provides an external validation was shifted relative to the higher probability alterna- of the model. tives. The diffusion model with constant evidence accounted for the shift in the RT distribution better Neuroscience than a competing class of models. One of the major advances in understanding Research on multichoice decision making, decision making is in neuroscience applications including confidence judgments, is a growing using single cell recording in monkeys (and rats), industry but the constraints provided by RT distri- human neuroscience including fMRI, EEG, and butions and response proportions for the different MEG. All these domains have had interactions choices makes the modeling quite challenging. between diffusion and neuroscience measures. Hanes and Schall (1996) made the first One-Choice Decisions connection between theory and single cell recording Relatively little work has been done recently on data, and this was taken up in work by Shadlen and one-choice decisions. In these, there is only one colleagues (e.g., Gold and Shadlen, 2001). key to press when a stimulus is detected. Ratcliff and Van Dongen (2011) tested a model that used a monkey neurophysiology single diffusion process to represent the process of In both psychology and neuroscience, theories accumulating evidence. The main application was of decision processes have been developed that

modeling simple decisions and applications 53 Decision Decision Stimulus process process Across-trial onset starts stops Response drift rate Ter=Ta+Tb distribution Ta Td Tb Time

a Evidence criterion Drift rate (v) 0 Starting point 0 v Drift rate Decision process:

Fig. 3.9 An illustration of the one-choice diffusion model. Evidence is accumulated at a drift rate v with SD across trials η,untila decision criterion at a is reached after time Td . Additional processing times include stimulus encoding time Ta and response output time Tb; these sum to nondecision time Ter, which has uniform variability across trials with range st . assume that evidence is gradually accumulated classes (Ratcliff & Smith, 2004; Smith & Ratcliff, over time (Boucher, Palmeri, Logan, & Schall, 2004), including those that assume accumulation 2007; Churchland, Kiani, & Shadlen, 2008; of a single evidence quantity taking on positive Ditterich, 2006; Gold & Shadlen, 2001, 2007; and negative values (Gold & Shadlen, 2000, 2001; Grinband, Hirsch, & Ferrera, 2006; Hanes & Ratcliff, 1978; Ratcliff et al., 2003; Ratcliff, Schall, 1996; Mazurek, Roitman, Ditterich, & Van Zandt, & McKoon, 1999; Smith, 2000) Shadlen, 2003; Platt & Glimcher, 1999; Purcell and those that assume that evidence is accu- et al., 2010; Ratcliff, Cherian, & Segraves, 2003; mulated in separate accumulators corresponding Ratcliff, Hasegawa, Hasegawa, Smith, & Segraves, to separate decisions (Churchland, et al., 2008; 2007; Roitman & Shadlen, 2002; Shadlen & Ditterich, 2006; Mazurek et al., 2003; Ratcliff Newsome, 2001). In these studies, cells in the et al., 2007; Usher & McClelland, 2001). In lateral intraparietal cortex (LIP), frontal eye field this latter class of models, accumulation can be (FEF), and the superior colliculus (SC) exhibit independent in separate accumulators, or it can behavior that corresponds to a gradual buildup in be interactive so that as evidence grows in one activity that matches the buildup in evidence in accumulator, it inhibits evidence accumulation making simple perceptual decisions (see also Munoz in the other accumulator. The single accumula- & Wurtz, 1995; Basso & Wurtz, 1998). The neural tor model can be seen as implementing perfect populations that exhibit buildup behavior in LIP, inhibition because a positive increment toward one FEF, and SC prior to a decision have been studied boundary is an increment away from the other extensively. There is debate about where exactly the boundary. accumulation takes place, but it is clear that (at The models with separate accumulators have least) these three structures are part of a circuit that the advantage because the two accumulators can is involved in implementing the decision. These be used to represent growth of activity in the studies so far support the notion that there is a flow populations of neurons corresponding to the two of information from LIP to FEF and then to SC decisions. In the single diffusion process models, if prior to a decision. the single process represented the aggregate activity In modeling the neurobiology of the decision in the two populations, then the growth of activity process, there are a number of models applied in the two populations would have to be perfectly to a range of different tasks. They all have negatively correlated. This is plausible if the resting the common theme that they assume evidence is activity level is relatively high in the neural popu- accumulated to a decision criterion, or boundary, lations (e.g., Roitman & Shadlen, 2002), but it is and that accumulated evidence corresponds to less plausible in populations in which the resting activity in populations of neurons corresponding level is low (Hanes & Schall, 1996; Ratcliff et al., to the decision alternatives. The models considered 2007). However, the two classes of models largely here have been explicitly proposed as models of mimic each other at a behavioral level (Ratcliff, oculomotor decision making in monkeys or argued 2006; Ratcliff & Smith, 2004) and although the to describe the evidence accumulation process in choice of models with racing diffusion processes humans or monkeys. The models fall into several seems to be superior in application in oculomotor

54 elementary cognitive mechanisms responses in monkeys, this does not rule out the diffusion models (e.g., Deco, Rolls, Albantakis, viability of the single accumulator model for human & Ramo, 2013; Roxin & Ledberg, 2008; Wong behavioral and neural data (Philiastides, Ratcliff, & and Wang, 2006). Smith (2010) made an explicit Sajda, 2006; Ratcliff et al., 2009). connection between diffusion processes at a macro Ratcliff et al. (2007; see also Ratcliff, Hasegawa, behavioral level and shot noise processes at a slightly et al., 2011) applied a dual diffusion model abstract neural level. to a brightness discrimination task. In the dual Smith (2010) sought to show how diffusive diffusion model, evidence for the two responses is information accumulation at a behavioral level accumulated by a pair of racing diffusion processes. could arise by aggregating neural firing rate pro- In Ratcliff et al.’s model, there was competition at cesses. He modeled the representation of stimulus input (drift rates summed to a constant) but no information at the neural level as the difference inhibition (i.e., Figure 3.8 without the inhibition). between excitatory and inhibitory Poisson shot Two rhesus monkeys were required to make a noise processes. The shot noise process describes saccade to one of two peripheral choice targets the cumulative effects of a number of time-varying based on the brightness of a central stimulus. disturbances or perturbations, each of which is Neurons in the deep layers of the SC exhibited initiated by a point event, which arrive according a robust presaccadic activity when the stimulus to a Poisson process. These discrete pulses are specified a saccade toward a target within the assumed to have exponential decay, and so, in neuron’s response field, and the magnitude of this time, some of these decaying traces add, and this activity was unaffected by level of difficulty. Activity is the shot noise process (e.g., Figure 3.1, Smith, following brightness stimuli specifying saccades 2010). In his model, the disturbances represent the to targets outside the response field was affected flux in postsynaptic potentials in a cell population by task difficulty, increasing as the task became in response to a sequence of action potentials. more difficult, and this modulation correlated Smith showed that the time integral of such Poisson with performance accuracy. The model fit the full shot-noise pairs follows an integrated Ornstein- complexity of the behavioral data, accuracy and RT Uhlenbeck process, whose long-time scale statistics distributions for correct and error responses, over a are very similar to those assumed in the standard range of levels of difficulty. Using the parameters diffusion model. His analysis showed how diffusive from the fits to behavioral data, simulated paths information at a behavioral level could arise from of the process were generated and these provided Poisson-like representations at the neural level. numerical predictions for the behavior of the firing Subsequently Smith and McKenzie (2011) inves- rates in SC neurons that matched most but not all tigated a simple model of how long time scale infor- the effects in the data. mation accumulation could be realized at a neural Simulated paths from the model were compared level. Wang (2002) previously argued that models to neuron activity. The assumption linking the of decision-making require information integration paths to the neuron data is that firing rate is linearly on a time scale that is an order of magnitude greater related to position in the accumulation process; the than any integration process found at a neural level. nearer the boundary the decision process is, the He argued that the most plausible substrate for higher the firing rate. The firing rate data show such long-time scale integration is persistent activity delayed availability of discriminative information in reverberation networks. Smith and McKenzie for fast, intermediate, and slow decisions when considered a very simple model of a recurrent activity is aligned on the stimulus and very small loop in which spikes cycle around the loop with differences in discriminative information when exponentially distributed cycle times and new spikes activity is aligned on the saccade. The model are added superposition. The activity in the loop produces exactly these patterns of results. The could, therefore, be modeled as a superposition of accumulation process is highly variable, allowing Poisson processes. the process both to make errors, as is the case for They showed that a model based on such recur- the behavioral performance, and also to account rent loops could realize the kind of long-time scale for the firing rate results. Figure 3.10 shows sample integration process described by Wang and that results for the firing rate functions (black lines) and it, too, exhibited a form of diffusive information predicted firing rates (red lines). accumulation that closely matches what is found There have also been significant modeling efforts behaviorally. In particular, the resulting model to relate models based on spiking neurons to successfully predicted the RT distributions and

modeling simple decisions and applications 55 Monkey 11 Align on stimulus 2% bright pixels EASY 45% bright pixels DIFFICULT

150 150

100 100

50 50 response (spikes/s) Activity for correct 0 0 0 100 200 300 400 0 100 200 300 400

150 150

100 100

50 50 Activity for error response (spikes/s) 0 0 0 100 200 300 400 0 100 200 300 400 Time from stimulus onset (ms) Time from stimulus onset (ms)

Fig. 3.10 Neural firing rates averaged over cells for firing rates aligned on the stimulus for the two monkeys from Ratcliff, Hasegawa et al. (2007). The firing rates are divided into thirds as a function the behavioral response (fastest third, middle third, and slowest third). The left hand column show easy conditions, bright responses to 98% white pixels and dark responses to 98% black pixels and the right hand column shows difficult conditions, bright responses to 55% white pixels and dark responses to 55% black pixels. The first row shows firing rates for cells in the receptive field of the target corresponding to the correct response and the correct response is made (target cell). The second row shows firing rates for cells in the receptive field of the target corresponding to the incorrect response for the stimulus when a correct response is made (competitor cell). The solid lines are the data and the dashed lines are model predictions. choice probabilities from a signal detection exper- ms windows from stimulus onset on up. The single- iment reported by Ratcliff and Smith (2004). trial regressor was significant at two times, around 180 ms and around 380 ms. Ratcliff, Philiastides, and Sajda (2009) reasoned that, if the regressor was Human Neuroscience an index of difficulty, then in each condition of the Diffusion models are currently being com- experiment, responses could be sorted into those bined with fMRI and EEG techniques to look that the electrical signal said were more facelike for stimulus-independent areas that implement and those that were more carlike. When responses decision-making (e.g., vmPFC, Heekeren, were sorted and the diffusion model fit to the two Marrett, Bandettini, & Ungerleider, 2004) and to halves of each condition, the drift rates for the two map diffusion model parameters onto EEG signals halves differed substantially but only for the later (Philliastides et al., 2006). component at 380 ms. The diffusion model provides an estimate of eeg support for across-trial variability in nondecision time, which represents the duration drift rate of encoding and stimulus transformation processes Philiastides, Ratcliff, and Sajda (2006) used a prior to the decision time (as well as response face/car discrimination task with briefly presented output processes). This estimate shows that the degraded pictures. They recorded EEGs from mul- decision process begins no earlier than 400 ms tiple electrodes during the task and then weighted after stimulus onset, and so the late EEG signal and combined the electrical signals to obtain a component indexes difficulty on a trial-to-trial single number or regressor that best discriminated basis prior to the onset of the decision process. between faces and cars. This was repeated over 60 Therefore, these two features of the late component

56 elementary cognitive mechanisms provide evidence that drift rate varies from trial to strength. These studies are the beginning of a new trial. approach to brain structure and processing. eeg support for across-trial variability in fmri starting point A major problem with attempts to relate results Bode, Sewell, Lilburn, Forte, Smith and Stahl from fMRI measurements to the growth of activity (2012) reported EEG evidence consistent with in decision-related brain areas is the sluggishness of trial-to-trial biasing of the starting point of the the BOLD response. Despite this, there are many diffusion process. They recorded EEG activity in studies that use diffusion models in analyses of a task requiring discrimination between briefly fMRI data. Mulder, Van Maanen, & Forstmann presented images of chairs or pianos that were (2014) have reviewed a number of studies of presented in varying levels of noise and then perceptual decision making using fMRI methods backward masked. They applied a support vector and found evidence for regions associated with dif- machine pattern classifier to the EEG signals at ferent components of diffusion models. Although successive time points and showed that decisions there was some converegence, maps of the peak- could be decoded (i.e., predicted) from the EEG coordinates of the activity for model components several hundred milliseconds before the behavioral showed quite a large scatter across areas. This response. When the stimulus display contained research would require a chapter by itself but only noise and no discriminative information, the the notion the some brain areas accumulate noisy decision outcome could still be predicted from evidence from other areas is certainly a mainstream the EEG, but only from the activity prior to belief in neuroscience and diffusion models are one stimulus presentation and not from any later theoretical framework that relates the neural to the time points. Bode et al. found that the RT behavioral level. distributions and accuracy in their task were well described by a diffusion model in which the starting point for evidence accumulation was biased toward the upper or lower boundary, depending Conclusions on the participant’s previous choice history. They The use of diffusion models in representing proposed that the information in the prestimulus simple decision-making in a variety of domains is an EEG was a neural correlate of the process of area of research that is seeing significant advances. setting the starting point, which occurs prior to the The view that evidence is accumulated over time start of evidence accumulation. When the display to decision criteria seems a settled view. The contained no stimulus information and the drift competing models seem to produce about the same of the diffusion process was zero, the primary conclusions about processing within experimental determinant of the decision outcome would be the paradigms, and so broad interpretations do not participant’s bias state: Processes starting near the depend on the specific model being used. In upper boundary would be more likely to terminate psychological applications, the basic theory and at that boundary, and similarly for the lower experimental applications are well established and boundary. somewhat mature. But application to individual differences (including neuropsychological testing) and different subject and patient populations are structural mri in their infancy. Also, neuroscience applications Studies that have examined structural connec- in both experimental and theoretical research are tions between brain areas that are implicated blossoming, with a variety of experimental methods in the control of decision making have found being used as well as a variety of variants on the basic correlations between tract strength and decision- models developed in psychology. making variables. Forstmann et al. (2010) found a relationship between cortico-striatal connection strength and the ability of subjects to change their Author Note speed-accuracy tradeoff settings. Mulder, Boekel, Preparation of this chapter was supported Ratcliff, & Forstmann (2014) found correlations by grants NIA R01-AG041176, AFOSR grant between subjects’ ability to bias their responses in FA9550-11-1-0130, IES grant R305A120189, and response to reward and vmPFC-STN connection by ARC Discovery Grant DP140102970.

modeling simple decisions and applications 57 Glossary Random walk model: A discrete-time counterpart of the Accumulator Model: A model in which positive incre- diffusion process. A diffusion process accumulates evidence ments are continuous random variable and the time at in continuous time, whereas a random walk accumulates which the increments are made are discrete in time. The evidence at discrete time points. accumulators race to separate decision criteria. Response Signal and Deadline Tasks: Tasks in which Confidence Judgments: Tasks in which responses are made the subject is required to respond at an experimenter- on a discrete scale using different response keys. determined time. The dependent variable is usually Decision Boundaries: These represent the amount of accuracy and the task measures how it grows over time in evidence needed to make a decision. the decision process. Response Time Distributions: The distribution of times Decision criteria: The amount of evidence for one or other at which the decision process terminates (i.e., a histogram alternative to make a decision. In diffusion models, the of times for data). criteria are represented as boundaries on the evidence space. Single Cell Recording in Animals: Recordings from single Diffusion Model: A model that assumes continuously neurons often in awake behaving animals. available evidence in continuous time. Evidence accumu- lates in one signed sum and the process terminates when one of two decision criteria are reached. Diffusion Process: A process in which continuously References variable noisy evidence is accumulated in continuous time. Balota, D. A. & Chumbley, J. I. (1984). Are lexical decisions a Drift rate: The average rate at which a diffusion process good measure of lexical access? the role of word frequency accumulates evidence. in the neglected decision stage. Journal of Experimental Psychology: Human Perception and Performance, 10, 340–357. Go/Nogo Tasks: Tasks in which subjects respond to one Basso, M. A. & Wurtz, R. H. (1998). Modulation of neuronal stimulus type but hold their response until a time out for activity in superior colliculus by changes in target probability. the other response. Journal of Neuroscience, 18, 7519–7534. Leaky Competing Accumulator Model: Amodelin Bode, S., Sewell, D. K., Lilburn, S., Forte, J. D., Smith, which evidence is continuously available in continuous P. L. & Stahl, J. (2012). Predicting perceptual decisions time. Evidence is accumulated in separate accumulators from early brain activity. Journal of Neuroscience, 32, 12488– (i.e., separate diffusion processes) and there is both decay 12498. in an accumulator and inhibition from other accumulators. Bogacz, R., Brown, E., Moehlis, J., Holmes, P. & Cohen, J. D. Nondecision Time: Duration of processes other than the (2006). The physics of optimal decision making: A formal decision process. These include encoding time, response analysis of models of performance in two-alternative forced output time, memory access time in memory tasks, and the choice tasks. Psychological Review, 113, 700–765. time to transform the stimulus representation to a decision- Boucher, L., Palmeri, T., Logan, G., & Schall, J. (2007). based representation for perceptual tasks. in mind and brain: An interactive race Optimality: Often defined in terms of “reward rate” or the model of countermanding saccades. Psychological Review, number correct per unit time in simple decision making 114, 376–397. experiments by analogy with animal experiments. Brown, S. D., & Heathcote, A. J. (2008). The simplest complete model of choice response time: Linear ballistic accumulation. Ornstein-Uhlenbeck diffusion process: This describes Cognitive Psychology, 57, 153–178. a noisy evidence accumulation process with leakage or Brown, S. D., Marley, A. A. J., Donkin, C. & Heathcote, A. J. decay; the standard (Wiener or Brownian motion) diffusion (2008). An integrated model of choices and response times in process describes a process in which there is no leakage. absolute identification. Psychological Review, 115, 396–425. Poisson Counter Model: A model in which increments are Buonocore, A., Giorno, V., Nobile, A. G., & Ricciardi, L. discrete equal-sized units, but the time at which they arrive (1990). On the two-boundary first-crossing- time problem as the accumulators are Poisson distributed (exponential for diffusion processes. Journal of Applied Probability, 27, delays between counts). 102–114. Poisson shot noise process: A process in which each Busemeyer, J. R., & Townsend, J. T. (1993). Decision field point event in a Poisson process generates a continuous, theory: A dynamic-cognitive approach to decision making time-varying disturbance or perturbation. The shot noise in an uncertain environment. Psychological Review, 100, process is the cumulative sum of the perturbations. The 432–459. shot noise process has been used as a model for a variety Churchland, A. K., Kiani, R., & Shadlen, M. N. (2008). of phenomena, including the flow of electrons in vacuum Decision-making with multiple alternatives. Nature Neuro- tubes, the cumulative effects of earth tremors, and the science, 11, 693–702. flux in the postsynaptic potential in cell bodies in a neural Deco, G., Rolls, E. T., Albantakis, L., & Romo, R. (2013). Brain population. mechanisms for perceptual and reward-related decision- PVT: The psychomotor vigilance test in which a counter making. Progress in Neurobiology, 103, 194–213. starts counting up and the subject simply hits a key to stop Dennis, S. & Humphreys, M. S. (2001). A context noise model it counting. of episodic word recognition. Psychological Review, 108, 452–477.

58 elementary cognitive mechanisms Diederich, A., & Busemeyer, J.R. (2003). Simple matrix Hanes, D. P., and Schall, J. D. (1996). Neural control of methods for analyzing diffusion models of choice probability, voluntary movement initiation. Science, 274, 427–430. choice response time, and simple response time. Journal of Heekeren, H. R., Marrett, S., Bandettini, P. A., Ungerleider, Mathematical Psychology, 47, 304–322. L. G. (2004). A general mechanism for perceptual decision- Diederich, A., & Busemeyer, J. (2006). Modeling the making in the human brain. Nature, 431, 859–62. effects of payoff on response bias in a perceptual dis- Hintzman, D. (1986). “Schema abstraction” in a multiple-trace crimination task: Bound-change, drift-rate-change, or two- memory model. Psychological Review, 93, 411–428. stage-processing hypothesis. Perception & Psychophysics, 68, Krajbich, I., & Rangel, A. (2011). A multi-alternative drift 194–207. diffusion model predicts the relationship between visual Ditterich, J. (2006). Computational approaches to visual fixations and choice in value-based decisions. Proceedings of decision making. In D. J. Chadwick, M. Diamond, & J. the National Academy of Sciences, 108, 13852–13857. Goode (Eds.), Percept, decision, action: Bridging the gaps Kounios, J., Osman, A. M., & Meyer, D. E. (1987). (p.114). Chichester, UK: Wiley. Structure and process semantic memory: New evidence based Ditterich, J. (2010). A comparison between mechanisms of on speed-accuracy decomposition. Journal of Experimental multi-alternative perceptual decision making: Ability to ex- Psychology: General, 116, 3–25. plain human behavior, predictions for neurophysiology, and Laming, D. R. J. (1968). Information theory of choice reaction relationship with decision theory. Frontiers in Neuroscience, time.NewYork:Wiley. 4, 184. Leite, F. P., & Ratcliff, R. (2010). Modeling reaction time and Donkin, C., Brown, S., Heathcote, A., & Wagenmakers, accuracy of multiple-choice decisions. Attention, Perception E. J. (2011) Diffusion versus linear ballistic accumulation: and Psychophysics, 72, 246–273. Different models for response time, same conclusions about Leite, F. P., & Ratcliff, R. (2011). What cognitive processes drive psychological mechanisms? Psychonomic Bulletin & Review, response biases? A diffusion model analysis. Judgment and 55, 140–151. Decision Making, 6, 651–687. Feller, W. (1968). An introduction to probability theory and its Link, S. W. & Heath, R. A. (1975). A sequential theory of applications.NewYork,NY:Wiley. psychological discrimination. Psychometrika, 40, 77–105. Forster, K. I. (1976). Accessing the mental lexicon. In R. J. Luce, R. D. (1986). Response times.NewYork,NY:Oxford Wales & E. Walker (Eds.), New approaches to language mech- University Press. anisms (pp. 257–287). Amsterdam, Netherlands: North- Mazurek, M. E., Roitman, J. D., Ditterich, J., & Shadlen, M. N. Holland. (2003). A role for neural integrators in perceptual decision- Forstmann, B. U., Anwander, A., Schafer, A., Neumann, making. Cerebral Cortex, 13, 1257–1269. J., Brown, S., Wagenmakers, E.-J., Bogacz, R., & McClelland, J. L. & Chappell, M. (1998). Familiarity breeds Turner, R. (2010). Cortico-striatal connections predict differentiation: A Bayesian approach to the effects of control over speed and accuracy in perceptual decision experience in recognition memory. Psychological Review, 105, making. Proceedings of the National Academy of Sciences, 107, 724–760. 15916–15920. McKoon, G., & Ratcliff, R. (1992). Spreading activation versus Frank, M.J. (2006). Hold your horses: A dynamic computational compound cue accounts of priming: Mediated priming role for the subthalamic nucleus in decision making. Neural revisited. Journal of Experimental Psychology: Learning, Networks, 19, 1120–1136. Memory, and Cognition, 18, 1155–1172. Gillund, G., & Shiffrin, R.M. (1984). A retrieval model McKoon, G., & Ratcliff, R. (2012). Aging and IQ effects for both recognition and recall. Psychological Review, 91, on associative recognition and priming in item recognition. 1–67. Journal of Memory and Language, 66, 416–437. Geddes, J., Ratcliff, R., Allerhand, M., Childers, R., Wright, R. McNamara, T. P. (1992 Priming and constraints it places on J., Frier, B. M., & Deary, I. J. (2010). Modeling the effects theories of memory and retrieval. Psychological Review, 99, of hypoglycemia on a two-choice task in adult humans. 650–662. Neuropsychology, 24, 652–660. McNamara, T. P. (1994). Priming and theories of memory: A Gold, J. I., & Shadlen, M. N. (2000). Representation of a replytoRatcliffandMcKoon.Psychological Review, 101, perceptual decision in developing oculomotor commands. 185–187. Nature, 404, 390–394. Meyer, D. E., Irwin, D. E., Osman, A. M., & Kounios, Gold, J. I., & Shadlen, M. N. (2001). Neural computations that J. (1988). The dynamics of cognition: mental processes underlie decisions about sensory stimuli. Trends in Cognitive inferred from a speed-accuracy decomposition technique. Science, 5, 10–16. Psychological Review, 95, 183–237. Gold, J. I., & Shadlen, M. N. (2007). The neural basis Milosavljevic, M., Malmaud, J., Huth, A., Koch, C., & Rangel, of decision making. Annual Review of Neuroscience, 30, A. (2010). The Drift Diffusion Model can account for the 535–574. accuracy and reaction times of value-based choice under high Gomez, P., Ratcliff, R., & Perea, M. (2007). A model of the and low time pressure. Judgment and Decision Making, 5, go/no-go task. Journal of Experimental Psychology: General, 437–449. 136, 347–369. Morton, J. (1969). The interaction of information in word Grinband, J., Hirsch, J., & Ferrera, V.P. (2006). A neural recognition. Psychological Review, 76, 165–178. representation of categorization uncertainty in the human Mulder, M. J., Boekel, W., Ratcliff, R., & Forstmann, B. U. (in brain. Neuron, 49, 757–763. press). Cortico-subthalamic connection predicts individual

modeling simple decisions and applications 59 differences in value-driven choice bias. Brain Structure & Ratcliff, R. (2006). Modeling Response Signal and Response Function, 219, 1239–1249. Time Data, Cognitive Psychology, 53, 195–237. Mulder, M. J., Bos, D., Weusten, J. M. H., van Belle, J., van Ratcliff, R. (2013). Parameter variability and distributional Dijk, S. C., Simen, P., van Engeland, H., & Durson, S. assumptions in the diffusion model. Psychological Review, (2010). Basic impairments in regulating the speed-accuracy 120, 281–292. tradeoff predict symptoms of attention-deficit/hyperactivity Ratcliff, R., Cherian, A., & Segraves, M. (2003). A comparison disorder. Biological Psychiatry, 68, 1114–1119. of macaque behavior and superior colliculus neuronal activity Mulder, M., van Maanen, L., & Forstmann, B. U. (2014). to predictions from models of simple two-choice decisions. Perceptual decision neurosciences-A model-based review. Journal of Neurophysiology, 90, 1392–1407. Neuroscience, 277, 872–884. Ratcliff, R., & Frank, M. (2012). Reinforcement-based decision Munoz, D. P., & Wurtz, R. H. (1995). Saccade-related activity making in corticostriatal circuits: Mutual constraints by neu- in monkey superior colliculus. I. Characteristics of burst and rocomputational and diffusion models. Neural Computation, buildup cells. Journal of Neurophysiology, 73, 2313–2333. 24, 1186–1229. Murdock, B. B. (1982). A theory for the storage and retrieval Ratcliff, R., Gomez, P., & McKoon, G. (2004). A diffusion of item and associative information. Psychological Review, 89, model account of the lexical-decision task. Psychological 609–626. Review, 111, 159–182. Niwa, M., & Ditterich, J. (2008). Perceptual decisions between Ratcliff, R., Hasegawa, Y. T., Hasegawa, Y. P., Childers, R., multiple directions of visual motion. Journal of Neuroscience, Smith, P.L., & Segraves, M. A. (2011). Inhibition in superior 28, 4435–4445. colliculus neurons in a brightness discrimination task? Neural Norris, D. (2006). The Bayesian reader: explaining word recog- Computation, 23, 1790–1820. nition as an optimal Bayesian decision process. Psychological Ratcliff, R., Hasegawa, Y. T., Hasegawa, Y. P., Smith, P. L., & Review, 113, 327–357. Segraves, M. A. (2007). Dual diffusion model for single-cell Oberauer, K., Suβ, H-M., Wilhelm, O., Wittmann, W. W. recording data from the superior colliculus in a brightness- (2003). The multiple faces of working memory: Storage, discrimination task. Journal of Neurophysiology, 97, 1756– processing, supervision, and coordination. , 31, 1774. 167–193. Ratcliff, R., Love, J., Thompson, C. A., & Opfer, J. (2012). Palmer, J., Huk, A. C., & Shadlen, M. N. (2005). The effect of Children are not like older adults: A diffusion model stimulus strength on the speed and accuracy of a perceptual analysis of developmental changes in speeded responses, decision. Journal of Vision, 5, 376–404. Child Development, 83, 367–381. Philiastides, M., & Ratcliff, R. (2013). Influence of branding on Ratcliff, R., & McKoon, G. (1988). A retrieval theory of priming preference-based decision making. Psychological Science, 24, in memory. Psychological Review, 95, 385–408. 1208–1215. Ratcliff, R., & McKoon, G. (1994). Retrieving information from Philiastides, M. G., Ratcliff, R., & Sajda, P. (2006). Neural memory: Spreading activation theories versus compound cue representation of task difficulty and decision making during theories. Psychological Review, 101, 177–184. perceptual categorization: A timing diagram. Journal of Ratcliff, R., & McKoon, G. (2008). The diffusion decision Neuroscience, 26, 8965–8975. model: Theory and data for two-choice decision tasks. Neural Platt, M., & Glimcher, P. W. (1999). Neural correlates of Computation, 20, 873–922. decision variables in parietal cortex. Nature, 400, 233–238. Ratcliff, R., McKoon, G., & Tindall, M. H. (1994). Empir- Pleskac, T. J., & Busemeyer, J. R. (2010). Two-stage dynamic ical generality of data from recognition memory receiver- signal detection: A theory of choice, decision time, and operating characteristic functions and implications for the confidence. Psychological Review, 117, 864–901. global memory models. Journal of Experimental Psychology: Purcell, B. A., Heitz, R. P., Cohen, J. Y., Schall, J. D., Logan, G. Learning, Memory, and Cognition, 20, 763–785. D., & Palmeri, T. J. (2010). Neurally-constrained modeling Ratcliff, R., Perea, M., Colangelo, A., & Buchanan, L. (2004). of perceptual decision making. Psychological Review, 117, A diffusion model account of normal and impaired readers. 1113–1143. Brain & Cognition, 55, 374–382. Ratcliff, R. (1978). A theory of memory retrieval. Psychological Ratcliff, R., Philiastides, M. G., & Sajda, P. (2009). Quality of Review, 85, 59–108. evidence for perceptual decision making is indexed by trial- Ratcliff, R. (1980). A note on modelling accumulation of to-trial variability of the EEG. Proceedings of the National information when the rate of accumulation changes over Academy of Sciences, 106, 6539–6544. time. Journal of Mathematical Psychology, 21, 178–184. Ratcliff, R., & Rouder, J.N. (2000). A diffusion model account Ratcliff, R. (1985). Theoretical interpretations of speed and of masking in letter identification. Journal of Experimental accuracy of positive and negative responses. Psychological Psychology: Human Perception and Performance, 26, 127–140. Review, 92, 212–225. Ratcliff, R., Sheu, C-F, & Gronlund, S.D. (1992). Testing global Ratcliff, R. (1988). A note on the mimicking of additive reaction memory models using ROC curves. Psychological Review, 99, time models. Journal of Mathematical Psychology, 32, 192– 518–535. 204. Ratcliff, R. & Smith, P. L. (2004). A comparison of sequential Ratcliff, R. (2002). A diffusion model account of reaction time sampling models for two-choice reaction time. Psychological and accuracy in a two choice brightness discrimination task: Review, 111, 333–367. Fitting real data and failing to fit fake but plausible data. Ratcliff, R., & Smith, P. L. (2010). Perceptual discrimina- Psychonomic Bulletin and Review, 9, 278–291. tion in static and dynamic noise: the temporal relation

60 elementary cognitive mechanisms between perceptual encoding and decision making. Journal Roxin, A., & Ledberg, A. (2008). Neurobiological models of Experimental Psychology: General, 139, 70–94. of two-choice decision making can be reduced to a one- Ratcliff, R., & Starns, J. J. (2009). Modeling confidence and dimensional nonlinear diffusion equation. PLoS Computa- response time in recognition memory. Psychological Review, tional Biology, 4, e1000046. 116, 59–83. Schmiedek, F., Oberauer, K., Wilhelm, O., Suβ, H-M., & Ratcliff, R., & Starns, J. J. (2013). Modeling confidence Wittmann, W. (2007). Individual differences in components judgments, response times, and multiple choices in decision of reaction time distributions and their relations to working making: recognition memory and motion discrimination. memory and intelligence. Journal of Experimental Psychology: Psychological Review, 120, 697–719. General, 136, 414–429. Ratcliff, R., Thapar, A., Gomez, P., & McKoon, G. (2004). A Schouten, J. F., & Bekker, J. A. M. (1967). Reaction time and diffusion model analysis of the effects of aging in the lexical- accuracy. Acta Psychologica, 27, 143–153. decision task. Psychology and Aging, 19, 278–289. Sewell, D. K., & Smith, P. L. (2012). Attentional control Ratcliff, R., Thapar, A. & McKoon, G. (2003). A dif- in visual signal detection: Effects of abrupt-onset and no- fusion model analysis of the effects of aging on bright- onset stimuli. Journal Of Experimental Psychology: Human ness discrimination. Perception and Psychophysics, 65, Perception And Performance, 38, 1043–1068. 523–535. Shadlen, M. N. & Newsome, W. T. (2001). Neural basis of a Ratcliff, R., Thapar, A., & McKoon, G. (2004). A diffusion perceptual decision in the parietal cortex (area LIP) of the model analysis of the effects of aging on recognition memory. rhesus monkey. Journal of Neurophysiology, 86, 1916–1935. Journal of Memory and Language, 50, 408–424. 1 Shiffrin, R. M., & Steyvers, M. (1997). A model for Ratcliff, R., Thapar, A., & McKoon, G. (2006). Aging, practice, recognition memory: REM: Retrieving effectively from and perceptual tasks: A diffusion model analysis. Psychology memory. Psychonomic Bulletin and Review, 4, 145–166. and Aging, 21, 353–371. Smith, P.L. (1995). Psychophysically principled models of visual Ratcliff, R., Thapar, A., & McKoon, G. (2007). Application simple reaction time. Psychological Review, 102, 567–593. of the diffusion model to two-choice tasks for adults 75–90 Smith, P. L. (2000). Stochastic dynamic models of response years old. Psychology and Aging, 22, 56–66. time and accuracy: A foundational primer. Journal of Ratcliff, R., Thapar, A., & McKoon, G. (2010). Individual Mathematical Psychology, 44, 408–463. differences, aging, and IQ in two-choice tasks. Cognitive Smith, P. L. (2010). From Poisson shot noise to the integrated Psychology, 60, 127–157. Ornstein-Uhlenbeck process: Neurally-principled models of Ratcliff, R., Thapar, A., & McKoon, G. (2011). Effects of diffusive evidence accumulation in decision-making and aging and IQ on item and associative memory. Journal of response time. Journal of Mathematical Psychology, 54, 266– Experimental Psychology: General, 140, 46–487. 283. Ratcliff, R., Thapar, A., Smith, P. L. & McKoon, G. (2005). Smith, P. L., Ellis, R., Sewell, D. K., & Wolfgang, B. J. (2010). Aging and response times: A comparison of sequential Cued detection with compound integration-interruption sampling models. In J. Duncan, P. McLeod, & L. Phillips masks reveals multiple attentional mechanisms. Journal of (Eds.), Speed, Control, and Age, Oxford, England: Oxford Vision, 10, 1–28. University Press. Smith, P. L., & McKenzie, C. (2011). Diffusive informa- Ratcliff, R., & Tuerlinckx, F. (2002). Estimating the parameters tion accumulation by minimal recurrent neural models of the diffusion model: Approaches to dealing with contam- of decision making. Neural Computation, 23, 2000– inant reaction times and parameter variability. Psychonomic 2031. Bulletin and Review, 9, 438–481. Smith, P. L., & Ratcliff, R. (2004). The psychology and Ratcliff, R. & Van Dongen, H. P. A. (2009). Sleep deprivation neurobiology of simple decisions, Trends in Neuroscience, 27, affects multiple distinct cognitive processes. Psychonomic 161–168. Bulletin and Review, 16, 742–751. Smith, P. L., & Ratcliff, R. (2009). An integrated theory of Ratcliff, R. & Van Dongen, H.P.A. (2011). A diffusion model attention and decision making in visual signal detection. for one-choice reaction time tasks and the cognitive effects Psychological Review, 116, 283–317. of sleep deprivation. Proceedings of the National Academy of Smith, P. L., Ratcliff, R., & Sewell, D. K. (2014). Modeling Sciences, 108, 11285–11290. perceptual discrimination in dynamic noise: Time-changed Ratcliff, R., Van Zandt, T., & McKoon, G. (1999). Connec- diffusion and release from inhibition. Journal of Mathemati- tionist and diffusion models of reaction time. Psychological cal Psychology, 59, 95–113. Review, 106, 261–300. Smith, P. L., Ratcliff, R., & Wolfgang, B. J. (2004). Attention Reed, A.V. (1973). Speed-accuracy trade-off in recognition orienting and the time course of perceptual decisions: memory. Science, 181, 574–576. response time distributions with masked and unmasked Roe, R. M., Busemeyer, J. R., & Townsend, J. T. (2001). displays. Vision Research, 44, 1297–1320. Multialternative decision field theory: A dynamic connec- Smith, P.L., & Vickers, D. (1988). The accumulator model tionist model of decision-making. Psychological Review, 108, of two-choice discrimination. Journal of Mathematical 370–392. Psychology, 32, 135–168. Roitman, J. D. & Shadlen, M. N. (2002). Response of neurons Sperling, G. & Dosher, B. A. (1986). Strategy and optimization in the lateral interparietal area during a combined visual in human information processing. In K. Boff, L. Kauf- discrimination reaction time task. Journal of Neuroscience, 22, man, and J. Thomas (Eds.), Handbook of perception and 9475–9489. performance. (Vol. 1, pp. 1–65). New York, NY: Wiley.

modeling simple decisions and applications 61 Starns, J. J., & Ratcliff, R. (2010). The effects of aging on Handbook of Experimental Psychology (3rd ed.), Volume 4: the speed-accuracy compromise: Boundary optimality in the Methodology in Experimental Psychology (pp. 461–516). diffusion model. Psychology and Aging, 25, 377–390. New York, NY: Wiley. Starns, J. J., & Ratcliff, R. (2012). Age-related differences in Vickers, D. (1970). Evidence for an accumulator model of diffusion model boundary optimality with both trial-limited psychophysical discrimination. Ergonomics, 13, 37–58. and time-limited tasks. Psychonomic Bulletin and Review, 19, Vickers, D. (1979). Decision processes in visual perception. 139–145. New York, NY: Academic . Starns, J. J., & Ratcliff, R. (2014). Validating the unequal- Vickers, D., Caudrey, D., & Willson, R. J. (1971). Discriminat- variance assumption in recognition memory using response ing between the frequency of occurrence of two alternative time distributions instead of ROC functions: A diffusion events. Acta Psychologica, 35, 151–172. model analysis. Journal of Memory and Language, 70, 36–52. Voss, A. & Voss, J. (2007) Fast-dm: A free program for efficient Starns, J. J., Ratcliff, R., & McKoon, G. (2012). Evaluating the diffusion model analysis. Behavior Research Methods, 39, unequal-variability and dual-process explanations of zROC 767–775. slopes with response time data and the diffusion model. Wang, X. J. (2002). Probabilistic decision making by slow Cognitive Psychology, 64, 1–34. reverberation in cortical circuits. Neuron, 36, 955–968. Stone, M. (1960). Models for choice reaction time. Psychome- Watson, A. B. (1986). Temporal sensitivity. In K. R. Boff, L. trika, 25, 251–260. Kaufman, & J. P.Thomas (Eds.), Handbook of perception and Thapar, A., Ratcliff, R., & McKoon, G. (2003). A diffusion human performance (pp 6–1 to 6–43). New York, NY: Wiley. model analysis of the effects of aging on letter discrimination. White, C. N., Ratcliff, R., & Starns, J. J. (2011). Psychology and Aging, 18, 415–429. Diffusion models of the flanker task: Discrete versus Townsend, J. T. (1972). Some results concerning the identi- gradual attentional selection. Cognitive Psychology, 63, fiability of parallel and serial processes. British Journal of 210–238. Mathematical and Statistical Psychology, 25, 168–197. White, C., Ratcliff, R., Vasey, M. & McKoon, G. (2009). Townsend, J. T., & Ashby, F. G. (1983). Stochastic Modeling Dysphoria and memory for emotional material: A diffusion of Elementary Psychological Processes. Cambridge: Cambridge model analysis. Cognition and , 23, 181–205. University Press. White, C. N., Ratcliff, R., Vasey, M. W., & McKoon, G. (2010). Townsend, J. T. & Wenger, M.J. (2004). A theory of interactive Using diffusion models to understand clinical disorders. parallel processing: New capacity measures and predictions Journal of Mathematical Psychology, 54, 39–52. for a response time inequality series. Psychological Review, Wickelgren, W. A. (1977). Speed-accuracy tradeoff and informa- 111, 1003–1035. tion processing dynamics. Acta Psychologica, 41, 67–85. Tuerlinckx, F., Maris, E., Ratcliff, R., & De Boeck, P. (2001). Wickelgren, W. A., Corbett, A. T., & Dosher, B. A. (1980). A comparison of four methods for simulating the diffusion Priming and retrieval from short-term memory: A speed process. Behavior, Research, Instruments, and Computers, 33, accuracy trade-off analysis. Journal of Verbal Learning and 443–456. , 19, 387–404. Usher, M. & McClelland, J. L. (2001). The time course of Wiecki, T. V., Sofer, I. and Frank, M. J. (2013). HDDM: perceptual choice: The leaky, competing accumulator model. Hierarchical Bayesian estimation of the Drift-Diffusion Psychological Review, 108, 550–592. Model in Python. Frontiers in Neuroinformatics, 7, Vandekerckhove, J. & Tuerlinckx, F. (2007) Fitting the Ratcliff 1–10. diffusion model to experimental data. Psychonomic Bulletin Wong, K.-F., & Wang, X.-J. (2006). A recurrent network &Review,14,1011–1026. mechanism for time integration in perceptual decisions. Vandekerckhove, J., Tuerlinckx, F., & Lee, M. D. (2011). Journal of Neuroscience, 26, 1314–1328. Hierarchical diffusion models for two-choice response times. Yonelinas, A. P.(1997). Recognition memory ROCs for item and Psychological Methods, 16, 44–62. associative information: The contribution of recollection and van Ravenzwaaij, D., Dutilh, G., & Wagenmakers, E.-J. (2012). familiarity. Memory & Cognition, 25, 747–763. A diffusion model decomposition of the effects of alcohol on Zeguers, M. H. T., Snellings, P., Tijms, J., Weeda, W. D., perceptual decision making. , 219, 1017– Tamboer, P., Bexkens, A. & Huizenga, H.M. (2011). 1025. Specifying theories of developmental dyslexia: A diffusion van Ravenzwaaij, D., van der Maas, H. L. J., & Wagenmakers, model analysis of word recognition. Developmental Science, E.-J. (2012). Optimal decision making in neural inhibition 14, 1340–1354. models. Psychological Review, 119, 201–215. Zohary, E., Shadlen, M., & Newsome, W. (1994). Correlated Van Zandt, T. (2002). Analysis of response time distributions. neuronal discharge rate and its implications for psychophys- In J. T. Wixted (Vol. Ed.) & H. Pashler (Series Ed.), Stevens’ ical performance. Nature, 370, 140–143.

62 elementary cognitive mechanisms CHAPTER Features of Response Times: Identification 4 of Cognitive Mechanisms through Mathematical Modeling

Daniel Algom, Ami Eidels, Robert X. D. Hawkins, Brett Jefferson, and James T. Townsend

Abstract Psychology is one of the most recent sciences to issue from the mother-tree of philosophy. One of the greatest challenges is that of formulating theories and methodologies that move the field toward theoretical structures that are not only sufficient to explain and predict phenomena but, in some vital sense, necessary for those purposes. Mathematical modeling is perhaps the most promising general strategy, but even under that aegis, the physical sciences have labored toward that end. The present chapter begins by outlining the roots of our approach in 19th century physics, physiology, and psychology. Then, we witness the renaissance of goals in the 1960s, which were envisioned but not usually realizable in 19th century science and methodology. It could be contended that it is impossible to know the full story of what can be learned through scientific method in the absence of what cannot be known. This precept brings us into the slough of model mimicry, wherein even diametrically opposed physical or psychological concepts can be mathematically equivalent within specified observational theatres! Discussion of examples from close to half a century of research illustrate what we conceive of as unfortunate missteps from the psychological literature as well as what has been learned through careful application of the attendant principles. We conclude with a statement concerning ongoing expansion of our body of approaches and what we might expect in the future. Key Words: parallel processing, serial processing, mimicking, capacity, response times, stochastic processes, visual search, redundant targets, history of response time measurement

From Past to Future: Main Currents in the progress that had been made in harnessing latency Evolution of Reaction Time as a Tool in the or reaction time (RT) to the study of psychological Study of Human Information Processing processes, Titchener (1905, p. 363) was still If time has a history (Hawking, 1988), the pondering whether “we have any right to speak of timing of mental events certainly does. The idea the ‘duration’ of mental processes.” Putting the term that human sensations, feelings, or thoughts occur duration in inverted commas indicates the recent in real time seemed preposterous less than two origin of usage of the term as well as Titchener’s own centuries ago. When the idea has finally gained doubts about its validity or serviceability. traction, its gradual acceptance in psychology has Thirty years later, Robert Sessions Woodworth often been accompanied by much rancor that in his celebrated Experimental Psychology argued continued well beyond the development of the against acceptance of the first method to use first attempts at measurement. After some early reaction time. In a section poignantly titled,

63 “Discarding the subtraction method” (Woodworth seem so incredible less than 200 years ago? The 1938, p. 309), Woodworth expressed broader and physiology of the human nervous system had made deep seated reservations, observing that because startling advances just around that time, but for “we cannot break up the reaction into successive many centuries the main thrust of attempts to acts and obtain the time for each act, of what use understand the system along with the attendant is reaction time?” Even more recent is Johnson’s sensations fell under the rubric of “vitalism.” (1955, p. 5) assertion that, “The reaction-time Vitalism is the doctrine that there is a fundamental experiment suggests a method for the analysis of difference between living organisms and nonliving mental processes that turned out to be unworkable.” matter because the former entail something that is An onerous history granted, the use of RT is missing from the latter. Pinpointing just what this firmly established in modern cognitive psychology “something” was has proved elusive, yet the doctrine not least due to the general conceptual framework enjoyed widespread influence from antiquity (the provided by the domain known as the information- Greek anatomist Galen held that vital spirits are processing approach. Within this framework, RT necessary for life) to the 19th century (for all is used in a systematic, theoretically guided fashion his great contributions to physiology, the towering in the quest to isolate the underlying processes and figure of Johannes Müller subscribed to vitalism) to their interactions activated by a given experimental our own time (Freud’s “psychic energy,” “emerging task (cf. Laming 1968; Luce 1986; Townsend and property,” or even “mind” itself come to mind). Ashby 1983; Welford 1980). Vitalism is best understood as opposition to the Nevertheless, we would be remiss if we did not Cartesian extension of mechanistic explanations to examine, if in passing, the essence of Woodworth’s biology (Bechtel and Richardson 1998; Rakover reasoning. Woodworth’s concerns hark back to 2007). It is on this background of the strong the forceful argument on the continuity of con- influence of vitalism that researchers at the time sciousness offered by in his seminal believed that nerve conduction was instantaneous Principles of Psychology (see in particular, James (in the order of the speed of light or faster) and that, 1890, Vol. 1, p. 244). In the chapter on the stream in any rate, it was too fast to be measured. of thought, James contends that, due to its absolute continuity, thought or consciousness cannot be divided up for analysis. His attack is directed against Hermann von Helmholtz’s Measurement of the possibility of introspecting minute mental the Speed of the Nerve experiences, but the objection is equally cogent Therefore, Hermann von Helmholtz (1821– with respect to RT. When obtaining a value of RT, 1894) along with his fellow students at Johannes one measures the duration between two markers in Müller’s Berlin Institute of Physiology had to time, usually that between some specified signal and summon their best judgment and blood (signing the observer’s response. The RT is taken then to their antivitalism oath) to rebuff their teacher, and represent the time consumed by an internal process espouse a strictly mechanistic position. Under the needed to perform a mental task. However, if circumstances, it was a bold move on the part of mental processes are not amenable to partition, any Helmholtz and his peers to consider the moving pair of markers must be considered arbitrary. On a nerve impulse as (merely) an event in space-time deeper level, the situation is a replica or subspecies on a par with, say, that of a moving locomotive. of the relationship between nature and language Devising an ingenious method for measuring time, as discussed by Friedrich Nietzsche (1873). Nature Helmholtz proceeded to measure the speed of the might well comprise a continuous whole, but former. He stimulated a motor nerve in a frog’s leg human language (used to describe nature) is always and found that the latency of the muscular response discrete. How does one treat a continuous variable depended on the distance of the stimulation from with discrete tools? Without dwelling on this issue the muscle: the smaller the distance, the faster in any depth, the upshot is clear. A fundamental, yet the response. Helmholtz’s calculations showed that heretofore unarticulated assumption underlying all the propagation of the impulse down the nerve RT-based models, serial or parallel, is this: Natural was surprisingly slow, between 25 to 43 meters a mental functioning can be divided into separate, second. Regardless of the value, it became evident psychologically meaningful acts. that the speed of nerve conduction was finite and Returning to history, why did the idea that measurable! More boldly yet, Helmholtz turned mental acts occur in real, hence measurable, time to humans, asking participants to push a button

64 elementary cognitive mechanisms when they felt stimulation in their leg. Predictably by Wilhem Wundt (1832–1920) and Franciscus enough, people reacted to stimulation in the toe Donders (1818–1889). Could RT measurement slower than to stimulation in the thigh. Helmholtz be refined to gauge duration of central processes, estimated the speed of nerve conduction in humans presumably reflecting mental activity in the brain to be between 43 and 150 meters per second. The itself? large range is notable, attesting to considerable Wundt approached the question experimentally variability. It was this variability, within-individuals by probing the simultaneity of stimulus appearance as well as between-individuals, that discouraged in the conscious mind. Do stimuli presented at Helmholtz from further pursuing RT research as a exactly the same (physical) time evoke similarly reliable means of psychological investigation. simultaneous sensations? In a simple experiment The last point is also notable because individual performed in his home in 1861, Wundt attached differences was the subject of a now-famous inci- a calibrated scale to the end of the pendulum of his dent at the Greenwich observatory, which occurred clock so that pendulum’s position at any time could half a century before Helmholtz’s measurements. be determined with precision. A needle fastened to Assistant astronomer David Kinnebrook was re- the pendulum perpendicularly at its middle would lieved of his job by his superior, Nevil Maskelyn, strike a bell at the very instant that the pendulum due to disagreement in reading the time that reached a predefined position on the scale. Using a star crossed the hairline in a telescope. The this makeshift (yet accurate for the time) instrument superior found that his assistant’s observations were (Figure 4.1), Wundt was observing his own mind: a fraction of a second longer than his own. Twenty the sound of the bell, Wundt did not years later, this little-noticed incident (at the time) perceive the pendulum to be in the predetermined came to the attention of the German astronomer position but always away from there. Calculation F. W. Bessel, who started to compare transit times based on the perceived distance of the pendulum by various astronomers. This first RT study revealed from its original position showed the perceived that all astronomers differed in their recordings. In time difference to be at around one-tenth of a order to cancel out individual variation from the second. Inevitably, Wundt concluded, people do astronomic calculations, Bessel set out to construct not consciously experience the visual and auditory “personal equations” as a means to correct or stimuli simultaneously, despite the fact that these equate differences among observers. Notice that stimuli occur at the same time. the concept of “personal equation” assumes small Encouraged by such data, Wundt subsequently (to nil) intra-individual variability in tandem with attempted to measure specific central processes. A stable interindividual differences. Neither notion favorite topic was “apperception,” an early term for proved to be correct as Helmholtz witnessed with what is now known as attention. Wundt found his observers. It turns out that variability, whether that the RT to a given stimulus was shorter by of intra- or inter-individual species, is a fixture one-tenth of a second if the observer concentrated of RT measurement. It is at this juncture that on the response rather than on the stimulus. The models developed within the generic framework of human information processing become truly valuable, attempting to disentangle the various sources of RT variability.

Studies of Reaction Time in Wundt’s Laboratory: Moving from the Periphery to the Center Note that for all his pioneering contribution, Helmholtz’s measurements were restricted to the periphery of the nervous system, to sensory and motor nerves transmitting impulses toward or from the brain (Fancher 1990). Even this result, as we recounted, was achieved after travelling a torturous road. Nevertheless, barely a decade after Helmholtz’s measurements in 1850, the following intriguing question was posed (separately) Fig. 4.1 Schematic of Wundt’s thought meter.

features of response times 65 reason is that one has first to perceive the stimulus We surely have come a long way from Hamilton’s and then to apperceive it, that is, to decide informal surmises. Nevertheless, his observations whether it is the appropriate one for responding. brought to the fore the idea of limited capacity When focusing on the response, the second of (resources or attention) and even the idea of parallel these processes is gratuitous. Consequently, Wundt processing. Murray (1988, p. 159), ever the proposed that apperception takes about one-tenth keen reader, concluded that “Hamilton perceived of a second. Regardless of the particular results, consciousness as a kind of receptacle of limited the significance of Wundt’s early foray into RT capacity.” Needless to add, capacity and parallel measurement lies in his bold thrust to probe the processing are key concepts in the current approach duration of mental processes of consequence to known as human information processing. cognitive science and everyday life alike. Cognizant of its potential, Wundt’s home apparatus has been depicted as a “thought meter,” and the title of his Donders’ Complication Experiment and own report (including subsequent data) aptly read, Method of Subtraction “Die Geschwindigkeit des Gedankens” (The speed of We already mentioned Franciscus Donders, the thoughts; Wundt 1892). true pioneer of RT measurement in psychology. Important work in Wundt’ laboratory was car- This Dutch physiologist (founder of modern oph- ried out on a related subject, the number of stimuli thalmology among sundry achievements) developed noticed simultaneously during a short glance. James the first influential, hence lasting procedure for McKeen Cattell (1860–1944), Wundt’s American measuring the duration of specific mental processes. student and assistant, first employed RT in the Donders devised an experimental setup known study of the visual span of attention or span of ap- as the complication experiment with an assorted prehension. However, a true pioneer in this domain method of RT measurement called the method of was the Scottish philosopher, Sir William Hamil- subtraction. The idea was to present tasks of increas- ton, whose observations are reported in a posthu- ing complexity and to subtract then the respective mous book published in 1859. Hamilton spread RTs in order to identify the duration of the added out marbles on the ground and concluded that, on processes. The technique is best illustrated by the average, the span of visual attention is limited to procedures used by Donders himself (Donders, 6–7 items. However, if the marbles are arranged in 1868; we follow Murray’s 1988 depiction). In one groups (of say two, three, or four marbles a group) variation, the a-method, a sound such as ki is the person can comprehend many more marbles presented by the experimenter and the observer because the mind considers each group as a unit. reproduces it orally as quickly as possible (one These results and conclusions anticipated those should note that Donders was the first experimenter of George Miller a century later in his famous article to use human [his own] voice in RT studies). on the “magical number seven” and on the effects of The a-task is a simple reaction time experiment, “chunking” (Miller, 1956). The power of grouping recording the time it takes the observer to react was expounded by Cattell himself who found that to a predetermined stimulus by a predetermined whole words could replace single unrelated letters, response. In the b-method, one of several sounds is leaving invariant the number of units noticed presented on a trial, and the observer repeats the within the span. Modern studies on the span of sound as fast as possible. This variation is dubbed attention use short exposure times (at around 50 choice reaction time: Several different stimuli are ms) in order to avoid eye movements and counting. presented and the observer responds to each of As a result, observers actually report the contents them differently. In the c-method, several sounds of their short-term or “iconic” memory. George are given again, but the observer imitates only Sperling (1960), reviving interest in the subject one of them and remains silent when the others in his groundbreaking studies on the information are presented (this variation is now known as contained in brief visual presentations, concluded the go/no-go procedure). The differences between that the span was much larger than previously the respective RTs reflect the duration of the thought (in the order of 12–16 letters), but that it psychological processes involved. For example, the was also short lived. The very report by the observer RT for the b-procedure entails both discrimination can conceal the true size of the span; larger estimates (or identification) of the stimulus presented and the are found when the deleterious effects of reporting selection of the appropriate response, whereas that are circumvented. for the c-procedure entails merely discrimination

66 elementary cognitive mechanisms 1. A main criticisms are easily summarized because they are interconnected in final analysis. R First, the experimental data collected by different investigators or by an individual investigator at different times proved extremely variable. For 2. A B example, Donders (1868), Laming (1968), and Snodgrass, Luce, & Gealanter (1967) reported Ra Rb vastly different RTs for (c−a) and (b−c). Over and above the variability, the order of the differences is not preserved: Donders found (b−c) longer than (c−a), but those subsequent investigators found the opposite pattern. Wundt, an early champion 1. A B Output (RT1) of the method, was so discouraged by the large intra-individual variability that he abandoned his

2. A Output (RT2) RT studies altogether. Second, the method requires that the added experimental task has no influence on any of the Mean RT = RT1 –RT2 other tasks. The assumption of “pure insertion” Fig. 4.2 Illustration of the complication experiment and analysis (Sternberg, 1969a,b) asserts that the previous pro- by the method of subtraction. Top: A simple RT experiment (a cesses unfold in time precisely in the same fashion single predetermined response made to a single predetermined regardless of whether another process is inserted stimulus) is complicated into a choice RT experiment (two into the chain. If pure insertion is impossible in different stimuli with a different response made to each). Bottom: The time it takes to perform the mental act of choice general or does not hold in particular cases, the is estimated by subtracting the mean RT of the simple RT assumptions of additivity and independence of the experiment from the mean RT of the choice RT experiment. processes are also compromised. To compound the problem, the assumption of pure insertion is untestable with mean statistics although it might be (or recognition, see Luce, 1986, p. 213). The mean with distributional statistics (Ashby & Townsend, difference (c−a) was taken by Donders to measure 1980). The issue is not fully settled (cf. Luce, 1986, the duration of recognition, whereas that of (b−c) p. 215), and it is moot whether it can be fully settled estimated the time consumed by the need to make with any mathematical or statistical test. a choice between responses (see Figure 4.2 for an The third criticism is even more fundamental. It outline of the Donders experiment and for the logic concerns the relationship between the experimental of the method of subtraction). task and the unobservable psychological process or In the scheme developed by Donders, there subprocesses that the task is supposed to tap. It is is a chain of discrete nonoverlapping processing not prima facie clear that by calling a task “response systems. The duration of each process is measur- choice/selection” or “stimulus discrimination” the able, assuming that each added experimental task underlying psychological process is that of choice or uniquely taps one and only one of the processing discrimination. It is not even clear that the task taps systems. If the assumptions hold, the procedure a single process, excluding all sorts of subprocesses. succeeds in inferring the duration and eventually The raison d’etre of the complication experiment the attendant architecture of the psychological is minimum complication, so that a single well- system under test. Consequently, the idea of defined process is probed with each addition. subtraction has exerted a profound influence on RT This minimal-addition- or single-process principle theory and experimentation. Townsend and Ashby is not readily testable (certainly not at the level (1983) paid well-deserved homage to Donders by of the mean) and it is even more difficult to designating psychological processes carried out in a satisfy in experimental practice. After all, how serial fashion (i.e., sequential and without overlap can one decide that the added task comprised in processing time) as Dondersian systems.This the smallest complication possible (Külpe, 1895; much granted, closer scrutiny of the method (in Sternberg, 1969a,b). Symptoms of the problem particular, its underlying assumptions) uncovered have recurrently surfaced in the century following several problems, so that the method has not been the Donders experiment. Where Donders called wholeheartedly accepted by students of RT. The a given task “discrimination,” Wundt called the

features of response times 67 same task, “cognition.” Donders’ c-task was con- ceived to tap stimulus recognition, but already in 1886 Cattell questioned its validity, arguing 600 that the task entails processes beyond identification or recognition. More recently, Welford (1980), echoing Cattell’s concerns, concluded that the 500 difference between the b-task (originally thought mean RT to tap response selection) and the c-task is one of positive response negative response degree and that both entail choice of the response. 400 line of best fit Wundt, acutely aware of the problem, conceived a new task, the d-procedure (meant to be a pure measure of recognition), to no avail. More than 0123456 linguistic indeterminism is at stake. G. A. Smith size of the positive set (1977), for one, obtained data showing choice to be faster than recognition! How does one make choices Fig. 4.3 Prototypical results of Sternberg’s memory scan among stimuli that one does not recognize? In the experiment. absence of a definite task-process association and theory, we cannot know with certainty the identity item is compared with the memory representation and order of the pertinent psychological processes. of each of the items in the positive set – one item Given the problems, the method of subtraction was a time. He interpreted the parallelism of the slopes out of favor for many years with students of RT. to mean that the search continues until the entire The succeeding section will bring us into the memory is exhausted even if an early item in the modern era of cognitive research. Subsequent positive set matches the probe stimulus. Sternberg’s sections will revisit many of the concepts with more interpretationofhisdataisnowknownasthe quantitative detail, but still with emphasis on a standard serial exhaustive search model. If search friendly style. ceases as soon as a probe item is located, the process is said to self-terminate. Sternberg’s original (1966) ’s Revival of the Donders analyses were stronger than many of the scores of Project: Inaugurating the Modern Study of studies that followed, due not only to invoking Human Information Processing several control conditions but also in helping to rule Reminiscent of the tale of Sleeping Beauty, out an important class of parallel models. Again, we Dondersian procedures were lying dormant for will discuss this matter as well as other topics in this over a century. The prince-investigator reviving section in more quantitative detail subsequently. the technique was Saul Sternberg (1966, 1969a,b), Sternberg’s conclusions seem compelling, but, as and the magic kiss awakening renewed interest was subsequent research has revealed, neither is forced his memory scan experiment. The participants are by the data. The positive slope appears to have all first shown a number of items. Then, they decide the earmarks of serial processing, but a moment whether a test item was or was not present in the of reflection suffices to show that the same result set just shown. Prototypical results are given in follows in a natural fashion from parallel processing. Figure 4.3. Two features of the data are noteworthy. Think of horse races (actual ones, not modeling First, RT is a linear function with a positive slope metaphors) with a different number of horses in of the size of the memory set shown. Adding a each race. The referee reports back to the organizer single member to the memory set increases RT once each race is over (i.e., when the slowest by the same constant amount. Second, targets horse crosses the finish line). Clearly, each race and foils produce the same increment in RT, so is parallel and exhaustive. It requires only a little that the slope of the function is the same for yes intuition to conclude that the larger the number of and for no responses (in Figure 4.3, the intercept, horses, the longer the expected duration between reflecting stimulus encoding, base and residual the common start and the finishing time by the time, incidentally is also the same; however, the slowest horse (i.e., the RT-set size function has important feature is the parallelism of the target- a positive slope). Now, if every horse runs just present and target-absent functions). Sternberg as fast and with the same random variation no interpreted the linear function with the positive matter how many other horses are present, then it slope to reflect serial processing such that the test can be shown that the increasing duration for all

68 elementary cognitive mechanisms the horses to finish bends over (i.e., increases by bimodal perception, see Bernstein, 1970), a result less and less an amount as the number of horses inconsistent with the prediction of the standard increases; see Townsend & Ashby 1983, p. 92 for serial exhaustive model. Regardless of this particular a proof) rather than being straight. Such a system, result (the violation can be dealt with fairly easily whether run by horses or by parallel perceptual or by slight modification of the pertinent models), cognitive channels, is said to be unlimited capacity redundant target designs proved a powerful tool in (e.g., Townsend, 1974; Townsend & Ashby 1978). revealing virtually all aspects of human information Sternberg’s (1966) analyses did rule out this variety processing. of parallel processing. Sternberg revived Donders’ method of subtrac- Formal models of memory- or perceptual- tion in a further profound way. In his method of ad- scanning have introduced the notion of limited ditive factors (Sternberg, 1966, 1969a,b), one does capacity in performing the comparison process. In not eliminate or bypass a stage (as in the method of Townsend’s capacity reallocation model (Townsend, subtraction) but rather affects it selectively. Think 1969, 1974; Townsend &Ashby, 1983; see also, of the standard memory scan experiment for an Atkinson, Holmgren, & Juola, 1969), a finite illustration. In the additive factors scheme, the amount of capacity is redistributed after completing operation of comparison comprises a single stage the comparison of each item, the processing itself is affected by the factor of size of the search set. always a parallel race between the remaining items. Suppose that one adds another stage, stimulus Such limited capacity, parallel exhaustive search encoding, affected by degrading the quality of models yield precisely the same predictions as Stern- the visual presentation. The logic of the method berg’s original model (e.g., positive parallel slopes is as follows. Varying the number of stimuli in for target-present and target-absent processing, the search set affects comparison (and response) absence of a serial position effect, and linear growth processes, whereas degrading the quality of the of variance with the numbers of items), some of stimuli affects perceptual encoding. Additivity (of which are not generally confirmed by experimental the mean RTs) holds if indeed the manipulations data. Following Townsend’s early development influence the respective processes selectively. If (1969, 1971), several classes of parallel models one further assumes independence, the incremental have been shown to predict Sternberg’s results effects of added stages should be additive over (Corcoran, 1971; Murdock, 1971; Townsend, accumulated RTs, too. The expected result in this 1969, 1971a,b, 1972, 1974; Townsend & Ashby, two-stage serial model is shown in Figure 4.4. The 1983). Moreover, Sternberg’s data can be predicted influence of set size is revealed by the positive slopes by self-terminating rather than exhaustive search of the RT curves and that of visual degradation by whether in parallel (e.g., Ratcliff, 1978) or even the longer RTs. Critically, the two factors do not serial (e.g., Theios Smith, Haviland, Traupmann, interact as is evident in the parallelism of the slopes. & Moy 1973) models. The reader should consult Section 5 as well as Van Zandt and Townsend (1993) and Townsend and Colonius (1997) for more details on the topic of testing self-terminating 700 versus exhaustive processing in parallel and serial 600 models. Poor quality The interrogation of Sternberg’s results entailed 500 also (slight) experimental modifications. For exam- 400 ple, the memory set can follow rather than precede Good quality RT the probe stimulus thus initiating what are usually 300 termed visual search (or early target) experiments. Early examples of these designs are found in the 200 studies by Estes and Taylor (1969), Atkinson et al. 100 (1969), and van der Heijden (1975). A more consequential manipulation entails the 0 012345678910 inclusion of more than a single replica of the Set Size target stimulus in the search list. RT is found to decrease with the number of redundant targets Fig. 4.4 Hypothetical results in an additive factors experiment (e.g., Baddeley & Ecob, 1973; Egeth, 1966; in in which additivity is seen to hold.

features of response times 69 The additive factors method, like the memory were highlighted from historical and philosophical scan experiment, has engendered a very large perspectives, all related to the notion of time and amount of research, producing a wealth of valuable the role it plays in mental processes. First, mental theorems (e.g., the independence of additivity and events—our feelings, thoughts, and decisions—take stochastic independence) and theoretical insights time, and this time can be measured. Second, inter- (e.g., success and failure in mimicry of serial nal subprocesses can take place one at a time (and systems by parallel systems). The last point will are hence called serial processes), or at the same time be particularly appreciated by those experiencing (parallel processes). Third, when several subprocesses the frustration in convincing a graduate student (or take place, the system must await the completion a seasoned researcher!) that a positive slope does of each and every one of these subprocesses before not, ipso facto, imply serial processing (Feature moving on to respond (exhaustive processing), or Integration Theory [Treisman & Gelade, 1980] conversely, it can finish before that, say, upon is a poignant case in point). Criticisms and the termination of any one of the subprocesses generalizations of the method unearthed further (minimum-time).1 Fourth, subprocesses may be important information. For example, additivity independent from one another (or not), and so does generally support separate processing stages, do the time durations taken to complete each but interaction does not necessarily support a single subprocess. And finally, we introduced the idea stage. Statistical properties of that people may have a limited capacity—limited (ANOVA) might compromise, to an extent, its amount of resources (attention)—and hence can value as the (sole) diagnostic tool (cf. Townsend, deal effectively with a limited amount of processing 1984). A really consequential feature of the method at any given time. In what follows we provide a in virtually all modifications and generalizations formal treatment of each of these basic issues, along (but see Schweickert, 1982) is that it tells us nothing with illustrative examples. about the order of occurrence of the various stages The first issue, regarding the temporal mod- (or underlying processes). The ensuing problems eling of information processing, is ubiquitous in were already noticed with respect to the original theoretical approaches to human cognition. We method by Donders, but they are equally serious see this affirmed in several chapters of this book, with the method of additive factors. such as Chapter 3 (Modeling Simple Decisions Sternberg’s landmark studies, along with the and Using a Diffusion Model) and Chapter 6 almost concomitant works by Sperling, Estes, Nick- (A Past, Present, and Future Look at Simple erson and Egeth and others, inaugurated the human Perceptual Judgment). Many models of perception information-processing approach in earnest. Where and decision making are based on the premise Donders, in his subtraction method, changed the that information, or evidence toward some target nature of the tasks as well as the number of stimuli, behavior is accumulated over time. Thus, to answer Sternberg, in his memory experiment, did not Titchener’s (1905) question, we have both the right change the task, only added items. It is easier and the obligation to speak about duration of mental to subtract numerical values of RT than entire processes. psychological processes (cf. Marx & Cronan-Hillix, The remaining basic issues are discussed next in 1987). In his additive factors method, Sternberg greater detail; the reader may find the following ex- showed that it was not even necessary to subtract ample helpful throughout this discussion. Suppose processes, only to affect them experimentally in that you are a driver approaching an intersection. a selective way. Within a decade of Sternberg’s The sight of a red light or the sound of a policeman’s seminal contribution, virtually all students of RT whistle signals you to stop and give way. One can and roughly half the community of cognitive think of the visual signal and the auditory signal psychologists (Lacmann, Lachmann, & Butterfield as being processed in separate subsystems, which 1979) were conducting research employing or we call channels. We denote the time to process testing some aspect of Sternberg’s theory and and detect a signal in each of the channels by tA methodology. (for the visual channel) and tB (for the auditory channel). We further make the assumption that both signals are presented at exactly the same time Basic Issues Expressed Quantitatively (we can relax this assumption subsequently). What In the previous sections we surveyed some of the can we learn about the time course of informa- history of . Several key issues tion processing? What can we learn about the

70 elementary cognitive mechanisms relationship between the information-processing + tB, for any tA, tB > 0. This intuitive notion is channels? true only as long as we assume that tA (and similarly 2 The critical properties of architecture, stopping tB) is the same in the serial and the parallel cases. rule, and independence will now be introduced Is it realistic to expect our driver to bring her with only little mathematics. A rigorous mathemat- car to a stop only after she detects both sources ical statement regarding architecture (i.e., parallel of information? On intuitive grounds, one would and serial processes) appears in Section 4 of this prefer to act quickly on the basis of only one signal, Chapter. For more quantitative detail on these whichever signal is detected first as a sign of danger. features, the reader should consult Townsend and This issue is considered next. Ashby (1983) or Townsend and Wenger (2004b, for a more recent statement). Exhaustive versus Minimum-Time Architecture: Parallel Versus Serial Stopping Rule Processing Awaiting the completion of two subprocesses is As mentioned, two or more subprocesses can referred to as exhaustive processing. The processing take place one at a time (serial), or at the same durations, tserial and tparallel for that strategy were time (parallel). Figure 4.5 illustrates these modes given earlier. It is also possible to stop as soon as of processing, where each arrow corresponds to a the first process is completed; in our example, as particular channel. It is convenient to consider the soon as the driver detects the red light or hears way the system operates—its architecture—through the policeman’s whistle. This strategy is referred the prism of the time it takes to complete the to as minimum-time processing.Theoveralltimeit processing of both signals. Suppose that the driver takes for a parallel system with a minimum-time = is unwilling to hit the brakes unless both signals rule is given by tparallel min(tA, tB). For a serial are spotted, that is, she processes the two signals system, the total duration depends on the order of = = exhaustively. In the serial case (Panel a), the time processing, tserial tA if A is first, and tserial tB if B to process both signals is the sum of the durations is processed first. Needless to add, in a serial system needed to process each channel, such that total tserial that stops as soon as the first channel completes = tA + tB. In the parallel case, this time equals that (as soon as the first signal is detected), the second needed to process the slower of the two processes, channel will not have a chance to operate at all. tparallel = max(tA, tB). It is tempting to think Although other stopping rules are also possible, the that parallel processing will yield a faster braking exhaustive and minimum-time stopping rules are response compared with serial processing (and more of particular interest. They are illustrated in Figure generally that parallel processing is more efficient 4.5 (Panels a–d). Processing times for the different than serial processing), given that max(tA, tB) < tA systems are summarized in Table 4.1.

(a) (b) Process A Process A Process B Response Response Decision Decision Process B

(c) (d) Process A Process A Process B Decision Response Decision Response Process B

(e) Process A ∑ Decision Response Process B

Fig. 4.5 Illustrations of serial (Panels a, c), parallel (b, d), and coactive (e) systems. Panels a and b demonstrate exhaustive processing, where both processes A and B must finish before a decision and response can be made. Panels c and d show minimum-time processing, where processing ceases once process A is completed (but B had not finished, as indicated by the broken line). Panel e illustrates a coactive mode of processing, where activation from two channels is summed before the decision stage. features of response times 71 Table 4.1. Summary of overall completion times for process, which takes random time T to complete, the various models. tA and tB denote the time to will actually be finished at time t. process signals in channels A and B, respectively. We can use f(t) to define stochastic indepen- Model and stopping Overall completion dence. In probability theory, two random variables rule time are independent if knowing the value of one tells nothing whatsoever about the values of the other Parallel exhaustive max(tA, tB) (e.g., Luce, 1986, chapter 1). In processing models, Serial exhaustive tA + tB total completion times of channels A and B are independent if knowing one, say tA, tells us nothing Parallel minimum time min(tA, tB) about the likelihood of various values of tB.Thus, Serial minimum time we can express independence in terms of the joint if channel A is processed first pdfs, fAB(tA,tB) = fA(tA) · fB(tB), which means that and B second tA the joint density of processes A and B both finishing if B is processed first and at time t is equal to the product of the probability A second tB of A finishing at time tA and the probability of B finishing at time tB.

Stochastic Independence Workload Capacity and the Capacity Two events are said to be statistically inde- Coefficient pendent if the occurrence of one does not affect We recounted earlier that the time to process the probability of the other. For example, height multiple signals depends on the stopping rule and SAT score (standardized test score for college and mode of processing (serial, parallel). Notably, admissions in the United States) are independent if processing also depends on the amount of resources knowing the height of a person tells nothing about available for processing, a notion that we call his SAT score. In the context of processing models, capacity. One may think of the cognitive system as total completion-times of channels A and B are performing some work, and the more subprocesses independent if knowing one does not tell us a thing (channels) are engaged the greater the amount of about the value of the other. work there is to perform. We define workload ca- Our discussion of the architecture and stopping pacity as the fundamental ability of a system to deal rule was simplified by the fact we assumed that with ever heavier task duties (Townsend & Eidels, processing is deterministic, rather than stochastic 2011; see also Townsend & Ashby, 1978, 1983). A (probabilistic). A deterministic process always yields ready example is the increase in load from process- a fixed result, such that the effect or phenomenon ing one signal to processing two or more signals. we observe has no variability. For example, a One may find it useful to think about work deterministic process predicts that the time taken and capacity in terms of metaphors such as water to drive from Sydney to Newcastle is always fixed, pipes filling a pool, or tradesmen building a house. or that the time to choose between chocolate and Suppose that the tradesmen operate in parallel vanilla flavors is the same every time we stop at the (and, for illustration, deterministically) and that ice-cream parlor. Under this assumption, we were there is an infinite amount of resources (tools, able to represent the time for processing in channels building materials)—unlimited capacity.Inthat A and B by the fixed values, tA and tB. case, a twofold increase in the number of workers However, observations of human performance will cut to half the amount of time needed to build (and Sydney’s traffic) lead to the conclusion that the house (assuming all tradesmen have the same behavior is quite variable and that it can probably workrate). Critically, adding more workers does not be better described as a stochastic process. If affect the labor rate of each individual worker. In a so, processing time in any particular channel can similar vein, increasing load on the cognitive system no longer be characterized by a fixed value, but by increasing the number of to-be-processed items is represented by a random variable.Arandom does not have an effect on the efficiency and time variable does not have a single, fixed value but can of processing each item alone. The time to process rather take a set of possible values. These values can the visual signal (red light) when it is presented be characterized by probability distributions. The alone should be the same as the time to process probability density function (pdf) is defined by f(t) the same signal when it is presented in tandem = p(T = t), and gives the likelihood that some with the auditory signal (whistle by the policeman),

72 elementary cognitive mechanisms tA|A = tA|AB. To clarify the notation, the subscript Townsend & Eidels, 2011) and is illustrated in A|A indicates processing of signal A given that Figure 4.5e. Clearly, this type of model benefits only signal A is present, whereas A|AB indicates from an increase in the number of relevant signals. processing of signal A when A and B are both With auditory and visual signals contributing to present. If several channels are working toward the a single pool, evidence accumulates more quickly, same goal and capacity is unlimited, then adding and will surpass threshold faster. Thus, a coactive more channels should facilitate processing. model is a natural candidate for supercapacity. It is possible however, that capacity is limited.In However, it is not the only way supercapacity can be one special case, the overall amount of processing achieved in parallel systems as we shall see (Eidels, resources, X, can be a fixed value. With more and Houpt, Altieri, Pel, & Townsend 2011; Townsend more channels coming into play, fewer resources can & Wenger, 2004a). be allocated to each channel, and, consequently, the Townsend and Nozawa (1995) offered a mea- time to complete processing within each channel sure of workload capacity known as the capacity increases. So, for example, the time to process the coefficient: visual signal is longer when the auditory signal is log[SAB(t)] also present. Using the same notation as before, COR(t) = .(1) log[SA(t) · SB(t)] we can express this as tA|A < tA|AB and tB|B < tB|AB. Under limited capacity, performance with a SA(t) and SB(t) are the survivor functions for given target is impaired as more targets are added completion times of processes A and B, and tell us to the task. Metaphorically, this is tantamount to the probability that channels A and B, respectively, tradesmen who are trying to work in parallel but did not finished processing by time t. SAB(t) share one set of tools. Worker A cannot work at is the survivor function for completion times of the same rate that she did alone if she needs to the system when channels A and B are both at await her partner handing over the hammer. Given work (e.g., when two targets are being processed that multiple workers or channels operate toward simultaneously). We have already defined the pdf, the same goal, a limited-capacity system can still f(t) = p(T = t), as the likelihood that a process that complete processing faster than (or at least as fast as) takes random time T to complete will actually be any single channel alone (depending on the severity finished at time t. We can also define the probability of the capacity limitation). However, a limited- that the process of interest is finished before or at capacity system cannot be faster than an otherwise time t,knownasthecumulative distribution function identical unlimited-capacity system. (cdf), F(t) = p(T ≤ t).Thesurvivor function is the A third and at first curious case is that of super complement of the cdf, S(t) = 1–F(t)= p(T>t), capacity. It is possible in principle that as more and tells us the probability that this process had not and more channels are called for action, the system yet finished by time t. recruits more resources (àla Kahneman, 1973) and The capacity coefficient, COR(t), allows to assess is able to allocate to each of the channels more performance in a system that processes multiple resources than what each channel originally had signals by comparing the amount of work done by when it was working alone. In this case, tA|A > the system when it processes two signals with the tA|AB and tB|B > tB|AB,andmoreover,themore amount of work it does when each of the signals is signals (and channels) there are, the faster the presented alone. The subscript OR indicates that system completes processing. Under supercapacity, processing terminates as soon as subprocess A or performance with a given target is improved as more subprocess B finishes (i.e., minimum-time termi- targets are added to the task. nation). Townsend and Wenger (2004a) developed We can model super capacity by way of a system a complimentary capacity coefficient for the AND in which channels A and B pool their activation design, where the system can stop only after the two into a single buffer, in which evidence is then processes, A and B, are both finished: compared against a single criterion. In that sense, log[FA(t) ·FB(t)] processing channels can also join efforts to satisfy a CAND(t) = .(2) log[F (t)] common goal as could be the case in the tradesmen AB example. This mode of processing is often referred Equations 1 and 2 both apply to two channels, to as coactivation (e.g., Colonius & Townsend, but the C(t) index can be easily generalized to 1997; Diederich & Colonius, 1991; Miller, 1978, account for more than two processes (Blaha & 1982; Schwarz, 1994; Townsend & Nozawa, 1995; Townsend, 2006). The interpretation of COR(t)

features of response times 73 and CAND(t) is the same, so that C(t) refers independent, we can multiply the probabilities so to both indices. Parallel-independent models are that the probability that the entire system does not characterized by unlimited capacity, C(t) = 1. stop by time t, SAB(t), is given by the product Capacity is C(t) < 1 in a limited capacity model, of the probabilities of A and B not finishing and it is C(t) > 1 with super capacity in force. (see Eq. 3 again).3 We note that this equation Architecture (serial, parallel), stopping rule, and describes a model with only two channels, but it potential dependencies can also affect the capacity can be generalized to any number of channels. The coefficient. For the effect of architecture, consider probability that an independent race model with a serial model, which processes channel A first and n parallel channels does not complete by time t is then processes channel B. This model will take more given by the product of the probabilities of neither time to complete, on average, than an otherwise channel finishing, identical parallel model in which processes A and B occur simultaneously. The former also results Smimimum−time(t) in C(t)<1 – limited capacity. Breakdown of n = · · · = independence across channels also affects C(t) in a S1(t) S2(t) ... Sn(t) Si(t). (4) predictable manner. Townsend and Wenger (2004a) i=1 and Eidels et al. (2011) have shown that positive Given a parallel model, it is possible that the dependency (one channel “helps” the other) can system stops only when all of its channels had > lead to supercapacity, C(t) 1, whereas negative completed processing (exhaustive processing). In the dependency (one channel inhibits the other) can example, the system will stop only when both < lead to limited capacity, C(t) 1. The capacity channel A and channel B stop. Assuming again that coefficient is discussed further in the later section, the channels are independent, the probability that Theoretical Distinctions, along with an illustrative the model completes processing by (at or before) example from the Stroop milieu. The interpretation time t is equal to the product of the probabilities of C(t) is particularly revealing when discussed of channels A and B finishing, with respect to the benchmark model that we describe next. FAB(t) = FA(t) · FB(t)(5) and in the more general form, with n channels, The Benchmark Model: Parallel, n Independent, Unlimited Capacity F (t) = F (t)·F (t)·...·F (t) = F (t)(6) The standard parallel model can be considered exhaustive 1 2 n i i=1 as the “industry’s standard” in response-time mod- eling. This model is characterized by unlimited Two well-known RT inequalities also define the capacity and independent, parallel processing chan- benchmark model. Miller (1978, 1982) proposed nels (attributes that yield the acronym UCIP, e.g., an upper bound for performance in the OR design Townsend & Honey, 2007). If we further assume (“respond as soon as you detect A or detect B”), the that the model can stop as soon as either one race model inequality: of the channels completes processing, we end up ≤ + with an independent race model, illustrated earlier FAB(t) FA(t) FB(t). (7) in Figure 4.5(d). Formally, the stochastic version of The inequality states that the cumulative distribu- this model can be written as tion function for double-target displays, FAB(t), SAB(t) = SA(t) · SB(t). (3) cannot exceed the sum of the single-target cu- mulative distribution functions if processing is SA(t) and SB(t) are again the survivor functions an ordinary race between parallel and inde- for completion times of processes A and B and pendent channels. Violations of the inequality tell us the probability that channels A and B, imply supercapacity of a rather strong degree respectively, did not finish by time t. Consider (Townsend and Eidels 2011; Townsend and a model that stops processing as soon as either Wenger 2004a). channel finishes (minimum-time processing), but will Grice, Canham, & Gwynee (1984) introduced a otherwise not stop as long as process A is still going bound on limited capacity, often referred to as the on and process B is still going on (i.e., as long as Grice inequality: both processes “survive,” hence the term survivor function). Because processing-channels A and B are FAB(t) ≥ MAX [FA(t),FB(t)]. (8)

74 elementary cognitive mechanisms This inequality states that performance on double- Box 1 Is human capacity limited? target trials, FAB(t), should be faster than (or at least as fast as) that in the faster of the single- We noted in this section that workload target channels. If this inequality is violated, the capacity—as measured by the capacity simultaneous processing of two target signals is coefficient—could theoretically be limited, highly inefficient and the system is very limited unlimited, or super, depending on whether capacity. An implication is that there is “no savings” the efficiency of processing decreases, is left or gains in moving from a single target to multiple unchanged, or increases with additional load targets (in OR designs). In Section 5 we shall (e.g., more signals to process). Cumulative demonstrate the use of the three assays of capacity evidence suggests that human capacity is in an OR design – C(t) and inequalities (7) limited (Kahneman, 1973), yet important and and (8). Colonius and Vorberg (1994) proposed frequent situations of modern life, such as upper and lower bounds appropriate for AND driving a car, require simultaneous processing tasks (“respond if you detect target A and target of multiple signals. Therefore, a key question B”), which are analogous to OR tasks in the is whether human capacity is, in fact, limited, sense that their violations indicate supercapacity and what might be the consequences of such and limited capacity. Our benchmark model is, limitations in our everyday life. therefore, useful in serving as a gold standard Strayer and Johnston (2001) studied the against which performance can be compared and effects of mobile-phone conversations on per- interpreted. formance in a concurrent (simulated) driving task. They found that conversations with either Conclusion a hand-held or a hand-free mobile phone while Information-processing models can be character- driving resulted in a failure to detect traffic ized by the following four features referring to the signals and in slower reactions to these signals relations among processing channels: architecture when they were detected. The findings clearly (serial, parallel), stopping rule (minimum-time, ex- suggest that human capacity is limited.How- haustive), capacity (limited, unlimited, super), and ever, in a more recent driving-simulator study stochastic (in)dependence. Most of these properties Watson and Strayer (2010) have been able to are latent and cannot be observed directly. Response identify a group of individuals—referred to as times are useful tools in uncovering these properties, “supertaskers” who can perform multiple tasks but in some cases the result is not unique. without observed detriments. Although the Model mimicry is thus the focus of the upcoming majority of the participants showed significant section. The caveats granted, recent advances performance decrements in the dual-task condi- in response-time modelling of cognitive processes tions (compared with a single-task condition of proved useful in addressing some of the mimicking driving without distraction), a small minority challenges (allowing researchers to identify critical of 2.5% showed no performance decrements. features of human information-processing). The These supertaskers can be best characterized as later section on Theoretical Distinctions outlines having unlimited capacity (and possibly even some of the advances, followed by applications supercapacity). The simulated-driving studies of novel techniques from empirical literature. The by Strayer and colleagues highlight some prac- reader might have noticed that some interesting tical implications of uncovering latent mental topics such as the stochastic form of serial models constructs (capacity, in this example). were excluded from our discussion due to lack 4 However, the topics included in this of space. now can expand purview to establishment of classes chapter should give the reader a good understanding of models characterized by those properties. For of elementary information-processing theory and a example, exhaustive stopping rule in a serial model solid preparation for more specialized reading. Box with independent identically distributed processing 1 gives a practical illustration of the outstanding times, will have a mean response time equal to the issues. sum of the mean response times for each channel

Model Mimicry E [RT ] = E [RTChannel 1] + E [RTChannel 2] Possessing the building blocks (architecture, +···+ + stopping rule, capacity, and independence), we E [RTChannel n] nE [T0],

features of response times 75 where T0 is the base time to respond. Thus, for each same actual processing distribution (thus produc- channel added, we simply add its mean response ing the equal mean processing times a fortiori) time for the total average response time. But, is this and that they are also independent, then one the only model with such a prediction? has the completestandard serial model as outlined In this section, we provide instances of overlap in earlier. predictions that arise from assuming various mod- For simplicity, assume that the mean processing els. When one model can predict the results of an- time for each of the single items are all equal. other model, we face an instance of model mimicry. Assume further that the target has equal probability Though perhaps an obvious platitude, investigators of appearing in any of the n positions. On n+1 rarely seem to concern themselves with the specter target-present trials, participants process 2 items of mimicry. In this discussion, we emphasize total on average (yielding a positive linear relationship mimicry, that is, the existence of mathematical between mean RT and set-size). On target-absent functions carrying the structure of one model to trials, participants have to process the entire list, another in such a way as to render them completely so that the average RT is n times the mean RT equivalent. The upshot is that no data expressed for a single item. Therefore, on both target-present at the same level as the mimicking equations can and target-absent trials, there is a positive linear decide between competing models. Mimicry at relationship between mean RT and set size. other levels will be considered as well as some As we alluded in Section 2, it can be shown that remedies to parallel-serial dilemma (in the following unlimited capacity, independent parallel models do section). not generally make this prediction. These models, when using an exhaustive stopping rule, produce logarithmic-like functions that increase with set Mean Response Time Predictions size, but not in a linear fashion (see Townsend Recall that mean RT has been a useful tool in and Ashby 1983, p.92). In the case of minimum helping to determine (or eliminate) models best time (i.e., race) stopping, they yield curvilinear suited for data. Sternberg (1966), discussed in decreasing mean RT functions. Interestingly, single- Section 2, supported a positive linear relationship target self-terminating processing reveals a flat, between mean RT and set size. An early extension straight-line mean RT function for these models. of this paradigm to conditions where the items were Yet, the linear prediction of the standard Sternberg on display (instead of being stored in memory) model is not unique to the serial class of models. was carried out by Atkinson et al. (1969) with Next, we introduce a particular parallel model, largely similar results. The evidence for exhaustive where the rate of processing depends on the number processing was supported by the lack of an effect of items to be processed, that does yield the linear for the serial position of the target in the list. increase prediction. This model is just one of a On the other hand, Nickerson (1966) argued that multitude of models that can predict the linear these data could be taken to favor self-terminating relationship found in the data. processing.Inaseminalresearchwithadifferent Mean response times are a common measure type of visual paradigm, same-different match- used in determining the processing mechanisms ing design with multiple targets, the data were in a task. Although illuminating with respect to interpreted as supporting a serial self-terminating the manipulated variables, the model conclusions process (Egeth, 1966). Even within the visual search made from such observations must consider the paradigm, sometimes a self-terminating stopping possibility of mimicry. is found and sometimes an exhaustive stopping is concluded. [See section, Theoretical Distinctions Supporting Mathematics: Serial Model for further discussion of assessing the decisional Recall from the previous section, Basic Issues stopping rule]. Expressed Quantitatively, that, in a serial model, However, in none of these pioneering studies items are processed one at a time. In minimum- was the potential for confounding by other pro- time processing the target may appear in any of the cessing characteristics, especially capacity, taken available positions and processing stops when the into account. As we recounted, the early standard target is found. As standard practice, E[Ii] denotes model was a serial, exhaustive model with equal mean the mean processing time of the ith item. Then, processing times for every item. If one additionally using mathematical induction, for target present assumes that each item or stage possesses the trials, one has

76 elementary cognitive mechanisms E(Response Time for n positions) We omit the reasoning due to space limitations, 1 1 1 1 but the average processing time for each stage is λ. = E[I1] + E[I1 + I2] +···+ E[I1 +···+In] n+1 1 n n n So, the mean processing time for n items is 2 λ 1 1 1 (positive response) and n λ (negative response). = E[I1] +···+ [E[I1] +···+E[In]] n n So, for λ = 1 this parallel model gives the same E[I1] 1 2 n predictions as the aforementioned serial model for = E[I1] + E[I1] +···+ E[I1] n n n mean response times as functions of the number of (n + 1) items. = E [I ] 2 1 whereas on target-absent trials the result is simply Intercompletion Time Equivalence E(ResponseTimefornitems) We refer to the time required for a stage of processing as the intercompletion time. So in a = E[I1 +···+ I n] = nE[I1]. serial model, the intercompletion times are just the There is, thus, a linear relationship between number processing times. We now examine the issue of of items and mean response times for positive and model mimicry with respect to the distribution of negative responses. Of course, the minimum time the intercompletion times. We will show cases in serial prediction is simply E[I1], a flat straight line. which equivalence can occur between two common models, the across-stage independent serial model Supporting Mathematics: Parallel Model and a large class of parallel models that assume For clarity, a “stage” of processing is the time within-state independence. Across-stage indepen- from one item finishing processing to the next dence is defined as the property that the probability item finishing processing. For example, in an density function of two or more stages of processing exhaustive model with three items to be processed, is the product of the component single stage density any channel will have three stages: the time from functions. Consider the case in which there are two start until the first item is processed, the time channels, a and b, each dedicated to processing a after the first item is processed to the time the particular item. second item is processed, and the time from To make the equivalence easy to follow, we write the second item’s completed processing until the the serial model on the left side of the equations and remaining item is finished processing. In a parallel the parallel model on the right. We use f for the pdf model, the distribution of stage processing time of the serial model, and g for the parallel model. takes the form of a difference between item p denotes the probability that a is processed first processing times usually conditioned on channel in the serial model. fa1(ta1)is the probability that information. Within-stage independence is defined it takes a the exact time of ta1to finish in the first as the statistical independence of stage processing stage of processing. G is the cumulative distribution times across two or more channels in the same function of the respective subscript (for a parallel stage, j. Across-stage independence assumes the model). So Gb1(ta1) denotes the probability that the independence of these times occurs within the first stage of processing for b will fail to finish before same channel, but for different stages. Consider the time ta1in a parallel model. the within-stage independent parallel model with Then, for the independent serial model to mimic each item having a processing time following the independent parallel model on all response time an exponential distribution with a rate inversely measurements it is necessary that: proportional to the number of items, n.Inother | words, the more items to be processed, the longer pfa1(ta1)fb2(tb2 ta1) the actual processing time of each item will be. ≡ ga1(ta1)Gb1(ta1)gb2(tb2|ta1), (9) = − λ Thus, let gaij exp n−j+1 be the processing th density for the i item in stage j of processing. For (1 − p)fb1(tb1)fa2(ta2|tb1) = example, stage-one processing on all items is ga11 ≡ | = ··· = = − λ gb1(tb1)Ga1(tb1)ga2(ta2 tb1), (10) ga21 gan1 exp n , whereas stage-two processing has density function   For mimicry on the level of intercompletion times, λ we need equivalence for each stage of processing. g = g =···=g − = exp − . a12 a21 an 11 n − 1 For example, in the case in which where a is features of response times 77 processed first (preceding Eq. (9)) one needs to where denotes that a finishes before b.To define f and p so that show equivalence one needs to define fa1,fb1,fa2,fb2, and p for a serial model so that each stage of pf (t ) ≡ g (t )G (t ). a1 a1 a1 a1 b1 a1 processing gives equivalent intercompletion time The three “equal” signs simply indicate that this predictions. equation must be true for all values of ta1. As above, for a second stage processing, simply This turns out to be readily done. Thus, there is set a serial model that can completely mimic response fb2(tb2|ta1) = gb2(tb2|ta1) time predictions from any given independent and parallel model. This shows us that response time | = | measurements are not enough to prove that there is fa2(ta2 tb1) ga2(ta2 tb1). a unique model for the processes involved in a task. Now we focus on fa1 and p. For equivalence, it is Fortunately, there are distributions for the serial sufficient that model that make parallel mimicry impossible. The = ( ) upshot here is that this serial class of models is more pfa1(t) ga1(t)Gb1(t) 1 general than that of the parallel models—the paral- Integrating with respect to t, lel class is mathematically contained within the serial ∞ class. This result provides one potential avenue = p ga1(t)Gb1(t)dt. for assessing architecture: Try to determine from 0 the experimental data and appropriate statistics if By dividing by p in the equation 1, processing satisfies serial but not parallel processing. If parallel models pass the tests, then these particular ga1(t)Gb1(t) ga1(t)Gb1(t) f (t) = = ∞ . tests cannot discriminate (for that task) serial versus a1 p 0 ga1(t)Gb1(t)dt parallel architectures. The remaining density, fb1(t),canbesolvedinthe The Math Beneath the Mimicry same way as above, using the equation Note that by integrating with respect to t ,(Eq. b2 (1 − p)fb1(t) = gb1(t)Ga1(t) (2) 9) reduces to ≡ and the fact that pfa1(ta1) ga1(ta1)Gb1(ta1) ∞ ∞ − = − = [FirstStageProcessing] 1 p 1 ga1(t)Gb1(t)dt gb1(t)Ga1(t)dt. 0 0 ( | ) ≡ ( | ) fb2 tb2 ta1 gb2 tb2 ta1 . Thus, the serial model that mimics the parallel is [SecondStageProcessing] given by: ∞ The same conclusions hold for Eq. (10) by p = ga1(t)Gb1(t)dt integrating with respect to ta2. This means that 0 if there is intercompletion time equivalence, then ga1(t)Gb1(t) there is total model equivalence. fa1(t) = ∞ 0 ga1(t)Gb1(t)dt Proposition 1. Given any within-stage independent parallel model there is always a serial model that is gb1(t)Ga1(t) fb1(t) = ∞ completely equivalent to it. 0 gb1(t)Ga1(t)dt

Proof. This proof generalizes to cases where there fb2(tb2|ta1) = gb2(tb2|ta1) are more than two processing positions (Townsend 1976a). Consider the following within-stage inde- and | = | pendent parallel model: fb2(tb2 ta1) gb2(tb2 ta1).

ga1,b2(ta1,tb2,) A Simple but Convincing Example = ga1(ta1)Gb1(ta1)gb2(tb2|ta1) Assume the exponential distribution for each and position and channel in both a serial and a parallel gb1,a2(tb1,ta2,) model. Assume across-stage independence, too. We follow Townsend and Ashby’s (1983) convention = | gb1(tb1)Ga1(tb1)ga2(ta2 tb1), and use different parameter notations for serial (u)

78 elementary cognitive mechanisms and parallel (v) models. We will derive parameter and across-stage independent parallel model that is mappings [p = f (va1,va2,vb1,vb2)] that leave the equivalent to it. serial and parallel equations equivalent. Although a logically sound statement, it may Then the density function for the serial model is carry too many assumptions on the processing densities to serve for practical application. Below is f (t ,t ;) a1,b2 a1 b2 a more general theorem. = pu exp( − u t )u exp( − u t ) a1 a1 a1 b2 b2 b2 Proposition 3. Given a serial model, then if there and the distribution for the parallel model is exists a within-stage independent parallel model that is equivalent to it, it can be found by setting < > ga1,b2 (ta1,tb2; a,b )  t pf t = v exp[−(v + v )t ]v exp(−v t ). ( ) = −∫ a1 a1 a1 b1 a1 b2 b2 b2 Ga1 t exp dt 0 pF (t ) + 1 − p F (t ) = = a1 b1 Step 1: Set ub2 vb2 and ub2 vb2.Secondstage t − equivalence achieved. (1 p)fb1(t ) Gb1(t) = exp − dt pF (t) + (1 − p)F (t) Step 2: Suppose the order is < a,b > for serial 0 a1 b1 | = | processing. Then computing conditional means for gb2(tb2 ta1) fb2(tb2 ta1) the first stage processing, and s 1 E (T1 |< a,b >) = , ga2 (ta2|tb1) = fa2 (ta2|tb1). ua1 < > whereas for b,a , Conclusions s 1 In the listed two examples, one can see E (T1| < b,a > ) = . ub1 how making assumptions about the model can But, for the parallel process the mean for the first yield overlapping predictions in response times stage of processing is and the relationship with number of channels (or items). These conclusions obviously sound p | < > = 1 = p | < > E (T1 a,b ) + E (T1 b,a ). a warning siren with regard to drawing hasty va1 vb1 inferences from the traditional logic concern- So, for equivalence ing behavior of architectures. It would appear that the best that one could achieve would be u = u . a1 b1 to posit several classes of models, for exam- Step 3: Now turn to ple both parallel and serial architectures, which may explain the data together. However, sub- = p Probability that a is first in a serial model sequent sections will reveal how our metathe- =Ps(< a,b >). oretical approach can lead to experimental de- signs that assay such characteristics at a more Recall that fundamental level. For even more generality in ∞ p model mimicry, see Townsend and Ashby (1983, P ( < a,b > ) = ga1(t)Gb1(t)dt 0 Chapter 14). ∞ − + = va1exp[ (va1 vb1)t]dt 0 Theoretical Distinctions v We now turn our attention toward theoretical = a1 . va1 + vb1 conditions and measures under which models are not equivalent. Careful probing of these conditions So, set will guide us to experimental designs that can = va1 p . overcome the parallel-serial dilemma and reliably va1 + vb1 distinguish between information-processing sys- In sum, we have guided ourselves to the following tems in a broad range of empirical settings. Because propositions. questions of processing characteristics have been Proposition 2. Given an across-stage independent motivated largely by the domain of visual and and exponential serial model such that ua1 = memory search, they provide the most natural ub1,there is no exponential, within-stage independent, examples. However, the field has recently moved

features of response times 79 toward more general applicability, and this section hybrid models) and also find techniques to analyze appropriately culminates in a discussion of the the tail behavior of empirical distributions. They Double Factorial Paradigm, which allows the ex- overcame these obstacles and their data indicated perimenter to evaluate architecture, stopping rule, that it could have been produced by a parallel capacity, and stochastic independence in a single system of the type envisioned by the ACT model. block of trials and which can be applied in a As we have noted, however, the data could not rule multitude of perceptual and cognitive tasks. out serial processing, it just failed to reject parallel processing. Architecture Distinctions based on Generality parallel systems generality based on serial systems generality based on partial processing degenerate mimicking So far, we have considered processing models, First,wewilltieupsomelooseendsfromthe even dynamic models, as functions of mapping previous section by using the results of Proposition stimuli onto RT and response probabilities. At 3 to point out an example of a serial model that is this point in our development, we must enrich unable to be mimicked by a parallel model. When these models with the concept of a state space. considered alongside the fact that any parallel model Simply put, the state space of a dynamic system has an equivalent serial model (Proposition 2), we is the set of values that the system may obtain. A arrive at our first fundamental distinction, which variety of models, including sequential sampling we will later refine: serial models can be more general and random-walk models, traditional counting than parallel models. Later, we will discover ways models, and multi-stage activation models (Ratcliff in which parallel models can be more general than and Smith 2004), suppose that as some cognitive serial models. process unfolds dynamically in time, evidence is Suppose we start with a serial model and want gathered from each element ai in the stimulus to check whether it can be mimicked by a parallel space, and a threshold ci must be reached before model. Recall that the four equations in Proposition the element ai has been completely processed. The 3 determine the form of that parallel model if state space, therefore, constrains the possible values it exists. Now, if either Ga1(t)orGb1(t) fails that the amount of evidence can take at any given to be a true survivor function (for example, if time. It turns out that probing the continuum of limt→∞ G(t) > 0), then the corresponding parallel accumulated evidence reflected in the state space model is deficient, and the serial model cannot be can be very informative in distinguishing between mimicked. A serial exponential model in which architectures. In fact, it reverses the mimicry-based the first-stage rates are not equal (ua1 = ub1)is serial systems generality proven earlier, such that we perhaps the simplest case in which the impossibility can use it to reject serial systems and legitimately of mimicry can occur (Townsend, 1972). Following confirm parallelism. this line of reasoning, Townsend (1976a) derived To formalize this reasoning, we denote the a set of necessary and sufficient conditions for the amount of information that has been sampled existence of well-defined survivor functions. from the element ai at some point in time by Given that every quantity in these conditions γi(t). This function can be either discrete or is in principle derivable from reaction time data, continuous, as required by the phenomena being a testable mechanism is provided for rejecting the modeled. In a traditional discrete-stage model, γi(t) possibility of parallel mimicry and confirming a could represent, say, the number of “features” true serial architecture. Ross and Anderson (1981) from the feature set that have been sampled from were among the first to apply these conditions to an alphanumeric character (see Figure 4.6 for an empirical data, testing an assumption of Anderson’s example from visual search). Here, the state space Adaptive Character of Thought (ACT) model is finite, since each character has a finite number of that the spread of activation in memory search features that can be processed. A Poisson process, is parallel and independent. In the process of on the other hand, operates over a countably applying Townsend’s conditions, they encountered infinite state space, such that the possible amount and addressed several obstacles. The authors had to of evidence gathered from each item can be put consider extending the theoretical results to account in correspondence with the integers. Finally, a for the possibility of having a mixture or convo- continuous (uncountably infinite) scale for γi(t)is lution of different reaction time densities (e.g., in familiar from connectionist models of cognition,

80 elementary cognitive mechanisms than one item partially processed, so the two models predict observably different distributions of accuracies. The authors tested the hypothesis that processing was parallel against a null hypothesis where it was serial. In the statistical analysis, the predictions were passed through two progressively stricter “sieves” and the serial null hypothesis Serial Parallel was unable to be rejected. This work was later Fig. 4.6 Illustration of the underlying state space in visual expanded by Van Zandt (1988) to demonstrate search. The two panels represent stimulus displays for which patterns of individual differences in parallel and a participant is instructed to “find the S.” Individual features serial processing. One individual may perform a of each letter are grey and dotted if they have not yet been task in a serial mode, whereas a different individual processed. If processing is cut off at some point—for instance, experiencing identical experimental conditions may if the display terminates—the participant may be left with some letters in a partial state of processing. In the left panel, the letter perform in a parallel mode. “E” is in a partial state. Notice that the serial processor must treat the letters one at a time, so there is at most one stimulus in serial systems generality based on order a partial state. In the right panel, on the other hand, all four of The next fundamental distinction is the way in the letters are in partial states of processing because the parallel processor has no such restriction. which order of completion is selected (Townsend and Ashby 1983, Chapters 3 and 15). As a concrete context for discussion, consider a typical visual such as McClelland and Rumelhart’s (1981; see search task in which a participant is instructed to also McClelland, Ramelhart, & the PDP Research decide whether a particular target is present in a Group, 1986) interactive activation model, where display with distractors (e.g., the letter H in an it is commonly referred to as the activation value of array of other letters). Suppose that this search were a node. This is also generally the case in diffusion carried out serially. It is plausible, then, that the processes. participants could causally direct their attention, If the state space contains more than one choosing at each stage which item will be examined value, then we can, in principle, consider partial next. They could even decide on some search processing of elements (in the discrete, feature- strategy a priori: “Start with the left-hand stimulus based case, this corresponds to some but not all of on the top of the display and scan from left to right, the features in an object been processed at the end top to bottom.” In any case, it is apparent that the of the trial; see Figure 4.6). This helps us distinguish order of processing is not affected at all by the rates between architectures in the following way. A serial of completion of the various items. system can only sample features from a single If this task were carried out in parallel, on element ai at once, and only moves on to begin the other hand, order of completion is entirely sampling from another element after completing determined by the relative rates of different items. the first. Thus, there can be at most one element in If item a can be processed faster than item b,then < > a partial state of completion (i.e., 0 <γi(t) < Ci) the two items will be completed in the order a, b in a serial system, whereas a parallel system can more frequently than in the order simply by have arbitrarily many elements in such a state. One the stochastic nature of processing, and no a priori way to distinguish a parallel processor from a serial decision is able to affect this ordering. processor in an empirical setting, therefore, is to observe the underlying state space while more than parallel systems generality based on the one item could be in a partial state of completion. identity of items or channels Townsend and Evans (1983) developed a full- To summarize our conclusions thus far, we have report experiment based on this premise. They seen that the rate of processing at each stage in collected second guesses from participants and both serial and parallel models can depend upon examined the pattern of accuracy. Each underlying the identity and order of previously processed items, state (e.g., “item 1 totally processed; item 2 partially but only in serial models can the rates potentially processed”) maps onto some accuracy pattern (e.g., depend upon a predetermined order of future items. “item 1 correct and item 2 incorrect on first guess; Parallel models are capable of a different kind of item 2 correct on second guess”). However, serial flexibility, however—dependency on the identity models cannot produce underlying states with more of items that have started, but not yet finished

features of response times 81 completion (Townsend 1976b). Consider a visual stress (e.g., periodic high intensity sound) on search with three items, a, b, and c. All items performance in the three conditions described are being processed simultaneously, so if item c above, with letters Q, R, T, and Z as stimuli. is particularly inscrutable and takes more effort to Contrary to expectations, they found that the process, its presence in the display might slow down presence of a stressor made the system more the completion time of items a and b,evenifa likely to operate in parallel. In his dissertation, or b end up finishing first. This is behavior that Vollick (1994) applied PST to the clinical setting. cannot be mimicked by a serial model. Note also It was suggested that the cognitive impairments that not all parallel models necessarily have this found in paranoid schizophrenics do not stem property; it is an additional degree of freedom we can from architectural issues, as they process stimuli in draw upon when constructing a parallel model to fit parallel the same as healthy individuals, but rather observed data. If a is the target in a self-terminating from inefficient deployment of their processing UCIP model, for example, processing times would capacity. be independent of the identity of c by definition. However, a limited capacity parallel model predicts Stopping Rule Distinctions based on the desired behavior. Set-Size Functions We take a short break from the serial-parallel parallel-serial testing paradigm dilemma to consider ways to distinguish between These simple observations about generality based exhaustive and self-terminating stopping rules. on order and identity are the foundation of the early Since Sternberg’s classic work (see Section 2), Parallel-Serial Tester (PST) paradigm (Snodgrass it has been common to test the stopping rule and Townsend 1980; Townsend 1976b; Townsend by examining slope difference in response times and Ashby 1983, chapter 13; Townsend and to different set-sizes. Consider again the class Snodgrass 1974). The PST paradigm is built on of standard serial models, where the processing three separate conditions of a simple matching random variable is the same across all experimental experiment, in which a participant must search variables such as processing position, location in through a list of two items for targets. In Condition a display, identity and so on. As we recounted, 1, the participant gives Response 1 (R1) if the target if processing is exhaustive in such models, one item appears on the left of the list and Response predicts that the slope of the lines would be equal, 2 (R2) if the target item appears on the right. regardless of whether the target is present, every Condition 2 is a simple AND task, where the target item must be checked. In addition, it is predicted must appear in both positions for response R1, and that no display position effects on RT will be found Condition 3 is a simple OR task, where the target in the data. If the participant is able to terminate may appear in either position for response R1. the process as soon as the target is found, on the Condition 1 is used to get a baseline measure other hand, one predicts the mean RT on positive of order effects, which is compared with the cases trials to be lower on average than on negative of Conditions 2 and 3. The processing time of an trials. item under serial processing cannot depend upon Townsend and Ashby (1983, Chapter 7) pointed the identity of other uncompleted items, so each out that if processing times can vary with display intercompletion time that we measure empirically position, processing position, or identity (e.g., must be the processing time of a single item. regardless of whether an item matches the probe), Intercompletion times under parallel processing then exhaustive models could actually violate the face no such constraint. Although the mathematical above predictions. However, subsequent theoret- details and precise predictions for both models are ical effort discovered that exhaustive systems are fleshed out in Townsend and Ashby (1983, Chapter nonetheless extremely limited in how far they 13; see also Townsend 1976b), the basic result for can deviate from those standard serial model serial systems forces the sum of mean reaction times predictions. A series of “impossibility” theorems in the two possible orders of Condition 1 to equal (Townsend & Colonius,1997; Townsend & Van the sum of mean reaction times in the redundant Zandt, 1990) have shown that large classes of conditions. If the two sums violate this equality, we exhaustive models of any architecture are incapable have empirical evidence for parallel processing. of producing significant slope differences. Further, This paradigm was used successfully by Neufeld they also showed that the ability of such models and McCarty (1994) to investigate the effect of to evince strong display position effects is severely

82 elementary cognitive mechanisms delimited. Commensurate with the other results of GREEN in red ink comprise one-target displays, this section, these results offer us mechanisms to and GREEN in green ink is a no-target display. reject the possibility of exhaustive processing and Note that the display can contain two, one, confirm self-termination. Van Zandt and Townsend or zero targets, so that the effect of redundant (1993) give a broad review of empirical applications targets is tested, too. The presence-absence fac- of this test, ultimately concluding that participants torial design thus created (WORD: target-present employ a self-terminating stopping rule whenever [RED], target-absent [GREEN] crossed with ink they can properly do so in an overwhelming color: target-present [red], target-absent [green]), majority of experiments. depicted at the bottom of Figure 4.7, enables the use of the capacity coefficient (Townsend & Distinctions based on Reaction Time Nozawa, 1995) as well as important RT inequalities Distributions (Grice, Canham, & Gwyane, 1984; Miller, 1982; Colonius & Vorberg, 1994; see, Luce, 1986; The search for an empirically tractable way to Townsend & Eidels, 2011). distinguish between underlying cognitive processing To carry out the salience manipulation in our systems reached a milestone with the development example, the target word RED can appear in a of an experimental protocol called the Double highly legible or in a poorly legible font and, Factorial Paradigm (DFP; Eidels, Townsend, & similarly, the target red ink color can appear in a Algom 2010; Townsend, & Nozawa, 1995; Wenger focal or in an off-focal wavelength. The goal is to & Townsend, 2001), which yields a singularly pow- selectively speed up or slow down the processing erful test not only of the serial-parallel distinction of the specific feature, that is, to manipulate one but also of stopping rule and capacity (as well as channel without affecting the other. This second stochastic independence, though less directly). This factorial design (word-target salience [high, low] paradigm rests within the theoretical framework of X color-target salience [high, low]), depicted at system factorial technology, an extensive general- the top of Figure 4.7, turns out to be highly ization of Sternberg’s additive factors methodology diagnostic with respect to serial-parallel distinctions (Ashby & Townsend, 1980; Townsend, 1984; when paired with the mathematical machinery of Townsend & Ashby, 1983; see also Dzhafarov 2003; systems factorial technology framework described Kujala & Dzhafarov, 2008; Schweickert, 1978; next. Schweickert, Fisher, & Sung 2012; Schweickert & Townsend, 1989). mean interaction contrast double factorial paradigm Consider the subset of trials in which both The DFP entails two concurrent manipulations, targets are presented. Hypothetical reaction time each creating a factorial design (hence the double results for the four salience combinations are factorial paradigm). The first manipulation varies presented in Figure 4.8. Panel B depicts an additive the number of presented targets (workload) in outcome, implying no interaction across different the visual search task posed. This present-absent levels of salience for the two channels. Panels manipulation is ideal for probing the capacity of A and C depict the two different species of the system. The second manipulation (note that interactions that might arise: A is overadditive, the designation “first” and “second” does not imply implying that processing is slow only when both logical or temporal order) pertains to the salience of channels are slow, and B is the under additive the stimulus features. The salience manipulation is species, implying that processing is fast only when ideal for probing the serial-parallel distinction along both channels are fast. Each factorial plot can be with further aspects of processing at both the mean summarized by a simple statistic: Double difference and distribution levels. (the difference of the pair of differences between the Consider a Stroop display in which the words two values defining each line), or Mean Interaction RED and GREEN are each presented in red or Contrast (MIC), green ink, and a trial consists of a single word (RT LL − RT HL) − (RT LH − RH HH ), displayed in a single color (Eidels et al., 2010; Melara & Algom, 2003; Stroop 1935). Suppose where RT is the mean RT and L and H denote further that (any kind of) “redness” is defined as low and high salience conditions, respectively. the target, so that RED in red ink comprises a Mean interactive contrast is zero for an additive redundant target display, RED in green ink and outcome, negative for underadditive interaction,

features of response times 83 Color Salience High Low

High Word Salience Low

Target

Word

Distractor

Target Distractor Color

Fig. 4.7 Schematics of the double factorial paradigm (DFP) experiment.

and positive for overadditive interaction. These slower channel is processed first in the remaining factorial plots at the level of the mean are already half; hence, the overall RT is the mean of these diagnostic with respect to the question of serial two completion times. This is consistent with an versus parallel processing. The diagnosis is qualified additive outcome: MIC(t) = 0. by the stopping rule in force, indicating whether In an exhaustive architecture, the system must processing is minimum time self-terminating or await processing of both signals (which practically exhaustive (Townsend, 1974). We will now derive means awaiting completion of processing of the MIC predictions for four different models sum- low-intensity signal) before a decision is made. This marized in Figure 4.9: parallel self-terminating, means that, in a parallel race, the response is serial self-terminating, parallel exhaustive, and serial fast only when both channels are high intensity, exhaustive. an underadditive interaction: MIC(t) > First, assume a parallel architecture with a self- 0. Applying the same logic to serial processing re- terminating stopping rule. If either channel is fed a veals that the architecture is still additive (although strong signal, then that channel completes process- completion times differ from those attained with a ing quickly, implying that the overall response will self-terminating stopping rule). Put succinctly, MIC be fast. If both channels are fed weak signals though is always zero in serial processing, positive in parallel (like a race of two weak and old horses against processing with a minimum time stopping rule, and each other), even the faster one will take a long negative in parallel processing within an exhaustive time. This gives rise to an overadditive interaction: architecture. MIC(t) < 0. If we assume a serial architecture with the same survivor interaction contrast stopping rule, the total processing time is just the These distinctions at the level of the mean are average time taken at each stage, assuming different useful, but their extension to the distribution level processing orders are equally likely. Thus, when provides further constraints and information. The both channels are fed strong signals, the response four RT distributions (i.e., different combinations is fast, and when both channels are low intensity, of target salience) can be sliced into small time bins the response is slow. On mixed intensity trials, the and a factorial plot, similar to those in Figure 4.8, faster channel is processed first half the time and the can be derived at each time bin. The resulting MICs

84 elementary cognitive mechanisms A. Over-additive B. Additive C. Under-additive 600 600 600 Low Color salience High Color salience 550 550 550

500 500 500

450 450 450 Mean RT (ms) RT Mean (ms) RT Mean (ms) RT Mean

400 400 400

350 350 350 Low High Low High Low High Word salience Word salience Word salience

Fig. 4.8 Three different outcomes of the Mean Interaction Contrast in a Stroop experiment. can then be plotted as a function of time, t, for the have been obtained even if selective influence were entire distribution. The mathematical predictions violated). become simpler when applying the momentary values of the respective survivor functions, S(t), workload capacity (rather than the MICs or CDFs), and the resulting Finally, consider briefly the other leg of the curve is dubbed the Survivor Interaction Contrast DFP, the factorial manipulation on the number (SIC; Townsend and Nozawa 1995). The SIC of targets presented. This design is well suited permits a distinction between different species of to probe the sensitivity of the system to changes parallel models (e.g., race versus co-activation) in workload. Recall the definition of the capacity and Townsend and Nozawa (1995) derived fully coefficient that we gave in the section, Basic Issues diagnostic functions for various combinations of Expressed Quantitatively, keeping in mind that the architecture and stopping rule (summarized in log survivor function is identical to the integrated Figure 4.9). hazard function, H(t). The numerator in our Stroop task example comprises the redundant-target trials in which the word RED is written in red ink. selective influence In the denominator, we have the same functions Analysis at the distribution level can also provide estimated from trials in which each channel appears supporting evidence on selective influence, a critical in isolation. So: stipulation for derivations based on the factorial manipulations. Selective influence means that a = H(REDinred)(t) COR(t) given factor or manipulation affects only the in- H(REDingreen)(t) + H(GREENinred)(t) tended process or channel. A distinct ordering of the Conceptually, this can be thought of as measuring survivor functions (for the four RT distributions) is the processing relationship between the “whole” in predicted with the following condition in force: the numerator and the “sum of its parts” in the denominator. > S(LOW, LOW) S(LOW,HIGH), We can think of the capacity coefficient as S(HIGH,LOW) > S(HIGH,HIGH) measuring the workload capacity relative to the benchmark UCIP model. Reiterating the predic- at all time t. Recall that the first factor is the salience tions from the section Basic Issues Expressed Quan- of the word and the second factor is the salience of titively, when channels are independent and parallel the color. Violation of this ordering compromises (as in the standard UCIP model), then the ratio interpretations based on the interaction contrasts. is 1 and the system is unlimited capacity. When One should note though that the presence of the presenting two (multiple) targets concurrently slows predicted ordering does not prove the assumption the rate of processing, C(t) < 1, the system is of selective influence. It is a necessary but not limited capacity, which is less efficient than UCIP. sufficient condition (i.e., the same ordering could Furthermore, if the system is limited to the extent

features of response times 85 MIC SIC

LL

Parallel LH HL Minimum Time HH Mean RT (ms) RT Mean MIC < 0 Time L H Time LL HL Parallel LH Exhaustive HH MIC > 0 Mean RT (ms) RT Mean L H LL HL Serial LH Minimum Time Time HH Mean RT (ms) RT Mean MIC = 0 L H LL HL LH Time Serial Exhaustive HH Mean RT (ms) RT Mean MIC > 0 L H

Fig. 4.9 Each combination of architecture and stopping rule has an MIC and SIC signature, as described in the text. See Townsend and Nozawa (1995) for derivations. that performance with double targets is worse than channels interact, a larger redundant target effect with the faster of the two individual targets, or in can be expected. If the former is the case, the Miller- terms of the capacity coefficient, then or race-model inequality (see the section Basic Issues Expressed Quantitatively) must hold: { } ≤ log MIN [SA(t),SB(t)] COR(t) F(RED in red)(t) ≤ F(RED in green)(t) log[SA(t) · SB(t)] +F(GREENinred)(t), and the Grice inequality is violated (Townsend & Eidels, 2011). In this case, the system is of severely where F denotes the RT CDFs for the respective limited capacity. redundant target and the two single target condi- Conversely, if the presence of two targets speeds tions. When this race-model inequality is violated, up the ability of the system to process each target, traditional thinking has been that all parallel race C(t)>1, the system is supercapacity, which is more models are thereby falsified (also observe that efficient than UCIP.Note that, when (target) signals satisfying it does not mean that the model is are presented simultaneously to two channels, necessarily parallel). However, it is straightforward detection is usually faster than when a single signal to construct parallel race models exhibiting super is presented in one channel. This redundant target workload capacity that readily violate Miller’s race effect can derive from mere statistical facilitation model bound (for examples, see Townsend & (the minimum of two random variables has a Wenger, 2004a). Such models can be created smaller mean even than that of the faster of the through mutual channel facilitation. It has been individual random variables alone). However, if the suggested that configural perception (e.g., words,

86 elementary cognitive mechanisms faces, other Gestalts) may be explained by such If, in a search task, the mean RT is linearly parallel channel facilitation (e.g., Eidels, Townsend, related to the number of stimuli and this function & Pomerantz, 2008; Wenger & Townsend, 2001; has a positive slope, then serial processing is 2006; but see Eidels et al., 2010 and Eidels, 2012). implied. In their original development, Townsend and Theories based on this untenable statement Nozawa (1995) showed that the Miller and Grice engulfed the field to the extent that mathematical inequalities can be recast as statements about the proofs and violations have been completely ignored. capacity of the system. When both hold, the system It is only during the past decade that cognitive capacity falls between those extremes and, thus, psychology has finally overcome this detour from could be of moderately limited capacity, unlimited logic and mathematical rigor. capacity, or moderately supercapacity. If Grice inequality is violated, the system is of severely Ignoring Parallel-Serial Mimicry: The Case limited capacity. If Miller’s inequality is violated, of a Linear RT-Set Size Function with a the system is of supercapacity (at that t). In a Positive Slope related development, Colonius (1990) has shown Treisman’s celebrated work on feature integration earlier that if the marginal probabilities F(Red theory (e.g., Treisman & Gelade, 1980; Treisman & in red)(t) and F(Green in red)(t) are invariant Gormican, 1988; Treisman & Schmidt, 1982) can from the single target to the redundant target serve as a convenient point of departure. This work conditions, then the Miller and Grice inequalities suggests that when searching for a target that differs correspond to maximum negative and positive from nontargets in terms of a single conspicuous dependence between parallel channels (see also feature (e.g., color, orientation, or shape), the Van Zandt, 2002). Subsequently, Townsend and number of elements in the display matters little Wenger (2004) showed that, interestingly, actual (feature search). However, when the target is defined dynamic parallel systems whose channels interact in terms of a conjunction of features (such as a by assisting one another (e.g., an increase in red vertical line among red tilted lines and green information in one channel leads to an increase vertical lines), search time increases linearly with in other channels) typically don’t produce invariant the number of elements in the display (conjunction probabilities and do produce supercapacity and search). In the theory, the main diagnostic tool to maybe violate Miller’s inequality (in apparent, but tell the two types of search apart is the slope of the not real, contradiction of Colonius’ mathematically respective RT-set size functions. The steep slopes impeccable theorems). Conversely, typical dynamic obtained with conjunction targets are interpreted to parallel systems with mutually inhibiting channels implicate serial search, whereas the much shallower evidence negative correlations, again cause failure of slopes with the one-dimension feature targets are marginal invariance, and effect strong slow-downs interpreted to implicate parallel search. of processing and possible violations of the Grice Feature integration theory has had a tremendous inequality. impact on attention research – as of the printing of this chapter, those three articles alone combine Cognitive-Psychological Complementarity for close to 10,000 citations in the literature. And, How well or poorly have these developments this theory is mentioned as a major accomplishment been incorporated into mainstream cognitive psy- in Treisman’s achievement of the highest scientific chology? Without attempting an exhaustive analysis honor the United States can offer, the coveted Na- of the vast volume of research in current cognitive tional Medal of Science. The associated burgeoning science, it is fair to say that the influence on literature helped to uncover valuable aspects of the substantive theorizing and experimentation of the cognitive processes engaged when people search for mathematical developments has been uneven at a target (whether in a cluttered computer screen or a best. Although early work on parallel and se- crowded airport terminal). A less salutary outcome rial processing by Estes, Murdock, Schweickert, of this trend has been neglect of the possibility of Dzhafarov, Egeth, Bernstein, Biederman, and mimicry. Many investigators have ignored proof Townsend (among others) has been well accepted that a putatively serial (mean)RT function can be in mainstream cognitive psychology of the time mimicked by a parallel one and vice versa. For (early 1970s) and had a sobering influence on the a trivial yet telling example, Townsend (1971a) field, this work has been subsequently ignored and was referenced by Treisman and Gelade (1980) but superseded by the following groundless logic: the citation concerned the makeup of the stimuli,

features of response times 87 not the assay of parallel versus serial processing! items were visually degraded and found the effects Generations of cognitive psychologists appear to of set size and degradation to be additive. The have been rendered oblivious to the developments additive pattern clearly refutes models of serial in mathematical psychology on the importance and identification. A conceptually similar study (indeed, (im)possibility of distinguishing between parallel complementary to that by Pashler & Badgio, 1985; and serial processing based on straight line mean RT cf. Pashler, 1998) was conducted by Egeth and functions (cf. Townsend, 1971a; 1990a; Townsend Dagenbach (1991). In their study, the observers & Wenger, 2004b). searched two-element displays in which each item The difference between feature search and con- could be visually degraded independently of the junction search impacted ensuing research to the other element. The authors found a subadditive extent that quite severe violations of the original pattern, confirming again a parallel process of letter pattern of results and conclusions, undermining identification. basic tenets of the theory, were largely overlooked. Townsend and Nozawa (1995) investigated the Consider the study by Pashler (1987) for an redundant target paradigm along with a range instructive example. In relatively small displays of of RT inequalities in a more general context, up to eight items, Pashler found the same slope for developing measures for the identification of dif- target-present and target-absent trials. This result is ferent cognitive architectures. Survivor function consistent with a serial exhaustive search rather than interaction contrasts and processing capacity play with the serial self-terminating search suggested key roles in this effort. In particular, Townsend by the theory. In a further experiment, Pashler and Nozawa showed that Miller’s inequality (among added a second target on some portion of the other RT inequalities) is actually a statement about trials and found a redundancy gain, the mean RT the capacity of the process under test. What these was faster when the display included two targets developments demonstrate is the futility of drawing than when there was a single target. Redundancy strong conclusions based on any simple RT (detec- gain is incompatible with both exhaustive and self- tion) function, if for no other reason than the brutal terminating varieties of serial models (cf. Egeth & reality that many if not all such functions can be Mordkoff, 1991; see also, Egeth, Virzi, & Garbart, mathematically mimicked (e.g., Dzhafarov, 1993, 1984). 1997). A broader angle of attack is needed, one These findings disconfirm the theory themselves, guided by a system of theorems and associated tools but the fact remains that they (along with a (the SFT proved serviceable in that role) within the fair number of similar results) generated little framework of which absolute or mean RTs or den- traction at the time. The bright side to the story sity functions serve as points of departure. In this though is the increased use of the redundant target respect, Townsend and Wenger (2004a) generalized heuristics. From a modest start in the late 1960s the earlier results to include conjunctive, rather (e.g., Bernstein, Blaken, Randolph, & Hughes, than only disjunctive, decisions and illustrated 1968; Egeth, 1966), this type of experimental their findings within the large class of interacting design evolved into a major tool not only for channels, parallel, linear stochastic systems. investigations into visual search but for uncovering A very selective review of the major findings aspects of elementary cognitive processes in general. in the field of speeded visual search during the Nevertheless, outside of the work of a few inves- past three decades reveals that a wealth of stimulus tigators such as H. Egeth and colleagues and our properties (spatial distribution of the items, target- own research, the visual search literature and that distractor similarity, stimulus discriminability or focusing on redundancy have unfortunately been task difficulty, practice, common shape and/or largely nonintersecting. semantic category, and even particular attributes Egeth and Mordkoff (1991) used the redundant such as form and color) have increasingly replaced target design in tandem with Miller’s race model the number of stimuli (set size) as the variable of inequality as a further means of theoretical reso- interest. The dichotomy, efficient-inefficient search, lution. They concluded that the large violations has been gradually superseding the dichotomy, of the inequality found were incompatible with parallel-serial processing. any species of serial processing (and with certain This course comprises a rather mixed bag. On varieties of parallel processing). the one hand, it reflects the growing recognition In another interesting interrogation, Pashler by cognitive scientists of the pertinent mathematical and Badgio (1985) included trials in which all developments. In this respect, it took some 20 years

88 elementary cognitive mechanisms for psychologists to finally conclude that models target. Duncan and Humphreys showed that search are needed that move “beyond Treisman’s original is easy for a distinctive target on the background proposal that conjunction search always operate of relatively uniform distractors but it is difficult serially” (Pashler, 1998, p. 143). On the other on the background of highly diverse distractors. hand, the same course also reflects a tendency of More recently, Ben-David and Algom (2009; see moving away from the issue of parallel versus serial also Fific, Townsend, & Eidels, 2008) applied processing altogether. the machinery of SFT to uncover the influence This is unfortunate because, for all the of species of target and distractor similarity and difficulties involved, the issue is tractable and it sameness (physical, nominal, semantic) on various is consequential for a wide range of cognitive aspects of the architecture of the underlying process. processes. What we need is a naturally emerging The additive factors method itself has been integration of a given RT model and a certain incorporated into mainstream cognitive research to cognitive theory. RT data, especially those em- the extent that, more often than not, Sternberg bedded within a larger system, provide rich infor- is no longer even referenced. Of the multitude of mation about cognitive processing. Nevertheless, studies, the sustained program of research by Besner given the prevalence of mathematical and statistical and his associates (see Risko et al. 2010, for a equivalence, RTs, even when sustained by explicit recent contribution) stands out for the methodic models, will not always be diagnostic (cf. Van application of the additive factors method to probe Zandt, 2002). It is at this juncture that the need reading processes. For example, in the study by for substantive theory becomes pellucid. Van Zandt Borowsky and Besner (1993), context or meaning (2002, p. 506) concludes that it is, therefore, “very was found to interact with word frequency, on the important that RT analyses be conducted in the one hand, and with stimulus quality, on the other context of . . . mechanistic . . . explanations of the hand, yet the latter two factors were additive. The process under study.” pattern of joint effects was accommodated by a Speeded visual search continues to fascinate multistage activation model. Nonetheless, it might investigators because it is such a ubiquitous hu- be well worth to employ the kinds of strategies man activity (from locating your baggage on the outlined herein to falsify or accommodate the conveyor belt in the terminal to picking up your various types of models in a nonparametric fashion. favorite Cheerios in the crowded aisle of the grocery When discussing models, we should address (but store to finding your article in the list of those space does not allow us to truly address) the issue appearing in the journal). The theory proposed of the degree to which processing across different and periodically revised by Wolfe (1994; Cave & stages is discrete or in cascade. That is, we conceive Wolfe, 1990), Guided Search, suggests that all of processing on different items or subsystems as kinds of searches (whether feature or conjunction) occurring in a sequential manner but which may involve two consecutive stages. The first stage overlap in time. These are often referred to as entails the simultaneous activation of all potential continuous flow systems. Taylor (1976) was one target features. Activity in the second stage is of the first to proceed to a quantitative analysis guided by the outcome of the first (i.e., the of such models but others soon followed (e.g., distribution of activations of the various features), McClelland, 1979; Miller, 1988). Let us just testing serially combinations of activated features note that McClelland (1979) sanctioned the use of until one matches the target. The theory entails additive factor methodology to identify separable the notions of parallel and serial processing, but stages of processing, and that, separately, Ashby envisages situations in which either one can become and Townsend (1980), Ashby (1982) and Roberts gratuitous. Incorporating further flexible features, and Sternberg (1993), too, have demonstrated that the theory is able to account reasonably well for a purely cascaded models can produce additive effects broad range of data. on the mean RTs (provided certain boundary condi- Another influential approach implicates similar- tions are respected; see O’Malley & Besner, 2008). ity as the major determinant of search (Duncan Schweickert and Mounts (1998) studied and made & Humphreys, 1989). A little noticed aspect of predictions from a quite general class of continuous the original Treisman experiments is that each flow systems. The issues are quite complicated and nontarget shares a feature with the conjunction the interested reader should consult Logan (2002) target (hence is similar to the target), whereas in on the broad distinction between discrete and feature search each nontarget is different from the continuous processing. More general and robust

features of response times 89 metatheoretical effort is required to experimentally nonetheless also increased their decision criterion and effectively segregate such systems from ordinary a modest amount, but not enough to offset the parallel and serial systems. increase in errors. So, there has indeed been a speed-accuracy trade-off in the sense that even higher inaccuracy would have occurred had not Extending the Metatheory to the participants altered their criterion. This kind Encompass Accuracy of subtlety requires a quantitative approach to be The great bulk of the theory enveloped in this adjudicated. chapter has featured response times. However, cer- Our tactic has been to extend the response-time tain (included) theory-driven experimental designs based workload-capacity function developed earlier were based on accuracy, such as the paradigms (see Basic Issues Expressed Quantitatively and The- utilizing second-guesses. Recall that the second- oretical Distinctions sections) to include accuracy guess strategy exploited the fundamental ability of (Townsend & Altieri, 2012). Detail is ruled out parallel systems to represent many objects (items, in this chapter, but the basic trick is to work out features, channels) in partial states of completion the predictions for the standard parallel class of as opposed to strict serial systems being confined models that are themselves enlarged to generate to a single object in a partial state of completion. errors. In addition, for most of the speed-accuracy Compared with factorial strategies, these latter combinations, the value-loaded term capacity is techniques have been woefully underused but, as inappropriate. For instance, is it higher capacity noted, have seen some renewed activity recently. and, therefore, “better” to be fast and inaccurate or There are some other directions that should be slow and accurate? For such reasons, it was necessary broached. to introduce value-free terminology, in this case, the term assessment function called A(t). Then the Extending Capacity Theory to Include assessment functions are assembled, as was the Accuracy: Moving Beyond Simple traditional statistical function, in a distribution-free Speed-Accuracy Tradeoff and nonparametric manner. A simplified, symbolic The motivation to extend our capacity theory to formula is include accuracy is twofold. Foremost, the more Pobs speed is fast and error occurs observable variables there are to constrain models A(t) = and theories, the better. Additionally, measures Ppar(speed is fast and error occurs) that can simultaneously gauge and combine speed and accuracy can address important questions that where obs=observed from data and par=theoretical neither alone is able to do. A specific manifestation prediction on the basis of the standard parallel of the relation of speed to accuracy arose in the model. 1960s and was called the speed-accuracy tradeoff Furthermore, either the numerator or denom- (e.g., Pew, 1969; Pachella 1974; Swanson & Briggs, inator can be decomposed into separate accuracy 1969; Yellott, 1971). The idea here is that one must and conditional response time elements. Thus, be wary when observing say, a speed-up in one’s data Pobs(speed is fast and error occurs) = Pobs(speed and drawing perceptual or cognitive conclusions. is fast and error occurs) Pobs(error occurs). Then The reason is that the error rate may have increased, comparison of the observed and predicted quan- perhaps reflecting an alteration of a decision cri- tities can aid in a number of useful theoretical terion rather than an improvement in cognitive inferences. For instance, initial analysis of an AND efficiency. condition carried out by Ami Eidels (personal It has been obligatory in cognitive psychology communication; see Townsend & Altieri, 2012) ever since, when either response times or accuracy shows that the above A(t) > 1 indicating that the changes, to check out how the other is varying. observed joint event of error-plus-speed in terms For instance, if the experimenter increases workload of response times and inaccuracy greatly exceeded in a task and errors increase, she/he makes sure that expected from the standard unlimited capacity that response times stay the same or increase. It independent parallel model. Furthermore, error- is then concluded that there is no speed-accuracy plus-slow tended to move in the other direction; trade-off. This inference is unwarranted. Consider that is, toward the A(t) = 1 line. Next, the correct- the following possibility: The workload is harder plus-speed A(t) < 1 by a massive degree, whereas and errors increase, but the participants have that for correct-plus-slow was a bit higher but still

90 elementary cognitive mechanisms quite low. All this is predicted by parallel models, approaches together, we are a little like a bunch which are limited capacity. of educated squirrels watching people drive, then We next decomposed the statistics as just de- poking about under the automobile’s hood at night scribed and discovered that inaccuracy was consid- and drawing profound squirrel-science inferences erably greater from that predicted by the standard about auto design, functioning, and maintenance. model. This aspect then contributed to this specific Nonetheless, the resources of mathematical mod- A(t) > 1. However, it also transpired that, given eling, neuroscientific knowledge and techniques, that an error occurred, the likelihood of a fast and excellent behavioral and neuropsychological response was also greater than that expected from experimental designs offer the best we can hope for. the standard prediction. In fact, it appeared that Before moving on and as stated in our intro- overall, speed was greater than expected whether duction, all the warnings about pitfalls associated conditioned on correct or incorrect performance with various aspects of mathematical modeling pale although most impressive when incorrect. in comparison with verbal theorizing. Electrical Some tentative interpretations of these results, and have long pos- as well as for an OR condition, were offered in sessed rigorous quantitative bodies of knowledge; Townsend and Altieri (2012). However, it is simply we could call them metatheories, of how to infer the the case that we know very little about how even internal mechanisms and dynamics from observable parameterized models will reflect limited capacity behavior. One of the most elegant of these is that effects when both accuracy as well as response associated with deterministic finite-state automata. times are analyzed. For instance, suppose, as is Of course, when the number of states becomes usually assumed, that changes in difficulty, when infinite or random aspects intrude, things get more accomplished within trial blocks, indicate only complicated. Yet even these accommodate math- changes in correct and incorrect processing speeds, ematically rigorous and applicable methodologies. not any change in decision criteria. The truth is, Naturally, the obverse side of the coin of “degree we don’t know to what extent an increase in errors, of uniqueness” of a prospective description of an as in the AND data above, exhibit lower or higher observed system is the “class of mimicking models” (as earlier) speeds than would be predicted by the (using our terminology). Even within the class standard parallel model, all on the bases of only of finite-state automata, however much data is changes in processing rates. collected, there will always be an equivalence class We anticipate that analysis of traditional models of machines able to predict said data. If the average of response times and accuracy, such as parallel graduate student in psychology were a little better diffusions and races as well as serial models plus prepared in mathematics, at least one course in such examination of many data sets, will show the way a topic would provide a beneficial and sobering to a considerably enhanced understanding of speed message regarding the challenges facing them in and accuracy. As a single example of this type their careers. of progress, we mention a recent expansion of The fact that even so diametrically opposed the accuracy-oriented general recognition theory concepts as are embodied in parallel versus serial (Ashby & Townsend, 1986) to stochastic parallel processing systems can readily be mathematically systems, which thereby includes response times equivalent within common and popular paradigms as well and the patterns of confusion (Townsend, should, along with the implicit forewarnings Houpt, & Silbert, 2012). from other sciences, lead an incipient science like psychology to emphasize the study of such challenges in training their students and planning Model Mimicking in Psychological Science and carrying out their own research programs. Alas, Psychological, cognitive, and brain sciences are, that prescription does not seem likely to eventuate outside the most ludicrous parody of , in the foreseeable future. However, mathematical black box enterprises in which hypotheses about psychology can strive to better train their own inner workings must be made and tested. The people in these matters, and conduct their own brain sciences are included here due to the brain’s research accordingly. amazing complexity. Even if we were without It should be evident that the use of metatheory society’s ethical and moral strictures (and a good to experimentally segregate large classes of models thing we are not), the brain’s machinery is so myste- (e.g., all parallel models) within a certain domain rious that even with behavioral and neuroscientific comes up short with regard to specifying a highly

features of response times 91 precise and detailed computational account. We could expect, this prediction can be made by a huge have, therefore, proposed that researchers adopt class of alternative models, as proven by Townsend a kind of successive sieve approach, where finer- (1990b). The constraints put on the mimicking grain models are probed at each step. Thus, after model class are extraordinarily weak. Just including determining, say, serial processing in a memory variances helps a lot but, of course, does not totally task, then one might begin to assess certain remediate the problem (e.g., Schneider & Shiffrin, particular process distributions. This approach is 1977; Townsend & Ashby 1983, chapters 6, 7). complementary to Platt’s strong inference tactic, but Third, there is mimicking by approximation. not the same. Next, it is important to ponder Though perhaps not so intriguing as mathematical the different echelons at which model and theory equivalence, it is more widespread than the latter mimicking can take place, as we see in the following and at least as underappreciated by the average subsection. researcher. Examples of this type of mimicking in the present venue are the abililty of sequential, but not strictly serial, continuous flow dynamic models Species of Mimicry to predict approximate additivity of mean response First, there is mathematical equivalence such times (e.g., McClelland 1979; see also Schweickert as we have discovered over many decades (e.g., et al., 2012, chapter 6). Townsend 1972, 1976a; Townsend & Ashby, 1983; It is likely that the third type of mimicking is Townsend & Wenger, 2004b; Williams, Eidels, & that which threatens the bulk of model testing in Townsend, 2014). Little had been accomplished in the literature. Ordinarily, two, sometimes more, terms of model identification in psychology outside parameterized models are compared in their ability of seminal work in mathematical learning theory to fit or predict numerical data. Now, if psychology by Greeno and Steiner (1968). Their rigorous possessed a level of precision of measurement even efforts explored identifiability issues in learning remotely close to that in, say, physics or many and memory theories based on Markov chains. areas of chemistry, then this policy might be quite However, within the realm of closed-form proofs of optimal. Why test entire classes of models, when mathematical equivalence, the work of Dzhafarov you can move directly to the precise model, with all should be mentioned. Consider the quite general its exact formulas and estimate parameters and thus class of all models based on a race of two or more be done? parallel channels. The Grice models (e.g., Grice Unfortunately, the tolerance of psychological et al., 1984) are members of this class. They place data is much too coarse for such a hope to all the variance in the decision bounds. Other be realized. Some help is afforded by way of models such as the Smith and Vickers (1988; see comparison of models against one another, rather also Vickers, 1979) accumulator model or the than simply fitting one’s favorite model to the race models of Townsend and Ashby (1983) or data. However, even here, the possibility exists Smith and Van Zandt (2000) place the variance that one or the other model will simply fail to in the state space of the channel activations. fit the data, even as well as another model, due Dzhafarov (1993) proved that in the absence of only to the specific quantitative formulation of the assumption of specific distributions, these classes psychological precepts, rather than the fundamental are mathematically equivalent within the usual characteristics of the latter. For instance, consider experimental designs. an investigator who has correctly pinpointed the As we have seen, with a little luck and lots of proper architecture, stopping rule, and so on for hard work, one may aspire to employ the very a task but failed to employ the valid associated metatheory used to demonstrate model mimicking stochastic process. Thus, perhaps the architecture is to aid in the design of experiments that test the standard parallel, with each channel being described model classes at deeper levels. Second, there is by a race between the correct and incorrect alter- incomplete but exact mathematical mimicking to natives, the race distributions being gamma. Our consider. For instance, a class of models might hapless investigator inappropriately has selected mimic a nonequivalent type of model at, say, the Weibull distributions to describe each channel’s level of the first and/or second moments (i.e., mean race. Meanwhile, a theoretical competitor might and variance) but not be completely equivalent. A produce an incorrect specification of the architec- case in point is the prediction of mean response ture (e.g., serial), but employed a set of stochastic time additivity by standard serial models. As one processes that adventitiously provide a superior fit.

92 elementary cognitive mechanisms Another vital aspect of model testing in cognitive alternative conceptions. We believe that fields science, has always been the issue of whether endeavoring to explain or predict human behavior, one model is merely more complex than another including those based on neurophysiology, will and thus can reach sectors of the data space that progress faster by developing appropriate meta- are unavailable to its competitors. Attempts to theories of mimicking, and how to best circumvent ameliorate this challenge have long depended on these experimentally. the assumption that if two models possess equal numbers of parameters, they must be of equal complexity. This is rarely the case, as has been Conclusions and the Future • recognized for some time. Indeed, we have shown The resurgence in the 1950s and 1960s of that large classes of models of classification (e.g., research attempting to identify mental mechanisms identification, categorization) can be much more and its continuation today prove that cognitive falsifiable than other classes, though they possess a psychology can be cumulative as well as huge number of parameters (Townsend & Landon, scientific. 1983). A special case of some interest is the well- • Mathematical models have made substantive known similarity choice model (Luce, 1963), which ingress to cognitive psychology since 1950 and is much more flexible in its model fitting ability made many contributions to rigorous theory (often referred to as the champion model of human building and testing. pattern recognition) than competitors, such as • Nonetheless, even in the realm of the overlap model (Townsend, 1971b) though the mathematical modeling, mimicry of one model or number of parameters is identical. A deep and vital antidote, when it can be brought to bear, is the quantification of model Box 2 Model Mimicking in Psychology flexibility-to-fit data, through cutting edge theories From one vantage point, it might seem that of complexity (Myung & Pitt, 2009).This approach model and theory mimicking reside in a fairly is able to quantify a model’s complexity and small and technical dominion of scientific compensate for it in model comparisons. The main psychology. Yet, in a broad sense mimicking obstacle here is that so far, only models with certain is ubiquitous. Every time two theoretical types of pellucid specification are subject to this explanations are in contention, it is because analysis, at least in realistic computational terms. the data at hand are not decisively in favor Nonetheless, as computers’ powers continue to of one over the other. That is, mimicking augment, this strain of technology offer hope for at one or more levels is occurring and it is the future of complex psychological science. up to the champions of either approach, or We have had little to say, beyond mere plati- “innocent bystanders”, to invent new observa- tudes, concerning aspirations of merging principled tions to resolve the issue. Within the realm of quantitative modeling with cognitive and sensory verbalized theory, what often happens is that neuroscience and especially . Some the two theoretical structures evolve, due to extremely credible senior psychological commenta- ministrations from their advocates in order to tors are frankly skeptical of the contributive power conform to the latest data. The end result is that of neuroscience and in particular, neuroimaging, to the theories may end up still handling the larger the cognitive sciences. We urge serious reflection on corpus of data, each being significantly more the observations of William Uttal (e.g., 2001) in complex (and thus more difficult to falsify), and this vein. Though we do not fully agree with his yet remaining empirically indistinguishable. A final inferences, we are convinced that his astute famous case in point is the decades-long clash scrutiny can only serve to improve the science and of the more behavioristic theory of Clark Hull its associated methodologies. versus the more cognitive theory of Edwin The past several decades have seen a vibrant Tolman. Though fundamentally distinct in growth of modeling extending from basic psy- their theoretical foundations, their long-lasting chophysics to higher mental processes to complex strugglemustbesaidtohaveeventuallyfaded social phenomena. It seems inevitable that such away inconclusively. Mathematical modeling multifaceted models, even when found to predict and the explicit study of model mimicking or fit data in an accurate manner, will be vulnerable seems a promising remedy for such ailments. to a broad spectrum of competitive, mimicking

features of response times 93 class of models by other models, poses a formidable Glossary challenge to science building in this complex arena. Across-stage independence: Assumes the independence of • Metatheory is, in our approach, a theoretical intercompletion times in serial models. It is defined as the and quantitative enterprise that attempts to property that the probability density function of two or formulate highly general mathematical more stages of processing is the product of the component single-stage density functions. characterizations of psychological notions in such a = log[SAB(t)] Capacity coefficient: COR(t) · ,isa way as to point toward development of robust log[SA(t) SB(t)] experimental methodologies for systems measure for processing efficiency as workload (number of signals to process) increases. C(t)=1 indicates unlimited identification. capacity – performance is identical to that of a benchmark • The reviewed body of research on metatheory UCIP model (see later). C(t)< 1andC(t)>1indicate has led to redoubtable methodologies for assessing limited and super-capacity, respectively. COR(t) is appro- various strategic aspects of elementary cognition in priate for OR tasks, while a comparable index, CAND(t), exists for the AND case, with a different formula but similar a manner that is resistant to model mimicking. interpretation. Most of the new theory and technology is founded Cumulative distribution function (cdf): F(t) = p(T≤t), on distribution and parameter-free theorems and gives the probability that the process of interest is finished methods. before or at time t. • As of now, the metatheory and associated Deterministic process: Always yields a fixed result, such methodologies are largely segregated into those that the effect or phenomenon we observe has no variability. resting on RTs as observable variables and those Exponential distribution: A probability distribution that relying on patterns of accuracy as observable describes the time between events in a Poisson process. It is very useful in response time modelling, and has the form − variables. f(t)=ve vt , where v is the rate parameter. It also has the • A major goal for the immediate future is to “memory-less” property, meaning that the likelihood of an create a unified theory that merges both RTs as well event to occur in the next instance of time is independent as accuracy. First steps have been taken in that of how much time had already passed. ≥ direction. Grice inequality: FAB(t) MAX [FA(t),FB(t)]. This inequality states that performance on double-target trials, • Very little is known about the perils from FAB (t), should be faster than (or at least as fast as) that in model mimicking to incremental science in more the faster of the single-target channels. If this inequality is complex spheres of cognitive science. It may be violated, the simultaneous processing of two target-signals is highly inefficient and the system is very limited capacity. that such theoretical research will be indispensable = < 1 For instance, if FA(t) FB(t)thenCOR(t) 2 .The to future methodologies, if cognitive science is to = special case when COR (t) 1/2 is referred to as fixed avoid devolvement into a maundering, largely capacity. inconclusive field. Intercompletion time: The time required for a stage of processing to be completed. In a serial model, the intercompletion times are just the processing times. Mean Interaction Contrast (MIC): A test statistic for Notes the interaction between two factors with two levels each, 1. Of course, in-between cases are often used, for instance, which allows assessment of architecture and stopping that of the highly popular single-target-among-distractors. In rule from mean RTs. Calculated as the difference be- tween differences of mean RTs in a factorial experiment, such a design, the processor may cease as soon as the target is MIC = (RT −RT ) − (RT −RT ), where RT is located. LL HL LH HH the mean RT and L and H denote low and high salience 2. If t , t are independent random variables then the A B conditions, respectively. Because all stopping rules for serial previous statement will hold true. However, if we assign distinct models predict that MIC = 0, they cannot be distinguished random variables to the actual processing times in parallel and for serial models by MIC. serial systems, then very broad questions can be asked and = = answered with regard to vital parallel-serial mimicking issues Probability density function (pdf): f(t) p(T t),gives (Townsend & Ashby 1983, chapter 1). the likelihood that some process that takes random time T 3.FromEq.1wecanalsoderivethepdf,f(t), of an to complete will actually be finished at time t. ≤ + independent race model with two parallel channels (i.e. the Race model (Miller’s) inequality: FAB(t) FA(t) FB(t). probability that channel A or channel B finish at time t), which This inequality states that the cumulative distribution func- = · + · is fAB(t) fA(t) SB(t) SA(t) fB(t). tion for double-target displays, FAB(t), cannot exceed the 4. Stochastic serial models are a bit more complex, since one sum of the single-target cumulative distribution functions needs to take into account the order by which process occur if processing is a race between parallel channels, with the (e.g., channel A before B or vice versa). Full treatment is given added constraint that the marginal distributions for A and in Townsend and Ashby (1983). B do not change from when one of the two channels is

94 elementary cognitive mechanisms Bernstein, I. H. (1970). Can we see and hear at the same time? Glossary Some recent studies of the intersensory facilitation of reaction presented to when both are presented. This stipulation time. Acta Psychologica, 33, 21–35. is known as context invariance. When the upper bound Bernstein, I. H., Blake, R., Randolph, R., & Hughes, M. H. implied in the inequality is violated, capacity must be super; (1968). Effects of time and event uncertainty upon sequen- that is, C (t) > 1. OR tial information processing. Perception & Psychophysics, 3, “Stage” of processing: The time from one item finishing 177–184. processing to the next item finishing processing. Blaha, L. M., & Townsend, J. T. (2006, May). Parts to Stochastic independence: Two events are independent if wholes: Configural learning fundamentally changes the the occurrence of one does not affect the probability of visual information processing system. Vision Sciences Society the other. This concept can be expressed in terms of the Annual Meeting, Sarasota, Florida. = · joint pdfs, fAB(tA,tB) fA(tA) fB(tB), which means that Borowsky, R., & Besner, D. (1993). Visual word recognition: the joint density of processes A and B both finishing at Amultistageactivationmodel. Journal of Experimental time t is equal to the product of the probability of A Psychology: Learning, Memory, & Cognition, 19, 813–840. finishing at time tA and the probability of B finishing at Cave, K. R., & Wolfe, J. M. (1990). Modeling the role of parallel time tB. processing in visual search. Cognitive Psychology, 22, 225– Stochastic process: The events cannot be characterized by 271. fixed values and should be represented by a random variable. Colonius, H. (1990). Possibly dependent probability summation A random variable does not have a single, fixed value but of reaction time. Journal of Mathematical Psychology, 34, 253– rather takes a set of possible values, with their likelihood 275. characterized by a probability distribution. Colonius, H., & Townsend, J. T. (1997). Activation-state Survivor function: S(t) = 1–F(t)= p(T>t), This representation of models for the redundant signals effect. function is the complement of the cdf, and tells us the In A. A. J. Marley (ed.), Choice, decision, and measurement: probability that the process of interest had not yet finished Essays in honor of R. Duncan Luce,Mahwah,NJ:Erlbaum. by time t. Colonius, H., & Vorberg, D. (1994). Distribution inequalities Survivor Interaction Contrast [SIC(t)]:SameasMICbut for parallel models with unlimited capacity. Journal of calculated for survivor functions, S(t), rather than mean RT Mathematical Psychology, 38, 35–58. = at each time bin of t. SIC(t) [SLL(t)- SHL(t)] – [SLH (t)- Corcoran, D. W. J. (1971). Pattern recognition. Middlesex, PA: SHH (t)]. The SIC(t) functions predict distinctive curves for Penguin. serial and parallel models for various stopping rules. Diederich, A., & Colonius, H. (1991). A further test of UCIP model: A processing model characterized by Unlim- the superposition model for the redundant-signals effect in ited Capacity and Independent Parallel processing channels. bimodal detection. Perception & Psychophysics, 50, 83–86. Within-stage independence: The statistical independence Donders, F. C. (1868). Die Schnelligkeit Psychischer Processe. of intercompletion times across two or more parallel Archiv fur Anatomie und Physiologie und Wissenschaflitche channels in the same stage. Medizin, 657–681. Duncan, J., & Humphreys, G. W. (1989). Visual search and stimulus similarity. Psychological Review, 96, 433–458. Dzhafarov, E. N. (1993). Grice representability of response time References distribution families. Psychometrika, 58, 281–314. Ashby, F. G. (1982). Deriving exact predictions from the cascade Dzhafarov, E. N. (1997). Process representations and decompo- model. Psychological Review, 89, 599–607. sitions of response times. In A. A. J. Marley (Ed.), Choice, Ashby, F. G., & Townsend, J. T. (1980). Decomposing decision, and measurement: Essays in honor of R. Duncan the reaction time distribution: Pure insertion and selective Luce (pp. 255–278). Mahwah, NJ: Erlbaum. influence revisited. Journal of Mathematical Psychology, 21, Dzhafarov, E. N. (2003). Selective influence through conditional 93–123. independence. Psychometrika, 68, 7–25. Ashby, F. G., & Townsend, J. T. (1986). Varieties of perceptual Egeth, H. E. (1966). Parallel versus serial processes in independence. Psychological Review, 93, 154–179. multidimensional stimulus discrimination. Perception & Atkinson, R. C., Holmgren, J. E., & Juola, J. F. (1969). Psychophysics, 1, 245–252. Processing time as influenced by the number of elements in Egeth, H. E., & Dagenbach, D. (1991). Parallel versus a visual display. Perception&Psychophysics, 6, 321–326. serial processing in visual search: Further evidence from Baddeley, A. D., & Ecob, J. R. (1973). Reaction time and subadditive effects of visual quality. Journal of Experimental short term memory: Implications of repetition effects for the Psychology: Human Perception & Performance, 17, 551–560. high-speed exhaustive scan hypothesis. Quarterly Journal of Egeth, H. E., Virzi, R. A., & Garbart, H. (1984). Searching Experimental Psychology, 25, 229–240. for conjunctively defined targets. Journal of Experimental Bechtel, W., & Richardson, R. C. (1998). Vitalism. In E. Psychology: Human Perception & Performance, 10, 32–39. Craig (Ed.), Encyclopedia of philosophy. London, England: Egeth, H. E., & Mordkoff, J. T. (1991). Redundancy Routledge. gain revisited: Evidence for parallel processing of separable Ben-David, B. M., & Algom, D. (2009). Species of redundancy dimensions. In J. Pomerantz and G. Lockhead (Eds.), in visual target detection. Journal of Experimental Psychology: The perception of structure (pp. 131–143). Washington, Human Perception & Performance, 35, 958–976. DC: APA.

features of response times 95 Eidels, A. (2012). Independent race of colour and word can Marx, M. H., & Cronan-Hillix, W. A. (1987). Systems and predict the Stroop effect. Australian Journal of Psychology, 64, theories in psychology. New York, NY: McGraw-Hill. 189–198. McClelleand, J. L. (1979). On the time relations of mental Eidels, A., Townsend, J. T., & Algom, D. (2010). Comparing processes: An examination of systems of processes in cascade. perception of Stroop stimuli in focused versus divided Psychological Review, 86, 287–330. attention paradigms: Evidence for dramatic processing McClelland, J. L., & Rumelhart, D. E. (1981). An interactive differences. Cognition, 114, 129–150. activation model of context effects in letter perception: I. An Eidels, A., Houpt, J. W., Altieri, N., Pei, L., & Townsend, account of basic findings. Psychological Review, 88 (5), 375. J. T. (2011). Nice guys finish fast and bad guys finish last: McClelland, J. L., Rumelhart, D. E., & the PDP research Facilitatory vs. inhibitory interaction in parallel systems. group. (1986). Parallel distributed processing: Explorations in Journal of Mathematical Psychology, 55, 176–190. the microstructure of cognition (Vol. II). Cambridge, MA: MIT Eidels, A., Townsend, J. T., & Pomerantz, J. R. (2008). Where Press. similarity beats redundancy: The importance of context, Melara, R. D., & Algom, D. (2003). Driven by infor- higher order similarity, and response assignment. Journal of mation: A tectonic theory of Stroop effects. Psychological Experimental Psychology: Human Perception & Performance, Review, 110(3), 422–471. 34, 1441–1463. Miller, G. A. (1956). The magical number seven, plus or minus Estes, W. K., & Taylor, H. A. (1966). Visual detection in two: Some limits on our capacity for processing information. relation to display size and redundancy of critical elements. Psychological Review, 63, 81–97. Perception&Psychophysics, 1, 9–16. Miller, J. O. (1978). Multidimensional same-different judg- Fancher, R. E. (1990). Pioneers of psychology. New York, NY: ments: Evidence against independent comparisons of dimen- Norton. sions. Journal of Experimental Psychology: Human Perception Fific, M., Townsend, J. T., & Eidels, A. (2008). Studying visual &Performance,4,411–422. search using systems factorial methodology with target– Miller, J. O. (1982). Divided attention: Evidence for coac- distractor similarity as the factor. Perception & Psychophysics, tivation with redundant signals. Cognitive Psychology, 14, 70, 583–603. 247–279. Greeno, J. G., & Steiner, T. E. (1968). Comments on “Marko- Miller, J. O. (1988). Discrete and continuous models of in- vian processes with identifiable states´lGeneral considerations formation processing: Theoretical distinctions and empirical and applications to all-or-none learning. Psychometrika, results. Acta Psychologica, 67, 191–257. 33(2), 169–172. Murdock, B. B. (1971). Four channel effects in short-term Grice, G. R., Canham, L., & Gwynne, J. W. (1984). Absence of memory. Psychonomic Science, 24, 197–198. a redundant-signals effect in a reaction time task with divided Murray, D. J. (1988). A history of western psychology. Enaglewood attention. Attention, Perception, & Psychophysics, 36 (6), 565– Cliffs, NJ: Prentice Hall. 570. Myung, J. I., & Pitt, M. A. (2009). Optimal experimental Hawking, S. W. (1988). A brief history of time. London, England: design for model discrimination. Psychological Review, 116, Bantam. 499–518. James, W. (1890). Principles of psychology,2vols.NewYork,NY: Nickerson, R. S. (1966). Response times with a memory- Dove. dependent decision task. Journal of Experimental Psychology, Johnson, D. M. (1955). The psychology of thought and judgment. 72(5), 761–769. doi:10.1037/h0023788 New York, NY: Harper. Neufeld, R. W., & McCarty, T. S. (1994). A formal analysis of Kahneman, D. (1973). Attention and effort. Englewood Cliffs, stressor and stress−proneness effects on simple information NJ: Prentice-Hall. processing. British Journal of Mathematical and Statistical Kujala, J. V., & Dzhafarov, E. N. (2008). Testing for Psychology, 47 (2), 193–226. selectivity in the dependence of random variables on external Nietzsche, F. (1873). Über Wahrheit und Lüge im auSSer- factors. Journal of Mathematical Psychology, 52(2), 128–144. moralischen Sinn (On Truth and Lies in an Extra-Moral Külpe, O. (1895). Grundriss der psychologie [Outline of Sense). In Friedrich Nietzsche (Ed.), The birth of tragedy and psychology], (translated by E. B. Titchener). New York, NY: other writings. Cambridge, England: Cambridge University Macmilan. Press, 1999. Lacmann, R., Lachman, J. L., & Butterfield, E. C. (1979). Cog- O’Malley, S., & Besner, D. (2008). Reading aloud: Qualitative nitive psychology and information processing: An introduction. differences in the relation between stimulus quality and word Hillsdale: Erlbaum. frequency as a function of context. Journal of Experimental Laming, D. (1968). Information theory of choice-reaction times. Psychology: Learning, Memory, & Cognition, 34, 1400–1411. London, England: Academic. Qualitative Logan, G. D. (2002). Parallel and serial processing. In J. Wixted Pachella, R. (1974). The interpretation of reaction time in (Ed.), Stevens’ handbook of experimental psychology (Vol4pp. information processing research. In B. H. Kantowitz (Ed.), 271–300). New York, NY: Wiley. Human Information Processing: Tutorials in performance and Luce, R. D. (1963). A threshold theory for simple detection cognition. Hillsdale, NJ: Erlbaum. experiments. Psychological Review, 70, 61–79. Pashler, H. (1987). Detecting conjunctions of color and Luce, R. D. (1986). Response times. NewYork,NY:Oxford form: Reassessing the serial search hypothesis. Perception & University Press. Psychophysics, 41, 191–201.

96 elementary cognitive mechanisms Pashler, H. (1998). The psychology of attention. Cambridge, MA: Smith, P. L., & Vickers, D. (1988). The accumulator MIT Press. model of two-choice discrimination. Journal of Mathematical Pashler, H., & Badgio, P. C. (1985). Visual attention and Psychology, 32, 135–168. stimulus identification. Journal of Experimental Psychology: Snodgrass, J. G., Luce, R. D., & Galanter, E. (1967). Some Human Perception & Performance, 11, 105–121. experiments on simple and choice reaction time. Journal of Pew, R. W. (1969). The speed-accuracy operating characteristic. Experimental Psychology, 75, 1–17. Attention and Performance II. Amsterdam, Netherlands: Snodgrass, J. G., & Townsend, J. T. (1980). Comparing serial North-Holland, 1969. and parallel models: Theory and implementation. Journal of Rakover, S. (2007). To Understand a Cat: Methodology and Experimental Psychology: Human Perception & Performance,6, Philosophy. Amsterdam, Netherlands: John Benjamins. 330–354. Ratcliff, R. A. (1978). A theory of memory retrieval. Psychological Sperling, G. (1960). The information available in brief visual Review, 85, 59–108. presentations. Psychological Monographs, 74, 1–29. Ratcliff, R., & Smith, P. L. (2004). A comparison of sequential Sternberg, S. (1966). High speed scanning in human memory. sampling models for two-choice reaction time. Psychological Science, 153, 652–654. Review, 111(2), 333. Sternberg, S. (1969a). The discovery of processing stages: Risko, E. F., Stolz, J. A., & Besner, D. (2010). Spatial attention Extensions of Donders’ method. In W. D. Koster (Ed.), modulates feature cross talk in visual word processing Attention and performance II. Acta Psychologica, 30, 276–315. Attention Perception & Psychophysics, 72, 989–998. Sternberg, S. (1969b). Memory scanning: Mental processes Roberts, S., & Sternberg, S. (1993). The meaning of additive revealed by reaction-time experiments. American Scientist, reaction-time effects: Tests of three alternatives. In D. E. 57, 421–457. (Reprinted in J. S. Antrobus (Ed.), Cognition Meyer and S. Kornblum (Eds.) Attention and performance and Affect (pp. 13–58). Boston: Little, Brown). XIV: Synergies in experimental psychology, artificial intelligence, Strayer, D. L., & Johnston, W. A. (2001). Driven to distraction: and cognitive neuroscience (pp. 611–653). Cambridge, MA: Dual-task studies of simulated driving and conversing on a MIT Press. cellular telephone. Psychological science, 12, 462–466. Ross, B. H., & Anderson, J. R. (1981). A test of parallel Stroop, J. R. (1935). Studies of interference in serial verbal versus serial processing applied to memory retrieval. Journal reactions. Journal of experimental psychology, 18(6), 643. of Mathematical Psychology, 24(3), 183–223. Swanson, J. M. & Briggs, J. E. (1969). Information processing Schneider, W., & R. M. Shiffrin. (1977). Controlled and as a function of speed vs. accuracy. Journal of Experimental automatic human information processing: 1. Detection, Psychology, 81, 223–289. search, and attention. Psychological Review, 84, 1–66 Taylor, D. A. (1976). Stage analysis of reaction time. Psychological Schwarz, W. (1994). Diffusion, superposition, and the Bulletin, 83, 161–191. redundant-targets effect. Journal of Mathematical Psychology, Theios, J., Smith, P.G., Haviland, S. E., Traupmann, J., & Moy, 38, 504–520. M. C. (1973). Memory scanning as a serial self-terminating Schweickert, R. (1978). A critical path generalization of the process. Journal of Experimental Psychology, 97, 323–336. additive factor method: Analysis of a Stroop task. Journal of Titchener, E. B.(1905). Experimental psychology.NewYork,NY: Mathematical Psychology, 18(2), 105–139. Macmillan. Schweickert, R. (1982). Scheduling decisions in critical path Townsend, J. T. (1969). Mock parallel and serial models and networks of mental processes. Paper presented in the meeting experimental detection of these. Purdue centennial symposium of the Society for Mathematical Society. on information processing. Purdue, IN: Purdue University Schweickert, R., Fisher, D.L., & Sung, K. (2012). Discovering Press. by selectively influencing mental processes. Townsend, J. T. (1971a). A note on the identifiability of parallel Advanced Series on Mathematical Psychology. Singapore: and serial processes. Perception & Psychophysics, 10, 161–163. World Scientific. Townsend, J. T. (1971b). Theoretical analysis of an alphabetic Schweickert, R., & Mounts, J. R. W. (1998). Additive effects of confusion matrix. Perception & Psychophysics, 9, 40–50. factors on reaction time and evoked potentials in continuous Townsend, J. T. (1972). Some results concerning the identi- flow models. In C. Dowling, F. Roberts, and P. Theunes fiability of parallel and serial processes. British Journal of (Eds.) Recent progress in mathematical psychology (pp. 311– Mathematical and Statistical Psychology, 25, 168–199. 327). Mahwah, NJ: Erlbaum. Townsend, J. T. (1974). Issues and models concerning the Schweickert, R., & Townsend, J. T. (1989). A tri- processing of a finite number of inputs. In B. H. Kantowitz chotomy method: Interactions of factors prolonging sequen- (Ed.), Human information processing: Tutorials in performance tial and concurrent mental processes in stochastic PERT and cognition (pp. 133–168). Hillsdale, NJ: Erlbaum. networks. Journal of Mathematical Psychology, 33, 328–347. Townsend, J. T. (1976a). Serial and within-stage independent Smith, G. A. (1977). Studies of compatibility and models of parallel model equivalence on the minimum completion choice reaction time. In S. Dornic (Ed.), Attention and time. Journal of Mathematical Psychology, 14, 219–238. performance VI, Hillsdale, NJ: Erlbaum. Townsend, J. T. (1976b). A stochastic theory of matching Smith, P. L., & Van Zandt, T. (2000). Time-dependent Poisson processes. Journal of Mathematical Psychology, 14, 1–52. counter models of response latency in simple judgment. Townsend, J. T. (1984). Uncovering mental processes with British Journal of Mathematical and Statistical Psychology, 53, factorial experiments. Journal of Mathematical Psychology, 28, 293–315. 363–400.

features of response times 97 Townsend, J. T. (1990a). Serial vs. parallel processing: Some- Townsend, J. T., & Wenger, M. J. (2004b). The serial parallel times they look like tweedledum and tweedledee but they dilemma: A in a linkage of theory and method. can (and should) be distinguished. Psychological Science, 1, Psychological Bulletin& Review, 11, 391–418. 46–54. Treisman, A. M., & Gelade, G. (1980). A Feature-integration Townsend, J. T. (1990b). A potpourri of ingredients for Horse theory of attention. Cognitive Psychology, 12, 97–136. (Race) Soup. (Technical Report #32). Cognitive Science Treisman, A. M., & Gormican, S. (1988). Feature analysis in Program. Bloomington, Bloomington, IN. early vision: Evidence from search asymmetries. Psychological Townsend, J. T., & Altieri, N. (2012). An accuracy–response Review, 95, 15–48. time capacity assessment function that measures performance Treisman, A. M., & Schmidt, H. (1982). Illusory conjunctions against standard parallel predictions. Psychological Review, in the perception of objects. Cognitive Psychology, 14, 107– 119(3), 500–516. 141. Townsend, J. T., & Ashby, F. G. (1978). Methods of modeling Uttal, W. R. (2001). The new phrenology.Cambridge,MA:MIT capacity in simple processing systems. In J. Castellan and Press. F. Restle (Eds.), Cognitive Theory Vol. III (pp. 200–239). van der Heiden, A. H. C. (1975). Some evidence for limited Hillsdale, NJ: Erlbaum. capacity parallel self-terminating process in simple visual Townsend, J. T., & Ashby, F. G. (1983). The stochastic modeling search tasks. Acta Psychologica, 39, 21–41. of elementary psychological processes. Cambridge, England: Van Zandt, T. (1988). Testing Serial and Parallel Processing Cambridge University. Hypotheses in Visual Whole Report Experiments. (Master’s Townsend, J. T., & Colonius, H. (1997). Parallel processing thesis). Indiana University response times and experimental determination of the Van Zandt, T. (2002). Analysis of response time distributions. In stopping rule. Journal of Mathematical Psychology, 41, 392– J. Wixted (Ed.), Stevens’ handbook of experimental psychology, 397. (Vol 4, pp. 461–516). New York, NY: Wiley. Townsend, J.T., & Eidels, A., (2011). Workload capacity Van Zandt, T.,& Townsend, J. T. (1993). Self-terminating vs. spaces: A unified methodology for response time measures exhaustive processes in rapid visual, and memory search: of efficiency as workload is varied. Psychonomic Bulletin & An evaluative review. Perception & Psychophysics, 53(5) Review, 18, 659–681. 563–580. Townsend, J. T., & Evans, R. (1983). A systems approach Vickers, D. (1979). Decision processes in visual perception. New to parallel-serial testability and visual feature processing. In York, NY: Academic. H. G. Geissler (Ed.), Modern Issues in Perception (pp. 166– Vollick, D. N. (1994). Stochastic models of encoding-latency 189). Berlin: VEB Deutscher Verlag der Wissenschaften. means and variances in paranoid schizophrenia. (Doctoral Townsend, J. T., & Honey, C. J. (2007). Consequences of dissertation). Retrieved from ProQuest Dissertations and base time for redundant signals experiments. Journal of Theses. (Accession Order No. NN93247). University of Mathematical Psychology, 51, 242–265. Western Ontario, Canada. Townsend, J. T., Houpt, J. & Silbert, N. H. (2012). General Watson, J. M., & Strayer, D. L. (2010). Supertaskers: Profiles recognition theory extended to include response times: Pre- in extraordinary multitasking ability. Psychonomic Bulletin & dictions for a class of parallel systems. Journal of Mathematical Review, 17, 479–485. Psychology. Available at: http://dx.doi.org/10.1016/j.jmp. Welford, A. T. (1980). (Ed.), Reaction Times. London: Academic. 2012.09.001 Wenger, M. J.,& Townsend, J. T. (2001). Computational, Townsend, J. T., & Landon, D. E. (1983). Mathematical models Geometric, and Process Issues in Facial Cognition: Progress and of recognition and confusion in psychology. International Challenges. Mahwah, NJ: Erlbaum. Journal of Mathematical Social Sciences, 4, 25–71. Wenger, M. J.,& Townsend, J. T. (2006). On the costs and Townsend, J. T., & Nozawa, G. (1995). Spatio-temporal benefits of faces and words: Process characteristics of feature properties of elementary perception: An investigation of search in highly meaningful stimuli. Journal of Experi- parallel, serial and coactive theories. Journal of Mathematical mental Psychology: Human Perception & Performance, 32, Psychology, 39, 321–360. 755–779. Townsend, J. T., & Snodgrass, J. G. (1974). A serial vs. parallel Williams, P., Eidels, A., & Townsend, J. T. (2014). The testing paradigm when “same” and “different” comparison resurrection of Tweedledum and Tweedledee: Bimodality rates differ. Paper presented to Psychonomic Society, Boston, cannot distinguish serial and parallel processes. Psychonomic MA. Bulletin & Review. In press. Townsend, J. T., & Van Zandt, T. (1990). New theoretical results Wolfe, J. M. (1994). Guided search 2.0: A revised model of on testing self-terminating vs. exhaustive processing in rapid visual search. Psychonomic Bulletin & Review, 1, 202–238. search experiments. In H. G. Geissler, M. H. Müller, and Woodworth, R. S. (1938). Experimental Psychology.NewYork: W. Prinz (Eds), Psychophysical explorations of mental structures. Holt. Stuttgart: Hogrefe and Huber Publishers. Wundt, W. (1892). Die geschwindigkeit des gedankens (The Townsend, J. T., & Wenger, M. J. (2004a). A theory of velocity of thought).Die Gartenlaube, 26, 263–265. interactive parallel processing: New capacity measures and Yellott, J. I. (1971). Correction for fast guessing and the predictions for a response time inequality series. Psychological speed-accuracy trade-off in choice reaction time. Journal of Review, 111, 1003–1035. Mathematical Psychology, 8, 159–199.

98 elementary cognitive mechanisms CHAPTER 5 Computational Reinforcement Learning

Todd M. Gureckis and Bradley C. Love

Abstract

Reinforcement learning (RL) refers to the scientific study of how animals and machines adapt their behavior in order to maximize reward. The history of RL research can be traced to early work in psychology on instrumental learning behavior. However, the modern field of RL is a highly interdisciplinary area that lies that the intersection of ideas in computer science, machine learning, psychology, and neuroscience. This chapter summarizes the key mathematical ideas underlying this field including the exploration/exploitation dilemma, temporal-difference (TD) learning, Q-learning, and model-based versus model-free learning. In addition, a broad survey of open questions in psychology and neuroscience are reviewed. Key Words: reinforcement learning, explore/exploit dilemma dynamic, decision-making

Introduction business is “wrong,” simply that the experience There are few general laws of behavior, but was less than exquisite. This particular aspect of one may be that humans and other animals RL – learning from evaluate rather than corrective tend to repeat behaviors that have led to positive feedback – makes it a particularly rich domain for outcomes in the past and avoid those associated studying how people adapt their behavior based on with punishment or pain. Such tendencies are on experience. display in the behavior of young children who learn The history of RL can be traced to early work in to avoid touching hot stoves following a painful behavioral psychology (Thorndike, 1911; Skinner, burn, but behave in school when rewarded with 1938). However, the modern field of RL is a toys. This basic principle exerts such a powerful highly interdisciplinary area at the crossroads of influence on behavior, it manifests throughout our computer science, machine learning, psychology, culture and laws. Behaviors that society wants to and neuroscience. In particular, contemporary discourage are tied to punishment (e.g., prison research on RL is characterized by detailed behav- time, fines, consumption taxes), whereas behaviors ioral models that make predictions across a wide society condones are tied to positive outcomes (e.g., range of circumstances, as well as neuroscience tax credits for fuel-efficient cars). findings that have linked aspects of these models The scientific study of how animals use expe- to particular neural substrates. In many ways, RL rience to adapt their behavior in order maximize today stands as one of the major triumphs of cog- rewards is known as reinforcement learning (RL). nitive science in that it offers an integrated theory Reinforcement learning differs from other types of of behavior at the computational, algorithmic, and learning behavior of interest to psychologists (e.g., implementational (i.e., neural) levels (Marr, 1982). unsupervised learning, supervised learning) since it The purpose of this chapter is to provide a deals with learning from feedback that is largely general overview of RL and to illustrate how evaluative rather than corrective. A restaurant diner this approach is used to understand decision- doesn’t necessarily learn that eating at a particular making and learning behavior in humans and

99 other animals. We begin in the following section escaped. Over time, the cats would learn that with a historical perspective, tracing the roots particular actions (e.g., pressing a lever, or pulling of contemporary reinforcement-learning research a cord) would lead to the desired outcome (food) to early work on learning behavior in animals. and engage in this behavior more rapidly, avoiding In the section A Computational Perspective on actions that had previously been unsuccessful at Reinforcement Learning we introduce a general opening the box. In some sense, the correct computational formalism for understanding RL, behavior to escape was selected out of the full based largely on the work of Sutton & Barto (1998). behavioral repertoire of the cat, whereas others were Along the way we discuss some of the critical aspects eliminated. of RL including the trade-off between exploration There are several key features of Thorndike’s and exploitation, credit assignment, and error- experiments that are central to RL, which we driven learning. The section on Neural Correlates discuss throughout the chapter. The first is that of RL focuses on the neural basis of RL. The section successful escape from the puzzle box requires Contemporary Issues in RL Research describes a sufficient exploration of alternative actions. If the variety of contemporary research questions as well cat repeatedly tries the same unsuccessful action, as open areas for future investigation. escape is hopeless. On the other hand, after the cat escapes, it becomes better at engaging in less exploration and exploiting previously successful A Historical Perspective actions so that escape is faster. Of course, dropping The early roots of RL research can be traced to off exploration too quickly could cause the cat to the work of Thorndike (1911). Thorndike studied miss alternative means of escape that are less ef- learning behavior in animals, most notably cats. In fortful or time consuming. Thus, effective behavior perhaps his most well-known experiment, he placed depends on a delicate balance between exploration a cat into a specially designed “puzzle box,” which and exploitation of options (see section Balacing could be opened from the inside via various latches, Exploration and Exploitation—A Computational strings, or other mechanisms (see Figure 5.1). View and Varieties of Exploration in Humans). The cat was encouraged to escape the box by the A second feature of Thorndike’s experiments is presentation of food on the outside of the box. The that the goal (escaping the box to get the food) is key issue for Thorndike was how long it would take a complex, multi-action sequence. When the cat the cat to escape. first succeeds in escaping, this raises the question Once placed in this situation, the cat usually of credit assignment: which of the multiple actions began experimenting with different ways to escape the animal might have tried were responsible for (pressing levers, pulling cords, pawing at the door). the escape? This problem can be especially difficult After some time and effort, they would stumble in a sequential decision-making setting because an on the particular mechanism that opened the cage. action taken at one point in time may only have This process was repeated across a number of trials, the desired effect some time later. For example, each time recording the total time until the cat perhaps the cat pulls a string at one point in time, Time to escape (seconds)

Trials

Fig. 5.1 Left: An illustration of Thorndike’s puzzle box experiments. Right: The time recorded to escape the box is reduced over repeated trials as the cat becomes more efficient at selecting the actions which lead to escape.

100 elementary cognitive mechanisms presses a lever, and then pulls the string a bit Model-Free Learning). The Box 1 describes com- harder, which opens the box. On the next trial, mon experimental approaches to studying RL. should the lever be pressed or should the string be pulled? Contemporary research in computational RL provides a computational solution to how A Computational Perspective on agents might solve this credit assignment problem Reinforcement Learning (see the section on A Computational Perspective on The ability to adaptively make decisions that Reinforcement Learning). avoid punishment and maximize rewards is a core feature of intelligent behavior. Computational RL Reinforcement learning is also closely related to (CRL) is a theoretical framework for understanding the idea of instrumental conditioning and the operant how agents (both natural or artificial) might learn conditioning paradigm pioneered by psychologists to make such decisions. CRL is not simply a in the behaviorist tradition (e.g., Skinner, 1938). description of how agents decide and learn, but Operant conditioning represents a refinement of offers insight into the overall function or objective the basic experiment design utilized by Thorndike. of adaptive decision making. This is sometimes In a typical experiment, a rat might be placed known as a “rational analysis” of behavior since it in an isolated chamber with a lever that can be seeks to understand the purpose of some behavior depressed. The experimenter records the frequency independent of the exact mechanism by which it by which the rat presses the lever and at various is accomplished (Marr, 1982; Anderson, 1990). points in time provides reward (e.g., food pellets) Once the objective of learning has been made clear, or punishment (e.g., electric shocks). The main CRL goes on to posit a family of related learning question of interest is how the voluntary behavior algorithms or mechanisms all designed to allow of the animal is modified by various reward or an embodied, autonomous agent to control their punishment schedules. There continue to be a environment in effective ways. The core methods variety of theories of operant conditioning, but in the field were developed jointly by researchers all share a basic view that, through learning, in both computer science and psychology (Sutton, associations are strengthened (or weakened in the 1988; Sutton & Barto, 1981, 1998)1 as well as case of punishment) between elements that follow (Bertsekas & Tsitsiklis, 1996). one another in time. Another major historical influence on modern RL was the work of Tolman (1948). Prior to The Goal Tolman’s work, psychologists largely viewed animal All else being equal, it seems likely that animals behavior as a product of associative stimulus- seek to avoid punishment and maximize reward response (S-R) learning and chaining of basic through their choices and actions (e.g., Thorndike’s S-R behaviors. Tolman (1948) argued that many “Law of Effect”). This basic idea is reflected in aspects of rodent behavior seemed to contradict the first key assumption in CRL, namely that this basic model. For example, Tolman showed agents strive to maximize the long-term reward they that in maze tasks, rats could quickly re-route a experience from the environment. Formally, if the path in a maze around a trained route that was reward experienced at time t is rt , then the goal experimentally blocked with an obstacle. This, of learning and decision making is to maximize he argued, could not be accomplished on the the expectation of reward over the long run future, basis of pure stimulus-response learning since the that is, relevant stimulus-response pairings for the new  ∞ routes were never directly experienced by the rat. k E γ r + + .(1) Instead, it appeared that rats use a “cognitive t k 1 k=0 map” of the maze that allowed them to flexibly plan goal-directed sequences of behavior. Although The term γ in Equation 1 represents a discount Tolman’s work is often seen as antithetical to basic factor that gives greater weight to rewards that are principles of conditioning, modern RL approaches experienced sooner rather than later. This property have directly explored the distinction between more is desirable for mathematical reasons (making the reactive, associative forms of reinforcement learning sum finite), but also because humans and other ani- and more cognitive, planning-based approaches mals seem to have similar preferences for immediate (a distinction referred to as model-based versus over delayed rewards (Myerson & Green, 1995; model-free RL, see section Model-Based versus Frederick, Loewenstein, & O’Donoghue, 2002).

computational reinforcement learning 101 Defining the Decision Environment: Basic meets certain engineering objectives (see section on Definitions Varieties of Reward for a discussion of rewards in Most applications of CRL, particularly applied the context of human learning). In a MDP, rewards to human and animal behavior, assume that the can be probabilistic (like the reward associated with world can be viewed as a particular type of dynamic buying a lottery ticket), thus it often makes sense processknowasaMarkovdecisionprocessorMDP. to talk in terms of averages or expected values of The MDP assumption simply asserts that the world rewards. In particular, we could summarize the is composed of a set of finite states (S), actions expected value of rewards on trial t + 1as (A), a set of transition probabilities (T ), and | = = = rewards (R). E[rt+1 st s,at a,st+1 s ]. (3) The states of a MDP refer to different distinct R situations the agent might be in. The notion of The function assigns the value of reward received for each possible state transition. Together, these a “state” in RL is sufficiently general to cover S A T R many different types of situations. For example, four elements ( , , ,and ) completely being in a particular location in a maze facing define the decision problem (or MDP) facing the a particular direction might be one state. States agent. Any particular task or environment can might also involve elements internal to the agent be defined by providing particular parameters or like “being hungry” or, for a robot, “having a low numbers for these four quantities (usually each can battery” (see the section on The Influence of State be expressed as a function or a matrix). Critically, Representations on Learning for a discussion of the the description of a particular task as an MDP notion of states in the context of human learning). does not assume any particular agent has complete Actions are the decision options that are available knowledge about these four quantities (in fact in to the agent across all different states of the world. most realistic settings it would be impossible for In some MDP problems, all actions are available an agent to completely know these aspects of the in all states, whereas in others, different subsets of world). What is important in an MDP is simply actions might be available, depending on the state. defining the generic nature of the decision-making For example, if facing a wall at the end of a dead-end problem facing the agent. We later will discuss hallway, an agent’s available actions are practically how CRL algorithms provide guidance on how limited to those that allow it to turn around. Like agents learn to make effective decisions in such an the definition of states, actions can be internal to the environment given various amounts of knowledge agent (e.g., encode the currently attended object in about the world and experience. memory). Transition probabilities determine the dynamics Assigning Value to States and Actions of the environment. A transition probability can be Given the description of the environment as defined as the probability of a new state s on the + an MDP, it is clear that states and actions in the next time step (t 1) given the current state is s and world that lead to greater expected reward have the agent selects action a: greater value to the agent. However, it is not only the states and actions immediately resulting P(st+1 = s |st = s,at = a)(2) in reward that may have value to an agent. For Note that the next state depends only on the current example, an action that takes the agent closer to state and action and not on the full history of a rewarding state can also be valuable (in the long actions up to that point. This is known as the run) by virtue of what it means for the agent’s future Markov assumption. Although this assumption may prospects. be violated in the real world, for many situations, One clear example of this is second-order condi- this assumption appears reasonable, and it greatly tioning (Rescorla, 1980). In second-order condi- simplifies the mathematics involved. tioning, a stimulus such as a light is first paired The R determines how rewards or punishments with a negative outcome like a shock. Next, a are distributed in the environment. In a psychology second stimulus such as a tone is paired with the experiment, the experimenter might manipulate light (but without the shock). At test, animals show how reward is provided to learners to alter their anticipation of the shock following the presentation behavior. Similarly, roboticists who use RL to of the tone even though this item was never directly train autonomous systems often must adapt the paired with the negative outcome. In other words, reward function provided to the robot so the system the tone appears to be a proxy for the negative

102 elementary cognitive mechanisms outcome because tones tend to beget lights, and to averaging over possible outcomes weighted by lights tend to beget shocks. their probability of occurrence is central to almost One way to quantify this “proxy” value is as the all work on judgement and decision-making (e.g., expected sum of future rewards available from each Bernoulli, 1954). state, denoted V (s ): One helpful way to think about this relationship t  ∞ is as a tree (see Figure 5.2). At the root of the tree is k V (st = s) = E γ rt+k+1|st = s (4) the agent’s current state (s). The value of that state k=t will depend first on which action the agent takes (a), which is represented by the first set of branches In the second-order conditioning example just in the figure. Which action is selected in turn will described, the long-term expected value at the onset determine which of a variety of different rewards of the light is largely negative, but so is the long- might be experienced (r) while transitioning to the term expected value at the onset of the tone since new state s . The value of the current state V (s)is they both ultimately will lead to shock. However, thus a weighted average of the value of all possible the state tied to the tone would have a slightly successor states based on both the agent’s decision- higher value since the punishment is delayed further making, as well as the dynamics of the environment. in the future. This is similar to analyses of games like tic-tac- This notion of estimating the “proxy” value of toe where different actions lead to different game state or actions that later lead to reward is what states and thus the value of any particular game allows RL to offer a solution to the credit assignment state will depend on the available actions and states problem described earlier. The value of actions that going forward. The recursive relationship between in turn lead to others actions that provide reward successive states is called the Bellman equation and is captured by the evaluations on long-term rather is an important feature of MDPs as we will see than immediate prospects. Of course, the valuation next. of particular states is very sensitive to the agent’s discounting parameter, γ .Ifγ is very small, the agent cares only about immediate reward (i.e., is myopic), and, thus, actions or states that result in Making Good Decisions direct reward are highly valued. If γ is larger, then One way an agent could maximize reward would future prospects are taken into account. be to learn a set rules about how to behave in each One of the most important and interesting state.Oftenthissetofrulesisknownasapolicy, aspects of CRL is that the value of each state in which tells the agent which action to select in each Equation 4 can be defined recursively in terms of state in order to maximize the expectation of long- the value of other states: term reward (Eq. 1). We first hinted at the idea of a policy in Eq. 6 when we needed to consider the = = ∞ γ k | = V (st s) E[ k=t rt+k+1 st s] agent’s probability of selecting each possible action = + γ ∞ γ k | = π E[rt+1 k=t rt+k+2 st s] ( (s,a)). The complete policy for how to respond =E[rt+1 + γ V (st+1)|st = s]. in each possible state of the environment is denoted (5) simply as π. In other words, the value of a state st is the expectation of the reward experienced leaving that V(s’) state (rt+1) plus a discounted estimate of the future r s’ reward available from the possible successor states, a st+1. Making the expectation in Eq. 5 more explicit: = = π a a + γ V(s) V (st s) (s,a) Pss [Rss V (s )] (6) s a s where π(s,a) is the agent’s current probability of a choosing action a in state s, P is the probability ss of transitioning from state s to s given action a,and a R is the reward expected from that same action ss Fig. 5.2 The value of the current state s can be estimated by (using the notation developed by Sutton & Barto, “looking forward” from the current state to the expected reward 1998). Finally, V (s ) is the estimated value of future r and the value of the possible successor states s , averaged over reward for the next state (i.e., Eq. 4). This approach all possible actions and state transitions.

computational reinforcement learning 103 Ideally, the agent would like to learn the optimal issue is how to efficiently learn these values. One policy, which is the one that returns the most reward possibly naïve way to do this would be to follow a over the long term across all possible states of the policy, π, keeping track of any rewards experienced environment. There are many different methods along the way and at some point updating the value for learning optimal policies for MDPs including of V (st ) based on that experience. The downside is dynamic programming and Monte-Carlo meth- that it will take a lot of experience before the agent ods (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, knows anything about the value of any particular 1998). However some of these methods are less state (conditioned on the policy). computationally practical for biological agents with However, a more efficient, online, incremental π finite resources or incomplete of knowledge about learning rule that estimates V (st ) can be derived by the environment. observing the relationship described in Eq. 5. The However, two particular methods of learning important insight of this equation is that the value π optimal (or near-optimal) policies have had an of V (st ) depends on the average value of the next π extremely important influence on research on state V (st+1). This fact can be used to “bootstrap” π learning and decision-making in humans and other estimates of V (st ) more quickly. animals: temporal difference (TD) learning and We will illustrate the basic idea with an example, Q-learning. In the following sections, we describe then discuss the mathematics. In Figure 5.2, we the basic idea behind each of these algorithms. Later showed how the value of a given state in an MDP we will discuss how the features of these algorithms is determined by the “tree” of successor outcomes are related to the learning behavior of animals. available from that state. In Figure 5.3, we imagine However, first it is worth pointing out how that the agent begins in state s and has already estimating the value of each state can help the estimated its long-term value to be 0.5, then selects agent make decisions. Since the values for each action a. As a result of that selection, the agent state represent the expected sum of future rewards receives a reward equal to r = 1.0 and transitions to available from that state (Eq. 4), it means the agent anewstate,s , which it has previously estimated to is maximizing reward (given the current policy) if have value of 1.2. Assuming the discount parameter, it chooses actions that have the highest probability γ is set to 0.9, this would appear to be a better of transitioning to states with high values. For outcome than the agent expected. In particular, the example, we can talk about the long-term value of agent expects to receive on average 0.5 reward units making a decision in a particular situation (more going forward from state s, but on this choice the specifically a state-action pair) as: agent estimates and receives r +γ V (s ) = 1.0+0.9· 1.2 = 2.08. = = = a a + γ Q(st s,at a) Pss [Rss V (s )] (7) s This discrepancy suggests that the agent might be wise to revise its estimate of V (s)tobehigher which is just a simplification of Eq. 6, which (this is a better than expected state). One approach doesn’t average over different possible actions. A could be to simply replace the agent’s estimate of good strategy for the agent would be to choose the V (s) so that it equals 2.08. However, remember action in the current state s which maximizes the that V (s) represents the long-term reward available value of Q(s,a). This is trivially true since Q(s,a) from a state averaged over all possible actions, measures the expected long-run reward available rewards, and successor states. Thus, it would be from taking that action. To maximize reward over to drastic to replace the old value completely with the long term, the agent just needs to choose actions a single experienced outcome. Instead, the agent that maximize this quantity in each state. However, could simply move its estimate of V (s)astepin agents must first learn good estimates of V (s)or the direction of r + γ V (s ). As long as the step size Q(s,a), which is what TD and Q-learning methods isn’t too large, it can be shown that this will allow accomplish. the value of V (s) to eventually converge on the true long-term estimate. On some trials, it might move Temporal Difference (TD) Learning a little too high, on some trials, a little to low, but As the previous section described, an agent in the limit it should converge toward the mean of who has learned the values for each state in the different experiences. the environment can behave optimally simply by Once the value of V (s) is updated, the current choosing actions in each state that maximize the state is set to s and the process continues. The estimated long-term reward. However, the main important point is that the value of V (s)canbe

104 elementary cognitive mechanisms V(s’) = 1.2

r=1.0 s’ 1. Select an action a

2. Experience reward and new state V(s)=V(s) + α[r +γV(s’)-V(s)] 3. Update V(s) based on outcome s

Fig. 5.3 The key steps in the temporal-different (TD) update. First, an action is selected from the current state based on the agent’s current policy. Next, the reward is experienced along with information about the successor state. The agent can lookup its estimate of the value of this state from memory or assign a default value if the successor state had never been visited. Next the value of the original state is updated in light of the experienced outcome. In this way estimates of the value of each state can be “bootstrapped.” When the agent’s estimate of V (s) is accurate the error in expected reward and experienced reward will drop to zero on average. Learning is based on a temporal difference in what was expected and what was actually experienced. bootstrapped based on the outcome of individual (or TD). The name is largely descriptive: estimates choices. After the agent makes a choice, it compares of the long-term value of a state are based on the value of the reward it experienced to the differences between what was expected and what long-term estimate going forward and adjusts that was experienced in time. estimate, V (s), up or down accordingly. Q-learning Formally, the error in the agent’s current estimate Q-learning is a modification of the basic of V (s) can be denoted using an intermediate temporal-difference3 algorithm, which learns the variable called the prediction error, commonly value of state-action pairs, Q(s,a) directly (rather denoted than estimating the value of individual states, V (s)). δ = rt+1 + γ V (st+1) − V (st ), (8) This is often a more useful quantity to estimate since we are often interested in the value of particu- which measures the difference between the ex- lar choices. A mentioned earlier, given direct Q(s,a) perienced and estimated value of current state. estimates computing a policy is simple: Choose the Prediction errors can be positive (when events action in the current state associated with the largest worked out better than expected) or negative (when value of Q(s,a). As a result, Q-learning obviates events work out worse then expected). When this the need for a separate representation of the policy. value drops to zero it means that the current From the perspective of a biological agent, learning value estimates are all consistent with one another and choice are made more compatible with few and thus represent the true values. In the section demands of extra processing (e.g., to compute V (s) on the Neural Correlates of RL we will describe estimates into Q(s,a) estimates via Eq. 7). In effect, how neurons encoding prediction errors have been the learned Q-values (i.e., Q(s,a)) tell the agent identified in the brains of animals. directly what choice to make. An incremental rule for adjusting V (st )canbe Following Eq. 9, an incremental update rule for written using this prediction error: Q(s,a) can be defined:

V (st )=V (st ) + α[rt+1 + γ V (st+1) − V (st )] Q(st ,at ) = Q(st ,at ) + αδ. (10) =V (s ) + αδ. t where δ takes a slightly different form than in (9) standard TD learning: in this equation α is known as the learning rate and represents the “step size” of the update to V (s). In δ = rt+1 + γ maxQ(st+1,a) − Q(st ,at ) (11) other words, V (s) is moved slightly in the direction a of the prediction error (assuming α<1.0)2.This Q-learning is considered an “off-policy” learn- simple, incremental method of learning the values ing algorithm because the prediction error for of states is known as temporal difference learning the current choice assumes that the agent will

computational reinforcement learning 105 choose the best action on the next state (i.e., highest value of Q(s,a), but choosing randomly the maxa Q(st+1,a) in Eq. 11) rather than follow from the available alternatives with probability ε. their current policy (which might include the This policy also ensures that each state action pair possibility of choosing to explore a different action). will be visited infinitely (assuming infinite time). Q-learning is particularly important in computer The downside of ε-greedy is that it continues to science applications of RL because it can be proven explore with probability ε even after the agent has to converge on the optimal policy for the particular lots of experience with the environment (and the Q- MDP given that all state-action pairs are visited values may have converged to their accurate values). infinitely often (Watkins, 1989)4. In addition, it Another approach, known as the “softmax” strategy, has shown considerable success as a model of how explores probabilistically, but allows the current humans learn in dynamic decision-making tasks in estimates of Q(s,a) to influence the probability of which past actions influence future rewards (e.g., exploration (Sutton & Barto, 1998). According to Gureckis & Love, 2009a,b; Otto, Gureckis, Love, the softmax rule, the probability of choosing action & Markman, 2009). ai in any state st is given by

Q(ai,st )·τ = e Balancing Exploration and P(ai,st ) (12) N Q(aj,st )·τ Exploitation—A Computational View j=1 e In Thorndike’s puzzle-box experiments (see the where τ is a parameter that determines how closely section A Historical Perspective), we noted that the the choice probabilities are biased in favor of the cat needs to explore various actions to learn which value of Q(ai,st )andN is the number of available allow escape from the box. However, after a bit actions in state st . In general, the probability of of learning, it becomes better to “exploit” action choosing option ai is an increasing function of the sequences that are known to be effective. estimated value of that action, Q(ai), relative to the An analogous situation arises in TD and Q- other action (see also Luce, 1959). However, the τ learning algorithms, just described. Choosing the parameter controls how deterministic responding is. action with the highest value of Q(s,a) in each state When τ → 0, each option is chosen randomly (the will lead to optimal long-term decision-making. impact of learned values is effectively eliminated). However, this assumes that the current estimates of Alternatively, as τ →∞, the model will always Q(s,a) are already accurate. At the start of learning select the highest valued option (also known as in a novel task or environment this is unlikely to be “greedy” action selection). the case. Instead it will take many updates of Eq. 10 Interestingly, the probability of exploration in in each state to ensure that the estimated values have the softmax model is sensitive to the degree of converged. “competition” between choices. Assuming τ>0, Thus, agents learning via TD and Q-learning when all possible actions have similar values of must also balance the need to explore (in order to Q(s,a), exploration becomes more likely. However, learn) and exploit (in order to actually maximize when one action is greatly superior, it will tend reward). Early in a task, the agent shouldn’t to be selected more often. In addition, in the necessarily trust the current estimates of Q(s,a) softmax rule, the value of τ might be adjusted too much and should sometimes choose actions as the experience in the task accumulates (e.g., that actually have lower estimated values. This Busemeyer and Stout, 2002) to favor exploitation is because those estimates may be incorrect and, over exploration (similar to simulated annealing). with experience, would be adjusted upward or The later section on The Varieties of Exploration downward. in Humans discusses in more detail open issues One way to balance these competing concerns surrounding how people balance exploration and would be to choose the action associated with the exploitation. highest value of Q(s,a) most of the time (exploit), but some fraction of the time to choose an action randomly (explore). As long as the probability of Neural Correlates of RL exploring is nonzero but not too large, this strategy Interest in computational RL stems not only can help the agent learn the true Q-values for from its utility in computer science applica- any particular environment. This explore/exploit tions (e.g., Tesauro, 1994; Bagnell and Schneider, strategy is often known as the ε-greedy algorithm 2001), but the fact that human and animal and entails choosing the option associated with the brains appear to use similar types of learning

106 elementary cognitive mechanisms algorithms. Indeed, modern RL represents a Sejnowski, 1996; Schultz et al., 1997). A critical powerful theoretical framework for thinking about assumption in the neurocomputational model is systems-level neuroscience. On one hand this that different points in time represent different outcome is not surprising since the pioneering states in the MDP (see Daw, Courville, & work in computer science on RL was directly Touretzky, 2006a; Ludvig, Sutton, & Kehoe, 2012, inspired by psychological research on basic learning for a contemporary discussions of the representation processes (Sutton, 1988; Sutton & Barto, 1998). of time in RL models). Note that, in a given state However, the discovery of extremely close cor- (i.e., time point), s, prediction errors (δ) will be respondences between the predictions of RL positive when experienced outcomes are better than algorithms and the operation of particular neural expected (i.e., γ V (s ) + r>V (s)) and are negative systems in mammal brains represents a major when experienced outcomes are worse than ex- scientific advance (see Niv, 2009 for an excellent pected. Early in training, the unexpected delivery of review and history of these developments). a rewarding stimulus leads to a positive prediction error (see Figure 5.4, panel D). As training trials The Reward Prediction Error Hypothesis in a condition experiment are repeated, the learned Perhaps the most famous discovery related to RL values for such states increase until the prediction and the brain is the contribution RL has made to error drops to zero. At the same time, TD algorithms “pass back” the proxy value of one state understanding the computational role of dopamine γ> in learning. As Niv (2009) describes, early theories to preceding states (assuming 0). Eventually, of dopamine suggested it encoded the reward the prediction error associated with the unexpected value of particular stimuli in the environment. presentation of positively valued CS (i.e., the start For example, dopamine neurons recorded within of the trial) prompts a new positive prediction error the midbrains of monkeys while they performed (the start of the trial is unexpected, but generally simple conditioning experiments showed increased positive by proxy since it will always lead to later firing rates immediately following the delivery reward). Likewise, withholding an expected delivery of rewarding stimuli such as food (e.g., Schultz, of reward following the CS results in a negative Apicella, & Ljungberg, 1993). However, if this prediction error (see Figure 5.4, panel F, closely rewarding stimulus was consistently preceded by matching the drop in firing observed in panel C). a conditioned stimulus (CS, e.g., a light or a The importance of this finding is hard to over- tone) then the reward-related firing pattern would state. Research in computer science and behavioral extinguish across trials. Instead, the neurons would psychology had identified prediction error as a key begin firing upon the presentation of the CS. signal for enabling incremental learning from expe- For example, Figure 5.4, Panel A shows an rience (i.e., section on A Computational Perspective example trial prior to any learning in the experi- on Reinforcement Learning). The discovery that ment. Shortly after the delivery of the unexpected particular neurons in the brains of monkey reliably reward (R) there is a large phasic spike in neural code a similar signal suggested an understanding firing. However, panel B shows the same neurons not only of the detailed empirical phenomena, but later in a trial after a CS is repeatedly paired of the overall function that dopamine plays in (following a fixed delay) to the reward. After many learning from experience. trials, the neurons no longer respond vigorously to the onset of the reward (R) but instead show a Model-Based Analysis of fMRI phasic burst of activity shortly after the presentation Another, very general, way in which RL of the CS. Finally, panel C shows the firing pattern has informed our understanding of the brain of the neurons late in training when the reward is is in the analysis of data collected from hu- unexpectedly withheld after the presentation of the mans using fMRI (functional magnetic resonance CS (i.e., “No R”). Here, there is a strong phasic imaging). Traditional studies using fMRI often burst of activity following the CS, but a noticeable contrasted the patterns of brain activity recorded drop in firing when the reward is withheld. during particular task-relevant states with a suit- This puzzling pattern of neural firing was ably defined baseline (e.g., passively looking ultimately deciphered using to the concept of a at a fixation cross). However, RL researchers temporal-difference based prediction error, which have pioneered the use of computational models we introduced in Eq. 8 (Montague, Dayan, to help structure fMRI data in a trial-by-trial Person, & Sejnowski, 1995; Montague, Dayan, & fashion (Daw, 2011; Ashby, 2011). The idea

computational reinforcement learning 107 Early in training Trial 1 Trial 5 AD1.0 Trial 10

0

(No CS) R –1.0 B E 1.0 Late in training, reward delivered

0

CS R –1.0 C F 1.0 Late in training, reward withheld

0

–1 01 2 s CS (No R) –1.0

CS R

Fig. 5.4 Figure adapted from Niv (2009). Panels A–C are taken from Schultz, Dayana, and Montague (1997) and shows raster plots of recorded dopaminergic neurons at various stages of a classical conditioning experiment. The rows along the bottom of each panel represent individual neurons. Time within the current trial flows from left to right across the page. Black dots in a row reflect a recorded neural spike at that point in time. Along the top is a histogram of the total firing rate summed across all recorded cells (essentially the marginal distribution of the points below). Panels D–F plot the predictions of the temporal difference algorithm (section Temporal Difference (TD) Learning), in particular, the prediction error term from Eq. 8 at various point within a trial and as training progresses. Refer to the text in the section The Reward Prediction Error Hypothesis for a full description. is that computational models inspired by RL Analyses of this type have revealed regions in algorithms posit the existence of certain latent the human brain that appear reliably responsive variables. The trial-by-trial changes in predic- to prediction errors (O’Doherty, Dayan, Friston, tion error described earlier are one example, Critchley, & Dolan, 2003; McClure, Berns, & but so are other latent variables assumed by Montague, 2003), and has given insight into RL algorithms such as the values of particular how people trade off between exploration and state-action pairs (i.e., Q(s,a)). During learn- exploitation (Daw, O’Doherty, Seymour, Dayan, ing, these variables fluctuate dynamically based & Dolan, 2006b). Even outside the field of RL, on the experience and decisions of the learning such model-based analysis approaches are changing agent. When first fit to human behavior (i.e., the way fMRI data are analyzed (e.g., see Davis patterns of choices), these models provide ex- et al., 2012a,b, in the domain of category learning cellent targets for structuring analyses of fMRI or Anderson et al. (2008) in and data. For example, regressors can be constructed reasoning). representing the trial-by-trial fluctuations in pre- diction error (δ), which are then correlated with Contemporary Issues in RL Research the fluctuations in measured blood oxygen level Model-Based versus Model-Free Learning dependent (BOLD) signal (Platt & Glimcher, Temporal difference learning methods (such as 1999; Sugrue, Corrado, & Newsome, 2004; TD and Q-learning described earlier) depend on Daw, O’Doherty, Seymour, Dayan, & Dolan, direct experience to estimate the value of particular 2006b; Ahn, Krawitz, Kim, Busemeyer, & Brown, actions. For example, an agent learning via Q- 2011). learning will not understand that hot stoves should

108 elementary cognitive mechanisms not be touched until it tries grasping a hot forms of planning and reasoning. Research in stove and experiences the large negative reward animal conditioning has pointed to a similar associated with that state-action pair. In this sense, distinction between goal-directed and habitual these algorithms are much like early associative behaviors (Dickinson, 1985; Dickinson & Balleine, theories of conditioning described earlier: They 2004). learn stimulus-response associations between states, The precise computational definition given to actions, and rewards based on direct experience. these two forms of learning allow model-based However, as the work by Tolman (1948) and and model-free RL to make dissociable behavioral others showed, animals exhibit a much richer predictions in a variety of tasks. For example, and more flexible class of learning behaviors. For Simon and Daw (2011a) present fMRI evidence for example, the ability of a rat to re-route it’s path dissociable neural systems that separately represent around a novel obstacle would seem to depend on a model-based and model-free RL in a realistic richer knowledge about the structure of the maze spatial navigation task. Similarly, Simon & Daw than is assumed by temporal-difference learning (2011b) found evidence for the contribution for methods. In fact, TD and Q-learning are are differentiable model-based and model-free learning classified by computer scientists as “model-free” systems in a simple sequential decision making learning algorithms because they do not require the task, which become more or less volatile over agent to know about the underlying structure of the time. Consistent with the idea that model-based environment (e.g., the set of transition probabilities, system utilizes more cognitive resources such as T R , and rewards, defined earlier). In model-free working memory and executive control, Otto, RL, estimates of V (s)orQ(s,a) suffice to enable Gershman, Markman, & Daw (2013) found adaptive behavior. In this case, the model refers to that participants’ learning behavior was better fit something akin to a relatively rich mental model of by model-free algorithms when they performed the task environment. a sequential decision making task under working In contrast, other types of RL algorithms (called memory load. model-based RL) emphasize the learning of the Computational RL may also provide insight into transition probabilities and rewards. Once the agent why these two, seemingly redundant, systems exist. has a representation of these quantities it becomes For example, Daw, Niv, & Dayan (2005) argue possible to compute “on the fly” the values of V (s) the model-free system takes longer to learn (since or Q(s,a) at any point in time (using the Bellman getting good estimates for Q(s,a) can take consid- insight in Figure 5.2). In addition, representing erable time and experience). On the other hand, the transition probabilities and rewards explicitly model-based systems are able to guide behavior allows the agent to plan into the future, forecasting more quickly. However, the model-based system the outcome of possible action sequences. This is is more error prone since it requires an accurate exactly the type of behavior exhibited by Tolman’s representation of the transition probabilities in the clever maze-running rats and it likely an important environment. Thus, there are times where the part of human decision making as well (e.g., accuracy of the model-free system will exceed the Sloman, 1996). performance of the model-based system. In effect, The distinction between model-based and the two systems are specialized for learning at model-free RL is not purely theoretical. In fact, different time scales. The full details of various a growing body of work has explored the idea model-based RL algorithms is beyond the scope that multiple learning systems in the brain spe- of the present chapter (see Daw, 2012, for a cialize in these respective types of learning and summary). decision making (Daw, Niv, & Dayan, 2005; Daw, Gershman, Seymour, Dayan, & Dolan, 2011; Otto, Gershman, Markman, & Daw, 2013). The Influence of State Representations on In particular, the idea is that habitual behaviors Learning (i.e., those that have been repeated many times A critical component of almost all existing RL and are executed without much thought) may algorithms is the notion of a state (owing the to be akin, computationally, to model-free learning. close relationship between these algorithms and Such behaviors are acquired slowly and depend the mathematics of Markov decision processes). heavily on direct experience. In contrast, model- In these models, a state is essentially a context based RL is more akin to more cognitively effortful or situation in which a certain set of choices are

computational reinforcement learning 109 available. For example, the idea of a “temporal ability to find the optimal solution (Gureckis & difference” error signal implicitly assumes that the Love, 2009a,b; Otto, Gureckis, Love, & Markman, deviation between the expected and experienced 2009). Corresponding simulations with an artificial outcomes is conditioned on the current state or RL agent based on Q-learning (see the section context (Schultz et al., 1997). However, a rigorous on Q-Learning) showed that, like humans in account of what exactly constitutes a state is the experiment, enriching the discriminability of often left aside in neurobiological models. To the distinct task states greatly improves learning in the degree that such models can eventually be extended task. to explain behavior in more complex learning In Gureckis and Love’s (2009b) experiment, state situations, it is critical that the field adopt a better cues were simple binary lights that unambiguously understanding of how the state representation that mapped to distinct task states. However, in the the learner adopts influences learning and how such real world, such state cues are unlikely to be representations are acquired. so direct. Consider a case in which you must The importance of state representation are decide which area of a lake is best for fishing. particularly apparent in sequential decision making Your options might be the shallows by the shore, contexts where agents make a series of decisions or the deep sections in the center. However, a in order to maximize reward. For example, a number of factors such as season, time of day, robot navigating a building could be in a particular presence of long grass for breeding, or the turbidity state (e.g., a certain position and orientation in a of the water may also be relevant. Successful particular hallway) and could make a decision (e.g., decision-making thus requires the integration of turn left), causing a change in state. According a variety of cues in order to identify the current to the RL framework, the overall goal of the state and enable effective behavior. Note that the agent to to make the decision in each state that question of how people combine multiple cues in would maximize its long-term reward. However the order to make predictions about the environment way that the agent represents the structure of the has been extensively studied in the categorization environment can strongly influence its ability to literature. However, the close relationship between achieve this goal. For example, an agent with very work in categorization and RL has only recently limited sensors might have difficulty differentiating been acknowledged (Shohamy, Myers, Kalanithi, between various hallways and intersections and thus & Gluck, 2008; McDonnell & Gureckis, 2009; would have trouble deciding which action to take in Gureckis & Love, 2009b; Redish, Jensen, Johnson, any given case. & Kurth-Nelson, 2007; Gershman, Blei, & Niv, A similar problem faces human learners. For 2010). example, Gureckis and Love (2009b) studied a sequential decision-making task where human Varieties of Exploration in Humans participants had to learn to avoid an immediate, short-term gain in order to maximize long-term Effective RL often requires a delicate balance reward (see also Herrnstein, Lowenstein, Prelec, of exploratory and exploitative behavior (Sutton & Vaughan, 1993) (see Box 1 for an overview & Barto, 1998; Steyvers, Lee, & Wagenmakers, of these types of tasks). Critically, the task was 2009). For example, consider the problem of structured such that the reward received on any choosing where to dine out from a set of competing trial depended on the history of the participant’s options. The quality of restaurants often changes response on previous trials. Finding the optimal over time such that one cannot be certain which long-term strategy in such environments is difficult restaurant is currently best. In this type of “nonsta- because the relationship between distinct task states tionary” environment, one either chooses the best- (in this case, different patterns of prior responses) experienced restaurant so far (i.e., exploit) or visits a and the reward on any trial is unclear. Gureckis and restaurant that was inferior in the past but now may Love suggested one reason people show difficulty in have improved (i.e., explore). Even in stationary such tasks may not stem entirely from an impulsive environments, knowing when and how much to bias toward immediate gains, but instead because explore is a difficult and important problem. For they adopt the wrong state representation of the example, when outcomes are stationary in time task. A series of experiments found that providing but noisy, uncertainty about which option is best simple perceptual cues that help to disambiguate can complicate decision-making. Closely related successive tasks states greatly improved participants’ problems occur in the foraging literature (Kamil,

110 elementary cognitive mechanisms Box 1 Experimental Approaches the tone might be lower since it means a puff of Given the description in the section on A air is coming soon. Computational Perspective on Reinforcement RL algorithms have also been applied Learning of common RL algorithms, it is to instrumental conditioning paradigms useful to consider the range of behavioral explaining issues like the effects of reward phenomena these theories have been used to rate on responding and response vigor (e.g., explain. Overall, the diversity of tasks that have Niv, Daw, Joel, & Dayan, 2007). Here it is been modeling using RL is a testament to the natural to consider learning algorithms like very general framework it provides for thinking Q-learning or actor-critic architectures (not about human and animal behavior. reviewed here but see Konda and Tsitsiklis, 1999) where agents explicitly are estimating Classical and Instrumental Conditioning the value of particular actions in different states As mentioned earlier, the contemporary field and adjusting decision policies based on these of RL was anticipated by early work in experiences. classical and instrumental conditioning. Such early studies identified notions of contingency, reward, and prediction error as possible deter- Bandit Tasks minants of learning (e.g., Rescorla and Wagner, A particular kind of instrumental conditioning 1972; Wagner and Rescorla, 1972). It is paradigm that has strongly influenced research not surprising then that RL algorithms have on RL in humans is the multi-armed bandit continued to be influential as theories of basic task. The basic experiment procedure is named conditioning phenomena. For example, the after the slot machines in casinos which deliver temporal difference methods described earlier payouts probabilistically when the “arm” of were largely developed by theorists to explain the machine is pulled (in fact, these machines continuous-time effects on conditioning as well are sometimes known as “one-armed bandits”). as second-order conditioning, two classical (i.e., In the multi-armed bandit task, a number of Pavlovian) phenomena that are difficult to choice options are presented to the subject. account for using standard models such as the On each trial the subject can select one of the Rescorla-Wagner model (Niv and Schoenbaum, bandits. Selected bandits pay out a random 2008). reward from a distribution specified by the When modeling classical conditioning, the experimenter. The goal of the subject is to standard TD learning algorithm described in maximize the total reward they earn across the section on Temporal Difference (TD) a finite number of trials (see Steyvers, Lee, learning is often used. This is natural since, & Wagenmakers, 2009; Lee, Zhang, Munro, in classical conditioning paradigms, the partic- & Steyvers, 2011, for discussion of human ipant does not have any control over the task experimentation and modeling). (e.g., there are no choices to be made). For Bandit tasks are simple experiments that example, in an eye-blink conditioning exper- expose many of the core aspects of RL described iment, a tone might sound moments before earlier. For example, the learner has to balance a puff of air is delivered to the participant’s exploration and exploitation between different eye. When applied to classical conditioning bandits in order to ensure they are consistently phenomena, particular assumptions are made choosing the best one. In addition, bandit tasks about the representations of “states” in the involve incremental learning of the valuation of system. For example, TD models of classical different options across multiple trials. In some conditioning (see the section on The Reward variants of a bandit task, the payout probability Prediction Hypothesis) assume that continuous of the different options drifts randomly over time is divided into arbitrarily small discrete time (sometimes called “restless bandits”) units, each of which constitutes a different state further encouraging continuous learning and in an MDP (Sutton, 1995; Daw, Courville, & exploration (e.g., Daw, O’Doherty, Seymour, Touretzky, 2006a; Ludvig, Sutton, & Kehoe, Dayan, & Dolan, 2006b; Yi, Steyvers, & Lee, 2012). The value of being in the state of hearing 2009).

computational reinforcement learning 111 of the environment. For environments that are Box 1 Continued volatile and undergo frequent change, one should Sequential Decision Making Tasks explore more often (Daw, Gershman, Seymour, A key feature of RL models is the ability Dayan, & Dolan, 2011; Knox, Otto, Stone, & to learn to make sequences of decisions to Love, 2011). In contrast, there is little reason achieve some goal. Experiments assessing this to explore a well-understood environment that behavior tend to use tasks with greater structure never changes. Other factors that affect how often than a traditional bandit task. For example, one should explore include the task horizon (i.e., researchers have considered variants on bandit how many more opportunities there are to make tasks where the reward rates of different options a choice) and how rewarding the environment are linked together in particular ways. For is in general (Rich & Gureckis, 2014; Steyvers, example, Knox, Otto, Stone, & Love (2011) Lee, & Wagenmakers, 2009). To return to the explored a “leap-frog” type tasks where the restaurant scenario, there is little reason on the overall reward rates of two bandits take turns last day of vacation to try a new restaurant increasing in value at random points during the when all the restaurants besides one’s favorite experiment. As discussed later, this additional have proven to be horrible. On the other hand, structure favors different types of exploration if it’s early in the vacation and most restau- strategies that are more structured in time (see rants are generally good, then it makes sense to the section on The Varieties of Exploration explore. in Humans). Relatedly, some researchers have The exploration methods considered earlier, explore dynamic decision making tasks where such as softmax (Eq. 12) and ε-greedy (see section the payoffs available from different actions on Balancing Exploration and Exploitation—A depends on the past history of choices made by Computational View) do not explicitly consider the agent (Neth, Sims, & Gray, 2006; Gureckis these issues concerning exploration. However, & Love, 2009a,b; Otto, Gureckis, Love, & other methods for regulating uncertainty do. Markman, 2009; Otto, Markman, Gureckis, Next, a number of methods of exploring in & Love 2010; Otto & Love, 2010). In these dynamic-decision environments are considered. cases, optimal, reward-maximizing strategies The methods are arranged from least to most may require more complex sequential decisions sophisticated. Each method has its place in both similar to playing a game like tic-tac-toe or engineering applications and for modeling human chess (see the section The Influence of State behavior. Representations on Learning). Researchers have even used complex video games that involve spatial navigation to interrogate basic learning trial independent “random” exploration and decision processes (e.g., Simon and Daw, This form of exploration is the simplest and most 2011). commonly used in psychology, neuroscience, and computer science (e.g., Sutton & Barto, 1998). On each trial, a value estimate is gathered for each possible action. Exploiting corresponds to choosing Krebs, & Pulliams, 1987; Stephens & Krebs, 1986). the highest value option, whereas exploring corre- Animals must decide when to abandon a resource sponds to choosing some other action. This form patch (e.g., a lake) and seek out a new resource. of exploration is referred to as trial independent Optimal foraging requires keeping an estimate of because the probability of choosing an action is the current reward rate and the expected reward rate not influenced by what was chosen on past trials for moving to a new resource patch (Kamil, Krebs, or what will be chosen on future trials. The only & Pulliams, 1987). factor determining the probability of selecting an Exploring when one should exploit and, con- action on the current trial is its current value. versely, exploiting when one should explore both Examples of choice procedures using this scheme incur costs. For example, an actor who exces- are softmax (Eq. 12) and ε-greedy. The strengths sively exploits will fail to notice when another of these forms of exploration include simplicity. For action becomes superior. Conversely, an actor who example, these methods do not require a complex excessively explores incurs an opportunity cost by analysis of the environment. Because every action frequently forgoing the high-payoff option. How has some chance of being sampled on every trial, a often one should explore should vary as a function well-designed learning rule (e.g., Q-learning) will

112 elementary cognitive mechanisms eventually discover the optimal policy. Weakness information value in taking actions that currently includes that there is no guarantee that one is have lower-than-expected reward. These methods exploring when one should. are blind to what the true best policy is, so there is a need to explore in hopes of discovering systematic exploration it. In contrast, an ideal actor that performs Trial independent random exploration does not optimal planning chooses to “explore” because it conform to popular intuition about what explo- is the next move in the plan that has the highest ration is. Consider a historic explorer, such as expected value. In this sense, it is incorrect to say Christopher Columbus, sailing toward the new exploration ever occurs in an ideal actor model world. His ships did not move east one day, west because every move is exploitative of long-term the next, and then south on the third day as they reward. might according to a softmax exploration procedure Oneissueisthatcomputingtheidealactor’s where a choice is made on every trial. Instead, a policy is computationally challenging and often can series of linked and similar (i.e., repeated) actions only be done for certain problems (cf., Kaelbling, were taken to truly move into unchartered territory. Littman, & Cassandra, 1998). Optimal solutions In certain tasks, exploration that reaches novel states can be calculated for relatively simple problems, requires consistent action across trials, which is not such a finite horizon n-armed bandit task (Gittins, readily achievable with trial independent random 1979; Steyvers, Lee, & Wagenmakers, 2009). In- exploration. Several consistent choices might move terestingly, recent work in psychology suggests that an agent to a novel state that several thousand people actually can adopt such optimal decision “random” choices would never reach. policies in certain situations (e.g., Rich & Gureckis, There are certain tasks in which people likely 2014; Knox, Otto, Stone, & Love, 2011). Another explore in a similar fashion. Otto, Markman, example includes modeling navigation in mazes Gureckis, & Love (2010) conducted a dynamic where people begin from a random starting position decision task that had a local optimum that people (Stankiewicz, Legge, Mansfield, & Schlicht, 2006). could only escape by repeatedly taking actions that resulted in lower immediate reward, but higher subsequent reward (cf. Bogacz, McClure, Varieties of Reward Li, Cohen, & Montague, 2007). These authors In RL, rewards are fundamental in that they found that, when people were in a motivational define the task for the agent. Rewards would seem state that fosters cognitive flexibility (cf. Higgins, to be straightforward, transparent, and immutable, 2000), subjects tended to be streaky in their choices, the reward is simply the signal the environment repeating the same option for several trials in a provides and the agent receives. In reality, the row. In particular, these subjects were best modeled concept of reward in RL is much richer, complex, by a choice procedure that repeated the previous and subtle. Although rewards can follow in a choice with probability p and with probability 1−p, straightforward manner from the environment, the softmax procedure determined the choice. This in many RL domains and applications, rewards generally leads to systematic, streaky patterns of are another aspect of the overall system that is exploration. manipulated by the agent designer or human experimenter. optimal planning For example, in Gureckis and Love (2009a) The preceding methods for balancing explo- human and artificial agents both performed better ration and exploitation are heuristic in nature. More in a difficult RL problem when low levels of ideally, one would calculate an overall plan in noise were added to the reward signal. In other which exploration was integral to maximizing total words, corrupting the reward signal with noise reward. For example, for an ideal actor that uses actually increased agent performance. The reason its beliefs and plans ahead, there is no distinction for this surprising outcome was that agents tended between exploration and exploitation as all actions to underexplore initially and the added noise in are following a policy that maximizes expected the reward signal had the effect of increasing agent value (Gittins, 1979). exploration, which benefitted the agent in the long- This may seem like a philosophical distinction run. Although not presented this way in the original but it is conceptually and practically important. paper, this is a case in which a limitation in an agent The preceding methods acknowledge that there is can be addressed by the design of a reward signal.

computational reinforcement learning 113 This notion has been explored formally in power and generality of the framework is made machine learning (Singh, Lewis, Barto, & Sorg, clear by the diverse range of theoretical and 2010). When an agent is computationally bounded practical ideas that it has help to articulate. in some sense, there can be an alternative reward That said, there are many important open is- signal that will lead to the agent performing better sues in the field that have been hinted in this than the external reward signal. Likewise, agents chapter. are often endowed with additional internal reward signals to create basic drives or , such as Acknowledgments curiosity (Schmidhuber, 1990; Oudeyer & Kaplan, The authors wish to thank Julie Hollifield for 2007) or the desire for information (Gureckis & help with the figures and Nathaniel Daw, David Markant, 2012). From a psychological perspective, Halpern, John McDonnell, Yael Niv, Alex Rich, models of human behavior have been formulated in and A. Ross Otto for helpful discussions in the which actions (and corresponding rewards) include preparation of the manuscript. TMG was supposed internal actions, such as storing information into a by grant number BCS-1255538 from the National working memory store (Dayan, 2012; Gray, Sims, Science Foundation and contract D10PC20023 Fu, & Schoelles, 2006; Todd, Niv, & Cohen, from Intelligence Advanced Research Projects Ac- 2008). The costs associated with these internal tivity (IARPA) via Department of the Interior operations can be optimized by learning algorithms, (DOI). such as Q-learning, to capture human performance given task constraints. Notes Finally, alternative ways of training agents 1. Sutton and Barto (1998) in particular provides a very sidestep many of the demands of traditional clear technical introduction to the area. α RL schemes that require extensive exploration of 2. This discussion implicitly assumes the learning rate, ,is constant, however, the learning rate might be adjusted based on undesirable states to converge on a policy. One experience in the task (e.g., dropping toward zero over time) or approach is learning by demonstration (Abbeel, based on the overall volatility of the environment (see Sutton & Coates, & Ng, 2013). In this line of work, a Barto, 1998, chapter 2 for a further discussion). This issue is difficult task is demonstrated to an agent, which also of interest in the psychology and neuroscience literatures dramatically speeds learning. For example, an (Behrens, Woolrich, Walton, & Rushworth, 2007; Krugel, agent can learn to control a helicopter much faster Biele, Mohr, Li, & Heekeren, 2009; Nassar & Gold, 2013). 3. In fact, both the TD algorithm described in the section by having certain maneuvers demonstrated (i.e., Temporal Difference (TD) Learning and Q-learning are observe the actions and corresponding outcomes) considered “temporal-difference methods” although the TD than by self-exploring the huge space of action algorithm shares the name with this more general class. sequences and subsequent rewards. Another ap- 4. This might sound like an extreme demand to make on proach is interactive in which an outside optimal learning behavior for a biological agent, but it is a general requirement of all optimal convergence algorithms. In teacher provides a reward signal to the agent rather practice, algorithms such as Q-learning can converge on than the environment providing the reward signal. effective policies even in complex tasks given reasonable Agents can master difficult tasks, such as playing amounts of experience with the environment. Tetris, much more quickly by receiving evaluative feedback from an observing human teacher than by learning to play by exploring and experiencing Glossary subsequent rewards (Knox, Glass, Love, Maddox, Credit Assignment: The problem of assigning blame to actions when rewards are delayed in time. For example, & Stone, 2012; Knox & Stone, 2009). The human- if you leave your cup on the edge of a table, then three supplied feedback effectively transforms an RL task days later nudge the table with your hip and break the into a supervised learning task in which only the cup, which action is most responsible for the negative immediate action is rewarded. Many of the inherent outcome? In one way it is the earlier action of leaving the difficulties of RL problems, such as delayed rewards cup in a dangerous place. However, traditional theories of conditioning might assume that more immediate actions and learning the proper sequence of actions, are are to blame. RL deals with this problem by having sidestepped by this approach. the value of action depend not only on their immediate outcomes but on the sum of future rewards available following that action. Concluding remarks Explore/Exploit Dillemma: In environments in which Reinforcement learning represents a dynamic the reward associated with different actions are unknown, confluence of ideas from psychology, neuroscience, agents face a decision dilemma to either return to options computer science, and machine learning. The

114 elementary cognitive mechanisms Behrens, T., Woolrich, M., Walton, M., & Rushworth, M. Glossary (2007). Learning the value of information in an uncertain previously known to be positive (exploit), or to explore world. Nature Neuroscience, 10(9), 1214–1221. relatively unknown options. The optimal balance between Bernoulli, D. (1954). Exposition of a new thoeyr on the exploration and exploitation is computable in some envi- measurement of risk. Econometrica, 22, 23–36. ronments but generally can be computationally intractable. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic program- Temporal Difference (TD) learning: Amethodof ming. Belmont, MA: Athena Scientific. computational RL that learns the value of actions through Bogacz, R., McClure, S., Li, J., Cohen, J., & Montague, P. experience. In particular, in TD, learning is based on a (2007). Short-term memory traces for action bias in human deviation between what is expected at one point in time reinforcement learning. Brain Research, 1153, 111–121. and what is actually experienced. TD is explained in detail Busemeyer, J., & Stout, J. (2002). A contribution of cognitive in the section on Temporal Difference (TD) Learning. decision models to clinical assessment: Decomposing perfor- Policy: A set of rules that determine which action an agent mance on the bechara gambling task. Psychological Assessment, should selected in each possible state in order to maximize 14(3), 253–262. the expectation of long-term reward. A everyday example Davis, T., Love, B., & Preston, A. (2012a). Learning the of a policy would be a list of directions from getting from exception to the rule: Model-based fmri reveals specialized location A to location B. At each intersection the directions representations for surprising category members. Cerebral (policy) tell the agent which decision to make. Cortex, 22, 260–273. Davis, T., Love, B., & Preston, A. (2012b). Striatal and State: Distinct situations an agent may be in. For example, hippocampal entropy and recognition signals in category being stopped at a red light on the corner of Houston learning: Simultaneous processes revealed by model-based St. and Broadway in NYC might be a state. States derive their importance in RL from the dependence of most fmri. Journal of Experimental Psychology: Learning, Memory, RL methods on the underlying mathematics of Markov & Cognition, 38, 821–839. Decision Processes (MDPs). Daw, N. (2011). Trial-by-trial data analysis using computational models. In E. A. Phelps, T. W. Robbins, and M. Delgado Reward: Typically a scalar value that determines the quality (Eds.), Affect, learning and decision making, attention and or desirability of an outcome or situation. Typically rewards performance XXIII. Oxford University Press. are designed by society, experimenters, or roboticists to Daw, N. (2012). Model-based reinforcement learning as guide the behavior of agents in particular ways. The goal cognitive search: Neurocomputational theories. In P. Todd, of an RL agent is to maximize the reward received over the T. Hills, and T. Robbins (Eds.) (2012). Cognitive search: long term. Evolution, algorithms, and the brain. MIT Press, Cam- Myopic behavior: The behavior of RL agents often bridge, MA. depends on the degree to which they value long-term versus Daw, N., Courville, A., & Touretzky, D. (2006a). Representa- γ short-term outcome. A parameters ( ) in most RL models tion and timing in theories of the dopamine system. Neural γ controls this trade-off. When is zero, RL agents make Computation, 18, 1637–1677. choices that only value immediate rewards. This is often Daw, N., Gershman, S., Seymour, B., Dayan, P., & Dolan, described as “myopic” behavior because it disregards future R. (2011). Model-based influences on humans’ choices and consequences of actions. striatal prediction errors. Neuron, 69, 1204–1215. Daw, N., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8(12), References 1704–1711. Abbeel, P., Coates, A., Ng, & A. Y. (2013). Autonomous Daw, N., O’Doherty, J., Seymour, B., Dayan, P., & Dolan, helicopter aerobatics through apprenticeship learning. Inter- R. (2006b). Cortical substrates for exploratory decision in national Journal of Robotics Research, 32, 458–482. humans. Nature, 441, 876–879. Ahn, W.-Y., Krawitz, A., Kim, W., Busemeyer, J., & Brown, Dayan, P. (2012). How to set the switches on this J. (2011). A model-based MRI analysis with hierarchical thing. Curr Opin Neurobiol, 22(6), 1068–1074. doi: bayesian parameter estimation. Journal of Neuroscience, 10.1016/j.conb.2012.05.011. Psychology, and Economics, 4(2), 95–110. Dickinson, A. (1985). Actions and habits: The devel- Anderson, J. (1990). The adaptive character of thought. opment of behavioural autonomy. Philosophical Trans- Mahwad, NJ: Erlbaum. actions of the Royal Society B: Biological Sciences, 308, Anderson, J., Carter, C., Fincham, J., Qin, Y., Ravizza, S., 67–78. & Rosenberg-Lee, M. (2008). Using fmri to test models of Dickinson, A., & Balleine, B. (2004). The role of learning complex cognition. Cognitive Science, 32, 1323–1348. in the operation of motivational systems. In R. Gallistel Ashby, F. (2011). Statistical analysis of fMRI data. Cambridge, (Ed.), Stevens’ handbook of experimental psychology: Vol. MA: MIT Press. 3. Learning, motivation, and emotion, (3rd ed.). Hoboken, Bagnell, J., & Schneider, J. (2001). Autonomous helicopter NJ: Wiley. control using reinforcement learning policy search methods Frederick, S., Loewenstein, G., & O’Donoghue, T. (2002). (pp. 1615–1620). In International Conference on Robotics and Time discounting and time preference: A critical review. Automation. IEEE. Journal of Economic Literature, XL, 351–401.

computational reinforcement learning 115 Gershman, S., Blei, D., & Niv, Y. (2010). Context, learning, McClure, S., Berns, G., & Montague, P. (2003). Temporal and extinction. Psychological Review, 117, 197–209. prediction errors in a passive learning task activate human Gittins, J. C. (1979). Bandit processes and dynamic allocation striatum. Neuron, 38(2), 339–346. indices. Journal of the Royal Statistical Society, Series B, 41, McDonnell, J., & Gureckis, T. (2009). How perceptual cate- 148–177. gories influence trial and error learning in humans. In Multi- Gray, W. D., Sims, C. R., Fu, W. T., & Schoelles, M. J. (2006). disciplinary Symposium on Reinforcement Learning. Montreal, The soft constraints hypothesis: A rational analysis approach Canada. http://www.gureckislab.org/papers/McDonnell to resource allocation for interactive behavior. Psychological GureckisMSRL.pdf Review, 113(3), 461–482. Montague, P., Dayan, P., Person, C., & Sejnowski, T. (1995). Gureckis, T., & Love, B. C. (2009a). Learning in noise: Bee foraging in uncertain environments using predictive Dynamic decision-making in a variable environment. Journal hebbian learning. Nature, 377(6551), 725–728. of Mathematical Psychology, 53, 180–193. Montague, P., Dayan, P., & Sejnowski, T. (1996). A framework Gureckis, T., & Love, B. C. (2009b). Short term gains, long for mesencephalic dopmaine system based on predictive term pains: How cues about state aid learning in dynamic hebbian learning. Journal of Neuroscience, 16 (5), 1936– environments. Cognition, 113(3), 293–313. 1947. Gureckis, T., & Markant, D. (2012). A cognitive and compu- Myerson, J., & Green, L. (1995). Discounting of delayed tational perspective on self-directed learning. Perspectives in rewards: Models of individual choice. Journal of the Psychological Science, 7, 464–481. Experimental Analysis of Behavior, 64, 263–276. Herrnstein, R., Lowenstein, G., Prelec, D., & Vaughan, W. Nassar, M., & Gold, J. (2013). A healthy fear of the unknown: (1993). Utility maximization and melioration: Internalities Perspectives on the interpretation of parameter fits from in individual choice. Journal of Behavioral Decision Making, computational models in neuroscience. PLOS Computational 6, 149–185. Biology, 9(4), e1003015. Higgins, E. T. (2000). Making a good decision: Value from fit. Neth, H., Sims, C., & Gray, W. (2006). Melioration dominates American , 55, 217–1230. maximization: Stable suboptimal performance despite global Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. feedback. In: R. Sun, and N. Miyake (Eds.), Proceedings (1998). Planning and acting in partially observable stochastic of the 28th Annual Meeting of the Cognitive Science Society. domains. Artificial Intelligence, 101, 99–134. Hillsdale, NJ: Erlbaum. Kamil, A. C., Krebs, J. R., & Pulliams, H. R. (1987). Foraging Niv, Y. (2009). Reinforcement learning in the brain. Journal of Behavior. London, England: Plenum. Mathematical Psychology, 53(3), 139–154. Knox, W., Glass, B., Love, B., Maddox, W., & Stone, P. (2012). Niv, Y., Daw, N., Joel, D., & Dayan, P. (2007). Tonic How humans teach agents. International Journal of Social dopamine: Opportunity costs and the control of response Robotics, 4(4), 409–421. vigor. Psychopharamcology, 191(3), 507–520. Knox, W. B., Otto, A. R., Stone, P., & Love, B. C. Niv, Y., & Schoenbaum, G. (2008). Dialogues on prediction (2011). The nature of belief-directed exploratory choice in errors. Trends in Cognitive Sciences, 12(7), 265–272. human decision-making. Frontiers in psychology, 2, 398. doi: O’Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, 10.3389/fpsyg.2011.00398 R. (2003). Temporal difference learning model accounts Knox, W. B., & Stone, P.(2009). Interactively shaping agents via for responses in human ventral striatum and orbitofrontal human reinforcement: The tamer framework. In Proceedings cortex during pavlovian appetitive learning. Neuron, 38, of The Fifth International Conference on Knowledge Capture 329–337. (K-CAP 2009). pp. 9–16. Otto, A., Gershman, S., Markman, A., & Daw, N. (2013). Konda, V., & Tsitsiklis, J. (1999). Actor-critic algorithms. In The curse of planning: Dissecting multiple reinforcement Neural Information Processing Systems. Cambridge, MA: MIT learning systems by taxing the central executive. Psychological Press. Science, 24(5), 751–761. Krugel, L., Biele, G., Mohr, P., Li, S.-C., & Heekeren, H., Otto, A., Gureckis, T., Love, B., & Markman, A. (2009). Genetic variation in dopaminergic neuromodulation (2009). Navigating through abstract decision spaces: Eval- influences the ability to rapidly and flexibly adapt decisions. uating the role of state knowledge in a dynamic decision Proceedings of the National Academy of Sciences, 106, 17951– making task. Psychonomic Bulletin and Review, 16 (5), 17956. 957–963. Lee, M., Zhang, S., Munro, M., & Steyvers, M., (2011). Otto, A., & Love, B. (2010). You don’t want to know what you’re Psychological models of human and optimal performance in missing: When information about forgone rewards impedes bandit problems. Cognitive Systems Research, 12, 164–174. dynamic decision making. Judgment and Decision Making, 5, Luce, R. D. (1959). Individual choice behavior: A theoretical 1–10. analysis. Westport, CT: Greenwood Press. Otto, A., Markman, A., Gureckis, T., & Love, B. (2010). Ludvig, E., Sutton, R., & Kehoe, E. (2012). Evaluating the Regulatory fit in a dynamic decision-making environment. td model of classical conditioning. Learning and Behavior, Journal of Experimental Psychology: Learning, Memory, and 40(3), 305–319. Cognition. 36 (3), 797–804. Marr, D. (1982). Vision: A computational investigation into the Oudeyer, P. Y., & Kaplan, F. (2007). What is intrinsic motiva- human representation and processing of visual information. tion? A typology of computational approaches. Frontiers in New York, NY: Freeman. Neurorobotics, 1(6).

116 elementary cognitive mechanisms Platt, M., & Glimcher, P. (1999). Neural correlates of Skinner, B. (1938). The behavior of organisms: An experimental decision variables in parietal cortex. Nature, 400(6741), analysis. Oxford, England: Appleton-Century. 233–238. Sloman, S. (1996). The empirical case for two systems of Redish, A., Jensen, S., Johnson, A., & Kurth-Nelson, Z. reasoning. Psychological Bulletin, 119(1), 3–22. (2007). Reconciling reinforcement learning models with Stankiewicz, B., Legge, G., Mansfield, J., & Schlicht, E. behavioral extinction and renewal: Implications for addition, (2006). Lost in virtual space: Studies in human and ideal relapse, and problem gamling. Psychological Review, 114(3), spatial navigation. Journal of Experimental Psychology: Human 784–805. Perception and Performance, 32, 688–704. Rescorla, R. (1980). Pavlovian second-order conditioning: Stephens, D., & Krebs, J. (1986). Foraging theory. Princeton, Studies in associative learning. Hillsdale, NJ: Erlbaum. NJ: Princeton University Press. Rescorla, R., & Wagner, A. (1972). A theory of pavolvian Steyvers, M., Lee, M., & Wagenmakers, E. (2009). A bayesian conditioning: Variations in the effectiveness of reinforcement analysis of human decision-making on bandit problems. and non-reinforcement. In A. Black, and W. Prokasy (Eds.), Journal of Mathematical Psychology, 53, 168–179. Classical conditioning II: Current research and theory. (pp. Sugrue, L., Corrado, G., & Newsome, W. (2004). Matching 64–99). New York, NY: Appleton-Century-Crofts. behavior and the representation of value in the parietal Rich, A. S., & Gureckis, T. M. (2014). The value of approaching cortex. Science, 304, 1782–1787. badthings.inP.Bello,M.Guarini,M.McShane,&B. Sutton, R. (1988). Learning to predict by the method of Scassellati (Eds.) Proceedings of the 36th Annual Conference temporal difference. Machine Learning, 3, 9–44. of the Cognitive Science Society. Austin, TX: Cognitive Sutton, R. (1995). TD models: Modeling the world at a Science Society. mixture of time scales. In Proceedings of the 12th International Schmidhuber, J. (1990). A possibility for implementing curiosity Conference on Machine Learning. Morgan-Kaufmann, (pp. and boredom in model-building neural controllers. Proc. 531–539). of the International Conference on Simulation of Adaptive Sutton, R., & Barto, A. (1981). Toward a modern theory of Behavior: From Animals to Animais, pp.222–227 published adaptive networks: Expectation and prediction. Psychological by MIT Press. Review, 88, 135–170. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses Sutton, R., & Barto, A. (1998). Reinforcement learning: An of monkey dopamine neurons to reward and conditioned introduction. Cambridge, MA: MIT Press. stimuli during successive steps of learning a delayed response Tesauro, G. (1994). Td-gammon, a self-teaching backgammon task. Journal of Neuroscience, 13(3), 900–913. program, achieves master-level play. Neural Computation, Schultz, W., Dayan, P., & Montague, P. R. (1997). A 6 (2), 215–219. neural substrate of predicion and reward. Science, 275, Thorndike, E. (1911). Animal intelligence: Experimental 1593–1598. studies. New York, NY: Macmillan. Shohamy, D., Myers, C., Kalanithi, J., & Gluck, M. (2008). Todd, M. T., Niv, Y., & Cohen, J. D. (2009). Learning Basal ganglia and dopamine contributions to probabilistic to use working memory in partially observable environ- category learning. Neuroscience and Biobehavioral Reviews, ments through dopaminergic reinforcement. In: D. Koller, 32(2), 219–236. D. Schuurmans, Y. Bengio, & L. Bottou (eds.) Advances in Simon, D., & Daw, N. (2011a). Neural correlates of forward Neural Information Processing Systems 21, 1689–1696. planning in a spatial decision task in humans. Journal of Tolman, E. (1948). Cognitive maps in rats and men. Psychologi- Neuroscience, 31, 5526–5539. cal Review, 55(4), 189–208. Simon, D. A., & Daw, N. D. (2011b). Environmental statistics Wagner, A., & Rescorla, R. (1972). Inhibition in pavlovian and the trade-off between model-based and TD learning in conditioning: Application of a theory. In R. Boake, and humans. In: J. Shawe-Taylor, R. Zemel, P.Bartlett, F. Pereira, M. Halliday, (Eds.), Inhibition and learning (pp. 301–336). & K. Weinberger (eds.), Advances in neural information London, England: Academic. processing systems, 24, 127–135. Watkins, C. (1989). Learning from delayed rewards. Ph.D. Singh,S.,Lewis,R.L.,Barto,A.G.,&Sorg,J. thesis, Cambridge University, Cambridge, England. (2010). Intrinsically motivated reinforcement learning: An Yi, S., Steyvers, M., & Lee, M. (2009). Modeling human evolutionary perspective. IEEE Transanctions on Autonomous performance on restless bandit problems using particle filters. Mental Development, 2(2), 70–82. Journal of Problem Solving, 2(2), 33–53.

computational reinforcement learning 117

PARTII Basic Cognitive Skills

CHAPTER Why Is Accurately Labeling Simple 6 Magnitudes So Hard? A Past, Present, and Future Look at Simple Perceptual Judgment

Chris Donkin, Babette Rae, Andrew Heathcote, and Scott D. Brown

Abstract Absolute identification is a deceptively simple task that has been the focus of empirical investigation and theoretical speculation for more than half a century. Since Miller’s (1956) seminal paper the puzzle of why people are severely limited in their capacity to accurately perform absolute identification has endured. Despite the apparent simplicity of absolute identification, many complicated and robust effects are observed in both response latency and accuracy, including capacity limitations, strong sequential effects and effects of the position of a stimulus within the set. Constructing a comprehensive theoretical account of these benchmark effects has proven difficult, and existing accounts all have shortcomings. We review classical empirical findings, as well as some newer findings that challenge existing theories. We then discuss a variety of theories, with a focus on the most recent proposals, make some broad conclusions about general classes of models, and discuss the challenges ahead for each class.

Absolute or perfect pitch, the ability to identify the In an absolute identification task, a set of N notes played on a musical instrument, is a very stimuli that vary on a single physical dimension rare ability, and has been a subject of scientific are assigned a set of labels (usually the numbers 1 study since the late 19th century (Ellis, 1876). It through N ). On any given trial, participants are is surprising that identifying musical notes is so presented with one stimulus and asked to produce difficult, since humans routinely identify a huge the corresponding label. For example, absolute number of things in day-to-day life: such as faces, pitch is a version of absolute identification in which voices, and places. Throughout the first half of the stimuli are tones varying in frequency and the the 20th century, however, psychology researchers labels are the musical note names A#, C, and so on. uncovered that the difficulty most people experi- Other common versions of absolute identification ence in naming notes is representative of a general use lines varying in length, or pure tones varying in deficit in identifying simple stimuli that vary on just loudness. one dimension. This early research culminated in Since Miller’s (1956) work, studies of absolute Miller’s (1956) seminal “7 ± 2” paper, in which identification using a variety of stimulus dimen- he argued that humans were capable of accurately sions and perceptual modalities have revealed an identifying only 5 to 9 stimuli that varied on a intricate pattern of phenomena behind this puzzling single dimension, regardless of the modality of that limitation in human ability. The complexity of stimulus. Miller’s work also made prominent the the behavior elicited by such a seemingly simple field of study that is the focus of this chapter, task has ensured the enduring interest of the absolute identification. area, for example, as summarized by Shiffrin

121 and Nosofsky (1994). Absolute identification has 4 also been of interest because of its links to other key areas of psychology, such as categorization— absolute identification is just categorization with 3 one item per category (Nosofsky, 1986, 1997), and magnitude production—the reverse of absolute identification: given a label, participants try to produce the stimulus (De Carlo & Cross, 1990; 2 Zotov, Jones, & Mewhort, 2011). More tantalizing is the suggested link between absolute identification Information Transmitted (bits) and basic short term memory research. As Miller 1 Pollack 1952 pointed out, both paradigms focus on memory, and Garner 1953 share the same severe performance limit (7 ± 2), Perfect Performance suggesting a common, and deep-seated, cognitive mechanism [this is also suggested by recently identified links in the sequential effects between 2 4 6 8 101214161820 Number of Stimuli (N) short-term memory and absolute identification (Malmberg & Annis, 2012)]. Fig. 6.1 Performance, in terms of information transmitted, as This extensive study of absolute identification a function of the number of stimuli used in the experiment. has yielded a wide range of robust benchmark The data are taken from two experiments in which severe limits phenomena not only in terms of the accuracy of on performance were observed, reported by Pollack (1952) and responses but also in dynamic aspects, such as the Garner (1953). Note that k bits transmitted corresponds to the perfect identification of 2k stimuli. Reproduced from Figure 1, time to make responses and the effect of previous Stewart, Brown and Chater (2005). responses on subsequent responses. Such a richness of data makes absolute identification a difficult challenge for cognitive modeling. We begin this stimuli is increased dramatically. In fact, once chapter by giving an overview of the benchmark adjacent stimuli are perceptually discriminable (i.e. phenomena. These classical benchmarks focus the spacing between stimuli is well above the on response accuracy. We then summarize key just-noticeable difference), further increases to the theoretical approaches that have been applied to range effectively have no influence on performance absolute identification and describe how the latest (Braida & Durlach, 1972). models within these approaches have expanded A stimulus set can be characterized by the their explanatory reach to response time (RT) as number of bits of information necessary to classify well as accuracy. We finish by discussing some of its members. Perfect classification of two stimuli the recent, and ongoing issues in the field. requires 1 bit of information, four stimuli require 2 bits of information, and in general N stimuli Benchmark Phenomena require log2 N bits of information. In a review Capacity Limitations of 25 studies using a wide variety of stimuli Early in the history of absolute identification including frequency and intensity of tones, taste, researchers were particularly intrigued by the diffi- hue of colours, and magnitude of lines and ar- culty of the task. A common method of assessing eas, Stewart, Brown, and Chater (2005) found the performance limits was to increase the size of the mean information limit was 2.48 bits. Figure stimulus set, beginning with just two stimuli, and 6.1 shows how increasing the number of stimuli observe classification accuracy. Perfectly accurate increases the amount of information required for performance usually failed after as few as five perfect classification. As illustrated, classification items. Initially, this result was demonstrated in performance falls below the information required experiments where the range of the stimulus set is for perfect classification as the number of stimuli held constant. Figure 6.1 demonstrates the severe increases beyond four, and asymptotes not much nature of the capacity limitation using data from above two bits. Hence accuracy becomes progres- Pollack (1952) and Garner (1953). One might sively worse as set size increases above 4. think that increasing the stimulus range would Historically, it has been found that the capacity facilitate performance but, surprisingly, the severe limitation is remarkably resistant to practice. For limit of 7 ± 2 persists even when the range of example, Garner’s (1953) participants performed

122 basic cognitive skills around 12,000 identification trials with tones 1.0 1.4 1 0.9 of different loudnesses, yet at the end of the 1.2 experiment they were still limited to identifying 0.8 1.0 the equivalent of just four stimuli correctly. Weber, 0.7 Mean RT (sec.) Proportion Correct Green, and Luce (1977) also had participants com- 0.6 0.8 plete 12,000 trials identifying just six tones varying 0.5 in loudness and found an improvement in response 12345678910 123456789 accuracy of just 6%. Final performance for these Stimulus Rank participants was well below ceiling, despite there Fig. 6.2 Performance, in terms of proportion of correct being extensive practice, monetary incentives, and responses (left) and mean response time (right) for each stimulus. only six tones. Hartman’s (1954) participants prac- The different lines correspond to different stimulus set size ticed for 8 weeks, and although they demonstrated conditions. The data are taken from Lacouture and Marley (1995), and demonstrate the bow effect, and the consistent substantial improvement, they still could only performance for stimuli towards the edge of the stimulus range perfectly identify five stimuli, which is well within despite the number of stimuli in the set. Reproduced from Miller’s limit. Such results have established what Figures 5 and 7 of Lacouture and Marley (2005). has subsequently been treated as a truism about Lacouture, Y., & Marley, A. A. J. (2004). Choice and response time absolute identification: there is a severe limitation in processes in the identification and categorization of unidimensional human ability to identify unidimensional stimuli, stimuli. Perception & Psychophysics, 66, 1206–1226. With kind permission from Springer Science and Business Media. and this limit is unaffected by practice. However, recent results suggest that practice is not always ineffectual, at least for some stimulus dimensions. Thornton, & Mallon, 1995: Van Orden, Holden, We will return to this issue later. & Turvey, 2003; Wageumakers, Farrell & Ratcliff, 2004, 2005). Absolute identification researchers Bow Effects were early pioneers in such analyses (Holland & The bow effect is one of the most robust results Lockhcad, 1968; Ward & Lockhead, 1970, 1971), in absolute identification tasks (Kent & Lamberts, showing that the occurrence of identification errors 2005; Pollack, 1953; Weber et al., 1977). As depends in a complicated way on previous stimuli shown in Figure 6.2, response accuracy follows a U- and responses. These sequential effects come in two shape when it is plotted as a function of stimulus broad varieties, and turn out to be one of the most position. That is, in any set of to-be-identified challenging aspects of absolute identification for stimuli, the smaller and larger stimuli are better theoretical accounts. Some researchers even believe identified than items in the middle of the stimulus these sequential effects are the defining feature of range. This effect is particularly interesting because absolute identification (e.g., Laming, 1984; Stewart it is independent of the absolute size of the stimulus. et al., 2005). In what follows, we refer to the current That is, the ability to identify any particular decision trial as “trial N ”, which allows us to refer stimulus depends on its relative position within to the preceding trials as N − 1, N − 2 and so on. the set of stimuli being identified. For example, a stimulus can go from being very accurately assimilation and contrast identified, when it is the smallest stimulus in some The response made on trial N is “assimilated” set, but later be identified at chance-level accuracy, toward (i.e., tends to be similar to) the stimulus when it is one of the central stimuli in a new set presented on trial N−1. For example, suppose that (Lacouture & Marley, 1995; Lacouture, 1997). The stimulus number 5 is presented on trial N .Ifthe addition of new stimuli to a set causes existing stimulus on trial N−1hadbeennumber1(which stimuli in the set to be identified less accurately, but is smaller), then an error on trial N is more likely this effect varies greatly across the stimulus range. to be an underestimate (e.g., number 4) than an Stimuli near the edge of of the range suffer much overestimate (number 6) (Holland & Lockhead, less from the introduction of additional stimuli than 1968; Luce, Nosofsky, Green, & Smith, 1982; stimuli near the center of the range. Stewart et al., 2005; Ward & Lockhead, 1970). Stimuli further back in the sequence of trials Sequential Effects have an opposite influence on responses. The There has been growing recent interest in se- response on trial N tends to “contrast” with (i.e., be quential effects, that is, the influence of recent stim- dissimilar to) the stimuli on trials N−2, N−3, and uli and responses on the current decision (Gilden, further back in the sequence. For example, incorrect

why is accurately labeling simple magnitudes so hard? 123 responses tend to be overestimates of current stimuli 0.4 when the stimuli were small on trial N−2, and vice versa. The contrast effect is usually smaller than the X 0.2 assimilation effect, and though it tends to decrease + as trials further back are considered, contrast can N persist until about trial N−5 (Holland & Lockhead, 1968; Lacouture, 1997; Ward & Lockhead, 1971). 0.0 One apparent difficulty in thinking about as- similation and contrast is that stimulus labels and –0.2 responses are correlated, simply because participants Mean Error at Trial typically perform above chance level accuracy. This can make it difficult to establish whether responses –0.4 on the current trial are assimilated toward the previ- 12345678 ous stimulus or the previous response. Nevertheless, Trials (X) Since Stimulus N progress has been made on this question by using autoregressive measurement models (e.g.: De Carlo, Fig. 6.3 Assimilation (at lag X = 1) and contrast effects (at 1992; De Carlo & Cross, 1990). These analyses X > 1) in data from Holland & Lockhead (1968). When X = 1, have generally supported the idea that assimilation the average error is negative for smaller stimuli (filled symbols) and contrast occur mostly as described earlier: and positive for larger stimuli (unfilled symbols). That is, when as biases caused by previous stimuli, rather than the stimuli on the previous trial were small, stimulus magnitudes on the current trial were underestimates, whereas stimulus responses. Intriguingly, recent work by Malmberg magnitudes were overestimated when the previous stimuli were and Annis (2012) has shown that assimilation and large. When X > 1, the pattern reverses, and a contrast effect is contrast effects have close analogues in short-term observed. Reproduced from Figure 4 of Brown, Marley, Donkin, memory decisions. and Heathcote (2008). The combination of assimilation and contrast, shown in Figure 6.3, poses a particularly challeng- absolute magnitude on an internal scale. The scale ing set of results for theories of absolute identifi- is divided by decision criteria into a set of response cation. A successful theoretical account of absolute categories. The criteria that divide the response identification must predict that responses are biased categories are allowed to vary between trials because toward the previous stimulus (assimilation) but of imperfect memory or shifting response bias. The that the bias switches direction (contrast) as that noisy response categories can account for informa- stimulus recedes in memory. Several possibilities for tion limits when the number of stimuli increase how this occurs have been proposed, and we now for a fixed range, as the bounds become closer and discuss them. so the noisy stimulus representation and the noisy category bounds lead to more errors in responding. Theories of Absolute Identification Durlach and Braida (1969) extended the basic We now give a brief overview of the different Thurstonian model to account for the invariance of theoretical accounts of the benchmark phenomena performance with changes in the range of stimuli. just described. Fascination with the counterintuitive They proposed that, because of limits on memory, and intricate nature of absolute identification has only recently presented stimuli are used to define spawned many theories and our coverage does not the context within which the current stimulus is aim to be encyclopeadic (we recommend Stewart et presented. This additional source of variability is al., 2005, who give an excellent and comprehensive modeled as being proportional to the range of coverage). Rather, we aim to cover prominent stimuli, and as such can produce the required examples of several different conceptual classes of invariance. theories, beginning with Thurstonian models and Basic Thurstonian models can also account for then covering three types of theories that have bow effects, but their account is not entirely yielded models still under active investigation. satisfactory. The models predict greater accuracy for edge stimuli because Thurstonian models that predict erroneous responses are mostly adjacent to Thurstonian Models the correct response, and edge stimuli have only The basic Thurstonian model assumes that any one adjacent response, whereas internal stimuli have given stimulus evokes a noisy representation of two. This account is sometimes called a “response

124 basic cognitive skills restriction” account, because improved accuracy tracking mechanism, moving criteria back in a way is predicted for edge stimuli simply because they that maintains balanced responding in the long run. have fewer neighbouring stimuli to be confused Treisman assumed that the tracking mechanism with. Although this restricted response explanation is stronger than the stabilizing mechanism, but certainly does play some role in bow effects, it fails decays more quickly. In this way, Treisman’s model to predict the observed gradual decrease in accuracy accounts for assimilation to trial N −1, where that is observed as stimuli become more internal the tracking mechanism dominates, and contrast to the set. Many researchers have used a d -based away from previous trials, where the stabilizing measure of performance (Luce et al., 1982), in place mechanism dominates. of raw accuracy. The d measure is, under some assumptions, insensitive to response effects such as bias and response restriction. The d measure Exemplar Models still reliably shows deep bow effects, and this is Exemplar models have proven very successful something that response restriction explanations of in accounting for categorization behavior, and this the bow effect cannot explain. makes them promising candidates for theories of ab- Braida et al., (1984) elaborated the basic solute identification because of the close similarity Thurstonian model’s account of bow effects to between absolute identification and categorization. allow for greater variability for stimuli near the Exemplar models assume absolute identification is center of the range than the ends. They explained accomplished by determining the similarity be- this by proposing that stimuli are judged against tween a to-be-identified stimulus, and the memory two “anchors” at the extreme ends of the stimulus representations for previous stimuli. Each stimulus range, and the distances between these anchors is assumed to be represented in memory along with and the presented stimulus are counted using a its associated label. The probability of response noisy measurement unit. The further a stimulus i is proportional to the similarity between the is from those anchors, the noisier the distance current stimulus j and all exemplars for response measurement becomes. This produces a gradual i. In general, exemplar models face the problem bow effect with increasingly frequent identification of not being able to account for the fundamental errors toward the middle of the measurement range. information limit of absolute identification: when Luce, Green, and Weber (1976) proposed a related the range of the stimulus set is increased but the model, suggesting that bow effects may be due number of stimuli remains fixed, these models to more attention being paid to the edges of the predict that the memory representations should stimulus range, thus reducing the variability of be more easily discriminable, and so performance stimulus representations in those regions. should improve. As Braida and Durlach (1972) Triesman (1985) modified the basic Thurstonian showed, this does not happen. model in a different way, in order to accommodate Nosofsky (1997) extended exemplar models to sequential effects. Triesman and Williams’s (1984) also make predictions about the time taken to criterion-setting theory proposed that the criteria make decisions, rather than just the decision that set the bounds for response categories change that is made. His exemplar-based random walk on a trial-by-trial basis due to two factors: a model (EBRW; Nosofsky & Palmeri, 1997) assumes stabilizing mechanism and a tracking mechanism. that the representations of stimuli in memory are The tracking mechanism moves the criteria away normally distributed across the stimulus dimension. from the previously observed stimulus. This is Upon presentation of a stimulus, the exemplars based on the assumption that, in the real world race to be retrieved from memory with a speed stimuli do not occur randomly, but that, given that is an increasing function of the similarity something is perceived, it is likely to reappear again between the current stimulus and each exemplar. sometime soon. The tracking mechanism moves the Each time an exemplar is retrieved, a counter for criterion away from the previous stimulus (thereby the associated response is incremented, whereas expanding the stimulus range for the response just the counters associated with other responses are given) so that when it is presented again, it is decremented. The race continues until one counter more often correctly identified. The stabilizing reaches a threshold and the corresponding response mechanism moves the criteria such that they are is given. Nosofsky’s model accounts for bow effects centered on the mean of the previous stimuli. by assuming that stimuli toward the edge of the The stabilizing mechanism acts to counteract the range have smaller variance in their perceptual

why is accurately labeling simple magnitudes so hard? 125 representation. The EBRW model makes no at- center of the range are slower). It does so because tempt to account for sequential effects or the stimuli toward the end of the range are relatively information limit phenomena. It is worth noting, more isolated. This adds a different mechanism however, that the EBRW model is one of the few to the restricted response set explanation of bow absolute identification models to consider response effects, because not only do edge stimuli have time. Indeed, the EBRW makes detailed and fewer competing response alternatives, they also precise predictions for response times, including have smaller summed similarity than stimuli in the full response time distributions for both correct and center of the stimulus range. A gradual decrease in incorrect responses. accuracy and increase in response time is predicted, Petrov and Anderson (2005) proposed an ex- because summed similarity increases gradually for emplar model of absolute identification called stimuli that are closer to the center of the ANCHOR based on the ACT-R architecture (An- range. derson, 1990, Anderson & Lebiere, 1998). They Without modification, the EGCM-RT was assume that unidimensional stimuli are encoded by not able to account for systematic differences a perceptual subsystem into an absolute magnitude. in the shape of bow effects as a function of The magnitude is then processed by the central the number of stimuli to be identified (set size; subsystem, comparing it with some exemplars see Figure 6.2). However, Kent and Lamberts or “anchors” stored in long-term memory. The (2005) were able to capture these differences by central subsystem is thought to be dynamic and assuming that the amount of information sampled evolves from trial to trial throughout the absolute from a stimulus was a decreasing function of the identification task. The perceptual processing that number of pieces of information already sampled— creates the magnitude and the selection of the a plausible assumption of “diminishing returns.” By anchor exemplar are both stochastic, with selection assuming that additional samples from a stimulus of the exemplar based on similarity between the were decreasingly useful in discriminating between current stimulus and memory representations, as members of the stimulus set, an information limit well as the previously presented stimulus. was imposed on the basic EGCM-RT model. Kent The ANCHOR model accounted for bow effects and Lamberts did not attempt to account for in accuracy (Petrov & Anderson, 2005) via the sequential effects. restricted response set explanation of bow effects, and, therefore, it suffers the same difficulty as Thurstonian models in accounting for bow effects Relative Judgement Models in d . Though not explicitly tested, Stewart et al. The previous Thurstonian and exemplar mod- (2005) suggest that the ANCHOR model would els tended to focus more heavily on explaining probably be able to account for information limit fundamental capacity limitations and bow effects, when range and set size were varied due to noisy and paid less attention to explaining sequential parts of the model unrelated to spacing of stimuli. effects in absolute identification data. The defining The ANCHOR model accounts for assimilation feature of relative judgment models, on the other by making exemplars that have been recently used hand, is that decisions are based on the difference more likely to be retrieved, but does not attempt to between current and previous stimuli (or responses), account for contrast effects. rather than being based directly on representations Kent and Lamberts (2005) proposed an ex- of the absolute magnitudes of stimuli. This leads emplar model of absolute identification based to a natural focus on sequential effects. Relative on an adaptation of Lamberts’s (2000) extended versus absolute judgments is not just a theoretical generalized context model for response times distinction in absolute identification modeling but (EGCM-RT). EGCM-RT, which in turn is based is also important in some applied areas. For exam- on Nosofsky’s (1986) general context model, as- ple, musicians are frequently trained in the skill sumes that information about the current stimulus of relative judgment (judging musical “intervals”) is sampled repeatedly until sufficient evidence is but almost never in absolute judgment, because accumulated to choose a response. Kent and relative judgment is useful in musical performance, Lamberts’ model successfully accounts not only for whereas absolute judgment is not (and may even the basic bow effect that is observed in accuracy but be maladaptive). This difference may help explain also for the inverted bow effect that is observed in the rarity of “perfect pitch” in musicians, which we response time (i.e., responses to stimuli from the discuss later.

126 basic cognitive skills ρ C Holland and Lockhead (1968) proposed that is a parameter that increases with DN ,N −1,and responses are made by combining feedback from the Z is a normally distributed variable with mean previous trial and the perceived distance between zero and standard deviation σ.Stewartetal. the current and previous stimuli. This basic relative (2005) demonstrate that their instantiation of a judgment mechanism accounts for assimilation relative judgment model is capable of producing and contrast by simply assuming that the judged all the classical response-choice related benchmark difference between the current and previous stimuli phenomena. is biased toward the previous stimulus and away Stewart et al., (2005) model accommodates from earlier stimuli. For example, consider the case the information limit and bow effect in accuracy in which a small stimulus was presented on the through related mechanisms. The information limit previous trial (N−1). This means that, on average, is accounted for via the inclusion of a parameter the stimuli presented on earlier trials (N−2, N−3, (Z), which is assumed to represent noise in the ...) were probably larger stimuli. The mapping of the physical stimulus to its internal for these larger earlier stimuli interfere with the representation. Bow effects are accounted for by judgment of the distance between the previous and assuming that Z is scaled by parameter ρ,which current stimulus in a way that causes the distance to depends on the position of the current stimulus be underestimated. Hence, a response based on this within the range of stimuli. For example, if there distance will assimilate to the stimulus presented are eight stimuli to be identified, the previous on trial N−1. As Stewart et al. (2005) point out, feedback was 7 and the current stimulus is judged this approach can explain assimilation and contrast to be larger, then the influence of the noise, σ, on average, but fails to provide a general account. is smaller than when the stimulus is closer to the To recount their example, imagine that a small center of the stimulus range. The relative judgment stimulus on trial N−1, say 3 is followed by a smaller model accounts for assimilation and contrast by stimulus, 2. Now, the interference from more assuming that the judgment of the distance between previous stimuli will cause an overestimate of the the previous and current stimulus is influenced by distance between the stimuli, and produce contrast, previous distance judgments, and that their impact whereas assimilation is usually still observed in such decreases with their position in the trial sequence cases. (i.e., the distance judgment on trial N−1hasthe Laming (1984) proposed a strict version of a most influence, followed by that on trial N−2, etc.). relative judgment model, assuming that no absolute information is used, and that only the difference Restricted Capacity between the previous and current stimulus is Marley and colleagues have proposed a number considered. In particular, decisions are assumed to of models that attribute poor performance in abso- be made in a relatively coarse manner, such that lute identification to limited capacity in memory or the current stimulus is judged in terms of only five attention. Marley and Cook (1984, 1986) assume categories: “much less than,” “less than,” “equal to,” that the full range of stimuli in an experiment “more than,” or “much more than.” Such limited are mapped onto an experimental context, which categorical information provides a natural account might be thought of as a fixed-capacity attention of the fundamental capacity limits, within a relative or memory store. When a stimulus is presented, its judgment framework. relative position within the set of stimuli is located Stewart et al. (2005) proposed the most within a context by reference to “anchors” located current and successful relative judgment model near the ends of the stimulus range. The relative of absolute identification. In their model, only position of the stimulus is then used to judge its the series of differences between each stimulus magnitude and subsequently assign a response label. and the next is represented internally. These Capacity limitations in the model are built into differences, along with feedback from the previ- the way the context is maintained. Attention to ous trial, are used to produce a response. The the stimulus range (context) of the experiment is formula used to generate a response, RN on trial DC assumed to be maintained by constant “rehearsal” of = + N ,N −1 +ρ N is: RN FN −1 λ Z,whereFN −1 that particular segment of the stimulus dimension. − C is the feedback on trial N 1, DN ,N −1 is the Rehearsal is assumed to operate as a Poisson pulse difference between the representations of differences process that directs activity to the relevant context, on trial N and N −1, λ is a scaling parameter, but this activation is assumed to passively decay.

why is accurately labeling simple magnitudes so hard? 127 The power of the rehearsal process is assumed to be SAMBA is an acronym for the model’s three fixed within an individual, thus yielding a limit on stages. The Selective Attention aspect of Marley performance when either more stimuli are added, or and Cook (1984) produces a magnitude estimate the range of stimuli is increased. for a stimulus, which is then used as an input Marley and Cook’s (1984, 1986) model naturally to the Mapping model of Lacouture and Marley predicts the bow effect in response accuracy and (1995). The mapping process transforms the single d because the relative magnitude of a stimulus is magnitude estimate into N response strengths, judged according to the amount of rehearsal activity which then drive a corresponding number of between the current stimulus location, and the Ballistic Accumulators. The ballistic accumulators closest anchor. The magnitude of stimuli in the are a set of evidence accumulators that accrue center of the range will, therefore, be estimated activation at a rate determined by the output with greater variance than edge stimuli, since the of the mapping stage of the model. Evidence noise in the Poisson rehearsal process will have more in each accumulator increases ballistically (that influence on estimation (i.e., similar to Durlach & is, deterministically, without moment-to-moment Braida, 1969). noise), suffers from passive decay, and is inhibited Lacouture and Marley (1995, 2004) proposed by the evidence in other accumulators. an alternative means of explaining the information The information limit and bow effects arise capacity limit and bow effect through their mapping out of the first two stages of the SAMBA model. model. The model assumes a basic and very simple SAMBA provides an advance over earlier restricted absolute magnitude estimate as its input, and capacity models by also accounting for assimilation transforms it into a set of response-output strengths and contrast. Assimilation is accounted for by as- for each of the possible responses via an internal suming that the evidence in accumulators passively (“hidden”) unit, in a similar way that a set of tuning decays between trials, beginning from their level curves might transform a magnitude estimate into at the time at which a response was made. For a set of response tendencies. Since the hidden unit example, this means that the accumulator that won normalizes the input into a unit (0–1) range, a small the race on the previous trial will begin the current and fixed amount of noise in the input magnitude trial with the highest level of activation. This benefit estimate has greater influence when the number for responses close to previous stimuli leads to of response alternatives increases, as more stimuli assimilation toward responses made on the previous are packed into the same psychological space. In trial. Contrast is incorporated into the selective Lacouture and Marley (2004), the output strengths attention stage of the model. Brown et al. (2008) coming from the mapping model were used to assumed that between trials, rehearsal activity is drive a leaky, competing accumulator (LCA) model preferentially directed to the elements of the context (Usher & McClelland, 2001). This expanded the corresponding to the stimulus presented on the explanatory scope of their model to include RT, not previous trial. On subsequent trials, this additional only in terms of mean RT but also variability in RT activation around previous stimuli leads to contrast (i.e., the full distribution of RT). away from those stimuli. That is, since magnitude The restricted capacity models just reviewed is estimated by summing the amount of activation make no attempt to account for any of the bench- between an anchor and the stimulus, increased mark sequential effects in absolute identification. activity will “push” the magnitude estimation away The SAMBA model (Brown et al., 2008) extends from the previous stimulus. these earlier models, and incorporates elements of Table 6.1 summarizes the ability of each model both the Poisson rehearsal and mapping processes class to account for existing benchmark phenom- from the previous two models, but uses a determin- ena. All the classes of models discussed so far are istic version of the LCA, Brown and Heathcote’s basically capable of accounting for what we are (2005) ballistic accumulator model. The authors calling historical benchmarks: capacity limitations, claimed SAMBA to be the most comprehensive the- bow effects, and sequential effects. Interested ory of absolute identification yet proposed because readers should see Stewart et al. (2005) for a more they showed it was capable of accounting for all detailed table outlining the various models’ ability of the aforementioned benchmark phenomena in to account for historical benchmark data. For the absolute identification not just in terms of response remainder of the chapter we will focus on recent choices, but also in terms of the full distribution of issues in absolute identification. As seen in Table response times. 6.1, these issues pose a challenge for many theories,

128 basic cognitive skills Table 6.1. A simplified summary of the ability of models within each class to account for historical and more recent phenomena. Empirical Phenomena

Historical Benchmarks Current Issues

Model Type Capacity Bow Sequential Uneven Inter-trial Limitations Effects Effects RT Spacing Interval Learning

Thurstonian Yes Yes Yes No No No No Exemplar Yes Yes No Yes No No No Relative Yes Yes Yes No Yesa No No Capacity Yes Yes Yes Yes Yes Nob No

aDepending on the definition one allows for a “purely relative” model–see Brown et al., (2009). bBut see Donkin, Chan, and Tran (2014) for a revised version of SAMBA capable of accounting for this phenomenon

and there are a number of results that yet remain easily account for bow effects in d , or capacity unresolved. We will not attempt to resolve these limitations, and the purely absolute models could issues here, but instead hope to lay out a guide as to not easily account for any of the sequential effects. what should constrain model development moving Purely relative and purely absolute models can forward. be envisaged as the endpoints on a continuum. For example, a purely relative model can be made a Current Issues in Absolute Identification little less relative by allowing comparisons against Absolute and Relative Judgment the last two or three observed stimuli. The end An unresolved tension in the modeling of point of such a process is an absolute model: an absolute identification concerns the merits of “ab- exemplar model, where all previous stimuli are used solute” versus “relative” accounts. Recently, the line for comparison. Similarly, a purely absolute model between absolute and relative models has become can be made a little more relative by allowing a blurred, leading to some important tasks for future stimulus to be judged by comparison to recent research. Most immediately, exactly what defines stimuli, as well as to long-term referents. a purely relative versus purely absolute model To accommodate the growing list of com- needs clarification? Subsequently, the limitations plicated benchmark phenomena, recent models of each class need to be more clearly established, have moved away from purely relative or purely especially if the current “integrated” models, which absolute accounts. For example, the SAMBA model include both absolute and relative aspects, are to be (Brown et al., 2008) was able to accommodate justified. almost all benchmark phenomena with purely Absolute and relative judgment models are absolute assumptions. However, one phenomenon defined in a similar spirit to the way the two was only able to be modeled within SAMBA’s terms are used in the judgment of musical notes, framework by including a relative process, judg- earlier. An “absolute” account proposes that each ment against the most recently observed magnitude. new stimulus is assigned a label by comparison with The phenomenon in question was the improved some stable referents stored in memory, such as the accuracy and response times observed for repeated extreme ends of the stimulus range. A “relative” stimuli, and for stimuli that are similar, but account proposes that each new stimulus is judged not identical, to the previous stimulus. It seems instead against the memory of only very recently ad hoc to incorporate an otherwise unnecessary observed stimuli (perhaps just the most recent single model component just to accommodate this one stimulus). The earliest Thurstonian and exemplar phenomenon, and so future work might investigate models were purely absolute, in that each stimulus whether this phenomenon can be explained more was judged only against long-term referents. Simi- parsimoniously. If it is possible to explain this phe- larly, the earliest relative judgment models (Holland nomenon without recourse to a relative judgment & Lockhead, 1968; Laming, 1984) were purely process, this could simplify theoretical arguments, relative in that stimuli were compared only to the by once again having a purely absolute model that most recently-observed prior stimulus. These “pure” fits the data. models had difficulty accounting for fundamental On the other hand, the purely relative models phenomena. The purely relative models could not of Holland and Lockhead (1968) and Laming

why is accurately labeling simple magnitudes so hard? 129 (1984) were augmented by Stewart et al. (2005). Matthews (2009) explored an augmented version Stewart et al.’s relative judgment model includes of the relative judgment model, endowed with some elements which might or might not be classed knowledge of all the various stimulus spacings as “absolute”, depending on taste. For example, in between all pairs of stimuli. Although such a model Stewart et al.’s model, the amount of variability in can easily accommodate the data, what is less clear the internal representation of a magnitude depends is the status of that model on the absolute versus on the distance between that magnitude and the relative continuum. Proponents of relative models end of the range. This comparison with the end might construe such knowledge as purely relative, of the range sounds just like an absolute judgment as it involves only knowledge of differences between procedure, but it could avoid that label by instead stimuli. On the other hand, proponents of absolute supposing that magnitudes being represented are models might construe such knowledge as purely response magnitudes, rather than stimulus magni- absolute, as it assumes a long-term memory for the tudes. How this argument would hold up in general structure of the entire stimulus set. is unclear: for example, if non-numeric response The most attractive way forward for the debate labels were used, as is customary in categorization between absolute and relative accounts is probably tasks. by resolution of the exact definition of the basic Brown et al. (2009) tried to identify the short- terms. However, since that is likely a more difficult comings of relative judgment accounts without task than it appears, an alternative way forward is recourse to particular theories or detailed and by simply fitting the existing models to data, and questionable assumptions about precisely which comparing their performance on quantitative mea- mechanism are absolute vs. relative. They noted sures of goodness of fit. This approach has worked that relative judgment theories proceed by allowing well in other fields, and avoids many of the exactly one piece of absolute knowledge about problems inherent in the search for qualitative stimulus magnitude: the difference in magnitude differences between classes of models. between adjacent stimuli in the decision set. This knowledge is necessary to translate observed dif- Learning ferences in magnitude between successive stimuli The limit on people’s ability to learn to identify into predicted differences in their response labels. more than around seven unidimensional stimuli was For example, suppose a set of 10 lines are used, firmly established early in the history of absolute with each line differing in length by 2 cm from its identification research. However, the remarkable neighbors. If the previous trial used line 4 (which is levels of achievement that are displayed by ex- known, after the feedback is given) and the current perts after extended practice in a broad range of line is 6 cm shorter, then knowledge of the 2 cm domains—e.g., musicians with absolute pitch— spacing in the set can be used to deduce that the makes this limit puzzling. The following quote, current line must be three responses smaller (line taken from Shiffrin and Nosofsky (1994) sums up 1). Brown et al. reasoned that this kind of account this conundrum with capacity limits nicely. must break down in the more general case, when the spacing between adjacent stimuli in the set is As an anecdotal example, Robert M. Nosofsky started uneven. In that case, a relative difference of 6 cm his research career around 1980 in the acoustical between the current and previous lines implies a laboratory of David Green and R. Duncan Luce, two different number of response units depending on researchers who happened at the time to be studying which line was shown. absolute identification of loudness ... As a cocky Brown et al. (2009) re-ran the classic experiment young graduate student, Nosofsky “knew” that with a of Lockhead and Hinson (1986), in which absolute bit of practice, he could surely learn to perfectly identification was performed on three loudness identify a set of 12 loudnesses. After locking himself levels. In one condition, the loudness values in one of Green’s sound-proof booths for several were evenly spaced (2 dB apart) but in the other weeks, and hearing tone after tone after tone, his conditions, the gap between one pair of stimuli absolute identification performance remained was three times as large (6 dB). As shown in unchanged. He did succeed, however, in increasing Figure 6.4, a relative judgment account failed substantially his need for . to accommodate data from the unevenly-spaced conditions. Brown et al. interpreted this as a failure Research in the last 10 years, however, suggests of purely relative models. However, Stewart and that Professor Nosofsky was just unlucky to choose

130 basic cognitive skills Low Spread

Even Spread

High Spread

73 76 79 82 85 88 91 Intensity (dB)

Spread: Low Even High 1.0 0.9 0.8 Data 0.7 0.6 0.5 0.4 0.3 1.0 0.9 0.8 SAMBA 0.7 0.6 0.5

Response Accuracy 0.4 0.3 1.0 0.9 0.8 RJM 0.7 0.6 0.5 Stimulus 0.4 1 0.3 2 123123123 3 Previous Stimulus

Fig. 6.4 Top panel: Schematic illustration of the stimuli used in Brown et al.’s (2009) experiment. Bottom panel: Response accuracy (y-axis) for each stimulus (symbols) conditioned on the previous stimulus (x-axis). The upper row of panesl show data, the middle row show predictions from the SAMBA model, and the lower row of panels show predictions from the RJM. The three columns correspond to the low-spread, even-spread and high-spread conditions. Reproduced from Figures 1 and 4, Brown et al. (2009). the particular stimulus continuum (loudness), and up to almost 20 stimuli. The other two participants, that other stimulus dimensions are more amenable despite not performing at quite the same level, to learning, at least for some people. Rouder, both achieved performance equivalent to the perfect Morey, Cowan, and Pfaltz (2004) showed that identification of around 10–13 items, both above at least one of their participants, the redoubtable Miller’s limit of 9. RM, was capable of far exceeding a limit of seven Dodds, Donkin, Brown, and Heathcote (2011) items in absolute identification of line length. replicated the key finding from Rouder et al.’s Indeed, all three participants in their experiments (2004) study. Dodds et al. then extended the were shown to display some learning. However, findings in a series of seven experiments. Many after extended practice, RM (one of the authors) more participants were observed to show substantial displayed a remarkable ability to accurately identify learning, though none quite reached the high bar

why is accurately labeling simple magnitudes so hard? 131 4.5

3 16 4.0 3 3 5 53 3 5 5 3 3 3 5 3.5 5 5 3 35 5 1 1 1 1 4 4 4 64 6 1 8 3.0 1 6 6 6 5 1 4 46 2 2 1 1 2 2 6 2 2 2.5 14 42 64 6 24 6 2 2 2 2 2 32 2 3 3 23 2.0 6 3 163 4 52 36 56 1 5 6 5 615 1 5 65 51 1 61 3 51 4 235 1 4 6 243 1 4 6 4 Number of Equivalent 1.5 4 4 Information Transmitted (bits) Transmitted Information 4 4 Correctly−Identified Stimuli 1.0 2 1234567891012345678910 Practise Session

Fig. 6.5 Information transmission as a function of sessions of practice plotted for two experiments from Dodds et al. (2011). The left panel of the figure is a replication of the unusually large amount of learning observed for lines of varying length. In the right panel, however, we see the more standard limited capacity for tones of varying loudness, that was more commonly reported in absolute identification experiments. Reproduced from Figures 1 and 5, Dodds et al. (2011). for absolute identification of line lengths set by might differ between other stimulus sets. In further RM. Dodds et al. (2011) showed that the learning work, Dodds, Rae, and Brown (2012) found that could be observed for stimulus modalities other those continua supporting learning may have a than line lengths, including the separation between psychological representation of magnitude that is dots (which removed the confound of brightness more complex than the simple, one-dimensional with line length), and the degree of angular rotation physical structure of the stimulus set (see also of a line. As shown in Figure 6.5, auditory stimulus Dodds, Donkin, Brown, & Heathcote, 2010). For sets, particularly the tone loudnesses that repelled example, although lines vary on a single physi- Nosofsky’s determined assault, were shown to be cal dimension—length—they may, after extended much more difficult to learn. Tones of varying practice, develop a higher-dimensional psycholog- frequency presented an interesting case, given the ical representation. This greater dimensionality popular belief in the existence of absolute or perfect provides additional information for the participants pitch. Four of the six participants who practiced to learn, and may allow for continued improvement with tones of varying frequencies showed no in performance (Rouder, 2001). For lines of various substantial improvement. However, one participant length, it is not clear what these extra dimensions showed learning on the order of that displayed by of information might be. However, for some RM with line lengths, and another showed smaller, other stimulus sets, more is known. For example, but still substantial, improvement. By the end of tones that vary in pitch form a physically one- just 10 experimental sessions, the best performer dimensional set, but this dimension is represented could identify about 16 items and was showing psychologically as a two-dimensional helix: the well- of no signs of a slowing in their learning rate, known separation of musical notes into chroma and suggesting they could have achieved even better octaves. performance with further practice. Dodds et al.’s (2011) study also revealed Dodds et al.’s (2011) results might be crudely considerable individual differences in the ability to described as showing that absolute identification learn. The best predictor of such differences seems performance can be improved for all kinds of stim- to be initial performance. That is, participants who ulus sets with extended practice; except for stimulus performed well in the first few hundred trials also sets based on tones of varying loudness. This result were also those who showed the greatest improve- resolves the apparent contradiction between new ment with subsequent practice. The direction of and old findings on practice effects. These new this correlation can be considered surprising. One results, however, raise a host of new issues. For might naively expect a negative correlation, since one, there is the unresolved question of why tone poorer initial performance leaves greater room for loudness cannot be learned, whereas other stimulus improvement. An open question about the nature types can: and further, why the ease of learning of learning in absolute identification regards the

132 basic cognitive skills reason for the observed correlation. One possibility of individual differences in practice effects should lies in a link between the psychological representa- be in some way related to initial performance. tion of stimulus sets and performance. Dodds et al. (2012) observed that extended practice was related to a move away from simple (one-dimensional) Absolute Identification vs. Perfect Pitch psychological representations toward more complex “Perfect pitch,” also known as absolute pitch, representations. Although there is a chicken-and- is the ability to perform almost perfectly on an egg problem that requires attention here, it suggests absolute identification task where the stimuli are a that improved absolute identification performance set of musical notes that must be labeled with their occurs through reorganization of the psychological chroma and octave (e.g. “C4”, also known as “mid- representation of the stimulus set. dle C”). Absolute pitch is considered to be rare with An important unanswered question is how the- only 1 in 10,000 of the general population reported ories of absolute identification should incorporate as having the ability (Bachem, 1955: Takeuchi & learning. Accounting for learning is no trivial Hulse, 1993). Rates are said to rise to 1 in 1,500 task, especially since most models were developed among amateur musicians (Profita & Bidder, 1988), to account for exactly the opposite—severe and and up to 1 in 7 in highly accomplished musicians resistant capacity limitations. Dodds et al. (2011) (Baharloo, Johnston, Service, Gitschier, & Freimer, made inroads to understanding how to account 1998). However, those rare people who possess for the effects of practice by looking at what absolute pitch are able to accurately identify dozens practice does to the previously described benchmark of different stimuli, well beyond the limits observed phenomena. Bow effects remain across practice in other stimulus sets. This raises questions about with approximately the same magnitude. Similarly, the difference between absolute pitch and absolute the stimulus presented on trial N −1 maintains identification using other stimulus sets. Miller an assimilative effect at the beginning and end of (1956) noted this apparent dichotomy, but was practice. However, the contrast effect from stimuli unable to explain it. further back in the trial sequence (N −2, N −3, An obvious candidate explanation for the dif- etc.) is greatly reduced by practice. This reduction ference in performance between absolute identi- in contrast effects with practice can be seen as fication tasks in general and absolute pitch tasks an adaptive behavior. Contrast effects represent is the difference in the physical dimensionality of incorrect responses, but such errors can be adaptive the stimuli. Live or recorded musical notes are in situations where the to-be-identified stimuli drift sometimes used in absolute pitch tasks, but these slowly with time. Such drift did not occur in stimuli are physically multidimensional, consisting the experiment, which makes learning to reduce of a fundamental frequency and a series of harmonic contrast effects adaptive. Incorporating learning overtones. There are also marked variations in into absolute identification models by modulating timbre, volume, resonance and decay characteristics contrast mechanisms is, therefore, a good start to across the registers of the musical range (Lockhead accounting for practice effects. However, changes in & Byrd, 1981; Terhardt & Seewann, 1983). Any contrast are unlikely to be the sole locus of learning; of these attributes might be used by observers for example, in the SAMBA model we have found to out-perform the usual limitations of absolute that completely removing the contrast mechanism identification. However, even when the stimuli are leads to an improvement in performance that was truly one-dimensional—computer-generated sine only about one-third of the size observed in some of tones—some people can still exhibit absolute pitch Dodds et al.’s participants. performance (Athos et al., 2007). This may repre- Clearly, practice effects provide a challenge to sent an endpoint of the kind of learning observed existing models claiming to aspire to a complete by Dodds et al. (2011) and Dodds et al. (2012), account of absolute identification. The set of where extended practice altered the psychological empirical results we have summarized provide representation of pure tones, supporting improved clues about the locus of the effect of practice, identification. This explanation is further supported as well as providing concrete targets for model by the prevalence of “octave errors” in identification fitting. Further, it seems likely that the increase in of pitch (where a note is mistaken for a note with performance for some modalities may be, at least the same chromatic name in a different octave). partially, driven by the development of higher-order Another possible explanation for good perfor- representations of stimuli. Finally, the magnitudes mance in absolute pitch tasks is that people,

why is accurately labeling simple magnitudes so hard? 133 especially musicians, have more exposure to the with knowledge of the previous note from feedback, stimuli generally used in absolute pitch tasks, to perform accurately, even without possessing namely musical notes. Previous thinking considered absolute pitch skills. Some experimenters have at- absolute pitch to be an innate quality possessed tempted to separate the contribution of relative and by very few people, and a skill unable to be absolute pitch by using interference tasks between learned (e.g., Revesz, 1953). However, many recent trials (e.g. Zatorre & Beckett, 1989), or separating researchers agree there is a critical period, under the successive tones by long time intervals (Baharloo age of approximately eight, when children can learn et al., 1998). A simpler technique to control for absolute pitch if exposed to enough musical train- relative pitch is to remove feedback information, ing, with several researchers noting a correlation which makes knowledge of the intervals unhelpful. between absolute pitch ability and musical training However, feedback must be given at some points, during the critical period (e.g., Levitin & Rogers, both for motivation and to stop poorly performing 2005). However, exposure to musical training participants from becoming wildly inaccurate. A during this critical period is not sufficient to develop compromise involves alternate blocks with feedback absolute pitch, as most people who are musically and without feedback (Ward & Lockhead, 1970). trained during this period do not develop the Speakers of tonal languages (e.g., Mandarin, ability (Baharloo et al., 1998). There are conflicting Cantonese, Thai) have been reported to have an views about whether learning is possible beyond the advantage in pitch perception due to their use of critical period. Some researchers maintain absolute pitch in conveying the meaning of words (e.g., pitch cannot be learned to any level of fluency after Deutsch, Henthorn, Marvin, & Xu, 2005). This the critical period (e.g., Zatorre, 2003), whereas hypothesis incorporates the ideas that extended others argue that with enough practice, learning is practice with identifying pitch (when speaking a possible (e.g., Lundin, 1963). Dodds et al.’s (2011) tonal language) and exposure to this task during results support the latter view. childhood (when learning language) might both Evidence that absolute pitch is a learned ability, support the development of absolute pitch. Various rather than innate, comes from the nonuniform studies have provided indirect support for the performance across the range of notes. People with assumed link, between tonal language and absolute absolute pitch most accurately identify those notes pitch, such as Pfordresher and Brown (2009), who to which they have had more exposure. Middle C found that native tonal language speakers were and other white-key notes on the piano are often better able to imitate and perceptually discriminate most accurately identified (Lundin, 1963; Takeuchi musical pitch. Deutsch et al. (2009) investigated & Hulse, 1993), perhaps because these are the the pitch ranges of female speech in two relatively first notes learned in standard piano training (Athos isolated villages in China, and found that the pitch et al., 2007; Miyazaki, 2004). Similarly, violin range of speech is heavily influenced by an absolute players are able to more accurately identify the open pitch template that is developed through long-term A string, as they are very familiar with this tone exposure. However, other studies have reported (Brammer, 1951), and in general musicians more that tonal language speakers have no advantage in accurately identify pitch on their own instrument absolute pitch (e.g., Bent, Bradlow, & Wright, (Takeuchi & Hulse, 1993). 2006; Schellenberg & Trehub, 2007). Prior work on absolute pitch has often been In a recent experiment from our lab (not yet confounded by the presence of feedback, and by published), 60 native Chinese language speakers the skill known as “relative pitch”. Relative pitch identified pure sine tones in a standard abso- is the ability to identify differences between notes, lute identification paradigm. All 60 participants rather than individual notes in isolation (Miyazaki, were university students at Sichuan University, 1995). The difference between two notes is known Chengdu, China, 30 of whom were enrolled in as a “musical interval,” and musicians are trained the Music Faculty, and 30 of whom were enrolled to identify these intervals accurately. Relative pitch in the Faculty of Administration. Although the plays an important role in musicianship, whereas music students outperformed the administration absolute pitch does not. Confounds arise when students when feedback was provided (p≤0.01), absolute pitch is tested in the usual way, by a suc- there was no difference between the two groups cession of notes with feedback on the correct answer when no feedback was given (p = 0.57). Further, provided after each. Participants with good relative both groups’ mean performances for feedback pitch skills can use the judged intervals, combined and nonfeedback blocks were well below Miller’s

134 basic cognitive skills (1956) upper limit of 9 stimuli, suggesting that stimuli that lay on the bounds of the gap between native Chinese [tonal] language speakers do not the two subsets of stimuli; they never confused have an advantage in identifying unidimensional stimuli 5 and 6. Despite this benefit in accuracy, tones. however, responses to the stimuli on the edge of the center gap were just as slow as when standard Response Times equally spaced stimuli are used. Response times used to be rarely collected in This, and similar, results are important because absolute identification experiments, and they have they cannot be accounted for by any model that not been nearly as often subjected to the detailed simply relies on a strict inverse relationship between analysis given to response choices. Previous research accuracy and response time. More generally, such has identified several effects in mean response times a dissociation between accuracy and response time that are analogous to effects in choice. These runs counter to the claim that response time is not include the effects already outlined: bow effects, in worthy of investigation in absolute identification. which responses to extreme stimuli are faster than The result also provides a strong challenge to those to middle stimuli Lacouture, 1997; Lacouture models of absolute identification. Donkin et al. & Marley, 1995, 2004), and capacity limitations, (2009) showed that the SAMBA model provides where RTs slow down as the number of stimuli a natural account of the dissociation between increases (Kent & Lamberts, 2005) (see Figure 6.2 accuracy and speed due to stimulus spacing (see for an example of these). Sequential effects on Figure6.6).Thecredit,however,mustgoto mean response times have also been observed due the mapping model from Lacouture and Marley to response repetition (Petrov & Anderson, 2005) (1995), because the dissociation arises out of the and assimilation (Lacouture, 1997). mapping stage of SAMBA. Recall that the mapping It could be argued that response times can be model takes a single magnitude estimate, z,and safely ignored because they fail to provide additional transforms it into a set of K response strengths, Ri, constraint on theories of absolute identification i = 1..K . This transformation is accomplished via = − − 2 + beyond the constraints provided by response choice the formula Ri (2Yi 1)z Yi 1, where Yi data. This argument rests on the assumption that measures the average magnitude estimate for stimuli effects on response time are always just the inverse classified with response i. The transformation is of effects on response choices (e.g., increased ac- parameter-free, and depends only on the relative curacy is always associated with decreased response position of each of the K stimuli in the stimulus time). Given this assumption, any theory that space (as recorded by Yn). When these Ynsrepresent can successfully account for the effect of some stimuli with a large, central gap, the mapping manipulation on accuracy will need only invert produces response strengths that yield high accuracy those predictions to also account for response times. for stimuli adjacent to the gap, but response times Such inversion can be accomplished by feeding the that remain slow. output of the model into evidence accumulators What made the fits of the SAMBA model to (like those used in the SAMBA model of Brown these data particularly compelling was that they et al., 2008). Large outputs drive the accumulators were essentially parameter free. The spacing experi- more quickly, resulting in the required faster and ments from Lacouture (1997) were part of a larger more accurate responses. experiment with other manipulations, including However, Donkin, Brown, Heathcote, and a control condition. In this baseline condition, Marley (2009) showed that an inverse relationship a standard set of 10 equally spaced stimuli were between accuracy and response time is not always used, and Brown et al. (2008) had already shown observed. In a re-analysis of data collected by that SAMBA could fit these control data. Donkin Lacouture (1997), manipulations of the distance et al. (2009) fixed SAMBA’s parameters to those between stimuli sometimes had large effects on estimated in the controlled condition, and then choice probability but no effects on response time. allowed only the representation of the stimulus For example, in one condition, participants were spacings (the Yis) to change across conditions, with presented with a stimulus set composed of five the further constraint that those Yis veridically smaller lines (1–5) and five longer lines (6–10), represented the stimulus spacings used in the with a much larger gap between stimuli 5 and experiment. With this setup, SAMBA gave accurate 6 than between other adjacent stimuli. As might quantitative predictions for the data from all the be expected, participants were more accurate with spacing conditions.

why is accurately labeling simple magnitudes so hard? 135 C−L C−S E−L E−S Spacing Condition 1.0 0.9 0.8 0.7 Accuracy 0.6 0.5 0.4 0.3 1.8 1.6 RT (sec.) 1.4 1.2 1.0

1 2 3 4 5 67 89 10 1 2 3 4 5 6 7 89 10 12 34 5 6 7 8 9 10 12 34 56 7 8 910 Response

Fig. 6.6 Dashes above the top panels provide a schematic representation of the stimuli used in Lacouture’s (1997) experiment. The second row shows response accuracy, and the third row shows mean RT for correct responses, both as functions of response. Data are shown as points with standard error bars that are joined by solid lines, and SAMBA fits are shown with dotted lines, and the arrows between panels show where large gaps separate adjacent stimuli. These gaps lead to different effects on response times than accuracy. Reproduced from Figure 2, Donkin, Brown, Heathcote, and Marley (2009). Donkin, C., Brown, S. D., Heathcote, A., & Marley, A. A. J. (2009). Dissociating speed and accuracy in absolute identification: The effect of unequal stimulus spacing. Psychological Research, 73, 308–316. With kind permission from Springer Science and Business Media.

Intertrial Interval and Sequential Effects also predicts that assimilation will decrease with Different models of absolute identification make intertrial interval, but it disagrees in predicting that different predictions about the influence of ma- contrast will actually increase. The reduction in nipulating the time between decisions. As noted assimilation again comes about because of decay, in by Matthews and Stewart (2009), there have been this case in the activity of response accumulators three basic accounts for sequential effects: criterion in the decision stage of the model. More time shifts, memory confusions, and selective atten- will allow the starting activity in accumulators tion. These three explanations correspond (approx- to return to baseline levels, thus reducing as- imately) to the Thurstonian, relative judgment, and similation to the previous response. The increase restricted capacity model classes outlined earlier. in contrast is a by-product of SAMBA’s existing The Thurstonian models assume that sequential explanation for how contrast occurs in general, effects arise out of shifts in criteria; a fast-moving the re-allocation of attention to the representation tracking process that produces assimilation, and a of the previous stimulus. Recall that the stimulus relatively slow-moving stabilizing mechanism that context is maintained via a rehearsal process in produces contrast. Both of these shifting processes which attention is randomly allocated across the decay in strength over time. If the time between range of stimuli used in the current experiment. trials is increased, then the criteria will have Contrast occurs because the location in which the more time to return to their original positions, previous stimulus was presented receives increased resulting in a reduction in both assimilation and attention. So, although rehearsal activity decays contrast. The relative judgment models also predict across the rest of the stimulus space, the previous a reduction in sequential effects with increased stimulus location is given preferential rehearsal. If intertrial interval due to decay, but the decay is this selective attention is allowed to continue for a located in a different process. Sequential effects longer time, due to an increased intertrial interval, in relative-judgment models are caused by the the contrast effect will increase in magnitude. interference from memories of the previous stimuli. Matthews and Stewart (2009) manipulated the Assuming that memory decays with time, then time between trials to be either 0 or 5 seconds more time between stimulus presentations would in an otherwise standard absolute identification reduce the confusion coming from memory. experiment. They found that relative to the In agreement with the other classes of mod- small intertrial interval condition, assimilation was els, the selective-attention-based SAMBA model reduced, and contrast increased when the time

136 basic cognitive skills Box 1 Surprising limitations Box 2 Sequential effects are everywhere Perhaps the most surprising aspect of absolute Previously encountered stimuli, and the re- identification is how difficult it is. The task sponses they elicit, have a strong influence itself sounds almost trivially simple: remember on the decision made on any given trial in the labels that correspond to, say, 10 stimuli. an absolute identification task. Responses on Despite weeks of training, perfect performance trial N appear to be more like responses on remains out of reach; remarkable, when the the immediately preceding trial, and less like same person might be able to understand the stimuli presented on trials further back in quantum mechanics, identify thousands of the sequence. Although perhaps most compre- species of birds, and almost certainly have a hensively studied within the field of absolute lexicon of many tens of thousands of words. identification, sequential effects also appear in The key difference between absolute identifi- many other decision-making tasks. For exam- cation and the many tasks that paradoxically are ple, perceptual contrast has been observed in much easier is that the stimuli being identified categorization tasks (Jones, Love, & Maddox, vary on only a single dimension. Ten lines that 2006; Zotov et al., 2011), assimilation has vary only in their length are very difficult to been observed in recognition memory tasks discriminate, even when the spacing between (although it is argued to be of a different nature the lines is large enough such that any pair of than that in absolute identification, Malmberg them can be told apart trivially easily. Of course, & Annis, 2012), and the sequence of trials has a if these lines join together to produce shapes systematic influence on performance in simple that differ on multiple dimensions, accurate two-alternative choice tasks (e.g., Gilden, 2001; identification becomes much easier. As such, Jones, Curran, Mozer, & Wilder, 2013). the ability to store and retrieve information in Absolute identification provides an important long-term memory, as well as our ability to window into understanding sequential effects, categorize our environment into such a rich as the task itself is so simple. There exist a structure must depend heavily on the mul- number of potential models for how previous tidimensional nature of our world. Absolute stimuli and responses influence behavior within identification is an important paradigm because the realm of absolute identification. In contrast, it represents one of the simplest and clearest most models of categorization, recognition cases in which the capacity limits on human memory, and even simple decision-making fail information processing can be studied. to account for sequential effects, despite their well-documented existence. between trials was longer. This pattern of results on this equation, Matthews and Stewart (2009) appears to align with the predictions from SAMBA found that assimilation to responses does not and to contradict those from Thurstonian and decrease as intertrial interval increases (i.e., that relative judgment models. However, a more so- β remains positive and of approximately the same phisticated analysis of the data that used regression 1 magnitude in both short and long intertrial interval equations to separate out the influence of previous conditions), but rather that contrast to the previous stimuli and previous responses revealed a more stimuli increases as intertrial interval increases (i.e., challenging pattern of results for theories of absolute both α and α become more negative). identification. 1 2 The observed increase in contrast with intertrial Responses from participants were fit with a interval is predicted by the selective attention regression equation in which both stimuli and explanation for contrast, but is inconsistent with responses were included as predictors: the Thurstonian and relative judgement accounts RN = r0 + α0SN + α1SN −1 + β1RN −1 for sequential effects. Recall, however, that SAMBA predicted a decrease in assimilation to the previous + α S − + β R − + e , 2 N 2 2 N 2 N response with inter-trial interval. Though the raw where RN is the response made on the N th data appear to agree with SAMBA, the regression trial, SN is the stimulus presented on the N th coefficients reported by Matthews and Stewart trial, α and β are regression coefficients, and (2009) suggest that the observed lack of an assimi- eN is a normally distributed error term. Based lation effect in the long intertrial interval condition

why is accurately labeling simple magnitudes so hard? 137 is due to an increase in contrast to the previous other paradigms. Some paradigms clearly are closely stimulus, rather than to a reduction in assimilation related to absolute identification, such as magnitude to the previous response. This suggests that either estimation, magnitude production, and categoriza- the mechanism by which SAMBA produces assimi- tion. It should be relatively simple for empirical lation is inappropriate and needs adjustment or that work to explore the links between these fields, but the decay process is extremely slow. a more difficult challenge will be to integrate and unify theoretical accounts in the different fields, Conclusion where possible. Absolute identification is an apparently simple task: take a set of lines that vary only in length, or tones that vary only in loudness; label them in Note a straightforward way; and ask an observer to recall 1. Although the terms loudness and intensity, and the terms the correct labels, when shown stimuli one at a time. pitch and frequency, have different meanings, we use them Despite this apparent simplicity, there is a wealth almost interchangeably for reasons of clarity. of complicated and robust patterns that have been identified in absolute identification data over the past 60 years. Early theoretical accounts of absolute identifica- Glossary tion were inspired by some of the most fundamental Absolute Models: These theories assume that identification decisions rely entirely on long-lasting, internal represen- data patterns; for example, the relative judgment tations. Simple forms of exemplar models, Thurstonian models were inspired by the ubiquitous sequential models, and restricted capacity models fall into this effects, and the limited capacity models by the category of model. observed performance limits. Since they were ACT-R: A production-system based cognitive architecture inspired by just one of the many fundamental explanation of human cognitive and perceptual systems that phenomena, these theories all had problems in has been mapped to brain areas. accounting for the full range of data. More recent Assimilation:Responsesaremorelikelytobeclosertothe theories are more complex, and share structure with response (and stimulus) made on the previous trial. more than one type of earlier theory. These new Bow Effect: Improved performance for stimuli toward the theories aim to provide a comprehensive account of edge of the stimulus range, so-called because performance plotted as a function of stimulus position has a bow-shaped the full range of absolute identification phenomena curve. (Petrov & Anderson, 2005; Stewart et al., 2005; Capacity: Processing resources used to identify stimuli. Brown et al., 2008). Performance in absolute identification tasks suggest that hu- This leaves some clear challenges for future mans are severely limited in their capacity when identifying empirical and theoretical research in absolute iden- unidimensional stimuli. tification: Contrast: Responses are more likely to be further away from the stimulus (and response) from 2–5 trials previous. • What exactly constitutes a “purely relative” d : A signal-detection-theory based measure of how well and “purely absolute” theoretical account? Can the two stimuli can be discriminated. current models be modified to perfectly fit either Stimulus Dimensionality: The number of physical con- category? If so, can a pure model explain the full tinua on which a set of stimuli vary. Absolute identification range of data? research focuses on the classification of stimuli varying on • just one physical dimension that is assumed to be mapped What mechanism underpins the newly onto a single psychological dimension. observed learning effects in absolute identification? Exemplar Model: Theory assuming that stimuli are Are tones of varying loudness the only stimulus set represented and stored as individual memory traces (i.e., that is impossible to learn? exemplars). The similarity between the stimulus presented • To what extent are the sequential effects in on a given trial and those stored in memory is critical for absolute identification malleable? They are decision-making. apparently altered substantially by practice, and by Evidence Accumulators: Hypothetical processing units the timing of the experimental procedure. These used to account for choice and response time in decision- making tasks. Each potential response is given its own findings may provide new constraints on existing accumulator, wherein evidence is collected and accrues. The theories. first accumulator to reach a threshold level of evidence gives the chosen response, and the time taken to reach threshold Further into the future, a challenge for absolute gives decision time. identification research will be to build links with

138 basic cognitive skills Deutsch, D., Henthorn, T., Marvin, E., & Xu, H. (2005). Glossary Absolute pitch among american and chinese conservatory Modality: The perceptual system to which a stimulus is students: Prevalence differences, and evidence for a speech- presented. For example, tones varying in frequency are from related critical period. Journal of the Acoustical Society of the auditory modality. America, 119(2), 719–722. Relative Models: These theories assume that the identifica- Deutsch,D.,Le,J.,Dooley,K.,Henthorn,T.,Shen,J.,& tion of unidimensional stimuli is based on the judgement of Head, B. (2009). Absolute pitch and tone language: Two the difference between the current stimulus and temporary new studies. Proceedings of the 7th Triennial Conference of the representations of recently presented stimuli (and/or the European Society for the Cognitive Sciences of Music. Finland, corresponding responses). 2009, 69–73. Sequential Effects: The influence of previous stimuli and Dodds, P., Donkin, C., Brown, S. D., & Heathcote, A. (2010). responses on behavior on the current trial. Multidimensional scaling methods for absolute identification data. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd annual conference of the cognitive science society. Portland, OR: Cognitive Science Society. References Dodds, P., Donkin, C., Brown, S. D., & Heathcote, A. (2011). Anderson, J. R. (1990). The adaptive character of thought. Increasing capacity: Practice effects in absolute identification. Hillsdale, NJ: Erlbaum. Journal of Experimental Psychology: Learning, Memory & Anderson, J. R., & Lebiere, C. (1998). The atomic components of Cognition, 37, 477–492. thought. Mahwah, NJ: Erlbaum. Dodds, P., Rae, B., & Brown, S. D. (2012). Perhaps Athos, E., Levinson, B., Kistler, A., Zemansky, J., Bostrom, unidimensional is not unidimensional. Cognitive Science, A., Frimer, N., & Gitschier, J. (2007). Dichotomy and 36(8), 1542–1555. perceptual distortions in absolute pitch ability. Proceedings of Donkin, C., Brown, S. D., Heathcote, A., & Marley, the National Academy of Sciences, 104(37), 14795–14800. A. A. J. (2009). Dissociating speed and accuracy in absolute Bachem, A. (1955). Absolute pitch. Journal of the Acoustical identification: The effect of unequal stimulus spacing. Society of America, 27, 1180-1185. Psychological Research, 73, 308–316. Baharloo, S., Johnston, P., Service, S., Gitschier, J., & Freimer, Donkin, C., Chan, V., & Tran, S. (2014). The effect of N. (1998). Absolute pitch: An approach for identification blocking inter-trial interval on sequential effects in absolute of genetic and nongentic components. American Journal of identification. Quarterly Journal of Experimental Psychology. Human Genetics, 62, 224–231. Durlach, N. I., & Braida, L. D. (1969). Intensity perception. Bent, T., Bradlow, A., & Wright, B. (2006). The influence i. preliminary theory of intensity resolution. Journal of the of linguistic experience on the cognitive processing of pitch Acoustical Society of America, 46 , 372–383. in speech and non-speech sounds. Journal of Experimental Ellis, A. J. (1876). On the sensitiveness of the ear to pitch Psychology: Human Perception and Performance, 32, 97–103. and change of pitch in music. Journal of the Royal Musical Braida, L. D., & Durlach, N. I. (1972). Intensity perception: Association, 3, 1–32. Ii. resolution in one-interval paradigms. Journal of Acoustical Garner, W. R. (1953). An informational analysis of absolute Society of America, 51, 483–502. judgments of loudness. Journal of Experimental Psychology, Braida, L. D., Lim, J. S., Berliner, J. E., Durlach, N. I., 46 , 373–380. Rabinowitz, W. M., & Purks, S. R. (1984). Intensity Gilden, D. L. (2001). Cognitive emissions of 1/f noise. perception: Xiii. perceptual anchor model of context-coding. Psychological Review, 108, 33–56. JournaloftheAcousticalSocietyofAmerica, 76 , 722–731. Gilden, D. L., Thornton, T., & Mallon, M. W. (1995). 1/f Brammer, L. (1951). Sensory cues in pitch judgment. Journal of noise in human cognition. Science, 267, 1837–1839. Experimental Psychology, 41(5), 336–340. Hartman, E. (1954). The influence of practice and pitch- Brown, S. D., & Heathcote, A. (2005). A ballistic model of distance between tones on the absolute identification of choice response time. Psychological Review, 112, 117–128. pitch. American Journal of Psychology, 67, 1–14. Brown, S. D., Marley, A., Dodds, P., & Heathcote, A. (2009). Holland, M. K., & Lockhead, G. R. (1968). Sequential effects in Purely relative models cannot provide a general account of absolute judgments of loudness. Perception & Psychophysics, absolute identification. Psychonomic Bulletin & Review, 16 , 3, 409–414. 583–593. Jones, M., Curran, T., Mozer, M. C., & Wilder, M. H. Brown, S. D., Marley, A., Donkin, C., & Heathcote, A. J. (2013). Sequential effects in response time reveal learning (2008). An integrated model of choices and response times mechanisms and event representations. Psychological Review, in absolute identification. Psychological Review, 115(2), 120, 628–666. 396–425. Jones, M., Love, B. C., & Maddox, W. T. (2006). Recency De Carlo, L. (1992). Intertrial interval and sequential effects effects as a window to generalization: Separating decisional in magnitude scaling. Journal of Experimental Psychology: and perceptual sequential effects in category learning. Journal Human Perception and Performance, 18, 1080–1088. of Experimental Psychology: Learning, Memory & Cognition, De Carlo, L., & Cross, D. (1990). Sequential effects in mag- 32, 316–332. nitude scaling: Models and theory. Journal of Experimental Kent, C., & Lamberts, L. (2005). An exemplar account of the Psychology: General, 119, 375–396. bow and set size effects in absolute identification. Journal of

why is accurately labeling simple magnitudes so hard? 139 Experimental Psychology: Learning, Memory, and Cognition, Nosofsky, R. M. (1997). An exemplar-based random-walk 31, 289–305. model of speeded categorization and absolute judgment. In Lacouture, Y. (1997). Bow, range, and sequential effects in ab- A. A. J. Marley (Ed.), Choice, decision and measurement (pp. solute identification: A response-time analysis. Psychological 347–365). Hillsdale, NJ: Erlbaum. Research, 60, 121–133. Nosofsky, R. M., & Palmeri, T. J. (1997). An exemplar–based Lacouture, Y., & Marley, A. (1995). A mapping model of bow random walk model of speeded classification. Psychological effects in absolute identification. Journal of Mathematical Review, 104, 266–300. Psychology, 39, 383–395. Petrov, A., & Anderson, J. R. (2005). The dynamics of scaling: A Lacouture, Y., & Marley, A. A. J. (2004). Choice and response memory-based anchor model of category rating and absolute time processes in the identification and categorization of identification. Psychological Review, 112(383–416). unidimensional stimuli. Perception & Psychophysics, 66 , Pfordresher, P., & Brown, S. (2009). Enhanced production 1206–1226. and perception of musical pitch in tone language speakers. Lamberts, K. (2000). Information-accumulation theory of Attention, Perception, & Psychophysics, 71, 1385-1398. speeded classification. Psychological Review, 107, 226–260. Pollack, I. (1952). The information of elementary auditory Laming, D. (1984). The relativity of "absolute" judgements. displays. JournaloftheAcousticalSocietyofAmerica, 24, British Journal of Mathematical and Statistical Psychology, 37, 745–749. 152–183. Pollack, I. (1953). The information of elementary auditory Levitin, D., & Rogers, S. (2005). Absolute pitch: Perception, displays: Ii. Journal of the Acoustical Society of America, 25, coding, and controversies. TRENDS in Cognitive Sciences, 9, 765–769. 26–33. Profita, J., & Bidder, T. (1988). Perfect pitch. American Journal Lockhead, G. R., & Byrd, R. (1981). Practically perfect pitch. of Medical Genetics, 29, 763-771. Journal of the Acoustical Society of America, 70(2), 387–389. Revesz, G. (1953). Introduction to the psychology of music. Lockhead, G. R., & Hinson, J. M. (1986). Range and sequence London: Longmans Green. effects in judgment. Perception & Psychophysics, 40, 53–61. Rouder, J. N. (2001). Absolute identification with simple and Luce, R. D., Green, D. M., & Weber, D. L. (1976). Attention complex stimuli. Psychological Science, 12, 318–322. bands in absolute identification. Perception & Psychophysics, Rouder, J. N., Morey, R. D., Cowan, N., & Pfaltz, M. (2004). 20, 49–54. Learning in a unidimensional absolute identification task. Luce, R. D., Nosofsky, R. M., Green, D. M., & Smith, Psychonomic Bulletin & Review, 11, 938–944. A. F. (1982). The bow and sequential effects in absolute Schellenberg, E., & Trehub, S. (2007). s there an asian advantage identification. Perception & Psychophysics, 32, 397–408. for pitch memory? Music Perception, 25, 241–252. Lundin, R. (1963). Can perfect pitch be learned? Music Shiffrin, R. M., & Nosofsky, R. M. (1994). Seven plus or minus Educators’ Journal, 49(5), 49–51. two: A commentary on capacity limitations. Psychological Malmberg, K., & Annis, J. (2012). On the relationship Review, 101, 357–361. between memory and perception: sequential dependencies Stewart, N., Brown, G. D. A., & Chater, N. (2005). Absolute in recognition memory testing. Journal of Experimental identification by relative judgment. Psychological Review, Psychology: General, 141(2), 233–259. 112, 881–911. Marley, A. A. J., & Cook, V. T. (1984). A fixed rehearsal Stewart, N., & Matthews, W. (2009). Relative judgment and capacity interpretation of limits on absolute identification knowledge of the category structure. Psychonomic Bulletin & performance. British Journal of Mathematical and Statistical Review, 16 , 594–599. Psychology, 37, 136–151. Takeuchi, A., & Hulse, S. (1993). Absoute pitch. Psychological Marley, A. A. J., & Cook, V. T. (1986). A limited capacity Bulletin, 113(2), 345–361. rehearsal model for psychological judgments applied to Terhardt, E., & Seewann, M. (1983). Aural key identification magnitude estimation. Journal of Mathematical Psychology, and its relationship to absolute pitch. Music Perception, 1, 30, 339–390. 63–83. Matthews, W., & Stewart, N. (2009). The effect of Treisman, M. (1985). The magical number seven and some inter-stimulus intervla on sequential effects in absolute other features of category scaling: Properties for a model of identification. Quarterly Journal of Experimental Psychology, absolute judgment. Journal of Mathematical Psychology, 29, 62, 2014–2029. 175–230. Miller, G. (1956). The magical number seven, plus or minus Treisman, M., & Williams, T. C. (1984). A theory of criterion two: Some limits on our capacity for information processing. setting with an application to sequential dependencies. Psychological Review, 63, 81–97. Psychological Review, 91, 68–111. Miyazaki, K. (1995). Perception of relative pitch with dif- Usher, M., & McClelland, J. L. (2001). On the time course of ferent references: Some absolute-pitch listeners can’t tell perceptual choice: The leaky competing accumulator model. musical interval names. Perception & Psychophysics, 57(7), Psychological Review, 108, 550–592. 962–970. Van Orden, G. C., Holden, J. G., & Turvey, M. T. Miyazaki, K. (2004). How well do we understand absolute pitch? (2003). Self–organization of cognitive performance. Journal Acoustical Science and Technology, 25(6), 426–432. of Experimental Psychology: General, 132, 331–350. Nosofsky, R. M. (1986). Attention, similarity, and the Wagenmakers, E.-J., Farrell, S., & Ratcliff, R. (2004). α identification-categorization relationship. Journal of Experi- Estimation and interpretation of 1/f noise in human mental Psychology: General, 115, 39–57. cognition. Psychonomic Bulletin & Review, 11, 579–615.

140 basic cognitive skills Wagenmakers, E.-J., Farrell, S., & Ratcliff, R. (2005). Human absolute identification. Perception & Psychophysics, 22, Cognition and a Pile of Sand: A Discussion on Serial Cor- 223–231. relations and Self–organized Criticality Human cognition Zatorre, R. (2003). Absolute pitch: A model for under- and a pile of sand: A discussion on serial correlations and standing the influence of genes and development on self–organized criticality. Journal of Experimental Psychology: neural and cognitive function. Nature Neuroscience, 6 (7), General, 134, 108–116. 692–695. Ward, L. M., & Lockhead, G. R. (1970). Sequential effect Zatorre, R., & Beckett, C. (1989). Multiple coding and memory in category judgment. Journal of Experimental strategies in the retention of musical tones by posses- Psychology, 84, 27–34. sors of absolute pitch. Memory & Cognition, 17(5), Ward, L. M., & Lockhead, G. R. (1971). Response system 582–589. processes in absolute judgment. Perception & Psychophysics, Zotov, V., Jones, M., & Mewhort, D. J. K. (2011). 9, 73–78. Contrast and assimilation in categorization and exemplar Weber, D. L., Green, D. M., & Luce, R. D. (1977). production. Attention, Perception, & Psychophysics, 73, Effect of practice and distribution of auditory signals on 621–639.

why is accurately labeling simple magnitudes so hard? 141 CHAPTER An Exemplar-Based Random-Walk Model 7 of Categorization and Recognition

Robert M. Nosofsky and Thomas J. Palmeri

Abstract In this chapter, we provide a review of a process-oriented mathematical model of categorization known as the exemplar-based random-walk (EBRW) model (Nosofsky & Palmeri, 1997a). The EBRW model is a member of the class of exemplar models. According to such models, people represent categories by storing individual exemplars of the categories in memory, and classify objects on the basis of their similarity to the stored exemplars. The EBRW model combines ideas ranging from the fields of choice and similarity, to the development of automaticity, to response-time models of evidence accumulation and decision-making. This integrated model explains relations between categorization and other fundamental cognitive processes, including individual-object identification, the development of expertise in tasks of skilled performance, and old-new recognition memory. Furthermore, it provides an account of how categorization and recognition decision-making unfold through time. We also provide comparisons with some other process models of categorization. Key Words: categorization, recognition, exemplar model, response times, automaticity, random walk, memory search, expertise, similarity, practice effects

Introduction into category response regions. If an object is A fundamental issue in cognitive psychology and perceived to lie in Region A of the space, then the cognitive science concerns the manner in which observer emits a Category-A response. According people represent categories and make classifica- to rule-plus-exception models (e.g., Davis, Love, & tion decisions (Estes, 1994; Smith & Medin, Preston, 2012; Erickson & Kruschke, 1998; 1981). There is a wide variety of process-oriented Nosofsky, Palmeri, & McKinley, 1994), people mathematical models of categorization that have construct low-dimensional logical rules for sum- been proposed in the field. For example, accord- marizing categories, and they remember occasional ing to prototype models (e.g., Posner & Keele, exceptions that may be needed to patch those rules. 1968; Smith & Minda, 1998), people represent In this chapter, however, our central focus is categories by storing a summary representation, on exemplar models of classification. According usually presumed to be the central tendency of to exemplar models, people represent categories the category distribution. Classification decisions by storing individual exemplars in memory, and are based on the similarity of a test item to the classify objects on the basis of their similarity to prototypes of alternative categories. According to the stored exemplars (Hintzman, 1986; Medin & decision-boundary models (e.g., Ashby & Maddox, Schaffer, 1978; Nosofsky, 1986). For instance, such 1993; McKinley & Nosofsky, 1995), people models would assume that people represent the construct boundaries, usually assumed to be linear category of “birds” by storing in memory the vast or quadratic in form, to divide a stimulus space collection of different robins, sparrows, eagles (and

142 so forth) that they have experienced during their the “summed similarity” of a test object to the lifetimes. If a novel object were similar to some of exemplars in the space. By conducting similarity- these bird exemplars, then a person would tend to scaling studies, one can derive multidimensional classify it as a bird. scaling (MDS) solutions in which the locations of Although alternative classification strategies are the exemplars in the similarity space are precisely likely to operate across different experimental located (Nosofsky, 1992). By using the GCM in contexts, there are several reasons why we chose combination with these MDS solutions, one can to focus on exemplar models in this chapter. One then achieve successful fine-grained predictions of reason is that models from that class have provided a classification and recognition choice probabilities successful route to explaining relations between cat- for individual items (Nosofsky, 1986, 1987, 1991; egorization and a wide variety of other fundamental Shin & Nosofsky, 1992). cognitive processes, including individual-object A significant development in the application of identification (Nosofsky, 1986, 1987), the develop- the GCM has involved extensions of the model to ment of automaticity in tasks of skilled performance explaining how the categorization process unfolds (Logan, 1988; Palmeri, 1997), and old-new through time. So, for example, the exemplar model recognition memory (Estes, 1994; Hintzman, 1988; not only predicts choice probabilities, but also Nosofsky, 1988, 1991). A second reason is that, predicts categorization and recognition response in our view, for most “natural” category structures times (RTs). This development is important because (Rosch, 1978) that cannot be described in terms RT data often provide insights into cognitive of simple logical rules, exemplar models seem to processes that would not be evident based on exam- provide the best-developed account for explaining ination of choice-probability data alone. Nosofsky how categorization decision-making unfolds over and Palmeri’s (1997a,b) exemplar-based random- time. Thus, beyond predicting classification choice walk (EBRW) model adopts the same fundamental probabilities, exemplar models provide detailed representational assumptions as does the GCM. quantitative accounts of classification response However, it extends that earlier model by assuming times (Nosofsky & Palmeri, 1997a). We now briefly that retrieved exemplars drive a random walk process expand these themes before turning to the main (e.g., Busemeyer, 1985; Link, 1992; Ratcliff, 1978). body of our chapter. This exemplar-based random-walk model allows One of the central goals of exemplar mod- one to predict the time course of categorization and els has been to explain relations between cate- recognition decision making. gorization and other fundamental cognitive pro- In this chapter, we provide a review of this cesses, including old-new recognition memory (Estes, EBRW model and illustrate its applications to 1994; Hintzman, 1988; Nosofsky, 1988, 1991; varieties of categorization and recognition choice- Nosofsky & Zaki, 1998). Whereas in categoriza- probability and RT data. In the section on The tion people organize distinct objects into groups, in Formal EBRW Model of Categorization RTs we recognition the goal is to determine if some indi- provide a statement of the formal model as applied vidual object is “old” (previously studied) or “new.” to categorization. As will be seen, the EBRW Presumably, when people make recognition judg- model combines ideas ranging from the fields of ments, they evaluate the similarity of test objects to choice and similarity, to the development of auto- the individual previously studied items (i.e., exem- maticity, to RT models of evidence accumulation plars). If categorization decisions are also based on and decision making. In the section Effects of similarity comparisons to previously stored exem- Similarity and Practice on Speeded Classification, plars, then there should be close relations between we describe applications of the model to speeded the processes of recognition and categorization. perceptual classification, and illustrate how it cap- A well-known model that formalizes these ideas tures fundamental phenomena including effects of is the generalized context model (GCM; Nosofsky, similarity and practice. In the section Automaticity 1984, 1986, 1991). In the GCM, individual and Perceptual Expertise we describe how the exemplars are represented as points in a mul- EBRW accounts for the development of automatic tidimensional psychological space, and similarity categorization and perceptual expertise. In the between exemplars is a decreasing function of the section Using Probabilistic Feedback Manipulations distance between objects in the space (Shepard, to Contrast the Predictions From the EBRW Model 1987). The model presumes that both classifi- and Alternative Models we describe experimental cation and recognition decisions are based on manipulations that have been used to try to

an exemplar-based random-walk model 143 “Category A” +A Category B

perceptual stimulus i processing Category A i dij j –B

psychological space random walk (a) (b)

time

Fig. 7.1 Schematic illustration of the ideas behind the exemplar-based random-walk model. Panel a: Exemplars are represented as points in a multidimensional psychological space. Similarity is a decreasing function of the distance between objects in the space. Presentation of a test probe causes the exemplars to be “activated” in accord with how similar they are to that probe. The activated exemplars “race” to be retrieved. Panel b: The retrieved exemplars drive a random-walk process for making categorization decisions. Each time that an exemplar from Category A is retrieved, the random walk takes unit step towards the Category-A response threshold, and likewise for Category B. The retrieval process continues until one of the response thresholds is reached. distinguish between the predictions of the EBRW or integral-dimension stimuli (Garner, 1974), which model and some other major models of speeded will be the main focus of the present chapter, ρ is categorization, including decision-boundary and set equal to 2, which yields the familiar Euclidean prototype models. Finally, in the section Extending distance metric. The dimension weights wk are free the EBRW Model to Predict Old-New Recognition parameters that reflect the degree of “attention” RTs, we describe recent developments in which the that subjects give to each dimension in making EBRW model has been used to account for old- their classification judgments (Carroll & Wish, new recognition RTs. We then provide conclusions 1974). In situations in which some dimensions are and questions for future research in the section far more relevant than others in allowing subjects Conclusions and New Research Goals. to discriminate between members of contrasting categories, these attention-weight parameters may play a significant role (e.g., Nosofsky, 1986, 1987). The Formal EBRW Model of In the experimental situations considered in this Categorization RTs chapter, however, all dimensions tend to be relevant The EBRW model and the GCM build upon and the attention weights will turn out to play a classic theories in the areas of choice and similarity relatively minor role. (Shepard, 1957, 1987). As described in the intro- The similarity of test item i to exemplar j duction, in the model, exemplars are represented is an exponentially decreasing function of their as points in a multidimensional psychological space psychological distance (Shepard, 1987), (for an illustration, see Figure 7.1a). The distance between exemplars i and j (dij)isgivenbythe = − · Minkowski power model, sij exp( c dij), (2)

 1 K   ρ =  − ρ where c is an overall sensitivity parameter that mea- dij wk xik xjk ,(1)sures overall discriminability in the psychological = k 1 space. The sensitivity governs the rate at which where xik is the value of exemplar i on psychological similarity declines with distance. When sensitivity is dimension k; K is the number of dimensions that high, the similarity gradient is steep, so even objects define the space; ρ defines the distance metric of that are close together in the space may be highly the space; and wk (0 < wk, wk =1)isthe discriminable. By contrast, when sensitivity is low, weight given to dimension k in computing distance. the similarity gradient is shallow, and objects are In situations involving the classification of holistic hard to discriminate.

144 basic cognitive skills Each exemplar j is stored in memory (along with drive a random-walk evidence accumulation process its associated category feedback) with memory- (e.g., Busemeyer, 1982; Ratcliff, 1978). In a two- strength mj. As will be seen, the precise assumptions category situation, the process operates as follows involving the memory strengths vary with the specific (for an illustration, see Figure 7.1b): First, there is a experimental paradigm that is tested. For example, random-walk counter with initial setting zero. The in some paradigms, we attempt to model perfor- observer establishes response thresholds representing mance on a trial-by-trial basis, and assume that the the amount of evidence needed to make either a memory strengths of individual exemplars decrease Category-A response (+A) or a Category-B response systematically with their lag of presentation. In (–B). Suppose that exemplar x winstheraceon other paradigms, we attempt to model performance a given step. If x received Category-A feedback in a transfer phase that follows a period of extended during training, then the random-walk counter is training. In that situation, we typically assume increased by unit value in the direction of +A, that the memory strengths are proportional to the whereas if x received Category-B feedback, the frequency with which each individual exemplar was counter is decreased by unit value in the direction presented in common with given category feedback of –B. (If a background element wins the race, the during the initial training phase. counter’s direction of change is chosen randomly.) When a test item is presented, it causes all If the counter reaches either threshold +A or –B, exemplars to be “activated” (Figure 7.1a). The the corresponding categorization response is made. activation for exemplar j, given presentation of item Otherwise, the races continue, another exemplar is i,isgivenby retrieved (possibly the same one as on the previous aij = mjsij.(3)step), and the random walk takes its next step. Given the processing assumptions outlined ear- Thus, the exemplars that are most highly activated lier, Nosofsky and Palmeri (1997a) showed that, are those that have the greatest memory strengths on each step of the random walk, the probability and are highly similar to the test item. In modeling (p ) that the counter is increased in the direction of early-learning behavior, we also presume that i threshold+Aisgivenby “background” elements exist in memory at the start of training that are not associated with any of the pi = (SiA + b)/(SiA + SiB + 2b), (5) categories. These background elements are presumed to have fixed activation b, independent of the test where SiA denotes the summed activation of all item that is presented. currently stored Category-A exemplars given pre- Borrowing from Logan’s (1988) highly influen- sentation of item i (and likewise for SiB). (The tial instance theory of automaticity, the EBRW probability that the counter is decreased in the assumes that presentation of a test item causes direction of Category B is given by qi =1-pi.) Thus, the activated stored exemplars and background as the summed activation of Category-A exemplars elements to “race” to be retrieved. For mathematical increases, the probability of retrieving Category-A simplicity, the race times are presumed to be inde- exemplars, and thereby the probability of moving pendent exponentially distributed random variables the counter in the direction of +A, increases. The with rates proportional to the degree to which categorization decision time is determined jointly exemplar j is activated by item i (Bundesen, 1990; by the total number of steps required to complete Logan, 1997; Marley, 1992). Thus, the probability the random walk and by the speed with which those density that exemplar j completes its race at time t, individual steps are made. given presentation of item i,isgivenby Given these random-walk processing assump- tions, it is straightforward to derive analytic f (t) = a · exp(− a · t). (4) ij ij predictions of classification choice probabilities and This assumption formalizes the idea that although mean RTs for each stimulus at any given stage of the the retrieval process is stochastic, the exemplars that learning process (Busemeyer, 1982). The relevant tend to race most quickly are those that are most equations are summarized by Nosofsky and Palmeri highly activated by the test item. The exemplar (1997a, pp. 269–270, 291–292). Here, we focus on (or background element) that “wins” the race is some key conceptual predictions that follow from retrieved. the model. Whereas in Logan’s (1988) model the response is A first prediction is that rapid classification based on only the first retrieved exemplar, in the decisions should be made for items that are EBRW model exemplars from multiple retrievals highly similar to exemplars from their own target

an exemplar-based random-walk model 145 category and that are dissimilar to exemplars from This equation is the descriptive equation of choice the contrast category. Under such conditions, all probability that is provided by the GCM, one of retrieved exemplars will tend to come from the the most successful formal models of perceptual target category, so the random walk will march classification (for discussion, see, e.g., Nosofsky, consistently toward that category’s response thresh- 1986; Wills & Pothos, 2012).1 Thus, besides old. For example, if an item is highly similar to providing a formal account of classification RTs, the the exemplars from Category A, and dissimilar to EBRW model provides a processing interpretation the exemplars from Category B, then the value pi for the emergence of the GCM response rule. in Eq. 5 will be large, so almost all steps in the random walk will move in the direction of +A. By Effects of Similarity and Practice on contrast, if an item is similar to the exemplars of Speeded Classification both Categories A and B, then exemplars from both In their initial tests of the EBRW model, categories will be retrieved; in that case, the random Nosofsky and Palmeri (1997a) conducted a speeded walk will meander back and forth, leading to a classification task using the category structure slow RT. shown in Figure 7.2. The stimuli were a set of 12 A second prediction is that practice in the Munsell colors of a constant red hue varying in classification task will lead to more accurate re- their brightness and saturation2. As illustrated in sponding and faster RTs. Early in training, before the figure, half the exemplars were assigned by the any exemplars have been stored in memory, the experimenter to Category A (squares) and half to background-element activations are large relative to Category B (circles). On each trial, a single color the summed-activations of the stored exemplars (see was presented, the observer classified it into one of Eq. 5). Thus, the random-walk step probabilities the two categories, and corrective feedback was then (pi) hover around .5, so responding is slow and provided. Testing was organized into 150 blocks prone to error. As the observer accumulates more of 12 trials (30 blocks per day), with each color category exemplars in memory, the summed activa- presented once in a random order in each block. tions SiA and SiB grow in magnitude, so responding (See Nosofsky and Palmeri, 1997a, pp. 273–274 is governed more by the stored exemplars. A second for further methodological details.) reason that responding speeds up is that the greater The mean RTs observed for one of the par- the number of exemplars stored in memory, the ticipants are shown in Figure 7.3. The top panel faster the “winning” retrieval times tend to be (cf. illustrates the mean RT for each individual color, Logan, 1988). The intuition is that the greater the averaged across the final 120 blocks. The diameter number of exemplars that participate in the race, of the circle enclosing each stimulus is linearly the higher is the probability that some particular related to the stimulus’s mean RT. To help interpret exemplar will finish quickly. Therefore, as more these data, the figure also displays a dotted bound- exemplars are stored, the speed of the individual ary of equal summed-similarity to the exemplars steps in the random walk increases. As discussed later in this chapter, many more fine-grained predictions follow from the model, 7/ 1 2 some of which are highly diagnostic for distinguish- ing between the predictions of the EBRW model 3 4 and alternative models. We describe these predic- 6/ tions in the context of the specific experimental paradigms in which they are tested. 5/ 5 6 7

One other important formal aspect of the model Brightness involves its predictions of classification choice 4/ 8 9 probabilities. In particular, in the special case in which the response thresholds +Aand−Bare 3/ 10 11 12 set an equal magnitude γ from the starting point of the random walk, the model predicts that the probability that item i is classified in Category A /4 /6 /8 /10 /12 is given by Saturation

Fig. 7.2 Schematic illustration of the category structure tested γ γ γ P(A|i) = (SiA +b) /[(SiA +b) +[(SiB +b) ]. (6) by Nosofsky and Palmeri (1997a, Experiment 1).

146 basic cognitive skills (A) To apply the EBRW model to these data, we first 2.0 derived a multidimensional scaling (MDS) solution 1 1.5 for the colors by having the participant provide 2 extensive similarity ratings between all pairs of the 1.0 4 colors. A two-dimensional scaling solution yielded a 3 0.5 7 good account of the similarity ratings. The solution 6 is depicted in the top panel of Figure 7.3, where the 0.0 5 center of each circle gives the MDS coordinates of Brightness 9 –0.5 8 the color. We then used the EBRW model in combi- 12 nation with the MDS solution to simultaneously fit –1.0 11 the mean RTs associated with the individual colors –1.5 (Figure 7.3, top panel) as well as the aggregated 10 –2.0 mean RTs observed as a function of grouped-blocks of practice (Figure 7.3, bottom panel). Specifically,

–2.0 –1.5 –1.0 –0.5 0.0 0.5 0.1 1.5 2.0 the MDS solution provides the coordinate values Saturation xik that enter into the EBRW model’s distance (B) 1900 function (Equation 1). For simplicity, in fitting the observed model, we assumed that on each individual block 1800 predicted of practice, an additional token of each individual 1700 exemplar was stored in memory with strength equal 1600 to one. (To reduce the number of free parameters,

1500 we set the background-element activation b equal to zero.) The first step in the analysis was to 1400 use the model to predict the mean RT for each 1300 individual color in each individual block. Then, 1200 for each individual color, we averaged the predic- Mean Response Time

1100 tions across Blocks 31–150 to predict the overall individual-color mean RTs. Likewise, we averaged 1000 the predicted mean RTs over all colors in each 900 grouped-block of practice to predict the speed-up 800 curves. We fitted the model by searching for the free 0 15 30 45 60 75 90 105 120 135 150 parameters that minimized the total sum of squared Block (N) deviations between predicted and observed mean Fig. 7.3 Data from Experiment 1 of Nosofsky and Palmeri RTs across both data sets. These free parameters (1997a). (Top) Mean RTs for individual colors (RT proportional included the overall sensitivity parameter c;an to the size of the circle). (Bottom) Mean RTs averaged across all attention-weight parameter w1(with w2 =1 – w1); colors as a function of grouped blocks of practice. and a response-threshold parameter +A (with +A = |–B|). In addition, we estimated a mean residual- time parameter μR; a scaling constant v for of Category A and Category B. Points falling to transforming the number of steps in the random the upper right have greater summed similarity to walk into ms; and an individual step-time constant Category A, and points falling to the lower left have α (see Nosofsky & Palmeri, 1997a, pp. 268–270 for greater summed similarity to Category B. As can details). be seen, the mean RTs tend to get faster as one The model-fitting results for the individual- moves farther away from the boundary of equal color RTs are displayed in Figure 7.4, which plots summed similarity. The bottom panel of Figure observed against predicted mean RTs for each 7.3 provides another perspective on the data. This individual color. The model provides a reasonably panel plots the overall mean RT for all 12 colors for good fit (r = 0.89), especially considering that it each “grouped-block” of practice in the task, where is constrained to simultaneously fit the speed-up a grouped-block corresponds to five consecutive curves. The model predicts these results because col- individual blocks. It is evident that there is a ors far from the exemplar-based boundary (e.g., 2, speed-up with practice, with the lion’s share of the 4, 7, and 10—see top panel of Figure 7.3) tend to be speed-up occurring during the early blocks. similar only to exemplars from their own category.

an exemplar-based random-walk model 147 1400 becomes fast and automatic. Logan (1988) gen- 9 erally attributed the development of automaticity 1300 in cognitive skills to a shift from strategic al- 1200 12 gorithmic processes to exemplar-based memory 3 retrieval. Palmeri (1997) specifically examined the 1100 6 development of automaticity in categorization as a 1000 shift from a rule-based process to an exemplar-based 5 process assumed by EBRW (see also Johansen & 11 8 900 Palmeri, 2002). 800 2 1 In Palmeri (1997, 1999), subjects were told to

Observed Mean Response Time Response Observed Mean use an explicit rule in a dot-counting categorization 104 700 7 task.They were asked to categorize random patterns 600 containing between six and eleven dots according to 600 700 800 900 1000 1100 1200 1300 1400 numerosity and did so over multiple sessions. RTs Predicted Mean Response Time observed in the first session increased linearly with the number of dots, as shown in Figure 7.5A. The Fig. 7.4 Scatterplot of observed individual-color mean RTs dot patternswere repeated across multiple sessions, against the predictions from the EBRW model. giving subjects an opportunity to develop auto- maticity in the task. Indeed, RTs observed in the 13th session were flat with numerosity, a signature Thus, only exemplars from a single category are of automaticity. In a subsequent transfer test, shown retrieved, and the random walk marches in rapid in Figure 7.5B, new unrelated test patterns had fashion to the appropriate response threshold. By categorization RTs that were a linear function of contrast, colors close to the boundary (e.g., 3, 9, numerosity, just like the first session, and old and 12) are similar both to exemplars from their training patterns continued to be categorized with own category and to the contrast category. Thus, the same RTs irrespective of numerosity, just like exemplars from both categories are retrieved and the the last session. Dot patterns of low or moderate random walk meanders back and forth, leading to similarity to the training patterns were categorized slow mean RTs. with RTs intermediate to the new unrelated and old The fits to the speed-up curves are shown along training patterns. with the observed data in the bottom panel of Categorization RTs over sessions and numerosity Figure 7.3. Again, the model captures these data were successfully modeled as a horse race between reasonably well (r = 0.931). It predicts the speed-up an explicit dot-counting process, whose stochastic for essentially the same reason as in Logan’s (1988) finishing time increased linearly with the number model: As practice continues, an increased number of dots in the pattern; and an exemplar-based of exemplars are stored in memory. The greater categorization process determined by the similarity the number of exemplars that race to be retrieved, of a dot pattern to stored exemplars of patterns the faster are the winning retrieval times, so the previously categorized—an elaboration of EBRW.4 individual steps in the random walk take place more When few patterns have been experienced, catego- quickly.3 rization is based entirely on the finishing times of the explicit counting process, predicting increased RTs with increased numerosity (Figure 7.5C). With Automaticity and Perceptual Expertise experience, the memory strength mj (Eq. 3) of The EBRW provides a general theory of cate- exemplars in EBRW increases, causing a faster gorization, automaticity, and the development of accumulation of perceptual evidence to a response perceptual expertise (Nosofsky & Palmeri, 1997a; threshold. Faster categorizations based on exemplar Palmeri, 1997; Palmeri, Wong, & Gauthier, 2004). retrieval imply increased likelihood that those In some real-world domains, novices learn to categorizations finish more quickly than explicit categorize objects by first learning to use a set counting. With sufficient experience, EBRW fin- of explicit verbal rules. This novice categorization ishes before counting for nearly all categorizations, can be a relatively slow, deliberate, attention- predicting flat RTs with numerosity. The similarity- demanding process. With experience, as people based retrieval in EBRW also predicts the transfer develop perceptual expertise, categorization often results, as shown in Figure 7.5D.

148 basic cognitive skills (A) Observed Data (C) EBRW Predictions 1 3000 1 3000

2500 2500

2 2000 2 2000 3 3 1500 4 1500

Response Time 4 Response Time (ms) 5 5 6 1000 6 1000 10 10 13 13 500 500 67891011 67891011 Numerosity Numerosity

(B) Observed Data (D) EBRW Predictions

Unrelated 3000 Session 1 3000

2500 2500 Unrelated Low Low 2000 2000

Moderate 1500 1500 Moderate Response Time (ms) Predicted Response Time 1000 1000 Old Session 13 Old 500 500 67891011 67891011 Numerosity Numerosity

Fig. 7.5 Data from Experiment 1 of Palmeri (1997). Subjects categorized dot patterns varying in numerosity over thirteen training sessions and these were followed by a transfer test. (A) Response time (ms) as a function of numerosity in each training session (1-13). (B) Response time (ms) as a function of numerosity during transfer for old patterns, new unrelated patterns, and patterns of low and moderate similarity to old patterns. (C and D) EBRW predictions.

A shift from rules to exemplars is just one reflect an initial stage of visual processing, perhaps common characteristic of the development of as early as object segmentation (Grill-Spector & perceptual expertise that EBRW can help explain Kanwisher, 2005; see also Jolicoeur, Gluck, & Koss- (see Palmeri et al., 2004; Palmeri & Cottrell, lyn, 1984). For novices, subordinate categorizations 2009). As one other brief example, consider are slow because basic-level categorizations must the well-known basic-level advantage. The seminal be made first. For experts, the stage of basic-level work of Rosch, Morris, Gray, Johnson, & Boyes- categorization is somehow bypassed. Braem (1976) showed that people are faster to EBRW provides an alternative account (see verify the category of an object at an intermediate Mack & Palmeri, 2011; Mack, Wong, Gauthier, basic level of abstraction (“bird”) than at more Tanoka, & Palmeri, 2007). Faster or slower superordinate (“animal”) or subordinate (“Northern categorization at different levels of abstraction Cardinal”) levels. With expertise, subordinate-level need not reflect stages of processing but may categorizations are verified as quickly as basic-level instead reflect differences in the speed of evidence categorizations (Tanaka & Taylor, 1991). One accumulation in the random walk. Mack et al. explanation for the basic-level advantage and related (2007) simulated basic-level and subordinate-level findings in novices is that basic-level categorizations categorization by novices and experts. For these

an exemplar-based random-walk model 149 simulations, subordinate categories were simply process interpretations for why RT should get faster assumed to be individual identities of objects within (and accuracy should increase) as distance of a clusters that were the basic-level categories. For stimulus from a decision boundary increases. moderate levels of the sensitivity parameter, c, According to the EBRW model and decision- which reflects the overall discriminability in the boundary models, the nature of the memory psychological space (Eq. 2), a basic-level advantage representations that are presumed to underlie cat- was predicted. But for high levels of sensitivity, re- egorization are dramatically different (i.e., stored flecting the greater discriminability that may come exemplars versus boundary lines). Despite this dra- with perceptual expertise, equivalent RTs for the matic difference, it is often difficult to distinguish subordinate and basic levels were predicted. Further between the predictions from the models. The empirical evidence has supported the hypothesisthat reason is that distance-from-boundary and relative differences in RTs at different levels of abstraction summed-similarity tend to be highly correlated in reflect how quickly perceptual evidence accumulates most experimental designs. For example, as we rather than differences in discrete visual processing already described with respect to the Figure 7.3 stages (e.g., Mack, Gauthier, Sadr, & Palmeri, 2008; (top panel) structure, items far from the boundary Mack & Palmeri, 2010). tend to be highly similar to exemplars from their own category and dissimilar to exemplars from the Using Probabilistic Feedback contrast category. Thus, both the distance-from- Manipulations to Contrast the Predictions boundary model and the EBRW model tend to from the EBRW Model and Alternative predict faster RTs and more accurate responding as Models distance from the boundary increases. One the major alternative modeling approaches In one attempt to discriminate between the RT to predicting multidimensional classification RTs predictions of the models, Nosofsky and Stanton arises from a general framework known as “decision- (2005) conducted a design that aimed to decouple boundary theory” (e.g., Ashby & Townsend, 1986). distance-from-boundary and relative summed simi- According to this framework, people construct larity (for a related approach, see Rouder & Ratcliff, decision boundaries for dividing a perceptual space 2004). The key idea in the design was to make use into category response regions. Test items are of probabilistic feedback manipulations associated assumed to give rise to noisy representations in the with individual stimuli in the space. The design multidimensional perceptual space. If a perceived is illustrated in Figure 7.6. The stimuli were representation falls in Region A of the space, then again a set of 12 Munsell colors of a constant the observer classifies the item into Category A. hue, varying in their brightness and saturation. Most past approaches to generating RT predic- As illustrated in the figure, Colors 1–6 belonged tions from decision-boundary theory have involved to Category A, whereas Colors 7–12 belonged to applications of the RT-distance hypothesis (Ashby, Category B. To help motivate the predictions, we Boynton, & Lee, 1994). According to this hypoth- have drawn a diagonal linear decision boundary esis, RT is a decreasing function of the distance of a for separating the two categories of colors into stimulus from the decision boundary. More recent response regions. Given reasonable assumptions models that formalize the operation of decision (for details, see Nosofsky & Stanton, 2005), this boundaries posit perceptual-sampling mechanisms boundary is the optimal (ideal-observer) boundary that drive random-walk or diffusion processes, according to decision-boundary theory. That is, it similar to those of the EBRW model (e.g., is the boundary that would maximize an observer’s Ashby, 2000; Fific, Little, & Nosofsky, 2010; percentage of correct categorization decisions. In Nosofsky & Stanton, 2005). For example, in generating predictions, decision-bound theorists Nosofsky and Stanton’s (2005, p. 625) approach, often assume that observers will use a boundary on each step of a random walk, a percept is with an optimal form. However, we will consider sampled randomly from the perceptual distribution more general possibilities later in this section. associated with a presented stimulus. If the percept The key experimental manipulation was that, falls in Region A (as defined by the decision across conditions, either Stimulus Pair 4/8 or boundary), then the random walk steps toward Stimulus Pair 5/9 received probabilistic feedback, threshold A, otherwise it steps toward threshold – whereas all other stimuli received deterministic B. The perceptual-sampling process continues until feedback. Specifically, in Condition 4/8, Stimulus 4 either threshold is reached. Such models provide received Category-A feedback with probability 0.75

150 basic cognitive skills but received Category- B feedback with probability move in the direction of threshold –B. By contrast, 0.25; whereas Stimulus 8 received Category- B presentations of the deterministic pair will result in feedback with probability 0.75 and Category- A more consistent steps of the random walk, leading feedback with probability 0.25. Analogous proba- to faster RTs and more accurate responding. bilistic feedback assignments were given to Stimuli Across two experiments (in which the instruc- 5 and 9 in Condition 5/9. In each condition, we tions varied the relative emphasis on speed versus refer to the pair that received probabilistic feedback accuracy in responding), the qualitative pattern as the probabilistic pair and to the pair that received of results strongly favored the predictions from deterministic feedback as the deterministic pair. the EBRW model compared to the distance-from- The key conceptual point is that, because of boundary model. In particular, in both experi- the symmetric probabilistic assignment of stimuli ments, observers responded more slowly and with to categories, the optimal boundary for partitioning lower accuracy to the probabilistic pair than to the space into response regions is the same diagonal the deterministic pair. These results were observed linear decision boundary that we have already at both early and late stages of testing and were illustrated in Figure 7.6. There is no way to consistent across the majority of the individual adjust the boundary to achieve more accurate participants. As a converging source of evidence, responding in the face of the probabilistic feedback the EBRW model also provided better overall assignments. Furthermore, because the probabilistic quantitative fits to the individual-subject choice- and deterministic pairs are an equal distance from probability and RT data than did the distance- the decision boundary, the most natural prediction from-boundary model, including versions of the from that modeling approach is that RTs for latter model in which the slope and y-intercept of the probabilistic and deterministic pairs should be the linear boundary were allowed to vary as free the same. parameters. By contrast, the EBRW model predicts that the As suggested by Nosofsky and Stanton (2005, probabilistic pair should be classified more slowly p. 623), the results were particularly intriguing (and with lower accuracy) than the deterministic because they pointed toward a stubborn form pair. For example, in Condition 4/8, in cases of suboptimality in human performance: In the in which stimulus 4 is presented and tokens of Figure 7.6 design, subjects would perform optimally exemplar 4 are retrieved from memory, 0.75 of by simply ignoring the probabilistic feedback and the steps in the random walk will move in the classifying objects based on the linear decision direction of threshold +A, but 0.25 of the steps will boundary. Nevertheless, despite being provided with monetary payoffs for correct responses, sub- jects’ behavior departed from that optimal strategy in a manner that was well predicted by the 7 2 6 EBRW model. Similar forms of evidence that favor the predictions from the EBRW model have been reported in related studies that manipulated the 6 1 5 10 overall familiarity of individual study exemplars rather than probabilistic feedback assignments (e.g., Nosofsky & Palmeri, 1997a; Verguts, Storms, and 5 4 9 Tuerlinckx, 2003).

Brightness Although the focus of Nosofsky and Stanton’s (2005) study was to contrast predictions from 4 3 8 12 the EBRW model and decision-bound models, the designs also yielded sharp contrasts between the EBRW model and prototype models.5 Specifically, 3 7 11 Nosofsky and Stanton (2005, p. 610) formulated a prototype-based random-walk (PBRW) model, anal- 46810 12 ogous in all respects to the EBRW model, except Saturation that the category representation was presumed to correspond to the central tendency of each category Fig. 7.6 Schematic illustration of the category structure used in Nosofsky and Stanton’s (2005) probabilistic-feedback exper- distribution rather than to the individual exemplars. iment. It turns out that for the stimulus spacings and

an exemplar-based random-walk model 151 probabilistic stimulus-category assignments used in include the size of the memory set; whether the test the Figure 7.6 design, the central tendency of each probe is old (a “positive” probe) or new (a “negative” category is equidistant to the probabilistic and probe); and, if old, the serial position with which deterministic stimulus pairs. Thus, the PBRW the positive probe occurred in the memory set. model predicted incorrectly that those pairs would Whereas in the standard version of the Sternberg show identical choice probabilities and RTs. Not paradigm the to-be-recognized items are generally surprisingly, therefore, the PBRW yielded far worse highly distinct entities, such as alphanumeric char- quantitative fits to the full sets of choice-probability acters, recent extensions have examined recognition and RT data than did the EBRW model. performance in cases involving confusable stimuli embedded in a continuous-dimension similarity Extending the EBRW Model to Predict space (e.g., Kahana & Sekuler, 2002). We focus Old-New Recognition RTs on this type of extended similarity-based paradigm Overview in the initial part of this section; however, we will As noted in the introduction, a central goal of ex- illustrate applications of the EBRW model to the emplar models is to explain not only categorization, more standard paradigm as well. but other fundamental cognitive processes such as old-new recognition. The GCM has provided The Formal Model successful accounts of old-new recognition choice The EBRW-recognition model makes the same probabilities in wide varieties of experimental representational assumptions as does the categoriza- situations. When applied to recognition, the GCM tion model: (a) Exemplars are conceptualized as assumes that each item from a study list is stored occupying points in a multidimensional similarity as a distinct exemplar in memory. The observer is space; (b) similarity is a decreasing function of presumed to sum the similarity of a test item to distance in the space (Eqs. 1 and 2); (c) activation these stored exemplars. The greater the summed of the exemplars is a joint function of their memory similarity, the more familiar is the test item, strength and their similarity to the test probe so the greater is the probability with which the (Equation 3); and (d) the stored exemplars race observer responds “old.” Indeed, the GCM can to be retrieved with rates proportional to their be considered a member of the general class of activations (Eq. 4). “global matching” models of old-new recognition In our previous applications to categorization, (e.g., Clark & Gronlund, 1996; Gillund & Shiffrin, a single global level of sensitivity (c in Eq. 2) 1984; Hintzman, 1988; Murdock, 1982). Within was assumed that applied to all exemplar traces this broad class, an important achievement of the stored in long-term memory. In application to model is that it predicts fine-grained differences short-term recognition paradigms involving high- in recognition probabilities for individual items similarity lures, however, allowance is made for a based on their fine-grained similarities to individual form of exemplar-specific sensitivity. In particular, exemplars in the study set (e.g., Nosofsky, 1988, an observer’s ability to discriminate between test 1991; Nosofsky & Zaki, 2003). item i and a nonmatching exemplar-trace j will Just as has been the case for categorization, a almost certainly depend on the recency with major development in recent years has involved which exemplar j was presented: Discrimination extensions of the GCM in terms of the EBRW is presumably much easier if an exemplar was just model to allow it to account for recognition RTs presented than if it was presented in the distant past. (Donkin & Nosofsky, 2012a,b; Nosofsky, Little, In the version of the model applied by Nosofsky Donkin, & Fific, 2011; Nosofsky & Stanton, et al. (2011), a separate sensitivity parameter cj 2006). In this section we describe these formal was estimated for each individual lag j on the developments and illustrate applications to variants study list, where lag is counted backward from and extensions of the classic Sternberg (1966) short- the presentation of the test probe to the memory- term probe-recognition paradigm. In this paradigm, set exemplar. For example, for the case in which subjects are presented on each trial with a short list memory set-size is 4, the exemplar in the fourth of items (the memory set) and are then presented serial position has Lag 1, the exemplar in the third with a test probe. The subjects judge, as rapidly serial position has Lag 2, and so forth. as possible, while minimizing errors, whether the Likewise, the memory strengths of the individual probe occurred in the memory set. Fundamental exemplars (mj) were also assumed to depend on variables that are manipulated in the paradigm lag j: Presumably, the more recently an exemplar

152 basic cognitive skills was presented on the study list, the greater its +OLD threshold and resulting in fast old RTs. memory strength (e.g., McElree & Dosher, 1989; By contrast, test probes that are highly dissimilar Monsell, 1978). (Although the effects are smaller, in to the memory-set items will not activate the modeling short-term recognition with the EBRW, stored exemplars, so only criterion elements will allowance is also typically made for a primacy effect be retrieved. In this case, the random walk will on memory strength. The memory strength of the march quickly to the –NEW threshold, resulting in item in the first serial position of the memory set fast new RTs. Through experience in the task, the is given by PM · mj,wheremj is the memory observer is presumed to learn an appropriate setting strength for an item with lag j,andPM is a of the criterion-element activation β, such that primacy-multiplier parameter.) summed activation (Fi) tends to exceed β when the To adapt the EBRW model to the domain of test probe is old, but tends to be less than β when old-new recognition, it is assumed that the observer the test probe is new. In this way, the random walk establishes what are termed “criterion elements”in will tend to step toward the appropriate response the memory system. These elements are similar to threshold on trials in which old versus new probes the “background elements” used for modeling the are presented. early stages of category learning. Just as is the case for the stored exemplars, upon presentation of a test Experimental Tests probe, the criterion elements race to be retrieved. In Nosofsky et al.’s (2011) initial experiment for However, whereas the retrieval rates of the stored testing the model, the stimuli were a set of 27 exemplars vary with their lag-dependent memory Munsell colors that varied along the dimensions of strengths and their similarity to the test probe, hue, brightness, and saturation. Similarity-scaling the retrieval rates of the criterion elements are procedures were used to derive a precise MDS independent of these factors. Instead, the criterion solution for the colors. elements race with some fixed rate β, independent β The design of the probe-recognition experiment of the test probe that is presented. The setting of involved a broad sampling of different list structures is presumed to be, at least in part, under the control to provide a comprehensive test of the model. There of the observer. were 360 lists in total. The size of the memory Finally, the retrieved exemplars and criterion set on each trial was either 1, 2, 3 or 4 items, elements drive a random-walk process that governs with an equal number of lists at each set size. For old-new recognition decisions. The observer sets each set size, half the test probes were old and half response thresholds +OLD and –NEW that estab- were new. In the case of old probes, the matching lish the amount of evidence needed for making an item from the memory set occupied each serial “old” or a “new” response. On each step of the position equally often. To create the lists, items random walk, if an old exemplar wins the retrieval were randomly sampled from the full set of stimuli, race, then the random-walk counter takes a step subject to the constraints described earlier. Thus, a in the direction of the +OLD response threshold; highly diverse set of lists was constructed, varying whereas if a criterion element wins the race, then the not only in set size, old/new status of the probe, counter takes a step in the direction of the –NEW and serial position of old probes, but also in the threshold. The retrieval process continues until one similarity structure of the lists. of the thresholds is reached. Because the goal was to predict performance at Given the processing assumptions outlined ear- the individual-subject level, three subjects were each lier, then on each step of the random walk, the tested for approximately 20 one-hour sessions, with probability that the counter steps in the direction each of the 360 lists presented once per session. of the +OLD threshold is given by As it turned out, each subject showed extremely similar patterns of performance, and the fits of the pi = F i/(F i + β), (7) EBRW model yielded similar parameter estimates where Fi gives the summed activation (“familiarity”) for the three subjects. Therefore, for simplicity, and of the test probe to all old exemplars on the study to reduce noise in the data, we report the results list (and β is the fixed setting of criterion-element from the analysis of the averaged subject data. activation). Note that test probes that match In the top panels of Figure 7.7 we report recently presented exemplars (with high memory summary results from the experiment. The top- strengths) will cause high summed familiarity (Fi), right panel reports the observed mean RTs plotted leading the random walk to march quickly to the as a function of: (a) set size, (b) whether the probe

an exemplar-based random-walk model 153 Observed Observed 1 1200

0.8 1000 0.6 800 0.4 (Error) Observed (Error)

p 0.2 600 Mean RT Observed RT Mean

0 400 132 4 Lure 132 4 Lure Set Size 1 Lag Lag Set Size 2 Set Size 3 Core Model Core Model Set Size 4 1 1200

0.8 1000 0.6 800 0.4

(Error) Predicted (Error) 600 p 0.2 Predicted RT Mean

0 400 132 4 Lure 132 4 Lure Lag Lag

Fig. 7.7 Summary data from the short-term memory experiment of Nosofsky, Little, Donkin, and Fific (2011). (Top) Observed error rates and mean RTs. (Bottom) Predictions from the EBRW model. was old or new (i.e., a lure), and (c) the lag with Figure 7.8. The top panel plots, for each individual which old probes appeared in the memory set. For list, the observed probability that the subjects old probes, there was a big effect of lag: In general, judged the probe to be “old” against the predicted the more recently a probe appeared on the study list, probability from the model. The bottom panel does the shorter was the mean RT. Indeed, once one takes the same for the mean RTs. Although there are a lag into account, there is little remaining effect of set few outliers in the plots, overall the model achieves size on the RTs for the old probes. That is, as can be a good fit to both data sets, accounting for 96.5% seen, the different set size functions are nearly over- of the variance in the choice probabilities and for lapping. The main exception is a persistent primacy 83.4% of the variance in the mean RTs. effect, in which the mean RT for the item at the The summary-trend predictions that result from longest lag for each set size is “pulled down.” (The these global fits are shown in the bottom panels item at the longest lag occupies the first serial posi- of Figure 7.7. It is evident from inspection that tion of the list.) By contrast, for the lures, there is a the EBRW does a good job of capturing these big effect of set size, with longer mean RTs as set size summary results. For the old probes, it predicts increases. The mean proportions of errors for the the big effect of lag on the mean RTs, the nearly different types of lists, shown in the top-left panel of overlapping set-size functions, and the facilitation Figure 7.7, mirror the mean RT data just described. in RT with primacy. Likewise, it predicts with good The goal of the EBRW modeling, however, quantitative accuracy the big effect of set size on the was not simply to account for these summary lure RTs. The error-proportion data (left panels of trends. Instead, the goal was to predict the choice Figure 7.7) are also well predicted, with the main probabilities and mean RTs observed for each exception that a primacy effect was predicted but of the individual lists. Because there were 360 not observed for the size-2 lists. unique lists in the experiment, this goal entailed The explanation of these results in terms of simultaneously predicting 360 choice probabilities the EBRW model is straightforward. According and 360 mean RTs. The results of that model-fitting to the best-fitting parameters from the model (see goal are shown in the top and bottom panels of Nosofsky et al., 2011, Table 2), more recently

154 basic cognitive skills 1.0

0.8

0.6

Size 1 Lag 1 Size 2 Lag 1 0.4 Size 2 Lag 2 P(old) Observed Size 3 Lag 1 Size 3 Lag 2 Size 3 Lag 3 Size 4 Lag 1 0.2 Size 4 Lag 2 Size 4 Lag 3 Size 4 Lag 4 Lure Size 1 0.0 Lure Size 2 Lure Size 3 Lure Size 4 0.00.2 0.4 0.6 0.8 1.0 P(old) Predicted

1400

1200

1000

800 Size 1 Lag 1 Size 2 Lag 1 Size 2 Lag 2

Mean RT Observed RT Mean Size 3 Lag 1 Size 3 Lag 2 600 Size 3 Lag 3 Size 4 Lag 1 Size 4 Lag 2 Size 4 Lag 3 Size 4 Lag 4 400 Lure Size 1 Lure Size 2 Lure Size 3 Lure Size 4 200 200 400 600 800 1000 1200 1400 Mean RT Predicted

Fig. 7.8 Scatterplots of observed and EBRW-predicted old recognition probabilities and mean RTs associated with individual lists from the short-term memory experiment of Nosofsky, Little, Donkin, and Fific (2011).

an exemplar-based random-walk model 155 presented exemplars had greater memory strengths itself is set equal to one, whereas the similarity and sensitivities than did less recently presented between a probe and any nonmatching item is exemplars. From a psychological perspective, this set equal to a free parameter s. Not only has pattern seems highly plausible. For example, this simple version of the EBRW model captured presumably, the more recently an exemplar was many of the classic patterns of results involving presented, the greater should be its strength in mean RTs in such paradigms, it also accounts memory. Thus, if an old test probe matches the successfully for the detailed RT-distribution data recently presented exemplar, it will give rise to that have been observed (Nosofsky et al., 2011; greater overall activation, leading to faster mean Donkin & Nosofsky, 2012a,b). old RTs. In the case of a lure, as set size increases, Furthermore, applications of the EBRW model the overall summed activation yielded by the lure to both continuous-similarity and discrete versions will also tend to increase. This pattern arises of the probe-recognition paradigm have led to both because a greater number of exemplars will the discovery of an interesting regularity involving contribute to the sum, and because the greater the memory strength. As noted earlier, in the initial set size, the higher is the probability that a least tests of the model, separate-memory strength one exemplar from the memory set will be highly parameters were estimated corresponding to each similar to the lure. As summed activation yielded individual lag on the study list. It turns out, by the lures increases, the probability that the however, that the estimated memory strengths random walk takes correct steps toward the –NEW follow almost a perfect power function of this lag. threshold decreases, so mean RTs for the lures get For example, in an experiment reported by Donkin longer. and Nosofsky (2012a), participants studied 12-item Beyond accounting well for these summary lists consisting of either letters or words, followed by trends, inspection of the detailed scatterplots in a test probe. Separate RT-distribution data for hits Figure 7.8 reveals that the model accounts for and misses for positive probes were collected at each fine-grained changes in choice probabilities and study lag. (RT-distribution data for false alarms and mean RTs depending on the fine-grained similarity correct rejections for negative probes were collected structure of the lists. For example, consider the as well.) The EBRW model provided an excellent choice-probability plot (Figure 7.8, top panel) and quantitative account of this complete set of detailed the Lure-Size-4 items (open diamonds). Whereas RT-distribution and choice-probability data.6 The performance for those items is summarized by a discovery that resulted from the application of the single point on the summary-trend figure (Figure model is illustrated graphically in Figure 7.9. The 7.7), the full scatterplot reveals extreme variability figure plots, for each of four individual participants in results across different tokens of the Lure-Size- who were tested, the estimated memory-strength 4 lists. In some cases the false-alarm rates associated parameters against lag. As shown in the figure, the with these lists are very low, in other cases moderate, magnitudes of the memory strengths are extremely and in still other cases the false-alarm rates exceed well captured by a simple power function. Inter- the hit rates associated with old lists. The EBRW estingly, other researchers have previously reported captures well this variability in false-alarm rates. In that a variety of empirical forgetting curves are some cases, the lure might not be similar to any of well described as power functions (e.g., Anderson the memory-set items, resulting in a low false-alarm & Schooler, 1991; Wickelgren, 1974; Wixted & rate; whereas in other cases the lure might be highly Ebbesen, 1991). For example, Wixted and Ebbesen similar to some of the memory-set items, resulting (1991) reported that diverse measures of forgetting, in a high false-alarm rate. including proportion correct of free recall of word The application reviewed earlier involved a lists, recognition judgments of faces, and savings version of a short-term probe-recognition paradigm in relearning lists of nonsense syllables, were well that used confusable stimuli embedded in a described as power functions of the retention inter- continuous-dimension space. However, the EBRW val. Wixted (2004) considered a variety of possible model has also been applied successfully to more reasons for the emergence of these empirical power- standard versions of such paradigms that involve function relations and concluded that the best easy-to-discriminate stimuli such as alphanumeric explanation was that the strength of the mem- characters. In those applications, instead of adopt- ory traces themselves may exhibit power-function ing MDS approaches, a highly simplified model decay. The model-based results from Donkin of similarity is used: The similarity of a probe to and Nosofsky (2012a,b) lend support to Wixted’s

156 basic cognitive skills 1.0 1.0

0.8 0.8

0.6 0.6 Participant 1 Participant 3 0.4 0.4

0.2 0.2

0.0 0.0 13579 13579 1.0 1.0 Memory Strength Memory 0.8 0.8

0.6 0.6 Participant 2 Participant 4 0.4 0.4

0.2 0.2

0.0 0.0 1357 9 1357 9 Lag Lag

Fig. 7.9 Model-based results from the probe-recognition experiment of Donkin and Nosofsky (2012a). Estimated memory strengths (open circles) are plotted as a function of lag, along with the best-fitting power functions. Donkin, C., & Nosofsky, R.M. (2012). A power-law model of psychological memory strength in short-term and long-term recognition. Psychological Science, 23, 625–634. Adapted with permission of SAGE Publications. suggestion and now motivate the new research faces, alphanumeric characters, words, and pictures goal of unpacking the detailed psychological and of real-world objects. The category structures neurological mechanisms that give rise to this include small collections of continuous-dimension discovered power law of memory strength. stimuli separated by boundaries of varying de- grees of complexity, normally distributed category Conclusions and New Research Goals structures, high-dimensional stimuli composed of In sum, the EBRW is an important candidate discrete dimensions, and categories generated from model for explaining how the process of catego- statistical distortions of prototypes (e.g., see Richler rization unfolds over time. The model combines & Palmeri, 2014). Furthermore, growing neural assumptions involving exemplar-based category evidence ranging from single-unit records to func- representation and processes of evidence accumu- tional brain imaging supports a number of the lation within a unified framework to account for processing assumptions embodied in models like categorization and recognition choice-probability EBRW (see Box 1). Finally, as illustrated in our and RT data. As reviewed in this chapter, it accounts chapter, a key theme of the theoretical approach is successfully for a wide variety of fundamental effects that, despite the dramatically different task goals, in these domains including effects of similarity, the processes of categorization and old-new recog- distance-from-boundary, familiarity, probabilistic nition may be closely related (but see Box 2 for dis- feedback, practice, expertise, set size, and lag. cussion of a major theoretical debate regarding this Although we were able to sample only a limited issue in the cognitive-neuroscience literature). A number of example applications in this single likely reason for the model’s success is that it builds chapter, we should clarify that the exemplar model on the strengths of classic previous approaches for has been applied in a wide variety of stimulus understanding processes of choice and similarity, domains and to varied category and study-list the development of automaticity, and evidence structures. The stimulus domains include colors, accumulation in decision-making and memory. dot patterns, multidimensional cartoon drawings, Beyond accounting for categorization and recog- geometric forms, schematic faces, photographs of nition, we believe that the EBRW model can serve

an exemplar-based random-walk model 157 Box 1 Neural Evidence Supports stretching of relevant dimensions was accompa- Mechanistic Assumptions in EBRW. nied by neural stretching of relevant dimensions EBRW proposes many mechanistic assump- measured by fMRI (Folstein, Palmeri, & tions, such as exemplar representations, atten- Gauthier, 2013; see also Folstein, Gauthier, & tion weights along relevant dimensions, and Palmeri, 2012). accumulation of perceptual evidence. In most Finally, mathematical psychology and systems modeling, these are supported by evaluating neuroscience have converged on accumulation predictions of behavioral data like accuracy and of perceptual evidence as a general theoret- RTs. But we can now evaluate particular mech- ical framework to explain the time course anistic assumptions using relevant neural data of decision making (see Palmeri, Schall, & from brain regions hypothesized to instantiate Logan, this volume). Some neurons show those mechanisms. dynamics predicted by accumulator models, For example, Mack, Preston and Love, other neurons show activity consistent with (2013; see also Palmeri, 2014) turned to encoded perceptual evidence to be accumulated patterns of brain activity measured with fMRI over time, and an ensemble of neurons predicts to evaluate whether the brain represents cat- the time course of decisions made by awake egories using exemplars or prototypes. Sub- behaving monkeys (e.g., Purcell, Schall, Logan, jects learned to classify objects into one & Palmeri, 2012; Zandbelt, Purcell, Palmeri, of two categories and then in the scanner Logan, Schall, 2014). were tested on training and transfer objects without feedback (Medin & Schaffer, 1978). Typical fMRI analyses would correlate brain activity with stimuli or responses, for exam- Box 2 The Exemplar Model Accounts ple highlighting regions that modulate with for Dissociations Between categorization difficulty. Instead, Mack and Categorization and Recognition colleagues first fitted exemplar and prototype Demonstrated in the models to individual subjects’ categorization Cognitive-Neuroscience Literature. responses; despite the fact that these models Interestingly, in contrast to the theme empha- made fairly similar behavioral predictions, they sized in this chapter, the prevailing view in the differed in patterns of summed similarity to cognitive neuroscience literature is that separate their respective exemplar or prototype repre- cognitive/neural systems mediate categorization sentations. Mack and colleagues showed that and recognition (Smith, 2008). The main patterns of individual subjects’ brain activ- source of evidence involves the demonstration ity were more consistent with patterns of of intriguing dissociations between categoriza- summed similarity predicted by an exemplar tion and recognition. For example, studies model than those predicted by a prototype have demonstrated that amnesics with poor model. According to exemplar models, learn- recognition memory perform at normal levels ing to categorize objects can cause selective in categorization tasks involving the same attention to relevant psychological dimensions, types of stimuli (e.g., Knowlton & Squire, stretching psychological space to better allow 1993). Nevertheless, formal modeling analyses subjects to discriminate between members of have indicated that even these dissociations contrasting categories. Neurophysiology and are consistent with the predictions from the fMRI have suggested that category-relevant exemplar model (e.g., Nosofsky, Denton, dimensions can be emphasized in visual cortex. Zaki, Murphy-Knudson, & Unverzagt 2012; After monkeys learned to categorize multidi- Nosofsky & Zaki, 1998; Palmeri & Flan- mensional objects, neurons in inferotemporal ery, 1999, 2002; Zaki, Nosofsky, Jessup, cortex were more sensitive to variations along & Unverzagt, 2003). The general approach a relevant dimension than an irrelevant dimen- in the modeling was to assume that am- sion (De Baene, Ons, Wagemans, & Vogels, nesics have reduced ability to discriminate 2008; Sigala and Logothetis 2002; see also among distinct exemplars in memory. This Gauthier & Palmeri, 2002). Similarly, after reduced discriminability is particularly detri- people learned object categories, psychological mental to old-new recognition, which may

158 basic cognitive skills Box 2 Continued Finally, an important theme in the categorization literature is that there may be multiple systems of require the observer to make fine-grained dis- categorization (e.g., Ashby, Alfonso-Reese, Turken, tinctions between old versus new items. How- & Waldron, 1998; Erickson & Kruschke, 1998; ever, the reduced discriminability is not very Johansen & Palmeri, 2002; Nosofsky, Palmeri, & detrimental in typical tasks of categorization, McKinley, 1994). A classic idea, for example, is that which may require only gross-level assessments many categories may be represented and processed of similarity to be made. A more direct chal- by forming and evaluating logical rules. In some lenge to the exemplar-model hypothesis comes modern work that pursues this avenue, researchers from brain-imaging studies that show that dis- have considered the RT predictions from logical- tinct brain regions are activated when observers rule models of classification. Furthermore, such engage in recognition vs. categorization tasks approaches have been used to develop sharp con- (Reber et al., 1998a,b). Exemplar theorists have trasts between the predictions of the ERBW model responded, however, by providing evidence and rule-based forms of category representation and that these brain-imaging dissociations may not processing (Fific, Little, & Nosofsky, 2010; Lafond, reflect the operation of separate neural systems Lacouture, & Cohen, 2009; Little, Nosofsky, devoted to categorization versus recognition per & Denton, 2011). In domains involving highly se. Instead, the brain-imaging dissociations may separable-dimension stimuli in which the category reflect changes in stimulus-encoding strategies structures can be described in terms of exceedingly across task situations (Gureckis, James, & simple logical rules, evidence has been mounting Nosofsky, 2011), differences in the precise that the logical-rule models provide better accounts stimuli that are tested (Nosofsky, Little, & of the detailed patterns of RT data than does James, 2012; Reber et al., 2003), as well the EBRW model. An important target for future as adaptive changes in parameter settings research is to develop a deeper understanding of that allow observers to meet the competing these multiple forms of categorization, to learn task goals of categorization versus recognition about the experimental factors that promote the (Nosofsky et al., 2012). use of each strategy, and to explain the manner in which exemplar-based and rule-based systems may as a useful analytic device for assessing human interact. performance. For example, note that Ratcliff’s (1978) diffusion model has been applied to analyze Acknowledgments choice behavior in various special populations, This work was partially supported by NSF grants including elderly adults, sleep-deprived subjects, SMA-1041755 and SBE-1257098 and by AFOSR and so forth (see Chapter 2 of this volume). grant FA9550-14-1-0307. The model-based analyses provide a deeper under- standing of the locus of the cognitive/perceptual deficits in such populations by tracing them to Notes γ changes in diffusion-model drift rates, response- 1. In the context of the GCM, the parameter is referred to γ threshold settings, or residual times. The EBRW as a response-scaling parameter.When =1, observers respond by “probability-matching” to the relative summed similarities. As γ model has potential to reveal even more fine-grained grows greater than 1, observers respond more deterministically information along these lines. For example, in with the category that yields the greater summed similarity that model, the random-walk step probabilities (Ashby & Maddox, 1993; McKinley & Nosofsky, 1995). (i.e., drift rates) emerge from cognitive/perceptual 2. The version of the EBRW model described in this chapter factors such as overall sensitivity, attention-weight is applicable to “integral-dimension” stimuli, which are encoded settings, and memory strengths of stored exemplars, and perceived holistically. A common example of such integral-dimension stimuli are colors varying in hue, brightness, each of which can be measured by fitting the and saturation. Because there has been extensive previous model to data obtained in suitable categorization scaling work indicating that similarity relations among these and recognition paradigms. Although exemplar stimuli are extremely well described in terms of these models have been applied to help interpret the dimensions, we often use these stimuli in our tests of the EBRW behavior of amnesic subjects and patients with model. An extended version of the EBRW model has also been developed that is applicable to separable-dimension stimuli mild memory-based cognitive impairment (see (Cohen & Nosofsky, 2003). In this version, rather than Box 2), we have only scratched the surface of such encoding stimuli in holistic fashion, the encoding of individual potential applications to many more clinical groups. stimulus dimensions is a stochastic process, and similarity

an exemplar-based random-walk model 159 relations between a test item and the stored exemplars change Automaticity: ability to perform some task at a satisfactory dynamically during the time course of processing (see also level without requiring conscious attention or effort and Lamberts, 2000). without limits in capacity. 3. Because accuracies were near ceiling in the present Background element: a hypothetical construct in the experiment, we focused our analysis primarily on the patterns of GCM and EBRW models that describes initial background RT data. However, in a follow-up study, Nosofsky and noise in people’s memories for members of alternative Alfonso-Reese (1999) tested conditions that allowed categories. examination of how both speed and accuracy changed during the early stages of learning. By including the Basic-level of categorization: an intermediate level of a background-activation parameter b in its arsenal, the EBRW category hierarchy that is hypothesized to lead to privileged model provided good quantitative fits to not only the speed-up forms of cognitive processing. in mean RTs, but to the improvements in choice-accuracy data Categorization: process in which observers classify distinct as well. (As noted by Nosofsky and Palmeri, 1997a, p. 291, with objects into groups. b=0, the EBRW model does not predict changes in response accuracy.) Criterion element: a hypothetical entity in the EBRW 4. The original EBRW model (Nosofsky & Palmeri, 1997a) recognition model. Retrieval of criterion elements leads the applies to two-alternative forced- choice responses. There is a random walk to step in the direction of the NEW response single accumulator whose value increases or decreases as threshold. Biases in the random-walk step probabilities are evidence accumulates in the random walk until an upper or determined by the strength of the criterion elements. lower response threshold is reached. Numerosity judgments in Decision-boundary model: model of categorization that Palmeri (1997) permitted six possible responses, so EBRW was assumes that people form boundaries to divide a stimulus extended to allow multiple alternatives. Each response space into category-response regions. alternative was associated with its own counter, so with six numerosity responses there were six counters. Whenever an Exemplar model: model of categorization that assumes exemplar was retrieved with the label associated with a particular that observers store individual examples of categories in counter, the value of that counter was incremented. A response memory. was made whenever the value of one of the counters exceeded all Exemplar-based random-walk (EBRW) model: an exten- the rest by some relative amount. With only two alternatives, sion of the generalized context model that explains how the this multiple counter model with a relative threshold response processes of categorization and recognition unfold over time. rule generally mimics a standard random-walk model with one Exemplars stored in memory “race” to be retrieved, and the accumulator with a positive and negative threshold. retrieved exemplars drive a random-walk decision-making 5. There is a long history of debate, too extensive to be process. reviewed in this chapter, between the proponents of exemplar and prototype models. For examples of recent research that has Exponential distribution: a probability distribution that argued in favor of the prototype view, see Minda and Smith describes the time between events, in which the events (2001), Smith and Minda (1998, 2000) and Smith (2002). For occur continuously and independently at a constant average examples from the exemplar perspective, see rate. Nosofsky (2000), Nosofsky and Zaki (2002), Palmeri and Generalized context model (GCM): a member of the class Flanery (2002), and Zaki and Nosofsky (2007). of exemplar models. In the GCM, exemplars are represented 6. The version of the exemplar-recognition model reported as points in a multidimensional psychological space, and by Donkin and Nosofsky (2012a) assumed a linear-ballistic similarity is a decreasing function of distance in the space. accumulation process (Brown & Heathcote, 2008) instead of a random-walk accumulation process. However, the same Integral-dimension stimuli: stimuli composed of individ- evidence for a power-law relation between memory strength and ual dimensions that combine into unitary, integral wholes. lag was obtained regardless of the specific accumulation process Memory strength parameters: parameters in the GCM that was assumed. We should note as well that, in fitting and EBRW models that describe the strength with which complete RT distributions for correct and error responses, such the exemplars are stored in memory. as occurred in the Donkin and Nosofsky (2012a) experiment, the exemplar model makes provision for drift-rate variability Minkowski power model: a model for computing dis- and response-threshold variability across trials (e.g, Donkin & tances between points in a space. Nosofsky, 2012a; Nosofsky & Stanton, 2006), in a manner Multidimensional scaling: a modeling technique for rep- analogous to the approach used in Ratcliff’s diffusion model resenting similarity relations among objects. The objects are (e.g, Ratcliff, Van Zandt, & McKoon, 1999). represented as points in a multidimensional psychological space and similarity is a decreasing function of the distance Glossary between the points in the space. Attention-weight parameters: a set of parameters in the Prototype model: model of categorization that assumes GCM and EBRW models that describe the extent to which that observers represent categories by forming a summary each dimension is weighted when computing distances representation, usually assumed to be the central-tendency among objects. of the category distribution.

160 basic cognitive skills Busemeyer, J. R. (1982). Choice behavior in a sequential Glossary decision-making task. Organizational Behavior and Human Prototype-based random-walk (PBRW) model: amodel Performance, 29, 175–207. that is analogous in all respects to the EBRW model, except Busemeyer, J. R. (1985). Decision making under uncer- that the category representation corresponds to the central tainty: A comparison of simple scalability, fixed-sample, tendency of each category distribution rather than to the and sequential-sampling models. Journal of Experimental individual exemplars. Psychology: Learning, Memory, and Cognition, 11, 538–564. Random walk: a mathematical model that describes a path Carroll, J. D., & Wish, M. (1974). Models and methods of outcomes consisting of a sequence of random steps. for three-way multidimensional scaling. In D. H. Krantz, Response thresholds: parameters in evidence- R. C. Atkinson, R. D. Luce, and P. Suppes (Eds.), accumulation models that determine how much Contemporary developments in mathematical psychology (Vol. evidence is required before a response is initiated. 2). San Francisco: W. H. Freeman. Clark, S. E., & Gronlund, S. D. (1996). Global matching Recognition memory: process in which observers decide models of recognition memory: How the models match the whether objects are “old” (previously experienced) or “new.” data. Psychonomic Bulletin & Review, 3, 37–60. Cohen, A. L., & Nosofsky, R. M. (2003). An extension of the Response-scaling parameter: a parameter in the GCM exemplar-based random-walk model to separable-dimension that describes the extent to which observers respond using stimuli. Journal of Mathematical Psychology, 47, 150–165. probabilistic versus deterministic response rules. Davis, T., Love, B. C., & Preston, A. R. (2012). Learning the RT-distance hypothesis: hypothesis that categorization exception to the rule: Model-based fMRI reveals specialized RT is a decreasing function of the distance of a stimulus representations for surprising category members. Cerebral from the decision boundary. Cortex, 22, 260–273. Rule-plus-exception model: model of categorization that De Baene, W., Ons, B., Wagemans, J., & Vogels, R. (2008). assumes that people classify objects by forming simple Effects of category learning on the stimulus selectivity of logical rules and remembering occasional exceptions to macaque inferior temporal neurons. Learning & Memory, those rules. 15(9), 717–727. Sensitivity parameter: a parameter in the GCM Donkin, C., & Nosofsky, R. M. (2012a). A power-law model of and EBRW models that describes overall discriminability psychological memory strength in short-term and long-term among distinct items in the multidimensional psychological recognition. Psychological Science, 23, 625–634. space. Donkin, C., & Nosofsky, R. M. (2012b). The structure of short- term memory scanning: An investigation using response- Short-term probe-recognition task: task in which time distribution models. Psychonomic Bulletin & Review, 19, observers are presented with a short list of to-be- 363–394. remembered items followed by a test probe. The observers Erickson, M. A., & Kruschke, J. K. (1998). Rules and exemplars judge as rapidly as possible, while trying to minimize errors, whether the probe is old or new. in category learning. Journal of Experimental Psychology: General, 127, 107–140. Estes, W. K. (1994). Classification and cognition.NewYork: Oxford University Press. References Fific, M., Little, D. R., & Nosofsky, R. M. (2010). Logical-rule Anderson, J. R., & Schooler, L. J. (1991). Reflections models of classification response times: A synthesis of mental- of the environment in memory. Psychological Science, 2, architecture, random-walk, and decision-bound approaches. 396–408. Psychological Review, 117, 309–348. Ashby, F. G. (2000). A stochastic version of general recognition Folstein, J., Gauthier, I., & Palmeri, T. J. (2012). Not all morph theory. Journal of Mathematical Psychology, 44, 310–329. spaces stretch alike: How category learning affects object Ashby, F. G., Alfonso-Reese, L. A., Turken, A. U., & Waldron, perception. Journal of Experimental Psychology: Learning, E. M. (1998). A neuropsychological theory of multiple Memory, and Cognition, 38(4), 807–820. systems in category learning. Psychological Review, 105, Folstein, J., Palmeri, T. J., & Gauthier, I. (2013). Category learn- 442–481. ing increases discriminability of relevant object dimensions in Ashby, F. G., Boynton, G., & Lee, W. W. (1994). Categorization visual cortex. Cerebral Cortex, 23(4), 814–823. response time with multidimensional stimuli. Perception & Garner, W. R. (1974). The processing of information and structure. Psychophysics, 55, 11–27. Potomac, Md.: Earlbaum. Ashby, F.G., & Maddox, W. T. (1993). Relations between proto- Gauthier, I., & Palmeri, T. J. (2002). Visual neurons: type, exemplar, and decision bound models of categorization. Categorization-based selectivity. Current Biology, 12, R282– Journal of Mathematical Psychology, 37, 372–400. 284. Ashby, F. G., & Townsend, J. T. (1986). Varieties of perceptual Gillund, G., & Shiffrin, R. M. (1984). A retrieval model for independence. Psychological Review, 93, 154–179. both recognition and recall. Psychological Review, 91, 1–65. Brown, S. D., & Heathcote, A. (2008). The simplest complete Grill-Spector, K., & Kanwisher, N. (2005). Visual recognition: model of choice reaction time: Linear ballistic accumulation. As soon as you know it is there, you know what it is. Cognitive Psychology, 57, 153–178. Psychological Science, 16, 152–160. Bundesen, C. (1990). A theory of visual attention. Psychological Gureckis, T. M., James, T. W., & Nosofsky, R. M. (2011). Review, 97, 523–547. Re-evaluating dissociations between implicit and explicit

an exemplar-based random-walk model 161 category learning: An event-related fMRI study. Journal of of perception and cognition (pp. 299–333). Hillsdale, NJ: Cognitive Neuroscience, 23, 1697–1709. Earlbaum. Hintzman, D. L. (1986). “Schema abstraction” in a multiple- McElree, B., & Dosher, B. A. (1989). Serial position trace memory model. Psychological Review, 93, 411–428. and set size in short-term memory: The time course of Hintzman, D. L. (1988). Judgments of frequency and recognition. Journal of Experimental Psychology: General, 118, recognition memory in a multiple-trace memory model. 346–373. Psychological Review, 95, 528–551. McKinley, S. C., & Nosofsky, R. M. (1995). Investigations of Jolicoeur, P., Gluck, M. A., & Kosslyn, S. M. (1984). Pictures exemplar and decision bound models in large, ill-defined and names: Making the connection. Cognitive Psychology, 16, category structures. Journal of Experimental Psychology: 243–275. Human Perception and Performance, 21(1), 128. Johansen, M. K., & Palmeri, T. J. (2002). Are there Medin, D. L., & Schaffer, M. M. (1978). Context theory of representational shifts during category learning? Cognitive classification learning. Psychological Review, 85, 207–238. Psychology, 45, 482–553. Minda, J. P., & Smith, J. D. (2001). Prototypes in category Kahana, M. J., & Sekuler, R. (2002). Recognizing spatial learning: The effects of category size, category structure, patterns: A noisy exemplar approach. Vision Research, 42, and stimulus complexity. Journal of Experimental Psychology: 2177–2192. Learning, Memory, and Cognition, 27, 775–799. Knowlton, B., & Squire, L. (1993). The learning of categories: Monsell, S. (1978). Recency, immediate recognition memory, Parallel brain systems for item memory and category and reaction time. Cognitive Psychology, 10, 465–501. knowledge. Science, 262 (5140), 1747–1749. Murdock, B. B., Jr. (1982). A theory for the storage and retrieval Lafond, D., Lacouture, Y., & Cohen, A. L. (2009). Decision tree of item and associative information. Psychological Review, 89, models of categorization response times, choice proportions, 609-626. and typicality judgments. Psychological Review, 116, 833– Nosofsky, R. M. (1984). Choice, similarity, and the context 855. theory of classification. Journal of Experimental Psychology: Lamberts, K. (2000). Information accumulation theory of Learning, Memory, and Cognition, 10, 104–114. categorization. Psychological Review, 107, 227–260. Nosofsky, R. M. (1986). Attention, similarity, and the Link, S. W. (1992). The wave theory of difference and similarity. identification-categorization relationship. Journal of Experi- Hillsdale, NJ: Earlbaum. mental Psychology: General, 115, 39–57. Little, D. R., Nosofsky, R. M., & Denton, S. E. (2011). Nosofsky, R. M. (1987). Attention and learning processes in the Response -time tests of logical-rule models of categorization. identification and categorization of integral stimuli. Journal Journal of Experimental Psychology: Learning, Memory, and of Experimental Psychology: Learning, Memory, & Cognition, Cognition, 37, 1–27. 13, 87–109. Logan, G. D. (1988). Toward an instance theory of automatiza- Nosofsky, R. M. (1988). Exemplar-based accounts of relations tion. Psychological Review, 95, 492–527. between classification, recognition, and typicality. Journal of Logan, G. D. (1997). The CODE theory of visual attention: Experimental Psychology: Learning, Memory, and Cognition, An integration of space-based and object-based attention. 14, 700–708. Psychological Review, 103, 603–649. Nosofsky, R. M. (1991). Tests of an exemplar model for relating Mack, M., Gauthier, I., Sadr, J., & Palmeri, T. J. (2008). perceptual classification and recognition memory. Journal of Object detection and basic-level categorization: Sometimes Experimental Psychology: Human Perception and Performance, you know it is there before you know what it is. Psychonomic 17, 3–27. Bulletin & Review, 15(1), 28–35. Nosofsky, R. M. (1992). Similarity scaling and cognitive process Mack, M. L., Preston, A. R., & Love, B. C. (2013). Decoding models. Annual Review of Psychology, 43, 25–53. the brain’s algorithm for categorization from its neural Nosofsky, R. M. (2000). Exemplar representation without implementation. Current Biology, 23(20), 2023–2027. generalization? Comment on Smith and Minda’s (2000) Mack, M. L., & Palmeri, T. J. (2010). Decoupling object detec- “Thirty categorization results in search of a model”. Journal tion and categorization. Journal of Experimental Psychology: of Experimental Psychology: Learning, Memory and Cognition, Human Perception and Performance, 36 (6), 1067-1-79. 26, 1735–1743. Mack, M. L., & Palmeri, T. J. (2011). The timing of visual object Nosofsky, R. M., & Alfonso-Reese, L. A. (1999). Effects of categorization. Frontier in Psychology, 2:165. similarity and practice on speeded classification response Mack, M. L., Wong, A. C.-N., Gauthier, I., Tanaka, J. times and accuracies: Further tests of an exemplar-retrieval W., & Palmeri, T. J. (2009). Time-course of visual object model. Memory & Cognition, 27(1), 78–93. categorization: Fastest does not necessarily mean first. Vision Nosofsky, R. M., Denton, S. E., Zaki, S. R., Murphy- Research, 49, 1961–1968. Knudsen, A. F., & Unverzagt, F. W. (2012). Studies Mack, M. L., Wong, A. C.-N., Gauthier, I., Tanaka, J. of implicit prototype extraction in patients with mild W., & Palmeri, T. J. (2007). Unraveling the time-course cognitive impairment and early Alzheimer’s disease. Journal of perceptual categorization: Does fastest mean first? In of Experimental Psychology: Learning, Memory, and Cognition, the Proceedings of the Twenty-Ninth Annual Meeting of the 38, 860–880. Cognitive Science Society. Nosofsky, R. M., Little, D. R., Donkin, C., & Fific, Marley, A. A. J. (1992). Developing and characterizing multidi- M. (2011). Short-term memory scanning viewed as mensional Thurstone and Luce models for identification and exemplar-based categorization. Psychological Review, 118, preference. In F. G. Ashby (Ed.), Multidimensional models 280–315.

162 basic cognitive skills Nosofsky, R. M., Little, D. R., & James, T. W. (2012). Acti- (Eds.), Oxford handbook of computational and mathematical vation in the neural network responsible for categorization Psychology. Oxford University Press. and recognition reflects parameter changes. Proceedings of the Palmeri, T. J., Wong, A. C.-N., & Gauthier, I. (2004). National Academy of Sciences, 109, 333–338. Computational approaches to the development of perceptual Nosofsky, R. M., & Palmeri, T. J. (1997a). An exemplar-based expertise. Trends in Cognitive Science, 8, 378–386. random walk model of speeded classification. Psychological Posner, M. I., & Keele, S. W. (1968). On the genesis Review, 104, 266–300. of abstract ideas. Journal of Experimental Psychology, 77, Nosofsky, R. M., & Palmeri, T. J. (1997b). Comparing 353–363. exemplar-retrieval and decision-bound models of speeded Purcell, B. A., Schall, J. D., Logan, G. D., & Palmeri, T. J. perceptual classification. Perception & Psychophysics, 59, (2012). From salience to saccades: multiple-alternative gated 1027–1048. stochastic accumulator model of visual search. Journal of Nosofsky, R. M., Palmeri, T. J., & McKinley, S. C. (1994). Rule- Neuroscience, 32(10), 3433–3446. plus-exception model of classification learning. Psychological Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 101, 53–79. Review, 85, 59–108. Nosofsky, R. M., & Stanton, R. D. (2005). Speeded Ratcliff, R., Van Zandt, T., & McKoon, G. (1999). Connec- classification in a probabilistic category structure: Contrast- tionist and diffusion models of reaction time. Psychological ing exemplar-retrieval, decision-boundary, and prototype Review, 106, 261–300. models. Journal of Experimental Psychology: Human Perception Reber, P. J., Gitelman, D. R., Parrish, T. B., & Mesulam, and Performance, 31, 608–629. M. M. (2003). Dissociating explicit and implicit category Nosofsky, R. M., & Stanton, R. D. (2006). Speeded old- knowledge with fMRI. Journal of Cognitive Neuroscience, new recognition of multidimensional perceptual stimuli: 15(4), 574–583. Modeling performance at the individual-participant and Reber, P. J., Stark, C. E. L., & Squire, L. R. (1998a). individual-item levels. Journal of Experimental Psychology: Cortical areas supporting category learning identified using Human Perception and Performance, 32, 314–334. fMRI. Proceedings of the National Academy of Sciences, 95, Nosofsky, R. M., & Zaki, S. R. (1998). Dissociations between 747–750. categorization and recognition in amnesic and normal Reber, P.J., Stark, C. E. L., & Squire, L. R. (1998b). Contrasting individuals: An exemplar-based interpretation. Psychological cortical activity associated with category memory and Science, 9(4), 247–255. recognition memory. Learning and Memory, 5, 420–428. Nosofsky, R. M., & Zaki, S. R. (2002). Exemplar and prototype Richler, J. J., & Palmeri, T. J. (2014). Visual category learning. models revisited: Response strategies, selective attention, and Wiley Interdisciplinary Reviews in Cognitive Science, 5, 75–94. stimulus generalization. Journal of Experimental Psychology: Rosch, E. (1978). Principles of categorization. In E. Rosch & B. Learning, Memory, and Cognition, 28, 924–940. B. Lloyd (Eds.), Cognition and categorization. Hillsdale, NJ: Nosofsky, R. M., & Zaki, S. R. (2003). A hybrid-similarity Erlbaum. exemplar model for predicting distinctiveness effects in per- Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & ceptual old-new recognition memory. Journal of Experimental Boyes-Braem, P. (1976). Basic objects in natural categories. Psychology: Learning, Memory, and Cognition, 29, 1194– Cognitive Psychology, 8, 382–439. 1209. Rouder, J. N., & Ratcliff, R. (2004). Comparing categorization Palmeri, T. J. (1997). Exemplar similarity and the development models. Journal of Experimental Psychology: General, 133, of automaticity. Journal of Experimental Psychology: Learning, 63–82. Memory, and Cognition, 23, 324–354. Shepard, R. N. (1957). Stimulus and response generalization: Palmeri, T. J. (1999). Theories of automaticity and the power A stochastic model relating generalization to distance in law of practice. Journal of Experimental Psychology: Learning, psychological space. Psychometrika, 22, 325–345. Memory, and Cognition, 25, 543–551. Shepard, R. N. (1987). Toward a universal law of generalization Palmeri, T. J. (2014). An exemplar of model-based cognitive for psychological science. Science, 237, 1317–1323. neuroscience. Trends in Cognitive Sciences, 18(2), 67–69. Shin, H. J., & Nosofsky, R. M. (1992). Similarity scaling Palmeri, T. J., & Cottrell, G. (2009). Modeling perceptual studies of dot pattern classification and recognition. Journal expertise. In D. Bub, M. Tarr, & I Gauthier (Eds.), Perceptual of Experimental Psychology: General, 121, 278–304. expertise: bridging brain and behavior. Oxford University Sigala, N., & Logothetis, N. K. (2002). Visual categorization Press. shapes feature selectivity in the primate temporal cortex. Palmeri, T. J., & Flanery, M. A. (1999). Learning about Nature, 415(6869), 318–320. categories in the absence of training. Psychological Science, Smith, E. E. (2008). The case for implicit category learning. 10(6), 526–530. Cognitive, Affective, and , 8, 3–16. Palmeri, T. J., & Flanery, M. A. (2002). Memory systems and Smith, E. E., & Medin, D. L. (1981). Categories and concepts. perceptual categorization. In B. H. Ross (Ed.), The psychology Cambridge, MA: Harvard University Press. of learning and motivation: Advances in research and theory Smith, J. D. (2002). Exemplar theory’s predicted typicality (Vol. 41, pp. 141–189). San Diego: Academic. gradient can be tested and disconfirmed. Psychological Science, Palmeri, T. J., Schall, J. D. & Logan, G. D. (2014). 13, 437–442. Neurocognitive modeling of perceptual decision making. In Smith, J. D., & Minda, J. P. (1998). Prototypes in the J. R. Busemeyer, Z. Wang, J. T. Townsend, & A. Eidels mist: The early epochs of category learning. Journal of

an exemplar-based random-walk model 163 Experimental Psychology: Learning, Memory and Cognition, Wixted, J. T. (2004). On common ground: Jost’s (1897) law 24, 1411–1436. of forgetting and Ribot’s (1881) law of retrograde amnesia. Smith, J. D., & Minda, J. P.(2000). Thirty categorization results Psychological Review, 111, 864–879. in search of a model. Journal of Experimental Psychology: Wixted, J. T., & Ebbesen, E. B. (1991). On the form of Learning, Memory and Cognition, 26, 3–27. forgetting. Psychological Science, 2, 409–415. Sternberg, S. (1966). High speed scanning in human memory. Zaki, S. R., & Nosofsky, R. M. (2007). A high-distortion Science, 153, 652–654. enhancement effect in the prototype-learning paradigm: Tanaka, J. W., & Taylor, M. (1991). Object categories and Dramatic effects of category learning during test. Memory & expertise: is the basic level in the eye of the beholder? Cognition, 35, 2088–2096. Cognitive Psychology, 23, 457–482 Zaki, S. R., Nosofsky, R. M., Jessup, N. M., & Unverzagt, F. Verguts, T., Storms, G., & Tuerlinckx, F. (2003). Decision- W. (2003). Categorization and recognition performance of a bound theory and the influence of familiarity. Psychonomic memory-impaired group: Evidence for single-system models. Bulletin & Review, 10, 141–148. Journal of the International Neuropsychological Society, 9(3), Wickelgren, W. A. (1974). Single-trace fragility theory of 394–406. memory dynamics. Memory & Cognition, 2, 775–780. Zandbelt, B. B., Purcell, B. A., Palmeri, T. J., Logan, G. D., Wills, A. J., & Pothos, E. M. (2012). On the adequacy of current & Schall, J. D. (2014). Response times from ensembles of empirical evaluations of formal models of categorization. accumulators. Proceedings of the National Academy of Sciences, Psychological Bulletin, 138, 102–125. 111(7), 2848–2853.

164 basic cognitive skills CHAPTER 8 Models of Episodic Memory

Amy H. Criss and Marc W. Howard

Abstract Episodic memory refers to memory for specific episodes from one’s life, such as working in the garden yesterday afternoon while enjoying the warm sun and chirping birds. In the laboratory, the study of episodic memory has been dominated by two tasks: single item recognition and recall. In single item recognition, participants are simply presented a cue and asked if they remember it appearing during the event in question (e.g., a specific flower from the garden) and in free recall they are asked to generate all aspects of the event. Models of episodic memory have focused on describing detailed patterns of performance in these and other laboratory tasks believed to be sensitive to episodic memory. This chapter reviews models with a focus on models of recognition with a specific emphasis on REM (Shiffrin & Steyvers, 1997) and models of recall with a focus on TCM (Howard & Kahana, 2002). We conclude that the current state of affairs, with no unified model of multiple memory tasks, is unsatisfactory and offer suggestions for addressing this gap. Key Words: episodic memory, computational modeling, free recall, recognition

Tulving (1972, 1983, 2002) coined the term debate (see Box 1). In contrast, subjects can episodic memory to refer to the ability to vividly frequently answer factual questions, such as “what remember specific episodes from one’s life. Episodic is the capital of France?” without any knowledge memory is often framed in contrast to other forms about the specific moment when they learned of memory which are not accompanied by the that piece of information. The association of same experience. For instance, asking a subject a memory with a specific spatial and temporal “what did you have for breakfast?” usually elicits context is considered the hallmark of episodic an episodic memory. In the process of answering memory. the question, subjects will sometimes report re- This chapter is about mathematical and com- experiencing the event as if they were present. They putational models of episodic memory. This is might remember being in their kitchen, with the something of an unusual topic. One could argue morning sun shining in and the sound of the that there are no mathematical models of episodic radio, the smell of the coffee in the process of memory as defined here. To date, there are no retrieving the information that they had a bagel quantitative models that have attempted to describe for breakfast. Other times, subjects may not have how or why the distinctive internal experience memory for all associated details, but instead have associated with episodic memory sometimes takes a fuzzy general memory for the event. Both vivid place and sometimes does not. In contrast, models and fuzzy episodic memories are situated in a of episodic memory have focused on describing particular spatio-temporal context. The nature of detailed patterns of performance in a set of these two types of episodic experiences is under laboratory memory tasks believed to be sensitive

165 Box 1 Dual process models of to episodic memory. This is extremely important because the experimenter must have control over recognition the stimuli the participant has experienced in You go to the grocery store and pass many order to evaluate the success or failure of memory other shoppers. You pass one shopper who retrieval. seems familiar. You consider saying hello, but In the laboratory, the study of episodic memory you can’t quite figure out how you know has been dominated by two tasks: single item this person. If you were asked to perform recognition and recall. In both recognition and an item recognition test (“have you seen this recall tasks, subjects are typically presented with a person before?”), you would have been much series of stimuli—a list—and then tested on their more likely to say yes for this shopper than memory for that experience later. In recognition, for one of the other shoppers. Perhaps you participants discriminate between studied targets would even express high confidence that you and unstudied foils. In recall, participants are had seen the familiar shopper before. After asked to generate some member of the stimulus thinking about it for a while, you might later set. There exist successful process models for both remember that you met this person at a meeting individual tasks but no single model that captures last semester. You might remember his or a wide range of theoretically and empirically her name, position and even details of your important data in both tasks. In part, this is interaction and be able to report these pieces of because a number of variables differentially affect information. performance in these two tasks, and in part because The ability to distinguish this one familiar the methodological details in the two domains face from all the other unfamiliar faces you often vary in ways that preclude direct comparison. passed in the grocery store certainly requires The division of this chapter into recognition and some form of memory. Similarly, the ability recall sections reflects the fact that efforts to to remember the details about your experience provide a common modeling framework have not with that person also requires some form of been successful thus far. This is a major gap in memory. The question is whether those two our understanding—a unified model of episodic abilities are best understood as points along memory that provides a quantitative description a continuum or as distinct forms of memory, of the data from the various paradigms would be typically referred to as familiarity—a general much preferable. In this chapter, we first present sense of knowing that a probe is old—and an overview of the important data and models in recollection—the vivid recall of specific details each area. about the probe. This question has been a major source of disagreement. It has been actively pursued in mathematical modeling (e.g., Klauer & Models of Recognition Memory Kellen, 2010; DeCarlo, 2003), behavioral Recognition memory tests are among the most studies (e.g., Hintzman & Curran, 1994; widely used experimental paradigms for the study Rotella, Macmillan, Reeder, & Wong, 2005), of episodic memory. Here, a to-be-remembered and a wide variety of cognitive neuroscience event is created, typically in the form of a list techniques (Fortin, Agster, & Eichenbaum, of individually presented items (words, pictures, 2002; Rugg & Curran, 2007; Staresina, etc). After a delay ranging from a few seconds to Fell Dunn. Axmacher, & Henson, 2013; 7 or more days, memory is tested by intermixing Wilding, Doyle, & Rugg, 1995; Wixted & the studied items (from the to-be-remembered list) Squire, 2011). A major problem leading to along with foils that did not occur during the study this debate has been difficulty in extracting episode. In forced choice recognition, a target item satisfactory measures of these two putative la- is presented alongside a foil or foils and participants tent processes using observable data. Although are instructed to select the target. In single item signal detection approaches have been popular recognition, test items are presented one by one and (Wixted, 2007; Yonelinas & Parks, 2007) in the participant is asked to endorse studied items and solving this problem, signal detection is by no reject foil items. Multiple measures describe perfor- means definitive (Malmberg, 2002; Province & mance: a hit is correctly endorsing a studied item, a Rouder, 2012). correct rejection is correctly identifying a foil, a false alarm is incorrectly endorsing a foil as having been

166 basic cognitive skills studied, and a miss is incorrectly rejecting a studied Denoting the vector corresponding to the probe item. Although measuring accuracy in recognition stimulus as fp we have memory is itself an active topic of research, in = T = T general, the larger the difference between the hit Dp fp mn fp fi.(4) rate (proportion of hits to old probes) and the false i alarm rate (the proportion of false alarms to new Notice that this decision variable has different probes), the more accurate is episodic memory. In distributions depending on whether the probe was forced choice, the measurement of performance on the list. If the probe is old, then fp should match is simply percent correct. Based on these and one of the list items, resulting in other measures, such as the time for providing " 1old a response or the confidence associated with the E D = .(5) p 0new response, a number of detailed mathematical models have been developed to explain episodic Moreover, the variance of Dp is a function of the memory. length of the list 2 Var Dp = N σ .(6) Global matching models The global matching models were successful for The matched filter model is an example of global decades (Gillund & Shiffrin, 1984; Hintzman, matching model. It is a matching model because 1984; Humphreys, Bain, & Pike, 1989; Murdock, the similarity of the probe stimulus to the contents 1982). The premise of these models was that the of memory is calculated and drives the decision search of episodic memory included a comparison process. It is a global matching model because to a relevant set of items and the memory decision the match is calculated not only to information was based on the overall or global match between stored about the probe stimulus during study; the probe and the set to which it was compared. rather the match from other study items also contributes to the decision. This concept is perhaps the matched filter model best understood in contrast to direct access models The basic idea of global matching models is (Dennis & Humphreys, 2001; Glanzer & Adams, concisely illustrated by Anderson’s matched filter 1990) that posit a direct comparison of the test item model (Anderson, 1973). Let the list be composed to its corresponding memory trace. In global matching models, errors in memory of N unique items represented as vectors fi,where the subscript denotes which of the N vectors is result from similarity between the test item and referred to. Let the vectors be randomly chosen memory traces with similar features. More formally, the probability that D is greater than some Gaussian deviates such that! p criterion is higher for old probes than for new E f T f = δ (1) > | > i j ij probes; for a fixed criterion c, P Dp c old P D > c|new . However, in order to correspond (where δij = 1ifi = j and δij = 0 otherwise) and the p similarity of the vectors to have variance given by to performance in the task (which is typically ! far from perfect) the criterion (and the variances) T 2 must be chosen such that the match from some Var fi fj = σ . new probes exceeds the criterion. According to the We will treat σ as a free parameter but, in general, matched filter model, and global matching models it will be specfied by the distribution of the features more generally, this happens because a particular of fi. Now, in the matched filter model, the list is new probe happens to match well with the study represented simply as the sum of the list items. At list. each time step, the sum mi is updated according to We can generate several other predictions from global matching models from these expressions as mi = mi−1 + fi,(2) well. First, Eq. 6 tells us that the the discriminability such that at the end of the list of the decision goes down with σ and with the N length of the list N . If the vectors are chosen such mn = fi.(3)that the similarity is a normal deviate, then the i=1 discriminability between the old and the new distri- butions is d = √1 . This makes a straightforward To model recognition, we take the vector corre- N σ sponding to a probe and match it against the list. experimental prediction—that accuracy should go

models of episodic memory 167 down as the length of the study list increases. presented only once. But now consider a mixed list This finding is typically observed in recognition experiment in which half the list items are presented memory experiments (Criss & Shiffrin, 2004; Ohrt five times and half are presented only once. We & Gronlund, 1999; Shiffrin, Huber, & Marinelli, refer to probes presented five times during study 1995 but see (21)). in the mixed list as mixed-strong probes and the probes presented only once as mixed-weak probes. the demise of global matching models Note that the mean of Dp for the pure-strong and Two observations contributed to the demise mixed-strong probes should be identical. However, of the global matching models for recognition the variance should not be the same. The mixed- memory: the generality of the mirror effect and the strong probes should be subject to less noise from null list strength effect. The mirror effect refers to the weak items in the mixed list and should thus the finding that when the hit rate is higher for a have higher accuracy than the pure-strong probes. particular experimental variable, the false alarm rate Following similar logic, we would expect accuracy is lower. This is a challenge for global matching to be lower for the mixed-weak probes than for models because the strength of target items appears the pure-weak probes due to additional interference to leapfrog the foil items. Suppose we perform an from the strong items on the study list. In contrast experiment in which we observe some hit rate and to this very strong prediction, this pattern of results false alarm rate. Now, we change the experiment does not hold (Ratcliff et al., 1990; Shiffrin et al., such that each studied item is presented five times 1990), reflecting a fundamental problem for global rather than just once. Now, the mean of Dp for the matching models. old probes will be 5 rather than 1 but the mean of Dp for the new probes will still be zero. The The Retrieving Effectively from Memory variance of the distributions should also increase, to (REM) Model σ 2 5N . If the criterion c is fixed across experiments, The pervasiveness of the mirror effect and the > | we would expect the hit rate, P(Dp c old) to discovery of the null list strength effect created be higher for the list with the repeated stimuli. a paradigm shift wherein a new set of models However, the false alarm rate should also increase, incorporating Bayesian principles were developed to in contrast to the experimental results. In order to account for recognition memory. account for the mirror effect in the context of the The REM model (Shiffrin & Steyvers, 1997) is matched filter model, we would have to assume the most thoroughly explored of these approaches that the criterion, C, changes across experiments, and we will focus on it extensively here. As in which is akin to unprincipled curve-fitting. The the global matching models, a probe is compared mirror effect is quite general and is observed for to each of the traces in memory. There are two a wide variety of experimental variables including key insights that allow REM to overcome the repetition, changes in presentation time, and word weaknesses of the global matching models. First, frequency (e.g., Glanzer & Adams, 1985, 1990). the comparison between the probe and the contents Most global matching models did not attempt to of memory incorporates both positive evidence of account for the mirror effect at all and those that did a match but also negative evidence for nonmatch. relied on a changing criterion (Gillund & Shiffrin, That is, rather than simply the absence of evidence, 1984), which is largely inconsistent with the REM can incorporate positive evidence for absence. empirical data (e.g., Glanzer, Adams, Iverson, & This is a powerful assumption. As a stimulus is Kim, 1993). studied more extensively, it means that this stimulus The null list strength effect posed a more can provide both more positive evidence that the fundamental challenge to global matching models. trace matches an old probe, but also more negative The null list strength effect is the finding that evidence that it doesn’t match a new probe. This the strength of other study items does not affect provides a mechanism for altering the new item recognition memory accuracy (Shiffrin, Ratcliff, distribution rather than assuming that encoding is Clark, 1990; Ratcliff, Clark & Shiffrin, 1990). To restricted to altering the target distribution. Second, make this more concrete, consider two experiments. the decision rule takes into account the nature of the In one experiment, as before, we present all the environment and expected memory evidence based items five times. We would expect accuracy to be on that prior knowledge. much higher in this pure strong condition relative REM is a Bayesian model with the core assump- to the pure weak condition, where all the items are tion that processes underlying memory are optimal

168 basic cognitive skills given noisy information on which to base a deci- by a zero indicating a lack of information. Thus, sion. There are two types of memory traces in REM: episodic memory is incomplete (i.e., some features lexical-semantic and episodic. Lexical-semantic are not stored), prone to error (i.e., an incorrect traces are accumulated across the lifespan and are feature value may be stored), and context-bound thus complete, accurate, and de-contextualized (i.e., contains a set of features representing the relative to episodic traces. Episodic traces are context). formed during a given episode and are updated with Study of a pair results in the concatenation of the item, context, and sometimes associative features two sets of item features and shared context features. during each presentation in a given context. REM Depending on the goals at encoding, associative is a simulation model wherein a set of episodic features that capture relationships between the two and lexical-semantic traces are generated for each stimuli may also be encoded in the vector (e.g., simulated subject as described next. Criss & Shiffrin, 2004, 2005). Processes at retrieval are necessarily different for different tasks as they representation depend on the information provided as a cue A memory trace consists of multiple types and the required output. Next we only consider of information. Item features represent a broad recognition memory, but note that REM has range of information about the stimulus including been applied to multiple memory tasks including the meaning of the stimulus and orthographic- judgments of frequency (Malmberg, Holden, & phonological units. Context represents the internal Shiffrin, 2004), free recall (Malmberg & Shiffrin, and external environment at the time of encoding. 2005), cued recall (Diller, Nobel, & Shiffrin, Associative features are sometimes generated and 2001), and associative recognition (Criss & represent information relating multiple items (e.g., Shiffrin, 2004, 2005). a stimulus-specific association formed during or prior to the experiment). All features are drawn from a geometric distribution with parameter g. retrieval The probability that a feature takes the value ν is According to REM, there are an immense ν = − ν−1 number of traces that have been laid down over P( ) g 1 g .(7)an extremely long time. In order to restrict the A geometric distribution assures that some features comparison to the relevant event, reinstated context will be relatively common and others will be features identifying the context to be searched relatively rare. Evidence provided by a matching are used to define the activated set. In a typical feature is a function of the base rate of that feature: experiment with a single study list, this step simply matching a common feature provides less evidence limits the comparisons of the test cue to the than matching a rare feature. study list and is often implemented by assumption for simplicity, which we assume here. The basis storage for a memory decision in REM is the likelihood During each experience with a stimulus, the computation. The likelihood reflects evidence in lexical-semantic trace for a stimulus is retrieved favor of the test cue as the ratio of the probability from memory and updated with the current that the cue matches a trace in memory given context features. In a typical recognition memory the data compared to the probability that the cue experiment the lexical-semantic traces are used does not match an item in memory given the solely for the purposes of generating and testing data. Here, data refer to the match between the episodic traces. Therefore, the theoretical principle cue and the contents of the activated subset of of updating lexical-semantic traces with current episodic memory. The item features from test cue context is typically not implemented in a simulation j are retrieved from its lexical-semantic trace and (c.f., Schooler, Shiffrin, & Raaijmakers, 2001). An compared to each item in the activated set, indexed episodic memory trace is formed by storing each by i. A likelihood ratio, indicating how well the test cue j matches memory trace i is computed lexical-semantic feature∗ and the context feature with some probability (u) per unit of time (t). Given that using a feature is stored, the correct value is stored with ∞ ν− nm some probability (c). Otherwise, a random value c + (1 − c)g 1 − g 1 nq sys sys λij = (1 − c) .(8) from the geometric distribution is stored. Features − ν−1 ν=1 gsys 1 gsys that are not stored during encoding are denoted

models of episodic memory 169 The gsys parameter is the long-run base rate. This The null list strength account in REM is is a fixed value, estimated by the system based on based on differentiation. The Subjective Likelihood experience. This base rate value may differ from Model (SLiM) of McClelland and Chappell (1998) the g values in Eq. 7, which gives the value of the shares the mechanism of differentiation and for g parameter for the stimulus itself. The number that reason also predicts a null list strength effect of nonzero features that mismatch is nq and the for recognition. Differentiation refers to the idea number of features that match and have the ν is nm. that the more that is known about an item, Missing features (value of zero) are ignored. Note the less confusable that item is with any other that the amount of evidence provided by a matching randomly chosen item. Obvious applications are feature depends on the feature value. This is one a bird expert or a radiologist, who have such way in which prior knowledge contributes to the knowledge in their area of expertise that they can decision. Specifically, in a geometric distribution, quickly and accurately identify a Rusty Blackbird low values are common and, therefore are likely or a cancerous tumor, whereas a novice simply to match by chance. These values provide little sees a bird or a blurry grayscale image. In episodic evidence when they match. In contrast, large values memory, an item becomes differentiated by being are uncommon and, therefore unlikely to match well practiced within a specific contextual episode, by chance, providing greater evidence when they for example by repetition in an experiment. Within match. Thus, knowledge about the statistics of the the differentiation models, an episodic memory environment (i.e., rarity of features, estimated by trace is updated during repetition, which causes gsys) learned over the course of life contributes to the memory trace to be more accurate and more the evidence of match between a test cue and the complete. Note that updating was a departure from contents of each memory trace. For single item the standard assumption of storing additional exem- recognition the decision that test cue j was present plars or additional memory traces with repetition during the relevant context is based on the average (e.g., Hintzman, 1986). In REM, differentiation of the likelihood ratios. If the average exceeds a is implemented by assuming that if an item is criterion, the item is endorsed as from the list, recognized as having been previously experienced otherwise it is rejected. in a given context, then the best matching trace is updated such that any missing (zero valued) features have the potential of being replaced in Word Frequency and Null List Strength accordance with the encoding mechanism described Effects in REM earlier. If an item is not recognized as a repetition, REM can provide an account of the mirror then a new memory trace is stored. In the original effect based on properties of the words themselves, REM model, updating only occurred during study such as word frequency. The probability of a given for simplicity. However, more recent applications feature value (g) and the expectation of feature incorporate updating and encoding at test (Criss, Malmberg, & Shiffrin, 2010). values (gsys) are both specified in REM. REM has used the g parameter to model the effects When a memory trace is stored with higher of normative word frequency. Specifically, low quality, that is when more features are stored, not frequency words (LF) are assumed to have more only is it a better match to a later comparison uncommon features (i.e., a lower value of g is used with the lexical-semantic trace from which it was to generate the stimuli). In contrast, high frequency generated, but it is also a poorer match to other (HF) words have relatively common features. The test items. In Eq. 8, note that the matching common features of HF words tend to match other and mismatching features contribute to the overall features of memory traces by chance increasing the evidence in favor of the test item. As the total false alarm rate. Additionally, the likelihood ratio number of stored features in a given memory trace includes prior information by taking into account increases, due to additional encoding, so too does thebaserateoffeatures(gsys) such that matching the total number of features matching the target unexpected features contribute more evidence in trace. Critically, the total number of mismatching favor of endorsing the test, increasing the hit rate features for any item other than the corresponding for LF words. Together, the stimulus represen- target trace also increases. The net result is that tation (g) and prior expectations (gsys) generate although a strengthened target item matches better a word frequency mirror effect, consistent with and will be better remembered, it is not at the cost empirical data. of the other studied items. Differentiation models

170 basic cognitive skills correctly predict that recognition memory is not rate. Both implementations of REM predict output harmed by increasing the strength of the other items interference in the sense that overall accuracy (e.g., d- on the study list. In fact, in some cases, the models prime) decreases across test position. However, only predict a small negative list strength effect such that, the updating model predicts the precise pattern of for a given item, memory may slightly improve as observed data. the strength of other studied items increases (see A second prediction that follows directly from Shiffrin et al., 1990). the differentiation mechanism is a higher hit rate In summary the differentiation models were and lower false alarm rate following a strongly developed to address shortcomings of the global encoded list compared to a weakly encoded list. matching models, in particular their failure to This finding is called the strength-based mirror capture the robust empirical findings of the word effect (SBME) and has been widely replicated (Cary frequency (WF) mirror effect and the null-list & Reder, 2003; Criss, 2006, 2009, 2010; Glanzer strength effect. In REM, the WF effect is due & Adams, 1985; Starns, White, & Ratcliff, 2010; to the assumed distribution of features along with Stretch & Wixted, 1998). The WF mirror effect a Bayesian decision rule that gives more weight is related to the nature of the stimuli, whereas the to unexpected matches or alternatively downplays SBME is related to the encoding conditions. Both expected matches. The null-list strength effect is findings co-occur (Criss, 2010; Stretch & Wixted, due to differentiation of well-learned items caused 1998) and despite the shared label of mirror effect, by updating memory traces. they result from entirely different mechanisms in REM. Increasing the strength with which a study list is encoded via levels of processing, repetition, or The Empirical Consequences of Updating study time increases the number of stored features The updating mechanism in REM was necessary and produces a distribution of λ that is higher to produce differentiation and to account for for strongly than weakly encoded targets. The empirical data. Auspiciously, this same mechanism same fact—that strongly encoded memory traces makes critical a priori predictions that appeared in contain more information—produces a distribution the literature after the model was conceived. First, of lower λ for foil items. Foils match a strongly updating memory traces during the encoding of encoded memory trace less well than a weakly test items results in output interference. Output encoded memory trace. Thus, a list containing interference(OI)isthefindingthatmemoryaccuracy all strongly encoded targets will match any given decreases over the course of testing (Murdock & foil poorly, reducing the false alarm rate. Not Anderson, 1975; Roediger & Schmidt, 1980; only are the HR and FAR patterns that make Tulving & Arbuckle, 1963, 1966; Wickens, Born, up the SBME well predicted by REM, but REM & Allen, 1963). Output interference is not a makes additional specific predictions that have been new finding, but a detailed understanding of the confirmed with behavioral experiments (see Criss manifestation of OI in recognition memory is (Criss, & Koop, in press for a review). For one, the Malmberg, & Shiffrin, 2011; Malmberg, Criss, actual distribution of estimated memory strength Gangwani, & Shiffrin, 2012). Figure 8.1 shows follows the pattern predicted by REM (Criss, a typical pattern of OI in recognition testing (left 2009). Further, the interaction between target-foil panel)alongwithpredictionsfromREM.Themiddle similarity and encoding strength presents just as panel shows predictions for REM where remembered predicted by REM (Criss, 2006; Criss, Aue, & items cause the best matching episodic trace to be Kilic, 2014). If one conceives of the λ values as updated, as described earlier. The right panel shows the driving force behind a random walk or diffusion predictions for a version of REM where updating model, then the rate at which the walk reaches a does not occur; instead, a new trace is added to boundary is consistent with REM, that is, targets memory for each test item. Both the predictions of and foils following a strongly encoded list have REM with updating and the data show a dramatic a steeper approach (e.g., larger drift rate) to the decrease in the rate of endorsing target items as old decision bound (Criss, 2010; Criss, Wheeler, & as a function of test position and the nearly flat McClelland, 2013). function for foils. In contrast, the multitrace version In summary, the global matching models were of REM in which additional traces are stored with found to be inadequate on the basis of several each test item predicts a shallow decrease in the findings, critically, the WF mirror effect and hit rate along with an increase in the false-alarm null-list strength effect. REM was developed to

models of episodic memory 171 1.0 ABC 0.8

0.6

P(OLD) 0.4

0.2 foils targets 0.0 Test Block

Fig. 8.1 Panel A shows data (Koop, Criss, & Malmberg, in press) that is representative of output interference. The panel gives the probability of endorsing old probes (targets) and new probes (foils) for several test blocks. The hit rate is P(OLD) for targets; the false-alarm rate is P(OLD) for foils. The data reveal a steep decline in the hit rate and flat false alarms across test position. Panel B shows the standard REM model where remembered items cause updating of an episodic memory trace. Updating produces patterns of data consistent with empirical findings. Panel C shows a version of REM where each test item causes the storage of a new memory trace (i.e., no updating). Such a model predicts a shallow decrease in the hit rate and increase in the false alarm rate, inconsistent with observed data. account for these data and others. Two key features its prior contexts. Those contexts are compared to of REM are updating a single context-bound the context in question, the test context in a typical episodic trace and a Bayesian decision rule that recognition experiment. The recognition decision is takes into account positive and negative evidence made based on how well the retrieved contexts of from the stimulus as well as expectations based the test item match the test context. It should be on the environment. These properties not only clear that context information is the only factor that accounted for the problematic data but also lead contributes to memory evidence and, thus, such to specific and fortuitous predictions. Differen- models predict no effects of list composition per tiation, the same mechanism that was required se. Neither the number nor strength of the other to predict the data that lead to the demise of studied items affect the decision because those items a whole class of models, predicted the observed are never compared to the test item, thus the model pattern of output interference and strength-based easily predicts a null list strength effect. The WF mirror effects. mirror effect is predicted on the assumption that common words have more prior contexts that in- terfere with the ability to isolate the tested context. An Alternative Idea: Context-Noise Models Although the context noise models are limited—for Context-noise models (Dennis & Humphreys, example, they only apply to recognition and must 2001) were also developed to account for data prob- generate post-hoc explanations for many findings lematic for the global matching models. However, including output interference and the SBME— they took a very different approach. Context-noise they certainly advanced the field by making context models assume that memory evidence is based on a touchstone for models of recognition. Unlike the similarity between the test context and the models of recall that emphasize context, models previous contexts in which the test item was en- of recognition have largely neglected context. The countered. There is no comparison between the test item-context-noise debate sparked by the intro- item and any other studied item, in fact no other duction of the Dennis & Humphreys model of items from episodic memory enter the decision pro- has brought to the forefront the fact that context cess. Briefly, the model works as follows. Each time must be taken seriously in recognition models a word is encoded, the current context is bound to but also raised questions about the nature of the word. During test, the item causes retrieval of context.

172 basic cognitive skills Models of Episodic Recall is given one of the stimuli, such as ABSENCE and In recall tasks, subjects must report their asked to produce the corresponding member of memory for an event by producing a stimulus. the pair, i.e., say (or write) HOLLOW.Oneway Recall tasks most commonly use words as stimuli. to approach cued recall is to form an association In cued recall, subjects are given a probe stimulus between the stimuli composing a pair. We will as a cue to retrieve a particular word from the list. see that this assumption is ultimately limited for Most commonly, pairs of words are studied, with recall more broadly, but illustrates several important one member of the pair serving as a cue for recall properties of models of recall. For that reason, of the other (but see Phillips, Shiffrin, & Atkinson, we will spend some time examining an extremely 1967; Nelson, McEvoy, & Schreiber, 1990; Tehan simple model of association in memory. & Humphreys, 1996; for other possibiities). In free recall, subjects are presented with a list of simple linear association words, typically one at a time and then asked to In the matched filter model we constructed a recall the words in the order they come to mind. memory vector that was the sum of the vectors In serial recall, subjects are asked to produce the corresponding to the list items. In a linear associ- stimuli in order, typically starting at the beginning ator, we again form a sum, but now of a set of of the list. Although there is a well-developed outer product matrices. Each matrix provides the literature modeling serial recall, the serial recall outer product of the first members of a pair with task is most commonly described as a function the second members of a pair. These associations of working memory rather than as a function of can be understood as changing the synaptic weights episodic memory,1 and we will not discuss it here. between two vector spaces according to a Hebbian The fundamental question in episodic recall is to rule. Such associations between distinct items are determine what constitutes the cue. In cued recall, referred to as heteroassociative. this might seem like an obvious question. Given Here we follow the heteroassociative model of the pair DOG-QUEEN the subject is presented DOG J. A. Anderson, Silverstein, Ritz, and Jones (1977). as a cue at test and correctly recalls QUEEN.Isit Let us refer to vector corresponding to the first not sufficient to understand this as a simple, almost member of the ith studied pair as fi and the Pavlovian, association between some distributed second member as gi. Now, the matrix storing representation of the word DOG and the word the associations between each stimulus and each QUEEN? This is not sufficient. If the subject is response can be described by T asked instead to recall the first word that comes Mi = Mi−1 + figi .(9) to mind when hearing the word DOG it is likely To model the association, we can probe the matrix the subject would recall CAT. If asked to recall a with a probe stimulus, Mgp.Herewefindthistobe word that rhymes with dog the subject would likely T recall LOG. If asked to remember a specific event Migp = fi gi gp . (10) from their life that involved a dog (or the word i DOG) it is unlikely that the subject would recall That is, the output is a combination of the vector QUEEN. All these tasks take the same nominal cue for each first member fi weighted by the degree to but result in very different responses. From this we which the paired stimulus gi stimulus matches the conclude that the cue stimulus itself is not enough probe vector. Equations 9 and 10 can be understood to account for the subjects’ responses. In free recall as describing a simple neural network connecting the problem is even more acute; in free recall there is f and g,withM understood as a simple Hebbian no external cue whatsoever. Free recall must proceed matrix describing the connections between them solely on the basis of some set of internal cues. (see Figure 8.2). Several concepts—fixed-list context, variable con- In recall tasks, subjects report one word at a text, short-term memory, and temporally varying time, rather than a mixture of words. This can be context—have been introduced to detailed models reconciled with a deblurring mechanism in which of recall tasks to attempt to solve these problems. one takes an ambiguous stimulus and perceives one of several possibilities. This is somewhat Cued recall analogous to the problem of perception in which In cued recall, subjects are presented with one identifies a particular stimulus from a blurry pairs of stimuli, such as ABSENCE-HOLLOW, input. One can imagine a number of physical PUPIL-RIVER, and so forth. At test, the subject processes that could be used to accomplish this

models of episodic memory 173 recall with a particular word and the probability f of transitions between stimuli after recall has been initiated. M empirical properties of recall initiation in g free recall The finding that subjects can direct their recall to Fig. 8.2 Schematic of a neural network interpretation of Eq. 9. Two sets of “neurons,” f and g are connected by a weight matrix a specific region of time has also had a large effect M. As each pair is presented, the values of the elements in f on hypotheses about the cue used to initiate free and g are set to the values corresponding to the stimulus in that recall. Shiffrin (1970) gave subjects a series of lists of pair. The connections between an individual element in g and an varying lengths for free recall. However, rather than individual element in f are strengthened according to the product having subjects recall the most recent list, he had of those two elements, that is, the corresponding element of the outer product of the two patterns. subjects direct their recall to the list before the most recent list. Remarkably, the probability of recall depended on the length of the target list rather than deblurring (e.g., J. A. Anderson et al., 1977; the length of the intervening list. This suggested a Sederberg, Howard, & Kahana, 2008), but in representation of the list per se that was used to focus 2 many applications, researchers simply assume that the subjects’ retrieval attempts. the probability of recalling a particular word is some But the most dramatic effect in the initiation phenomenological function of its activation relative of free recall is the recency effect manifest in the to the activation of competing words (e.g., Howard probability of first recall. The probability of first & Kahana, 2002a) or a sampling and recovery recall gives the probability that the first word the process (e.g., Raaijmakers & Shiffrin, 1980). subject recalls came from each of the positions A series of models can be developed that can within the list. When the test is immediate, this be thought of as variations on the basic theme measure shows a dramatic advantage for the last illustrated by Eq. 10. These models differ in the items in the list (Figure 8.3a). This, combined associative mechanism but are similar in that they with the fact that the recency effect is reduced store the associations between each of the members when a delay intervenes between presentation of the of the pairs and provide a noisy output when given a last word and the test (Glanzer & Cunitz, 1966; probe. TODAM utilizes a convolution/correlation Postman & Phillips, 1965) led many researchers associative mechanism rather than the outer prod- to attribute the recency effect in immediate free uct (e.g., Murdock 1982). The matrix model recall to the presence of a short-term memory buffer (Humphreys et al., 1989) utilizes a triple association (e.g., Atkinson & Shiffrin, 1968). However, the between the two members of the pair and a fixed- recency effect measured in the probability of first list context vector to form a three tensor that can be recall is present even when the time between the probed with the conjunction of a probe item and a last item in the list and the test is much longer. test context. Similarly, when subjects make errors in free recall by recalling a word from a previous list, the intrusions are more likely for recent lists (Zaromb et al., Free recall 2006). The debate among researchers modeling Free recall raises two computational problems free recall in the last several years has focused on that are probably central to the question of episodic whether these longer-term recency effects depend memory. First, how do subjects initiate recall in on a different memory store than the short-term the absence of a particular cue? Second, how do effects (Davelaar, Goshen-Gottstein, Ashkenazi, those cues change across the unfolding process of Haarmann, & Usher, 2005; Lehman & Malmberg, retrieval? Models of free recall have not focused on 2012) or if recency effects across time scales reflect differences in associative mechanisms, nor detailed a common retrieval mechanism (Sedcrberg et al., assumptions about how items are represented, but 2008; Shankar & Howard, 2012). on representations that are used to initiate recall and how those representations change across retrieval context as a cue for recall attempts. These two problems can be concisely To solve the problem of initiating recall summarized by two classes of empirically observable researchers have appealed to a representation of phenomena: the probability that subjects initiate “context,” some information that is not identical to

174 basic cognitive skills (a) (b) 0.5 Immediate 0.2 Delayed Continuous Distractor 0.4 0.15

0.3 0.1 0.2

Probability of First Recall 0.05 0.1 Probability of First Recall

0 0 123456789101112 10 20 30 40 Serial Position List number

Fig. 8.3 The recency effect in recall initiation across time scales. The probability of first recall gives the probability that the first word the subject free recalls came from each position within the list. a. In immediate free recall, the test comes immediately after presentation of the last list item. A dramatic recency effect results. In delayed free recall, a distractor task intervenes after presentation of the last word in the list. The recency effect is sharply attenuated. In continuous distractor free recall, a distractor is presented after the last item, but also between presentation of each item in the list. The recency effect in the probability of first recall recovers. Here the distractor interval was approximately 16 s. After Howard & Kahana (1999). b. Subjects studied and recalled 48 lists of words. At the end of the experiment, they recalled all the words they could remember from all lists. Probability of first recall is shown as a function of the list the word came from. Here the recency effect extends over a few hundred seconds. After Howard, Youker, & Venkatadass (2008). a. After Howard & Kahana (1999), c 2008, Elsevier; b. After Howard, Youker, & Venkatadass (2008),with kind permission from Springer Science and Business Media. the stimuli composing the list but that nonetheless results. One choice is to have the state of context be enables the subject to focus their memory search constant within a list but completely different from on a subset of the stimuli that could potentially the state of context that obtains when one studies be generated. Context can function as a cue for the next list. Models that exploit a fixed list context retrieval of items from the appropriate list if it is have been successful in describing many detailed associated to the list stimuli during learning. We can aspects of free recall performance (Raaijmakers think of a straightforward extension of Eq. 9: & Shiffrin, 1980, 1981). If there is a binary difference between the context of each list, then it M = M − + f c T , (11) i i 1 i i is not possible to describe recency effects across lists where ci the state of the context vector at time step (Glenberg, Bradley, Stevenson, Kraus, Tkachuk, & i. The “context” available at the time of test can be Gretz, 1980; Howard et al., 2008; Zaromb et al., used as a probe for recall from memory (Figure 8.4). 2006). Similarly, if the context is constant within In much the same way that words in memory were a list, then fixed list context cannot be utilized to activated to the extent that they were paired with account for recency effects within the list. the probe word in Eq. 10, the words in memory Another choice is to have context change will be activated to the extent that the cue context gradually across lists (J. R. Anderson & Bower, resembles the state of context available when they 1972), or across time per se. Following Estes (1955), were encoded. Mensink and Raaijmakers (1988) introduced a model of interference effects in paired associate = T Mic fi ci c , (12) learning where the state of context gradaully i changed over intervals of time. The state of context where c is the context available at the time of at test is the cue for retrieval of words from the list. test. We can see the power of proposing a context If the state changes gradually during presentation of representation from Eq. 12: each studied item fi is the list items, then the state at the time of test will activated to the extent that its encoding context ci be a better cue for words from the end of the list overlaps with the probe context. than for words presented earlier. As a consequence, Obviously, the choice of how context varies has this produces a recency effect. This approach has a tremendous effect on the behavioral model that been applied to describing the recency effect in free

models of episodic memory 175 (a) (b) f f

M M

c c

noise (c) f

M C IN

c

Fig. 8.4 Context as an explanatory concept. a. Stimuli f are associated to states of context c. Compare to Figure 8.2, Eqs. 9, 11. b. In contextual variability models, context changes gradually from moment to moment by integrating a source of external noise. See Eq. 13. c. In retrieved context models, the changes in context from moment to moment are caused by the input stimuli themselves. Repeating an item can also cause the recovery of a previous state of temporal context. See Eqs. 15, 16. recall both within and across lists (Davelaar et al., Broadly speaking, recall transitions show sensitivity 2005; Howard & Kahana, 1999; Sirotin, Kimball, both to the study context of the presented words & Kahana, 2005). If one can arrange for the state as well as similarities between the words themselves. of context to change gradually over long periods of For instance, given that the subject has just recalled time, recency effects can be observed over similarly some word from the list, the next word recalled is long periods of time. more likely to be from a nearby position within Murdock (1997) used a particularly tractable the list than from a distant position within the model of variable context that illustrates this idea. list, showing a sensitivity to relationships induced At time step i, the context representation is by the study context. Similarly, the subject is updated as more likely to recall a word from the list that # is semantically related to the just-recalled word c = ρc − + 1 − ρ2η . (13) i i 1 i than to recall an unrelated word from the list, η where i is a vector of random features chosen at showing a sensitivity to the properties of the words time step i (Figure 8.4b). themselves. The noise vectors are chosen such that the The tendency to recall words from nearby expectation value of the inner product of any two serial positions in sequence is referred to as the vectors is zero and the expectation value of the inner contiguity effect (Kahana, 1996; Sederberg, Miller, product of a noise vector with itself is 1. Now, it Howard, & Kahana, 2010). The contiguity effect is easy to verify that the expectation of the inner is manifest not only in free recall, but in a product of states of context falls off exponentially: ! wide variety of other episodic memory tasks as | − | well (see Kahana, Howard, & Polyn, 2008 for E c T c = ρ i j . (14) i j a review). Like the recency effect, the contiguity Here we can see that the parameter ρ controls the effect is also manifest across time scales as well rate at which context drifts in this formulation. As a (Howard & Kahana, 1999; Howard et al., 2008; consequence, Eq. 13, coupled with Eq. 12 results in Kiliç, Criss, & Howard, 2013). In addition to the an exponentially decaying activation for list items. sensitivity to the temporal context in which words are studied, subjects’ recall transitions also reflect empirical properties of recall transitions the spatial context in which words were studied. in free recall Miller, Lazarus, Polyn, and Kahana (2013) had After the first recall is generated, transitions from subjects study a list of objects while traveling on a one word to the next also show lawful properties. controlled path within a virtual reality environment

176 basic cognitive skills to navigate to different locations where stimuli retrieved context models were experienced. After exploration, free recall In the previous subsection we saw that many of the stimuli was tested. Because the sequence researchers have appealed to a representation of of locations was chosen randomly, the sequential “context” that is distinct from the representations of contiguity of the stimuli and the spatial contiguity the words in the list to explain free-recall initiation. of the stimuli were decorrelated. At test, the recall We also saw that properties of context could be transitions that subjects exhibited reflected not used to account for recency effects within and across only the temporal proximity along the path, but lists if the states of context changed gradually. But also the spatial proximity within the environment. a context that is independent of the list items Moreover, Polyn, Norman, and Kahana (2009a, seems like a poor choice to account for transitions 2009b) had subjects study concrete nouns using one between subsequent recalls, which seem to reflect of two orienting tasks. For some words, subjects the properties of the items themselves, both learned would make a rating of its size (“Would this object and pre-experimental. fit in a shoebox?”); for other words subjects would One approach is to have multiple cues contribute rate its animacy. Polyn et al. (2009a, 2009b) found to retrieval. That is, analogous to the Humphreys that recall transitions between words studied using et al. (1989) model discussed earlier, one could the same orienting task were more common than have both direct item-to-item associations, as in transitions between words studied using different Eq. 9, and context-to-item associations, as in orienting tasks. Eq. 11. When an item is available, either because Because the words in the list are randomly it is provided as an experimental cue or because it assigned, the preceding effects must reflect new has been successfully retrieved in free recall, one learning during the study episode. In addition, can use both the item and the context to focus participants’ memory search also reflects properties retrieval. These two sources of information can then of the words themselves acquired from learning be combined, perhaps multiplicatively, to select prior to the experimental session. It has long candidate words for recall (Raaijmakers & Shiffrin, been known that, given that a list including pairs 1980, 1981). This approach could readily account of associated words (TABLE, CHAIR) randomly for semantic associations if the similarity of the assigned to serial positions, the pairs are more likely vectors corresponding to different words reflects the to be recalled together (Bousfield, 1953). This effect semantic similarity between the meaning of those generalizes to lists chosen from several categories— two words (Kimball, Smith, & Kahana, 2007; when the words are presented randomly, subjects Sirotin et al., 2005). nonetheless organize them into categories during One could account for the contiguity effect recall (e.g., Cofer, Bruce, & Reicher, 1966). over shorter time scales via direct item-to-item Interestingly, the time taken to retrieve words associations if associations are formed in a short- within a category is faster than the retrieval term memory during study of the list. That time necessary to transition from one category to is, rather than having the two simultaneously another (Pollio, Kasschau, & DeNise, 1968; Pollio, presented members of a pair be associated to one Richards, & Lucas, 1969). Semantic relatedness another in Eq. 9, one could form associations is also a major factor affecting recall errors (e.g., between all the words simultaneously active in Deese, 1959; Roediger & McDermott, 1995). The short-term memory. At any one moment during effect of semantic relatedness on memory retrieval study of the list, the last several items are likely can even be seen when the words do not come to remain active in short-term memory. As a from well-defined semantic categories. Semantic consequence, a particular word in the list is likely similarity can be estimated between arbitrary pairs to have strengthened associations to words from of words using automatic computational methods nearby positions within the list. Recall of that item such as latent semantic analysis (Landauer & would provide a boost in accessibility of other words Dumais, 1997) or the word association space from nearby in the list. However, in much the (Steyvers, Shiffrin, & Nelson, 2004). There same way that short-term memory has difficulty are elevated transition probabilities even between accounting for the recency effect across long time words with relatively low values of semantic scales, this still leaves the question of how to similarity (Howard & Kahana, 2002b; Sederberg account for contiguity effects across longer time et al., 2010). scales.

models of episodic memory 177 (b) 4 (a) 0.4 Delayed 3 IPI~2s 2 0.3 IPI~8s IPI~16s 1 0.2 0

–1 0.1 –2 Across-list Association (z score) Conditional Resp. Prob.

0 –3 –4–3–2–101234 –30 –20 –10 0 10 20 30 Lag Across-list Lag

Fig. 8.5 Transitions between words in free recall are affected by the order in which the words were presented. Here the probability of a recall transition from one word to the next is estimated as a function of the distance between the two words. Suppose that the 10th word in the study list has just been recalled. The lag-CRP at position +2 estimates the probability that the next word recalled will be from position 12; the lag-CRP at position −3 estimates the probability that instead the next word recalled will be from position 7. a. The lag-CRP is shown for four conditions of a delayed free recall. There was always approximately 16 s between the presentation of the last word and the time of test. The duration of the distractor interval between words (the IPI) was manipulated across conditions. After Howard & Kahana (1999). b. Subjects studied and recalled 48 lists of words. At the end of the experiment, they recalled all the words they could remember from all lists. This plot estimates the excess probability of a transition between lists, expressed as a z-score, as a function of the distance between the lists. That is, given that the just-recalled word came from list 10, a list lag of −3 corresponds to a transition to a word from list 7. Here the contiguity effect extends over a few hundred seconds. After Howard et al. (2008). a. After Howard & Kahana (1999), c 2008, Elsevier; b. After Howard, Youker, & Venkatadass (2008); With kind permission from Springer Science and Business Media.

Retrieved context models address this problem effect naturally persists over those same periods of by postulating a coupling between words and a time (Howard & Kahana, 2002a; Sederberg et al., gradually changing state of context. Rather than 2008). Computational models describing details of contextual drift resulting from random fluctuations, free recall dynamics, including semantic transition context is driven by the presented items. Each word effects have been developed (Sederberg et al., 2008; IN i provides some input ci : Polyn et al., 2009a). # = ρ + − ρ2 IN autonomous search models ci ci−1 1 ci . (15) In retrieved context models, retrieval of an item For a random list of once-presented words, these causes recovery of a previous state of temporal inputs are uncorrelated, resulting in contextual drift context, resulting in the contiguity effect. This analgous to Eq. 14. However, because the inputs are is not the only possibility, however. Consider a caused by the stimuli, the model is able to account brief thought experiment. Try to recall as many for contiguity effects (Figure 8.4c). The key idea of the 50 United States as possible (if the reader enabling the contiguity effect is the assumption that is unfamiliar with U.S. geography, the experiment retrieving a word also results in recovery of the state should work just as well with any well-learned of context in which that word was encoded. That geographical region with more than a dozen or is, if the word presented at time step i is repeated at so entities). Most subjects recall geographically some later time step r,then contiguous states (MAINE,VERMONT,NEW 3 IN IN HAMPSHIRE ...). Examining the recall protocols, c = γ ci−1 + (1 − γ )c . (16) r i we would observe a spatial analog of a contiguity In addition to the input that stimulus caused effect—if the subject has just recalled a word from when it was initially presented, it also enables a particular spatial location, the next word the recovery of the state of context present when it was subject retrieves would also tend to be from a nearby presented, ci−1. Because this state resembles the spatial location. However, it is not necessarily the context when neighboring items were presented, a case that remembering one state (Michigan) caused contiguity effect naturally results. If context changes recovery of a nearby state (Wisconsin). Rather, both gradually over long periods of time, the contiguity words might have been recalled because the search

178 basic cognitive skills happened to encounter a part of the map containing Open Problems and Future Directions both of those states (the Great Lakes region). • If nothing else, the diversity of models Autonomous search models provide an account of demonstrates that behavioral data alone is not the temporal contiguity effect that is similar in spirit sufficient to result in a consensus model of any of to that described above. In the Davelaar et al. (2005) the tasks we have considered, let alone a general account of the contiguity effect, retrieving a word theory of episodic memory. Going forward, early has no effect on the state of memory used as a cue for steps to use neurobiological constraints on process the next retrieval. Rather, a state of context evolves models of memory (e.g., Criss, Wheeler, & according to some dynamics during study. It is reset McClelland, 2013; Howard, Viskontas, Shankar, to the state at the beginning of the list during recall & Fried, 2012; Manning, Polyn, Litt, Baltuch, & and evolves according to the same dynamics during Kahana, 2011) must be expanded. The results of recall. Because it tends to revisit states in a similar these experiments must also affect the hypothesis order, the sequence of retrievals is correlated with space of models going forward. the study order. Similarly, Farrell (2012) described • The textbook definition of episodic memory temporal contiguity effects as resulting from the retrieval dynamics of hierarchical groups of chunked is the experience of a “jump back in time” such contexts (see also Ezzyat & Davachi, 2011). that the subject vividly reexperiences a particular Autonomous search models have difficulty in moment from his life. One of the major limitations accounting for genuinely associative contiguity ef- in constructing models of episodic memory is that fects. For instance, in cued recall, the experimenter we do not have a coherent idea about how to chooses the cue—the fact that the correct pair represent time—context in the models we have is retrieved cannot be attributed to autonomous discussed here may change gradually over time but retrieval dynamics. Rather, the cue is utilized to is nonetheless ahistorical. Richer representations of recover the correct trace. The contiguity effect temporal history (Shankar & Howard, 2012) may is observed under circumstances where the cue is be able to provide a more unified approach to randomized, eliminating the possibility that correla- episodic memory (Howard, Shankar, Aue, & Criss, tions between study and retrieval cause the contigu- in press). ity effect (Kiliç et al., 2013; Howard, Venkatadass, • It is frustrating that there are so many Norman, & Kahana, 2007; Schwartz, Howard, differences between the item recognition models Jing, & Kahana, 2005). The strong form of au- we have discussed and the recall models. The tonomous models—that memory search is indepen- contiguity effect may provide a point of contact dent of the products of previous memory search— that could lead to the unification of these classes of must be false. Nonetheless, the geographical search models. Successful recognition of a list item during thought experiment is compelling. A challenge test seems to leave neighboring items in an elevated going forward is to mechanistically describe the state of availability. Schwartz et al. (2005) representations that could support a temporally presented travel scenes for item recognition testing. defined search through memory analogous to the They systematically manipulated the lag between spatial search in the geography thought experiment. successively tested old probes. That is, after presenting old item i, they tested old item i + lag. Summary and Conclusions They found that when |lag| was small, memory for • A variety of detailed process models of the second old probe was enhanced, but only when performance have been developed in a variety of subjects endorsed the first probe with high episodic memory tasks. confidence. The recovery of previous states given • In recognition, differentiation and Bayesian an old probe has been the focus of connectionist decision rules have been important steps in models of recall and recognition (Norman & advancing our understanding of recognition. O’Reilly, 2003; Hasselmo & Wyble, 1997) • The major driver of models in recall has been suggesting a point of contact between an attempt to understand the nature of the context mathematical models of memory and connectionist representation—more broadly what constitutes the modeling, perhaps via dual process assumptions cue in free-recall tasks. (see Box 1).

models of episodic memory 179 Acknowledgments Nosofsky, R., & Steyvers, M. (Eds.). Cognitive Modeling in The work was supported by the U.S. National Perception and Memory: A Festschrift for Richard M. Shiffrin. Science Foundation grants to AHC (0951612) and Psychology Press. MH (1058937). Criss, A. H., Malmberg, K. J. & Shiffrin, R. M. (2011). Output interference in recognition memory. Journal of Memory and Language, 64, 316–326. Notes Criss, A. H., & Shiffrin, R. M. (2004). Context noise and item noise jointly determine recognition memory: a comment on 1. For instance amnesia patients with essentially complete loss of episodic memory are typically unimpaired at serial recall Dennis and Humphreys (2001). Psychological Review, 111(3), of a short list of digits. 800–807. 2. Recent studies have significantly elaborated this empirical Criss, A. H., & Shiffrin, R M. (2005). List discrimination story (Jang & Huber, 2008; Unsworth, Spillers, & Brewer, in associative recognition and implications for representa- 2012; Ward & Tan, 2004). tion, Journal Experimental Psychology: Learning, Memory 3. Occassionally, subjects will try to recall in alphabetical and Cogntion, 31(6), 1199–1212. doi: 10.1037/0278- order (ALABAMA,ALASKA,ARIZONA ...) or according to 7393.31.6.1199. some other idiosyncratic retrieval strategy, but that is not central Criss, A. H., Wheeler, M. E., McClelland, J. L. (2013). A to the point. differentiation account of recognition memory: evidence from fmri. Journal of Cognitive Neuroscience, 25(3), 421–35. doi: 10.1162/jocn_a_00292 References Davelaar, E. J., Goshen-Gottstein, Y., Ashkenazi, A., Haarmann, Anderson, J. A. (1973). A theory for the recognition of items H. J., & Usher, M. (2005). The demise of short-term mem- from short memorized lists. Psychological Review, 80(6), 417– ory revisited: empirical and computational investigations of 438. recency effects. Psychological Review, 112(1), 3–42. Anderson, J. A., Silverstein, J. W., Ritz, S. A. & Jones, R. DeCarlo, L. T. (2003). Source monitoring and multivariate S. (1977). Distinctive features, categorical perception, and signal detection theory, with a model for selection. Journal probability learning: Some applications of a neural model. of Mathematical Psychology, 47, 292–303. Psychological Review, 84, 413–451. Deese, J. (1959). On the prediction of occurrence of particular Anderson, J. R., & Bower, G. H. (1972). Recognition and verbal intrusions in immediate recall. Journal of Experimental retrieval processes in free recall. Psychological Review, 79(2), Psychology, 58, 17–22. 97–123. Dennis, S., & Humphreys, M. S. (2001). A context noise model Atkinson, R. C., Shiffrin, R. M. (1968). Human memory: A of episodic word recognition. Psychological Review, 108(2), proposed system and its control processes. K. W. Spence J. 452–478. T. Spence (Eds.), The psychology of learning and motivation Dennis, S., Lee, M. D., & Kinnell, A. (2008). Bayesian analysis (Vol. 2, p. 89–105). New York, NY: Academic. of recognition memory: The case of the list-length effect. Bousfield, W. A. (1953). The occurrence of clustering in the Journal of Memory and Language, 59(3), 361–376. recall of randomly arranged associates. Journal of General Diller, D. E., Nobel, P. A., & Shiffrin, R. M. (2001). Psychology, 49, 229–240. An ARC-REM model for accuracy and response time Cary, M., Reder, L. M. (2003). A dual-process account in recognition and recall. Journal Experimental Psychology: of the list-length and strength-based mirror effects in Learning, Memory and Cogntion, 27(2), 414–35. recognition. Journal of Memory and Language, 49(2), Estes, W. K. (1955). Statistical theory of spontaneous recovery 231–248. and regression. Psychological Review, 62, 145–154. Cofer, C. N., Bruce, D. R. & Reicher, G M. (1966). Ezzyat, Y., & Davachi, L. (2011). What constitutes an episode in Clustering in Free Recall as a Function of Certain Method- episodic memory?. Psychological Science, 22(2), 243–52. doi: ological Variations. Journal of Experimental Psychology, 71, 10.1177/0956797610393742 858–866. Farrell, S. (2012). Temporal clustering and sequencing in short- Criss, A. H. (2006). The consequences of differentiation in term memory and episodic memory. Psychological Review, episodic memory: Similarity and the strength based mirror 119(2), 223–71. doi: 10.1037/a0027371 effect. Journal of Memory and Language, 55)(4), 461–478. Fortin, N. J., Agster, K. L. & Eichenbaum, H. B. (2002). Critical Criss, A. H. (2009). The distribution of subjective memory role of the hippocampus in memory for sequences of events. strength: list strength and response bias. Cognitive Psychology, Nature Neuroscience, 5(5), 458–62. 59(4), 297–319. doi: 10.1016/j.cogpsych.2009.07.003 Gillund, G. & Shiffrin, R. M. (1984), A retrieval model for both Criss, A. H. (2010). Differentiation and response bias in episodic recognition and recall. Psychological Review, 91, 1–67. memory: evidence from reaction time distributions., Journal Glanzer, M., & Adams, J. K. (1985). The mirror effect of Experimental Psychology: Learning, Memory, and Cognition, in recognition memory. Memory & Cognition, 13(1), 36(2), 484–499. 8–20. Criss, A. H., Aue, W. R., & Kılıç, A. (2014). Age and Glanzer, M., & Adams, J K. (1990). The mirror effect in recog- response bias: Evidence from the strength based mirror nition memory: data and theory. Journal of Experimental effect. Quarterly Journal of Experimental Psychology, 67(10), Psychology: Learning, Memory, and Cognition, 16(1), 5–16. 1910–1924 Glanzer, M., & Cunitz, A. R. (1966). Two storage mechanisms Criss. A. H. & Koop, G. J. (in press). Differentiation in Episodic in free recall. Journal of Verbal Learning and Verbal Behavior, Memory. In Raaijmakers, J., Criss, A. H., Goldstone, R., 5, 351–360.

180 basic cognitive skills Glanzer, M., Adams, J. K., Iverson, G. J., & Kim, K. (1993). Kili c, A., Criss, A. H., & Howard, M. W. (2013). A causal The regularities of recognition memory. Psychological Review, contiguity effect that persists across time scales. Journal 100, 546–567. Experimental Psychology: Learning, Memory and Cogntion, Glenberg, A. M., Bradley, M. M., Stevenson, J. A., Kraus, 39(1), 297–303. doi: 10.1037/a0028463 T. A., Tkachuk, M. J., & Gretz, A. L. (1980). A two- Kimball, D. R., Smith, T. A. & Kahana, M. J. (2007). The process account of long-term serial position effects. Journal fSAM model of false recall. Psychological Review, 114(4), of Experimental Psychology: Human Learning and Memory, 6, 954–993. 355–369. Klauer, K. C., & Kellen, D. (2010). Toward a com- Hasselmo, M. E., & Wyble, B. P. (1997). Free recall and plete decision model of item and source recognition: A recognition in a network model of the hippocampus: discrete-state approach. Psychonomic Bulletin & Review, 17 simulating effects of scopolamine on human memory 465–478. function. Behavioural Brain Research, 89(1–20 1–34. Koop, G. J., Criss, A. H., & Malmberg, K. J. (in press). Hintzman, D. L. (1984). MINERVA 2: A simulation model of The role of mnemonic processes in pure-target and pure-foil human memory. Behavior Research Methods, Instruments & recognition memory. Psychonomic Bulletin & Review. Computers, 16(2), 96–101. Landauer, T. K., Dumais, S. T. (1997). Solution to Plato’s Hintzman, D. L. (1986). “Schema abstraction” in a multiple- problem : The latent semantic analysis theory of acquisition, trace memory model. Psychological Review, 93, 411–428. induction, and representation of knowledge. Psychological Hintzman, D. L., & Curran, T. (1994). Retrieval dynamics of Review, 104, 211–240. recognition and frequency judgements: Evidence for separate Lehman, M., & Malmberg, K. J. (2012). A Buffer Model of processes of familiarity and recall. JournalofMemoryand Memory Encoding and Temporal Correlations in Retrieval. Language, 33, 1–18. Psychological Review. doi: 10.1037/a0030851 Howard, M. W., & Kahana, M. J. (1999). Contextual Malmberg, K. J. (2002). On the form of ROCs con- variability and serial position effects in free recall. Journal of structed from confidence ratings. Journal of Experimen- Experimental Psychology: Learning, Memory, and Cognition, tal Psychology: Learning, Memory, and Cognition, 28(2), 25, 923–941. 380–387. Howard, M. W., & Kahana, M. J. (2002a). A Distributed Rep- Malmberg, K. J., Criss, A. H., Gangwani, T. H., & Shiffrin, resentation of Temporal Context. Journal of Mathematical R. M. (2012). Overcoming the negative consequences of Psychology, 46(3), 269–299. interference from recognition memory testing. Psychological Howard, M. W., & Kahana, M. J. (2002b). When does semantic Science, 23(2), 115–119. doi: 10.1177/0956797611430692 similarity help episodic retrieval?. JournalofMemoryand Malmberg, K. J., Holden, J. E., & Shiffrin, R. M. (2004). Language, 46(1), 85–98. Modeling the effects of repetitions, similarity, and normative Howard, M. W., Shankar, K. H., Aue, W., & Criss, A H. word frequency on old-new recognition and judgments (in press). A distributed representation of internal time. of frequency. Journal Experimental Psychology: Learning, Psychological Review. doi: 10.1037/a0037840 Memory and Cogntion, 30(2), 319–331. Howard, M. W., Venkatadass, V., Norman, K. A., & Kahana, Malmberg, K. J., & Shiffrin, R. M. (2005). The “one- M. J. (2007). Associative processes in immediate recency. shot” hypothesis for context storage. Journal of Experimen- Memory & Cognition, 35, 1700–1711. tal Psychology: Learning, Memory, and Cognition, 31(2), Howard, M. W., Viskontas, I. V., Shankar, K. H., & Fried, 322–336. I. (2012). A neural signature of mental time travel in Manning, J. R., Polyn, S. M., Litt, B., Baltuch, G., & the human MTL. Hippocampus, 22(PMC3407826), 1833– Kahana, M. J. (2011). Oscillatory patterns in temporal 1847. lobe reveal context reinstatement during memory search. Howard, M. W., Youker, T. E., & Venkatadass, V. (2008). Proceedings of the National Academy of Science, USA, 108(31), The persistence of memory: Contiguity effects across several 12893–12897. minutes. Psychonomic Bulletin & Review, 15, 58–63. McClelland, J. L., & Chappell, M. (1998). Familiarity breeds Humphreys, M. S., Bain, J. D., & Pike, R. (1989). Different differentiation: a subjective-likelihood approach to the effects ways to cue a coherent memory system: A theory for episodic, of experience in recognition memory. Psychological Review, semantic, and procedural tasks. Psychological Review, 96, 105(4), 724–760. 208–233. Mensink, G. J. M., Raaijmakers, J. G. W. (1988). A model Jang, Y., & Huber, D. E. (2008). Context retrieval and context for interference and forgetting. Psychological Review, 95, change in free recall: recalling from long-term memory drives 434–55. list isolation. Journal Experimental Psychology: Learning, Miller, J. F., Lazarus, E. M., Polyn, S. M., & Kahana, M. J. Memory and Cogntion, 34(1), 112–127. doi: 10.1037/0278- (2013). Spatial clustering during memory search. Journal 7393.34.1.112 Experimental Psychology: Learning, Memory and Cogntion, Kahana, M. J. (1996). Associative retrieval processes in free 39(3), 773–781. doi: 10.1037/a0029684 recall. Memory & Cognition, 24, 103–109. Murdock, B. B. (1982). A theory for the storage and retrieval Kahana, M. J., Howard, M., & Polyn, S. (2008). Associative of item and associative information. Psychological Review, 89, Processes in Episodic Memory. In H. L. Roediger III (Ed.), 609–626. Cognitive psychology of memory, Vol. 2 of Learning and Memory Murdock, B. B. (1997). Context and mediators in a theory – A Comprehensive Reference (J. Byrne, Editor) (p. 476–490). of distributed associative memory (TODAM2). Psychological Oxford Elsevier. Review, 104(2), 839–862.

models of episodic memory 181 Murdock, B. B., & Anderson, R. E. (1975). Encoding, Rotello, C. M., Macmillan, N. A., Reeder, J. A., & Wong, storage and retrieval of item information. R L. Solso (Ed.), M. (2005). The remember response: subject to bias, graded, Information Processing and Cognition: The Loyola Symposium and not a process-pure indicator of recollection. Psychonomic (p. 145–194). Hillsdale, New Jersey: Erlbaum. Bulletin & Review, 12(5), 865–873. Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (1990). Rugg, M. D., & Curran, T. (2007). Event-related potentials and Encoding context and retrieval conditions as determinants recognition memory. Trends in Cognitive Science, 11(6), 251– of the effects of natural category size. Journal of Experi- 257. mental Psychology: Learning, Memory, and Cognition, 16(1), Schooler, L. J., Shiffrin, R. M., & Raaijmakers, J. 31–41. G. (2001). A Bayesian model for implicit effects in Norman, K. A., & O’Reilly, R. C. (2003). Modeling hippocam- perceptual identification. Psychological Review, 108(1), pal and neocortical contributions to recognition memory: 257–272. a complementary-learning-systems approach. Psychological Schwartz, G., Howard, M. W., Jing, B., & Kahana, Review, 110(4), 611–646. M. J. (2005). Shadows of the past: Temporal retrieval Ohrt, D. D., & Gronlund, S. D. (1999). List-length Effects and effects in recognition memory. Psychological Science, 16(11), Continuous memory: Confounds and Solutions. C. Izawa 898–904. (Ed.), On Human Memory: Evolution, Progress, and Reflections Sederberg, P. B., Howard, M. W., & Kahana, M. J. (2008). A on the 30th Anniversary of the Atkinson-Shiffrin Model context-based theory of recency and contiguity in free recall. (p. 105–125). Mahwah, NJ: Erlbaum. Psychological Review, 115, 893–912. Phillips, J. L., Shiffrin, R. J., & Atkinson, R. C. (1967). Sederberg, P. B., Miller, J. F., Howard, M. W., & Kahana, The effects of list length on short-term memory. Journal of M. J. (2010). The temporal contiguity effect predicts Verbal Learning and Verbal Behavior, 6, 303–311. episodic memory performance. Memory & Cognition, 38, Pollio, H. R., Kasschau, R. A., & DeNise, H. E. (1968). 689–699. Associative structure and the temporal characteristics of free Shankar, K. H., & Howard, M. W. (2012). A scale- recall. Journal of Verbal Learning and Verbal Behavior, 10, invariant representation of time. Neural Computation, 24, 190–197. 134–193. Pollio, H. R., Richards, S., & Lucas, R. (1969). Temporal Shiffrin, R. M. (1970). Forgetting: Trace erosion or retrieval properties of category recall. Journal of Verbal Learning and failure?. Science, 168, 1601–1603. Verbal Behavior, 8, 529–536. Shiffrin, R. M., Huber, D. E., & Marinelli, K. (1995). Effects of Polyn, S. M., Norman, K. A., & Kahana, M. J. (2009a). category length and strength on familiarity in recognition. A context maintenance and retrieval model of organiza- Journal of Experimental Psychology: Learning, Memory, and tional processes in free recall. Psychological Review, 116, Cognition, 21(2), 267–287. 129–156. Shiffrin, R. M., Ratcliff, R., & Clark, S. E. (1990). List-strength Polyn, S. M., Norman, K. A., & Kahana, M. J. (2009b). effect: II. Theoretical mechanisms. Journal of Experimen- Task context and organization in free recall. Neuropsychologia, tal Psychology: Learning, Memory, and Cognition, 16(2), 47(11), 2158–2163. 179–195. Postman, L., & Phillips, L. W. (1965). Short-term temporal Shiffrin, R. M. & Steyvers, M. (1997). A Model for Recognition changes in free recall. Quarterly Journal of Experimental Memory: REM — Retrieving Effectively From Memory. Psychology, 17, 132–138. Psychonomic Bulletin and Review, 4, 145–166. Province, J. M., & Rouder, J. N. (2012). Evidence for discrete- Sirotin, Y. B., Kimball, D. R., Kahana, M. J. (2005). Going state processing in recognition memory. Proceedings of the beyond a single list: Modeling the effects of prior experience National Academy of Sciences USA, 109(36), 14357–14362. on episodic free recall. Psychonomic Bulletin & Review, 12, doi: 10.1073/pnas.1103880109 787–805. Raaijmakers, J. G. W., & Shiffrin, R. M. (1980). SAM: A theory Staresina, B. P., Fell, J., Dunn, J. C., Axmacher, N., of probabilistic search of associative memory. G H. Bower & Henson, R. N. (2013). Using state-trace analysis to (Ed), The psychology of learning and motivation: Advances dissociate the functions of the human hippocampus and in research and theory (Vol. 14, p. 207–262). New York: perirhinal cortex in recognition memory. Proceedings of Academic. the National Academy of Sciences, 110(80), 3119–24. doi: Raaijmakers, J. G. W., & Shiffrin, R. M. (1981). 10.1073/pnas.1215710110 Search of associative memory. Psychological Review, 88, Starns, J. J., White, C. N., & Ratcliff, R. (2010). A 93–134. direct test of the differentiation mechanism: REM, BCD- Ratcliff, R., Clark, S. E., & Shiffrin, R. M. (1990). List-strength MEM, and the strength-based mirror effect in recogni- effect: I. Journal of Experimental Psychology: Learning, tion memory. Journal of Memory and Language, 63(1), Memory, and Cognition, 16, 163–178. 18–34. Roediger, H. L., & McDermott, K. B. (1995). Creating false Steyvers, M., Shiffrin, R. M., Nelson, D. L. (2004). Word memories: remembering words not presented in lists. Journal association spaces for predicting semantic similarity effects of Experimental Psychology: Learning, Memory, and Cognition, in episodic memory. A. Healy (Ed.), Experimental Cognitive 21, 803–814. Psychology and its Applications: Festschrift in Honor of Lyle Roediger, H. L., & Schmidt, S. R. (1980). Output interference Bourne, Walter Kintsch and Thomas Landauer (pp. 237–249). in the recall of categorized and paired-associate lists. Journal Washington, D.C.: American Psychological Association. of Experimental Psychology: Human Learning and Memory, Stretch, V., & Wixted, J T. (1998). On the differ- 6(1), 91–102. ence between strength-based and frequency-based mirror

182 basic cognitive skills effects in recognition memory. Journal Experimental Psychology: Learning, Memory, and Cognition, 30(6), 1196– Psychology: Learning, Memory and Cogntion, 24(6), 1210. 1379–1396. Wickens, D. D., Born, D. G. & Allen, C. K. (1963). Proactive Tehan, G., & Humphreys, M. S. (1996). Cuing effects in short- inhibition and item similarity in short-term memory. Journal term recall. Memory & Cognition, 24(6), 719–732. of Verbal Learning and Verbal Behavior, 2(5), 440–445. Tulving, E. (1972). Episodic and semantic memory. E. Tulving Wilding, E. L., Doyle, M. C., & Rugg, M. D. (1995). & W. Donaldson (Ed.), Organization of Memory. (pp. 381– Recognition memory with and without retrieval of context: 403). New York: Adademic. an event-related potential study. Neuropsychologia, 33(6), Tulving, E. (1983). Elements of episodic memory.NewYork,NY: 743–767. Oxford University Press. Wixted, J. T. (2007). Spotlighting the Probative Findings: Reply Tulving, E. (2002). Episodic memory: from mind to brain. to Parks and Yonelinas (2007). Psychological Review, 114, Annual Review of Psychology, 53, 1–25. 203–209. Tulving, E., & Arbuckle, T. (1963). Sources of intratrial Wixted, J. T., & Squire, L. R. (2011). The famil- interference in immediate recall of paired associates. Journal iarity/recollection distinction does not illuminate me- of Verbal Learning and Verbal Behavior, 1(5), 321–334. dial temporal lobe function: response to Montaldi and Tulving, E., & Arbuckle, T Y. (1966). Input and output Mayes. Trends in Cognitive Sciences, 15(8), 340–1. doi: interference in short-term associative memory. Journal of 10.1016/j.tics.2011.06.006 Experimental Psychology, 72, 145–150. Yonelinas, A. P., & Parks, C. M. (2007). Receiver operating Unsworth, N., Spillers, G. J., & Brewer, G. A. (2012). characteristics (ROCs) in recognition memory: a review. Evidence for noisy contextual search: examining the dy- Psychological Bulletin, 133(5), 800–832. namics of list-before-last recall. Memory, 20(1), 1–13. doi: Zaromb, F. M., Howard, M. W., Dolan, E. D., Sirotin, Y. B., 10.1080/09658211.2011.626430 Tully, M., Wingfield, A., & Kahana, M. J. (2006). Temporal Ward, G., & Tan, L. (2004). The effect of the length of to- associations and prior-list intrusions in free recall. Journal of be-remembered lists and intervening lists on free recall: a Experimental Psychology: Learning, Memory, and Cognition, reexamination using overt rehearsal. Journal of Experimental 32, 792–804.

models of episodic memory 183

PARTIII Higher Level Cognition

CHAPTER Structure and Flexibility in Bayesian Models 9 of Cognition

Joseph L. Austerweil, Samuel J. Gershman, Joshua B. Tenenbaum, and Thomas L. Griffiths

Abstract Probability theory forms a natural framework for explaining the impressive success of people at solving many difficult inductive problems, such as learning words and categories, inferring the relevant features of objects, and identifying functional relationships. Probabilistic models of cognition use Bayes’s rule to identify probable structures or representations that could have generated a set of observations, whether the observations are sensory input or the output of other psychological processes. In this chapter we address an important question that arises within this framework: How do people infer representations that are complex enough to faithfully encode the world but not so complex that they “overfit” noise in the data? We discuss nonparametric Bayesian models as a potential answer to this question. To do so, first we present the mathematical background necessary to understand nonparametric Bayesian models. We then delve into nonparametric Bayesian models for three types of hidden structure: clusters, features, and functions. Finally, we conclude with a summary and discussion of open questions for future research. Key Words: inductive inference, Bayesian modeling, Nonparametrics, Bias-variance tradeoff, Categorization, Feature representations, Function learning, Clustering

Introduction are existing tutorials on some of the key mathemati- Probabilistic models of cognition explore the cal ideas behind this approach (Griffiths and Yuille, mathematical principles behind human learning 2006; Griffiths, Kemp, & Tenenbaum, 2008a) and and reasoning. Many of the most impressive its central theoretical commitments (Tenenbaum, tasks that people perform—learning words and Griffiths, & Kemp, 2006; Tenenbaum, Kemp, categories, identifying causal relationships, and Griffiths, & Goodman, 2010a; Griffiths, Chater, inferring the relevant features of objects—can Kemp, Perfors, & Tenenbaum, 2010a). In this be framed as problems of inductive inference. chapter, we focus on a recent development in Probability theory provides a natural mathematical the probabilistic approach that has received less framework for inductive inference, generalizing attention—the capacity to support both structure logic to incorporate uncertainty in a way that can and flexibility. be derived from various assumptions about rational One of the striking properties of human cogni- behavior (e.g., Jaynes, 2003). tion is the ability to form structured representations Recent work has used probabilistic models to in a flexible way: we organize our environment explain many aspects of human cognition, from into meaningful clusters of objects, identify dis- memory to language acquisition (for a representa- crete features that those objects possess, and tive sample, see Chater and Oaksford, 2008). There learn relationships between those features, without

187 apparent hard constraints on the complexity of describe the basic logic behind using Bayes’ rule these representations. This kind of learning poses for inductive inference. Then, we explore two of a challenge for models of cognition: how can the main types of hypothesis spaces for possible we define models that exhibit the same capacity structures used in statistical models: parametric and for structured, flexible learning? And how do we nonparametric models.1 Finally, we discuss what it identify the right level of flexibility, so that we only means for a nonparametric model to be “Bayesian” postulate the appropriate level of complexity? A and propose nonparametric Bayesian models as number of recent models have explored answers to methods combining the benefits of both parametric these questions based on ideas from nonparametric and nonparametric models. This sets up the remain- Bayesian statistics (Sanborn, Griffiths, & Navarro, der of the article, where we compare the solution 2010; Austerweil & Griffiths 2013; see Gershman given by nonparametric Bayesian methods to how and Blei 2012 for a review), and we review the key people (implicitly) solve this dilemma when learn- ideas behind this approach in detail. ing associations, categories, features, and functions. Probabilistic models of cognition tend to focus on a level of analysis that is more abstract than that Basic Bayes of many of the other modeling approaches discussed After observing some evidence from the envi- in the chapters of this handbook. Rather than trying ronment, how should an agent update her beliefs to identify the cognitive or neural mechanisms that in the various structures that could have produced underlie behavior, the goal is to identify the abstract the evidence? Given a set of candidate structures principles that characterize how people solve induc- and the ability to describe the degree of belief in tive problems. This kind of approach has its roots each structure, Bayes’s rule prescribes how an agent in what Marr (1982) termed the computational level, should update her beliefs across many normative focusing on the goals of an information processing standards (Oaksford and Chater, 2007; Robert, system and the logic by which those goals are 1994). Bayes’s rule simply states that an agent’s best achieved, and was implemented in Shepard’s belief in a structure or hypothesis h after observing (1987) search for universal laws of cognition and data d from the environment, the posterior P(h|d), Anderson’s (1990) method of rational analysis.But should be proportional to the product of two terms: it is just the starting point for gaining a more her prior belief in the structure, the prior P(h), and complete understanding of human cognition—one how likely the observed data d wouldbehadit that tells us not just why people do the things they been produced by the candidate structure, called the do, but how they do them. Although we will not likelihood P(d|h). This is given by discuss it in this chapter, research on probabilistic P(d|h)P(h) models of cognition is beginning to consider how | = P(h d) | , we can take this step, bridging different levels of h∈H P(d h )P(h ) analysis. We refer the interested reader to one of where H is the space of possible hypotheses or the more prominent strategies for building such a latent structures. Note that the summation in the bridge, namely rational process models (Sanborn denominator is the normalization constant, which et al., 2010). ensures that the posterior probability is still a The plan of this chapter is as follows. First, we valid probability distribution (sums to one). In introduce the core mathematical ideas that are used addition to specifying how to calculate the posterior in probabilistic models of cognition—the basics of probability of each hypothesis, a Bayesian model Bayesian inference—and a formal framework for prescribes how an agent should update her belief characterizing the challenges posed by flexibility. in observing new data dnew from the environment We then turn to a detailed presentation of the ideas given the previous observations d behind nonparametric Bayesian inference, looking P(d |d) = P(d |h)P(h|d) at how this approach can be used for learning new new h three different kinds of representations—clusters, | features, and functions. = | P(d h)P(h) P(dnew h) . ∈H P(d|h )P(h ) h h Mathematical Background The fundamental assumptions of Bayesian models In this section, we present the necessary math- (i.e., what makes them “Bayesian”) are (a) agents ex- ematical background for understanding nonpara- press their expectations over structures as probabili- metric Bayesian models of cognition. First, we ties and (b) they update their expectations according

188 higher level cognition to the laws of probability. These are not uncon- The statistical literature offers a useful classifica- troversial assumptions in psychology (e.g., Bowers tion of different types of probability density func- and Davis 2012; Jones and Love 2011; Kahneman tions, based on the distinction between parametric et al. 1982; McClelland et al. 2010, but also look and nonparametric models (Bickel and Doksum, at the replies Chater et al. 2011; Griffiths, Chater, 2007). Parametric models take the set of possible Norris, & Pouget 2012; Griffiths, Chater, Kemp, densities to be those that can be identified with a Perfors, & Tenenbaum 2010b). However, they are fixed number of parameters. An example of a para- extremely useful because they provide methodologi- metric model is one that assumes the density follows cal tools for exploring the consequences of adopting a Gaussian distribution with a known variance, but different assumptions about the kind of structure unknown mean. This model estimates the mean of that appears in the environment. the Gaussian distribution based on observations and its estimate of the probability of new observations is their probability under a Gaussian distribution with Parametric and Nonparametric the estimated mean. One property of parametric One of the first steps in formulating a com- models is that they assume there exists a fixed putational model is a specification of the possible set of possible structures (i.e., parameterizations) structures that could have generated the observa- that does not change regardless of the amount tions from the environment, the hypothesis space of data observed. For the earlier example, no H. As discussed in the Basic Bayes subsection, each matter how much data the model is given that is hypothesis h in a hypothesis space H is defined inconsistent with a Gaussian distribution (e.g., a as a probability distribution over the possible bimodal distribution), its density estimate would observations. So, specifying the hypothesis space still be a Gaussian distribution because it is the only amounts to defining the set of possible distributions function available to the model. over events that the agent could observe. To make In contrast, nonparametric models make much the model Bayesian, a prior distribution over those weaker assumptions about the family of possible hypotheses also needs to be specified. structures. For this to be possible, the number of From a statistical point of view, a Bayesian model parameters of a nonparametric model increases with with a particular hypothesis space (and prior over the number of data points observed. An example those hypotheses) is a solution to the problem of a nonparametric statistical model is a Gaussian of density estimation, which is the problem of kernel model, which places a Gaussian distribution estimating the probability distribution over possible at each and its density estimate is the observations from the environment.2 In fact, it is average over the Gaussian distributions associated the optimal solution given that the hypothesis space with each observation. In essence, the parameters faithfully represents how the environment produces of this statistical model are the observations, and observations, and the environment randomly se- so the parameters of the model grow with the lects a hypothesis to produce observations with number of data points. Although nonparametric probability proportional to the prior distribution. suggests that nonparametric models do not have any In general, a probability distribution is a func- parameters, this is not the case. Rather, the number tion over the space of observations, which can be of parameters in a nonparametric model is not fixed continuous, and thus is specified by an infinite with respect to the amount of data. number of parameters. So, density estimation One domain within cognitive science where the involves identifying a function specified by an distinction between parametric and nonparametric infinite number of parameters, as theoretically, models has been useful is category learning (Ashby it must specify the probability of each point in and Alfonso-Reese, 1995). The computational a continuous space. From this perspective, a problem underlying category learning is identifying function is analogous to a hypothesis and the a probability distribution associated with each space of possible functions constructed by varying category label. Prototype models (Posner and Keele, the values of the parameters defines a hypothesis 1968; Reed, 1972), approach this problem para- space. Different types of statistical models make metrically, by estimating the mean of a Gaussian different assumptions about the possible functions distribution for each category. Alternatively, exem- that define a density, and statistical inference plar models (Medin and Schaffer, 1978; Nosofsky, amounts to estimating the parameters that define 1986) are nonparametric, using each observation each function. as a parameter; each category’s density estimate for

structure and flexibility in bayesian models of cognition 189 (a) Parametric(b) Nonparametric(c) Nonparametric Bayesian

Fig. 9.1 Density estimators from the same observations (displayed in blue) for three types of statistical models: (a) parametric, (b) nonparametric, and (c) nonparametric Bayesian models. The parametric model estimates the mean of a Gaussian distribution from the observations, which results in the Gaussian density (displayed in black). The nonparametric model averages over the Gaussian distributions centered at each data point (in red) to yield its density estimate (displayed in black). The nonparametric Bayesian model puts the observations into three groups, each of which gets its own Gaussian distribution with mean centered at the center of the observations (displayed in red). The density estimate is formed by averaging over the Gaussian distributions associated with each displayed in black. an new observation is a function of the sum of the the article, we propose nonparametric Bayesian distances between the new observation to previous models, which are nonparametric models with observations. See Figure 9.1(a) and 9.1(b) for prior biases toward certain types of structures, as examples of parametric and nonparametric density a computational explanation for how people infer estimation, respectively. structure but maintain flexibility. One limitation of parametric models is that the structure inferred by these models will not be the true structure producing the observations if the Box 1 The bias-variance trade-off true structure is not in the model’s hypothesis Given that nonparametric models are guaran- space. On the other hand, many nonparametric teed to infer the true structure, why would models are guaranteed to infer the true structure anyone use a parametric model? Although it is given enough (i.e., infinite) observations, which true that nonparametric models will converge is a property known as consistency in the statistics to the true structure, they are only guaranteed literature (Bickel and Doksum, 2007). Thus, as to do so in the limiting case of an infinite the number of observations increase, nonparametric number of observations. However, people do models have lower error than parametric models not get an infinite number of observations. when the true structure is not in the hypothesis Thus, the more appropriate question for cogni- space of the parametric model. However, non- tive scientists is how do people infer structures parametric models typically need more observations from a small number of observations, and to arrive at the true hypothesis when the true which type of model is more appropriate for hypothesis is in the parametric model’s hypothesis understanding human performance? There are space. Box 1 describes the bias-variance trade- many structures consistent with the limited and off, which expounds this intuition and provides a noisy evidence typically observed by people. formal framework for understanding the benefits Furthermore, when observations are noisy, it and problems with each approach. becomes difficult for an agent to distinguish between noise and systematic variation due to the underlying structure, a problem known as Putting Them Together: Nonparametric “overfitting.” Bayesian Models When there are many structures available Although people are clearly biased toward certain to the agent, as is the case for nonparametric structures by their prior expectations (like paramet- models, this becomes a serious issue. So, al- ric models), the bias is a soft constraint, meaning though nonparametric models have the upside that people seem to be able to infer seemingly of guaranteed convergence to the appropriate arbitrarily complex models given enough evidence structure, they have the downside of being (like nonparametric models). In the remainder of

190 higher level cognition Box 1 continued prone to overfitting. In this box, we discuss (e.g., McKinley and Nosofsky 1995). Thus, in the bias-variance trade-off, which provides a other respects, people act like nonparametric useful framework for understanding the trade- models. How to reconcile these two views off between ultimate convergence to the true remains an outstanding question for theories structure and overfitting. of human learning. Hierarchical Bayesian mod- The bias-variance trade-off is a mathemati- els offer one possible answer, where agents cal result, which demonstrates that the amount maintain multiple hypothesis spaces and infer of error an agent is expected to make when the appropriate hypothesis space to perform learning from observations can be decomposed Bayesian inference over, using the distribution into the sum of two components (German, of stimuli in the domain (Kemp, Perfors, & Bienenstock, & Doursat, 1992; Griffiths Tenenbaum, 2007) and the concepts agents et al., 2010): bias, which measures how close learn over the stimuli (Austerweil and Griffiths, the expected estimated structure is to the true 2010a). In principle, a hierarchical Bayesian structure, and variance, which measures how model could be formulated that includes sensitive the expected estimated structure is to both parametric and nonparametric hypothesis noise (how much it is expected to vary across spaces, thereby inferring which is appropriate different possible observations). Intuitively, for a given domain. Formulating such a model increasing the number of possible structures by is an interesting challenge for future research. using a larger parametric or a fully nonpara- metric model reduces the bias of the model because a larger hypothesis space increases the Nonparametric Bayesian models are Bayesian likelihood that the true structure is available because they put prior probabilities over the set to the model. However, it also increases the of possible structures, which typically include variance of the model because it will be harder arbitrarily complex structures. They posit prob- to choose among them given noisy observations ability distributions over structures that can, in from the environment. On the other hand, principle, be infinitely complex, but they are biased decreasing the possible number of structures towards “simpler” structures (those representable available by using a small parametric model in- using a smaller number of units), which reduces creases the bias of the model, because unless the the variance that plagues classical nonparametric true structure is one of the structures available models. The probability of data under a structure, to the parametric model, it will not be able to which can be very large for complex structures infer the true structure. Furthermore, using a that encode each observation explicitly (e.g., each small parametric model reduces the variance, observation in its own category), is traded off because there are fewer structures available to against a prior bias toward simpler structures, the model that are likely to be consistent with which allow observations to share parameters. This the noisy observations. The bias-variance trade- bias toward simpler structures is a soft constraint, off presents a trade-off: reducing the bias of a allowing models to adopt more complex structures model by using a nonparametric model with as new data arrive (this is what makes these models fewer prior constraints comes at the cost of less “nonparametric”). Thus, nonparametric Bayesian efficient inference and increased susceptibility models combine the benefits of parametric and to overfitting, resulting in larger variance. nonparametric models: a small variance (by using How do people resolve the bias-variance Bayesian priors) and a small bias (by adapting trade-off? In some domains, they are clearly their structure nonparametrically). See Figure 9.1(c) biased because some structures are much easier for an example of nonparametric Bayesian density to learn than others (e.g., linear functions in estimation. function learning; Brehmer 1971, 1974). So in Nonparametric Bayesian models can be classified some respects people act like parametric mod- according to the type of hidden structure they els, in that they use strong constraints to infer posit. For the previously discussed category learning structures. However, given enough training, example, the hidden structure is a probability dis- experience, and the right kind of information, tribution over the space of observations. Thus, the people can infer extremely complex structures prior is a probability distribution over probability distributions. A common choice for this prior is the

structure and flexibility in bayesian models of cognition 191 Dirichlet process (Ferguson, 1973), which induces segmentation (Werker and Yeung, 2005), semantic a set of discrete clusters, where each observation representation (Griffiths, Steyvers, & Tenenbaum, belongs to a single cluster and each cluster is 2007), and associative learning (Gershman et al., assigned to a randomly drawn value. Combining 2010). Clustering is challenging because in real the Dirichlet process with a model of how observed world situations the number of clusters is often features are generated by clusters, we obtain a unknown. For example, a child learning language Dirichlet-process mixture model (Antoniak, 1974). does not know a priori how many words there are As we discuss in the following section, elaborations in the language. How should a learner discover new of the Dirichlet process mixture model have been clusters? applied to many psychological domains as varied In this section, we show how clustering can as category learning (Anderson, 1991; Sanborn, be formalized as Bayesian inference, focusing in Canini, & Navarro, 2008b), word segmentation particular on how the nonparametric concepts (Griffiths, & Johnson, 2009), and associative introduced in the previous section can be brought learning (Gershman, Blei, & Niv, 2010; Gershman, to bear on the problem of discovering new clusters. and Niv, 2012). We then describe an application of the same ideas Although many of the applications of non- to associative learning. parametric Bayesian models in cognitive science have focused on the Dirichlet process mixture A Rational Model of Categorization model, other nonparametric Bayesian models, such Categorization can be formalized as an inductive as the Beta process (Hjort, 1990; Thibaux and problem: given the features of a stimulus (denoted Jordan, 2007) and Gaussian process (Rasmussen and by x), infer the category label c.UsingBayes’srule, Williams, 2006), are more appropriate when people the posterior over category labels is given by: infer probability distributions over observations | that are encoded using multiple discrete units or | = P(x c)P(c) = P(x,c) P(c x) | . continuous units. For example, feature learning is c P(x c )P(c ) c P(x,c ) best described by a hidden structure with multiple From this point of view, category learning is funda- discrete units. The Beta process (Griffiths and mentally a problem of density estimation (Ashby and Ghahramani, 2005, 2011; Hjort, 1990; Thibaux Alfonso-Reese, 1995) because people are estimating and Jordan, 2007) is one appropriate nonparametric a probability distribution over the possible obser- Bayesian model for this example, and as we discuss vations from each category. Probabilistic models in the section Inferring Features: What Is a Per- differ in the assumptions they make about the joint ceptual Unit?, elaborations of the Beta process have distribution P(x,c). Anderson (1991) proposed that been applied to model feature learning (Austerweil people model this joint distribution as a mixture and Griffiths, 2011, 2013), multimodal learning model: (Yildirm and Jacobs, 2012), and choice preferences (Görür, Jäkel; Miller, Griffiths). Finally, Gaussian P(x,c) = P(x,c|z)P(z), processes are appropriate when each observation is z encoded using one or more continuous units; we where z ∈{1,...,K } denotes the cluster assigned to discuss their application to function learning in the x (z is the traditional notation for a cluster, and section Learning Functions: How Are Continuous is analogous to the hypothesis h in the previous Quantities Related? (Griffiths, Lucas, Williams, & section). From a generative perspective, observa- Kalish, 2009). tions are generated by a mixture model from the environment according to the following process: to Inferring Clusters: How Are Observations sample observation n, first sample its cluster zn from Organized into Groups? P(z), and then the observation xn and its category cn One of the basic inductive problems faced from the joint distribution specified by the cluster, by people is organizing observations into groups, P(x,c|zn). Each distribution specified by a cluster sometimes referred to as clustering. This problem might be simple (e.g., a Gaussian), but their mixture arises in many domains, including category learning can approximate arbitrarily complicated distribu- (Clapper and Bower, 1994; Kaplan and Murphy, tions. Because each observation only belongs to one 1999; Pothos and Chater, 2002), motion percep- cluster, the assignments zn ={z1,...,zn} encode a tion (Braddick, 1993), causal inference (Kemp, partition of the items into K distinct clusters, where Tenenbaum, Niyogi, & Griffiths, 2010), word a partition is a grouping of items into mutually

192 higher level cognition exclusive clusters. When the value of K is specified, cluster assignments are drawn from: this generative process defines a simple probabilistic θ ∼ Dirichlet(α/K ,...,α/K ) model of categorization, but what should be the ∼ θ = ... value of K ? zi Multinomial( ), for i 1, ,n, To address the question of how to select the where Dirichlet( · ) denotes the K -dimensional value of K , Anderson assumed that K is not Dirichlet distribution, where θ roughly corresponds known a priori, but rather learned from experi- to the relative weight of each block in the partition. ence, such that K can be increased as new data As the number of clusters increases to infinity (K → are observed. As the number of clusters grows ∞), this distribution on zn is equivalent to the CRP. with observations and each cluster has associated Another view of this model is given by a seemingly parameters defining its probability distribution over unrelated process, the Dirichlet process (Ferguson, observations, this rational model of categorization 1973), which is a probability distribution over dis- is nonparametric. Anderson proposed a prior on crete probability distributions. It directly generates partitions that sequentially assign observations to the partition and the parameters associated with clusters according to: each block in the partition. Marginalizing over all the possible ways of getting the same partition from mk > = | = n−1+α if mk 0 (i.e., k is old) a Dirichlet process defines a related distribution, the P(zn k zn−1) α = n−1+α if mk 0 (i.e., k is new) Pólya urn (Blackwell and MacQueen, 1973), which is equivalent to the CRP when the parameter asso- where mk is the number of items in zn−1 assigned ciated with each block is ignored. For this reason, to cluster k, n is the total number of items observed a mixture model with a CRP prior on partitions so far, and α ≥ 0 is a parameter that governs is known as a Dirichlet process mixture model the total number of clusters.3 As pointed out by (Antoniak, 1974). Neal (2000), the process proposed by Anderson is One of the original motivations for developing equivalent to a distribution on partitions known the rational model of categorization was to reconcile as the Chinese restaurant process (CRP; Aldous, two important observations about human category 1985; Blackwell and MacQueen, 1973). Its name learning. First, in some cases, the confidence with comes from the following metaphor (illustrated in which people assign a new stimulus to a category Figure 9.2): Imagine a Chinese restaurant with an is inversely proportional to its distance from the unbounded number of tables (clusters), where each average of the previous stimuli in that category table can seat an unbounded number of customers (Reed, 1972). This, in conjunction with other data (observations). The first customer enters and sits on central tendency effects (e.g., Posner and Keele, at the first table. Subsequent customers sit at an 1968), has been interpreted as people abstracting occupied table with a probability proportional to a “prototype” from the observed stimuli. On the how many people are already sitting there (mk), other hand, in some cases, people are sensitive and at a new table with probability proportional to specific stimuli (Medin and Schaffer, 1978), to α. Once all the customers are seated, the a finding that has inspired exemplar models that assignment of customers to tables defines a partition memorize the entire stimulus set (e.g., Nosofsky, of observations into clusters. 1986). Anderson (1991) pointed out that his The CRP arises in a natural way from a rational model of categorization can capture both of nonparametric mixture modeling framework (see these findings, depending on the inferred partition Gershman and Blei, 2012, for more details). To structure: when all items are assigned to the same see this, consider a finite mixture model where the cluster, the model is equivalent to forming a single

(a) (b) Tables Chinese Restaurant Process 1 2 3

2

Table Table Table 1 2 3 . . .

1 3 Customers

Fig. 9.2 (a) The culinary representation of the Chinese restaurant process and (b) the cluster assignments implied by it.

structure and flexibility in bayesian models of cognition 193 prototype, whereas when all items are assigned out with zero associative strength and there are no to unique clusters, the model is equivalent to an prediction errors during the preconditioning phase, exemplar model. However, in most circumstances, the Rescorla-Wagner model predicts that there the rational model of categorization will partition should be no learning. The model fails to explain the items in a manner somewhere between these how preconditioning enables B to subsequently ac- two extremes. This is a desirable property that other quire associative strength from A-outcome training. recent models of categorization have adopted (Love, Medin, & Gureckis, 2004). Finally the CRP serves as a useful component in many other cognitive Box 2 Composing Richer models (see Figure 7 and Box 2). Nonparametric Models Although this chapter has focused on several of Associative Learning the most basic nonparametric Bayesian models As its name suggests, associative learning has and their applications to basic psychological traditionally been viewed as the process by which processes, these components can also be com- associations are learned between two or more posed to build richer accounts of more sophis- stimuli. The paradigmatic example is Pavlovian ticated forms of human learning and reasoning conditioning, in which a cue and an outcome (e.g., (bottom half of Figure 7). These composites a tone and a shock) are paired together repeatedly. greatly extend the scope of phenomena that When done with rats, this procedure leads to the can be captured in probabilistic models of rat freezing in anticipation of the shock whenever it cognition. In this box, we discuss a number of hears the tone. these composite models. Association learning can be interpreted in terms In many domains, categories are not simply a of a probabilistic causal model, which calculates the flat partition of entities into mutually exclusive probability that one variable, called the cue (e.g., a classes. Often they have a natural hierarchical tone), causes the other variable, called the outcome organization, as in a taxonomy that divides (e.g., a shock). Here y encodes the presence or life forms into animals and plants; animals absence of the outcome, and x similarly encodes the into mammals, birds, fish, and other forms; presence or absence of the cue. The model assumes mammals into canidae, felines, primates, ro- that y is a noisy linear function of x: y∼N (wx,σ 2). dents, ...; canidae into dogs, wolves, foxes, The parameter w encodes the associative strength coyotes, ...; and dogs into many different between x and y and σ 2 parameterizes the variability breeds. Category hierarchies can be learned via of their relationship. This model can be generalized nonparametric models based on nested versions to the case in which multiple cues are paired with of the CRP, in which each category at one level an outcome by assuming that the associations are gives rise to a CRP of subcategories at the level ∼ N σ 2 additive: y ( i wixi, ), where i ranges over below (Griffiths et al., 2008b; Blei, Griffiths, & the cues. The linear-Gaussian model also has an Jordan, 2010). interesting connection to classical learning theories Category learning and feature learning are such as the Rescorla-Wagner model (Rescorla and typically studied as distinct problems (as dis- Wagner, 1972), which can be interpreted as assum- cussed in the sections Inferring Clusters: How ing a Gaussian prior on w and carrying out Bayesian Are Observations Organized into Groups? and inference on w (Dayan, Kakade, Montague 2000; Inferring Features: What Is a Perceptual Unit?), Kruschke, 2008). but in many real-world situations people can Despite the successes of the Rescorla-Wagner jointly discover how best to organize objects model and its probabilistic variants, they incorrectly into classes and which features best sup- predict that there should only be learning when the port these categories. Nonparametric models − = prediction error is nonzero (i.e., when y i wixi that combine the CRP and IBP can capture 0), but people and animals can still learn in these joint inferences (Austerweil and Griffiths, some cases. For example, in sensory preconditioning 2013). The model of Salakhutdinov et al. (Brogden, 1939), two cues (A and B) are presented (2012) extends this idea to hierarchies, using together without an outcome; when A is subse- the nested CRP combined with a hierarchical quently paired with an outcome, cue B acquires Dirichlet process topic model to jointly learn associative strength despite never being paired with hierarchically structured object categories and the outcome. Because A and B presumably start

194 higher level cognition Box 2 continued hierarchies of part-like object features that dimensional liberal-conservative axis, and the support these categories. properties of animals by their relation in a The CrossCat model (Shafto et al., 2011) taxonomic tree. We could also distinguish an- allows us to move beyond another limitation imals’ anatomical and physiological properties, of typical models of categorization (parametric which are best explained by the taxonomy, or nonparametric): the assumption that there from behavioral and ecological properties that is only a single best way to categorize a set might be better explained by their relation in a of entities. Many natural domains can be directed graph such as a food web. Perhaps most represented in multiple ways: animals may intriguingly, nonparametric Bayesian models be thought of in terms of their taxonomic can be combined with symbolic grammars to groupings or their ecological niches, foods may account for how learners could explore the be thought of in terms of their nutritional broad landscape of different model structures content or social role; products may be thought that might describe a given domain and arrive at of in terms of function or brand. CrossCat the best model (Kemp and Tenenbaum, 2008; discovers multiple systems of categories of Grosse et al., 2012). A grammar is used to entities, each of which accounts for a distinct generate a space of qualitatively different model subset of the entities’ observed attributes, by families, ranging from simple to complex, each nesting CRPs over entities inside CRPs over of which defines a predictive model for the attributes. observed data based on a GP,CRP,IBP or other Traditional approaches to categorization nonparametric process. These frameworks have treat each entity individually, but richer seman- been used to build more human-like machine tic structure can be found by learning categories learning and discovery systems, but they remain in terms of how groups of entites relate to to be tested as psychological accounts of how each other. The Infinite Relational Model (IRM; humans learn domain structures. Kemp et al. 2006, 2010) is a nonparametric model that discovers categories of objects that not only share similar attributes, but also Sensory preconditioning and other related find- participate in similar relations. For instance, a ings have prompted consideration of alterna- data set for consumer choice could be charac- tive probabilistic models for associative learning. terized in terms of which consumers bought Courville, Daw, & Touretzky, (2006) proposed that which products, which features are present in people and animals posit latent causes to explain which products, which demographic properties their observations in Pavlovian conditioning exper- characterize which users, and so on. IRM iments. According to this idea, a single latent cause could then discover how to categorize products generates both the cues and outcomes. Latent cause and consumers (and perhaps also features and models are powerful because they can explain why demographic properties), and simultaneously learning occurs in the absence of prediction errors. uncover regularities in how these categories For example, during sensory preconditioning, the relate (for) example, that consumers in class X latent cause captures the covariation between the tend to buy products in class Y). two cues; subsequent conditioning of A increases Nonparametric models defined over graph the probability that B will also be accompanied by structures, such as the graph-based GP models the outcome. of Kemp and Tenenbaum (2008, 2009), can An analogous question about how to pick the capture how people reason about a wider number of clusters in a mixture model arises in range of dependencies between the properties associative learning: How many latent causes should of entities and the relations between entities, there be in a model of associative learning? To allowing that objects in different domains can address this question, Gershman and colleagues be related in qualitatively different ways. For (Gershman et al., 2010; Gershman and Niv, 2012) instance, the properties of cities might be used the CRP as a prior on latent causes. This best explained by their relative positions in a allows the model to infer new latent causes when two-dimensional map, the voting patterns of the sensory statistics change, but otherwise it prefers politicians by their orientation along a one- a small number of latent causes. Unlike previous models that define the number of latent causes

structure and flexibility in bayesian models of cognition 195 a priori, Gershman et al. (2010) showed that A Rational Model of Feature Inference this model could explain why extinction does not Analogous to other Bayesian models of cogni- tend to erase the original association: Extinction tion, defining a rational model of feature inference training provides evidence that a new latent cause amounts to applying Bayes’s rule to a specification is active. For example, when conditioning and of the hypothesis space, how hypotheses produce extinction occur in different contexts, the model observations (the likelihood), and the prior proba- infers a different latent cause for each context; upon bility of each hypothesis (the prior). Following pre- returning to the conditioning context, the model vious work by Austerweil and Griffiths (2011), we predicts a renewal of the conditioned response, first define these three components for a Bayesian consistent with empirical findings (see Bouton, model with a finite feature repository and then 2004). By addressing the question of how agents define a nonparametric Bayesian model by allowing infer the number of latent causes, the model offered an infinite repository of features. This allows the new insight into a perplexing phenomenon. model to infer a feature representation without presupposing the number of features ahead of time. The computational problem of feature repre- Inferring Features: What Is a sentation inference is as follows: Given the D- Perceptual Unit? dimensional raw primitives for a set of N stimuli The types of latent structures that people use X (each object is a D-dimensional row vector to represent a set of stimuli can be far richer than xn), infer a feature representation that encodes the clustering the stimuli into groups. For example, stimuli and adheres to some prior expectations. We decompose a feature representation into two consider the following set of animals: domestic cats, × dogs, goldfish, sharks, lions, and wolves. Although components: an N K feature ownership matrix they can be represented as clusters (e.g., PETS Z, which is a binary matrix encoding which of the K features each stimulus has (i.e., znk = 1if and WILD ANIMALS or FELINES, CANINES, = stimulus n has feature k,andznk 0 otherwise), and SEA ANIMALS), another way to represent the × animals is using features, or multiple discrete units and a K D feature image matrix Y, which encodes per animal (e.g., a cat might be represented with the the consequence of a stimulus having each feature. following features: HAS TAIL, HAS WHISKERS, In this case, the hypothesis space is the Cartesian HAS FUR, IS CUTE, etc.). Feature representa- product of possible feature ownership matrices and tions can be used to solve problems arising in possible feature image matrices. As we discuss later many domains, including choice behavior (Tversky, in further detail, the precise format of feature image 1972), similarity (Nosofsky, 1986; Tversky, 1977), matrix Y depends on the format of the observed and object recognition (Palmer, 1999; Selfridge and raw primitives. For example, if the stimuli are the Neisser, 1960). Analogous to clustering, the appro- images of objects and the primitives are D binary priate feature representation or even the number of pixels that encode whether light was detected in features for a domain is not known a priori. In fact, each part of the retina, then a stimulus and a feature a common criticism of feature-based similarity is image, x and y respectively, are both D-dimensional that there is an infinite number of potential features binary vectors. So if x is the image of a mug, y could that can be used to represent any stimulus and be the image of its handle. that human judgments are mostly determined by Applying Bayes’s rule and assuming that the fea- the features selected to be used in a given context ture ownership and image matrices are independent (Goldmeier, 1972; Goodman, 1972; Murphy and a priori, inferring a feature representation amounts Medin, 1985). How do people infer the appropriate to optimizing the product of three terms features to represent a stimulus in a given context? P(Z,Y|X) ∝ P(X|Y,Z)P(Y)P(Z), In this section, we describe how the problem of inferring feature representations can be cast where P(X|Y,Z), the likelihood, encodes how as a problem of Bayesian inference, where the well each object xn is reconstructed by combining hypothesis space is the space of possible feature together the feature images Y of the features the representations. Because there is an infinite number object has, which is given by zn, P(Y) encodes prior of feature representations, the model used to solve expectations about feature images (e.g., Gestalt this problem will be a nonparametric Bayesian laws), and P(Z) encodes prior expectations about model. Then, we illustrate two psychological feature ownership matrices.4 As the likelihood and applications. feature image prior are more straightforward to

196 higher level cognition define and specific to the format of the observed at least one object), and Kh is the number of features primitives, we first derive a sensible prior distri- with history h, where a history can be thought bution over all possible feature ownership matrices of as the column of the feature interpreted as a before returning our attention to them. binary number. The term containing the history Beforedelvingintothecaseofaninfinite penalizes features that have equivalent patterns of number of potential features, we derive a prior ownership and it is a method for indexing features distribution on feature ownership matrices that has with equivalent ownership patterns. a finite and known number of features K .Asthe Analogous to the CRP, the probability distri- elements of a feature ownership matrix are binary, bution given by this limiting process is equivalent we can define a probability distribution over the to the probability distribution on binary matrices matrix by flipping a weighted coin with bias πk implied by a simple sequential culinary metaphor. for each element znk. We do not observe πk and In this culinary metaphor, “customers,” correspond- so, we assume a Beta distribution as its prior. This ing to the stimuli or rows of the matrix, enter an corresponds to the following generative process Indian buffet and take dishes, corresponding to π iid∼ α/ = ... the features or columns of the matrix, according k Beta( K ,1), for k 1, ,K to a series of probabilistic decisions based on how iid the previous customers took dishes. When the first z |π ∼ Bernoulli(π ), for n = 1,...,N . nk k k customer enters the restaurant, she takes a number Due to the conjugacy of Bernoulli likelihoods and of new dishes sampled from a Poisson distribution Beta priors, it is relatively simple to integrate out with parameter α. As customers sequentially enter π ... π 1, , K to arrive at the following probability the restaurant, each customer n takes a previously distribution, P(Z|α), over finite feature ownership / sampled dish k with probability mk n and then representations. See Bernardo and Smith (1994) samples a number of new dishes sampled from a and Griffiths and Ghahramani (2011) for details. Poisson distribution with parameter α/n. Analogous to the method discussed earlier for Figure 9.3(a) illustrates an example of a potential constructing the CRP as the infinite limit of a finite state of the IBP after three customers have entered |α →∞ model, taking the limit of P(Z )asK the restaurant. The first customer entered the yields a valid probability distribution over feature restaurant and sampled two new dishes from the ownership matrices with an infinite number of Poisson probability distribution with parameter α. 5 →∞ potential features. Note that as K ,the Next, the second customer entered the restaurant prior on feature weights gets concentrated at zero and took each of the old dishes with probability 1/2 α/ → (because K 0). This results in an infinite and sampled one new dish from the Poisson proba- number of columns that simply contain zeroes, and bility distribution with parameter α/2. Then, the thus, these features will have no consequence for third customer entered the restaurant and took the set of stimuli we observed (as they are not the first dish with probability 2/3, did not take assigned to any stimuli). Because both the number the second dish with probability 1/3, and did →∞ of columns K and the probability of an not take the third dish with probability 2/3. = object taking a feature (probability that znk 1) The equivalent feature ownership matrix repre- π → k 0 at corresponding rates, there is a finite, sented by this culinary metaphor is shown in but random, number of columns have at least one Figure 9.3(b). nonzero element (the features that have been taken As previously encountered features are sampled by at least one stimulus). This limiting distribution with probability proportional to the number of is called the Indian buffet process (IBP; Griffiths and times they were previously taken and the proba- Ghahramani 2005, 2011), and it is given by the bility of new features decays as more customers following equation % & enter the restaurant (it is Poisson distributed with αK+ N parameter given by α/N where N is the number of |α =$ −α −1 P(Z ) n− exp n customers), the IBP favors feature representations 2 1 K h=1 h n=1 that are sparse and have a small number of features. K+ (N − m )(m − 1) Thus, it encodes a natural prior expectation toward × k k , feature representations with a few features, and can N k=1 be interpreted as a simplicity bias. where K+ is the number of columns with at least Now that we have derived the feature ownership one nonzero entry (the number of features taken by prior P(Z), we turn to defining the feature image

structure and flexibility in bayesian models of cognition 197 (a) (b) Indian Buffet Process Dishes 1 2 1 2 3

Dish Dish Dish 1 2 3 . . .

3 2 1 2 Customers

Fig. 9.3 (a) The culinary representation of the Indian buffet process and (b) the feature ownership matrix implied by it. prior P(Y) and the likelihood P(X|Y,Z), which will explanation for how people reason about observed finish a specification of the nonparametric Bayesian effects being produced by multiple hidden causes model. Remember that the feature image matrix (Cheng, 1997; Griffiths and Ghahramani, 2005). contains the consequences of a stimulus having a For grayscale images, the linear-Gaussian likelihood feature in terms of the raw primitives. Thus, in is typically used (Griffiths and Ghahramani, 2005; the nonparametric case, the feature image matrix Austerweil and Griffiths, 2011), which is optimal can be thought of containing the D-dimensional under the assumption that the reconstructed stimuli consequenceofeachoftheK+ used features. The is given by ZY and that the metric of success is infinite unused features can be ignored because the sum squared error between the reconstructed a feature only affects the representation of a and observed stimuli. Recent work in machine stimulus if it is used by that stimulus. In most learning has started to explore more complex applications to date, the feature image prior is likelihoods, such as explicitly accounting for depth mostly “knowledge-less.” For example, the standard and occlusion (Hu, Zhai, Williamson, & Boyd- prior used for stimuli that are binary images is an Graber 2012). Formalizing more psychologically independent Bernoulli prior, where each pixel is valid feature image priors and likelihoods is a on with probability φ independent of the other mostly unexplored area of research that demands pixels in the image. One exception is one of the attention. simulations by Austerweil and Griffiths (2011), After specifying the feature ownership and who demonstrated that using a simple proximity image prior and the likelihood, a feature rep- bias (an Ising model that favors adjacent pixels to resentation can be inferred for a given set of share values, Geman and Geman, 1984) as the observations using standard machine learning infer- feature image prior results in more psychologically ence techniques, such as Gibbs sampling (Geman plausible features: The feature images without using and Geman, 1984) or particle filtering (Gordon, the proximity bias were not contiguous and were Salmond, & Smith, 1993). We refer the reader speckled, whereas the feature images using the to Austerweil and Griffiths (2013), who discuss proximity bias were contiguous. For grayscale and Gibbs sampling and particle filtering for feature color images (Austerweil and Griffiths, 2011; Hu, inference models and analyze their psychological Zhai, Williamson, & Boyd-Graber, 2012), the plausibility. standard “knowledge-less” prior generates each pixel What features should people use to represent the from a Gaussian distribution independent of the image in Figure 9.4(a)? When the image is in the other pixels in the image. context of the images in Figure 9.4(b), Austerweil Analogous to the feature image priors, the choice and Griffiths (2011) found that people and the IBP of the likelihood depends on the format of the raw model infer a single feature to represent it, namely dimensional primitives. In typical applications, the the object itself, which is shown in Figure 9.4(d). likelihood assumes that the reconstructed stimuli Alternatively, when the image is in the context of is given by the product of the feature ownership the images in Figure 9.4(c), people and the IBP and image matrices, ZY and penalizes the de- model infer a set of features to represent it, which viation between the reconstructed and observed are three of the six parts used to create the images, stimuli (Austerweil and Griffiths, 2011). For binary which are shown in Figure 9.4(e).6 Importantly, images, the noisy-OR likelihood (Pearl, 1988; Austerweil and Griffiths (2011) demonstrated that Wood, Griffiths, & Ghahramani, 2006) is used, two of the most popular machine-learning tech- which amounts to thinking of each feature as a niques for inferring features from a set of im- “hidden cause” and has support as a psychological ages, principal component analysis (Hyvarinen,

198 higher level cognition (a)

(b) (c)

(d) (e)

Fig. 9.4 Effects of context on the features inferred by people to represent an image (from Experiment 1 of Austerweil & Griffiths, 2011). What features should be used to represent the image shown in (a)? When (a) is in the context of the images shown in (b), participants and the nonparametric Bayesian model represent it with a single feature, the image itself, which is shown in (d). Conversely, when (a) is in the context of the images shown in (c), it is represented as a conjunction of three features, shown in (e). Participants and the nonparametric Bayesian model generalized differently due to using different representations.

Karhunen, & Oja 2001) and independent com- Choice Behavior ponent analysis (Hyvarinen et al., 2001), did not According to feature-based choice models, peo- explain their experimental results. This suggests that ple choose between different options depending on the simplicity bias given by the IBP is similar to their preference for the features (called an aspect) the prior expectations people use to infer feature that each option has. One influential feature-based representations, although future work is necessary choice model is the elimination by aspects model to more precisely test and formally characterize the (Tversky, 1972). Under this model, the preference biases people use to infer feature representations. of a feature is assumed to be independent of other Furthermore, this contextual effect was replicated in features and defined by a single, positive number, grayscale 3-D rendered images and in a conceptual called a weight. Choices are made by repeatedly domain, suggesting that it may be a domain-general selecting a random feature weighted by their prefer- capability. ence and removing all options that do not contain Unlike the IBP and most other computational the feature until there is only one option remaining. models of feature learning, people tend to use fea- For example, when a person is choosing which tures that are invariant over transformations (e.g., television to purchase, she is confronted with a large translations or dilations) because properties of fea- array of possibilities that vary on many features, tures observed from the environment do not occur such as IS LCD, IS PLASMA, IS ENERGY identically across appearances (e.g., the retinal im- EFFICIENT, IS HIGH DEFINITION, and so on, age after eye or head movements). By modifying the each with their own associated weight. Because the culinary metaphor such that each time a customer same person may make different choices even when takes a dish she draws a random spice from a spice they are confronted with the same set of options, rack, the transformed IBP (Austerweil and Griffiths, the probability of choosing option i over option j, 2010b) can infer features invariant over a set of pij, is proportional to the weights of the features that transformations. Austerweil and Griffiths (2013) option i has, but option j does not have, relative to explain a number of previous experimental results the features that option j has, but option i does not and outline novel contextual effects predicted by have. Formally, this is given by the extended model that they subsequently confirm through behavioral experiments. Other extensions w z (1 − z ) = k k ik jk to this same framework have been used to explain pij − + − , multimodal representation learning (Yildirm and k wkzik(1 zjk) k wk(1 zik)zjk Jacobs, 2012), and in one extension, the IBP and CRP are used together to infer features diagnostic where zik = 1 denotes that option i has feature k for categorization (Austerweil and Griffiths, 2013). and wk is the preference weight given to feature k.

structure and flexibility in bayesian models of cognition 199 Although modeling human choice behavior learning how hard to press the pedal to yield a using the elimination by aspects model is straight- certain amount of acceleration when driving a rental forward when the features of each option and car. Nonparametric Bayesian methods also provide their preference weights are known; it is not a solution to this problem that can learn complex straightforward how to infer what features a person functions in a manner that favors simple solutions. uses to represent a set of options and their weights Viewed abstractly, the computational problem given only the person’s choice behavior. To address behind function learning is to learn a function this issue, Görür et al. (2006) proposed the y = f (x) from a set of real-valued observations IBP as a prior distribution over possible feature xN = (x1,...,xN )andtN = (t1,...,tN ), where tn is representations and a Gamma distribution as a assumed to be the true value obscured by some kind prior over the feature weights, which is a flexible of additive noise (i.e., tn = yn +n,wheren is some distribution that only assigns probability to positive type of noise). In machine learning and statistics, numbers. They demonstrated that human choice this is referred to as a regression problem. In this for which celebrities participants from the early section, we discuss how this problem can be solved 1970s one would prefer to chat with (Rumelhart using Bayesian statistics, and how the result of and Greeno, 1971) is just as well described by this approach is related to a class of nonparametric the elimination by aspects model given a set Bayesian models known as Gaussian processes. Our of features defined by a modeler or when the presentation follows that in Williams (1998). features are inferred with the IBP as the prior over possible feature representations. As participants are Bayesian linear regression familiar with the celebrities and the celebrities are Ideally, we would seek to solve our regression related to each other according to a hierarchy (i.e., problem by combining some prior beliefs about politicians, actors, and athletes), Miller et al. (2008) the probability of encountering different kinds extended the IBP such that it infers features in of functions in the world with the information a manner that respects the given hierarchy (i.e., provided by x and y. We can do this by applying options are more likely to have the same features Bayes’s rule, with to the degree that they are close to each other | in the hierarchy). They demonstrated that the p(tN f ,xN )p(f ) p(f |xN ,tN ) = ,(1) extended IBP explains human choice judgments F p(tN |f ,xN )p(f )df better and uses fewer features to represent the where p(f ) is the prior distribution over functions options. Although the IBP-based extensions helps in the hypothesis space F, p(tN |f ,xN )isthe the elimination by aspects model overcome some likelihood of observing the values of tN if f were of its issues, the extended models are unable to the true function, and p(f |xN ,tN )istheposterior account for the full complexity of human choice distribution over functions given the observations behavior (e.g., attraction effects; Huber, Payne, xN and yN . In many cases, the likelihood is defined & Puto, 1982). Regardless, exploring choice by assuming that the values of tn are independent models that include a feature-inference process is given f and xn, and each follows a Gaussian a promising direction for future research, because 2 distribution with mean yn = f (xn)andvarianceσ . such models can potentially be incorporated with Predictions about the value of the function f for a more psychologically valid choice models (e.g., new input xN +1 can be made by integrating over sequential sampling models of preferential choice; this posterior distribution. Roe, Busemeyer, & Townsend 2001). Performing the calculations outlined in the previous paragraph for a general hypothesis space Learning Functions: How Are Continuous F is challenging, but it becomes straightforward Quantities Related? if we limit the hypothesis space to certain specific So far, we have focused on cases in which the classes of functions. If we take F to be all latent structure to be inferred is discrete—either linear functions of the form y = b0 + xb1,then a category or a set of features. However, latent we need to define a prior p(f ) over all linear structures can also be continuous. One of the most functions. As these functions can be expressed in prominent examples is function learning, in which terms of the parameters b0 and b1, it is sufficient a relationship is learned between two (or more) to define a prior over the vector b = (b0,b1), continuous variables. This is a problem that people which we can do by assuming that b follows often solve without even thinking about it, as when a multivariate Gaussian distribution with mean

200 higher level cognition zero and covariance matrix b. Applying Eq. 1, approach to prediction is often referred to as using then, results in a multivariate Gaussian posterior a Gaussian process, since it assumes a stochastic distribution on b (see Rasmussen and Williams, process that induces a Gaussian distribution on 2006, for details) with y based on the values of x. This approach can −1 also be extended to allow us to predict yN +1 from | = σ 2−1 + T T σ 2 E[b xN ,tN ] t b XN XN XN tN xN +1, tN ,andxN by adding t IN to KN ,where   I is the n × n identity matrix, to take into −1 N − 1 cov[b|x ,y ] = 1 + XT X account the additional variance associated with the N N b σ 2 N N t observations tN . The covariance matrix K + is specified using where X = [1 x ] (i.e., a matrix with a vector N 1 N N N a function whose argument is a pair of stimuli of ones horizontally concatenated with x + ). N 1 known as a kernel,withK = K (x ,x ). Any Because y + is simply a linear function of b, ij i j N 1 kernel that results in an appropriate (symmetric, the predictive distribution is Gaussian, with y + N 1 positive-definite) covariance matrix for all x can having mean [1 x + ]E[b|x ,t ]andvariance N 1 N N be used. One common kernel is the radial basis [1 x + ]cov[b|x ,t ][1 x + ]T . The predictive N 1 N N N 1 function, with distribution for tN +1 is similar but with the addition of σ 2 to the variance.   1 K (x ,x ) = θ 2 exp − (x − x )2 i j 1 θ 2 i j 2 Basis Functions and Similarity Kernels Although considering only linear functions indicating that values of y for which values of x might seem overly restrictive, linear regression are close are likely to be highly correlated. See actually gives us the basic tools we need to solve Schölkopf and Smola (2002) for further details this problem for more general classes of functions. regarding kernels. Gaussian processes thus provide Many classes of functions can be described as linear an extremely flexible approach to regression, with combinations of a small set of basis functions. For the kernel being used to define which values of x example, all kth degree polynomials are linear com- are likely to have similar values of y. Some examples binations of functions of the form 1 (the constant are shown in Figure 9.5. 2 k φ(1) ... φ(k) function), x, x ,...,x . Letting , , denote Just as we can express a covariance matrix in a set of basis functions, we can define a prior on terms of its eigenvectors and eigenvalues, we can the class of functions that are linear combinations express a given kernel K (xi,xj) in terms of its of this basis by expressing such functions in the eigenfunctions φ and eigenvalues λ,with = (1) (k) form f (x) b0 + φ (x)b1 +···+φ (x)bk and defining a prior on the vector of weights b.If ∞ we take the prior to be Gaussian, we reach the (k) (k) K (xi,xj) = λkφ (xi)φ (xj) same solution as outlined in the previous paragraph, = (1) (k) k 1 substituting  = [1N φ (xN )...φ (xN )] for φ(1) ...φ(k) X and [1 (xN +1) (xN +1) for [1 xN +1], for any x and x . Using the results from the previous φ = φ ...φ T i j where (xN ) [ (x1) (xN )] . paragraph, any kernel can be viewed as the result of If our goal were merely to predict yN +1 from performing Bayesian linear regression with a set of xN +1, yN ,andxN , we might consider a different basis functions corresponding to its eigenfunctions, approach, by simply defining a joint distribution and a prior with covariance matrix b = diag(λ). on yN +1 given xN +1 and conditioning on yN . These equivalence results establish an important For example, we might take yN +1 to be jointly duality between Bayesian linear regression and Gaussian, with covariance matrix   Gaussian processes: For every prior on functions, there exists a kernel that defines the similarity = KN kN ,N +1 KN +1 T between values of x, and for every kernel, there kN ,N +1 kN +1 exists a corresponding prior on functions that yields where KN depends on the values of xN , kN ,N +1 the same predictions. This result is a consequence of depends on xN and xN +1,andkN +1 depends only Mercer’s theorem (Mercer, 1909). Thus, Bayesian on xN +1. If we condition on yN , the distribution linear regression and prediction with Gaussian T −1 of yN +1 is Gaussian with mean kN ,N +1KN y processes are just two views of the same class of − T −1 and variance kN +1 kN ,N +1KN kN ,N +1.This solutions to regression problems. structure and flexibility in bayesian models of cognition 201 A 3 B 3

2 2

1 1

0 0 Output (y) Output (y)

−1 −1

−2 −2

−3 −3 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 Input (x) Input (x)

Fig. 9.5 Modeling functions with Gaussian processes. The data points (crosses) are the same in both panels. (A) Inferred posterior using a radial basis covariance function (Eq. 2) with θ2 = 1/4. (B) Same as panel A, but with θ1 = 1andθ2 = 1/8. Notice that as θ2 gets smaller, the posterior mean more closely fits the observed data points, and the posterior variance is larger in regions far from the data.

Modeling Human Function Learning and their expertise over regions of dimensional The duality between Bayesian linear regression space. and Gaussian processes provides a novel perspective Bayesian linear regression resembles explicit rule on human function learning. Previously, theories learning, estimating the parameters of a function, of function learning had focused on the roles of whereas the idea of making predictions based on different psychological mechanisms. One class of the similarity between predictors (as defined by a theories (e.g., Carroll, 1963; Brehmer, 1974; Koh kernel) that underlies Gaussian processes is more and Meyer, 1991) suggests that people are learning in line with associative accounts. The fact that, an explicit function from a given class, such at the computational level, these two ways of as the polynomials of degree D. This approach viewing regression are equivalent suggests that these attributes rich representations to human learners, competing mechanistic accounts may not be as far but has traditionally given limited treatment to apart as they once seemed. Just as viewing category the question of how such representations could learning as density estimation helps to understand be acquired. A second approach (e.g., DeLosh, that prototype and exemplar models correspond to Busemeyer, & McDaniel, 1997) emphasizes the different types of solutions of the same statistical possibility that people could simply be forming problem, viewing function learning as regression associations between similar values of variables. reveals the shared assumptions behind rule learning This approach has a clear account of the underlying and associative learning. learning mechanisms, but it faces challenges in Gaussian process models also provide a good explaining how people generalize beyond their account of human performance in function learning experience. More recently, hybrids of these two tasks. Griffiths et al. (2009) compared a Gaussian approaches have been proposed (e.g., Kalish, process model with a mixture of kernels (linear, Lewandowsky, & Kruschke, 2004; McDaniel and quadratic, and radial basis) to human performance. Busemeyer, 2005). For example, the population Figure 9.6 shows mean human predictions when of linear experts (POLE; Kalish et al. 2004) uses trained on a linear, exponential, and quadratic associative learning to learn a set of linear functions function (from DeLosh et al., 1997), together with

202 higher level cognition (a) Linear Exponential Quadratic

Function Human/Model (b)

Fig. 9.6 Extrapolation performance in function learning. Mean predictions on linear, exponential, and quadratic functions for (a) human participants (from DeLosh et al. 1997) and (b) a Gaussian process model with linear, quadratic, and nonlinear kernels. Training data were presented in the region between the vertical lines, and extrapolation performance was evaluated outside this region. Figure reproduced from Griffiths et al. (2009). the predictions of the Gaussian process model. which of the three classes of models should be The regions to the left and right of the vertical used. lines represent extrapolation regions, being input Our presentation of these three classes of models values for which neither people nor the model were is only the beginning of an ever-growing literature trained. Both people and the model extrapolate near using nonparametric Bayesian processes in cognitive optimally on the linear function, and reasonably ac- science. Each class of model can be thought of as curate extrapolation also occurs for the exponential providing primitive units, which can be composed and quadratic function. However, there is a bias in various ways to form richer models. For example, toward a linear slope in the extrapolation of the the IBP can be interpreted as providing primitive exponential and quadratic functions, with extreme units that can be composed together using logical values of the quadratic and exponential function operators to define a categorization model, which being overestimated. learns its own features along with a propositional rule to define a category. Figure 9.7 presents some of these applications and Box 2 provides a detailed Conclusions discussion of them. Probabilistic models form a promising frame- work for explaining the impressive success that people have in solving different inductive problems. Concluding remarks As part of performing these feats, the mind 1. Although it is possible to define probabilistic constructs structured representations that flexibly models that can infer any desired structure, due to adapt to the current set of stimuli and context. the bias-variance trade-off, prior expectations over In this chapter, we reviewed how these problems a rich class of structures are needed to capture the can be described from a statistical viewpoint. To structures inferred by people when given a limited define models that infer representations that are number of observations. both flexible and structured, we described three 2. Nonparametric Bayesian models are main classes of nonparametric Bayesian models and probabilistic models over arbitrarily complex how the format of the observed stimuli determines structures that are biased toward simpler structures.

structure and flexibility in bayesian models of cognition 203 Representational primitive Key Psychological Application Nonparametric Bayesian Process Key References Partitions Category assignment Chinese restaurant process (CRP) Aldous (1985) Probability distributions over Category learning/Density esti- Dirichlet process (DP) Ferguson (1973) competing discrete units mation Probability distributions over in- Indian buffet process (IBP)/Beta pro- Griffiths and Ghahramani (2005); Feature assignment dependent discrete units cess (BP)1 Thibaux and Jordan (2007) Distributions over continuous Function learning Gaussian process (GP) Rasmussen and Williams (2006) units Griffiths et al. (2008b); Blei et al. Hierarchical category learning Nested CRP (2010) Jointly learning categories and CRP + IBP,nested CRP + hierarchical Austerweil and Griffiths (2013); features DP topic model Salakhutdinov et al. (2012) CRP over entities embedded inside Composites Cross-cutting category learning Shafto et al. (2011) CRP over attributes Product of CRPs over multiple entity Relational category learning Kemp et al. (2006, 2010) types Property induction GP over latent graph structure Kemp and Tenenbaum (2009) GP/CRP/IBP + grammar over model Kemp and Tenenbaum (2008); Domain structure learning forms Grosse et al. (2012)

Fig. 9.7 Different assumptions about the type of structure generating the observations from the environment results in different types of nonparametric Bayesian models. The most basic nonparametric models define distributions over core representational primitives, while more advanced models can be constructed by composing these primitives with each other and with other probabilistic modeling and knowledge representation elements (see Box 2). Typically, researchers in cognitive science do not distinguish between the CRP and DP, or the IBP and BP. However, they are all distinct mathematical objects, where the CRP and IBP are distributions over the assignment of stimuli to units and the DP and BP are distributions over the assigment of stimuli to units and the parameters associated with those units. The probability distribution given by only considering the number of stimuli assigned to each unit by a DP and BP yields a distribution over assignments equivalent to the CRP and IBP, respectively.

Thus, they form a middle ground between the two Notes extremes of models that infer overly simple 1. Note that we define parametric, nonparametric, and other structures (parametric models) and models that statistical terms from the Bayesian perspective. We refer the reader interested in the definition of these terms from the infer overly complex structures (nonparametric frequentist perspective and a comparison of frequentist and models). Bayesian perspectives to 3. Using different nonparametric models result Young and Smith (2010). in different assumptions about the format of the 2. Our definition of “density estimation” includes estimating hidden structure. When each stimulus is assigned any probability distribution over a discrete or continuous space, which is slightly broader than its standard use in statistics, to a single latent unit, multiple latent units, or estimating any probability distribution over a continuous space. continuous units, the Dirichlet process, Beta 3. Our formulation departs from Anderson’s by adopting the process, and Gaussian process are appropriate, notation typically used in the statistics literature. However, the respectively. These processes are compositional in two formulations are equivalent. 4. We have written the posterior probability as proportional that they can be combined with each other and to the product of three terms because the normalizing constant other models to infer complex latent structures, (the denominator) for this example is intractable to compute such as relations and when there is an infinite repository of features. hierarchies. 5. Technically, to ensure that the infinite limit of P(Z|α)is valid requires defining all feature ownership matrices that differ only in the order of the columns to be equivalent. This is due to identifiability issues and is analogous to the arbitrariness of the Some Future Questions cluster (or table) labels in the CRP. 1. How similar are the inductive biases defined 6. Austerweil and Griffiths (2011) tested whether people represent the objects with the parts as features by seeing if they by nonparametric Bayesian models to those people were willing to generalize a property of the set of objects (being use when inferring structured representations? foundinacaveonMars)toanovelcombinationofthreeofthe 2. What are the limits on the complexity of six parts used to create the images. See Austerweil and Griffiths representations that people can learn? Are (2011) and Austerweil and Griffiths (2013) for a discussion of nonparametric Bayesian models too powerful? this methodology and the theoretical implications of these results. 3. How do nonparametric Bayesian models 7. Although we collapsed the distinction between IBP and compare to other computational frameworks that BP, they are distinct nonparametric Bayesian processes. See the adapt their structure with experience, such as caption of Figure 2 and the glossary for more neural networks? details.

204 higher level cognition Glossary particle filtering: a sequentially adapting importance Beta process: a stochastic process that assigns a real number sampler where the surrogate distribution is based on the between 0 and 1 to a countable set of units, which makes approximated posterior at the previous time step it a natural prior for latent feature models (interpreting the number as a probability) partition: division of a set into nonoverlapping subsets bias: the error between the true structure and the posterior probability: an agent’s belief in a structure or average structure inferred based on observations from the hypothesis after some observations environment prior probability: an agent’s belief in a structure or bias-variance trade-off: to reduce the error in generalizing hypothesis before any observations a structure to new observations, an agent has to reduce both rational analysis: interpreting the behavior of a system as its bias and variance the ideal solution to an abstract computational problem Chinese restaurant process: a culinary metaphor that posed by the environment usually with respect to some defines a probability distribution over partitions, which assumptions about the system’s environment yields an equivalent distribution on partitions as the one rational process models: a process model that is a statistical implied by a Dirichlet process when only the number of approximation to the ideal solution given by probability stimuli assigned to each block is considered theory computational level: interpreting the behavior of a system variance: the degree that the inferred structure changes as the solution to an abstract computational problem posed across different possible observations from the environment by the environment consistency: given enough observations, the statistical model infers the true structure producing the observations Dirichlet process: a stochastic process that assigns a set of References non-negative real numbers that sum to 1 to a countable Aldous, D. (1985). Exchangeability and related topics. In École set of units, which makes it a natural prior for latent class d’Été de Probabilités de Saint-Flour XIII, pp. 1–198. Berlin: models (interpreting the assigned number as the probability Springer. of that unit) Anderson, J. R. (1990). The adaptive character of thought. exchangeability: a sequence of random variables is ex- Hillsdele, NJ: Erlbaum. changeable if and only if their joint probability is invariant Anderson, J. R. (1991). The adaptive nature of human to reordering (does not change) categorization. Psychological Review 98(3), 409–429. Gaussian process: a stochastic process that defines a joint Antoniak, C. (1974). Mixtures of Dirichlet processes with distribution on a set of variables that is Gaussian, which applications to Bayesian nonparametric problems. The makes it a natural prior for function learning models Annals of Statistics, 2, 1152–1174. Ashby, F. G. & Alfonso-Reese, L. A. (1995). Categorization importance sampling: approximating a distribution by as probability density estimation. Journal of Mathematical sampling from a surrogate distribution and then reweight- ing the samples to compensate for the fact that they came Psychology, 39, 216–233. from the surrogate rather than the desired distribution Austerweil, J. L. & Griffiths, T. L. (2010a). Learning hypothesis spaces and dimensions through concept learning. In S. Indian buffet process: a culinary metaphor for describing Ohlsson & R. Camtrabone, (Ed.). Proceedings of the 32nd the probability distribution over the possible assignments Annual Conference of the Cognitive Science Society, 73–78. of observations to multiple discrete units, which yields an Austin, TX: Cognitive Science Society. equivalent distribution on discrete units as the one implied by a Beta process when only the number of stimuli assigned Austerweil, J. L. & Griffiths, T. L. (2010b). Learning invariant to each unit is considered features using the transformed indian buffet process. In R. Zemel & J. Shawne-Taylor, (Ed.). Advances in neural infor- inductive inference: a method for solving a problem that mation processing systems (Vol. 23, pp. 82–90. Cambridge, has more than one logically possible solution MA. MIT Press. likelihood: the probability of some observed data given a Austerweil, J. L. & Griffiths, T. L. (2011). A rational model of particular structure or hypothesis is true the effects of distributional information on feature learning. Markov chain Monte Carlo: approximating a distribution Cognitive Psychology, 63, 173–209. by setting up a Markov chain whose limiting distribution is Austerweil, J. L. & Griffiths, T. L. (2013). A nonparametric that distribution Bayesian framework for constructing flexible feature repre- Monte Carlo: using random number sampling to solve sentations. Psychological Review, 120, 817–851. numerical problems Bernardo, J. M. & Smith, A. F. M. (1994). Bayesian theory.New nonparametric: a model whose possible densities belongs York, NY: Wiley. to a family that includes arbitrary distributions Bickel, P. J. & Doksum, K. A. (2007). Mathematical statistics: basic ideas and selected topics. Upper Saddle River, NJ: parametric: a model that assumes possible densities belongs Pearson. to a family that is parameterized by a fixed number of Blackwell, D. & MacQueen, J. (1973). Ferguson distributions variables via Polya urn schemes. The Annals of Statistics, 1, 353–355.

structure and flexibility in bayesian models of cognition 205 Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nested Gershman, S. J. & Blei, D. M. (2012). A tutorial on bayesian Chinese restaurant process and Bayesian nonparametric nonparametric models. Journal of Mathematical Psychology, inference of topic hierarchies. Journal of the ACM, 57, 1–30. 56(1), 1–12. Bouton, M. (2004). Context and behavioral processes in Ghosal, S. (2010). The Dirichlet process, related priors, and extinction. Learning & Memory, 11(5):485–494. posterior asymptotics. In N. L. Hjort, C. Holmes, P. Müller, Bowers, J. S. & Davis, C. J. (2012). Bayesian just-so &S.G.Walker,(Eds.),Bayesian nonparametrics (pp. 35–79). stories in psychology and neuroscience. Psychological Bulletin, Cambridge UK, Cambridge University Press. 138(3):389–414. Goldmeier, E. (1972). Similarity in visually perceived forms. Braddick, O. (1993). Segmentation versus integration in visual Psychological Issues, 8, 1–136. Original written in German motion processing. Trends in neurosciences, 16(7), 263–268. and published in 1936. Brehmer, B. (1971). Subjects’ ability to use functional rules. Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). A Psychonomic Science, 24, 259–260. Bayesian framework for word segmentation: Exploring the Brehmer, B. (1974). Hypotheses about relations between scaled effects of context. Cognition, 112, 21–54. variables in the learning of probabilistic inference tasks. Goodman, N. (1972). Seven strictures on similarity. In N. Organizational Behavior and Human Decision Processes, 11, Goodman, (Ed.), Problems and projects.NewYork,NY: 1–27. Bobbs-Merrill. Brogden, W. (1939). Sensory pre-conditioning. Journal of Gordon, N., Salmond, J., & Smith, A. (1993). A novel approach Experimental Psychology, 25(4), 323–332. to non-linear/non-Gaussian Bayesian state estimation. IEEE Carroll, J. D. (1963). Functional learning: The learning of Proceedings on Radar and Signal Processing, 140, 107–113. continuous functional mappings relating stimulus and response Görür, D., Jäkel, F., & Rasmussen, C. E. (2006). A choice model continua. Princeton, NJ: Education Testing Service. with infinitely many latent features. In Proceedings of the 23rd Chater, N., Goodman, N., Griffiths, T. L., Kemp, C., International Conference on Machine Learning (ICML 2006), Oaksford, M., & Tenenbaum, J. B. (2011). The imaginary pages 361–368, New York. ACM Press. fundamentalists: The unshocking truth about Bayesian Griffiths, T., Chater, N., Kemp, C., Perfors, A., & Tenenbaum, cognitive science. Behavioral and Brain Sciences, 34(4), 194– J. (2010a). Probabilistic models of cognition: exploring 196. representations and inductive biases. Trends in Cognitive Chater, N. & Oaksford, M. (2008). The probabilistic mind: Sciences, 14, 357–364. Prospects for Bayesian cognitive science.NewYork,NY:Oxford Griffiths, T., Steyvers, M., & Tenenbaum, J. (2007). Topics University Press. in semantic representation. Psychological Review, 114(2), Cheng, P.(1997). From covariation to causation: A causal power 211–244. theory. Psychological Review, 104, 367–405. Griffiths, T. L. (2010). Bayesian models as tools for exploring Clapper, J. & Bower, G. (1994). Category invention in inductive biases. In Banich, M. & Caccamise, D., editors, unsupervised learning. Journal of Experimental Psychology: Generalization of knowledge: Multidisciplinary perspectives. Learning, Memory, and Cognition, 20(2), 443–460. Psychology Press, New York. Courville, A., Daw, N., & Touretzky, D. (2006). Bayesian Griffiths, T. L., Chater, N., Kemp, C., Perfors, A., & theories of conditioning in a changing world. Trends in Tenenbaum, J. B. (2010b). Probabilistic models of cognition: Cognitive Sciences, 10(7), 294–300. exploring representations and inductive biases. Trends in Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning and Cognitive Sciences, 14(8), 357–364. selective attention. Nature Neuroscience, 3, 1218–1223. Griffiths, T. L., Chater, N., Norris, D., & Pouget, A. (2012). DeLosh, E. L., Busemeyer, J. R., & McDaniel, M. A. How the Bayesians got their beliefs (and what those beliefs (1997). Extrapolation: the sine qua non for abstraction actually are): Comment on Bowers and Davis (2012). in function learning. Journal of Experimental Psychology: Psychological Bulletin, 138(3), 415–422. Learning, Memory, and Cognition, 23(4), 968–986. Griffiths, T. L. & Ghahramani, Z. (2005). Infinite latent feature Ferguson, T. (1973). A Bayesian analysis of some nonparametric models and the Indian buffet process. (Technical Report problems. The Annals of Statistics, 1, 209–230. 2005-001, Gatsby Computational Neuroscience Unit). Freedman, D. & Diaconis, P. (1983). On inconsistent Bayes Griffiths, T. L. & Ghahramani, Z. (2011). The Indian buffet estimates in the discrete case. Annals of Statistics, 11(4), process: An introduction and review. Journal of Machine 1109–1118. Learning Research, 12, 1185–1224. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural net- Griffiths, T. L., Kemp, C., & Tenenbaum, J. B. (2008a). works and the bias-variance dilemma. Neural Computation, Bayesian models of cognition. In R. Sun, (Ed.). Cambridge 4, 1–58. handbook of computational cognitive modeling. Cambridge, Geman, S. & Geman, D. (1984). Stochastic relaxation, Gibbs England: Cambridge University Press. distributions, and the Bayesian restoration of images. IEEE Griffiths, T. L., Lucas, C., Williams, J. J., & Kalish, M. L. Transactions on Pattern Analysis and Machine Intelligence, 6, (2009). Modeling human function learning with Gaussian 721–741. processes. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. Gershman, S., Blei, D., & Niv, Y. (2010). Context, learning, K. I. Williams & A. Culotta, (Eds.). Advances in Neural and extinction. Psychological Review, 117(1), 197–209. Information Processing Systems 21, pp. 556–563. Red Hook, Gershman, S. & Niv, Y. (2012). Exploring a latent cause theory NY: Curran. of classical conditioning. Learning & Behavior, 40(3), 255– Griffiths, T. L., Sanborn, A. N., Canini, K. R., & Navarro, 268. D. J. (2008b). Categorization as nonparametric Bayesian

206 higher level cognition density estimation. In N. Chater, & M. Oaksford, (Eds.). Kruschke, J. (2008). Bayesian approaches to associative learning: The probabilistic mind. Oxford, England: Oxford University From passive to active learning. Learning & Behavior, 36(3), Press. 210–226. Griffiths, T. L. & Yuille, A. (2006). A primer on probabilistic Love, B. C., Medin, D. L., & Gureckis, T. M. (2004). SUS- inference. Trends in Cognitive Sciences, 10(7), 1–11. TAIN: A network model of category learning. Psychological Grosse, R. B., Salakhutdinov, R., Freeman, W. T., & Review, 111, 309–332. Tenenbaum, J. B. (2012). Exploiting compositionality to Marr, D. (1982). Vision. San Francisco, CA: WH Freeman. explore a large space of model structures. In N. de Freitas McClelland, J. L., Botvinick, M. M., Noelle, D. C., Plaut, & K. Murphy, (Eds.). Conference on Uncertainty in Artificial D. C., Rogers, T. T., Seidenberg, M. S., & Smith, Intelligence, pp. 306–315. Corvallis, OR: AUAI Press. L. B. (2010). Letting structure emerge: connectionist Hjort, N. L. (1990). Nonparametric Bayes estimators based and dynamical systems approaches to cognition. Trends in on Beta processes in models for life history data. Annals of Cognitive Sciences, 14(8), 348–356. Statistics, 18, 1259–1294. McDaniel, M. A. & Busemeyer, J. R. (2005). The conceptual Hu, Y., Zhai, K., Williamson, S., & Boyd-Graber, J. (2012). basis of function learning and extrapolation: Comparison Modeling images using transformed Indian buffet processes. of rule-based and associative-based models. Psychonomic International Conference of Machine Learning, Edinburgh, Bulletin & Review, 12(1), 24–42. UK. McKinley, S. C. & Nosofsky, R. M. (1995). Investigations of Huber, J., Payne, J. W., & Puto, C. (1982). Adding asymmet- exemplar and decision bound models in large, ill-defined rically dominated alternatives: Violations of regularity and category structures. Journal of Experimental Psychology: the similarity hypothesis. Journal of Consumer Research, 9(1), Human Perception and Performance, 21(1), 128–148. 90–98. Medin, D. L. & Schaffer, M. M. (1978). Context the- Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent ory of classification learning. Psychological Review, 85, component analysis.NewYork,NY:Wiley. 207–238. Jaynes, E. T. (2003). Probability theory: The logic of science. Mercer, J. (1909). Functions of positive and negative type Cambridge, England: Cambridge University Press. and their connection with the theory of integral equations. Jolliffe, I. T. (1986). Principal component analysis.NewYork,NY: Philosophical Transactions of the Royal Society A, 209, 415– Springer. 446. Jones, M. & Love, B. C. (2011). Bayesian fundamentalism or Miller, K. T., Griffiths, T. L., & Jordan, M. I. (2008). The enlightenment? on the explanatory status and theoretical phylogenetic Indian Buffet Process: A non-exchangeable contributions of Bayesian models of cognition. Behavioral nonparameteric prior for latent features. In D. McAllester and Brain Sciences, 34, 169–231. & P. Myllymaki, (Eds.). Proceedings of the Twenty-Fourth Kahneman, D., Slovic, P., & Tversky, A., editors (1982). Conference on Uncertainity in Artificial Intelligence, (pp. 403– Judgment under uncertainty: Heuristics and biases. Cambridge, 410). Corvallis, Oregon: AUAI Press. England: Cambridge University Press. Murphy, G. L. & Medin, D. L. (1985). The role of Kalish, M. L., Lewandowsky, S., & Kruschke, J. K. (2004). theories in conceptual coherence. Psychological Review, 92, Population of linear experts: knowledge partitioning and 289–316. function learning. Psychological Review, 111(4), 1072–1099. Neal, R. M. (2000). Markov chain sampling methods for Kaplan, A. & Murphy, G. (1999). The acquisition of category dirichlet process mixture models. Journal of Computational structure in unsupervised learning. Memory & Cognition, and Graphical Statistics, 9(2), 249–265. 27(4), 699–712. Nosofsky, R. M. (1986). Attention, similarity, and the Kemp, C., Perfors, A., & Tenenbaum, J. B. (2007). identification-categorization relationship. Journal of Experi- Learning overhypotheses with hierarchical bayesian models. mental Psychology: General, 115(1), 39–57. Developmental Science, 10(3), 307–321. Oaksford, M. & Chater, N. (2007). Bayesian rationality: The Kemp, C., Tenenbaum, J., Niyogi, S., & Griffiths, T. (2010). A probabilistic approach to human reasoning.NewYork,NY: probabilistic model of theory formation. Cognition, 114(2), Oxford University Press. 165–196. Palmer, S. E. (1999). Vision Science. Cambridge, MA: MIT Press. Kemp, C. & Tenenbaum, J. B. (2008). The discovery of Pearl, J. (1988). Probabilistic reasoning in intelligent systems.San structural form. Proceedings of the National Academy of Francisco, CA: Morgan Kaufmann. Sciences, 105(31), 10687–10692. Posner, M. I. & Keele, S. W. (1968). On the genesis of abstract Kemp, C. & Tenenbaum, J. B. (2009). Structured statistical ideas. Journal of Experimental Psychology, 77(3p1), 353. models of inductive reasoning. Psychological Review, 116(1), Pothos, E. & Chater, N. (2002). A simplicity principle in 20–58. unsupervised human categorization. Cognitive Science, 26(3), Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., 303–343. & Ueda, N. (2006). Learning systems of concepts with Rasmussen, C. E. & Williams, C. K. (2006). Gaussian Processes an infinite relational model. In Y. Gil & R. J. Mooney, for Machine Learning. Cambridge, MA: MIT Press. (Eds.). Proceedings of the 21st National Conference on Artificial Reed, S. K. (1972). Pattern recognition and categorization. Intelligence, pp. 381–388. Menlo Park, CAAAAI Press. Cognitive Psychology, 3, 393–407. Koh, K. & Meyer, D. E. (1991). Function learning: induction Rescorla, R. A. & Wagner, A. R. (1972). A theory of of of continuous stimulus–response relations. Journal of Exper- Pavlovian conditioning: variations in the effectiveness of imental Psychology: Learning, Memory, and Cognition, 17(5), reinforcement and nonreinforcement. In A. Black, & W. 811–836. Prokasy, (Eds.), Classical conditioning II: Current research

structure and flexibility in bayesian models of cognition 207 and theory (pp. 64–99). New York, NY: Appleton-Century- Tenenbaum, J. B., Griffiths, T. L., & Kemp, C. (2006). Theory- Crofts. based Bayesian models of inductive learning and reasoning. Robert, C. P. (1994). The Bayesian choice: A decision-theoretic Trends in Cognitive Science, 10:309–318. motivation. New York, NY: Springer. Tenenbaum, J. B., Kemp, C., Griffiths, T. L., & Goodman, Roe, R. M., Busemeyer, J. R., & Townsend, J. T. (2001). Mul- N. D. (2011). How to grow a mind: Statistics, structure, tialternative decision field theory: A dynamic connectionist and abstraction. Science, 331, 1279–1285. model of decision making. Psychological Review, 108(2), Thibaux, R. & Jordan, M. I. (2007). Hierarchical Beta processes 370–392. and the Indian buffet process. In Eleventh International Rumelhart, D. & Greeno, J. (1971). Similarity between Conference on Artificial Intelligence and Statistics (AISTATS stimuli: An experimental test of the Luce and Restle 2007), pages 564–571. choice models. Journal of Mathematical Psychology, 8, Tversky, A. (1972). Elimination by aspects: A theory of choice. 370–381. Psychological Review, 79, 281–299. Salakhutdinov, R., Tenenbaum, J. B., & Torralba, A. Tversky, A. (1977). Features of similarity. Psychological Review, (2012). Learning to learn with compound hierarchical- 84, 327–352. deep models. In Advances in Neural Information Processing Werker, J. & Yeung, H. (2005). Infant speech perception Systems. bootstraps word learning. Trends in Cognitive Sciences, 9(11), Sanborn, A. N., Griffiths, T. L., & Navarro, D. J. (2010). 519–527. Rational approximations to rational models: alternative al- Williams, C. K. (1998). Prediction with gaussian processes: gorithms for category learning. Psychological Review, 117(4), From linear regression to linear prediction and beyond. In M. 1144–1167. Jordan, (Ed.). Learning in graphical models (pp. 599–621). Schölkopf, B. & Smola, A. J. (2002). Learning with kernels: Cambridge, MA: MIT Press. Support vector machines, regularization, optimization, and Wood, F., Griffiths, T. L., & Ghahramani, Z. (2006). A beyond. MIT Press, Cambridge, MA. nonparametric Bayesian method for inferring hidden causes. Shepard, R. N. (1987). Towards a universal law of generalization Proceedings of the 22nd Conference in Uncertainty in Artificial for pshychological. Science. 237, 1317–1323. Intelligence (UAI ’06), 536–543. Selfridge, O. G. & Neisser, U. (1960). Pattern recognition by Yildirm, I. & Jacobs, R. A. (2012). A rational analysis of machine. Scientific American, 203, 60–68. the acquisition of multisensory representations. Cognitive Shafto, P.,Kemp, C., Manishka, V., & Tenenbaum, J. B. (2011). Science, 36, 305–332. A probabilistic model of cross-categorization. Cognition, Young, G. A. & Smith, R. L. (2010). Essentials of statistical 120(1), 1–25. inference. Cambridge, UK: Cambridge University Press.

208 higher level cognition CHAPTER Models of Decision Making under Risk 10 and Uncertainty

Timothy J. Pleskac, Adele Diederich, and Thomas S. Wallsten

Abstract Formal models have a long and important history in the study of human decision-making. They have served as normative standards against which to compare real choices, as well as precise descriptions of actual choice behavior. This chapter begins with an overview of the historical development of decision theory and rational choice theory and then reviews how models have been used in their normative and descriptive capacities. Models covered include prospect theory, rank- and sign-dependent utility theories and their descendants, as well as cognitive models of human decision-making like Decision Theory and the Leaky Competing Accumulator Model, which are based on basic psychological principles rather than assumptions of rationality. Key Words: subjective probability, utility, rational choice models, choice axioms, cognitive model, expected utility theory, subjective expected utility theory, prospect theory, decision field theory, risk , uncertainty, independence, transitivity, stochastic dominance

Introduction latent constructs involved in decision-making Formal models serve three separate but inter- with the aim of characterizing decision-making twined roles in the study of human decision- deficits associated with clinical and neurological making.1 First, rational choice models, especially disorders. when posed as sets of axioms, serve as standards We begin with a historical introduction to against which to compare real choices made by therationalchoicemodelsofexpected utility theory sentient beings. Indeed, many of the axioms can (EUT; von Neumann & Morgenstern, 1944) be cast as putative empirical laws to be tested and subjective expected utility theory (SEUT; Sav- in suitable experiments (Krantz, 1972). Such age, 1954), followed by a discussion of empir- comparisons invoke a second, closely related role ical tests of the underlying axioms. Descriptive of formal models, which is to describe actual failures of the axioms laid the groundwork for behavior. Finally, to the extent that a descriptive prospect theory, rank- and sign-dependent utility model is accurate, it provides a means for mea- theories, and their descendants. Following a pre- suring the values of latent constructs underlying sentation of these newer models, which rep- choice. resent generalizations of EUT and SEUT, we This chapter provides an overview of for- turn to formal models rooted in cognitive psy- mal decision and choice models in the first chology that represent alternative approaches to two of these three roles. For illustrations of understanding decision behavior (Busemeyer & the third role, see the chapter by Pleskac (in Diederich, 2010). For conciseness, we refer to press), which includes examples of how descriptive these as cognitive models of choice or decision- models of choice have been used to measure making.2

209 Pascal’s solution was that the players should be Box 1 Types of uncertainty paid based on their expected earnings based on their The terms risk and uncertainty in the title of this play so far, which eventually led to an expected- chapter arise from a distinction that economists value theory of decision-making. This is the idea make between risky and uncertain situations, that gambles and risky alternatives should be valued the former when probabilities of the outcomes according to their expected value and ultimately in a situation are known (and not 0 or 1) one should choose according to this value.3 For and the latter when the probabilities are not example, consider an individual deciding whether well specified (Knight, 1921; Luce & Raiffa, to take Gamble A1 or A2, either of which pays off 1957). Applicable rational models are EUT in on the basis of a coin flip. The situation is illustrated the former case and SEUT in the latter. in the payoff matrix in Table 10.1, in which the A related distinction often found in the rows are the alternatives between which the decision behavioral literature is between precise and maker (DM) must choose (A1 or A2)andthe ambiguous probabilities, the former applying columns are the states of nature that might occur when probability distributions are described or (coin lands heads, s1, or tails, s ). Each state has a well learned from the environment and the 2 probability pj of occurring. The values in the cells latter when they are not (Ellsberg, 1961). The cij are the cash payoffs that the individual receives term ambiguous is really a misnomer in this when choosing alternative i and state sj occurs. context, in that something is ambiguous if it Which alternative should the DM choose? Ac- can take on one of a number of well-defined cording to expected value theory, letting p(sj) denote meanings, which is not what is intended the probability of state sj, the DM should choose when decision researchers refer to ambiguity the alternative with the largest expected value, v(Ai), or ambiguous probability. The terms vague calculated as or imprecise would be more appropriate, but generally have not taken hold in this literature. v(Ai) = p sj · cij.(1) See Budescu and Wallsten (1987) for further j discussion. Structuring the decision problem in this payoff A third distinction, this one made by matrix illustrates a common technique in modeling philosophers and risk analysts, is between risky decisions, which is to represent a risky decision aleatory and epistemic uncertainty, the former as a choice between two or more well-defined gam- when the uncertainty is due to stochastic factors bles (Coombs, Dawes, & Tversky, 1970; Savage, in the environment and the latter when it is due 1954). Without a doubt, this quantitative represen- to lack of knowledge (Hacking, 1975). tation facilitated the development of normative and descriptive theories of risky choice. Distilling risky choice into these crucial dimensions also provides a small, constrained world in which one can ask Historical Development of Decision Theory how actual DMs use the dimensions of payoffs The academic study of decision-making under and probabilities to make a decision. For example, risk can trace its origins to the 18th century nearly 100 years after Pascal and Fermat discussed when the gentleman-gambler Chevalier de Méré the problem of the points, Daniel Bernoulli (1954; asked prominent mathematicians of the time for originally 1738) used another lottery puzzle, the St. a solution to what became known as the Problem Petersburg paradox, to demonstrate why valuing a of Points (Hacking, 1975; Todhunter, 1865). In gamble—and by extension a risky alternative—by the problem, two people are playing a game of its objective expectation does not seem quite right. chance with a goal of reaching a pre-determined In the St. Petersburg gamble, a fair coin is tossed set of points to win a pot of money. Méré was curious about how the stakes sould be divided between the players if the game ended before Table 10.1. Payoff matrix either one reached the threshold. Blaise Pascal States of nature and Pierre de Fermat took up this problem, and in their correspondence, Pascal developed the Head (s1) Tail (s2) basic ideas of what would later be recognized as Alternatives Gamble A1 c11 c12 probability theory to solve the puzzle (Hacking, Gamble A2 c21 c22 1975).

210 higher level cognition until tails appears. At this point, the player is paid a satisfied the axioms, then a utility value, unique sum equal to 2n ducats (a gold coin) where n is the up to a positive linear transformation,5 could be number of tosses that occurred. Note that, by sub- assigned to each outcome, and the DM can be said stituting in Eq. 1, the expected value of this gamble, to have chosen so as to maximize expected utility ∞ as represented in Eq. 2, but without the restriction v(G) = 0.5j · 2j, j=1 that u(cij) is a log function or, indeed, that the cij even have cash values at all. is infinite. Savage (1954) generalized EUT to situations The paradox was, why does intuition suggest in which the probabilities are not explicitly that “any fairly reasonable man would sell his given, which Knight (1921) called decisions under chance, with great pleasure, for twenty ducats” uncertainty.6 His solution, known as SEUT, pro- (Bernoulli, 1954, p. 31) when its expected value is vided the axiomatic basis for rational choice among infinite? Bernoulli’s insight was that the operative lotteries with unstated or unknown probabilities. values were not the cash amounts, c, but their A DM whose choices are consistent with Savage’s “moral value,” or what today commonly is termed axioms is acting as though she assigns utilities to their utility, u(c). He persuasively argued that outcomes, personal numerical probabilities to the the moral value (utility) of wealth is a concave relevant possible states of the world, and is choosing downward function of ducats (monetary value), so as to maximize subjectively expected utility. representing the idea that as wealth increases, there Both von Neumann and Morgenstern and is a diminishing return to its utility. Specifically, Savage proposed axiom sets that can be taken as based on reasonable assumptions about human bases for rational choice models, not as foundations behavior, he deduced that the utility function must of models of human behavior. Nevertheless, these be logarithmic. This insight can resolve the paradox developments opened up a larger set of issues because taking the expectation of this (logarithmic) concerning whether humans (or other animals) moral value leads to a finite price for the St. make choices that correspond with the axioms of Petersburg gamble. In other words, according EUT and SEUT or whether alternative theories are to Bernoulli, the value to the DM of the St. required to describe behavior. This question has Petersburg gamble G was its expected utility, u(G), been the entry point for the development of models calculated as ∞ that better describe the choices people actually u(G) = 0.5j · u 2j . make under risk and uncertainty. A prominent j=1 approach has been to generalize the expected utility More generally, and moving beyond Bernoulli’s log expression in Eq. 2. Another approach has been function, Eq. 1 is modified to more computational in nature. Instead of relying on maximization models that consist of, or are general- u(Ai) = p sj · u cij .(2) j izations of, rational choice models, cognitive models Thus, subjective values of wealth entered into of choice tend to build on psychologically plausible economic theory in the form of EUT via lotteries. processes that may govern how DMs deliberate This concept of a subjective value of a physical among choice alternatives over time. Following a stimulus influenced later psychological theorists like brief discussion of axiom tests, we review these Fechner, Weber, and Thurstone (Masin, Zudini, & various approaches and their implications. Antonelli, 2009). Brilliant as it was, a problem with Bernoulli’s Empirical Tests of Choice Axioms expected utility principle was that it explained a This chapter is not the place for an exten- phenomenon ex post. Moreover, the explanation sive review of the experimental literature on the depended crucially on the untestable assumption adequacy of the choice axioms as descriptive of that DMs made choices that maximized expected human behavior, but a few examples are useful as utility. In a development that provides the modern illustrations of this work, which has so profoundly form of EUT, von Neumann and Morgenstern affected decision modeling. We consider the axioms (1944) provided an ex ante justification based on of independence and of transitivity, because they first principles. Focusing on monetary gambles, are utterly foundational to rational models and they proposed a set of qualitative (i.e., non- necessary for a pattern of choices to be considered numerical) axioms4 such that if a decision maker’s rational under any axiomatic definitions.7 Readers pattern of choices over selected pairs of lotteries desiring more complete coverage of this issue should

models of decision making under risk and uncertainty 211 consult Rapoport and Wallsten (1972) for an early Subtracting 0.89u ($1M) from both sides of the discussion of the empirical status of the axioms inequality yields and Mellers, Schwartz, and Cooke (1998) and 0.11u($1M) > 0.10u($5M) =⇒ u A > u B , Shafir and Le Boeuf (2002) for more recent reviews. The report by MacCrimmon and Larsson (1979) implying that A  B . But, in fact, the most contains a very thorough summary of the von commonly observed pattern is A  B and B  A , Neumann and Morgenstern as well as the Savage which is inconsistent with the predictions of EUT axioms, along with empirical tests of them. For (e.g., MacCrimmon & Larsson, 1979). Although a discussion of the merits of testing decision- the existence, magnitude and even the direction of making theories via critical tests of their axioms, see the Allais effect can be manipulated (e.g., Barron & Birnbaum (2011). Erev, 2003), its overall robustness calls into question the generality of EUT or SEUT as descriptive theories of choice. Kahneman and Tversky (1979) Independence axioms introduced prospect theory as a generalization of Independence axioms come in many forms, but EUT in response to this and other choice patterns they all capture the principle that adding identical that violate the normative rational models. outcomes to choice alternatives should not alter the direction of preference between the alternatives. The axiom of transitivity One version is Savage’s (1954) sure-thing principle, Preferences among options A, B,andC are said which states that if gambles A and B provide the to be transitive if and only if same outcome contingent on state s , then the DM’s j   =⇒  preference between A and B is independent of that (A B and B C) A C. outcome (Wakker, 2010, p. 115). Maurice Allais It is easy to see that Eq. 2 implies transitivity among provided an early and powerful test of the sure-thing all triples of options. If it systematically fails, then principle with the Allais Paradox (Allais, 1953, EUT and SEUT cannot hold, nor for that matter 1979). The basic design is illustrated in Table 10.2. can any model that assumes or implies a weak Note that Gamble A can be derived from A and ordering of preferences. B can be derived from B by in each case adding an One would think that empirically testing tran- 89% chance of winning $1M. Thus, according to sitivity is a straightforward matter. After all, what the sure-thing principle, the directions of preference could be simpler than establishing a triple of should be the same for Choices 1 and 2, i.e., either preferences? The difficulty arises from the fact that     8 A B and A B or B A and B A . To see that choice, as with so much of behavior, is stochastic,so EUT (and SEUT) requires precisely one of these that when confronted with the same pair of options,  two choice patterns, observe that if, e.g., A B, A and B, on two or more ostensibly identical = then according to EUT, Eq. 2 with u(0) 0, it occasions, an individual will not necessarily choose mustbethecasethat the same option each time. Stochastic behavior often conflicts with the deterministic nature of u(A) > u(B) =⇒ u(1M) > 0.89u(1M) the axioms, causing statistical difficulties when testing the descriptive validity of the axioms.9 + 0.10u(5M). There have been numerous attempts to develop statistical techniques especially suited to axiom testing, but until recently (Regenwetter, Dana, & Davis-Stober, 2011) none has proved satisfactory Table 10.2. Two choice situations that illustrate the Allais paradox for a variety of deep reasons (Luce, 1995). We return to Regenwetter et al. shortly, but note first Choice 1 Choice 2 that, early on, researchers distinguished three levels Gamble A* $1M 100% Gamble A’ $1M 11% of stochastic transitivity, strong (SST), moderate $0 89% (MST) and weak (WST). Letting P(i  j) denote Gamble B $1M 89% Gamble B’* the probability that i is preferred to j, assume that (  ) > (  ) > $0 1% $0 90% P A B 0.5 and P B C 0.5. Then, $5M 10% $5M 10% • SST holds when * = Majority choice, M = Million P (A  C) > max[P (A  B),P (B  C)],

212 higher level cognition • MST holds when Otherwise, the DM would choose according to P (A  C) > min[P (A  B),P (B  C)], the value of x.Thislexicographic semi-order (LS) choice rule is a particular form of the additive- and difference model and predicted a certain pattern • WST holds when P (A  C) > 0.5. of intransitive choices. Specifically, Tversky (1969) proposed that probabilities in adjacent cards, a and By relaxing the assumptions of Thurstone b, b and c, and so on, would be perceptually (1927) Case V Law of Comparative Judgment, so similar that respondents would consider them Morrison (1963) developed 42 distinct choice essentially equal and compare the outcomes. As models, which differed, among other ways, in the the outcomes were expressed numerically and easily form of stochastic transitivity that each implied. distinguished, they would provide the basis for Within that framework, Morrison (1963) pointed > the choice between adjacent cards in the set. out that for n 3 options in a choice set, it Sufficiently nonadjacent cards, say a and e,would is impossible for preferences to be intransitive for differ noticeably in perceptual probabilities and, all triples of options. And in fact, as n increases, therefore, they would provide the basis for choice. the maximum proportion of triples that can be Hence a pattern of choices might be a ≺ b, b ≺ intransitive approaches the expected proportion c, c ≺ d, d ≺ e but e ≺ a. Using participants given random responding, making actual tests very pretested for a propensity toward such circular difficult. This result caused Morrison (1963) to choices, Tversky (1969) used maximum-likelihood conclude that a proper demonstration of intransi- chi-squared tests to compare the likelihoods of tivity requires one to determine in advance which observed choice patterns under LS and under WST. triples of options will produce the violations. He interpreted the data to strongly favor the former. Tversky (1969), in what has become a classic This conclusion has predominated despite the fact paper, took up that challenge and demonstrated sys- that Tversky preselected his subjects, leaving open tematic intransitivities precisely where he expected the possibility that the results might not generalize, to see them based on an additive-difference choice and that Iverson and Falmagne (1985) pointed model. In contrast to expected utility types of out problems with the data analysis stemming models, this model assumed that DMs compare from the inappropriateness of the chi-squared options dimension by dimension, for example, distribution in this case. When they used the proper first compare the probabilities, then compare the asymptotic distribution, the pattern of results, with outcomes, and so on, weight the difference on each one exception, became not significantly different dimension according to a weighting function, sum from that expected under WST. Subsequently, the transformed differences, and choose accord- Myung, Karabatsos and Iverson (2005) used a ingly. Depending on the nature of the weighting Bayesian approach to find greater support for functions, this model can be rendered indistinguish- Tversky’s (1969) model than for a variety of other able from SEUT, or it can be very different indeed, reasonable models. Thus, although it appears that to the point of predicting intransitivities. Tversky (1969) obtained the predicted pattern Tversky (1969) constructed sets of five gambles, of intransitivities, the supporting data might be each displayed on a card as a circular spinner best considered as weakly supporting violations of radially divided into a black and a white sector transitivity. of proportions p and 1 − p,respectively,with Returning now to Regenwetter et al. (2011) a positive dollar amount, x, above the black they discuss in depth various shortcomings in sector. Thus, the card represented the gamble earlier formulations of tests of transitivity. Their “win $x with probability p, otherwise nothing.” work subsumes and goes substantially beyond From cards a through e in a set, values of p Iverson and Falmagne (1985) in proposing new increased by units of 1/24, say from 7/24 to 11/24 theoretical, statistical, and empirical approaches to of the circle, whereas payoffs decreased by small investigating the descriptive validity of transitivity amounts, say from $5.00 to $4.00 in steps of and other choice axioms. Using the developments in $0.25. Tversky assumed that when presented with Regenwetter et al. (2011), Regenwetter, Dana, and pairs of cards, DMs would first compare black Davis-Stober (2010) provided data in support of a sectors, that is, perceived values of p.Ifthey particular mixture model of transitive preference, were sufficiently different, then the DM would the essence of which is that an individual may choose the gamble with the greater probability. have multiple transitive preference states, only

models of decision making under risk and uncertainty 213 one of which is active at any given moment. A Baratta (1948) psychologists explicitly entertained set of observed choices, therefore, likely arises the hypothesis that an expectation-based descriptive from many different states. In this formulation, model of risky choice required both a utility func- the preference states, and not the choices they tion and a function describing the psychological imply, are stochastic and define the parameter value of probabilities (see also Edwards, 1954). space. This model has a good deal of structure Kahneman and Tversky’s (1979) prospect theory and, therefore, is highly constrained, rendering made perhaps the best case for the probability only a small fraction of logically possible response weighting function by illustrating that nonlinear patterns consistent with it. In careful individual- probability weighting can account for the Allais level analyses of data from 18 respondents, they paradox. To see how, according to Eq. 3, the found little support for intransitive preference preference of AB in choice 1 in Table 10.2 implies states. We, in turn, must conclude that when ( ) > ( ) =⇒ ( ) > ( ) · ( ) subjected to rigorous tests, transitivity has not u A u B u $1M w 0.89 u $1M been shown to fail. This is a fortunate out- + w(0.10) · u($5M). come, because it is implied by most decision models. With some algebra, the right-hand side of the arrow can be rewritten as u($1M) w(0.10) Generalizing Rational Choice Models > − . One common response to the violation of u($5M) 1 w(0.89) the independence axiom and others has been to The preference B  A in choice 2 implies seek ways to generalize the rational models to account for human behavior. Indeed Luce (2000) > =⇒ ( ) · ( ) suggested, “...any descriptive theory of decision u(B ) u(A ) w 0.10 w $M making should always include as special cases any > w(0.11) · u($1M). locally rational theory for the same topic” (p. 108). Prospect theory (Kahneman & Tversky, 1979; The right-hand side of the arrow here can be Tversky & Kahneman, 1992) has been perhaps rewritten as the most successful theory in this spirit. At a very ( ) w(0.10) > u $1M general level, prospect theory not only introduced w(0.11) u($5M) psychologically more plausible assumptions about the properties of the utility function10,butitalso Thus, these two inequalities imply did away with the linear treatment of probabilities 1 > w(0.11) + w(0.89). in estimating the value of risky options. Kahneman and Tversky (1979) initially expressed prospect Kahneman and Tversky (1979) referred to this theory in the form of a probability weighting relationship as subcertainty and it implies that function, generalizing Eq. 2 to w(p) is regressive with respect to p.Thatis,the Allais paradox would imply that the weighting u(Ai) = w[p sj ] · u cij .(3) j function for intermediate levels of probabilities is less sensitive to variations of probability than Next, we examine how Kahneman and Tversky dictated by the expected utility model. generalized the utility and weighting functions first There are other properties of the weighting in terms of prospect theory and then in terms of function besides the subcertainty property. For cumulative prospect theory to better account for example, one of the more common properties people’s choices. For more complete coverage of is overweighting of rare events. Kahneman and these generalizations readers should consult Luce Tversky (1979) demonstrated this property when (2000) and Wakker (2010). they asked participants to choose between a risky gamble of $5,000 with a probability of 0.001 Probability weighting function and a sure thing of $5. Participants expressed The assumption of linear sensitivity to prob- a preference for the risky gambles, much as is abilities in EUT and SEUT largely relegates all observed with people playing the lottery. Assuming descriptive power to the utility function, ascribing, concavity of the utility function, this preference for example, risk aversion to the concavity of the implies overweighting of rare events: w(0.001) utility function. However, starting with Preston and >u(5)/u(5,000) > 0.001. These gambles as well as

214 higher level cognition A 1.0 B u(c)

(p) .5

w c

0 0 .5 1.0 p

Fig. 10.1 Prospect theory’s inverse-S probability weighting function (Panel A) and utility function with loss aversion (Panel B). others lead to the conclusion that the probability to form a probability estimate, DMs form the weighting function on average takes an inverse S- estimate by sampling directly from the available shape like the one shown in Figure 10.1A. Analyses alternatives. This process has been called decisions at the individual level also support this conclusion from experience (Barron & Erev, 2003; Hertwig, (Gonzalez & Wu, 1999). Barron, Weber, & Erev, 2004). One interesting It is important to note that investigations into empirical consequence of making decisions in this the shape of the probability weighting function manner is that instead of rare events being over- have primarily focused on decisions under risk. weighted, as they appear to be when probabilities An important question is what is the shape of are described, they appear to be underweighted, the weighting function for decisions under uncer- that is, to have less impact on choice than they tainty, when events do not have explicitly stated deserve according to their objective probabilities. probabilities? Tversky and Fox (1995) showed that In other words, the probability weighting function the inverse-S shape of the probability weighting seems to take a different form for decisions made function generalized to decisions under uncertainty. from experience than from description (see Box 2). In particular, they found that, in evaluating the worth of gambles in domains like sporting events, Box 2 Experience-Based Decision the stock market, and temperatures, events that Making turned an impossibility into a possibility or a Often DMs do not have convenient descrip- possibility into a certainty had greater impact than tions of the probabilities and payoffs associated when the same event made a possibility more or less with different choice options. Instead they likely (see also Tversky & Wakker, 1995). must use their experience with the options Psychologically, Tversky and Fox (1995) inter- to form a statistical estimate of these values. preted these results as consistent with a two-stage A natural question is do decisions made from process for making decisions under uncertainty (see descriptions differ from decisions made on the also Fox & Tversky, 1998). In their theory, people basis of samples of information, decisions from first estimate the probability of an uncertain event experience? based on the support it receives relative to the alternative events. This support can be understood Choice 1 Choice 2 as a measure of the strength of evidence in favor Gamble A $4 80% Gamble A’$4 20% of the event arrived at via memory search or some $0 20% $0 80% other heuristic process (Tversky & Koehler, 1994). Gamble B $3 100% Gamble B’$3 25% The probability estimate then is transformed into a $0 75% decision weight during the decision process. Recently, psychologists have been examining an To investigate this question, it is necessary to alternative approach to making decisions under investigate choices in description and experience uncertainty, whereby, instead of searching memory

models of decision making under risk and uncertainty 215 Utility function Box 2 Continued Prospect theory also introduced several psycho- conditions. In the description condition, peo- logical principles regarding the utility function. ple are presented with verbal descriptions of One, which is foundational, is that the function is options as in Choice 1. In the experience defined over gains and losses relative to a reference condition, the descriptions of the gambles are point (usually the status quo—see Panel B of replaced with computerized buttons. Clicking Figure 10.1), not over total final wealth. For either button produces a random draw from the example, consider the two choice problems in Table respective unknown gamble. 3 given to two different groups of participants. A This change in presentation impacts prefer- majority of the participants who received Choice ences. For example, when faced with Choice 1 1 chose the sure thing, Gamble A. In contrast, a in the description condition approximately 70 majority of the participants who received Choice to 80% of people choose the sure thing (Option 2 chose the risky option, Gamble D.Asonecan B). In comparison, when the same gambles see, if participants made their choice based on are learned via experience, then 70 to 80% final wealth, as required by the utility theories, of people prefer the gamble (Option A)(see the choices should have been the same. Instead for example Hertwig et al., 2004). Moreover, the actual choices imply that the DMs treated the when the probabilities of the nonzero payoffs outcomes as gains or losses relative to a reference are divided by 4 (see Choice 2), then people in point (in this case, and often, the status quo). the description condition tend to prefer Option Although historically the valuation of outcomes A , but people in the experience condition tend has been often modeled as changes from a reference to prefer Option B . point, disregarding initial wealth, prospect theory Because the probabilities between the two formally included this property as a core principle choices are a common ratio of each other (see Wakker, 2010, p. 234; Luce, 2000, p. 1). preferences should be consistent, so that, for In prospect theory, the distinction between gains example, if A is preferred to B then A and losses brought with it two further psychological should be preferred to B . Thus, the choice implications in terms of the shape of the utility patterns in both the description and experience function. One is that the function is concave conditions are each violations of the indepen- over gains and convex over losses (see Panel dence axiom of expected utility theory. Like B in Figure 1), reflecting the hypothesis that the Allais paradox the switch in preference as the magnitude of gains and losses increase, can be understood in terms of a non-linear people become increasingly less sensitive to changes. probability weighting function. However, they Originally, support for this hypothesis came from imply differently shaped functions: Choices in choices among gambles like the one in Table 10.4, the description condition imply overweighting which Kahneman and Tversky (1979) presented of rare events (Kahneman & Tversky, 1979) to participants. Using the model in Eq. 3, the and the choices in the experience condition majority choices among these gambles implies that imply underweighting of rare events (Hertwig u(6,000) < u(4,000) + u(2,000) and u(−6,000) et al., 2004). A question of great interest is > u(−4,000) + u(−2,000). However, as we will what can account for this apparent gap between see shortly, this interpretation of the preferences in description and experience. Part of the gap is terms of Eq. 3 has given way to a slightly different due to the sampling error that occurs in the version called rank dependent utility models. The experience condition (Fox & Hadar, 2006). rank dependent models do not allow such a clean However, there are now several studies that interpretation of preferences in Table 10.4 (see show even after reducing sampling error (Hau, Luce, 2000, p. 82). Nevertheless, analyses of a wide Pleskac, & Hertwig, 2010) or controlling for range of situations reveal that there is a mixture of it (Ungemach et al., 2009) the gap remains. concave and convex utility functions over gains and The source of the gap is still an active area of losses, but the most common pattern is concave research, but current research points to learning over gains and convex over losses (Fishburn & and memory processes, estimation error of the Kochenberger, 1979). event probabilities, and format dependence of the decision algorithm used (see Hertwig & A second property that emerges from prospect Erev, 2009). theory’s distinction between gains and losses is loss aversion, the hypothesis that losses loom

216 higher level cognition Table 10.3. Two Choice Situations Used to Illustrate Reference Dependence Choice 1 Choice 2 In addition to whatever you own, you have been given In addition to whatever you own, you have been given $1,000 2,000. You are now asked to choose between Gamble A* $500 100% Gamble C $-500 100% Gamble B $1000 50% Gamble D* $-1000 50% $0 50% $0 50%

* = majority choice

larger than the equivalent gains in risky choices. Cumulative Prospect Theory Thus, as Kahneman and Tversky (1979) argued, A potential drawback of original prospect theory “...most people find symmetric bets of the form or any other theory that assumes a subadditive (x, .50; −x, .50) distinctly unattractive” (p. 279). weighting function is that they predict violations of Formally, loss aversion results in the utility function stochastic dominance. Gamble A stochastically dom- having a steeper slope for losses than gains (see inates Gamble B when A’s probability distribution Figure 1 Panel B). Perhaps the strongest evidence in over outcomes is as favorable or more favorable than support of loss aversion comes from the endowment (i.e., dominates) B’s. Formally, for two nonidentical effect, in which individuals demand significantly gambles A and B, A stochastically dominates B if more to sell a good that they own than they would and only if P (x > t | A) P(x > t|B) for all t,where be willing pay to buy the same good when they P(x > t|A) is the probability that an outcome of do not own it (Kahneman, Knetsch, & Thaler, Gamble A exceeds t. Prospect theory originally de- 1990). In risky contexts, the certainty equivalents scribed choices as occurring in two phases: first edit- people place on mixed gambles suggest loss aversion ing and then evaluation. The editing phase provides (Tom, Fox, Trepel, & Poldrack, 2007; Tversky & for, among other operations, the removal of domi- Kahneman, 1992). However, the evidence is a bit nated alternatives from consideration. Without this more mixed when participants are given the option editing phase, violations in stochastic dominance to choose between symmetric bets. Battalio, Kagel, can occur. For example, consider the following and Jiranyakul (1990) reported mixed results with numerical example from Weber (1994) shown in a majority of people choosing the gamble over a Table 10.5. zero change alternative when the gamble was for Gamble A stochastically dominates Gamble B. $10, but more people choosing the zero change However, by allowing subadditivity in the weight- option when the gamble was for $20 (see also Ert ing function, prospect theory permits w(0.8) > & Erev, 2008). Such mixed evidence has led to calls w(0.4) + w(0.4). Consequently, so long as the for alternative views of losses as more modulators of difference between w(0.8) and w(0.4) + w(0.4) attention, rather than some sort of loss aversion in outweighs the greater utility difference between the valuation of prospects (Yechiam & Hochman, $200 and $210, prospect theory will predict a 2013). greater value for Gamble B than A and predict a violation in stochastic dominance. More generally, violations of stochastic dominance can occur when

Table 10.4. Gambles illustrating diminishing sensitivity in gains and losses.

Choice 1 Choice 2 Table 10.5. Gambles illustrating prospect Gamble A $6000 25% Gamble C* $-6000 25% theory’s violation of stochastic dominance. $0 75% $0 75% Gamble A $210 40% Gamble B* $4000 25% Gamble D $-4000 25% $200 40% $2000 25% $-2000 25% $0 20% $0 50% $0 50% Gamble B $200 80% $0 20% * = majority choice

models of decision making under risk and uncertainty 217 π− π+ the decision weights for a set of outcomes do not The decision weights i and i are found accord- sum to 1 (Fishburn, 1978). Empirically there is ing to the following formulas evidence that when dominance is not transparent, − − − − π = w p1 , π = w p1 +···+pi people do make choices that violate stochastic 1 i − dominance(Birnbaum & Navarrete, 1998; Lopes, − w p1 +···+pi−1 ,2≤ i ≤ k 1984; Mellers, Weiss, & Birnbaum, 1992; Tversky + + + + π = w pn , π = w pi +···+pn & Kahneman, 1986). In fact, later in this chapter, n i we review how decision field theory predicts + − w pi+1 +···+pn , k + 1 ≤ i ≤ n. violations of stochastic dominance, and empirical − + data support the prediction. The probability weighting functions w and w Nevertheless, a choice theory that does not are the same inverse S-shaped functions shown violate stochastic dominance seemed desirable. One in Figure 1 Panel A. In words, as mentioned reason was that stochastic dominance (as well as earlier, according to cumulative prospect theory, transitivity) is among the properties that in the gains and losses are evaluated separately. For losses, words of Quiggin (1982) commanded “...virtually the cumulative probabilities (i.e., the probability unanimous assent even among those who violated of obtaining an outcome as bad or worse) are transformed via a probability weighting function them” (p. 325). A second reason was that adherence − to dominance principles is assumed by a large w . Then, the decision weights are estimated body of empirically supported economic theories of by taking the difference between the transformed decision-making under certainty, and thus a theory cumulative probability of obtaining the outcome of − of decision-making under risk and uncertainty that i and of obtaining i 1. For gains, the decumulative was consistent with these theories also seemed probabilities (i.e., the probability of obtaining an useful. On a separate note, when there are a large outcome as least as good as a specific value) are number of outcomes or even a continuous distribu- first transformed and then the decision weights are tion over outcomes, the application of a weighting found via the difference between the transformed function applied to individual event probabilities is decumulative probability of obtaining outcome i + cumbersome. These reasons led to the development and outcome i 1. π− of rank-dependent utility theories (Quiggin, 1982). In the end, the decision weights i and π+ The key idea behind rank-dependent utility theories i can be understood as controlling the marginal is that the weighting function is not applied to the contribution of each outcome to the overall value of the alternative. When the probability weighting single event probabilities but to the (de)cumulative − + probabilities, that is, the probability of obtaining functions w and w are identity functions, then a given outcome or worse (better) from a specific the decision weights are the probabilities associated alternative. The transformed (de)cumulative proba- with each outcome. Thus, expected utility theory bilities are then used to determine the weight each (Eq. 2) is a special case of cumulative prospect outcome receives (i.e., the marginal contribution theory (Eq. 4). The psychological intuition behind of each outcome) in assessing the value of the this generalization is that the impact an outcome alternative. Cumulative prospect theory is one has to the overall value of the risky alternative is instance of a rank dependent utility theory (Tversky a function of both the probability of the outcome & Kahneman, 1992). For a broader review of rank occurring and its rank relative to the other possible dependent utility theories see Luce (2000). outcomes (Diecidue & Wakker, 2001). To see how cumulative prospect theory works, Cumulative prospect theory applies rank-depen consider the gambles in the Allais paradox (see Table dent transformation separately to gains and losses, 10.1). The majority preference in Choice 1 implies when they are present in a specific gamble, and + + then sums the two to form an overall evaluation u(A) > u(B) =⇒ w (1) · u($1M) > w (0.10)· of a gamble. More formally, suppose the possible ( ) + + ( ) − + ( ) · ≤ ··· ≤ ≤ ≤ u $5M w 0.99 w 0.10 outcomes for Gamble A are x1 xk 0 ( ) + + ( ) − + ( ) · ( ) xk+1 ≤ ··· ≤ x?,, which occur with probability u $1M w 1.0 w 0.99 u 0 ...... p1, pk, pk+1, , pn. According to cumulative First, note that the right-hand side of the inequality prospect theory, the value of A is demonstrates that the decision weights sum to 1, + + + ( ) = k π − · ( ) + n π + · ( ) π (0.10) + π (0.89) + π (0.01) u A = i u xi = + i u xi . i 1 i k 1 + + + (4) = w (0.10) + w (0.99) − w (0.10)

218 higher level cognition + + + w (1.0) − w (0.99) One of those accounts is a cognitive model of + decision-making called decision field theory (DFT; = w (1.0) = 1.0 Busemeyer & Townsend, 1993), which will be Thus, revealing how cumulative prospect theory described in the next section. and rank dependent utility theories obey stochastic Another account that successfully predicts spe- dominance. Even with this constraint, cumulative cific violations of stochastic dominance is a different prospect theory can account for the Allais paradox. generalization of expected utility theory from that It does so via the nonlinearity of the probability described earlier called the transfer of attention weighting function. To see how, the preceding exchange model (TAX; Birnbaum & Naverette, inequality can be rewritten so that 1998). The TAX model treats the utility of an ( ) + ( ) alternative as a weighted average of the utilities u $1M > w 0.10 of the possible outcomes. Similar, to cumulative ( ) + ( ) − + ( ) + + ( ) u $5M w 1.0 w 0.99 w 0.10 prospect theory, the weights are a function of both Choice 2 in the Allais paradox implies the probability of obtaining the specific outcome and the rank of that outcome, with lower valued > =⇒ + ( ) · ( ) u B u A w 0.10 u $5M outcomes having some of the weight transferred + > w (0.11) · u($1M). from higher ranked outcomes. For more details, see Birnbaum (2008b). The more important point is This inequality in turn implies that although cumulative prospect theory appears to + ( ) provide a fairly comprehensive account of a range of w (0.10) > u $1M + decisionmaking phenomena, there still is room for w (0.11) u($5M) improvement. Thus, according to the majority preferences in Choice 1 and Choice 2 of the paradox Cognitive Models of Decision Making + + ( ) Now, we review some cognitive models of w (0.10) > w 0.10 w+(0.11) w+ (1.0) − w+ (0.99) + w+ (0.10) human decision-making under risk and uncertainty. Rather than being grounded in compact sets of which in turn implies axioms, these models are developed from cognitive + + + + process theories. Consequently, the variety of mod- w (0.11) − w (0.10) < w (1.0) − w (0.99). els mirrors the variety of available approaches and In words, cumulative prospect theory accounts theories. One cognitive approach to understanding for the Allais paradox by allowing the change decision-making has been via choice heuristics. + + from w (0.99) to w (1.0) to exceed the change These accounts began primarily as verbal theories + + from w (0.10) to w (0.11), attributing the Allais (see Shah & Oppenheimer, 2008), but increasingly paradox and the certainty effect to the steepness of have been implemented as computational and/or the weighting function at p = 1. This property mathematical models. Examples of computational can obviously occur with the convex component implementation of heuristic choice theories are of the inverse S-shape of the probability weighting by Brandstätter, Gigerenzer, & Hertwig, 2006; function. Gigerenzer, Hertwig, & Pachur, 2011; Payne, Cumulative prospect theory has been tested Bettman, & Johnson, 1993; Thorngate, 1980; empirically in different fashions. One approach has examples of mathematical models are by Rieskamp been to explore the shapes of the utility and weight- (2008) and Tversky (1972). For discussion of ing functions (Gonzalez & Wu, 1999; Tversky & issues that arise in testing these models, see Kahneman, 1992). A second approach has been Brandstätter, Gigerenzer, & Hertwig (2008); Birn- to further test critical properties of the theory. baum (2008a); Johnson, Schulte-Mecklenbeck, & For tests of an independence type assumption that Willemsen (2008); Pachur, Hertwig, Geigerenzer, rank-dependent theories like cumulative prospect & Brandstätter (2013) and Rieger & Wang (2008). theory make, see Wakker, Erev, and Weber (1994). Dynamic decision models represent theories of A specific focus has been given to contexts in how decision processes unfold over time. Exam- which people tend to, in fact, violate stochastic ples include decision-by-sampling models (Stewart, dominance (Birnbaum, 2004, 2008c; Diederich & 2009; Stewart, Chater, & Brown, 2006) and Busemeyer, 1999). These violations are consistent more all-encompassing models based on complete with alternative descriptive accounts of choice. cognitive architectures such as ACT-R (e.g., Gonzalez

models of decision making under risk and uncertainty 219 & Dutt, 2011). For the remainder of this chapter, Criterion for choosing alternative A )

we will focus on one specific dynamic decision t ( model called decision field theory (DFT; Busemeyer P & Townsend, 1993). We focus on this model for several reasons. First, as we will show shortly, DFT 0 can be understood as a reinterpretation of the earlier time (t) expectation-based models (Eq. 2). Second, DFT goes beyond the expectation models and predicts distributions of response times,anditcanbeex- Process Accumulation tended to account for other measures of preference Criterion for choosing alternative B such as willingness to sell or buy. Finally, the DFT framework is arguably the most comprehensive Fig. 10.2 Sequential sampling process to describe the dynamic framework for modeling choice covering not only process of deliberation when deciding between two choice decisions under risk and uncertainty, but also alternatives A and B. Each trajectory refers to the preference decisions under certainty. process of a single trial. Note, that for one trial A is chosen, for another B is chosen and for the third trial no decision criterion has yet been reached. Decision Field Theory Decision field theory (DFT) can be understood W1(t) > W2(t) and if one moment (t + τ) later, as a dynamic generalization of the constituents of W2 (t + τ) > W1(t + τ), then attention is focused the expectation framework (Eq. 2). To see this, we on attribute 2. start with a binary choice problem similar to the one It is easy to see that with the Wj as random presented in Table 10.1, each choice characterized variables, by two attributes, with possible consequences c, for Ui(t) = W1(t) · mi1 + W2 (t) · mi2 (5) example, winning c11or losing c12 when choosing alternative A1 and winning c21 or losing c22 when is the stochastic version of the classic weighted choosing alternative A2. The subjective values of additive utility model (Eq. 2), that is, the attention = the consequences are denoted as m cij mij. weights are stochastic rather than deterministic and Note that these subjective values are very similar to thus illustrates how DFT can be understood as a utility values. However, we use different notation to dynamic generalization of the expectation models denote the fact that these values are not grounded described earlier. The expected values of the atten- in an axiomatic system. tion weights over time, E Wj(t) , correspond to The subjective probability pij assigned to an event the deterministic weights of the weighted additive associated with consequence cij, or more precisely, model, wj, and Equation 5 can be written as its probability weighting function, w pij ,isinter- = · + ε preted psychologically as the amount of attention Ui(t) wj(t) mij j(t), (6) j given to the possible consequence. Importantly, the ε attention weight is assumed to be a random variable where j refers to the random component. W and moreover, a dynamic random variable. Alternative A1 is chosen over alternative A2 if the The idea is that during the course of making a value (utility) for A1 is higher than the value for A2, decision, the DM focuses attention on different that is, if the difference in value for alternatives A1 = − attributes or events; the amount of attention during and A2, V U1 V2, is positive. If the difference deliberation is indexed by W. This dynamic process is negative, alternative A2 is chosen. In DFT the of deliberation gives rise to a sequential sampling difference in value, called valence, changes from process according to which the subjective values of moment to moment: the consequences mij are accumulated sequentially V (t) = U1(t) − U2(t). over time until a preset criterion is reached and a response is initiated. Figure 10.2 shows the process A positive (negative) valence V (t) indicates an for three trials. The graphs are called trajectories and advantage (disadvantage) for alternative A1 under they reflect the randomness in accumulation process the current focus of attention. toward preset criteria for alternatives A and B. The valence is integrated over time into a More formally, we represent the momentary preference state (t + τ).11 That is, the difference attention allocated to attribute j at time t as Wj(t). in utility is not static as for the previous models Attention is focused at time t on attribute 1 if but evolves dynamically. This dynamic property

220 higher level cognition allows DFT to account for response times in computer simulation methods to generate pre- addition to choices. It is also consistent with recent dictions. DFT has been applied to a variety of behavioral and neurological data that suggests an traditional decision making problems including evidence accumulation process underlies decision- decisionmaking under risk and uncertainty (Buse- making (see for example Gold & Shadlen, 2007; meyer & Townsend, 1993), selling prices and Krajbich & Rangel, 2011; Liu & Pleskac, 2011). certainty equivalents and preference reversal phe- DFT also specifies how memory and motiva- nomena (Busemeyer & Goldstein, 1992; Johnson tional processes can impact the preference state. It & Busemeyer, 2005). DFT has been extended does so by weighting the previous preference state to account for phenomena that arise in multi- P (t) in the accumulation process by a constant γ , attribute (Diederich, 1997) and multialternative such that (Roe, Busemeyer, & Townsend, 2001) choice problems including changes in preference under ( + τ) = ( − τγ) ( ) + ( + τ) P t 1 P t V t . time pressure (Diederich, 2003) and violation of Thus, the new preference state P (t + τ) is a stochastic dominance (Diederich & Busemeyer, weighted combination of the previous state of 1999) shown next. preference and the new input valence. When γ = 0 then the model is a standard drift diffusion process. Example: Multi-attribute Decision Field When 0 <γτ<1, then the accumulation process Theory (Predicted and Observed Violations slows down (relative to when γ = 0) because the in Stochastic Dominance) increments to the preference state have less impact Most decision alternatives are not unidimen- with increasing position in the preference state. This sional, for example, purely monetary. Instead most can be understood as a recency effect or avoidance alternatives are made up of multiple, sometimes behavior or both. Technically, when 0 <γτ< conflicting, dimensions. Keeney and Raiffa (1993) 1, this process is an Ornstein-Uhlenbeck process provided an axiomatic foundation for extending with drift (e.g., Bhattacharya & Waymire, 1990). expected utility theories to these situations. Cogni- When γτ<0, then the stochastic process speeds up tive models like DFT also can be extended to these producing primacy effects or approach behavior or more realistic choice alternatives and, in fact, doing both.12 For more details on separating the memory so reveals new predictions about where rational and motivational components see Busemeyer and models fail. For example, Diederich and Busemeyer Townsend (1993). (1999) showed an extension to multi-attribute An important characteristic of any information- alternatives that reveals systematic violations in processing system is its stopping rule: How and stochastic dominance. when does one stop information search and make To illustrate the violation, consider the two a choice (see also Townsend & Ashby, 1983)? Two hypothetical medical decisions in Table 6. Each stopping rules (and some combinations of them) decision offers a choice between two treatments have been proposed for dynamic systems of choice. whose outcomes depend on the state of the world One is a fixed stopping time, also called externally that occurs. Focusing first on the Situation 1, one controlled decision threshold, in which the time to can see that almost certainly in terms of preference make the decision is predetermined, for example, x1 = 190 days of recovery;$255,000 cost ≺ x2 a deadline. The other is an optional stopping time, also called internally controlled decision = 180 days of recovery;$250,000 cost ≺ x3 threshold, in which the DM makes a choice when = 149 days of recovery;$5,500 cost ≺ x preference meets or exceeds the threshold, P(t) 4 ≥ |θ|. Alternative A1 is chosen when P(t) ≥ θ = 7 days of recovery;$5,000 cost . and alternative A is chosen when P(t) ≤−θ. In 2 We also know the probability of obtaining a this case, the decision time is a random variable, consequence as good as x or better for treatment and, thus, DFT predicts choice proportions and i A, G , and treatment B, G . They are as follows: response-time distributions. A B Mathematical formulas for computing choice ( ) = ( ) = ( ) = probabilities and distributions of response times GA x1 GB x1 1.0,GA x2 1.0 have been derived (see Busemeyer & Townsend, > GB (x2) = 0.5,GA (x3) = GB (x3) = 0.5, 1992; Diederich & Busemeyer, 2003). The ( ) = > ( ) = model can also be simulated with Monte Carlo GA x4 0.5 GB x4 0.

models of decision making under risk and uncertainty 221 Thus, treatment A stochastically dominates DFT was shown to account for the distributions of treatment B. The same is true for Situation 2. response times as well. This ability to provide an ac- Because the two situations are identical with respect count not only of choice but also of response times to the probabilities of the preference-ordered list, demonstrates an added advantage of this dynamic cumulative prospect theory (Eq. 4) and other rank- approach to choice over and above the static the- dependent utility theories predict that in both cases ories like prospect theory and cumulative prospect treatment A should be chosen over B. theory. However, upon closer inspection, the situations seem psychologically quite different: Situation 1 Decision Field Theory for multialternative appears more conflicted than Situation 2. That is choice problems because in Situation 1 the treatments are negatively DFT’s grounding in basic cognitive processes correlated in terms of the outcomes that can be also allows it to be extended to multi-attribute obtained for a given state. For example, for state choice during decisions under certainty, thus S1, treatment A has low cost and low recovery times, providing an account of not only decisions made whereas treatment B has high cost and high recovery under risk and uncertainty but also decisions under times and vice versa for state S2.Thisnegative certainty. This extension also shows how DFT can correlation between outcomes would suggest that be extended to model choices with more than two if DMs focus on a particular state to evaluate the alternatives. To see how, we start with a binary two treatments, then they would be conflicted choice situation with optional stopping times. between the two possible actions. In comparison, The decision process for a binary choice situation in Situation 2 the actions are positively correlated stops when P (t + τ) ≥ |θ|, with Alternative 1 in terms of the outcomes that can be obtained for chosen if P (t + τ) ≥ θ and Alternative 2 chosen a given state so that, for Si, treatments A and B if P (t + τ) ≤−θ.This assumption is equivalent had both low (high) cost and short (long) recovery to assuming that the accumulation process is times. This would imply that, if DMs focus on represented by a two-dimensional diffusion process a given state to evaluate the possible treatments, of the form (cf. Heath, 1981) then the choice will be straightforward regardless + τ = ( − τγ ) ( ) of the state that occurs and, thus, the DM should P1(t ) [ 1 1 P1 t experience less conflict in this situation. − τγ2P2 (t)] + V1(t + τ) Multi-attribute DFT, in fact, instantiates this for choosing Alternative 1 and idea of conflict and predicts that stochastic domi- + τ = −τγ ( ) + ( − τγ ) ( ) nance is perfectly satisfied for Situation 2 (positively P2(t ) [ 2P1 t 1 1 P2 t ] correlated, no-conflict, condition) but violations + V2(t + τ)(7) are expected for Situation 1 (negatively correlated, for choosing Alternative 2. Each choice alternative high-conflict, condition). This is because in DFT is evaluated relative to the other one, and the va- the valence V (t) fluctuates as the DM changes lences for the two alternatives sum to V + V = 0. attention from moment to moment, switches back 1 2 Accordingly, for a choice problem with n > 2 and forth between the states S and S ,from 1 2 alternatives the valence for alternative i is produced attribute “treatment cost” to attribute “recovery by contrasting the utility (Eq. 5) of alternative i duration” during the deliberation process. In against the average utility of the remaining n − 1 Situation 1, the valence V (t) changes back and alternatives, forth from positive to negative valence as attention n−1 switches from state S1, favoring alternative A,to k =i Uk(t) Vi(t) = Ui(t) − .(8) state S2, favoring alternative B. In Situation 2, n − 1 the valence is always such that it favors alternative The valence vector V can then be written as the A. An experiment with real consequences (money product of three matrices, to win or lose the duration of a noisy sound) ( ) = ( ) and probabilities learned from experience showed V t CMW t , that stochastic dominance was indeed violated for where the elements of C are defined as cii = 1and = 1 ≤ Situation 1 but not for Situation 2, as predicted cij n−1 for i j. The matrix M contains the by the model (Diederich & Busemeyer, 1999). subjective values mij of alternative i on attribute Moreover, in the same experiment, response times j,andthematrixW(t) contains the momen- for the choices were collected and multi-attribute tary attention weights allocated to attribute j at

222 higher level cognition Table 10.6. Two choice situations with different correlations between outcomes. State

Situation Treatment S1 (p = 0.50) S2 (p = 0.50) 1: Negatively A $ 5,000 cost 7 days $ 250,000 cost correlated of recovery 180 days of recovery outcomes B $ 255,000 cost 190 $ 5,500 cost 14 days of days of recovery recovery 2: Positively A $ 5,000 cost 7 days $ 250,000 cost 180 correlated of recovery days of recovery outcomes B $ 5,500 cost 14 days $ 255,000 cost 190 of recovery days of recovery

time t. The preference state at time t + τ for all choice alternatives can be written as weighted value for alternativei with the average of the other (n − 1) alternatives (C)viaEq.8. ( + τ) = · ( ) + ( + τ) Since the weights are random variables and P t S P t V t , fluctuate over time, the valences for the choice alternatives at any time fluctuate as well. The with S = [I − τ],whereI is the identity matrix fourth layer is a recursive network with positive and  is the matrix containing the γ ’s. The diagonal self-recurrence for each choice alternative and elements of S provide, for instance, memory for negative lateral inhibition between choice al- previous states of the system (see earlier). The off- ternatives (S), and competition between choice diagonal values allow for competitive interactions alternatives takes place here. Figure 4 shows among competing alternatives that are a function the architecture of this network. The preference of the distance between the alternatives in a state for alternative i from a set of n alternatives psychological space (see Hotalling, Busemeyer, & is computed according to (compare Equation 7) Li, 2010). See Busemeyer and Diederich (2002) and n−1 Diederich and Busemeyer (2003) for derivations. Pi (t+τ)=sii ·Pi(t)− s · P (t) + Vi (t + τ), The extended DFT adds theoretical power as ik k k =i it reveals how the same deliberation system can (9) account for multi-alternative context effects such as where sii reflects self-recurrence (the decay similarity, compromise, and attraction effects. or leakage) and sik lateral inhibition. Note, thestrength of the lateral inhibition connection is a decreasing function of the dissimilarity (distance) between a pair of alternatives. See Box 3 A Connectionist Interpretation Hotaling, Busemeyer, and Li (2010) for a of Decision Field Theory specification of the function. DFT can be interpreted as a connectionist model with four layers (Roe et al., 2001). The first layer, the input, corresponds to the evaluations of each alternative on each attribute (M), There are now several different dynamic decision feeding into a second layer via the attentional models that model decision-making as a sequential weights (W), and activation is determined by sampling process. Another strong competitor to Eq. 6. Note that, in this context, at any time DFT is the leaky competing accumulator (Usher step, wj is either 1 or 0, depending on which at- & McClelland, 2001; Usher & McClelland, 2004). tribute the decision maker attends to. The third The model shares a number of similar assumptions layer computes the valences by comparing the with DFT (see Box 4). More recently, three other models have been proposed that also share similar

models of decision making under risk and uncertainty 223 at the third layer of LCA, the advan- ++ S + A D tage/disadvantage of alternative i relative to all remaining alternatives j, j ≤ i, Ii,is 0 C = F(d ) + I , (10) Attribute 2 Attribute Ii ij 0 = – B j i

–– – 0 + where dij is the advantage or advantage of Attribute 1 alternative i to alternative j with respect to the chosen attribute (dimension) (only one Fig. 10.3 Choice alternatives characterized by two attributes. dimension is active at a time, compare to wj = 1 or 0 in DFT), F is a nonlinear Weights Contrasts Valences Preferences function accounting for loss aversion and I0 A A A is a positive constant promoting the alter- 1 natives in the choice process and prevent- ing Ii of inferior alternatives to become too B B B negative. The function F is consistent with 2 Tversky and Kahneman’s (1991) reference- C C C dependent model and Tversky and Simon- son’s (1993) context-advantage model in which Fig. 10.4 Connectionist network for decision field theory. The losses and disadvantages have greater impact first layer represents the subjective values of the consequences, on preferences than gains and advantages. In = = mij,forthei A,B,C alternatives with the j 1,2 attributes. In particular, the fourth layer inhibition takes place symbolized by the circled headed lines. F (x) ( + ) > properties: The 2N -ary choice tree model for = log 1 x , x 0 N -alternative preferential choice (Wollschläger & −[log(1 + |x|) + log(1 + |x|) 2], x < 0 Diederich, 2012), the association accumulation (11) model (Bhatia, 2013), and the multi-attribute linear > < ballistic accumulator (MLBA) model (Trueblood, with (x 0) for relative advantages and (x Brown, & Heathcote, 2014). For details and a 0) for relative disadvantages. The fourth layer comparison of some of these models, see Marley and determines the response activation (the prefer- Regenwetter (in press). ence state in DFT) for each choice alternative following an iterative procedure: A (t + 1) = λA (t) + (1 − λ) Box 4 The Relationship Between the i i Leaky Competing Accumulator Model × Ii(t) − β Aj(t) + ξi(t) , and Decision Field Theory j =i The leaky competing accumulator model where λ is a neural decay constant (the leakage), (LCA) (Usher & McClelland, 2001; Usher & β is a global inhibition parameter, and ξ is a McClelland, 2004) shares many assumptions noise term. with DFT, but differs from it on a few LCA also accounts simultaneously for the important points. The model is defined as a multi-alternative context effects, that is, sim- four-layered connectionist network as well. The ilarity, compromise and attraction shown as first two layers of both models are identical to an example later. To see the similarity and DFT, and the main differences between DFT difference between DFT and LCA, we give and LCA are with respect to the third (i.e., the an example for choosing alternative A1among evaluation of alternatives) and fourth layer (i.e., three possible choice alternatives. For conve- the dynamics of the choice process). Figure 5 nience we take the DFT notation: shows the architecture of the network. Each of ( + ) the choice alternatives serves as a reference point DFT 1 t 1 in evaluating each of the other alternatives, and = s11P1 (t) − s12P2 (t) − s13P3 (t)+

224 higher level cognition Box 4 Continued The positive correlation between the similar alternatives A and S lead to high activations for

V1 (t + 1) both and therefore, the choice between them are shared. For a formal implementation of the = ( − γ ) ( ) − ( − γ ) ( ) 1 11 P1 t 1 12 P2 t effects into the model and numerical examples − (1 − γ13) see Usher and McClelland (2004). U (t + 1) + U (t + 1) P (t) + U (t + 1) − 2 3 3 1 2 and Example: Context effects - similarity, attraction, and compromise ( + ) LCA1 t 1 Accounting simultaneously for three specific = γ ( ) + ( − γ ){ ( ) P1 t 1 V 1 t context effects has become a criterion for evaluating multi-alternative choice models. Next we briefly − β ( ) + ( ) + ε ( )} [P2 t P3 t ] 1 t review the three effects and how DFT gives an account of all three effects. We do this to = γ P1 (t) − (1 − γ )βP2 (t) − (1 − γ )βP3 (t) demonstrate an important attribute of DFT and + − γ ( ) − ( ) + ( ) (1 )F[U1 t U2 t ] F[U1 t cognitive models in general is that they offer the possibility of providing a parsimonious account of − U (t)] + ε (t) 3 1 a range of decision-making phenomena. The three As seen from the equations, the decay γ effects—similarity, compromise, and attraction— in DFT is local, separate for each choice refer to effects on choice probabilities produced alternative whereas for LCA it is global, the by adding a third alternative to an existing choice same for all choice alternatives. In DFT, set of two. The effect that occurs depends on the choice alternatives are contrasted by the characteristics of the third alternative compared determining the difference between the value to the other two alternatives. Suppose that in of the alternative and the average value of a binary choice between A and B,thetwo the remaining alternatives with respect to the alternatives, characterized by different levels of the attribute under consideration. In LCA, the same two attributes,are about equally attractive and advantages and disadvantages between choice are chosen equally frequently, Pr[A |{A,B}] = alternatives are determined and transformed Pr[B |{A,B}]. by a function accounting for risk attitudes. The similarity effect is produced by adding an For a detailed comparison, see Busemeyer, alternative, S, to the choice set that is similar to Townsend, Diederich, and Barkan (2005) and either of the two alternatives of the earlier choice Tsetsos, Usher, and Chater (2010). set, say A. In this case, the probability to choose The LCA model explains the compromise A decreases because the new choice alternative and attraction effects as adirect outcome of S takes away shares from its neighbor A;the the loss-aversion advantage function. In both probability of choosing B remains unaffected and conditions, the asymmetric function (Eq. 11) penalizes the distant alternatives (B in the similarity condition and both A and B WeightsDifferences Preferences in the compromise condition). That is, the A A advantages of the more distant alternative 1 decrease (Eq. 10) and alternative C, which has two shorter distance alternatives, is preferred B B in the compromise condition. In the attraction 2 condition, A is preferred over B, which has C C two longer distance alternatives (A and D), as opposed to one for A (a longer distance to Fig. 10.5 Connectionist network for leaky competing accumu- B and a short one to D). The explanation of lator model. The first layer represents the subjective values of = the similarity effect is similar to that of DFT. the consequences, mij,forthei A,B,C alternatives with the j = 1,2 attributes. In the fourth layer inhibition takes place symbolized by the circled headed lines.

models of decision making under risk and uncertainty 225 Pr[B|{A,B,S}] > Pr[A|{A,B,S}]. This result vio- chosen. That is, adding S to the choice set takes lates the principle of independence from irrelevant choices of A but not of alternative B. alternatives (see Tversky, 1972 for a discussion). For the attraction effect, consider the choice The attraction effect is produced by adding an alternatives A, B,andD,inFigure3.Comparison alternative, D, to the choice set that is similar of the dominated alternative D with the average to but dominated by one of the two alternatives ofthe other two alternatives (Eq. 8) results in a of the earlier choice set, say A.In this case, the negative valence, VD (t), and eventually, during probability to choose A increases, Pr[A|{A,B,D}] > deliberation, in a negative preference state PD(t) Pr[A|{A,B}] (Huber, Payne, & Puto, 1982), for D (Eq. 9). The negative preference state PD(t) violating the principle ofregularity, which asserts feeds through a negative inhibitory connection to that the probability of choosing an alternative from the similar but dominant alternative, A, producing a givenset cannot be increased by enlarging the a positive (bolstering) effect on it. Thus, the offered set (Tversky, 1972, p. 289). dominant alternative A looks better in light of the The compromise effect is produced by adding dominated alternative D. The effect does not work an alternative, C, midway between alternatives A for alternative B since B is too far away from D and, and B,to the choice set that is. Suppose, that all the therefore, lateral inhibition is not strong enough. binary choices are equal so that Pr[A | {A,B)] = Pr[A Although the attention-switching process is |{A,C)]= Pr[B | {B,C)] = 0.50. When presenting mainly responsible for producing the similarity all three alternatives together, the probability of effect and the lateral-inhibition process is predom- choosing C increases such that Pr[C|{A,B,C}] > inately responsible for producing the attraction Pr[A|{A,B,C}] = Pr[B|{A,B,C}]. This result also effect, they nevertheless operate in synchrony to violates the principle of independence from irrele- generate both effects. This interaction is essential vant alternatives (Simonson, 1989). for producing the compromise effect. If attention Elimination by aspects (Tversky, 1972) can is focused on some irrelevant features favoring the account for the similarity effect, and the context- alternative C, the compromise, lateral inhibition dependent preference model (Tversky & Simonson, is sent to alternatives A and B, decreasing their 1993) can account for the attraction and com- strength and simultaneously increasing the strength promise effect, but neither of these models can for C. For a formal implementation of the effects explain all three effects simultaneously. In contrast, into the model and for numerical examples see Roe both the DFT and the LCA model can handle all et al. (2001). three effects. To see how they accomplish this feat, it is helpful to consider the alternatives presented in a two-dimensional space with negative (−)to Final Thoughts positive (+) attribute values for each attribute It is clear from this review that modern be- (Figure 9.3). havioral decision theory has roots in both rational Each choice alternative is represented as a point choice theory and cognitive psychology. Rational in this two-dimensional space. A minus sign (−) choice theory, in its axiomatic form, provides a set indicates negative values and a plus sign (+) of normative principles against which to evaluate indicates positive values for the attribute level. actual behavior. Descriptive failures of axiomatic Alternatives S and D are similar to alternative A, choice theory, specifically of EUT and SEUT, with A dominating D but not S. For the similarity resulted in their being generalized to other models, effect consider the choice alternatives A, B,and such as prospect and cumulative theory, rank- S, in Figure 3. Multi alternative DFT explains the and sign-dependent theories, and TAX. We barely effect as follows. Whenever attention is focused on touched on stochastic versions of these theories, but attribute 1 alternative B gets an advantage, and both see the review by Marley and Regenwetter (in press). alternatives A and S get disadvantages; whenever However, even these generalizations are insufficient attention is focused on attribute 2, then both to understand fully the descriptive failures of various alternatives A and S get advantages and alternative B normative and, arguably more importantly, to gets a disadvantage. The valences of A and S are understand how observed choice behavior arises. positively correlated with each other and negatively Models, especially dynamic stochastic models, that correlated with B. When focusing attention more apply what is known about cognition to human- on attribute 1, alternative B is chosen; when choice behavior address this gap and go beyond it focusing more on attribute 2, alternative A or S is to examine other variables, such as choice time and

226 higher level cognition confidence ratings (Pleskac & Busemeyer, 2010), in addition to choice probabilities. Choice: See decision in this glossary. Cognitive architecture: A unified theory of cognition Acknowledgments underlying intelligent behavior, often implemented as a We thank Shuli Yu for her detailed comments on computational model that provides an integrated account of an earlier draft. We express our gratitude to Mitchell the structure of the mind. Uitvlugt and Kyle Bort for theireditorial assistance Cognitive model: A formal model of one or more basic cognitive processes that describes the flow of information in preparing this chapter. A grant from the when accomplishing a task. National Science Foundation (0955410) supported Computational model: A formal model applying TJP while writing this chapter. computer algorithms and programs, implementing them on a computer and performing computer simulations to Notes derive the predictions of the model and also to generate 1. Terms that may be unfamiliar to readers new to this field data. (See model in this glossary.) are italicized the first time we introduce them and are defined in Connectionist model: a model that assumes neural the glossary. systemspass activation among simple, interconnected 2. We recognize that not all cognitive models are formal in processing units. These models are also referred to as neural nature. But in this chapter, we consider only those that are. network models. 3. Christiaan Huygens(1629–1695) is responsible for using the word expectation when speaking of the value of a gamble Decision: A discrete response that explicitly identifies an (Hacking, 1975). Huygens, in fact, wrote the first text of action selected from a larger set of possible actions, each of probability theory, developing a set of axioms for the science of which has consequences in the form of an outcome or set mathematical expectation with the sole purpose of finding the of outcomes. fair price of a gamble (David, 1962). Deterministic model: A model that has no randomness 4. A particularly readable description of the axioms can be in its representation so that it always produces the same found in Chapter 2 of Luce and Raiffa (1957) or in Kreps output from a given set of inputs. (1988). Dynamic model: A model of the time-dependent changes 5. A positive linear transformation is of the form in the state of the system. u (xi) = a + b · u(xi)withb > 0. Commonly, the value of a is Empirical law: An observed regularity between fixed at 0 by setting u (0) = u(0) = 0, but that is not necessary. independent and dependent variables. 6. See uncertain situations in the glossary. 7. Only patterns of choices, never an individual choice, can Expected utility theory: (EUT) A special case of rational be considered rational or irrational within the framework of choice models. See relational choice models. EUT or SEUT. For more on this point, see Wakker (2010). Formal model: A model that uses mathematical and 8. The symbol  denotes empirical preference, in contrast statistical methods, formal logic, or computer simulation to the symbol >, which denotes numerical order, i.e. larger to represent a phenomenon, situation, or system in the real than. Thus, A  B is read as “A is preferred to B.” world. (See model in this glossary.) 9. The stronger the violation of the axiom, i.e., the bigger Heuristics: Simple procedures or rules of thumbs that the effect, the less problematic is the presence of an error term. provide approximate solutionsto a problem. Thus, for example, the stochastic nature of choice does not Independence axiom: A principle that adding hinder the demonstration of the Allais paradox in most cases. identical outcomes to choice alternatives does 10. Note Kahneman and Tversky (1979) used the term not alter the direction of preference between the value function instead of utility function, presumably to alternatives. highlight the different psychological properties, such as reference dependence, of the two functions. However, to Latent construct: A theoretical dimension or variable that maintain consistency across this chapter and echoing Wakker affects behavior but cannot be observed directly and can be measured via a formal model that links the dimension (2010) we will refer to it as a utility function. to observable behavior.Utility and subjective probability are 11. The parameter τ is an arbitrary small time unit. The two important latent constructs in decision theory. preference state approximates to a diffusion process as τ approaches zero. Mathematical model: A formal model taking on various 12. Strictly speaking when, γτ <0, this process can be structures such as algebraic or geometric and various forms unstable in the long run and is not an Ornstein-Uhlenbeck such as axiomatic and systems of equations. process with drift. Model: An abstraction representing some aspects of a phenomenon, situation, or system in the real world. Preference: An ordered relation between two or more Glossary outcomes or choice alternatives in terms of their desirability Axiom: A proposition or principle that is assumed without or value. proof and taken as self-evident. Probability weighting function: A nonlinear Axiomatic model: A model that derives necessary con- transformation of probabilities into decision weights sequences from a set of independent and non-conflicting capturing the impact of probabilities on decisions made axioms. (See model in this glossary.) under risk and uncertainty.

models of decision making under risk and uncertainty 227 Glossary Utility function: A function that maps objective into Probabilistic model: A model that incorporates random- subjective valueswithin the context of EUT and SEUT. ness in its representation of a complex phenomenon or Value function: A function that maps objective into situation so that inputs result in probability distributions subjective changes in valuesrelative to the status quo (or over outputs or actions. Outputs are not deterministic. (See, other frame of reference) within the context of prospect also, stochastic model.) theory. In this chapter, following Wakker (2010), we Prospect theory: A generalization of expected utility theory use the term utility function to serve that purpose. See in which the utilities are over gains or losses relative to footnote 10. a frame of reference, usually the status quo, and the probabilities map into nonadditive decision weights. Rank- and sign-dependent choice models: General- izations of expectation-based models, such as expected utility, subjective expected utility and prospect theories, in References which decisions weights are associated with decumulative Allais, M. (1953). Le comportement de l’homme rationnel probability distributions and moreover, the weights are devant le risque: Critique des postulats et axiomes different for losses than for gains relative to a frame of de l’ecole Americaine. Econometrica, 21 (4), 503–546. reference, usually the status quo. doi:10.2307/1907921 Rational choice models: Models derived from axioms Allais, M. (1979). The so-called Allais paradox and rational sufficient to prescribe patterns of choices such that, if decisions under uncertainty. In M. Allais & O. Hagen satisfied, a utility value, unique up to a positive linear (Eds.), Expected utility hypotheses and the Allais paradox: transformation, can be assigned to each outcome, and the Contemporary discussions of decisions under uncertainty with decision maker can be said to have chosen so as to maximize Allais’ rejoinder (pp. 437–681). Dordrecht, Netherlands: D. expected utility (when the probability distribution over Reidel. states is known) or subjectively expected utility (when it is Barron, G., & Erev, I. (2003). Small feedback-based deci- not known). sions and their limited correspondence to description-based Response time: Duration between a specified start point, decisions. Journal of Behavioral Decision Making, 16 (3), usually the onset of the stimulus, and an observer’s response, 215–233. doi:10.1002/bdm.443 sometimes referred to as latency. Battalio, R. C., Kagel, J. H., & Jiranyakul, K. (1990). Testing Risky situation: A situation in which the probabilities of between alternative models of choice under uncertainty: different events occurring are known and are not 0 or 1. Some initial results. Journal of Risk and Uncertainty, 3(1), Static model: A time-insensitive model of a system. 25–50. doi:10.1007/bf00213259 Bhatia, S. (2013). Associations and the accumulation of Stochastic dominance: Arelation between nonidentical preference. Psychological Review, 120(3), 522–543. doi: gambles such that the probability distribution over out- comes of one gamble is consistently at least as favorable 10.1037/a0032457 as that of the other gamble. Formally, A stochastically Bernoulli, D. (1954/1738). Exposition of a new theory on dominates B iff F( X |A ) ≥F( X |B). the measurement of risk. Econometrica, 22(1), 23–36. doi:10.2307/1909829 Stochastic: the property of a system that is nondetermin- Bhattacharya, R. N., & Waymire, E. C. (1990). Stochastic istic, so that its state at any given time is determined processes with applications.NewYork,NY:Wiley. probabilistically. Birnbaum, M. H. (2004). Causes of Allais common consequence Subjective probability: This term has different, but paradoxes: An experimental dissection. Journal of Mathe- related meanings in rational and behavioral decision matical Psychology, 48(2), 87–106. doi:10.1016/j.jmp.2004 models. In rational choice models subjective probability .01.001 refers to the belief structure of a rational choice agent. Birnbaum, M. H. (2008a). Evaluation of the priority heuristic In behavioral decision models, subjective probability refers as a descriptive model of risky decision making: Comment to an individual’s underlying belief structure. If that belief structure is coherent, that is, conforms to the on Brandstätter, Gigerenzer, and Hertwig (2006). Psycho- axioms of probability theory, then it can be consid- logical Review, 115(1), 253–260. doi: Doi 10.1037/0033- ered rational within the framework of rational choice 295x.115.1.253 theory. Birnbaum, M. H. (2008b). New paradoxes of risky de- cision making. Psychological Review, 115(2), 463–501. Subjectively expected utility theory: (SEUT) A special doi:10.1037/0033-295X.115.2.463 case of rational choice models. See that entry. Birnbaum, M. H. (2008c). New tests of cumulative prospect Transitivity axiom: A principle of most models of decision- theory and the priority heuristic: Probability-outcome making that states, that when option x is preferred to y and tradeoff with branch splitting. Judgment and Decision Making y is preferred to z,thenx must be preferred to z. Journal, 3(4), 304–316. Uncertain situation: A situation in which the probability Birnbaum, M. H. (2011). Testing theories of risky decision of different events occurring are not known or not given. making via critical tests. Frontiers in Psychology, 2(315), 1–3. Utility: The subjective value of outcomes obtained from doi:10.3389/fpsyg.2011.00315 choices or actions. In axiomatic choice models, utilities are Birnbaum, M. H., & Navarrete, J. B. (1998). Testing descriptive inferred from the pattern of choices. utility theories: Violations of stochastic dominance and

228 higher level cognition cumulative independence. Journal of Risk and Uncertainty, Diederich, A. (1997). Dynamic stochastic models for decision 17(1), 49–79. doi:10.1023/a:1007739200913 making under time constraints. Journal of Mathematical Brändstatter, E., Gigerenzer, G. & Hertwig, R. (2006). Psychology, 41(3), 260–274. doi:10.1006/jmps.1997.1167 The priority heusistic-making choices without-trade-offs. Diederich, A. (2003). MDFT account of decision making under Psychological Review 113, 409–432. doi:10.1037/0033- time pressure. Psychonomic Bulletin & Review, 10(1), 157– 295X.113.2.409 166. doi:10.3758/bf03196480 Brändstatter, E., Gigerenzer, G. & Hertwig, R. (2008). Diederich, A., & Busemeyer, J. R. (1999). Conflict and Postscript: Rejoinder to Johnson et al. (2008) and the stochastic-dominance principle of decision making. Birnbaum (2008). Pscyhological Review 115, 289–290. Psychological Science, 10(4), 353–359. doi:10.1111/1467- doi:10.1037/0033-295X.115.1.289 9280.00167 Budescu, D. V., & Wallsten, T. S. (1987). Subjective estimation Diederich, A., & Busemeyer, J. R. (2003). Simple matrix of precise and vague uncertainties. In G. Wright & P. methods for analyzing diffusion models of choice prob- Ayton (Eds.), Judgmental forecasting (pp. 63–81). Chichester, ability, choice response time, and simple response time. England: Wiley. Journal of Mathematical Psychology, 47(3), 304–322. Busemeyer, J. R., & Diederich, A. (2002). Survey of decision doi:10.1016/s0022-2496(03)00003-8 field theory. Mathematical Social Sciences, 43(3), 345–370. Edwards, W. (1954). The theory of decision making. Psychologi- doi:10.1016/S0165-4896(02)00016-1 cal Bulletin, 51(4), 380–417. doi:10.1037/h0053870 Brandstätter E, Gigerenzer G, & Hertwig R. (2006). The Ellsberg, D. (1961). Risk, ambiguity, and the savage ax- Priority Heuristic: Making choices without trade-offs. ioms. Quarterly Journal of Economics, 75(4), 643–669. Psychological Review, 113, 409–432. doi: 10.1037/0033- doi:10.2307/1884324 295X.113.2.409 Ert, E., & Erev, I. (2008). The rejection of attractive Brandstätter E, Gigerenzer G, & Hertwig R. (2008). Postscript: gambles, loss aversion, and the lemon avoidance heuris- Rejoinder Johnson et al. and Birnbaum (2008). Psy- tic. Journal of Economic Psychology, 29(5), 715–723. chological Review, 115, 289–290. doi: 10.1037/0033- doi:10.1016/j.joep.2007.06.003 295X.115.1.289 Fishburn, P. C. (1978). On Handa’s “New theory of cardinal Busemeyer, J. R. & Townsend, J. T. (1992). Fundamental utility” and the maximization of expected return. Journal of derivations from decision field theory. Mathematical Social Political Economy, 86 (2), 321–324. doi:10.1086/260670 Sciences, 23, 255–282. doi: 10.1016/0165-4896(92)9043-5 Fishburn, P. C., & Kochenberger, G. A. (1979). Two-piece von Busemeyer, J. R., & Diederich, A. (2010). Cognitive modeling. Neumann-Morgenstern utility functions. Decision Sciences, Thousand Oaks, CA: SAGE. 10(4), 503–518. doi:10.1111/j.1540-5915.1979.tb00043.x Busemeyer, J. R., & Goldstein, W. M. (1992). Linking together Fox, C. R., & Hadar, L. (2006). “Decisions from experience” = different measures of preference: A dynamic model of sampling error plus prospect theory: Reconsidering Hertwig, matching derived from decision field theory. Organizational Barron, Weber & Erev (2004). Judgment and Decision Behavior and Human Decision Processes, 52(3), 370–396. Making Journal, 1(2), 159–161. doi:10.1016/0749-5978(92)90026-4 Fox, C. R., & Tversky, A. (1998). A belief-based account of Busemeyer, J. R., & Townsend, J. T. (1993). Decision field decision under uncertainty. Management Science, 44(7), 879. theory: A dynamic-cognitive approach to decision making doi:10.1287/mnsc.44.7.879 in an uncertain environment. Psychological Review, 100(3), Gigerenzer, G., Hertwig, R., & Pachur, T. (2011). Heuristics: The 432–459. doi:10.1037//0033-295X.100.3.432 foundations of adaptive behavior. Oxford, England: Oxford Busemeyer, J.R., Townsend, J.T., Diederich, A., & Barkan, University Press. R. (2005). Contrast effects of loss aversion? Comment on Gold, J. I., & Shadlen, M. N. (2007). The neural basis M.Usher and J.L.McClelland’s (2004) ’Loss aversion and of decision making. Annual Review of Neuroscience, 30, inhibition in dynamical models of multi-alternative choice. 535–574. doi:10.1146/annurev.neuro.29.051605.113038 Psychological Review, 112(1), 253–255. doi:10.1037/0033- Gonzalez, C., & Dutt, V. (2011). Instance-Based Learn- 295X.112.1.253 ing: Integrating Sampling and Repeated Decisions From Busemeyer, J.R., & Townsend, J.T. (1992). Fundamental Experience. Psychological Review, 118(4), 523–551. doi: derivations of decision field theory. Mathematical So- 10.1037/A0024558 cial Sciences, 23, 255–282 doi:10.1016/0165-4896(92) Gonzalez, R., & Wu, G. (1999). On the shape of the probability 90043-5 weighting function. Cognitive Psychology, 38(1), 129–166. Coombs, C. H., Dawes, R. M., & Tversky, A. (1970). Math- doi:10.1006/cogp.1998.0710 ematical psychology: An elementary introduction. Englewood Hacking, I. (1975). Emergence of probability: A philosophical Cliffs, NJ: Prentice-Hall. study of early ideas about probability, induction, and statistical David, F. N. (1962). Games, gods, and gambling: The inference. Cambridge, England: Cambridge University Press. origins and history of probability and statistical ideas from Hau, R., Pleskac, T. J., & Hertwig, R. (2010). Decisions from the earliest times to the Newtonian era. London, England: experience and statistical probabilities: Why they trigger dif- Griffin. ferent choices than a priori probabilities. Journal of Behavioral Diecidue, E., & Wakker, P. P. (2001). On the intuition of rank- Decision Making, 23(1), 48-68. doi: 10.1002/bdm.665 dependent utility. Journal of Risk and Uncertainty, 23(3), Heath, R.A. (1981). A tandem random walk model for psy- 281–298. doi:10.1023/a:1011877808366 chological discrimination. British Journal of Mathematical

models of decision making under risk and uncertainty 229 and Statistical Psychology, 34, 79–92. doi:10.111/j.2044- Luce, R. D. (1995). Four tensions concerning mathematical 8317.1981.tb00619.X modeling in psychology. Annual Review of Psychology, 46, Hertwig, R., Barron, G., Weber, E. U., & Erev, I. (2004). 1–27. doi:10.1146/annurev.ps.46.020195.000245 Decisions from experience and the effect of rare events Luce, R. D. (2000). Utility of gains and lossess: Measurement- in risky choice. Psychological Science, 15(8), 534–539. theoretical and experimental approaches. Mahwah, NJ: Erl- doi:10.1111/j.0956-7976.2004.00715.x baum. Hertwig, R., & Erev, I. (2009). The description-experience gap Luce, R. D., & Raiffa, H. (1957). Games and decisions: in risky choice. Trends in Cognitive Sciences, 13(12), 517– Introduction and critical survey.NewYork,NY:Wiley. 523. doi:10.1016/j.tics.2009.09.004 MacCrimmon, K. R., & Larsson, S. (1979). Utility theory: Hotaling, J. M., Busemeyer, J. R., & Li, J. (2010). Theoretical Axioms versus ’paradoxes’. In M. Allais & O. Hagen developments in decision field theory: Comment on Tsetsos, (Eds.), Expected utility hypotheses and the Allais paradox: Usher, and Chater (2010). Psychological Review, 117(4), Contemporary discussions of decisions under uncertainty with 1294–1298. doi:10.1037/a0020401 Allais’ rejoinder (pp. 333–409). Dordrecht, Netherlands: D. Huber, J., Payne, J. W., & Puto, C. (1982). Adding asymmet- Reidel. rically dominated alternatives: Violations of regularity and Marley, A. A. J., & Regenwetter, M. (in press). Choice, the similarity hypothesis. Journal of Consumer Research, 9(1), preference, and utility: Probabilistic and deterministic rep- 90–98. doi:10.1086/208899 resentations. In W. Batchelder, H. Colonius, E. Dzhafarov, Iverson, G., & Falmagne, J.-C. (1985). Statistical issues in and J. Myung (Ed.), The new handbook of mathematical measurement. Mathematical Social Sciences, 10(2), 131–153. psychology: Cambridge, UK. Cambridge University Press. doi:10.1016/0165-4896(85)90031-9 Masin, S. C., Zudini, V., & Antonelli, M. (2009). Early alterna- Johnson, E. J., Schulte-Mecklenbeck, M., & Willemsen, M. tive derivations of Fechner’s law. JournaloftheHistoryofthe C. (2008). Process models deserve process data: Com- Behavioral Sciences, 45(1), 56–65. doi:10.1002/jhbs.20349 ment on Brandstätter, Gigerenzer, and Hertwig (2006). Mellers, B., Weiss, R., & Birnbaum, M. (1992). Violations Psychological Review, 115(1), 263–272. doi:10.1037/0033- of dominance in pricing judgments. Journal of Risk and 295x.115.1.263 Uncertainty, 5(1), 73–90. doi:10.1007/bf00208788 Johnson, J. G., & Busemeyer, J. R. (2005). A dynamic, stochas- Mellers, B. A., Schwartz, A., & Cooke, A. D. J. (1998). tic, computational model of preference reversal phenomena. Judgment and decision making. Annual Review of Psychology, Psychological Review, 112(4), 841–861. doi:10.1037/0033- 49, 447–477. doi:10.1146/annurev.psych.49.1.447 295X.112.4.841 Morrison, H. W. (1963). Testable conditions for triads of Kahneman, D., Knetsch, J. L., & Thaler, R. H. (1990). paired comparison choices. Psychometrika, 28(4), 369–390. Experimental tests of the endowment effect and the doi:10.1007/bf02289558 Coase theorem. Journal of Political Economy, 1325–1348. Myung, J., Karabatsos, G., & Iverson, G. (2005). A Bayesian doi:10.1086/261737 approach to testing decision making axioms. Journal of Kahneman, D., & Tversky, A. (1979). Prospect theory: An Mathematical Psychology, 49, 205–225. analysis of decision under risk. Econometrica, 47(2), 263– Pachur,T.,Hertwig,R.,Gigerenzer,G.,&Brandstätter,E. 292. doi:10.2307/1914185 (2013). Testing process predictions of models of risky choice: Keeney, R. L., & Raiffa, H. (1993). Decisions with multiple ob- A quantitative model comparison approach. Frontiers in jectives: Preferences and value tradeoffs. Cambridge, England: Psychology, 4. doi: 10.3389/fpsyg.2013.00646 Cambridge University Press. Payne, J. W., Bettman, J. R., & Johnson, E. J. (1993). Knight, F. H. (1921). Risk, uncertainty, and profit.NewYork, The adaptive decision maker.NewYork,NY:Cambridge NY: Sentry Press. University Press. Krajbich, I., & Rangel, A. (2011). Multialternative drift- Pleskac, T. J. (in press). Decision making and learning. In G. diffusion model predicts the relationship between visual Keren & G. Wu (Eds.), Handbook of judgment and decision fixations and choice in value-based decisions. Proceedings making. Hoboken, NJ: Wiley-Blackwell. of the National Academy of Sciences of The United States Pleskac, T. J., & Busemeyer, J. R. (2010). Two-stage of America, 108(33), 13852–13857. doi:10.1073/Pnas. dynamic signal detection: A theory of choice, decision time, 1101328108 and confidence. Psychological Review, 117(3), 864–901. Krantz, D. H. (1972). Measurement structures and psychologi- doi:10.1037/a0019737 cal laws. Science, 175(4029), 1427–1435. doi:10.1126/science Preston, M. G., & Baratta, P. (1948). An experimental study of .175.4029.1427 the auction-value of an uncertain outcome. American Journal Kreps, D. M. (1988). Notes on the Theory of Choice. Boulder, of Psychology, 61(2), 183–193. doi:10.2307/1416964 CO: Westview Press. Quiggin, J. (1982). A theory of anticipated utility. Journal Liu, T., & Pleskac, T. J. (2011). Neural correlates of evidence of Economic Behavior & Organization, 3(4), 323–343. accumulation in a perceptual decision task. Journal of doi:10.1016/0167-2681(82)90008-7 Neurophysiology, 106, 2383–2398. doi: 10.1152/jn.0041 Rapoport, A., & Wallsten, T. S. (1972). Individual decision 3.2011 behavior. Annual Review of Psychology, 23, 131–176. Lopes, L. L. (1984). Risk and distributional inequality. Journal of doi:10.1146/annurev.ps.23.020172.001023 Experimental Psychology: Human Perception and Performance, Regenwetter, M., Dana, J., & Davis-Stober, C. P. (2010). 10(4), 465–485. doi:10.1037/0096-1523.10.4.465 Testing transitivity of preferences on two-alternative forced

230 higher level cognition choice data. Frontiers in Psychology, 1(148), 1–15. Tversky, A. (1972). Elimination by aspects: A the- doi:10.3389/fpsyg.2010.00148 ory of choice. Psychological Review, 79(4), 281–299. Regenwetter, M., Dana, J., & Davis-Stober, C. P. (2011). doi:10.1037/h0032955 Transitivity of preferences. Psychological Review, 118(1), 42– Tversky, A., & Fox, C. R. (1995). Weighing risk and uncertainty. 56. doi:10.1037/a0021150 Psychological Review, 102(2), 269–283. doi:10.1037//0033- Rieger, M. O., & Wang, M. (2008). What is behind the priority 295X.102.2.269 heuristic? A and comment on Brand- Tversky, A., & Kahneman, D. (1986). Rational choice and the stätter, Gigerenzer, and Hertwig (2006). Psychological Review, framing of decisions. Journal of Business, 59(4), S251–S278. 115(1), 274–280. doi:10.1037/0033-295x.115.1.274 doi:10.1086/296365 Rieskamp, J. (2008). The Probabilistic Nature of Preferential Tversky, A., & Kahneman, D. (1991). Loss aversion in riskless Choice. Journal of Experimental Psychology: Learning Memory choice: A reference-dependent model. The Quarterly Journal and Cognition, 34(6), 1446–1465. doi: 10.1037/A0013646 of Economics, 106 (4), 1039–1061. doi:10.2307/2937956 Roe, R. M., Busemeyer, J. R., & Townsend, J. T. (2001). Mul- Tversky, A., & Kahneman, D. (1992). Advances in prospect the- tialternative decision field theory: A dynamic connectionst ory: Cumulative representation of uncertainty. Journal of Risk model of decision making. Psychological Review, 108(2), and Uncertainty, 5(4), 297–323. doi:10.1007/bf00122574 370–392. doi:10.1037//0033-295X.108.2.370 Tversky, A., & Koehler, D. J. (1994). Support theory: Savage, L. J. (1954). The foundations of statistics.NewYork,NY: A nonextensional representation of subjective probability. Wiley. Psychological Review, 101(4), 547–567. doi:10.1037//0033- Shafir, E. & LeBoeuf, R.A. (2002). Raionality, Annual Review of 295X.101.4.547 Psychology, 53, 491–517. doi:10.1146/annurev.psych.53.100 Tversky, A., & Simonson, I. (1993). Context-dependent 901.B2513 preferences. Management Science, 39(10), 1179–1189. Shah, A.K., & Oppenheimer, D.M. (2008). Heuristics made doi:10.1287/mnsc.39.10.1179 easy: An effort reduction framework. Psychological Bulletin, Tversky, A., & Wakker, P. (1995). Risk attitudes and decision 134, 207–222. doi: 10.1037/0033-2909.134.2.207 weights. Econometrica, 63(6), 1255–1280. doi:10.2307/21 Simonson, I. (1989). Choice based on reasons: The case of 71769 attraction and compromise effects. Journal of Consumer Ungemach, C., Chater, N., & Stewart, N. (2009). Are Research, 16 (2), 158–174. doi:10.1086/209205 Probabilities Overweighted or Underweighted When Rare Stewart, N. (2009). Decision by sampling: The role of Outcomes Are Experienced (Rarely)? Psychological Science, the decision environment in risky choice. Quarterly 20(4), 473–479. doi: 10.1111/j.1467-9280.2009.02319.x Journal of Experimental Psychology, 62(6), 1041–1062. Usher, M., & McClelland, J. L. (2001). The time course of doi:10.1080/17470210902747112 perceptual choice: The leaky, competing accumulator model. Stewart, N., Chater, N., & Brown, G. D. A. (2006). Psychological Review, 108(3), 550–592. doi:10.1037/0033- Decision by sampling. Cognitive Psychology, 53(1), 1–26. 295X.108.3.550 doi:10.1016/j.cogpsych.2005.10.003 Usher, M., & McClelland, J. L. (2004). Loss aversion and Thorngate, W. (1980). Efficient decision heuristics, Behavioural inhibition in dynamical models of multialternative choice. Science, 25, 219–225. doi:10.1002/bs.38302506 Psychological Review, 111(3), 757–769. doi:10.1037/0033- Thurstone, L. L. (1927). A law of comparative judgment. 295X.111.3.757 Psychological Review, 34(4), 273–286. doi:10.1037/h00 von Neumann, J., & Morgenstern, O. (1944). Theory of games 70288 and economic behavior. Princeton, NJ: Princeton University Todhunter, I. (1865). A history of the mathematical theory Press. of probability from the time of Pascal to that of Laplace. Wakker, P. (2010). Prospect theory for risk and ambiguity. Cambridge and London, England: Macmillan. Cambridge, England: Cambridge University Press. Tom, S. M., Fox, C. R., Trepel, C., & Poldrack, Wakker, P., Erev, I., & Weber, E. U. (1994). Comonotonic R. A. (2007). The neural basis of loss aversion in Independence - the Critical Test between Classical and Rank- decision-making under risk. Science, 315(5811), 515–518. Dependent Utility Theories. Journal of Risk and Uncertainty, doi:10.1126/science.1134239 9(3), 195–230. doi:10.1007/Bf01064200 Townsend, J. T., & Ashby, F. G. (1983). Stochastic modeling of Weber, E. U. (1994). From subjective probabilities to decision elementary psychological proceses. New York, NY: Cambridge weights: The effect of asymmetric loss functions on the University Press. evaluation of uncertain outcomes and events. Psychological Trueblood, J. S., Brown, S., & Heathcote, A. (2014). Bulletin, 115, 228–242. doi:10.1037/0033-2909.115.2.228 The multi-attribute linear ballistic accumulator model of Wollschläger, L. M., & Diederich, A. (2012). The 2N - context effects in multi-alternative choice. 121, 179–205. ary choice tree model for N-alternative preferential choice. doi:10.1037/a0030137 Frontiers in Psychology, 3(189), 1–11. doi:10.3389/fpsyg.201 Tsetsos, K., Usher, M., & Chater, N. (2010). Preference reversal 2.00189 in multiattribute choice. Psychol Review, 117(4), 1275–1291. Yechiam, E., & Hochman, G. (2013). Losses as modulators doi:10.1037/a0020580 of attention: Review and analysis of the unique effects of Tversky, A. (1969). Intransitivity of preferences. Psychological losses over gains. Psychological Bulletin, 139(2), 497–518. Review, 76 (1), 31–48. doi:10.1037/h0026750 doi:10.1037/a0029383

models of decision making under risk and uncertainty 231 CHAPTER 11 Models of Semantic Memory

Michael N. Jones, Jon Willits, and Simon Dennis

Abstract Meaning is a fundamental component of nearly all aspects of human cognition, but formal models of semantic memory have classically lagged behind many other areas of cognition. However, computational models of semantic memory have seen a surge of progress in the last two decades, advancing our knowledge of how meaning is constructed from experience, how knowledge is represented and used, and what processes are likely to be culprit in disorders characterized by semantic impairment. This chapter provides an overview of several recent clusters of models and trends in the literature, including modern connectionist and distributional models of semantic memory, and contemporary advances in grounding semantic models with perceptual information and models of compositional semantics. Several common lessons have emerged from both the connectionist and distributional literatures, and we attempt to synthesize these themes to better focus future developments in semantic modeling. Key Words: semantic memory, semantic space model, distributional semantics, connectionist network, concepts, cognitive model, latent semantic analysis

Introduction two common types of semantic information are Meaning is simultaneously the most obvious conceptual and propositional knowledge. A concept feature of memory—we can all compute it rapidly is a mental representation of something, such as a and automatically—and the most mysterious aspect panther, and knowledge of its similarity to other to study. In comparison to many areas of cogni- concepts. A proposition is a mental representation of tion, relatively little is known about how humans conceptual relations that may be evaluated to have compute meaning from experience. Nonetheless, a truth value, for example, that a panther is a jungle a mechanistic account of semantics is an essential cat, or has four legs and knowledge that panthers do component of all major theories of language com- not have gills. prehension, reading, memory, and categorization. In Tulving’s (1973) classic modular taxonomy, Semantic memory is necessary for us to construct declarative memory was subdivided into episodic meaning from otherwise meaningless words and and semantic memory, the former containing utterances, to recognize objects, and to interact with memory for autobiographical events, and the latter the world in a knowledge-based manner. dedicated to generalized memory not linked to a Semantic memory typically refers to memory for specific event. Although you may have a specific word meanings, facts, concepts, and general world autobiographical memory of the last time you saw a knowledge. For example, you know that a panther panther at the zoo, you do not have a specific mem- is a jungle cat, is more like a tiger than a corgi, ory of when you learned that a panther was a jungle and you know better than to try to pet one. The cat, was black, or how it is similar to a tiger.Inthis

232 sense, semantic memory gained a reputation as the major clusters of cognitive models that have been more miscellaneous and mysterious of the memory prominent: distributional models and connectionist systems. Although episodic memory could be stud- models. The division may also be broadly thought ied with experimental tasks such as list learning and of as a division between models that specify how could be measured quantitatively by counting the concepts are learned from statistical experience number of items correctly recognized or recalled, (distributional models), and models that specify semantic memory researchers focused more on tasks how propositions are learned or that use conceptual such as similarity judgments, proposition verifica- representations in cognitive processes (connection- tion, semantic priming, and free association. Unlike ist models). Obviously, there are exceptions in both episodic memory, there existed no mechanistic ac- clusters that cross over, but the two literatures have count of how semantic memory was constructed as had different foci. Next, we summarize some classic a function of experience. However, the field has ad- models of semantic memory and common theoreti- vanced a considerable amount in the past 25 years. cal debates that have extended to the contemporary A scan of the contemporary literature reveals models. Following the historical trends in the a large number of formal models that aim to literature, we then discuss advances in connec- understand the mechanisms that humans use to tionist models, followed by distributional models. construct semantic memory from repeated episodic Finally, we discuss hybrid approaches and new experience. Modern semantic models have made directions in models of grounded semantics and truly impressive progress at elucidating how humans compositional semantics, and attempt to synthesize learn and represent semantic information, how common lessons that have been learned across the semantic memory is recruited and used in cognitive literature. processing, and even how complex functions like semantic composition may be accomplished by Classic Models and Themes in Semantic relatively simple cognitive mechanisms. Many of Memory Research the current advances build from classic ideas, but The three classic models of semantic memory only relatively recently has computational hardware most commonly discussed are semantic networks, advanced to a scale where we can actually simulate feature-list models, and spatial models. These and evaluate these systems. Advances in semantic three models deserve mention here, both because modeling also are indebted to excellent interdis- they have each seen considerable attention in the ciplinary collaboration, building in part on de- literature, and because features of each have clearly velopments in computational linguistics, machine evolved into modern computational models. learning, and information retrieval. The semantic network has traditionally been one The goal of this chapter is to provide an of the most common theoretical frameworks used overview of recent advances in models of semantic to understand the structure of semantic memory. memory. We will first provide a brief synopsis of Collins and Quillian (1969) originally proposed a classic models and themes in semantic memory hierarchical model of semantic memory in which research, but will then focus on computational concepts were nodes and propositions were labeled developments. In addition, the focus of the chapter links (e.g., the nodes for dog and animal were is on models that have a formal instantiation that connected via an “isa” link). The superordinate may be tested quantitatively. Hence, although there and subordinate structure of the links produced are several exciting new developments in verbal a hierarchical tree structure (animals were divided conceptual theory (e.g., Louwerse’s (2011) Symbol into birds, fish, etc., and birds were further Interdependency Hypothesis), we focus exclusively divided into robin, sparrow, etc.), and allowed the on models that are explicitly expressed by computer model to explain both conceptual and propositional code or mathematical expressions. In addition, the knowledge within a single framework. Accessing chapter assumes a sufficient understanding of the knowledge required traversal of the tree to the empirical literature on semantic memory. For an critical branch, and the model was successful in overview of contemporary experimental findings, this manner of explaining early sentence verification we refer the reader to a companion chapter on data from humans (e.g., the speed to verify that “a semantic memory by McRae and Jones (2014). canary can sing”). A later version of the semantic There are several potential ways to organize a network model proposed by Collins and Loftus review of the literature, and no single structure (1975) deemphasized the hierarchical nature of will satisfy all theorists. We opt here to follow two the network in favor of the process of spreading

models of semantic memory 233 activation through all network links simultaneously coded based on the theorist’s intuition (or subjective to account for semantic priming phenomena—in ratings) of semantic structure, but none formally particular, the ability to produce fast negative specified the cognitive mechanisms by which the responses. Early semantic networks can be seen as representations were constructed. As Hummel and clear predecessors to several modern connectionist Holyoak (2003) have noted, this type of intuitive models, and features of them can also be seen in modeling may have serious consequences: “The modern probabilistic and graphical models as well. problem of hand-coded representations is the most A competing model was the feature-comparison serious problem facing computational modeling as a model of Rips, Shoben, and Smith (1973). In scientific enterprise. All models are sensitive to their this model, a word’s meaning is encoded as a list representation, so the choice of representation is of binary descriptive features, which were heavily among the most powerful wildcards at the modeler’s tied to the word’s perceptual referent. For example, disposal” (p. 247). As we will see later in the the feature would be turned on for chapter, this is exactly the concern that modern a robin, but off for a beagle.Smith,Shoben, distributional models address. and Rips (1974) proposed two types of semantic features: defining features that all concepts have, Connectionist Models of Semantic Memory and characteristic features that are typical of the Connectionist models were among the first to concept, but are not present in all cases. For specify how semantic representations might come example, all birds have wings, but not all birds to be learned, and how those representations might fly. Processing in the model was accomplished interact with other cognitive processes. Modern by computing the feature overlap between any is a framework used to model two concepts, and the features were allowed to mental and behavioral phenomena as an emergent vary in their contribution of importance to the process—one that arises out the behavior of net- concept, although how particular features came works of simple interconnected units (Rumelhart, to be and how they were ranked was not fully McClelland, & the PDP Group, 1986). Connec- specified. Modern versions of feature-list models tionism is a very broad enterprise. Connectionist use aggregate data collected from human raters in models can be used to explicitly model the inter- property generation tasks (e.g., McRae, de Sa, & action of different brain regions or neural processes Seidenberg, 1997). (O’Reilly, Munakata, Frank, Hazy, & Contributors, A third type was the spatial model, which emer- 2012) or they can be used to model cognition ged from Osgood’s (1952, 1971) early attempts to and behavior from a “neurally inspired” perspective, empirically derive semantic features using semantic which values the way in which the models exhibit differential ratings. Osgood had humans rate words parallel processing, interactivity, and emergentism on a Likert scale against a set of polar opposites (e.g., (Rumelhart et al., 1986; Rogers & McClelland, rough-smooth, heavy-light), and a word’s meaning 2006). Connectionist models have made a very was then computed as a coordinate in a multidi- large contribution to simulating and understanding mensional semantic space. Distance between words the dynamic nature of semantic knowledge and how in the space was proposed as a process for semantic semantic knowledge interacts with other cognitive comparison.1 Featural and spatial representations processes. have been contrasted as models of human similarity Connectionist models represent knowledge in judgments (e.g., Tversky & Gati, 1982), and terms of weighted connections between intercon- the same contrast applies to spatial versus featural nected units. A model’s set of units, its connections, representations of semantic representations. We will and how they are organized is called the model’s seethefeatureversusspacedebateemergeagain architecture. Research involving connectionist mod- with modern distributional models. Early spatial els has studied a wide range of architectures, but models can be seen as predecessors of modern most connectionist models share a few common semantic space models of distributional semantics features. Most models have at least one set of units (but co-occurrences in text corpora are used as the designated as input units, as well as at least one data on which the space is constructed rather than set of units designated as target or output units. human ratings). Most connectionist models also have one or more One issue with all three of these classic models sets of intervening units between the input and is that none ever did actually learn anything. Each output units, which are often referred to as hidden model relied on representations that were hand layers.

234 higher level cognition A connectionist model represents knowledge in studied in detail by Rogers and McClelland (2006). terms of the strength of the weighted connections This network has two sets of input units: (1) a set between units. Activation is fed into the input of units meant to represent words or concepts (e.g., units, and that activation in turn activates (or robin, canary, sunfish, etc.), and (2) a set of units suppresses) the units to which the input units are meant to represent different types of relations (e.g., connected, as a function of the weighted connection is-a, can, has, etc.). The network learns to associate strength between the units. Activation eventually conjunctions of those inputs (e.g., robin+can)with propagates to the output units, with one important outputs representing semantic features (e.g. fly, question of interest being: What output units will move, sing, grow, for robin+can). The model a connectionist model activate given a particular accomplishes this using supervised learning, having input? In this sense, the knowledge in connectionist robin+can activated as inputs, observing what a models is typically thought of as representing the randomly initialized version of the model produces function or relationship between a set of inputs and as an output, and then adjusting the weights so as a set of outputs. Connectionist models should not, to make the activation of the correct outputs more however, be confused with models that map simple likely. The model is not merely learning associations stimulus-response relationships; The hidden layers between inputs and outputs; in the Rumelhart between input and output layers in connectionist network, the inputs and outputs are mediated by networks allow them to learn very complex internal two sets of hidden units, which allow the network representations. Models with an architecture such to learn complex internal representations for each as the one just described, where activation flows input. from input units to hidden units to output units, A critical property of connectionist architectures are typically referred to as feed-forward networks. using hidden layers is that the same hidden units are A key aspect of connectionist models is that being used to create internal representations for all they are often used to study the learning process possible inputs. In the Rogers et al. example, robin, itself. Typically, the weights between units in a oak, salmon, and daisy all use the same hidden units; connectionist network are initialized to a random what differentiates their internal representations is state. The network is then provided with a training that they instantiate different distributed patterns phase, in which the model is provided with of activation. But because the network is using inputs (typically involving some sort of expected overlapping distributed representations for all the input from the environment), and the weights are concepts, this means that during the process of adjusted as a function of the particular inputs the learning, changing the connection weights as a network received. Learning (adjusting the weights) result of learning about one input could potentially is accomplished in either an unsupervised or a super- affect how the network represents all other items. vised fashion. In unsupervised learning, weights are When the network learns an internal representation typically adjusted according some sort of associative (i.e., hidden unit activation state) for the input principle, such as Hebbian learning (Grossberg, robin+can, and learns to associate the outputs 1976; Hebb, 1946), where weights between units sing and fly with that internal representation, are increased the more often the two units are active this will mean that other inputs whose internal at the same time. In supervised learning, weights representations are similar to robin (i.e., have similar are adjusted by observing which output units the hidden unit activation states, such as canary) will network activated given a particular input pattern, also become more associated with sing and fly.This and comparing that to some goal or target output provides these networks with a natural mechanism given those inputs. The weights are then adjusted for categorization, generalization, and property so as to reduce the amount of error the network induction. The behavior allows researchers using makes in terms of its activation of the “correct” and connectionist models to study how these networks “incorrect” outputs (Kohonen, 1982; Rosenblatt, categorize, and to compare the predictions of the 1959; Rumelhart, Hinton, & Williams, 1986; model to human behaviors. Widrow & Hoff, 1960). Rogers and McClelland (2006) extensively stud- ied the behavior of the Rumelhart networks, and Rumelhart Networks found that the model provides an elegant account of An illustrative example of a connectionist model a number of aspects of human concept acquisition of semantic memory (shown in Figure 11. 1a) was and representation. For example, they found that first presented by Rumelhart & Todd (1993) and as the model acquires concepts through increasing

models of semantic memory 235 (a) living thing (b) plant animal canary tree flower bird flower daisy pine oak 0.5 Hidden 1 Hidden 2 pine rose oak daisy rose robin daisy canary robin sunfish robin canary salmon salmon sunfish rose salmon pretty 0.0 Item tall living green red ISA yellow sunfish is grow

can component Second move has –0.5 swim Query fly sing oak bark petals wings feathers pine scales –1.0 gills –1.0 –0.5 0.0 0.5 roots First Component skin

Fig. 11.1 (left). The network architecture used by Rogers and McClelland (2006), showing the output activation of appropriate semantic features given an input concept and an input relation. Fig. 11.1b (right). A graph of the network’s learning trajectory, obtained by performing a multidimensional scaling on the network’s hidden unit activations for each input. As the network obtains more experience with each concept, it progressively learns to make finer and finer distinctions between categories. Timothy T. Rogers and James L. McClelland, Semantic Cognition: A Parallel Distributed Processing Approach, figures 2.2 & 3.3, c Massachusetts Institute of Technology, by permission of The MIT Press. amounts of experience, the internal representations 1992; Nestor et al., 1998), and a number of other for the concepts show progressive differentiation, disorders that involve impairments to semantic learning broader distinctions first and more fine- memory (see Aakerlund & Hemmingsen, 1998, grained distinctions later, similar to the distinctions for a review). These models typically study brain children show (Mandler, Bauer, & MoDonough, disorders by lesioning the network (i.e., removing 1991). In the model, this happens because the units or connections), or otherwise causing the network is essentially performing something akin network to behave in suboptimal ways, and then to a principal component analysis, learning the studying the consequences of this disruption. Con- different features in the order of the amount of nectionist models provide accounts of a wide range variance in the input that they explain. Rogers of impairments and disorders, and have also been and McClelland argued that this architecture, used to show that many semantic consequences of which combines simple learning principles with impairments and disorders, such as the selective the expected structure of the environment, can impairment of certain categories, can be explained be used to understand how certain features (those in terms of emergent processes deriving from that have rich covariational structure) become the interaction of low-level features, rather than the features that organize categories, and how requiring explicit instantiations in the model (such conceptual structure can become reorganized over as creating modular memory systems for living and the course of concept acquisition. The general (and nonliving things, see McRae and Cree, 2002, for a somewhat controversial) conclusion that Rogers review). and McClelland draw from their study of this model is that a number of properties of the Dynamic Attractor Networks semantic system, such as the taxonomic structure In addition to feed-forward models such as of categories (Bower, 1970) and role of causal the Rumelhart network, a considerable amount knowledge in semantic reasoning (Keil, 1989), of semantic memory research has explored the can be explained as an emergent consequence of use of dynamical connectionist models (Hopfield, simple learning mechanisms combined with the 1982). A connectionist model becomes a dynamical expected structure of the environment, and that model when its architecture involves some sort of these structural factors do not necessarily need to bi-directionality, feedback, or recurrent connectiv- be explicitly built into models of semantic memory. ity. Dynamical networks allow investigations into Feed-forward connectionist models have only how the activation of representations may change been used in a limited fashion to study the actual over time, as well as how semantic representations structure of semantic memory. However, these interact with other cognitive processes in an online models have been used extensively to study how fashion. semantic structure interacts with various other For example, Figure 11.2a shows McLeod, cognitive processes. For example, feed-forward Shallice, and Plaut’s (2000) dynamical network for models have been used to simulate and understand pronouncing printed words. The network has a the word learning process (Gasser & Smith, 1998; layer of units for encoding orthographic representa- Regier, 2005). These word-learning models have tions (grapheme units), a layer of units for encoding been used to show that many details about the phonological representations (phoneme units), and representation of word meanings (like hierarchical an intervening layer between the two that encodes structure), learning constraints (such as mutual the words’ semantic features (sememe units), as exclusivity and shape bias), and empirical phenom- well as additional layers of hidden units between ena (such as the vocabulary spurt that children each of these layers. Critically, the activation in this show around two years of age) emerge naturally network is allowed to flow in both directions, from from the structure of environment with a simple phonemes to sememes to graphemes, and from learning algorithm, and do not need to be explicitly graphemes to sememes to phonemes. The network built into the model. Feed-forward models have also has recurrent connections (the loops in Figure also been used to model consequences of brain 11.2a) connecting the grapheme, sememe, and damage (Farah & McClelland, 1991; Rogers phoneme layers to themselves. The combination et al., 2004; Tyler, Durrant-Peatfield, Levy, Voice, of the bidirectional connections and recurrent & Moss, 2000), Alzheimer’s disease (Chan, Salmon, connectivity allows the McLeod et al. (2000) & Butters, 1998), schizophrenia (Braver, Barch, network to establish a dynamical system where & Cohen, 1999; Cohen and Servan-Schreiber, the activation at the various levels will feed back

models of semantic memory 237 (a)33 phoneme units (b) initial semantic pattern normal trajectory error trajectory

40 hidden units

68 sememe units

initial influence of orthography 40 hidden units

mask onset 28 grapheme units enegry cat dog log

Fig. 11.2 (left). A prototypical example of a semantic attractor network, from McLeod, Shallice, & Plaut (2000). Fig. 11.2b (right). An example of the network’s behavior, simulating the experience of a person reading the word “dog”. The weights of the network have created a number of attractor spaces, determined by words’ orthographic, phonological, and semantic similarity. Disrupting the input (such as presenting the participant with a stimulus mask) at different stages has different effects. Early masking leads to a higher likelihood of falling into the wrong perceptual attractor (LOG instead of DOG). Later masking leads to a higher likelihood of falling into the wrong semantic attractor (CAT instead of DOG). (left) After McLeod, Shallice, and Plaut (2000), Cognition c Elsevier Inc. and forth eventually settling into a stable attractor input is impoverished (e.g., noisy, with errors in state. The result is that these attractor networks the input signal), or disrupted (simulating masking can allow multiple constraints (e.g. the weights such as might happen in a psychology experiment), that establish the network’s knowledge of the links this can affect the network’s ability to settle into between orthography and semantics, and semantics the correct attractor. In a manner corresponding and phonology) to compete, eventually settling into well to the disruption effects that people show in a state that satisfies the most likely constraints for a behavioral experiments, an early disruption (before given input. the network has had a chance to settle into an As an illustration of how this works, consider an orthographic attractor basin) can lead the network example using the McLeod et al. network, shown in to make a form-based error (settling into the LOG Figure 11.2b. Here, the network is simulating the basin instead). A later disruption—happening after experience of a person reading words. The figure the orthographic layer has settled into its basin but depicts a three-dimensional space, where the vertical before the semantic layer has done so—can lead the direction (labeled “energy”) represents the stability network to make a semantic error, activating a code of the network’s current state (versus its likelihood of semantic features corresponding to CAT. to switch to a new state) as activity circulates Attractor networks have been used to study through the network. In an attractor network, only a very wide range of semantic-memory related a small number of possible states are stable. These phenomena. Rumelhart et al. (1986) used an stable states are determined by the network’s knowl- attractor network to show how schemas (e.g., one’s edge about the likelihood of certain orthographic, representations for different rooms) can emerge phonological, and semantic states to co-occur. And naturally out of the dynamics of co-occurrence given any input, the network will eventually settle of lower-level objects (e.g., items in the rooms), into one of these stable states. For example, if the without needing to build explicit schema repre- network receives a clear case of the printed word sentations into the model (see also Botvinick & DOG as input, and this input is not disrupted, the Plaut, 2004). Like the McLeod example already network will quickly settle into the corresponding described, attractor networks have been extensively DOG state in its orthographic, phonological, used to study how semantic memory affects lex- and semantic layers. Alternatively, if the network ical access (Harm & Seidenberg, 2004; McLeod received a nonword like DAG as an input, it et al., 2000) as well as to model semantic priming would eventually settle into a neighboring attractor (Cree, McRae, & McNorgan, 1999; McRae, et al., state (like DOG or DIG or DAD). Similarly, if 1997; Plaut & Booth, 2000). Dynamical models the network receives DOG as an input, but this have also been used to study the organization

238 higher level cognition and development of the child lexicon (Horst, company it keeps,” and this idea was further McMurray, & Samuelson, 2006; Li, Zhao, & developed by Harris (1970) into the distributional MacWhinney, 2007), the bilingual lexicon (Li, hypothesis of contextual overlap. For example, robin 2009), and children’s causal reasoning using seman- and egg may become related because they tend to tic knowledge (McClelland & Thompson, 2007), co-occur frequently with each other. In contrast, and how lexical development differs in typical and robin and sparrow become related because they atypical developmental circumstances (Thomas & are frequently used in similar contexts (with the Karmiloff-Smith, 2003). same set of words), even if they rarely co-occur Dynamical connectionist models have also sim- directly. Ostrich may be less related to robin due to a ulated various ways that semantic knowledge lower overlap of their contexts compared to sparrow, impacts and interacts with sentence production and stapler is likely to have very little contextual and comprehension, including how semantic con- overlap with robin. Formal models of distributional straints impact the grammaticality of sentences semantics differ in their learning mechanisms, but (Allen & Seidenberg, 1999; Dell, Chang, & Griffin, they all have the same overall goal of formalizing 1999; McClelland, St. John, & Taraban, 1989; the construction of semantic representations from Tabor & Tanenhaus, 1999; Taraban & McClelland, statistical redundancies in language. 1988), and how semantic knowledge assists in the A taxonomy of distributional models is very learning of linguistic structure (Borovsky & Elman, difficult now given the large number of them and 2006; Chang, Dell, & Bock, 2006; Rohde & Plaut, range of learning mechanisms. The models can be 2000). As with feed-forward models, dynamical loosely clustered based on their notion of context models have been also used to extensively study (e.g., documents, words, time, etc.), or the learning many developmental and brain disorders such as mechanism they employ. We opt for the latter dyslexia and brain damage (Devlin, Gonnerman, organization here, and just present some standard Anderson, & Seldenberg, 1998; Hinton & Shallice, exemplars of each model type; an exhaustive 1991; Kinder & Shanks, 2003; Lambon Ralph, description of all models is beyond the scope of this MoClelland, Patterson, Galton, & Hodges, 2001; chapter (for reviews, see Bullinaria & Levy, 2007; Plaut, 1999, 2002). Riordan & Jones, 2011; Turney & Pantel, 2010).

Distributional Models of Semantic Memory Latent Semantic Analysis There are now a large number of computational Perhaps the best-known distributional model is models in the literature that may be classified as Latent Semantic Analysis (LSA; Landauer & Du- distributional. Other terms commonly used to refer mais, 1997). LSA begins with a term-by-document to these models are corpus-based, semantic-space, frequency matrix of a text corpus, in which each row or co-occurrence models, but distributional is the vector is a word’s frequency distribution over docu- most appropriate term common to all the models in ments. A document is simply a “bag-of-words” in that it fairly describes the environmental structure which transitional information is not represented. all learning mechanisms capitalize on (i.e., not all Next, a word’s row vector is transformed by its are truly spatial models, and most do not capitalize log frequency in the document and its information − ( ) merely on direct co-occurrences). The various entropy over documents ( p x log2p(x); cf. models differ greatly in the cognitive mechanisms Salton & McGill, 1983). Finally, the matrix they posit that humans use to construct semantic is factorized using singular-value decomposition representations, ranging from Hebbian learning to (SVD) into three component matrices, U, , and V. probabilistic inference. But the unifying theme The U matrix represents the orthonormal basis for common to all these models is that they hypothesize aspaceinwhicheachwordisapoint,V represents a formal cognitive mechanism to learn semantics an analogous orthonormal document space, and from repeated episodic experience in the linguistic is a diagonal matrix of singular values (cf. an environment (typically a text corpus). eigenvector) weighing dimensions in the space (see The driving theory behind modern distribu- Landauer, McNamara, Dennis, & Kintsch, 2007 tional models of semantic representation is cer- for a tutorial). The original transformed term-by- tainly not a new one, and dates back at least document matrix, M, may be reconstructed as: to Wittgenstein (1953). The most famous and M = U V T ,(1) commonly used phrase to summarize the approach is Firth’s (1957) “you shall know a word by the where V T is the transpose of V.

models of semantic memory 239 More commonly, only the top N singular performance peaked at the reduced 300 dimensions values of are retained, where N is usually compared to fewer or even the full dimensionality around 300. This dimension reduction allows an of the matrix. Even though the ability of the approximation of the original “episodic” matrix to model (from an algebraic perspective) to reconstruct be reconstructed, and has the effect of bringing out the original M matrix diminishes monotonically as higher-order statistical relationships among words dimensionality is reduced, its ability to simulate more sophisticated than mere direct co-occurrence. the human semantic data was better at the reduced A word’s semantic representation is then a pattern dimensionalities. This finding supports the notion across the N latent semantic dimensions, and that semantic memory may simply be supported by is often projected as a point in N -dimensional a mental dimension reduction mechanism applied semantic space (cf. Osgood, 1952). Even though to episodic contexts. The dimension reduction two words (e.g., boat and ship) might have had operation brings out higher-order abstractions by zero similarity in the original M matrix, indicating glossing over variance that is idiosyncratic to that they do not co-occur in the same documents, specific contexts. The astute reader will note the they may nonetheless be proximal in the reduced similarity of this notion to the emergent behavior space reflecting their deeper semantic similarity of the hidden layers of a connectionist network (contextual similarity but not necessarily contextual that also performs some dimensional reduction overlap). operation; we will return to this similarity in the The application of SVD in LSA is quite similar discussion. to common uses of principal component analysis TheinfluenceofLSAonthefieldofsemantic (a type of SVD) in questionnaire research. Given modeling cannot be overstated. Several criticisms of a pattern of observable scale responses to items the model have emerged over the years (see Perfetti, on a personality questionnaire, for example, the 1998), including the lack of incremental learning, theorist may apply SVD to infer a small number of neglect of word-order information, issues about latent components (e.g., extroversion, neuroticism) what exact cognitive mechanisms would perform that are causing the larger number of observable SVD, and concerns over its core assumption that response patterns. Similarly, LSA uses SVD to infer meaning can be represented as a point in space. a small number of latent semantic components in However, LSA clearly paved the way for a rapid language that explain the pattern of observable word sequence of advances in semantic models in the co-occurrences across contexts. In this sense, LSA years since its publication. was the first model to successfully specify a function mapping semantic memory to episodic context. Moving Window Models Landauer and Dumais (1997) were careful not to An alternative approach to learning distribu- claim that humans use exactly SVD as a learning tional semantics is to slide an N -word window mechanism, but rather that the brain uses some across a text corpus, and to apply some lexical dimensional reduction mechanism akin to SVD association function to the co-occurrence counts to create abstract semantic representations from within the window at each step. Although LSA experience. represents a word’s episodic context as a doc- The semantic representations constructed by ument, moving-window models operationalize a LSA have demonstrated remarkable success at word’s context in terms of the other words that simulating a wide range of human behavioral it is commonly seen with in temporal contexts. data, including judgments of semantic similarity Compared to LSA’s batch-learning mechanism, this (Landauer & Dumais, 1997), word categorization allows moving-window models to gradually de- (Laham, 2000), and discourse comprehension velop semantic structure from simple co-occurrence (Kintsch, 1998), and the model has also been counting (cf. Hebbian learning), because a text applied to the automated scoring of essay quality corpus is experienced in a continuous fashion. (Landauer, Lahma, Rehder, & Schreiner, 1997). In addition, several of these models inversely One of the most publicized feats of LSA was its weight co-occurrence by how many words intervene ability to achieve a score on the Test of English between a target word and its associate, allowing as a Foreign Language (TOEFL) that would allow them to capitalize on word-order information. it entrance into most U.S. colleges (Landauer The prototypical exemplar of a moving-window & Dumais, 1997). A critically important insight model is the Hyperspace Analogue to Language from the TOEFL simulation was that the model’s model (HAL; Lund & Burgess, 1996). In HAL,

240 higher level cognition a co-occurrence window (typically, the 10 words COALS model (Correlated Occurrence Analogue preceding and succeeding the target word) is slid to Lexical Semantics). In COALS, there is no across a text corpus, and a global word-by-word preceding/succeeding distinction within the mov- co-occurrence matrix is updated at each one-word ing window, and the model uses a co-occurrence increment of the window. HAL uses a ramped association function based on Pearson’s correlation window in which co-occurrence magnitudes are to factor out the confounding of chance co- weighted inversely proportional to distance from occurrence due to frequency. Hence, the similarity the target word. A word’s semantic representation between two words is their normalized covariational in the model is simply a concatenation of its row pattern over all context words. In addition, COALS and column vectors from the global co-occurrence performs SVD on this matrix. Although these matrix. The row and column vectors reflect the are quite straightforward modifications to HAL, weighted frequency with which each word preceded COALS heavily outperforms its predecessor on hu- and succeeded, respectively, the target word in the man tasks such as semantic categorization (Riordan corpus. Obviously, the word vectors in HAL are & Jones, 2011). both high dimensional and very sparse. Hence, it A similar moving window model was used is common to only use the column vectors with by McDonald and Lowe (1998) to simulate the highest variance (typically about 10% of all semantic priming. In their model, there is no words are then retained as ‘context’ words; Lund predecessor/successor distinction, but all words are & Burgess, 1996). Considering its simplicity, HAL simply represented by their co-occurrence in the has been very successful at accounting for human moving window with a small number of predefined behavior in semantic tasks, including semantic “context words.” Although many applications of priming (Lund & Burgess, 1996), and asymmetric HAL tabulate the entire matrix and then discard semantic similarity as well as higher-order tasks such the 90% of column vectors with the least amount as problem solving (Burgess & Lund, 2000). of variance, McDonald and Lowe’s context-word In HAL, words are most similar if they have approach specifies the context words (columns) a appeared in similar positions relative to other words priori, and it tabulates row vectors for each target (paradigmatic similarity; e.g., bee-wasp). In fact, word but only in relation to the predefined context Burgess and Lund (2000) have suggested that words. This context word approach, in which as the structure learned by HAL is very similar to few as 100 context words are used as the columns, what an SRN (Elman, 1990) would learn if it has also been successfully used by Mitchell et al. could scale up to such a large linguistic dataset. (2008) to predict fMRI brain activity associated In contrast, it is known that LSA gives stronger with humans making semantic judgments about weight to syntagmatic relations (e.g., bee-honey) nouns. Slightly more modern versions of these than does HAL, since LSA ignores word order, and context-word models use log likelihood or log odds both types of similarity are important factors in rather than raw co-occurrence frequency as matrix human semantic representation (Jones, Kintsch, & elements (Lowe & McDonald, 2000), and some Mewhort, 2006). even apply SVD to the word-by-word matrix (e.g., Several recent modifications to HAL have pro- Budi, Royer, & Pirolli, 2007) to bring out latent duced models with state-of-the-art performance at word relationships. simulating human data. One concern in the original Moving window models such as HAL have model was that chance frequencies can produce surprised the field with the array of “deep” semantic spurious similarities in the global matrix: A higher tasks they can explain with relatively simple learning frequency word has a greater chance of randomly algorithms based on counting repetitions. They also occurring with any other word and, hence, high- tie large-scale models of statistical semantics with frequency words end up being more semantically other learning models such as compound cuing similar to a target independent of semantic similar- (McKoon & Ratcliff, 1992) and cross-situational ity. Recent versions of HAL, such as Hidex (Shaoul word learning (Smith & Yu, 2008). & Westbury, 2006) factor out chance occurrence by weighting co-occurrence by inverse frequency of the Random Vector Models target word, which is similar to LSA’s application of An entirely different take on contextual rep- log-entropy weighting but after learning the matrix. resentation is seen in models that use random A second modification to HAL was proposed by representations for words that gradually develop Rohde, Gonnerman, and Plaut (2004) in their semantic structure through repeated episodes of the

models of semantic memory 241 word in a text corpus. The mechanisms used by distributed across dimensions evenly. If fewer or these models are theoretically tied to mathematical more dimensions are selected (provided a critical models of associative memory. For this reason, mass is used), the information is simply distributed random vector models tend to capitalize on both over fewer or more dimensions. Multiple runs of contextual co-occurrence as LSA does, and also a model on the same corpus may produce very associative position relative to other words as different vectors (unlike LSA or HAL), but the models like HAL and COALS do, representing overall similarity structure of the memory matrix both in a composite vector space. on multiple runs will be remarkably similar. In In the Bound Encoding of the Aggregate this sense, BEAGLE has considerable similarity to Language Environment model (BEAGLE; Jones unsupervised connectionist models. & Mewhort, 2007), semantic representations are The use of random environmental representa- gradually acquired as text is experienced in sentence tions allows BEAGLE to learn information as would chunks. The model is based heavily on mechanisms LSA, but in a continuous fashion and without the from Murdock’s (1982) theory of item and associa- need for SVD. But the most interesting aspect of the tive memory. The first time a word is encountered, model is that the random representations allow the it is assigned a random initial vector known as its en- model to encode word order information in parallel vironmental vector, ei. This vector is the same each by applying an operation from signal processing time the word is experienced in the text corpus, and known as convolution to bind together vectors is assumed to represent the relatively stable physical for words in sequence. Convolution-based memory characteristics of perceiving the word (e.g., its visual models have been very successful as models of both form or sound). The random vector assumption is vision and paired-associate memory, and BEAGLE obviously an oversimplification, assuming that all extends this mechanism to encode n-gram chunk words are equally similar to one another in their information in the word’s representation. The environmental form (e.g., dog is as similar to dug as model uses circular convolution, which binds it is to carburetor), but see Cox, Kachergis, Recchia, together two vectors, with dimensionality n,into and Jones (2010) for a version of the model that a third vector of the same dimensionality: builds in preexisting orthographic structure. n−1 = − = ∗ In BEAGLE, each time a word is experienced for i 0ton 1: zi xjmodn y(i−j)modn (2) j=0 in the corpus, its memory vector, mi, is updated as the sum of the random environmental vectors BEAGLE applies this operation recursively to create for the other words that occurred in context with an order vector representing all the environmental it, ignoring high-frequency function words. Hence, vectors that occur in sequences around the target in the short phrase “A dog bit the mailman,” word, and this order vector is also summed into the the memory representation for dog is updated as word’s memory vector. Hence, the memory vector mdog = ebit + emailman. In the same sentence, becomes a pattern of elements that reflects the mbit = edog +emailman and mmailman = edog + ebit are word’s history of co-occurrence with, and position encoded. Even though the environmental vectors relative to, other words in sentences. Words that are random, the memory vectors for each word appear in similar contexts and similar syntactic roles inthephrasehavesomeofthesamerandomen- within sentences will become progressively more vironmental structure summed into their memory similar. Jones, et al. (2006) have demonstrated representations. Hence, mdog , mbit ,andmmailman all how this integration of context and order infor- move closer to one another in memory space each mation in a single vector representation allows the time they directly co-occur in contexts. In addition, model to better account for patterns in semantic latent similarity naturally emerges in the memory priming data. matrix; even if dog and pitbull never directly co- An additional benefit of having order informa- occur with each other, they will become similar in tion encoded in a word’s memory vector is that the memory space if they tend to occur with the same convolution mechanism used to encode sequence words (i.e., similar contexts). This allows higher- information may be inverted to decode sequential order abstraction, achieved in LSA by SVD, to expectancies for a word from its learned history. emerge in BEAGLE naturally from simple Hebbian This decoding operates in a similar fashion to how summation. Rather than reducing dimensionality Murdock (1982) retrieves an associated target given after constructing a matrix, BEAGLE sets dimen- a cue in paired-associate learning. The model can sionality a priori, and the semantic information is make inferences about likely transitions preceding

242 higher level cognition or following a word and can build up expectancies to learning semantic representations, binding local for which words should be upcoming in sentence item representations to a gradually changing rep- processing tasks using the same associative mech- resentation of context by modifying the Temporal anism it uses for learning (see Jones & Mewhort, Context Model (TCM; Howard & Kahana, 2002) 2007). Although it only learns lexical semantic to learn semantic information from a text corpus. structure, BEAGLE naturally displays complex The TCM uses static vectors representing word rulelike syntactic behavior as an emergent property form, similar to RPM’s initial vectors or BEAGLE’s of its lexicon. Further, it draws a theoretical bridge environmental vectors. However, the model binds between models of lexical semantics and associative words to temporal context, a representation that memory suggesting that they may be based on the changes gradually with time, similar to oscillator- same cognitive mechanisms. based systems. In this sense, the model is heavily A similar approach to BEAGLE, known as inspired by hippocampal function. Encountering a random indexing, has been taken by Kanerva and word reinstates its previous temporal contexts when colleagues (Kanerva, 2009; Kanerva, Kristoferson, encoding its current state in the corpus. Hence, & Holst, 2000). Random indexing uses similar whereas LSA, HAL, and BEAGLE all treat context principles to BEAGLE’s summation of random as a categorical measure (documents, windows, environmental vectors, but is based on Kanerva’s and sentences, respectively, are completely different (1988) theory of sparse distributed memory. The contexts), TCM treats context as a continuous initial vector for a word in random indexing measure that is gradually changing over time. In is a sparse binary representation, a very high addition, although all the aforementioned models dimensional vector in which most elements are are essentially batch learners or ignore previous zeros with a small number of random elements semantic learning when encoding a word, a word’s switched to ones (a.k.a., a “spatter code”). A word’s learned history in TCM contributes to its future memory representation is then a sum of initial representation. This is a unique and important vectors for the other words with which it has feature of TCM compared to other models. appeared in contexts. Words that are semantically Howard et al. (2011) trained a predictive version similar will tend to be additive on the random of TCM (pTCM) on a text corpus to compare to elements that they share nonzero values on, which established semantic models. The pTCM contin- leads to a similarity structure remarkably similar to uously attempts to predict upcoming words based LSA, but without the need for SVD. on reinstated temporal context. In this sense, the Sahlgren, Holst, & Kanerva (2008) have model has many features in common with both extended random indexing to encode order in- BEAGLE and SRNs (Elman, 1990), allowing it formation as does BEAGLE in their Random to represent both context and order information Permutation Model (RPM). The RPM encodes within the same composite representation. Howard contextual information the same way as standard et al. demonstrate impressive performance from random indexing. Rather than convolution, it uses a pTCM on linguistic association tasks. In addition, permutation function to encode the order of words the application of TCM in general to semantic around a target word. The permutation function representation makes a formal link to mechanisms may be applied recursively to encode multiple of episodic memory (which at its core, TCM is), as words at multiple positions around the target word, well as findings in cognitive neuroscience (see Polyn andthisordervectorisalsoaddedtotheword’s & Kahana, 2008). memory representation. Like BEAGLE, a word’s memory vector is a distributed pattern that contains Probabilistic Topic Models information about its co-occurrence with and Considerable attention in the cognitive model- position relative to other words. However, in RPM ing literature has recently been placed on Bayesian this representation is a sparse hyperdimensional models of cognition (see Austerweil, et al., this vol- vector that contains less noise than does BEAGLE’s ume), and mechanisms of Bayesian inference have dense Gaussian vectors. In comparative simulations, been successfully extended to semantic memory as RPM has been shown to outperform BEAGLE on well. Probabilistic topic models (Blei, Ng, & Jordan, simple associative tasks (Recchia, Jones, Sahlgren, 2003; Griffiths, Steyvers, & Tenenbaum, 2007) & Kanerva, 2010). operate in a similar fashion to LSA, performing Howard and colleagues (e.g., Howard, Shakar, statistical inference to reduce the dimensionality & Jagadisan, 2011) have taken a different approach of a term-by-document matrix. However, the

models of semantic memory 243 theoretical mechanisms behind the inference and “topics” that are responsible for generating the co- representation in topic models differ markedly from occurrence patterns, in which each document is a LSA and other spatial models. probabilistic mixture of these topics. Each topic is An assumption of a topic model is that docu- a probability distribution over words, and a word’s ments are generated by mixtures of latent “topics,” meaning can be captured by the probability that it in which a topic is a probability distribution over was generated by each topic (just as each disease words. Although LSA makes a similar assumption would be a probability distribution over symptoms, that latent semantic components can be inferred and a symptom is a probability distribution over from observable co-occurrences across documents, possible diseases that generated it). topic models go a step further, specifying a Figure 11.3, reproduced from Steyvers and fully generative model for documents (a procedure Griffiths (2007), illustrates this process. Assuming by which documents may be generated). The that document co-occurrences are being generated assumption is that when constructing documents, by the process on the left, the topic model attempts humans are sampling a distribution over universal to statistically infer (on the right) the most likely latent topics. For example, one might construct topics and mixtures that would have generated the a document about a recent beetle infestation by observed data. It is important to note that, like LSA, mixing topics about insects, forests, the ecosystem, topic models tend to assume a simple bag-of-words etc., with varying weights. To generate each representation of a document, neglecting word- word within this document, one samples a topic order information (but see Andrews & Vigliocco, according to the document’s mixture weights, and 2010; Griffiths, et al., 2005). Similar to LSA, each then samples words from that topic’s probability document in the original co-occurrence matrix may distribution over words. be reconstructed by determining the document’s To train the model, Bayesian inference is used to distribution over N topics (reflecting its gist, g), reverse the generative process: Assuming that topic using this distribution to select a topic for each mixing is what generates documents, the task of the word wi, and then generating a word from the model is to invert the process and statistically infer distribution of words conditioned on the topic: the set of topics that were responsible for generating  N  a given set of documents. The formal instantiation   P wi g = P(wi|zi)P zi g ,(3) of a topic model can be technically intimidating zi=1 to the novice modeler—based on Latent Dirichlet Allocation algorithms, Markov Chain Monte Carlo where w is the distribution of words over topics, algorithms etc. (see Griffiths et al., 2007; Griffiths, and z is the distribution of topics over words. In Steyvers, Blei, & Tenenbaum, 2005). However, practice, topic models construct a prior value on it is important to note that the theoretical ideas the degree of mixing of topics in a document, and underlying the model are actually quite simple and then estimate the probability distributions of topics elegant and are based on the same ideas posited over words and documents over topics using Gibbs for how children infer unseen causes for observable sampling (Griffiths & Steyvers, 2004). events (Tenenbaum, Kemp, Griffiths, & Goodman, The probabilistic inference machinery behind 2011). topic models results in at least three major dif- Consider the analogy of a dermatologist: Given ferences in topic models when compared to other that disease X is present, symptoms A, B, and C are distributional models. First, as mentioned earlier, expected to manifest. The task of a dermatologist topic models are generative. Second, it is often sug- is one of causal inference, however—given a set of gested that the topics themselves have a meaningful co-occurring symptoms she must infer the unseen interpretation, such as finance, medicine, theft, disease or diseases that produced the observed data. and so on, whereas the components of LSA are Over many instances of the same co-occurring difficult to interpret, and the components of models symptoms, she can infer the likelihood that they like BEAGLE are purposely not interpretable in are the result of a common cause. The topic isolation from the others. It is important to note, model works in an analogous way, but on a much however, that since the number of topics (and the larger scale of inference and with mixtures of causal value of the priors) is set a priori by the theorist, variables. Given that certain words tend to co-occur there is often a considerable amount of hand-fitting in contexts and this pattern is consistent over and intuition that can go into constructing topics many contexts, the model infers the likely latent that are meaningful (similar to “labeling” factors

244 higher level cognition PROBABILISTIC GENERATIVE PROCESS STATISTICAL INFERENCE

money 1 1 1 ? ? ? 1.0 DOC1: money bank loan DOC1: money bank loan moneyloan bank1 money1 money1 bank? money? money?

bank bank1 loan1 ? bank? loan?

loan bank loan

bank .5 DOC2: money1 bank1 DOC2: money? bank? TOPIC 1 ? .5 bank2 river2 loan1 stream2 TOPIC 1 bank? river? loan? stream? bank1 money1 bank? money?

river bankriver 2 2 ? ?

stream DOC3: river bank ? DOC3: river bank

2 2 2 2 ? ? ? ? bank 1.0 stream bank river river stream bank river river

river 2 2 ? ? stream stream bank stream bank

TOPIC 2 TOPIC 2

Fig. 11.3 Illustration of the generative process (left) and the problem of statistical inference (right) underlying topic models. (Reproduced from Steyvers & Griffiths, 2007). Steyvers, M., & Griffiths, T. (2008). Probabilistic topic models. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch (Eds.) Handbook of Latent Semantic Analysis, NJ: Erlbaum. With kind permission from the Taylor and Francis Group.

Box 1 So Which Model Is Right? attention to, and how is this information It is tempting to think of different distribu- initially represented? tional models as competing “brands.” However, 2. Representational transformations:By a potentially more fruitful approach is to what function are the representations consider each specific model as a point in transformed to produce a semantic space? a parameter space, as one would with other 3. Comparison process: How is the semantic cognitive models. Each model is really just a space queried, and how is the semantic particular set of decisions made to formalize the information, relations, or similarity used to distributional theory of “knowing a word by the model behavioral data? company it keeps” (Firth, 1957), and no single model has emerged victorious at accounting The HAL model defines its representations for the wide range of semantic behavioral data. in terms of a word-by-word co-occurrence Each model has its merits and shortcomings. matrix, whereas the LSA model defines its rep- How should distributional models be com- resentation in terms of a word-counts-within- pared? If a model is being proposed as a documents matrix. This difference corresponds psychological model, it is important to identify to a long tradition of different psychological the model’s subprocesses. How do those sub- theories. HAL’s word-word co-occurrences are processes contribute to how the model works? akin to models that propose representations How are they related to other psychological based on associations between specific stimuli theories? And how do they contribute to the (such as classical associationist theories of model’s ability to predict behavioral data? For learning). In contrast, LSA’s word-by-document example, LSA and HAL vary in a large number representation proposes representations based of ways (see Table 1). Studies that perform associations between a stimuli and abstract simple model comparisons end up confound- pointers to the event in which those stimuli ing these differences, leaving us unsure what participate (similar to classic context association underlying psychological claim is being tested. theories of learning). Most model differences can be ascribed to A number of studies have begun comparing one of three categories, each corresponding model performance as a function of differ- to important differences in the underlying ences in these subprocesses (e.g. Bullinaria & psychological theory: Levy, 2012; Shaoul & Westbury, 2010), but much more research is needed before any firm 1. Representational structure:What conclusions can be made. statistical information does the model pay

models of semantic memory 245 in a ). Third, words in a topic model are represented as probability distributions and word-by-document spaces. SuperMatrix rather than as points in semantic space; this is was designed to emphasize the exchangeability a key distinction between topic models and the of various sub-processes within semantic earlier spatial models. It allows topic models to models (see Box 1), to allow isolation and naturally display asymmetric associations, which testing the effects of specific model are commonly seen in free association data but components. require additional assumptions to explain with • GenSim (http://radimrehurek.com/ spatial models (Griffiths, et al., 2007; but see Jones, gensim/): A python module that is very fast Gruenenfelder, & Recchia, 2011). Representing a and efficient for constructing and testing word’s meaning with a probability distribution also word-by-document models, including LSA naturally allows for polysemy in the representation (reduced using SVD) and Topics (reduced compared to vector representation models that using Latent Dirichlet Allocation). collapse multiple meanings to a single point. For • S-Space (https://github.com/ these reasons, topic models have been shown to fozziethebeat/S-Space): A Java-based produce better fits to free association data than LSA, implementation of a large number of semantic and they are able to account for disambiguation, space models, including HAL, LSA, BEAGLE, word-prediction, and discourse effects that are and COALS. problematic for LSA (Griffiths et al., 2007). • SEMMOD (http://mall.psy.ohio-state. edu/wiki/index.php/Semantic_Models_Package_ (SEMMOD)): A python package to Box 2 Semantic Memory Modeling implement and compare many of the most Resources common semantic models. A chapter on semantic models would seem • Word-Similarity (https://code.google. incomplete without some code! Testing models com/p/wordsimilarity/wiki/train): A tool to of semantic memory has become much easier explore and visualize semantic spaces, displayed due to an increase in semantic modeling as directed graphical networks. resources. There are now a wide variety of software packages that provide the ability Web-Based Resources to construct and test semantic models. The • http://lsa.colorado.edu: The original software packages vary in terms of their easy LSA website provides the ability to of installation and use, flexibility, and perfor- exploreLatent Semantic Analysis with a wide mance. In addition to the software packages, variety of different metrics, including a limited number of Web-based resources exist word-word similarities, similarities of passages for doing simple comparisons online. You may of text to individual words, and similarities of test models on standardized datasets, train them passages of texts to each other. on your own corpora for semantic exploration, • http://semanticore.org: The Semanticore or use them for generating stimuli. website is a web portal designed to bring data Software Packages from many semantic models and psycholinguistic databases under one roof. • HiDEx (http://www.psych.ualberta.ca/ Users can obtain frequency and co-occurrence westburylab/downloads/HiDEx.download. statistics from a wide variety of corpora, as well html): A C++ implementation of the HAL as semantic similarities from a number of model; it is useful for constructing large different semantic memory models, including word-by-word co-occurrence matrices and HAL, LSA, BEAGLE, and Probabilistic Topics testing a wide variety of possible parameters. Models. • SuperMatrix (http://semanticore.org/ supermatrix/): A python implementation of a large number of semantic space model transformations (including PCA/SVD, Latent Retrieval-Based Semantics Kwantes (2005) proposed an alternative ap- Dirichlet Allocation, and Random Vector proach to modeling semantic memory from dis- Accumulation) on both word-by-word tributional structure. Although not named in his

246 higher level cognition publication, Kwantes’s model is commonly referred for feature norms. Feature-based representations to as the constructed semantics model (CSM), a contain a great deal of sensorimotor properties of name that is paradoxical given that the model words that cannot be learned from purely linguistic posits that there is no such thing as semantic input, and both types of information are core to memory. Rather, semantic behavior exhibited by human semantic representation (Louwerse, 2008). the model is an emergent artifact of retrieval from Riordan and Jones (2011) recently compared a episodic memory. Although all other models put the variety of feature-based and distributional models semantic abstraction mechanism at encoding (e.g., on semantic clustering tasks. Their results demon- SVD, Bayesian inference, vector summation), CSM strated that whereas there is information about actually encodes the episodic matrix and performs word meaning redundantly coded in both feature abstraction as needed when a word is encountered. norms and linguistic data, each has its own unique CSM is based heavily on Hintzman’s (1986) variance and the two information sources serve as Minerva 2 model which was used as an existence complimentary cues to meaning. proof that a variety of behavioral effects that Research using recurrent networks trained on had been used to argue for two distinct memory child-directed speech corpora has found that pre- stores (episodic and semantic) could naturally be training a network with features related to children’s produced by a model that only had memory for sensorimotor experience produced significantly bet- episodes. So-called “prototype effects” were simply ter word learning when subsequently trained on an artifact of averaging at retrieval in the model, linguistic data (Howell, Jankowicz, & Becker, not necessarily evidence of a semantic store. CSM 2005). Durda, Buchanan, and Caron (2009) extends Minerva 2 almost exactly to a text corpus. trained a feed-forward network to associate LSA- In CSM, the memory matrix is the term-by- type semantic vectors with their corresponding document matrix (i.e., it assumes perfect memory activation of features from McRae, Cree, Seiden- of episodes). When a word is encountered in berg, and McNorgan’s (2005) norms. Given the the environment, its semantic representation is semantic representation for dog, the model attempts constructed as an average of the episodic memories to activate correct output features such as and inhibit incorrect output featuressuch contextual similarity to the target. The result is a as. After training, the network vector that has higher-order semantic similarities was able to infer the correct pattern of perceptual accumulated from the lexicon. This semantic vector features for words that were not used in training is similar in structure to the memory vector learned because of their linguistic similarity to words that in BEAGLE by context averaging, but the averaging were learned. is done on the fly, it is not encoded or stored. Several recent probabilistic topic models have Although retrieval-based models have received also explored parallel learning of linguistic and less attention in the literature than models like featural information (Andrews, Vigliocco, & LSA, they represent a very important link to other Vinson, 2009; Baroni, Murphy, Barba, & Poesio, instance-based models, especially exemplar models 2010; Steyvers, 2009). Given a word-by-document of recognition memory and categorization (e.g., representation of a text corpus and a word-by- Nosofsky, 1986). The primary reason limiting their feature representation of feature production norms, uptake in model applications is likely due to the the models learn a word’s meaning by simulta- heavy computational expense required to actually neously considering inference across documents simulate their process (Stone, Dennis, & Kwantes, and features. This enables learning from joint 2011). distributional information: If the model learns from the feature norms that sparrows have beaks, Grounding Semantic Models and from linguistic experience that sparrows and Semantic models, particularly distributional mockingbirds are distributionally similar, it will infer models, have been criticized as psychologically that mockingbirds also have beaks, despite having implausible because they learn from only linguistic no feature vector for mockingbird. Integration of information and do not contain information about linguistic and sensorimotor information allows the sensorimotor perception contrary to the grounded models to better fit human semantic data than a cognition movement (for a review, see de Vega, model trained with only one source (Andrews et al., Glenberg, & Graesser, 2008). Hence, representa- 2009). This information integration is not unique tions in distributional models are not a replacement to Bayesian models but can also be accomplished

models of semantic memory 247 Table 11.1. Highly cited semantic models and the specific subprocesses that comprise the models.

Model Representational Structure Representational Transformation Comparison Process Unit Type Unit Span Normalization Abstraction HAL Ordered word-by-word Distance weighted 10- Conditional probabilities None City block distance similarity co-occurrence matrix word window (matrix row sum) COALS Ordered word-by-word 10-word window Correlational normalization Principle components analysis Correlational similarity co-occurrence matrix LSA Unordered word-by-document Predefined document Log entropy Singular value decomposition Cosine similarity count matrix Topic Models Unordered word-by-document Predefined document None Latent Dirichlet allocation Inner product similarity count matrix BEAGLE Ordered word-by-word matrix Sentence window None Random vector accumulation Cosine similarity within random vector models (Jones & Recchia, structure of the environment and to understand 2010; Vigliocco, Vinson, Lewis, & Garrett, 2004) the mechanisms that may be used to construct and retrieval-based models (Johns & Jones, 2012). semantic representations. Connectionist models are an excellent compliment, elucidating our under- standing of semantic processing, and how semantic Compositional Semantics structure interacts with other cognitive systems and The models we have considered thus far are tasks. An obvious and important requirement for designed to extract the meaning of individual terms. the future is to start to bring these insights together, However, the sentence “John loves Mary” is not and several hybrid models are now emerging in the just the sum of the words it contains. Rather literature. “John” is bound to the role LOVER and “Mary” Several important themes have emerged that is bound to the role LOVEE. The study of how are common to both the connectionist and dis- sentence structure determines these bindings is tributional literatures. The first is data reduction. called compositional semantics. Recent work has Whatever specific mechanism humans are using to begun to explore mechanisms for compositional construct conceptual and propositional knowledge semantics by applying learning mechanisms to the from experience, it is likely that this mechanism already learned lexicon of a distributional model learns by focusing on important statistical factors (Mitchell & Lapata, 2010). that are constant across contexts, and by throwing Dennis (2004, 2005) argued that extracting away factors that are idiosyncratic to specific propositional structure from sentences revolves contexts. In a sense, capacity constraints on human around the distinction between syntagmatic and encoding, storage, and retrieval may give rise to our paradigmatic associations. Syntagmatic associations incredible ability to construct and use meaning. occur between words that appear together in utter- ances (e.g., run fast). Paradigmatic associations oc- A second common theme is the importance cur between words that appear in similar contexts, of data scale in semantic modeling. In both but not necessarily in the same utterances (e.g., connectionist and distributional models, the issue deep and shallow). The syntamatic paradigmatic of data scale versus mechanistic complexity has model proposes that syntagmatic associations are been brought to the forefront of discussion in the used to determine that words could have filled literature. A consistent emerging theme is that a particular slot within a sentence. The set of simpler models tend to give the best explanation these words form role vectors that are then bound of human data, both in terms of parsimony and to fillers by paradigmatic associations to form a quantitative fit to the data, when they are trained propositional representation of the sentence. The on linguistic data that is on a realistic scale to syntagmatic paradigmatic mechanism has been what humans experience. For example, although shown to be capable of accounting for a wide range simple context-word moving-window models are of sentence-processing phenomena. Furthermore, considerably simpler than LSA and do not perform it is capable of taking advantage of regularities well at small data scales, they are capable of in the overlap of role patterns to create implicit scaling up to learn from human-scale amounts of inferences that Dennis (2005) claimed are respon- linguistic data (a scale not necessarily possible to sible for the robustness of human commonsense learn with LSA), and consistently outperform more reasoning. complex models such as LSA with large data (e.g., Louwerse, 2011; Recchia & Jones, 2009). This leads to potential concern that earlier theoretical Common Lessons and Future Directions advancements with models trained on so-called “toy Models of semantic memory have seen impres- datasets” (artificial language corpora constructed to sive developments over the past two decades that test the model’s structural learning) may have been have greatly advanced our understanding of how overly complex. To fit human behavioral data with humans create, represent, and use meaning from a corpus that is far smaller and without the deep experience. These developments have occurred complexities inherent in real language, the model thanks in part to advances in other areas, such as must necessarily be building complexity into the ar- machine learning and to better large-scale norms chitecture and mechanism whereas humans may be of semantic data on which to fit and compare the using a considerably simpler mechanism, offloading models. In general, distributional models have been considerable statistical complexity already present in successfully used to better explore the statistical their linguistic environment.

models of semantic memory 249 A third common theme is that complex semantic Connectionist Model: A model that represents knowledge structures and behaviors may be an emergent as weighted network of interconnected units. Behavioral property of the lexicon. Emergence is a key phenomena are an emergent process of the full network. property of connectionist models, and we have Context: In semantic models, context is typically con- seen that complex representations of schemata, sidered the “window” within which two words may be hierarchical categories, and syntactic processing considered to co-occur, and it is one of the major theoretical differences between distributional models. Context may may be emergent consequences of many connec- be considered to be discrete units, such as sentences or tionist models (e.g., Rogers & McClelland, 2004). documents, or it may be more continuous, such as in But emergence is also a natural consequence of moving-window or temporal context models. distributional models. In several cases, the same Distributional Model: A general approach to concept mechanisms used to learn semantic representations learning and representation from statistical redundancies in may be applied to the learned representations the environment. to simulate complex behaviors, such as BEA- Dynamic Network: A connectionist network whose ar- GLE’s ability to model sentence comprehension chitecture involves bi-directionality, feedback, or recurrent as an emergent property of order information connectivity. distributed across the lexicon (Jones & Mewhort, Feature Comparison Model: A classic model of semantic 2007). Topic models also possess a natural mech- memory that represents concepts as vectors of binary anism for producing asymmetric similarity and features representing the presence or absence of features. polysemous processing through conditional infer- For example, the has_wings element would be turned on ence. for ROBIN, but off for GOLDEN RETRIEVER. Learning to organize the mental lexicon is one Paradigmatic and Syntagmatic Relations: Paradigmatic of the most important cognitive functions across similarity between two words emphasizes their synonymy or development, laying the fundamental structure substitutability (bee-wasp), whereas syntagmatic similarity emphasizes associative or event relations (e.g., bee-honey, for future semantic learning and communicative wasp-sting). behavior. Semantic modeling has a very promising future, with potential to further our understanding Proposition: A mental representation of conceptual re- lations that may be evaluated to have a truth value, for of basic cognitive mechanisms that give rise to example, knowledge that birds have wings. complex meaning structures, and how these mental Random Vector Model: A semantic model that begins with representations are used in a wide range of higher- some sort of randomly generated vector to initially represent order cognitive tasks. a concept. Over linguistic experience, an aggregating function gradually produces similar vector patterns among words that are semantically related. They allow for study of the time course of semantic acquisition. Acknowledgments This work was supported by National Science Semantic Memory: Memory for word meanings, facts, concepts, and general world knowledge. Typically not tied Foundation grant BCS-1056744 and National to a specific event. Institutes of Health grant R01MH096906 to MNJ, Semantic Network: A classic graphical model of semantic and Defence Research and Development Canada memory that represents concepts as nodes and semantic grant W7711-067985 to SD. JW was supported by relations as labeled edges between the nodes. Often, the postdoctoral training grant NIH T32 DC000012. hypothetical process of spreading activation is used to simulate behavioral data such as semantic priming from a semantic network. Singular-Value Decomposition: A statistical method of Note factorizing an m × n matrix, M,intoanm × m unitary 1. One interpretation of feature comparison given by Rips matrix, U,anm × n diagonal matrix, , with diagonal et al., 1973 was also spatial distance. entries that are the singular values, and an n × n unitary matrix, V. The original matrix may be recomposed as M =UVT,whereVT is the transpose of V. Glossary Spatial Model: A model that represents word meaning as a point in a multidimensional space, and that typically ap- Compositional Semantics: The process by which a com- plies a geometric function to express conceptual similarity. plex expression (e.g., a phrase or sentence) is constructed from the meanings of its constituent concepts. Supervised and Unsupervised Learning: Supervised learning typically trains the model on a set of labeled Concept: A mental representation generalized from par- exemplars (i.e., the true output of each training exemplar is ticular instances, and knowledge of its similarity to other known), whereas in unsupervised learning the model must concepts.

250 higher level cognition Chang, F., Dell, G. S., & Bock, K. (2006). Becoming syntactic. Glossary Psychological Review, 113(2), 234–272. discover structure in the data without the benefit of known Collins, A. M., & Loftus, E. F. (1975). A spreading-activation labels. theory of semantic processing. Psychological Review, 82, 407– Topic Model: A generative probabilistic model that uses 428. Bayesian inference to abstract the mental “topics” used to Collins, A. M., & Quillian, M. R. (1969). Retrieval time compose a set of documents. from semantic memory. Journal of Verbal Learning & Verbal Behavior, 8(2), 240–247. Cox, G. E., Kachergis, G., Recchia, G., & Jones, M. N. (2011). Toward a scalable holographic word-form representation. References Behavior research methods, 43(3), 602–615. Aakerlund, & L., Hemmingsen, R. (1998). Neural networks as Cree, G. S., McRae, K., & McNorgan, C. (1999). An attractor models of psychopathology. Biological Psychiatry, 43, 471– model of lexical conceptual processing: Simulating semantic 482. priming. Cognitive Science, 23, 371–414. Allen, J., & Seidenberg, M. S. (1999). The emergence of Cohen, J. D., Servan-Schreiber, D. (1992). Context, cortex, and grammaticality in connectionist networks. In B. MacWhin- dopamine: A connectionist approach to behavior and biology ney (Ed.), Emergentist approaches to language: proceedings of in schizophrenia. Psychological Review, 99, 45–77. the 28th Carnegie symposium on cognition (pp. 115–151). Dell, G. S., Chang, F., & Griffin, Z. M. (1999). Connec- Hillsdale, NJ: Earlbaum. tionist models of language production: Lexical access and Andrews, M., & Vigliocco, G. (2010). The hidden-Markov grammatical encoding. Cognitive Science, 23( 4), 517–542. topic model: A probabilistic model of semantic representa- Dell, G. S., Juliano, C., & Govindjee, A. (1993). Structure tion. Topics in Cognitive Science, 2, 101–113. and content in language production: A theory of frame Andrews, M., Vigliocco, G. & Vinson, D. P. (2009). Inte- constraints in phonological speech errors. Cognitive Science, grating experiential and distributional data to learn semantic 17, 149–195. representations. Psychological Review, 116, 463–498. Dennis, S. (2004). An unsupervised method for the extraction Baroni, M., Murphy, B., Barbu, E., & Poesio, M. (2010). of propositional information from text. Proceedings of the Strudel: A corpus-based semantic model based on properties National Academy of Sciences of the United States of America, and types. Cognitive Science, 34, 222–254. 101(Suppl 1), 5206–5213. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet Dennis, S. (2005). A memory-based theory of verbal cognition. allocation. the Journal of machine Learning research, 3, 993– Cognitive Science, 29, 145–193. 1022. De Vega, M., Glenberg, A. M., & Graesser, A. C. (2008). Borovsky, A., & Elman, J. (2006). Language input and semantic Symbols and embodiment: debates on meaning and cognition. categories: A relation between cognition and early word Oxford, UK: Oxford University Press. learning. Journal of Child Language, 33(04), 759–790. Devlin, J. T., Gonnerman, L. M., Anderson, E. S., & Botvinick, M., & Plaut, D. (2004). Doing without schema Seidenberg, M. S. (1998). Category specific semantic hierarchies: A recurrent connectionist approach to normal deficits in focal and widespread brain damage: A com- and impaired sequential action. Psychological Review, 111, putational account. Journal of Cognitive Neuroscience, 10, 395–429. 77–94. Bower, G. H. (1970). Organizational factors in memory. Durda, K., Buchanan, L., & Caron, R. (2009). Grounding Cognitive Psychology, 1, 18–46. co-occurrence: Identifying features in a lexical co-occurrence Braver, T. S., Barch, D. M., Cohen, J. D. (1999). Cognition model of semantic memory. Behavior Research Methods, 41, and control in schizophrenia: a computational model of 1210–1223. dopamine and prefrontal function. Biological Psychiatry, 46, Elman, J. (1990). Finding structure in time. Cognitive Science, 312–328. 14, 179–211. Budiu, R., Royer, C., Pirolli, P.L. (2007). Modeling information Farah, M. J., & McClelland, J. L. (1991). A computational scent: A comparison of LSA, PMI and GLSA similarity model of semantic memory impairment: Modality- speci- measures on common tests and corpora. Proceedings of ficity and emergent category-specificity. Journal of Experi- Recherche d’Information Assistée par Ordinateur (RIAO). mental Psychology: General, 120, Pittsburgh, PA, 8. 339–357. Bullinaria, J., A., & Levy, J., P. (2007). Extracting semantic Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. In representations from word co-occurrence statistics: A com- Philological Society (Great Britain) (Ed.), Studies in linguistic putational study. Behavior Research Methods, 39, 510–526. analysis. Oxford, England: Blackwell. Burgess, C. & Lund, K. (2000) The dynamics of meaning Gasser, M., & Smith, L. B. (1998). Learning nouns and in memory. In E. Dietrich & A. B. Markman (Eds.), adjectives: A connectionist account. Language and cognitive Cognitive dynamics: Conceptual and representational change processes, 13(2-3), 269–306. in humans and machines (pp. 117–156). Mahwah, NJ: Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Erlbaum. Proceedings of the National Academy of Sciences, 101, 5228– Chan, A. S., Salmon, D. P., & Butters, N. (1998). 16 Semantic 5235. Network Abnormalities in Patients with Alzheimer’s Disease. Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. Fundamentals of neural network modeling: Neuropsychology B. (2005). Integrating topics and syntax. Advances in Neural and cognitive neuroscience, 381. Information Processing Systems, 17, 537–544.

models of semantic memory 251 Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics Kanerva, P. (2009). Hyperdimensional computing: An intro- in semantic representation. Psychological Review, 114, 211– duction to computing in distributed representations with 244. high-dimensional random vectors. Cognitive Computation, 1, Grossberg, S. (1976). Adaptive pattern classification and 139–159. universal recoding, I: Parallel development and coding of Keil, F. C. (1989). Concepts, kinds, and cognitive development. neural feature detectors. Biological Cybernetics, 23, 121–134. Cambridge, MA: MIT Press. Harm, M. W., & Seidenberg, M. S. (2004). Computing the Kinder, A., & Shanks, D. R. (2003). Neuropsychologicla meanings of words in reading: Cooperative division of labor dissociations between priming and recognition: A single- between visual and phonological processes. Psychological system connectionist account. Psychological Review, 110, Review, 111, 662–720. 728–744. Harris, Z. (1970). Distributional structure. In Papers in Kintsch, W. (1998). Comprehension: A paradigm for cognition. Structural and Transformational Linguistics (pp. 775–794). New York, NY: Cambridge University Press. Dordrecht, Holland: D. Reidel. Kohonen, T. (1982). Self-organized formation of topolog- Hebb, D. (1946). The Organization of Learning. ically correct feature maps. Biological Cybernetics, 43, Hinton, G. E. & Shallice, T. (1991). Lesioning an attractor 59–69. network: Investigations of acquired dyslexia. Psychological Kwantes, P. J. (2005). Using context to build semantics. Psy Bull Review, 98, 74–95. & Review, 12, 703–710. Hintzman, D. L. (1986). "Schema abstraction" in a multiple- Landauer, T., & Dumais, S. (1997). A solution to Plato’s trace memory model. Psychological Review, 93, 411U428.˝ problem: The Latent Semantic Analysis theory of the Horst, J. S., McMurray, B., and Samuelson, L. K. (2006) acquisition, induction, and representation of knowledge. Online processing is essential for learning: Understanding Psychological Review, 104, 211–240. fast mapping and word learning in a dynamic connectionist Laham, R. D. (2000). Automated content assessment of text using architecture. In R. Sun (Ed) Proceedings of the 28th meeting of latent semantic analysis to simulate human cognition (Doctoral the Cognitive Science Society (pp. 339–334). dissertation, University of Colorado). Hopfield, J. J. (1982). Neural networks and physical systems Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, with emergent collective computational abilities. Proceedings W. (Eds.). (2013). Handbook of latent semantic analysis. of the National Academy of Sciences, 79, 2554–2558. Psychology Press. Howard, M. W., & Kahana, M. J. (2002). A distributed Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. representation of temporal context. Journal of Mathematical E. (1997). How well can passage meaning be derived Psychology, 46, 269–299. without using word order? A comparison of latent semantic Howard, M. W., Shankar, K. H., & Jagadisan, U. K. (2011). analysis and humans. In M. G. Shafto & P. Langley (Eds.), Constructing semantic representations from a gradually Proceedings of the 19th Annual Meeting of the Cognitive Science changing representation of temporal context. Topics in Society (pp. 412–417). Mahwah, NJ: Erlbaum. Cognitive Science, 3, 48–73. Li, P. (2009). Lexical organization and competition in first and Howell, S., Jankowicz, D., & Becker, S. (2005). A model second languages: Computational and neural mechanisms. of grounded language acquisition: Sensorimotor features Cognitive science, 33(4), 629–664. improve lexical and grammatical learning. Journal of Memory Li, P., Zhao, X., & Mac Whinney, B. (2007). Dynamic and Language, 53, 258–276. Self−Organization and Early Lexical Development in Chil- Hummel, J. E., & Holyoak, K. J. (2003). A symbolic- dren. Cognitive science, 31(4), 581–612. connectionist theory of relational inference and generaliza- Lowe, W. and McDonald, S. (1998). Modelling functional tion. Psychological Review, 110, 220–264. priming and the associative boost. In M. A.Gernsbacher & Johns, B. T., & Jones, M. N. (2012). Perceptual inference from S. D. Derry, (Eds.). Proceedings of the 20th Annual Meeting of global lexical similarity. Topics in Cognitive Science, 4, 1, 103– the Cognitive Science Society, (pp. 675U680).˝ Hillsdale, NJ: 120. Lawrence Erlbaum Associates. Jones, M. N., Kintsch, W., & Mewhort, D. J. K. (2006). High- Louwerse, M. M. (2008). Embodied relations are encoded in dimensional semantic space accounts of priming. Journal of language.Psychonomic Bulletin & Review, 15 (4), 838–844. Memory and Language, 55, 534–552. Louwerse, M. M. (2011). Symbol interdependency in symbolic Jones, M. N., Gruenenfelder, T. M., & Recchia, G. (2011). In and embodied cognition. Topics in Cognitive Science, 3, defense of spatial models of lexical semantics. In Proceedings 273–302. of the 33rd annual conference of the cognitive science society (pp. Lund, K., & Burgess, C. (1996). Producing high-dimensional 3444–3449). semantic spaces from lexical co-occurrence. Behavioral Jones, M. N., & Recchia, G. (2010). You can’t wear a coat rack: Research Methods, Instrumentation, and Computers, 28, 203– A binding framework to avoid illusory feature migrations in 208. perceptually grounded semantic models. Proceedings of the Mandler, J. M., Bauer, P.J. & McDonough, L. (1991) Separating 32nd Annual Cognitive Science Society, 877–882. the sheep from the goats: Differentiating global categories. Jones, M. N., & Mewhort, D. J. K. (2007). Representing word Cognitive Psychology, 23, 263–298. meaning and order information in a composite holographic McClelland, J. L. & Thompson, R. M. (2007). Using domain- lexicon. Psychological Review, 114, 1–37. general principles to explain children’s causal reasoning Kanerva, Pentti. Sparse distributed memory. MIT press, 1988. abilities. Developmental Science, 10(3), 333–356.

252 higher level cognition McClelland, J. L., St. John, M., & Taraban, R. (1989). Sentence Plaut, D. C. (2002). Graded modality-specific specialisation comprehension: A parallel distributed processing approach. in semantics: A computational account of optic aphasia. Language and Cognitive Processes, 4, 287–335. Cognitive Neuropsychology, 19 (7), 603–639. McDonald, S. & Lowe, W. (1998). Modelling functional Plaut, D. C., and Booth, J. R. (2000). Individual and priming and the associative boost. Proceedings of the 20th developmental differences in semantic priming: Empirical Annual Conference of the Cognitive Science Society (pp. 667– and computational support for a single-mechanism account 680). of lexical processing. Psychological Review, 107, 786–823. McKoon, G., & Ratcliff, R. (1992). Spreading activation versus Polyn, S. M., & Kahana, M. J. (2008). Memory search and the compound cue accounts of priming: Mediated priming neural representation of context. Trends in cognitive sciences, revisited. Journal of Experimental Psychology: Learning, 12(1), 24-30. Memory, and Cognition, 18, 1155–1172. Recchia, G. L., & Jones, M. N. (2009). More data trumps McLeod, P., Shallice, T., & Plaut, D. C. (2000). Visual smarter algorithms: Training computational models of and semantic influences in word recognition: Converging semantics on very large corpora. Behavior Research Methods, evidence from acquired dyslexic patients, normal subjects, 41, 657–663. and a computational model. Cognition, 74, 91–114. Recchia, G., Jones, M., Sahlgren, M., & Kanerva, P. (2010). McRae, K., & Cree, G. (2002). Factors underlying category- Encoding sequential information in vector space models of specific semantic deficits. In E. M. E. Forde, & G. semantics: Comparing holographic reduced representation Humphreys (Eds.), Category-specificity in mind and brain. and random permutation. East Sussex, England: Psychology Press. Regier, T. (2005). The emergence of words: Attentional learning McRae, K., Cree, G., Seidenberg, M. S., & McNorgan, C. of form and meaning. Cognitive Science, 29, 819–865. (2005). Semantic feature production norms for a large set of Riordan, B., & Jones, M. N. (2011). Redundancy in linguistic living and nonliving things. Behavior Research Methods, 37, and perceptual experience: Comparing distributional and 547–559. feature-based models of semantic representation. Topics in McRae, K., de Sa, V., & Seidenberg, M. S. (1997). On Cognitive Science, 3, 1–43. the nature and scope of featural representations of word Rips, L. J., Shoben, E. J., & Smith, E. E. (1973). Semantic meaning. Journal of Experimental Psychology: General, 126, distance and the verification of semantic relations. Journal of 99–130 Verbal Learning & Verbal Behavior, 12, 1–20. McRae, K., & Jones, M. N. (2014). Semantic memory. In D. Rogers, T. T., & McClelland, J. L. (2006). Semantic Cognition. Reisberg (Ed.) The Oxford Handbook of Cognitive Psychology. Cambridge, MA: MIT Press. Mitchell, J., & Lapata, M. (2010). Composition in dis- Rogers, T. T., Lambon Ralph, M. A., Garrard, P., Bozeat, tributional models of semantics. Cognitive science, 34(8), S., McClelland, J. L., Hodges, J. R., & Patterson, K. 1388–1429. (2004). The structure and deterioration of semantic mem- Mitchell, T. M., Shinkanerva, S. V., Carlson, A., Chang, ory: A neuropsychological and computational investigation. K., Malave, V. L., Mason, R. A., & Just, M. A. (2008). Psychological Review, 111, 205–235. Predicting human brain activity associated with the meanings Rohde, D. L. T., and Plaut, D. C. (1999). Language acquisition of nouns. Science, 320, 1191–1195. in the absence of explicit negative evidence: How important Murdock, B. B. (1982). A theory for the storage and retrieval is starting small? Cognition, 72, 67–109. of item and associative information. Psychological Review, 89, Rhode, D. L., Gonnerman, L. M., & Plaut, D. C. 609–626. (2004). An Introduction to COALS. Retrieved February, 4, Nestor, P. G., Akdag, S. J., O’Donnell, B. F., Niznikiewicz, M., 2008. Law, S., Shenton, M. E., & McCarley, R. W. (1998). Word Rosenblatt, F. (1959). Two theorems of statistical separability recall in schizophrenia: a connectionist model. American in the perceptron. In M. Selfridge (Ed.), Mechanisation Journal of Psychiatry, 155(12), 1685–1690. of thought processes: Proceedings of a symposium held at Nosofsky, R. M. (1986). Attention, similarity, and the the National Physical Laboratory . London: HM Stationery identification–categorization relationship. Journal of experi- Office mental psychology: General, 115(1), 39. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986) O’Reilly, R. C., Munakata, Y., Frank, M. J., Hazy, T. E., and Learning representations by back-propagating errors. Nature, Contributors (2012). Computational Cognitive Neuroscience. 323, 533–536. Wiki Book, 1st Edition. URL: http://ccnbook.colorado.edu. Rumelhart, D. E., McClelland, J. L., & the PDP research Osgood, C. E. (1952). The nature and measurement of meaning. group. (1986). Parallel distributed processing: Explorations in Psychological Review, 49, 197–237. the microstructure of cognition. Volume I. Cambridge, MA: Osgood, C. E. (1971). Explorations in semantic space: A MIT Press. personal diary. Journal of Social Issues, 27, 5–62. Rumelhart, D. E. & Todd, P. M. (1993) Learning and Perfetti, C. (1998). The limits of co-occurrence: Tools and connectionist representations. In D. E. Meyer & S. Korn- theories in language research. Discourse Processes, 25, 363– blum’s (Eds.) Attention and performance XIV: Synergies in 377. experimental psychology, artificial intelligence, and cognitive Plaut, D. C. (1999). A connectionist approach to word reading neuroscience (pp. 3–30). Cambridge, MA: MIT Press. and acquired dyslexia: Extension to sequential processing. Sahlgren, M., Holst, A., & Kanerva, P. (2008). Permutations Cognitive Science, 23, 543–568. as a means to encode order in word space. Proceedings

models of semantic memory 253 of the 30th Conference of the Cognitive Science Society,pp. Tenenbaum, J. B., Kemp, C., Griffiths, T. L., & Good- 1300–1305. man, N. D. (2011). How to grow a mind: Statistics, Salton, G., & McGill, M. (1983). Introduction to modern structure, and abstraction. science, 331(6022), 1279– information retrieval. New Yark, NY: McGraw-Hill. 1285. Shaoul, C., & Westbury, C. (2006). Word frequency effects Thomas, M. S. C., & Karmiloff-Smith, A. (2003). Modeling in highdimensional co-occurrence models: A new approach. language acquisition in atypical phenotypes. Psychological Behavior Research Methods, 38, 190–195. Review, 110(4), 647–682. Smith, E. E., Shoben, E. J., & Rips, L. J. (1974). Structure and Turney, P. D. & Pantel, P. (2010). From frequency to maning: process in semantic memory: A featural model for semantic Vector space models of semantics. Journal of Artificial decisions. Psychological Review, 81, 214–241. Inteligence Research, 37, 141–188. Smith, L., & Yu, C. (2008). Infants rapidly learn word- Tulving, E. (1972). Episodic and semantic memory 1. Organiza- referent mappings via cross-situational statistics. Cognition, tion of Memory. London: Academic, 381, e402. 106, 1558–1568. Tversky, A. & Gati, I. (1982). Studies of similarity. In E. Rosch Steyvers, M. (2009). Combining feature norms and text data & B. Lloyd Eds.), On the nature and principle of formation of with topic models. Acta Psychologica, 133, 234–243. categories (pp. 79–98). Hillsdaley, NJ: Erlbaum. Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Tyler, L. K., Durrant-Peatfield, M. R., Levy, J. P., Voice, J. K., & In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch Moss, H. E. (1996). Distinctiveness and correlations in the (Eds.) Handbook of latent semantic analysis. Mahwah, NJ: structure of categories: Behavioral data and a connectionist Erlbaum. model. Brain and Language, 55, 89–91. Stone, B., Dennis, S. & Kwantes, P. J. (2011). Comparing Vigliocco, G., Vinson, D. P., Lewis, W., & Garrett, M. (2004). methods for single paragraph similarity analysis. Topics in Representing the meanings of object and action words: The Cognitive Science. 3(1). 92–122. featural and unitary semantic space hypothesis. Cognitive Tabor, W. & Tanenhaus, M. K. (1999). Dynamical theories of Psychology, 48(4), 422–488. sentence processing. Cognitive Science, 23, 491–515. Widrow, B., & Hoff, M. E., Jr. (1960). Adaptive switching Taraban, R. & McClelland, J. L. (1988). Constituent attach- circuits. In 1960 IRE WESCON Convention Record, Part ment and thematic role assignment in sentence processing: 4 (pp. 96–104). New York, NY: IRE. Influences of content-based expectations. Journal of Memory Wittgenstein, Ludwig (1953). Philosophical Investigations. & Language, 27, 597–632. Blackwell.

254 higher level cognition CHAPTER 12 Shape Perception

Tadamasa Sawada, Yunfeng Li, and Zygmunt Pizlo

Abstract This chapter provides a review of topics and concepts that are necessary to study and understand 3D shape perception. This includes and their invariants; model-based invariants; Euclidean, affine, and projective geometry; symmetry; inverse problems; simplicity principle; Fechnerian psychophysics; regularization theory; Bayesian inference; shape constancy and shape veridicality; shape recovery; perspective and orthographic projections; camera models; as well as definitions of shape. All concepts are defined and illustrated, and the reader is provided with references providing mathematical and computational details. Material presented here will be a good starting point for students and researchers who plan to study shape, as well as for those who simply want to get prepared for reading the contemporary literature on the subject. Key Words: 3-dimensional shape perception, shape constancy, perspective projection, simplicity constraints, inverse problems, symmetry, shape veridicality, regularization methods, Bayesian Model, definition of shape

Introduction: Uniqueness of Shape shape in a (2N – 4) dimensional space, where N Shape is unique among other perceptual features is the number of vertices of the polygon (here, such as color, position, orientation, size, and speed by shape we mean a figure whose 2D position, because shape is characterized by high degree of orientation, and size are irrelevant). Clearly, to (a) complexity and (b) spatial regularity. These represent an arbitrary 2D polygon, one will need two characteristics, complexity and regularity, allow a high-dimensional space. To represent an arbitrary the visual system to apply very sophisticated and smooth curve, an infinitely dimensional space is very effective computations resulting in veridical needed. Obviously, the dimensionality will also be perception. By “veridical” we mean that we see high with 3-dimensional (3D) shapes. Complexity shapes the way they are “out there.” Although of shape is essential in recognizing objects from their all other perceptual features are represented in retinal or camera images. We will never confuse a a space of a few dimensions: 3 dimensions for horse with a bookshelf when presented with a single color, 3 dimensions for position, 3 dimensions 2D image of one of these two objects, regardless for orientation, 1 dimension, for size and 1 from which viewing direction the image is taken. A dimension, for speed, shape requires dozens if horse and a bookshelf will never produce identical not hundreds of dimensions.Consider first a 2- retinal images because of the complexity of their dimensional (2D) polygonal shape. The position shapes. Confusion, on the other hand is very likely of each vertex of a polygon is represented by two with very simple figures such as ellipses or triangles. coordinates. Hence, a polygon can be represented The second unique aspect of 3D shapes is asapointina2N dimensional space and its that they are characterized by spatial regularities.

255 Natural objects are not a random scatter of points. can easily recognize the cars when you see them Objects are volumetric, whose surfaces are piecewise at the dealership after you had seen them in the smooth. Even more importantly, natural objects are photos. This phenomenological observation has characterized by symmetries. Plants are symmetrical been confirmed in well-controlled psychophysical because of the way they grow. Animals are symmetrical experiments. because of the way they move. Man-made objects are In a typical shape constancy experiment, the symmetrical because of the functions they serve. subject is shown a single object twice or two Until very recently, veridical perception of 3D different objects sequentially. The subject is asked shapes was considered an unsolved problem. In to judge whether the shape was the same in the fact, many, if not most, students of vision have two presentations. Note that when the same shape assumed that veridical perception of shape never is shown twice, the 3D orientation of the shape in actually happens. The difficulty of the problem the second presentation must be different from that is related to the fact that the depth dimension is in the first presentation so that the 2D images are lost in the projection from a 3D space to the 2D different (see Figure 12.1). This is accomplished by retinal or camera image. It follows that the task of rotating the object in depth. A rotation around the recovering a 3D shape from its 2D retinal image is line of sight does not change the 2D retinal shape.It underconstrained: a 2D image allows for infinitely only affects the 2D orientation of the retinal shape. many 3D interpretations. Yet, we always perceive As a result, such a rotation does not allow testing 3D natural shapes veridically and our percept does shape constancy. Some parts of the object that are not depend on the direction from which we view visible in the first view become invisible in the the object. When the viewing direction changes, second view, when the object is rotated in depth, the shape of the 2D retinal image changes, as well. and other parts of the object appear only in the The fact that our percept of the 3D shape does not second view. If the rotation in depth is small, the change is called shape constancy. 2D images of the object are similar. If the rotation In this chapter we will describe the contempo- in depth is large, the 2D images will be very rary approach to study 3D shape perception. We different. will review studies of shape constancy (Review of The first experiment reporting perfect constancy Psychophysical Results on Shape Constancy) and with 3D shapes was performed only 20 years ago will explain conventional (Fechnerian Paradigm Is (Biederman & Gerhardstein, 1993). Their 3D Inadequate: Shape Perception Should Be Viewed objects were composed of simple parts that were 3D as an Inverse Problem) and new (Solving Inverse mirror- or translationally-symmetrical (see section Problems: The Role of Constraints in Regulation Symmetry of a 3D Shape and Its 2D Orthographic and Bayesian Methods) paradigms of shape per- and Perspective Projections). Shape constancy was ception, geometry of projections from 3D to 2D perfect even with very large rotations in depth, space (Geometry of the Eye and of the Camera) as long as the 3D parts were at least partially and invariants of these projections (Invariants of visible. Very reliable shape constancy was also Perspective and Orthographic Projection). Then, demonstrated with 3D mirror-symmetrical objects we will show how shape perception can be studied that did not have distinctive parts (Stevenson, Li, using the new approach (Symmetry of a 3D Chan, & Pizlo, 2006; Li & Pizlo, 2011; Pizlo & Shape and its 2-dimensional Orthographic and Stevenson, 1999). Shape constancy fails, however, Perspective Projections, Any 2D Image Is Consis- with irregular and unstructured 3D objects. tent with a 3D Symmetrical Interpretation, and The first systematic study of the role of spatial Models of Recovering a 3-dimensional Mirror- regularities in shape constancy was performed by Symmetrical Shape from One or Two 2D Im- Pizlo & Stevenson (1999). Their stimuli are shown ages). Finally, we present our new definition of in Figure 12.2. Some of these 3-dimensional shape (Conclusion: a new definition of shape). stimuli were mirror-symmetrical, some of them Review of Psychophysical Results on were volumetric, and some of them had piecewise Shape Constancy planar contours. Shape constancy was reliable with Shape constancy, which is illustrated in 3D mirror-symmetrical polyhedra whose faces were Figure 12.1, is of fundamental importance in planar (Figure 12.2a) and it completely failed our everyday life. As an example, imagine going with 3D asymmetrical polyhedra whose faces were to a car dealer after checking photographs of cars nonplanar (Figure 12.2d), as well as with a 3D that are available in your favorite magazine. You irregular polygonal lines (Figure 12.2e). For other

256 higher level cognition (a) (b) (c)

(e)

(d)

Fig. 12.1 Five images of a 3D shape. If you see the same 3D shape, you achieved shape constancy.

(a) stimuli, the degree of shape constancy was inter- mediate. Interestingly, monocular and binocular performances were highly correlated, suggesting (b) a common underlying mechanism. In particular, binocular shape constancy was not reliable, unless monocular performance was reliable, as well. These (c) results indicate that 3D mirror-symmetry, volume, and planarity of faces play a critical role in shape perception. Note that all these three characteristics (d) refer to abstract aspects of objects, rather than to particular exemplars of objects. Also, these characteristics can be found in natural 3D objects, (e) but not in the 2D retinal images. It follows that symmetry, volume, and planarity are used by the human visual system as constraints, not as visual cues. The distinction between visual cues and (f) constraints is best expressed in the language of inverse problems theory. Left Right Left

Fig. 12.2 Stereoscopic images of the six types of stimuli Fechnerian Paradigm is Inadequate: Shape used in our prior studies on shape constancy. The left and Perception Should Be Viewed as an Inverse the center images are used for uncrossed fusion and the Problem center and the right image are used for crossed fusion. (a) According to the commonly accepted paradigm Mirror-symmetrical polyhedron with planar faces. (b) Mirror- attributed to Gustav Theodor Fechner (1860/1966), symmetrical polyhedron with nonplanar faces. (c) Asymmetrical polyhedron with planar faces. (d) Asymmetrical polyhedron with the “percept” can be best understood as the result nonplanar faces. (e) irregular polygonal 3D contour containing of a causal chain of events. In the case of 3- no volume. (f) 16 points, the vertices of a 3D symmetrical dimensional shape perception, the chain begins polyhedron. with the shape of an object “out there” (called

shape perception 257 a distal stimulus), followed by forming the 2- one or more of these criteria are not met, the dimensional retinal image (called a proximal stim- problem is ill-posed. In 3D shape perception, it is ulus), transduction, which changes the light energy the nonuniqueness of the solution that is the reason into electrical energy in the nervous system, brain for ill-posedness. The only way to transform an ill- processes, and finally the percept of the distal posed problem to a well-posed problem is to impose stimulus. In this view, the percept is assumed to constraints on the family of 3D interpretations. be a mental measurement of the physical stimulus. In plain English, the visual system must know Although it is easy to treat brightness or loudness something about the 3D shapes “out there” and perception as a mental measurement of light or combine this knowledge with the 2D retinal image sound intensity, perception of a 3D shape produced in producing a unique and veridical 3D percept. For by looking at a single 2D image can only be viewed example, given a 2D image of a 3D rectangular box, as an inference based on incomplete data, not as a shoebox for example, the 3D shape of the shoebox a measurement. The visual system must go beyond can be recovered veridically by maximizing the 3D the information given in the 2D retinal image. How symmetries of the 3D interpretation. It might be this can be done is described in a theory of inverse a little surprising, but it is true nonetheless, that problems (Pizlo, 2001; Poggio, Torre, & Kach, applying 3D symmetries to the 2D image of a 3D 1985; Tikhonov & Arsenin, 1977). opaque shoebox allows recovering the entire box, its Consider an orthographic projection from a 3- invisible back parts, as well its visible front parts. dimensional space to a 2-dimensional image. This The next section will describe the method of solving type of projection will be explained in detail in the ill-posed inverse problems formally. section Geometry of the Eye and of the Camera. For now, the illustration in Figure 12.3 as well as the Solving Inverse Problems: The Role of Box 1 should be sufficient. Let a set of 3D points Constraints in Regularization and Bayesian represent a 3D object. The coordinates of these 3D Methods points form a matrix P . The 2D orthographic 3xN Althoughthefactthattheperceptisaresult image of these points forms a matrix p .The 2xN of applying simplicity principle or constraint to projection from P to p is called a forward 3xN 2xN the retinal image(s) has been known for at least problem. It is accomplished by dropping the z- 100 years (Köhler, 1920/1938; Mach, 1906/1939; coordinates of the points. In matrix form, this is Werheimer, 1923/1958), a formal theory of how expressed as follows: an inverse problem can be solved did not appear in 100 vision literature until 30 years ago (Poggio, Torre p × = P × ,(1) 2 N 010 3 N & Koch, 1985). According to this new paradigm, visual perception is an inference about (recovery The binary matrix A2x3 represents an orthographic of) the 3D environment “out there” based on projection. It is obvious that this forward problem incomplete data provided by a 2D retinal image. always has a solution: given P3xN ,itiseasyto Consider forming a 2D retinal image I 2D by a 3D compute a unique p2xN .Now,the3Dpercept object η3D: produced by a 2D image is equivalent to solving I = F (η )(2) an inverse problem. The name is related to the 2D projection 3D fact that if A2x3 were a square nonsingular matrix, where Fprojection is a function representing the −1 its inverse A2x3 would have existed, and P3xN projection from η3D to I 2D. Eq. (2) is a more could have been computed from p2xN simply by general case of Eq. (1). Specifically, projection −1 premultiplying both sides of Eq. (1) by A2x3 . F could be either a perspective or orthographic This cannot be done in our case because A2x3 is not mapping (see section Geometry of the Eye and of a square matrix. This inverse problem has infinitely the Camera and Box 1). Note that the retinal image many solutions represented by assigning arbitrary I 2D always has some noise due to uncertainty in z-coordinates to each of the N retinal points. This identifying points and extracting contours. Hence, fact had already been known to the Bishop Berkeley Eq. (2) can be modified as follows: (1709/1910), but it did not receive attention it I = F (η ) + ε ,(3) deserves until the second half of the 20th century. 2D projection 3D 2D This inverse problem is called ill-posed. A problem where ε2D is the noise inherent in the retinal image is well-posed when the solution exists, it is unique, I 2D or in the higher visual representations in the and it depends continuously on the data. When brain hierarchy. The process of recovering the 3D

258 higher level cognition (b) 4

2 8 (a) 6 6 2 3 5 1 1 7 8

4 5 Π 7 I 3

Fig. 12.3 (a) A 2D orthographic image of a 3D cube. This image consists of three sets of four parallel line segments. (b) Geometry of an orthographic projection of the cube to an image plane I . Here, the image plane is represented by a line. The 3D vertices of the cube are projected to I along projection lines (dotted lines). The projection lines are parallel to one another and orthogonal to I . object η3D from the 2D retinal image I 2D can be (Tikhonov & Arsenin, 1977): ' ' written as follows: η = ' η − '2 Etotal ( 3D, I2D) Fprojection( 3D) I2D −1 η3D = F (I2D). (4) ' ' projection +λ' η '2 econstraint ( 3D) ,(5) Recall that there are infinitely many 3D interpreta- where λ is a regularization parameter. The first term tions for any given I 2D. It is impossible to deter- − 1 uniquely unless some additional in the cost function E evaluates how consistent mine Fprojection η information is available. Hence, recovering the 3D the projected image of 3D is with the retinal object from the 2D retinal image is an ill-posed image I 2D. The second term evaluates how well η problem. 3D satisfies the a priori constraints. If the retinal λ Before an ill-posed inverse problem is solved, image (visual data) is reliable, should be small. λ it must be transformed into a well-posed problem Otherwise should be large. Recovering a 3D by imposing a priori constraints on the family shape that best satisfies both these requirements of possible interpretations. For the percept to be is equivalent to finding the global minimum of η veridical, the constraints must reflect regularities Etotal ( 3D, I 2D) in the space of all 3D shapes. present in the natural world. What is the nature of Sometimes there are two (or more) local minima of these regularities? Symmetries are arguably the most the cost function. important regularities present in our natural world. As There is a stochastic version of the regularization already pointed out in the beginning of this chapter, theory involving Bayes’s rule (Bouman & Sauer, all natural objects are symmetrical. 1993; Kersten, 1999; Knill & Richards, 1996; The main idea behind a regularization method Pizlo, 2001; Poggio et al., 1985). Bayes’s rule allows η of solving inverse problems is fairly simple. We look the computation of the posterior probability p( 3D |I 2D) as follows: for a 3D shape η 3D that satisfies two requirements. First, the 3D shape η 3D has to be geometrically |η η η | = p(I2D 3D)p( 3D) consistent with the retinal image I 2D.Inother p( 3D I2D) .(6) p(I2D) words, the 2D image I 2D can be produced by η 3D. Second, the 3D shape η 3D has to satisfy the a priori where p(η 3D) is the prior probability distribution constraints. Typically, there is no 3D shape that for the 3D shape η 3D and it represents apri- exactly satisfies these two requirements. This may ori knowledge about shapes in the environment. happen because of the noise ε2D in the retinal image p(I 2D |η 3D) is the likelihood function for η 3D. (e.g., a 2D digital image of a perfect sphere is never This function is a probability of producing the a perfect ellipse) or because the original 3D shape retinal image I 2D by a 3D shape η 3D. Finally, η3D does not exactly satisfy the constraints (e.g., no p(I 2D) is the probability of obtaining I 2D.The human face is perfectly mirror-symmetrical). inverse problem of determining the 3D shape based A good (perhaps the best) way to resolve this on the information provided by the 2D retinal compromise is to use a cost function consisting image can be solved by finding η 3D that maximizes of two terms representing these two requirements the posterior probability, p(η 3D |I 2D). Such an

shape perception 259 η 3D is called a maximum a posteriori (MAP) Box 1 Projections (from a 3D Space to estimator. Note that p(I ) is a constant for the 2D a 2D Image): given image I 2D in Eq. (6), and so it can be ignored in solving the inverse problem. Also note Orthographic: The projecting lines are orthog- that, typically, priors are included in both the onal to the image plane (http://shapebook. prior probability distribution and in the likelihood psych.purdue.edu/3.1/). function. If a prior is expected to be satisfied exactly, x = X it is put in the likelihood function and called an y = Y assumption, or an implicit constraint. Otherwise, Perspective: The projecting lines intersect the prior is called an explicit constraint. at the center of perspective projection The relation between Eq. (5) and (6) can be seen (http://shapebook.psych.purdue.edu/3.2/). by taking the logarithm of both sides of Eq. (6) x = f (X /Z) = (ignoring the term p(I 2D)): y f (Y /Z) − η | =− |η − η logp( 3D I2D) logp(I2D 3D) logp( 3D). (7) essentially equivalent because the 2D coordinates of Note that Eq. (7) is analogous to Eq. (5). The the position on the spherical and the planar retina logarithm of the likelihood function p(I 2D |η 3D) can be uniquely transformed to one another (Figure corresponds to the first term of Eq. (5), which 12.4a). In most geometrical analyses, the planar evaluates how consistent η 3D is with I 2D.The retina leads to simpler equations. So, we will use logarithm of the prior probability distribution the pinhole camera model with a planar retina in p(η 3D) corresponds to the second term of Eq. this chapter. (5), which evaluates how well η 3D satisfies the The geometry of the pinhole camera is charac- constraints. The regularization parameter λ in Eq. terized by its focal length and the principal point. (5) is implicitly represented in Eq. (7) by the ratio The focal length is a distance between the center of of the variances of p(I 2D |η 3D)andp(η 3D). In perspective projection (the pinhole) and the image fact, under some assumptions about the probability plane (Figure 12.4). The principal point is the distributions, maximizing the posterior probability center of the image plane. In the human eye, the p(η 3D |I 2D) is equivalent to minimizing the cost principal point is the center of the fovea. The line function Etotal (η 3D, I 2D) (Pizlo, 2001). connecting the principal point and the center of Before we show in the section Models of projection is normal to the image plane, and this Recovering a 3-Dimensional Shape from One or line is called the principal axis. According to the Two 2-Dimensions Images how the cost function rules of geometrical optics, the center of projection (Eq. 5) is used to recover 3D shapes, we will (the pinhole) is always on the line connecting a describe the geometry of the eyeball and the point in a 3D scene and its 2D image. This is rules of geometrical optics representing the forward called a projection line. Let [0, 0, 0]t be the center problem. This will be followed by description of of perspective projection and Z = f be the image invariants of perspective and orthographic projec- plane where |f | is the focal length of the camera. tion. These invariants are essential in determining The x-andy-axes of the 2D Cartesian coordinate if and how the a priori constraints can be applied. system on the image plane are parallel to the X - The section Symmetry of a 3D Shape and Its and Y -axes of the 3D Cartesian coordinate system. 2D Orthographic and Perspective Projections will The origin of the 2D coordinate system is set at [0, describe 3D symmetries of shapes, as well as three 0, f ]t. The principal point has 2D coordinates [0, t t other a priori shape constraints. 0] . Then, a point V 3D=[X 3D, Y 3D, Z 3D] in a t 3D scene projects to a point v2D=[x2D, y2D] on Geometry of the Eye and of the Camera the image plane according to the following matrix Geometry of a human eye is complex because equation: ! the eye is anatomically composed of several optical t t fX3D fY3D parts and some of them are even flexible. Hence, x2D y2D = .(8) Z3D Z3D it is quite common to use simplified models. Two different models of the human eye are particularly Equation (8) represents the rules of a perspective common: a pinhole camera with (a) a spherical projection. According to the rules of optics, the 3D and with (b) a planar retina. These two models are point should be on the opposite side of the center

260 higher level cognition (a) V3D

(b) V3D Principal Π I point v2D s i x a

C l a

Principal axis | f | p i c n i r | f | Principal axis P C vretina

Π v2D Principal I point

Fig. 12.4 Geometry of a perspective projection. A point V3D in a 3D scene projects to v2D on an image plane I through a center of projection C. (a) Models with a spherical and planar “retina,” which are behind the center of projection C. Note that the correspondence between vretina on the spherical image and v2D on the planar image is unique for a given center of projection C. This means that these two models of the eye are essentially equivalent. (b) The image plane I is placed in front of C. This setting is often used in computational models. of projection compared to the image plane (Figure containing a perspective projection is a projective 12.4a): transformation (Coxeter, 1987; Mundy & Zisser- < man, 1992). It turns out that a projective trans- fZ3D 0. (9) formation corresponds to an uncalibrated camera, The signs of x2D and y2D are always opposite that is a camera whose focal length and principal to those of X 3D and Y 3D respectively. This reflects point are unknown (they are free parameters). It is the fact that the 2D retinal image of the 3D important to emphasize that the projective group ◦ scene is always upside down (more precisely, 180 applies only to the case of camera images of planar rotated) in the human eye or in a camera. When figures because this case is a one-to-one mapping. fZ 3D > 0, the 3D point is behind the eye and It is customary to use the terminology of projective becomes invisible. It is quite common to “put” the geometry in the case of a 3D to 2D mapping, as image plane in front of the center of projection, in well, even though this mapping is not one-to-one. computational models of a camera. This simplifies The 3D to 2D mapping does not have a unique the computations in the sense that the computed inverse and so it is not a group. It follows that image is right sight up (see Figure 12.4b). We will this mapping is not a projective transformation. follow this convention here. Before we describe the relation between the eye Perspective projection is a nonlinear transforma- and a projective transformation in some more tion, which makes it difficult to use. In particular, detail, we will introduce an orthographic projection matrix algebra cannot be directly applied to a that is an important and useful approximation to perspective projection. Furthermore, even if we a perspective projection. Using an orthographic restrict a perspective projection to a one-to-one approximation to a perspective projection is a mapping by considering the case of an image of a way to circumvent the problems of nonlinearity planar figure slanted relative to the eye, perspective and of nonexistence of invariants. Practically, an projection is not a group, and so it does not orthographic projection is a good approximation to have invariants (Pizlo & Rosenfeld, 1992). As a a perspective projection when the range in depth reminder, a set of transformations is a group if four of an object is less than 10% of the viewing axioms are satisfied (see Glossary entry Group of distance. Formally, an orthographic projection can Transformations): (a) an inverse transformation is be introduced as follow. Consider moving the in the set, (b) identity transformation is in the set, center of perspective projection to infinity along (c) a composition of two transformations from the the Z-axis, relative to the image plane and to the set belongs to the set, and (d) transformations are 3D object. This is equivalent to translating both associative. A perspective projection between two the image plane and the 3D object together to planes is not a group because it violates the closure infinity in the opposite direction. As a result, the axiom (c). The smallest group of transformations projecting lines become parallel to the principal

shape perception 261 axis (the Z-axis) and perpendicular to the image 1/αx x plane (Figure 12.3). In such a case, Eq. 8 takes the following form: α  1/ y 1 pixel (f + Z )X x y t = lim translation 3D 2D 2D | |→∞ + Ztranslation (Z3D Zt ranslation)  + t y (f Ztranslation)Y3D = t s/αxαy + [X3D Y3D] , (10) (Z3D Ztranslation) Fig. 12.5 Geometry of a camera pixel. where Z translation is the translation along the Z- axis. Note that Eq. (10) is equivalent to Eq. ∗ ∗ (1). The relation between a perspective and an W3D of V can usually be set to 1. This way, orthographic projection is illustrated in our demo: 3D homogenous coordinates are trivially obtained http://shapebook.psych.purdue.edu/3.2/ (from Pi- from 3D Euclidean coordinates (and vice versa). zlo, Li, Sawada, & Steinman, 2014). You see Note that Eq. (11) is equivalent to Eq. (8) a 3D object (a chair) inside a rectangular box when: defining the x, y, z directions. The center of ⎡ ⎤ f 000 perspective projection is at F whose position can = ⎣ ⎦ be changed using a slider in the bottom. Use Pperspective 0 f 00 (12) your mouse to rotate the “entire scene” relative 0010 to you, or to rotate the “inner object” relative In this chapter, some equations use both ho- to the image plane. Moving F toward infinity mogenous and Euclidean coordinates. To avoid makes the perspective projection closer and closer confusion, symbols with asterisks represent the ho- to an orthographic projection. The orthographic mogenous coordinates and those without asterisks projection is also illustrated in another demo: represent the Euclidean coordinates. http://shapebook.psych.purdue.edu/3.1/. The “camera matrix” Pperspective can be decom- Next, consider a general case of the viewing posed into the following two matrices: geometry where the coordinate system of the 3D ⎡ ⎤ sceneandthecoordinatesystemoftheimageplane αx su0 ⎣ ⎦ are not necessarily aligned. We assume a perspective Kperspective = 0 αy v0 and projection, which means that we will elaborate 001 Eq. (8). This equation will take a more complex Q = R −R C (13) form because both the camera parameters (called 3x3 3x3 3x1 intrinsic parameters) and the information about where Pperspective = KperspectiveQ. The matrix the orientation of the 3D object relative to the K perspective is called the “intrinsic matrix,” which de- camera (extrinsic parameters) must be used on the fines the camera’s intrinsic properties, such as focal right-hand side. This is done by putting all these length. The position of the principal point is [μ0, parameters in a “camera matrix” Pperspective,whichis ν ]T on the image plane. If the pixels in the camera × 0 a3 4 matrix. The projection from a 3D point V = are squares, α = α and s = 0. The focal length t x y [X3D,Y3D,Z3D] to its 2D perspective projection v in our representation is equal to α and to α .Ifs = t x y [x2D,y2D] can be written as follows: = 0, the coordinate system is skewed (Figure 12.5) ∗ ∗ and s is equal to cosine of the angle between the x v = PperspectiveV (11) and y axes. ∗ = ∗ ∗ ∗ ∗ t = ∗ ∗ where V [X3D,Y3D,Z3D,W3D] W3D[X3D, Matrix Q is called the “extrinsic matrix,” and it ∗ ∗ t ∗ = ∗ ∗ ∗ t = Y3D,Z3D, 1] and v [x2D,y2D,w2D] represents the transformation between the world co- ∗ ∗ ∗ t w D[x D, y D,1]. The vectors V and v are ordinate system and the camera coordinate system. 2 2 2 ∗ ∗ called the homogeneous coordinates of V and v . This transformation consists of a 3D translation The homogeneous coordinates allow expressing a (−C3x1) followed by a 3D rotation (R3x3). Note nonlinear perspective projection by using matrix that the vector C 3x1 represents the position of notation. In homogenous coordinates, the last the center of projection expressed in the world value of a vector can be an arbitrary non-zero coordinate system. number. If the last value is equal to zero, the vector The “camera matrix” for the case of an or- represents a point at infinity. In equation (11), thographic projection can be defined in a similar

262 higher level cognition manner: does not eliminate information about 3D shape. = ∗ = ∗ As pointed out in the first section, Introduc- v PorthographicV KorthographicQV (14) tion: Uniqueness of Shape, the more complex where: the 3D shape is, the more information about the 3D shape is preserved in the 2D image. α = x s 0 Mathematically, this is represented by invariants of Korthographic α . (15) 0 y 0 perspective projection. There are two main types ∗ ∗ of invariants: conventional (or group) invariants W3D of V should be set to 1 and the homogeneous coordinate of v is ignored. The extrinsic matrix Q and model-based invariants (Mundy & Zisserman, is the same as in Eq. (13). Estimation of Q and 1992, 1993; Pizlo, 2008; Rothwell, 1995; Weiss, 1988, 1993). K perspective(or K perspective) is called camera calibra- tion and is typically performed based on multiple Conventional invariants are defined for groups of images of a reference scene, whose geometry is transformations. It follows that the transformation known (see Hartley & Zisserman, 2003, for more for which we seek invariants must be a one-to- details of camera calibration). When a pair of one mapping: a line to a line, a plane to a plane identical cameras form a binocular (stereo) system, or a 3D space to a 3D space. So, the most such a system can be calibrated in an analogous way. interesting case representing vision does not belong here because 3D vision involves a mapping between a 3D space and a 2D plane. This mapping can Invariants of Perspective and Orthographic only be handled by model-based invariants (see Projection later). We first discuss invariants of an orthographic projection of a planar (2D) figure to a 2D image. Box 2 Transformation Groups (2D This projection can be described by a group of Examples) 2D affine transformations, which is composed of a 1. Rigid motion: rotation and translation. rotation, translation, size scaling and stretch. Using matrix notation, a 2D affine transformation can be x = xcosϕ−ysinϕ+dx ϕ ϕ written as follows: y = xsin +ycos +dy x gA11 gA12 x gA13 Invariants: distances, angle, area, volume. = + 2. Similarity: rigid motion plus uniform size y gA21 gA22 y gA23 scaling. (16) In this transformation, parallel lines remain par- x =(xcosϕ−ysinϕ)s+dx allel, the ratio of lengths of two parallel line y =(xsinϕ+ycosϕ)s+d y segments stays the same as does the ratio of Invariants: angle, ratio of lengths of line areas of two regions. These three properties are segments, ratio of areas, ratio of volumes. invariants of a 2D affine transformation and of an 3. Affine: Similarity plus stretch along an orthographic projection of a planar figure. This is arbitrary direction. illustrated in Figures 12.6a and b where a regular x = ax+by+c checkerboard and its orthographic image are shown. y = dx+ey+f The square regions of the checkerboard project Invariants: ratio of length of parallel line to parallelograms. All these parallelograms are segments, ratio of areas, ratio of volumes. identical, which represents the operation of affine 4. Projective: a continuous transformation that invariants. maps straight lines into straight lines and planes Next, consider a 2D perspective image of a into planes. planar object slanted relative to the camera. The focal length of the camera (or the eye) and the x =(ax+by+c)/(gx+hy+i) principal point are fixed and known. This is y =(dx+ey+f )/(gx+hy+i) the case relevant to human vision. In such a Invariants: Cross-ratio of four collinear points, case, a perspective projection does not form a cross-ratio of areas and of volumes. group because a composition of two perspective projections is not a perspective projection. Instead, it is a projective transformation (Springer, 1964). It Projection from a 3D space to a 2D retinal follows that a perspective projection does not have image removes information about depth, but it its own invariants. Specifically, properties that are

shape perception 263 (a)

(c) (b)

Fig. 12.6 (a) A square with a regular (checkerboard) pattern. (b) An orthographic projection of (a) with a slant of 60 degrees. The resulting pattern is not rectangular, but parallel lines remain parallel and all regions that are images of squares are parallelograms. (c) A perspective projection of (a). Parallel lines do not project to parallel lines and the regions that are images of the squares are not identical. invariant under a 2D perspective projection are also (Burns, Weiss, & Riseman 1990). Specifically, there invariant under a larger set of transformations that is nothing that can be computed from a set of N form a 2D projective group. The 2D projective points in a 3D space that remains invariant in a transformation is represented by the following 2D image (orthographic or perspective) of these equation using homogeneous coordinates: points, except for the number of these points. ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ Unlike a perspective projection between two planes, ∗ x gP11 gP12 gP13 where one can use invariants of a larger, projective ⎣ ∗ ⎦ ⎣ ⎦ x ⎣ ⎦ y = gP21 gP22 + gP23 . set, such invariants do not exist in a 3D to 2D ∗ y w gP31 gP32 gP33 projection. (17) If one is willing to restrict the class of ob- The Euclidean coordinates of the point jects, model-based invariants can be derived for [x , y ]t after the transformation are obtained as a perspective projection between planes, as well ∗ ∗ ∗ ∗ follows: [x , y ]t =[x /w , y /w ]t .Under as for a perspective and orthographic projection the 2D projective transformation, the parallelism from a 3D space to a 2D image. Consider a of lines is not invariant (Figure 12.6c). Neither is perspective image of a planar shape. When the the ratio of areas. What is invariant under a 2D shape and its perspective image are represented projective transformation is a ratio of ratios (cross using a contour-based descriptor, called the psi ratio) of 4 areas and a ratio of ratios of lengths of function, certain ratios in this representation are line segments defined by 4 collinear points (called approximately invariant (Pizlo, 1991; Pizlo & cross ratio of 4 collinear points). Rosenfeld, 1992). Most importantly, these ratios are The fact that there are no invariants char- not invariant under an arbitrary projective or ortho- acterizing a perspective projection between two graphic image. Pizlo called these ratios “perspective planes could be considered only a minor in- invariants” and he showed that the human visual convenience because one could always use 2D system uses them in achieving shape constancy projective invariants, which are preserved under (Pizlo, 1994). 2D perspective projection, as well. Mathematically Next, consider a projection from a 3D space to this is true, but perceptually it is not. Namely, a 2D image. Take a set of lines that are parallel the human visual system achieves shape constancy to one another in a 3D scene and not necessarily for perspective but not for projective images of coplanar. Their 2D perspective projections are lines planar figures (Pizlo, 1994; 2008). This empirical that converge at a vanishing point in the image (see fact has played a critical role in the recent history an example in Figure 12.7). Note that we can say of research on human vision because it attracted that parallel lines “out there” converge at a point attention of the vision community to another type in infinity. This means that lines that intersect at of invariants, called model-based invariants. Model- a single point in the 3D space, will intersect at based invariants are of fundamental importance in a single point in a 2D image, as well. So, the vision because these are the only invariants that intersection point is an invariant of the projection. can be applied to a 2D image of a 3D scene, the It is a model-based invariant because this invariant most relevant type of projection in vision. A 3D cannot be used with an arbitrary configuration of to 2D projection violates all group axioms and, as points and lines out there. This invariant can only a result, it does not have conventional invariants be used when a set of lines intersect at a single

264 higher level cognition Fig. 12.7 Converging 2D lines that are 2D projections of parallel lines in a 3D scene.

point, in particular, when all lines are parallel in a mirror-symmetrical. Limbs of animal bodies and 3D space. parts of man-made objects are translationally sym- We are all familiar with images like that in metrical. Flowers are rotationally symmetrical. Each Figure 12.7. Vanishing points have been used by type of symmetry has its own model-based invariant painters since the Renaissance, or more precisely, represented by eigenvectors characterizing the sym- since the formal introduction of the rules of metry transformation (Li, Sawada, Shi, Steinman, perspective projection by Brunelleschi in 1413. & Pizlo, 2013). In this chapter we focus on Conventionally, perception textbooks refer to this 3D mirror-symmetrical shapes, arguably the most invariant as a “perspective cue.” The label “cue” important type of symmetry. However, the same is less than ideal, but it is justified by the approach applies to translational and rotational fact that lines that converge in the plane of symmetries as well. The next section will summarize the image are interpreted by human observers what we know about mirror-symmetrical objects as parallel lines “out there.” It is easy to see and their images. that parallelism of lines is induced by a mirror- symmetry in the 3D configuration. Specifically, in every 3D mirror-symmetrical object, pairs of Symmetry of a 3D Shape and Its 2D corresponding (symmetrical) points form parallel Orthographic and Perspective Projections line segments. It follows that the vanishing point Consider a 3D mirror-symmetrical object, in a 2D-perspective image is a model-based in- whose symmetry plane s has the following t variant of a 3D to 2D projection of a mirror- equation: [X, Y, Z] · Ns + ds = 0whereds t symmetrical object out there. It is “model-based” is a constant scalar, Ns = [XNs, YNs, ZNs] is because this invariant can only be used with a thenormalto s and |Ns |= 1. A line segment restricted set of objects that are mirror-symmetrical. connecting a pair of symmetrical points is bisected It cannot be used with an arbitrary object. In by s andisnormalto s (Figure 12.8). This shape perception of natural objects, the “sym- line segment is called a symmetry line segment. t metry restriction” does not really restrict its use Then, any point Vi = [Xi, Yi, Zi] of the object because all objects out there are symmetrical. corresponds to its mirror-symmetrical counterpart t Animal bodies and many man-made objects are Vpair(i) = [Xpair(i), Ypair(i), Zpair(i)] that satisfies

shape perception 265 (a) (c) (b)

Fig. 12.8 Three different orthographic views of a 3D mirror-symmetrical object. Pairs of corresponding points are connected by dotted line segments, which are projections of symmetry line segments. The symmetry line segments are bisected by the symmetry plane (grey plane) and are normal to the symmetry plane.

the following equation: (a)

Vpair(i) = Vi − 2(Vi · Ns + ds)Ns (18)

The symmetry line segments are parallel to Ns and (Vi·Ns + ds) is equal to the distance betweenVi and s along Ns. If a 3D object is mirror-symmetrical and its 2D image is given, a “virtual image” of the same 3D object from a different viewpoint can be computed directly from the original (real) 2D image (Vetter (b) & Poggio, 1994). The virtual image of the 3D mirror-symmetrical object is computed by reflecting the original 2D image with respect to an arbitrary line (Figure 12.9). It is important to point out that the virtual image is computed from the given real 2D image without knowing the 3D object. Note that, the original and virtual images can be used to recover the object using multiple-view geometry (see section Models of Recovering a 3D Mirror- Symmetrical Shape Frome One or Two 2D Images for computational details). Fig. 12.9 An image (a) of a 3D mirror-symmetrical object The symmetry line segments are parallel to one and its virtual image (b). The virtual image was produced by another in the 3D representation. It follows that the reflecting the original image with respect to a vertical line. images of symmetry line segments are also parallel in an orthographic image and they converge at a system solves this problem in the case of curves, vanishing point in a perspective image. Note that in but not in the case of unconnected points (Sawada order to verify whether the symmetry lines segments & Pizlo, 2008; Pizlo, Sawada, Li, Kropatch, & are parallel or intersect at a single point, one has to Steinman, 2010; Wagemans, 1992, 1993). This fact solve “symmetry correspondence problem,” which is illustrated in Figure 12.10. The number of pairs refers to establishing which pairs of points in the of dots that must be tried in order to solve symmetry 2D image are images of pairs of symmetrical points correspondence problem in a dotted stimulus grows in the 3D space. We already know that the visual very quickly with the number of dots. In contrast,

266 higher level cognition (a) (b) (c)

(d)

Fig. 12.10 It is almost impossible to see any object when only dots representing vertices of a polygon in (b) and polyhedron in (d) are shown, as in (a) and (c). the number of pairs of dots (feature points) that system does. The fact that you do not see 3D must be tried in a stimulus composed of one or symmetrical curves when you look at the original more closed contours is proportional to n, where image in this demo illustrates that the visual system n is the number of feature points such as corners or does not recover symmetry from every 2D retinal points of maximal curvature. image. Two questions arise. First, how does the Once symmetry correspondence problem is visual system do this? Second, how well does solved, the 3D symmetrical shape can be recovered. it do it? These two questions will be answered However, before we describe how to recover 3D next. symmetrical shape, we will discuss limitations of the Now look at the next demo: http://shapebook. symmetry a priori constraint. psych.purdue.edu/4.2/ Again, you see two unrelated 2D curves, but now Any 2D Image Is Consistent with a 3D you know that a 3D symmetrical interpretation of Symmetrical Interpretation. these 2D curves exist. You can see this interpretation Look at the following demo: http://shapebook. by using your mouse to rotate the 3D curves. psych.purdue.edu/4.1/ You see that the 3D symmetry was hidden in the The curves do not look three-dimensional and depth direction. Indeed, so much information is they do not look mirror-symmetrical. This makes lost in the projection from a 3D space to a 2D sense because they were drawn in a haphazard image that one can hide almost anything there. way on the 2D image. Interestingly, and com- In this case, one can always hide 3D symmetry. pletely surprisingly, there is a pair of 3D mirror- The fact that 3D symmetry is hidden in the depth symmetrical curves that can produce this 2D image. direction is related to the concept of a degenerate Use your mouse to rotate the 3D interpretation. view. A degenerate view is a view in which two It turns out that there is always a 3D symmetrical or more points or edges “out there” project to a interpretation no matter how the 2D curves are single point or edge in the 2D image. In other drawn. This statement was formulated as a theorem words, a point or an edge can hide behind another and proven by Sawada, Li, & Pizlo, (2011). This point or edge. This fact has been known for a means that symmetry constraint is not sufficient while, but now it receives a new meaning and new to recover a unique 3D symmetrical shape because importance. A degenerate views have not received a every pair of curves on the retinal image and any lot of attention in the past because they practically correspondence may lead to a 3D symmetrical never happen. By “never” it is meant “probability interpretation. Using the terminology of signal zero”. Indeed, the set of viewing directions from detection, symmetry constraint would lead to which two points in a 3D space project to the hits with probability 1, but also to false alarms same point in the 2D retinal image is of measure with probability 1. This is obviously unacceptable. zero, and thus, for a wide range of probability Fortunately, this is not what the human visual distribution of viewing directions, the probability

shape perception 267 of a degenerate view is zero. So, degenerate views not produce a 3D symmetrical interpretation in this hide some information entirely, but because they case is that such an interpretation is too complex. almost never happen, the visual system, human or At this point, we cannot offer a general definition machine, need not worry about them. However, our of simplicity, a definition that would explain all case is different because we can always compute a human 3D percepts, but we think that a planarity 3D symmetrical interpretation that corresponds to a constraint is a good start. This is explained next. degenerate view when the actual 3D object is asym- Consider a pair of planar mirror-symmetrical metrical. In other words, when the 3D stimulus is curves in a 3D space. It follows that these two asymmetrical, we can always produce an incorrect curves are identical up to a 3D reflection. As a (spurious) 3D symmetrical interpretation and this result, their 2D perspective images are related by a interpretation will correspond to a degenerate view. 2D projective transformation, as illustrated by the This will happen not only when the 3D object is diagram shown in Figure 12.11. More specifically, asymmetrical, but also when unrelated parts of the they are related by a 5-parameter subset of a 2D 3D scene are incorrectly interpreted as a single 3D projective transformation. If the image is produced object, or when the symmetry correspondence is by an orthographic, rather than a perspective not established correctly. projection, the curves in the image are related by This brings us to the next demo: http://shapebook. a 4-parameter subset of affine transformation. We psych.purdue.edu/4.3/ will explain this fact in some detail, next. The two examples show two different 3D sym- First, consider an orthographic projection. The metrical interpretations computed from the same 3D mirror-symmetrical pair of the planar contours 2D image. These two different interpretations were project to a pair of 2D contours that are related computed by assuming two different symmetry by a 2D affine transformation. This is obvious, correspondences. In the example on the left, the considering the fact that the two planar curves “out pairs of corresponding points form horizontal lines, there” are identical, up to a 3D rigid motion, which whereas in the example on the right, they form a includes reflection. The lines connecting pairs of -25 degree angle with the horizontal axis. It follows corresponding points between the 2D contours that we can compute infinitely many different 3D are parallel to one another because they are 2D symmetrical interpretations just by choosing sym- projections of the symmetry line segments that are metry correspondence arbitrarily. However, you, parallel in a 3D scene. This parallelism is preserved as an observer, perceive only one 3D symmetrical only by a subset of a 2D affine transformation that interpretation, and what you see is what we show is composed of stretch, shear, and translation along youontheleftinthedemo.The3Dinterpretation a direction of the 2D parallel lines (Figure 12.12). that you see has two characteristics: (1) it does This subset of 2D affine transformations can be not correspond to a degenerate view, and (2) it is written as follows: simple. By “simple” we mean, in this case, a pair t θ xpair(i) = mo11 mo12 t θ of planar curves. A planar curve is almost always R2D( ) R2D( ) ypair(i) 01 simpler than a 3D curve because when a curve is planar, the torsion of the curve is zero everywhere x m · i + o13 (19) along this curve (Hilbert & Cohn-Vossen, 1952). yi 0 Using the intuitions behind Attneave’s (1954) use of cosθ −sinθ information theory to describe shapes, a 3D curve R (θ) = 2D sinθ cosθ has more uncertainty than a planar curve and more t t uncertainty translates to more complex description. where vi = [xi, yi] and vpair(i) = [xpair(i), ypair(i)] The next demo suggests that the concept of are a pair of corresponding points on the 2D simplicity of the 3D interpretation may be more curves, θ is an orientation of a line connecting vi general than the concept of a degenerate view: and vpair(i) and R2D is a 2D rotation. The three http://shapebook.psych.purdue.edu/4.4/ components of the subset of a 2D affine trans- When you use your mouse to rotate the 3D formation are represented by mo11 (stretch), mo12 symmetrical interpretation, you realize that nothing (shear) and mo13(translation) and the orientation was hidden in the original view. The original view of the transformation is θ. Note that θ represents was not degenerate, but you could not see the 3D the orientation of the 2D parallel lines which are symmetry that we produced from this 2D image. the projections of the symmetry line segments. We The only way to explain why the visual system does showed that Eq. (19) is the necessary and sufficient

268 higher level cognition 3D Planar Curve Planar Curve to satisfy the relation (20). The 5th parameter reflection in 3D in 3D in a perspective image represents |vvp|, which is the distance of the vanishing point vvp from the principal point of the image. Perspective Perspective There is a simple, but an interesting qualitative projection projection invariant implied by the relations [Eq. (19) and Eq. (20)]. When we consider a pair of curves on 2D a 2D image, the curvatures must have the same (or 2D image 2D image projectivity curve curve opposite) sign for all pairs of corresponding points, with 5df if these 2D curves are images of 3D planar sym- metrical curves (see Figure 12.2). This qualitative Fig. 12.11 A pair of planar mirror-symmetrical curves in a 3D space project to two curves that are related by a subset of 2D criterion provides only a necessary condition for projective transformation. the existence of a 3D interpretation consisting of a pair of planar curves. The sufficient condition is provided by Eqs. (19) and (20). The advantage of condition for a pair of 2D curves to have a 3D the qualitative criterion is that it can be used with interpretation that is a 3D symmetrical pair of images of approximately planar 3D symmetrical planar curves (Sawada, Li, & Pizlo, 2014). The curvesaswell. fact that the two curves in the image are related by Now that you know that symmetry constraint an affine transformation implies that one can use should not be applied to an arbitrary 2D image affine invariants such as ratio of lengths of parallel with an arbitrary correspondence of points, we will line segments and ratios of areas. Affine invariance explain how this constraint actually works. provides a necessary condition for satisfying the relation [see Eq. (19)]. When a perspective projection is used, the Models of Recovering a 3D resulting transformation on the 2D image is a 5- Mirror-Symmetrical Shape From parameter projectivity. Let the focal length and the One or Two 2D Images principal point of the image be f and [0, 0]t ,respec- Let’s begin with a demo illustrating the 3D shape tively. The lines connecting the corresponding pairs recovery: http://shapebook.psych.purdue.edu/1.2/ jeep/ of points converge at a vanishing point vvp = [xvp, t On the bottom left, you see a 3D synthetic yvp] . The relation between the2D contours that are images of a pair of planar mirror-symmetrical curves model of a jeep. On top left you see a single 2D in 3D is as follows (see Sawada, Li & Pizlo, 2014, orthographic image of the jeep. The jeep is opaque, for derivation): so the contours and surfaces on the back are not ⎡ ⎤ visible. On top right you see 2D contours extracted ∗ xpair(i) by hand. Only the front, visible contours have been T θ T θ ⎣ ∗ ⎦ R ( Y )R ( Z ) ypair(i) marked. On the bottom right, you see the 3D 3DY 3DZ ∗ wpair(i) contours recovered from the 2D image on top right. ⎛ ⎞ ⎡ ∗ ⎤ The recovery is perfect! Note that we recovered the mp11 mp12 mp13 xi =⎝ ⎠ T θ T θ ⎣ ∗ ⎦ back invisible contours as well as the front visible 010R ( Y )R ( Z ) y 3DY 3DZ i∗ ones. This shows that one of the main claims of 001 wi ⎛ ⎞ (20) David Marr (1982) that we see only the front visible θ − θ cos Z sin Z 0 surfaces (2.5D sketch) is not necessarily true. Our θ = ⎝ θ θ ⎠ R3DZ ( Z ) sin Z cos Z 0 , model can see the entire 3D shape, and we are 001 sure that you do, too. Go here to see a few more ⎛ ⎞ examples: http://shapebook.psych.purdue.edu/1.2/ cosθ 0sinθ Y Y Now, we will explain how our model recovers a R (θ )⎝ 010⎠, 3DY Y 3D symmetrical shape from a single 2D image. The −sinθ 0cosθ Y Y model uses four constraints: 3D mirror-symmetry, t ∗ ∗ ∗ ∗ t where vi = [xi, yi] = [fxi /wi , fyi /wi ] , planarity of contours, maximum 3D compactness, ∗ ∗ ∗ ∗ t vpair(i) = [fxpair(i) /wpair(i) , fypair(i) /wpair(i) ] , and minimum surface area. We will focus on cosθ Z = xvp/|vvp|, sin θ Z = yvp/|vvp|and θ Y = recovering a 3D mirror-symmetrical shape from its -atan(f /|vvp|). Again, one can use projective in- single 2D orthographic image. An orthographic variants as a necessary condition for two curves image is an approximation to the true perspective

shape perception 269 (a) (b) (c) (d)

Fig. 12.12 Pairs of 2D contours whose relation can be represented by stretch (a, b), shear (c) and translation (d) along a horizontal direction. The stretch and shear transformations were applied in all four cases. The stretch factor is positive in (a) and negative in (b). image, and it is also mathematically less informa- to one another (Zabrodsky & Weinshall, 1997):  tive than a perspective image. The orthographic x xi i xpair(i) image is computationally more interesting because vi = = y +y ,vpair(i) = yi i pair(i) ypair(i) most real images of real objects are so close to  2 orthographic images that it is difficult to capitalize xpair(i) on the mathematical advantage of a perspective = + , (22) yi ypair(i) projection. Let Z = 0 be the plane of the 2 t t image in the 3D scene, and let the x-andy- where vi = [xi, yi] and vpair(i) = [xpair(i), ypair(i)] axes of the 3D Cartesian coordinate system be the are the projections of the points i and pair(i)after 2D coordinate system on the image plane. We the correction. Note that yi = ypair(i).Ifthepro- assume that symmetry correspondence has been jections of the symmetry line segments are exactly solved (see Li, Sawada, Latecki, Steinman, & parallel to one another in the image, vi = v i and Pizlo, 2012, for an algorithm that can do this). vpair(i) = v’pair(i). Now, the corrected orthographic Recall that symmetry line segments connecting image is consistent with an interpretation of a mirror-symmetrical pairs of points of the 3D perfectly mirror-symmetrical 3D object. object are parallel to one another. The symmetry Let the 3D coordinates of the points i and pair(i) t line segments project to line segments that are be Vi = [Xi, Yi, Zi] and V pair(i) = [X pair(i), t also parallel to one another in the 2D image. Y pair(i), Z pair(i)] . Note that the X -andY - Let’s set the direction of the x-axis so that it is coordinates of Vi and V pair(i) are identical to those parallel to the projections of the symmetry line of vi and vpair(i) respectively under the orthographic segments. projection. Then, the 2D symmetry line segments Note that the symmetry line segments in the are perpendicular to the Y -axis because Yi = yi image may not be exactly parallel if the object is = ypair(i) = Y pair(i). Note that the symmetry line not exactly mirror-symmetrical, if the image has segments of the 3D object are perpendicular to visual noise, or if the image was produced by a the symmetry plane s of the 3D object and are perspective projection. In such cases, the x-axis is bisected by s. So, the tilt τ s of the symmetry plane set to a direction that is as parallel to the 2D line s is zero and s is represented by the following segments as possible in the least squares sense: equation: Z = X tanσ + d (23) s s 2 σ min y − y (21) where s and ds are the slant and the depth of i pair(i) i s. The symmetry line segments are normal to s. Hence, they are parallel to a vector Ns = [XNs, YNs, t t = t = ZNs] = [1,0,-cosσ s/sinσ s] that is normal to s. where v i [x i, y i] and v pair(i) [x pair(i), t y’ ]t are projections of a mirror-symmetrical A 2D vector from vpair(i) to vi is li[1, 0] where pair(i) = pair of points i and pair(i) of the 3D object. The li xi - xpair(i) and |li| is the length of the 2D y-coordinates of the mirror-symmetrical pairs of the vector. Then, a 3D vector from V pair(i) to Vi can be recovered as: points in the image are modified (corrected) so that ( t the line segments connecting them become parallel liNs = li 10−cosσs sinσs (24)

270 higher level cognition Let a midpoint of the 3D symmetry line segment t between Vi and V pair(i) be Mi = [XMi, YMi, ZMi] . The midpoint Mi projects to a midpoint mi = t [xmi, ymi] between vi and vpair(i).TheX -and Y -coordinates of Mi are also identical with those of mi under the orthographic projection. Hence, t mi = [(xi + xpair(i))/2, yi] and Mi = [(xi + xpair(i))/2, t yi,(Zi + Zpair(i))/2] . Note Mi is on s because the symmetry line segment is bisected by s (see Figure 12.8). Hence, the Z-coordinate of Mi can be recovered from m as: Fig. 12.13 A 3D mirror-symmetrical object with planar faces. i The position and the orientation of the planar face enclosed by a σ dotted contour can be determined if the 3D coordinates of three = sin s + ZMi xmi σ ds (25) points on the plane are known. In this image, five points on the cos s planar face can be recovered with their visible counterparts (open The 3D coordinates of the points i and pair(i) circles) using Eq. (26). The remaining points on this face can be recovered as intersections of the projection lines and this plane. can be recovered from their midpoint Mi and the 3D vector from V pair(i) to Vi: Vi = Mi + liNs/2 = − and V pair(i) Mi liNs/2. Particularly, the Z- the visible point is located have to be recovered coordinates of Vi and V pair(i) can be recovered first. These three points define the plane, which 1 as : means that the Z-coordinate of the visible point σ σ can be computed as an intersection of this plane = + li = sin s + − li cos s Zi ZMi ZNs xmi ds and the projection line emanating from the image 2 cos σs 2 sin σs of this visible point (see Figure 12.13). The hidden counterpart of the visible point is recovered as xpair(i) − xi cos 2σs = + ds Zpair(i) a mirror-reflection with respect to the symmetry sin 2σ s plane of the 3D shape. σ σ li sin s li cos s Now, we are ready to select a unique 3D = ZMi − ZNs = xmi + ds + 2 cos σs 2 sin σs shape from the one-parameter family. This is − σ done by using maximum 3D compactness and = xi xpair(1) cos 2 s + d minimum surface area of the 3D shape as additional sin 2σs constraints. 3D compactness is defined as follows (26) (Hildebrandt & Tromba, 1996): σ ( Note that s and ds are free parameters in Eq. (26). Volume 2(η) Surface3(η) (27) ds does not affect the recovered 3D object but only its position along the Z-axis. So, we will ignore where η represents the 3D shape, Volume η η this parameter from now on. Slant σ s affects the ( )andSurface() are the volume and the aspect ratio of the recovered 3D shape through its surface area of η. The next demo shows 8 effect on Zi and Z pair(i). It means that mirror- members from the one-parameter family of 3D symmetrical 3D objects that are consistent with the shapes produced from the 2D line drawing 2D orthographic image form a one-parameter fam- shown in the center: http://shapebook.psych. ily controlled by σ s. This family is illustrated on one purdue.edu/2.3/ example in the following demo: http://shapebook. Some of the shapes are tall, whereas others are psych.purdue.edu/2.2/ wide. The shape at the 3-oclock has maximal 3D The slant σ s can be any value in the range compactness, which essentially means that it has between 0 and 90 degrees. The 3D recovery cannot most volume for a given surface area. We showed be performed when σ s= 0 degrees or 90 degrees. that human monocular performance in recovering Before we choose a unique 3D shape from the 3D mirror-symmetrical shapes is best simulated by one parameter family, we have to recover points selecting a 3D object that maximizes the following whose symmetrical counterparts are not visible. criterion (Li et al., 2009, 2011): ( To recover these points, additional constraints are Volume(η) Surface3(η) (28) required, for example, planarity of faces (Li et al., 2009; Pizlo et al., 2010). In order to use a planarity This criterion is a combination of (or a compromise constraint, at least three points of a plane on which between) maximizing 3D compactness of the 3D

shape perception 271 object and minimizing its surface area. In other that it perceptually matched the 3D reference shape. words, the observer’s percept corresponds to a 3D The test shapes were taken from the one-parameter shape that is slightly thinner in the depth direction family of 3D shapes produced from the image of compared to the maximally compact 3D shape. In the reference shape. The test shape was viewed the demo you just saw, the shape in the bottom right monocularly and was rotating around the vertical corner maximizes criterion (28). If it is difficult axis in order to show to the subject many views to determine the surface and the volume of the of this test shape. The ordinate shows the error object, the compactness criterion (28) is applied to of the match on a log scale. Zero means perfect the 3D convex hull containing the entire object. (veridical) match. Negative (or positive) 1.0 means This method was used to recover a 3D shape from that the adjusted 3D shape had an aspect ratio a hand drawing of a real object (praying mantis): that was different from the true aspect ratio by a http://shapebook.psych.purdue.edu/1.2/mantis/ factor of 2. If the original image was corrected using Monocular adjustment of the aspect ratio is Eq. (22), the recovered object is not consistent with close to veridical when the slant of the symmetry the original image. This means that the recovered plane is not too far from 45 degrees. For slants object must be distorted (un-corrected) so that the 15 degrees and 75 degrees the errors were larger object agrees with the original image: V i = Vi + in the direction of 3D shapes that were thinner i where i is a 3D distortion and V i is the along the depth direction. It is important to point position of the point i after the distortion. Let the out, however, that the 3D shapes in our experiment t 3D coordinates of i be [X i, Y i, Z i] .From were characterized by as many as 20 parameters t Eq. (22), i =[0,(yi - ypair(i))/2, Z i] ,whereZ i (8 vertices characterizing one-half of the object, can be arbitrary. The magnitude of 3D distortion times 3 coordinates (x, y, z), plus 3 parameters |i| is minimized when Z i = 0. The result is a for the orientation of the symmetry plane, minus maximally symmetrical 3D shape consistent with 7 parameters characterizing rigid motion in a 3D the 2D image.2 space and size scaling). The fact that the subjects Next, we will explain binocular shape recovery. were able to match the 3D shapes perceptually In order to motivate the formulation of the means that they recovered 19 of the 20 parameters binocular model, look at Figure 12.14a, which veridically and the error existed only in the aspect shows results of monocular and binocular 3D shape ratio. recovery of one of our subjects (Li et al., 2011). Now, look at the black curve in the left panel of The abscissa shows the slant of a stationary Figure 12.14. This curve, which shows binocular “reference” 3D shape that was viewed monocularly performance of our subject, essentially coincides or binocularly, depending on the viewing condi- with the x-axis. The binocular shape percept was tion. We used 3D symmetrical polyhedral objects absolutely veridical for all slants tested! This is the consisting of 16 vertices. The subject’s task was first such result in the entire history of research on to adjust the aspect ratio of a “test” 3D shape so 3D shape perception. The main reason for why we

(a) 2 (b) 2 YS YS: Model

1 1

0 0 Dissimilarity Dissimilarity –1 –1 Monocular Monocular Binocular Binocular –2 –2 1530 45 60 75 1530 45 60 75 Slant (°) Slant (°)

Fig. 12.14 Results of a subject, YS, (a) and results of our “YS: Model” (b) that simulated the performance of this subject. Monocular performance is shown in grey and binocular performance is shown in black.

272 higher level cognition observed veridical perception, while all others did shape changes resulting in a frequent change of not, is that we, but no one else, used symmetrical the depth order of the vertices. The depths of the objects presented at nondegenerate viewing direc- vertices are marked by colored vertical lines. If the tions. Using Brunswik’s (1956) terminology, our depth order is correctly perceived by the observer, stimuli were ecologically valid: they captured all the aspect ratio of the 3D shape is restricted to important characteristics of natural objects in our a very small range. Our Bayesian model puts the natural environment. Natural objects are symmet- symmetry and planarity priors in the likelihood rical and volumetric. Unstructured stimuli such as function which also evaluates the probability of bent paper clips (among others) that dominated the depth order of many pairs of vertices. The prior research have no resemblance to natural prior is a Gaussian distribution whose maximum objects. Results obtained with such unnatural corresponds to the maximum of the criterion shown stimuli do not allow making any generalizations, in formula (28). The product of these two functions whatsoever, to how we see in everyday life. The is a posterior whose maximum is taken as the argument is the same as that used in explaining why model’s “percept.” When the information about responses of a nonlinear system to one set of inputs depth order is less reliable, for example due to does not allow predicting the response of the system a larger viewing distance, the maximum of the to other inputs. Everyone knows that the human posterior gets closer to the maximum of the prior. In visual system is highly nonlinear, but somehow the extreme case, when the model “views” the shape this fact has been ignored when “shape” stimuli monocularly, the posterior is equivalent to the prior. were designed in prior laboratory experiments and Both the monocular and binocular performance of generalizations to natural shapes were drawn from our model is shown in Figure 12.14b. Note how such limited experimental results. close the model’s performance is to the performance Formulating a binocular model that achieves of our subject. perfect performance is not easy. It is much more Recovery of a 3D shape is mathematically difficult than formulating a model that performs simpler when a 2D perspective, rather than ortho- poorly. There are two types of binocular abilities: graphic image is used. After solving the symmetry judging depth intervals and judging depth order. correspondence problem in the 2D image, the The former is very sloppy, but the latter is vanishing point for the symmetry line segments is excellent. Judging depth order belongs to hyper- estimated. This vanishing point allows a unique 3D acuities because it is performed with precision shape recovery, which means that the compactness that is an order of magnitude better than the constraint is not needed anymore or it is less critical. distance between the retinal receptors (Blakemore, However, reliable estimation of the vanishing point 1970; Westheimer, 1979; Regan, 2000). All prior is computationally difficult, unless other constraints models of binocular shape perception were based such as the direction of gravity and the position of on binocular depth intervals. Our model uses the horizon in the 2D image are used (Li et al., the depth order. How can a depth order lead to 2012). veridical perception of a metric shape? The way we do it is not very different from the way mul- Conclusion: a new definition of shape tidimensional scaling (MDS) works. We combine All natural objects in our natural environment depth order with an a priori constraint. Whereas have a great deal of symmetry. Once one appreciates MDS used a fairly weak constraint, namely that this fact, it follows that any meaningful definition of the ordinal judgments come from a metric space, shape must be based on the concept of symmetry. we used a much stronger constraint of mirror- We claim that symmetry is the sine qua non of shape symmetry. The way 3D symmetry interacts with (Li et al., 2013; Pizlo, Li, Sawada, & Steinman, depth order is shown in the following interactive 2014). In plain English, there is only as much shape demo: http://shapebook.psych.purdue.edu/5.2/ in an object as there is symmetry in it. It implies that Use your mouse to “walk” through the one some objects have more shape than others and there parameter family of 3D symmetrical shapes pro- may be objects with no shape, whatsoever. Consider duced from the image shown on the left. All completely irregular objects, such as a crumpled these 3D shapes produce the same image, which is newspaper and a bent paperclip, or a random set illustrated by the fact that the horizontal projection of points. Such stimuli were quite common in lines do not change. By changing the slant of prior studies on shape perception. These studies the symmetry plane, the aspect ratio of the 3D showed that shape constancy fails with those objects

shape perception 273 (see Pizlo, 2008 for review). This is not surprising how a symmetry group characterizing the 3D shape because these objects did not have shape to begin “out there” is transformed into another symmetry with. group characterizing model-based invariants in the Symmetry refers to spatial self-similarity, which 2D retinal image. This is just a glimpse of what means that one part of an object is produced by and how much can be done when shape is defined copying, transforming, and pasting another part. the way we propose to do it, namely by the object’s If the transformation is a reflection, we produce symmetries. Finally, and most important, symmetry mirror-symmetry, the type of symmetry character- is a natural prior in the task of recovery of 3D shapes izing most animal bodies. If the transformation from 2D images. It is the kind of prior that need is a rotation, we produce rotational symmetry, not be learned and need not be updated during the type of symmetry characterizing flowers. If our experience with real objects in our natural the transformation is a translation, we produce environment. It allows us to see 3D shapes of generalized cones, like those used by Biederman objects veridically regardless of whether the objects (1987) to define his geons. As a reminder, a are familiar or not. As a mathematical concept, generalized cone is produced by sweeping a planar symmetry exists, so to speak, independently from cross-section in a 3D space along an axis. The size real objects, and there is no reason to believe of the cross-section does not have to be constant, that it can be improved or “updated” during our hence the name “cone.” A circular cone is produced interaction with the physical environment. by sweeping a circle along a straight line segment and reducing linearly the circle’s diameter. If a Notes square is used instead of a circle, we produce a 1. It is also possible to derive the 3D coordinates using square pyramid. And so on. If the cross-section is multiple-view geometry (Vetter & Poggio, 1994; Li, Pizlo, & rotated around and translated along the axis, we Steinman, 2009). Assume a single 2D orthographic image of a produce a spiral symmetry like that characterizing 3D mirror-symmetric shape is given. Then, a virtual image of the same 3D shape from a different viewpoint can be computed shells of snails. Spiral symmetry is also present in the by reflecting the original 2D image with respect to an arbitrary arrangement of leaves around the stem of a flower. line (Vetter & Poggio, 1994; see the section Symmetry of a 3D Note that no real object is exactly symmetrical. Shape and Its 2D Orthographic and Perspective Projections). In such a case, the shape of the object can be The 3D shape can be computed from these two 2D images of represented as: symmetry + distortion. One-half of the shape using multiple-view geometry resulting in Eq. (26) (Sawada, 2010). a human face is never an exact reflection of the 2. You can see the application of this method here: other half. The difference is usually so small that it http://shapebook.psych.purdue.edu/2.4/ In these three is hard to be detected. But when a person makes examples, we recovered the skeleton of a human body. The an asymmetrical grimace, we do not seize to see compactness criterion (24) was applied to the 3D convex hull the face as symmetrical in the sense that we know containing the skeleton. what needs to be changed in the face to make it perfectly symmetrical, again. In fact, we perceive the face as the same even when the grimace changes Glossary in time, and we perceive the person as the same Bayesian method: of solving inverse problems. The poste- even when the person walks or runs. As long as rior is proportional to the product of a likelihood function symmetry is not completely removed from a 3D and a prior. The likelihood function is a probability that nonrigid object, the shape can be recognized as the 3D object can produce the 2D image. The prior is the the same. This is illustrated in the following three a priori probability of the 3D object. The 3D object for demos: http://shapebook.psych.purdue.edu/1.1/. which the posterior has a global maximum is taken as the solution. It should be obvious to the reader at this point Degenerate view: Two or more 3D points project to the that using symmetry to define shape has great same 2D point. advantages because it allows us to use mathematical Forward problem: Mapping from a 3D object (model) to tools of group theory, including group transforma- a 2D image (data). Forward problem is well-posed: solution tions and their invariants to describe real objects exists, is unique, and depends continuously on the data. that we encounter in our natural environment. Group of transformations: A set of transformations is a The same tools will surely be used to explain how group if the following four axioms are satisfied: our visual system recognizes shapes, remembers 1. Identity transformation is in the set. them, and uses them to identify objects and their 2. Any transformation in the set can be undone by an functions. The diagram in Figure 12.11 illustrated inverse transformation.

274 higher level cognition Biederman, I. & Gerhardstein, P. C. (1993). Recognizing Glossary depth-rotated objects: Evidence and conditions from three- 3. The composition of two transformations from the set is dimensional viewpoint invariance. Journal of Experimental also in this set (closure). Psychology: Human Perception & Performance, 19, 1162–82. 4. The result of a composition of three transformations Berkeley, G. (1910). A new theory of vision. New York, NY: does not depend on how we group these operations Dutton. (associativity). Blakemore, C. (1970). The range and scope of binocular depth Group invariants: A set of transformations has invariants if discrimination in man. Journal of Physiology, 211, 599–622. and only if the set is a group. An invariant is a property that Bouman, C. & Sauer, K. (1993). A Generalized Gaussian remains unchanged under the transformation. Invariants image model for edge-preserving MAP estimation. IEEE exist only if the transformations are one-to-one. Transactions on Image Processing, 2, 296–310. Homogeneous coordinates: Generalizations of Euclidean Brunswik, E. (1956). Perception and the representative design coordinates that make it possible to use matrix operations in of psychological experiments. Berkeley, CA: University of the analysis of a perspective and projective transformations. California Press. Inverse problem: Mapping from a 2D image (data) to a Burns, J. B., Weiss, R. & Riseman, E. M. (1990) View variation 3D object (model). Inverse problem is ill-posed. Solving of point set and line segment features. Proceedings of DARPA inverse problem is equivalent to an inference problem. Image Understanding Workshop, 650–659. Model-based invariants: Invariants of a set of Chan, M. W., Stevenson, A. K., Li, Y. & Pizlo, Z. (2006) transformations that is not a group. These invariants Binocular shape constancy from novel views: the role of are typically formulated by restricting the types of objects. a priori constraints. Perception & Psychophysics, 68, 1124– Regularization method: of solving inverse problems. 1139. A cost function evaluates how close the 3D object is to Coxeter, H. S. M. (1987). Projective geometry (second edition). the2Dimageandhowwellthe3Dobjectsatisfiesa New York, NY: Springer. priori simplicity constraints. The 3D object for which Fechner, G. (1860/1966). Elements of psychophysics.NewYork, the cost function has a global minimum is taken as the NY: Holt, Rinehart & Winston. solution. Hartley, R. & Zisserman, A. (2003). Multiple view geometry Shape: Refers to all spatially global self-similarities in computer vision. Cambridge, MA: Cambridge University (symmetries) of an object. Press. Hilbert, D., & Cohn-Vossen, S. (1952). Geometry and the Shape constancy: Refers to the fact that the perceived shape of a 3D object is constant despite changes in the imagination; New York, NY: Chelsea. shape of the 2D retinal image produced by changes in the Hildebrandt, S., & Tromba, A. (1996). The parsimonious 3D viewing direction. universe. New York: Springer. Kersten, D. (1999). High level vision as statistical inference. Symmetry: Refers to all spatial self-similarities of an object. In M. S. Gazzaniga (Ed.), The new cognitive neurosciences There are three common types of symmetries in the natural objects: mirror (reflectional), rotational and translational. (pp. 353–363). Cambridge, MA: MIT Press. Knill, D. C. & Richards, W. (1996). Perception as Bayesian Symmetry correspondence: problem: Establishing which inference. Cambridge, MA: Cambridge University Press. pairs (more generally which sets) of 2D points, features, or Köhler, W. (1920/1938). Physical Gestalten. In: W. D. Ellis, contours form 3D symmetrical configurations. (Ed.) A source book of (pp. 17–54). New Symmetry lines: In a mirror-symmetrical object York, NY: Humanities Press. symmetry lines connect pairs of symmetrical points. Li, Y., Pizlo, Z. & Steinman, R. M. (2009). A computational These lines are parallel in the 3D object and they model that recovers the 3D shape of an object from a single are bisected by the plane of symmetry. In a 2D 2D retinal representation. Vision Research, 49, 979–991. orthographic image symmetry lines are also parallel. Li, Y. & Pizlo, Z. (2011). Depth cues vs. simplicity principle In a 2D perspective image symmetry lines intersect at in 3D shape perception. Topics in Cognitive Science 3, the vanishing point, which is a projection of a point at 667–685. infinity). Li, Y., Sawada, T., Shi, Y., Kwon, T. & Pizlo, Z. (2011). Vanishing point: In a 2D perspective image the vanishing A Bayesian model of binocular perception of 3D mirror point is a projection of a 3D point in infinity that symmetric polyhedra. Journal of Vision, 11(4), 1–20. represents the intersection of parallel lines in the 3D space. Li, Y., Sawada, T., Shi, Y., Steinman, R. M. & Pizlo, Z. (2013). Veridical perception: Refers to the fact that we see the 3D Symmetry is the sine qua non of shape. In: S. Dickinson & objects and 3D scenes the way they are “out there.” Z. Pizlo (Eds.), Shape perception in human and computer vision, New York, NY: Springer. Li, Y., Sawada, T., Latecki, L. M., Steinman, R. M. & Pizlo, Z. References (2012). Visual recovery of the shapes and sizes of objects, as Attneave, F. (1954). Some informational aspects of visual well as distances among them, in a natural 3D scene. Journal perception, Psychological Review, 61, 183–193. of Mathematical Psychology, 56, 217–231. Biederman, I. (1987). Recognition-by-components: A theory Mach, E. (1906/1959). The Analysis of Sensations. New York, of human image understanding. Psychological Review, 94, NY: Dover. 115–147. Marr, D. (1982). Vision. New York, NY: W.H. Freeman.

shape perception 275 Mundy, J. L. & Zisserman, A. (1992). Geometric invariance in Sawada, T. & Pizlo, Z. (2008). Detection of skewed symmetry. computer vision. Cambridge, MA: MIT Press. Journal of Vision, 8(5), pp. 1–18. Mundy, J.L., & Zisserman, A. & Forsyth, D. (1993). Sawada, T., Li, Y. & Pizlo, Z. (2011). Any pair of 2D curves is Applications of invariance in computer vision. New York, consistent with a 3D symmetric interpretation. Symmetry 3, NY: Springer-Verlag. 365–388. Pizlo, Z. (1991). Shape constancy in human beings and comput- Sawada T., Li Y. & Pizlo Z. (2014). Detecting 3-D Mirror ers based on a perspective invariant. Doctoral dissertation. Symmetry in a 2-D Camera Image for 3-D Shape Recovery. Department of Psychology, University of Maryland, College Proceedings of the IEEE, 102, 1588–1606. Park, MD. Springer, C. (1964). Geometry and analysis of projective spaces. Pizlo, Z. (1994). A theory of shape constancy based on New York, NY: W.H. Freeman. perspective invariants. Vision Research, 34, 1637–1658. Tikhonov, A. N. & Arsenin, V. Y. (1977). Solutions of ill-posed Pizlo, Z. (2001). Perception viewed as an inverse problem. Vision problems. New York, NY: Wiley. Research, 41, 3145–3161. Vetter, T. & Poggio, T. (1994). Symmetric 3D objects are an Pizlo, Z. (2008). 3D shape: its unique place in visual perception. easy case for 2D object recognition. Spatial Vision, 8 (4), Cambridge, MA: MIT Press. 443–453. Pizlo, Z., Li, Y., Sawada, T. & Steinman, R.M. (2014). Making a Wagemans, J. (1992). Perceptual use of nonaccidental properties. machine that sees like us. New York, NY: Oxford University Canadian Journal of Psychology, 46, 236–279. Press. Wagemans, J. (1993). Skewed symmetry: A nonaccidental Pizlo, Z. & Rosenfeld, A. (1992). Recognition of planar shapes property used to perceive visual forms. Journal of Experi- from perspective images using contour-based invariants. mental Psychology: Human Perception and Performance, 19, Computer Vision, Graphics, and Image Processing: Image 364–380. Understanding, 56, 330–350. Weiss, I. (1988). Projective invariants of shapes. Proceedings Pizlo, Z. & Stevenson, A.K. (1999). Shape constancy from novel of DARPA image understanding warkshop (pp. 1125–1134) views. Perception & Psychophysics, 61, 1299–1307. Cambridge, MA. Pizlo, Z., Sawada, T., Li, Y., Kropatsch, W., & Steinman, R.M. Weiss, I. (1993). Geometric invariants and object recognition. (2010). New approach to the perception of 3D shape based International Journal of Computer Vision, 10, 207–231. on veridicality, complexity, symmetry and volume. Vision Wertheimer, M. (1923/1958). Principles of perceptual organiza- Research, 50, 1–11. tion. In: D. C. Beardslee & M. Wertheimer (Eds.) Readings Poggio, T., Torre, V. & Koch, C. (1985). Computational vision in Perception (pp. 115–135). New York, NY: van Nostrand. and regularization theory. Nature 317, 314–319. Westheimer, G. (1979). Cooperative neural processes involved Regan, D. (2000). Human perception of objects. Sunderland, in stereoscopic acuity. Experimental Brain Research, 36, 585– MA: Sinauer. 597. Rothwell, C. A. (1995). Object recognition through invariant Zabrodsky, H., & Weinshall D. (1997). Using bilateral indexing. Oxford, England: Oxford University Press. symmetry to improve 3D reconstruction from image se- Sawada, T. (2010). Visual detection of symmetry in 3D shapes. quences. Computer Vision and Image Understanding, 67, Journal of Vision, 10(6), 4, 1–22. 48–57.

276 higher level cognition PARTIV New Directions

CHAPTER 13 Bayesian Estimation in Hierarchical Models

John K. Kruschke and Wolf Vanpaemel

Abstract Bayesian data analysis involves describing data by meaningful mathematical models, and allocating credibility to parameter values that are consistent with the data and with prior knowledge. The Bayesian approach is ideally suited for constructing hierarchical models, which are useful for data structures with multiple levels, such as data from individuals who are members of groups which in turn are in higher-level organizations. Hierarchical models have parameters that meaningfully describe the data at their multiple levels and connect information within and across levels. Bayesian methods are very flexible and straightforward for estimating parameters of complex hierarchical models (and simpler models too). We provide an introduction to the ideas of hierarchical models and to the Bayesian estimation of their parameters, illustrated with two extended examples. One example considers baseball batting averages of individual players grouped by fielding position. A second example uses a hierarchical extension of a cognitive process model to examine individual differences in attention allocation of people who have eating disorders. We conclude by discussing Bayesian model comparison as a case of hierarchical modeling. Key Words: Bayesian statistics, Bayesian data analysis, Bayesian modeling, hierarchical model, model comparison, Markov chain Monte Carlo, shrinkage of estimates, multiple comparisons, individual differences, cognitive psychometrics, attention allocation

The Ideas of Hierarchical Bayesian commonplace. Suppose there are several unaffiliated Estimation suspects for a crime. If evidence implicates one of Bayesian reasoning formalizes the reallocation them, then the other suspects are exonerated. Thus, of credibility over possibilities in consideration of the initial allocation of credibility (i.e., culpability) new data. Bayesian reasoning occurs routinely in across the suspects was reallocated in response to everyday life. Consider the logic of the fictional new data. detective Sherlock Holmes, who famously said that In data analysis, the space of possibilities consists when a person has eliminated the impossible, then of parameter values in a descriptive model. For whatever remains, no matter how improbable, must example, consider a set of data measured on a be the truth (Doyle, 1890). His reasoning began continuous scale, such as the weights of a group with a set of candidate possibilities, some of which of 10-year-old children. We might want to describe had low credibility a priori. Then he collected the set of data in terms of a mathematical normal evidence through detective work, which ruled out distribution, which has two parameters, namely some possibilities. Logically, he then reallocated the mean and the standard deviation. Before credibility to the remaining possibilities. The collecting the data, the possible means and standard complementary logic of judicial exoneration is also deviations have some prior credibility, about which

279 we might be very uncertain or highly informed. As another example, consider research into After collecting the data, we reallocate credibility childhood obesity. The researchers measure weights to values of the mean and standard deviation that of children in a number of different schools that are reasonably consistent with the data and with our have different school lunch programs, and from a prior beliefs. The reallocated credibilities constitute number of different school districts that may have the posterior distribution over the parameter values. different but unknown socioeconomic statuses. In We care about parameter values in formal models this case, a child’s weight might be modeled as because the parameter values carry meaning. When dependent on his or her school lunch program. we say that the mean weight is 32 kilograms and the The school lunch program is characterized by standard deviation is 3.2 kilograms, we have a clear parameters that indicate the central tendency and sense of how the data are distributed (according variability of weights that it tends to produce. The to the model). As another example, suppose we parameters of the school lunch program are, in want to describe children’s growth with a simple turn, dependent on the school’s district, which linear function, which has a slope parameter. When is described by parameters indicating the central wesaythattheslopeis5kilogramsperyear,we tendency and variability of school-lunch parameters have a clear sense of how weight changes through across schools in the district. This chain of time (according to the model). The central goal dependencies among parameters again exemplifies of Bayesian estimation, and a major goal of data a hierarchical model. analysis generally, is deriving the most credible In general, a model is hierarchical if the parameter values for a chosen descriptive model, probability of one parameter can be conceived because the parameter values are meaningful in the to depend on the value of another parameter. context of the model. Expressed formally, suppose the observed data, Bayesian estimation provides an entire distri- denoted D, are described by a model with two bution of credibility over the space of param- parameters, denoted α and β. The probability of eter values, not merely a single “best” value. the data is a mathematical function of the parameter The distribution precisely captures our uncertainty values, denoted by p(D|α,β), which is called the about the parameter estimate. The essence of likelihood function of the parameters. The prior Bayesian estimation is to formally describe how probability of the parameters is denoted p(α,β). uncertainty changes when new data are taken into Notice that the likelihood and prior are expressed, account. so far, in terms of combinations of α and β in the joint parameter space. The probability of the data, weighted by the probability of the parameter Hierarchical Models Have Parameters with values, is the product, p(D|α,β)p(α,β). The model Hierarchical Meaning is hierarchical if that product can be factored as a In many situations, the parameters of a model chain of dependencies among parameters, such as have meaningful dependencies on each other. As a p(D|α,β)p(α,β) = p(D|α)p(α|β)p(β). simplistic example, suppose we want to estimate the Many models can be reparameterized, and con- probability that a type of trick coin, manufactured ditional dependencies can be revealed or obscured by the Acme Toy Company, comes up heads. under different parameterizations. The notion of We know that different coins of that type have hierarchical has to do with a particular meaningful somewhat different underlying biases to come up definition of a model structure that expresses heads, but there is a central tendency in the bias dependencies among parameters in a meaningful imposed by the manufacturing process. Thus, when way. In other words, it is the semantics of the we flip several coins of that type, each several parameters when factored in the corresponding way times, we can estimate the underlying biases in that makes a model hierarchical. Ultimately, any each coin and the typical bias and consistency multiparameter model merely has parameters in a of the manufacturing process. In this situation, joint space, whether that joint space is conceived as the observed heads of a coin depend only on the hierarchical or not. Many realistic situations involve bias in the individual coin, but the bias in the natural hierarchical meaning, as illustrated by the coin depends on the manufacturing parameters. two major examples that will be described at length This chain of dependencies among parameters in this chapter. exemplifies a hierarchical model (Kruschke, 2015, One of the primary applications of hierarchical Ch. 9). models is describing data from individuals within

280 new directions groups. A hierarchical model may have parameters are inherently designed to provide clear repre- for each individual that describe each individual’s sentations of uncertainty. A thorough critique of tendencies, and the distribution of individual frequentist methods such as p values would take us parameters within a group is modeled by a higher- too far afield. Interested readers may consult many level distribution with its own parameters that other references, such as articles by Kruschke (2013) describe the tendency of the group. The individual- or Wagenmakers (2007). level and group-level parameters are estimated simultaneously. Therefore, the estimate of each individual-level parameter is informed by all the Some Mathematics and Mechanics of other individuals via the estimate of the group-level Bayesian Estimation distribution, and the group-level parameters are The mathematically correct reallocation of cred- more precisely estimated by the jointly constrained ibility over parameter values is specified by Bayes’ individual-level parameters. The hierarchical ap- rule (Bayes & Price, 1763): proach is better than treating each individual independently because the data from different p(α|D) = p(D|α) p(α) /p(D)(1) individuals meaningfully inform one another. And ) *+ , ) *+ , )*+, the hierarchical approach is better than collapsing posterior likelihoodprior all the individual data together because collapsed data may blur or obscure trends within each where individual. p(D) = dα p(D|α)p(α)(2) Advantages of the Bayesian Approach Bayesian methods provide tremendous flexibility is called the “marginal likelihood” or “evidence.” in designing models that are appropriate for describ- The formula in Eq. 1 is a simple consequence ing the data at hand, and Bayesian methods provide of the definition of conditional probability (e.g., a complete representation of parameter uncertainty Kruschke, 2015), but it has huge ramifications (i.e., the posterior distribution) that can be directly when applied to meaningful, complex models. interpreted. Unlike the frequentist interpretation of In some simple situations, the mathematical parameters, there is no construction of sampling form of the posterior distribution can be analytically distributions from auxiliary null hypotheses. In a derived. These cases demand that the integral in frequentist approach, although it may be possible Eq. 2 can be mathematically derived in conjunction to find a maximum-likelihood estimate (MLE) of with the product of terms in the numerator of parameter values in a hierarchical nonlinear model, Bayes’ rule. When this can be done, the result can the subsequent task of interpreting the uncertainty be especially pleasing because an explicit, simple of the MLE can be very difficult. To decide whether formula for the posterior distribution is obtained. an estimated parameter value is significantly dif- Analytical solutions for Bayes’ rule can rarely be ferent from a null value, frequentist methods achieved for realistically complex models. Fortu- demand construction of sampling distributions of nately, instead, the posterior distribution is approx- arbitrarily-defined deviation statistics, generated imated, to arbitrarily high accuracy, by generating from arbitrarily-defined null hypotheses, from a huge random sample of representative parameter which p values are determined for testing null hy- values from the posterior distribution. A large potheses. When there are multiple tests, frequentist class of algorithms for generating a representative decision rules must adjust the p values. Moreover, random sample from a distribution is called Markov frequentist methods are unwieldy for constructing chain Monte Carlo (MCMC) methods. Regardless confidence intervals on parameters, especially for of which particular sampler from the class is used, in complex hierarchical nonlinear models that are the long run they all converge to an accurate repre- often the primary interest for cognitive scientists.1 sentation of the posterior distribution. The bigger Furthermore, confidence intervals change when the the MCMC sample, the finer-resolution picture researcher intention changes (e.g., Kruschke, 2013). we have of the posterior distribution. Because the Frequentist methods for measuring uncertainty (as sampling process uses a Markov chain, the random confidence intervals from sampling distributions) sample produced by the MCMC process is often are fickle and difficult, whereas Bayesian methods called a chain.

bayesian estimation in hierarchical models 281 that is inferred from the sample of data. The Box 1 MCMC Details MCMC chain typically uses tens of thousands of Because the MCMC sampling is a random representative parameter values from the posterior walk through parameter space, we would like distribution to represent the posterior distribution. some assurance that it successfully explored the Box 1 provides more details about assessing when posterior distribution without getting stuck, an MCMC chain is a good representation of the oversampling, or undersampling zones of the underlying posterior distribution. posterior. Mathematically, the samplers will be Contemporary MCMC software works seam- accurate in the long run, but we do not know lessly for complex hierarchical models involving in advance exactly how long is long enough to nonlinear relationships between variables and non- produce a reasonably good sample. normal distributions at multiple levels. Model- There are various diagnostics for assessing specification languages such as BUGS (Lunn, MCMC chains. It is beyond the scope of this Jackson, Best, Thomas, & Spiegelhalter, 2013; chapter to review their details, but the ideas Lunn, Thomas, Best, & Spiegelhalter, 2000), JAGS are straightforward. One type of diagnostic (Plummer, 2003), and Stan (Stan, 2013) allow assesses how “clumpy” the chain is, by using a the user to specify descriptive models to satisfy descriptive statistic called the autocorrelation of theoretical and empirical demands. the chain. If a chain is strongly autocorrelated, successive steps in the chain are near each other, thereby producing a clumpy chain that takes a Example: Shrinkage and Multiple long time to smooth out. We want a smooth Comparisons of Baseball Batting Abilities sample to be sure that the posterior distribution American baseball is a sport in which one person, is accurately represented in all regions of the called a pitcher, throws a small ball as quickly as parameter space. To achieve stable estimates possible over a small patch of earth, called home of the tails of the posterior distribution, one plate, next to which is standing another person heuristic is that we need about 10,000 indepen- holding a stick, called a bat, who tries to hit the dent representative parameter values (Kruschke, ball with the bat. If the ball is hit appropriately into 2015, Section 7.5.2). Stable estimates of central the field, the batter attempts to run to other marked tendencies can be achieved by smaller numbers patches of earth arranged in a diamond shape. The of independent values. A statistic called the batter tries to arrive at the first patch of earth, called effective sample size (ESS) takes into account first base, before the other players, called fielders, the autocorrelation of the chain and suggests can retrieve the ball and throw it to a teammate what would be an equivalently sized sample of attending first base. independent values. One of the crucial abilities of baseball players is, Another diagnostic assesses whether the therefore, the ability to hit a very fast ball (some- MCMC chain has gotten stuck in a subset of times thrown more than 90 miles [145 kilometers] the posterior distribution, rather than exploring per hour) with the bat. An important goal for the entire posterior parameter space. This enthusiasts of baseball is estimating each player’s diagnostic takes advantage of running two or ability to bat the ball. Ability can not be assessed more distinct chains, and assessing the extent directly but can only be estimated by observing how to which the chains overlap. If several different many times a player was able to hit the ball in all his chains thoroughly overlap, we have evidence opportunities at bat, or by observing hits and at-bats that the MCMC samples have converged to a from other similar players. representative sample. There are nine players in the field at once, who specialize in different positions. These include the pitcher, the catcher, the first base man, the second base man, the third base man, the shortstop, the It is important to understand that the MCMC left fielder, the center fielder, and the right fielder. “sample” or “chain” is a huge representative sample When one team is in the field, the other team is at of parameter values from the posterior distribution. bat. The teams alternate being at bat and being in The MCMC sample is not to be confused with the the field. Under some rules, the pitcher does not sample of data. For any particular analysis, there have to bat when his team is at bat. is a single fixed sample of data, and there is a sin- Because different positions emphasize different gle underlying mathematical posterior distribution skills while on the field, not all players are prized

282 new directions for their batting ability alone. In particular, pitchers Box 2 Model Assumptions and catchers have specialized skills that are crucial For the analysis of batting abilities, we assume for team success. Therefore, based on the structure θ of the game, we know that players with different that a player’s batting ability, i, is constant primary positions are likely to have different batting for all at-bats, and that the outcome of any abilities. at-bat is independent of other at-bats. These assumptions may be false, but the notion of a constant underlying batting ability is a The Data meaningful construct for our present purposes. The data consist of records from 948 players in Assumptions must be made for any statistical the 2012 regular season of Major League Baseball analysis, whether Bayesian or not, and the who had at least one at-bat.2 For player i,wehave conclusions from any statistical analysis are his number of opportunities at bat, ABi,hisnumber conditional on its assumptions. An advantage of hits Hi, and his primary position when in the of Bayesian analysis is that, relative to 20th field pp(i). In the data, there were 324 pitchers with century frequentist techniques, there is greater a median of 4.0 at-bats, 103 catchers with a median flexibility to make assumptions that are ap- of 170.0 at-bats, and 60 right fielders with a median propriate to the situation. For example, if of 340.5 at-bats, along with 461 players in six other we wanted to build a more elaborate analysis, positions. we could incorporate data about when in the season the at-bats occurred, and estimate temporal trends in ability due to practice or The Descriptive Model with Its Meaningful fatigue. Or, we could incorporate data about Parameters which pitcher was being faced in each at- We want to estimate, for each player, his bat, and we could estimate pitcher difficulties θ underlying probability i of hitting the ball when simultaneously with batter abilities. But these at bat. The primary data to inform our estimate elaborations, although possible in the Bayesian θ of i are the player’s number of hits, Hi,and framework, would go far beyond our purposes his number of opportunities at bat, ABi. But the in this chapter. estimate will also be informed by our knowledge of the player’s primary position, pp(i), and by the data from all the other players (i.e., their hits, at- pitchers are assumed to come from a distribution bats, and positions). For example, if we know that specific to pitchers, that might have a different player i is a pitcher, and we know that pitchers central tendency and dispersion than the distribu- tend to have θ values around 0.13 (because of all tion of abilities for the 103 catchers, and so on the other data), then our estimate of θi should be for the other positions. We model the distribution anchored near 0.13 and adjusted by the specific of θi’s for a position as a beta distribution, which hits and at-bats of the individual player. We will is a natural distribution for describing values that construct a hierarchical model that rationally shares fall between zero and one, and is often used in information across players within positions, and this sort of application (e.g., Kruschke, 2015). across positions within all major league players.3 The mean of the beta distribution for primary th We denote the i player’s underlying probability position pp is denoted μpp,andthenarrownessof of getting a hit as θi. (See Box 2 for discussion the distribution is denoted κpp.Thevalueofμpp of assumptions in modeling.) Then the number of represents the typical batting ability of players in hits Hi out of ABi at-bats is a random draw from primary position pp, and the value of κpp represents a binomial distribution that has success rate θi,as how tightly clustered the abilities are across players illustrated at the bottom of Figure 13.1. The arrow in primary position pp.Theκ parameter is pointing to Hi is labeled with a “∼”symbolto sometimes called the concentration or precision of indicate that the number of hits is a random variable the beta distribution.4 Thus, an individual player distributed as a binomial distribution. whose primary position is pp(i) is assumed to have a To formally express our prior belief that different batting ability θi that comes from a beta distribution primary positions emphasize different skills and with mean μpp(i) and precision κpp(i).Thevalues hence have different batting abilities, we assume of μpp and κpp are estimated simultaneously with that the player abilities θi come from distributions all the θi. Figure 13.1 illustrates this aspect of specific to each position. Thus, the θi’s for the 324 the model by showing an arrow pointing to θi

bayesian estimation in hierarchical models 283 from a beta distribution. The arrow is labeled with estimated from the data, along with all the other “∼ ...i” to indicate that the θi have credibilities parameters. distributed as a beta distribution for each of the The precisions of the nine distributions are individuals. The diagram shows beta distributions also estimated from the data. The precisions of as they are conventionally parameterized by two the position distributions, κpp, are assumed to shape parameters, denoted app and bpp,thatcanbe come from an overarching gamma distribution, algebraically redescribed in terms of the mean μpp as illustrated in Figure 13.1 by the split arrow, and precision κpp of the distribution: app = μppκpp labeled with “∼ ...pp”, pointing to κpp from a and bpp = (1 − μpp)κpp. gamma distribution. A gamma distribution is a To formally express our prior knowledge that generic and natural distribution for describing non- all players, from all positions, are professionals negative values such as precisions (e.g., Kruschke, in major league baseball, and, therefore, should 2015). A gamma distribution is conventionally mutually inform each other’s estimates, we assume parameterized by shape and rate values, denoted μ that the nine position abilities pp come from in Figure 13.1 as sκpp and rκpp . We assume that μ an overarching beta distribution with mean μpp the precisions of each position can mutually inform κ and precision μpp . This structure is illustrated each other; that is, if the batting abilities of in the upper part of Figure 13.1 by the split catchers are tightly clustered, then the batting arrow, labeled with “∼ ...pp”, pointing to μpp abilities or shortstops should probably also be μ from a beta distribution. The value of μpp in tightly clustered, and so forth. Therefore the shape the overarching distribution represents our estimate and rate parameters of the gamma distribution are of the batting ability of major league players themselves estimated. κ generally, and the value of μpp represents how At the top level in Figure 13.1 we incorporate tightly clustered the abilities are across the nine any prior knowledge we might have about general positions. These across-position parameters are properties of batting abilities for players in the

Fig. 13.1 The hierarchical descriptive model for baseball batting ability. The diagram should be scanned from the bottom up. At the th bottom, the number of hits by the i player, Hi, are assumed to come from a binomial distribution with maximum value being the at-bats, ABi, and probability of getting a hit being θi. See text for further details.

284 new directions major leagues, such as evidence from previous different constants in the top-level gamma distri- seasons of play. Baseball aficionados may have butions, to check whether they had any notable extensive prior knowledge that could be usefully influence on the resulting posterior distribution. implemented in a Bayesian model. Unlike base- Whether all gamma distributions used shape and ball experts, we have no additional background rate constants of 0.1 and 0.1, or 0.001 and 0.001, knowledge, and, therefore, we will use very vague the results were essentially identical. The results and noncommittal top-level prior distributions. reported here are for gamma constants of 0.001 and Thus, the top-level beta distribution on the overall 0.001. batting ability is given parameter values A = 1 and B = 1, which make it uniform over all comparisons of positions possible batting abilities from zero to one. The We first consider the estimates of hitting ability top-level gamma distributions (on precision, shape, for different positions. Figure 13.2, left side, shows and rate) are given parameter values that make the marginal posterior distributions for the μpp them extremely broad and noncommittal such that parameters for the positions of catcher and pitcher. the data dominate the estimates, with minimal The distributions show the credible values of the influence from the top-level prior. parameters generated by the MCMC chain. These There are 970 parameters in the model alto- marginal distributions collapse across all other θ μ κ gether: 948 individual i,plus pp, pp for each parameters in the high-dimensional joint parameter μ κ of nine primary positions, plus μ, μ across space. The lower-left panel in Figure 13.2 shows positions, plus sκ and rκ . The Bayesian analysis the distribution of differences between catchers and yields credible combinations of the parameters in the pitchers. At every step in the MCMC chain, the 970-dimensional joint parameter space. μ difference between the credible values of catcher We care about the parameter values because they μ and pitcher was computed, to produce a credible are meaningful. Our primary interest is in the value for the difference. The result is 15,000 estimates of individual batting abilities, θi,and μ credible differences (one for each step in the in the position-specific batting abilities, pp.We MCMC chain). are also able to examine the relative precisions of For each marginal posterior distribution, we abilities across positions to address questions such provide two summaries: Its approximate mode, as, Are batting abilities of catchers as variable as displayed on top, and its 95% highest density interval batting abilities of shortstops? We will not do so (HDI), shown as a black horizontal bar. A param- here, however. eter value inside the HDI has higher probability density (i.e., higher credibility) than a parameter Results: Interpreting the Posterior value outside the HDI. The total probability of Distribution parameter values within the 95% HDI is 95%. We used MCMC chains with total saved length The 95% HDI indicates the 95% most credible of 15,000 after adaptation of 1,000 steps and burn- parameter values. in of 1,000 steps, using 3 parallel chains called from The posterior distribution can be used to make the runjags package (Denwood, 2013), thinned by discrete decisions about specific parameter values 30 merely to keep a modest file size for the saved (as explained in Box 3). For comparing catchers chain. The diagnostics (see Box 1) assured us that and pitchers, the distribution of credible differences the chains were adequate to provide an accurate falls far from zero, so we can say with high and high-resolution representation of the posterior credibility that catchers hit better than pitchers. distribution. The effective sample size (ESS) for all (The difference is so big that it excludes any the reported parameters and differences exceeded reasonable ROPE around zero that would be used 6,000, with nearly all exceeding 10,000. in the decision rule described in Box 3.) The right side of Figure 13.2 shows the marginal check of robustness against changes in posterior distributions of the μpp parameters top-level prior constants for the positions of right fielder and catcher. Because we wanted the top-level prior distri- The lower-right panel shows the distribution of bution to be noncommittal and have minimal differences between right fielders and catchers. The influence on the posterior distribution, we checked 95% HDI of differences excludes a difference of whether the choice of prior had any notable effect zero, with 99.8% of the distribution falling above on the posterior. We conducted the analysis with zero. Whether or not we reject zero as a credible

bayesian estimation in hierarchical models 285 Catcher Right Field mode = 0.241 mode = 0.26

95% HDI 95% HDI 0.233 0.25 0.251 0.269

0.15 0.20 0.25 0.23 0.24 0.25 0.26 0.27 0.28 μ μ pp=2 pp=9

Pitcher Catcher

mode = 0.13 mode = 0.241

95% HDI 95% HDI 0.12 0.141 0.233 0.25

0.15 0.20 0.25 0.23 0.24 0.25 0.26 0.27 0.28 μ μ pp=2 pp=1

Difference Difference

mode = 0.111 mode = 0.0183 0% < 0 < 100% 0.2% < 0 < 99.8% 0% in ROPE 10% in ROPE 95% HDI 95% HDI 0.0976 0.125 0.0056 0.031

0.00 0.04 0.08 0.12 0.00 0.01 0.02 0.03 0.04 μ = −μ = μ −μ pp 2 pp 1 pp=9 pp=2

Fig. 13.2 Comparison of estimated batting abilities of different positions. In the data, there were 324 pitchers with a median of 4.0 at-bats, 103 catchers with a median of 170.0 at-bats, and 60 right fielders with a median of 340.5 at-bats, along with 461 players in six other positions. The modes and HDI limits are all indicated to three significant digits, with a trailing zero truncated from the display. In the lowest row, a difference of 0 is marked by a vertical dotted line annotated with the amount of the posterior distribution that falls below or above 0. The limits of the ROPE are marked with vertical dotted lines and annotated with the amount of the posterior distribution that falls inside it. The subscripts such as “pp=2” indicate arbitrary indexical values for the primary positions, such as 1 for pitcher, 2 for catcher, and so forth.

difference depends on our decision rule. If we between the pitchers (who tend to have the lowest use a ROPE from −0.01 to +0.01, as shown in batting averages) and the catchers (who tend to have Figure 13.2, then we would not reject a difference higher batting averages). Thus, the modes of the of zero because the 95% HDI overlaps the ROPE. posterior marginal distributions are not as extreme The choice of ROPE depends on what is practically as the proportions in the data (marked by the equivalent to zero as judged by aficionados of triangles). This shrinkage is produced by the mutual baseball. Our choice of ROPE shown here is merely influence of data from all the other players, because for illustration. they influence the higher-level distributions, which In Figure 13.2, the triangle on the x-axis in turn influence the lower-level estimates. For indicates the ratio in the data of total hits divided example, the modal estimate for catchers is 0.241, by total at-bats for all players in that position. which is less than the ratio of total hits to total Notice that the modes of the posterior are not at-bats for catchers. This shrinkage in the estimate centered exactly on the triangles. Instead, the for catchers is caused by the fact that there are 324 modal estimates are shrunken toward the middle pitchers who, as a group, have relatively low batting

286 new directions Box 3 Decision Rules for Bayesian Lapsley (1985, 1993); Spiegelhalter, Freedman, Posterior Distribution and Parmar (1994). The posterior distribution can be used for Notice that the decision rule is distinct from making decisions about the viability of specific the Bayesian estimation itself, which produces parameter values. In particular, people might be the complete posterior distribution. We are interested in a landmark value of a parameter, using a decision rule only in case we demand or a difference of parameters. For example, a discrete decision from the continuous pos- we might want to know whether a particular terior distribution. There is another Bayesian position’s batting ability exceeds 0.20, say. Or approach to making decisions about null values we might want to know whether two positions’ that is based on comparing a “spike” prior batting abilities have a non-zero difference. on the landmark value against a diffuse prior, The decision rule involves using a region which we discuss in the final section on model of practical equivalence (ROPE) around the comparison, but for the purposes of this chapter null or landmark value. Values within the we focus on using the HDI with ROPE. ROPE are equivalent to the landmark value for practical purposes. For example, we might declare that for batting abilities, a difference ability, and pull down the overarching estimate of less than 0.04 is practically equivalent to zero. batting ability for major-league players (even with To decide that two positions have credibly the other seven positions taken into account). The different batting abilities, we check that the overarching estimate in turn affects the estimate of 95% HDI excludes the entire ROPE around all positions, and, in particular, pulls down the zero. Using a ROPE also allows accepting a estimate of batting ability for catchers. We see in difference of zero: If the entire 95% HDI falls the upper right of Figure 13.2 that the estimate of within the ROPE, it means that all the most batting ability for right fielders is also shrunken, but credible values are practically equivalent to zero not as much as for catchers. This is because the right (i.e., the null value), and we decide to accept fielders tend to be at bat much more often than the null value for practical purposes. If the the catchers, and, therefore, the estimate of ability 95% HDI overlaps the ROPE, we withhold for right fielders more closely matches their data decision. Note that it is only the landmark proportions. In the next section we examine results value that is being rejected or accepted, not for individual players, and the concepts of shrinkage all the values inside the ROPE. Furthermore, will become more dramatic and more clear. the estimate of the parameter value is given by the posterior distribution, whereas the decision rule merely declares whether the parameter comparisons of individual players value is practically equivalent to the landmark In this section we consider estimates of the value. We will illustrate use of the decision batting abilities of individual players. The left rule in the results from the actual analyses. side of Figure 13.3 shows a comparison of two In some cases we will not explicitly specify individual players with the same record, 1 hit in 3 a ROPE, leaving some nonzero width ROPE at-bats, but who play different positions, namely implicit. In general, this allows flexibility catcher and pitcher. Notice that the triangles are in decision-making when limits of practical at the same place on the x-axes for the two equivalence may change as competing theories players, but there are radically different estimates and instrumentation change (Serlin & Lapsley. of their probability of getting a hit because of the 1993). In some cases, the posterior distribution different positions they play. The data from all falls so far away from any reasonable ROPE the other catchers inform the model that catchers θ that it is superfluous to specify a specific ROPE. tend to have values of around 0.241. Because For more information about the application this particular catcher has so few data to inform of a ROPE, under somewhat different terms his estimate, the estimate from the higher-level of “range of equivalence,” “indifference zone,” distribution dominates. The same is true for the and “good-enough belt,” see e.g., Carlin and pitcher, but the higher-level distribution says that θ Louis (2009); Freedman, Lowe, and Macaskill pitchers tend to have values of around 0.130. (1984); Hobbs and Carlin (2008); Serlin and The resulting distribution of differences, in the lowest panel, suggests that these two players have

bayesian estimation in hierarchical models 287 Tim Federowicz (Catcher) Mike Leake (Pitcher) 1 Hits/3 At Bats 18 Hits/61 At Bats mode = 0.241 mode = 0.157

95% HDI 95% HDI 0.191 0.297 0.119 0.209

0.05 0.15 0.25 0.35 0.05 0.10 0.15 0.20 0.25 0.30 θ θ 263 494

Casey Coleman (Pitcher) Wandy Rodriguez ( Pitcher) 1 Hits/3 At Bats 4 Hits/61 At Bats mode = 0.132 mode = 0.112

95% HDI 95% HDI 0.0905 0.176 0.0825 0.156

0.05 0.15 0.25 0.35 0.05 0.10 0.15 0.20 0.25 0.30 θ θ 169 754

Difference Difference mode = 0.111 mode = 0.0365 0.1% < 0 < 99.9% 5.6% < 0 < 94.4% 2% in ROPE 46% in ROPE 95% HDI 95% HDI 0.0419 0.178 −0.0151 0.105

−0.05 0.05 0.10 0.15 0.20 0.25 −0.05 0.00 0.05 0.10 0.15 0.20 θ −θ 263 169 θ −θ 494 754

Fig. 13.3 Comparison of estimated batting abilities of different individual players. The left column shows two players with the same actual records of 1 hit in 3 at-bats, but very different estimates of batting ability because they play different positions. The right column shows two players with rather different actual records (18/61 and 4/61) but similar estimates of batting ability because they play the same position. Triangles show actual ratios of hits/at-bats. Bottom histograms display an arbitrary ROPE from −0.04 to +0.04; different decision makers might use a different ROPE. The subscripts on θ indicate arbitrary identification numbers of different players, such as 263 for Tim Federowicz. credibly different hitting abilities, even though their Indeed, in the lower panel, we see that a difference actual hits and at-bats are identical. In other words, of zero is credible, as it falls within the 95% HDI because we know the players play these particular of the differences. The shrinkage is produced different positions, we can infer that they probably because there is a huge amount of data, from 324 have different hitting abilities. pitchers, informing the position-level distribution The right side of Figure 13.3 shows another about the hitting ability of pitchers. Therefore, comparison of two individual players, both of the estimates of two individual pitchers with only whom are pitchers, with seemingly quite different modest numbers of at-bats are strongly shrunken batting averages of 18/61 and 4/61, as marked by toward the group-level mode. In other words, the triangles on the x-axis. Despite the players’ because we know that the players are both pitchers, different hitting records, the posterior estimates of we can infer that they probably have similar hitting their hitting probabilities are not very different. abilities. Notice the dramatic shrinkage of the estimates The amount of shrinkage depends on the toward the mode of players who are pitchers. amount of data. This is illustrated in Figure 13.4,

288 new directions Andrew McCutchen ( Center Field) ShinSoo Choo ( Right Field) 194 Hits/593 At Bats 169 Hits/598 At Bats mode = 0.304 mode = 0.276

95% HDI 95% HDI 0.274 0.335 0.246 0.302

0.22 0.24 0.26 0.28 0.30 0.32 0.15 0.20 0.25 0.30 0.35 θ θ 159 573

Brett Jackson (Center Field) Ichiro Suzuki ( Right Field) 21 Hits/120 At Bats 178 Hits/629 At Bats mode = 0.233 mode = 0.275

95% HDI 95% HDI 0.194 0.278 0.248 0.304

0.22 0.24 0.26 0.28 0.30 0.32 0.15 0.20 0.25 0.30 0.35 θ θ 844 428

Difference Difference mode = −0.00212 mode = 0.0643 50.8% < 0 < 49.2% 0.5% < 0 < 99.5% 14% in ROPE 95% in ROPE 95% HDI 95% HDI −0.0398 0.039 0.0171 0.122 −0.05 0.00 0.05 0.00 0.05 0.10 0.15 θ −θ θ −θ 159 844 573 428

Fig. 13.4 The left column shows two individuals with rather different actual batting ratios (194/593 and 21/120) who both play center field. Although there is notable shrinkage produced by playing the same position, the quantity of data is sufficient to exclude a difference of zero from the 95% HDI on the difference (lower histogram); although the HDI overlaps the arbitrary ROPE shown here, different decision makers might use a different ROPE. The right column shows two right fielders with very high and nearly identical actual batting ratios. The 95% HDI of their difference falls within the ROPE in the lower right histogram. Note: Triangles show actual batting ratio of hits/at-bats. which shows comparisons of players from the at-bats. This again illustrates the concept that the same position, but for whom there are much estimate is informed by both the data from the more personal data from more at-bats. In these individual player and by the data from all the other cases, although there is some shrinkage caused by players, especially those who play the same position. position-level information, the amount of shrinkage The lower left panel of Figure 13.4 shows that the is not as strong because the additional individ- estimated difference excludes zero (but still overlaps ual data keep the estimates anchored closer to the particular ROPE used here). the data. The right side of Figure 13.4 shows right fielders The left side of Figure 13.4 shows a comparison with huge numbers of at-bats and nearly the same of two center fielders with 593 and 120 at-bats, batting average. The 95% HDI of the difference respectively. Notice that the shrinkage of estimate falls almost entirely within the ROPE, so we for the player with 593 at-bats is not as extreme as might decide to declare that players have identical the player with 120 at-bats. Notice also that the probability of getting a hit for practical purposes, width of the 95% HDI for the player with 593 that is, we might decide to accept the null value of at-bats is narrower than for the player with 120 zero difference.

bayesian estimation in hierarchical models 289 Shrinkage and Multiple Comparisons but frequentist decisions are based on auxiliary In hierarchical models with multiple levels, there sampling distributions instead of the posterior is shrinkage of estimates within each level. In the distribution. model of this section (Figure 13.1), there was shrinkage of the player-position parameters toward the overall central tendency, as illustrated by the Example: Clinical Individual Differences in pitcher and catcher distributions in Figure 13.2, Attention Allocation and there was shrinkage of the individual-player Hierarchical Bayesian estimation can be applied parameters within each position toward the position straightforwardly to more elaborate models, such central tendency, as shown by various examples as information processing models typically used in in Figures 13.3 and Figure 13.4. The model cognitive science. Generally, such models formally also provided some strong inferences about player describe the processes underlying behavior in tasks abilities based on position alone, as illustrated by such as thinking, remembering, perceiving, de- the estimates for individual players with few at bats ciding, learning and so on. Cognitive models are in the left column of Figure 13.3. increasingly finding practical uses in a wide variety There were no corrections for multiple compar- of areas outside cognitive science. One of the most isons. We conducted all the comparisons without promising uses of cognitive process models is the computing p values, and without worrying whether field of cognitive psychometrics (Batchelder, 1998; we might intend to make additional comparisons in Riefer, Knapp, Batchelder, Bamber, & Manifold, the future, which is quite likely given that there are 2002; Vanpaemel, 2009), where cognitive process 9 positions and 948 players in whom we might be models are used as psychometric measurement interested. models. These models have become important It is important to be clear that Bayesian methods tools for quantitative clinical cognitive science (see do not prevent false alarms. False alarms are caused Neufeld chapter 16, this volume). by accidental conspiracies of rogue data that happen In our second example of hierarchical Bayesian to be unrepresentative of the true population, estimation, we use data from a classification task and no analysis method can fully mitigate false and a corresponding cognitive model to assess conclusions from unrepresentative data. There are young women’s attention to other women’s body two main points to be made with regard to false size and facial affect, following the research of Treat, alarms in multiple comparisons from a Bayesian Nosofsky, McFall, & Palmeri, (2002). Rather than perspective. relying on self-reports, Viken et a1. (2002) collected First, the Bayesian method produces a posterior performance data in a prototype-classification task distribution that is fixed, given the data. The involving photographs of women varying in body posterior distribution does not depend on which size and facial affect. Furthermore, rather than comparisons are intended by the analyst, unlike using generic statistical models for data analysis, traditional frequentist methods. Our decision rule, the researchers applied a computational model of using the HDI and ROPE, is based on the posterior category learning designed to describe underlying distribution, not on a false alarm rate inferred from psychological properties. The model, known as a null hypothesis and an intended sampling/testing the multiplicative prototype model (MPM: Nosofsky, procedure. 1987; Reed, 1972), has parameters that describe Second, false alarms are mitigated by shrinkage how much perceptual attention is allocated to body in hierarchical models (as exemplified in the right size or facial affect. The modeling made it possible column of Figure 13.3). Because of shrinkage, to assess how participants in the task allocated their it takes more data to produce a credible differ- attention. ence between parameter values. Shrinkage is a To measure attention allocation, Viken et al. rational, mathematical consequence of the hierar- (2002) tapped into women’s perceived similarities chical model structure (which expresses our prior of photographs of other women. The women in knowledge of how parameters are related) and the the photographs varied in their facial expressions actually observed data. Shrinkage is not related in of affect (happy to sad) and in their body size any way to corrections for multiple comparisons, (light to heavy). We focus here on a particular which do not depend on the observed data but do categorization task in which the observer had to depend on the intended comparisons. Hierarchical classify a target photo as belonging with reference modeling is possible with non-Bayesian estimation, photo X or with reference photo Y. In one version

290 new directions could have implications for both the etiology and X t treatment of the disease. X t Viken et al. (2002) collected data from a group

Affect Y of woman who were high in bulimic symptoms, Affect Sad Happy and from a group that was low. Viken et al. then Y used likelihood-ratio tests to compare a model that Sad Happy used separate attention weights in each group to Light Heavy Light Heavy Body Size Body Size a model that used a single attention weight for both groups. Their model-comparison approach Fig. 13.5 The perceptual space for photographs of women who revealed that high-symptom women, relative to vary on body size (horizontal axis) and affect (vertical axis). low-symptom women, display enhanced attention Photo X shows a prototypical light, happy woman and photo to body size and decreased attention to facial Y shows a prototypical heavy, sad woman. The test photo, t, affect. is categorized with X or Y according to its relative perceptual proximity to those prototypes. In the left panel, attention to In contrast to their non-Bayesian, nonhierarchi- body size (denoted w in the text) is low, resulting in compression cal, nonestimation approach, we use a Bayesian of the body size axis, and, therefore, test photo t tends to be hierarchical estimation approach to investigate the classified with prototype X. In the right panel, attention to body same issue. The hierarchical nature of our approach size is high, resulting in expansion of the body size axis, and, means that we do not assume that all subjects therefore, test photo t tends to be classified with prototype Y. within a symptom group have the same attention to body size. Bayesian inference and decision-making implies that we do not require assumptions about of the experiment, reference photo X was of a light, sampling intentions and multiple tests that are happy woman and reference photo Y was of a heavy, required for computing p values. Moreover, our sad woman. In another version, not discussed here, use of estimation instead of only model comparison the features of the reference photos were reversed. ensures that we will know how much the groups Suppose the target photo t showed a heavy, happy differ. woman. If the observer was paying attention mostly to affect, then photo t should tend to be classified with reference photo X, which matched on affect. The Data If the observer was paying attention mostly to body Viken et a1. (2002) obtained classification size, then photo t should tend to be classified with judgments from 38 women on 22 pictures of other reference photo Y, which matched on body size. women, varying in body size (light to heavy) and A schematic representation of the perceptual space facial affect (happy to sad). Symptoms of bulimia for photographs is shown in Figure 13.5. In the were also measured for all of the women. Eighteen actual experiment, there were many different target of these women had BULIT scores exceeding 88, photos from throughout the perceptual space. By which is considered to be high in bulimic symptoms recording how each target photo was categorized by (Smith & Thelen, 1984). The remaining 20 the observer, the observer’s attention allocation can women had BULIT scores lower than 45, which is be inferred. considered to be low in bulimic symptoms. Each Viken et al. (2002) were interested in whether woman performed the classification task described women suffering from the eating disorder, bulimia, earlier, in which she was instructed to freely classify allocated their attention differently than normal each target photo t as one of two types of women women. Bulimia is characterized by bouts of exemplified by reference photo X and reference overconsumption of food with a feeling of loss of photo Y. No corrective feedback was provided. control, followed by self-induced vomiting or abuse Each target photo was presented twice, hence, for of laxatives to prevent weight gain. The researchers each woman i, the data include the frequency of were specifically interested in how bulimics allo- classifying stimulus t as a type X, ranging between cated their attention to other women’s facial affect 0and2.Ourgoalistousethesedatatoinfer and body size, because perception of body size has a meaningful measure of attention allocation for been the focus of past research into eating disorders, each individual observer, and simultaneously to and facial affect is relevant to social perception but infer an overall measure of attention allocation is not specifically implicated in eating disorders. An for women high in bulimic symptoms and for understanding of how bulimics allocate attention women low in bulimic symptoms. We will rely

bayesian estimation in hierarchical models 291 on a hierarchical extension of the MPM, as function of distance between t and X , dtX ,inthe described next. psychological space:

stX = exp(−ci dtX ) (4) The Descriptive Model with Its where ci > 0 is the sensitivity parameter for Meaningful Parameters observer i. The psychological distance between Models of categorization take perceptual stimuli target t and reference X is given by the weighted as input and generate precise probabilities of distance between the corresponding points in the category assignments as output. The input stimuli 2-dimensional psychological space: must be represented formally, and many leading / | |2 | |2 1 2 categorization models assume that stimuli can be dtX = wi xtb − xXb + (1 − wi) xta − xXa , represented as points in a multidimensional space, (5) as was suggested in Figure 13.5. Importantly, the models assume that attention plays a key role in where xta denotes the position of the target on the categorization, and formalize the attention allocated affect dimension, and xtb denotes the position of the to perceptual dimensions as free parameters (for a target on the body-size dimension. These positions review see, e.g., Kruschke, 2008). In particular, the are normative average ratings of the photographs on MPM (Nosofsky, 1987) determines the similarity two 10-point scales: body size (1 = underweight, between a target item and a reference item by 10 = overweight), and affect (1 = unhappy, 10 = happy), as provided by a separate sample of young multiplicatively weighting the separation of the < < items on each dimension by the corresponding women. The free parameter 0 wi 1 corresponds attention allocated to each dimension. The higher to the attention weight on the body size dimension the similarity of a stimulus to a reference category for observer i. It reflects the key assumption of the prototype, relative to other category prototypes, the MPM that the structure of the psychological space higher the probability of assigning the stimulus to is systematically modified by selective attention (see the reference category. Figure 13.5). For each trial in which a target photo t is presented with reference photos X and Y, the hierarchical structure We construct a hierarchical model that has MPM produces the probability, pi(X |t), that the i th observer classifies stimulus t as category X . parameters to describe each individual, and parame- This probability depends on two free parameters. ters to describe the overall tendencies of the bulimic and normal groups. The hierarchy is analogous to One parameter is denoted wi, which indicates the attention that the i th observer pays to body the baseball example discussed earlier: Just as indi- vidual players were nested within fielding positions, size. The value of wi can range from 0 to 1. here individual observers are nested within bulimic Attention to affect is simply 1 − wi. The second symptom groups. (One difference, however, is that parameter is denoted ci and called the “sensitivity” of observer i. The sensitivity can be thought of as we do not build an overarching distribution across the observer’s decisiveness, which is how strongly bulimic-symptom groups because there are only two the observer converts a small similarity advantage groups.) With this hierarchy, we express our prior for X into a large choice advantage for X. Note that expectation that bulimic women are similar but not attention and sensitivity parameters can differ across identical to each other, and nonbulimic women are observers, but not across stimuli, which are assumed similar but not identical to each other, but the two to have fixed locations in an underlying perceptual groups may be different. space. The hierarchical model allows the parameter Formally, the MPM posits that the probability estimates for an individual observer to be rationally that photo t will be classified with reference photo influenced by the data from other individuals X instead of reference photo Y is determined within their symptom group. In our model, by the similarity of t to X relative to the total the individual attention weights are assumed to similarity: come from an overarching distribution that is characterized by a measure of central tendency pi(X |t) = stX /(stX + stY ). (3) and of dispersion. The overarching distributions for the high-symptom and low-symptom groups The similarity between target and reference is, are estimated separately. As the attention weights in turn, determined as a nonlinearly decreasing wi are constrained to range between 0 and 1,

292 new directions we assume the parent distribution for the wi’s is unif unif unif unif [g] a beta distribution, parameterized by mean μw [g] and precision κw ,where[g] indexes the group membership (i.e., high symptom or low symptom). mo , σ μ , κ c c The individual sensitivities, ci, are also assumed to w w come from an overarching distribution. Since the beta gamma sensitivities are non-negative, a gamma distribution ...i ...i is a convenient parent distribution, parameterized [g] σ [g] by mode moc and standard deviation c ,where MPM (t, wi, ci) [g] again indicates the group membership (i.e., high symptom or low symptom). The group- = ...i μ[g] [g] κ[g] σ [g] level parameters (i.e., w , moc , w and c ) pi (X/t) are assumed to come from vague, noncommittal uniform distributions. There are 84 parameters altogether, including wi and ci for 38 observers Bernoulli and the 8 group level parameters. Figure 13.6 summarizes the hierarchical model in an integrated ...t/i diagram. The caption provides details. Xt/i The parameters of most interest are the group- [g] level attention to body size, μw , for g ∈{low,high}. Fig. 13.6 The hierarchical model for attention allocation. At the bottom of the diagram, the classification data are denoted as Other meaningful questions could focus on the = = relative variability among groups in attention, Xt|i 1ifobserveri says “X ” to target t,andXt|i 0 otherwise. The responses come from a Bernoulli distribution that has its [g] which would be addressed by considering the κw success parameter determined by the MPM, as defined in Eqs. 3, parameters, but we will not pursue these here. 4, and 5 in the main text. The ellipsis on the arrow pointing to the response indicates that this relation holds for all targets within every individual. Scanning up the diagram, the individual Results: Interpreting the Posterior attention parameters, wi, come from an overarching group-level beta distribution that has mean μw and concentration κw (hence Distribution shape parameters of aw = μwκw and bw = (1 − μw)κw,aswas The Bayesian hierarchical approach to estima- indicated explicitly for the beta distributions in Figure 13.1). The tion yields attention weights for each observer, individual sensitivity parameters ci come from an overarching informed by all the other observers in the group. group-level gamma distribution that has mode moc and standard deviation σ (with shape and rate parameters that are algebraic At the same time, it provides an estimate of c combinations of moc and σc; see Kruschke, 2015, Section 9.2.2). the attention weight at the group level. Further, The group-level parameters all come from noncommittal, broad for every individual estimate and the group level uniform distributions. This model is applied separately to the estimates, a measure of uncertainty is provided, high-symptom and low-symptom observers. in the form of a credible interval (95% HDI), which can be used as part of a decision rule to decide whether or not there are credible differences Whether all uniform distributions assumed an between individuals or between groups. upper bound of 10 or 50, the results were essentially The MCMC process used 3 chains with a total of identical. The results reported here are for an upper 100,000 steps after a burn-in of 4,000 steps. It pro- bound of 10. duced a smooth (converged) representation of the 84-dimensional posterior distribution. We use the comparison across groups of attention to MCMC sample as an accurate and high-resolution body size representation of the posterior distribution. Figure 13.7 shows the marginal posterior dis- tribution for the group-level parameters of most check of robustness against changes in interest. The left side shows the distribution of the top-level prior constants central tendency of attention to body size for each We conducted a sensitivity analysis by using group as well as the distribution of their difference. different constants in the top-level uniform distri- In particular, the bottom left histogram shows that butions, to check whether they had any notable the low-symptom group has an attention weight on influence on the resulting posterior distribution. body size about 0.36 lower than the high-symptom

bayesian estimation in hierarchical models 293 mode = 0.526 mode = 0.606

95% HDI 95% HDI 0.335 0.718 0.0637 0.851

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 μlow low w moc

mode = 0.919 mode = 0.573

95% HDI 95% HDI 0.79 0.976 0.432 0.685

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 μhigh high w moc

mode = −0.36 mode = 0.0347 99.9% < 0 < 0.1% 55.1% < 0 < 44.9%

95% HDI 95% HDI −0.583 −0.15 −0.505 0.325

−0.8 −0.6 −0.4 −0.2 0.0 −0.5 0.0 0.5 1.0 μlow − μhigh low − high w w moc moc

Fig. 13.7 Marginal posterior distribution of group-level parameters for the prototype classification task. The left column shows the [g] group-level central tendency of the attention weight on body size, μw . The bottom-left histogram reveals a credibly nonzero difference between groups, with low-symptom observers allocating about 0.36 less attention to body-size than high-symptom observers. The 95% HDI is so far away from a difference of zero that any reasonable ROPE would be excluded; therefore, we do not specify a particular [g] ROPE. The right column shows the group-level central tendency of the sensitivity parameter, moc . The bottom-right histogram shows that zero difference is squarely among the most credible differences. group, and this difference is credibly nonzero. The comparisons across individual women’s right side shows that the most credible difference of attention to body size sensitivities is near zero. Although the primary question of interest in- The conclusions from our hierarchical Bayesian volves the group-level central tendencies, hierarchi- estimation agree with those of Viken et al. (2002), cal Bayesian estimation also automatically provides who took a non-Bayesian, nonhierarchical, model- estimates of the attention weights of individual comparison approach. We also find that high- women. Figure 13.8 shows the estimates of in- symptom women, relative to low-symptom women, dividual attention weights wi for three women, show enhanced attention to body size and decreased based on the hierarchical Bayesian estimation that attention to facial affect, but no differences in their shares information across all observers to inform the sensitivities. However, our hierarchical Bayesian estimate of each individual observer. Figure 13.8 estimation approach has provided explicit distri- also shows the individual estimates from a non- butions on the credible differences between the hierarchical MLE, which derives each individual groups. estimate from the data of a single observer only.

294 new directions mode = 1 mode = 0.999 mode = 0.00012

95% HDI 95% HDI 95% HDI 0.821 1 0.837 1 1.76e−53 0.172

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 w33 w16 w12

Fig. 13.8 Posterior of attention weights wi of three individual observers. The vertical mark on the HDI indicates the MLE of the attention weight based on the individual’s data only. Observer 33 is a high-symptom woman, whose estimate is shrunk upward (toward one). Observers 16 and 12 are both low-symptom women, whose estimates are shrunken in different directions (upwards for 16, downwards for 12).

Figure 13.8 illustrates that in hierarchical mod- estimates even closer to 1. Shrinkage for the low- els, data from one individual influence inferences symptom women thus highlights that shrinkage is about the parameters of the other individuals. not necessarily inward, toward the middle of the Technically, this happens because each woman’s higher-level distribution; it can also be outward, data influence the group-level parameters, which and always toward the modes. affect all the individual-level parameter estimates. For example, the hierarchical Bayesian modal estimate of the attention weight for observer 33, a Model Comparison as a Case of Estimation high-symptom woman, is 1, which is larger than in Hierarchical Models the nonhierarchical MLE of 0.89. This shrinkage In the examples discussed earlier, Bayesian in the hierarchical estimate is caused by the fact estimation was the reallocation of credibility across that most other high-symptom women tend to have the space of parameter values, for continuous relatively high attention weights, thereby pulling parameters. We can think of each distinct pa- up the group-level estimate of the attention weight rameter value (or joint combination of values in and the estimates for each individual high-symptom a multiparameter space) as a distinct model of woman. Shrinkage also occurs for the low-symptom the data. Because the parameter values are on woman, shown in the other panels of Figure 13.8. a continuum, there is a continuum of models. The second panel shows that for observer 16, Under this conceptualization, Bayesian parameter the hierarchical Bayesian modal estimate of the estimation is model comparison for an infinity of attention weight is 1, which is higher than the models. nonhierarchical estimate of 0.93. For observer 12, Often, however, people may think of different however, shrinkage is in the opposite direction: models as being distinct, discrete descriptions, not the hierarchical Bayesian modal estimate of the on a continuum. This conceptualization of models attention weight is smaller than the MLE based makes little difference from a Bayesian perspective. on individual data (0 vs 0.07). These opposite When models are discrete, there is still a parameter directions in shrinkage of the estimate are caused by that relates them to each other, namely an indexical the fact that the overarching beta distribution for parameter that has value 1 for the first model, 2 low-symptom women is bimodal (i.e., have shape for the second model, and so on. Bayesian model parameters less than 1.0), with one mode near 0 and comparison is then Bayesian estimation, as the a second mode near 1, indicating that low-symptom reallocation of credibility across the values of the women tend to have either a low attention weight or indexical parameter. The posterior probabilities of a high attention weight. This bimodality is evident the models are simply the posterior probabilities of in the data and is not merely an artifact of the the indexical parameter values. Bayesian inference model, insofar as many women classify as if paying operates, mathematically, the same way regardless most attention to either one dimension or the other. of whether the parameter that relates models is Woman with MLE’s close to 0 have hierarchical continuous or discrete. Bayesian estimates even closer to 0, whereas woman Figure 13.9 shows a hierarchical diagram for with MLE’s close to 1 have hierarchical Bayesian comparing two models. At the bottom of the

bayesian estimation in hierarchical models 295 Pm indexical parameter m, together with representative values of the parameter θ (when m = 1) and φ = categorical the parameter (when m 2). The posterior probability of each model is approximated accu- rately by the proportion of steps that the MCMC ...m chain visited each value of m. For a hands-on introduction to MCMC methods for Bayesian p(θ|m =1) p(ϕ|m =2) model comparison, see Chapter 10 of Kruschke (2015) and Lodewyckx et al. (2011). Examples of Bayesian model comparison are also provided p(D|θ, m =1) p(D|ϕ, m =2) by Vandekerckhove, Matzke, and Wagenmakers in chapter 14, this volume. When comparing models, it is crucially im- portant to set appropriately the prior distributions within each model, because the estimation of the D modelindexcanbeverysensitivetothechoiceof prior. In the context of Figure 13.9, we mean that it Fig. 13.9 Model comparison as hierarchical modeling. Each is crucial to set the prior distributions, p(θ|m = 1) dashed box encloses a discrete model of the data, and the models φ| = depend on a higher-level indexical parameter at the top of the and p( m 2), so that they accurately express diagram. See text for further details. the priors intended for each model. Otherwise it is trivially easy to favor one model over the other, perhaps inadvertently, by setting one prior to values diagram are the data, D. Scanning up the diagram, that accommodate the data well while setting the the data are distributed according to likelihood other prior to values that do not accommodate function p(D|θ,m = 1) when the model index m the data well. If each model comes with a theory is 1. The likelihood function for model 1 involves a or previous research that specifically informs the parameter θ, which has a prior distribution specified model, then that theory or research should be by p(θ|m=1). All the terms involving the parameter used to set the prior for the model. Otherwise, θ are enclosed in a dashed box, which indicates the the use of generic default priors can unwittingly part of the overall hierarchy that depends on the favor one model over the other. When there are higher-level indexical parameter, m, having value not strong theories or previous research to set m = 1. Notice, in particular, that the prior on the priors for each model, a useful approach for the parameter θ is an essential component of the setting priors is as follows: Start each model with model; that is, the model is not only the likelihood vague default priors. Then, using some modest function but also the prior. When m = 2, the amount of data that represent consensually accepted data are distributed according to the model on previous findings, update all models with those the right of the diagram, involving a likelihood data. The resulting posterior distributions in each function and prior with parameter φ.Atthetop model are then used as the priors for the model of the hierarchy is a categorical distribution that comparison, using the new data. The priors, specifies the prior probability of each indexical by being modestly informed, have mitigated the value of m, that is, the prior probability of each arbitrary influence of inappropriate default priors, model as a discrete entity. This hierarchical diagram and have set the models on more equal playing fields is analogous to previous hierarchical diagrams in by being informed by the same prior data. These Figures 13.1 and Figure 13.6, but the top-level and other issues are discussed in the context of distribution is discrete and lower-level parameters cognitive model comparison by Vanpaemel (2010) and structures can change discretely instead of and Vanpaemel and Lee (2012). continuously when the top-level parameter value A specific case of the hierarchical structure in changes. Figure 13.9 occurs when the two models have The sort of hierarchical structure diagrammed the same likelihood function, and hence the same in Figure 13.9 can be implemented in the same parameters, but different prior distributions. In this MCMC sampling software we used for the baseball case, the model comparison is really a comparison of and categorization examples earlier. The MCMC two competing choices of prior distribution for the algorithm generates representative values of the parameters. A common application for this specific

296 new directions case is null hypothesis testing. The null hypothesis model, and Bayesian estimation provides a com- is expressed as a prior distribution with all its mass plete posterior distribution of credible parameter at a single value of the parameter, namely the “null” values for individuals and groups. Other examples value, such as θ = 0. If drawn graphically, the of hierarchical Bayesian estimation can be found, prior distribution would look like a spike-shaped for instance, in articles by Bartlema, Lee, Wetzels, distribution. The alternative hypothesis is expressed and Vanpaemel (2014), Lee (2011), Rouder and as a prior distribution that spreads credibility over Lu (2005), Rouder, Lu, Speckman, Sun, and Jiang a broad range of the parameter space. If drawn (2005), and Shiffrin, Lee, Kim, and Wagcnmakers graphically, the alternative prior might resemble a (2008). thin (i.e., short) slab-shaped distribution. Model The hierarchical Bayesian method is very attrac- comparison then amounts to the posterior prob- tive because it allows the analyst to define mean- abilities of the spike-shaped (null) prior and the ingfully structured models that are appropriate for slab-shaped (alternative) prior. This approach to the data. For example, there is no artificial dilemma null-hypothesis assessment depends crucially on the of deciding between doing separate individual meaningfulness of the chosen alternative-hypothesis analyses or collapsing across all individuals, which prior, because the posterior probability of the both have serious shortcomings (Cohen, Sanborn, null-hypothesis prior is not absolute but merely & Shiffrin: 2008). When collapsing the data relative to the chosen alternative-hypothesis prior. across participants in each group, it is implicitly The relative probability of the null-hypothesis prior assumed that all participants within a group behave can change dramatically for different choices of identically. Such an assumption is often untenable. the alternative-hypothesis prior. Because of this The other extreme of analyzing every individual sensitivity to the alternative-hypothesis prior, we separately with no pooling across individuals can recommend that this approach to null-hypothesis be highly error prone, especially when each par- assessment is used only with caution when it is ticipant contributed only small amounts of data. clearly meaningful to entertain the possibility that A hierarchical analysis provides a middle ground the null value could be true and a justifiable between these two strategies, by acknowledging that alternative-hypothesis prior is available. In such people are different, without ignoring the fact that cases, the prior-comparison approach can be very they represent a common group or condition. The useful. However, in the absence of such meaningful hierarchical structure allows information provided priors, null-value assessment most safely proceeds by one participant to flow rationally to the estimates by explicit estimation of parameter values within a of other participants. This sharing of information single model, with decisions about null values made across participants via hierarchical structure occurs according to the HDI and ROPE as exemplified in both the classification and baseball examples of earlier. For discussion of Bayesian approaches to this chapter. null-value assessment, see, for example, Kruschke A second key attraction of hierarchical Bayesian (2011), Kruschke (2013, Appendix D), Morey estimation is that software for expressing complex, and Rouder (2011), Wagenmakers (2007), and nonlinear hierarchical models (e.g., Lunn et al. Wetzels, Raaijmakers, Jakab, and Wagenmakers 2000, Plummer 2003, Stan Development Team (2009). 2012), produces a complete posterior distribution for direct inferences about credible parameter values without need for p values or corrections for multiple Conclusion comparisons. The combination of ease of defining In this chapter we discussed two examples specifically appropriate models and ease of direct of hierarchical Bayesian estimation. The baseball inference from the posterior distribution makes example (Figure 13.1) illustrated multiple levels hierarchical Bayesian estimation an extremely useful with shrinkage of estimates within each level. We approach to modeling and data analysis. chose this example because it clearly illustrates the effects of hierarchical structure in rational inference of individual and group parameters. The catego- Acknowledgments rization example (Figure 13.6) illustrated the use of The authors gratefully acknowledge Rick Viken hierarchical Bayesian estimation for psychometric and Teresa Treat for providing data from Viken, assessment via a cognitive model. The parameters Treat, Nosofsky, McFall, and Palmeri (2002). are meaningful in the context of the cognitive Appreciation is also extended to E.-J.

bayesian estimation in hierarchical models 297 Wagenmakers and two anonymous reviewers who provided helpful comments that improved the pre- Region of practical equivalence (ROPE): An interval sentation. Correspondence can be addressed to John around a parameter value that is considered to be equivalent to that value for practical purposes. The ROPE is used as K. Kruschke, Department of Psychological and part of a decision rule for accepting or rejecting particular Brain Sciences, Indiana University, 1101 E. 10th parameter values. St., Bloomington IN 47405-7007, or via electronic mail to [email protected]. Supplementary in- formation about Bayesian data analysis can be found at http://www.indiana.edu/∼kruschke/ References Bartlema, A., Lee, M D., Wetzels, R., & Vanpaemel, W. (2014). A Bayesian hierarchical mixture approach to individual differences: Case studies in selective attention and Notes representation in category learning. Journal of Mathematical 1. The most general definition of a confidence interval is the Psychology, 59, 132–150. range of parameter values that would not be rejected according Batchelder, W H. (1998). Multinomial processing tree models < to a criterion p value, such as p 0.05. These limits depend on and psychological assessment. Psychological Assessment, 10, the arbitrary settings of other parameters, and can be difficult to 331–344. compute. Bayes, T., & Price, R. (1763). An essay towards solving a 2. Data retrieved December 22, 2012 from problem in the doctrine of chances. By the Late Rev. Mr. http://www.baseball-reference.com/leagues/MLB/2012- Bayes, F.R.S. Communicated by Mr. Price, in a Letter standard-batting.shtml to John Canton, A.M.F.R.S. Philosophical Transactions, 53, 3. This analysis was summarized at 370–418. doi: 10.1098/rstl.1763.0053 http://doingbayesiandataanalysis.blogspot.com/2012/11/shrin- Carlin, B P., & Louis, T. A. (2009). Bayesian methods for data kage-in-multi-level-hierarchical.html analysis (3rd ed.). Boca Raton, FL: CRC Press. 4. In the context of a normal distribution, instead of a beta Cohen, A. L., Sanborn, A. N., & Shiffrin, R, M. (2008). Model distribution, the “precision” is the reciprocal of variance. evaluation using grouped or individual data. Psychonomic Intuitively, it refers to the narrowness of the distribution for Bulletin & Review, 15, 692–712. either the normal or beta distributions. Denwood, M. J. (2013). runjags: An R package providing interface utilities, parallel computing methods and additional distributions for MCMC models in JAGS. Journal of Sta- Glossary tistical Software, (in review). http://cran.r-project.org/web/ Hierarchical model: A formal model that can be expressed bpackages/runjags/ such that one parameter is dependent on another parameter. Doyle, A. C. (1890). The sign of four. London, England: Many models can be meaningfully factored this way, for Spencer Blackett. example when there are parameters that describe data from Freedman, L. S., Lowe, D., & Macaskill, P. (1984). Stop- individuals, and the individual-level parameters depend on ping rules for clinical trials incorporating clinical opinion. group-level parameters. Biometrics, 40, 575–586. Highest density interval (HDI): The highest density Hobbs, B. P., & Carlin, B. P. (2008). Practical Bayesian design interval summarizes the interval under a probability distri- and analysis for drug and device clinical trials. Journal of bution where the probability densities inside the interval Biopharmaceutical Statistics, 18(1), 54–80. are higher than probability densities outside the interval. A Kruschke, J. K. (2008). Models of categorization. R. Sun 95% HDI includes the 95% of the distribution with the (Ed.), The Cambridge Handbook of Computational Psychology highest probability density. (p. 267–301). New York, NY: Cambridge University Markov chain Monte Carlo (MCMC): A class of Press. stochastic algorithms for obtaining samples from a prob- Kruschke, J. K. (2011). Bayesian assessment of null values via ability distribution. The algorithms take a random walk parameter estimation and model comparison. Perspectives on through parameter space, favoring values that have higher Psychological Science, 6(3) 299–312. probability. With a sufficient number of steps, the values of Kruschke, J. K. (2013). Bayesian estimation supersedes the t the parameter are visited in proportion to their probabilities test. Journal of Experimental Psychology: General, 142(2), and therefore the samples can be used to approximate the 573–603. doi: 10.1037/a0029146 distribution. Widely used examples of MCMC are the Kruschke, J. K. (2015). Doing Bayesian data analysis, Second Gibbs sampler and the Metropolis-Hastings algorithm. edition: A tutorial with R, JAGS, and Stan. Waltham, Posterior distribution: A probability distribution over pa- Academic Press/Elsevier. rameters derived via Bayes’ rule from the prior distribution Lee, M. D. (2011). How cognitive modeling can benefit by taking into account the targeted data. from hierarchical Bayesian models. Journal of Mathematical Prior distribution: A probability distribution over param- Psychology, 55, 1–7. eters representing the beliefs, knowledge or assumptions Lodewyckx, T., Kim, W., Lee, M. D., Tuerlinckx, F., Kuppens, about the parameters without reference to the targeted data. P., & Wagenmakers, E. J. (2011). A tutorial on Bayes The prior distribution and the likelihood function together factor estimation with the product space method. Journal of define a model. Mathematical Psychology, 55(5), 331–347.

298 new directions Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, in the behavioral sciences: Methodological issues (pp. 199– D. (2013). The BUGS book: A practical introduction to 228). Hillsdale, NJ: Erlbaum. Bayesian analysis. Boca Raton, FL: CRC Press. Shiffrin, R. M., Lee, M. D., Kim, W., & Wagenmakers, E. J. Lunn, D., Thomas, A., Best, N., & Spiegelhalter, D. (2000). (2008). A survey of model evaluation approaches with a WinBUGS — A Bayesian modelling framework: Concepts, tutorial on hierarchical Bayesian methods. Cognitive Science, structure, and extensibility. Statistics and Computing, 10(4), 32(8), 1248–1284. 325–337. Smith, M. C., & Thelen, M. H. (1984). Development and Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches validation of a test for bulimia. Journal of Consulting and for testing interval null hypotheses. Psychological Methods, Clinical Psychology, 52, 863–872. 16(4), 406–419. Spiegelhalter, D. J., Freedman, L. S., & Parmar, M. K. B. (1994). Nosofsky, R. M. (1987). Attention and learning processes in the Bayesian approaches to randomized trials. Journal of the Royal identification and categorization of integral stimuli. Journal Statistical Society. Series A, 157, 357–416. of Experimental Psychology: Learning, Memory and Cognition, Stan Development Team. (2012). Stan: A C++ library 13, 87–108. for probability and sampling, version 1.1. Retrieved from Plummer, M. (2003). JAGS: A program for analysis of Bayesian http://mc-stan.org/citations.html graphical models using Gibbs sampling. Proceedings of the 3rd Vanpaemel, W. (2009). BayesGCM: Software for Bayesian International Workshop on Distributed Statistical Computing inference with the generalized context model. Behavior (DSC 2003). Vienna, Austria. Research Methods, 41(4), 1111–1120. Reed, S. K. (1972). Pattern recognition and categorization. Vanpaemel, W. (2010). Prior sensitivity in theory testing: Cognitive Psychology, 3, 382–407. An apologia for the Bayes factor. Journal of Mathematical Riefer, D. M., Knapp, B. R., Batchelder, W. H., Bamber, D., Psychology, 54, 491–498. & Manifold, V. (2002). Cognitive psychometrics: Assessing Vanpaemel, W., & Lee, M. D. (2012). Using priors to storage and retrieval deficits in special populations with formalize theory: Optimal attention and the general- multinomial processing tree models. Psychological Assessment, ized context model. Psychonomic Bulletin & Review, 19, 14, 184–201. 1047–1056. Rouder, J. N., & Lu, J. (2005). An introduction to Bayesian Viken, R. J., Treat, T. A., Nosofsky, R. M., McFall, hierarchical models with an application in the theory of R. M., & Palmeri, T, J. (2002). Modeling individual signal detection. Psychonomic Bulletin & Review, 12(4), differences in perceptual and attentional processes related 573–604. to bulimic symptoms. Journal of , 111, Rouder, J. N., Lu, J., Speckman, P., Sun, D., & Jiang, 598–609. Y. (2005). A hierarchical model for estimating response Wagenmakers, E. J. (2007). A practical solution to the pervasive time distributions. Psychonomic Bulletin & Review, 12(2), problems of p values. Psychonomic Bulletin & Review, 14(5), 195–223. 779–804. Serlin, R. C.,& Lapsley, D. K. (1985). Rationality in psy- Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, chological research: The good-enough principle. American E. J. (2009). How to quantify support for and against the Psychologist, 40(1), 73–83. null hypothesis: A flexible WinBUGS implementation of a Serlin, R. C., & Lapsley, D. K. (1993). Rational appraisal default Bayesian t test. Psychonomic Bulletin & Review, 16(4), of psychological research and the good-enough principle. 752–760. G. Keren, & C. Lewis (Eds.) A handbook for data analysis

bayesian estimation in hierarchical models 299 CHAPTER Model Comparison and the Principle 14 of Parsimony

Joachim Vandekerckhove, Dora Matzke, and Eric-Jan Wagenmakers

Abstract

According to the principle of parsimony, model selection methods should value both descriptive accuracy and simplicity. Here we focus primarily on Bayes factors and minimum description length, explaining how these procedures strike a balance between goodness-of-fit and parsimony. Throughout, we demonstrate the methods with an application on false memory, evaluating three competing multimonial proces tree models of interference in memory. Key Words: model selection, goodness of fit, parsimony, inference, Akaike’s information criterion (AIC), Bayesian information criterion (BIC), minimum description length (MDL), Bayes factor (BF), Akaike weights, Jeffreys weights, Rissanen weights, Schwartz weights

Introduction responses and the slow responses speed up with At its core, the study of psychology is concerned practice according to a power law (Logan, 1992). with the discovery of plausible explanations for The observation that practice makes perfect human behavior. For instance, one may observe that is trivial, but the finding that practice-induced “practice makes perfect”: as people become more improvement follows a general law is not. Nev- familiar with a task, they tend to execute it more ertheless, the power law of practice only provides quickly and with fewer errors. More interesting a descriptive summary of the data and does not is the observation that practice tends to improve explain the reasons that practice should result in a performance such that most of the benefit is accrued power law improvement in performance. In order early on, a pattern of diminishing returns that is to go beyond direct observation and statistical sum- well described by a power law (Logan, 1988; but mary, it is necessary to bridge the divide between see Heathcote, Brown, & Mewhort, 2000). This observed performance on the one hand and the pattern occurs across so many different tasks (e.g., pertinent psychological processes on the other. Such cigar rolling, maze solving, fact retrieval, and a bridges are built from a coherent set of assump- variety of standard psychological tasks) that it is tions about the underlying cognitive processes—a known as the “power law of practice.” Consider, theory. Ideally, substantive psychological theories for instance, the lexical decision task, a task in are formalized as quantitative models (Busemeyer & which participants have to decide quickly whether Diederich, 2010; Lewandowsky & Farrell, 2010). a letter string is an existing word (e.g., sunscreen) For example, the power law of practice has been or not (e.g., tolphin). When repeatedly presented explained by instance theory (Logan, 1992, 2002). with the same stimuli,1 participants show a power Instance theory stipulates that earlier experiences are law decrease in their mean response latencies; in stored in memory as individual traces or instances; fact, they show a power law decrease in the entire upon presentation of a stimulus, these instances race response time distribution, that is, both the fast to be retrieved, and the winner of the race initiates

300 a response. Mathematical analysis shows that, as the Bayes factor. Both procedures entail principled instances are added to memory, the finishing time of and general solutions to the trade-off between the winning instance decreases as a power function. parsimony and goodness of fit. Hence, instance theory provides a simple and The outline of this chapter is as follows. The general explanation of the power law of practice. first section describes the principle of parsimony For all its elegance and generality, instance and the unavoidable trade-off with goodness-of- theory has not been the last word on the power fit. The second section summarizes the research law of practice. The main reason is that single of Wagenaar and Boer (1987) who carried out an phenomena often afford different competing expla- experiment to compare three competing multino- nations. For example, the effects of practice can also mial processing tree models (MPTs; Batchelder & be accounted for by Rickard’s component power Riefer, 1980); this model comparison exercise is laws model (Rickard, 1997), Anderson’s ACT- used as a running example throughout the chapter. R model (Anderson, 2004), Cohen et al.’s PDP The third section outlines different methods for model (Cohen, Dunbar, & McClelland, 1990), model comparison and applies them to Wagenaar Ratcliff’s diffusion model (Dutilh, Vandekerck- and Boer (1987)’s MPT models. We focus on hove, Tuerlinckx, & Wagenmakers, 2009; Ratcliff, two popular information criteria, the AIC and the 1978), or Brown and Heathcote’s linear ballistic BIC, on the Fisher information approximation of accumulator model (Brown & Heathcote, 2005, the minimum description length principle, and 2008; Heathcote & Hayes, 2012). When various on Bayes factors as obtained from importance models provide competing accounts of the same sampling. The fourth section contains conclusions data set, it can be difficult to choose between them. and take-home messages. The process of choosing between models is called model comparison, model selection, or hypothesis testing, and it is the focus of this chapter. The Principle of Parsimony A careful model comparison procedure includes Throughout history, prominent philosophers both qualitative and quantitative elements. Impor- and scientists have stressed the importance of tant qualitative elements include the plausibility, parsimony. For instance, in the Almagest— parsimony, and coherence of the underlying as- a famous second-century book on astronomy— sumptions, the consistency with known behavioral Ptolemy writes: “We consider it a good principle to phenomena, the ability to explain rather than explain the phenomena by the simplest hypotheses describe data, and the extent to which model that can be established, provided this does not predictions can be falsified through experiments. contradict the data in an important way.” Ptolemy’s Here we ignore these important aspects and focus principle of parsimony is widely known as Occam’s solely on the quantitative elements. The single razor (see Box 2); the principle is intuitive as it puts most important quantitative element of model a premium on elegance. In addition, most people comparison relates to the ubiquitous trade-off feel naturally attracted to models and explanations between parsimony and goodness-of-fit (Pitt & that are easy to understand and communicate. Myung, 2002). The motivating insight is that the Moreover, the principle also gives ground to reject appeal of an excellent fit to the data (i.e., high propositions that are without empirical support, descriptive adequacy) needs to be tempered to the including extrasensory perception, alien abduc- extent that the fit was achieved with a highly tions, or mysticism. In an apocryphal interaction, complex and powerful model (i.e., low parsimony). Napoleon Bonaparte asked Pierre-Simon Laplace The topic of quantitative model comparison why the latter’s book on the universe did not is as important as it is challenging; fortunately, mention its creator, only to receive the curt reply the topic has received—and continues to receive— “I had no need of that hypothesis.” considerable attention in the field of statistics, However, the principle of parsimony finds its and the results of those efforts have been made main motivation in the benefits that it bestows accessible to psychologists through a series of recent on those who use models for prediction. To see special issues, books, and articles (e.g., Grünwald, this, note that empirical data are assumed to be 2007; Myung et al., 2000; Pitt & Myung, 2002; composed of a structural, replicable part and an Wagenmakers & Waldorp, 2006). Here we discuss idiosyncratic, nonreplicable part. The former is several procedures for model comparison, with knownasthesignal,andthelatterisknownas an emphasis on minimum description length and the noise (Silver, 2012). Models that capture all

model comparison and the principle of parsimony 301 the signal and none of the noise provide the best possible predictions to unseen data from the same fit. Note that parsimony plays a role even source. Overly simplistic models, however, fail to in classical null-hypothesis significance testing, capture part of the signal; these models underfit the where the simpler null hypothesis is retained data and provide poor predictions. Overly complex unless the data provide sufficient grounds for models, on the other hand, mistake some of the its rejection. noise for actual signal; these models overfit the data and again provide poor predictions. Thus, parsimony is essential because it helps discriminate Goodness of fit the signal from the noise, allowing better prediction “From the earliest days of statistics, statisticians and generalization to new data. have begun their analysis by proposing a distri- bution for their observations and then, perhaps Box 1 Occam’s razor with somewhat less enthusiasm, have checked on Occam’s razor (sometimes Ockham’s) is named whether this distribution is true. Thus over the years after the English philosopher and Francis- a vast number of test procedures have appeared, can friar Father William of Occam (c.1288– and the study of these procedures has come to be c.1348), who wrote “Numquam ponenda est known as goodness-of-fit” (D’Agostino & Stephens, pluralitas sine necessitate” (plurality must never 1986, p. v). be posited without necessity), and “Frustra fit The goodness of fit of a model is a quantity that per plura quod potest fieri per pauciora” (it is expresses how well the model is able to account for a futile to do with more what can be done with given set of observations. It addresses the following less). Occam’s metaphorical razor symbolizes question: Under the assumption that a certain the principle of parsimony: by cutting away model is a true characterization of the population needless complexity, the razor leaves only from which we have obtained a sample, and given theories, models, and hypotheses that are as the best fitting parameter estimates for that model, simple as possible without being false. how well does our sample of data agree with that Throughout the centuries, many other schol- model? ars have espoused the principle of parsimony; Various ways of quantifying goodness of fit the list predating Occam includes Aristotle, exist. One common expression involves a Euclidean Ptolemy, and Thomas Aquinas (“it is super- distance metric between the data and the model’s fluous to suppose that what can be accounted best prediction (the least squared error or LSE for by a few principles has been produced by metric is the most well-known of these). Another many”), and the list following Occam includes measure involves the likelihood function, which Isaac Newton (“We are to admit no more expresses the likelihood of observing the data under causes of natural things than such as are both the model, and is maximized by the best fitting true and sufficient to explain their appearances. parameter estimates (Myung, 2000). Therefore, to the same natural effects we must, so far as possible, assign the same causes.”), Parsimony Bertrand Russell, Albert Einstein (“Everything Goodness of fit must be balanced against model should be made as simple as possible, but no complexity in order to avoid overfitting—that is, simpler”), and many others. to avoid building models that well explain the In the field of statistical reasoning and infer- data at hand, but fail in out-of-sample predictions. ence, Occam’s razor forms the foundation for The principle of parsimony forces researchers to the principle of minimum description length abandon complex models that are tweaked to the (Grunwald, 2000, 2007). In addition, Occam’s observed data in favor of simpler models that can razor is automatically accommodated through generalize to new data sets. Bayes factor model comparisons (e.g., Jefferys A common example is that of polynomial & Berger, 1992; Jeffreys, 1961; MacKay, regression. Figure 14.1 gives a typical example. 2003). Both minimum description length The observed data are the circles in both the and Bayes factors feature prominently in this left and right panels. Crosses indicate unobserved, chapter as principled methods to quantify the out-of-sample data points to which the model trade-off between parsimony and goodness of should generalize. In the left panel, a quadratic function is fit to the 8 observed data points, whereas

302 new directions (a) 1500 (b) 1500 hidden data hidden data visible data visible data 2nd degree polynomial 7th degree polynomial

1000 1000 criterion

500 500

0 0

0 10 20 30 40 50 60 0 10 20 30 40 50 60 predictor predictor ˆ = d i + Fig. 14.1 A polynomial regression of degree d is characterized by y i=0 aix . This model has d 1 free parameters ai;hence,in the right panel, a polynomial of degree 7 perfectly accounts for the 8 visible data points. This 7th-order polynomial, however, accounts poorly for the out-of-sample data points. the right panel shows a 7th-order polynomial Wagenaar and Boer (1987) proposed three function fitted to the same data. Since a polynomial competing theoretical accounts of the effect of of degree 7 can be made to contain any 8 points misleading postevent information. To evaluate the in the plane, the observed data are perfectly three accounts, Wagenaar and Boer set up an ex- captured by the best-fitting polynomial. However, periment and introduced three quantitative models it is clear that this function generalizes poorly to that translate each of the theoretical accounts into the unobserved samples, and it shows undesirable a set of parametric assumptions that together give behavior for larger values of x. rise to a probability density over the data, given the In sum, an adequate model comparison method parameters. needs to discount goodness of fit with model complexity. But how exactly can this be accom- Abstract Accounts plished? As we will describe shortly, several model Wagenaar and Boer (1987) outlined three com- comparison methods are currently in vogue; all peting theoretical accounts of the effect of mis- resulting from principled ideas on how to obtain leading postevent information on memory. Loftus’s measures of generalizability,2 meaning that these destructive updating model (DUM) posits that methods attempt to quantify the extent to which a the conflicting information replaces and destroys model predicts unseen data from the same source the original memory. A coexistence model (CXM) (cf. Figure 14.1). Before outlining the details asserts that an inhibition mechanism suppresses of various model-comparison methods, we now the original memory, which nonetheless remains introduce a data set that serves as a working example viable though temporarily inaccessible. Finally, a no- throughout the remainder of the chapter. conflict model (NCM) simply states that misleading Example: Competing Models of postevent information is ignored, except when the Interference in Memory original information was not encoded or already forgotten. For an example model comparison scenario, we revisit a study by Wagenaar and Boer (1987) on the effect of misleading information on the Experimental Design recollection of an earlier event. The effect of The experiment by Wagenaar and Boer (1987) misleading postevent information was first studied proceeded as follows. In Phase I, a total of 562 systematically by Loftus, Miller, and Burns (1978); participants were shown a sequence of events in the for a review of relevant literature see Wagenaar and form of a pictorial story involving a pedestrian-car Boer (1987) and references therein. collision. One picture in the story would show a car

model comparison and the principle of parsimony 303 (a) (b)

Fig. 14.2 A pair of pictures from the third phase (i.e., the recognition test) of Wagenaar and Boer (1987), reprinted with permission, containing the critical episode at the intersection. After Wagenaar and Boer (1987), ActaPsychologica c Elsevier Inc. at an intersection, and a traffic light that was either light is encoded with probability p, and if so, the red, yellow, or green. In Phase II, participants were color is encoded with probability c. In Phase II, the asked a set of test questions with (potentially) con- stop sign is encoded with probability q.InPhase flicting information: Participants might be asked III, the answer may be known or may be guessed whether they remembered a pedestrian crossing correctly with probability 1/2, and in Phase IV the the road when the car approached the “traffic answer may be known or may be guessed correctly light” (in the consistent group), the “stop sign” (in with probability 1/3. The probability of each path the inconsistent group) or the “intersection” (the is given by the product of all the encountered neutral group). Then, in Phase III, participants probabilities, and the total probability of a response were given a recognition test about elements of the pattern is the summed probability of all branches story using picture pairs. Each pair would contain that lead to it. For example, the total probability of one picture from Phase I and one slightly altered getting both questions wrong is (1−p)×q ×2/3+ version of the original picture. Participants were (1−p)×(1−q)×1/2×2/3. We would then, under then asked to identify which of the pair had featured the no-conflict model, expect that proportion of in the original story. A picture pair is shown in participants to fall in the response pattern with two Figure 14.2, where the intersection is depicted with errors. either a traffic light or a stop sign. Finally, in Phase The destructive updating model (Figure 2 in Wa- IV, participants were informed that the correct genaar & Boer, 1987) extends the three-parameter choice in Phase III was the picture with the traffic no-conflict model by adding a fourth parameter light, and were then asked to recall the color of the d: the probability of destroying the traffic-light traffic light. By design, this experiment should yield information, which may occur whenever the stop different response patterns depending on whether sign was encoded. The coexistence model (Figure 3 the conflicting postevent information destroys the in Wagenaar & Boer, 1987), on the other hand, original information (destructive-updating model), posits an extra probability s that the traffic light is only suppresses it temporarily (coexistence model), suppressed (but not destroyed) when the stop sign or does not affect the original information unless it is encoded. A critical difference between the latter is unavailable (no-conflict model). two is that a destruction step will lead to chance accuracy in Phase IV if every piece of information Concrete Models was encoded, whereas a suppression step will not Wagenaar and Boer (1987) developed a series of affect the underlying memory and lead to accurate MPT models (see Box 3) to quantify the predic- responding. Note here that, if s = 0, the coexistence tions of the three competing theoretical accounts. model reduces to the no-conflict model, as does the Figure 14.3 depicts the no-conflict MPT model in destructive-updating model with d = 0. The models the inconsistent condition. The figure is essentially only make different predictions in the inconsistent a decision tree that is navigated from left to right. condition, so that, for the consistent and neutral In Phase I of the collision narrative, the traffic conditions, the trees are identical.

304 new directions Phase III Phase IV correct correct

q c 1–q

p ⅓ q ⅔ yes 1–c ⅓ 1–q Color ⅔ encoded? ⅓ no q ⅔

1–p ⅓ ½ ⅔ 1–q Traffic light Stop sign ⅓ encoded? encoded? ½ ⅔ Traffic light guessed? Color guessed?

Fig. 14.3 Multinomial-processing tree representation of the inconsistent condition according to the no-conflict model adapted from Wagenaar and Boer (1987). After Wagenaar and Boer (1987), ActaPsychologica c Elsevier Inc.

Box 2 Popularity of multinomial list is the recently developed R package MPTinR processing tree models (Singmann & Kellen 2013) with which we Multinomial processing tree models have good experiences. As will become ap- (Batchelder & Riefer, 1980; Chechile, 1973; parent throughout this chapter, however, our Chechile & Meyer, 1976; Riefer & Batchelder, preferred method for fitting MPT models is 1988) are psychological process models for cate- Bayesian (Chechile & Meyer, 1976; Klauer, gorical data. MPT models are used in two ways: 2010; Lee & Wagenmakers, 2013; Matzke, as a psychometric tool to measure unobserved Dolan, Batchelder, & Wagenmakers, in press; cognitive processes, and as a convenient for- Rouder, Lu, Morey, Sun, & Speckman, 2008; malization of competing psychological theories. Smith & Batchelder, 2010). Over time, MPTs have been applied to a wide range of psychological tasks and processes. For instance, MPT models are available for recog- nition, recall, source monitoring, perception, priming, reasoning, consensus analysis, the Previous Conclusions process dissociation procedure, implicit attitude After fitting the three competing MPT models, measurement, and many other phenomena. Wagenaar and Boer (1987) obtained the parameter χ 2 model For more information about MPTs, we point estimates in Table 14.1. Using a fit index, they concluded that “a distinction among recommend the review articles by Batchelder the three model families appeared to be impossible and Riefer (1999; 2007), and Erdfelder et al. in actual practice” (p. 304), after noting that the no- (2009). conflict model provides “an almost perfect fit” to the The latter review article also discusses differ- data. They propose, then, “to accept the most parsi- ent software packages that can be used to fit monious model, which is the no-conflict model.” MPT models. Necessarily missing from that In the remainder of this chapter, we re-examine

model comparison and the principle of parsimony 305 Table 14.1. Parameter point estimates from Wagenaar and Boer (1987). pcqds No-conflict model (NCM) 0.50 0.57 0.50 na na Destructive-updating model (DUM) 0.50 0.57 0.50 0.00 na Coexistence model (CXM) 0.55 0.55 0.43 na 0.20

this conclusion using various model comparison full details, see Burnham & Anderson, 2002). methods. The AIC is unfortunately not consistent:asthe number of observations grows infinitely large, AIC is not guaranteed to choose the true data-generating Three Methods for Model Comparison model. In fact, there is cause to believe that the Many model comparison methods have been AIC tends to select complex models that overfit the developed, all of them attempts to address the data (O’Hagan & Forster, 2004; for a discussion see ubiquitous trade-off between parsimony and good- Vrieze, 2012). ness of fit. Here we focus on three main classes Another information criterion, the BIC (“Bayesi- of interrelated methods: (1) AIC and BIC, the an information criterion”) was proposed by Schwarz most popular information criteria; (2) minimum- (1978): description length; (3) Bayes factors. Below we provide a brief description of each method and then BIC =−2ln p y | θˆ + k lnn.(2) apply it to the model comparison problem that confronted Wagenaar and Boer (1987). Here, the penalty term is k lnn,wheren is the number of observations.3 Hence, the BIC penalty for complexity increases with sample size, outweigh- Information Criteria ing that of AIC as soon as n ≥ 8. The BIC was Information criteria are among the most popular derived as an approximation of a Bayesian hypoth- methods for model comparison. Their popularity esis test using default parameter priors (the “unit is explained by the simple and transparent manner information prior”; see later for more information in which they quantify the trade-off between parsi- on Bayesian hypothesis testing, and see (Raftery, mony and goodness of fit. Consider for instance the 1995), for more information on the BIC). The BIC oldest information criterion, AIC (“an information is consistent: as the number of observations grows criterion”), proposed by Akaike (1973, 1974a): infinitely large, BIC is guaranteed to choose the true data-generating model. Nevertheless, there is AIC =−2ln p y | θˆ + 2k.(1) evidence that, in practical applications, the BIC tends to select simple models that underfit the data | θˆ The first term ln p y is the log maximum (Burnham & Anderson, 2002). likelihood that quantifies goodness of fit, where Now consider a set of candidate models, Mi, y is the data set and θˆ the maximum-likelihood i = 1,...,m, each with a specific IC (AIC or BIC) parameter estimate; the second term 2k is a value. The model with the smallest IC value should penalty for model complexity, measured by the be preferred, but the extent of this preference is number of adjustable model parameters k.TheAIC not immediately apparent. For better interpretation estimates the expected information loss incurred we can calculate IC model weights (Akaike, 1974; when a probability distribution f (associated with Burnham & Anderson, 2002; Wagenmakers & the true data generating process) is approximated Farrell, 2004); First, we compute, for each model by a probability distribution g (associated with the i, the difference in IC with respect to the IC of the model under evaluation). Hence, the model with best candidate model: the lowest AIC is the model with the smallest  = IC − minIC. (3) expected information loss between reality f and i i model g, where the discrepancy is quantified by This step is taken to increase numerical stability, the Kullback-Leibler divergence I(f ,g), a distance but it also serves to emphasize the point that only metric between two probability distributions (for differences in IC values are relevant. Next we obtain

306 new directions k = 0.4 β = 1.0 c = 0.1 b = 2.0 k = 0.7 β = 0.5 c = 0.5 b = 0.5

0.5 0.5

Subjective intensity 0 Subjective intensity 0 0 1 2 3 0 1 2 3 Objective intensity Objective intensity

Fig. 14.4 Two representative instances of Fechner’s law (left) and Steven’s law (right). Although Fechner’s law is restricted to nonlinear functions that level off as stimulus intensity increases, Steven’s law can additionally take shapes with accelerating slopes. the model weights by transforming back to the and Box 3); these predicted probabilities are likelihood scale and normalizing: deterministic parameters from a multinomial prob- exp( −  /2) ability density function. Hence, the maximum log w = i .(4) i M −  / likelihood parameter estimates for an MPT model m=1 exp( m 2) produce multinomial parameters that maximize the The resulting AIC and BIC weights are called probability of the observed data (i.e., the occurrence Akaike weights and Schwarz weights, respectively. of the various outcome categories). These weights not only convey the relative prefer- Several software packages exist that can help find ence among a set of candidate models (i.e., they the maximum log likelihood parameter estimates express a degree to which we should prefer one for MPTs (Singmann & Kellen, 2013). With these model from the set as superior) but also provide estimates in hand, we can compute the information a method to combine predictions across multiple criteria described in the previous section. Table 14.2 models using model averaging (Hoeting, Madigan, shows the maximum log likelihood as well as AIC, Raftery, & Volinsky, 1999). BIC, and their associated weights (wAIC and wBIC; Both AIC and BIC rely on an assessment of from Eq. 4). model complexity that is relatively crude, because Interpreting wAIC and wBIC as measures of it is determined entirely by the number of free relative preference, we see that the results in parameters but not by the shape of the function Table 14.2 are mostly inconclusive. According through which they make contact with the data. To to wAIC, the no-conflict model and coexistence illustrate the importance of the functional form in model are virtually indistinguishable, though both which the parameters participate, consider the case are preferable to the destructive-updating model. of Fechner’s law and Steven’s law of psychophysics. According to wBIC, however, the no-conflict model Both of these laws transform objective stimulus should be preferred over both the destructive- intensity to subjective experience through a two- updating model and the coexistence model. The parameter nonlinear function.4 According to Fech- extent of this preference is noticeable but not ner’s law, perceived intensity W (I)ofstimulusI decisive. is the result of the logarithmic function k ln(I + β). Steven’s law describes perceived intensity as Minimum Description Length b an exponential function: S(I) = cI .Although The minimum description length principle is both laws have the same number of parameters, based on the idea that statistical inference centers Steven’s is more complex because it can cover a around capturing regularity in data; regularity, in larger number of data patterns (see Figure 14.4). turn, can be exploited to compress the data. Hence, the goal is to find the model that compresses the application to multinomial-processing data the most (Grünwald, 2007). This is related tree models to the concept of Kolmogorov complexity—for In order to apply AIC and BIC to the three a sequence of numbers, Kolmogorov complexity competing MPTs proposed by Wagenaar and Boer is the length of the shortest program that prints (1987), we first need to compute the maximum log that sequence and then halts (Grünwald, 2007). likelihood. Note that the MPT model parameters Although Kolmogorov complexity cannot be cal- determine the predicted probabilities for the differ- culated, a suite of concrete methods are available ent response outcome categories (cf. Figure 14.3 based on the idea of model selection through

model comparison and the principle of parsimony 307 Table 14.2. AIC and BIC for the Wagenaar & Boer MPT models. log likelihood k* AIC wAIC BIC wBIC No-conflict model (NCM) −24.41 3 54.82 0.41 67.82 0.86 Destructive-updating model (DUM) −24.41 4 56.82 0.15 74.15 0.04 Coexistence model (CXM) −23.35 4 54.70 0.44 72.03 0.10

Note: k is the number of free parameters.

data compression. These methods, most of them form complexity, not just complexity due to the developed by Jorma Rissanen, fall under the general number of free parameters. Note that FIA weights heading of minimum description length (MDL; (or Rissanen weights) can be obtained by multiply- Rissanen, 1978, 1987, 1996, 2001). In psychology, ing FIA by 2 and then applying Eqs. 3 and 4. the MDL principle has been applied and promoted The third version of the MDL principle dis- primarily by Grünwald (2000), Grünwald, Myung, cussed here is normalized maximum likelihood and Pitt (2005), and Grünwald (2007), as well as (NML; Myung et al., 2006; Rissanen, 2001): Myung, Navarro, and Pitt (2006), Pitt and Myung ˆ (2002), and Pitt, Myung, and Zhang (2002). p y | θ(y) = Here we mention three versions of the MDL NML .(6) p | θˆ principle. First, there is the so-called crude two- X x (x) dx part code (Grünwald, 2007); here, one sums This equation shows that NML tempers the the description of the model (in bits) and the enthusiasm about a good fit to the observed data description of the data encoded with the help of that y (i.e., the numerator) to the extent that the model model (in bits). The penalty for complex models could also have provided a good fit to random data is that they take many bits to describe, increasing x (i.e., the denominator). NML is simple to state the summed code length. Unfortunately, it can be but can be difficult to compute; for instance, the difficult to define the number of bits required to denominator may be infinite and this requires fur- describe a model. ther measures to be taken (for details see Grünwald, Second, there is the Fisher information approxi- 2007). Additionally, NML requires an integration mation (FIA; Pitt et al., 2002; Rissanen, 1996): over the entire set of possible data sets, which may =− | θˆ + k n be difficult to define as it depends on unknown FIA ln p y ln π 2 2 decision processes in the researchers (Berger & # Berry, 1988). Note that, since the computation of + ln det[I(θ)]dθ,(5) NML depends on the likelihood of data that might have occurred but did not, the procedure violates the where I(θ) denotes the Fisher information matrix likelihood principle, which states that all information of sample size 1 (Ly, Verhagen, Grasman, & about a parameter θ obtainable from an experiment Wagenmakers, 2014). I(θ)isak × k matrix whose is contained in the likelihood function for θ for the (i,j)th element is - . given y (Berger & Wolpert, 1998). ∂ p | θ ∂ p | θ θ = ln y ln y Ii,j( ) E , application to multinomial processing ∂θi ∂θj tree models where E() is the expectation operator. Using the parameter estimates from Table 14.1 Note that FIA is similar to AIC and BIC in that and the code provided by Wu, Myung and it includes a first term that represents goodness of Batchelder (2010), we can compute the FIA for fit, and additional terms that represent a penalty for the three competing MPT models considered by complexity. The second term resembles that of BIC, Wagenaar and Boer (1987).5 Table 14.3 displays, and the third term reflects a more sophisticated for each model, the FIA along with its associated penalty that represents the number of distinguish- complexity measure (the other one of its two con- able probability distributions that a model can stituent components, the maximum log likelihood, generate (Pitt et al., 2002). Hence, FIA differs from can be found in Table 14.2). The conclusions AIC and BIC in that it also accounts for functional from the MDL analysis mirror those from the

308 new directions Table 14.3. Minimum description length values for the Wagenaar & Boer MPT models. Complexity FIA wFIA No-conflict model (NCM) 6.44 30.86 0.44 Destructive-updating model (DUM) 7.39 31.80 0.17 Coexistence model (CXM) 7.61 30.96 0.39

AIC measure, expressing a slight disfavor for the Table 14.4. Evidence categories for the Bayes factor destructive updating model, and approximately BF12 (based on Jeffreys, 1961). equal preference for the no-conflict model versus Bayes factor BF12 Interpretation the coexistence model. > 100 Extreme evidence for M1 30 — 100 Very Strong evidence for M Bayes Factors 1 M In Bayesian model comparison, the posterior 10 — 30 Strong evidence for 1 M M odds for models 1 and 2 are obtained 3 — 10 Moderate evidence for M1 by updating the prior odds with the diagnostic 1 — 3 Anecdotal evidence for M1 information from the data: M | M | M 1 No evidence p( 1 y) = p( 1) × m(y 1) .(7) M p(M2 | y) p(M2) m(y | M2) 1/3 — 1 Anecdotal evidence for 2 M Equation 7 shows that the change from prior 1/10 — 1/3 Moderate evidence for 2 M / M M | odds p( 1) p( 2) to posterior odds p( 1 1/30 — 1/10 Strong evidence for M2 y)/ p(M2 | y) is given by the ratio of marginal 1/100 — 1/30 Very Strong evidence for M2 likelihoods m(y | M1)/m(y | M2) (see later for the < M definition of the marginal likelihood). This ratio 1/100 Extreme evidence for 2 is known as the Bayes factor (Jeffreys, 1961; Kass & Raftery, 1995). The log of the Bayes factor is often interpreted as the weight of evidence provided MacKay, 2003; Myung & Pitt, 1997). To see this, by the data (Good, 1985; for details see Berger consider that the marginal likelihood m(y | M(·)) & Pericchi, 1996; Bernardo & Smith, 1994; Gill, can be expressed as p(y | θ,M(·)) p(θ | M(·))dθ: 2002; O’Hagan, 1995). an average across the entire parameter space, with Thus, when the Bayes factor BF12 = m(y | the prior providing the averaging weights. It M1)/m(y | M2) equals 5, the observed data y are follows that complex models with high-dimensional 5 times more likely to occur under M1 than under parameter spaces are not necessarily desirable— M2;whenBF12 equals 0.1, the observed data are large regions of the parameter space may yield a very 10 times more likely under M2 than under M1. poor fit to the data, dragging down the average. The Even though the Bayes factor has an unambiguous marginal likelihood will be highest for parsimonious and continuous scale, it is sometimes useful to models that use only those parts of the parameter summarize the Bayes factor in terms of discrete space that are required to provide an adequate categories of evidential strength. Jeffreys (1961, account of the data (Lee & Wagenmakers, 2013). Appendix B) proposed the classification scheme By using marginal likelihood, the Bayes factor shown in Table 14.4. We replaced the labels “not punishes models that hedge their bets and make worth more than a bare mention” with “anecdo- vague predictions. Models can hedge their bets in tal,” “decisive” with “extreme,” and “substantial” different ways: by including extra parameters, by with “moderate.” These labels facilitate scientific assigning very wide prior distributions to the model communication but should be considered only as parameters, or by using parameters that participate an approximate descriptive articulation of different in the likelihood through a complicated functional standards of evidence. form. By computing a weighted average likelihood Bayes factors negotiate the trade-off between across the entire parameter space, the marginal parsimony and goodness of fit and implement an likelihood (and, consequently, the Bayes factor) automatic Occam’s razor (Jefferys & Berger, 1992; automatically takes all these aspects into account.

model comparison and the principle of parsimony 309 Bayes factors represent “the standard Bayesian so- application to multinomial processing tree lution to the hypothesis testing and model selection models problems” (Lewis & Raftery, 1997, p. 648) and In order to compute the Bayes factor we seek to “the primary tool used in Bayesian inference for determine each model’s marginal likelihood m(y | hypothesis testing and model selection” (Berger, M(·)). As indicated earlier, the marginal likelihood 2006, p. 378), but their application is not without m(y | M(·)) is given by integrating the likelihood challenges (Box 3). Next we show how these over the prior: challenges can be overcome for the general class of MPT models. Then we compare the results of m(y | M(·)) = p y | θ,M(·) p θ | M(·) dθ. our Bayes factor analysis with those of the other (8) model comparison methods using Jeffreys weights The most straightforward manner to obtain (i.e., normalized marginal likelihoods). m(y | M(·)) is to draw samples from the prior p(θ | M(·)) and average the corresponding values for p(y | θ,M(·)): Box 3 Two challenges for Bayes factors N Bayes factors (Jeffreys, 1961; Kass & Raftery, 1 m(y | M · ) ≈ p y | θi,M · ,θi ∼ p(θ). (9) 1995) come with two main challenges, one ( ) N ( ) i=1 practical and one conceptual. The practical challenge arises because Bayes factors are de- For MPT models, this brute force integration fined as the ratio of two marginal likelihoods, approach may often be adequate. An MPT model each of which requires integration across the usually has few parameters, and each is conveniently entire parameter space. This integration process bounded from 0 to 1. However, brute-force integra- can be cumbersome and hence the Bayes factor tion is inefficient, particularly when the posterior can be difficult to obtain. Fortunately, there is highly peaked relative to the prior: in this case, θ | M are many approximate and exact methods to draws from p( (·)) tend to result in low facilitate the computation of the Bayes factor likelihoods and only few chance draws may have (e.g., Ardia, Ba¸stürk, Hoogerheide, & van high likelihood. This problem can be overcome by a Dijk, 2012; Chen, Shao, & Ibrahim, 2002; numerical technique known as importance sampling Gamerman & Lopes, 2006); in this chapter (Hammersley & Handscomb, 1964). we focus on BIC (a crude approximation), the In importance sampling, efficiency is increased Savage-Dickey density ratio (applies only to by drawing samples from an importance density θ θ |M nested models), and importance sampling. The g( ) instead of from the prior p( (·)). Consider θ conceptual challenge that Bayes factors bring is an importance density g( ). Then, that the prior on the model parameters has a g(θ) pronounced and lasting influence on the result. m(y | M · ) = p y | θ,M · p θ | M · dθ ( ) ( ) ( ) g(θ) This should not come as a surprise: the Bayes | θ M θ | M factor punishes models for needless complexity, p y , (·) p (·) = g(θ)dθ and the complexity of a model is determined g(θ) in part by the prior distributions that are N | θ M θ | M 1 p y i, (·) p i (·) assigned to the parameters. The difficulty arises ≈ , N g(θ ) because researchers are often not very confident i=1 i about the prior distributions that they specify. θi ∼ g(θ). (10) To overcome this challenge one can either θ = θ | M spend more time and effort on the specification Note that if g( ) p( (·)), the importance of realistic priors, or else one can choose sampler reduces to the brute-force integration θ = θ | default priors that fulfill general desiderata (e.g., shown in Eq. 9. Also note that if g( ) p( M Jeffreys, 1961; Liang, Paulo, Molina, Clyde, & y, (·)), a single draw suffices to determine p(y) Berger, 2008). Finally, the robustness of the exactly. conclusions can be verified by conducting a In sum, when the importance density equals the sensitivity analysis in which one examines the prior, we have brute force integration, and when effect of changing the prior specification (e.g., it equals the posterior, we have a zero-variance Wagenmakers, Wetzels, Borsboom, & van der estimator. However, in order to compute the Maas, 2011). posterior, we would have to be able to compute the normalizing constant (i.e., the marginal likelihood),

310 new directions which is exactly the quantity we wish to determine. Box 4 Importance sampling for MPT In practice, then, we want to use an importance models using the Beta mixture method density that is similar to the posterior, is easy to evaluate, and is easy to draw samples from. In Importance sampling was invented by Stan addition, we want to use an importance density Ulam and . Here we with tails that are not thinner than those of the use it to estimate the marginal likelihood by posterior; thin tails cause the estimate to have high repeatedly drawing samples and averaging—the variance. These desiderata are met by the Beta samples are, however, not drawn from the prior mixture importance density described in Box 4: a (as per Eq. 9, the brute force method), but mixture between a Beta(1,1) density and a Beta instead they are drawn from some convenient density that provides a close fit to the posterior density g(θ) (as per Eq. 10; Andrieu, De distribution. Here we use a series of univariate Freitas, Doucet, & Jordan, 2003; Hammersley Beta mixtures, one for each separate parameter, & Handscomb, 1964). The parameters in MPT but acknowledge that a multivariate importance models are constrained to the unit interval, and, density is potentially even more efficient as it therefore, the family of Beta distributions is a accommodates correlations between the parameters. natural candidate for g(θ). The middle panel In our application to MPT models, we assume of Figure 14.5 shows an importance density that all model parameters have uniform Beta(1,1) (dashed line) for MPT parameter c in the no- priors. For most MPT models this assumption is conflict model for the data from Wagenaar and fairly uncontroversial. The uniform priors can be Boer (1987). This importance density is a Beta thought of as a default choice; in the presence of distribution that was fit to the posterior distri- strong prior knowledge one can substitute more bution for c using the method of moments. The informative priors. The uniform priors yield a importance density provides a good description default Bayes factor that can be a reference point of the posterior (the dashed line tracks the for an analysis with more informative priors, if posterior almost perfectly) and, therefore, is such an analysis is desired (i.e., when reliable prior more efficient than the brute force method information is available, such as can be elicited from illustrated in the left panel of Figure 14.5, experts or derived from earlier experiments). which uses the prior as the importance density. Unfortunately, Beta distributions do not always monte carlo sampling for the fit MPT parameters so well; specifically, the posterior distribution Beta importance density may sometimes have Before turning to the results of the Bayes factor tails that are thinner than the posterior, and model comparison, we first inspect the posterior this increases the variability of the marginal distributions. The posterior distributions were likelihood estimate. To increase robustness and approximated using Markov chain Monte Carlo ensure that the importance density has relatively sampling implemented in JAGS (Plummer, 2003) fat tails, we can use a Beta mixture, shown in the and WinBUGS (Lunn, Jackson, Best, Thomas, right panel of Figure 14.5. The Beta mixture & Spiegelhalter, 2012).6 All code is available on consists of a uniform prior component (i.e., the the authors’ websites. Convergence was confirmed Beta(1,1) prior as in the left panel) and a Beta posterior component (i.e., a Beta distribution by visual inspection and the Rˆ statistic (Gelman fit to the posterior, as in the middle panel). & Rubin, 1992). The top panel of Figure 14.6 In this example, the mixture weight for the shows the posterior distributions for the no-conflict uniform component is w = 0.2. Small mixture model. Although there is slightly more certainty weights retain the efficiency of the Beta poste- about parameter p than there is about parameters rior approach but avoid the extra variability due q and c, the posterior distributions for all three to thin tails. It is possible to increase efficiency parameters are relatively wide considering that they further by specifying a multivariate importance are based on data from as many as 562 participants. density, but the present univariate approach The middle panel of Figure 14.6 shows the is intuitive, easy to implement, and appears posterior distributions for the destructive-updating to work well in practice. The accuracy of the model. It is important to realize that when d = 0 estimate can be confirmed by increasing the (i.e., no destruction of the earlier memory) the number of draws from the importance density, destructive-updating model reduces to the no- and by varying the w parameter. conflict model. Compared to the no-conflict model,

model comparison and the principle of parsimony 311 Brute Force Method Beta Posterior Method Beta Mixture Method Density

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability

Fig. 14.5 Three different importance sampling densities (dashed lines) for the posterior distribution (solid lines) of the c parameter in the no-conflict model as applied to the data from Wagenaar and Boer (1987). Left panel: a uniform Beta importance density (i.e., the brute-force method); middle panel: a Beta posterior importance density (i.e., a Beta distribution that provides the best fit to the posterior); right panel: a Beta mixture importance density (i.e., a mixture of the uniform Beta density and the Beta posterior density, with a mixture weight w = 0.2 on the uniform component).

parameters p, q,andc show relatively little change. No−Conflict Model The posterior distribution for d is very wide, indicating considerable uncertainty about its true ˆ p value. A frequentist point-estimate yields d = 0 (Wagenaar & Boer, 1987; see also Table 14.1), but q c this does not convey the fact that this estimate is highly uncertain. The lower panel of Figure 14.6 shows the 0.0 0.2 0.4 0.6 0.8 1.0 posterior distributions for the coexistence model. When s = 0 (i.e., no suppression of the earlier Destructive Updating Model memory), the coexistence model reduces to the no- p conflict model. Compared to the no-conflict model and the destructive-updating model, parameters p,

Density q q,andc again show relatively little change. The d c posterior distribution for s is very wide, indicating considerable uncertainty about its true value. 0.0 0.2 0.4 0.6 0.8 1.0 The fact that the no-conflict model is nested under both the destructive-updating model and the Coexistence Model no-conflict model allows us to inspect the extra parameters d and s and conclude that we have not p learned very much about their true values. This sug- gests that, despite having tested 562 participants, q the data do not firmly support one model over the s c other. We will now see how Bayes factors can make this intuitive judgment more precise. 0.0 0.2 0.4 0.6 0.8 1.0 Probability importance sampling for the bayes factor Fig. 14.6 Posterior distributions for the parameters of the no- We applied the Beta mixture importance sam- conflict MPT model, the destructive updating MPT model, pling method to estimate marginal likelihoods for and the coexistence MPT model, as applied to the data from the three models under consideration. The results Wagenaar and Boer (1987). were confirmed by varying the mixture weight w,by

312 new directions independent implementations by the authors, and nested and non-nested models alike. For the models by comparison to the Savage-Dickey density ratio under consideration, however, there exists a nested test presented later. Table 14.5 shows the results. structure that allows one to obtain the Bayes factor From the marginal likelihoods and the Jeffreys through a computational shortcut. weights we can derive the Bayes factors for the pair wise comparisons; the Bayes factor is 2.77 in the savage-dickey approximation to the favor of the no-conflict model over the destructive- bayes factor for comparing nested models updating model, the Bayes factor is 1.39 in favor of Consider first the comparison between the no- the coexistence model over the no-conflict model, conflict model MNCM and the destructive-updating and the Bayes factor is 3.86 in favor of the model MDUM. As shown earlier, we can obtain M M coexistence model over the destructive-updating the Bayes factor for NCM versus DUM by com- model. The first two of these Bayes factors are puting the marginal likelihoods using importance anecdotal or “not worth more than a bare mention” sampling. However, because the models are nested, (Jeffreys, 1961), and the third one just makes the we can also obtain the Bayes factor by considering criterion for “moderate” evidence, although any only MDUM, and dividing the posterior ordinate enthusiasm about this level of evidence should be at d = 0 by the prior ordinate at d = 0. This tempered by the realization that Jeffreys himself surprising result was first published by Dickey and described a Bayes factor as high as 5.33 as “odds Lientz (1970), who attributed it to Leonard that would interest a gambler, but would be J. “Jimmie” Savage. The result is now generally hardly worth more than a passing mention in a known as the Savage-Dickey density ratio (e.g., scientific paper” (Jeffreys, 1961, pp. 256–257). Dickey, 1971; for extensions and generalizations In other words, the Bayes factors are consistent see Chen, 2005; Verdinelli & Wasserman, 1995; with the intuitive visual assessment of the posterior Wetzels, Grasman, & Wagenmakers, 2010; for an distributions: the data do not allow us to draw introduction for psychologists see Wagenmakers, strong conclusions. Lodewyckx, Kuriyal, & Grasman, 2010; a short We should stress that Bayes factors apply to mathematical proof is presented in O’Hagan & a comparison of any two models, regardless of Forster, 2004, pp. 174–177).7 Thus, we can exploit M M whether or not they are structurally related or nested the fact that NCM is nested in DUM and use (i.e., where one model is a special, simplified version the Savage-Dickey density ratio to obtain the Bayes of a larger, encompassing model). As is true for factor: the information criteria and minimum description length methods, Bayes factors can be used to | M = | M = m(y NCM) = p(d 0 y, DUM) compare structurally very different models, such BFNCM,DUM . m(y | MDUM) p(d = 0 | MDUM) as for example REM (Shiffrin & Steyvers, 1997) (11) versus ACT-R (Anderson, 2004), or the diffusion model (Ratcliff, 1978) versus the linear ballistic The Savage-Dickey density ratio test is visualized accumulator model (Brown & Heathcote, 2008). in Figure 14.7; the posterior ordinate at d = 0is In other words, Bayes factors can be applied to higher than the prior ordinate at d = 0, indicating

Table 14.5. Bayesian evidence (i.e., the logarithm of the marginal likelihood), Jeffreys weights, and pairwise Bayes factors computed from the Jeffreys weights or through the Savage-Dickey density ratio, for the Wagenaar and Boer MPT models. Bayesian Jeffreys Bayes factor (Savage-Dickey) evidence weight Over NCM Over DUM Over CXM No-conflict −30.55 0.36 1 2.77 (2.81) 0.72 (0.80) model (NCM) ∗ Destructive-updating −31.57 0.13 0.36 (0.36) 1 0.26 (0.28 ) model (DUM) ∗ Coexistence −30.22 0.51 1.39 (1.25) 3.86 (3.51 )1 model (CXM)

∗ Derived through transitivity: 2.81 × 1/0.80 = 3.51.

model comparison and the principle of parsimony 313 Savage−Dickey Density Ratio where the second step is allowed because we have assigned uniform priors to both d and s,sothat = = | M p(d = 0 | MDUM) p(s 0 CXM). Hence, the Parameter d in the Savage-Dickey estimate for the Bayes factor between destructive updating model M the two non-nested models MDUM and CXM equals the ratio of the posterior ordinates at d = 0 = = and s 0, resulting in the estimate BFCXM,DUM Density Posterior 3.51, close to the importance sampling result of 3.86. Prior

Comparison of Model Comparisons We have now implemented and performed a 0.0 0.2 0.4 0.6 0.8 1.0 variety of model comparison methods for the three Probability competing MPT models introduced by Wagenaar and Boer (1987): we computed and interpreted Fig. 14.7 Illustration of the Savage-Dickey density-ratio test. the Akaike information criteria (AIC), Bayesian The dashed and solid lines show the prior and the posterior information criteria (BIC), the Fisher information distribution for parameter d in the destructive updating model. The black dots indicate the height of the prior and the posterior approximation of the minimum description length distributions at d = 0. principle (FIA), and two computational implemen- tations of the Bayes factor (BF). The general tenor across most of the model that the data have increased the plausibility that d comparison exercises has been that the data do equals 0. This means that the data support M NCM not convincingly support one particular model. over M . The prior ordinate equals 1, and hence DUM However, the destructive updating model is con- BF simply equals the posterior ordinate NCM,DUM sistently ranked the worst of the set. Looking at at d = 0. A nonparametric density estimator the parameter estimates, it is not difficult to see (Stone, Hansen, Kooperberg, & Truong, 1997) that whythisisso:thed parameter of the destructive- respects the bound at 0 yields an estimate of 2.81. updating model (i.e., the probability of destroying This estimate is close to 2.77, the estimate from the memory through updating) is estimated at 0, importance sampling approach. thereby reducing the destructive-updating model to The Savage-Dickey density-ratio test can be the no-conflict model, and yielding an identical applied similarly to the comparison between the fit to the data (as can be seen in the likelihood no-conflict model M versus the coexistence NCM column of Table 14.2). Since the no-conflict model model M , where the critical test is at s = CXM counts as a special case of the destructive-updating 0. Here, the posterior ordinate is estimated model, it is by necessity less complex from a model- to be 0.80, and, hence, the Bayes factor for selection point of view—the d parameter is an M over M equals 1/0.80 = 1.25, close CXM NCM unnecessary entity, the inclusion of which is not to the Bayes factor obtained through importance warranted by the data. This is reflected in the sampling, BF = 1.39. CXM,NCM inferior performance of the destructive updating With these two Bayes factors in hand, we can im- model according to all measures of generalizability. mediately derive the Bayes factor for the compari- Note that the BF judges the support for NCM son between the destructive updating model M DUM to be “anecdotal” even though NCM and DUM versus the coexistence model M through CXM provide similar fit and have a clear difference in transitivity, that is, BF = BF × CXM,DUM NCM,DUM complexity—one might expect the principle of par- BF . Alternatively, we can also obtain CXM,NCM simony to tell us that, given the equal fit and clear BF by directly comparing the posterior CXM,DUM complexity difference, there is massive evidence for density for d = 0 against that for s = 0: the simpler model, and the BF appears to fail to = × BFCXM,DUM BFNCM,DUM BFCXM,NCM implement Occam’s razor here. The lack of clear = | M p = | M support of the NCM over the DUM is explained = p(d 0 y, DUM) × (s 0 CXM) = | M = | M p(d 0 DUM) p(s 0 y, CXM) by the considerable uncertainty regarding the value = | M of the parameter d: even though the posterior mode = p(d 0 y, DUM) = | M , is at d = 0, much posterior variability is visible in p(s 0 y, CXM) (12) the middle panel of Figure 14.6. With more data 314 new directions and a posterior for d that is more peaked near 0, methods to the scenario of three competing MPT the evidence in favor of the simpler model would models introduced by Wagenaar and Boer (1987). increase. Despite collecting data from 562 participants, the The difference between the no-conflict model model comparison methods indicate that the data and the coexistence model is less clear-cut. are somewhat ambiguous; at any rate, the data do Following AIC, the two models are virtually not support the destructive updating model. This indistinguishable—compared to the coexistence echoes the conclusions drawn by Wagenaar and model, the no-conflict model sacrifices one unit Boer (1987). of log-likelihood for two units of complexity (one It is important to note that the model- parameter). As a result, both models perform comparison methods discusses in this chapter can be equally well under the AIC measure. Under the BIC applied regardless of whether the models are nested. measure, however, the penalty for the number of This is not just a practical nicety; it also means that free parameters is more substantial, and here the the methods are based on principles that transcend no-conflict model trades a unit of log likelihood for the details of a specific model implementation. log(N ) = 6.33 units of complexity, outdistancing In our opinion, a method of inference that is both the destructive updating model and the necessarily limited to the comparison of nested coexistence model. The BIC is the exception in models is incomplete at best and misleading at clearly preferring the no-conflict model over the worst. It is also important to realize that model coexistence model. The MDL, like the AIC, would comparison methods are relative indices of model have us hedge on the discriminability of the no- adequacy; when, say, the Bayes factor expresses an conflict model and the coexistence model. extreme preference for model A over model B, this The BF, finally, allows us to express evidence does not mean that model A fits the data at all well. for the models using standard probability theory. Figure 14.8 shows a classic but dramatic example Between any two models, the BF tells us how of the inadequacy of simple measures of relative much the balance of evidence has shifted due to model fit. the data. Using two methods of computing the Because it would be a mistake to base inference BF, we determined that the odds of the coexistence on a model that fails to describe the data, a complete model over the destructive updating model almost inference methodology features both relative and ≈ quadrupled (BFCXM,DUM 3.86), but the odds of absolute indices of model adequacy. For the MPT the coexistence model over the no-conflict model models under consideration here, Wagenaar and ≈ barely shifted at all (BFCXM,NCM 1.39). Finally, Boer (1987) reported that the no-conflict model we can use the same principles of probability to provided “an almost perfect fit” to the data.8 compute Jeffreys weights, which express, for each The example MPT scenario considered here model under consideration, the probability that it was relatively straightforward. More complicated is true, assuming prior indifference. Furthermore, MPT models contain order-restrictions, feature we can easily recompute the probabilities in case individual differences embedded in a hierarchi- we wish to express a prior preference between the cal framework (Klauer, 2010; Matzke, Dolan, candidate models (for example, we might use the Batchelder, & Wagenmakers, in press), or con- prior to express a preference for sparsity, as was tain a mixture-model representation with differ- originally proposed by Jeffreys, 1961). ent latent classes of participants (for application to other models see Frühwirth-Schnatter, 2006; Concluding Comments Scheibehenne, Rieskamp, & Wagenmakers, 2013). Model comparison methods need to implement In theory, it is relatively easy to derive Bayes the principle of parsimony: goodness-of-fit has to be factors for these more complicated models. In discounted to the extent that it was accomplished practice, however, Bayes factors for complicated by a model that is overly complex. Many methods models may require the use of numerical techniques of model comparison exist (Myung et al., 2000; more involved than importance sampling. Never- Wagenmakers & Waldorp, 2006), and our selective theless, for standard MPT models the Beta mixture review focused on methods that are popular, easy- importance sampler appears to be a convenient to-compute approximations (i.e., AIC and BIC) and reliable tool to obtain Bayes factors. We and methods that are difficult-to-compute “ideal” hope that this methodology will facilitate the solutions (i.e., minimum description length and principled comparison of MPT models in future Bayes factors). We applied these model comparison applications.

model comparison and the principle of parsimony 315 Anscombe’s Quartet

12 12 10 10 8 8 6 r = 0.816 6 r = 0.816 4 4

5 10 15 20 5 10 15 20

12 12 10 10 8 8 6 r = 0.816 6 r = 0.816 4 4

5 10 15 20 5 10 15 20

Fig. 14.8 Anscombe’s Quartet is a set of four bivariate data sets whose basic descriptive statistics are approximately identical. In all cases, est = + mean of X is 9, variance of X is 11, mean of Y is 7.5, variance of Y is 4.1, and the best fitting linear regression line is yi 3 0.5xi, which explains R2 = 66.6% of the variance in Y . However, in two of the four cases, the linear regression is clearly a poor account of the data. The relative measure of model fit (R2) gives no indication of this radical difference between the data sets, and an absolute measure of fit (even one as rudimentary as a visual inspection of the regression line) is required. (Figure downloaded from Flickr, courtesy of Eric-Jan Wagenmakers.)

Notes 8. We confirmed the high quality of fit in a Bayesian 1. This work was partially supported by the starting grant framework using posterior predictives (Gelman & Hill, 2007), “Bayes or Bust” awarded by the European Research Council to results not reported here. EJW, and NSF grant #1230118 from the Methods, Measurements, and Statistics panel to JV. 2. This terminology is due to Pitt and Myung (2002), who Glossary point out that measures often referred to as “model fit indices” Akaike’s information criterion (AIC): A quantity that are in fact more than mere measures of fit to the data—they expresses the generalizability of a model, based on the combine fit to the data with parsimony and hence measure likelihood of the data under the model and the number of generalizability. We adopt their more accurate terminology here. free parameters in the model. 3. Note that for hierarchical models, the definition of Akaike weights: A quantity that conveys the relative sample size n is more complicated (Pauler 1998; Raftery 1995). preference among a set of candidate models, using AIC as a 4. For a more in-depth treatment, see Townsend (1975). measure of generalizability. 5. Analysis using the MPTinR package by Singmann and Kellen (2013) gave virtually identical results. Technical details Anscombe’s quartet: A set of four bivariate data sets whose for the computation of the NML for MPTs are provided in statistical properties are virtually indistinguishable until Appendix B of Klauer and Kellen (2011). they are displayed graphically, and a canonical example of 6. The second author used WinBUGS, the first and third the importance of data visualization. authors used JAGS. Bayes factor (BF): A quantity that conveys the degree to 7. Note that the Savage-Dickey density ratio requires that which the observed data sway our beliefs towards one or when d = 0 the prior for the common parameters p, c,andq is the other model. Under a-priori indifference between two M M thesamefor DUM and NCM.Thatis, models M1 and M2, the BF expresses the a-posteriori = | M p(p,c,q | d = 0,MDUM) p(p,c,q NCM). relative probability of the two.

316 new directions Akaike, H. (1974b). On the likelihood of a time series model. Bayesian information criterion (BIC): A quantity that The Statistician, 27, 217–235. expresses the generalizability of a model, based on the Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., likelihood of the data under the model, the number of free Lebiere, C., & Qin, Y. (2004). An integrated theory of the parameters in the model, and the amount of data. mind. Psychological Review, 111, 1036–1060. Fisher information approximation (FIA): One of several Andrieu, C., De Freitas, N., Doucet, A., & Jordan, M. I. (2003). approximations used to compute the MDL. An introduction to MCMC for machine learning. Machine Goodness of fit: A quantity that expresses how well a model Learning, 50, 5–43. is able to account for a given set of observations. Ardia, D., Bastürk, N., Hoogerheide, L., & van Dijk, H. K. (2012). A comparative study of Monte Carlo methods for Importance sampling: A numerical algorithm to efficiently efficient evaluation of marginal likelihood. Computational draw samples from a distribution by factoring it into an Statistics and Data Analysis, 56, 3398–3414. easy-to-compute function over an easy-to-sample density. Batchelder, W. H., & Riefer, D. M. (1980). Separation of Jeffreys weights: A quantity that conveys the relative storage and retrieval factors in free recall of clusterable pairs. preference among a set of candidate models, using BF as Psychological Review, 87, 375–397. a measure of generalizability. Batchelder, W. H., & Riefer, D. M. (1999). Theoretical Likelihood principle: A principle of modeling and and empirical review of multinomial process tree modeling. statistics that states that all information about a certain Psychonomic Bulletin & Review, 6, 57–86. parameter that is obtainable from an experiment is con- Batchelder, W. H., & Riefer, D. M. (2007). Using multinomial tained in the likelihood function of that parameter for the processing tree models to measure cognitive deficits in given data. Many common statistical procedures, such as clinical populations. In R. W. J. Neufeld (Ed.), Advances hypothesis testing with p-values, violate this principle. in clinical cognitive science: Formal modeling of processes and Minimum description length (MDL): A quantity that symptoms (19–50). Washington, DC: American Psychologi- expresses the generalizability of a model, based on the cal Association. extent to which the model allows the observed data to be Berger, J. O. (2006). Bayes factors. S. Kotz, N. Balakrishnan, compressed. C. Read, B. Vidakovic, & N. L. Johnson (Eds.), Encyclopedia Monte Carlo sampling: A general class of numerical of statistical sciences (2nd ed.) (Vol. 1, pp 378–386). algorithms used to characterize (i.e., compute descriptive Hoboken, NJ: Wiley. measures) an arbitrary distribution by drawing large Berger, J. O., & Berry, D. A. (1988). Statistical analysis numbers of random samples from it. and the illusion of objectivity. American Scientist, 76, 159–165. Nested models: Model M1 is nested in Model M2 if there Berger, J. O., & Pericchi, L. R. (1996). The intrinsic Bayes factor exists a special case of M2 that is equivalent to M1. for model selection and prediction. Journal of the American Overfitting: A pitfall of modeling whereby the proposed Statistical Association, 91, 109–122. model is too complex and begins to account for irrelevant Berger, J. O., & Wolpert, R. L. (1988). The likelihood particulars (i.e., random noise) of a specific data set, causing principle (2nd ed.). Hayward, CA: Institute of Mathematical the model to poorly generalize to other data sets. Statistics. Parsimony: A strategy against overfitting, and a fundamen- Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory.New tal principle of model selection: all other things being equal, York, NY: Wiley. simpler models should be preferred over complex ones; Brown, S. D., & Heathcote, A. (2005). Practice increases the or: greater model complexity must be bought with greater efficiency of evidence accumulation in perceptual choice. explanatory power. Often referred to as Occam’s razor. Journal of Experimental Psychology: Human Perception and Rissanen weights: A quantity that conveys the relative Performance, 31, 289–298. preference among a set of candidate models, using FIA as Brown, S. D., & Heathcote, A. J. (2008). The simplest complete a measure of generalizability. model of choice reaction time: Linear ballistic accumulation. Savage-Dickey density ratio: An efficient method for Cognitive Psychology, 57, 153–178. computing a Bayes factor between nested models. Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information–theoretic Schwartz weights: A quantity that conveys the relative approach (2nd ed.). New York, NY: Springer Verlag. preference among a set of candidate models, using BIC as a Busemeyer, J. R., & Diederich, A. (2010). Cognitive modeling. measure of generalizability. Thousand Oaks, CA: Sage. Chechile, R. A. (1973). The relative storage and retrieval losses in short–term memory as a function of the similarity and amount References of information processing in the interpolated task.Unpublished Akaike, H. (1973). Information theory as an extension of the doctoral dissertation, University of Pittsburgh. maximum likelihood principle. In: B. N. Petrov & F. Csaki Chechile, R. A., & Meyer, D. L. (1976). A Bayesian procedure (Eds.), Second international symposium on information theory for separately estimating storage and retrieval components (267–281). Budapest: Akadémiai Kiadó. of forgetting. Journal of Mathematical Psychology, 13, Akaike, H. (1974a). A new look at the statistical model 269–295. identification. IEEE Transactions on Automatic Control, 19, Chen, M.-H. (2005). Computing marginal likelihoods from a 716–723. single MCMC output. Statistica Neerlandica, 59, 16–29.

model comparison and the principle of parsimony 317 Chen, M.-H., Shao, Q.-M., & Ibrahim, J. G. (2002). Monte Jefferys, W. H., & Berger, J. O. (1992). Ockham’s razor and Carlo methods in Bayesian computation.NewYork,NY: Bayesian analysis. American Scientist, 80, 64–72. Springer. Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, Cohen, J. D., Dunbar, K., & McClelland, J. L. (1990). On England: Oxford University Press. the control of automatic processes: A parallel distributed Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the processing account of the Stroop effect. Psychological Review, American Statistical Association, 90, 773–795. 97, 332–361. Klauer, K. C. (2010). Hierarchical multinomial processing D’Agostino, R. B., & Stephens, M. A. (1986). Goodness-of-fit tree models: A latent–trait approach. Psychometrika, 75, techniques. New York, NY: Marcel Dekker. 70–98. Dickey, J. M. (1971). The weighted likelihood ratio, linear Klauer, K. C., & Kellen, D. (2011). The flexibility hypotheses on normal location parameters. The Annals of of models of recognition memory: An analysis by the Mathematical Statistics, 42, 204–223. minimum-description length principle. Journal of Mathemat- Dickey, J. M., & Lientz, B. P. (1970). The weighted likelihood ical Psychology, 55(6), 430–450. ratio, sharp hypotheses about chances, the order of a Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian modeling Markov chain. The Annals of Mathematical Statistics, 41, for cognitive science: A practical course. Cambridge, England: 214–226. Cambridge University Press. Dutilh, G., Vandekerckhove, J., Tuerlinckx, F., & Wagenmakers, Lewandowsky, S., & Farrell, S. (2010). Computational modeling E.-J. (2009). A diffusion model decomposition of the in cognition: Principles and practice. Thousand Oaks, CA: practice effect. Psychonomic Bulletin & Review, 16, 1026– Sage. 1036. Lewis, S. M., & Raftery, A. E. (1997). Estimating Bayes Erdfelder, E., Auer, T.-S., Hilbig, B. E., Aßfalg, A., Moshagen, factors via posterior simulation with the Laplace–Metropolis M., & Nadarevic, L. (2009). Multinomial processing tree estimator. Journal of the American Statistical Association, 92, models: A review of the literature. Zeitschrift für Psychologie, 648–655. 217, 108–124. Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, Frühwirth–Schnatter, S. (2006). Finite mixture and Markov J. O. (2008). Mixtures of g priors for Bayesian variable switching models. New York, NY: Springer. selection. Journal of the American Statistical Association, 103, Gamerman, D., & Lopes, H. F. (2006). Markov chain Monte 410–423. Carlo: Stochastic simulation for Bayesian inference.Boca Loftus, E. F., Miller, D. G., & Burns, H. J. (1978). Semantic Raton, FL: Chapman & Hall/CRC. integration of verbal information into a . Gelman, A., & Hill, J. (2007). Data analysis using regression Journal of Experimental Psychology: Human Learning and and multilevel/hierarchical models. Cambridge, England: Memory, 4, 19–31. Cambridge University Press. Logan, G. D. (1988). Toward an instance theory of automatiza- Gelman, A., & Rubin, D. B. (1992). Inference from iterative tion. Psychological Review, 95, 492–527. simulation using multiple sequences (with discussion). Logan, G. D. (1992). Shapes of reaction–time distributions and Statistical Science, 7, 457–472. shapes of learning curves: A test of the instance theory of Gill, J. (2002). Bayesian methods: A social and behavioral automaticity. Journal of Experimental Psychology: Learning, sciences approach. Boca Raton, FL: CRC Press. Memory, and Cognition, 18, 883–914. Good, I. J. (1985). Weight of evidence: A brief survey. J. M. Logan, G. D. (2002). An instance theory of attention and Bernardo,M.H.DeGroot,D.V.Lindley,&A.F.M.Smith, memory. Psychological Review, 109, 376–400. Bayesian statistics 2 (249–269). New York, NY: Elsevier. Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. Grünwald, P. (2000). Model selection based on minimum (2012). The BUGS book: A practical introduction to Bayesian description length. Journal of Mathematical Psychology, 44, analysis. Boca Raton, FL: Chapman & Hall/CRC. 133–152. Ly, A., Verhagen, A. J., Grasman, R. P. P. P., Wagenmakers, Grünwald, P. (2007). The minimum description length principle. E.-J. (2014). A tutorial on Fisher information. Manuscript Cambridge, MA: MIT Press. submitted for publication. Grünwald, P., Myung, I. J., & Pitt, M. A. (Eds). (2005). Ad- MacKay, D. J. C. (2003). Information theory, inference, vances in minimum description length: Theory and pplications. and learning algorithms. Cambridge, England: Cambridge Cambridge, MA: MIT Press. University Press. Hammersley, J. M., & Handscomb, D. C. (1964). Monte Carlo Matzke, D., Dolan, C. V., Batchelder, W. H., & Wagenmakers, methods. London, England: Methuen. E.-J. (in press). Bayesian estimation of multinomial process- Heathcote, A., Brown, S., & Mewhort, D. J. K. (2000). The ing tree models with heterogeneity in participants and items. power law repealed: The case for an exponential law of Psychometrika. practice. Psychonomic Bulletin & Review, 7, 185–207. Myung, I. J. (2000). The importance of complexity in model Heathcote, A., & Hayes, B. (2012). Diffusion versus linear selection. Journal of Mathematical Psychology, 44, 190–204. ballistic accumulation: Different models for response time Myung, I. J., Forster, M. R., & Browne, M. W. (2000). Model with different conclusions about psychological mechanisms? selection [Special issue]. Journal of Mathematical Psychology, Canadian Journal of Experimental Psychology, 66, 125–136. 44, 1–2. Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, Myung, I. J., Navarro, D. J., & Pitt, M. A. (2006). Model C. T. (1999). Bayesian model averaging: A tutorial. Statistical selection by normalized maximum likelihood. Journal of Science, 14, 382–417. Mathematical Psychology, 50, 167–179.

318 new directions Myung, I. J., Pitt, & M. A. (1997). Applying Occam’s razor from memory. Psychonomic Bulletin & Review, 4, in modeling cognition: A Bayesian approach. Psychonomic 145–166. Bulletin & Review, 4, 79–95. Silver, N. (2012). The signal and the noise: The art and science of O’Hagan, A. (1995). Fractional Bayes factors for model prediction. London, England: Allen Lane. comparison. Journal of the Royal Statistical Society B, 57, Singmann, H., & Kellen, D. (2013). MPTinR: Analysis 99–138. of multinomial processing tree models with R. Behavior O’Hagan, A., & Forster, J. (2004). Kendall’s advanced theory Research Methods, 45, pp 560–575. of statistics, Vol. 2B: Bayesian inference (2nd ed.). London, Smith, J. B., & Batchelder, W. H. (2010). Beta–MPT: England: Arnold. Multinomial processing tree models for addressing individual Pauler, D. K. (1998). The Schwarz criterion and related methods differences. Journal of Mathematical Psychology, 54, 167–183. for normal linear models. Biometrika, 85, 13–27. Stone, C. J., Hansen, M. H., Kooperberg, C., & Truong, Pitt, M. A., & Myung, I. J. (2002). When a good fit can be bad. Y. K. (1997). Polynomial splines and their tensor products Trends in Cognitive Sciences, 6, 421–425. in extended linear modeling (with discussion). The Annals of Pitt, M. A., Myung, I. J., & Zhang, S. (2002). Toward a method Statistics, 25, 1371–1470. of selecting among computational models of cognition. Townsend, J. T. (1975). The mind–body equation revisited. Psychological Review, 109, 472–491. C. Cheng (Ed.), Philosophical aspects of the mind–body Plummer, M. (2003). JAGS: A program for analysis of problem (pp. 200–218). Honolulu, Hawaii: Honolulu Bayesian graphical models using Gibbs sampling. K. Hornik, University Press. F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd Verdinelli, I., & Wasserman, L. (1995). Computing Bayes international workshop on distributed statistical computing. factors using a generalization of the Savage–Dickey density Vienna, Austria. ratio. Journal of the American Statistical Association, 90, Raftery, A. E. (1995). Bayesian model selection in social research. 614–618. P. V. Marsden (Ed.), Sociological methodology (pp 111–196). Vrieze, S. I. (2012). Model selection and psychological theory: A Cambridge, England: Blackwells. discussion of the differences between the Akaike information Ratcliff, R. (1978). A theory of memory retrieval. Psychological criterion (AIC) and the Bayesian information criterion Review, 85, 59–108. (BIC). Psychological methods, 17, 228. Rickard, T. C. (1997). Bending the power law: A CMPL theory Wagenaar, W. A., & Boer, J. P. A. (1987). Misleading postevent of strategy shifts and the automatization of cognitive skills. information: Testing parameterized models of integration in Journal of Experimental Psychology: General, 126, 288–311. memory. Acta Psychologica, 66, 291–306. Riefer, D. M., & Batchelder, W. H. (1988). Multinomial Wagenmakers, E.-J., & Farrell, S. (2004). AIC model selection modeling and the measurement of cognitive processes. using Akaike weights. Psychonomic Bulletin & Review, 11, Psychological Review, 95, 318–339. 192–196. Rissanen, J. (1978). Modeling by shortest data description. Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, Automatica, 14, 445–471. R. (2010). Bayesian hypothesis testing for psychologists: A Rissanen, J. (1987). Stochastic complexity. Journal of the Royal tutorial on the Savage–Dickey method. Cognitive Psychology, Statistical Society B, 49, 223–239. 60, 158–189. Rissanen, J. (1996). Fisher information and stochastic complex- Wagenmakers, E.-J., & Waldorp, L. (2006). Model selection: ity. IEEE Transactions on Information Theory, 42, 40–47. Theoretical developments and applications [Special issue]. Rissanen, J. (2001). Strong optimality of the normalized ML Journal of Mathematical Psychology, 50(2), pp 1–2. models as universal codes and information in data. IEEE Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Transactions on Information Theory, 47, 1712–1717. Maas, H. L. J. (2011). Why psychologists must change Rouder, J. N., Lu, J., Morey, R. D., Sun, D., & Speckman, P. L. the way they analyze their data: The case of psi. Journal of (2008). A hierarchical process dissociation model. Journal of Personality and Social Psychology, 100, 426–432. Experimental Psychology: General, 137, 370–389. Wetzels, R., Grasman, R. P. P. P., & Wagenmakers, E.-J. (2010). Scheibehenne, B., Rieskamp, J., & Wagenmakers, E.-J., (2013). An encompassing prior generalization of the Savage–Dickey Testing adaptive toolbox models: A Bayesian hierarchical density ratio test. Computational Statistics & Data Analysis, approach. Psychological Review, 120, 39–64. 54, 2094–2102. Schwarz, G. (1978). Estimating the dimension of a model. Wu, H., Myung, J. I., & Batchelder, W. H. (2010). Annals of Statistics, 6, 461–464. Minimum description length model selection of multinomial Shiffrin, R. M., & Steyvers, M. (1997). A model processing tree models. Psychonomic Bulletin & Review, 17, for recognition memory: REM—retrieving effectively 275–286.

model comparison and the principle of parsimony 319 CHAPTER Neurocognitive Modeling of Perceptual 15 Decision Making

Thomas J. Palmeri, Jeffrey D. Schall, and Gordon D. Logan

Abstract

Mathematical psychology and systems neuroscience have converged on stochastic accumulator models to explain decision making. We examined saccade decisions in monkeys while neurophysiological recordings were made within their frontal eye field. Accumulator models were tested on how well they fit response probabilities and distributions of response times to make saccades. We connected these models with neurophysiology. To test the hypothesis that visually responsive neurons represented perceptual evidence driving accumulation, we replaced perceptual processing time and drift rate parameters with recorded neurophysiology from those neurons. To test the hypothesis that movement related neurons instantiated the accumulator, we compared measures of neural dynamics with predicted measures of accumulator dynamics. Thus, neurophysiology both provides a constraint on model assumptions and data for model selection. We highlight a gated accumulator model that accounts for saccade behavior during visual search, predicts neurophysiology during search, and provides insights into the locus of cognitive control over decisions. Key Words: accumulator models, decision making, response time, visual search, stop task, countermanding, neurophysiology, computational modeling, neural modeling, frontal eye field, superior colliculus

Introduction know more about the brain areas involved in a range We make decisions all the time. Whom to of decision-making tasks (Glimcher & Rustichini, marry? What car to buy? What to eat? Whether to 2004; Heekeren, Marrett, & Ungerleider, 2008; turn left or right? Some are easy. Some are hard. Schall, 2001; Shadlen & Newsome, 2001). To Some involve uncertainty. Some involve risk or develop an integrated understanding of decision- reward. Decision-making requires integrating our making mechanisms, new efforts aim to combine of the current environment with our behavioral and neural measures with cognitive knowledge and past experience and our assessments modeling (e.g., Forstmann, Wagenmakers, Eichele, of uncertainty and risk in order to select a Brown, & Serences, 2011; Gold & Shadlen, 2007; possible action from a set of alternatives. Behavioral Palmeri, in press; Smith & Ratcliff, 2004), an research on decision-making has had a long and approach we aim to illustrate in some detail here. distinguished history in psychology (e.g., Kahne- We focus on perceptual decisions. Perceptual man & Tversky, 1984). We now have powerful decision-making involves perceptually representing computational and mathematical models of how the world with respect to current task goals and decisions are made (e.g., Brown & Heathcote, using perceptual evidence to inform the selection of 2008; Busemeyer & Townsend, 1993; Dayan & an action. A broad class of accumulator models of Daw, 2008; Ratcliff & Rouder, 1998). And we perceptual decision-making assume that perceptual

320 evidence accumulates over time to a response threshold (e.g., Bogacz, Brown, Moehlis, Holmes, LIP & Cohen, 2006; Brown & Heathcote, 2008; FEF Link, 1992; Nosofsky & Palmeri, 1997; Palmeri, MT 1997; Ratcliff & Rouder, 1998; Ratcliff & Smith, 2004; Ratcliff & Smith, in press; Smith & Van Zandt, 2000; Usher & McClelland, 2001; see V4 also Nosofsky & Palmeri, 2015). These models TEO have provided excellent accounts of observed be- TE havior, including the choices people make and the time it takes them to decide. Moreover, the observation that the pattern of spiking activity SC of certain neurons resembles an accumulation to Brainstem threshold (Hanes & Schall, 1996) has sparked exciting synergies of mathematical and computa- Fig. 15.1 Illustration of the macaque cerebral cortex. Frontal tional modeling with systems neuroscience (e.g., eye field (FEF) is a key brain area involved in the production Boucher, Palmeri, Logan, & Schall, 2007a; of saccadic eye movements and the focus of our recent work. Churchland & Ditterich, 2012; Cisek, Puskas, & It receives projections from numerous posterior visual areas, including the middle temporal area (MT), visual area V4, El-Murr, 2009; Ditterich, 2006, 2010; Mazurek, inferotemporal areas TE and TEO, and the lateral intraparietal Roitman, Ditterich, & Shadlen, 2003; Purcell, area (LIP). FEF projects to the superior colliculus (SC). Both Heitz, Cohen, Schall, Logan, & Palmeri, 2010; FEF and SC project to the brainstem saccade generators that Purcell, Schall, Logan, & Palmeri, 2012; Ratcliff, ultimately control the muscles of the eyes. Not shown are Cherian, & Segraves, 2003; Ratcliff, Hasegawa, connections between FEF and prefrontal cortical areas and areas of the basal ganglia. (Adapted from Purcell et al., 2010.) Hasegawa, Smith, & Segraves, 2007; Wong, Huk, Shadlen, & Wang, 2007; Wong & Wang, 2006). In this article, we provide a general review of our movements, and is modulated by prefrontal brain contributions to these efforts. We use variants of areas involved in cognitive control. Indeed, one accumulator models to explain neural mechanisms, class of visually responsive neurons in FEF represent use neurophysiology to constrain model assump- task-relevant salience of objects in the visual field, tions, and use neural and behavioral data as a tool whereas another class of movement-related neurons for model section. increase their activity in a manner consistent with Our specific focus has been on perceptual accumulation of evidence models and modulate decisions about where and when to make a saccadic their activity according to changing task demands eye movement to objects in the visual field. The (e.g., see Schall, 2001, 2004). first section of this article, Perceptual Decisions One form of an accumulator model is illustrated by Saccades, provides an overview of behavior, in Figure 15.2. Accumulator models assume that neuroanatomy, and neurophysiology of the primate perceptual processing takes some amount of time. saccade system, with an emphasis on the frontal The product of perceptual processing is perceptual eye field (FEF). There are numerous practical ad- evidence that is accumulated over time to make a vantages to studying perceptual decisions made by perceptual decision. The rate of accumulation is saccades over perceptual decisions made by finger, often called drift rate, and this drift rate can be hand, or limb movement and we can also capitalize variable within a trial, across trials, or both (e.g., on over two decades of careful systems neuro- Brown & Heathcote, 2008; Ratcliff & Rouder, science research with awake behaving monkeys 1998). Variability in the accumulation of perceptual characterizing the response properties of neurons evidence to a threshold is a major contributor to in FEF and the interconnected network of other variability in predicted behavior. brain areas involved in saccadic eye movements In their most general form, accumulator models (Figure 15.1). FEF itself provides physiologists assume drift rates to be free parameters that can and theoreticians a unique window on perceptual be optimized to fit a set of observed behavioral decision-making. FEF receives projections from data. There has been concern that unrestricted a wide range of posterior brain areas involved assumptions about drift rate and its variability in visual perception, projects to subcortical brain may imbue these models with too much flexibility areas involved directly in the production of eye (Jones & Dzhafarov, 2014; but see also Ratcliff,

neurocognitive modeling of perceptual decision making 321 (a)

drift perceptual processing θ motor time response

TR TM

(b)

θ motor response

TM

Time

Fig. 15.2 (a) Illustration of a classic stochastic accumulator model of perceptual decision-making, highlighting some of the key free parameters. Perceptual processing of a visual stimulus takes some variable amount time with mean TR. The outcome of perceptual processing is noisy perceptual evidence in favor of competing decisions with some mean drift rate. Perceptual evidence is accumulated over time, originating at some variable starting point (z), and accumulating until some threshold is reached, determined by θ Illustrated here is a drift-diffusion model, but different architectures for the perceptual decision-making process can be assumed (see Figure 15.5). Variability in the accumulation of evidence to a threshold is a key constituent in predicting variability in RT. A motor response is made with some time TM, which for saccadic eye movements is on the order of 10-20ms. (b) Our recent work has tested whether many of the free parameters can be constrained by the observed physiological dynamics of one class of neurons in FEF (see Figure 15.5) and whether predicted model dynamics of the stochastic accumulator can predict observed physiological dynamics of another class of neurons in FEF (see Figure 15.8).

2013). One important step in theory development account for observed saccade response probabilities has been to significantly constrain these models and response time distributions. by creating theories of the drift rates driving A number of different model architectures have the accumulation of evidence, linking models been proposed that all involve some accumulation of perceptual decision making with models of of perceptual evidence to a threshold (e.g., see perceptual processing (e.g., Ashby, 2000; Logan Bogacz et al., 2006; Smith & Ratcliff, 2004). For & Gordon, 2001; Mack & Palmeri, 2010, 2011; example, as their name implies, independent race Nosofsky & Palmeri, 1997; Palmeri, 1997; Palmeri models assume that evidence for each alternative & Cottrell, 2009; Palmeri & Tarr, 2008; Schneider decision independently (Smith & Van Zandt, & Logan, 2005, 2009; Smith & Ratcliff, 2009). 2000; Vickers, 1970). Drift-diffusion models As a first step toward a neural theory of drift (Ratcliff, 1978; Ratcliff & Rouder, 1998) and rates, we hypothesized that activity of visually random walk models (Laming, 1968; Link, 1992; responsive neurons in FEF represent perceptual Nosofsky & Palmeri, 1997; Palmeri, 1997) assume evidence driving the accumulation to threshold. To that perceptual evidence in favor of one alternative test this hypothesis, as described in the section counts as evidence against competing alternatives. titled A Neural Locus of Drift Rates, we replaced Competing accumulator models (Usher & Mc- perceptual processing-time and drift-rate parame- Clelland, 2001) assume that support for various ters directly with recorded neurophysiology from alternatives is mutually inhibitory, so as evidence in these neurons (see Figures 15.2 and 15.5), testing favor of one alternative grows, it inhibits the others, whether any model architecture for accumulation often in a winner-take-all fashion (Grossberg, of perceptual evidence could then quantitatively 1976). Different models can vary in other respects

322 new directions as well, such as whether integration of evidence Eye Field (FEF), an area where visual perception, is perfect or leaky. We describe these alternative motor production, and cognitive control come model architectures and how well they account together in the primate brain (Schall & Cohen, for observed response probabilities and response 2011). time distributions in the section Architectures for FEF has long been known to play a role in the Perceptual Decision Making. production of saccadic eye movements (e.g., Bruce, We also tested the hypothesis that movement- Goldberg, Bushnell, & Stanton, 1985; Ferrier, related neurons in FEF instantiate an accumulator 1874). This is reflected by its direct and indirect (Hanes & Schall, 1996). As described in the section connectivity with the superior colliculus (SC) and Predicting Neural Dynamics, we quantitatively brain stem nuclei necessary for the production compared measured metrics of neural dynamics of saccadic eye movement (e.g., Munoz & Schall, with predicted metrics of accumulator dynamics. 2004; Scudder et al., 2002; Sparks, 2002), as Neurophysiology and modeling are synergistic in illustrated in Figure 15.1. Also as illustrated, FEF is that we test quantitatively whether movement- innervated by numerous dorsal and ventral stream related neurons have dynamics predicted by areas of extrastriate visual cortex (Schall, Morel, accumulator models, and we use the measured King, & Bullier, 1995). Not illustrated are con- neural dynamics of movement-related neurons as an nections between FEF and brain areas implicated additional tool to select between competing model in cognitive control, such as medial frontal and architectures. Finally, in a complementary way, in dorsolateral prefrontal cortex (e.g., Stanton, Bruce, the section Control over Perceptual Decisions, we & Goldberg, 1995) and basal ganglia (Goldman- test whether competing hypotheses about cognitive Rakic & Porrino, 1985; Hikosaka & Wurtz, control mechanisms can predict observed behavior 1983). Neuroanatomically, FEF lies at a juncture as well as the observed modulation of movement- of perception, action, and control. This bears related neurons dynamics. out functionally, as various neurons within FEF reflect the importance of objects in the visual field,signal the selection and timing of saccadic eye Perceptual Decisions by Saccades movements, and modulate in a controlled manner Significant insights into the neurophysiological according to changing task conditions (e.g., Heitz basis of perceptual decision-making have come & Schall, 2012; Murthy, Ray, Shorter, Schall, & from research on decisions about where and when Thompson, 2009; Thompson, Biscoe, & Sato, to move the eyes (e.g., Gold & Shadlen, 2007; 2005). Schall, 2001, 2004; Smith & Ratcliff, 2004). At the start of each neurophysiological session, Although the majority of human research on once a neuron in FEF has been isolated, a memory- perceptual decisions has used manual key-press guided saccade task is used to classify its response responses, a neurophysiological focus on saccadic properties (Bruce & Goldberg, 1985). As illustrated eye movements is justified on several grounds: in Figure 15.3, the monkey fixates a spot in the From the perspective of effect or dynamics and center of the screen while a target is flashed in motor control, eye movements have relatively few the periphery. To earn reward, the monkey must degrees of freedom, far fewer than limb movements, maintain fixation for a variable amount of time allowing fairly direct links between neurophysiology after which the fixation spot disappears and then and behavior to be established (Scudder, Kaneko, the monkey must make a single saccade to the & Fuchs, 2002). Saccadic eye movements are remembered target location. When the target is also relatively ballistic, with movement dynamics in the receptive field of the FEF neuron, that quite stereotyped depending on the direction, neuron is classified as a visually responsive neuron starting point, and distance the eyes need to move (or visual neuron) if it shows a vigorous response (Gilchrist, 2011), unlike limb movement, which to the appearance of the target, perhaps with can reach the same endpoint using a multitude of a tonic response during the delay period, but different trajectories having vastly different tempo- with no significant saccade-related modulation. ral dynamics (Rosenbaum, 2009). Moreover, from The neuron is classified as a movement-related the perspective of understanding the mechanisms neuron (or movement neuron, sometimes referred by which perceptual evidence is used to produce to as a buildup neuron) if it shows no or very a perceptual decision, the saccade system is also a weak modulation to the appearance of the target choice candidate to study because of the Frontal but pronounced growth of spike rate immediately

neurocognitive modeling of perceptual decision making 323 memory-guided visual search (a) 1.0 Easy (target in)

Hard (target in)

0.5 Hard (target out)

Neuron Activity Easy (target out) time Normalized Visual F 155 DSP02a 0.0 0 100 200 Time from array onset (ms)

(b) 1.0 (c)

Fig. 15.3 Illustration of two saccade decision tasks discussed in Easy (target in) this article. (a) In a memory-guided saccade task, the monkey

fixates a central point while a peripheral target is quickly flashed; 0.5 Hard (target in) F 250 DSP04a the location of the target is guided by the receptive field

properties of the isolated neuron for a given experimental session. Neuron Activity The monkey is required to maintain fixation for 400–1000ms, Normalized Movement after which the fixation spot disappears. To earn reward, the 0.0 0 100 200–100 –50 0 monkey must make a single saccade to the remembered location Time from array onset (ms) From saccade (ms) of the peripheral target. (b) In a visual search task, the monkey first maintains fixation on a central point. An array of visual Fig. 15.4 Illustration of response properties of visually respon- objects is then presented and to earn reward the monkey must sive and movement-related neurons in FEF (Hanes, Patterson, make a single saccade to the target object and not one of the & Schall, 1998; Hanes & Schall, 1996; Purcell et al., 2010). distractor objects. In this case, the reward target was an L and the Recordings were made while monkeys engaged in a visual search distractors were variously rotated Ts, with the particular reward task where the target either appeared among dissimilar distractors target changed from session to session. Various experiments (easy search) or among similar distractors (hard search). Plots manipulated the number of distractors (set size), the similarity display normalized spike rate as a function of time (ms). Visually between targets and distractors, and the particular dimensions on responsive neuron activity aligned on visual search array onset which targets and distractors differed (shape, color, or motion). time illustrated in panel (a), movement-related neuron activity aligned on visual search array onset time illustrated in panel (b), and movement-related neuron activity aligned on saccade time preceding saccade production. Other neurons in illustrated in panel (c). Solid lines are trials in which the target FEF show other response properties (e.g., Sato & was in the visual neuron’s receptive field or movement neuron’s Schall, 2003), but our recent work has focused movement field (target in), and dashed lines are trials in which primarily on visual and movement neurons, which the target was outside the neurons’ response fields (target out). we might loosely characterize as the incoming input (Adapted from Purcell et al., 2010.) signal and outgoing output signal from FEF (see also Pouget et al., 2009). activity eventually discriminates between target and Once visually responsive neurons and movement- distractor, with generally faster and more significant related neurons are identified, their response prop- discrimination with easy compared to hard visual erties can be measured during a primary perceptual search trials (Bichot & Schall, 1999; Sato, Murthy, decision task. For example, in a visual search task, as Thompson, & Schall, 2001) and small compared illustrated in Figure 15.3, after the monkey fixates to large set sizes (Cohen, Heitz, Woodman, & a central spot, a search array is shown containing a Schall, 2009). We note that the particular shape target (in this case an L) and several distractors (in of the trajectories taken to achieve this neural this case rotated Ts) and the monkey must make discrimination can be somewhat heterogeneous a single saccade to the target in order to receive across different neurons, but virtually all visually re- reward. During visual search, visually responsive sponsive neurons discriminate target from distractor and movement-related neurons display character- over time. We emphasize that this discrimination istic dynamics. Figure 15.4 shows the normalized concerns the “targetness” of the object in the spiking activity of representative neurons recorded neuron’s receptive field, not particular features or duringeasyandhardvisualsearchtrialswhen dimensions of the object like its color or shape, the target (solid) or a distractor (dashed) was in except under unique circumstances (Bichot, Schall, the neuron’s receptive field. For some time after & Thompson, 1996). Visually responsive neurons the visual search array appears, visually responsive display these same characteristic dynamics regard- neurons (Figure 15.4a) show no discrimination less of whether a saccade is made, such as when between a target and a distractor. However, spiking the monkey withholds or cancels an eye movement

324 new directions because of a stop signal (Hanes, Patterson, & Schall, drives the accumulator model? We begin with the 1998) or when the monkey is trained to main- last question. tain fixation and respond with a limb movement A broad class of models of perceptual decision- and not an eye movement (Thompson, Biscoe, making assumes that perceptual evidence is accu- & Sato, 2005). mulated over time to a threshold (Figure 15.2; Normalized activity of a representative movement- see also Ratcliff & Smith, 2015). The rate at related neuron is shown aligned on the onset which perceptual evidence is accumulated, the time of the visual search array (Figure 15.4b) drift rate, can vary across objects, conditions, and and aligned on the time of the saccade (Figure experience. When accumulator models are tested 15.4c). When the monkey makes a saccade to by fitting them to observed behavior, it is not the object in the receptive field (movement field) uncommon to assume that different drift rates of the neuron, there is a characteristic buildup across different experimental conditions are free of activity some time after array onset; there is parameters that are optimized to maximize or min- far less activity when the nonselected object is in imize some fit statistic (e.g., Brown & Heathcote, the receptive field, although the precise nature of 2008; Boucher et al., 2007a; Ratcliff & Rouder, those dynamics varies somewhat from neuron to 1998; Usher & McClelland, 2001). But other neuron. We see clearly that, when aligned on theoretical work has aimed to connect models of saccade initiation time, activity reaches a relatively perceptual decision-making to models of perceptual constant threshold level immediately prior to the processing by developing a theory of the drift eye movement (Hanes & Schall, 1996), and this rates. pattern of activity holds across search difficulty and For example, Nosofsky and Palmeri (1997; set size (Woodman, Kang, Thompson, & Schall, Palmeri, 1997) proposed an exemplar-based ran- 2008). Movement-related neuron activity does not dom walk model (EBRW) that combined the gen- reach threshold if the monkey withholds or cancels eralized context model of categorization (Nosofsky, an eye movement because of a stop signal (Hanes 1986) with the instance theory of automaticity et al., 1998; Murthy et al., 2009) or makes a (Logan, 1988) to develop a theory of the drift response to the target using a limb movement rates driving a stochastic accumulation of evidence. and not an eye movement (Thompson, Biscoe, & Briefly, EBRW assumes that a perceived object Sato, 2005).We discuss more detailed aspects of the activates previously stored exemplars in visual temporal dynamics of movement-related neurons memory, the probability and speed of exemplar later in this article. One of our primary goals retrieval is governed by similarity, and repeated has been to develop models that both predict the exemplar retrievals determine the direction and rate saccade behavior of the monkey and predict the of accumulation to a response threshold. EBRW temporal dynamics of movement-related neurons predicts the effects of similarity, experience, and in FEF. expertise on response probabilities and response times for perceptual decisions about visual catego- rization and recognition (see Nosofsky & Palmeri, A Neural Locus of Drift Rates 2015; Palmeri & Cottrell, 2009; Palmeri, Wong, Movement-related neurons increase in spike rate & Gauthier, 2004). Other theorists have similarly over time and reach a constant level of activity connected visual perception and visual attention immediately prior to a saccade being initiated mechanisms to accumulator models of perceptual (Figure 15.4). The dynamics of movement-related decision making by creating theories of drift rate neurons appear consistent with the dynamics of (e.g., Ashby, 2000; Logan, 2002; Mack & Palmeri, models that assume a stochastic accumulation of 2010; Schneider & Logan, 2005; Smith & Ratcliff, perceptual evidence to a threshold (Hanes & Schall, 2009). 1996; Ratcliff et al., 2003; Schall, 2001; Smith & As a first step toward a neural theory of drift Ratcliff, 2004). This insight raises several questions rates, we recently proposed a neural locus of drift that we have begun to address in our recent rates when decisions are made by saccades (Purcell work: If movement-related neurons instantiate an et al., 2010, 2012). We hypothesize that the accumulator model, what kind of accumulator accumulation of evidence is reflected in the firing model do they instantiate? What kind of an rate of FEF movement-related neurons and the accumulator model can predict the fine-grained perceptual evidence driving this accumulation is dynamics of movement-related neurons? What reflected in the firing rate of FEF visually responsive

neurocognitive modeling of perceptual decision making 325 Target in RF

k θ Cell 2 g m 0100 200 300 vT T Cell 1 Cell n 0 200 400 0100 200 300

u β Distractor in RF θ

vD g mD

0100 200 300 Cell 2

0100 200 300 Cell 1 Cell n 0 200 400

Fig. 15.5 Illustration of simulation model architectures tested in Purcell et al. (2010, 2012). Spike trains were recorded from FEF visually-responsive neurons during a saccade visual search task. Trials were sorted into two populations according to whether the target or a distractor was within the neuron’s response field. Spike trains were randomly sampled from each population to generate a normalized activation function that served as the dynamic model input associated with a target (vT )andadistractor(vD) on a given simulated trial, as illustrated. Different architectures for perceptual decision-making were systematically tested. Decision units (mT ) could integrate evidence or not, and they could be leaky (k) or not. Decision units could integrate a difference between the inputs (u)or not, the stochastic input could be gated (g) or not, and the units could compete with one another (β) or not. Here, only two decision units are shown, one for a target and one for a distractor. In Purcell et al. (2012) there were eight accumulators, one for each possible stimulus location in the visual search array. neurons. One way to test this hypothesis would in the receptive fields of the neurons to simulate be to develop a model of the dynamics of visually perceptual evidence in favor of the target location responsive neurons, a model of how those dynamics and trials when a distractor was in the receptive field are translated into drift rates, and then use those to simulate perceptual evidence in favor of each of drift rates to drive a model of the accumulation the distractor locations. Along its far left, Figure of perceptual evidence. We chose a different 15.5 illustrates raster plots for example neurons, approach. Rather than model the dynamics of with individual trials arranged sequentially along visually responsive neurons, we used the observed the y axis, time along the x axis, and each black firing rates of those neurons directly as a dynamic dot indicating the incidence of a recorded spike on neural representation of the perceptual evidence a given trial for that neuron. The gray thick bars that was accumulated over time. illustrate a random sampling from those recorded Figure 15.5 illustrates our general approach. neurons. These sampled spike trains were convolved Activity of visually responsive neurons was recorded with a temporally asymmetric doubly exponential from FEF of monkeys performing a visual search function (Thompson, Hanes, Bichot, & Schall, task. In Figure 15.4, we illustrate spike density 1996), averaged together, and normalized to create functions of a representative neuron when a target dynamic drift rates associated with target and or distractor appeared in its receptive field during distractor locations (Purcell et al., 2010, 2012), as easy or hard visual search. For our modeling, we did illustrated in the middle of Figure 15.5; the result- not use the mean activity of neurons as input but, ing input functions are mathematically similar to a instead, generated thousands of simulated spike- Poisson shot noise process (Smith, 2010). Different density functions by subsampling from the full set inputs were defined according to the experimental of individually recorded trials of visually responsive condition under which the visually responsive neurons. Specifically, on each simulated trial, we neurons were recorded on each trial, such as easy first randomly sampled, with replacement, a set versus hard search or small versus large set sizes. of spike trains recorded from individual neurons. Arguably, this approach allows the most direct We subsampled from trials when the target was test of whether the dynamics of visually responsive

326 new directions neurons provide a sufficient representation of per- The mi(t) are rectified to be greater than or equal ceptual evidence to predict where and when the to zero because we later compare the dynamics of monkey moves its eyes. If no model can predict these accumulators to the observed spike rates of saccade behavior using visually responsive neurons movement-related neurons, and those spike rates as input, then some other neural signal must be are greater than zero by definition. ξ represents significantly modulating behavior of the monkey. Gaussian noise intrinsic to each accumulator with Furthermore, as illustrated by contrasting Figures mean 0 and standard deviation σ; in all of our 2a and 2b, this novel approach imposes significant simulations, this intrinsic accumulator variability constraints on possible models by replacing free could be assumed to be quite small relative to parameters governing the mean and variability of the variability of the visual inputs vi(t).All perceptual processing time, starting point of accu- accumulators, mi(t), are assumed to race against one mulation, and drift with observed neurophysiology. another to be the first to reach their threshold θ. Finally, because the neurophysiological signal from The winner of that race between accumulators visually responsive neurons is continuous in time, determines which saccade response is made on that the models cannot merely assume that percep- simulated trial and the response time is given by the tual processing and perceptual decisions constitute time to reach threshold plus a small ballistic time of discrete stages, as typical for many accumulator 10–20ms. models. If k > 0, these are leaky accumulators, otherwise they are perfect integrators. If β = 0andu = 0, Architectures for Perceptual we have a version of a simple horse race model. Decision-Making If β>0, these are competing accumulators, and Within the broad class of perceptual decision- combined with leakage, k > 0, we have the making models assuming an accumulation of leaky competing accumulator model (Usher & perceptual evidence to a threshold, a variety of McClelland, 2001). If u > 0, then weighted different model architectures have been proposed differences are accumulated by each mi(t). In the (e.g., see Ratcliff & Smith, 2004; Smith & Ratcliff, case of only two accumulators, one for a target 2004). We instantiated several of these competing and the other for a distractor, and assuming u = 1, architectures, and using drift rates defined by both mi(t) accumulates the difference between the recorded spiking activity of visually responsive evidence for a target versus evidence for a distractor, neurons as inputs, evaluated how well each could fit which is quite similar to a standard drift-diffusion observed response probabilities and response times model (see Bogacz et al., 2006; Ratcliff et al., of monkeys making saccades during a visual search 2007; Usher & McClelland, 2001), and when task (Purcell et al., 2010, 2012). assuming positive leakage (k > 0) is quite similar Figure 15.5 illustrates the common architectural to an Ornstein-Uhlenbeck process (Smith, 2010); framework. Drift rates defined by neurophysiology this similarity can become mathematical identity constitute the input nodes labeled vT (target) and with some added assumptions (Bogacz et al., 2006; vD (distractor). We assume an accumulator associ- Usher & McClelland, 2001). ated with the target location (mT ) and distractor Finally, we also proposed a novel aspect to locations (mD). Figure 15.5 shows only one target this general architecture, which we called a gated and one distractor accumulator (Purcell et al., 2010) accumulator (Purcell et al., 2010, 2012). When but we have extended this framework to multiple g > 0 and the input is positive-rectified, as indicated + accumulators, one for every possible target location by the subscript in the equation, then only in the visual field (Purcell et al., 2012). Each inputs that are sufficiently large can enter into the accumulator is governed by the following stochastic accumulation. For example, consider a gated accu- differential equation mulator assuming u > 0; this would mean that the differences in the evidence in favor of the target over ⎡⎛ ⎞+ the distractors must be sufficiently large before that dt differences will accumulate. Recall that we assumed dm (t) = ⎣⎝v (t) − uv − g⎠ i τ i j(t) that the inputs are defined by neurophysiology, j =i which has no beginning or ending, apart from the ⎤  birth or death of the organism. Intuitively, the gate − β − ⎦ + dt ξ mk(t) kmi(t) τ . forces the accumulators to accumulate signal, not k =i merely noise, and noise is all that is present before

neurocognitive modeling of perceptual decision making 327 (a)nonaccumulator (b)perfect accumulator (c) leaky accumulator (d) gated accumulator 1.0 easy

0.5

P (RT < t) hard

0.0 100 200 300 400 data Response Time (ms) model

Fig. 15.6 In Purcell et al. (2010), models (Figure 15.5) were tested on how well they could account for observed RT distributions of the onset of saccades in an easy visual search where the target and distractors were dissimilar or where the target and distractors were similar hard. Each panel shows observed cumulative RT distributions (symbols) for easy and hard search. Best-fitting model predictions for a subset of the models tested in Purcell et al. (2010) are shown for illustration, ranging left-to-right from a nonaccumulator model that does not integrate perceptual evidence over time, a perfect integrator model with no leakage, a leaky accumulator model, and a gated accumulator model. (Adapted from Purcell et al., 2010.) perceptual processing has begun to discriminate size was systematically manipulated and where the targets from distractors. search was difficult enough to produce significant We evaluated the fits of competing model errors (Cohen et al., 2009). Models were required architectures to observed response probabilities to fit correct- and error-response probabilities as and distributions of response times using standard well as distributions of correct- and error-response model fitting techniques (e.g., Ratcliff & Tuer- times. These data are shown in Figure 15.7. linckx, 2002; Van Zandt, 2000). We system- Also shown are the predictions of the best fitting atically compared models assuming a horse race, model, which was a gated accumulator model that a diffusion-like difference accumulation process, assumed both significant leakage and competition or competition via lateral inhibition, factorially via lateral inhibition. Likely because this dataset combined with various leaky, nonleaky, or gated was larger, it also provided a greater challenge to accumulators. For example, Figure 15.6 displays other models, since many horse-race models and observed response time distributions for easy versus diffusion-like models failed to provide adequate fits hard visual search along with a sample of predictions to the observed data, whether they included leakage from some of the model architectures evaluated or gating (see Purcell et al., 2012). by Purcell et al. (2010); for these particular data Just based on the quality of fits to observed (Bichot, Thompson, Rao, & Schall, 2001; Cohen data, models with leakage and competition via et al., 2009), there were very few errors. As lateral inhibition provided comparable fits whether shown in the left two panels, models assuming those models included gating or not in both nointegrationat all, meaning that the current value Purcell et al. (2010) and Purcell et al. (2012). So of mi(t) simply reflects the current inputs at based on parsimony, a nongated version, which is time t, and models assuming perfect integration essentially a leaky competing accumulator model without leakage, provided a relatively poor fit (Usher & McClelland, 2001), would win the to the observed behavioral data. Although these theoretical competition. But our goal was also to particular behavioral data were fairly limited, with test whether the accumulators in the competing only a response-time distribution for easy and hard models could provide a theoretical account of the visual search, we could rule out some potential movement-related neurons in FEF. To do that, we model architectures. However, other competing also tested whether the dynamics measured in the models, including those with leakage or gate, accumulators could predict the dynamics measured assuming a competition or an accumulation of in movement-related neurons (see also Boucher differences, all provided reasonable quantitative ac- et al., 2007a; Ratcliff et al., 2003, 2007). counts of the behavioral data, a couple of examples of which are shown in the two right panels of Predicting Neural Dynamics Figure 15.6. Until now, the work we have described follows Purcell et al. (2012) evaluated fits of these a long tradition of developing and testing com- models to a more comprehensive dataset where set putational and mathematical models of cognition.

328 new directions (a) (b) 400 100 Correct Error Data 350 Model

300 80 Mean RTMean (ms) 250 Correct Percent

200 60 24 8 24 8

Set Size

(c) Correct (d) Error Set size 2 0.9 0.9 Set size 8 0.7 0.7 Set size 4 0.5 0.5 P(RT < t) 0.3 0.3

0.1 0.1 100 300 500 100 300 500 Response time (ms)

Fig. 15.7 In Purcell et al. (2012), models (Figure 15.5) were tested on how well they could account for correct- and error-response probabilities and correct- and error-response time distributions of saccades in a visual search task with three levels of set size: 2 (blue), 4 (green), or 8 (red) objects in the visual array. Predictions from the best-fitting gated accumulator model are shown. (a) Mean observed (symbols) and predicted (lines) correct- (solid) and error- (dashed) response times as a function of set size. (b) Mean observed (symbols) and predicted (lines) probability correct as a function of set size. (c) Observed (symbols) and predicted (lines) cumulative RT distributions of correct responses at each set size. (d) Observed (symbols) and predicted (lines) cumulative RT distributions of error responses at each set size. (Adapted from Purcell et al., 2012.)

Competing models are evaluated on their ability to Ratcliff et al., 2003; Smith & Ratcliff, 2004), we predict behavioral data by optimizing parameters go beyond noting qualitative relationships to test in order to maximize or minimize the fit of each quantitative predictions. model to the observed data, and then statistical Following the approach used by Woodman et al. tests are performed for nested or nonnested model (2008), we evaluated how several key measures of comparison (e.g., see Busemeyer & Diederich, neural dynamics varied according to the measured 2010; Lewandowsky & Farrell, 2010). We go response time of a saccade. The top row of beyond this approach to evaluate linking propo- Figure 15.8 illustrates several hypotheses for how sitions (Schall, 2004; Teller, 1984) that aim variability in response time is related to variability to map particular cognitive model mechanisms in the underlying neural dynamics. Fast responses onto observable neural dynamics. Specifically, we could be associated with an early initial onset of evaluate the linking proposition that movement- the neural activity from baseline, whereas slow related neurons in FEF instantiate an accumulation responses could be associated with a delayed onset. of evidence to a threshold. We do this by testing Alternatively, fast responses could be associated with how well the simulated dynamics of accumulators high growth rate in spiking activity to threshold, in the various model architectures described in whereas slow responses could be associated with low the previous section predict the observed dynam- growth rate. Fast responses could be associated with ics in movement-related neurons. Although the an increased baseline firing rate or decreased thresh- qualitative relationship between accumulator dy- old, whereas slow responses could be associated namics and movement neuron dynamics has long with a decreased baseline firing rate or increased been recognized (e.g., Hanes & Schall, 1996; threshold. To evaluate these proposals, the onset

neurocognitive modeling of perceptual decision making 329 time, growth rate, baseline, and threshold of neural Perhaps the most widely used task for studying activity were all measured within bins of trials normal and dysfunctional cognitive control is the defined by response times from fastest to slowest, stop-signal task (Lappin & Eriksen, 1966; Logan both within conditions and across conditions (see & Cowan, 1984). Saccade variants of this task have Purcell et al., 2010, 2012, for details). The middle been used with monkeys, and neurophysiological row shows the relationship between onset time, activity has been recorded from neurons in FEF growth rate, baseline, and threshold of neural (Hanes et al., 1998). The basic stop-signal task activity and mean response time for each bin of with saccades is in certain ways a converse of the an RT distribution for a representative neuron in memory-guided saccade task illustrated in Figure a representative condition.The bottom row shows 15.4. Monkeys initially fixate the center of the the mean correlation of neural measures with RT screen. After a variable amount of time, the fixation as a function of set size from Purcell et al. (2012), spot disappears and a peripheral target appears with a significant relationship between onset time somewhere in the visual field, and the monkey and response time observed in neural activity in must make a single saccade to the target in order movement-related neurons in FEF. to earn reward. This is the primary task, or go Using analogous methods, we also measured signal. On a fraction of trials, some time after the relationship between onset time, growth rate, the peripheral target appears, the fixation spot is baseline, and threshold of accumulator dynam- reilluminated, and the monkey is rewarded for ics and response time predicted by each of the cancelling its saccade, maintaining fixation. This is competing model architectures that we simulated. the stop signal. The interval between the appearance Shown in Figure 15.8 are the predictions of of the go signal, the peripheral target, and the stop the gated accumulator model from Purcell et al. signal, the fixation point, is called stop signal delay (2012), illustrating a good match between model (SSD). Monkeys’ ability to inhibit their saccade is and neurons. These are true model predictions, probabilistic due to the stochastic variability of go not model fits. After the model was fitted to and stop processes and depends on SSD. behavioral data, the accumulator dynamics using Figure 15.9 displays the key behavioral data ob- the best-fitting model parameters were measured served in the saccade stop-signal paradigm (Hanes and compared directly with the observed neural et al., 1998). Figure 15.9a displays the probability dynamics. All other models failed to predict the of responding to the go signal (y axis), despite the observed neural dynamics. For example, models presence of a stop signal at a particular SSD (x axis). without gate typically predicted a significant nega- When the stop signal illuminates shortly after the tive correlation between baseline and response time appearance of the target, at a short SSD, the that was completely absent in the observed data. probability of responding to the go signal is quite Part of the reason for this is that, with nongated small. Control over the saccade as a consequence models, the accumulators are allowed to accumulate of the stop signal has been successful. In contrast, noise in the input defined by visually responsive for a long SSD, the probability of successfully neurons. Although a leakage term may be sufficient inhibiting the saccade is rather small. Figure 15.9b to keep a weak noise signal from leading to a displays distributions of response times for primary premature accumulation to threshold, it cannot go trials with a stop signal (signal response trials), in prevent significant differences in baseline activity which a saccade was erroneously made, shaded by from being correlated with differences in predicted gray according to SSD (see figure caption). These response time when the accumulators reach thresh- response times are significantly faster than response old, at least without significantly compromising fits times without any stop signal (no-stop-signal trials) to the observed behavior. in black. Behavioral data in the stop-signal paradigm has long been accounted for by an independent Control over Perceptual Decisions race model (Logan & Cowan, 1984), which We have also considered the neurophysiological assumes that performance is the outcome of a race basis of cognitive control over perceptual decisions. between a go process, responsible for initiating Mirroring our other research, we used cognitive the movement, and a stop process, responsible models to better understand neural mechanisms for inhibiting the movement (see also Becker and used neural data to constrain competing & Jürgens, 1979; Boucher, Stuphorn, Logan, cognitive models. Schall, & Palmeri, 2007b; Camalier et al., 2007;

330 new directions Onset Growth rate Baseline Threshold

400 r = 0.81 r = –0.58 40 r = –0.32 40 r = 0.07 p < 0.05 p = 0.17 p = 0.48 p = 0.87 0.6

200 0.4 20 20

Onset (ms) Onset 0.2 Baseline (sp/s) Baseline Threshold (sp/s) Threshold

0 rate (sp/s/ms) Growth 0.0 0 0 200 400 200 400 200 400 200 400 Respose time (ms) Respose time (ms) Respose time (ms) Respose time (ms)

*** 1.0 ** ** 1.0 1.0 1.0

* 0.0 0.0 0.0 0.0 Data Model Response time Response Correlation with Correlation –1.0 –1.0 –1.0 –1.0 24 8 24 8 24 8 24 8 Set size Set size Set size Set size

Fig. 15.8 Comparing observed neural dynamics and predicted model dynamics. Top row: Four possible hypotheses for how variability in RT is related to variability in neural or accumulator dynamics: from left to right, variability in RT could be correlated with variability in the onset time, growth rate, baseline, or threshold. Middle row: Following Woodman et al. (2008), correct RTs were binned in groups from fastest to slowest and within each bin the onset time, growth rate, baseline, and threshold of the spike density functions were calculated. The relationship between RT and neural measure (left to right: onset time, growth rate, baseline, and threshold) are shown for one representative neuron in set size 4 for one of the monkeys tested; the correlation between RT and neural measure and its associate p-value are also shown. Bottom row: Average correlation between RT and neural measure (left-to-right: onset time, growth rate, baseline, and threshold) as a function of set size observed in neural dynamics and predicted in model dynamics for the gated accumulator model. (Adapted from Purcell et al., 2012.) (a) (c) SSD SSRT 1.0 1.0 Observed Model

0.5 0.5 Normalized Activation Normalized cancel time Probability (signal-respond) Probability 0.0 0.0 50 100 150 200 250 0 100 200 Stop signal delay (ms) Time from stimulus (ms)

(b) (d) 1.0 0.4 Movement Neurons in: frontal eye field superior colliculus 0.3 Model Simulation: 0.5 Go Process 0.2 P(RT < t) 0.1 Normalized Proportion Normalized 0.0 0.0 200 300 400 –50 0 50 Reaction time (ms) Cancel time (ms)

Fig. 15.9 (a) Observed inhibition function(gray line) and simulated inhibition function from the interactive race model (black line). (b) Observed (thin lines) and simulated (thick lines) cumulative RT distributions from no stop signal (black line) and signal-response trials with progressively longer stop signal delays (progressively darker gray lines). (c) Illustration of simulated activity in the interactive race model of the go unit and stop unit activation on signal-inhibit (thick solid line) and latency-matched no-stop-signal trials (thin solid lines) with stop-signal delay (SSD) and stop-signal reaction time (SSRT) indicated. Cancel time is indicated by the downward arrow. (d) Histogram of cancel times of the go unit predicted by the interactive race model compared with the histogram of cancel times measured for movement-related neurons in FEF and SC.(Adapted from Boucher et al., 2007a.)

Logan, Van Zandt, Verbruggen, & Wagenmakers, a threshold level is reached, shortly after which 2014; Olman, 1973). Boucher et al. (2007a) a saccade is made (Hanes & Schall, 1996), just addressed an apparent paradox of how seemingly as they do on memory-guided saccade tasks or interacting neurons in the brain could produce visual search tasks. On trials with a stop signal, behavior that appears to be the outcome of in- the dynamics of visually responsive neurons are dependent processes. Mirroring the general model unaffected (Hanes et al., 1998). For movement- architectures described earlier and illustrated in the related neurons, we can distinguish between ac- right half of Figure 15.5, they instantiated and tivity when a stop was successful, signal-inhibit tested models that assumed stochastic accumulators trials, from activity when a stop was unsuccessful, for the go process and for the stop process that that is signal-respond trials. On signal-respond were either an independent race or that assumed trials, the activity of movement-related neurons competitive, lateral interactions between stop and is qualitatively the same as the activity on no- go. Outstanding fits to observed behavioral data signal trials, with neurons reaching a threshold level for both the independent race model and the before a saccade is made. Even more striking, the interactive race model were observed. Figures 9a activity on signal-respond trials is quantitatively and 9b show fits of the interactive race model, but indistinguishable from activity on no-signal trials fits of the independent race model were virtually that are equated for response time (latency-matched identical. Parsimony would favor the independent trials). On signal-inhibit trials, the activity in- race. But neural data favored the interactive race. creases in a manner indistinguishable from latency- In the absence of a stop signal, visually responsive matched no-signal trials until some time after the neurons in FEF select the target, and movement- SSD, at which point the activity of movement- related neurons in FEF increase their activity until related neurons is reduced back to baseline without

332 new directions reaching the threshold. The saccade has been movement neurons and thereby preventing them inhibited. from reaching threshold. Figure 15.9c displays the predicted accumulator As another example, in a stop-signal task, both dynamics of the interactive race model (Boucher humans and monkeys adapt their performance from et al., 2007a). The dynamics of the go accumulator trial to trial, for example, producing longer RTs in the interactive race precisely mirrors the descrip- after successfully inhibiting a planned movement tion of the dynamics of movement-related neurons (e.g., Bissett & Logan, 2011; Nelson, Boucher, provided earlier, with dynamics not observed in Logan, Palmeri, & Schall, 2010; Verbruggen the independent race model. For signal-inhibit & Logan, 2008). For monkeys, within FEF, activ- trials and latency-matched no-signal trials, activity ity of visually responsive neurons are unaffected increases for some time after SSD, after which by these trial-to-trial adjustments, but the onset activity on signal-inhibit trials returns to baseline time of activity of movement-related neurons is while activity on latency-matched no-signal trials significantly delayed (Pouget et al., 2011). Purcell continues to threshold. The accumulator dynamics et al. (2012) suggested that strategic adjustment in the interactive race model qualitatively captures in the level of the gate could explain the delayed the neural dynamics of movement-related neurons. onset of movement-related neurons in the absence But we could go further than that. We also of any modulation of visually responsive neurons. calculated a metric called cancel time (Hanes et Moreover, they demonstrated that this strategic al., 1998), which is a function of the time at adjustment of gate could be couched in terms which the dynamics statistically diverge between of optimality. It has been previously suggested signal-inhibit trials and latency-matched no-signal that strategic modulation of accumulator threshold trials. This time can be calculated from movement- could maximize reward rate, which is defined as the related neurons. It can also be calculated from proportion of correct responses per unit time (e.g., accumulator dynamics. And as shown in Figure Gold & Shadlen, 2002; Lo & Wang, 2006). We 15.9b, these measures from neurons and the model observed that strategic modulation of the level of nicely converge. We emphasize that, as was the the gate could maximize reward rate in much the case for Purcell et al. (2010, 2012), these are same way (Purcell et al., 2012). true model predictions. Boucher et al. (2007a) fitted models to behavioral data, then calculated the cancel time predicted by the models, and Summary and Conclusions compared that to the observed cancel time in Here we reviewed some of our contributions neurons. Parameters were not adjusted to maximize to a growing synergy of mathematical psychology the correspondence. and systems neuroscience. Our starting point has The hypothesized locus of control in Boucher been a class of successful cognitive models of et al. (2007a) is inhibition of a stop process on perceptual decision-making that assume a stochastic the go process, with the stop process identified accumulation of perceptual evidence to a threshold as activity of fixation-related neurons and the go over time (Figure 15.2). Models of this sort process identified as activity of movement-related have long provided excellent accounts of response neurons. The gate in the gated accumulator model probabilities and distributions of response times in a (Purcell et al., 2010, 2012) could be another wide range of perceptual decision-making tasks and hypothesized locus of control over perceptual de- manipulations (e.g., see Nosofsky & Palmeri, 2015; cisions. In recent work, we have suggested that Ratcliff & Smith, 2015). We have extended these blocking the input to the go unit, rather than models to account for response probabilities and actively inhibiting it via a stop unit, could be distributions of response times for awake behaving an alternative mechanism for stopping. Indeed, a monkeys to make saccades to target objects in blocked input model predicted observed data and their visual field (Boucher et al., 2007a; Pouget distributions of cancel times at least as well as the et al., 2011; Purcell et al., 2010, 2012). Applying interactive race model (Logan, Schall, & Palmeri, techniques common to mathematical psychology, 2015; Logan, Yamaguchi, Schall, & Palmeri, in we instantiated different model architectures and press). One suggestion we made was that the stop ruled out models that provided poor fits to observed process could raise a gate between visual neurons data. that select the target and movement neurons that These models have free parameters that govern generate a movement to it, blocking input to the theoretical quantities like perceptual processing

neurocognitive modeling of perceptual decision making 333 time, the starting point of accumulation, the drift Boucher et al. (2007a) and Pouget et al. (2011), rate of accumulation, and the response threshold. respectively. We constrained many of these parameters using Turning to more general issues, our work has neurophysiology. Unlike some approaches that con- confronted a common challenge in the develop- strain parameters values based on neurophysiology, ment of mathematical and computational models often based on neural findings with rather large of cognition where competing models reach a confidence intervals, we replaced parameterized point where they make very similar predictions, model assumptions directly with recorded neuro- examples of which are discussed in other chapters physiology. Specifically, we sampled from neural in this volume (Busemeyer, Wang, Townsend activity recorded from visually responsive neurons & Eidels, 2015). This could be a consequence in FEF, feeding these spike trains directly into of true mimicry, where models assuming vastly stochastic accumulator models, thereby creating a different mechanisms nonetheless produce math- largely nonparametric neural theory of perceptual ematically identical predictions that cannot be processing time and the drift rate of accumulation. distinguished behaviorally. Often, however, it is Not only did this approach constrain computational that the current corpus of experimental manipula- modeling, it also provided a direct test of the tions and measures are insufficient to discriminate hypothesis that the activity of visually responsive between competing models. Cognitive modelers neurons in FEF encodes perceptual evidence: This have long turned to predicting additional com- neural code can be accumulated over time to predict plexity in behavioral data to resolve mimicry, where and when the monkey moves its eyes (Purcell going from predicting accuracy alone to predicting et al., 2010, 2012). response probabilities as well as response times, We also tested the hypothesis that movement- and from predicting mean response-times to pre- related neurons in FEF instantiate a stochastic dicting response, time distributions, including accumulation of evidence. Although it has long those for correct and error responses. Indeed, been acknowledged that these neurons behave in in our work reviewed here, predicting jointly a way consistent with accumulator models (e.g., response probabilities and response time distribu- Hanes & Schall, 1996; Schall, 2001), we went tions yielded considerable traction in discriminating beyond qualitative description to test whether between competing models. Unfortunately, outside movement neuron dynamics can be quantitatively the mathematical psychology community, it is not predicted by accumulator model dynamics. We uncommon to hear researchers state with complete measured how the onset of activity, baseline activity, confidence that response time distributions yield rate of growth, and threshold varies with behavioral no more useful information than response time response time in both movement-related neurons means, sadly unknowledgeable about the state of and model accumulators, and we found close reality (e.g., see Townsend, 1990). That said, correspondences for some models. recognition is emerging, for example, that response Notonlydoesthistestanhypothesisaboutthe time distributions are key aspects of data that theoretical role of FEF movement-related neurons theories of visual cognition needs to account for in perceptual decision-making, it also provides a (e.g., Palmer, Horowitz, Torralba, & Wolfe, 2011; powerful means of contrasting models that other- Wolfe, Palmer, & Horowitz, 2010), that response wise make indistinguishable behavioral predictions. time distributions provide challenging constraints Our gated accumulator model, which enforces for low-level spiking neural models (e.g., Lo, accumulation of discriminative neural signals from Boucher, Paré, Schall, & Wang, 2009), and visually responsive neurons, not only accounted more generally that considerations of behavioral for the detailed saccade behavior of monkeys, variability can yield insights into neural processes but also predicted quantitatively the dynamics (e.g., Churchland et al., 2011; Purcell, Heitz, observed in movement-related neurons in FEF, Cohen, & Schall, 2012). But even joint modeling whereas other models could not (Purcell et al., of response probabilities and response-time distri- 2010, 2012; see also Boucher et al., 2007a). This butions may be insufficient to contrast competing gated accumulator model also suggests a potential models. locus of cognitive control over perceptual decisions. Our work illustrates how neurophysiological Increasing the gate may account for speed-accuracy data can also help distinguish between models. tradeoffs (Purcell et al., 2012) as well as stopping We have described cases in which two models fit behavior and trial history effects described by behavioral data equally well (Boucher et al., 2007a;

334 new directions Purcell et al., 2010, 2012) but one model is more Box 1 Top-down versus Bottom-up complex than the other. With only behavioral Theoretical Approaches data and an appeal to parsimony, we would have demanded the exclusion of the more complex Computational cognitive neuroscience aims to model in favor of the simpler one. However, in understand the relationship between brain and order to successfully mapobserved neural dynamics behavior using computational and mathemat- onto predicted model dynamics, the assumptions ical models of cognition. One approach is of the more complex model were required. Key bottom up. Theorists begin with fairly detailed here is that we believe that it is the important mathematical models of neurons based on to map between neural dynamics and model current understanding of cellular and molec- dynamics, not between neural dynamics and model ular neurobiology. A common approach is parameters (see also e.g., Davis, Love, & Preston, to develop and test a single model of a 2012). Variation in model parameters need not neural network built up from these detailed uniquely map onto variation in neural dynamics, models of neurons along with hypotheses about but predicted variation in model dynamics must. their excitatory and inhibitory connectivity. And while we have demonstrated the theoretical Although these neural models provide excellent usefulness of neural data in adjudicating between accounts of spiking and receptor dynamics of competing models, we do not believe that neural individual neurons and may also account well data has any particular empirical primacy. Just for emergent network activity, they may provide as mimicry issues can emerge when examining only fairly coarse accounts of observed behavior, behavioral measures like accuracy and response have somewhat limited generalizability, and be time, analogous mimicry issues may be found at impractical to rigorously simulate and evaluate the level of neurophysiology and neural dynamics. quantitatively. Neural data are not necessarily more intrinsically Another approach is top down (e.g., informative than behavioral data, but more data Forstman et al., 2011; Palmeri, 2014). Cog- provides additional constraints for distinguishing nitive models account for details of behavior between competing models. across multiple conditions, have significant More generally, our work allies with a growing generalizability across tasks and subject body of research supporting accumulator models populations, and are often relatively easy to of perceptual decision making (e.g., Nosofsky & simulate and evaluate. It is common to evaluate Palmeri, 1997; Ratcliff & Rouder, 1998; Ratcliff & multiple competing models and to test the Smith, 2004; Usher & McClelland, 2001), not necessity and sufficiency of model assumption just as models that explain behavior but also with nested model comparison techniques. as models that explain brain activity measured Although these models do not provide the using neurophysiology (e.g., Boucher et al., 2007; same level of detailed predictions of spiking Churchland & Ditterich, 2012; Purcell et al., 2010, and receptor dynamics, they can provide 2012; Ratcliff et al., 2003; but see Heitz & Schall, predictions about the temporal dynamics of 2012, 2013), EEG (e.g., Philiastides, Ratcliff, & neural activity at the same level of precision as Sajda, 2006), and fMRI (e.g., Turner et al., 2013; commonly summarized in neurophysiological van Maanen et al., 2011; White, Mumford, & investigations, as we illustrated in our review. In Poldrack, 2012). The relative simplicity of cognitive fact, Carandini (2012) suggested that bridging models like accumulator models is a virtue in that between brain and behavior can only be done they are computationally tractable, making them by considering intermediate-level theories, that easily applicable across a wide range of phenomena the gap between low-level neural models and and levels of analysis. behavior is simply a “bridge too far.” Although Making explicit links to brain mechanisms does he considered linear filtering and divisive expose complexities. Our focus here has been largely normalization as example computations that on FEF, but other brain areas have neurons with may be carried out across cortex (Carandini dynamics that are visually responsive or movement- & Heeger, 2011), we consider accumulation related, including SC (Hanes & Wurtz, 2001; of evidence as a similar computation that may Paré & Hanes, 2003) and LIP (Gold & Shadlen, be carried out in various brain areas, included 2007; Mazurek et al., 2003; Shadlen & Newsome, FEF. These computations can simultaneously 2001). Compared to the relative simplicity of most explain behavioral and neural dynamics.

neurocognitive modeling of perceptual decision making 335 stochastic accumulator models, there isa network of brain areas involved in evidence accumulations perceptual evidence, turning a perfect integrator of percep- for perceptual decision making (Gold & Shadlen, tual evidence into a leaky integrator of perceptual evidence. 2007; Heekeren et al., 2008; Schall, 2001; movement-related neurons: Neurons in FEF that show 2004). Such mechanisms involving accumulation little or no modulation to the appearance of the target in the visual field but pronounced growth of spike rate of evidence for perceptual decision-making may immediately preceding the production of a saccade. be replicated across different sensory and effector perceptual decision-making: Perceptual decision-making systems in the brain, such as those for visually requires representing the world with respect to current task guided saccades, but there may also be domain- goals and using perceptual evidence to inform the selection general mechanisms as well (e.g., Ho, Brown, of a particular action. & Serences, 2009). Although the dynamics of saccade: A ballistic eye movement of some angle and specific individual neurons within particular brain velocity to a particular location in the visual field. areas mirror the dynamics of accumulators in stochastic accumulator model: A class of computational models, we also know that, within any given models that assume that noisy perceptual evidence is brain area, ensembles of tens of thousands of accumulated over time from a starting point to a threshold, neurons are involved in the generation of any allowing predictions of both response probabilities and distributions of response times. perceptual decision. We need to understand the scaling relations from simple accumulator models stop-signal task: A classic cognitive control paradigm in which a primary go task is occasionally interrupted with a to complex ensembles of thousands of neural stop signal. accumulators (Zandbelt, Purcell, Palmeri, Logan, visually responsive neurons: Visually responsive neurons & Schall, 2014) and how to map the relatively few are neurons in FEF that respond to the appearance of an parameters that define simple accumulator models object in their receptive field relative to that object’s salience onto the great number of parameters that define with respect to current task goals but show little or no complex neural dynamics (Umakantha, Purcell, & change in activity prior to the onset of a saccade Palmeri, 2014).

Acknowledgments This work was supported by NIH R01- References EY021833, NSF Temporal Dynamics of Learn- Ashby, F. G. (2000). A stochastic version of general recog- nition theory. Journal of Mathematical Psychology, 44, ing Center SMA-1041755, NIH R01-MH55806, 310–329. NIH R01-EY008890, NIH P30-EY08126, NIH Becker, W., & Jürgens, R. (1979). An analysis of the saccadic P30-HD015052, and by Robin and Richard system by means of double step stimuli. Vision Research, 19, Patton through the E. Bronson Ingram Chair in 976–983. Neuroscience. Address correspondences to Thomas Bichot, N. P., & Schall. J. D. (1999). Effects of similarity and J. Palmeri, Department of Psychology, Vanderbilt history on neural mechanisms of visual selection. Nature University, Nashville TN 37203. Electronic Neuroscience, 2, 549–554. Bichot, N. P., Schall, J. D., & Thompson, K. G. (1996). Visual mail may be addressed to thomas.j.palmeri@ feature selectivity in frontal eye fields induced by experience vanderbilt.edu. in mature macaques. Nature, 381, 697–699. Bichot, N. P., Thompson, K. G., Rao, S. C., & Schall, J. Glossary D. (2001). Reliability of macaque frontal eye field neurons signaling saccade targets during visual search. Journal of drift rate: The mean rate of perceptual evidence accu- Neuroscience, 21, 713–725. mulation in a stochastic accumulator model of perceptual Bissett, P. G., & Logan, G. D. (2011). Balancing cognitive decision-making. demands: Control adjustments in the stop-signal paradigm. frontal eye field: An area of prefrontal cortex that governs Journal of Experimental Psychology: Learning, Memory and whether, where, and when the eyes moves to a new location Cognition, 37, 392–404. in the visual field. Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J. D. gated accumulator: A stochastic accumulator model that (2006). The physics of optimal decision making: A formal includes a gate that enforces accumulation of discriminative analysis of models of performance in two-alternative forced- neural signals, a model which quantitatively accounts choice tasks. Psychological Review, 113, 700–765. for both behavioral and neural dynamics of saccadic eye Boucher, L., Palmeri, T. J., Logan, G. D., & Schall, movement. J. D. (2007a). Inhibitory control in mind and brain: leakage: A weighted self-inhibition on the accumulation of An interactive race model of countermanding saccades. Psychological Review, 114, 376–397.

336 new directions Boucher, L., Stuphorn, V., Logan, G. D., Schall, J. explain human behavior, predictions for neurophysiology, D., & Palmeri, T. J. (2007b). Stopping eye and hand and relationship with decision theory. Frontiers in Decision movements: Are the processes independent? Perception & Neuroscience, 4. Psychophysics, 69, 785–801. Ferrier, D. (1874). The localization of function in brain. Brown, S. D., & Heathcote, A. (2008). The simplest complete Proceedings of the Royal Society of London, 22, 229–232. model of choice response time: Linear ballistic accumulation. Forstmann, B. U., Wagenmakers, E. J., Eichele, T., Brown, S., Cognitive Psychology, 57, 153–178. & Serences, J. T. (2011). Reciprocal relations between cog- Bruce, C. J., & Goldberg, M. E. (1985). Primate frontal eye nitive neuroscience and formal cognitive models: opposites fields. I. Single neurons discharging before saccades. Journal attract? Trends in Cognitive Sciences, 15(6), 272–279. of Neurophysiology, 53, 603–635. Glimcher, P.W., & Rustichini, A. (2004). Neuroeconomics: The Bruce, C. J., Goldberg, M. E., Bushnell, M. C., & Stanton, G. consilience of brain and decision. Science, 306 (5695), 447– B. (1985). Primate frontal eye fields: II. Physiological and 452. anatomical correlates of electrically evoked eye movements. Gilchrist, I. D. (2011). Saccades. In S. P. Liversedge, I. P. Journal of Neurophysiology, 54, 714–734. Gilchrist,&S.Everling(Eds.),Oxford Handbook on Eye Busemeyer, J. R., & Diederich, A. (2010). Cognitive modeling. Movements, (pp. 85–94). Oxford, UK: Oxford University Thousand Oaks, CA: Sage Publications. Press. Busemeyer, J. R., & Townsend, J. T. (1993). Decision field Gold, J. I., & Shadlen, M. N. (2002). Banburismus and the theory: A dynamic-cognitive approach to decision making brain: Decoding the relationship between sensory stimuli, in an uncertain environment. Psychological Review, 100(3), decisions, and reward. Neuron, 36, 299–308. 432–459. Gold, J. I., & Shadlen, M.N. (2007). The neural basis of Busemeyer, J. R., Wang, Z., Townsend, J. T., & Eidels A. (2015). decision making. Annual Review of Neuroscience, 30, 535– Mathematical and computational models of cognition.Oxford, 560. UK: Oxford University Press. Goldman-Rakic, P. S., & Porrino, L. J. (1985). The primate Camalier, C. R., Gotler, A., Murthy, A., Thompson, K. mediodorsal (MD) nucleus and its projection to the frontal G., Logan, G. D., Palmeri, T. J., & Schall, J. D. lobe. Journal of Comparative Neurology, 242, 535–560. (2007). Dynamics of saccade target selection: Race model Grossberg, S. (1976). Adaptive pattern classification and analyses of double step and search step saccade production in universal recoding: II. Feedback, expectation, olfaction, human and macaque. Vision Research, 47, illusions. Biological Cybernetics, 23, 187–202. 2187–2211. Hanes, D. P., Patterson, W. F., II, & Schall, J. D. (1998). Carandini, M. (2012). From circuits to behavior: A bridge too Role of frontal eye fields in countermanding saccades: Visual, far? Nature Neuroscience, 15, 507–509. movement, and fixation activity. Journal of Neurophysiology, Carandini, A. K., & Heeger, D. J. (2012). Normalization as a 79, 817–834. canonical neural computation. Nature Reviews Neuroscience, Hanes, D. P., & Schall, J. D. (1996). Neural control of voluntary 13, 51–62. movement initiation. Science, 274, 427–430. Churchland, A. K., & Ditterich, J. (2012). New advances Hanes, D. P., & Wurtz, R. H. (2001). Interaction of the frontal in understanding decisions among multiple alternatives. eye field and superior colliculus for saccade generation. Current Opinion in Neurobiology, 22(6), 920–926. Journal of Neurophysiology, 85(2), 804–815. Churchland, A. K., Kiani, R., Chaudhuri, R., Wang, X. J., Heekeren, H. R., Marrett, S., & Ungerleider, L. G. (2008). Pouget, A., & Shadlen, M. N. (2011). Variance as a signature The neural systems that mediate human perceptual decision of neural computations during decision making. Neuron, making. Nature Reviews Neuroscience, 9(6), 467–479. 69(4), 818–831. Heitz, R. P., & Schall, J. D. (2012) Neural mechanisms of speed- Cisek, P., Puskas, G. A., & El-Murr, S. (2009). Decisions in accuracy tradeoff. Neuron, 76, 616–628. changing conditions: The urgency-gating model. Journal of Heitz, R. P., & Schall, J. D. (2013). Neural chronometry Neuroscience, 29(3), 11560–11571. and coherency across speed-accuracy demands reveal lack of Cohen, J. Y., Heitz, R. P., Woodman, G. F., Schall, J. D. (2009). homomorphism between computational and neural mecha- Neural basis of the set-size effect in frontal eye field: Timing nisms of evidence accumulation. Philosophical Transactions of of attention during visual search. Journal of Neurophysiology, the Royal Society of London B, 368, 20130071. 101, 1699–1704. Hikosaka, O., & Wurtz, R. H. (1983). Visual and oculomotor Dayan, P., & Daw, N. D. (2008). Decision theory, reinforcement functions of monkey substantia nigra pars reticulata: IV. learning, and the brain. Cognitive, Affective, & Behavioral Relation of substantia nigra to superior colliculus. Journal of Neuroscience, 8(4), 429–453. Neurophysiology, 49, 1285–1301. Davis, T., Love, B. C., & Preston, A. R. (2012). Learning the Ho, T. C., Brown, S., & Serences, J. T. (2009). Domain general exception to the rule: Model-based fMRI reveals specialized mechanisms of perceptual decision making in human cortex. representations for surprising category members. Cerebral The Journal of Neuroscience, 29(27), 8675–8687. Cortex, 22, 260–273. Jones, M., & Dzhafarov, E. N. (2014). Unfalsifiability of major Ditterich, J. (2006). Stochastic models of decisions about modeling schemes for choice reaction time. Psychological motion direction: Behavior and physiology. Neural Networks, Review, 121, 1–32. 19, 981–1012. Kahneman, D., & Tversky, A. (1984). Choices, values, and Ditterich, J. (2010). A comparison between mechanisms of frames. American Psychologist, 39(4), 341–350. multi-alternative perceptual decision making: Ability to

neurocognitive modeling of perceptual decision making 337 Laming, D. R. J. (1968). Information theory of choice-reaction Nelson, M. J., Boucher, L., Logan, G. D., Palmeri, T. J., times. New York, NY: Academic. Schall, J. D. (2010). Impact of nonstationary response time Lappin, J. S., & Eriksen, C. W. (1966). Use of a delayed signal to in stopping and stepping saccade tasks. Attention, Perception, stop a visual reaction–time response. Journal of Experimental & Performance, 72, 1913–1929. Psychology, 72, 805–811. Nosofsky, R. M. (1986). Attention, similarity, and the Lewandowsky, S., & Farrell, S. (2010). Computational modeling identification-categorization relationship. Journal of Experi- in cognition: principles and practice. Thousand Oaks, CA: mental Psychology: General, 115, 39–57. Sage. Nosofsky, R. M., & Palmeri, T. J. (1997). An exemplar-based Link, S. W. (1992). The wave theory of difference and similarity. random walk model of speeded classification. Psychological Hillsdale, NJ: Erlbaum. Review, 104, 266–299. Lo, C.-C., Boucher, L., Paré, M., Schall, J. D., & Wang, X.-J. Nosofsky, R. M., & Palmeri, T. J. (2015). Exemplar-based (2009). Proactive inhibitory control and attractor dynamics random walk model. In J. R. Busemeyer, Z. Wang, in countermanding action: A spiking neural circuit model. J. T. Townsend, & A. Eidels (Eds.), Mathematical and Journal of Neuroscience, 29, 9059–9071. computational models of cognition. Oxford University Press. Lo, C.-C., & Wang, X. J. (2006). Cortico–basal ganglia circuit Palmer, E. M., Horowitz, T. S., Torralba, A., & Wolfe, J. M. mechanism for a decision threshold in reaction time tasks. (2011). What are the shapes of response time distributions Nature Neuroscience, 9, 956–963. in visual search? Journal of Experimental Psychology: Human Logan, G. D. (1988). Toward an instance theory of automatiza- Perception and Performance, 37, 58. tion. Psychological Review, 95, 492–527. Palmeri, T. J. (1997). Exemplar similarity and the development Logan, G. D. (2002). An instance theory of attention and of automaticity. Journal of Experimental Psychology: Learning, memory. Psychological Review, 109, 376. Memory, and Cognition, 23, 324–354. Logan, G. D. & Cowan, W. B. (1984). On the ability to Palmeri, T. J. (2014). An exemplar of model-based cognitive inhibit thought and action: A theory of an act of control. neuroscience. Trends in Cognitive Science, 18(2), 67–69. Psychological Review, 91, 295–327. Palmeri, T. J., & Cottrell, G. (2009). Modeling perceptual Logan, G. D., & Gordon, R. D. (2001). Executive control of expertise. In I. Gauthier, M. Tarr, & D. Bub (Eds.), visual attention in dual-task situations. Psychological Review, Perceptual expertise: bridging brain and behavior.Oxford,UK: 108, 393–434. Oxford University Press. Logan, G. D., Schall, J. D., & Palmeri, T. J. (2015). Neural Palmeri, T. J., & Tarr, M. (2008). Visual object perception models of stopping and going. Manuscript in preparation. and long-term memory. In S. Luck & A. Hollingworth To appear in B. Forstmann & E. J. Wagenmakers (Eds.), (Eds.), Visual Memory, (pp. 163–207). Oxford, UK: Oxford An introduction to model-based cognitive neuroscience. Springer University Press. Neuroscience. Palmeri, T. J., Wong, A. C.-N., & Gauthier, I. (2004). Logan, G. D., Van Zandt, T., Verbruggen, F., & Wagenmakers, Computational approaches to the development of perceptual E.-J. (2014). On the ability to inhibit thought and action: expertise. Trends in Cognitive Sciences, 8, 378–386. General and special theories of an act of control. Psychological Paré, M., & Hanes, D. P. (2003). Controlled movement Review, 121(1), 66–95. processing: superior colliculus activity associated with coun- Logan, G. D., Yamaguchi, M., Schall, G. D., & termanded saccades. Journal of Neuroscience, 23(16), 6480– Palmeri, T. J. (in press). Inhibitory control in mind 6489. and brain 2.0: A blocked-input model of saccadic Philiastides, M. G., Ratcliff, R., & Sajda, P. (2006). Neural countermanding, psychological review. representation of task difficulty and decision making during Mack, M. L., & Palmeri, T. J. (2010). Modeling categorization perceptual categorization: a timing diagram. Journal of of scenes containing consistent versus inconsistent objects. Neuroscience, 26 (35), 8965–8975. Journal of Vision, 10(3):11, 1–11. Pouget,P.,Logan,G.D.,Palmeri,T.J.,Boucher,L.,Paré,M., Mack, M. L., & Palmeri, T. J. (2011). The timing of visual object & Schall, J. D. (2011). Neural basis of adaptive response categorization. Frontiers in Perception Science. time adjustment during saccade countermanding. Journal of Mazurek, M. E., Roitman, J. D., Ditterich, J., & Shadlen, M. Neuroscience, 31(35), 12604–12612. N. (2003). A role for neural integrators in perceptual decision Pouget, P., Stepniewska, I., Crowder, E. A., Leslie, M. W., making. Cerebral Cortex, 13, 1257–1269. Emeric, E. E., Nelson, M. J., & Schall, J. D. (2009). Visual Munoz, D. P., & Schall, J. D. (2004). Concurrent, distributed and motor connectivity and the distribution of calcium- control of saccade initiation in the frontal eye field and binding proteins in macaque frontal eye field: Implications superior colliculus. In W.C. Hall & A. Moschovakis, for saccade target selection. Frontiers in Neuroanatomy, 3, 2. (Eds.), The superior colliculus: New approaches for studying Purcell, B. A., Heitz, R. P., Cohen, J. Y., & Schall, J. D. (2012). sensorimotor integration (pp. 55–82). Boca Raton, FL: CRC Response variability of frontal eye field neurons modulates Press. with sensory input and saccade preparation but not visual Murthy, A., Ray, S., Shorter, S. M., Schall, J. D., & Thompson, search salience. Journal of Neurophysiology. 108, 2737–2750 K. G. (2009). Neural control of visual search by frontal Purcell, B. A., Heitz, R. P., Cohen, J. Y., Schall, J. D., Logan, G. eye field: effects of unexpected target displacement on visual D., & Palmeri, T. J. (2010). Neurally constrained modeling selection and saccade preparation. Journal of Neurophysiology, of perceptual decision making. Psychological Review, 117, 101(5), 2485–2506. 1113–1143.

338 new directions Purcell, B. A., Schall, J. D., Logan, G. D., & Palmeri, T. J. Scudder, C. A., Kaneko, C. S., & Fuchs, A. F. (2002). The (2012). From salience to saccades: Multiple-alternative gated brainstem burst generator for saccadic eye movements: A stochastic accumulator model of visual search. Journal of modern synthesis. Experimental Brain Research, 142, 439– Neuroscience, 32, 3433–3446. 462. Ratcliff, R. (1978). A theory of memory retrieval. Psychological Shadlen, M. N., & Newsome, W. T. (2001). Neural basis Review, 85, 59–108. of a perceptual decision in the parietal cortex (area LIP) Ratcliff, R. (2013). Parameter variability and distributional of the rhesus monkey. Journal of Neurophysiology, 86 (4), assumptions in the diffusion model. Psychological Review, 1916–1936. 120, 281–292. Smith, P. L. (2010). From Poisson shot noise to the in- Ratcliff, R., Cherian, A., & Segraves, M. (2003). A comparison tegrated Ornstein-Uhlenbeck process: Neurally principled of macaque behavior and superior colliculus neuronal activity models of information accumulation in decision-making to predictions from models of two-choice decisions. Journal and response time. Journal of Mathematical Psychology, 54, of Neurophysiology, 90, 1392–1407. 266–283. Ratcliff, R., Hasegawa, Y. T., Hasegawa, R. P., Smith, P. Smith, P. L., & Ratcliff, R. (2004). Psychology and neu- L., & Segraves, M. A. (2007). Dual diffusion model for robiology of simple decisions. Trends in Neuroscience, 27, single-cell recording data from the superior colliculus in a 161–168. brightness-discrimination task. Journal of Neurophysiology, Smith, P. L., & Ratcliff, R. (2009). An integrated theory of 97, 1756–1774. attention and decision making in visual signal detection. Psychological Review, 116, 283. Ratcliff, R., & Rouder, J. N. (1998). Modeling response times Smith, P. L., & Van Zandt, T. (2000). Time-dependent Poisson for two-choice decisions. Psychological Science, 9, 347–356. counter models of response latency in simple judgment. Ratcliff, R., & Smith, P. L. (2004). A comparison of sequential British Journal of Mathematical & Statistical Psychology, 53, sampling models for two-choice reaction time. Psychological 293–315. Review, 111, 333–367. Sparks, D. L. (2002). The brainstem control of saccadic eye Ratcliff, R., & Smith, P.L. (2015). Modeling simple decisions movements. Nature Reviews Neuroscience, 3, 952–964. and applications using a diffusion model. In J. R. Busemeyer, Stanton, G. B., Bruce, C. J., & Goldberg, M. E. (1995). Z.Wang,J.T.Townsend,&A.Eidels(Eds.),Mathematical Topography of projections to posterior cortical areas from the and computational models of cognition, Oxford University macaque frontal eye fields. Journal of Comparative Neurology, Press. 353, 291–305. Ratcliff, R., & Tuerlinckx, F. (2002). Estimating parameters of Teller, D. Y. (1984). Linking propositions. Vision Research, 24, the diffusion model: Approaches to dealing with contami- 1233–1246. nant reaction times and parameter variability. Psychonomic Thompson, K. G., Biscoe, K. L., & Sato, T. R. (2005). Neuronal Bulletin & Review, 9, 438–481. basis of covert spatial attention in the frontal eye field. Journal Rosenbaum, D. A. (2009). Human motor control. (2nd ed.). New of Neuroscience, 25, 9479–9487. York, NY: Academic. Thompson, K. G., Hanes, D. P., Bichot, N. P., & Schall, J. D. Sato, T., Murthy, A., Thompson, K. G., & Schall, J. D. (2001). (1996). Perceptual and motor processing stages identified in Search efficiency but not response interference affects visual the activity of macaque frontal eye field neurons during visual selection in frontal eye field. Neuron, 30, 583–591. search. Journal of Neurophysiology, 76, 4040–4055. Sato, T., & Schall, J. D. (2003). Effects of stimulus-response Townsend, J. T. (1990). The truth and consequences of compatibility on neural selection in frontal eye field. Neuron, ordinal differences in statistical distributions: Toward a 38(4), 637–648. theory of hierarchical inference. Psychological Bulletin, 108, Schall, J. D. (2001). Neural basis of deciding, choosing and 551–567. acting. Nature Reviews Neuroscience, 2, 33–42. Turner, B. M., Forstmann, B. U., Wagenmakers, E. J., Brown, Schall, J. D. (2004). On building a bridge between brain and S. D., Sederberg, P. B., and Steyvers, M. (2013). A behavior. Annual Review of Psychology, 55, 23–50. Bayesian framework for simultaneously modeling neural and Schall, J. D., & Cohen, J. Y. (2011). The neural basis of saccade behavioral data. NeuroImage, 72, 193–206. target selection. In S. P. Liversedge, I. P. Gilchrist, & S. Umakantha, A., Purcell, B. A., & Palmeri, T. J. (2014). Mapping Everling (Eds.). Oxford handbook on eye movements.Oxford, between a spiking neural network model and the diffusion UK: Oxford University Press. model of perceptual decision making. Schall, J. D., Morel, A., King, D., & Bullier, J. (1995). Usher, M., & McClelland, J. L. (2001). The time course of Topography of visual cortex connections with frontal eye perceptual choice: The leaky, competing accumulator model. field in macaque: Convergence and segregation of processing Psychological Review, 108, 550–592. streams. Journal of Neuroscience, 15, 4464–4487. van Maanen, L., Brown, S. D., Eichele, T., Wagenmakers, Schneider, D. W., & Logan, G. D. (2005). Modeling task E. J., Ho, T., Serences, J., & Forstmann, B. U. (2011). switching without switching tasks: A short-term priming ac- Neural correlates of trial-to-trial fluctuations in response count of explicitly cued performance. Journal of Experimental caution. Journal of Neuroscience, 31(48), 17488–17495. Psychology: General, 134, 343–367. Van Zandt, T. (2000). How to fit a response time distribution. Schneider, D. W., & Logan, G. D. (2009). Selecting a response Psychonomic Bulletin & Review, 7, 424–465. in task switching: Testing a model of compound cue retrieval. Verbruggen, F., & Logan, G. D. (2008). Response inhibition Journal of Experimental Psychology: Learning, Memory and in the stop-signal paradigm. Trends in Cognitive Sciences, 12, Cognition, 35, 122–136. 418–424.

neurocognitive modeling of perceptual decision making 339 White, C. N., Mumford, J. A., & Poldrack, R. A. (2012). Per- Wong, K. F., & Wang, X. J. (2006). A recurrent network ceptual criteria in the human brain. Journal of Neuroscience, mechanism of time integration in perceptual decisions. 32(47), 16716–16724. Journal of Neuroscience, 26, 1314–1328. Wolfe, J. M., Palmer, E. M., & Horowitz, T. S. (2010). Reaction Woodman, G. F., Kang, M. S., Thompson, K., & Schall, time distributions constrain models of visual search. Vision J. D. (2008). The effect of visual search efficiency Research, 50, 1304–1311. on response preparation: Neurophysiological evidence Wong, K. F., Huk, A. C., Shadlen, M. N., & Wang, for discrete flow. Psychological Science, 19, 128–136. X. J. (2007). Neural circuit dynamics underlying ac- Zandbelt, B. B., Purcell, B. A., Palmeri, T. J., Logan, G. D., cumulation of time-varying evidence during perceptual Schall, J. D. (2014). Response times from ensembles of decision making. Frontiers in Computational Neuroscience, accumulators. Proceedings of the National Academy of Sciences, 1, 1–11. 111, 2848–2853.

340 new directions CHAPTER Mathematical and Computational 16 Modeling in Clinical Psychology

Richard W. J. Neufeld

Abstract This chapter begins with an introduction to the basic ideas behind clinical mathematical and computational modeling. In general, models of normal cognitive-behavioral functioning are titrated to accommodate performance deviations accompanying psychopathology; model features remaining intact indicate functions that are spared; those that are perturbed are triaged as signifying functions that are disorder affected. Distinctions and interrelations among forms of modeling in clinical science and assessment are stipulated, with an emphasis on analytical, mathematical modeling. Preliminary conceptual and methodological considerations are presented. Concrete examples illustrate the benefits of modeling as applied to specific disorders. Emphasis in each case is on clinically significant information uniquely yielded by the modeling enterprise. Implications for the functional side of clinical functional neuro-imaging are detailed. Challenges to modeling in the domain of clinical science and assessment are described, as are tendered solutions. The chapter ends with a description of continuing challenges and future opportunities. Key Words: clinical mathematical modeling, clinical cognitive modeling, analytical modeling, psychological assessment, stochastic modeling, clinical bayesian modeling, clinical quantitative cognition, cognitive deficit, method of titration

Introduction them meaning according to their operation within “The important point for methodology of psy- that very system (e.g., Braithwaite, 1968). New chology is that just as in statistics one can have explanatory properties are availed, as are options a reasonably precise theory of probable inference, for clinical-science measurement, and tools for being ‘quasi-exact about the inherently inexact,’ clinical-assessment technology. This chapter is so psychologists should learn to be sophisticated designed to elaborate on these assets by providing and rigorous in their metathinking about open examples where otherwise intractable or hidden concepts at the substantive level. . . . In social and clinical information has been educed. Issues of biological science, one should keep in mind that practicality and validity, indigenous to the clinical explicit definition of theoretical entities is seldom setting, are examined, as is the potentially unique achieved in terms of initial observational variables contribution of clinical modeling to the broader of those sciences, but it becomes possible instead modeling enterprise. Emphasis is on currently by theoretical reduction or fusion (Meehl, 1978; p. prominent domains of application, and exemplary 815).” Mathematical and computational modeling instances within each. Background material for the of clinical-psychological phenomena can elucidate current developments is available in several sources clinically significant constructs by translating them (e.g., Busemeyer & Diedrich, 2010; Neufeld, into variables of a quantitative system, and lending 1998; 2007a).

341 We begin by considering an overall epistemic short-term or working memory, on which informed strategy of clinical psychological modeling. Divi- responding rests; and preparing and delivering sions of modeling in the clinical domain are the information-determined response. During each then distinguished. Exemplary implementations are trial of the preceding task, a prememorized set presented, as are certain challenges sui generis to of items (memory set), such as alphanumeric this domain. characters, are to be scanned, in order to ascertain Figure 16.1 summarizes the overall epistemic “as quickly and accurately as possible” the presence strategy of clinical psychological modeling. In or absence of a subsequently presented item (probe its basic form, quantitative models of normal item). The subspan memory-set size varies from performance, typically on laboratory tasks, are trial to trial, with the items within each set size titrated to accommodate performance deviations also possibly varying over their presentations, or occurring with clinical disturbance. The requisite alternatively remaining constant within each set model tweaking, analogous to a reagent of chemical size (variable- versus fixed- set procedures). Other titration, in principle discloses the nature of change manipulations may be directed, for instance, to to the task-performance system taking place with increasing probe-encoding demands (whereby, say, clinical disturbance. Aspects of the model remain- the font of the probe item mismatches, rather ing intact are deemed as aligning with functions than matches, the font of the memory-set items). spared with the disturbance, and those that have The principal response property tenably is latency been perturbed are triaged as pointing to functions from probe-item onset, accuracy being high, and that have been affected. not compromising the validity of latency-based Accommodation of performance deviations by inferences (e.g., no speed-accuracy trade-off). the appropriated model infrastructure, in turn, Quantitatively informed convergent experimen- speaks to validity of the latter. Successful accom- tal evidence may point to an elongated probe- modation of altered performance among clinical encoding process as being responsible for delayed samples becomes a source of construct validity, trial-wise response latencies in the clinical group. over and against an appropriated model’s failure The encoding process may be represented by a por- or strain in doing so. This aspect of model tion of the trial-performance model. This portion evaluation, an instance of “model-generalization may stipulate, for example, constituent encoding testing” (Busemeyer & Wang, 2000), is one in operations of the encoding process, there being which performance data from the clinical setting k in number (to use model-parameter notation canplayanimportantrole. consistent with the clinical modeling literature). To illustrate the preceding strategy, consider a Such subprocesses may correspond with observable typical memory-search task (Sternberg, 1969). Such stimulus features, such as curves, lines, and inter- a task may be appropriated to tap ecologically sections of the alphanumeric probe, extracted in the significant processes: cognitively preparing and service of mental template matching to members of transforming (encoding) environmental stimulation the trial’s memory set. The intensity of processing into a format facilitating collateral cognitive op- applicable to each of the respective k subprocesses erations; extracting and manipulating material in (loosely, speed of completion, or number transacted per unit time; e.g., Rouder, Sun, Speckman, Lu, & Ahou, 2003; Townsend & Ashby, 1983) Mathematical and Clinical Mathematical and Computational Psychology Computational Psychology is denoted v. Decomposing clinical-sample perfor- mance through the model-based analysis potentially exonerates the parameter v, but indicts k as the Expression of Performance Mathematical and Deviation according to Model source of encoding protraction. Specifically, model- Computational Modeling of Titration predicted configurations of performance latency, Experimental-Task Performance and intertrial variability, converge with the pattern of empirical deviations from control data, upon Model-Generalization Testing elevation in k , but not reduction in v.Bywayof parameter assignment in the modeling literature, strictly speaking, cognitive capacity (in the sense of processing speed) is intact, but efficiency of its Fig. 16.1 Relations between clinical and nonclinical mathemat- deployment has suffered. The preceding example ical and computational psychology. of mapping of formal theory onto empirical data

342 new directions is known in clinical mathematical modeling, and Box 1 Abductive Reasoning in Clinical elsewhere, as “abductive reasoning” (see Box 1). Cognitive Science When it comes to potential “value added” beyond the immediate increment in information, Scientific rigor does not demand that theo- the following come to the fore. Ecological signif- retical explanation for empirical findings be icance is imbued in terms of assembling environ- restricted to a specific account from a set of mental information in the service of establishing those bearing on the study that have been necessary responses, including those germane to singled out before the study takes place. In fact, self-maintenance activities, and meeting environ- the value of certain formal developments to un- mental demands. On this note, basic deficit may derstanding obtained data configurations may ramify to more elaborate operations in which become apparent only after the latter present the affected process plays a key role (e.g., where themselves. It is epistemologically acceptable judgments about complex multidimensional stimuli to explanatorily retrofit extant (formal) theory are built up from encoded constituent dimen- to empirical data (e.g., in the text, changing sions). Deficits in rudimentary processes moreover the clinical sample’s value of k versus v), a may parlay into florid symptom development or method known as abductive reasoning (Haig, maintenance, as where thought-content disorder 2008). Abductive reasoning not only has a (delusions and thematic hallucinations) arise from bona-fide place in science, but it is economical insufficient encoding of cues that normally anchor in its applicability to already-published data the interpretation of other information. Additional (Novotney, 2009) and/or proffered conjectures implications pertain to memory, where heightened about clinically significant phenomena. retrieval failure is risked owing to protraction of On the note of rigorous economy, the clin- initial item encoding. Debility parameterization ical scientist can play with model properties to also may inform other domains of measurement, explore the interactions of model-identified or such as neuroimaging. A tenable model may other proposed sources, say, of pathocognition demarcate selected intratrial epochs of cognitive with various clinically significant variables, such tasks when a clinically significant constituent pro- as psychological stress. For example, it has been cess is likely to figure prominently. In this way, conjectured that some depressive disorders can times of measurement interest may complement be understood in terms of highly practiced, brain-regions of interest, for a more informed automatic negative thoughts supplanting oth- navigation of space-time coordinates in functional erwise viable competitors, and also that psycho- neuroimaging. Symptom significance thus may be logical stress enhances ascendancy of the former brokered to imaged neurocircuitry via formally on the depressed individual’s cognitive land- modeled cognitive abnormalities. scape (Hartlage, Alloy, Vasquez & Dykman, 1993). Translating these contentions into terms Modeling Distinctions in established in mainstream quantitative cogni- Psychological Clinical Science tive science, however, discloses that psycho- There are several versions of “modeling” in logical stress instead reduces the ascendancy psychological clinical science. Nonformal models, of well-practiced negative thoughts, at least such as flow-diagrams and other organizational within this highly-defensible assumptive frame- schemata, nevertheless, are ubiquitously labeled work (Neufeld, 1996; Townsend & Neufeld, “models” (cf. McFall, Townsend & Viken, 1995). 2004). Our consideration here is restricted to formal mod- The quantitative translation begins with els, where the progression of theoretical statements expressing the dominance of well-practiced (so- is governed by precisely stipulated rules of successive called automatic) negative versus less practiced statement transitions. Most notable, and obviously (so-called effortful) non-negative thought con- dominating throughout the history of science are tent, as higher average completion rates for the mathematical models. Formal languages for the- former. With these rate properties in tow, the ory development other than mathematics include well-practiced and less-practiced thought con- symbolic logic and computer syntax. Within the tent then enter a formally modeled “horse race,” formal modeling enterprise, then is mathemati- where the faster rates for negative-thought cal modeling, computational modeling [computer generation evince higher winning probabilities, simulation, including “connectionist,” “(neural) for all race durations. Note that although these

mathematical and computational 343 Box 1 Continued guided by results from mathematical modeling. Connectionist-modeling results, in turn, may lend derivations result from basic computations construct validity to mathematical model titration in integral calculus, they nevertheless yield (earlier). precise predictions, and lay bare their associated assumptions. Before delineating subsets of formal modeling, a Differentiating the above horse-race expres- word is in order about so-called statistical modeling sion of the probability of negative-thought (see, e.g., Rodgers, 2010). Statistical modeling, victory, with respect to a parameter conveying such as structural-equation modeling, including the effects of stress on processing capacity, confirmatory factor analysis, hierarchical linear leads to a negative derivative. Formalized in modeling, mixture growth modeling, and taxomet- this way, then, the result shows psychological ric analysis (mixture-model testing for staggered, stress actually to handicap the negative-thought or quasi-staggered latent distributions of clinical contender. and nonclinical groups) supply a platform for data It is conceivable that reduction in the organization, and inferences about its resulting ascendancy of well-practiced negative thoughts, structure. To be sure, parameters, such as path in the face of stressing environmental demands weights and factor loadings are estimated using and pressures, in favour of less-practiced but methods shared with formal models, as demarcated more adaptive cognitive processes conveys a here. Contra the present emphasis, however, the certain protective function. In all events, this format of proposed model structure (typically one example illustrates the hazards of depending of multivariate covariance), and computational on unaided verbal reasoning in attempting methods are transcontent, generic, and do not to deal with complex intervariable relations germinate within the staked out theoretical-content (including stress effects on psychopathololgy), domain with its problem-specific depth of anal- and exemplifies the disclosure through available ysis. In the case of formal modeling, it might formal modeling, of subtleties that are both be said that measurement models and empirical- plausible and clinically significant—if initially testing methods are part and parcel of process- counterintuitive (Staddon, 1984; Townsend, models of observed responses and data production 1984). (see also Box 2). Extended treatment of formal- model distinctives and assets in clinical science is available in alternate venues (e.g., Neufeld, 2007b; Shanahan, Townsend & Neufeld in press). network,” “cellular automata,” and “computational informatics” modeling], and nonlinear dynamical systems modeling (“chaos-theoretic” modeling, in the popular vernacular). There is, of course, an Forms of Modeling in Clinical Science arbitrary aspect to such divisions. Mathemati- mathematical modeling cal modeling can recruit computer computations Clinical mathematical modeling is characterized (later), whereas nonlinear dynamical systems mod- by analytically derived accounts of cognitive- eling entails differential equations, and so on. behavioral abnormalities of clinical disorders. In Like many systems of classification, the present most instances, models are stochastic, meaning one facilitates exposition, in this case of modeling they provide for an intrinsic indeterminacy of activity within the domain of psychological clinical the modeled phenomenon (not unlike Brownian science. Points of contact and overlap among these motion being modeled by a Wiener process, in divisions, as well as unique aspects, should become physics). Doob (1953) has described a stochastic more apparent with the more detailed descriptions model as a “...mathematical abstraction of an that follow. empirical process whose development is governed The respective types of formal modeling by probabilistic laws (p. v).” Roughly, built into potentially provide psychopathology-significant the structure of stochastic models is a summary information unique to their specific level of of nature’s perturbation of empirical values from analysis (Marr, 1982). They also may inform one observation to the next. Predictions, therefore, each other, and provide across-level-analysis con- by and large, are directed to properties of the struct validity. For example, manipulation of distributions of observations, such as those of connectionist-model algorithm parameters may be response latencies over cognitive-task trials. Model

344 new directions expressions of clinical-sample deviations in distribu- & Neufeld, 2007; Hoffman & McGlashan, 2007; tion properties, therefore, come to the fore. Such Phillips & Silverstein, 2003; Siegle & Hasselmo, properties may include summary statistics, such as 2002; Stein & Young, 1992). Typically directed distribution moments (notably means and intertrial to cognitive functioning, the connectionist-network variances), but can also include distribution features approach recently has been extended to the study as detailed as the distribution’s probability density of intersymptom relations, providing an unique function (density function, for short; proportional perspective for example on the issue of co-morbidity to the relative frequency of process completion at a (Boorsboom, & Cramer, 2013; Cramer, Waldorp, particular time since its commencement; see, e.g., van der Maas & Boorsboom, 2010). Evans, Hastings & Peacock, 2000; Townsend & Ashby, 1983; Van Zandt, 2000). Grasping the results of mathematical modeling’s nonlinear dynamical system analytical developments can be challenging, but it (chaos-theoretic) modeling can be aided by computer computations. Where This form of modeling again entails interconnected the properties of derivations are not apparent variables (“coupled system dimensions”), but in from inspection of their structures themselves, the clinical science, by and large, these are drastically formulae may be explored numerically. Doing so fewer in number than usually is the case with com- often invokes three-dimensional response surfaces. putational network modeling (hence, the common The predicted response output expressed by the distinction between “high-dimensional networks” formula, plotted on the Y axis, is examined as and “low-dimensional nonlinear networks”). Con- the formula’s model parameters are varied on the tinuous interactions of the system variables over X and Z axes. In the earlier example, for instance, time are expressed in terms of differential equations. the probability of finalizing a stimulus-encoding The latter stipulate the momentary change of each process within a specified time t may be examined dimension at time t, as determined by the extant as parameters k and v are varied. values of other system dimensions, potentially Note in passing that expression of laws of nature including that of the dimension whose momentary typically is the purview of analytical mathematics change is being specified (e.g., Fukano & Gunji, (e.g., Newton’s law of Gravity). 2012). The nonlinear properties of the system can arise from the differential equations’ cross-product computational modeling (computer terms, conveying continuous interactions or non- simulation) linear functions of individual system dimensions, Computational modeling expresses recurrent in- such as raising extant momentary status to a power teractions (reciprocal influences) among activated, other than 1.0. and activating units, that have been combined The status of system variables at time t is into a network architecture. The interactions are available via solution to the set of differential implemented through computer syntax (e.g., Farrell equations. Because of the nonlinearity-endowed & Lewandowsky, 2010). Some of its proponents complexity, solutions virtually always are carried have advanced this form of modeling as a method out numerically, meaning computed cumulative uniquely addressing the computational capacity of infinitesimal changes are added to starting values for the brain, and as such, have viewed the workings of the time interval of interest. a connectionist network as a brain metaphor. Ac- System dimensions are endowed with their cordingly, the network units can stand for neuronal substantive significance according to the theo- entities (e.g., multineuron modules, or “neurodes”), retical content being modeled (e.g., dimensions whose strengths of connection vary over the course of subjective fear, and physical symptoms, in of the network’s ultimate generation of targeted dynamical modeling of panic disorder; Fukano & output values. Variation in network architecture Gunji, 2012). A variation on the preceding (essentially paths and densities of interneurode description comprises the progression of changes connections), and/or connection activation, present in system-variable states over discrete trials (e.g., themselves as potential expressions of cognitive successive husband-wife interchanges; Gottman, abnormalities. Murray, Swanson, Tyson & Swanson, 2002). Such Exposition of neuro-connectionist modeling in sequential transitions are implemented through clinical science has been relatively extensive (e.g., trial-wise difference-equations, which now take the Bianchi, Klein, Caviness, & Cash, 2012; Carter place of differential equations.

mathematical and computational 345 Model-predicted trajectories can be tested clinical samples, typically it is the values of model against trajectories of empirical observations, as parameters, rather than the model structure (the obtained, say, through diary methods involving model’s mathematical organization) that are found on-line data-gathering sites (e.g., Qualtrics; Bolger, to shift away from control values. Davis & Rafaeli, 2003). In addition, empirical The clinical significance of the shift depends data can be evaluated for the presence of dynamical on the meaning of the parameter(s) involved. signatures (“numerical diagnostics”) generated by Parameters are endowed with substantive signif- system-equation driven, computer-simulated time icance according to their roles in the formal series. It is possible, moreover, to search empirical system in which they are positioned, and their data for time-series numerical diagnostics that mathematical properties displayed therein. For are general to system complexity of a nonlinear- example, a parameter of a processing model may be dynamical-systems nature. This latter endeavour, deemed to express “task-performance competence.” however, has been cricized when undertaken with- Construct validity for this interpretation may be out a precise tendered model of the responsible supported if the statistical moments of modeled system, ideally buttressed with other forms of mod- performance-trial latencies (e.g., their mean or eling (notably mathematical modeling, earlier)— variance) are sent to the realm of infinity, barring such a more informed undertaking as historically a minimal value for the parameter. If this value exemplified in physics (Wagenmakers, van der Maas is not met, a critical corpus of extremely long, & Farrell, 2012). or incomplete task trials ensues, incurring infinite A branch of nonlinear dynamical systems response-latency moments in the formalized system. theory, ‘catastrophe theory,’ has been imple- From a construct-validity standpoint, this type mented notably in the analysis of addiction of effect on system behavior—severe performance relapse. For example, therapeutically significant impairment, or breakdown—is in keeping with the dynamical aspects of aptitude-treatment inter- parameter’s ascribed meaning (Neufeld, 2007b; see vention procedures (e.g., Dance & Neufeld, also Pitt, Kim, Navarro & Myung, 2006). 1988) have been identified through a catastrophe- Added to this source of parameter information— theory-based reanalysis of data from a large-scale analytical construct validity—is experimental con- multisite treatment-evaluation project (Witkiewitz, struct validity. Here, estimated parameter values van der Maas, Hufford & Marlatt, 2007). are selectively sensitive to experimental manipula- Catastrophe theory offers established sets of dif- tions, diagnostic-group differences, or variation in ferential equations (“canonical forms”) depicting psychometric measures on which they purportedly sudden jumps in nonlinear dynamical system out- bear (e.g., additional constituent operations of a put (e.g., relapse behaviour) occurring to gradual designated cognitive task, other demands being changes in input variables (e.g., incremental stress). equal, resulting in elevation specifically in the Nonlinear dynamical systems modeling—whether parameter k’, earlier). originating in the content domain, or essentially In clinical science and assessment, values of imported, as in the case of catastrophe theory— model parameters to a large extent are estimated often may be the method of choice when it comes from empirical data. Such estimation differs from to macroscopic, molar clinical phenomena, such methods frequently used in the physical sciences, as psychotherapeutic interactions (cf. Molenaar, which more often have the luxury of direct 2010). Extended exposition of nonlinear dynamical measurement of parameter values (e.g., measuring systems model development, from a clinical-science parameters of liquid density and temperature, in perspective, including that of substantively signifi- modeling sedimentation in a petroleum-extraction cant patterns of dimension trajectories (“dynamical tailings pond). Parameter estimation from the attractors”), has been presented in Gottman, et al. very empirical data being modeled—of course, (2002), Levy et al., (2012), and Neufeld (1999). with associated penalization in computations of empirical model fit—however, is only superficially Model Parameter Estimation in suspect. As eloquently stated by Flanagan (1991), Psychological Clinical Science A parameter is “an arbitrary constant whose value In physics there are many explanatory constructs, affects the specific nature but not the properties of electrons for example, which cannot be measured a mathematical expression” (Borowski & Borwein, independently of the observable situations in which 1989, p. 435). In modeling abnormalities in they figure explanatorily. What vindicates the

346 new directions explanatory use of such constructs is the fact that, and k’, the multiple derivatives are set to 0, followed given everything else we know about nature, by solving simultaneously. electrons best explain the observable processes in a As model complexity increases, with respect wide array of experimental tests, and lead to to the number of parameters and possibly model successful predictions (p. 380). structure, numerical solutions may be necessary. Solutions now are found by computationally How, then might parameter values be estimated, searching for the likelihood-function maximum, with an eye to possible constraints imposed in while varying constituent parameters iteratively and the clinical arena? Multiple methods of parameter reiteratively. Such search algorithms are available estimation in clinical science variously have been through R, through the MATLAB OPTIMIZA- used, depending on desired statistical properties TIOMN TOOLBOX, computer-algebra programs, (e.g., maximum likelihood, unbiasedness, Bayes; such as Waterloo MAPLE, and elsewhere. see, e.g., Evans, et al., 2000) and data-acquisition As with other methods that rely exclusively on constraints. Note that selection of parameter- the immediate data, stability of estimates rests in estimation methods is to be distinguished from good part on extensiveness of data acquisition. methods of selecting from among competing mod- Extensive data acquisition, say, on a cognitive task, els or competing model variants (for issues of possibly amounting to hundreds or even thousands model selection, see especially Wagenmakers & of trials, obtained over multiple sessions, may be Vandekerckhove, this volume). prohibitive when it comes to distressed patients. For these reasons, Bayesian estimates may be preferred moment matching (later). One method of parameter estimation consists of moment matching. Moments of some stochastic moment fitting and related procedures distributions can be algebraically combined to pro- Another straightforward method of parameter vide a direct estimate. For example, the mean of the estimation is that of maximizing the conformity of Erlang distribution of performance-trial latencies, model-predicted moments (typically means and in- expressed in terms of its parameter implementation tertrial standard deviations, considering instability k’andv, earlier, is k’/v (e.g., Evans, et al., 2000). 2 of higher-order empirical moments; Ratcliff, 1979), Theintertrialvarianceisk’/v . From these terms, across performance conditions and groups. An un- an estimate of v is available as the empirical mean weighted least-squares solution minimizes the sum divided by the empirical variance, and an estimate of squared deviations of model predictions from the of k’ is available as the mean, squared, divided by empirical moments. Although analytical solutions the variance. are available in principle (deriving minima using differential calculus, similar to deriving maxima in maximum likelihood the case of maximum-likelihood solutions), use of a Maximum-likelihood parameter estimation means numerical search algorithm most often is the case. initially writing a function expressing the likelihood Elaborations on unweighted least-squares solu- of obtained data, given the data-generation model. tions include minimization of Pearson χ2,where The maximum-likelihood estimate is the value of data are response-category frequencies fri: the model parameter that would make the observed - . p − 2 data maximally likely, given that model. (fri,observed fri,model−predicted) min , Maximum likelihood estimates can be obtained fr − i=1 i,model predicted analytically, by differentiating the written likelihood function with respect to the parameter in question, where there are p categories. Note that frequencies setting it to 0, and solving (the second derivative can include those of performance-trial latencies being negative). For example, the maximum- falling within successive time intervals (“bins”). likelihood estimate of v in the Erlang distribution, Where the chief data are other than categorical earlier, is frequencies, as in the case of moments, parameter Nk values may be estimated by constructing and , 2 N minimizing a pseudo- χ function. Here, the the- ti oretical and observed frequencies are replaced with = i 1 the observed and theoretical moments (Townsend, where the size of the empirical sample of latency 1984; Townsend & Ashby, 1983, chap. 13). As with 2 values ti is N. With multiple parameters, such as v the Pearson χ , the squared differences between the

mathematical and computational 347 model-predicted and observed values are weighted so, the density functions again replace discrete- by the inverse of the model-predicted values. value probabilities. For an Erlang distribution, for It is important to take account of logistical instance, the density function for a latency datum constraints when it comes to parameter-estimation t is i k −1 and other aspects of formal modeling in the clinical (vt ) − i ve vti . setting. In this setting, where available data can be (k − 1) sparse, the use of moments may be necessary for Allowing k to be fixed, for the present purposes of stability of estimation. Although coarse, as com- illustration (e.g., directly measured, or set to 1.0, as pared to other distribution properties, moments with the exponential distribution), and allowing θ encompass the entire distribution of values, and can to stand for v, then for a sample of N independent effectively attenuate estimate-destabilizing noise (cf. values, Neufeld & Gardner, 1990). Observe that, along with likelihood maximiza- Pr(D|θ) tion, the present procedures are an exercise in function optimization. In the case of least squares in Eq. (2) becomes the joint conditional density and variants thereon, the exercise is one of function function of the Nti values, given v and k’, minimization. In fact, often an unweighted least- N k −1 (vt ) − squares solution also is the maximum likelihood. i ve vti . χ 2 (k − 1) Similarly, where likelihood functions enter or ap- i=1 proximate χ 2 computations (likelihood-ratio G2), θ the minimum χ 2 is also maximum likelihood. The posterior density function of [e.g., Eq. (2)] θ Evaluation of the adequacy of parameter esti- can be computed for all candidate values of , mation amid constraints that are intrinsic to psy- the tendered estimate then being the mean of this chological clinical science and assessment ultimately distribution (the statistical property of this estimate rests with empirical tests of model performance. is termed “Bayes”; see especially Kruschke & Performance obviously will suffer with inaccuracy Vanpaemel, this volume). If an individual partic- of parameter estimation. ipant or client is the sole source of D,theBayes estimate of θ de facto has been individualized bayesian parameter estimation accordingly. Recall that Bayes’s theorem states that the prob- Bayesian parameter estimation potentially en- ability of an estimated entity A, given entity- dows clinical science and practice with demon- related evidence B (posterior probability of A) strably important advantages. Allowance for in- is the pre-evidence probability of A (its prior dividual differences in parameter values can be probability) times the conditional probability of B, built into the Bayesian architecture, specifically in given A (likelihood), divided by the unconditional terms of the prior distribution of performance- probability of B (normalizing factor): model parameters (Batchelder, 1998). In doing so, the architecture also handles the issue of over- Pr(A)Pr(B|A) Pr(A|B) = .(1)dispersion in performance-data, meaning greater Pr(B) variability than would occur to fixed parameter As applied to parameter estimation, A becomes the values for all participants (Batchelder & Riefer, candidate parameter value θ, and B becomes 2007). data D theoretically produced by the stochastic Selecting a prior distribution of θ depends in model in which θ participates. Recognizing part on the nature of θ. Included are strong θ as continuous, Eq. (1) becomes priors, such as those from the Generalized Gamma θ |θ family (Evans, et al., 2000) where 0 ≤ θ;the θ| = f ( )Pr(D ) g( D) +∞ (2) Beta distribution, where 0 ≤ θ ≤ 1.0; and the f (θ)Pr(D|θ)dθ −∞ normal or Gaussian distribution. Included as where g and f denote density functions, over and well are gentle (neutral) priors, notably the uni- against discrete-value probabilities Pr.ThedataD form distribution, whose positive height spans may be frequencies of response categories, such as the range of possible non-0 values of θ (see correct versus incorrect item recognition. also Berger, 1985, for Jeffreys’ uninformed, and Note that the data D as well may be continuous, other prior distributions). Grounds for prior- as in the case of measured process latencies. If distribution candidacy also includes “conjugation”

348 new directions with the performance process model. The prac- systems is illustrated in the following examples. tical significance of distributions being conjugate Focus is on uniquely disclosed aspects of system essentially is that the resulting posterior distribution functioning. Results from generic data-theory em- becomes more mathematically tractable, allowing pirical analyses, such as group by performance- a closed-form solution for its probability density conditions statistical interactions, are elucidated function. in terms of their modeled performance-process Defensible strong priors (defensible on theoret- underpinnings. In addition to illuminating results ical grounds, and those of empirical model fit) from conventional analyses, quantitative model- can add to the arsenal of clinically significant ing can uncover group differences in otherwise information, in and of themselves. To illustrate, conflated psychopathology-germane functions (e.g., the Bayesian prior distribution of θ values has Chechile, 2007; Riefer, Knapp, Batchelder, Bam- its own governing parameters (hyperparameters). ber & Manifold, 2002). Theoretically unifying Hyperparameters, in turn, can be substantively seemingly dissociated empirical findings through significant within the addressed content domain rigorous dissection of response configurations rep- (e.g., the taskwise competence parameter, described resents a further contribution of formal modeling at the beginning of this section; or another to clinical science and assessment (White, Ratcliff, expressing psychological-stress effects on processing Vasey & McKoon, 2010a). speed). Moreover, definitive assessment of prominent Because of the information provided by a conjectures on pathocognition is availed through strong Bayesian prior, the estimate of θ can be experimental paradigms exquisitely meshing with more precise and stable in the face of a smaller key conjecture elements (Johnson, Blaha, Houpt & data set D than would be necessary when the Townsend, 2010). Such paradigms emanate from estimate falls entirely to the data at hand (see, models addressing fundamentals of cognition, and e.g., Batchelder, 1998). This variability-reducing carry the authority of theorem-proof continuity, influence on parameter estimation is known as and closed-form predictions (Townsend & Nozawa, Bayesian shrinkage (e.g., O’Hagan & Forster, 1995; see also Townsend & Wenger, 2004a). 2004). Bayesian shrinkage can be especially valuable At the same time, measures in common clinical in the clinical setting, where it may be unreasonable use have not been left behind (e.g., Fridberg, to expect distressed participants or patients to offer Queller, Ahn, Kim, Bishara & Busemeyer, 2010; up more than a modest reliable specimen of task Bishara, Kruschke, Stout, Bechara, McCabe & performance. Integrating the performance sample Busemeyer, 2010; Yechiam, Veinott, Busemeyer, with the prior-endowed information is analogous & Stout, 2007). Mathematical modeling effectively to referring a modest blood sample to the larger has quantified cognitive processes at the root of body of hematological knowledge in the biomedical performance on measures such the Wisconsin Card assay setting. Diagnostic richness of the proffered Sorting Test, the Iowa Gambling Task, and the Go– specimen is exploited because its composition is No-Go task (taken up under Cognitive Modeling subjected to the preexisting body of information of Routinely Used Measures in Clinical Science and that is brought to bear. Other potentially com- Assessment, later). pelling Bayesian-endowed advantages to clinical Further, formal models of clinical-group cog- science and assessment are described in the section, nition can effectively inform findings from clin- Special Considerations Applying to Mathematical ical neuroimaging. Events of focal interest in and Computational Modeling in Psychological “event-related imaging” are neither the within- Clinical Science and Assessment, later. (See also nor between-trial transitions of physical stimuli Box 2 for an unique method of assembling posterior embedded in administered cognitive paradigms but, density functions of parameter values to ascertain rather, the covert mental processes to which such the probability of parameter differences between transitions give rise. Modeled stochastic trajectories task-performance conditions.) of the symptom-significant component processes that transact cognitive performance trials can Illustrative Examples of Contributions of stipulate intratrial epochs of special neuroimaging Mathematical Psychology to Clinical interest. The latter can complement brain regions Science and Assessment of interest, together facilitating the calibration of Information conveyed by rigorous quantitative space-time measurement coordinates in neuroimag- models of clinically significant cognitive-behavioral ing studies.

mathematical and computational 349 Multinomial Processing Tree Modeling of measurement technology has been substantial (e.g., Memory and Related Processes; Unveiling Batchelder & Riefer, 1990; Chechile, 2004; and Elucidating Deviations Among Riefer & Batchelder, 1988), including that tai- Clinical Samples lored to clinical-science audiences (Batchelder, Possibly the most widely used mathematical 1998; Batchelder & Riefer, 2007; Chechile, modeling in clinical psychology is multinomial 2007; Riefer et al., 2002; Smith & Batchelder, processing tree modeling (MPTM). Essentially, 2010). MPTM models the production of categorical re- Issues in clinical cognition potentially form sponses, such as recall or recognition of previously natural connections with the parameters of MTM. studied items, or judgments of items about their Just as the categories of response addressable with earlier source of presentation (e.g., auditory versus MPTM are considerable, so are the parameter- visual, thereby potentially bearing on the nature of ized constructs it accommodates. In addition to hallucinations; Batchelder & Riefer, 1990; 1999). essential processes of memory, perception, and Responses in such categories are modeled as learning, estimates are available for the effects of having emanated from a sequence of stages, or guessing, and for degrees of participants’ con- processing operations. For example, successful recall fidence in responding. Moreover, MPTM has of a pair of semantically linked items, such as been extended to predictions of response latency “computer, Internet,” entails storage and retrieval of (Hu, 2001; Schweickert, 1985; Schweickert & the duo, retrieval itself necessarily being dependent Han, 2012). on initial storage. The tree in MPTM consists of Riefer et al. (2002) applied MPTM in two exper- branches emanating from nodes; processes branch- iments to decipher the nature of recall performance ing from nodes proceed from other processes, on among schizophrenia and brain-damaged alcoholic which the branching process is conditional (as in participants. In each experiment, the clinical the case of retrieving a stored item). group and controls (nonpsychotic patients, and Each process in each successive branch has a non–organic-brain-syndrome alcoholics) received probability of successful occurrence. The proba- six trials of presentation and study of semantically bility of a response in a specific category (e.g., related item pairs (earlier), each study period accurate retrieval of an item pair) having taken place being followed by recall of items in any order. In through a specific train of events, is the product of both experiments, the research design comprised the probabilities of those events (e.g., probability a “correlational experiment” (Maher, 1970). A of storage times the conditional probability of correlational experiment consists of the diagnostic retrieval, given storage). groups under study performing under multiple Parameters conveying the event probabilities conditions of theoretical interest—a prominent are viewed as “capacities of the associated pro- layout in psychological clinical science (e.g., Yang, cesses.” Criteria for success of constituent processes Tadin, Glasser Glasser, Hong & Park, 2013). implicitly are strong, in that execution of a process Initial analyses of variance (ANOVA) were that took place earlier in the branching is a conducted on sheer proportions of items recalled. sufficient precondition for the process being fed; the In each instance, significant main effects of groups probability of failure of the subsequent process falls and study-recall trials were obtained; a statistically to that process itself. In this way, MPTM isolates significant trials-by-groups interaction, the test of functioning of the individual response-producing particular interest, was reported only in the case operations, and ascertains deficits accordingly. of the brain-damaged alcoholic participants and The model structure of MPTM is conceptually their controls (despite liberal degrees of freedom for tractable, thanks to its straightforward processing within-subjects effects). As noted by Riefer et al. tree diagrams. It has, however, a strong analytical (2002), this generic empirical analysis betrayed its foundation, and indeed has spawned innovative own shortcomings for tapping potentially critical methods of parameter estimation (Chechile, 1998), group differences in faculties subserving recall and notably rigorous integration with statistical performance. Indeed, Riefer et al. simply but science (labeled “cognitive psychometrics”; e.g., forcefully showed how reliance on typical statistical Batchelder, 1998; Riefer et al., 2002). Computer treatments of data from correlational experiments software advances have accompanied MPTM’s can generate demonstrably misleading inferences analytical developments (see Moshagen, 2010, about group differences in experimentally addressed for current renderings). Exposition of MPTM processes of clinical and other interest.

350 new directions Riefer et al.’s theory-disciplined measures pre- explanation. Multiple dissociation in spared and cisely teased apart storage and recall-retrieval pro- affected elements of performance was observed cesses, but went further in prescribing explicitly within diagnostic groups, moreover with further theory-driven significance tests on group differences dissociation of these patterns across groups. (see also Link, 1982; Link & Day, 1992; Townsend, Estimates of model parameters in these studies 1984). Pursuant to the superficially parallel group were accompanied by estimated variability therein performance changes across trials, schizophrenia (group and condition-wise standard deviations). participants failed to match the controls in im- Inferences additionally were strengthened with provement of storage efficiency specifically over the flanking studies supporting construct validity of last 3 trials of the experiment. Moreover, analysis parameter interpretation, according to selective of a model parameter distinguishing the rate of sensitivity to parameter-targeted experimental ma- improvement in storage accuracy as set against its nipulations. In addition, validity of parameter- first-trial “baseline,” revealed greater improvement estimation methodology was attested to through among the schizophrenia participants during trials large-scale simulations, which included provision 2 and 3, but a decline relative to controls during for possible individual differences in parameter the last 3 trials. In other words, this aspect of recall- values (implemented according to “hierarchical task performance arguably was decidedly spared by mixture structures”). the disorder, notably during the initial portions of In like fashion, Chechile (2007) applied MPTM task engagement. Analysis of the model parameter to expose disorder-affected memory processes asso- distinguishing rate of improvement in retrieval, as ciated with developmental dyslexia. Three groups set against its first-trial baseline now indicated a were formed according to psychometrically iden- significantly slower rate among the schizophrenia tified poor, average, and above average reading participants throughout. The interplay of these performance. Presented items consisted of 16 sets of component operations evidently was lost in the con- 6 words, some of which were phonologically similar ventional analysis of proportion of items recalled. (e.g., blue, zoo), semantically similar (bad, mean), It goes without saying that precisely profiling orthographically similar (slap, pals), or dissimilar disorder-spared and affected aspects of functioning, (e.g., stars, race). For each set of items, 6 pairs of as exemplified here, can inform the navigation cards formed a 2 × 6 array. The top row consisted of therapeutic intervention strategies. It also can of the words to be studied, and the second row was round out the “functional” picture brought to bear used for testing. A digit-repetition task intervened on possible functional neuroimaging measurement between study and test phase, controlling for obtained during recall task performance. rehearsal of the studied materials. Testing included Applying the substantively derived measurement that of word-item recall or word position, in the top model in the study on alcoholics with organicity row of the array. For recall trials, a card in the second yielded potentially important processing specifics row was turned over revealing a blank side, and the buried in the nevertheless now-significant groups- participant was asked what word was in the position by-trials ANOVA interaction. As in the case of just above. For recognition trials, the face-down side schizophrenia, the brain-damaged alcoholic par- was exposed to reveal a word, with the participant ticipants failed to approximate the controls in questioned about whether that word was in the improvement of storage operations, specifically over corresponding position in the top row. In some in- the last 3 trials of the experiment. Also, significant stances, the exposed word was in the corresponding deficits in retrieval again were observed throughout. position, and in others it was in a different position. Further, the rate of improvement in retrieval, A 6-parameter MPTM model was applied to relative to the trial-1 baseline, more or less stalled response data arranged into 10 categories. The over trials. In contrast to retrieval operations, the parameters included two storage parameters, one rate of improvement in storage among the sample directed to identification of a previous presentation, of alcoholics with organicity kept up with that of and one reserved for the arguably deeper storage controls—evidently a disorder-spared aspect of task required to detect foils (words in a different execution. position); a recall-retrieval parameter; two guessing The mathematically derived performance- parameters, and a response-confidence parameter. assessment MPTM methodology demonstrably As in the case of Riefer et al. (2002), parameterized evinced substantial informational added value, in memory-process deviations proprietary to model terms of clinically significant measurement and implementation were identified.

mathematical and computational 351 Apropos of the present purposes, a highlight allows the dissection of decisional performance into among other noteworthy group differences oc- component processes that act together to generate curred as follows. Compared to above-average choice responses and their latencies (Ratcliff, 1978). readers, performance data of poor readers produced Application of diffusion modeling has been used lower values for the previous-presentation storage to advantage in simplifying explanation, and in parameter, but higher values for recall retrieval. unifying observations on sensitivity to threat- This dissociation of process strength and deficit valenced stimulation among anxiety-prone individ- was specific to the orthographically similar items. uals (White, Ratcliff, Vasey & McKoon, 2010a; for These inferences, moreover, were supported with diffusion-model software developments, see Wagen- a decidedly model-principled method of deriving makers, van der Maas, Dolan & Grasman, 2008). the probability of two groups differing in their Increased engagement of threatening stimulus distributions of parameter values (see Box 2). content (e.g., words such as punishment or accident) The reciprocal effects of stronger retrieval and among higher anxiety-prone (HA) as compared weaker storage on the poor readers’ performance to lower anxiety-prone (LA) individuals has been with the orthographically similar items evinced a demonstrated across multiple paradigms (e.g., the nonsignificant difference from the above-average Stroop and dichotic listening tasks; and the dot- readers on raw recall scores (p > .10). Without probe task, where for HA individuals, detection of the problem-customized measurement model, not the probe is disproportionately increased with its only would group differences in memory functions proximity to threatening versus neutral items in a have gone undetected, but the nature of these visual array). differences would have remained hidden. Again, a Consistency of findings of significant HA-LA pattern of strength and weakness, and the particular group differences, however, by and large has conditions to which the pattern applied (encounter- depended on presentation of the threat items in the ing orthographically similar memory items), were company of nonthreat items. Such differences break powerfully exposed. down when items are presented singly. Specifically, In this report as well, findings were fortified the critical HA-LA by threat—nonthreat item with model-validating collateral studies. Selective interaction (group-item second-order difference, or sensitivity to parameter-targeting experimental ma- two-way interaction) has tended to be statistically nipulations lent construct validity to parameter significant specifically when the two types of stimuli interpretation. In addition, the validity of the model have occurred together. This pattern of find- structure hosting the respective parameters was ings has generated varyingly complex conjectures supported with preliminary estimation of coher- about responsible agents. The conjectures have ence between model properties and empirical data emphasized processing competition between the [Pr(coherence); Chechile, 2004]. Parameter recovery two types of stimuli, and associated cognitive- as well was ascertained through simulations for control operations. Group differences have been the adopted sample size. Furthermore, parameter attributed, for example, to differential tagging of estimation in this study employed an innovation the threat items, or to H-A participant threat-item developed by Chechile, called “Population Param- disengagement deficit (reviewed in White, et al., eter Mapping” (detailed in Chechile, 1998; 2004; 2010a). 2007).1 The difficulty in obtaining significant second order-differences with presentation of singletons unification of disparate findings on has led some investigators to question the impor- threat-sensitivity among anxiety-prone tance, or even existence, of heightened threat- individuals through a common-process stimulus sensitivity as such among HA individuals. model Others have developed neuro-connectionst com- Random-walk models (Cox & Miller, 1965) are putational models (e.g., Williams & Oaksford, stochastic mathematical models that have a rich 1992), and models comprising neuro-connectionist history in cognitive theory (e.g., Busemeyer & computational-analytical amalgams (Frewen, Do- Townsend, 1993; Link & Heath, 1975; Ratcliff, zois, Joanisse & Neufeld, 2008), expressly stip- 1978; see Ratcliff & Smith, this volume). Diffusion ulating elevated threat sensitivity among HA modeling (Ratcliff, 1978) presents itself as another individuals.2 If valid, such sensitivity stands to ram- form of modeling shown to be of extraordinary ify into grosser clinical symptomatology (Neufeld & value to clinical science. This mathematical method Broga, 1981).

352 new directions Greater threat-stimulus sensitivity defensibly response times attending the simultaneous item exists among HA individuals; but for reasons that condition (earlier) itself may be attenuated due to are relatively straightforward, such sensitivity may the conflation of processing time with collateral be more apparent in the company of nonthreat cognitive activities, such as item encoding and items, as follows: The cognitive system brought response organization and execution. On balance, to bear on the processing task stands to be one of a measure denuded of such collateral processes limited capacity (see Houpt & Townsend, 2012; may elevate the statistical effect size of the solo- Townsend & Ashby, 1983; Wenger & Townsend, presentation second-order difference at least to that 2000). When items are presented together, a parallel of the paired-presentation raw reaction time. processing structure arguably is in place for both Such a refined measure of processing speed per HA and LA participants (Neufeld & McCarty, se was endowed by the diffusion model as applied 1994; Neufeld, et al., 2007). With less prepotent by White, et al. (2010a) to a lexical decision salience of the threat item for the LA participants, task (yes-no about whether presented letters form a more attentional capacity potentially is drawn off word). Teased apart were speed of central decisional by the nonthreat item, attenuating their difference activities (diffusion-model drift rate), response in processing latency between the two items. A style (covert accumulation of evidence pending a larger interitem difference would occur for the HA decision), bias in favor of stating the presence of participants, assuming their greater resistance to the an actual word, and encoding (initial preparation erosion of processing capacity away from the threat and transformation of raw stimulation). Analysis item (see White, et al., 2010a, p. 674). was directed to values for these parameters, as well This proposition lends itself to the follow- as to those of raw latency and accuracy. ing simple numerical illustration. We invoke an In three independent studies, analysis of drift independent parallel, limited-capacity processing rates consistently yielded significant group,-item- system (IPLC), and exponentially distributed item- type second-order differences, whereas analysis of completion times (Townsend & Ashby, 1983). Its raw latency and accuracy rates consistently fell technical specifics aside, the operation of this system short. The significant second-order difference also makes for inferences about the present issue that are was parameter-selective, being restricted to drift- easy to appreciate. The resources of such a system rate values, even when manipulations were con- illustratively are expressed as a value of 10 arbitrary ducive to drawing out possible response-propensity units (essentially, the rate per unit time at which task differences.4 elements are transacted) for both HA and LA par- Here too, findings were buttressed with sup- ticipants. In the case of a solo presentation, a threat porting analyses. Included was construct-validity item fully engages the system resources of an HA augmenting selective parameter sensitivity to participant, and 90% thereof in the case of an LA parameter-directed experimental manipulations. A participant. The solo presentation of the nonthreat further asset accompanying use of this model item engages 50% of the system’s resources for both is its demonstrable parametric economy, in the participants. By the IPLC model, the second-order following way; Parameter values have been shown difference in mean latency then is (1/10 – 1/9) – 0 = to be uncorrelated, attesting to their conveyance of −0.0111 (that is, latency varies inversely as capacity, independent information. Information also is fully expressed as a processing-rate parameter). salvaged inasmuch as both correct and incorrect Moving to the simultaneous-item condition, response times are analyzed (see also Link, 1982). 80% of system-processing resources hypothetically Parameter estimates were accompanied by cal- are retained by the threat item in the case of the HA culations of their variability (standard deviations participant, but are evenly divided in the case of the and ranges), for the current conditions of esti- LA participant. The second-order difference now mation. Diagnostic efficiency statistics (sensitivity, is (1/8−1/5) − (1/2−1/5) =−0.375. Statistical specificity, and positive and negative predictive power obviously will be greater for an increased power) were used to round out description of statistical effect size accompanying such a larger group separation on the drift-rate, as well as raw difference.3 data values, employing optimal cut-off scores for It should be possible, nevertheless, to detect the predicted classification. In each instance, the drift- smaller effect size for the single-item condition, rate parameter decidedly outperformed the latency with a refined measure of processing. On that mean and median, as well as raw accuracy. These note, the larger second-order difference in raw results were endorsed according to signal-detection

mathematical and computational 353 analysis, where the “signal” was the presence of Architecture pertains to whether the system is higher anxiety proneness. designed to handle task constituents concurrently Altogether, the previously described develop- (in parallel channels) or successively, in a serial ments make a strong case for model-delivered fashion (e.g., encoding curves, lines, and intersec- parsimony. Seemingly enigmatic and discordant tions of alphanumeric characters, simultaneously findings are shown to cohere, as products of a or sequentially). Within the parallel division, common underlying process. moreover, alternate versions can be entertained. The channels can function as segregated units, with the measurement technology emanating from products of their processing remaining distinct from theoretical first principles: assessing one another in task completion (regular parallel fundamentals of cognition in autism architecture). Alternately, the channels can act as spectrum disorders tributaries to a common conduit that receives and As averred by Meehl (1978; see quotation at conveys the sum of their contributions, dispatching the outset of this chapter), measurement tech- the collective toward task finalization (co-active nology emanating from formal theory of longer- parallel architecture). established disciplines has emerged from the formal Cognitive workload capacity is estimated in theory itself [writ large in the currently promi- SFT through an index related to work and energy nent Higgs-boson-directed Large Hadron Collider; in physics (Townsend & Wenger, 2004b). The Close (2011); see McFall & Townsend (1998) index registers the potential of the system to for a still-current update of Meehl’s appraisal of undertake cognitive transactions per unit time measurement methods in clinical-science]. Systems [analogous to the rate of dispatching Shannon- Factorial Technology (SFT; Townsend & Nozawa, Weaver (1949) bits of information]. An important 1995; see also Townsend & Wenger, 2004a, aspect of system control entails cessation of process- and Chapter 4, this volume) comprises such a ing upon sufficiency for informed responding, over development in cognitive science, and has been used and against extracriterial continuation (operative to notable advantage in clinical cognitive science stopping rules). Furthermore, independence versus (Johnson, et al., 2010; Neufeld, et al., 2007; interdependence of system components refers to Townsend, Fific, & Neufeld, 2007). Identifiability absence versus presence of either mutual facilitation of fundamentals of cognition has been disclosed or cross-impedance of system channels devoted to by a series of elegant theorem-proof continuities discrete task constituents (e.g., channels handling addressed to temporal properties of information separate alphanumeric items or possibly item fea- processing (see Townsend & Nozawa, 1995 for tures). details; see also Townsend & Altieri, 2012 for Significantly, SFT mathematically disentangles recent extensions incorporating the dual response these key elements of cognition. For example, properties of latency and accuracy). The axiomatic cognitive-workload capacity is isolated from stop- statements from which the proofs emanate, more- ping rules and system architecture. Such elements over, ensure that results are general, when it comes are conflated in macroscopic speed and/or accuracy, to candidate distributions of processing durations; whose relative resistance to increased task load continuity of underlying population distributions is (e.g., added items of processing; or concomitant- assumed, but results transcend particular parametric secondary versus single-task requirements) typically expressions thereof (e.g., exponential, Weibull, etc.; is taken to indicate system capacity (see Neufeld, see e.g., Evans, et al., 2000). The distribution- et al., 2007). Disproportionate change in such general feature is particularly important because behavioral data may occur, however, for reasons it makes for robustness across various research other than limitation in system workload capacity. settings, something especially to be welcomed in the Uneconomical stopping rules may be at work, such field of clinical science. as exhaustive processing (task constituents on all Elements of cognitive functioning exposed by system channels are finalized), when selfterminat- SFT include: (a), the architecture, or structure of ing processing will suffice (informed responding the information-processing system; (b), the system’s requires completion of only one, or a subset, of task cognitive workload capacity; (c), selected charac- constituents). It also is possible that healthy partic- teristics of system control; and, (d), independence ipants’ seemingly greater workload capacity actually versus interdependence of constituent cognitive is attributable to a more efficient architecture (e.g., operations carried out by system components. the presence of co-active parallel processing).

354 new directions This quantitatively disciplined measurement in- stopping rules, and in(inter)dependence of process- frastructure takes on increased significance for ing channels handling the individual targets. clinical cognitive science, when it is realized that A microanalysis of task-performance latency certain highly prominent constructs therein align distributions (errors being homogeneously low for with cognitive elements measured by SFT. Espe- Johnson et al.’s both ASD and control participants) cially noteworthy in the study of schizophrenia, for was undertaken via the lens of systems-factorial example, is the construct of cognitive capacity (see, assessment technology.5 Mathematically authorized e.g., Neufeld, et al., 2007). In addition, system- signatures of double target facilitation, over and control stopping rules impinge on so-called execu- against single-target facilitation of processing (“re- tive function, a construct cutting across the study of dundancy gain”), was in evidence for all ASD multiple disorders. Implicated by cognitive-control and control participants alike. This aspect of stopping rules are cognitive-resource conservation, processing evidently was spared with the occurrence and the robustness of selective inhibition. of ASD. It should not go unnoticed that SFT funda- Contra prominent hypotheses, which were de- mentals of cognition also are at the heart of the scribed earlier, all ASD participants displayed a “automatic-controlled processing” construct.2 This speed advantage for global target processing over construct arguably trumps all others in cognitive- local-target processing. In contrast, 4 of the controls clinical-science frequency of usage. exhibited a local-target advantage or approximate In identifying variants of the cognitive elements equality of target speed. On balance, the verdict enumerated earlier, the stringent mathematical from quantitatively disciplined diagnostics was that developments of SFT are meshed with relatively this property of performance was directly opposite straightforward experimental manipulations, illus- to that predicted by major hypotheses about ASD trated as follows. In the study of processing cognitive functioning. At minimum, a global-target mechanisms in autism spectrum disorder (ASD), processing advantage was preserved within this ASD Johnson et al. (2010) instantiated SFT as “double sample. factorial technology” (Townsend & Nozawa, 1995). Less prominent in the literature have been con- A designated visual target consisted of a figure jectures about cognitive control in ASD. However, of a right-pointing arrow in a visual display. exhaustive target processing was detected as poten- Manipulations included the presence or absence of tially operative among 5, and definitively operative such a figure. The target figure could be present in for 2 of the sample of 10 ASD participants (one the form of constituent items of the visual array case being inconclusive). In contrast, for a minority being arranged into a pattern forming a right- of the 11 controls—4 in number—exhaustive pointing arrow (global target), the items themselves processing was either possibly or definitively op- consisting of right-pointing arrows (local target) or erative. The analysis, therefore, revealed that both (double target). postcriterial continuation of target processing (with This manipulation is incorporated into quan- possible implications for preservation of processing titative indexes discerning the nature of system resources, and the processing apparatus’s inhibition workload capacity. The specific target implemen- mechanism) may be disorder affected. System tation appropriated by Johnson et al., is ideally workload capacity, chronometrically measured in suited to the assessment of processing deviation in its own right, nevertheless was at least that of ASD, because prominent hypotheses about ASD controls—an additional component of evidently cognitive performance hold that more detailed (read spared functioning. “local”) processing is favored. Observed violations of selective influence of An additional mathematical-theory-driven ma- target-salience manipulations, notably among the nipulation entails target salience in the double- control participants, indicated the presence of target condition. The right-pointing item arrange- cross target-processing-channel interactions. The ment can be of high or low salience, as can the right- violations impelled the construction of special pointing items making up the arrangement, alto- performance-accommodating theoretical architec- gether resulting in four factorial combinations. The tures. Certain candidate structures thus were combinations, in lockstep with SFT’s mathematical mandated by the present clinical-science samples. treatment of associated processing-latency distribu- The upshot is an example where clinical cognitive tions, complement the capacity analysis given ear- science reciprocates to nonclinical cognitive science, lier by discerning competing system architectures, in this case by possibly hastening the uncovering

mathematical and computational 355 of potentially important structures in human Box 2 A Novel Statistical Test for cognition. Model-Parameter Differences cognitive modeling of routinely used As indicated in the text, mathematical mod- measures in clinical science and assessment eling can prescribe its own measures, exper- Measurement in clinical science and assess- imentation, and tests. Sometimes, proposed ment frequently has been aimed at the important tests can transcend the specific domain from cognitive-behavioral domain of decision and choice. which they emerge. This is the case for a Examples include the assembling of physical objects statistical test for inequality of model properties based on a judged organizing principle, executing between groups, devised by Chechile (2007; risky gambles, and withholding versus emitting 2010). a response to a presenting cue. These decision- Considerations begin with a “horse race” choice scenarios are instantiated in the Wisconsin model of cognitive processes occurring in Sorting Test (WCST; Berg, 1948), which tar- parallel (Townsend & Ashby, 1983, e.g., p. gets frontal lobe “executive function”; the Iowa 249). At any time point since the start of the Gambling Task (Bechara, Damasio, Damasio & race t’, the probability density function of the Anderson, 1994), which is directed to decisions first process of the pair winning is its density potentially abetted by accompanying affect; and the function f 1(t’) times the probability of the Go/No-Go Discrimination Task (see, e.g., Hoaken, second function remaining incomplete S2(t’), Shaughnessy & Pihl, 2003), which is thought or f 1(t’) S2(t’). Integrating this expression from to engage inhibitory aspects of cognitive control. t’ = 0tot’= t gives the probability of the first Deficits in decisional operations are poised to be completion being earlier than the second, as ecologically consequential, when it comes to social, evaluated to time t. Integrating across the entire occupational, and self-maintenance activities [see range of values (t’= 0tot’=∞)givesthe Neufeld & Broga, 1981, for a quantitative portrayal unconditional probability of the first process of “critical” (versus “differential”) deficit, a concept having a shorter time than the second. Chechile recently relabelled “functional deficit”, e.g., Green, has adapted this reasoning to the construction Horan & Sugar, 2013]. of a statistical test expressing the probability The Expectancy Valence Learning Model (EVL; that a model-parameter value under one condi- Busemeyer & Myung, 1992; Busemeyer & Stout, tion of cognitive-behavioral performance is less 2002; Yechiam, Veinott, Busemeyer, & Stout, than its value under a comparison condition θ <θ 2007; see also Bishara et al., 2010, and Fridberg Pr( 1 2). et al., 2010, for related sequential learning models) The method uses Bayesian posterior distri- θ θ is a stochastic dynamic model (see Busemeyer & butions of 1 and 2 (with qualifications de- Townsend, 1993) that supplies a formal platform scribed and illustrated empirically in Chechile, for interpreting performance on such measurement 2007). Slicing the range of θ into small tasks. The model expresses the dynamics of de- segments, the proportion of the distribution θ cisional behaviors in terms of the progression of of 1 within a particular segment, times the θ expected values accrued by task alternatives, as proportion of the 2 distribution beyond that θ governed by the record of outcomes rendered by segment, gives the probability of 1 lying θ choice-responses to date. within the segment, and that of 2 lying be- Dynamic changes in alternative-expectations yond, analogous to f 1(t’) and S2(t’). Summing are specified by the model structure, in which the products over the entire range of θ (if θ θ are embedded the psychological forces—model 1 and 2 themselves are probabilities, then parameters—operative in generating selection like- θ ranges from 0 to 1.0) directly gives the θ lihoods at the level of constituent selection trials. unconditional probability of 1 being lower θ Parameters of the EVL model deliver notable than 2. Such a probability of .95, for instance, psychometric “added value” when it comes to corresponds to a 95% level of confidence that θ <θ the interpretation and clinical-assessment utility of 1 2. task-measure data.

356 new directions Model parameters tap motivational, learning Conventional measures have been understood with and response, domains of decision and choice. regard to their dynamical-process underpinnings. The first ascertains relative sensitivity to positive For instance, WCST performance (e.g., persever- versus negative outcomes to selected alternatives ative errors, whereby card sorting has not transi- (e.g., payoff versus loss to a card-selection gamble). tioned to a newly imposed organizing principle) A second performance parameter encapsulates the has been simulated by modeled sequential-learning comparative strength of more versus less recent (Bishara, et al., 2010). Errors of commission choice outcomes in influencing current responses. on the go–no-go task have been associated with And the third parameter conveys the degree to elevated attention to reward outcomes, and errors which responding is governed by accumulated of omission have been associated with greater information, as opposed to momentary, ephemeral attention to punishing outcomes (Yechiam et al., influences (informational grounding, versus im- 2006). pulsivity). The model, therefore, contextualizes Model parameters also have lent incremen- the dynamics of choice responding in terms of tal nomothetic span (Embretson, 1983) to rou- rigorously estimated, psychologically meaningful tinely used measures. Specifically, in addition constructs. to producing theoretically consistent diagnostic- The EVL model, and its closely aligned group correlates (e.g., elevated sensitivity to IGT- sequential learning derivatives (e.g., Ahn, Buse- selection rewards, among cocaine users; Yechiam meyer, Wagenmakers & Stout, 2008), have et al., 2007), parameters have been differentially successfully deciphered sources of choice-selection linked to relevant multi-item psychometric mea- abnormalities in several clinical contexts. Studied sures (Yechiam, et al., 2008). groups have included those with Huntington’s and Moreover model-based measures have displayed Parkinson’s disease (Busermeyer & Stout, 2002), diagnostic efficacy incremental to conventional bipolar disorder (Yechiam, Hayden, Bodkins, measures. Individual differences in model param- O’Donnell & Hetrick, 2008), and various forms eters have added to the prediction of bipolar dis- of substance abuse (Bishara, et al., 2010; Fridberg, order, over and above that of cognitive-functioning et al., 2010; Yechiam, et al., 2007), and autism and personality/temperament inventories (Yechiam spectrum disorders (Yechiam, Arshavsky, Shamay- et al., 2008). In addition to its explanatory value Tsoory, Yanov & Aharon, 2010). Attestation to for conventionally measured WCST performance, the value of EVL analyses, among other forms, has model parameters comprehensively have captured been that of multiple dissociation of parameterized conventional measures’ diagnostic sensitivity to abnormalities across the studied groups. substance abuse, but conventional measures have For example, among Huntington’s disease failed to capture that of model parameters (Bishara individuals, the influence of recent outcome et al., 2010). experiences in IGT selections has ascended over In addition to the credentials enumerated, that of more remote episodes; and responding performance of the EVL model and its affiliates has been less consistent with extant information have survived the crucible of competition against as transduced into alternative-outcome modeled competing model architectures (e.g., Yechiam expectations (Busemeyer & Stout, 2002; Yechiam, et al., 2007). Applied versions of EVL and Veinott, Busemeyer & Stout, 2007). Judgments of closely related sequential learning models also both stimulant- and alcohol-dependent individuals, have been vindicated against internal variations on the other hand, have tracked negative feedback on the ultimately applied versions (e.g., attempts to WCST selections to a lesser degree than have at either reduced or increased parameteriza- those of controls. Although similarly less affected by tion; Bishara, et al., 2010; Yechiam, et al., negative feedback, stimulant-dependent individuals 2007). have been more sensitive than alcohol-dependent individuals to positive outcomes attending their formal modeling of pathocognition, and selection responses (Bishara et al., 2010). functional neuroimaging (the case of Through the psychological significance of their stimulus-encoding elongation in constituent parameters, sequential learning mod- schizophrenia) els have endowed the routinely used measures, Mathematical and computational modeling of described earlier, with incremental content validity cognitive psychopathology can provide vital infor- and construct representation (Embretson, 1983). mation when it comes to the functional component

mathematical and computational 357 of clinical functional neuroimaging (functional input in its objective significance (i.e., “context magnetic resonance imaging, magnetic resonance deficit”; e.g., Dobson & Neufeld, 1982; George & spectroscopy, electroencephalography, and electro- Neufeld, 1985). This combination potentially con- magnetoencephalography). Cognitive-process un- tributes to schizophrenia thought-content disorder folding, as mathematically mandated by a tenable (delusions and thematic hallucinations; Neufeld, process model, can speak to the why, where, when, 1991; 2007c). Formally prescribed cognitive psy- and how of neurocircuitry estimation. In so doing, chopathology thus can help broker symptom signif- it can provide an authoritative antidote to the icance to monitored neurocircuitry. vexatious problem of reverse inference, whereby Monitored neurocircuity likewise can be en- functions whose neurocircuitry is purported as dowed with increased ecological significance. Intact being charted, are inferred from the monitored encoding, for example, may be critical to negotiat- neuro-circuitry itself (e.g., Green, et al., 2013; see ing environmental stressors and demands, especially Poldrack, 2011). when coping is cognition intensive (e.g., where So called event-related neuroimaging seeks to threat minimization is contingent on predictive track sites of neuro-activation, and intersite co- judgments surrounding coping alternatives; e.g., activation, aligned with transitions taking place Neufeld, 2007b; Shanahan & Neufeld, 2010). in experimental cognitive-performance paradigms Impairment of this faculty therefore may be part (e.g., occurrence of a probe item, for ascertainment and parcel of stress susceptibility (Normal & Malla, of its presence among a set of previously memorized 1994). Compromised stress resolution in turn may items, in a memory-search task; or presentation of a further impair encoding efficiency, and so on visual array, for ascertainment of target presence, in (Neufeld, et al., 2010). a visual-search task). Events of actual interest, how- Turning to the where of neuro-imaging mea- ever, are the covert cognitive operations activated surement, targeted cognitive functions in princi- by one’s experimental manipulations. Estimating ple are used to narrow down neuro-anatomical the time course of events that are cognitive per se regions of measurement interest, according to arguably necessitates a tenable mathematical model locations on which the functions are thought of their stochastic dynamics. Stipulation of such to supervene. Precision of function specification temporal properties seems all the more impor- constrained by formal modeling in principle should tant when a covert process of principal interest reduce ambiguity about what cognitively is being defensibly is accompanied by collateral processes, charted neurophysiologically. Explicit quantitative within a cognitive-task trial (e.g., encoding the stipulation of functions of interest should sharpen probe item for purposes of memory-set comparison, imaging’s functionally guided spatial zones of brain along with response operations,—earlier). Moti- exploration. vation for model construction is fueled further Formal modeling of cognitive-task performance if the designated intratrial process has symptom seems virtually indispensable to carving out in- and ecological significance. The upshot is that tratrial times of neuro-(co)activation measurement mathematical modeling of cognitive abnormality interest. Stochastic dynamic models specify proba- is uniquely poised to supply information as the bilistic times of constituent-process occurrence. For why, where, and when of clinical and other instance, encoding a probe item for its comparison neuroimaging. It arguably also speaks to the how to memory-held items would have a certain time of vascular- and electro-neurophysiological signal trajectory following the probe’s presentation. With processing, in terms of selection from among the encoding as the singled-out process, imaging signals array of signal-analysis options. identified with a trial’s model-estimated epoch, Clinical importance of neurocircuitry measures during which the ascendance of encoding was can be increased through their analytical interlacing highly probable, would command special attention. with clinical symptomatology. Among individu- A tenable stochastic model of trial performance als with schizophrenia, for example, elongated can convey the stochastic dynamics of a designated duration of encoding (cognitively preparing and symptom- and ecologically significant process, nec- transforming) presenting stimulation into a format essary for a reasonable estimate of its corresponding that facilitaties collateral processes (e.g., those within-trial neuro-circuitry (additional details are taking place in so-called working memory) can available in Neufeld, 2007c, the mathematical and disproportionately jeopardize the intake of con- neuro-imaging specifics of which are illustrated in textual cues that are vital to anchoring other Neufeld, et al., 2010).

358 new directions A delimited intratrial measurement epoch stipu- with faithfully engaging major substantive issues lated by a stochastic-model specified-time trajectory found in clinical science and assessment. commands considerable temporal resolution of processed imaging signals. Among analysis options, Methodological Challenges covariance of activation between measurement sites, One of the first challenges to clinical formal across time (“seed-voxel, time series covariance”) has modeling consists of obtaining sufficient data been shown empirically and mathematically to have for stability of model-property estimation. This the temporal resolution necessary to monitor neuro- challenge is part and parcel of the field’s long- circuitry during the designated epoch (Neufeld, standing concern with ensuring the applicability 2012; see also Friston, Fletcher, Josephs, Holmes of inferences to a given individual or, at least, to & Rugg, 1998). In this way, formal modeling a homogeneous class of individuals to which the of cognitive performance speaks to the how of one at hand belongs (Davisdon & Costello, 1969; imaging-signal processing by selecting out or cre- Neufeld, 1977). ating neuro-imaging-signal analysis options with Mainline mathematical modeling too has been requisite temporal resolution. Note that success concerned with this issue. The solution by and at ferreting out target-process neurocircuitry is large has been to model performance based on supported by the selective occurrence of differential a large repertoire of data obtained from a given neuro-(co)activation to the modeled presence of the participant, tested over the course of a substantial triaged process in the clinical sample, and con- number of multitrial sessions. Clinical exigencies versely, its nonoccurrence otherwise (also known as (e.g., participant availability or limited endurance), “multiple dissociation”). however, may proscribe such extensiveness of data Model-informed times of measurement interest accumulation; the trade-off between participants stand to complement tendered regions of interest, and trial number may have to be tipped toward in navigating space-time coordinates of neuroimag- more participants and fewer trials.6 The com- ing measurement. Moreover, symptom and ecologi- bination of fewer performance trials, but more cally significant functions remain in the company of participants requires built-in checks for relative collateral functions involved in task-trial transaction homogeneity of individual data protocols, to ensure (e.g., item encoding occurs alongside the memory- that the modeled aggregate is representative of its scanning and response processes, for which it respective contributors. exists). Keeping the assembly of trial-performance Collapsing data across participants within a operations intact, while temporally throwing into testing session, then, risks folding systematic per- relief the target process, augurs well for preserving formance differences into their average, with a re- the integrity of the target process, as it functions sultant centroid that is unrepresentative of any of its in situ. Doing so arguably increases the ecological parts (see Estes, 1956). Note in passing that similar validity of findings, over and against experimen- problems could attend participant-specific aggrega- tally deconstructing the performance apparatus by tion across experimental sessions, should systematic experimentally separating out constituent processes, intersession performance drift or re-configuration thereby risking distortion of their system-level occur. Methods have been put forth for detect- operation. ing systematic heterogeneity (Batchelder & Riefer, 2007; Carter & Neufeld, 1999; Neufeld & McCarty, 1994), or ascertaining its possibility to Special Considerations Applying to be inconsequential to the integrity of modeling Mathematical and Computational results (Riefer, Knapp, Batchelder, Bamber & Modeling in Psychological Clinical Science Manifold, 2002). If present, methods of disen- and Assessment tangling, and separately modeling the conflated Clinical mathematical and computational mod- clusters of homogenous data have been suggested eling is a relatively recent development in the (e.g., Carter, Neufeld, & Benn, 1998). history of modeling activity (see, e.g., Townsend, Systematic individual differences in model prop- 2008). It goes without saying that opportunities erties also can be accommodated by building unlocked by clinical implementations are accom- them into the model architecture itself. Mixture panied by intriguing challenges. These take the models implement heterogeneity in model opera- form of upholding rigor of application amid tions across individual and/or performance trials. exigencies imposed by clinical constraints, along For example, model parameters may differ across

mathematical and computational 359 individual participants within diagnostic groups, tack to the thorny issue of small-N model testing the parameters in principle forming their own (see Bayesian shrinkage, under Bayesian Parameter random distribution. As parameter values now Estimation, earlier). are randomly distributed across individuals (hy- An additional asset conferred by a mixture- perdistribution), a parameter value’s probability model platform bridges methodological and clinical (discrete-value parameters) or probability density substantive issues, as follows. Emerging from (continuous-value parameters) takes on the role this model design is a statistical—and cognitive- of its Bayesian prior probability (density) for the behavioral—science-principled measurement in- current population of individuals. Combined with frastructure for dynamically monitoring individual a sample of empirical task performance, on whose treatment response. The method exploits the distribution the parameter bears (base distribution), distinct prior distributions of base-model properties an individualized estimate of the parameter is obtained from varyingly symptomatic and healthy available through its Bayesian posterior parameter groups. Now Bayes’ theorem allows the profiling of distribution (see Bayesian Parameter Estimation, an individual at hand, with respect to the latter’s earlier). A mixture-model approach can be es- current probability of belonging to group g, g = pecially appealing in cases of overdispersion, as 1,2, . . . G (G being the number of referent groups), indicated say by a relatively large coefficient of given obtained performance sample {*}: Pr(g|{*}). variation (standard deviation of individual mean Such a profile can be updated with the progression performance values divided by their grand mean; of treatment. see overdispersion under Bayesian Parameter Esti- Moreover, closely related procedures can be mation, earlier).7 applied to the assessment of treatment regimens. Also availed through mixture models are cus- In this case, treatment efficacy is monitored by tomized (again, Bayesian posterior) performance charting the degree to which the treated sample at distributions. If the latter apply to process latencies, large is being edged toward or possibly away from for example, they can be consulted to estimate healthier functioning (computational specifics are strategic times of neuro-imaging measurement illustrated with empirical data in Neufeld, 2007b, within subgroups of parametrically homogeneous and Neufeld, et al., 2010). participants (detailed in Neufeld, et al., 2010, Section 7). Clinical substantive issues Apropos of the limitations indigenous to the A substantive issue interlaced with measure- clinical enterprise, mixture models potentially theoretic considerations is the so-called differential- moderate demands on participants when it comes deficit, psychometric-artefact problem (Chapman to their supplying a task- performance sample. & Chapman, 1973). Abnormalities that are more A relatively modest sample tenably is all that is pronounced than others may have special etio- required, because of the stabilizing influence on logical importance, in part because of associated participant-specific estimates, as bestowed by the neuro-physiological substrates. False inferences of mixing distribution’s prior-held information. This differential deficit, however, are risked because obviously is a boon in the case of clinical modeling, the relative amounts by which examined faculties considering the possible practical constraints in are disorder-affected, are conflated with the psy- obtaining a stable performance sample from already chometric properties of the instruments used to distressed individuals. measure them. This issue retains much currency Mixture models also open up related avenues of in the contemporary clinical-science literature (e.g., empirical model testing. For purposes of assessing Gold & Dickinson, 2013). model validity, predictions of individuals’ perfor- Frailties of recommended solutions, consisting of mance, mediated by modest performance samples, intermeasure calibration toward equality on classical can be applied to separately obtained individual measurement properties (reliability and observed- validation samples. In this way, model verisimil- score variance) have been noted almost since the itude can be monitored, including the model’s recommendations’ inception (e.g., Neufeld, 1984a; functioning at the level of individual participants Neufeld & Broga, 1981; with augmenting technical (for details exposited with empirical data, see reports, Neufeld & Broga, 1977; Neufeld 1984b). Neufeld, et al., 2010). Because of the stabilizing Note that the arguments countering the original influence of the Bayesian prior distribution of base- classical-measurement recommendations have been distribution properties, this strategy provides a demonstrated to extend from diagnostic-group to

360 new directions continuous-variable designs (as used, e.g., by of modeling methodology. Included have been Kang & MacDonald, 2010; see Neufeld & multidimensional scaling; classification paradigms; Gardner, 1990). perceptual-independence—covariation paradigms; It can be shown that transducing classical and memory paradigms–all work that brooks no psychometric partitioning of variance into formally compromise on state-of-the-art mathematical and modeled sources not only lends a model-based computational developments. Deviations in cogni- substantive interpretation to the partitioned vari- tive mapping, classification, and memory retrieval, ance, but also renders the psychometric-artefact surrounding presented items bearing on clinically issue (which continues to plague reliance on off- important problems comprising eating disorders, the-shelf and subquantitative, often clinically, rather and risk of sexual aggression have been productively than contemporary cognitive-science, contrived studied. measures) essentially obsolete (Neufeld, 2007b; see A further point of contact between formal also McFall & Townsend, 1998; Neufeld, Vollick, process modeling and substantive clinical issues Carter, Boksman, Jette, 2002; Silverstein, 2008). involves relations between the former and multi- Partitioned sources of variance in task perfor- item psychometric inventories. Empirical associa- mance now are specified according to a governing tions between model properties, and multi-item formal performance model, which incorporates measures, reciprocally contribute to each other’s an analytical stipulation of its disorder-affected nomothetic-span construct validity. Correlations and spared features (see Figure 16.1, and related with process-model properties furthermore lend prose, in the Introduction of this chapter). These construct-representation construct validity to psy- sources of variance include classical measurement- chometric measures. Psychometric measures, in error variance (within-participant variance); within- turn, can provide useful estimates of model proper- group interparticipant variance; and intergroup ties with which they sufficiently correlate, especially variance. if their economy of administration exceeds that of Another major substantive issue encountered direct individual-level estimates of the model prop- in clinical science concerns the integration of so- erties (Carter, Neufeld & Benn, 1998; Wallsten, called cold and hot cognition. Cold cognition, Pleskac & Lejuez, 2005). in this context pertains more or less to the Furthermore, to liaison with multi-item psycho- mechanisms of information processing, and hot metric inventories, responding to a test item can cognition pertains to their informational product, be viewed as an exercise in dynamical information notably with respect to its semantic and affective processing. As such, modeling option selection properties—roughly, How is the thought delivered via item-response theory can be augmented with (cold cognition), and what does the delivery consist modeling item-response time, the latter as the of (hot cognition). Deviations in the processing product of a formally portrayed dynamic stochastic apparatus leading to interpretations of one’s en- process (e.g., Neufeld, 1998; van der Maas, Mole- vironment cum responses that bear directly on naar, Maris, Kievit & Boorsboom, 2011; certain symptomatology hold special clinical significance. direct parallels between multi-item inventories and The synthesis of cold and hot cognition perhaps stochastic process models have been developed in is epitomized in the work on risk of sexual Neufeld, 2007b). coercion, and eating and other disorders, by Note that a vexing contaminant of scores on Teresa Treat, Richard Viken, Rchard McFall, and multi-item inventories is the influence of social their prominent cognitive-scientist collaborators. desirability (SD), that is, the pervasive inclination Examples include Treat et al. (2002), and Treat, to respond to item options in a socially desirable Viken, Kruschke & McFall (2010; see also Treat & direction. Indeed, a prime mover of contem- Viken, 2010). porary methods of test construction, Douglas These investigators have shown how ba- N. Jackson, has stated that SD is the g factor sic perceptual processes (perceptual organization) of personality and clinical psychometric measures can ramify to other clinically cogent cognitive- (e.g., Helmes, 2000). Interestingly, it has been behavioral operations, including those of classifica- shown that formal theory and its measurement tion in , and memory. This models can circumvent the unwanted influence program of research exemplifies integrative, trans- of SD, because of the socially neutral com- lational, and collaborative psychological science. position of the measures involved (Koritzky & It has been rigorously eclectic in its recruitment Yechiam, 2010).

mathematical and computational 361 Conclusion individuals, through quantitatively inserting the Mathematical and computational modeling otherwise inaccessible constructs into the model arguably is essential to accelerating progress in composition. clinical science and assessment. Indeed, rather than Current challenges can be expected to spawn its being relegated to the realms of an esoteric vigorous research activity in several future direc- enterprise or quaint curiosity, a strong case can be tions. Among others, included are: (a) capitalizing made for the central role of quantitative modeling on modeling opportunities in multiple areas of in lifting the cognitive side of clinical cognitive clinical science, poised for modeling applications neuroscience out of current quagmires and avoiding (e.g., stress, anxiety disorders, and depression); (b) future ones. surmounting constraints intrinsic to the clinical Clinical formal modeling augurs for progress in setting that pose special barriers to the valid the field, owing to the explicitness of theorizing extraction of modeling-based information (e.g., to which the modeling endeavor is constrained. estimating and providing for potentially confound- Returns on research investment are elevated because ing effects of medication on cognitive-behavioral rigor of expression exposes empirical shortcomings task performance); (c) developing sound methods of extant formulations; blind alleys are expressly for bridging mathematical and computational clini- tagged as such, thanks to the definiteness of derived cal science to clinical assessment and intervention; formulas, so that wasteful retracing is avoided and, (d), creating methods of tapping model- and needed future directions often are indicated, disciplined information from currently available because of the forced exactness of one’s quantitative published data—in the service of model-informed, formulations. substantively meaningful, meta analyses. Otherwise unavailable or intractable information It may be said that evidence-based practice is is released. Moreover, existing data can be exploited best practice only if it is based on best evidence. by freshly viewing and analyzing it through meth- Best evidence goes hand in hand with maximizing ods opened up by a valid mathematical, computa- methodological options, which compels the candi- tional infrastructure (Novotney, 2009). dacy of those that stem from decidedly quantitative Formal developments also stand to enjoy an theorizing. elongated half-life of contribution to the discipline. Staying power of the value of rigorous and sub- Acknowledgments stantively important achievements has been seen Manuscript preparation, and the author’s own in mathematical modeling generally. For instance, work reported in the manuscript, were supported early fundamental work on memory dynamics (e.g., by grants from the and Humanities Atkinson, Belseford & Shiffrin, 1967) recently Research Council of Canada (author, Principal has been found useful in the study of memory Investigator), the Medical Research Council of in schizophrenia (Brown et al., 2007). A similar Canada (author, Principal Investigator) and the durability stands to occur for clinical mathematical Canadian Institutes of Health Research (author, psychology, within the field of clinical science itself, co-investigator). I thank Mathew Shanahan for and beyond. helpful comments on an earlier version, and Lorrie Lefebvre for her valuable assistance with manuscript Certain suspected agents of disturbance may preparation. resist study through strictly experimental manipu- lation of independent variables (e.g., non–model- informed correlational experiments; see earlier Notes section Mulitnomial Processing Tree Modeling of 1. See also Chechile (2010) for the introduction of a novel Memory and Related Processes; Unveiling and procedure for individualizing MPTM parameter values, which is especially pertinent to clinical assessment technology. See Elucidating Deviations in Clinical Samples earlier). also Smith & Batchelder (2010) on computational methods for Such may be the case on ethical grounds, or because dealing with the clinical-science issue of group-data variability, the theoretically tendered agent may elude experi- owing to individual differences in parameter values (the problem mental induction (e.g., organismically endogenous of “overdispersion”). stress activation suspected as generating cognition- 2. Still others (Ouimet, Gawronski & Dozois, 2009) have detracting intrusive associations and/or diminished tendered extensive verbally conveyed schemata and flow dia- grams [cf. McFall, Townsend & Viken (1995) for a poignant cognitive-workload capacity). Tenability of such demarcation of models qua models] entailing dual-system cog- conjectures nevertheless may be tested by examining nition (extrapolation of “automatic” and “controlled” processing improvement in model fit to data from afflicted properties; Schneider & Shiffrin, 1977; Shiffrin & Schneider,

362 new directions 1977), and threat-stimulus disengagement deficit [possibly suggesting serial item processing; but cf. Neufeld & McCarty Glossary (1994); Neufeld, Townsend & Jette (2007); White, et al., 2010a, analytical construct validity: Support for the interpre- p. 674]. tation of a model parameter that expresses an aspect of 3. An independent parallel, limited-capacity processing sys- individual differences in cognitive performance, according tem, with exponentially distributed processing latencies, po- to the parameter’s effects on model predictions. tentially characterizing the processing architectures of both HA base distribution: Distribution of a cognitive-behavioral and LA individuals at least during early processing (Neufeld performance variable (e.g., speed or accuracy) specified by a & McCarty, 1994; Neufeld, Townsend & Jette, 2007): (a) task-performance model. parsimoniously can be shown to cohere with reaction-time construct-representation construct validity: Support for findings on HA threat bias, including those taken to indicate the interpretation of a measure in terms of the degree threat stimulus disengagement deficit; and, (b) also can be shown to which it comprises mechanisms that are theoretically to cohere with collateral oculomotor data (e.g., Mogg, Millar & meaningful in their own right. Bradley, 2000; Neufeld, Mather, Merskey & Russell, 1995; see exponential distribution: A stochastic distribution of also, Townsend & Ashby, 1983, chapter 5). process latencies whose probability density function is −ν 4. Note that encoding and response processes typically have νe t ,wheret is time (arbitrary units), and v is a been labeled “base processes,” and often have been relegated to parameter; latencies are loaded to the left (i.e., toward t =0, “wastebasket” status. They may embody important differences, their maximum probability density being at t =0),andthe however, between clinical samples and controls. By examining distribution has a long tail. “boundary separation,” a response-criterion parameter of the hyperdistribution: A stochastic distribution governing the Ratcliff diffusion model, for example, White, Ratcliff, Vasey and probabilities or probability densities of properties (e.g., McKoon (2010b) have discovered that HA, but not LA partici- parameters) of a base distribution (see section Bayesian pants, increase their level of caution in responding after making Parameter Estimation in the text). an error (an otherwise undetected “alarm reaction”). See also Neufeld, Boksman, Vollick, George and Carter (2010), for anal- likelihood function: A mathematical expression convey- yses of potentially symptom significant elongation of stimulus ing the probability or probability density of a set of encoding—a process subserving collateral cognitive operations— observed data, as a function of the cognitive-behavioral among individuals with a diagnosis of schizophrenia. task-performance model (see section Bayesian Parameter 5. Extensions to mathematically prescribed signatures of Estimation in the text). variants on the cognitive elements described earlier, recently have mixing distribution: See hyperdistribution. included statistical significance tests and charting of statistical mixture model: A model of task performance, expressed properties (e.g., “unbiasedness” and “statistical consistency” of in terms a base distribution of response values, whose signature estimates; Houpt & Townsend, 2010; 2012) and base-distribution properties (e.g., parameters) are randomly measures of theoretic and methodological unification of concepts distributed according to a mixing distribution (see section in the literature related to cognitive-workload capacity (known Bayesian Parameter Estimation in the text). as “Grice” and “Miller inequalities”), all under qualitative multiple dissociation: Cognitive-behavioral performance properties of SFT’s quantitative diagnostics (Townsend & Eidels, patterns or estimated neuro-circuitry occurring for a 2011). Bayesian extensions also currently are in progress. A clinical group under selected experimental conditions, with description of software for SFT, including a tutorial on its use, contrasting patterns occurring under other conditions. The has been presented by Houpt, Blaha, McIntire, Havig and profile obtained for the clinical group is absent or opposite Townsend (2014). for control or other clinical groups. 6. The challenge of requisite data magnitude may be nomothetic-span construct validity: Support for the overblown. Meaningful results at the individual level of interpretation of a measure according to its network of as- analysis, for example, can be obtained with as few as 160 sociations with variables to which it is theoretically related. trials per experimental condition–if the method of analysis is mathematical-theory mandated (see Johnson, et al., 2010). normalizing factor: Overall probability or probability Furthermore, clinicians who may balk at the apparent demands density of data or evidence, all model-prescribed conditions of repeated-trial cognitive paradigms nevertheless may have no (e.g., parameter values) considered (see section Bayesian hesitation in asking patients to respond to 567 separate items of Parameter Estimation, in the text). a multi-item inventory. overdispersion: Greater variability in task-performance 7. Of late, cognitive science has witnessed a burgeoning data than would be expected if a fixed set of model interest in mixture models, dubbed “hierarchical Bayesian parameters were operative across all participants. analysis” (e.g., Lee, 2011). The use of such models in cognitive prior distribution: In the Bayesian framework, the science nevertheless can be traced back at least two and one half probability distribution before data are collected (See decades (e.g., Morrison, 1979; Schweickert, 1982). Mixture hyperdistribution). models of various structures (infinite, and finite, continuous posterior distribution: In the Bayesian framework, the and discrete), along with their Bayesian assets, moreover have distribution corresponding to the prior distribution after enjoyed a certain history of addressing prominent issues in the data are collected. Bayes’ theorem is used to update clinical cognitive science (e.g., Batchelder, 1998; Carter, et al., the prior distribution, following data acquisition. The key 1998; Neufeld, 2007b; Neufeld, Vollick, Carter, Boksman, & agent of this updating is the likelihood function (see section Jette, 2002; Neufeld & Williamson, 1996), some of which are Bayesian Parameter Estimation in the text). stated in the text.

mathematical and computational 363 References Busemeyer, J. R. & Myung, I. J. (1992). An adaptive approach Ahn, W. Y., Busemeyer, J. R., Wagenmakers, E. J., & Stout, to human decision making: Learning theory, decision theory, J. C.(2008). Comparison of decision learning models using and human performance. Journal of Experimental Psychology: the generalization criterion method. Cognitive Science, 32, General, 121, 177–194. 1376–1402. Busemeyer, J. R., & Stout, J. C. (2002). A contribution of Atkinson, R. C., Brelsford, J. W., & Shiffrin, R. M. cognitive decision models to clinical assessment: Decompos- (1967). Multiprocess models for memory with applications ing performance on the Bechara gambling task. Psychological to a continuous presentation task. Journal of Mathematical Assessment, 14, 253–262. Psychology, 4, 277–300. Busemeyer, J. R., & Townsend, J. T. (1993). Decision field Batchelder, W. H. (1998). Multinomial processing tree models theory: A dynamic-cognitive approach to decision making and psychological assessment. Psychological Assessment, 10, in an uncertain environment. Psychological Review, 100, 331–344. 432–459. Batchelder, W. H., & Riefer, D. M. (1990). Multinomial Busemeyer, J. R. & Wang, Y. (2000). Model comparisons and processing models of source monitoring. Psychological Review, model selections based on generalization test methodology. 97, 548–564. Journal of Mathematical Psychology, 44(1), 171–189. Batchelder W. H., & Riefer, D. M. (1999). Theoretical and Carter, J. R., & Neufeld, R. W. J. (2007). Cognitive processing empirical review of multinomial process tree modelling. of facial affect: Neuro-connectionist modeling of deviations Psychonomic Bulletin & Review, 6, 57–86. in schizophrenia. Journal of Abnormal Psychology, 166, Batchelder, W. H., & Riefer, D. M. (2007). Using multinomial 290–305. processing tree models to measure cognitive deficits in Carter, J. R., Neufeld, R. W. J., & Benn, K. D. (1998). clinical populations. In R. W. J. Neufeld (Ed.), Advances Application of process models in assessment psychology: in clinical cognitive science: Formal modeling of processes Potential assets and challenges. Psychological Assessment, 10, and symptoms (pp. 19–50). Washington, D.C: American 379–395. Psychological Association. Chapman, L. J., & Chapman, J. P. (1973). Problems in the Bechara A., Damasio A. R., Damasio H., & Anderson, S. measurement of cognitive deficit. Psychological Bulletin, 79, W. (1994). Insensitivity to future consequences following 380–385. damage to human prefrontal cortex. Cognition 50, 7–15. Chechile, R. A. (1998). A new method for estimating model Berg, E. A. (1948). A simple objective technique for measuring parameters for multinomial data. Journal of Mathematical flexibility in thinking. Journal of General Psychology, 39, Psychology, 42, 432–471. 15–22. Chechile, R. A. (2004). New multinomial models for the Berger, J. O. (1985). Statistical decision theory and bayesian Chechile–Meyer task. Journal of Mathematical Psychology, 48, analysis, (2nd ed.). New York, NY: Springer. 364–384. Bianchi, M. T., Klein, J. P., Caviness, V. S., & Cash, S. Chechile, R. A. (2007). A model-based storage-retrieval analysis S. (2012). Synchronizing bench and bedside: A clinical of developmental dyslexia. In R. W. J. Neufeld (Ed.), overview of networks and oscillations. In M. T. Bianchi, Advances in clinical cognitive science: Formal modeling of V. S. Caviness, & S. S. Cash (Eds.), Network approaches processes and symptoms (pp. 51–79). Washington, DC: to diseases of the brain: Clinical applications in neurology American Psychological Association. and psychiatry (pp. 3–12). Oak Park, IL: Bentham Science Chechile, R. A. (2010). Modeling storage and retrieval pro- Publishers. cesses with clinical populations with applications examining Bishara, A., Kruschke, J., Stout, J., Bechara, A., McCabe, D., alcohol-induced amnesia and Korsakoff amnesia. Journal of Busemeyer, J., (2010). Sequential learning models for the Mathematical Psychology, 54, 150–166. Wisconsin card sort task: Assessing processes in substance Close, F. (2011). The infinity puzzle: How the hunt to understand dependent individuals. Journal of Mathematical Psychology, the universe led to extraordinary science, high politics, and the 54, 5–13. large hadron collider. Toronto, Canada: A. Knopff. Bolger, N., Davis, A., & Rafaeli, E. (2003) Diary methods: Cox, D. R. & Miller, H. D. (1965). Thetheoryofstochastic Capturing life as it is lived. Annual Review of Psychology, 54, processes. London, England: Methuen. 579–616. Cramer, A. O. J., Waldorp, L. J., van der Maas, Borowski, E. J., & Borwein, J. M. (1989). The Harper Collins H., & Borsboom, D. (2010). Comorbidity: A net- Dictionary of Mathematics (2nd ed.). New York, NY: Harper work perspective. Behavioral and Brain Sciences, 33, Collins. 137–193. Borsboom, D., & Cramer, A. O. J. (2013). Network analysis: Dance, K., & Neufeld, R. W. J. (1988). Aptitude-treatment An integrative approach to the structure of psychopathology. interaction in the clinical setting: An attempt to dispel Annual Review of Clinical Psychology, 9, 91–121. the patient-uniformity myth. Psychological Bulletin, 104, Braithwaite, R. B. (1968). Scientific explanation. London, 192–213. England: Cambridge University Press. Davidson, P. O., & Costello, G. G. (1969). N=1: Experimental Brown, G. G., Lohr, J., Notestine, R., Turner, T., Gamst, A., & studies of single cases. New York, NY: Van Nostrand Reinhold. Eyler, L. T. (2007). Performance of schizophrenia and bipolar Dobson, D., & Neufeld, R. W. J. (1982). Paranoid-nonparanoid patients on verbal and figural working memory tasks. Journal schizophrenic distinctions in the implementation of external of Abnormal Psychology, 116, 741–753. conceptual constraints. Journal of Nervous and Mental Busemeyer, J. R., & Diedrich, A. (2010) Cognitive modeling. Disease, 170, 614–621. Thousand Oaks, CA: Sage. Doob, J. L. (1953). Stochastic processes. New York, NY: Wiley.

364 new directions Embretson, W. S. (1983). Construct validity: Construct Houpt, J. W., & Townsend, J. T. (2010). The statistical representation versus nomothetic span. Psychological Bulletin, properties of the survivor interaction contrast. Journal of 93, 179–197. Mathematical Psychology, 54, 446–453. Estes, W. K. (1956). The problem of inference from curves based Houpt, J. W. and Townsend, J. T. (2012). Statistical measures on group data. Psychological Bulletin, 53, 134–140. for workload capacity analysis. Journal of Mathematical Evans, M., Hastings, N., & Peacock, B. (2000). Statistical Psychology, 56, 341–355. distributions (3rd Ed.). New York, NY: Wiley. Houpt, J. W., Blaha, L. M., McIntire, J. P., Havig, Farrell, S., & Lewandowsky, S. (2010). Computational models P. R., & Townsend, J. T. (2014). Systems factorial as aids to better reasoning in psychology. Current Directions technology with R. Behavior Research Methods, Instrumen- in Psychological Science, 19, 329–335. tation and Computing, 46, 307–330 (Available online Flanagan, O. (1991). Science of the mind (2nd ed.). Cambridge, from http://link.springer.com/article/10.3758%2Fs13428- MA: MIT Press. 013-0377-3.) Frewen, P. A., Dozois, D., Joanisse, M., & Neufeld, Hu, X. (2001). Extending general processing tree models to R. W. J. (2008). Selective attention to threat ver- analyze reaction time experiments. Journal of Mathematical sus reward: Meta-analysis and neural-network modeling Psychology, 45, 603–634. of the dot-probe task. Clinical Psychology Review, 28, Johnson, S. A., Blaha, L. M., Houpt, J. W., & Townsend, 308–338. J. T. (2010). Systems Factorial Technology provides new Fridberg, D. J., Queller, S., Ahn, W.-Y., Kim, W., Bishara, A. insights on global-local information processing in autism J., Busemeyer, J. R., Porrino, L., & Stout, J. C. (2010). spectrum disorders. Journal of Mathematical Psychology, 54, Cognitive mechanisms underlying risky decision-making in 53–72. chronic cannabis users. Journal of Mathematical Psychology, Kang, S. S., & MacDonald, A. W. III (2010). Limitations of 54, 28–38. true score variance to measure discriminating power: Psycho- Friston,K.J.,Fletcher,P.,Josephs,O.,Holmes,A.,&Rugg,M. metric simulation study. Journal of Abnormal Psychology, 119, D. (1998). Event-related fMRI: Characterizing differential 300–306. responses. Neuroimage, 7, 30–40. Koritzky, G., & Yechiam, E. (2010). On the robustness of Fukano T, & Gunji Y. P.(2012). Mathematical models of panic description and experience based decision tasks to social disorder. Nonlinear Dynamics, Psychology and Life Sciences, desirability. Journal of Behavioral Decision Making, 23, 16, 457–470. 83–99. George, L., & Neufeld, R. W. J. (1985). Cognition and Lee, M. D. (2011). How cognitive modeling can benefit symptomatology in schizophrenia. Schizophrenia Bulletin, from hierarchical Bayesian models. Journal of Mathematical 11, 264–285. Psychology, 55, 1–7. Gold, J. M., & Dickinson, D. (2013). “Generalized Cognitive Levy, L. R., Yao, W., McGuire, G., Vollick, D. N., Jetté, Deficit” in Schizophrenia: Overused or underappreciated? J., Shanahan, M. J., & Neufeld, R. W. J. (2012). Schizophrenia Bulletin, 39, 263–265. Nonlinear bifurcations of psychological stress negotiation: Gottman, J. M., Murray, J. D., Swanson, C. C., Tyson, R., New properties of a formal dynamical model. Nonlinear & Swanson, K. R. (2002). The mathematics of marriage: Dynamics, Psychology and Life Sciences, 16, 429–456. Dynamic nonlinear models. Cambridge, MA: MIT Press. Link, S. W. (1982), Correcting response measures for guessing Green, M. F., Horan, W. P., & Sugar, C. A. (2013). Has and partial information. Psychological Bulletin, 92, 469–486. the generalized deficit become the generalized criticism? Link, S. W., & Day, R. B. (1992). A theory of cheating. Behavior Schizophrenia Bulletin, 39, 257–262. Research Methods, Instruments, & Computers, 24, 311–316. Haig, B. D. (2008). Scientific method, abduction, and clinical Link, S. W., & Heath, R. A. (1975). A sequential theory of reasoning. Journal of Clinical Psychology, 64, 1013–1127. psychological discrimination. Psychometrika, 40, 77–105. Hartlage, S., Alloy, L. B., Vázquez, C. & Dykman, B.(1993). Au- Maher, B. (1970). Introduction to research in psychopathology. tomatic and effortful processing in depression. Psychological New York, NY: McGraw-Hill. Bulletin, 113(2), 247–278. Marr, D. (1982). Vision. San Francisco, OA: Freeman. Helmes, E. (2000). The role of social desirability in the McFall, R. M., & Townsend, J. T. (1998). Foundations assessment of personality constructs. In R. D. Goffin & E. of psychological assessment: Implications for cognitive Helmes (Eds.). Problems and solutions in human assessment: assessment in clinical science. Psychological Assessment, 10, Honoring Douglas N. Jackson at seventy. Norwell, MA: 316–330. Kluwer. McFall, R. M., Townsend, J. T. and & Viken, R. J. (1995). Hoaken, P. N. S., Shaughnessy, V. K., & Pihl, R. O. (2003). Diathesis stress model or "just so" story? Behavioral and Brain Executive cognitive functioning and aggression: Is it an issue Sciences, 18(3), 565–566. of impulsivity? Aggressive Behavior, 29, 15–30. Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Hoffman, R. E., & McGlashan, T. H. (2007). Using Karl, Sir Ronald, and the slow progress of soft psychol- a speech perception neural network simulation to study ogy. Journal of Consulting and Clinical Psychology, 46 (4), normal neurodevelopment and auditory hallucinations in 806–843. schizophrenia. In R. W. J. Neufeld (Ed.), Advances in clinical Mogg, K., Millar, N., Bradley, B. P. (2000). Biases in eye cognitive science: Formal modeling of processes and symptoms movements to threatening facial expressions in generalized (pp. 239–262). Washington, D.C.: American Psychological anxiety disorder and depressive disorder. Journal of Abnormal Association. Psychology, 109, 695–704.

mathematical and computational 365 Molenaar, P.C. M. (2010). Note on optimization of psychother- Neufeld, R. W. J., & Broga, M. I. (1981). Evaluation apeutic processes. Journal of Mathematical Psychology, 54, of information-sequential aspects of schizophrenic perfor- 208–213. mance, II: Methodological considerations. Journal of Nervous Morrison, D. G. (1979). An individual-difference pure- and Mental Disease, 169, 569–579. extinction process. Journal of Mathematical Psychology, 19, Neufeld, R. W. J., Vollick, D. Carter, J. R., Boksman, K., 307–315. & Jetté, J. (2002). Application of stochastic modelling to Moshagen, M. (2010). multiTree: A computer program for group and individual differences in cognitive functioning. the analysis of multinomial processing tree models. Behavior Psychological Assessment, 14, 279–298. Research Methods, 42(1), 42–54. Neufeld, R. W. J., & Gardner, R. C. (1990). Data aggregation in Neufeld, R. W. J. (1977) Clinical quantitative methods, NY: evaluating psychological constructs: Multivariate and logical- Grone & Stratton. deductive considerations. Journal of Mathematical Psychology, Neufeld, R. W. J. (1984a). Re: The incorrect application 34, 276–296. of traditional test discriminating power formulations to Neufeld, R. W. J., & McCarty, T. (1994). A formal analysis diagnostic-group studies. Journal of Nervous and Mental of stressor and stress-proneness effects on basic information Disease, 172, 373–374. processing. British Journal of Mathematical and Statistical Neufeld, R. W. J. (1984b). Elaboration of incorrect application of Psychology, 47, 193–226. traditional test discriminating power formulations to diagnostic- Neufeld, R. W. J., & Williamson, P.(1996). Neuropsychological group studies. Department of Psychology Research Bulletin correlates of positive symptoms: Delusions and hallucina- Number 599. London, ON: University of Western Ontario. tions. In C. Pantelis, H.E. Nelson, & T.R.E. Barnes (Eds.), Neufeld, R. W. J. (1991). Memory in paranoid schizophrenia. Schizophrenia: A neuropsychological perspective (pp. 205–235). In P. Magaro (Ed.), The cognitive bases of mental disorders: London, England: Wiley Annual review of psychopathology (Vol. 1, pp. 231–261). Neufeld, R. W. J., Townsend, J. T., & Jetté, J. Newbury Park, CA: Sage. (2007). Quantitative response time technology for mea- Neufeld, R. W. J. (1996). Stochastic models of information suring cognitive-processing capacity in clinical studies. In processing under stress. Research Bulletin No. 734, London, R.W.J. Neufeld (Ed.), Advances in clinical cognitive science: ON: Department of Psychology, University of Western Formal modeling and assessment of processes and symptoms Ontario. (pp. 207–238). Washington, D.C.: American Psychological Neufeld, R. W. J. (1998). Intersections and disjunctions in Association. process-model applications. Psychological Assessment, 10, Neufeld, R. W. J., Mather, J. A., Merskey, H., & Russell, N. C. 396–398. (1995). Multivariate structure of eye-movement dysfunction Neufeld, R. W. J. (1999). Dynamic differentials of stress and in schizophrenia. Multivariate Experimental Clinical Research, coping. Psychological Review, 106, 385–397. 11, 1–21. Neufeld, R. W. J. (2007a). Introduction. In R. W. J. Neufeld Norman, R. M. G. & Malla, A. K.(1994). A prospective (Ed.), Advances in the clinical cognitive science: Formal study of daily stressors and symptomatology in schizophrenic modeling of processes and symptoms, Washington, DC: patients. Social Psychiatry and Psychiatric Epidemiology, 29, American Psychological Association (pp. 3–18). 244–249. Neufeld, R. W. J. (2007b). Composition and uses of formal Novotney, A. (2009). Science on a shoestring. Monitor on clinical cognitive science In B. Shuart, W. Spaulding & Psychology, 40(1), 42–44. J. Poland (Eds.), Modeling Complex Systems: Nebraska O’Hagan, A., & Forster, J. (2004). Kendall’s advanced theory Symposium on Motivation, 52, 1–83. Lincoln, Nebraska: of statistics: Vol. 2B. Bayesian inference (2nd ed.). London, University of Nebraska Press. England: Arnold. Neufeld, R. W. J. (2007c). On the centrality and significance of Ouimet, A. J., Gawronski, B., & Dozois, D. J. A. (2009). Cog- encoding deficit in schizophrenia. Schizophrenia Bulletin, 33, nitive vulnerability to anxiety: A review and an integrative 982–993. model. Clinical Psychology Review, 29, 459–470. Neufeld, R. W. J. (2012). Quantitative clinical cognitive Phillips, W. A., & Silverstein, S. M. (2003) Convergence science, cognitive neuroimaging, and tacks to fMRI signal of biological and psychological perspectives on cognitive analysis: The case of encoding deficit in schizophrenia. Paper coordination in schizophrenia. Discussion. Behavioral and presented at the 45th Annual Meeting of the Society for Brain Sciences, 26, 65–138. Mathematical Psychology, Columbus, Ohio, July 21–24, Pitt, M. A., Kim, W., Navarro, J., & Myung, J. I. (2006). Global 2012. model analysis by parameter space partitioning. Psychololgical Neufeld, R. W. J., Boksman, K., Vollick, D., George, L., & Review, 113, 57–83. Carter, J. (2010). Stochastic dynamics of stimulus encoding Poldrack, R. A. (2011). Inferring mental states from neuroimag- in schizophrenia: Theory, testing, and application. Journal of ing data: From reverse inference to large-scale decoding. Mathematical Psychology, 54, 90–108. Neuron, 72, 692–697. Neufeld, R. W. J. & Broga, M. I. (1977). Fallacy of the Ratcliff, R. (1978). A theory of memory retrieval. Psychological reliability-discriminability principle in research on differential Review, 85, 59–108. cognitive deficit. Department of Psychology Research Bul- Ratcliff, R. (1979). Group reaction time distributions and an letin Number 360. London, ON: University of Western analysis of distribution statistics. Psychological Bulletin, 86, Ontario. 446–461.

366 new directions Riefer, D. M. & Batchelder, W. H. (1988). Multinomial Sternberg, S(1969). The discovery of processing stages: Exten- modeling and the measurement of cognitive processes. sions of Donders’ method. In W. G. Koster (Ed.), Attention Psychological Review, 95, 318–339. and performance II. Acta Psychologica, 30, 276–315. Riefer, D. M., Knapp, B. R., Batchelder, W. H., Bamber, D., Townsend, J. T. (1984). “Uncovering mental processes with & Manifold, V. (2002). Cognitive psychometrics: Assessing factorial experiments.” Journal of Mathematical Psychology, storage and retrieval deficits in special populations with 28(4), 363–400. multinomial processing tree models. Psychological Assessment, Townsend, J. T. (2008). Mathematical psychology: Prospects for 14, 184–201. the 21st century: A guest editorial. Journal of Mathematical Rodgers, J. L. (2010) The epistemology of mathematical and Psychology, 52, 269–280. statistical modeling: A quiet methodological revolution. Townsend, J. T. and Ashby, F. G. (1983). Stochastic modelling American Psychologist, 65(1), 1–12. of elementary psychological processes. Cambridge: Cambridge Rouder, J. N., Sun, D., Speckman, P. L., Lu, J. & Zhou, University Press. D. (2003). A hierarchical bayesian statistical framework Towsend, J. T., & Altieri, N. (2012). An accuracy-response for response time distributions. Psychometrika, 68(4), time capacity assessment function that measures performance 589–606. against standard parallel predictions. Psychological Review, Schneider, W., & Shiffrin, R. M. (1977). Controlled and 119, 500–16. automatic human information processing: I. Detection, Townsend, J. T. & Eidels, A., (2011). Workload capacity search, and attention. Psychological Review, 84, 1–66. spaces: A unified methodology for response time measures Schweickert, R. (1982). The bias of an estimate of coupled of efficiency as workload is varied. Psychonomic Bulletin & slack in stochastic PERT networks. Journal of Mathematical Review, 18, 659–681. Psychology, 26, 1–12. Townsend, J. T., Fific, M., & Neufeld, R. W. J. (2007). Assess- Schweickert, R., (1985) Separable effects of factors on speed ment of mental architecture in clinical/cognitive research. In and accuracy: Memory scanning, lexical decisions and choice T. A. Treat, R. R. Bootzin, T. B. Baker (Eds.), Psychological tasks. Psychological Bulletin, 7, 530–548. clinical science: Papers in Honor of Richard M. McFall (pp. Schweickert, R., & Han, H. J. (2012). Reaction time predic- 223–258). Mahwah, NJ: Erlbaum. tions for factors selectively influencing processes in processing Townsend, J. T. & Neufeld, R. W. J. (2004). Mathematical trees. Paper presented at Mathematical Psychology meeting, theory-driven methodology and experimentation in the Columbus, OH, July 2012. emerging quantitative clinical cognitive science: Toward Shanahan, M. J., & Neufeld, R. W. J. (2010). Coping with stress general laws of individual differences. Paper presented at the through decisional control: Quantification of negotiating the Association for Psychological Science-sponsored symposium environment. British Journal of Mathematical and Statistical on TranslationPsychological Science in Honor of Richard M. Psychology, 63, 575–601. McFall, Chicago, Illinois, May, 2004. Shanahan, M. J., Townsend, J. T. & Neufeld, R. W. J. (in press). Townsend, J. T., & Nozawa, G. (1995). Spatio-temporal Mathematical modeling in clinical psychology. In R. Cautlin properties of elementary perception: An investigation of & S. Liltienfield (Eds.); R. Zinbarg (Section Ed.) Wiley- parallel, serial, and coactive theories. Journal of Mathematical Blackwell’s Encyclopedia of Clinical Psychology, London, Psychology, 39, 321–359. Wiley-Blackwell. Townsend, J. T., & Wenger, M. J. (2004a). The serial-parallel Shannon, C. E. (1949). The mathematical theory of communica- dilemma: A case study in a linkage of theory and method. tion. Urbana, IL: University of Illinois Press. Psychonomic Bulletin & Review, 11, 391–418. Shiffrin, R. M., & Schneider, W. (1977). Controlled and auto- Townsend, J. T. & Wenger, M. J. (2004b). A theory of matic human information processing. II. , interactive parallel processing: New capacity measures and automatic attending and a general theory. Psychological predictions for a response time inequality series. Psychological Review, 84, 127–190. Review, 111, 1003–1035. Siegle G. J., Hasselmo M. E. (2002). Using connectionist models Treat, T. A., & Viken, R. J. (2010). Cognitive processing to guide assessment of psychological disorder. Psychological of weight and emotional information in disordered eating. Assessment, 14, 263–278. Current Directions in Psychological Science, 19, 81–85. Silverstein, S. S. (2008). Measuring specific, rather than Treat, T. A., McFall, R. M., Viken, R. J., Nosfosky, R. M., generalized, cognitive deficits and maximizing between- MacKay, D. B., & Kruschke, J. K. (2002). Assessing clini- group effect size in studies of cognition and cognitive change. cally relevant perceptual organization with multidimensional Schizophrenia Bulletin, 34, 645–655. scaling techniques. Psychological Assessment, 14, 239–252. Smith, J. B., & Batchelder, W. H. (2010). Beta-MPT: Treat, T. A., Viken, R. J., Kruschke, J. K., & McFall, R. Multinomial processing tree models for addressing indi- M. (2010). Role of attention, memory, and covariation- vidual differences. Journal of Mathematical Psychology, 54, detection processes in clinically significant eating-disorder 167–183. symptoms. Journal of Mathematical Psychology, 54, 184–195. Staddon, J. E. R. (1984). Social learning theory and the Van der Maas, H. L. J., Molenaar, D., Maris, G., Kievit, dynamics of interaction. Psychological Review, 91(4), 502– R. A., & Borsboom, D. (2011). Cognitive psychology 507. meets psychometric theory: On the relation between process Stein, D. J., & Young, J. E. (1992). (Editors), Cognitive science models for decision making and latent variable models for and clinical disorders. New York, NY: Academic. individual differences. Psychological Review, 118, 339–356.

mathematical and computational 367 Van Zandt, T. (2000). How to fit a response time distribution. In D. J. Stein & J. E. Young (Eds.) Cognitive science and Psychonomic Bulletin and Review, 7, 424–465. clinical disorders (pp. 129–150). San Diego, CA: Academic. Wagenmakers, E.-J., van der Maas, H. L. J., Dolan, C., & Witkiewitz, K., van der Maas, H. J., Hufford, M. R., & Marlatt, Grasman, R. P. P. P. (2008). EZ does it! Extensions of G. A. (2007). Non-normality and divergence in post- the EZ-diffusion model. Psychonomic Bulletin & Review, 15, treatment alcohol use: re-examining the project MATCH 1229–1235. data “another way.” Journal of Abnormal Psychology, 116, Wagenmakers, E.-J., van der Maas, H. L. J., & Farrell, S. 378–394. (2012). Abstract concepts require concrete models: Why Yang, E., Tadin, D., Glasser, D. M., Hong, S. W., Blake, R., & cognitive scientists have not yet embraced nonlinearly- Park, S. (2013). Visual context processing in schizophrenia. coupled, dynamical, self-organized critical, synergistic, scale- Clinical Psychological Science, 1, 5–15. free, exquisitely context-sensitive, interaction-dominant, Yechiam, E., Arshavsky, O., Shamay-Tsoory, S. G., Yaniv, S., and multifractal, interdependent brain-body-niche systems. Top- Aharon, J. (2010). Adapted to explore: Reinforcement learn- iCS, 4, 87–93. ing in Autistic Spectrum Conditions. Brain and Cognition, Wallsten, T. S., Pleskac, T. J., Lejuez, C. W., (2005). Modeling 72, 317–324. a sequential risk-taking task. Psychological Review, 112, Yechiam, E., Goodnight, J., Bates, J. E., Busemeyer, J. R., 862–880. Dodge, K. A., Pettit, G. S., & Newman, J. P. (2006). A Wenger, M. J., & Townsend, J. T. (2000). Basic tools for formal cognitive model of the Go/No-Go discrimination attention and general processing capacity in perception and task: Evaluation and implications. Psychological Assessment, cognition. Journal of General Psychology: Visual Attention, 18, 239–249. 127, 67–99. Yechiam, E., Hayden, E. P., Bodkins, M., O’Donnell, B. F., and White, C. N., Ratcliff, R., Vasey, M. W., & McKoon, Hetrick, W. P. (2008). Decision making in bipolar disorder: G. (2010a). Anxiety enhances threat processing without A cognitive modeling approach. Psychiatry Research, 161, competition among multiple inputs: A diffusion model 142–152. analysis. Emotion, 10, 662–677. Yechiam E., Veinott E. S., Busemeyer J. R., Stout J. C. (2007). White, C. N., Ratcliff, R., Vasey, M. W., McKoon, G. (2010b). Cognitive models for evaluating basic decision processes in Using diffusion models to understand clinical disorders. clinical populations. In: Neufeld R. W. J. (Ed.), Advances Journal of Mathematical Psychology, 54, 39–52 in clinical cognitive science: Formal modeling and assessment of Williams, J. M. G., & Oaksford, M. (1992). Cognitive science, processes and symptoms (pp. 81–111). Washington, DC: APA anxiety and depression: From experiments to connectionism. Publications.

368 new directions CHAPTER 17 Quantum Models of Cognition and Decision

Jerome R. Busemeyer, Zheng Wang, and Emmauel Pothos

Abstract

Quantum probability theory provides a new formalism for constructing probabilistic and dynamic systems of cognition and decision. The purpose of this chapter is to introduce psychologists to this fascinating theory. This chapter is organized into six sections. First, some of the basic psychological principles supporting a quantum approach to cognition and decision are summarized; second, some notations and definitions needed to understand quantum probability theory are presented; third, a comparison of quantum and classical probability theories is presented; fourth, quantum probability theory is used to account for some paradoxical findings in the field of human probability judgments; fifth, a comparison of quantum and Markov dynamic theories is presented; and finally, a quantum dynamic model is used to account for some puzzling findings of decision-making research. The chapter concludes with a summary of advantages and disadvantages of a quantum probability theoretical framework for modeling cognition and decision. Key Words: quantum probability, classical probability, Hilbert space, the law of total probability, conjunction fallacy, disjunction fallacy, dynamic models of decision, Markov models, quantum interference, two-stage gambles, dynamic consistency

Reasons for a Quantum Approach to et al., 2009), and perception (Atmanspacher, Filk, Cognition and Decision & Romer, 2004; Conte et al., 2009). Several This chapter is not about quantum physics review articles (Pothos & Busemeyer, 2013; Wang, 1 per se. Instead, it explores the application of Busemeyer, Atmanspacher, & Pothos, 2013) and probabilistic dynamic systems derived from quan- books (Busemeyer & Bruza, 2012; Ivancevic & tum theory to a new domain – cognition and Ivancevic, 2010; Khrennikov, 2010) provide a decision making behavior. Applications of quantum summary of this new program of research. Before theory have appeared in judgment (Aerts & Aerts, presenting the formal ideas, let us first examine why 1994; Busemeyer, Pothos, Franco, & Trueblood, quantum theory should be applicable to human 2011; Franco, 2009; Pothos, Busemeyer, & cognition and decision behavior. Trueblood, 2013; Wang & Busemeyer, 2013), decision making (Bordley & Kadane, 1999; Judgments Are Based upon Indefinite Busemeyer, Wang, & Townsend, 2006; Khrennikov and Uncertain Cognitive States & Haven, 2009; Lambert-Mogiliansky, Zamir, & Models commonly used in psychology assume Zwirn, 2009; La Mura, 2009; Pothos & Busemeyer, the cognitive system changes from moment to 2009; Trueblood & Busemeyer, 2011; Yukalov & moment, but at any specific moment it is in a Sornette, 2011), conceptual combinations (Aerts, definite state with respect to some judgment to 2009; Aerts & Gabora, 2005; Blutner, 2009), be made. For example, suppose a juror has just memory (Brainerd, Wang, & Reyna, 2013; Bruza heard conflicting evidence from the prosecutor and

369 the defense and the juror has to consider two mu- person may be ambiguous about his or her feelings tually exclusive and exhaustive hypotheses—guilty after watching the scene, but the answer “Yes. I am or not guilty. A Bayesian model would assign a afraid” is constructed from the interaction of this probability distribution over the two hypotheses— indefinite state and the question, which results in a probability p(G|evidence) is assigned to guilt a now definitely “afraid” state. This is, in fact, the and a probability 1 − p(G|evidence) is assigned basis for modern psychological theories of emotion to not guilty. Therefore, the juror’s subjective (Schachter & Singer, 1962). Decision scientists also probability with respect to the question of guilty have shown evidence that beliefs and preferences or not boils down to a state represented by a point are constructed online rather than simply being lying somewhere between zero and one on the read straight out of memory (Payne, Bettman, probability scale at each moment. This probability & Johnson, 1992), and expressing choices and may change from moment to moment to produce opinions can change preferences (Sharot, Velasquez, a definite trajectory of probability for guilt across & Dolan, 2010). time. However, at each moment, this subjective probability either favors guilt p(G|evidence) > .50, Judgments Disturb Each Other or it favors not guilty p(G|evidence) < .50, or Producing Order Effects it is exactly at p(G|evidence) = .50. At a single According to quantum theory, the answer to moment, the juror cannot be both favoring guilt a question can change a state from indefinite to p(G|evidence) > .50 and at the same time favoring definite state, and this change causes one to respond not guilty p(G|evidence) < .50. differently to subsequent questions. Intuitively, the In contrast, quantum theory assumes that during answer to the first question sets up a context that deliberation the juror is in an indefinite (superposi- changes the answer to the next question, and tion) state at each moment. While in an indefinite this produces order effects of the measurements. state, the juror does not necessarily favor guilty and Order effects make it impossible to define a joint at the same time the juror does not necessarily favor probability of answers to questions A and B not guilty. Instead, the juror is in a superposition (unless one conditionalizes the conjunction with state that leaves the juror conflicted, ambiguous, an order parameter), and instead it is necessary to confused, or uncertain about the guilty status. The assign a probability to the sequence of answers to potential for saying guilt may be greater than the question A followed by question B. In quantum potential for saying not guilty at one moment, and theory, if A and B are two measurements, and the these potentials may change from one moment to probabilities of the outcomes depend on the order the next, but either hypotheses could potentially be of the measurements, then the two measurements chosen at each moment. In quantum theory, there are noncommutative. Many of the mathematical is no single trajectory or sample path across time properties of quantum theory, such as Heisenberg’s before making a decision. When asked to make a famous uncertainty principle (Heisenberg, 1958), decision, the juror would be forced to commit to arise from developing a probabilistic model for either guilt or not. noncommutative measurements. Question order effects are a major concern for attitude researchers, who struggle for a theoretical understanding of Judgments Create Rather Than Record these effects similar to that achieved in quantum a Cognitive State theory (Feldman & Lynch, 1988). Of course Models commonly used in psychology assume quantum theory is not the only theory to explain that what we record at a particular moment reflects order effects. Markov models, for example, also can the state of the cognitive system as it existed produce order effects. Quantum theory, however, immediately before we inquired about it. For provides a more natural, elegant, and built in set example, if a person watches a scene of an exciting of principles (as opposed to ad hoc assumptions) for car chase and is asked “Are you afraid?” then explaining order effects (Wang, Solloway, Shiffrin, the answer “Yes. I am afraid” reflects the person’s & Busemeyer, 2014). cognitive state with respect to that question just before we asked it. Judgments Do Not Always Obey In contrast, quantum theory assumes that taking Classical Logic a measurement of a system creates rather than Probabilistic models commonly used in psy- records a property of the system (Wang, Busemeyer, chology are based on the Kolmogorov axioms Atmanspacher, & Pothos, 2013). For example, the (1933/1950), which define events as sets that

370 new directions obey the axioms of and Boolean logic. One important axiom is the distributive axiom: If 0.25 {G,T ,F} are events then G ∩ (T ∪ F) = (G ∩ T ) ∪ S (G ∩ F). Consider for example, the concept that a 0.2 boy is good (G), and the pair of concepts that the 0.15 boy told the truth (T ) versus the boy did not tell C 0.1 truth (F). According to classical Boolean logic, the event G can only occur in one of two ways: either 0.05 ∩ ∩ (G T ) occurs or (G F) exclusively. From this 0 T distributive axiom, one can derive the law of total 1 = | + | probability, p(G) p(T )p(G T ) p(F)p(G F). 1 0.5 0.8 0.6 Quantum probability theory is derived from the 0.4 B 0.2 von Neumann axioms (1932/1955), which define 0 0 A events as subspaces that obey different axioms from those of set theory. In particular, the distributive Fig. 17.1 Three-dimensional vector space spanned by basis axiom does not always hold (Hughes, 1989). For vectors A, B, and C. example, according to quantum logic, when you try 2 to decide whether a boy is good without knowing endowed with an inner product. The space has if he is truthful or not, you are not forced to a basis,whichisasetofN orthonormal basis have only two thoughts: he is good and he is vectors χ = {|X1,...,|XN } that span the space. truthful, or he is good and he is not truthful. The symbol |X  represents an arbitrary vector in an You can remain ambiguous or indeterminate over N -dimensional vector space, which is called a “ket.” the truthful or not truthful attributes, which can This vector can be expressed by its coordinates with be represented by a superposition state. The fact respect to the basis χ as follows that quantum logic does not always obey the N distributive axiom implies that the quantum model |X  = xi |Xi. does not always obey the law of total probability i=1 (Khrennikov, 2010). The coordinates, xi are complex numbers. The N coordinates representing the ket |X  with respect to Preliminary Concepts, Definitions, abasisχ forms an N × 1 column matrix and Notations ⎡ ⎤ x1 Quantum theory is based on geometry and ⎢ ⎥ = . ⎦ defined on a Hilbert space. (Hilbert X ⎣ . . spaces are complex vector spaces with certain xN convergence properties.) Paul Dirac developed an Referring to Figure 17.1, the coordinates for the elegant notation for expressing the abstract elements specific vector |S with respect to the {|A,|B,|C} of the theory, which are used in this chapter. This basis equal chapter is restricted to finite spaces for simplicity, ⎡ ⎤ but note that the theory is also applicable to infinite 0.696 dimensional spaces. In fact, to keep examples S = ⎣ 0.696 ⎦. simple, this section introduces the ideas using only a 0.1765 three-dimensional space in order to visually present the ideas. Figure 17.1 shows a particular vector Referring back to our simple attitude model, the labeled S that lies within a three-dimensional space coordinates of S represents the potentials for each of the opinions. spanned by three basis vectors labeled A, B, and  | C. For example, a simple attitude model could The symbol X represents a linear functional interpret S as the state of opinion of a person with in an N -dimensional (dual) vector space, which is called a “bra.” Each ket |X  has a correspond- regard to the beauty of an artwork using three  | mutually exclusive evaluations “good,” “mediocre,” ing bra X . The conjugate transpose operation, |X † = X | changes ket into a bra. The N or “bad,” which are represented by the basis vectors  | A, B, and C, respectively. coordinates representing the bra X with respect to abasisχ forms a 1 × N row matrix A finite Hilbert space is an N -dimensional vector † = ∗ ··· ∗ space defined on a field of complex numbers and X x1 xN . quantum models of cognition and decision 371 The * symbol indicates complex conjugation. For attitude model, MA would be used to represent example, the bra S| corresponding to the ket |S the event that the person decides the artwork to be has the matrix representation given below as “good” (which corresponds to event A). The matrix representation of the projection of the ket |S onto S† = 0.696 0.696 0.1765 . the ray spanned by the basis |A then equals

(Here the numbers in the example are real, and ⎡ ⎤ 0.696 so conjugation has no effect.) A · A† · S = ⎣ 0 ⎦. Hilbert spaces are endowed with an inner 0 product. Psychologically, the inner product is a measure of similarity between two vectors. The inner product is a scalar formed by applying the bra In our simple attitude model, this projection is used to the ket to form a bra-ket to determine the probability that the person decides N the artwork is “good.” According to quantum  |  = ∗ · Y X yi xi. theory, the squared length of this projection, 2 i=1 0.696 = 0.4844, equals the probability that the For example person will decide that the artwork is “good.” Similarly, the coordinates for the basis vector |B † S|S = S · S = 0.696 0.696 0.1765 (with respect to the {|A,|B,|C} basis) simply ⎡ ⎤ equals 0.696 ×⎣ ⎦ = ⎡ ⎤ 0.696 1. 0 0.1765 B = ⎣ 1 ⎦. This shows that the ket |S is unit length. 0 The outer product, denoted by |X Y |,isa linear operator, which is used to make transitions So the matrix representation (with respect to the from one state to another. In particular, assuming {|A,|B,|C} basis) for the projector for |B equals that the kets are unit length, then the outer product |X Y | maps the ket |Y  to the ket |X  as follows: ⎡ ⎤ (|X Y |) · |Y  = |X Y | Y  = |X . Assuming |X  000 † ⎣ ⎦ is unit length, the outer product |X X | projects B · B = 010 . |X  to itself, |X X | · |X  = |X X |X  = |X  · 1, 000 and |X X | projects any other ket |Y  onto the ray |  |  |  =  | ·|  spanned by X as follows X X Y X Y X . In our simple attitude model, this projector is used |  | For these reasons, X X is called the projector for to represent the event that the person decides the |  the ray spanned by X , which is also symbolized as artwork to be “mediocre” (which corresponds to = |  | MX X X . Projectors correspond to subspaces event B). In addition, the horizontal plane shown that represent events in quantum theory. They are in Figure 17.1 is spanned by the {|A,|B} basis Hermitian and idempotent. Referring to Figure vectors, and the projector that projects vectors onto |  17.1, the coordinates for the basis vector A (with this plane equals M +M = |AA|+|BB| which {|  |  | } A B respect to the A , B , C basis) simply equals has the matrix representation ⎡ ⎤ 1 ⎡ ⎤ ⎣ ⎦ A = 0 100 0 A · A† + B · B† = ⎣ 010⎦. and the matrix representation of the projector for 000 this basis equals ⎡ ⎤ ⎡ ⎤ In our simple attitude example, this corresponds 1 100 A · A† = ⎣ 0 ⎦ · 100= ⎣ 000⎦. to the event that the person thinks the artwork 0 000 is “good” or “mediocre.” The vector labeled T in Figure 17.1 is the projection of the vector |S onto † The projector MA = A · A corresponds to the the plane spanned by the {|A,|B} basis vectors, subspace representing the event A. In our simple which has the matrix representation

372 new directions T = A · A† + B · B† · S These three vectors form another orthogonal basis ⎡ ⎤ ⎡ ⎤ for spanning the three-dimensional space. The = |  | 100 .696 projector MV V V projects vectors onto the = ⎣ 010⎦ · ⎣ .696 ⎦ ray spanned by the basis vector |V  as follows: |  = |  |  000 .1765 MV X V V X . Using the coordinates ⎡ ⎤ |  defined above for V we obtain, ⎡ ⎤ .696 = ⎣ ⎦ 0.696 .696 . · † · = · 1 − 1 1 · ⎣ 0.696 ⎦ V V S V 2 2 2 0 0.1765 = (0.125)V , The squared length of this projection, T †T = 2 · which indicates that 0.125 is the coordinate of the 6962 = 0.969, equals the probability that the person vector |S on the |V  basis vector. This is the decides the artwork to be “good” or “mediocre.” projection of S on the V basis, and the squared This is also the probability that the person thinks length of this projection, 0.1252 = 0.0156, equals the artwork is not “bad” (1 − 0.17652 = 0.969). the probability of this event, e.g., the probability Referring back to our simple attitude model, that the person decides that the artwork is “neutral.” suppose that instead of asking whether the artwork Repeating this procedure for |U  and |W ,we is beautiful, we ask what kind of moral message obtain the coordinates for the vector |S in Figure it conveys, and once again there are three answers 17.2 with respect to the {|U ,|V ,|W } basis. such as “good,” “neutral,” or “bad.” Now the person ⎡ ⎤ needs to evaluate the same artwork with respect 0.125 ⎣ ⎦ to a new point of view. In quantum theory, this Y = 0.9843 new point of view is represented as a change in the 0.125 basis. Figure 17.2 illustrates three new orthonormal In sum, the same vector |S can be expressed by the vectors within the same three-dimensional space coordinates X using the {|A,|B,|C} basisorby labeled U, V, and W in the figure. Now the basis the coordinates Y using the {|U ,|V ,|W } basis. vectors U, V, and W represent a “good,” “neutral,” Note that the event “morally good” is repre- or “bad” moral message, respectively. The state S sented by the vector U in Figure 17.2. This vector now represents the person’s opinion with respect to lies along the diagonal line of the A, B plane. Here this new moral message point of view. we see an interesting feature of quantum theory. With respect to the {|A,|B,|C} basis, the If a person is definite that the piece of artwork is coordinates for these three vectors are as follows “morally good” (represented by the vector U), then ⎡ √ ⎤ ⎡ ⎤ the person must be uncertain about whether it’s √1/2 1/2 ⎣ ⎦ ⎣ ⎦ beauty is good versus mediocre (because U has a U = 1/2 , V = √−1/2 , 0 1/2 45 degree angle with respect to each of the A, B vectors). However, if the person is certain that the ⎡ ⎤ −1/2 artwork is “morally good” then the person is certain ⎣ ⎦ that it’s beauty is not “bad” (because U is contained W = √1/2 1/2 in the A, B plane). Quantum Compared to Classical 0.8 Probabilities 0.7 This section presents the quantum probability W 0.6 V axioms formulated by Paul Dirac (1958) and 0.5 John von Neumann (1932/1955), and compares 0.4 them systematically with the axioms of classical 0.3 S Kolmogorov probability theory (1933/1950) (see 3 0.2 Box 1 for a summary). For simplicity, we restrict 0.1 U this presentation to finite spaces in this chapter. 0 1 Although the space is finite, the number of 0.5 0 0.5 1 dimensions can be very large. The general theory −0.5 −0.5 0 −1 −1 is applicable to infinite dimensional spaces. See Chapter 2 in Busemeyer and Bruza (2012) for a Fig. 17.2 New basis U, V, W for representing the three- dimensional vector space more comprehensive introduction. quantum models of cognition and decision 373 plane spanned by the two vectors {|A,|B} for the Box 1 A brief comparison of the quantum model. classical Kolmogorov probability theory and the quantum probability theory System State Classical probability postulates a probability • Kolmogorov Theory function p that maps points in the sample space χ • Sample space set of events into positive real numbers which sum to unity. • Event A is represented as a subset The empty set is mapped into zero, and the sample • State is a probability function p space is mapped into one, and all other events are • mapped into the interval [0,1]. If the pair of events p(A) = probability assigned to event A { ⊆ χ ⊆ χ} ∩ = ∩ = ∪ = + A ,B are mutually exclusive A B Ø, • if A B , p(A B) p(A) p(B) ∪ = + p(A∩B) then p(A B) p(A) p(B). The probability of the • | = ¯ ¯ p(A B) p(B) , event “not A,” denoted A,equalsp(A) = 1 − P(A). ∩ = ∩ • p(A B) p(B A) Quantum probability postulates a unit length state vector |X  in the Hilbert space. The proba- χ ⊆ χ • Quantum Theory bility of an event A spanned by A is defined 2 by q(A) =||MA|X || . For later work, it will • Hilbert vector space of events be convenient to express ||M|X ||2 as the inner • Event A is represented as a subspace product corresponding to a projector MA || | ||2 = | † | = | |  • State is a vector |S in Hilbert space M X X M M X X M X , • q(A) = M |S2 A where the last step made use of the idempotency of • if M M = 0, q(A ∨ B) = q(A) + q(B) A B the projector M† = M = MM. M M |S2 • q(A|B) = √A B , { } q(B) If the pair of events A,B , both spanned by 2 2 subsets of basis vectors from χ, are mutually • MAMB |S = MBMA |S exclusive, χA ∩ χB = Ø, then it follows from 2 orthogonality that q(A∨B) =||(MA +MB)|X || = 2 2 ||MA|X || +||MB|x|| = q(A) + q(B). The event ¯ Events A is the subspace that is the orthogonal complement to the subspace for the event A, and its probability Classical probability postulates a sample space ¯ = ( − )| 2 = − χ, which we will assume contains a finite number equals q(A) I MA X 1 q(A). For example, in Figure 17.1, the prob- of points, N (and N maybeverylarge).Theset  | 2 = of points in the sample space is defined as χ = 'ability of' the event A equals MA S ' · † · '2 = | |2 {X1,...,XN }.AneventA is a subset of this sample A A S .696 , and the probability of ( + )| 2 = space A ⊆ χ.IfA ⊆ χ is an event and B ⊆ χ is an the' event “A or B”' equals MA MB S ' '2 event, then the intersection A ∩ B is an event; also A · A† + B · B† · S = T 2 = 2 · |.696|2 . the union A ∪ B is an event. Quantum theory postulates a Hilbert space χ, which we will assume has a finite dimension, N State Revision ⊆ (and again N may be very large). The space is According to classical probability, if an event A χ spanned by an orthonormal set of basis vectors is observed, then a new conditional probability 4 function is defined by the mapping p(X |A) = χ ={|X ,...,|X } that form a basis for the space. i 1 N p(Xi∩A) An event A is a subspace spanned by a subset p(A) . The normalizing factor in the denominator χ ⊆ χ A of basis vectors. This event corresponds to is used to guarantee that the probability assigned to = |  | a projector MA i∈A Xi Xi .IfA is an event the entire sample space remains equal to one. spanned by χA ⊆ χ and B is an event spanned by According to quantum probability, if an event A χ ⊆ χ ∧ is observed, then the new revised state is defined B , then the meet (infimum)A B is an event |  χ ∩χ ∨ by |X = MA X . The normalizing factor in the spanned by A B; also the join (supremum) A B A ||MA|X || is an event spanned by χA ∪ χB. denominator is used to guarantee that the revised For example, in Figure 17.1, the event A is state remains unit length. The new revised state represented by the ray spanned by the vector |A, is then used (as described earlier) to compute and “A or B” is represented by the horizontal probabilities for events. This is called Lüder’s rule.

374 new directions For example, in Figure 17.1, if the event “A or down and the events cannot all be described within B” is observed, then the matrix representation of a single sample space. The events spanned by χ, the revised state equals which are all compatible with each other, form ⎡ √ ⎤ one sample space; and the events spanned by / √1 2 χ , which are combatible with each other, form = T = ⎣ / ⎦ = SAorB   1 2 U . another sample space; but the events from χ are T 0 not compatible with the events from χ .Inthiscase, there are two stochastically unrelated samples spaces The probability' of event A given' that “A or B” was ' '2 (Dzhafarov & Kujala, 2012), and quantum theory · † · S = .50. observed equals A A AorB provides a single state |S that can be used to assign probabilities to both sample spaces. Commutativity For noncommutative events, probabilities are Classical probability assumes that for any given assigned to histories or sequences of events using χ ⊆ experiment, there is only one sample space, χ,and Lüder’s rule. Suppose A is an event spanned by A χ χ ⊆ χ all events are contained in this single sample space. , and event B is spanned by B . Consider the Consequently, the intersection between two events probability for the sequence of events: A followed by = and the union of two events is always well defined. B. The probability of the first event A equals q(A) 2 ||MA|X || ; the revised state, conditioned on ob- A single probability function p is sufficient to assign |  serving this event equals |X = MA X ; the prob- probabilities to all events of the experiment. This is A ||MA|X || 5 called the principle of unicity (Griffiths, 2003). It ability of the second event, conditioned on the first | =|| | ||2 follows from the commutative property of sets that event, equals q(B A) MB XA ; therefore, the joint probabilities are commutative, p(A) · p(B|A) = probability of event A followed by event B equals p(A ∩ B) = p(B ∩ A) = p(B) · p(A|B). ' ' ' |  '2 Quantum probability assumes that there is only · | =|| | ||2 · ' MA X ' q(A) q(B A) MA X 'MB  | ' one Hilbert space and all events are contained in MA X this single Hilbert space. For a single fixed basis, 2 = MBMA |X  .(1) such as χ ={|Xi,i = 1,...,N }, the meet and the join of two events spanned by a common set of If the projectors do not commute, then order basis vectors in χ are always well defined, and matters because a probability function q can be used to assign probabilities to all the events defined with respect to 2 2 the basis χ.Whenacommonbasisisusedtodefine q(A) · q(B|A) = MBMA |X  = MAMB |X  all the events, then the events are compatible. = q(B) · q(A|B). The beauty of a Hilbert space is that there are many choices for the basis that one can use to de- For example, referring to Figures 17.1 and 17.2, scribe the space. For example, in Figure 17.2, a new the probability of event “A or B” and then event V {|  |  | } basis using vectors U , V , W was introduced equals |  to represent the state S , which was obtained by ' ' {|  |  | } 2 rotating the original basis A , B , C used in 2 ' † ' MV (MA + MB)|S = 'VV · T ' = 0. Figure 17.1. Suppose χ ={|Yi,i = 1,...,N } is another orthonormal basis for the Hilbert space. The probability of the event V and then the event If event A is spanned by χA ⊂ χ, and event B χ ⊂ χ “A or B” equals is spanned by B , then the meet for these '⎡ ⎤' two events is not defined; also the join for these ' '2 ' 0.123/2 ' two events is not defined either (Griffths, 2003). '⎣ ⎦' (M + M )M |S2 = ' −0.123/2 ' In this case, the events are not compatible. That A B V ' ' is, the projectors for these two events do not 0 = commute, MAMB MBMA, and the projectors = 0.008. for these two events do not share a common set of eigenvectors. In this case, it is not meaningful If all events are compatible, then quantum probabil- to assign a probability simultaneously to the pair ity theory reduces to classical probability theory. In of events {A,B} (Dirac, 1958). When the events this sense, quantum probability is a generalization are incompatible, the principle of unicity breaks of classical probability theory (Gudder, 1988).

quantum models of cognition and decision 375 Violations of the Law of Total Probability with q(B) because we must finally obtain The quantum axioms do not necessarily have to 1 = q(B) + q(B¯ ) obey the classical law of total probability in the ' ' following manner. Consider an experiment with = | 2 + ' | '2 1 MBMA X MBMA¯ X two different conditions. The first condition simply ' ' ' ' + ' | '2 + ' | '2 measures whether event B occurs. The second MB¯ MA X MB¯ MA¯ X . condition first measures whether A occurs, and then A skeptic might argue that the preceding rules measures whether B occurs. For both conditions, we for assigning probabilities to events defined as compute the probability of the event B. For the first subspaces are ad hoc, and maybe there are many condition, this is simply q(B) =||M |X ||2. For the B other rules that one could use. In fact, a famous second condition, we could observe the sequence theorem by Gleason (1957) proves that these are with event A and then event B with probability the only rules one can use to assign probabilities q(A)·q(B|A) =||M M |X ||2, or we could observe B A to events defined as subspaces using an additive the sequence with event A¯ and then event B ¯ measure (at least for vector spaces of dimension with probability q(A) · q(B|A) =||M M ¯ |X ||2, B A greater than 2). and so the total probability for event B in the Now let us turn to a couple of example second experiment equals the sum of these two applications of this theory. Quantum cognition ways: q (B) =||M M |X ||2 +||M M ¯ |X ||2. T B A B A and decision has been applied to a wide range of The interference produced in this experiment is findings in psychology (see Box 2). In this chapter defined as the probability of event B observed in the we only have space to show two illustrations— first condition minus the total probability of event an application to probability judgment errors, B observed in the second condition. According to and another application to violations of rational quantum probability theory, the interference equals decision-making. Int = q(B) − qT (B). To analyze this more closely, let us decompose the probability from the first condition as follows: Application to Probability Judgment Errors 2 Quantum theory provides a unified and coherent q(B) = MB |X  ' ' account for a broad range of puzzling findings in the = ' + | '2 MB MA MA¯ X area of human judgments. The theory has provided ' ' accounts for order effects on attitude judgments = '( | ) + |  '2 MBMA X MBMA¯ X ' ' (Wang & Busemeyer, 2013; Wang et al., 2014), 2 ' '2 inference (Trueblood & Busemeyer, 2010), and = MBMA |X  + MBM ¯ |X  + Int 1 2 1 A 2 causal reasoning (Trueblood & Busemeyer, 2011). = | | + | | ∗ Int X MAMBMA¯ X X MAMBMA¯ X . The same theory has also been used to account (2) for conjunction and disjunction errors found with probability judgments (Franco, 2009), as well An interference cross-product term, denoted Int, as overextension and underextension errors found appears in the probability q (B) from the first in conceptual combinations (Aerts, 2009). This condition, which produces deviations from the section briefly describes how the theory accounts for total probability qT (B) computed from the second conjunction and disjunction errors in probabilistic condition. This interference term can be positive judgments (Busemeyer et al., 2011). (i.e., constructive interference) or negative (i.e., Conjunction and disjunction probability judg- destructive interference) or zero (i.e., no inter- ment errors are very robust and they have been ference). If the two projectors commute so that found with a wide variety of examples (Tversky & MAMB = MBMA, then the interference is zero. Kahneman, 1983). Here we consider an example, There is also an interference term associated with where a judge is provided with a brief story about a the complementary probability hypothetical woman named Linda (circa 1980s):

' ' ' ' Linda is 31 years old, single, outspoken, and very ¯ = ' | '2 +' | '2 − q(B) MB¯ MA X MB¯ MA¯ X Int.(3) bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination ¯ The interference term associated with q(B)must and social justice, and also participated in antinuclear be the negative of the interference term associated demonstrations.

376 new directions (Morier & Borgida, 1984), the mean probability Box 2 Applications of quantum theory judgments were ordered as follows: using J(A) to to cognition and decision denote the mean probability judgment for event A, J(feminist) = .83 > J(feminist or bank teller) 1. Choice and decision time (Busemeyer, = .60 > J(feminist and bank teller) = .36 > J(bank Wang, & Townsend, 2006; Fuss & Navarro, teller) = .26 (N = 64 observations per mean, and 2013) all pairwise differences are statistically significant). 2. Violations of rational decision theory These results violate classical probability theory (Pothos & Busemeyer, 2009; Yukalov & which is the reason that they are called fallacies. Sornette, 2011) What follows is a simple yet general model for these 3. Categorization and decision (Busemeyer, types of findings. Wang, & Lambert-Mogiliansky, 2009) The first assumption is that after reading the 4. Probability judgment errors (Busemeyer, story about Linda, a person forms an initial belief Pothos, & Trueblood, 2012) state |S that represents the person’s beliefs about 5. Similarity judgments (Pothos, Busemeyer, features or properties that may or may not be true & Trueblood, 2013) about Linda. Formally, this belief state is a vector 6. Causal reasoning (Trueblood & within an N -dimensional vector space. This belief Busemeyer, 2011) state is used to answer any possible question that 7. Bistable perception (Atmanspacher & might be asked about Linda. Filk, 2010) The second assumption is that a question such as < 8. Conceptual combinations (Aerts, Gabora, “Is Linda a feminist?” is represented by a NF N - & Sozzo, 2013) dimensional subspace of the N -dimensional vector 9. Concept vagueness (Blutner, Pothos, & space. This subspace corresponds to a projector Bruza, 2013) MF that projects the state vector onto the subspace 10. Associative memory (Bruza, Kitto, representing the feminist question. The question “Is Linda a bank teller?” is represented by another sub- Nelson, & McEvoy, 2009) space of dimension N < N with a corresponding 11. Memory recognition (Brainerd, Wang, B projector M . & Reyna, 2013) B The third assumption is that the projectors 12. Attitude question order effects (Wang & MF ,MB do not commute so that MF MB = MBMF , Busemeyer, 2013; Wang, Solloway, Shiffrin, & and thus, the order of their applications matters. Busemeyer, 2014) The reason why these two projectors do not 13. Order effects on inference (Trueblood & commute is the following. The two concepts (fem- Busemeyer, 2010) inist, bank teller) are rarely experienced together, 14. (Kvam, and so the person has not formed a compatible Lambert-Mogiliansky, & Busemeyer, 2013) representation of beliefs about combinations of both concepts using a common basis. A person may have formed one basis representing features related to feminists, but this basis differs from the Then the judge is asked to rank the likelihood basis used to represent features related to bank of the following events: Linda is (a) active in the tellers. The concepts do not share the same basis feminist movement, (b) a bank teller, and (c) and so they are incompatible, and the person active in the feminist movement and a bank teller, needs to change from one basis to the other (d) active in the feminist movement and not a sequentially in order to answer questions about each bank teller, (e) active in the feminist movement concept. or a bank teller. The conjunction fallacy occurs The fourth assumption concerns the order that when option c is judged to be more likely than the concepts are processed when asked “Is Linda a option b (even though it can be argued from the feminist and a bank teller?”. Given that the events classical perspective that the latter contains the are incompatible, the person has to pick an order former), and the disjunction fallacy occurs when to process them. It is assumed that the more likely option a is judged to be more likely than option event is processed first. It is quite easy to judge the order of each individual question such as that e (again, even though it can be argued that the > latter contains the former). For example, in a study J(feminist) J(bank teller). But the question about

quantum models of cognition and decision 377 “feminist bank teller” is subtler, and this is not as by first projecting Psi onto feminist (F), and then easy as the previous two questions. The judgment projecting onto bank teller, which is shown as the for the conjunction requires forming an additional longer dark grey vertical segment. Note that the and subtler judgment about the conditional proba- projection for the conjunction (dark grey vertical bility J(bank teller given feminist). segment) exceeds the projection for the bank teller These assumptions are now used to derive the alone (light grey vertical segment). quantum predictions for the probability of bank The same theory can also account for the teller (using Eq. 2) : disjunction effect. The event that “Linda is a bank 2 teller or a feminist” is the same as the denial of the q(B) =MB |S ' ' event that “Linda is not a bank teller and not a = | 2 + ' | '2 + feminist.” According to the quantum model, the MBMF S MBMF¯ S Int 1 2 1 2∗ probability of the event “not a bank teller and = | | + | | ¯ ¯ ¯ Int S MF MBMF¯ S S MF MBMF¯ S not a feminist” equals q(B) · q(F|B), and so the − ¯ · ¯ | ¯ According to the quantum model, a conjunction probability of the denial equals 1 q(B) q(F B). error occurs when The disjunction error is predicted to occur when the ' ' probability of “feminist” exceeds the disjunction, =  | 2 + ' | '2 + < ¯ ¯ ¯ ¯ q(B) MBMF S MBMF¯ S Int that is, when q(F) > 1 − q(B) · q(F|B) → q(F) < ' ' ¯ ¯ ¯ ' '2 q(B) · q(F|B). Therefore, according to the quantum M M |S2 → Int < − M M ¯ |S . B F B F model, a disjunction error occurs when Formally, the negative interference term produces ' ' ' ' = ' | '2 + ' | '2 the conjunction error. Intuitively, the Linda story q(F) MF¯ MB S MF¯ MB¯ S ' ' produces a belief state that is almost orthogonal to + < ' | '2 Int MF¯ MB¯ S the subspace for the bank teller event. However, ' ' ' '2 if this state is first projected onto the feminist → Int < − M ¯ MB |S . 1 F2 1 2 subspace (eliminating some details about Linda that = | | + | | ∗ make it impossible for her to be a bank teller), Int S MB¯ MF¯ MB S S MB¯ MF¯ MB S then it becomes a bit more likely to think that this To account for both of these fallacies using the same feminist can be a bank teller too. principles and parameters, the model must predict Figure 17.3 illustrates how this works using a the following order effect (see Busemeyer et al., simple two-dimensional example (though we stress 2011, appendix) that the specification of the model is general and  | 2 >  | 2 not restricted to one-dimensional subspaces). The MBMF S MF MB S . probability for the bank teller alone question is Intuitively, the probability obtained by first con- determined from the direct projection from the sidering whether Linda is a feminist and then initial state Psi to the bank teller (BT) axis, which considering whether she is a bank teller must is shown as the shorter light grey vertical segment. be greater than the probability obtained by the The probability for the conjunction is represented opposite order. Order effects in this direction have been reported—asking people to judge “is Linda a bank teller” before asking them to judge “is Linda BT a feminist and a bank teller” significantly reduces the size of the conjunction error as compared to the F opposite order (Stolarz-Fantino, Fantino, Zizzo, & Ψ Wen, 2003). The fact that the quantum model can account for both conjunction and disjunction errors using ~BT the same principles and same parameters is a definite advantage over other accounts, such as an averaging model. As described in Busemeyer et al. (2011), there are many other qualitative predictions ~F that can be derived from this model. In particular, because q(F) > q(F)q(B|F), this model cannot Fig. 17.3 Example of conjunction fallacy for a special case with produce double conjunction errors, in which the two dimensions. conjunction is greater than both individual events.

378 new directions Empirically, indeed single conjunction errors are State Space much more common and double conjunction errors Both quantum and Markov models begin with are infrequent (Yates & Carlson, 1986). Another asetofN states χ = {|X1,...,|XN },wherethe prediction from this model is that assuming the number of states, N , can be very large. According to conjunction error occurs so that q(F) · q(B|F) > the Markov model, a state such as |Xi represents all q(B), then q(B|F) ≥ q(B)becauseq(B|F) ≥ q(F) · the information required to characterize the system q(B|F) > q(B). The intuition here is that, given at some moment, and χ represents the set of all the detailed knowledge about Linda, it is almost the possible characterizations of the system across impossible for Linda to be a bank teller; but time. At any moment in time, the Markov system is given that she is viewed more generally as a exactly located at some specific state in χ,andacross feminist, it is more likely to think that a feminist time the state changes from one element to another also can be a bank teller. This is an important in χ. In comparision, according to the quantum prediction that needs further empirical testing. (See model, a state, such as |Xi, represents a basis vector Tentori & Crupi, 2013, for arguments against this used to describe the system, and the set χ is a set prediction.) of basis vectors that span an N -dimensional vector The model presented in this section can account space. At any moment in time, the system is in a for conjunction errors, disjunction errors, averaging superposition state, |ψ, which is a point within effects, and order effects. It is, however, only the vector space spanned by χ, and across time the one of many possible ways to build models of point |ψ moves around in the vector space (until a probability judgments using quantum principles. In measurement occurs, which reduces the state to the particular, Aerts (Aerts, 2009) and his colleagues observed basis vector). (Aerts & Gabora, 2005) have developed alternative quantum models that account for conjunction Initial State and disjunction errors in conceptual combinations. According to the Markov model, the system Importantly, their model can produce double con- starts at some particular element of χ. However, junction errors; but unfortunately it must change this initial state may be unknown to the investi- parameters to account for differences between con- gator, in which case a probability, denoted 0 ≤ junction and disjunction errors. In summary, the φi (0) ≤ 1, is assigned to each state |Xi.TheN quantum axioms provide a common set of general initial probabilities form a N × 1 column matrix principles that can be implemented in different ⎡ ⎤ φ1 (0) ways to construct more specific and competing ⎢ ⎥ φ ( ) = ⎣ . ⎦ quantum models of the same phenomena. Each 0 . . of the specific quantum models can be compared φN (0) with each other and with other classical models It will be convenient to define a 1 × N row matrix with respect to their ability account for empirical as J = 1···1 , which is used for summation. results. More generally, φ (t) represents the probability distribution across states in χ at time t.TheMarkov model requires this probability distribution to sum Quantum Dynamics to unity: J · φ (t) = 1. This section presents the quantum dynamical According to the quantum model, the system principles and compares them with Markov pro- starts in a superposition state |ψ (0) = ψi (0) · cesses used in classical dynamical systems. Markov |Xi where ψi is the coordinate (called amplitude) theory provides the basis for constructing a wide assigned to the basis vector |Xi.TheN amplitudes × variety of classical probability models in cognitive for the initial state form⎡N 1 column⎤ matrix science (e.g., random walk/diffusion models of ψ1 (0) decision-making). Once again, we restrict this ⎢ ⎥ ψ (0) = ⎣ . ⎦. presentation to finite dimensional systems in this . ψ ( ) chapter. Although finite, the number of dimensions N 0 can be very large, and both quantum and Markov More generally, ψ (t) represents the amplitude processes can readily be extended to infinite dimen- distribution across basis vectors in χ at time t. sional systems. See Busemeyer et al. (2006) and The quantum model requires the squared length Chapter 7 in Busemeyer and Bruza (2012) for a of this amplitude distribution to equal unity: more comprehensive treatment. ψ (t)† ψ (t) = 1·

quantum models of cognition and decision 379 State Transitions According to the quantum model, the stationary According to the Markov model, the proba- unitary matrix obeys the Schrödinger equation bility distribution across states evolves across time d according to the linear transition law U (t) =−i · H · U (t), dt φ(t + τ) = T (t + τ,t) · φ (t), where H is the Hamiltonian matrix, which is a Hermitian matrix H † = H, to guarantee that U(t) ( + τ ) where T t ,t is a transition matrix with ele- is a unitary matrix. This is where complex numbers ment Tij representing the probability of transiting enter in a significant way into quantum models. to a state in row i from a state in column For the Markov model, the solution to the j. The transition matrix of a Markov model is Kolmogorov foward equation is the following ( + ) called stochastic because the columns of T t h,t matrix exponential function must sum to one to guarantee that the resulting probability distribution continues to sum to one, T (t) = exp(t · K ). that is J · φ(t + τ) = 1, and recall that J = [1 1 1 For the quantum model, the solution to the ... 1]. (The rows, however, are not required to sum Schrödinger equation is the following complex to one.) In many applications, it is assumed that the matrix exponential function transition matrix is stationary so that T (t2 +τ,t2) = T (t1 + τ,t1) = T (τ) for all t and τ. The transition U (t) = exp(−i · t · H). matrix of a Markov model is called a stochastic matrix because the columns of the transition matrix In summary, the probability distribution across must sum to unity. states for the Markov model at time t equals According to the quantum model, the amplitude φ (t) = exp(t · K ) · φ (0), distribution evolves across time according to the linear transition law and likewise the amplitude distribution across states for the quantum model at time t equals ψ (t + τ) = U (t + τ,t) · ψ (t), ψ (t) = exp(−i · t · H) · ψ (0). + τ where U (t ,t) is a unitary matrix with element The most important step for building a dynamic Uij representing the amplitude for transiting to row model is specifying the intensity matrix for the i from column j. The unitary matrix must satisfy Markov model or specifying the Hamiltonian the unitary property U † · U = I (I is the identity ψ ( )† ψ ( ) = matrix for the quantum model. Here psychological matrix) in order to guarantee that t t science enters by developing a mapping from 1. That is, the columns are unit length and each the psychological factors onto the parameters that pair of columns is orthogonal. A transition matrix define these matrices. An example is provided can be formed from the unitary matrix by taking following this section to illustrate this model the squared modulus of each of the cell entries development process. of U(t + τ,t). The transition matrix formed in this manner is doubly stochastic: both the rows and columns of this transition matrix must sum to unity. Response Probabilities This is a more restrictive constraint on the transition Consider the probability of observing the matrix as compared to the Markov model. In many response Rk at time t, which is denoted = applications, it is assumed that the unitary matrix p(R (t) Rk). In this section, we use the same is stationary so that U (t2 +τ,t2) = U (t1 + τ,t1) = choice probability notation for both the Markov U(τ) for all t and τ. and quantum models. Assume that φ (t) is the cur- According to the Markov model, the stationary rent probability distribution for the Markov model transition matrix obeys the Kolmogorov forward and ψ (t) is the current amplitude distribution for equation the quantum model at time t. Both the Markov and d quantum models determine the probability of a re- T (t) = K · T (t), dt sponse by evaluating the set of states that map onto that particular response. Suppose a subset of states, where K is the intensity matrix, with element Kij, χk ⊂ χ, are mapped onto a response Rk.DefineMk ≥ = = and Kij 0 for i j,and j Kij 0toguarantee as a N × N indicator matrix, which is a diagonal that T (t) remains a transition matrix. matrix with ones on the diagonal corresponding to

380 new directions the states mapped onto the response Rk,andzeros in order to decide actions at the beginning of everywhere else. Then according to the Markov the sequence. To be dynamically consistent, when model, the response probability equals (recall reaching the decisions at the end of the sequence, J = [1 1 1...1]) one should follow through on the plan that was ( = ) = · · φ ( ) used to make the decision at the beginning of the p R(t) Rk J Mk t . sequence. Violations of dynamic decision-making

If in fact, the response Rk is observed at time t, occur when people change plans and fail to follow then the new probability distribution, conditioned through on a plan once they arrive at the final on this observation equals decisions. · φ ( ) φ ( | ) = MK t t Rk . Two-Stage Gambling Paradigm p(R (t) = R ) k Tversky and Shafir (1992) experimentally inves- According to the quantum model, the response tigated the sure thing principle using a two-stage probability equals gamble. They presented 98 students with a target gamble that had an equal chance of winning $200 p(R (t) = R ) = M · ψ (t)2 . k k or losing $100 (they used hypothetical money).

If in fact, the response Rk is observed at time t,then The students were asked to imagine that they the new amplitude distribution, conditioned on this already played the target gamble once, and now observation equals they were asked whether they wanted to play the same gamble a second time. Each person M · ψ (t) ψ (t|R ) = # k . experienced three conditions that were separated by k ( ( ) = ) p R t Rk a week and mixed with other decision problems to produce independent decisions. They were asked The conditional states, φ (t|R ) for the Markov k if they wanted to play the gamble a second time, model and ψ (t|R ) for the quantum model, then k given that they won the first play (Condition become the “initial” states to be used for further 1:knownwin),giventhattheylostthefirst evolution and future observations. play (Condition 2: known loss), and when the outcome of the first play was unknown (Condition Application to Decision Making 3: unknown). If they thought they won the first This section examines two puzzling findings gamble, the majority (69%) chose to play again; from decision research. One is the violation of the if they thought they lost the first gamble, then “sure thing” principle (Tversky & Shafir, 1992). again the majority (59%) chose to play again; Savage introduced the “sure thing” principle as a but if they didn’t know whether they won or rational axiom for the foundation of decision theory lost, then the majority chose not to play (only (1954). According to the sure thing principle, if 36% wanted to play again). Tversky and Shafir under state of the world X, you prefer action A (1992) explained these findings by claiming that over B, and if under the complementary state of the people fail to follow through on consequential world X , you also prefer action A over B, then you reasoning. When a person knows she/he has should prefer action A over B even when you do not won the first gamble, then a reason to play know the state of the world. A violation of the sure again arises from the fact that she/he has extra thing principle occurs when A is preferred over B house money to play with. When the person for each known state of the world, but the opposite knows she/he has lost the first gamble, then a preference occurs when the state of the world is reason to play again arises from the fact that unknown. she/he needs to recover for their losses. When The other puzzling finding is the violation of the the person does not know the outcome of the principle of dynamic consistency, called dynamic first play, these reasons fail to arise. However, inconsistency. Dynamic consistency is considered why not? in standard theory to be a rational principle for Pothos and Busemeyer (2009) explained these making dynamic decisions involving a sequence of and other results found by Shafir and Tversky actions and events over time. According to the (1992) using the concept of quantum interference. backward induction algorithm used to form optimal Referring back to the section Violations of the Law of plans with dynamic decisions, a person works Total Probability, define the event B as deciding to backward making plans at the end of the sequence play the gamble on the second stage, define event

quantum models of cognition and decision 381 ¯ A as winning the first play, and define event A the final payoff with real money at stake. The as losing the first play. Then Eq. 2 expresses the results showed that people violate the dynamic probability of playing the gamble on the second consistency principle: Following an actual win, they stage for the unknown condition in terms of the changed from planning to take to finally rejecting total probability, qT (B), of playing the second stage the second stage; following an actual loss, they on either of the two known conditions, plus the changed from planning to reject to finally taking interference term Int. Given that the probability of the second stage. For example, Table 17.1 shows winning the first stage equals .50, then a violation the results from the four gambles used by Barkan of the sure thing principle is predicted whenever and Busemeyer (1999). The first two columns show qT (B) > .50 and the interference term Int is the amounts to win or lose, the next two colums sufficiently negative so that q(B) < .50. But what show the probability of taking the gamble under the determines the interference term? To answer this plan (conditioned a planned win or loss), and the question, Pothos and Busemeyer (2009) developed last two columns show the probability to take a dynamic quantum model to account for the the gamble for the final decision (conditioned on violation of the sure thing principle. This model is an experienced win or a loss). Similar results were described in detail later, but before presenting these found by Barkan and Busemeyer (2003) using 17 modeling details, let us first examine the second different gambles. For later reference, we will denote puzzling finding regarding violations of dynamic the amount of the win by xW and the amount of the consistency. The same model is used to explain both loss by xL. So for example, in the first row, xW = 80 findings. and xL = 100. Barkan and Busemeyer (1999, 2003) used It is worth mentioning that the results shown in the same two-stage gambling paradigm to study Table 1 once again demonstate a violation of the another phenomena called dynamic inconsistency, classical law of total probability in the following which occurs whenever a person changes plans way. If the law of total probability holds, then the during decision-making. Each study included a probability of taking the gamble during the plan total of 100 people, and each person played a series (denoted p(T |P) and shown in the columns labeled of gambles twice. Each gamble had an equal chance “Plan Win, Plan Lose”) equals the probability of of producing a win or a loss (e.g., equal chance winning the first play (denoted p(W )whichwas to win 200 points or lose 100 points, where each stated to be equal to .50) times the probability point was worth $0.01). Different gambles were to take the gambe after a win (denoted p(T |W ) formed by changing the amounts to win or lose. and shown under the column “Final Win” in Table For each gamble, the person was forced to play the 17.1) plus the probability of losing the first play first round, and then contingent on the outcome of (denoted p(L) = 1 − p(W )) times the probability of the first round, they were given a choice whether to taking the gamble following a loss (denoted p(T |L) take the same gamble on the second round. Choices and shown under the column “Final Loss” in Table were made under two conditions: a planned versus 17.1) so that p(T |P) = p(W ) · p(T |W ) + p(L) · a final choice. For the planned choice, contingent p(T |L). All the gambles have the same probability on winning the first round, the person had to select of winning, and p(W ) is fixed across gambles and is a plan about whether to take or reject the gamble stated in the problem to be equal to .50. However, on the second round; contingent on losing the first these assumptions fail to reproduce the findings round, the person had to make another plan about shown in Table 17.1. For example, we require whether to take or reject the gamble on the second p(W ) = .64 to reproduce the data in the first row, round. Then the first-stage gamble was actually but we require p(W ) = .43 to reproduce the third played out and the actual win or loss was revealed. row, and we require p(W ) = .31 to reproduce For the final choice, after actually experiencing the data in the fourth row, and even worse, no the win on the first round, the person made a legitimate value of p(W ) can be found to reproduce final decision to take or reject the second round. the second row. Likewise, after actually experiencing a loss on the first round, the person had to decide whether to take or reject the gamble on the second round. The plan Markov Dynamic Model for and the final decisions were made equally valuable Two-Stage Gambles because the experimenter randomly selected either First let us construct a general Markov model for the planned action or the final action to determine this two-stage gambling task. The Markov model

382 new directions Table 17.1. Barkan and Busemeyer (1999). Win Lose Plan Win Plan Loss Final Win Final Loss

80 100 .25 .26 .20 .35

80 40 .76 .72 .69 .73

200 100 .68 .68 .60 .75

200 40 .84 .86 .76 .89

uses a four-dimensional state space with states The matrix that picks out the states corre- {|  |  |  | } sponding to the action of “taking the gamble” is BW AT , BW AR , BLAT , BLAR , represented by where |BW AT  represents the state “believe you win MT 0 10 the first gamble and act to take the second gamble,” M = , MT = . 0 MT 00 |BW AR represents the state “believe you win the first gamble and act to reject the second gamble,” Recall that J = 1111sums across states |B A  represents the state “believe you lose the L T to obtain the probability of a response. Finally, the first gamble and act to take the second gamble,” and Markov model predicts: |B A  represents the state “believe you lose the first L R gamble and act to reject the second gamble.” The .50 p(T |W ) = J ·M ·T (t)·φ = J ·M ·T · , probability distribution over states is represented W T W .50 by a 4 × 1 column matrix (that sums to unity) composed of two parts ⎡ ⎤ ⎡ ⎤ φ .50 WT 0 p(T |L) = J · M · T (t) · φ = J · M · T · , ⎢ ⎥ ⎢ ⎥ L T L .50 ⎢φWR ⎥ ⎢ 0 ⎥ φ = φW + φL, φW = ⎣ ⎦, φL = ⎣ ⎦. 0 φLT 0 φLR MT 0 and p(T |U) = J · M · T (t) · φU = J · Before evaluating the payoffs of the gamble, the 0 MT decision maker has an initial state represented by ⎛⎡ ⎤ ⎡ ⎤⎞ .25 0 φ (0). This initial state depends on information ⎜⎢ ⎥ ⎢ ⎥⎟ T 0 .25 0 about the outcome of the first play. If the outcome · W · ⎜⎢ ⎥ + ⎢ ⎥⎟ 0 T ⎝⎣ 0 ⎦ ⎣.25⎦⎠ of the first play is unknown (i.e., the planning L 0 .25 stage), then the initial state is set equal to φ (0) = φU , which has coordinates φWT = φWR = φLT = 1 φLR = . If the first play is known to be a win, 4 = ( ) · · · · .50 then the initial state is set equal to φ (0) = φW with .50 11MT TW 1 .50 coordinates φWT = φWR = , φLT = φLR = 0. If 2 the first play is known to be a loss, then the initial φ ( ) = φ .50 state is set equal to 0 L with coordinates +(.50) · 11· MT · TL · φ = φ = φ = φ = 1 .50 WT WR 0, LT LR 2 . The probabilities of taking the gamble, depend- = (.50) · p(T |W ) + (.50) · p(T |L). ing on the win or lose first game belief states, are then determined by a transition matrix The last line shows that the Markov model must satisfy the law of total probability. Note that the TW 0 T (t) = , Markov model must always obey the law of total 0 TL probability, and, thus, already qualitatively fails to where TW is a 2 × 2 transition matrix conditioned account for the violation of the sure thing principle on winning, and TL is a 2 × 2 transition matrix and the dynamic inconsistency results described conditioned on losing. earlier.

quantum models of cognition and decision 383 ⎡ ⎤ Quantum Dynamic Model for √hW √ 1 + 2 + 2 = HW 0 = ⎣ 1 hw 1 hw ⎦ Two-Stage Gambles H1 , HW 1 −h , 0 HL √ √ W + 2 + 2 Pothos and Busemeyer (2009) developed a 1 hw 1 hw quantum dynamic model that has been applied to ⎡ ⎤ hL  1 the two-stage gambling task. The quantum model + 2 + 2 ⎢ 1 hL 1 hL ⎥ H = ⎣ − ⎦, also uses a four-dimensional vector space spanned L  1  hL by four basis vectors 1+h2 1+h2 ⎡ L L ⎤ 10 10 {|B A ,|B A ,|B A ,|B A }, ⎢ ⎥ W T W R L T L R −c ⎢ 0 −101⎥ H = √ . 2 ⎣ 10−10⎦ |  2 where BW AT represents the event “believe you 01 01 win the first gamble and act to take the second gamble,” |BW AR represents the event “believe you The matrix HW intheupperleftcornerof win the first gamble and act to reject the second H1 rotates the state toward taking or reject- gamble,” |BLAT  represents the event “believe ing the gamble depending on the final payoffs ( + − ) you lose the first gamble and act to take the xW xW ,xW xL , given an initial win of the second gamble,” and |BLAR represents “believe amount of xW from the first play. The coefficients you lose the first gamble and act to reject the hW and hL in the Hamiltonian HW are supposed to second gamble.” The decision-maker’s state is a range between −1and+1. So we need to map the superposition over these four basis states: utility differences into this range. The hyperbolic tangent provides a smooth S-shaped mapping. We then define h in terms of the utility difference |ψ = ψWT · |BW AT  + ψWR · |BW AR W following a win as follows: + ψLT · |BLAT  + ψLR · |BLAR = 2 − = − a hW − 1, DW uW xW , 1 + e DW The matrix representation of this superposition ⎧ state is the 4 × 1 column matrix (length equal to ⎪ 1 · ( + )a + 1 · ( − )a ⎪ 2 xW xW 2 xW xL one) composed of two parts ⎨ (xW > xL) uW = ⎡ ⎤ ⎡ ⎤ ⎪ 1 · (x + x )a − 1 · b · (x − x )a ψ ⎩⎪ 2 W W 2 L W WT 0 ( > ) ⎢ψ ⎥ ⎢ ⎥ xL xW . ψ = ψ + ψ ψ = ⎢ WR ⎥ ψ = ⎢ 0 ⎥ W L, W ⎣ ⎦, L ⎣ψ ⎦. 0 LT The variable uW is the utility of playing the gamble ψ 0 LR after a win, which uses a risk-aversion parameter a and a loss-aversion parameter b. The variable DW Before evaluating the payoffs of the gamble, the is the difference between the utility of taking and decision maker has an initial state represented by rejecting the gamble after a win. The matrix HL ψ (0). This initial state depends on information in the bottom right corner of H1 rotates the state about the outcome of the first play. If the outcome toward taking or rejecting the gamble depending of the first play is unknown (i.e., the planning on the final payoffs (xW − xL,−xL − xL) given an stage), then the initial state is set equal to ψ (0) = initial loss of the amount xL from the first play. ψU , which has coordinates .50. If the first play Once again, using the hyperbolic tangent, we map is known to be a win, then the initial state is set the utility differences following a loss into hL as ψ = ψ ψ = equal to√ (0) W with coordinates WT follows: ψ = .50, ψ = ψ = 0. If the first play is WR LT LR = 2 − = − a known to be a loss, then the initial state is set equal hL − 1, DL uL xL, 1 + e DL to ψ (0) = ψL with coordinates ψWT = ψWR = ⎧ √ 1 a 1 a ψ = ψ = ⎪ · (xW − xL) − · b · (xL + xL) 0, LT LR .50. ⎨⎪ 2 2 Evaluation of the gamble payoffs causes the (x > x ) u = W L initial state ψ (0) to evolve into a final state ψ (t) L ⎪− 1 · · ( − )a − 1 · · ( + )a ⎪ 2 b xL xW 2 b xL xL after a period of deliberation time t, and this final ⎩ (xL > xW ). state is used to decide whether to take or reject the gamble at the second stage. The Hamiltonian H The variable uL is the utility of playing the gamble used for this evolution is H = H1 + H2,where after a loss, which uses the same risk aversion

384 new directions ' √ ' ' '2 parameter a and the same loss aversion parameter b. ' √.50 ' = (.50) · 'MT · exp( − i · t · HW ) · ' The variable DL is the difference between the utility .50 of taking and rejecting the gamble after a loss. ' √ ' The matrix H2 is designed to align beliefs ' '2 with actions. This produces a type of “hot hand” ' √.50 ' +(.50) · 'MT · exp( − i · t · HL) · ' effect. The parameter c determines the extent .50 that beliefs can change from their initial values during the evaluation process, and it is critical = (.50) · p(T |W ) + (.50) · p(T |L). for producing interference effects. Critically, if the Therefore, if c = 0, the quantum model violates the parameter c is set to zero, then the quantum model law of total probability; but if c = 0, the quantum reduces to a special case of a Markov model, the model satisfies the law of total probability. In fact, law of total-probability holds, and there are no when c = 0, the Markov model can reproduce the interference effects. According to the quantum- predictions of the quantum model by setting each model hypothesis, a nonzero value of this parameter element of the first row of the transition matrix c is expected, which will reproduce the 17 different T equal to p(T |W ) predicted by the quantum quantum interference terms for the 17 different W model, and by setting each element of the first gambles. row of the transition matrix T equal to p(T |L) The initial state evolves to the final state L predicted by the quantum model. In other words, according to the unitary evolution we can obtain a Markov model from the quantum ψ(t) = exp(−i · t · H) · ψ (0). model by setting c = 0. Following Pothos and Busemeyer (2009), the = π Model Comparisons deliberation time was set equal to t 2 ,because at this time point, the evolution of preference Next, we compare the Markov (obtained by = first reaches an extreme point. The projector for setting c 0 in the quantum model) and the = choosing to gamble is represented by the indicator quantum model (allowing c 0) with respect to matrix that picks the “take gamble” action their fits to the Barkan and Busemeyer (2003) results in three different ways. The first is to MT 0 10 compare least-squares fits to the 17 (gambles with M = , MT = . 0 MT 00 different payoff conditions) × 2 (plan versus final) = 34 mean choice proportions reported in Barkan and The probability of taking the gamble for the known Busemeyer (2003). The second is to compare the win, known loss, and unknown (plan) conditions models using maximum likelihood estimates at the then equals ' ' individual level and using AIC and BIC methods. ' '2 The third is to estimate the hierarchical Bayesian p(T |W ) = M · exp( − i · t · H) · ψW ' ' posterior distribution for the critical parameter c ' '2 p(T |L) = M · exp( − i · t · H) · ψL that distinguishes the two models. ' ' 2 = ( | ) = ' · − · · · ψ '2 The models are first compared using R p T U M exp( i t H) U . − SSE 2 = − SSE · 34−1 1 TSS ,andadjusted R 1 TSS 34−n . The parameter c is critical for producing viola- The latter index includes a penalty term for extra tions of the law of total probability. If we set the parameters. These statistics were computed with 2 ¯ 2 quantum-model parameter c =0, then the quantum SSE = Pi − pi and TSS = Pi − P , model predicts where Pi is the observed mean proportion of trials to ' = ' choose gamble i 1,...,34, pi is the predicted mean ' MT 0 ¯ = p(T |U) =' proportion, P is the grand mean, 34 17 (payoff 0 MT × conditions) 2 (plan versus final stage choices) is exp( − i · t · HW )0 the number of observed choice proportions being · · = 0 exp( − i · t · HL) fit. The quantum model has n 3 parameters, ⎡ ⎤ ⎡ ⎤' and the best-fitting parameters (minimizing sum of '2 .50 0 ' squared error) are a = 0.71 (risk aversion), b = 2.54 ⎢ ⎥ ⎢ ⎥' ⎢ .50 ⎥ ⎢ 0 ⎥' (loss aversion), and c =−4.40. The risk aversion ⎣ ⎦ + ⎣ ⎦' 0 .50 ' parameter is a bit below one as expected, and the ' 0 .50 loss parameter b exceeds one, as it should be. The

quantum models of cognition and decision 385 model produced an R2 = 0.8234 and an adjusted the one extra parameter. Using AIC, 48 out of the R2 = 0.8120. The Markov model, obtained by 100 participants produced AIC indices favoring the setting c = 0 in the quantum model, has only two quantum model over the Markov model. The BIC parameters and it produced an R2 = 0.7854 and an penalty depends on the number of observations, adjusted R2 = 0.7787, which are lower than those which is 33 for each person, and so for one extra of the quantum model. parameter, the penalty equals log(33) = 3.4965. Next, the models were compared using AIC and Using the more conservative BIC index, 22 out BIC methods based on maximum likehood fits to of the 100 participants produced BIC indices individuals. For person i on trial t we observe a favoring the quantum model over the Markov data pattern Xi(t) = [xTT (t),xTR(t),xRT (t),xRR(t)] model. Thus a majority of participants were ade- defined by xjk(t) = 1ifevent(j,k) occurs and quately fit by the Markov model, but a substantial otherwise zero, where TT is the event “planned percentage of participants were better fit by the to take the gamble and finally took the gamble,” quantum model. TR is the event “planned to take the gamble One final method used to compare models but finally rejected the gamble,” RT is the event is to examine the posterior distribution of the “planned to reject the gamble but finally took the parameter c when estimated by hiearchical Bayesian gamble” and RR is the event “planned to reject the methods. The details for this analysis are described gamble and finally rejected the gamble.” To allow in Busemeyer, Wang, and Trueblood (2012) and for possible dependencies between a pair of choices the results are only briefly summarized here. The within a single trial, an additional memory recall hierarchical Bayesian estimation method starts by parameter, m, was included in each model. For assuming a prior distribution over the individuals both models, it was assumed that there is some for each of the four quantum model parameters. probability 0 ≤ m ≤ 1 that the person simply recalls Then, the likelihoods from the individual fits and repeats the planned choice during the final are used to update the prior distribution into choice, and there is some probability 1−m that a posterior distribution over the indivduals for the person forgets or ignores the planned choice the four parameters. The posterior distribution when making the final choice. After including this of the critical quantum parameter c is shown in memory parameter, the prediction for each event Figure 17.4. The entire distribution lies below becomes zero, and the mean of the distribution equals to −2.67. p = p(T |plan) · (m · 1 + (1 − m) · p(T |final)) TT This supports the hypothesis that the critical = | · ( − ) · | pTR p(T plan) 1 m p(R final) quantum parameter, c, is not zero, and the model pRT = p(R|plan) · (1 − m) · p(T |final) does not reduce to the Markov model. It is worth noting that the same quantum = | · · + − · | pRR p(R plan) (m 1 (1 m) p(R final)) model also accounts for the violations of the sure- Using these definitions for each model, the log thing principle, whereas the Markov model cannot 6 explain this violation. Furthermore, the same likelihood function for the 33 trials (with a pair quantum model described here was used to explain of plan and final choices on each trial) from a single two other puzzling findings (not reviewed here). person can be expressed as One is concerned with order effects on inference (Trueblood & Busemeyer, 2010), and the other ( ( ))= ( ) · lnL Xi t xjk t ln pjk is the interference of categorization on decision j,k making (Busemeyer et al., 2009). In sum, the 33 same quantum model has been successfully applied ( ) = ( ( )) lnL Xi lnL Xi t . to four distinct puzzling judgement and decision t=1 findings, which builds confidence in the broad The log likelihood from each person was con- applicability of the model. 2 =− · ( ) verted into Gi 2 ln Li which indexes the lack of fit, and the parameters that minimized 2 7 Gi were found for each person. The quantum Concluding Comments model has one more parameter than the Markov This chapter provides a brief introduction to the model. In this case, the AIC badness of fit index basic principles of quantum theory and a few major 2+ is defined as Gi 2, where 2 is the penalty for paradoxical judgement and decision findings that

386 new directions 0.2

0.18

0.16

0.14

0.12

0.1

0.08

Posterior Probability 0.06

0.04

0.02

0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Quantum Parameter c

Fig. 17.4 Posterior Distribution of quantum model parameter c across individuals. the theory has been used to explain. The theory does become familiar with the mathematics and is new and needs further testing, but the initial concepts, it becomes apparent that quantum theory successful applications demonstrate its viability provides a conceptually elegant and innovative way and theoretical potential. Busemeyer and Bruza to formalize and represent some of the major (2012) provide a more detailed presentation of the concepts from cognition. For example, the superpo- basic principles, and they also describe in detail sition principle provides a natural way to represent a much larger number of empirical applications. parallel processing of uncertain information and Also Pothos and Busemeyer (2013) summarize capture that deep ambiguous feelings. Moreover, applications of quantum theory to cognitive sci- the consideration of quantum models allows the ence. Finally, special issues on quantum cognition (re)introduction of new and useful theoretical have recently appeared in Journal of Mathematical principles into psychology, such as incompatibility, Psychology (Bruza, Busemeyer, & Gabora, 2009) interference, and entanglement. In all, the main and Topics in Cognitive Science (Wang, Busemeyer, advantage of quantum theory is that a small set of Atmanspacher, & Pothos, 2013). principles provide a coherent explanation for a wide What are the advantages and disadvantages of variety of puzzling results that have never been con- the quantum approach as compared to traditional nected before under a single theoretical framework cognitive theories? First, let us consider some (see Box 2). of the disadvantages. One is that the concepts and mathematics are very new and unfamiliar to psychologists, and learning how to use them requires an investment of time and effort. Second, Notes because of the unfamiliarity, it may seem difficult at 1. In particular, this chapter does not rely on the quantum first to intuitively connect these ideas to traditional brain hypothesis (Hammeroff, 1998). 2. This section makes little use complex numbers, but the concepts of cognitive psychology, such as memory, section on dynamics requires their use. attention, and information processing. Finally, 3. This chapter follows the Dirac representation of the state applications of quantum theory to cognition have asavectorratherthanthemoregeneralvonNeumann to overcome skepticism that naturally arises when representation of the state as a density matrix. χ introducing a revolutionary scientific new idea. 4. The basis is an arbitrary choice and there are many other However, now consider the advantages. First, choices for a basis. Initially, we restrict ourselves to one arbitrarily chosen basis. Later we discuss issues arising from using different the mathematics is not as difficult as it seems, choices for the basis. and it only requires knowledge of linear algebra 5. Kolmogorov assigned a single sample space to the outcomes and differential equations. Second, once one of an experiment. This allows one to use a different sample space

quantum models of cognition and decision 387 for each experiment. But then the problem is that these separate Busemeyer, J. R., Pothos, E. M., Franco, R., & Trueblood, sample spaces are left stochastically unrelated. J. S. (2011). A quantum theoretical explanation for 6. 16 gambles were played twice, one other gamble was played probability judgment errors. Psychological Review, 118(2), only once. 193–218. 7. A surprising feature was found with the log likelihood Busemeyer, J. R., Wang, Z., & Lambert-Mogiliansky, A. (2009). function of the quantum model as a function of the key quantum Empirical comparison of markov and quantum models of parameter c. The log likelihood function forms a damped decision making. Journal of Mathematical Psychology, 53(5), oscillation that converges at a reasonably high log likelihood 423–433. at the extremes, and this is true both for the average across Busemeyer, J. R., Wang, Z., & Townsend, J. (2006). participants as well as for individual participants. Quantum dynamics of human decision making. Journal of Mathematical Psychology, 50(3), 220–241. Busemeyer, J. R., Wang, Z., & Trueblood, J. S. (2012). Hierarchical bayesian estimation of quantum decision model References parameters. In J. R. Busemeyer, F. DuBois, A. Lambert- Aerts, D. (2009). Quantum structure in cognition. Journal of Mogiliansky, & M. Melucci (Eds.), Quantum interaction. Mathematical Psychology, 53(5), 314–348. lecture notes in computer science, vol. 7620 (pp. 80–89). Aerts, D., & Aerts, S. (1994). Applications of quantum statistics Springer. in psychological studies of decision processes. Foundations of Conte, E., Khrennikov, A. Y., Todarello, O., Federici, A., Science, 1, 85–97. Mendolicchio, L., & Zbilut, J. P. (2009). Mental states fol- Aerts, D., & Gabora, L. (2005). A theory of concepts and their low quantum mechanics during perception and cognition of combinations ii: A Hilbert space representation. Kybernetes, ambiguous figures. Open Systems and Information Dynamics, 34, 192–221. 16, 1–17. Aerts, D., Gabora, L., & Sozzo, S. (2013). Concepts and Dirac, P. A. M. (1958). The principles of quantum mechanics. their dynamics: A quantum - theoretic modeling of human Oxford University Press. thought. Topics in Cognitive Science, 5, 737–773. Dzhafarov, E., & Kujala, J. V. (2012). Selectivity in probabilistic Atmanspacher, H., & Filk, T. (2010). A proposed test of causality: Where psychology runs into quantum physics. temporal nonlocality in bistable perception. Journal of Journal of Mathematical Psychology, 56, 54–63. Mathematical Psychology, 54, 314–321. Feldman, J. M., & Lynch, J. G. (1988). Self-generated Atmanspacher, H., Filk, T., & Romer, H. (2004). Quantum validity and other effects of measurement on belief, attitude, zero features of bistable perception. Biological Cybernetics, 90, intention, and behavior. Journal of , 73(3), 33–40. 421–435. Barkan, R., & Busemeyer, J. R. (1999). Changing plans: Franco, R. (2009). Quantum amplitude amplification algo- dynamic inconsistency and the effect of experience on the rithm: An explanaton of availability bias. In P. Bruza, D. reference point. Psychological Bulletin and Review, 10, 353– Sofge, W. Lawless, K. van Rijsbergen, & M. Klusch (Eds.), 359. Quantum interaction (pp. 84–96). Springer. Barkan, R., & Busemeyer, J. R. (2003). Modeling dynamic Fuss, I. G., & Navarro, D. J. (2013). Open parallel cooperative inconsistency with a changing reference point. Journal of and competitive decsision processes: A potential provenance Behavioral Decision Making, 16, 235–255. for quantum probability decision models. Topics in Cognitive Blutner, R. (2009). Concepts and bounded rationality: An Science, 5(4), 818–843. application of Niestegge’s approach to conditional quantum Gleason, A. M. (1957). Measures on the closed subspaces of a probabilities. In L. e. a. Acardi (Ed.), Foundations of Hilbert space. Journal of Mathematical Mechanics, 6, 885– probability and physics-5 (Vol. 1101, p. 302–310). 893. Blutner, R., Pothos, E. M., & Bruza, P. (2013). A quantum Griffiths, R. B. (2003). Consistent quantum theory.Cambridge probability perspective on borderline vagueness. Topics in University Press. Cognitive Science, 5(4), 711–736. Gudder, S. P. (1988). Quantum probability. Academic Press. Bordley, R. F., & Kadane, J. B. (1999). Experiment- Hammeroff, S. R. (1998). Quantum computation in brain dependent priors in psychology. Theory and Decision, 47(3), microtubles? the penrose - hameroff “orch or” model of 213–227. consiousness. Philosophical Transactions Royal Society London Brainerd, C. J., Wang, Z., & Reyna, V. (2013). Superposition of (A), 356, 1869–1896. episodic memories: Overdistribution and quantum models. Heisenberg, W. (1958). Physics and philosophy. Harper and Row. Topics in Cognitive Science, 5(4), 773–799. Hughes, R. I. G. (1989). The structure and interpretation of Bruza, P., Kitto, K., Nelson, D., & McEvoy, C. quantum mechanics. Harvard University Press. (2009). Is there something quantum-like in the human Ivancevic, V. G., & Ivancevic, T. T. (2010). Quantum neural mental lexicon? Journal of Mathematical Psychology, 53, computation. Springer. 362–377. Khrennikov, A. Y. (2010). Ubiquitous quantum structure: From Bruza, P. D., Busemeyer, J., & Gabora, L. (Eds.). (2009). psychology to finance. Springer. Special issue on quantum cognition (Vol. 53). Journal of Kolmogorov, A. N. (1933/1950). Foundations of the theory of Mathematical Psyvhology. probability. N.Y.: Chelsea Publishing Co. Busemeyer, J. R., & Bruza, P. D. (2012). Quantum models of Lambert-Mogiliansky, A., Zamir, S., & Zwirn, H. cognition and decision. Cambirdge University Press. (2009). Type indeterminacy: A model of the “KT”

388 new directions (Kahneman-Tversky)-man. Journal of Mathematical Psychol- Trueblood, J. S., & Busemeyer, J. R. (2010). A quantum ogy, 53(5), 349–361. probability account for order effects on inference. Cognitive La Mura, P. (2009). Projective expected utility. Journal of Science, 35, 1518–1552. Mathematical Psychology, 53(5), 408–414. Trueblood, J. S., & Busemeyer, J. R. (2011). A quantum Morier, D. M., & Borgida, E. (1984). The conjuction fallacy: probability model of causal reasoning. Frontiers in cognitive A task specific phenomena? Personality and Social Psychology science, 3, 138. Bulletin, 10, 243–252. Tversky, A., & Kahneman, D. (1983). Extensional versus Payne, J., Bettman, J. R., & Johnson, E. J. (1992). Behavioral intuitive reasoning: The conjuctive fallacy in probability decision research: A constructive processing perspective. judgment. Psychological Review, 90, 293–315. Annual Review of Psychology, 43, 87–131. Tversky, A., & Shafir, E. (1992). The disjunction effect in choice Peres, A. (1998). Quantum theory: Concepts and methods. Kluwer under uncertainty. Psychological Science, 3, 305–309. Academic. Von Neumann, J. (1932/1955). Mathematical foundations of Pothos, E. M., & Busemeyer, J. R. (2012). Can quantum quantum theory. Princeton University Press. probability provide a new direction for cognitive modeling? Wang, Z., & Busemeyer, J. R. (2013). A quantum question Behavioral and Brain Sciences, 36, 255–274. order model supported by empirical tests of an a priori Pothos, E. M., Busemeyer, J. R., & Trueblood, J. S. (2013). A and precise prediction. Topics in Cognitive Science, 5, quantum geometric model of similarity. Psychological Review, 689–710. 120(3), 679–696. Wang, Z., Busemeyer, J. R., Atmanspacher, H., & Pothos, Savage, L. J. (1954). The foundations of statistics.JohnWiley E. M. (2013). The potential of using quantum theory to & Sons. build models of cognition. Topics in Cognitive Science, 5, Schachter, S., & Singer, J. E. (1962). Cognitive, social, and 672–688. physiological determinants of emotional state. Psychological Wang, Z., Solloway, T., Shiffrin, R. M., & Busemeyer, J. Review, 69(5), 379–399. (2014). Context effects produced by question orders reveal Shafir, E., & Tversky, A. (1992). Thinking through uncertainty: quantum nature of human judgments. Proceedings of the nonconsequential reasoning and choice. Cognitive Psychology, National Academy of Sciences, 111(26), 9431–9436. 24, 449–474. Yates, J. F., & Carlson, B. W. (1986). Conjunction errors: Stolarz-Fantino, S., Fantino, E., Zizzo, D. J., & Wen, J. (2003). Evidence for multiple judgment procedures, including The conjunction effect: New evidence for robustness. ’signed summation’. Organizational Behavior and Human American Journal of Psychology, 116 (1), 15–34. Decision Processes, 37, 230–253. Tentori, K., & Crupi, V. (2013). Why quantum probability Yukalov, V. I., & Sornette, D. (2011). Decision theory with does not explan the conjunction fallacy. Behavioral and Brain prospect interference and entanglement. Theory and Decision, Sciences, 36 (3), 308–310. 70, 283–328.

quantum models of cognition and decision 389

INDEX

Abductive reasoning in clinical descriptive model and Bayes’ rule, 6, 281–282 cognitive science, 343–344 parameters, 292–293 BEAGLE (Bound Encoding of the Absolute identification overview, 290–291 Aggregate Language absolute and relative judgment, posterior distribution Environment) model, 129–130 interpretation, 293–295 243–244, 248 intertrial interval and sequential Attention-weight parameters, 144 Bellman equation, 103 effects, 136–138 Attraction, as context effect in Benchmark model, 74–75 learning, 130–133 DFT, 225–226 Benchmark phenomena, in perfect pitch versus, 133–135 Austerweil, J. L., 187 perceptual judgment, response times, 135 Autism spectrum disorders, 122–124 theories of, 124–129 354–356 Berlin Institute of Physiology, 64 Absorbing barriers, 30 Automaticity, 143, 148–150, 325 Bernoulli, Daniel, 210–211 Accumulator models, 321–322, Autonomous search models, Bessel, F. W., 65 327–328 178–179 Bias-variance trade-off, 190–191 Across-trial variability, 37–38, 46, BIC (Bayesian information 56–57 Bandit tasks, 111 criterion), 9, 21, 306–308 Actions, in Markov decision Basa ganglia model, 51 Blood sugar reduction, 50 process (MDP), 102–103 Baseball batting example, 282–290 Bootstrapping, 105 ACT-R architectures, 126, 219, data, 283 Boundary setting across tasks, 48 301 descriptive model and BoundEncodingoftheAggregate Additive factors method, 69–70, parameters, 283–285 Language Environment 89 overview, 282–283 (BEAGLE) model, ADHD, 49–50 posterior distribution 243–244, 248 Affine transformation, 28 interpretation, 285–290 Bow effects, in absolute Aging studies, diffusion shrinkage and multiple identification, 123, 128 models in, 48 comparisons, 290 Brown, S. D., 121 Akaike information criterion Basis functions, 201 Brown and Heathcote’s linear (AIC), 306–308 Bayesian information criterion ballistic accumulator Alcohol consumption, 50 (BIC), 9, 21, 306–308 model, 301 Aleatory uncertainty, 210 Bayesian models. See also BUGS modeling specification Algom, D., 63 Hierarchical models, language, 282 Allais paradox, 219 Bayesian estimation in Busemeyer, J. R., 1, 369 ANCHOR-based exemplar model of cognition, 187–208 of absolute identification, clustering observations, 126 192–196 Calculus, 3–5 Anderson’s ACT-R model, 126, conclusions, 203–204 Candidate decision processes, 14 219, 301 continuous quantities, Capacity coefficient, 72–74 Anxiety, diffusion models of, 49 200–203 Capacity limitations, in absolute Anxiety-prone individuals, threat features as perceptual units, identification, 122–123 sensitivity modeling in, 196–200 Capacity reallocation model, 69 352–354 future directions, 204 Capacity theory, 90–91 Aphasia, 49–50 mathematical background, Catastrophe theory, 346 Ashby, F. G., 13 188–192 Categorization, 29, 325. See also Assimilation and contrast, in overview, 187–188 Bayesian models; absolute identification, overview, 40, 169 Exemplar-based random 123–124, 128 parsimony principle in, walk (EBRW) model Associative learning, 194–196 309–314 Category learning, 30–31, 189 Associative recognition, 47 of shape perception, 258–260 Cattell, James McKeen, 66 Attention allocation differences Bayesian parameter estimation, Chaos-theoretic modeling, data, 291–292 348–349 345–346

391 Child development, diffusion Cognitive-psychological COVIS theory of category models in, 48–49 complementarity, 87–90 learning, 30–31 Chinese restaurant process (CRP) Cognitive psychometrics, 290 Credit assignment problem, in metaphor, 193–195 Cohen’s PDP model, 301 reinforcement learning, Choice axiom testing, Cold cognition, 361 100–101, 103 211–214 Commutativity, 375 Criss, A. H., 165 Choice behavior, 199–200 Competing accumulator models, CRL (Computational Cholesky transformation, 20 322 reinforcement learning). “Chunking,” 66 Complication experiment, 66–68 See Computational Clinical psychology, mathematical Component power laws model, reinforcement learning and computational 301 (CRL) modeling in, 341–368 Compositional semantics, 249 CrossCat model, 195 contributions of, 349–359 Compromise, as context effect in CRP (Chinese restaurant process) cognition in autism spectrum DFT, 225–226 metaphor, 193–195 disorders, 354–356 Computational reinforcement Crude two-part code, in MDL, cognitive modeling of learning (CRL), 99–117 307–308 routinely used measures, decision environment, 102 CSM (constructed semantics 356–357 exploration and exploitation model), 247 Cued recall models of episodic multinomial processing tree balance, 106 memory, 173–174 modeling of memory, goal of, 101 Cumulative prospect theory, 350–352 good decision making, 103–104 217–219 in pathocognition and historical perspective, 100–101 CXM (coexistence model), functional neuroimaging, neural correlates of, 106–108 303–304, 306, 308–309, 357–359 Q-learning, 105–106 313 threat sensitivity modeling of research issues, 108–114 anxiety-prone individuals, human exploration varieties, 352–354 110–113 Deadline tasks, 41–42 distinctions in, 343–346 model-based versus model-free Decisional separability, 15–16, 22f, overview, 341–343 learning, 108–109 23 parameter estimation in, reward varieties, 113–114 Decision-boundary models, 30, 142 346–349 state representation influence, special considerations, 109–110 Decision field theory (DFT), 220–225 359–361 temporal difference learning, Decision-making models, Clustering observations, 104–105 209–231. See also 192–196 values for states and actions, Computational Coactivation, 73 102–103 reinforcement learning; COALS (Correlated Occurrence Conditioning, 101–103, 111, 195 Perceptual decision Analogue to Lexical Confidence judgments, 52–53 making, neurocognitive Semantics) model, 241, Conjunction probability judgment modeling of; Quantum 248 errors, 376–379 models of cognition and Coexistence model (CXM), Connectionist models decision 303–304, 306, 308–309, decision field theory as, 223, choice axiom testing, 211–214 313 225 cognitive models Cognition. See Bayesian models; of semantic memory, 234–239 context effects example, Quantum models of Constancy, in shape perception, 225–226 cognition and decision 256–257 decision field theory, 220–221 Cognitive control of perceptual Constructed semantics model decision field theory for decisions, 330, 334 (CSM), 247 multialternative choice Cognitive modeling, 219–226 Context, 175–178 problems, 222–225 of clinical science measures, Context-noise models, 172 multi-attribute decision field 356–357 Contingency table, 6f theory, 221–222 context effects example, Continuous quantities, overview, 219–220 225–226 relationships of, 200–203 historical development of, decision field theory Contrast and assimilation, in 210–211 for multialternative choice absolute identification, overview, 209–210 problems, 222–225 123–124, 128 rational choice models, 214–219 multi-attribute, 221–222 Correlated Occurrence Analogue Decision rules for Bayesian overview, 220–221 to Lexical Semantics posterior distribution, 287 “horse race,” 356 (COALS) model, 241, 248 Dennis, S., 232

392 index Density estimation, in Bayesian Domain of the function, 1 old-new recognition RTs models, 188–190 Donders, Franciscus, 65–66 predicted by, 152–157 Depression, diffusion models of, Donkin, C., 121 overview, 142–144 49 Double factorial paradigm, 83 probabilistic feedback to Derivatives and integrals, 3–5 Drift rates contrast predictions, Destructive updating model accumulator model assumptions 150–152 (DUM), 303–304, 306, about, 321–322 research goals, 157–159 308–309, 313 across-trial variability in, 56–57 in response times, 144–146 Deterministic processes, 72 in perceptual decision making, similarity and practice effects, Diederich, A., 1, 209 36–38, 45, 325–327 146–148 Differential-deficit, Dual process models of in perceptual decision making, psychometric-artifact recognition, 166 325 problem, 360 DUM (destructive updating Exemplar models of absolute Differential equations, 4–5 model), 303–304, 306, identification, 125–126, Diffusion models, 35–62 308–309, 313 129 in aging studies, 48 Dynamic attractor networks, Exhaustive processing, 71–72 in child development, 48–49 237–239 Expectancy Valence Learning in clinical applications, 49–50, Dynamic decision models, Model (EVL), 356–357 352 219–220. See also Expectations, 7–8 competing two-choice models, Decision-making models Expected utility theory (EUT), 51–56 Dynamic programming, 104 209, 211 failure of, 50–51 Dyslexia, 50 Experience-based decision making, in homeostatic state 215–216 manipulations, 49–50 Effective sample size (ESS) statistic, Experimental Psychology 282 in individual differences studies, (Woodworth), 83 EGCM (extended generalized 48 Exploration/exploitation balance context model) of absolute in lexical decision, 46–47 experiments in, 100 identification, 126, 143 optimality, 44–45 human varieties of, 110–113 Eidels, A., 1, 63 in perceptual tasks, 45–46 in reinforcement learning, 106 Emotional bias, 49 Exponential functions, 2 for practice effects, 301 Episodic memory, 165–183 Extended generalized context for rapid decisions, 35–44 cued recall models of, 173–174 model (EGCM) of absolute accuracy and RT distribution free recall models of, 174–179 identification, 126, 143 expressions, 38–41 future directions, 179 Eye movements, saccadic, drift rate, 36–38 overview, 165–166 323–325 overview, 35–36 recognition memory models, standard two-choice task, 166–172 False alarm rates, 23 41–44 context-noise models, 172 Feature inference, 196–199 in recognition memory, 46 global matching models, Feature integration theory, 87 in semantic and recognition 167–168 Feature-list models, 233–234 priming effects, 47 retrieving effectively from Fechnerian paradigm, 257–258 in value-based judgments, memory (REM) model, Fechner’s law of psychophysics, 47–48 168–171 307 Diffusion process, 30 updating consequences, Feed-forward networks, 235 Díríchlet-process mixture model, 171–172 FEF (frontal eye field), 321, 323 192–194, 244 Epistemic uncertainty, 210 Fermat, Pierre de, 210 Disjunction probability judgment Error signal, 4–5 Flexibility-to-fit data, of models, errors, 376–379 ESS (effective sample size) statistic, 93 Dissociations, in categorization 282 fMRI and recognition, 158–159 EUT (expected utility theory), category-relevant dimensions Distributional models of semantic 209, 211 shown by, 158 memory, 239–247 EVL (Expectancy Valence Learning in clinical psychology, 357–359 latent semantic analysis, Model), 356–357 context word approach and, 241 239–240 Exemplar-based random walk diffusion models and, 57 moving window models, (EBRW) model model-based analysis of, 240–241 of absolute identification, 107–108 probabilistic topic models, 125–126 Free recall models of episodic 243–246 of categorization and memory, 174–179 random vector models, 241–243 recognition, 142–164 Frequentist methods, 281 retrieval-based semantics, automaticity and perceptual Frontal eye field (FEF), 321, 323 246–247 expertise, 148–150 Functions, mathematical, 1–3

index 393 Gabor patch orientation descriptive model and Independent race model, 74–75 discrimination, 50 parameters, 292–293 Indian buffet process (IBD) Galen, 64 overview, 290–291 metaphor, 194–195, Gate accumulator model, 327 posterior distribution 197–200, 203 Gaussian distribution, 189, 192 interpretation, 293–295 Individual differences studies, Generalizability, measures of, 303 baseball batting example, diffusion models in, 48 Generalized context model 282–290 Infinite Relational Model (IRM), (GCM), 126, 143, 152, data, 283 195 325 descriptive model and Information criteria, in model General recognition theory (GRT) parameters, 283–285 comparison, 306–307 application of, 14 overview, 282–283 Instance theory, 301, 325 applied to data, 17–21 posterior distribution Institute for Collaborative empirical example, 24–28 interpretation, 285–290 Biotechnologies, 31 multivariate normal shrinkage and multiple Instrumental conditioning, 101, distributions assumed by, comparisons, 290 111 16 comparison of, 295–297 Integrals and derivatives, 3–5 neural implementations of, ideas in, 279–282 Integrate-and-fire neurons, 39 30–31 Higher anxiety-prone-lower Intercompletion time equivalence, overview, 15–16 anxiety-prone (HA-LA) 77–78 response accuracy and response group differences, 352–353 Intertrial interval, sequential effects time accounted for, 28–30 Highest density interval (HDI), and, 136–138 summary statistics approach, 285 Inverse problem, shape perception 22–24 Hilbert space, in quantum theory, as, 256–263 GenSim software for semantic 371–372, 374–375 Iowa Gambling Task, 349, 356 memory modeling, 246 Histograms, 9 IPLC (independent parallel, Gershman, S. J., 187 Homeostatic state manipulations, limited-capacity) Global matching models, 49–50 processing system, 353 167–168 “Horse race” model of cognitive IRM (Infinite Relational Model), Go/No-Go Discrimination Task, processes, 356 195 44, 349, 356 Hot cognition, 361 Goodness of fit evaluation, 20–21, Howard, M. W., 165 James, William, 64 302 Human function learning, Jefferson, B., 63 Grice inequality, 74–75, 92 202–203 Jeffreys weights, 313 Jones, M. N., 232 Griffiths, T. L., 187 Human information processing, 63–70 Grouping, power of, 66 Donder’s complication Kinnebrook, David, 65 GRT (general recognition theory). experiment, 66–68 Kolmogorov axioms, 307, 370, See General recognition Sternberg’s work in, 68–70 373–374 theory (GRT) von Helmholtz’s measurement of Kruschke, J. K., 279 Guided search, 89 nerve impulse speed, Kullback-Leibler divergence, Gureckis, T. M., 99 64–65 306 Wundt’s reaction time studies, HA-LA (higher 65–66 Languages, tonal, 134 anxiety-prone-lower Human neuroscience, diffusion Latent Díríchlet Allocation anxiety-prone) group models for, 56–58 algorithms, 244 differences, 352–353 Hyperspace Analogue to Language Latent semantic analysis (LSA), HAL (Hyperspace Analogue to (HAL) model, 240–241, 239–240, 245, 248–249 Language) model, 245, 248 Law of total probability, 376 240–241, 245, 248 LBA (Linear Ballistic Accumulator) Hamilton, Sir William, 66 IBP (Indian buffet process) model, 52, 301 Hawkins, R. X. D., 63 metaphor, 194–195, Leaky competing accumulator HDI (highest density interval), 285 197–200, 203 (LCA) model, 36, 51–52, Heathcote, A., 121 Identification data, fitting GRT to, 128, 223–225, 327 Hebbian learning, 235 18–21 Learning. See also Computational HiDEx software for semantic Identification hit rate, 23 reinforcement learning memory modeling, 246 Importance sampling for Bayes absolute identification in, Hierarchical models, Bayesian factor, 312–313 130–133 estimation in, 279–299 Independence, axioms of, 211–212 associative, 194–196 attention allocation differences Independent parallel, Hebbian, 235 example, 290–295 limited-capacity (IPLC) modeling human function, data, 291–292 processing system, 353 202–203

394 index procedural, 30–31 Measures of generalizability, 303 Multidimensional signal detection relationships in continuous Memory, 350–352. See also theory, 13–34 quantities, 200–203 Episodic memory; general recognition theory Lexical decisions, diffusion models Semantic memory applied to data, 17–21 in, 46–47 Memory interference models empirical example, 24–28 Lexicographic semi-order (LS) example, 303–306 neural implementations of, choice rule, 213 Méré, Chevalier de, 210 30–31 Li, Y., 255 Meyer, Irwin, Osman, and overview, 15–16 Likelihood function, 18–19, 280, Kounios partial response accuracy and 296 information paradigm, response time accounted Likelihood ratio test, 28 42–44 for, 28–30 Limited capacity, 70, 73 Miller, George, 66 summary statistics approach, Linear Ballistic Accumulator (LBA) Minimum description length 22–24 model, 52, 301 (MDL), in model multivariate normal model, Linear functions, 1–2 comparison, 307–309 16–17 Linear regression, 8, 200–202 Minimum-time stopping rule, overview, 13–15 Logan, G. D., 320 exhaustive processing Multinomial processing tree Love,B.C.,99 versus, 71–72 models (MPTs), 301, LSA (latent semantic analysis), 304–305, 307–311, Minkowski power model, 144 239–240, 245, 248–249 350–352 LS (lexicographic semi-order) MLE (maximum likelihood Multiple comparisons, shrinkage choice rule, 213 estimation), 8–9, 281, 347 and, 290 Lüder’s rule, 374 Model-based versus model-free Multiple linear regression, 8 learning, 108–109 Multiplicative prototype model “Magical number seven,” 66 Modeling. See Parsimony principle (MPM), 290, 292 Mapping, functions for, 1 in model comparison; Multivariate normal model, Marginal discriminabilities, 23 specifically named models 16–17 Marginal response invariance, 22 Model mimicking Myopic behavior, of agents, 103 Markov Chain Monte Carlo degenerative, 80 (MCMC) algorithms, 244, ignoring parallel-serial, 87–90 National Institute of Neurological 281–282, 293 prediction overlaps from, 75–78 Disorders and Stroke, 31 Markov decision process (MDP), in psychological science, 91–93 Natural log functions, 2 102–104 Moderate stochastic transitivity NCM (no-conflict model), 303, Markov dynamic model for (MST), 212–213 305–306, 308–309, 313 two-stage gambles, Moment matching, in parameter Nested models, comparing, 382–383, 385–387 estimation, 347–348 313–314 Maskelyn, Nevil, 65 Monte-Carlo methods, 104, Neufeld, R. W. J., 341 Matched filter model, 167, 173 311–313 Neural evidence Mathematical concepts, review of, Movement-related neurons, in FEF, of computational reinforcement 1–10 321, 323, 325–326, 328 learning, 106–108 derivatives and integrals, 3–5 Moving window models, 240–241 of exemplar-based random walk, expectations, 7–8 158 MPM (multiplicative prototype mathematical functions, 1–3 of GRT, 30–31 model), 290, 292 maximum likelihood estimation, in perceptual decision making, 8–9 MPTs (multinomial processing tree 325–330 probability theory, 5–7 models). See Multinomial Neurocognitive modeling of Matrix reasoning, 48 processing tree models perceptual decision Matzke, D., 300 (MPTs) making. See Perceptual Maximum likelihood estimation MST (moderate stochastic decision making, (MLE), 8–9, 281, 347 transitivity), 212–213 neurocognitive modeling of MCMC (Markov Chain Monte Müller, Johannes, 64 Neuro-connectionist modeling, Carlo) algorithms, 244, Multialternative choice problems, 345 281–282, 293 decision field theory for, Neuroeconomics, 48 MDL (minimum description 222–225 Neuroscience, decision making length), in model Multi-armed bandit tasks, 111 understanding from, 53–58 comparison, 307–309 Multi-attribute decision field Newton-Raphson method, 19 MDP (Markov decision process), theory, 221–222 Nietzsche, Friedrich, 64 102–104 Multichoice-decision-making, No-conflict model (NCM), 303, MDS (multidimensional scaling), 52–53 305–306, 308–309, 313 143 Multidimensional scaling (MDS), Noise, in perceptual systems, 15, Mean interaction contrast, 83–84 143, 273 36

index 395 Nondecision time, 48–50, 57 neural locus of drift rates, Probability judgment error, Nonlinear dynamical system 325–327 377–379 modeling, 345–346 overview, 320–323 Probability mass function, 7 Nonparametric models, 189–192, saccadic eye movements and, Probability theory, 5–7, 373–377. 194–195 323–325 See also Bayesian models; Normal distribution, 7 Perceptual expertise, automaticity Decision-making models Nosofsky,R.M.,142 and, 148–150 Probability weighting function, Null list strength effects in (REM) Perceptual independence, 15–16, 214–216 model, 170–171 23 Problem of Points, 210 Numerosity discrimination task, 50 Perceptual judgment, 121–141 Procedural learning, 30–31 absolute identification issues, Prospect theory, 209, 214, Observations, clustering, 192–196 129–139 216–219 Occam’s razor, 301–302 absolute and relative Prototype-based random walk One-choice decisions, 53 judgment, 129–130 (PBRW) model, 151–152 Operant conditioning, 101 absolute identification versus Prototype models, 142 Optimality, 44–45 perfect pitch, 133–135 PST (Parallel-Serial Tester) Optimal planning, 113 intertrial interval and paradigm, 82 Ornstein-Uhlenbeck (OU) sequential effects, 136–138 Psychology, mathematical and computational modeling diffusion process, 50, 55 learning, 130–133 in. See Clinical psychology, Overfitting, 190–191 response times, 135 mathematical and absolute identification theories, computational modeling in Palmeri, T. J., 142, 320 124–129 Psychomotor vigilance task Parallelism, 68 benchmark phenomena, (PVT), 53 Parallel processing 122–124 in benchmark model, 74–75 overview, 121–122 Q-learning, 104–106, 109, 111 mathematics supported by, 77 Perceptual separability, 15, 22f Quadratic functions, 2 parallel-serial mimicry ignored, Perceptual tasks, diffusion models Quantile-probability plots, 40 87–90 in, 36, 45–46 Quantum models of cognition and partial processing as basis of, Perceptual units, features as, decision, 369–389 80–81 196–200 classical probabilities versus, serial processing versus,71 Perfect pitch, absolute 373–377 Parallel-Serial Tester (PST) identification versus, concepts, definitions, and paradigm, 82 133–135 notation, 371–373 Parametric models, 189–190 Perspective. See Shape perception decision making applications, Parsimony principle in model Pizlo, Z., 255 381–387 comparison, 300–319 Pleskac, T. J., 209 Markov dynamic model for Bayes factors, 309–314 Poisson counter model, 36, 55 two-stage gambles, comparison of model Poisson shot noise process, 55 382–383 comparisons, 314–315 Policies, in decision making, model comparisons, 385–387 information criteria, 306–307 103–104 quantum dynamic model for memory interference models Polynomial regression, 302–303 two-stage gambles, example, 303–306 Posterior distribution 384–385 minimum description length, two-stage gambling paradigm, 307–309 in attention allocation differences, 293–295 381–382 overview, 300–303 dynamical principles, 379–381 Partial information paradigm, in baseball batting example, 285–290 probability judgment error 42–44 applications, 377–379 Monte Carlo sampling for, Pascal, Blaise, 210 reasons for, 369–371 Pathocognition, 357–359 311–313 Pavlovian conditioning, 195 in tests for model-parameter Race model inequality, 74 PBRW (prototype-based random differences, 356 Rae, B., 121 walk) model, 151–152 Pothos, E., 369 Random Permutations Model Perceptual decision making, Power functions, 2 (RPM), 244 neurocognitive modeling Power law, 300 Random variables with continuous of, 320–340 Practice effects, 146–148, 300–301 distribution, 7–8 architectures for, 327–328 Prediction error, 105, 107–108 Random vector models, conclusions, 333–336 Principles of Psychology (James), 64 241–243 control over, 330–333 Probabilistic topic models, Random walk, 39. See also neural dynamics, predictions of, 243–246, 248, 250 Exemplar-based random 328–330 Probability density function, 72 walk (EBRW) model

396 index Range of the function, 1 quantitative expressions of, SD (social desirability) Rank-dependent utility theory, 209 70–75 contamination of scores, Rapid decisions. See Diffusion stopping rule distinctions based 361 models on set-size functions, Second-order conditioning, Ratanalysis, 188 82–87 102–103, 111 Ratcliff, R., 35 theoretical distinctions, 79–82 Selective Attention, Mapping, and Ratcliff’s diffusion model, 301 Restricted capacity models of Ballistic Accumulators Rational choice models, 47, absolute identification, (SAMBA) model of 214–219 127–128, 129t, 136 absolute identification, Reaction time distributions, 83–87 Retrieval-based semantics, 128–129, 133, 135–138 Recognition and categorization. See 246–247 Selective influence, 85 Exemplar-based random Retrieved context models, Semantic and recognition priming walk (EBRW) model 177–178 effects, 47 Recognition memory models, 46, Retrieving effectively from memory Semantic memory, 232–254 166–172 (REM) model compositional semantics, 249 Region of practical equivalence consequences of updating in, connectionist models of, (ROPE), in decision rules, 171–172 234–239 286 overview, 168–170 distributional models of, Regularization methods, in shape word frequency and null list 239–247 perception, 258–260 strength effects in, latent semantic analysis, Reinforcement learning (RL). See 170–171 239–240 Computational Reward prediction error moving window models, reinforcement learning hypothesis, 107 240–241 Relative judgment models of Reward-rate optimality, 45 probabilistic topic models, absolute identification, Reward varieties, in reinforcement 243–246 126–127, 129t, 136 learning, 113–114 random vector models, Release-from-inhibition model, Rickard’s component power laws 241–243 50–51 model, 301 retrieval-based semantics, REM (retrieving effectively from Risk in decision making. See 246–247 memory) model. See Decision-making models future directions, 249–250 Retrieving effectively from RL (reinforcement learning). See grounding semantic models, memory (REM) model Computational 247–249 Rescorla-Wagner model, 111, 194 reinforcement learning overview, 232–233 Response accuracy, 28–30, 90–91 ROPE (region of practical research models and themes, Response signal tasks, 41–42 equivalence), in decision 233–234 Response times (RT) rules, 286 Semantic networks, 233–234 absolute identification and, 128, RPM (Random Permutations SEMMOD software for semantic 135 Model), 244 memory modeling, 246 cognitive-psychological RT (response times). See Response Sensory preconditioning, 194–195 complementarity, 87–90 times (RT) Sequential effects, 123–124, in diffusion models, 38–41 RT-distance hypothesis, 29, 150 136–138 example of, 78–79 Rule-plus-exception models, 142 Sequential-sampling models. See exemplar-based random walk Rumelhart networks, 235–237 Diffusion models model of, 144–146, Serial processing 152–157 Saccadic eye movements, 323–325 mathematics supported by, GRT to account for, 28–30 SAMBA (Selective Attention, 76–77 human information processing Mapping, and Ballistic parallel processing versus,71 studied by, 63–70 Accumulators) model of parallel-serial mimicry ignored, Donder’s complication absolute identification, 87–90 experiment, 66–68 128–129, 133, 135–138 parallel-serial testing paradigm, Sternberg’s work in, 68–70 Sampling independence test, 24 82 von Helmholtz’s measurement Savage-Dickey approximation to SEUT (subjective expected utility of nerve impulse speed, Bayes factor, 313–314 theory), 209, 211 64–65 Sawada, T., 255 SFT (Systems Factorial Wundt’s reaction time studies, SBME (strength-based mirror Technology), 354–355 65–66 effect), 171–172 Shape perception, 255–276 metatheory expansion to Schall, J. D., 320 constancy, 256–257 encompass accuracy, 90–91 Schizophrenia, stimulus-encoding constraints in regularization and model mimicking, 75–78, elongation in, 357–359 Bayesian methods, 91–93 SCM (similarity-choice model), 21 258–260

index 397 Shape perception (Cont.) Stochastic dominance, 217–218, Transfer of attention exchange eye and camera geometry, 221–222 (TAX) model, 219 260–263 Stochastic independence, 72 Transformations, perspective Fechnerian paradigm Stochastic transitivity, 212–213 projection as, 261 inadequacy, 257–258 Stopping rule Transition probabilities, in Markov new definition of, 273–274 exhaustive processing versus, decision process (MDP), perspective and orthographic 71–72 102 projection, 263–265 set-size functions and, 82–87 Transitivity, axioms of, 211–214 3D mirror-symmetrical shape Stop signal paradigm, 330, Trial independent “random” recovery from 2D images, 332–333 exploration, 112–113 269–273 St. Petersburg paradox, 210–211 Trial-to-trial variability, 37 3D symmetry and 2D Strength-based mirror effect Trigonometric functions, 2–3 orthographic and (SBME), 171–172 2D orthographic and perspective perspective projections, Strong inference tactic, 92 projections. See Shape 265–269 Strong stochastic transitivity (SST), perception uniqueness, 255–256 212–213 Two-choice models, diffusion Shrinkage and multiple Structural MRI, 57 models versus, 51–56 comparisons, 286–288, Subjective expected utility theory Two-choice tasks, 41–44 290 (SEUT), 209, 211 Two-stage gambles Subtraction, method of, 66–69 Sichuan University, 134 Markov dynamic model for, Summary statistics approach to Signal detection theory. See 382–383 GRT, 22–24 Multidimensional signal model comparisons for, Super capacity, 73 detection theory 385–387 SuperMatrix software for semantic Sign-dependent utility theory, 209 overview, 381–382 memory modeling, 246 quantum dynamic model for, Similarity Supertaskers, 75 384–385 as context effect in DFT, Survivor interaction contrast, 225–226 84–85 kernels of, 201 Uncertainty, 281. See also Symmetry. See Shape perception Decision-making models practice effects and, 146–148 Systematic exploration, 113 Uniqueness, in shape perception, as search determinant, 89 Systems Factorial Technology 255–256 similarity-choice model (SFT), 354–355 Unlimited capacity and (SCM), 21 independent, parallel TAX (transfer of attention Single cell recording data, 54 processing channels exchange) model, 219 Sleep deprivation, 50 (UCIP), 74–75 Temporal Context Model (TCM), Social desirability (SD) U.S. Army Research Office, 31 244 contamination of scores, Utility function, 211, 216–217 361 Temporal difference learning, 104–105, 109, 111 Soto, F. A., 13 Value-based judgments, diffusion Span of attention, 66 Tenenbaum, J. B., 187 Test of English as a Foreign models in, 47–48 Spatial models, 233–234 Vandekerckhove, J., 300 Speed-accuracy tradeoff, 90–91 Language (TOEFL), 240 Theorizing process, 1 Vanpaemel, W., 279 Speeded classification, 29, Venn diagrams, 6 146–148 Threat sensitivity modeling in anxiety-prone individuals, Veridical perception, 255, 258 Speeded visual search, 89 352–354 Vickers accumulator model, 36 Sperling, George, 66 3D symmetry. See Shape Visually responsive neurons, in S-Space software for semantic perception FEF, 321, 323, 325–326 memory modeling, 246 Thurstonian models of absolute Visual search experiments, 69 SST (strong stochastic transitivity), identification, 124–125, Visual short-term memory 212–213 129, 136 (VSTM), 45 State representation influence, in Time-varying processing, 44 Vitalism, 64 reinforcement learning, TOEFL (Test of English as a von Helmholtz, Hermann, 64 109–110 Foreign Language), 240 von Neumann axioms, 371, 373 States, in Markov decision process Tolman, Edwin, 93 VSTM (visual short-term (MDP), 102–103 Tonal languages, 134 memory), 45 Sternberg, Saul, 68 Topic models, 243–246, 248, 250 Steven’s law of psychophysics, 307 Total probability, law of, 375–376 Wagenmakers, E.-J., 300 Stimulus dimensionality, 133 Townsend, J. T., 1, 63 WAIS vocabulary, 48 Stimulus-response learning, 101 Townsend’s capacity reallocation, Wallsten, T. S., 209 Stochastic difference equation, 5 69 Wang, Z., 1, 369

398 index Weak stochastic transitivity Willits, J., 232 Word recognition, 47 (WST), 212–213 Wisconsin Card Sorting Test, 349, Word-Similarity software for Weighted additive utility model, 356–357 semantic memory 220 Woodworth, R. S., 63–64 modeling, 246 Wickens, T. D., 14, 19 Word frequency effects in (REM) Workload capacity, 72–74, 85–87 William of Occam, 302 model, 170–171 Wundt, Wilhem, 65

index 399