OXFORD LIBRARY OF PSYCHOLOGY
EDITED BY JEROME R. BUSEMEYER ZHENG WANG JAMES T. TOWNSEND & AMI EIDELS
The Oxford Handbook of COMPUTATIONAL and MATHEMATICAL PSYCHOLOGY The Oxford Handbook of Computational and Mathematical Psychology OXFORD LIBRARY OF PSYCHOLOGY
EDITOR-IN-CHIEF
Peter E. Nathan
AREA EDITORS
Clinical Psychology David H. Barlow Cognitive Neuroscience Kevin N. Ochsner and Stephen M. Kosslyn Cognitive Psychology Daniel Reisberg Counseling Psychology Elizabeth M. Altmaier and Jo-Ida C. Hansen Developmental Psychology Philip David Zelazo Health Psychology Howard S. Friedman History of Psychology David B. Baker Methods and Measurement Todd D. Little Neuropsychology Kenneth M. Adams Organizational Psychology Steve W. J. Kozlowski Personality and Social Psychology Kay Deaux and Mark Snyder OXFORD LIBRARY OF PSYCHOLOGY
Editor-in-Chief peter e. nathan The Oxford Handbook of Computational and Mathematical Psychology Edited by Jerome R. Busemeyer Zheng Wang James T. Townsend Ami Eidels
1 3
Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide.
Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto
With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam
Oxford is a registered trademark of Oxford University Press in the UK and certain other countries.
Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016
c Oxford University Press 2015
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license, or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above.
You must not circulate this work in any other form and you must impose this same condition on any acquirer.
Library of Congress Cataloging-in-Publication Data Oxford handbook of computational and mathematical psychology / edited by Jerome R. Busemeyer, Zheng Wang, James T. Townsend, and Ami Eidels. pages cm. – (Oxford library of psychology) Includes bibliographical references and index. ISBN 978-0-19-995799-6 1. Cognition. 2. Cognitive science. 3. Psychology–Mathematical models. 4. Psychometrics. I. Busemeyer, Jerome R. BF311.O945 2015 150.1 51–dc23 2015002254
987654321 Printed in the United States of America on acid-free paper Dedicated to the memory of Dr. William K. Estes (1919–2011) and Dr. R. Duncan Luce (1925–2012) Two of the founders of modern mathematical psychology
v
SHORT CONTENTS
Oxford Library of Psychology ix
About the Editors xi
Contributors xiii
Table of Contents xvii
Chapters 1–390
Index 391
vii
OXFORD LIBRARY OF PSYCHOLOGY
The Oxford Library of Psychology, a landmark series of handbooks, is published by Oxford University Press, one of the world’s oldest and most highly respected publishers, with a tradition of publishing significant books in psychology. The ambitious goal of the Oxford Library of Psychology is nothing less than to span a vibrant, wide-ranging field and, in so doing, to fill a clear market need. Encompassing a comprehensive set of handbooks, organized hierarchically, the Library incorporates volumes at different levels, each designed to meet a distinct need. At one level are a set of handbooks designed broadly to survey the major subfields of psychology; at another are numerous handbooks that cover important current focal research and scholarly areas of psychology in depth and detail. Planned as a reflection of the dynamism of psychology, the Library will grow and expand as psychology itself develops, thereby highlighting significant new research that will impact on the field. Adding to its accessibility and ease of use, the Library will be published in print and, later on, electronically. The Library surveys psychology’s principal subfields with a set of handbooks that capture the current status and future prospects of those major subdisci- plines. The initial set includes handbooks of social and personality psychology, clinical psychology, counseling psychology, school psychology, educational psychology, industrial and organizational psychology, cognitive psychology, cognitive neuroscience, methods and measurements, history, neuropsychology, personality assessment, developmental psychology, and more. Each handbook undertakes to review one of psychology’s major subdisciplines with breadth, comprehensiveness, and exemplary scholarship. In addition to these broadly- conceived volumes, the Library also includes a large number of handbooks designed to explore in depth more specialized areas of scholarship and research, such as stress, health and coping, anxiety and related disorders, cognitive development, or child and adolescent assessment. In contrast to the broad coverage of the subfield handbooks, each of these latter volumes focuses on an especially productive, more highly focused line of scholarship and research. Whether at the broadest or most specific level, however, all of the Library handbooks offer synthetic coverage that reviews and evaluates the relevant past and present research and anticipates research in the future. Each handbook in the Library includes introductory and concluding chapters written by its editor to provide a roadmap to the handbook’s table of contents and to offer informed anticipations of significant future developments in that field.
ix An undertaking of this scope calls for handbook editors and chapter authors who are established scholars in the areas about which they write. Many of the nation’s and world’s most productive and best-respected psychologists have agreed to edit Library handbooks or write authoritative chapters in their areas of expertise. For whom has the Oxford Library of Psychology been written? Because of its breadth, depth, and accessibility, the Library serves a diverse audience, including graduate students in psychology and their faculty mentors, scholars, researchers, and practitioners in psychology and related fields. Each will find in the Library the information they seek on the subfield or focal area of psychology in which they work or are interested. Befitting its commitment to accessibility, each handbook includes a comprehensive index, as well as extensive references to help guide research. And because the Library was designed from its inception as an online as well as print resource, its structure and contents will be readily and rationally searchable online. Further, once the Library is released online, the handbooks will be regularly and thoroughly updated. In summary, the Oxford Library of Psychology will grow organically to provide a thoroughly informed perspective on the field of psychology, one that reflects both psychology’s dynamism and its increasing interdisciplinarity. Once published electronically, the Library is also destined to become a uniquely valuable interactive tool, with extended search and browsing capabilities, As you begin to consult this handbook, we sincerely hope you will share our enthusiasm for the more than 500-year tradition of Oxford University Press for excellence, innovation, and quality, as exemplified by the Oxford Library of Psychology.
Peter E. Nathan Editor-in-Chief Oxford Library of Psychology
x oxford library of psychology ABOUT THE EDITORS
Jerome R. Busemeyer is Provost Professor of Psychology at Indiana University. He was the president of Society for Mathematical Psychology and editor of the Journal of Mathematical Psychology. His theoretical contributions include decision field theory and, more recently, pioneering the new field of quantum cognition. Zheng Wang is Associate Professor at the Ohio State University and directs the Communication and Psychophysiology Lab. Much of her research tries to understand how our cognition, decision making, and communication are contextualized. James T. Townsend is Distinguished Rudy Professor of Psychology at Indiana University. He was the president of Society for Mathematical Psychology and editor of the Journal of Mathematical Psychology. His theoretical contributions include systems factorial technology and general recognition theory. Ami Eidels is Senior Lecturer at the School of Psychology, University of Newcastle, Australia, and a principle investigator in the Newcastle Cognition Lab. His research focuses on human cognition, especially visual perception and attention, combined with computational and mathematical modeling.
xi
CONTRIBUTORS
Daniel Algom Ami Eidels School of Psychological Sciences School of Psychology Tel-Aviv University University of Newcastle Israel Callaghan, NSW F. Gregory Ashby Australia Department of Psychological and Brain Sciences Samuel J. Gershman University of California, Santa Barbara Department of Brain and Cognitive Sciences Santa Barbara, CA Massachusetts Institute of Technology Joseph L. Austerweil Cambridge, MA Department of Cognitive Thomas L. Griffiths Linguistic, and Psychological Sciences Department of Psychology Brown University University of California, Berkeley Providence, RI Berkeley, CA Scott D. Brown Todd M. Gureckis School of Psychology Department of Psychology University of Newcastle NewYorkUniversity Callaghan, NSW New York, NY Australia Robert X. D. Hawkins Jerome R. Busemeyer Department of Psychological Department of Psychological and Brain Sciences and Brain Sciences Indiana University Cognitive Science Program Bloomington, IN Indiana University Andrew Heathcote Bloomington, IN School of Psychology Amy H. Criss University of Newcastle Department of Psychology Callaghan, NSW Syracuse University Australia Syracuse, NY Marc W. Howard Simon Dennis Department of Psychological Department of Psychology and Brain Sciences The Ohio State University Center for Memory and Brain Columbus, OH Boston University Adele Diederich Boston, MA Psychology Brett Jefferson Jacobs University Bremen gGmbH Department of Psychological Bremen 28759 and Brain Sciences Germany Indiana University Chris Donkin Bloomington, IN School of Psychology Michael N. Jones University of New South Wales Department of Psychological and Brain Sciences Kensington, NSW Indiana University Australia Bloomington, IN
xiii John K. Kruschke Emmanuel Pothos Department of Psychological Department of Psychology and Brain Sciences City University London Indiana University London, UK Bloomington, IN Babette Rae Yunfeng Li School of Psychology Department of Psychological University of Newcastle Sciences Callaghan, NSW Purdue University Australia West Lafayette, IN Roger Ratcliff Gordon D. Logan Department of Psychology Department of Psychology The Ohio State University Vanderbilt Vision Research Center Columbus, OH Center for Integrative and Tadamasa Sawada Cognitive Neuroscience Department of Psychology Vanderbilt University Higher School of Economics Nashville, TN Moscow, Russia Bradley C. Love Jeffrey D. Schall University College London Department of Psychology Experimental Psychology Vanderbilt Vision Research Center London, UK Center for Integrative and Dora Matzke Cognitive Neuroscience Department of Psychology Vanderbilt University University of Amsterdam Nashville, TN Amsterdam, the Netherlands Philip Smith Robert M. Nosofsky School of Psychological Sciences Department of Psychological The University of Melbourne and Brain Sciences Parkville, VIC Indiana University Australia Bloomington, IN Fabian A. Soto Richard W. J. Neufeld Department of Psychological Departments of Psychology and Psychiatry, and Brain Sciences Neuroscience Program University of California, Santa Barbara University of Western Ontario Santa Barbara, CA London, Ontario Joshua B. Tenenbaum Canada Department of Brain and Cognitive Sciences Thomas J. Palmeri Massachusetts Institute of Technology Department of Psychology Cambridge, MA Vanderbilt Vision Research Center James T. Townsend Center for Integrative and Department of Psychological Cognitive Neuroscience and Brain Sciences Vanderbilt University Cognitive Science Program Nashville, TN Indiana University Zygmunt Pizlo Bloomington, IN Department of Psychological Joachim Vandekerckhove Sciences Department of Cognitive Sciences Purdue University University of California, Irvine West Lafayette, IN Irvine, CA Timothy J. Pleskac Wolf Vanpaemel Center for Adaptive Rationality (ARC) Faculty of Psychology Max Planck Institute for Human and Educational Sciences Development University of Leuven Berlin, Germany Leuven, Belgium xiv contributors Eric-Jan Wagenmakers Zheng Wang Department of Psychology School of Communication University of Amsterdam Center for Cognitive and Brain Sciences Amsterdam, the Netherlands The Ohio State University Thomas S. Wallsten Columbus, OH Department of Psychology Jon Willits University of Maryland Department of Psychological and Brain Sciences College Park, MD Indiana University Bloomington, IN
contributors xv
CONTENTS
Preface xix
1. Review of Basic Mathematical Concepts Used in Computational and Mathematical Psychology 1 Jerome R. Busemeyer, Zheng Wang, Ami Eidels, and James T. Townsend
Part I • Elementary Cognitive Mechanisms 2. Multidimensional Signal Detection Theory 13 F. Gregory Ashby and Fabian A. Soto 3. Modeling Simple Decisions and Applications Using a Diffusion Model 35 Roger Ratcliff and Philip Smith 4. Features of Response Times: Identification of Cognitive Mechanisms through Mathematical Modeling 63 Daniel Algom, Ami Eidels, Robert X. D. Hawkins, Brett Jefferson, and James T. Townsend 5. Computational Reinforcement Learning 99 Todd M. Gureckis and Bradley C. Love
Part II • Basic Cognitive Skills 6. Why Is Accurately Labeling Simple Magnitudes So Hard? A Past, Present, and Future Look at Simple Perceptual Judgment 121 Chris Donkin, Babette Rae, Andrew Heathcote, and Scott D. Brown 7. An Exemplar-Based Random-Walk Model of Categorization and Recognition 142 Robert M. Nosofsky and Thomas J. Palmeri 8. Models of Episodic Memory 165 Amy H. Criss and Marc W. Howard
Part III • Higher Level Cognition 9. Structure and Flexibility in Bayesian Models of Cognition 187 Joseph L. Austerweil, Samuel J. Gershman, Joshua B. Tenenbaum, and Thomas L. Griffiths
xvii 10. Models of Decision Making under Risk and Uncertainty 209 Timothy J. Pleskac, Adele Diederich, and Thomas S. Wallsten 11. Models of Semantic Memory 232 Michael N. Jones, Jon Willits, and Simon Dennis 12. Shape Perception 255 Tadamasa Sawada, Yunfeng Li, and Zygmunt Pizlo
Part IV • New Directions 13. Bayesian Estimation in Hierarchical Models 279 John K. Kruschke and Wolf Vanpaemel 14. Model Comparison and the Principle of Parsimony 300 Joachim Vandekerckhove, Dora Matzke, and Eric-Jan Wagenmakers 15. Neurocognitive Modeling of Perceptual Decision Making 320 Thomas J. Palmeri, Jeffrey D. Schall, and Gordon D. Logan 16. Mathematical and Computational Modeling in Clinical Psychology 341 Richard W. J. Neufeld 17. Quantum Models of Cognition and Decision 369 Jerome R. Busemeyer, Zheng Wang, and Emmanuel Pothos
Index 391
xviii contents PREFACE
Computational and mathematical psychology has enjoyed rapid growth over the past decade. Our vision for the Oxford Handbook of Computational and Mathematical Psychology is to invite and organize a set of chapters that review these most important developments, especially those that have impacted— and will continue to impact—other fields such as cognitive psychology, developmental psychology, clinical psychology, and neuroscience. Together with a group of dedicated authors, who are leading scientists in their areas, we believe we have realized our vision. Specifically, the chapters cover the key developments in elementary cognitive mechanisms (e.g., signal detection, information processing, reinforcement learning), basic cognitive skills (e.g., perceptual judgment, categorization, episodic memory), higher-level cognition (e.g., Bayesian cognition, decision making, semantic memory, shape perception), modeling tools (e.g., Bayesian estimation and other new model comparison methods), and emerging new directions (e.g., neurocognitive modeling, applications to clinical psychology, quantum cognition) in computation and mathematical psychology. An important feature of this handbook is that it aims to engage readers with various levels of modeling experience. Each chapter is self-contained and written by authoritative figures in the topic area. Each chapter is designed to be a relatively applied introduction with a great emphasis on empirical examples (see New Handbook of Mathematical Psychology (2014) by Batchelder, Colonius, Dzhafarov, and Myung for a more mathematically foundational and less applied presentation). Each chapter endeavors to immediately involve readers, inspire them to apply the introduced models to their own research interests, and refer them to more rigorous mathematical treatments when needed. First, each chapter provides an elementary overview of the basic concepts, techniques, and models in the topic area. Some chapters also offer a historical perspective of their area or approach. Second, each chapter emphasizes empirical applications of the models. Each chapter shows how the models are being used to understand human cognition and illustrates the use of the models in a tutorial manner. Third, each chapter strives to create engaging, precise, and lucid writing that inspires the use of the models. The chapters were written for a typical graduate student in virtually any area of psychology, cognitive science, and related social and behavioral sciences, such as consumer behavior and communication. We also expect it to be useful for readers ranging from advanced undergraduate students to experienced faculty members and researchers. Beyond being a handy reference book, it should be beneficial as
xix a textbook for self-teaching, and for graduate level (or advanced undergraduate level) courses in computational and mathematical psychology. We would like to thank all the authors for their excellent contributions. Also we thank the following scholars who helped review the book chapters in addition to the editors (listed alphabetically): Woo-Young Ahn, Greg Ashby, Scott Brown, Cody Cooper, Amy Criss, Adele Diederich, Chris Donkin, Yehiam Eldad, Pegah Fakhari, Birte Forstmann, Tom Griffiths, Andrew Heathcote, Alex Hedstrom, Joseph Houpt, Marc Howard, Matt Irwin, Mike Jones, John Kruschke, Peter Kvam, Bradley Love, Dora Matzke, Jay Myung, Robert Nosofsky, Tim Pleskac, Emmanuel Pothos, Noah Silbert, Tyler Solloway, Fabian Soto, Jennifer Trueblood, Joachim Vandekerckhove, Wolf Vanpaemel, Eric-Jan Wagenmakers, and Paul Williams. The authors and reviewers’ effort ensure our confidence in the high quality of this handbook. Finally, we would like to express how much we appreciate the outstanding assistance and guidance provided by our editorial team and production team at Oxford University Press. The hard work provided by Joan Bossert, Louis Gulino, Anne Dellinger, A. Joseph Lurdu Antoine and the production team of Newgen Knowledge Works Pvt. Ltd., and others at the Oxford University Press are essential for the development of this handbook. It has been a true pleasure working with this team!
Jerome R. Busemeyer Zheng Wang James T. Townsend Ami Eidels December 16, 2014
xx preface CHAPTER Review of Basic Mathematical Concepts 1 Used in Computational and Mathematical Psychology
Jerome R. Busemeyer, Zheng Wang, Ami Eidels, and James T. Townsend
Abstract Computational and mathematical models of psychology all use some common mathematical functions and principles. This chapter provides a brief overview. Key Words: mathematical functions, derivatives and integrals, probability theory, expectations, maximum likelihood estimation
We have three ways to build theories to explain Psychological theories may start as a verbal and predict how variables interact and relate to each description, which then can be formalized using other in psychological phenomena: using natural mathematical language and subsequently coded verbal languages, using formal mathematics, and into computational language. By testing the models using computational methods. Human intuitive using empirical data, the model fitting outcomes and verbal reasoning has a lot of limitations. For can provide feedback to improve the models, as well example, Hintzman (1991) summarized at least as our initial understanding and verbal descriptions. 10 critical limitations, including our incapability For readers who are newcomers to this exciting to imagine how a dynamic system works. Formal field, this chapter provides a review of basic models, including both mathematical and compu- concepts of mathematics, probability, and statistics tational models, can address these limitations of used in computational and mathematical modeling human reasoning. of psychological representation, mechanisms, and Mathematics is a “radically empirical” science processes. See Busemeyer and Diederich (2010) and (Suppes, 1984, p.78), with consistent and rigorous Lewandowsky and Ferrel (2010) for a more detailed evidence (the proof) that is “presented with a presentations. completeness not characteristic of any other area of science” (p.78). Mathematical models can help avoid logic and reasoning errors that are typically Mathematical Functions encountered in human verbal reasoning. The Mathematical functions are used to map a set of complexity of theorizing and data often requires points called the domain of the function into a set the aid of computers and computational languages. of points called the range of the function such that Computational models and mathematical models only one point in range is assigned to each point can be thought of as a continuum of a theorizing in the domain.1 As a simple example, the linear process. Every computational model is based on function is defined as f (x) = a·x where the constant a certain mathematical model, and almost every a is the slope of a straight line. In general, we use the mathematical model can be implemented as a notation f (x) to represent a function f that maps computational model. a domain point x into a range point y = f (x). If a
1 = function f (x) has the property that each range point you to first compute the squared deviation y y can only be reached by a single unique domain x−μ 2 1 σ and then compute the reciprocal exp(y). point x, then we can define the inverse function x − The exponential function obeys the property e · f 1(y) = x that maps each range point y = f (x) + · ey = ex y and (ex)a = ea x. In contrast to the power back to the corresponding domain point x.For function, the base of the exponential is a constant example, the quadratic function is defined as the and the exponent is a variable. map f (x) = x2 = x · x, and if we pick the number The (natural) log function is denoted ln(x) for x = 3.5, then f (3.5) = 3.52 = 12.25. The quadratic positive values of x. For example, using a calculator, function is defined on a domain of both positive for x = 10, we obtain ln(10) = 2.3026. (We and negative real numbers, and it does not have an normally use the natural base e = 2.7183. If instead inverse because, for example, ( − x)2 = x2 and so we used base 10, then log (10) = 1.) The log there are two ways to get back from each range point 10 function obeys the rules ln(x ·y) = ln(x)+ln(y)and y to the domain. However, if we restrict the domain ln(xa) = a · ln(x). The log function is the inverse to the non-negative real numbers, then the inverse of the exponential function: ln(exp(x)) = x and of x2 exists and it is the square root function defined √ √ the exponential function is the inverse of the log = 2 = on non-negative real numbers y x x.There function exp(log(x)) = x. The function ax where a are, of course, a large number of functions used is a constant and x is a variable can be rewritten in in mathematical psychology, but some of the most terms of the exponential function: define b = ln(a), popular ones include the following. bx = b x = x = x a then e (e ) exp(ln(a)) a . The power function is denoted x where the Figure 1.1 illustrates the power, exponential, and variable x is a positive real number and the constant log functions using different coefficient values for a is called the power. A quadratic function can be the function. As can be seen, the coefficient changes = obtained by setting a 2 but we could instead the curve of the functions. = choose a√ 0.50, which is the square root function Last but not least are the trigonometric functions .50 = =− x x, or we could choose a 1, which based on a circle. Figure 1.2 shows a circle with −1 = 1 produces the reciprocal x x ,orwecould its center located at coordinates (0,0) in an (X ,Y ) = choose any real number such as a 1.37. Using plane. Now imagine a line segment of radius = a calculator, one finds that if x = 15.25 and a r = 1 that extends from the center point to the 1.37 = 1.37, then 15.25 41.8658. One important circumference of the circle. This line segment property to remember about power functions is that intersects with the circumference at coordinates a · b = a+b b · b = · b ( a)b = ab x x x and x y (x y) and x x . (cos(t · π),sin(t · π)) in the plane. The coordinate 0 = Also note that x 1. Note that when working cos(t · π) represents the projection of the point on with the power function, the variable x appears the circumference down onto the the X axis, and in the base, and the constant a appears as the the point sin(t · π) is the projection of the point power. on the circumference to the Y axis. The variable x The exponential function is denoted e where t (which, for example can be time) moves this the exponent x is any real valued variable and point around the circle, with positive values moving the constant base e stands for a special number the point counter clockwise, and negative values =∼ that is approximately e 2.7183. Sometimes it is moving it clockwise. The constant π = 3.1416 x = ( ) more convenient to use the notation e exp x equals one-half cycle around the circle, and 2π is 2.5 = instead. Using a calculator, we can calculate e the period of time it takes to go all the way around 2.5 = 2.7183 12.1825. Note that the exponent can once. The two functions are related by a translation − < be negative x 0, in which case we can write (called the phase) in time: cos(t · π + (π/2)) = −x = 1 = 0 = e ex .Ifx 0, then e 1. The exponential sin(t ·π). Note that cos is an even function because x > function always returns a positive value, e 0, and cos(t · π) = cos(−t · π), whereas sin is an odd it approaches zero as x approaches negative infinity. function because −sin(t · π) = sin( − t · π).Also More complex forms of the exponential are often note that these functions are periodic in the sense used. For example, you will later see the function that for example cos(t · π) = cos(t · π + 2 · k · π) 2 − x−μ e σ ,wherex is a variable and μ and σ are for any integer k. We can generalize these two functions by changing the frequency and the phase. constants. In this case, it is more convenient to 2 (ω · π + θ) − x−μ For example, cos t is a cosine function σ x−μ 2 write this as e =exp − σ . This tells with a frequency ω (changing the time it takes
2 review of basic mathematical concepts Power function: y = xa Exponential function: y = exp(x) Log function: y = log(x) 5 5 5
4 a =2 4 y = exp(x) 4 y = log(2x) y = log(x) 3 3 3 y y y y = log(0.5x) a = 1/2 2 2 2 y = exp(0) 1 1 1 a =–1 y = exp(–x) 0 0 0 01 23 45 01 23 450 1020304050 x x x
Fig. 1.1 Examples of three important functions, with various parameter values. From left to right: Power function, exponential function, and log function. See text for details.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2 ) x 0 0 Sin(
–0.2 Sin(Time) –0.2
–0.4 –0.4
–0.6 –0.6
–0.8 –0.8
–1 –1 –1 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 0 1.2 1.4 1.6 1.8 2 Cos(x) Time
Fig. 1.2 Left panel illustrates a point on a unit circle with a radius equal to one. Vertical line shows sine, horizontal line shows cosine. Right panel shows sine as a function of time. The point on the Y axis of the right panel corresponds to the point on the Y axis of the left panel.
θ d to complete a cycle) and a phase (advancing or dx f (x). The derivatives of many functions are delaying the initial value at time t = 0). derived in calculus (see Stewart (2012) or any calculus textbook for an introduction to calculus). · For example, in calculus it is shown that d ec x = · dx Derivatives and Integrals c · ec x, which says that the slope of the exponential A derivative of a continuous function is the rate function at any point x is proportional to the of change or the slope of a function at some point. exponential function itself. As another example, it − Suppose f (x) is some continuous function. For a is shown in calculus that d xa = a · xa 1,whichis dx small increment , the change in this function is the derivative of the power function. For example, = − ( − ) df (x) f (x) f x and the rate of change the derivative of a quadratic function a·x2 is a linear df (x) is the change divided by the increment = function 2 · a · x, and the derivative of the linear f (x)−f (x−) · . If the function is continuous, then function a x is the constant a. The derivative of d =− as → 0, this ratio converges to what is called the cosine function is dt cos(t) sin(t)andthe d = the derivative of the function at x, denoted as derivative of sine is dt sin(t) cos(t).
review of basic mathematical concepts 3 Power function: y = xa; a =2 Power function: y = xa; a = 0.5 5 5
4 4
3 3 y y
2 2
1 1
0 0 021435021435 x x
Fig. 1.3 Illustration of the power function and its derivatives. The lines in both panels mark the power-function line. The slope of the dotted line (the tangent to the function) is given by the derivative of that function. (in this example, at x = 1).
Figure 1.3 illustrates the derivative of the power A(xN ) − A(xN −1) = f (xN ), function at the value x = 1 for two different ( ) − ( ) A xN A xN −1 = ( ) coefficients. The curved line shows the power f xN . function, and the straight line touches the curve at x = 1. The slope of this line is the derivative. This simple idea (proven more rigorously in a The integral of a continuous function is the area calculus textbook) leads to the first fundamental d ( ) = under the curve within some interval (see Fig. 1.4). theorem of integrals which states that dz F z ( ) ( ) = z Suppose f (x) is a continuous function of x within f z ,withF z a f (x)dx. The fundamental the interval [a,b]. A simple way to approximate this theorem can then be used to find the integral of z a = area is to divide the interval into N very small steps, a function. For example, the integral of 0 x dx with a small increment being a step: ( + )−1 a+1 d ( + )−1 a+1 = a a 1 z because dz a 1 z z .The z α·x 1 α·z d α·z = = + = + integral of e dx = α e because e = α · [x0 a,x1 a ,x2 a 2 ,..., α· dt e z.Theintegralof z cos(t)dt = sin(z)because = + · = − = xj a j ,...,xN −1 b ,xN b] d = dz sin(z) cos(z). Then, compute the area of the rectangle within each Computational and mathematical models are of- step, · f xj , and finally sum all the areas of the ten described by difference or differential equations. rectangles to obtain an approximate area under the These types of equations are used to describe how curve: the state of a system changes with time. For exam- N ple, suppose V (t) represents the strength of a neural ≈ · ( ) + · ( ) +···+ · = · A f x1 f x2 f xN f xj . connection between an input and an output at time j=1 t, and suppose x (t) is some reward signal that is guiding the learning process. A simple, discrete time As the number of intervals becomes arbitrarily large linear model of learning can be V (t) = (1 − α) · and the increments get arbitrarily small so that V (t − 1)+α ·x (t),where0≤ α ≤ 1 is the learning →∞ → N and 0, this sum converges to the rate parameter. We can rewrite this as a difference integral equation: b A = f (x) · dx. = ( ) − ( − ) a dV (t) V t V t 1 If we allow the upper limit of the integral to be a =−α · V (t − 1) + α · x (t) variable, say z, then the integral becomes a function =−α · (V (t − 1) − x (t)). of the upper limit, which can be written as F (z) = z · a f (x) dx . What happens if we take the derivative This model states that the change in strength at time of an integral? Let’s examine the change in the area t is proportional to the negative of the error signal, divided by the increment which is defined as the difference between the
4 review of basic mathematical concepts x 10–3 300 70 8
250 60 6 50 200 40 150 4 30 100 20 2 50 10
0 0 0 0 500 1000 0 500 1000 020 40 60 80 100
Fig. 1.4 The integral of the function is the area under curve. It can be approximated as the sum of the areas of the rectangles (left panel). As the rectangles become narrower (middle), the sum of their areas converges to the true integral (right). previous strength and the new reward. If we wish a connection changes according to the preceding to describe learning as occurring more continuously learning model, but with some noise (denoted in time we can introduce a small time increment t as (t)) added, then we can use the following into the model so that it states stochastic difference equation √ dV (t) =V (t) − V (t − t) dV (t) =−α · t · (V (t − t) − x (t)) + (t) · t. =−α · t · (V (t − t) − x (t)), √ Note that the noise is multiplied by t instead which says that the change in strength is propor- of t in a stochastic difference equation. This is tional to the negative of the error signal, with the required so that the effect of the noise does not constant of proportionality now modified by the disappear as t → 0, and the variance of the time increment. Dividing both sides by the time noise remains proportional to t (which is the key increment t we obtain characteristic of Brownian motion processes). See dV (t) Battacharya and Waymire (2009) for an excellent =−α · (V (t − t) − x (t)), t book on stochastic processes. and now if we allow the time increment to approach zero in the limit, t → 0, then the preceding Elementary Probability Theory equation converges to a limit that is the differential Probability theory describes how to assign prob- equation abilities to events. See Feller (1968) for a review of probability theory. We start with a sample space that d V (t) =−α · (V (t) − x (t)), is a set denoted as , which contains all the unique dt outcomes that can be realized. For simplicity, we which states that the rate of change in strength is will assume (unless noted otherwise) that the sample proportional to the negative of the error signal. space is finite. (There could be a very large number Sometimes we can solve the differential equation of outcomes, but the number is finite.) For example, for a simple solution. For example, the solution to if a person takes two medical tests, test A and test the equation d V (t) =−α · V (t) + c is V (t) = c − B, and each test can be positive or negative, then −α· dt a e t because when we substitute this solution back the sample space contains four mutually exclusive into the differential equation, it satisfies the equality and exhaustive outcomes: all four combinations of of the differential equation positive and negative test results from tests A and B. Figure 1.5 illustrates the situation for this simple d c c − −αt =−α · − −α·t + example. An event, such as the event A (e.g., test A α e α e c. dt is positive) is a subset of the sample space. Suppose A stochastic difference equation is frequently used for the moment that A, B are two events. The in cognitive modeling to represent how a state disjunctive event AorB(e.g., test A is positive changes across time when it is perturbed by noise. or test B is positive) is represented as the union For example, if we assume that the strength of A∪B. The conjunctive event AandB(e.g., test A is
review of basic mathematical concepts 5 Test A Negative Positive AB A ∩ B Negative A Test B Positive B A ∩ B
Fig. 1.5 Two ways to illustrate the probability space of events A and B. The contingency table (left) and the Venn diagram (right) correspond in the following way: Positive values on both tests in the table (the conjunctive event, A∩B) are represented by the overlap of the circles in the Venn diagram. Positive values on one test but not on the other in the table (the XOR event, A positive and B negative, or vice versa) are represented by the nonoverlapping areas of circles A and B. Finally, tests that are both negative (upper left entry in the table) correspond in the Venn diagram to the area within the rectangle (the so-called “sample space”) that is not occupied by any of the circles.
positive and test B is positive) is represented as the exclusive and exhaustive hypotheses denoted H1 intersection A ∩ B. The impossible event (e.g., test and H2. For example H1 could be a certain disease is A is neither positive nor negative), denoted ,isan present and H2 is the disease is not present. Define empty set. The certain event is the entire sample the event D as some observed data that provide space . The complementary event “not A” is evidence for or against each hypothesis, such as a denoted A. medical test result. Suppose p(D|H1)andp(D|H2) A probability function p assigns a number are known. These are called the likelihood’s of the between zero to one to each event. The impossible data for each hypothesis. For example, medical event is assigned zero, and the certain event testing would be used to determine the likelihood is assigned one. The other events are assigned of a positive versus negative test result when the probabilities 0 ≤ p(A) ≤ 1andp A¯ = 1 − disease is known to be present, and the likelihood p(A). However, these probabilities must obey the of a positive versus negative test would also be following additive rule: If A ∩ B = then p(A ∪ known when the disease is not present. We define B) = p(A) + p(B). What if the events are not p(H1)andp(H2) as the prior probabilities of mutually exclusive so that A ∩ B = ?Theanswer each hypothesis. For example, these priors may is called the “or”, which follows from the previous be based on base rates for disease present or assumptions: not. Then according to the conditional probability definition ∪ = ( ∩ ) + ∩ ¯ + ¯ ∩ p(A B) p A B p A B p A B ( ) ( | ) p H1 p D H1 ¯ p(H1|D) = = p(A ∩ B) + p A ∩ B p(D) + ¯ ∩ + ( ∩ ) ( ) ( | ) p A B p A B = p H1 p D H1 ( ) ( | ) + ( ) ( | ). − p(A ∩ B) p H1 p D H1 p H2 p D H2 The last line is Bayes’ rule. The probability p(H |D) = p(A) + p(B) − p(A ∩ B). 1 is called the posterior probability of the hypothesis Suppose we learn that some event A has occurred, given the data. It reflects the revision from the and now we wish to define the new probability prior produced by the evidence from the data. for event B conditioned on this known event. If there are M ≥ 2 hypotheses, then the rule is The conditional probability p(B|A) stands for the extended to be probability of event B given that event A has p(H1)p(D|H1) ( ∩ ) p(H |D) = , occurred, which is defined as p(B|A) = p A B . 1 M ( ) ( | ) p(A) k=1 p Hk p D Hk | = p(A∩B) Similarly, p(A B) p(B) is the probability of where the denominator is the sum across k = 1to event A given that B has occurred. Using the k = M hypotheses. definition of conditional probability, we can then We often work with events that are assigned define the “and” rule for joint probabilities as to numbers. A random variable is a function that follows: the probability of A and B equals p(A∩B) = assigns real numbers to events. For example, a p(A)p(B|A) = p(B)p(A|B). person may look at an advertisement and then An important theorem of probability is called rate how effective it is on a nine-point scale. In Bayes rule. It describes how to revise one’s beliefs this case, there are nine mutually exclusive and based on evidence. Suppose we have two mutually exhaustive categories to choose from on the rating
6 review of basic mathematical concepts scale, and each choice is assigned a number (say, 0 ≤ p X = xi,Y = yj ≤ 1 1, 2,..., or 9). Then we can define a random = = = = variable X (R), which is a function that maps the p X xi,Y yj p(X xi) category event R onto one of the nine numbers. j For example, if the person chooses the middle p X = xi,Y = yj = p(Y = yj) rating option, so that R = middle, then we assign i ( ) = X middle 5. For simplicity, we often omit = = = the event and instead write the random variable p X xi,Y yj 1. i j simply as X . For example, we can ask what is the probability that the random variable is assigned Finally, we can define the conditional probability = the number 5, which is written as p(X 5). of Y = yj given X = xi as p Y = yj|X = xi = Then we assign a probability to each value of a p(X =xi,Y =yj) ( = ) . random variable by assigning it the probability of p X xi the event that produces the value. For example, Often, we work with random variables that have p(X = 5) equals the probability of the event that the a continuous rather than a discrete and finite distri- bution, such as the normal distribution. Suppose X person picks the middle value. Suppose the random is a univariate continuous random variable. In this variable has N values x1,x2, ...,xi,...,xN .In our previous example with the rating scale, the case, the probability assigned to each real number is random variable had nine values. The function zero (there are uncountably infinite many of them p(X = x ) (interpreted as the probability that in any interval). Instead, we start by defining the i = ( ≤ ) the person picks a choice corresponding to value cumulative distribution function F(x) p X x . Then we define the probability density at each value xi) is called the probability mass function for the random variable X . This function has the following of x as the derivative of the cumulative probability = d properties: function, f (x) dx F(x). Using this definition for the density, we compute the probability of X falling ( ∈ ) = b ≤ ( = ) ≤ in some interval [a,b] as p X [a,b] a f (x)dx . 0 p X xi 1; The increment f (x) · dx for the continuous random N variable is conceptually related to the mass function ( = ) = ( = ) + ( = ) p X xi p X x1 p X x2 p(X = xi). i=1 The probability density function for the normal 2 − x−μ +···+ ( = ) = 1. σ p X xN distribution equals f (x) = √ 1 e ,where 2πσ μ is the mean of the distribution and σ is the The cumulative probability is then defined as standard deviation of the distribution. The normal distribution is popular because of the central limit ¯ theorem, which states that the sample mean X = p(X ≤ xi) = p(X = x1) + p(X = x2) Xi/N of N independent samples of a random i variable X will approach normal as the number +···+ ( = ) = = p X xi p X xj . of samples becomes arbitrarily large, even if the j=1 original random variable X is not normal. We often work with sample means, which we expect to be Often we measure more than one random variable. approximately normal because of the central limit For example, we could present an advertisement theorem. and ask how effective it is for the participant personally but also ask how effective the participant believes it is for others. Suppose X is the random Expectations variable for the nine-point rating scale for self, and When working with random variables, we are let Y be the random variable for the nine-point often interested in their moments (i.e., their means, rating scale for others. Then we can define ajoint variances, or correlations). See Hogg and Craig probability p(X = xi,Y = yj), which equals the (1970) for a reference on mathematical statistics. probability that xj is selected for self and that yj is selected for others. These joint probabilities form These different moments are different concepts, but a two way 9 × 9 table with with p(X = xi,Y = yj) they are all defined as expected values of the random in each cell. This joint probability function has the variables. The expectation of the random variable = ( = ) · properties: is defined as E [X ] i p X xi xi for the review of basic mathematical concepts 7 discrete case and it is defined as E [X ] = f (x)·x·dx the model parameters, which makes them more for the continuous case. The mean of a random complicated, and one cannot use simple linear variable X , denoted μX is defined as the expectation regression fitting routines. of the random variable μX = E [X ].Thevariance The model parameters are estimated from the σ 2 of a random variable X , denoted X ,isdefinedas empirical experimental data. These experiments the expectation of the squared deviation around the usually consist of a sample of participants, and each σ 2 = ( ) = ( − μ)2 mean: X Var X E X . For example, participant provides a series of responses to several σ 2 = ( = )· in the discrete case, this equals X i p X xi experimental conditions. For example, a study of ( − μ )2 xi X . The standard deviation is the square learning to gamble could obtain 100 choice trials at each of 5 payoff conditions from each of 50 root of the variance, σ = σ 2. The covariance X X participants. One of the first issues for modeling between two random variables (X ,Y ), denoted σ is the level of analysis of the data. On the one XY , is defined by the expectation of the product hand, a group-level analysis would fit a model of deviations: σXY = cov(X ,Y ) = E [(X − μX )· ( − μ ) to all the data from all the participants ignoring Y Y ]. For example, in the discrete case, this σ = = = (( − μ ) individual differences. This is not a good idea equals XY i j p X xi,Y yj xi X if there are substantial individual differences. On yj − μY . The correlation is defined as ρXY = σ the other hand, an individual level analysis would XY . σX ·σY fit a model to each individual separately allowing Often we need to combine two random variables arbitrary individual differences. This introduces a by a linear combination Z = a · X + b · Y ,where new set of parameters for each person, which is a,b are two constants. For example, we may sum unparsimonious. A hierarchical model applies the two scores, a = 1andb = 1, or take a difference model to all of the individuals, but it includes a between two scores, a = 1,b =−1. There are two model for the distribution of individual differences. important rules for determining the mean and the This is a good compromise, but it requires a good variance of a linear transformation. The expectation model of the distribution of individual differences. operation is linear: E [a · X + b · Y ] = a · E [X ] + Chapter 13 of this book describes the hierarchical b · E [Y ]. The variance operator, however, is not approach. Here we describe the basic ideas of linear: var(a · X + b · Y ) = a2var(X ) + b2var(Y ) + fitting models at the individual level using a method 2ab · cov(X ,Y ). called maximum likelihood (see Myung, 2003 for a detailed tutorial on maximum likelihood Maximum Likelihood Estimation estimation). Also see Hogg and Craig (1970) Computational and mathematical models of for the general properties of maximum likelihood psychology contain parameters that need to be estimates. estimated from the data. For example, suppose a Suppose we obtain 100 choice trials (gamble, person can choose to play or not play a slot machine not gamble) from 5 payoff conditions from each at the beginning of each trial. The slot machine pays participant. The above learning model has two α β the amount x(t)ontrialt, but this amount is not parameters ( , ) that we wish to estimate using × = revealed until the trial is over. Consider a simple the 100 5 500 binary valued responses. = model that assumes that the probability of choosing We can put the 500 answers in a vector D ...... to gamble on trial t, denoted p(t), is predicted by [x1,x2, ,xt , ,x500],whereeachxt is zero (not the following linear learning model gamble) or one (gamble). If we pick values for the two parameters (α,β), then we can insert these into = ( − α) · − + α · V (t) 1 V (t 1) x(t) our learning model and compute the probability 1 of gambling, p(t), for each trial from the model. p(t) = 1 + e−β·V (t) Define p(xt ,t) as the probability that the model This model has two parameters α,β that must be predicts the value xt observed on trial t.For example, if xt = 1, then p (xt ,t) = p(t), but if xt = 0, estimated from the data. This is analogous to the = − estimation problem one faces when using multiple then p(xt ,t) 1 p(t) , where recall that p(t)is linear regression, where the linear regression is the predicted probability of choosing the gamble. the model and the regression coefficients are the Then we compute the likelihood of the observed sequence of data D given the model parameters model parameters. However, computational and α β mathematical models, such as the earlier learning ( , ) as follows: model example, are nonlinear with respect to L(D|α,β) = p(x1,1)p(x2,2)···p(x500,500).
8 review of basic mathematical concepts 300
250
200
150
100
50
0 100 200 300 400 500 600 700 800
Fig. 1.6 Example of maximum likelihood estimation. The histograms describe data sampled from a Gamma distribution with scale and shape parameters both equal to 20. Using maximum likelihood we can estimate the parameter values of a Gamma distribution that best fits the sample (they turn out to be 20.4 and 19.5, respectively), and plot its probability density function (solid line).
To make this computationally feasible, we use the complex model. Then we can compare models by log likelihood instead using a Bayesian information criterion (BIC, see Wasserman, 2000, for review). This criterion is 500 derived on the basis of choosing the model that LnL(D|α,β) = ln p(x ,t) t is most probable given the data. (However, the t=1 derivation only holds asymptotically as the sample This likelihood changes depending on our choice size increases indefinitely.) For each model we of α,β. Our goal is to pick (α,β) that maximizes wish to compare, we can compute a BIC index: |αβ = 2 + · LnL(D ). Nonlinear search algorithms available BICmodel Gmodel nmodel ln(N ), where nmodel in computational software, such as MATLAB, R, equals the number of model parameters estimated Gauss, and Mathematica can be used to find the from the data and N = number of data points. maximum likelihood estimates. The log likelihood The BIC index is an index that balances model fit is a goodness-of-fit measure. Higher values indicate with model complexity as measured by number of better fit. Actually the computer algorithms find parameters. (Note however, that model complexity the minimum of the badness of fit measure is more than the number of parameters, see Chapter G2 =−2 · LnL(D|α,β). 13). It is a badness-of-fit index, and so we choose Maximum likelihood is not restricted to learning the model with the lowest BIC index. See Chapter models and it can be used to fit all kinds of models. 14 for a detailed review on model comparison. For example, if we observe a response time on each trial, and our model predicts the response time Concluding comments for each trial, then the preceding equation can be This handbook categorizes models into three applied with xt equal to the observed response time levels based on the questions asked and addressed on a trial, and with p(xt ,t) equal to the predicted by the models, and the levels of analysis. The first probability for the observed value of response time category includes models theorizing the elementary on that trial. Figure 1.6 shows an example in which cognitive mechanisms, such as the signal detection a sample of response time data (summarized in process, the diffusion process, informaton process- the figure by a histogram) was fit by a gamma ing, and reinforcement learning. They have been distribution model for response time using two widely used in many formal models within and model parameters. beyond psychology and cognitive science, ranging Now suppose we have two different competing from basic visual perception to complex decision learning models. The model we just described has making. The second category covers models the- two parameters. Suppose the competing model is orizing basic cognitive skills, such as perceptual quite different and it is more complex with four identification, categorization, and episodic mem- parameters. Also suppose the models are not nested ory. The third category includes models theorizing so that it is not possible to compute the same higher level cognition, such as Bayesian cognition, predictions for the simpler model using the more decision making, semantic memory, and shape
review of basic mathematical concepts 9 perception. In addition, we provide two chapters applications, and provides a specific glossary list of on modeling tools, including Bayesian estimation the basic concepts in the topic area. We believe you in hierarchical models and model comparison will have a rewarding reading experience. methods. We conclude the handbook with three chapters on new directions in the field, including Note neurocognitive modeling, mathematical and com- 1. This chapter is restricted to real numbers. putational modeling in clinical psychology, and cognitive and decision models based upon quantum probability theory. References The models reviewed in the handbook make Battacharya, R. N. & Waymire, E. C. (2009). Stochastic processes use of many of the mathematical ideas presented with applications (Vol. 61). Philadelphia, PA: Siam. Busemeyer, J. R., & Diederich, A. (2009). Cognitive modeling. in this review chapter. Probabilistic models ap- Thousand Oaks, CA: SAGE. pear in chapters covering signal detection theory Cox, D. R. & Miller, H. D. (1965). Thetheoryofstochastic (Chapter 2), probabilistic models of cognition processes (Vol. 134). Boca Raton, FL: CRC Press. (Chapter 9), decision theory (Chapters 10 and 17), Feller, W. (1968). on An introduction to probability theory and its and clinical applications (Chapter 16). Stochas- applications (3rd ed., Vol. 1). New York, NY: Wiley. tic models (i.e., models that are dynamic and Hintzman D.L. (1991). Why are formal models useful in probabilistic) appear in chapters covering informa- psychology? In W. E. Hockley & S. Lewandowsky (Eds.), Relating theory and data: Essays on human memory in honor of tion processing (Chapter 4), percpetual judgment Bennet B. Murdock (pp. 39–56). Hillsdale, NJ: Erlbaum. (Chapter 6), and random walk/diffusion models Hogg, R. V., & Craig, A. T. (1970). Introduction to mathematical of choice and response time in various cogntive statistics (3rd ed.). New York, NY: Macmillan. tasks (Chapters 3, 7, 10, and 15). Learning and Lewandowsky, S. & Ferrel, S. (2010). Computational modeling memory models are reviewed in Chapters 5, 7, 8, in cognition: Principles and practice. Thousand Oaks, CA: and 11. Models using vector spaces and geometry SAGE. are introduced in Chapters 11, 12, and 17. Myung, I. J. (2003). Tutorial on maximum likelihood estima- tion. Journal of Mathematical Psychology, 47, 90–100. The basic concepts reviewed in this chapter Stewart, J. (2012). Calculus (7th ed.) Belmont, CA: Brooks/Cole. should be helpful for readers who are new to math- Suppes, P. (1984). Probabilistic metaphysics. Oxford: Basil ematical and computational models to jumpstart Blackwell. reading the rest of the book. In addition, each Wasserman, L. (2000) Bayesian model selection and model chapter is self-contained, presents a tutorial style averaging. Journal of Mathematical Psychology, 44(1), 92– introduction to the topic area exemplified by many 107.
10 review of basic mathematical concepts PARTI Elementary Cognitive Mechanisms
CHAPTER 2 Multidimensional Signal Detection Theory
F. Gregory Ashby and Fabian A. Soto
Abstract Multidimensional signal detection theory is a multivariate extension of signal detection theory that makes two fundamental assumptions, namely that every mental state is noisy and that every action requires a decision. The most widely studied version is known as general recognition theory (GRT). General recognition theory assumes that the percept on each trial can be modeled as a random sample from a multivariate probability distribution defined over the perceptual space. Decision bounds divide this space into regions that are each associated with a response alternative. General recognition theory rigorously defines and tests a number of important perceptual and cognitive conditions, including perceptual and decisional separability and perceptual independence. General recognition theory has been used to analyze data from identification experiments in two ways: (1) fitting and comparing models that make different assumptions about perceptual and decisional processing, and (2) testing assumptions by computing summary statistics and checking whether these satisfy certain conditions. Much has been learned recently about the neural networks that mediate the perceptual and decisional processing modeled by GRT, and this knowledge can be used to improve the design of experiments where a GRT analysis is anticipated. Key Words: signal detection theory, general recognition theory, perceptual separability, perceptual independence, identification, categorization
Introduction itself has a history that dates back at least several Signal detection theory revolutionized psy- centuries. The insight of signal detection theorists chophysics in two different ways. First, it in- was that this model of statistical decisions was also troduced the idea that trial-by-trial variability in a good model of sensory decisions. The first signal sensation can significantly affect a subject’s perfor- detection theory publication appeared in 1954 mance. And second, it introduced the field to the (Peterson, Birdsall, & Fox, 1954), but the theory then-radical idea that every psychophysical response did not really become widely known in psychology requires a decision from the subject, even when until the seminal article of Swets, Tanner, and thetaskisassimpleasdetectingasignalinthe Birdsall appeared in Psychological Review in 1961. presence of noise. Of course, signal detection theory From then until 1986, almost all applications of proved to be wildly successful and both of these signal detection theory assumed only one sen- assumptions are now routinely accepted without sory dimension (Tanner, 1956, is the principal question in virtually all areas of psychology. exception). In almost all cases, this dimension The mathematical basis of signal detection was meant to represent sensory magnitude. For theory is rooted in statistical decision theory, which a detailed description of this standard univariate
13 theory, see the excellent texts of either Macmillan Second, the problem of how to model decision and Creelman (2005) or Wickens (2002). This processes when the perceptual space is multidi- chapter describes multivariate generalizations of mensional is far more difficult than when there signal detection theory. is only one sensory dimension. A standard signal- Multidimensional signal detection theory is a detection-theory lecture is to show that almost any multivariate extension of signal detection to cases in decision strategy is mathematically equivalent to which there is more than one perceptual dimension. setting a criterion on the single sensory dimension, It has all the advantages of univariate signal then giving one response if the sensory value falls detection theory (i.e., it separates perceptual and on one side of this criterion, and the other response decision processes) but it also offers the best existing if the sensory value falls on the other side. For method for examining interactions among percep- example, in the normal, equal-variance model, this tual dimensions (or components). The most widely is true regardless of whether subjects base their studied version of multidimensional signal detec- decision on sensory magnitude or on likelihood tion theory is known as general recognition theory ratio. A straightforward generalization of this model (GRT; Ashby & Townsend, 1986). Since its incep- to two perceptual dimensions divides the perceptual tion, more than 350 articles have applied GRT to plane into two response regions. One response is a wide variety of phenomena, including categoriza- given if the percept falls in the first region and the tion (e.g., Ashby & Gott, 1988; Maddox & Ashby, other response is given if the percept falls in the 1993), similarity judgment (Ashby & Perrin, 1988), second region. The obvious problem is that, unlike face perception (Blaha, Silbert, & Townsend, 2011; a line, there are an infinite number of ways to divide Thomas, 2001; Wenger & Ingvalson, 2002), recog- a plane into two regions. How do we know which nition and source memory (Banks, 2000; Rotello, of these has the most empirical validity? Macmillan, & Reeder, 2004), source monitoring The solution to the first of these two problems— (DeCarlo, 2003), attention (Maddox, Ashby, & that is, the sensory problem—was proposed by Waldron, 2002), object recognition (Cohen, 1997; Ashby and Townsend (1986) in the article that Demeyer, Zaenen, & Wagemans, 2007), per- first developed GRT. The GRT model of sensory ception/action interactions (Amazeen & DaSilva, interactions has been embellished during the past 2005), auditory and speech perception (Silbert, 25 years, but the core concepts introduced by Ashby 2012; Silbert, Townsend, & Lentz, 2009), haptic and Townsend (1986) remain unchanged (i.e., per- perception (Giordano et al., 2012; Louw, Kappers, ceptual independence, perceptual separability). In & Koenderink, 2002), and the perception of sexual contrast, the decision problem has been much more interest (Farris, Viken, & Treat, 2010). difficult. Ashby and Townsend (1986) proposed Extending signal detection theory to multiple some candidate decision processes, but at that time dimensions might seem like a straightforward they were largely without empirical support. In the mathematical exercise, but, in fact, several new ensuing 25 years, however, hundreds of studies conceptual problems must be solved. First, with have attacked this problem, and today much is more than one dimension, it becomes necessary known about human decision processes in percep- to model interactions (or the lack thereof) among tual and cognitive tasks that use multidimensional those dimensions. During the 1960s and 1970s, perceptual stimuli. a great many terms were coined that attempted to describe perceptual interactions among separate stimulus components. None of these, however, were Box 1 Notation rigorously defined or had any underlying theoretical AiBj = stimulus constructed by setting compo- foundation. Included in this list were perceptual nent A to level i and component B to level j independence, separability, integrality, performance aibj = response in an identification experiment parity, and sampling independence. Thus, to be signaling that component A is at level i and useful as a model of perception, any multivariate component B is at level j extension of signal detection theory needed to provide theoretical interpretations of these terms X1 = perceived value of component A and show rigorously how they were related to one X2 = perceived value of component B another.
14 elementary cognitive mechanisms joint pdf. Any such sample defines an ordered pair Box 1 Continued (x ,x ), the entries of which fix the perceived value f (x ,x ) = joint likelihood that the perceived 1 2 ij 1 2 of the stimulus on the two sensory dimensions. value of component A is x and the perceived 1 General recognition theory assumes that the subject value of component B is x on a trial when the 2 uses these values to select a response. presented stimulus is A B i j In GRT, the relationship of the joint pdf to the gij(x1) = marginal pdf of component A on trials marginal pdfs plays a critical role in determining when stimulus AiBj is presented whether the stimulus dimensions are perceptually rij = frequency with which the subject re- integral or separable. The marginal pdf gij(x1) sponded Rj on trials when stimulus Si was simply describes the likelihoods of all possible presented sensory values of X 1. Note that the marginal pdfs P(Rj|Si) = probability that response Rj is given are identical to the one-dimensional pdfs of classical on a trial when stimulus Si is presented signal detection theory. Component A is perceptually separable from component B if the subject’s perception of A does not change when the level of B is varied. For General Recognition Theory example, age is perceptually separable from gender General recognition theory (see the Glossary for if the perceived age of the adult in our face key concepts related to GRT) can be applied to experiment is the same for the male adult as for virtually any task. The most common applications, the female adult, and if a similar invariance holds however, are to tasks in which the stimuli vary for the perceived age of the teen. More formally, in on two stimulus components or dimensions. As an experiment with the four stimuli, A1B1,A1B2, an example, consider an experiment in which A2B1,andA2B2, component A is perceptually participants are asked to categorize or identify faces separable from B if and only if that vary across trials on gender and age. Suppose g11(x1) = g12(x1)andg21(x1) = g22(x1) there are four stimuli (i.e., faces) that are created by for all values of x .(1) factorially combining two levels of each dimension. 1 In this case we could denote the two levels of the Similarly, component B is perceptually separable gender dimension by A1 (male) and A2 (female) and from A if and only if the two levels of the age dimension by B1 (teen) and g11(x2) = g21(x2)andg12(x2) = g22(x2), (2) B2 (adult). Then the four faces are denoted as A1B1 (male teen), A1B2 (male adult), A2B1 (female teen), for all values of x2. If perceptual separability fails and A2B2 (female adult). then A and B are said to be perceptually integral. As with signal detection theory, a fundamental Note that this definition is purely perceptual since assumption of GRT is that all perceptual systems it places no constraints on any decision processes. are inherently noisy. There is noise both in the Another purely perceptual phenomenon is per- stimulus (e.g., photon noise) and in the neural ceptual independence. According to GRT, compo- systems that determine its sensory representation nents A and B are perceived independently in (Ashby & Lee, 1993). Even so, the perceived value stimulus AiBj if and only if the perceptual value on each sensory dimension will tend to increase as of component A is statistically independent of the level of the relevant stimulus component in- the perceptual value of component B on AiBj creases. In other words, the distribution of percepts trials. More specifically, A and B are perceived will change when the stimulus changes. So, for independently in stimulus AiBj if and only if example, each time the A1B1 face is presented, its f ij(x1,x2) = gij(x1) g (x2)(3) perceived age and maleness will tend to be slightly ij different. for all values of x1 and x2. If perceptual inde- General recognition theory models the sensory pendence is violated, then components A and B or perceptual effects of a stimulus AiBj via the joint are perceived dependently. Note that perceptual probability density function (pdf) fij(x1,x2)(see independence is a property of a single stimulus, Box 1 for a description of the notation used in this whereas perceptual separability is a property of article). On any particular trial when stimulus AiBj groups of stimuli. is presented, GRT assumes that the subject’s percept A third important construct from GRT is deci- can be modeled as a random sample from this sional separability. In our hypothetical experiment
multidimensional signal detection theory 15 with stimuli A1B1,A1B2,A2B1,andA2B2,and the two dimensions. These are typically catalogued two perceptual dimensions X 1 and X 2, decisional in a mean vector and a variance-covariance matrix. separability holds on dimension X 1 (for example), For example, consider a bivariate normal distribu- if the subject’s decision about whether stimulus tion with joint density function f (x1,x2). Then the component A is at level 1 or 2 depends only on mean vector would equal the perceived value on dimension X 1.Adecision μ µ = 1 bound is a line or curve that separates regions of the μ (4) perceptual space that elicit different responses. The 2 only types of decision bounds that satisfy decisional and the variance-covariance matrix would equal separability are vertical and horizontal lines. σ 2 cov = 1 12 (5) cov σ 2 The Multivariate Normal Model 21 2 So far we have made no assumptions about where cov12 is the covariance between the values on the form of the joint or marginal pdfs. Our the two dimensions (i.e., note that the correlation only assumption has been that there exists some coefficient is the standardized covariance—that is, the correlation ρ = cov12 ). probability distribution associated with each stim- 12 σ1σ2 ulus and that these distributions are all embedded The multivariate normal distribution has an- in some Euclidean space (e.g., with orthogonal other important property. Consider an identifica- dimensions). There have been some efforts to tion task with only two stimuli and suppose the extend GRT to more general geometric spaces perceptual effects associated with the presentation (i.e., Riemannian manifolds; Townsend, Aisbett, of each stimulus can be modeled as a multivariate Assadi, & Busemeyer, 2006; Townsend & Spencer- normal distribution. Then it is straightforward to Smith, 2004), but much more common is to show that the decision boundary that maximizes add more restrictions to the original version of accuracy is always linear or quadratic (e.g., Ashby, GRT, not fewer. For example, some applications 1992). The optimal boundary is linear if the of GRT have been distribution free (e.g., Ashby two perceptual distributions have equal variance- & Maddox, 1994; Ashby & Townsend, 1986), covariance matrices (and so the contours of equal but most have assumed that the percepts are likelihood have the same shape and are just trans- multivariate normally distributed. The multivariate lations of each other) and the optimal boundary is normal distribution includes two assumptions. quadratic if the two variance-covariance matrices are First, the marginal distributions are all normal. unequal. Thus, in the Gaussian version of GRT, the Second, the only possible dependencies are pairwise only decision bounds that are typically considered linear relationships. Thus, in multivariate normal are either linear or quadratic. distributions, uncorrelated random variables are In Figure 2.1, note that perceptual independence statistically independent. holds for all stimuli except A2B2.Thiscanbe A hypothetical example of a GRT model that seen in the contours of equal likelihood. Note assumes multivariate normal distributions is shown that the major and minor axes of the ellipses that in Figure 2.1. The ellipses shown there are contours define the contours of equal likelihood for stimuli of equal likelihood; that is, all points on the same A1B1,A1B2,andA2B1 are all parallel to the ellipse are equally likely to be sampled from the two perceptual dimensions. Thus, a scatterplot of underlying distribution. The contours of equal samples from each of these distributions would be likelihood also describe the shape a scatterplot of characterized by zero correlation and, therefore, points would take if they were random samples statistical independence (i.e., in the special Gaussian from the underlying distribution. Geometrically, case). However, the major and minor axes of the the contours are created by taking a slice through A2B2 distribution are tilted, reflecting a positive the distribution parallel to the perceptual plane and correlation and hence a violation of perceptual looking down at the result from above. Contours independence. of equal likelihood in multivariate normal distribu- Next, note in Figure 2.1 that stimulus compo- tions are always circles or ellipses. Bivariate normal nent A is perceptually separable from stimulus com- distributions, like those depicted in Figure 2.1 are ponent B, but B is not perceptually separable from each characterized by five parameters: a mean on A. To see this, note that the marginal distributions each dimension, a variance on each dimension, and for stimulus component A are the same, regardless a covariance or correlation between the values on of the level of component B [i.e., g11(x1) =
16 elementary cognitive mechanisms Respond A2B2 larger perceived values of A). As a result, component B is not decisionally separable from component A. g22(x2) Respond A1B2 f22(x1,x2)
f (x ,x ) 12 1 2 Applying GRT to Data g12(x2) 2
c The most common applications of GRT are X g (x ) 21 2 Respond A B to data collected in an identification experiment 2 1 1 f21(x1,x2) X like the one modeled in Figure 2.1. The key data f11(x1,x2) Respond A2B1 from such experiments are collected in a confusion g (x ) 11 2 matrix, which contains a row for every stimulus g (x )=g (x ) g (x )=g (x ) 11 1 12 1 21 1 22 1 and a column for every response (Table 2.1 displays an example of a confusion matrix, which will be discussed and analyzed later). The entry in row i X X1 c1 and column j lists the number of trials on which stimulus Si was presented and the subject gave Fig. 2.1 Contours of equal likelihood, decision bounds, and response R . Thus, the entries on the main diagonal marginal perceptual distributions from a hypothetical multi- j variate normal GRT model that describes the results of an give the frequencies of all correct responses and the identification experiment with four stimuli that were constructed off-diagonal entries describe the various errors (or by factorially combining two levels of two stimulus dimensions. confusions). Note that each row sum equals the total number of stimulus presentations of that type. So if each stimulus is presented 100 times then the sum of all entries in each row will equal 100. g12(x1)andg21(x1) = g22(x1), for all values of x1]. This means that there is one constraint per row, so Thus, the subject’s perception of component A does an n × n confusion matrix will have n × (n –1) not depend on the level of B and, therefore, stimu- degrees of freedom. lus component A is perceptually separable from B. General recognition theory has been used to On the other hand, note that the subject’s percep- analyze data from confusion matrices in two tion of component B does change when the level different ways. One is to fit the model to the of component A changes [i.e., g11(x1) = g21(x1) entire confusion matrix. In this method, a GRT and g12(x1) = g22(x1) for most values of x1]. In model is constructed with specific numerical values particular, when A changes from level 1 to level 2 of all of its parameters and a predicted confusion the subject’s mean perceived value of each level of matrix is computed. Next, values of each parameter component B increases. Thus, the perception of are found that make the predicted matrix as close component B depends on the level of component as possible to the empirical confusion matrix. To A and therefore B is not perceptually separable test various assumptions about perceptual and deci- from A. sional processing—for example, whether perceptual Finally, note that decisional separability holds independence holds—a version of the model that on dimension 1 but not on dimension 2. On assumes perceptual independence is fit to the data dimension 1 the decision bound is vertical. Thus, as well as a version that makes no assumptions the subject has adopted the following decision rule: about perceptual independence. This latter version contains the former version as a special case (i.e., Component A is at level 2 if x > X ; otherwise 1 c1 in which all covariance parameters are set to zero), component A is at level 1. so it can never fit worse. After fitting these two Where X c1 is the criterion on dimension 1 models, we assume that perceptual independence (i.e., the x1 intercept of the vertical decision is violated if the more general model fits sig- bound). Thus, the subject’s decision about whether nificantly better than the more restricted model component A is at level 1 or 2 does not depend on that assumes perceptual independence. The other the perceived value of component B. So component method for using GRT to test assumptions about A is decisionally separable from component B. On perceptual processing, which is arguably more the other hand, the decision bound on dimension popular, is to compute certain summary statistics x2 is not horizontal, so the criterion used to judge from the empirical confusion matrix and then whethercomponentBisatlevel1or2changeswith to check whether these satisfy certain conditions the perceived value of component A (at least for that are characteristic of perceptual separability or
multidimensional signal detection theory 17 independence. Because these two methods are so significantly during this period, then decisional different, we will discuss each in turn. separability can be rejected, whereas if accuracy is It is important to note however, that regard- unaffected then decisional separability is supported. less of which method is used, there are certain Third, one could analyze the 2 × 2data nonidentifiabilities in the GRT model that could using the newly developed GRT model with in- limit the conclusions that are possible to draw from dividual differences (GRT-wIND; Soto, Vucovich, any such analyses (e.g., Menneer, Wenger, & Blaha, Musgrave, & Ashby, in press), which was patterned 2010; Silbert & Thomas, 2013). The problems after the INDSCAL model of multidimensional are most severe when GRT is applied to 2 × 2 scaling (Carroll & Chang, 1970). GRT-wIND is fit identification data (i.e., when the stimuli are A1B1, to the data from all individuals simultaneously. All A1B2,A2B1,andA2B2). For example, Silbert and participants are assumed to share the same group Thomas (2013) showed that in 2 × 2 applications perceptual distributions, but different participants where there are two linear decision bounds that are allowed different linear bounds and they are do not satisfy decisional separability, there always assumed to allocate different amounts of attention exists an alternative model that makes the exact to each perceptual dimension. The model does not same empirical predictions and satisfies decisional suffer from the identifiability problems identified separability (and these two models are related by an by Silbert and Thomas (2013), even in the 2 × 2 affine transformation). Thus, decisional separability case, because with different linear bounds for is not testable with standard applications of GRT to each participant there is no affine transformation 2 × 2 identification data (nor can the slopes of the that simultaneously makes all these bounds satisfy decision bounds be uniquely estimated). For several decisional separability. reasons, however, these nonidentifiabilities are not catastrophic. First, the problems don’t generally exist with Fitting the GRT Model to 3 × 3 or larger identification tasks. In the 3 × 3 Identification Data case the GRT model with linear bounds requires computing the likelihood function at least 4 decision bounds to divide the perceptual When the full GRT model is fit to identification space into 9 response regions (e.g., in a tic-tac-toe data, the best-fitting values of all free parameters configuration). Typically, two will have a generally must be found. Ideally, this is done via the method vertical orientation and two will have a generally of maximum likelihood—that is, numerical values horizontal orientation. In this case, there is no of all parameters are found that maximize the affine transformation that guarantees decisional likelihood of the data given the model. Let S1,S2, separability except in the special case where the ..., Sn denote the n stimuli in an identification two vertical-tending bounds are parallel and the experiment and let R1,R2, ..., Rn denote the n two horizontal-tending bounds are parallel (because responses. Let rij denote the frequency with which parallel lines remain parallel after affine transfor- the subject responded Rj on trials when stimulus Si × mations). Thus, in 3 3 (or higher) designs, was presented. Thus, rij is the entry in row i and decisional separability is typically identifiable and column j of the confusion matrix. Note that the rij testable. are random variables. The entries in each row have a Second, there are simple experimental manip- multinomial distribution. In particular, if P(Rj|Si) × ulations that can be added to the basic 2 is the true probability that response Rj is given 2 identification experiment to test for decisional on trials when stimulus Si is presented, then the separability. In particular, switching the locations probability of observing the response frequencies of the response keys is known to interfere with ri1, ri2,...,rin in row i equals performance if decisional separability fails but not if decisional separability holds (Maddox, Glass, ni! ri1 ri2 rin P(R1 |Si) P(R2 |Si) ···P(Rn |Si) O’Brien, Filoteo, & Ashby, 2010; for more infor- ri1!ri2!···rin! mation on this, see the section later entitled “Neural (6) Implementations of GRT”). Thus, one could add × 100 extra trials to the end of a 2 2 identi- where ni is the total number of times that fication experiment where the response key loca- stimulus Si was presented during the course of the tions are randomly interchanged (and participants experiment. The probability or joint likelihood of are informed of this change). If accuracy drops observing the entire confusion matrix is the product
18 elementary cognitive mechanisms of the probabilities of observing each row; that is, distribution that describes the perceptual effects of n n stimulus Si, and the solid lines denote two possible n = i | rij decision bounds in this hypothetical task. In Figure L n P(Rj Si) (7) = r i=1 j 1 ij j=1 2.2 the bounds are linear, but the method works for any number of bounds that have any parametric General recognition theory models predict that form. The shaded region is the R response region. |S ) has a specific form. Specifically, they j P(Rj i Thus, according to GRT, computing P(R |S )is predict that P(R | j i j Si) is the volume in the Rj equivalent to computing the volume under the S response region under the multivariate distribution i perceptual distribution in the Rj response region. of perceptual effects elicited when stimulus Si is pre- This volume is indicated by the shaded region in sented. This requires computing a multiple integral. the figure. First note that any linear bound can be The maximum likelihood estimators of the GRT written in discriminant function form as model parameters are those numerical values of each parameter that maximize L. Note that the first term h(x1,x2) = h(x) = b x + c = 0(9) in Eq. 7 does not depend on the values of any where (in the bivariate case) x and b are the vectors model parameters. Rather it only depends on the data. Thus, the parameter values that maximize the x1 x = and b = b1 b2 second term also maximize the whole expression. x2 For this reason, the first term can be ignored during the maximization process. Another common prac- and c is a constant. The discriminant function form tice is to take logs of both sides of Eq. 7. Parameter of any decision bound has the property that positive values that maximize L will also maximize any values are obtained if any point on one side of the monotonic function of L (and log is a monotonic bound is inserted into the function, and negative transformation). So, the standard approach is to values are obtained if any point on the opposite find values of the free parameters that maximize side is inserted. So, for example, in Figure 2.2, the constants b1, b2,andc can be selected so that h1(x) n n > 0 for any point x above the h bound and h (x) r logP(R |S )(8) 1 1 ij j i < 0 for any point below the bound. Similarly, for i=1 j=1 the h2 bound, the constants can be selected so that > estimating the parameters h2(x) 0 for any point to the right of the bound < In the case of the multivariate normal model, and h2(x) 0 for any point to the left. Note that the predicted probability P(Rj|Si) in Eq. 8 equals under these conditions, the Rj response region is > the volume under the multivariate normal pdf defined as the set of all x such that h1(x) 0and > that describes the subject’s perceptual experiences h2(x) 0. Therefore, if we denote the multivariate µ on trials when stimulus Si is presented over the normal (mvn) pdf for stimulus Si as mvn( i, i), then response region associated with response Rj.To estimate the best-fitting parameter values using a | = µ standard minimization routine, such integrals must P(Rj Si ) mvn( i, i)dx1dx2 (10) be evaluated many times. If decisional separability is h1(x) > 0; assumed, then the problem simplifies considerably. h2(x) > 0 For example, under these conditions, Wickens (1992) derived the first and second derivatives Ennis and Ashby (2003) showed how to quickly necessary to quickly estimate parameters of the approximate integrals of this type. The basic idea is model using the Newton-Raphson method. Other to transform the problem using a multivariate form methods must be used for more general models that of the well-known z transformation. Ennis and do not assume decisional separability. Ennis and Ashby proposed using the Cholesky transformation. Ashby (2003) proposed an efficient algorithm for Any random vector x that has a multivariate normal evaluating the integrals that arise when fitting any distribution can always be rewritten as GRT model. This algorithm allows the parameters x = Pz + µ, (11) of virtually any GRT model to be estimated via standard minimization software. The remainder of where µ is the mean vector of x, z is a random this section describes this method. vector with a multivariate z distribution (i.e., a mul- The left side of Figure 2.2 shows a contour tivariate normal distribution with mean vector 0 of equal likelihood from the bivariate normal and variance-covariance matrix equal to the identity
multidimensional signal detection theory 19 Cholesky z2
h1*(z1, z2)=0 h2(x1, x2)=0 h1(x1, x2)=0 Cholesky
z1 x2
Respond Rj h2*(z1, z2)=0 Respond R x1 j
Fig. 2.2 Schematic illustration of how numerical integration is performed in the multivariate normal GRT model via Cholesky factorization. matrix I), and P is a lower triangular matrix such are 10 points). Taking the Cartesian product of that PP = (i.e., the variance-covariance matrix these 10 points produces a table of 100 ordered of x). If x is bivariate normal then pairs (z1, z2) that are each the center of a rectangle ⎡ ⎤ × σ with volume 0.01 (i.e., 0.10 0.10) under the 1 0 ⎣ ⎦ bivariate z distribution. Given such a table, the P = 2 (12) cov12 σ 2 − cov12 Eq. 14 integral is evaluated by stepping through all σ1 2 σ 2 1 (z1, z2) points in the table. Each point is substituted The Cholesky transformation is linear (see Eq. 11), into Eq. 13 for k = 1 and 2 and the signs of ∗ ∗ so linear bounds in x space are transformed to linear h1 (z1, z2)andh2 (z1, z2) are determined. If ∗ ∗ bounds in z space. In particular, h1 (z1, z2) > 0andh2 (z1, z2) > 0 then the = + = Eq. 14 integral is incremented by 0.01. If either hk(x) b x c 0 or both of these signs are negative, then the value becomes of the integral is unchanged. So the value of the integral is approximately equal to the number of h (Pz + µ) = b (Pz + µ) + c = 0 k (z1, z2) points that are in the Rj response region or equivalently divided by the total number of (z1, z2) points in the table. Figure 2.2 shows a 10 × 10 grid of (z1, ∗ = + µ + = hk (z) (b P)z (b c) 0 (13) z2) points, but better results can be expected from a grid with higher resolution. We have had success Thus, in this way we can transform the Eq. 10 with a 100 × 100 grid, which should produce integral to approximations to the integral that are accurate to within 0.0001 (Ashby, Waldron, Lee, & Berkman, P(R |S ) = mvn(µ , )dx dx j i i i 1 2 2001). h1(x) > 0; > h2(x) 0 evaluating goodness of fit As indicated before, one popular method for = mvn(0,I)dz1dz2 (14) testing an assumption about perceptual or deci- ∗ h (z) > 0; sional processing is to fit two versions of a GRT 1∗ h (z) > 0 model to the data using the procedures outlined in 2 this section. In the first, restricted version of the The right and left panels of Figure 2.2 illustrate model, a number of parameters are set to values that these two integrals. The key to evaluating the reflect the assumption being tested. For example, second of these integrals quickly is to preload z- fixing all correlations to zero would test perceptual values that are centered in equal area intervals. In independence. In the second, unrestricted version Figure 2.2 each gray point in the right panel has of the model, the same parameters are free to vary. 1 a z1 coordinate that is the center of an interval Once the restricted and unrestricted versions of the with area 0.10 under the z distribution (since there model have been fit, they can be compared through
20 elementary cognitive mechanisms a likelihood ratio test: then at least 4 bounds are required. The confusion matrix from a 2 × 2 factorial experiment has 12 =−2(logL − logL ), (15) R U degrees of freedom. The full model has more where LR and LU represent the likelihoods of free parameters than this, so it cannot be fit the restricted and unrestricted models, respectively. to the data from this experiment. As a result, Under the null hypothesis that the restricted model some restrictive assumptions are required. In a is correct, the statistic has a Chi-squared 3 × 3 factorial experiment, however, the confusion distribution with degrees of freedom equal to the matrix has 72 degrees of freedom (9 × 8) and difference in the number of parameters between the the full model has 49 free parameters (i.e., 41 restricted and unrestricted models. distributional parameters and 8 decision bound If several non-nested models were fitted to the parameters), so the full model can be fit to data, we would usually want to select the best candi- identification data when there are at least 3 levels of date from this set. The likelihood ratio test cannot each stimulus dimension. For an alternative to the be used to select among such non-nested models. GRT identification model presented in this section, Instead, we can compute the Akaike information see Box 2. criterion (AIC, Akaike, 1974) or the Bayesian information criterion (BIC, Schwarz, 1978): AIC =−2logL + 2m (16) Box 2 GRT Versus the BIC =−2logL + m log N (17) Similarity-Choice Model where m is the number of free parameters in the The most widely known alternative identifi- model and N is the number of data points being cation model is the similarity-choice model fit. When the sample size is small compared to (SCM; Luce, 1963; Shepard, 1957), which the number of free parameters of the model, as assumes that in most applications of GRT, a correction factor + / 2 − − η β equal to 2m(m 1) (n m 1) should be added | = ij j P(Rj Si) η β , to the AIC (see Burnham & Anderson, 2004). ik k The best model is the one with the smallest AIC k or BIC. where ηij is the similarity between stimuli Because an n × n confusion matrix has n(n − 1) Si and Sj and βj is the bias toward response degrees of freedom, the maximum number of Rj. The SCM has had remarkable success. free parameters that can be estimated from any For many years, it was the standard against − confusion matrix is n(n 1). The origin and unit which competing models were compared. For of measurement on each perceptual dimension are example, in 1992 J. E. K. Smith summarized arbitrary. Therefore, without loss of generality, the its performance by concluding that the SCM mean vector of one perceptual distribution can be “has never had a serious competitor as a set to 0, and all variances of that distribution can model of identification data. Even when it be set to 1.0. Therefore, if there are two perceptual has provided a poor model of such data, dimensions and n stimuli, then the full GRT model other models have done even less well” − + has 5(n 1) 1 free distributional parameters (i.e., (p. 199). Shortly thereafter, however, the GRT n – 1 stimuli have 5 free parameters—2 means, 2 model ended this dominance, at least for variances, and a covariance—and the distribution identification data collected from experiments with mean 0 and all variances set to 1 has 1 with stimuli that differ on only a couple of free parameter—a covariance). If linear bounds are stimulus dimensions. In virtually every such assumed, then another 2 free parameters must be comparison, the GRT model has provided added for every bound (e.g., slope and intercept). a substantially better fit than the SCM, in With a factorial design (e.g., as when the stimulus many cases with fewer free parameters (Ashby set is A1B1,A1B2,A2B1,andA2B2), there must be et al., 2001). Even so, it is important to note at least one bound on each dimension to separate that the SCM is still valuable, especially in each pair of consecutive component levels. So for the case of identification experiments in which stimuli A1B1,A1B2,A2B1,andA2B2,atleast the stimuli vary on many unknown stimulus two bounds are required (e.g., see Figure 2.1). dimensions. If, instead, there are 3 levels of each component,
multidimensional signal detection theory 21 The Summary Statistics Approach X2 The summary statistics approach (Ashby & Townsend, 1986; Kadlec and Townsend, 1992a, f12 f22 1992b) draws inferences about perceptual inde- pendence, perceptual separability, and decisional separability by using summary statistics that are easily computed from a confusion matrix. Consider f11 f21 again the factorial identification experiment with 2 levels of 2 stimulus components. As before, X we will denote the stimuli in this experiment as 1 g (x ) g (x ) A1B1,A1B2,A2B1,andA2B2. In this case, it is 12 1 22 1 convenient to denote the responses as a1b1,a1b2, a2b1,anda2b2. The summary statistics approach ' dAB2 operates by computing certain summary statistics |cAB2| × g (x ) that are derived from the 4 4 confusion matrix 11 1 g21(x1) that results from this experiment. The statistics are Hit computed at either the macro- or micro-level of False Alarm ' analysis. dAB1 |cAB1| macro-analyses Fig. 2.3 Diagram explaining the relation between macro- Macro-analyses draw conclusions about percep- analytic summary statistics and the concepts of perceptual and tual and decisional separability from changes in decisional separability. accuracy, sensitivity, and bias measures computed for one dimension across levels of a second di- P(a1bj | A1Bj) + P(a2bj | A1Bj) mension. One of the most widely used summary = | + | statistics in macro-analysis is marginal response P(a1bj A2Bj) P(a2bj A2Bj) (19) invariance, which holds for a dimension when the probability of identifying the correct level of for both j =1and2. that dimension does not depend on the level of Marginal response invariance is closely related any irrelevant dimensions (Ashby & Townsend, to perceptual and decisional separability. In fact, 1986). For example, marginal response invariance if dimension X 1 is perceptually and decisionally requires that the probability of correctly identifying separable from dimension X 2, then marginal that component A is at level 1 is the same response invariance must hold for X 1(Ashby & regardless of the level of component B, or in other Townsend, 1986). In the later section entitled words that “Extensions to Response Time,” we describe how an even stronger test is possible with a response time version of marginal response invariance. Figure P(a1 | A1B1) = P(a1 | A1B2) 2.3 helps to understand intuitively why perceptual and decisional separability together imply marginal Now in an identification experiment, A1 can be correctly identified regardless of whether the level response invariance. The top of the figure shows of B is correctly identified, and so the perceptual distributions of four stimuli that vary on two dimensions. Dimension X 1 is decisionally ; P(a |A B ) = P(a b | A B ) + P(a b | A B ) but not perceptually separable from dimension X 2 1 1 1 1 1 1 1 1 2 1 1 the distance between the means of the perceptual distributions along the X axisismuchgreater For this reason, marginal response invariance holds 1 for the top two stimuli than for the bottom on dimension X if and only if 1 two stimuli. The marginal distributions at the bottom of Figure 2.3 show that the proportion of P(a b | A B ) + P(a b | A B ) i 1 i 1 i 2 i 1 correct responses, represented by the light-grey areas = P(aib1 | AiB2) + P(aib2 | AiB2) (18) under the curves, is larger in the second level of X 2 than in the first level. The result would be for both i = 1 and 2. Similarly, marginal response similar if perceptual separability held and decisional invariance holds on dimension X 2 if and only if separability failed, as would be the case for X 2
22 elementary cognitive mechanisms if its decision bound was not perpendicular to its violated. The equality between two d scanbetested main axis. using the following statistic (Marascuilo, 1970): To test marginal response invariance in dimen- − d1 d2 sion X 1, we estimate the various probabilities in Z = (23) 2 + 2 Eq. 18 from the empirical confusion matrix that sd sd results from this identification experiment. Next, 1 2 equality between the two sides of Eq. 18 is assessed where via a standard statistical test. These computations − − 2 F(1 F) H(1 H) are repeated for both levels of component A and s = + d φ −1 ( ) 2 φ −1 ( ) 2 if either of the two tests is significant, then we nn F ns H conclude that marginal response invariance fails, (24) and, therefore, that either perceptual or decisional separability are violated. where φ is the standard normal probability density The left side of Eq. 18 equals P(ai|AiB1) function, H and F are the hit and false-alarm and the right side equals P(ai|AiB2). These are rates associated with the relevant d , nn is the the probabilities that component Ai is correctly number of trials used to compute F,andns is the identified and are analogous to “hit” rates in signal number of trials used to compute H.Underthenull detection theory. To emphasize this relationship, we hypothesis of equal d s, Z follows a standard normal define the identification hit rate of component Ai on distribution. trials when stimulus AiBj is presented as Marginal hit and false-alarm rates can also be used to compute a marginal response crite- = = Ha |A B P ai AiBj P aib1 AiBj rion. Several measures of response criterion and i i j bias have been proposed (see Chapter 2 of + P a b A B (20) i 2 i j Macmillan & Creelman, 2005), but perhaps the The analogous false alarm rates can be defined most widely used criterion measure in recent years (due to Kadlec, 1999) is: similarly. For example, = −1 = = cAB F | (25) Fa |A B P a2 A1Bj P a2b1 A1Bj j a1 A2Bj 2 1 j + P a2b2 A1Bj (21) As shown in Figure 2.3, this measure represents the placement of the decision-bound relative to In Figure 2.3, note that the dark grey areas the center of the A2Bj distribution. If component in the marginal distributions equal Fa1|A2B2 (top) A is perceptually separable from component B, | = and Fa1 A2B1 (bottom). In signal detection theory, but cAB1 cAB2 , then decisional separability must hit and false-alarm rates are used to measure have failed on dimension X 1 (Kadlec & Townsend, stimulus discriminability (i.e., d ). We can use 1992a, 1992b). On the other hand, if perceptual the identification analogues to compute marginal separability is violated, then examining the marginal discriminabilities for each stimulus component response criteria provides no information about (Thomas, 1999). For example, decisional separability. To understand why this is the case, note that in Figure 2.3 the marginal c − − d = 1 H | − 1 F | (22) values are not equal, even though decisional sep- ABj a2 A2Bj a2 A1Bj arability holds. A failure of perceptual separability − where the function 1 is the inverse cumulative has affected measures of both discriminability and distribution function for the standard normal response criteria. distribution. As shown in Figure 2.3, the value To test the difference between two c values, the of d represents the standardized distance between following test statistic can be used (Kadlec, 1999): ABj the means of the perceptual distributions of stimuli c − c Z = 1 2 (26) A1Bj and A2Bj. If component A is perceptually s2 + s2 separable from component B, then the marginal c1 c2 discriminabilities between the two levels of A must where = be the same for each level of B – that is, dAB 1 F(1 − F) d (Kadlec & Townsend, 1992a, 1992b). Thus, s2 = . (27) AB2 c φ −1 ( ) 2 if this equality fails, then perceptual separability is nn F
multidimensional signal detection theory 23 micro-analyses dimension X 2. The decision bound perpendicular Macro-analyses focus on properties of the entire to X 2 separates the perceptual plane into two stimulus ensemble. In contrast, micro-analyses regions: percepts falling in the upper region elicit test assumptions about perceptual independence an incorrect response on component B (i.e., a miss and decisional separability by examining summary for B), whereas percepts falling in the lower region statistics computed for only one or two stimuli. elicit a correct B response (i.e., a hit). The bottom The most widely used test of perceptual inde- of the figure shows the marginal distribution for pendence is via sampling independence, which holds each stimulus conditioned on whether B is a hit when the probability of reporting a combination or a miss. When perceptual independence holds, of components P(aibj) equals the product of the as is the case for the stimulus to the left, these probabilities of reporting each component alone, conditional distributions have the same mean. On P(ai)P(bj). For example, sampling independence the other hand, when perceptual independence does holds for stimulus A1B1 if and only if not hold, as is the case for the stimulus to the right, the conditional distributions have different | = | × | P(a1b1 A1B1) P(a1 A1B1) P(b1 A1B1) means, which is reflected in different d and c values
= [P(a1b1|A1B1) + P(a1b2|A1B1)] depending on whether there is a hit or a miss on B. If decisional separability holds, differences in the × | + | [P(a1b1 A1B1) P(a2b1 A1B1)] conditional d sandcs are evidence of violations (28) of perceptual independence (Kadlec & Townsend, 1992a, 1992b). Sampling independence provides a strong test of perceptual independence if decisional separability Conditional d and c values can be computed holds. In fact, if decisional separability holds from hit and false alarm rates for two stimuli differ- on both dimensions, then sampling independence ing in one dimension, conditioned on the reported holds if and only if perceptual independence holds level of the second dimension. For example, for the (Ashby & Townsend, 1986). Figure 2.4A gives an pair A1B1 and A2B1, conditioned on a hit on B, the intuitive illustration of this theoretical result. Two hit rate for A is P(a1b1|A1B1) and the false alarm cases are presented in which decisional separability rate is P(a1b1|A2B1). Conditioned on a miss on B, holds on both dimensions and the decision bounds the hit rate for A is P(a1b2|A1B1) and the false alarm cross at the mean of the perceptual distribution. In rate is P(a1b2|A2B1). These values are used as input the distribution to the left, perceptual independence to Eqs. 22–27 to reach a statistical conclusion. holds and it is easy to see that all four responses Note that if perceptual independence and deci- are equally likely. Thus, the volume of this bivariate sional separability both hold, then the tests based on sampling independence and equal conditional normal distribution in response region R4 =a2b2 is 0.25. It is also easy to see that half of each marginal d and c should lead to the same conclusion. If distribution lies above its relevant decision criterion only one of these two tests holds and the other fails, this indicates a violation of decisional separability (i.e., the two shaded regions), so P(a2) = P(b2) = 0.5. As a result, sampling independence is satisfied (Kadlec & Townsend, 1992a, 1992b). since P(a2b2)=P(a2) × P(b2). It turns out that this relation holds regardless of where the bounds An Empirical Example are placed, as long as they remain perpendicular to In this section we show with a concrete example the dimension that they divide. The distribution how to analyze the data from an identification to the right of Figure 2.4A has the same variances experiment using GRT. We will first analyze the as the previous distribution, and, therefore, the data by fitting GRT models to the identification same marginal response proportions for a2 and b2. confusion matrix, and then we will conduct sum- However, in this case, the covariance is larger than mary statistics analyses on the same data. Finally, zero and it is clear that P(a2b2) > 0.25. we will compare the results from the two separate Perceptual independence can also be assessed analyses. Imagine that you are a researcher inter- through discriminability and criterion measures ested in how the age and gender of faces interact computed for one dimension conditioned on the during face recognition. You run an experiment perceived value on the other dimension. Figure in which subjects must identify four stimuli, the 2.4B shows the perceptual distributions of two combination of two levels of age (teen and adult) stimuli that share the same level of component B and two levels of gender (male and female). Each (i.e., B1) and have the same perceptual mean on stimulus is presented 250 times, for a total of 1,000
24 elementary cognitive mechanisms A
R3 R4 R3 R4
R1 R2 R1 R2
B Hit f f11 21 Miss False Alarm
g (X | B miss) 11 1 g21(X1 | B miss)
' dA|B miss |cA|B miss | g11(X1| B hit) g21(X1| B hit)
' dA|B hit |CA|B hit|
Fig. 2.4 Diagram explaining the relation between micro-analytic summary statistics and the concepts of perceptual independence and decisional separability. Panel A focuses on sampling independence and Panel B on conditional signal detection measures. trials in the whole experiment. The data to be data, some parameters were fixed for all models. analyzed are summarized in the confusion matrix Specifically, all variances were assumed to be equal displayed in Table 2.1. These data were generated to one and decisional separability was assumed for by random sampling from the model shown in both dimensions. Figure 2.5C shows the hierarchy Figure 2.5A. of models used for the analysis, together with The advantage of generating artificial data from the number of free parameters m for each of this model is that we know in advance what them. In this figure, PS stands for perceptual conclusions should be reached by our analyses. For separability, PI for perceptual independence, DS example, note that decisional separability holds in for decisional separability and 1_RHO describes the Figure 2.5A model. Also, because the distance a model with a single correlation parameter for between the “male” and “female” distributions is all distributions. Note that several other models larger for “adult” than for “teen,” gender is not could be tested, depending on specific research goals perceptually separable from age. In contrast, the and hypotheses, or on the results from summary “adult” and “teen” marginal distributions are the statistics analysis. same across levels of gender, so age is perceptually The arrows in Figure 2.5C connect models separable from gender. Finally, because all dis- that are nested within each other. The result tributions show a positive correlation, perceptual of likelihood ratio tests comparing such nested independence is violated for all stimuli. models are displayed next to each arrow, with A hierarchy of models were fit to the data in an asterisk representing significantly better fit for Table 2.1 using maximum likelihood estimation (as the more general model (lower in the hierarchy) in Ashby et al., 2001; Thomas, 2001). Because and n.s. representing a nonsignificant difference there are only 12 degrees of freedom in the in fit. Starting at the top of the hierarchy, it
multidimensional signal detection theory 25 A B Adult Adult Age Age Teen Teen
Male Fermale Male Fermale Gender Gender
C
{PI, PS, DS} m =4
* n.s. *
{PI, PS(Gender), DS} {PI, PS(Age), DS} {1_RHO, PS, DS} m=6 m=6 m=5
n.s. * n.s * *
{PI, DS} {1_RHO, PS(Gender), DS} {1_RHO, PS(Age), DS} {PS, DS} m=8 m=7 m=7 m=8
* n.s. * n.s. n.s. n.s.
{1_RHO, DS} {PS(Gender), DS} {PS(Age), DS} m=9 m=10 m=10
Fig. 2.5 Results of analyzing the data in Table 2.1 with GRT. Panel A shows the GRT model that was used to generate the data. Panel B shows the recovered model from the model fitting and selection process. Panel C shows the hierarchy of models used for the analysis and the number of free parameters (m) in each. PI stands for perceptual independence, PS for perceptual separability, DS for decisional separability and 1_RHO for a single correlation in all distributions.
Table 2.1. Data from a simulated identification experiment with four face stimuli, created by factorially combining two levels of gender (male and female) and two levels of age (teen and adult). Response
Stimulus Male/Teen Female/Teen Male/Adult Female/Adult
Male/Teen 140 36 34 40
Female/Teen 89 91 4 66
Male/Adult 85 5 90 70
Female/Adult 20 59 8 163
26 elementary cognitive mechanisms Table 2.2. Results of the summary statistics analysis for the simulated Gender × Age identification experiment. Macroanalyses
Marginal Response Invariance Test Result Conclusion Equal P (Gender=Male) across all Ages z = −0.09, p>.1 Yes Equal P (Gender=Female) across all Ages z = −7.12, p<.001 No Equal P (Age=Teen) across all Genders z = −0.39, p>.1 Yes Equal P (Age=Adult) across all Genders z = −1.04, p>.1 Yes Marginal d Test d for level 1 d for Result Conclusion level 2 Equal d for Gender across all Ages 0.84 1.74 z = −5.09, p < .001 No Equal d for Age across all Genders 0.89 1.06 z = −1.01, p >.1 Yes Marginal c Test c for level 1 c for Result Conclusion level 2 Equal c for Gender across all Ages −0.33 −1.22 z = 6.72, p < .001 No Equal c for Age across all Genders −0.36 −0.48 z = 1.04, p >.1 Yes
Microanalyses
Sampling Independence Stimulus Response Expected Observed Result Conclusion Proportion Proportion
Male/Teen Male/Teen 0.49 0.56 z=1.57, p > .1 Yes Male/Teen Female/Teen 0.21 0.14 z=−2.05, p < .05 No Male/Teen Male/Adult 0.21 0.14 z = −2.24, p < .05 No Male/Teen Female/Adult 0.09 0.16 z = 2.29, p < .05 No Female/Teen Male/Teen 0.27 0.36 z = 2.14, p < .05 No Female/Teen Female/Teen 0.45 0.36 z = −2.01, p < .05 No Female/Teen Male/Adult 0.23 0.02 z = −7.80, p < .001 No Female/Teen Female/Adult 0.39 0.26 z = −3.13, p < .01 No Male/Adult Male/Teen 0.25 0.34 z = 2.17, p < .05 No Male/Adult Female/Teen 0.11 0.02 z = −4.09, p < .001 No Male/Adult Male/Adult 0.21 0.36 z = 3.77, p < .001 No Male/Adult Female/Adult 0.09 0.28 z = 5.64, p < .001 No Female/Adult Male/Teen 0.04 0.08 z = 2.15, p < .05 No Female/Adult Female/Teen 0.28 0.24 z = −1.14, p > .1 Yes Female/Adult Male/Adult 0.10 0.03 z = −3.07, p < .01 No Female/Adult Female/Adult 0.79 0.65 z = −3.44, p < .001 No Conditional d Test d |Hit d |Miss Result Conclusion Equal d for Gender when Age=Teen 0.84 1.48 z = −2.04, p < .05 No Equal d for Gender when Age=Adult 1.83 2.26 z = −1.27, p > .1 Yes Equal d for Age when Gender=Male 0.89 1.44 z = −1.80, p > .05 Yes Equal d for Age when Gender=Female 0.83 1.15 z = −0.70, p > .1 Yes
multidimensional signal detection theory 27 Table 2.2. Continued
Conditional c Test c |Hit c |Miss Result Conclusion
Equal c for Gender when Age=Teen −0.014 −1.58 z = 6.03, p < .001 No Equal c for Gender when Age=Adult −1.68 −0.66 z = −4.50, p < .001 No Equal c for Age when Gender=Male −0.04 −1.50 z = 6.05, p < .001 No Equal c for Age when Gender=Female −0.63 0.57 z = −4.46, p < .001 No