Research Methods in Psycholinguistics and the Neurobiology of Language Guides to Research Methods in Language and Linguistics Series Editor: Li Wei, Centre for Applied Linguistics, University College London

The science of language encompasses a truly interdisciplinary field of research, with a wide range of focuses, approaches, and objectives. While linguistics has its own traditional approaches, a variety of other intellectual disciplines have contributed methodological perspectives that enrich the field as a whole. As a result, linguistics now draws on state‐of‐the‐art work from such fields as , computer science, biology, neuroscience and cognitive science, sociology, music, philosophy, and anthropology. The interdisciplinary nature of the field presents both challenges and opportu- nities to students who must understand a variety of evolving research skills and methods. The Guides to Research Methods in Language and Linguistics addresses these skills in a systematic way for advanced students and beginning researchers in language science. The books in this series focus especially on the relationships between theory, methods, and data—the understanding of which is fundamental to the successful completion of research projects and the advancement of knowledge. 1. The Blackwell Guide to Research Methods in Bilingualism and Edited by Li Wei and Melissa G. Moyer 2. Research Methods in Child Language: A Practical Guide Edited by Erika Hoff 3. Research Methods in Second : A Practical Guide Edited by Susan M. Gass and Alison Mackey 4. Research Methods in Clinical Linguistics and Phonetics: A Practical Guide Edited by Nicole Müller and Martin J. Ball 5. Research Methods in Sociolinguistics: A Practical Guide Edited by Janet Holmes and Kirk Hazen 6. Research Methods in Studies: A Practical Guide Edited by Eleni Orfanidou, Bencie Woll, and Gary Morgan 7. Research Methods in Language Policy and Planning: A Practical Guide Edited by Francis Hult and David Cassels Johnson 8. Research Methods in Intercultural Communication: A Practical Guide Edited by Zhu Hua 9. Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide Edited by Annette M. B. de Groot and Peter Hagoort Research Methods in Psycholinguistics and the Neurobiology of Language

A Practical Guide

Edited by Annette M. B. de Groot and Peter Hagoort This edition first published 2018 © 2018 John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. The right of Annette M. B. de Groot and Peter Hagoort to be identified as the authors of the editorial material in this work has been asserted in accordance with law. Registered Offices John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA Editorial Office 9600 Garsington Road, Oxford, OX4 2DQ, UK For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats. Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Cataloging‐in‐Publication data applied for Hardback: 9781119109846 Paperback: 9781119109853 Cover image: (Figure) Designed by Hartebeest, Nijmegen, The Netherlands Cover design by Wiley Set in 10/12pt Sabon by SPi Global, Pondicherry, India

10 9 8 7 6 5 4 3 2 1 Contents

List of Figures vii List of Tables ix Notes on Contributors x Preface xvi 1 Habituation Techniques 1 Christopher T. Fennell 2 Visual Preference Techniques 18 Roberta Michnick Golinkoff, Melanie Soderstrom, Dilara Deniz Can, and Kathy Hirsh‐Pasek 3 Assessing Receptive and Expressive Vocabulary in Child Language 40 Virginia A. Marchman and Philip S. Dale 4 Eye‐Movement Tracking During Reading 68 Reinhold Kliegl and Jochen Laubrock 5 The Visual World Paradigm 89 Anne Pier Salverda and Michael K. Tanenhaus 6 Word Priming and Interference Paradigms 111 Zeshu Shao and Antje S. Meyer 7 Structural Priming 130 Holly P. Branigan and Catriona L. Gibb 8 Conversation Analysis 151 Elliott M. Hoey and Kobin H. Kendrick 9 Virtual Reality 174 Daniel Casasanto and Kyle M. Jasmin 10 Studying Psycholinguistics out of the Lab 190 Laura J. Speed, Ewelina Wnuk, and Asifa Majid 11 Computational Modeling 208 Ping Li and Xiaowei Zhao vi Contents

12 Corpus Linguistics 230 Marc Brysbaert, Paweł Mandera, and Emmanuel Keuleers 13 Electrophysiological Methods 247 Joost Rommers and Kara D. Federmeier 14 Hemodynamic Methods: fMRI and fNIRS 266 Roel M. Willems and Alejandrina Cristia 15 Structural Neuroimaging 288 Stephanie J. Forkel and Marco Catani 16 Lesion Studies 310 Juliana V. Baldo and Nina F. Dronkers 17 Molecular Genetic Methods 330 Carolien G. F. de Kovel and Simon E. Fisher Index 354 List of Figures

Figure 1.1 Examples of various infant language habituation tasks 5 Figure 1.2 Mean looking times across various trial types in Fennell and Byers‐Heinlein (2014) 12 Figure 2.1 The Intermodal Preferential Looking Paradigm 22 Figure 2.2 Means of single longest look in seconds to infant‐directed (IDS) and adult‐directed (ADS) speech stimuli 25 Figure 2.3 The Interactive Intermodal Preferential Looking Paradigm 26 Figure 2.4 Visual fixation to original label, new label, and recovery trials by condition 28 Figure 2.5 Eye gaze shifts toward and away from target in looking‐while‐listening task by age 30 Figure 2.6 The Headturn Preference Procedure 33 Figure 4.1 Typical eye tracker set up 71 Figure 4.2 Illustration of the gaze‐contingent moving‐window (top) and boundary (bottom) paradigms 73 Figure 4.3 Velocity‐based saccade detection 75 Figure 4.4 Determination of word boundaries with PRAAT 80 Figure 4.5 Main effect of eye‐voice span and its interaction with predictability 81 Figure 5.1 Example of a screen‐based visual world paradigm experimental set up 90 Figure 5.2 Example visual display modeled after Altmann and Kamide (1999) 91 Figure 5.3 Timing of target fixations for each trial, for one participant and fixation proportions computed for same data 100 Figure 5.4 Proportion of fixations over time (from target‐word onset) to target (goat), cohort competitor (goal), and distractor in neutral and constraining verb conditions in Experiment 1 in Dahan and Tanenhaus (2004) 104 Figure 6.1 An illustration of the trial structure in Meyer and Schvaneveldt (1971) 113 Figure 6.2 An illustration of the prime‐target pairs used in Glaser and Düngelhoff (1984) 114 Figure 6.3 Results obtained by Glaser and Düngelhoff (1984) 115 Figure 6.4 Illustration of trial structures in the masked and unmasked conditions in de Wit and Kinoshita (2015) 119 viii List of Figures

Figure 7.1 Example trial in a picture‐matching comprehension priming paradigm 138 Figure 7.2 Example trial in a picture‐matching and picture‐description production priming paradigm 140 Figure 7.3 Example trial in a sentence recall production priming paradigm 142 Figure 10.1 Comparison of cut and break verbs in Chontal, Hindi, and Jalonke 195 Figure 11.1 The basic architecture of a Simple Recurrent Network (SRN) 213 Figure 11.2 A sketch of the probabilistic model that incorporates distributional statistics from cross‐situational observation and prosodic and attentional highlights from social gating 219 Figure 11.3 A sketch of the DevLex‐II model 221 Figure 11.4 Vocabulary spurt simulated by DevLex‐II (591 target words) 223 Figure 13.1 Idealized example of an event‐related potential waveform in response to a visual stimulus, with labeled positive and negative peaks 248 Figure 13.2 Grand average ERPs from three parietal channels, elicited by the final words in the three conditions 257 Figure 13.3 Simulated EEG data illustrating the difference between ERPs and time‐frequency analyses in their sensitivity to phase‐locked (evoked) and non‐phase‐locked (induced) activity 260 Figure 14.1 An anatomical scan of the head and the brain (A), and Functional MRI images (B) 269 Figure 14.2 Example of an idealized BOLD curve, sometimes called the hemodynamic response function (HRF) 271 Figure 14.3 A statistical map overlaid on an anatomical brain scan 276 Figure 14.4 Image of a 5‐month‐old infant wearing a fNIRS cap, including a schematic illustration of the path of light between a source (star) and a detector (circle), through the scalp (dashed line) and cortical tissue (in gray) 278 Figure 14.5 Sample of signal in fNIRS studies 280 Figure 15.1 Imaging of an acute patient presenting with anomia following left inferior parietal and frontal lobe stroke 293 Figure 15.2 Lesion mapping based on T1‐weighted data (A), on a diffusion tractography atlas (B), and an example of extracting tract‐based measurements from tractography () 299 Figure 15.3 Anatomical variability in perisylvian white matter anatomy and its relation to post‐stroke language recovery 302 Figure 16.1 A schematic illustration showing the steps involved in a VLSM analysis 317 Figure 16.2 Overlay of patients’ lesions 320 Figure 16.3 Power analysis map showing the degree of power in our sample, given a medium effect size and alpha set at p < .05 321 Figure 16.4 VLSM results showing neural correlates of auditory word recognition with varying levels of correction 322 Figure 17.1 Transmission of DNA between generations 332 Figure 17.2 Visualization of Sanger sequencing results 338 Figure 17.3 Next generation sequencing 339 Figure 17.4 Visualization of SNP‐chip results 340 List of Tables

Table 1.1 Mock habituation data from four experiments with looking time as the dependent variable 8 Table 1.2 Steps in data collection and analyses 9 Table 2.1 Visual and linguistic stimuli used to teach two novel words in either infant‐directed or adult‐directed speech 24 Table 2.2 Ten- to 12‐month‐old infants saw two types of discrimination trials, one to test for path discrimination and one for actor discrimination 31 Table 3.1 Overview of instruments/analysis tools for studying vocabulary development in children 45 Table 3.2 Example transcript from CHILDES 48 Table 4.1 Definitions of location and duration eye‐tracking measures 77 Table 4.2 Practical issues related to eye‐tracking during reading 82 Table 7.1 Example structural alternations studied in structural priming experiments 134 Table 7.2 Stimulus materials for a hypothetical small clause study 144 Table 7.3 Hypothetical results for a small clause study 145 Table 8.1 Questions and assessments from Extracts 8.1 to 8.3 161 Table 12.1 Excerpt from the SUBTLEX‐US database for the word “appalled” 234 Table 12.2 Stimuli used in a semantic priming experiment by de Mornay Davies (1998) 239 Table 17.1 Example of genotyping chip results for four individuals and five polymorphisms 340 Notes on Contributors

Juliana V. Baldo is Research Scientist at Veterans Affairs Northern California Health Care System and Adjunct Professor of Psychology at California State University East Bay. She specializes in research related to language and neuropsychological disorders arising from brain injury, including both stroke and traumatic brain injury. Dr. Baldo has also published a number of articles on language impairments in aphasia and associated cognitive deficits, and has utilized various brain imaging methodologies to better understand the neural basis of these impairments.

Holly P. Branigan is Professor of Psychology of Language and Cognition at the University of Edinburgh. Her research uses a wide range of experimental psycholin- guistic methods to investigate language production in monologue and dialogue, with a particular focus on syntactic processing and representation in adults and in typically and atypically developing children.

Marc Brysbaert is Professor of Psychology at Ghent University. In recent years his word recognition research has shifted to big data, including the calculation and validation of improved word frequency measures, running megastudies to establish word processing times, collecting subjective measures of word features (concreteness, valence, arousal, age‐of‐acquisition), and investigating the use and validation of semantic vectors.

Daniel Casasanto is Associate Professor of Human Development and Psychology at Cornell University, Ithaca, NY. He studies how physical and social experiences shape our brains and minds.

Marco Catani holds a joint affiliation as clinical senior lecturer and honorary consultant psychiatrist with the Department of Forensics and Neurodevelopmental Sciences and the Department Neuroimaging at King’s College London. He studies the lateralization of human brain networks and their implications for post stroke recovery from aphasia and neglect.

Alejandrina Cristia is a researcher at the Centre National de Recherche Scientifique, affiliated with the Laboratoire de Sciences Cognitives et Psycholinguistique Notes on Contributors xi

(ENS, EHESS, CNRS), Département d’Etudes Cognitives, Ecole Normale Supérieure, PSL Research University. She studies early language acquisition.

Philip S. Dale is Professor Emeritus of Speech & Hearing Sciences at the University of New Mexico and Visiting Professor at King’s College London. He is a co‐developer of the MacArthur‐Bates Communicative Development Inventories. He has conducted research on the assessment, genetic and environmental causes, and consequences of individual differences in early language development, with a special interest in late talkers. He also conducted research on the effectiveness of intervention programs for young children.

Annette M. B. de Groot is Professor of Psycholinguistics at the University of Amsterdam. Her early research focused on priming effects on word recognition, the structure of the mental lexicon, and the psychology of reading and spelling. Later her research shifted toward bilingualism and multilingualism, studying bilingual word recognition and word production, foreign‐language vocabulary acquisition, translation and simultaneous interpreting, and the influence of bilingualism on various aspects of verbal and non‐verbal cognition.

Carolien G. F. de Kovel is a researcher at the Max Planck Institute for Psycholinguistics in Nijmegen, the Netherlands. She studies the genetic background of lateralization in humans. Carolien received her PhD in Biology at the University of Utrecht, the Netherlands.

Dilara Deniz Can is a scientist/practitioner who has obtained her PhD and Educational Specialist degrees from the University of Delaware in School Psychology. She completed a post‐doctoral research fellowship at the University of Washington’s Institute for Learning and Brain Sciences working with vulnerable children ages 3 to 5, studying the links between brain, environment, and language development. She has worked as a school psychologist in public schools of WA State, completing psycho‐educational evaluations for young children and adolescents.

Nina F. Dronkers is a VA Research Career Scientist and Director of the Center for Aphasia and Related Disorders with the Department of Veterans Affairs Northern California Health Care System. She is also an Adjunct Professor at the University of California, Davis in the Department of Neurology. She received her interdisciplinary Ph.D. degree in Neuropsychology from the University of California, Berkeley, and has since used novel techniques to identify new brain structures that play critical roles in the processing of speech and language, and studies how these relate to other cognitive skills.

Kara D. Federmeier is a Professor in the Department of Psychology, the Program in Neuroscience, and the Beckman Institute of Advanced Science and Technology at the University of Illinois. Her research uses event‐related potentials, EEG, and eye tracking to understand the mechanisms involved in language comprehension and meaning processing, the nature of hemispheric differences in cognitive processing, xii Notes on Contributors the impact of age‐related changes on language and memory functioning, and the effects of literacy on cognitive processing in adulthood.

Christopher T. Fennell is an Associate Professor at the University of Ottawa and Director of the Language Development Lab. His research focuses on speech perception, phonological development, and lexical acquisition in monolingual and bilingual infants. He has published numerous articles on infant language development, many using habit- uation methods, in journals such as Child Development, Developmental Science, Infancy, and Bilingualism: Language and Cognition.

Simon E. Fisher is a Director of the Max Planck Institute for Psycholinguistics and Professor of Language and Genetics at the Donders Institute, Radboud University, in Nijmegen, the Netherlands. He obtained a Natural Sciences degree from Cambridge University, UK, followed by a DPhil in Human Genetics at Oxford University, UK. His research uses genes as molecular windows into the basis of human cognitive traits, with a particular focus on speech, language, and reading skills.

Stephanie J. Forkel is senior neuroimaging research scientist in the Department for Neuroimaging at King’s College London. She investigates the lateralization of human brain networks and their implications for post stroke recovery from aphasia and neglect.

Catriona L. Gibb was a PhD student in the Psychology Department at the University of Edinburgh. Her research used experimental methods to study the psycholinguis- tics of bilingualism, most recently focusing on the nature of syntactic processing and syntactic representation in early and late bilinguals.

Roberta Michnick Golinkoff is Unidel H. Rodney Sharp Professor at the University of Delaware. She has received numerous awards for her research and her dissemina- tion work. Funded by federal agencies, she has over 150 publications, as well as 16 books and monographs. Passionate about the dissemination of psychological science for improving our schools and families’ lives, her latest book is Becoming Brilliant: What Science Tells Us About Raising Successful Children (APA Press).

Peter Hagoort is Academy Professor of the Royal Netherlands Academy of Arts and Sciences, and Professor of Cognitive Neuroscience at Radboud University. He is a Director of the Max Planck Institute for Psycholinguistics and of the Donders Institute for Brain, Cognition, and Behaviour. His research focuses on the neurobiolog- ical infrastructure for language with the help of advanced neuroimaging methods such as fMRI, MEG, and TMS.

Kathy Hirsh‐Pasek is the Stanley and Debra Lefkowitz Faculty Fellow in the Department of Psychology at Temple University and a Senior Fellow at the Brookings Institution. Author of 14 books and hundreds of publications, she is the recipient of numerous awards, is President of the International Society for Infant Studies, and served as an Associate Editor of Child Development. An expert in early learning (language, literacy, STEM), she is dedicated to translating basic science for public consumption. Notes on Contributors xiii

Elliott M. Hoey is a PhD student in the Language and Cognition Department at the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands. He uses conversation analytic and interactional linguistic methods to study the multimodal constitution of mundane social settings. His recent research has addressed the inter- actional uses of sighing and drinking, and participants’ conduct during extended silences in conversation.

Kyle M. Jasmin is a postdoctoral researcher at University College London. He studies the cognitive neuroscience of language and communication, in typical and atypical populations.

Kobin H. Kendrick is a Lecturer in the Department of Language and Linguistic Science at the University of York. His research uses conversation analysis to investigate basic organizations of talk‐in‐interaction such as turn‐taking, action‐sequencing, and repair. A recent line of research has examined the multimodal practices that participants in interaction use to “recruit” others to assist them with troubles that emerge in everyday activities.

Emmanuel Keuleers is an Assistant Professor in the Department of Communication and Information Sciences at Tilburg University, the Netherlands. He has done extensive research on visual word recognition and computational modeling of morphology. In his current research he is particularly interested in effects of age and multilingualism on vocabulary growth and in the application and interpretation of crowd‐based lexical measures to language processing.

Reinhold Kliegl is Professor of Psychology, Department of Psychology, University of Potsdam, Germany. His research focuses on how the dynamics of language‐related, perceptual, and oculomotor processes subserve attentional control, using reading as one experimental venue. He also specializes in applied multivariate statistics, espe- cially linear mixed models. He is an active promoter of Open Science with the Potsdam Mind Research Repository (PMR2); (http://read.psych.uni‐potsdam.de/ pmr2/).

Jochen Laubrock is Senior Research Scientist, Department of Psychology, University of Potsdam, Germany. His research focuses on how perceptual, attentional, and (oculo‐)motor processes interact in the planning of goal‐related behavior. A special interest is in the co‐operation of foveal, parafoveal, and peripheral processing for the control of saccade timing and target selection in reading, related tasks (RAN), and scene perception, and how these processes operating at quite different time‐scales cooperate when reading graphic literature.

Ping Li is Professor of Psychology, Linguistics, and Information Sciences and Technology, Associate Director of the Institute for CyberScience, and Co‐Director of the Center for Brain, Behavior, and Cognition at Pennsylvania State University. He holds a Ph.D. (1990) in psycholinguistics from the University of Leiden. His research is focused on the neurocognitive and computational mechanisms of language acquisition and bilingualism. xiv Notes on Contributors

Asifa Majid is Professor of Language, Communication, and Cultural Cognition at the Centre for Language Studies, Radboud University Nijmegen and Affiliated Principle Investigator at the Max Planck Institute for Psycholinguistics and Donders Institute for Brain, Cognition, and Behaviour in Nijmegen, The Netherlands. She investigates concepts in language and cognition by conducting cross‐cultural and developmental studies. At the heart of her research program lie the questions: Where do our categories come from, and how widely are they shared across languages and cultures?

Paweł Mandera is a Postdoctoral Researcher at the Department of Experimental Psychology, Ghent University. In his research he brings together methods from com- puter science and psychology to study how text corpora and other sources of behavioral data can be used to advance our understanding of human language processing.

Virginia A. Marchman is a Research Associate in Psychology at Stanford University and Adjunct Associate Professor in the School of Behavioral and Brain Sciences at the University of Texas at Dallas. She is a member of the Advisory Board of the MacArthur‐Bates Communicative Development Inventories and a contributing member of Wordbank. Her research focuses on the causes and consequences of individual differences in language processing efficiency and vocabulary development in monolingual and bilingual children.

Antje S. Meyer (PhD Radboud University) is a professor at Radboud University and director at the Max Planck Institute for Psycholinguistics in Nijmegen. Before taking up her appointments in Nijmegen in 2010, she was a professor of psycholinguistics at the University of Birmingham, UK. Meyer has worked on various aspects of psycho- linguistics, in particular word and sentence production, dialogue, and the relationship between visual‐conceptual and linguistic processing.

Joost Rommers is a postdoctoral researcher in the Psychology Department and the Beckman Institute for Advanced Science and Technology, University of Illinois. His research uses electrophysiological and eye‐tracking methods to investigate ­language comprehension and production. One focus concerns the mechanisms and consequences of predicting upcoming language input.

Anne Pier Salverda is a Research Associate in the Department of Brain and Cognitive Sciences at the University of Rochester. He did his graduate work at the Max‐Planck‐ Institute for Psycholinguistics in Nijmegen. His research focuses on speech perception and spoken‐word recognition.

Zeshu Shao (PhD Radboud University) is a research scientist at the Max Planck Institute for Psycholinguistics in Nijmegen. She has worked on speech production, specifically on the attention control mechanism influencing the planning and pro- duction of spoken words in a wide range of populations, and the effects of social network structure on lexical choice. Notes on Contributors xv

Melanie Soderstrom is Associate Professor and Associate Head in the Department of Psychology at the University of Manitoba. She has published a number of studies using the Headturn Preference Procedure on infants’ sensitivity to the prosodic char- acteristics of speech and their understanding of grammatical dependencies. She is currently active in initiatives to automate analysis of large scale recordings of children’s real‐world language experiences.

Laura J. Speed is Postdoctoral Researcher at the Centre for Language Studies, Radboud University Nijmegen, The Netherlands. She conducts psychological research on the interplay between language and the senses. Her PhD thesis investigated an embodied account of language comprehension: How perception and action systems contribute to the understanding of words and sentences. Her current work focuses on language and olfaction—how we talk about smell and understand language about smell, and how language and information from multiple perceptual modalities can influence odour cognition.

Michael K. Tanenhaus is the Beverly Petterson Bishop and Charles W. Bishop Professor of Brain and Cognitive Sciences at the University of Rochester and a Chair Professor of Nanjing Normal University. His research with the Visual World Paradigm has spanned topics in spoken language processing ranging from speech perception to interactive conversation.

Roel M. Willems is Associate Professor at the Centre for Language Studies and Donders Institute, Radboud University, Nijmegen. He studies the role of mental sim- ulation during narrative comprehension.

Ewelina Wnuk is Postdoctoral Researcher at the Centre for Language Studies, Radboud University Nijmegen. She conducts fieldwork‐based research among the speakers of Maniq—an Austroasiatic language spoken by a group of nomadic hunter‐gatherers in Thailand. Her research interests include semantics, grammar, the relationship between language and culture, and the language of perception. In her recent work, she has been focusing on the language of smell and its relationship to cognition.

Xiaowei Zhao is Associate Professor of Psychology at Emmanuel College, Boston. He holds a B.S. (1998) as well as a Ph.D. (2003) in physics from Nankai University in China. Dr. Zhao works in the field of computational modeling of language development, knowledge representation, and bilingualism. He is the President‐elect of the Society for Computers in Psychology (term 2016‐2017). Preface

In many aspects the human language system is a unique support system for communication and thinking. Ways to investigate this complex cognitive capacity were traditionally restricted to observational and behavioral methods in healthy people and neuropsychological patients with a language disorder. In recent decades this picture has changed dramatically. Partly due to technological developments and partly as a result of developments in other fields of research, methods to study language and communication have seen a vast increase in number and level of sophistication. Due to the technological progress in computing power, we are now able to build way more advanced computational models of language processing than ever before. Thanks to developments in neuroimaging and genetic sequencing, we are able to study the neural basis and the genetic underpinnings of the language‐ ready brain in an unprecedented manner. These developments, however, come at a price. To be able to appreciate research findings or actively participate in this research field, one has to be acutely aware of the ins and outs of the research methods that are currently available. Until now a volume that summarizes and discusses all available methods in this field of research was missing. Research Methods in Psycholinguistics and the Neurobiology of Language intends to fill this gap. It provides a comprehensive overview of all relevant methods currently used in research on human language and communication. Some of them have their roots in psycholinguistics, others were introduced from other fields of science such as the biological sciences. Some require highly specialized technical knowledge and skills, whereas others take little time and effort to learn. For some methods a modest, inexpensive laboratory infrastructure suffices, whereas others depend on equipment that takes millions to acquire and interdisciplinary groups of specialists to operate. Some are offline methods that only measure the outcome of mental processing, whereas others continuously monitor mental processes as they unfold in real time, producing information‐rich and dense datasets. Presenting this diverse collection of methods, we anticipate that this book will be a useful guide for doctoral students, postdocs, and active researchers in our field who would want to inform themselves about the basics, the advantages and disadvantages of available research methods, and to get for each one of them pointers to additional method‐related information and best practice examples. While conceiving this book we wondered how the great diversity of methods used in the study of language—its acquisition, use, neural and genetic basis, and disor- ders—could be covered within the limited space available. The solution was to not Preface xvii focus on the specific type of research methods called tasks, of which an innumerable variety exists, but on a broader notion of what research methods are. A task is what participants in an experiment are asked to do, for instance, to name the objects on a set of pictures shown to them. The participants’ behavioral and/or brain responses are registered and constitute the database from which the researcher subsequently extracts information. A research method in the more general sense that we had in mind for this volume is a much broader construct, one that covers a complex of procedures to study the question of interest (e.g., designing a study, constructing stimulus materials, and collecting and analysing the data), and that also includes the technical apparatus, tools, and instruments that support these procedures. Although many methods in this broad sense include data gathering by having participants per- form some task, other methods do without this altogether because the data already exist (corpus linguistics; Chapter 12) or because the method produces artificially generated data (computational modeling; Chapter 11). There are also methods that elicit data from participants without the latter explicitly being asked to perform some task (e.g., the habituation techniques and visual preference techniques presented in Chapters 1 and 2, respectively). Other methods can be combined with a multitude of different tasks (e.g., word priming and interference paradigms, Chapter 6; struc- tural priming, Chapter 7; the electrophysiological and hemodynamic neuroimaging methods presented in Chapters 13 and 14, respectively). All this shows that tasks and methods are not the same things. A feature that characterizes many methods in the broad sense of the word is that they are domain‐nonspecific. Those developed within psycholinguistics can typically be used in various of its sub‐fields: They are suitable to address questions concerning more than one, or all three, of psycholinguistics’ main areas of study (language acqui- sition, comprehension, and production) and/or to answer questions about multiple linguistic domains (e.g., phonology, morphology, syntax, and semantics). The neuro- biological methods included in this volume are even more multipurpose, not being restricted to studying language but domain‐general pur sang, also applicable in studying other areas of cognition and other aspects of human (and animal) behavior. While the majority of the 17 contributions to this book present domain‐nonspecific methods, a couple of them deal with domain‐specific methods: Chapter 3 presents three approved methods for assessing vocabulary in children (language sampling, parent report, and direct assessment); Chapter 4 discusses the ins and outs of the presumably most ecologically valid behavioral research method for examining the reading process: the tracking of eye‐movements; Chapter 8 exclusively deals with conversation analysis. But even these domain‐specific methods allow variability in how they are used and are thus able to inform multiple aspects of language processing. For instance, having the participants read complete paragraphs is what qualifies eye‐movement tracking as an ecologically valid method to study reading, but the stimulus does not need to be a whole paragraph. Sentences, even single words, may also serve as stimuli and, when they do, inform accounts of syntactic parsing, semantic analysis, and word recognition. Similarly, though the primary goal of conversation analysis is to study human social interactions and how people perform actions through talking, the database on which the analyses are done, often a corpus of naturally occurring conversations, contains information on all aspects of the conversational partners’ language use and, thus, on phonology, vocabulary, and more. xviii Preface

In addition to guaranteeing a broad coverage of relevant research methods by predominantly selecting domain‐general methods, the volume’s coverage was increased yet further by inviting authors to present several related methods within a single chapter, directing the readers to these methods’ similarities and differences. For instance, the authors of Chapter 2 contrast multiple conceptually related variants of the visual‐preference technique to study language development in very young children, at an age at which they do not yet produce language or their verbal produc- tions are still incomprehensible. The differences between the various implementa- tions of the general method are often subtle and could easily escape readers if not presented in opposition. Similarly, the authors of Chapter 14 discuss two non‐invasive functional neuroimaging methods, fMRI and fNIRS, that both make use of the fact that neural activity leads to changes in the local cerebral blood flow in the brain and that can both reveal which parts of the brain are activated while participants per- form a particular task. Contrasting the pros and cons of these two related techniques within a single chapter will help readers to make a well‐informed choice between the two during the planning of their own research project. Likewise, after detailing the specifics of the EEG/ERP methodology, in which electrical brain activity can be measured with a temporal resolution in the order of milliseconds, the authors of Chapter 13 contrast it with MEG, which provides a record of the magnetic activity of the brain. Chapter 15 differentiates multiple non‐invasive techniques for structural neuroimaging based on MRI, which reveals the neuroanatomy of language with good spatial resolution. Among the presented methods is tractography, a novel tech- nique for visualizing white matter pathways in the living human brain. Chapter 16 also presents various structural neuroimaging methods, but whereas in Chapter 15 the major focus is on the healthy brain, in this contribution the emphasis is on the lesioned brain. Yet another example of a chapter that presents several related methods is Chapter 17, where inter‐individual variability in language skills is linked to genetic variation. The specific method used depends on whether the studied trait is suspected to be monogenetic (due to a single genetic variant) or multifactorial (resulting from the combined effects of multiple genes). Finally, the chapters dealing with the Visual World Paradigm (Chapter 5) and priming (Chapters 6 and 7) actually concern families of related methods (e.g., masked priming and cross‐modal priming). The inevitable consequence of choosing domain‐nonspecific methods as themes for the separate chapters was that ways of organizing them that appeared obvious at first sight turned out to be neither feasible nor appropriate on second thoughts: The chapters could not be organized according to the main areas of language study, input and output modalities, or the various structural subsystems that languages consist of. After all, most of the presented methods are not specifically tied to any such subdi- vision of study. A presentation according to the type of measures used, behavioral or neurobiological, would be more appropriate and feasible but is complicated by the fact that studies using neurobiological methods generally encompass behavioral measures as well, and the opposite also occurs. This is shown in many of the chapters, for instance in Chapter 16, where the authors illustrate the “two‐pronged” nature of most lesion studies, which combine structural neuroimaging data and a diversity of behavioral data that index patients’ linguistic performance. Another example con- cerns Chapter 7 on structural priming. Though in its early days this method only involved behavioral measures, it increasingly uses brain measures such as ERPs and the BOLD response that indexes brain activation in fMRI. Still, for most language Preface xix researchers it makes sense to qualify methods as behavioral or neurobiological (and computational as a third category), so this is how we ordered the chapters, from primarily behavioral (Chapters 1‐10) and computational methods (Chapters 11 and 12) to neurobiological methods (Chapters 13‐17). But because the partitions between these classes of methods are not clear‐cut and a continued growth in inter- disciplinary research will likely result in their further integration, we have decided against explicitly labeling these three subsections in the table of contents. In the preceding paragraphs almost all chapters have been introduced, however briefly. The exceptions were Chapter 9, on virtual reality, and Chapter 10, which presents ways for studying language outside the laboratory. These chapters were saved for now, where we mention two limitations of many traditional methods for studying language processing: Their ecological validity and external validity are often low; that is, their findings cannot easily be generalized to real‐world settings and to other populations and situations. The main reason why much traditional research lacks ecological validity is that in order to obtain reliable data and make sense of them, strict control over the experimental variables is required. Such con- trol can generally only be secured by using laboratory tasks that are impoverished ­substitutes of the real phenomena under study, the latter being stripped of many of their essentials, including the context in which they take place. The authors of Chapter 9 show how with virtual‐reality techniques it is possible to realize ecolo­gical validity in the laboratory while at the same time controlling numerous experimental­ variables. The authors of Chapter 10 describe ways to enhance ecological­ validity and external validity by, for example, taking the experiment out of the laboratory into institutionalized public spaces such as museums, by crowdsourcing data on the internet, or by conducting cross‐cultural fieldwork. But unlike in research that makes use of virtual reality, in such studies maintaining experimental control is a real challenge. A final feature that characterizes this volume is that many of its chapters contain the same or very similar sections, this resulting from our instructions to the authors. They were asked to explain the underlying assumptions and rationale of “their” method, to describe the required apparatus, the nature of the stimuli and data, the way the data are collected and analysed, and what the method’s strengths and weaknesses are in comparison to related methods. We also asked them to illustrate the method with an exemplary study so that the actual research practices and tools could be more vividly pictured, and to provide a glossary for easy accessibility of the method’s central concepts and features. We are confident that the broad collection of research methods presented in this volume is varied enough for all beginning researchers interested in human language processing to find a topic to their liking and get going, and for researchers already active in language studies to become familiar with techniques they have not yet prac- ticed themselves.

Annette M. B. de Groot and Peter Hagoort

1 Habituation Techniques

Christopher T. Fennell

Abstract

This chapter presents the general aspects of the habituation technique. This ­technique has helped to address various language acquisition questions over the past half- century. While discussing implementations using different behavioural responses, the chapter focuses on the most common measure of habituation in language acquisition research: looking time (LT). Issues in implementing the method and potential prob- lems are discussed. The simplicity of the habituation procedure in both its design and implementation, along with its long history in the field, makes this method one of the fundamental tools that psycholinguists can use to uncover nascent, emerging, and maturing language skills during infancy and early childhood.

Assumptions and Rationale

One of the biggest challenges of determining what an infant knows about language is actually tied to language itself. Unlike Piaget (1926), who famously asked older children to reflect on and discuss their understanding of the meaning of words, we have no such luxury of interviewing a 12‐month‐old regarding their word‐referent links. Even for developmentally simpler skills, we cannot get a 6‐month‐old to give a simple yes or no answer to the question of whether they discriminate two language sounds. It is somewhat paradoxical that language itself is a barrier to understanding

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. 2 Research Methods in Psycholinguistics and the Neurobiology of Language language development in infants. The fact that infants have little or no lexical pro- duction requires researchers to often turn to tasks that require no language output from the child. Further, infants’ limited motor skills restrict the measures that can reveal underlying linguistic abilities. Tasks must take advantage of gross motor abilities, such as full head turns (fine pointing or manual selection are difficult); con- genitally organized behaviors that infants have strong control over since birth, such as looking or sucking; or, basic psychophysiological responses, such as heart rate. One of the most valid and reliable tools we have to examine the perceptual skills related to infant language is the habituation task. Habituation is a decrease in a response to a stimulus after repeated presentations. This produces what is termed the habituation curve, a monotonically decreasing behavior in response to a repeated target stimulus. It is a task with a very long history in our field, stretching back to the nineteenth century (for a review, see Thompson, 2009). Indeed, Thompson highlights that the concept is reflected in antiquity: in Aesop’s fables, a fox is quite frightened of a lion upon first meeting him, but becomes less alarmed upon each subsequent view- ing. Perhaps Aesop’s example was prescient. Habituation tasks were primarily used for decades with animals (and continue to be used with these populations), with everything from amoebas to dogs showing habituation responses (Harris, 1943). Considering the long history of the task and ubiquitous nature of the habituation response across other non‐verbal beings (i.e., animals), it is unsurprising that the method was extended to infants in the early twentieth century (see Humphrey, 1933). However, simply habituating an infant to a stimulus is necessarily a bit limiting with respect to what one can say about learning. If, for example, an 8‐month‐old had a reduced behavioral response to the repeated presentation of a phoneme, one could argue that they have formed a memory of that particular sound. But it could also be that the infant is simply tiring. The key to demonstrating that the infant has formed a representation of or learned something about the presented stimulus is dishabituation— an increase in behavioral response to a novel stimulus. Sokolov’s (1963) comparator model is the classic formulation of this approach. The infant (or adult) has an orienting response to a novel or unexpected non‐threat- ening stimulus (e.g., becoming still, looking at the stimulus, reduced heart rate). As it repeats, the infant builds an internal representation of the stimulus. The increasing strength of the representation leads to a greater match between the internal percept and the repeating external stimulus. The initially large orienting response correspond- ingly reduces as the internal/external match increases. But, if the external stimulus does not match the established internal representation (i.e., a novel stimulus), the infant’s orienting response should reoccur. Thus, habituation is one of the optimal tasks for testing pre‐verbal infants as it does not rely on overt productions, but rather on implicit cognitive measures such as those mentioned earlier (e.g., looking time, sucking, heart rate, among others). Further, based on the comparator model, it allows researchers to determine the nature of infants’ per- cepts and concepts by testing differing levels of novelty from the habituated stimulus (e.g., changing a habituated word form by one phoneme, or multiple sound changes). If the infants’ behavioral responses increase to the novel stimulus, it can be concluded that they have the ability to differentiate the habituated and novel stimuli. In this regard, habituation is fundamentally a method to index discrimination ability. Fantz’s (1964) article in Science on visual habituation in the human infant broadly introduced using this task with very young participants to psychological researchers. However, it is important to note that previous studies had already used habituation with Habituation Techniques 3 infants, including studies on auditory habituation. For example, Bartoshuk (1962) dem- onstrated that newborns habituate to tones and dishabituate to tones of a differing intensity, using heart rate as his dependent measure. Once it was determined that infants could habituate and dishabituate to auditory tones, it was a straight road for researchers to examine language sound (i.e., phoneme) discrimination using similar methods. In one of the seminal works on infant language perception, Eimas, Siqueland, Jusczyk, and Vigorito (1971) used a habituation task with sucking as their measure to investigate 1‐ and 4‐month‐old infants’ discrimination of consonants, specifically a voicing con- trast. Consonants produced in the same place and manner can differ in the timing of the vibration of the vocal folds. For example, /b/ and /p/ are both produced from the lips and are stops, but they differ in voicing. Vocal cord vibrations occurring approximately 25 ms after the air burst from the mouth (or later) sound like a /p/to English speakers. Vibrations starting before that mark sound like a /b/. In Eimas et al., infants heard a repeated sound contingent on strong sucks. Once their sucking rate decreased by 20%, a novel stimulus was presented in two experimental conditions. In one condition, the novel sound was from the same phonological category (i.e., a new /b/ sound that dif- fered by 20 ms in voicing from the original stimulus) and in the other condition the novel sound came from a different category (i.e., a 20 ms voicing change that crossed the boundary from /b/ to /p/). In the control condition, the same sound was played after the 20% reduction in behavior. Only infants in the differing category condition had a dishabituation response—increased sucking when the sound change occurred. The above experiment highlights some important aspects of infant habituation. First, habituation allows the researcher to test categorical perception in that we can determine if an acoustically different stimulus will engender a continued habituated response or a dishabituation response. As Thompson and Spencer (1966) highlight in their classic list of the characteristics of habituation, “habituation of response to a given stimulus exhibits stimulus generalization to other stimuli” (p. 19). Thus, we can assume that the lack of dishabituation to an acoustically different stimulus means that the infant considered it to fall into the same category, or the distinction is too weak to detect. This second explanation is unlikely if you include a condition where a similar magnitude difference elicits a dishabituation response due its crossing of a category boundary, as in the Eimas et al. work. Second, the use of criterion equated the processing of the stimuli across infants. Based on Sokolov’s (1963) theory, the reduction in the target behavior is commensurate with the increasing robustness of the infant’s memory trace for the stimulus. But, different infants would potentially, and probably, have differing timing with respect to the building of the stimulus memory trace due to individual differences in cognitive skills, particu- larly attention. By requiring them to reach the same relative decrease, the researcher can assume that they have reached similar processing levels for the stimulus in question.

Apparatus

Testing infants’ language skills via habituation typically involves relatively little technology in comparison to some other methods present in the literature, like event‐related potentials (ERP), functional near infrared spectroscopy (fNIRS), or functional magnetic resonance imaging (fMRI). As such, the necessary apparatus can be implemented relatively quickly and inexpensively in the lab. Primarily, a 4 Research Methods in Psycholinguistics and the Neurobiology of Language researcher requires devices to present stimuli in a controlled manner and to measure the target behavior. As the measurements of the latter need to feed back to control the former in habituation, we typically use the same device to do both: habituation software on a lab computer. A widely used free‐ware program is available for such purposes, called Habit 2 (Oakes, Sperka, & Cantrell, L., 2015). The program will control stimuli presentation, compute habituation criteria, and accumulate behavioral data. Stimuli are usually played from digitized files on the computer and are sent to the display and speaker in the testing room. The experimenter, who should be blind to the audio stimuli being presented and to whether a trial was a habituation or test trial, remotely monitors the infant’s behaviors via key presses. As alluded to in the section above, many of the early studies in infant habituation used sucking or heart rate as the dependent measure. Measuring infant sucking strength and rates requires the experimenter to have a pressure transducer within a pacifier, and the corresponding connected equipment to measure the output from the transducer. Heart rate measures usually require three electrodes to be placed on the infants’ chest and abdomen, with the electrodes again connected to equipment to measure their output. While these measures are still used, the typical behavior being measured in modern habituation research is looking time (LT) to a visual display, even if one is testing habituation/dishabituation of auditory language stimuli. There is a positive relationship between attention to an auditory stimulus and visual fixation (Horowitz, 1975). Unlike the measures above, LT requires nothing to be in physical contact with the infants, which is an advantage. The researcher—appropriately blinded to the condition—simply needs to record, by pressing buttons on a keyboard connected to the same software, where the infant is looking, typically through watch- ing the infant via a closed‐circuit video camera. The use of a video camera also allows for a record that can confirm the real‐time measurements when coded post‐experiment. The ease of measuring looking behavior has led to its wide application.

Nature of the Stimuli

Due to the nature of the task, many habituation studies investigating language development have involved basic auditory stimuli, such as changes to the acoustic form of simple syllables. For example, tracking infants’ phonetic and phonological development has been the focus of many infant habituation studies, starting with Eimas et al. (1971). Using an example from Polka and Werker (1994), infants in such studies are habituated to one syllable (e.g., /dYt/) and then given a syllable involving a single phoneme change for the novel stimulus at test (e.g., /dut/). If infants dishabitu- ate to the novel syllable, they are able to distinguish the target phonemic contrast. See Figure 1.1 for a visualization of such a study. Such studies contributed to the finding that infants are initially universal listeners, able to distinguish sounds from both their native and from non‐native languages, but then become language‐specific listeners over the first year—failing to dishabituate to non‐native contrasts. Habituation studies examining language development are not limited to simple sylla- bles, however. For example, Mehler, Jusczyk, Lambertz, Halsted, Bertoncini, and Amiel‐ Tison (1988) used a habituation method where the target stimuli were narrative auditory passages from rhythmically similar and dissimilar languages (recorded from fluent bilinguals so that the voice did not differ). Using sucking rate as the dependent variable, Habituation Techniques 5

Type of task PretestHabituation phaseTest phase Posttest Familiar Novel

Speech discrimination

“Neem.” “Gek.” “Gek.” “Gik.” “Neem.”

Word learning (Single object)

“Neem.” “Gek.” “Gek.” “Gik.” “Neem.”

Word learning (Two objects)

“Neem.” “Gek.” “Gik.” “Gek.” “Gik.” “Neem.” Note: The two object version of the task is known as the Switch task.

Figure 1.1 Examples of various infant language habituation tasks. (See insert for color ­representation of the figure.) they showed that infants of 2 months can distinguish their native language from a non‐ native language based on their rhythmic class, but not two non‐native languages. One can even use visual stimuli to demonstrate language discrimination. In a novel twist, Weikum, Vouloumanos, Navarra, Soto‐Faraco, and Sebastián‐Gallés (2007) presented infants with silent video clips of fluent French‐English bilinguals reciting passages in each language. Infants of 4 and 6, but not 8, months dishabituated to French clips after being habituated to English ones, and vice versa. This shows that infants have an early ability to visually discriminate non‐native from native lan- guages before perceptually narrowing to their native language in the visual domain. Interestingly, French‐English bilingual infants, for whom both languages were native, were able to discriminate the languages at the older age. Finally, rather than focusing on audio or visual stimuli, some habituation e­xperiments explore the connection between the two by pairing objects and word forms during habituation, and then test infants on novel word‐object associations (see Figure 1.1). As such, these studies contribute to a major area of early language development: early word learning. For example, a simple way to invoke word learning is to replace a visual pattern typically used in discrimination studies with an object that affords naming. But infants may succeed by ignoring the object and simply focus on the change in label during the novel trial in the test phase. Werker, Cohen, Lloyd, Casasola, and Stager (1998) corrected for this by creating an audio‐visual variation of the habitua- tion procedure called the Switch task (see Figure 1.1). Infants are habituated to two word‐object associations (e.g., Object A – Word A; Object B – Word B) and then tested on two trials: a Same trial comprising of one of the habituated pairings (e.g., Object A – Word A) and a Switch trial where an incorrect pairing is presented (e.g., Object A – Word B). Importantly, the Switch trial consists of a habituated object and a habituated word, but linked in a novel way. In this manner, infants should only disha- bituate if they have appropriately linked the word and object. 6 Research Methods in Psycholinguistics and the Neurobiology of Language

Methodological Structure

Now that we have discussed the nature of the work using habituation tasks, we can turn to the typical structure of these tasks. Infant habituation studies in language research typically involve four phases: pretest, habituation, test, and posttest. Figure 1.1 outlines these four phases across three different studies. Each of these phases comprises discrete trials wherein a visual and an audio stimulus are concur- rently presented. Trials can be preceded by what is termed an attention‐getter in order to get the infant to orient to the screen. Various attention‐getters have been used in past research. Some examples include: a silent, flashing light; a silent, mor- phing, colourful shape; and the face of a baby with giggling as the accompanying audio track. Once the infant looks to the screen, the relevant trial commences. Trials can be of fixed length or infant‐controlled. The latter involves setting a criterion via which the trial will end if the infant disengages attention. For example, if an infant looks away from the stimulus for 2 seconds, the trial ends and the next trial commences.

Pretest

In the pretest phase, the infant experiences a stimulus that is different from the repeated one during the upcoming habituation phase. The reasoning behind this trial is that infants need to become accustomed to the presentation method. Thus, this phase serves as a warm‐up prior to presenting the stimulus of import in your study.

Habituation

The habituation phase follows and is, of course, key to the experiment. One impor- tant point to consider for this phase is the intensity of the audio stimuli. As Thompson and Spencer (1966) highlight in their list of habituation characteristics, “strong stimuli may yield no significant habituation” (p. 19). For example, it would be hard to habituate to a blaring, variable siren. Therefore, the audio stimulus is typically delivered at approximately 65 decibels to make it loud enough for infants to hear, but not too loud to invoke failure to habituate. Following similar logic, the visual display that is shown should also be only moderately engaging. Another setting the researcher must decide upon is the habituation criterion. A common criterion is a 50% reduction in LT (Ashmead & Davis, 1996), although some researchers advocate using more stringent criteria with younger infants (e.g., 70%) as they are more c­ognitively immature and therefore may require more presentations to fully process stimuli (e.g., Flom & Pick, 2012). Another consideration is the window over which the decrease in response is based. If one stimulus is repeated, many researchers opt for a window of three trials. For example, if the infant looks for a total of 50 seconds over the first three trials, they need to fall below 25 seconds in a subsequent window of three back‐to‐back trials to reach habituation. Ashmead and Davis recommended the size of this window based on their computer modeling in that it was more stable than a window size of two. Habituation Techniques 7

Two other important considerations are related to these windows. First, the researcher could opt for a fixed or a sliding window. A sliding window keeps a running total of LT to determine habituation (e.g., trials 2, 3, and 4 are first compared to trials 1, 2, and 3). A fixed window compares subsequent blocks of three trials to the criterion previous block (e.g., trials 4, 5, and 6 are first compared to trials 1, 2, and 3). Oakes (2010) recom- mends using the sliding window whenever possible, as it necessarily will lead to shorter experiments on average, and shorter habituation phases should result in less attrition. However, if the infant is being habituated to two types of stimuli (e.g., two word‐ referent combinations), the fixed window is necessary despite the chance of increased attrition in order to ensure that the infants receive an equal number of examples from each stimulus type during habituation (and the window needs to be increased to four trials: two of each stimuli type per block). The second consideration is whether to base the habituation criterion on the first block of trials, which typically—but not always— has the highest infant behavioral response to the stimuli, or base it on the block of trials with the highest behavioral response, regardless of when it occurs in the experiment. Most researchers use the first block, as infants may have an increased response on a later block due to a factor unrelated to the habituation curve (e.g., a baby surprises him- self with a sneeze and reorients to the stimuli due to increased arousal). Despite the presence of a criterion, the researcher should cap the number of possible trials in an experiment. Without such a cap, the experiment would not end for some infants in a reasonable amount of time, as they would not reach criterion. Dannemiller (1984) recommended that the maximum amount of trials in an infant habituation study should be 15 trials. At that number, according to his Monte Carlo modelling, there would be a 5% risk that infants are habituating by chance. But the trade off of having a small maximum number of trials is that you will have fewer infants reaching criterion within that overall trial limit. But, increasing the maximum amount of trials increases both the chance of random habituation and of attrition. Oakes (2010) recommends piloting infants and/or examining similar studies in the literature to determine the optimal maximum number of trials for a particular study.

Test

The test phase of the experiment should include both the novel stimulus and a r­epetition of the familiar stimulus from habituation, with the order counterbalanced across participants (e.g., Werker et al., 1998). Why not only present the novel stim- ulus and compare it to the last habituation block? The major issue in taking that approach is that behavioral responses in the final block of habituation trials are necessarily low, and may be artificially so (see Cohen, 2004). This could be due to an infant reducing attention to the stimuli for a reason unrelated to habituation, such as a distraction in the room like the parent shifting in their seat. Thus, this comparison may falsely indicate dishabituation. By presenting both the novel stimulus and a rep- etition of the familiar stimulus, the experimenter can determine whether the infants can detect the difference between the habituated stimuli and something new. Some researchers run this manipulation as a between‐subject design, but this is not recom- mended for both statistical and practical reasons. One would introduce more error into the design related to the individual differences between the groups, and it requires doubling the amount of participants. 8 Research Methods in Psycholinguistics and the Neurobiology of Language

Posttest

Finally, the posttest trial should be presented last and should be maximally different from the habituation and test trials. It is expected that if infants are still engaged in the experiment, looking time would recover to near pretest level during this final trial.

Collecting and Analysing Data

As mentioned earlier, the best method to collect the data is via habituation‐specific software, like Habit. The program is designed to continuously compare an infant’s behavioral response to the stimulus, in this case LT, in a block of trials to previous blocks to determine when the infant has reached the habituation criterion. Importantly, before delving into statistical analyses of the habituation and test phases, one has to establish the reliability of the experimenter’s coding of the target variable (e.g., manual key strokes in response to LT) if it is not a direct measure (e.g., eye tracker for LT, ECG for heart rate). One standard is for another coder to recode the target trials of 25% of the participants from the video records of those experi- ments. In this case, the original coding would be considered to be reliable if a Pearson product‐moment correlation of the two coders’ measures is equal to or greater than .95. A more exact method is to have two coders score all the video records using a frame‐by‐frame analysis using freeware such as SuperCoder (Hollich, 2005). One should also report analyses of the habituation phase prior to testing for novelty and familiarity effects. To determine whether infants maintained interest throughout the experiment and recovered from habituation, one possibility is to run a series of planned orthogonal comparisons to first compare pretest to posttest and, if these two trials are found to be the same, to then compare these trials to the last habituation block. It is expected that if infants were still engaged in the experiment, looking time would recover to near pretest level during the posttest. Thus, there should be no significant difference between the pretest and posttest. However, the pretest and posttest should be significantly different from the last habituation block to demonstrate recovery. One can then compare the first habituation block to the last via a paired‐sample t‐test to confirm a significant drop in looking time across the habituation phase. Finally, a full descriptive analysis of the habituation phase should be reported (i.e., mean number of habituation trials, mean looking time during habituation). If there are multiple conditions or groups in the experiment, these habituation variables should be compared in a mixed ANOVA to ensure similar habituation across conditions or groups. For example, in Table 1.1, infants in

Table 1.1 Mock habituation data from four experiments with looking time as the dependent variable. Experiment Habituation Trials Habituation Time Familiar Trial Novel Trial 1 8 112 s 11.5 s 12.2 s 2 8 180 s 8.1 s 12.8 s 3 12 200 s 7.9 s 12.4 s 4 12 210 s 8.3 s 8.1 s

Note: All data represent the mean. Habituation Techniques 9

Experiment 1 and those in Experiment 2 have a similar number of habituation trials, but the latter group has significantly more active looking during habituation. This may explain their success in the task (i.e., significantly higher looking times to novel over familiar stimuli). On the other hand, infants across Experiments 3 and 4 have similar habituation scores, but behave differently at test. Thus, these results cannot be attributed to habituation differences. To determine if infants have dishabituated to the novel stimulus at test, researchers should compare the novel test trial to the familiar test trial. As test trial is a within‐ subject variable, a paired‐sample t‐test is often used. If a between‐subjects variable is key to the interpretation of results, a mixed ANOVA would be appropriate. For example, gender is often included in habituation studies testing language skills due to the oft‐reported female advantage for language. Indeed, research involving infants’ habituation to word‐object associations (i.e., word learning tasks) has found gender differences in terms of a female advantage (e.g., Fennell, Byers‐Heinlein, & Werker, 2007). Another common between‐subjects variable in infant language research using habituation is age, in order to track developmental changes (e.g., Werker et al., 2002). Table 1.2 summarizes all steps in collecting and analysing habituation data that were expanded upon in this section.

Table 1.2 Steps in data collection and analyses. Steps in Temporal Purpose Order Real‐Time To determine if and when habituation occurs, a specialized computer Habituation program (e.g., Habit) collects behavioral response data in real time Data Collection via experimenter input (e.g., button press when infant is looking). It constantly compares blocks of responses to determine habituation. Objective Data Optimally, experimenters should simultaneously collect the relevant Collection behavioral response via a completely objective method (e.g., video recording of infant looking). Coding 1) Confirm the accuracy of the real‐time habituation data. Code the raw data for a minimum of 25% of the participants. For looking time, one can use video software (e.g., Super Coder) to examine infants’ looking time in each video frame. Confirm accuracy via a Pearson product‐moment correlation of the habituation and coded raw measures that is equal to or greater than .95. 2) Recommend coding all key trials for all participants via an exact measure (e.g., frame‐by‐frame coding) to maximize accuracy. Disconfirm Run a planned orthogonal comparison on the pretest, posttest, and Fatigue last habituation block. The pretest and posttest should be equal to confirm that the baby did not become fatigued over the experiment. The posttest should differ from the last habituation block to confirm recovery to a large change. Confirm Compare the first habituation block to the last using a paired‐sample Habituation t‐test to confirm a significant drop in behavioral response. Confirm Compare the novel test trial to the familiar test trial via a paired‐sample Dishabituation t‐test. Typically, a significant difference will reveal that the novel trial exceeds the familiar. Rarely, it is the reverse (see discussion in chapter). If a between‐subjects variable is key to the interpretation of results, use a mixed ANOVA that includes that factor (e.g., gender). 10 Research Methods in Psycholinguistics and the Neurobiology of Language

An Exemplary Study

To illustrate the use of the habituation task to investigate early language development, a study that involved multiple conditions and populations will be described. Before turning to that study—Fennell and Byers‐Heinlein (2014), some background is in order. Using the Switch Task, Byers‐Heinlein, Fennell, and Werker (2012) had shown that monolingual and bilingual infants have similar word‐object associative skills, reliably noticing the incorrect links in habituated word‐object pairings around 14 months when the words involved are phonologically distinct (e.g., /lIf/ and /nim/). In contrast to the finding with dissimilar sounding words, Werker, Fennell, Corcoran, and Stager (2002) found that infants growing up in a monolingual environment have difficulty learning similar sounding words in the Switch task up until 17 months of age. Fennell, Byers‐Heinlein, and Werker (2007) extended the latter study to bilin- guals. Bilinguals of 14, 17, and 20 months received two word‐object pairings during habituation: a multi‐coloured crown object paired with the nonsense word “bih” and a blue‐green molecule object paired with “dih.” Unlike monolingual infants in the previous work, bilingual infants did not dishabituate to the switch in the pairings until 20 months of age (e.g., crown paired with “dih”). Thus, it appeared that bilin- guals might be “delayed” in incorporating their emerging phonology into word learning, in that they were accepting a similar‐sounding, but incorrect, word as an appropriate label for the object. Fennell et al. argued that ignoring detail might be adaptive for bilinguals (less is more): by incorporating less information into the word form, they could match their monolingual peers on word learning. However, a subsequent study by Mattock, Polka, Rvachew, and Krehm (2010) indicated that there was more to the story. Fennell et al. had used recordings of a monolingual English speaker in their study. Mattock and her colleagues showed that French‐English bilingual infants of 17 months succeeded if given a mix of French and English tokens produced by a French‐English bilingual speaker. Monolingual infants failed with these mixed stimuli. Thus, they discovered inverse results in comparison to Fennell et al.: bilingual success and monolingual failure. Mattock et al. argued that bilinguals may have enhanced flexibility in their phonological rep- resentations, allowing for success with the mixed tokens in the face of monolinguals’ failures. However, Mattock et al. did not test bilingual infants on monolingual‐like stimuli. Fennell and Byers‐Heinlein (2014) hypothesized that both previous interpretations may be incorrect. There may be no bilingual delay or advantage in learning similar‐ sounding words. Instead, it may be that both monolinguals and bilinguals may simply have difficulty with a “non‐native” speaker, as Fennell et al. used a monolingual speaker and Mattock et al. used a bilingual one. Bilingual and monolingual speakers have differing productions, even in the same language (Antoniou, Best, Tyler, & Kroos, 2011). Therefore, this hypothesis would be in line with recent work showing that infants have difficulty with accented speech (Schmale, Hollich, & Seidl, 2011). To determine if our hypothesis was correct, we implemented a fully crossed design where English monolingual and English‐French bilingual infants were tested on monolingual‐ and bilingual‐produced target words. Infants were tested in the standard Switch task (see Figure 1.1). All trials were fixed at 20 seconds each (i.e., not infant controlled). The first trial was a pretest trial Habituation Techniques 11 consisting of a word‐object pairing (a spinning waterwheel toy paired with /nib/) completely dissimilar to those presented in the habituation phase. The two habitua- tion pairings were the crown object paired with /kεm/ and the molecule object paired with /gεm/. Each pairing was presented twice within a four‐trial block, with no more than three consecutive trials of the same type. When average looking time across a four‐trial block decreased to 65% of the maximum looking time across a previous block, the habituation phase ended. There was a maximum of 24 habituation trials— the typical amount for the Switch task. During the test phase, infants received one Same and one Switch trial in one of eight testing orders that counterbalanced trial order (Same‐Switch/Switch‐Same) and the particular pairings presented. The posttest was a repetition of the very dissimilar stimuli presented during the pretest. Infants sat on their parents’ laps during testing. All video stimuli appeared on a screen directly in front of the infant and audio stimuli were presented via speakers below the screen at approximately 65 db. Infant looks were recorded via a hidden video camera below the screen. Videos of all test trials were coded frame‐by‐frame (i.e., each video frame was examined to determine if infants were looking to the target) by an experienced coder, blind to condition, and then a second coder recoded 25% of those trials to ensure high reliability. Sixty‐one infants of 17 months successfully completed the study: 31 monolingual and 30 bilingual. Sixteen infants in each group heard tokens of the target words produced by a bilingual speaker and 14 infants in each group heard tokens produced by a monolingual speaker. An additional 24 infants were tested but not included in the analyses due to fussiness, parental interference, distraction during testing, or being off‐camera during the test trials. This is a normal attrition rate for this age group. The habituation and test trial data are presented in Figure 1.2. The first analysis confirmed that infants across all conditions habituated to the stimuli. A 2 (habituation block: first versus last) × 2 (token: monolingual or bilingual) × 2 (infant: monolingual or bilingual) ANOVA revealed a significant decrease in looking time in the last habit- uation block compared to the first in all conditions, with no interactions. The number of trials and total looking time it took to reach the criterion across conditions was statistically equivalent. Thus, any differences at test could not be attributed to differ- ences in exposure during habituation. All groups of infants also showed significant recovery during the posttest as compared to the last habituation block. Thus, infants were not fatigued or generally disinterested in the task. Since all habituation mea- sures were normal, an analysis of the test trials could be undertaken. The key parts of the test trial analysis were the test trials themselves, whether the stimuli matched the infant’s language environment (stimuli match; e.g., a monolin- gual infant hearing monolingual tokens), and the language background of the infant. A 2 (trial type: Same versus Switch) × 2 (stimuli match: yes versus no) × 2 (infant language background: bilingual versus monolingual) mixed ANOVA revealed a significant main effect of trial type, moderated by a significant interaction between trial type and stimuli match. No other effects were found. Thus, infants’ comparative looking across the Same and Switch trials depended of whether the tokens they were hearing matched their learning environment, and monolinguals and bilinguals showed the same pattern since there was no effect of language background. Follow‐up t‐tests comparing the Same and Switch trials in each condition, with the appropriate corrections for multiple tests, revealed that infants only detected the novel pairing (i.e., a change in minimally different labels) if they heard tokens that matched their 12 Research Methods in Psycholinguistics and the Neurobiology of Language

Monolingual tokens Bilingual tokens 20 18 16 14 12 10 8 6 4 2 0 Same trial Same trial Switch trial Switch trial Pretest trial Pretest trial Posttest trial Posttest trial Last habituation bloc k Last habituation bloc k First habituation block First habituation block

Monolingual infants Bilingual infants

Figure 1.2 Mean looking times across various trial types in Fennell and Byers‐Heinlein (2014). Source: Fennell and Byers-Heinlein 2014. Reproduced with permission of SAGE publications. language background: bilinguals hearing bilingual tokens and monolinguals hearing monolingual tokens. Thus, the relatively simple habituation task revealed that bilingual infants are neither advantaged nor disadvantaged relative to monolinguals, and helped to clarify the phonological development of all infants by showing that the optimal stimuli for learning words are those presented in a manner similar to infants’ everyday language environment.

Problems and Pitfalls/Advantages and Disadvantages

Some of the advantages of habituation tasks have already been highlighted. The reliability of the method is a key advantage, especially when dealing with a child population. Multiple studies have demonstrated that infants’ habituation responses are stable in the short‐ (e.g., Bornstein & Benasich, 1986) and long‐term (e.g., Miller et al., 1979). As Bornstein and Benasich emphasized, this psychometric reliability gives strong validity to the use of habituation as a methodological tool in infant research (p. 97). The simplicity of the method and the relatively low technology needs make such tasks very easy to implement. For example, if one were testing phoneme discrimination, the use of a habituation technique would involve much less cost than implementing an ERP study, another valid method for determining discrimination skills. This cost imbalance would be both in terms of the technology needed, with ERP equipment (multiple computers, software, EEG caps, etc.) running Habituation Techniques 13 in the thousands of dollars, and in terms of the time involved (training, coding, analyzing, etc.). Habituation tasks also have a long history both inside and outside of their use with infants. This provides the researcher with a large literature base from which to design and explore relevant tasks, and to interpret possible results. Newer methods, such as fNIRS, do not yet have a rich literature, leading to more guesswork. Another advantage of habituation is its possible use across all ages. For example, the Conditioned Head Turn procedure, another method that has often been used to determine infants’ phoneme discrimination skills, has a limited age range. In this technique, infants are trained to turn their head upon detecting a phoneme change. Werker, Polka, and Pegg (1997) suggest that the procedure is optimal for a small developmental window of between 6 and 10 months, as infants below 6 months have limited head control and older infants become too “mobile” to sit through the long task involving training and testing. Habituation can be used throughout the lifespan, from foetus to adult. It is a basic psychological process and is tailored to the individual’s cognitive skills. However, it should be noted that infants do become more restless in a habituation task as they become older and more mobile. The application of a habituation criterion appropriate to the infant’s age (based on similar past studies with said age) and close measurement of their responses should alleviate this issue. Another technique often used in infant language studies is preferential looking (e.g., Looking‐While‐Listening, Intermodal Preferential Looking Procedure; see Chapter 2). These tasks often involve presenting two visual stimuli simultaneously to a child and measuring looking to each while hearing an auditory stimulus asso- ciated with one of the visual displays. Often, researchers use this method to test word knowledge (e.g., infant sees a dog and a cat on a screen and looks longer to the dog when hearing “dog”). Another variant is the Head Turn Preference procedure (see Chapter 2) where the infant turns her head toward a visual stimulus to hear an auditory stimulus, which ends when she turns away. Infants’ preference for one stimulus over another (e.g., native versus non‐native language) is inferred from the amount of time they listen to each. The above examples reveal that these techniques are often used to test infants’ preference for stimuli learned in their natural lan- guage environment. Habituation, or the other hand, necessarily teaches the infant new information (e.g., new words) or re‐familiarizes them to information from their environment (e.g., language sounds), always to a criterial level. As such, the advantage of habituation is the ability to ensure that infants have processed the stimuli immediately prior to testing. A comparative disadvantage is that habituation involves more memory load at test: it presents test trials sequentially, whereas pref- erential looking presents choices simultaneously so that all information is available. Hybrid tasks that use a habituation phase followed by a visual preference phase maximize the benefits of both techniques (Yoshida, Fennell, Swingley, & Werker, 2009). One large advantage of visual preference tasks is that a researcher can present multiple stimuli during the course of an experiment (e.g., test knowledge of multiple words), as no training phase is required. Habituation is limited to com- paring a familiar versus a novel stimulus. A technique similar to habituation is familiarization. Studies that present stimuli until infants reach a certain LT amount (e.g., 5 minutes total) or a certain amount of trials are not habituation experiments, but rather are familiarization studies. 14 Research Methods in Psycholinguistics and the Neurobiology of Language

These studies suffer from a major issue: Unlike habituation, they fail to tailor the learning phase to the individual participant. All infants do not require the same amount of time to learn about a stimulus. Some may take 30 seconds to process the information presented, while others may take 2 minutes. By failing to control for individual learning via a criterion, these studies have a greater chance of producing strange effects, such as a preference for a familiar stimulus over a novel stimulus (which we will turn to next) or producing null results since a segment of the partici- pants may not have processed the stimuli. An apparent problem one can encounter when using the habituation task is the presence of what is called a familiarity effect. After habituation to a stimulus, infants may demonstrate a reduction of a behavioral response to a novel stimulus, but an increase or maintenance of the response to a repeated presentation of the habituation stimulus. This runs counter to the novelty effect typically found in habituation: increased behavioral response to a novel stimulus and decreased or maintained response to a repetition of the habituation stimulus. However, this atypical response may not be a real problem at all, but rather an informative reflection of infants’ processing of the stimuli. Familiarity and novelty effects are tied to the level of difficulty infants encounter in processing habituation stimuli. Infants avoid or show no interest in stimuli that are not at their optimal level of stimulation (Cohen, 2004; Hunter & Ames, 1988). Since infants are cognitively immature, they initially actively avoid a complex stim- ulus and only start preferring it as they become familiar with its properties. Infants exposed to too complex information during habituation may show a “preference” for the familiar stimulus at test, as they are still trying to process its components and actively reject new complex information (i.e., the novel stimulus). For example, if we return to Figure 1.2, the looking time to the Same trial in the condition where mono- linguals were hearing tokens from bilingual speakers appears to be higher than expected based on the data from the other conditions. This difference may be reflec- tive of a weak familiarity effect due to increased complexity in that one condition. Infants being raised in a monolingual environment may have had few opportunities to hear bilingual speakers, whereas bilingual infants often hear monolingual speakers. Of course, the other two conditions were also less difficult because infants were hearing typical voices from their environment. It should be noted that Cohen (2004) argues that we should adopt strict habitua- tion criteria, such as a 50% reduction in a behavioral response, to avoid possible familiarity effects. Of course, the use of a strict criterion can lead to the problem of increased attrition as infants can cross the line from bored to extremely restless and non‐participatory. An examination of past similar research with infants of the same age should give researchers solid ideas of both an appropriate habituation criterion and what stimuli are optimal for infants that age. Another issue is that infants can have preferences for some habituation stimuli, which would interfere with or prevent habituation, or can lead to false familiarity or novelty preferences at test (Oakes, 2010). Again, the appropriate choice of a habitu- ation stimulus is key: Do not use stimuli for which infants have strong preferences and ensure that the familiar and novel stimuli have similar preference strengths in the population being tested. These preferences can be determined via pilot testing, or from a literature review of the target area. Finally, one cannot automatically treat habituators and non‐habituators as the same population. Non‐habituators are those infants who reach the maximum Habituation Techniques 15

number of habituation trials without meeting the criterion of a reduced behavioral response to a set percentage. Cohen (2004) highlights that, in comparison to habitu- ators, there is a greater chance that non‐habituators will alter the results of a study by demonstrating a familiarity preference at test, due to their incomplete processing of the stimulus. For example, Werker et al. (1998) found that, at 14 months of age, only habituators showed evidence of learning a word‐object pairing at test, whereas non‐habituators did not. Having highlighted these differences, it is important to note that some studies have found no group differences between habituators and non‐ habituators (e.g., Byers‐Heinlein, Fennell, & Werker, 2012). However, studies should always compare habituators and non‐habituators to determine if any differences are present. In conclusion, the simplicity of the habituation procedure in both its design and implementation, along with its long history in our field, makes this task one of the fundamental tools that psycholinguists can use to uncover nascent, emerging, and maturing language skills during infancy and early childhood.

Key Terms

Familiarity preference An uncommon response in a true habituation task where the participant attends more to the familiar (i.e., habituation) stimulus at test than to a novel stimulus. This usually would indicate that the wrong habituation criterion was employed and/or that the familiar stimulus was too complex for the participant. Familiarization study Unlike a habituation experiment with its individualized crite- rion, this is a study where every participant experiences the target stimuli for the same predetermined amount of time. Familiar stimulus The stimulus that the participants receive throughout habituation. Habituation The progressive reduction of an organism’s behavior in response to a repeated stimulus. Habituation criterion The set percentage to which the participant’s behavioral response must decrease from the maximum response during habituation (sometimes the maximum during the first block of trials only) before the test phase begins. Habituation curve The pattern of the participant’s responses over the habituation phase, typically an exponential decrease in response—thus the term “curve.” Novel stimulus A test stimulus distinct from the familiar, or habituated, stimulus. Participants should increase their target response to this stimulus over the familiar one, if they have adapted to or learned the latter stimulus from habitu- ation (see Novelty preference). Novelty preference The classic test response in a habituation task where the partici- pant attends more to a novel stimulus than to a familiar one post‐habituation. Switch procedure An associative word‐learning variant of the habituation task where participants receive two word‐object associations throughout habituation (Object A – Word A; Object B – Word B) and are tested on two test trials, a familiar pairing (Object A – Word A) and a novel one (Object A – Word B). If the participants learned the associative link, they will show a novelty response. 16 Research Methods in Psycholinguistics and the Neurobiology of Language

References

Antoniou, M., Best, C. T., Tyler, M. D., & Kroos, C. (2011). Inter‐language interference in VOT production by L2‐dominant bilinguals: Asymmetries in phonetic code‐switching. Journal of Phonetics, 39, 558–570. doi: 10.1016/j.wocn.2011.03.001 Ashmead, D. H., & Davis, D. L. (1996). Measuring habituation in infants: An approach using regression analysis. Child Development, 67, 2677–2690. Bartoshuk, A. K. (1962). Human neonatal cardiac acceleration to sound: Habituation and dishabituation. Perceptual and motor skills, 15, 15–27. Bornstein, M. H., & Benasich, A. A. (1986). Infant habituation: Assessments of individual differences and short‐term reliability at five months. Child Development, 57, 87–99. Byers‐Heinlein, K., Fennell, C. T., & Werker, J. F. (2012). The development of associative word learning in monolingual and bilingual infants. Bilingualism: Language and Cognition, 16, 198–205. doi:10.1017/S1366728912000417 Cohen, L. B. (2004). Uses and misuses of habituation and related preference paradigms. Infant and Child Development, 13, 349–352. Dannemiller, J. L. (1984). Infant habituation criteria: A Monte Carlo study of the 50% decrement criterion. Infant Behavior & Development, 7, 147–166. Eimas, P. D., Siqueland, E. R., Jusczyk, P., & Vigorito, J. (1971). Speech perception in infancy. Developmental Issues, 171, 303–306. Fantz, R. L. (1964). Visual experiences in infants: Decreased attention to familiar patterns relative to novel ones. Science, 146, 668–670. Fennell, C. T. & Byers‐Heinlein, K. (2014). You sound like mommy: Bilingual and monolin- gual infants learn words best from speakers typical of their language environments. International Journal of Behavioral Development, 38, 309–316. Fennell, C. T., Byers‐Heinlein, K., & Werker, J. F. (2007). Using speech sounds to guide word learning: The case of bilingual infants. Child Development, 78, 1510–1525. doi:10.1111/ j.1467‐8624.2007.01080.x Flom, R., & Pick, A. D. (2012). Dynamics of infant habituation: Infants’ discrimination of musical excerpts. Infant Behavior and Development, 35, 697–704. Harris, J. D. (1943). Habituatory response decrement in the intact organism. Psychological Bulletin, 40, 385. Hollich, G. (2005). Supercoder: A program for coding preferential looking (Version 1.5). [Computer Software]. West Lafayette: Purdue University. Horowitz, F. D. (1975). Visual attention, auditory stimulation, and language discrimination in young infants, Monographs of the Society for Research in Child Development, 39, pp. i‐x+1–140. Humphrey, G. (1933). The nature of learning in its relation to the living system. Harcourt, Brace; New York. Hunter, M. A., & Ames, E. W. (1988). A multifactor model of infant preferences for novel and familiar stimuli. In C. Rovee‐Collier, & L. P. Lipsitt (Eds.), Advances in infancy research (pp. 69–95), 5, Ablex: Norwood, NJ. Mattock, K., Polka, L., Rvachew, S., & Krehm, M. (2010). The first steps in word learning are easier when the shoes fit: Comparing monolingual and bilingual infants. Developmental Science, 13, 229–243. doi:10.1111/j.1467‐7687.2009.00891.x Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., & Amiel‐Tison, C. (1988). A precursor of language acquisition in young infants. Cognition, 29, 143–178. Miller, D. J., Ryan, E. B., Aberger, E., Jr., McGuire, M. D., Short, E. J., & Kenny, D. A. (1979). Relationships between assessments of habituation and cognitive performance in the early years of life. International Journal of Behavioral Development, 2, 159–170. Oakes, L. M. (2010). Using habituation of looking time to assess mental processes in infancy. Journal of Cognition and Development, 11, 255–268. Habituation Techniques 17

Oakes, L. M., Sperka, D. J., & Cantrell, L. (2015). Habit 2. Unpublished software. Center for Mind and Brain, University of California, Davis. Piaget, J. (1926). The language and thought of the child. New York: Harcourt Brace & Company. Polka, L., & Werker, J. (1994). Developmental changes in perception of nonnative vowel contrasts, Journal of Experimental Psychology: Human Perception and Performance, 20, 421–435. Schmale, R., Hollich, G., & Seidl, A. (2011). Contending with foreign accent in early word learning. Journal of Child Language, 38, 1096.1108. (doi: 10.1017/S0305000910000619) Sokolov, E. N. (1963). Higher nervous functions: The orienting reflex. Annual review of physiology, 25, 545–580. Thompson, R. F. (2009). Habituation: A history. Neurobiology of Learning and Memory, 92, 127–134. Thompson, R. F. & Spencer, W. A. (1966). Habituation: A model phenomenon for the study of neuronal substrates of behavior. Psychological Review, 73, 16–43. Weikum, W. M., Vouloumanos, A., Navarra, J., Soto‐Faraco, S., Sebastián‐Gallés, N., & Werker, J. F. (2007). Visual language discrimination in infancy. Science, 316 (5828), 1159–1159. Werker, J. F., Cohen, L., Lloyd, V., Casasola, M., & Stager, C. (1998). Acquisition of word‐object associations by 14‐month‐old infants. Developmental Psychology, 34, 1289–1309. Werker, J. F., Fennell, C. T., Corcoran, K. M., & Stager, C. L. (2002). Infants’ ability to learn phonetically similar words: Effects of age and vocabulary size. Infancy, 3, 1–30. Werker, J. F., Polka, L., & Pegg, J. E. (1997). The conditioned head turn procedure as a method for testing infant speech perception. Early Development and parenting, 6, 171–178. Yoshida, K. A., Fennell, C. T., Swingley, D., & Werker, J. F. (2009). Fourteen month‐ old infants learn similar sounding words. Developmental Science, 12, 412–418.

Further Reading

Cohen, L. B. (2004). Uses and misuses of habituation and related preference paradigms. Infant and Child Development, 13, 349–352. Hunter, M. A., & Ames, E. W. (1988). A multifactor model of infant preferences for novel and familiar stimuli. In C. Rovee‐Collier, & L. P. Lipsitt (Eds.), Advances in infancy research (pp. 69–95), 5, Norwood, NJ: Ablex. Oakes, L. M. (2010). Using habituation of looking time to assess mental processes in infancy. Journal of Cognition and Development, 11, 255–268. Oakes, L. M., Sperka, D. J., & Cantrell, L. (2015). Habit 2. Unpublished software. Center for Mind and Brain, University of California, Davis. Werker, J. F., Cohen, L., Lloyd, V., Casasola, M., & Stager, C. (1998). Acquisition of word‐object associations by 14‐month‐old infants. Developmental Psychology, 34, 1289–1309. 2 Visual Preference Techniques

Roberta Michnick Golinkoff, Melanie Soderstrom, Dilara Deniz Can, and Kathy Hirsh‐Pasek

Abstract

The rationale and purpose of the Intermodal Preferential Looking Paradigm and the Head Turn Preference Procedure are described. The development, instrumentation, and utility of each method is showcased including variants of the original paradigms that have evolved over time and use. Also discussed are the different types of questions these methods address and how they have advanced the field. Advantages and disadvantages­ of the methods are also presented.

The eyes shout what the lips fear to say. William Henry (1729‐1786)

Introduction

The purpose of visual preference methods to study language acquisition is revealed by changing one word in the quotation above, The eyes shout what the lips cannot say. Children know much about language before they can produce it. Prior to the advent of visual preference methods, the field made progress through diary studies (e.g., Brown, 1973) and experiments with older children (e.g., Berko, 1958). In retrospect

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Visual Preference Techniques 19 and from the vantage point of a new millenium, two things were needed to propel the field further. The first was for a theory of language that considered more than its sur- face manifestation. Noam Chomsky’s books Syntactic Structures (1957) and Aspects of the Theory of Syntax (1965) provided that theory and gained prominence in the field of psychology with writings by George Miller (Miller, 1965). Once we knew that children had rich language structures hidden beneath their meager produc- tions, the second change was the introduction of new methods designed to unearth this knowledge. By the time a child says the two‐word utterance, “Mommy sock,” for example, an enormous amount of language acquisition has already occurred. Researchers like Martin Braine (1963), Lois Bloom (1970) and Roger Brown (1973) began to analyze children’s early productions for their grammatical properties and their putative underlying structure. To study the process by which language emerged, researchers recognized that they needed to start earlier. What did young children know about language before it emerged in speech? How could we get purchase on this question when infants could neither talk nor respond on command? Part of the new methodology appearing at that time was videotape. The ability to record dynamic events was a boon to the field in several ways. It allowed researchers to test for motion verb comprehension. As verb knowledge is a key component of grammatical knowledge, researchers could now probe how children viewed events that verbs would label (see Hirsh‐Pasek & Golinkoff, 2006). Further, with video tape, participants’ performance yields a permanent record so reliability of visual fixation could be calculated offline.

Development, Assumptions, and Rationale

Modern visual preference techniques have their roots in the work by Robert Fantz (1958, 1964). Interested in visual acuity, he showed that infants would show differential responding to stripes of different widths. Around this time, researchers speculated that visual fixation might be a window to understand language development (Colombo & Bundy, 1981; Horowitz, 1975). Horowitz (1975) used a visual preference method with infants and discovered that infants would look more to visual displays when they were accompanied by language than when they were presented in silence. Then in 1987, Golinkoff, Hirsh‐Pasek, Cauley, and Gordon, in a paper aptly titled “the eyes have it,” adapted a method to study language acquisition employed by Spelke (1979). Spelke presented 4‐month‐old infants with a dynamic, intermodal version of Fantz’s paired‐comparisons method. While Fantz’s studies were mostly on visual acuity, Spelke’s adaptation was to see if infants knew which sights went with which sounds. Babies saw two events side by side (e.g., a donkey jumping on a table and a person clapping hands) accompanied by a single auditory stimulus that matched only one of the actions (e.g., the sound of hands clapping). Infants looked more at the event that matched the auditory stimulus than to the event that did not. Golinkoff et al. (1987) realized that this method could be adapted to study lan- guage comprehension. Would children look more toward a scene that matched the language they were hearing than a scene that did not match? When they did, it sug- gested that infants naturally looked for sights that matched the sounds they heard, 20 Research Methods in Psycholinguistics and the Neurobiology of Language an attribute that would be useful for uncovering hidden aspects of language learning. This early work (and Hirsh‐Pasek & Golinkoff, 1996) indicated that infants knew more about language than their meager productions revealed. Infants by 16 months were matching words to a visual representation of their meanings (viz, a picture of a shoe to the word shoe) and latent grammar was available as well. Sixteen‐month‐old children, saying as few as two words, were already watching an event that matched a seven‐word sentence, as in “Where is Big Bird tickling Cookie Monster?” rather than seeking out and watching an event that did not match what they were hearing (Hirsh‐Pasek & Golinkoff, 1996; Golinkoff, Ma, Song, Hirsh‐Pasek, 2013). Simultaneously with the development of the Intermodal Preferential Looking Para­digm (hereafter referred to as IPLP), another procedure was also being devel- oped—the Headturn Preference Procedure (Hirsh‐Pasek et al., 1987; Kemler Nelson et al., 1995). While preferential looking tasks explicitly test infants’ pairing of words with particular visual stimuli, the Headturn Preference Procedure (hereafter HPP) is designed to probe what infants know about the phonological properties of language, without requiring infants to understand the meaning of what they are hearing. A single, visual display (usually a flashing light, sometimes a visual display on a ­television screen such as a flashing circle or checkerboard pattern) measures infants’ relative preference for different auditory stimuli. Since it is not possible to directly measure what auditory stimulus an infant is attending to through behavioral ­measures, the auditory stimulus is paired with the visual display, and the infant’s attention to the visual display is used as a proxy measure for their auditory attention. Hirsh‐Pasek envisioned this method while watching her own infant son turn his head back and forth to follow the side on which a speaker was playing. Hirsh‐Pasek et al. (1987) first used this measure to examine infants’ perception of the prosodic charac- teristics of clauses. All pauses greater than 1 second were removed from a passage said in infant‐directed speech and 1‐second artificial pauses were inserted either at clause boundaries (coincident with the edges of prosodic units) or at other places within clauses (not coincident with prosodic units but not starkly disruptive as in the middle of words). Infants (7‐10 months old) preferred the sentences in which the artificial pauses were aligned with clause boundaries, suggesting that they were attuned to the cues that coincide with major linguistic constituents. A few years later, Jusczyk and Aslin (1995) added a familiarization phase to the basic procedure, allowing the researcher not only to test the preferences that infants bring into the lab from their everyday experiences, but also to allow the researcher to introduce biases through experiences in the lab that are then tested within the same paradigm. The primary advantage that the IPLP and the HPP share is that the infant’s response is minimal but meaningful. Shifts in eye gaze or a turn of the head require very little from the infant, thereby reducing the need for the use of complex motor behaviors to carry out commands or engage in decision making. These methods rely on the idea that infant looking behavior captures low‐level affinities for finding structure in the world—characteristics such as “matching/non‐matching,” “coherent/ incoherent,” “familiar/unfamiliar,” “coincident/non‐coincident.” They also rest on an even more fundamental assumption: That infants are motivated to respond to language stimuli well before they can talk. Indeed, infants appear to be seeking out the regularities language contains. Both the IPLP and the HPP rely on the key assumption that infants’ looking behavior toward a visual stimulus can be used to make inferences about their linguistic Visual Preference Techniques 21 sensitivity. In the case of Preferential Looking methods, infants’ looking behavior while hearing an auditory stimulus is compared across two visual displays, as a means of determining whether infants consider the auditory stimulus to be a better “match” with one or the two displays. In the case of HPP methods, looking toward a visual display is used as a proxy measure for infants’ attention to or preference for one type of auditory stimulus over the other. Because there are numerous variants of each, we will start out with a description of the “basic” methodology, and then describe some of the variants that have been used.

The Intermodal Preferential Looking Paradigm (IPLP)

The IPLP (Golinkoff et al., 1987) enables the exploration of language questions that cannot be tackled otherwise in participants who cannot yet follow directions. It is used to assess infants’ emergent language knowledge in a number of ways. First, it tests for children’s receptive vocabulary knowledge and the grammatical structures they understand. Second, the IPLP explores the process by which chil- dren learn new lexical items (e.g., Ma, Golinkoff, Houston, & Hirsh‐Pasek, 2011). Third, it can probe whether infants attend to the phonological properties of word forms, as in a study by White and Morgan (2008). They presented 19 month olds with mispronunciations of familiar words (e.g., “tup” instead of “cup”). Infants showed graded amounts of looking toward the cup versus an unfamiliar object (i.e., a new object that might be called a “tup”) depending on the degree of mispronunciation, indicating fine‐grained sensitivity to the featural properties of phonemes. Fourth, it has been used to ask when children not yet producing morphological affixes (such as the plural /s/) notice these and comprehend their function (e.g., Jolly & Plunkett, 2008). Finally, it has been used to assess how infants interpret linguistic constructions, such as transitive and intransitive sentences (Hirsh‐Pasek & Golinkoff, 1996).

Overview of Method, Apparatus, and Data Analysis

Picture a child sitting on a parent’s lap and seeing the images in Figure 2.1 (a car and a dog). The voiceover asks “Where’s the car? Find the car!” The infant’s visual fixation to the two simultaneous images is videotaped by a hidden camera for off‐ line coding by a trained researcher who is blind to the location of the target visual stimulus. The hypothesis is that infants will allocate more looking time to the car if they understand the word “car” than to the dog. After all, in more natural environments children are encouraged to look at objects and actions that are being talked about. The prediction is never that children will allocate all their looking to the matching image, since the stimuli are balanced such that each has interesting properties. In fact, the stimuli are created to be of equivalent salience by controlling for a number of parameters, including the size of the images, the degree of movement (when relevant), and degree of affect if faces are seen, just to name just a few. The original IPLP used two separate but time locked television monitors with the auditory stimulus delivered through a central speaker. With the arrival of big screen 22 Research Methods in Psycholinguistics and the Neurobiology of Language

Figure 2.1 The Intermodal Preferential Looking Paradigm (see text for details). (See insert for color representation of the figure.) televisions, stimulus displays can appear on a single screen (see Figure 2.1) and the auditory stimuli can emanate from the television itself.1 This advance, in and of itself, proved critical as the timing of the video presentations could be more closely controlled. Infants are seated on a parent’s lap facing a large monitor, and visual stimuli are shown as left and right split‐screen displays at approximately infant’s eye level. To avoid “Clever Hans” effects (i.e., where the infant is being influenced by the mother’s unconscious behaviors), the parent is either asked to close their eyes, or to wear blacked‐out sunglasses, or some other eye obstruction. Another important feature of the set up is that a single central stimulus (sometimes a video of a laughing baby or a flashing light) is used during intertrial intervals to attract infants’ attention back to the middle of the screen. This is done to lure infants away from focusing only on a single side of the television; and because it invites comparison between the stimuli starting from a central fixation spot. Data analysis is usually accomplished via measurements of total visual fixation time of attention to the test trial over attention to both the matching versus the non‐ matching image. Information from a coder judging the location of the infant’s eyegaze, is fed into a computer program that cumulates the amount of time an infant looks to the matching display, to the non‐matching display, and away from the screen. Reliability between coders is usually very high and can be easily measured by having two researchers code the same video separately. Given that chance predicts looking time of 50% to each display, differences from chance are also calculated. A range of dependent variables, in addition to raw visual fixation time, are also used: 1) a comparison of the single longest look to each trial (Ma et al., 2011; Schafer & Plunkett, 1998); 2) the proportion of looking toward the target versus the distractor across the

1 Caution is suggested here if the acoustic stimuli have been artificially modified such as by low‐pass filtering—preprocessing in television speakers can sometimes alter the intended output, a problem one of the authors encountered. Visual Preference Techniques 23 trial (Tincoff & Jusczyk, 1999); and 3) using only the first two seconds of a trial when infants’ gaze is more likely to reflect what they consider new (Roseberry, Hirsh‐Pasek, Parish‐Morris, & Golinkoff, 2009). Statistical tests are typically analyses of variance that compare mean looking times across trial types. Multiple trials are offered and an IPLP test in total can be as long as 3‐4 minutes and still maintain many infants’ attention. Researchers often also include a count of how many of the participants show an overall preference for the target across trials, as a check to ensure that the effect is not a result of a small subset of children within the sample. Sometimes non‐ parametric tests are used for this reason. Once children are old enough to reliably follow directions (generally around 24 months of age), children’s pointing to one of the displays can be used as the dependent variable (Maguire, Hirsh‐Pasek, Golinkoff, & Brandone, 2008).

An Exemplary IPLP Study

Ma et al. (2011) utilized the IPLP method to train children to learn two new words offered in either infant‐directed versus adult‐directed sentences. Although prior studies had documented that infants preferred hearing infant‐directed speech, none had shown that its use actually advanced word learning (Golinkoff, Deniz Can, Soderstrom, & Hirsh‐Pasek, 2015). In a first experiment, monolingual, English‐learning 21 month olds were randomly assigned to either an infant‐directed or an adult‐directed speech condition (IDS versus ADS) describing two novel objects. As Table 2.1 shows, the study began with a task familiarization phase using familiar objects. The child was asked to look at each of the known objects (book or ball) on one trial each to get them used to looking where asked. A salience trial followed of the two objects to be seen at test. The purpose of this trial is to test whether infants have an a priori preference for either of the test objects prior to training and test. Four training trials followed during which toddlers were shown a single novel object and told its name on sequential trials. For example, when children were shown one novel object they were told, “Blick! Where’s the blick? Look at the blick! There’s the blick!” At the same time, the novel object was programmed to drop down to the bottom of the screen, to bounce, and to engage in other movements designed to motivate children to continue to watch. During the test phase, infants saw a static version of the two novel objects side‐ by‐side while being directed to look at one of the objects on half the trials and the other object on the other half (e.g., “Blick! Where’s the blick?). There were two blocks of testing (four trials in each block) with a reminder trial in between. During the last second of each test trial, the “target” (i.e., the named object) bounced to rein- force or encourage looking to it. The two reminder trials offered children another opportunity to learn the novel names. The reminder trials were followed by the sec- ond block of four test trials. Visual fixation was coded frame‐by‐frame with the dependent variable being the single longest look at the target and non‐target in each test trial. Using total raw visual fixation time to test trials yielded the same outcome. In addition, the caregivers completed the Short Form of the MCDI (MacArthur Communicative Developmental Inventory)—words and sentences (Fenson et al., 2000; see Chapter 3)—to examine links between children’s lexical knowledge and their performance on IPLP tasks. Table 2.1 Visual and linguistic stimuli used to teach two novel words in either infant‐directed or adult‐directed speech from Ma et al. (2011). Left side Right side Audio Task familiarization Book! Look for the book! Can you phase find the book? That’s the book.

Ball! Look for the ball! Can you find the ball? That’s the ball.

Salience No audio

Training Look here! It’s a modi! See the Animations of objects modi. That’s the modi. Look (four trials: 24 what the modi is doing? Now the seconds each) modi is going over here. Where’s The two trials (modi the modi going? Where’s the & blick) repeat. modi? Modi! There’s the modi! Look here! It’s a blick! See the blick. That’s the blick. Look what the blick is doing? Now the blick is going over here. Where’s the blick going? Where’s the blick? Blick! There’s the blick! Test block 1 Modi! Where’s the modi? Look at (four trials: two for the modi! There’s the modi. each word; seven seconds each test) Blick! Where’s the blick? Look at the blick! There’s the blick.

Reminder 1 Modi! That’s the modi. See the (two trials: seven modi. It’s a modi! seconds each

Blick! That’s the blick. See the blick. It’s a blick!

Test block 2 Modi! Where’s the modi? Look at (four trials: two for the modi! There’s the modi. each word; seven seconds each test) Blick! Where’s the blick? Look at the blick! There’s the blick.

Note: Empty cell: one side of the monitor is blank. Name assignment (modi and blick) and side of presentation of the two novel objects were counterbalanced in four conditions in infant‐directed and adult‐directed speech, respectively. Source: Ma et al. (2011). Reproduced with permission of Taylor & Francis. Visual Preference Techniques 25

2.5 Target 2 Non-target

1. 5

1 look (in seconds) 0.5 Means of single longest

0 test test test test test test block 1 block 2 block 1 block 2 block 1 block 2 IDS ADS ADS 21-month-olds 27-month-olds

Figure 2.2 Means of single longest look in seconds to infant‐directed (IDS) and adult‐ directed (ADS) speech stimuli. From Ma et al. (2011). Source: Reproduced with permission of Taylor & Francis.

Infants’ mean longest looks to the target and the nontarget across blocks for each condition are displayed in Figure 2.2 (Ma et al., 2011). A 2 (condition: IDS versus ADS) × 2 (stimulus type: target versus nontarget) × 2 (test block: 1 versus 2) repeated‐measures ANOVA was conducted. A significant interaction effect was found between the condition (IDS, ADS) and stimulus type (target, nontarget) sug- gesting that children performed differently with IDS and ADS. Planned t‐tests revealed that infants in the IDS condition looked significantly longer to the target than to the nontarget in Block 1 and in Block 2. Infants in the ADS condition, however, did not look significantly longer to the target than to the nontarget in either Block 1 or Block 2. These results suggested that the children had learned the words in IDS, but not in ADS. A second experiment tested children who were 27 months of age in the ADS condition only. As Figure 2.2 shows, children that age were able to learn from ADS.

Variants of the Intermodal Preferential Looking Paradigm

Interactive Intermodal Preferential Looking Paradigm (IIPLP)

One limitation of the original IPLP is that the presentation of stimuli is all screen‐based. Therefore, the ability to examine the influence of social cues on language learning is limited. Hollich, Hirsh‐Pasek, and Golinkoff (2000) introduced a three‐dimensional version of the IPLP called the “Interactive Intermodal Preferential Looking Paradigm” (IIPLP) to address this problem.

Overview of Method and Data Analysis

During IIPLP (see Figure 2.3), in contrast to IPLP, a human experimenter delivers the stimuli, allowing researchers to examine the role of social cues in language learning (Golinkoff et al., 2013). The design of the study can be very similar to that used in 26 Research Methods in Psycholinguistics and the Neurobiology of Language

Mirror

Video

Fagan board

Experimental Parent & child stands sit here here

Figure 2.3 The Interactive Intermodal Preferential Looking Paradigm (see the text for details) Hollich et al. (2000). Source: Reproduced with permission of John Wiley & Sons. the IPLP as shown in Table 2.1. Stimuli are real objects affixed by Velcro at 20 cm from the top and 12.5 cm from either side to a wooden, black “flipboard” that can rotate (Fagan, Holland, & Wheeler, 2007). On one side of the table, an infant sits on the parent’s lap and the parent closes his or her eyes. On the opposite side of the table, the experimenter stands behind the board. Because the board rotates, the experimenter can face the child and attach or remove objects on her side of the board. She can rotate the board to reveal the objects to the infant, tightly controlling the time of exposure. Because these babies can be as young as 10 months of age, the experimenter needs to prompt them to look at the objects before starting to speak, using finger snapping, tapping on the flipboard, or calling the infant by name. When labeling an object, the experimenter can provide social cues such as enthu- siastically looking back and forth between the object and the child’s eyes. For later coding and reliability testing, a camera films a mirror on the wall behind the parent and child to capture the child’s visual fixation of the objects and what she is seeing on the flipboard. During test trials, the experimenter ducks down behind the flipboard so as not to influence where children choose to look. Coding is done off‐line from the videotape with the sound muted and high reliability is achieved. The time children looked at the target versus the non‐target object is coded for each test trial and analyzed via analysis of variance, as in the IPLP, since there are usually the same types of trials, viz., salience, training, and test.

An Exemplary IIPLP Study

Pruden, Hirsh‐Pasek, Golinkoff, and Hennon (2006) tested the Emergentist Coalition Model (ECM) of word learning (Hollich et al., 2000) by asking whether 10‐month‐old infants might first use perceptual cues—in preference to social cues like eye gaze—to identify which of two objects a speaker is naming. If babies are more likely to attach a word to the perceptually salient object in the environment, they might systematically mismap an offered name, taking it for the name of the Visual Preference Techniques 27 most interesting object in the room. Each infant sat on a blindfolded parent’s lap 75 cm back from the center of the flipboard resting on a table. The stimuli consisted of four novel, unfamiliar objects that varied in perceptual salience. Two of these objects were “interesting” (brightly colored and with moving parts) and the other two objects, dull in both color and appearance, were designated as “boring.” Four perceptually distinct labels were chosen (“modi,” “glorp,” “dawnoo,” and “blicket”) to pair randomly with the objects. Participants completed four phases in which visual fixation time was the dependent variable. In the exploration phase, the infants played sequentially with one inter- esting and one boring novel object. In the salience phase, infants saw these same two objects side by side on the flipboard for 6 seconds with the prediction that the interesting objects would elicit longer looking than the boring objects. The independent variables were whether infants were in the coincidental condition or the conflict condition and which of the objects they looked at during test trials. Infants in the coincidental condition saw the experimenter look at and label the interesting object. In the conflict condition, the experimenter looked at and labeled the boring object. For the training phase, either the interesting or the boring object was labeled, depending on the child’s condition, as in “Jordan, Look! A modi!” The experimenter only spoke after obtaining the infant’s attention and establishing eye contact. Finally, the testing phase had four trials, each 6 seconds in duration. During the first two test trials (original‐label test trials), the experimenter hid behind the center of the testing board, and the infants saw the two objects side by side on the board. The experimenter asked the child: “Jordan, where is the modi? Can you find the modi?” In addition to the original‐label test trials, a third and fourth test trial were included to assess whether infants were truly pairing a label with an object rather than simply attending to this object because it was the more interesting one. In the third trial, the new‐label test trial, infants were asked to look at the “glorp” rather than the “modi.” If they had attached a label to the interesting object during the training, then they should look away from the interesting object upon hearing a new label. In the fourth test trial, named the recovery trial, infants were again asked to look at the “modi.” If they had learned the name of the modi object, they should look at it again when offered the original label. Recall that infants went through the procedure twice with different objects and labels. The results are shown in Figure 2.4. An independent samples t‐test showed that 10 month olds performed the same way regardless of condition in the salience trial, making it possible to pool the data. Data for the test trials were analyzed in a repeated measures ANOVA. Data from the two trials that comprised the original‐label test trials were averaged. Neither a main effect of condition (coincidental versus conflict) nor an interaction between condition and test trial was found. However, a main effect of test trial did emerge. One‐sample t‐tests revealed that infants paid significantly more attention to the interesting object during the original‐label trials, looked less at the interesting object during the new‐label trial, and then renewed their looking time to the interesting object during the recovery test trial. In other words, infants attached the new label to the interesting object, regardless of which object the speaker labeled. That is, at 10 months of age, infants always linked the word to the interesting object. Perceptual preference determined how infants connected word to object, and they made systematic mismappings. The finding that 10 month olds at the cusp of word learning looked away from the interesting object in the new‐label trials and renewed their looking to this object on 28 Research Methods in Psycholinguistics and the Neurobiology of Language

0.7 Original label trials New label trial Recovery trial esting) 0.65 er + **

0.6 tion of looking time 0.55 opor ing) ------> (Int Pr

(Bor 0.5 Test trial type

Original label trials 0.75 New label trial 0.7 Recovery trial *** esting) er 0.65 + 0.6

tion of looking time 0.55 opor ing) ------> (Int 0.5 Pr

(Bor Test trial type 0.45

Figure 2.4 Visual fixation to original label, new label, and recovery trials by condition. From Pruden et al. (2006). Below 50%: looking to the boring object; above 50%, looking to the interesting object. Top panel: looking times during test trials in the Coincidental condition (interesting object named). “V” pattern indicates learning of novel object’s name (see text for explanation). Bottom panel: Conflict condition (boring object named). Infants attached novel name to interesting object in both conditions. Source: Pruden et al. (2006). Reproduced with permission of John Wiley & Sons. the recovery trials provided compelling evidence that they had attached a label to the interesting object. However, infants only learned the name of the interesting and did not learn the boring object’s label (see Figure 2.4). When the experimenter looked at and named the boring object, 10 month olds systematically mismapped that word to the interesting object, apparently ignoring the speaker’s social cues (Pruden et al., 2006). This study, combined with results of Hollich et al. (2000), illustrates how the cues infants use to map words to referents change over the first two years of life, moving from a reliance on perceptual salience to the use of social and linguistic cues.

The Looking‐While‐Listening Paradigm (LWL)

One significant expansion of the IPLP retains much the same basic set up, but intro- duces a different form of analysis that allows for detailed timecourse explorations of infants’ eye gaze. The unit of analysis is the time it takes for a child to land on the Visual Preference Techniques 29 match and remain on the match during a trial, rather than collecting cumulated looking times across trials. This variant, referred to as the “Looking‐while‐listening paradigm” has been instrumental in showing relationships between toddler’s speed in finding named targets and a host of other variables such as parental input and vocabulary acquisition (e.g., Fernald, Perfors, & Marchman, 2006).

Overview of Method and Data Analysis

During the LWL, as with IPLP, two pictures of objects (matched for attractiveness) are typically presented next to each other accompanied by a sentence that matches only one of the objects (e.g., “Where is the doggie?”). The timecourse analysis (­typically at a 33 ms resolution due to video frame rates) is based on the onsets of particular tar- gets. For example, the coding starts slightly before the onset of the first phoneme (d‐in doggie in “Where is the doggie?”). If infants comprehend the word “doggie,” and they look at the picture of the dog, they should stay there. But if they are looking at the distractor (say, a cookie), their gaze should shift to the dog upon hearing the word (Fernald et al., 2006). Timecourse analysis allows for detailed comparisons that take into account both the proportion of trials in which infants are looking toward a given visual display at each point in the timecourse, and the speed of shifting.

An Exemplary LWL Study

Fernald and colleagues (2006) used the LWL method to test speech processing efficiency and vocabulary growth across the second year of life. Speed of processing was operationalized in terms of the latency with which infants shifted their gaze to the named target picture. As in the original IPLP, two computer monitors, separated horizontally, each contained a target picture. Three seconds of silence was followed by a speech stimulus. An entire test session lasted about 4 minutes. Coders analyzed the infant’s gaze ­patterns frame by frame: Were infant’s eyes oriented to the left or right picture, between the pictures, or away from both pictures? The correct response differed depending on the nature of the trial. For distracter‐initial trials (i.e., when the child was looking at the cookie but the dog was requested), the child was expected to shift to the target ­picture. But if children were already looking at the target (the dog) and it was requested, the child should remain on the target picture and not shift away. The same infants’ speed and accuracy were assessed at 15, 18, 21, and 25 months using repeated mea- sures of ANOVA. When a correct shift occurred on distractor trials (within 300‐1800 ms interval following word onset), mean reaction time was calculated. The mean proportion of correct shifts from the distracter to the target picture and the mean proportion of incorrect shifts away from the target to the distracter were computed. Correct and incorrect shifts were then compared in a 4 (age) × 2 (trial type: target‐initial versus distracter‐initial) repeated measures ANOVA. Figure 2.5 illustrates that there were significant main effects of age, and trial type, as well as an Age × Trial Type interaction. Correct shifts to the target picture on distracter‐initial trials went up with age. Importantly, speed and accuracy at 25 months were related to lexical and grammatical development across a range of measures (e.g., the number 30 Research Methods in Psycholinguistics and the Neurobiology of Language

1 0.9 0.8 Distracter-initial trials 0.7 Target-initial trials 0.6 0.5 0.4

Proportion shifts 0.3 0.2 0.1 0 15 18 21 25 Age (months)

Figure 2.5 Eye gaze shifts toward and away from target in looking‐while‐listening task by age. From Fernald, Perfors, and Marchman (2006). Gray bars represent correct shifts (the measure of accuracy); white bars represent incorrect shifts. These data suggest that while the rate of error remains roughly constant, the proportion of children who shift correctly increases with age. Source: Fernald et al. (2006). Reproduced with permission of American Psychological Association. of produced words, grammatical complexity) from 12 to 25 months (Fernald et al., 2006), indicating that the LWL procedure can detect individual differences in infants’ language capabilities.

Preferential Looking Paradigm Without Language (PLP)

Another variant of the IPLP shows infants videos in silence to probe how they segment and analyze the nonlinguistic motion events that will ultimately be encoded by verbs and prepositions. This work brings together theorizing in linguistics and the burgeoning field of event perception in psychology. One question addressed is when infants can discriminate between actions like running, walking, and jumping. If children are to learn different names for these actions, they must both discriminate between them and form categories of them, regardless of the agent performing the action, the location, or the duration of the action.

Overview of Method and Data Analysis

PLP studies are typically identical to IPLP studies in design except for the absence of lan- guage. Studies may start out with a salience trial showing infants what they will see at test to establish that there is no a priori preference for the event that will be “new” at test. A familiarization phase often follows to show infants either a repeating identical scene or different exemplars that belong to the same action or event category (say, multiple actors jumping, as in Song, Pruden, Golinkoff, & Hirsh‐Pasek, 2016). During test trials, infants are shown the same two dynamic visual stimuli that they saw during the salience trial. Visual Preference Techniques 31

During PLP, children are expected to show discrimination or categorization by watching the novel event. Because the PLP allows children to compare two simulta- neously presented events at test, thus minimizing memory demands, it may heighten their attention to the differences between the test events. Simultaneous presentation of test events thus affords children the opportunity to detect differences that they might not detect with sequential presentation (Pruden, Shallcross, Hirsh‐Pasek, & Golinkoff, 2008).

An exemplary PLP Study

One study tested both discrimination and categorization of the action of marching. Song et al. (2016) asked if 10‐ to 12‐month‐old infants could form a category of marching when performed by different actors and across different paths, for example, across or in a circle. To evidence categorization, infants must first show that they can discriminate between the different instances that make up the category. Experiment 1 asked about discrimination between different exemplars; Experiment 2 tested for categorization. The dependent variable was the same in both studies: The proportion of time infants looked at the novel event divided by the time they looked at the novel and the old event. In Experiment 1, infants were first shown, for example, a single, 6-second event of the same actor performing a marching action across the screen 10 times. Attention stayed high during familiarization, declining only to a mean of 88% visual fixation by trial 10. Two different pairs of test trials followed in counterbalanced order. As Table 2.2 shows, in one pair, children saw the same actor marching along the same path versus the same actor marching on a different path. In the other, they saw the same actor marching along the same path versus a new actor marching along the same path. Infants watched the change of path more than the old path. They also noticed the change of actor in the first half of the other trial. In Experiment 2, categorization was tested by showing four different actors each marching along a different path during familiarization. At test, an in‐category action was shown of a new actor marching along a new path versus an out‐of‐category action of that same new actor hopping along the same new path. Infants watched the novel action (hopping) significantly more at test, after seeing marching repeatedly during

Table 2.2 Ten‐ to 12‐month‐old infants saw two types of discrimination trials, one to test for path discrimination and one for actor discrimination. Study 1: Design of discrimination movies. Visual stimuli Duration Familiarization A marches across (full screen) 6 s per trial Trials to 10 Test 1 (path) A marches across (left) A marches in a fixed position (right) 12 s Test 2 (actor) B marches across (left) A marches across (right) 12 s

Note: A and B refer to the two actors. Test order and target side were counterbalanced across infants. Source: Song et al. (2016). Reproduced with permission of Elsevier. 32 Research Methods in Psycholinguistics and the Neurobiology of Language familiarization. This study demonstrates how the PLP can inform us about when and how children can form the categories of actions that verbs will name.

Headturn Preference Procedure (HPP)

Not all questions in language acquisition are about the mappings between sights and sounds. Researchers who wished to uncover what infants know about the structure of the language (independent of meaning) created the HPP so that auditory stimuli could be presented without meaningful visual displays. Infants’ sensitivity to aspects of language structure is, for example, measured by asking whether they prefer one kind of auditory stimulus over another as an index of discriminative skills, or to see if they prefer hearing their own name over other names.

Overview of Method and Data Analysis

In the HPP, infants (typically between the ages of 4 and 18 months) are seated in a three‐sided booth, with a flashing light or other visual display to their front and on both sides (Figure 2.6). The auditory stimulus emanates from each side of the booth, playing one at a time from a speaker adjacent to the flashing light. The infant is usually seated on a caregiver’s lap, although sometimes an infant seat is used. If the caregiver is present, they typically wear headphones that play music and/or speech sounds with similar characteristics to the test stimuli being heard by the infant, to mask the sounds the infant is hearing. An experimenter, usually located in an adja- cent room and therefore blind to the exact condition of a given trial, watches via a closed‐circuit camera. The camera is located in the center panel, below the flashing light. By pressing a button, the experimenter records whether the infant is looking toward or away from one of the two speakers. Children’s responses are taken as an indirect indicator of their preference for the sound originating from a speaker. Each test trial begins with a light flashing at the front to orient the infant forward. Once the infant looks forward, the front light is extinguished, and one of the side lights appears. Randomization of side of presentation across trials and stimulus types is done to avoid side biases. When the infant orients to a particular side, the auditory stimulus continues to play until the infant looks away for a criterion time (usually 2 s) or the maximum trial length is reached (usually 20‐30 s). Infants’ looking time toward a side light (excluding any short looks away that are less than the criterion time) is used as the dependent measure and is assumed to be a measure of infants’ interest in the auditory stimulus. Usually there are 2‐4 warm‐up trials prior to the presentation of 8‐16 test trials. Warm‐up trials are typically either additional trials similar to the test trials that are excluded from analysis, or consist of music. The total number of trials is kept short, as infant boredom becomes a significant factor after a relatively small number of trials. Test trials are divided into two to four categories (e.g., ungrammatical versus grammatical, familiar versus unfamiliar) and a mean looking time is calculated across all the test trials of each category. Visual Preference Techniques 33

Figure 2.6 The Headturn Preference Procedure (see the text for details). Source: Courtesy of Melanie Soderstrom. (See insert for color representation of the figure.)

An Exemplary HPP Study

One of the classic HPP studies is that of Mandel, Jusczyk, and Pisoni (1995), who were interested in examining when infants recognized their own names. They tested 24 4.5 month olds, presenting them with test trials of four types: repetitions of the child’s own name (e.g., Harry), a stress‐matched foil (e.g., Peter), and two different‐ stress foils (e.g., Gerard, Emil). Each trial consisted of 15 repetitions of the name, produced in lively infant‐directed speech by a female speaker. Infants were tested seated on their caregiver’s lap in a 4 × 6 foot three‐sided enclosure. Mandel et al. used music during the warm‐up phase to acquaint the infant with the contingency between the flashing lights and the sounds. They used a criterion of listening at least 40 s to the music. (Nowadays it is more common to have a fixed short number of warm‐up trials unless the “warm‐up” provides critical stimuli for the test phase; see the “modified” version below.) After the warm‐up phase, researchers presented three blocks of 4 test trials, for a total of 12 test trials. To analyze their findings, Mandel et al. took the average listening time across the three repetitions of each test trial type. They used a repeated measures ANOVA across the four trial types and found a significant effect of trial type. Planned contrasts then revealed longer average looking 34 Research Methods in Psycholinguistics and the Neurobiology of Language times for their own name (e.g., Harry, mean = 16.4 s) compared with the stress‐ matched name (e.g., Peter, mean = 13.0 s) or each of the opposite‐stress names (e.g., Gerard, Emil, mean = 12.3 s). Using these relatively simple looking time measures, Mandel et al. thus demonstrated that infants are familiar with their own names quite early.

Variants of the HPP

Two significant changes are sometimes implemented with the HPP. The first is that some recent versions use a television screen as the visual display on each of the three sides (see Figure 2.6) rather than a flashing light. The screen could display a flashing circle or a checkerboard pattern. Second, just as the IPLP can be used to teach new words or grammatical structures, the “modified” HPP familiarizes infants to a particular target stimulus which is then presented in some of the test trials. Typically, infants accumulate a certain amount of listening time (~30 s) to each target stimulus during the familiarization phase. For example, in one study (Bortfeld, Morgan, Golinkoff, & Rathbun, 2005) 6‐month‐old infants heard two target words (e.g., “bike” and “cup”) embedded in a six‐sentence passage during the familiarization phase. In one passage, the infant’s own name was followed by one of these novel words (e.g., “Harry’s bike had big black wheels”). In the second passage, all of the sentences contained another name, balanced for number of syllables and stress pattern (e.g., “Peter’s cup was bright and shiny”). At test, children heard “bike,” “cup,” and two other words they had not heard in the passages. Would infants prefer to hear words that came after their own name com- pared to words that received an equal amount of exposure? Even at 6 months, babies indicated recognition of the word that followed their own names compared to all types of foils. These findings suggest that well before infants can speak, they are storing information about the acoustic properties of the language stream.

Advantages and Disadvantages of the IPLP and HPP

Advantages

Because these methods do not require infants to respond to commands or perform any overt action, they have made it possible to examine questions about infants’ linguistic knowledge and perceptual capabilities well before they produce words and sentence structures. They have therefore significantly advanced our knowledge of some of the earliest stages of language development and have caused a proliferation of research on infant speech perception. The popularity of these methods in particular stems from their relative simplicity (in methodology and equipment) compared with methods such as habituation (see Chapter 1) or conditioned head turn. The assumptions underlying the behavioral measures are straightforward and justified both theoretically and in practice: Infants will continue to look longer at Visual Preference Techniques 35 stimuli that are of interest to them. Hardware consists of basic audiovisual and computer equipment that can be purchased off the shelf of any local electronics store. One difficulty posed for the resource poor researcher, however, is that there has been no off‐the‐shelf software available to run the basic methods. Individual labs who are often happy to share have developed in‐house software to run the procedure. Another innovation impacting this methodology is the increasing affordability of eye‐tracking equipment, which allows for the automation of the coding of infants’ looking behavior. As these automated methods are becoming more reliable, portable and affordable, they are increasingly becoming a high‐tech option for implementing what has traditionally been a low‐tech procedure. HPP and IPLP are attractive also because the statistical analyses needed to interpret the findings are direct and accessible. Although there is a trend away from p‐values and hypothesis testing toward effect sizes, t‐tests and analyses of variance are still the most common means of evaluating statistical findings in preference studies. In large part, this is because what is typically important is simply a “yes or no” answer to a question like, “Do infants of a particular age prefer stimulus X over stimulus Y?”, rather than the size of the difference between groups. One exception to this is the timecourse analysis of LWL studies that address the relative time it takes infants to look at a particular visual stimulus accompanied by language. The IPLP and the HPP enable the study of underlying mechanisms associated with language learning. Both methods enable researchers to probe how infants analyze the language they are hearing prior to producing speech. This has had a profound impact on the field. The discovery that language development is occurring at a prodigious rate prior to the production of the first word has changed the field’s view of the “prelinguistic” child. In addition, the discovery of the infant’s burgeoning language skill underscores the importance of early experience for language development. Practices such as talking with children and reading to them are seen as mattering earlier for children’s future success than before these findings emerged (Hoff, 2013; Hirsh‐Pasek et al., 2015). Another benefit of these methods is their use for assess- ments. Because these methods lend themselves to probing children’s early language competencies, the IPLP was adapted to test vocabulary knowledge (Friend & Keplinger, 2008) and Pace et al. (in preparation) created the Quick Interactive Language Screener (QUILS) for 3‐ to 5‐year‐olds to test vocabulary, grammar, as well as processes of language learning.

Disadvantages

One issue to consider is that infants’ looking behaviors are driven by a host of uncontrolled factors in addition to the preference being examined within a study, contributing to the variance. Small differences in equipment set up such as light levels, sound levels, or the structure of the test trials, can have unintended effects on infant behavior and drive differences between studies in ways that we do not yet understand well. The number of familiarization trials, for example, can apparently cause children to exhibit a familiarity preference or a novelty preference at test (see Chapter 1 for details), as Thiessen, Hill, and Saffran (2005) showed in a study on the role infant‐directed speech plays in word segmentation. There is at present no way to predict whether either type of preference will occur. While it is important to 36 Research Methods in Psycholinguistics and the Neurobiology of Language deal head‐on with these issues, the insights generated by these conceptually elegant methodologies have radically altered our understanding of early language development, and continue to drive a broad spectrum of research programs. Furthermore, by now there have been a large number of replications and extensions of research findings using these methods (e.g., Golinkoff et al., 2013). Although these methods are powerful laboratory tools, they paradoxically may overestimate children’s knowledge (Golinkoff et al., 2013). When presented with two alternatives, children may solve the task through the process of elimination or mutual exclusivity (e.g., “I know this one, so, it must be the other one”) (Halberda, 2006; Markman & Wachtel, 1988). For HPP, demonstrations that infants prefer one stimulus over another do not tell us why they have this preference, and these preferences may be quite superficial. It is therefore important not to overinterpret HPP findings but to follow up with additional research to probe the source of effects. Another way to say this is that we do not really understand the mechanisms under- lying infants’ responses. Another potential limitation of using both methods is that they allow for only a limited number of items, given infants’ short attention span. Finally, the fact that both methods indicate that language analysis and comprehension precedes language production may not be true to the same degree for some non‐Western societies (Bornstein & Hendricks, 2012).

Conclusion

We have described the goals, methodology, analyses, and questions addressed by two popular visual preference methods used with infants and toddlers to study language acquisition. Despite the advent of neurological measures, we hypothesize that these methods, relatively inexpensive and easy to implement, will continue to provide us with significant new insights into the process of language acquisition.

Acknowledgments

R. M. Golinkoff and K. Hirsh‐Pasek’s participation in this project was supported by Institute of Sciences Grants (R305A090525; R305A150435; R305A100215).

Key Terms

Headturn Preference Procedure (HPP) A method of examining infants’ relative preference for two or more auditory (usually speech) stimuli. Interactive Intermodal Preferential Looking Procedure (IIPLP) A live‐action, three‐ dimensional version of the IPLP used for testing the influence of social cues (such as eye gaze and object handling) on infant word‐learning. Visual Preference Techniques 37

Intermodal Preferential Looking Procedure (IPLP) A method that presents infants with two visual stimuli and an auditory stimulus that matches only one of the visual displays. Its purpose is to use language comprehension as a way to under- stand early language development. Looking‐While‐Listening (LWL) A version of the IPLP in which detailed timecourse analysis is introduced. Preferential Looking Procedure (PLP) A visual‐only variant of the IPLP used to test conceptual distinctions that underly language understanding. Visual Preference Relative infant interest in one of two visual displays, used as a measure of interest in the display itself (PLP), or infants’ ability to pair a visual display with an auditory stimulus (IPLP, IIPLP, or LWL), or as a proxy measure for interest in a set of auditory stimuli (HPP).

References

Berko, J. (1958). The child’s learning of English morphology. Word, 14, 150–177. Bloom, L. (1970). Language development: Form and function in emerging grammars. Cambridge, MA: MIT Press. Bornstein, M., & Hendricks, C. (2012). Basic language comprehension and production in >100,000 children from sixteen developing nations. Journal of Child Language, 39, 899–918. Bortfeld, H., Morgan, J. L., Golinkoff, R. M., & Rathbun, K. (2005). Mommy and me: Familiar names help launch babies into speech steam segmentation. Psychological Science, 4, 298–304. Braine, M. (1963). The ontogeny of English phrase structure: The first phase. Language, 39, 1–13. Brown, R. (1973). A first language. Cambridge, MA: Harvard University Press. Chomsky, N. (1957). Syntactic structures. Cambridge: The MIT Press. Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge: The MIT Press. Colombo, J., & Bundy, R. S. (1981). A method for the measurement of infant auditory selectivity. Infant Behavior & Development, 4, 229–231. Fagan, J. F., Holland, C. R., & Wheeler, K. (2007). The prediction, from infancy, of adult IQ and achievement. Intelligence, 35, 225–231. Fantz, R. (1958). Pattern vision in young infants. The Psychological Record, 8, 43–47. Fantz, R. (1964). Visual experience in infants: Decreased attention familiar patterns relative to novel ones. Science, 146, 668–670. Fenson, L., Pethick, S., Renda, C., Cox, J., Dale, P. S., & Reznick, J. S. (2000). Short‐form versions of the MacArthur Communicative Developmental Inventories. Applied Psycholinguistics, 21, 95–116. Fernald, A., Perfors, A., & Marchman, V. A. (2006). Picking up speed in understanding: Speech processing efficiency and vocabulary growth across the second year. Developmental Psychology, 42, 98–116. Friend, M., & Keplinger, M. (2008). Reliability and validity of the Computerized Comprehension Test (CTT): Data from English and Mexican Spanish infants. Journal of Child Language, 35, 77–98. Golinkoff, R. M., Hirsh‐Pasek, K., Cauley, K. M., & Gordon, L. (1987). The eyes have it: Lexical and syntactic comprehension in a new paradigm. Journal of Child Language, 14, 23–45. Golinkoff, R. M., Ma, W., Song, L., & Hirsh‐Pasek, K. (2013). Twenty‐five years using the intermodal preferential looking paradigm to study language acquisition: What have we learned? Perspectives on Psychological Science, 8, 316–339. 38 Research Methods in Psycholinguistics and the Neurobiology of Language

Golinkoff, R. M., Deniz Can, D., Soderstrom, M., & Hirsh‐Pasek, K. (2015). (Baby)talk to me: The social context of infant‐directed speech and its effects on early language acquisition. Current Directions in Psychological Science, 24, 349–344. Halberda, J. (2006). Is this a dax which I see before me? Use of the logical argument disjunc- tive syllogism supports word‐learning in children and adults. Cognitive Psychology, 53, 310–344. Hirsh‐Pasek, K., Kemler Nelson, D. G., Jusczyk, P. W., Wright Cassidy, K., Druss, B., & Kennedy, L. (1987). Clauses are perceptual units for young infants. Cognition, 26, 269–286. Hirsh‐Pasek, K., & Golinkoff, R. M. (1996). The origins of grammar. Cambridge, MA: MIT Press. Hirsh‐Pasek, K., & Golinkoff, R. M. (Eds.). (2006). Action meets word: How children learn verbs. New York, NY: Oxford University Press. Hirsh‐Pasek, K., Adamson, L. B., Bakeman, R., Owen, M. T., Golinkoff, R. M., Pace, A., Yust, P. K. S., & Suma, K. (2015). Quality of early communication matters more than quantity of word input for low‐income children’s language success. Psychological Science, 26, 1071–1083. Hoff, E. (2013). Interpreting the early language trajectories of children from language minority homes: Implications for closing achievement gaps. Developmental Psychology, 49, 4–14. Hollich, G. J., Hirsh‐Pasek, K., Golinkoff, R. M. (With Hennon, E., Chung, H. L., Rocroi, C., Brand, R. J., & Brown, E.) (2000). Breaking the language barrier: An emergentist coali- tion model for the origins of word learning. Monographs of the Society for Research in Child Development, 65 (3, Serial No. 262). Horowitz, F. D. (1975). Visual attention, auditory stimulation, and language discrimination in young infants. Monographs of the Society for Research in Child Development, 39, 1–140. Jolly, H., & Plunkett, K. (2008). Inflectional bootstrapping in 2‐year‐olds. Language and Speech, 51, 45–59. Jusczyk, P. W., & Aslin, R. N. (1995). Infants′ detection of the sound patterns of words in fluent speech. Cognitive Psychology, 29, 1–23. Kemler Nelson, D., Jusczyk, P. W., Mandel, D. R., Myers, J., Turk, A. E., & Gerken, L. (1995). The headturn preference procedure for testing auditory perception. Infant Behavior & Development, 18, 111–116. Ma, W., Golinkoff, R. M., Houston, D., & Hirsh‐Pasek, K. (2011). Word learning in infant‐ and adult‐directed speech. Language Learning and Development, 7, 209–225. Maguire, M., Hirsh‐Pasek, K., Golinkoff, R. M., & Brandone, A. (2008). Focusing on the relation: Fewer exemplars facilitate children’s initial verb learning and extension. Developmental Science, 11, 628–634. Mandel, D. R., Jusczyk, P. W., & Pisoni, D. B. (1995). Infants’ recognition of the sound pat- terns of their own names. Psychological Science, 6, 314. Markman, E. M., & Wachtel, G. F. (1988). Children’s use of mutual exclusivity to constrain the meaning of words. Cognitive Psychology, 20, 121–157. Miller, G. A. (1965). Some preliminaries to psycholinguistics. American Psychologist, 20, 15–20. http://dx.doi.org/10.1037 Pace, A., Morini, G., Luo, Golinkoff, R. M., de Villiers, J., Hirsh‐Pasek, K., Iglesias, A., & Wilson. M. (in preparation). The QUILS: An interactive language screener for children 3 through 5 bears on fundamental questions in language development. Pruden, S. M., Hirsh‐Pasek, K., Golinkoff, R. M., & Hennon, E A. (2006). The birth of words: Ten‐month‐olds learn words through perceptual salience. Child Development, 77, 266–280. Pruden, S. M., Shallcross, W. L., Hirsh‐Pasek, K., & Golinkoff, R. M. (2008). Foundations of verb learning: Comparison helps infants abstract event components. In H. Chan, Visual Preference Techniques 39

H. Jacob & E. Kapia (Eds.), Proceedings of the 32nd Annual Boston University Conference on Language Development (pp. 402–414). Somerville, MA: Cascadilla Press. Roseberry, S., Hirsh‐Pasek, K., Parish‐Morris, J., & Golinkoff, R. M. (2009). Live action: Can young children learn verbs from video? Child development, 80, 1360–1375. Schafer, G., & Plunkett, K. (1998). Rapid word learning by fifteen‐month‐olds under tightly controlled conditions. Child Development, 69, 309–320. Song, L., Pruden, S., Golinkoff, R. M., & Hirsh‐Pasek, K. (2016). Prelinguistic foundations of verb learning: Infants discriminate and categorize dynamic human actions. Journal of Experimental Child Psychology, 151, 77–95. Spelke, E. S. (1979). Perceiving bimodally specified events in infancy. Developmental Psychology, 15, 626–636. Thiessen, E. D., Hill, E. A., & Saffran, J. R. (2005). Infant‐directed speech facilitates word segmentation. Infancy, 7, 53–71. Tincoff, R., & Jusczyk, P. W. (1999). Some beginnings of word comprehension in 6‐month‐olds. Psychological Science, 10, 172–175. White, K. S., & Morgan, J. L. (2008). Sub‐segmental detail in early lexical representations. Journal of Memory and Language, 59, 114–132.

Further Reading

Fernald, A., & Weisler, A. (2011). Early language experience is vital to developing fluency in understanding. In S. Neuman & D. Dickinson (Eds.), Handbook of early literacy research (Vol 3) (pp. 3–20). NY: Guilford Publications. Swingley, D. (2012). The looking‐while‐listening procedure. In E. Hoff (Ed.), Research methods in child language: A practical guide (pp. 29–42). UK: Blackwell. 3 Assessing Receptive and Expressive Vocabulary in Child Language

Virginia A. Marchman and Philip S. Dale

Abstract

In this chapter, we focus on a core component of language structure, receptive and expressive vocabulary, which can be examined with a wide range of methods. We first review some general issues in the study of early vocabulary, and then discuss three general types of methods that are appropriate for use with young children: l­anguage sampling, parent report, and direct assessment. The goals of the chapter are to overview­ the strengths and limitations of each method and to provide a ‘consumer guide’ for their use.

Introduction

Unlike other chapters in this volume, which are focused on a specific research method, the present chapter is defined by a core component of language structure, namely vocabulary, which can be examined with a wide range of methods. In fact, it is not unusual for multiple methods to be used within the same study. In this chapter we examine the strengths and limitations of three general types of methods, language sampling, parent report, and direct assessment, with the goal of providing a “consumer guide” for their use.

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Assessing Receptive and Expressive Vocabulary in Child Language 41

The Purposes of Studying and Assessing Vocabulary

Vocabulary, or lexicon, is a core component of language. Although even smaller units, the morphemes, can carry meaning (compare dog with dogs), in early child language words are typically monomorphemic, so we are in practice examining the smallest units of meaning. Vocabulary is also large; estimates vary greatly, but are typically in the range of tens of thousands. Thus acquiring a vocabulary is not only essential, but challenging. These facts motivate a range of research questions and designs, most of which can be classified as follows:

Vocabulary as an Object of Study in its Own Right

Which words do children learn first? How many words do children know at var- ious points in development? Are there similarities among children learning the same language, or even across languages? Are there regular patterns of individual differences in vocabulary, either quantitatively (how many words) or qualitatively (which words)? Can vocabulary be thought of as a single broad category, or are there important subtypes with a different developmental history? Are there t­heoretically significant linkages between words, such that some words are learned systematically earlier (or later) than other words? An example of this category of research is Bates et al. (1994) who classified early vocabulary into nominal (names for things), predicates (verbs and adjectives), and closed‐class words (prepositions, articles, question words, etc.). They mapped the changing com- position of vocabulary from 8 to 30 months, and also described regular patterns of variation across children.

Vocabulary as Antecedent

The emergence of words in production is one of the very first aspects of language to be directly observable. Does either the quantity of words or the composition of early vocabulary predict later language, literacy, cognitive, or academic mea- sures? An example of this category of research is Lee (2011), who examined the correlation of early vocabulary to later language and literacy. In fact, a recent large‐scale study documented the broader importance of oral language skills, finding that early vocabulary is a significant predictor not only of children’s later reading, but also mathematics achievement and “non‐academic” skills, such as behavior regulation and externalizing behaviors (Morgan, Farkas, Hillemeier, Hammer, & Maczuga, 2015). It should be kept in mind, however, that even if vocabulary does predict later skills, this is only correlational evidence; the earlier and later measures might be causally related, but alternatively they might be independent consequences of underlying differences in the child or of stable characteristics of the environment. An important subset of this research is focused on the early identification of language impairment on the basis of vocabulary measures, and how to increase the validity of such identification (Rescorla & Dale, 2013). 42 Research Methods in Psycholinguistics and the Neurobiology of Language

Vocabulary as Consequent

Probably the largest subcategory of research that includes vocabulary measures uses those measures to investigate the effects on vocabulary of various genetic (e.g., normal variation, specific genetic and chromosomal abnormalities), physiological (e.g., preterm birth, hearing impairment), naturally occurring environmental (e.g., prenatal drug exposure, parental language input, bilingual input), and intervention factors (e.g., specific language intervention practices such as sentence recasts, broader parent‐training and preschool programs). An example of this category of research is Feldman et al. (2003) who correlated amount of time with otitis media (middle‐ear infection) with vocabulary at 3 years of age. Different purposes call for different tools. Our goal here is to provide information to help researchers select the most appropriate technique (or techniques) for exam- ining vocabulary.

What Does it Mean to Know a Word?

Knowing a word includes knowledge of its pronunciation sufficient to recognize the word and also to produce it with sufficient accuracy that it can be recognized by others. It also includes its syntactic category (part of speech), how it can be combined with other words to communicate sentence meaning. For the purposes of the present chapter, however, we focus on aspects of word knowledge that concern meaning. In addition, due to limitations of length, we also exclude those aspects of vocabu- lary learning which are related to a child’s skill in accessing a word’s meaning or identifying its referent in real time (Fernald, Perfors, & Marchman, 2006). Vocabulary meaning can appear deceptively simple. To determine if a child knows a word receptively, we can say the word, for example, dog and ask the child to point to the picture of a dog from a set of, say, four pictures. For productive knowledge, we can wait to hear the word, or we can simply show the child a picture of a dog and ask what it is. An easy way to see how much this misses is to imagine trying to teach someone new to English the following seven words: dog, green, love, some, in, can, the. The concept of “pointing to the world,” which seems to work for dog, becomes less and less useful as you work down the list. Different grammatical categories seem to have meaning in different ways. Even for specific, individual words, there are multiple levels of knowledge. For most words, understanding appears to occur prior to production. Politeness forms such as thank you (Berko Gleason, Perlmann, & Grief, 1984) and color words (Rice, 1984) seem to be the most common exceptions. The gap between comprehension and production can vary depending on the word, and the overall gap can vary across children (Bates, Dale, & Thal, 1995). Independent of modality (production versus comprehension) are distinctions of meaning emerging from research on the philosophy of language (Carroll, 2008). One such distinction is reference (or extension), which refers to knowledge of which entities, actions, or qualities in the world a word can apply to, versus meaning (or intension), which is the concept of the word. The distinction can be seen at a phrase level by com- paring the present queen of England with Elizabeth II. These two phrases have the same reference—they point to the same person—but different meaning, and therefore it is informative to say “the present queen of England is Elizabeth II.” Meaning is Assessing Receptive and Expressive Vocabulary in Child Language 43 closely related to definition, although one can know the meaning of a word without necessarily being able to give a formal definition. Judgment of synonyms is often taken as a measure of knowledge of meaning which does not require formal definition. Most words, including the majority of common nouns, verbs, and adjectives, refer to categories, not just single examples. Cognitive psychological research has demon- strated that most common natural categories (unlike scientifically defined ones) are organized something like a normal curve, with a best example (often called prototypes), examples which clearly belong to the category but are not as good, and borderline examples (Rosch, Mervis, Gray, Johnson, & Boyes‐Braem, 1976). For the natural cate- gory of bird, robin is a prototype, turkey a less good but clearly valid example, and penguin closer to borderline. Words are often applied first to prototypical examples, and the category word can be applied more quickly to prototypes than to less central examples, among other psychological differences. The relevant point here is that being able to apply a word to a prototype example does not mean that the learner has a broader category at all, or if there is a category, that the boundary is in the right place. Both reference and meaning are aspects of the denotation of words, the objective or dictionary definition. Words also may carry connotation, aspects of meaning which are suggested by a word, though they are not part of the definition. Connotations often have an evaluative or other emotional sense. Compare frugal with cheap, and strong‐willed with stubborn; in each pair the denotations are very similar, but the connotations are quite different. Most common words of English and other languages have multiple meanings, for example, watch, right, and long. Assessment of meaning is usually focused on one specific meaning; less is known about the relation between the acquisition of multiple meanings for individual words. What we do know is that children in preschool can typically demonstrate knowledge of two or more meanings (e.g., watch as an action and watch as a device for keeping time) by identifying the correct meaning in specific contexts, without having conscious awareness of the ambiguity. These skills reflect metalinguistic awareness, which typically emerges around the beginning of school, and is the basis of much verbal humor. We have distinguished these aspects of meaning—and there are many others such as relationships among words (synonymy, antonyms, superordinates (Carroll, 2008))— to make clear that methods for assessing vocabulary knowledge vary in which aspects of meaning are addressed.

Some Core Issues in Vocabulary Assessment

We reserve the term “vocabulary assessment” for procedures that attempt to eval- uate a learner’s lexicon quantitatively, as opposed to methods that are more focused on qualitative dimensions, such as specific components of meaning and relationships (networks) among words. The following distinctions should be kept in mind when considering assessment procedures:

(a) Is the goal to determine the overall size of vocabulary, or to obtain information about vocabulary composition? The latter is likely to require a longer procedure and careful design of lists. Composition has been most often analyzed with respect to syntactic or semantic subcategories of words, such as the contrast bet- ween nominals, predicates, and closed‐class words discussed above (Bates et al., 44 Research Methods in Psycholinguistics and the Neurobiology of Language

1994). A different analysis scheme, more often applied to parental input than to children directly, is common versus rare (or diverse) words (Weizman & Snow, 2001). Providing a richer set of words has been proposed to be particularly rele- vant for future academic growth. (b) Is the meaning of a word being assembled “on the spot” or was it pre‐existing? In English, monomorphemic root words (dog) are outnumbered by inflected words (sleeping), derived words (unhappy, stillness), and compounds (blackboard, smart- phone). Using a variety of measures including definition, Anglin (1993) estimated the number of dictionary entry words known at ages 6, 8, and 10 years to be approximately 10,000, 20,000 and 40,000 words, respectively. A coding of the actual test responses suggested that, instead of having been learned previously, nearly half their estimated recognition vocabulary may have been “morphologi- cally solved” on the spot. Development in this stage is very different, then, from the earliest vocabulary, which is primarily composed of monomorphemic words. (c) Although it is common to assess children with developmental difficulties relative to population‐wide norms, this amounts to the assumption that these children are developing in the same way as typically developing children, just slower. That assumption may be valid for some clinical categories, but not for others. For example, children with Autism Spectrum Disorders learn some words ear- lier than expected based on their overall vocabulary size, and some words later than expected (Lazenby et al., 2015). This consideration is especially relevant for methods which assess only a relatively small number of words and attempt to extrapolate from that small set to estimate total vocabulary size. (d) Many children—perhaps the majority of the world’s children, and an ever‐growing proportion of children in the United States and other developed nations—grow up learning two languages (Hoff, 2013). An assessment of vocabulary in only one language is clearly inadequate. But how is information about vocabulary from the two languages to be combined? One proposal is simply to add the two vocabularies (Total Vocabulary, or TV); another is to count words in different languages with the same or very similar meaning, for example, English cat and Spanish gato, as a single item (Total Conceptual Vocabulary, or TCV; Pearson & Fernández, 1994). Both methods give results that suggest that bilingual children’s development of expressive vocabulary is comparable to monolingual children, although TV generally shows more comparable scores than TCV (Core, Hoff, Rumiche, & Señor, 2013).

We now review three general categories of vocabulary assessment methods: Language Sampling, Parent Report, and Direct Assessments. For easy reference, Table 3.1 lists the instruments and tools discussed in each section.

Language Sampling

Assumptions and Rationale

Observing what children say when interacting with others is a classic way of deter- mining children’s vocabulary knowledge. Indeed, the use of language diaries goes back to the earliest studies in which parents documented the language production Assessing Receptive and Expressive Vocabulary in Child Language 45

Table 3.1 Overview of instruments/analysis tools for studying vocabulary development in children. Name Link/Source See pages Language Child Language Data http://childes.psy.cmu.edu pp. 48–49 Sampling Exchange System (CHILDES) Systematic Analysis of http://www.saltsoftware.com p. 48 Language Transcripts (SALT) EUDICO Linguistic https://tla.mpi.nl/tools/tla‐tools/ p. 48 Annotator (ELAN) elan/ Language Environment http://www.lenafoundation.org pp. 46–47 Analysis (LENA)TM Parent Report MacArthur‐Bates http://mb‐cdi.stanford.edu pp. 50–57 Communicative Development Inventory Language Development Rescorla (1989) p. 52 Survey (LDS) Developmental Vocabulary Libertus, Odic, Feigenson, & p. 51 Assessment for Parents Halberda (2015) (DVAP) Cross‐Linguistic Lexical http://www.cdi‐clex.org/ p. 55 Norms (CLEX) Wordbank http://wordbank.stanford.edu p. 55 Direct Peabody Picture http://www.pearsonclinical.com/ p. 58 Assessments Vocabulary Test‐4th language/products/100000501/ Edition peabody‐picture‐vocabulary‐ test‐fourth‐edition‐ppvt‐4.html Receptive/Expressive One http://www.proedinc.com/ p. 58 Word Vocabulary Test customer/productView. (ROWPVT/EOWPVT) aspx?ID=2166 Expressive Vocabulary http://www.pearsonclinical.com/ p. 58 Text, 2nd Edition language/products/100000416/ (EVT‐2) expressive‐vocabulary‐test‐ second‐edition‐evt‐2.html Computerized Friend & Keplinger (2003) p. 59 Comprehension Task NIH Picture Vocabulary http://www.nihtoolbox.org/ p. 59 Test (NPVT) WhatAndWhy/Cognition/ Language/Pages/NIH‐Toolbox‐ Picture‐Vocabulary‐Test.aspx Quick Interactive Brookes Publishing p. 59 Language Screener (QUILS) of their own children (e.g., Darwin, 1877), and several prominent modern studies have provided remarkably detailed pictures of language development over time (e.g., Dromi, 1987). The advent of audio‐ and video‐recording technology greatly facili- tated this process, allowing the observer to gain a permanent record of the child’s language and accompanying behavior, rather than relying on fleeting memories and distilled observational notes. This strategy has been applied in several important 46 Research Methods in Psycholinguistics and the Neurobiology of Language studies, such as the classic longitudinal study of Adam, Eve and Sarah, by Roger Brown (Brown, 1973), and larger studies of more diverse samples (Hart & Risley, 1995; Pan, Rowe, Singer, & Snow, 2005). Observing children interacting with their parents or an experimenter/clinician is considered an ecologically valid way to assess child vocabulary because the settings are child‐friendly and involve child‐centered activities. This technique can be used with a broad age range of children; however, the child should be old enough to engage in a play activity and, ideally, to produce some spontaneous language (e.g., older than 1 ½ years). After early elementary school ages, the method is less appropriate because contexts vary in their “pull” for more advanced language, and play may be less effective than other contexts. Language samples are often viewed as an unbiased way to assess a child’s vocab- ulary, especially for children from diverse populations or who are learning multiple languages and especially when conducted with a caregiver or other familiar adult (Craig & Washington, 2000). However, it should be kept in mind that some toys or activities may be more or less typical of the kinds of toys or activities children engage in on a regular basis. Although language samples are generally not appropriate for studying vocabulary comprehension, a thorough language sample not only records what words or sentences the child produces, but also what the child is doing (e.g., , eye gaze), how the child responds to the language of others, and the fre- quency and nature of the language that the child hears from their caregivers (e.g., quantity and quality of input), all of which are illuminating of comprehension.

Apparatus

Compared to other methods, the technical requirements for a language sample are simple, typically consisting of a small audio recording device and/or video camera, a tripod, and a high‐quality microphone. Video‐recordings allow the researcher to capture not only what is being said, but also to capture non‐verbal interactions and to what objects or events the speakers may be referring. The lighting should be sufficient and the camera should allow access to the details of the activities, but not be so close that the child or caregiver may veer out of the frame. A camera placed behind a one‐way mirror can reduce its influence on the activities. When video‐ recording, do not underestimate the importance of a high‐quality audio‐recording device and choose a camera that can accommodate an external microphone. An addi- tional wireless microphone placed on the caregiver will typically provide sufficiently clear audio for both caregiver and child. Recommendations for recording equipment are available on the TalkBank website (http://talkbank.org/info/dv.html). Using only audio‐recording may be appropriate for some questions. Audio recording equipment is less intrusive than video, and in some cases, the parent can operate the device themselves. For example, in Hoff‐Ginsberg (1991), families were provided with a recording device and asked to record when they were engaged in different activities (e.g., mealtimes, dressing), providing a broader sample of the contexts than is typically available in the laboratory or when an experimenter is present. A popular new audio‐recording technology, LENATM (Oller et al., 2010; http://www.lenafoundation. org), consists of a digital recorder in the chest pocket of specialized clothing worn by the child. This device enables unobtrusive recordings of up to 16 hours of speech, as well as the child’s own vocalizations. The automated speech‐recognition software Assessing Receptive and Expressive Vocabulary in Child Language 47 provides an estimate of the number of child vocalizations, the number of words used in proximity to the child, and the number of “conversational turns” in which the child engages with their caregivers. It does not identify the actual words, however. In order to facilitate interpretation of the audio‐recordings at a later date, caregivers can log the locations in which the recording was conducted, who was present, and the main activities (Weisleder & Fernald, 2013).

Nature of the data/collecting data

Beyond equipment, there are many factors to consider, such as where the recordings will be made (home versus laboratory), in what kind of activities the child and the interlocutor will engage (unstructured versus semi‐structured), how many people (and who) will be present, and how long to record. For longitudinal studies, one needs to further determine the frequency of recordings over what time period. For example, it is often appropriate to record more often at earlier ages due to rapid development. Recording sessions in a laboratory or clinic with a standard set of child‐friendly toys allow full control over the lighting, sound, and other factors that affect the quality of the recording (e.g., ambient noise) or the nature of the interactions that are being observed (e.g., other children or activities that could distract the child). One disadvantage is that although the environment is child‐friendly and supportive, the context and particular toys are nevertheless unfamiliar. Some children may be shy and require considerable time to “warm up,” and the toys or activities may be more familiar to children from some backgrounds than others. One might therefore decide to record in the more familiar context of the child’s own home environment, where the child may be more comfortable and need less time to “warm up.” Note, however, that there will be less control of the environment (e.g., the presence of tele- vision), and standardizing the procedures may be more difficult. A language sample reflects the child’s language in the context of a particular set of activities and a particular caregiver‐child pair. For example, book reading with a caregiver will elicit a very different sample of the child’s language than free‐play with a school bus or tea party set (Hoff‐Ginsberg, 1991). Since caregivers vary in their level of skill in successfully eliciting speech from young children, one may choose to engage the child with a researcher, rather than the caregiver. This will also standardize interlocutors across a study. Training should involve ways to make the child and the caregiver feel comfortable, ways to structure the toys/activities, and techniques to “draw out” a child using humor or surprise. For many questions, a standard set of toys is provided to structure the activity, typically chosen to encourage communication between the child and interlocutor around a joint activity or pretend play. These activities tend to elicit names of objects, commands/requests, or answers to questions. Common examples are a farm house and animals, a tea set with plates/utensils, or a bedroom set with teddy bear. Activities that encourage more physical play (balls, bubbles), while excellent for engaging children of different ability levels, may be less optimal for obtaining a sample of the child’s vocabulary. For older children, free‐play with toys may be less effective at eliciting language than choosing to have the child tell a personal narrative or story, describe a set of pictures, or recall a past event (Southwood & Russell, 2004). 48 Research Methods in Psycholinguistics and the Neurobiology of Language

In either a home or laboratory recording, a common instruction for parents is to play “as they normally do” and to provide no additional instructions. This may pro- duce individual variability in parental behavior, or it may be less meaningful to some parents. Some studies standardize the play interactions by giving parents “bags” of toys and asking them to play with the toys in each “bag” in a given order (e.g., Hirsh‐ Pasek et al., 2015). Other studies observe children engaging in activities that occur naturally at home, but are not practical in most laboratory settings, such as meal or bath time routines (Hoff‐Ginsberg, 1991). One advantage of the LENATM all‐day recordings is that samples of speech are captured in various activities (e.g., meal- times) without “pre‐staging” the context. Typically, language sample recordings last 10‐30 minutes, although some researchers record up to ninety minutes (e.g., Pan et al., 2005). For clinical purposes, it is generally recommended that the sample consist of at least 50 child utterances (Miller, 1981). Transcription and analyses are time‐consuming and therefore the length of the language sample, and the corresponding transcription, will be determined by the specific research question and resource availability. For some research or clinical ­purposes, a single lan- guage sample is sufficient. However, if the goal is to examine ­trajectories of language development, it is necessary to obtain multiple language samples over time. For detailed examination of the emergence of specific words or structures, some researchers use “dense sampling”, creating dense databases (DDBs) in which speech samples occur at a much higher rate, for example, 5 hours per week at multiple time points over several years (e.g., Maslen, Theakston, Lieven, & Tomasello, 2004). Since language samples are the mainstay of child language research, many systems have been developed to standardize the transcription process, including Systematic Analysis of Language Transcripts (SALT, Miller, 2012; http://www.saltsoftware. com/), EUDICO Linguistic Annotator (ELAN, http://www.lat‐mpi.eu/tools/tools/ elan) and Codes for the Human Analysis of Transcripts (CHAT), which is part of the Child Language Data Exchange system (CHILDES, MacWhinney, 2000; http:// childes.psy.cmu.edu). Table 3.2 provides an excerpt from a caregiver‐child interaction transcribed using CHAT (from MacWhinney, 2000). These systems employ user‐ interfaces that facilitate data entry and that connect easily with the analysis tools. Nevertheless, transcribing children’s speech is very difficult, since young children may speak softly and have immature phonological systems. Transcribing a language sample is time‐consuming, taking 8‐10 hours per 1 hour of recording depending on the level of detail desired. Most studies of lexical development transcribe at the level

Table 3.2 Example transcript from CHILDES (from MacWhinney, 2000). @Begin @Languages: eng @Participants: CHI Ross Child, FAT Brian Father @ID: eng|macwhinney|CHI|2;10.10||||Target_Child||| @ID: eng|macwhinney|FAT|35;2.||||Target_Child||| *ROS: why isn’t Mommy coming? %com: Mother usually picks Ross up around 4 PM. *FAT: don’t worry. *FAT: she’ll be here soon. *CHI: good. @End Assessing Receptive and Expressive Vocabulary in Child Language 49 of the word; however, it may be more appropriate to transcribe exactly what is said, rather than simply writing down the closest target. Computer‐based analysis systems (e.g., the Computerized Language Analysis (CLAN) system in CHILDES) provide many different measures of vocabulary pro- duction, including the number of words (tokens), number of different words (types), the number of words per utterance or conversation unit (e.g., mean length of utter- ance) and others. Computing and interpreting these variables again requires many decisions about the relevant units of analyses (e.g., words versus morphemes) and what constitutes an utterance. The reader is encouraged to follow the guidelines provided in the sources describing these coding systems. These systems also provide tools to generate counts of various aspects of the transcripts, for example, number of utterances, type‐token ratio of words (TTR), mean length of utterances (MLU), and rate of utterances per minute. It should be noted that TTR, which is often used as a measure of vocabulary diversity in a language sample and hence an index of vocab- ulary size, is substantially affected by the size of the sample used. An alternative measure, VOCD, which is available in the CHILDES system, is much less affected by vocabulary size, and is preferable when samples vary in size (MacWhinney, 2000).

An Exemplary Study

The enormous contribution of the language sample methodology to the study of early vocabulary development is exemplified by CHILDES, a component of the larger Talkbank project (http://www.talkbank.org). Not only does CHILDES provide a frame- work for applying a standardized set of transcription and analyses tools to any video‐ or audio‐taped language interaction, members of the CHILDES consortium contribute their language samples by publicly sharing them with the child language community. This open‐source project is one of the first of its kind, conceived in the early 1980s, when computing technology and the infrastructure for handling large databases were in their infancy. At the time of this writing, the CHILDES database has grown to include several hundreds of language samples in English, including many classic samples, such as those of Adam, Eve, and Sarah from Brown (1973). The database has grown to include dozens of other languages, as well as samples from children learning in bilingual contexts and several clinical populations. Researchers should carefully review the conditions under which each language sample was collected and characteristics of the population before analyzing the archived data. As described above, the system also provides a suite of tools that enable automated analyses (http://childes.psy.cmu.edu), such as word and utterance counts and MLU. In addition to accessing the archived data, MacWhinney (2000) provides extensive guidelines that researchers can follow for transcription that will then allow them to analyze their own data using the tools available in the system. The system is also widely used in teaching or clinical training contexts. The CHILDES project has been the inspiration to several other data‐sharing projects, including Wordbank for parent report data (http://wordbank.stanford.edu), and Homebank for daylong home recordings using the LENA system (http://homebank.talkbank.org).

Problems and Pitfalls

Transcription of child language is a time‐consuming task, which requires attention to detail and a good ear, and in many cases, considerable training in phonetic/phonemic 50 Research Methods in Psycholinguistics and the Neurobiology of Language analysis. Naturalistic observation also incorporates the risk of underestimating the child’s knowledge, due to lack of opportunity for production of some words. One must also keep in mind that the individual words that are produced are especially subject to a frequency bias, in that high‐frequency words are more likely to occur in a sample than low‐frequency words. It is also the case that some children may be inhibited during interactions with an unfamiliar adult. If other conversational partners are being recorded (e.g., a parent), they may also be inhibited due to their knowledge of being recorded. Naturalistic interactions in the home or recordings using less intrusive strategies may to some extent alleviate those issues.

Assessment of Vocabulary by Parent Report

Assumptions and Rationale

As discussed elsewhere in this chapter, technologically complex methods have been developed for the study of language development, including computerized analysis of language samples and digital audio recorders that allow day‐long naturalistic recording. In contrast, this section is focused on the revival and improvement of a very old and “low tech” approach; one that is not only practical and cost‐effective, but for certain purposes simply better than the alternatives. It is parent report: the systematic utilization of the extensive experience of parents (and potentially other caregivers) with their children (Dale, 1996). Professionals concerned with assessment of individual children’s development have also relied on parent report, especially for purposes of initial screening. Motivation to use parent report in the United States was greatly increased by the Amendments to the Education for All Handicapped Children Act of 1986 (P. L. 99‐457), which mandated increased parental involvement in the development of programs for young children. However, there has been a reluctance to use parental report as the primary basis for assessment. Most parents do not have specialized training in language development, and may not be sensitive to subtle aspects of language structure and use. Furthermore, a natural pride in the child and a failure to critically test their impressions may cause parents to overestimate the child’s ability; conversely, frustra- tion in the case of delayed language may lead to underestimates. In recent decades, however, carefully designed parent report has been shown to provide reliable and valid information on vocabulary and other components of language. Parent report has a number of inherent advantages over the major alternative assessment methods. They include the lack of need for compliance by the child (as in structured testing) or the need for time and sophisticated training for analyzing language samples. Most important is the fact that parent report is based on experiences with the child which are not only more extensive than any researcher or clinician can obtain, but are more representative of the child’s ability. Parents have experience with children at play, at meals, at bath‐ and bedtime, at tantrums—in short, with the full range of the child’s life and therefore with the full range of language structures used in these contexts. They also have had opportunities to hear the child interact with other people: the other parent, grandparents, siblings, and friends. Assessing Receptive and Expressive Vocabulary in Child Language 51

Because parent report represents an aggregation over much time and many situations, it is less influenced by factors that influence performance, such as shyness, or that impact sampling, such as word frequency. As Bates, Bretherton, and Snyder (1988, p. 57) point out, “parental report is likely to reflect what a child knows, whereas [a sample of] free speech reflects those forms that she is more likely to use.” Another important advantage of parent report is that it makes possible the collec- tion of data from far larger samples of children than would be possible with tests or naturalistic observation. Information from more adequate samples, especially in the form of norms, can benefit both clinical practice and research. Fenson et al. (2007), for example, used the norming data from the MacArthur‐Bates Communicative Development Inventories (CDI)—a sample of 2,550 children aged 8 to 30 months— to address questions about variability in communicative development. Large sam- ples are especially needed to provide an accurate statistical description of extreme scores, that is, what score corresponds to the 10th percentile? Research on questions such as environmental influences on language development can also benefit from large samples. Correlational research is hampered by the problem of multicollinearity: The predictor variables such as parental education, number of books in the home, family size, use of questions versus imperatives, are likely to be intercorrelated, making it difficult to separate the effects of each of them individually. Large samples in which there is a substantial amount of non‐overlapping variance are essential for addressing these questions. Clearly there are legitimate concerns about the ability of parents to provide detailed and specific knowledge about their children’s language. However, many of the reservations that have been expressed may have more to do with how parental experience is accessed rather than with the validity of that perspective in general. Parent report is most likely to be accurate under three general conditions (Bates et al., 1988):

1 when assessment is limited to current behaviors 2 when assessment is focused on emergent behaviors 3 when a primarily recognition format is used

Each of these conditions acts to reduce demands on the respondent’s memory. The first condition reflects the fact that parents are better able to report on their child’s language at the present than at times past. The second condition reflects the fact that parents are better able to report on, for example, animal names in their vocabulary, at the age in which their child is actively learning new animal words. In typically developing samples, parents can track their child’s receptive vocabulary to about 16‐18 months, after which it is too large to monitor. Expressive vocabulary can be monitored until about 2½‐3 years, after which it becomes too large. The recognition strategy capitalizes on the greater ease of recognition as contrasted with recall. That is, it is better to ask parents to report on their child’s vocabulary by selecting words from a comprehensive list rather than having them write down all the words they can recall hearing their child use (or, even worse, asking the global question: “Does your child know at least 50 words?”). A promising extension of the parent report method to older children (2 to 7 years) has been developed, the Developmental Vocabulary Assessment for Parents (DVAP) (Libertus et al., 2015), showing good validity and reliability. 52 Research Methods in Psycholinguistics and the Neurobiology of Language

Apparatus and Instruments

In principle, assessment of vocabulary by parent report requires the least supporting material of any method, simply the printed form. As discussed below, it is sometimes appropriate to have a trained interviewer administer the form for parents with low literacy. Online and other electronic administration methods are also emerging to facilitate efficiency, but these do not affect the basic method (Kristoffersen et al., 2013). The core ‘work’ of this form of assessment has been done in the process of developing and norming the form, and in particular, identifying a list of words which includes the great majority of words learned by young children. (There will inevitably be words learned by individual children, which are not on the list, reflecting individual differences in environment and child interest.) At present there are two major parent report measures of early language for English (both have been adapted for numerous other languages). Rescorla’s Language Development Survey (LDS; Rescorla, 1989) was originally designed as a brief expres- sive language‐screening instrument for children between 12 and 24 months, though it has more recently been normed for a wider age range. It contains a 310‐word expressive vocabulary checklist, along with a section requesting that the parent write out three of the child’s longest recent sentences or phrases. The LDS demonstrates excellent reliability including internal consistency, as well as validity as a screening device (Rescorla, Ratner, Jusczyk, & Jusczyk, 2005). The most fully developed set of parent report measures for language are the MacArthur‐Bates (originally MacArthur) Communicative Development Inventories (CDIs, Fenson et al., 2007; http://mb‐cdi.stanford.edu). The CDIs are designed to measure vocabulary across the full range of ability levels, as well as additional dimensions of communicative development. The CDI:Words & Gestures (CDI:WG) was designed for typically developing children between 8 and 18 months. On the 396‐item vocabulary checklist, the parent is asked to indicate if the child “under- stands” or “understands and says” the word. The CDI:Words & Sentences (CDI:WS) was designed for typically developing children between 16 and 30 months. For each word on the 680‐item checklist, the parent is asked to indicate if the child says (and understands) the word. Both measures have been used for somewhat older children with a variety of developmental delays. In addition to these “long‐forms,” there are short‐form instruments at each of these two development levels (Fenson et al., 2000). Each includes ~100 vocabulary items that have been shown to predict long‐form vocabulary scores with impressive accuracy. The short form version of the CDI:WS also asks if the child is combining words. Finally, the CDI‐III, designed for children between 30 and 37 months, includes a 100‐item vocabulary checklist appropriate to that developmental level. These short forms are useful for comparing individual children with the population overall, but cannot provide information on vocabulary composition or aspects of language other than vocabulary, such as gestures and grammar. The choice between long and short forms must be made carefully, in light of the goals of the research or clinical work. Both the LDS and the CDIs have been adapted to numerous other languages, although not all projects have advanced to actual collection of norming data (http:// mb‐cdi.stanford.edu/adaptations.html). It is essential that these must be adaptations, not translations from the original American English instruments, reflecting the linguistic and cultural contexts that influence the early acquisition of vocabulary and Assessing Receptive and Expressive Vocabulary in Child Language 53 other aspects of language (see http://mb‐cdi.stanford.edu/AdaptationsInformation2015. pdf and AdaptationsNotTranslations2015.pdf). Due to variation in language struc- ture and also the interests of the developer, these instruments vary somewhat in structure. However, adaptations of the CDI:WG generally include gestures as well as vocabulary comprehension and production, while adaptations of the CDI:WS generally include vocabulary production and some measure of morphology and com- binatorial syntax. These adaptations are valuable both for the study of monolingual development as well as for research on bilingualism. Furthermore, they make it possible to obtain some information on a bilingual child’s first language when no other method is available, for example, a child of a Turkish‐speaking immigrant family in the United States. Users should consult the manuals for those particular instruments for development and normative information.

Collecting Data

Generally speaking, one should follow the age and procedure guidelines of the devel- opers of the instruments. Note that the forms can be used with children older than the specified age ranges, as long as children are likely to score within the expected developmental levels, for example, children with developmental delays or children who are learning more than one language. Ideally, the CDI should be completed by one or more caregivers who are in the best position to judge the child’s vocabulary abilities in a particular language. Although the typical respondent is the child’s mother, in some circumstances another caregiver (father, grandmother) is the more appropriate choice if this individual is the child’s primary caregiver. In other cases, one parent may have access to only a portion of the situations in which the child’s language abilities are demonstrated, for example, when that child is attending a day‐care center. In this situation, one might choose to ask multiple caregivers who are familiar with the child to complete the form (e.g., mother and grandmother; parent and teacher) (De Houwer, Bornstein, & Leach, 2005). We suggest to track single versus multiple reporters by asking parents to indicate on the front cover of the Inventory which individuals contributed the information. This issue of which and how many caregivers should be involved in completing the form is especially criti- cal in bilingual or multilingual situations in which a single caregiver may not be able to provide a comprehensive assessment of the child’s abilities in both (or all) of their languages. A complete account of children’s early vocabulary knowledge is only available when assessing all of the languages they are learning. For example, in the case of children learning English and Spanish, it is recommended to administer both the English and Spanish CDIs, completed by one or more caregivers who are familiar with the child’s ability in that language. The scores from both forms can be combined to reflect Total Vocabulary (e.g., all words produced in either languages) or Total Conceptual Vocabulary (e.g., all of the concepts that a child has a word for in one or both languages). The choice of scoring will depend on the user’s goal and research or clinical question (Core et al., 2013). Total conceptual scoring is also available for children learning both English and French (Marchman & Friend, 2013, 2014). While some users ask parents to complete the form during an experimental session (i.e., in the “waiting room”), this administration procedure may result in parents feeling rushed or distracted, resulting in less reliable estimates (Jackson‐Maldonado, 54 Research Methods in Psycholinguistics and the Neurobiology of Language

Thal, Marchman, Bates, & Gutierrez‐Clellen, 1993). For some populations with low literacy, and most easily with short forms, the CDIs can be administered orally in a face‐to‐face interview format (Alcock et al., 2014). In most cases, it is recommended that parents take home and complete them at their leisure. It can also then be sug- gested that the parents fill out the questionnaire during a quiet time away from the child, e.g., during naptime. One should remind parents that they do not need to complete the form in a single setting; they can return to the form as often as they like. While parent report instruments are intended to be easy to administer, an addi- tional cover sheet summarizing the instructions may be helpful (see Appendix 3.1). These written instructions are all that many caregivers will need; however, it is recommended that the instructions also be explained verbally. In particular, it is of utmost importance that parents not attempt to test the child’s ability to imitate a word or (e.g., Billy, can you say “banana”?). Rather, they should mark only words or gestures they have heard the child use spontaneously, without a direct model. This idea is sometimes difficult to get across, and so it may be helpful to explain verbally that parents should mark words their child says “on their own, not just when they repeat back what you say.” Asking parents to complete the form when the child is not with them (e.g., when the child is sleeping) is an excellent way to eliminate this source of error. Parents should also be reminded that they should give their child credit for a word even if it is pronounced in a child‐like way (e.g., “banky” for blanket). In addition, some families may use a different variant of the word than the one on the form due to dialect, regional or personal preferences (e.g., “nana” for grandmother, “lorry” for truck). These are acceptable substitutes for the items listed on the form. Researchers and clinicians working with particular populations should be familiar with possible variants of the words on the checklist and highlight those for parents when appropriate. It is strongly recommended that the examiner confirm the child’s date of birth and the date the form was completed. Note that different countries may use different date conventions, for example, MM/DD/YY versus DD/ MM/YY, a source of critical errors when working internationally. Upon receipt from the caregiver, the examiner should check that the form was filled out completely and that no pages were skipped or left blank. The reader is referred to Chapters 2/3 in Fenson et al. (2007) for more suggestions.

Nature of the Data/Scoring

Obtaining raw scores for vocabulary for the CDIs is straightforward: simply counting the words marked “understands and says” yields a production score; while adding the words marked “understands and says” and those marked “understands” yields a com- prehension vocabulary raw score. These raw scores can be converted to percentiles uti- lizing tables in the Manual. Percentiles are provided for boys and girls separately, or combined, and may be applied depending on users’ preference. Because raw vocab- ulary scores are often quite skewed, the developers prefer the use of percentiles rather than standard scores, which assume normality, especially for clinical work. For research purposes, however, various transformations that produce normally distributed derived scores have sometimes been used. Because the long form CDI vocabulary checklists are relatively comprehensive, the raw scores have an inherent “criterion‐ referenced” meaning, as estimates of total vocabulary, while the percentiles from the Assessing Receptive and Expressive Vocabulary in Child Language 55 manual have “norm‐referenced” meaning. Early vocabulary assessed by parent report with long form checklists is nearly unique in producing both kinds of measures, which greatly increases its uses (see exemplary study). While the tallying of the responses is relatively straightforward, counting items and looking up corresponding percentiles can be a time‐consuming and error‐prone process. Depending on the number of forms involved, one could consider utilizing the CDI Scoring Program (http://mb‐cdi.stanford.edu/scoring_db.htm). The CDI Scoring program provides a template for hand‐entry of responses at the item or section level; the program then tallies the scores and looks up the percentile in the appropriate table. In addition, the program generates a summary report for sharing with a parent and enables exporting the item, summary, and percentile scores in tab- ular format that can then be imported to another program for data analysis. The CDI scoring program also links item‐level responses across forms for a single child with an available English and Spanish or French CDI, automatically computing the number of items indicated in English only, Spanish/French only, and both English and Spanish/French, yielding TV and TCV, defined earlier. For more information on scoring, see Chapter 2 in Fenson et al. (2007). Following in the spirit of CHILDES for language sample sharing, researchers have developed systems for compiling parent report data, in particular, the MacArthur‐ Bates CDIs, across research laboratories and languages. One effort of this sort was the Cross‐Linguistic Lexical Norms site (CLEX; http://www.cdi‐clex.org/; Jørgensen, Dale, Bleses, & Fenson, 2010), which archives normative data from a range of CDI adaptations across languages, allowing browsing of acquisition trajectories for individual items or age groups. Like its predecessor (Dale & Fenson, 1996), this system allows users to query the number of children who are reported to understand or produce a word or sets of words at a given age. More recently, a new system, Wordbank (Frank, Braginsky, Yurovsky, & Marchman, 2016), http://wordbank.stanford.edu), has been developed that also compiles CDIs from multiple research groups. Wordbank builds directly on CLEX, offering the same functionality but allowing flexible and interactive visualization and analysis, as well as direct database access and data download. Wordbank’s additional goal is to extend beyond the norming data of individual CDIs by dynamically incorporating data from many different researchers and projects of varying sizes and scopes. The resulting datasets have the potential to be considerably larger and more representa- tive than the norming datasets taken individually. While a novel and useful resource for many applications, it is not recommended that Wordbank‐generated statistics be used for research or clinical purposes in which the goal is to evaluate children’s performance in reference to an established normative standard. For these applica- tions, users should refer to the norms and guidelines published in the manuals for those languages.

An Exemplary Study

Because vocabulary is a core component of all languages, it is well‐suited for cross‐ linguistic research. A good example is Bleses et al. (2008), who compiled data from 18 languages and dialects for which CDI:WG and CDI:WS norming had been completed: Basque, Chinese‐Mandarin, Croatian, Danish, Dutch, English‐US, 56 Research Methods in Psycholinguistics and the Neurobiology of Language

English‐British (Hamilton et al.), English‐British (Klee), Finnish, French, Galician, German, Hebrew, Icelandic, Italian, Spanish‐European, Spanish‐Mexican, and Swedish. Many conclusions hold across the range of languages studied, including very great variability in rate of development, a positive acceleration in the second year, comprehension and production somewhat dissociated, gestural communication related more strongly to receptive vocabulary than to productive, and vocabulary closely related to grammatical development (Bates & Goodman, 1997). Differences also occur, such as the balance of nominal to non‐nominals, which is related to grammatical structure. Bleses et al. focused on rate of development; they observed that early receptive vocabulary growth was slower in Danish than in any of the other languages studied, from 12 months on. This pattern was not observed for productive vocabulary. Interestingly, it has also long been observed anecdotally, and now is confirmed by empirical research, that Danish is the most difficult Scandinavian language for other adult Scandinavians to understand, despite the close typological relationship among these languages. Indeed, Norwegian and Danish are nearly identical lexically and grammatically. Danish is characterized by some highly distinctive phonological reduction processes which greatly reduce the frequency of obstruents, and more generally lead to “an indistinct syllable structure which in turn results in blurred vowel‐consonant, syllable and word boundaries. In particular, word endings are often indistinctly pronounced…” (Bleses et al., 2008, p. 623). In other analyses, the authors were able to provide evidence against the alternative view that Danish parents are simply more reluctant to respond “yes”—there were no differences on either gestures or word production. They conclude that the phonological structure of Danish produces an initial obstacle to breaking into the stream of speech, although the children do eventually catch up. More generally, this study stresses the impor- tance of sound for language and language acquisition (see Bleses, Basbøll, Lum, & Vach (2011) for an interesting follow‐up study).

Challenges and Related Issues

One important consideration is the validity of parent report for assessment of com- prehension rather than production. Since correlations between CDI:WG vocabulary comprehension and structured tests of receptive language are of the same order of magnitude as seen for CDI:WG and CDI:WS productive vocabulary, the CDI:WG receptive vocabulary measure appears to be quite valid. However, this may not be true at the youngest ages. As Tomasello and Mervis (1994) point out, for the youngest children, especially 8‐10 month olds, the vocabulary comprehension scores are surprisingly high and likely implausible. They suggest that this is due to a lack of clarity in the term “understands” on the part of parents of children at this young age, and that caution be used in interpreting such comprehension scores in any absolute way. Additional explanation of the term “understands” should be given. A second issue concerns clinical vs. research applications. In many respects, clinical validity is a more stringent requirement than research validity, as decisions are being made about individual children rather than a group. Another difference is that most research is interested in variability across the full range of scores, whereas clinical applications are primarily focused at the low end. Parent report measures in particular Assessing Receptive and Expressive Vocabulary in Child Language 57

are likely to be used for screening for language delay. One common convention (originally suggested by Rescorla, 1989) is to refer a child for further assessment and possible intervention if the parent reports fewer than 50 words (approximately the lowest 5%) or no word combinations by 24 months (approximately the lowest 14%). This criterion is tied to a specific age; for utility at other ages, a criterion of lowest 10% is often suggested (Rescorla & Dale, 2013). To evaluate clinical validity, measures of diagnostic accuracy, such as sensitivity and specificity, or likelihood ratios (Dollaghan, 2007) are more appropriate than correlations across the full range. It should also be noted that a substantial proportion of children who are late talkers will spontaneously catch up. Thus the low predictive validity seen over that period (Dollaghan, 2013) is a genuine research finding, not a limitation of the method (Fenson et al., 2000). The CDIs can also be used with somewhat older children with language delay. As long as the child’s score does not exceed the median level for 30 month olds, a “language age” can be derived using the 50th percentile row of the existing tables. Scores above this level cannot be interpreted with confidence. Finally, detailed examination of responses may help therapists design intervention programs for individual children, and the CDIs may be used as one evaluation measure for intervention effects. A third issue is related, but distinct from the one just discussed. For either research or clinical purposes, how valid are parent report measures such as the CDI for specific clinical populations? The question is not one of identification, whether the child falls below a predefined criterion, but of simple assessment of the degree of impairment, and potential improvement after intervention. The validity studies reviewed in Fenson et al. (2007) suggest that validity correlations are at least as high, if not higher (perhaps due to greater variability within the clinical population) for late talking children, children post‐cochlear implant, and children with Down Syndrome or Autism Spectrum Disorder. Clearly, many clinical populations have not yet been studied in this way, but the available evidence overall is very encouraging. Finally, are parent report measures valid in children from lower‐SES back- grounds, in particular, from families where there is a lower level of education or literacy? In an important early study, Arriaga, Fenson, Cronan, and Pethick (1998) compared parents’ reports of vocabulary by lower‐ and higher‐SES families, show- ing that the more disadvantaged children scored consistently lower on nearly all of the major vocabulary and grammar scales on the CDIs. These differences in scores could reflect valid delays in children’s language development that parallel those obtained with different methods, such as naturalistic observation or standardized tests (e.g., Hammer, Farkas, & Maczuga, 2010). But it is also possible that lower scores might be attributable to parental misjudgment (Roberts, Burchinal, & Durham, 1999). Several studies have questioned whether parent report tools are valid in families from diverse populations, in light of the fact that on some sub‐tests children from lower‐SES families actually perform better than their higher‐SES counterparts for portions of the CDIs that require judgements of comprehension in younger children (Feldman et al., 2000). Later studies have shown that, for children over 2 years, patterns of validity are consistent in lower and higher‐SES groups (Feldman et al., 2005; Reese & Read, 2000). The reader is advised to follow the guidelines to ensure that parents understand the instructions and complete the form appropriately. 58 Research Methods in Psycholinguistics and the Neurobiology of Language

Direct Assessment

Assumptions and Rationale

A third way to assess vocabulary is to use standardized or researcher‐designed exper- imental tasks, which allow the researcher to directly test the child’s knowledge in a controlled context. There are many standardized tests available to assess children’s vocabulary knowledge directly, asking the child to identify a picture of a named object (for comprehension) or to name an object or picture when asked (e.g., “What’s this called?” for production). Commercially available standardized assessments have the advantage of being normed on a population according to clearly defined characteristics, such as ethnicity, geographic location, or family educational level. In addition, some standardized assessments of vocabulary comprehension have been co‐normed with tests of vocabulary production, for example, the Peabody Picture Vocabulary Test (PPVT, Dunn & Dunn, 2012) and the Expressive Vocabulary Text (EVT, Williams, 1997), or the Receptive and Expressive One Word Vocabulary Test (ROWPVT/EOWPVT, Martin & Brownell, 2011). Such co‐norming allows direct comparison of receptive and expressive vocabulary skills (“profile analysis”) within a single child to evaluate differences between the two measures. While some tests are specifically designed to assess only vocabulary (e.g., PPVT), others assess vocabulary production or comprehension as a sub‐scale of a larger battery (e.g., the Clinical Evaluation of Language Fundamentals, CELF, Secord, Semel, & Wiig, 2003). Standardized assessments provide targeted ways to assess the child’s vocabulary knowledge following standard procedures and using a common set of objects or pictures. Many of these assessments require that the child generate a response, such as point to a picture (comprehension), name a picture or object or use a word in a sentence (production). While generally straightforward to administer and score, these tasks may require responses that impose significant demands on young children. Even at older ages, some children may be more comfortable than others interacting with an unfamiliar adult and being asked questions that appear to be already known (i.e., test questions). For some research questions, it may be appropriate for a researcher to assess vocabulary using a task or items that are designed specifically to address the particular question of interest. For example, one may be interested in whether a child knows the names for particular objects (e.g., animals) or attributes (e.g., colors, sizes). These tasks are not assessing the size of the child’s vocabulary, but rather the child’s depth of knowledge of particular classes of words or the relations among different kinds of words. Experimenter designed tasks are often modeled after the procedures used in standardized tests, for example, asking children to pick out an appropriate picture or to name an object. In designing an assessment protocol, it is critical to ensure that the tasks chosen are appropriate to the age and developmental level of the target participants. It was noted earlier that language samples are biased in that they are especially sensitive to high frequency words, which are primarily but not exclusively closed‐ class words such as auxiliary verbs, prepositions, and articles. Direct assessment of vocabulary using picture‐pointing responses for comprehension and picture‐ or object‐naming for production has a complementary bias. The methodology is especially Assessing Receptive and Expressive Vocabulary in Child Language 59 suitable for concrete nouns, action verbs, and adjectives that describe perceptual qualities such as size, shape and color. Dale (1991) found that vocabulary measures derived from a language sample and from a direct assessment accounted for partially independent variance in a parent report measure (the CDI) that included both kinds of words. An alternative method for direct assessment of vocabulary is to ask the child to define a word, for example, “what does it mean to imitate,” or “what is an envelope?” This method has been used in a number of intelligence tests, such as the WISC‐III (Wechsler, 1991). An advantage of this method is that it widens the range of words that can be assessed beyond the directly concrete, although closed‐class words are still generally not suited for it. However, it requires considerable training and skill in scoring the responses. In addition, because defining words requires substantial expressive skill as well as metalinguistic awareness, this method cannot be described as assessing vocabulary comprehension specifically. This method is not used exten- sively in psycholinguistic research.

Apparatus and Instruments

Standardized tests require the particular test materials as provided by the manu- facturer, typically consisting of a test booklet or real objects (e.g., doll) or toys (e.g., blocks). Testing typically occurs when the child and experimenter are situated face‐to‐face across a table (for very young children, sometimes next to each other). While many tests can be scored by the experimenter in real time (i.e., during testing), the sessions can be audio‐ or preferably, video‐taped and later checked. Experimenter‐designed tasks require similar materials, but the researcher is free to develop those materials to suit their individual purposes. Researchers are now beginning to take advantage of electronic platforms that allow the creation of cus- tomized instruments and that also facilitate administration and scoring (e.g., Frank, Sugarman, Horowitz, Lewis, & Yurovsky, 2016). For example, the Computerized Comprehension Task (CCT), modeled after the PPVT (Dunn & Dunn, 2012), adapts the standard picture‐pointing paradigm to touch screen technology (Friend & Keplinger, 2003) to assess children’s vocabulary comprehension using haptic (pointing) responses. Touch screen technology has also been applied in the adaptation of the Picture Vocabulary Test, available from the NIH Toolbox (http://www. nihtoolbox.org/WhatAndWhy/Cognition/Language/Pages/NIH‐Toolbox‐Picture‐ Vocabulary‐Test.aspx), and the recently developed Quick Interactive Language Screener (QUILS).

Collecting Data

Direct assessments of all kinds require an appropriate, child‐friendly, setting in which the child and the experimenter can interact without distraction. The area should be spatially and visually separate from alternative activities. Some children may readily engage in the tasks, whereas others may require more encouragement or persuasion. For all types of direct assessments, a visual schedule is sometimes helpful to keep a child on task, for example, a piece of paper with a graphic image 60 Research Methods in Psycholinguistics and the Neurobiology of Language of each “game” that the child will play and in what order. The child can then place a sticker next to each task upon completion. It can be challenging for the experi- menter to adhere to the task protocols while managing a young child’s behavior or attention. It is also critical that the experimenter resist providing cues to the correct answers with eye gaze or body movements or by shaping the child’s behavior in any way. Several practice sessions and pilot participants should be run before data collection begins, especially when the experimenter is new to the materials and when the target population includes younger, preschool‐aged children or children with attentional limitations. The manuals for standardized assessments provide useful information about administration and scoring, as well as guidelines on how to convert raw scores to scaled or standardized scores based on the child’s age. Users should also assure that they are following recommended guidelines regarding positioning of the experi- menter and child, as well as the use of prompts and corrective feedback. As with the parent report methodology, there are parallel versions of some assessments in English and Spanish that can be combined using a type of conceptual scoring with children learning both languages (Gross et al., 2014). Users of experimenter‐ developed protocols should develop their own set of guidelines prior to beginning data collection.

An Exemplary Study

Standardized measures of vocabulary are a frequent choice in many large‐scale nationally‐representative studies, for example, the Children of the National Longitudinal Study of Youth (Farkas & Beron, 2004). The major reason for this choice is that their psychometric properties and reliability/validity estimates are well‐documented, and make the measures suitable for complex statistical methods, such as structural equation modeling. Moreover, since most direct assessments are only appropriate for use with children who are preschool‐age or older, they are sometimes applied in combination with the other methods described here, such as parent report or language sampling, thereby enabling researchers to longitudinally track vocabulary development from toddlerhood into school age. An example is a recent study by Rowe and colleagues (Rowe, Raudenbush, & Goldin‐Meadow, 2012) in which children’s vocabulary growth was tracked longitudinally between 14 and 46 months based on words produced in a language sample, and then their receptive vocabulary outcomes were assessed using a standardized assessment (PPVT‐III, Dunn & Dunn, 1997) at 54 months. The results indicated that children’s growth in vocabulary predicted later vocabulary scores on the direct assessment, especially for children from low‐SES backgrounds. Such findings suggest that there is continuity in oral vocabulary knowledge across the first several years of life and that interventions that accelerate early vocabulary growth may have the potential to improve children’s oral vocabulary outcomes at school entry. For a similar study with English/Spanish bilingual children, see Hoff, Rumiche, Burridge, Ribot, and Welsh (2014), in which expressive vocabulary from 18 to 30 months was assessed using the English and Spanish versions of the CDIs and expressive vocabulary was assessed at 48 months using the English and Spanish versions of the EOWVT (Brownell, 2001; Martin & Brownell, 2011). Assessing Receptive and Expressive Vocabulary in Child Language 61

Problems and Pitfalls/Advantages

Direct assessments of vocabulary involve some time commitment (e.g., 20‐30 mins) and require that the experimenters have some level of training in engaging with children. Most tasks also require that the examiner is adept at carefully adhering to a protocol, while at the same time effectively engaging the child in the target activities and managing the child’s behavior to keep the child on task. In general, direct assess- ments involve an active response on the part of the child, for example, a verbal response or a point to a picture, which is likely to be more difficult for young children or children who are not used to engaging in this way with an unfamiliar adult. Therefore, such assessments are most often used successfully in children who have some familiarity with the context and who are older than 3 years, when they are more likely to follow verbal prompts and comply with examiner instructions. Users of standardized direct assessments should apply special caution when applying the normative scores with children from diverse populations and children who are learning another language than English at home (Bedore & Pena, 2008). Some instruments have analogous versions in English and Spanish (e.g., the PPVT and TVIP). However, comparing across these instruments is not straightforward since items and norming populations can differ.

Conclusions

The study of vocabulary development is fortunate in having multiple methods avail- able to the researcher, each providing a wealth of information regarding this important domain of children’s language development. Of course, each method also has key limitations, for example, constraints on the ages of the children for which the method is appropriate, the requirements on the researcher in the data collection or analysis process, and the particular aspects of vocabulary that are examined. Thus, both the formulation of the research question and the interpretation of results require the researcher to consider carefully the choice of method or methods, and how that choice might affect the results obtained. It is also the case that a number of impor- tant aspects of vocabulary meaning discussed early in this chapter are currently dif- ficult to capture using these methods; these include category boundaries, multiple meanings for words, word connotation, and nonliteral semantics, such as idioms and sarcasm. These represent key challenges for future research.

Key Terms

Direct assessment Assessment of a child’s expressive or receptive language through a structured interaction between the child and a clinician or researcher. Experimenter‐developed assessment A subtype of direct assessment in which a child’s expressive or receptive language is assessed using a protocol of interac- tion and scoring that has been developed by the researchers for the purpose of a 62 Research Methods in Psycholinguistics and the Neurobiology of Language

specific project or research program. Typically, such instruments are narrowly focused and do not have normative data. Language sampling Assessment of a child’s expressive language through observa- tion, recording, and analysis of minimally structured interaction between the child and a parent, clinician, or researcher. Language samples vary in length, but are based on continuous observation episodes. Nominals, predicates, closed‐class words A widely used set of broad word cate- gories used in evaluating the composition of early vocabulary, both within and across languages. Nominals are typically defined as common nouns, excluding games and routines, names for people, and locations; predicates as main verbs and adjectives, excluding demonstrative and pronominal adjectives; and closed‐ class words as pronouns, prepositions, question words, quantifiers, articles, auxiliaries and connectives. Parent report Assessment of a child’s expressive and/or receptive language through completion of a structured questionnaire by a parent or other knowledgeable person, primarily utilizing a recognition format. Reference and meaning The referent of a word is the category of objects, events, persons, or qualities to which it applies; meaning is the concept expressed by the word. Standardized tests A subtype of direct assessment in which a child’s expressive or receptive language is assessed using a conventional protocol of interaction and scoring, which is generally available, often commercially, and for which norma- tive data are available that make it possible to evaluate the child relative to the population. Total Conceptual vocabulary Total conceptual vocabulary is based on the total set of words which are expressed (or understood) in two or more languages, modi- fied by the principle that when the same, or a very similar meaning, is expressed by a word in both languages, it is only counted once.

References

Alcock, K. J., Rimba, K., Holding, P., Kitsao‐Wekulo, A., Abubakar, A., & Newton, C. R. J. C. (2014). Developmental inventories using illiterate parents as informants: Communicative Development Inventory (CDI) adaptation for two Kenyan languages. Journal of Child Language. http://doi.org/10.1017/S0305000914000403 Anglin, J. M. (1993). Knowing versus learning words. Monographs of the Society for Research in Child Development, 58, 176–186. Arriaga, R. I., Fenson, L., Cronan, T., & Pethick, S. J. (1998). Scores on the MacArthur Communicative Development Inventory of children from low and middle‐income families. Applied Psycholinguistics, 19, 209. http://doi.org/10.1017/S0142716400010043 Bates, E., Bretherton, I., & Snyder, L. (1988). From first words to grammar: Individual differences and dissociable mechanisms. Cambridge, MA: Cambridge University Press. Bates, E., Dale, P. S., & Thal, D. J. (1995). Individual differences and their implications for theories of language development. In P. Fletcher & B. MacWhinney (Eds.), Handbook of Child Language (pp. 96–151). Oxford, UK: Basil Blackwell. Bates, E., & Goodman, J. C. (1997). On the inseparability of grammar and the lexicon: Evidence from acquisition, aphasia and real‐time processing. Language and Cognitive Processes, 5, 507–584. http://doi.org/10.1080/016909697386628 Assessing Receptive and Expressive Vocabulary in Child Language 63

Bates, E., Marchman, V. A., Thal, D. J., Fenson, L., Dale, P. S., Reznick, J. S., … Hartung, J. (1994). Developmental and stylistic variation in the composition of early vocabulary. Journal of Child Language, 21, 85–123. Bedore, L. M., & Pena, E. D. (2008). Assessment of bilingual children for identifcation of language impairment: Current findings and implications for practice. International Journal of Bilingual Education and Bilingualism, 11, 1–29. http://doi.org/10.2167/beb392.0 Berko Gleason, J., Perlmann, R., & Grief, E. (1984). What’s the magic word: Learning language through politeness routines. Discourse Processes, 7, 493–502. Bleses, D., Basbøll, H., Lum, J., & Vach, W. (2011). Phonology and lexicon in a cross‐linguistic perspective: The importance of phonetics—A commentary on Stoel‐Gammon’s “Relationships between lexical and phonological development in young children.” Journal of Child Language, 38, 61–68. http://doi.org/10.1017/s0305000910000437 Bleses, D., Vach, W., Slott, M., Wehberg, S., Thomsen, P., Madsen, T. O., & Basbøll, H. (2008). Early vocabulary development in Danish and other languages: A CDI‐based comparison. Journal of Child Language, 35, 619–650. http://doi.org/10.1017/S0305000908008714 Brown, R. (1973). A first language: The early stages. Boston, MA: Harvard University Press. Brownell, R. (2001). Expressive One Word Vocabulary Test: English‐Spanish bilingual version. Novato, CA: Academic Therapy Publications. Carroll, D. W. (2008). Psychology of Language. Belmont, CA: Wadsworth. Core, C., Hoff, E., Rumiche, R., & Señor, M. (2013). Total and conceptual vocabulary in Spanish–English bilinguals from 22 to 30 months: Implications for assessment. Journal of Speech, Language, and Hearing Research, 56, 1637–1649. http://doi.org/10.1044/1092‐ 4388(2013/11‐0044)Why Craig, H. K., & Washington, J. A. (2000). An assessment battery for identifying language impairment in African American children. Journal of Speech, Language, and Hearing Research, 43, 366–379. Dale, P. S. (1991). The validity of a parent report measure of vocabulary and syntax at 24 months. Journal of Speech and Hearing Research, 34, 565–571. http://doi.org/10.1016/ 0165‐5876(92)90087‐6 Dale, P. S. (1996). Parent report assessment of language and communication. In K. Cole, P. S. Dale, & D. J. Thal (Eds.), Assessment of Communication and Language (pp. 161–182). Baltimore, MD: Brookes Publishing Co. Dale, P. S., & Fenson, L. (1996). Lexical development norms for young children. Behavior Research Methods, Instruments, & Computers, 28, 125–127. http://doi.org/10.3758/BF03203646 Darwin, C. (1877). A biographical sketch of an infant. Mind: A Quarterly Review of Psychology and Philosophy, 2, 285–294. De Houwer, A., Bornstein, M. H., & Leach, D. B. (2005). Assessing early communicative ability: A cross‐reporter cumulative score for the MacArthur CDI. Journal of Child Language, 32, 735–758. http://doi.org/10.1017/S0305000905007026 Dollaghan, C. A. (2007). The handbook for evidence‐based practice in communication disorders. Baltimore, MD: Brookes Publishing Co. Dollaghan, C. A. (2013). Late Talkers as a clinical category: A critical evaluation. In L. Rescorla & P. S. Dale (Eds.), Late Talkers: Language development, assessment, intervention (pp. 91–112). Baltimore, MD: Brookes Publishing Co. Dromi, E. (1987). Early lexical development. Cambridge: Cambridge University Press. Dunn, L. M., & Dunn, D. M. (1997). The Peabody Picture Vocabulary Test‐III (3rd Edition). Johannesburg: Pearson Education Inc. Dunn, L. M., & Dunn, D. M. (2012). Peabody Picture Vocabulary Test (PPVTTM‐4) (4th Edition). Johannesburg: Pearson Education Inc. Farkas, G., & Beron, K. (2004). The detailed age trajectory of oral vocabulary knowledge: Differences by class and race. Research, 33, 464–497. http://doi. org/10.1016/j.ssresearch.2003.08.001 64 Research Methods in Psycholinguistics and the Neurobiology of Language

Feldman, H. M., Campbell, T. F., Kurs‐Lasky, M., Rockette, H. E., Dale, P. S., Colborn, D. K., & Paradise, J. L. (2005). Concurrent and predictive validity of parent reports of child language at ages 2 and 3 years. Child Development, 76, 856–868. Feldman, H. M., Dolloghan, C. A., Campbell, T. F., Colborn, D. K., Janosky, J., Kurs‐Lasky, M., … Paradise, J. L. (2003). Parent‐reported language skills in relation to Otitis Media during the first 3 years of life. Journal of Speech, Language & Hearing Research, 46, 273–287. Feldman, H. M., Dolloghan, C. A., Campbell, T. F., Kurs‐Lasky, M., Janosky, J. E., & Paradise, J. L. (2000). Measurement properties of the MacArthur communicative development inventories at ages one and two years. Child Development, 71, 310–322. http://doi. org/10.1111/1467‐8624.00146 Fenson, L., Bates, E., Dale, P. S., Goodman, J. C., Reznick, J. S., & Thal, D. J. (2000). Measuring variability in early child language: Don’t shoot the messenger. Child Development, 71, 323–328. Fenson, L., Marchman, V. A., Thal, D. J., Dale, P. S., Reznick, J. S., & Bates, E. (2007). MacArthur‐Bates Communicative Development Inventories: User’s guide and technical manual (2nd Edition). Baltimore, MD: Brookes Publishing Co. Fenson, L., Pethick, S. J., Renda, C., Cox, J. L., Dale, P. S., & Reznick, J. S. (2000). Short‐form versions of the MacArthur Communicative Development Inventories. Applied Psycholinguistics, 21, 95–115. http://doi.org/10.1017/S0142716400001053 Fernald, A., Perfors, A., & Marchman, V. A. (2006). Picking up speed in understanding: Speech processing efficiency and vocabulary growth across the 2nd year. Developmental Psychology, 42, 98–116. Frank, M. C., Braginsky, M., Yurovsky, D., & Marchman, V. A. (2016). Wordbank: An open repository for developmental vocabulary data. Journal of Child Language, (May), 1–18. http://doi.org/10.1017/S0305000916000209 Frank, M. C., Sugarman, E., Horowitz, A. C., Lewis, M. L., & Yurovsky, D. (2016). Using tablets to collect data from young children. Journal of Cognition and Development, 17, 1–17. http://doi.org/10.1017/CBO9781107415324.004 Friend, M., & Keplinger, M. (2003). An infant‐based assessment of early lexicon acquisition. Behavior Research Methods, Instruments, & Computers, 35, 302–309. http://doi. org/10.3758/BF03202556 Gross, M., Buac, M., & Kaushanskaya, M. (2014). Conceptual scoring of receptive and expres- sive vocabulary measures in simultaneous and sequential bilingual children. American Journal of Speech‐Language Pathology, 23, 574–586. http://doi.org/10.1044/2014 Hammer, C. S., Farkas, G., & Maczuga, S. (2010). The language and literacy development of Head Start children: A study using the Family and Child Experiences Survey database. Language, Speech, and Hearing Services in Schools, 41, 70–83. http://doi.org/10.1044/ 0161‐1461(2009/08‐0050) Hart, B., & Risley, T. R. (1995). Meaningful differences in the everyday experience of young American children. Baltimore, MD: Brookes Publishing Co. Hirsh‐Pasek, K., Adamson, L. B., Bakeman, R., Owen, M. T., Golinkoff, R. M., Pace, A., … Suma, K. (2015). The contribution of early communication quality to low‐income children’s language success. Psychological Science. http://doi.org/10.1177/0956797615581493 Hoff, E. (2003). The specificity of environmental influence: Socioeconomic status affects early vocabulary development via maternal speech. Child Development, 74, 1368–1378. http:// doi.org/10.1111/1467‐8624.00612 Hoff, E. (2012). Interpreting the early language trajectories of children from low‐SES and language minority homes: Implications for closing achievement gaps. Developmental Psychology, 46, 899–909. http://doi.org/10.1037/a0027238 Hoff, E., Rumiche, R., Burridge, A., Ribot, K. M., & Welsh, S. N. (2014). Expressive vocabu- lary development in children from bilingual and monolingual homes: A longitudinal study from two to four years. Early Childhood Research Quarterly, 29, 433–444. Assessing Receptive and Expressive Vocabulary in Child Language 65

Hoff‐Ginsberg, E. (1991). Mother‐child conversation in different social classes and communi- cative settings. Child Development, 62, 782–796. Jackson‐Maldonado, D., Thal, D. J., Marchman, V. A., Bates, E., & Gutierrez‐Clellen, V. (1993). Early lexical development in Spanish‐speaking infants and toddlers. Journal of Child Language, 20, 523–549. Jørgensen, R. N., Dale, P. S., Bleses, D., & Fenson, L. (2010). CLEX: A cross‐linguistic lexical norms database. Journal of Child Language, 37, 419–428. http://doi.org/10.1017/ S0305000909009544 Kristoffersen, K. E., Simonsen, H. G., Bleses, D., Wehberg, S., Jørgensen, R. N., Eiesland, E. A., & Henriksen, L. Y. (2013). The use of the Internet in collecting CDI data – an example from Norway. Journal of Child Language, 40, 567–585. http://doi.org/10.1017/ S0305000912000153 Lazenby, D. C., Sideridis, G. D., Huntington, N., Prante, M., Dale, P. S., Curtin, S., … Tager‐ Flusberg, H. (2015). Language differences at 12 months in infants who develop Autism Spectrum Disorder. Journal of Autism and Developmental Disorders. http://doi. org/10.1007/s10803‐015‐2632‐1 Lee, J. (2011). Size matters: Early vocabulary as a predictor of language and literacy compe- tence. Applied Psycholinguistics, 32, 69–92. http://doi.org/10.1017/S0142716410000299 Libertus, M. E., Odic, D., Feigenson, L., & Halberda, J. (2015). A Developmental Vocabulary Assessment for Parents (DVAP): Validating parental report of vocabulary size in 2‐7‐year‐ old children. Journal of Cognition and Development, 16, 442–454. http://doi.org/10.1080/ 15248372.2013.835312 MacWhinney, B. (2000). The CHILDES project. Mahwah, NJ: Lawrence Elbaum Associates. Marchman, V. A., & Friend, M. (2013). MacArthur Communicative Development Inventories scoring program for Canadian French and French‐English bilinguals. Marchman, V. A., & Friend, M. (2014). MacArthur Communicative Development Inventories scoring program for European French and French‐English bilinguals. Marchman, V. A., & Martínez‐Sussmann, C. (2002). Concurrent validity of caregiver/parent report measures of language for children who are learning both English and Spanish. Journal of Speech, Language, and Hearing Research, 45, 983–997. http://doi.org/10.1044/ 1092‐4388(2002/080) Martin, N. A., & Brownell, R. (2011). Expressive One‐word Picture Vocabulary Test‐4. Austin, TX: Pro Ed, Inc. Maslen, R. J. C., Theakston, A. L., Lieven, E. V. M., & Tomasello, M. (2004). A dense corpus study of past tense and plural overregularization in English. Journal of Speech, Language, and Hearing Research, 47, 1319–1333. http://doi.org/10.1044/1092‐4388(2004/099) Miller, J. F. (1981). Assessing language production in children: Experimental procedures. University Park Press. Miller, J. F. (2012). Systematic Analysis of Language Transcripts (Version 2012). Morgan, P. L., Farkas, G., Hillemeier, M. M., Hammer, C. S., & Maczuga, S. (2015). 24‐month‐old children with larger oral vocabularies display greater academic and behavioral functioning at Kindergarten entry. Child Development, 86, 1351–1370. http://doi. org/10.1111/cdev.12398 Oller, D. K., Niyogi, P., Gray, S., Richards, J. A., Gilkerson, J., Xu, D., … Warren, S. F. (2010). Automated vocal analysis of naturalistic recordings from children with autism, language delay, and typical development. Proceedings of the National Academy of Sciences of the United States of America, 107, 13354–13359. http://doi.org/10.1073/ pnas.1003882107 Owen, A. J., & Leonard, L. B. (2002). Lexical diversity in the spontaneous speech of children with Specific Language Impairment: Application of D. Journal of Speech, Language, and Hearing Research, 45, 927–937. http://doi.org/10.1044/1092‐4388(2002/075) 66 Research Methods in Psycholinguistics and the Neurobiology of Language

Pan, B. A., Rowe, M. L., Singer, J. D., & Snow, C. E. (2005). Maternal correlates of growth in toddler vocabulary production in low‐income families. Child Development, 76, 763–782. http://doi.org/10.1111/j.1467‐8624.2005.00876.x Pearson, B. Z., & Fernández, S. C. (1994). Patterns of interaction in the lexical growth in two languages of bilingual infants and toddlers. Language Learning, 44, 617–653. http://doi.org/10.1111/j.1467‐1770.1994.tb00633.x Pearson, B. Z., Fernández, S. C., & Oller, D. K. (1995). Cross‐language synonyms in the lexi- cons of bilingual infants: One language or two? Journal of Child Language, 22, 345–368. http://doi.org/10.1017/S030500090000982X Reese, E., & Read, S. (2000). Predictive validity of the New Zealand MacArthur Communicative Development Inventory: Words and Sentences. Journal of Child Language, 27, 255–266. http://doi.org/10.1017/S0305000900004098 Rescorla, L. (1989). The Language Development Survey: A screening tool for delayed lan- guage in toddlers. Journal of Speech and Hearing Disorders, 54, 587–599. Rescorla, L., & Dale, P. S. (2013). Late talkers: Language development, interventions, and outcomes. Baltimore, MD: Brookes Publishing Co. Rescorla, L., Ratner, N. B., Jusczyk, P., & Jusczyk, A. M. (2005). Concurrent validity of the language development survey: Associations with the MacArthur‐Bates Communicative Development Inventories: Words and Sentences. American Journal of Speech‐Language Pathology, 14, 156–163. http://doi.org/10.1044/1058‐0360(2005/016) Rice, M. (1984). A cognition account of differences between children’s comprehension and production of language. Western Journal of Speech Communication, 48, 145–154. Roberts, J. E., Burchinal, M., & Durham, M. (1999). Parents’ report of vocabulary and grammatical development of African American preschoolers: Child and environmental associations. Child Development, 70, 92–106. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes‐Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439. http://doi.org/10.1016/ 0010‐0285(76)90013‐X Rowe, M. L., Raudenbush, S. W., & Goldin‐Meadow, S. (2012). The pace of vocabulary growth helps predict later vocabulary skill. Child Development, 83, 508–525. http://doi. org/10.1111/j.1467‐8624.2011.01710.x Secord, W., Semel, E., & Wiig, E. (2003). Clinical valuation of language fundamentals. San Antonio, TX: Pearson Education Inc. Southwood, F., & Russell, A. F. (2004). Comparison of conversation, freeplay, and story gen- eration as methods of language sample elicitation. Journal of Speech, Language, and Hearing Research , 47, 366–376. http://doi.org/10.1044/1092‐4388(2004/030) Tomasello, M., & Mervis, C. B. (1994). The instrument is great, but measuring comprehension is still a problem. Monographs of the Society for Research in Child Development, 59, 174–179. Vagh, S. B., Pan, B. A., & Mancilla‐Martinez, J. (2009). Measuring growth in bilingual and monolingual children’s English productive vocabulary development: The utility of combining parent and teacher report. Child Development, 80, 1545–1563. http://doi. org/10.1111/j.1467‐8624.2009.01350.x Wechsler, D. (1991). Wechsler Intelligence Scale for Children‐3rd Edition (3rd Editio). San Antonio, TX: The Psychological Corporation. Weisleder, A., & Fernald, A. (2013). Talking to children matters: Early language experience strengthens processing and builds vocabulary. Psychological Science, 24, 2143–2152. http://doi.org/10.1177/0956797613488145 Weizman, Z. O., & Snow, C. E. (2001). Lexical input as related to children’s vocabulary acqui- sition: Effects of sophisticated exposure and support for meaning. Developmental Psychology, 37, 265–279. http://doi.org/10.1037/0012‐1649.37.2.265 Williams, K. T. (1997). Expressive Vocabulary Test Second Edition (EVTTM 2). Journal of the American Academy of Child & Adolescent , 42, 864–872. Assessing Receptive and Expressive Vocabulary in Child Language 67

Appendix 3.1

Instructions for Completing the Language Inventory

• Try to complete the inventory when you have at least 30 quiet minutes, without interruptions. An example might be when your child is sleeping. • You do not have to complete the inventory in one sitting. If you are interrupted, it is ok to put it down and come back to it when you have more time. • Write the date you completed the inventory on the form. • Ask others (e.g., other family members, nanny, child care providers) to help you fill out this form. Please mark everyone who helped complete the inventory on the front of the form. • Please read all of the instructions on the inventory carefully, and make sure you complete all of the pages.

Remember:

• For the Words & Gestures form, mark the words your child Understands OR Understands and Says in English. For the Words & Sentences form, mark ONLY the words your child Understands and Says. • Mark only the words that are your child uses on their own. Do not mark imitations. Do not read the words on the inventory to your child and ask him/her to repeat them. • Give your child credit for mispronounced or childish words (e.g., “pasketti” for “spaghetti” or “raffe” for “giraffe”). • Mark words that your child has a different word for, but still has the same meaning as words on the inventory (e.g., “carriage” for “stroller” or “nana” for “grandmother”).

If any questions come up while completing the inventory, please call us!

Thank you! We appreciate your time and effort! 4 Eye‐Movement Tracking During Reading

Reinhold Kliegl and Jochen Laubrock

Abstract

Eye movements during reading are mostly tracked with video-based pupil monitoring systems using isolated words, sentences, paragraphs, or texts serving as stimuli. Technical issues and potential problems are described. Fixation durations and loca- tions yield many measures that are sensitive to language-related processing difficulty. Gaze-contingent display changes are used to determine the size of the perceptual span (McConkie paradigm) and afford the isolation of language-related effects in parafo- veal preview (Rayner paradigm). Multivariate statistics, e.g., linear mixed models, can also be used to assess these effects. This is illustrated with an analysis of eye-voice span effects on fixation durations during oral reading.

Assumptions and Rationale

One of the most stunning dissociations between human behavior and phenom- enal experience occurs during reading. Our experience tells us that the eyes move smoothly across the line of text most of the time, disrupted only by return sweeps to the next line or occasionally by a jump back to an earlier word in the sentence or text when, for example, psycholinguistically speaking, “the parser crashed” as a consequence of some garden pathing. At the behavioral level, however, there is

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Eye‐Movement Tracking During Reading 69 nothing smooth about the movements of the eyes; they are characterized by an alternation of quick jerky movements (“saccades”), lasting between 10 and 30 ms, and relatively stable phases (“fixations”), lasting between 30 and more than 500 ms. A second dissociation is embedded in the first one: There is an almost complete sup- pression of visual input during saccades, presumably to suppress motion blur. As everyone knows, these physiologically based “blackouts” escape our awareness. And, last but not least, there is a third dissociation: Even if we think that we don’t move the eyes, for example, when we consciously try to fixate a word, the eyes are nevertheless engaged in so‐called fixational movements (i.e., temor, drift, and microsaccades) related to control of the six ocular muscles, maintaining the eyes in synchrony (i.e., disparity), and preventing bleaching of the receptors (see Further Readings for pointers to literature about these fundamental results). These three dissociations between behavior and phenomenal experience force the conclusion that what we “perceive” during reading is not the movement of the eyes, but the movement of attention. Indeed, it is the link of fixation durations and fixation locations to attention and our increasing degree of understanding of these relations why eye tracking has become a prime method of choice for many questions asked in psycholinguistic research. Attention is one key theoretical construct at the heart of the information processing with a rather straightforward perspective as far as psycholinguistic research is concerned: If processing is difficult, fixation durations increase and the distances between their locations (i.e., saccade amplitudes) decrease. The basic assumption is that the location of the eye provides information about the focus of attention. These are the default expectations; they have served the research community well and will continue to do so. Occasionally, data are reported where the opposite results are obtained. These, then, are cases of surprising findings that, once reconciled with theory, usually represent a major leap forward in our attempts to come up with a coherent theoretical account of reading; even more productive (but much rarer) are counterintuitive theoretical predictions sub- sequently confirmed with experiments (Kliegl & Engbert, 2013). One reason for coun- terintuitive results may be a fourth dissociation: The fixation location, that is, the direction of gaze, and the focus of attention are usually, but not always identical, as is evident when we are engaged in a boring conversation, with an attractive alternative conversation partner standing nearby. It is an active field of research to determine conditions under which gaze and covert attention dissociate and exactly how this dissociation is implemented, for example, as a zoom‐lens (Risse et al., 2014, Schad & Engbert, 2012) or a spotlight (Schotter, Reichle, & Rayner, 2014) model of attention. We will describe research methods for the assessment of attention‐gaze dissociation in this chapter. There is another distinction to be kept in mind. From a psycholinguistic perspec- tive, reading is of interest due to its inherent relation with processing of written language. The primary interest is to obtain as reliable and as valid indicators of language‐related processes as possible. Natural reading, of course, involves not only language‐related processing, but also the programming of saccades. Both language‐ related processing and oculomotor programming are heavily restricted by perceptual constraints due to crowding (Bouma, 1970). Combined with attentional processes this constraint causes a limited perceptual span, which extends much further to the right than to the left for languages with left‐to‐right reading direction (McConkie & Rayner, 1975). Of course, the asymmetry of the perceptual span is itself strong evidence for the relevance of attention. 70 Research Methods in Psycholinguistics and the Neurobiology of Language

A final basic assumption about natural reading concerns the timing of language‐ related processes and the start of programming of saccades. If the cognitive system were designed by psycholinguists, they would probably ask that these processes are scheduled in a strict sequence. At the outset of a fixation we would have the system take care of language‐related issues. Once this is done, the saccade could be programmed to the next word. Finally, after the program is assembled the eyes are carried on and the cycle starts over. In this case, the psycholinguist’s task would simply be to determine the timelines, extract various language‐related components and be done with. Unfortunately, although this might be a convenient architecture for psycholinguis- tics, the system is the result of evolutionary tinkering, optimizing allocation of attention and gaze control for survival in an environment in which communication by written language was unknown. Indeed, we are using an architecture that was initially completely­ unrelated to reading, which is the product of a much more recent cultural evolution. Moreover, if the system had been implemented in a strictly sequential manner, it would also make for highly inefficient reading. Rather, in dealing with acuity and working‐memory constraints, readers are extremely efficient in scheduling these processes in parallel such that, in the ideal case, language‐related processing at the currently fixated word (which usually involves processing the last, the fixated, and the next word) is finished just in time, when the motor program to carry the eye forward (or backward) is ready as well. Unraveling the dynamics, that is, the degree and the conditions under which language and oculomotor processes are scheduled in parallel, is at the core of theoretical controversies reflected in the differences between computational models such as E‐Z Reader (Reichle et al., 1998), Glenmore (Reilly & Radach, 2006), and SWIFT (Engbert et al., 2005); see Further readings for overview of current research). The reason for bringing this issue up at the outset is that none of the indicators that we derive from the eye‐tracking record is a process‐pure measure; they all contain information about language, vision, attention, and oculomotor demands. Obviously, the measures may differ in the degree to which they reflect the different processes and this weight may itself depend on differences related to instructions, materials, or readers.

Apparatus

There are many alternatives for tracking a reader’s eye. Best known are surface elec- trodes, infrared corneal reflections, search coils attached to the surface of the eyes, infrared dual‐Purkinje image tracking, and video‐based pupil monitoring. While search coils and dual‐Purkinje image tracking used to be considered the gold standard in terms of accuracy and temporal resolution, they have a number of strong disadvantages in terms of intrusiveness and usability. Video‐based systems ameliorated by corneal reflections have considerably improved, and in a direct comparison with a search coil, Kimmel, Mammo and Newsome (2012) report that “leading optical performance now rivals that of the search coil, rendering optical systems appropriate for many if not most applications.“ In practice, video‐based eye trackers with high sampling rates and a fairly large tracking area clearly dominate today, as they are easy to use and relatively accurate and inexpensive. This is especially true for psycholinguistic research. Eye‐Movement Tracking During Reading 71

Host Display computer computer

Eye tracker card, Eye tracker camera, realtime OS experimental applications, fast refresh rate

Ethernet

Figure 4.1 Typical eye tracker set up. Eye tracker logo by Aenne Brielmann, CC BY 3.0 US https://thenounproject.com/term/eye‐ tracking/89896/. Source: Anne, https://thenounproject.com/term/eye‐tracking/89896/. Used under CC BY 3.0 US, https://creativecommons.org/licenses/by/3.0/us/.

Video‐based trackers typically combine an infrared light source, high‐speed cameras sensitive to visible and infrared light, and computer‐based image processing to detect the pupil in the eye image as well as the corneal reflection, that is, the “first Purkinje image” or reflection of the infrared light from the outer cornea. The vector between pupil center and corneal reflection can be used to compute the gaze loca- tion on the screen. Given the short duration of saccades, a high sampling rate is needed for detection of fixations and saccades; current state‐of‐the art technology often samples eye position at 1000 Hz, additionally allowing measurement of fixa- tional eye movements as well as implementation of fast gaze‐contingent display changes. Within the class of video‐based trackers, there are different types such as head‐mounted, tower‐mounted, desktop‐mounted, and mobile trackers. Mobile trackers, allowing the participant to move freely, extend the range of situations in which eye tracking can be used, but they have the major disadvantage that the stim- ulus is not under control of the experimenter. For example, whereas the location of each word on the screen is precisely known in a tower‐mounted system and gaze position can therefore easily be mapped to words, it has to be laboriously recovered from a video recording of the participant’s field of view in a mobile eye tracker. For reading research, tower‐mounted set ups often give the best compromise of accuracy and usability. Figure 4.1 shows a typical eye tracker set up. Host (experimenter) and display (participant) computers should ideally be arranged in an L shape. They communicate via Ethernet, for example, to allow for local control and gaze‐contingent experi- ments. To accomplish display changes during a saccade, it is advisable to use a display with a fast refresh rate. Windows or other bright light sources that could cause reflections on the host and display monitors should be avoided. 72 Research Methods in Psycholinguistics and the Neurobiology of Language

Experimental Paradigms

Stimuli

In principle, eye tracking works with any stimulus, even viewing single words in isolation—in which case it can be used to control for fixation location. However, fluent reading consists of more than processing of isolated words; it also involves integration of successive words into a context of discourse, while oculomotor programs are scheduled in parallel towards the next saccade target. Therefore, reading of sentences or paragraphs provides a richer picture of the reading process; fixations are influenced by processing of past, present, and future words. In practical reading research with spatially extensive stimuli, the specifications of the system need to be taken into consideration. One limitation is the tracking range, which is typically on the order of 60 by 40 degrees; this is not a problem for a computer monitor at a normal viewing distance, but may limit the use in situations with very large displays. The standard experimental paradigm is borrowed from single‐word presentation studies and consists of the presentation of single sentences or text passages containing a target word. Variables of interest such as length, fre- quency, and predictability of the target word are orthogonally varied or held constant across conditions in order to achieve optimal statistical power relative to the number of subjects and number of items of the design; it also aligns the experimental design with familiar ANOVA‐based statistical inference. Of course, such quasi‐experimental control of stimuli implies a lack of generalizability across the full range word properties. Therefore, a second line of studies uses all words of sentences (e.g., Kliegl, Nuthmann, & Engbert, 2006) or text passages extracted from newspapers (e.g., Kennedy, Hill, & Pynte, 2003) and bases statistical inference on advanced multivariate statistics to deal with correlated predictors. The higher the corre- lation, the smaller the statistical power to detect hypothesized effects. To some degree lower statistical power can be compensated with increasing sample sizes of subjects and items. For a discussion of the merits of the two approaches we refer to the exchange bet- ween Kliegl et al. (2006), Rayner et al. (2007), and Kliegl (2007). Obviously, any systematic differences between the two approaches must be resolved; the most likely explanations are related to selection effects with respect to the word material or, in the case of finding an effect or not, to differences in statistical power. One very interesting feature of eye‐movement recording is that the signal is available on‐line and can be fed back to the participant. Such closed‐loop control is sometimes also used in other domains such as EEG biofeedback, but is much more common in eye‐tracking studies. The approach has been developed in the 1970s (Rayner, 1975; McConkie & Rayner, 1975; Watanabe, 1971) and has been very successfully employed ever since to investigate issues such as the size of the perceptual span in reading or the order in which properties of upcoming text are processed.

Moving Window (McConkie) Paradigm

How much information can we extract from text during a single fixation? The best method to measure the size of the effective visual field (or the perceptual span) is the moving window paradigm (McConkie & Rayner, 1975), in which a window Eye‐Movement Tracking During Reading 73

Moving-Window (McConkie) paradigm A stunning dissociation between behavior and phenomenal experience occurs during reading Xxxxxxxxxg dissocixxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx a * Xxxxxxxxxxxxxxxxxiation bexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx * X xxxxxxxg dissocixxxxx xxxxxxx xxxxxxxx xxx xxxxxxxxxx xxxxxxxxxx xxxxxx xxxxxx xxxxxxx b * X xxxxxxxx xxxxxxiation bexxxxx xxxxxxxx xxx xxxxxxxxxx xxxxxxxxxx xxxxxx xxxxxx xxxxxxx * E closcesg dissociobuos hufmaas lifuwuen umt glomunicid amgisuarsi ennosc tosork sieburp c * E closcesp hennuriation befmaas lifuwuen umt glomunicid amgisuarsi ennosc tosork sieburp * Boundary (Rayner) paradigm A stunning dissociation between behavior and phenomenal experience arises during reading a *| A stunning dissociation between behavior and phenomenal experience occurs during reading * A stunning dissociation between behavior and phenomenal experience ekkers during reading b *| A stunning dissociation between behavior and phenomenal experience occurs during reading *

Figure 4.2 Illustration of the gaze‐contingent moving‐window (top) and boundary (bottom) paradigms. The asterisk (*) indicates gaze location. The example of the McConkie paradigm uses a symmetric 9‐character window and different mask conditions, (a) an x‐mask, (b) an x‐mask preserving spaces, (c) a letter mask preserving letter shapes and vowel/consonant classes. In the example of the Rayner paradigm, the invisible boundary is indicated by the pipe symbol (|) and illustrates (a) semantically (“arises”) and (b) phonologically (“ekkers”) related previews; the target word is always “occurs.” of normal text moves in real time with the reader’s gaze. Text outside the window is masked, and the width of the window is under experimental control. By varying window size, the perceptual span can be estimated, either as the point at which performance no longer deviates from a static control condition, or by fitting an asymptotic nonlinear growth curve (Sperlich, Meixner, & Laubrock, 2016). Masks vary in the extent to which spaces and letter features are preserved, as illustrated in Figure 4.2 (upper part). Studies have consistently shown that the size of the percep- tual span is much smaller than one would intuitively think. In alphabetic languages, the span extends about 14‐15 characters in the reading direction for picking up low‐level visual information such as word boundaries, and only about 9‐10 characters for letter identity; the span is asymmetric and considerably smaller at 3‐4 characters against the reading direction. This implies that it extends to the left in languages like Hebrew or Arabic, where reading is from right to left. Information density of a writing system has a large influence; in terms of characters, the span is considerably smaller in Chinese (about three characters) or Japanese (about five to six characters), whereas in terms of bits of information transmitted it is comparable between languages. Within a writing system, developing readers have a smaller span than mature readers, and the span is also momentarily influenced by cognitive demands, for example, it gets smaller when a low‐frequency word is fixated (Meixner, Nixon, & Laubrock, 2017).

Boundary (Rayner) Paradigm

What information is extracted from an upcoming parafoveal word before it is fix- ated? A related gaze‐contingent method called the boundary paradigm (Rayner, 1975) is very useful to answer what properties are pre‐processed and to what extent. 74 Research Methods in Psycholinguistics and the Neurobiology of Language

Rather than moving a window, static text is presented, but with one target word exchanged by a preview. When the gaze crosses an invisible boundary just before the target location, the preview is changed to the target, as illustrated in Figure 4.2 (lower part; target word “occurs”). Preview benefit can be computed as the difference in fixation durations on the target word after related (or identical) versus unrelated previews (for a recent review see Schotter & Rayner, 2015). Preview benefit is influenced by linguistic relatedness of preview and target; in English, orthographic and phonologically related previews generate a sizeable preview benefit; in other languages, semantic relatedness also generates preview benefit, which in Chinese is even larger than its phonological cousin. Variants of the boundary paradigm in which a gaze‐contingent trigger is combined with timed presentation are, for example, the disappearing‐text (Rayner et al., 2003) and (parafoveal) fast‐priming (Hohenstein, Laubrock, & Kliegl, 2010) para- digms. In the former, the fixated word is made to disappear after a brief period; in the latter the gaze triggers an unrelated preview to be exchanged by a prime, which is only visible for an experimentally manipulated duration from the beginning of a fixation before being replaced by the target. Further Readings provide encompassing descriptions of other variants of these experimental paradigms.

Collecting and Analyzing Data

Data Collection

Eye tracking systems are limited in accuracy. For example, an average spatial accuracy of 0.25 to 0.5 degrees means that with a typical font size there is a sizeable statistical chance that the reported fixation location is one letter off; this might complicate the assignment of fixations to words during reading of tiny fonts such as in graphic novels. Since the error is often somewhat larger in the vertical direction, one practical implica- tion is that vertical line spacing in stimulus texts should be increased in order to facilitate unambiguous assignment of gaze to rows. These problems were much more severe in the past, which might be a reason why most research on eye movements during reading has been carried out with single‐line sentences. With current technology, presentation of paragraphs is feasible. Lab equipment like head and chin rests are advisable to min- imize head movements for an increase in measurement accuracy. At the beginning of an eye tracking session, the system needs to be calibrated in order to establish a mapping of screen coordinates and measurements. This is achieved by asking the participant to fixate a sequence of calibration points. A mapping can then be computed from the correspondences between stimulus locations and measurements of the pupil‐corneal reflection vector, for example, by estimating parameters in a polynomial fit. After calibration, gaze position is available in screen coordinates. Usually calibration is followed by a validation run, deter- mining whether the estimated eye position is indeed close to the known position of new targets. In most cases, calibration and validation are accomplished in a few minutes and are supported by high‐level routines of the manufacturer’s software. It is a recommended common practice to present additional validation points, or “fixation checks,” during an experiment and to re‐calibrate in case of failure. Eye‐Movement Tracking During Reading 75

Data Reduction

Eye trackers provide time‐stamped data of x‐ and y‐coordinates of the measure in screen coordinates for one or two eyes; especially in video‐based systems some measure of pupil diameter is also part of the default record. Obviously, the time stamp depends on the temporal resolution of the eye tracker. The time series of gaze coordinates is usually classified into event periods. During reading, fixations, saccades and blinks are the main event classes, but with moving stimuli (e.g., scrolling text), smooth pursuit is also important. Given a sufficiently fast sampling rate, saccades can be detected based on a velocity threshold, often combined with an acceleration criterion. Usually, high‐velocity noise caused, for example, by quantization is removed from the velocity time series by using some low‐pass filter before applying the threshold. Figure 4.3 shows raw (x‐) position data, the transformation to a smoothed velocity time series, and the result of a saccade detection algorithm. The software suites accompanying most commercially available eye‐trackers include event parsers. Since common dependent variables, such as fixation duration and sac- cadic amplitude, depend critically on the choice and parameters of the filter and detec- tion algorithm, it is important that these proprietary implementations are well documented, so that comparison across studies and labs is possible. Access to the raw data is still important, as criteria might change and even new classes be introduced; with the wide‐spread availability of high‐speed trackers the post‐saccadic wobble termed “glissade” that follows about every other saccade and lasts for about 20 ms is now some- times regarded a separate class (Nyström & Holmqvist, 2010), whereas previously it

300

200 x Position 100

0 200 400 600 800 1000 Sample No.

4000

0 x Velocity

–6000 0 200 400 600 800 1000 Sample No.

Figure 4.3 Velocity‐based saccade detection. The upper panel illustrates the x position of the eye during reading of a single sentence sam- pled at 500 Hz, and the lower panel the smoothed eye velocity. Red dots indicate points clas- sified as belonging to a saccade as output from an event detection algorithm (Engbert & Kliegl, 2003), and vertical lines indicate beginning and end of the corresponding saccade and fixation intervals. (See insert for color representation of the figure.) 76 Research Methods in Psycholinguistics and the Neurobiology of Language might have been assigned to the neighboring fixation or saccade events. The classification of fixational eye movements also requires access to the raw data, which should be recorded binocularly with a high sampling rate (Engbert & Kliegl, 2003). In the age of Open Science storage of raw eye‐movement data is mandatory in reading research. For reading research, the most important output of the event detection algorithm is a sequence of fixation durations, each of which is assigned to a specific letter in the material read. Thus, the sequence of fixations in the data file corresponds to their temporal occurrence during reading. From this mapping of fixations to letters all the dependent measures typically used in research on eye‐movement control during reading can be computed. This is not the case if eye‐movement measures are initially computed with respect to words as basic units, defined, for example, as areas of interest, or if the data file is organized by words of the experimental material. Table 4.1, a slightly modified combination of Tables 1 and 2 in Radach and Kennedy (2004; also Inhoff & Radach, 1998), provides definitions of the most common mea- sures derived from fixation locations and fixation durations.

Locations

The measures listed in the top part of the table are related to fixation locations. Their meaning should be self‐explanatory and obviously most of them are correlated, some of them very highly. For example, large mean saccade amplitude correlates positively with launch distance and skipping and negatively with fixation frequency and refixation probability. In the context of inferential statistics all of them have been used to capture effects of ­fixation locations on language‐related and oculomotor‐related processing, that is, all of them have been used as a dependent variable in some context. At the same time, all of them have also served as covariates (predictors, independent variables) in the explanation of each other and of the various measures of fixation duration listed in the bottom part of the table. This necessarily is an everlasting source of confusion, but, given the heteroge- neity and diversity of theoretical and practical contexts of language‐related and oculomotor‐ related reading research, hardly to be avoided. Obviously, it is primarily one’s theoretical framework that determines whether a measure is to be used as an independent or a dependent variable. The justification is that the chosen conceptualization delivers a coherent account that is convincing to the scientific community.

Durations

The most frequently used measures based on fixation durations are listed in the bottom part of Table 4.1. Again, we consider the descriptions sufficiently clear to forego their repetition here. Most of these measures are computed only for first‐pass reading, meaning a fixation location and its associated duration is only included in the analyses when the word on which it is measured is entered for the first time with a saccade in reading direction. Of course, in case of substantial rereading, one may also compute them for second‐pass reading. With some qualifications, what we wrote about measures derived from fixation locations can also be said for the var- ious measures of fixation durations: Depending on the theoretical context, fixation durations may serve as dependent or independent variable; in the psycholinguistic Eye‐Movement Tracking During Reading 77

Table 4.1 Definitions of location (top) and duration (bottom) eye‐tracking measures. Measure Definition Based on fixation locations Saccade amplitude (length) Distance between two successive fixation locations Fixation (skipping) probability Relative frequency with which a word is fixated (skipped) Fixation position (location) Position within word; empty space between words is coded as 0 Launch distance (site) Distance between the prior fixation and the beginning (or center) of the currently fixated word Fixation frequency Mean absolute number of fixations per word for the current pass (defined as first, second, etc. encounter with specified text) Initial/first fixation duration Duration of the first fixation on a word, irrespective of number of fixations on word during the current pass Refixation probability Relative frequency of at least two fixations before leaving a word Regression probability Relative frequency of a saccade to a previous word in the sentence Based on fixation durations Single fixation duration Duration of fixation on a word, if the word is read with one fixation during current pass Refixation duration Summed duration of additional fixations within the current pass prior to an exit from the word Gaze duration Summed duration of all fixations before leaving the word during the current pass (usually first pass) Re‐reading time Summed duration of all fixations made after leaving the word for the first time Total reading time Summed duration of all fixations made on the critical word Go‐past time Sum of all fixations from entering a region during first‐pass reading until the eye leaves in reading direction Regression‐path duration Sum of all fixations from entering a region during second‐ pass reading until the eye leaves in reading direction Reading rate Aggregate of spatial and temporal measures; arguably the (words per minute) criterion that the reader’s cognitive and eye‐movement control system attempts to optimize (typical values: 200‐300 wpm)

Note. The relevant metric for positions, amplitudes, and distances is usually characters, not degrees of visual angle. Aggregated durations (except reading rate) are usually computed without the duration of saccades. Modified after Inhoff & Radach, 1998, and Radach & Kennedy, 2004, Tables 1 and 2.

context they are mostly used as dependent measures. Moreover, by definition, the measures are not independent of each other; obviously, when a word is read with a single fixation, single‐fixation duration will equal the gaze duration for this word. Similarly, first fixations are part of gaze durations, too. Conceptually, these distinc- tions were motivated by attempts to distinguish between early and late effects of processing, with gaze durations being considered the upper bound of early processing. Consequently, separate analyses are often reported for several of these measures and the significance of effects is scanned for consistency across all of them. From a data‐analytic perspective such inclusive definitions are highly undesirable, because, 78 Research Methods in Psycholinguistics and the Neurobiology of Language obviously such analyses do not provide independent evidence; rather one may wonder whether spuriously significant results are more likely to be reported in this scenario. There is no easy solution to this problem without a major break with the past research tradition. Thus, without a convincing and encompassing new data‐analytic framework, progress will depend very much on direct and conceptual replications of critical results (which is not bad either).

Inferential Statistics

The distributed processing during reading locations and fixations presents consider- able challenges for statistical inference about experimental or quasi‐experimental effects. Traditionally, separate analyses of variance using subjects and items as random factors (F1/F2 ANOVA) were the method of choice for the analysis of data with uncorrelated independent variables (e.g., experiments with orthogonal factorial designs built around manipulations of target words in the boundary paradigm). Only measures on the target word were entered as dependent variables; possibly separate F1/F2‐ANOVAs were reported for surrounding fixations. Moreover, as mentioned in the last paragraph, this set of analyses was repeated for a subset of the duration‐ and location‐based measures listed in Table 4.1. During the last 10 years, the advent of linear mixed models (LMMs; Baayen, Davidson, & Bates, 2008; Kliegl, Risse, & Laubrock, 2007) cut in half the number of analyses with the specification of subjects and items as crossed random factors in one analysis. There are additional costs and benefits associated with using LMMs. In terms of cost, consid- erable responsibility in model specification is returned to the data analyst in comparison with the largely automated ANOVA procedures. This concerns both the specification of hypotheses as single‐degree of freedom contrasts, ideally a priori, for fixed effects and the specification of the random‐effects structure (i.e., variance components and correlation parameters) for within‐subject and within‐item effects (Bates et al., 2015). In terms of benefits, LMMs adequately handle the pervasive problem of missing‐at‐random data in the eye‐movement records and allow a seamless integration of factor and numeric covariates varying within or between subjects and items (Kliegl, 2007). With LMMs the classic distinction between experimental and “correlational” analysis is breaking down, even for the analysis of data collected in experimental paradigms. For example, in an analysis of semantic preview benefit in Chinese, Yan et al. (2012) reported an interaction of type of preview and pre‐boundary fixation duration for target fixation duration: Semantic preview benefit was large for short preview duration and absent for long preview duration. Obviously, as preview ­duration is not under experimental control, this interaction severely constrains the interpretation of a preview benefit (or its absence) as an experimental effect. In addition to testing the significance of differences between experimental con- ditions and their interactions, LMMs assess the reliability of interindividual differ- ences and differences between items in these effects. These advances in statistical inference are possible because eye tracking yields a very high density of behavioral observations during reading. LMMs are but heralds of other advanced multivariate statistical techniques such as linked LMMs (Hohenstein, Matuschek, & Kliegl, in press), generalized additive mixed models (Matuschek, Kliegl, & Holschneider, 2015), nonlinear mixed models (Sperlich et al., 2016), quantile regression analyses (Risse & Kliegl, 2014), survival analyses Eye‐Movement Tracking During Reading 79

(Reingold et al., 2012), and, probably of special interest to those with an interest in complex syntactic structures, scan‐path analysis (von der Malsburg & Vasishth, 2011) being adopted for the analyses of eye‐tracking data during reading. With these tech- niques we are getting closer to how the dynamics of processes unfold over time.

An Exemplary Study: The Eye‐Voice Span During Oral Reading

The core interest of psycholinguistics is in language‐related processes. Eye move- ments during silent reading tap into perception of language via the conversion of written script, but also into language production via sub‐articulation varying in degree, for example, with task demand and reading skill. Language production is manifest in oral reading, which historically and ontogenetically precedes silent reading. Moreover, there is little doubt that during oral reading the voice strongly regulates saccade programs (Buswell, 1920; Laubrock & Kliegl, 2015). Indeed, the dynamics of language‐related processing difficulty are reflected in how far the eye travels ahead of the voice: the easier the processing, the larger the eye‐voice span (EVS). There can be no doubt about the large potential of using this method for an understanding of theoretical problems in psycholinguistics. We are convinced that the sparseness of research about oral reading is due to the technical difficulties of simultaneously recording and classifying eye and voice as well as the challenges of analyzing not only one, but two dynamically related time series of eye and voice. We describe these difficulties, but also the potential of EVS research in a synopsis of a study reported in Laubrock and Kliegl (2015); technical details about simultaneous recording and identification of word boundaries are quoted literally from this paper.

Coregistration of Eye and Voice

In the example described here sentences were presented on a 22” Iiyama Vision Master Pro 514 CRT monitor with a resolution of 1280 × 960 pixels. Voice was recorded to hard disk using a Sennheiser K6 series condensator microphone connected to an ASIO compatible SoundBlaster Audigy sound card inside the PC, ensuring a fixed audio latency of 5 ms. Eye movements were registered using the Eyelink 1000 tower mount (SR Research, Ottawa, ON, Canada). The head was stabilized and a viewing distance of 60 cm was assured with a headrest, but the usual additional chinrest was removed to allow for easy articulation. Eye movements and voice pro- tocols were synchronized by sending trigger signals to the eye tracker at the beginning and end of each sound recording, which were recorded in tracker time in the series of eye tracker time stamps and later adjusted for the audio output delay.

Identification of Word Boundaries

The biggest technical challenge is to identify word boundaries in the oral recording protocol. A Praat (Boersma & Weenink, 2010) script was prepared that looped over subjects and sentences and presented each sentence (divided into words) together 80 Research Methods in Psycholinguistics and the Neurobiology of Language

Figure 4.4 Determination of word boundaries with PRAAT software. Computer program presents text, voice, and proportionally distributed word boundaries. Human coder zooms into voice record and manually adjusts word boundaries. with its associated sound recording, showing a representation of the waveform together with a spectrogram, formants, and intensity and pitch contours. The script attempted to locate the beginning and end of spoken parts by crossings of an inten- sity threshold, and initially distributed word boundaries across the spoken part in proportion to word length. Human scorers then manually dragged word boundaries to the subjective real boundary locations by repeatedly listening to stretches of the speech signal. Several zoom levels were available, and scorers were instructed to zoom in so far that only the word in question and its immediate neighbors were visible (and audible) for the ultimate adjustment (Figure 4.4). In the case of ambiguous boundaries due to co‐articulation, scorers were instructed to locate the boundary in the middle of such ambiguous stretches. Only articulated word durations from sentences that were read without error were used in further analyses.

An Exemplary LMM Interaction Based on Two Numeric Covariates

With voice‐onset and voice‐offset for the pronunciation of words merged with the sequence of fixations described above, these new variables yield measures of spatial and temporal EVS relative to the onset and offset of fixations. In an LMM with single‐fixation duration (SFD) as dependent variable there was a very strong and Eye‐Movement Tracking During Reading 81

325

300

275

250

225 ation duration [ms] x fi

200 Single

175 5 7911 13 15 17 19 21 23 25 Onset eye–voice span

325

300

275

250

N Predictability

ation duration [ms] 225 x fi High

200 Single Low

175 5 7911 13 15 17 19 21 23 25 Onset eye–voice span

Figure 4.5 Main effect of eye‐voice span and its interaction with predictability. Top: partial main effect of onset EVS on SFD; dots represent observed scores. Bottom: visualization of partial interaction effect between onset EVS and predictability of fixated word; LMM estimate of interaction was based on two continuous numeric covariates; binning into low versus high ­predictability is only for visualization of interaction (part of Laubrock & Kliegl, 2015, Figure 4). Source: Laubrock and Kliegl, http://journal.frontiersin.org/article/10.3389/fpsyg.2015.01432/ full. Used under CC BY 4.0, https://creativecommons.org/licenses/by/4.0/. linear effect of spatial EVS on SFD (Figure 4.5, top panel) suggesting that fixation durations are prolonged when the EVS gets too large; this effect was stronger for fixated words that were high than low predictable from the prior sentence context (Figure 4.5, bottom panel): prediction usually leads to shorter fixations, but only if the EVS is not too large. EVS at fixation onset was one of the strongest predictors of SFD, and had a substantial linear influence that was larger than well‐established effects such as launch site, word frequency, or word predictability. The plots are partial effects after 82 Research Methods in Psycholinguistics and the Neurobiology of Language statistically controlling for 28 other covariates in the LMM as well as taking into account shrinkage correction due to differences between subjects and differences between sentences. Aside from documenting this and various other theoretically relevant results about how eye‐voice span and word‐related properties co‐determine fixation durations during oral reading, we consider Laubrock and Kliegl (2015) a tutorial paper on how one goes about identifying and documenting a parsimonious LMM for a very complex set of data. For example, the final LMM, based on 11,709 fixations, 32 subjects, and 144 sentences, included 66 fixed effects (covariates were estimated with quadratic and some cubic trends), 12 variance components (incl. residual variance), and 3 correlation parameters.

Practical Issues

Problems and Pitfalls

There are a number of potential pitfalls in eye‐tracking research. The most frequent problems associated with infrared video‐based eye trackers are listed in Table 4.2. They range from mundane subject characteristics over technical limitations of eye trackers to conceptual distinctions to be kept in mind when interpreting fixation durations. As far as technical issues are concerned, we consider most of them to be self‐explana- tory, but want to mention that CRTs with fast screen cycles are no longer produced. Paradoxically, in this case technological advances have made it increasingly difficult to implement fast gaze‐contingent display changes with precise control of timing. Since this problem is even more severe in other areas of research like near‐threshold priming or visual perception, some manufacturers now provide specialized display hardware.

Table 4.2 Practical issues related to eye‐tracking during reading. Subject characteristics Eye color may impact calibration Varifocals and lenses may impact calibration Mascara may generate spurious reflections and measures Special populations (e.g., infants, older adults) may have larger variability Display Position error increases with eccentricity Horizontal accuracy higher than vertical accuracy Eye blinks Loss of measurement Saccade artifacts (closing eyelid causes quick downward shift of pupil center of gravity) Event detection Post‐saccadic wobble/glissades; are they part of fixation or (time/space) saccade? Eye tracker: Minimum of 250 Hz required for gaze‐contingent display change; faster is better Display monitor: Minimum of 100 Hz required for gaze‐ contingent display change during saccades; faster is better Assignment of gaze to line of text (see Display) Interpretation Interest is in focus of attention, but point of gaze may not indicate focus of attention Eye‐Movement Tracking During Reading 83

There is, of course, a practical problem with the theoretical interpretation of fixation durations with respect to the potential dissociation between point of gaze and focus of attention. From a naïve perspective, one might want to hope for the validity of the strong eye‐mind hypothesis that there is no appreciable lag between what is fixated and what is being processed (Just & Carpenter, 1980). The many results on parafoveal processing in the perceptual span as well as those related to the eye‐voice span very clearly indicate that this simply can’t be the case, except possibly under some artificially constrained settings. In general, covert attention has no inertia and moves faster than the eye. Nevertheless, although attention can move to a certain degree independently of gaze, attention shifts obligatorily precede gaze shifts (Deubel & Schneider, 1996). In this sense, gaze shifts are indeed indicators of attention shifts. Furthermore, even covert attention shifts leave traces in fixational eye movements, suggesting that the oculomotor system is tightly coupled with the system implement- ing spatial attention (Laubrock, Engbert, & Kliegl, 2005). There is another twist to the story. A fixation does not necessarily guarantee that attention was really focused long enough at a location to process the stimulus—even if fixation duration was long. For example, studies of mindless reading show that although the pattern of fixation duration is changed, the basic pattern of eye move- ments is nevertheless superficially similar (Schad, Nuthmann & Engbert, 2012).

Advantages and Disadvantages as Compared with Related Methods

Arguably, eye‐tracking during reading captures the reading process in its most natural way and therefore also in its outmost complexity. There is no evidence that the presence of eye‐tracking monitors limits the generalizability of results; in other words, eye‐ tracking during reading has high external validity. One of the main advantages and at the same time disadvantages is that fixation durations not only measure processing effects for the fixated word, but also for the preceding and following word. Unsurprisingly, there is a tradeoff with respect to the internal validity of eye‐movement measures due to high collinearity. Eye‐movement research has embraced three methodologies to deal with this problem: control by (quasi‐)experimental design, control by multivariate statistics, and computational modeling. In principle, all three of them can be applied not only to the analyses of data from natural reading, but also to tasks that in one way or another reduce the dynamics of language‐related and oculomotor‐related processes by eliminating the latter. We briefly describe three of these paradigms, specifically nam- ing or lexical decision tasks for isolated word recognition, rapid serial visual presenta- tion, and self‐paced reading.

Naming/Lexical Decision Task

The conceptually most reductionist approach to the study of reading is implemented in experimental paradigms investigating processes of isolated word recognition. There are two main paradigms. In the naming task, words that differ in some critical feature or are presented in the context of different priming conditions must be named 84 Research Methods in Psycholinguistics and the Neurobiology of Language as rapidly as possible. In the lexical decision task, the speed of distinguishing between nonwords and words is the primary dependent variable. Typically, in the latter task the same words under the same experimental conditions as in the naming task can be used and often nonwords are not even analyzed. As there is no need for eye movements, these measures reflect the efficiency of reading in the absence of effects due to oculomotor programs or saccades. In comparison with technical complexities associated with eye‐tracking measures (Table 4.2), LDTs deliver simple and powerful indicators of language‐related processes, albeit restricted to isolated word recognition, that is, typically in the absence of sentence context and parafoveal processing of upcoming words.

Rapid Serial Visual Presentation (RSVP)

By definition, isolated word recognition lacks context, arguably one of the most impor- tant influences on reading. In the RSVP paradigm, words are typically presented one after another at a pace of 100‐700 ms/word at the same display location. Again, the primary goal is to isolate the effects of language‐related processes by eliminating the need for saccades. RSVP with longer intervals (300 to 700 ms/word) is typically used when event‐related potentials (ERPs) are measured during reading of sentences because the task ensures the absence of oculomotor artifacts in the brain measures. In addition, with low presentation rates ERPs can be determined for the individual words without overlap between waves triggered by other words. There has been success with co‐regis- tration of eye movements and brain potentials during natural reading. In this paradigm, the onset of a fixation, rather than the presentation of the word on the display, is used as the trigger for computing what is called a fixation‐related potential. Dimigen et al. (2011) contains an elaborate tutorial about how to deal with the technical, data‐ analytic, and conceptual problems one encounters with co‐registration of eye move- ments and fixation‐related brain potentials during reading.

Self‐Paced Reading

Arguably, the closest simulation of natural reading involving eye movements without tracking them is self‐paced reading where readers’ button presses initiate successive presentation of words or phrases in their usual physical location. Here the assump- tion is that inspection times yield direct information about language‐related processing without the technical complexities associated with collection and analysis of eye movements. Obviously, with the presence of eye movements, processes related to oculomotor programming are involved in the task, but processing of parafoveal information is disabled.

Psycholinguistic research is driven by an interest in language‐related processing. Eye‐ movement tracking during reading is one window through which we can observe some of the most intricate orchestration of cognitive processing. There are technical, data‐analytic, and last but not least, conceptual issues that need to be overcome—as in any other productive field of science. A unique contribution of this psycholin- guistic research method is that it brings us in direct contact with the embodiment of the dynamics of mind and behavior. Eye‐Movement Tracking During Reading 85

Key Terms

Boundary paradigm (Rayner paradigm) A gaze‐contingent experimental paradigm used to measure when a specific type of parafoveal information is processed; a preview changes into a target when the gaze crosses an invisible boundary; preview benefit is indicated by shorter fixations with related (or identical) than with unrelated previews. Calibration Alignment of gaze and screen coordinates. Corpus analysis Analysis of a large number of observations collected for the same material, which is usually large itself, aiming for generalizability of effects across the full range of word properties. Eye‐voice span (EVS) Difference between fixated and pronounced word during oral reading; computed in a metric of letters (spatial EVS) or time (temporal EVS). Moving window paradigm (McConkie paradigm) A gaze‐contingent experimental paradigm to measure the useful field of view in reading (perceptual span), where text is only visible in a controlled‐width window that moves in real time with the reader’s gaze. Perceptual span The asymmetric region around the fixation location, extending about 3 letters to the left and 6 letters to the right for identification of letters, and up to 15 letters to the right for picking up low‐level visual information such as empty spaces between words. The perceptual span is usually determined with the McConkie paradigm. Rapid serial visual presentation Presentation of stimuli in rapid succession usually at a fixed location in the center of the screen; used to study sentence integration processes without eye movement. Saccade detection Parsing of the raw time series into saccades and other events (fixations, blinks, smooth pursuit). Selfpaced reading Word‐by‐word presentation of sentences triggered by subject’s button presses; usually words appear cumulatively in their regular position in the sentence, thus preserving the spatial layout. Video‐based eye tracking Most commonly used and most widely available tech- nique for measuring eye movements during reading, based on pupil detection in a video stream of eye images and usually ameliorated by corneal reflex of an infrared light source.

References

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed‐effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412. Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. arXiv. org:1506.04967. Boersma, P., & Weenink, D. (2010). Praat: Doing phonetics by computer [Computer Program]. Version 5.1. Available at: http://www.praat.org/ Bouma, H. (1970). Interaction effects in parafoveal letter recognition. Nature, 226, 177–178. Buswell, G. T. (1920). An experimental study of the eye‐voice span in reading. Supplementary Educational Monographs No. 17. Chicago: Chicago University Press. 86 Research Methods in Psycholinguistics and the Neurobiology of Language

Deubel, H., & Schneider, W.X. (1996). Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision Research, 36, 1827–1837. Dimigen, O., Sommer, W., Hohlfeld, A., Jacobs, A.M., & Kliegl, R. (2011). Co‐registration of eye movements and EEG in natural reading: Analyses and review. Journal of Experimental Psychology: General, 140, 552–572. doi:10.1037/a0023885. Engbert, R., & Kliegl, R. (2003). Microsaccades uncover the orientation of covert attention. Vision Research, 43, 1035–1045. Engbert, R., Nuthmann, A., Richter, E., & Kliegl, R. (2005). SWIFT: A dynamical model of saccade generation during reading. Psychological Review, 112, 777–813. Hohenstein, S., Laubrock, J., & Kliegl, R. (2010). Semantic preview benefit in eye movements during reading: A parafoveal fast‐priming study. Journal of Experimental Psychology: Learning, Memory, and Cognition, 36, 1150–1170. Hohenstein, S., Matuschek, H., & Kliegl, R. (in press). Linked linear mixed models: A joint analysis of fixation locations and fixation durations in natural reading. Psychonomic Bulletin & Review. doi:10.3758/s13423-016-1138-y. Inhoff, A. W., & Radach, R. (1998). Definition and computation of oculomotor measures in the study of cognitive processes. In G. Underwood (Ed.), Eye guidance in reading and scene perception (pp. 29–53). Oxford, UK: Elsevier. Just, M. A., & Carpenter P. A. (1980). A theory of reading: From eye fixations to comprehen- sion. Psychological Review, 87, 329–354. Kennedy, A., Hill, & Pynte, J. (2003). The Dundee corpus. Proceedings of the 12th European conference on eye movement. Dundee: University of Dundee. Kimmel, D. L., Mammo, D., & Newsome, W. T. (2012). Tracking the eye non‐invasively: Simultaneous comparison of the scleral search coil and optical tracking techniques in the macaque monkey. Frontiers in Behavioral Neuroscience, 6, 49. Kliegl, R. (2007). Towards a perceptual‐span theory of distributed processing in reading: A reply to Rayner, Pollatsek, Drieghe, Slattery, & Reichle (2007). Journal of Experimental Psychology: General, 138, 530–537. Kliegl, R., & Engbert, R. (2013). Evaluation of a computational model of eye‐movement control during reading. In U. Gähde, S. Hartmann, & J. H. Wolf (Eds.), Models, simulations, and the reduction of complexity (pp. 153–178). Berlin‐New York: Verlag der Akademie. DeGruyter. Kliegl, R., Nuthmann, A., & Engbert, R. (2006). Tracking the mind during reading: The influence of past, present, and future words on fixation durations. Journal of Experimental Psychology: General, 135,13–35. Kliegl, R., Risse, S., & Laubrock, J. (2007). Preview benefit and parafoveal‐on‐foveal effects from word n + 2. Journal of Experimental Psychology: Human Perception and Performance, 33, 1250–1255. Laubrock, J., Engbert, R., & Kliegl, R. (2005). Microsaccade dynamics during covert attention. Vision Research, 45, 721–730. Laubrock, J., & Kliegl, R. (2015). The eye‐voice span during reading aloud. Frontiers in Psychology, 6 (1432). von der Malsburg, T., & Vasishth, S. (2011). What is the scanpath signature of syntactic reanalysis? Journal of Memory and Language, 65, 109–127. Matuschek, H., Kliegl, R., & Holschneider, M. (2015). Smoothing spline ANOVA decomposi- tion of arbitrary splines: An application to eye movements in reading. PLoS ONE 10: e0119165. doi:10.1371/journal.pone.0119165 McConkie, G. W., & Rayner, K. (1975). The span of the effective stimulus during a fixation in reading. Perception & Psychophysics, 17, 578–586. Meixner, J., Nixon, J., & Laubrock, J. (2017). The perceptual span is locally modulated by word frequency early in reading development. Under review. Nyström, M., & Holmqvist, K. (2010). An adaptive algorithm for fixation, saccade, and glis- sade detection in eyetracking data. Behavior Research Methods, 42, 188–204. Eye‐Movement Tracking During Reading 87

Radach, R., & Kennedy, A. R. (2004). Theoretical perspectives on eye movements in reading: Past controversies, current issues and an agenda for future research. European Journal of Cognitive Psychology, 16, 3–26. Rayner, K. (1975). The perceptual span and peripheral cues in reading. Cognitive Psychology, 7, 65–81. doi: 10.1016/0010-0285(75)90005-5 Rayner, K., Liversedge, S. P., White, S. J., & Vergilino‐Perez, D. (2003). Reading disappearing text. Psychological Science, 14, 385–388. Rayner, K., Pollatsek, A., Drieghe, D., Slattery, T. J., & Reichle, E. D. (2007). Tracking the mind during reading via eye movements: Comments on Kliegl, Nuthmann, and Engbert (2006). Journal of Experimental Psychology: General, 136, 520–529. Reichle, E. D., Pollatsek, A., Fisher, D. L., & Rayner, K. (1998). Towards a model of eye movement control in reading. Psychological Review, 105, 125–157. Reilly, R. G., & Radach, R. (2006). Some empirical tests of an interactive activation model of eye movement control in reading. Journal of Cognitive Systems Research, 7, 34–55. Reingold, E., Reichle, E., Glaholt, M., & Sheridan, H. (2012). Direct lexical control of eye movements in reading: Evidence from a survival analysis of fixation durations. Cognitive psychology, 64, 177–206. doi: 10.1016/j.cogpsych.2012.03.001 Risse, S., Hohenstein, S., Kliegl, R., & Engbert, R. (2014). A theoretical analysis of the percep- tual span based on SWIFT simulations of the n + 2 boundary paradigm. Visual Cognition, 22, 283–308. Risse, S., & Kliegl, R. (2014). Dissociating preview validity and preview difficulty in parafoveal processing of word n + 1 during reading. Journal of Experimental Psychology: Human Perception and Performance, 40, 653–668. Schad, D. J., & Engbert, R. (2012). The zoom lens of attention: Simulating shuffled versus normal text reading using the SWIFT model. Visual Cognition, 20, 391–421. Schad, D. J., Nuthmann, A., & Engbert, R. (2012). Your mind wanders weakly, your mind wanders deeply: Objective measures reveal mindless reading at different levels. Cognition, 125, 179–194. Schotter, E. A., & Rayner, K. (2015). The work of the eyes during reading. In A. Pollatsek & R. Treiman (Eds.), The Oxford handbook of reading (pp. 44–62). Oxford, UK: Oxford University Press. Schotter, E. R., Reichle, E. D., & Rayner, R. (2014). Rethinking parafoveal processing in reading: Serial‐attention models can explain semantic preview benefit and N + 2 preview effects. Visual Cognition, 22, 309–333. Sperlich, A., Meixner, J., & Laubrock, J. (2016). Development of the perceptual span in reading: A longitudinal study. Journal of Experimental Child Psychology, 146, 181–201. Watanabe, A. (1971). Fixation points and the eye movements. Oyo Buturi, 40, 330–334 (in Japanese). Yan, M., Risse, S., Zhou, X., & Kliegl, R. (2012). Preview fixation duration modulates identical and semantic preview benefit in Chinese reading. Reading and Writing: An Interdisciplinary Journal, 25, 1093–1111.

Further Reading

Duchowski, A.T. (2007). Eye tracking methodology. Theory and practice. London: Springer. Holmqvist, K., Nyström, M., Andersson, R., Dewhurst, R., Jarodzka, H., & van de Weijer, J. (2011). Eye tracking: A comprehensive guide to methods and measures. Oxford, UK: Oxford University Press. 88 Research Methods in Psycholinguistics and the Neurobiology of Language

Liversedge, S., Gilchrist, I., & Everling, S. (2011). The Oxford handbook of eye movements. Oxford: Oxford University Press. Rayner, K., Pollatsek, A., Ashby, J., & Clifton, C. Jr. (2012). The psychology of reading, 2nd Edn. New York, NY: Psychology Press. Rayner, K., Pollatsek, A., & Schotter, E. R. (2012). Reading: Word identification and eye movements. In A. Healy (Ed.), Handbook of psychology, Volume 4: Experimental Psychology (pp. 548–577). Hoboken: Wiley. 5 The Visual World Paradigm

Anne Pier Salverda and Michael K. Tanenhaus

Abstract

The visual world paradigm (VWP) is a family of experimental methods for studying real-time language processing in language comprehension and production that can be used with participants of all ages and most special populations. Participants’ eye move- ments to objects in a visual workspace or pictures in a display are monitored as they listen to, or produce, spoken language that is about the contents of the visual world. Eye-movements in the VWP provide a sensitive, time-locked response measure that can be used to investigate a wide range of psycholinguistic questions on topics running the gamut from speech perception to interactive conversation in collaborative task-­ oriented dialogue.

Introduction

The visual world paradigm (VWP) is a family of experimental methods in which par- ticipants’ eye movements to real objects in a visual workspace, or to pictures on a display, are monitored as they listen to spoken language or produce language. Figure 5.1 shows an example of the experimental set up. The term, coined by Tanenhaus and colleagues (Allopenna, Magnuson, & Tanenhaus, 1998), emphasizes that the visual workspace defines a circumscribed context that the language is about.

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. 90 Research Methods in Psycholinguistics and the Neurobiology of Language

Figure 5.1 Example of a screen‐based visual world paradigm experimental set up.

In 1974, in a remarkable article titled “The control of eye fixation by the meaning of spoken language: A new methodology for the real‐time investigation­ of speech perception, memory, and language processing,” Roger Cooper reported experiments that used a Dual‐Purkinje eye‐tracker to measure participants’ eye movements as they listened to stories while looking at a display of pictures. Participants shifted their gaze to pictures that were named in the stories and to pictures associated with those names. Fixations were often generated before the spoken word ended, suggesting a tight coupling of visual and linguistic processing. More than 20 years later, Tanenhaus, Spivey‐Knowlton, Eberhard, and Sedivy (1995) used a head‐mounted video‐based eye‐tracker to monitor participants’ eye movements as they followed experimenter‐generated spoken instructions to pick up and move objects arranged on a table (e.g., Put the apple that is on the towel in the box). Their task-based approach was influenced by pioneering work at Rochester that used eye movements to study vision in natural tasks (see Hayhoe & Ballard, 2005, for a review). Tanenhaus et al. found evidence for rapid integration of visual and linguistic information in word recognition, reference resolution, and syntactic processing (pars- ing). The latter was the focus of their report. Allopenna, Magnuson, and Tanenhaus (1998) is the first VW study to use a screen‐based presentation to study the time‐course of spoken‐word recognition in continuous speech. Trueswell, Sekerina, Hill, and Logrip (1999) demonstrated that the VWP could be used to study sentence comprehension in pre‐literate children, using a variant of the set‐up in Tanenhaus et al. Many current VW studies follow the methods and rationale introduced by Cooper (1974), who did not use an explicit task. Altmann and Kamide (1999) is the founda- tional “look‐and‐listen” study. They presented displays with clipart of a person (e.g., a boy) and a set of four objects (e.g., a cake, a toy car, a ball, and a toy train) and a spoken utterance, for example, The boy will eat the cake (see Figure 5.2). Participants were more likely to generate a saccade, that is, to make an “anticipatory eye movement” to the target object, as the verb unfolded when the semantics of the verb were consistent with only one of the objects (eat; only the cake is eatable, as opposed to move; all of the objects are movable). The Visual World Paradigm 91

Figure 5.2 Example visual display. Modeled after Altmann & Kamide (1999).

Two seminal studies provided the foundation for using the VWP in language production. Meyer, Sleiderink, and Levelt (1998) demonstrated that eye move- ments are closely time‐locked to utterance planning during the production of simple noun phrases. Griffin and Bock (2000) monitored eye movements with schematic scenes that could be described using active or passive constructions (e.g., a picture of lightning striking a house) and demonstrated a tight coupling between fixations and utterance planning.

Assumptions, Logic, and Terminology

All VW experiments use similar logic and variations of the same design. A visual workspace contains real objects or a display depicts an array of objects, a schematic scene, or a real‐world scene. With screen displays, pictures are typically used, but some studies use printed words instead (McQueen & Viebahn, 2007). Participants’ eye movements are monitored as speech unfolds. Of interest is at what point in time with respect to some acoustic landmark in the speech signal (e.g., the onset of a word) a shift in the participant’s visual attention occurs, as measured by a saccadic eye movement to an object or picture. Behavioral and neuroimaging measures require a linking hypothesis that maps the dependent measure, in this case eye movements, onto hypothesized underlying processes. The most general form of the VW linking hypothesis is that as visual attention shifts to an object in the workspace, as a consequence of planning or com- prehending an utterance, there is a high probability that a saccadic eye movement will rapidly follow to bring the attended area into foveal vision. Where a participant 92 Research Methods in Psycholinguistics and the Neurobiology of Language is looking, and in particular when and to where ­saccadic eye movements are launched in relationship to the speech, can provide insights into real‐time language processing. We return later to considerations about how linking hypotheses affect the interpre- tation and analysis of VW studies. Across studies, the characteristics of the language, the contents and structure of the visual workspace, and the instructions and/or task vary. For this discussion, we assume that the potential referents are pictures displayed on a screen. Each picture may be referred to one or more times as the spoken language unfolds. The picture of interest, at a particular point in time, is the target. Experimenters are primarily interested in when looks to the target diverge from looks to the other pictures. The properties of one or more of the non‐target pictures are often manipulated such that they are more related to the target than the other non‐target pictures along some specified dimension, which could include participation in an implied event. Those pictures are then typically labelled competitors and the unrelated pictures distractors. Competitors are labelled by the dimension(s) along which they differ from the target. For example, if the names of two of the pictures begin with the same syllable, for instance, candle and candy, and the participant hears the instruction, Click on the candle, then the candle would be the target and the candy would be the phonological competitor (or alternatively, the cohort competitor). Competitors can differ along any number of dimensions, ranging from how their names differ from the target (e.g., cohort, rhyme, or voice‐onset time (VOT) competitors) to how similar they are along visual and/or conceptual dimensions. For example, two depicted objects of the same type might differ along a dimension such as size, color, or a feature such as having stripes or stars. In comprehension studies, the point in the speech signal when only one picture is consistent with the integration of information in the sentence and the affordances of the objects in the visual world is sometimes referred to as the Point of Disambiguation (POD). The POD can serve as a reference point, defining the ear- liest point in the speech signal where a participant could identify the target if he or she was using all of the information available. However, POD is also sometimes used to refer to the point in time where looks to the target actually begin to differ from looks to competitors. The competitor terminology is not typically used in produc- tion studies, but the logic is similar, with researchers examining the relationship between looks to a region of interest (e.g., a potential agent or patient) and aspects of the utterance, for example, when a picture is mentioned, and in what grammatical or thematic role (e.g., subject or object and agent or patient, respectively).

Apparatus

The biggest decision one faces when setting up a lab is what type of eye‐tracker to choose. Here we describe the two most commonly used systems. In determining which system is most suitable for a given type of experimental paradigm or experiment, factors to be taken into account include: properties of the experiment (the nature of the task, e.g., the form of interaction with the visual world); requirements for temporal and spatial sensitivity (an eye‐tracker with a high temporal sampling frequency may be desired when subtle differences in the timing of effects are of interest, while a system with low spatial resolution may be used when the number of regions of interest in the display is small, and these regions are spatially The Visual World Paradigm 93 distinct); the population(s) that will be tested; whether automatic coding of the data is desired; and affordability. The simplest, least expensive, and most portable system is a video camera, which records an image of the participant’s eyes. The camera can be mounted above or below a computer screen, or positioned in the center of a platform with real objects (Snedeker & Trueswell, 2004). Eye movements are coded through frame‐by‐frame examination of the video recording. Temporal resolution is limited by the video equipment, which usually records at 30 or 60 Hz. The objects in the visual display need to be located such that fixations to each of the objects result in clearly distinct images of the eye. An important limitation is that participants are required to keep their eyes positioned in front of the camera. Many eye‐tracking systems use optical sensors to infer gaze location by measuring the orientation of the eye in its orbit. An image of one or both eyes is recorded by one or two eye cameras, which are either head‐mounted or remote. The image is processed by dedicated hardware and gaze location is established on the basis of the image of the pupil, or by computing the vector between the center of the dark pupil and the corneal reflection. The latter is obtained by exposing the eyes to invisible near‐infrared light originating from an illuminator. Importantly, gaze location is con- tingent on both eye orientation and the orientation of the head relative to the visual display. Most optical systems compensate for head movements (e.g., remote systems track the shape of a small sticker attached to the participant’s forehead to record head position and orientation). Optical eye trackers typically generate output in the form of a stream of XY coor- dinates reflecting the participant’s gaze location. If this output is in the form of screen coordinates, coding of eye movements to regions of interest in the visual world can be automatized. Some optical systems use an additional scene camera and produce video output in which the participant’s gaze location is superimposed on a video recording of the visual workspace. Head‐mounted systems typically operate with a higher sampling rate and spatial resolution than remote eye trackers. However, spatial resolution for remote eye trackers can be improved by using some form of head stabilization, for example, a chin rest.

Common Variations Across Experiments

Language

The language can differ along any number of dimensions, from manipulations of fine‐ grained acoustic‐phonetic features (duration, VOT, formant structure, fundamental frequency, etc.) to properties of words (syntactic category, semantic features, frequency of occurrence, etc.) to linguistic structure (syntactic structure, information structure, semantic and pragmatic properties such as implicating and questioning, etc.). The source of the speech is important. The language often comes from a disem- bodied voice, which provides a narrative (e.g., The doctor will hand the scalpel to the nurse) or an instruction (e.g., Put the large candle above the fork). The default assumption is that the speaker and the listener have access to the same information in the visual world. In more interactive tasks, naïve participants and/or confederates generate the utterances of interest. 94 Research Methods in Psycholinguistics and the Neurobiology of Language

Visual World

The characteristics of the workspace play an important role in determining the questions that can be asked in a VW experiment. The most frequently used set up is a screen display depicting an array of pictures, a schematic scene, or a real‐world scene. The workspace can also contain real objects arranged on a tabletop or a more complex apparatus. When real‐world objects are used in conjunction with instruc- tions to manipulate them, one can ask research questions such as how affordances of objects interact with the language, which might be less natural with screen displays. These questions could be asked in a more controlled environment by using virtual reality, which would allow for a wide range of interesting manipulations, including sophisticated saccade‐contingent changes to the virtual environment. More complex workspaces are useful for asking questions about perspective- taking and for generating a variety of utterance types. For example, control of what information is shared and what information is privileged between participants can be achieved by constructing an appropriate physical apparatus, for example, one with cubbyholes that are open or occluded such that only one interlocutor can see one or more of the objects (Keysar, Barr, Balin, & Brauner, 2000).

Task

There are two common variants of VW experiments. Task or action‐based studies borrow from the vision‐in‐natural‐tasks literature. Participants interact with real‐ world objects or, more typically, interact with pictures in a screen‐based workspace to perform a motor task, typically clicking and dragging pictures to follow explicit instructions (Put the clown above the star), clicking on a picture when its name is mentioned, or manipulating real objects (e.g., Pick up the apple. Now put it in the box). Explicit goal‐directed motor tasks encourage the participant to rapidly identify and fixate the target object of the linguistic expression. Participants typically generate a saccade to the referent (or maintain an earlier fixation), and keep fixating it until the mouse cursor or hand approaches the goal (visually‐guided reaching). The choice indicates the final interpretation, which can be used for response‐contingent analyses (e.g., analyzing trials with looks to the voiced competitor beach when the participant choses the voiceless target peach upon hearing a token with a particular VOT). The earliest language‐mediated fixations occur 200‐250 ms after the relevant acoustic landmark that could establish a POD (Salverda, Kleinschmidt, & Tanenhaus, 2014). Throughout a trial, a high proportion of the fixations are controlled by the goal, including fixations to objects that are relevant to establishing reference as the language unfolds (Salverda, Brown, & Tanenhaus, 2011; for a discussion of an alternative, activation‐based hypothesis, see Altmann & Kamide, 2007). Look‐and‐listen studies (sometimes misleadingly called passive listening studies) do not require participants to perform an explicit task other than to look at the computer screen. Because the interpretation of the language is co‐determined by information in the scene, participants’ attention is drawn to referents, including pictures that the listener anticipates will be mentioned or pictures associated with implied events (e.g., an action that will take place in the future). In a variation introduced by Altmann (2004), a blank screen replaces the schematic scene at some point in the narrative. The Visual World Paradigm 95

There is a paucity of work that directly compares “task‐based” and “look‐and‐listen” studies that are designed to address the same question, which makes claims about the strengths and weaknesses of each approach somewhat speculative.

General Considerations Affecting Design and Interpretation

Many first‐time users want to know what steps to follow to design and analyze VW experiments. We find that an analogy to cooking is helpful. Everyone cooks to some degree, but expertise varies. Some people rarely cook and know almost nothing about cooking techniques. If you are one of those people, you can feed yourself, but you cannot create anything new. And if you get adventurous and try a recipe, it’s unlikely to turn out well; even the most detailed recipe requires knowledge of some basic cooking techniques. In contrast, master chefs have expertise with preparing a wide range of dishes in multiple genres of cooking; they are also aware of the molec- ular processes involved in cooking and the latest technology. Whereas master chefs rarely make mistakes when preparing established dishes, their novel creations are not always successful. When, however, a dish fails, they have good intuitions about what went wrong and how to correct it. One need not be a master chef to use the VW paradigm. But being the equivalent of someone who rarely cooks and occasion- ally tries to follow a recipe is likely to be problematic. Every VW experiment combines aspects of both spoken language and vision. Successful use of the paradigm therefore requires some basic knowledge about, and sensitivity to, properties of both systems. This is challenging because few psycholin- guists are knowledgeable about vision. Moreover, many psycholinguists who study higher‐level processes (e.g., syntactic processing, interpretation, inference and implicature) have limited experience with the speech signal. Conversely, many who are knowledgeable about the speech signal have only a cursory knowledge of how it is impacted by higher‐level factors. In what follows, we present some of the factors in speech and in vision in natural tasks that strongly impact the design, analysis, and interpretation of VW studies.

Speech and Spoken Language

Speech is a temporal, rapidly changing signal. Acoustic cues are transient, and there are no acoustic signatures that correspond to linguistic categories. Relevant cues to a category, or even a phonetic feature such as voicing, are determined by multiple cues, many of which arrive asynchronously and are impacted by both high and low level linguistic subsystems. Linking eye movements to relevant linguistic information in the speech signal is therefore critically dependent on having some understanding of where, when, and why information in the speech signal provides information about linguistic structure. Time‐locking eye movements to an acoustic landmark typically requires deter- mining the onset of a speech sound or spoken word. This task is straightforward when a target word is presented in isolation, for instance, the word beaker starts 96 Research Methods in Psycholinguistics and the Neurobiology of Language with the release of the plosive /b/. However, most studies use spoken sentences where the target word is embedded in continuous speech, for instance, Click on the beaker. Words in continuous speech can have very different characteristics than words spoken alone. Determining when a target word starts in continuous speech can be complicated and we therefore recommend consulting with a phonetician. For example, in Click on the beaker, the release of the plosive /b/ does not correspond to the onset of beaker. The closure preceding the release is an integral part of the articulation of plosives in continuous speech, and the onset of the closure therefore constitutes the onset of beaker. Coarticulation—the temporal and spatial overlap in the articulation of two or more speech sounds—is a ubiquitous property of speech. At any moment in time, the speech signal provides information about multiple speech sounds, with the strength of coarticulation depending on many factors. This has consequences for the time‐locking between speech and eye movements, especially under conditions where it is essential to estimate the earliest information in the speech signal that might influence a language‐mediated eye movement. Careful examination with a speech editor (using a spectrogram) or evaluation of the stimuli using incremental­ auditory presentation can improve the quality of the segmentation of a linguistic event (such as a speech sound). The influence of coarticulation can be reduced by using cross‐spliced materials when possible and otherwise by carefully choosing the stimuli. Speech is determined by constraints at multiple levels. The same acoustic cues that provide information about phonemic segments may also generate expectations about syntax, information structure, and pragmatics. Many aspects of these higher‐ level processes are manifested by prosody and intonation, which affect acoustic cues (such as duration) that are also used in processing phonemes and spoken words. Thus, higher‐level information may be available earlier than one might otherwise think. Therefore, it is important to consider the locus and extent of various cues to aspects of linguistic structure in the speech tokens used in a VW study. Moreover, manipulation of speech cues may impact interpretation at multiple, and perhaps mutually constraining, levels of linguistic representation.

Eye Movements in Natural Tasks

While the classic literature on visual search with simple displays, and more recently, scenes, is informative for VW researchers, a newer literature on vision in natural tasks is arguably more relevant (Salverda, Brown, & Tanenhaus, 2011). Traditional visual‐search studies focused on the role of low‐level perceptual features (e.g., color, orientation, and shape) in pre‐attentive visual processing and in the subsequent allocation of visual attention. These studies used simple, static, and largely unstruc- tured displays, on the assumption that these elementary perceptual features would have similar effects on visual attention in complex real‐life scenes. Given this assumption, basic stimulus features should be key predictors of the deployment of visual attention. Indeed, in the absence of a task, global estimates of visual salience derived by integrating multiple feature values at each location within a screen correlate with gaze patterns during viewing of a scene (Parkhurst, Law, & Niebur, 2002). The Visual World Paradigm 97

Feature‐based salience, however, is a poor predictor of gaze patterns when a participant is engaged in a well‐defined task (Tatler, Hayhoe, Land, & Ballard, 2011). In studies of everyday visuomotor behaviors, such as preparing tea, making sandwiches, and driving, the vast majority of fixations, typically 90% or more, can clearly be attributed to task‐based goals. Participants have a strong tendency to fixate objects immediately before they become relevant to the execution of a task subgoal (e.g., fixating an object immediately prior to reaching for it). Moreover, participants direct their fixations to those parts of an object that are behaviorally most relevant (e.g., the spout of a tea kettle during the pouring of hot water). In addition to influencing the location and timing of fixations, cognitive goals play a key role in determining the information encoded during fixations and the retrieval, during a fixation, of information that is stored in memory. Importantly, aspects of the task that a participant performs, including those that change dynamically, can strongly influence the time and resources available for accessing information, and thus the information that is encoded during a fixation. For instance, as task complexity increases in a block‐sorting task, participants begin to rely less on working memory and more on the external environment (Droll & Hayhoe, 2007). The most general implication for VW studies is that where and when participants will look will be strongly determined by both explicit and implicit task goals. For example, one might be interested in using the proportion of looks to a previously mentioned picture as an indication that it is being considered as a potential referent for a referring expression. However, a participant who already knows the location and the properties of that object might not look at a picture even though it is being considered as a possible referent and even when the picture is interpreted as the most likely referent of a referring expression (Yee & Heller, 2012). This does not mean that the VW paradigm is poorly suited to studying pronoun resolution; indeed, some of the most elegant and influential VW studies have done so. But it does mean that one has to be careful about interpreting the absence of looks to an object or picture. More generally this highlights the importance of not confusing your dependent mea- sure with an underlying process. While this might seem obvious, it commonly occurs, especially when one assumes that there are “signature” data patterns that are diagnostic of a particular cognitive process (Tanenhaus, 2004). Finally, in the absence of a specific goal structure, it can be problematic to “back engineer” expla- nations based on fixation patterns.

Nature of Stimuli

Visual World

Each trial in a VW study begins with the presentation of a display that includes the target and typically one or more competitors (see Figure 5.1). Unrelated distractors provide a baseline for the assessment of speech‐driven effects in the eye move- ments, which are revealed by differences in fixations to the target, competitor, and distractors. In order to avoid baseline differences that complicate interpretation and 98 Research Methods in Psycholinguistics and the Neurobiology of Language increase noise in the data, distractor objects should not have any direct or indirect relationship to the relevant information that might be activated (even temporarily) by the linguistic stimulus along phonological, semantic, and visual dimensions. Distractors with visual properties that might attract the participant’s attention irrespective of the language should also be avoided. The structure of the visual world varies across experiments, from a grid with objects to less structured visual scenes and workspaces. To facilitate coding of eye movements, objects should be situated some distance from each other. Systematic patterns in exploratory fixations (e.g., the tendency to fixate the top left picture in a search array early in a trial; Dahan, Tanenhaus, & Salverda, 2007) can be coun- teracted by randomizing or counterbalancing object positions. Unless there are other compelling reasons, we recommend against instructing participants to fixate a specific location at the start of a trial (e.g., by using a fixation cross). Maintaining fixation is resource‐intensive. Moreover, asking participants to control their initial fixation can reduce the number of eye movements, with some participants main- taining fixation until just before they initiate an action. In production studies the characteristics of the display are often manipulated to examine how fixations to different objects affect lexical choice and grammatical encoding. Participants’ attention is sometimes manipulated by a transient visual stimulus in a specified location. Some studies use a preview phase, where objects are presented one at a time along with their intended name. Familiarization is useful when constraints on item selection result in pictures that may not be readily associated with the intended name.

Linguistic Stimuli

On each trial, a spoken instruction or sentence refers to one or more objects in the visual world. Utterances are designed such that there are clear predictions about how the combination of visual and linguistic information would yield different patterns of fixations as the language unfolds, given a particular set of hypotheses. The time course of information integration can be examined in carefully chosen designs that use minimal differences in the timing and/or availability of linguistic information between experimental conditions (see the Example Study section.)

Timing

Comprehension studies typically use pre‐recorded speech that is segmented and labeled with a speech editor. Time codes corresponding to the onset and offset of acoustic landmarks (e.g. onset/offset of the target word) are provided to the experiment software, so that eye‐movement data can be aligned relative to particular linguistic material. Appropriate segmentation of the speech stimuli has direct con- sequences for the interpretation of eye movements during the unfolding of the linguistic stimulus (see also the section General Considerations Affecting Design and The Visual World Paradigm 99

Interpretation). Systematic language‐mediated fixations earlier than 200 ms after an acoustic landmark are likely due to biasing coarticulatory information before the marked event (Salverda, Kleinschmidt, & Tanenhaus, 2014; see also the section Nature of Stimuli). In production studies, the experimenter typically records the participant’s utterances and then uses speech editing software to identify landmarks that are time‐locked to the onset of the display or to looks to a particular location on the screen. In most VW studies, the presentation of the linguistic stimulus follows the display with a brief delay of about a second, to allow participants to identify the objects in the display without giving them much opportunity to engage in strategic behavior. The complexity of the display is a factor in determining the appropriate duration of preview.

Data Collection and Analysis

The primary VW eye‐movement data are a stream of gaze locations recorded at the sampling rate of the eye‐tracker. These data are superimposed on a video recording of the visual world and/or stored in a digital file as XY coordinates. The latter type of output includes time‐stamped messages that provide essential information about the trial, including the identity and position of the objects and the timing of acoustic landmarks in the speech stream (e.g. target word onset/offset). A digital sequence of XY coordinates can be parsed into a sequence of fixations, saccades, and blinks using dedicated software.

Coding

In order to assess what the participant was looking at throughout a trial, the experi- menter defines regions of interest (ROIs) in the visual world, each of which is associ- ated with one or more objects. We recommend extending regions of interest beyond the edges of objects (e.g., to the cell of a grid within which a picture appears) because visual attention is focused on a region, not a point, in space, and because gaze location as estimated by the eye‐tracker is subject to error. A coder or automated coding procedure then scores each fixation as directed at one of the ROIs, or as not directed at any ROI. Saccades can be scored too, even though the visual system receives minimal input during a saccade—a phenomenon known as saccadic suppression. Because a saccade is triggered by a shift in visual‐spatial attention to a new location, that loca- tion can be considered the locus of attention during a saccade. Similarly, a sequence of saccades and fixations to one ROI can be scored as one long fixation to that region, and blinks can be scored as continuing fixations if the same object is fixated prior to and following the blink. Eye movements can be scored until the end of the trial or until the point in time when the participant performs an action indicating that they arrived at a definitive interpretation of the spoken input (e.g., the moment that a par- ticipant clicks on the target object, or the onset of the preceding mouse movement). 100 Research Methods in Psycholinguistics and the Neurobiology of Language

Visualization

A widely used method for summarizing results of VW studies plots the proportion of fixations to different objects throughout a trial (see Figure 5.3; see the Example Study section for another illustration). A proportion‐of‐fixations plot represents, at each moment in time throughout a time window, the proportion of trials with a look to each type of picture, averaged across participants (or items). Over the course of a trial, fixation proportions change in response to the processing of linguistic information and the integration of this information with information in the visual world. For instance, a rise in fixation proportions to an object reflects increased evidence for a particular linguistic interpretation associated with that object. Proportion‐of‐fixation plots are useful because they provide a comprehensive (though by no means exhaustive) representation of the eye‐movement record.

(A) (B) Looks to target Proportion of target fixations

1 1.0 2 3 4 0.9 5 6 7 0.8 8 9 0.7 10 11 12 0.6 13 14 15 0.5 16

Trial number 17 18 0.4 19 Proportion of fixations 20 0.3 21 22 23 0.2 24 25 26 0.1 27 28 29 0.0

0 200 400 600 800 1000 0 200 400 600 800 1000 Time (ms) relative to word onset Time (ms) relative to word onset

Figure 5.3 A. Timing of target fixations for each trial, for one participant (data from Salverda, Kleinschmidt, & Tanenhaus, 2014). B. Fixation proportions computed for the same data. The Visual World Paradigm 101

Changes in the distribution of gaze to different types of pictures in the display over time reveal important aspects of the eye‐movement data. They are also useful for some first‐pass checks: Are objects fixated to the degree expected? Are only a small proportion of looks not directed at any of the ROIs? Do looks converge on the target picture? Are there baseline differences in fixation proportions? More generally, if the results of statistical analyses are inconsistent with what can be seen in proportion‐ of‐fixation plots then something has gone awry. As discussed below, it is inappro- priate to first look at proportion‐of‐fixation plots and then define an analysis region based on where one sees the biggest effects. Proportion‐of‐fixation plots are constructed by taking a specific time window and computing, for each moment in time (limited by the sampling rate), the proportion of all relevant trials on which each of the objects is fixated. Figure 5.3 presents data from one participant in Experiment 1 of a study by Salverda et al. (2014) where the partic- ipant saw a display with a target picture and three distractors and followed a simple spoken instruction to click on the target. Figure 5.3A presents, for each trial, looks to the target during a time interval of one sec beginning at target‐word onset. Proportion‐of‐fixations to the target are presented in Figure 5.3B. For instance, at 200 ms, the target was fixated on 7 out of 29 trials, resulting in a fixation proportion of 7/29 = 0.24. After the data have been aggregated across participants, it can be use- ful for purposes of data inspection or presentation to bin fixation proportions (e.g., using 20‐ms bins for data recorded at 250 Hz; see Figure 5.4 in the Example Study section for an example). Such “down‐sampling” reduces the influence of incidental moment‐by‐moment variation in the proportion of fixations observed. Proportion‐of‐fixations plots usually present data aligned to a relevant linguistic event, which typically requires temporal realignment of the data across trials. For instance, in Figure 5.3, zero ms corresponds to wherever the target word started for each of the trials. For the evaluation of data in proportion‐of‐fixations plots it is important to take into account that information in the speech signal influences eye movements with a delay of approximately 200‐250 ms (Salverda et al., 2014). An important issue arises when the amount of eye‐movement data in a time window of interest varies across trials. For instance, if a participant’s response terminates the trial, there is no eye‐movement data from that moment onwards. When fixation proportions are computed for such data, early fixation proportions reflect data from all trials, whereas later fixation proportions reflect only the subset of trials on which the participant has not made or initiated a response. A frequently used solution is to extend the final fixation of each trial as an ongoing look in accordance with the par- ticipant’s response, for example, a look to the picture that was selected. The rationale is that this “artificial” look reflects the participant’s final interpretation of the speech signal. Extending the final fixation ensures that each trial contributes the same amount of information to the statistical analysis of fixation proportions across time.

Statistical Analyses

VW eye‐movement data can be analyzed with a range of statistical analyses on dependent measures that provide information about the speed and ease of target identification and the degree to which the participant considers competing interpretations. The most basic types of analyses examine the timing or occurrence of saccades to the 102 Research Methods in Psycholinguistics and the Neurobiology of Language target and competitor(s), such as the time it takes to generate a saccade to the target (on trials on which it was not already fixated), or the likelihood of making a saccade to the target or competitor during a time window. Analyses of mean fixation pro- portions across time windows can yield a more focused and detailed measure of the degree to which a picture is looked at over a temporal region. (Note that fixation proportions are bounded between 0 and 1 and thus violate data distribution assumptions of many statistical tests and models. In such cases, an appropriate data transformation, such as log odds or empirical log odds, is required; see Barr, 2008, and Jaeger, 2008.) An important limitation of mean fixation proportions is that they do not capture trends in changes in fixation proportions across the window for which they are computed. Some analysis methods model the proportion‐of‐fixations curves directly (e.g., growth‐curve analysis, Mirman, Dixon, & Magnuson, 2008, and Mirman, 2014; generalized additive mixed models, Nixon et al., 2016; bootstrapped difference of timeseries, Oleson, Cavanaugh, McMurray, & Brown, in press). Vandeberg, Bouwmeester, Bocanegra, and Zwaan (2013) introduced a different type of analysis, which predicts the likelihood of eye‐movement transitions from one type of picture to another as a function of time. In most studies, researchers are interested in eye movements in response to the presentation of relevant linguistic information in the speech stream, which translate to temporal windows that are time‐locked to particular linguistic events (e.g., a window that captures eye movements during the presentation of the target word). For example, if one is interested in looks that could be triggered by “put“ in Put the large apple before effects of “large,” then the region might be the onset of “put” plus 200 ms to the onset of “large” plus 200 ms. If there was a theoretical reason to focus on the region before “apple,” then the region that began with the onset of “put” would end 200 ms after the onset of “apple.” Note that these regions must be calculated for each item. Researchers often want to compare two or more conditions over an extended time interval, starting with the onset of a word. Here one can use any size window. However, the choice of window size should be motivated and chosen before analysis. Any change in window size should be acknowledged as being a post‐hoc choice and the windows that did not show significant effects should be reported. Selectively reporting statistically significant results for post‐hoc time windows is a form of “p‐hacking” (cherry‐picking the analyses one reports to obtain a statistically significant result), which sharply increases the odds that results will not replicate. Perhaps the most dangerous form of p‐hacking arises when one first inspects a proportion‐of‐ fixations plot and then chooses the most promising windows. If there are more looks to a related object (the target or competitor) relative to an unrelated object, this suggests that the listener perceived evidence for the linguistic information uniquely associated with the related object. In production studies, looks are taken as evidence that the participant attended to, and therefore likely linguisti- cally encoded, that object. When contrasting looks to multiple objects within the same display, it may be necessary to compute a single measure in the form of a ratio for some types of statistical analyses which require independent measures. For example, the following ratio evaluates if the mean proportion of fixations to the competitor is higher than that to a distractor (in which case the result is larger than .5): proportionof fixations to competitor proportionof fixations tto competitor distractor The Visual World Paradigm 103

Variation in the degree of evidence in favor of a particular linguistic interpretation as a function of experimental condition can be assessed by comparing looks to the same target or competitor object across conditions. For instance, in the Example Study section, we discuss a VW study by Dahan and Tanenhaus (2004), who predicted (and found) a statistically significant difference in cohort competition between two experimental conditions. It is important to note that current analyses do not map onto a generative model of the primary data that are evaluated in VW studies, which come from saccadic eye movements to real or depicted objects. These saccades are events and they are state‐dependent. At the very least, where to and when a saccade is executed is affected by the spatial relationship among objects (e.g., distance and what trajectory, e.g., vertical, horizontal, or oblique, is required to shift gaze to a new location). However, current methods analyze where people are looking and not the events that underlie looks. We believe that advances in the analysis of VW data will come from the application of generative statistical models that predict events at the trial level, as a function of linguistic input, time, and the eye‐movement record up to that point in time (i.e., the sequence of saccades, fixations, and their duration). While no such analyses currently exist, if and when they are developed, common practice may change.

Example Study

In this section we discuss an experiment that combines aspects of sentence processing and word recognition. Dahan and Tanenhaus (2004) conducted a VW study in Dutch to examine the effect of verb‐based semantic constraints on lexical competition. Listeners heard spoken sentences that mentioned one of four depicted objects (the target) in the context of a semantic constraint that was introduced either before or after the target word. Their task was to click on the target object. Dahan and Tanenhaus took advantage of the fact that in Dutch, a verb can precede or follow its subject. When the verb precedes the noun, as in Nog nooit klom een bok zo hoog (Never before climbed a goat so high), it creates a constraining context that is consis- tent with the target bok (goat) but inconsistent with the cohort competitor bot (bone). When the verb follows the noun, Nog nooit is een bok zo hoog geklommen (Never before has a goat climbed so high), the context preceding the target noun is neutral with respect to the target and the cohort competitor. (For ease of exposition we will use the English target “goat” and substitute the word “goal” as a cohort com- petitor, because the English words “goat” and “bone” do not overlap at onset.) The experimental manipulation involved a repeated‐measures design, in which each participant was exposed to multiple trials in each experimental condition. Issues that could arise from repeated presentation of pictures or target words, in particular across conditions, were avoided by presenting each item once and splitting the items across experimental conditions. For each participant, each item occurred in only one of the experimental conditions (neutral verb or constraining verb), and the assignment of items to conditions was counterbalanced across participants. Filler trials were designed to counteract contingencies in the experimental trials and 104 Research Methods in Psycholinguistics and the Neurobiology of Language included sentences with a verb that was semantically consistent with two of the pictures in the display (e.g., melt; ice cream/butter). In a subset of the fillers, the two distractors were phonologically similar, to discourage participants from developing the expectation that pictures with phonologically similar names were likely targets. The order of trials was randomized. (Note that with some setups, it can be helpful to have practice trials at the start of the experiment to familiarize the participant with the experimental task and procedure.) Figure 5.1 (shown at the beginning of this chapter) presents an example of a visual display including a target (goat), a cohort competitor (goal), an unrelated distractor (mirror), and a semantic competitor (spider). The latter was included to provide a baseline to separate effects of processing the target from effects that are due only to the verb. Figure 5.4 presents the proportion of fixations to the target, cohort com- petitor, and distractor. In the neutral‐verb condition, competitor fixation proportions increased from about 100 to 400 ms after the onset of the target word, and then

Neutral verb Constraining verb

1.0 Target Competitor Distractor 0.9

0.8

0.7

0.6

0.5

0.4 Proportion of fixations

0.3

0.2

0.1

0.0

0 100 200 300 400 500 600 700 800 9001000 0 100 200 300 400 500 600 700 800 9001000 Time relative to word onset (ms)

Figure 5.4 Proportion of fixations over time (from target‐word onset) to target (goat), cohort competitor (goal), and distractor in neutral and constraining verb conditions in Experiment 1 in Dahan and Tanenhaus (2004). Adapted from Dahan & Tanenhaus (2004). Reproduced with permission of the American Psychological Association. The Visual World Paradigm 105 dropped until they merged with distractor fixations. (The early looks might reflect coarticulation and/or information from the preceding verb.) This suggests that the cohort competitor was temporarily considered for recognition during the presen- tation of the target word. In the constraining‐verb condition, a strikingly different pattern was obtained: Competitor fixation proportions did not increase signifi- cantly above their baseline level. This suggests that listeners made immediate use of verb‐semantic constraints made available by the verb climb to eliminate the cohort competitor goal from the set of candidate words upon hearing the target word goat.

Advantages and Common Applications

Unlike other on‐line psycholinguistic paradigms, the VWP is intrinsically referential: Language‐mediated eye movements to objects and locations in the visual workspace occur because processing the language makes the object or region of the workspace potentially relevant. A particular advantage of the VWP is its versatility. The VWP can be used in a wide range of natural (goal‐based) tasks, with minimal restrictions. It can be used with a range of populations, including infants (using a variant of the preferential looking paradigm, see Chapter 2), elderly adults, and patients (e.g., aphasics). It has proved particularly useful in studying sentence processing in pre‐literate children. It can also be used to study most topics in language comprehension (and to a lesser extent, language production) at multiple levels, ranging from phonetic to pragmatic processing. We briefly outline some of the most common applications. The VWP is frequently used as a real‐time measure in speech perception and spoken word recognition in continuous speech because it is extremely sensitive to fine‐grained manipulations of the speech signal, including small variations in sub‐phonemic acoustic/ phonetic variables, for example, 5 ms within‐category differences in VOT (McMurray, Tanenhaus, & Aslin, 2002). We note that, while they are related, sensitivity and sampling rate are not equivalent. A dependent measure can have a high sampling rate, yet not be sensitive to a within‐category 5 ms manipulation in VOT. The VWP is used to study a wide spectrum of questions in sentence processing at multiple linguistic levels. In comprehension it is used in investigations of prosody and intonation, parsing, reference and discourse, and issues in experimental seman- tics and pragmatics. It is also well suited for studying the interaction of con- straints across different linguistic levels, including asynchronous information. In language production, the VWP has been used to study lexical and grammatical encoding, and the interface between message planning, message updating, and utterance formulation. The VWP is frequently used to study interactive task‐based dialogue in conjunction with goal‐based tasks such as the Edinburgh MAP task and targeted language games—a term introduced by Brown‐Schmidt and Tanenhaus (2008). The MAP task is a collaborative task in which speakers sit opposite one another, with each having their own map. The instructor, who has a route, directs the follower to reproduce the 106 Research Methods in Psycholinguistics and the Neurobiology of Language route. Targeted language games are a type of interactive referential communication task constructed so that the conditions that one might design as experimental trials in a factorial experiment emerge spontaneously and with sufficient frequency to conduct informative analyses.

Disadvantages, Limitations, and Concerns

There are some intrinsic limitations to the VWP in both the form and types of questions that can naturally be asked with VW designs, and in the types of inferences that can be drawn from VW data. Some of these limitations are obvious and have to do with domains of applicability and inquiry. For example, the VWP cannot be used for the study of (a) language that is not at least partially related to the visual world; (b) language that is about events and entities that cannot easily be depicted (but cf. the printed-words paradigm), and (c) reading. Other limitations are more nuanced. Many questions in sentence processing focus on “processing difficulty.” Because the VWP is a referential task, there is no transparent mapping between the time to fixate a potential referent and a theoretical construct hypothesized to underlie processing difficulty. For example, to test theoretically motivated hypotheses, an experimenter could manipulate “surprisal” and see whether it affects the likelihood of fixating a mentioned target, the duration of fixations, and the time from an acoustic landmark (e.g., word onset) to when a saccade is launched. However, there is no clear linking hypothesis that would map surprisal onto any of these measures. VW studies can be used to address questions about when different types of information are used and integrated. However, one cannot attribute a fixation to a particular process (word recognition, parsing, inference, etc.), nor infer a processing stage (e.g., pre‐or post‐bottleneck) from the timing of a saccade. Perhaps the broadest concern about the VWP is that because the visual world creates a restricted set of possible referents, it might introduce task‐specific strategies that bypass “normal” language processing. This issue has been directly addressed in studies of spoken‐word recognition. Three important results are incompatible with the concern that normal processing is bypassed. First, there are effects of lexical frequency (Dahan, Magnuson, & Tanenhaus, 2001). Second, there are neighborhood effects: Words that are similar to many other words (neighbors) are harder to process than words with fewer neighbors (Magnuson, Dixon, Tanenhaus, & Aslin, 2007). Third, target fixations are sensitive to frequency and neighborhood in so‐called “hidden competitor” designs in which all of the non‐target pictures are unrelated distractors and none of the words and pictures are repeated (Dahan, Magnuson, Tanenhaus, & Hogan, 2001; Magnuson et al., 2007). A related concern is that because most language use is not about concrete co‐pre- sent referents, conclusions drawn from VW studies will not generalize to less con- strained situations. To the best of our knowledge, there is no evidence suggesting that this might be the case. Rather, insights from studying language processing in constrained situations using the VWP seem to scale up to language that is not about a restricted visual context (for discussion see Tanenhaus & Brown‐Schmidt, 2008). The Visual World Paradigm 107

Conclusion

The Visual World Paradigm provides a sensitive, time‐locked response measure that can be used to investigate a wide range of psycholinguistic questions in language production and language comprehension, ranging from speech perception to ­collaborative, task‐oriented dialogue. The VWP can be used with participants of all ages, including special populations. In VW studies, eye movements to objects or pictures in a visual workspace are monitored as the participant produces and/or comprehends spoken language that is about the co‐present “visual world.” As visual attention shifts to an object in the workspace, there is a high probability that a saccadic eye movement will rapidly follow to bring the attended area into foveal vision. Where a participant is looking, and in particular when and to where saccadic eye movements are launched in rela- tionship to information in the speech signal, can therefore provide insights into real‐time language processing. The VWP combines spoken language processing and visual search. Therefore, users need to take into account how different aspects of language impact the speech signal. They also need to be cognizant of results about the relationship between eye movements and visual attention from the relatively new literature on vision in natural tasks.

Acknowledgments

We thank Delphine Dahan, Bob McMurray, and John Trueswell for helpful comments.

Key Terms

Competitor Object in the visual workspace that is related to the target along some specified dimension. Distractor Object in the visual workspace that is unrelated to the target. Look‐and‐listen VWP The participant is not given an explicit task. Point‐of‐disambiguation Point in time at which speech and visual context uniquely specify the target; also: point in time at which the proportion‐of‐fixations curves diverge in favor of the target. Proportion of fixations Proportion of trials on which the participant looks at a particular type of picture. Target Object in the visual workspace that is the referent of the linguistic expression. Task‐based VWP The participant performs a well‐defined action in the VW. Visual world paradigm (VWP) Experimental paradigm that monitors eye move- ments to objects in a visual workspace as participants listen to, or produce, spoken language about elements of the workspace. 108 Research Methods in Psycholinguistics and the Neurobiology of Language

References

Allopenna, P., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38, 419–439. Altmann, G. T. M. (2004). Language‐mediated eye movements in the absence of a visual world: The ‘blank screen paradigm’. Cognition, 93, B79–87. Altmann, G. T. M., & Kamide, Y. (1999). Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition, 73, 247–264. Altmann, G. T. M., & Kamide, Y. (2007). The real‐time mediation of visual attention by language and world knowledge: Linking anticipatory (and other) eye movements to linguistic processing. Journal of Memory and Language, 57, 502–518. Barr, D. J. (2008). Analyzing ‘visual world’ eyetracking data using multilevel logistic regression. Journal of Memory and Language, 59, 457–474. Brown‐Schmidt, S., & Tanenhaus, M. K. (2008). Real‐time investigation of referential domains in unscripted conversation: A targeted language game approach. Cognitive Science, 32, 643–684. Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language: A new methodology for the real‐time investigation of speech perception, memory, and language processing. Cognitive Psychology, 6, 84–107. Dahan, D., Magnuson, J. S., & Tanenhaus, M. K. (2001). Time course of frequency effects in spoken‐word recognition: evidence from eye movements. Cognitive Psychology, 42, 317–367. Dahan, D., Magnuson, J. S., Tanenhaus, M. K., & Hogan, E. (2001). Subcategorical mis- matches and the time course of lexical access: Evidence for lexical competition. Language and Cognitive Processes, 16, 507–534. Dahan, D., & Tanenhaus, M. K. (2004). Continuous mapping from sound to meaning in spoken‐language comprehension: Immediate effects of verb‐based thematic constraints. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 498–513. Dahan, D., Tanenhaus, M. K., & Salverda, A. P. (2007). The influence of visual processing on phonetically driven saccades in the “visual world” paradigm. In R. P. G. van Gompel, R. H. Fischer, W. S. Murray, & R. L. Hill (Eds.), Eye movements: A window on mind and brain (pp. 471–486). Oxford: Elsevier. Droll, J. A., & Hayhoe, M. M. (2007). Trade‐offs between gaze and working memory use. Journal of Experimental Psychology: Human Perception and Performance, 33, 1352–1365. Griffin, Z. M., & Bock, K. (2000). What the eyes say about speaking. Psychological Science, 11, 274–279. Hayhoe, M., & Ballard, D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9, 188–194. Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446. Keysar, B., Barr, D. J., Balin, J. A., & Brauner, J. S. (2000). Taking perspective in conversation: The role of mutual knowledge in comprehension. Psychological Science, 11, 32–38. Magnuson, J. S., Dixon, J. A., Tanenhaus, M. K., & Aslin, R. N. (2007). The dynamics of lexical competition during spoken word recognition. Cognitive Science, 31, 133–156. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within‐category phonetic variation on lexical access. Cognition, 86, B33–B42. McQueen, J. M., & Viebahn, M. C. (2007). Tracking recognition of spoken words by tracking looks to printed words. Quarterly Journal of Experimental Psychology, 60, 661–671. The Visual World Paradigm 109

Meyer, A. S., Sleiderink, A. M., & Levelt, W. J. M. (1998). Viewing and naming objects: Eye movements during noun phrase production. Cognition, 66, B25–B33. Mirman, D. (2014). Growth curve analysis and visualization using R. Chapman and Hall/CRC. Mirman, D., Dixon, J. A., & Magnuson, J. S. (2008). Statistical and computational models of the visual world paradigm: Growth curves and individual differences. Journal of Memory and Language, 59, 475–494. Nixon, J. S., van Rij, J., Mok, P., Baayen, R. H., & Chen, Y. (2016). The temporal dynamics of perceptual uncertainty: Eye movement evidence from Cantonese segment and tone per- ception. Journal of Memory and Language, 90, 103–125. Oleson, J. J., Cavanaugh, J. E., McMurray, B., & Brown, G. (in press). Detecting time‐specific differences between temporal nonlinear curves: Analyzing data from the visual world paradigm. Statistical Methods in Medical Research. Parkhurst, D., Law, K., & Niebur, E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42, 107–123. Salverda, A. P., Brown, M., & Tanenhaus, M. K. (2011). A goal‐based perspective on eye movements in visual world studies. Acta Psychologica, 137, 172–180. Salverda, A. P., Kleinschmidt, D., & Tanenhaus, M. K. (2014). Immediate effects of anticipa- tory coarticulation in spoken‐word recognition. Journal of Memory and Language, 71, 145–163. Snedeker, J., & Trueswell, J. C. (2004). The developing constraints on parsing decisions: The role of lexical‐biases and referential scenes in child and adult sentence processing. Cognitive Psychology, 49, 238–299. Tanenhaus, M. K. (2004). On‐line sentence processing: past, present, and future. In M. Carreiras and C. Clifton Jr. (Eds.), On‐line sentence processing: ERPS, eye movements and beyond (pp. 371–392). New York: Psychology Press. Tanenhaus, M. K., Spivey‐Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995). Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217), 1632–1634. Tatler, B. W., Hayhoe, M. M., Land, M. F., & Ballard, D. H. (2011). Eye guidance in natural vision: Reinterpreting salience. Journal of Vision, 11, 1–23. Trueswell, J. C., Sekerina, I., Hill., N. M., & Logrip, M. L. (1999). The kindergarten‐path effect: Studying on‐line sentence processing in young children. Cognition, 73, 89–134. Yee, E., & Heller, D. (2012). Looking more when you know less: Goal‐dependent eye movements during reference resolution. Poster presented at the Annual Meeting of the Psychonomic Society, Minneapolis, MN.

Further Reading and Resources

For an historical review of foundational VW studies: Spivey, M. J., & Huette, S. (2016). Towards a situated view of language. In P. Knoeferle, P. Pyykkönen‐Klauck, & M. W. Crocker (Eds.), Visually situated language comprehension (pp. 1–30). Amsterdam/ Philadelphia: John Benjamins Publishing. For a more comprehensive review: Huettig, F., Rommers, J., & Meyer, A. S. (2011). Using the visual world paradigm to study language processing: A review and critical evaluation. Acta Psychologica, 137, 151–171. As a methodological tool for interactive conversation: Tanenhaus, M. K., & Trueswell, J. C. (2005). Eye movements as a tool for bridging the language‐as‐product and language‐ as‐action traditions. In J. C. Trueswell & M. K. Tanenhaus (Eds.), Approaches to studying world‐situated language use: Bridging the language‐as‐product and language‐as‐action traditions (pp. 3–37). Cambridge, MA: MIT Press. 110 Research Methods in Psycholinguistics and the Neurobiology of Language

Vision and eye movements in natural tasks: Land, M. F. (2009). Vision, eye movements, and natural behavior. Visual Neuroscience, 26, 51–62. R packages for processing and visualizing visual‐world data: Dink, J. W., & Ferguson, B. F. (2015). eyetrackingR: An R library for eye‐tracking data analysis (R package version 0.1.6). Retrieved from http://www.eyetrackingr.com. Porretta V., Kyröläinen A., van Rij, J., & Järvikivi, J. (2016). VWPre: Tools for preprocessing visual world data (R package version 0.5.0). Retrieved from https://cran.rstudio.com/web/packages/VWPre/ 6 Word Priming and Interference Paradigms

Zeshu Shao and Antje S. Meyer

Abstract

In word priming and interference studies, researchers typically present participants with pairs of words (called primes and targets) and assess how the processing of the targets (e.g., “nurse”) is affected by different types of primes (e.g., semantically related and unrelated primes, such as “doctor” and “spoon”). Priming and interference paradigms­ have been used to study a broad range of issues concerning the structure of the mental lexicon and the ways linguistic representations are accessed during word comprehension and production. In this chapter, we illustrate the use of the paradigms in two exemplary studies, and then discuss the factors researchers need to take into account when selecting their stimuli, designing their experiments, and analyzing the results.

Introduction

In order to talk to each other, people need to have a shared vocabulary. It has long been known that our repository of words, the mental lexicon, is not a random heap of words, but has a complex internal structure. There is plenty of anecdotal evidence illustrating this. For instance, we can easily provide associates (“chicken – hen”, “red – fire”), opposites (“tall – short”, “good – bad”), or rhymes of words (“cat – mat”, “bay – day”). This shows that our memory representations of associates, opposites,

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. 112 Research Methods in Psycholinguistics and the Neurobiology of Language and rhymes are somehow linked. These links can work against us, for instance when we find ourselves asserting the opposite of the intended meaning (“I hereby declare the meeting closed, eh, opened”), or when we are in a tip‐of‐the‐tongue state, where similar sounding words appear to block access to a target (“it’s not Rutherford, Remington, … Rubicon!”). These observations show that the mental lexicon represents not only properties of individual words but also multiple rela- tionships between them. Describing these relationships and understanding their development and their impact on language production and comprehension have been key issues in psycholinguistics (Gaskell, 2007). Among the most important tools in this research area are word priming and interference paradigms. Their properties are discussed in the following sections.

Assumptions and Rationale

The goal of word priming studies is to observe the effect of a first stimulus, the prime, on the response speed (measured in milliseconds) and/or accuracy (measured as proportion of correct responses) to another stimulus, the target. The prime may, for instance, be the word “cat” and the target the word “mouse.” In order to establish the effect of a prime, one needs to include a suitable baseline condition with a neutral or unrelated prime in the experiment (e.g., a row of “xxxx” or an unrelated word, such as “fork,” for the target “cat”). The goal of interference studies is exactly the same: To observe the effect of a first stimulus, the distractor, on the speed and/or accuracy of responding to another stimulus, the target. Prototypical priming and interference studies differ in the timing of the stimuli, with primes preceding the targets and distractors co‐occurring with the targets; and they also differ in the direction of the effects, with priming studies typically yielding faster and/or more accurate responses in the related relative to the unrelated condition, and interference studies yielding slower and/or less accurate responses in the related condition. However, as neither the timing of the stimuli nor the directions of the observed effects distinguishes clearly between the two types of studies, we consistently refer to primes and priming studies in this chapter. The underlying assumptions of word priming studies are straightforward: To affect the response to the target, the prime must have been processed, and the activated mental representation of the prime must be related in some way to the representation of the target. Therefore, priming studies can be used in two ways, namely, first, to study the processing of stimuli and, second, to determine the properties of mental representations and the relationships among them.

Two Exemplary Studies

To illustrate the use of priming paradigms we describe two classic studies, a word recognition study by D.E. Meyer and Schvaneveldt (1971) and a picture naming study by Glaser and Düngelhoff (1984). Meyer and Schvaneveldt were interested in the memory search processes underlying lexical decision, that is, the decision whether or not a written string of letters is a word. The trials of their experiments had the following structure: At trial onset, the participants saw the word “ready” on the Word Priming and Interference Paradigms 113

READY

1 sec

NURSE Until response Time BUTTER

2 sec

Figure 6.1 An illustration of the trial structure in Meyer and Schvaneveldt (1971). The presentation time for “READY” is described as “brief” in the text. screen, followed first by a fixation box and then by a pair of stimuli (see Figure 6.1). These stimuli remained on the screen until the participant reacted. After 2 seconds, the next trial began. The stimuli were either two words, two nonwords, or a word and a nonword. Nonwords (for instance MARB) were derived from existing English words, mostly by replacing a single letter. Importantly, the words shown together were either associatively related (as in “bread” – “butter”) or unrelated (“nurse” – “butter”). In the first experiment, participants pressed one button on a push‐button panel when both stimuli were words and another button when one or both stimuli were nonwords. Twelve participants were tested. The authors recorded the accuracy of their responses, measured as the proportion of correct word and nonword responses, and the response speed for correct responses, measured from the onset of the word pair. The error rates for related and unrelated pairs were 6.3% and 8.7%, respectively; and the corresponding reaction times were 855 ms and 940 ms, respec- tively. The 85 ms difference between the two conditions was statistically significant. In the second experiment, the participants pressed one button when the two stimuli were both words or both nonwords, and another button when one of them was a word and the other was a nonword. Again, accuracy and response speed for correct responses were recorded. As in the first experiment, responses to word stimuli were more likely to be correct and faster when the words shown together were related than when they were unrelated. To account for these findings, Meyer and Schvanefeldt proposed that there might be passive spread of activation between associated words in the lexicon, so that in the related condition reading the first word facilitated access to the second word, or that the second word might be faster to access from a nearby (associated) location in the lexicon than from a location farther away. The second classic study to be described was carried out by Glaser and Düngelhoff (1984, Experiment 1). They presented participants with word‐picture combinations as shown in Figure 6.2 and asked them either to name the pictures or to read aloud the words. Earlier studies had shown that speakers are slower to name pictures accompanied by semantically related than by unrelated written words. Thus, there is a semantic interference effect for picture naming. By contrast, naming written words (reading aloud) is not hindered by the presence of related compared to unrelated 114 Research Methods in Psycholinguistics and the Neurobiology of Language

A) B)

XXXXXX Car

Neutral prime Incongruent prime

C) D)

Church House

Category-congruent prime Concept-congruent prime

Figure 6.2 An illustration of the prime‐target pairs used in Glaser and Düngelhoff (1984). pictures. This pattern had been linked to the greater speed and automaticity of word naming compared to picture naming. To assess the importance of the speed of access to the meanings of the stimuli for the occurrence of the semantic interference effect in picture naming, Glaser and Düngelhoff varied the time interval between the onsets of the picture and the word (the stimulus onset asynchrony, SOA) giving either the word or the picture a head start. Participants saw four types of prime‐target pairs, which the authors called neutral (a row of “xxxxxx” combined with a picture, as in “xxxxxx” – “house”), incongruent (“car” – “house”), category congruent (“church” – “house”), and concept congruent (“house” – “house”). The written stimulus was superimposed upon the picture as shown in Figure 6.2. The presenta- tion of the two stimuli either began at the same time (i.e., with an SOA of 0 ms), or the presentation of the word began 100, 200, 300, or 400 ms before or after picture onset. Both stimuli disappeared 200 ms after the onset of the response. One group of 18 participants had to name the pictures ignoring the words, and another group of 18 participants named the words ignoring the pictures. Glaser and Düngelhoff recorded the accuracy of the responses, that is, whether or not the participants named the word or picture correctly, and the reaction times for correct responses, measured from the onset of the target. The results obtained for the response latencies are summarized in Figure 6.3. The top panel shows the results for the picture naming task. Compared to the neutral prime baseline, concept‐congruent primes speeded up the responses. This was true for primes presented at any time bet- ween 400 ms prior to target onset until 200 ms after target onset. In the same broad time window, incongruent primes slowed down target naming relative to neutral ones. Most importantly, in a narrower time window, with primes presented at picture onset or 100 ms later, category‐congruent primes interfered with target naming (i.e., slowed it down more) than incongruent primes. Thus, in this time window there was a semantic interference effect. The results obtained for word naming are shown in the bottom panel of the figure. Here, there was little difference in the effects of the differ- ent primes, regardless of the SOA. Thus, even when given a head start, semantically related pictures did not interfere with word naming. This shows that other variables than the speed of access to meaning representations must be responsible for the fact that there is a semantic interference effect in picture naming but not in word naming. Word Priming and Interference Paradigms 115

Incongruent Neutral

Category congruentConcept congruent 850

800

750

700

) 650

600

550 ency (in ms 500 –400 –300 –200–1000 100 200 300 400

490 480 get naming lat r 470 Ta 460 450 440 430 420 410 400 –400 –300 –200–1000 100 200 300 400 SOAs

Figure 6.3 Results obtained by Glaser and Düngelhoff (1984). Average target naming latencies (in milliseconds, error bars represent standard errors of the mean) per SOA (ms) and stimulus type (incongruent, neutral, category congruent, and concept con- gruent) for picture naming (top panel) and word naming (bottom panel).

In sum, the goal of priming studies is to observe the effects of different types of primes on the processing of targets. As will be further illustrated below, priming experiments can be designed such that specific hypotheses can be tested concerning the representations of words in the mental lexicon and concerning the processes involved in accessing these representations.

Apparatus

For a standard priming experiment, no specialized apparatus is required. The stimuli can be presented using any laptop or desktop computer, and the experiment can be controlled using standard experimental software packages, such as Presentation® soft- ware (Version 0.70, www.neurobs.com) or E‐prime (Schneider, Eschman, & Zuccolotto, 2012). For masked priming experiments (see below) tight control of ambient lighting in the experimental room and of the timing of the stimuli is required, which needs to be kept in mind when choosing the monitor for stimulus presentation. Speech onset latencies in priming experiments using vocal responses are often measured­ online using voice keys associated with experimental software packages, which register the onset 116 Research Methods in Psycholinguistics and the Neurobiology of Language and offset of speech. However, given the poor accuracy of most voice keys, researchers often record the responses and measure the speech onset latencies off‐line, using soft- ware packages such as Praat (Boersma, 2001) or Audacity ® software (Version 1.2.2, http://audacity.sourceforge.net/). Specialized equipment is, of course, required for fMRI, MEG, and EEG experiments using priming paradigms.

Designing Priming Experiments

In designing priming experiments, researchers need to decide on the modality of the primes and targets, their properties, the relationships between them, the timing of the events during a trial and in the entire experiment, and the types of responses to the stimuli (e.g., naming or categorization). These decisions depend largely on the hypotheses to be investigated. In this section we describe some of the options to be considered in making each decision.

Modality

A first decision concerns the modalities of primes and targets. The stimuli can be spoken sounds or words, or they can be visual stimuli, that is, strings of letters or written words, signed words, or pictures. Primes and targets can be presented in the same modality or in different modalities. For example, a written prime word may be followed by a written or a spoken target word; or a spoken prime word may be fol- lowed by a target picture or a signed word. When prime and target are presented in different modalities, the experiment is a cross‐modal priming experiment. The choice of stimulus modality depends on the goals of the study and on the researcher’s theory about the processing of stimuli in different modalities. For in- stance, studies of lexical access during speaking often use picture naming tasks, whereas reading studies typically use written stimuli. Studies of spoken word recog- nition often use spoken primes and targets, or spoken primes and written targets (Marslen‐Wilson & Zwitserlood, 1989). Presenting primes and targets in different modalities is often useful because the stimuli can then be presented simultaneously without causing mutual sensory masking. For many research questions, the modality of the stimuli is not critical. For instance, researchers interested in the representation of semantic knowledge that is accessed regardless of the modality of the input may use either written or spoken words. Whereas Glaser and Düngelhoff (1984) used written category‐congruent and incon- gruent primes, other picture naming studies used spoken prime words of the same types and replicated the semantic interference effect observed in the original study (Roelofs, 2005; Schriefers, Meyer, & Levelt, 1990).

Properties of Primes and Targets and Prime‐Target Combinations

The properties of primes and targets and their combinations define the experimental conditions of priming experiments (often along with other variables, such as the ­timing of the stimuli). Obviously, the choice of stimuli depends on the aims of the Word Priming and Interference Paradigms 117

study. Priming studies have been used in many different research contexts, and con- sequently many types of primes and targets have been used. To give just a few exam- ples, primes and targets can vary in the language (English, Turkish, American Sign Language), they can be part of the participants’ first or second language; they can be words or nonwords; they can be high or low in frequency, long or short, concrete or abstract, emotionally neutral or positive, regular or “tabu” words. Similarly, priming studies have implemented many different types of prime‐target relationships. In addition to a substantial body of studies using various types of meaning‐related prime‐target pairs, there are numerous studies that used morphologically related pairs (e.g., related verb forms as in “fall – fell,” Crepaldi, Rastle, Coltheart, & Nickels, 2010; or stems and compounds as in “butter – butter dish,” Lüttmann et al., 2011), ortho- graphically related pairs (e.g., “castfe – castle,” Adelman et al., 2014), phonologically related pairs (e.g., “ma – mama,” Becker, Schild, & Friedrich, 2014), and identical pairs (Kane et al., 2015). In most studies prime and target appear in the same lan- guage, but studies of word processing in bilingual speakers often present primes and targets in different languages (Wang, 2013). This allows one to draw conclusions about the relationships between the participants’ first and second language lexicon. Primes can also be “novel words,” that is, strings that have been associated with novel or existing concepts in a preceding training phase (Gaskell & Dumay, 2003). Comparing the priming effects from novel words and existing words allows researchers to estimate how well the novel lexical items have been learned, and whether they are functionally similar to existing words in the participants’ mental lexicon. Many studies have used several types of related primes with appropriate controls and/or several types of targets and compared the effects obtained for the different prime ‐ target combinations. Such designs can be used to test specific hypotheses about the representations of words. For instance, Lüttmann et al. (2011) presented target pictures (e.g., “butter”) with primes that were transparent compounds (“butter dish”) or opaque compounds (“butterfly”). One of the goals of the study was to determine whether the individual constituents of the compounds became activated only in transparent compounds or in both types of compounds. The results sup- ported the latter hypothesis: The average picture naming latency was 855 ms (SD = 145) in the unrelated condition and significantly lower (831 ms, SD = 122) in the transparent prime condition and in the opaque prime condition (831 ms, SD = 134). Thus, both types of related primes equally facilitated target naming, and the difference between the two conditions was not significant. Designs with multiple prime types have also been used in many studies of visual word recognition. For instance, numerous studies have compared the effects of primes that were both orthographically and phonologically related to the targets to the effects of primes that were related to the targets only in orthographic form or only in sound. Many of these studies aimed to assess the role of the activation of the sound forms of words during reading (for a review see Leineger, 2014). The large priming literature demonstrates that many types of related primes affect target processing. This indicates that speakers and listeners are sensitive to many different types of relationships between stimuli they perceive together or shortly after each other, which is perhaps not too surprising. However, related primes differ in the strength of their effects. A common finding is that priming effects are stronger for highly similar than for less similar prime target pairs. For instance, Meyer (1991) showed that phonological priming effects increased with the amount of form overlap between words priming each other: Form overlap in the word onset consonant alone, 118 Research Methods in Psycholinguistics and the Neurobiology of Language as in “kever – kilo,” yielded a facilitatory effect of about 30 ms, compared to an unre- lated condition (“hamer – kilo”); whereas overlap in the entire first syllable (“kilo – kiwi”) yielded a facilitatory effect of 50 ms. To give another example, several studies have reported mediated priming effects (e.g., “lion” priming “stripes” via the lexical representation of “tiger,” Chwilla & Kolk, 2002; Sass et al., 2009), but such effects are generally weaker than direct priming effects (“tiger” priming “stripes”). For instance, in the study by Chwilla and Kolk (2002), the direct priming effect amounted to 82 ms and the mediated effect to 41 ms. Thus, priming paradigms allow researchers to study not only whether or not the representations of words in the mental lexicon are related, but also how tight their links are. Similarity between prime and target is not necessarily beneficial to target processing. As noted in the above description of the study by Glaser and Düngelhoff (1984), category‐congruent primes slow down responses in a picture naming task, compared to unrelated primes. By contrast, associatively related primes tend to facil- itate target naming or have no effect. An account of this pattern is that both types of primes facilitate the conceptual processing of the targets, but that category‐ congruent primes in addition hamper later processes, either the selection of target names from the mental lexicon or the retrieval of the sound form of the target from a response buffer (Mahon et al., 2007; Roelofs, 1992). Thus, comparisons of the effects of different prime types provide insights into the ways different components of the cognitive system cooperate during word processing. A word priming experiment must feature related and unrelated primes. In most studies each target is combined with each type of prime (e.g., with a semantically related prime, an unrelated prime, and a neutral prime). Thus, each target word appears in each condition. Primes are often also repeated in different conditions. For instance, “dog” might be the related prime for the target “cat” and the unrelated prime for the target “shoe”; and “hat” might be the unrelated prime for “dog” and the related prime for “shoe.” Alternatively, one can use different primes and/or different targets in different conditions. However, the words appearing in different conditions then need to be tightly controlled for any properties that may affect their processing, such as their length, frequency, age of acquisition, and so forth. Since perfect matching is often dif- ficult to achieve, and since not all variables that may affect lexical access are known, designs using the same primes and/or targets across conditions are generally preferred. In some priming studies each participant is presented with all prime–target combinations. This is, for instance, the case for many picture naming studies (e.g., Schriefers et al., 1990). In the picture naming task, items can be repeated because robust priming effects can be obtained even when participants name the same pictures several times. By contrast, in word recognition experiments using word naming or lexical decision, each participant typically sees or hears each target only once, combined with one of the primes for the target; and different groups of participants are presented with different prime‐target combinations. Such designs are complex and require many stimuli and participants, but they are often preferred because the priming effects for word recognition are often subtle and can easily be concealed when participants see or hear a target several times. Priming experiments often include the same number of related and unrelated trials, typically presented in random or pseudo‐random order. However, many studies include additional unrelated filler trials. Fillers are used in order to discourage participants from using the primes strategically to predict the targets and to separate Word Priming and Interference Paradigms 119 trials featuring the same stimuli or conditions, thereby reducing unwanted trial-to- trial priming effects (Kinoshita, Mozer, & Forster, 2011).

Stimulus Timing

In designing priming experiments, researchers need to decide for how long to present the primes and targets and when they should appear relative to each other. When auditory stimuli are used, the duration of the stimuli is determined by the duration of the speech signal, but visual stimuli can be presented for longer or shorter periods. Visual targets can either be presented until the participant responds, or for a fixed duration, typically between 1 and 3 seconds. The presentation time of the primes is often more critical than that of the targets. When primes are presented for a long time, participants may develop processing strategies that may be quite different from everyday word processing, or they may try to anticipate the targets. Researchers often try to discourage such strategic behavior by using the shortest possible prime presentation times. Numerous studies have used masked primes. Here, primes are presented for very brief periods of time (e.g., for 40 ms in Van den Bussche, Van den Noortgate, & Reynvoet, 2009, and for 56 ms in Gomez, Perea, & Ratcliff, 2013) and are followed and/or preceded by pattern masks suppressing their afterimage. Under these conditions, participants are on most trials unable to consciously identify the primes and to use them strategically. Nevertheless, robust priming effects can be obtained. For instance, Crepaldi, Rastle, Coltheart, and Nickels (2010) found that lexical decision latencies were shorter after masked primes that were morphologically and orthographically related to the targets (Mean = 582 ms, SD = 51 ms) than after primes that were only orthographically related to the targets (Mean = 606 ms, SD = 61 ms) or unrelated (Mean = 603 ms, SD = 60 ms). Many studies have compared the effects of unmasked and masked primes, for instance to uncover the contributions of early “bottom up” and later “top‐down” processes in word recognition (e.g., de Wit & Kinoshita, 2015, see Figure 6.4). However, it should be noted that unconscious prime processing may be modulated

+ 250 ms ####### 500 ms Prime 200 ms Prime 50 ms 40 ms

Up to 2000 ms Up to 2000 ms Time TARGET Time TARGET

730 ms 730 ms

Masked condition Unmasked condition

Figure 6.4 Illustration of trial structures in the masked and unmasked conditions in de Wit and Kinoshita (2015). Targets were presented until response, maximally for 2000 ms. 120 Research Methods in Psycholinguistics and the Neurobiology of Language by attentional resources and task requirements (see Kiefer, Adams, & Zovko, 2012, for a review). Moreover, the impact of attentional control on priming may differ across groups of participants (e.g., persons with or without attention deficits). Thus, in interpreting the results of priming studies researchers need to consider possible top‐down influences on both prime and target processing. Finally, the time interval between prime and target onset needs to be determined. In many priming studies, the prime begins at the same time as the target or shortly before or after target onset. The choice of stimulus onset asynchrony (SOA) depends on the the- oretical goals of the study and the researchers’ assumptions about the time course of the processes they are investigating. It is also possible to link the presentation of the stimuli to the participants’ behavior. For instance, a prime word or picture may be replaced by a target as soon as the participant fixates the location of the prime (Morgan & Meyer, 2005). Many studies have included several SOAs, often in conjunction with several types of primes to trace the time course of the activation of different types of information. This was the case for the study by Glaser and Düngelhoff described above. To give another example, in a picture naming study, Schriefers, Meyer, and Levelt (1990) presented target pictures with semantically or phonologically related or unrelated prime words. They observed a semantic interference effect of 20 ms and a phonological facilitation effect of 36 ms; the mean naming latency was 651 ms in the semantically related condition, 595 ms in the phonologically related condition, and 631 ms in the unrelated prime condition. Importantly, the semantic effect peaked at the earliest SOA, namely when the prime was presented 150 ms before target onset; whereas the phonological effect peaked only when the prime was presented 150 ms after target onset. This indicates that the semantic repre- sentations of the targets began to be activated before the phonological representations. In many priming studies, primes and targets appear on separate trials. For instance, in a repetition priming experiment, participants may be asked to name a stream of pictures, and the same picture may come up several times, with the first instance priming the sec- ond. Similarly, in a semantic priming experiment, participants may name a picture of an animal and after several intervening trials they may name another animal (Howard et al., 2006). Thus, in this kind of design the distinction between primes and targets is present in the design of the experiment but is not obvious to the participants. Many types of priming effects are robust and can be observed even when several trials intervene bet- ween prime and target. For instance, in a picture naming experiment, Zwitserlood, Bölte, and Dohmes (2000) obtained a morphological priming effect of 143 ms (with means of 653 ms in the morphologically related prime condition and 796 ms in the unrelated prime condition) when primes preceded targets by several minutes. In some priming studies, the stimuli are blocked by condition. In these blocking paradigms, there are homogeneous test blocks where participants repeatedly name small sets of related pictures, for instance members of the same semantic category (as in “duck, mouse, fish, snake, mouse…”) or pictures with similar names (“bed, bell, bench, bed…”), and heterogeneous blocks, where the same stimuli are combined into unrelated sets (Belke & Stielow, 2013; O’Seaghdha, Chen, & Chen, 2010). These paradigms allow researchers to study how participants can strategically exploit the similarity between the stimuli; speakers can, for instance, prepare well when all words in a block have the same onset, but not when the words rhyme (Meyer, 1990). More importantly, blocking paradigms can also be used to study the interplay of repetition and competition effects arising when speakers repeatedly access members of the same semantic category. Word Priming and Interference Paradigms 121

Task

The choice of task depends, again, on the goals of the study. Researchers using priming paradigms to study word production often ask participants to name target pictures, typically in bare nouns or verbs, occasionally in short phrases. Picture categorization (e.g., with respect to the real‐life size of the objects, or as animate or inanimate) has also been used, often in control conditions for naming conditions (Schmitz & Wentura, 2012). In word recognition studies, a number of different tasks have been used: Participants are sometimes asked to read aloud written targets or repeat or write down spoken ones (Adelman et al., 2014; De Bree, Janse, & Van de Zande, 2007). They may also be asked to categorize targets with respect to semantic or phonological properties. A common phonological categorization task is phoneme monitoring, where partici­ pants are asked to decide whether or not the target includes a specific phoneme (e.g., /p/). This task is performed faster for words than for nonwords, which indicates that it is suitable to assess lexical knowledge (Dijkstra, Roelofs, & Fieuws, 1995). The most common task used in word recognition studies is probably the lexical decision task, which was already described above. Here, trials featuring target words are mixed with trials featuring nonwords. Both types of targets are preceded by primes. Participants are asked to categorize each target as a word or a nonword by pressing one of two buttons. Lexical decision latencies have been shown to be sensitive to a large number of lexical variables, for instance the length and frequency of the words and characteristics of their phonological neighborhoods (i.e., the words they resemble in their sound forms). These lexical effects demonstrate that the task is suitable for studying how readers and listeners access their mental lexicon. However, lexical decision is a metalinguistic task, as participants are asked to make judgements about the stimuli they see or hear, and is sensitive to various response strategies. This can complicate the interpretation of the results (Ratcliff, Gomez, & McKoon, 2004).

Participants

Most word priming experiments have been conducted with college students. However, priming paradigms can readily be adapted for use with any sample of interest. There are, for instance, recent word priming studies using children as young as 2.5 years (Singh, 2014), and word priming paradigms have been amply used in research on healthy aging (De Bree et al., 2007), bilingual speakers (Kroll & Stewart, 1994; van Hell & de Groot, 1998) and in research involving various groups of patients (e.g., patients with Broca’s aphasia, Utman, Blumstein, & Sullivan, 2001; with temporal lobe epilepsy, Miozzo & Hamberger, 2015; or semantic dementia, Merck, Jonin, Laisney, Vichard, & Belliard, 2014).

Data Analysis

In this chapter we have focused on the use of priming paradigms in behavioral studies where participants produce individual words or respond to spoken or written stimuli by categorizing them, most commonly as words or nonwords. A priming 122 Research Methods in Psycholinguistics and the Neurobiology of Language experiment with a simple design, for instance featuring twenty target pictures that have to be named, each combined with two primes, and thirty participants, who see all prime‐target combinations, yields a raw data set of 1,200 naming latencies. Designs with more stimuli, participants, or conditions evidently yield larger data sets. A comprehensive discussion of the statistical analyses of the results of priming experiments including, for instance, exclusion of outliers, appropriate transforma- tions of data, and tests of significance, is beyond the scope of the present paper; we refer the reader to text books (e.g., Baayen, 2008; Field, Miles, & Field, 2012). Here we can only provide a brief sketch of the main steps involved in analyzing the data. The first step in the analyses serves to decide whether all participants and stimuli should be maintained in the data set, or whether some participants and/or stimuli need to be excluded. Researchers may decide to exclude participants whose overall performance deviates substantially from the remaining sample; these may, for instance, be participants whose average response latencies are exceptionally slow (e.g., more than three standard deviations above the sample mean) or whose error rates are exceptionally high. Similarly, researchers may decide to exclude stimuli that were responded to with exceptionally long latencies or that yielded very high error rates. For instance, in a lexical decision experiment, one might exclude words that the majority of participants categorized as nonwords. The next step in the analyses concerns the error rates in the remaining data set. In a typical lexical decision experiment, these are the rates of missing responses and the rates of nonword responses for words and of word responses for nonwords. In a picture naming experiment, errors include missing responses, incorrect picture names (e.g., “cat” instead of “dog”), self‐repairs (“cat… dog”), and responses that begin with a hesitation or filled pause (e.g. “eh … cat”). Since error rates are rarely nor- mally distributed, many researchers use log‐transformed, rather than raw error rates when comparing average error rates. However, in the recent literature analyses of error rates using logit mixed models have often been preferred (Jaeger, 2008). Even when the hypotheses do not concern the error rates but the response latencies, the error rates in the different conditions are reported and often analyzed. This is to ascertain that the results obtained for the error rates are consistent with those obtained for the latencies. For instance, if related primes are hypothesized to facilitate target processing, the responses should be faster after related than after unrelated primes, and the error rates should be the same or lower, but not higher in the related than in the unrelated prime condition. When related primes are associated with faster responses and higher error rates than unrelated primes, or when related primes are associated with slower responses and lower error rates (i.e., when there is a speed‐accuracy trade‐off) the interpretation of the results can be challenging. This is because the results obtained for one dependent variable suggest that the related primes facilitate target processing, whereas the results obtained for the other vari- able suggest that they interfered with target processing. The following steps in the analyses concern the latencies for correct responses, which are usually the most important dependent variable in priming experiments. In lexical decision experiments, word and nonword responses are often analyzed separately. In addition to incorrect responses, many researchers exclude abnormally fast and/or abnormally slow responses. Such outliers can be defined in different ways (e.g., Ratcliff, 1993). One option is to use fixed deadlines. For instance, picture naming or lexical decision latencies below 200 ms are likely to be due to artifacts or Word Priming and Interference Paradigms 123 measurement errors since participants cannot process the target and initiate their response so quickly; therefore these latencies are often excluded from the analyses. Another option is to refer to the distribution of latencies in the sample and exclude latencies that deviate from a mean (e.g., the grand mean of the sample, the condition mean, or the participant mean) by a certain amount, for instance by 2.5 or 3 sd. Researchers sometimes use several criteria to exclude outliers, for instance a fixed lower deadline to exclude short latencies and a distribution‐based criterion (e.g., three standard deviations above the grand mean) to exclude long latencies. Since parametric comparisons of means (t‐tests, analyses of variance) require the input data to be normally distributed but raw response latencies typically do not fulfill this criterion but feature a long tail of slow responses, latencies are often log‐ transformed before analyses (e.g., Baayen, 2008). Contemporary statistical packages (R, R Core Team, 2005, and SPSS, IBM Corp, 2013) offer advanced graphical tools to facilitate the optimal choice of criteria for the exclusion of outliers and the trans- formation of raw data. Finally, inferential statistics are used to determine whether or not the primes significantly affected the response latencies to the targets. Analyses typically focus on the condition means, though sometimes it is useful to consider the entire distribution of the latencies (e.g., Roelofs, 2008). Following a proposal by Clark (1973) many researchers carry out separate analyses based on the participant means per condition (i.e., averaging across items) and on item means (averaging across participants) respectively (for an example see Crespaldi et al., 2010). Clark advocated combining the two test statistics into one F‐value (min F′), but this is rarely done as min F′ is considered to be overly conservative. An alternative, favored in much of the contem- porary literature, is mixed‐effects modeling (e.g., Barr, Levy, Scheepers, & Tily, 2013; Baayen, Davidson, & Bates, 2008), which allows researchers to include participants and items as random effects in the same model and, more generally, offers much flexibility in the statistical analyses of the data (for an example, see Shao, Roelofs, Martin, and Meyer, 2015). Priming paradigms have been used in numerous neurobiological studies using EEG (Jouravlev, Lupker, & Jared, 2014; Llorens et al., 2014; Riès et al., 2015), MEG (Brennan et al., 2014; Whiting, Shtyrov, & Marslen‐Wilson, 2014), and fMRI (Almeida & Poeppel, 2013; Massol et al., 2010; Savill & Thierry, 2011). EEG and MEG studies can offer precise information about the time course of prime and target processing. fMRI studies can be used to investigate which brain circuits are impli- cated when grammatical features, sound forms, or meanings of words are accessed (Koester & Schiller, 2011). How such studies are designed, and how the data are analyzed is described in Chapters 13 and 14 of the current volume.

Evaluation of Word Priming Paradigms

Since their inception in the 1970s, word priming paradigms have been widely used in psycholinguistics. There are many reasons for the popularity of priming paradigms: The underlying theoretical assumptions are straightforward, priming experiments are easy to set up and highly portable, and no specific expertise is required to analyze 124 Research Methods in Psycholinguistics and the Neurobiology of Language the data. Most importantly, priming paradigms are extremely versatile and can be used to address a wide range of issues concerning the representation of words in the mental lexicon and the way they are accessed during language production and comprehension. Word priming paradigms are a research tool and, as is true for any tool, their use- fulness depends on the goals of the user. Word priming is an experimental paradigm and is tailored to study how words are represented and accessed. Many issues in psycholinguistics can be studied experimentally and do concern individual words, but evidently there are questions that are not easily studied in experiments and/or do not concern individual words and therefore require other approaches. When a word‐priming paradigm is deemed to be suitable to address a research question, the details of the experimental method, stimuli, and design have to be determined. Many properties of priming experiments are, of course, dictated by the research question. A researcher specifically interested in the processing of morphologically complex forms or in lexical access during speaking will choose the stimuli and task accordingly. Other design properties are not determined in this way. For instance, to study the representation of morphologically complex forms, one might either use a production or a comprehension task, and either masked or unmasked primes. Here choices may to some extent depend on practical consider- ations (e.g., the ease of finding appropriate stimuli, of setting up the experiment, and of analyzing the responses). In designing experiments, it is often useful to con- sider published experiments on similar issues and aim to replicate design features (especially those used in many studies) as much as possible. For instance, researchers designing a masked priming experiment might present the stimuli in the same way (same size, luminance, etc.) and with the same timing as reported in a similar recent study in a peer‐reviewed journal. This strategy increases the chance that an experiment will actually “work,” and it facilitates the comparison of the results to earlier findings. We are, of course, not advocating blind imitation of existing studies. The most important considerations in designing a word priming experiment (or any other type of study) must stem from the theoretical goals of the research. Researchers need to consider how each design choice may affect how participants approach the task, how the stimuli are processed, and how these influences may affect the conclusions that can be drawn from the results.

Key Terms

Blocking paradigm Experimental paradigm where stimuli are blocked per condition. For instance, four semantically homogeneous blocks may feature pictures of objects from the categories of animals, vehicles, fruits, and items of furniture, respectively; the corresponding four heterogeneous blocks feature pictures of objects from each of the four categories. Lexical decision task A task that is often used in studies of visual and auditory word recognition. Participants hear or see sound or letter sequences (e.g., BLISS or BLIFF). For each sequence they have to decide as quickly as possible whether or not it is a word. Decision latency and accuracy are measured. Word Priming and Interference Paradigms 125

Masked priming paradigm Priming paradigm where primes are presented very briefly (usually 40‐50 ms) and followed and/or preceded by visual masks (e.g., %$%$$% or #######). Participants can usually not identify the primes or even reliably report their presence or absence, but the primes may still affect subsequent target processing. Phoneme monitoring A task that is often used in studies of auditory word recogni- tion. Participants hear strings of words and have to press a button as soon as they detect a specific sound (e.g., /p/). Picture‐word interference paradigm A paradigm often used to study lexical access in speaking. Participants see a stream of pictures, each accompanied by a written or spoken distractor word. They are asked to name the pictures and ignore the distractor words. In spite of these instructions, the distractors may still affect the speed and/or accuracy of the naming responses. Prime A stimulus that affects the response to a following target; for instance, presentation of the prime word “nurse” may facilitate processing of the following target word “doctor” relative to an unrelated prime word such as “cat.” Stimulus‐onset asynchrony Time interval between the onsets of the prime and the target in a priming experiment. Target A stimulus a participant is asked to react to.

References

Adelman, J. S., Johnson, R. L., McCormick, S. F., McKague, M., Kinoshita, S., Bowers, J. S., Perry, J. R., Lupker, S. J., Forster, K. I., Cortese, M. J., Scaltritti, M., Aschenbrenner, A. J., Coane, J. H., White, L., Yap, M. J., Davis, C., Kim, J., & Davis, C. J. (2014). A behavioral database for masked form priming. Behavior Research Methods, 46, 1052–1067. DOI: 10.3758/s13428‐013‐0442‐y. Almeida, D., & Poeppel, D. (2013).Word‐specific repetition effects revealed by MEG and the implications for lexical access. Brain and Language, 127, 497–509. Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge University Press. Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed‐effects modeling with crossed random effects for subjects and items. Journal of memory and language, 59, 390–412. Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confir- matory hypothesis testing: Keep it maximal. Journal of memory and language, 68, 255–278. Becker, A. B. C., Schild, U., & Friedrich, C. K. (2014). ERP correlates of word onset priming in infants and young children. Developmental Cognitive Neuroscience, 9, 44–55. DOI:10.1016/j.dcn.2013.12.004. Belke, E., & Stielow, A. 2013. Cumulative and non‐cumulative semantic interference in object nam- ing: Evidence from blocked and continuous manipulations of semantic context. Quarterly Journal of Experimental Psychology, 66, 2135–2160. DOI: 10.1080/17470218.2013.775318. Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5, 341–345. Brennan, J., Lignos, C., Embick, D., & Roberts, T. P. L. 2014. Spectro‐temporal correlates of lexical access during auditory lexical decision. Brain & Language, 133, 39–46. DOI: 10.1016/j.bandl.2014.03.006. Chwilla, D. J., & Kolk, H. H. J. (2002). Three‐step priming in lexical decision. Memory & Cognition, 30, 217–225. DOI: 10.3758/BF03195282. 126 Research Methods in Psycholinguistics and the Neurobiology of Language

Clark, H. H. (1973). The language‐as‐fixed‐effect fallacy: A critique of language statistics in psychological research. Journal of verbal learning and verbal behavior, 12, 335–359. Crepaldi, D., Rastle, K., Coltheart, M., & Nickels, L. (2010). ‘Fell’ primes ‘fall’, but does ‘bell’ prime ‘ball’? Masked priming with irregularly‐inflected primes. Journal of Memory and Language, 63, 83–99. DOI:10.1016/j.jml.2010.03.002. de Bree, E., Janse E., & Van de Zande, A. M. (2007). Stress assignment in aphasia: Word and non‐word reading and non‐word repetition. Brain & Language, 103, 264–275. DOI:10.1016/j.bandl.2007.07.003. de Wit, B., & Kinoshita, S. (2015). The masked semantic priming effect is task dependent: Recon­ sidering the automatic spreading activation process. Journal of Experimental Psychology: Learning, Memory, and Cognition, 41, 1062–1075. DOI: 10.1037/xlm0000074. Dijkstra, T., Roelofs, A., & Fieuws, S. (1995). Orthographic effects on phoneme monitoring. Canadian Journal of Experimental Psychology, 49, 264–271. DOI:10.1037/1196‐ 1961.49.2.264. Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Los Angeles, California: SAGE Publications. Gaskell, M. G. (2007). The Oxford handbook of psycholinguistics. Oxford, UK: Oxford University Press. Gaskell, M. G., & Dumay, N. (2003). Lexical competition and the acquisition of novel words. Cognition, 89,105–132. DOI: 10.1016/S0010‐0277(03)00070‐2. Glaser, W. R., & Düngelhoff, F. J. (1984). The time course of picture‐word interference. Journal of Experimental Psychology: Human Perception and Performance, 10, 640–654. DOI:10.1037/0096‐1523.10.5.640. Gomez, P., Perea, M., & Ratcliff, R. (2013). A diffusion model account of masked versus unmasked priming: Are they qualitatively different? Journal of Experimental Psychology: Human Perception and Performance, 39, 1731–1740. DOI:10.1037/a0032333. Howard, D., Nickels, L., Coltheart, M., & Cole‐Virtue, J. (2006). Cumulative semantic inhi- bition in picture naming: Experimental and computational studies. Cognition, 100, 464–482. DOI:10.1016/j.cognition.2005.02.006. IBM Corp. Released 2013. IBM SPSS Statistics for Windows, Version 22.0. Armonk, NY: IBM Corp. Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446. Jouravlev, O., Lupker, S. J., & Jared, J. (2014). Cross‐language phonological activation: Evidence from masked onset priming and ERPs. Brain & Language, 134, 11–22. DOI:10.1016/j.bandl.2014.04.003. Kane, A. E., Festa, E. K., Salmon, D. P., & Heindel, W. C. (2015). Repetition priming and cortical arousal in healthy aging and alzheimer’s disease. Neuropsychologia, 70, 145–155. DOI:10.1016/j.neuropsychologia.2015.02.024. Kiefer, M., Adams, S. C., & Zovko, M. (2012). Attentional sensitization of unconscious visual processing: Top‐down influences on masked priming. Advances in Cognitive Psychology, 8, 50–61. DOI:10.2478/v10053‐008‐0102‐4. Kinoshita, S., Mozer, M. C., & Forster, K. I. (2011). Dynamic adaptation to history of trial difficulty explains the effect of congruency proportion on masked priming. Journal of Experimental Psychology: General, 140, 622–636. DOI:10.1037/a0024230. Koester, D., & Schiller, N. O. (2011). The functional neuroanatomy of morphology in language production. NeuroImage, 55, 732–741. DOI: 10.1016/j.neuroimage.2010.11.044. Kroll, J. F., & Stewart, E. (1994). Category interference in translation and picture naming: Evidence for asymmetric connections between bilingual memory representations. Journal of Memory and Language, 33, 149–174. DOI:10.1006/jmla.1994.1008 Leinenger, M. (2014). Phonological coding during reading. Psychological Bulletin, 140, 1534–1555. DOI: 10.1037/a0037830. Word Priming and Interference Paradigms 127

Llorens, A., Trébuchon, A., Riès, S., Liégeois‐Chauvel, C., & Alario, F.‐X. (2014). How familiar- ization and repetition modulate the picture naming network. Brain and Language, 133, 47–58. DOI: 10.1016/j.bandl.2014.03.010. Lüttmann, H., Zwitserlood, P., Böhl, A., & Bölte J. (2011). Evidence for morphological com- position at the form level in speech production. Journal of Cognitive Psychology, 23, 818–836. DOI:10.1080/20445911.2011.575774. Mahon, B. Z., Costa, A., Peterson, R., Vargas, K. A., & Caramazza, A. (2007). Lexical selec- tion is not by competition: A reinterpretation of semantic interference and facilitation effects in the picture–word interference paradigm. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33, 503–535. DOI: 10.1037/0278‐7393.33.3.503. Marslen‐Wilson, W., & Zwitserlood, P. (1989). Accessing spoken words: The importance of word onsets. Journal of Experimental Psychology: Human Perception and Performance, 15, 576–585. DOI:10.1037/0096‐1523.15.3.576. Massol, S., Grainger, J., Dufau, S., & Holcomb, P. (2010). Masked priming from orthographic neighbors: An ERP Investigation. Journal of Experimental Psychology: Human Perception and Performance, 36, 162–174. DOI: 10.1037/a0017614. Merck, C., Jonin, P.‐Y., Laisney, M., Vichard, H., & Belliard, S. (2014). When the zebra loses its stripes but is still in the savannah: Results from a semantic priming paradigm in semantic dementia. Neuropsychologia, 53, 221–232. DOI:10.1016/j.neuropsychologia. 2013.11.024 Meyer, A. S. (1990). The time course of phonological encoding in language production: The encoding of successive syllables of a word. Journal of Memory and Language, 29, 524–545. DOI:10.1016/0749‐596X(90)90050‐A. Meyer, A. S. (1991). The time course of phonological encoding in language production: Phonological encoding inside a syllable. Journal of Memory and Language, 30, 69–69. DOI:10.1016/0749‐596X(91)90011‐8. Meyer, D. E., & Schvaneveldt, R. W. (1971). Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology, 90, 227–234. DOI:10.1037/h0031564. Miozzo, M., & Hamberger, M. J. (2015). Preserved meaning in the context of impaired naming in temporal lobe epilepsy. Neuropsychology, 29, 274–281. DOI:org/10.1037/ neu0000097. Morgan, J., & Meyer, A. S. (2005). Processing of extrafoveal objects during multiple‐object naming. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 428–442. DOI:10.1037/0278‐7393.31.3.428. O’Seaghdha, P. G., Chen, J.‐Y., & Chen, T.‐M. (2010). Proximate units in word production: Phonological encoding begins with syllables in Mandarin Chinese but with segments in English. Cognition, 115, 282–302. DOI:10.1016/j.cognition.2010.01.001. R Core Team. (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R‐project.org/. Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological Bulletin, 114, 510. Ratcliff, R., Gomez, P., & McKoon, G. (2004). A diffusion model account of the lexical decision task. Psychological Review, 111, 159–182. DOI:10.1037/0033‐295X.111.1.159 Riès, S. K., Fraser, D., McMahon, K. L., & de Zubicaray, G. I. (2015). Early and late electro- physiological effects of distractor frequency in picture naming: Reconciling input and output accounts. Journal of Cognitive Neuroscience, 27, 1936–1947. DOI:10.1162/ jocn_a_00831. Roelofs, A. (1992). A spreading‐activation theory of lemma retrieval in speaking. Cognition, 42, 107–142. DOI:10.1016/0010‐0277(92)90041‐F. Roelofs, A. (2005). The visual‐auditory color‐word Stroop asymmetry and its time course. Memory & Cognition, 33, 1325–1336. DOI: 10.3758/BF03193365. 128 Research Methods in Psycholinguistics and the Neurobiology of Language

Roelofs, A. (2008). Dynamics of the attentional control of word retrieval: Analyses of response time distributions. Journal of Experimental Psychology: General, 137, 303–323. DOI: 10.1037/0096‐3445.137.2.303. Sass, K., Krach, S., Sachs, O., & Kircher, T. (2009). Lion ‐ tiger ‐ stripes: Neural correlates of indirect semantic priming across processing modalities. Neuroimage, 45, 224–236. DOI:10.1016/j.neuroimage.2008.10.014. Savill, N. J., & Thierry, G. (2011). Reading for sound with dyslexia: Evidence for early orthographic and late phonological integration deficits. Brain Research, 1385, 192–205. DOI:10.1016/j.brainres.2011.02.012. Schmitz, M., & Wentura, D. (2012). Evaluative priming of naming and semantic categoriza- tion responses revisited: A mutual facilitation explanation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38, 984–1000. DOI:10.1037/a0026779. Schneider, W., Eschman, A., & Zuccolotto, A. (2012). E‐Prime reference guide. Pittsburgh: Psychology Software Tools, Inc. Schriefers, H., Meyer, A. S., & Levelt, W. J. M. (1990). Exploring the time course of lexical access in language production: Picture‐word interference studies. Journal of Memory and Language, 2, 86–102. DOI:10.1016/0749‐596X(90)90011‐N. Singh, L. (2014). One world, two languages: Cross‐language semantic priming in bilingual toddlers. Child Development, 85, 755–766. DOI: 10.1111/cdev.12133. Shao, Z., Roelofs, A., Martin, R. C., & Meyer, A. S. (2015). Selective inhibition and naming performance in semantic blocking, picture‐word interference, and color‐word stroop tasks. Journal of Experimental Psychology: Learning, Memory, and Cognition, 41, 1806–1820. DOI:10.1037/a0039363. Utman, J. A., Blumstein, S. E., & Sullivan, K. (2001). Mapping from sound to meaning: Reduced lexical activation in Broca’s aphasics. Brain and Language, 79, 444–472. DOI: 10.1006/brln.2001.2500. Van den Bussche, E., Van den Noortgate, W., Reynvoet, B. (2009). Mechanisms of masked priming: A meta‐analysis. Psychological Bulletin, 135, 452–477. DOI:10.1037/a0015329. van Hell, J. G., & de Groot, A. M. B. (1998). Conceptual representation in bilingual memory: Effects of concreteness and cognate status in word association. Bilingualism: Language and Cognition, 1, 193–211. Wang, X. (2013). Language dominance in translation priming: Evidence from balanced and unbalanced Chinese English Bilinguals. Quarterly Journal of Experimental Psychology, 66, 727–743. DOI:10.1080/17470218.2012.716072. Whiting, C., Shtyrov, Y., & Marslen‐Wilson, W. (2014). Real‐time functional architecture of visual word recognition. Journal of Cognitive Neuroscience, 27, 246–265. DOI:10.1162/ jocn_a_00699. Zwitserlood, P., Bölte, J., & Dohmes, P. (2000). Morphological effects on speech production: Evidence from picture naming. Language and Cognitive Processes, 15, 563–591. DOI: 10.1080/01690960050119706.

Further Reading and Resources

Bates, E., D’Amico, S., Jacobsen, T., Szèkely, A., Andonova, E., Devescovi, A., Herron, D., Lu, C. C., Pechmann, T., Pléh, C., Wicha, N., Federmeier, K., Gerdjikova, I., Gutierrez, G., Hung, D., Hsu, J., Iyer, G., Kohnert, K., Mehotcheva, T., Orozco‐Figueroa, A., Tzeng, A., & Tzeng, O. (2003). Timed picture naming in seven languages. Psychonomic Bulletin & Review, 10, 344–380. DOI:10.3758/BF03196494. Word Priming and Interference Paradigms 129

Brysbaert, M., Stevens, M., Mandera, P., & Keuleers E. (in press). The impact of word preva- lence on lexical decision times: Evidence from the Dutch Lexicon Project 2. Journal of Experimental Psychology: Human Perception and Performance. Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese, M. J., Cohen‐Shikora, E. R., Tse, C.‐S., Yap, M. J., Bengson, J. J., Niemeyer, D., & Buchanan, E. (2013). The semantic priming project. Behavior Research Methods, 45, 1099–1114. DOI 10.3758/s13428‐012‐0304‐z. Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior Research Methods, 42, 627–633. DOI:10.​3758/​s13428‐011‐0118‐4. Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44, 287–304. DOI: 10.3758/s13428‐011‐0118‐4. 7 Structural Priming

Holly P. Branigan and Catriona L. Gibb

Abstract

People tend to show facilitation when language structure is repeated across consecutive utterances (e.g., sentences, phrases, syllables). This phenomenon may arise from struc- tural priming, that is, from automatic and implicit facilitation of abstract structures and processes that underlie language use. A wide range of paradigms has been developed to investigate the conditions under which the structure that participants use to process a prime expression affects the structure that they use for a subsequent target expression, in order to address fundamental questions about structural representation and processing that are implicated in language production, comprehension, and acquisition. These par- adigms use both behavioural and non-behavioural measures that tap offline and online processing, and vary from simple picture-description and picture-choice methodologies suitable for young children to sophisticated eye-tracking and imaging techniques. Structural priming paradigms are a flexible and potentially powerful tool that can be applied to a wide range of populations, contexts, and theoretical questions.

Introduction

During language use, people tend to show facilitation when language structure is repeated across consecutive utterances. For example, they are more likely to use a passive sentence to describe an event (e.g., the girl was chased by the boy) after saying

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Structural Priming 131 or hearing another passive sentence (e.g., the church was struck by lightning) than after an active sentence expressing the same meaning (Lightning struck the church). Following the pioneering work of Bock (1986), many studies have now demonstrated that this phenomenon, called structural priming, can have a source in automatic and implicit facilitation of abstract structures that underlie language use. This tendency toward structural priming has been extensively used to address fundamental questions about the structural representations and processes that are implicated in language production, comprehension, and first and second language acquisition. Structural priming paradigms are a flexible and potentially powerful tool that can be applied to a wide range of populations, contexts, and theoretical questions.

Assumptions and Rationale

Priming effects are well established in research on human cognition. They occur when processing a prime stimulus with particular characteristics facilitates processing of a subsequent target stimulus that shares these characteristics.1 Priming occurs without conscious awareness or explicit recall of the prime stimulus, and is generally believed to be automatic and resource‐free. Priming effects are assumed to occur because representations or processes become more available through use in a way that makes their subsequent re‐use easier if they are potentially applicable to a subsequent stimulus.2 As such, they are informa- tive about representation and processing: For stimulus A to prime stimulus B, the processor must treat them as related along some cognitively relevant dimension. By manipulating the characteristics of the prime and target stimuli, and observing the conditions under which facilitation occurs, we can determine how A and B are related, and which dimensions are relevant to processing. Early psycholinguistic priming paradigms focused on facilitation associated with repetition of content (e.g., semantic and phonological features of lexical representa- tions, as in DOCTOR‐NURSE or CAT‐CAP). However, since Bock’s seminal (1986) demonstration that speakers repeat syntactic structure across consecutive sentences in the absence of repeated meaning, words or sounds, it has been recognized that abstract linguistic structure can also be primed, in ways that are potentially informa- tive about structural representation and processing. The rationale for structural priming paradigms is therefore the same as for other priming paradigms, with the defining feature that priming arises from repetition of linguistic structure. By manipulating the structural characteristics of the prime and target stimulus, and the modality of processing, we can determine the nature of structural representations and how they are implicated in language use. In principle, priming might occur for any aspect of linguistic structure, and in practice it is assumed that all aspects of structure should be susceptible to priming. Structural priming paradigms have been most extensively used to investigate syn- tactic structure (e.g., Bock, 1986; Pickering & Branigan, 1998). The tendency for speakers to repeat syntax across otherwise unrelated sentences is taken as evi- dence that language use involves a level of processing in which syntactic structure is represented independently of semantic and phonological content. Patterns of priming effects when different aspects of syntactic structure are manipulated 132 Research Methods in Psycholinguistics and the Neurobiology of Language

(e.g., word order versus hierarchical relations) are used to draw inferences about the precise nature of these representations. However, a growing body of research uses structural priming paradigms to inves- tigate other aspects of language structure, including structure associated with meaning (e.g., quantification, information structure), structure associated with sound (e.g., syllable structure, prosodic structure), and mappings between levels of struc- ture. For example, some studies have examined facilitation between expressions that share quantifier scope relations (e.g., Every kid climbed a tree; Raffray & Pickering, 2010), and between words that share syllable structure (Sevald, Dell, & Cole, 1995); others have examined facilitation of mappings ­between semantic and syntactic ­representations (Bunger, Papafragou, & Trueswell, 2013). In all cases, studies examine whether participants who have been exposed to a prime expression involving one linguistic structure (whether based on syntax, meaning, or sound) subsequently show facilitation for a target expression involving the same structure (but different content), compared to when they have been exposed to a prime expression involving a different structure (and content). Such facilitation demonstrates structural priming.

Apparatus and Test Tools

As described below, there are many different paradigms for studying structural priming effects, involving a correspondingly diverse range of apparatus and test tools. Studies investigating structural priming effects in comprehension generally use computer‐based presentation of spoken or text stimuli. Participants’ responses may be recorded using a keyboard or buttonbox (measuring response latencies or choice of response), an eye‐tracker (measuring eye movements), electrodes attached to the scalp (measuring event‐related potentials [ERPs]), a functional Magnetic Resonance Imaging [fMRI] scanner (measuring changes in blood oxygen level‐dependent [BOLD] response), or printed booklets (measuring acceptability judgments). Studies investigating structural priming effects in production often use computer‐based stimulus presentation, but may also use sets of picture cards (especially in studies involving children) or printed booklets. Participants’ responses may be recorded using a digital recorder (measuring spoken choice of response or response latencies), a voice key (measuring response latencies), a keyboard (measuring response latencies or choice of response), an fMRI scanner (measuring changes in BOLD response), or printed booklets (measuring choice of response). Test tools include sets of sentences and sentence fragments, digital images (e.g., digital photographs or cartoons), and sets of picture cards involving line drawings, cartoons, or photographs.

Nature of Stimuli and Data

Minimally, all structural priming studies involve a participant processing a prime expression of some type (e.g., sentence, phrase, syllable) followed by a target expres- sion, in which some level of structure is manipulated, with the dependent variable Structural Priming 133 being some measure of processing for the target expression. However, paradigms vary greatly in the stimuli used to elicit prime and target expressions (as well as the tasks used to elicit processing).

Prime/target expressions

In principle, priming could facilitate use of a structure irrespective of whether an alternative structure were possible. In practice, however, studies usually investigate how prior experience of a structure affects processing when participants have a choice between that structure and an alternative. As such, researchers must select target expressions that allow a choice between structural alternatives at the level of the structure under investigation (e.g., between syntactic structures), one of which can be facilitated through prior use. Prime expressions must use one of the two struc- tures, but there is no parallel requirement that they allow a choice between alterna- tives, as priming appears to be based on participants’ prior use of a structure, not on a prior act of choosing between alternatives. The requirement that target expressions allow a structural choice constrains to some extent the types of prime/target structures that can be studied (see Table 7.1 for examples). In production priming studies, target expressions generally involve alternations in which speakers choose between two structures, or mappings between levels of structure, to express the same underlying conceptual representation. For example, the same ditransitive event can be expressed using a Prepositional Object (PO) or Double Object (DO) VP structure (e.g., The cowboy handed the banana to the thief vs. The cowboy handed the thief the banana); the same concept can be expressed using an ’s‐genitive or an of‐genitive NP structure (the king’s castle versus the castle of the king); and the same complex event can be expressed by mapping its semantic representation onto a full VP or a coerced VP structure (e.g., The bricklayer began building the wall versus The bricklayer began the wall). In comprehension priming studies, target expressions typically involve ambiguities where comprehenders must choose between alternative structures. For example, in Main clause (MC) versus Reduced relative clause (RR) sentences (e.g., The defendant examined the glove but was unreliable vs. The defendant examined by the lawyer was unreliable), they must choose whether to analyze examined as a main clause or a reduced relative clause verb, prior to encountering disambiguating material (e.g., by the lawyer). In High‐attached (HA) versus Low‐attached (LA) sentences (e.g., The policeman prodded the doctor with the gun), they must choose whether to interpret the PP as attached to the verb (prodded) or the second NP (the doctor); in this case, the syntax does not disambiguate. In quantifier scope ambiguities (e.g., Every kid climbed a tree), the choice relates to semantic structure: Whether the existential NP a tree takes wide scope (every child climbed the same tree) or narrow scope (every child climbed a possibly different tree). Prime‐target expressions are generally designed to be related with respect to the structure of interest, but to differ in other respects (e.g., lexical content), in order to exclude other possible loci of priming (e.g., lexical priming). However, in some cases researchers may be interested in the interaction between different aspects of language (e.g., between syntactic structure and lexical content), and in these cases the prime‐ target expressions may overlap in specific ways (e.g., use the same verb). Table 7.1 Example structural alternations studied in structural priming experiments. Construction Example Production or Level of structure Example study Comprehension priming tested Prepositional Object PO: The cowboy handed the banana to Production, Syntax, Semantic‐ (Arai, van Gompel, & versus Double the thief. Comprehension syntactic mappings Scheepers, 2007; Cai, object DO: The cowboy handed the thief the Pickering, & Branigan banana 2012; Bock, 1986) Main clause versus MC: The defendant examined the glove Comprehension Syntax (Ledoux, Traxler, & Swaab, Reduced relative but was unreliable. 2007) RR: The defendant examined by the lawyer was unreliable High‐ versus HA: The policeman prodding [the Comprehension/ Syntax (Branigan et al., 2005b) Low‐ attached PP doctor] [with the gun]. Production LA: The policeman prodding [the doctor with the gun]. Verb‐particle order Post‐verb: Pull off a sweatshirt Production Syntax (Konopka & Bock, 2009) Post‐object: Pull a sweatshirt off ’s‐ versus of‐Genitive S‐: The King’s castle. Production Syntax (Bernolet, Hartsuiker, & NP Of‐: The castle of the King. Pickering, 2013) Agent‐ versus Agent‐emphasis: Production Semantics (Vernice et al., 2012) Patient‐emphasis The one who is hitting him is the cowboy. Patient‐emphasis: The one who he is hitting is the cowboy. Wide versus Narrow Wide scope: Every kid climbed a tree Comprehension Semantics (Raffray & Pickering, 2010) quantifier scope [the same tree] Narrow scope: Every kid climbed a tree [a different tree] Coerced VPs Full VP: The bricklayer began building Production Semantic‐ Syntactic (Raffray, Pickering, Cai, & the wall. mappings Branigan, 2014) Coerced VP: The bricklayer began the wall. CV versus CVC CV: ki Production Syllable structure (Sevald et al., 1995) CVC: kil Structural Priming 135

Experimental Stimuli

The prime and target expressions of interest may directly constitute experimental stimuli. For example, participants may be presented with sentences to repeat (production priming) or interpret (comprehension priming). However, many priming experiments elicit production or comprehension of prime and target expressions by using non‐ linguistic stimuli such as pictures or animated videos. Such stimuli may be particularly relevant when testing populations such as young children. In production, stimulus pictures are designed to induce descriptions with relevant structural characteristics; for example, pictures depicting transitive events involving two participants are likely to induce active or passive descriptions. Pictures can also be designed to express complex semantic relationships, such as possession, that participants are trained to interpret in specific ways (e.g., cued entity = possessor; Bernolet, Hartsuiker, & Pickering, 2013). In comprehension, stimulus pictures depict possible interpretations of ambig- uous expressions. They can be used to force participants to use a particular struc- ture to comprehend a prime (e.g., to use an HA structure for The policeman prodding the doctor with the gun by presenting a picture consistent with only an HA interpretation). They can also be used to infer participants’ interpretation of an ambiguous expression (e.g., whether participants interpret The policeman prodding the doctor with the gun as HA or LA after an HA prime). Participants may also be asked to respond to pictures (e.g., by judging the veracity of a ­picture description) to ensure that they have processed prime and target expres- sions fully. Many production priming studies use a combination of sentence fragments and pictures (e.g., a picture of a ditransitive event together with the fragment The pirate is giving…), or provide some linguistic content (e.g., the verb) alongside a stimulus pic- ture, to constrain both the meaning and the possible form of participants’ responses, in order to reduce “exuberant responding,” that is, responses that are not relevant to the study. Experimental stimuli are typically interleaved with unrelated “filler” stimuli that distract participants’ attention from the experimental manipulation, and minimize carryover effects between experimental trials. Some experiments also include fillers using one of the experimental structures if this structure is normally infrequent, to boost its overall use within the experiment.

Types of Data

Most production studies collect data about the content of participants’ responses (e.g., recordings of spoken responses); less frequently, they collect data indexing online processing (e.g., speed to initiate a response, fluency). Comprehension studies largely focus on data indexing online processing, including eye movements to sen- tences or pictures and response latencies, but data about frequency of response types or response accuracy may also be recorded. Non‐behavioral data include changes in electrical activity across the scalp (ERPs) and changes in activity in brain regions (fMRI) in response to stimuli. 136 Research Methods in Psycholinguistics and the Neurobiology of Language

Collecting and Analyzing Data

As already noted, all structural priming paradigms involve processing of a prime stimulus associated with a prime expression and a target stimulus associated with a target expression. However, there are many ways in which such processing can be induced, and as a result there exist many different structural priming paradigms for both comprehension and production. Our discussion assumes use of single primes, and that the target immediately follows the prime, but some studies use multiple primes or present intervening material between primes and targets.

Structural Priming of Language Comprehension

Overt Responses: Temporal Measures

Priming in comprehension has been most extensively studied using an eye‐tracking par- adigm in which participants read sentences presented on a computer screen while the sequence and duration of their eye movements are recorded (Tooley, Traxler, & Swaab, 2009). On experimental trials, a prime sentence is presented in its entirety, immediately followed by a target sentence that involves the same structure or a different struc- ture. For example, participants might read The defendant examined by the lawyer was unreliable (RR) followed by either The engineer examined by the doctor had a large mole (RR) or The engineer examined the license in the doctor’s office (MC). Analyses focus on participants’ fixations within regions of interest, generally the disambiguating region following a choice point (e.g., by the doctor, which confirms or disconfirms an MC or RR analysis). Measures typically used include first pass processing (the sum of fixations in a region until the reader fixates outside it) and total time (the sum of all fixations in a region). Structural priming is manifested as a reduction in reading times for a target sentence after reading a prime sentence involving the same structure compared to after the alternative structure. Such effects are usually limited to the disfavored alternative (e.g., faster first pass reading times for RR targets after RR primes; Tooley et al., 2009). Many studies use analysis of variance to analyze the data (with separate analyses for participants and for items), but some studies use mixed effects models (removing the need for separate by‐par- ticipant and by‐item analyses). A closely related paradigm involves visual presentation of prime and target sen- tences using whole‐sentence or phrase‐by‐phrase self‐paced reading (Kim, Carbary, & Tanenhaus, 2014). In this case, the measure is participants’ reading time for the sentence or for critical phrases (measured by keyboard responses). This method is less informative about the detailed time course of priming effects. A different eye‐movement paradigm, the visual world (VW) paradigm, taps spoken comprehension (see Chapter 5). Participants’ eye movements are recorded as they listen to sentences while viewing a visual scene, to determine whether they anticipate the same structure for the target as they just processed in the prime, for example whether processing a DO prime increases anticipation of a DO target (Arai et al., 2007). In a typical trial, participants see and read aloud an unambiguous prime sentence. Immediately afterwards, they hear an auditory target sentence that Structural Priming 137 is temporarily ambiguous between the primed structure and the alternative (e.g., The pirate will send…) while viewing an array that includes potentially relevant objects (e.g., a potential recipient, princess, and theme, necklace). Analyses focus on participants’ anticipatory fixations on objects. Typical measures include first‐gaze duration (all consecutive fixations on an entity until another entity or the background is fixated) and log gaze probability ratio (the strength of the visual bias to one or other object). Structural priming manifests as longer looks to, and a higher likelihood of fixating, the object consistent with the structure used in the prime sentence (e.g., the recipient, following a DO prime), as established using analysis of variance or mixed effects models. The VW paradigm can straightforwardly be adapted for use with children as young as 3‐4 years (Thothathiri & Snedeker, 2008). Children hear a prime sentence and act it out with a set of toys, then hear a target sentence and act it out. This paradigm requires no literacy skills, and the act‐out task engages children’s interest and ensures depth of processing for the experimental sentences. As with adults, structural priming is measured as an increased likelihood to fixate an object consis- tent with the structure experienced in the prime. However, because children often have poorer memory for item location, total fixation time usually provides a more accurate measure than first‐gaze duration.

Overt Responses: Structure Choice

Other paradigms investigate how previous experience affects participants’ interpretation of expressions that are not disambiguated. A common paradigm uses a computer‐ based task in which participants choose pictures to match sentences or expressions (Branigan, Pickering, & McLean 2005; Figure 7.1). Typically, participants read or hear a prime expression that contains an ambiguity (e.g., The policeman prodding the doctor with the gun), then see two pictures, one of which is compatible with one structure (e.g., HA) and one of which is compatible with neither. The participant must choose the matching picture (and is thus forced to use a particular structure to comprehend the prime—here, HA). Subsequently, participants read/hear a target sentence containing the same ambiguity, then choose between a picture matching one alternative (HA) and a picture matching the other alternative (LA). The dependent measure is which picture (hence, which structure) the participant chooses. Analyses (typically using logit mixed effects models) compare the likelihood of choosing a structure after a prime with the same structure versus the alternative struc- ture. Structural priming manifests as an increased likelihood to choose the picture corresponding to the structure used in the prime sentence, for example, a higher likelihood of choosing the picture corresponding to the HA structure after an HA prime than after an LA prime.3 (Response latencies can also be measured; priming is reflected in faster response times when choosing the picture corresponding to the primed structure than the alternative, analyzed using analysis of variance or linear mixed effects models.) The Truth‐Value Judgment task involves a similar logic, and like the picture‐match- ing paradigm, is appropriate for children (Viau et al., 2010). Children hear an experi- menter tell and act out stories with props, then decide whether a puppet’s description for an event is true or false. On prime trials, the description and/or action disambiguate the structure (e.g., the scope of negation in a sentence such as Not every horse jumped 138 Research Methods in Psycholinguistics and the Neurobiology of Language

High Attached Prime sentence The policeman prodding the doctor with the gun.

Which picture matches the description?

The waitress prodding the Target sentence clown with the umbrella.

Which picture matches the description?

Low-Attachment High-Attachment interpretation interpretation

Figure 7.1 Example trial in a picture‐matching comprehension priming paradigm. over the fence). On target trials, the description is structurally ambiguous. Under one interpretation, it is true of the action; under the other, it is not. Children’s truth‐value judgment for the description therefore provides insight into their interpretation. Finally, acceptability judgment tasks can be used to measure comprehension priming (Luka & Barsalou, 2005). Participants read a series of prime sentences before rating the acceptability of target sentences varying in grammaticality. Increases in acceptability ratings after exposure to sentences with the same structure than after sentences with different structures (as established using analysis of variance) are interpreted as reflecting facilitated on‐line processing.

Non‐Behavioral Responses

Recent research on structural priming effects in comprehension also uses non‐ behavioral measures, principally ERPs and fMRI. In ERP studies, participants’ electrical activity across the scalp is recorded as they process language. Characteristic waveform deflections occur when different types of linguistic stimuli are processed. For example, processing the disambiguating word in a garden path sentence (e.g., RR) is associated with a positive shift around 500 ms post‐stimulus onset (the P600 compo- nent), compared to a non‐garden path (e.g., MC). ERPs can therefore provide a sensitive non‐behavioral measure of structural processing time‐locked to critical words. In typical studies, participants silently read sentences presented word‐by‐word using rapid serial visual presentation (RSVP) (Ledoux et al., 2007). On experimental Structural Priming 139 trials, participants read a prime sentence involving one structure (e.g., RR), followed by a target sentence with a local ambiguity that is resolved to the same structure as the prime sentence (here, RR), or the alternative structure (MC). The dependent measure is the mean ERP amplitude to the critical disambiguating word, focusing on the structure that usually induces processing difficulty (and so is associated with a discriminable ERP component). Structural priming for syntactic structure manifests as a smaller positive deflection at the disambiguating word following processing of a sentence with the same structure than the alternative struc- ture (as established using analysis of variance). In principle, structural priming for other types of structure would similarly manifest through reduction in deflections in relevant components (e.g., components associated with semantics). As an implicit measure of processing, ERP paradigms can be used to investigate structural priming in populations in which it may be difficult to elicit explicit responses (e.g., children, aphasics, low proficiency second language learners). However, these paradigms require very large numbers of items, as well as being potentially invasive, limiting their suitability for use beyond standard adult populations. In fMRI paradigms, participants’ activity (their BOLD response) in brain regions associated with comprehension is measured as they read or listen to sentences. In paradigms using text, participants typically silently read a prime sentence (e.g., a passive sentence) presented via mirrors above their head, followed by a target sentence with the same or the alternative structure (Weber & Indefrey, 2009). Sentences are presented word‐by‐word, with each word presented for a fixed presentation time. In paradigms using speech, participants may see a verb, then a photograph of a transitive event, and after a fixed interstimulus interval hear a sentence describing the event (Segaert, Menenti, Weber, Petersson, & Hagoort, 2012). Stimuli may be presented in a running priming manipulation, with each sentence acting as a prime for the following sentence. Analyses compare participants’ BOLD response when processing a sentence with the same or a different structure to the preceding sentence (using analysis of variance). Structural priming manifests as fMRI adaptation, that is, a reduction in BOLD response (associated with decreased neuronal activity) when a structure is repeated (e.g., when hearing a passive sentence after a previous passive sentence).

Structural Priming of Language Production

Paradigms for studying priming in production are generally more varied than compre- hension paradigms, and tend to focus on participants’ choice of structure rather than on‐line processing measures. Paradigms differ in the extent to which participants are free to decide a meaning to express, versus being constrained to convey a specified meaning, and in the extent to which the linguistic content of their responses is con- strained. Many production priming paradigms involve not only production processes, but also a comprehension element associated with processing the prime and/or target.

Overt Responses: Structure Choice

Most production paradigms investigate whether participants’ structural choices are influenced by exposure to a particular structure in a preceding prime. Picture (or video/animated) stimuli are often used to elicit target responses with specified conceptual content. In some paradigms, participants are presented with target 140 Research Methods in Psycholinguistics and the Neurobiology of Language

“The pirate was lifted by the Prime mouse”

Participant states if picture matches description

LIFT “The boxer was pushed by the cowboy” Target

PUSH

Figure 7.2 Example trial in a picture‐matching and picture‐description production priming paradigm. pictures without any linguistic cues, either on‐screen or as printed cards, and asked to describe the depicted event. In others, pictures are shown together with a word or phrase that participants must use in their response, for example a verb (e.g., sell) or a sentence fragment (e.g., The doctor is giving…). Picture description paradigms are highly flexible and can be combined with many different modes of prime presentation. In the simplest case, participants read or hear prime sentences. For example, participants may hear and repeat sentences, then describe target pictures as part of a recognition memory task (Bock, 1986), or may silently read sentences and then describe animated videos (Bunger et al., 2013). In other paradigms, picture‐description is combined with a picture‐matching task where participants must decide whether a spoken or written prime description matches a prime picture, then ­produce a description for a target picture (Figure 7.2). For example, they might hear The pirate was lifted by the mouse, and then see a pic- ture of a mouse lifting a pirate (match decision: yes), before describing a picture of a cowboy pushing a boxer. In a variant of this task, participants alternate between choosing a picture matching a prime description from an array and producing a target description (Branigan, Pickering, & Cleland, 2000). Picture‐matching tasks are often used in a non‐interactive setting, with prime sentences presented as text on‐screen or as audio‐recordings over headphones (Cai et al., 2012). However, they can also be used in interactive settings involving two or more “players” (Branigan et al., 2007). Here, one player is a confederate who appears to spontaneously produce descriptions for the naïve participant to match, but actually follows a script specifying which structure to use on each turn. Interactive settings may promote participants’ engagement in the task, and may yield stronger priming effects than non‐interactive settings. In a recently developed “pseudo‐inter- active” variant, participants are led to believe they are interacting with another participant, but actually interact with a computer (Ivanova, Pickering, Branigan, Costa, & McLean, 2012). This method offers the advantages of an interactive Structural Priming 141 paradigm, whilst obviating the need for a confederate and allowing closer control over prime presentation (e.g., variations in timing or phrasing). Picture‐description tasks are particularly suitable for use with children and other pop- ulations who may have restricted language abilities (e.g., non‐native speakers; Hartsuiker, Pickering, & Veltkamp, 2004), as well as clinical populations (e.g., aphasics and individ- uals with an Autistic Spectrum Disorder; Allen et al., 2011; Hartsuiker & Kolk, 1998). The ‘Snap’ paradigm is a child‐appropriate modification of the picture‐matching task, based on a popular children’s game (Branigan, McLean, & Jones, 2005). In it, players alternate turning over and describing cards showing events or objects, and compete to identify matching pairs. The experimenter plays first, hence her description acts as a prime for the child’s subsequent target description. In a similar “Bingo‐game” paradigm, children listen to and repeat an experimenter’s prime description of an animated video, before describing a different video (Rowland, Chang, Ambridge, Pine, & Lieven, 2012). In both paradigms, the interactive game‐based setting encourages children to maintain attention and process the primes fully, as well as providing sustained motivation to participate. Thus even 3‐4‐year‐old children can complete a large number of trials, making it possible to use within‐participants designs, although such experiments usually use fewer filler items than adult experiments. Both paradigms can also be used with adult participants, allowing direct comparisons of structural priming between populations. Other production paradigms use purely linguistic stimuli. In the fragment comple- tion paradigm, participants read and complete sentence fragments with the first completion that comes to mind. Prime and target fragments may be presented in booklets for written completion, with prime and target appearing on consecutive lines (Pickering & Branigan, 1998), or in a computerized format, with participants reading and producing spoken or typed completions for consecutively presented prime and target fragments (Branigan et al., 2000). Prime fragments are designed to elicit one of the experimental structures (e.g., The doctor gives the girl… favors a DO completion) whereas target fragments are designed to be compatible with either structure (e.g., The teacher shows… allows a DO or PO completion), and usually evoke stereotypical situations that are likely to elicit predictable responses. However, because the conceptual content is not specified, this paradigm tends to yield a high number of “other” responses (i.e., responses that do not involve either of the intended structures), and hence a lower proportion of usable data than in picture‐based paradigms. In contrast, sentence recall paradigms are highly constrained. They are most often used when the structures of interest are difficult to elicit spontaneously (e.g., cannot be easily depicted, or cannot reliably be elicited by sentence fragments). These para- digms are based on the assumption that sentence recall involves normal processes of language production that are susceptible to priming effects (Potter & Lombardi, 1998). Hence, when participants recall sentences, they may tend to do so using struc- tures that they have recently processed. Stimulus presentation is typically computerized. Participants silently read sentences presented using RSVP, then perform a distractor task (e.g., reading a digit string and making a recognition decision) before repeating the sentence aloud (Figure 7.3). On experimental trials, participants recall target sentences involving one structure after recalling prime sentences involving the same or the alternative structure. In all of these paradigms, the dependent measure is the structure of participants’ target responses. Responses are recorded, transcribed, and coded for structure 142 Research Methods in Psycholinguistics and the Neurobiology of Language

PRIME TRIAL TARGET TRIAL

The farmer The maid Location-theme Theme-location prime sentence heaped straw target sentence rubbed the presented one onto the presented one table with word at a time wagon word at a time polish

##### #####

56348 86294

Six Four Ye s No Ye sNo Was this number in the original array?

REPEAT REPEAT Repeat original sentence

Figure 7.3 Example trial in a sentence recall production priming paradigm.

(primed, alternative, or other). Analysis is usually restricted to participants’ initial response (thought to best reflect automatic priming processes); incomplete responses are usually discarded. Given the wide range of responses that participants may produce (especially in less constrained paradigms), it is critical to have detailed criteria for coding responses, and inter‐coder reliability checks may be desirable. Because research investigates which structure participants use when they have a choice of alternatives, responses that are not reversible (i.e., not expressible using the alternative structure) and Other responses are usually excluded from analysis. Current analyt- ical approaches compare the likelihood of producing a structure following a prime sentence with the same structure versus following a prime with the other structure, using logit mixed effects models. Structural priming manifests as a higher likelihood of using a structure following a prime with the same structure.

Overt Responses: Temporal Measures

Less frequently, production priming studies investigate online processing. Most such paradigms constrain the structure that participants must use. For example, partici- pants may be instructed to use specific structures to describe specific events, such Structural Priming 143 as animated objects moving in particular configuration (Smith & Wheeldon, 2001), or in response to visual cues, such as mentioning a highlighted character first (Segaert et al., 2012). Computerized stimulus presentation allows precise recording of partic- ipants’ latencies to initiate a response to the visual stimulus. Similarly, Sevald, Dell, and Cole (1995) had participants repeat specific syllables, and measured participants’ speech rate when producing syllables with the same or a different structure to a previous syllable. Structural priming manifests as speeded processing (reduced onset latencies or faster speech rate, as established using analysis of variance or linear mixed effects models) following a prime with the same structure.

Non‐Behavioral Responses

Paradigms have recently been developed for studying priming in production using fMRI in both non‐interactive and interactive contexts (Schoot, Menenti, Hagoort, & Segaert, 2014; Segaert et al., 2012). Participants typically first see a verb that they must use in their description, then see a photograph of an event to describe. Visual cues may be used to indicate the entity with which participants must begin their description, thus eliciting specific structures (e.g., passive sentences). As in comprehension, analyses compare participants’ BOLD response in language‐relevant brain areas (using analysis of variance) when producing a structure after previously hearing or producing the structure, compared to after the alternative structure (see discussion above).

An Exemplary Study

A hypothetical study might investigate whether small clause sentences (e.g., The cowboy called the burglar a liar) have the same syntactic structure as DO sentences (e.g., The cowboy gave the burglar a banana). If so, then participants should be facilitated for the DO structure after processing a small clause sentence, so that they should show an increased likelihood to use a DO structure when they have a choice between a DO structure and a PO structure to describe a ditransitive event (e.g., a teacher selling a swimmer a cup). To test this, 24 participants are exposed to four types of prime sentence (see Table 7.2): DO primes (which should facilitate a DO structure), PO primes (which should facilitate a PO structure), unrelated intransitive Baseline primes (which should facilitate neither structure), and the crucial small clause primes (which by hypothesis should facilitate a DO structure). The priming manipulation is embedded in a com- puter‐based picture‐matching/picture‐description task, which allows precise control of stimulus presentation. Participants alternate listening to a recorded description played over headphones and deciding whether it matches a picture displayed on‐ screen, and describing a picture by reading aloud and completing a sentence fragment displayed on‐screen. On half of the matching trials, the picture and description do not match (e.g., involve different actions), requiring participants to attend to descriptions carefully. On experimental trials, participants hear prime sentences with a DO, PO, Baseline or Small Clause structure, then describe a picture of a ditransitive event (i.e., comprehension‐to‐production priming). 144 Research Methods in Psycholinguistics and the Neurobiology of Language

Table 7.2 Stimulus materials for a hypothetical small clause study. Prime Prime Prime match picture Target picture Target condition sentence [Yes] sentence fragment PO The cowboy The teacher gave a sold… banana to the burglar

DO The cowboy The teacher gave the sold… burglar a banana

Small The cowboy The teacher LIAR Clause called the sold… burglar a liar

Baseline The cowboy The teacher sneezed ACHOO sold…

The materials involve a restricted range of easily identifiable entities (e.g., a soldier, a cake) and actions (including 6 dative verbs), with which participants are familiarized before the experiment. Target sentence fragments are compatible with either a DO or a PO description (e.g., The teacher sold…..). These features together serve to reduce participants’ cognitive load (e.g., identifying entities and actions, retrieving appropriate words) and variability in responses (e.g., use of target struc- tures other than PO and DO). The left‐right order of Agent and Recipient in pictures is counterbalanced, to control for any link between picture‐scanning preferences and word order. Participants experience 24 experimental items (6 per prime condition), interspersed with 72 fillers. The order of items and conditions is individually randomized (to avoid participants detecting a pattern and to control possible between‐trial priming effects), with the constraint that two to four fillers occur between consecutive exper- imental items (to minimize inter‐trial interference). Most fillers involve intransitive Structural Priming 145

Table 7.3 Hypothetical results for a small clause study. Prime condition PO responses (%) DO responses (%) Other responses PO 111 31 2 (78%) (22%) DO 63 80 1 (44%) (56%) Small Clause 66 77 1 (46%) (54%) Baseline 96 48 0 (67%) (33%)

and transitive events that should prime neither target structure. However, because it is hypothesized that two of the four priming conditions favor DO completions, 12 fillers use PO structures to boost the production of PO sentences in the experiment as a whole. Participants’ target descriptions are audio‐recorded, and coded as PO, DO or Other according to a set of specific criteria. Participants’ likelihood of producing a PO vs. DO structure following each prime type is analyzed using logit mixed effects modelling (excluding the <1% Other responses). The dependent variable is the response type that is produced on each trial (PO versus DO). The full model including prime type as a fixed effect is compared with the null model that excludes prime type as a fixed effect, using the maximal random effects structure justified by the design. If there is a priming effect, the model including prime type should provide a signifi- cantly better fit for the data than the null model, assessed using a likelihood ratio test. More importantly, if small clauses and DOs have the same syntactic structure, then participants should be more likely to produce DO target descriptions after small clause primes and DO primes than after PO or baseline primes, and the small clause and DO conditions should not differ, assessed using pairwise comparisons. Table 7.3 shows a hypothetical set of results that would be consistent with this pattern.

Problems and Pitfalls

As we have seen, structural priming paradigms allow a powerful and flexible implicit measure of structural representation and processing in production and comprehen- sion. They can be used to investigate many aspects of language structure in a range of populations, using simple materials that do not require complex equipment and that can be used outside a laboratory setting, as well as more sophisticated materials and specialized technical equipment. However, there are a number of potential pitfalls. The requirement that target expressions allow a choice between alternative struc- tures imposes a fundamental restriction on the range of expressions that can be investigated. Moreover, these alternatives should not be associated with systematic distinctions at other levels of structure (e.g., a systematic correspondence between syntactic structure A and semantic structure A, and between syntactic structure B and semantic structure B). The latter restriction is necessary to allow localization of 146 Research Methods in Psycholinguistics and the Neurobiology of Language the source of priming. For example, if alternative syntactic structures systematically co‐occur with alternative semantic structures, it can be difficult to determine whether any priming effects reflect repetition of syntactic structure, repetition of semantic structure, or both. Additionally, in production priming paradigms (other than those that strongly constrain speakers’ production, e.g., sentence recall), both alternatives must be felici- tous in the experimental context and occur above a minimal level of frequency in the language in order to be susceptible to priming (although use of infrequent structures can be boosted by the inclusion of fillers involving that structure): Priming can alter the relative preference for a possible response, but cannot generally induce use of a response that would not normally be considered as a possible response (see note 2). In comprehension priming paradigms, priming effects in on‐line processing are most easily detectable when there is a strong difference in preference between alternatives (e.g., difficult garden path sentences), and it may be difficult to establish priming when a marked asymmetry in preference does not exist. Indeed, whereas production priming effects are generally robust (especially regarding syntax), com- prehension priming effects appear more fragile and are often detected only when there is some content overlap between prime and target (e.g., verb repetition). Priming also appears to be attenuated when prime expressions are not fully pro- cessed. This is most likely to occur if participants are not required to respond to the prime in some way. Depth of processing is therefore critical, and is most easily ensured by using secondary tasks (e.g., comprehension questions, memory tests, or picture‐matching decisions). It may also be necessary to exclude data from trials where accurate processing of the prime cannot be confirmed (in particular if there is doubt about whether participants used the intended structure). A further potential source of data loss in production priming studies stems from participants’ use of prime and/or target structures other than those under study. The less constrained the task, the more likely such responses are to occur. Fragment completion tasks in which participants are free to generate their own conceptual content often yield high rates of “other” responses, but picture‐description para- digms that do not provide cues may also yield large quantities of data that must be discarded. This “exuberant responding” problem can be addressed to some extent by careful pretesting of stimuli, use of suitable examples during the instruction phase, and use of relevant constraining cues (e.g., providing content for participants to use in their responses). In interactive contexts, the confederate’s utterances may also act as an implicit normative influence. Use of highly constraining paradigms (e.g., sentence recall) avoids these problems and allows close experimental control, but is open to charges of artificiality and a lack of ecological validity. Other experimental control issues relate to the means of stimulus presentation and recording responses. Both production and comprehension priming can be investi- gated with little technical equipment, for example using pictures to elicit responses, where the dependent measure is participants’ structural choices. But such methods allow considerable variability with respect to, for example, timing of prime/target presentation. In contrast, computerized presentation allows considerable control, but imposes limitations in where studies can be carried out. Equally, there are con- cerns that overly close control over the temporal dynamics of stimulus presentation can reduce ecological validity, for example the use of unnaturally slow word‐by‐ word auditory presentation. Structural Priming 147

A different concern concerns the modality of prime/target processing. Many production paradigms engage the comprehension system to some extent (e.g., when participants read or hear prime sentences, or prime/target fragments prior to producing completions). Using distinct modalities for primes and targets (i.e.., com- prehended primes/produced targets) may allow inferences about the extent to which relevant representations and processes are amodal. But where both modalities are implicated in prime and/or target responses, it may be difficult to distinguish the locus of effects. For example, priming effects following repetition of a prime may implicate comprehension processes involved in reading the prime prior to repetition, or production processes involved in repetition. This is a particular concern where structural priming is used to investigate questions relating to processing rather than representation. Other important considerations concern experimental design. First, if the direction of priming is critical (i.e., whether both alternative structures are primed, or only one, generally the disfavored structure), it may be important to include an unrelated baseline prime condition: Using only primes involving the two alternative structures means that on each trial either one structure is primed, or the alternative is primed, and hence there is no neutral baseline against which facilitation can be measured. Additionally, the overall balance of structures within the experiment is poten- tially influential. Increasing evidence suggests that structural priming involves learning that endures beyond individual trials. As such, participants’ behavior on a trial may be influenced by their experiences elsewhere in the experiment. Indeed, many studies investigate the effects of cumulative exposure to structures (Kaschak, Loney, & Borreggine, 2006). Researchers should therefore be aware that priming for a structure may attenuate with increasing exposure to it within the experiment, and consider inclusion of possible trial order effects during analysis (Jaeger & Snider, 2013). Moreover, “unbalanced” designs where more than one condition primes one of the alternative structures (as in our exemplary study) can lead to very low usage of the other structure; this is a particular problem when the other structure is normally infrequent anyway. As with the use of a priori low‐frequency target structures, a possible solution is the inclusion of fillers using the other structure. Finally, specific populations may impose particular constraints on possible meth- odologies. As we have discussed, structural priming paradigms can be successfully adapted for use with children as well as adult second language learners and clinical populations. When studying such populations, potential considerations include ensuring maintained attention and detailed processing; avoiding fatigue; reducing cognitive load; encouraging use of relevant responses; and avoiding the necessity for literacy. For these groups, it may be important to use smaller numbers of experimental items and fillers to ensure that sessions are not too taxing. In particular, the use of engaging interactive game‐based paradigms with children can encourage attention and depth of processing, as well as sustained motivation. This is preferable to para- digms in which children hear and repeat primes but are not required or motivated to act on them, which may tax working memory as well as reducing attention and engagement. Familiarizing participants with objects and actions (and the words to use for them) beforehand can also reduce cognitive load (e.g., lexical retrieval processes or conceptual processing) during the experiment, and can be beneficial for second language learners as well as children. 148 Research Methods in Psycholinguistics and the Neurobiology of Language

Acknowledgments

Holly Branigan acknowledges a British Academy/Leverhulme Trust Senior Research Fellowship. Catriona Gibb received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 613465.

Key Terms

Comprehension priming Priming manifested in comprehension of language following production or comprehension of a prime. Prime Stimulus whose processing leads to facilitation for a subsequent stimulus. Production priming Priming manifested in production of language following pro- duction or comprehension of a prime. Structural priming Facilitation for repeated structural aspects of language. Syntactic priming Facilitation for repeated syntactic structure. Target Stimulus whose processing is facilitated by a prior stimulus.

Notes

1 Priming can also involve inhibition, but for exposition we couch our discussion in terms of facilitation. 2 Note that priming modulates how people process appropriate responses for a stimulus, but the stimulus constrains the possible responses that are considered (and which may therefore be affected by priming). 3 Importantly, priming raises the relative likelihood of using a structure; it does not necessarily make that structure the most likely structure to be used in absolute terms.

References

Allen, M. L., Haywood, S., Rajendran, G., & Branigan, H. P. (2011). Evidence for syntactic alignment in children with autism. Developmental Science, 14, 540–548. doi:10.1111/ j.1467‐7687.2010.01001.x Arai, M., van Gompel, R. P. G., & Scheepers, C. (2007). Priming ditransitive structures in com- prehension. Cognitive Psychology, 54, 218–250. doi:10.1016/j.cogpsych.2006.07.001 Bernolet, S., Hartsuiker, R. J., & Pickering, M. J. (2013). From language‐specific to shared syntactic representations: The influence of second language proficiency on syntactic sharing in bilinguals. Cognition, 127, 287–306. doi:10.1016/j.cognition.2013.02.005 Bock, J. K. (1986). Syntactic persistence in language production. Cognitive Psychology, 18, 355–387. Branigan, H. P., McLean, J. F., & Jones, M. W. (2005). A blue cat or a cat that is blue? Evidence for abstract syntax in young children’s noun phrases. In A. Brugos, M. Clark‐Cotton, & S. Ha (Eds.), BUCLD 29: The Proceedings of the Twenty‐Ninth Boston University Conference on Language Development (pp. 109–121). Somerville MA: Cascadilla Press. Retrieved from http://istina.msu.ru/static/pl‐2012_html/documents/Priming_add.pdf Structural Priming 149

Branigan, H. P., Pickering, M. J., & Cleland, A. A. (2000). Syntactic co‐ordination in dialogue. Cognition, 75, 13–25. doi:10.1016/S0010‐0277(99)00081‐5 Branigan, H. P., Pickering, M. J., & McLean, J. F. (2005). Priming prepositional‐phrase attach- ment during comprehension. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 468–481. doi:10.1037/0278‐7393.31.3.468 Branigan, H. P., Pickering, M. J., McLean, J. F., & Cleland, A. (2007). Syntactic alignment and par- ticipant role in dialogue. Cognition, 104, 163–197. doi:10.1016/j.cognition.2006.05.006 Branigan, H. P., Pickering, M. J., Stewart, A. J., & McLean, J. F. (2000). Syntactic priming in spoken production: linguistic and temporal interference. Memory & Cognition, 28, 1297–1302. doi:10.3758/BF03211830 Bunger, A., Papafragou, A., & Trueswell, J. C. (2013). Event structure influences language production: Evidence from structural priming in motion event description. Journal of Memory and Language, 69, 299–323. doi:10.1016/j.jml.2013.04.002 Cai, Z. G., Pickering, M. J., & Branigan, H. P. (2012). Mapping concepts to syntax: Evidence from structural priming in Mandarin Chinese. Journal of Memory and Language, 66, 833–849. doi:10.1016/j.jml.2012.03.009 Hartsuiker, R. J., & Kolk, H. H. (1998). Syntactic facilitation in agrammatic sentence production. Brain and Language, 62, 221–254. doi:10.1006/brln.1997.1905 Hartsuiker, R. J., Pickering, M. J., & Veltkamp, E. (2004). Is syntax separate or shared between languages? Cross‐linguistic syntactic priming in Spanish‐English bilinguals. Psychological Science, 15, 409–414. doi:10.1111/j.0956‐7976.2004.00693.x Ivanova, I., Pickering, M. J., Branigan, H. P., Costa, A., & McLean, J. F. (2012). The compre- hension of anomalous sentences: Evidence from structural priming. Cognition, 122, 193–209. doi:10.1016/j.cognition.2011.10.013 Jaeger, T. F., & Snider, N. E. (2013). Alignment as a consequence of expectation adaptation: Syntactic priming is affected by the prime’s prediction error given both prior and recent experience. Cognition, 127, 57–83. doi:10.1016/j.cognition.2012.10.013 Kaschak, M. P., Loney, R. A., & Borreggine, K. L. (2006). Recent experience affects the strength of structural priming. Cognition, 99. doi:10.1016/j.cognition.2005.07.002 Kim, C. S., Carbary, K. M., & Tanenhaus, M. K. (2014). Syntactic priming without lexical overlap in reading comprehension. Language and Speech, 57, 181–195. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/25102605 Konopka, A. E., & Bock, K. (2009). Lexical or syntactic control of sentence formulation? Structural generalizations from idiom production. Cognitive Psychology, 58, 68–101. doi:10.1016/j.cogpsych.2008.05.002 Ledoux, K., Traxler, M. J., & Swaab, T. Y. (2007). Syntactic priming in comprehension: Evidence from event‐related potentials. Psychological Science, 18, 135–143. doi:10.1111/ j.1467‐9280.2007.01863.x Luka, B. J., & Barsalou, L. W. (2005). Structural facilitation: Mere exposure effects for grammatical acceptability as evidence for syntactic priming in comprehension. Journal of Memory and Language, 52, 444–467. doi:10.1016/j.jml.2005.01.013 Pickering, M. J., & Branigan, H. P. (1998). The representation of verbs: Evidence from syn- tactic priming in language production. Journal of Memory and Language, 39, 633–651. Retrieved from http://linkinghub.elsevier.com/retrieve/pii/S0749596X9892592X Potter, M. C., & Lombardi, L. (1998). Syntactic priming in immediate recall of sentences. Journal of Memory and Language, 38, 265–282. doi:10.1006/jmla.1997.2546 Raffray, C. N., & Pickering, M. J. (2010). How do people construct logical form dur- ing ­language comprehension? Psychological Science, 21, 1090–1097. doi:10.1177/ 0956797610375446 Raffray, C. N., Pickering, M. J., Cai, Z. G., & Branigan, H. P. (2014). The production of coerced expressions : Evidence from priming. Journal of Memory and Language, 74, 91–106. doi:10.1016/j.jml.2013.09.004 150 Research Methods in Psycholinguistics and the Neurobiology of Language

Rowland, C. F., Chang, F., Ambridge, B., Pine, J. M., & Lieven, E. V. M. (2012). The development of abstract syntax : Evidence from structural priming and the lexical boost. Cognition, 125, 49–63. doi:10.1016/j.cognition.2012.06.008 Schoot, L., Menenti, L., Hagoort, P., & Segaert, K. (2014). A little more conversation ‐ the influence of communicative context on syntactic priming in brain and behavior. Frontiers in Psychology, 5. doi:10.3389/fpsyg.2014.00208 Segaert, K., Menenti, L., Weber, K., Petersson, K. M., & Hagoort, P. (2012). Shared syntax in language production and language comprehension‐An fMRI study. Cerebral Cortex, 22, 1662–1670. Sevald, C. A., Dell, G. S., & Cole, J. S. (1995). Syllable structure in speech production: Are syllables chunks or schemas? Journal of Memory and Language, 34, 807–820. Smith, M., & Wheeldon, L. (2001). Syntactic priming in spoken sentence production ‐ An online study. Cognition, 78, 123–164. doi:10.1016/S0010‐0277(00)00110‐4 Thothathiri, M., & Snedeker, J. (2008). Syntactic priming during language comprehension in three‐ and four‐year‐old children. Journal of Memory and Language, 58, 188–213. doi:10.1016/j.jml.2007.06.012 Tooley, K. M., Konopka, A. E., & Watson, D. G. (2014). Can intonational phrase structure be primed (like syntactic structure)? Journal of Experimental Psychology: Learning, Memory, and Cognition. doi:10.1037/a0034900 Tooley, K., Traxler, M., & Swaab, T. (2009). Electrophysiological and behavioural evidence of syntactic priming in sentence comprehension. Journal of Experimental Psychology: Learning, Memory and Cognition, 35, 19–45. doi:10.1037/a0013984 Vernice, M., Pickering, M. J., & Hartsuiker, R. J. (2012). Thematic emphasis in language production. Language and Cognitive Processes, 27, 631–644. doi:10.1080/01690965. 2011.572468 Viau, J., Lidz, J., & Musolino, J. (2010). Priming of abstract logical representations in 4‐year‐olds. Language Acquisition, 17, 26–50. doi:10.1080/10489221003620946 Weber, K., & Indefrey, P. (2009). Syntactic priming in German‐English bilinguals during sentence comprehension. NeuroImage, 46, 1164–1172. doi:10.1016/j.neuroimage.2009.03.040

Further Reading

Use of cue‐based recall paradigm: Ferreira, V. S. (2003). The persistence of optional complementizer production: Why saying ‘that’ is not saying ‘that’ at all. Journal of Memory and Language, 48, 379–398. doi:10.1016/S0749‐596X(02)00523‐5. Use of picture description paradigm adapted for amnesics: Ferreira, V. S., Bock, J. K., Wilson, M. P., & Cohen, N. J. (2008). Memory for syntax despite amnesia. Psychological Science, 9, 940–946. Example of non‐game‐based picture‐description paradigm for children: Huttenlocher, J., Vasilyeva, M., & Shimpi, P. (2004). Syntactic priming in young children. Journal of Memory and Language 50, 182–195. 8 Conversation Analysis

Elliott M. Hoey and Kobin H. Kendrick

Abstract

Conversation Analysis (CA) is an inductive, micro-analytic, and predominantly qualitative method for studying human social interactions. This chapter describes and illustrates the basic methods of CA. We first situate the method by describing its sociological­ foundations, key areas of analysis, and particular approach in using nat- urally occurring data. The bulk of the chapter is devoted to practical explanations of the typical conversation analytic process for collecting data and producing an analysis.­ We analyze a candidate interactional practice – the assessment-implicative interroga- tive – using real data extracts as a demonstration of the method, explicitly laying out the relevant questions and considerations for every stage of an analysis. The chapter concludes with some discussion of quantitative approaches to conversational interac- tion, and links between CA and psycholinguistic concerns.

Introduction

The language sciences are undergoing an interactive turn, as researchers in social neuroscience (Schilbach et al., 2013), psycholinguistics (Pickering & Garrod, 2004; Levinson, 2016), and cognitive science (De Jaegher et al., 2010, 2016; Fusaroli et al., 2014) increasingly recognize interaction as the arena in which their diverse concerns

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. 152 Research Methods in Psycholinguistics and the Neurobiology of Language converge. Underpinning this robust stream of interaction research are decades of cumulative discoveries from Conversation Analysis (CA) on the organization of interactive language use. This chapter explicates conversation analytic principles, findings, and methods. CA is an inductive, micro‐analytic, and predominantly qualitative method for study- ing language as it is used in social interaction. It differs most distinctly from other methods in this handbook in its use of field recordings of naturally occurring conversation; its focus on language as a resource for social action; and its procedure of basing analyses on the details of participants’ own behavior. As we will see, the method consists in the collection and curation of instances of an interactional phenomenon, the case‐by‐case analysis of that phenomenon, and the production of a formal account of its operation. The CA approach typically resonates with those who are interested in the specifics of human social conduct and committed to naturalistic observation. It offers researchers a well‐developed descriptive apparatus for investigating conversational interaction and a rigorously empirical procedure for supporting analyses.

Historical and Conceptual Background

CA was developed in the 1960s to 1970s by Harvey Sacks with his colleagues Emanuel Schegloff and Gail Jefferson. It emerged as a distinctive approach in sociology principally via the influence of Erving Goffman and Harold Garfinkel. Goffman’s (1967) major innovation was uncovering an entirely new domain of sociological inquiry, face‐to‐face interaction. As Goffman’s students, Sacks and Schegloff developed an appreciation of interaction as a locus of social organization that could be investigated in its own right. Around the same time, Harold Garfinkel was establishing ethnomethodology, a unique perspective on everyday activities that critiqued prevailing theories of social order. For Garfinkel (1967), social order was not to be located in aggregate descriptions of social life, but in the very methodical procedures that people deployed in situ to render their local circumstances intelli- gible. As such, the intelligibility of any social activity was an achieved intelligibility, one that participants themselves designed, ratified, and sustained using common- sense knowledge and practical reasoning (Heritage, 1984). CA synthesized these two themes: the methods with which participants themselves go about recognizing and producing actions, together in actual episodes of social interaction. CA’s guiding principle is that interaction exhibits “order at all points” (Sacks, 1992(I), p. 484). This orderliness is normative—it is produced and maintained by the partici- pants themselves in their orientations to social rules or expectations. One conversa- tional norm is “one party speaks at a time” (Sacks, Schegloff, & Jefferson, 1974). This is evidenced not only by the fact that conversations everywhere tend to proceed in this way, but also by cases where participants depart from the norm. Imagine the following: while someone is speaking, another participant whispers to a third party. This is not evidence against the one‐at‐a‐time norm. Rather, overlapping talk produced in a whisper and directed to a third party reveals an orientation to the norm itself. The whispering participant shows herself as “not the current speaker,” thereby acknowledging the norm while demonstrably departing from it. Such participant orientations let us recover the normative order of social settings from the very details of interaction itself. Conversation Analysis 153

In CA, talk is seen as a vehicle for action. Participants attend to talk not for its propositional content, nor as a simple medium of information transfer, but because they care about the actions getting done through talk (e.g., asking, requesting, complaining, noticing, and so on), and the real life consequences of those actions (Schegloff, 1995). Further, talk is examined not as isolated utterances, but as talk‐ in‐interaction, an activity that transpires in real settings between real people. In this respect, actions in interaction are always contextually situated; they are produced by someone, for someone else, at a certain time, in a certain way. This approach to language and social interaction has over the last half‐century resulted in a well‐developed descriptive apparatus for analyzing interactional structures. There are several intersecting “machineries” of practice required for conducting conversation. We briefly describe four: turn‐taking, sequence organization, turn design, and repair. Turn‐taking procedures address the recurrent problems of “who speaks next?” and “when do they start?” by coordinating the ending of one turn with the start of the next (Sacks et al., 1974). Turns are composed of one or more turn‐constructional units (TCUs), which consist of linguistic units (words, phrases, clauses, etc.) that form a recognizably complete utterance in a given context. As some turn approaches a place where it could be treated as adequately complete, then comes the possibility of turn‐transfer—a transition‐relevance place (TRP). At a TRP, participants use turn‐allocation techniques (other‐/self‐selection) in a hierarchically organized way (other‐selection by current speaker > self‐selection by others > self‐selection by current speaker). The turn‐taking organization thus provides for the orderly distribution of turns‐at‐talk for conversation. Sequence organization refers to how successive turns link up to form coherent courses of action (Schegloff, 2007). The adjacency pair is the basis of this organiza- tion: two turns/actions, produced by different participants, where the first pair part (FPP) is followed in next position by a type‐matched second pair part (SPP), which, were it not produced, would be “noticeably absent.” Examples of adjacency pairs include greeting‐greeting, question‐answer, invitation‐acceptance/declination, complaint‐account, and so on. The property that unites FPPs and SPPs is called conditional relevance because the relevance of the second action is contingent upon the production of the first. Multiple adjacency pairs can be strung together to form complex courses of action by processes of sequence expansion. Turn design refers to how speakers format their turns to implement some action, in some position, for some recipient(s) (Drew, 2013). A basic assumption in CA is that participants use talk and other conduct to produce recognizable actions, often employ- ing particular grammatical formats as resources to do so (see Levinson, 2013). To make an offer, for example, speakers can design their turn as a conditional (if your husband would like their address, my husband would gladly give it to him), declarative (I’ll take her in Sunday), or interrogative (do you want me to bring the chairs?), each of which systematically occurs in particular sequential positions (Curl, 2006). Repair practices address troubles in speaking, hearing, and understanding (Schegloff, Jefferson, & Sacks, 1977). A repair procedure includes three basic com- ponents: trouble source (e.g., an unfamiliar word), repair initiation (i.e., a signal that begins a repair procedure), and repair solution (e.g., a rephrasing of the unfamiliar word). Either the speaker of the trouble source (self) or its recipient (other) can initiate a repair procedure and/or produce a repair solution. Thus a distinction is made between, for example, self‐initiated self‐repair (e.g., so he didn’t take Sat‐ uh Friday off), in which the speaker of the trouble source initiates and executes the 154 Research Methods in Psycholinguistics and the Neurobiology of Language repair procedure independently, and other‐initiated self‐repair (e.g., A: so he didn’t take Saturday off. B: Saturday? A: Friday.), in which a recipient of the trouble source initiates the procedure and the speaker produces the solution.

Nature of the Data

Recording and Apparatus

Conversation analysts understand direct interaction between participants as the pri- mordial site of sociality. Therefore, they almost exclusively use recordings of natu- rally occurring interactions, rather than constructed, imagined, or experimentally induced ones. Naturalistic data are preferred because field notes and memories of interactions are necessarily incomplete, and people’s intuitions about how they behave in interaction often conflict with their actual behavior. Additionally, record- ings may be played repeatedly and slowly, permitting the transcription and analysis of interactional details. Any social occasion for which ethics permit recording is a potential site of interest, as any instance of people doing things together exhibits systematicity. The idea is to capture social life as it is lived—activities that would have taken place regardless of being recorded. This includes both “ordinary” interactions between friends and intimates, and “institutional” interactions occurring in hospitals, classrooms, and offices. Scripted interactions should be avoided (e.g., movies, television, plays), though call‐in radio programs, broadcast debates, and interviews have been profit- ably used (e.g., Heritage & Clayman, 2010). CA traditionally relied on telephone calls and short, fixed‐perspective video recordings of domestic life, meaning that much remains to be documented. Less well represented in the current literature are multiple recordings of the same participants, activity, or environment; multi‐day recordings; usage of multiple cameras; and recordings of mobile activities. While any activity is theoretically available for analysis, some may present chal- lenges. Anything that impairs transcription of audible/visible conduct (e.g., poor lighting, cacophonous setting, substantial overlapping speech) makes an analysis less reliable. The researchers themselves may also impede analysis if they lack basic knowledge of the occasion being recorded. Analysis requires adequate familiarity with the language(s) and culture(s) represented, some understanding of who the partici- pants are to one another, and a practical grasp of the situation being documented. With respect to the recording apparatus, video is required if participants are face‐ to‐face, and multiple cameras capturing different perspectives are preferable over single cameras. Richer data is of course obtained using the best technology currently available, such as high or ultra high definition video cameras. You may also con- sider using newly available technologies such as eye‐tracking glasses, body‐mounted or even drone‐mounted cameras, wide angle or panoramic lenses. The resulting forms of data could yield novel findings when combined with a CA approach. How much you need to record depends on the frequency of your phenomenon of interest and the practicalities of recording. CA dissertations, for instance, have been based on 10‐50 hours of recordings. While most conversation analysts collect their own primary data, especially as PhD students when this is typically required, some corpora are publicly available (see Further reading and resources), and others are Conversation Analysis 155 readily shared among CA researchers. For discussion of camera positioning, consent forms, file format, and other practical considerations, see Mondada (2013).

Transcription

Transcription is an important part of doing CA. Conversation analysts produce detailed transcripts of the talk—and in some cases behaviors like gaze or gesture—before ana- lyzing an episode of interaction. The conventions used in CA to transcribe talk (see Transcription conventions) were developed by Gail Jefferson and represent aspects of the phonetics, prosody, and timing of talk (Hepburn & Bolden, 2013). In CA tran- scripts, no detail should be ignored because one cannot know a priori what perceptible features of the talk participants may use when making sense of their circumstances. The precise length of silences, and the places where they occur, have been shown to be deeply consequential for how participants understand interaction (Hoey, 2015; Kendrick & Torreira, 2015; Sacks et al., 1974). Transcripts should therefore show not only speech but also vocalizations like laughter, the boundaries of overlapping talk, the length of silences, inhalations and exhalations, sound stretches, prosodic contours, faster or slower speech, and so on. For the transcription of body behavior, we recom- mend Mondada’s (2014) conventions for multimodal transcription.

Collecting and Analyzing Data

Identify a Candidate Phenomenon

Most analyses begin with an observation of something in the recorded data. Anything that participants treat as relevant for their interaction may be considered a candidate phenomenon for investigation. Observations might concern the structure of entire episodes of interaction, like “doctor’s consultation” or “playing a board game.” At a lower grain of organization, observations may concern the transaction of courses of action like “announcing bad news” or “arranging to meet.” Observations may be directed at the actions that constitute such sequences, like requesting, complaining, or assessing. And perhaps at the smallest level of structural organization, potential phenomena may lie in the composition of such actions, like their prosodic contours, their grammatical construction, or gestures that accompany their production. Developing the skill to notice potential phenomena emerges from the study of nat- uralistic data. The CA policy here is ideally one of “unmotivated looking,” or approach- ing data with nothing particular in mind. While this particular technique will naturally involve a researcher’s particular interests, those intuitions and hunches are organically sculpted over time through experience with interactional data. Working knowledge of the basic structural organization of interaction (e.g., turn‐taking, sequence orga- nization, turn‐design, and repair) is part of this, as is hands‐on practice in analyzing interactional data. Most students of CA develop their analytical skills in data sessions, where students and experts in the CA community gather to examine data together. Data sessions are an important pedagogical­ site for learners and practitioners to build experience in “unmotivated looking.” And so as in other disciplines, the ability to “see” phenomena of potential interest is at least partially gained through coursework, practice, and training with expert analysts. Furthermore, the time spent analyzing the 156 Research Methods in Psycholinguistics and the Neurobiology of Language same recordings over and over again allows you to familiarize yourself with the inter- actions. Commonly, something of interest in one place will remind you of a similar thing in another recording that you know well. In this way, familiarity with your materials also supports the ability to notice candidate phenomena. In order to exemplify basic CA methods, we will introduce a candidate phenomenon that we noticed in a data session and we will examine it throughout the chapter. Ultimately, for reasons that will soon become clear, we will come to refer to the phenomenon as an assessment‐implicative interrogative. But at this early stage in the research process, before the nature of the phenomenon is apparent, you should actively resist the urge to apply labels to the phenomenon because they will guide what you see and choose to analyze and can obscure as much as they elucidate. Extract 8.1 presents our initial specimen of the phenomenon. In it, three friends are discussing a popular British television show, and Clara asks the other two a question.

Extract 8.1 [01_EMIC_n03t] 1 CLA: Have you seen the American version of The 2 Inbe[tweeners 3 AME: [Oh it is aw[ful. = it’s so terrible 4 BOB: [Um:: no::: 5 CLA: [It’s so bad

We can start our analysis of this extract with some basic observations. First, Clara’s question (lines 1‐2) is formatted grammatically as a yes/no interrogative, which makes relevant a yes/no response (Raymond, 2003). Second, Amelia’s response to the question does not contain yes or no (or some equivalent form), but rather a negative assessment of the television show (line 3). Third, in overlap with Amelia’s response, Bobby responds to the question negatively and produces no assessment (line 4). Fourth, Clara, who asked the question, subsequently produces a negative assessment of her own (line 5). With these observations, we can draw some tentative conclusions about the sequence. The observation that Amelia responds to the question with an assessment, rather than an answer, suggests that she has understood the question as something other than a straightforward request for information. This exemplifies the next‐turn proof procedure: each turn in conversation displays, and thereby makes available for analysis, the speak- er’s understanding of the prior turn (Sacks et al., 1974). Furthermore, the observation that Clara then produces a negative assessment herself, thereby agreeing with Amelia, suggests Amelia’s understanding of the question was appropriate. Thus the participants’ conduct provides evidence that the question at lines 1‐2 does not request information per se, but rather implicates an assessment of the object under discussion. These observations and inferences alerted us to the possibility of a regular prac- tice. Is it the case, we wondered, that asking someone if they have seen some object (e.g., a television show) implicates an assessment of it? To a conversation analyst, Extract 8.1 raises such questions. The methods of CA, which we describe in this chapter, offer the possibility of answers. An initial step in the research process is to produce a formal description of the phenomenon under investigation, which might be called the assessment‐implicative interrogative. We provide a first description below, and we will revise it repeatedly throughout the chapter. Conversation Analysis 157

Formal description of phenomenon I

–– Questioner produces yes/no interrogative • in have you seen X format, • making a yes/no response conditionally relevant. –– Question‐recipient produces either • assessment, or • no. –– Questioner produces a subsequent assessment • which agrees with the question‐recipient’s assessment.

Build a Collection of Cases

With a preliminary description of the phenomenon in hand, the next step is to examine additional audio and video recordings of social interaction to build a collection of cases that will form the empirical foundation of the analysis. The idea is to gather widely and generously so you catch a substantial range of variation in the target phenomenon and related phenomena. Include everything that satisfies the criteria you developed for your preliminary description, as well as everything that approximates but does not strictly conform to them. By gathering this way, you will start to detect the contours of the phenomenon and discern how it operates. As you examine additional data, you will revise your preliminary description as the nature of the phenomenon becomes clearer. There are at least two approaches to collection building. The first involves exam- ining recordings for all candidate cases of the phenomenon. While slow, this process has the benefit of being rigorous and systematic. You can claim, for instance, that 1 hour of data contained 100 cases of the phenomenon. The second approach is more serendipitous in nature. It involves stumbling upon cases of the phenomenon while working on something else (for example, in a data session), then adding it to the appropriate collection. While this approach is opportunistic rather than systematic, it allows for building multiple collections in parallel. And while gathering enough cases may take years, you can contemplate the phenomenon in a way that shorter time windows do not allow. Most conversation analysts use both approaches depending on the particularities of the project. The first approach is suitable for high frequency phenomena (e.g., assessments, overlap, nodding), and the second for phenomena that do not occur often, or do not occur in all settings/activities. Another relevant aspect of this process, as noted above, is familiarity with your own materials, since intimate knowledge of specific interactions will allow you to more quickly find instances of your phenomenon of interest. In a standard CA study, all record- ings available to the researcher are drawn on in an opportunistic manner, while quantitative CA studies generally employ systematic sampling procedures (see Quantitative methods in CA). Because the composition of our example phenomenon includes specific lexical items (i.e., have you seen), we first searched the transcripts of our data for additional cases. Although a textual search can be a useful tool, CA collections invariably go beyond simple searches. One reason for this is that CA transcription conventions do 158 Research Methods in Psycholinguistics and the Neurobiology of Language not always use standard orthography. For example, the question did you have coffee? could be represented as d’yih’av co:ffee?, meaning that most searches for you or have would fail to locate it. A second reason is that not all phenomena of interest are discoverable by searching texts (e.g., prosody or body behavior). A third is that negative evidence is important in CA (Schegloff, 1996). Text searches only return things that occur; they cannot locate the non‐occurrence of something in a position where it relevantly could or should occur. With that said, our simple search nonetheless yielded additional candidate cases of the phenomenon, such as that in Extract 8.2.

Extract 8.2 [Poker] 1 BEN: Have you seen the ↓ chips that we play with 2 at yer house wi Roberto?= 3 SHA: =Yeah, I was thinkin that those were tight 4 BEN: Those are fun ↓

This sequence satisfies many of the formal criteria we developed for Extract 8.1. The first speaker produces a yes/no interrogative in have you seen X format; the question‐recipient responds with an assessment; then the first speaker produces a second assessment which agrees with the first. There is one important difference, however: in addition to an assessment, the question‐recipient’s response also includes an answer to the question itself (i.e., Yeah; cf. Extract 8.2, line 3). The sequences in Extracts 8.1 and 8.2 thus appear to be variants of the same phenomenon. While we found cases like Extract 8.2 that conformed to our preliminary descrip- tion, we also encountered cases that challenged it, like Extract 8.3.

Extract 8.3 [02_EMIC_n09t] 1 ALI: Oo::h have you had (.) fried green tomato:es:? 2 CHA: No[::, 3 BRI: [Those are [goo:d. 4 ALI: [°So goo:d.°

Note that this sequence is formally analogous to that in Extract 8.1. The question receives two responses—one which answers the question in the negative (line 2; cf. Extract 8.1, line 4) and one which assesses the object in question (line 3; cf. Extract 8.1, line 3)—and the questioner produces a second assessment in agreement with the first (line 4; cf. Extract 8.1, line 5). In contrast to Extracts 8.1‐8.2, however, the yes/no interrogative here is not in have you seen X format. At this point, our choices are either to specify some criteria to exclude cases like Extract 8.3 from the collection, or to revise our description of the phenomenon to include it. The first option would fail to recognize the obvious commonality between have you seen X and have you had X interrogatives: both inquire into the recipient’s perceptions or experiences. It thus seems more plausible that our initial description was too specific. Indeed, additional cases we identified support this conclusion and reveal further variation in turn design (e.g., did you ever go to the Cheesecake Factory?). Because participants treat different turn formats as the same kind of thing (e.g., by responding with assessments), we changed our description of the phenomenon accordingly. Conversation Analysis 159

An important methodological question at this stage is how big a collection needs to be. Schegloff (1996) suggests that 60 cases suffices, though other studies report on smaller and larger collections. Our collection contains 27 cases that satisfy the criteria below (changes underlined).

Formal description of phenomenon II

–– Questioner produces yes/no interrogative • in {did, have} you + perception/experience verb + object format, • making a yes/no response conditionally relevant. –– Question‐recipient produces either • assessment, • yes + assessment, or • no. –– Questioner produces a subsequent assessment • which agrees with the question‐recipient’s assessment.

Recommendation: Start with the Clearest Cases First

After building a collection, the next step is to analyze each case individually. As a general rule, it is a good idea to start with the clearest and most straightforward cases, even if they appear “boring” in comparison to others. Only after developing an analytic grasp of the clear cases should you tackle the more complex ones. Ultimately, of course, your analysis must account for the whole collection, but you should work from the inside out, as it were, starting with the dead center of the phenomenon. Here are a few general suggestions for how to begin.

1 Start at a beginning. A case that occurs close to the beginning of a new course of action (e.g., a new topic, activity, etc.) will be easier to analyze than one that is deeply embedded within a complex sequence. Such cases are often clearer because you can track the trajectory of action leading up to the focal phenomenon. 2 Capitalize on prior research. Cases that occur in interactional contexts that have already been well‐described in the CA literature can shed light on the phenomenon. For example, if a case occurs within a recognizable action sequence (request‐acceptance, question‐answer, etc.), it may be easier to analyze than others. 3 Watch for self‐repair. A powerful form of evidence in CA comes in the form of cases where participants’ conduct directly confirms the analyst’s account of some phenomenon. This can be seen in some instances of self‐repair. For example, a speaker may start a turn as why don’t we and then change it to why don’t I. This self‐repair displays the speaker’s understanding of both formats, the action each implements, and how such an action would fit in the specific interactional con- text (Drew, Walker, & Ogden, 2013). 160 Research Methods in Psycholinguistics and the Neurobiology of Language

Analyze Each Case in the Collection

The next step is developing an analysis for each case in the collection. Start by consid- ering the basic nuts and bolts of any interaction: activity, participation, position, com- position, and action. An adequate analysis of any phenomenon rests on an understanding of how these facets of interaction operate line‐by‐line and moment‐to‐moment. Activity is what participants are doing together through interaction. Relevant considerations include: What circumstances bring the participants into interaction? What resources or constraints does the activity furnish? Do participants orient to a shared‐in‐common activity structure, the environmental setting, or the communica- tive medium? Is it goal‐directed, or more loosely organized? Are certain things done at certain times, in a certain order, by certain participants? Participation refers to the roles that participants occupy over the course of a given activity. Consider questions like: What interactional roles do the participants occupy right now (e.g., someone who just started speaking, someone who just stopped speaking), in this specific turn at talk (e.g., speaker, recipient), in this sequence of action (e.g., speaker of trouble source, repair initiator), on this specific occasion (e.g., caller, called)? How do the participants orient to and flexibly exploit these participatory roles? Position refers to where something occurs in the course of interaction. Consider how a turn‐at‐talk fits into the larger sequence of action. Does it initiate a sequence, mandating a response? Or is it responsive to a previous turn, potentially completing the sequence (Schegloff, 2007)? Take Extract 8.4 for example. Here, Rick initiates a sequence with a question in did you see X format. As we have seen, such questions can implicate a yes/no response, an assessment, or some combination thereof. None of these immediately follow the question, however. Instead, Luke produces a question of his own, an other‐initiation of repair (OIR; see Kendrick, 2015a, for a review).

Extract 8.4 [05_Monopoly_Boys] 1 RIC: Didya see the Yankees didn‐ (.) resign Bernie, 2 (0.7) 3 LUK: Williams? 4 RIC: Mmhm 5 (1.0) 6 RIC: .TSK No.[(w‐ sh‐) 7 LUK: [Ba:d idea.

After Rick confirms that Luke has understood the reference to Bernie correctly (i.e., Bernie Williams), Luke responds to Rick’s initial question with an assessment: Ba:d idea (line 7). (Note that Rick’s turn at line 6 is the beginning of a tease and does not bear on the basic structure of the sequence described here.) This example shows that the relevant response to a sequence‐initiating action need not occur in the next turn and can be “displaced” by other activities, in this case an insert sequence (Schegloff, 2007). It also shows that sequences can have complex structures, with one adjacency pair (lines 3‐4) embedded within another (lines 1 and 7). Composition refers to the verbal, vocal, bodily, or material resources that form an action. Consider every turn component as possibly relevant: turn‐initial inbreaths, clicks, or sighs (Hoey, 2014); the grammatical format of the turn (e.g., a did you see X interrogative); the selection of one word over another (e.g., have you had vs. have Conversation Analysis 161

Table 8.1 Questions and assessments from Extracts 8.1 to 8.3. Extract Questions Assessments 1 Have you seen the American version of The oh it is awful Inbetweeners it’s so bad 2 Have you seen the↓ chips that we play with at I was thinkin that those yer house wi Roberto? were tight Those are fun↓ 3 Ooh have you had fried green tomatoes Those are good So good you eaten fried green tomatoes); the prosodic accents and intonational contours of the turn; and so on. How do these contribute to what’s getting done? How would things change if alternative forms were used, or if something were left out? How does the composition reflect position? How does it deal with what came before? How is it designed for its recipients? Consider, for example, the composition of the questions and assessments in Table 8.1. The questions feature interrogative syntax, second person subjects, verbs of per- ception or experience in past tense, detailed descriptions of the perceived or experi- enced object, and affective prosody. The assessments feature pronominal references, clearly valenced predicate adjectives, and are relatively short. These are all poten- tially relevant for an analysis. Take, for instance, the turn‐initial particle ooh in Extract 8.3. Turn‐initial particles can project the type of action that the incipient turn will implement (Levinson, 2013). As an affective particle, ooh imparts an emotional valence to the question and displays a positive stance toward fried green tomatoes. This implicit assessment may provide a place for other participants to display some stance toward fried green tomatoes as well. Action refers to what some talk or other conduct accomplishes in interaction. A methodological mantra in CA is that “position plus composition equals action,” meaning that an analysis of what someone is doing is largely a question of where their conduct occurs and how it gets formatted (Schegloff, 1995). Thus a character- ization of action should come after an adequate analysis of sequence structure and turn construction. The goal of this stage in the process is to produce a line‐by‐line analysis of each case in the collection. Start at the beginning of the data extract and work through the transcript word‐by‐word, turn‐by‐turn, sequence‐by‐sequence. Write down your observations and inferences (e.g., as bullet points) and revise your formal description of the phenomenon as necessary to account for the data.

Analyze Variation in the Collection

The next step is to come to grips with the variation exhibited by the phenomenon. The analysis of variant cases should focus on those dimensions of variation that participants orient to as relevant and meaningful. The task is to track forms of variation across the collection and sort cases into ad hoc categories such that you can easily compare variants. Which dimensions of variation are relevant will depend on the nature of the phenomenon. We’ve already observed variation in Extracts 8.1‐8.4. For instance, 162 Research Methods in Psycholinguistics and the Neurobiology of Language question‐recipients have the option to produce yes, no, or neither of these. The choice appears to be consequential for where an assessment occurs and which participant produces it. When the question‐recipient responds with a no token, the assessment appears after it, produced by the questioner, as in Extract 8.3 and below in Extract 8.5.

Extract 8.5 [LUSI:Santa Barbara 2] 1 CIN: Yea:h have you tried there? 2 DAD: N:o. 3 CIN: They’re a lot s:maller than the ones we got in L A: 4 but they’re, >↑they’re kinda < decent¿ Conversely, when the question‐recipient responds with a yes token, then the assessment appears directly after it, produced in the same turn by the question‐recipient, as in Extract 8.2 and below in Extract 8.6.

Extract 8.6 [SBC045] 1 COR: Did you hear about that cop (.) in Milwaukee? 2 PAT: Oh: yeah, I loved that.

When the question‐recipient responds with something other than a yes/no response, then the assessment is produced by the question‐recipient, who places it either in the next turn (Extract 8.1) or after an insert sequence (Extract 8.4). Furthermore, when the question‐recipient does give an assessment, the questioner may also give an assessment afterwards, as in Extracts 8.1 and 8.3. From just six examples, we’ve identified several types of variation: what follows the interrogative (yes, no, something else); who provides the assessment (questioner, question‐recipient, both); and where the assessment occurs (by itself, after yes in the same turn). More generally, these are intersecting matters of what gets done, in what order, by which participants, in what way, and so on. These are the sorts of consid- erations involved in analyzing variation. One type of variation in the collection led us to modify our formal description. The question Have you tried there (Extract 8.5, line 1) does not contain an explicit object, but implicitly references a restaurant under discussion. Because the sequence nonetheless transpires as expected, we modified the formal description below (change underlined).

Formal description of phenomenon III

–– Questioner produces yes/no interrogative • in {did, have} you + perception/experience verb + object format, ▪▪ where object can be implicit, • making a yes/no response conditionally relevant. –– Question‐recipient produces either • assessment, • yes + assessment, or • no. –– Questioner optionally produces a subsequent assessment. • which agrees with the question‐recipient’s assessment Conversation Analysis 163

Define the Boundaries of the Phenomenon

Standing in opposition to clear cases are boundary cases. These resemble the phenomenon under investigation but can be shown analytically not to be genuine instances of it. The process whereby one identifies such cases and develops criteria to exclude them from the core collection defines the boundaries of the phenomenon (see Schegloff, 1997). For our phenomenon, we identified two types of boundary cases that forced us to amend our formal description. Take Extract 8.7 for example. Cindy had been to Mom’s house to retrieve something from a closet. Mom has just finished explaining why her closet was so messy.

Extract 8.7 [LUSI:Santa Barbara 2] 1 MOM: So did‐ uh Matthew didn’t tell you I’m a clothes horse 2 did’ee hahuhuh 3 CIN: That you’re a WHAt?= 4 MOM: =Have you ever heard the expression < clothes horse> 5 CIN: ↑ No:: °what is it?° 6 MOM: Oh: my mother had a clothes fetish that means like you 7 have um: obsessive amounts of clothes haha

The question at line 4 clearly matches our formal criteria and was therefore included in our collection. However, it turned out to be problematic for our analysis because no assessment of the expression clothes horse ever occurs, nor is there any orientation to the non‐occurrence of an assessment. On its face, this case contradicts the tentative generalization we’ve developed so far, namely that questions like have you ever heard X implicate an assessment of the object in question. So what’s going on here? A careful line‐by‐line analysis of the cases in the collec- tion in terms of position provided a straightforward answer. Simply put, all other questions in our collection occur in a sequence‐initial position; they are constructed with specific linguistic practices that practically mandate a responsive action. By contrast, the question at line 4 is not in sequence‐initial position, but in a sequence‐ subsequent position. It occurs as part of a complex insert sequence that the partici- pants produce to deal with Cindy’s trouble in understanding the expression clothes horse. Our phenomenon appears to be restricted to the beginning of a course of action. Therefore, we excluded Extract 8.7 based on sequence‐organizational grounds and amended our description to specify that the question must be in a sequence‐initial position. Now consider Extract 8.8. Like Extract 8.7, it presents a challenge because the question (line 4) fits our description, but no assessment ever occurs. However, we could not exclude this case, nor others like it, on sequence‐organizational grounds because the question occurs sequence‐initially.

Extract 8.8 [13_RCE28] 1 KEL: So i’wzlike‐ (.) > we could js < buy a private island, 2 (cuzs) cheaper than a house¿ 3 (0.8) 164 Research Methods in Psycholinguistics and the Neurobiology of Language

4 HEA: Did you see BBC Breakfast this £morn(h)ing? 5 hhh‐[hhh 6 KEL: [((snort/laugh)) £N(h)o:? 7 HEA: It’s a link, it’s an island = h 8 ((continues telling))

We were therefore left with two options: (i) concede that such cases contradict our analysis, or (ii) reanalyze the collection to determine whether such cases differ systematically from others. A careful analysis of the cases in the collection in terms of composition revealed a systematic difference in turn design: The question in Extract 8.8 contains the temporal adverbial phrase this morning, which localizes the experiential/perceptual event in time. In contrast, the clear cases of the pheno­ menon (e.g., Extracts 8.1‐8.6) do not have temporal adverbials like this. They all exhibit what linguists call the experiential perfect aspect (Comrie, 1976), which portrays a situation as having held at least once during some time in the past—in other words, not specifically localized in time. This tense‐aspect distinction can be grasped by comparing Did you see BBC Breakfast this morning? to Have you seen the American version of the Inbetweeners (Extract 8.1). Whereas the first question can be paraphrased as, “Did you see the particular episode of BBC Breakfast that aired this morning?”, the second communicates something like, “Have you ever, at any time in the past, watched the American version of the Inbetweeners?” The first asks about a specific point in the past; the second asks about one’s general past experience. We therefore concluded that our collection in fact contained two types of ques- tions, only one of which was part of our phenomenon. Across the collection, none of the questions with temporal adverbials like this morning (or recently, see below) elicit assessments. Thus we excluded Extract 8.8 and others like it from the collec- tion on turn‐constructional grounds and modified the description of our phenomenon accordingly, shown below (changes underlined).

Formal description of phenomenon IV

–– Questioner produces yes/no interrogative • in {did, have} you + perception/experience verb + object format, ▪▪ where object can be implicit, ▪▪ where verb is in experiential perfect aspect, • in sequence‐initial position, • making a yes/no response conditionally relevant. –– Question‐recipient produces either • assessment, • yes + assessment, or • no. –– Questioner optionally produces a subsequent assessment. • which agrees with the question‐recipient’s assessment Conversation Analysis 165

Analyze Deviant Cases and Look for Normative Evidence

In one respect, the goal of CA is to describe the normative practices that partici- pants use when organizing social interactions—that is, what people expect to happen in social situations. Deviant cases are an especially powerful kind of evidence for demonstrating the normative organization of some phenomenon. Deviant cases feature (i) a departure from an expected pattern, and (ii) an observ- able orientation to it as a departure from the norm (see Maynard & Clayman, 2003, pp. 177‐182). Say you build a collection of hundreds of question‐answer pairs. While this would provide evidence that statistically these actions co‐occur, it would not show that participants normatively expect answers to follow ques- tions. To demonstrate normativity, you must present something like the following: (i) a question‐recipient does not provide an answer, and (ii) this gets treated as problematic (the questioner pursues a response, the question‐recipient accounts for the non‐response, etc.). This would show that the co‐occurrence of questions and answers is not merely a statistical correlation but a socially normative orga- nization (see Heritage, 1983, pp. 245‐253). Extract 8.9 presents possible normative evidence for our analysis. It begins with Molly asking Hannah about a mutual friend.

Extract 8.9 [11_RCE25] 1 MOL: Uhm what‐ (.) Have you seen: (.) other Jack. 2 (0.4) 3 MOL: recent[ly. 4 HAN: [.hh No:, I think he: (1.0) uhm (0.7) he’s 5 just‐ (0.9) staying at home en:

At first blush, Molly’s question looks like our phenomenon. However, neither participant goes on to produce an assessment. The question cannot be excluded on sequence‐organizational grounds because it occurs in a sequence‐initial position, nor can it be excluded on turn‐constructional grounds—at least not initially— because the question in line 1 lacks a temporal adverbial. Note, however, that Hanna fails to respond to the question promptly, resulting in a 0.4 second gap (line 2). After this, Molly continues her turn, adding recently—a temporal adver- bial that localizes the event in time. CA research has shown that a delay before a response can signal interactional trouble and that turn continuations like this can be used to address such troubles tacitly (see, e.g., Kendrick, 2015b, pp. 8‐10). Therefore a plausible analysis here is that the non‐response by Hanna was under- stood by Molly as an indication of trouble, and that Molly produced the turn continuation recently as a possible solution. But what sort of trouble could Hanna have had with the question? The turn continuation itself suggests an answer: recently transforms the question from one in the experiential perfect aspect to one that asks about an event in the recent past. Thus the tense‐aspect distinction that we identified in the previous section is, in this particular case, oriented to by the participants in the course of interaction. This suggests that the distinction is a socially normative one, which participants use to produce and interpret recogniz- able social actions. 166 Research Methods in Psycholinguistics and the Neurobiology of Language

Produce a Formal Account of the Phenomenon

The final step in the research process is to produce a formal account of the phenomenon. The criteria that you have developed to identify the phenomenon and its boundaries are an essential part of this account, as is your analysis of the variation in the collection. The account should not only describe the nuts and bolts of the phenomenon, the linguistic forms and social actions that comprise it, but should also explain how it operates, the conditions under which variation occurs, and the sort of social‐interactional problem for which the phenomenon constitutes a solution (see Schegloff, 1996). For the phenomenon we have explored in this chapter, we walked through the initial steps of the research process carefully as an illustration of the basic practices and principles of the method. We (i) identified a candidate phenomenon; (ii) built a collection of cases; (iii) analyzed each case individually; (iv) examined variation across the collection; (v) defined the boundaries of the phenomenon; and (vi) looked for normative evidence for our analysis. But to complete the final step—that is, to develop a full account of our phenomenon—one would need first to answer two important questions that our tentative analysis has brought to light. The first is whether the phenomenon is or is not a specific type of pre‐sequence (Schegloff, 2007). A pre‐sequence is an adjacency pair in which the first pair‐part projects the contingent relevance of a subsequent first pair‐part. For example, a question like what’re you doing tonight? not only makes a response conditionally relevant; it also projects the production of a subsequent first pair‐part and specifies its action (e.g., as an invitation). The recipient of such a pre‐invitation can anticipate the projected action and either block its production (e.g., I’m staying in tonight) or allow it to go ahead (e.g., nothing). In the case of our phenomenon, the question is whether a first pair‐part like Oo::h have you had (.) fried green tomato:es:? in Extract 8.3 should be analyzed as a pre‐assessment. According to Schegloff’s (2007) definition of a pre‐sequence, a pre‐assessment would be a first pair‐part that projects the contingent relevance of a subsequent assessment and allows the recipient to either block its pro- duction or allow it to go forward. A no response such as that in Extract 8.3 might be analyzed as a go‐ahead, which allows for the production of the projected assessment. The difficulty that such an analysis faces is that across the collection of cases either the speaker or the recipient goes on to produce an assessment, whether the response was yeah (e.g., Extract 8.2) or no (e.g., in Extract 8.3). That is, the set of response alterna- tives that we observe does not appear to include one that can block the progression of the sequence. Thus if our phenomenon is a pre‐assessment, the organization of the sequence that it engenders differs from that of other pre‐sequences described in the literature (see also Levinson, 1984, p. 360‐364; Rossi, 2015). A full account of the phenomenon would therefore examine such differences in detail and describe the sequential organization of pre‐assessment sequences as observed in the collection. The second and related question that a full account would need to address concerns the management of social epistemics in assessment sequences (Heritage & Raymond, 2005). Epistemics refers to the social management and distribution of knowledge in conversation: who knows what, who has the right to know what, and so on. Speakers select different linguistic forms depending on their recipient’s epistemic access to an assessable object. For instance, something like That sounds interesting is hedged with the word sounds, and is found to occur when a recipient has no access or derived Conversation Analysis 167 access to an assessable. Conversely, if a recipient is known to have access, then a speaker who gives an assessment might use a tag question, as in That’s interesting, isn’t it? This displays an orientation to the type and scope of knowledge that a recipient has relative to a speaker. For our phenomenon, speakers first inquire into recipient’s expe- rience with some object, thus orienting to their recipient’s epistemic access to that object as a practical precondition for assessment. This connects to the matter of pre‐ sequences discussed above, in that the sequence may be designed to establish the recipi- ent’s epistemic access in advance of the assessment. We would want to integrate an analysis of sequence and epistemics for a fuller formal account of the phenomenon, looking at the ways in which recipient’s access to an assessable affects the trajectory of the assessment‐implicative interrogative sequence across the collection. Although important questions remain unanswered, we have nonetheless learned a great deal about our example phenomenon, the assessment‐implicative interrogative. Our results suggest that asking about another’s perceptual experience of some object or event (e.g., have you seen X), in a sequence‐initial position (e.g., as a new topic), formulating it as a question about general past experience (i.e., in the experiential perfect aspect), makes relevant or otherwise implicates an assessment of the object or event in question. Should the account developed here bear out, then we will have recovered from the fine details of talk a recurrent practice by which interactants engage in a commonplace activity: assessing things together.

Quantitative Methods in CA

CA is an inductive, data‐driven method for the discovery and description of interac- tional practices and organizations of practice observed in naturally occurring social interaction. As over 40 years of empirical research in CA demonstrates, quantitative and experimental methods are not necessary to produce valid accounts of the organization of social interaction. Indeed, conversation analysts have been deeply skeptical of the use of quantification, let alone experimentation, as an analytic tool (Schegloff, 1993). A principle concern is that researchers should first identify, describe, and understand a phenomenon of interest before they count and code instances of it, lest the statistical results not reflect the true nature of the phenomenon. However, quantitative methods such as coding and counting together with standard inferential statistics have been used by conversation analysts to investigate interac- tional phenomena for which qualitative analyses already exist (see Stivers, 2015). Such studies can not only replicate previous results, but also refine previous empirical observations and, in some cases, challenge conventional wisdom. To cite but one example, in their seminal study of the organization of repair in conversation, Schegloff et al. (1977) observed that other‐initiations of repair (OIR; e.g., asking what? if you didn’t hear the prior turn) are systematically delayed (see, e.g., Extract 8.4, lines 2‐3). Yet nowhere in the article did the authors report the statistical distribution of cases or the precise timing of the delay. A quantitative CA study later showed that the modal gap duration before OIRs is approximately 700 ms, which was longer than the modal 300 ms observed in responses to yes/no questions in the same corpus (Kendrick, 2015b). The results thus replicated and further specified Schegloff et al.’s general observations. But the study also made an 168 Research Methods in Psycholinguistics and the Neurobiology of Language unexpected discovery: one type of OIR, other‐correction, was in fact produced without delay, contra Schegloff et al.’s claims. This shows that CA can use quantitative methods not only to reproduce and refine previous observations, but also to make new discoveries about well‐described phenomena.

Advantages and Disadvantages

The dominant model for scientific research is the hypothetico‐deductive method: A researcher formulates a hypothesis that could be proved false by empirical obser- vation and then tests it, often through experimentation. Although can be hypothesis driven, the CA method, by contrast, begins not with a hypothesis about what participants in social interaction might do, but rather with an actual specimen of what some participants have in fact done. This initial specimen acts as the seed from which the analysis grows, inductively. An advantage of this approach over others is its ecological validity (i.e., the extent to which the results generalize to everyday life). Given that the primary data in CA are recordings of everyday life, the only concern regarding ecological validity is the possibility that the researcher’s recording equipment could influence the interaction (see Hazel, 2015). In experimental research, however, participants may be asked to perform unfamiliar or unusual tasks that have few parallels to their everyday experience (e.g., naming a series of pictures that appear on a computer screen, maintaining prolonged eye‐contact with a stranger). For this reason, conversation analysts are generally skeptical of results from social and psychological experiments. One disadvantage of CA, especially from the perspective of psychology, is its lack of experimental “control.” Psychological experiments aim to isolate and manipulate some independent variable to determine its effect on some dependent variable, and thereby infer causality. In controlled experiments, only the value of the independent variable should differ between conditions. In a CA study, however, the thick particu- lars of each case differ—the participants, their relationships, the setting, the topic, and so on. With so many “extraneous variables” how can conversation analysts be certain of their results? The answer is that CA methods exploit the inherent variability of naturally occurring data. Consider the collection we built for this chapter. It includes cases from face‐to‐face interactions and telephone calls, recorded in quiet rooms and outside in public, with participants engaged in other activities (e.g., playing a game) and not. With a diverse collection of cases, whatever extraneous variable one might posit as explanatory in one particular case is unlikely to hold for another, let alone for all cases in the collection. Rather than minimize variability through experimental con- trol, the CA method exploits the variability of naturally occurring social interaction.

CA and Psycholinguistics

In many ways, CA and psycholinguistics are an odd couple. The two fields differ markedly in their approaches to data collection, the basic units of their analyses, and the emphasis they place on social versus cognitive processes. Whereas much Conversation Analysis 169 psycholinguistic research takes the production and comprehension of single words or sentences elicited under controlled conditions as its basic unit of analysis, CA research treats the interactional exchange of utterances by two or more participants recorded in naturally occurring social situations as its basic unit. And whereas psycholinguists generally seek to reveal the cognitive processes of individuals that underlie observable behavior, conversation analysts set aside inquiries into cognition and instead aim to describe and model interactional processes that involve the coordination of multiple participants and that produce the orderliness observed in conversation and other forms of talk‐in‐interaction. With that said, the CA literature nonetheless offers a wealth of rich descriptions of interactional phenomena whose relevance to psycholinguistic theory is hard to question. Consider turn‐taking in conversation (Sacks et al., 1974). The gaps between turns are on average 200‐300 ms (Stivers et al., 2009), yet according to psycholinguistic experiments, speakers require at least 600 ms to plan even a single word (e.g., in picture naming tasks; Indefrey & Levelt, 2004). This suggests that a next speaker begins planning his or her turn well before the current turn is complete, and that processes of language comprehension and production overlap in conversation (Levinson, 2016). This raises problems for psycholinguistic theory, such as the pro- posal that comprehension uses the production system for prediction (e.g., Garrod & Pickering, 2015). As the example of turn‐taking shows, CA research on the organization of conversation, using the methods described in this chapter, can inform models of production and comprehension and suggest avenues for future research. Take Extract 8.10. After Jamie confirms that he plans to play football, Will asks when the game starts.

Extract 8.10 [RCE15a] 1 WIL: You gonna come to football tonight, 2 JAM: Yeah. 3 (0.9) 4 JAM: hhh[h 5 WIL: [W’time is it? 6 (0.2) 7 JAM: Four o’clock.

The question‐answer sequence at lines 5‐7 presents a psycholinguistic puzzle: the duration of the question is only 295 ms and the following gap 185 ms, yet Jamie answers the question with apparent ease. Two explanations are possible: either he planned the noun phrase four o’clock in less time than the minimum of 600 ms observed in picture naming tasks (Indefrey & Levelt, 2004), or he anticipated Will’s question and prepared his response in advance. Although we find the first explana- tion more plausible, both raise questions for psycholinguistic research. One might question the ecological validity of the experiments that established such temporal minimums, and use CA research to create new paradigms for language production experiments. Alternatively, one could investigate the circumstances under which speakers plan answers to questions not yet asked. The temporal adverb tonight could in theory activate a representation for four o’clock, and the long gap during which Will doesn’t speak (line 3) could serve as a signal of trouble and prompt Jamie to search for its source. The rapidity of Jamie’s response could therefore be a byproduct 170 Research Methods in Psycholinguistics and the Neurobiology of Language of the sequential organization of talk. Whichever explanation one prefers, sequences such as this, in which speakers go faster than the psycholinguistic limits, are easy to find using CA methods and would surely repay psycholinguistic investigation.

Acknowledgments

We thank Gene Lerner for granting us access and permission to some of the data used in this chapter. We also thank Will Schuerman and Chase Raymond for comments on an earlier draft. This research was made possible by the financial support of the Language and Cognition Department at the Max Planck Institute for Psycholinguistics and the International Max Planck Research School for the Language Sciences.

Key Terms

Action The social action (or “speech act”) that a participant performs through the production of an utterance (e.g., greeting, asking, telling, offering, requesting). Collection A set of data extracts (e.g., video clips) from recorded interactions that forms the empirical foundation for an analysis. Composition The structure of the practices that a participant uses to perform an action. Deviant case A case in which a departure from a pattern is oriented to as a departure, which provides evidence that the pattern is socially normative. Ethnomethodology The study of the methods that people use for understanding and producing social order. Interactional phenomenon An observable locus of order in social interaction that serves as an object of study. Naturally occurring interaction A social interaction, ordinary or institutional, that was not arranged expressly for the purpose of scientific research. Next‐turn proof procedure A method whereby a turn is analyzed as evidence of its speaker’s understanding of the prior turn. Position The location of an action within a sequence of actions or the overall struc- ture of a social occasion.

Transcription Conventions (.) Short, untimed pause (1.4) Timed pause hh Exhalation .hh Inhalation (word) Unclear hearing ((comment)) Transcriber’s comment w[ord Overlapping onset wor]d Overlapping offset wor‐ Cut‐off word >word < Faster speech rate Conversation Analysis 171

Slower speech rate ↓word Markedly lower pitch ↑word Markedly higher pitch word = Latching, rush into next turn or segment word Prominent stress WORd Higher volume than surrounding talk w(h)ord Laughter in word £word Smile voice °word° Lower volume than surrounding talk wo:rd Lenghtening of segment . Falling intonation , Level or slight rise intonation ? High rising intonation ¿ Mid rising intonation

References

Comrie, B. (1976). Aspect. Cambridge: Cambridge University Press. Curl, T. S. (2006). Offers of assistance: Constraints on syntactic design. Journal of Pragmatics, 38, 1257–1280. De Jaegher, H., Di Paolo, E. A., & Gallagher, S. (2010). Can social interaction constitute social cognition? Trends in Cognitive Sciences, 14, 441–447. De Jaegher, H., Peräkylä, A., & Stevanovic, M. (2016). The co‐creation of meaningful action: Bridging enaction and interactional sociology. Philosophical Transactions of the Royal Society B, 371 (1693), 20150378. Drew, P., Walker, T., & Ogden, R. (2013). Self‐repair and action construction. In M. Hayashi, G. Raymond, & J. Sidnell (Eds.), Conversational repair and human understanding (pp. 71–94). Cambridge: Cambridge University Press. Drew, P. (2013). Turn design. In J. Sidnell & T. Stivers (Eds.), The handbook of conversation analysis (pp. 131–149). Malden: Wiley‐Blackwell. Fusaroli, R., Raçzaszek‐Leonardi, J., & Tylén, K. (2014). Dialog as interpersonal synergy. New Ideas in Psychology, 32, 147–157. Garfinkel, H. (1967). Studies in ethnomethodology. Englewood Cliffs, NJ: Prentice‐Hall. Garrod, S., & Pickering, M. J. (2015). The use of content and timing to predict turn transi- tions. Frontiers in Psychology, 6. http://doi.org/10.3389/fpsyg.2015.00751 Goffman, E. (1967). Interaction ritual: Essays on face‐to‐face behavior. Chicago: Aldine Publishing Company. Hazel, S. (2015). The paradox from within: Research participants doing‐being‐observed. Qualitative Research. Advance online publication. Hepburn, A., & Bolden, G. (2013). The conversation analytic approach to transcription. In J. Sidnell & T. Stivers (Eds.), The handbook of conversation analysis (pp. 56–76). Malden: Wiley‐Blackwell. Heritage, J. (1984). Garfinkel and ethnomethodology. Cambridge: Polity Press. Heritage, J., & Clayman, S. E. (2010). Talk in action: Interactions, identities, and institutions. Malden: Wiley‐Blackwell. Heritage, J., & Raymond, G. (2005). The terms of agreement: Indexing epistemic authority and subordination in talk‐in‐interaction. Social Psychology Quarterly, 68, 15–38. Hoey, E. M. (2014). Sighing in interaction: Somatic, semiotic, and social. Research on Language and Social Interaction, 47, 175–200. 172 Research Methods in Psycholinguistics and the Neurobiology of Language

Hoey, E. M. (2015). Lapses: How people arrive at, and deal with, discontinuities in talk. Research on Language and Social Interaction, 48, 430–453. Indefrey, P., & Levelt, W. J. M. (2004). The spatial and temporal signatures of word production components. Cognition, 92, 101–144. Kendrick, K. H. (2015a). Other‐initiated repair in English. Open Linguistics, 1, 164–190. Kendrick, K. H. (2015b). The intersection of turn‐taking and repair: The timing of other‐ initiations of repair in conversation. Frontiers in Psychology, 6. Kendrick, K. H., & Torreira, F. (2015). The timing and construction of preference: A quantitative study. Discourse Processes, 52, 255–289. Levinson, S. C. (1983). Pragmatics. Cambridge: Cambridge University Press. Levinson, S. C. (2013). Action formation and action ascription. In J. Sidnell & T. Stivers (Eds.), The handbook of conversation analysis (pp. 103–130). Malden: Wiley‐Blackwell. Levinson, S. C. (2016). Turn‐taking in human communication: Origins and implications for language processing. Trends in Cognitive Sciences, 20, 6–14. Maynard, D. W., & Clayman, S. E. (2003). Ethnomethodology and conversation analysis. In L. T. Reynolds & N. J. Herman‐Kinney (Eds.), Handbook of symbolic interactionism (pp. 173–202). Walnut Creek, CA: Altamira Press. Mondada, L. (2013). The conversation analytic approach to data collection. In J. Sidnell & T. Stivers (Eds.), The handbook of conversation analysis (pp. 32–56). Malden: Wiley‐Blackwell. Mondada, L. (2014). Conventions for multimodal transcription. Accessed March 1 2016. https:// franz.unibas.ch/fileadmin/franz/user_upload/redaktion/Mondada_conv_multimodality.pdf. Pickering, M. J., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences, 27, 169–190. Raymond, G. (2003). Grammar and social organization: Yes/no interrogatives and the struc- ture of responding. American Sociological Review, 68, 939–967. Rossi, G. (2015). Responding to pre‐requests: The organisation of hai x “do you have x” sequences in Italian. Journal of Pragmatics, 82, 5–22. Sacks, H. (1992). Lectures on conversation, Vols. 1 & 2, edited by G. Jefferson. Oxford: Blackwell. Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn‐taking for conversation. Language, 50, 696–735. Schegloff, E. A. (1993). Reflections on quantification in the study of conversation. Research on Language and Social Interaction, 26, 99–128. Schegloff, E. A. (1995). Discourse as an interactional achievement III: The omnirelevance of action. Research on Language and Social Interaction, 28, 185–211. Schegloff, E. A. (1996). Confirming allusions: Toward an empirical account of action. American Journal of Sociology, 102, 161–216. Schegloff, E. A. (1997). Practices and actions: Boundary cases of other‐initiated repair. Discourse Processes, 23, 499–545. Schegloff, E. A. (2007). Sequence organization in interaction. Cambridge: Cambridge University Press. Schegloff, E. A., Jefferson, G., & Sacks, H. (1977). The preference for self‐correction in the organization of repair in conversation. Language, 53, 361–382. Schilbach, L., Timmermans, B., Reddy, V. et al. (2013). Towards a second‐person neuroscience. Behavioral and Brain Sciences, 36, 393–462. Stivers, T., Enfield, N. J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., Hoymann, G., et al. (2009). Universals and cultural variation in turn‐taking in conversation. Proceedings of the National Academy of Sciences, 106, 10587. Stivers, T. (2015). Coding social interaction: A heretical approach in conversation analysis? Research on Language and Social Interaction, 48, 1–19. Conversation Analysis 173

Further Reading and Resources

Readings: Sidnell, J. (2010). Conversation analysis: An Introduction. Malden: Wiley‐Blackwell. Sidnell, J., & Stivers, T. (Eds.). (2013). The handbook of conversation analysis. Malden: Wiley‐Blackwell. Software: ELAN: https://tla.mpi.nl/tools/tla‐tools/elan/ CLAN: http://childes.psy.cmu.edu/clan/ Transcriber: http://transcriber.en.softonic.com/

Corpora: CABank (English, Spanish, Mandarin, others): http://talkbank.org/cabank/ Language and Social Interaction Archive (English): http://www.sfsu.edu/~lsi/ 9 Virtual Reality

Daniel Casasanto and Kyle M. Jasmin

Abstract

Immersive virtual reality (iVR) is a rapidly developing technology through which exper- imenters can transport participants into virtual words. These worlds are rendered via stereoscopic video projections, which are typically enhanced with audio systems that simulate a 3-dimensional soundscape, haptic stimulators that make virtual objects seem tangible, and sometimes even olfactory stimulators. Traditional verbal or pictorial stimuli can induce experimental participants to imagine alternate realities; iVR can allow partic- ipants to experience them sensorially. Thus, iVR provides a degree of richness and realism that is not possible in traditional laboratory experiments, while enabling researchers to maintain rigorous control over the stimuli and the experimental environment. In this chapter we outline the basic components of iVR systems, discuss some ways in which they have been used to study social cognition, and describe ways in which this tech- nology has begun to help researchers understand social aspects of language use.

Assumptions and Rationale

Language is the original virtual reality (VR) device. In the real world, what we can experience is limited by the richness of our surroundings, the reach of our arms, and the resolution of our senses. Through language, we can transcend these limitations

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Virtual Reality 175 and create an infinite number of alternate realities. Narratives can blast us into outer space (Asimov, 1951), plunge us 20,000 leagues under the sea (Verne, 1962), or lead us along a yellow‐brick road toward an emerald‐green city, past magic poppies and flying monkeys (Baum, 1958). The worlds we create via language exist only in our imagination, and not in our senses. Information presented in other media, via newer kinds of “VR devices,” can incrementally shift the burden of creating a virtual world from imagination to perception. Pictures in books and sound effects on the radio add unimodal (visual or auditory) details, both enhancing and constraining the imag- ined world. Audiovisuals on the stage, television, or in the movies supply even more perceptual details, yet the real world still exists alongside of the fictitious world. One need only glance away from the screen to return to reality, and remaining inside of these virtual worlds often requires a willing suspension of disbelief. By contrast, in fully immersive virtual reality (iVR), which we describe below, the shift from imagination to perception is nearly complete. When people enter an iVR system the real world disappears, and an alternate reality commandeers the senses. What you see is determined by stereoscopic goggles that wrap around your field of view, and what you hear is determined by a montage of speakers that model a three‐dimensional soundscape. What you feel may be shaped by floor shakers beneath your feet, or vibratory feedback devices cued by your body movements. Some iVR systems even include olfactory stimulation. How “immersive” are iVR systems? The answer depends in part on the system, and on the individuals’ propensity to feel “presence,” which is the term VR researchers use to describe one’s subjective immersion in the virtual world (Heeter, 1992). But a standard program that can run on even rudimentary iVR systems illustrates the grip iVR can have on most people’s minds. The “pit” illusion is simple. Participants stand at the mouth of a deep chasm, and are invited to walk across it on a plank of virtual wood. (Although it’s not necessary, some labs enhance the illusion by placing a real plank of wood on the ground at the participant’s feet—which lifts them about 1 inch above the floor.) The animation may not look realistic; the rocks and trees may look cartoony, and the 3D perspective may not be perfect. But still, the illusion may be inescapable. Many participants refuse to walk across the plank even though they know that there is absolutely no danger—that they are safely inside a university laboratory—­ and yet the mind cannot overrule the senses. There may be no need to suspend disbelief in iVR; disbelief may be impossible. (One of the authors of this chapter experienced severe vertigo the first time he crossed the plank, or rather failed to cross it.) Aside from piquing people’s fear of heights, what is iVR good for? iVR offers a level of richness and realism that is difficult to achieve in the laboratory, while also letting researchers maintain rigorous experimental control over the stimuli and the experimental environment. Experimenters can stimulate multiple senses simulta- neously, and collect multiple streams of data in parallel (e.g., vocal responses, body movement; also eye movement and electrophysiological data for iVR labs equipped with an eyetracker and electroencephalograph (EEG). By immersing participants in a virtual world, iVR may elicit more naturalistic responses to emotional or social stimuli than traditional methods do. 176 Research Methods in Psycholinguistics and the Neurobiology of Language

Apparatus

The hardware supporting iVR can be divided into two types. Input hardware “cap- tures” data from the real world, such as the position and motion of a subject’s body. Output hardware “renders” the world by presenting some combination of visual, auditory and haptic information to the subject. In the middle, connecting the devices is a computer that processes the input and uses it to produce the output. We will take each type of device in turn.

Input Devices: Motion‐Capture

Imagine you are seated in a virtual environment—a virtual classroom. You look at the person seated on your right, or perhaps, look down at your desk, where a virtual coffee mug is sitting. In doing so you of course move your head. Next, you pick up the coffee mug, and your virtual hand moves forward into your field of view, as it would in the real world. This is accomplished through the use of input technology called “motion capture” or “mo‐cap.” Mo‐cap allows the tracking of people and objects in the real world, for updating the positions of virtual people and objects in the virtual world. This is often done through the use of markers, small devices that attach to whatever body part or object one might wish to track. Two common types of markers—active and passive optical markers— rely on light and cameras to work. Passive markers are plastic balls with a reflective coating. They are called “passive” because they do not themselves emit light; instead, they reflect light emitted from another source, such as an infrared lamp attached to the camera. Infrared is ideal for this purpose because it is invisible to the naked eye. Multiple cameras are used to pinpoint a marker’s precise location and orientation in space. Whereas passive markers reflect light, active markers emit it. Active marker systems typ- ically consist of LED’s worn on the body. As with passive markers, a camera detects the light and feeds this information to a computer in order to calculate the marker’s location in space. With both types of systems, the more cameras you have, the better the results will be. This is true both because the triangulation of position can be more precise with more cameras, and also because markers only work when the camera can “see” them, that is, when they are not occluded or hidden. For example, suppose you are tracking the position of a subject’s hand, and they reach behind their head. You would need a camera posi- tioned to the rear of the subject in order for tracking to continue accurately. A dataglove is capable of tracking movements of individual fingers. A classic but crude example is the Power Glove created by Nintendo in the 1980s. Professional datagloves used in virtual environments are more sophisticated, and are used for both input and output. Precise sensors in each finger of the glove allow a subject’s hand shape and finger movements to be recorded. This data can be used to precisely measure hand gestures or linguistic signs and render the hand of an avatar (i.e., the character that embodies the participant in the virtual world) in real time. The glove can also serve as an output device by producing haptic feedback to simulate the sensation of holding or touching a virtual object. The dataglove does not transmit arm position information on its own, but by attaching a mo‐cap marker to the glove, it is possible to locate the arm in the virtual environment. Virtual Reality 177

A low‐cost alternative to a full motion capture system is the Microsoft Kinect, which provides basic motion sensing. The system works without any markers at all; instead, a single camera positioned in front of the user detects motion against the background of the room, and infers both the user’s position within the room and the position of their body. For some purposes, Kinect has been shown to work as well as more expensive optical systems (e.g., Chang et al., 2002). You can also measure other kinds of behavior or physiology using equipment that is not specific to VR research. Microphones can be attached to the subjects to record their voice for later analysis (we will give an example of this below in Section 5). Measures like eye tracking and galvanic skin response could also be incorporated.

Output Devices

Subjects are immersed in a virtual environment through output devices, which pro- vide sensory information (visual, auditory, haptic) to the subject. Head‐mounted Displays (HMDs) are a common method of presenting visual information. As the name implies, the device is worn on the head and consists of two video screens (one for each eye) attached to a helmet or visor. These screens project a first‐person stereoscopic view that helps to create a three‐dimensional effect. The field of view varies. Generally, a device with a wider field of view allows more immersion and is more expensive. Some HMDs also provide head tracking through the use of accelerometers. Although HMDs have in the past been expensive, low‐cost options are emerging. Google released a product called “Google Cardboard,” which was introduced in 2014 at the astonishing retail price of USD $15. It is a sheet of cardboard contain- ing two lenses, and can be cleverly folded into a device that mounts a smartphone in front of the user’s face (the smartphone is not included in the price). Together, the Cardboard and the smartphone make an effective HMD. The smartphone’s screen is divided in two down the middle so that two images can be presented stereoscop- ically, one to each eye, to create a 3D effect. The phone’s accelerometer provides head‐tracking information so that the view of the virtual environment can be updated in real time. A second low‐cost device, the Oculus Rift, was released in 2016 at a price of USD $599. Rather than something you attach to your phone, the Rift is a full‐fledged HMD. It provides a 110‐degree field of view and built‐in 3D headphones. CAVE systems (short for computer‐activated virtual environments) render virtual worlds without the need for an HMD. The environment is instead projected onto the walls, ceiling and floor of a room—similar to the “holodeck” from the Star Trek television series. The user wears 3D glasses that are synchronized with the projections on the sides of the CAVE, and separates the images into left and right for stereoscopy. Presenting audio (e.g., voices) to subjects can be done with headphones built in to the HMD. Alternatively, external speakers can be placed on the walls, in the corners, on the flow in the ceiling, immersing the subject in a 3D sound experience. With this technique, the source location of sounds can be controlled exactly, if this is required. 178 Research Methods in Psycholinguistics and the Neurobiology of Language

Moving Through the Virtual World

How does a user move through a virtual world? The answer depends on the kind of physical constraints in your real‐world laboratory, and the input and output hardware you use. If your laboratory is large enough, a subject can simply walk around the room (e.g., wearing an HMD and a backpack full of other hardware). Of course, any input and output devices the user may be wearing will need to stay connected to the computer, through either a wireless transceiver worn by the subject or through direct wired connections. Alternatively, wires can be fed straight up to a gantry system installed in the ceiling, which moves around the room with the subject, keeping the right amount of slack in the wires. The position of the user in the real‐world labora- tory is tracked with motion capture (e.g., markers worn on the body), and this information is used to move the corresponding avatar in the virtual world. Depending on the size of the VR lab, and whether the subject’s movement is, itself, of interest to the researchers, it might be better to let subjects sit still and move the environment around them. This option allows the virtual world to be infinitely large, even though the physical lab space is limited. In Staum Casasanto, Jasmin, and Casasanto (2010) and Gijssels et al. (2016), our subjects moved through a virtual supermarket. However, our lab was much smaller than a supermarket—in fact, participants could only take a few steps before reaching a wall. So instead of walking through the store, the avatar sat in a virtual motorized cart and was driven through the store by a virtual agent (i.e., an autonomous character in the virtual world—a digital robot). Floor shakers rumbled when the cart’s virtual motor was operating, which provided haptic input and perturbed the subject’s vestibular system to allow for an illusion of motion. Thus, the subject did not have to move through the lab—the virtual environment moved around them.

Integrating Input and Output

Building your lab is the first step. The next is building your virtual world. Do you want your subjects indoors or outside? Do they need to walk around? Do they need to touch or manipulate objects? Will they talk to other people? The answers to these questions will affect your choices, but every virtual world needs one thing—a soft- ware system to integrate data from the input and output devices. Although multiple software packages are available, one package popular among research psychologists is Vizard VR software, from WorldViz. It is an Integrated Development Environment (IDE) that controls multiple functions related to your experiment from within the same system or framework. With this tool, you can program what happens during your experiment and visually inspect the virtual world you are developing. During an experiment, the software handles program and data flow, processing input from motion capture cameras, microphones, and other streams, and updates the subjects’ HMDs and audio headsets while they move their heads, hands, and bodies in the virtual world. Vizard is based on the Python programming language, which may be advanta- geous to researchers who already use Python for other aspects of their research. In Vizard, virtual objects, avatars, and agents in the virtual world are all represented by Python “objects” that are easily controlled by changing their attributes (e.g., location = x,y,z; or color = blue) or activating their actions (making an agent “walk” Virtual Reality 179 or “speak,” or a ball “drop”). When all of the various objects have been created for the world, controlling them with Python is only slightly more complex than other experiments such as video‐game-based tasks. Another benefit of Python is that it is open source, with many add‐ons freely available. The objects and avatars that populate your virtual world can be purchased or sometimes obtained free from a public repository. Software packages like Vizard sometimes come with a set of stock “models” (the specifications for the 3D object’s physical shape) and “textures” (the bitmap graphics that map onto the model to give it its color and other visual attributes). Common situations, objects and people—for example, a man and a woman dressed in suits sitting at a conference in an office— will be easy to obtain. More niche needs (e.g., a pterodactyl flying past Macchu Picchu) will prove to be more difficult, and may require the aid of a graphic designer with experience working with 3D models.

Nature of Stimuli and Data

In VR experiments, the virtual world itself is the stimulus, and it has nearly countless parameters to vary. You will need to choose which parameters to manipulate based on the exact experimental question or questions you are testing. Below, we will highlight some ways that aspects of virtual environments have been altered experimentally in the past and show how these paradigms could be adapted for language research.

Manipulating Parameters of Virtual People

VR is effective when a person feels a strong “presence” in the virtual world, and responds to it as though it were real (Heeter, 1992). Establishing presence is what allows researchers to manipulate not just participants’ sensory experience, but also their thoughts, beliefs, and behavior. VR allows us to change people’s appearance in ways that are impossible in the real world. This can have consequences on a person’s beliefs about themselves. A classic example is the “Proteus Effect.” Yee and Bailenson (2007) altered the height of subjects’ avatars. Some subjects were given a tall avatar, others a short one. They then played a competitive bargaining game. Subjects with taller avatars played aggressively, whereas those with shorter avatars were more likely to accept unfair deals. In another study, Fox et al. (2013) gave female participants either a conservatively dressed avatar or one dressed in a revealing outfit. Participants who were assigned a sexualized avatar reported more body‐related thoughts and reported more “victim‐blaming” attitudes toward rape. The Proteus Effect studies show that VR can be effective in altering people’s beliefs about themselves. Could this effect be exploited for language research? If the height of a person’s avatar activates stereotypes and affects their feelings of dominance and power, perhaps it could also affect their linguistic behavior as it relates to dominance. We might predict that people with taller avatars would behave more dominantly in conversation—talking louder, interrupting more, and accommodating less to the linguistic choices of the person they’re speaking with. Conversely, a person with a shorter avatar might speak less loudly, interrupt less, and accommodate more to the language styles of their speaking partner. Changing an avatar’s height is 180 Research Methods in Psycholinguistics and the Neurobiology of Language t­rivially easy in VR. Using Vizard software, you can simply specify in centimeters exactly how tall you would like a person to be. There are other ways that changing how a subject appears might affect their linguistic output. Groom et al. (2009) showed that changing the race of an avatar can activate stereotypes and affect racial biases. Might changing the race of a partic- ipant also activate linguistic knowledge—words or phonological patterns associated with that race? Race could be varied simply by substituting one avatar for another. Manipulating the cultural subgroup of a subject through a change of virtual clothing could produce similar effects. (An aristocrat speaks differently from a hobo.) VR could prove to be a useful tool for exploring the extent of latent knowledge of other groups’ linguistic patterns, and whether this knowledge can be activated and put into production by transiently changing a person’s identity.

Manipulating Parameters of the Environment

Perhaps you want a drastic change in the experimental environment: You can simply substitute one background environment for another. Previous studies have used this technique for effective mood manipulations. For example, Riva et al. (2007) created two park environments that were designed to elicit specific emotions. One featured inviting sounds, lighting, and textures designed to induce calm relaxation, while the other was darkly lit and used sounds and textures designed to evoke feelings of anx- iety. These environments were effective at inducing the target moods. Indeed, the more presence the subject felt, the more this mood induction worked. Conversely, being immersed in one of these emotionally charged parks also heightened feelings of presence (compared to being placed in a neutral park). Why might it be useful to study language in different emotional contexts? There is some evidence that emotions affect language processing. Van Berkum et al. (2013) showed that moods induced with film clips (Happy Feet for a positive mood or Sophie’s Choice for a negative one) affected the neural basis of pronoun reference assignment. VR could be used for more sophisticated mood inductions in the study of language processing, language production, and behavior in language interaction. VR allows greater experimental control than film clips, as the mood‐inducing virtual scenes could be modified minimally to change the moods (in contrast to the use of film clips, which could differ along many different dimensions besides emotional valence). VR mood inductions could also be useful for the creation of emotional vocal stimuli. Emotional vocal stimuli are often recorded by actors who merely pose the desired emotion, pretending to be fearful or relaxed, angry or excited. The actor is not actually experiencing the emotion they are trying to convey with their voice. This could be problematic if the portrayal is not convincing or if posed emotional vocalizations dif- fer from real emotional vocalizations along some unknown dimensions. VR could be used to elicit genuinely emotional speech for an experiment. For the creation of fearful speech, experimenters could take advantage of the powerful “pit illusion” discussed in the introduction. People who experience a strong sense of presence in this illusion feel genuinely afraid. If they were asked to produce speech while they are experiencing the illusion, that speech should have all the characteristics of genuinely fearful speech. Manipulating the spatial environment of a subject could also be useful for exploring relationships between language and space. Take for example reference frames for Virtual Reality 181 locating things in space. Languages like the Australian Guugu Yimitthir and Mexican Tzeltal use cardinal direction (north, south, east, west) to locate things in space, for example, “the ant is south of your leg” (Majid et al., 2004; Haviland et al., 1993). VR could be used to manipulate the physical environment to test how people keep track of their orientation with respect to the sun, geographic features like mountains, and so on, for the purposes of encoding spatial information in language. Changing the visual background in an iVR experiment requires having more than one background and choosing which one to load for your experiment. The back- grounds can be designed in graphic editing and 3D‐Modelling software.

Nature of the Data

What you decide to collect in terms of data is up to you and will depend on your exper- imental question. Just as you have myriad options for presenting and manipulating stimuli, the various input devices we discussed above allow much flexibility in data collection. If your experiment requires verbal responses, these will be picked up by the microphone and can be saved as WAV audio files (https://en.wikipedia.org/wiki/WAV) for linguistic or acoustic analysis. Any motion capture devices you employ will give you precise coordinates of where each marker was in space at each time point in your experiment. You can then time‐lock these movements to events in your experiment or other behavior (like vocalizations) and plot and analyze the movements.

Collecting and Analyzing Data

As discussed above, using VR lets you have multiple data streams. You will have to decide what to collect and what to analyze. If your experiment uses motion capture, send position information for each of the markers to your log file, for the entire duration of your experiment. If you are recording audio from a microphone, record and save everything in a high quality uncompressed format. You may also want to record a video of everything your subject saw during the experiment. This is pos- sible, but it will require a lot of disk space, so you will need to make sure you have a large hard drive with fast disk access. Much of the data you collect can be analyzed using software you might already be familiar with. For example, if you are collecting audio recordings of subjects’ voices, these can be analyzed with Praat (Boersma & Weenink, 2011), a well‐established tool for measuring and manipulating aspects of voices. You could use Praat to, for example, measure pitch, inflection, and durational characteristics of subjects’ voices. Movement‐related information is recorded as millisecond‐level timeseries of x, y, and z, coordinates for markers. You can compute quantities like velocity and acceleration in Matlab (Mathworks, Natick, MA). Alternatively, if only a simple analysis of movement is required for your experiment, such as where a subject gestured in left‐ right space, you could simply export movement data for the y‐axis. This simple one‐dimensional timeseries can be loaded into, for example, ELAN software (Brugman & Russel, 2004) and plotted with respect to other data streams such as audio and video recorded during the experiment and the timing of specific events. 182 Research Methods in Psycholinguistics and the Neurobiology of Language

Exemplary Studies

There is enormous potential for VR in language research, although there are relatively few published studies. We will highlight two examples and explain why using iVR was advantageous. If we consider language to be a low‐tech tool for creating virtual worlds, then non‐immersive VR has been used to study language since the earliest experiments in psycholinguistics. Immersive VR, however, has been used in only a handful of psy- cholinguistic studies to date. A study by Gijssels, Staum Casasanto, Jasmin, Hagoort, and Casasanto (2016) tested the psychological mechanisms underlying linguistic accommodation (i.e., the tendency of speakers to adjust their linguistic production to be more (or less) like their interlocutor’s; Giles, Taylor, & Bourhis, 1973). According to a leading psycholinguistic theory (Pickering & Garrod, 2004), all speech accommodation is the result of an automatic priming mechanism. According to this theory, called the Interactive Alignment Model (IAM), perceiving an utterance raises the activation level of the linguistic representations in the percept. Consequently, when it is the perceiver’s turn to speak, the heightened activation of these represen- tations increases the likelihood that these forms will be produced. Producing forms that have been primed by an interlocutor lightens the speaker’s computational load; this is the functional motivation for accommodation, according to the IAM (Pickering & Garrod, 2004; see Chapter 6 for details about the priming methodology). Gijssels and colleagues (2016) reasoned that, if priming is the mechanism of accommodation, then accommodation should show two “signatures” of priming: dose dependence and persistence (Wiggs & Martin, 1998). For alignment to be “dose dependent” means that the more often a listener perceives a given linguistic feature in a conversation, the higher the likelihood of producing that feature becomes (Garrod & Pickering, 2004). Thus, increasing exposure to a given aspect of linguistic production should cause accommodation to increase incrementally throughout a conversation (Hartsuiker, Kolk, & Huiskamp, 1999). For alignment to be “persistent” means that alignment effects should persist beyond the local exposure context. That is, once a feature of language has been primed, its heightened activation should not immedi- ately return to its baseline level; rather, activation should remain heightened for some measurable period of time after exposure to the priming stimulus ends. Both of these signatures of priming have been found in studies of syntactic accommodation: The more speakers were exposed to a construction (e.g., active versus passive verb phrases) the more likely they were to produce the construction them- selves (e.g., Branigan, Pickering, & Cleland 2000; Jaeger & Snider, 2008). Such syntactic alignment effects have been observed to last up to 7 days after the initial priming manipulation (e.g., Kaschak, Kutta, & Coyle, 2014), and to persist across changes in location or experimental context (Kutta & Kaschak, 2012). The IAM predicts that priming is responsible for accommodation effects “at all linguistic levels,” including continuous dimensions of language like speech rate and pitch (i.e., f0; Finlayson et al., 2012; Garrod & Pickering, 2004; Giles, Coupland, & Coupland, 1991; Staum Casasanto, Jasmin, & Casasanto, 2010). Because these features are continuous, aligning one’s pitch or speech rate with an interlocutor’s presumably does not involve activating representations of discrete linguistic units (e.g., words, syntactic structures) that match the units used previously by an interlocutor. Virtual Reality 183

It seems unlikely, therefore, that priming is the mechanism of accommodation along continuous dimensions of linguistic production like speech rate and pitch, in which case accommodation effects should not show dose dependence or persistence. To test this prediction, Gijssels and colleagues (2016) measured the pitch of partici- pants’ speech before, during, and after their conversation with a virtual agent, in iVR. Male and female participants discussed items in a virtual supermarket with a lifelike virtual agent of their same gender (named VIRTUO or VIRTUA) at the iVR lab at the Max Planck Institute for Psycholinguistics, in Nijmegen, The Netherlands. The supermarket environment was created specifically for this experiment using pre‐made 3D models and textures that were integrated with Adobe 3ds Max 4 software (Adobe Systems Inc., San Jose, CA). We started with an empty supermarket model, then added shelves and products to put on the shelves. The VIRTUO and VIRTUOA characters were ‘stock’ models that came with Vizard Software. The various items you typically find in a supermarket served as the topics of conversation. To make sure there were always new things to talk about, there needed to be new items in the immediate visible environment of the subject and the virtual conversation partner. This was accomplished by “moving” the participant through the supermarket in a virtual vehicle. Subjects sat in a chair in the real world, which became a motorized golf cart in the virtual environment. VIRTUO/A sat behind the steering wheel and “drove” the subject down the supermarket aisle. Floor shakers rumbled as the virtual engine ran, simulating the sound and feel of an engine. Although this might seem quite complicated to set up, Vizard allows experimenters to control programming flow at a very high level. Moving a virtual golf cart can be as simple as specifying the golf cart’s object ID and the coordinates it should move to (e.g., “golfcart.move([x, y, z], speed = s”) and starting the engine (“floorshakers.Start”). The difficult part is setting up all of the hardware and software that makes this possible. In the experiment, the agent asked the participant a series of questions about each item (e.g., What is ketchup made of?). VIRTUO’s and VIRTUA’s voices were record- ings of native Dutch speakers of the same gender. Crucially, the F0 of these record- ings was adjusted to be 5% higher or lower than the original, and participants were randomly assigned to interact with either the high or low version of VIRTUO/A. Pitch was manipulated with Audacity software, which is freely downloadable (http:// audacity.sourceforge.net). An experimenter listened to the conversation between the participant and the agent, and triggered VIRTUO/A to make an appropriate response, at the appropriate time. Results showed that, compared to a pre‐experimental sample of speech (recorded while the participant was in the virtual world, but before they met VIRTUO/A), the pitch of participants’ speech was adjusted in the predicted directions. Participants assigned to interact with the high VIRTUO/A spoke significantly higher, on average, than participants assigned to interact with the low VIRTUO/A. Moreover, the partic- ipants’ F0s tracked the agents’ F0s on a turn‐by‐turn basis. However, the magnitude of accommodation did not increase over the course of the conversation (i.e., with more exposure to the interlocutor’s pitch), nor did it persist in the post‐experiment sample of speech that was collected immediately after the conversation with VIRTUO/A ended. Thus, although participants showed a strong speech accommodation effect, accommodation showed neither dose dependence nor persistence, suggesting that priming was not the mechanism underlying this effect (see Staum Casasanto et al., 2010, for a compatible finding in which participants accommodated their speech rate 184 Research Methods in Psycholinguistics and the Neurobiology of Language to match VIRTUO/A’s). According to the IAM, speech alignment in all of its forms (e.g., lexical, syntactic, phonological) “is automatic and only depends on simple priming mechanisms” (Pickering & Garrod, 2004, p. 188, italics added). Yet, contra the IAM, Gijssels et al.’s (2016) results suggest that priming is not the only mechanism of speech accommodation, and that it is necessary to posit different mechanisms underlying different types of accommodation (i.e., accommodation along discrete versus continuous dimensions of speech production). Why did Gijssels and colleagues use iVR to address this question? First, it would be impossible to achieve the same level of experimental control with a human confederate, who could never modulate his or her F0 to be precisely 5% higher for half of the participants and 5% lower for the other half. Beyond pitch, it would be impossible to control myriad other physical and social aspects of the way confederates use their voices and their bodies, which could all potentially influence accommodation. All of these were held 100% constant across condi- tions with VIRTUO/A. Accommodation has been observed using a much simpler, non‐immersive VR device, an audio recording (e.g., Babel, 2009), which allows for control of the voice but eliminates all other physical and social aspects of the conversation (e.g., gaze). Why not simplify this experiment and use an audio recording? Although an audio recording may be useful for answering some ques- tions about conversation, language in its “natural habitat” is multimodal (not just auditory) and situated (interlocutors share a physical environment which consti- tutes an important component of their common ground; Clark, 1996). Stripping away the information that is typically available to language users as they see each other and their shared environment may blind researchers to important features of linguistic behavior. Accommodation exemplifies an aspect of language that is manifestly social (e.g., Babel, 2009; Giles et al., 1973), and may therefore be affected by extralinguistic aspects of an interaction. Accordingly, in an iVR study of speech‐rate accommodation, Staum Casasanto et al. (2010) found that partici- pants who rated themselves to be more similar to VIRTUO/A showed stronger accommodation effects. As these experiments with VIRTUO/A illustrate, immersive VR can provide a rare combination of experimental control and richness or realism that is hard to achieve with human interlocutors or with simpler VR devices. But an important question remains open: Do the conclusions of experiments on conversation in iVR gener- alize to conversations between two humans? A study by Heyselaar, Hagoort, and Segaert (2015) addressed this question by testing whether using iVR to study syn- tactic accommodation yields similar results to studies using human speakers and listeners. They compared syntactic priming when humans were interacting with (i) other humans, (ii) human‐like virtual interlocutors, and (iii) computer‐like virtual interlocutors. Results showed that the rate at which participants produced passive vs. active syntactic constructions was affected equally by interacting with another human and by interacting with a humanlike agent. By contrast, this effect was reduced when the humans interacted with computer‐like virtual interlocutors. These findings suggest that iVR with humanlike interlocutors presents the opportunity to study linguistic behavior with extraordinary experimental control over linguistic and extralinguistic aspects of the stimuli and the testing environment, without sacrificing the ability to generalize the results to real conversation between humans. Virtual Reality 185

Advantages and Disadvantages

Throughout this chapter we have emphasized that iVR allows for unprecedented levels of environmental richness and sensorimotor realism, while also enabling the experimenter to maintain strict control over myriad variables that would vary uncontrollably if human confederates were used rather than virtual agents or ava- tars. Here we mention some other potential advantages of iVR, as well as some disadvantages.

Expanding the Participant Pool

Networked VR systems may allow greater diversity in the subject pool (Blascovich et al., 2002; Fox et al., 2009). As HMDs like the Oculus Rift become more affordable and commonplace, and with a fast internet connection, it should be possible to test participants remotely, without the typical geographic constraints imposed by the laboratory. Participants in different locations, perhaps with vastly different cultural or linguistic backgrounds, could interact within the same virtual environment. Atypical populations would be one area of applicability. For example, people in resi- dential care, who are unable to travel, would be able to put on an HMD and be trans- ported anywhere, to talk to anyone, thus opening up possibilities for studying language processing and use in older people or people with mental disorders. A mobile VR lab is possible in principle, so long as motion capture needs are minimal, relying on, for example, an accelerometer in the HMD rather than external cameras to track head motion.

Emotional Realism

One of the challenges researchers face in studying emotion in the laboratory is that genuine emotions are difficult to elicit. Even strongly emotional words or pictures may fail to affect participants emotionally in the way real‐life scenarios do. By commandeering the senses and immersing participants in virtual worlds, iVR may be useful for overcoming the emotional impotence of traditional stimuli. The pit illusion described before elicits real fear and anxiety. iVR may be capable of eliciting many other emotions as well. For example, even in non‐immersive VR such as the Second Life online social environment (www.secondlife.com), interacting with other people’s avatars can cause people to fall in love for real (Meadows, 2007).

Reproducibility of Complex Environments

Much can vary between any two naturally occurring conversations, from the sur- roundings, background noise, weather, experimenter’s clothes and behavior, and so on. iVR allows tight control over all sensory input delivered to the subject, such that the experience is replicated exactly for each subject (Blascovich et al., 2002; Fox et al., 2009). Verbal interactions between a person and a computer‐driven agent can be structured and scripted such that the agent says exactly the same thing in each 186 Research Methods in Psycholinguistics and the Neurobiology of Language interaction, in exactly the same way, with all of the accompanying nonverbal behav- iors held constant as well. In an interaction between two person‐controlled avatars, the physical layout of the environment can be set up exactly the same for each experiment. Controlling the layout of objects in the environment could be especially useful for the study of reference (Keysar et al., 2000).

Pitfalls of iVR

The realism of iVR can have its downsides. The illusion of height or of motion can be so powerful that it causes nausea in a minority of subjects. Heyselaar et al.’s (2015) study (see above) raises another important consideration in iVR research: beware of creepy agents. People are somewhat comfortable interacting with robots that look nothing like humans (picture R2D2, the garbage‐can–shaped robot in the Star Wars movies), and may be more comfortable interacting with anthropomorphic robots (like R2D2’s tall golden sidekick, C3PO). But when robots or digital agents become too humanlike people typically have an aversive reaction: An anthropomor- phic figure that succeeds in looking about 90% humanlike falls into the uncanny valley between the obviously artificial and the convincingly realistic (Mori, 2012). For example, human‐like prosthetic hands, which fall short of looking fully life‐like, are typically judged to be creepier than metal prostheses that are obviously not human. To ensure that their humanlike agent did not fall into the uncanny valley, Heyselaar et al. (2015) asked a group of raters to evaluate the candidate agents’ faces, and they chose one that was rated high on humanness but low on creepiness. Stumbling into the uncanny valley could produce unexpected effects for any experiment with a social component. Perhaps the greatest potential pitfall, if you are new to VR, is the investment of both time and money that can be required to create even a “simple” iVR study. Although a portable HMD can be purchased cheaply (e.g., Google Cardboard), as can a simple motion tracking system (e.g., Microsoft Kinect), the virtual interactions you have in mind may or may not be feasible with a low‐cost system. Detailed tracking of multiple body parts may require more sophisticated, multi‐component mo‐cap technologies. Even if you use stock characters as agents and avatars, creating the virtual world may require a substantial amount of programming, and populating it with 3D models a substantial amount of artistry. Researchers new to iVR should be aware of the extent of equipment and expertise that may be needed to turn the study they are imagining into a (virtual) reality. On the other hand, the catalog of tasks that can be accom- plished with low‐cost hardware and pre‐packaged software is growing quickly.

Conclusions

Language researchers typically face a trade‐off between experimental control and richness or realism of the experimental stimulus. Immersive VR can provide high levels of control and realism, compared to lower‐tech methods of creating virtual worlds (e.g., words, pictures, video, and audio recordings). To date, iVR has Virtual Reality 187 been used in only a few psycholinguistic studies, to address questions about speech accommodation (as illustrated above) and gesture‐speech interaction (Chu & Hagoort, 2014). Yet, in other areas of psychology iVR is already being used in imaginative ways, to address a variety of questions. Since language use is inherently interactive, iVR is a natural tool for language researchers to explore—one that allows experimental participants to interact with one or more interlocutors (other avatars or virtual agents) in a panoply of physical and social environments, while assuming diverse physical and social identities. Even if iVR environments or char- acters look somewhat artificial (thus avoiding the uncanny valley), they can elicit real emotions and social attitudes, allowing researchers to observe language in the kinds of socio‐affective contexts in which it is typically used but rarely studied. With the advent of affordable motion capture and iVR technologies like the Microsoft Kinect, Google Cardboard, and Occulus Rift, mo‐cap and iVR are no longer the province of those few researchers with access to a full‐fledged VR labo- ratory. Like ERPs in the early 1980s and eye tracking in the late 1990s, iVR is now poised to become one of the psycholinguist’s go‐to methods.

Acknowledgments

We thank Laura Staum Casasanto for helpful discussions. This work was funded by a James S. McDonnell Foundation Scholar Award (#220020236) and NSF award (#1257101) to D.C.

Key Terms

Agent A virtual agent is an autonomous character in the virtual world; a digital robot, who is not an avatar (see below). Rather, an agent’s actions are controlled by a computer, and not by a human actor. Avatar The character that embodies a human immersed in the virtual world; the digital persona of a human actor. HMD Abbreviation for Head Mounted Display. A helmet containing the video screen on which an iVR participant views the virtual world. iVR Abbreviation for Immersive Virtual Reality. The kind of virtual reality system in which percepts in the visual modality (and sometimes other sensory modalities as well) are entirely determined by the virtual environment; participants have no access to the real (visual) world, and are therefore immersed in the virtual world. Presence A participant’s subjective sense of immersion in the virtual world. Uncanny Valley A region of the continuum between artificial‐looking and real‐ looking stimuli. People’s level of comfort interacting with robots (physical or virtual) generally increases as the robots’ appearance becomes more realistic; an exception to this trend, however, is that people often feel uncomfortable with robots or other devices that look about 90% (but not entirely) lifelike. These devices are said to fall into the uncanny valley. 188 Research Methods in Psycholinguistics and the Neurobiology of Language

References

Asimov, I. (1951). Foundation. New York: Doubleday. Babel, M. E. Phonetic and social selectivity in speech accommodation. PhD dissertation, University of California, Berkeley, 2009. Baum, L. F. (1958). The wizard of oz. New York: Scholastic. Blascovich, J., Loomis, J., Beall, A. C., Swinth, K. R., Hoyt, C. L., & Bailenson, J. N. (2002). Immersive virtual environment technology as a methodological tool for social psychology. Psychological Inquiry, 13, 103–124. Boersma, P., & Weenink, D. (2011). Praat: Doing phonetics by computer [computer program]. Version 5.2.46. Retrieved 10 September 2011 from http://www.praat.org Branigan, H. P., Pickering, M. J., & Cleland, A. A. (2000). Syntactic co‐ordination in dialogue. Cognition, 75, B13–B25. Brugman, H., & Russel, A. (2004). Annotating multimedia/multi‐modal resources with ELAN. In Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation. Chang, C.‐Y., Lange, B., Zhang, M., Koenig, S., Requejo, P., Somboon, N., Sawchuk, A. A., & Rizzo, A. A. (2012). Towards pervasive physical rehabilitation using Microsoft Kinect. In 6th International Conference on Pervasive Computing Technologies for Healthcare, 159–162. IEEE. Chu, M., & Hagoort, P. (2014). Synchronization of speech and Gesture: Evidence for interac- tion in action. Journal of Experimental Psychology: General, 143, 1726. Clark, H. H. (1996). Using language. Cambridge: Cambridge University Press. Finlayson, I., Lickley, R. J., & Corley, M. (2012). Convergence of speech rate: Interactive alignment beyond representation. In Twenty‐Fifth Annual CUNY Conference on Human Sentence Processing, CUNY Graduate School and University Center, 24, New York, USA. Fox, J., Arena, D., & Bailenson, J. N. (2009). Virtual reality: A survival guide for the social scientist. Journal of Media Psychology, 21, 95–113. Fox, J., Bailenson, J. N., & Tricase, L. (2013). The embodiment of sexualized virtual selves: The Proteus effect and experiences of self‐objectification via avatars. Computers in Human Behavior, 29, 930–938. Garrod, S., & Pickering, M. J. (2004). Why is conversation so easy? Trends in Cognitive Sciences, 8, 8–11. Gijssels, T., Staum Casasanto, L., Jasmin, K., Hagoort, P., & Casasanto, D. (2016). Speech accommodation without priming: The case of pitch. Discourse Processes, 53, 233–251. Giles, H., Coupland, N., & Coupland, J. (1991). Accommodation theory: Communication, context and consequences. In H. Giles, J. Coupland, & N. Coupland (Eds.), Contexts of Accommodation (pp. 1–68). Cambridge & Paris: Cambridge University Press & Editions de la Maison des Sciences de l’Homme. Giles, H., Taylor, D. M., & Bourhis, R. (1973). Towards a theory of interpersonal accommodation through language: Some Canadian data. Language in Society, 2, 177–192. Groom, V., Bailenson, J. N., & Nass C. 2009. The influence of racial embodiment on racial bias in immersive virtual environments. Social Influence, 4, 231–248. Hartsuiker, R. J., Kolk, H. H. J., & Huiskamp, P. (1999). Priming word order in sentence production. The Quarterly Journal of Experimental Psychology: Section A, 52, 129–147. Haviland, J. B. (1993). Anchoring, iconicity, and orientation in Guugu Yimithirr pointing gestures. Journal of Linguistic Anthropology, 3, 3–45. Heeter, C. (1992). Being there: The subjective experience of presence. Presence: Teleoperators and Virtual Environments, 1, 262–271. Heyselaar, E., Hagoort, P., & Segaert, K. (2015). In dialogue with an avatar, language behavior is identical to dialogue with a human partner. Behavior Research Methods, 1–15. Virtual Reality 189

Jaeger, T. F., & Snider, N. (2008). Implicit learning and syntactic persistence: Surprisal and cumulativity. In D. S. McNamara & J. G. Trafton (Eds.), Proceedings of the 29th Annual Cognitive Science Society Conference (pp. 1061–1066). Austin, TX: Cognitive Science Society. Kaschak, M. P., Kutta, T. J., & Coyle, J. M. (2014). Long and short term cumulative structural priming effects. Language and Cognitive Processes, 29, 728–743. Kutta, T. J., & Kaschak, M. P. (2012). Changes in task‐extrinsic context do not affect the per- sistence of long‐term cumulative structural priming. Acta Psychologica, 141, 408–414. Keysar, B., Barr, D. J., Balin, J. A., & Brauner, J. S. (2000). Taking perspective in conversation: The role of mutual knowledge in comprehension. Psychological Science, 11, 32–38. Majid, A., Bowerman, M., Kita, S., Haun, D. B. M., & Levinson, S. C. (2004). Can language restructure cognition? The case for space. Trends in Cognitive Sciences, 8, 108–114. Meadows, M. S. (2007). I, avatar: The culture and consequences of having a second life. New Riders. Mori, M., MacDorman, K. F., & Kageki, N. (2012). The uncanny valley [from the field]. IEEE Robotics & Automation Magazine, 19, 98–100. Pickering, M. J., & Garrod, S. (2004). Towards a mechanistic psychology of dialogue. Behavioral and Brain Sciences, 27, 169–226. Riva, G., Mantovani, F., Capideville, C. S., Preziosa, A., Morganti, F., Villani, D., Gaggioli, A., Botella, C., & Alcañiz, M. (2007). Affective interactions using virtual reality: The link between presence and emotions. CyberPsychology & Behavior, 10, 45–56. Staum Casasanto, L., Jasmin, K., & Casasanto, D. (2010). Virtually accommodating: Speech rate accommodation to a virtual interlocutor. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 127–132). Austin, TX: Cognitive Science Society. Yee, N., & Bailenson, J. (2007). The Proteus effect: The effect of transformed self‐representation on behavior. Human Communication Research, 33, 271–290. Van Berkum, J. J. A., De Goede, D., Van Alphen, P. M., Mulder, E. R., & Kerstholt, J. H. (2013). How robust is the language architecture? The case of mood. Frontiers in Psychology, 4, 505. Verne, J. (1870). 20,000 leagues under the sea. Translated by Anthony Bonner, 1962. New York: Bantam. Wiggs, C. L., & Martin, A. (1998). Properties and mechanisms of perceptual priming. Current Opinion in Neurobiology, 8, 227–233.

Further reading and resources

Loomis, J. M., Blascovich, J. J., & Beall, A. C. (1999). Immersive virtual environment tech- nology as a basic research tool in psychology. Behavior Research Methods, Instruments, & Computers, 31, 557–564. McCall, C., & Blascovich, J. (2009). How, when, and why to use digital experimental virtual environments to study social behavior. Social and Personality Psychology Compass, 3, 744–758. Tarr, M. J., Warren, W. H. (2002). Virtual reality in behavioral neuroscience and beyond. Nature Neuroscience, 5, 1089–1092. 10 Studying Psycholinguistics out of the Lab

Laura J. Speed, Ewelina Wnuk, and Asifa Majid

Abstract

Traditional psycholinguistic studies take place in controlled experimental labs and ­typically involve testing undergraduate psychology or linguistics students. Investigating psycholinguistics in this manner calls into question the external validity of findings, that is, the extent to which research findings generalize across languages and cultures, as well as ecologically valid settings. Here we consider three ways in which psycholinguistics­ can be taken out of the lab. First, researchers can conduct cross-cultural fieldwork in diverse languages and cultures. Second, they can conduct online experiments or experiments in institutionalized public spaces (e.g., museums) to obtain large, diverse participant samples. And, third, researchers can perform studies in more ecologically valid settings, to increase the real-world generalizability of findings. By moving away from the traditional lab setting, psycholinguists can enrich their understanding of language use in all its rich and diverse contexts.

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Studying Psycholinguistics out of the Lab 191

Introduction

Taking part in a psycholinguistic study typically involves going to a university, meet- ing a researcher and completing a computer task in a quiet laboratory cubicle under the researcher’s instruction. This chapter takes psycholinguistic research out of this traditional laboratory setting, and moves it to the outside world—both real and online. Such forms of research continue to use standard psycholinguistic methods for the most part, but change the research location in order to include a more diverse sample of observations into the consideration of psycholinguistic theories. The use of diverse samples is a pillar of modern science. In order for generaliza- tions to accurately portray a population, research samples must be representative, that is, selected so as to best reflect the population’s diversity. Hence, psycholinguis- tics—a discipline whose major goal is to understand the mental representations and processes underlying human language use—must strive to be representative of the whole of humanity. It must not leave aside neglected populations (e.g., sign language users, bilinguals, aphasia patients, etc.), or culturally diverse groups. Given the chal- lenges of reaching some of these populations, researchers need to venture outside the lab setting and take a more active role approaching people in their homes, schools, community centres, clinics, and so forth. For studies conducted outside of the lab, we perceive there to be two general issues, both of which address concerns about external validity of research findings, that is, do current theories of psycholinguistics hold for the everyday language use of ordinary people conversing in the 7,000 or so diverse languages spoken today? In order to answer this question we need to know, first, whether established psycholin- guistic phenomena generalize to other populations across the globe and out of the university setting. Second, we must establish whether observations made inside the lab can be replicated in ecologically valid settings outside of the lab. In service to this broader goal, there are three motivations for being out of the lab: (1) reaching neglected populations, including speakers of diverse languages through cross‐cultural studies in the field, (2) collecting large, demographically diverse ­samples within specific languages, which can be achieved through online experi- ments (i.e., crowdsourcing) or experiments in institutionalized public spaces (e.g., museums), and (3) increasing ecological validity of research findings by conducting pseudo‐experiments in real‐world settings. Both (1) and (2) typically employ ­traditional psycholinguistic experimental paradigms but reach a wider pool of par- ticipants; whereas (3) often requires further refinement of traditional methods to afford higher ecological validity. We separate (1) and (2) because they utilize differ- ent methodologies and experimental concerns, but many of the studies we discuss could fit into more than one category. Our categorization is by no means exhaustive. Research outside of the lab can be conducted for other reasons, for example, theoretical reasons specific to a particular question. For example, researchers may seek alternative research ­settings in order to undertake manipulations not possible in the traditional psycholinguistic lab, such as manipulations of gravity in a space flight, which allowed Friederici and Levelt (1990) to investigate the perceptual cues used in order to determine spatial frames of refer- ence in language. But the global issues—and potential benefits—of moving out of the lab should concern all psycholinguists. 192 Research Methods in Psycholinguistics and the Neurobiology of Language

Cross‐Cultural Field Studies

Rationale

Why should we study diverse cultures and languages? It has long been recognized that the diversity of language is a window into diversity of thought. In the words of Wundt (1920)—the father of psycholinguistics—every language “represents its own characteristic organization of human thought” and as such may hide a “treasure” uniquely contributing to our understanding of how thought and language work (cf. Levelt, 2013). Research in psychology has been heavily tilted toward a largely homogenous sample of Western undergraduate students (96% of study populations are from Western industrialized societies, who themselves only constitute 12% of the human population) (Arnett, 2008). This has been described as a “narrow” database (Sears, 1986). The picture for psycholinguistics is largely similar. With the notable exception of lan- guage acquisition research (e.g, Bowerman & Brown, 2008; Slobin, 1985), most psy- cholinguistic research has been done with speakers of English, or other European languages. Jaeger and Norcliffe (2009), for instance, found sentence production research relies on data from only 0.6% of the world’s languages (cf. Norcliffe, Harris, & Jaeger, 2015). This is problematic because English, and other “Standard Average European” languages, do not adequately portray the world’s linguistic diversity (Dahl, 2015), and this leads researchers to disproportionately focus on patterns imposed by Eurocentric linguistic traditions (Whorf, 1944; Gil, 2001). Similarly, the sociodemographic charac- teristics of speakers typically participating in psycholinguistic experiments—that is, “WEIRD”: Western, Educated, Industrialized, Rich and Democratic—make them unusual when compared to the rest of the world (Henrich, Heine, & Norenzayan, 2010; Majid & Levinson, 2010). For instance, there is a strong focus on monolinguals in psycholinguistic studies, which ignores the fact that worldwide multilingualism is rampant. In sum, an approach restricted to a largely homogenous sample fails to recog- nize the world’s vast cultural and linguistic diversity (Evans & Levinson, 2009; Malt & Majid, 2013), tacitly assuming psycholinguistic universalism. In reality differences in grammatical and semantic structure have differential consequences for the encoding and decoding of utterances (e.g., Norcliffe, Harris, & Jaeger, 2015; Levinson, 2012), and can affect general cognitive processes (e.g., Majid et al., 2004; Wolff & Holmes, 2011). We focus here on the lesser‐known languages spoken outside urban areas, but since cross‐linguistic psycholinguistics is in its infancy, even relatively well‐described languages (e.g., Tagalog), can offer novel insights (e.g., Sauppe et al., 2013).

What Does It Entail? Best Practice

Each language presents a unique set of challenges to a researcher. The requirements and procedure followed in a field study will thus vary considerably from place to place depending on a number of practical and theoretical issues related to the field site logistics, sociocultural and linguistic background of the study population, state of language documentation, research questions, and so on. There are a number of Studying Psycholinguistics out of the Lab 193 excellent guides (e.g., Bowern, 2008; Crowley, 2007; Sakel & Everett, 2012), and handbooks (e.g., Gippert, Himmelmann, & Mosel, 2006; Newman & Ratliff, 2001; Thieberger, 2011) on linguistic fieldwork, so we will only flag some key general issues, focusing specifically on psycholinguistic methods in the field. The first prerequisite for successful psycholinguistic research in the field is familiarity with the language and culture under study. What this means in practice is long‐term involvement with the community. If a language has not been previously studied, fieldwork will also require doing basic description to provide the groundwork for pursuing more advanced questions. If, on the other hand, a sufficiently good grammatical description already exists, getting to know the language will be easier. Knowing the language and culture is crucial not only because it enables you to interact with speakers and carry out experiments, but also to ensure you do not overlook important links. Since it is impossible to determine a priori how an under‐ described language works, fieldworkers cannot allow themselves the luxury of being interested only in syntax or only in morphology, but need a general mastery of the “whole language” (Hyman, 2001), and an understanding of its fit within the culture. For instance, sentence formulation is affected by word order, but at the same time it might also be driven by verb morphology (Norcliffe et al., 2015), while perceptual vocabulary might be intimately tied to cultural practices (e.g., Burenhult & Majid, 2011; Wnuk & Majid, 2014). Stimuli and data collection in the field need not differ very much from lab studies, insofar as the employed method is itself suitable for the study population. Classical psycholinguistic paradigms (e.g., self‐paced reading, lexical decision) were devel- oped with literate populations in mind, so many standard methods need adaptation for cross‐cultural usability (e.g., Wagers, Borja, & Chung, 2015). In principle, any task administered on a simple computer can be run in the field on a laptop. Needless to say, other (non‐electronic), easily transportable stimuli such as pictures, booklets, small 3D objects can also be used in a field experiment. Transport and storage often requires careful planning—as does ensuring regular access to electricity—but there are a number of tips to deal with such practical considerations, for example, use of protective bags/boxes, lightweight solar powers, carrying backup equipment (e.g., Bowern, 2008). Thanks to the rapid development of technology, some specialized techniques—for example, ultrasound (Gick, 2002), eye‐trackers (Norcliffe et al., 2015), EEG systems—have become portable and can also be used in psycholin- guistic field studies. In some situations, it might also be possible to create field labs— enclosed quiet spaces—to approximate lab‐testing conditions. So rather than moving the researcher out of the lab, we can now move the lab to the outside world.

Disadvantages and Pitfalls

As already mentioned, no two field sites are identical, so there is no single set of pit- falls for psycholinguistic field research. There are, however, some general issues to keep in mind. Of these, we would like to single out three we consider most important in the context of the present discussion: the practicalities of working with naive participants, small participant pools, and limited experimental control. For an extensive discussion of the general challenges of carrying out linguistic fieldwork, see Crowley (2007). 194 Research Methods in Psycholinguistics and the Neurobiology of Language

One important concern to keep in mind is the practicality of working with people who are not used to being tested. Many non‐urban communities do not have formal education, and are not socialized into being compliant responders. Things that seem unproblematic from the point of view of university students, who spend hours listening to lectures and writing exams on a daily basis (e.g., performing repetitive tasks), can be highly demanding for other people (see also Whalen & McDonough, 2015). Care also has to be taken that modern equipment and testing are not intimidating to participants. So avoid straining participants with endless questionnaires or tedious procedures. A second issue to consider is the limited common ground between the experi- menter and participants, for example resulting from distinct cultural backgrounds. Sometimes, conveying the point of an experiment might be difficult, especially if it includes concepts with no direct translation equivalents in the target language. For these reasons, it is important to keep the design as clear and simple as possible: pilot the task and include a training phase. With growing knowledge of the language and community, researchers learn to anticipate participants’ reactions and potential mis- understandings, so challenges of this kind usually become easier to navigate. Another issue concerns the difficulty of recruiting large numbers of participants in the field. Understudied languages are often spoken by small communities so the par- ticipant pool can be relatively small. A possible solution is to increase the number of stimuli, so there are more critical data‐points to feed into the analysis. Note, though, there is a trade‐off between the duration of the experiment and data quality, as people might become tired more easily or even be reluctant to participate. To maximize the chances of recruiting people, it is important to plan the field trip at the right time. It may not be a good idea to visit a farmer community during harvest, for instance. Another related constraint has to do with potential societal stratification along gender or class lines. It might be socially inappropriate for fieldworkers to talk to community members of the opposite gender or of certain social classes. In these cases, it can help to recruit a local third person to accompany you, or perhaps even administer the task. Finally, it can be difficult to have full experimental control in the field. Many field- work locations have little or no infrastructure. There is often no available separate, enclosed space for testing. So disruptions can include background noise and inquisitive observers. You can take various precautions to avoid these—for example, find a quiet spot out of the way, politely ask not to be disturbed, and so on. Again, further famil- iarity with the people and local environment can help optimise testing conditions.

Exemplary Studies

An example of a psycholinguistic study employing a diverse sample is the “Cut & Break” project (Majid et al., 2007; Majid, Boster, & Bowerman, 2008). The project investigated event categorization across 28 diverse languages using a set of video clips depicting physical separation events (cutting and breaking). Speakers—inter- viewed in their native languages by a team of expert linguists—were asked to view the clips and provide free descriptions of each event. From the full descriptions, the verbs describing the target physical separation events were used to create a clip‐by‐ clip similarity matrix for each language. Pairs of events were deemed similar (i.e., assigned a similarity score of 1) if they were ever described with the same verb, ­otherwise they were deemed dissimilar (i.e., assigned a score of 0). The stacked Studying Psycholinguistics out of the Lab 195

Slice carrot across with Cut carrot in half with a Snap twig with two knife karate chop of hand hands

Chontal te-k’e tyof’n~i-

Hindi kaaT toD|

Jalonke i-xaba i-sεgε gira

Figure 10.1 Comparison of cut and break verbs in Chontal, Hindi, and Jalonke (adapted from Majid et al., 2007). (See insert for color representation of the figure.)

­similarity data was then fed into a correspondence analysis to extract the main dimensions of variance. The analysis revealed that although languages vary consid- erably in how they categorize events (see Figure 10.1), there is a common core under- lying the structure of the domain across languages. To verify the results, the authors correlated the dimensions extracted by the general solution across languages with those for each individual language. Overall, the individual languages correlated highly, as reflected in high mean correlations and small standard deviations. Additional analyses with factor analysis and cluster analysis further confirmed a common space of event categorization across languages. Thanks to the approach involving an “etic” grid—a standardized, language‐ independent stimulus set—it was possible to carry out a large‐scale comparison at a general level, while the specialized expertise of the team of fieldworkers also enabled researchers to include the “emic” perspective—that is, a language‐ and culture‐ specific internal perspective (cf. the contributions in Majid et al., 2007).

Studies Conducted Online and in Museums

Rationale

If as a psycholinguist you are not ready to pack your bags and jet‐off to remote destinations to test the generalizability of your studies, you can still make efforts to broaden your participant sample so it is more inclusive and representative. Online studies and museums have both been the locus of a flurry of studies recently. Although on the surface they seem quite different, they are motivated by the same consider- ations so we discuss them here together. Placing an experiment or survey online allows access to an impressively large number of participants, at all times of the day, every day of the week. Amazon 196 Research Methods in Psycholinguistics and the Neurobiology of Language

Mechanical Turk (MTurk), an online crowdsourcing site, permits the researcher to test over 100,000 participants in over 100 different countries (although the majority are based in the USA). Burhmester, Kwang, and Gosling (2011) report participants on MTurk are significantly more diverse than typical samples from American universities. Similarly, museums have a continuous flow of visitors almost every day, providing access to an impressively large number of people during opening hours. London’s Science Museum has around 2.7 million visitors each year. Participants recruited online and in museums will represent a more diverse sample than typical psycholinguistic studies, and may even provide access to specialist populations, such as individuals with rare cases of synesthesia who are otherwise difficult to reach. There may also be qualitative differences between participants recruited in univer- sities and those recruited online and at museums. Participants from universities are likely to represent a volunteer bias. Results from people putting themselves forward for experiments might not be representative of the general population. Ganguli et al. (2015) found study volunteers tend to be younger, better educated, healthier, and have fewer cognitive impairments than participants randomly selected from the population. In addition, participants in universities typically get paid for partici- pation, but museum visitors do not. Although studies online and in museums do not completely solve such a volunteer bias (visitors to a science museum are obviously interested in science, for example), they at least go a step toward diversifying the pool of participants. Participants in the lab may also be particularly prone to experimenter demand characteristics. Recruiting participants online, therefore, has the additional advantage of anonymity, as pointed out by Bargh and McKenna (2004). Participants may be less inclined to try and figure out the “correct answer,” or otherwise behave in a way they think will please the experimenter. Overall, recruiting participants online may improve the diversity and quality of the sample in a number of ways. Along with access to larger and more diverse samples of participants, data col- lection can be much expedited if experimenters use these alternative locations. Recruiting and running individuals in a university setting is difficult and ­hampered by a number of factors, including the local population size (typically undergrad- uate psychology or linguistics students), university holidays, exam times, and so on. By moving out of the university setting both the researcher and the participant will be less disrupted. For example, Dufau et al. (2011) collected data from 4,157 participants in only four months using an experiment conducted on a smartphone. A comparably sized study conducted in a lab took almost three years (Balota et al., 2007). A further benefit is data is cheap. Cost per participant on Amazon Turk begins at one cent, with an additional fee to Amazon of 20% (https://requester.mturk.com/ pricing). Costs are also reduced in terms of lab space, labor, and data entry (Birnbaum, 2004). Costs for experiments in museums are also lower. Participants typically volunteer for free. For them, participation is a fun and educational experience—another aspect of their museum visit. Finally, research of this nature, particularly research conducted in museums, has additional benefits, for example, public engagement. By conducting research in a public setting one can promote a research program, institute, or university, and simultaneously educate the public about the research process and research findings. Studying Psycholinguistics out of the Lab 197

What Does It Entail? Best Practice

In the last 5 years or so, research conducted online has expanded dramatically. With the development of crowdsourcing services such as MTurk and Crowdflower, or online experimental software such as WebExp, online research is easy. Many standard psycholinguistic studies involving visual and auditory stimuli, for example, pictures, words and sentences, are possible; and data can include ratings, written and spoken responses, and even reaction times. For example, Dufau et al. (2011) presented English words and nonwords and collected accuracy and response times for lexical decisions (i.e., “Is this a real English word?”). There are a number of standard templates on MTurk available, such as surveys and Likert scales, which can be easily adapted to suit the researcher’s needs. When building an online experiment there are a number of things to keep in mind. It is important to ensure all variables of interest are identified and coded to allow efficient data processing and analysis. A mistake in variable labelling could lead to weeks of additional work once large volumes of data have been collected. Since the participant will be completing the task away from the experimental lab, ways of reducing fatigue and sustaining motivation also need to be considered, such as a progress bar indicating the length of the study (Keuleers et al., 2015). Similarly, removing a “time‐out” feature that ends the experiment after a period of inactivity means participants can take a break whenever they want and hence reduces the number of dropouts (Keuleers et al., 2015). However, as with all forms of experiments, the participant must be informed about their right to withdraw from participation at any point without consequence. After the experiment is completed, response times can be measured to assess concentration on the task. Participants with extremely long or short response times, or with large gaps during the experiment, probably were distracted or unmotivated, and so should be removed from analysis. Online studies are now branching out into mobile devices, with a number of experiment applications (“apps”) emerging. Smartphones are a fundamental feature of many people’s daily lives and offer a great opportunity for research, with high spatial and temporal resolution making them appropriate for experiment presenta- tion (Dufau et al., 2011). One example is the app “SynQuiz” designed by the research consortium Language in Interaction (2015). It is quick and easy to download and use, and presents participants with a number of fun tasks to test whether an individual has grapheme‐color synesthesia (where individuals automatically and involuntarily experience color sensations to letters or numbers). The Language in Interaction con- sortium has also developed “WoordWolk,” an app designed to aid aphasia patients with word finding, and “LingQuest,” a game to educate players about the world’s languages, and so are also applying and disseminating research. Researchers have also been availing themselves of opportunities to run studies in museums (e.g., Simner et al., 2006), and other public events such as science festivals (e.g., Verhoef, Roberts, & Dingemanse, 2015). A research study in a museum will typically involve a museum residence for a period of time (i.e., days or weeks), but it is also possible to have short data collection sessions, such as at a special event or a museum “Late night” opening. Visitors to museums include individuals of all ages and backgrounds, so it is imperative this wide population is kept in mind and instruc- tions are written in a clear and comprehensible manner. The experiment itself should be fun and educational. It is important participants leave the museum feeling happy 198 Research Methods in Psycholinguistics and the Neurobiology of Language and other visitors feel encouraged to participate. For the same reason, experimental tasks should not be too long or difficult. Naturally, museums can be noisy and unex- pected things occur, so keep a record of any such extraneous factors to take into account during analysis.

Disadvantages, Problems, and Pitfalls

Despite the excitement surrounding online studies and the potential for rapid data collection of large and diverse samples, there are, of course, a number of disadvan- tages to take into consideration. There are three main classes of problems centring around participants, the amount of control the experimenter has over the situation, and the types of studies that can be conducted. First, although moving experiments online has the potential to increase the diversity of the participant pool, experimenters must be careful to understand the limitations of this type of sampling too. People with access to internet technology are part of an increasingly homogenized globalization culture, dominated by Western consumer values. They are likely infected by English too. So although participants may come from diverse nations, they may not reflect the cultural or linguistic diversity the researcher hopes to tap. Knowing the relevant demographic facts about the participants is important for interpreting any results. Second, although researchers may carefully compose instructions, there will no doubt be room for misinterpretation and confusion. An online participant cannot ask clarifi- cation questions if something is unclear. So, there is no guarantee the instructions will be followed as carefully as they would be in a lab where the researcher is on hand to ensure comprehension. At the same time the experimenter has little control over who is participating in the study. The same people can take part in a study multiple times under different usernames (although this could be avoided by allowing participation from an IP address only once). Participants who do not meet the study’s requirement can also sign up to a study (e.g., being a native speaker of a language), or they can “cheat” by working on an experiment collaboratively, for example. At the same time, the dropout rate may be higher than for studies conducted in person because there is no immediate social consequence, or simply because other events intervene for the participant. This leads to a related issue—that is, the extent to which controlled experimental conditions are observed. In a lab, experimental cubicles are soundproofed and bare, with minimal distraction, so full attention is given to the task. Completing an experiment at home, on the other hand, instead lends itself to distraction. There may be music or a television playing in the background, telephone calls, children demanding attention, etc. The researcher has no control over this. Similarly, in a museum or other public space, participants are there to enjoy themselves, so they might not adhere to experimental conditions as would a paid participant in a university. On the other hand, “real‐world variability” could be seen as an advantage because it simulates conditions closer to the way we naturally process language every day (Moroney, 2003). Interestingly, Enochson and Culbertson (2005) compared response time data collected online to an identical task in the lab, and found greater variability in the data from the lab (larger standard error). So perhaps people online are not as prone to succumbing to distractions as one might fear. A corollary to the lack of control over the environment is a lack of control over the equipment used in online experiments. Different computers, different operating systems, Studying Psycholinguistics out of the Lab 199 and different internet servers can add variance to the timing of both experimental stimuli presentation and participant reaction times. In psycholinguistics many robust phenomena, such as semantic priming, are observed in small but significant differences in reaction times, so any additional variance in the data could wash out effects. Enochson and Culbertson (2015), however, have replicated three classic psycholinguistic effects with small differences in reaction times using MTurk: faster processing of pronouns compared to determiner phrases, processing costs for filler‐gap dependencies, and agreement attraction, when a verb spuriously agrees with a nearby noun, instead of its grammatical subject. Moreover, Germine et al. (2012) compared the quality of data (i.e., mean performance, variance, and internal reliability) collected from online studies with typical lab experiments, and found negligible differences. Finally, in addition to the issues above, there are limits to the types of studies that can be conducted online or in public places. Experiments requiring behaviors more complex than pushing buttons on a keyboard, or which require non‐visual or auditory stimuli (e.g., odors), are not possible online. Studies taking place in museums are constrained in terms of time and difficulty, as museum visitors are primarily there to have fun and learn.

Exemplary Studies

One of the largest online studies to date was conducted by Keuleers et al. (2015). Nearly 300,000 participants took part in an adapted lexical decision test online, in which participants had to judge if letter strings were real words or not, producing accuracy and response time data for tens of thousands of words. Data from such a large number of participants allowed the researchers to more reliably estimate variability in language processing in the general (Dutch speaking) population. Additionally, it provided the opportunity to investigate effects of age, education, multilingualism, and location on vocabulary size. This study also serves as a good example of public engagement. After completing the test, participants could share their scores on social media, which, the researchers believed, led to increased partic- ipation rates and participant satisfaction. Furthermore, participants could go back to their responses given in the lexical decision task and look up word meanings in an online dictionary. The educational aspect was not one‐way either. Participants had the opportunity to comment on items used in the task, so experimenters were informed about a number of nonwords being too similar to real words. To build such an online experiment, one can use a program like WebExp (Keller, Gunasekhran, Mayo, & Corley, 2009). WebExp utilizes a server that hosts experi- mental stimuli and results, and a connected client applet that runs in the browser of the participant. An experiment is written in XML, a programming language familiar to users of HTML, and requires a timeline describing the stages of the experiment (e.g., introduction, practice). Further specified in each stage are individual slides and components such as text, image, buttons, each with defined properties. Data such as button press and timing information can be recorded and stored on the server using numbered files in a data directory. An excellent example highlighting the advantages of conducting studies in museums is provided by Simner and colleagues (2006). In 3 months, 1,190 English‐ speaking visitors to London’s Science museum took part in a computerized letter/ number‐to‐color matching task in order to estimate the prevalence of grapheme‐ color synesthesia. The most significant finding from this research was a female to 200 Research Methods in Psycholinguistics and the Neurobiology of Language male ratio of synesthesia of 0.9:1. Previous studies had estimated a much higher ratio of 6:1. Collecting data from a wider pool of participants (museum visitors of many ages instead of just university students) provided results against the strongly held belief of a greater prevalence of synesthesia in females. The research suggested previous estimates reflected a study bias in which males are much less likely than females to come forward and report their synesthetic experience.

Conducting Studies in Real‐World Settings

Rationale

Traditional studies within psycholinguistics tend to take a “narrow” view of lan- guage (Port, 2010), focusing on speech or written text, while leaving out rich con- textual features—such as the physical context, the discourse context, and the social context; as well as other features of communication, such as hand and body gestures and facial expressions. Since much of the psychology of language has only focused on a constrained portion of communication this raises the question to what extent psycholinguistic findings reflect the way language is actually used by people. Studies conducted in more “real‐world” settings—that is, situations more closely reflecting how language is used in daily life—can be a step toward addressing the problem of ecological validity. This has also been described as the “scaling problem” (Zwaan, 2014): do results from psycholinguistic studies “scale up” to the real world? The study of natural language use has typically been side‐stepped in traditional psycholinguistics most likely because of the difficulty involved in studying language in its fully embedded and multimodal context. Traditional psycholinguistic experiments are conducted in controlled settings with real‐world factors removed or radically simplified so variables of interest can be carefully manipulated. They take place in soundproofed laboratory cubicles. The participant is encouraged to focus solely on the language task at hand. The linguistic stimuli are often presented context free. Responding to decontextualized single words presented in the centre of a computer screen, or reading a single sentence about an unknown agent in an unknown situation is arguably a different matter than speaking and understanding in everyday life. Language use in daily life is accompanied by a wealth of context. Consider chatting to your family over dinner, talking to friends as you take a stroll, or catching up with a cousin after a long separation. Speakers have common ground with their interlocu- tors. There are people‐centred—rather than experimenter‐driven—motivations and intentions for comprehending and producing language. There are contextual factors at play from multiple modalities. In addition to external context—such as objects in the environment or ongoing activity—additional aspects of the communicative signal are often neglected in psy- cholinguistic studies and theories too. When talking, speakers use hand and body gestures, for example, via iconic gestures or by using beat gestures as a prosodic cue (e.g., McNeill, 1992). Research has shown speech and gesture are an “integrated system” (Kelly, Özyürek, & Maris, 2010): gestures congruent with speech (e.g., Studying Psycholinguistics out of the Lab 201 cutting gesture with “chop”) facilitate speech comprehension compared to gestures incongruent with speech (e.g., twisting gesture with “chop”). With the advent of the embodied cognition paradigm (e.g., Barsalou, 1999), researchers are now also inves- tigating how external factors in the communicative situation, such as the body and ongoing actions, affect the comprehension and production of language (for a review see Fischer & Zwaan, 2008). This highlights the potential impact of real‐world body movement on language comprehension.

What Does It Entail? Best Practice

To reduce the artificiality of experimental manipulations and increase the ecological validity of results, researchers can use real‐world situations to assess how various factors affect language processing. The concern for ecological validity is by no means new. One of the first examples of a psycholinguistic experiment conducted in a natural setting is by Clark (1979). In order to investigate responses to indirect requests, across five experiments a researcher telephoned 950 local businesses and asked simple direct and indirect questions such as “Could you tell me the time you close tonight?”, and recorded the responses given. Based on the results, Clark outlined six sources of information addressees use to determine whether indirect questions should be interpreted in the literal form or not. Today, researchers are beginning to record lengthy periods of real‐world interaction. There are now recording devices children can wear all day so recordings of the child’s utterances, and of those around her, can be collected and automatically analyzed when connected to specialized computer software (e.g., Kimbrough Oller, 2010). Similarly, children can wear lightweight head‐cameras that enable researchers to see the world through a child’s eyes and assess the role of real‐world features on language acquisition (Smith, Yu, Yoshida, & Fausey, 2014). Experiments conducted in real‐world situations can be difficult and potentially problematic. So another way forward is to bring richer contextual cues into the lab. Experiments could investigate speech processing with simultaneous gestures or facial expressions, language comprehension whilst completing manual tasks or other forms of ongoing action such as by using a virtual reality environment (see Chapter 9, this volume), or conversations among friends with topics relevant to the individuals.

Disadvantages, Problems, and Pitfalls

Many of the disadvantages of conducting studies outside the lab reflect the trade‐off between ecological validity and experimental control. In addition, there are specific ethical issues raised. First, let’s consider the lack of experimental control. Having a fairly context‐free setting for an experiment enables the researcher to identify the effect of an experi- mental manipulation with more certainty. Within the real world, it is difficult to ensure the experimental manipulation occurred under the same conditions at all times. In an external context precise measurements are more difficult, which can be problematic for certain psycholinguistic phenomena that occur in the order of milliseconds. Real‐world environments are noisy and so the range of psycholinguistic phenomena amenable to rigorous testing in this context may be limited. 202 Research Methods in Psycholinguistics and the Neurobiology of Language

A more practical consideration concerns problems recording data with spe- cialist equipment. Many experimental methods now popular in psycholinguistics, such as EEG, eye tracking, and fMRI are difficult, if not impossible, to use outside of the typical laboratory purely due to the requirements of the equipment. However, recent developments have overcome some of these problems—such as mobile eye‐ trackers in wireless glasses (www.smivision.com). In addition, including records of the non‐linguistic situational context can be expensive in terms of the time required to analyze and code such features (particularly if in video format), and also disruptive if video equipment needs to be installed into environments, such as people’s homes (Roy, 2009). However, methods to reduce such costs are being developed; for example, the development of fast and accurate speech and video transcription and annotation (Roy, 2009) or virtual reality systems (see Chapter 9, this volume). Second, we turn to the ethics of conducting experiments under more naturalistic con- texts. When conducting studies in a university, research proposals have to be carefully reviewed by an ethics committee to monitor for likely risks, and make sure sufficient information is given to participants. By conducting an experiment outside of the lab, the researcher cannot anticipate all potential problems and risks. In addition, some studies may rely on the participant not knowing they are part of an experimental manipulation, since knowing you are in an experiment may make you behave differently. This means participants lose the opportunity to give informed consent. However, ethical guidelines set out by the American Psychological Association indicate it is acceptable to dispense with informed consent provided certain conditions are met, such as there being no risk of harm or distress to the participant, and participant confidentiality being protected (http://www.apa.org/ethics/code/). In sum, researchers must respect participants’ free- dom and privacy, and take care not to disrupt people’s daily lives. Since studies completed in real‐world environments can contain a large amount of variance and potential confounding factors, researchers must take careful and thorough records of events. Overall, it is probably still the case that any phenomena will have to be investigated using multiple methodologies (i.e., in typical experimental settings and in ecologically valid settings). Such data can be used to provide converging evidence for specific psycholinguistic phenomena.

Exemplary Studies

Boroditsky and Ramscar (2009) present a good example of a study conducted in an everyday situation with rich context. The researchers wished to address the effect of spatial position on the conceptualization of time, so they took advantage of real‐world situations that could serve as experimental manipulations. For example, individuals in an airport who were waiting to depart or who had just arrived were asked the question “Next Wednesday’s meeting has been moved forward two days. What day is the meeting now that is has been rescheduled?”. The extent to which people took an ego‐moving perspective (thinking of themselves moving through time and thus answering “Friday”) or a time‐moving perspective (thinking of time moving toward them and thus answering “Monday”) was affected by their real‐world spatial experi- ence: people who had just arrived on a flight were more likely to take the ego‐moving perspective (and answer Friday) than those just about to depart. Studying Psycholinguistics out of the Lab 203

Although more an observational study than an experiment, an impressive example of rich, ecologically valid data comes from Roy (2009). In the “Human Speechome Project” cameras were fitted in Roy’s own home so a comprehensive recording of language acquisition in the natural context of Roy’s son could be collected from birth to age three. This resulted in over 230,000 hours of recordings. From the recordings numerous features could be extracted using human‐machine transcrip- tion and annotation systems, such as words, prosodic features, and speaker identification from the audio; and person/object information, actions, and manner of actions from video. After processing this perceptual information, it can be fed into a machine learner that computationally models and predicts the language acquisition process. Initial findings from this rich data suggest the importance of the caregiver in language acquisition. For example, the first reliable utterance of a new word by the child occurred once the caregiver had reduced the complexity of utterances contain- ing that word. There are many further possibilities for the Speechome project, taking into account semantic and pragmatic contexts and assessing the role of eye‐gaze and body movements on production, for example. Overall the project reveals how children learn to understand the meaning of words within meaningful contexts.

Conclusions

The lab experiment remains a crucial home for psycholinguistics. But there are a number of factors which together call for a greater participation of a wider‐selection of people, and a more contextualized notion of language. An informed choice of methods, weighing up the advantages and pitfalls specific to each of them, offers remedy to some of the problems haunting psycholinguistic research. After all, our theories should apply to all of humanity, and all of language use in its rich and varied guises. It’s time for psycholinguists to venture out of the lab.

Acknowledgments

All authors are supported by The Netherlands Organization for Scientific Research: NWO VICI grant “Human olfaction at the intersection of language, culture and biology.” We thank Ilja Croijmans, Josje de Valk, Elisabeth Norcliffe and Sebastian Sauppe for comments on an earlier draft.

Key Terms

Crowdsourcing The process of collecting responses from large groups of people in an online community. Ecological validity The extent to which research findings can be generalized to real‐world settings. 204 Research Methods in Psycholinguistics and the Neurobiology of Language

External validity The extent to which research findings can be generalized to other populations and situations. Linguistic fieldwork Collection of primary language data outside of a workplace setting, typically associated with long‐term investigation of lesser‐known and under‐described languages. Linguistic relativity The hypothesis, associated most strongly with Benjamin Lee Whorf and Edward Sapir, which proposes that language can affect the way reality is viewed by its speakers. Standard Average European (SAE) A term used with reference to modern Indo‐ European languages of Europe to highlight similarities in their linguistic features.

References

Arnett, J. J. (2008). The neglected 95%: Why American psychology needs to become less American. American Psychologist, 63, 602–614. doi: 10.1037/0003‐066X.63.7.602 Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., & Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39, 445–459. Bargh, J. A., & McKenna, K. Y. A. (2004). The internet and social life. Annual Review of Psychology, 55, 573–590. doi:10.1146/annurev.psych.55.090902.141922. Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral and Brain Sciences, 22, 577–660. doi: 10.1017/S0140525X99532147 Birnbaum, M. H. (2004). Human research and data collection via the internet. Annual Review of Psychology, 55, 803–832. doi:10.1146/annurev.psych.55.090902.141601. Boroditsky, L., & Ramscar, M. (2002). The roles of body and mind in abstract thought. Psychological Science, 13, 185–189. doi: 10.1111/1467‐9280.00434 Bowerman, M., & Brown, P. (Eds.). (2008). Crosslinguistic perspectives on argument structure. New York: Lawrence Erlbaum Associates. Bowern, C. (2008). Linguistic fieldwork: A practical guide. Basingstoke: Palgrave Macmillan. Burenhult, N., & Majid, A. (2011). Olfaction in Aslian ideology and language. The Senses & Society, 6, 19–29. doi: 10.2752/174589311X12893982233597 Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high‐quality, data? Perspectives on Psychological Science, 6, 3–5. http://doi.org/10.1177/1745691610393980 Clark, H. (1979) Responding to indirect speech acts. Cognitive psychology, 11(4), 430–477. Crowley, T. (2007). Field linguistics: A beginner’s guide. New York: Oxford University Press. Dahl, Ö. (2015). How WEIRD are WALS languages? Presented at the Diversity Linguistics: Retrospect and Prospect, Leipzig. Dufau, S., Duñabeitia, J. A., Moret‐Tatay, C., McGonigal, A., Peeters, D., Alario, F.‐X., … Grainger, J. (2011). Smart phone, smart science: How the use of smartphones can revolu- tionize research in cognitive science. PLoS ONE, 6, e24974. doi:10.1371/journal. pone.0024974. Enochson, K., & Culbertson, J. (2015). Collecting psycholinguistic response time data using Amazon Mechanical Turk. PLOS ONE, 10, e0116946. doi:10.1371/journal.pone.0116946. Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32, 429–448. doi:10.1017/S0140525X0999094X. Fischer, M. H., & Zwaan, R. A. (2008). Embodied language: A review of the role of the motor system in language comprehension. The Quarterly Journal of Experimental Psychology, 61, 825–850. doi:10.1080/17470210701623605. Studying Psycholinguistics out of the Lab 205

Friederici, A. D., & Levelt, W. J. M. (1990). Spatial reference in weightlessness: Perceptual factors and mental representations. Perception & Psychophysics, 47, 253–266. doi:10.3758/ BF03205000. Ganguli, M., Lee, C.‐W., Hughes, T., Snitz, B. E., Jakubcak, J., Duara, R., & Chang, C.‐C. H. (2014). Who wants a free brain scan? Assessing and correcting for recruitment biases in a population‐based fMRI pilot study. Brain Imaging and Behavior, 9, 204–212. doi:10.1007/s11682‐014‐9297‐9. Germine, L., Nakayama, K., Duchaine, B. C., Chabris, C. F., Chatterjee, G., & Wilmer, J. B. (2012). Is the Web as good as the lab? Comparable performance from Web and lab in cognitive/perceptual experiments. Psychonomic Bulletin & Review, 19, 847–857. doi:10.3758/s13423‐012‐0296‐9. Gick, B. (2002). The use of ultrasound for linguistic phonetic fieldwork. Journal of the International Phonetic Association, 32, 113–121. doi:10.1017/S0025100302001007. Gil, D. (2001). Escaping Eurocentrism: Fieldwork as a process of unlearning. In P. Newman & M. S. Ratliff (Eds.), Linguistic fieldwork (pp. 102–132). Cambridge, UK ; New York, NY: Cambridge University Press. Gippert, J., Himmelmann, N., & Mosel, U. (Eds.). (2006). Essentials of language documenta- tion. Berlin ; New York: Mouton de Gruyter. Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world. Behavioral and Brain Sciences, 33, 1–75. doi: 10.1017/S0140525X0999152X Hyman, L. M. (2001). Fieldwork as a state of mind. In P. Newman & M. S. Ratliff (Eds.), Linguistic fieldwork (pp. 15–33). Cambridge, UK ; New York, NY: Cambridge University Press. Jaeger, T. F., & Norcliffe, E. J. (2009). The cross‐linguistic study of sentence production. Language and Linguistics Compass, 3(4), 866–887. Keller, F., Gunasekharan, S., Mayo, N., & Corley, M. (2009). Timing accuracy of web experiments: A case study using the WebExp software package. Behavior Research Methods, 41, 1–12. Kelly, S. D., Ozyurek, A., & Maris, E. (2010). Two sides of the same coin: Speech and gesture mutually interact to enhance comprehension. Psychological Science, 21, 260–267. doi:10.1177/0956797609357327. Keuleers, E., Stevens, M., Mandera, P., & Brysbaert, M. (2015). Word knowledge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. The Quarterly Journal of Experimental Psychology, 68, 1665–1692. doi:10.1080/17470218. 2015.1022560. Kimbrough Oller, D. (2010). All‐day recordings to investigate vocabulary development: A case study of a trilingual toddler. Communication Disorders Quarterly, 31, 213–222. doi:10.1177/1525740109358628. Language in Interaction Consortium (2015). LingQues t(1.1). [Mobile application software]. Retrieved from http://itunes.apple.com Language in Interaction Consortium (2015). Syn Quiz (1.1.161). [Mobile application software]. Retrieved from http://itunes.apple.com Language in Interaction Consortium (2015). WoordWolk (1.3). [Mobile application software]. Retrieved from http://itunes.apple.com Levelt, W. J. M. (2013). A history of psycholinguistics: The pre‐Chomskyan era. Oxford: Oxford University Press. Levinson, S. C. (2012). The original sin of cognitive science. Topics in Cognitive Science, 4, 396–403. doi:10.1111/j.1756‐8765.2012.01195.x. Majid, A., Boster, J. S., & Bowerman, M. (2008). The cross‐linguistic categorization of everyday events: A study of cutting and breaking. Cognition, 109, 235–250. doi: 10.1016/j.cognition.2008.08.009 Majid, A., Bowerman, M., Kita, S., Haun, D. B. M., & Levinson, S. C. (2004). Can language restructure cognition? The case for space. Trends in Cognitive Sciences, 8, 108–114. doi:10.1016/j.tics.2004.01.003. 206 Research Methods in Psycholinguistics and the Neurobiology of Language

Majid, A., Bowerman, M., Staden, M. van, & Boster, J. S. (2007). The semantic categories of cutting and breaking events: A crosslinguistic perspective. Cognitive Linguistics, 18, 133–152. doi: 10.1515/COG.2007.005 Majid, A., & Levinson, S. C. (2010). WEIRD languages have misled us, too. Behavioral and Brain Sciences, 33, 103. doi: 10.1017/S0140525X1000018X Malt, B. C., & Majid, A. (2013). How thought is mapped into words. Wiley Interdisciplinary Reviews: Cognitive Science, 4, 583–597. doi:10.1002/wcs.1251. McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Chicago: University of Chicago Press. Moroney, N. (2003). Unconstrained web‐based color naming experiment. In Electronic Imaging 2003 (pp. 36–46). International Society for Optics and Photonics. doi:10.1117/12.472013 Newman, P., & Ratliff, M. S. (Eds.). (2001). Linguistic fieldwork. Cambridge: Cambridge University Press. Norcliffe, E., Harris, A. C., & Jaeger, T. F. (2015). Cross‐linguistic psycholinguistics and its critical role in theory development: Early beginnings and recent advances. Language, Cognition and Neuroscience, 30, 1009–1032. doi: 10.1080/23273798.2015.1080373 Norcliffe, E., Konopka, A. E., Brown, P., & Levinson, S. C. (2015). Word order affects the time course of sentence formulation in Tzeltal. Language, Cognition and Neuroscience, 30, 1187–1208. doi:10.1080/23273798.2015.1006238. Port, R. F. (2010). Language as a social institution: Why phonemes and words do not live in the brain. Ecological Psychology, 22, 304–326. http://doi.org/10.1080/10407413.2010. 517122 Roy, D. (2009). New horizons in the study of child language acquisition. Proceedings of Interspeech 2009, 13–20. Sakel, J., & Everett, D. L. (2012). Linguistic fieldwork: A student guide. Cambridge ; New York: Cambridge University Press. Sauppe, S., Norcliffe, E., Konopka, A. E., Van Valin, R. D., & Levinson, S. C. (2013). Dependencies first: Eye tracking evidence from sentence production in Tagalog. In M. Knauff, M. Pauen, N. Sebanz, & I. Wachsmuth (Eds.), Proceedings of the 35th Annual Meeting of the Cognitive Science Society(CogSci 2013) (pp. 1265–1270). Austin, TX: Cognitive Science Society. Sears, D. O. (1986). College sophomores in the laboratory: Influences of a narrow data base on social psychology’s view of human nature. Journal of Personality and Social Psychology, 51, 515. doi: 10.1037/0022‐3514.51.3.515 Simner, J., Mulvenna, C., Sagiv, N., Tsakanikos, E., Witherby, S. A., Fraser, C., Scott, K., & Ward, J. (2006). Synaesthesia: The prevalence of atypical cross‐modal experiences. Perception, 35, 1024. doi: 10.1068/p5469 Slobin, D. I. (Ed.). (1985). The crosslinguistic study of language acquisition. Hillsdale, N.J: L. Erlbaum Associates. Smith, L. B., Yu, C., Yoshida, H., & Fausey, C. M. (2015). Contributions of head‐mounted cameras to studying the visual environments of infants and young children. Journal of Cognition and Development, 16, 407–419. doi:10.1080/15248372.2014.933430. Thieberger, N. (Ed.). (2011). The Oxford handbook of linguistic fieldwork. Oxford: Oxford University Press. Verhoef, T., Roberts, S. G., & Dingemanse, M. (2015). Emergence of systematic iconicity: Transmission, interaction and analogy. In In D. C. Noelle, R. Dale, A. S. Warlaumont, J. Yoshimi, T. Matlock, C. D. Jennings, & P. P. Maglio (Eds.), The 37th annual meeting of the Cognitive Science Society (CogSci 2015). Cognitive Science Society. Wagers, M, Borja, M. F., & Chung, S. (2015). The real‐time comprehension of WH‐dependencies in a WH‐agreement language. Language, 91, 109–144. Whalen, D. H., & McDonough, J. (2015). Taking the laboratory into the field. Annual Review of Linguistics, 1, 395–415. http://doi.org/10.1146/annurev‐linguist‐030514‐124915 Studying Psycholinguistics out of the Lab 207

Whorf, B. L. (1944). The relation of habitual thought and behavior to language. ETC: A Review of General Semantics, 197–215. Wnuk, E., & Majid, A. (2014). Revisiting the limits of language: The odor lexicon of Maniq. Cognition, 131, 125–138. doi: j.cognition.2013.12.008 Wolff, P., & Holmes, K. J. (2011). Linguistic relativity. Wiley Interdisciplinary Reviews: Cognitive Science, 2, 253–265. doi:10.1002/wcs.104. Wundt, W. M. (1920). Erlebtes und Erkanntes. Stuttgart: A. Kröner. Zwaan, R. A. (2014). Embodiment and language comprehension: Reframing the discussion. Trends in Cognitive Sciences, 18, 229–234. doi:10.1016/j.tics.2014.02.008.

Further Reading and Resources

Comprehensive reference information for the world’s languages, especially the lesser known languages: http://glottolog.org/ Database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials: http://wals.info/ Stimulus material for the elicitation of semantic categories by the Language and Cognition department at the Max Planck Institute for Psycholinguistics: http://fieldmanuals.mpi.nl/ A comprehensive and practical guide to designing and conducting semantic elicitation studies: Majid, A. (2012). A guide to stimulus‐based elicitation for semantic categories. In N. Thieberger (Ed.), The Oxford handbook of linguistic fieldwork (pp. 54–71). New York: Oxford University Press. A collection of useful databases of various linguistic measures from Ghent University. Includes software such as nonword generators, and data from online vocabulary tests: crr.ugent.be/programs‐data Amazon’s Mechanical Turk, an online crowdsourcing site that allows collection of data from a large number of participants, such as using questionnaires and experiments: www.mturk.com Home of WebExp, a system for conducting experiments on the internet and storing results: http://groups.inf.ed.ac.uk/webexp/ Information on how to apply to conduct research in London’s Science Museum: http://www.sciencemuseum.org.uk/about_us/new_research_folder/livescience.aspx 11 Computational Modeling

Ping Li and Xiaowei Zhao

Abstract

Computational modeling has played significant roles both for psycholinguistic theorizing­ and as a research tool. Computational models offer particular advantages in dealing with complex interactions between variables that are often confounded in natural language situations. This chapter provides an overview of two approaches of computational modeling in psycholinguistics: the probabilistic approach and the con- nectionist approach. We discuss the assumptions and rationales behind each approach, along with methodological challenges related to both. In particular, we discuss how modeling is conducted by illustrating these approaches with examples, and with their applications in psycholinguistic studies by focusing on co-occurrence based semantic representation and lexical development in children and adults.

Assumptions and Rationale

Since the early days of the cognitive revolution in the 1950s, progress in computer science has been instrumental to cognitive science for understanding human linguistic behaviors (see Gardner, 1987, for a historic review). For example, advances in the development of digital computers in the von Neumann architecture (i.e., the separation of a central processing unit or CPU and memory) had inspired cognitive

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Computational Modeling 209

scientists to conceive of the human mind as a digital symbol processor, and to liken human information processing to symbol processing in the input‐output computa- tion. Following this tradition, many computational models of language processing have therefore aimed at deriving declarative and symbolic rules/algorithms to com- putationally analyze the syntactic structure of sentences (parsing) or to construct computationally and psychologically plausible applications in Natural Language Processing (e.g., the parsing model from Vosse & Kempen, 2000; the WEAVER model of speech production from Roelofs, 1997). Approaches different from this more “classical” view of cognition and language have led to the development of computational modeling of language along the following two directions. First, it has become clear to researchers that statistical features of language play a vital role in many aspects of language processing and language learning. Both children and adults can detect and utilize statistical information in the ambient linguistic input, either between items in the auditory or visual language stream (Saffran, Aslin, & Newport, 1996) or between their language input and the surrounding environment (Smith & Yu, 2008). Because of such empirical discoveries, computational researchers have begun to explore computational frameworks of language based on probabilistic principles, such as Bayesian statistics and co‐occurrence statistics (see Chater & Manning, 2006; Jones, Willits, & Dennis, 2015; Perfors, Tenenbaum, Griffiths, & Xu, 2011, for reviews). A significant number of models have been devel- oped in the last two decades along this line of research. Second, since the 1980s the classical view of the mind as a serial symbolic compu- tational system has been challenged by the resurgence of connectionism or Parallel Distributed Processing (PDP), also known as artificial neural networks. The study of language from the connectionist perspective has been a major focus of the early PDP models. Connectionism argues for the emergence of human cognition as the outcome of large networks of interactive processing units operating simultaneously, resembling the workings of the massive network of neurons in the human brain. Connectionism advocates that language learning and processing are parallel, distributed, and inter- active in nature, just as other cognitive systems are. Hence the strict separation between specific operational principles of a modular language system and other cognitive modules, as advocated by Chomsky (1965) and Fodor (1983), is discarded in connectionist processing systems. Specifically, connectionist language models embrace the philosophy that static linguistic representations (e.g., words, concepts, syntactic structures) are emergent properties that can be dynamically acquired from the input environment (e.g., the speech data received by the learner). In this chapter, we will focus on the probabilistic and connectionist approaches. Although these two approaches are not always clearly separable, for the sake of clarity, we will introduce them separately.

Probabilistic Approach

Use of Bayesian statistics in understanding cognitive processes has become very popular in recent years (see Lake, Salakhutdinov, & Tenenbaum, 2015). It is also becoming an important method in studying language. Bayesian methods use the simple but powerful Bayes’s theorem to make inferences about a hypothesis given the prior knowledge in probabilistic terms. Specifically, the Bayes’s theorem states 210 Research Methods in Psycholinguistics and the Neurobiology of Language that the posterior probability of a particular hypothesis being true (p(H/E)) is the result of both the prior probability of a hypothesis (p(H)) and the conditional prob- ability (termed likelihood) of the evidence ((p(E/H)). This type of hypothesis testing in light of the probabilistic relationship between truth value of a hypothesis and evidence turns out to work very well on a classical theme of cognitive research: the integration of bottom‐up processing (observed data; i.e., evidence) and top‐down processing (previous knowledge/background information; i.e., prior probability of a hypothesis). This feature also makes it an excellent tool for investigating mechanisms of language acquisition and processing, particularly because language learners and users are constantly engaged in drawing inferences about the underlying linguistic structure, given certain language input data and previous linguistic knowledge. Such scenarios of language processing led many researchers to consider human learners as the “optimal Bayesian decision‐makers” (e.g., Norris, 2006). Another important assumption of probabilistic models is that human learners have the ability to track statistical relationships both between items within the lan- guage system and between the language input and its ambient physical environment. Based on this assumption, computational models are built to simulate patterns in statistical learning for human languages. One popular method in implementing such models is to incorporate the co‐occurrence statistics in a large corpus, which can be done by calculating the frequency of a language component (e.g., a word or a phrase) co‐occurring with other similar components (e.g., words or phrases) or with different components (e.g., objects or properties). For example, one can calculate the co‐ occurrence frequencies of a word with all the other words in a text to form a vector representation for that word (see the Hyperspace Analogue to Language or HAL model; Burgess & Lund, 1997), or calculate the co‐occurrence matrix of word by paragraph/document in a large corpus of text (the Latent Semantic Analysis or LSA model; Landauer & Dumais, 1997). In both the HAL and LSA models, the resulting representation of the target (i.e., the word) is a high‐dimensional vector with each dimension denoting a linguistic entity (word or document). Co‐occurrence statistics form the basis of many so‐called Distributional Semantic Models (see Jones, Willits, & Dennis, 2015, for a recent review; also see Chapter 12, this volume).

Connectionist Approach

Modern‐day theories of connectionism highlight “brain‐style computation,” suggesting that we should build connectionist networks that can process information in ways similar to that of the real brain, albeit in a simplified form. The human brain consists of a huge network of approximately 100 billion neurons and trillions of connections among the neurons. A neuron has dendrites to receive signals from other neurons and axons to send signals to other neurons. Neuronal information transmission occurs through synapses, tiny gaps with different levels of strengths/effectiveness of signal transmission depending on the amount and nature of neurotransmitters. Synapse is the basic connection medium for massive numbers of neurons to “talk” (to connect) to each other, and the synaptic strengths are not fixed but can dynami- cally change depending on the complexity of the input‐output mapping relations and the learning algorithms used in the neural network. The human brain has the ability to derive the “optimal”combination of synaptic strengths for a neural network in Computational Modeling 211 solving problems. This ability is the foundation of neural information processing that has inspired connectionism. With these considerations of brain features, connectionist modelers can build artificial neural networks with two fundamental components: simple processing elements (units, nodes, or artificial neurons), and connections among these processing elements. Like real neurons, a node receives input from other nodes and sends output to other nodes. The input signals are accumulated and further transformed via a mathematical function (either a linear threshold, or more often, a nonlinear function), so as to determine the activation value of the node. A given connectionist network can have varying numbers of nodes, many of which are connected so that activations can spread from node to node via the corresponding connections. Like real synapses, the connections can have different levels of strength (weights), which can be adjusted according to certain learning algorithms, thereby modulating the amount of activation a source node can transmit to a target node. In this way, the network can develop unique combinations of weights and activation patterns of nodes in repre- senting different input patterns from the learning environment. Unlike traditional computer programs that are dedicated to specific tasks and are fixed a priori, the weights and activation patterns in most connectionist networks are allowed to continuously adapt during learning, resembling the dynamic changes in real synaptic connections. It is these adaptive dynamic changes that make connectionist networks interesting models of human behavior, including language.

Apparatus and Tools

Depending on the simulation goals and tasks, the apparatus used for computational linguistic modeling could be as simple as one personal computer equipped with any type of programming language. Although today’s high‐performance computers allow researchers to significantly increase the computational speed, the most impor- tant aspect of modeling is to use the relevant algorithms to implement the basic concepts and principles appropriate for the research goals. In this section, we will survey some basic algorithms and discuss practical considerations related to their implementation.

Probabilistic Algorithms

Bayesian inference has been used in many studies based on computational probabi- listic models. Many of them have focused on how infants solve the word‐to‐object/ referent mapping problem (see Yu & Smith, 2012, for a review). The following example can help readers understand how such a Bayesian interference framework works for word learning. According to Xu and Tennebaum (2007), the learner makes decisions in a search space consisting of many hypotheses about potential word‐ object/referent pairs. The hypothesis that has the highest probability of being true is selected as the most likely word‐referent pair generated by the model. According to the Bayesian theorem, this posterior probability (p(H/E)) is proportional to the 212 Research Methods in Psycholinguistics and the Neurobiology of Language product of both the prior probability of the hypothesis (p(H)) and the likelihood of the evidence ((p(E/H)), and it reflects the joint influence of the learner’s pre‐existing knowledge (prior probability) and their evaluation of the observed evidence given the hypothesis space (likelihood). The model matched well to empirical data and the key to the model’s success was a well‐defined hypothesis space (i.e., a hierarchical tree of the categories of the potential referents). Probabilistic algorithms have also been widely used in the study of semantic representations, as mentioned earlier for the HAL and LSA models. According to HAL, a variable moving window (e.g., with the size from 1 to 10 words) scans through a large corpus of text and records the word‐word co‐occurrences. The resulting Ni‐by‐Nj matrix includes the frequency counts of how often each target word (Ni) co‐occurs with other words (Nj) in the immediate sentence context (depending on the window size). In HAL, a word’s meaning is thus represented by reference to all the other words in the co‐occurrence matrix, in which the total contextual history of the target word is supported by a high‐dimensional space of language use. It is such global lexical co‐occurrence information that contributes to the richness of lexical meaning. As HAL, the LSA model is based on co‐occurrence statistics (Landauer & Dumais, 1997), but the Ni‐by‐Nj matrix includes the frequency counts of how often each target word (Ni) co‐occurs with a global context such as a para- graph or a document (Nj). The raw vectors derived from these models consist of thousands or tens of thousands of dimensions depending on the co‐occurrence context (represented by the large number of Nj). These vectors are usually very sparse, that is, many dimensions are zero in value. To extract the most useful information within the vectors, methods such as normalization and dimension reduction are used so that the smallest number of dimensions can maximally represent the linguistic contents of the target words, which can then be used as input to models for simulating psycholinguistic data such as word association, lexical categorization, and conceptual mapping coherence. For example, LSA uses Singular Value Decomposition (SVD), a popular mathematical algorithm, to convert the high‐dimensional word‐document matrix into a new matrix with a much lower number of dimensions (typically around 100‐300).

Connectionist Algorithms

To build connectionist models the researcher needs to select the architecture of the network and determine what learning algorithms to use to adjust connection weights. In psycholinguistic research, a popular connectionist architecture is a net- work with information feeding forward through multiple layers of nodes, usually three: input, hidden, and output. The input layer receives information from input patterns (e.g., representations of acoustic features of phonemes), the output layer provides the desired target output patterns (e.g., classifications of phonemes according to their features), and the hidden layer forms the network’s internal representations as a result of the network’s learning to map input to output (e.g., the phonological similarities between phonemes, e.g., /b/ versus /p/). Once the architecture of the network model is determined, the researcher needs to train the model by using a specific learning or training algorithm. A popular algorithm called “backpropagation” (Rumelhart, Hinton, & Williams, 1986) has been widely Computational Modeling 213

Output units

Hidden units

Context units Input units

Figure 11.1 The basic architecture of a Simple Recurrent Network (SRN) with a context layer, which keeps a copy of the hidden unit activations at a prior point in time. used in psycholinguistic computational models. According to this algorithm, each time the network is presented with an input‐to‐output mapping, the discrepancy (or error) between the target output (determined by the researcher) and the actual output (produced by the network based on the combination of connection weights, or “weight vector,” at a given trial) is calculated. This error is then propagated back to the network so that the relevant connection weights can be changed (or updated for the next trial) relative to the amount of error. Continuous weight updating in this way will allow the network to derive a set of weight values so that, over time, it can take on any pattern in the input and produce the desired pattern in the output. Elman (1990) developed the Simple Recurrent Network (SRN) to capture semantic categories like nouns, verbs, and adjectives as language input unfolds in time. The SRN combines the classic three‐layer backpropagation network with a recurrent layer of context units, which can keep a copy of the hidden‐unit activations at a prior point in time (Figure 11.1). This copy is then provided along with the new input to the current stage of learning (hence “recurrent” connections). This method enables connectionist networks to effectively capture the temporal order of information, since the context units serve as a dynamic memory buffer for the system. Given that language unfolds in time, the SRN therefore provides a simple but powerful mecha- nism to identify structural constraints in continuous streams of linguistic input. The backpropagation algorithm trains a class of neural networks that belong to the so‐called “supervised learning” models. In contrast to such models, unsupervised learning models use no explicit error signal at the output level to adjust the weights (i.e., no desired target output provided by the researcher). A popular unsupervised learning algorithm is the self‐organizing map (or SOM; Kohonen, 2001), which consists of a two‐dimensional topographic map for the organization of input represen- tations, where each node is a unit on the map that receives input via the input‐to‐map connections. At each training step of SOM, an input pattern (e.g., the phonological or semantic information of a word) is randomly picked out and presented to the network. The SOM algorithm starts out by identifying all the incoming connection weights to each and every unit on the map and, for each unit, compares the weight vector (i.e., the combination of weights) with the input vector (i.e., the combination of values in the input pattern). If the unit’s weight vector and the input vector are similar or identical by chance, the unit will receive the highest activation and is 214 Research Methods in Psycholinguistics and the Neurobiology of Language chosen as the “winner.” Once a unit becomes the winner for a given input, its weight vector and that of its neighboring units are adjusted, such that they become more similar to the input and hence will respond to the same or similar inputs more strongly the next time. This process continues until all the input patterns elicit specific response units (the winners) in the map. As a result of this self‐organizing process, the statistical structure implicit in the input is captured by the topographic structure of the SOM (i.e., how the winners are organized) and can be visualized on a 2D map as meaningful clusters. Finally, although not an inherent property of SOM, different maps can be linked via adaptive connections trained by the Hebbian learning rule (Hebb, 1949), a neurally inspired and biologically plausible mechanism of associative learning and memory which allows for highly co‐activated neurons to strengthen their mutual connections (i.e., using the Hebbian principle “cells that fire together, wire together”). It is worth noting that recent exciting developments in the field of artificial intelli- gence, including Google’s AlphaGo (Silver et al., 2016), have used new connectionist algorithms such as the so‐called “Deep Learning” neural networks. These new algorithms often involve many stages of computation with a large number of layers (including recurrent layers like the context units in SRN), along with a combination of different learning rules (see Schmidhuber, 2015). Implementations of algorithms such as in Deep Learning neural networks have yet to occur in the field of psycholinguistics.

Practical Considerations

There are a number of practical considerations that modeling researchers must take into account when they start their modeling work. The first major one is to determine the appropriate algorithm/framework that could be used to simulate the specific linguistic phenomenon the researchers are interested in. Indeed, this is a difficult decision to make given that the same algorithm with small variations might be used to simulate different linguistic phenomena, and the same linguistic phenomenon might be simulated by models based on quite different algorithms. For example, Bayesian inference can be used to simulate how adult readers’ reaction times (RT) during word recognition are influenced by different lexical variables of the target words (Norris, 2006), and it can also be used to model word learning (Xu & Tennebaum, 2007) and semantic representation (Griffiths, Steyvers, & Tenenbaum, 2007). Similarly, the connectionist Interactive Activation (IA) principle has been used to simulate visual word perception (McClelland & Rumelhart, 1981), speech per- ception (McClelland & Elman, 1986), and lexical access and speech production (Dell, 1986). On the other hand, different connectionist models can be used to explain the same “U‐shape” trajectory in the learning of the English past tenses (e.g., Plunkett & Marchman, 1991; Rumelhart & McClelland, 1986). These models were based on different algorithms but they all demonstrated that single mechanisms embodied in connectionist learning can account for the acquisition of complex grammatical structures. Another example is that a number of connectionist models with different learning algorithms have been used to account for the early childhood vocabulary spurt, the sudden acceleration of word learning when a child reaches about 18‐22 months (e.g., Plunkett et al., 1992; Regier, 2005; and DevLex models from Li, Farkas, & MacWhinney, 2004, and Li, Zhao, & MacWhinney, 2007). Computational Modeling 215

Although there is no simple rule for selecting algorithms for modeling, it is important that the researchers clearly understand the nature of their specific psycholinguistic phenomenon and related research questions, the goal of their simulation, and the pros and cons of different algorithms. When these factors are properly considered, the researchers may determine the most suitable algorithm for the study. Taking connectionist models as an example, if the researcher is inter- ested in semantic representation and organization, a SOM‐based architecture might be highly appropriate and relevant given its topography‐preserving feature. But if the researcher is interested in simulating the processing of temporally ordered components (e.g., syntax), a network with a recurrent algorithm such as the SRN structure might be a better candidate since the temporal order information can be recorded in the context layer that combines new input with previous hidden‐layer representations. Depending on the researchers’ simulation goals, hybrid connec- tionist network architectures that combine both supervised and unsupervised learning methods, and models that have adjustable structures or involve dynamic unit growth, may also be appropriate. A second major practical consideration is for the modeling researchers to deter- mine whether they want to use existing simulation tools or build their own model from scratch. Given that some psycholinguistic researchers may not be familiar with computer programming, we recommend that they start with existing tools that have already been tested and made available to the research community. There are several such tools that are easy to use and accessible from Internet. For example, for proba- bilistic statistical modeling, Shaoul and Westbury (2010) have developed HiDEx (http://www.psych.ualberta.ca/~westburylab/projects/HDMoLS.html), a software package that allows researchers to build many variations of HAL. Much useful information about LSA can be found on its official website (http://lsa.colorado.edu/; see Dennis, 2007, for methods and steps for using the website). Zhao, Li, and Kohonen (2011) developed a Contextual Self‐organizing Map Package that can generate semantic representations based on word‐word co‐occurrence statistics, and has the potential to integrate real‐world perceptual features into its representa- tions (http://blclab.org/contextual-self-organizing-map-package/). Mikolov, Chen, Corrado, and Dean (2013) developed word2vec, a tool that allows researchers to derive distributed semantic representations of words from large‐scale text corpus (https://code.google.com/archive/p/word2vec/). An online tool for visualizing the basic working processes of word2vec can be found at https://ronxin.github.io/ wevi/(Rong, 2014). Plunkett and Elman (1997) developed the Tlearn software that many developmental psycholinguists have used. Its newer version has been coded in the Matlab‐based sofware OxLearn (Ruh & Westermann, 2009; http://psych. brookes.ac.uk/oxlearn/). McClelland has developed PDP modeling software accom- panied by an online handbook that provides a general introduction to connectionist networks, a step‐by‐step user’s guide, and a bibliography (http://www.stanford.edu/ group/pdplab/pdphandbook/, McClelland, 2015). Emergent is another very powerful neural network simulator that covers many basic connectionist algorithms (https:// grey.colorado.edu/emergent/). Its website includes a comparison of several neural ­network simulators and links to them. TensorFlow is a Google‐supported open source software library for machine learning including deep learning neural networks (https://www.tensorflow.org/). The SOMToolbox for MatLab would be a good start for readers who are interested in using the Self‐Organizing Map algorithm 216 Research Methods in Psycholinguistics and the Neurobiology of Language

(http://research.ics.aalto.fi/software/somtoolbox/). On the down side, although the existing tools are convenient to start with, they may not be flexible enough to fit the specific research needs and goals of a given project. At a later stage the researchers may have to develop their own software or implement new algorithms with their own programs.

From Task to Implementation: Representation and Analysis

Nature of Stimulus

Modern digital computer programs work on numerical codes, which are obviously different from the natural language we use every day. Therefore, it is important for investigators to determine how to represent the linguistic input stimuli in their models faithfully. It has been suggested that “input representativeness” is crucial for computational modeling of language (Christiansen & Chater, 2001). To begin with, a crude way to represent lexical entries is to use the so‐called “localist” representa- tion, according to which a single, unitary processing unit in the system, randomly picked by the modeler, is assigned a numerical value to represent a linguistic item (e.g., the meaning, sound, or other linguistic property of a word). This way, the activation of a processing unit can be unambiguously associated with the specific linguistic item that the unit is supposed to represent, and the strength of the activation can be taken as the indicator of how well the linguistic entity is represented. With this type of one‐to‐one mapping, the localist representation clearly has simplicity and efficiency, and it has brought great success for simulating language processing in computational models. However, it is important to note that many early computational models based on “localist” representations have been criticized as “Toy models” that lack linguistic and psychological reality. Critics may doubt whether results from such models can make direct contact with the statistical properties of natural language to which the learner or language user is exposed. Therefore, a different method, embraced by connectionist models in general, is to represent lexical entries as distributed repre- sentations. According to this view, a given lexical item is represented by multiple nodes and their weighted connections, as a distributed pattern of activation of relevant micro‐features. To use distributed semantic representations of words as an example, we can roughly classify them into two groups. One uses feature‐based repre- sentation, in which empirical data are used to help generate the features describing the meanings of words (e.g., McRae et al., 2005). The other uses corpus‐based representation, which derives meanings of words through co‐occurrence statistics in large‐scale linguistic corpora. Examples of the latter are the HAL and LSA methods we discussed before (see pp. 210–212). The following is an example of how to generate the phonological representation of words as the stimuli to a model. Recent developments in the field of phonological representation favor the approach that codes a word’s pronunciation on a slot‐based representation while taking the articulatory features of phonemes into consideration. In particular, the phonology of a word can be encoded in terms of a template with a Computational Modeling 217 fixed set of slots. Each phoneme of the word is assigned to a different slot, depending on which syllable it belongs to and at which position it appears in the syllable, such as onset, nucleus, or coda. Based on this idea of syllabic templates, researchers have introduced phonological pattern generators (PatPho http://blclab.org/patpho-for- chinese/) for both English words (Li & MacWhinney, 2002) and Chinese words (Zhao & Li, 2009). For example, we can represent the articulatory features of 38 Mandarin Chinese phonemes with numerical real values, scaled between zero and one and chosen to adequately represent the similarities and differences among the articulatory features. We can then use a syllabic template with five phonemic slots and a tonal slot (CVVVCT) to represent a monosyllable in Mandarin Chinese as a numerical vector. Specifically, the numerical values of the phonemes are sequentially arranged in the phonemic slots according to their order of occurrence in a syllable and according to their status as consonants (C) or vowels (V). In this way, /lan/ would be encoded as laVVn, /tai/ as taiVC, and /pai/ as paiVC. The real‐value vector represen- tations of these syllables are shown below. Here, the empty phonemic slots (C or V in the symbol codes) are replaced by zeros in the numerical vectors. The vector representations should capture the overall similarity of the phonetic structure of words, as seen in the examples /tai/ and /pai/.

/laVVn/: 0.75 0.67 1.0 0.1 0.175 0.444 0 0 0 0 0 0 0.75 0.67 0.644; /taiVC/: 1.0 0.67 0.733 0.1 0.175 0.444 0.1 0.1 0.1 0 0 0 0 0 0; /paiVC/: 1.0 0.45 0.733 0.1 0.175 0.444 0.1 0.1 0.1 0 0 0 0 0 0.

It is important that researchers choose an appropriate method to represent linguistic features, whether phonological, lexical‐semantic, or morphosyntactic, based on careful evaluations of their simulation goals. Localist representations are simple and efficient but may not accurately represent the input, as discussed earlier. Distributed representations may be more difficult to implement, but could be a better choice if the goal is to capture the similarities among sounds or concepts (e.g., for simulating effects like similarity‐based phonological or semantic priming).

Data Analysis

Given different research purposes, the outcome of computational psycholinguistic models can be analyzed at different levels. First, the output pattern of the model could be evaluated against different psycholinguistic variables, and the investigators can further relate these output patterns to the real linguistic phenomena and analyze them using the methods common to empirical studies. For example, in a connectionist network model of language production, the input to the network could represent the concepts of words, and the output of the network could represent the phonological representations of the words. Investigators can measure the trained network’s performance by checking if the correct phonological representations are generated by the network when it receives the semantic representations of the corresponding words as input. Deviations of output patterns from target patterns could be defined and ana- lyzed as lexical production errors and compared with those of human participants (see examples in the next section). In addition, with slightly different initial conditions (e.g., the initial weights of a neural network, or a different value of a free parameter 218 Research Methods in Psycholinguistics and the Neurobiology of Language of a probabilistic model), even the same computational model could behave differently­ and show individual differences. These data could be analyzed by inferential statistics to identify if initial conditions have a meaningful impact on the patterns, in the same way as empirical data from real participants would be evaluated. A significant advantage of computational modeling is that not only the output, but also the internal representations of the model can be easily analyzed. Analysis of the internal representations provides researchers with insights into the underlying mecha- nisms of human language acquisition and processing. For example, Elman (1990) applied a hierarchical clustering analysis on the activation patterns of units in the hidden layer of the SRN, and showed the emergence of semantic categories in the internal representation of the network as the result of its learning from the input stream of sentences. Similarly, in the DevLex models (see examples in next section), the emergence of both semantic and phonological categories can be observed at the different layers of self‐organizing maps. Analyses that focus on the internal representations of computational models can be compared with data patterns observed from human participants using both behavioral experimental paradigms and non‐invasive neuroimaging methods.

Modeling Examples

In this section we provide two word learning models in some detail to show how computational modeling has been applied to address important psycholinguistic issues, and specifically to demonstrate how the probabilistic approach and the connectionist approach, respectively, are implemented.

The Yu and Ballard Model: An Example of the Probabilistic Approach

The Yu and Ballard (2007) study represents an example of computational probabi- listic models applied to developmental psycholinguistics. This model focuses on semantic learning and it begins by calculating co‐occurrence statistics between linguistic labels (words) in the spoken utterances and real‐world objects (referents) in their direct extra‐linguistic contexts. This type of statistics of “cross‐situational” co‐occurrence differs from those used in HAL and LSA: The latter are all based on the co‐occurrence of language/text components within the linguistic scope. Here, the input data of the model was extracted from two video‐clips of caregiver‐infant inter- actions from the CHILDES database (MacWhinney, 2000; see also Chapter 3). Specifically, Yu and Ballard focused on two components of the input: the language stream, which included the transcripts of caregivers’ speech, and the meaning stream, which included a set of objects shown in the video as the potential referents. The task of the model was to find the correct word‐referent pairs based on statistical regular- ities in these two streams of the input. With this goal in mind, the authors argued that simple frequency counting of single word‐object pairs is not the best way to find the correct referent of a word, because there were too many high frequency function words in the spoken utterances (such as you, the) that could outweigh the number of content words (such as cat) in Computational Modeling 219 the input speech stream, leading to incorrect mappings to the referents (such as the image of a cat) in the context. To solve this problem, the authors first esti- mated the association probabilities of all the possible word‐referent pairs based on an expectation‐maximization (EM) algorithm. They then identified the best word‐ referent pairs with association probabilities that can jointly “maximize the likelihood of the audio‐visual observations in natural interaction” (Yu & Ballard, 2007, p. 2156). To illustrate, we summarize the pseudocode of the EM algorithm below (mathematical details are omitted).

Step 1. Assign initial values for all word‐referent association probabilities based on their simple co‐occurrence frequency counts. Repeat: Step 2. For all word‐referent pairs, compute the expected number of times that the word in a sentence generates the particular meaning/referent in an extra‐linguistic context. Step 3. Re‐estimate the association probabilities based on the results of step 2 (using Eq. (3) of Yu and Ballard, 2007). Until the association probabilities converge.

The authors demonstrated that, with the convergence of the EM algorithm, the association probabilities of relevant word‐referent pairs increased and those of irrelevant pairs decreased. Eventually, correct referents to several words could be successfully identified given the higher association probabilities between words and referents. An important feature of the Yu and Ballard model is the incorporation of certain non‐linguistic (social) contextual cues into its statistical learning (see Figure 11.2). For the language stream, Yu and Ballard analyzed the prosodic features of the speech and then used a clustering method (SVC: Support vector clustering) to iden- tify the prosodically salient words highlighted by the caregivers in each spoken

Cross-situational observation Social gating

Statistical learning Language stream Prosodic highlights (auditory words)

Estimate word-referent associations based on cross-situational co-occurrences Meaning stream Attentional highlights (referent items)

Figure 11.2 A sketch of the probabilistic model that incorporates distributional statistics from cross‐situational observation and prosodic and attentional highlights from social gating. (Figure adapted from Figure 1 of Yu & Ballard, 2007). Source: Courtesy of Chen Yu. 220 Research Methods in Psycholinguistics and the Neurobiology of Language u­tterance (the “prosodic highlights”). Compared with non‐highlighted words, these ­prosodically salient words were assigned higher weights in the calculation of the association probabilities based on the EM algorithm mentioned above. Similarly, for the meaning stream, the objects that shared the joint attention of the caregiver and the child were identified in each visual scene and assigned higher weights (the “attentional highlights”) in the calculation of the association probabilities of word‐referent pairs. The model’s statistical learning performance was greatly improved when this type of social cues/highlights was incorporated (Yu & Ballard, 2007). Yu and Ballard’s study demonstrates a salient feature of computational modeling, which is that researchers can systematically manipulate the variables in the simulations. Adding or removing certain factors into the simulation (i.e., adding the social cues into the current model) allows the researchers to clearly identify their causal role and systematically investigate their effect and impact on learning or processing. In short, this model clearly shows the significance of cross‐situational statistics in the learning of word meanings. However, this model only learned a small number (about 40‐60) of relevant word‐referent pairs. In the next section, we discuss a model based on the connectionist approach that was applied to a much larger lexicon (500‐1000 words) so as to approximate the vocabulary size of toddlers.

The DevLex‐II Model: An Example of the Connectionist Approach

The DevLex‐II model, as formulated in Li, Zhao, and MacWhinney (2007), is a scalable SOM‐based connectionist language model designed to simulate a wide range of processes in both first and second language learning. We say that the model is “scalable” because it can be used to simulate a large realistic lexicon, in single or multiple languages, and for various bilingual language pairs (see Li, 2009; Zhao & Li, 2010, 2013).

Model Architecture

The architecture of the model is illustrated in Figure 11.3. Since the model was designed to simulate language development on the vocabulary level, we choose to include three basic levels for the representation and organization of words: phono- logical content, semantic content, and the articulatory output sequence. The core of the model is a SOM that handles lexical‐semantic representation. This SOM is connected to two other SOMs, one for input (auditory) phonology, and another for articulatory sequences of output phonology. Upon training of the network, the semantic representation, input phonology, and output phonemic sequence of a word are simultaneously presented to the network. This process is analogous to that of a child hearing a word and performing analyses of its semantic, phonological, and phonemic information. On the semantic and phonological levels, DevLex‐II constructs the representations based on the corresponding linguistic input according to the standard SOM algorithm. On the phonemic output level, the model uses a temporal sequence learning network (based on SARDNET of James and Miikkulainen, 1995). Given the challenge that the language learner faces in articulatory control of Computational Modeling 221

Word meaning representation

Self-organization

Semantic map (SOM)

Hebbian learning Hebbian learning (comprehension) (production)

Input phonology map Output sequence map (SOM) (SARDNET)

Self-organization Self-organization

Phonological form Phonemic sequence /dɔg/ /d/.../ɔ/.../g/

Figure 11.3 A sketch of the DevLex‐II model. Figure adapted from Li et al., 2007. Reproduced with permission of John Wiley & Sons. the phonemic sequences of words, the use of a temporal sequence network allows us to model word production more realistically. In DevLex‐II, the associative connec- tions between maps are trained via the Hebbian learning rule. As training progresses, the weights of the associative connections between the concurrently activated nodes on two maps become increasingly stronger.

Stimulus Representation

Many empirical phenomena in both monolingual and bilingual contexts have been examined by DevLex‐II, including early lexical development, early phonological production, acquisition of grammatical and lexical aspects, age of acquisition effects of second language learning, and cross‐language priming effects (see Li & Zhao, 2013 for review). Here we focus on the model’s simulation of the “vocabulary spurt,” the rapid vocabulary growth during an early period of lexical development, typically when the child is around 18 to 24 months of age. As mentioned before (see p. 214), this phenomenon has been extensively examined in both empirical studies and in connectionist models. The DevLex‐II model, in which 591 English words constituted the target vocabulary, was designed to provide a computational account of this phenomenon. Compared to other connectionist models, DevLex‐II attempted to be linguistically realistic in that words were not randomly chosen but based on data from the MacArthur‐Bates Communicative Development Inventories (the CDI; Dale & Fenson 1996; see Chapter 3 for details). In addition, the vector representations of the words were not randomly generated but were based on the phonemic, phonological, 222 Research Methods in Psycholinguistics and the Neurobiology of Language or semantic information of words, as follows: (1) PatPho, a generic phonological representation system, was used to generate the sound patterns of words based on articulatory features of different languages (see p. xx); (2) Statistics‐based methods were used to generate semantic representations of the training stimuli based on large‐scale corpus data (e.g., the CHILDES database; MacWhinney, 2000; see also Chapter 3) or computational thesauruses (e.g., the WordNet database; Miller, 1990). Thus, the DevLex‐II model was trained on realistic linguistic information coded in the input, making the model’s simulation results relevant to realistic vocabulary learning in children.

Model Simulation and Data Analysis

The procedures of running a simulation are demonstrated below through the study of Li et al. (2007). In total 10 simulation trials were run, with each trial corresponding to the learning of a participant in an empirical study. At the beginning of a simula- tion trial, the connection weights of the network were randomly initialized with real numbers. There were 100 epochs of training in each simulation, and at each epoch, the 591 words from the training lexicon were presented to the network one by one in random order. Specifically, the semantic, phonological, and phonemic information of each word was simultaneously presented to the network, and the weights of the connections within and across the maps were adjusted according to the algorithms described on pp. 220–222. After each epoch of training, the connection weights and the outputs of the network can be saved and analyzed. This can be compared with taking snapshots of children’s lexical development at different ages. Specifically, word comprehension and word pro- duction are defined as follows. After the cross‐map connections have been established with training, the activation of a word form can evoke the activation of a word meaning via form‐to‐meaning links, which models word comprehension. If the activated unit on the semantic map matches the correct word meaning, we determine that the network correctly comprehends this word; if not, it is assumed that the network makes a com- prehension error. Similarly, the activation of a word meaning can trigger the activation of an output sequence via meaning‐to‐sequence links, which models word production. If the activated units on the phonemic map match the phonemes making up the word in the correct order, we determine that the network correctly produces this word; if not, it is assumed that the network makes a production error. Figure 11.4 presents the DevLex‐II’s simulation results based on average receptive and productive vocabulary sizes across the course of training. The Y‐axis represents the average number of words that the model can successfully comprehend and pro- duce (as defined above). These data demonstrate that, for both comprehension and production, the model showed a clear vocabulary spurt, preceded by a stage of slow learning and followed by a performance plateau. Once the basic organization of the lexicon was acquired in terms of lexical and semantic categories and their associations, vocabulary learning accelerated (at around 40 epochs, one third of the total training time; see Figure 11.4). When the basic structures were established on the corresponding maps, the associative connections between the maps could be consistently strengthened to reach a critical threshold through Hebbian learning, which facilitates subsequent learning of new vocabulary. Computational Modeling 223

600 Comprehension Production 500

e 400 y siz 300 ular

ocab 200 V

100

0

0 20 40 60 80 100 Epochs (training time)

Figure 11.4 Vocabulary spurt simulated by DevLex‐II (591 target words). Results were averaged across 10 simulations. Error bars indicate standard error of the mean. Figure adapted from Li et al., 2007. Reproduced with permission of John Wiley & Sons.

As suggested by the error bars in Figure 11.4, there were significant individual differences between different simulation trials, even when all simulations had the same modeling parameters. Most interestingly, the largest variations tended to coincide with the rapid growth or spurt period. Examining the individual trials in detail, we found that different networks could differ dramatically in the onset time of their vocabulary spurt. Aside from random effects (due to different random initial weights for different networks), we could observe systematic differences as a function of the complexity of the lexical input for learning. For example, the higher the word frequency, or the shorter the word length, the earlier the vocabulary spurt occurred (see discussions in Li et al., 2007). Such data agree with the outcomes of empirical studies, but added systematic information about how stimulus properties can independently or interactively shape the learning outcome and its trajectory.

Challenges and Future Directions

This chapter illustrates that computational modeling provides a particularly useful tool for psycholinguistic research, above and beyond traditional behavioral and recent neuroimaging methods. Specifically, modeling can offer researchers the flexibility in dealing with complex interactions between variables that are often confounded in natural language learning and processing situations, because modelers can systematically bring target variables under tight experimental control to test theoretically relevant hypotheses (McClelland, 2009). In other words, computational modeling can “simplify” the research question and allow the researcher to systemat- ically manipulate different levels of a variable and see the effects while holding other variables constant. For example, a researcher interested in second language learning can try a model with L1 (first language) data first and then introduce the 224 Research Methods in Psycholinguistics and the Neurobiology of Language

L2 (­second language) data to the model at different stages during the training, so as to simulate effects of ages of L2 acquisition (early versus late L2; see Zhao & Li, 2010, for simulated models). In this way, we can examine the outcome directly and causally link it to the different levels of a specific variable, which may be difficult to do in the natural environment. In realistic language learning situations, one cannot have the same individual at both an early and a late L2 learning stage, but this can be done in the same model. One also cannot reverse a clinical condition (e.g., aphasia) and compare the pre‐ and post‐lesion conditions in the same patient, whereas this can be done conveniently by simulating an intact model and then damaging it with the same parameters, or by damaging the model and then repairing the connections in the model (see Kiran et al., 2013, for an example). Although the advantages of using computational modeling as a psycholinguistic research tool are clear, the method does pose some challenges to the researcher. Because computational modeling requires the model to be implemented, it forces the researcher to be very explicit about the hypotheses, predictions, materials, variables and parameters, and testing procedures. This can be advantageous for the methodology, but at the same time it also is a challenge, as the “explicitness” nature of modeling requires that all input and output representations be specified algorithmically in the model. Sometimes basic concepts that psycholinguists take for granted may not be obvious to the model and need to be clearly specified in the model. For example, for a model to represent conceptual “similarity,” or word “association,” the relevant concepts (e.g., horse and zebra) must be defined in quantitative and numerical terms, so that their “similarity” can be explicitly calculated. Due to the need of algorithmic specification and the challenge that comes with it, psycholinguistic computational models often simplify things to make the modeling task tractable (e.g., representing lexical items as vectors with random values and smaller dimensionality). But such simplifications, while often necessary, make the model out of touch with the statistical properties of natural language input to which the speaker or learner is exposed. One of the challenges will thus be to develop linguistically realistic models that can scale up to real language data. For example, in probabilistic language models based on Bayesian inference, to make a valid inference/prediction/decision about a hypothesis, the modeler must set up a reasonable prior probability about the hypothesis based on linguistically valid background information. Such background information should ideally come from real language use. As discussed earlier (see pp. 211–213), many corpus‐based analyses (e.g., HAL, LSA, or Contextual SOM) derive their semantic representations from co‐occurrence statistics. This provides a solid basis for linguistically realistic input in computational models. To tackle the issue of linguistic realism in the era of Big Data, computational modeling in psycholinguistics can clearly take advantage of the many databases and corpora available online or in other digital forms. Another challenge for computational modeling in psycholinguistics is the handling of “free parameters” in models and how they should be adjusted (e.g., manually or not). For example, in the HAL model, researchers need to determine the window size of the target word’s neighborhood. In connectionist models, the magnitude of the learning rate and the size of the network (e.g., number of units) often need to be determined by the modeler based on intuition before a simulation is run. In each case, these are difficult choices to make as each model involves a different degree of complexity and task difficulty, and the researcher needs to use experience based on Computational Modeling 225 previous models and conventional wisdom in setting up the appropriate values of the free parameters. Inevitably, criticism toward a particular model may arise due to the use of a free parameter in this or that way. In general, the researcher should avoid introducing too many free parameters in a simulation. Although having more free parameters usually means better fitting of the model to the target data, their use may compromise the external validity of the network in relation to the phenomena being simulated (see Pitt & Myung, 2002, for a discussion). As in empirical studies, findings from overly tight controlled experiments with too many variables may not generalize to other situations. How many free parameters are needed and how their values should be adjusted remain to be a major challenge in future psycholinguistic computational models. A final, more general, challenge that lies ahead is how modelers can build a bridge between computational modeling results and a variety of other behavioral, neuro- psychological, and neuroimaging findings. There is clearly the need for increased ability of a model to make predictions based on a wide range of data from different modalities and contexts (see Schloss & Li, 2016, for a recent example of using com- putational models of distributed semantic representations to predict brain activation patterns based on fMRI data). In some cases, the empirical data have either not yet been obtained or cannot be obtained (e.g., as in the case of brain injury, where one cannot go back to pre‐lesion conditions). This is where modeling results could be extremely helpful. In other cases, not only should computational modeling verify existing patterns of behavior when they are available, it should also inform psy- cholinguistic theories by making distinct predictions under different hypotheses or conditions. In so doing, computational modeling will provide a new forum for generating novel ideas, inspiring new experiments, and helping formulate new theories.

Key Terms

Bayesian models A group of probabilistic models based on Bayesian statistics. The Bayesian theorem focuses on the impact of prior probabilities along with the likelihood of evidence in determining the likelihood of a hypothesis being tested. Connectionism Also known as Neural Networks or Parallel Distributed Processing (PDP), it is a theoretical framework as well as a computational approach of human cognition and language. It argues for the emergence of human cognition as the outcome of large networks of interactive processing units operating simulta- neously and advocates that learning, representation, and processing are parallel, distributed, and interactive in nature. Cross‐situational word learning models A group of models that focus on how young children solve the word‐to‐referent mapping problem. Yu and Ballard (2007), as discussed in this chapter, is such a model. DevLex (Developmental Lexicon) models A series of multi‐layer unsupervised connectionist models for lexical development, which have been applied to both first and second language acquisition. The models focus on training phonological and semantic representations and the connections between these representations via Hebbian learning. 226 Research Methods in Psycholinguistics and the Neurobiology of Language

Distributional semantic models A group of computational probabilistic models (sometimes called “semantic space” models) based on distributional statistics from large‐scale language/text corpora. Popular models include Latent Semantic Analysis (LSA) and Hyperspace Analogue to Language (HAL). Hebbian learning A biologically plausible mechanism of associative learning which allows for highly co‐activated neurons to strengthen their mutual connec- tions. This is often referred to as the “neurons that fire together wire together” principle. Hyperspace Analogue to Language (HAL) A distributional semantic model based on word‐word co‐occurrences in sentence contexts from large‐scale language corpora. Latent Semantic Analysis (LSA) A distributional semantic model based on the relations of word‐to‐document co‐occurrences in large‐scale language corpora. Self‐Organizing Map (SOM) A type of unsupervised connectionist model with topology‐preserving feature, typically condensing multi‐dimensional features onto two‐dimensional feature maps for visualization. Simple Recurrent Network (SRN) A type of connectionist model that combines the three‐layer backpropagation algorithm with recurrent context units, which is ideally suited for modeling sequence learning. Supervised learning A type of connectionist learning models that usually involve weight adjustments in the network based on explicit error signals at the output level. Unsupervised learning A type of connectionist learning models that use no explicit error signal at the output level when network weights are adjusted.

References

Burgess, C., & Lund, K. (1997). Modelling parsing constraints with high‐dimensional context space. Language and Cognitive Processes, 12, 177–210. Chater, N., & Manning, C. D. (2006). Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10, 335–344. Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: M.I.T. Press. Christiansen, M. H., & Chater, N. (2001). Connectionist psycholinguistics: Capturing the empirical data. Trends in Cognitive Sciences, 5, 82–88. Dale, P. S., & Fenson, L. (1996). Lexical development norms for young children. Behavior Research Methods, Instruments, and Computers, 28, 125–127. Dell, G. S. (1986). A spreading‐activation theory of retrieval in sentence production. Psychological Review, 93, 283–321. Dennis, S. (2007). How to use the LSA Web Site. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 57–70). Mahwah, NJ, US: Lawrence Erlbaum Associates. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Fodor, J. A. (1983). The modularity of mind: An essay on faculty psychology. Cambridge, MA: MIT press. Gardner, H. (1987). The mind’s new science: A history of the cognitive revolution. New York, NY: Basic books. Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114, 211–244. Computational Modeling 227

Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. James, D. L., & Miikkulainen, R. (1995). SARDNET: A self‐organizing feature map for sequences. Advances in Neural Information Processing Systems, 7, 577–584. Jones, M. N., Willits, J., & Dennis, S. (2015). Models of semantic memory. In J. R. Busemeyer & J. T. Townsend (Eds.), Oxford handbook of mathematical and computational psychology (pp. 232–254). New York, NY: Oxford University Press. Kiran, S., Grasemann, U., Sandberg, C., & Miikkulainen, R. (2013). A computational account of bilingual aphasia rehabilitation. Bilingualism: Language and Cognition, 16, 325–342. doi:10.1017/S1366728912000533. Kohonen, T. (2001). The Self‐Organizing Maps (3rd ed). Berlin, Germany: Springer. Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human‐level concept learning through probabilistic program induction. Science, 350 (6266), 1332–1338. doi:10.1126/ science.aab3050. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. Li, P. (2009). Lexical organization and competition in first and second languages: Computational and neural mechanisms. Cognitive Science, 33, 629–664. Li, P., Farkas, I., & MacWhinney, B. (2004). Early lexical development in a self‐organizing neural network. Neural Networks, 17, 1345–1362. Li, P., & MacWhinney, B. (2002). PatPho: A phonological pattern generator for neural networks. Behavior Research Methods, Instruments, and Computers, 34, 408–415. Li, P., & Zhao, X. (2013). Self‐organizing map models of language acquisition. Frontiers in Psychology, 4 (828). doi:10.3389/fpsyg.2013.00828. Li, P., Zhao, X., & MacWhinney, B. (2007). Dynamic self‐organization and early lexical development in children. Cognitive Science, 31, 581–612. MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk, transcription, format and programs (Vol. 1). Mahwah, NJ: Lawrence Erlbaum. McClelland, J. L. (2009). The place of modeling in cognitive science. Topics in Cognitive Science, 1, 11–38. McClelland, J. L. (2015). Explorations in Parallel Distributed Processing: A handbook of models, programs, and exercises. http://www.stanford.edu/group/pdplab/pdphandbook/ McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86. doi:http://dx.doi.org/10.1016/0010‐0285(86)90015‐0. McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: Part 1, an account of basic findings. Psychological Review, 88, 375–407. McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature produc- tion norms for a large set of living and nonliving things. Behavior Research Methods, 37, 547–559. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word repre- sentations in vector space. In Proceedings of the Workshop at ICLR. http://arxiv.org/ pdf/1301.3781.pdf Miller, G. A. (1990). WordNet: An on‐line lexical database. International Journal of Lexicography, 3, 235–312. Norris, D. (2006). The Bayesian Reader: Explaining word recognition as an optimal Bayesian decision process. Psychological Review, 113, 327–357. Perfors, A., Tenenbaum, J. B., Griffiths, T. L., & Xu, F. (2011). A tutorial introduction to Bayesian models of cognitive development. Cognition, 120, 302–321. doi:http://dx. doi.org/10.1016/j.cognition.2010.11.015. 228 Research Methods in Psycholinguistics and the Neurobiology of Language

Pitt, M. A., & Myung, I. J. (2002). When a good fit can be bad. Trends in Cognitive Sciences, 6, 421–425. Plunkett, K., & Elman, J. L. (1997). Exercises in rethinking innateness: A handbook for connectionist simulations. Cambridge, MA: MIT Press. Plunkett, K., & Marchman, V. (1991). U‐shaped learning and frequency effects in a multi‐ layered perception: Implications for child language acquisition. Cognition, 38, 43–102. Plunkett, K., Sinha, C., Møller, M. F., & Strandsby, O. (1992). Symbol grounding or the emer- gence of symbols? Vocabulary growth in children and a connectionist net. Connection Science, 4, 293–312. Regier, T. (2005). The emergence of words: Attentional learning in form and meaning. Cognitive Science, 29, 819–865. Roelofs, A. (1997). The WEAVER model of word‐form encoding in speech production. Cognition, 64, 249–284. Rong, X. (2014). Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 Ruh, N., & Westermann, G. (2009). OXlearn: A new MATLAB‐based simulation tool for connectionist models. Behavior Research Methods, 41, 1138–1143. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back‐ propagating errors. Nature, 323, 533–536. Rumelhart, D., & McClelland, J. L. (1986). On learning the past tenses of English verbs. In L. McClelland, D. E. Rumelhart, and the PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the microstructure of cognition. Vol. 2, Psychological and biological models (pp. 216–271). Cambridge, MA: MIT Press. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8‐month‐old infants. Science, 274, 5294. Schloss, B., & Li, P. (in press). Disentangling narrow and coarse semantic networks in the brain: The role of computational models of word meaning. Behavior Research Methods. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. Shaoul, C., & Westbury, C. (2010). Exploring lexical co‐occurrence space using HiDEx. Behavior Research Methods, 42, 393–413. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529 (7587), 484–489. Smith, L., & Yu, C. (2008). Infants rapidly learn word‐referent mappings via cross‐situational statistics. Cognition, 106, 1558–1568. doi:http://dx.doi.org/10.1016/j.cognition.2007.06.010. Vosse, T., & Kempen, G. (2000). Syntactic structure assembly in human parsing: A computa- tional model based on competitive inhibition and a lexicalist grammar. Cognition, 75, 105–143. Xu, F., & Tenenbaum, J. B. (2007). Word learning as Bayesian inference. Psychological Review, 114, 245–272. doi:10.1037/0033‐295X.114.2.245. Yu, C., & Smith, L. B. (2012). Modeling cross‐situational word‐referent learning: Prior questions. Psychological Review, 119, 21–39. Yu, C., & Ballard, D, H. (2007). A unified model of early word learning: Integrating statistical and social cues. Neurocomputing, 70, 2149–2165. doi:http://dx.doi.org/10.1016/j.neucom. 2006.01.034. Zhao, X., & Li, P. (2009). An online database of phonological representations for Mandarin Chinese. Behavior Research Methods, 41, 575–583. Zhao, X., & Li, P. (2010). Bilingual lexical interactions in an unsupervised neural network model. International Journal of Bilingual Education and Bilingualism, 13, 505–524. Zhao, X., & Li, P. (2013). Simulating cross‐language priming with a dynamic computational model of the lexicon. Bilingualism: Language and Cognition, 16, 288–303. Zhao, X., Li, P., & Kohonen, T. (2011). Contextual self‐organizing map: Software for constructing semantic representations. Behavior Research Methods, 43, 77–88. Computational Modeling 229

Further Reading and Resources

CHILDES database: http://childes.talkbank.org/ Annotation: Child Language Exchange System. A rich online database of child‐child and child‐adult speech interactions. Information extracted from it has been used as input to many computational models. Elman, J. L., Bates, E. A., Johnson, M. H., & Karmiloff‐Smith, A. (1996). Rethinking innate- ness: A connectionist perspective on development. Cambridge, MA, US: The MIT Press. Annotation: This book takes a connectionist perspective on cognitive and language development. It argues for the need to clearly define innateness at different levels, and to separate innate- ness from modularity, domain‐specificity, and localization. Griffiths, T. L., Kemp, C., & Tenenbaum, J. B. (2008). Bayesian models of cognition. In R. Sun (Ed.), The Cambridge handbook of computational psychology (pp. 59–100). New York, NY: Cambridge University Press. Annotation: An introduction to Bayesian statistics and its application in cognitive modeling. It also discusses how Bayesian inference can be used to infer topics from large texts. Jones, M. N., Willits, J., & Dennis, S. (2015). Models of semantic memory. In J. R. Busemeyer & J. T. Townsend (Eds.), Oxford handbook of mathematical and computational psychology (pp. 232–254). New York, NY: Oxford University Press. Annotation: A comprehensive review for both probabilistic and connectionist models of semantic representation. Li, P., & Zhao, X. (2012). Connectionism. In M. Aronoff (Ed.), Oxford bibliographies online: Linguistics. New York, NY: Oxford University Press. http://www.oxfordbibliographies. com/view/document/obo‐9780199772810/obo‐9780199772810‐0010.xml Annotation: Online annotated bibliographies for important concepts and references of connectionist models. 12 Corpus Linguistics

Marc Brysbaert, Paweł Mandera, and Emmanuel Keuleers

Abstract

Corpus linguistics refers to the study of language through the empirical analysis of large databases of naturally occurring language, called corpora. Psycholinguists are mostly familiar with corpus linguistics because the word frequency norms they use come from corpus linguistics. The frequency norms are more informative if they include information about the part-of-speech roles of the words (e.g., the word “dance” used as a verb or a noun). This requires the syntactic parsing of the corpus, which is currently done auto- matically. An exciting new development is the calculation of semantic vectors on the basis of word co-occurrences. In this analysis, the meaning of a target word is derived by taking into account the words surrounding the target word. This makes it possible to calculate the semantic similarity between two target words. The measures provided by corpus linguistics are the most powerful when they can be combined with processing times for large numbers of words (obtained in megastudies) and subjective ratings for many words (obtained via crowdsourcing studies). Examples are given.

Introduction

Corpus linguistics refers to the study of language through the empirical analysis of large databases of naturally occurring language, called corpora (singular form: corpus). In linguistics, corpus linguistics for a long time was the rival of approaches that

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Corpus Linguistics 231 p­redominantly valued the theoretical insights and acceptability intuitions of individual linguists. In recent years, signs of collaboration and cross‐fertilization are observed (Gries, 2010), partly because the tools used in corpus linguistics have become more user‐friendly. Everyone looking up the use of a particular phrase on an internet search engine is essentially doing corpus linguistics, searching a large collection of webpages for the presence of a particular word or word co‐occurrence. At the same time, ideas from theorists are important for corpus linguists, as corpus searches are particularly informative when they address specific, theory driven predictions. Psycholinguists are mostly familiar with corpus linguistics because of the word frequency measures they use. It is well‐known that high‐frequency words are pro- cessed more efficiently than low‐frequency words. The frequency norms, on which the selection of stimulus materials is based, come from corpus linguistics. In particular, the compilation of a balanced, 1 million word corpus by Kucera and Francis (1967) and the word counts based on that corpus have had a tremendous influence on word recognition research in English up to the present day. Corpus analysis also has had an influence on sentence parsing research, first to find out which constructions were attested and which not, then to find out the relative frequencies of various constructions, and now increasingly to train computational models of sentence parsing. Another exciting use of corpus analysis is the calculation of semantic similarity measures on the basis of word co‐occurrences.

Assumptions and Rationale

The underlying assumptions of corpus linguistics differ slightly between studies depending on whether a researcher is interested in language production or language perception. For language production researchers, the corpus is the output to be analyzed and the ideal is to have the largest possible sample of spontaneously gener- ated contents. These can be written texts, but most of the time will consist of spoken discourse, as there are more researchers interested in speech production than in writing, and because written texts are often edited and polished before publication (although there are exceptions, such as television programs that are subtitled online or chat interactions). The rationale behind the approach is that the corpus forms a representative sample of language produced and, therefore, can be analyzed to reveal the processes underlying language production. Typical examples of such studies are the analysis of speech errors (e.g., saying “dye a beggar” instead of “buy a dagger”; Fromkin, 1973) or the investigation of acoustic reductions in speech (Ernestus, Baayen, & Schreuder, 2002). The main assumption made by word perception researchers is that the language corpus is representative for the type of language people have been exposed to in their lives. The corpus can then be used to count the frequencies of various words, phrases, and syntactic constructions encountered by people. This has been the basis of all research on word frequency (Brysbaert & New, 2009; Monsell, Doyle, & Haggard, 1989). It has also been the basis of all research investigating whether people are more likely to use the most frequent analysis when confronted with a syntactic ambiguity (Reali & Christiansen, 2007). 232 Research Methods in Psycholinguistics and the Neurobiology of Language

A criticism raised against the rationale behind using frequency measures in perception research is that a correlation between frequency of production and ease/preference of use need not be interpreted as evidence for the claim that exposure drives perception. It can be defended that exposure frequency does not affect inter- pretation directly, but that both are the outcome of a third variable influencing both production and perception. For instance, it has been argued that differences in struc- tural complexity and working memory demands drive both syntactic production and perception: One is likely to produce the structure with the least demands and one tends to prefer the disambiguation with the simplest structure. Similarly, with respect to the word frequency effect in speech production, Hayes (1988) wondered whether the observation that spoken discourse contains fewer low‐frequency words than written texts could be due to people avoiding the use of low‐frequency words in spoken discourse in order to preserve the fluency of their speech. According to Hayes the difficulty of producing a word determines the frequency of occurrence (and not the other way around). It is good to keep these objections in mind: A correlation between production and perception need not mean that perception is directly affected by frequency differences in the language exposed to, as assumed by experience‐based models of language processing. On a more positive note, the correlation between perception and corpus data can be used to predict one from the other, independent of the underlying causality structure.

Apparatus and Tools

The apparatus for corpus linguistics is becoming simple, as a result of the growing power of computers. Most desktop and laptop computers nowadays can do the analyses that required supercomputers only a few decades ago. The most likely imped- iment to applying corpus linguistics is the computer programming skills required. Given that corpora currently contain billions of words/sentences, one needs automated algorithms to process the data. Indeed, there is a big overlap between corpus linguistics and natural language processing (NLP) research in departments of computer sciences, where one tries to improve the verbal intelligence of computers by making them digest large corpora of information (usually texts, although the first uses of pictorial materials have been reported). Increasingly, libraries of algorithms and software packages become available, making it possible to run programs without requiring in‐depth knowledge of the underlying operations, just like statistical packages make it possible to run complicated analyses without being familiar with matrix algebra (what Schütz, 1962, called the use of recipe knowledge). A few of the packages are mentioned at the end of the chapter. However, because the packages rapidly change and are language‐dependent, our list is likely to be outdated soon and it is better to do an internet search. Two programming languages that are popular at the moment are R and Python. Depending on the information one needs, it is possible to do direct searches in a corpus. This will be the case when one is interested in the occurrence of certain words or word sequences. In many cases, however, one will want to have more information than can be derived from a surface analysis, for instance when one is interested in syntactic structures or in part‐of‐speech information related to the words. For such Corpus Linguistics 233 questions, it is important to have access to a corpus that has been parsed and tagged. Parsing refers to the decomposition of sentences into their grammatical constituents, which are put into a tree diagram indicating the syntactic relationships between the constituents. Tagging involves the assignment of part‐of‐speech (PoS) information to the words, which includes the assignment of the right lemma (base form) to inflected words. A number of small corpora have been parsed and tagged manually (the most famous arguably is the Penn Treebank). Most of the time, however, this is done auto- matically now, even though the output is not yet 100% error‐free. Software packages often used in English include CLAWS (http://ucrel.lancs.ac.uk/) and the Stanford Parser (http://nlp.stanford.edu/software/lex‐parser.shtml). Occasionally (and way too infrequently) psycholinguists can profit from derived data made available by computer linguists or NLP scientists. As indicated above, the best‐known example is the availability of word frequency lists. These lists consist of word types, the number of times they have been observed in the corpus, the syntactic roles (parts‐of‐speech) of the word, and the lemmas associated with these parts‐of‐ speech (see below). This information can often be reduced to a single file for a spreadsheet or a statistical program, or made available through a website. An inter- esting addition in recent years is the collection of frequencies of word sequences (called word Ngrams). These consist of word bigrams (frequencies of word pairs), word trigrams (sequences of three words), and so on. They were first made available by Google (https://books.google.com/ngrams). Another interesting website for English word Ngrams is the Corpus of Contemporary American English (http://­ corpus.byu.edu/coca/).

Nature of Stimuli and Data

Raw Data Versus Derived Data

The nature of the stimuli depends on whether you make use of the corpus itself or of derived data. If you want to work with a corpus yourself, you obviously must have access to it. This will consist of text, sometimes enriched with additional information such as part of speech associated with words or parse structure of the sentences included in the corpus. (Spoken materials are usually transcribed, because it is not yet possible to do corpus‐wide analyses on speech signals.) A major limitation of corpora is that most of them are subject to copyright restric- tions, because the materials were produced by other people, who did not transfer copyright to the corpus builders (this is often impossible given the sheer number of people and organizations involved). Because of possible copyright infringements, researchers are very hesitant to share their corpora with colleagues, meaning that many corpora must be built anew by research groups, hindering the accumulation of information and the replication of findings. The situation is much better for derived data, as these data usually are free for research purposes and are easier to handle. Because the derived data do not harm the authors’ commercial rights, they do not violate intellectual property and fall under the rules of “fair use of a copyrighted work”. In their simplest form, the derived data are available as a spreadsheet (e.g., Excel) and can be used by anyone with basic 234 Research Methods in Psycholinguistics and the Neurobiology of Language computer skills. Occasionally, the list is too long and then you need access to (slightly) more advanced software. Language corpora need not be limited to spoken and written words. They can also consist of gestures, either to replace speech (in mute or deaf participants) or to accompany speech.

Word Frequency Data

The most frequently used measure derived from corpus linguistics is word frequency. Table 12.1 shows an excerpt from the SUBTLEX‐US database (Brysbaert, New, & Keuleers, 2012), which contains word frequencies based on an analysis of a corpus of film subtitles including 51 million words from 9,388 films. It describes the information for the word “appalled.” The first line shows that this word was observed 59 times in the corpus. The second line indicates that it was observed in 53 films (a variable called “contextual diversity”). The third and the fourth line provide standardized fre- quency measures: Frequency per million words (59/51 = 1.16) and the Zipf‐value, which is a standardized logarithmic value (log10((59 + 1)/51) + 3 = 3.07). The Zipf‐value is a better measure than frequency per million, because it takes into account the facts that the word frequency effect is a logarithmic function and that more than half of the words have a frequency of less than one per million words. The value ranges from 1 to 7, with low‐frequency words covering the range of 1‐3, and high frequency words cov- ering the range of 4‐7. For more information, see van Heuven, Mandera, Keuleers, and Brysbaert (2014). The next lines of Table 12.1 indicate that “appalled” is used as an adjective (49 times) and as a verb form (10 times). So, the dominant lemma of “appalled” is “appalled” (used as an adjective); the other lemma is the verb “appall.” Because word frequencies are so easy to calculate nowadays, it is important to make sure you use a good frequency measure (see the next section as well). Important variables to consider are (1) the size of the corpus, (2) the language register captured by the corpus, and (3) the quality of the analyses done. As for the size of the corpus, good frequency measures require corpora of some 20‐50 million words. This is because a large part of the word frequency effect is s­ituated at frequencies lower than 1 per million words (Keuleers, Diependaele, &

Table 12.1 Excerpt from the SUBTLEX‐US database for the word “appalled.” Word appalled FREQcount 59 CDcount 53 SUBTLEX pm 1.16 Zipf‐value 3.07 Dom_PoS_SUBTLEX Adjective Freq_dom_PoS_SUBTLEX 49 Percentage_dom_PoS 0.83 All_PoS_SUBTLEX Adjective;Verb All_freqs_SUBTLEX 49;10 Dom_Lemma_SUBTLEX appalled All_Lemma_SUBTLEX appalled;appall Corpus Linguistics 235

Brysbaert, 2010). These are the Zipf‐values between 1 and 3. If the corpus is too small, it is impossible to measure these frequencies properly. Larger corpora are required when in addition one wants information about part‐of‐speech or word Ngrams. At the same time, it is not true that large corpora are always better than small cor- pora, the reason being that large corpora often tap into language registers few partic- ipants in psychology experiments (typically undergraduate students) are familiar with. Such corpora are, for instance, encyclopedias. Wikipedia is a very popular source in NLP research, because it contains nearly 2 billion words, is freely available, and exists for many languages. However, it is not the type of language undergraduates read a lot. The same is true for Google books, which is another multibillion word corpus, cov- ering millions of fiction and non‐fiction books, but again unlikely to be read by under- graduates. When the quality of word frequency measures is tested, substantially better results are obtained when the corpus consists of film subtitles (Brysbaert, Keuleers, & New, 2011), tweets and blogs (Gimenes & New, 2016), or social media messages (Herdağdelen & Marelli, in press), as discussed in the next section. Finally, the quality of the word frequency measure also depends on the quality of the analysis done. Several factors are involved. One of them is the multiplication of sources. Because electronic materials are easy to copy, most corpora contain multiple instances of the same information (e.g., subtitles for the same film in a corpus of subtitles). It is important to detect and delete such duplications. The same is true for interchanges where previous messages are copied in the replies. Often some checks of the text quality must be done as well, to make sure that the language is the one intended and of an acceptable level. Another issue is that files often contain meta‐ information related to the source, which must be discarded as well. For instance, files with film subtitles usually include information about the film, the people who made the subtitles, and so on. This information must be excluded. Lastly, if one is interested in part‐of‐speech information, it is important to use a parser of good quality. The following are interesting sources across a number of languages. The first are the so‐called SUBTLEX frequencies, based on film subtitles and available for Chinese, Dutch, English, French, German, Greek, Polish, Portuguese, and Spanish (for more information, see http://crr.ugent.be/programs‐data/subtitle‐frequencies). Another interesting source comes from tweets and blogs. Gimenes and New (2016) provide them for 66 languages. Some databases are geared towards children. The best known is the CHILDES database, available for several languages (http://childes.psy.cmu. edu/) and discussed extensively in Chapter 3.

Semantic Vectors

Whereas corpus studies traditionally were geared towards word frequency data and syntactic analyses, an exciting development in the past two decades is the calculation of semantic information on the basis of word co‐occurrences. This approach, which is based on the idea that words with similar meanings tend to occur in similar contexts (Harris, 1954), was introduced to psychology in two classic papers by Lund and Burgess (1996) and Landauer and Dumais (1997). The authors operationalized the semantic similarity between words by observing the joint occurrence of the words in contexts. For Lund and Burgess, the context was a small moving window (up to 10 words) sliding through the corpus. For Landauer and Dumais, the context was a short article. 236 Research Methods in Psycholinguistics and the Neurobiology of Language

Lund and Burgess compiled a corpus of 160 million words from internet news groups. In their analysis, they included all words appearing at least 50 times within the corpus. This resulted in a total of 70 thousand words and a co‐occurrence matrix of 70,000 x 70,000 entries. Each cell of the matrix included the number of times the words were present together in the sliding window. On the basis of this matrix, each word had a semantic vector consisting of 70,000 numbers. By comparing the semantic vectors, the semantic similarity between words could be calculated: Words that co‐occurred in the same contexts had very similar semantic vectors; words that rarely co‐occurred in the same contexts had different semantic vectors. Lund and Burgess observed that the similarity vectors made clear distinc- tions between words from the categories animals, body parts, and geographical locations. The vectors of the words within these categories were much more similar than those between the categories. The authors also showed that the semantic similarities were larger between targets and related primes from a previously pub- lished semantic priming experiment than between targets and unrelated control primes. Lund and Burgess called their approach the hyperspace analogue to language (HAL; see also Chapter 11). Landauer and Dumais (1997) started from the same basic principles but applied a slightly different procedure. First, they used a corpus consisting of a 4.6‐million word encyclopedia for young students, which included 30,473 entries (in later implementa- tions the authors worked with a larger corpus of schoolbooks to better approach the learning process in children). From each entry the authors took a text sample with a maximum of 2,000 characters (about 151 words). The encyclopedia entries formed one dimension of the matrix; the other dimension consisted of the 60,768 words they were interested in. The cells in the matrix contained the frequency with which a particular word appeared in a particular text sample. Next, the authors applied a dimensionality reduction to the matrix (called singular value decomposition), which reduced the 30,473 entries to 300 dimensions. Again the values of the words on each of these 300 dimensions were used as a vector to calculate the similarity to the other words. To test the usefulness of the semantic vectors, Landauer and Dumais used them to solve a vocabulary test with multiple choice answer alternatives (taken from the synonym portion of the Tests of English as a Foreign Language—TOEFL). The test consisted of 80 items with four alternatives to choose from. An item was correctly solved when the semantic distance calculated between the target and the correct alternative was smaller than the distances with the other three alternatives. This was the case for 64% of the items, which agreed with the score obtained by a large sample of applicants to U.S. colleges from non‐English speaking countries. Landauer and Dumais called their approach latent semantic analysis (LSA; see also Chapter 11). From a practical point of view, an important difference between Lund and Burgess (1996) and Landauer and Dumais (1997) was that the latter not only published their paper, but also developed a website (http://lsa.colorado.edu/) on which visitors could calculate the LSA similarities between words. This website informs you, for instance, that the semantic similarity between apple and pear is .29, whereas the similarity between apple and tear is .18. The site also informs you that other words are closer neighbors to apple. Some of these are in descending order: cherry (.43), peel (.42), and tree (.40). Surprisingly, the list also includes chicle (.41), nonalphabetic (.40), uppercase (.39), and chapman (.38), showing that further improvements to the measure are warranted. Because of the availability of the user‐friendly interface with Corpus Linguistics 237 derived measures, LSA has had much more impact on psycholinguistic research than HAL. Indeed, one regularly comes across semantic‐priming experiments in which LSA values were compared or matched across conditions. In the years since the publications of Lund and Burgess (1996) and Landauer and Dumais (1997), researchers have attempted to improve the performance of the pro- cedures. Several approaches were taken. First, researchers made use of larger cor- pora. Second, they tried to optimize the transformation steps applied to the raw context count matrices and searched for the best possible parameter sets. One of the testing standards was the TOEFL test used by Landauer and Dumais. Gradually, the number of correctly solved items rose until Bullinaria and Levy (2012) reached 100% correct test performance. This was achieved by using a corpus of over 2 billion words crawled from the web (including Wikipedia pages), a HAL‐based approach with a window size of one word to the left and one word to the right of the target word, a cosine semantic similarity index, and by weighting the vector components. Lemmatizing a text before running the analysis (i.e., replacing all inflected forms by lemmas) did not improve the performance of the models if the corpus was big enough. In addition to improving well‐established models, completely new approaches have been proposed. One was that researchers started to use a connectionist network rather than a count matrix (Mikolov, Chen, Corrado, & Dean, 2013). In these models, word co‐occurrences are no longer explicitly counted and reduced to principal com- ponents. Instead, all target words are represented as input and output nodes in a three‐layer connectionist network. The context words are used as predictors in the input layer and the target word is the one that must be activated in the output layer. The input and output layers are connected via a hidden layer of a few hundred units. The weights between the nodes are adapted to optimize the performance of the network and the final weights are used to form the semantic vectors (see Chapter 11 for details about connectionist models). Several studies have confirmed that this approach usually yields better and more robust performance than the traditional distributional models, such as HAL or LSA (Baroni, Dinu, & Kruszewski, 2014; Mandera, Keuleers, & Brysbaert, 2017; but see Levy, Goldberg, & Dagan, 2015 for an alternative view). In addition, it has been shown that the proposed connectionist models can be mathe- matically equivalent to a certain type of the traditional models (Levy & Goldberg, 2014). At the same time, it has been suggested that better performance on the TOEFL may not be the best indicator of human performance, because optimal performance on the TOEFL test requires encyclopedic input, whereas human semantic priming data are better predicted by semantic vectors based on everyday language such as found in film subtitles (Mandera et al., 2017). Unfortunately, the access to the information and skills needed to independently train and use the state‐of‐the‐art semantic vectors make them out of reach to many psycholinguistic researchers. The corpora on which the new measures were calculated cannot be made freely available due to copyright restrictions, and running the algorithms requires expert knowledge (not to mention computer time). As a result, psycholinguists had little option but to continue working with the easily available but outdated LSA measures. To solve this problem, we have written a shell that can be downloaded and calculates the semantic distances bet- ween words based on the latest developments (http://crr.ugent.be/snaut/). At the moment, the shell calculates semantic distance values for English and Dutch. Other languages are likely to follow. 238 Research Methods in Psycholinguistics and the Neurobiology of Language

Collecting the Data

Most of the time, a corpus will be downloaded from the internet. Indeed, the massive availability of language in digital form has been the driving force behind corpus lin- guistics. Researchers have a tendency to go for the materials that are easiest to reach. As indicated above, a lot of corpora contain the Wikipedia webpages (https://www. wikipedia.org/), as they can be downloaded easily. This is a good corpus for encyclo- pedic knowledge, but is less suited as a proxy for the typical speech or text people are exposed to. Some other popular text corpora are based on web crawlers that browse the World Wide Web and download the contents of various sites. These cor- pora contain a wide variety of sources (which is good), but usually require consider- able cleaning (duplicates, pages in other languages, pages with repetitions of the same information, etc.). Finally, some corpora can be obtained from previous research (but see the copyright issues above). The advantage here is that much of the cleaning work has been done already. The size required for a good corpus depends on its use. If the goal is to have frequencies of single words, then a corpus of some 20‐50 million words is enough (Brysbaert & New, 2009). If one in addition wants reliable part‐of‐speech information about low‐frequency words, a larger corpus is indicated. Larger sizes are also needed if the researcher wants information about word co‐occurrences, as these are by defini- tion lower in frequency. At the same time, it is good to keep in mind that an undergrad- uate student (the typical participant in psycholinguistic experiments) is unlikely to have come across more than 2 billion words in their life (Brysbaert, Stevens, Mandera, & Keuleers, 2016a). So, corpora larger than this size are less representative as well. Next to size, the language register of the corpus is of critical importance, certainly if one wants to predict performance in psycholinguistic experiments. In general, measures based on the type of language participants have been exposed to are more valid than measures based on scientific or non‐fiction sources. As indicated above, particularly useful sources are film subtitles and social media messages. Also school books are a good source, arguably because undergraduates spent a good part of their lives reading and studying them. Books from primary school have an extra advantage because they tap into the language first acquired, which seems to have a stronger influence on language processing than words acquired later (Brysbaert & Ellis, 2016). A special case concerns research with participants of old age, as these have been less exposed to internet language and language from recent years. Several studies report that for these participants, corpora of some time ago may be more representative (for references, see Brysbaert & Ellis, 2016). The register of the corpus is particularly relevant when one wants to compare the processing of various types of words. One such question is whether emotional words (associated with positive and negative feelings) are recognized faster than neutral words. To answer this question, one must be sure that the frequencies of the various words are estimated correctly (Kuperman, Estes, Brysbaert, & Warriner, 2014). For instance, if the word frequency estimates are based on a non‐fiction corpus, the frequency of the emotional words will be underestimated (as non‐fiction texts rarely deal with emotion‐laden situations) and it will look as if emotional words are pro- cessed faster than expected on the basis of their “frequency.” Alternatively, if the corpus is based on song lyrics, it might seem like emotional words are processed more slowly than expected on the basis of their “frequency.” Corpus Linguistics 239

An Exemplary Application

There are two ways to show the utility of the various measures provided by computa- tional linguistics: either by setting up a new study that addresses a specific theoretical question or by reanalyzing an old study. We take the latter approach and consider the stimuli used in a randomly chosen semantic priming experiment (de Mornay Davies, 1998, Experiment 1). The experiment was based on 20 target words preceded by semantically related and unrelated primes. These are shown in the first three columns of Table 12.2. The first thing we want to know about these stimuli is their word frequency. As the experiment was run in the United Kingdom, we want frequencies for British English. A good source for these are the SUBTLEX‐UK frequencies (Van Heuven et al., 2014). They can be found at the website http://crr.ugent.be/archives/1423. The fourth column of Table 12.2 shows the outcome for the target words. The mean Zipf value is 4.54 (SD = .67), which is rather high (similar to a frequency of 28 per million words). It is further noteworthy that the targets consist of a combination of nouns, verbs and adverbs, with two words that are primarily used as proper nouns (cup, lance). These are stimuli we may want to avoid in a good experiment. A similar analysis of the related primes shows that their average Zipf frequency is 4.84 (SD = .50), that they include one word mostly used as a proper noun (cable) and four words mostly used as an adjective (clean, dark, key, slow) in addition to nouns.

Table 12.2 Stimuli used in a semantic priming experiment by de Mornay Davies (1998). The first three columns show the stimuli (target, related prime, unrelated prime). The fourth column gives the SUBTLEX‐UK frequency of the target word (expressed in Zipf‐values) and the fifth column gives the dominant part‐of‐speech of the word.

TARGET RELATED UNRELATED Zipftarget DomPoStarget bird wing shirt 4.85 noun bottle glass claim 4.65 noun boy girl land 5.28 noun chase run town 4.31 verb cup plate pitch 5.09 name drop fall club 4.90 verb fast slow goal 5.09 adverb gammon bacon spade 2.85 noun glove hand think 3.81 noun house home small 5.83 noun lance sword canoe 3.74 name light dark view 5.28 noun lock key add 4.42 noun mail letter effort 4.63 noun moon sun shot 4.74 noun string rope clue 4.25 noun tail feather parent 4.45 noun wash clean sweet 4.54 verb wig hair food 3.82 noun wire cable tiger 4.29 noun

Source: de Mornay 1998. Reproduced with permission of Taylor & Francis. 240 Research Methods in Psycholinguistics and the Neurobiology of Language

The frequency of the unrelated primes is 4.85 (SD = .67), well matched to the related primes. They include two verbs (claim, think) and two adjectives (small, sweet), in addition to 16 nouns. It is furthermore interesting to see how much the related and the unrelated primes differ in semantic distance. We use the semantic vectors of Mandera et al. (2017). The semantic distance is .50 (SD = .12) between the targets and the related primes (on a scale going from 0—fully related—to 1—fully unrelated). The distance between the targets and the unrelated primes is .84 (SD = .09), which is substantially higher. In addition to the above measures, we could also check whether the stimuli are well matched on other variables known to influence visual word recognition, such as word length, age‐of‐acquisition, and orthographic/phonological similarity to other words. For English, information about the similarity to other words can be looked up in Balota et al. (2007; http://elexicon.wustl.edu/) or calculated with the vwr package (Keuleers, 2015). Information about age‐of‐acquisition can be found in Kuperman, Stadthagen‐Gonzalez, and Brysbaert (2012; http://crr.ugent.be/archives/806). Applied to the data of Table 12.2, the orthographic similarity to other words, as measured with OLD20 in Balota et al. (2007), is 1.40 (SD = .26; the word “gammon” is not in the database) for the target words, 1.49 (SD = .29) for the related primes, and 1.71 (SD = .29) for the unrelated primes. The deviation of the last value indicates that better primes could have been chosen in the unrelated condition. The age‐of‐acquisition values are 4.96 (SD = 3.09) for the targets, 4.27 (SD = 1.66) for the related primes, and 5.58 (SD = 1.64) for the unrelated primes, again suggesting that a better match- ing of the prime stimuli is possible. In summary, the stimuli used by de Mornay Davies (1998, Experiment 1) were not bad, but they can be further improved, so that they all consist of nouns, and are fully matched on variables such as orthographic similarity (OLD20) and age‐of‐acquisition. Having access to databases such as those just mentioned allows us to run better controlled experiments in psycholinguistics. Such information can also be used in regression analyses based on processing times for thousands of words, to find out the relative impact of the various variables (Keuleers & Balota, 2015; Brysbaert, Stevens, Mandera, & Keuleers, 2016b).

Limitations and Opportunities for Validation

Corpus linguistics provides psycholinguists with valuable tools to investigate language processing. Research on word processing would be impossible without access to word frequency information, morphological information, and word similarity indices, all based on corpus analyses. A new extension that is currently tried out is to see how well specific word features can be calculated on the basis of semantic vectors. For instance, it seems reasonable to derive the emotional value of a word from the emotional values of its semantically close words. If one knows that the word “beautiful” has a positive affect, one can be pretty sure that the same will be true for all its synonyms, such as “lovely,” “attractive,” “good‐looking,” “gorgeous,” “stunning,” “striking,” and “handsome.” So, by using a limited number of seed words and semantic similarity Corpus Linguistics 241 vectors, it may be possible to estimate the emotional value of all words in a language, and indeed whole texts. Studies indicate that this approach is likely to work, although more work is needed to validate and optimize it (e.g., compare Mandera, Keuleers, & Brysbaert, 2015, to Hollis, Westbury, & Lefsrud, in press). If the approach indeed turns out to work, it will be possible to obtain values for all existing words on the basis of a small‐scale rating study. This will be particularly valuable for languages that do not yet have large‐scale databases with human ratings. Indeed, a first important limitation of the current contribution of corpus linguis- tics is that the measures we discussed are only available for a minority of the around 7,000 extant languages, which does injustice to the language diversity and biases research. Another limitation is that the available information is restricted to language registers than can easily be analyzed (in particular, written texts). There is an increasing realization that language is inherently multimodal, whereas the corpora are not (yet). This creates a validity problem in relation to the real input for the language user. A solution here might be the creation of multimodal corpora such as the Language Archive at the Nijmegen MPI (https://tla.mpi.nl/). Even for languages that have been included in computational linguistics, another big limitation is that not all measures made available are good or even useful. As it happens, a lot of useless information is to be found on the internet. Using computer algorithms to calculate and compare word features guarantees that one will have a list of numbers as outcome, but does not guarantee that the numbers will be valid. Many things can go wrong. For a start, analyzing big datasets is quite error‐prone and requires regular calculation checks. Second, not all algorithms have the same quality (as shown by the research on semantic vectors). Third, much depends on the quality of the corpus one is working with (in this respect it may be good to keep in mind the saying “garbage in, garbage out”). Finally, there may be theoretical reasons why the currently used algorithms are suboptimal. For instance, one of the limits of semantic vectors as presently calculated is that antonyms tend to be semantically close on the basis of word co‐occurrences. This implies that black is assumed to be a “synonym” of white, and ugly a “synonym” of beautiful. The best way to avoid bad measures derived from corpus analysis is to validate them against human data. Ideally, this is based on numbers of observations that match those derived from the corpus. In principle, one could check the usefulness of a new word frequency measure by correlating it to the processing times for some 100 words and see whether it correlates more with the processing times than the prevailing measure, but this is a rather risky strategy, as 100 observations is a small number when the frequency list includes some 100 thousand words. It is much better if one has a database of word processing times for some 20 thousand words. Indeed, research on the quality of word frequency measures and ways to improve them only took off after Balota and colleagues (2007) published a megastudy of lexical decision times (is this letter string a word or not?) and naming latencies for 40 thousand English words. Similarly, it is risky to compare the quality of two semantic similarity measures on the basis of an experiment in which only 20 target words were preceded by related and unrelated primes (as we have done above). The ground is much firmer when one can make use of a megastudy, such as the one by Hutchison et al. (2013), which contains data for 1,661 words preceded by four types of primes. Megastudies consisting of word processing times in popular psycholinguistic tasks (lexical decision, naming, semantic classification, eye movement data) are one source 242 Research Methods in Psycholinguistics and the Neurobiology of Language of data for validation studies. Another interesting source of data consists of human ratings. The best way to test how valid affective estimates based on algorithms are is to compare them to human ratings. Here, again, the size of the database is crucial, so that ratings should be collected for thousands of words. Unfortunately, thus far such sizable databases of human ratings are only available for a few languages (English, Dutch, Spanish). A further use of large databases of human ratings is that they can serve as input for other algorithms, such as those estimating the affective tones of texts (e.g., Hills, Proto, & Sgroi, 2015). A third interesting validation source is Wordnet (https://wordnet.princeton.edu/). This is a handmade dictionary, available for several languages, in which sets of synonyms (synsets) have been grouped, each expressing a distinct concept, and related to other synsets by a small number of conceptual relations. In the English database, information is available on 117,000 synsets. The database also contains information about the different meanings and senses of words. For instance, it informs us that “second” can be used as a noun (with 10 different senses), as a verb (2 senses), an adjective (2 senses), and an adverb (1 sense). A final human information database that is a useful validation criterion consists of word association data. In word‐association studies, participants write down one or more words that come to mind upon seeing or hearing a target word. The standard database up to recently was the Florida Free Association Norms collected in the 1970s and 1980s (http://w3.usf.edu/FreeAssociation/), which contains three‐quarters of a million responses to 5,019 stimulus words. An ongoing crowdsourcing study is likely to replace the Florida norms, as it already contains over 4 million responses to 12,000 target words (De Deyne, Navarro, & Storms, 2012; see http://www. smallworldofwords.org/). There is some irony in the fact that the need for psycholinguistic data is so huge now that corpus linguistics and NLP research produce increasingly better measures of word features (and may soon replace the need for large‐scale human ratings). This fact illustrates the everlasting interaction between offline corpus analysis and online human performance research, which is of benefit to both sides.

Key Terms

Corpus (corpora) Collection of language produced by humans (speech, written materials, gestures) used to calculate word characteristics, such as word fre- quency, similarity to other words, and dominant part‐of‐speech; two important characteristics are the size of the corpus and the representativeness for naturally occurring language. Corpus linguistics The study of language through the empirical analysis of large databases of naturally occurring language. Language register Variety of language used in a particular setting (e.g., scientific books versus blogs); is important for psycholinguistics because it has been shown that word characteristics are better at predicting results from experi- ments if they are based on language participants are likely to have experienced in their life. Corpus Linguistics 243

Megastudy Large‐scale word processing study in which responses to thousands of words are collected or in which responses from a very large sample of partici- pants are collected; used to examine the variables affecting word processing efficiency and to validate word characteristics calculated in computational linguistics. Natural language processing (NLP) Discipline that is focused on language processing in computers to increase the interactions with humans, largely based on the analysis of corpora. Parsing Syntactic analysis of sentences. Semantic vector String of 200‐300 numbers describing the meaning of words based on word co‐occurrences. Tagging Determining the part‐of‐speech words have in sentences. Word frequency norms Estimates of how often words are encountered based on counting their occurrences in representative corpora. Wordnet A large lexical database in several languages, in which words have been grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

References

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … & Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39, 445–459. Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context‐counting vs. context‐predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1). Retrieved from http://clic.cimec.unitn.it/marco/publications/acl2014/baroni‐etal‐countpredict‐ acl2014.pdf. Brysbaert, M., & Ellis, A. W. (2016). Aphasia and age‐of‐acquisition: Are early‐learned words more resilient? Aphasiology, 30, 1240–1263. Brysbaert, M., Keuleers, E., & New, B. (2011). Assessing the usefulness of Google Books’ word frequencies for psycholinguistic research on word processing. Frontiers in Psychology, 2, 27. Brysbaert, M., & New, B. (2009). Moving beyond Kucerǎ and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word fre- quency measure for American English. Behavior Research Methods, 41, 977–990. Brysbaert, M., New, B., & Keuleers, E. (2012). Adding Part‐of‐Speech information to the SUBTLEX‐US word frequencies. Behavior Research Methods, 44, 991–997. Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016a). The impact of word preva- lence on lexical decision times: Evidence from the Dutch Lexicon Project 2. Journal of Experimental Psychology: Human Perception and Performance, 42, 441–458. Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016b). How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Frontiers in Psychology, 7, 1116. doi: 10.3389/ fpsyg.2016.01116. Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word co‐occur- rence statistics: stop‐lists, stemming, and SVD. Behavior Research Methods, 44, 890–907. De Deyne, S., Navarro, D., & Storms, G. (2012). Better explanations of lexical and semantic cognition using networks derived from continued rather than single word associations. Behavior Research Methods, 45, 480–498. 244 Research Methods in Psycholinguistics and the Neurobiology of Language de Mornay Davies, P. (1998). Automatic semantic priming: The contribution of lexical‐and semantic‐level processes. European Journal of Cognitive Psychology, 10, 389–412. Ernestus, M., Baayen, H., & Schreuder, R. (2002). The recognition of reduced word forms. Brain and language, 81, 162–173. Kucera, H., & Francis, W. N. (1967). Computational analysis of present‐day American English. Providence, RI: Brown University Press. Fromkin, V. A. (1973) (Ed.) Speech errors as linguistic evidence. The Hague: Mouton. Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods, 48, 963–972. Gries, S. T. (2010). Corpus linguistics and theoretical linguistics A love‐hate relationship? Not necessarily… International Journal of Corpus Linguistics, 15, 327–343. Harris, Z. (1954). Distributional structure. Word, 10, 146–162. Hayes, D. P. (1988). Speaking and writing: Distinct patterns of word choice. Journal of Memory and Language, 27, 572–585. Herdağdelen, A., & Marelli, M. (in press). Social media and language processing: How Facebook and Twitter provide the best frequency estimates for studying word recogni- tion. Cognitive Science. Hills, T. T, Proto, E., & Sgroi, D. (2015), Historical analysis of national subjective wellbeing using millions of digitized books. IZA Discussion Paper No. 9195. Retrieved from http:// ftp.iza.org/dp9195.pdf. Hollis, G., Westbury, C., & Lefsrud, L. (In press). Extrapolating human judgments from Skip‐ gram vector representations of word meaning. The Quarterly Journal of Experimental Psychology. Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese, M. J., Cohen‐Shikora, E. R., Tse, C.‐S., … Buchanan, E. (2013). The semantic priming project. Behavior Research Methods, 45, 1099–1114. Keuleers, E. (2015). Package ‘vwr’. Retrieved from https://cran.r‐project.org/web/packages/ vwr/vwr.pdf. Keuleers, E., & Balota, D. A. (2015). Megastudies, crowdsourcing, and large datasets in psycholinguistics: An overview of recent developments. The Quarterly Journal of Experimental Psychology, 68, 1457–1468. Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice effects in large‐scale visual word recognition studies: A lexical decision study on 14,000 Dutch mono‐ and disyllabic words and nonwords. Frontiers in Psychology 1, 174. doi: 10.3389/fpsyg.2010.00174. Kuperman, V., Estes, Z., Brysbaert, M., & Warriner, A.B. (2014). Emotion and language: Valence and arousal affect word recognition. Journal of Experimental Psychology: General, 143, 1065–1081. Kuperman, V., Stadthagen‐Gonzalez, H., & Brysbaert, M. (2012). Age‐of‐acquisition ratings for 30 thousand English words. Behavior Research Methods, 44, 978–990. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104, 211–240. Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems (pp. 2177–2185). Retrieved from http:// papers.nips.cc/paper/5477‐neural‐word‐embedding‐as‐implicit‐matrix‐factorization. Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3. Retrieved from http://u.cs.biu.ac.il/~nlp/wp‐content/uploads/Improving‐ Distributional‐Similarity‐TACL‐2015.pdf Lund, K., & Burgess, C. (1996). Producing high‐dimensional semantic spaces from lexical co‐occurrence. Behavior Research Methods, Instruments, & Computers, 28, 203–208. Mandera, P., Keuleers, E., & Brysbaert, M. (2015). How useful are corpus‐based methods for extrapolating psycholinguistic variables? The Quarterly Journal of Experimental Psychology, 68, 1623–1642. Corpus Linguistics 245

Mandera, P., Keuleers, E., & Brysbaert, M. (2017). Explaining human performance in psycho- linguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language, 92, 57–78. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word repre- sentations in vector space. arXiv:1301.3781 [cs]. Retrieved from http://arxiv.org/ abs/1301.3781. Monsell, S., Doyle, M. C., & Haggard, P. N. (1989). Effects of frequency on visual word recognition tasks: Where are they? Journal of Experimental Psychology: General, 118, 43–71. Reali, F., & Christiansen, M. H. (2007). Processing of relative clauses is made easier by frequency of occurrence. Journal of Memory and Language, 57, 1–23. Schutz, A. (1962). Common‐sense and scientific interpretation of human action. In Collected Papers I (pp. 3–47). Springer Netherlands. Van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex‐UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67, 1176–1190.

Further Reading and Resources

The best textbook on corpus linguistics is Jurafsky D., & Martin, J. H. (2008), Speech and language processing (2nd ed.). Pearson Prentice Hall. The third edition is foreseen for 2017 (preliminary versions of the chapters can be found on http://web.stanford. edu/~jurafsky/slp3/). The Language Goldmine website (http://languagegoldmine.com/) includes over 230 links to interesting resources for language research in various languages. Includes most of the links presented here. The Center for Reading Research website (http://crr.ugent.be/programs‐data) includes links to all the variables collected at Ghent University (e.g., word frequency, age of acquisition, concreteness, word prevalence, word valence, arousal), which can be downloaded in various formats. Mostly limited to English and Dutch, however. The Open Parallel Corpus OPUS (http://opus.lingfil.uu.se/) is a growing collection of trans- lated texts from the web, which provides the community with cleaned, annotated, and aligned corpora in several languages. Behavior Research Methods (http://www.springer.com/psychology/cognitive+psychology/ journal/13428) is the journal in which most word features are published for various languages. Some of the software packages for corpus research are:

• Natural Language Toolkit (http://www.nltk.org/) – a Python module that provides inter- faces to over 50 text corpora and a set of libraries for text processing • Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/) – a set of natural language analysis tools (see also other software released by The Stanford Natural Language Processing Group, http://nlp.stanford.edu/software/index.shtml) • Gensim (https://radimrehurek.com/gensim/) – a Python module implementing various models used in distributional semantics, including the skip‐gram and CBOW models (see also the original word2vec tool released by Google, https://code.google.com/archive/p/ word2vec/)

If you want to make use of derived materials, you can use the R package vwr (Keuleers, 2015), download Excel sheets (see above), or make use of websites that allow you to obtain values online. Some of these are: 246 Research Methods in Psycholinguistics and the Neurobiology of Language

American English • http://www.ugent.be/pp/experimentele‐psychologie/en/research/documents/subtlexus/ overview.htm (the SUBTLEX‐US database) • http://elexicon.wustl.edu/(David Balota’s English Lexicon Project) • http://www.wordfrequency.info/(Mark Davies’s site with word frequencies from var- ious sources) • http://crr.ugent.be/snaut/(semantic vectors for English) British English • http://crr.ugent.be/archives/1423 (SUBTLEX‐UK) • http://websites.psychology.uwa.edu.au/school/MRCDatabase/uwa_mrc.htm (slightly dated site with all types of word information) • http://celex.mpi.nl/(database with a lot of morphological information) • http://www.pc.rhul.ac.uk/staff/c.davis/Utilities/(N‐Watch, a program by Colin Davis to obtain various features of English) • http://crr.ugent.be/programs‐data/lexicon‐projects (British Lexicon Project, with lexical decisions to 28,000 words) • http://crr.ugent.be/snaut/(semantic vectors for English) Dutch • http://crr.ugent.be/isubtlex/(the SUBTLEX‐NL database) • http://celex.mpi.nl/(database with a lot of morphological information) • http://crr.ugent.be/snaut/(semantic vectors for Dutch) • http://crr.ugent.be/programs‐data/lexicon‐projects (Dutch Lexicon Project 1 and 2, with lexical decisions to 30,000 words) French • http://www.lexique.org/(Boris New’s site with next to all information about French words) • https://sites.google.com/site/frenchlexicon/(the French Lexicon Project with lexical decision times to over 30,000 words) German • http://www.dlexdb.de/query/kern/typposlem/(site with word frequencies in German) • http://celex.mpi.nl/(database with a lot of morphological information) Chinese • http://crr.ugent.be/programs‐data/subtitle‐frequencies/subtlex‐ch (SUBTLEX‐CH word frequencies and PoS information for Chinese words) Spanish • http://www.bcbl.eu/databases/espal/(various word characteristics) • http://crr.ugent.be/archives/679 (the SUBTLEX‐ESP word frequencies) • http://www.pc.rhul.ac.uk/staff/c.davis/Utilities/(the N‐Watch program for Spanish) 13 Electrophysiological Methods

Joost Rommers and Kara D. Federmeier

Abstract

Recordings of electrical brain activity allow researchers to track multiple cognitive sub- processes with high temporal resolution. This chapter discusses how the electroenceph- alogram (EEG) is generated and recorded, and how it is analyzed, including filtering, artifact rejection, and statistical testing. It shows how electrophysiological methods have been used to study language, including discussion of aspects of experimental design, stimuli, and tasks, illustrated with a concrete example study. The chapter ends with some advantages and disadvantages of electrophysiological methods and current developments in their use. It is concluded that the noninvasive measurement of electrical brain activity generates some of the most direct evidence regarding the processes under- lying language comprehension, production, and acquisition in the brain. The methods are likely to continue to provide important new insights that challenge our views of cognition and brain functioning.

Language processing is multifaceted and unfolds rapidly, necessitating methods that can reveal the operation of multiple cognitive subprocesses with high temporal resolution. One such method, which has played a critical role in developing our understanding of language over the last several decades, is the recording of electrical brain activity through the electroencephalogram (EEG). This chapter discusses how electrophysiological methods have been used to study language, their advantages and disadvantages, and current developments in their use.

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. 248 Research Methods in Psycholinguistics and the Neurobiology of Language

Assumptions and Rationale

The human EEG was discovered in the 1920s, when Hans Berger recorded and ampli- fied electrical activity from the surface of a patient with head injury (Millett, 2001). Among other phenomena, Berger observed the alpha wave, an oscillation of around 10 cycles per second that was most prominent when the eyes were closed. Berger’s find- ings were initially met with skepticism but were eventually replicated. Now, countless studies have used EEG to investigate virtually all aspects of cognition, including perception, action, attention, memory, and language. The EEG signal is a direct, continuous measure of brain activity. One of its primary strengths lies in its temporal resolution, which is on the order of milliseconds. This temporal resolution makes it one of the methods of choice for answering “when” questions in psycholinguistics. How long after a word is encountered is it integrated with its context? Is syntactic information retrieved before phonological information during word production? Do language‐specific phoneme categories impact early or late speech perception processes? Most EEG‐based studies on language processing have relied on the derivation of event‐related potentials (ERPs) from the ongoing data. ERPs are created by extracting from the continuous EEG data the brain responses time‐locked to an event of interest, such as the onset of a stimulus or response. Typically, epochs from multiple instances of the same or similar events are aligned and averaged together point‐by‐point, so that random fluctuations in the EEG signal will tend to cancel one another, revealing the stable event‐related voltage fluctuations. As shown in Figure 13.1, plotting the averaged voltage changes over time—the ERP—reveals a pattern of positive and negative deflections that can be linked to specific neural systems and functional processes; waveform features with well‐established links are often referred to as “ERP components.” The timing and amplitude of ERP components have been shown

–3 N1

N2 v) μ P1 +3 P2 ential ( t Po

P3

0 100 200 300 400 500 Time after stimulus (ms)

Figure 13.1 Idealized example of an event‐related potential waveform in response to a visual stimulus, with labeled positive and negative peaks. A single channel is shown; negative is plotted up. Source: https://en.wikipedia.org/wiki/File:Constudevent.gif. Used under CC‐BY‐SA 3.0 http:// creativecommons.org/licenses/by‐sa/3.0/. Electrophysiological Methods 249 to be sensitive indices of changes in specific cognitive processes related to stimulus perception and evaluation, attentional allocation, memory encoding and retrieval, response selection, motor preparation, and error‐ and reward‐ related processing, among others (see Luck & Kappenman, 2011). Most ERP components are labeled according to their polarity and the (typical) latency or ordinal position of the peak. For example, the P200 (or P2) is a positive peak that occurs around 200 ms after onset of a visual stimulus. Many components also have a characteristic scalp distribution that helps identify them. The ERP component that has been used most in language research is probably the N400, a centro‐parietally distrib- uted negative‐going waveform feature that peaks around 400 ms after the onset of potentially meaningful stimuli. The N400 was initially discovered as a response to unex- pected words in sentences, being larger in amplitude to “dog” than “sugar” at the end of a sentence such as “I drink my coffee with cream and…” (Kutas & Hillyard, 1980). However, the N400 is not an anomaly detector. Subsequent studies have established that the N400 is part of the normal response to content words in all modalities, as well as to pictures and other meaningful stimuli, with its amplitude being attenuated as a function of the contextual support for the stimulus (for review, see Kutas & Federmeier, 2000, 2011). For instance, the N400 amplitude to a word in a sentence is inversely related to the word’s cloze probability, operationalized as the proportion of participants who would provide that word as a continuation when given the sentence fragment in an off‐line task. The N400 also decreases with repetition, word position in a congruent sentence, semantic relatedness to a preceding word in a list, and even semantic related- ness to expected but not actually presented words. Across all of these manipulations, the latency of the N400 is remarkably stable, unlike some other components (such as the P300) whose timing depends on various aspects of the experimental manipulation. A second component that has often been used in language research is the P600, a longer‐lasting positivity with a less consistent timing that does not always exhibit a clear peak. It was initially reported as a response to grammatical violations such as “throw” in “The spoilt child throw the toys on the floor” (Hagoort, Brown, & Groothusen, 1993; Osterhout & Holcomb, 1992), opening up the possibility of tracking grammatical processing with ERPs and suggesting the possibility of a neural dissociation between semantics (associated with N400 effects, as just discussed) and syntax. However, later studies reported similar effects to spelling errors (Münte et al., 1998) and semantic reversal anomalies such as “For breakfast the eggs would only eat” (Kuperberg, Sitnikova, Caplan, & Holcomb, 2003). These findings shifted the view of the P600 toward revision or repair processes, although several different interpretations currently exist (e.g., Brouwer, Fitz, & Hoeks, 2012; Coulson, King, & Kutas, 1998; Kolk & Chwilla, 2007; Kuperberg, 2007), including those that link the P600 to domain‐general responses like the P300. Another component that has been linked to syntactic processing is the Left Anterior Negativity (LAN), occurring at around 300‐500 ms with a left frontal distribution (Osterhout & Holcomb, 1992). It has been reported in response to agreement errors and has been linked to morphosyntactic processing (Friederici, 1995) but also to working memory (Kluender & Kutas, 1993). Recently, however, it has been suggested that at least some apparent LAN effects could also arise from component overlap between an N400 and the onset of a right‐lateral- ized P600 (Tanner, 2015). Although some components tend to reliably be associated with peaks in the ERP, an individual waveform typically does not allow the researcher to draw conclusions about cognitive processes. Instead, as with most methods, the focus is on differences 250 Research Methods in Psycholinguistics and the Neurobiology of Language between conditions, or “ERP effects.” An ERP effect is a modulation of an ERP component, or just the difference between two conditions, which, in well‐designed studies, isolates particular subprocesses of interest. The focus of the sentence processing literature on the N400, P600 and LAN certainly does not mean that these are the only important components for studies of language. In fact, it should be stressed that language manipulations routinely elicit ERP effects that are not specific to language, because so many cognitive functions come together when reading, listening, or speaking. Moreover, among some of the most elegant studies using ERPs to answer questions in language processing are those that have made use of components that were originally characterized in very different contexts. For example, the Lateralized Readiness Potential (LRP), a compo- nent associated with response selection, has been used to study timing questions in language production (van Turennout, Hagoort, & Brown, 1997). Furthermore, the Mismatch Negativity (MMN), a component associated with auditory sensory memory, has been used to study phonological processing (Dehaene‐Lambertz, 1997; Näätänen et al., 1997). This emphasizes the utility of being aware of the full toolbox of electro- physiological responses that could potentially be harnessed to do psycholinguistics (for a thorough overview, see Luck & Kappenman, 2011).

Apparatus

The voltage changes in the EEG are a direct, instantaneous measure of neural activity. The signal is thought to arise primarily from post‐synaptic potentials produced by large populations of cortical pyramidal neurons that fire in synchrony. Pyramidal cells are the likely main contributor to the EEG signal because they occur in layers close to the scalp and are oriented in a common direction, which allows the activity from multiple neurons to summate rather than cancel out. The relatively slower post‐ synaptic potentials are a more probable source of the signal than action potentials, because action potentials are of short duration and therefore less likely to occur in synchrony (see Nunez & Srinivasan, 2006). Non‐invasive recordings of these potentials are possible through the use of silver‐ silver chloride or tin electrodes affixed to or held near the face and scalp (via, for example, a close‐fitting elastic cap). Some electrode types, called active electrodes, have amplifiers built into the electrodes, making them more resistant to noise under certain conditions (but for a direct comparison between passive and active electrodes in different recording environments, see Laszlo, Ruiz‐Blondet, Khalifian, Chu, & Jin, 2014). Conductive gel is used to establish the connection between the electrode and the skin. Especially with passive electrodes, light abrasion of the skin is typically used to establish a low impedance connection between the scalp and each electrode and to diminish skin potentials that can add noise to the recordings. The effectiveness of the electrode‐to‐scalp connection can be measured with an impedance meter; for passive electrodes, the impedance generally needs to be kept lower than for active electrodes (for the effects of impedance on data quality, see Kappenman & Luck, 2010). The choice of how many electrodes to record from depends on the research question. Twenty to 32 electrodes are often sufficient for language‐processing studies Electrophysiological Methods 251 that target broadly distributed components like the N400 and P600. Higher‐density configurations, up to 256 channels, are also available, and the correspondingly improved resolution of the scalp topography can be advantageous for observing more focal effects and/or for modeling the underlying sources. There are trade‐offs, however, as recording from more channels results in an increase in set up time, an increased probability of bridging between electrodes (resulting in a loss of separable signal), and a higher likelihood of a channel showing an artifact at any given time, leading to more data loss when rejecting entire trials from the analysis. Another important choice is that of the reference electrode. Voltages express a difference in electrical potential between two points, meaning that at least two elec- trodes are necessary to measure a potential. EEG systems typically use a double sub- traction (“differential amplification”) involving a ground electrode and a reference electrode to reduce noise that the recordings from all electrodes have in common. Ideally, the reference electrode would be placed on an electrically neutral location on the body, but, in practice, no location is fully neutral. Most language studies make recordings relative to a reference electrode on the left mastoid, where there is a thick bone structure between the electrode and the brain, and re‐reference to the average of left and right mastoid electrodes during analysis. However, some studies place the reference electrode on the nose or the earlobes. Still other studies convert the data to be able to use the average of all electrode sites as the reference (“average reference”). Because the choice of reference critically affects the amplitude and scalp distribution of the measured electrical signals, it is necessary to pay careful attention to the refer- ence location when comparing datasets and best to follow the convention from other experiments within a subfield when designing a new study. Electrodes are also placed on the face to help distinguish between artifacts and brain activity. Several types of artifacts stem from the eyes, because they act as an electrical dipole, causing large voltage changes during saccadic eye movements and blinks. Typically, electrodes are placed on the outer canthus of each eye to measure horizontal eye movements, which appear as square wave patterns of opposite polarity on each side of the eye. Another electrode is often placed on the infraorbital ridge below at least one of the eyes to measure blinks, which appear as large peaks of opposite polarity above and below the eye. For studies in which participants speak, it can be useful to place electrodes on the orbicularis oris muscle near the mouth, to monitor muscle activity, which is visible as bursts of higher frequency activity. In the typical EEG set up, one computer presents the stimuli to the participant and another samples (digitizes) and stores the EEG data. The stimulus computer also sends brief event codes (also called triggers or markers) to the digitization computer when stimuli are presented or responses are made, to enable the later extraction of event-related data from the continuous EEG. The recorded signals are small and need to be amplified considerably. They also need to be filtered, using an analog filter prior to digital sampling, in order to avoid aliasing, the phenomenon wherein activity at frequencies higher than half the sampling frequency (the Nyquist frequency) becomes misrepresented as lower frequency activity because it is sampled at a rate that is too low to reconstruct the original information. In practice, the EEG is low pass filtered at a frequency well below the Nyquist frequency. EEG data always contain noise. There are external sources of noise, such as line noise from electrical devices near the participant. Most interference from electrical noise can be prevented by shielding the noise sources themselves (e.g., the monitor, 252 Research Methods in Psycholinguistics and the Neurobiology of Language cables) and/or by shielding the participant or recording devices (e.g., by seating the participant in a Faraday chamber). Nonetheless, many EEG recordings do contain some 60 Hz or 50 Hz line noise, depending on the country where the recordings are done. There are also physiological sources of noise, such as skin potentials, blinks, eye movements, and muscle activity. These are minimized by asking the participant to sit still and relax while fixating on the center of the screen. Many experimenters also ask participants to restrict blinking to certain points in the experiment, such as after every trial. Recordings are monitored by the experimenters in real time, so that they can detect excessive artifacts and other possible problems with the data, which is preferable to having to deal with them at the analysis stage.

Nature of Stimuli and Data

Many types of stimuli have been presented to participants while their EEG was recorded: written and spoken words, sentences, pictures, scenes, environmental sounds, and even video clips (e.g., Sitnikova, Kuperberg, & Holcomb, 2003). This range of stimuli permits language researchers to address all kinds of questions in comprehension, production, and acquisition. The EEG technique does put some constraints on the stimuli, however. In order to avoid eye movement artifacts, most studies present a single visual stimulus at a time, scaled to occupy a restricted part of the visual field. For instance, written sentences are usually presented word‐by‐word (although some groups have developed methods to record “fixation related poten- tials” during natural reading; e.g., Baccino & Manunta, 2005), and the presentation of auditory stimuli is usually combined with a constant visual stimulus, such as a fixation cross, to help keep participants’ eyes on the center of the screen. Furthermore, each condition needs to have a relatively large number of stimuli: 30‐60 for studies that target large components like the N400 and P600, more for studies that target smaller components (for discussion, see Luck, 2005). If it is difficult to design enough stimuli, or if the focus is on single‐item ERPs, it is possible to compensate by testing more participants (Laszlo & Federmeier, 2011). The stimuli also need to be well controlled at many levels of analysis, because ERPs reveal the entire processing stream from perceiving, retrieving, evaluating, and (sometimes) responding to aspects of the stimulus. Full counterbalancing is optimal, but if this is not possible, stimuli can be matched on the relevant dimensions. The task affects how the stimuli are processed. In some comprehension studies, participants make lexical decision responses, detect words, or answer comprehen- sion questions. However, a strength of EEG as a continuous measure is that a task is not necessary in order to generate data. Thus, rather than requiring responses based on metalinguistic criteria, participants may simply be asked to read or listen for com- prehension. This makes EEG experiments likely to capture the processes that lis- teners and readers also use outside the lab. Furthermore, it means that EEG can be used in populations for which behavioral testing is difficult, such as infants and certain patient groups. A particularly good example of this are Mismatch Negativity (MMN) studies asking when during development infants’ speech perception system becomes more attuned to their native language versus other languages (Cheour et al., 1998). In production studies, classical picture naming tasks lend themselves to EEG Electrophysiological Methods 253 investigations too (for review, see Ganushchak, Christoffels, & Schiller, 2011). However, the muscle artifacts generated by speaking are large and span a wide range of frequencies (Goncharova et al., 2003). This makes careful interpretation impor- tant, especially for later components close to articulation. In all designs, a core aspect of the acquired EEG data is that they are multidimen- sional. They can be conceived of as a time sample × channel × trial matrix, with positive‐ and negative‐going voltages. It is important to note that whether a signal is positive‐going or negative‐going in absolute terms does not allow for clear inferences about the underlying neurophysiological processes. The signal’s polarity depends on the scalp electrode location: The same underlying brain activity that can be summarized as a current dipole will be measured as a positivity from one side and as a negativity from the opposite side. Moreover, ERPs are relative measures, recorded relative to a ground and reference channel and computed relative to a pre‐stimulus baseline. Although the complexity of the data creates challenges for analysis, it is a key part of the utility of the technique, as it allows inferences not only about whether or not an experimental manipulation has an impact, but more specifically when and how. Such inferences can be especially strong when they involve well‐characterized components linked to specific cognitive and neural functions. We have already seen an example of this way of exploiting the multidimensionality of the data: Whereas semantic and syntactic anomalies might both elicit longer response times relative to a congruent condition in a behavioral task, the distinct ERP effects (N400 and P600) these conditions elicit show that qualitatively different processes are recruited.

Collecting and Analyzing Data

A typical analysis pipeline for ERPs involves filtering, segmenting the epochs from the continuous data, baseline correction, artifact rejection, averaging, and statistical evaluation. Filtering, or reducing the presence of certain frequencies in the signal, is a large, complex topic, and beyond the scope of this chapter to adequately address. However, it is crucial that ERP researchers familiarize themselves with at least the basics (see Handy, 2004, and Luck, 2005, for useful discussion). There are high‐pass filters (which let high frequencies pass through while attenuating lower frequencies), low‐pass filters (which let lower frequencies pass through while attenuating higher frequencies), and band‐pass filters (which combine low‐pass and high‐pass filters to let a frequency band pass through). Further properties of filters are the filter type (e.g., infinite impulse response, finite impulse response, each with various subclasses, such as Butterworth or Gaussian), the slope of the roll‐off (which describes the steepness of the filter), and the frequency (defined as the half‐amplitude cutoff or half‐power cutoff). For ERPs, filtering is beneficial because it can reduce the amplitude of certain artifacts, facilitating the identification of ERP components and effects. High‐pass filters can be used to reduce the influence of slow drifts and skin potentials in studies that do not target very slow components (which partly occupy the same frequency range). Low‐pass filters can reduce the influence of high‐frequency muscle activity. However, any filtering also leads to a loss of information and can distort the signal, which compromises the temporal resolution. High‐pass filters, especially, can produce edge artifacts at the beginning and end of the piece of signal 254 Research Methods in Psycholinguistics and the Neurobiology of Language they are applied to. For this reason, high‐pass filters are best applied to the continuous EEG, prior to segmentation. To create ERPs, epochs around the onset of stimuli (or responses) of interest are extracted from the continuous EEG. A baseline correction is applied to each trial by subtracting the average voltage in the period preceding stimulus onset from all data points in the epoch, effectively setting the signal to zero at stimulus onset. This makes it easier to see the event‐related modulations in the signal. Baseline periods in sentence comprehension paradigms are usually 100 to 200 ms long; short baselines minimize overlap with preceding events, whereas long baselines increase reliability of the estimate of baseline activity. There are a few studies that have filtered the signal instead of applying a baseline correction, as high‐pass filtering can have similar effects as baseline correction when the cutoff frequency is fairly high and/or the filter is steep. However, as already mentioned, steep filters distort the signal (for discussion, see Luck, 2005). Under certain circumstances and settings, such filters can even make a P600 effect look like an N400 effect (Tanner, Morgan‐Short, & Luck, 2015). An important part of preprocessing is the removal of artifacts, including blinks, eye movements, muscle activity, drifts, and amplifier blocking (flat lining due to clipping, because the signal reached the end of the dynamic range of the amplifier). The identification of blinks is facilitated by subtracting the signal from electrodes above and below the eye (a vertical derivation), and the identification of saccades is similarly facilitated by computing a horizontal derivation of the signals from electrodes to the left and right of the eyes. Most studies reject trials that contain artifacts. Rejection decisions can be made using visual inspection, preferably while being blind to condition (although bias is unlikely, because the components of interest are usually not visible on individual trials). More common is a semi‐automatic procedure in which one chooses participant‐calibrated thresholds for automatic artifact detection methods (such as the maximal amplitude, peak‐to‐peak amplitude, or correlation with a step function). Instead of artifact rejection, which reduces the number of trials, artifact correction methods are also available. These methods measure or model the artifacts and remove them, for instance using independent components analysis (ICA; Makeig, Bell, Jung, & Sejnowski, 1996). The non‐artifactual independent compo- nents that ICA detects can also be studied as brain dynamics associated with cognitive processing, although researchers doing so will need a thorough understanding of the technique’s limitations, and it will be more difficult to compare the results of such statistically derived components with prior studies. In the next step, the artifact‐free trials are averaged together point‐by‐point for each condition and each participant (or, in some studies, for each item). Finally, a grand average across participants is created to allow for visualization. The participant averages are submitted to statistical analysis. Much ERP work relies on relatively straightforward statistical methods, validated by replication. Often, the research question is of the type “Does component X differ in amplitude between conditions?”, where the timing and scalp distribution of the component are known. This makes it possible to average across time points during which the component typically occurs and across the electrodes at which the effect tends to be maximal. The resulting values can be subjected to traditional analyses such as ANOVAs. To characterize ERP effects in terms of their scalp distribution, the loca- tions of the electrodes or groups of electrodes on the scalp can be included as factors. Although the spatial resolution of ERPs is relatively poor compared with other Electrophysiological Methods 255 neuroimaging techniques, a reliable difference between scalp distributions indicates that an experimental manipulation affected brain functioning, either by recruiting partially non‐overlapping neuronal generators or by changing the amplitude of a shared generator. If the question is instead of the type “Does the timing of compo- nent X differ between conditions?” one can compute a fractional peak latency or fractional area latency measure. The fractional area measure computes the area under the curve within a time window and finds the point in time that divides the area into a specific fraction, such as 50% (Hansen & Hillyard, 1980). The fractional peak latency is calculated from the peak, back in time, as the point at which the signal reaches a particular fraction of the peak. Because noise makes the identification of peaks in individual subjects difficult, both of these measures benefit from applying them to “leave‐one‐out” grand averages, using the jackknife procedure (Miller, Patterson, & Ulrich, 1998; for recommended settings, see Kiesel et al., 2008) or, at minimum, from measuring peaks in low‐pass filtered data. In other types of experimental designs, the nature, timing, and distribution of the effects of interest are not known beforehand, and instead, the research question is of the type “Does the brain appreciate the difference between these conditions (and if so, how quickly)?” To handle such cases, data‐driven “mass univariate” analyses have been developed and are implemented in, or compatible with, freely available software (e.g., Delorme & Makeig, 2004; Groppe, Urbach, & Kutas, 2011; Lopez‐ Calderon & Luck, 2014; Maris & Oostenveld, 2007; Oostenveld, Fries, Maris, & Schoffelen, 2011). Various approaches exist, but each share the advantage that the researcher need not specify a time window and set of electrode sites a priori. The first step of mass univariate approaches is to quantify the difference of interest in the form of some statistic (such as a t value) at each time point and each electrode. In a second step, a correction for multiple comparisons is applied, often based on permu- tation methods (or on the false discovery rate; Benjamini & Hochberg, 1995). Permutation procedures involve randomly swapping around the condition labels and re‐running the statistical tests, and this process is repeated many times. Each permutation result contributes to a null distribution of test statistics, which acts as a benchmark for quantifying the size of effects that can occur simply by chance. Finally, the statistics from the actual (non‐permuted) results are compared with the null distribution. If they are relatively “special” among the random permutations (i.e., if they are in a tail of the distribution), the difference between conditions is considered statistically significant. The main downside of these approaches is that they are less powerful compared with running an ANOVA or t‐test directly on a time window. Thus, to avoid missing true effects, any a priori information that is available should be used to restrict the analysis and increase power. For instance, if one knows the distribution of an expected effect but not its timing, one can pick the electrode sites of interest but still test the entire epoch point‐by‐point—or vice versa. The test results allow the researcher to inspect at which time points and electrodes any differences between the conditions occurred, although the extent to which this time course can be interpreted as onsets and offsets of effects depends on the multiple comparisons correction procedure. For instance, the cluster‐based permutation approach only tests the general null hypothesis that there is no difference between the conditions (the conditions are exchangeable); the false alarm rate is not controlled at the level of the onsets and offsets of clusters (Maris, 2012). Taken together, there are suitable statistical methods for most designs and extents of a priori knowledge. 256 Research Methods in Psycholinguistics and the Neurobiology of Language

An Exemplary Study

To help make the above more concrete, we discuss an example study by Van Petten, Coulson, Rubin, Plante, and Parks (1999), who used ERPs to investigate spoken‐ word comprehension in sentence context. The study used speech, which is less often used in ERP studies than written words (as visual stimuli are easier to time‐lock to), and capitalized on several advantageous features of the ERP method in its experi- mental design. Spoken language input unfolds over time and lacks clear cues to word boundaries, unlike alphabetic written text in which the spaces help. Listeners activate multiple candidate words (like “cat,” “a,” and “log” while hearing “catalog”) in an incremental fashion based on incomplete input (Marslen‐Wilson & Welsh, 1978). Van Petten et al. investigated the extent to which the meanings of these candidate words are activated and when and how they make contact with sentence context. Out of context, a word can be identified as soon as the acoustic input becomes uniquely consistent with that word. This point in time is known as the isolation point and it can be empirically established using the gating task (Grosjean, 1980), in which listeners are presented with successively longer onset fragments of the word and asked to guess what the word is or is going to be. As the fragments get longer, listeners converge on the same response. In supportive sentence contexts, the responses converge earlier, with less acoustic input (Grosjean, 1980). Some studies used cross‐modal priming para- digms, in which participants make lexical decisions to visually presented words while listening to words in context, to investigate the semantic activation of word candidates (Chwilla, 1996; Moss & Marslen‐Wilson, 1993; Zwitserlood, 1989). For example, while hearing successively longer fragments of the word “generous” in a supportive sentence context that is inconsistent with “general”, participants would be probed with “gift” (associated with the contextually supported word) and “army” (assessing activation of the contextually unsupported, but initially overlapping “general”; Zwitserlood, 1989). However, the results were mixed, and the nature and time course of the processes bet- ween hearing the fragment, seeing the target, and making a response was not known. Van Petten et al. (1999) used the N400 effect to examine the initiation of semantic processing relative to the isolation point. The study focused directly on processing of the spoken word itself. If context‐dependent semantic processing of words only begins after they have been fully recognized, the N400 to words that fit and words that do not fit in the sentence should only begin to differ after the isolation point. However, if semantic processing begins to operate on incomplete input, then the N400 effect could begin prior to the onset of the isolation point, as soon as the acoustic input diverges from any contextual expectations that listeners might have. Participants lis- tened to sentence contexts like “It was a pleasant surprise to find that the car repair bill was only seventeen…”, which ended in a word that fit in the context (“dollars”; cohort congruous condition), an incongruous word that rhymed with the congruous word (“scholars”; rhyme condition), or in an incongruous word that shared initial phonemes with the congruous word (“dolphins”; cohort incongruous condition). Figure 13.2 shows the results (of Experiment 3, continuous speech; not shown are Experiment 1, a gating study, and Experiment 2, which presented a pause before the final word). In the ERPs time‐locked to word onset, both incongruous conditions elicited much larger N400 amplitudes than the congruous condition. This replicated Electrophysiological Methods 257

Onset Isolation point

Left parietal

Right parietal

Midline parietal

3.0 V μ – Cohort congruous Cohort incongruous –600 0 600 1200 ms Rhyme

Figure 13.2 Grand average ERPs from three parietal channels, elicited by the final words in the three conditions. In the left column, time zero is the onset of the word. In the right column, time zero is the­ ­isolation point. Source: Van Petten et al. (1999). Reproduced with permission of American Psychological Association. previous studies showing how contextual support reduces N400 amplitude. Comparing the incongruous conditions, however, there was a large difference in onset timing of the N400. The semantically incongruous words that shared initial phonemes with the congruous completion (Cohort incongruous) elicited an N400 that was delayed by about 200 ms compared with those that did not share initial phonemes (Rhyme). These results already suggest that the isolation point may not be a crucial determinant of N400 onset, but to correct for variability in isolation point across individual words, the ERPs were also time‐locked to the isolation point. When the incongruous words shared initial phonemes with the congruous word, the N400 onset occurred at the isolation point. But when the incongruous word had different initial phonemes, the N400 onset occurred ~200 ms prior to the isolation point. This strongly demonstrates that context‐driven semantic processes do not wait until the acoustic signal has fully disambiguated the word. Instead, the results argue for a con- tinuous mapping from acoustic input to semantic representations. Note that the semantic interpretation of these results is afforded by the ability to identify the pre‐iso- lation‐point ERP effect as being on the N400 rather than some other component (for discussion about a phonological mismatch component, see Connolly & Phillips, 1994; van den Brink, Brown, & Hagoort, 2001). Van Petten et al. made this argument based on the waveform characteristics and functional sensitivity of the effect, pointing out as well that there was no evidence for additional components—no additional peaks in individual subject ERPs and no shift in scalp distribution over time. 258 Research Methods in Psycholinguistics and the Neurobiology of Language

The advantages of ERPs for addressing the questions of interest in this study are clear. The experimental design made use of the fact that the EEG signal is an instan- taneous and continuous reflection of how the speech signal is processed, obviating the need to make inferences based on downstream consequences and metalinguistic judgments. The study also exemplifies the utility of time‐locking to different parts of the speech signal, in this case allowing for the investigation of context effects separately at points in time before and after any purely context‐independent word recognition processes could have disambiguated the input.

Advantages and Disadvantages

This section discusses challenges with ERP methods, as well as how some of these issues are being overcome. One fundamental challenge that has already been discussed is that the EEG contains high levels of noise, necessitating techniques for extracting a stable signal of interest— most commonly, averaging. However, as with any average, an average ERP may not accurately reflect the processing pattern in individual participants or on individual trials. For instance, a decrease in amplitude in one condition relative to another could be due to a component being attenuated on every trial, on only a subset of trials, or even as a result of latency variation, such that the timing of the component is more variable in one condition than in the other, leading to a reduced amplitude in the average (Spencer, 2004). Furthermore, the ERP from a given study may contain a biphasic N400‐P600 pattern of effects when aver- aged, but this could in principle stem from a combination of some trials (and/or, in the grand average, participants) with only an N400 modulation and some with only a P600 modulation. Moreover, ERP datasets are not unlikely to be somewhat unbal- anced in terms of the number of trials and the identity of the items going into each condition average, because of artifact rejection and, in some designs, binning that is based on participants’ behavioral response patterns. Although this is unlikely to affect outcomes in experiments wherein the same perceptual stimuli are rotated across conditions and only a random 5‐10% of the trials are rejected, sometimes the question addressed necessarily contrasts different items, such as in word recognition experiments that try to discern the effects of various psycholinguistic variables. To address such concerns, ERP researchers have begun to use alternative statistical methods that have also gained popularity in the behavioral and eye‐tracking litera- tures, such as mixed‐effects regression models (e.g., Baayen, Davidson, & Bates, 2008). Instead of averaging, mixed‐effects (or hierarchical) models directly model the trial‐level data. This allows for the simultaneous inclusion of participants and items as random factors, which makes it possible to include any measured partici- pant characteristics (such as working memory capacity) and item characteristics (such as word frequency) and to examine effects of practice or fatigue across trials. In principle, estimating ERPs by running a regression model at the level of individual trials is not dissimilar to averaging. However, whereas averages can be distorted in unpredictable ways by unbalanced missing data, mixed‐effects models can deal with missing data in a principled way because, at the individual trial level, it is known by which participant and item a (brain) response was elicited. Although the field has not yet settled on conventions regarding the various possible ways of modeling multiple electrodes and time points/windows, there are promising applications of trial‐level Electrophysiological Methods 259 analyses to ERPs, including the investigation of how continuous predictors such as word position in a sentence affect ERPs (Payne, Lee, & Federmeier, 2015), of non‐ linear relationships between predictors and ERPs (Tremblay & Newman, 2015), and of ways to handle overlapping responses to distinct events (Smith & Kutas, 2015). Another disadvantage of averaging is that it does not capture certain aspects of the EEG signal. For activity to show up in an average ERP, it needs to be not only time‐ locked to an event, but also phase‐locked to it; that is, the peaks and troughs in the waveform need to be aligned in time across different trials. Such “evoked” activity can be contrasted with “induced” activity, which is time‐locked but not phase‐locked. Even though its amplitude can be large, non‐phase‐locked activity is unlikely to become visible in the ERP because the peaks have a variable latency relative to the stimulus and largely cancel each other. The success of ERPs in delineating core cognitive processes suggests that phase‐locked activity captures something fundamental about cognition and brain functioning. However, current views of brain functioning also emphasize the role of oscillatory activity (which is often not phase‐locked) in critical aspects of cognitive processing, including language (for discussion about oscillatory activity as the EEG signature of the coupling and uncoupling of neuronal networks, see Bastiaansen, Mazaheri, & Jensen, 2008; see also Buzsáki, 2006). Therefore, a growing number of language processing studies employ time‐frequency analysis to make visible not only phase‐locked but also non‐phase‐locked activity, as in other fields in which these analyses are routinely applied. Time‐frequency analysis involves decomposing the EEG signal into multiple frequencies and quantifying power (amplitude squared) at each frequency over time. The analysis is applied to individual trials and then an average across trials is taken. Different frequency bands that respond differently to cognitive manipulations have been identified and labeled: delta (1‐3 Hz), theta (4‐7 Hz), alpha (8‐12 Hz), beta (13‐30 Hz), and gamma (>30 Hz). The frequency bands are not fixed but merely serve as a guideline to facilitate communication. Peak alpha frequencies, for instance, actually differ between par- ticipants, as well as between tasks within the same participants (Haegens, Cousijn, Wallis, Harrison, & Nobre, 2014; Klimesch, 1999). Various time‐frequency analysis methods are commonly used, including the short‐ time Fast Fourier Transform (FFT), Morlet wavelet analysis, and filtering combined with the Hilbert transform (for discussion, see Cohen, 2014). Each of these methods has its own parameters, but when the parameter settings are matched, the results look similar; in fact, the three approaches are mathematically equivalent to one another (Bruns, 2004). As shown in Figure 13.3, the result from such an analysis can be visualized as a spectrogram, with time on the x‐axis, frequency on the y‐axis, and color coding for increases and decreases in power at the different frequencies over time. It is important to note that these spectrograms do not have the temporal resolution that ERPs have; there is considerable temporal and frequency “smearing.” In signal processing, there is an inverse relationship between frequency precision and temporal precision, and this trade‐off is determined by the analysis parameters (such as the number of wavelet cycles, the filter settings when using the filter‐Hilbert method, or the FFT window length and taper properties). For instance, when using a 400 ms moving‐window FFT approach, each “pixel” in the spectrogram is calcu- lated using the data from 200 ms before and 200 ms after the pixel (although data points closer to ‐200 and +200 ms will have progressively less influence, depending on the shape of the taper used). Using a larger time window would improve the frequency precision at the expense of temporal precision, whereas using a smaller 260 Research Methods in Psycholinguistics and the Neurobiology of Language

ERP analysis Time-frequency analysis

40 Hz

10 Hz Trials

TIme-locked average: the ERP Averaged spectrogram 10 µV

5 µV

0 0 1500 ms

Figure 13.3 Simulated EEG data illustrating the difference between ERPs and time‐frequency analyses in their sensitivity to phase‐locked (evoked) and non‐phase‐locked (induced) activity. The first response is time‐locked and phase‐locked to time zero, whereas the second response is time‐locked but not phase‐locked. The first response shows up both after ERP averaging (as an oscillation) and after time‐frequency analysis of power (as a power increase at around 10 Hz). The second response is canceled by ERP averaging, but is preserved in time‐frequency analysis of power. Source: Bastiaansen, M. C. M., Mazaheri, A., & Jensen, O. (2008). By permission of Oxford University Press, USA. (See insert for color representation of the figure.) window would improve the temporal precision at the expense of frequency preci- sion. In general, by decomposing the signal into its constituent frequencies across time windows, some temporal precision is sacrificed. Compared with the rich and well‐established literature on ERPs, less is currently known about the role of non‐phase‐locked activity in language processing. This will likely change in the coming years, but it has important implications for statistical anal- ysis. With ERPs, one can target a particular component with a known latency and scalp distribution and reduce the data accordingly for analysis. With time‐frequency approaches, it is more often the case that the latency, scalp distribution, and frequency bands in which effects will occur are not known before inspecting the data. Thus, in this case it becomes especially important to consider using data‐driven statistical methods that deal with the problem of multiple comparisons, such as the ones discussed in the section Collecting and Analyzing Data, which can straightforwardly incorporate frequency as an additional dimension besides time and space (Maris & Oostenveld, 2007). Another challenge with ERPs and EEG in general is that it is difficult to infer which brain areas were active based on just the scalp topography of a component or effect. For most psycholinguistic questions, the timing of the brain activity is probably more Electrophysiological Methods 261 germane than its source location. But when localizing activity is important, and one does not want to sacrifice temporal resolution (as occurs with functional magnetic resonance imaging, fMRI; see Chapter 14), one can turn to magnetoencephalography (MEG). MEG is similar to EEG in many ways (for a detailed introduction to the method, see Hämäläinen et al., 1993). The same types of neural processes that produce the electrical activity reflected in the EEG also produce the magnetic activity visible in the MEG. Both power changes and event‐related fields (ERFs, the magnetic equivalent of ERPs) can be analyzed. Many ERP components have a magnetic counterpart. In those cases, the corresponding MEG components are generally named like the ERP ones, with an “m” appended to the label. For example, the N400m is the MEG response taken to reflect activity shared with the N400 (e.g., Halgren et al., 2002; Simos, Basile, & Papanicolaou, 1997). As with EEG, the temporal resolution of MEG is a major strength. Despite these similarities, there are important differences between EEG and MEG. The MEG signal is recorded using superconducting quantum interference devices (SQUIDs), which are highly sensitive magnetometers that need to be cooled in liquid helium at a very low temperature (4 Kelvin). Gradiometers, which measure the difference between two or more neighboring coils, make the signal especially sensitive to nearby brain sources and decrease the influence of more distant noise sources, including the heart. Most current MEG systems contain several hundred gradiome- ters, arranged in a helmet‐like shape. Because the brain signals are much weaker than magnetic noise coming from, for example, radios, moving cars and elevators, MEG systems are usually placed in a magnetically shielded room. Both the initial purchase of the MEG system and the necessary supplies of liquid helium make the method considerably more costly than EEG. One of the main virtues of MEG stems from the fact that, compared with electrical signals, magnetic signals are less spatially smeared by the skull between the brain and the sensors (e.g., Hämäläinen et al., 1993). Skin potentials, which complicate EEG recordings at low frequencies, are also not seen by MEG. Furthermore, certain widespread muscle artifacts in the EEG may be reduced in the MEG, which can facilitate the study of speech production (Hari & Salmelin, 2012; for examples, see Levelt, Praamstra, Meyer, Helenius, & Salmelin, 1998; Salmelin, Hari, Lounasmaa, & Sams, 1994). Thus, certain types of distortion and noise are more problematic for EEG than MEG. At the same time, MEG is sensitive to a different subset of brain signals: Whereas currents that are oriented tangentially to the skull (in the walls of cortical sulci) are seen by both MEG and EEG, currents that are oriented radially to the skull (as on gyri, encompassing an estimated one third of the brain’s cortical sur- face) are seen by EEG only. Magnetic signals, compared to electrical ones, also show a steeper decline with distance, making MEG relatively more selective to superficial brain sources. Source localization with MEG is thus easier because a more restricted subset of brain activity is being modeled. Sources can be modeled using various methods, including an equivalent current dipole, multiple dipoles, or beamforming techniques (for discussion, see Hari & Salmelin, 2012). In each case, certain assumptions are necessary, because the “inverse problem” has no unique solution (multiple source configurations can generate the same scalp distribution). Incorporating an anatomical MRI scan into the analysis can further help reduce the source modeling solution space (this is true for source mod- eling with EEG as well). Some current MEG studies go beyond localization and use sophisticated connectivity methods at the source level to investigate communication 262 Research Methods in Psycholinguistics and the Neurobiology of Language between different brain areas (for review, see Bressler & Seth, 2011; David et al., 2006; Schoffelen & Gross, 2009). Overall, a broad characterization of the two methods is that MEG usually sees less than the EEG sees, but sees it more clearly (Cohen & Halgren, 2009). However, it is perhaps most useful to view these methods as complementary, and, indeed, some have argued that the best source localization will come from combined EEG and MEG (Cohen & Halgren, 2009; Sharon, Hämäläinen, Tootell, Halgren, & Belliveau, 2007). In summary, this chapter discussed how the noninvasive measurement of electrical brain activity generates some of the most direct evidence regarding the processes underlying language comprehension, production, and acquisition in the brain. The established approaches, supplemented by current developments, are likely to con- tinue to provide important new insights that keep challenging our views of cognition and brain functioning.

Acknowledgment

This work was supported by a James S. McDonnell Foundation Scholar Award and NIH grant AG026308 to K. D. F.

Key Terms

EEG Electroencephalogram, the record of electrical brain potentials. ERP component One of the component waves of the ERP waveform. ERP effect An experimentally isolated difference between conditions, often a modula- tion of an ERP component. ERPs Event‐related potentials, waveforms averaged across multiple trials time‐locked to an event. MEG Magnetoencephalogram, the record of magnetic brain potentials.

References

Baayen, R.H., Davidson, D. J., & Bates, D.M. (2008). Mixed‐effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412. Baccino, T., & Manunta, Y. (2005). Eye‐fixation‐related potentials: Insight into parafoveal processing. Journal of Psychophysiology, 19, 204–215. Bastiaansen, M. C. M., Mazaheri, A., & Jensen, O. (2008). Beyond ERPs: Oscillatory neuronal dynamics. In S. Luck & E. Kappenman (Eds.), Oxford handbook of event‐related potential components. New York: Oxford University Press. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57, 289–300. Bressler, S. L., & Seth, A. K. (2011). Wiener–Granger causality: A well established methodology. Neuroimage, 58, 323–329. Brouwer, H., Fitz, H., & Hoeks, J. (2012). Getting real about semantic illusions: Rethinking the functional role of the P600 in language comprehension. Brain Research, 1446, 127–143. Electrophysiological Methods 263

Bruns, A. (2004). Fourier‐, Hilbert‐ and wavelet‐based signal analysis: Are they really different approaches? Journal of Neuroscience Methods, 137, 321–332. Buzsáki, G. (2006). Rhythms of the brain. New York: Oxford University Press. Cheour, M., Ceponiene, R., Lehtokoski, A., Luuk, A., Allik, J., Alho, K., & Näätänen, R. (1998). Development of language‐specific phoneme representations in the infant brain. Nature Neuroscience, 1, 351–353. Chwilla, D. J. (1996). Electrophysiology of word processing: The lexical processing nature of the N400 priming effect. Nijmegen University. Cohen, M. X. (2014). Analyzing neural time series data: Theory and practice. Cambridge, MA/London, UK: MIT Press. Cohen, D., & Halgren, E. (2009). Magnetoencephalography. In L. R. Squire (Ed.), Encyclopedia of neuroscience, Volume 5, 615–622. Connolly, J. F., & Phillips, N. A. (1994). Event‐related potential components reflect phonological and semantic processing of the terminal word of spoken sentences. Journal of Cognitive Neuroscience, 6, 256–266. Coulson, S., King, J. W., & Kutas, M. (1998). Expect the unexpected: Event‐related brain response to morphosyntactic violations. Language and Cognitive Processes, 13, 21–58. David, O., Kiebel, S. J., Harrison, L. M., Mattout, J., Kilner, J. M., & Friston, K. J. (2006). Dynamic causal modeling of evoked responses in EEG and MEG. NeuroImage, 30, 1255–1272. Dehaene‐Lambertz, G. (1997). Electrophysiological correlates of categorical phoneme perception in adults. Neuroreport, 8, 919–24. Delorme, A., & Makeig, S. (2004). EEGLAB: An open source toolbox for analysis of single‐ trial EEG dynamics including independent component analysis. Journal of Neuroscience Methods, 134, 9–21. Friederici, A. D. (1995). The time course of syntactic activation during language processing: A model based on neuropsychological and neurophysiological data. Brain and Language, 50, 259–281. Ganushchak, L. Y., Christoffels, I. K., & Schiller, N. O. (2011). The use of electroencephalog- raphy in language production research: A review. Frontiers in Psychology, 2, 208. Goncharova, I. I., McFarland, D. J., Vaughan, T. M., & Wolpaw, J. R. (2003). EMG contami- nation of EEG: Spectral and topographical characteristics. Clinical Neurophysiology, 114, 1580–1593. Groppe, D. M., Urbach, T. P., & Kutas, M. (2011). Mass univariate analysis of event‐related brain potentials/fields I: A critical tutorial review. Psychophysiology, 48, 1711–1725. Grosjean, F. (1980). Spoken word recognition processes and the gating paradigm. Perception & Psychophysics, 28, 267–283. Haegens, S., Cousijn, H., Wallis, G., Harrison, P. J., & Nobre, A. C. (2014). Inter‐ and intra‐ individual variability in alpha peak frequency. Neuroimage, 92, 46–55. Hagoort, P., Brown, C. M., & Groothusen, J. (1993). The syntactic positive shift (SPS) as an ERP measure of syntactic processing. Language and Cognitive Processes, 8, 439–483. Halgren, E., Dhond, R. P., Christensen, N., Van Petten, C., Marinkovic, K., Lewine, J. D., & Dale, A. M. (2002). N400‐like magnetoencephalography responses modulated by semantic context, word frequency, and lexical class in sentences. Neuroimage, 17, 1101–1116. Hämäläinen, M., Hari, R., Ilmoniemi, R., Knuutila, J., Lounasmaa, O. (1993). Magneto­ encephalography: Theory, instrumentation, and applications to noninvasive studies of the working human brain. Reviews of Modern Physics, 65, 1–93. Handy, T. C. (Ed.). (2004). Event‐related potentials: A methods handbook. Cambridge, MA: MIT Press. Hansen, J. C., & Hillyard, S. A. (1980). Endogeneous brain potentials associated with selective auditory attention. Electroencephalography and Clinical Neurophysiology, 49, 277–290. Hari, R., & Salmelin, R. (2012). Magnetoencephalography: From SQUIDs to neuroscience: Neuroimage 20th anniversary special edition. Neuroimage, 61, 386–396. 264 Research Methods in Psycholinguistics and the Neurobiology of Language

Kappenman, E. S., & Luck, S. J. (2010). The effects of electrode impedance on data quality and statistical significance in ERP recordings. Psychophysiology, 47, 888–904. Kiesel, A., Miller, J. O., Jolicoeur, P., & Brisson, B. (2008). Measurement of ERP latency differences: A comparison of single‐participant and jackknife‐based scoring methods. Psychophysiology, 45, 250–274. Klimesch, W. (1999). EEG alpha and theta oscillations reflect cognitive and memory performance: A review and analysis. Brain Research Reviews, 29, 169–195. Kluender, R., & Kutas, M. (1993). Bridging the gap: Evidence from ERPs on the processing of unbounded dependencies. Journal of Cognitive Neuroscience, 5, 196–214. Kolk, H., & Chwilla, D. (2007). Late positivities in unusual situations. Brain and Language, 100, 257–261. Kuperberg, G. R. (2007). Neural mechanisms of language comprehension: Challenges to syntax. Brain Research, 1146, 23–49. Kuperberg, G. R., Sitnikova, T., Caplan, D., & Holcomb, P. J. (2003). Electrophysiological distinctions in processing conceptual relationships within simple sentences. Cognitive Brain Research, 17, 117–129. Kutas, M., & Federmeier, K. D. (2000). Electrophysiology reveals semantic memory use in language comprehension. Trends in Cognitive Science, 4, 463–470. Kutas, M., & Federmeier, K. D. (2011). Thirty years and counting: Finding meaning in the N400 component of the event‐related brain potential (ERP). Annual Review of Psychology, 62, 621–647. Kutas, M., & Hillyard, S. A. (1980). Reading senseless sentences: Brain potentials reflect semantic incongruity. Science, 207, 203–205. Laszlo, S., & Federmeier, K. D. (2011). The N400 as a snapshot of interactive processing: Evidence from regression analyses of orthographic neighbor and lexical associate effects. Psychophysiology, 48, 176–186. Laszlo, S., Ruiz‐Blondet, M., Khalifian, N., Chu, F., & Jin, Z. (2014). A direct comparison of active and passive amplification electrodes in the same amplifier system. Journal of Neuroscience Methods, 235, 298–307. Levelt, W. J. M., Praamstra, P., Meyer, A. S., Helenius, P., & Salmelin, R. (1998). An MEG study of picture naming. Journal of Cognitive Neuroscience, 10, 553–567. Lopez‐Calderon, J., & Luck, S. J. (2014). ERPLAB: An open‐source toolbox for the analysis of event‐related potentials. Frontiers in Human Neuroscience, 8, 1–14. Luck, S. J. (2005). An introduction to the event‐related potential technique. Cambridge, MA: MIT Press. Luck, S. J., & Kappenman, E. S. (Eds.). (2011). The Oxford handbook of event‐related potential components. New York: Oxford University Press. Makeig, S., Bell, A. J., Jung, T.‐P., & Sejnowski, T. J. (1996). Independent component analysis of electroencephalographic data. Advances in Neural Information Processing Systems, 8, 145–151. Maris, E. (2012). Statistical testing in electrophysiological studies. Psychophysiology, 49, 549–565. Maris E., & Oostenveld, R. (2007). Nonparametric statistical testing of EEG‐ and MEG‐data. Journal of Neuroscience Methods, 164, 177–190. Marslen‐Wilson, W. D., & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive psychology, 10, 29–63. Miller, J., Patterson, T., & Ulrich, R. (1998). Jackknife‐based method for measuring LRP onset latency differences. Psychophysiology, 35, 99–115. Millett, D. (2001). Hans Berger: From psychic energy to the EEG. Perspectives in Biology and , 44, 522–542. Moss, H. E., & Marslen‐Wilson, W. D. (1993). Access to word meanings during spoken language comprehension: Effects of sentential semantic context. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 1254–1276. Electrophysiological Methods 265

Münte, T. F., Heinze, H.‐J., Matzke, M., Wieringa, B. M., & Johannes, S. (1998). Brain poten- tials and syntactic violations revisited: No evidence for specificity of the syntactic positive shift. Neuropsychologia, 36, 217–226. Näätänen, R., Lehtoskoskl, A., Lennes, M., Cheour, M., Huotilainen, M., Ilvonen, A., Vainio, M., Alku, P., Ilmoniemi, R., Luuk, A., Allik, J., Sinkkonen, J., & Alho, K. (1997). Language‐ specific phoneme representations revealed by electric and magnetic brain responses. Nature, 385, 432–434. Nunez, P. L., & Srinivasan, R. (2006). The electric fields of the brain: The neurophysics of EEG. Oxford: Oxford University Press. Oostenveld, R., Fries, P., Maris, E., & Schoffelen, J. M. (2011). FieldTrip: Open source soft- ware for advanced analysis of MEG, EEG, and invasive electrophysiological data. Computational Intelligence and Neuroscience, 2011, 156869. Osterhout, L., & Holcomb, P. J. (1992). Event‐related brain potentials elicited by syntactic anomaly. Journal of Memory and Language, 31, 785–806. Payne, B. R., Lee, C. L., & Federmeier, K. D. (2015). Revisiting the incremental effects of con- text on word processing: Evidence from single‐word event‐related brain potentials. Psychophysiology, 52, 1456–1469. Salmelin, R., Hari, R., Lounasmaa, O. V., & Sams, M. (1994). Dynamics of brain activation during picture naming. Nature, 368, 463–465. Schoffelen, J. M., & Gross, J. (2009). Source connectivity analysis with MEG and EEG. Human Brain Mapping, 30, 1857–1865. Sharon, D., Hämäläinen, M. S., Tootell, R. B., Halgren, E., & Belliveau, J. W. (2007). The advantage of combining MEG and EEG: Comparison to fMRI in focally stimulated visual cortex. NeuroImage, 36, 1225–1235. Simos, P. G., Basile, L. F., & Papanicolaou, A. C. (1997). Source localization of the N400 response in a sentence‐reading paradigm using evoked magnetic fields and magnetic resonance imaging. Brain research, 762, 29–39. Sitnikova, T., Kuperberg, G. R., & Holcomb P. J. (2003). Semantic integration in videos of real‐world events: An electrophysiological investigation. Psychophysiology, 40, 160–164. Smith, N. J., & Kutas, M. (2015). Regression‐based estimation of ERP waveforms: I. The rERP framework. Psychophysiology, 52, 157–168. Spencer, K. M. (2004). Averaging, detection and classification of single‐trial ERPs. In Todd C. Handy (Ed.), Event‐related potentials. A method handbook. Cambridge, MA: MIT Press. Tanner, D. (2015). On the left anterior negativity (LAN) in electrophysiological studies of morphosyntactic agreement. Cortex, 66, 149–155. Tanner, D., Morgan‐Short, K., & Luck, S. J. (2015). How inappropriate high‐pass filters can produce artifactual effects and incorrect conclusions in ERP studies of language and cog- nition. Psychophysiology, 52, 997–1009. Tremblay, A., & Newman, A. J. (2015). Modeling nonlinear relationships in ERP data using mixed‐effects regression with R examples. Psychophysiology, 52, 124–139. Van Den Brink, D., Brown, C., & Hagoort, P. (2001). Electrophysiological evidence for early contextual influences during spoken‐word recognition: N200 versus N400 effects. Journal of Cognitive Neuroscience, 13, 967–985. Van Petten, C., Coulson, S., Rubin, S., Plante, E., & Parks, M. (1999). Time course of word identification and semantic integration in spoken language. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 394–417. Van Turennout, M., Hagoort, P., & Brown, C. M. (1997). Electrophysiological evidence on the time course of semantic and phonological processes in speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23, 787–806. Zwitserlood, P. (1989). The locus of the effects of sentential‐semantic context in spoken‐word processing. Cognition, 32, 25–64. 14 Hemodynamic Methods: fMRI and fNIRS

Roel M. Willems and Alejandrina Cristia

Abstract

Neural activity leads to changes in the amount of oxygen nearby in the brain. Two methods in cognitive neuroscience exploit this indirect measure of neural activation. Functional Magnetic Resonance Imaging (fMRI) is a method that measures the ­oxygenation in local parts of the brain at relatively high spatial resolution (in the order of millimeters). Functional Near Infrared Spectroscopy (fNIRS) uses the reflec- tion of infrared light onto the cortical surface as an indicator of blood oxygenation and hence neural activity. Both methods allow sampling of brain activation on-line, non-invasively, and in relatively fine-grained spatial locations.

Assumptions and Rationale

FMRI and fNIRS are called hemodynamic methods since they rely on signals related to blood flow (hemo or haemo is derived from the Greek word for blood). Although the precise mechanisms are still not completely understood, it is clear that when neurons fire, this is typically correlated with changes in the local con- centrations of oxygenated and deoxygenated hemoglobin. One intuitive way of thinking about it is to imagine that, when a population of neurons is activated,

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Hemodynamic Methods: fMRI and fNIRS 267 they consume energy from the local blood, ultimately calling for a “refill” that results in an increased flow of oxygenated blood to that population of neurons. Both fMRI and fNIRS measure these local changes in blood concentration. They therefore provide indirect and delayed measures of brain activation. These methods are indirect because they do not measure activation of the neurons directly, but a correlate of that activation. The measures are delayed because the response of the vascular system is much slower than the neuronal firing rate, and thus our measurements reflect events that continue to occur relatively long after the neural activation has taken place. This may seem to make hemodynamic methods far from ideal, and one may wonder why they have become so popular despite these disadvantages. The main reason is that fMRI and fNIRS measure correlates of neural activation non‐invasively and with relatively high spatial pre- cision, which means that they measure brain activation relatively precisely in terms of its localization in the brain. Given the differences in methodology across fMRI and fNIRS, we discuss each separately.

FMRI

Apparatus, Signal, and Scan Sequence

Magnet and Head Coil

Functional Magnetic Resonance Imaging (fMRI) is measured using an MRI scanner (sometimes called MR scanner). This is essentially a large and strong magnet. The magnet is inside the machine, and is surrounded by liquid helium, cooling it so that the magnet remains “on field,” meaning that it keeps its strength. The magnet is therefore always “on”: also when the MR machine is not being operated and the computer hardware interfacing with the machine is turned off, the magnet will still be in operation. The magnetic field can be removed by releasing the liquid helium, which warms up the magnet and makes it lose its force. This is not regularly done and it is therefore best to assume that the magnet is always in function. We will see later that because the magnet is always “on,” we need to take particular security measures when doing fMRI studies, the most obvious one avoiding to bring metal into the scanner room. Another important piece of equipment for fMRI is the head coil. This is placed over the head of the participant, and serves to emit (send) radio frequency pulses as well as to “read” (receive) the information from the brain. The “f” in fMRI stands for “functional,” distinguishing it from “regular” MRI, which measures a more static property of tissue. For instance, when clinicians want to have a high resolu- tion image of a knee, they will collect an MR scan of the knee. This will be a sharp image of the anatomy of the knee. The same can be done with a scan of the head, showing the skull and the brain. These are what we typically call anatomical images. In contrast, functional MRI measures a signal, which is related to brain function, that is, to the ongoing activity in the brain. The signal that is measured with fMRI is called BOLD. 268 Research Methods in Psycholinguistics and the Neurobiology of Language

Blood Level Oxygenation Dependent Signal

The underlying principle of fMRI is that a magnetic field puts the protons in the brain in a steady state: The magnetic field “pulls” them into one direction or the other. We then apply a radio pulse to push the protons off their preferred direction. The trick now lies in the fact that it takes different types of tissue different amounts of time to get back to the preferred direction as induced by the magnet. In the case of fMRI one makes use of the difference between oxygenated and de‐oxygenated blood. Oxygenated blood contains more oxygen compared to de‐oxygenated blood, and this difference is measurable. With fMRI we measure a correlate of brain activation, the Blood Level Oxygenation Dependent signal (BOLD). This measure lags behind actual neural activation (firing of neurons) for several seconds, making it a slow and indirect measure of neural activation. The ratio between oxygenated and de‐oxygenated blood will be different in a brain region, which is activated by a task as compared to a region which is not activated. It is important to note that since the brain is always active, it always consumes energy (and oxygen). An increase in BOLD is thus an indicator of a relative increase in brain activation. In summary, BOLD is the measure that is picked up with fMRI, and it relies on the difference between oxygenated and de‐oxygenated blood.

The Scan Sequence

A typical fMRI experiment uses fMRI scanning settings (called a scanning sequence) in which the brain is measured slice after slice. The protons are excited not all at once, but per virtual slice. Dividing the brain up in these slices means that we have to measure around 30 of those, each having a 2‐3 mm thickness to cover most brains fully. We draw a box around the brain of a participant, so that we can measure the whole brain (Figure 14.1A). The time it takes to measure the whole brain once is called the Time to Repetition (TR) and is typically in the order of 2‐2.5 seconds. So every 2‐2.5 seconds we can sample brain activation in all areas of the brain at a spatial resolution expressed in voxel size (Figure 14.1B). Voxels are small cubes from which the BOLD signal is measured. Compare them to pixels on a screen: The screen is not a continuous image, but is divided into small pixels. The same is done with the brain: it is divided into small cubes (“three dimensional pixels”). Typical voxel sizes are 2×2×2 mm ‐ 3×3×3 mm. This means that we are measuring the BOLD signal in cubes of 8‐27 mm3. Compared to other non‐ invasive techniques for measuring brain activity (e.g., EEG) this is a good spatial reso- lution. At the same time, it should be noted that when measuring BOLD with fMRI we pool neural activity over a lot of neurons (and other brain cells or veins). The numbers concerning time to repetition and typical voxel size that we described in the previous paragraph are typical for fMRI studies which use a cognitive task and cover the whole brain. It is possible to increase the spatial precision of fMRI. For instance, suppose that a researcher is only interested in measuring a signal from the parietal cortex. The slices can be positioned in such a way that only the parietal cortex is measured, leaving the rest of the brain out of the field of view. Now that less tissue will be covered, the researcher can choose to increase spatial precision by making the distance between slices smaller, or to scan faster, by decreasing the time (A) (B)

12 34567

89 10 11 12 13 14

15 16 17 18 19 20 21 10 cm

22 23 24 25 26 27 28

29 30 31 32

Figure 14.1 An anatomical scan of the head and the brain (A), and Functional MRI images (B) The yellow box with lines (in A) shows the positioning of the slices. In functional MRI, brain activation (BOLD) is measured every TR (for instance every 2 seconds), slice by slice. In this example the slices are positioned to cover activation across the whole brain. They are overlain on an anatomical scan of the brain. Displayed in B are the collected slices going from the lower part of the brain (Slice 1) to the top part of the brain (Slice 32). The grey values indicate signal intensity, with colors more towards white having higher signal intensity. The image shows the results of one TR of scanning: 32 slices are collected to cover the whole brain. In off-line preprocessing the slices are combined into one image, creating a 3D image of the brain activation. (See insert for color representation of the figure.) 270 Research Methods in Psycholinguistics and the Neurobiology of Language to repetition. All the scan settings (forming the scan sequence) can be varied indepen- dently, and it is up to the researcher to decide what the optimal settings are for a given experiment. There are many parameters that can be varied, and varying the one often influences the other. It is therefore recommended to consult a person with profound knowledge of fMRI sequences (e.g., an MR physicist) before making changes to a sequence. In many labs there are “standard” sequences available, which are optimized for “standard” fMRI experiments. It should be noted that new developments in MR techniques allow for measuring the brain at higher spatial or temporal precision with fMRI. So‐called multiband scan sequences, for instance, allow for measuring BOLD from the whole brain with TRs shorter than 1 second. Note that increasing the temporal precision of the measurement is not the same as increasing the temporal precision of the signal being measured. Put differently, sampling the brain every second does not change the relatively slow nature of the BOLD response. The BOLD response is a property of the neural tissue and is not influenced by how fast we measure it.

Nature of the Stimuli

Many fMRI experiments on language take existing paradigms from psycholinguistics and look at which neural regions are activated during a task. This means that there is no particular type of stimulus used in fMRI. In essence all kinds of stimuli that can be used in behavioral experiments can be used with fMRI (e.g., phonemes, words, sentences, stories; auditory, tactile, visual stimuli; see below for exceptions). This also means that issues of experimental design are the same as in any (behavioral) psychological/psycholinguistic experiment. Improper matching of stimuli or tasks or the lack of appropriate control conditions render interpretation of the results difficult or impossible. This is not unique for fMRI, but we still stress the point: A badly designed study will not yield interesting results, no matter what the dependent measure is. Binder and colleagues (2009) illustrate this point in their meta‐analysis of fMRI studies investigating semantic processing. Most fMRI studies rely on the principle of trial averaging: The experiment consists of several conditions, a suitable number of trials per condition is collected, and the dependent variable (BOLD response in voxels) is averaged across the trials for each condition.

Constraints

The participant in an fMRI study is lying on his or her back and can see visual stimuli via a mirror which is attached to the head coil above the eyes. Because of the strong magnetic field, visual stimuli are presented onto the subject’s mirror via a beamer placed outside of the scanner room. While collecting images the MR machine makes a lot of noise, and participants need to wear ear protection to avoid damage. There are, how- ever, dedicated inner ear phones that allow for presenting auditory stimuli and at the same time minimizing disturbance from the scanner noise. For single word presentation and up (single sentences/extended pieces of discourse) this works well. For presentation of phonemes the interference of the scanner noise is sometimes considered too disturb- ing and so‐called sparse scanning sequences can be used. These are scanning sequences Hemodynamic Methods: fMRI and fNIRS 271

in which the machine is not collecting images during presentation of the stimuli (and hence no loud noise is emitted), but only after presentation of the stimuli. This approach takes advantage of the slowness of the BOLD response, at the expense of not sampling brain activation continuously. From our experience, recent advances in scanner hardware as well as presentation equipment (e.g., headphones) render this option not necessary for the bulk of auditory language experiments. Another constraint when designing experiments for fMRI is that (head) motion is detrimental to the data. Participants are asked to lie as still as possible, and in many labs the head is somehow fixated to further reduce head motion. One way of doing this is to place small cushions in between the sides of the head and the head coil, which arguably sounds a lot nicer than fixating the head. Because of the importance of avoiding motion, speech production studies were traditionally avoided in fMRI (instead, Positron Emission Tomography, PET, was used as a preferred method, see below). While the concern for large movements is justified (see analysis section), recent studies show that it is possible to have a reliable signal while participants speak in the scanner (e.g., Segaert, Menenti, Weber, Petersson, & Hagoort, 2012). One way of avoiding head motion in the scanner is to ask participants to plan their verbal response and only speak it out slightly later. The analysis can then be focused on the planning phase, which is typically not contaminated by motion (e.g., Willems et al., 2010). In our opinion, the most serious constraint in designing an fMRI experiment has to do with the intertrial interval (ITI). Because the BOLD response is so slow (Figure 14.2), one cannot present the stimuli with ITIs typical of behavioral experi- ments. Suppose one would present a new word a second after the previous word has ended (ITI = 1 sec.), the BOLD curves of the words would start to overlap and the resulting response would plateau, meaning that there would be no variance in the

0.4

0.35

0.3

0.25

units) 0.2

0.15

Signal (ar b. 0.1

0.05

0

–0.05 0 24681012141618 TR (1 TR is 2 sec)

Figure 14.2 Example of an idealized BOLD curve, sometimes called the hemodynamic response function (HRF). The curve peaks around 6‐8 seconds after stimulus onset (stimulus onset is point 0), and has a post‐stimulus undershoot. Note that the time axis (x‐axis) is in TRs, with one TR being 2 seconds. The y‐axis expresses signal intensity in arbitrary units. 272 Research Methods in Psycholinguistics and the Neurobiology of Language response any more. One way of solving this is to wait until the BOLD response has gone back to baseline and only then present the next stimulus. This calls for very long ITIs, for instance 16 seconds between two stimuli. This approach is called a slow event‐related approach, and while it is possible to do it, there are two clear disadvan- tages. First, it increases the duration of the experiment enormously. A second disadvan- tage is that the experiment becomes very boring to the participants, increasing chances of falling asleep. One solution (not the preferred one) is to use a blocked design. In a blocked design, stimuli from one condition are presented together in a block, and the ITI can be short since the analysis will focus on brain activation during the whole block, not on single trials. Blocked designs were very popular in the early days of fMRI research (and they still have their merits in the sense of being simple and effective), but the main concern with them is that randomization of conditions is not possible. The better solution to the long ITIs of slow event‐related designs is to use a fast event‐related design. Fast event‐related designs use relatively short ITIs (on average around 3‐4 seconds), but make sure that the ITI duration varies over trials. That is, the ITI should not always be the same, but should vary in duration (see Miezin, Maccotta, Ollinger, Petersen, & Buckner, 2000). The reason for this is that the variable ITI will induce variation in the BOLD signal. Although the BOLD curves will start to overlap, if there is enough variation in this overlap, the response to trials from a given condition can still be estimated. Conditions can be (pseudo)‐randomized in this scheme, and they should be. There are several toolboxes available that can help researchers in selecting an optimal sequence of trials. A matter of debate is the range of ITIs to select. In our experience a range with a mean around 4 seconds works well, but the reader should consult the literature for other opinions. Next to that the ITIs should be variable, they should not be a multiple of the TR. That is, if the TR is 2 seconds, the ITIs that are used should not be 2, 4, and 6 seconds. As a final note we want to draw attention to recent developments in which partic- ipants are presented with continuous language. Some of these studies rely on trial averaging, but take advantage of the natural separation of events of interest to estimate the BOLD response associated with a certain phenomenon (e.g., event segmentation, see Zacks et al., 2001; see also Nijhof & Willems, 2015). A variant of this is presenting the stimuli very rapidly (e.g., ITI lower than 1 second), but to ensure that there is enough variation in a particular characteristic of the stimuli. Yarkoni and colleagues (2008) pioneered this approach for language studies by presenting words in rapid succession and investigating the neural response to a large number of psycho- linguistic variables related to the words (e.g., lexical frequency and age of acquisi- tion). Since all words have variation in, for instance, lexical frequency, it is possible to estimate which areas are sensitive to this characteristic, despite the fact that the BOLD curve for single words has plateaued (see Willems, Frank, Nijhof, Hagoort, & Bosch, 2015 for a comparable approach). Other approaches do not rely on averaging at all, and the interested reader is referred to Andric and Small (2015) for an overview.

Collecting and Analyzing Data

When the decisions about the scan sequence (see above) have been made, collecting data is relatively easy. The participant is positioned in the scanner, and after filling in of some information (e.g., age, weight of the participant), the scanner can be put into scanning Hemodynamic Methods: fMRI and fNIRS 273 mode and will start collecting data. The experimenter is sitting outside of the magnet room and communicates with the participant via a speaker system. When the machine is collecting data, making a lot of noise, communication is not possible. The participant can give a signal to the outside via an alarm button. Data collection as such is hence a more or less automatic process. However, before data collection can start, it is very important that the participant is well instructed and that safety measures are taken. Instructing a participant about the procedure and his or her rights as a human sub- ject is obviously important in any experiment. An extra step in fMRI is explanation of safety risks and procedures and informing the participant about the noise the machine makes, and about the importance of lying as still as possible. Every lab will have its own safety procedures in place. Here we can only give a brief summary of safety risks and how to minimize them. The risks of fMRI are potentially severe, but can be well‐ controlled. The main risk concerns metal. No ferromagnetic metal should be brought into the magnet room. Because the magnet is so strong, it will put very strong forces on any metal that gets close to it. Small objects (coins, keys, pens, scissors) can become very dangerous when someone is inside the MRI scanner. Since the object is pulled toward the magnet with great force there is a considerable chance that it will hurt the person inside the scanner. Anybody entering the magnet room should therefore get rid of metal objects, and it should be clearly indicated that the magnet is always on, also when the scanner is not in operation (not collecting images). Another potential threat from metal is that it can slowly heat up because of the radio frequency emission from the head coil. Examples are necklaces, earrings, metal in bras, and also some types of tattoos, which can contain small pieces of metal. These should be taken off if possible. Sometimes metal cannot be removed, such as surgical steel that remains in the body after surgery. Another example is wires that are placed to correct positioning of teeth. Surgical steel is typically not problematic, because it is not ferromagnetic, but it is important that laboratories have sought specialist advice to decide which policy they take in cases like these. As pointed out earlier, care should be taken to provide ear pro- tection to the loud noise the MR machine makes when in operation. Side effects of fMRI do generally not occur. The most frequent reason for participants to refrain from taking part or to withdraw from the experiment is claustrophobia. The participant is lying still in a rather small space, and is constrained to avoid head move- ments. People who have a tendency for claustrophobia tend to dislike this situation. The setting can be explained beforehand to participants, or they can be familiarized with it by first going into a mock (“fake”) scanner. Another side effect that is some- times reported is mild nausea and/or a metal taste. This can occur when the participant moves too quickly in or out of the magnet field, but it goes away quickly.

Data Analysis

FMRI data analysis is sometimes considered complicated compared to data analysis in other neuroimaging techniques. One reason is that the data sets are larger and that data handling becomes a real issue for fMRI data analysis. Several toolboxes and analysis packages are available, both open source and commercially. Examples include FSL, SPM, AFNI, and Brainvoyager. FMRI data are initially stored in a manufacturer‐specific native file format. In order to analyze them, the data are converted from this native format to a format, which can be read by all software analysis packages. The currently 274 Research Methods in Psycholinguistics and the Neurobiology of Language most used format is the Nifti format, with file extension “.nii.” Before statistical analysis is done, several analysis steps are performed, together called “preprocessing”. Here we describe a more or less standard sequence of preprocessing as we have used it in numerous studies. The first preprocessing step often is some correction for small head movements (“motion correction”). A transformation is applied to the data, which aims to align all data to the first scan. The rationale is that small head movements can be corrected by translating and rotating the ensuing scans slightly, matching them as well as possible to the first scan. Motion is problematic for fMRI since even a slight movement of the head displaces voxels in space. At the beginning of the scan session voxel x could be in another location than at the end of the experiment. This is undesirable. Moreover, motion can lead to edge artefacts. These will show up as intense “activations” at the edges of the brain or near the ventricles. They arise when there was motion during one or more trials of a condition. The brain tissue near the edges or ventricles moves into a part of the image that has very low signal strength (the area outside of the brain or the cerebro‐spinal fluid of the ventricles). Because this change from brain tissue to outside the brain or the ventricles is very large, it will show up as a large increase in signal. A next step is slice‐timing correction. Slice‐timing correction is a temporal inter- polation, which renders the separate slices of each scan (each TR) as if they were acquired at once, making the data better fit the assumptions of the statistical model. Slice timing correction is a debated preprocessing step, and there is considerable dis- agreement about whether the cure is not worse than what it tries to correct for. Next a transformation is applied to the data in order to normalize them to a standard space. All brains have different shapes. In order to do group analyses, researchers will make all brains look more like a “standard brain.” One such standard brain is the MNI template, and the resulting space is called “MNI space.” The transforma- tion is often done on the anatomical scan first. The anatomical scan is matched as well as possible to the MNI template brain. The resulting transformation parameters are then applied to the functional MRI data. Normalization to a standard brain space is quite a brute analysis step. As an analogy, suppose that one would try to make all hands look as much as possible like one “standard hand.” Quite some stretching and pulling would have to be done to the images of each hand, essentially ripping it of its original shape. The advantage of normalization is that all brains in the sample look more alike and that results can be reported in coordinates (“MNI coordinates,” or “Talairach coordinates”), increasing comparability across studies. A final step is spatial smoothing, in which the data are spatially “blurred” using a filter with a Gaussian kernel. The idea here is that since brains and brain locations differ across people (even after normalization), we want to smooth the data to account for this spread, typically with a filter with around 8 mm Full Width at Half Maximum (FWHM). This is arguably the strangest step in preprocessing. Remember that fMRI has an advantage over other neuroimaging methods because of its high spatial resolution. Ironically, by spatial smoothing we partially give up this advantage: We blur the spatial resolution and effectively make it lower. The advantage of group analysis drives this final analysis step. It is possible to avoid performing group analysis by localizing a certain area of interest in each participant before spatial normalization, and subsequently doing group statistics on the activation levels across subjects from this region. This works well when there is a clear prediction on which brain area plays a role and when this area can be localized. Localizers can be Hemodynamic Methods: fMRI and fNIRS 275 anatomical or functional. An anatomical localization takes place by reference to a brain atlas. For instance it is possible to determine where a certain Brodmann Area is in a particular subject’s brain (Eickhoff et al., 2005). Functional localizers define areas of interest by means of their function. One use of this approach is to localize parts of the language network in each participant individually, which overcomes the problems associated with spatial normalization (Fedorenko, Hsieh, Nieto‐Castañón, Whitfield‐Gabrieli, & Kanwisher, 2010). The areas from the localizer are subse- quently used to test the main experimental question of interest. Statistical analysis of the data involves the creation of a statistical model, which models the expected hemodynamic signal over time per condition. In an experiment with four experimental conditions (see the exemplary study below), this means that we define four regressors, which are based on when each stimulus was presented and its duration. The time vector with onsets and durations is convolved with the hemody- namic response function (Figure 14.2) to account for the delay in the BOLD response. This model is then fitted to each voxel’s time course separately in a multiple regression framework. This is similar to multiple regression on behavioral data, except that it is done many times, that is, for each voxel’s time course. The outcome is a map with beta values (expressing the weight for each regressor) per condition, per voxel. A next step involves testing of statistical contrasts. Now that we know which voxels’ time courses have a good fit with which condition (beta weights), we can ask which voxels are more implicated in processing stimuli of Condition A as compared to Condition B. One way of doing this is to compute a T‐statistic for this contrast for each voxel, and make a contrast T‐map per subject (with one t‐value per voxel, see Figure 14.3). Group statistics can then be done by performing a one‐sample t‐test (testing against zero) across participants. Again, this one sample t‐test is done for each voxel. Since the test is done so many times, there is a considerable multiple comparisons problem (MCP). The MCP is a statistical problem, which implies that the probability of Type 1 error (false positives) is difficult to control. Without correction, chances of Type 1 error become unacceptably large. Traditional methods handling this problem such as Bonferroni methods are too conservative: they fall victim to an unacceptably large probability of Type 2 error (false negatives). One often‐used solution involves combining a voxel‐level threshold (uncorrected p‐value, typically set at p < 0.001 or p < 0.005) with a minimal region extent such that only sufficiently large activation regions “survive” the thresholding. How to achieve the optimal balance between sensitivity (avoiding false negatives) and replicable findings (avoiding false positives) is a matter of consid- erable debate in fMRI data analysis (see Bennett et al., 2009, for an excellent review). Other ways of analyzing fMRI data that have gained in popularity recently are multi‐voxel pattern analysis (MVPA) and model‐based fMRI. In multi‐voxel pattern analysis, the multivariate response in sets of voxels are the unit of analysis instead of many single voxels, all one by one. The latter is called a massive univariate approach and is the one that we described above. MVPA for instance allows to detect differ- ences between conditions which are not dependent on overall changes in activation strength, but on differences in activation patterns (e.g., Kok, Jehee, & de Lange, 2012). Model‐based fMRI is a method in which a computational model’s accuracy is tested by pitting its predicted brain response to the actual brain responses (the fMRI data). An interesting feature of this approach is that it allows for creating brain maps of particular features of, for instance, a story to which participants were listening (Wehbe et al., 2014). 276 Research Methods in Psycholinguistics and the Neurobiology of Language

Figure 14.3 A statistical map overlaid on an anatomical brain scan. The anatomical brain scan is normalized into MNI space (see text). The yellow statistical maps show voxels (or sets of voxels sometimes called blobs) in which the one condition had higher activation than another condition. The yellow coloring indicates the height of the T‐value of the comparison, which was cut off at an arbitrary value (only T values higher than 2.6 are shown). These are the results for one participant. Note that the color coding is arbitrary (we could have chosen any color), and does not reflect neural activation, but the outcome of a statistical test which is performed at each voxel. The underlying assumption is that the statistical test reflects differences in neural activation between the two conditions. (See insert for color representation of the figure.)

An Exemplary Study

A range of experiments shows that reading of literal sensori‐motor language can lead to sensori‐motor simulation, such as when participants have faster reaction times when reading hand action words (“to throw”) as compared to action verbs which are not performed with the hand (“to giggle”). A debated topic has been whether and how such simulation occurs for metaphorical language use. Samur and colleagues (2015) used fMRI to investigate the impact of emotional context on the embodied understanding of metaphor. Participants read short “stories” (around three sentences long), which were followed by a target sentence. The scenarios rendered­ the reading Hemodynamic Methods: fMRI and fNIRS 277 of the target sentence literal or metaphorical (factor 1 in the design) and more or less emotional (factor 2 in the design). An example of a literal or metaphorical reading of a target sentence is “He pushed it away,” which could refer to a book or a thought being pushed away (literal versus metaphorical). There were four conditions in this 2×2 design. The stimuli were presented in an event‐related fashion, and there were variable time intervals between trials, as well as between presentation of the short story and the target sentence. Group statistics were done with a 2×2 ANOVA design, and the multiple comparisons problem was taken care of by combining a voxel level threshold with a minimal extent in number of voxels that an activated region should have. The researchers also defined several a priori regions of interest. Some of these were based on separate functional data from the same participants. For instance, visual motion areas were localized by having participants engage in a motion local- izer task. Participants watched displays of moving dots, non‐moving dots, or did nothing. Comparing the blocks during which participants saw moving dots versus non‐moving dots revealed activation in visual areas known to be involved in motion perception. Subsequently, the effects of the manipulation in the main task (about emotion and metaphor) were tested in these areas. That is, the beta weights for each of the four conditions were extracted from these motion areas and an ANOVA was performed on them. Motor cortex regions of interest were determined in two ways: one functional (participants moving their hands or feet) and one anatomical (regions taken from a cytoarchitectonic probability map). The main finding was that an area involved in the detection of visual motion (human area MT) was sensitive to metaphorical motion, but only when the metaphor was embedded in an emotional context. This was taken as evidence for a special link between metaphor and emotions. One advantage of using fMRI in this example is that participants did not have to do a task that was related to the manipulation of interest. A second advantage is the specificity of fMRI. From carefully relating the observed activations to functional characterization of these regions it was possible to state that the effect of emotion on embodied metaphor comprehension involves sensori‐motor simulation.

Advantages and Disadvantages

FMRI has an advantage over other methods in terms of spatial resolution, showing which brain areas are activated in a given cognitive process. The temporal resolution is rather low because of the delayed BOLD response: The BOLD response lags behind neural activation for several seconds. Another method with good spatial resolution is Positron Emission Tomography (PET), which has as an additional advantage that motion is not as detrimental to the data. The reason why fMRI is much more popular for cognitive neuroimaging than PET is that PET involves injection of a radioactive tracer into the blood stream of the brain. This is an invasive step, whereas fMRI is not invasive (nothing is injected into a participant in an fMRI study). There are several disadvantages of fMRI: The costs are relatively high, both of data acquisition and of maintenance of the hardware. Participants have to lie still, making the method less suited for studies involving movement (although we saw above that speech production fMRI studies have been remarkably successful). The machine makes a lot of noise and is unpleasant for people susceptible to 278 Research Methods in Psycholinguistics and the Neurobiology of Language claustrophobia. We also saw that some safety issues play a role in fMRI, mainly related to metal. With proper safety regulations in place, fMRI is a rather safe method in practice though.

FNIRS

Apparatus and Signal

Instrumentation

Turning now to fNIRS, the instrumentation is somewhat different, even if the under- lying phenomenon being measured is the same. Just like in fMRI, we measure the hemodynamic consequences of local brain activation. However, a key difference is that we rely on blood’s light absorption properties. We provide here a brief, non‐technical introduction to the most commonly used implementation of fNIRS (see Ferrari et al., 2004, for a more detailed overview). Like in EEG and unlike in fMRI, most fNIRS data are collected via a cap or pad that is placed on the participant’s head. This cap is equipped with a number of optodes that are hooked onto a fNIRS system which controls the delivery of light and measures the light that is detected (Figure 14.4). Most fNIRS systems are almost silent, and often relatively small and portable. For instance, the Hitachi ETG‐4000 (which is built into a wheeled chariot) is about 1 m × .4 m × .4 m. The UCL‐NTS system is even smaller, and can be likened to a cube with a 40 cm side. The length of the cables leading up to the optodes can be customized, but is often about 2 m. With the right apparatus (for instance, assuring tight contact between the optode and the skin), a normal room with no special lighting or other conditions can be used.

Figure 14.4 Image of a 5‐month‐old infant wearing a fNIRS cap, including a schematic illus- tration of the path of light between a source (star) and a detector (circle), through the scalp (dashed line) and cortical tissue (in gray). Hemodynamic Methods: fMRI and fNIRS 279

BOLD Measured Via Light Absorption

As noted above, participants wear a cap onto which optodes have been placed. There are two types of optodes: sources, which emit light (at typically two frequencies in the infra‐red range), and detectors, which pick up light (often at a broad range of infra‐red frequencies). The light emitted from a source will diffuse throughout all underlying tissue: hair, skin, bone, cerebral‐spinal fluid, and so on, as well as the hemoglobin in the blood traveling through the brain. The light that is picked up at a detector will typically have traveled in a banana‐shaped path from a nearby source. We call such a source‐detector combination a channel. Naturally, some light will be lost, so it is chal- lenging to use light absorption levels to estimate absolute hemoglobin concentrations. Instead, we more often analyze changes in absorption as a function of time (Figure 14.5). Insofar as the other tissues do not change as a function of our stimulation (an issue to which we return below), the relative changes in light absorption over time can be taken to reflect changes in hemoglobin concentration in the brain. To take a specific example, let us imagine that we are interested in cognitive processing of speech versus silence and thus have decided on presenting blocks of 20 seconds of speech followed by 15‐25 seconds of silence. Presumably the former will evoke greater brain activity than the latter in primary and secondary auditory regions. This greater activity would result in an influx of hemoglobin, as explained above. The change in level of hemoglobin concentration in those regions from one moment to the next would alter the amount of light being absorbed, which consti- tutes the signal fNIRS relies on. We thus expect changes in light absorption over time in the speech condition to be larger than those found in the silence condition. Notice that, unlike in fMRI, data are acquired continuously. The spatial resolution of fNIRS is typically 1‐3 cm, depending on the way that optodes are arranged and the distance between sources and detectors.

Nature of Stimuli and Data

Most common psycholinguistic tasks can be used in combination with fNIRS (lexical decision, self‐paced reading, grammaticality judgments, and so on; see Rossi, Telkemeyer, Wartenburger, & Obrig, 2012, for an overview). Nonetheless, the same considerations relating to the sluggish BOLD response discussed above apply here: Very fast‐paced tasks will not allow the vascular response to mount sufficiently rapidly such that our method can actually detect it. If the fNIRS cap has been tightly secured on the participant’s head, this usually allows the participant some degree of movement. For this reason, it is increasingly common to see fNIRS studies on the production of speech, gesture, and sign (e.g., Kovelman, Shalinsky, Berens, & Petitto, 2014) or in tasks involving face‐to‐face interactions (as in hyperscanning, where multiple people’s brain activity is recorded at the same time; e.g., Cheng, Li, & Hu, 2015). In fNIRS, data are gathered from 10 to 40 channels, and since each source emits light at two wavelengths, there are about 20 to 80 time series that are thus generated. The data is expressed as (changes in) light intensity in its raw form, and then converted into (changes in) oxygenated and deoxygenated hemoglobin (Rossi et al., 2012). Most often a procedure has been followed that will produce one additional time D5S9.oxy

D3S12.oxy

D3S9.oxy

D3S5.oxy

D2S9.oxy

D2S5.oxy

D2S4.oxy

D0S5.oxy

D0S4.oxy

234.9 255.3 275.6 296.0 316.3 336.7 357.0 377.4 397.7 418.1 438 Time (s) Grand average: Newborns Grand average: Adults 0.4 0.15 0.3 0.10 0.2 0.1 0.05 0.0 0.00 –0.1 oncentration (mM.mm)

C –0.05 –0.2 –0.3 5 10 15 20 510 15 20 Time (s) Time (s)

Figure 14.5 Sample of signal in fNIRS studies. The top panel shows estimated oxygenation levels from 9 channels, each emerging from the combination of a detector and a source. Highlighted regions have been automatically labelled as “artifacted” based on the speed of the signal change. The bottom panels show averages that have been time‐locked to the onset of stimulation (10 seconds of sound) for oxygenated (red) and deoxygenated (blue) hemoglobin among 40 newborns (left) and 24 adults (right). The dotted red and blue lines represent the best fit using a variable phase; the lighter dashed lines indicate observed 95% confidence intervals over par- ticipants. (See insert for color representation of the figure.) Hemodynamic Methods: fMRI and fNIRS 281 series that contains information on the stimulation, timed tasks that the participant carried out, and/or post‐hoc coding of events of interest (e.g., errors in production that have been coded offline).

Collecting and Analyzing Data

When conducting a fNIRS study, it is crucial to reflect on whether this technique is suit- able to answer one’s research question. This entails performing calculations to estimate the ideal placement of the optodes to ensure that the brain regions of interest are indeed in the path of travel of a source‐detector combination (see Tsuzuki et al., 2007). When using a commercial system, such as those by Hitachi, one can buy ready‐made pads with optode holders that have a certain geometry specified. Alternatively, one can attempt to create a pad with a more convenient geometry. In this case, interoptode distances should be carefully selected: If the source and the detector are placed very close together, then the light does not travel very deep into tissue and it might not reach the brain at all. Indeed, some fMRI studies use such surface fNIRS signals to clean up from the fMRI signal the variation due to global blood flow patterns. Conversely, if the source and detector are too far apart, then precision in localization is sacrificed, as the changes observed could reflect activity in many different gyri and sulci. As this explanation suggests, fNIRS is a relatively young technique, with perhaps less standardization and more “tinkering” being done in individual labs than, for instance, in EEG studies. It might even be the case that the most inventive of tin- kerers will nonetheless be unable to design a cap allowing a channel to reach certain brain areas, or will not be reasonably certain about the channel reaching them for a majority of the participants, in which case fNIRS would not be appropriate. For example, if one is interested in assessing the role of the basal ganglia in language production, it would be better to turn to fMRI, as these deep structures will hardly be measured precisely using a scalp‐based method like fNIRS. It is good practice to pilot one’s study with 5 participants and run all analyses to (a) make sure that the equipment is working correctly (e.g., the standard hemody- namic response following stimulation is observed, there is little data loss); and (b) carry out power analyses to estimate how many trials and how many participants will be needed. Piloting is important because there are certain aspects that are almost impossible to predict perfectly in fNIRS implementation. For instance, as we explained above, most fNIRS users must rely on a mapping between surface structures (where we place our pad, which depends on landmarks such as the ears) and underlying brain structures, because they will not concomitantly gather MRI data. If the experimenter carefully places the cap and if the study involves a population with little individual variation, this mapping can be quite precise (Tsuzuki et al., 2004). However, calcula- tions regarding the placement of the optodes on the scalp to pick up a specific region of interest can be wrong, or perhaps it is difficult even for a well‐trained experimenter to place the cap reliably. For instance, if one is specifically interested in the involve- ment of the supramarginal gyrus, and separately from the posterior superior temporal gyrus, it might be challenging to design a pad and place it where different channels tap these two regions. It is therefore advisable to start with a power analysis that uses the size of the effect and the variance found in the actual implementation. If this reveals that many more participants are needed than the experimenters are ready to run, then 282 Research Methods in Psycholinguistics and the Neurobiology of Language they can go back to tinkering the cap, or perhaps alter the study design to reduce the involvement of the brain regions they are not interested in. Finally, experimenters must be trained in positioning the optodes and in trouble- shooting. For instance, if there is poor contact in one channel, the experimenter might try to move hair out of the way, verify the correct attachment of the optode to the scalp, and/or alter the intensity of the light at the sources to compensate for different hair colors.

Data Analysis

As mentioned above, fNIRS data consist of time series that can be analyzed like EEG and fMRI data are. Indeed, within the fNIRS community we observe EEG‐like analyses, using time‐locked averages of signal level that are thereafter submitted to analyses of variance, and fMRI‐like analyses, where general linear models (GLM) are employed to both reduce the temporal dimension and assess systematic changes in signal level associated with a type of time‐locked event. The first stage in both types of analyses is usually pre‐processing, where artifacts (regions of large signal changes, or an overly low or high signal) are detected and, for researchers not using general linear models, signals are detrended (removing slow linear drift in the time course). Most investigators today use fixed thresholds for this first denoising, a procedure that is not time‐consuming and does not require human annotation. Regions of artifact detected in this way are most commonly discarded in averaging analyses and given a weight of zero in GLM analyses. Sometimes, if a channel has little data left, the whole of its data is removed from consideration. Similarly, participants who have data for few channels may be ­altogether excluded. Data loss is extremely variable across populations and implementations. The next stage typically involves the reduction of the time series into concrete events. Common analysis methods involve averaging over a block of stimulation, or fitting a general linear model and extracting an event‐related response, and then plotting the hemoglobin level as a function of time from the start of some key event. Other analyses, such as calculating correlations across channels and incorporating these data into network analyses, are also possible. Often, these stages are followed by a third, where targeted analyses are carried out. For example, the researcher might extract the average hemoglobin change during reading blocks and listening blocks, for each participant and hemisphere separately, and then use inferential statistics (e.g., ANOVA or t‐tests) to test for group differences across tasks and hemispheres. One unique aspect of fNIRS is that typically data are collected for both oxygen- ated and deoxygenated hemoglobin. Therefore, there are at least three options for dependent measures: oxygenated, deoxygenated, and total hemoglobin. There are ongoing discussions as to which of these is most appropriate and/or provides the largest effects (see Rossi et al., 2012; and Lloyd‐Fox, Blasi, & Elwell, 2010, for diverging views on the topic). There are many freeware Matlab scripts that can be used to perform the first two stages of analysis, and some more organized packages are also available. One widely used package is HomER (Huppert, Diamond, Franceschini, & Boas, 2009). Some fNIRS systems come with their own proprietary software, and many researchers have additionally developed in‐house analysis schemes. Hemodynamic Methods: fMRI and fNIRS 283

An Exemplary Study

One of the key questions in psycholinguistics concerns the organization of language networks in the brain as a function of when a language is learned. Is the same brain network employed when processing one’s native language (L1) and a foreign language learned later (L2)? Sugiura et al. (2011) conducted a large‐scale fNIRS study on this topic. They studied one cohort of about 400 children enrolled in 7 different schools, all between 8 and 10 years of age. For all, Japanese was their native language (L1), and English (their L2) was being learned at school, in a private institution, and/or at home. In a nutshell, they found that the topography of the network engaged by the L1 and L2 was overall very similar. Moreover, the same trend for increasing left‐dominance as a function of increased lexical knowledge was observed in both languages. However, they also observed some salient differences, such as larger responses for L1 than L2, which were surprisingly located in temporo‐parietal regions traditionally associated with phonological rather than lexical processing (but see below for details on the task). Children were tested in a van that was parked at their school or nearby, and which contained all of the equipment. The experiment as a whole took less than 10 minutes. After the cap was placed, children heard one word at a time (in English or Japanese) and had to try to repeat it as closely as possible (a phonologically heavy task, which may explain why language differences were observed mainly in pho- nology‐associated areas). To probe different levels of lexical knowledge, half of the words in each language were high in frequency and half were of low frequency. For the fNIRS section of the experiment, they were presented in 6 blocks of 5 words, each drawn from the same language and frequency level. Readers may wonder how the researchers could be certain that they were tapping specific brain regions later interpreted as the loci of phonological or lexical processing. In fact, as soon as the fNIRS task was completed, a 3D electromagnetic digitizer was used to record the positions of the optodes and the scalp landmarks, which thereafter allowed virtual registration to estimate the brain regions that each source‐detector combination tapped. To confirm that this estimation was accurate, the researchers also gathered magnetic resonance data from 30 of the children. The localization calculated on these MR scans matched those calculated via virtual registration, thus confirming the accuracy of the estimation based on virtual registration. As for data analyses, first, artifacts were identified using a combination of automatic thresholding and visual inspection. It was noted that two channels were systemati- cally artifacted. Further inspection revealed that they were actually placed over the temporalis muscle. Second, preliminary tests were used to determine regions of interest (ROI), which were defined on the basis of channels that were significant in at least one hemisphere and task. Third, the average oxygenated and deoxygenated hemoglobin concentration changes for each hemisphere, task, and child were then submitted to analyses of variance, separately for each ROI, and Bonferroni correction was applied to control for the number of ROIs only.

Advantages and Disadvantages

Our exemplary study illustrates some salient advantages and disadvantages of using fNIRS in Psycholinguistics. To begin with, the study makes use of the portability of 284 Research Methods in Psycholinguistics and the Neurobiology of Language fNIRS, bringing the lab to the school for ease of recruitment and potentially larger ecological validity, while still ensuring comparability across schools by keeping the immediate environment (the van) constant. Moreover, fNIRS might be one of the least time‐consuming neuroimaging tech- niques available. Once experimenters are trained, collecting fNIRS data is extremely easy and fast, as most often cap placement, if done correctly, only requires a couple of minutes. Undoubtedly, the impressive sample size in Sugiura et al. (2011) is in part due to their choice on portability, and in part to this ease of testing. One salient advantage of fNIRS is its relative insensitivity to participant motion (see Lloyd‐Fox et al., 2010, for discussion), another reason that led Sugiura and colleagues to opt for fNIRS. fNIRS is thus very appropriate for use with individuals who find it challenging or impossible to keep perfectly still, such as awake infants and young children, and it can be easily combined with tasks requiring a small degree of movement from the participant, such as speaking. Other advantages of fNIRS are that it is relatively inexpensive, at least when com- pared with MEG and fMRI (with some systems going for as little as 40 k US$) and that it is compatible with therapeutic devices (such as cochlear implants) and most other methods, including fMRI and EEG. But because both fNIRS and eye‐tracking use infra‐red light, those desiring to combine these two methods should make sure they buy or build equipment that uses non‐overlapping frequencies. As alluded to previously, there are two salient risks related to the fact that fNIRS measurements are drawn from a cap placed on the head. First, we have to be careful in relating scalp positions to underlying positions. Our exemplary study addressed this challenge by: (1) using a 3D digitizer to keep track of individuals’ cap position and using a well‐validated method of localization estimation, and (2) further vali- dating these estimations with MRI data on a subset of participants. The latter step may not be possible for all researchers, but it may not be necessary if the studied population is not very dissimilar to that used in the benchmark studies. Regardless, we strongly recommend all fNIRS users to follow Sugiura et al.’s example to boost the reliability and precision of the observed neuroimaging effects. The second risk associated with working on the surface of the skull relates to the channels that are lost due to muscular artifacts. fNIRS signals in general are not sensitive to slight movements of the arms or head, but they are extremely sensitive to event‐related changes in local tissue composition. Recall that light travels from a scalp source to a detector, and thus picks up any local changes, for example, changes in blood concentration on the skin (e.g., if the person blushes) or changes in the local optical properties when a muscle contracts (as in the temporalis example above, or if the person frowns for probes placed on the forehead). Our exemplary study also illustrates one salient challenge facing fNIRS and other methods where multiple dependent measures and multiple analyses are possible: that of using “researcher’s degrees of freedom” to better describe the data, which unfortunately inflates the risk of false positives (Simmons, Nelson, & Simonsohn, 2011). In conclusion, in this chapter we described two research methods in cognitive neuroscience that rely on hemodynamic signals: fMRI and fNIRS. Both measure changes in the ratio between oxygenated and de‐oxygenated blood in the brain. They rely on the change in oxygen use in brain tissue that becomes activated. fMRI and fNIRS methods thus measure brain activation indirectly, non‐invasively, and with relatively good spatial resolution, with fMRI outperforming fNIRS in this respect. Hemodynamic Methods: fMRI and fNIRS 285

Acknowledgments

This work was supported by grants from the Dutch Organisation for Scientific Research (NWO‐Vidi 276‐89‐007), and by grants ANR‐14‐CE30‐0003 MechELex, ANR‐10‐IDEX‐0001‐02 PSL*, and ANR‐10‐ LABX‐0087 IEC. Luca Filippin and Emmanuel Dupoux are acknowledged for the software used to generate one of the fNIRS figures.

Key Terms

BOLD ‐ Blood Oxygenation Level Dependent signal The signal that is measured in fMRI and fNIRS studies. It relies on the relative difference between oxygenated and de‐oxygenated blood in the brain. fMRI – functional Magnetic Resonance Imaging A technique in which changes in oxygenation of blood are measured and related to a cognitive or psychological process using magnetic resonance. fNIRS ‐ functional Near InfraRed Spectroscopy A technique using light absorption to measure changes in concentration of oxygenated and deoxygenated hemo- globin in tissue, including cortical brain regions.

References

Andric, M., & Small, S. L. (2015). fMRI methods for studying the neurobiology of language under naturalistic conditions. In R. M. Willems (Ed.), Cognitive neuroscience of natural language use. Cambridge, UK: Cambridge University Press. Bennett, C. M., Wolford, G. L., & Miller, M. B. (2009). The principled control of false positives in neuroimaging.Social Cognitive and Affective Neuroscience, 4, 417–422. https://doi.org/10.1093/scan/nsp053 Binder, J. R., Desai, R. H., Graves, W. W., & Conant, L. L. (2009). Where is the semantic system? A critical review and meta‐analysis of 120 functional neuroimaging studies. Cerebral Cortex (New York, N.Y.: 1991), 19, 2767–2796. https://doi.org/10.1093/cercor/bhp055 Cheng, X., Li, X., & Hu, Y. (2015). Synchronous brain activity during cooperative exchange depends on gender of partner: A fNIRS‐based hyperscanning study: Synchronous brain activities. Human Brain Mapping, 36, 2039–2048. http://doi.org/10.1002/hbm.22754 Eickhoff, S. B., Stephan, K. E., Mohlberg, H., Grefkes, C., Fink, G. R., Amunts, K., & Zilles, K. (2005). A new SPM toolbox for combining probabilistic cytoarchitectonic maps and functional imaging data. NeuroImage, 25, 1325–1335. http://doi.org/10.1016/j. neuroimage.2004.12.034 Fedorenko, E., Hsieh, P.‐J., Nieto‐Castañón, A., Whitfield‐Gabrieli, S., & Kanwisher, N. (2010). New method for fMRI investigations of language: Defining ROIs functionally in individual subjects. Journal of Neurophysiology, 104, 1177–1194. https://doi.org/ 10.1152/jn.00032.2010 Ferrari, M., Mottola, L., & Quaresima, V. (2004). Principles, techniques, and limitations of near infrared spectroscopy. Canadian Journal of Applied Physiology, 29, 463–487. http:// doi.org/10.1139/h04-031 286 Research Methods in Psycholinguistics and the Neurobiology of Language

Huppert, T. J., Diamond, S. G., Franceschini, M. A., & Boas, D. A. (2009). HomER: A review of time‐series analysis methods for near‐infrared spectroscopy of the brain. Applied Optics, 48, D280–D298. http://doi.org/10.1364/AO.48.00D280 Kok, P., Jehee, J. F. M., & de Lange, F. P. (2012). Less is more: Expectation sharpens represen- tations in the primary visual cortex. Neuron, 75, 265–270. http://doi.org/10.1016/ j.neuron.2012.04.034 Kovelman, I., Shalinsky, M. H., Berens, M. S., & Petitto, L.‐A. (2014). Words in the bilingual brain: An fNIRS brain imaging investigation of lexical processing in sign‐speech bimodal bilinguals. Frontiers in Human Neuroscience, 8. Retrieved from http://www.ncbi.nlm. nih.gov/pmc/articles/PMC4139656/ Lloyd‐Fox, S., Blasi, A., & Elwell, C. E. (2010). Illuminating the developing brain: The past, present and future of functional near infrared spectroscopy. Neuroscience & Biobehavioral Reviews, 34, 269–284. http://doi.org/10.1016/j.neubiorev.2009.07.008 Miezin, F. M., Maccotta, L., Ollinger, J. M., Petersen, S. E., & Buckner, R. L. (2000). Characterizing the hemodynamic response: Effects of presentation rate, sampling procedure, and the possibility of ordering brain activity based on relative timing. Neuroimage, 11(6 Pt 1), 735–759. http://doi.org/10.1006/nimg.2000.0568 Nijhof, A. D., & Willems, R. M. (2015). Simulating Fiction: Individual differences in literature comprehension revealed with fMRI. PLoS ONE, 10, e0116492. http://doi.org/10.1371/ journal.pone.0116492 Rossi, S., Telkemeyer, S., Wartenburger, I., & Obrig, H. (2012). Shedding light on words and sentences: Near‐infrared spectroscopy in language research. Brain and Language, 121, 152–163. http://doi.org/10.1016/j.bandl.2011.03.008 Samur, D., Lai, V. T., Hagoort, P., & Willems, R. M. (2015). Emotional context modulates embodied metaphor comprehension. Neuropsychologia, 78, 108–114. http://doi.org/ 10.1016/j.neuropsychologia.2015.10.003 Segaert, K., Menenti, L., Weber, K., Petersson, K. M., & Hagoort, P. (2012). Shared syntax in language production and language comprehension‐‐an FMRI study. Cerebral Cortex, 22, 1662–1670. http://doi.org/10.1093/cercor/bhr249 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False‐positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. http://doi.org/10.1177/0956797611417632 Sugiura, L., Ojima, S., Matsuba‐Kurita, H., Dan, I., Tsuzuki, D., Katura, T., & Hagiwara, H. (2011). Sound to language: Different cortical processing for first and second languages in elementary school children as revealed by a large‐scale study using fNIRS. Cerebral Cortex, 21, 2374–2393. http://doi.org/10.1093/cercor/bhr023 Tsuzuki, D., Jurcak, V., Singh, A. K., Okamoto, M., Watanabe, E., & Dan, I. (2007). Virtual spatial registration of stand‐alone fNIRS data to MNI space. Neuroimage, 34, 1506–1518. http://doi.org/10.1016/j.neuroimage.2006.10.043 Wehbe, L., Murphy, B., Talukdar, P., Fyshe, A., Ramdas, A., & Mitchell, T. (2014). Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PloS One, 9, e112575. http://doi.org/10.1371/journal.pone.0112575 Willems, R. M., de Boer, M., de Ruiter, J. P., Noordzij, M. L., Hagoort, P., & Toni, I. (2010). A cerebral dissociation between linguistic and communicative abilities in humans. Psychological Science, 21, 8–14. http://doi.org/10.1177/0956797609355563 Willems, R. M., Frank, S. L., Nijhof, A. D., Hagoort, P., & Bosch, A. van den. (2015). Prediction during natural language comprehension. Cerebral Cortex, bhv075. http://doi.org/10.1093/ cercor/bhv075 Yarkoni, T., Speer, N. K., Balota, D. A., McAvoy, M. P., & Zacks, J. M. (2008). Pictures of a thousand words: Investigating the neural mechanisms of reading with extremely rapid event‐related fMRI. NeuroImage, 42, 973–987. http://doi.org/10.1016/j.neuroimage. 2008.04.258 Hemodynamic Methods: fMRI and fNIRS 287

Zacks, J. M., Braver, T. S., Sheridan, M. A., Donaldson, D. I., Snyder, A. Z., Ollinger, J. M., … Raichle, M. E. (2001). Human brain activity time‐locked to perceptual event boundaries. Nature Neuroscience, 4, 651–655. http://doi.org/10.1038/88486

Further Reading

Boas, D. A., Elwell, C. E., Ferrari, M., & Taga, G. (2014). Twenty years of functional near‐infrared spectroscopy: Introduction for the special issue. NeuroImage, 85, Part 1, 1–5. http:// doi.org/10.1016/j.neuroimage.2013.11.033 Huettel, S. A., Song, A. W., & McCarthy, G. (2004). Functional magnetic resonance imaging. Sunderland, MA: Sinauer Associates. Rossi, S., Telkemeyer, S., Wartenburger, I., & Obrig, H. (2012). Shedding light on words and sentences: Near‐infrared spectroscopy in language research. Brain and Language, 121, 152–163. http://doi.org/10.1016/j.bandl.2011.03.008 15 Structural Neuroimaging

Stephanie J. Forkel and Marco Catani

Abstract

The field of neuroanatomy of language is moving forward at a fast pace. This advance- ment is partially due to developments in magnetic resonance imaging (MRI) and in particular MRI–based diffusion tractography, the latter allowing scientists to non- invasively study brain connections in the living brain. For the field of language studies this advancement is timely and important for two reasons. First, it liberates scientists from neuroanatomical models of language derived from animal studies. Second, it per- mits testing network correlates of linguistic models directly in the human brain. This chapter introduces general principles of MRI, diffusion MRI, and tractography (many technical terms will be explained in the Key Terms section; these are printed in italics on their first occurrence in the main text). An example of their applications will be used to explicate the versatility of this method in the realm of language studies, whilst discussing advantages and limitations of diffusion methods. Their non-invasiveness and wide availability will continue to provide new insights which will challenge our current understanding of the brain’s language network.

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Structural Neuroimaging 289

Introduction

Structural imaging based on computerized tomography (CT) and magnetic resonance imaging (MRI) has progressively replaced traditional post‐mortem studies in the process of identifying the neuroanatomical basis of language. In the clinical setting, the information provided by structural imaging has been used to confirm the exact diagnosis and formulate an individualized treatment plan. In the research arena, neuroimaging has permitted to understand neuroanatomy at the individual and group level. The possibility to obtain quantitative measures of lesions has improved correlation analyses between severity of symptoms, lesion load, and lesion location. More recently, the development of structural imaging based on diffusion MRI has provided valid solutions to some of the major limitations of more conventional imaging. In stroke patients, diffusion can visualize early changes that are otherwise not detectable with more conventional structural imaging, with important implica- tions for the clinical management of acute stroke patients. Beyond the sensitivity to early changes, diffusion imaging tractography presents the possibility of visualizing the trajectories of individual white matter pathways connecting distant regions. A pathway analysis based on tractography is offering a new perspective in neurolin- guistics. First, it permits to formulate new anatomical models of language function in the healthy brain and allows to directly test these models in the human population without any reliance on animal models. Second, by defining the exact location of the damage to specific white matter connections we can understand the contribu- tion of different mechanisms to the emergence of language deficits (e.g., cortical versus disconnection mechanisms). Finally, a better understanding of the anatomical variability of different language networks is helping to identify new anatomical predic- tors of language recovery. In this chapter we will focus on the principles of structural MRI and, in particular, diffusion imaging and tractography and present examples of how these methods have informed our understanding of variance in language perfor- mances in the healthy brain and language deficits in patient populations.

Assumptions and Rationale

In the last 30 years, advances in the field of structural imaging have primarily originated from a progressive improvement of spatial resolution of CT and MRI sequences, automatic methods for group-level analysis, and the development of diffusion imaging. Increased spatial resolution for structural images enabled scientists to obtain more precise quantitative measurements of cortical anatomy in the form of thickness, surface, and volume, and a better delineation of cortical and subcortical lesions. Diffusion imaging on the one hand is highly sensitive towards tissue damage and on the other hand allows to visualize and quantify white matter connections between cortical brain regions in the living human brain. When combined with automatic methods for tissue classification and group-level statistics, this has led to significant new insights on the anatomy of language. In addition, diffusion imaging has revealed tracts that are unique to the human brain and identified correlations 290 Research Methods in Psycholinguistics and the Neurobiology of Language between lesions to specific tracts and severity of behavioral symptoms. In this para- graph we briefly discuss how these approaches are applied to study language in healthy volunteers and to patients with language deficits.

Structural Imaging Methods Based on Conventional MRI

Current algorithms for structural imaging analysis are able to differentiate neuronal tissue into gray matter, white matter, and cerebrospinal fluid (CSF) and extract quantitative measurements in single subjects and across large populations. These brain morphometry methods require an excellent contrast between different tissues (gray and white matter, CSF) to define gray matter density, gray matter volume, and the inner and outer surface of the cortex. Tissue classification improves with increasing spatial resolution of the imaging sequences. Different automatic processing approaches to brain morphometry analysis have been developed and include voxel‐based morphometry (VBM), deformation‐based morphometry (DBM), and surface‐based morphometry (SBM). VBM is a fully automated technique that aims at estimating local differences in tissue composition, after minimizing gross anatomical differences between individ- uals (Ashburner & Friston, 2000). This is achieved by, first, estimating tissue classification based on T1‐weighted images. Second, the segmentation mask (gray matter or white matter) is spatially linearly normalized to a standard space to assure that a specific voxel is at the same anatomical location across subjects. Third, to reduce the influence of inter‐individual anatomical variability, spatial smoothing is applied. After correction for intensity non‐uniformities, voxel intensities are mea- sured and compared between groups or correlated with behavioral measurements (Ashburner & Friston, 2000). Finally, the results are corrected for multiple compar- isons to avoid type I error (false positive results). With VBM it is possible to either analyze the entire brain or focus on specific regions of interest (Geva, Baron, Jones, Price, & Warburton, 2012; Leff et al., 2009; Rowan et al., 2007). In the healthy brain, VBM has been used on large datasets to understand structural characteristics of language–related areas. For example, Good et al. (2001) studied 465 healthy vol- unteers to show significant leftward asymmetry in Heschl’s gyrus, frontal oper- culum, superior and inferior frontal sulci, and limbic structures. When combined with other measures, VBM aids exploring structural–functional relationships. Dorsaint‐Pierre et al. (2006), for example, showed no correlation between language dominance (assessed with the Wada test) and asymmetry of gray matter concentration in posterior language areas (assessed with VBM) in epileptic patients. However, when more anterior language regions in the frontal lobe were analyzed, a significant correlation emerged. Deformation Based Morphometry (DBM) has been developed as complementary method to VBM to partially overcome the limitations due to potential misregistra- tion. In DBM non‐linear registration algorithms are used to register the native image to a reference template and deformation field matrices are computed. The statistical analysis is then performed on the deformation matrices rather than on the registered voxels. In other words, DBM analyzes how much the voxel volumes change during image registration to the reference template, in contrast to VBM, which focuses on the residual image variability after its transformation. DBM is a preferred method Structural Neuroimaging 291 to investigate longitudinal changes, for example, in patients with progressive neuro- degenerative disease (Brambati et al., 2015; Heim et al., 2014). Finally, Surface Based Morphometry (SBM) offers the possibility of analyzing separate features of gray matter anatomy, such as surface area, cortical thickness, curvature, and volume. While thickness measures may provide some indication of underlying neuronal loss, reduced size of neuronal cell bodies, or degradation, sur- face area measures may reflect underlying white matter fibers (Van Essen, 1997). Similar to VBM, SBM requires a tissue segmentation of high‐resolution T1‐weighted images. However, in SBM the surface boundary between white and gray matter (inner boundary of cortex) and the boundary between gray matter and CSF (outer surface or pial surface) are calculated separately. The output file is a scalar value measured in millimeters, which indicates the distance between the inner and the outer surface for each vertex (Fischl & Dale, 2000). These techniques construct and analyze surfaces that represent structural boundaries between different tissues within the brain. As such, it differs from VBM and DBM approaches, which ultimately analyze image properties within the individual voxels. SBM is widely used in neurodevelopmental and neurodegenerative language disorders where the boundaries between cortex and white matter are preserved and reliable cortical measures of thickness, surface area and volume can be obtained (Ecker et al., 2016; Rogalski et al., 2011). In primary progressive aphasia (PPA) patients, for example, Rogalski et al., (2011) used DBM to investigate a specific correspondence between the pattern of cortical thinning and the language deficit profile. When applied to stroke patients, all automatic methods listed above have some shortfalls due to problems related to tissue classification and image normalisation, especially when lesions are large. Some authors have tried to over- come these limitations by developing dedicated lesion‐based methods. These lesion‐based methods rely on the delineation of a lesion to estimate statistical associations between damaged tissue and behavioral deficits. Multiple algorithms are currently available to perform lesion‐deficit analysis, including voxel‐based lesion symptom mapping (VLSM) (Bates et al., 2003; see Chapter 16 for details), non‐ parametric mapping (NPM) (Rorden, Karnath, & Bonilha, 2007), and Anatomo‐ Clinical Overlapping Maps (AnaCOM) (Kinkingnéhun et al., 2007; see also Foulon et al., 2017). All these software packages differ with regard to their required input data (e.g., binary versus continuous scores), statistical analysis (parametric vs. non‐ parametric), underlying assumptions on voxel independence (e.g., single voxels analysis versus clusters of voxels analysis), and their need for different study designs (e.g., number and demographics of groups for comparison). Despite these differences, all lesion‐deficit approaches need to fulfill prerequisites, including accurate and precise anatomical delin- eation of the lesions, neuropsychological assessments with high diagnostic sensitivity to the cognitive processes of interest, and reliable statistical methods to associate lesion characteristics with behavioral deficits (Medina, Kimberg, Chatterjee, & Coslett, 2010).

Diffusion‐Weighted Imaging

Diffusion‐weighted imaging (DWI) based on MRI was initially applied to the brain in the mid–1980s (Le Bihan et al., 1986) and its potential for studying stroke– related changes was promptly recognized (Moseley et al., 1990). The much later development of tractography algorithms (Mori et al., 1999; Conturo et al., 1999; 292 Research Methods in Psycholinguistics and the Neurobiology of Language

Basser et al., 2000) made it possible to visualize white matter connections in the human brain and describe, for example, how language networks mature from childhood to adulthood and to characterize the effects of neurological and psychi- atric disorders on the anatomy and function of language pathways. Enthusiasm for the first tractography visualizations of white matter pathways was partially due to the resemblance of the in vivo virtual reconstructions to classical post‐mortem dissections (Catani, Howard, Pajevic, & Jones, 2002; Lawes et al., 2008). In addition, it became evident that tractography offered clear advantages compared to other invasive methods and could reveal new features of white matter anatomy that are unique to the human brain. For example, it became apparent that the arcuate fasciculus is a rather complex pathway formed by a direct long segment between the classical Broca’s and Wernicke’s regions and an indirect pathway passing via the inferior parietal lobule (i.e., Geschwind’s region). The indirect pathway includes the anterior segment between Broca’s and Geschwind’s regions and the posterior segment between Wernicke’s and Geschwind’s regions (Catani, Jones, & ffytche, 2005). The availability of diffusion imaging in large groups of healthy volunteers permitted to replicate these findings and at the same time identify inter‐individual differences. The three segments of the arcuate fasciculus are present in the left hemi- sphere in all healthy individuals, but in the right hemisphere the long segment shows great variability. Indeed, it is reported as being bilateral in 40% of the healthy population and extremely left lateralized in the remaining 60%, where this segment is either absent or very small in the right hemisphere (Catani et al., 2007). These percentages change when females and males are analyzed separately, with a greater number of males showing an extreme left asymmetry. In recent years, tractography has been used to identify previously undescribed language pathways, such as the frontal aslant tract (FAT), which connects Broca’s area to pre‐supplementary motor cortex and medial prefrontal cortex (Catani et al., 2013). When applied to language disorders, tractography provides diffusion indices that can be used to map white matter degeneration along specific tracts and reveal a direct association between the severity of tract damage and language deficits.

Apparatus and Nature of the Data

Current MRI scans allow to acquire structural T1- and T2-weighted images, FLAIR, perfusion and diffusion data in less than one hour. 1.5 or 3 Tesla MRI systems are typically used to acquire MR images by applying a pulse sequence, which contains radiofrequency (RF) pulses and gradient pulses with carefully controlled timings. There are various types of sequences, but they all have timing values, namely echo time (TE) and repetition time (TR), both of which can be modified by the operator and influence the weighting, or sensitivity, of the image to specific tissues. MRI uti- lizes the natural properties of hydrogen atoms as part of water or lipids and the most important properties are the proton density (number of hydrogen atoms in a particular volume) and two characteristic relaxation times called longitudinal and transverse relaxation time, denoted as T1 and T2 respectively. Relaxation times describe how long the tissue takes to return to equilibrium after an RF pulse. Structural T1‐weighted images are acquired using short TE/TR whereas T2‐weighted images are acquired using long TE/TR (Figure 15.1). On T2-weighted images the signal from the cerebro-spinal Structural Neuroimaging 293

A) B)

L CT T1-weighted T2-weighted T2-FLAIR R

C) D)

pCASLb0 b500 b1500 L R

Figure 15.1 Imaging of an acute patient presenting with anomia following left inferior parietal and frontal lobe stroke. A) Axial non‐contrast computerized tomography (CT) scan demonstrates diffuse hypo‐ density in the parietal (indicated by thick red arrow) and frontal regions (indicated by thin red arrow), predominantly in white matter. The low signal‐to‐noise resolution and low white/ gray matter boundary contrast of CT does not allow to determine the exact extent of the damage. B) T1‐ and T2‐weighted and fluid‐attenuated inverse recovery (FLAIR) images showing structural changes as hypo‐ and hyper‐intense areas in the white matter, respectively. In structural T1‐weighted images there is a clear contrast between white and gray matter, which is less evident in pathological T2‐weighted images. In T2‐weighted images the CSF signal is hyperintense (i.e., brighter) and gray matter appears brighter than white matter. Lesions appear hyperintense and may therefore be difficult to distinguish from CSF. In the FLAIR images there is a better contrast between the CSF (hypointense) and the lesion (hyper- intense). C) Pulsed continuous Arterial Spin labelling (pCASL) perfusion‐weighted MRI image of the lesion shows reduced cerebral blood flow (CBF) to a large area in the inferior parietal region and to a smaller area in the left frontal lobe. The degree of hypo‐perfusion within the white matter is also noticeable but more difficult to distinguish from the CSF within the lateral ventricles. D) Series of diffusion images showing differences in the exact extension of the lesion depending on the b‐value used to acquire them (non‐diffusion weighted image: b=0 and diffusion‐weighting: b=500 and b=1500). These images lack the spatial resolution of conventional MRI sequences but are sensitive to acute lesions within minutes. (See insert for color representation of the figure.) 294 Research Methods in Psycholinguistics and the Neurobiology of Language fluid (CSF) in the ventricles and around the cortex is hyperintense and gray matter appears brighter than white matter. This poses a problem in stroke, as lesions appear hyperintense and may therefore be difficult to distinguish from CSF. To overcome this limitation a T2–weighted image with fluid–attenuated inversion recovery (FLAIR) is often acquired in clinical populations, where an additional inversion pulse is applied with the purpose of nulling signal from CSF. This renders CSF nearly fully suppressed and it appears dark, whilst lesions appear bright. In clinical settings, T1‐ and T2‐weighted images are widely used to characterize lesions due to tumors, traumatic brain injury, infection, neurodegeneration, and chronic stroke, but their sensitivity to acute ischemic changes is low. Early changes in acute stroke can be best detected using perfusion‐ and diffusion‐ weighted imaging (Figure 15.1). Perfusion imaging is a method to measure cerebral blood flow (CBF) through the brain. Measurement of tissue perfusion depends on the ability to serially measure concentration of a tracer agent in the brain. These tracers are often exogenous contrast agents that are injected into the vascular system before acquiring the images. More recently less invasive sequences have been developed that use magnetic labeling of blood (endogenous) as the tracer (e.g., Arterial Spin Labelling, ASL) (Alsop & Detre, 1998). Perfusion imaging is a highly sensitive sequence to early ischaemic changes as it measures CBF, which if reduced for a critical time period, will cause irreversible damage (Figure 15.1). A mismatch between the lesion extent depicted on T1‐weighted and perfusion images is often used to guide therapeutic decision as this mismatch is considered to quantify salvageable tissue at risk. Diffusion MRI quantifies water diffusion in biological tissues. In neuronal tissue, the displacement of water molecules is not random due to the presence of biological structures such as cell membranes, filaments, and nuclei. These structures reduce diffusion distances in the three‐dimensional space. In the white matter, the overall displacement is reduced unevenly (i.e., anisotropic) due to the presence of axonal membranes and myelin sheets, which restricts water diffusion in a direction per- pendicular to the main orientation of the axonal fibers. Diffusion MRI can there- fore detect diffusion drops in infarcted tissue within only several minutes of an arterial occlusion. Hereafter the signal stabilizes (pseudonormalization) before it progressively increases to become elevated in the chronic stage. For diffusion imaging, scanning times depend on various settings, including the b‐value, which is a factor that reflects the strength and timing of the gradients used for the sequence: the higher the b‐value, the stronger the diffusion effects in the data (Figure 15.1). At a given b‐value, tissue with fast diffusion (e.g., CSF) experiences more signal loss, resulting in low intensity in the image, whilst tissue with slow diffusion (e.g., gray matter) produces high intensity in the image (Figure 15.1). Other important parameters are the number of gradient directions (ideally ≥30 for diffusion tensor studies, and ≥60 for High Angular Resolution Diffusion Imaging; HARDI) and the number of non‐diffusion weighted images (Jones et al., 2002; Jones, 2008; Dell’Acqua et al., 2013). Non–diffusion weighted scans are of impor- tance to better fit the diffusion metrics and to improve the correction of diffusion‐ weighted volumes for eddy current and motion artefacts. This is achieved by iterative alignment to the non‐diffusion weighted volumes and to minimize T1 and T2 shine through effects (Le Bihan & Johansen‐Berg, 2012). The rule of thumb is to acquire one non‐diffusion weighted scan interleaved between diffusion‐weighted volumes, usually with a 1:10 ratio. Structural Neuroimaging 295

Collecting and Analyzing Data

Raw data are collected as Digital Imaging and Communications in Medicine (DICOM) files from the scanner and converted to 4D Neuroimaging Informatics Technology Initiative (NIFTI) format, which can be readily imported in any standard neuroimaging program for visualization and further processing. For diffusion imaging, in addition to the 4D image a B‐matrix (which contains the gradient table that encodes the orientation of the gradients during the acquisition) is extracted to correctly preserve the orientational information by realigning the diffu- sion‐weighted images to the reoriented B–matrix (Leemans & Jones, 2009). The B‐ matrix is usually provided by the analysis software during the initial processing steps. Prior to modeling, it is essential to perform manual quality control of the raw data (e.g., detecting missing volumes and misorientation of gradient tables) and automatic correction for artefacts (e.g., ghosting, wrapping, and ringing), head motion artefacts, and image distortions due to the scanner equipment and environ- ment (e.g., eddy current, field inhomogeneity, echo planar imaging geometric distor- tion) (Jones, Knösche, & Turner, 2013). Once these steps have been implemented, tracking algorithms can be chosen to propagate the streamline reconstruction, using tensor or multi‐fiber models and deterministic or probabilistic tracking. Virtual dis- sections of tractography datasets are used to obtain 3D reconstructions of pathways and tract-specific measurements along the tracts, such as volume and other diffu- sion indices calculated from the tensor or the fibre orientation distribution (FOD) (see below). The resulting average values per pathway and from each single subject can be submitted to statistical analysis. This allows to create percentage overlay maps for pathways of interest (Forkel, Thiebaut de Schotten, Kawadler et al., 2014b), establish group differences between controls and patients and between patients with different clinical presentations (Catani et al., 2013), detect volumetric left‐right dif- ferences (Catani et al., 2007; Catani, Forkel, & Thiebaut de Schotten, 2010; Thiebaut de Schotten et al., 2011), and associate structural white matter anatomy with recovery from aphasia after stroke (Forkel, Thiebaut de Schotten, Dell’Acqua et al., 2014a).

Diffusion Tensor Imaging

The displacement of water molecules measured in a voxel can be described geomet- rically as an ellipsoid (the tensor) calculated from the diffusion coefficient values

(eigenvalues, λ1‐3) and orientations (eigenvectors, ν1‐3) of its three principal axes. A detailed analysis of the tensor can provide precise information about not only the average water molecular displacement within a voxel (e.g., mean diffusivity, MD), but also the degree of tissue anisotropy (e.g., fractional anisotropy, FA), and the main orientation of the underlying white matter pathways (e.g., principal eigenvector or color‐coded maps). These indices provide complementary information about the microstructural composition and architecture of brain tissue. Mean diffusivity (MD) is a rotational invariant quantitative index that describes the average mobility of water molecules and is calculated from the three eigenvalues (λ1, λ2, λ3) of the tensor (MD = [(λ1 + λ2 + λ3)/3]). Voxels containing gray and white matter tissue show similar MD values (Pierpaoli, Jezzard, Basser, Barnett, & Di Chiro, 296 Research Methods in Psycholinguistics and the Neurobiology of Language

1996). MD reduces with age within the first years of life and increases in those disor- ders characterized by demyelination, axonal injury, and edema (Beaulieu, 2009). The fractional anisotropy (FA) index ranges from 0 to 1 and represents a quantitative measure of the degree of anisotropy in biological tissue. High FA values indicate a more anisotropic, that is, a non‐equal, diffusion. In the healthy adult brain, FA varies from 0.2 (e.g., in gray matter) to ≥0.8 in the white matter. FA provides information about the organization of the tissue within a voxel (e.g., strongly or weakly anisotropic) and the microarchitecture of the fibers (e.g., parallel, crossing, kissing fibers). FA reduces in pathological tissue (e.g., demyelination, edema) and is therefore commonly used as an indirect index of microstructural organization. Perpendicular [(λ2 + λ3)/2] and parallel diffusivity (λ1) describe the diffusivity along the principal directions of the diffusion. The perpendicular diffusivity, also indicated with the term radial diffusivity (RD), is generally considered a more sensitive index of axonal or myelin damage, although interpretation of changes in these indices in regions with crossing fibers is not always straightforward (Dell’Acqua & Catani, 2012). The principal eigenvector and color‐coded maps are particularly useful to visualize the principal orientation of the tensor within each voxel (Pajevic & Pierpaoli, 1999). Diffusion tractography, which is a family of algorithms able to propagate continuous streamlines from voxel to voxel, can be used to generate indirect mea- sures of tract volume and microstructural properties along pathways. Tractography‐ derived inter‐hemispheric differences in tract volume are widely reported in the literature, especially for language pathways (Catani et al., 2007). In addition to tract volume, for each voxel intersected by streamlines, other diffu- sion indices can be extracted and a total average can be extrapolated from these. Examples of this application include tract-specific measurements of fractional anisotropy, mean diffusivity, parallel and radial diffusivity (Catani 2006). These can provide important information on the microstructural properties of streamlines and their organization. Asymmetry in FA, for example, could indicate differences in the axonal anatomy (intra‐axonal composition, axon diameter, and membrane perme- ability), fiber myelination (myelin density, internodal distance, and myelin distribu- tion), or fiber arrangement and morphology (axonal dispersion, axonal crossing, and axonal branching) (Beaulieu, 2002). Other diffusion measurements may reveal more specific streamline properties. Changes in axial diffusivity, for example, could be related to intra‐axonal compo- sition, while RD may be more sensitive to changes in membrane permeability and myelin density (Song et al., 2002). These in vivo diffusion–based measurements allow connectional anatomy to be defined at different scales during development and in the adult brain.

Advanced Diffusion Models

One of the major limitations of the tensor model is the inability to estimate multiple fiber orientations. Several non-tensorial models have been proposed to overcome the limitations of the tensor model and the most commonly employed will be briefly men- tioned below. Structural Neuroimaging 297

Multiparametric methods, for example, multitensor (Alexander, Barker, & Arridge, 2002; Tuch et al., 2002) or “Ball and Stick” models (Behrens et al., 2003) are model‐dependent approaches in which the diffusion data are fitted with a chosen model that assumes a discrete number of fiber orientations (e.g., two or more). Nonparametric, model‐independent methods such as diffusion spectrum imaging (DSI) (Wedeen, Hagmann, Tseng, Reese, & Weisskoff, 2005), q–Ball imaging (Tuch, Reese, Wiegell, & Van Wedeen, 2003), or diffusion orientation transform (Özarslan, Shepherd, Vemuri, Blackband, & Mareci, 2006) have been developed to better characterize the water displacement by using a spherical function or the diffusion orientation distribution function (dODF). Whilst tensor‐based models only visu- alize one diffusion orientation per voxel, the multilobe shape of the dODF pro- vides information on the number of fiber orientations, their orientation and the weight of each fiber component within a voxel. A third group of methods takes advantage of both approaches by extracting directly the underlying fiber orientation (i.e., fiber‐ODF) using a specific diffusion model for white matter fibers. The latter approaches are usually described as spherical deconvolution methods (Dell’Acqua, Simmons, Williams, & Catani, 2013) and they generally show higher angular resolution (i.e., the ability to resolve crossing fibers at smaller angles) compared with methods based on dODFs (Seunarine et al., 2009; Catani et al., 2012). Spherical deconvolution methods are becoming the methods of choice in an increasing number of studies as they require acquisition protocols that are close to clinical tractography protocols (e.g., a low number of diffusion gradient directions and b–values that are accessible on most clinical scanners).

Tractography Reconstructions

Deterministic and probabilistic tractography represent the most widely used approaches to perform 3D reconstructions of white matter trajectories using diffu- sion data. Compared to deterministic approaches in which the estimated fiber orien- tation (e.g., the direction of maximum diffusivity for the tensor model) is assumed to represent the best estimate to propagate streamlines, probabilistic methods generate multiple solutions to reflect also the variability or “uncertainty” of the estimated fiber orientation (Jbabdi & Johansen‐Berg, 2011). These methods, therefore, provide additional information on the reproducibility of each tractography reconstruction by mapping the intrinsic uncertainty of individual diffusion datasets. The uncer- tainty quantified by probabilistic tractography is mainly driven by the magnetic res- onance noise, partial volume effects, and inaccuracy of the chosen diffusion model. Therefore, the probability of individual maps should not be considered as a direct measure of the anatomical probability of the tract. Indeed, in some cases artefactual trajectories can have high probability similar to true anatomical pathways. Ultimately, in datasets without noise both deterministic and probabilistic approaches based on the same diffusion model would generate identical tractography maps. Understanding these basic assumptions underlying probabilistic tractography is important to cor- rectly interpret the obtained results (Dell’Acqua & Catani, 2012). Advanced diffusion models that resolve multiple white matter trajectories within a single voxel offer the possibility of describing tracts that are not visible using current diffusion tensor methods. This opens up the possibility to visualize and 298 Research Methods in Psycholinguistics and the Neurobiology of Language describe tracts, which until now have been impossible to identify due to methodological limitations (Thiebaut de Schotten et al., 2011; Catani et al., 2012; Parlatini et al., 2017). Although an exact knowledge of these fibers represents a significant step for- ward in our understanding of human anatomy, it is important to be aware that tractography based on advanced diffusion methods is prone to produce a higher number of false positives compared to the tensor model. Hence, validation of these tracts with complementary methods, such as intraoperative stimulation studies and postmortem staining (Elias, Zheng, Domer, Quigg, & Pouratian, 2012) is necessary before widely applying these anatomical models to clinical populations.

Atlasing

Until the advent of tractography, our knowledge of white matter anatomy was based on a small number of influential 19th and early 20th century post‐mortem dissection atlases (Burdach, 1819; Déjerine, 1895; Sachs, 1892; Forkel et al., 2015). In common with their contemporary counterparts (Talairach & Tournoux, 1988), these atlases emphasize the average anatomy of representative participants at the expense of variability between participants. In recent years, several research groups have used tractography to produce group atlases of the major white matter tracts (Catani & Thiebaut de Schotten, 2012; Hua et al., 2008; Mori et al., 2005; Rojkova et al., 2016; Wakana et al., 2007). By extracting the anatomical location of each tract from several participants, these atlases provide probability maps of each pathway and quantify their anatomical variability. These atlases help clini- cians to establish a relationship of focal lesions with nearby tracts and improve clinical‐anatomical correlation (Figure 15.2) (Thiebaut de Schotten et al., 2014). It remains to be established, however, how much of this variability is due to a true underlying anatomical difference or the result of methodological limitations.

Tract Specific Measurements

Beyond visualizing white matter pathways, tractography facilitates quantitative analyses by extracting diffusion indices along the dissected tract. It is possible to characterize the microstructural properties of tissue in the normal and pathological brain and provide quantitative measurements for group comparisons or individual case studies (Figure 15.2) (Catani 2006). The interpretation of these indices, however, is not always straightforward, espe- cially in regions containing multiple fibers. An example of the complexity of this problem is the increase of fractional anisotropy commonly seen in the normal‐ appearing white matter regions distant to the lesioned area. Before interpreting these changes as indicative of “plasticity or remodeling,” other explanations should be taken into account. In voxels containing both degenerating and healthy fibers, increases in fractional anisotropy values are, in fact, more likely due to the axonal degeneration of the perpendicular fibers (Wheeler‐Kingshott & Cercignani, 2009; Dell’Acqua et al., 2013). The lack of specificity of current diffusion indices (i.e., diffusion changes depend on a number of biological, biochemical, and microstruc- tural factors) and the intrinsic voxel‐specific rather than fiber‐specific information Structural Neuroimaging 299

A) –20–10 0 10 20 30 40

MTG 11Patients 6 STG Angular Occipital Supermarginal B) MFG Cingulate Rectus IFG Thalamus SFG MFG White matter atlas

Corpus callosum Arcuate fasciculus Uncinate fasciculus ILF iFOF Lesion –20 –10 0 10 20 30 40 C) Control Agrammatic Semantic Frontal aslant tractUncinate fasciculus

0.42

0.39 **

†† 0.36

0.33 Frontal aslant tract Fractional anisotropy (FA) Uncinate fasciculus 0 FA 1 0.30

Figure 15.2 Lesion mapping based on T1‐weighted data (A), on a diffusion tractography atlas (B), and an example of extracting tract–based measurements from tractography (C). A) Group‐level lesion overlay percentage maps for an aphasic stroke patient cohort (n=16) reconstructed on an axial template brain and projected onto the left lateral cortical surface. This method identifies areas most commonly affected by lesions within a group of patients. B) Lesion mask (purple) from a single stroke patient overlaid onto a tractography‐based white matter atlas to extract measures of lesion load on pathways affected by the lesion. C) Differences in tract‐specific measurements of the frontal aslant tract and uncinate fasciculus between ­control subjects and patients with non‐fluent/agrammatic and semantic variants of primary progressive aphasia (PPA). Tractography reconstructions show the fractional anisotropy values mapped onto the streamlines of the frontal aslant tract and uncinate fasciculus of a control subject and two representative patients with PPA with non‐fluent/agrammatic and semantic variant. Exemplary measurements of fractional anisotropy (FA) are reported for the frontal aslant tract (solid bars) and the uncinate fasciculus (patterned bars). **statistically significant different versus semantic group (P < 0.05), ††statistically significant different versus controls (P < 0.001). IFG: inferior frontal gyrus, MFG: middle frontal gyrus, SFG: superior frontal gyrus, MTG: middle temporal gyrus, STG: superior temporal gyrus. Source: Modified from Forkel et al., 2014 and Catani et al., 2013. (See insert for color representation of the figure.) 300 Research Methods in Psycholinguistics and the Neurobiology of Language derived from current indices has stimulated scientists to work on new methods and novel diffusion indices. More recently, true tract‐specific indices based on spherical deconvolution that better describe the microstructural diffusion changes of individual crossing fibers within the same voxel have been proposed. Changes in the hindrance modulated orientation anisotropy (HMOA) (Dell’Acqua et al., 2013), for example, have a greater sensitivity than conventional fractional anisot- ropy values to detect degeneration that occurs only in one population of fibers, whereas the other crossing fibers remain intact. In the future, tractography combined with multimodal imaging methods will allow to extract even more specific tissue microstructure indices.

An Exemplary Study

In this section, we discuss how Forkel et al. (2014a) used conventional MRI in conjunction with diffusion tractography to identify anatomical predictors of language recovery after stroke. In this study, 18 patients with unilateral first‐ever left hemisphere stroke and language impairment confirmed by the revised Western Aphasia Battery (WAB–R) (Kertesz, 2007) were prospectively recruited. Language and neuroimaging assessments were performed within two weeks after symptom onset and again after six months. The 45‐minute MRI scan included a high‐resolution structural T1‐weighted volume for lesion analyses and diffusion imaging data with 60 diffusion‐weighted directions (b‐value 1500 mm2/s) and seven interleaved non‐diffusion weighted volumes. Matrix size was 128 × 128 × 60 and voxel size was 2.4 × 2.4 × 2.4 mm. Peripheral gating was applied to avoid brain pulsation artefacts. Diffusion tensor imaging data were preprocessed and corrected for eddy current and motion artefacts through iterative correction to the seven non‐diffusion weighted volumes using ExploreDTI (www.exploreDTI.com). Whole brain tractography was performed from all brain voxels with fractional anisotropy >0.2. Streamlines were propagated with a step‐size of 1 mm, using Euler integration and b‐spline interpolation of the diffusion tensor field (Basser et al., 2000). Where fractional anisotropy was <0.2 or when the angle between two consecutive tractography steps was >45°, streamline propagation was stopped. Tractography dissections of the three segments of the arcuate fasciculus were obtained using a three regions of interest approach as previously described (Catani et al., 2005). Regions of interest were defined on fractional anisotropy images in the patients’ native space and included an inferior frontal region, an inferior parietal region, and a posterior temporal region. All streamlines passing through both frontal and temporal regions of interest were considered as belonging to the long segment of the arcuate fasciculus. All streamlines between temporal and parietal regions of interest were classified as posterior segment of the arcuate fasciculus and those between parietal and frontal regions of interest were labelled as anterior segment of the arcuate fasciculus. The volume for each segment was calculated as the number of voxels intersected by the streamlines of each segment. To control for the possibility that hemisphere size might be driving Structural Neuroimaging 301 the volume of the arcuate segments (i.e., larger hemisphere means larger arcuate ­fasciculus), the tract volume was normalized by the hemisphere volume (segment volume/hemisphere volume). The hemispheric volume was obtained using FMRIB Software Library package (FSL, http://www.fmrib.ox.ac.uk/fsl/). The normalized segment volume was then used for further analysis. Stroke lesions were manually delineated on T1‐weighted images and these delin- eations were saved as lesion masks. Their volume (number of voxels) was extracted using FSL and lesion masks were subsequently binarized (i.e., assigning a value of 0 or 1 to each voxel) and normalized to a standard space. Lesion masks were overlaid to create percentage maps to compute commonly damaged voxels. The average lesion size for this group was 21.62 cubic centimeters (standard deviation = 32.43 cubic centimeters). This number can be obtained by extracting the number of voxels within the lesion mask and multiplying these with the volume of the voxel in the underlying imaging scan. A standard neuroimaging software will provide this value automatically without the need for the calculation. An overlay of the patients’ nor- malized lesions is shown in Figure 15.2A. The aphasia quotient (AQ) was used as a measure of the patients’ overall performance on the WAB‐R at the acute stage and at follow‐up. This measure was then inputted into a hierarchical regression analysis alongside demographic data (age, sex, education), lesion volume, and volume of the three segments. This analysis was run separately for the left and the right hemisphere. For the left hemisphere, adding tractography to the analysis did not significantly improve the predictive strength of longitudinal aphasia severity. By contrast, in the right hemisphere the addition of the normalized size of the long segment of the arcuate to a model based on age, sex, and lesion size, increased the predictive power of the variance at six months from nearly 30% to 57% (Figure 15.3). Of the four predictors only age and the right long segment were independent predictors. Gender and lesion size were marginally significant predictors. These results indicate that the use of structural imaging based on lesion mapping and tractography can help clinicians identify trajectories of language recovery after stroke.

Advantages and Disadvantages of Diffusion Tractography

The ability of tracking connections in the living human brain allows to move above and beyond network models based on non‐human primate tracing and small number of human post mortem studies. This is leading to the description of new tracts, some of which are important for language. In addition, fast acquisition sequences are now available to obtain high‐quality data from patients who are prone to movement arte- facts. When combining tractography with detailed linguistic assessment, neurobio- logical language models can be directly validated or falsified. However, despite a progressive amelioration of the spatial resolution of diffusion datasets, compared to classical axonal tracing studies, tractography is still unable to identify the smallest bundles and differentiate anterograde and retrograde connections. The level of noise in the diffusion data and the intrinsic MRI artefacts also constitute important factors A) B) R2 = 0.43 100

Patient 3 90

Patient 2 Patient 1 80 stroke

70 LeftArcuate fasciculus Right Patients

Aphasia severity (AQ) six months after AQ>93.8 Anterior segment Posterior segment Long segment –10 –5 0 510 Right long segment index size C)

Patient 1 (59 year old male) Patient 2 (81 year old female) Patient 3 (87 year old female)

Figure 15.3 Anatomical variability in perisylvian white matter anatomy and its relation to post‐stroke language recovery. A) shows the three segments of the arcuate fasciculus in the left and the right hemisphere obtained from a group average. B) shows a regression plot of the volume of the right long segment plotted against the six‐month longitudinal aphasia quotient (AQ, corrected for age, sex, and lesion size). C) Indicates the right long segment for three exemplified patients (indicated in B) presenting with different degrees of language recovery at six months. Source: Modified from Forkel et al., 2014. (See insert for color representation of the figure.) Structural Neuroimaging 303 that affect the precision and accuracy of the measurements and, as a consequence, the quality of the tractography reconstruction (Basser, Pajevic, Pierpaoli, Duda, & Aldroubi, 2000; Le Bihan, Poupon, Amadon, & Lethimonnier, 2006). Finally, diffu- sion tensor tractography assumes that fibers in each voxel are well described by a single orientation estimate, which is a valid assumption for voxels containing only one population of fibers with a similar orientation. The majority of white matter voxels, however, contain populations of fibers with multiple orientations. In these regions fibers cross, kiss, merge, or diverge, and the tensor model is inadequate to capture this anatomical complexity. More recent tractography developments based on HARDI methods and appropriate processing techniques are able to partially resolve fiber crossings. All these limitations may lead to tracking pathways that do not exist (false positive) or fail to track existing ones (false negative). It is evident from all the considerations above that interpretation of tractography results requires experience and a priori anatomical knowledge. This is particularly true for the diseased brain, where alteration and anatomic distortion due to the presence of pathology generate tissue changes likely to lead to a greater number of artefactual reconstructions. Despite these limitations, tractography is the only tech- nique that permits a quantitative assessment of white matter tracts in the living human brain. The recent development of MRI scanners with stronger gradients and multi‐band acquisition sequences represents one of many steps towards a significant improvement of the diffusion tractography approach. The possibility of combining tractography with other imaging modalities will provide a complete picture of the functional anatomy of human language pathways.

Key Terms

Brain morphometry Measures brain structures based on structural MRI data. Techniques include voxel‐based, surface‐based, and deformation‐based morphometry. Cerebral blood flow (CBF) Blood supply to the brain in a given period of time. In an adult, CBF is typically 750 milliliters per minute or 15% of the cardiac output. This equates to an average perfusion of 50 to 54 milliliters of blood per 100 grams of brain tissue per minute. Cerebrospinal fluid (CSF) Fluid surrounding the brain and spinal cord and filling the cavities inside the brain. It is produced within the ventricles of the brain and provides basic mechanical and immunological protection to the nervous system. Computerized Tomography (CT) An imaging procedure that uses special x‐ray equipment to create anatomical scans. Contrast Various tissues have different signal intensities, or brightness, on MR images. The differences are described as the image, tissue, or signal contrast and allow to define boundaries between tissues, for example, gray‐white matter. Diffusion–weighted imaging (DWI) An advanced MRI pulse sequence based upon measuring the random Brownian motion of water molecules within the biological tissue contained in a voxel (3D volume). 304 Research Methods in Psycholinguistics and the Neurobiology of Language

Echo time (TE) Time between the radio frequency pulse and MR signal sampling, corresponding to maximum of echo. Fractional Anisotropy (FA) A measure based on diffusion‐weighted imaging describing the deviation from isotropy (equal diffusion in all directions) and measured between 0 (isotropic) and 1 (anisotropic). High FA is found in brain voxels with a minimal amount of crossing fibers. High Angular Resolution Diffusion Imaging (HARDI) A “family” of advanced diffusion modeling methods that tries to overcome limitations of diffusion tensor imaging by resolving multiple fiber orientations. The main feature of HARDI approaches is to collect diffusion data along a large number of diffusion direc- tions (≥60) to better characterize certain features of microstructure such as angular complexity. Hindrance modulated orientation anisotropy (HMOA) Fiber specific diffusion index derived from spherical deconvolution analysis that provides information about white matter anisotropy and microstructure organization. Differently from more common voxel‐based metrics (e.g., FA) that provide only a single average value per voxel, HMOA can have multiple values, one for each distinct fiber orientation resolved by spherical deconvolution. Magnetic Resonance Imaging (MRI) Non‐invasive imaging technique for obtaining anatomical images based on the magnetic properties of hydrogen atoms. Mean Diffusivity (MD) A measure based on diffusion‐weighted imaging describing the mean molecular motion, independent of tissue directionality. Myelin: The myelin sheath is a lipid membrane wrapped around the nerve axons in a spiral fashion, which provides an electrically insulating layer. The myelin sheath originates from oligodendroglia cells in the central nervous system. Pulse sequences A group of MRI sequences in which multiple radio frequency pulses are applied to produce a wide range of contrasts. The most frequent pulse sequences are spin echo, gradient echo, inversion recovery, susceptibility- weighted imaging, and diffusion. Radial Diffusivity (RD) A DWI‐based measure describing the diffusivity perpendicular to the axonal fibers, which is calculated from the mean magnitude of diffusion along two perpendicular directions that are orthogonal to the overall maximum diffusion direction. Registration/normalization A neuroimaging registration method to spatially align a series of images, either from intra‐subject or inter‐subject image volumes, which is utilized in several steps of preprocessing. Repetition time (TR) Time between two excitation pulses during an MRI acquisition. Segmentation mask Partition of an image into a set of tissues that compose the image, including masks for gray and white matter, CSF, and lesioned tissue. Spatial smoothing A process that requires convolving the data with a smoothing kernel in order to increase signal relative to noise, conform the data to a Gaussian field model, and to improve intersubject averaging. Spatial resolution Spatial resolution of an image is determined by the size of the voxels. The smaller the size of the voxel, the higher the resolution and higher resolution allows to better segment tissues and identify lesions. Standard space In order to compare brain scans they have to be aligned in a patient‐ orientation‐independent space. Often this is achieved by using a reference tem- plate brain, a representative image with anatomical features in a coordinate space, which then provides a target to align individual images to. Structural Neuroimaging 305

Streamlines Tractography visualizes 3D reconstructions of the preferred orientation of water molecules, which is indicative of the underlying axonal structures. Given the inference, the term “streamlines” should be used in preference of axons or fibers when referring to tractography results. T1‐weighted image A basic pulse sequence (short TE/TR), which relies on the longitudinal relaxation after spins have been flipped into a transverse plane by a radiofrequency pulse. T2‐weighted image A basic pulse sequence (long TE/TR), which relies upon the transverse relaxation of the net magnetization vector. Tractography A method used to reconstruct 3D trajectories of white matter path- ways from diffusion data. Voxel A 3D volume (a volume pixel), associated with a particular x‐y‐z coordinate in the brain, used in the analysis of 3D brain imaging data.

References

Alexander, D. C., Barker, G. J., & Arridge, S. R. (2002). Detection and modeling of non‐ Gaussian apparent diffusion coefficient profiles in human brain data. Magnetic Resonance in Medicine, 48, 331–340. http://doi.org/10.1002/mrm.10209 Alsop, D. C., & Detre, J. A. (1998). Multisection cerebral blood flow MR imaging with con- tinuous arterial spin labeling. Radiology, 208, 410–416. Ashburner, J., & Friston, K. J. (2000). Voxel‐based morphometry—the methods. NeuroImage, 11, 805–821. Basser, P. J., Pajevic, S., Pierpaoli, C., Duda, J., & Aldroubi, A. (2000). In vivo fiber tractography using DT‐MRI data. Magnetic Resonance in Medicine, 44, 625–632. Bates, E., Wilson, S. M., Saygin, A. P., Dick, F., Sereno, M. I., Knight, R. T., & Dronkers, N. F. (2003). Voxel‐based lesion–symptom mapping. Nature Neuroscience, 6, 448–450. Beaulieu, C. (2002). The basis of anisotropic water diffusion in the nervous system – a technical review. NMR in Biomedicine, 15, 435–455. http://doi.org/10.1002/nbm.782 Behrens, T. E. J., Woolrich, M. W., Jenkinson, M., Johansen‐Berg, H., Nunes, R. G., Clare, S., et al. (2003). Characterization and propagation of uncertainty in diffusion‐weighted MR imaging. Magnetic Resonance in Medicine, 50, 1077–1088. http://doi.org/10.1002/ mrm.10609 Brambati, S. M., Amici, S., Racine, C. A., Neuhaus, J., Miller, Z., Ogar, J., et al. (2015). Longitudinal gray matter contraction in three variants of primary progressive aphasia: A tenser‐based morphometry study. NeuroImage Clinical, 8, 345–355. http://doi. org/10.1016/j.nicl.2015.01.011 Burdach, C. F. (1819). Vom Baue und Leben des Gehirns. Leipzig: Dyk. Catani, M. (2006). Diffusion tensor magnetic resonance imaging tractography in cognitive disorders. Current Opinion in Neurology, 19, 599–606. Catani, M., Allin, M. P. G., Husain, M., Pugliese, L., Mesulam, M. M., Murray, R. M., & Jones, D. (2007). Symmetries in human brain language pathways correlate with verbal recall. Proceedings of the National Academy of Sciences of the United States of America, 104, 17163–17168. http://doi.org/10.1073/pnas.0702116104 Catani, M., Dell’Acqua, F., Vergani, F., Malik, F., Hodge, H., Roy, P., et al. (2012). Short frontal lobe connections of the human brain. Cortex, 48, 273–291. http://doi. org/10.1016/j.cortex.2011.12.001 Catani, M., Forkel, S. J., & Thiebaut de Schotten, M. (2010). Asymmetry of the white matter pathways in the brain. In K. Hugdahl & R. Westerhausen (Eds.), The two halves of the brain (pp. 1–34). Cambridge (MA): MIT Press. 306 Research Methods in Psycholinguistics and the Neurobiology of Language

Catani, M., Howard, R. J., Pajevic, S., & Jones, D. (2002). Virtual in vivo interactive dissection of white matter fasciculi in the human brain. NeuroImage, 17, 77–94. http://doi. org/10.1006/nimg.2002.1136 Catani, M., Jones, D., & ffytche, D. H. (2005). Perisylvian language networks of the human brain. Annals of Neurology, 57, 8–16. http://doi.org/10.1002/ana.20319 Catani, M., Mesulam, M. M., Jakobsen, E., Malik, F., Martersteck, A., Wieneke, C., et al. (2013). A novel frontal pathway underlies verbal fluency in primary progressive aphasia. Brain, 136, 2619–2628. http://doi.org/10.1093/brain/awt163 Catani, M., & Thiebaut de Schotten, M. (2008). A diffusion tensor imaging tractography atlas for virtual in vivo dissections. Cortex, 44, 1105–1132. http://doi.org/10.1016/ j.cortex.2008.05.004 Catani, M., & Thiebaut de Schotten, M. (2012). Atlas of human brain connections. Oxford: OUP Oxford. Conturo, T. E., Lori, N. F., Cull, T. S., Akbudak, E., Snyder, A. Z., Shimony, J. S., et al. (1999). Tracking neuronal fiber pathways in the living human brain. Proceedings of the National Academy of Sciences, 96, 10422–10427. Déjerine, J. J. (1895). Anatomie des centres nerveux. Paris: Rueff at Cie. Dell’Acqua, F., & Catani, M. (2012). Structural human brain networks: Hot topics in diffu- sion tractography. Current Opinion in Neurology, 25, 375–383. http://doi.org/10.1097/ WCO.0b013e328355d544 Dell’Acqua, F., Simmons, A., Williams, S. C. R., & Catani, M. (2013). Can spherical deconvo- lution provide more information than fiber orientations? Hindrance modulated orienta- tional anisotropy, a true‐tract specific index to characterize white matter diffusion. Human Brain Mapping, 34, 2464–2483. http://doi.org/10.1002/hbm.22080 Dorsaint‐Pierre, R., Penhune, V. B., Watkins, K. E., Neelin, P., Lerch, J. P., Bouffard, M., & Zatorre, R. J. (2006). Asymmetries of the planum temporale and Heschl’s gyrus: Relationship to language lateralization. Brain, 129, 1164–1176. http://doi.org/10.1093/ brain/awl055 Ecker, C., Andrews, D., Dell’Acqua, F., Daly, E., Murphy, C., Catani, M., et al.; MRC AIMS Consortium. Murphy D. G. (2016). Relationship between cortical gyrification, white matter connectivity, and autism spectrum disorder. Cerebral Cortex, 26, 3297–3309. doi: 10.1093/cercor/bhw098 Elias, W. J., Zheng, Z. A., Domer, P., Quigg, M., & Pouratian, N. (2012). Validation of connectivity‐ based thalamic segmentation with direct electrophysiologic recordings from human sensory thalamus. NeuroImage, 59, 2025–2034. http://doi.org/10.1016/j.neuroimage.2011.10.049 Fischl, B., & Dale, A. M. (2000). Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proceedings of the National Academy of Sciences of the United States of America, 97, 11050–11055. http://doi.org/10.1073/pnas.200033797 Forkel, S. J., Mahmood, S., Vergani, F., & Catani, M. (2015). The white matter of the human cerebrum: part I The occipital lobe by Heinrich Sachs. Cortex, 62, 182–202. Forkel, S. J., Thiebaut de Schotten, M., Dell’Acqua, F., Kalra, L., Murphy, D. G. M., Williams, S. C. R., & Catani, M. (2014a). Anatomical predictors of aphasia recovery: A tractography study of bilateral perisylvian language networks. Brain, 137(Pt 7), 2027–2039. http://doi. org/10.1093/brain/awu113 Forkel, S. J., Thiebaut de Schotten, M., Kawadler, J. M., Dell’Acqua, F., Danek, A., & Catani, M. (2014b). The anatomy of fronto‐occipital connections from early blunt dissections to contemporary tractography. Cortex, 56, 73–84. http://doi.org/10.1016/j. cortex.2012.09.005 Foulon, C., Cerliani, L., Kinkingnehun, S., Levy, R., Rosso, C., Urbanski, M., et al. (2017). Advanced lesion symptom mapping analyses and implementation as BCBtoolkit. http:// dx.doi.org/10.1101/133314 Structural Neuroimaging 307

Geva, S., Baron, J.‐C., Jones, P. S., Price, C. J., & Warburton, E. A. (2012). A comparison of VLSM and VBM in a cohort of patients with post‐stroke aphasia. NeuroImage Clinical, 1, 37–47. http://doi.org/10.1016/j.nicl.2012.08.003 Good, C. D., Johnsrude, I. S., Ashburner, J., Henson, R. N. A., Friston, K. J., & Frackowiak, R. S. J. (2001). A voxel‐based morphometric study of ageing in 465 normal adult human brains. NeuroImage, 14, 21–36. http://doi.org/10.1006/nimg.2001.0786 Heim, S., Pieperhoff, P., Grande, M., Kuijsten, W., Wellner, B., Sáez, L. E., et al. (2014). Longitudinal changes in brains of patients with fluent primary progressive aphasia. Brain and Language, 131, 11–19. http://doi.org/10.1016/j.bandl.2013.05.012 Hua, K., Zhang, J., Wakana, S., Jiang, H., Li, X., Reich, D. S., et al. (2008). Tract probability maps in stereotaxic spaces: Analysis of white matter anatomy and tract-specific quanti- fication. NeuroImage, 39(1), 336–347. Jbabdi, S., & Johansen‐Berg, H. (2011). Tractography: Where do we go from here? Brain Connectivity, 1, 169–183. http://doi.org/10.1089/brain.2011.0033 Jones, D. (2008). Studying connections in the living human brain with diffusion MRI. Cortex, 44, 936–952. Jones, D., Knösche, T. R., & Turner, R. (2013). White matter integrity, fiber count, and other fallacies: The do’s and don’ts of diffusion MRI. NeuroImage, 73, 239–254. http://doi. org/10.1016/j.neuroimage.2012.06.081 Jones, D. K., Williams, S. C. R., Gasston, D., Horsfield, M. A., Simmons, A., & Howard, R. (2002). Isotropic resolution diffusion tensor imaging with whole brain acquisition in a clinically acceptable time. Human Brain Mapping, 15, 216–230. Kertesz, A. (2007). Western Aphasia Battery – Revised. San Antonio: PsychCorp. Kinkingnéhun, S., Volle, E., Pélégrini‐Issac, M., Golmard, J.‐L., Lehéricy, S., Boisgueheneuc, Du, F., et al. (2007). A novel approach to clinical–radiological correlations: Anatomo‐ Clinical Overlapping Maps (AnaCOM): Method and validation. NeuroImage, 37, 1237– 1249. http://doi.org/10.1016/j.neuroimage.2007.06.027 Lawes, N., Barrick, T. R., Murugam, V., Spierings, N., Evans, D. R., Song, M., & Clark, C. A. (2008). Atlas‐based segmentation of white matter tracts of the human brain using diffu- sion tensor tractography and comparison with classical dissection. NeuroImage, 39, 62–79. http://doi.org/10.1016/j.neuroimage.2007.06.041 Le Bihan, D., & Johansen‐Berg, H. (2012). Diffusion MRI at 25: Exploring brain tissue structure and function. NeuroImage, 61, 324–341. http://doi.org/10.1016/j.neuroimage.2011.11.006 Le Bihan, D., Breton, E., Lallemand, D., Grenier, P., Cabanis, E., & Laval‐Jeantet, M. (1986). MR imaging of intravoxel incoherent motions: Application to diffusion and perfusion in neuro- logic disorders. Radiology, 161, 401–407. http://doi.org/10.1148/radiology.161.2.3763909 Le Bihan, D., Poupon, C., Amadon, A., & Lethimonnier, F. (2006). Artifacts and pitfalls in diffusion MRI. Journal of Magnetic Resonance Imaging, 24, 478–488. http://doi. org/10.1002/jmri.20683 Leemans, A., & Jones, D. (2009). The B‐matrix must be rotated when correcting for subject motion in DTI data. Magnetic Resonance in Medicine, 61, 1336–1349. http://doi. org/10.1002/mrm.21890 Leff, A. P., Schofield, T. M., Crinion, J. T., Seghier, M. L., Grogan, A., Green, D. W., & Price, C. J. (2009). The left superior temporal gyrus is a shared substrate for auditory short‐term memory and speech comprehension: Evidence from 210 patients with stroke. Brain, 132, 3401–3410. http://doi.org/10.1093/brain/awp273 Medina, J., Kimberg, D. Y., Chatterjee, A., & Coslett, H. B. (2010). Inappropriate usage of the Brunner‐Munzel test in recent voxel‐based lesion‐symptom mapping studies. Neuropsychologia, 48, 341–343. http://doi.org/10.1016/j.neuropsychologia.2009.09.016 Mori, S., Crain, B. J., Chacko, V. P., & Van Zijl, P. C. (1999). Three-dimensional tracking of axonal projections in the brain by magnetic resonance imaging. Ann Neurol, 45, 265–269. 308 Research Methods in Psycholinguistics and the Neurobiology of Language

Mori, S., Wakana, S., van Zijl, P. C. M., & Nagae-Poetscher, L. M. (2005). MRI Atlas of Human White Matter. Amsterdam, The Netherlands: Elsevier. Moseley, M. E., Kucharczyk, J., Mintorovitch, J., Cohen, Y., Kurhanewicz, J., Derugin, N., et al. (1990). Diffusion‐weighted MR imaging of acute stroke: correlation with T2‐ weighted and magnetic susceptibility‐enhanced MR imaging in cats. American Journal of Neuroradiology, 11, 423–429. Özarslan, E., Shepherd, T. M., Vemuri, B. C., Blackband, S. J., & Mareci, T. H. (2006). Resolution of complex tissue microarchitecture using the diffusion orientation transform (DOT). NeuroImage, 31, 1086–1103. http://doi.org/10.1016/j.neuroimage.2006.01.024 Pajevic, S., & Pierpaoli, C. (1999). Color schemes to represent the orientation of anisotropic tissues from diffusion tensor data: Application to white matter fiber tract mapping in the human brain. Magnetic Resonance in Medicine, 42, 526–540. Pierpaoli, C., Jezzard, P., Basser, P. J., Barnett, A., & Di Chiro, G. (1996). Diffusion tensor MR imaging of the human brain. Radiology, 201, 637–648. http://doi.org/10.1148/ radiology.201.3.8939209 Rogalski, E., Cobia, D., Harrison, T. M., Wieneke, C., Thompson, C. K., Weintraub, S., & Mesulam, M.‐M. (2011). Anatomy of language impairments in primary progressive aphasia. The Journal of Neuroscience, 31, 3344–3350. http://doi.org/10.1523/ JNEUROSCI.5544‐10.2011 Rojkova, K., Volle, E., Urbanski, M., Humbert, F., Dell’Acqua, F., & Thiebaut de Schotten, M. (2016). Atlasing the frontal lobe connections and their variability due to age and educa- tion: A spherical deconvolution tractography study. Brain Structure & Function, 221, 1751–1766. http://doi.org/10.1007/s00429‐015‐1001‐3 Rorden, C., Karnath, H.‐O., & Bonilha, L. (2007). Improving lesion‐symptom mapping. Journal of Cognitive Neuroscience, 19, 1081–1088. Rowan, A., Vargha‐Khadem, F., Calamante, F., Tournier, J.‐D., Kirkham, F. J., Chong, W. K., et al. (2007). Cortical abnormalities and language function in young patients with basal ganglia stroke. NeuroImage, 36, 431–440. http://doi.org/10.1016/j.neuroimage. 2007.02.051 Sachs, H. (1892). Das Hemisphaerenmark des menschlichen Grosshirns. I. Der Hinterhauptlappen. Leipzig: Georg Thieme Verlag. Song, S. K., Sun, S. W., Ramsbottom, M. J., Chang, C., Russell, J., & Cross, A. H. (2002). Dysmyelination revealed through MRI as increased radial (but unchanged axial) diffu- sion of water. Neuroimage, 17, 1429–1436. Seunarine, K. K. & Alexander, D. C. (2009). Multiple Fibers: Beyond the Diffusion Tensor. In H. Johansen‐Berg & TEJ Behrens (Eds.), Diffusion MRI: From quantitative measurement to in‐vivo neuroanatomy (pp. 55–72). Oxford: Academic Press, 2009. Talairach, J., & Tournoux, P. (1988). Co‐planar stereotaxic atlas of the human brain. Stuttgart: Georg Thieme Verlag. Thiebaut de Schotten, M., ffytche, D. H., Bizzi, A., Dell’Acqua, F., Allin, M., Walshe, M., et al. (2011). Atlasing location, asymmetry and inter‐subject variability of white matter tracts in the human brain with MR diffusion tractography. NeuroImage, 54, 49–59. http://doi. org/10.1016/j.neuroimage.2010.07.055 Thiebaut de Schotten, M., Tomaiuolo, F., Aiello, M., Merola, S., Silvetti, M., Lecce, F., et al. (2014). Damage to white matter pathways in subacute and chronic spatial neglect: A group study and 2 single‐case studies with complete virtual “in vivo” tractography dissection. Cerebral Cortex, 24, 691–706. http://doi.org/10.1093/cercor/bhs351 Tuch, D. S., Reese, T. G., Wiegell, M. R., & Van J Wedeen. (2003). Diffusion MRI of complex neural architecture. Neuron, 40, 885–895. http://doi.org/10.1016/S0896‐6273(03)00758‐X Tuch, D. S., Reese, T. G., Wiegell, M. R., Makris, N., Belliveau, J. W., & Wedeen, V. J. (2002). High angular resolution diffusion imaging reveals intravoxel white matter fiber heteroge- neity. Magnetic Resonance in Medicine, 48, 577–582. http://doi.org/10.1002/mrm.10268 Structural Neuroimaging 309

Van Essen, D. C. (1997). A tension‐based theory of morphogenesis and compact wiring in the central nervous system. Nature, 385, 313–318. http://doi.org/10.1038/385313a0 Wakana, S., Caprihan, A., Panzenboeck, M. M., Fallon, J. H., Perry, M., Gollub, R. L., et al. (2007). Reproducibility of quantitative tractography methods applied to cerebral white matter. NeuroImage, 36, 630–644. Wedeen, V. J., Hagmann, P., Tseng, W.‐Y. I., Reese, T. G., & Weisskoff, R. M. (2005). Mapping complex tissue architecture with diffusion spectrum magnetic resonance imaging. Magnetic Resonance in Medicine, 54, 1377–1386. http://doi.org/10.1002/mrm.20642 Wheeler‐Kingshott, C. A. M., & Cercignani, M. (2009). About “axial” and “radial” diffu- sivities. Magnetic Resonance in Medicine, 61, 1255–1260. http://doi.org/10.1002/ mrm.21965

Further Reading and Resources

Catani, M., & Thiebaut de Schotten, M. (2012). Atlas of human brain connections. Oxford: Oxford University Press. Damasio, H., & Damasio, A. (1989). Lesion analysis in neuropsychology. New York: Oxford University Press. Johansen‐Berg, H., & Behrens, T. E. J. (Eds.) (2013). Diffusion MRI: From quantitative measurement to in vivo neuroanatomy, 2nd ed. Academic Press: Elsevier. Jones, D. (Ed.) (2010). Diffusion MRI. Oxford: Oxford University Press. Stemme, B., & Whitacker, H. (Eds.) (2008). Handbook of the neuroscience of language. London: Elsevier Academic Press. Toga, A. (Ed.) (2015). Brain mapping: An encyclopedic reference, 1st ed. Academic Press: Elsevier. 16 Lesion Studies

Juliana V. Baldo and Nina F. Dronkers

Abstract

Lesion studies are used to infer the neural basis of psycholinguistic processes in the healthy human brain. Traditionally, studies involved testing patients with a particular lesion site in order to ascertain the role of that particular brain region in the specific language pro- cess being tested. Newer methods using voxel-based techniques capitalize on advances in brain imaging to provide voxel-by-voxel analyses of the role that specific grey and white matter regions play in psycholinguistic processes. Common to all lesion study methodol- ogies are a number of assumptions about the brain, as well as a number of challenges with respect to patient recruitment, stimulus selection, and data analysis and interpretation. In this chapter, we discuss these issues and provide an exemplary study that analyzes the neural correlates of auditory word recognition in a large patient sample. Last, we describe a number of advantages and disadvantages of lesion studies, as well as alternate methods.

Introduction

Lesion analysis studies have a long and storied tradition in the study of speech and language. Starting in the late 1700s to early 1800s, a number of neurologic case studies led scientists to conclude that language could be localized to discrete regions of the left hemisphere. Such thinking ran counter to the previously dominant notion that the brain was indivisible and rather uniform in its support of language and

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Lesion Studies 311 other cognitive functions. With carefully described anatomical reports by Bouillaud, Lallemand, Broca, Wernicke, and others, it became clear by the end of the nineteenth century that there were indeed specific regions in the left hemisphere that subserved speech and language and that, moreover, these regions could be divided into sub‐ regions that supported very specific functions such as speech production and auditory comprehension (see Whitaker, 1998 for an historical overview). While early lesion studies had to rely on post‐mortem data to localize lesioned tissue, the advent of modern neuroimaging in the late twentieth century allowed for the identification of regions critical for speech and language processes in a patient’s living brain. More recent developments in computing and modeling now facilitate increasingly sophisticated analysis techniques that can elucidate brain‐behavior mapping on a voxel‐by‐voxel basis. In this chapter, we first discuss the rationale and assumptions underlying lesion studies in psycholinguistics. Next, we discuss issues related to data collection and analysis, including an exemplary lesion study using modern analysis techniques. Last, we discuss the advantages/disadvantages of the lesion analysis technique as well as alternate methods.

Rationale and Assumptions

The basic rationale behind lesion studies is that if a brain region is damaged and a particular behavioral deficit occurs, then that brain region plays a critical role in that particular behavior in the normal, healthy brain. Thus, the question typically asked is, “If a patient, or group of patients, demonstrates a particular linguistic deficit, have they suffered an injury in a specific area of the brain?” It is sometimes assumed that there is a simple one‐to‐one correspondence between lesion site and deficit. We now know that discrete lesions in grey or white matter can disrupt pathways/networks that link multiple brain regions (so‐called “disconnection syndromes”; Catani, 2005; Geschwind, 1974). In this way, a focal lesion in a particular brain region may result in a specific behavioral deficit, but the core processes under- lying that behavior may actually be subserved by a remote brain region or larger network. Newer imaging methods that measure whole‐brain and network activity such as resting‐state functional Magnetic Resonance Imaging (fMRI) suggest that multiple regions act in concert to subserve particular behaviors, including speech and language processes (Turken & Dronkers, 2011). Another common assumption in lesion studies is that brain regions subserve the same processes across all individuals. While it is clear that pre‐morbid individual differences exist both in anatomy and behavior, it is often difficult to quantify and control for such differences. For example, pre‐morbid behavioral differences, such as IQ and verbal ability, are difficult to establish post‐stroke due to the confounding brain injury and lack of pre‐morbid data. Interesting new studies are beginning to focus on such inter‐individual differences and relate them to distinct structural patterns in grey and white matter, using new MRI and post‐processing techniques (for a review, see Kanai & Rees, 2011). A final common assumption made in lesion studies of speech and language is that maximal recovery is achieved a few months post‐stroke, so that we can study fixed brain‐behavior relationships. However, there are likely continued changes at both 312 Research Methods in Psycholinguistics and the Neurobiology of Language the anatomical level (e.g., axonal branching) and physiological level (e.g., neuronal efficiency) that continue for many years. Newer techniques such as diffusion tensor imaging (DTI) and resting state fMRI are beginning to be used to track such changes and thus improve our understanding of how such changes relate to continued improvement in speech and language (Thompson & den Ouden, 2008). Also, other novel techniques such as perfusion‐weighted imaging are being used in the very early acute stages of stroke to map speech and language functions to discrete brain regions prior to any potential brain reorganization (e.g., Davis et al., 2008; Hillis et al., 2001).

Patient Recruitment, Study Design, Stimuli, and Instruments

Patient Recruitment and Selection

Lesion studies of speech and language typically involve patients who have experi- enced a sudden injury, such as a stroke (either embolic or hemorrhagic) or traumatic brain injury, but may also include patients with progressive disorders such as Alzheimer’s disease or primary progressive aphasia. Perhaps the most critical and yet challenging aspect of lesion studies is identifying and assembling a well‐characterized patient sample. Potential participants are typically screened for a number of inclusion and exclusion criteria that could confound data interpretation, so that reliable brain‐ behavior inferences can be made. These criteria often include restricting the sample to patients with a history of a single brain event, no prior neurologic or psychiatric history, no history of developmental speech‐language disorders (e.g., dyslexia, stuttering), sufficient educational background, and native proficiency in the language of testing. Inclusion may also be restricted to right‐handed patients, as handedness is purported to affect laterality and functional organization (Borod et al., 1984; Goodglass & Quadfasel, 1954; Pujol et al., 1999).

Study Design

Lesion studies vary widely with respect to study design, depending on the type of patient sample and the questions being asked. Some studies focus on assessing a single deficit in order to associate it with a particular lesion site. Patient performance is often compared to that of age‐ and education‐matched healthy controls. For example, a study from our group (Baldo et al., 2004) showed that patients with frontal lobe lesions were impaired at formulating effective yes/no questions in order to identify an unknown object (i.e., 20 questions game), relative to age‐matched controls. Studies may also involve a comparison between two sets of patients with distinct lesions who exhibit divergent performance on a particular psycholinguistic task. An example of such a study is one that showed that patients with right hemisphere lesions produce a smaller proportion of formulaic expressions in their spontaneous speech, relative to patients with left hemisphere lesions (Van Lancker Sidtis & Postman, 2006). Studies may also involve contrasting patient performance across two different con- ditions or stimulus types, in order to further delineate the specific role of a particular brain region. For example, a study by Warrington (1982) showed that a patient with a left parietal lesion was specifically impaired in accessing arithmetical facts but Lesion Studies 313

showed preserved understanding of quantity and arithmetical operations. This type of contrast in performance across two different conditions is sometimes referred to as a “single dissociation.” In more complex designs, performance is compared between two distinct patient groups on two different conditions/tasks. An example of such a study is one that reported on two patients with left anterior/middle temporal cortex lesions who exhibited normal retrieval for verbs but impaired retrieval for nouns, while another patient with a left premotor cortex lesion showed the reverse behavioral pattern (Damasio & Tranel, 1993). Such “double dissociations” are very powerful and enhance confidence that the observed behavioral deficits are specifically linked to the particular brain regions affected (see Baddeley, 2003). More recent approaches to lesion analysis involve voxel‐based studies (methods described in detail below), which typically involve large, heterogeneous patient samples that exhibit a wide range of performance on a particular psycholinguistic measure (Baldo et al., 2012; Bates et al., 2003). Patients’ behavioral scores are entered into a voxel‐based lesion symptom mapping (VLSM) program along with their lesion reconstructions. Unlike the study designs described above, VLSM analyses do not require that patients be divided into groups by lesion site or clinical diagnosis (e.g., right versus left hemisphere patients or Broca’s aphasics versus controls). Rather, a wide range of patient types and patient performance can be simultaneously analyzed in relation to patients’ respective lesion sites. An example of such a study is a recent analysis of picture naming in 96 left hemisphere stroke patients, in which we found that lexical‐semantic retrieval was critically dependent on left posterior middle and superior temporal cortex and underlying white matter (Baldo et al., 2013).

Behavioral Stimuli and Response Measures

A wide range of speech and language measures are used in lesion studies of linguistic functioning. These include measures that tap the full range of linguistic domains (i.e., phonology, morphology, semantics, pragmatics, etc.), as well as input (i.e., auditory material, written material, etc.) and output modalities (i.e., oral response, written response, etc.). Dependent variables typically include response accuracy and reaction time (e.g., patients’ speed and accuracy at naming pictures aloud). In some studies, variables may involve more qualitative analysis of a speech‐language sample, such as measuring patients’ mean length of utterance, type‐token ratio, correct information units, morphological production, or intelligibility. Depending on the question(s) being asked, lesion studies may also involve admin- istering standardized speech and language measures. Such measures have the benefit of published normative data so that an individual patient’s performance can be eval- uated against that of a healthy population. Also, data collected with standardized measures are more directly applicable to clinical settings where the same standard- ized tests are commonly used. There are a number of standardized language batteries (e.g., the Western Aphasia Battery, Aachen Aphasia Test, Boston Diagnostic Aphasia Examination) that measure a range of different speech and language processes such as speech fluency, auditory‐verbal comprehension, repetition, naming, and reading/ writing. Other standardized measures with greater sensitivity are used to quantify specific deficits, for example, in naming (e.g., the Boston Naming Test), semantics (e.g., Pyramids and Palm Trees), and auditory comprehension (e.g., Peabody Picture Vocabulary Test), to name a few. 314 Research Methods in Psycholinguistics and the Neurobiology of Language

On the other hand, standardized measures do not offer the flexibility to address specific, novel research questions. For such questions, lesion studies typically employ experimental measures, such as novel paper‐and‐pencil tests or computerized, reaction‐time based tests. The latter can provide millisecond‐resolution in patient performance and thus are extremely sensitive to subtle differences between individ- uals or task conditions. Computerized testing also allows for subtle manipulations of timing and stimuli, for example, in priming tasks, adaptive learning paradigms, and tasks involving synthesized speech.

Collecting and Analyzing Data

In lesion studies of psycholinguistic functioning, data collection is two‐pronged, involving acquisition of both psycholinguistic data and neuroimaging data (e.g., MRI or CT imaging). Both sets of data are ideally collected coincident in time, so that behavioral performance can be directly attributed to the observed lesion data (i.e., subsequent brain events could influence behavioral performance). We first discuss issues related to behavioral data collection, followed by a discussion of MRI/ CT data acquisition. Depending on the goals of the study, psycholinguistic measures are administered to patients at different stages of recovery: Acute stage (0‐4 weeks), post‐acute stage (4‐12 weeks), or chronic stage (12 weeks on). Data collection in the acute and post‐ acute stages can be challenging due to a number of factors, such as less‐than‐ideal testing situations (e.g., bedside testing in a noisy hospital), degree of patient coherence/ cooperation, and potential confounds due to factors unrelated to the brain injury itself (e.g., new medications, lack of sleep, reactive depression). At any stage of recovery, however, care must be taken to ensure that patients can comprehend task instructions and carry out task demands. To this end, it is critical that patients receive ample prac- tice and that the tasks be as simple as possible (e.g., requiring a single key‐button response for a “yes” response rather than separate key‐button responses for “yes” and “no”). Such simplicity maximizes the likelihood that brain‐injured patients can execute the task and that their performance reflects the particular variable of interest rather than a nuisance variable related to task demands (e.g., working memory load).

Behavioral Data Analysis

Standard inferential statistics are often applied to the analysis of psycholinguistic data in lesion studies, but a number of caveats are warranted. First, data from group lesion studies are generally analyzed in an aggregated fashion but are often not normally dis- tributed (i.e., do not fall under a symmetrical bell‐shaped curve). To address this problem, data may be modified in numerous ways prior to analysis. For example, log transforma- tions of the data or analyzing median rather than mean performance can reduce skew- ness in the dataset. Such modifications can help to make data patterns more apparent and may also be necessary in order to analyze the data with inferential statistics. Other options for non‐normally distributed data include the use of non‐parametric statistical tests, which do not make the same assumptions as parametric statistics. Lesion Studies 315

In contrast to group studies, data from single case studies can be difficult to analyze with traditional statistical approaches (Crawford & Garthwaite, 2004, 2005). In some cases of striking dissociations, visual inspection of the data alone may be presented as evidence. When normative data are available, single case data may be simply presented as a z‐score that reflects the number of standard deviations that a patient’s performance differs from the mean of a healthy control group. However, a number of issues arise with such methods, and more rigorous approaches have been introduced to single case data analysis, including multiple regression, Bayesian methods, and simulation models.

Neuroimaging Data Acquisition

Brain imaging data are typically acquired with 3D computed tomography (CT) or MRI at least 3 months post‐onset, after the acute effects of the injury have subsided (e.g., cerebral edema). If an acute protocol such as perfusion‐weighted MRI is being used, imaging data are collected as close to the time of brain injury as possible. MRI provides superior spatial resolution and provides a range of options for visualizing different lesion parameters, but it cannot be used in patients who have pacemakers, certain metal implants, and other MRI contraindications (see Kanal, 1992). CT imaging is much less expensive and has fewer contraindications, but it has relatively low spatial resolution and exposes patients to ionizing radiation. While CT is often the modality of choice in acute hospital settings, MRI is generally utilized for research studies to provide a more detailed image of the lesion. With MRI, multiple sequences are typically acquired on brain‐injured patients, including T1, T2, and FLAIR images, which all provide a distinct pattern of enhance- ment of the lesion signal. Specifically, T1 and T2 refer to parameters that vary the relaxation times for protons to return to equilibrium after being perturbed by a radio frequency pulse. These differences in relaxation times generate brain images that vary with respect to signal intensity (e.g., fluid appears dark on T1 images and bright on T2). With FLAIR (fluid‐attenuated inversion‐recovery), the signal from cerebrospinal fluid is suppressed to allow for more sensitivity to small lesions that might otherwise be missed. When a patient’s lesion is being reviewed and/or reconstructed, these distinct sequences can be linked and viewed simultaneously using specialized imaging soft- ware (e.g., MRICron; Rorden & Brett, 2000) so that the complete extent of the lesion can be most precisely estimated. In published studies with single cases or small groups, raw CT/MRI images may be used to identify and display the site of a patient’s lesion for the purposes of relating that brain region to an identified behavioral finding.

Lesion Reconstructions

In larger group studies, a variety of techniques have been developed to “reconstruct” patients’ lesions so that they may be aggregated for both visualization and statistical purposes (Bates et al., 2003; Damasio & Damasio, 1989; Rudrauf et al., 2008). As neuroimaging became more readily available in the 1970s‐1980s, it was common to reconstruct patients’ lesions onto standardized templates to clarify the location of 316 Research Methods in Psycholinguistics and the Neurobiology of Language the lesion and to allow for easy comparison to other patients’ lesions (e.g., Damasio & Damasio, 1989). Early versions of these techniques required an experienced radi- ologist, neurologist, or other trained expert to interpret a patient’s hard copy CT or MRI and then draw the observed lesion onto a hard copy of slices taken from a brain atlas. Using a common brain template allowed lesions to be compared across a group of patients in a more systematic way. Subsequent approaches to lesion reconstruction involved digitizing the brain templates so that the reconstructions could be digitally combined across multiple patients and visualized on a single template, showing the degree of lesion overlap across patients. Such computerized lesion reconstructions can be used in a variety of ways, including overlapping the lesions of patients who display a common language or clinical deficit to determine if they also share a common lesion site. For example, this technique was used to determine the neural underpinnings of apraxia of speech (Dronkers, 1996): Lesion overlay maps identified a common area of overlap in the left superior pre‐central gyrus of the insula that was lesioned in all 25 patients who had a chronic apraxia of speech. Importantly, a second lesion overlay map in 19 matched speech‐language patients who did not have an apraxia of speech showed this same insular region completely spared. More recent approaches to lesion reconstruction allow a trained expert to draw the boundaries of a patient’s lesion directly on the digital MRI scan (usually T1 or T2 image) using specialized software (see Rorden & Brett, 2000). This reconstruc- tion is saved as a lesion file, which can then be normalized (i.e., warped, rotated) along with the whole brain so that it is aligned with a standard brain template (Brett et al., 2001). This spatial normalization process allows an individual patient’s lesion data to be digitally combined with other patients’ lesions and also enables direct comparison with anatomic atlases and functional imaging studies that use the same brain template (Friston et al., 1995). The procedures involved in the process of normalizing patients’ brain scans are con- stantly being updated and there is as of yet no single consensus on the optimal parame- ters (Andersen et al., 2010; Ashburner & Friston, 2005; Brett et al., 2001; Crinion et al., 2007). For example, lesions are sometimes traced before normalization (on the “native” MRI), while in other studies, lesions are traced after the whole brain has been normal- ized. Another significant issue is that enlarged ventricles are a common consequence of tissue loss due to stroke, and strategies for accounting for this loss of tissue in the lesion reconstruction process vary across research centers (Andersen et al., 2010). Another new approach to lesion reconstruction is the use of automated software to detect lesion boundaries, which obviates the need for manual lesion tracing (Griffis et al., 2016; Guo et al., 2015; Pustina et al., 2016). These automated methods could greatly reduce the workload required to trace large samples of individual patients’ lesions and could provide more objective, consistent lesion reconstructions. At this point in time, however, such automated reconstruction techniques are still being evaluated with respect to their ability to match the accuracy of manual tracing by a trained expert. Regardless of whether a trained expert or automated software is used, it is often difficult to determine the true boundaries of a lesion and whether an area of tissue is truly lesioned or not. Moreover, even if an area adjacent to a lesion appears “normal,” it is not currently possible to know whether or not that region is truly functioning normally based on the structural MR images (i.e., T1/T2). In the future, lesion recon- struction techniques may be combined with functional MRI techniques to better identify the true boundaries of functional versus lesioned tissue. Lesion Studies 317

Voxel‐Based Lesion Analyses

The newer digital lesion reconstruction techniques described above have paved the way for voxel‐based lesion symptom mapping (VLSM) tools that provide a statistical analysis of the relationship between discrete brain regions and language behaviors on a voxel‐by‐voxel basis (for a review, see Baldo et al., 2012; Bates et al., 2003; Tyler et al., 2005). In VLSM, a statistical test (e.g., generalized linear model, t‐test) is run at every voxel in order to compare performance in patients whose lesions involve a particular voxel versus those whose lesions spare that voxel (see Figure 16.1 for a schematic of the VLSM process). Color‐coded maps are then generated that reflect

Patient 1 Patient 2 Patient 3 ...Patient n

Voxel Voxel Voxel Voxel lesioned lesioned intact intact

Behavioral score = 8 Behavioral score = 12 Behavioral score = 20 Behavioral score = 24

Group behavioral scores of Group behavioral scores of patients with voxel lesioned patients with voxel intact

Voxel Voxel lesioned intact

Run statistical test to compare voxel-lesioned group vs. voxel- intact group

Map of the t-statistic at each voxel

2.5

t

0 z=8 z=16 z=24 z=32

Figure 16.1 A schematic illustration showing the steps involved in a VLSM analysis. In the first stage, patients’ lesions, which have been reconstructed onto a standardized tem- plate, are read into the analysis. Second, at every voxel, a statistical test is run to compare the behavioral scores (e.g., comprehension, fluency) of patients with and without a lesion in that voxel. The resulting test statistics (e.g., t‐values) at every voxel are then color‐coded and visu- alized as shown. In the next step (not shown), a statistical correction is applied (e.g., permuta- tion testing) to adjust for the large number of comparisons being done, so that only voxels meeting a pre‐specified significance level are displayed. Source: Baldo et al., (2012). Reproduced with permission of John Wiley & Sons. (See insert for color representation of the figure.) 318 Research Methods in Psycholinguistics and the Neurobiology of Language the resultant statistics (e.g., t‐map) and can be visualized on a standard brain template using freely available imaging software (e.g., MRICron (http://www.mccauslandcenter. sc.edu/mricro/mricron/; Rorden & Brett, 2000). A brain atlas template (e.g., AAL or Brodmann’s area map) can be yoked to the VLSM results to identify anatomical labels for the location of voxels identified by VLSM. Since a large number of statistical tests are run in a VLSM analysis, a statistical correction or alternative statistical approach is needed to avoid generating spurious results. One recommended method, known as permutation testing, is a relatively conservative method that involves randomly reassigning patients’ behavioral scores across all lesioned voxels a large number of times (e.g., 1,000 iterations). This procedure produces a minimum t‐value threshold at a given alpha level (e.g., .05), above which the values would likely occur by chance 5% of the time or less (Kimberg et al., 2007; Rorden et al., 2007). Alternatively, cluster‐size thresholds may be used. In this way, permutation testing identifies a critical statistical threshold, and only those voxels with values surpassing this threshold are then considered for interpre- tation in the analysis. Other statistical correction methods such as Bonferroni and false discovery rate (FDR) have been used in the literature to correct for multiple tests with VLSM, but these approaches make assumptions about the data that are not appropriate for lesion data (see Baldo et al., 2012; Rorden et al., 2007). Another adjustment important in VLSM analyses is limiting the voxels included in the analysis to those that comprise a minimum number of patients. Otherwise, statistical tests may be run in voxels that compare performance of a single patient to the rest of the sample. The recommended minimum cut‐off is 5‐10% of the patient sample. For example, in a VLSM analysis of 100 patients, statistical tests should be limited to those voxels in which at least 5‐10 individuals have lesion involvement. Also, studies should ideally include a lesion coverage map that shows an overlay of patients’ lesions using this same minimum cut‐off in order to show which voxels are actually included in the analysis. Similarly, in order to establish the robustness and replicability of a given VLSM finding, it is highly recommended that a thresholded lesion‐overlap difference map or power analysis map be generated (Kimberg et al., 2007; Rudrauf et al., 2008). It is critical that brain regions without adequate lesion coverage and/or power are not part of the predictions or interpretation of a VLSM study. For example, most lesion studies of language include primarily patients with middle cerebral artery strokes, and thus there is little lesion coverage in the frontal poles. A null finding in this brain region cannot be assumed to reflect a lack of involvement of this region in the particular behavioral measure being studied. In order to obtain robust results with adequate power, VLSM analyses are typically done with relatively large patient sam- ples. With smaller patient groups, it is preferable to use another technique such as a lesion overlay map, rather than VLSM.

An Exemplary Study

Here, we describe a typical voxel‐based lesion symptom mapping (VLSM) analysis study, using retrospective data from our patient database of over 500 stroke patients. In this example, we analyzed single‐word auditory recognition data from a subset of 109 left Lesion Studies 319

hemisphere stroke patients who met strict inclusion/exclusion criteria. The focus of this VLSM analysis was to identify grey and white matter regions associated with auditory word recognition. The classical model of speech comprehension (based primarily on single‐case and small group studies) has typically linked this ability with Wernicke’s area, which includes the left posterior superior temporal gyrus (STG) and sometimes inferior parietal cortex. Newer data from our group, however, suggest an important role for the posterior middle temporal gyrus (MTG) and underlying white matter as well (Baldo et al., 2013; Dronkers et al., 2004). Thus, it was of interest to use VLSM to test these two competing predictions in a large group of left hemisphere patients.

Participants

For this analysis, 109 patients (10 female) were selected from our patient database who met the following criteria: 1) a single, left hemisphere stroke; 2) at least 3 months post‐stroke (to ensure that acute brain effects as well as behaviors were relatively stabilized); 3) native English‐speaking (acquired before 5 years of age); 4) right‐ handed, according to the Edinburgh Handedness Inventory; 5) no prior neurologic or severe psychiatric history (e.g., Parkinson’s, traumatic brain injury, schizophrenia, bipolar disorder); 6) an eighth grade education or higher (to ensure a basic level of scholastic exposure); and 7) available neuroimaging. The mean age of the patient sample was 60.3 years (SD = 11.3, range 31‐86); their mean level of education was 14.8 years (SD = 2.69, range 8‐20); and the mean number of months post‐stroke was 49.8 (SD = 50.4, range 11‐271). Based on the Western Aphasia Battery (WAB) scoring system, the sample included 25 patients with Broca’s aphasia, 2 with global aphasia, 1 with transcortical sensory aphasia, 12 with Wernicke’s aphasia, 5 with conduction aphasia, 35 with anomic aphasia, and 29 who scored within normal limits (WNL; > 93.7 out of 100 points). The WNL subgroup included patients with very mild or no speech‐language symptoms.

Stimuli and Procedures

Patients were individually administered the WAB by a speech‐language pathologist or trained researcher. The WAB includes a series of different subtests that tap distinct speech‐language processes. For the current analysis, we examined data from the auditory word recognition subtest, which is one of three subtests that make up the Auditory Verbal Comprehension section of the battery. On the auditory word recog- nition subtest, patients are asked to point to objects and pictures spoken aloud by the examiner (e.g., Point to the cup.). The overall auditory word recognition score was used as the main dependent variable in the VLSM analyses presented below.

Lesion Reconstructions

Most patients’ lesions were reconstructed from high‐resolution, 3T MRI scans obtained at least 3 months post‐stroke so that lesion sites were stabilized. In patients who had MRI contraindications, 3D CT scans were used to reconstruct lesions. The average size of patients’ lesions was 102.1 cubic centimeters (SD = 85.6). 320 Research Methods in Psycholinguistics and the Neurobiology of Language

30 60 90

Figure 16.2 Overlay of patients’ lesions. Only voxels with a minimum of five patients with lesions per voxel are included here, consistent with the data eligible for inclusion in the VLSM analysis. The color bar indicates the degree of lesion overlap, from purple (~5‐10 patients with lesion overlap) to aqua (~50 patients with lesion overlap) to red (~100 patients with lesion overlap). (See insert for color representation of the figure.)

An overlay of patients’ normalized lesions is shown in Figure 16.2 on a standard brain template to demonstrate the extent of patients’ lesions across the left hemi- sphere. Only those voxels in which at least five patients had lesions are shown so that the lesion map reflects all the potential voxels analyzed in the VLSM analyses below. As can be seen, there was substantial lesion coverage across most of the middle cerebral artery distribution in the left hemisphere, with the greatest degree of overlap (in approximately one‐third of patients) in left fronto‐insular cortex. It should be noted that this region of common lesion overlap is distinct from those regions associated with auditory word recognition identified in the VLSM analyses described below.

VLSM Analysis

Patients’ performance on the WAB auditory word recognition subtest (percent correct), along with their reconstructed lesions, were analyzed using a freely available VLSM program (https://langneurosci.mc.vanderbilt.edu/resources.html; Baldo et al., 2012; Bates et al., 2003) in order to identify the neural correlates of single word auditory recognition. For purposes of illustration, separate VLSM analyses were run with and without permutation testing and also with and without covariates (age, education, and lesion volume), in order to illustrate the range of results obtained with various levels of correction (see Results below). Figure 16.3 shows a power analysis map that provides an indication of the statistical power in the VLSM analyses based on a medium effect size (Cohen’s d = 0.5) and alpha set at p = .05. As can be seen, power is relatively high in much of the left hemi- sphere (shown in red) but is low in the most superior and inferior portions of the left hemisphere, as well as in the very anterior, posterior and medial portions (areas with no color). Importantly, there was a high degree of power in the brain region that was the focus of this analysis, namely, left posterior middle and superior temporal cortex. Lesion Studies 321

0.4 0.6 0.8 1. 0

Figure 16.3 Power analysis map showing the degree of power in our sample, given a medium effect size and alpha set at p < .05. The color bar indicates power ranging from 0.4 (shown in black) up to 1.0 (shown in black to red). (See insert for color representation of the figure.)

Results: VLSM Correlates of Auditory Word Recognition

VLSM‐derived correlates of auditory word recognition at different levels of statistical correction are shown in Figure 16.4. The top row (A) shows VLSM results based on an uncorrected raw t‐map with no covariates included. A large portion of the lateral surface of the left hemisphere is implicated, with the highest t‐values centered in the middle temporal gyrus (shown in red). Row B displays VLSM results with a voxelwise correction of p < .001 but no correction for the large number of t‐tests carried out and no covariates. Fewer regions are implicated in this map, but they still make up a large portion of the left temporal lobe with extension into inferior frontal and inferior parietal cortices. The minimum significant t‐value with this correction was 3.17, and the maximum t‐value was 10.94, centered in the white matter medial to the mid‐middle temporal gyrus (MNI x,y,z coordinates ‐40,‐16,‐8). Row C shows VLSM results using the same voxelwise correction of p < .001 but with covariates included in the analysis (age, education, and lesion volume). At this level of correction, the significant regions implicated in auditory word recognition were restricted to the temporal lobe with a minimum significant t‐value of 3.17. The maximum t‐value of 7.51 was located in the left posterior middle temporal gyrus (‐40,‐70,8). Finally, Row D displays the most rigorous VLSM analysis of auditory word rec- ognition, using permutation testing to identify a critical t‐value threshold (5.08) as well as critical covariates (age, education, and lesion volume). The maximum t‐ value was again 7.51, located in the left posterior middle temporal gyrus (‐40,‐70,8), but the minimum t‐value was much higher than in the previous anal- ysis (5.08 versus 3.17). This more stringent cut‐off value constrained the significant regions associated with auditory word recognition to the mid‐posterior middle temporal gyrus. A)

0 369

B)

4 6 810

C)

4 5 6 7

D)

5.4 6.0 6.6 7.2

Figure 16.4 (Continued) Lesion Studies 323

VLSM Analysis Summary

We used VLSM to identify the neural correlates of auditory single‐word recognition in 109 patients and found that the critical region associated with this process is located in left mid‐posterior middle temporal cortex. Although traditional models of language suggest that posterior superior temporal cortex (i.e., Wernicke’s area) is critical for auditory comprehension, newer studies have implicated the same region we identified here (Bates et al., 2003; Binder et al., 1997; Hickok & Poeppel, 2007; Rodd et al., 2005). Our previous research has shown that this posterior middle temporal region is a highly interconnected hub for language processing in the left hemisphere (Turken & Dronkers, 2011) that is critical for associating words with concepts in sentence com- prehension (Dronkers et al., 2004) and naming (Baldo et al., 2013). The current finding is also consistent with our earlier lesion‐overlay studies showing that the pos- terior middle temporal gyrus is the area of common infarct in patients with chronic, persisting Wernicke’s aphasia, who exhibit severely impaired lexical‐semantic processing, particularly in the auditory domain (Dronkers & Baldo, 2009).

Advantages, Disadvantages, and Alternative Methods

The most important advantage of lesion analysis over any other technique is the ability to evaluate which brain areas are most critical for certain functions. Unlike results generated from functional imaging studies with healthy individuals that highlight voxels “active” during performance of a particular behavior, brain regions identified by lesion studies illustrate those areas most critical to a particular behavior. Thus, lesion studies provide a singularly unique perspective on the study of brain‐ behavior relationships not offered by other techniques. Lesion analysis is also versatile in that it can be done with a single patient (e.g., relating a lesioned brain area to a specific deficit), with a small group of homoge- neous patients (e.g., overlapping lesions in patients with a common disorder), and with large groups of heterogeneous patients (e.g., using robust voxel‐based analyses that do not rely on pre‐determined group membership). All of these approaches have been successful in determining brain‐behavior relationships, and each has helped to answer a different kind of research question. The results of lesion studies have also contributed to the clinical realm, leading to more accurate patient diagnosis, prognosis, and treatment planning.

Figure 16.4 VLSM results showing neural correlates of auditory word recognition with varying levels of correction. A) raw t‐map with no voxelwise correction, no permutation testing and no covariates; B) ­voxelwise‐corrected t‐map (p < .001) with no permutation testing and no covariates; C) voxel‐ wise‐corrected t‐map (p < .001) with lesion volume, education, and age as covariates but no permutation testing; and D) permutation testing‐derived t‐map with lesion volume, education, and age as covariates. The colored bars represent the range of significant t‐values for each anal- ysis, from lower (though still significant) t‐values shown in purple to higher t‐values shown in red. (See insert for color representation of the figure.) 324 Research Methods in Psycholinguistics and the Neurobiology of Language

An advantage of voxel‐based lesion analyses in particular is the ability to directly compare results with those acquired in healthy individuals with fMRI or PET, as all techniques use a common brain template for analysis and visualization of results. With both types of results depicted in the same stereotactic space, comparison between maps is facilitated. Another advantage is that voxel‐based lesion analysis techniques like VLSM allow for the inclusion of a wide range of patients with varying degrees of behavioral impairment, in contrast to earlier lesion studies that required a binary distinction (e.g., comparing patients with and without a certain impairment, or lesions in a particular area). The disadvantage of VLSM is that it typically requires a large number of patients to achieve enough statistical power in a broad range of brain areas. The broader the coverage, the wider the map of regions that can be tested. Another factor to be aware of is that some lesion analysis studies of speech and language involve measuring behaviors that are multi‐determined and nonetheless attempt to localize these complex behaviors to a discrete brain region, when they are likely sub‐served by a network of regions that each plays a distinct role in different aspects of the behavior. For example, in a sentence comprehension task, a number of processes are likely recruited, including auditory processing, speech perception, grammatical processing, verbal working memory (depending on the length/complexity of the sentence), visual perception (if matching sentences to picture choices), and response output/monitoring. For such tasks that involve numerous sub-processes, lesion studies can both determine those brain regions important for successful performance on the overall task, as well as tease apart the brain regions that underlie each specific sub-function and show how these different regions work together to achieve the larger task (see Dronkers et al., 2004). In cases of progressive neurologic disease (e.g., Alzheimer’s disease) where no frank lesion is present, an alternate technique, called voxel‐based morphometry (VBM), can be used (see Ashburner & Friston, 2000). VBM measures cortical thin- ning by comparing voxel‐based intensity changes between groups of individuals (e.g., patients and age‐matched neurologically‐normal controls). Loss of tissue can then be correlated with changes in behavioral performance. This technique was used to measure changes in language performance in primary progressive aphasia, for example, with distinct patterns emerging for different types of the disorder (Gorno‐ Tempini et al., 2004). VBM has also been used in cases of developmental or inherited speech or language disorders (e.g., Watkins et al., 2002), and even in neurologically‐ normal brains to detect changes associated with skill learning, such as acquiring a second language (Mechelli et al., 2004). A technique that simulates cognitive dysfunction with a temporary “lesion” is known as transcranial magnetic stimulation (TMS). Here, a stimulation coil is placed near the surface of the head and, when supplied with an electric current, creates a magnetic field that penetrates the skull and generates an electric field in the stimulated brain area. When, consequently, transient disruptions in cognitive functioning occur, these have been referred to as “virtual lesions” (Pascual‐Leone et al., 1999), with longer reaction times or increases in the number of errors after stimulation (see Hartwigsen, 2015 for a review). While these can certainly be con- sidered disruptions in functioning, they are quite unlike the long‐term, disabling deficits seen in neurological patients. The purported advantage of studying virtual lesions over true brain lesions is that functional reorganization presumably cannot Lesion Studies 325 take place in the short amount of time during which stimulation occurs. However, improvements in functioning have also been reported under the same stimulation conditions, suggesting that other variables are influencing these effects and require further study. TMS, tDCS (transcranial direct cortical stimulation) and other non‐invasive brain stimulation (NIBS) techniques have been actively used in attempts to enhance language and cognitive functions in healthy individuals and as potential adjunctive treatments for aphasia and other neurologic conditions. Outcomes are mixed and vary widely depending on the stimulation parameters used (e.g., dura- tion, intensity, orientation, timing). Potential benefits of such interventions must also be weighed against the rare but possible risks, including fainting or the induce- ment of seizures. The efficacy of such treatments for individual disorders are currently being examined. With constant advances in computer technology, new neuroimaging techniques are developing at an equally rapid rate. Diffusion imaging, in particular, has become an exciting avenue for detecting white matter changes in brain‐injured patients (Breier et al., 2008; Schlaug et al., 2009). It takes advantage of the constrained ­diffusion of water that occurs along axons. Subsequent post‐processing techniques reveal fiber‐like structures that coincide with known fiber pathways and can thus reveal the effects of injury on these fiber tracts and whether they correspond to observed behavioral deficits. Diffusion imaging can also identify which cortical regions have become disconnected. Another technique, resting state functional magnetic resonance imaging (rsfMRI), is also becoming more common in measuring network‐wide changes after brain injury (Van Den Heuvel et al., 2010; Van Hees et al., 2014). Measurements of tissue perfusion using MRI can also be informative for determining brain areas affected in early stroke before spontaneous recovery has begun to take place (e.g., Hillis et al., 2001). Combinations of such techniques will continue to help us uncover the intricacies of language networks and the different neural structures that underlie speech and language.

Acknowledgments

This material is based on work supported in part by the U.S. Department of Veterans Affairs, Office of Research & Development Rehabilitation R&D and CSR&D Programs, and NIH/NINDS 5 P01 NS040813, NIH/NIDCD 5 R01 DC00216. The contents reported within do not represent the views of the Department of Veterans Affairs or the United States Government.

Key Terms

Brodmann’s area map Cytoarchitecture maps developed by Brodmann in the early 1900s that are still used in brain imaging to divide the brain into distinct sub‐regions. 326 Research Methods in Psycholinguistics and the Neurobiology of Language

CT Imaging technology that uses x‐rays to generate images reflecting underlying anatomy of, for example, the brain. Diffusion Tensor Imaging (DTI) A diagnostic technique that incorporates the measurement of molecular diffusion (such as water or metabolites) for tissue assessment by MRI. Functional magnetic resonance imaging (fMRI) A functional neuroimaging procedure using MRI technology that measures brain activity by detecting changes associated with blood flow. Magnetic Resonance Imaging (MRI) A non‐invasive method for demonstrating internal anatomy based on the principle that atomic nuclei in a strong magnetic field absorb pulses of radiofrequency energy and emit them as radio waves which can be reconstructed into computerized images. Structural scans using T1‐ and T2‐weighted MR provide detailed anatomic images, while other sequences including functional MRI and perfusion‐weighted images reflect more physiological processes such as blood flow. MNI Template A brain template developed by the Montreal Neurologic Institute that was approximately matched to the Talairach brain atlas. Perfusion‐weighted Imaging An MRI technique that generates images reflecting the degree of blood perfusion throughout the tissues, for example, in the brain. Stroke A group of pathological conditions characterized by sudden, non‐convulsive loss of neurological function due to brain ischemia or intracranial hemorrhage. Transcranial magnetic stimulation (TMS)/transcranial direct current stimulation (tDCS) A non‐invasive procedure that uses magnetic fields (TMS) or electrical current (tDCS) to stimulate specific parts of the brain. The magnetic fields are generated by a coil that is placed on the scalp, and the technique may be used as a clinical intervention in brain‐injured patients or to alter functioning in the healthy brain for the purpose of studying brain‐behavior relationships. Voxel A 3D volume (a volume pixel), associated with a particular x‐y‐z coordinate in the brain, used in the analysis of 3D brain imaging findings. Voxel‐based Lesion Symptom Mapping (VLSM) A type of neuroimaging analysis developed to analyze stroke lesion data on a voxel‐by‐voxel basis, similar to that used in functional neuroimaging, so that the relative contributions of different brain regions to a given cognitive or linguistic function can be analyzed. Voxel‐based Morphometry A type of neuroimaging analysis developed to analyze structural changes at the voxel level in individuals with progressive brain changes such as Alzheimer’s disease.

References

Andersen, S. M., Rapcsak, S. Z., & Beeson, P. M. (2010). Cost function masking during normal- ization of brains with focal lesions: Still a necessity? Neuroimage, 53, 78–84. Ashburner, J., & Friston, K. J. (2005). Unified segmentation. Neuroimage, 26, 839–851. Ashburner, J., & Friston, K. J. (2000). Voxel‐based morphometry—the methods. Neuroimage, 11, 805–821. Baddeley, A. (2003). Double dissociations: Not magic, but still useful. Cortex, 39, 129–131. Lesion Studies 327

Baldo, J. V., Delis, D. C., Wilkins, D. P., & Shimamura, A. P. (2004). Is it bigger than a breadbox? Performance of patients with prefrontal lesions on a new executive function test. Archives of Clinical Neuropsychology, 19, 407–419. Baldo, J. V., Wilson, S. M., & Dronkers, N. F. (2012). Uncovering the neural substrates of language: A voxel‐based lesion symptom mapping approach. M. Faust (Ed.), Advances in the neural substrates of language: Toward a synthesis of basic science and clinical research (pp. 582–594). Oxford: Wiley‐Blackwell. Baldo, J. V., Arévalo, A., Patterson, J. P., & Dronkers, N. F. (2013). Grey and white matter correlates of picture naming: Evidence from a voxel‐based lesion analysis of the Boston Naming Test. Cortex, 49, 658–667. Bates, E., Wilson, S. M., Saygin, A. P., Dick, F., Sereno, M. I., Knight, R. T., & Dronkers, N. F. (2003). Voxel‐based lesion–symptom mapping. Nature Neuroscience, 6, 448–450. Binder, J. R., Frost, J. A., Hammeke, T. A., Cox, R. W., Rao, S. M., & Prieto, T. (1997). Human brain language areas identified by functional magnetic resonance imaging. The Journal of Neuroscience, 17, 353–362. Borod, J. C., Caron, H. S., & Koff, E. (1984). Left‐handers and right‐handers compared on performance and preference measures of lateral dominance. British Journal of Psychology, 75, 177–186. Breier, J. I., Hasan, K. M., Zhang, W., Men, D., & Papanicolaou, A. C. (2008). Language dysfunction after stroke and damage to white matter tracts evaluated using diffusion ten- sor imaging. American Journal of Neuroradiology, 29, 483–487. Brett, M., Leff, A. P., Rorden, C., & Ashburner, J. (2001). Spatial normalization of brain images with focal lesions using cost function masking. Neuroimage, 14, 486–500. Catani, M. (2005). The rises and falls of disconnection syndromes. Brain, 128, 2224–2239. Crawford, J. R., & Garthwaite, P. H. (2004). Statistical methods for single‐case studies in neuropsychology: Comparing the slope of a patient’s regression line with those of a control sample. Cortex, 40, 533–548. Crawford, J. R., & Garthwaite, P. H. (2005). Testing for suspected impairments and dissocia- tions in single‐case studies in neuropsychology: Evaluation of alternatives using Monte Carlo simulations and revised tests for dissociations. Neuropsychology, 19, 318. Crinion, J., Ashburner, J., Leff, A., Brett, M., Price, C., & Friston, K. (2007). Spatial normalization of lesioned brains: Performance evaluation and impact on fMRI analyses. Neuroimage, 37, 866–875. Damasio, H., & Damasio, A. R. (1989). Lesion analysis in neuropsychology. New York, NY: Oxford University Press. Damasio, A. R., & Tranel, D. (1993). Nouns and verbs are retrieved with differently distributed neural systems. Proceedings of the National Academy of Sciences, 90, 4957–4960. Davis, C., Kleinman, J. T., Newhart, M., Gingis, L., Pawlak, M., & Hillis, A. E. (2008). Speech and language functions that require a functioning Broca’s area. Brain and Language, 105, 50–58. Dronkers, N. F. (1996). A new brain region for coordinating speech articulation. Nature, 384, 159–161. Dronkers, N. F., & Baldo, J. V. (2009). Language: Aphasia. In L. R. Squire (Ed.), The new encyclopedia of neuroscience (pp. 343–348). Oxford: Elsevier. Dronkers, N. F., Wilkins, D. P., Van Valin, R. D., Redfern, B. B., & Jaeger, J. J. (2004). Lesion analysis of the brain areas involved in language comprehension. Cognition, 92, 145–177. Friston, K., Ashburner, J., Frith, C. D., Poline, J. B., Heather, J. D., & Frackowiak, R. S. (1995). Spatial registration and normalization of images. Human Brain Mapping, 3, 165–189. Geschwind, N. (1974). Disconnexion syndromes in animals and man (pp. 105–236). Springer Netherlands. 328 Research Methods in Psycholinguistics and the Neurobiology of Language

Goodglass, H., & Quadfasel, F. A. (1954). Language laterality in left‐handed aphasics. Brain, 77, 521–548. Gorno‐Tempini, M. L., Dronkers, N. F., Rankin, K. P., Ogar, J. M., Phengrasamy, L., Rosen, H. J., … & Miller, B. L. (2004). Cognition and anatomy in three variants of primary progressive aphasia. Annals of Neurology, 55, 335–346. Griffis, J. C., Allendorfer, J. B., & Szaflarski, J. P. (2016). Voxel‐based Gaussian naïve Bayes classification of ischemic stroke lesions in individual T1‐weighted MRI scans. Journal of Neuroscience Methods, 257, 97–108. Guo, D., Fridriksson, J., Fillmore, P., Rorden, C., Yu, H., Zheng, K., & Wang, S. (2015). Automated lesion detection on MRI scans using combined unsupervised and supervised methods. BMC Medical Imaging, 15, 1. Hartwigsen, G. (2015). The neurophysiology of language: Insights from non‐invasive brain stimulation in the healthy human brain. Brain and Language, 148, 81–94. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8, 393–402. Hillis, A. E., Kane, A., Tuffiash, E., Ulatowski, J. A., Barker, P. B., Beauchamp, N. J., & Wityk, R. J. (2001). Reperfusion of specific brain regions by raising blood pressure restores selective language functions in subacute stroke. Brain and Language, 79, 495–510. Kanai, R., & Rees, G. (2011). The structural basis of inter‐individual differences in human behaviour and cognition. Nature Reviews Neuroscience, 12, 231–242. Kanal, E. (1992). An overview of electromagnetic safety considerations associated with magnetic resonance imaging. Annals of the New York Academy of Sciences, 649, 204–224. Kimberg, D. Y., Coslett, H., & Schwartz, M. F. (2007). Power in voxel‐based lesion‐symptom mapping. Journal of Cognitive Neuroscience, 19, 1067–1080. Mechelli, A., Crinion, J. T., Noppeney, U., O’Doherty, J., Ashburner, J., Frackowiak, R. S., & Price, C. J. (2004). Neurolinguistics: Structural plasticity in the bilingual brain. Nature, 431, 757–757. Pascual‐Leone, A. (1999). Transcranial magnetic stimulation: Studying the brain‐behaviour relationship by induction of ‘virtual lesions’. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 354, 1229–1238. Pujol, J., Deus, J., Losilla, J. M., & Capdevila, A. (1999). Cerebral lateralization of language in normal left‐handed people studied by functional MRI. Neurology, 52, 1038–1038. Pustina, D., Coslett, H., Turkeltaub, P. E., Tustison, N., Schwartz, M. F., & Avants, B. (2016). Automated segmentation of chronic stroke lesions using LINDA: Lesion identification with neighborhood data analysis. Human Brain Mapping. Rodd, J. M., Davis, M. H., & Johnsrude, I. S. (2005). The neural mechanisms of speech com- prehension: fMRI studies of semantic ambiguity. Cerebral Cortex, 15, 1261–1269. Rorden, C., & Brett, M. (2000). Stereotaxic display of brain lesions. Behavioural Neurology, 12, 191–200. Rorden, C., Karnath, H. O., & Bonilha, L. (2007). Improving lesion‐symptom mapping. Journal of Cognitive Neuroscience, 19, 1081–1088. Rudrauf, D., Mehta, S., & Grabowski, T. J. (2008). Disconnection’s renaissance takes shape: Formal incorporation in group‐level lesion studies. Cortex, 44, 1084–1096. Schlaug, G., Marchina, S., & Norton, A. (2009). Evidence for plasticity in white‐matter tracts of patients with chronic Broca’s aphasia undergoing intense intonation‐based speech therapy. Annals of the New York Academy of Sciences, 1169, 385–394. Thompson, C. K., & den Ouden, D. B. (2008). Neuroimaging and recovery of language in aphasia. Current Neurology and Neuroscience Reports, 8, 475–483. Turken, A., & Dronkers, N. F. (2011). The neural architecture of the language comprehension network: Converging evidence from lesion and connectivity analyses. Frontiers in Systems Neuroscience, 5, 1. Lesion Studies 329

Tyler, L. K., Marslen‐Wilson, W., & Stamatakis, E. A. (2005). Dissociating neuro‐cognitive component processes: Voxel‐based correlational methodology. Neuropsychologia, 43, 771–778. Van Den Heuvel, M. P., & Pol, H. E. H. (2010). Exploring the brain network: A review on resting‐state fMRI functional connectivity. European Neuropsychopharmacology, 20, 519–534. Van Hees, S., McMahon, K., Angwin, A., de Zubicaray, G., Read, S., & Copland, D. A. (2014). A functional MRI study of the relationship between naming treatment outcomes and resting state functional connectivity in post‐stroke aphasia. Human Brain Mapping, 35, 3919–3931. Van Lancker Sidtis, D., & Postman, W. A. (2006). Formulaic expressions in spontaneous speech of left‐and right‐hemisphere‐damaged subjects. Aphasiology, 20, 411–426. Warrington, E. K. (1982). The fractionation of arithmetical skills: A single case study. The Quarterly Journal of Experimental Psychology, 34, 31–51. Watkins, K. E., Vargha‐Khadem, F., Ashburner, J., Passingham, R. E., Connelly, A., Friston, K. J., … & Gadian, D. G. (2002). MRI analysis of an inherited speech and language disorder: Structural brain abnormalities. Brain, 125, 465–478. Whitaker, H. A. (1998). Neurolinguistics from the middle ages to the pre‐modern era: Historical vignettes. In Whitaker, H. & Stemmer, B. (Eds.): Handbook of neurolinguistics (pp. 27–54). Academic Press: San Diego, CA.

Further Reading and Resources

Kemmerer, D. (2014). Cognitive neuroscience of language. New York, NY: Psychology Press. McConnell Brain Imaging Centre brain atlas templates: http://www.bic.mni.mcgill.ca/ ServicesAtlases/ICBM152NLin2009 Menn, L. & Dronkers, N. (2015). Psycholinguistics: Introduction and applications, 2nd Edition. San Diego, CA: Plural Publishing, Inc. Rorden, C. MRIcron and other brain imaging links: http://www.mccauslandcenter.sc.edu/ mricro/ Rorden, C., & Karnath, H. (2004). Using human brain lesions to infer function: A relic from a past era in the fMRI age? Nature Reviews Neuroscience, 5, 812–819. Sliwinska, M. W., Vitello, S., & Devlin, J. T. (2014). Transcranial magnetic stimulation for investigating causal brain‐behavioral relationships and their time course. Journal of Visualized Experiments, 89, e51735. Wilson, S. VLSM and other brain imaging software: http://www.neuroling.arizona.edu/resources. html 17 Molecular Genetic Methods

Carolien G. F. de Kovel and Simon E. Fisher

Abstract

Finding the genetic variation that underlies inter-individual variability in language skills is an important approach for deciphering the biological bases of this fascinating human phenomenon. Recent years have seen dramatic advances in the techniques available for identifying DNA variants that influence human traits, not only for disor- ders but also for variability in the normal range. The method of choice depends on the genetic architecture of the trait being studied. If the difference between people is due to a single alteration in the DNA with a large effect, an effective strategy is to investigate linkage in multigenerational families. Alternatively, if the variability in the trait depends on the accumulation of small effects of many DNA variants, it is optimal to carry out a genome-wide association study with thousands of participants. This chapter describes the principles behind these complementary methods, and how they can be used to study language-related traits, discussing both the pitfalls and the opportunities.

Introduction

Molecular genetics is a subdivision of genetic research concerned with the structure and functions of genes at the molecular (i.e., DNA/RNA) level. An important part of this type of research involves identifying variations in DNA that are associated with

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Molecular Genetic Methods 331

variations in the development of a particular trait. In this chapter, we will explain the practical side of searching for genetic variations that influence a person’s language skills. In another subdivision of genetic research, which we do not cover here, researchers aim to decipher the biological pathways by which genetic variations have their effects, tracing out intermediate steps between molecules, cells, tissues, and organisms.

Background

Our abilities to understand and use language are undoubtedly influenced by envi- ronment and experience. Yet when such effects are accounted for, people still differ in their language skills. At least some of these inter‐individual differences are due to variability in genetic make‐up. Decades of behavioral research in families and twin cohorts have provided solid evidence that genetic factors can significantly impact on speech, language, and reading proficiency (see Bishop, 2001; Kovas et al., 2005). This chapter will discuss the background and general approaches for identifying the genes and genetic alterations involved. Once critical genes have been identified they provide molecular windows for understanding biological processes involved in the trait (Fisher & Scharff, 2009). For example, we could determine which parts of the brain are affected by the relevant genetic alterations, and at what stages of development. We focus here on the (molecular genetics) techniques for first finding connections between genes and language traits. Before explaining the key methods, it is worth briefly recapitulating the basics of genetics. Every human cell contains strings of DNA. DNA is a huge molecule built by putting together smaller units, usually referred to as nucleotide bases. These bases come in four types: A (adenine), C (cytosine), G (guanine), and T (thymine). DNA is therefore usually represented as a sentence composed of sequences of these four letters. The long string of DNA that makes up our genome is organized in 23 different pieces, the chromosomes. Our cells contain two copies of each chromosome: one inherited from the mother (maternal) and one inherited from the father (paternal). Sequences of DNA letters (As, Cs, Gs, and Ts) provide the instructions for assembling strings of amino‐acids into proteins, which in turn form the molecular machinery that make our bodies function; enzymes that catalyze reactions, molecules that define the structure of a cell, signaling factors and receptors, to highlight just a few examples. A stretch of DNA that encodes a particular protein is called a gene. However, only a small proportion of our genome (<1.5%) codes for proteins. The remainder includes features that regulate when and where proteins are constructed from the DNA code, and how much of each protein should be made. Nevertheless, the potential functional significance of much of the genome’s non‐coding DNA (i.e., the DNA that does not code for protein) remains to be determined. Gametes (eggs or sperm) carry only one copy from each pair of chromosomes, selected at random during the production of the eggs and sperm. When egg and sperm from two parents fuse, the resulting embryo again has a double set of each chromosome. Each pair of chromosomes is known by a number (1‐22), except for the sex chromosomes X and Y. A crucial point for understanding genetic mapping is that during the formation of the gametes, the two chromosomes of a pair line up with each other and may exchange material in a process called crossing‐over 332 Research Methods in Psycholinguistics and the Neurobiology of Language

1 2 T...... A...... G...... A...C

G...... A...... A...... T...G

3 4 5 6 7

8

(Many generations)

i ii iii iv v

Figure 17.1 Transmission of DNA between generations. Top: Males are represented by squares, females by circles. In this pedigree one pair of chromosomes (out of 23 pairs) is shown below each individual. Grandmother 1 carries a yellow DNA‐variant on her red chromosome, which influences trait X. She transmits her chromosomes to her offspring (individuals 3, 4, 5, 6), and because of crossing‐over between the red and the blue chromosome during egg production, each child gets a different combination. Half of her children inherit the variant that influences trait X. Top (right): Zoomed in, each chromosome can be represented as a string of letters. At most positions (the dots) the chromosomes are identical to the Human Reference Genome. At some positions (the letters) at least one of the chromosomes differs from the Reference. Such differences are on average a few hundred letters apart. The A with the yellow dot influences trait X. Bottom: Many generations later some of the descendants of 1 still carry the yellow DNA‐ variant. The stretch of red chromosome surrounding it has shrunk, but in a different way in each descendant. However, individuals i, ii, iv and v still carry the C to the right of the yellow A, while iii and v still carry the G to the left of it. In a GWAS these two variants may show association with trait X. (See insert for color representation of the figure.)

(Figure 17.1). As a consequence, each maternal chromosome in the resulting egg cell is effectively a patchwork of stretches of DNA originating from both maternal grandparents. Similarly, every sperm cell carries a combination of DNA stretches from the different paternal grandparents. Thus, there is a shuffling of genetic information at each generation. Molecular Genetic Methods 333

Understanding DNA Variation

The genomes of two unrelated people are typically identical for more than 99% of their length. However, since a human genome is 3.1 × 109 DNA‐letters (times two copies), even a ~1% difference means that, on average, one person differs from the next in ~3.6 × 106 DNA letters (The 1000 Genomes Project Consortium, 2015). The vast majority of the DNA variations that a person carries were inherited from her/his parents. In addition, during the production of gametes, a few errors are made in copying the DNA. As a consequence, each individual also carries about 50 new (de novo) variants that were not present in the genomes of either parent. Since most of our genome does not code for proteins, a lot of the variants (whether inherited or new) have little consequence. Even when a variant is located within the coding sequence of a gene, it does not always lead to a change in the encoded protein. This is because the coding system whereby sequence information in DNA is read off for building proteins contains some redundancy: Different three‐letter DNA‐codes are translated into the same amino acid (e.g., GCA, GCG, GCT, and GCC all correspond to the amino acid alanine). On average, when compared to a standardized reference genome (see https://genome. ucsc.edu/ or http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/), each person carries ~10,000 changes that yield differences in protein sequences. As mentioned above, almost all of these are inherited from one or other parent; typically just one or two of those protein‐coding differences are new (Veltman & Brunner, 2012). Most protein‐coding variations are relatively harmless. They may contribute to differences in appearance and behavior between us and our neighbors. And because we share inherited variants to some extent with our relatives, they cause similarities within families. However, some DNA variations cause or contribute to susceptibility for disease. In addition to simple changes of a DNA letter, one person’s genome differs from that of the next person in missing pieces (deletions), having extra pieces (duplications), having stretches inverted (inversions), or having multiple copies in tandem of a given stretch, and more. For an overview of all sorts of vari- ation and their consequences, see The 1000 Genomes Project Consortium (2015). For simplicity, we will mainly focus here on single‐letter alterations. Genomic diversity has been intensively studied in recent years, and we now know a great deal about different types of changes. Many DNA variants are fairly common in general populations. To take an arbitrary example, at a given chromosomal position, 80% of the genomes in a human population might carry an A, while the remaining 20% carry a C. The alternative letters at the same position are referred to as alleles; in this case allele A and allele C. Variations that are common—that is, more than 1% of chromosomes in a given population have the rarer allele—are known as polymor- phisms. Because each individual has two copies of every piece of DNA (one paternal, one maternal), for our arbitrary example a person may either have two A alleles (homozygous A), A plus C (heterozygous), or two C alleles (homozygous C). The combination of the two alleles at a position is called the genotype. Most polymor- phisms are close to neutral with respect to health. If they are harmful, fewer children with the damaging allele survive in every generation, and eventually the variant disappears. If, on the other hand, one allele confers an advantage as compared to the alternative allele at that position, it becomes more frequent every generation until it is fixed, meaning that the alternative allele is lost. These processes are called 334 Research Methods in Psycholinguistics and the Neurobiology of Language selection. Without selection, the allele frequencies of a polymorphism remain roughly constant over time in a population.

Genetic Architecture

Different traits or disorders can differ in the nature of their genetic architecture. Here, we will consider two well‐studied extremes. Certain differences in traits between people can be caused by a single genetic change of large effect, a type of genetic architecture that we call monogenic. Many severe diseases are monogenic: An impor- tant gene gets disrupted and results in, for example, deafness, blindness, intellectual disability, or some other major disorder affecting one or more tissues of the body. Deleterious DNA variants with large effects are typically rare in the general popu­ lation, because they are purged by selection. Monogenic traits often show strong clustering within families, and can be identified by their inheritance patterns. Beyond disorders, a frequently cited example of apparent monogenic inheritance in the general population is the ability to taste the bitter compound phenylthiocarbamide (PTC), which is largely determined by variation in the TAS2R38 receptor, involving two alternative common alleles (a “G” instead of an “A” at one point in this gene). However, in recent years it has become clear that PTC tasting varies along a continuum, and is not purely monogenic (Bufe et al., 2005); it is suspected that other genes, as well as environmental factors, modulate a person’s abilities in this regard. This leads us toward the other extreme of genetic architecture. Some traits are far from monogenic, being influenced by the joint action of a large number of DNA variations, occurring in many different genes, that each have a small effect on the trait. Height is a good example of such a multifactorial trait. Many of the relevant DNA variations have such small effects by themselves on survival or fecundity that they are not filtered out by selection even if the trait they contribute to is detrimental, thus remaining polymorphic in the general population. While height is a quantitative trait with a continuous distribution, the multifactorial model can also apply to dichotomous traits and diseases. The seesaw provides a useful analogy. A small weight placed on the higher seat will not cause it to topple, but if you keep piling up additional weight, at some point the higher seat will suddenly come down. In a similar way, once a person has a dangerously large number of deleterious DNA variations he or she may develop a particular disease, whereas people with a lower number of those variations are fine. Dichotomous traits with a multifactorial basis cluster less strongly in families than monogenic traits, since there is a low probability that someone will transmit the total package of deleterious DNA variations to a child, given that they may be located at many different sites of the genome. Common experience with multifactorial quantitative traits teaches us that being tall or short clusters in families to a certain degree, but that extreme parents often have less extreme children, while average parents occasionally have an exceptionally tall or short child. This is how it is with most multifactorial traits. In both types of genetic architecture—monogenic and multifactorial—environmental factors may also contribute. Moreover, it is possible for a trait to lie between these extremes of monogenic and multifactorial architecture, for example by involving interactions of variants of medium effect size in a relatively small number of genes. Such intermediate models are poorly understood at this time. Molecular Genetic Methods 335

Introducing the General Approach

If we want to identify genes involved in variation in language skills, the strategy used depends on assumptions about the genetic architecture. Nonetheless, most approaches posit that a DNA variant influencing the trait originated at some point in time, and was transmitted to the next generation, together with surrounding stretches of DNA (Figure 17.1). Because of successive events of crossing‐over, the surrounding section of co‐transmitted DNA (linked to the variant of interest, and hence to the trait) gets smaller and smaller, the more generations pass (Figure 17.1). People showing the same trait are therefore likely to share the putative DNA variant that contributes to it along with a surrounding stretch of DNA. The more distantly related these people are, the shorter will be this section of shared DNA. Within a family, the regions of shared DNA around a causal variant can be as large as a quarter of a chromosome. If we collect from the general population seemingly unrelated people who share a particular trait, the stretch of shared DNA around a causal variant can be as small as one or two genes. These people may seem unrelated, but they have all inherited that particular stretch of DNA from the same distant ancestor. People who share the same stretch of DNA not only share the variant of interest, but also variants at a number of presumably neutral neighboring polymorphisms (see Figure 17.1, bottom). By determining the genotypes of these common polymorphisms in people, we can map out where shared sections of DNA lie. This is an important step toward pinpointing the locations of the causative variants themselves. Monogenic traits are usually studied in families with multiple relatives affected by the trait. The aim is to locate a stretch of DNA that the affected relatives share with each other, but not with the unaffected family members, that is, a chromosomal region where all the variants show linkage with the trait. Multifactorial studies, in contrast, involve analyzing a set of unrelated people affected by a trait or disorder (cases) and comparing their genotypes to those who are unaffected (controls). In such a case‐control design, we expect variants that con- tribute to the trait, and additional variants in the surrounding DNA, to be more common among cases than controls. However, not all cases will carry the same set of trait‐related DNA variants. Also, within a multifactorial framework, a trait‐related variant found in cases can also be carried by controls, since it is ultimately the overall load of risk variants in multiple genes that contributes to whether or not the trait develops (as in the seesaw analogy described earlier). Thus, in multifactorial studies, rather than testing for presence/absence of a particular allele at a polymorphic marker, we compare whether the frequency of the allele differs between cases and controls. Sometimes the trait of interest can be indexed by a quantitative measure that shows continuous variation in a population (standard examples from biomed- ical fields include height and blood pressure) rather than presence/absence. For these traits we can either compare people who lie at the trait extremes, or collect a random sample of people from throughout the normal distribution showing a range of dif- ferent values. The choice of traits or trait combinations (phenotypes) to study, along with optimal ways to approach quantitative traits, is discussed later, with particular reference to speech, language, and reading skills. Because of technical limitations in laboratory techniques or in computing possi- bilities, we may first try to identify the broad location of the suspected causal DNA 336 Research Methods in Psycholinguistics and the Neurobiology of Language variant in the genome and only later search for the particular DNA variant itself. In the past, this was the normal approach for family studies, but recent technological advances have made it possible to start looking for the causal DNA variant directly by means of next‐generation sequencing (NGS) (Metzker, 2010). The traditional method is still in use, though, for practical reasons or because of the costs. We will go into the details later. In studies of multifactorial traits, unless there is a clear‐cut prior hypothesis concerning a specific gene, it is necessary to start with a systematic search of hundreds of thousands of polymorphisms across the genome. This is called a genome‐wide association scan (GWAS) (McCarthy et al., 2008). As discussed later, a GWAS requires thousands of individuals to yield adequate statistical power. Obtaining and analyzing sequence data of the whole genome rather than just a set of polymorphisms in cohorts of this size is not yet feasible for most laboratories, so DNA‐chip technology is used to read each individual’s genotype (DNA letters) at a great many common polymorphisms. Whichever approach we choose, statistics are crucial. When we observe a genetic difference between affected and unaffected individuals, rigorous statistical analyses are required to determine whether or not this can be explained by random sampling error. Indeed, the proper statistical methodologies for genetic analyses are an intensely studied field. In addition to robust statistical support for a finding, along with replication in independent cohorts, we often want to collect evidence that the functions of a particular gene are relevant to the trait of interest and that they might be altered by the genetic changes we observe. This can be done by a variety of exper- iments, for example using cells grown in the laboratory or animal models, which we will not discuss in this chapter. However, a large amount of knowledge about genes, what they do, where and when they are switched on in the body, and other aspects has already been collected. As such, geneticists spend a lot of time mining the available information from public (online) databases for information on the various candidate genes highlighted by their genetic mapping studies.

Techniques for Characterizing Genetic Variation

To give an idea of the lab‐work involved in performing a genetic study, we will here describe a few common techniques. In order to analyze the genes of a study participant we must first isolate the person’s DNA. Because the genome sequence is virtually the same in every cell of the body, we preferably use a tissue that is easily sampled and processed. Traditionally, blood has been the tissue of choice, especially since it gives particularly large yields of high quality DNA. In situations where drawing blood is difficult (e.g., participants have fear of needles), we can collect DNA non‐invasively from other tissue such as the inside of the cheek, sampled by buccal swabs or saliva sampling. Saliva sampling can even be done by mailing participants a prepared container to spit in, and having it returned to the lab. Once blood or saliva samples are in the lab, extracting and puri- fying the DNA can be done with commercial kits. Several alternative techniques are used to read the nucleotide letters from the DNA sample of a participant. We may read out the individual letters in consecutive sequence from an entire stretch of DNA (the size of which might vary depending on Molecular Genetic Methods 337

the technique). This type of approach is called DNA sequencing. For reasons of speed, costs, and computational ease, we may in some study designs choose to only assess which DNA‐letter(s) a person carries at a predefined set of known polymor- phisms. In that case, we do not read complete “sentences,” but only a single letter here and there. With currently available methods, the number of polymorphisms that are investigated could range from just a single variant to hundreds of thousands of known polymorphisms. This is generically known as genotyping. We will describe both sequencing and genotyping in more detail.

Sequencing

The aim of sequencing is to read the code of a given stretch of DNA letter by letter. There are currently two prominent types of technology for doing so: traditional “Sanger”‐techniques, and more novel high‐throughput massive parallel sequencing techniques, also known as next‐generation sequencing (NGS), which have emerged during the past decade. The output from these techniques is typically not the sequence of just a single DNA molecule from a single cell, but the average of the sequence of many molecules from many cells. If at a given position in the DNA the nucleotide letter you inherited from your father is different from the one that you inherited from your mother (i.e., you are heterozygous at this position), half of the sequenced mole- cules will have the paternal letter at that position, while the other half will have the maternal letter at that same position. For example, for a particular stretch of DNA sequence a person’s code might be read as “GTGCAAGA(C/T)GAGACAGGTAAA,” indicating that half the molecules are “GTGCAAGACGAGACAGGTAAA” while the other half are “GTGCAAGATGAGACAGGTAAA” (Figure 17.2). Unless the corres­ ponding sequences of the mother and father have also been determined, the result does not tell you which letter (in this case, C or T) was inherited from which parent. Traditional Sanger techniques are still considered to be of better quality than the available NGS techniques, with higher sensitivity and specificity, but NGS is rapidly catching up. To perform Sanger sequencing, one must first isolate a specific stretch of interest from the long molecules of DNA. This is done with a technique called polymerase chain reaction (PCR), in which a particular region of the genome is selectively and exponentially amplified from the original DNA sample, generating large numbers of copies of this target region (https://youtu.be/iQsu3Kz9NYo). The amplified material is then used as a template in a sequencing reaction. A single such reaction typically reads stretches of up to ~800 letters. Most protein‐coding genes are substantially larger than this, so it is almost always necessary to carry out multiple reactions to cover the full length of a gene. This technique is relatively low‐throughput and preferred if sequencing only a few stretches of DNA per individual, in which case it is faster and cheaper than NGS. The material costs including PCR are around $2 per reaction per individual (as estimated in 2016). During Sanger sequencing, DNA molecules resulting from the PCR‐procedure, are read by a sequencing machine. Each “letter” that is encountered generates a fluorescent signal, with a different color for each letter (usually A = green, C = blue, G = black, T = red). A series of differently colored fluorescent signals reveals the DNA sentence that was offered to the machine. More detailed explanations can be found on YouTube, for example, https://youtu.be/ e2G5zx‐OJIw. Figure 17.2 shows a visualization of Sanger‐sequencing results. 338 Research Methods in Psycholinguistics and the Neurobiology of Language

140Reference 150 160 T GGCCGGT G CCCCG A TTGGCCG C

C-to-T variant14 0 150 TTGGCCGG G CCC TTG A T GGCCG C

Figure 17.2 Visualization of Sanger sequencing results. Sanger sequencing results for two individuals for the same stretch of DNA. On the X‐axis, the position along the sequenced fragment of DNA; on the Y‐axis, the fluorescence intensity for four different colors. A different color lights up for each of the bases that is read: A=green, C=blue, G=black, T=red. In the middle of the lower image, two different colors light up on the same ­position (arrow), because the individual has inherited different letters from his father and mother: the reference letter C and the variant letter T. Some background coloring can be seen near the bottom of each image. This is an artefact. (See insert for color representation of the figure.)

Next Generation Sequencing (NGS) is a technology that is preferred when you need to read a large number of letters of DNA per individual. With NGS it is possible­ to read all 3.1 × 109 letters of a person’s genome in a single experiment (whole genome sequencing, or WGS). Alternatively, a method called enrichment can be used to initially isolate all known protein‐coding parts (~5.5x107 letters, known as the exome), before sequencing only these sections (whole exome sequencing, or WES). NGS reads 50 to 300 letters at a time, depending on the platform and equipment being used (Goodwin, McPherson, & McCombie, 2016). At the end of the experiment, the database in the sequencing machine contains millions of short DNA “sentences,” along with infor­ mation on their reliability. Intensive computer analyses are required to make sense of all these data. Usually, this involves aligning each DNA‐sentence to the matching part of the full “text” of the Human Reference Genome (http://www.ensembl.org/ Homo_sapiens/Info/Index), a little like assembling the pieces of a huge jigsaw (albeit one that is linear). Multiple sentences will overlap at every position in the text, meaning that each letter has been read several times, increasing the confidence in the accuracy of the sequence information (Figure 17.3). Then, positions in the data that deviate from the reference genome can be identified and listed. Processing of NGS data is highly demanding in terms of computer time, power, and storage capacity. Depending on the quality required, WGS costs around $1200 per sample (as esti- mated in 2016). WES is currently cheaper (~$500) including the cost of the enrich- ment. The investment costs for equipment and for computer infrastructure are Molecular Genetic Methods 339

Human genome reference sequence

DNA variants in dbSNP database

Figure 17.3 Next generation sequencing. Next Generation Sequencing data for one individual for a short stretch of the DNA of gene GABRB3. Each horizontal bar (blue or pink) represents a single sequenced molecule. The sequences are aligned to the Human Reference Genome (bottom). If the sequence differs from the Reference, this is indicated—see blue C in the middle. About half of the molecules carry the C, the other mol- ecules carry the T (as indicated in the Reference). This individual is heterozygous at this point. Either a C was present in the DNA from one of his parents and a T in the other parent, or the C originated de novo during egg or sperm production. (See insert for color representation of the figure.)

considerable, which makes outsourcing a common solution for most laboratories. While any student can carry out Sanger sequencing, NGS typically needs dedicated technicians and experts in bioinformatics.

Genotyping

For a large number of positions in the human genome, previous experiments have shown that people carry different letters, called polymorphisms, as explained earlier in the chapter. For most known polymorphisms only two of the possible four letters are common in the general population (i.e., there are two alternative alleles). Publically available online databases have collated information about these polymorphisms, including their allele frequencies in various ethnic populations across the world. One of the most well‐used databases, dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) catalogues over 150 million different single nucleotide variants (July 2016). In the early days of molecular genetics, it was necessary to genotype variants one by one in DNA‐samples of interest. Genotyping practices were revolutionized at the end of the 1990s by the development of glass slides on which assays for hundreds of thousands of known single nucleotide polymorphisms (SNPs) can be attached: a SNP‐chip. Once DNA from the studied person is added, each individual assay detects the presence of one of the known alleles for the polymorphism of interest, flagging this with a fluorescent label, such that two assays are needed per polymorphism. After computer processing of the signals, the genotypes of the person whose DNA was added, at each of those hundred thousand or more polymorphisms, are known (Figure 17.4, Table 17.1). Nowadays there are several companies (such as Affymetrix and Illumina) 340 Research Methods in Psycholinguistics and the Neurobiology of Language

C

A

Figure 17.4 Visualization of SNP‐chip results. SNP‐chip assay result for a single polymorphism. Each dot represents an individual. X‐axis: intensity of the fluorescent label attached to one allele at the polymorphism (e.g., A). Y‐axis: intensity of the fluorescent label attached to the other allele (e.g., C). The software­ recognizes three clusters and assigns a genotype to each individual (e.g., A/A (red), A/C (green), or C/C (blue)). Black dots show samples without signal: controls that contained water instead of DNA. (See insert for color representation of the figure.)

Table 17.1 Example of genotyping chip results for four individuals and five polymorphisms. Polymorphism Chromosome Position Ind 1 Ind 2 Ind 3 Ind 4 rs6051856 20 41499 A/A A/A A/G A/G rs6038013 20 56187 A/A A/A A/G A/G rs5038037 20 57272 G/G G/G C/G C/G rs2298108 20 82476 C/C C/C C/T C/T rs2298109 20 86125 G/T T/T G/T G/T Molecular Genetic Methods 341 that produce standardized commercial SNP‐chips allowing genome‐wide genotyping at low cost (e.g., $100‐$200 per sample). As can be seen in Table 17.1, each individual has two letters per polymorphism: one inherited from each parent. At this stage of genotyping, it is not known which letter came from which parent, so they are often presented in alphabetical order (C/T, A/G, etc.). Many labs carry out such experiments themselves, but they are also offered as a commercial service. SNP‐chips are a quick and easy way to genotype very large numbers of polymor- phisms in one experiment for large cohorts of people. However for other experiments, assays similar to those attached to the SNP‐chips are available individually to genotype just a single polymorphism (or perhaps a handful) in a few dozen to a few thousand people. A number of companies sell such assays as kits, to be carried out in your own laboratory, and there are also some that will provide genotyping as a service.

Collecting Phenotypes and Defining Cohorts

The methods described in this chapter essentially consist of uncovering correlations between variability at the level of genotype and that at the level of the trait or trait combination (phenotype) in a cohort of interest. As outlined above, there are well‐ standardized techniques available for obtaining reliable information about the DNA‐letters from any study cohort. For a successful outcome it is just as critical to obtain a robust characterization of the traits and characteristics of the study participants. Language‐related skills offer considerable challenges when it comes to this side of things. One strategy that has proved valuable has been to target developmental disorders in which there are unexplained problems with speech, language, or reading occur- ring against a background of normal intelligence and sensory acuity, along with adequate exposure to spoken/written language in the environment (Bishop, 2001; Fisher & DeFries, 2002; Fisher, Lai, & Monaco, 2003). Research in this area may use performance on a number of different tests, along with clinical reports and case his- tory where available (e.g., from speech/language therapists) to make a formal diag- nosis in the individuals taking part in the study. Participants are designated as either being affected or unaffected with the disorder of interest and then the geneticist searches for correlations between genotypic data and this dichotomous affection status in the study cohort. Examples where a “qualitative” approach to trait defini- tion has been particularly effective include studies of childhood apraxia of speech (CAS, also known as developmental verbal dyspraxia), a rare disorder where indi- viduals have problems in mastering the rapid coordinated sequences of orofacial movements that underlie fluent speech, leading to inconsistent errors that worsen as the complexity and length of the utterance increases (see later section on Exemplary Studies). Similarly, several investigations of developmental dyslexia (specific reading/ spelling disability) have employed qualitative definitions of the disorder to pinpoint suspected candidate genes, such as DYX1C1 and ROBO1 (reviewed by Carrion‐ Castillo et al., 2013). Qualitative all‐or‐none approaches to defining language‐related disorders have certain limitations (discussed in more detail by Fisher & DeFries, 2002; Fisher 342 Research Methods in Psycholinguistics and the Neurobiology of Language et al., 2003). A positive diagnosis might be made based on a child scoring signi­ ficantly below the level expected for his/her age on one or more measures of language or reading performance. The tasks used to assess performance usually show continuous variation in the general population and the exact choice of threshold can be somewhat arbitrary. Sometimes, formal definitions of these disor- ders also require a discrepancy between language/reading and general cognition, as assessed by tests of non‐verbal IQ, and again the most appropriate degree of discrepancy to apply remains a matter of debate (Fisher & DeFries, 2002). Another difficulty concerns the fact that different types of language‐related disorders can co‐occur in the same individual, which could reflect biological pathways that impact on multiple skills simultaneously. Traditional diagnostic schemes depend on exclusionary criteria and do not deal well with instances of comorbidity. For instance, the term specific language impairment is used to describe a child who has problems with receptive and/or expressive language, without any deficits in speech motor functions, leading to the misleading conclusion that no child could have both specific language impairment and childhood apraxia of speech together. Moreover, speech, language, and reading skills are developmental traits, such that a child’s diagnosis might change at different ages (even though their genetic material remains the same). For instance, a child diagnosed with specific language impairment before reading instruction has started, may eventually acquire ade- quate language skills, but be considered dyslexic later when she or he has problems with learning to read. Overall, a single qualitative diagnosis of a language‐related disorder may potentially encompass a heterogeneous mixture of different causes, which could impede the discovery of reliable correlations between traits or disorders and genotypes. Therefore, an alternative way of studying genetics of language‐related disorders is to move away from categorical diagnoses and directly employ the quantitative scores from relevant measures for the genetic analyses. Moreover, such an approach allows researchers to investigate different aspects of our speech, language, and reading fac- ulties, using tasks that are hypothesized to tap into distinct components. Commonly studied traits include the ability to identify and manipulate the sounds in spoken words (phoneme awareness), the ability to retain new phonological information without rehearsal (phonological short‐term memory), the understanding of rules marking tense, number, gender (grammatical morphology), the ability to recognize written word forms (orthographic processing), and the rapid naming of highly familiar visual symbols (rapid automatized naming) (reviewed by Carrion‐Castillo et al., 2013; Fisher & DeFries, 2002; Fisher et al., 2003). By focusing on quantitative traits it becomes possible to not only study the molecular basis of disorders (the extremes), but to also investigate the genetic underpinnings of normal variation in language skills in the general population (e.g., Gialluisi et al., 2014; Luciano et al., 2013), which is likely to be highly multifactorial. We later discuss an illustration of this approach under Exemplary Studies. Another key factor in study design concerns the types of cohorts that are collected. For a monogenic trait, such as a rare language‐related disorder, the usual approach is to try to identify multigenerational families in which there are multiple affected indi- viduals, showing an apparently simple inheritance pattern. As discussed in the section “Analyzing the Data,” the structure and size of the family, particularly the numbers of affected people in the different generations, is important. Not only does this Molecular Genetic Methods 343 i­ndicate whether there is likely to be a monogenic explanation, but it also gives an idea of how much statistical power there is to track down the responsible genetic alteration. In addition, a robust assessment of affection status is crucial, because mis- diagnosis of an individual could derail attempts to uncover the relevant gene. For studies that focus on multifactorial traits, and hence assume the involvement of common DNA‐variants with small effect sizes, it is typically necessary to collect large cohorts of thousands of people. As mentioned above, if the trait of interest shows continuous variation in the general population and can be indexed by a reliable quantitative measure, it can be studied in a cohort of people collected ran- domly from the general population. Examples include birth cohorts like the Avon Longitudinal Study of Parents and Children (ALSPAC) in the UK and Generation Rotterdam (GenR) in the Netherlands.

Analyzing the Data

Once the genetic data (genotypes or sequences) have been collected, the information needs to be analyzed to find out what DNA alterations may be involved in the traits under study. It is necessary to establish whether there is a statistically significant relationship between a variant and a trait. Here we briefly describe the approaches typically used for studying monogenic and multifactorial traits.

Monogenic Traits—Linkage in Large Families

Suppose we have a suspected monogenic disorder clustering in a family, such that each affected person has apparently inherited the disorder from one of his/her parents. We would like to find out if a single genetic variant can explain the dis- order in this family. The traditional method, which is still quite often used, involves two steps. In the first step, we search for any sections of the genome in which DNA variants are shared (i.e., the same letters at a polymorphism) between the affected (but not the unaffected) relatives in the family; that is, we want to identify genomic regions that are linked to the disorder. At present, this step can be run in a cost‐ effective way by genotyping DNA variants across the genome in all available family members using SNP‐chips (DNA arrays for genotyping, as described above). Since the shared stretches of DNA that we are looking for are expected to be rather large, genotype data from ~10,000 well‐distributed common polymorphisms will suffice (Figure 17.1). By using software that systematically considers the inheritance pattern of each polymorphism, we can detect any parts of the genome that show significant linkage to (i.e., are inherited together with) the disorder. Rigorous statistical methods are carried out to determine whether a linkage that we observe is a significant finding. If there is significant linkage, this means the linked polymorphism and the DNA variant that causes the trait are very likely to be located relatively close to each other on the DNA molecule. The section of DNA around this polymorphism is now the place to look for the causal DNA variation. Investigating a number of adjacent polymorphisms and checking their linkage to the trait, will tell us something about the size of the section we need to investigate. 344 Research Methods in Psycholinguistics and the Neurobiology of Language

On finding a section of the genome that shows linkage, the assumption is that somewhere within this section there lies a rare variant (perhaps even unique to that family) that is responsible for causing the disorder. The aim of the second step is to then identify that causative variant, usually by reading (sequencing) the DNA of all the genes from the linked DNA‐section in the family. Since the regions implicated by the first stage often contain tens to hundreds of different genes, this second step can be very time‐consuming, unless an obvious candidate gene (such as one previously implicated in a related disorder) resides in the region. Most monogenic disorders are caused by DNA variants that change proteins, so the search would usually focus primarily on sequencing the protein‐coding parts of the linked interval. Sometimes, tracking down the responsible gene can be aided by finding other families who show linkage of a similar disorder to the same region, or independent cases of people in which the region is disrupted by a large‐scale chromosomal rearrangement. These days, as an alternative to the above two‐step search, we might instead take advantage of new possibilities offered by whole genome/exome sequencing. Ideally, we would like to have sequence data for all members of a family of interest, but because next‐generation sequencing (NGS) is still quite expensive, we may only be able to afford this for two or three relatives. In that case, a popular approach is to select two affected people from the family who are not too closely related, such as two cousins. If we have more money, we might choose an unaffected brother or sister of one of them as a control. In these three people, we can sequence all the protein‐coding parts of the DNA or even the entire genome (WES or WGS, as described earlier). We can then look for any protein‐changing variants that are shared by the two affected people, but are absent in the unaffected sibling control. If the variant is causing a rare and easily recognizable disorder it is unlikely that the causative DNA variant is present in healthy individuals. On the internet, databases are available with sequence data for thousands of apparently healthy individuals. We can discard all variants that are seen in these databases. Usually, these steps leave only a handful of candidate variants. Using Sanger sequencing (see above), we can then inspect these variants in the whole family. The remaining suspects are those variants that are seen in all affected relatives but in none of the unaffected family members, that alter protein sequence in a way that is predicted to alter protein function in a substantive way, and that have never been found in healthy individuals in any other studies. If using a WGS/WES strategy, we can statistically test for linkage at the end of the search, rather than using linkage mapping as a starting point, by performing statistical analyses for just the candidate causal DNA‐variants. Even when there is convincing statistical evidence that the causal variant has been found, additional investigations are needed to increase the confidence in this result, such as identification of other causative variants in the same gene in unrelated families/cases or proof of functional effects from, for example, studying the result of manipulating the genes in cultured cells or animal models. Described above is the ideal situation. In real life, some data from sequencing or SNP‐chips may be of low quality, one or more family members may have died or is unwilling to cooperate, the disorder may show variability which makes it hard to be certain about its presence or absence in some people, and so on. For the traditional two‐step method to be viable, we need DNA and matching trait data for at least three generations in a family, and the third generation should have at least two people who are affected. Depending on a number of factors, we need about ten to Molecular Genetic Methods 345

twelve affected relatives to be able to find a significant result for a dominant disorder. A single large family would be perfect for such a study, but a combination of a number of families could be used, assuming that the same gene is disrupted (difficult to establish a priori unless the phenotype is particularly distinctive). For the modern WGS/WES‐based method different statistics would be possible, in which some of the criteria could be relaxed. You would not necessarily need all family members to be able to trace the inheritance patterns, because a really rare DNA var- iant is unlikely to occur more than once in a family without being inherited from one family member to the next. So far we discussed so‐called dominant monogenic inher- itance, in which a disorder may result from a DNA‐change in one of the two copies you have of each gene. Some disorders only occur when both your maternal and paternal copy of the relevant gene are disrupted. These disorders we call recessive. For disorders that show recessive inheritance, similar methods can be used with certain adaptations, which we will not go into here.

Multifactorial Traits—Identifying Common Effects with GWAS

When a trait is suspected to have a multifactorial genetic architecture, involving combined effects of a number of different common polymorphisms each with a small effect size, a typical study would collect a large cohort of individuals and test for association between gene variants and the trait (e.g., a score on a test or diagnosis of a disorder). If we know of a gene that we already think is very likely to be involved in the trait (a candidate gene), we can select polymorphisms in and around that gene to focus on, as described below. However, for traits where we know little about the biology, as in the case of language‐related phenotypes, it is difficult to pick out appro- priate candidate genes and come up with reasonable hypotheses to test. Technological developments have allowed geneticists to overcome this issue by carrying out a ­hypothesis‐free (with respect to gene choice) search in which polymorphisms across the whole genome are systematically interrogated for association. While for a monogenic disorder a single causal variant is often sufficient to fully account for the risk of developing the disorder in a family, for multifactorial traits a risk variant might increase a person’s chance of having a disorder by <1%. To have enough statistical power to detect the subtle roles of such variants it is necessary to collect DNA and matching phenotypic trait information from large cohorts. For some traits a cohort with thousands of unrelated people could be enough to support a GWAS, but it is becoming common for studies to analyze cohorts comprising tens of thousands of participants (sometimes only possible through meta‐analyses). Using SNP‐chips, hundreds of thousands of polymorphisms are genotyped in each individual of the cohort. The statistical analyses involved in testing for association are conceptually simple. If we are studying a dichotomous trait, in which we can divide participants into two groups (e.g., cases versus controls) we can use for example a chi‐square test to test for each polymorphism whether one of the alleles has a significantly higher frequency in one group than in the other. If we are study- ing a quantitative trait, we can use an approach like linear regression to test for each polymorphism whether there is a relationship between the alleles that they carry and the trait. For example, for a C/T polymorphism we can ask whether the number of C alleles at that polymorphism (0, 1, or 2) is correlated with the quantitative 346 Research Methods in Psycholinguistics and the Neurobiology of Language score. Because DNA variants that lie close together on a chromosome tend to be transmitted­ together for many generations, they tend to co‐occur on the same stretch of DNA even in people who are seemingly unrelated. Therefore we often see evi- dence of association for a number of neighboring polymorphisms on a chromosome. A single GWAS requires performance of hundreds of thousands of different statistical tests. Under the null hypothesis of no association between a polymor- phism and a trait, 5% of those tests will yield a p‐value that is below 0.05. So, clearly, the standard threshold is not suitable since it will deliver an unacceptably high number of false‐positive findings. The consensus of the field is that in a GWAS an association between a polymorphism and a trait is only considered significant if the p‐value is less than 5 × 10−8, and even then independent replication in another cohort is usually necessary to be convincing. Since the people being studied have only very distant relatedness with each other, the DNA‐sections implicated by this kind of association testing are much smaller than those identified by family linkage analyses (Figure 17.1). This is because many more generations have passed since the patients shared a common ancestor than in a family. A polymorphism that is significantly associated to the trait or disorder in a genome‐wide association study (GWAS) might point to a single gene or, at most, to five or six: a region around it of on average ~300,000 letters. Since genes take up only a small fraction of our genome, it may also happen that these significant poly- morphisms are in regions with no gene at all in the neighborhood. Despite the relatively small size of the associated sections, it has turned out to be remarkably difficult to identify which variant is truly responsible for the effect on the trait. In most monogenic disorders, the causal variant obviously disturbs the working of an encoded protein. In multifactorial traits, it is more likely that the relevant variant alters the levels or timing of production of an encoded protein in a subtle manner. We are not so skilled yet at recognizing and characterizing such variants, although there are many genomic initiatives underway that seek to improve the situation. Therefore, GWAS studies usually end with identifying the genes that are likely involved in the disease or trait of interest, without necessarily being able to zero in on the exact variants or mechanisms responsible. Because each sufficiently powerful study will identify multiple genes, subsequent analysis can assess whether these genes act together in a shared biological process, are translated into protein in similar tissues, and so on. Even when a GWAS does not identify individual polymorphisms that meet criteria for genome‐wide significance, it can provide useful information on whether association signals are enriched for certain types of genes or biological processes. It has also become popular to study whether variants that were implicated for one trait are also seen for another related trait, so we gain more understanding about the similarities and differences between them (Cross‐Disorder Group of the Psychiatric Genomics Consortium, 2013).

Exemplary Studies

To give concrete illustrations of the methods we have introduced, we discuss two exemplary language‐related studies from the literature, one concerning a monogenic disorder, the other investigating a multifactorial quantitative trait. Molecular Genetic Methods 347

Linkage Analysis Implicates FOXP2 in Speech and Language Deficits

In 1998, a linkage analysis was reported for a three‐generation family (known in the literature as the KE family) in which a rare severe speech and language disorder was transmitted in a pattern that could be easily recognized as dominant inheritance (Fisher, Vargha‐Khadem, Watkins, Monaco, & Pembrey, 1998). The disorder, which affected fifteen people (about half of the family members) involved childhood apraxia of speech, accompanied by wide‐ranging impairments in spoken and written language skills, while other aspects of cognition were less affected (http://www.omim. org/entry/602081). The researchers used the traditional design of first genotyping polymorphisms to detect stretches of DNA shared by the affected people. A region on chromosome 7 was found to show highly significant linkage to the disorder. With our current knowledge of the genome, we can see that the size of the linked region covers ~13 × 106 DNA letters and contains ~40 protein‐coding genes. At that time, however, the first entire human genome sequence was not yet determined, and there was very limited knowledge about which genes lay in the interval of interest, and what their sequences were. The researchers pieced together as much information as they could from frag- ments of data that were known at the time, and began sifting through the available genes using Sanger sequencing to search for causative variants in the affected KE members (Lai et al., 2000). Fortunately, they came across a child who was not related to the KE‐family and who had a very similar type of speech and language disorder. He had a chromosomal rearrangement disturbing the same genomic region that had been implicated by the KE family linkage analysis. In this rearrangement (a translocation) part of chromosome 7 had been swapped with part of chromosome 5, without any loss of genetic material. (Note: the rearrangement affected one of the two copies of each chromosome; the other copies of chromosome 7 and 5 were normal.) The child’s parents did not have any speech/language problems and they did not have this genomic rearrangement: It had happened during the formation of the egg or sperm. When pieces of chromosome get swapped in this way, they must have been broken somewhere. If such a breakpoint passes directly through a gene, this gene’s function is disturbed. Because a similar section of chromosome 7 was implicated in both the KE family and the unrelated child, it was possible that their language problems were caused by malfunctioning of the same gene. Knowledge and technology was much less advanced than nowadays, so a complicated set of experiments was necessary to find the gene that was broken in the unrelated child and to study it subsequently in the KE family. In the end, a clearly causative DNA variant was found in the affected members of the family, disrupting a novel gene that is now called FOXP2 (Lai, Fisher, Hurst, Vargha‐Khadem, & Monaco, 2001). Unaffected KE family members all had G/G at this position, whereas all those with the language problems had G/A, that is, they were heterozygous for an unusual A allele. The A results in alteration of the protein encoded by FOXP2; at one critical point of that protein the amino‐acid arginine is replaced by a different one, histi- dine. Experiments in cultured cells and animal models have shown that this change prevents the protein from working properly (see Fisher & Scharff, 2009). Subsequent screening studies have since identified different rare disruptive variants of the FOXP2 gene in other unrelated families and cases (reviewed by Graham & Fisher, 348 Research Methods in Psycholinguistics and the Neurobiology of Language

2015). Though disruptions of FOXP2 are rare, explaining only a small proportion of cases of speech and language ­disorder, the discovery of the gene led to a highly informative series of investigations­ concerning its roles in cells, neurons, brains, and behavior, as well as providing novel insights into important evolutionary questions. Such work is outside the scope of the current chapter; the interested reader is referred to, for example, Fisher and Vernes (2015) for extensive descriptions.

GWAS Uncovers Effects of ROBO2 on Early Expressive Vocabulary

The expression and understanding of language in children shows considerable inter‐individual variation. Twin studies suggest that this is the result of both envi- ronmental and genetic factors (e.g., see Kovas et al., 2005). At young ages, the (as yet undetermined) environmental differences seem to explain the larger part of the variation, but clearly not all of it. St Pourcain and colleagues performed­ a genome‐wide association scan (GWAS) of expressive vocabulary scores in unre- lated children of European descent from the general population, analyzing early (15–18 months; “one‐word stage”) and later (24‐30 months; “two‐word stage”) phases of language acquisition (St Pourcain et al., 2014). The study took advantage of large cohorts from the general population that had been followed longitudi- nally since birth, and had also been genotyped using genome‐wide SNP‐chips. The phenotypes (traits/trait combinations) of interest for this study were derived from communicative development inventories—parent report instruments, which capture information about children’s developing abilities in multiple­ domains of early language (see Chapter 3 of the present volume). The GWAS was carried out first in a discovery cohort analyzing trait‐polymorphism association for >2 × 106 polymorphisms. This involved 6,851 infants of mean age 15 months for the “one‐ word stage” and 6,299 toddlers of mean age 24 months for the “two‐word stage.” For the trait measured at the younger age, the top association was seen for a poly- morphism near the gene ROBO2, with a p‐value of 9.5 × 10−7, while analysis of the trait measured at the older age pointed to a polymorphism within the gene CAMK4 that gave a p‐value of 3.5 × 10−7. As noted above, the accepted threshold for significance in a GWAS is p < 5 × 10−8, meaning that these polymorphisms were only suggestively associated with the traits being studied. The researchers went on to evaluate their findings further in three independent cohorts from the UK, Netherlands, and Australia, again making use of available measures and genome‐wide genotyping data already collected for these cohorts. In this follow‐up, which included an additional 2,038 children for the early stage, and 4,520 children for the late stage, the researchers did not perform a full GWAS, but focused on only the most interesting polymorphisms from their discovery cohort. In the combined data from the discovery and the replication cohort, the polymorphism near ROBO2 was found to be significantly associated with the trait measured at the younger age (p = 1.3 × 10−8). Around 35% of all the children in the cohort had at least one G allele; the others had only A alleles at this position. Having a G allele decreased expressive vocabulary scores at the “one‐word stage” by 0.098 standard deviations, illustrating the very small effects typically found for multifactorial traits. Be reminded that this DNA variant is not necessarily responsible itself for the change in the vocabulary scores, but that it lies on a section of DNA that carries the putative Molecular Genetic Methods 349 contributing variant (see Figure 17.1, bottom). Curiously the ROBO2 findings in this study seemed to be specific to the infant sample—no association was found in toddlers at the later stage, and there was no impact on later outcomes for speech, language, or reading skills. ROBO2 is a convincing candidate for involvement in language, since the gene is known to be important for brain development (p­articularly in relation to guidance of axons) and prior studies found association of language/reading‐related phenotypes with a very similar gene, known as ROBO1 (Mascheretti et al., 2014). Overall, this study shows that even with access to cohorts totaling more than 10,000 participants, identifying common genetic factors that influence a multifactorial trait can be challenging.

Problems and Pitfalls

Research in genetics, like in any field, can encounter difficulties in collecting, analyzing, or interpreting the data. Here we discuss some commonly encountered problems. No one method is optimal for all research questions. Laboratory‐based genetic data can be prone to artefacts. For studies like GWAS, but also linkage, which involve datasets of thousands to millions of data‐points, manual inspection of each and every data‐point is not an option. Thus, rigorous quality control steps and sanity checks are necessary at every stage of the analyses. Results that generate significant evidence of linkage or association should at least be checked by visualization of the overall data patterns (Figures 17.2‐17.4), and/or by testing them again with a second independent technique. However, this cannot guard against false negative results. Note also that the statistical analyses used in gene mapping can only tell us about probabilities in relation to the null‐hypothesis of a chance finding, rather than giving absolute proof of the involvement of a variant in the trait under study. The steps we describe in this chapter should be seen as starting points for generating new hypotheses and novel questions, leading to experiments that can further evaluate the contributions of specific genes and genetic variants to language phenotypes. Studies of monogenic disorders depend on tracking down appropriate families in which developmental language deficits affect large numbers of relatives and are inher- ited in a simple manner. Suitable families tend to be rare and difficult to find, so there is some serendipity involved. Even the most carefully carried out linkage screens may fail to find any significant results, either because the family is too small to yield adequate power, because the underlying genetic architecture is more complex (i.e., not actually monogenic after all), or due to misdiagnosis of some key family member(s). Even when significant linkage has been found in a family, it might not lead to successful identification of a causal variant, despite extensive searching. Often, the identification of independent families/cases implicating the same gene is crucial for pinpointing causal variants, as already shown by our description of the discovery of FOXP2. Moreover, the value of experimental evidence in cell‐cultures, animal models, and with other approaches supporting a functional effect cannot be overestimated. We discussed how Next Generation Sequencing (NGS) offers an alternative to traditional methods in family studies. The advantage of going immediately to whole genome or exome sequencing is that it is more direct and may lead to faster answers, 350 Research Methods in Psycholinguistics and the Neurobiology of Language quickly highlighting potential candidate causal variants. Also, missing a few individ- uals in the family tree is probably less problematic than for traditional linkage. On the other hand, WGS/WES approaches may miss out on a true causal variant that does not alter a protein, but is instead in some regulatory region of the DNA. Large deletions (i.e., missing stretches of DNA) are hard to detect in NGS results. Notably, the traditional method can pinpoint a linked region of the genome regardless of the type of variant that is causing the disorder. When it comes to genome‐wide association studies (GWAS) and multifactorial traits, one of the main limitations is that the effects of variants in multifactorial traits are so small that you may need very large numbers of participants to ensure adequate statistical power (typically 10,000‐ > 100,000 people). This in turn might necessitate the establishment of multi‐center consortia, involving collaborations between mul- tiple different groups or even countries. Clearly, consensus must be reached on how the trait is measured, and care must be taken that all centers use the same definitions and inclusion criteria. For the language sciences there is the added complication that data may have to be pooled across diverse languages with distinct properties. Because of the large cohort sizes involved, traits that can be measured reliably without spending too much extra time and money per participant are most suitable for GWAS studies. In the coming years, the field could be transformed by the development of suitable web‐ or app‐based batteries of tests for reliably capturing inter‐individual variation in language skills. Thus existing study cohorts from the general population who have already been genotyped with genome‐wide chips could be targeted for “phenotyping from a distance,” making very large‐scale GWAS studies of language traits feasible. If the GWAS study design involves comparing cases of language disorder with healthy controls, the collection of controls may require extra care. As in all such studies, to adequately compare groups of individuals, it is necessary to match age, gender, and so on. It is in this case also important that the case and control groups are genetically matched, since different ethnic groups sometimes have different allele frequencies for a subset of polymorphisms. Also when we are studying a quantitative trait, the whole group must be as genetically homogeneous as possible; mixing people from different ethnicities will invalidate the standard study design. An alternative to GWAS that can be run with smaller cohorts is to focus on testing fewer polymorphisms. We could choose only a subset of genes that we are particu- larly interested in, and test only the polymorphisms that are in and around these genes. This would keep the statistical problem of multiple testing within bounds. Finally, we note that since systematic genome‐wide screens avoid choosing candi- date genes, they may seem less elegant than formulating a prior hypothesis based on available biological knowledge. However, for many genes we still know little about their precise functions, and the underlying biology of language‐related skills remains very poorly understood. Time and again in human biology, it has turned out that our original ideas about mechanistic underpinnings of a trait or disorder were off the mark, and genetic findings have radically changed our view of these underpinnings, providing important new entry points into the key processes. The rapid major advances in molecular technologies of recent years make it possible to apply systematic screening approaches to more and more questions, even for unraveling the ultimate mysteries of our unique capacities for speech and language. Molecular Genetic Methods 351

Key Terms

Allele Alternative DNA‐form at a particular position in the genome. For example, a polymorphism may consist of the alleles C and G. Amino acid Organic compounds that make up proteins. In human biology, 20 different amino acids are used to build proteins. Chromosome A structure found in living cells, consisting of a single molecule of DNA— encodes genes and much more. Humans have 23 pairs of them. Crossing‐over The exchange of stretches of DNA between two chromosomes from the same pair during egg/sperm formation. Dominant Inheritance pattern in which alteration of one copy of a gene is enough to change the trait (see recessive). Exome Subset of the genome, encompassing all DNA that codes for protein. In total ~1% of the human genome. Gene Segment of DNA that codes for a particular protein. Genome All the genetic material contained in your 23 pairs of chromosomes, including a total of more than 20,000 protein‐coding genes. Genotype Combination of two DNA letters at a particular position in the genome (exceptions on the sex chromosomes). Occasionally, also used to denote all DNA‐variants in the whole genome. GWAS Genome‐wide association scan (for example case‐control study). Linkage analysis Family based analysis to identify a DNA‐variant whose inheritance pattern matches that of the trait of interest. Molecular Genetics A subdivision of genetics research concerned with the structure and function of genes at the molecular (i.e., DNA/RNA) level. Monogenic A disorder or a major trait difference that is caused by disruption of a single gene. Multifactorial Variations in many genes (plus environmental factors) affect a trait. Both quantitative traits (e.g., blood pressure) and dichotomous traits (e.g., rheumatoid arthritis, presence/absence of a disease) can have such a genetic background. Next‐generation sequencing (NGS) Also known as massive parallel sequencing. A group of relatively new methods (developed in the late 1990s) that enable sequencing of large amounts of DNA per person. Phenotype An individual’s trait or combination of traits, for example, eye color or height. Sometimes used to refer to only the trait(s) under study, sometimes for all of an individual’s traits. Polymorphism Position in the genome where a significant proportion of people in a population carry different DNA letters. Protein Large molecules composed of one or more chains of amino acids in a specific order determined by the base sequence of nucleotides in the DNA coding for the protein. Recessive Inheritance pattern in which DNA alterations in both the paternal and the maternal copies of the same gene are needed to change the trait (see dominant). Translocation A chromosome abnormality caused by rearrangement of parts of chromosomes, for example, a piece of one chromosome breaks off and gets attached to a different chromosome. 352 Research Methods in Psycholinguistics and the Neurobiology of Language

Whole‐exome sequencing (WES) Approach to perform NGS only on protein‐coding parts of the genome. Whole‐genome sequencing (WGS) Approach to perform NGS on the whole genome (~3 billion DNA letters per person).

References

Bishop, D. V. M. (2001). Genetic and environmental risks for specific language impairment in children. Philosophical Transactions of the Royal Society B Biological Sciences, 356, 369–380. Bufe, B., Breslin, P. A., Kuhn, C., Reed, D. R., Tharp, C. D., Slack, J. P., … Meyerhof, W. (2005). The molecular basis of individual differences in phenylthiocarbamide and propylthiouracil bitterness perception. Current Biology, 15, 322–327. Carrion‐Castillo, A., Franke, B., & Fisher, S. E. (2013). Molecular genetics of dyslexia: An overview. Dyslexia, 19, 214–240. doi: 10.1002/dys.1464. Cross‐Disorder Group of the Psychiatric Genomics Consortium. (2013). Genetic relationship between five psychiatric disorders estimated from genome‐wide SNPs. Nature Genetics, 45, 984–994. Fisher, S. E., & DeFries, J. C. (2002). Developmental dyslexia: Genetic dissection of a complex cognitive trait. Nature Reviews Neuroscience, 3, 767–780. doi: 10.1038/nrn936. Fisher, S. E., Lai, C. S., & Monaco, A. P. (2003). Deciphering the genetic basis of speech and language disorders. Annual Review of Neuroscience, 26, 57–80. doi: 10.1146/annurev. neuro.26.041002.131144 Fisher, S. E., & Scharff, C. (2009). FOXP2 as a molecular window into speech and language. Trends in Genetics, 25, 166–177. doi: 10.1016/j.tig.2009.03.002. Fisher, S. E., Vargha‐Khadem, F., Watkins, K. E., Monaco, A. P., & Pembrey, M. E. (1998). Localisation of a gene implicated in a severe speech and language disorder. Nature Genetics, 18, 168 ‐170. doi: 10.1038/ng0298‐168. Fisher, S. E., & Vernes, S. C. (2015). Genetics and the Language Sciences. Annual Review of Linguistics, 1, 289–310. doi: 10.1146/annurev‐linguist‐030514‐125024. Gialluisi, A., Newbury, D. F., Wilcutt, E. G., Olson, R. K., DeFries, J. C., Brandler, W. M., … Fisher, S. E. (2014). Genome‐wide screening for DNA variants associated with reading and language traits. Genes, Brain and Behavior, 13, 686–701. doi: 10.1111/gbb.12158. Goodwin, S., McPherson, J. D., & McCombie, W. R. (2016). Coming of age: Ten years of next‐generation sequencing technologies. Nature Reviews Genetics, 17, 333–351. Graham, S. A., & Fisher, S. E. (2015). Understanding language from a genomic perspective. Annual Review of Genetics, 49, 131–160. doi: 10.1146/annurev‐genet‐120213‐092236. Kovas, Y., Hayiou‐Thomas, M. E., Oliver, B., Dale, P. S., Bishop, D. V., & Plomin, R. (2005). Genetic influences in different aspects of language development: The etiology of language skills in 4.5‐year‐old twins. Child Development, 76, 632–651. Lai, C. S. L., Fisher, S. E., Hurst, J. A., Levy, E. R., Hodgson, S., Fox, M., … Monaco, A. P. (2000). The SPCH1 region on human 7q31: Genomic characterization of the critical interval and localization of translocations associated with speech and language disorder. American Journal of Human Genetics, 67, 357–368. doi: 10.1086/303011. Lai, C. S. L., Fisher, S. E., Hurst, J. A., Vargha‐Khadem, F., & Monaco, A. P. (2001). A fork- head‐domain gene is mutated in a severe speech and language disorder. Nature, 413, 519–523. doi: 10.1038/35097076. Luciano, M., Evans, D. M., Hansell, N. K., Medland, S. E., Montgomery, G. W., Martin, N. G., … Bates, T. C. (2013). A genome‐wide association study for reading and language abilities in two population cohorts. Genes, Brain and Behavior, 12, 645–652. doi: 10.1111/ gbb.12053. Molecular Genetic Methods 353

Mascheretti, S., Riva, V., Giorda, R., Beri, S., Lanzoni, L. F., Cellino, M. R., & Marino, C. (2014). KIAA0319 and ROBO1: Evidence on association with reading and pleiotropic effects on language and mathematics abilities in developmental dyslexia. Journal of Human Genetics, 59, 189–197. McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P., & Hirschhorn, J. N. (2008). Genome‐wide association studies for complex traits: Consensus, uncertainty and challenges. Nature Reviews Genetics, 9, 356–369. doi: 10.1038/nrg2344. Metzker, M. L. (2010). Sequencing technologies ‐ the next generation. Nature Reviews Genetics, 11, 31–46. doi: 10.1038/nrg2626. St Pourcain, B., Cents, R. A. M., Whitehouse, A. J. O., Haworth, C. M. A., Davis, O. S. P., O’Reilly, P. F., … Davey Smith, G. (2014). Common variation near ROBO2 is associated with expressive vocabulary in infancy. Nature Communications, 5, 4831, doi: 10.1038/ ncomms5831. The 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation, Nature, 526, 68–74. doi: 10.1038/nature15393. Veltman, J. A., & Brunner, H. G. (2012). De novo mutations in human genetic disease. Nature Reviews Genetics, 13(8), 565–575. doi: 10.1038/nrg3241.

Further Reading and Resources

Fisher, S. E. (2016). A molecular genetic perspective on speech and language. In G. Hickok, & S. Small (Eds.), Neurobiology of Language (pp. 13–24). Amsterdam: Elsevier. doi: 10.1016/B978‐0‐12‐407794‐2.00002‐X. Fisher, S. E., & Vernes, S. C. (2015). Genetics and the Language Sciences. Annual Review of Linguistics, 1, 289–310. doi: 10.1146/annurev‐linguist‐030514‐125024. Neale, B. M., Ferreira, M. A. R., & Medland, S. E. (Eds.). (2008). Statistical genetics: Gene mapping through linkage and association. Abingdon: Taylor & Francis. Strachan, T. & Read A. (2010). Human Molecular Genetics (4th ed.). New York, NY: Garland Science. Index

Page references to Figures or Tables will be followed by the letters “f” or “t” in italics as appropriate. adjacency pair, conversation analysis, 153 ANOVA (analysis of variance) see analysis agents, virtual, 178, 183–187 of variance (ANOVA), 8, 9, 11 alanine, 333 aphasia quotient (AQ), 301 algorithms apparatus computational modeling, 211–212, 212 conversation analysis (CA), 154–155 corpus linguistics, 241 electrophysiological methods, 250–252 libraries of, 232 functional magnetic resonance imaging neuroimaging, structural, 290 (fMRI), 267–268f alleles, 333, 334, 335, 351 functional near infrared spectroscopy AlphaGo (Google), 214 (fNIRS), 278–279 Alzheimer’s disease, 312, 324 habituation techniques, 3–4 Amazon Mechanical Turk (MTurk), Intermodal Preferential Looking Paradigm 195–196, 197, 199 (IPLP), 21–22 Amendments to the Education for All language sampling, 46–47 Handicapped Children Act 1986, parent report, vocabulary assessment by, US, 50 52–53 American Psychological Association, 202 structural neuroimaging, 292, 293f, 294 amino acids, 331, 333, 351 virtual reality (VR), 176–179 analysis of variance (ANOVA), 72, 136, 254, visual world paradigm (VWP), 92–93 255, 282 word priming and interference paradigms, automated, 78 115–116 F1/F2‐ANOVAs, 78 artificial neural networks, 209 mixed, 8, 9, 11 assessment‐implicative interrogative, random factors, 78 156, 167 repeated‐measures, 25, 27, 29, 33 Audacity® software, 116 Anatomo‐Clinical Overlapping Maps audio‐recording technology, 45, 46, 47, 184 (AnaCOM), 291 WAV audio files, 181 animal studies, habituation techniques, 2 autism spectrum disorders, 44, 57

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Index 355 avatars, 179–180, 187 Child Language Data Exchange System Avon Longitudinal Study of Parents and (CHILDES), 48, 49, 55, 218 Children (ALSPAC), UK, 343 childhood apraxia of speech (CAS), 341, 347 backpropagation algorithm, 212 Children of the National Longitudinal Study Bayesian models, 209, 211, 214, 225–226 of Youth, 60 behavioral data analysis, lesion studies, chromosomes, 331, 333, 351 314–315 chromosomal rearrangement, 344 bilingualism, 5, 10, 11, 14 clauses Bingo‐game paradigm, 141 main clause (MC), 133, 136, 139 blocking paradigms, 120, 124 prosodic characteristics, 20 blood level oxygenation dependent signal reduced relative clause (RR), 133, (BOLD) 136, 139 functional magnetic resonance imaging small clause sentences, 143, 144t, 145T (fMRI), 267, 268, 270–272, 275, 277 CLAWS software package, 233 functional near infrared spectroscopy closed‐class words, 41, 62 (fNIRS), 279 co‐articulation, speech, 96 measured via light absorption, 279, 280f Codes for the Human Analysis of blood light absorption properties, 278 Transcripts (CHAT), 48 BOLD see blood level oxygenation coding dependent signal (BOLD) conversation analysis (CA), 167 Bonferroni methods, 318 Interactive Intermodal Preferential boundary (Rayner) paradigm, eye‐ Looking Paradigm (IIPLP), 26 movement tracking, 68, 73–74, 85 inter‐coder reliability checks, 142 brain, 210, 211, 276f looking‐while‐listening paradigm see also lesion studies; neuroimaging, (LWL), 29 structural structural priming, 142 Broca’s area, 292 visual world paradigm (VWP), 99 Brodmann Area, 275 cohorts frontal aslant tract (FAT), 292 cohort competitor, visual world gray and white matter, 290, 291, 292, 302f paradigm, 92 insula, 316 molecular genetic methods, 342–343 medial prefrontal cortex, 292 Communicative Development Inventories middle temporal gyrus (MTG), 319 (CDIs), 54, 57, 59 pre‐supplementary motor cortex, 292 CDI: Words & Gestures (CDI:WG), 52, right hemisphere lesions, 312 53, 55, 56 “standard brain,” 274 CDI: Words & Sentences (CDI:WS), 52, superior temporal gyrus (STG), 319 55, 56 Wernicke’s area, 292, 319 Scoring Program, 55 brain morphometry, 290, 303 competitors, visual world paradigm, 92, 107 Broca’s area, 292 “hidden competitor” designs, 106 Brodmann Area, 275, 325 composition, conversation analysis, 160–161, 164, 170 CA see conversation analysis (CA) compounds, 44 calibration, 74, 85 comprehension priming, 133, 146, 148 CAVE (computer‐activated virtual computational modeling, xvii, 83, 208–229 environments), 177 apparatus/tools, 211–216 CDIs see communicative development assumptions/rationale, 208–211 inventories (CDIs) Bayesian models, 209, 211, 214, 225–226 cerebral blood flow (CBF), 294, 303 challenges and future directions, 223–225 cerebrospinal fluid (CSF), 290, 292, connectionism, 209, 210–211, 215, 220–223 294, 303 algorithms, 212–214 356 Index computational modeling (cont’d ) analysis of variation in the collection, data analysis, 217–218 161–162 distributional semantic models, 210, 226 apparatus, 154–155 examples, 218–223 assessment‐implicative interrogative, DevLex‐II model, 220–223 156, 167 Yu and Ballard model, 218–220 candidate phenomenon, identifying, Hebbian learning, 214, 221, 226 155–156 Hyperspace Analogue to Language collection of cases, building, 157–159 (HAL), 210, 212, 215, 216, 224, 226, composition, 160–161, 164, 170 236, 237 concepts/description, 151–152 hypothesis testing, 210 conditional relevance, 153 latent semantic analysis (LSA), 210, 212, data, nature of, 154–155 216, 226, 236, 237 data collection and analysis, 155–167 nature of stimulus, 216–217 defining boundaries of the phenomenon, parallel distributed processing (PDP), 209, 163–164 215, 225 deviant cases, analysing, 165, 170 practical considerations, 214–216 ecological validity, 168 probabilistic approach, 209–210, 215, epistemics, 166 218–220 ethnomethodology, 170 algorithms, 211–212 first pair part (FPP), 153 representation and analysis, 216–218 formal account of the phenomena, self‐organizing map (SOM), 213, 214, producing, 166–167 215, 220, 226 historical/conceptual background, simple recurrent network (SRN), 213, 152–154 214, 226 interactional phenomenon, 152, 170 supervised learning, 226 naturally occurring interaction, 170 unsupervised learning, 226 next‐turn proof procedure, 156, 170 computer‐activated virtual environments normative evidence, looking for, 165 (CAVE), 177 order, 152 computerized comprehension task overlapping talk, 152 (CCT), 59 position, 170 computerized language analysis (CLAN), 49 and psycholinguistics, 168–170 computerized tomography (CT), 289, 293f, quantitative research methods, 303, 326 167–168 lesion studies, 313, 315 recommendations, 159 conditional relevance, conversation recording, 154–155 analysis, 153 repair practices, 153–154, 159, 160, Conditioned Head Turn procedure, 13, 34 167, 168 connectionism, computational modeling, scripted interactions, 154 209, 210–211, 215, 220–223, 225 second pair part (SPP), 153 algorithms, 212–214 sequence organization, 153 corpus linguistics, 237 talk, as vehicle for action, 153 consonants, infants’ discrimination of, 3 transcription, 155 Contextual Self‐organizing Map transition‐relevance place (TRP), 153 Package, 215 turn design, 153, 164 conversation analysis (CA), xvii, 151–173 turn‐constructional units (TCUs), 153 see also speech and spoken language turn‐initial particles, 161 action, 161, 170 turn‐taking procedures, 153 adjacency pair, 153 co‐occurrence statistics, 210, 215, 216 advantages and disadvantages, 168 cross‐situational co‐occurrence, 218 analysis of each case in the collection, corpus analysis, 85 160–161 corpus linguistics, xvii, 230–246 Index 357

apparatus and tools, 232–233 eye‐movement tracking, during reading, assumptions/rationale, 231–232 77–78 concepts/description, 230–231 functional magnetic resonance imaging corpora, 230, 242 (fMRI), 273–276 data collection, 238 functional near infrared spectroscopy exemplary applications, 239–240 (fNIRS), 282 language registers, 234, 235, 238, habituation techniques, 8–9 241, 242 Headturn Preference Procedure (HPP), limitations and opportunities for 32, 33f validation, 240–242 Interactive Intermodal Preferential megastudies, 241–242 Looking Paradigm (IIPLP), 25–26 natural language processing (NLP), 209, Intermodal Preferential Looking Paradigm 233, 242, 243 (IPLP), 22–23 parsing, xvii, 230, 233, 243 lesion studies, 314–318 raw data vs. derived data, 233–234 Looking‐While‐Listening Paradigm semantic vectors, 235–237 (LWL), 29 stimuli and data, 233–237 molecular genetic methods, 343–346 tagging, 233, 243 Preferential Looking Paradigm Without word frequency data, 234–235 Language (PLP), 30–31 Wordnet, 242, 243 virtual reality (VR), 181 Corpus of Contemporary American visual world paradigm (VWP), 101–103 English, 233 word priming and interference paradigms, cross‐cultural field studies, 192–195 121–123 best practice, 192–193 data collection disadvantages, problems and pitfalls, conversation analysis (CA), 155–167 193–194 corpus linguistics, 238 exemplary studies, 194–195 electrophysiological methods, 253–255 rationale, 192 eye‐movement tracking, 74 crossing‐over, stretches of DNA, 331, 351 functional magnetic resonance imaging Cross‐Linguistic Lexical Norms (fMRI), 272–273 (CLEX), 55 functional near infrared spectroscopy cross‐situational word learning models, 225 (fNIRS), 281–282 Crowdflower, 197 habituation techniques, 8–9 crowdsourcing, 196, 197, 203, 230 language sampling, 47–49 CT see computerized tomography (CT) lesion studies, 314–318 “Cut & Break” project, 194 parent report, vocabulary assessment by, 53–54 data, nature of virtual reality (VR), 181 electrophysiological methods, 252–253 visual world paradigm (VWP), 99–101 language sampling, 47 data reduction, eye‐movement tracking, naturalistic data, 154 75–79 parent report, vocabulary assessment by, durations, 76–78 54–55 inferential statistics, 78–79 raw data vs. derived data, 233–234 locations, 76 structural neuroimaging, 292, 293f, 294 databases virtual reality (VR), 181 Child Language Data Exchange system word frequency data, 234–235 (CHILDES), 218 data analysis genotyping, 339 see also data, nature of; data collection SUBTLEX‐US, 234 computational modeling, 217–218 Deep Learning neural networks, 214 conversation analysis (CA), 155–167 deformation‐based morphometry (DBM), electrophysiological methods, 253–255 290–291 358 Index deletions, DNA, 333 EEG see electroencephalogram (EEG) dementia, 312, 324 ELAN software, 181 dense databases (DDBs), 48 electrodes, 250–251 developmental verbal dyspraxia, 341, 347 electroencephalogram (EEG), 72 Developmental Vocabulary Assessment for see also electrophysiological methods; Parents (DVAP), 51 event‐related potentials (ERP) deviant cases, conversation analysis, 165, 170 differential amplification, 251 DevLex‐II model, computational modeling, electrophysiological methods, 247, 248, 220–223, 225 250–254, 258, 262 model architecture, 220–221 hemodynamic methods, 278, 281, model simulation and data analysis, 282, 284 222–223 noise, 251–252 stimulus representation, 221–222 non‐laboratory settings, 193, 202 dictionary entry words, 43, 44 origins, 248 diffusion orientation distribution function signal, 248, 259 (dODF), 297 simulated, 259, 260f diffusion tensor imaging (DTI), 295–296, strength, 252 312, 326 electrophysiological methods, xvii, diffusion‐weighted imaging (DWI), 289, 247–265 291–292, 303 advantages and disadvantages, 258–262 Digital Imaging and Communications in apparatus, 250–252 Medicine (DICOM), 295 assumptions/rationale, 248–250 direct assessment, vocabulary assessment by, blinks, measurement of, 251, 254 58–61 data collection and analysis, 253–255 apparatus/instruments, 59 electrodes, 250–251 assumptions/rationale, 58–59 electroencephalogram (EEG), 247, 248, data collection, 59–60 250–254, 258, 262 exemplary study, 60 exemplary study, 256–258 disconnection syndromes, 311 experimental design, 247, 255, 256, 258 dishabituation, 2, 3, 7 magnetoencephalography (MEG), distractors, visual world paradigm, 98, 107 xvii, 262 Distributional Semantic Models, 210, 226 Mismatch Negativity (MMN), 250, 252 DNA arrays, 343 multiple comparisons correction DNA sequencing, 337 procedure, 255 DNA variation, 330, 331, 332f, 333–334, permutation procedures, 255 344, 346 stimuli and data, 252–253 techniques for characterizing, 336–341 Emergentist Coalition Model (ECM), word domain‐specific/domain‐nonspecific learning, 26 methods, xvii, xviii “emic” perspective, 195 dominant monogenic inheritance, 345 emotions, and language processing, 180, 185 double dissociations, 313 epistemics, 166 Double Object (DO), 133, 136, 137, 143, 145 E‐prime software, 115 Down Syndrome, 57 ERPs see event‐related potentials (ERP) DTI see diffusion tensor imaging (DTI) ethnomethodology, conversation analysis Dual‐Purkinje image tracking, infrared, 70, 90 (CA), 170 duplications, DNA, 333 “etic” grid, 195 DYX1CI gene, 341 EUDICO Linguistic Annotator, 48 Eurocentric linguistic traditions, 192 echo time (TE), 292, 304 event‐related fields (ERFs), 261 ecological validity, 168, 203 event‐related potentials (ERP) edge artefacts, 274 see also electroencephalogram (EEG); Edinburgh MAP task, 105 electrophysiological methods Index 359

creation of, 248, 254 eye‐movement tracking, during reading, ERP components, 248–249, 262 72, 83 ERP effect, 250, 262 hemodynamic methods, 270 eye‐movement tracking, during experimental paradigms reading, 84 eye‐movement tracking, during reading filtering, 253 boundary (Rayner) paradigm, 68, habituation techniques, 3, 12 73–74, 85 N400 amplitude, 249, 250, 251, 252, 254, moving window (McConkie) paradigm, 256, 257, 258 68, 72–73, 85 P200 amplitude, 249 naming/lexical decision task, 83 P600 amplitude, 249, 250, 251, 252, visual world paradigm (VWP), 92 254, 258 experimental software packages, signal, 248 115–116, 197 structural priming, 132, 135, 138, 139 experimenter‐developed assessment, 61–62 exemplary studies Expressive Vocabulary Text (EVT), 57 corpus linguistics, 239–240 external validity, 190, 204 cross‐cultural field studies, 194–195 eye movements direct assessment, vocabulary behavior and phenomenal experience, 69 assessment by, 60 crowding, 69 electrophysiological methods, 256–258 fixations see fixations, eye movements functional near infrared spectroscopy in natural tasks, 96–97 (fNIRS), 283 saccades (jerky movements) see saccade habituation techniques, 10–12 detection Headturn Preference Procedure (HPP), time‐locking, 95–96 33–34 tracking during reading see eye‐movement Interactive Intermodal Preferential tracking, during reading Looking Paradigm (IIPLP), visual world paradigm (VWP) see visual 26–28 world paradigm (VWP) Intermodal Preferential Looking Paradigm eye‐movement tracking, during reading, (IPLP), 23, 24t, 25 xvii, 68–88 language sampling, vocabulary assessment see also eye movements; eye movements by, 49 in natural tasks lesion studies, 318–323 advantages/disadvantages compared with Looking‐While‐Listening Paradigm related methods, 83–84 (LWL), 29–30 apparatus, 70–71 molecular genetic methods, 346–349 assumptions/rationale, 68–70 non‐laboratory settings, 194–195, calibration, 74, 85 199–200, 202–203 closed‐loop control, 72 parent report, vocabulary assessment by, data analysis, 77–78 55–56 data collection, 74 Preferential Looking Paradigm Without data reduction, 75–79 Language (PLP), 31–32 durations, 76–78 real world settings, conducting studies in, inferential statistics, 78–79 202–203 locations, 76 structural neuroimaging, 300–301, 302f dissociations between behavior and virtual reality (VR), 182–184 phenomenal experience, 69 expectation‐maximization (EM) experimental paradigms algorithm, 219 boundary (Rayner) paradigm, experience‐based models, 232 68, 73–74, 85 experimental design moving window (McConkie) paradigm, electrophysiological methods, 247, 255, 68, 72–73, 85 256, 258 naming/lexical decision task, 83 360 Index eye‐movement tracking, during reading proportion‐of‐fixations plots, 100, 101, (cont’d ) 102, 104f, 107 eye‐voice span (EVS) single‐fixation duration (SFD), 80–81 co‐registration of eye and voice, 79 visual fixation time, 22–23, 27 exemplary LMM interaction based on visual world paradigm (VWP), 90, 100, two numeric covariates, 80–82 101, 102, 107 word boundaries, 79–80 fluid‐attenuated inverse recovery (FLAIR), fixations see fixations, eye movements 293f, 294, 315 naming/lexical decision task, 83–84 fMRI see Functional Magnetic Resonance natural reading, 69, 70, 84 Imaging (fMRI) oculomotor programming, 69, 72, 84 FMRIB Software Library, 301 perceptual span, 68, 69, 72, 73, 83, 85 fractional anisotropy (FA), 296, 298, 304 practical issues, 82–84 frame‐by‐frame analysis, 8 problems/pitfalls, 82–83 frontal aslant tract (FAT), 292 pupil‐corneal reflection vector, 74 Full Width at Half Maximum rapid serial visual presentation (RSVP), 84 (FWHM), 274 saccades (jerky movements) see saccade functional magnetic resonance imaging detection (fMRI), xviii, 266, 267–268, 326 self‐paced reading, 84, 85 see also functional near infrared tower‐mounted trackers, 71 spectroscopy (fNIRS); hemodynamic tracking range, 72 methods typical set up, 71f advantages and disadvantages, 277–278 validation, 74 apparatus, 267–268 video‐based, 85 blood level oxygenation dependent signal word boundaries, 73, 79–80 (BOLD), 267, 268, 270–272, 275, eye‐voice span (EVS), 79–82 277, 279, 280f co‐registration of eye and voice, 79 Bonferroni methods, 275 defining, 85 concepts/description, 266–267 exemplary LMM interaction based on constraints, 270–272 two numeric covariates, 80–82 data analysis, 273–276 word boundaries, identification of, 79–80 data collection, 272–273 edge artefacts, 274 false discovery rate (FDR), 318 exemplary study, 276–277 familiarity effect/preference, 14, 15 field of view, 268 familiarization, 98 habituation techniques, 3, 4 familiarization study, 13–14, 15 head coil, 267 habituation techniques, 13–14 images, 269f visual preference methods, 23, 30 magnet, 267 Fast Fourier Transform (FFT), 259 MNI space, 274 field studies, cross‐cultural see cross‐cultural model‐based fMRI, 275 field studies multiband scan sequences, 270 first pair part (FPP), conversation multiple comparisons problem (MCP), 275 analysis, 153 scan sequence, 268, 269f, 270 fixations, eye movements, 69, 70, 71 scanner, 267 see also eye movements; eye‐movement slice‐timing correction, 274 tracking, during reading small head movements, correction for, 274 analyses of mean fixation spatial smoothing, 274 proportions, 102 “standard brain,” 274 first fixations, 77 statistical model, 275 fixation durations/locations, 69, 76–78, stimuli, nature, 270 97, 106 structural priming, 132, 135, 138, language‐mediated, 94 139, 143 Index 361

Time to Repetition (TR), 268, 270 genotype (DNA letters), 331, 333, 336, 339, functional near infrared spectroscopy 340f, 369f (fNIRS), xviii, 12, 280f genotyping, 337, 339, 340f, 340t, 341, see also functional magnetic resonance 343, 347, 348 imaging (fMRI); hemodynamic glissade (post‐saccadic wobble), 75 methods Google books, 235 advantages and disadvantages, 283–284 “Google Cardboard,” 177 apparatus and signal, 278–279 gradiometers, 261 Blood Level Oxygenation Dependent grammatical morphology, 342 Signal (BOLD), 279 graphic editing, 181 BOLD measured via light absorption, gray matter, 290, 291, 292, 311 279, 280f Guugu Yimitthir language, Australia, 181 cap, 278, 279 GWAS see genome‐wide association scan concepts/description, 266–267 (GWAS) data analysis, 282 data collection, 281–282 Habit software, habituation, 4, 8 exemplary study, 283 habituation curve, 2, 7, 15 habituation techniques, 3, 4 habituation techniques, xvii instrumentation, 278 advantages and disadvantages, 12–15 interoptode distances, 281 ANOVA (analysis of variance), 8, 9, 11 stimuli, nature, 279 apparatus, 3–4 assumptions/rationale, 1–3 gametes, 331 audio stimuli, use of, 6 gating task, 256 auditory habituation, 3 Gaussian kernel, 274 characteristics of habituation, 3, 6 gaze comparator model (Sokolov), 2 first‐gaze duration, 137 compared to visual preference methods, 34 gaze durations, 77 concepts/description, 1, 2, 15 gaze‐contingent display, 71, 74 Conditioned Head Turn procedure, 13 log gaze probability ratio, 137 criteria, 6, 7, 13, 15 Generation Rotterdam (GenR), The data collection and analysis, 8–9 Netherlands, 343 dishabituation, 2, 3, 7 genes, 331, 334, 341 exemplary study, 10–12 candidate genes, 345 and familiarization, 13–14 FOXP2, implicated in speech and frame‐by‐frame analysis, 8 language deficits, 347–348, 349 habituation phase, 6–7 ROBO2, uncovering effects in early habituation‐specific software, 8 expressive vocabulary, 348–349 history, 2, 12 genetic architecture, 334 in infants, 1–17 genetic mapping, 331 looking time (LT), 1, 4, 6, 8 genome, 331, 333 methodological structure, 6–8 see also genome‐wide association scan non‐habituators, 14–15 (GWAS) preferential looking, 13 whole exome sequencing (WES), pretest and posttest, 6, 8 338–339, 350 random habituation, 7 whole genome sequencing (WGS), 338, 350 stimuli, 4–5 genome‐wide association scan (GWAS), between‐subject design, 7 332f, 336, 350 test, 7 identifying common effects with, 345–346 trials, 5, 6, 7, 11 uncovering effects of ROBO2 gene on visual habitation, 2 Early Expressive Vocabulary, 348–349 visual stimuli, use of, 5, 13 genomic diversity, 333 windows, trials, 6–7 362 Index

HAL see Hyperspace Analogue to Language infants (HAL) bilingual, 5, 10, 11 Head Mounted Displays (HMDs), 177, 185, habituation techniques see habituation 186, 187 techniques Head Turn Preference Procedure (HPP), 13, heart rate, 2, 4 18, 20–21, 36 language discrimination, 5 advantages, 34–35 language perception, 3 disadvantages, 35–36 perceptual cues, use of, 26 exemplary study, 33–34 phoneme discrimination skills, 13 method/data analysis, 32, 33f phonetic development, 4 modified, 33, 34 phonological development, 4, 12 variants, 34 sucking, 3, 4 warm‐up trials, 32 visual preference methods, 19 heart rate, in infants, 2, 4 inferential statistics Hebbian learning, 214, 221, 226 eye‐movement tracking, 78–79 hemodynamic methods, xvii, 266–287 word priming and interference paradigms, assumptions/rationale, 266–267 123 electroencephalogram (EEG), 278, 281, infrared corneal reflections, eye‐movement 282, 284 tracking, 70 functional magnetic resonance imaging infrared Dual‐Purkinje image tracking, 70, (fMRI) see functional magnetic 90 resonance imaging (fMRI) input devices (motion capture), 176–177 functional near infrared spectroscopy insula, brain, 316 (fNIRS) see functional near infrared Integrated Development Environment (IDE), spectroscopy (fNIRS) 178 hemoglobin, 266, 279, 282 interactional phenomenon, 152, 170 HiDEx software, 215 Interactive Alignment Model (IAM), 182, High Angular Resolution Diffusion Imaging 184 (HARDI), 294, 303, 304 Interactive Intermodal Preferential Looking high‐attached (HA) expressions, 133, 137 Paradigm (IIPLP) Hilbert transform, 259 coding, 26 hindrance modulated orientation anisotropy concepts/description, 36 (HMOA), 300, 304 exemplary study, 26–28 HMDs (Head Mounted Displays), 177, 185, method/data analysis, 25–26 186, 187 new‐label test trial, 27 HPP see Head Turn Preference Procedure original‐label test trials, 27 (HPP) purpose, 25 Human Reference Genome, 338 recovery trial, 27, 28 “Human Speechome Project,” 203 interference paradigms see word priming Hyperspace Analogue to Language (HAL), and interference paradigms 210, 212, 215, 216, 224, 226, 236, Intermodal Preferential Looking Paradigm 237 (IPLP), 18, 21–25 advantages, 34–35 IIPLP see Interactive Intermodal Preferential apparatus, 21–22 Looking Paradigm (IIPLP) “Clever Hans” effects, 22 Iiyama Vision Master Pro 514 CRT monitor, data analysis, 22–23 79 disadvantages, 35–36 immersive virtual reality (iVR), 174–176, exemplary study, 23, 24t, 25 182–187 and Looking‐While‐Listening paradigm, see also virtual reality (VR) 29 independent components analysis (ICA), method, 21 254 and Preferential Looking Paradigm, 30 Index 363

variants, 25–32 associative, 214 visual fixation time, 22–23 cross‐situational word learning models, 225 intertrial interval (ITI), 271 Deep Learning neural networks, 214 inversions, DNA, 333 Hebbian, 214, 221, 226 IPLP see Intermodal Preferential Looking supervised, 226 Paradigm (IPLP) temporal sequence learning networks, 220 unsupervised, 226 language Left Anterior Negativity (LAN), 249 comprehension see language LENA (audio‐recording technology), 46, 48, 49 comprehension lesion overlay maps, 316 discrimination, in infants, 5 lesion studies, xviii, 310–329 narrow view of, 200 advantages and disadvantages, 323–325 natural language processing (NLP), 209, alternative methods, 323–325 233, 242, 243 assumptions/rationale, 311–312 perception, in infants, 3 behavioral data analysis, 314–315 production see language production, behavioral stimuli and response measures, structural priming 313–314 sampling see language sampling concepts/description, 310–311 language comprehension data collection and analysis, 314–318 non‐behavioral responses, 138–139 exemplary study, 318–323 overt responses lesion reconstructions, 315–316, 319–320 structure choice, 137–138 neuroimaging data acquisition, 315 temporal measures, 136–137 patient recruitment and selection, 312 and production, 42 right hemisphere lesions, 312 structural priming, 136–139 standardized tests, 313, 314 Language Development Survey (LDS), 52 stimuli and procedures, 319 language diaries, 44–45 study design, 312–313 Language in Interaction (research voxel‐based lesion analyses, 317–318 consortium), 197 voxel‐based lesion symptom mapping language production, structural priming, (VLSM), 321, 322f, 323, 329 139–143 lexical decision task, 122 non‐behavioral responses, 143 eye‐movement tracking, 83–84 overt responses latency, 121, 122 structure choice, 139–142 word priming and interference paradigms, temporal measures, 142–143 121, 122, 124 language registers, 234, 235, 238, 241, 242 lexicon see vocabulary language sampling, vocabulary assessment likelihood of the evidence (p/E/H)), 210, 212 by, 44–50 Likert scales, 197 apparatus, 46–47 linear mixed models (LMMs), 78–79 assumptions/rationale, 44–46 exemplary interaction, based on two computer‐based analysis systems, 49 numeric covariates, 80–82 concepts/description, 62 linguistic fieldwork, 193, 204 data, nature of, 47 linguistic relativity, 204 data collection, 47–49 LMMs see linear mixed models (LMMs) dense sampling, 48 localist representation, 216 and direct assessment, 58 log gaze probability ratio, 137 exemplary study, 49 longitudinal research, 46, 47 problems and pitfalls, 49–50 look‐and‐listen studies Latent Semantic Analysis (LSA), 210, 212, see also looking‐while‐listening paradigm 216, 226, 236, 237 (LWL) Lateralized Readiness Potential (LRP), 250 visual world paradigm (VWP), 90, 94, learning 95, 107 364 Index looking time (LT), 1, 4, 6, 8 molecular genetic methods, 330–353 looking‐while‐listening paradigm (LWL), 13, background, 331–332 28–29 buccal swabs, 336 see also look‐and‐listen studies cohorts, defining, 342–343 coding, 29 concepts/description, 330–334 exemplary study, 29–30 data analysis, 343–346 method/data analysis, 29 DNA sequencing, 337 low‐attached (LA) expressions, 133, 137 DNA variation, 333–334 low‐pass filtering, 22 exemplary studies, 346–349 LWL see Looking‐While‐Listening Paradigm general approach, 335–336 (LWL) genetic architecture, 334 genetic variation, characterization MacArthur (MacArthur Bates) techniques, 336–341 Communicative Developmental genotype, 331, 333, 336, 339, 340f, 341, Inventory (MCDI), 23, 51, 52, 343, 369f 55, 221 genotyping, 339, 340f, 341 McConkie (moving window) paradigm, GWAS, identifying common effects with, eye‐movement tracking, 68, 345–346 72–73, 85 linkage analysis implicating FOXP2 in magnetic resonance imaging (MRI), 289, speech and language deficits, 304, 326 347–348, 349 conventional, 290–291 linkage in large families, 343–345 functional see Functional Magnetic monogenic traits, xviii, 343–345 Resonance Imaging (fMRI) multifactorial traits, xviii, 334, lesion studies, 313, 315 345–346 MRI‐based diffusion tractography, phenotype collection, 341–342 288, 289 problems and pitfalls, 349–350 pulse sequence, 292 saliva sampling, 336 T1‐weighted images, 290, 292, 293f, sequencing, 337–339 294, 305 monogenic diseases, 334 T2‐weighted images, 292, 293f, 294, 305 monogenic traits, 343–345 magnetoencephalography (MEG), xviii, monomorphemic words, 41, 44 261, 262 Monte Carlo modeling, 7 main clause (MC), 133, 136, 139 mood, and language processing, 180 masked priming paradigm, xviii, 115, 125 morphemes, 41 Matlab, 181, 215–216, 282 motion‐capture (mo‐cap), input technology), Max Plank Institute for Psycholinguistics, 176–177 Nijmegen (The Netherlands), 183 moving window (McConkie) paradigm, mean diffusivity (MD), 295–296, 304 eye‐moving tracking, 68, 72–73, 85 mean length of utterances (MLU), 49 MRI see magnetic resonance imaging (MRI) Mechanical Turk (MTurk) see Amazon multicollinearity, 51 Mechanical Turk (MTurk) multifactorial traits, xviii, 334, 345–346 medial prefrontal cortex, 292 multiparametric methods, 297 MEG see magnetoencephalography (MEG) multiple comparisons correction megastudies, 241–242, 243 procedure, 255 Microsoft Kinect, 177 multiple comparisons problem (MCP), 275 middle temporal gyrus (MTG), 319 multivariate statistics, 68, 83 Mismatch Negativity (MMN), 250, 252 multi‐voxel pattern analysis (MVPA), 275 MNI space, 274 museum settings, researching MNI template, 326 best practice, 197–198 mobile trackers, 71 disadvantages, problems and pitfalls, model‐based fMRI, 275 198–199 Index 365

exemplary studies, 199–200 real world settings, conducting studies in, research rationale, 195–196 200–203 myelin, 294, 296, 304 research samples, requirements for, 191 non‐parametric mapping (NPM), 291 naming task, eye‐movement tracking, 83–84 novelty preference, 14, 15, 35 natural language processing (NLP), 209, 233, 242, 243 oculomotor programming, 69, 72, 84 naturalistic observation, 50 1000 Genomes Project Consortium, 333 naturally occurring interaction, 170 online studies neural network simulators, 215 best practice, 197–198 neural networks, 225 disadvantages, problems and pitfalls, neuroimaging, structural, xviii, 288–309 198–199 advanced diffusion models, 296–297 exemplary studies, 199–200 apparatus and data, 292, 293f, 294 rationale, 195–196 assumptions/rationale, 289–292 optical eye trackers, 93 atlasing, 298, 299f optical sensors, 93 based on conventional MRI, 290–291 other‐initiation of repair (OIR), 160, 167, 168 computerized tomography (CT), 289 other‐initiated self‐repair, 154 data collection and analysis, 295–300 output devices, 177 diffusion tensor imaging, 295–296 OxLearn, 215 diffusion‐weighted imaging (DWI), 291–292 Parallel Distributed Processing (PDP), 209, evaluation, 301, 303 215, 225 exemplary study, 300–301, 302f parent report, vocabulary assessment by, magnetic resonance imaging (MRI), 289 50–57 tract specific measurements, 298, 300 apparatus/instruments, 52–53 tractography reconstructions, xviii, assumptions/rationale, 50–51 297–298 challenges/related issues, 56–57 neuroimaging data acquisition, lesion concepts/description, 62 studies, 315 data, nature of, 54–55 Neuroimaging Informatics Technology data collection, 53–54 Initiative (NIFTI), 4D, 295 exemplary study, 55–56 neurotransmitters, 210 motivation to use, in the US, 50 newborns, 3 parsing, 233, 243 next‐generation sequencing (NGS), 336, syntactic parsing, xvii, 230 337, 338, 339f, 344, 349 part‐of‐speech (PoS), 233 next‐turn proof procedure, 156, 170 Peabody Picture Vocabulary Test (PPVT), 57 Ngrams, 233 perceptual cues, 26 nominals, 62 perceptual span, eye‐movement tracking, 68, non‐coding DNA, 331 69, 72, 73, 83, 85 non‐invasive brain stimulation (NIBS), 325 perfusion‐weighted imaging, 326 non‐laboratory settings, 190–207 permutation procedures, 255 best practice, 192–193, 197–198, 201 p‐hacking, 102 cross‐cultural field studies, 192–195 phenotypes, 335, 345 everyday language use, of ordinary phenotype collection, 341–342 people, 191 phenylthiocarbamide (PTC), 334 exemplary studies, 194–195, 199–200, phonemes 202–203 awareness, 342 motivations for being out of the lab, 191 connectionist algorithms, 212 online studies and museums, 195–200 habituation techniques, 2, 13 rationale, research, 192, 195–196, monitoring, 125 200–201 speech and spoken language, 96 366 Index phonetic development, infants, 4 masked, xviii, 115, 125 phonological development, infants, 4, 12 near‐threshold, 82 phonological short‐term memory, 342 parafoveal fast‐priming, 74 phonology, input and output, 220 production, 142–143, 146, 148 phrase‐by‐phrase self‐paced reading, 136 semantic, 199 picture naming, 113–114, 120 syntactic, 148 latency, 117, 122 prior probability of the hypothesis (p(H)), picture description paradigms, 140, 141 210, 212 picture‐matching tasks, 140–141 probabilistic approach, computational structural priming, 140 modeling, 209–210, 215, 218–220 word priming and interference paradigms, algorithms, 211–212 113–114, 117, 120, 122 production priming, 142–143, 146, 148 Picture Vocabulary Test, 59 production studies picture‐word interference paradigm, 125 electrophysiological methods, 252–253 PLP see Preferential Looking Paradigm hemodynamic methods, 271 Without Language (PLP) structural priming, 135 point‐of‐disambiguation (POD), visual visual world paradigm (VWP), 92, 98, world paradigm, 92, 107 99, 102 polymerase chain reaction (PCR), 337 profile analysis, 58 polymorphisms, 333, 339 proportion‐of‐fixations plots, visual world position, conversation analysis, 170 paradigm, 100, 101, 102, 104f, 107 Positron Emission Tomography (PET), Proteus Effect, 179 271, 277 proton density, 292 posterior probability (p(H/E)), 210, prototypes, 43 211–212 pseudonormalization, 294 post‐mortem dissections, 292 psychophysiological responses, 2 Power Glove (Nintendo), 176 pulse sequences, 292, 304 Praat software, 79–80, 116, 181 pupil‐corneal reflection vector, 74 predicates, 41, 62 Purkinje images, 71 preferential looking, 13 Python programming language, 178, 179 Preferential Looking Paradigm Without Language (PLP), 21 quantitative research methods, conversation exemplary study, 31–32 analysis (CA), 167–168 and IPLP studies, 30 Quick Interactive Language Screener method/data analysis, 30–31 (QUILS), 35, 59 salience trial, 30 test trials, 30 radial diffusivity (RD), 296, 304 Prepositional Object (PO), 133, 143, 145 radiofrequency (RF), 292 presence, virtual reality, 175, 179, 187 rapid automatized naming, 342 Presentation®software, 115 rapid serial visual presentation (RSVP), 84, pre‐supplementary motor cortex, 292 85, 138 preview benefit, Rayner boundary, 74 Rayner (boundary) paradigm, eye‐ primary progressive aphasia (PPA), 291, movement tracking, 68, 73–74, 85 299f, 312 reaction times (RTs), adult readers, 214 priming reading see also structural priming; word priming see also eye movements; eye‐movement and interference paradigms tracking, during reading comprehension, 133, 146, 148 eye‐movement tracking, 68–88 cross‐modal, xviii, 256 apparatus, 70–71 data loss, 146 assumptions/rationale, 68–70 definition of “prime,” 125, 148 data collection, 74 inhibition, involving, 148n1 exemplary study, 79–82 Index 367

experimental paradigms see under saccade detection, 75, 85 eye‐movement tracking, during velocity‐based, 355f reading saccades stimuli, 72 see also eye movements; eye‐movement eye‐voice span (EVS), 79–82 tracking, during reading co‐registration of eye and voice, 79 eye‐movement tracking, during reading, word boundaries, 79–80 69, 70, 71, 75, 79, 85 natural, 69, 70, 84 post‐saccadic wobble (glissade), 75 perceptual span, 68, 69, 72, 73, 83 visual world paradigm (VWP), 102, 103, phrase‐by‐phrase self‐paced reading, 136 106, 107 reaction times (RTs), adult readers, 214 salience reading aloud (oral reading), 68, 79–82, feature‐based, 97 113–114 perceptual, 28 second‐pass, 76 trials, 23, 27, 30 self‐paced, 84, 85, 136 visual preference methods, 23, 27, 30 silent, 79 visual world paradigm (VWP), 96, 97 whole‐sentence reading, 136 Same and Switch trials, habituation, 11 real world settings, conducting studies in, Sanger sequencing techniques, 337, 338f, 200–203 344, 347 best practice, 201 scaling problem, 200 disadvantages, problems and pitfalls, scan‐path analysis, 79 201–202 scripted interactions, 154 exemplary studies, 202–203 search coils, eye‐movement tracking, 70 rationale, 200–201 second pair part (SPP), conversation Receptive and Expressive One Word analysis, 153 Vocabulary Test (ROWPVT/ segmentation mask, 290, 304 EOWPVT), 57 selection, in genetics, 333–334 recessive monogenic inheritance, 345 self‐initiated self‐repair, 153, 159 recognition memory tasks, 140 self‐organizing map (SOM), 213, reduced relative clause (RR), 133, 214, 226 136, 139 DevLex‐II model, 220 reference electrodes, 251 SOMToolbox, MatLab, 215 regions of interest (ROIs), 136 self‐paced reading, 84, 85, 136 hemodynamic methods, 277, 281, 283 semantic analysis, xvii structural neuroimaging, 290, 300 semantic priming, 199 visual world paradigm (VWP), 92, 93, semantic structure, 192 99, 101 semantic vectors, 235–237, 243 repair practices, conversation analysis, sentences 153–154, 159 CDI:Words & Sentences (CDI:WS), 52, other‐initiation of repair (OIR), 160, 55, 56 167, 168 processing difficulty, 106 self‐initiated self‐repair, 153, 159 sentence recall paradigms, 141, 142f repetition time (TR), 292, 304 sequencing, 338 response‐contingent analyses, 94 small clause sentences, 143, resting state functional magnetic resonance 144t, 145t imaging (rsfMRI), 325 as stimuli, xvii ROBO1 gene, 341 whole‐sentence reading, 136 ROBO2 gene, uncovering effects in early sequence organization, conversation expressive vocabulary, 348–349 analysis, 153 ROIs see regions of interest (ROIs) sequencing, genetic methods, 337–339 RSVP see rapid serial visual presentation Simple Recurrent Network (SRN), 213, (RSVP) 214, 226 368 Index single nucleotide polymorphisms (SNPs), inferential statistics, 78–79 339, 340f linear mixed models (LMMs), 78–82 SNP chips, 341, 343, 344, 345, 348 multivariate statistics, 68, 83 single‐fixation duration (SFD), 80–81 nonlinear mixed models, 78 Singular Value Decomposition (SVD), probabilistic approach see probabilistic 212, 236 approach, computational modeling slice‐timing correction, 274 quantile regression analysis, 78 small clause sentences, 143, 144t, 145t scan‐path analysis, 79 Snap paradigm, 141 survival analyses, 78–79 SNPs see single nucleotide polymorphisms visual preference methods, 22, 34 (SNPs) visual world paradigm (VWP), 101–103 SOM see self‐organizing map (SOM) stimulus timing, word priming, 119–120 spatial resolution, xviii, 93, 254–255, 279, stimulus‐onset asynchrony (SOA), 114, 284, 304 120, 125 good quality, 268, 277, 284, 315 streamlines, 296, 297, 299f, 300, 305 high, 266, 274 stroke, 294, 326 low, 92, 315 structural neuroimaging see neuroimaging, neuroimaging, 289, 290, 293f, 301 structural spatial smoothing, 274, 290, 304 structural priming, xviii, 130–150 specific language impairment, 342 apparatus and test tools, 132 spectrogram, 259 assumptions/rationale, 131–132 speech and spoken language Baselines, 143 see also conversation analysis (CA) Blood Level Oxygenation Dependent accommodation effect, 183 Signal (BOLD), 132, 139, 143 acoustic cues, 95, 96 concepts/description, 130–131 childhood apraxia of speech (CAS), data collection and analysis, 136–143 341, 347 data types, 135 co‐articulation, 96 Double Object (DO), 133, 136, 137, errors, 231 143, 145 FOXP2 gene, in speech and language event‐related potentials (ERP), 132, 135, deficits, 347–348, 349 138, 139 lesion studies, 311–312 exemplary study, 143–145 onset latencies, 115, 116 experimental design, 147 rate and pitch, 182 experimental stimuli, 135 segmentation of speech stimuli, 98 Functional Magnetic Resonance Imaging source of speech, 93 (fMRI), 132, 135, 138, 139, 143 visual world paradigm (VWP), 93, 95–96 of language comprehension, 136–139 speech‐recognition software, 46–47 of language production, 139–143 Standard Average European (SAE), 192, 204 nature of stimuli and data, 132–135 standard space, 274, 290, 301, 304 non‐behavioral responses, 138–139, 143 standardized assessment manuals, 60 overt responses standardized template, 315–316, 317f, 363f structure choice, 137–138, 139–142 standardized tests, 234, 341 temporal measures, 136–137, lesion studies, 313, 314 142–143 vocabulary assessment, 49, 57–61, 62 picture description paradigms, 140 statistical analysis Prepositional Object (PO), 133, 143, 145 analysis of variance see analysis of prime/target expressions, 133, 134t, 147 variance (ANOVA) priming effects, 131 co‐occurrence statistics, 210, 215, problems/pitfalls, 145–147 216, 218 response latencies, 143 Functional Magnetic Resonance Imaging small clause sentences, 143, 144t, 145t (fMRI), 275 structure choice, 137–138, 139–142 Index 369

syntax, 131, 132, 133, 134t, 146 time series analysis, 282 targets, 133, 134t, 141–142, 148 time‐frequency analysis, 259 temporal measures, 136–137, 142–143 time‐locking eye movements, 95–96 SUBTLEX frequencies, 235, 239 Total Conceptual Vocabulary, 53, 62 SuperCoder (freeware), 8 touch screen technology, 59 superconducting quantum interference tractography reconstructions, xviii, devices (SQUIDs), 261 297–298, 305 superior temporal gyrus (STG), 319 transcranial direct current stimulation superordinates, 43 (tDCS), 325, 326 supervised learning, 226 transcranial magnetic stimulation (TMS), support vector clustering (SVC), 219 324, 325, 326 surface electrodes, eye‐movement transcription tracking, 70 conventions, 155, 157–158 surface‐based morphometry (SBM), conversation analysis (CA), 155 290, 291 language sampling, 48, 49–50 surveys, 197 transition‐relevance place (TRP), 153 Switch procedure, 5, 10, 15 translocation, 347, 351 synapses, 210, 211 Truth‐Value Judgment task, 137, 138 syntactic parsing, xvii, 230 turn design, conversation analysis, syntactic priming, 148 153, 164 syntactic structure, 131–132 turn‐constructional units (TCUs), 153 Syntactic Structures (Chomsky), 19 turn‐initial particles, 161 syntax, xvii, 53, 96, 161, 193, 215, 249 turn‐taking procedures, 153 interrogative, 161 type‐token ratio of words (TTR), 49 structural priming, 131, 132, 133, Tzeltal language, Mexico, 181 134t, 146 Systematic Analysis of Language Transcripts uncanny valley, virtual reality, 186, 187 (sALT), 48 unsupervised learning, 226

T1‐weighted images, 290, 292, 293f, validity 294, 305 ecological, 168, 203 T2‐weighted images, 292, 293f, 294, 305 external, 190, 204 tagging, 233, 243 video‐based eye tracking, 70, 71, 85 talk video‐recording technology/video cameras, see also conversation analysis (CA); 4, 93, 154 speech and spoken language vocabulary assessment, 45, 46 overlapping, 152 virtual reality (VR), 174–189 as vehicle for action, 153 advantages and disadvantages, TalkBank website, 46 185–186 targeted language games, 105, 106 agents, 178, 183–187 targets apparatus, 176–179 prime/target expressions, in structural assumptions/rationale, 174–175 priming, 133, 134t, 147 avatars, 179–180, 187 structural priming, 133, 134t, data, nature of, 181 141–142, 148 data collection and analysis, 181 visual world paradigm (VWP), 92, 107 emotional realism, 185 word priming and interference paradigms, environment, manipulating parameters of, 118, 120, 125 180–181 TAS2R38 receptor, 334 evaluation, 186–187 temporal sequence learning networks, 220 exemplary studies, 182–184 Tests of English as a Foreign Language Head Mounted Displays (HMDs), 177, (TOEFL), 236, 237 185, 186, 187 370 Index virtual reality (cont’d ) design and interpretation, general immersive virtual reality (iVR), 174–176, considerations affecting, 95–97 182–187 disadvantages and limitations, 106 input devices (motion capture), 176–177 distractor, 98, 107 integrating input and output, 178–179 example study, 103–105 markers, active and passive, 176 experimental paradigms, 92 moving through the virtual world, 178 eye movements in natural tasks, 96–97 output devices, 177 language, 93 participant pool, expanding, 185 linguistic stimuli, 98 presence, 175, 179, 187 linking hypothesis, 91–92 reproducibility of complex environments, logic, 91 185–186 look‐and‐listen studies, 90, 94, 95, 107 stimuli and data, 179–181 point‐of‐disambiguation (POD), 92, 107 uncanny valley, 186, 187 production studies, 92, 98, 99, 102 virtual people, manipulating parameters speech and spoken language, 95–96 of, 179–180 statistical analysis, 101–103 Vizard VR software, 178–179, 180, 183 stimuli, nature, 97–99 VIRTUO/A characters, virtual reality, and structural priming, 136, 137 183, 184 targets, 92, 107 visual fixation time, 22–23, 27 task‐based, 94–95, 107 visual habitation, 2 terminology, 92 visual preference methods, xvii, 18–39 timing, 98–99 assumptions/rationale, 20–21 variations across experiments, 93–95 concepts/description, 18–19 visual world, 94, 97–98 development, 19–20 visualization, 100–101 Headturn Preference Procedure (HPP), workplace characteristics, 94 32–34 visualization, 100–101 Interactive Intermodal Preferential Vizard VR software, 178–179, 180, 183 Looking Paradigm (IIPLP), 25–28 VLSM see voxel‐based lesion symptom Intermodal Preferential Looking Paradigm mapping (VLSM) (IPLP), 21–25 vocabulary Looking‐While‐Listening Paradigm as antecedent, 41–42 (LWL), 28–32 assessment, 40–66 Preferential Looking Paradigm Without apparatus, 46–47, 52–53, 59 Language (PLP), 30–32 assumptions/rationale, 44–46, 50–51, purpose, 18–19 58–59 visual fixation time, 22–23, 27 challenges/problems, 49–50, 56–57, 61 visual stimuli, use, of habituation core issues, 43–44 techniques, 5, 13 data, nature of, 47, 54–55 visual world paradigm (VWP), xviii, 89–110 data collection, 47–49, 53–54, 59–60 see also eye movements; eye‐movement direct, 58–61 tracking, during reading exemplary studies, 49, 55–56, 60 acoustic cues, 95, 96 language sampling, 44–50 advantages and common applications, methods, 40 105–106 by parent report, 50–57 apparatus, 92–93 standardized tests, 49, 57–61, 62 assumptions, 92 as consequent, 42 coding, 99 meaning, 42–43 competitors, 92, 107 as object of study in its own right, 41 “hidden competitor” designs, 106 purposes of studying/assessing, 41–42 concepts/description, 89, 89–91, 107 Total Conceptual Vocabulary, 53, 62 data collection and analysis, 99–103 word knowledge, managing, 42–43 Index 371 vocal cord vibrations, 3 phoneme monitoring, 125 voice‐onset time (VOT) competitors, 92 picture naming, 113–114, 117, 120 voxel‐based lesion analyses, 317–318 picture‐word interference paradigm, 125 voxel‐based lesion symptom mapping presentation time of primes, 119 (VLSM), 291, 313, 317–318, 326 priming effects, 117 analyses, 323 properties of primes/targets and prime‐ correlates of auditory word recognition, target combinations, 116–119 321, 322f prototypical priming, 112 summary, 323 related and unrelated primes, 118 voxel‐based morphometry (VBM), 290, 291, response latencies, 115, 116, 117, 324, 326 121, 122 VWP see visual world paradigm (VWP) stimulus timing, 119–120 stimulus‐onset asynchrony (SOA), 114, WAV audio files, 181 120, 125 Wernicke’s area, 292, 319 targets, 118, 120, 125 Western Aphasia Battery (WAB‐R), 300, 319 task, 121 white matter, 290, 291, 292, 302f, 311 word‐association studies, 242 whole exome sequencing (WES), Wordbank, 55 338–339, 350 Wordnet, 242, 243 whole genome sequencing (WGS), 338, 350 word‐object associative skills, 5, 10, 11 whole‐sentence reading, 136 word‐referent links, 1 Wikipedia, 235, 238 words word boundaries, eye‐movement tracking, categories of, 43, 44 73, 79–80 closed‐class, 41, 62 word frequency, 51, 234–235, 243 color, 42 word knowledge, managing, 42–43 common versus rare, 44 word priming and interference paradigms, compounds, 44 xvii, 111–129 connotations, 43 see also priming cross‐situational word learning apparatus, 115–116 models, 225 associatively related primes, 118 denotation, 43 assumptions/rationale, 112 derived, 44 blocking paradigms, 120, 124 dictionary entry, 43, 44 category‐congruent and incongruent monomorphemic, 41, 44 primes, 116, 118 multiple meanings, 43 concepts/description, 111–112 recognition see word recognition data analysis, 121–123 target words, 230, 237 designing of priming experiments, type‐token ratio of words (TTR), 49 116–121 World Wide Web, 238 evaluation, 123–124 exemplary studies, 112–115 XML programming language, 199 goal of word priming studies, 112, 115 lexical decision task, 121, 124 Yu and Ballard model, computational masked priming paradigm, xviii, 115, 125 modeling, 218–220 modality, 116 participants, 121 Zipf‐value, 234, 235, 239 Type of task PretestHabituation phaseTest phase Posttest Familiar Novel

Speech discrimination

“Neem.” “Gek.” “Gek.” “Gik.” “Neem.”

Word learning (Single object)

“Neem.” “Gek.” “Gek.” “Gik.” “Neem.”

Word learning (Two objects)

“Neem.” “Gek.” “Gik.” “Gek.” “Gik.” “Neem.” Note: The two object version of the task is known as the Switch task.

Figure 1.1 Examples of various infant language habituation tasks.

Figure 2.1 The Intermodal Preferential Looking Paradigm (see text for details).

Research Methods in Psycholinguistics and the Neurobiology of Language: A Practical Guide, First Edition. Edited by Annette M. B. de Groot and Peter Hagoort. © 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc. Figure 2.6 The Headturn Preference Procedure (see the text for details). Source: Courtesy of Melanie Soderstrom.

300

200 x Position 100

0 200 400 600 800 1000 Sample No.

4000

0 x Velocity

–6000 0 200 400 600 800 1000 Sample No.

Figure 4.3 Velocity‐based saccade detection. The upper panel illustrates the x position of the eye during reading of a single sentence sampled at 500 Hz, and the lower panel the smoothed eye velocity. Red dots indicate points classified as belonging to a saccade as output from an event detection algorithm (Engbert & Kliegl, 2003), and vertical lines indicate beginning and end of the corresponding saccade and fixation intervals. Slice carrot across with Cut carrot in half with a Snap twig with two knife karate chop of hand hands

Chontal te-k’e tyof’n~i-

Hindi kaaT toD|

Jalonke i-xaba i-sεgε gira

Figure 10.1 Comparison of cut and break verbs in Chontal, Hindi, and Jalonke (adapted from Majid et al., 2007).

ERP analysis Time-frequency analysis

40 Hz

10 Hz Trials

TIme-locked average: the ERP Averaged spectrogram 10 µV

5 µV

0 0 1500 ms

Figure 13.3 Simulated EEG data illustrating the difference between ERPs and time‐frequency analyses in their sensitivity to phase‐locked (evoked) and non‐phase‐locked (induced) activity. The first response is time‐locked and phase‐locked to time zero, whereas the second response is time‐locked but not phase‐locked. The first response shows up both after ERP averaging (as an oscillation) and after time‐frequency analysis of power (as a power increase at around 10 Hz). The second response is canceled by ERP averaging, but is preserved in time‐frequency analysis of power. Source: Bastiaansen, M. C. M., Mazaheri, A., & Jensen, O. (2008). By permission of Oxford University Press, USA. (A) (B)

1 234567

8 91011121314

15 16 17 18 19 20 21 10 cm

22 23 24 25 26 27 28

29 30 31 32

Figure 14.1 An anatomical scan of the head and the brain (A), and Functional MRI images (B). The yellow box with lines (in A) shows the positioning of the slices. In functional MRI, brain activation (BOLD) is mea- sured every TR (for instance every 2 seconds), slice by slice. In this example the slices are positioned to cover activation across the whole brain. They are overlain on an anatomical scan of the brain. Displayed in B are the collected slices going from the lower part of the brain (Slice 1) to the top part of the brain (Slice 32). The grey values indicate signal intensity, with colors more towards white having higher signal intensity. The image shows the results of one TR of scanning: 32 slices are collected to cover the whole brain. In off-line preprocessing the slices are combined into one image, creating a 3D image of the brain activation. Figure 14.3 A statistical map overlaid on an anatomical brain scan. The anatomical brain scan is normalized into MNI space (see text). The yellow statistical maps show voxels (or sets of voxels sometimes called blobs) in which the one condition had higher activation than another condition. The yellow coloring indicates the height of the T‐value of the comparison, which was cut off at an arbitrary value (only T values higher than 2.6 are shown). These are the results for one participant. Note that the color coding is ­arbitrary (we could have chosen any color), and does not reflect neural activation, but the outcome of a statistical test which is performed at each voxel. The underlying assumption is that the statistical test reflects differences in neural activation between the two conditions. D5S9.oxy

D3S12.oxy

D3S9.oxy

D3S5.oxy

D2S9.oxy

D2S5.oxy

D2S4.oxy

D0S5.oxy

D0S4.oxy

234.9 255.3 275.6 296.0 316.3 336.7 357.0 377.4 397.7 418.1 438 Time (s) Grand average: Newborns Grand average: Adults

) 0.4 0.15 0.3 0.10 0.2 0.1 0.05 0.0 0.00 –0.1 oncentration (mM.mm

C –0.05 –0.2 –0.3 5 10 15 20 510 15 20 Time (s) Time (s)

Figure 14.5 Sample of signal in fNIRS studies. The top panel shows estimated oxygenation levels from 9 channels, each emerging from the combination of a detector and a source. Highlighted regions have been automatically labelled as “artifacted” based on the speed of the signal change. The bottom panels show averages that have been time‐locked to the onset of stimulation (10 seconds of sound) for oxygenated (red) and deoxygenated (blue) hemoglobin among 40 newborns (left) and 24 adults (right). The dotted red and blue lines represent the best fit using a variable phase; the lighter dashed lines indicate observed 95% confidence intervals over participants. A) B)

L CT T1-weighted T2-weighted T2-FLAIR R

C) D)

pCASLb0 b500 b1500 L R

Figure 15.1 Imaging of an acute patient presenting with anomia following left inferior parietal and frontal lobe stroke. A) Axial non‐contrast computerized tomography (CT) scan demonstrates diffuse hypo­ density in the parietal (indicated by thick red arrow) and frontal regions (indicated by thin red arrow), predominantly in white matter. The low signal‐to‐noise resolution and low white/ gray matter boundary contrast of CT does not allow to determine the exact extent of the damage. B) T1‐ and T2‐weighted and fluid‐attenuated inverse recovery (FLAIR) images showing structural changes as hypo‐ and hyper‐intense areas in the white matter, respectively. In structural T1‐weighted images there is a clear contrast between white and gray matter, which is less evident in pathological T2‐weighted images. In T2‐weighted images the CSF signal is hyperintense (i.e., brighter) and gray matter appears brighter than white matter. Lesions appear hyperintense and may therefore be difficult to distinguish from CSF. In the FLAIR images there is a better contrast between the CSF (hypointense) and the lesion (hyper- intense). C) Pulsed continuous Arterial Spin labelling (pCASL) perfusion‐weighted MRI image of the lesion shows reduced cerebral blood flow (CBF) to a large area in the inferior parietal region and to a smaller area in the left frontal lobe. The degree of hypo‐perfusion within the white matter is also noticeable but more difficult to distinguish from the CSF within the lateral ventricles. D) Series of diffusion images showing differences in the exact extension of the lesion depending on the b‐value used to acquire them (non‐diffusion weighted image: b=0 and diffusion‐weighting: b=500 and b=1500). These images lack the spatial resolution of conventional MRI sequences but are sensitive to acute lesions within minutes. A) –20–10 0 10 20 30 40

MTG 11Patients 6 STG Angular Occipital Supermarginal B) MFG Cingulate Rectus IFG Thalamus SFG MFG White matter atlas

Corpus callosum Arcuate fasciculus Uncinate fasciculus ILF iFOF Lesion –20 –10 0 10 20 30 40 C) Control Agrammatic Semantic Frontal aslant tractUncinate fasciculus

0.42

0.39 **

†† 0.36

0.33 Frontal aslant tract Fractional anisotropy (FA) Uncinate fasciculus 0 FA 1 0.30

Figure 15.2 Lesion mapping based on T1‐weighted data (A), on a diffusion tractography atlas (B), and an example of extracting tract–based measurements from tractography (C). A) Group‐level lesion overlay percentage maps for an aphasic stroke patient cohort (n=16) reconstructed on an axial template brain and projected onto the left lateral cortical surface. This method identifies areas most commonly affected by lesions within a group of patients. B) Lesion mask (purple) from a single stroke patient overlaid onto a tractography‐based white matter atlas to extract measures of lesion load on pathways affected by the lesion. C) Differences in tract‐specific measurements of the frontal aslant tract and uncinate fasci­ culus between control subjects and patients with non‐fluent/agrammatic and semantic ­variants of primary progressive aphasia (PPA). Tractography reconstructions show the fractional anisotropy values mapped onto the streamlines of the frontal aslant tract and uncinate ­fasciculus of a control subject and two representative patients with PPA with non‐fluent/ agrammatic and semantic variant. Exemplary measurements of fractional anisotropy (FA) are reported for the frontal aslant tract (solid bars) and the uncinate fasciculus (patterned bars). **statistically significant different versus semantic group (P <0.05), ††statistically significant different versus controls (P < 0.001). IFG: inferior frontal gyrus, MFG: middle frontal gyrus, SFG: superior frontal gyrus, MTG: middle temporal gyrus, STG: superior temporal gyrus. Source: Modified from Forkel et al., 2014 and Catani et al., 2013. A) B) R2 = 0.43 100

Patient 3 90

Patient 2 Patient 1

80 stroke

70 LeftArcuate fasciculus Right Patients

Aphasia severity (AQ) six months after AQ>93.8

Anterior segment Posterior segment Long segment –10 –5 0 510 Right long segment index size C)

Patient 1 (59 year old male) Patient 2 (81 year old female) Patient 3 (87 year old female)

Figure 15.3 Anatomical variability in perisylvian white matter anatomy and its relation to post‐stroke language recovery. A) shows the three segments of the arcuate fasciculus in the left and the right hemisphere obtained from a group average. B) shows a regression plot of the volume of the right long segment plotted against the six‐month longitudinal aphasia quotient (AQ, corrected for age, sex, and lesion size). C) Indicates the right long segment for three exemplified patients (indicated in B) presenting with different degrees of language recovery at six months. Source: Modified from Forkel et al., 2014. Patient 1 Patient 2 Patient 3 ...Patient n

Voxel Voxel Voxel Voxel lesioned lesioned intact intact

Behavioral score = 8 Behavioral score = 12 Behavioral score = 20 Behavioral score = 24

Group behavioral scores of Group behavioral scores of patients with voxel lesioned patients with voxel intact

Voxel Voxel lesioned intact

Run statistical test to compare voxel-lesioned group vs. voxel- intact group

Map of the t-statistic at each voxel

2.5

t

0 z=8 z=16 z=24 z=32

Figure 16.1 A schematic illustration showing the steps involved in a VLSM analysis. In the first stage, patients’ lesions, which have been reconstructed onto a standardized template, are read into the analysis. Second, at every voxel, a statistical test is run to compare the behavioral scores (e.g., comprehension, fluency) of patients with and without a lesion in that voxel. The resulting test statistics (e.g., t‐values) at every voxel are then color‐coded and visua­ lized as shown. In the next step (not shown), a statistical correction is applied (e.g., permutation testing) to adjust for the large number of comparisons being done, so that only voxels meeting a pre‐specified significance level are displayed. Source: Baldo et al., (2012). Reproduced with permission of John Wiley & Sons. 30 60 90

Figure 16.2 Overlay of patients’ lesions. Only voxels with a minimum of five patients with lesions per voxel are included here, consistent with the data eligible for inclusion in the VLSM analysis. The color bar indicates the degree of lesion overlap, from purple (~5‐10 patients with lesion overlap) to aqua (~50 patients with lesion overlap) to red (~100 patients with lesion overlap).

0.4 0.6 0.8 1. 0

Figure 16.3 Power analysis map showing the degree of power in our sample, given a medium effect size and alpha set at p < .05. The color bar indicates power ranging from 0.4 (shown in black) up to 1.0 (shown in black to red). A)

0 369

B)

4 6 810

Figure 16.4 VLSM results showing neural correlates of auditory word recognition with varying levels of correction. A) raw t‐map with no voxelwise correction, no permutation testing and no covariates; B) voxelwise‐corrected t‐map (p < .001) with no permutation testing and no covariates; C) voxel‐wise‐corrected t‐map (p < .001) with lesion volume, education, and age as covari- ates but no permutation testing; and D) permutation testing‐derived t‐map with lesion volume, ­education, and age as covariates. The colored bars represent the range of significant t‐values for each analysis, from lower (though still significant) t‐values shown in purple to higher t‐values shown in red. C)

456 7

D)

5.4 6.0 6.6 7.2

Figure 16.4 (Continued) 1 2 T...... A...... G...... A...C

G...... A...... A...... T...G

3 4 5 6 7

8

(Many generations)

i ii iii iv v

Figure 17.1 Transmission of DNA between generations. Top: Males are represented by squares, females by circles. In this pedigree one pair of chromosomes (out of 23 pairs) is shown below each individual. Grandmother 1 carries a yellow DNA‐variant on her red chromosome, which influences trait X. She transmits her chromosomes to her offspring (individuals 3, 4, 5, 6), and because of crossing‐over between the red and the blue chromosome during egg production, each child gets a different combination. Half of her children inherit the variant that influences trait X. Top (right): Zoomed in, each chromosome can be represented as a string of letters. At most positions (the dots) the chromosomes are identical to the Human Reference Genome. At some positions (the letters) at least one of the chromosomes differs from the Reference. Such differ- ences are on average a few hundred letters apart. The A with the yellow dot influences trait X. Bottom: Many generations later some of the descendants of 1 still carry the yellow DNA‐variant. The stretch of red chromosome surrounding it has shrunk, but in a different way in each descendant. However, individuals i, ii, iv and v still carry the C to the right of the yellow A, while iii and v still carry the G to the left of it. In a GWAS these two variants may show association with trait X. 140Reference 150 160 T GGCCGGT G CCCCG A TTGGCCG C

C-to-T variant14 0 150 TTGGCCGG G CCC TTG A T GGCCG C

Figure 17.2 Visualization of Sanger sequencing results. Sanger sequencing results for two individuals for the same stretch of DNA. On the X‐axis, the position along the sequenced fragment of DNA; on the Y‐axis, the fluorescence intensity for four different colors. A different color lights up for each of the bases that is read: A=green, C=blue, G=black, T=red. In the middle of the lower image, two different colors light up on the same posi- tion (arrow), because the individual has inherited different letters from his father and mother: the reference letter C and the variant letter T. Some background coloring can be seen near the bot- tom of each image. This is an artefact.

Human genome reference sequence

DNA variants in dbSNP database

Figure 17.3 Next generation sequencing. Next Generation Sequencing data for one individual for a short stretch of the DNA of gene GABRB3. Each horizontal bar (blue or pink) represents a single sequenced molecule. The sequences are aligned to the Human Reference Genome (bottom). If the sequence differs from the Reference, this is indicated—see blue C in the middle. About half of the molecules carry the C, the other molecules carry the T (as indicated in the Reference). This individual is heterozygous at this point. Either a C was present in the DNA from one of his parents and a T in the other parent, or the C originated de novo during egg or sperm production. C

A

Figure 17.4 Visualization of SNP‐chip results. SNP‐chip assay result for a single polymorphism. Each dot represents an individual. X‐axis: intensity of the fluorescent label attached to one allele at the polymorphism (e.g., A). Y‐axis: inten- sity of the fluorescent label attached to the other allele (e.g., C). The software recognizes three clusters and assigns a genotype to each individual (e.g., A/A (red), A/C (green), or C/C (blue)). Black dots show samples without signal: controls that contained water instead of DNA. WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.