Action to Language Via the Mirror Neuron System
ACTION TO LANGUAGE VIA THE MIRROR NEURON SYSTEM
Mirror neurons may hold the brain’s key to social interaction: each coding not only a particular action or emotion, but also the recognition of that action or emotion in others. The Mirror System Hypothesis adds an evolutionary arrow to the story: from the mirror system for hand actions, shared with monkeys and chimpanzees, to the uniquely human mirror system for language. In this volume, written to be accessible to a wide audience, experts from child development, computer science, linguistics, neuroscience, primat- ology, and robotics present and analyze the mirror system and show how studies of action and language can illuminate each other. Topics discussed in the 15 chapters include the following. What do chimpanzees and humans have in common? Does the human capabil- ity for language rest on brain mechanisms shared with other animals? How do human infants acquire language? What can be learned from imaging the human brain? How are sign- and spoken-language related? Will robots learn to act and speak like humans?
MICHAEL A. ARBIB is the Fletcher Jones Professor of Computer Science, as well as a Professor of Biological Sciences, Biomedical Engineering, Electrical Engineering, Neuroscience and Psychology at the University of Southern California (USC), which he joined in 1986. He has been named as one of a small group of University Professors at USC in recognition of his contributions across many disciplines.
ACTION TO LANGUAGE VIA THE MIRROR NEURON SYSTEM
edited by MICHAEL A. ARBIB
University of Southern California CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo
Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521847551
© Cambridge University Press 2006
This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2006
ISBN-13 978-0-511-24663-0 eBook (NetLibrary) ISBN-10 0-511-24663-3 eBook (NetLibrary)
ISBN-13 978-0-521-84755-1 hardback ISBN-10 0-521-84755-9 hardback
Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Contents
List of contributors page vii Preface ix Part I Two perspectives 1 1 The Mirror System Hypothesis on the linkage of action and languages Michael A. Arbib 3 2 The origin and evolution of language: a plausible, strong-AI account Jerry R. Hobbs 48 Part II Brain, evolution, and comparative analysis 89 3 Cognition, imitation, and culture in the great apes Craig B. Stanford 91 4 The signer as an embodied mirror neuron system: neural mechanisms underlying sign language and action Karen Emmorey 110 5 Neural homologies and the grounding of neurolinguistics Michael A. Arbib and Mihail Bota 136 Part III Dynamic systems in action and language 175 6 Dynamic systems: brain, body, and imitation Stefan Schaal 177 7 The role of vocal tract gestural action units in understanding the evolution of phonology Louis Goldstein, Dani Byrd, and Elliot Saltzman 215 8 Lending a helping hand to hearing: another motor theory of speech perception Jeremy I. Skipper, Howard C. Nusbaum, and Steven L. Small 250 Part IV From mirror system to syntax and Theory of Mind 287 9 Attention and the minimal subscene Laurent Itti and Michael A. Arbib 289
v vi Contents
10 Action verbs, argument structure constructions, and the mirror neuron system David Kemmerer 347 11 Language evidence for changes in a Theory of Mind Andrew S. Gordon 374 Part V Development of action and language 395 12 The development of grasping and the mirror system Erhan Oztop, Michael A. Arbib, and Nina Bradley 397 13 Development of goal-directed imitation, object manipulation, and language in humans and robots Ioana D. Goga and Aude Billard 424 14 Assisted imitation: affordances, effectivities, and the mirror system in early language development Patricia Zukow-Goldring 469 15 Implications of mirror neurons for the ontogeny and phylogeny of cultural processes: the examples of tools and language Patricia Greenfield 501 Index 534 Contributors
Michael A. Arbib Karen Emmorey Computer Science Department School of Speech, Language, and Neuroscience Program and USC Hearing Sciences Brain Project San Diego State University University of Southern California San Diego, CA 92120, USA Los Angeles, CA 90089, USA Ioana D. Goga Aude Billard Center for Cognitive and Autonomous Systems Laboratory Neural Studies Swiss Institute of Technology 400504 Clvj-Napoca, Romania Lausanne (EPFL) 1015 Lausanne, Switzerland Louis Goldstein Department of Linguistics Mihail Bota Yale University Neuroscience Program and New Haven, CT 06511, USA USC Brain Project University of Southern California Andrew S. Gordon Los Angeles, CA 90089, USA Institute for Creative Technologies University of Southern California Nina Bradley Marina del Rey, CA 90292, USA Department of Biokinesiology and Physical Therapy Patricia Greenfield University of Southern California Department of Psychology Los Angeles, CA 90033, USA FPR-UCLA Center for Culture, Brain, and Dani Byrd Development Department of Linguistics University of Southern California University of Southern California Los Angeles Los Angeles, CA 90089, USA CA 90095, USA
vii viii List of contributors
Jerry R. Hobbs Elliot Saltzman USC Information Sciences Institute Department of Physical Therapy University of Southern California Boston University Marina del Rey, CA 90292, USA Boston, MA 02215, USA
Laurent Itti Stefan Schaal Department of Computer Science, Department of Computer Science University of Southern California University of Southern California Los Angeles Los Angeles, CA 90089, USA CA 90089, USA Jeremy I. Skipper David Kemmerer Department of Neurology Department of Speech, Language, The University of Chicago and Hearing Sciences Chicago, IL 60637, USA Purdue University West Lafayette Steven L. Small IN 47907, USA Departments of Psychology and Neurology and the Brain Howard C. Nusbaum Research Imaging Center Department of Psychology and The University of Chicago the Brain Research Imaging Center Chicago, IL 60637, USA The University of Chicago Chicago Craig B. Stanford IL 60637, USA Department of Anthropology University of Southern California Erhan Oztop Los Angeles, CA 90089, USA JST-ICORP Computational Brain Project ATR Patricia Zukow-Goldring Computational Neuroscience Linguistics Department Laboratories University of Southern California Kyoto 619-0288, Japan Los Angeles, CA 90089, USA Preface
There are many ways to approach human language – as a rich human social activity, as a formal system structured by rules of grammar, and as a pattern of perception and production of utterances, to name just a few. The present volume uses this last concern – with the perception and production of utterances – as its core. The aim is not to ignore the other dimensions of language but rather to enrich them by seeking to understand how the use of language may be situated with respect to other systems for action and perception. The work is centered on, but in no way restricted to, the Mirror System Hypothesis (introduced by Arbib and Rizzolatti in 1997). This is the hypothesis that the mirror neuron system for the recognition of movements of the hands in praxic action – which is present both in monkey and in human in a number of areas including Broca’s area (generally considered to be the frontal speech area) – provides the evolutionary basis for the brain mechanisms which support language. The Mirror System Hypothesis sees the ancestral action recognition system being elaborated through the evolution of ever more capable neural mechanisms supporting imitation of hand movements, then pantomime emerging on the basis of displacement of hand movements to imitate other degrees of freedom. A system of “protosign” emerges as conventionalized codes extend the range of manual communication, and serves as scaffolding for “protospeech.” An expanding spiral of protosign and protospeech yields a brain able to support both action and language. The arguments pro and con the Mirror System Hypothesis and the more general issue of how the studies of action and language can illuminate each other are developed in 15 chapters by experts in child development, computer science, linguistics, neuroscience, primatology, and robotics. Part I of the book provides Two perspectives on the evolution of language. I discuss “The Mirror System Hypothesis on the linkage of action and languages,” presenting the essential data and modeling of the mirror system for grasping in the macaque brain as well as related human brain imaging data to set the stage for the Mirror System Hypothesis on the evolution of the language-ready brain. Two controversial hypotheses are discussed: the view that the path to protospeech was indirect, depending on the scaffolding of protosign rather than evolving directly from primate vocalizations within the vocal domain; and the view that protolanguage was “holophrastic” and that the emergence of
ix x Preface language rested on the simultaneous fractionation of “unitary utterances” into words and the development of varied syntactic strategies to put the pieces back together again. Jerry Hobbs then offers a view of “The origin and evolution of language: a plausible, strong-AI account” which may be set against the Mirror System Hypothesis. He presents a computa- tional approach in which abductive logic is realized in the structured connectionist networks of the SHRUTI model of language processing, and on this he bases an account of the evolution of language mechanisms. He uses the development of folk psychology to explain the evolution of Gricean non-natural meaning, i.e., the notion that what is conveyed is not merely the content of the utterance, but also the intention of the speaker to convey that meaning by means of that specific utterance. He also presents an account of syntax arising out of discourse, and uses that to argue against the holophrasis hypothesis. Part II of the book presents three chapters addressing the theme of Brain, evolution, and comparative analysis. Craig Stanford provides, in “Cognition, imitation, and culture in the great apes,” a comparative view of humans and the great apes. The “culture of apes” is discussed, especially variations in behavior in chimpanzees in different areas of Africa. The capacity (or lack of it) for imitation in apes and the relation between “communication in the wild” and simple forms of “language” taught to apes in captivity leads to a discussion of which of the cognitive abilities that make human language possible are possessed by non-human primates; while an emphasis on social behavior is the basis for an evaluation of claims for Theory of Mind (Machiavellian intelligence) in non-human primates. Karen Emmorey provides a comparative analysis of language modalities within humans, focusing on “The signer as an embodied mirror neuron system: neural mechan- isms underlying sign language and action.” Like mirror neurons, signers must associate the visually perceived manual actions of another signer with self-generated actions of the same form. However, unlike grasping and reaching movements, sign articulations are structured within a phonological system of contrasts. She relates this to language evolu- tion, showing how languages escape their pantomimic and iconic roots and develop duality of patterning, and probes the similarities and differences between the neural systems for production and perception that support speech, sign, and action. Finally for Part II, Mihail Bota joins me in the chapter “Neural homologies and the grounding of neurolinguistics” to use comparative neurobiology of the monkey and human to establish homologies between brain regions of the two species related to the mirror system as well as communication and (precursors of) language to ground claims as to the brain of the common ancestor of monkeys and humans of perhaps 20 million years ago, and thus evaluate the Mirror System Hypothesis on how such brains changed to become language- ready. Of particular importance to charting the similarities and differences between human speech and the vocalizations of non-human primates is a set of data on the macaque auditory system and on the role of anterior cingulate cortex in both motivation and vocalization to better distinguish mechanisms that support language from those that support quite different forms of communication. Part III analyzes Dynamic systems in action and language. In “Dynamic systems: brain, body, and imitation,” Stefan Schaal focuses primarily on technological approaches to Preface xi learning “motor primitives” by building up dynamical systems by using combinations of basis functions in the system description. Learning methods can set the weights of these combinations both to yield imitation of behaviors expressed in a form that uses all the relevant dynamic variables and to yield recognition of actions as well. The discussion is extended to superposition of movements and, briefly, to sequential and hierarchical behavior. A short discussion relates these concepts to “what the brain really does.” With this we turn to speech as a dynamical system. Louis Goldstein, Dani Byrd, and Elliot Saltzman discuss “The role of vocal tract gestural action units in understanding the evolution of phonology,” thus helping us think through the issue of to what extent speech production shares mechanisms with motor control more generally. Like Emmorey, they stress the central role of duality of patterning – the use of a set of non-meaningful arbitrary discrete units that allows word creation to be productive – in phonology. They analyze vocal tract action gestures in terms of dynamical systems which are discrete and combinable. They see the iconic aspects of manual gestures as critical to evolution of a system of symbolic units whereas phonological evolution crucially requires the emer- gence of effectively non-meaningful combinatorial units, with the two systems ultimately converging in a symbiotic relationship. Continuing in this vein, Jeremy Skipper, Howard Nusbaum, and Steven Small discuss “Lending a helping hand to hearing: another motor theory of speech perception.” They describe an active model of speech perception that involves mirror neurons as the basis for inverse and forward models used in the recogni- tion of speech, with special emphasis on audiovisual speech in which facial movements (e.g., of the lips) provide extra cues for speech recognition. According to this model, the mirror neuron system maps observed speech production and manual gestures to abstract representations of speaking actions that would have been activated had the observer been the one producing the action (inverse models). These representations are then mapped in a somatotopically appropriate manner to pre- and primary motor cortices. The resulting motor commands have sensory consequences (forward models) that are compared to processing in the various sensory modalities. This aids in the recognition of a particular acoustic segment of an utterance by constraining alternative linguistic interpretations. Brain imaging experiments are adduced in support of these ideas. Part IV takes us From mirror system to syntax and Theory of Mind. Laurent Itti and I discuss “Attention and the minimal subscene,” showing how perception of a “minimal subscene” linking an agent and an action to one or more objects may underlie processes of scene description and question-answering, linking the schematic structure of visual scenes to language structure. The approach integrates studies of the role of salience in visual attention with “top–down” attention, cooperative computation models of vision and speech processing, and theories and data linking sentence processing to eye movements. David Kemmerer then offers an integrative view of “Action verbs, argument structure constructions, and the mirror neuron system.” Here the route from mirror neurons to syntax is via the constructionist framework which regards syntax as an extension of the lexicon. Construction Grammar provides the setting for discussing the major semantic properties of action verbs and argument structure constructions. This in turn sets the stage xii Preface for analysis of the neuroanatomical substrates of action verbs and argument structure constructions in support of the proposal that the linguistic representation of action is grounded in the mirror neuron system. The discussion is then broadened to consider the emergence of language during ontogeny, history, and phylogeny. Andrew Gordon looks at “Language evidence for changes in a Theory of Mind”, with Theory of Mind viewed as the set of abilities that enable people to reflect on their own reasoning, to empathize with other people by imagining what it would be like to be in their position, and to generate reasonable expectations and inferences about mental states and processes. He analyzes a corpus of English and American novels in search of a “Freudian shift” and finds an increasing use of language about the unconscious in the decades preceding Freud. He thus concludes that a major shift has occurred in Theory of Mind, strengthening the case that its roots are cultural rather than biological. The final part of the book, Part V, examines Development of action and language. Erhan Oztop, Nina Bradley, and I study “The development of grasping and the mirror system” from a modeling perspective comprising the Infant Learning to Grasp Model (ILGM), the Grasp Affordance Emergence Model (GAEM), and the mirror neuron system (MNS) model of the development of the mirror neuron system. The account has strong links to the literature on infant motor development and makes clear that the range of mirror neuron responses is highly adaptive rather than being “hard-wired.” This paves the way for an understanding of how imitation builds on the mirror system. Iona D. Goga and Aude Billard provide a broad conceptual framework for the study of “Development of goal-directed imitation, object manipulation, and language in humans and robots.” Social abilities, such as imitation, turn-taking, joint attention, and intended body communication, are fundamental for the development of language and human cognition. Inspired by this perspective, they offer a composite model of the mechanisms underlying the development of action, imitation, and language in human infants. A recent trend of robotics research seeks to equip artifacts with social capabilities. Thus the model also sets the stage for reproducing imitation and language in robots and simulated agents. They validate the model through a dynamic simulation of a child–caregiver pair of humanoid robots. Patricia Zukow-Goldring discusses “Assisted Imitation: affordances, effectivities, and the mirror system in child development.” She takes a mirror-system oriented view of cognitive development in the child, showing how caregivers help the child learn the effectivities of her own body and the affordances of the world around her. The basic idea is that a shared understanding of action grounds what individuals know in common. In particular, this perspective roots the ontogeny of language in the progression from action and gesture to speech and supports the view that the evolutionary path to language also arises from perceiving and acting, leading to gesture, and eventually to speech. Finally, Patricia Greenfield explores “Implications of mirror neurons for the ontogeny and phylogeny of cultural processes: the examples of tools and language.” Mirror neurons underlie the ability of the monkey and human to respond both to their own acts and to the same act performed by another, selectively responding to intentional or goal-directed action rather than to movement per se. Greenfield argues that this neural substrate Preface xiii underwrites phylogenetic and ontogenetic development of two key aspects of human culture, tool use and language. She stresses that ontogeny does not recapitulate phyl- ogeny, but that infant behavior is more likely to be conserved than adult behavior across phylogeny. The analysis of both ontogeny and phylogeny draws on comparison of chimpanzees, bonobos, and humans to derive clues as to what foundations of human language may have been present in the common ancestor 5 to 7 million years ago. This multi-author volume gains unusual coherence through its emergence from a year- long seminar led by the Editor in which eleven of the authors met every 2 to 3 weeks to debate and build upon the Mirror System Hypothesis. We thank the Center for Interdis- ciplinary Research of the University of Southern California which made this seminar, and thus this book, possible.
Part I Two perspectives
1 The Mirror System Hypothesis on the linkage of action and languages
Michael A. Arbib
1.1 Introduction Our progress towards an understanding of how the human brain evolved to be ready for language starts with the mirror neurons for grasping in the brain of the macaque monkey. Area F5 of the macaque brain is part of premotor cortex, i.e., F5 is part of the area of cerebral cortex just in front of the primary motor cortex shown as F1 in Fig. 1.1 (left). Different parts of F5 contain neurons active during manual and orofacial actions. Cru- cially for us, an anatomically segregated subset of these neurons are mirror neurons. Each such mirror neuron is active not only when the monkey performs actions of a certain kind (e.g., a precision pinch or a power grasp) but also when the monkey observes a human or another monkey perform a more or less similar action. In humans, we cannot measure the activity of single neurons (save when needed for testing during neurosurgery) but we can gather comparatively crude data on the relative blood flow through (and thus, presumably, the neural activity of ) a brain region when the human performs one task or another. We may then ask whether the human brain also contains a “mirror system for grasping” in the sense of a region active for both execution and observation of manual actions as compared to some baseline task like simply observing an object. Remarkably, such sites were found in frontal, parietal, and temporal cortex of the human brain. Most significantly for this book, the frontal activation was found in or near Broca’s area (Fig. 1.1 right), a region which in most humans lies in the left hemisphere and is traditionally associated with speech production. Moreover, macaque F5 and human Broca’s area are considered to be (at least in part) homologous brain regions, in the sense that they are considered to have evolved from the same brain region of the common ancestor of monkeys and humans (see Section 1.2 as well as Arbib and Bota, this volume). But why should a neural system for language be intimately related with a mirror system for grasping? Pondering this question led Giacomo Rizzolatti and myself to formulate the Mirror System Hypothesis (MSH) (Arbib and Rizzolatti, 1997; Rizzolatti and Arbib, 1998) which will be presented more fully in Section 1.3.1. We view the mirror system for
Action to Language via the Mirror Neuron System, ed. Michael A. Arbib. Published by Cambridge University Press. © Cambridge University Press 2006.
3 4 Michael A. Arbib
Figure 1.1 A comparative side view of the monkey brain (left) and human brain (right), not to scale. The view of the monkey brain emphasizes area F5 of the frontal lobe of the monkey; the view of the human brain emphasizes two of the regions of cerebral cortex, Broca’s area and Wernicke’s area, considered crucial for language processing. The homology between monkey F5 and human Broca’s area is a key component of the present account. grasping as a key neural “missing link” between the abilities of our non-human ancestors of 20 million years ago and the modern human capability for language, with manual gestures rather than a system for vocal communication providing the initial seed for this evolutionary process. The present chapter, however, goes “beyond the mirror” to offer hypotheses on evolutionary changes within and outside the mirror system which may have occurred to equip Homo sapiens with a language-ready brain, a brain which supports language as a behavior in which people communicate “symbolically” through integrated patterns of vocal, manual and facial movements. Language is essentially multi-modal, not just a set of sequences of words which can be completely captured by marks on the printed page.
1.1.1 Protolanguage defined In historical linguistics (e.g., Dixon, 1997), a protolanguage for a family of extant human languages is the posited ancestral language – such as proto-Indo-European for the family of Indo-European languages – from which all languages of the family are hypothesized to descend historically with various modifications. In this chapter, however, we reserve the term protolanguage for a system of utterances used by a particular hominid species (possibly including Homo sapiens) which we would recognize as a precursor to human language (if only we had the data!), but which is not itself a human language in the modern sense. Bickerton (1995) hypothesized that the protolanguage of Homo erectus was a system whose utterances took the form of a string of a few words much like those of today’s language, but which conveyed meaning without the aid of syntax. On this view, language just “added syntax” through the evolution of Universal Grammar. Contrary to Bickerton’s hypothesis, I argue (Section 1.5) that the protolanguage of Homo erectus and early Homo The Mirror System Hypothesis 5 sapiens was “holophrastic”, i.e., composed mainly of “holophrases” or “unitary utteran- aces” (phrases in the form of semantic wholes with no division into meaningful subparts) which symbolized frequently occurring situations and that words as we know them then co-evolved culturally with syntax through fractionation. My view is shared, e.g., by Wray (2002) and opposed by Hobbs (this volume). Bickerton (2005) and Wray (2005) advance the debate within the context of the Mirror System Hypothesis.
1.1.2 Relating language to the vocalizations of non-human primates Humans, chimpanzees, and monkeys share a general physical form and a degree of manual dexterity, but their brains, bodies, and behaviors differ. Humans have abilities for bipedal locomotion and learnable, flexible vocalization that are not shared by other primates. Monkeys exhibit a primate call system (a limited set of species-specific calls) and an orofacial (mouth and face) gesture system (a limited set of gestures expressive of emotion and related social indicators). Note the linkage between the two systems: communication is inherently multi-modal. This communication system is closed in the sense that it is restricted to a specific repertoire. This is to be contrasted with the open nature of human languages which can form endlessly many novel sentences from the current word stock and add new words to that stock. Admittedly, chimpanzees and bonobos (great apes, not monkeys) can be trained to acquire a form of communication – based either on the use of hand signs or objects that each have a symbolic meaning – that approximates the complexity of the utterances of a 2-year-old human child, in which a “message” generally comprises one or two “lexemes.” However, there is no evidence that an ape can reach the linguistic ability of a 3-year-old human. Moreover, such “ape languages”1 are based on hand–eye coordination rather than vocalization whereas a crucial aspect of human biological evolution has been the emergence of a vocal apparatus and control system that can support speech. It is tempting to hypothesize that certain species-specific vocalizations of monkeys (such as the “snake call” and “leopard call” of vervet monkeys) provided the basis for the evolution of human speech, since both are in the vocal domain (see Seyfarth (2005) for a summary of arguments supporting this view2). However, combinatorial properties for the openness of communication are virtually absent in basic primate calls and orofacial communication, though individual calls may be graded. Moreover, Ju¨rgens (1997, 2002) found that voluntary control over the initiation and suppression of such vocalizations relies on the mediofrontal cortex including anterior cingulate gyrus (see Arbib and Bota, this volume). MSH explains why it is F5, rather than the cingulate area involved in macaque vocalization, that is homologous to the human’s frontal substrate for language by asserting that a specific mirror system – the primate mirror system for grasping – evolved into a
1 I use the quotes because these are not languages in the sense in which I distinguish languages from protolanguages. 2 For some sense of the debate between those who argue that protosign was essential for the evolution of the language-ready brain and those who take a “speech only” approach, see Fogassi and Ferrari (2004), Arbib (2005b), and MacNeilage and Davis (2005). 6 Michael A. Arbib key component of the mechanisms that render the human brain language-ready. It is this specificity that will allow us to explain below why language is multi-modal, its evolution being rooted in the execution and observation of hand movements and extended into speech. Note that the claim is not that Broca’s area is genetically preprogrammed for language, but rather that the development of a human child in a language community normally adapts this brain region to play a crucial role in language performance. Zukow-Goldring (this volume) takes a mirror-system-oriented view of cognitive development in the child, showing how caregivers help the child learn the effectivities of her own body and the affordances of the world around her. As we shall see later in this chapter, the notion of affordance offered by Gibson (1979) – that of information in the sensory stream concern- ing opportunities for action in the environment – has played a critical role in the development of MSH, with special reference to neural mechanisms which extract affor- dances for grasping actions. Not so much has been made of effectivities, the range of possible deployments of the organism’s degrees of freedom (Turvey et al., 1981), but Oztop et al. (this volume) offer an explicit model (the Infant Learning to Grasp Model, ILGM) of how effectivities for grasping may develop as the child engages in new activities. Clearly, the development of novel effectivities creates opportunities for the recognition of new affordances, and vice versa.
1.1.3 Language in an action-oriented framework Neurolinguistics should emphasize performance, explicitly analyzing both perception and production. The framework sketched in Fig. 1.2 leads us to ask: to what extent do language mechanisms exploit existing brain mechanisms and to what extent do they involve biological specializations specific to humans? And, in the latter case, did the emergence of language drive the evolution of such mechanisms or exploit them? Arbib (2002, 2005a) has extended the original formulation of MSH to hypothesize a number of (possibly overlapping) stages in the evolution of the language-ready brain. As we shall see in more detail in Section 1.3.2, crucial early stages extend the mirror system for grasping to support imitation, first so-called “simple” imitation such as that found in chimpanzees (which allows imitation of “object-oriented” sequences only as the result of extensive practice) and then so-called “complex” imitation found in humans which combines the perceptual ability to recognize that a novel action was in fact composed of (approximations to) known actions with the ability to use this analysis to guide the more or less successful reproduction of an observed action of moderate complexity.3 Complex imitation was a crucial evolutionary innovation in its own right, increasing the ability to learn and transmit novel skills, but also provides a basis for the later emergence of the language-ready brain – it is crucial not only to the child’s ability to acquire
3 See the discussion of research by Byrne (2003) on “program-level imitation” in gorillas (cf. Stanford, this volume). A key issue concerns the divergent time course of acquisition of new skills in human versus ape. The Mirror System Hypothesis 7
Figure 1.2 The figure places production and perception of language within a framework of action and perception considered more generically. Language production and perception are viewed as the linkage of Cognitive Form, Semantic Form, and Phonological Form, where the “phonology” may involve vocal or manual gestures, or just one of these, with or without the accompaniment of facial gestures. language and social skills, but the perceptual ability within it is essential to the adult use of language. But it requires a further evolutionary breakthrough for the complex imitation of action to yield pantomime, as the purpose comes to be to communicate rather than to manipulate objects. This is hypothesized to provide the substrate for the development of protosign, a combinatorially open repertoire of manual gestures, which then provides the scaffolding for the emergence of protospeech (which thus owes little to non-human vocalizations), with protosign and protospeech then developing in an expanding spiral. Here, I am using “protosign” and “protospeech” for the manual and vocal components of a protolanguage, in the sense defined earlier. I argue that these stages involve biological evolution but that the progression from protosign and protospeech to languages with full-blown syntax and compositional semantics was a historical phenomenon in the development of Homo sapiens, involving few if any further biological changes. For production, the notion is that at any time we have much that we could possibly talk about which might be represented as cognitive structures (Cognitive Form; schema assemblages) from which some aspects are selected for possible expression. Further selection and transformation yields semantic structures (hierarchical constituents express- ing objects, actions and relationships) which constitute a Semantic Form enriched by linkage to schemas for perceiving and acting upon the world. Finally, the ideas in the Semantic Form must be expressed in words whose markings and ordering reflect the relationships within Semantic Form. These words must be conveyed as “phonological” 8 Michael A. Arbib structures – where here I extend phonological form to embrace a wide range of ordered expressive gestures which may include speech, sign, and orofacial expressions (and even writing and typing). For perception, the received utterance must be interpreted semantically with the result updating the “hearer’s” cognitive structures. For example, perception of a visual scene may reveal “Who is doing what and to whom/which” as part of a non-linguistic action– object frame in cognitive form. By contrast, the verb–argument structure is an overt linguistic representation in semantic form – in most human languages, the action is named by a verb and the objects are named by nouns or noun phrases. A production grammar for a language is then a specific mechanism (whether explicit or implicit) for converting verb–argument structures into strings of words (and hierarchical compounds of verb–argument structures into complex sentences) and vice versa for perception. These notions of “forward” and “inverse” grammars may be compared with the forward and inverse models discussed by Oztop et al. (this volume) and by Skipper et al. (this volume). Emmorey (this volume) discusses the neural basis for parallels between praxis (includ- ing spatial behavior), pantomime, sign production, and speech production, establishing what is modality-general for brain mechanisms of language and what is modality-specific. Here I would suggest a crucial distinction between signed language and speech. In both speech and signing, I claim, we recognize a novel utterance as in fact composed of (approximations to) known actions, namely uttering words, and – just as crucially – the stock of words is open-ended. However, signed language achieves this by a very different approach to speech. Signing exploits the fact that the signer has a very rich repertoire of arm, hand, and face movements, and thus builds up vocabulary by variations on this multidimensional theme (move a hand shape (or two) along a trajectory to a particular position while making appropriate facial gestures). By contrast, speech employs a system of articulators that are specialized for speech – there is no rich behavioral repertoire of non-speech movements to build upon.4 Instead evolution “went particulate” (Studdert- Kennedy, 2000; Goldstein et al., this volume), so that the spoken word is built (to a first approximation) from a language-specific stock of phonemes (actions defined by the coordinated movement of several articulators, but with only the goal of “sounding right” rather than conveying meaning in themselves). But if single gestures are the equivalent of phonemes in speech or words in sign, what levels of motor organization correspond to derived words, compound words, phrases, sentences, and discourse? Getting to derived words seems simple enough. In speech, we play variations on a word by various morphological changes which may modify internal phonemes or add new ones. In sign, “words” can be modified by changing the source and
4 However, it is interesting to speculate on the possible relevance of the oral dexterity of an omnivore – the rapid adaptation of the coordination of chewing with lip, tongue, and swallowing movements to the contents of the mouth – to the evolution of these articulators. Such oral dexterity goes well beyond the “mandibular cyclicities” posited by MacNeilage (1998) to form the basis for the evolution of the ability to produce syllables as consonant–vowel pairs. The Mirror System Hypothesis 9
origin, and by various modifications to the path between. For everything else, it seems enough – for both action and language – that we can create hierarchical structures subject to a set of transformations from those already in the repertoire. For this, the brain must provide a computational medium in which already available elements can be composed to form new ones, irrespective of the “level” at which these elements were themselves defined. When we start with words as the elements, we may end up with compound words or phrases, other operations build from both words and phrases to yield new phrases or sentences, etc., and so on recursively. Similarly, we may learn arbitrarily many new motor skills based on those with which we are already familiar. With this, let me give a couple of examples (Arbib, 2006) which suggest how to view language in a way which better defines its relation to goal-directed action. Consider a conditional, hierarchical motor plan for opening a child-proof aspirin bottle:
While holding the bottle with the non-dominant hand, grasp the cap, push down and turn the cap with the dominant hand; then repeat (release cap, pull up cap, and turn) until the cap comes loose; then remove the cap. This hierarchical structure unpacks to different sequences of action on different occa- sions, with subsequences conditioned on the achievement of goals and subgoals. To ground the search for similarities between action and language, I suggest we view an action such as this one as a “sentence” made up of “words” which are basic actions. A “paragraph” or a “discourse” might then correspond to, e.g., an assembly task which involves a number of such “sentences.” Now consider a sentence like “Serve the handsome old man on the left.”, spoken by a restaurant manager to a waiter. From a “conventional” linguistic viewpoint, we would apply syntactic rules to parse this specific string of words. But let us look at the sentence not as a structure to be parsed but rather as the result of the manager’s attempt to achieve a communicative goal: to get the waiter to serve the intended customer. His sentence planning strategy repeats the “loop”
NP ! Adj NP are abstracted from procedures which serve to reduce ambiguity in reaching a communicative goal. This example concentrates on a noun phrase – and thus exemplifies ways in which reaching a communicative goal (identifying the right person, or more generally, object) may yield an unfolding of word structures in a way that may clarify the history of syntactic structures. While syntactic constructions can be usefully analyzed and categorized from an abstract viewpoint, the pragmatics of what one is trying to say and to whom one is trying to say it will drive the goal-directed process of producing a sentence. Conversely, the hearer has the inferential task of unfolding multiple meanings from the word stream (with selective attention) and deciding (perhaps unconsciously) which ones to meld into his or her cognitive state and narrative memory.
1.2 Grasping and the mirror system 1.2.1 Brain mechanisms for grasping Figure 1.3 shows the brain of the macaque (rhesus monkey) with its four lobes: frontal, parietal, occipital, and temporal. The parietal area AIP is near the front (anterior) of a groove in the parietal lobe, shown opened up in the figure, called the intraparietal sulcus – thus AIP (for anterior region of the intraparietal sulcus).5 Area F5 (the fifth region in a numbering for areas of the macaque frontal lobe) is in what is called ventral premotor cortex. The neuroanatomical coordinates refer to the orientation of the brain and spinal cord in a four-legged vertebrate: “dorsal” is on the upper side (think of the dorsal fin of a shark) while “ventral” refers to structures closer to the belly side of the animal. Together, AIP and F5 anchor the cortical circuit in macaque which transforms visual information on intrinsic properties of an object into hand movements for grasping it. AIP processes visual information to implement perceptual schemas for extracting grasp parameters (affor- dances) relevant to the control of hand movements and is reciprocally connected with the so-called canonical neurons of F5. Discharge in most grasp-related F5 neurons correlates with an action rather than with the individual movements that form it so that one may relate F5 neurons to various motor schemas corresponding to the action associ- ated with their discharge. By contrast, primary motor cortex (F1) formulates the neural instructions for lower motor areas and motor neurons. The FARS (Fagg–Arbib–Rizzolatti–Sakata) model (Fagg and Arbib, 1998) provides a computational account of the system (it has been implemented, and various examples of grasping simulated) centered on this pathway: the dorsal stream via AIP does not know “what” the object is, it can only see the object as a set of possible affordances, whereas the ventral stream from primary visual cortex to inferotemporal cortex (IT), by contrast, is able to recognize what the object is. This information is passed to prefrontal cortex (PFC)
5 The reader uncomfortable with neuroanatomical terminology can simply use AIP and similar abbreviations as labels in what follows, without worrying about what they mean. On the other hand, the reader who wants to know more about the neuroanatomy should turn to Arbib and Bota (this volume) in due course. The Mirror System Hypothesis 11
Figure 1.3 A side view of the left hemisphere of the macaque brain. Area 7b is also known as area PF. (Adapted from Jeannerod et al., 1995.) which can then, on the basis of the current goals of the organism and the recognition of the nature of the object, bias the affordance appropriate to the task at hand. The original FARS model suggested that the bias was applied by PFC to F5; subsequent neuroanatom- ical data (as analyzed by Rizzolatti and Luppino, 2003) suggest that PFC and IT may modulate action selection at the level of parietal cortex rather than premotor cortex. Figure 1.4 gives a partial view of “FARS Modificato,” the FARS model updated to show this modified pathway. AIP may represent several affordances initially, but only one of these is selected to influence F5. This affordance then activates the F5 neurons to command the appropriate grip once it receives a “go signal” from another region, F6, of PFC. We now turn to those parts of the FARS model which provide mechanisms for sequencing actions: the circuitry encoding a sequence is postulated to lie within the supplementary motor area called pre-SMA (Rizzolatti et al., 1998), with administration of the sequence (inhibiting extraneous actions, while priming imminent actions) carried out by the basal ganglia. In Section 1.3, I will argue that the transition to early Homo coincided with the transition from a mirror system used only for action recognition and “simple” imitation to more elaborate forms of “complex” imitation: I argue that what sets hominids apart from their common ancestors with the great apes is the ability to rapidly exploit novel sequences as the basis for immediate imitation or for the immediate construction of an appropriate response. Here, I hypothesize that the macaque brain can generate sequential behavior on the basis of overlearned coordinated control programs (cf. Itti and Arbib, this volume), but that only the human brain can perform the inverse operation of observing a sequential behavior and inferring the structure of a coordinated control program that might have generated it. In the FARS model, Fagg and Arbib (1998) modeled the interaction of AIP and F5 in the sequential task employed by Sakata in his studies of AIP. In this task, the monkey sits with his hand on a key, while in front of him is a manipulandum. A change in an LED signals him to prepare to grasp; a further change then tells him to execute the 12 Michael A. Arbib
Figure 1.4 (a) “FARS Modificato”: the original FARS diagram (Fagg and Arbib, 1998) is here modified to show PFC acting on AIP rather than F5. The idea is that AIP does not “know” the identity of the object, but can only extract affordances (opportunities for grasping for the object consider as an unidentified solid); prefrontal cortex uses the IT identification of the object, in concert with task analysis and working memory, to help AIP select the appropriate action from its The Mirror System Hypothesis 13 grasp. He must then hold the manipulandum until the onset of another LED signals him to release the manipulandum. Successful completion of the task yields a juice reward. Figure 1.4b simplifies how the FARS model presents the interaction of AIP and F5 in the Sakata Task. Three AIP cells are shown: a visual-related and a motor-related cell that recognize the affordance for a precision pinch, and a visual-related cell for power grasp affordances. The five F5 units participate in a common program (in this case, a precision grasp), but each cell fires during a different phase of the program. However, the problem with the diagram of Fig. 1.4b is that it suggests that each action coded by F5 neurons is hard-wired to determine a unique successor action – hardly appropriate for adaptive behavior. Indeed, the Sakata Task is invariant save that the nature of the object, and thus the appropriate grasp, may change between trials. One would thus like to understand how “grasp switching” can occur within the one overall pattern of behavior. Karl Lashley (1951) raised the problem of serial order in behavior in a critique of the behaviorist view that each action must be triggered by some external stimulus. If we tried to learn a sequence like A ! B ! A ! C by reflex chaining, what is to stop A triggering B every time, to yield the performance A ! B ! A ! B ! A ! ... (or we might get A ! B þ C ! A ! B þ C ! A ! ...)? One solution is to store the “action codes” (motor schemas) A, B, C, . . . in one part of the brain and have another area hold “abstract sequences” and learn to pair the right action with each element. Fagg and Arbib posited that F5 holds the “action codes” (motor schemas) for grasping, while the pre-SMA holds the code for the overlearned sequence. We further posited that the basal ganglia (BG) manage priming and inhibition (cf. Arbib et al.(1998), for modeling, and Lieberman (2000) for the role of human BG in language). This analysis emphasizes that F5 alone is not the “full” mirror system. Behavior, and its imitation, involves not only the “unit actions” encoded in the F5 mirror system but also sequences and more general patterns. The FARS model sketches how pre-SMA and BG may cooperate in generating a sequence. But this is a model of the performance of overlearned sequences. Thus, if we are to extend the FARS model to include complex imitation, we must not only model a monkey-like mirror system which relates observed grasping actions to ones which the observer can himself perform, but must also go “beyond the mirror” to show how the units of a sequence and their order/interweaving can be recognized and how novel performances (imitation) can be based on this
Caption for Figure 1.4 (cont.) “menu.” Area cIPS is a part of the intraparietal sulcus that appears to be involved in recognizing the orientation of surfaces, information useful for both the dorsal stream (extracting affordances for grasping) and the ventral stream (using shape cues to recognize objects). (b) Interaction of AIP and F5 in the Sakata Task. (In the Sakata Task, the animal places its hand on a key once an LED signals it to do so; a second cue instructs the animal to reach for, grasp, and hold a manipulandum; while a third cue instructs the animal to release its hold and return to the resting position.) Three AIP cells are shown: a visual-related and a motor-related cell that recognize the affordance for a precision pinch, and a visual-related cell for power grasp affordances. The five F5 units participate in a common program (in this case, a precision grasp), but each cell fires during a different phase of the program. 14 Michael A. Arbib recognition. This future model will require recognition of a complex behavior on multiple occasions with increasing success in recognizing component actions and in linking them together. Arbib and Bota (this volume; see also Itti and Arbib, this volume) offer related discussion, viewing sequences as the expression of hierarchical structures.
1.2.2 A mirror system for grasping in macaques Gallese et al.(1996) discovered that, among the F5 neurons related to grasping there are neurons, which they called mirror neurons, that are active not only when the monkey executes a specific hand action but also when it observes a human or other monkey carrying out a similar action. Mirror neurons do not discharge in response to simple presentation of objects even when held by hand by the experimenter. They require a specific action – whether observed or self-executed – to be triggered. Moreover, mirror neurons do not fire when the monkey sees the hand movement unless it can also see the object or, more subtly, if the object is not visible but is appropriately “located” in working memory because it has recently been placed on a surface and has then been obscured behind a screen behind which the experimenter is seen to be reaching (Umilta` et al., 2001). In either case, the trajectory and handshape must match the affordance of the object. All mirror neurons show visual generalization. They fire when the instrument of the observed action (usually a hand) is large or small, far from or close to the monkey. They fire whether the action is made by a human or monkey hand. A few neurons respond even when the object is grasped by the mouth. The majority of mirror neurons respond selectively to one type of action. For some mirror neurons, the congruence between the observed and executed actions which are accompanied by strong firing of the neuron is extremely strict, with the effective motor action (e.g., precision grip) coinciding with the action that, when seen, triggers the neuron (again precision grip, in this example). For other neurons the congruence is broader. For them the motor requirement (e.g., precision grip) is usually stricter than the visual (e.g., any type of hand grasping, but not other actions). These neurons constitute the “mirror system for grasping” in the monkey and we say that these neurons provide the neural code for matching execution and observation of hand movements. By contrast, the canonical neurons are those grasp-related neurons that are not mirror neurons; i.e., they are active for execution but not observation. They constitute the F5 neurons simulated in the FARS model. The populations of canonical and mirror neurons appear to be spatially segregated in F5 (see Rizzolatti and Luppino (2001) for an overview and further references). Perrett et al.(1990) and Carey et al.(1997) found that STSa, in the rostral part of the superior temporal sulcus (STS, itself part of the temporal lobe), has neurons which discharge when the monkey observes such biological actions as walking, turning the head, bending the torso, and moving the arms. Of most relevance to us is that a few of these neurons discharged when the monkey observed goal-directed hand movements, such as grasping objects (Perrett et al., 1990) – though STSa neurons do not seem to discharge during movement execution as distinct from observation. The Mirror System Hypothesis 15
STSa and F5 may be indirectly connected via inferior parietal area PF (Brodmann area (BA) 7b) (Matelli et al., 1986; Cavada and Goldman-Rakic, 1989; Seltzer and Pandya, 1994). About 40% of the visually responsive neurons in PF are active for observation of actions such as holding, placing, reaching, grasping, and bimanual interaction. Moreover, most of these action observation neurons were also active during the execution of actions similar to those for which they were “observers,” and were thus called PF mirror neurons (Fogassi et al., 1998). In summary, area F5 and area PF include an observation/execution matching system: when the monkey observes an action that resembles one in its movement repertoire, a subset of the F5 and PF mirror neurons is activated which also discharges when a similar action is executed by the monkey itself. We next develop the conceptual framework for thinking about the relation between F5, AIP, and PF. For now, we focus on the role of mirror neurons in the visual recognition of hand movements. However, Section 1.4 expands the mirror neuron database, reviewing the findings by Kohler et al.(2002) that 15% of mirror neurons in the hand area of F5 can respond to the distinctive sound of an action (breaking peanuts, ripping paper, etc.) as well as viewing the action, and by Ferrari et al.(2003) that the orofacial area of F5 (adjacent to the hand area) contains a small number of neurons tuned to communicative gestures (lip- smacking, etc.). Clearly, these data must be assessed carefully in weighing the relative roles of protosign and protospeech in the evolution of the language-ready brain. The Mirror Neuron System (MNS) model of Oztop and Arbib (2002; described further by Oztop et al., this volume) provides some insight into the anatomy while focusing on the learning capacities of mirror neurons. Here, the task is to determine whether the shape of the hand and its trajectory are “on track” to grasp an observed affordance of an object using a known action. The model is organized around the idea that the AIP ! F5canonical pathway emphasized in the FARS model (Fig. 1.4) is complemented by PF ! F5mirror.As shown in Fig. 1.5, the MNS model can be divided into three parts.