<<

Artificial Neural Nets: a critical analysis of their effectiveness as empirical technique for cognitive modelling

Peter R. Krebs

A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy.

University of New South Wales

August 2007 COPYRIGHT, AUTHENTICITY, and ORIGINALITY STATEMENT

I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International. I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restric- tion of the digital copy of my thesis or dissertation,

and

I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format,

and

I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgment is made in the thesis. Any con- tribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Signed ...... Date ...... iii

Abstract

This thesis is concerned with the computational modelling and sim- ulation of physiological structures and cognitive functions of through the use of artificial neural nets. While the structures of these models are loosely related to neurons and physiological structures ob- served in brains, the extent to which we can accept claims about how neurons and brains really function based on such models depends largely on judgments about the fitness of (virtual) computer experi- ments as empirical evidence. The thesis examines the computational foundations of neural models, neural nets, and some computational models of higher cognitive functions in terms of their ability to provide empirical support for theories within the framework of Parallel Dis- tributed Processing (PDP). Models of higher cognitive functions in this framework are often presented in forms that hybridise top-down (e.g. employing terminology from Psychology or Linguistics) and bottom-up (neurons and neural circuits) approaches to . In this thesis I argue that the use of terminology from either approach can blind us to the highly theory-laden nature of the models, and that this tends to produce overly optimistic evaluations of the empirical value of com- puter experiments on these models. I argue, further, that some classes of computational models and simulations based on methodologies that hybridise top-down and bottom-up approaches are ill-designed. Conse- quently, many of the theoretical claims based on these models cannot be supported by experiments with such models. As a result, I question the effectiveness of computer experiments with artificial neural nets as an empirical technique for cognitive modelling. Contents

1 Introduction 1 1.1 Research and Thesis Outline ...... 1 1.1.1 Part One ...... 2 1.1.2 Part Two ...... 3 1.1.3 Part Three ...... 4 1.2AShortHistoryofIdeas...... 6 1.3 Setting the Stage ...... 9 1.4 Minds and Models ...... 13 1.5 Summary ...... 18

I Representation and Computing 21

2 Models and Simulations 23 2.1TheConceptofModel...... 23 2.2 Models as Representations ...... 30 2.2.1 Representations ...... 32 2.2.2 Simulations ...... 38 2.3Methodologies...... 39 2.4 Summary ...... 42

3 Computational Foundations 43 3.1 Symbol Systems ...... 46 3.1.1 Formal Computation ...... 46 3.1.2 Limitations of Turing Machines ...... 56 3.1.3 Digital Computing ...... 59 3.1.4 Parallel Computing ...... 62 3.2 Other Computing systems ...... 67

v vi CONTENTS

3.2.1 Analog Computing ...... 68 3.2.2 Neural Computing ...... 69 3.2.3 Other Forms of Computing ...... 70 3.2.4 Implementation Independence ...... 70 3.3 Computation as Interpretation ...... 71 3.4 Machines and Semantics ...... 78 3.5 Summary ...... 79

II Models and Reality 81

4 Virtual Models 83 4.1 Virtual Scientific Experimentation ...... 85 4.1.1 Mathematical Models ...... 87 4.1.2 Methodologies ...... 92 4.2 Implementation ...... 94 4.3 Computer experiments ...... 97 4.3.1 Computer models as scientific experiments . . . . . 99 4.3.2 Levels of Explanation ...... 101 4.3.3 Virtual Models ...... 102 4.4 Summary ...... 103

5 Models of Neurons 105 5.1BiologicalModels...... 106 5.1.1 The Hodgkin and Huxley Model ...... 107 5.1.2 Neural coding ...... 109 5.2 Mathematical Neurons ...... 111 5.3 Relation to Reality ...... 118 5.4 Summary ...... 120

6 Artificial Neural Nets 123 6.1Background...... 124 6.1.1 The limitations ...... 126 6.1.2 Plausibility ...... 129 6.1.3 Unsupervised Learning ...... 131 6.1.4 Architectures ...... 132 6.1.5 Implementation ...... 136 CONTENTS vii

6.2 Universal Frameworks ...... 140 6.2.1 Labels ...... 142 6.2.2 Simplicity ...... 144 6.3 Summary ...... 147

III Models and Explanation 149

7 Models and Evidence 151 7.1 Verification and Validation ...... 153 7.2 Validation of Models ...... 156 7.2.1 Visualization ...... 156 7.2.2 Measuring and Imaging ...... 159 7.3 Limits of Technology ...... 163 7.4 Other Evidence ...... 167 7.5 Summary ...... 168

8 Models of Cognition 171 8.1 Higher Level Models ...... 172 8.2 Interpreting Results ...... 174 8.2.1 Pronouncing English ...... 175 8.2.2 Structure in time ...... 178 8.2.3 Moral virtues ...... 185 8.2.4 Cluster Analysis ...... 188 8.3 Summary ...... 191

9 Conclusion 193 9.1NeuralNetsandPDP...... 194 9.1.1 Neurological Inspiration ...... 194 9.1.2 Holding out Hope ...... 195 9.1.3 Computational Sufficiency ...... 196 9.1.4 Psychological Accuracy ...... 203 9.1.5 Parallel Processing ...... 204 9.1.6 Distributed Representation ...... 205 9.1.7 Learning ...... 206 9.2 Bridging the Explanatory Gap ...... 207 9.3 Neurological Plausibility ...... 212 viii CONTENTS

9.4ClosingRemarks...... 212

Bibliography 215

Author Index 231

List of Figures 235 Chapter 1

Introduction

Because mind has shown itself to behave as a nearly decomposable system, we can model thinking at the symbolic level, with events in the range of hundreds of milliseconds or longer, without concern for details of implementation at the ‘hardware’ level, whether the hardware be or computer (Simon, 1995, 83).

Many of the current philosophical issues in that are concerned with inquiries into human cognition and intelligence involve computation in some form. These inquiries may include the search for an artificial intelligence, but are mainly concerned with understanding and explaining real human intelligence. Computation is not only at the core of the philosophical foundations of Cognitive Science, but is also the primary tool for models and simulations in Cognitive Science. This thesis concerns computational models and simulations of cognitive func- tions and processes of the human mind. There are many approaches to modelling and I will focus on models based on artificial neural nets. The questions I will investigate are (1) how much can we infer about the human mind from models, and (2) are some of the approaches and architectures of models really suitable as empirical devices?

1.1 Research and Thesis Outline

In this introduction, I will outline the fundamental philosophical ideas, theories and assumptions that form the framework for Cognitive Sci- ence and the field of artificial intelligence (AI). First I will re-trace

1 2 CHAPTER 1. INTRODUCTION the various ideas of the computational theory of mind and the appli- cation of this theory as the philosophical foundation of an AI. A short review of the major ideas that led to the development of the computing machinery is followed by a sketch of how models fit into the scientific framework. These introductory remarks may help to establish a feel for the attitudes, expectations and problems that concern workers in the field of Cognitive Science, particularly in the context of building and us- ing mathematical models. Following this introduction, there are three parts concerning Representation and Computing, Models and Reality and Models and Explanation. In the Conclusion I will offer an analysis with respect to the questions raised above.

1.1.1 Part One

Part one of this thesis (Representation and Computing) describes and discusses the role of models and simulations in science. The terms model and simulation have many meanings, and the different perceptions of what a model constitutes, have many different properties and connota- tions associated with them. Models can be “real” in the sense that they exist as a physical entity, or they may exist only as a theory, a piece of mathematics or as a computer program. These distinctly different concepts of model have been described as models to hold in one’s hand versus models to keep in one’s head (Hacking, 1983), and their episte- mological status in the context of science is different. Since the devel- opment of the modern digital computing machine in the 1950s, there has also been a shift from mechanical models to more complex math- ematical models and simulations. This is largely due to the fact that the modern electronic computer can deal effectively with mathematical models that would otherwise be either computationally too ‘expensive’, in the sense that the calculations would just need too many resources, or too ‘expansive’, in the sense of being complex. In chapter two, the relationships between mathematically constructed entities and their ‘real’ world counterparts are examined. In all of the computational theories and methods, we make assumptions about their explanatory powers. The problem of grounding representations in the real world is largely unresolved (see for example Harnad (1990)), and there are perceived limitations to what can be computed in principle. 1.1. RESEARCH AND THESIS OUTLINE 3

For a discussion of computation in the context of computational mod- els and simulations and particularly in the area of AI, it is generally deemed sufficient to know something about the Church-Turing thesis (CTT) and the Turing Machine (TM). However there is evidence that this particular concept of computation runs into problems not only for the field of AI, but many of the in principle objections against the pos- sibility of an AI are based on arguments involving computational the- ory and principles. There are many such objections for a variety of re- ligious, philosophical, and cultural . Let me sketch three argu- ments from the list of philosophical objections. The Gödelian argument by Lucas (1961, 1970) and also by Penrose (1990) is based on the insight that formal systems, like computing with TMs, are either inconsistent or that the formal systems contain statements that are not provable within that particular formal system (Gödel, 1931). Human beings, Lucas argues, can detect these unprov- able statements in machines and are therefore above the machine, i.e. computer. In the phenomenological argument, Dreyfus (1979) holds the view that humans are embedded in the environment and that their lives cannot possibly be described in finite terms. Such a description would be necessary in order to emulate a human being, and consequently an AI is not possible. Lastly, there is the argument from simulation by Searle (1980). The Chinese Room thought experiment is designed to show that a computer with a set of transformation rules can never achieve inten- tionality. I will argue that some of these in principle objections are not applicable given the appropriate hardware and software architecture. Chapter three contains an outline and an analysis of computation as it pertains to mathematical models in the Cognitive Sciences. The aim here is to ascertain whether and to what extent models of cognitive func- tions are constrained and ultimately limited within the current compu- tational paradigms (symbol-based and connectionist). These constraints and limitations are likely to be the same as those that made the con- struction of an AI so far largely unsuccessful.

1.1.2 Part Two

Part two of this work concerns the methods of construction and use of mathematical and computational models in Cognitive Science. The kind 4 CHAPTER 1. INTRODUCTION of mathematical constructs, and the assumptions and constraints, are examined in the context of their potential explanatory powers in section 4. The role of models and simulations as experiments, and their capacity as empirical evidence are examined. This capacity depends on the kind of thing a model can represent. In chapter five I examine the principles, assumptions and methodologies of computational models of neurons. The processes of simplification and generalization are discussed, and the relationships between computational models and biological neurons are examined in terms of their architectural and functional properties. There are two main classes of neural models, namely (1) the relatively simple kind that is used in AI and also in neural net models, and (2) a more complex variety that is more concerned with single neurons and their physiology. The AI models follow from the ground breaking work by McCulloch and Pitts (1943), Turing (1948) and Hebb (1949), while the physiological neural models are largely based on the initial mathe- matical model of the electro-physiology of cell membranes by Hodgkin and Huxley (1952). Chapter six concerns artificial neural nets. Artificial neural nets are based on biological systems, albeit very loosely. In many of the mod- els, which are based on artificial neural nets, their properties and functionalities are formalized, simplified and taken to much more ab- stract levels. Some of the major architectures of models in the Parallel Distributed Processing (PDP) (Rumelhart and McClelland, 1986a; Mc- Clelland and Rumelhart, 1986) paradigm and their properties are dis- cussed. I will argue that simple neural nets are universal in the sense that they can offer an apparent solution for nearly any conceivable prob- lem.

1.1.3 Part Three

The issues of verification and validation of models are discussed in chap- ter seven. In some cases, predictions and results gained through com- putational methods can be verified against empirical evidence from var- ious sources. Evidence from highly sophisticated imaging systems has been cited to substantiate claims in Cognitive Science. The images are constructed or re-constructed by means of computational processes and visualizations are in some sense models and simulations themselves. I 1.1. RESEARCH AND THESIS OUTLINE 5 will introduce the principles of these imaging technologies and discuss the role of these technologies as evidence in support of claims about cognition. The use of evidence, which is based on such technologies, for claims about models can be controversial due to the inherent problems with the evidence itself. The localisation of mental activity in the brain, for example, is fraught with difficulty (see e.g. Uttal (2001)). In some cases not only the evidence for the verification of models is problematic, but also the basis of a model can already rest on false assumptions. In this chapter I will illustrate some of the technical difficulties in relating psychology to neurology, and importantly, the difficulties in grounding psychological models in neural frameworks. The material in chapter eight concerns models of higher cognitive func- tions and the insights and results from computational models and sim- ulations. The kinds of inferences that have been made about human vision, learning and production will be examined. The results and their interpretations will be compared and analyzed in view of the methodologies and assumptions associated with the models in question. Some ‘classic’ models and simulations, which aim at higher level cogni- tive functions, are re-visited. Work by Elman (1990), Elman et al. (1998), McLeod et al. (1998) and Seidenberg and McClelland (1989) with feed- forward and recurrent neural nets during the 1990s and more recent work by Churchland (1998) and Rogers and McClelland (2004) are used to illustrate the relationships between computational models, empiri- cal evidence and claims about cognition. A final summary and analy- sis of computational models and simulations is presented in the con- clusion. Particularly the analysis of the class of relatively simple mod- els that involve only a small number of neurons is analyzed. I argue that the use of terminology from either approach can blind us to the highly theory-laden nature of the models, and that this tends to produce overly optimistic evaluations of the empirical value of computer experi- ments on these models. Some classes of computational models and sim- ulations based on methodologies that hybridise top-down and bottom- up approaches are ill-designed. Consequently, many of the theoretical claims based on these models cannot be supported by experiments with such models. 6 CHAPTER 1. INTRODUCTION

1.2 A Short History of Ideas

Thomas Hobbes (1588-1679), a contemporary of Descartes, did not offer a detailed theory of mind in computational terms, but we can see in the LEVIATHAN of 1651 what must be recognized as the precursor of the symbol processing hypothesis.

When a man Reasoneth, hee does nothing but conceive a summe totall, from Addition of parcels; or conceive a Remainder, from Subtraction of one summe from another; [. . . ] For as Arithmeti- cians teach to adde and substract in numbers; so the Geometricians teach the same in lines, figures, [. . . ] The Logicians teach the same in Consequences of words; adding together two Names, to make an Affirmation [. . . ] In summe, in what matter soever there is a place for addition and subtraction, there is also a place for ; and where these have no place, there Reason has nothing at all to do. [. . . ] For REASON, in this sense, is nothing but Reckoning (that is, Adding and Subtracting) of the Consequences of generall names agreed upon, for the marking and signifying our thoughts; [. . . ] (Hobbes, 1651, 18).

Hobbes describes human cognition as the manipulation of words or “generall names” with the aid of logical operators, which themselves are based on primitive functions such as “addition and subtraction”. These concepts are still in the current debate in Philosophy of Mind, Cogni- tive Science and in the field of AI, albeit in a more formalized form. The possibility of mechanizing and automating the execution of such prim- itive functions with the aid of computing machinery led to the physical symbol hypothesis and the beginnings of AI in the 1940s. However, a possible connection between automated computing devices and creative behavior was entertained much earlier. During the first half of the nine- teenth century Lady Ada Lovelace, who may be regarded as the first computer programmer, worked with Charles Babbage on the Analyti- cal Engine. Unfortunately, this device was never completed, but some of the plans, descriptions, and even some components have survived. Alan Turing remarked that

although Babbage had all the ideas, his machine was not at the time a very attractive prospect. The speed which would have been available would be definitely faster than a human computer [. . . ] (Turing, 1950, 439). 1.2. A SHORT HISTORY OF IDEAS 7

The design principles and the potential computing power of Babbage’s machines, the Difference Engine and the Analytical Engine, are well understood. The machines were designed to be a mathematician’s tool and the idea of intelligent behavior being exhibit by the machines had not been anticipated. Nevertheless, Lovelace speculated on the compu- tational powers of the machine (the Analytical Engine) and she envis- aged that the machine might be able to play chess and compose music (Kurzweil, 1990). More systematic research into the possibilities for the creation of intelligent systems began with the conception of the the- oretical and the technological foundations of modern computing ma- chines themselves. Turing’s work in number theory and on the Entschei- dungsproblem (Turing, 1936), which had been posed by Hilbert (1902), led to the development of the theory of computation that forms the ba- sis of Computer Science. Incidentally, Gödel’s Incompleteness Theorem is also a response to one of the open problems in mathematics tabled by Hilbert. The combination of an emerging computational theory, break- throughs in the engineering of digital computers by von Neumann and others, and a mechanistic theory of mind led to the Computational The- ory of Mind. The Computational Theory of Mind is attractive as a hy- pothesis, because, as Sterelny puts it,

[i]t is good research strategy to try to model our information pro- cessing on something we already know a bit about. And we do know a good deal about computation, both from the theory of formalised systems of reasoning and from the actual implementation of some of those systems on real machines (Sterelny, 1989, 74). The computational theory of mind has now a strong philosophical foun- dation. Amongst others, Zenon Pylyshyn (1984) has presented a widely accepted form of the theory that forms an important part of the philo- sophical and ideological foundation of the discipline of Cognitive Sci- ence. Computing theory, which forms the theoretical framework for Computer Science, is sometimes directly applied to Computational The- ory of Mind. The various forms and aspects of computationalism have been the subject of criticism (e.g. Lucas (1961, 1970); Searle (1980, 1990); Dreyfus (1979) and others). However, the Computational Theory of Mind is neither a clearly defined nor a homogeneous field of inquiry. Harnish (2002) points out that the computational theory of mind is essentially a special case of a represen- 8 CHAPTER 1. INTRODUCTION tational Theory of Mind that goes back to the empiricists John Locke (1632-1704) and David Hume (1711-1776). Locke argued against innate ideas and knowledge and considered that

a child at his first coming into the world, will have little reason to think him stored with plenty of ideas, that are to be the matter of his future knowledge. It is by degrees he comes to be furnished with them (Locke, 1690, 47). Instead, he held the view that all our knowledge comes from experi- ences, namely sensations and perceptions. Hume believed in a mind where ideas are impressed as faint images, and he was able to explain some phenomena of cognitive processes such as vagueness and in his theory (Russell, 1946). Hume noted that

[t]hough experience be our only guide in reasoning matters of fact; it must be acknowledged, that this guide is not altogether infal- lible, but in some cases is apt to lead us to errors (Hume, 1777, 110). A Computational Theory of Mind also requires some architecture in which the computations can be performed. The Symbol Processing Hy- pothesis (Newell and Simon, 1963, 1976) (Newell, 1990) and the frame- work of PDP models (Rumelhart and McClelland, 1986a; McClelland and Rumelhart, 1986) are mirrored in the two paradigms in the field of AI, namely the (1) knowledge and rule based symbolic approach and (2) the connectionist approach, which is based on parallel computing with neuron-like elements. It was not until the late 1890s that the nature of the nerve cell had been discovered. Particular the work by Camillo Golgi (1843-1926) and Santiago Ramon y Cajal (1852-1934), for which they both received the Nobel prize in 1906, lead to the current neural doctrine and neural phi- losophy. The principle of the neuron being the “anatomical, physiologi- cal, metabolic, and genetic unit of the ” is attributed to Waldeyer (1836-1921). Shepherd added more recently that

a major contribution of research in the future will have to add a fifth tenet: the neuron as a basic information processing unit in the nervous system (Shepherd, 1991, 292). The concept of a neuron as an information processing unit forms the foundation of artificial neural nets, and much of the work done in the 1.3. SETTING THE STAGE 9

field of AI is invaluable for many of the techniques used in cognitive modelling. In particular the discovery of learning algorithms and devel- opment of various network topologies added renewed support for what has been developed into the parallel distributed processing approach to human cognition, although it must be stressed that some of these de- velopments are very much engineering efforts toward an AI, and these efforts are not necessarily aimed at providing models for human cog- nition. The initial work on neurally inspired networks by McCulloch and Pitts (1943), Turing (1948) and Hebb’s work THE ORGANIZATION OF BEHAVIOUR (1949) led to the development of the perceptron1 as a neural model (Rosenblatt, 1958, 1959, 1962). After a short stasis in the development of neural nets following the publication of the very critical work PERCEPTRONS by Minsky and Papert (1969), new methods and developments brought neural nets back into favour. Considerations and debates in Philosophy of Mind have shifted toward Neurophilosophy, a strain of philosophy that concerns the implementation of a Mind in a brain that is comprised of neurons. However, some types of computa- tion do not have a strong connection with neural structures. The more traditional forms of computation based on the stepwise execution of a series of commands are equally important. For the last 60 years or so, computational models of cognitive functions and also efforts to build in- telligent systems in the field of AI, have been based on both principal forms of computation. Hypotheses about possible architectures of the mind have followed these computational paradigms analogously. The debate between the followers of the symbolic approach (Fodor, Pylyshyn et al.) and the connectionists (McClelland, the Churchlands, Sejnowski, Smolensky, et al.) is still alive and well, as the recent presentations by McClelland (2005) and Fodor (2005) demonstrate.

1.3 Setting the Stage

Cognitive Science is a relatively new discipline. Over the last 30 years or so, different insights, methods, and ideas from the sciences, math- ematics, and philosophy have been brought together in an attempt to

1The term perceptron is used to describe both a single element (model neuron) and a two layer network composed of these elements. 10 CHAPTER 1. INTRODUCTION explain human cognition. There are many points of view about almost every topic in the field, but with the developments in Computer Science, Neuroscience, Philosophy, Psychology, etc. in mind, this series2 of postu- lates captures the core of what workers in Cognitive Science generally (ought to) accept.

1. Philosophical Monism. The doctrine that the mind is a product of the brain and that there is no room for, and no need for, any supernatural substances. This is in essence the first commandment of Cognitive Science. Dual- ism, particularly Cartesian dualism, which proposes a separation of body and mind, is certainly not in vogue, however Churchland and Sejnowski point out that

[s]uffice it to say that Cartesian hypothesis fails to cohere with current physics, chemistry, [. . . ], and neuroscience. To be sure, materialism is not a fact, in the way that the four-base heli- cal structure of DNA, for example, is an established fact. It is possible therefore, that current evidence notwithstanding, dualism might actually be right (Churchland and Sejnowski, 1992, 2).

I certainly agree with this, but as a philosopher of science I believe it necessary to also point out that the notion of fact in science is in itself a contentious issue. This makes it more difficult to argue that coherence with current scientific fact is a “safe bet” - it might turn out in the future that some of the science may be wrong after all, despite the established facts. Nevertheless, it is a reasonable position to hold that, until there is some good evidence to support claims that dualism actually is right, the doctrine of monism is very plausible.

2. Complexity. The mind can be explained as processes within the brain. Al- though it is possible that some these processes are too complex to be understood for now, given that there are about 1012 neurons and approximately ten times as many glia, and some 1015 synapses.

2The headings of the items in the list have been adapted from Uttal (2005). A similar list can be found in McLeod et al. (1998). 1.3. SETTING THE STAGE 11

That translates to around 105 neurons and 109 synapses in a sin- gle cubic millimeter of human cortical brain tissue.

3. The Brain Is the Seat of the Mind. Again, there are objections, caveats and clarifications offered. The nervous system is essentially distributed through the entire body, with large concentrations of nerve cells in the brain and the spinal cord. However, we have sensors, receptors and neural connections to muscles everywhere in the body. Our minds may just need a bit more than a brain deprived of all environmental cues and clues. Bennett and Hacker suggest that a new kind of dualism between brain and body is now ever present in Cognitive Science. They ar- gue that “it is people who reason and infer from known or assumed premises to conclusions, not their brains” (Bennett and Hacker, 2003, 112). A Computer Science analogy would be to say that it is the computer that can execute a program, not just the central processing unit without the peripherals, from input- and output units to the more mundane, but essential, power supply. While the brain is the seat of mind, it is probably wise to consider some of the ancillary organs as well. After all, it is the BOLD3 signal that gives us the images of mental activity in f MRI.

4. The Solid Matter of the Brain Is Where the Action Is. This is, I believe, a less problematic proposition. Unlike Galen’s and later Descartes’ suggestion, there is no evidence for the exis- tence of animal spirits in the ventricles. Accordingly, the role of the pineal gland has also been re-defined since.

5. The Solid Portions of the Brain Are Made Up of a Host of Anatom- ically Discrete Neurons. The brain comprises many kinds of nerve cells, or neurons, and glial cells that are massively interconnected. This proposition causes little or no concern. After all, the different kinds of cells are observable with a light microscope. The exact makeup and the functional roles of the various types of cells is largely unknown at this stage.

3Blood Oxygenation Level Detection. 12 CHAPTER 1. INTRODUCTION

6. Synapses Mediate the Interaction of the Discrete Neurons in the Brain Synapses are the point of connection between neurons, and they are responsible for the transmission of signals. The information content of single signals at this level is probably minimal. The in- formation is contained in the activation patterns of a (very) large number of synapses.

7. The Brain is Highly Adaptive. The brain is not a rigid entity, but it is possible for new connec- tions (synapses) between neurons to be established, and it is pos- sible for existing connections to be modified. This principle was first described by Hebb (1949) and it is the fundamental learning mechanism in artificial neural nets.

8. Neural Activity is Based on Electrochemistry. Neural activity can be directly measured with micro-electrodes, and the physiological and electrochemical processes are largely understood and well documented. Any good Neuroscience text book (see for example Bear et al. (1996)) will provide a solid sci- entific account of the cellular neurophysiology concerning phos- pholipid membranes, action potentials, sodium-potassium and cal- cium pumps, and the like. The neuron, because of its definable and observable behavior, is commonly taken as the base unit, or pri- mary element (PE), for developments in AI and also for most of the connectionist computational models4. Not every one agrees - Pen- rose (1990), for example, suggests that explaining neural activity at this level is insufficient. He holds the view that quantum events within neurons need to be considered, and that quantum processes may be responsible for a neuron’s information processing capabil- ities. However, this line of thought has not received much support over the years. The main argument against the possibility of quan- tum processes being relevant at the cellular level is based on the claim that such processes are simply swamped by the molecular processes and that it is unclear how quantum processes could in- fluence events at a molecular level. 4Not all models take the neuron as the smallest unit. In some models a single model “neuron” within a network may be considered as a functional group of neurons. 1.4. MINDS AND MODELS 13

9. The Fundamental Correlate of Mind Is Neural Interaction. The activities between neurons are the key to mental activity, not the activities inside a single neuron. Similarly any mental repre- sentations are to be found in some form in the connections, not within individual neurons (note Penrose’s objection above).

10. Localization Is a Contentious Issue in Cognitive Neuroscience. I left this item on this list, although it is only claimed that localiza- tion “remains a contentious issue”. However this particular point is the subject of Uttal’s work (Uttal, 2001, 2005). He challenges principal assumptions and methodologies in brain-mapping, and Uttal essentially argues that the reductionist approach may not be helpful in the field of Psychology (compare Bechtel (2002)).

11. Cognition is Computation, or, Cognition is Computable. A computational Philosophy of Mind is fundamental to Cognitive Science. Much of the work that comes from the field of AI and par- ticularly much of cognitive modelling would be meaningless with- out the mind - computation - brain connection. There is however no consensus among the workers in the discipline on what exactly computation is and how or what the brain actually computes. For the field of AI the question of whether it is in principle possible to create intelligent behavior on a computer remains.

All items on this list are naturally open to some discussion over the de- tails, and I have already hinted at some examples of objections and dif- ferent opinions. However, I believe that these core beliefs and assump- tion do provide a suitable framework for Cognitive Science to operate in. While it is likely that changes will be applied to some of the theo- ries, we should not expect major shifts in the fundamental assumptions to occur soon. Lytton addresses this point and he comments that “even if we don’t suffer a paradigm shift, we may still get a major paradigm face-lift” (Lytton, 2002, 25).

1.4 Minds and Models

In this section, I will introduce some of the principal questions con- cerning the computation, the simulation and the modelling of cognition. 14 CHAPTER 1. INTRODUCTION

Such questions have been raised in the field of AI. Is thinking com- putable with a personal computer, a slide rule or even with pen and pa- per, or perhaps with the aid of tables of logarithms? Does the brain com- pute using discrete logical components like those found in most modern electronic computing machines? The answers to all of these questions must surely be no. The questions become much more interesting if we consider them in ‘principle’. Can computers think? Is thinking or what- ever the brain does computable with a computer as we know it? Do brains compute? All of these questions are linked to the underlying as- sumption that brains and computation have something in common. Not everyone is willing to accept the premise that it is all about information processing. Lytton (2002) suggests that

Airplanes fly like albatrosses but computers don’t think like brains. Both brains and computers process information, but infor- mation processing might not be central to the process of thinking (Lytton, 2002, 13).

The problem can be put in a form that avoids the comparison machine versus brain. Rather than asking whether computers can do things as they are done by brains, we can ask whether machines can compute what brains do. Computability as a concept for Cognitive Science and AI is grounded in the fundamental principles of Computer Science, i.e. com- puting with Turing Machines (TM). If we accept the materialistic and monistic paradigm of Cognitive Science, then we may have reason to as- sume that brain functions are computable. However, this does not nec- essarily imply that brain functions are actually computed by the brain - nor does it imply that everyone accepts the notion that what the brain does can be computed. There are many theoretical and practical fac- tors and limitations, such as complexity of the calculations. Computable here means that the functions can be described in mathematical terms and they can be dealt with computationally. These functions ought to be computable by machines based on TMs, at least ‘in principle’. A physical instantiation of a TM is possible, if we can supply enough memory for the task at hand5. It might need some time to complete the task, but it is

5Turing asked that a machine has sufficient tape (memory) for the particular com- putation. The memory needs only to be unbounded for the particular instance and not infinite in general, see section 3.1.3 and also Krebs (2002). 1.4. MINDS AND MODELS 15 a finite problem, given that brains live only a finite time. If it turns out that the speed of execution is also of importance, then we might have to restrict the computational problem to a smaller task. The task may just entail the simulation of a particular brain function, which should reduce the computational effort considerably. However, even for a small task, there is probably a large number of neurons and an even larger number of connections to consider, so that the total number of states to be computed is very large. The computational effort is hard to predict, because for now we have little or no means of determining how many neurons or connections are actually required to implement a particular function. Still, the problem is “merely” one of complexity6. This way of thinking can be a trap. Although cognitive tasks may be computable in principle, the computations will be intractable for all but trivial cases due to constraints in computation time and memory requirements. I must concede that in the future much more powerful computers, e.g. possibly quantum computers, may allow us to tackle more “realistic” problems. For now the question remains, whether any brain processes should actually be considered computable if the processes turn out to be intractable. In the context of models and simulation we should ask, whether it is possible to build meaningful models on purely theoretical, i.e. ‘in principle’, foundations. Most high level models in Cognitive Sci- ence (particularly PDP models) fall into this category. The kinds of computations that are necessary to emulate, simulate, or replicate brain activity seem too complex to be performed without much more sophisticated tools. Also, the way in which the computations are implemented seems to be significant. The debate, whether the classic, or symbolic, approach to an AI offers a more appropriate method rather than the connectionist approach, is still ongoing. The possibility of an AI is dependent on a string of assumptions7. Not only is it assumed that intelligence can be replicated in some other substance than brain tissue, but it is also assumed that this replication can be achieved fully, or at least in part, through computational processes. The assumptions about what goes on in the brain go much further than that. For the followers

6The complexity of the human brain and the computational complexities (Kolmogo- roff Complexity) in the context of an AI are discussed in Hoffmann (1993, 1998). 7An AI could possibly be based on a variety of principles. Here, and in the context of this thesis, the kind of AI that is referred to uses methods based on digital computers. 16 CHAPTER 1. INTRODUCTION of the connectionist paradigm of an AI, the neural structure of the brain itself is also considered an important part of the necessary ingredients that could make an AI possible. Many projects in the field of AI are related to some specific function of the brain, be it speech production, speech recognition, or vision for robots and the like. From this we can assume that the modularity of brain and mind is also accepted tacitly or even explicitly. This acceptance is indicated by the assumption that cognitive sub-systems can be modelled effectively in isolation. It is believed that computational models and simulations are the ideal approach to provide the clues and explanations for a computational the- ory of mind. This thesis concerns computational models and simula- tions of cognitive architectures and of cognitive functions. Computa- tional models have their own set of assumptions and constraints as- sociated with them. There seems little disagreement in the field of AI about what computation is - that is what TMs do and what computers do. Moreover, it is widely held that computation can be implemented on various kinds of hardware, like electronic digital machines or mechan- ical calculators. A closer look reveals that things are not as straight forward as they seem, because computation can mean quite different things (see section 3.2 on page 67). If a computational model is going to be effective in telling us something about the thing that is under consideration, e.g. some brain function, then success is unlikely, unless the model reflects the workings of the brain at least at some level. While models are usually simplifications in scale and complexity, they do relate to the real thing in shape, mate- rial, structure or functionality. A model of a banana can be shaped like a banana, can be yellow and it might even smell like a banana. If the model has all of these properties, it could be considered a good model of a banana. A yellow cube that smells like a banana is a better model of a banana than a red sphere that smells like a strawberry. Yet, if we are told that the stuff, of which the red sphere is made, is in fact the same stuff bananas are made of, then we have to rethink. The point here is that a model may be related to the real world at some deeper level, despite the fact that on the surface there is no suggestion of any relationship at all. There is a strong possibility that there is something to be learned about brains from brain models, even if some of the models 1.4. MINDS AND MODELS 17 are of the wrong shape and the wrong colour, as long as there is some- thing ‘brainy’ in the model. Analogously we may learn something about the mind, if our models contain something ‘mind-like’. It has been argued that whatever the brain does is nothing but the by- product of computations, or that mental processes are emergent prop- erties of computations. The concepts of computations, and I use a dou- ble plural here deliberately, are fuzzy constructs. The term computation is used to denote quite different things in different disciplines. On the surface, computer science deals with types of problems that are quite different from those dealt with in Philosophy of Mind. A view within a particular community of philosophers or scientists can be based on a set of famous quotes of one or two important players in the field. This usually creates a consensus on what is meant by the term in the vernac- ular of the people involved in that particular activity. Yet this approach, which may be based on an intuitive understanding of computation, may not be sufficiently rigorous to serve as a foundation for a general theory of computation. Intuitions about computations are biased, because they will incorporate what the individuals have experienced and what the in- dividuals try to convey. ‘Theory ladenness’ is a problem in all (scientific) endeavours. Inquiries into the nature of computational theories, particular those dealing with the inadequacies of one or more aspects of such theories, consider the existence of distinct types or even paradigms of computa- tion. As a consequence, the questions raised are about classification or differentiation of various forms of computation, rather than attempt- ing to integrate the various ideas into a single structured concept of computation. Computational models serve as vehicles to explain struc- tures and mechanisms not only of real physical objects, brains, but also to hypothesize about human behavior and the human mind. Claims in Cognitive Science, which are based on empirical evidence gained from computational models, are influenced and constrained by the compu- tational theories and methods on which these models rest. The final results and conclusions about the real world drawn from models and simulations, and any claims about human cognition, are reflecting el- ements of the computational paradigm of the models. The model for learning and simulating moral virtues (Churchland, 1998) (see chapter 18 CHAPTER 1. INTRODUCTION

8.2.3 on page 185) is a point in case. Churchland suggests that a sim- ple recurrent network may have appropriate architecture for learning and simulating moral virtues, but, as I will argue, such a network does little more than data analysis. Hence it seems a rather bold claim that humans may place their moral judgments on what are essentially sta- tistical inferences. The final analysis reveals that the inherent vagueness of data obtained from experiments in Psychology (O Nualláin, 2002) make it difficult to provide a clear and realistic explanation of a model’s effectiveness.The descriptions and interpretations for what are essentially ‘computer sci- ence’ models are often borrowed from a different scientific paradigm like Psychology or Neuroscience. The resulting hybrid models are ex- tremely difficult to verify and their effectiveness as models for cogni- tive functions should be questioned and evaluated case by case. Claims about these models are often shrouded in wishful and theory-laden ter- minology, like the ‘meaningful’ descriptive labels for inputs and outputs. The descriptions of processes and relations that are offered sometimes ‘surprise’ even the modeller. However it turns out that overly positive evaluations are often based on little more than statistical analysis and inference. For many models it would be possible to obtain better (numer- ically more accurate) results by using traditional mathematical and sta- tistical methods instead of artificial neural nets. The results from many experiments with neural nets can therefor not be generalized, particu- larly if the (training) data relates only to quite simple ontologies. Very simple neural net models of cognitive functions provide rarely sufficient support for any neurological plausibility - just because a particular cog- nitive function can be modelled using a neural net, does not imply that this function is (or can be) implemented in the human brain in similar fashion (Krebs, 2005).

1.5 Summary

Cognitive science is based on a set of core beliefs. Although many of the assumptions and methodologies are not universally accepted for vari- ous cultural, religious, philosophical and scientific reasons, there is at least some consensus among the workers within the discipline. I have 1.5. SUMMARY 19 outlined a set of tenets that form the framework for Cognitive Science in section 1.3. There is a great variety of models and simulations used in Cognitive Science. Some models are purely conceptual, and they may be regarded as theories themselves. The Parallel Distributed Process- ing (PDP) architecture presented in Rumelhart and McClelland (1986a); McClelland and Rumelhart (1986) and many of the models concerning higher cognitive (psychological) functions fall into this category. Other models are based on the physiology of neurons and concern their electro- chemistry, action potential, spike propagation and the like. However, all these various kinds of activities in Cognitive Science involve computer programs that implement mathematical models, and some may also in- clude empirical methods, so that the models and simulations and their predictions can be validated. There is little doubt in my mind that mod- els and simulation are helpful tools for much of the work in Cognitive Science. Moreover, given the particular nature of the some objects under investigation, namely functions of the mind, models and simulations are often the only tool available. 20 CHAPTER 1. INTRODUCTION Part I

Representation and Computing

21

Chapter 2

Models and Simulations

A brain model may actually be constructed, in physical form, as an aid to determining its logical potentialities and performance; this, however, is not an essential feature of the model-approach. The essence of a theoretical model is that it is a system with known properties, readily amenable to analysis, which is hypothesized to embody the essential features of a system with unknown or ambiguous properties - in the present case, the biological brain (Rosenblatt, 1962, 3).

In this chapter I present first a short introduction to models in general and their role in Cognitive Science in particular. Some definitions and descriptions of the kinds of models that I am concerned with in this work will be given. The second section deals with aspects of the Dretskian concept of representational systems (RS) (Dretske, 1981, 1988, 1994) and to what extend models can be thought of as representations of real world entities or phenomena.

2.1 The Concept of Model

Models and simulations have been described as mock-ups, analogies, simplifications, or metaphors, and sometimes the term simulation car- ries connotations of pretense, or even deceit (Fox Keller, 2003). However, in the world of science and technology, these connotations have largely faded, and the use of computational models and simulations has become common practice. Fox Keller (2003) remarks that during the 1940s

23 24 CHAPTER 2. MODELS AND SIMULATIONS

the valence of the term [simulation] changes decisively: now pro- ductive rather than merely deceptive, and, in particular, designat- ing a technique for the promotion of scientific understanding (Fox Keller, 2003, 198). Models have always played an important role in the sciences. Indeed, models have served as the basis for major shifts in theories in the phys- ical sciences. Bohr’s model of atoms, for example, changed the way in which chemists could predict the properties of substances. Models and simulations can be in the form of some apparatus, like the model of an aeroplane in a wind tunnel to determine its aerodynamics, or the ball and stick models of molecules. The advantage of using models is that they help in the understanding of otherwise too complex problems. Models, when they are utilized in this way, are simplified versions of something else. This simplification can be achieved by the omission of some elements that are not considered to be of interest, or alternatively, by emphasizing the effects of particular factors. An equally important function of models is in their use as aids to illustrate or support some aspect of a hypothesis or theory. These two distinct conceptions of mod- els have been described as models to hold in one’s hand versus models to keep in one’s head (Hacking, 1983). These introductory remarks al- ready illustrate that it is very difficult to define what exactly models are. Instead, we will have to be content with a rough classification as is suggested by Hacking. The evaluation of a model in terms of its success might be easier, if a more detailed description could be given. A detailed description of the necessary properties of a class of models would then provide a kind of formalism against which a particular model could be tested. I have al- ready mentioned that certain classes of models have emerged in the Cognitive Sciences, however the range of modelling techniques and sub- jects that have been modelled are vast. The knowledge and rule based approach has been used to model language production (e.g. Winograd and Flores (1986)), or Human reasoning (e.g. Newell and Simon (1963); Newell (1990)). Artificial neural nets have been employed to provide models for speech production (e.g. Sejnowski and Rosenberg (1987)), as- pects of vision (e.g. Poggio (1990)), or elements of cognitive development (e.g. Shultz (2003)). There are mathematical models for probabilistic reasoning, models for the electro-chemistry of neurons and the mental 2.1. THE CONCEPT OF MODEL 25 rotation of images, and anatomical models of eyeballs or brains. One of the possible approaches toward a definition of the term model is that a model should contain, at least at some level, representations of a real physical object, or a collection of physical objects. Alternatively, if the model serves as an hypothesis, that it refers to some real object, observable phenomenon, or at least, makes predictions about some real entity or event. In either case the model must have some testable com- ponents, either by representations or by predictions. The kind of model that is of the “to hold in one’s hand” kind (Hacking, 1983), like a model car, a plaster cast of a brain, or a working model of a steam engine, seem straight forward. However, models of physical objects can, of course, even be created of objects that do not exist at the time the model is made, or may not possibly exist ever: a 10:1 model of a pink unicorn with wheels is a case in point. An architect’s model of a house, for ex- ample, is made long before the foundations of the real house are laid. Similarly, it is possible to create a model of something that cannot be scaled up or scaled down to the dimensions of the real world object, like a scale model of the solar system made up of oranges and golf balls. The common feature of models that are constructed to represent real objects is that they capture some essential features of the real world object, without the need to be copies of the real world object. The difficult part is to determine what the essential features are, particularly if very little is known about the real world object. Models of theories describe ideas or conjectures about relationships or processes in the real world. Often such models are mathematical enti- ties in the form of a set of formulas which describe some (observable) phenomena in a generalized and idealized form. But theories need not be expressed in mathematical terms, nor do models always concern ob- servable phenomena. Bohr’s model of the atom is a good example for a model that suggested a descriptive architecture of a non-observable entity, but the model helped to explain and predict properties of real atoms1. 1The most sophisticated imaging techniques can now resolve individual atoms. But even if it was possible to resolve subatomic particles, Bohr’s model can not be verified. The model does not describe the physics, but it is/was a reasonable model to explain some of the observed behaviors. 26 CHAPTER 2. MODELS AND SIMULATIONS

Conceptual models (models of theories) do not rely on knowledge about a real physical system. Instead, the models and simulations are built around sets of conjectures, hypotheses, or theories, which need not nec- essarily have any relation to the physical world. The role of models for a theory is to illustrate, elucidate or visualize certain aspects or details in support for this theory.

[. . . ] models are intended in the present context to mean replicas or analogs that can imitate natural systems without necessarily invoking implicit formal(i.e., mathematical or logical) representa- tions. In this sense they represent something less than a full the- ory, but something that can nevertheless produce an acceptable so- lution to the problem posed by an otherwise formally unanalyzable (Uttal, 2005, 30).

This is a quite different concept from representational models, i.e. mod- els of something. Unlike the representational model, where some knowl- edge is required about what is to be modelled, conceptual models rely on a much softer set of assumptions. Conceptual models are doubly at risk of being ineffective or inappropriate, because (1) the models can be based on theories that may turn out to be flawed at some fundamen- tal level, or even outright wrong, and (2) there is little or no hard data against which the model could be validated. The validation and verifi- cation of models concerns the ways in which a model’s success or even a model’s plausibility can be measured. Models and simulations that represent a theory can, of course, be de- scribed as a theory themselves. Such models may reflect the belief sys- tem of the people involved and they are constructed, or construed, to support theories based on this particular belief system. In Cognitive Sci- ence there have been a number of models that belong to this category, like the language interface for a simulated robot in a micro block world described by Winograd (1973) or the General Problem Solver by Newell and Simon (1963). The models are essentially computer programs that mimic superficially human cognitive behavior. A common difficulty with micro world simulations (i.e. small ontologies) is that they rely largely on the principle that all possible relations within the world can be cap- tured. There is a (small) finite number of blocks and possible positions in Winograd’s model, so that the data contains everything there is to know about the world. It is possible to produce, what seem at first, startling 2.1. THE CONCEPT OF MODEL 27 results. Weizenbaum (1976) showed with the program ELIZA, in which very simple string manipulations were used to simulate a psychothera- pist, that it is relatively easy to create programs that can be convincing and yet be totally unrealistic at the same time. Weizenbaum himself remarked about the apparent success of ELIZA that

[he] had not realized that extremely short exposures to a relatively simple computer program could induce powerful delusional think- ing in quite normal people. [. . . ] This reaction to ELIZA showed me more vividly than anything I had seen hitherto the enormously exaggerated attributions an even well-educated audience is capa- ble of making, even strives to make, to a technology it does not understand (Weizenbaum, 1976, 7, italics added).

But, the real world is much more complex (Dreyfus, 1979) and none of the micro world models have been successfully scaled up. Besides the game-like programs, like Weizenbaum’s ELIZA, there are serious attempts to model or to simulate aspects of human cognitive behavior. The attempts to build such systems rely on a set of assump- tions and the cognitive models in turn provide a further level of theory. Glymour, for example, says about efforts in AI that

‘[c]onventional’ artificial intelligence programs [. . . ] are little theo- ries. They are theories about the confirmation and testing of propo- sitions [. . . ] The more the theories look like theories of reasoning, the more the description of the program looks like a piece of phi- losophy (Glymour, 1988, 200).

This view of models as “little theories” is also applicable to many models in Cognitive Science. Some models are clearly designed to support a par- ticular philosophical stance. For example, work with artificial networks carried out by Elman (1990) and by Rogers and McClelland (2004)2, cer- tainly do more than just imply that the network models are in support of the connectionist approach to a computational Philosophy of Mind. The conception of the Parallel Distributed Processing (PDP) - Model, as a framework for cognitive models, is an extension of the more gen- eral computational philosophy of Mind. It is claimed that, in support of this , PDP-models, which are constructed with artificial neural nets, offer the necessary architectures and techniques

2Most of this material is based on earlier work by Rumelhart and McClelland (1986a) and others in the PDP research group. 28 CHAPTER 2. MODELS AND SIMULATIONS capable of explaining theories about cognitive functions. Rogers and Mc- Clelland (2004) write, for example, that

[t]he simulations discussed [. . . ] demonstrate that PDP [parallel distributed processing] models can provide a mechanistic explana- tion of both the context sensitivity by children and adults in se- mantic induction tasks, and the apparent reorganization of con- ceptual knowledge throughout childhood revealed by Carey and others (Rogers and McClelland, 2004, 292).

McClelland et al. suggest in PARALLEL DISTRIBUTED PROCESSING: EXPLORATIONS IN THE MICROSTRUCTURE OF COGNITION (1986a) that they

[. . . ] will show that the mechanisms these models employ can give rise to powerful emergent properties that begin to suggest attrac- tive alternatives to traditional accounts of various aspects of cog- nition (McClelland et al., 1986, 4).

Cognitive Science deals with different phenomena, like emotions and consciousness on one hand, and on the other observations of biological processes in brains “in terms of ad hoc mixtures of biological and compu- tational mechanism” (Pylyshyn, 1990, 31). The question of what is the real system that is to be modelled is therefore not always an easy one to answer, because often there is only limited ‘hard’ knowledge of the subject matter. The real system might have properties that cannot be modelled successfully, as some of the opponents of AI argue, for exam- ple Searle (1980), Lucas (1961) and Dreyfus (1979). Berkeley requires that a model should be “ more or less” functionally equivalent.

As a minimal condition, any proposed model of some aspect of bi- ological cognition must have the appropriate input and output be- haviors to model the relevant aspect of cognition. That is to say, the model had better be able to do more or less the same things as the biological system which it putatively models (Berkeley, 1998, 2).

I do not think this is a very useful proposition, because so far it gen- erally is not possible to attribute particular cognitive functions for a particular biological system (Uttal, 2001, 2005). A model for some cog- nitive function can not be validated against the matching biological sub- system, which might be a particular chunk of brain. The alternative is to measure the model’s input and output behaviors against a cognitive 2.1. THE CONCEPT OF MODEL 29 system, like memory, visual perception, or language production, but it is notoriously difficult applying metrics to these systems, as a particu- lar subsystem may not work in the biological system in isolation. Such concerns have already been raised for cognitive models as well. Bennett and Hacker (2003) argue that scientific approaches to problems must fail, if the underlying conceptual questions have not been adequately explained and understood. They write that

[d]istiguishing conceptual questions from empirical ones is of the first importance. When a conceptual is confused with a scientific one, it is bound to appear singularly refractory. It seems in such cases as if science should be able to discover the truth of the mat- ter by theory and experiment - yet it permanently fails to do so. That is not surprising, since conceptual questions are no more amenable to empirical methods of investigation than problems of pure mathematics are solvable by the methods of physics. Further- more, when empirical problems are addressed without adequate conceptual clarity, misconceived questions are bound to be raised, and misdirected research is likely to ensue (Bennett and Hacker, 2003, 2).

The two important questions that need to be answered are (1) whether the “conceptual questions” about cognitive processes are adequately de- scribed so that they are able to be captured by computational methods (i.e. models), and (2) whether such processes can actually be simulated as formal systems, which would be a necessary condition for a simula- tion with some computational means. The distinction between conven- tional (i.e. Turing machines) and distributed computing (i.e. simulated artificial neural nets) is of little concern here, because the models or sim- ulations are implemented on conventional computers in either case3,so that the models and simulations of cognitive processes themselves are formal systems. Note that this concerns only the computational models and simulations, not the theories or phenomena that are under investi- gation.

3Only very few neural net simulations use dedicated hardware. Today, Artificial neu- ral nets are almost universally simulated on personal computers. 30 CHAPTER 2. MODELS AND SIMULATIONS

2.2 Models as Representations

In the context of models and simulations, symbols are used to represent different things at many levels. For example, a computational model of a neuron may have a large number of inputs and outputs. Such a simulated neuron, or primary element, is calculated using some sophis- ticated algorithms, where a number of symbols are used to represent its inputs, sums, products, outputs and the like. A typical mathemat- ical model of a neuron, which is used as a primary element in neural networks, can be expressed as a piece of mathematics in the following form4: 1 ϕ(x1...m)=   m −c ∑ w jx j+b j 1 + e j=1 Once we discus networks and network architectures, the neurons are sometimes represented as circles in graph-like structure. Figure 2.1

Figure 2.1: Simple network diagram shows the architecture of a small feed-forward network, where little detail is offered, except that inputs and outputs for each neuron are in- dicated by the connecting lines. We can also make some inferences about the function of a neuron within the network based on its position, and the particular role (input , hidden , or output-unit) may have some bear- ing on the implementation. However, most of the detail of the neurons

4This particular model, its functionality and its properties are discussed in some detail in chapter 5. 2.2. MODELS AS REPRESENTATIONS 31 workings are hidden and their implementation details are placed inside a ‘black box’. In doing so, we reduce the complexity of the network, but at the same time we introduce a much more abstract representation of the initial of the neuron. The practice of reducing complex structures and processes to simpler symbols is commonplace. In a model for a neural network, the neurons, or primary elements, are often referred to as functional units and labeled with semantically meaningful terms. Furthermore, in models of higher cognitive functions any number of individual neurons may be represented as a single func- tional unit. Doing so, changes the level of description from a base level of primary elements (i.e. neurons) to an even more abstract level. One of the questions that is of interest here, is whether a symbol at a higher level inherits the semantics from its underlying levels. This thesis concerns the epistemic value or validity of models, so that the question should be whether models are to be evaluated in terms of their lower levels. Are models that employ artificial neural nets neurologi- cally plausible, if the individual artificial neurons are not? Can certain characteristics of representations at lower levels render entire models and simulations invalid? I hold that particularly symbols (or represen- tations) in models that are claimed to be somehow ‘like the real world’ must have some relation, or grounding, in the real world themselves in order to be explanatorily useful. Psillos (1999), for example, asks of models to adequately represent the real world system. He says

[f]or, although scientists do not start off with the assumption that a particular model gives a literal description of a physical system X, there may be circumstances in which a model M of X offers an adequate representation of X. These circumstances relate to the accuracy with which a model M represents the target system (Psil- los, 1999, 144). The definition of what is “adequate” depends on the level of “accuracy” between model and the target system. But how do we determine the de- gree of accuracy, and when do we know that the degree of accuracy is sufficient to be an adequate representation of a real world object? Daw- son notes that a model “can imitate a phenomenon, [but] it need not re- semble it” (Dawson, 2004, 6). This certainly holds in terms of a model’s visual appearance. It is a characteristic of representational models that they are replications of a known physical system or parts of a physi- 32 CHAPTER 2. MODELS AND SIMULATIONS cal system, albeit with a degree of simplification, change in scale, and change in material. I consider models, the representative kind in par- ticular, as representational systems (RS) in the sense they have been described by Dretske, who defined such systems as

[. . . ] any system whose function it is to indicate how things stand with respect to some other object, condition, or magnitude (Dretske, 1988, 304).

Dretske’s work aims to establish a theory that grounds human knowl- edge based on information in the physical world. This work concerns models as RSs only and therefore I do not enter into the debate whether artificial intentional systems do exist, or whether they are even possi- ble. However, I will explore some of Dretske’s ideas, which I believe to be relevant and useful for models and simulations.

2.2.1 Representations

Representations could be described as placeholders, things that stand for other things. Wartofsky (1979) thinks that

[a]nything (in the strongest and most unqualified sense of ‘any- thing’) can be a representation of anything else. Therefore, there are no intrinsic or relational properties which mark one thing off as a representation of something else [. . . ] It is we who constitute something as a representation of something else. It is essential to something’s being a representation, therefore, that it be taken to be one (Wartofsky, 1979, xx).

This view is far too general to be a useful definition of what a represen- tation is. Some things cannot be a representation of everything, if we put restrictions and conditions on representations, particularly if represen- tations must have a causal relation to the world. Dretske differentiates in his REPRESENTATIONAL SYSTEMS three types of representational system. These three types of RS, namely symbol, sign and natural rep- resentation, or Type I, II and III respectively, have different properties and also different representational powers. This classification of RSs into types concerns not only their representational power, but also their level of intentionality and their level of semantics. I will introduce some of the properties of these types and explain how Type I and Type II representational systems (RS) relate to models and simulations. 2.2. MODELS AS REPRESENTATIONS 33

Symbols

Commonly the term symbol is used to describe something that stands for something else. Symbols can be understood as place holders, like the figures and letters in the language of mathematics or in musical nota- tion. The association of a particular symbol with what it will stand for, is usually established arbitrarily, and many of the symbols are maintained by convention. Consequently it is quite possible that the same symbol may be used to denote quite different things depending on the context or the vernacular of a group of people. In mathematics, e is universally used to denote a particular constant5, while e denotes a unit of charge in physics6. These arbitrarily chosen ‘stand-ins’, which do not have any meaning or function other than those that are assigned by the persons who are using them, are, in Dretskian terms, RSs of Type I. Dretske explains that

[t]he elements of Type I systems have no intrinsic powers of rep- resentations - no power that is not derived from us, their creator and users. Both their function (what they, when suitably deployed, are supposed to indicate) and their power to perform that function (their success in indicating what it is their function to indicate) are derived from another source: human agents with communica- tive purpose (Dretske, 1988, 305). and that

[s]uch representational systems are, . . . , doubly conventional: we give them a job to do, and then we do it for them (Dretske, 1988, 305).

It is this definition that is also applicable to symbols on which certain kinds of computation rely7. Entire representational systems that are

5ex is the function equal to its own derivative and e (approximately 2.71828) can be defined as   1 n e = limn→∞ 1 + n

6The smallest measurable unit of charge e carried by the electron is about 1.6021773310−19 C (Coulomb). 7I am hinting here that computation with Turing machines is often described as mark manipulation, which can be quite different from symbol manipulation in terms of semantic content. The rather narrow concept of symbol, i.e. an element of a Type I representational system, eliminates all the difficulties of assigning semantic content to them: they never have any semantic content. I believe that TM computation in its 34 CHAPTER 2. MODELS AND SIMULATIONS composed of meaningless place holders, or mere “representational in- struments” (Dretske, 1988), may potentially be semantically vacuous. In many models with simple artificial neural nets individual nodes are used to represent entire functional groups of neurons and their associ- ated processes. The problem is that an explanation of these functional groups in terms of what psychological primitive functions are imple- mented at some lower levels is not provided. There is also little expla- nation on how such psychological primitive functions might exactly be implemented. This deficiency has been noted by Crick two decades ago.

Another explanation offered by modelers in defense of their units is that a single unit really stands for a group of neurons. This might be acceptable to neuroscientists if it were carefully stated how this group might be built out of more or less real neurons, but this is seldom if ever done. Consequently, it is difficult to know whether a given “unit” is plausible or not (Crick and Asanuma, 1986, 370).

In some cases (e.g. ‘Discovering Lexical Classes from Word Order’ in El- man (1990), ‘Moral Knowledge’ in Churchland (1998), for example) the models are essentially semantically empty, because the relationships and processes at the implementation level are not described at all. In- stead, the artificial neural network is discussed only in broad conceptual terms using localized representations (labels). I will refer to these exam- ples in some detail in the discussion of models with artificial neural nets later in this work (chapter 8).

Natural signs

In the Dretskian theory, RSs of Type II are grounded in the real world in that their power to indicate is linked to causal events in the real world. Linking meaning to causal events has also been suggested by Russell who defines “causal lines” as

[. . . ] a temporal series of events so related that, given some of them, something can be inferred about others whatever may be happening elsewhere. A causal line may always be regarded as the persistence of something - a person, a table, a photon, or what not. theoretical and intended sense is entirely about the manipulation of such semantically empty symbols. Some of the theoretical properties of TMs are discussed in the following chapter. 2.2. MODELS AS REPRESENTATIONS 35

Throughout a given causal line, there may be consistency of qual- ity, consistency of structure, or gradual change in either, but no sudden change of any considerable magnitude. I should consider the process from speaker to listener in broadcasting one causal line: here the beginning and the end are similar in quality as well as structure, but the intermediate links - sound waves, electromag- netic waves, and physiological processes - have only a resemblance of structure to each other and to the initial and final terms of the series (Russell, 1948, 477).

The meaning (i.e. our interpretation of the semantic content) is not bound to the real world in the same way. The power to indicate some- thing about the real world has to be recognized by the observer of the sign. Scientific instruments, thermometers, or voltmeters, indicate temperature, electric potential and similar properties and phenomena. They function by exploiting (detecting) a known physical phenomenon. Instruments provide the observer with a representation of the state or relation of that phenomenon through a series of often complex transfor- mations. For example, it is a property of the real world that the volume of a quantity of metal varies with temperature. A suitable arrangement of levers and a pointer on a dial can be used to exploit the relationship between temperature and volume to create an instrument that will in- dicate the temperature with some accuracy. There is a distinction be- tween what the instrument indicates and what the observer believes that indication means. The pointer on the dial will only be meaningful to someone who knows that this instrument is indeed a thermometer. The instrument will indicate the temperature quite independently from the observer. To be an indicator of some particular property of the real world, the causal relationships must be maintained and the observer must attach the right kind of interpretation in terms of the indicator’s meaning. Dretske explains that

[i]f a fuel gauge is broken (stuck, say, at “half full”), it never in- dicates anything about the gasoline in the tank. Even if the tank is half full, and even if the driver, unaware of the broken gauge, comes to believe (correctly, as it turns out) that the tank is half full, the reading is not a sign - does not mean or indicate - that the tank is half full. Broken clocks are never right, not even twice a day, if being right requires them to indicate the correct time of day (Dretske, 1988, 308). 36 CHAPTER 2. MODELS AND SIMULATIONS

The relationship between the real world and representations of the world is of great importance in the context of models and simulations. The value of a model as a representation of the real world and any insights into the working of the world by investigating properties of the model depends on the kind of representations the model employs. A meaningless representation, in the sense that it can represent arbitrar- ily anything, can and will render the entire model meaningless, unless there is a syntactically correct procedure (probably a causal chain) to tie these representations down. A model, or representational system, that is to function as a representation of the real world ought not to contain any Type I elements. In addition, representations of Type II, by definition, must not have gaps or uncertainties in the causal chain linking them to the real world. A thermometer is only a thermometer if it has the power to indicate the temperature. Some apparatus may well indicate the temperature provided certain other conditions are given. An example will illustrate this point. Imagine a partially inflated bal- loon that is connected to a pressure gauge. The volume of air and the air pressure inside the balloon will change with the ambient temperature and the ambient air pressure. This setup will function as a thermome- ter, if the ambient air pressure is kept constant. However, if the tem- perature is kept constant and the ambient pressure is allowed to vary, then the instrument will indicate pressure. This simple instrument has the power to indicate either temperature or pressure, that is, the setup can function as a thermometer or a barometer. A scientific, or a merely usable, instrument would have to be engineered so that the relation- ships between pressure, temperature and volume are exploited. But the power to indicate one or the other must be constrained through appro- priate means to guarantee an indication of either only pressure or only temperature. Type II representational systems contain natural signs that are objec- tively connected to the real world and their power to indicate something about that world is exploited by using their natural meaning (Dretske, 1988), because

[i]n systems of Type II, natural signs take the place of symbols as the representational elements. A sign is given the job of doing what it (suitably deployed) can already do (Dretske, 1988, 310). 2.2. MODELS AS REPRESENTATIONS 37

It is important to note that there is no intentionality associated with this type of representation. However, the potential intentionality (the meaningful interpretation) is constrained by the causal links to the real world. The variation in volume of metal, for example, may be due to the change of temperature, but this variation in volume cannot be reason- ably attributed to the colour of the paper it is wrapped in. While these signs carry information about some state or phenomenon in the real world, signs do not know anything about the world. Dretske says that

[. . . ] simple mechanical instruments (voltmeters, television re- ceivers, thermostats) do not qualify for semantic structure (third level of intentionality) with respect to the information they trans- mit about the source. [. . . ] This information (about the voltages across the leads) is nested in other, more proximal structures (e.g. the amount of current flowing through the instrument, the amount of flux generated around the internal windings, the amount of torque exerted on the mobile armature) which the pointer “reads”. The pointer carries information about the voltage in virtue of car- rying accurate and reliable information about these more proxi- mal events [. . . ] This, basically, is why such instruments are in- capable of holding beliefs concerning the events about which they carry information, incapable of occupying the higher-level inten- tional states8 that we (the users of the instrument) can come to occupy by picking up the information that they convey (Dretske, 1981, 186).

The quality of any model will be reflected in the degree of understand- ing of the real world physical object in question. There are of course several potential pitfalls, because any misunderstandings, misconcep- tions or misinterpretations of the physical world will be introduced into the model. A failure to recognize the true importance of what seems to be a minor detail can render a model ineffective or inappropriate. The many models for machines purporting to demonstrate the possibility of perpetual motion are examples of that. For the representative model, there is always the danger that some- thing is misrepresented. A model aeroplane may not fly well, because it is too heavy, underpowered, or the center of gravity is not where it ought to be. All of these difficulties can be realistically eliminated, or can be

8In KNOWLEDGE AND THE FLOW OF INFORMATION, Dretske uses a model describ- ing the flow of information from a core (source) surrounded by several shells (levels of beliefs or intentionality). 38 CHAPTER 2. MODELS AND SIMULATIONS dealt with in principle at least, because they can be resolved by investi- gating the properties of the model and comparing these properties with the “real thing”.

Natural Systems

Type III systems are not applicable to models and simulations, because it is a key attribute of models to be representations of something else. Moreover, the reason for having models is that they are representations of something else for us to interpret. Dretske defines Type III as the

ones which have their own intrinsic indicator functions, functions that derive from the way the indicators are developed and used by the system of which they are part. In contrast with systems of Type I and Type II, these functions are not assigned. They do not depend on the way others may use or regard the indicator elements. (Dretske, 1988, 313).

We can glean from this definition that Type III RSs require a level of in- tentionality. However, intentionality cannot be attributed to any model without negating its ‘model’-status.

2.2.2 Simulations

A definition of the term simulation for the purpose of this work seems, on the surface, straight forward.

Simulation is the process of designing a model of a real system and conducting experiments with this model for the purpose either of understanding the behavior of the system or of evaluating various strategies (within the limits imposed by a criterion or set of crite- ria) for the operation of the system (Shannon and Weaver, 1975).

Shannon’s definition is put forward from an engineer’s position in that a simulation is a form of representation of the real thing. More specif- ically, it is the behavior or the dynamics of some physical thing that is captured by a simulation. Yet this kind of simulation is probably not of the kind that will always be of value to Cognitive Science. Models of brains made of jelly are not likely to tell us much about cognitive processes, because such models tend to remain in the realm of gross anatomy. Any simulations with such a model are equally likely to be of little interest to Psychologists. If the aim of Cognitive Science is to 2.3. METHODOLOGIES 39 explain brain function beyond anatomy, then models and simulations must contain aspects of cognitive functions. However, simulations in- volving anatomical brain models made of jelly are certainly of great in- terest to people researching head injuries caused by road accidents. If the scale and detail of such brain models are changed, they might be- come of interest to surgeons, physiologists or material scientists. Ulti- mately, the model will be exhausted once we enter the world of quantum physics. It seems that simulations (and models) that can be of interest to Cognitive Science, are those that contain some aspects of a theory about ‘how things might work’. Part of a simulation’s role is to explain or illustrate not only the physical aspects of the real world, but also part of the dynamics of a theory.

2.3 Methodologies

Models and simulations serve in many ways as experiments. Tradition- ally, experiments were seen as a means to gain insights into the work- ings of nature. This Baconian view is reflected in Hacking’s more mod- ern “don’t just peer, interfere” (Hacking, 1983, 189) stance to experimen- tation and scientific practice. Models, as substitutes for the real world, play a different role when they are used as the object of experimenta- tion. Models have been used for such purposes. Galileo’s measurements of spheres running down an inclined plane were conducted on a model of the real world. In this particular model, the dynamics of the rolling spheres were slowed down sufficiently so that the dynamics could be measured and analyzed. This translation of the real world dynamics into “user friendly” dynamics is a prime example of a model serving to alter the scale of some parameters. The physics of spheres running down an inclined plane are well understood. Models can be used to attack a particular problem, either bottom-up or top-down. The first approach, bottom-up, starts with functionally low level entities, i.e. artificial neuron-like structures, and tries to simulate more complex higher-level functions by combining many of these prim- itive elements. The top-down methodology starts with higher-level cog- nitive functions and the aim is to break these into smaller and smaller tasks, so that the lowest level is again the domain of primitive or pri- mary elements. 40 CHAPTER 2. MODELS AND SIMULATIONS

Either approach to modelling or simulation has its own particular sets of difficulties in terms of complexity. In the bottom-up approach, the neuron models are greatly simplified. This simplification may be an ad- vantage in that more complex models would become over-complex at higher levels. O’Reilly and Munakata (2000) claim that “simple models often just work better”. Moreover, simplicity is often connected with at- tributes like beauty and elegance, especially in mathematical formulae from physics and other sciences. Mathematics is the language to de- scribe physical processes, and systems of differential equations describe dynamic systems well. The behavior of a dynamic system, whether it is the actual physical system or a mathematical model, is not only gov- erned by its dynamics, but also by a set of initial conditions. A neuron’s behavior may depend on the current temperature, the concentration of certain chemicals in the cell, the amount of oxygen in the blood and many other initial conditions. These conditions could be determined by actual measurement, by assumption or may even be ignored altogether. Noble claims that

[a] complete description of the sequence of events of an adequate set of differential equations is not necessarily a sufficient explana- tion of the behaviour of the system, even though it may be a very accurate description (Noble, 1990, 107). and

[a]s the number of constants and conditions increases, and partic- ularly if the sensitivity of the system in response to changes in these parameters increases, so there will be a shift in explanatory power from the theory embodied in the differential equations to whatever theories are required to explain the initial and boundary conditions (Noble, 1990, 107). Even a modest number of neurons in a simulated neural network in- crease the complexity of a system significantly. The processes and work- ings of artificial neural networks become difficult to describe, if such a description is needed to provide explanations at a higher level. Models and their relation to the real world form RSs that can become difficult to evaluate when the number of symbols, signs and natural rep- resentations becomes large. In functionalism, albeit in a naïve form, the outward behavior of some object is deemed to be the only point of inter- est. The inner workings of the object can be ignored and the thing itself 2.3. METHODOLOGIES 41 is replaced by a black box. Moreover, we can replace one black box with another, if these are functionally equivalent. In order to maintain a causal link to the real world, the contents of the box must not be ignored. This will also be necessary in the case of abstraction. Here we would replace a causal link with a piece of math- ematics that would describe such a link in a formal, and therefore con- sistent, manner. Can the semantics be inherited by the mathematical description in this situation? Uttal points out that

[t]he problem is that a single mathematical expression can sym- bolically represent a true infinity of real systems that could ex- hibit the common behavior or processes described by mathematics (Uttal, 2005, 31).

Consider the simple piece of mathematics

y = ax + b which expresses a linear relationship between x and y. But, as neither x or y are explained in terms of what they represent in this particular instance, they remain semantically empty. Being “semantically empty” does not preclude that symbols can refer to real or abstract objects. I do think that it is necessary to make a further distinction between type and token in the context of computation as symbol manipulation as suggested, e.g. by Mellor (1989). Propositions about the world have no causal powers, unlike the tokens, which are causal in effecting a com- puter’s behavior. A proposition P ‘the world is flat’ has no causal powers. A boolean variable P may have a value assigned to it so that P is false if the world is flat. This variable, i.e. its assigned value, is testable in a program and the execution of the program may at one point depend on this value: if P = true then print “Buy a new physics text book”. The change in the behavior of a computer is effected by the pattern of electric charges in a series of tiny capacitors, i.e. dynamic random access memory. The states of the electronic computer change, under the control of a clocking mechanism, according to causal chains of physical interac- tions. In a memory read operation the tiny charges of the capacitors at a certain specified address are amplified and transferred onto wires and thereby transferred to other parts of the computer. The electric poten- 42 CHAPTER 2. MODELS AND SIMULATIONS tial present on these wires can be used to switch electronic gates, which in turn can raise or lower the potential of other wires. In this way, pat- terns in the memory can be transformed into different patterns. These physical patterns are tokens and can be interpreted as symbols, in that they can represent something.

2.4 Summary

In order to evaluate models and simulations in terms of their epistemic value, it is possible and helpful to view models and simulations as repre- sentational systems (RS). In the same way that generic RSs can be clas- sified in terms of their elements, models and simulations may or may not be causally linked, or grounded, in the real world. Causal links in some models are easier to establish than others. Physical, i.e. mechani- cal, models of some physical entity, may in fact only differ in scale. Their workings may be identical to the modelled object, so that they are an RS of Type II, i.e. the causal links are clearly understood and the indicators are interpreted correctly. Conceptual models have the added problem that they may be fundamentally flawed because the core assumptions on which the model is based might be mere conjecture. A further diffi- culty for the field of Cognitive Science is that the decomposition of the brain into physical and functional units is not possible in relevant ways (Bechtel, 2005), and that many models are mere mathematical tools to reproduce or predict data of some theory. It is nevertheless possible to have a reasonable model of such a theory, provided the meaning of the symbols and the mathematical constructs is explained and made plausi- ble. I have shown that a definition for the term model cannot be formal- ized and that there is no set of metrics available by which to judge the success of a model or a simulation. Consequently each and every model has to be evaluated, verified and validated on its merits against any claims made in relation to the model’s utility, the underlying theories, and possible explanations. Chapter 3

Computational Foundations

If we take the properties of the universal machine in combination with the fact that machine processes and rule of thumb processes are synonymous we may say that the universal machine is one which, when supplied with the appropriate instructions, can be made to do any rule of thumb process (Turing, 1947, 383).

Computation in some form is at the very core of Cognitive Science. The computational theory of mind forms the conceptual backbone of the dis- cipline. This theory encompasses various aspects, and several flavours of this theory are debated. Whether one takes the stance that cogni- tion is computation or whether one subscribes to the idea that human cognition is computable, does not detract from the fact that cognition and computation are closely related. The idea of a computational mind is not new and theories can be traced well back into history1. The first “contemporary” reference of a computational theory came from Thomas Hobbes (1651). Computation is of course the primary tool for the dis- ciplines of AI and computational Neuroscience, and computational pro- cesses are the building blocks for the models and simulations under review. AI and computational Neuroscience have somewhat different goals, but both disciplines are concerned with models and simulations of neurons and neural structures. AI mainly deals with models in the model for sense, because many of the activities involve concepts and

1Early theories on bodies, minds and souls date back to Greek philosophy.

43 44 CHAPTER 3. COMPUTATIONAL FOUNDATIONS entities that only superficially resemble real world phenomena. In the context of AI, or knowledge representation, for exam- ple, have little in common with how humans learn or represent acquired knowledge. Many of the activities in computational Neuroscience, in contrast, are concerned with computational models and simulations of the representative type, i.e. with models of some neurological aspect. In Cognitive Science, some of the neural models and their simulated be- haviors are more closely related to the actual biology and physiology than others. The Hodgkin and Huxley model of the electro-physiological processes in a neuron (Hodgkin and Huxley, 1952), which I describe on page 107, is concerned with the biology of a cell. In contrast, the ‘psycho- logical’ model of learning and simulating ‘moral virtues’ by Churchland (1998), see page 185, does not address the biology or physiology of neu- rons. Models in Cognitive Science can concern a wide range of subjects from Neuroscience and Psychology to Linguistics or Philosophy. Many claims and refutations in the context of models of brain or mind functions concern the computability of brain or mind functions. In or- der to ascertain whether the underlying computational theories can influence the models and the conclusions that can be drawn from ex- periments with these models, it is necessary to examine some of the computational principles that underpin computational models. This in- vestigation is for now not concerned with brains or minds, but with the mathematical and computational principles and techniques used in the context of models and simulations. In this chapter it is argued that modern computing machines are su- persets of Turing Machines (TM). While a modern electronic computer can emulate a TM, the capabilities are not necessarily constrained by the perceived limitations of a TM. Moreover, I argue that modern com- puters are without doubt mark manipulation devices. Digital computing machines can possibly be considered to be symbol manipulation devices, however, this does not imply that they are necessarily formal systems2. These questions about the very nature of computing machines are im- portant in the context of modelling and simulation, as some of the ar-

2Formal systems are usually symbols, sets of axioms and sets of rules that govern how symbols can be manipulated. Mathematics is a prime example of a collection of formal systems. 45 guments against the possibility of an AI and the validity of models, are based on theoretical principles of computing. Searle’s infamous Chinese room thought experiment3, as an objection to the possibility of an AI, is essentially based on the claim that simulation cannot lead to inten- tionality (Searle, 1980). Some of the past and contemporary objections to an AI on philosophical grounds may in fact be based on misguided assumptions about the nature of computing machines4. In the following section, I will first examine the role of symbols and signs and their relations to computing principles. Symbol manipulation5 is the essence of a formal description of computation in terms of TMs. I will discuss TMs briefly in order to establish their computational powers and limitations. From there, I argue that modern electronic computing machines are more powerful than a TM, because real machines can in- teract with the physical world, while TMs are restricted to algorithmic computation. Consequently, algorithmic computation, which is essen- tially a representational system comprising only Type I symbols, can- not have any form of grounded semantics. This is important in order to establish possible causal links between models and the real world. The type of representational system that is realized by a particular model will determine much of the model’s epistemological status. The second section deals with parallel computing and simulated par- allel computing. Nearly all artificial neural nets are simulated on Per- sonal Computers or machines with fundamentally identical architec- ture, i.e. serial machines, so that a model with a simulated neural net is itself also simulated at the level of implementation. In the last sec- tion, I will show that other types of systems can qualify as computers, although they lack some, or all, criteria of formal computation in terms of the TM. 3While it seems still unclear what exactly the problem with Searle’s argument is, many have tried to discredit his claim passionately. 4I have argued elsewhere (Krebs, 2002) that Lucas’s objection against Mechanism (Lucas, 1961, 1970) is in fact not applicable to many real computing machines. Lucas’s argument applies only to formal systems, which, so he claims, are subject to Gödel’s incompleteness theorem (Gödel, 1931). Machines with the appropriate architecture and input/output systems are not closed systems. They can be interactive systems, which are as a consequence not formal systems either. 5The term symbol is used here to denote Dretskian symbols of Type I, which have no semantics in themselves - they are merely marks. Dretske (1988) classified symbols that have any semantic content assigned to them arbitrarily as Type I. 46 CHAPTER 3. COMPUTATIONAL FOUNDATIONS

3.1 Symbol Systems

Current computing machinery is nearly universally identical in its ar- chitecture and principles of operation. This architecture, which is based on the fetch-execute cycle, is often regarded as a physical realisation of aTM6. However there are certain assumptions about TMs that need to be considered, because they do impact on how models and simulations are implemented. Computation with TMs, and also with their physical instantiations, are forms of symbol manipulation, or at least mark ma- nipulation. In the following discussion of TMs and the Church-Turing Thesis (CTT) the term symbol is used to denote representational sys- tems of the Dretskian Type I, which are void of any intrinsic meaning, or, in other words, they are semantically empty.

3.1.1 Formal Computation

In the search of a solution to Hilbert’s question about the decidabil- ity of mathematics, Turing devised a theoretical computing device with strongly mechanistic, or machine-like, principles of operation. With the help of this device, now commonly referred to as a Turing Machine, Tur- ing was able to show that there are statements, or, to be more specific numbers that cannot be computed by means of an algorithm or effective procedure. Using Cantor’s diagonalization argument and Gödel’s (1931) result on self-referential statements (Gödel’s Incompleteness Theorem), Turing demonstrated in his 1936 paper ‘ON COMPUTABLE NUMBERS, WITH AN APPLICATION TO THE ENTSCHEIDUNGSPROBLEM’ that algo- rithmically unsolvable problems do in fact exist7. Using this insight it was possible to show that there cannot be a definite method to check whether an algorithmic solution for a particular problem exists. It could also be shown that there is no logical calculating machine that can de- termine whether another will ever come to a halt - i.e. complete a calcu-

6This assumption is generally true. A modern computing machine with sufficient memory to compute the task at hand should be considered an actual realisation of a TM. It is not necessary for a TM to have an infinite memory, i.e. tape, it is only necessary to have a sufficient amount (Feynman, 1996). Sufficient memory is an finite amount of memory that allows the machine to execute the program. 7An example of a non-computable problem is Post’s Correspondence Problem (PCP), which is frequently described in the literature, see for example Sipser (1997). 3.1. SYMBOL SYSTEMS 47 lation - or will in fact continue forever (the Halting Problem8). Copeland points out that the term ‘Halting Problem’ was introduced by Davis in 1952, and that it is therefore strictly not true to contribute the Halting Problem to Turing (Copeland, 2004, 40). The overall result for mathe- matics and computational theory is that there is no effective procedure to determine whether there is an algorithmic procedure for some prob- lem. There is a list of postulates and theorems dealing with computabil- ity, which are closely related to this insight, but it has been shown since that the Church thesis, Turing thesis and the CTT are stating the same core result.

The Church-Turing Thesis

The CTT concerns effective methods m or procedures, which must have the following properties:

1. m must consist of a finite number of exact and finite instructions.

2. m must produce a result in a finite number of steps.

3. m can be carried out, in practice or in principle, by a human being with pen and paper.

4. m demands no insight or ingenuity on the part of the human being.

The most important properties for effective procedures, or algorithms, that emerge from these criteria are those of being deterministic and fi- nite. Finite means that the algorithm has to come to an end at some time, which implies that the algorithm itself is specified in finite terms. This implies also, that the input dataset that is associated with the problem at hand is also finite. The term finite is therefore applicable to the algorithm’s specifications and to its execution. The procedure must be completely defined at the point when the execution begins and the execution must end at some time and, because of the deterministic na- ture of algorithms, must yield the same result with every execution. It is reasonable to assume in terms of any algorithm’s utility and efficiency that the final part of the execution is to produce some result. It would be

8see Davis (1958) 48 CHAPTER 3. COMPUTATIONAL FOUNDATIONS rather inefficient, if the procedure would continue for some time with- out contributing to the solution. Consequently, the output produced can only be interpreted as the result, when the execution of procedure is complete and has terminated. Algorithms are about ‘marks on paper’, i.e. syntactic manipulation of symbols without any semantic content. These symbols, or

. . . elements of Type I systems have no intrinsic powers of repre- sentations - no power that is not derived from us, their creator and users (Dretske, 1988, 305).

If we accept that marks on paper are elements of a Type I representa- tional system, then it follows that TMs are also representational sys- tems of Type I, because they comprise only freely interpretable marks. Kearns points out that the meaning we give to such symbols is of no con- sequence as far as the TM is concerned. In fact, if the marks or strings of marks, for example words, have some meaning in real then the

meaning [of these marks or words] must be irrelevant to carrying out the procedure (Kearns, 1997, 274).

These descriptions of an effective procedure so far have only been intu- itive. However, the TM gives us a formal treatment of these concepts, namely that of an algorithm. Turing provided a machine-like descrip- tion of the underlying principles of mapping the execution of an al- gorithm onto a set of machine states and state transitions9.TheTM provides a clear description of how the concept of an algorithm can be

9A formal definition of the Turing machine is offered by Sipser (1997, 128) as a

7-tuple (Q,Σ,Γ,δ,q0,qaccept ,qre ject ), where Q,Σ,Γ are all finite sets. 1. Q is the set of states 2. Σ is the input alphabet not containing the blank symbol , 3. Γ is the tape alphabet, where {} ∈ Γ and Σ ⊆ Γ, 4. δ : Q × Γ → Q × Γ ×{L,R} is the transition function, 5. q0 ∈ Q is the start state, 6. qaccept ∈ Q is the accept state, 7. qre ject ∈ Q is the reject state, where qaccept = qre ject . See also Wells (1996), who offers a definition of a TM as a collection of 4 sets. 3.1. SYMBOL SYSTEMS 49 translated into an automated formal system. It is important to remem- ber that the TM is still a purely theoretical entity, or, a conceptual ma- chine, although it is usually described in terms of a read-write head, tape and sets of rules and so on.

Universal Computation

The TM is closely linked to the CTT, which asserts that any problems that can be solved by ‘rule of thumb’ can also be solved mechanically (Turing, 1936). Both the CTT and the properties and capabilities of TMs are directly connected to computational theories, but also to the Philos- ophy of Mind. I will concentrate only on a few aspects which seem to influence the arguments about computation in relation to models. Com- puter scientists, as well as philosophers, are able to maintain quite “un- conventional” views about the computational power of TMs. Their views are then extended into conjectures and theories about artificial and real intelligence and also to theories about cognition or consciousness. For example, Dennett says that

Turing had proven - and this is probably his greatest contribution - that his Universal Turing Machine can compute any function that any computer, with any architecture, can compute (Dennett, 1991, 215). and Putnam believes that

[i]t should be remarked that Turing machines are able in principle to do anything that any computing machine (of whichever kind) can do (Putnam, 1975, 366).

A more explicit account of this kind of mis-interpretation can be found in Kurzweil (1990) where he writes that

[a]s for [the TMs], Turing was able to show that this extremely simple machine can compute anything that any machine can com- pute, no matter how complex. If a problem cannot be solved by a Turing machine, then it cannot be solved by any machine (and ac- cording to the Church-Turing thesis, not by a human being either). [. . . ] The fact that there are problems that cannot be solved by this particular machine may not seem particularly startling until one considers the other conclusion of Turing’s paper, namely, that the Turing machine can model any machine (Kurzweil, 1990, 112). 50 CHAPTER 3. COMPUTATIONAL FOUNDATIONS

There is a considerable amount of confusion about TMs and their limits. Yet Turing is quite clear in regard to the relationship between logical computing machines (LCM) 10 and algorithms or effective procedures.

It is found in practice that LCMs can do anything that could be de- scribed as ‘rule of thumb’ or ‘purely mechanical’. This is sufficiently well established that it is agreed amongst logicians that ‘calcula- ble by means of an LCM’ is the correct accurate rendering of such phrases (Turing, 1948, 111). and

It is possible to produce the effect of a computing machine by writ- ing down a set of rules of procedure and asking a man to carry them out. Such a combination of a man with written instructions will be called a ‘Paper Machine’. A man provided with paper, pencil, and rubber, and subject to strict discipline, is in effect a universal machine (Turing, 1948, 112).

Copeland and Sylvan offer this as a definition of the Church-Turing Thesis proper:

Any procedure that can be carried out by an idealized human clerk working mechanically with paper and pencil can also be carried out by a Turing machine (Copeland and Sylvan, 1999).

This definition of the CTT captures the idea very well: An idealized ma- chine can also follow any procedure that can be followed mechanically by a human. “Any procedure” can only mean the application of an al- gorithm for computation with numbers. After all, the 1936 paper was specifically written in response to the problems from Hilbert’s program. Also, Turing referred in his paper to a disciplined human, which clearly indicates to me that the CCT concerns the possibility of automating ‘mindless’ human computations. Turing used the term disciplined to in- dicate that human must adhere strictly to the procedure; there is no room for deviation. It is this mechanical, or automatic, principle of algo- rithms that matters. The interpretation and application of these properties has profound con- sequences for many of the ‘in principle’ objections against AI. For exam- ple, actual machines that interact with the world cannot be considered

10The LCM is now commonly referred to as the ‘Turing Machine’. 3.1. SYMBOL SYSTEMS 51

TM equivalent. It has been, in my mind implausibly, argued that inter- active computing systems could somehow be TM equivalent, or, could be modelled by TMs. Wells (1998, 1996), for example, argues that Turing

was not, it is crucial to note, modelling just the processes going on inside the mind of the [human] computer, but the interactive pro- cesses involving both the mind and the external medium, the pa- per, which was used for intermediate workings and the final result. Turing suggested that the mind of the (human) computer could be modelled by ‘a machine which is capable of a finite number of con- ditions’ (Wells, 1998, 272).

If we examine part of the text in Turing (1948) it is easy to see that Turing made no such suggestion.

The types of machines [a-machines or the TM] that we have con- sidered so far are mainly ones that are allowed to continue in their own way for indefinite periods without interference from the out- side. The universal machines were an exception to this, in that from time to time one might change the description of the machine which is being imitated (Turing, 1948, 115, my italics).

In COMPUTING MACHINERY AND INTELLIGENCE we find a statement that expresses Turing’s ideas on this issue even more clearly. He says that

[. . . these digital] machines are intended to carry out any opera- tions which could be done by a human computer. The human com- puter is supposed to be following fixed rules; he has no authority to deviate from them in any detail. We may suppose that these rules are supplied in a book, which is altered whenever he is put on to a new job (Turing, 1950, 155).

Here can clearly see that for Turing an algorithm is strictly closed, i.e. “fixed rules”, without any deviation allowed once the execution has com- menced. This principle applies to the Turing machine11, and it applies to digital machines. We can change the procedure, or algorithm, in the Universal TM and in the digital machine before they begin the execu- tion. There is however no freedom to change the algorithm while it is executing12. This property also prevents a TM from passing the Turing

11This is the type of machine Turing referred to as logical computing machine LCM, and also as a-machine in Turing (1936, 1948). 12It is possible in a digital machine for a running program to modify itself during execution. It is considered bad software engineering practice to write programs that way. It is certainly against the rules for TMs as formal definitions of an algorithm. 52 CHAPTER 3. COMPUTATIONAL FOUNDATIONS

Test (Turing, 1950). A Turing machine cannot react to the environment or to changes of the environment, which I would consider necessary to exhibit ‘intelligent behaviour’ in the broadest sense. The model that Turing suggests is about a human executing an algo- rithm in relation to computing to fixed rules, and not as Wells suggests about a model of the human mind. In fact, Turing excluded the ‘mind’ from the mechanical procedure, when he says that “a man provided with paper, pencil, and rubber and subject to strict discipline, is in effect a universal [Turing] machine” (Turing, 1948, 117, my italics). The essence of the algorithm is its mechanical procedure and the strict adherence to it, not the processes used for the execution. Unlike the pro- cesses, the algorithms are essentially independent of their implementa- tion. It is at the core of the CTT that some algorithm can be executed by a human, a digital computer, or some other kind of machine. The con- cept of algorithm can be extended beyond the realm of arithmetic and the notion of algorithmic, or machine-like, execution of some procedure is universally applicable. This extension does not apply to properties of the procedure itself - the conceptual properties of an algorithm must not be violated. Haugeland describes the principle of algorithmic properties in terms of playing games and says that

[t]he crucial issue concerns what it takes “to play the game” - that is, to make and recognize moves allowed by the rules. Roughly, we want to insist that formal games be playable by finite players: no infinite or magical powers are required (Haugeland, 1985, 63).

Copeland and Sylvan (1999), Sloman (1996), Eberbach et al. (2004), and Krebs (2002), among others, have rejected the idea of TMs as agents on the grounds that Turing’s assumptions and specification of the TM are misinterpreted. The properties of the TM are succinctly put by Wegner, who states that a “Turing machine is a closed, non-interactive system, shutting out the external world while it computes” (Wegner, 1997, 82). This closed world includes all constituent parts of the TM including the memory, i.e. tape. When Turing equates a TM to “a man provided with paper, pencil and rubber”, the “paper tape” of the TM is merely replaced with sheets of paper to be used as the memory for input, output and intermediate results. Supporting procedures and hardware should not be considered part of the execution of the algorithm. For example, any 3.1. SYMBOL SYSTEMS 53

‘controlling sub-routines’ that are invoked by the human to control his hands, while they are writing on the paper, are not part of the algo- rithm, but they are a part of its implementation on a physical system. Analogously, the operating system routines, which are called during the execution of some high-level program are not part of the program at that level of description. Although the formal properties of algorithms do not allow for any ‘agent- like’ interaction between the executing procedure and the environment, there are some further implications, if we were to accept such a view. The first obvious difficulty is that an agent’s entire life would have to be pre-coded as an input string for some TM. It might be hypothetically conceivable to have all inputs of an agent’s life pre-coded for execution, but this would demand the adoption of a Laplacian determinism. Al- ternatively, it has been argued that the algorithm itself may be able to adapt, i.e. change or learn, to new situations during the execution. The concept of an ‘ever-changing algorithm’ is defended, for example, by Penrose. He claims that human beings may follow some algorithm, which includes also all subroutines that will allow modifications, and that

[t]his algorithm includes all the ‘learning procedures’ that have been laid down at the beginning. If the only ways in which the algorithm can change is by such pre-assigned algorithmic means, then we do not, strictly speaking, have a changing algorithm at all but just a single algorithm (Penrose, 1993, 21).

This position is difficult to maintain. The ‘coming into existence’ of such an incredibly complex algorithm would be impossible to explain. Fur- thermore, any modification, or improvements, cannot be passed on to the next generation in a Lamarckian fashion, so that all modifications to the algorithms would be through mutations. I suggest that the chances of an extremely complex algorithm surviving a mutation, let alone ac- tually improving through mutations, are slim indeed. There are other approaches than the single algorithm theory. Some pro- pose the ‘TM as agent’ hypothesis, which essentially claims that the power of the algorithm is found in the interaction with the real world. The output of the algorithm is much more the reactions to its inputs, rather than the pre-coded pattern of the algorithm itself. Hoffmann, for 54 CHAPTER 3. COMPUTATIONAL FOUNDATIONS example, suggests that the output behavior for some given input might change over the agent’s lifetime, and says that

[l]earning fits into that framework as follows: To each part of the input string there is a corresponding part in the output string which can be viewed as the response to the input supplied. Thus, the response to the same pattern in the input may change or adapt over time (Hoffmann, 1998, 171, my italics).

While Hoffmann thinks that any intelligent behavior exhibited by some agent is the output that corresponds to input, or “a mapping from some finite input to some finite output”, which could well be within the bound- aries of a algorithmic computational framework, he essentially sug- gests that an algorithm could be changed during its execution. That is, the agent that operates under the control of some procedure is able to change this procedure according to external factors. It must be stressed, that this kind of operation, i.e. changing the procedure while it is exe- cuted, does violate the rule of Turing machines being closed systems. In order not violate this rules, it is necessary for the inputs for a TM to be pre-determined and that any potential intelligent ‘behaviour’ must also be encoded in the machine. For more realistic attempts for an AI this is not possible due of the complexity of the problem (see Hoffmann (1998) on the complexity of AI). The argument from the application of phenomenology (Dreyfus, 1979) is also relevant in this context. For an AI or a model that is TM based to behave like a rational agent in the real world, it would be necessary to encode all of the worlds properties and rules13. This is an impossible task as Dreyfus points out:

When we are home in the world, the meaningful objects embed- ded in their context of references among which we live are not a model stored in our mind or brain; they are the world itself (Drey- fus, 1979, 266).

Due to the non-deterministic nature of the world, we have to accept that the ‘input string’ for a TM that represents the (changing) world is not computable. Actual computing with electronic machines allows for doing so, but this type of computation is not to be considered algorithmic. Wegner (1997)

13Most examples in AI deal with so-called micro-worlds, which offer at least some possibility to encode its properties and behaviours (e.g. SHRDLU by Winograd (1973)). 3.1. SYMBOL SYSTEMS 55 describes what he terms interactive identity machines (IIM) to illustrate the power of interactive machines. All the IIM’s input and output is directed to or taken from the environment. From a programmer’s point of view the implementation of such a machine is trivial. A high-level program would look something like

while (true) do { echo input } end_while.

This program simply copies all its input to the output, but there lies its power - this simple IIM would pass the Turing Test. In fact, we use the kind of computing machine that passes the Turing test every day. Current mobile phone sets are computational devices. The conversion of the analog input (i.e. speech) is transformed into dig- ital signals, modified and transmitted by a quite powerful little elec- tronic computer. Analogously, the digital information received by the set is modified and transformed into a signal we can hear. All these pro- cesses, together with the management of phone numbers, the built-in games, and all the other features and functions are under the control of a program. The architecture of mobile phone sets is not unlike that of a personal computer. When we have a ‘conversation’ with such a com- puter, we are not even aware that it is able to answer all our questions in a very ‘human like’ manner. Moreover, these computing machines can even express human feelings like happiness, anger, or sadness very well. Here, the interaction theme is, of course, taken to the extreme in that all of the program’s ‘intelligence’ is external to the machine. Digital mo- bile phones are, of course, not to be classed as intelligent machines. The question then is about how much interaction with the real world are we willing to accept. For example, a medical diagnostic system could be enhanced by some quite feasible addition, to resolve any cases where its existing knowledge base might be insufficient. In cases where the di- agnosis is doubtful, the computer sends automatically some electronic mail to a few human medical experts and asks for advice. The returned mail items are scanned and analyzed, the information is used to produce the diagnosis, and the new information is added to the knowledge base. This kind of system would have to be classed as an intelligent system14.

14In the same way earlier expert systems like MYCIN have been considered intelli- gent, or, at least they had been considered as prime examples of expert systems. 56 CHAPTER 3. COMPUTATIONAL FOUNDATIONS

The mobile phone diagnostic system is merely an further development of this system by doing away with the inefficient knowledge base alto- gether and talks to the doctor directly. An IIM could possibly be construed to be a Type II representational system, if the program is connected to the real world via sensors and transducers15. The computer operations are in principle traceable and as a consequence the mapping of the inputs to its outputs, which in this particular case consists of a verbatim copy, are a continuation of the chain of causal links. In this ‘pathological’ example, the computational effort is trivial, so trivial that it can be removed altogether - the observer of the program’s output might as well be observing the program’s input.

3.1.2 Limitations of Turing Machines

The concepts of TMs are clearly defined and any misinterpretations of their ‘principles of operations’ and behaviour can lead to invalid or in- sufficient arguments about computation, AI or minds. In this section I will address a point of detail, namely that TMs are formal systems. This property of TMs has been used, albeit unsuccessfully, to refute the possi- bility of an AI, and more applicable for this thesis, to refute mechanism (e.g. Lucas (1961)). I will outline one argument against this position16. Lucas argues against Mechanism and against the possibility of an arti- ficial intelligence in his paper MINDS,MACHINES AND GÖDEL Lucas (1961). He believes that Gödel’s incompleteness theorem applies to all formal systems and therefore to all levels of computing machinery as well, whereas minds themselves are not constraint by Gödel’s theorem. He writes that

Gödel’s theorem must apply to cybernetical machines, because it is of the essence of being a machine, that it should be a concrete instantiation of a formal system. It follows that given any machine which is consistent and capable of doing simple arithmetic, there is a formula which it is incapable of producing as being true U˝ i.e., the formula is unprovable-in-the-system U˝ but which we can see to be true. It follows that no machine can be a complete or

15I refer here to a running program, or program in the process of being executed, and program would include the necessary hardware and part of that hardware will be used to implement the interface to the world. 16Some of the material in this section has been adapted from Krebs (2002). 3.1. SYMBOL SYSTEMS 57

adequate model of the mind, that minds are essentially different from machines (Lucas, 1961, 44).

The application of Gödel’s theorem to all formal systems is generally ac- cepted, although it is essentially a mathematical issue. Lucas provides a one line précis of Gödel’s theorem, which nevertheless states Gödel’s core result:

Gödel’s theorem states that in a consistent system which is strong enough to produce simple arithmetic there are formulae which can- not be proved-in-the-system, but which we can see to be true (Lu- cas, 1961, 43).

It is necessary to present Lucas’s objection in the context of his un- derstanding of what a machine is. He emphasizes that a machine’s be- haviour is completely determined by the way it is made and the incom- ing ‘stimuli’: there is no possibility of its acting on its own (Lucas, 1961). Lucas makes essentially two claims in his paper. The strong claim is that Mechanism is false, which entails that no form of computing ma- chinery or any machinery can ever be used to successfully implement something equivalent to a human mind. The weaker claim is that it is impossible to implement a human mind or to successfully model a human mind using a Turing machine. I do not believe that Lucas’s ar- gument against Mechanism holds, because a number of mechanical de- vices like analog computers and non-synchronous parallel computers are not equivalent to Turing machines (Sloman, 1996). Other processes, like non-synchronous parallel processes, continuous (analog) processes and chemical processes, are all involved in the human brain (Sloman, 1996, 181). Benacerraf agrees that Lucas’s claim against mechanism is insufficient, if mechanisms entail “non-Turing machine” computing:

It is an open question whether certain things which do not satisfy Turing’s specifications might also count as machines (for the pur- pose of Mechanism). If so, then to prove that it is impossible to “explain the Mind” as a Turing machine (whatever that might in- volve) would not suffice to establish Lucas’s thesis U˝ which is that it is impossible “to explain the mind as a machine” (Benacerraf, 1967, 13).

However, Lucas’s argument rests on the assumption that the mind can always apply some method to a Turing machine to show that this par- ticular Turing machine contains an unprovable statement. He claims 58 CHAPTER 3. COMPUTATIONAL FOUNDATIONS that a mind can ‘see’ that such statements are true, while the machine, because of Gödel’s theorem, cannot prove them to be true. The impor- tant point here is to clearly determine what kind of statements can be made within formal systems and what type of statements can be made about formal systems. Lucas’s mind is outside the formal system and observes that it is possible to introduce a statement into this system, which cannot be proved with the rules and axioms of that system. Lu- cas claims that in essence, Gödel sentences are contradictions akin to “a partial analogue of the Liar paradox” (Lucas, 1970, 129), which a ma- chine could not resolve. Whiteley sums up the challenge posed by Lucas:

the trap in which the machine is caught is set by devising a for- mula which says of itself that it cannot be proved in the system governing the intellectual operations of the machine. This has the result that the machine cannot prove the formula without self- contradiction (Whiteley, 1962, 61)

However, Whiteley recognizes that Lucas has reduced Gödel’s theorem to an inability of the machine to resolve self-contradicting statements. I think that this is an over-simplification and misinterpretation of Gödel’s theorem, but it is Lucas’s interpretation after all and Whiteley uses this interpretation for his own counter argument. Whiteley claims that Lu- cas cannot resolve the statement ‘This formula cannot be consistently asserted by Lucas’ himself without contradicting himself. Lucas has been placed into the same position into which Lucas wants to put the machine: Lucas is inside his own formal system and the rest of us are on the outside. Anyone reading this sentence can see the truth of this statement, while Lucas is unable to do so. Lucas can be trapped in the same way that Lucas wants to trap the machines. Whiteley points out that human minds can deal with contradictions of this kind easily in that we can make a statement about the contradiction. He suggest that Lucas can

escape from the trap by stating what the formula states with- out using the formula: e.g. by saying ‘I cannot consistently assert Whiteley’s formula’ (Whiteley, 1962, 61).

It is this type of action that Lucas wants to deny to the machine. White- ley claims that a machine can be programmed, in principle, to deal with contradictions of this sort. The machine can use all the axioms and rules 3.1. SYMBOL SYSTEMS 59 in an attempt to prove a formula and indicate a positive result, or the machine may indicate that it cannot prove the formula because of some non-resolvable condition, a circularity, which might cause the machine to deadlock. The machine can make some statement about the formula such as “I cannot prove the formula: ‘The machine cannot consistently assert this formula”’. Lucas’s arguments are not convincing and have been refuted for a variety of reasons. While Turing machines remain in principle still open to a Gödelian attack, Lucas fails to recognise that many forms of computation with physical systems are not necessarily formal systems i.e. they are not ‘machines within the act’17.

3.1.3 Digital Computing

There seems to be a close relationship between the TM and real comput- ing machines if we look at and compare their general architectures and their modes of operation. However, I have already shown that there are some real and important differences in their theoretical capabilities. Turing’s later work is very much concerned with real machines and their logical structure. It has been argued in the literature that TMs cannot be physically realized, because TMs need to have infinite mem- ory, or need to have a tape of infinite length to stay with Turing’s de- scription of his machine. I believe that there is no such requirement. The amount of memory must ‘merely’ be unbounded, that is, there must always be enough memory available for the task at hand to be com- pleted. A physical realization of a TM is straight forward for some spe- cific reasonable task18. Provided we accept for the moment that there may be constraints in execution time and physical memory for some tasks, modern machinery is quite powerful for real computing tasks, such as models and simulations. Turing’s principal architecture of his LCM is echoed in the kinds of ma- chines that were developed in the late 1940s. I have mentioned previ-

17Lucas uses this term in THE FREEDOM OF THE WILL where he excludes all ma- chines that do not fit his theory by saying that “any system which was not floored by the Gödel question was eo ipso not a Turing machine, i.e. not a computer within the meaning of the act” (Lucas, 1970, 138) 18Reasonable task means that the task is in fact solvable in finite time and within finite memory. Such a task may nevertheless run for days on some super computer, but it would still be reasonable. There is nothing unreasonable about calculating π or e to a few million decimal places. 60 CHAPTER 3. COMPUTATIONAL FOUNDATIONS ously that the principle of algorithmic computation had been applied by Turing himself to digital machines in INTELLIGENT MACHINERY (Tur- ing, 1948). It has been argued that computing with modern electronic digital machines allows us to go beyond the limits of the TM (see for ex- ample Sloman (1996), Copeland and Sylvan (1999) and Wegner (1997)). The development of the modern digital computer is essentially a se- ries of technological revolutions in terms of increase of speed and large scale integration of vast numbers of logic gates and transistors on a sin- gle silicon chip. A modern central processing unit does not do anything different from the early designs by Turing (1948) or the devices from the designers and engineers working with von Neumann. Modern ma- chines get their perceived power merely by ever increasing processing speed and faster access to large (and cheap) mass storage. Nothing new has been introduced since the late 1940s as far as the principle of their operation is concerned. The stored program architecture reflects the structure of the Universal Turing machine (UT) only superficially. The principal ideas, which make up the stored program architecture, which works on the principle of the ‘fetch-execute cycle’ is that of a central pro- cessor with direct access to a memory in which data and program are stored. Program execution follows the fetch-execute-store cycle. This cy- cle forms the core of sequential processing, whereby each instruction is loaded from memory into the processor, executed - this may involve a further access to memory to read some data - and finally the result of the operation - if any - is stored into memory. Then the next instruction of the program is loaded, executed, and so on. The step by step execution of programs is also the modern computer’s greatest disadvantage. The fetch-execute-store cycle is the bottle neck that determines and limits the computing capacity of a computer. Modern machines make use of multiple processors and highly complex pipelining technologies, which allow to pre-fetch the next instruction from memory, while the previous instruction is still being executed. Such measures are some of the many technologies which help to overcome this basic structural constraint of the stored program machine. For large computational tasks it may be necessary to have a considerable amount of hardware resources, e.g. memory, available. The memory of a computer is, of course, finite. Ran- dom access memory (RAM) is relatively expensive and particularly this 3.1. SYMBOL SYSTEMS 61 type of fast storage is generally limited. This is however not a limita- tion of what these machines can do, because RAM can be extended by storing blocks of memory, which are currently not accessed, onto cheap and large mass-storage i.e. magnetic disk. These portions of memory are then restored, when needed, somewhere else in memory in locations that themselves can be freed by storing them somewhere on a hard disk. This use of virtual RAM is commonplace in modern machines. Therefore machines never run out of memory for practical applications, although there is a greater time penalty if the storing/restoring of RAM onto hard disk, i.e. swapping, becomes excessive when the amount of RAM in the particular machine is small. Remember that the memory is sufficiently large, if the amount of memory, which may include virtual RAM, is large enough for the computing task at hand. The requirement that memory is large enough, rather than infinite, is also true for the theoretical TM. Its structure can implement any other digital computing machine. The universality of digital machines had been recognized by Turing, who ex- plains that the

special property of digital computers, that can mimic any discrete state machine, is described by saying that they are universal ma- chines. . . . [I]t is unnecessary to design various new machines to do various computing processes. They all can be done with one digital computer, suitably programmed for each case (Turing, 1950, 157).

It is significant that Turing treats different computational problems as separate cases. Early computers operated differently from modern ma- chines in that each computation, or job in contemporary jargon, was prepared, loaded onto the machine, and executed one by one. Each job contained the program (algorithm) and the input data and there was no interaction between machine and operator while the program ran. Putnam remarks that

[t]he modern digital computer is a realization of the idea of a uni- versal Turing machine in a particular effective form - effective in terms of size, cost, speed, and so on (Putnam, 1988, 270).

Almost any digital computer is able to instantiate a TM, but these ma- chines are not limited to instantiating TMs. A theory of computation us- ing TMs to define what is computable is too narrow to even include the capabilities of modern electronic computers based on the stored program 62 CHAPTER 3. COMPUTATIONAL FOUNDATIONS machine. Turing said in COMPUTING MACHINERY AND INTELLIGENCE that

[t]he idea behind digital computers may be explained by saying that these machines are intended to carry out any operations which could be done by a human computer (Turing, 1950, 436).

This statement was certainly true for the kind of ‘number-crunching tasks’ that were envisaged to be solved by electronic computers in the 1950s. The programs were executed very much in the TM style: program and data had to be loaded into the memory of the machine, the pro- gram was run, and the results of the calculations remained in memory. On earlier machines, the programs were loaded by changing the hard- ware of the machine, and only the data was memory resident. Nowa- days nearly all electronic computing machines are based on the stored program architecture and with the aid of fast processors other architec- tures can be successfully emulated.

3.1.4 Parallel Computing

If we consider the lack of success in building intelligent agents and the difficulties in reconciling serial (TM like) computing to the distributed architecture of human brains, then it seems reasonable to ask whether there might an entirely different form of computation. Haugeland said in 1985 that

[t]oo often people unwittingly assume that all computers are es- sentially like BASIC, FORTRAN, or even Turing machines. Once it is appreciated that equally powerful (i.e., universal) architectures can be deeply and dramatically different, then suddenly the sky’s [sic] the limit. [. . . ] The important point, however is this: the mind could have a computational architecture all its own (Haugeland, 1985, 163).

Moreover, the mind could not only have a different computational ar- chitecture, but could have a different way of computing altogether. Par- allel computing is the foundation and the philosophical paradigm of connectionism. The acceptance of this paradigm seems to be ultimately grounded in the belief that overall computing power can be greatly in- creased by inter-connecting a large number of individual processors. A second important factor for parallel computing in Cognitive Science is 3.1. SYMBOL SYSTEMS 63 the similarity with observable physical structures in the brain. It has been shown that parallel computing is TM equivalent on a theoretical level (Siegelmann and Sontag (1991) and Neto et al. (1997)), who have established that artificial neural nets can be constructed that are math- ematically equivalent to TMs. Consequently, this form of parallel com- puting is computationally as powerful as TMs. Some forms of parallel computation can be performed on Turing machines on a practical level (see for example Penrose (1990) and McDermott (2001), amongst oth- ers). However some functions that can be computed by a network of pro- cessors (parallel computation) cannot be computed by a single processor, because the network cannot be replaced under certain conditions by an equivalent single TM (Sloman (1996) and Copeland and Sylvan (1999)). There is little doubt in my mind that parallel computing is fundamen- tally different from symbol processing in the TM sense. This becomes evident once symbols are considered in the context of neural nets. In neural nets representations (i.e. symbols with some attached semantic content) are distributed in the structure of the network. Not everyone agrees with the idea that connectionist and conventional computing systems are fundamentally different. McDermott believes that

[a]t first glance [a neural net] seems like a very different model to computation from the computers we are used to, with their CPU’s, instruction streams, and formal languages. These differences have caused many philosophers to view neural nets as a fundamentally different paradigm for computation from the standard digital one. [. . . ] The whole thing is based on exaggeration of superficial dif- ferences. Anything that can be computed by a neural net can also be computed by a digital computer (McDermott, 2001, 40, italics added). I agree that whatever can be computed by neural nets can be computed by a conventional serial machine, provided that the function, which a neural net has learned from the training dataset, is actually known, so that the function can be implemented on the serial machine. But the function may not be known. The ability to implement some func- tion from the training dataset alone is one of the great advantages of neural net learning algorithms. Entire research programs concern the development of methodologies to reveal these functions, and knowledge extraction (KE) has emerged as a discipline within AI. 64 CHAPTER 3. COMPUTATIONAL FOUNDATIONS

McDermott’s comment shows that the levels of description are impor- tant, and that the level of description should be stated explicitly as part of the argument. I believe that neural nets have little or no relationship to “CPUs, instruction streams, and formal languages” at the concep- tual level of neurons (or primary elements) and networks. McDermott compares different levels with each other: the neural nets belong at the functional and conceptual level, whereas the CPUs, instruction streams and formal languages belong to a lower (implementation) level. In fact, formal languages have little or no relationship to CPUs either, because they belong to different levels themselves19. McDermott’s arguments and claims are immersed in the confusion about the levels of descrip- tion. Penrose claims that “there is no difference in principle between a par- allel and a serial computer” and that “[b]oth are in effect Turing ma- chines” (Penrose, 1990, 514). He argues that the implementation of par- allel processors and running more than one TM together “does not in principle gain anything”. Penrose believes that at best one might find a possible increase in performance of such a system, given the conditions are right. This seems to indicate that in principle it makes no difference in which way human cognition is simulated, because inference engines and artificial neural nets are merely different programs for physical re- alizations approximating true TMs. Following from my arguments on page 51 that interactive computer programs are not TMs, the question whether artificial neural nets and TMs are identical is, at least in some cases, essentially irrelevant. In the context of Cognitive Science, artifi- cial neural nets are not used to gain increases in performance. In fact, the massive amount of computing that is necessary for the simulation of the neural nets at the implementation level, makes simulated neural nets rather inefficient computing devices. However, the kind of parallel computing associated with artificial neural nets has little in common with the kind of parallel computing Penrose refers to.

19McDermott refers here to the formal programming languages like the imperative languages Java or C++, and functional languages like Haskell or Miranda and lan- guages for logic programming (e.g. Prolog) are not executable directly by CPUs. Some languages (C++) are compiled into machine executable code, while others are trans- formed into intermediate code for an interpreter or a virtual machine (Java, Haskell). The interpreters and virtual machines themselves are large programs that are written in a high level language and compiled into machine code. 3.1. SYMBOL SYSTEMS 65

Connectionism, or parallel distributed processing (PDP), as it is de- scribed in various contributions to Rumelhart and McClelland (1986a) and McClelland and Rumelhart (1986), has to be considered a cogni- tive architecture rather than a computational paradigm in the sense of a multi processor parallel computing device. Penrose’s comments are valid for these multi-processor machines, because much of what is done with these kinds of super-computers could be done in principle on a sin- gle CPU as well20. PDP needs to be viewed at a conceptual level and it is concerned with cognitive functions involving representations with semantic content, rather than computing with freely interpretable sym- bols. Artificial neural nets in Cognitive Science have a dual purpose, because as an explanatory tool they belong conceptually in the realm of PDP, while their actual computing is still closely aligned with tradi- tional computing. This is not only true at the lower level, where conven- tional (imperative) programming languages are used to write programs implementing the artificial neural nets, but also at the conceptual level. For example, the apparent need for symbols in the context of Turing- type computing was accommodated in neural nets (Smolensky’s theory of sub-symbols (Smolensky, 1990, 1997)). In many discussions about ar- tificial neural nets in the context of cognitive modelling, the inputs are labeled with terms other than ‘0sor‘1s. Because we can assign these labels at will, there is always the temptation of introducing ‘wishful’ terminology not only for labels, but also for methodological terms. Feld- man and Ballard note that

[i]t is obvious that the choice of technical language that is used for expressing hypotheses has a profound influence on the form in which theories are formulated and experiments [and models are] undertaken (Feldman and Ballard, 1982, 206).

20This is not universally true. When multiple processors or machines are coupled so that they are controlled by a common clocking mechanism, the entire system may possi- bly be replaceable by a single processor. If the individual processors run independently and interrupts are used to exchange data among the processors, then the system of pro- cessors is no longer deterministic if the processors have to share (i.e. compete for) for resources like I/O and memory. Such a system cannot be considered as a TM, because determinism is a requirement for TMs. Moreover, it should be noted that real computers are not deterministic in any case, because the initial conditions at start-up are never identical. There are many factors that influence the execution of programs: the amount of data stored on a hard disk and the degree of fragmentation changes the time needed to retrieve data, and the system clock (changing date and time every few microseconds) ensures that the system is never in the same state. 66 CHAPTER 3. COMPUTATIONAL FOUNDATIONS

Only the activation levels of the input nodes and the connection strengths in the network matter for an artificial neural net to produce the appropriate output for the function it is trained to approximate. Nevertheless, labels for the nodes and terminology for other parts of the networks are often introduced whenever models are constructed. This is due to the requirement that nodes with specific meaning or localized representations cannot be avoided, at least at some level. Aside from the problem of tying meaning to neurons or any entity in a parallel distributed paradigm, there is still the question of in which computational category artificial neural networks belong. At some level of description the neuron is the functional primitive or primary unit of computation. Even if we view each simulated neuron as a distinct black box with defined inputs and some output, they are not independent from each other within artificial neural nets. Apart from the obvious reason, that they are interconnected and that the inputs of neurons in one layer are the outputs of neurons in the previous layer21, there are other de- pendencies as well. Parallel processing with artificial neural nets is generally simulated on a single computer22 that is based on the stored program architecture. Under these circumstances, parallel processing is simulated as a series of chunks of operations. Within each chunk of operations all calcula- tions for the nodes are performed layer by layer. The set of operations that has been completed for each node, or all nodes in the layer, or even all nodes in simple feed forward networks and recurrent networks, are then assumed to have happened within a single time click. This high level clocking mechanism for these networks therefore delivers paral- lelism only at discrete steps. In between the steps, when all nodes are brought up to date, parallelism is suspended (see also section 6.1.5 on page 136 on the implementation of artificial neural nets).

21In recurrent networks, the inputs of neurons might be tied to outputs of neurons in the same level. In some special cases a neuron might have a feed back mechanism to itself (see chapters 5 and 6). 22A personal computer (PC) is more then adequate to deal with artificial neural nets of considerable size. I have simulated networks with several hundred PEs and large training datasets. Even training such a large network with a few thousand vectors and hundreds of epochs takes only a few minutes. 3.2. OTHER COMPUTING SYSTEMS 67

3.2 Other Computing systems

There are computing systems and methods that do not fit into the rel- atively narrow algorithmic or into the distributed parallel processing paradigm. The concept of an algorithm has so far served as one of the determining factors in the definition of computation. Shagrir (1999) de- scribes a computational system based on an example by Pitowsky (1990) that belongs in both the algorithmic and the analog category of com- putation. The argument is based on a simple system that consists of a water tank with two internal walls, creating three equal sized com- partments (k,m,n). Each compartment contains water at different tem- peratures. When we remove the dividing walls and mix the water the system calculates the function q =(k + m + n)/3 in a single and analog fashion. If we take the same initial setup, but remove only one barrier to calculate r =(k + m)/2 and then remove the second barrier to calcu- late s =(2r + n)/3, the end result is the same, but the calculation is now algorithmic, because there is a clear sequence of steps to first calculate q and then s. Shagrir asks the question, whether removing the barriers all at once or removing them one by one, is thus deciding whether we call the process “computation” or not. The example certainly confirms that the notion of an algorithm is insufficient as a defining property of computation. The idea of an algorithm, albeit in a weaker, less formal form, can be applied to some other computing activities as well. Performing calcu- lations using a slide rule is usually a sequence of instructions like: set 3.1 on the C-scale, move the B-scale so that π is aligned with 5.6 on the A-scale,...,readtheresult on the D-scale. However, this set of rules is somewhat different from Turing’s rules of thumb for solving problems in number theory. Algorithms for working with a slide rule are much more like operating instructions, in that they describe the task of calculating at a meta level - at the level of operating a machine. For example, the task of multiplying two numbers on a slide rule is described as a process of placing two lengths, where each lengths represents a number23,end to end. The operating instructions are describing an addition, whereas the calculation that is actually performed is that of multiplication.The

23The main scales of a slide rule represent the logarithms of real numbers. 68 CHAPTER 3. COMPUTATIONAL FOUNDATIONS slide rule is a typical example that shows that in many analog systems some physical property, here length, is used as a representation of some something else - numbers in this case. This type of computation often employs physical systems, whose dynamics can be described with some known function. The interpretation of the system dynamics is used to calculate a value for that function. Stufflebeam (1998) suggests that his cat Sophie’s behavior, when dropped from two feet, “satisfies the dis- tance function D(t)=gt 2/2”. Some principal issues concerning the inter- pretation of systems as computation are discussed in section 3.3.

3.2.1 Analog Computing

Calculations have been performed with the aid of various machines for many centuries. The slide rule by Oughtred, Lord Kelvin’s tide predic- tors, integrators and planimeters are examples of such machines. Tide predictors were one of the most successful analog computing devices ever built. The first, designed by Lord Kelvin, was completed in the 1860s. The principle of such machines is simple and some excellent en- gineering allowed for very accurate predictions of tidal water-levels. The height of a tide is determined by a series of constituents. The masses and positions of the moon, sun and other heavenly bodies and some earthly phenomena contribute to the total gravitational pull for a partic- ular location. All of these phenomena are subject to cyclic variations of different period and amplitude. The final tidal height can be expressed as y = A cos(u)+B cos(v)+C cos(w)+...

These terms have to be calculated for every location and in intervals of a few minutes. These equations can be solved numerically, but evaluation without modern computing machines would be too time consuming. Un- til the emergence of fast (and cheap) electronic computers, this task was entirely done by mechanical tide predictors that were based on Kelvin’s design. The last of the machines had been decommissioned in the mid 1960s, after having been in service since 1911. Kelvin’s machine used at most 12 cosine terms, but a later model in the US used 37 and a Ger- man machine built in 1938 used 64 terms. Williams (1997) notes that the machine, constructed and operated by the U.S. Coast and Geodetic 3.2. OTHER COMPUTING SYSTEMS 69

Survey, could calculate the height of the tides to the nearest 0.1 ft for each minute of the year for a location in a few minutes. The magnitude of calculations involved to establish a tidal forecast can be gauged by the fact that in a modern computer the cosine sub-routine is called about 20 million times to predict the tides for a single year for a single location (Williams, 1997). These machine were certainly calculators in that they produced a series of outputs according to computationally complex (and computationally expensive) function by analog and mechanical means. This kind of computation is in effect no different to computation with the aid of a slide rule. However, these mechanical analog machines can certainly not be termed computers, if we consider computation in the context of Computer Science (i.e. TMs). For most practical applications we can approximate analog systems to any desired degree on digital systems using numerical methods. It is probably an advantage for engineering models and simulations that bi- ological entities themselves can only represent and process analog data down to certain levels of accuracy, before details are lost due to biolog- ical noise. Computational models and simulations are not affected by biological noise, and analog processes can be approximated on a digital machine more accurately than the processes in the human brain itself. Note that this may not be possible in real time for very large number of processes due to the limitations in computing performance.

3.2.2 Neural Computing

Can computing with neurons be realized on digital machines, or, is what neurons do something utterly different from what I have described so far? Churchland is convinced that neuro-computing is utterly different from ‘classical’ computing, or Turing computing, and he believes that

[i]t has become increasingly clear that the brain is organized along computational lines radically different from those employed in con- ventional digital computers. The brain [. . . ] is a massively parallel processor, and it performs computational tasks of the classical kind at issue only very slowly and comparatively badly. To speak loosely, it does not appear to be designed to perform the tasks the classical view assigns to it (Churchland, 1993, 24). 70 CHAPTER 3. COMPUTATIONAL FOUNDATIONS

If we adopt a functional view of neurons and take only their behavior, i.e. output, for a given stimulus, i.e. input, into consideration then the answer is likely to be yes. The computational models of neurons will be discussed in some detail in chapter 5, but it should be already quite obvious that the kinds of computations that are performed by mathe- matical or computational models of neurons in AI and Computational Neuroscience are both different from what neurons do. The behaviour of real neurons is best be described in terms of analog quantities, and the parameters in simulated neurons (implemented on computers) can be approximated to any desired level. In section 5.3 on page 118 the re- lationship of real, i.e. biological, neurons and their relationship to sim- ulated neurons will be explored further.

3.2.3 Other Forms of Computing

For completeness, it must be mentioned that a computational system can be built on chemical principles, quantum mechanics (Penrose, 1990) and gene technology. The discussion whether quantum mechanics can or cannot be excluded is ongoing (see e.g. Chalmers (1995) or Tegmark (2000)), but it seem unlikely that quantum processes are part of any brain’s ‘computational arsenal’. So far, no tangible experimental results have be obtained. The mechanism for how quantum events could trigger events at the molecular level (or at the cellular level for that matter), without being swamped and lost in biological noise, is largely specula- tive.

3.2.4 Implementation Independence

There are many ways in which computations can be implemented us- ing physical systems. Some particular forms of computation can only be implemented on a specific system. Analog computing, for example, cannot be implemented on digital machines24. Simple arithmetic can be done using an abacus, a slide rule, mechanical and electronic cal- culators, and with hi-tech supercomputers. If the user adheres to the appropriate algorithms and methods that are suitable for the device in

24Analog systems can be approximated to any degree desired, but some degree of granularity cannot be avoided. 3.3. COMPUTATION AS INTERPRETATION 71 question, the results of these computations will be consistently accurate. The implementation of continuous functions on discrete state machines is not straight forward, neither is the implementation of a problem from discrete mathematics on an analog computer.

3.3 Computation as Interpretation

It is now apparent that none of the computational paradigms can of- fer a universal description that could to serve as a definition of what computation is. The TM offers a definition of an algorithm25, but com- putation goes beyond merely following some effective procedure using a Turing machine. The examples of other computing systems (section 3.2 and the discussion of the Tide Predictors on page 68) illustrate that the term ‘computing’ is indeed not specific. If all the other modes of com- putation are considered, then a definition of computation would have to be informal and rather vague - a problem that has been identified by Putnam and Searle. Computation can be considered as the applica- tion of some method in order to achieve the required result or goal. This method must be effective (suitable to achieve the goal) and the goal must be interpreted in terms of the initial conditions. The application of various methods facilitates the mapping of some input to some output. The important point is that the methods aiding the transformation of the input(s) into the output(s) can take many forms. When computation concerns numbers and when computation is restricted to numbers, the application of arithmetical methods alone may be sufficient to achieve the mapping from input to output. However, if computation gets more complex (or difficult) then we may be able to employ many and quite diverse methods. If, for example, it is known that a real world phe- nomenon can be described in terms of a mathematical equation, then we can use the real world, or a suitable model of the real world, as an effec- tive method to calculate. Tide predictors are probably a good example to illustrate this point. The mapping of inputs (the position and masses of the moon and the planets, amongst others) and output of the real world

25Hoffmann notes that “it is not expected that a more powerful formalisation of the notion of algorithm will ever be developed” (Hoffmann, 1998, 163), as the intuitive no- tion of an algorithm is also explicated by Church-Turing Thesis. 72 CHAPTER 3. COMPUTATIONAL FOUNDATIONS phenomenon (water-levels at some future time at some specific location) can be described in terms of mathematical equations with sufficient ac- curacy. The difficulty is in the evaluation of the formulas, which in this particular case involves the evaluation of the cosine-function over and over again. Even with the use of tables, or slide rules, the prediction of tides remained impractical26. Lord Kelvin’s design was essentially a mechanical model of the physical world, and his design produced the re- sults by simulating the dynamics in the real world. Modern computers have sufficient speed so that the same calculations can be done through numerical methods, in even less time than it used to take with tide pre- dictors. The use of models as computational devices makes a definition of com- putation very difficult. In order for a physical system to be considered as a computational device, it must first be recognized as a system that can be interpreted in a useful way. If such an interpretation is possible, then the system can be utilized, and it can be incorporated in a suitable pro- cedure to perform some kind of calculation. The entire system is then an effective method. That, of course, demands that such a system is under- stood in terms of its dynamics, or behavior, and that it can be described in an appropriate and abstract way. The claim, that the structures of a wall are an implementation of word-star at some level (Searle, 1990) is interesting, but I believe that a formal system without the appropriate interpretation, and also some utility, does not implement any form of computation. The wall might well implement computation at some level, but it is necessary to have a clear understanding of the wall at that par- ticular level in order to explain the mapping of inputs to some outputs. The inputs and outputs might possibly be interpretations of quantum states, and the computed function f could be an explainable and de- terministic sequence of quantum leaps within the wall. Until all of the states and processes can be explained at that level, the wall cannot be used as a computing device and therefore does not implement word-star. While there is probably an endless number of suitable physical systems that could be used as computational tools, their mere existence does not

26The calculations are mathematically straight forward and rather trivial. The real problem is that the number of relatively simple calculations is so large, that even with the use of cosine-tables the task remains impractical. 3.3. COMPUTATION AS INTERPRETATION 73 qualify them as implementations. Stufflebeam remarks that

[w]e do not to find computational systems in the wild, because whether something is a computational system is not just an em- pirical matter. Rather, something is a computational system al- ways relative to a computational interpretation - a description of a system’s behaviour in terms of some function f. In other words, something warrants the name “computational system” only after its behaviour gets interpreted as satisfying, or computing some specific function. Thus, individuating computational systems is al- ways interpretation-dependent. Not only is there the subjective matter of whether we care about individuating an object as a com- putational system, there are certain pragmatic considerations that figure in determining which function among equivalent functions is being computed (Stufflebeam, 1998, 641).

This leads to a possible definition of computation, which is perhaps pre- cise enough and still allows for the inclusion of the many possible im- plementations through a limitless number of physical systems. Harnad suggests that semantic interpretability is a necessary criterion of com- putational systems (Harnad 1995). He holds that a formal definition of computation in terms of Turing machines does not rely on the semantics of symbols that are processed. Real and useful computation, however, has to make systematic sense.

It is easy to pick a bunch of arbitrary symbols and to formulate ar- bitrary yet systematic syntactic rules for manipulating them, but this does not guarantee that there will be any way to interpret it all so as to make sense (Harnad, 1995, 381).

I think that the definition of computation must include even more than the coherent and ‘sense-making’ semantic interpretability suggested by Harnad. Symbol manipulation can only be regarded as computation if these symbols are semantically interpreted. Any computation with the aid of a machine, an electronic calculator or a personal computer, is a very complex process at the binary level. Because of that, semantic in- terpretation at this level of the machine is neither possible nor neces- sary for most users. The engineer, however, is able to determine that the bit pattern at some memory location corresponds to a digit, if the out- put of a certain flip-flop is zero and some other conditions are met, and so on. Essentially there is a causal chain between the low level func- tioning of the machine and the more abstract level of digits, so that the 74 CHAPTER 3. COMPUTATIONAL FOUNDATIONS calculator is doing what it is supposed to be doing. Unlike the wall, it is interpretable at all levels. The semantic interpretation of the inputs and outputs at the digit level is required by the human using the machine. There must be some in- tentionality and intent in relation to computation. Pressing buttons on an electronic calculator is no different to pressing buttons on a remote control unit for a television unless the user wants to calculate. Even if for some reason a random sequence of button presses on the calculator was syntactically correct and produced a result, a computation is only performed if the action and the result are interpreted as such. Analo- gously, sliding beads on an abacus does not constitute calculation, un- less the beads are place-holders for numbers, i.e. the beads must have semantic interpretations attached to them. Values on the abacus are encoded in the patterns of beads. The manipulations of the beads, or symbols in general, have to follow the formal rules to maintain the se- mantic content. Computation is an intentional, i.e. purposeful, process in which semantically interpretable symbols are manipulated. Seman- tic interpretation is a requirement for computation in order to be able to eliminate any instances of trivial and pathological forms of compu- tation. McDermott (2001) includes the thermostat and the solar system in a list of examples that he considers to be computers. Moreover, he suggest that

[t]he planets compute their positions and lots of other functions [. . . ] The Earth, the Sun, and Jupiter compute only approximately, because the other planets add errors to the result, and, more fun- damentally, because there are inherent limits to measurements of position and alignment (McDermott, 2001, 174).

I disagree with McDermott’s notion of a computing universe. The solar system is not computing anything - nor is the solar system measuring any positions as inputs. McDermott’s solar system supposedly com- putes a function using “distances r1 and r2, and velocities v1 and v2” (McDermott, 2001, 175). We do not have any evidence that the solar system is not computing epicycles rather than using Kepler’s formula. Dreyfus is also critical of a computing solar system and he comments specifically on the ‘computation by planets’. He asks us to 3.3. COMPUTATION AS INTERPRETATION 75

[c]onsider the planets. They are not solving differential equations as they swing around the sun. They are not following any rules at all; but their behavior is nonetheless lawful, and to understand their behavior we find a formalism - in this case differential equa- tions - which expresses their behavior according to a rule (Dreyfus, 1979, 189).

The questions about trivial or incidental computation are much more serious than that. By incidental, I mean the type of computation that happens without any purpose. Searle’s example of a wall implement- ing word-star and Putnam’s rock implementing every finite state au- tomaton deal with the problem of computational functionalism. Searle accepts that physical systems are not intrinsically computational sys- tems. He notes that

[c]omputational states are not discovered within the physics, they are assigned to the physics. [ . . . ] There is no way you could dis- cover that something is intrinsically a digital computer because the characterization of it as a digital computer is always relative to an observer who assigns a syntactical interpretation to the purely physical features of the system (Searle, 1990)

Searle argues against the view that physical systems are neither se- mantic engines nor syntactic engines as, for example, Haugeland (1985) does. His argument that a wall is at some level of description an im- plementation of word-star is based on the claim that anything, which is sufficiently complex, can be interpreted as a program. Shagrir (1999, 2001) provides an interesting argument for the notion that a particular physical system can be interpreted in more than one way. As a conse- quence, computation, in the context of cognition, is then the interpre- tation of some syntactic structure27 according to (mental) content. He says that

[t]he computational identity of the system is determined by the syntactic structure the system implements while performing a given task, but the task of the system in any given context is de- fined, at least partly, in semantic terms: i.e., in terms of the con- tents the system carries (Shagrir, 1999, 143).

27Syntactic Structure refers to the formalism of the description of the computation. The syntax of an expression expresses the rules that govern the notation for a piece of mathematics or a computer program. For example, the expression ‘sum=a+b’in a BASIC computer program tells the machine to evaluate the sum of a and b and to assign the result to the variable sum. Any other notation, e. g. ‘a+b=sum’, will violate the syntax of this particular statement. 76 CHAPTER 3. COMPUTATIONAL FOUNDATIONS

McDermott rejects Searle’s arguments against a functional view of com- puting and offers a more concrete definition. For McDermott, a com- puter is purely a syntactic engine. The concept of computing includes a much wider range of systems, and it is therefore much broader than a concept based on digital computers. He suggests as a definition, that a computer is a physical system having certain states. The states of such systems may not be necessarily discrete and can also be partial, that is, such a system can be in several states at once (McDermott, 2001). An- other important feature of such a computing system is that its outputs are a function of its inputs. Interestingly, the notion of function, as Mc- Dermott proposes, could be taken from a textbook in mathematics. He writes that

[a] computer computes a function from a domain to a range. Con- sider a simple amplifier that takes an input voltage v in the do- main [−1,1] and puts out an output voltage of value 2v in the range [−2,2]. Intuitively, it computes the function 2v [. . . ] (McDermott, 2001, 173).

The admission of states that may be non-discrete, i.e. they can be con- tinuous or may be partial, has some important consequences for his con- cept of computation. Firstly, there is a possible contradiction in that sys- tems cannot have states that are not discrete. The term state, I would think, presupposes that there is an amount of stability at a certain level in a system when it is considered to be in a state. Stability does not mean that the entire system is without motion. Stability will enable us to describe a state at various levels of description. These levels can be ar- bitrarily selected and need not be confined to events in relation to time, but could also include other measures as long as such entities are not in flux (having a certain colour or temperature, for example). Not all such states are necessarily useful in describing computational processes. A washing machine can be in several states, switched off, washing, spin- drying, and so on. At the level of these descriptive states the machine can only be in one state at any time, with transitions in between - the machine cannot wash and spin-drying concurrently. However, at a lower level, the machine can be dynamic and may exhibit no stability at all - e.g. the machine’s motor is turning and the timer is winding down and so on. The motor of the washing machine itself can be in many states: 3.3. COMPUTATION AS INTERPRETATION 77 not turning, turning slowly forwards, turning fast, and so on. There is a large number of ways of defining states for objects, or attributing states to objects. Looking at the turning motor, it becomes meaningless to de- fine states to the various degrees of turn, say, without discretizing the process of turning, because the turning of the motor is a continuous pro- cess. An engineer may well say that something or other should happen once the motor turns through 180 degrees, but then the process is no longer a continuous one. The introduction of a fixed marker or a fixed event - 180 degrees - allows for the introduction of higher level states: not yet at 180 degrees, at 180 degrees and triggering some event, past 180 degrees, and so on. Secondly, if a system is allowed to be in states, there have to be guards that such states are not mutually exclusive. McDermott says that a sys- tem “might be in the state of being at 30 degrees Celsius and also in the state of being colored blue” (McDermott, 2001, 169). The problem here is that McDermott identifies states at a semantic level and not at syntac- tic level. I would argue that his concept of state is much more a concept of properties. Consider again the simple amplifier that implements the function 2v, which is continuous. The function definition, however, nar- rows the possible inputs for this function to numbers. Blueness cannot and must not be allowed in the domain of a function 2v. Accepting that a computer computes a function in terms of mapping from a domain to a range implies that such a computation does occur at a syntactic level and must remain semantically interpretable during the process. McDermott also believes that computation is in essence the interpreta- tion of a formal physical system, which implements a function, or map- ping. He says that

[a]ll we have to do is to define the word “computation” carefully, abstracting away from the fact that historically the concept has been applied primarily to artifacts. A computer is simply a physical system whose outputs are a function of its inputs [. . . ] We specify what the inputs and outputs are with input and output decodings, which are just mappings from certain partial input states to range and domain sets (McDermott, 2001, 169).

This definition is suitable, without being vacuous, if it includes also the requirement that the processes within such a physical system are un- derstood and explained at the level of implementing a computational 78 CHAPTER 3. COMPUTATIONAL FOUNDATIONS system. There remains, of course, the problem of TMs, which are en- tirely abstract and do not qualify as implementing computation.

3.4 Machines and Semantics

I have argued that TMs are mark manipulating devices, due to their inability to access or link to the real world. TM computing is semanti- cally empty, because the marks are arbitrarily chosen, manipulated and in the end interpreted as being semantically charged. It can of course be argued, that a symbol can be related to the real world by assign- ing meaning to it. But it is important to remember that this relation is only in the head of the observer. Many symbols in mathematics are used to mean specific things. By convention, unless specified differently, the symbol e stands for a particular number, and this symbol can appear in formulas like eπi +1 = 0. This expression can be manipulated using a set of axioms and transformation rules, yet the symbol e will retain the assigned meaning. Mathematics is a formal system and as such it can be automated. The concept of an interpreted automatic formal sys- tem is used by Haugeland (1981, 1985) as the definition of algorithmic computation. A formal system, by its inception, is truth preserving. The axioms and rules of mathematics ensure that theorems hold28.Aslong as we remain within the formal system, i.e. we observe the rules, then the assigned meaning of symbols is preserved. The principle has been summed up:

In effect, given an interpreted formal system with true axioms and truth-preserving rules, if you take care of the syntax, and the se- mantics will take care of itself (Haugeland, 1981, 44, original ital- ics).

The semantics or truths are not related to the real world in the way a representational system (RS) of Type II is related to the world. Unless the symbols are Type II RSs, i.e. the symbols are causally connected to the world, the semantics are mere conventions.

28The search for proofs in mathematics is about showing that theorems can indeed be reduced to a chain of transformations from a set of axioms. Once a theorem has been proved, then this theorem becomes a mathematical ‘piece of truth’, which can be used in other proofs. 3.5. SUMMARY 79

3.5 Summary

I have argued that computation is distinct from the behavior, or the ac- tions, of an automated formal system. Computation must be viewed as a goal oriented and, at the same time, a mathematical activity. This ac- tivity is not restricted to discrete state machines, because computations can be implemented through analog models of the world. The example of Lord Kelvin’s tide predictors, shows that obtaining measurements from real-world models is indeed an effective method. After all, the result of this method, a table giving the expected heights of tides for a given lo- cation, is sufficiently accurate for people to navigate safely. The various concepts of computation may well be sufficiently defined for a particular area of work, like the TM based definition in the field of computer science. But a definition based the concept of an algorithm is not suitable, because there many kinds of computation that are not of the TM kind. Analog processes can also be considered computation, provided we can interpret the analog processes appropriately. Although digital computers dominate as the prime tool on which models and sim- ulations are implemented, it should be clear that the brain does not involve this type of computation. Others kinds of processes are likely to be involved and these processes will have computational properties, which may turn out to be essential for cognition. Today, most models and simulations rely, at least at some level, on TM type computation. An important argument in this chapter showed that electronic comput- ers are supersets of TMs. Because the TM is a formal description of an algorithm, the computing power of a TM is restricted to closed, finite computing operations. The fact that ‘real’ computers can interact with the world is sufficient for these devices to compute, in the sense of sym- bol/mark manipulation, beyond the TM. Unlike real computers, TMs are representational systems of Type I and as such are mark manipulating machines. Real machines are, given certain conditions, RSs of Type II. For this reason we cannot view an electronic computer in certain config- urations as a formal system, and therefore, the Gödelian argument from Lucas (1961, 1970) can be dismissed as inapplicable (see section 3.1.2 on page 56). The kind of computation that is employed to manipulate the natural signs (RSs of Type II) is of little consequence. What is impor- 80 CHAPTER 3. COMPUTATIONAL FOUNDATIONS tant is the ability to maintain the chain of causal mechanism between the real world, its representations (signs) and the transformation dur- ing processing. The next chapter concerns computational models and simulations and how certain aspects of computational theory influence their design for experiments in Cognitive Science. Part II

Models and Reality

81

Chapter 4

Virtual Models

As is often the case with an abstract model, not only [is] the medium of expression different, but the object [is] also been simpli- fied and idealized in certain respects in the model, and its empiri- cal (material) features [are] no longer dependent on any particular case, rather it incorporate[s] certain average or typical features [. . . ] (Morgan, 2003, 230).

Many of the requirements, attributes and characteristics of models do change when we move to computer models and simulations, or virtual models. Now there are no real world objects at all, and representations of the world relate primarily to mathematical abstractions and simplifi- cations of real world phenomena. Experimentation with such models, in the interactive, ‘interfering’ way that Hacking (1983) and Harré (1970) asked for, is not possible. Instead, the experiments are conducted in the domain of virtuality and computation (Peschl and Scheutz, 2001). Not long ago, the concepts of simulation and, to a lesser degree, the term model, “invariably implied deceit” (Fox Keller, 2003). Simulations and models were thought of as merely mimicking or faking the real world. While modelling has become a widely used technique in almost any imaginable discipline, the term is still associated with a certain amount of doubt, or, disbelief. For every model that shows A, there seems to be always an alternative model showing B, and it is significant that we hear the expression ‘it is only a model’ quite regularly. Terms like ‘vir- tual experiments’ might be preferable to ‘simulations’ or ‘models’, be- cause modern terms like virtual, and in particular virtual reality are positive and are associated with cutting edge computing and AI. While

83 84 CHAPTER 4. VIRTUAL MODELS

‘virtual’ has also connotations like ‘not real’ and ‘imaginary’, I believe that they are less strong. Computer models and simulations make use of many advanced tech- niques that introduce new, sometimes exciting, ways to present aspects of the model to users. Computer generated images (CGI), such as the im- age1 of neurons in figure 4.1 for example, have not only changed the way we think about pictures and movies, but also how theories are formed and data is presented in science. There is a variety of exciting com- putational methods for visualization and presentation available, and many of the results of these technique have made their way into text- books and journals in the form of pictorial illustrations and graphical representations. With advanced image processing techniques, it is not

Figure 4.1: CG Image of neurons only possible to alter and to enhance pictures, but it is also possible to render images of phenomena that are not visible, or do not exist at all. While many of the computer aided experiments and visualizations may be helpful to understand complex phenomena, some reservations remain about the usefulness and validity of computer simulations as experiments and methods to gain scientific knowledge. I would even go as far as claiming that new issues concerning the utility and rigor of computational models have emerged. Accepting virtual models and vir- tual simulations as experimental or empirical tools in science, will force us to also adopt some form of virtual scientific method. In this chapter, the relationships between real and virtual science will be explored. The first section concerns mathematical models and their implementation of computer programs on modern computing machines.

1Image by MAXON Computer GmbH (URL: //www.maxon.net). 4.1. VIRTUAL SCIENTIFIC EXPERIMENTATION 85

4.1 Virtual Scientific Experimentation

For Hacking and Harré, experimentation is not merely about the ob- servation of phenomena and subsequent inferences to the explanatory theories, but experimentation is also about observing and interfering with the objects in question. The ability to manipulate objects is an es- sential and integral part of the process of experimentation, which is “to create, produce, refine, and stabilize phenomena” (Hacking, 1983, 230). The close connection between experiment and some real world entities is a key requirement in the definition of the term experiment offered by Harré, who says that

[a]n experiment is the manipulation of [an] apparatus, which is an arrangement of material stuff integrated into the material world in a number of different ways (Harré, 2003, 19).

The kinds of experiments that fit the criteria, which relate to the discussions by Harré and Hacking, are the activities we often associate with what happens in the laboratory. These are the kinds of experiments we know about from our high school days. However, it has become obvi- ous that the vast majority of experiments are different from this stereo- typic view (Morgan, 2003). When we conduct experiments with com- putational models and simulations, there are no materials that could possibly be manipulated. The material, the apparatus and the process of interference are all replaced by data structures and computational processes2. The nature of the entities and the phenomena that are the points of interest in the field of Cognitive Science dictates that models and simulations are often the only way to do any experimentation at all. In Cognitive Science, the experiment is moved into the realm of the vir- tual, not just for convenience, but more often than not, out of necessity. Computer models and simulations are doubly abstract. In the first in- stance there is an abstraction in the transformation from the observable phenomenon to its corresponding mathematical model. The mathemat-

2The only ‘material’ part of the experiment is the computer hardware, which in some way is irrelevant for the abstract model. On the one hand, the model does not work without the hardware, on the other, the hardware is not contributing to the experiment or model as such. With the proliferation of the personal computer it turns out that the vast majority of computational models and simulations are implemented on the same platform, i.e., more or less identical hardware. 86 CHAPTER 4. VIRTUAL MODELS ical models have been described as “intellectual constructs”, or, “mathe- matical objects” (Jorion, 1999). Secondly, there is a transformation of the mathematical structure into a computational entity, which is designed to deal with the complexity of the calculations in an efficient manner. One of the indicators of a model’s efficiency is the performance of the model or simulation in terms of speed, particularly with models where a large amount of data is associated with them. Additionally, the perfor- mance metrics should (must) include the accuracy and integrity of the data. The second level of abstraction takes place when the data that has been generated by models is transformed into a format that is interpretable by the experimenter. In models and simulations, where large amounts of data are involved, additional steps are usually taken to present the data in some sort of visual form.

Figure 4.2: MRI scan

These visualizations can be in the form of a pictorial representation like the MRI scan in figure 4.2 (Frese and Engels, 2003)3, a graphical rep- resentation like the dendrogram in figure 8.6 on page 190, a table of statistics, or a graph. In the following sections, some the issues concerning abstraction, for-

3The colours of this image have been inverted to achieve a better reproduction. 4.1. VIRTUAL SCIENTIFIC EXPERIMENTATION 87 malization, generalization, and simplification are discussed, because they are crucial operations when building mathematical models. These operations influence the accuracy and the fitness of the model.

4.1.1 Mathematical Models

A mathematical object, unless sensibly interpreted, “does not tell any- thing about the world” according to Jorion (1999). He holds the view that mathematics is all about syntax, and any of its meaning derives entirely from its structure and observes that

[t]his is actually an apt manner for expressing the type of meaning held by a mathematical object: some of the symbols which consti- tute it impose constraints on others, some have no more meaning than the set of constraints they are submitted to (Jorion, 1999, 2).

This view is consistent with a Dretskian representational system (RS) of Type I as described on page 33. However, a mathematical model is more than a collection of ‘meaningless’ symbols, if a sensible interpretation is possible.

The benefits of a mathematical model for world comprehension are the following: if an interpreted mathematical model makes sense, then it is reasonable to assume that the type of relations which hold between the symbols in the model hold also between the bits of the real world which are represented in the interpretation of the model (Jorion, 1999, 3).

Jorion goes as far as to say that the mathematical model and the part of the world being modelled are isomorphic:

The set of these relations make up the shape (Greek morphe), if the shape is the same (Greek ison), one talks of an isomorphism between the model and the part of the world that is modeled (Jo- rion, 1999, 3).

I believe, that the analysis of neural models (in chapter 5) clearly re- veals that Jorion’s opinion that it is reasonable to assume that the re- lations in a model and the relations in the real world correspond, is somewhat optimistic. I will show in the next chapter that artificial neurons have very little in common with real neurons: they differ in their external functional- ity, their behavior, and their architecture. Other than a gross similarity 88 CHAPTER 4. VIRTUAL MODELS in that they transform several inputs to one output, they really share only the name. The isomorphism of biological neurons and mathemat- ical neurons can barely be described as an ‘approxi-morphism’. In con- trast to Jorion, Psillos (1999) refers to modelling assumptions that re- flect the relationship between model and the target physical system. He thinks that

[f]ar from being arbitrary, the choice of modelling assumptions for [the target system] X is guided by substantive similarities between the target system X and some other physical system Y.Itisinthe light of these similarities that Y is chosen to give rise to a model M of X (Psillos, 1999, 140, original italics).

I believe that “substantive similarities” also capture the nature of the relationships between the mathematical description (model) and the real world system or theory. How do we derive a mathematical description of some relationship among physical (or mental) entities? There are several conceptual transformations and processes involved. Building a model of what is ob- served, or what is assumed, involves abstraction. Abstraction is closely related to the processes of formalization, generalization and simplifica- tion. I will discuss these principles in turn.

Abstraction

Mathematical models refine the real world by introducing an element of abstraction. It is clear that a model should be simpler than what is to be modelled. Earlier, I quoted Stufflebeam (1998), who suggested that his cat Sophie’s behavior, when dropped from two feet, “satisfies 1 2 the distance function D(t)= 2 gt ”. The abstraction here includes the re- duction of the cat Sophie to a point mass in Newtonian physics. The process of abstraction involves several practices, all of which widen the gap between the sometimes observable phenomena and an idealized de- scription in mathematical terminology. The observable behavior of the cat Sophie in free fall, differs from the idealized point mass. In fact, 1 2 Stuffelbeam’s description of Sophie’s behavior as D(t)= 2 gt + v0t + D0 does not involve Sophie at all4. Abstraction is the process of defining a

4It is, of course, possible to put Sophie explicitly into the equation by saying that 1 2 DSophie(t)= 2 gt + v(0)t + DSophie(0). This could indicate that this mathematical model is 4.1. VIRTUAL SCIENTIFIC EXPERIMENTATION 89 general and idealized case for relationships between entities and pro- cesses. In the distance formula, g stands for the acceleration, and if we . m f substitute the values for g of 9 8 s2 or 32 s2 then we have a reasonable ap- proximation of the conditions on Earth. However, we can also find the appropriate values for this model to work on the moon or on Mars. The distance formula holds anywhere for any object, provided we have the correct value for g. The most important aspect here is the introduction of placeholders like D(t), which is the abstract notion of the distance of something at a particular moment in time. This placeholder, or symbol, can now be manipulated within a formal system like mathematics.

Formalization

The second and usually difficult process in building a mathematical model is that of formalization. A mathematical description of entities and the relationships between them can only work as a useful model if there is a sufficient precision of terms. In areas of elementary Physics, like Newtonian dynamics, models work well because terms like mass, velocity and force and the interaction between these concepts are de- fined in an axiomatic fashion. The difficulty in Cognitive Science is that many terms describe mental things, like beliefs and behaviors, rather than physical things with properties that can be described and defined easily. Moreover, for the mental concepts, we do not have clearly de- fined relations or processes to manipulate such concepts. Green (2001) suggests that some of the apparent success of connectionist models is due the lack of precision of terms and explanations of what is actually modelled. The question is whether beliefs or behaviors can be modelled successfully, if it is not possible to provide a formal description of what we want to model. However, formal representations of a belief, for ex- ample, are needed in a computer program, because we need some way of encoding this concept. In this situation we have to face the additional problem of also having to encode the degree of belief and quite likely fuzzy representations about what is believed. It is clear that we cannot choose suitable sets of symbols and sets of rules of inference and trans- formation rules for mental concepts in the same way we can choose D(t). a particular instance of the general case. 90 CHAPTER 4. VIRTUAL MODELS

I suspect that formal descriptions of mental events, if it is at all possi- ble to produce such descriptions, will not be simple placeholders. They will have to be either simple and relatively vague, or they will be very complex in order to provide some exactness and precision. But there is a catch: On one hand there has to be sufficient precision to build a good model, on the other hand, precision in terminology and in detail makes it harder to build models that remain simple. Formalization ought to eliminate many of the ‘soft’ assumptions and descriptions about men- tal concepts, but as experience with the representation of knowledge in many AI applications has shown, it is very difficult to even encode facts and the associated rules5.

Generalization

For some models it is desired that they work well for a theory about something “in principle”, rather than to replicate a particular instance of this theory. In other words, the model has to be able to produce data, if the model is designed to predict some cognitive behavior in humans, for humans, rather than the behavior of Lucy or Bob. There is, of course, the added problem of validating the model. In order to measure the suc- cess of the models, we need to compare data from the model with real data. The real data in this case have to be statistical data, because only averaging data from many individuals can provide us with generic hu- man data. Generalization is usually achieved by omitting detail and allowing for a very broad interpretation of results. The danger here is to make models so general that they no longer capture the complexity of the theory or issue to be modelled. For example, the general formula for falling objects based on Newtonian physics is not a sufficiently precise model for what happens to a parachute jumper in free fall. In this much more specific case it is important that drag and terminal velocity are considered in the model.

5A good example is the apparent failure of the CYC project (Lenat and Guha, 1990), a ten-year $60-million attempt to build a very large ontology, which was designed to simulate aspects of human intelligence. The project was started by Doug Lenat in the mid eighties, but it has been all but abandoned a decade later. 4.1. VIRTUAL SCIENTIFIC EXPERIMENTATION 91

Simplification

One of the many criteria defining what makes a ‘good’ model, is that the model is easier to work with. One way of making models easier is to simplify things, which can be achieved by disregarding details or ex- ternal (environmental) issues that influence the model. For Sophie, the distance traveled by a free falling object on earth can be modelled using

1 2 D(t)= gt + v0 + D0 2

2 where g is the acceleration of about 10 ms , and v0 is the vertical ve- locity at the beginning of the time interval t. D(t) gives us the distance after the time t from the position D0, the position of the object at the beginning of the time interval. This is an ‘easy to work with’ model, be- cause we do not take into account, among many other things, that (1) the acceleration is only approximately 10ms2, and (2) the atmosphere 1 2 6 causes drag R = − 2 ρAv CR. Even when taking drag into consideration, the mathematical model of Sophie’e behavior is still crude, because we have not considered the Reynolds numbers, the variation of the gravi- tational force over geographical regions, and many other perturbations. If we take drag into consideration we need to know that drag itself de- pends on, among other things, (1) the shape of the object and (2) air density. But, the density of the air is dependent on the temperature and the humidity, and the Reynolds numbers depend on the velocity of the object (cat), its shape and size, its surface, and so on. In the case of So- phie, the problem can not be fully described, because the cat could and would change its shape and therefore many parameters during the free fall. At some point the model will become so complicated that it is no longer easy to work with, because the model is more difficult to under- 7 1 2 stand than the original problem . D(t)= 2 gt + v0 + D0 is, as a model for most ‘dropping cat problems’, likely to be sufficient. Models involving artificial neural nets are no different. A single ele-

6In the formula ρ stands for the air density, A is the flat plate equivalent area facing the air, v is the velocity of the object, CR is a drag coefficient representing a combination of a number of other factors, like the shape of the object, parasitic and interference drag. 7Dutton and Starbuck (1975) described this phenomenon as Bonini’s paradox (Daw- son, 2004). 92 CHAPTER 4. VIRTUAL MODELS ment, or artificial neuron, is mathematically and computationally rela- tively simple, and a couple of formulas will describe the neural model precisely and succinctly (see chapter 5), which is by no means true for a real neuron. The fully connected simple recurrent network described by Elman (1990) comprising 31 input-, 31 output-, 150 hidden-, and 150 context nodes with almost 32,000 connections is computationally no longer simple, if we consider the network in its entirety. In many of the particular networks, the transfer functions of neurons are different in each layer, so that the mechanism of changing the connection weights during learning becomes extremely complex. The complexity increases as we lower the level at which the network is specified. If we attempt to describe the behavior of the artificial network in its entirety at the neuronal level, the task becomes nearly intractable. Hence, we com- monly describe networks and their general behavior in terms of their architectures, such as feed-forward, simple recurrent,orKohonen. Addi- tional information like the number of layers, the number of nodes, in the way I just have described Elman’s network, enables us to predict quite a bit about the model and what it can reveal. These descriptions are really only suitable for the simple typical networks and some of their elementary dynamics that are associated with their particular design. Simple recurrent networks, for example, have a particular architecture designed to implement a particular behavior (a feed back mechanism, see section 6.1.4).

4.1.2 Methodologies

The translation or transformation of a phenomenon into a mathematical entity is not merely a process of formalization. Fulford et al. describe the creation of a mathematical model as a set of steps.

Identifying the quantities most relevant to the problem and then making assumptions about the way in which the quantities are re- lated. This usually involves simplifying the original problem so as to emphasize the features which are likely to be most important. Introducing symbols to denote the various quantities, and then writing the assumptions as mathematical equations. Solving the equations and interpreting their solutions as state- ments about the original problem. Checking the results obtained to see whether they are in agree- ment with experimental data (Fulford et al., 1997, 2). 4.1. VIRTUAL SCIENTIFIC EXPERIMENTATION 93

There are many interesting points in this proposed methodology for building mathematical models. We will find on closer examination that nearly every phrase is fraught with problems. The first directive con- cerns “identifying the quantities most relevant”. Missing what seems to be a minor detail may turn out to be vital for a model (or a theory) to be successful. I have already touched on the extremely difficult problem of deciding between what is relevant and what is not. The introduction of symbols may put constraints on the type of opera- tions and the methods for the model. For example, the squashing func- 1 tion in neuron models (perceptrons) ϕ(v)= 1+e−cv is selected, because the function’s behavior is close to the simpler step function, and the func- tion is differentiable. While the qualities of the step function are desir- able for the functionality of the neuron, some mathematical procedures (back propagation of error during learning) require that the function is differentiable. The point is that the mathematical methods are likely to dictate the kind of mathematical structures of the model at some level. The interpretation of results in the language of the original problem is also not without problems. The evaluation of experimentally pre- dicted results should ideally rest on the comparison against empirical data. The mathematical model is itself an experiment, or even a theory, which yields data. Given the amount of theory-laden judgment that is necessary in the identification of most relevant quantities and other as- sumptions, the model should be considered a theory. A (scientific) theory should produce data in form of predictions that can be checked against observed real world phenomena (Popper, 1959) in the same way a model produces data. The building of a model is a cyclic process. Fulford et al. say, that in the final stage

there may be insufficient agreement between the actual experi- mental result and the results predicted from the model. If this hap- pens, one again returns [. . . ] to see whether the assumptions can be made more realistic. The process of returning [. . . ] may be re- peated many times until a satisfactory model is obtained (Fulford et al., 1997, 2). I consider the task of simplification to determine what must be included in the model, and what kind of detail can be omitted, the most difficult. As computers become more powerful, the computational complexity of 94 CHAPTER 4. VIRTUAL MODELS models can be increased, which in turn, one would expect, will increase the quality and power of the models. But this is not necessarily the case. A very complex model is no longer easy to use. An increased complexity can also be an indication that the model needed many additions to ex- plain or produce acceptable predictions. In the same way Tycho Brahe’s model of the universe needed more and more epicycles to ‘keep up’ with the data that was actually observed. It might turn out that the model is not good enough to explain things adequately.

4.2 Implementation

When a model has been designed, and the mathematics have been spec- ified, it must be transformed into a computer program using one of the many computer languages that are available. The kind of problem that needs to be solved will usually influence the choice of computer lan- guage. In some models the speed of execution may be critical, in other models it might more important to be able to deal with large amounts of data efficiently. Often the selection of a programming language is just based on how competent a programmer is with a particular language. Before a suitable selection of language is made, it is also important to switch to an appropriate computational paradigm. There are many com- putational environments in which the implementation of a particular model inherits some of the concepts and architecture of the environ- ment. I would like to provide an overview of logic programming, which is a prime example for where a kind of virtual system is created. Note that these environments are different from virtual machines8. A virtual system creates a kind of encapsulated micro world with its own rules. I will compare this environment with the more ‘conventional’ imperative programming. Later, in chapter 6, it will become apparent that similar assumptions about the irrelevance of the implementation are made in the context of artificial neural nets on modern computing machines.

8Virtual machines is a term used to describe an interface between a high level com- puter language and a combination of hardware and operating system. Examples of such languages are Java and Python, where the tokenized code of a program can be executed without change on a different computer running different operating systems. This is achieved by adding an extra layer of machine dependent software, that essentially makes all machines behave and look the same, as far as a Java-orPython-program is concerned. This additional layer of software implements a virtual machine. 4.2. IMPLEMENTATION 95

Logic programming

In expert systems, arguably the most advanced form of classical AI, knowledge is introduced to the system by declaring certain values to variables as facts and rules of inference about those facts. I must em- phasize that the term “fact” is used here in a technical sense, where it is means that a proposition is true within the program, only because it is declared as such. There is no necessary relation to truths in the real world9. In Prolog, which is a typical example for an entire class of languages and systems often used in classical AI, facts are expressed as instances of relations. A statement like female(pam) or male(bob) estab- lishes the fact that something named pam is an instance of something called female or the fact that something named bob is an instance of something called male. Similarly, a relation in the form parent_of(tom, pam) tells the system that something named tom is something called parent_of something called pam. Prolog can recall facts in response to a question. If asked a question ?- female(pam), Prolog would respond yes, because something named pam is in fact an instance of something called female. The system would respond to the question ?− parent_of(X, pam) with X=tom, because it can instantiate the variable X with tom from the fact that tom is a parent_of something pam. If the fact that mother_of(bob, pam) is given to the system, then it will become a fact, that bob is the mother of pam. The names of facts and rules such as ‘mother’ or ‘parent_of’ are selected for human understanding only, and the meanings of facts and the associated rules are completely arbitrary. The unary relation male(X) might also be called xtxtvb(X),aslongaswe keep the system consistent. Therefore, if Bob is represented as mnb2, the xtxtvb(mnb2) becomes a fact when given to the system. Rules can contain relations of other rules and the Prolog system can resolve com- plex relationships. The definition of mother_of should be improved to also stipulate that mother_of implies the female attribute: mother_of(X, Y) : − parent_of(X, Y), female(X), which translates to the notion that X is the mother of Y is defined as X being a parent of Y and that X is also female. Someone’s grandmother is a mother of one of the person’s

9It is possible to interface Prolog programs with the real world, which could make them RSs of Type II, but purists would argue that doing so would destroy some of the key assumptions about Prolog. 96 CHAPTER 4. VIRTUAL MODELS parents. In Prolog, this would look something like grandmother_of(X, Y) : − mother_of(X ,Z), parent_of(Z, X). A Prolog system may con- tain thousands of facts and rather complex relations or rules and such a system will deduce from its facts and rules the answer to questions in the form: ‘is it true that Pam is grandmother of Bob?’, or questions in the form: ‘who is a parent of Pam?’. Knowledge within such systems is repre- sented as symbols and their meaning is defined by definition: the human concept ‘mother’ can be represented in such systems as mother(...), female_parent_of(...),orq24uie(...). In Prolog facts are expressed as membership of sets, i.e. mother(pam) can be interpreted as pam ∈ mothers. It is important to note that the relationship between pam and mother is always maintained and that inferences can be drawn from the fact that pam is a mother and that one of the mothers is an entity named pam. A schema for a representation of knowledge must also include a system to interpret and extract that knowledge. Simply finding stored facts is of course trivial, at least, if the number of facts is not too large. However, the system must be able to deduce new facts from the already stored knowledge and a set of rules of inference.

These are rules of inference from logic that allow a new formula to be derived from old ones, with the guarantee that if the old formu- las are true, so is the new one (Spivey, 1996, 3, my italics).

A system of relationships between members of sets can be described by using combinations of the logical operators and, or, and not. Ear- lier we saw that a mother could be defined as someone’s female par- ent: mother_of(X, Y) : − parent_of(X, Y), female(X). The relation- ships parent_of() and female() must both be satisfied. From here we can also deduce that if someone is a parent and if that someone is female then this someone is also a mother. This statement can be written for- mally as a formula in the first order predicate calculus:

∀x ∀y parent_of(x,y) ∧ f emale(x) → mother(y)

The number of elementary rules needed to establish even the most com- plex inferences is relatively small. Copi (1979) lists nine rules of infer- ence and ten rules of replacement. A set of six rules is needed to resolve 4.3. COMPUTER EXPERIMENTS 97 propositions containing quantifiers such as ∀x (for all x)or∃x (there ex- ists at least one x). I do not believe that there is any problem about the effectiveness of systems like Prolog on a theoretical level. However, as soon as we im- plement an actual Prolog system on a computer, the “guarantee” that every thing is logically correct is really dependent on (1) the error-free implementation of the Prolog system, (2) an error-free operating system, and (3) flawless hardware. Computational models, and this includes ar- tificial neural nets, rely also on a string of assumptions about the cor- rectness of the underlying levels of software and hardware.

4.3 Computer experiments

The role of computational models and ‘virtual experiments’, i.e. simula- tions, as contributors in the framework of empirical science is of partic- ular importance. This holds particularly for Cognitive Science, because many of the objects of inquiry in Cognitive Science cannot be observed directly or mediated by scientific instruments. Consequently, models and simulations are often the only method available to the scientist. It has been argued that computer simulations are essentially extensions of numerical methods, which have been part of scientific reasoning for a long time10. Human beings do not reliably maintain accuracy when they have to deal with a large quantity of numbers, and digital machines are much more efficient at doing logical and numerical calculations. The recognition of patterns and structures is much more the domain of peo- ple. The work of analysis and interpretation of patterns, whether these are observed directly or whether these are produced by a machine, re- mains largely the task of the scientist. Ziman (2000) suggests that what can be known to science is restricted to what is known to scientists. He says that

[a]n empirical scientific fact originates in an observation - an act of human perception (Ziman, 2000, 102).

Elements of computation can be part of a causal chain. Imagine a exper- imental setup where micro electrodes are used to measure some voltage

10For a discussion see for example Fox Keller (2003) and Gooding (2003). 98 CHAPTER 4. VIRTUAL MODELS in a cell. Instead of using a voltmeter that is built around a mecha- nism involving a coil, a magnet and a pointer with a dial, the voltage is display on a computer screen. The voltage differential at the electrodes is converted into a digital signal so that a particular voltage is repre- sented by a binary bit pattern11. This pattern is fed through one of the computer’s input/output channels, and a program performs the task of converting and displaying the pattern as a series of figures, i.e. a num- ber, on the screen. There are in principle no difficulties in explaining the causal chain between the number on the screen and electrical potential at the micro electrodes. The numbers on the screen are elements of a Type II representational system (see page 34). The computer program is now modified to read the pattern and to dis- play the corresponding number every second, and as an additional fea- ture, the program records the time and the values in the machine’s memory. The information in the memory can also be replayed so that the sequence of numbers is displayed on the computer screen in one sec- ond intervals. Is there now a problem in causally linking the patterns in memory to the micro electrodes? The program and the data, the pre- viously recorded sequence of voltages, have been shipped to some other laboratory, where they have been installed onto a suitable computer. Can the postal system be part of the causal chain? A leap of faith is required when we want to equate computer models with phenomena in the real world. While the causal connections of some symbol could be traced, at least in principle, the actual connection to the real world is merely assumed. This assumption, because it is, at least in principle, traceable, maintains or promotes the symbol to an element of a Type II RS. Computer models and computer simulations have become tools for sci- ence in many ways. AI and computational Neuroscience are special cases amongst the hard sciences in that computation is the very nature of their activities. Other sciences might employ computational mod- els and simulations as tools, however chemistry, for example, is essen- tially about elements, molecules, compounds, plastics or pharmaceuti-

11There are several ways of doing this. Suitable electronic components are read- ily available. Analog to Digital converters have replaced galvanometers nearly every- where. 4.3. COMPUTER EXPERIMENTS 99 cals, even if computational models and simulations play a role in chem- ical research. AI, in contrast, takes computational models to simulate, even replicate, cognitive functions that are computational themselves. This would certainly be the case if the assumption that cognition is com- putation is true. If it turns out that cognition is merely computable, then AI would still be entirely about computation, but the contributions to cognitive science would need additional justification. In many of the sciences, the experimenter has access to the object of in- quiry via the senses. This access may also be indirect, i.e. mediated with the aid of some apparatus, like a microscope. These scenarios may be described as a ‘classical’ scientific environment, where hypotheses are formed, predictions are made and empirical evidence is used to verify or falsify the hypotheses12. Models are often presented in an idealized way, so that, by ignoring unnecessary details, more important properties of a system or some particular behaviors of a system can be observed, mea- sured or recorded. In some models, natural processes are slowed down, sped up, or scaled in some dimension, so that a specific effect becomes more accessible. This type of modelling has been part of empirical sci- ence for a long time. Consider, for example, Galileo’s experiments with an inclined track to measure acceleration due to gravity. In these ex- periments, the physics of a sphere rolling down an inclined plane or track, become easily observable, when compared with a free-falling ob- ject. Modern computing machinery has started a revolution in terms of what kinds of models and simulations can be implemented.

4.3.1 Computer models as scientific experiments

Models representing theories (conceptual models) and models repre- senting real entities (representational models) must be accommodated within the framework of scientific practice. The conceptual model is the kind of model that has been associated with the terms metaphor and analogy by Bailer-Jones (2002). The general claim is that all models are metaphors. In this view, models are

12It should be clear that this classical scientific environment would only be a carica- ture of what really goes on in science. The literature on the topic scientific method is large and cannot be ignored. 100 CHAPTER 4. VIRTUAL MODELS

an interpretative description of a phenomenon that facilitates ac- cess to that phenomenon. [. . . ] This access can be perceptual as well as intellectual. [. . . ] Models can range from being objects, such as a toy aeroplane, to being theoretical, abstract entities, such as the Standard Model of the structure of matter and its particles. (Bailer-Jones, 2002, 108)

Some models can be an adequate representations of real entities pro- vided that there is sufficient accuracy with which a model represents the real world and

taking a realist attitude towards a particular model is a matter of having evidence warranting the belief that this model gives an accurate representation of an otherwise unknown physical system in all, or in most, causally relevant respects (Psillos, 1999, 144).

Jorion (1999) goes much further and suggests that in mathematical models the relations that hold between the symbols in the model and the relations of the parts of the real world that are modelled are isomorphic. We can consider and may even be able to defend the view that models in science go beyond being ‘interpretative descriptions’, and that they are scientific truths instead. Psillos hints that the adequacy of a model as a representation can only be determined on a case by case analysis, when he refers to the realist attitude toward a particular model. We will have to accept that the judgment whether a model or a simulation, or any ex- periment with such a model, is grounded in some scientific method, will also have to be made on a case by case basis. I have already shown that there are no rules for building models, and that the process of building models is largely based on assumptions about what the relevant fac- tors are, how things can be simplified, how we write a program, and so on. The question of whether virtual simulations and virtual models are valid tools for a scientific endeavour is even more problematic, because the debates in the history and philosophy of science have shown that a general scientific method does not exist. Ziman, who largely follows Merton’s normative view of science (Merton, 1942), comments that

Most people who have thought about this all are aware that the notion of an all-conquering intellectual ‘method’ is just a legend. This legend has been shot full of holes, but they do not know how it can be repaired or replaced. They are full of doubt about past certainties, but full of uncertainty about what they ought now to believe (Ziman, 2000, 2). 4.3. COMPUTER EXPERIMENTS 101

However, I believe that thoughts by Popper (1959) on how science should operate, are still normatively useful. Theories should be formulated such that they are testable, and no magic ingredients and methods are allowed as part of the supporting evidence. This, of course, must also apply to any counter examples and counter arguments. The application of models is a part of the empirical process. Helping to flesh out the de- tails of some theory or to formulate a new hypothesis using models and simulations is also part of a scientific framework (Popper, 1959, 106). The epistemological role of simulations and models in cognitive science in terms of the development of theories is closely linked to questions about scientific theories in general (Peschl and Scheutz, 2001). Never- theless, I believe that models and simulations are scientific tools, pro- vided ‘good scientific practice’, whatever that may entail, is applied. The claims, why a particular model or simulation should be judged as an ad- equate representation, and as an adequate (suitable) model, need to be examined in each case. We need to check that each part and process of a model can be mapped onto the corresponding part of the real world object or process that is modelled. In the case of a computer model, the elements and links in the data structures links and their relations to the object that is modelled need also to be explained. A computer model has to be testable in two ways. Firstly, we can test that the model is ad- equate in terms of what it models, and secondly test how the model is implemented and whether the implementation itself is adequate.

4.3.2 Levels of Explanation

Models and simulations are targeted at different levels of explanation. A model can be used to explain certain aspects of a neuron, a partic- ular phenomenon within a neuron, or the behavior of a collection of neurons. Another way to specify levels of explanation of models con- cerns the model itself. Models and simulations have a high level task to explain something. This level is likely to be the most abstract, and much of the model’s implementation and internal workings may be of little interest. If, for example, we are presented with a simulation of the behavior of a few neurons on a computer screen, the actual imple- mentation is of no concern to the observer or experimenter. The neuron simulation works (hopefully) as it should - it should work according to a 102 CHAPTER 4. VIRTUAL MODELS set of specifications, which the experimenter is aware of. However, there are many layers of programs, library functions, operating system, device drivers, integrated circuits, gates, resistors and wires. The laptop com- puter, which I am using now, has several quite different programs for neural simulations stored on it. Most of these models are trying to ex- plain the same thing at the highest or abstract level. They are all about relatively simple artificial neural nets, Hebbian learning, learning al- gorithms e.g. back propagation, and so on. The fact that the “neurons” in these programs are mathematical structures involving mostly lin- ear algebra is not essential to know or understand in detail for many users of the computer programs. The implementation of the mathemat- ical engine, the subsystem that evaluates and transforms the matrices, is accessible only to the mathematically oriented computer programmer. Then, of course, there are all the components and systems that are part of the implementation on an actual machine.

4.3.3 Virtual Models

Experiments that are conducted in AI and computational Neuroscience rarely allow access to the object of inquiry. The question, whether evi- dence from virtual experiments qualifies as empirical, is still being de- bated. One of the issues within this debate concerns the relationship be- tween behavioral models and simulated or virtual objects on one hand, and real world behavior and the real world objects on the other. Are these virtual entities representations of or are they representative for the real world object? Churchland and Sejnowski (1992) note that real worldliness has two principle aspects, namely that the world is more complex, so that scal- ing up models does not always succeed, and that real world events do not occur in isolation. Consequently, virtual models and virtual exper- iments lack realism in several ways. Like many other more ‘conven- tional’ models, they do not scale well, both structurally and function- ally, and the virtual implementation by means of computation, reduces the number of similarities to real world objects even further. In a way, computer simulations introduce a second layer of abstraction. The first layer is the abstraction or conceptualizing of real world phenomena into a model. The second is the simulation of the model and its dynamics into the realm of the virtual. 4.4. SUMMARY 103

The relationships between computer models and the world can take dif- ferent forms. Three distinct types of models have been identified where, (1) computers are used to deal with theories and mathematical abstrac- tions, which would otherwise be computationally intractable, (2) com- puters provide responses (data) in ‘what-if’ simulations, i.e. the behav- ior of a real world physical system is simulated according to some the- ory, and (3) computers simulate the behavior of non-existing entities, for example the simulation of artificial life (Fox Keller, 2003).

4.4 Summary

Models and simulations are part of what is considered scientific method in the empirical sciences, although it is not clear what the term scien- tific method actually denotes. In cognitive science there are many phe- nomena which do not belong to the observable world; but as Peschl and Scheutz (2001) point out

[i]t is exactly this “hidden character” of many cognitive processes which makes this domain so interesting as an object of scientific research (Peschl and Scheutz, 2001). The fact that many cognitive processes are not accessible for direct or indirect observation is also interesting in terms of what can be modelled and simulated. It is not so much the mode of experimentation. Whether real world objects or ‘virtual objects’ are the targets of the experiment does not seem to be that much a point of controversy. It is, I suspect, the human contribution during analysis and interpretation that makes the experiment and the results appear to be ‘reasonable’ in terms of their value as scientific evidence. We should not forget that with the ever increasing complexity of com- puter hardware and the operating system software, it is impossible for most application programmers to understand much of these system ‘operations’ in any detail. Some of the users of software that offers a friendly interface for experimentation with artificial neural nets, may not understand how the neural nets work on a theoretical level, nor how they are implemented mathematically or as programs. I will ex- plore the properties of artificial neurons and artificial neural nets and the principles of their implementation in the following sections. 104 CHAPTER 4. VIRTUAL MODELS Chapter 5

Models of Neurons

Artificial neural nets are typically composed of interconnected “units”, which serve as model neurons. The function of the synapse is modeled by a modifiable weight, which is associated with each connection. Most artificial networks do not reflect the detailed ge- ometry of the dendrites and axons, and they express the electrical output of a neuron as a single number that represents the rate of firing - its activity (Hinton, 1992, 181).

Artificial neural nets are composed of many individual units, and the be- havior of these units individually affects the behavior of the overall net- work. The connections between the units in artificial neural nets with various architectures and the way neurons interact is of great interest within many models. Many features of the neural models have been specifically designed to achieve the desired behavior at the network level. There are a number of different neural models, and it would be very difficult to establish a kind of hierarchy or taxonomy by function- ality, utility or computational cost. A clear distinction into two classes can only be made at the fundamental level. This distinction is based on the prime functional properties, their intended use, and this distinction also coincides, to a large degree, with their computational structure and the complexity of the models. The field of AI employs typically what I would describe as functional models, because the external behavior of a neuron is the only similar- ity between these particular models and real biological neurons. The computational effort, or computational complexity in terms of difficulty and demand on computational resources, is relatively small for func-

105 106 CHAPTER 5. MODELS OF NEURONS tional models. I will show that the mathematics at the neural level are quite straight forward. These relatively simple mathematical constructs are of real interest only when they are used in large numbers as pri- mary elements (units) in artificial neural nets. The origins of functional neural models, which focus on the parallel computing and networking aspects, are found in the work by McCulloch and Pitts (1943), Turing (1948), Hebb (1949) and Rosenblatt (1958, 1962). With the exception of Hebb’s contribution, much of the early work is strongly influenced by Mathematics, the field of Cybernetics, speculations about artificial in- telligence, and activities that belong in Computer Science today. Much more complex models in Neuroscience and in theoretical and com- putational Neuroscience aim to model and to simulate the biological processes within an individual neuron or the interactions between a very small number of neurons. I will refer to such neural models as bi- ological models. This kind of modelling is largely based on the work by Hodgkin and Huxley (1952) and Hodgkin et al. (1952). The distinction between the functional models and the biological models is however not rigid, because there are some models, like ‘spiking rate’ models, that borrow features from both paradigms. In this chapter I introduce and describe models of neurons, or “mathe- matically purified neurones” (Papert, 1988), and I discuss some of their features, functions, and computational properties. The emphasis will be placed on the simpler functional models, because these are the pri- mary elements (units) for artificial neural networks, and they are also the most commonly found models in the Parallel Distributed Processing (PDP) framework.

5.1 Biological Models

Neurons are highly specialized cells, and they are able to perform a range of specific functions. Like all cells they comprise various parts, and much of the physiology is common to all cells. Other constituents, like axons, dendrites, and synapses are however specific to neurons. Neurons have particular ways of interacting with other neurons and other cells such as muscle fibers. The intra-neural and inter-neural com- munication and the propagation of information work on chemical and 5.1. BIOLOGICAL MODELS 107 electro-chemical principles. Many of these principles are not fully un- derstood and much of the actual biology might turn out to be irrelevant for constructing models of cognitive functions. One of the observed properties of a neuron is its capacity to integrate a number of inputs over time. A neuron receives inputs through its recep- tors (sensory receptors, or synapses on the dendrites), and it produces a signal to other neurons by sending a spike along the axon, which is in contact with other neurons’ dendrites or muscle fibers. A neuron main- tains an electrical potential that increases when its synapses are excited by other neurons. The strength of the connection between the neurons (the synaptic strength) and the frequency of other neurons spiking in- fluence the electric potential. If a sufficient number of excitations occur over a period of time, then the neuron will produce an action potential - it will fire and a spike will travel down the axon. If there are very few or no excitations, then the neuron will settle back to its resting poten- tial. This caricature of a functional description of a neuron does not say anything about how any of this might work. There are many textbooks available that deal with the question of how the observable behavior of a neuron can be explained and the explanations are usually provided in the language of biology and chemistry. The more interesting questions for modelling cognitive functions are, whether the living cell is a neces- sary precondition for cognition, and what kind of information processing neurons perform? We can glean from the quote by Simon (1995) on page 1, that the field of AI rests on the assumption that intelligent behav- ior is possible without the biological processes in the cells, and that the actual processing of information is the essential part.

5.1.1 The Hodgkin and Huxley Model

The Hodgkin and Huxley (1952) model (HHm) relates to the electro- physiology of neural membranes1. In their paper A QUANTITATIVE DE- SCRIPTION OF MEMBRANE CURRENT AND ITS APPLICATION TO CON- DUCTION AND EXCITATION IN NERVE, Hodgkin and Huxley describe

1Hodgkin, Huxley et al. published a series of papers in the early 1950s, four in the year 1952 alone. The ‘Hodgkin and Huxley model’ is based on all of these papers, al- though the title A QUANTITATIVE DESCRIPTION OF MEMBRANE CURRENT ... isar- guably the most important of the series. 108 CHAPTER 5. MODELS OF NEURONS the flow of ions with the aid of equivalent electrical circuits. The cir- cuit for a section of squid axon in figure 5.1 (from Hodgkin and Huxley (1952)), is an example of the model showing a capacitor, variable re- sistors and sources of electrons. They offered a detailed mathematical model of the electro-chemical processes of the cell membrane and con- cluded in their analysis that

[t]he equations [. . . ] were used to predict the quantitative behavior of a model nerve under a variety of conditions which corresponded to those in actual experiments (Hodgkin and Huxley, 1952, 543).

They also say that “good agreement was obtained” for a long list of con- ditions and situations. In many cases the predictions of the model were validated by measuring the potentials at rest and action potentials in the giant axon in squid and cuttlefish using micro electrodes (Hodgkin et al., 1952).

Figure 5.1: Circuit diagram of squid axon

The HHm has been refined and much more elaborate versions of the HHm, and many new models covering different aspects of neurophys- iology are on offer (see for example Gerstner and Kistler (2002), and Dayan and Abbott (2001)). This thesis concerns neural networks and the detail and scope of models that reflect the internal workings of bio- logical neurons are only of limited interest. However, the generation of the output (spike) and its potential to convey some kind of information is of great importance. 5.1. BIOLOGICAL MODELS 109

5.1.2 Neural coding

Neurons certainly produce output that is dependent on the inputs of other pre-synaptic neurons. So far, the question of what individual neu- rons contribute to the flow of information is largely unanswered. What kind of information, if any, is contained in the neurons’ activities, and how this information is encoded is still the subject of speculation. Lytton quips that his book

dances through metaphors, models, speculations, relevant details, possibly relevant details, likely irrelevant details, etc. The 43rd edition of this book is tentatively scheduled for publication in 2212. It will contain a more thorough treatment (Lytton, 2002, 105).

Some electrical signals that are emanating from the brain can be de- tected and measured with electrodes attached to the head. These ‘brain- waves’ have a frequency range from about 1 Hz for delta waves to about 25 Hz for gamma waves2. The signals with low frequencies have been thought of as clocking signals and various frequencies have been ob- served over different brain regions. It is not known whether these sig- nals carry any information that might directly contribute to cognition. It is certainly unclear what kind of information is carried in the signals (spikes) of individual neurons. The shape of the spikes (see figure 5.2 3) does not change significantly when the signal travels along the axon of the neuron. The shape also does not change when the rate of activation changes significantly. Gerstner and Kistler (2002) conclude, that for the conveying of information the amplitudes and the shapes of the spikes seem of lesser importance than the number and the (timely) spacing of the spikes. The information that is carried in the neuron’s output could be coded in several ways. Firstly, all or some of the information may be coded in the rate of the spikes, or in the timing of the spikes. Again there are several possible ways how information could be contained, or coded, in spike timing. Such principles might include the time delay between spikes, the shift in phase to a reference signal, or the synchronization of outputs from pairs or groups of neurons. Although the term firing rate is

2Higher frequencies cannot be detected or shown on a electroencephalograph. This is true for some gamma waves, which may go beyond 30 Hz (Lytton, 2002); (Bear et al., 1996). 3This figure contains output from the program NEURON (http://neuron.duke.edu/). 110 CHAPTER 5. MODELS OF NEURONS

Figure 5.2: Spike of a neuron not clearly defined, the term rate is usually used to describe the number of events occurring over a fixed period of time. A neuron is theoretically limited to short bursts of about 1000 spikes per second4, and the num- ber of spikes of a neuron is not constant even for very short fixed periods of time. Instead, neurons seem to be spiking in bursts, so that averaging the firing rate over long periods (100 ms, or 500 ms) distorts the picture. Without a better understanding of the coding of information inside neu- rons, the flow of information between neurons and in entire networks, it is very difficult to provide explanatory models. Gerstner and Kistler (2002) think that

[i]t should be clear that modelling cannot give definite answers to the problem of neural coding. The final answers have to come from experiments. One task of modelling may be to discuss possible cod- ing schemes, study their computational potential, exemplify their utility, and point out their limitations (Gerstner and Kistler, 2002, 27).

Models for high-level cognitive functions, which I describe in chapter 8 on page 171, largely ignore the coding issue altogether5, and much

4The absolute refractory period is the period of time after a firing in which the cell is unable to produce another action potential. There is also a relative refractory pe- riod, which follows the absolute refractory period, where depolarization (firing) is only possible with a higher voltage (Bear et al., 1996); (O’Reilly and Munakata, 2000). 5Some notable exceptions can be found in the literature. Shastri and Ajjanagadde (1993), for example, suggested a network architecture in which the synchronized ar- rivals of pulses from various units are interpreted as representations. While this may not be biologically plausible in this particular form, the model uses pulses (spikes) and 5.2. MATHEMATICAL NEURONS 111 more abstract views of information, flow of information and semantic content is assumed. The role of the model neuron also shifts toward a unit that can carry semantic information either individually (localized representations) or in groups (distributed representations). The neural models concerning the physiology and biology of cells are far too complex and computationally too expensive for this task (see 5.3 on page 118).

5.2 Mathematical Neurons

Earlier in section 4.1.1, I described the processes of abstraction, formal- ization, generalization and simplification. Real neurons are very com- plicated things when we try to describe them at the level of their biol- ogy and chemistry. Even the mathematical and computational models at that level are complex and computationally expensive. Artificial net- works comprising only a modest number of units that are based on the HHm would become computationally intractable. Abstraction and sim- plification have to go further so that the unit’s functionality becomes easier to describe and becomes more ‘user-friendly’. At the same time, the computational effort has to be reduced, so that we can deal with units in large numbers. Unfortunately, high levels of abstraction and simplification are not possible without producing a very distorted pic- ture of the functionality of the neuron. There is a chance that in the process of making things simple enough for people to understand and to work with, one or more of the fundamentally important properties have been neglected. The actual functionalities and workings of biolog- ical neurons have been put in a form of a mathematical neural model, and it is certainly simple enough to combine many of these units into neural nets. Whether this model is accurate enough to count as an ade- quate representation of a real neuron is the subject of this section. Current models are very similar to the perceptron by Rosenblatt (1958, 1962). Rosenblatt incorporated two very important ideas into the per- ceptron. The perceptrons were two-layer, or multi-layer, networks, that were well suited for pattern recognition. The most important part of this work was arguably the formulation of the perceptron convergence theorem, which showed that the learning algorithm of perceptron with their timing very effectively for coding of information. 112 CHAPTER 5. MODELS OF NEURONS linear units will find a solution, if a solution for the particular prob- lem exists. Current neural models are implemented mathematically as a sequence of four distinct steps. The (1) neuron’s inputs are evalu- ated according to their individual weights, which are analogous to the synaptic strength in real neurons. The (2) sum of the weighted inputs is passed through an (3) activation function, which essentially deter- mines, whether this neuron is producing some (4) output, i.e. whether it fires, and what strength this output should be. The type of activation function varies between different models, but they typically take one of four general forms: the threshold function, the linear function and the squashing functions, namely the piecewise linear function and the sigmoid function.

Implementation

AI model neurons can be described mathematically and computation- ally as a set of two functions. The model neuron is described by

n u = ∑w ji j j=1 and Nout = ϕ(u + b)

where i1, i2, ..., in are the input signals and w1, w2, ..., wn are the asso- ciated synaptic weights. The output of the neuron Nout is the result of applying the activation function ϕ on the input u, which is the sum of the weighted inputs, and a bias b. The bias is a fixed value input that is used to adjust the average output. The activation function ϕ, is some- times also called the transfer function or the output function. Although there is a large number of suitable functions, the vast majority of mod- els over the years utilized functions that are derived from one of the following four. The threshold function ⎧ ⎨ 1 1 if v ≥ 2 ϕ(v)= ⎩ 1 0 if v < 2 5.2. MATHEMATICAL NEURONS 113

which has switch-like characteristics was used in the earliest models of McCulloch and Pitts (1943). The linear function ϕ(v)=cv was found in the perceptrons (Rosenblatt, 1958), while the piecewise-linear function ⎧ ⎪ ≥ ⎨⎪1 if v 1 ϕ(v)= v if 1 > v > 0 ⎪ ⎩⎪ 0 if v ≤ 0 is a more recent development. The sigmoid functions , such as

1 ϕ(v)= 1 + e−cv are part of the development of the back-propagation algorithm in the 1980s6. Let me provide an example with real numbers to show that the cal- culations, which need to be performed for this kind of neural model, are simple and straight forward. The arithmetic can be done with pen and paper, although a calculator is handy for finding the value for ex. The computer is, of course, an essential item when we need to perform these calculations many times for a large number of neurons. A network may have hundreds of units, and during the learning phase, each unit may have to be re-calculated thousands of times in order to adjust the weights. The neuron in figure 5.3 has six inputs (i) with the current values 1.2, .8, -1.1, .25, -.6, .55, the corresponding weights (w) for the connections are .6, 1.1, -.1, 2.1, .2, -.3, and the bias (B) is set to 1.0. The first step is to calculate the sum of the weighted inputs, which is the dot product7 of

6The name sigmoid function relates to shape of their graphs. The logistic distribu- 1 tion function ϕ(v)= 1+e−cv is used to describe the ‘law of logistic growth’. Although the Gaussian, for example, offers a better fit for data concerning logistic growth, the logis- tic function and its derivative are much simpler (and simpler to work with) than the R 2 ϕ( )= √2 ∞ −t Gaussian and the corresponding error function v π v e dt. A particular function may be better suited for a specific implementation of a learning algorithm, but they all exhibit the features of being differentiable and having asymptotes. 7 The use of matrices and vectors is common in them context of artificial neural nets (see section 6.1.5). The expressions u = i·w and u = ∑ w ji j are merely different notation j=1 for the same thing. 114 CHAPTER 5. MODELS OF NEURONS

1.2 .6 B 1.0 .8 1.1

-1.1 -.1

2.1 .25 .2

-.6 -.3

.55

Figure 5.3: AI neuron example the input and the weight vector (i ·w). ⎡ ⎤ .6 ⎢ ⎥ ⎢ ⎥ ⎢ 1.1 ⎥ ⎢ ⎥ ⎢−.1⎥ [ . ,., − . ,. , −. ,. ] ⎢ ⎥ = 1 2 8 1 1 25 6 55 ⎢ ⎥ ⎢ 2.1 ⎥ ⎢ ⎥ ⎣ .2 ⎦ −.3

1.2 ∗ .6 + .8 ∗ 1.1 + −1.1 ∗−.1 + .25 ∗ 2.1 + −.6 ∗ .2 + .55 ∗−.3 = 1.95

The output of the neuron is now calculated by passing the sum of the weighted inputs (1.95) and the bias (1.0) into the activation function ϕ. If the sigmoid function is selected as the transfer function, as suggested in figure 5.3, the output of the neuron will be ≈ .950. It is obvious that the same inputs and weights produce different output values, depending on the choice of activation function (see figure 5.4 for sketches). When we use the threshold function as the activation func- tion, then the neuron’s output would be 1. Remember that the threshold function returns either 0 or 1, depending on whether the sum of the 1 weighted inputs and the bias (u + b) is greater or equal to 2 . The lin- ear function would return a value that is proportional to the sum of the weighted inputs (u). The value Nout , which represents the neuron’s out- put, is also controlled by the constant c and the bias b. 5.2. MATHEMATICAL NEURONS 115

The functionality of the models and their behavior in simulated net- works are directly related to the properties of the activation functions. The linear function ϕ(v)=cv, for example, causes the output of a neu- ron to take on very high values for a combination of a high value input and high value for the corresponding weight. The output will be propa- gated to the neurons in the next layers, where all other inputs are essen- tially swamped by this very large value. The same behavior is exhibited for very small numbers, which happens if one of the factors (input or weight) is negative and both factors have large absolute values.

linear step

piecewise linear sigmoid

Figure 5.4: Activation functions

In a network comprising units with the step function as activation func- tions, the input vectors of each unit (i) will consist of elements that are either 0 or 1. The inputs are transformed into a single value (i · w)in the first part of the calculation, which is common to all neurons of this kind. Consequently, a combination of a large input value and a large weight will also produce a very large value for the sum of the weighted inputs. The behavior of the unit within a network is different from the linear unit, because the output is limited to be within the range (0,1). Other neurons cannot be swamped, which is a valuable feature of the step function. However, the step function is not differentiable everywhere. Many learning algorithms, error back-propagation8 in particular, require the

8The error back-propagation algorithm is described in some detail in Rumelhart et al. (1986), Russell and Norvig (1995), Shultz (2003), McLeod et al. (1998), and Haykin 116 CHAPTER 5. MODELS OF NEURONS derivative of the activation function to determine the direction of small change δ. During the process of updating the connection weights w, each particular weight wn will be changed by the amount δn, which is a combination of wn and the learning rate. The direction of change de- termines whether wn will be increased or decreased by the amount δn. The piecewise linear function is a workable compromise in that it limits the extreme values to 0,or1 respectively, but it is linear and therefore differentiable in the range 1 > v > 0. The sigmoid function is a very el- egant solution and an improvement over the piecewise linear function, because it is continuous and differentiable at every point (except at v = 0 and v = 1) over its domain from −∞ to +∞ , while its range is limited to (0,1). It is therefore a suitable function for applications using the back- propagation algorithm9. The particular need for an activation function to be differentiable, is not so much a design feature of the neural model, but more a constraint that is dictated by the learning algorithm. Artificial neurons have to conform to the constraint, even if it turns out that the model does not need this feature to represent an actual neuron. This particular require- ment is an example where the mathematical and computational princi- ples used in the learning algorithm of the network becomes a constraint for this kind of neural model. The activation function that was used in the early perceptrons, i.e the step function, is unsuitable for this appli- cation. Since the introduction of the back propagation algorithm, mem- bers of the family of sigmoid functions (e.g. tansig or logsig) are now universally employed in networks using various forms of back propaga- tion learning algorithms. When we combine the appropriate formulas, the most common neuron model can then be fully described in a single formula. The output of the neuron Nout for a given set of inputs i1...n is

1 N(i1...n)=   . m −c ∑ w ji j+b 1 + e j=1

Early attempts to build intelligent systems and models of cognitive

(1999). 9Of course, there are many other suitable functions that exhibit the property of being continuous and differentiable. 5.2. MATHEMATICAL NEURONS 117 functions were very limited because of the digital, switch-like, behav- ior of the units. Especially neural models of the all-or-nothing type with unsophisticated transition functions, e.g. the step function, offer little beyond what resembles digital circuitry. Individual perceptrons10 have the additional property of being adaptive. These units can change their behavior when a network is trained, unlike the static McCulloch and Pitts (1943) model neurons. Nevertheless, studying individual model neurons from the perceptron family does not tell us much about intelli- gence, cognition or behavior. Rosenblatt explains that

[i]t is significant that the individual elements, or cells, of a nerve network have never been demonstrated to possess any specifically psychological functions, such as “memory”, “awareness”, or “intel- ligence”. Such properties, therefore, presumably reside in the or- ganization and functioning of the network as a whole, rather than in its elementary parts (Rosenblatt, 1962, 9) Rosenblatt clearly expresses the notion that there has been no evidence for the existence of ‘grandmother’ cells, although more recent findings seem to suggest that individual neurons may be related to particular concepts much more closely than previously thought (Martindale, 2005). I suggest that Rosenblatt’s sentiment also applies to artificial neurons of the kind that are commonly used today. Using non-linear transfer functions in the models has several benefits, but as Elman points out, the benefits are in their behavior at the network level.

For some ranges of inputs [. . . ] these units exhibit an all or nothing response (i.e., they output 0.0 or 1.0). This sort of response lets the units act in a categorical, rule-like manner. For other ranges of inputs, however, [. . . ] the nodes are very sensitive and have a more graded response. In such cases, the nodes are able to make subtle distinctions and even categorize along dimensions which may be continuous in nature. The nonlinear response of such units lies at the heart of much of the behavior which makes networks interesting (Elman et al., 1998, 53, original italics). Artificial neurons with nonlinear and continuous transition functions can be “very sensitive” and may have “a more graded response”, but are they any more ‘nature-like’ than previous neural models? In the following section, I will discuss whether AI type neural models, i.e the mathematical neurons, are adequate representations of real neurons. 10The term perceptron is interchangeably used to refer to single units and to entire network comprising such units. 118 CHAPTER 5. MODELS OF NEURONS

5.3 Relation to Reality

Approaching modelling and simulating of neurons with digital ma- chines, i.e. machines that run mostly in an algorithmic mode, limits the models to mathematical constructs. The question of how much ‘nature’ is captured by the commonly used mathematical neural models is im- portant, because it is this type of neuron that is used to build artificial neural networks. The biological models are, without doubt, much more descriptive at the level of the biology of real neural cells. Like the HHm, biological models target a specific phenomenon or aspect of a neuron, and they are not models of an entire neuron. There are some network models that explore the stochasticity of spiking and the dynamics of spike trains in networks. Dayan and Abbott (2001) point out that

[t]he most direct way to simulate spiking neural networks is to [. . . ] synaptically connect model spiking neurons. This is a worth- while and instructive enterprise, but it presents significant compu- tational, calculational and interpretational challenges (Dayan and Abbott, 2001, 230). Even with a small number of neurons it is not possible to simulate the dynamics of spiking neurons in real time. However, this is not just a problem of insufficient computer power. An artificial network com- prising even a modest number of spiking neurons exhibits dynamics of such complexity that is it impossible to provide a sensible interpreta- tion. Moreover, spiking neural models have a large number of free pa- rameters that not only influence the behavior of the individual neuron, but can (and will) also change the dynamics of the entire network. The changes in the network dynamics cannot be sensibly explained in terms of changes to a couple of parameters of a single neuron. In ‘firing rate’ network simulations, a number of spiking neural models are put together in a simplified form. The models use the firing rate as an indication of the neuron’s activity, and “the greater apparent preci- sion of spiking models may not actually be realized in practice” (Dayan and Abbott, 2001, 230). The biological neural model has to be simplified and an additional level of abstraction has to be introduced if we want to simulate networks. Computational constraints and the fact that it is impossible to offer meaningful interpretations of the complex network dynamics, force us to utilize biologically inferior mathematical, or func- 5.3. RELATION TO REALITY 119 tional, neurons for models at cognitive levels. How much similarity to a real biological neuron is there in a functional neuron? This similarity is very difficult to describe, because the comparison has to be made at a similar level of description. Looking at the materials which the real neurons and the artificial neurons are made of, there is no similarity at all. We cannot compare biological structures with math- ematical structures in terms of materials or composition. Lytton puts it succinctly, when he says that “computers are made from sand and metal, while brains are made of water, salt, protein and fat” (Lytton, 2002, 13). Any comparison must therefore be restricted to an analysis and description of their behavior and their functionality. Essentially, we have to take a black box approach and only consider the observable and abstract equivalent behavior of the box. In section 4.1.1, I described the processes that are involved when building mathematical models. Depending on the kinds of assumptions that are made about the impor- tance of particular biological properties, abstract models take on quite different forms. Real neurons have many different properties that are not reflected in the constructs used for simulation or modelling with artificial neural nets. Especially for the functional models that I have described in some detail, little of the actual biology has been taken into account. Percep- trons and the more recent developments offer a rather crude implemen- tation of a neuron’s functionality. The functionality is restricted to an idealized view of a neuron’s input-output behavior, and the loss of de- tail is not only restricted to what is inside the ‘black box’. The input- output behavior is transformed to a simple sum and squash function. Unlike the functional models, (1) single real neurons do not respond simultaneously to all of the inputs, (2) real neurons do not have both excitatory and inhibitory synapses (Dale’s principle), and (3) real neu- rons do respond to the firing-rate (Churchland and Sejnowski, 1992; Churchland, 1993). Other major differences between biological neurons in brains and artificial neurons in neural nets refer to the number of connections between neurons. Neurons have several thousand synapses (Stillings et al., 1995, 277)11, depending on the type of neuron and their

11An upper value of 10,000 connections is provided in Carter (1998, 38), and a value of 100,000 is suggested by Dayan and Abbott (2001, 4) for Purkinje cells . 120 CHAPTER 5. MODELS OF NEURONS location in the brain, while artificial neurons have comparatively very few (typically less than a hundred, occasionally a few hundred in large networks). Particularly the shift of the model neuron to carry semantic content ei- ther as (1) a localized representation of a phenomenon, physical entity or concept, or (2) as a contributor of content of a distributed represen- tation requires a giant ‘leap of faith’. These models of neurons, and also the networks that are constructed with them, have very little in com- mon with cells and brains. But, as Rosenblatt has pointed out, that is not the aim, as

[p]erceptrons are not intended to serve as detailed copies of any actual nervous system. They are simplified networks, designed to permit the study of lawful relationships between the organization of a nerve net, the organization of its environment, and the “psy- chological” performance of which the network is capable (Rosen- blatt, 1962, 28).

I believe that this is true for the perceptron (network) as well as for the single perceptron (mathematical and functional model of a single neu- ron), because the simple functional neural models have different roles to play. In the context of modelling cognitive functions, any detailed de- scription of neurons in terms of their biology, physiology and electro- chemistry is not likely to be of any help at all.

5.4 Summary

We can identify two fundamentally different approaches to neural mod- elling. The modelling techniques concern broadly either the physiology of neurons or the dynamics and representational properties of neurons in artificial networks. The models that focus on the physiology and the dynamics of neurons are mathematically complex, because the neurons themselves are very complex, and these models describe the processes at very low levels. Biological models are couched in the language of biol- ogy, biochemistry and cell physiology. There are many different models for specific aspects of neurons, but a large model that would incorporate all of the various models is beyond our capabilities. While it is important to understand neurons in detail, the techniques and the models are too 5.4. SUMMARY 121 complex for models referring to cognition. The complexity of the models calls for considerable computing power if we want to use even a modest number of model neurons in a network. The complex nature and the un- predictable behavior of networks with spiking model neurons makes it very difficult, if not impossible, to provide a meaningful interpretation of the network dynamics. We must accept (for now) that the study of the behavior of large num- bers of interconnected neurons is only possible with relatively simple functional models. Their structure and functionality is influenced by computational constraints. The simplicity of the functional models al- lows for the use of many units in a single network, because the sim- plicity translates also into a low computational ‘cost’ in terms of storage requirements (memory) and processing speed. Furthermore, there are design constraints that depend on planned use of the model. The choice of activation function, for example, is limited by the learning algorithm for the particular network. If the network is going to be trained with one of the back-propagation algorithms, then the neural model must be implemented using a differentiable activation function. The most im- portant result of this chapter is that the simple neurons have very little in common with real neurons, and that some of their properties and be- haviors are influenced by network design issues. The next chapter deals with models of collections of neurons (artificial neural nets), some of the principal network architectures, and their properties. 122 CHAPTER 5. MODELS OF NEURONS Chapter 6

Artificial Neural Nets

The first thing to be said about connectionist networks is that they are really quite simple, but their behaviors are not. That is part of what makes them so fascinating (Elman et al., 1998, 50).

The Parallel Distributed Processing (PDP) paradigm, which was in- troduced by Rumelhart and McClelland (1986a) and McClelland and Rumelhart (1986), and connectionism in general, rely on artificial net- works as ‘technical vehicles’ (Hoffmann, 1998). Neural nets are not only tools that are used in the field of AI for engineering intelligent con- nectionist systems, but they are also tools with which connectionist cognitive architectures can be modelled. The term Connectionism has presently two different meanings, namely (1) a philosophical viewpoint that representations in the brain are distributed and that all process- ing of information occurs in parallel fashion (see section 1.3 on page 9), and (2) a different way of approaching the possibility of an AI. In this chapter I concentrate on the kind of neural network principles and ar- chitectures that are dominant in the context of cognitive modelling. I also present first a short summary of the ideas that led to the develop- ment to the kind of neural nets commonly used today. The principles of operation, the implementation, and some of the key properties of feed- forward networks (FFN) and simple recurrent networks (SRN) are in- troduced in the second section. A comparison with real neural nets and alternative implementations of neural networks follows. I argue in sec- tion three that FFNs and SRNs can be used as practical solutions to many data driven models, and that they can be considered as universal solutions. 123 124 CHAPTER 6. ARTIFICIAL NEURAL NETS

6.1 Background

The early successes in using computational, or mathematical, models of neurons and networks is largely due to the work by McCulloch and Pitts (1943). They described a calculus of all or nothing type logical neuron models. These early neural models were based on the simple threshold function, which I described on page 112. The simple ‘neuron like’ compu- tational units by McCulloch and Pitts had been used to successfully im- plement small networks composed from elementary functions like AND, OR, and NOT. Turing had a very similar approach and discussed neu- ral networks in INTELLIGENT MACHINERY (Turing, 1948), and the net- works were made up of “two-state boolean ‘neurons”’ (Copeland, 2004, 408). These networks were also the first computational models aimed to- ward simulating cognitive tasks. All of these networks were constructed as task-specific entities, similar to designing electronic circuits using el- ementary logic gates. The core assumptions for the McCulloch and Pitts (1943) models were

1. The activity of the neuron is an ‘all-or-none’ process. 2. A certain fixed number of synapses must be excited within the period of latent addition in order to excite a neuron at any time, and this number is independent of previous activity and position of the neuron. 3. The only significant delay within the nervous system is synaptic delay. 4. The activity of any inhibitory synapse absolutely prevents ex- citation of the neuron at that time. 5. The structure of the net does not change with time (McCul- loch and Pitts, 1943, 353).

A major contribution to the development of networks was THE ORGA- NIZATION OF BEHAVIOR by Hebb (1949), in which he postulated ideas about how physiological processes relate to the interactions between connected neurons. In order to explain the necessary physical changes to make memory persistent, he hypothesized that the synaptic strength between neurons changes with the frequency of their firing. Hebb ex- plained that

[w]hen an axon of cell A is near enough to excite a cell B and re- peatedly or persistently takes part in firing it, some growth process 6.1. BACKGROUND 125

or metabolic change takes place in one or more cells such than A’s efficiency as one of the cells firing B, is increased (Hebb, 1949, 324).

The principal idea of Hebb’s hypothesis is the basis for much of the suc- cess of model neurons and model neural nets. While physiology can explain the detail, at least in part, many contentious issues remain about whether in human brains ‘Hebbian learning’ actually occurs. The important message emerging from Hebb’s work for the modellers is that neurons are not static and some parameters like their connec- tion strength can be modified. Real neurons are able to change the connection strength to other neurons (synaptic strength), and some of the changes are due to and dependent on the behavior of other neu- rons. Neurons change according to what happens around them - they learn. In the building of neural models this Hebbian learning through changing the synaptic strengths is translated into adjusting the connec- tion weights, which are updated by the learning algorithm during the learning process. In real neurons the process of changing the connection strength is a slow process, because it involves changes to the physiology of the cells. In fact, things must actually grow. Hebb points out that

to account for the permanence [of ], some structural change seems necessary, but a structural growth presumably would require an appreciable time (Hebb, 1949, 324).

Because it takes a relatively long time for physiological changes to occur, there is a need for some mechanism to solve the delay between when a stimulus (an idea, a representation, some bit of information) is present and the time it has been made permanent in the connections. Some pro- cesses are needed to provide a mechanism of temporary storage before memory is laid down. Hebb suggested a process that keeps the informa- tion dynamically ‘alive’ with reverberatory processes1. Memorization is achieved once the synapses are altered to encode the information per- manently. We must assume that this process is scalable, and that it

1Early electronic computers used delay lines as storage that worked on this princi- ple. These were long tubes filled with mercury and had a electromagnetic device on one end which would send a series of pulses (a bit pattern) along the tube. On the other end another electromagnetic device was used to up pick these signal and feed them back into the device at the other end. The propagation of the signals through mercury is rel- atively slow, so that several bits of information were traveling down the tube at any one time. The entire assembly acts as a dynamic data store, if the timing (spacing) of the pulses is correct. 126 CHAPTER 6. ARTIFICIAL NEURAL NETS works also in a similar fashion for groups of neurons. If a single neuron causes another neuron to fire repeatedly, and the connection between the two is strengthened, then this also holds for a collection of neurons, that is, a neural net. The synaptic strengths of the many connections within the neural nets effectively form a memory of previous neural ac- tivity. Although Hebb’s work had been influential in the field of psychology, the engineering community largely ignored it at the time (Haykin, 1999). Later, the work by McCulloch and Pitts together with Hebb’s theory led to the inception of the perceptron2. Rosenblatt recognized the inability to learn as the main limitation of a McCulloch and Pitts neuron. He noted that

[t]he postulates [concerning the McCulloch and Pitts model] rule out memory except in the form of modifications of perceptual ac- tivity or circulating loops of impulses in the network (Rosenblatt, 1962, 15).

The “circulating loops of impulses” that are referred to by Rosenblatt are very much like Hebb’s ‘reverberatory’ processes. Rosenblatt assumed for the perceptron that neurons are of the ‘all or nothing’ type, and that the architecture of the simple networks was comprised of a set of functional layers. Sensory inputs are fed into a single layer of S-units (sensory units), and their outputs, or responses, are passed through a set of random connections to a layer of A-units (association units). The final layer of R-units (response units) feeds the result of the operation to the outside of the network. Teaching these networks is accomplished using the delta-rule, which was developed by Rosenblatt (1959).

6.1.1 The limitations

The connections in the network are adjusted, such that the error of the input values and the target values for the function to be implemented

2The term perceptron is used to describe either a single neural model that follows Rosenblatt’s model, or, an entire single layer feed-forward neural network of neural models of this type. I will not differentiate either, as the use of the term in context will resolve this ambiguity. Rosenblatt intended to use the term ‘perceptron’ to describe ‘theoretical nerve nets’ and pointed out that the term had an “unfortunate tendency to suggest a specific piece of hardware” (Rosenblatt, 1962, v). 6.1. BACKGROUND 127

AND OR

Figure 6.1: AND/OR-functions by the perceptron are minimized. The addition of learning, or the “mod- ification of selected synapses” (Rosenblatt, 1962), was the most impor- tant property of this new neural model and the new family of neural networks. The static McCulloch and Pitts logical model and the Turing network have developed into a dynamic neural model3. A network that is composed of this kind of neuron is, in modern terms, classed as a sin- gle layer feed-forward network. This simple architecture is very limited in what kind of functions it can deal with. Unlike the AND, OR, and NOT function, the elementary XOR function can not be implemented using Rosenblatt’s two-layer perceptron. The XOR function for the two given inputs X and Y yields (X ∨ Y)∧¬(X ∧ Y). When we draw a simple diagram of the inputs and outputs of these functions in the Cartesian plane (figures 6.1 and 6.2) , we can see that it is obviously possible to draw a line to separate the true (•) conditions from the false (◦) conditions for the AND function and the OR function. For the XOR function (figure 6.2) this is not possible. The XOR function cannot be implemented, because a two-layer perceptron can only learn to classify linearly separable data. The parameters of a linear percep- tron allow only for the position and the gradient of the line to vary. The combination of linear units and the delta rule as the learning algorithm provides little more than linear regression. The performance of the net- work is largely dependent on the data, and Stone (1986) states that

[i]n particular, wherever a linear regression is insufficient to pro- vide a good account of the relationship between input and output patterns, the system will perform poorly (Stone, 1986, 458).

3Turing (and to a lesser degree McCulloch and Pitts, proposed learning in neural nets in Turing (1948), but this considerations are purely theoretical. The first computer simulation of a neural net did not occur until 1954 (Copeland, 2004, 405). 128 CHAPTER 6. ARTIFICIAL NEURAL NETS

Figure 6.2: XOR-function

After the promising start, research into artificial neural nets came nearly to a halt at the end of the 1960s. Haykin (1999) cites the lack of personal computers and the publication PERCEPTRONS by Minsky and Papert (1969) as reasons for the “lag of more than 10 years”. Minsky and Papert showed in their book that the computational power of single layer feed-forward nets is restricted to certain classes of functions. Inter- estingly, Minsky and Papert clearly stated in 1969 that the restrictions and limitations do not apply to all configurations of perceptrons and not to all classes of problems. Nevertheless, the release of their work placed doubts into the minds of many, including those who were able to seri- ously slow down the research. Haykin comments that Minsky and Pa- pert’s book “certainly did not encourage anyone to work on perceptrons, or agencies to support the work with them”. It seems that the inter- pretation and the subsequent comments by others gave PERCEPTRONS more impetus than originally intended. In their second edition Minsky and Papert expressed concerns that the publication had “the reputation of being mainly concerned with what perceptrons cannot do, and having concluded with a qualitative evaluation that the subject was not impor- tant” (Minsky and Papert, 1969, 248). Their intention was to expose the fledgling fields of Computer Science and Cybernetics to scientific rigor. Under the heading Cybernetics and Romanticism, Minsky and Papert explain that

[o]ur discussion will include some rather sharp criticisms of ear- lier work in this area [. . . ] Most of [the writing about perceptrons in books, journal articles, and voluminous reports] is without sci- entific value [. . . ] The science of computation and cybernetics be- gan, and it seems quite rightly so, with a certain flourish of ro- 6.1. BACKGROUND 129

manticism. They were laden with attractive and exciting new ideas which have already borne rich fruit. Heavy demands of rigor and caution could have held this development to a much slower pace [. . . ] But now the time has come for maturity, and this requires us to match our speculative enterprise with equally imaginative standards of criticism (Minsky and Papert, 1969, 4).

Research into neural nets and connectionism had a renaissance after the “remorseless analysis of what perceptrons cannot do” (McLeod et al., 1998, 323), beginning with the development of the back propagation al- gorithm. The back propagation algorithm was first introduced by Wer- bos in 1974, although it only became widely known through the publi- cation by Rumelhart et al. (1986)4. This algorithm5 is a form of reverse- mode gradient descent computation, which works on the principle of distributing the blame for the error between the target output and the calculated output of a neuron according to its contribution to the error. The development of artificial neural nets has been a sequence of math- ematical and computational enhancements. The simple networks of switch-like model neurons have developed into multi-layer networks with non-linear units, and the teaching algorithms have changed from elementary logic to gradient descent back propagation. As these devel- opments took place, the complexity of the algorithms and the computa- tional cost of artificial neural nets has steadily increased. The increase in complexity was necessary to overcome some of the obvious limita- tions, and the development of cheap and powerful computing machines (and later more powerful personal computers) made it possible to deal with networks of considerable size.

6.1.2 Plausibility

While back propagation is a computationally elegant procedure, and while it allows for efficient training of artificial neural nets, a process even similar to back propagation is not something we have found in the

4The back propagation algorithm is introduced in LEARNING INTERNAL REPRESEN- TATION BY ERROR PROPAGATION by Rumelhart et al. (1986). The back propagation algorithm is based on earlier work by Rosenblatt (1962) on Perceptrons, and Widrow and Hoff on Adaptive Linear Elements or ADALINEs, i.e. the Delta rule, in the 1960s. 5It is probably better to speak here of a family of algorithms. While the general principle of back-propagation is common, many details of a particular implementation of the algorithm are governed by different gradient descent methods. 130 CHAPTER 6. ARTIFICIAL NEURAL NETS human brain. Changes to the synaptic strength, as suggested by Hebb, does not work by iteratively applying these kinds of algorithms. Hinton remarks that

[. . . ] back propagation seems biological implausible. The most ob- vious difficulty is that information must travel through the same connections in the reverse direction, from one layer to the previous layer. Clearly, this does not happen in real neurons (Hinton, 1992, 186, my italics).

While there is no evidence that signals travel in real neurons in reverse direction across synapses, a backward propagation of the action poten- tial (spike) into the dendritic tree does in fact occur. This effect is taken into consideration in some spiking models (see Gerstner and Kistler (2002)). However, this phenomenon does not resemble back-propagation and it does not seem to have any bearing on Hebbian learning. Other difficulties, which make back-propagation biologically implausible, are the necessary speed of calculation (Sejnowski, 1986) and the obvious improbability for such an algorithm to develop through evolution and to survive evolutionary change. Haykin believes that

neurological misgivings do not belittle the engineering importance of back-propagation learning as a tool for information processing, as evidenced by its successful application in numerous highly di- verse fields [. . . ] (Haykin, 1999, 227).

A further problem for the biological plausibility of learning with back propagation is the requirement for a teacher. Hinton is quite explicit when he says that

The most serious objection to back propagation as a model of real learning is that it requires a teacher to supply the desired output for each training example. In contrast people learn most things without the help of a teacher. [. . . ] We learn to understand sen- tences or visual scenes without any direct instructions (Hinton, 1992, 186).

I would like to explore Hinton’s second point further. The necessity for a ‘teacher’ to train the networks is one of the serious limitations for mod- els that rely on artificial networks6. Networks can only be trained us- ing back propagation if we have a training dataset that contains input-

6Network can be trained unsupervised (see section 6.1.3). 6.1. BACKGROUND 131 output pairs of some function7. The need to supply the ‘stimulus’ (input data) and the ‘response’ (desired output) for the training process places serious restrictions on what a network can learn. It can be argued that artificial neural nets are not much more than statistical tools, and that the kind of analysis that can be derived from models using networks can be derived by other mathematical (statistical) means. Green (2001), for example, argues that the results from experimentation with neural networks

are just as analytic as are the results of a mathematical derivation; indeed they are just mathematical derivation. It is logically not possible that [the results] could have turned out other than they did (Green, 2001, 109).

The learning algorithm determines the kind of model that can possibly be implemented using neural networks. Because there is a requirement to provide a training dataset, models must be restricted to data driven models8. The architecture of the network has also some bearing on the capabilities and performance of the model.

6.1.3 Unsupervised Learning

Most of the models in Cognitive Science, and all the models that are discussed here, need a teacher. As I have pointed out earlier, to train a network we must know the inputs and the desired outputs (i.e. a train- ing set). Alternatively, it is possible to provide some external feedback to the network, which may be simply indicating to the algorithm that the current output is closer to, or further away from, some heuristic goal. Rather than telling the network what the actual output should be and propagating the error (the difference between the target value and the actual output) back), the error is determined by comparing the output from previous states and the feedback from the environment (goal). The network, simply put, get better and better. This approach to neural net- work learning is known as re-enforcement learning.

7Some network algorithms learn without a teacher. Unsupervised learning is a method to detect and classify patterns automatically. The network can detect clusters of similarity by comparing input patterns with previously presented patterns. This be- havior is also referred to as self-organization (Hoffmann, 1998). 8Data driven models take inputs to produce outputs, whereas other models, like a self-organising map, do not need teaching data sets - they are self-learning. 132 CHAPTER 6. ARTIFICIAL NEURAL NETS

In unsupervised or self-organized learning, the network determines the goal from the patterns in the inputs themselves. A network may form several classes and may classify the inputs into these classes from the inputs alone. In auto-association tasks, the number of classes can be de- termined by the network alone. The ability to perform this task leads to the definition of unsupervised learning itself: “Unsupervised learning means to find ‘natural’ regularities among unclassified objects or pat- terns” (Hoffmann, 1998, 139). In the next section I introduce the two principal architectures of neural nets that are used in many models.

6.1.4 Architectures

The design of network architectures and the exploration of their partic- ular dynamics has developed into an entire branch of AI. The field of AI, if we see it as an engineering enterprise, is not limited in the selection of tools and methods in the pursuit of its goals and aims. Whether a network is suitable or not, or whether a particular research program is considered worthwhile or not, is ultimately only measured by results: if it works, then it is successful. However, when we leave the discipline of engineering and enter the realm of AI as a contributor to the field of Cognitive Science, then we should place a number of constraints on the tools and methods. Not all artificial neural network architectures can be considered as suitable models merely because they are constructed from model neurons. Some network architectures are neither biologically nor logically plausible. At this point it is necessary to shift the focus away from the neural nets as ‘technical vehicles’, because the biological and logical plausibil- ity may not be relevant at the level of description of the model. The distinction between neural nets as tools and models built on those tools becomes clear if we compare birds and aeroplanes, for example. Aero- planes can hardly be described as biologically plausible models of birds, but both aeroplanes and birds fly. I would argue that for the act of flying as such, the biology of birds is as irrelevant as most of the technical de- tails of the aeroplane. Flying can be considered independent of a partic- ular implementation: the birds’ biology is important for birds to be able to fly, but not for flight as such. One can throw a stone to make it fly, but there is nothing in the stone’s makeup to facilitate flight (not consider- 6.1. BACKGROUND 133 ing its weight, which might prevent me from being able to throw it). In their role as a tool for models, artificial neural nets become a part of models for theories. These theories concern many topics such as learn- ing, knowledge representations, language acquisition, language produc- tion, vision and every other conceivable topic in Cognitive Science. How- ever, the neural nets that we want to use as the vehicle for a model must be able to perform the task at hand. Recall that the analysis by Minsky and Papert (1969) showed that per- ceptrons can only solve tasks that are linearly separable, and that the relatively simple XOR-function cannot be handled by a single-layer net- work of perceptrons9. Since then, artificial multi-layered networks com- prising non-linear neurons have been shown to be much more powerful than Minsky and Papert were willing to admit. Churchland writes that

[...] a nonlinear response profile brings the entire range of possible nonlinear transformations within reach of three-layer networks [...] Now there are no transformations beyond the computational power of a large enough and suitably weighted network (Church- land, 1990, 206). and Elman et al. say that

[f]or some ranges of inputs [. . . ] these units exhibit an all or noth- ing response (i.e., they output 0.0 or 1.0). This sort of response lets the units act in a categorical, rule-like manner. For other ranges of inputs, however, [. . . ] the nodes are very sensitive and have a more graded response. In such cases, the nodes are able to make subtle distinctions and even categorize along dimensions which may be continuous in nature. The nonlinear response of such units lies at the heart of much of the behavior which makes networks interesting (Elman et al., 1998, 53).

The introduction of multi-layered networks and non-linear output func- tions introduced an interesting phenomenon that is directly related to the back propagation of errors during learning. The delta rule, as it is used in the perceptrons, will always find the solution, provided that such a solution exists. The convergence theorem by Rosenblatt (1959) proves that this is true. This theorem does not hold for the back propagation

9It can be shown that for any multi-layered networks with linear units there is an equivalent single layer network (see for example appendix 1 in McLeod et al. (1998)). It follows that any neural network comprising purely linear units can solve only linearly separable functions. 134 CHAPTER 6. ARTIFICIAL NEURAL NETS algorithm. Back propagation is a reverse hill climbing or gradient de- scent algorithm, which follows the error function to find the point with the smallest error. Copeland (1993) describes the global minimum as the ‘most relaxed’ state and says that there can be

a number of ‘uneasy truces’- states that are just sufficiently stable to prevent the network from looking for a more harmonious way of accommodating the input. If a network settles down into one of these compromise states it will stick there and never produce the desired output (Copeland, 1993, 212). Multi-layered networks may have local minima, which can prevent the algorithm from finding the overall (global) minimum (see figure 6.3). Consequently, as Copeland points out, it is possible that a network may

gradient of descent

local minimum

global minimum

Figure 6.3: Local and global minimum not find a solution under a range of conditions. These conditions depend on the choice of the learning rate, the initial random connection weights, and the particular choice of training algorithm. However, there are ways to lower the risk that the network settles in a local minimum. Although there has been a plethora of architectures and network topolo- gies, most of the cognitive models are based on two principal architec- tures. Multi-layer networks are usually feed-forward networks (FFN) (figure 6.4) or simple recurrent networks (SRN) (figure 6.5). SRN’s em- ploy a feed back mechanism, whereby the current state of the hidden nodes is used as part of the input for the next step in the training of the network10. Some of the network architectures with particular proper-

10For a detail description of this kind of network see FINDING STRUCTURE IN TIME by Elman (1990), and also section 8.2.2. 6.1. BACKGROUND 135

Input nodes

Hidden nodes

Output nodes

Figure 6.4: Feed-forward Network

Input nodes

Context nodes

Output nodes

Figure 6.5: Simple Recurrent Network (SRN)

ties are of value in AI, but I do not believe that these are useful tools for building cognitive models if neurological plausibility is considered. Hop- field networks have been successfully used to model auto-associative memory, however their architecture is built around a set of fully con- nected nodes. Each unit is connected to every other unit in the network, which also means that two adjacent units have two connections, one in each direction. Direct feed back between neurons is avoided, by pass- ing each connection to a circuit which induces a short delay. Structures of this kind have not been observed in brains, and I will not deal with Hopfield, Kohonen, Helmholtz, or Boltzmann network architectures in this thesis, although they are computationally interesting and valuable for AI. 136 CHAPTER 6. ARTIFICIAL NEURAL NETS

6.1.5 Implementation

Perceptrons and multi-layer nets (feed-forward, recurrent, or any other architecture) are mathematical constructs and as such need to be ex- pressed in the language of mathematics. Artificial neural nets can be considered to be much more a problem of linear algebra than a prob- lem of programming skills when we consider the user interface to pro- grams that implement neural networks. Much of the work with artificial neural nets has always been described, as far as their notation is con- cerned, as problems resembling linear algebra. Remember that the use of vectors and matrices in the notation for the specification and math- ematical treatment of neural nets had been suggested by Rosenblatt (1958, 1959, 1962). The connections in the networks and the weights of these connections, which represent the synaptic strengths, are usu- ally expressed in the form of matrices. This notation is commonplace in linear algebra, which provides many of the algorithms that pertain to calculations involving artificial neural nets. The notation using vectors for model neurons (see section 5.2) is extended to neural networks. The example of a small FFN in figure 6.6 illustrates how networks can be represented as matrices. All the nodes of the network are numbered and

1 2 3 4 5 .75

.60 .25 .30 Input nodes

.60 .30 .30 .15

678

.10 .60 .30 .85 Hidden nodes .15

.48

910Output nodes

Figure 6.6: Weights and Connections the numbers are placed along the rows and columns of a matrix (figure 6.7). The connections are represented as values at the intersections of the rows and columns, so that the connections from a neuron are in the 6.1. BACKGROUND 137 columns and the connections to the neuron are in the rows of the ad- jacency matrix. The value of the number at the intersection represents the synaptic strength (weight) of the connection. For example, the con- nection from the output of neuron 2 (column 2) to the inputs of neuron 7 (row 7) for the network in figure 6.6 has a weight of .60, and the con- nection from neuron 8 to neuron 9 has a weight of .85. If the neurons are numbered layer by layer, then feed-forward networks have only en- tries on one side of the main diagonal11, while recurrent networks will have entries on both sides. Calculations on the matrices can be done

12345678910 1 2 3 4 5 6 .60 .30 7 .60 .25 .30 .30 8 .75 .15 9 .30 .15 .85 10 .10 .60 .48

Figure 6.7: Weight Matrix for figure 6.6

row by row, or column by column, depending on the operation that is required. Calculating the outputs for all units of the hidden layer in our example is accomplished by forming the dot products of rows 6,7, and 8, which represent the weight vector (w) for the hidden nodes, and the cal- culated outputs, which represent the input vector (i) for this layer, from the nodes 1 to 5. The dot products are then passed into the appropriate output function and the results for each node are stored so that they can in turn be used to calculate the outputs of the next layer of neurons. The entire process is iterative where the same processes are repeated node by node and layer by layer. The implementation of artificial neural nets is relatively simple when we consider the basic components, i.e. units, and the basic operations performed on them, like updating weights or calculating outputs. The

11Whether these entries are all above or all below the diagonal depends on whether the inputs are assigned to rows and outputs are assigned to columns, or vice versa. 138 CHAPTER 6. ARTIFICIAL NEURAL NETS complexity of neural nets is in their overall behaviour and the com- putational effort grows exponentially with the number of units in the network. Digital computers are the right tool for solving these kinds of problems, where simple things have to be calculated repetitively.

Digital Machine Implementations

Although neural nets are conceptually examples of parallel computing, their implementation almost universally requires simulation on a com- puter with a serial (fetch-execute cycle) architecture12. Consequently, artificial neural nets should more accurately be called simulated neu- ral nets. Parallel computing is only achieved at a conceptual level at certain intervals during the simulation process. Whenever calculations are required that involve a number of neurons, these calculations are performed in sequence. Once all required calculations are completed, the entire operation on the network is deemed to have happened con- currently. Simulated networks are under the control of a clocking mech- anism that is largely influenced by the amount of computation that is necessary between stable states. The amount of computation can be con- siderable, for example during learning using the back-prop algorithm with even a moderate number of units. Artificial neural nets can be engineered using any programming lan- guage such as C or Java. Neural nets are nowadays conveniently im- plemented in mathematical packages such as Matlab or Mathematica. Many of these packages offer a great number of support tools for build- ing and analyzing networks, and often these program provide graphical user interfaces (GUI) that help to visualize the logical structures and ar- chitectures of the networks. Additionally there are tools for evaluating the performance during and after learning, and provisions for statisti- cal analysis of the weight matrices (cluster analysis). When we examine the implementation, we recognize again a set of distinct levels of de- scription in which the artificial neural nets are described. These are (1) a visual representation of the network in terms of its architecture, (2)

12Work on other ways of implementing artificial neuron networks is in progress. One such research program involves using discrete electronic components to build ‘neuron- like’ modules and combine them into functional groups (Diorio and Rao (2000); Hahn- loser et al. (2000)). The investigation concerns inter neuron communication, and mod- elling cognitive functions is not (yet) considered. 6.1. BACKGROUND 139 the mathematical structure i.e. matrices, and (3) a level at which in- dividual units (neurons) are considered. In many implementations the data structures and the engines that operate on these structures are themselves black boxes from the experimenter’s viewpoint. Software packages like Matlab or Mathematica are the product of years of work and are highly complex and efficient in solving problems in many of the branches of mathematics. The user does not need to know, and usually does not know, much about how the package is actually working. For the user, the package is little more than a reliable big electronic calcu- lator. The example of Matlab code in figure 6.8 is an implementation of a small feed-forward network to show how relatively easy it is with advanced ‘off the shelf’ tools. A particular example for a two layer net- work showing that the XOR problem can be solved with the appropriate architecture is shown in figure 6.8. The matrix |inp| andt (inp and tar

% two layer feedforward (2 input - 3 hidden - 1 output ) % with non-linear neurons, gradient descent algorithm % inp=inputs, tar=targets inp=[1100;1010] tar=[0110] net = newff([0 1; 0 1],[3 1],{’tansig’,’tansig’},’traingd’) net.trainParam.lr = 0.04 net.trainParam.goal = 1e-2 net.trainParam.epochs = 5000 net = train(net,inp,tar) round(sim(net,inp)) % end

Figure 6.8: Matlab Code Example in Matlab notation) represent the inputs and corresponding targets for the learning process, i.e. the training dataset. The instruction newff cre- ates a new feed-forward network object net with input nodes accepting a range of inputs from 0 to 1, 3 hidden nodes and 1 output node. The parameters lr, goal and epochs tune the learning algorithm. The learn- ing process is continued until the number of epochs exceeds 5000 or the error between the network’s output and the target is less than 10−2.The learning rate is specified by the parameter lr. Finally train(net,inpp,tar) trains the network using the gradient descent back propagation algo- rithm. Finally, the network is tested using the sim command, which 140 CHAPTER 6. ARTIFICIAL NEURAL NETS returns the network’s output given the training vector |inp| (inp). In or- der to convert the real valued output to discrete values (0s and 1s), the outputs of the network are rounded. The aim of this example is to show the simplicity of implementing neural nets on a computer. The modeller can use neural nets as a tool without extensive mathematical or pro- gramming skills. Artificial neural nets are ‘technical vehicles’ and like other vehicles they can be driven without the need to know how or why they operate in the way they do.

6.2 Universal Frameworks

In this section I will argue that simple neural networks under the right interpretation offer a model solution for any kind of cognitive function. Siegelmann and Sontag (1991) have shown that artificial neural nets using standard linear connections can be constructed in a way that they are “computationally as powerful as any Turing Machine” (Siegelmann and Sontag, 1991, 77). In a later paper, Neto et al. (1997) have proved that an “‘universal’ analog recurrent neural net” does indeed exist. The existence of an ‘universal’ neural net that is equivalent to an Universal Turing Machine is now a mathematical certainty, and the existence of such a net solves the question about whether neural nets are able to compute any Turing computable function. In this context the question is of lesser importance, because the models of cognitive functions, which are discussed in this thesis, deal with often very simple (mathematically elementary) functions. Artificial neural networks are trained using algorithms that adjust the weights between units, i.e. model neurons, so that the error between the network’s computed output and the expected output is minimized for the given input. This process is repeated for all possible input-output pairs many times over. Networks that act as classifiers and predictors perform a process during the learning phase that is very similar to mul- tivariate regression analysis on the dataset. For example13, to imple- ment the XOR function f =(I1 ∨ I2) ∧¬(I1 ∧ I2), the network will be pre-

13The XOR function is merely used here as an example because it has been discussed earlier on page 126. AND, or any similar function could have served equally well as an example in this context. 6.2. UNIVERSAL FRAMEWORKS 141 sented with values for I1 and I2, i.e. ‘0,0’, ‘0,1’, ‘1,0’ and ‘1,1’. The weights are adjusted using an appropriate algorithm to minimize the error be- tween the network’s output and the output of the training set, i.e. ‘0’, ‘1’, ‘1’, and ‘0’ respectively. Once the network is trained, it will compute the output from the inputs I1 and I2 according to the XOR-function. The network learns a function that is clearly contained in the training data. The fact that the relationship between the inputs (0,0, 0,1, 1,0, 1,1) and the outputs (0, 1, 1, 0) is the XOR function may or may not be known to us. The ability to implement unknown functions is one of the real practical advantages of neural nets. The training set of pairs of input and output vectors already contain all there is, and the ANN does not add anything that could not be extracted from the training sets through other mathematical or computational methods. A trained neural net implements a mapping from the input nodes (I1−n) to the output nodes (O1−i). One of the powerful characteristics of the network is its ability to implement some function O1−i = f (I1−n) from the train- ing dataset very efficiently in terms of performance (speed). Hoffmann (1998) emphasizes this point and says that

[t]he greatest interest in neural nets, from a practical point of view, can be found in engineering, where high-dimensional continuous functions need to be computed and approximated on the basis of a number of data points (Hoffmann, 1998, 157). The engineering project ALVINN (Autonomous Land Vehicle In a Neu- ral Network) by Pomerleau (1995) is probably a good example for using a neural network for a highly complex and technical task14. There is no need to know the function f in non-tabular form. As long as there is a set of input values and output values for some function, a suitable network will implement a function to approximate the tabulated func- tion f . Knowledge extraction (KE) from neural nets concerns providing a (hopefully) simple description of the function f that is approximated by the trained network. The extraction of the function “lies in the desire to have explanatory capabilities besides the pure performance” (Hoff- mann, 1998, 155). The ability to determine f may or may not add to the explanatory value of the model. 14In this project a series of sensors, such as a laser range finder, radar, and video cameras were connected to the inputs of a neural network. The outputs of the network were operating the steering mechanism of a car. The Network was able to steer the car unassisted over relatively long distances (Pomerleau (1995); Baluja (1996)). 142 CHAPTER 6. ARTIFICIAL NEURAL NETS

6.2.1 Labels

The assumption that symbols, or some other forms of representation (see Smolensky (1990, 1997)), and their semantic contents are dis- tributed throughout the network is part of the connectionist doctrine (McClelland et al., 1986). However the interpretation of experimental results in the context of neural nets is not possible without the use of non-distributed symbols at some level of description. Hoffmann points out that

. . . in more complex systems, the use of symbols for describing abstractions of the functionality at the lowest level is inevitable . . . [and] any description of a sufficiently complex system needs layers of abstraction. Thus, even if a non-symbolic approach uses tokens at its base level which cannot be reasonably interpreted, there still needs to be a more abstract level of description (Hoff- mann, 1998, 257).

For a meaningful interpretation of the network and its dynamics, it is necessary to convey content and meaning in terms of non-distributed, or localized, symbols. Elman et al. suggest that

localist representations [. . . ] provide a straightforward mechanism for capturing the possibility that a system may be able to si- multaneously entertain multiple propositions, each with different strength, and that the process of resolving uncertainty may be thought of as a constraint satisfaction problem in which different pieces of information interact (Elman et al., 1998, 90).

While localized representations and distributed representations can be used together in a single representational system, there is also the need to use local labels to describe the otherwise distributed representations. A distributed feature x consists only as a set of numbers (the weights of neural connections), but there is no way to talk about x without us- ing the local description “x”. Representations according to Fodor and Pylyshyn (1988); Fodor (1998) are necessary to allow for systematicity and language (and thought). Simple models involving neural nets are described as having distinct and discrete inputs and outputs, each labeled with distinct and discrete meaning. These localized representations are no longer available once the focus shifts on to hidden nodes within the network, and the ‘repre- sentations’ are now described in terms of weights, or synaptic strengths, 6.2. UNIVERSAL FRAMEWORKS 143 between individual units. Strictly speaking, localized representations should not even be given to units at the boundary of the network (the input- and output units), because these units are representing neurons. Rosenblatt explained that

[i]t is significant that the individual elements, or cells, of a nerve network have never been demonstrated to possess any specifically psychological functions, such as “memory”, “awareness”, or “intel- ligence” (Rosenblatt, 1962, 9).

However in PDP-models, the input nodes and output nodes are treated as localized representations (symbols), and individual model neurons do have semantics bestowed upon them by Elman (1990) and Churchland (1998), who map meaningful words to the inputs and outputs of their networks. Churchland goes as far as mapping “moral concepts” onto in- dividual units. Because these labels are used freely, there is always the danger of introducing ‘wishful’ terminology not only for labels, but also for methodological terms. Shultz (2003), for example, offers a mapping of neural net terminology onto terms from (i.e. Piagetian theory).

Accommodation, in turn, can be mapped to connection-weight ad- justment, as it occurs, in the output phase of cascade-correlation learning. [. . . ] More substantial qualitative changes, correspond- ing to reflective abstraction, occur as new hidden units are re- cruited into the network. [. . . ] Then the network reverts back to an output phase in which it tries to incorporate the newly achieved representations of the recruited unit into a better overall solution. This, of course, could correspond to Piaget’s notion of reflection (Shultz, 2003, 128).

The terminology from Piagetian theory clearly belongs to a higher level of description than the descriptions of the network’s dynamics. The units in an artificial neural network ought to process distributed rep- resentations, but, as Hoffmann (1998) noted, non-distributed represen- tations are necessary at some level to provide for a meaningful explana- tion. Many modellers seem to use distributed and localized representa- tions ‘as required’ to make the model work. A more interesting problem lies in the interpretation of representations that are within the network. First there is the question of locating suitable representations that could carry any semantics, given that the representations are dis- 144 CHAPTER 6. ARTIFICIAL NEURAL NETS tributed in the network15. Rosenblatt explained that

[i]t is significant that the individual elements, or cells, of a nerve network have never been demonstrated to possess any specifically psychological functions, such as “memory”, “awareness”, or “intel- ligence”. Such properties, therefore, presumably reside in the or- ganization and functioning of the network as a whole, rather than in its elementary parts (Rosenblatt, 1962, 9).

In many computational models, the input nodes and output nodes are treated as localized representations (symbols). Individual model neu- rons do have semantics bestowed upon them by Elman (1990) and Churchland (1998), who map meaningful words and moral concepts to the inputs and outputs of their networks. Treating the activation pat- terns of the hidden units as a resource for categorical “hot spots”, to use Churchland’s term, is an even more contentious exercise. The relation- ships and patterns in the input datasets and training datasets become embedded in the structure of the network during training16. The inter- nal representations, which are “snapshots of the internal states during the course of processing sequential inputs” (Elman, 1990), are extracted by means of cluster analysis of the hidden layer in the ANN. Who really does the analysis and interpretation of the distributed representations? The neurons in the networks represent whatever meaning the modeller assigns to them17.

6.2.2 Simplicity

A feed-forward network with a suitable number of inputs and outputs and a reasonable number of hidden units18 can be trained to implement any Turing computable function19. A network comprising connections with real weights can be trained to ‘learn’ a real function with arbitrary

15The material in this section is also in Krebs (2007). 16The patterns and relationships in these datasets can either be carefully designed or might be an unwanted by-product. 17Much like Humpty Dumpty in THROUGH THE LOOKING-GLASS who says “When I use a word it means what I choose it to mean - neither more nor less” (Carroll, 1996, 196). 18Note that the number of hidden nodes and the number of connections within the network are largely determined by experience and experiment. There are no definite methods or algorithms for this. 19Siegelmann and Sontag (1991) have shown that neural nets that are computation- ally equivalent to Turing machines can be constructed. 6.2. UNIVERSAL FRAMEWORKS 145 precision. Many of the classic neural network problems, like most of those described in Elman (1990), Rumelhart and McClelland (1986b), McLeod et al. (1998) and others, are classification tasks. In these models the inputs are integers and the outputs are (rounded) integers, whereas the connections are expressed as real numbers20. The point here is that a network can implement anything we want to model, as long as we have the appropriate training set for the particular selection of labels for the inputs and outputs of the network. All of the models in question are based on computable functions of real numbers. As a result we can consider even simple artificial neural nets as a universal framework for cognitive models (Krebs, 2005, 2007). This universality is, of course, convenient for simple models, but there remain doubts whether these kinds of networks are suitable for more advanced problems. Feldman believes that only networks with task-specific structures are candidates for advanced learning and he proclaims that

[t]he belief that weight adaptation in layered or other unstructured networks will suffice for complex learning is so wrongheaded that it is hard to understand why anyone professes it (Feldman, 1993).

This remark does not contradict the fact that a suitable network can implement any conceivable computable function of real numbers. There is however a clear distinction between the ability to learn a function and ‘learning’ in the context of human cognitive behavior. Uttal suggest that neural networks with relatively few neurons are too simple to be able to simulate anything worthwhile. He says that

[. . . ] perhaps most important, is the fact that all neural network models are oversimplifications that are incapable of dealing with the number of neurons and interactions that must be involved in instantiating the simplest cognitive process. Some, indeed, are so oversimplified that they cannot simulate cognition. Much is lost in the shift from reality to theory (Uttal, 2005, 199).

I would like to re-phrase this remark for the modellers benefit so that it reads much is gained from the shift from reality to theory.ThePar- allel Distributed Processing (PDP) paradigm can be considered one of the theories that is largely made plausible by the ‘oversimplifications’

20All of real numbers can be represented by a decimal number (Salas and Hille, 1995, 5). Some rational numbers may have a non terminating expansion, however this set of numbers can approximated in a computer to any desired precision. 146 CHAPTER 6. ARTIFICIAL NEURAL NETS

Uttal refers to. Much of the work in cognitive modelling since the mid 1980s is conducted within the framework of PDP, and yet very little progress has been made as far as neural networks are concerned. Re- cently Rogers and McClelland (2004) suggested theories, and it seems indicative for my claim that little progress has been made, that many of the models in SEMANTIC COGNITION:APARALLEL DISTRIBUTED APPROACH have remained unchanged since the mid 1980s. Rogers and McClelland (2004) readily admit that

the form that a complete theory [of semantic cognition] will ulti- mately take cannot be fully envisaged at this time. We do believe, however, that the small step represented in this work, together with those taken by Hinton (1981) and Rumelhart (1990), are steps in the right direction, and that, whatever the eventual form of the complete theory, the principles [of PDP] will be instantiated in it (Rogers and McClelland, 2004, 380). What seems to be happening here is that the theories of PDP get devel- oped further and further, while the models in support of these theories remain essentially the same. This works well, because of the universal nature of the artificial neural nets. They are very adaptable and can be used as models for just about any theories without too many changes (if any) to the experimental setup of earlier models. New data can be pushed through an existing neural network, and the results can be in- terpreted in terms of the new theory in question. The interpretation of the meager results of actual neural net models (reality) can enhance the models, because their predictions and claims can be made plausible by choosing the right language and placing the models into a concep- tual environment that is sufficiently vague (theory)21. Green (2001) at- tributes much of the success of models to this. This perceived vagueness of PDP models can be deduced from McClel- land et al. (1986) when they describe the various forms that models can take, and what units (i.e. neurons) can represent in the various models.

In some cases, the units stand for possible hypotheses [. . . ] In these cases, the activations stand roughly for the strengths associated with the different possible hypotheses, [. . . ] In other cases, the units stand for possible goals and actions, [. . . ] In still other cases, units stand not for particular hypotheses or goals, but aspects of these things (McClelland et al., 1986, 10). 21See (Krebs, 2005, 2007) for a more detailed discussion. 6.3. SUMMARY 147

Units in a model network can represent anything, because they can be labeled to mean anything. PDP itself could be described as a theory for everything, and models can be readily produced to provide a kind of ‘em- pirical’ support. I provide a case study of a model presented in FINDING STRUCTURE IN TIME (Elman, 1990) as a case in point in section 8.2.2.

6.3 Summary

The functional neurons that are employed in AI only loosely resemble actual biological neurons, and artificial neural nets only superficially mirror actual brain structures. Learning in the human brain, for exam- ple, does not involve back-propagation. The functionality of a neural net is determined by its architecture and the kind of units it is composed of. A number of different architectures and learning algorithms are on offer, however FFNs and SRNs are the most commonly used network ar- chitectures for building models of higher cognitive functions. Artificial neural nets, particularly networks with hidden layers and non-linear units, can be used to implement almost any function from a set of data points. The ability to detect relationships or functions in the training dataset makes neural nets very powerful analytical tools. Artificial neural nets are the tools for modelling in the PDP paradigm, and the experiments and models with these tools are vital as support for this cognitive architecture. I have argued that artificial neural nets are universal solutions for data models, provided that a training dataset can be provided. These training datasets can be obtained from many sources like experiments with human subjects, animals, and so on. The data can also be carefully constructed in order to achieve the required result in the network (an example is provided in chapter 8 on page 183). The next chapter concerns the validation and verification of models and the kinds of criteria that are available to evaluate models that are based on neural networks. 148 CHAPTER 6. ARTIFICIAL NEURAL NETS Part III

Models and Explanation

149

Chapter 7

Models and Evidence

In a computer simulation of the networks at issue (which is cur- rently the standard technique for exploring their properties), both the computation and the subsequent weight adjustments are eas- ily done: the computation is done outside the network by the host computer, which has direct access to and control over every ele- ment of the network being simulated. But in the self-contained biological brain, we have to find some real source of adjustment signals, and some real pathways to convey them back to the rele- vant units. Unfortunately, the empirical brain displays little that answers to exactly these requirements (Churchland, 1993, 53).

So far I have examined some of the principles and methodologies that are applied in the construction of models and simulations. I have argued in the previous chapter that artificial neural nets are valuable analyt- ical tools that can deal efficiently with large amounts of data. Before it is possible to ascertain whether artificial neural nets can be used as plausible models for cognitive functions, it is necessary to examine what kind of evidence can be made available to support claims made by the modeller. In this chapter I discuss some of the difficulties that are en- countered when trying to justify claims concerning cognitive models us- ing artificial neural nets. Models in the broadest sense serve as tools to obtain knowledge, to explain, and to support theories. Models with arti- ficial neural networks rely on the body of knowledge (and theories) from cognitive Neuroscience, computational Neuroscience, and AI among the many disciplines in Cognitive Science. These disciplines combined with Computer Science contribute to the theoretical and conceptual founda- tions of artificial neural nets and to the principles and methods of im-

151 152 CHAPTER 7. MODELS AND EVIDENCE plementing the network on a computer. Additionally, most models will draw on at least one other body of knowledge, because models can be about psychological or linguistic phenomena. There are then two lev- els of justification to be considered, (1) the level of implementation of the network, and (2) the model at the level of claims about Psychol- ogy or Linguistics. The first level deals with questions of whether arti- ficial neural nets are neurologically plausible. The discussion so far has shown that artificial neurons and artificial neural nets have very little in common with real neurons and real neural structures (brains). This chapter begins with a short discussion regarding the correctness or validity of models. There are two fundamental reasons why a computer model can be flawed. Computational models and simulations are com- plex computer programs, and as any other piece of software, a computer program “may have a bug in it, or the model upon which the program is based may be flawed” (Quinn, 2005, 306). A more ‘serious’ reason why a model may not work is that it may be based on the wrong set of assump- tions in terms of its design, or the model may be inadequate in terms of its performance. A computer model that produces data or makes predic- tions needs to be verified and validated to ensure that the program is a correct instantiation of the model. Only then can the model be compared against the real world and the model’s ‘fitness’ be judged. Verification re- lates to the correctness of the computer program. The program must do what it ought to do correctly, i.e. it should be free of logical or syntac- tical errors. Validation of a computer program concerns the correctness of the implementation of the underlying model. Much of the validation process is related to the design of the data model and the transforma- tion into a computer program and the related data structures. The final section of the chapter concerns the evidence supporting the neurological plausibility of models employing artificial neural nets. The question here is, whether any claims can be made about a model’s neu- rological plausibility, because an artificial neural net can be used suc- cessfully to perform statistical analysis of data. 7.1. VERIFICATION AND VALIDATION 153

7.1 Verification and Validation

Some models provide us with data that can be validated against real observations, and the accuracy of their predictions may be used as an indicator for whether or not a particular model is successful. Unfortu- nately, not all models can be verified this easily and conveniently, and the validation of their predictions often rely on a string of assumptions. Some established scientific procedures use models and simulations to provide the ‘hard’ data, on which the validation of some model would depend. Computer generated images, for example, are indispensable in the diagnosis and treatment of medical conditions. Many images do not show what would be directly observable, or, they show things that are not observable at all without serious invasive treatment (see fig- ure 7.1)1. Instead, they show visual representations of physical entities and processes. Computer generated images are representative models

Figure 7.1: X-ray image of a broken bone of some other real world object or process. It is important to note that modern imaging techniques rely on computer generated visualizations of data from a range of phenomena and different physical properties. The devices and methods that are employed to produce evidence for certain anatomical and functional properties of the brain may also be subjected to the constraints imposed by the mathematical and compu- tational principles involved. 1Source htt p : //www.nu − riskservices.co.uk/news/news_images (17/12/2005). 154 CHAPTER 7. MODELS AND EVIDENCE

After a short section on the problem of program verification, I investi- gate the kind of empirical evidence that can be obtained from a particu- lar family of visualization techniques. The technological principles and methods that ‘create’ images of the brain are one aspect that I will ad- dress. The second part deals with the interpretation of these images as evidence of mental processes.

Correctness of Programs

If we leave very small theoretical models aside, many models and sim- ulations are complex mathematical constructs, and their implementa- tion as computer programs is a difficult process. Once the program is complete, a considerable amount of computation is performed when the simulation of an artificial network is executed on the computer. This calls for well engineered software that is reliable and performs well in terms of speed and utilization of resources. The computing envi- ronments themselves, which include the hardware and the appropri- ate operating systems, are the least probable causes for failures. How- ever computer models do occasionally fail like most other software oc- casionally does. We can never be certain that a computer program is free of errors. However, good software engineering practices can provide high levels of confidence that programs do what they are supposed to do. These practices may include formal methods for the specification of programs, which describe not only the functional requirements and the interface to the user, but also ensure data integrity. Some approaches to formal specification, like the B-method (Schneider, 2001), can detect flaws in the design of programs by simulating and analyzing the rela- tionships between the various processes and the data during the design phase. The method allows for setting constraints on inputs and outputs, enforcing data typing, and other program parameters, so that the final computer program (model) is protected against failure due to erroneous data. Trying to detect and correct any potential errors early in the develop- ment of a software product, and that includes computer models, is an important aspect of software engineering. The cost to fix a particular problem in terms of effort (and money) increases very quickly, because cost is related to the number of things that need to be done. A design 7.1. VERIFICATION AND VALIDATION 155 fault is therefore less costly to fix if it is found early in the development cycle. Schach points out that

[i]t is important that we improve our specification and design tech- niques, not only so that faults can be found as early as possible but also because specifications and design faults constitute a large pro- portion of all faults (Schach, 2002, 14).

Many of the models, particularly those involving artificial neural nets, are based to some extent on commercial software libraries and tools2. Because of that, the effort of implementing a particular model is greatly reduced. Given that most of these commercially available packages are very well engineered, tested and maintained, the probability of faults is also greatly reduced and there is a flow-on effect for the models that are based on these products. While the engineering of a new program that implements the model or simulation is the most likely source of faults, it is not the only area where problems can emerge. Modern computing machines rely on many layers of software for their operation. Some of the software components are dependent on specific hardware and most peripheral devices will only function with the appropriate device drivers3. The user is shielded from much of the complexity in computing machines. Instead, the user is presented with a relatively uniform interface to the machine by the operating system which ought to provide “an environment in which a user can execute programs in a convenient and efficient manner” (Sil- berschatz and Galvin, 1999, 1). Operating systems do have faults and in some cases a user program may not execute correctly. However, in situations where the operating system malfunctions, it is more likely that the user program will fail altogether, rather than produce incorrect results.

2Mathematical software packages like Matlab or Mathematica offer special subsys- tems and ‘add-ons’ for neural network simulations (like the Neural Network Tool Box for Matlab). Also, there are several tools available from universities and interest groups. 3Device drivers are bits of software (programs), which provide an interface between the operating system and the particular piece of equipment. These drivers ensure that the piece of equipment conforms to the standards of the operating system. 156 CHAPTER 7. MODELS AND EVIDENCE

7.2 Validation of Models

The validation of models is concerned with the correctness of the model in terms of the predictions and explanations it can provide. In order to make a judgment about ‘how good a model is’, it is necessary to establish a set of criteria by which we can measure the success of a model. This is not easy, because models are theories as much as they are devices that produce empirical evidence (i.e. data). In some cases we are able to verify some or all of the data obtained from a model by comparison with data from direct observation or from experimentation involving people. The accuracy of a model’s predictions is of course only one measure of a model’s success. Other indicators might be the internal architecture, the human interface, the performance in terms of speed, and the model’s ontology4. For models concerning mental processes, data may be avail- able from experiments with humans or animals. Pavlik and Anderson (2005) for example, describe a model for the spacing effect in learning and forgetting and compare the predictions of their models to data ob- tained from experiments with 40 humans. Additionally they compared the results with data from other models and they concluded that

[t]his formal model of the relation between current memory ac- cessibility and the stability of new encodings, using an integrated retention function, provided good fits to a wide variety of results of only a minimal number of parameters. Further, the absolute and relative parameter stability of the model shows that models’s be- havior was consistent with different data sets (Pavlik and Ander- son, 2005, 583, my italics).

The comparison of the predictions with empirically obtained data allows the modellers to evaluate their model. There are various ways to obtain empirical data, and there are many ways to transform and represent data.

7.2.1 Visualization

The practice of drawing pictures or sketches of imaginary things is not a recent development in the sciences. Mathematics, particularly Geom- etry, relied on the graphical representations of entities and their rela-

4I am alluding here at the many micro-world models that haven been created, e.g. SHRDLU (Winograd, 1973). 7.2. VALIDATION OF MODELS 157 tionships to one another for millennia. Since the development of digi- tal computers with appropriate output devices, such as high resolution CRTs5, printers and plotters, visualization of data has taken on a differ- ent role. Visualization techniques not only help us in the understanding of things that are only in our imagination as it does with Mathematics, but with the aid of computers and the appropriate programs, it is pos- sible to draw and render images of things that are either too small for us to see, as well as things that are too big for us to see. Many scien- tific instruments, such as microscopes, endoscopes and x-ray machines, use digital imaging devices (digital cameras) and imaging enhancement techniques. Often the boundaries between direct observation and “com- puter enhanced” images are no longer clearly defined. It has been sug- gested that “‘Visualization’ . . . goes beyond conventional uses of com- puters, decidedly stepping into the ‘cognitive’ domain” (Araya, 2003, 27). Advances in the field of computer aided visualization are driven not only by the need to make things visible, but because these may not be directly observable. Another reason for the proliferation of computer aided imagery may be based on the “information-without-interpretation dilemma” (Araya, 2003, 29). Many experiments and procedures in sci- ence involve digital machines that produce vast amount of numerical data. However, much of this data cannot be analyzed by human en- deavor without using computers. The data can be subjected to various statistical processes that might in turn be distilled into tables, graphs or images. This process of interpretation of (raw) data in some new con- text, turns data into information.

Imaging

The need for visualization can also be the driving factor for obtaining data. The production of images that aim to produce what could be ob- served directly is a more contentious endeavour. Brain imaging tech- niques, such as Positron Emission Tomography (PET) or Magnetic Res-

5Cathode Ray Tube, i.e. “computer screen”. There are now several other technologies like plasma screens or LCD panels, which have replaced the heavy CRTs. However, all of these display consist of a large two dimensional array of pixels (picture cells), which are individually addressable. Pictures are created by setting each pixel’s colour and intensity. 158 CHAPTER 7. MODELS AND EVIDENCE onance Imaging (MRI)6, become more and more valuable as the resolu- tion increases and the increase in resolution demands more and more data points. One of the first f MRI images (Figure 7.2) was produced by Belliveau et al. (1991). While the increases in resolution are largely the product of more advanced technology, the underlying assumptions of these technologies have not changed. I believe that these images are data models for these underlying theories as well as pictures represent- ing entities, in the case of PET and MRI. Images can also be used as a medium for explaining processes or phenomena that are linked to pro- cesses, as in f MRI. Whether the data for the visualization process is

Figure 7.2: f MRI-Image from 1991 (Belliveau et al., 1991) the product of some experiment, or whether the data has been collected specifically for the purpose of generating images, visualization is a form of data manipulation. The transition from raw data to meaningful infor- mation is not possible without other inputs. Data transformation also involves putting data into context, and making subjective decisions on what and how we represent. The settings of thresholds in PET images, for example, may be determined statistically or it may rely on the expe- rience of the machine operator. The size of ‘area of activation’ and what colours are used to represent the ‘level of activation’ in the images is not only the product of data, but is also affected by the level of human skill. These kinds of interventions occur in even the simplest cases: statis- ticians often have to ‘take care’ of outliers in sets of data so that the

6See page 161 for short descriptions of these technologies. 7.2. VALIDATION OF MODELS 159 statistical analysis produces a realistic result. Additional methods of curve smoothing and trend analysis can be employed to produce clean “pictures” of the data.

7.2.2 Measuring and Imaging

In order to assess what some ‘images of the brain’ actually show, it is necessary to investigate some of the assumptions, principles and the computational efforts in connection with modern imaging technology. Current brain imaging techniques can be divided into three distinct types, based on the kind of insights and detail that are revealed, or, can possibly be revealed. Technologies of first type, which include X- ray imaging, computer-aided tomography (CAT) and magnetic resonance imaging (MRI), are employed to visualize anatomical structures of en- tire bodies, or organs. The targets of investigation include the head and, of course, the brain. There is no doubt that these technologies have been very successful in the diagnosis of conditions like bone fractures, the early detection of some cancers, and so on. The common feature of these kinds of scans is that they are not dependent on something hap- pening in the target of the scan. The image is a representation of a structure which is essentially stable. The broken bone, for example, is stable in the sense that it is broken before and for some time after the image is taken. Any short time delay that might occur during the scan- ning process and the subsequent generating of an image of a broken bone is largely inconsequential. This is not true for images that show processes in the brain, like images using f MRI techniques (see section 7.3 on page 163). Although the various technologies rely on different properties of the target and on properties of its constituents, the im- ages are essentially reconstructions based on differences of densities of matter, water content, or areas of increased blood supply. The second type of imaging techniques, which include Positron Emission Tomogra- phy (PET) and functional MRI (f MRI), are able to show brain activities. The third type is the kind that requires invasive techniques. Brains can be cut up, sliced, stained, and prepared in many ways so that micro- scopes of various kinds can be used to look at brain tissue. However, these procedures do not reveal much information in terms of functional analysis of brain structures. The main imaging techniques and their fundamental principles of operation are as follows. 160 CHAPTER 7. MODELS AND EVIDENCE

Structural Tomography

Structural tomography offers non-invasive techniques to obtain images of the internal structure of the head and the brain. Standard x-ray im- ages do not provide enough detail to be of much interest in this context. The density of brain tissue is relatively uniform throughout, so that only very little detail is revealed, and therefore only limited spatial informa- tion is made available. The introduction of CAT has allowed for a recon- struction of a 3-dimensional representation from many 2-dimensional images through a variety of back-projection algorithms. While CAT of- fers some greater detail than plain x-ray images, the technology is now essentially superseded by PET/CT.

Functional Tomography

Functional tomography is concerned with imaging actual brain pro- cesses. PET and f MRI allow for the differentiation between areas based on differences in blood flow. It is assumed that the increase of blood flow is due to the higher glucose metabolic rate in neurally active regions. This assumption is problematic, and I will discuss this issue further in section 7.3. Blood-Oxygenation-Level-Dependent (BOLD) imaging technology currently provides spatial resolution to less than one cubic- millimeter for MRI (Uttal, 2001), while the spacial resolution for PET is at best 6 to 8 mm (Townsend, 2004, 135). Neurons are between 0.01 and 0.05 mm in diameter, so that the resolution of one cubic-millimeter could include up to 106 neurons, if we consider that the spaces between neurons are only 0.02 μm (Bear et al., 1996). The resolution for func- tional MRI seems to be restricted to about 3 to 4 millimeters (Casey et al., 2002) so that a single voxel7 represents up to about 106 to 107 neurons.

PET

PET works on the principle that the location of radioactive decay of a proton into a neutron can be determined by detecting the emission and localization of two gamma rays. These ‘annihilation’ gamma rays

7A voxel is the 3-dimensional equivalent of a pixel, or picture element. A voxel is the smallest resolvable unit in space. 7.2. VALIDATION OF MODELS 161 occur when the positron that was emitted during the decay of the pro- ton, annihilates with an electron near by (see for example (Townsend, 2004), or Uttal (2001)). The patient is injected with a tracer substance 15 like radioactive water (H2 O) or a variety of other substances like fluo- rodeoxyglucose (FDG)8. Increased blood flow to particular areas of the body (and brain) delivers more of the radioactive tracer to the region. The increase in blood flow is in response to the increase in metabolism. The frequency and locations of the positron emissions (the locations of their annihilation to be precise) are detected and this data is used to generate the ‘brain’ images.

MRI and f MRI

MRI, unlike PET, does not need a radioactive substance. Instead, the imaging technique is based around the physical property of atoms to align themselves along (very) strong magnetic fields. Fortunately, the most common substance in humans, namely water, contains hydrogen, and the alignment of hydrogen atoms can be exploited. The hydrogen atoms align themselves in a strong magnetic field. When the magnetic field is momentarily disturbed by a powerful radio signal with an ap- propriate frequency, the hydrogen atoms will become ‘excited’ and will start to resonate with the exciting radio signal. When the radio sig- nal is switched off, the atoms will return to their original energy state (realigned with the strong magnetic field). The time to return to the original state depends on the number of atoms and characteristic phys- ical properties of the different tissue. This time delay can be captured and can be used to construct an image. In MRI and f MRI the differ- ences in the relaxation time constants of oxygenated and de-oxygenated hemoglobin are detected. Analogously to PET, MRI and f MRI visualizes regions of increased metabolism and there is no significant difference between the technologies as far as the detection of activation levels are concerned (Feng et al., 2004).

82 − deoxy − 2 − [18F] f luoro − D − glucose is a fluorinated sugar like substance and the active ingredient Fluorine 18F has a half life of 109.8 minutes. 162 CHAPTER 7. MODELS AND EVIDENCE

Microscopy

Optical and other microscopes can also provide some evidence and in- sights into the anatomical structures of brains. The optical microscope is limited to a resolution of about 2,500 Å (or .25 μm) due to the wave- length of visible light. This resolution does not allow for a detailed study of cells, because little detail can be observed at this level. With early electron microscopes the level of resolution could be pushed to about 5- 10 Å, allowing for a much better resolution. However the introduction of electron microscopy has created a host of new problems that are related to the technology itself. The electron microscope scans only the surface of the specimen and the specimen has to be in a vacuum. Additionally, as Bechtel (2000) points out, the parts of the biological specimen (cells) are composed of the same kind of material, and therefore scatter electrons in the same way. In order ‘to see’ things, it is necessary to enhance the contrast by various means. These enhancements are usually achieved by altering the specimens before they are scanned, “which raise[s] the possibility of an artifact” (Bechtel, 2000, 141). Neurons can be observed in vivo using microscopes, and it is possible to observe the behavior of neurons by inserting micro electrodes as Hodgkin and Huxley (1952) have shown. Evidence in support of neurologically inspired models can be obtained through any of the methods mentioned above. The most interesting methods appear to be f MRI and PET, because they allow for the visual- ization of processes. Modellers using artificial neural nets often claim to mimic or even replicate mental processes (for example McLeod et al. (1998), Churchland (1998), or Rumelhart and McClelland (1986b)9). Neural net models provide the environment for modelling the under- lying theories, which “must cohere appropriately with the kinematical and dynamical features of the brain” (Churchland, 1999, 9). In contrast, microscopes and x-ray imaging can only provide images of anatomical structures. The dynamics of the brain cannot be captured with these, if we ignore long term change in anatomy due to growth of cells or (possi- bly) connections. The evaluation of slow dynamics can nevertheless be modelled using artificial neural nets. Shultz (2003), for example, models cognitive and psychological development with artificial neural nets. 9I will discuss some models and I will evaluate the evidence against some of the claims made about the models in section 8.2 on page 174. 7.3. LIMITS OF TECHNOLOGY 163

7.3 Limits of Technology

It is very likely that future innovations and improvements to existing imaging methods will allow for higher resolutions in f MRI or other func- tional imaging techniques. It may well be possible to detect anomalies, like single cancerous cells in human tissue, and it may be possible to resolve small ‘neuron circuits’ in the future. There are however funda- mental problems with these kinds of technologies that cannot be re- solved merely by engineering efforts alone. Increasing the resolution of brain images does not necessarily help in providing better results in the quest to localize the areas of mental ac- tivity. The mapping of mental functions onto areas of brains can only be approximated, even if it was possible to resolve individual neurons. Ob- viously, it would be impossible to generalize any detailed mapping due to the variation in the population. Any detailed mapping would only reflect the situation in a particular brain. Because brains are chang- ing, i.e neurons grow and die and new connections are made, a detailed image will only show the activities in a particular brain at a partic- ular time. Uttal argues that the functional regions in brains are not “sharply demarcated” (Uttal, 2001, 153) and that “cognitive functions activate broadly distributed regions in the brain” (Uttal, 2001, 155). Any increase in resolution and sensitivity due to technological advancement will actually enlarge the area in which neural activity can be detected for a particular mental process. The technology may have reached its useful limit already. This ‘limit’ does not of course apply for the detec- tion of details in anatomy. It is only to be considered problematical in the context of functional MRI, PET, or similar technologies, for the pur- pose of localizing mental activity. There is a shift in opinion about the value of functional imaging for that purpose. Uttal comments, that

the basic assumption of sharply localized brain representations of mental activity, challenged over a period of a century, is now un- der a new kind of empirical attack using the very tools that were originally used to support it (Uttal, 2001, 161).

A paradox seems to be emerging. As the resolution and sensitivity of the imaging technologies increases due to technical enhancements to the scanners and improvements in image processing, there is less and less 164 CHAPTER 7. MODELS AND EVIDENCE clarity assessing where particular mental processes actually occur. It might well turn out that reductionism and localization may not helpful when trying to understand the interactions between the mental and the physical. Midgley, for example, points out that

no amount of information about Einstein’s brain would enable a neurologist who was ignorant of physics to learn anything from that brain about relativity theory (Midgley, 2004, 30)

The idea of finding out how the brain thinks by looking at the neural structures ‘at work’ using better imaging techniques is unlikely to be successful. An important question in the context of modelling refers to the relationships between brain activity and mental activity: how much f is there in f MRI to see?

Interpretation of Images

The images obtained through PET technology are generated by exploit- ing the relationship between metabolism and blood flow. The underlying assumption here is, of course, that the change in metabolism is a con- sequence of changes in neural activity, even mental activity. Belliveau et al. (1991) assert in their seminal paper that

[d]uring cognitive task performance, local alterations in neural ac- tivity induce local changes in metabolism and cerebral perfusion (blood flow and . . . blood volume). These changes can be used to map the functional loci of component mental operations (Belliveau et al., 1991, 254).

Uttal (2001) and Bechtel (2000) suggest that this assumption may not be supported by relevant evidence. Bechtel comments

[w]hile a connection between metabolism and blood flow is very plausible, the mechanism linking neuronal firing and metabolism has not been established. There are several possible mecha- nisms for generating increased metabolism (neurotransmitter metabolism, action potential generation, etc.), some of which are not tightly linked to firing rates (Bechtel, 2000, 145).

Before I discuss the problem of linking metabolism with neural firing, I will address some issues that relate to the metabolism - blood flow inter- action. While there is little doubt that this relationship is indeed “very plausible”, as Bechtel suggests, the exploitation of this relationship for 7.3. LIMITS OF TECHNOLOGY 165 brain imaging and the subsequent interpretation of such images is a different matter. One of the concerns is the time delay between psycho- logical stimulation and the change in the level of blood oxygenation that can be detected using f MRI or PET. The delay for the BOLD10 signal to be activated after the stimulus has been applied is in the region of 5 to 6 seconds, and the signal takes about 10 - 12 seconds11 to decay after the stimulus is removed (Casey et al., 2002). Consequently, only neu- ral activity that can be sustained for a period of several seconds can be localized using BOLD imaging techniques. Hence there is always a chance that the image shows not only areas of neural activity that are in response to the stimulus in question, but also some activation patterns from other mental processes stemming from thoughts like ‘I don’t want to be in this MRI scanner any more’. Logothetis et al. note

that the extent of activation in human f MRI experiments is very often underestimated to a significant extent owing to the variation in the vascular response (Logothetis et al., 2001, 154).

In their work Logothetis et al. (2001) established that there is a close connection between neural activity and changes in metabolism. Their results “show unequivocally that a spatially localized increase in the BOLD contrast directly and monotonically reflects an increase in neu- ral activity” (Logothetis et al., 2001, 154). The BOLD signal exhibits a linear relationship between the duration of the stimulus and the neural responses (spiking). This is a very important result, and Raichle (2001) points out that

[a]fter a century of research, we still do not know how or why blood flow increases during neuronal activation. It does seem to reflect an increased need for either oxygen or glucose. But thanks to the work of Logothetis et al., cognitive neuroscience can move forward with greater confidence in the knowledge that changes in blood flow and oxygen levels do represent a definable alteration in neural activity (Raichle, 2001, 130).

Not everyone shares this optimism. There are still many open ques- tions in relation to the underlying mechanisms, and from the discussion above we must recognize that images based on BOLD signals are dis-

10Blood Oxygenation Level Detection. 11Delays of 2 to 10 seconds and decay times of 8 to 11 seconds are given by Bandettini and Ungerleider (2001). 166 CHAPTER 7. MODELS AND EVIDENCE torted by the time lag and the variations of the vascular system. Casey et al. (2002) remind us that

[e]ven with the enormous interest and widespread use of this methodology, the relation between the MR signal and physiological mechanisms underlying this signal are not well understood. BOLD imaging relies on sensitivity to changes in oxygen levels within the circulating blood. [...] What is not clear is how blood oxygenation levels relate to neural activity (Casey et al., 2002, 301). The current situation can be summed up as follows. Neural activity causes changes in the metabolism of the neuron, and that activity is reflected in changes of the blood oxygenation level. The validity of this relationship has been established by Logothetis et al. (2001), although there still are many open questions about the physiological processes that are involved (Bechtel, 2000). The imaging techniques relying on the BOLD signal suffer from time lag between changes to the metabolism of neurons and the increase in blood flow (Casey et al., 2002). Also, the variations to the vascular systems have an effect on the area “to a sig- nificant extent” (Logothetis et al., 2001). The methods for gathering information about the brain that I have dis- cussed so far do not provide much evidence that could be used in support of the question whether artificial neural nets are suitable tools for mod- elling cognitive functions. This is largely due to the fact that detailed observation of cognitive processes seems impossible. While it is possi- ble to observe cognitive functions on a large scale (areas of the brain) using f MRI, there is no indication of how these high level functions, or sub functions, are implemented at a level that could be modelled using artificial neural nets. Currently it is not possible to simulate a neural network that is equivalent to the number of neurons captured in a sin- gle voxel (106 neurons and about 109 connections). Of course, the concept of neuron and neural net is directly inspired from the observable constituents of the brain. Many aspects of biologi- cal models, i.e. the kind of models that follow the Hodgkin and Huxley (1952) approach, can be verified against measurements and observa- tions mediated by microscopes, micro-electrodes and other more ‘hi-tech’ equipment. While the physiological behaviour of individual neurons can be tested, information and data about the psychological behavior of in- dividual neurons remains largely hidden. Koch and Segev (2000), for 7.4. OTHER EVIDENCE 167 example, describe the role of single neurons in information processing and suggest in their paper alternative ways of modelling the integration of pre-synaptic and post-synaptic activations within a single neuron. They argue that the back propagation of the spike into the dendritic tree and its interaction with the “incoming streams of binary pulses” could enable the neuron to a be a “sophisticated processor with mixed analog-digital logic and highly adaptive synaptic elements” (Koch and Segev, 2000, 1176)12. The work relates however more to establishing a link between neural behavior and mathematical/computational primi- tive functions, than to psychological primitives.

7.4 Other Evidence

Other evidence from Linguistics, Philosophy and various other disci- plines, may also contribute some evidence in support, or refutation for that matter, of claims and explanations based on models and simula- tions. Some information that might help to judge the correctness or ef- fectiveness of neural models and neural simulations can be gathered from the field of Psychology. One of the questions that arises here is about the kind of empirical evidence that Psychology can furnish. Much of the material in this discipline is drawn from statistical inference and it can be argued that such material can not be described as hard evi- dence. O Nualláin (2002) thinks that

[t]here are several formal difficulties with psychology as an em- pirical discipline. In the first place, the subject’s matter has the unpleasant habit of reading what you’ve just written about it and may even be perverse enough to change its behaviour the next time round. Similarly, psychology’s supposed descriptions of what one’s inner experience and/or behaviour should be often has the effect of self-fulfilling prophecy (O Nualláin, 2002, 50).

In any case, evidence from Psychology does not provide direct informa- tion about the structures of the brain, if we ignore coarse structural effects due to lateralization of the brain and the like. It can however provide many clues as to what the brain does. However, I do not believe that comparing what the brain does with what an artificial neural net does can be taken as evidence for how things are done by the brain.

12See also Gerstner and Kistler (2002) and Würtz (1998). 168 CHAPTER 7. MODELS AND EVIDENCE

7.5 Summary

Any judgments about whether a model is successful should not only be made on the quality of the model’s predictions, but also on the qual- ity and suitability of the tools and methodologies that have been em- ployed to build the model. An important part of the engineering pro- cess of computer models is the design phase. The actual implementa- tion of the model as a computer program is made relatively easy, due to the availability of many reliable and powerful tools. The tools combined with modern computing hardware and effective operating systems can provide very good models and simulations, at least as far as it impacts on the verification of the model. As computing power increases, more sophisticated models may be in- troduced. This sophistication is directed toward two different goals. For biological models it is possible to build models that take more and more physiological features into account and it is possible to simulate highly complex single firing neuron models in real time. Many of the predic- tions and behaviors of these models can be plausibly verified against empirical data from real neurons. The kinds of evidence that might be helpful to validate artificial neural nets that are built up from simple ‘perceptron-like’ units falls into two categories. Neural structures in the brain can be observed directly in a very limited way, but examination of brain tissue mediated by various forms of microscopy is practical. Imaging technologies, MRI in particu- lar, allows for non-evasive exploration of the brain. Structural (anatomi- cal) evidence is of little value for the evaluation of models using artificial neural nets, because these models are generally about cognitive func- tions and behaviors. Imaging of brain activity through f MRI or PET is unsuitable largely due to the difference in scale. Simulations of artifi- cial neural nets have far fewer units than those that could be minimally resolved using current technology. Moreover, any increases in resolution and sensitivity will not help to pinpoint mental activity to smaller ar- eas. In fact, there is now evidence that neural activity for specific tasks is much more distributed than believed earlier (Logothetis et al., 2001). In this chapter I have hinted at some of the technological and concep- tual difficulties with the evidence in support of models and simulations. 7.5. SUMMARY 169

I have argued that there is little that can be gleaned by looking at the brain ‘in action’, i.e. f MRI, that might be useful for the validation of neu- ral network models. The methodologies and theoretical assumptions of modelling and simulation, which were introduced in the previous chap- ters, will be further explored in the next chapter. I will also discuss some classic experiments in terms of the methods used, the claims made and the evidence that is available to validate the claims. 170 CHAPTER 7. MODELS AND EVIDENCE Chapter 8

Models of Cognition

The point is that some low-level stuff makes large differences in the way that the entire system operates, which in turn makes huge differences in the types of high-level descriptions that are most im- portant for the cognitive scientist. Who would have thought that the person interested in problem solving or thinking should be in- terested in energy states or measures of “harmony” among units at a “neural-like” level? But guess what? They turn out to be im- portant (Norman, 1986, 539).

In this chapter I look at network models and simulations of human thought, cognition and behaviour. There are many computer programs that model the behaviour of individuals or try to predict the behaviour of groups or entire populations. Some models are concerned with the bi- ology of nerve-cells, while other models deal with structural elements of brains. Some computer simulations are used to explain aspects of cogni- tion and cognitive behaviour. Here, I will deal with some of the ‘classic’ experiments in Cognitive Science that are persistently cited in the cur- rent literature. One of the issues which cause(d) a considerable amount of confusion concerns the linking of artificial neural nets and the Par- allel Distributed Processing (PDP) framework, or the PDP cognitive ar- chitecture. The last section deals with the interpretation of models that employ artificial neural nets and the use of cluster analysis as a means of validating claims in the context of PDP. Many different kinds of computational model and simulation are used in Cognitive Science, but several classes share common approaches and methodologies. One of these classes involves artificial neural nets with a

171 172 CHAPTER 8. MODELS OF COGNITION small number of nodes. Particularly feed-forward networks and simple recurrent networks (SRNs)1 (see figures 6.4 and 6.5 on page 135) are offered as empirical evidence for the suitability of the PDP paradigm as a modelling technique. Accordingly, these network architectures have been employed to model high level cognitive functions like the detection of syntactic and semantic features for words (Elman, 1990, 1993), learn- ing the past tense of English verbs (Rumelhart and McClelland, 1986b), or cognitive development (McLeod et al., 1998; Shultz, 2003). SRNs have even been suggested as a suitable platform “toward a cognitive neuro- biology of the moral virtues” (Churchland, 1998, 77). While some of the models go back a decade or two, there is still great interest in many of the ‘classics’, and similar models are still being developed, see for ex- ample Rogers and McClelland (2004). I will argue that many models in this class explain little at the neurological level, and that they explain even less at the level of any cognitive architectures which the models are designed to support. The fact that some mathematical functions can be extracted from a given set of data, and that these functions can be successfully implemented in an artificial neural net (neurological possi- bility), does not necessarily support any claims that these functions are realized, or could possibly be realized, in similar a fashion inside human brains (neurological plausibility).

8.1 Higher Level Models

A classic example of a high level model is the GENERAL PROBLEM SOLVER (GPS), a computer program, which was claimed to be a sim- ulation of human thought (Newell and Simon, 1963). The idea behind GPS was to replicate the same processes that are executed by a hu- man being who is asked to solve a problem that can be solved by pro- cesses of deduction using elementary and first order logic. Simon and Newell used a relatively simple example from symbolic logic: the state- ment (R ⊃∼ P) · (∼ R ⊃ Q) is to be transformed into ∼ (∼ Q · P), using a set of transformation rules2. The authors claimed that the process of

1SRNs have a set of nodes that feed some or all of the previous states of the hidden nodes back. This ‘short term memory’ becomes part of the current input. 2The rules of inference are those of propositional symbolic logic and these include Modus Ponens, Modus Tollens, De Morgan’s Theorems, and quantification rules (see for 8.1. HIGHER LEVEL MODELS 173 applying rules to transform the initial problem into a sequence of dif- ferent problems leads progressively to the solution, and that this pro- cess of decomposition and transformation can be replicated by a com- puter program. The logic within the program uses a set of goals, i.e. the next steps toward the solution, and applies the appropriate inferen- tial rules to reach the next step. Complex goals could be broken down into sub-goals, so that the original task was to be transformed into a series of rather trivial sub-tasks by systematic reduction. The little sub- tasks could then be solved easily by applying one of the nineteen log- ically valid rules of inference. GPS has been discredited because this sub-goaling and means-ends analysis lacks domain-specific knowledge (Stillings et al., 1995). Problem solving seems to require not only deduc- tive skills, but also knowledge, intuition and experience (and even some good ‘informed’ guesswork). The type of generalizations about human thought and the ‘obvious’ methods that can be realized using a computer program are the in- teresting part of the GPS episode. GPS depended of course on the as- sumption that the Physical Symbol Processing Hypothesis is valid. This hypothesis is essentially the belief, or the principle, that

[a] physical system has the necessary and sufficient means for general intelligent action. [. . . ] By “necessary” we mean that any system that exhibits general intelligence will prove upon analysis to be a physical symbol system. By “sufficient” we mean that any physical symbol system of sufficient size can be organized further to exhibit general intelligence. By “general intelligent action” we wish to indicate the same scope of intelligence as we see in human action: that in any real situation, behavior appropriate to the ends of the system and adaptive to the demands of the environment can occur, within some limits of speed and complexity (Newell and Si- mon, 1976).

In contrast to the Physical Symbol Processing Hypothesis, Rumelhart and McClelland (1986a) and McClelland and Rumelhart (1986) of- fered Distributed Parallel Processing (PDP) as an alternative cognitive framework. The aim of PDP was to provide a mechanism that caters for simultaneous action between pieces of information, because such a mechanism was ‘intuitively’ required, and example Copi (1979) and Copi and Cohen (1994)). 174 CHAPTER 8. MODELS OF COGNITION

[t]o articulate these intuitions, we and others have turned to a class of models we call Distributed Parallel Processing (PDP) mod- els. These models assume that the information processing takes place through the interactions of a large number of simple process- ing elements called units, each sending excitatory and inhibitory signals to other units. In some cases, the units stand for possible hypotheses about such things as the letters in a particular dis- play or the syntactic roles of the words in a particular sentence. In these cases, the activations stand roughly for the strengths asso- ciated with the different possible hypotheses, and the interconnec- tions among the units stand for the constraints the system knows to exist between the hypotheses. In other cases, the units stand for possible goals and actions, subgoals to actions, such as the goal of typing a particular letter, or the action of moving the left index finger, and actions to muscle movements. In still other cases, units stand not for particular hypotheses or goals, but aspects of these things. Thus, a hypothesis about the identity of a word, is itself dis- tributed in the activations of a large number of units (McClelland et al., 1986, 10).

The two fundamentally different approaches toward cognitive modelling have been developed over the last few decades into the symbolic and the connectionist paradigms in AI and in Cognitive Science. The PDP archi- tecture is, of course, based on the general principles of artificial neural nets, which include as the name implies the notion of distributed rep- resentations. Although the use of localized symbols can not be avoided at some level of description (Hoffmann, 1998, 208, 230, 235), artificial neural nets are assumed to be the tool for the connectionist approach to cognitive modelling. The difficulties emerge when meaningful labels need to be attached to at least some of the nodes (or connections) of the model in order to furnish a sensible and meaningful interpretation. Distributed connectionist models must use localized symbols, but the selection of symbols and the interpretation of their semantics is very much a subjective process. This subjectivity contributes to the ‘theory ladenness’ that we detect in some models. The following section deals with the process of interpreting models and simulations.

8.2 Interpreting Results

Once a model has been built, it is generally used to either generate (pre- dict) data or to simulate some aspects of human or animal cognition. 8.2. INTERPRETING RESULTS 175

Such simulations also yield data that needs to be analyzed and inter- preted. In the interpretive phase we encounter only one aspect of the problem. The other difficulty lies in the determination of what makes a ‘successful’ model. Agre comments that

[t]o understand what is implied in a claim that a given model “works”, one must distinguish two senses of “working”. The first, narrow sense, again, is “conforms to spec” - that is, it works if its behavior conforms to a pregiven formal-mathematical specifica- tion. Since everything is defined mathematically, it does not mat- ter what words we use to describe the system; we could use words like “‘plan”, “learn” and “understand”, or we could use words like “foo”, “bar”, and “baz”. [. . . ] an AI system is only truly regarded as “working” when its operation can be narrated in intentional vocab- ulary, using words whose meanings go beyond the mathematical structures (Agre, 1997, 14).

It is the use of “plan” or “learn”, instead of “foo” or “bar”, that adds an element of relevance to cognition to the language of modelling and sim- ulations. However, these kinds of descriptors may also lead the experi- menter to make claims about “learning” and “understanding”, whereas the claims ought to be about “foo” and “bar”. I argue that using localized symbols which are assigned to mathematical entities and their subse- quent interpretation at a cognitive or psychological level, does not al- ways produce ‘plausible’ models. I will examine a number of models to illustrate my point.

8.2.1 Pronouncing English

McLeod et al. (1998) describe the ‘power’ of connectionism and provide an example for a neural net learning and pronouncing irregular words in the English language3. They describe how a neural net can model the relationships between the mean naming latency for irregular words, i.e. the time it takes before a human pronounces an irregular verb correctly, and the relative frequency of use of the word, i.e. how often it occurs in the common use of English. A graphical representation of this function is compared to a graphical representation of the result from a neural network. The graphs (see figure 8.14) look very similar indeed, yet what

3This experiment was first described by (Seidenberg and McClelland, 1989). 4reproduced following McLeod et al. (1998, 3). 176 CHAPTER 8. MODELS OF COGNITION is actually shown in the graphs is quite dissimilar. An important clue is found in the (not very detailed) explanation where it says that

[t]he output of the model is error rather than reaction time but it can be seen that the interaction between frequency and regularity is reproduced (McLeod et al., 1998, 3).

The result that is obtained in reproducing the “precise match to data in experiments with human subjects” (McLeod et al., 1998, 2) can be re- duced to the following: while it takes a human a bit longer to pronounce irregular verbs correctly depending on the frequency of use, the neural net gets them wrong more often. Is there anything of value for Cog- nitive Science in this analysis? The training dataset contained similar frequencies of words as the dataset used for producing the ‘human’ data. We should not be surprised that the neural net is in error proportionally

590 8

6 Mean 570 Mean naming squared latency error (msec) 550 4

2 530

High Low High Low

Frequency Frequency

Exception

Regular

Figure 8.1: Error versus naming latency to that frequency. During learning, the weights are adjusted and there- fore optimized according to the relative frequency of the words among other factors that make up the function that is ultimately learned by the network. The network is essentially ‘underexposed’ to the less frequent words when we compare it to exposure of the more frequent words. In this particular experiment, the ratio between the most common word (‘THE’) and the least common word in the training dataset was 230:7. McLeod et al. (1998) conclude in their analysis of Seidenberg and Mc- CLelland’s (1989) model that 8.2. INTERPRETING RESULTS 177

[g]iven that the network’s response to any letter string is a func- tion of its entire experience, perhaps it is not surprising that the model can account for effects of word frequency, spelling-to-sound regularity and the number of other words with a similar spelling pattern on word naming. But what is impressive about the model is that it does not just make vague qualitative predictions about these effects. It gives precise, quantitative matches to data ob- tained from normal adult readers. Moreover it accounts for the relative effects for frequency, consistency and neighbourhood size without any additions to the model. This suggests that the con- nectionist approach to learning and the representation of knowl- edge may indeed mimic some aspects of the way the people learn (McLeod et al., 1998, 166-7).

I believe that the only conclusions that should be drawn from the exper- iment is that (1) the neural network is a suitable tool for this kind of sta- tistical analysis, and (2) the data contained possibly a lurking variable5. McLeod et al. (1998) essentially agree that the network has learned the (statistical) relationships that are contained in the training data set, and as such “it is not surprising that the model can account for” various effects. The point is that we should not make any inferences or even sug- gestions that the artificial neural network might “mimic some aspects of the way people learn” from the performance of this neural network model. The final analysis by McLeod et al. (1998) is essentially based on the oversight that the training data set was created by humans in the first place. Matching (predicting) data using some function f that has been laid down in the connections of the network during training with some data set D, does not tell us anything about how D came to be6.I will examine a more complex and more influential research program in terms of the methodologies, assumptions, interpretations and claims to further illustrate that claims about artificial neural nets are made on assumptions and even ‘wishful thinking’.

5The term lurking variable is used in statistics to describe a variable “that has an important effect on the response but is not included among the explanatory variables studied” (Moore and McCabe, 1993, 129). 6Let me provide a somewhat exaggerated example to illustrate this point. Assume we were to produce a copy of the Mona Lisa using a paint-by-number system. Even if we would produce a good copy, unlikely as it may be, we can not infer that Leonardo has also used a paint-by-number system on grounds that the finished products are similar. 178 CHAPTER 8. MODELS OF COGNITION

8.2.2 Structure in time

The experiments with recurrent neural nets by (Elman, 1990) are per- haps the best example to illustrate the processes involved. With pro- cesses, I mean the steps that were taken during the simulation, and also, and even more importantly, the steps during the interpretive phase. Elman’s work and results have been cited in the literature as a prime example of a cognitive model by, for example, Lloyd (1995), McLeod et al. (1998) and Elman et al. (1998), and Elman (1998). In one of these experiments, Elman trained a relatively simple recurrent artificial neural net with a number of two and three word sentences. It is claimed that the resulting analysis and interpretation showed that

[t]he networks are able to learn interesting internal representa- tions which incorporate task demands with memory demands; in- deed, in this approach the notion of memory is inextricably bound up with task processing. The representations reveal a rich struc- ture, which allows them to be highly context-dependent while also expressing generalizations across classes of items (Elman, 1990, 179).

These claims are based on the statistical analysis and interpretation of the weights in the hidden layer of the network. Because I am interested in the analytical and interpretive part of the entire experiment, I have reconstructed the network simulation and followed Elman’s methodol- ogy of analysis. The data have been reproduced and the experiment has been repeated following the descriptions as close as possible. The details for the particular network, the data, and the statistical analysis have been compiled from several publications (Elman (1990); Elman et al. (1998); Elman (1993, 1998); Bates and Elman (1993); Hare et al. (1995); and (McLeod et al., 1998)). Although the details of the actual network are of lesser importance, I will provide an outline of its principal ar- chitecture and operation first. This will be followed by a more detailed description and discussion of the production of test data, the statistical analysis and interpretation of the results.

Network Architecture

The Elman (1990) model is based on a simple recurrent network (SRN) (see figure 6.5 on page 135). One of a series of experiments in this very 8.2. INTERPRETING RESULTS 179 influential paper (FINDING STRUCTURE IN TIME) involved a network with a total of 362 neurons. There are 31 input and 31 output neurons, which are connected to a hidden layer of 150 elements. The architecture of the network utilizes a further group of 150 hidden units as a tempo- rary storage of part of the network’s internal state. In Elman’s neural networks, the output of the hidden units is fed back into the hidden units during the next cycle7. The use of hidden nodes to provide a mech- anism for temporary data storage is what makes these networks dis- tinct from Jordan’s network (Jordan, 1986), in which the output layers are fed back. The feed back mechanism introduces effectively a memory component into the network, which enables the network to take infor- mation from the previous state into the calculations of the current state. This technique models time, or allows for representing the effect that time has on processing (Elman, 1990).

The generation of data

A relatively large corpus of data is necessary for this particular exper- iment. The data was produced with a simple computer program that uses Elman’s original specification and a random number generator. The program generates a training data set and a matching target data set in only a few seconds. The data to train and test the network is a sequence of simple two and three word sentences. The process of gener- ating the sentences is important for the discussion and I will describe it here in some detail. The sentences, and the words within, are con- structed within a clearly defined set of constraints. These constraints allow for little variation in terms of random selection of constructs. The sentences were constructed by first randomly selecting a pattern from a pool of sixteen, as suggested by Elman (1990). The table in figure 8.2 shows these patterns that were also used to build the input and teach- ing dataset for Elman’s original experiments. While the notation in the tables differs somewhat from the original descriptions, the resulting datasets for teaching and testing the network are essentially identical. Each of the sixteen patterns is made up of two or three words from a

7The class of networks with these particular features are known as Elman nets.A variant architecture with a chain of feed back units is referred to as an Elman tower (Elman, 1990). 180 CHAPTER 8. MODELS OF COGNITION

NOUN-HUM VERB-EAT NOUN-FOOD NOUN-HUM VERB-PERCEPT NOUN-INANIM NOUN-HUM VERB-DESTROY NOUN-FRAG NOUN-HUM VERB-INTRAN NOUN-HUM VERB-TRAN NOUN-HUM NOUN-HUM VERB-AGPAT NOUN-INANIM NOUN-HUM VERB-AGPAT NOUN-ANIM VERB-EAT NOUN-FOOD NOUN-ANIM VERB-TRAN NOUN-ANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM NOUN-ANIM VERB-AGPAT NOUN-INANIM VERB-AGPAT NOUN-AGRESS VERB-DESTROY NOUN-FRAG NOUN-AGRESS VERB-EAT NOUN-HUM NOUN-AGRESS VERB-EAT NOUN-ANIM NOUN-AGRESS VERB-EAT NOUN-FOOD

Figure 8.2: Sentence Patterns particular class. The patterns in table 8.2 describe a set of trivial sen- tences. The first pattern in the table (NOUN-HUM VERB-EAT NOUN- FOOD) is a template for a sentence in the form noun (human) followed by a verb (eat) and a noun (food). Other classes for nouns are AGRESS (aggressor), ANIM (animal), FRAG (fragile), and INAN (inanimate).For verbs, the classes INTRAN (intransitive), TRAN (transitive), PERCEPT (perceive) and AGRESS (aggression) are available for selection. We can NOUN-HUM man woman boy girl NOUN-ANIM cat mouse dog NOUN-ANIM book rock car NOUN-AGRESS dragon monster lion NOUN-FRAG glass plate NOUN-FOOD cookie bread sandwich VERB-INTRAN think sleep exist VERB-TRAN see chase like VERB-AGRESS move break VERB-PERCEPT smell see VERB-DESTROY break smash VERB-EAT eat Figure 8.3: Word Classes see that the classes of words are quite specific so that each class is not determined by its use, but that the relationship of the word’s lexical category (i.e. verb, noun) is prescribed. Each pattern is populated with words by randomly selecting from a list of twenty-nine words (Elman, 8.2. INTERPRETING RESULTS 181

1990)8, from the thirteen classes of nouns and verbs as shown in figure 8.3. A sentence that follows the first pattern can have twelve different forms, with either man, woman, boy, girl eating either cookie, bread, or sandwich. The sentences are presented to the network as a continu- ous stream of words. A sample of this input stream reads, in a human friendly form, as ‘. . . girl smash glass monster eat man boy think girl move book boy see man . . . ’. This stream of words is, of course, not suit- able for use with the network. The actual data for the network itself has to be transformed. Each sentence is encoded as a vector containing only ‘0’s or ‘1’s when they are generated. The entire corpus of sentences is presented as a stream of vectors of 31 bits to match the number of input units of the network. Each of the bits is assigned to a word so that only one bit is ever set to 1. The words, i.e. binary vectors, are presented one after the other and without breaks between sentences, as a continuous stream of words for the input and the output. The data is very similar to the example shown in figure 8.4. Note that the network is trained to predict the particular word following the input. The word for the output (target) then becomes the input for the next word and so on. The entire training set consists of the same sentences shifted by one word: the in- puts woman, smash, plate, cat, move, man, etc. have the corresponding outputs smash, plate, cat, move, man, break, etc.. The training process itself consisted of presenting the entire corpus of 27,354 words to the network six times. The trained network performed the prediction task with a final root mean square error (RMS) of 0.88 (Elman, 1990).

Analysis

In the hierarchical cluster analysis (see 8.2.4) of the hidden nodes, El- man found that the network grouped the verbs and nouns into sets of various classes. It was claimed that membership of these groups was es- sentially due to the recognition of contextual relationships and on the in- herent semantics. Is it possible that this ‘semantic’ content was only the by-product of the setup of the particular experiment, because Elman’s

8There are some discrepancies between the number of words given in the text (29), the number of words shown in the tables (23) an the number of words shown in the analysis (31). There are also hints that this particular experiment or very similar ex- periments were conducted using different numbers of words, e.g. Elman et al. (1998) vs. Elman (1990). 182 CHAPTER 8. MODELS OF COGNITION

Input data:

000000000000000000000000000010 (woman) 000000000000000000000000010000 (smash) 000000000000000000001000000000 (plate) 000001000000000000000000000000 (cat) 010000000000000000000000000000 (move) 000000000000000000010000000000 (man) ...

Output data:

000000000000000000000000010000 (smash) 000000000000000000001000000000 (plate) 000001000000000000000000000000 (cat) 010000000000000000000000000000 (move) 000000000000000000010000000000 (man) 000000000000100000000000000000 (break) ...

Figure 8.4: Training Data

cluster analysis revealed only the initial distributions in the training data-set? Given the kind of training data that had been designed for the experi- ment, the initial claim that the hidden nodes contain information about the ‘words’ and also about the context in which were used seems reason- able. I would prefer to see this ‘context’ as merely being the probability of one vector (as in figure 8.4) following a particular other vector, rather than using context at the level of semantically ‘loaded’ terms like words and sentences. The interesting question here is where, when and how the vectors that have been extracted from the hidden nodes get their se- mantic content that is analyzed in the cluster analysis. Elman asserts that

[t]he hidden unit activation patterns pick out points in a high (but fixed) dimensional space. This is the space available to the network for its internal representations. The network structures that space in such a way that important relations between entities are trans- lated into spatial relations. Entities which are nouns are located in one region of space and verbs in another (Elman, 1990, 201). 8.2. INTERPRETING RESULTS 183

Half of the data passed to the hidden nodes in each step is provided from the 150 context units in the network, while the other portion comes from the input units. Two identical input patterns will cause different output patterns, because the output is not only effected by the input, but also by the values of the hidden units. That the activation patterns (outputs of the hidden nodes) for certain input patterns are more similar than pat- terns for other inputs, should not be a surprise. After all that is what the network was trained to do. For the analysis of the network after train- ing, the entire set of input vectors is presented to network on more time. The values of the hidden units are saved for each input as a separate set of vectors. This sequence of vectors is then subjected to a hierarchical cluster analysis. The final analysis, which is presented as a dendrogram (see page 190), reveals the relationships (distances) between patterns, or “similarity structure of the internal representations” (Elman, 1990). The emerging clusters are then named according to their lexical mean- ing in the context of the experiment, i.e. nouns, verbs,orhumans. There should be little doubt that the classification using cluster analysis into nouns, verbs, humans, and the like is an artifact. The construction of the training data contains three food items that follow the word eat. At this point there are already “important relations between entities”, and the context for the words cookie, bread and sandwich is that they are always preceded by the word eat. The selection of words and sen- tences has to be done carefully for the experiment to work at the higher interpretive layer. The classification of edibles and breakables, for ex- ample, fails if we enrich the language and include sentences like man break bread or dragon eat boy. Both sentences are grammatically valid and make sense in terms of their semantic content. Note that we have not introduced any new words here, instead only the list of “allowable” words of some sentence patterns has changed (see Figs. 8.2 and 8.3 on pages 180 and 180). If these sentences were to be- come part of the training dataset, and if the relative frequency of these new sentences in the dataset would be at comparable levels, the classifi- cation into “edibles” and “breakables” will no longer be possible. Elman claims that

[t]his use of context is appealing, because it provides the basis for establishing generalizations about classes of items and also allows 184 CHAPTER 8. MODELS OF COGNITION

for the tagging of individual items by their context. The result is that types can be identified at the same time as the tokens. In sym- bolic systems, type/token distinctions are often made by indexing or binding operations; the networks here provide an alternative account of how such distinctions can be made without indexing or binding (Elman, 1990, 202).

I argue that the binding or indexing occurred when the training dataset was very carefully designed; that is, before the network came into the picture. It is interesting here that the new sentences need not necessar- ily introduce contextual ambiguities. Breaking bread is a new phrase and is not ambiguous. The phrase has a completely different meaning, namely dining. Nevertheless, bread is broken and boys are dragon food. Whether a network is able to resolve (semantic) ambiguities depends on the statistical distribution of sentences in the training data set. One would expect that the use of a particular pattern, e.g. boy as ‘dragon food’, will case the network to classify boy as a food item. Classification even within a micro-ontology is a difficult task, and I maintain that a neural network that does not go beyond statistical anal- ysis does not contribute much to this task. Indeed, the entire analysis can be performed altogether without the network. An appropriate (mul- tivariate regression) analysis of the relative probability of a particular word following another would reveal the same clusters. In fact, it is safe to assume that results using direct statistical methods provide a more accurate analysis. If the goal of Elman’s experiment was to show that a network is able to deal with time and sequential processing in a distributed framework, then this experiment has had some success. However, if we are to judge the success of the experiment as a model of context sensitive lexical analysis, then the model is unsuccessful. The success of the experiment is then a question of evaluating the results (and claims) as a cognitive model (i.e. in the context of Cognitive Sci- ence) or as an experiment in Artificial Intelligence (i.e. in the context of an engineering discipline). 8.2. INTERPRETING RESULTS 185

8.2.3 Moral virtues

Churchland (1998) describes a recurrent network that could model more challenging cognitive functions. He suggests that a recurrent net- work may have an appropriate architecture for learning and simulating moral virtues. He considers, given the “examples of perceptual or mo- tor categories at issue” (Churchland, 1998, 83), that a network would be able to map concepts like cheating, tormenting, lying,orself-sacrifice within a n-space of classes containing dimensions of morally significant, morally bad,ormorally praiseworthy actions. Churchland says that

[t]his high-dimensional similarity space [...] displays a structured family of categorical “hot spots” or “prototype positions”, to which actual sensory inputs are assimilated with varying degree of close- ness (Churchland, 1998, 83). and that

[i]n living creatures, learning also consists in the repeated adjust- ment of one’s myriad connections, a process that is also driven by one’s ongoing experience with failure. Our artificial “learning tech- nologies” are currently a poor and pale reflection of what goes on in brains - those gradual synaptic readjustments lead to an appro- priately structured high-dimensional similarity space, a space par- titioned into a hierarchical family of categorical subspaces, which subspaces contain a central hot spot that represents a prototypical instance of its proprietary category (Churchland, 1998, 84).

The language of prototypes in hierarchical subspaces is also found in Churchland (1999) and also in Churchland (1993). The reduction of sub- spaces toward categorical “hot spots” has several consequences that I regard as contentious. There is (1) a need for some explanation about what kind of mechanism could attribute ‘meaning’ to entities that are distributed in a multi-dimensional (vector) space, and (2) a recursive re- duction of the vector space leads to a paradoxical situation where it is necessary to find (by means of analysis) representational “prototypical instance[s]” in order to show that representations are distributed. I dis- cuss these concerns in turn. For all “prototypical instances” of representations that are context sen- sitive, there will have to be multiple copies (or “hot spots”), as any rep- resentation with a different meaning must exist in a different location in the vector space in order to distinguish it from other possible mean- 186 CHAPTER 8. MODELS OF COGNITION ings, and the relationships between these “hot spots” are then relation- ships between meaningful representations. These representations must have semantic content, otherwise it would not be possible to either de- tect or resolve the ambiguities. Thus, the relations clearly belong at the cognitive level and not somewhere in the distribution of weights in an artificial neural net (Fodor and Pylyshyn, 1988). The question arises, whether it is possible to provide some link or mechanism between the cognitive level and the weight (vector) space in the artificial neural net. Thornton argues that there are such mechanisms, e.g. the delta learn- ing rule or back propagation, that produce “computational components which can be classified as concepts” (Thornton, 1999, 184), and he be- lieves that these concepts are “produced” by the similarity of data. He continues to explain that learning using the delta rule defines “groups of similar data elements in terms of their centroid point”, and that

[i]n Competitive learning [. . . ] the potential centroids (weight- vectors) are moved around in data-space until they lie at the centre of groups of similar elements (Thornton, 1999, 184). The exact meaning of this remains unclear to me, but it seems that the proposal is another re-interpretation in ‘Churchlandish’ about the mathematical (statistical) operations that are performed by neural nets (surface fitting) and statistical analysis of the hidden nodes (cluster analysis9). Thornton is merely saying that during training of the neu- ral network, the weights are adjusted in such a way, that a subsequent cluster analysis will reveal the similarities of activation patterns in the trained network according to the distributions in the training dataset. That, however, is what is designed to happen in an artificial neural net- work with this kind of architecture. Modelling of “categorical subspaces” and “hot spots” (i.e. classification) has been unsuccessful for any cogni- tive problem other than for carefully engineered examples within micro ontologies. There are many examples of models with small ontologies (e.g. Elman (1990) (31 words); Rogers and McClelland (2004) (36 words); and Seidenberg and McClelland (1989) (26 inputs)). It seems there are no working models based on a large corpus of knowledge. The feed-

9The term centroid is usually used in connection with cluster analysis (see section 8.2.4 on page 188). The number of centroids is equal to the number of clusters to be formed during the analysis, and each centroid is the mean (average) vector in each cluster. 8.2. INTERPRETING RESULTS 187 forward network architecture that Churchland proposes is not likely to produce an actual working model of a moral calculus, other than an ‘in principle’ demonstration that may take a few “actual sensory inputs” and a set of appropriately designed relations into account. Any recursive reduction of the subspace into smaller and smaller sub- spaces and ultimately into hot spots will lead toward meaningful atomic entities in the vector space, a kind of ‘logical’ grandmother neuron. Churchland’s notion of a “hierarchical family of categorical subspaces” (Churchland, 1998, 84) appears to be paradoxical. Individual vectors are very much like addresses, and a vector space encompasses everything that can be addressed by the set of vectors10. For example, the vector i =(2,5,8) is used denote a point p in 3-dimensional space along the x, y, and z axes, and a vectork =(2.1,4.9,8.3) would address a point q that lies in the vicinity of p in the same vector space (see figure 8.5). If a single component of k has a vastly different value then q is placed far away from p, given that the dimensionality ofk is significantly large. A

9.8

p

8

7

5 8

4

2 10.5 3

Figure 8.5: Points in 3-space

single instance of such a vector in a number of vectors would qualify as an outlier, and an artificial neural net may not classify this vector cor- rectly, depending on whether the outlier was contained in the training

10Vector space is a set of vectors and a set of rules that determine the allowable operations between vectors and scalars (see, for example Lay (1997) or Fraleigh and Beauregard (1995), for a more formal treatment of vectors and vector spaces). 188 CHAPTER 8. MODELS OF COGNITION dataset11. This will also occur in a vector space with many dimensions so that classification by addressing the appropriate “hot spots” is not workable. The concept of ‘Tree’ might serve as an example: trees have certain properties that make them trees as opposed to ‘books’. Trees have leaves, trees are green, trees are tall, trees are living things, and so on. A neural net, given that it has learned an association like ‘cac- tus has needles’, will misplace pine trees in the same way it will put a Copper Beech into the wrong categorical subspace. If we consider that a book has leaves and can be green, the task becomes impossible for a statistical procedure. The model to deal with the classification of enti- ties that are connected with is a or has a (Rogers and McClelland, 2004) has a small domain of very few concepts and relations. The reason that “the model makes comparatively few cross-domain errors” (Rogers and McClelland, 2004, 113), is related to the fact that the dataset does not have enough ‘unpleasant’ relations contained within it. A trained network of a reasonable size can perform the classification fast - this is why neural nets are of great value as computational tools (Hoffmann, 1998). However, it can not apply common sense or any heuristics in the process, and the network is therefore restricted in its competence to classify by (1) the architecture, including the network structure, the number of layers, the number nodes in each layer, the number of connections, possibly internal feed back, and so on, and (2) by the training dataset, and (3) by the amount of training12. Much of what is claimed about how networks learn and how networks store informa- tion is the result of a post mortem analysis that reveals the internal state of the network after learning. I will present the general principles of cluster analysis of hidden nodes in the next section.

8.2.4 Cluster Analysis

Cluster analysis is employed as a tool to uncover some of the hidden aspects of the network after it has been trained. The various forms of

11An ‘outlier’ that is contained in the training dataset can be classified correctly by the network, where as a distant data-point (outlier) will not be classified correctly, if it is presented after the network has been trained. Essentially, the network may not be able to extrapolate correctly. 12A network can be over-fitted by training the network extensively, which limits their capacity to generalize. 8.2. INTERPRETING RESULTS 189 cluster analysis are used as statistical tools for grouping data items into respective categories (clusters) according to some measure of similarity between the data. In the context of artificial neural nets, cluster anal- ysis is commonly applied to reveal the levels of activation that are re- flected in the level of outputs of the hidden nodes. The analysis of all activations in response to the set of input vectors reveals some infor- mation on how the training data is encoded within the network struc- ture, or in the words of Clark “[c]luster analysis is an example of an analytic technique addressing the crucial question ‘what kinds of repre- sentations has the network acquired’?” (Clark, 2001, 69). This process involves generating a set of new data, processing (clustering), and inter- preting the results. While the mathematics for this kind of analysis is relatively simple, a considerable amount of computing power is needed as the number of data items (vectors) increases. It is important to note that this analysis does not reveal much about the dynamics in the net- work, unless a series of cluster analyses are performed between train- ing epochs, e.g. Rogers and McClelland (2004). During the early phase of the training process, the adjustment of connection weights is largely influenced by the (randomly) assigned initial weights, whereas the final training epochs may only produce very small changes, particular if the network has already reached, or is very close to, the global minimum of error. The process of producing the data for a cluster analysis is as follows. The complete set of vectors that make up the corpus of the input data of the training dataset is presented to the trained network one by one, and the values of all the outputs of the hidden nodes are captured and saved as a new vector for every input vector. Each new vector represents a snap- shot of the particular activation pattern of the network’s hidden layer for the corresponding input vector. This new set of vectors is then an- alyzed using cluster analysis. In practice this involves little more than submitting the data together with a few parameters to a readily avail- able statistics computer package. If the number of clusters is known, it is possible to use k-means clustering, which separates the vectors ac- cording to which of the k centroids they are closest to. More common is hierarchical cluster analysis, where the data, i.e. list of vectors, is clus- 2 tered according to their Euclidian distance d(x,y)= ∑i (xi − yi) , which 190 CHAPTER 8. MODELS OF COGNITION is the geometric distance in the n-dimensional vector space13. In hier- archical cluster analysis the relative similarities of the clusters and the items within, are often represented graphically as a dendrogram sim- ilar to the example for a network in figure 8.6 (from Elman (1990)). The horizontal distances between the items and the first node on the

Figure 8.6: Example of a Dendrogram

dendrogram indicate the similarity, or compactness, within each clus- ter. The horizontal distances between nodes on the graph represent the difference between clusters, or their distinctness. The interpretation of the results of the analysis and the graphic repre- sentations of such results becomes the basis for claims about the net- work and its dynamics. It should be clear that cluster analysis cannot reveal anything that is not already contained in the raw data of the

13 2 There are other measures, like the squared Euclidian distance d(x,y)=∑i (xi − yi) or the Manhattan distance d(x,y)=∑i |xi − yi| that may be applicable for special cases. 8.3. SUMMARY 191 model. Nevertheless, the analysis may reveal details and relationships in the dataset that were not known to exist beforehand. Thornton considers that the dendrograms obtained through cluster analysis could be thought of “in the manner of decision trees produced by symbolist mechanism” (Thornton, 1999, 181). This view does not cap- ture the essence of either of the entities in question, because dendro- grams are essentially static presentations of elements of the network structure, while a decision tree is about the flow of information and the possible transformation of data according to some logic. The confusions and misconceptions about the nature and principle of cluster analyses indicate that there is potential for errors when evaluating the signifi- cance of data obtained by models. Misinterpretations will also affect the validity and strength of subsequent claims about models with artificial neural nets.

8.3 Summary

Higher cognitive functions have been modelled using artificial neural nets. In the examples which are representative for most of this class of model in the literature, I have shown that these models can be plausi- ble at a theoretical level or conceptual level. However, it seems that the translation of the theoretical models into computational models using artificial neural nets brings out some fundamental conceptual problems. Neural nets are very limited in what they can do, so that the models must generally be restricted to small ontologies. Then, and only then, is it possible to build a neural net that can successfully support a high level cognitive PDP model. I have argued that some neural net models, e.g. Elman (1990), fail when the model is scaled up or when the data contains ambiguities. The success of many neural net models is dependent on carefully con- structed sets of data for the training of the networks. While it is pos- sible to obtain what seem to be convincing results, the interpretations and claims hold only for the dataset in question. The claims that are based on statistical analysis (curve fitting, surface fitting, and cluster- ing) are then extrapolated to the theoretical (hypothetical) PDP model at the cognitive level (e.g. Churchland (1998) and Rogers and McClel- 192 CHAPTER 8. MODELS OF COGNITION land (2004)). In the next chapter I discuss whether the apparent con- ceptual differences that exist between the levels of modelling, namely (1) the conceptual high level cognitive model and (2) the computational data model based on artificial neural nets, can be bridged. Chapter 9

Conclusion

The danger [. . . ] comes when we switch from the indefinite to the definite article: when we stop talking of ‘a’ model, metaphor, or analogy, and sneak in the term ‘the’. Then a metaphor or model - something which presupposes, which can catalyse, a theory - is about to grow into a theory itself [. . . ] ‘mind (in some respects) like a computer’ becomes ‘the mind as a computer’; problems for which the model is useless are denied, ignored, or shelved; the methodol- ogy suggested by the model becomes ‘the’ methodology of the disci- pline (Wilkes, 1990, 63).

So far I have looked at a range of different topics, and I have highlighted some of the issues that shape the design, the implementation, and the interpretation of computational models and simulations. In this conclu- sion I will bring these various ideas together and provide an analysis of what artificial neural nets may offer, and what they cannot offer, as ve- hicles for models in Cognitive Science. The analysis begins with a short statement by Rumelhart et al. (1986) that captures what many workers in the field believe neural nets provide or at least promise. I have shown that the computational principles not only govern, but also limit artifi- cial networks in their role as experimental tools in Cognitive Science, and I argue that the role and the ‘suggested’ capabilities of neural nets are often explained in terms of a cognitive theory. This discussion of what part neural nets can play in the framework of Parallel Distributed Processing (PDP) is followed by an investigation of a possible source of the confusion about the power of neural nets. Of particular interest is the question, whether artificial neural nets are suitable for the provi- sion of neurologically plausible implementations of PDP models.

193 194 CHAPTER 9. CONCLUSION

9.1 Neural Nets and PDP

Parallel Distributed Processing (PDP), which is almost synonymous with the term connectionism, has been offered as an alternate way of thinking about human cognition. PDP also provides new ideas on how cognitive functions might be implemented in the brain. Because of the obvious difficulties in accessing anything mental in brains, cognitive sci- entists are largely forced to ‘make do’ with models. It is claimed by many followers of connectionism that PDP models are particularly appealing because of their “neurological plausibility and neural inspiration” (Mc- Clelland et al., 1986, 11) and that

they hold out the hope of offering computationally sufficient and psychologically accurate mechanistic accounts of the phenomena of human cognition which have eluded successful explication in conventional computational formalisms; and they have radically altered the way we think about the time-course of processing, the nature of representation, and the mechanisms of learning (McClel- land et al., 1986, 11).

This quote captures the generally positive mood among the protagonists of connectionist and PDP modelling. It should be noted that these words were written some 20 years ago. There are several issues to be consid- ered in this quotation regarding what models that are based on artificial neural nets can contribute to the PDP hypothesis. As a way of tying the various chapters of this thesis together I will comment in turn on some of the phrases, and claims or assertions made. I will leave the question of whether models with neural nets are neurological plausible to last, because all of the other points raised contribute to the answer.

9.1.1 Neurological Inspiration

Unlike the biological neural models, which are mainly based on the work by Hodgkin and Huxley (1952)1, functional neural models after Rosenblatt (1958) have only superficial similarities with real neuronal cells. Although very little is known about neural coding, it should be clear that the sum and squash processing in the functional neural model

1There are of course many others who have contributed in this area of work, but I will restrict credits in this conclusion to the most influential ideas and cite only core publications. 9.1. NEURAL NETS AND PDP 195 is an over-simplification. The neurological inspiration of PDP as cogni- tive architecture is not in question here, but a neurologically plausible connection between living cells and m u = σ ∑w ji j , j=1 where σ is a squashing function, is difficult to establish and equally difficult to defend.

9.1.2 Holding out Hope

PDP offers new opportunities for exploring old questions from a dif- ferent viewpoint. The debate whether PDP is a workable (defensible) alternative as a cognitive architecture is far from over. PDP models have been in use for some 20 years now, and recently Elman claimed that “connectionist models stand out as offering some of the most ex- citing (and also controversial) insights into cognitive development” (El- man, 2005, 111). It is interesting to note that the modelling techniques have not changed significantly, in fact many of the recent models have remained unchanged over this 20-year period, but new high-level in- terpretations are offered for very similar architectures. For example, Rogers and McClelland (2004) have presented a re-implementation of the networks designed earlier (see Rumelhart and McClelland (1986a), McClelland and Rumelhart (1986)). Similarly, Kello et al. have used net- work based models in 2005, which by their own admission “were similar in many ways to those used in other distributed models of word read- ing” (Kello et al., 2005, 634). They have built their models largely on those presented by Rumelhart and McClelland (1986a). When we con- sider that the raw computational capacity of computers has increased by three orders of magnitudes2, we have to ask the question why the number of neurons in models has not increased at the same rate - it has not increased at all. The stasis in the size of networks supports my argument: there is no need for large networks because small networks

2The clock speed of personal computers in 1982 was 4.7 Mhz compared to 2.5 Ghz in 2005. There are other significant improvements relating to the computational power, e.g. pipelining, which allows for a partial overlap of command execution during a single CPU cycle, and dedicated co-processors, which perform floating point arithmetic. 196 CHAPTER 9. CONCLUSION already produce the pre-ordained requirements for the task at hand. I have argued that artificial neural nets are universal solutions, because a very simple network under the ‘right’ interpretation can implement almost any function (see section 6.2 on page 140). Intensive training with an unsuitable ratio between data points and connections may lead to over-fitting, so that a ‘successful’ model is often the result of trial and error. Moving the model from a micro world example (micro ontol- ogy) to a real world example, with a more realistic ontology, may break the model, because the complexities of the problem can no longer be captured in the (deterministic) statistical behavior of the network (see section 6.2.2 on page 144). “Models that might work in single domains might not easily scale up to more complex situations [. . . ] So building more complex models also means solving the scaling problem” (Elman, 2005, 115). This, however, seems to be a very difficult problem to solve, in fact this takes us back to the kind of problems that have been raised in the context of the possibility of an AI from its very beginnings (e.g. Dreyfus (1979) and Dennett (1990)). Moreover, it seems that an increase in the complexity of the model offers no additional insights regarding the underlying theory about how and why this particular network model functions in the way it does. McCloskey remarked that even a network simulation with more than 1,000 units and more than 250,000 weighted connections can not “remedy the vagueness in the verbal formulation of [Seidenberg and McClelland’s3] theory” McCloskey (1991, 1137).

9.1.3 Computational Sufficiency

Artificial neural nets4 are very powerful mathematical tools. Their abil- ity to learn functions that are contained in datasets is a very convenient way to create a ‘calculator’ or predictor for almost any function (f ). Ar- tificial neural nets can implement (or approximate) quite complex func- tions that may not be possible to calculate otherwise. After the learning process, artificial neural nets are also very effective in predicting values for f in terms of performance (speed) (Hoffmann, 1998, 157). The fact

3Seidenberg and McClelland (1989) 4I will restrict my comments to feed-forward and simple recurrent networks. These two architectures seem to be the most commonly used for implementing high level cog- nitive functions. 9.1. NEURAL NETS AND PDP 197 that neural nets can even be used to implement functions that are un- known to the experimenter, makes them very attractive tools for mod- ellers in various disciplines. The ability to provide good predictions for f after a network is trained, is not just dependent on the neural network structure. While the appropriateness of a network’s architecture and the selection of a suitable number of nodes and connections is very im- portant for the successful implementation of a model, the datasets that are available for training and the parameters that control the training seem to be of even greater importance. I have argued that neural nets in the configurations used by Elman (1990) and also in many of the models in Rogers and McClelland (2004), Shultz (2003), and McClelland and Rumelhart (1986), perform little more than curve (and surface) fitting, and that even a subsequent analy- sis of the network’s internal states reveals very little beyond what other statistical methods could offer. While it takes a considerable amount of computational effort to train a network, a trained neural network can almost be considered a static entity. Unlike the hundreds or thousands of epochs of adjusting weights using forward and backward passes dur- ing training, a trained network needs only to compute a single forward propagation of the input vector to produce an output. It still takes some computational resources at the implementation level, but at the net- work level the computational effort is almost trivial, at least for small feed-forward and simple recurrent networks. Artificial networks for the models in question perform statistical anal- ysis on carefully constructed sets of data from small ontologies. Statis- tical analysis is limited in what it can do, so it should not be surprising that Elman’s (1990) network “was not particularly successful predicting the next word in the input sequence, even after considerable training” (McLeod et al., 1998, 197).

Neural Nets as Mathematical Tools

Considering the mathematical nature of artificial neural nets there is no reason to think of trained neural nets in any other way than as pieces of mathematics in respect to their status as an empirical tool. A function Netout = f(Netin) is no different than other functions such f(x)=sin(x). However, there are some points that need to be considered as to how 198 CHAPTER 9. CONCLUSION the function is acquired by the network. A function that is implemented using an artificial neural net is the result of a gradual adaptation to- ward a global minimum of error. The internal weights are modified dur- ing training so that the overall error between the set of output vectors (explanatory variable) and the set of target vectors (response variable) is minimized. Several factors such as the learning rate, the (random) values of the weights before training, and the particulars of the learning algorithm together with the experience of the modeller, determine whether the network will indeed find the optimal solution. Unlike with perceptrons, where the optimal solution is guaranteed to be found5, in multi layered networks with non-linear units the network may not converge toward a solution or the solution might not be optimal due to the existence of local minima. The number of data points, i.e. input-output pairs, their distribution over the domain × range of the functions6, and the number of training epochs, all have a direct impact on the performance of the trained neural network. The network training algorithm approaches a solution during the train- ing phase step by step by repeatedly adjusting the network using many passes of recalculation. The training dataset might be passed through the network hundreds or thousands of times. The simplest function that can be implemented is a linear function y = ax + b, which is the regression line by which the trained network predicts new values7.The training data can describe a function completely when all possible input- output pairings are contained in the dataset. The simple logic functions OR, AND, and XOR (see section 6.1.1 on page 126) are examples where the trained neural net implements the given functions completely. Be- cause the network is trained for all possible permutations of inputs, the network is effectively computing using an n-dimensional look-up table. Implementation of a function, as opposed to an approximation of a func- tion, is straight-forward if the data is binary valued, i.e. the input and

5Provided that such a solution exists. 6‘Domain × range’ is the cross product of the function’s domain and range. 7The limitation of perceptrons is due to the fact that the solution space can only be divided by that regression line. The solution space of some functions, such as XOR,can not be divided by a line such that all solutions are on one side of the line and non- solutions are on the other. These functions are not linearly separable. 9.1. NEURAL NETS AND PDP 199 output nodes can only assume values of ‘0’ and ‘1’ (effectively encoding conditions such as true or false, yes or no). The training set of pairs of input and output vectors already contains all there is to the model in terms of implementing f, and the artificial neural net does not add anything that could not be extracted from the training sets through other mathematical or computational methods, albeit in less elegant or neurologically inspired fashion.

Input nodes

??

Figure 9.1: Insights emanating from the network

Analysis of Network Structures

Cluster analysis is a statistical method for obtaining information about the structure of data, and clustering provides a measure about the sim- ilarity of data items within a dataset in particular. Cluster analysis is usually used in models with neural nets to find out how the training data is encoded in the weights of the network connections. The activa- tions of the hidden nodes in response to the input vectors are generally captured for this analysis (see page 188). It is important to note that cluster analysis does not reveal much about the implemented function f, rather it is a statistical analysis of the network’s internal activation patterns. Extracting the function f itself is a different task altogether. Cluster analysis is a ‘secondary analysis’, because it introduces a new set of data extracted from the model itself. However, this data is not available within the neural net during the simulation of the model. The 200 CHAPTER 9. CONCLUSION

“hot spots” in the network proposed by Churchland (1998) and the clas- sification of words into edibles, nouns, humans, etc., claimed by Elman (1990) are only interpretations by an observer who is external to the neural net. None of this information is available in any form at the out- put nodes of the simulated neural network. Instead, the new insights seem to condense out of the network’s structure (figure 9.1). The anal- ysis is clearly performed by the modeller’s neurons with the aid of a statistical procedure and not by the model’s neurons. If the network is meant to be a model of what might happen at the neu- ral level, then the question arises, what mechanism could be responsible for the equivalent analysis of activation patterns in the brain. In order to make this information accessible to the rest of the brain, we would have to assume another neural circuit that performs such an analysis. This new addition to the network could categorize words into verbs and nouns, but then we would need another circuit if we wish to catego- rize words according to their etymology, and yet another to categorize into mono-syllabic and multi-syllabic words. Essentially, there would be an infinite number of circuits required, or alternatively, circuits would have to be of a generic nature - this would allow for circuits to be called recursively. While such mechanism cannot be entirely dismissed, there is little evidence for their existence. Some recurrent circuits and feed- back mechanism in the brain have been suggested recently (see Levy and Krebs (2006)). I do not find this strategy of having individual specialized circuits just for classification very convincing, considering that we would already need a very large number of neural circuits just for the analysis of word categories in the 31-word ontology of Elman’s (1990) model. McLeod et al. (1998) comment on Elman’s model that

The [cluster] analyses are very impressive. They demonstrate that a recurrent network is able to discover something which corre- sponds to a word’s grammatical category simply by attempting to predict the next word in sequences in which the words appears. It does not need to be told ahead of time anything about nouns or verbs. This suggests that the apparently intractable problem of how a child can get grammar off the ground without knowing about grammatical categories in advance may not be quite so intractable as it seems (McLeod et al., 1998, 200). 9.1. NEURAL NETS AND PDP 201

I described this model in some detail (see section 8.2.2 on page 178), and I have argued that the data, the training process, and the analysis are ‘constructed’ carefully to produce a model that appears to support the theory. The results of the cluster analysis did not reveal anything new at all. The network is unable to “discover” anything beyond what is contained in the dataset and the subsequent coding. For example, the network could not make any discoveries about word lengths, because the length of the words is not coded. Recall, that each ‘word’ is a single bit in the input vectors of 31 elements. The “grammatical categories” are nothing but the relative frequency of some input vectors following others, because these sequences of ‘words’ that made up the ‘sentences’ is all the network can discover. Networks are mathematical entities and any results must turn out the

if Inp1 = 0 and Inp2 = 1 then Out = 1 else if Inp1 = 1 and Inp2 = 0 then Out = 1 else if Inp1 = 0 and Inp2 = 0 then Out = 0 else if Inp1 = 1 and Inp2 = 1 then Out = 0 Figure 9.2: Definition of the XOR function way they do, because “it is logically not possible that [the results] could have turned out other than they did” (Green, 2001, 109). Norman (1986) points out that it would be necessary to have a great number of individ- ual PDP models to potentially model an entire brain given that each one “can only settle into a single state at the time” Norman (1986, 543). But I do not think that we have reached the point where we should even con- template the number of PDP models that would be necessary to model a brain.

Alternatives

Data that describe the solution space completely (exhaustively) lend themselves to be modelled using either logic programming or conven- tional programming techniques. The XOR function can be implemented, albeit awkwardly8, in the form of a series of statements as in figure 9.2. Functions that contain hierarchical and membership relations (is a and

8In some programming environments, the function can simply be expressed as XOR(x,y)=(x =y). 202 CHAPTER 9. CONCLUSION has a) are particularly well suited to be modelled in a logic program- ming environment such as Prolog (see section 4.2 on page 95). ‘Symbol and rule’ based systems can conveniently and intuitively model system- atic relations of the kind that have been suggested as indispensable for human thought by Fodor and Pylyshyn (1988). Neural nets do not perform well in models that need to capture temporal events. Although some recurrent networks can take previous experience into account, the temporal distance of prior events that can be taken into account is rela- tively short. The network of Elman (1990) performed poorly in the word prediction task, because there is no long term memory available in the model. In this experiment the current input is merged with the output of the hid- den nodes of the previous input. The current output is the combination It−1 It−2 It−3 9 of the current input It and the previous states 2 , 4 , 8 , and so on . The result is that the contribution of past word ‘memory’ decays expo- nentially with the temporal distance. In order to increase the memory capacity we can either alter the ratio of feed-back 10or we can introduce further sets of neurons to build a shift register11. The idea is to use a series of groups of hidden nodes into which we copy and store some of the internal states of the network. In every cycle the states of one group are shifted to the next and the ‘oldest’ memory is aged out. These methods are of importance to network theorists and some work in AI, but the simple neural networks for the PDP models introduced by El- man, Churchland and many others do not have the necessary elements , i.e. memory, to make these networks truly dynamic. The particular net- work architectures that can capture temporal (sequential) events pro- vide some ways to implement functions that are oriented toward symbol based processing. However artificial neural nets do not provide easy so- lutions for some of the trivial requirements to give PDP models some systematicity of the kind Fodor and Pylyshyn (1988) are asking for. 9I assume that the feedback from the hidden nodes and the inputs is not modified and contributes equally. Any non-linear effects that may be introduced by the squashing function, or other effects due to the learning algorithm, are also ignored. 10A network with a suitable architecture where the amount of feed-back is variable has recently been suggested by Levy and Krebs (2006). They describe a network where an additional ‘control’-input varies the ratio of input and feedback from the hidden nodes. Levy and Krebs (2006) suggest further that comparable circuits and processes may be present in human brains and may be responsible for certain behaviours. 11Haykin (1999) describes this methodology and the mathematical basis for tapped delay lines, which is another term for the same kind of function. 9.1. NEURAL NETS AND PDP 203

9.1.4 Psychological Accuracy

The question whether the mind is modular is different from the ques- tion whether the brain is modular. The varied anatomical structures of brains are immediately evident, once we remove the brain from the skull and take it apart. Functional structures in the brain are far less obvious. Trying to connect (map) psychological functions with particu- lar brain areas using PET and f MRI is a contentious exercise (Uttal, 2005, 2001). Whether the decomposition of mental processes is possible is still debated (see for example Bechtel (2002, 2005)). In the context of artificial neural nets the issue is often simply ‘resolved’ by switching to a different level of description. It may be surprising for some newcom- ers to the field of Cognitive Science to learn that neural nets in most cognitive models are not necessarily composed of neurons. Elman et al. (1998) offer as “a note of caution” that

. . . [m]ost modelers who study higher-level cognitive processes tend to view the nodes in their models as equivalent not to single neu- rons but to larger populations of cells. The nodes in these models are functional units rather than anatomical units (Elman et al., 1998, 91). The introduction of functional groups adds further to the confusion about the levels of description. I believe accepting this shift in scale to be wrong, because I do not think this scaling up is justifiable unless an explanation is provided for how the functional units might work. These explanations are not provided in any of the models that I have exam- ined. To assign concepts and “possible hypotheses” (Rumelhart and Mc- Clelland, 1986a, 10) to units is one thing, but to assume that relatively simple statistical methods using a neural net can offer a coherent and plausible model is another.

Conventional mathematics [. . . ] is usually inadequate to describe anything more detailed than either the overall behavior of an or- ganism or the action of a highly simplified neural network com- posed of a relatively small number of simulated neurons. Such highly reduced models are often described in the same context as more complex neural networks. On closer analysis, however, they turn out to be conceptually identical to the block or modular di- agrams favored by earlier generations of cognitive psychologists. Worst of all, although they appear in the guise of networks of neu- rons, they may actually represent something quite different (Uttal, 2005, 197). 204 CHAPTER 9. CONCLUSION

Churchland’s proposed network model for moral virtues (Churchland, 1998) (see also section 8.2.3 on page 185) is a case in point for Uttal’s suggestion that at least some PDP models are akin to ‘boxology’. Den- nett (2001) holds a very similar view about how computer programs are interpreted and says that the model

[. . . ] is the interpretation of data structures in computer programs, and the effect of such user-friendly interpretations [. . . ] is that they direct the user/interpreter’s attention away from the grubby details of performance by providing a somewhat distorted (and hyped up) sense of what the computer “understands”. Computer programmers know enough not to devote labor to rendering the internal representations of their products “precise” because they appreciate that these are mnemonic labels, not specifications of content that can be used the way a chemist uses formulae to de- scribe molecules. By missing this trick, philosophers have created fantasy worlds of propositional activities marshaled to accomplish reference, recognition, expectation-generation, and so forth (Den- nett, 2001, 134).

This sentiment about the interpretation of computer programs also ap- plies to models and simulations based on artificial neural nets. Many of the semantic attributes which are suggested to be inherent in neural nets by some cognitive modellers, e.g. Churchland (1998), are not dis- cussed in the text books that are used in Computer Science and the AI community.

9.1.5 Parallel Processing

Neural nets are not massively parallel computers if we understand com- puting as ‘step by step’ information processing. Turing machines (TM) are symbol processors, and physical realizations of a TM are causally controlled by physically encoded equivalents of symbols. Semantically interpretable representations exist in the electronic computing machine in a physical guise as charges in a memory location, as magnetized iron- oxide particles on hard drives, or as a voltage spike in a wire. Computa- tion in this classic sense is a manipulation of symbols in a formal man- ner over time, in a step by step fashion. Haugeland (1981) described computing machines as automated formal systems. The functionality of the brain can not be implemented as a TM, because of the ‘100 step rule’, which stipulates that calculating some cognitive function cannot 9.1. NEURAL NETS AND PDP 205 take more than 100 steps to ‘complete’. When the average speed of a neuron is considered - serial computing in the brain would just be too slow. Fodor and Pylyshyn (1988) point out, that the idea that the brain implements cognitive functions in the way we would implement them on a serial computing machine is “absurd”. The individual model neuron’s ‘sum and squash’ function does not im- plement a cognitive function. Unless there is a (very) large number of interconnected units parallel distributed processing does not occur. On an abstract, conceptual or theoretical level PDP models may well work, because of the ‘functional groups’ that can be called upon and any func- tionality can be attributed to them. But functional groups cannot be simulated convincingly on artificial neural nets without being able to define psychological (cognitive) functions that lend themselves to being implemented by statistical means using a few hundred simple model neurons.

9.1.6 Distributed Representation

Neural networks compute in a different way than a Turing Machine (TM) does. The physical equivalent of a symbol in a Turing Machine can causally control the behaviour of a computer program on a real com- puter.

But of course the programmer’s representational system is the kind of cognitive capacity connectionists are trying to model, so it won’t do to presuppose such capacities in the modeling (Harnish, 2002, 389).

Artificial neural nets are strictly Dretskian Type I representational sys- tems when they are used in cognitive models. Networks, parts of net- works such as context units12, individual nodes or functional groups of neurons can represent what we like them to represent. McClelland et al. (1986) suggest that units can represent theories, goals, letters, words, and the “syntactic roles of the words” (McClelland et al., 1986, 10). While the distributed character of representations is proposed in PDP (e.g. by Smolensky (1986)), the models are interpreted in terms of

12A term used by Elman (1990) describing the group of nodes within a simple recur- rent neural network that forms the temporary storage for feed-back. 206 CHAPTER 9. CONCLUSION localized representations. In doing so, artificial neural nets and their in- ternal representations support the traditional symbol based theories as much as they support distributed parallel processing. Using localized representations also enable the modeller to address the requirements of systematicity that Fodor and Pylyshyn (1988) asked for, albeit in a limited way. Elman’s (1990) experiments showed that neural nets can at least in principle capture some element of serial processing (i.e. the effects of time).

9.1.7 Learning

Learning is a cognitive function that has been the subject of PDP mod- els. The very idea of learning by adjusting the strengths of connections is what makes artificial neural nets so interesting. Learning algorithms for artificial neural nets are designed to implement functions in artifi- cial neural nets in a way that is inspired by the work by Hebb (1949). There is however no evidence that our brains use the delta rule or the back propagation algorithm for learning. Nevertheless, the mechanisms of neural network learning have also been applied to cognitive models. The particular methods that are used to train neural nets, and partic- ularly the rate at which a neural net learns, may help in the under- standing of human learning. That, at least, is the claim made by some modellers (e.g. Elman et al. (1998); Shultz (2003); McLeod et al. (1998)). Elman (2005), for example, comments that the

greater plasticity in ‘younger’ networks allowed them to recover from artificial lesions to a much greater extent than when the le- sions occurred later in learning. The implication here is that even when the learning mechanism remains constant, the cumulative effect of learning can result in entrenchment, and this reduces the plasticity of the system (Elman, 2005, 112)13.

There might just be other ways to explain these behaviours in the neu- ral net. Is it not possible to argue that “entrenchment” is the desired result of learning, or, that the sole purpose of teaching a network is to ‘entrench’ some function f ?

13This and similar statements are made as a commentary on the work by Marchman (1993), Rumelhart and McClelland (1986b), Munakata and McClelland (2003), and oth- ers. 9.2. BRIDGING THE EXPLANATORY GAP 207

We should not be surprised that a partially trained network recovers from “artificial lesions” better than a network that is at a later state of learning. Artificial lesioning is essentially a change in the network’s topology (or architecture), and it has the effect of shifting the global minimum of error between outputs and target values (i.e. the optimal solution) to a different location in (the weight vector) space. The more training epochs that are allowed to occur after this shift, the better the approximation toward the new minimum will be. This is exactly what ‘gradient descent back propagation algorithms’ are designed to do. In fact, the network’s learning behaviour in response to artificial lesioning could have been predicted by reasoning alone. We can see that not only the behaviour of a trained neural net can be interpreted as a model for some cognitive function or theory, but also the learning process itself can be misrepresented as a model for a very large number of psychological phenomena.

9.2 Bridging the Explanatory Gap

Theories in cognitive science fall generally into two distinct categories. Some theories are offered as explanations of aspects of human cognition in terms of what brains do, and the implementation at the biological neural level is usually of little concern. This top-down approach deals with high level cognitive functions, and the brain is often viewed as a single black box, or a collection of black boxes with certain functional properties. One could argue that these theories are all about psycho- logical phenomena and that the physical brain is incidental in the con- text of this approach. The bottom-up approach, in contrast, deals with base elements and processes at the level of implementation, i.e. neu- rons. Cognitive Science is also concerned with how some of the ‘func- tional boxes’ might be implemented in brains. Because very little is un- derstood about what the low level psychological functions might be, and how they could be implemented on biological neurons, network architec- tures and methodologies that are borrowed from AI contribute largely in this endeavor. There is currently much effort aimed at establishing connections be- tween the top-down and bottom-up paradigms. One such program con- 208 CHAPTER 9. CONCLUSION cerns the mapping or localizing of cognitive functions in the brain. Mod- ern technologies such as PET and fMRI are commonly used for that purpose although there are many technical and conceptual issues unre- solved. Other attempts to bridge the divide between the two paradigms involve computational models with artificial neural nets, which aim to explain how higher cognitive functions could possibly be supported by the PDP architecture. In order to achieve this, descriptive elements from different levels are brought together in an attempt to present unified and coherent computational models and simulations of cognitive pro- cesses. The following sentence contains terms from Ethics, Psychology, Linear Algebra, and Neuroscience.

Following this simple model, the suggestion here advanced is that our capacity for moral discrimination also resides in an intricately configured matrix of synaptic connections, which connections also partition an abstract conceptual space, at some proprietary neu- ronal layer of the human brain, into a hierarchical set of categories, categories such as “morally significant” vs. “morally nonsignificant actions [. . . ] Churchland (1998, 81).

This example shows how the use of terminology disguises much of the detail in the model. All of the interpretations and the use of terms like “proprietary neuronal layers” (hidden units), “hierarchical set of cate- gories” (clustering), “intricately configured matrix” (back-propagation), and “synaptic connections” (connection) are boldly extrapolated from a simple recurrent network.

Levels

It is not always possible from the descriptions offered by the modellers to determine what the model actually models (Green, 2001). Green ar- gues that this vagueness contributes to the model’s ‘apparent success’. The problem here is that the modellers often fail to explain how the ar- tificial neural network in question could actually implement a cognitive function. Many of the ground breaking models like FINDING STRUC- TURE IN TIME (Elman, 1990), ON LEARNING THE PAST TENSE OF ENGLISH VERBS (Rumelhart and McClelland, 1986b) have been criti- cized for this omission (e.g. McCloskey (1991)). The table in figure 9.3 by Smolensky (1986) clearly shows how ‘loosely’ defined the mappings 9.2. BRIDGING THE EXPLANATORY GAP 209 of the mathematical equivalents in neural nets (in the middle column) to the conceptual and to the neural are. Functional neural models are

Neural Mathematical Conceptual

neurons units hypotheses

spiking frequency activation degree of confidence

speed of spread of propagation of depolarization activation inference

synaptic contact connection conceptual - inferential - interrelations

excitation/inhibition positive/negative positive/negative weight inferential relations

approximate additivity summation of approximate additivity of depolarization inputs of evidence

spiking thresholds activation spread independence from threshold G irrelevant information

limited dynamic sigmoidal limited range of range function F processing strength

Figure 9.3: Levels of Description (Smolensky, 1986) certainly a simplification of what real biological neurons do, but the re- lationships between the mathematical entities in artificial neural nets and the ‘conceptual’, as suggested by Smolensky, are very difficult to reconcile with the ‘neural’. A unit can represent anything, if we follow Smolensky’s 14 suggestion that a unit in a neural network can stand for a hypothesis (Smolensky, 1986), and if we also allow that a unit may represent a ‘functional group’ of neurons. The difficulty for the modeller is that the actual units in the artificial neural net and the learning algorithms cannot implement any vague concepts, because neural nets are limited by the mathematics that govern and limit their design and operation. The consequence is that many models only ‘work’ at a higher interpretative level of descrip- tion. I suggest a more structured approach (figure 9.4) might help to disen- tangle the levels of description, and it also suggests that there is a need to introduce meaningful labels in order to describe a model that is based

14Also McClelland et al. (1986, 10). 210 CHAPTER 9. CONCLUSION on an artificial neural net. The column on the left shows the decomposi- tion of the brain into smaller constituents, while the right hand column shows the decomposition of the mind into functions and operations. The diagram shows that there is no ‘conceptual’ equivalent at the level of neurons and units. Generally, workers in the field accept that single neurons do not perform any cognitive functions, instead the implemen- tation of such functions is distributed over many thousands of neurons (see section 1.3 on page 10)15. Recently it has been shown that men- tal tasks cause much more widespread brain activity than previously thought (see for example Uttal (2001, 2005)). This has been the result of experiments using the techniques that I have illustrated in chap- ter 7. Primitive mental operations, whatever they may turn out to be, could thus be ‘implemented’ on very large groups of neurons. Clusters

Biology Models Cognition

Cognitive Brain Mind Architecture

Cognitive Area Function Symbol System

Neural Net Cluster Operation

Neuron Neuron (Unit)

Figure 9.4: Levels of Description (alt.) or groups of neurons are shown on the same level as neural nets. Al- though I have indicated that we should consider basic mental opera- tions to belong at this level, we cannot specify these operations. Conse- quently, neural nets cannot be used as models for such operations unless it becomes possible to provide a framework that could enable us to not only define mental operations, but also to offer a plausible account for

15This assumption had already been suggested in Rosenblatt (1962). 9.2. BRIDGING THE EXPLANATORY GAP 211 their implementation. This implementation needs to be plausible both in groups of real neurons and under simulation in an artificial neural network. There is however no reasonable definition of primitive mental functions, although a much better understanding of such primitive oper- ations would be necessary to provide a mechanistic explanation of men- tal processes (Bechtel, 2005). The central box ‘Symbol System’ / ‘Neural Net’ indicates that ‘cognitive functions’, ‘Mind’, and ‘Cognitive Architec- ture’ rely largely on symbolic descriptions, as I have discussed in section 6.2.1 on page 142. Models with neural nets are used to model cognitive functions, which are shown by evidence from PET and f MRI imaging to involve rela- tively large and dispersed areas of the brain. The activation patterns in brains go certainly beyond mere clusters of neurons. There are no choices for the PDP modeller other than to assign cognitive functions to ‘units’ that can be defined, unlike the low level operations. Neural nets are used as tools to bridge the divide between neurons or clusters of neurons and cognitive functions. However, the models per- form this task not in a layered fashion as Smolensky suggests (figure 9.3). Instead, the models elevate the functionality of biological neurons to what they ‘ought to’ be able to do. Neurons, clusters of neurons, and areas of brains are now connected with cognitive functions and minds. As models connect biological neurons with Psychology, a lot of detail and ‘exactness’ of how this might be achieved is lost. The mathematical nature of neural nets and their predictable behaviour need an interpretative shell, which is labeled ‘Symbol System’ in figure 9.4. This shell is necessary, because it enables us to use meaningful labels and descriptions for processes in a language or context that we can understand. The descriptions that are offered are not necessarily making sense merely because they are couched in the language of some (cognitive) theory. Models that are used to make bold claims about cog- nitive functions may not bridge the explanatory gap if the model is built on a small neural net with a generic feed-forward or simple recurrent architecture. Models of high level cognitive functions and minds can be successfully accommodated in the PDP framework, but attempts to val- idate the model by simulation of an actual neural net with a small set of data may not add to our understanding or even our suspicions. In many 212 CHAPTER 9. CONCLUSION cases, the details of the experiment may just highlight the inadequa- cies in the data, the simplicity of the analysis, and the overly optimistic interpretations.

9.3 Neurological Plausibility

I have argued that neurological possibility can be demonstrated for al- most any conceivable psychological theory due to the universality of simple artificial neural nets (see section 6.2 on page 140 and also Krebs (2005)). Not only does the vagueness of the training data obtained from experiments in Psychology (O Nualláin, 2002) make it very difficult to keep a clear and realistic view of what a model can do, but the descrip- tions and interpretations that are borrowed from a “conceptual envi- ronment that is sufficiently vague” (Green, 2001) shift the attention from verifying what the data actually support. Instead, claims are often shrouded in wishful (theory-laden) labels for units, and the descriptions of processes and relations that are offered sometimes ‘surprise’ even the modeller. A closer look reveals that the overly positive evaluations are based on little more than statistical analysis. Particularly, the very sim- ple models of cognitive functions do not provide any support for them to be considered neurologically plausible. If the use of artificial neural nets as models for cognitive functions does not extend past statistical analysis, then other means for the data analysis might be better suited. The suitability of generic (non-structured) networks as tools for build- ing models of complex learning has been questioned before (Feldman, 1993), but it seems that in the search for ever more intriguing cognitive theories only very little attention is paid to developing the tool-kit.

9.4 Closing Remarks

I have investigated various aspects of cognitive modelling that have had a direct bearing on the success of models and simulations in Cognitive Science. The mathematical foundations of neuron models and neural networks, the training algorithms, and their architectures, limit what an artificial neural net can do. The validity of claims and interpreta- tions in the context of these cognitive models can often only be judged 9.4. CLOSING REMARKS 213 against evidence that is also controversial at times. The material in chapter seven shows that the relationships between physical neurons and psychological functions are very difficult to establish. When we con- sider the strength of the evidence from Psychology, of evidence gained by interpreting f MRI images, and the ‘plausible’ frameworks that have been proposed for particular theories of mind, it becomes apparent that much of what is deduced from models does not stand on solid ground. Their apparent success can largely be attributed to the (sometimes un- warranted) extension of findings that really only hold for the limited domains the networks and the training data are actually designed for. The mistake, I think, is to bring the top-down psychological model and the bottom-up neural environment together and to treat the resultant model as a coherent and meaningful demonstration for the validity of some theory. I believe that artificial neural nets can be used as cognitive models successfully, provided clear descriptions of the aims, assump- tions and claims are presented. When simple artificial neural nets with small numbers of nodes are employed to model complex high level cogni- tive functions, the experimenter should evaluate whether the simplicity of the network can really offer a plausible solution. The results from ex- periments with neural nets should not be generalized, particularly if the (training) data relates only to quite simple ontologies. Unlike the model itself, it is only the interpretations and inferences that seem to scale up successfully. I have shown that it is relatively easy to design and im- plement models that fit into the PDP framework, particularly when we consider modern computer programs that allow designing and operat- ing neural networks using graphical user interfaces. These interfaces remove many of the difficulties and issues relating to Computer Science and Artificial Intelligence. The modern software allows for sophisticated and elaborate experiments to be conducted in Cognitive Science - yet it remains very difficult to show that such models are neurologically plau- sible. 214 CHAPTER 9. CONCLUSION Bibliography

Agre, P. E. (1997). Computation and Human Experience. Cambridge UP, Cambridge.

Anderson, A. R. (1964). Minds and Machines. Prentice-Hall.

Araya, A. A. (2003). The hidden side of visualization. Technè, 7(2):27– 93.

Bailer-Jones, D. M. (2002). Models, Metaphors and Analogies, pages 108–127. Blackwell, Malden, Massachusetts.

Baluja, S. (1996). Evolution of an artifical neural network based au- tonomous land vehicle controller. IEEE Transactions on System, Man, and Cybernetics, 26(3):450–463.

Bandettini, P. A. and Ungerleider, L. G. (2001). From neuron to BOLD: new connections. Nature neuroscience, 4(9):864–866.

Bates, E. and Elman, J. (1993). Connectionism and the Study of Change, pages 623–642. In Johnson (1993).

Bear, M. F., Connors, B. W., and Paradiso, M. A. (1996). Neuroscience: Exploring the Brain. Williams & Wilkins, Baltimore, Maryland.

Bechtel, W. (2000). From Imaging to Believing, pages 138–163. In Creath and Maienschein (2000).

Bechtel, W. (2002). Decomposing the mind-brain: A long-term pursuit. Brain and Mind, 3:229–242.

Bechtel, W. (2005). Mental mechanisms: What are the operations? In Bara, B., Barsalou, L., and Bucciarelli, M., editors, Proceedings of the 27th annual meeting of the Cognitive Science Society, pages 208–213, Mahwah, New Jersey. Cognitive Science Society, Lawrence Erlbaum Associates.

Bechtel, W. and Graham, G. (1998). A Companion to Cognitive Science. Blackwell, Malden.

215 216 BIBLIOGRAPHY

Belliveau, J. W., Kennedy, D. N., McKinstry, R. C., Buchbinder, B. R., Weisskopf, R. M., Cohen, M. S., Vevea, J. M., Brady, T. J., and Rosen, B. R. (1991). Functional mapping of the human visual cortex by mag- netic resonance imaging. Science, 254(5032):716–719. Benacerraf, P. (1967). God, The Devil and Gödel. The Monist, 51:9–32. Bennett, M. R. and Hacker, P. M. S. (2003). Philosophical Foundations of Neuroscience. Blackwell, Malden. Berkeley, I. S. N. (1998). Connectionism reconsidered: Minds, ma- chines and models. Technical report, Phil. and Cog.Sci., University of Louisiana at Lafayette, http://cogprints.org/1975/ (11/10/05). Boden, M. A. (1990). The Philosophy of Artificial Intelligence. Oxford University Press, Oxford. Branquinho, J. (2001). The foundations of cognitive science. Oxford UP, Oxford. Broadbent, D. (1993). The Simulation of Human Intelligence. Blackwell, Oxford. Burton, R. G. (1993). Natural and Artificial Minds. State University of New York Press, Albany, NY. Carroll, L. (1996). The complete illustrated Lewis Carroll. Wordsworth Editions, Ware, England. Carter, R. (1998). Mapping the Mind. Phoenix. Casey, B. J., Davidson, M., and Rosen, B. (2002). Functional magnetic resonance imaging: basic principles of and application to developmen- tal science. Developmental Science, 5(3):301–309. Chalmers, D. J. (1995). Minds, machines, and mathematics. Psyche, 2(9). Churchland, P. M. (1990). Cognitive Activity in Artificial Neural Net- works, chapter 12, pages 198–216. In Cummins and Delarosa Cum- mins (2000). Churchland, P. M. (1993). On the Nature of Theories: A Neurocomputa- tional Perspective, pages 21–67. In Burton (1993). Churchland, P. M. (1998). Toward a Cognitive Neurobiology of the Moral Virtues, pages 77–108. In Branquinho (2001). Churchland, P. M. (1999). Learning and Conceptual Change: The View from the Neurons, pages 7–43. In Clark and Millican (1999). BIBLIOGRAPHY 217

Churchland, P. S. and Sejnowski, T. J. (1992). The Computational Brain. MIT Press, Cambridge Massachusetts.

Clark, A. (2001). Mindware: An Introduction to the Philosophy of Cog- nitive Science. Oxford UP, Oxford.

Clark, A. and Millican, P. (1999). Connectionism, Concepts, and Folk Psychology. The Legacy of Alan Turing. Oxford UP, Oxford.

Cooney, B. (2000). The Place of Mind. Wadsworth.

Copeland, J. (1993). Artificial Intelligence: A Philosophical Introduction. Blackwell, Malden.

Copeland, J. (2004). The Essential Turing. Clarendon Press, Oxford.

Copeland, J. and Sylvan, R. (1999). Beyond the universal turing ma- chine. Australasian Journal of Philosophy, 77:46–66.

Copi, I. M. (1979). Symbolic Logic. Macmillan, New York.

Copi, I. M. and Cohen, C. (1994). Introduction to Logic. Macmillan, Englewood Cliffs, N.J.

Creath, R. and Maienschein, J. (2000). Biology and Epistemology. Cam- bridge UP.

Crick, F. and Asanuma, C. (1986). Certain Aspects of the Anatomy and Physiology of the Cerebral Cortex, chapter 20, pages 333–371. In Mc- Clelland and Rumelhart (1986).

Cummins, R. and Delarosa Cummins, D. (2000). Minds, Brains, and Computers: The Foundations of Cognitive Science. Blackwell, Malden.

Davis, M. (1958). Computability and Unsolvability. MacGraw-Hill, New York.

Davis, M. (1965). The Undecidable: Basic Papers On Undecidable Propo- sitions, Unsolvable Problems And Computable Functions.Raven Press, Hewlett, New York.

Dawson, M. R. W. (2004). Minds and Machines: Connectionism and Psychological Modeling. Blackwell.

Dayan, P. and Abbott, L. F. (2001). Theoretical Neuroscience: Compu- tational and Mathematical Modeling of Neural Systems. MIT Press, Cambridge, Massachusetts.

Dennett, D. (1990). Cognitive Wheels: The Framing Problem in AI, pages 147–170. In Boden (1990). 218 BIBLIOGRAPHY

Dennett, D. (1991). Consciousness Explained. Little Brown, Boston.

Dennett, D. C. (2001). Things about Things, pages 133–143. In Bran- quinho (2001).

Diorio, C. and Rao, R. P. N. (2000). Neural circuits in silicon. Nature, 405:891–992.

Dretske, F. (1981). Knowlegde & the Flow of Information. MIT Press, Cambridge, Massachusetts.

Dretske, F. (1988). Representational Systems, chapter 15, pages 304– 331. In O’Connor and Robb (2003).

Dretske, F. (1994). If you can’t make one, then you don’t know how it works, chapter 21, pages 306–317. In Cooney (2000).

Dreyfus, H. L. (1979). What Computers Still Can’t Do. MIT Press, Cam- bridge, Massachusetts.

Eberbach, E., Goldin, D., and Wegner, P. (2004). Turing’s Ideas and Models of Computation. In Teuscher (2004).

Elman, J. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48:71–99.

Elman, J. (1998). Generalization, simple recurrent networks, and the emergence of structure. In Gernsbacher and Derry (1998).

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14:179–211.

Elman, J. L. (2005). Connectionist models of cognitive development: where next? Trends in Cognitive Sciences, 9(3):111–117.

Elman, J. L., Bates, E. A., Karmiloff-Smith, A., Parisi, D., and Plunkett, K. (1998). Rethinking Innateness: A Connectionist Perspective on De- velopment. MIT Press, Cambridge, Massachusetts.

Feldman, J. A. (1993). Structured connectionist models and language learning. Artificial Intelligence Review, 7:301–312.

Feldman, J. A. and Ballard, G. H. (1982). Connectionist models and their properties. Cognitive Science, 6:205–254.

Feng, C.-M., Narayana, S., Lancaster, J. L., Jerabeck, P. A., Arnow, T. L., Zhu, F., Tan, L. H., Fox, P. T., and Gao, J.-H. (2004). Cbf changes during brain activation: Fmri vs. pet. Neuroimage, 22:443–446. BIBLIOGRAPHY 219

Fetzer, J. H. (1988). Aspects of Artificial Intelligence. Kluwer, Dordrecht. Feynman, R. (1996). Feynman Lectures on Computation. Penguin. Fodor, J. (1998). Connectionism and the Problem of Systemacity (Contin- ued): Why Smolensky’s Solution Still Doesn’t Work, chapter 34, pages 1117–1128. In Polk and Seifert (2002). Fodor, J. (2005). What we still don’t know about cognition. In Bara, B., Barsalou, L., and Bucciarelli, M., editors, Proceedings of the 27th an- nual meeting of the Cognitive Science Society, page 2, Mahwah, New Jersey. Cognitive Science Society, Lawrence Erlbaum Associates. Fodor, J. and Pylyshyn, Z. (1988). Connectionism and cognitive archi- tecture: A critical analysis. Cognition, 28:3–71. Fox Keller, E. (2003). Models, Simulation, and "Computer Experiments", chapter 10, pages 198–215. In Radder (2003). Fraleigh, J. B. and Beauregard, R. A. (1995). Linear Algebra. Addison- Wesley. Frese, G. and Engels, H. (2003). Magnetic resonance imaging (mri) elec- tromagnetic fields (emf). Technical report, University of Nottingham. Fulford, G., Forrester, P., and Jones, A. (1997). Modelling with Differen- tial and Difference Equations. Cambridge UP, Cambridge. Garfield, J. L. (1990). Foundations of Cognitive Science: The Essential Readings. Paragon, New York. Gernsbacher, M. A. and Derry, S. (1998). Proceedings of the 20th Con- ference of the Cognitive Science Society. Lawrence Erlbaum, Mahwah, New Jersey. Gerstner, W. and Kistler, W. (2002). Spiking Neuron Models. Cambridge UP, Cambridge. Glymour, C. (1988). Artificial Intelligence is Philosophy, pages 195–207. In Fetzer (1988). Gödel, K. (1931). On Formally Undecidable Propositions Of Principia Mathematica and Related Systems, pages 4–38. In Davis (1965). Gooding, D. (2003). Varying the Cognitive Span: Experimentation, Vi- sualization, and Computation, chapter 13, pages 255–283. In Radder (2003). Graubard, S. R. (1988). The Artificial Intelligence Debate: False Starts, Real Foundations. MIT Press, Cambridge, Massachusetts. 220 BIBLIOGRAPHY

Green, C. D. (2001). Scientific models, connectionist networks, and cog- nitive science. Theory & Psychology, 11(1):97–117. Hacking, I. (1983). Representing and Intervening. Cambrigde UP, Cam- bridge. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., and Seung, H. S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405:947–951. Hare, M., Elman, J., and Daugherty, K. G. (1995). Default general- ization in connectionist networks. Lanuage and Cognitive Processes, 10(6):601–630. Harnad, S. (1990). The symbol grounding problem. Physica, 42:335– 346. Harnad, S. (1995). What is computation (and is cognition that)? Minds and Machines, 4:377–378. Harnish, R. M. (2002). Minds, Brains, Computers: A Historical Intro- duction to the Foundations of Cognitive Science. Blackwell, Malden, Massachusetts. Harré, R. (1970). The Principles of Scientific Thinking. UoCP, Chicago. Harré, R. (2003). The Materiality of Instruments in a Metaphysics for Experiments, pages 19–59. In Radder (2003). Haugeland, J. (1981). Semantic Engines: An Introduction to Mind De- sign, pages 34–50. In Cummins and Delarosa Cummins (2000). Haugeland, J. (1985). Artificial Intelligence: the Very Idea. MIT Press, Cambridge, Massachusetts. Haugeland, J. (1997). Mind Design II. MIT press, Camridge, Mas- sachusetts. Haykin, S. (1999). Neural Nets: A Comprehensive Foundation. Prentice- Hall, Upper Saddle River, New Jersey. Hebb, D. O. (1949). The Organization of Behavior, chapter 19, pages 323–332. In Cummins and Delarosa Cummins (2000). Hilbert, D. (1902). Mathematical problems: Lecture delivered before the international congress of mathematics at paris in 1900. Bulletin of the American Mathematical Society, 8:437–479. Hinton, G. E. (1992). How Neural Nets Learn from Experience, pages 181–195. In Polk and Seifert (2002). BIBLIOGRAPHY 221

Hobbes, T. (1651). Leviathan. Everyman’s Library (1914).

Hodgkin, A., Huxley, A., and Katz, B. (1952). Measurement of the voltage-current relation in the membrane of the giant axon of Loligo. Journal of Physiology, 116:424–448.

Hodgkin, A. L. and Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology, 117:500–544.

Hoffmann, A. (1993). Komplexität einer künstlichen Intelligenz.PhD thesis, Kommunikations- and Geschichtswissenschaft, Technische Universität Berlin.

Hoffmann, A. (1998). Paradigms of Artificial Intelligence. Springer.

Hume, D. (1990, 1777). Enquiries concerning Human understanding and concerning the principles of morals. Clarendon Press, Oxford.

Ince, D. C. (1992). Collected Works of A. M. Turing: Mechanical Intelli- gence. Elsevier (North-Holland), Amsterdam.

Johnson, M. (1993). Brain Development and Cognition: A Reader. Black- well, Oxford.

Jordan, M. I. (1986). Serial order: A parallel distributed approach. Tech. Rep. 8604, University of California, San Diego.

Jorion, P. (1999). What do mathematicians teach us about the world? an anthropological perspective. Dialectical Anthropology, 24(1):45–98 (repr. 1–25).

Kearns, J. T. (1997). Thinking machines: Some fundamental confusions. Minds and Machines, 7:269–287.

Kello, C. T., Sibley, D. E., and Plaut, D. C. (2005). Dissociations in perfor- mance on novel versus irregular items: Single-route demonstrations with input gain in localist and distributed models. Cognitive Science, 25:627–654.

Koch, C. and Segev, I. (2000). The role of single neurons in information processing. Nature (neuroscience supplement), 3:1171–1177.

Krebs, P. R. (2002). Turing machines, computers, and artificial intelli- gence. Master’s thesis, History and Philosophy of Science, UNSW.

Krebs, P. R. (2005). Models of cognition: Neural possibility does not indi- cate neural plausibility. In Bara, B., Barsalou, L., and Bucciarelli, M., 222 BIBLIOGRAPHY

editors, Proceedings of the 27th annual meeting of the Cognitive Sci- ence Society, pages 1184–1189. Lawrence Erlbaum Associates, Mah- wah, New Jersey.

Krebs, P. R. (2007). Smoke Without Fire: What do virtual experiments in cognitive science really tell us? In Srinivasan et al. (2007).

Kurzweil, R. (1990). The Age of Intelligent Machines. MIT Press, Cam- bridge, Massachusetts.

Lay, D. C. (1997). Linear Algebra and its Applications. Addison-Wesley.

Lenat, D. B. and Guha, R. V. (1990). Building Large Knowledge-based Systems. Addison-Wesley, Reading, Massachusetts.

Levy, F. and Krebs, P. R. (2006). Cortical-subcortical re-entrant circuits and recurrent behaviour. Australian and New Zealand Journal of Psychiatry, 40:752–758.

Lloyd, D. (1995). Consciousness: A connectionist manifesto. Minds and Machines, 5:161–185.

Locke, J. (1993, 1690). An essay concerning Human understanding.J. M. Dent, London.

Logothetis, N. K., Pauls, J., Augath, M., Trinath, T., and Oeltermann, A. (2001). Neurophysiological investigation of the basis of the fMRI signal. Nature, 412:150–157.

Lucas, J. R. (1961). Minds, Machines and Gödel, pages 43–59. In An- derson (1964).

Lucas, J. R. (1970). The Freedom of the Will. Clarendon Press, Oxford.

Lytton, W. W. (2002). From Computer to Brain: Foundations of Compu- tational Neuroscience. Springer.

Marchman, V. (1993). Constraints on plasticity in a connectionist model of the English past tense. Cognitive Neuroscience, 5:215–234.

Martindale, D. (2005). One face, one neuron: Storing Hall Berry in a single brain cell. Scientific American, 293(4):10–11.

McClelland, J. and Rumelhart, D. (1986). Parallel Distributed Pro- cessing: Explorations in the Microstructure of Cognition, Volume 2: Psychological and Biological Models. MIT Press, Cambridge, Mas- sachusetts. BIBLIOGRAPHY 223

McClelland, J., Rumelhart, D., and Hinton, G. E. (1986). The Appeal of Parallel Distributed Processing, chapter 1, pages 3–44. In Rumelhart and McClelland (1986a).

McClelland, J. L. (2005). Organization and emergence of semantic knowledge: A parallel-distributed processing approach. In Bara, B., Barsalou, L., and Bucciarelli, M., editors, Proceedings of the 27th an- nual meeting of the Cognitive Science Society, page 3, Mahwah, New Jersey. Cognitive Science Society, Lawrence Erlbaum Associates.

McCloskey, M. (1991). Networks and Theories: The Place of Connection- ism in Cognitive Science, pages 1121–1146. In Polk and Seifert (2002).

McCulloch, P. and Pitts, W. (1943). A logical calculus of the ideas imma- nent in nervous activity, pages 351–366. In Cummins and Delarosa Cummins (2000).

McDermott, D. (2001). Mind and Mechanism. MIT Press, Cambridge, Massachusetts.

McLeod, P., Plunkett, K., and Rolls, E. T. (1998). Introduction to Con- nectionist Modelling of Cognitive Processes. Oxford UP, Oxford.

Mellor, D. (1989). How much of the mind is a computer?, pages 47–69. In Slezak and Albury (1989).

Merton, R. K. (1942). The Normative Structure of Science, pages 267– 278. In Storer (1973).

Midgley, M. (2004). Do we ever really act?, chapter 1, pages 18–33. In Rees and Rose (2004).

Millican, P. and Clark, A. (1999). Machines and Thought. The Legacy of Alan Turing. Oxford UP, Oxford.

Minsky, M. and Papert, S. (1969). Perceptrons. MIT Press, Cambridge, Massachusetts.

Mohyeldin Said, K. A., Newton-Smith, W. H., Viale, R., and Wilkes, K. V. (1990). Modelling the Mind. Clarendon Press, Oxford.

Moore, D. S. and McCabe, G. P. (1993). Introduction to the Practice of Statistics. W. H. Freeman, New York.

Morgan, M. S. (2003). Experiments without Material Intervention, chap- ter 11, pages 216–235. In Radder (2003).

Munakata, Y. and McClelland, J. L. (2003). Connectionist models of development. Development Science, 6:413–429. 224 BIBLIOGRAPHY

Neto, J. P., Siegelmann, H. T., Costa, J. F., and Suárez Araujo, C. P. (1997). Turing universailty of neural nets (revisited). Lecture Notes in Computer Science, 1333:361–366.

Newell, A. (1990). Unified Theories of Cognition. Harvard UP, Cam- bridge.

Newell, A. and Simon, H. (1976). Computer science as empirical inquiry: Symbols and search. Communications of the ACM, 19:113–126.

Newell, A. and Simon, H. A. (1963). GPS, a Program that Simulates Human Thought, chapter 6, pages 84–94. In Cummins and Delarosa Cummins (2000).

Noble, D. (1990). Biological Explanation and Intentional Behaviour, chapter 7, pages 97–112. In Mohyeldin Said et al. (1990).

Norman, D. A. (1986). Reflections on Cognition and Parallel Distributed Processing, chapter 26, pages 531–552.

O Nualláin, S. (2002). The Search For Mind. Intellect Books, Bristol, UK.

O’Connor, T. and Robb, D. (2003). Philosophy of Mind: Contemporary Readings. Routledge.

O’Reilly, R. C. and Munakata, Y. (2000). Computational Explorations in Cognitive Neuroscience: Understanding the Mind by Simulating the Brain. MIT Press, Cambridge, Massachusetts.

Papert, S. (1988). One AI or Many?, pages 1–14. In Graubard (1988).

Partridge, D. and Wilks, Y. (1990). The Foundations of Artificial Intelli- gence: A Source Book. Cambridge UP, Cambridge.

Pavlik, P. I. and Anderson, J. R. (2005). Practice and forgetting effects on vocabulary memory: An activation-based model of the spacing effect. Cognitive Science, 29:559–586.

Penrose, R. (1990). The Emperor’s New Mind. Vintage, London.

Penrose, R. (1993). Setting the Scene: the Claim and the Issues, pages 1–33. In Broadbent (1993).

Peschl, M. E. and Scheutz, M. (2001). Explicating the epistemological role of simulation in the development of theories of cognition. Proceed- ings of the seventh colloquium on Cognitive Science ICCS-01, pages 274–280. BIBLIOGRAPHY 225

Pitowsky, I. (1990). The physical Church Turing Thesis and physical computational complexity. Iyuun, 39:81–99.

Poggio, T. (1990). Vision: The ’Other’ Face of AI, chapter 9, pages 139– 154. In Mohyeldin Said et al. (1990).

Polk, T. A. and Seifert, C. M. (2002). Cognitive Modeling. MIT Press, Cambridge, Massachusetts.

Pomerleau, D. (1995). Neural Network Vision for Robot Driving. MIT Press, Cambridge, Massachusetts.

Popper, K. R. (1959). The Logic of Scientific Discovery. Routledge, Lon- don.

Psillos, S. (1999). Scientific Realism: How science tracks truth. Rout- ledge, London.

Putnam, H. (1975). Mind, Language and Reality. Cambridge UP, Cam- bridge.

Putnam, H. (1988). Much Ado About Not Very Much, pages 269–282. In Graubard (1988).

Pylyshyn, Z. W. (1984). Computation and Cognition: Toward a Founda- tion of Cognitive Science. MIT Press, Cambridge, Massachusetts.

Pylyshyn, Z. W. (1990). Computation and Cognition: Issues in the Foun- dations of Cognitive Science, pages 18–74. In Garfield (1990).

Quinn, M. J. (2005). Ethics for the Information Age. Pearson, Boston, MA.

Radder, H. (2003). The Philosophy of Scientific Experimentation. Uni- versity of Pittsburgh Press, Pittsburgh.

Raichle, M. E. (2001). Bold insights. Nature, 412:128–130.

Rees, D. and Rose, S. (2004). The new Brain Sciences: Perils and Prospects. Cambridge UP, Cambridge.

Rogers, T. T. and McClelland, J. L. (2004). Semantic Cognition: A Par- allel Distributed Processing Approach. MIT Press, Cambridge, Mas- sachusetts.

Rosenblatt, F. (1958). The perceptron: A probabalistic model for infor- mation storage and organization in the brain. Psychological Review, 65:386–408. 226 BIBLIOGRAPHY

Rosenblatt, F. (1959). Two theorems of statistical separability in the perceptron. Procedings of the Symposium on the Mechanization of Thought, pages 421–456.

Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, Washing- ton.

Rumelhart, D. and McClelland, J. (1986a). Parallel Distributed Pro- cessing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. MIT Press, Cambridge, Massachusetts.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning Internal Representations by Error Propagation, chapter 8, pages 318– 362. In Rumelhart and McClelland (1986a).

Rumelhart, D. E. and McClelland, J. L. (1986b). On Learning the Past Tense of English Verbs, chapter 18, pages 216–268. In McClelland and Rumelhart (1986).

Russell, B. (1946). History of Western Philosophy. Allan & Unwin, Lon- don.

Russell, B. (1948). Human Knowledge: Its Scope and Limits. Allen & Unwin, London.

Russell, S. and Norvig, P. (1995). Artificial Intelligence: A Modern Ap- proach. Prentice Hall, Upper Saddle River, New Jersey.

Salas, S. and Hille, G. J. (1995). Salas and Hille’s Calculus: One and several variables. John Wiley, Toronto, Canada, 7th edition.

Schach, S. R. (2002). Object-Oriented and Classical Software Engineer- ing. McGraw Hill, New York.

Schneider, S. (2001). The B-method. Palgrave, New York.

Searle, J. (1980). Minds, Brains, and Programs, chapter 3, pages 140– 152. In Cummins and Delarosa Cummins (2000).

Searle, J. (1990). Is the brain a digital computer? Proc. & Addr. of the American Philosophical Association, 64:21–37.

Seidenberg, M. and McClelland, J. (1989). A distributed model of word recognition and naming. Psychological Review, 96:523–568.

Sejnowski, T. J. (1986). Open Questions About Computation in Cerebral Cortex, chapter 21, pages 372–389. In Rumelhart and McClelland (1986a). BIBLIOGRAPHY 227

Sejnowski, T. J. and Rosenberg, C. R. (1987). Parallel networks that learn to produce english text. Complex Systems, 1:145–168.

Shagrir, O. (1999). What is computer science about? The Monist, 82(1):131–149.

Shagrir, O. (2001). Content, computation and externalism. Mind, 110(438):367–400.

Shannon, C. E. and Weaver, W. (1975). The Mathematical Theory of Communication. University of Illinois Press, Urbana.

Shastri, L. and Ajjanagadde, V. (1993). From simple associations to sys- tematic reasoning: A connectionist representation of rules, variables and dynamic bindings using temporal synchronity. Behavioral and Brain Sciences, 3(16):417–494.

Shepherd, G. M. (1991). Foundations of the neuron doctrine. Oxford UP, Oxford.

Shultz, T. R. (2003). Computational Developmental Psychology. MIT Press.

Siegelmann, H. T. and Sontag, E. D. (1991). Turing computabilty with neural nets. Applied Mathematics Letters, 4(6):77–80.

Silberschatz, A. and Galvin, P. B. (1999). Operating System Concepts. John Wiley, New York.

Simon, H. (1995). Machine as Mind, chapter 5, pages 81–102.

Sipser, M. (1997). Introduction to the Theory of Computation. PWS Publishing, Boston.

Slezak, P. and Albury, W. R. (1989). Computers, Brains and Minds: Es- says in Cognitive Science. Kluwer, Dordrecht.

Sloman, A. (1996). Beyond Turing Equivalence, pages 179–219. In Mil- lican and Clark (1999).

Smolensky, P. (1986). Neural and Conceptual Interpretations of PDP Models, chapter 22, pages 390–431. In McClelland and Rumelhart (1986).

Smolensky, P. (1990). Connectionism and the Foundations of AI.In Partridge and Wilks (1990).

Smolensky, P. (1997). Connectionist Modeling: Neural Computation / Mental Connections. In Haugeland (1997). 228 BIBLIOGRAPHY

Spivey, M. (1996). An Introduction to Logic Programming through Pro- log. Prentice Hall. Srinivasan, N., Gupta, A. K., and Pandey, J. (2007). Advances in Cogni- tive Science. Sage, Delhi, India (in press). Sterelny, K. (1989). Computational Functional Psychology: Problems and Prospects, pages 71–93. In Slezak and Albury (1989). Stillings, N. A., Weisler, S. E., Chase, C. H., Feinstein, M. H., Garfield, J. L., and Rissland, E. L. (1995). Cognitive Science: An Introduction. MIT Press, Cambridge, Massachusetts. Stone, G. O. (1986). An Analysis of the Delta Rule and the Learning of Statistical Associations, chapter 11, pages 444–459. In Rumelhart and McClelland (1986a). Storer, N. W. (1973). The Sociology of Science. University of Chicago Press, Chicago. Stufflebeam, R. S. (1998). Representation and Computation, pages 636– 648. In Bechtel and Graham (1998). Tegmark, M. (2000). Importance of quantum decoherence in brain pro- cesses. Physical Review, 61:4194 – 4206. Teuscher, C. (2004). Alan Turing: Life and Legacy of a Great Thinker. Springer. Thornton, C. J. (1999). Why Concept Learning is a Good Idea. In Clark and Millican (1999). Townsend, D. W. (2004). Physical principles and technology of clinical PET imaging. Annals Academy of Medicine, Singapore, 33(2):133– 145. Turing, A. M. (1936). On Computable Numbers, with an Application to the Entscheidungsproblem, pages 115–153. In Davis (1965). Turing, A. M. (1947). Lecture on the Automatic Computing Engine, pages 378–394. In Copeland (2004). Turing, A. M. (1948). Intelligent Machinery, pages 107–127. In Ince (1992). Turing, A. M. (1950). Computing Machinery and Intelligence, pages 153–167. In Cummins and Delarosa Cummins (2000). Uttal, W. R. (2001). The New Phrenology: The Limits of Localizing Pro- cesses in the Brain. MIT Press, Cambridge, Massachusetts. BIBLIOGRAPHY 229

Uttal, W. R. (2005). Neural Theories of Mind. Lawrence Erlbaum Asso- ciates, Mahwah, New Jersey.

Wartofsky, M. W. (1979). Models: Representation and the Scientific Un- derstanding. D. Reidel, Dordrecht.

Wegner, P. (1997). Why interaction is more powerful than algorithms. Communications of the ACM, 40:80–91.

Weizenbaum, J. (1976). Computer Power and Human Reason. Freeman & Co., San Francisco.

Wells, A. J. (1996). Situated action, symbol systems and universal com- putation. Minds and Machines, 6:33–46.

Wells, A. J. (1998). Turing’s analysis of computation and theories of cognitive architecture. Cognitive Science, 22(3):296–294.

Whiteley, C. H. (1962). Minds, Machines and Gödel: A Reply to Mr Lu- cas. J. Royal Inst. Phil, 37:61–72.

Wilkes, K. V. (1990). Modelling the Mind, chapter 5, pages 63–82. In Mohyeldin Said et al. (1990).

Williams, M. R. (1997). A History of Computing Technology. IEEE Press, Los Alamitos.

Winograd, T. (1973). A Procedural Model of Language Understanding, pages 95–113. In Cummins and Delarosa Cummins (2000).

Winograd, T. and Flores, F. (1986). Understanding Computers and Cog- nition: A New Foundation for Design. Ablex, Norwood.

Würtz, R. P. (1998). Neural networks as a model for visual perception: What is lacking? Cognitive Systems, 7(2).

Ziman, J. (2000). Real Science: What it is, and what it means. Cambridge UP, Cambridge. 230 BIBLIOGRAPHY Author Index

Agre (1997), 175 Copi (1979), 96, 173 Araya (2003), 157 Crick and Asanuma (1986), 34 Bailer-Jones (2002), 99, 100 Davis (1958), 47 Baluja (1996), 141 Dawson (2004), 31, 91 Bandettini and Ungerleider Dayan and Abbott (2001), 108, (2001), 165 118, 119 Bates and Elman (1993), 178 Dennett (1990), 196 Bear et al. (1996), 12, 109, 110, Dennett (1991), 49 160 Dennett (2001), 204 Bechtel (2000), 162, 164, 166 Diorio and Rao (2000), 138 Bechtel (2002), 13, 203 Dretske (1981), 23, 37 Bechtel (2005), 42, 203, 211 Dretske (1988), 23, 32–36, 38, Belliveau et al. (1991), 158, 164 45, 48 Benacerraf (1967), 57 Dretske (1994), 23 Bennett and Hacker (2003), 11, Dreyfus (1979), 3, 7, 27, 28, 54, 29 75, 196 Berkeley (1998), 28 Eberbach et al. (2004), 52 Carroll (1996), 144 Elman et al. (1998), 5, 117, 123, Carter (1998), 119 133, 142, 178, 181, 203, Casey et al. (2002), 160, 165, 166 206 Chalmers (1995), 70 Churchland and Sejnowski Elman (1990), 5, 27, 34, 92, 134, (1992), 10, 102, 119 143–145, 147, 172, 178– Churchland (1990), 133 184, 186, 190, 191, 197, Churchland (1993), 69, 119, 151, 200, 202, 205, 208 185 Elman (1993), 172, 178 Churchland (1998), 5, 17, 34, 44, Elman (1998), 178 143, 144, 162, 172, 185, Elman (2005), 195, 196, 206 187, 191, 200, 204, 208 Feldman and Ballard (1982), 65 Churchland (1999), 162, 185 Feldman (1993), 145, 212 Clark (2001), 189 Feng et al. (2004), 161 Copeland and Sylvan (1999), 50, Feynman (1996), 46 52, 60, 63 Fodor and Pylyshyn (1988), 142, Copeland (1993), 134 186, 202, 205, 206 Copeland (2004), 47, 124, 127 Fodor (1998), 142 Copi and Cohen (1994), 173 Fodor (2005), 9

231 232 AUTHOR INDEX

Fox Keller (2003), 23, 24, 83, 97, Koch and Segev (2000), 166, 167 103 Krebs (2002), 14, 45, 52, 56 Fraleigh and Beauregard Krebs (2005), 18, 145, 146, 212 (1995), 187 Krebs (2007), 144–146 Frese and Engels (2003), 86 Kurzweil (1990), 7, 49 Fulford et al. (1997), 92, 93 Lay (1997), 187 Gerstner and Kistler (2002), Lenat and Guha (1990), 90 108–110, 130, 167 Levy and Krebs (2006), 200, 202 Glymour (1988), 27 Lloyd (1995), 178 Gooding (2003), 97 Locke (1690), 8 Green (2001), 89, 131, 146, 201, Logothetis et al. (2001), 165, 208, 212 166, 168 Gödel (1931), 3, 45 Lucas (1961), 3, 7, 28, 45, 56, 57, Hacking (1983), 2, 24, 25, 39, 83, 79 85 Lucas (1970), 3, 7, 45, 58, 59, 79 Hahnloser et al. (2000), 138 Lytton (2002), 13, 14, 109, 119 Hare et al. (1995), 178 Marchman (1993), 206 Harnad (1990), 2 Martindale (2005), 117 Harnad (1995), 73 McClelland and Rumelhart Harnish (2002), 7, 205 (1986), 4, 8, 19, 65, 123, Harré (1970), 83 173, 195, 197 Harré (2003), 85 McClelland et al. (1986), 28, Haugeland (1981), 78, 204 142, 146, 174, 194, 205, Haugeland (1985), 52, 62, 75, 78 209 Haykin (1999), 115, 126, 128, McClelland (2005), 9 130, 202 McCloskey (1991), 196, 208 Hebb (1949), 4, 12, 106, 124, McCulloch and Pitts (1943), 4, 9, 125, 206 106, 113, 117, 124 Hilbert (1902), 7 McDermott (2001), 63, 74, 76, 77 Hinton (1992), 105, 130 McLeod et al. (1998), 5, 10, 115, Hobbes (1651), 6, 43 129, 133, 145, 162, 172, Hodgkin and Huxley (1952), 4, 175–178, 197, 200, 206 44, 106–108, 162, 166, Mellor (1989), 41 194 Merton (1942), 100 Hodgkin et al. (1952), 106, 108 Midgley (2004), 164 Hoffmann (1993), 15 Minsky and Papert (1969), 9, Hoffmann (1998), 15, 54, 71, 128, 129, 133 123, 131, 132, 141–143, Moore and McCabe (1993), 177 174, 188, 196 Morgan (2003), 83, 85 Hume (1777), 8 Munakata and McClelland Jordan (1986), 179 (2003), 206 Jorion (1999), 86, 87, 100 Neto et al. (1997), 63, 140 Kearns (1997), 48 Newell and Simon (1963), 8, 24, Kello et al. (2005), 195 26, 172 AUTHOR INDEX 233

Newell and Simon (1976), 8, 173 Russell (1948), 35 Newell (1990), 8, 24 Salas and Hille (1995), 145 Noble (1990), 40 Schach (2002), 155 Norman (1986), 171, 201 Schneider (2001), 154 O Nualláin (2002), 18, 167, 212 Searle (1980), 3, 7, 28, 45 O’Reilly and Munakata (2000), Searle (1990), 7, 72, 75 40, 110 Seidenberg and McClelland Papert (1988), 106 (1989), 5, 175, 186, 196 Pavlik and Anderson (2005), 156 Sejnowski and Rosenberg Penrose (1990), 3, 12, 63, 64, 70 (1987), 24 Penrose (1993), 53 Sejnowski (1986), 130 Peschl and Scheutz (2001), 83, Shagrir (1999), 67, 75 101, 103 Shagrir (2001), 75 Pitowsky (1990), 67 Shannon and Weaver (1975), 38 Poggio (1990), 24 Shastri and Ajjanagadde (1993), Pomerleau (1995), 141 110 Popper (1959), 93, 101 Shepherd (1991), 8 Psillos (1999), 31, 88, 100 Shultz (2003), 24, 115, 143, 162, Putnam (1975), 49 172, 197, 206 Putnam (1988), 61 Siegelmann and Sontag (1991), Pylyshyn (1984), 7 63, 140, 144 Pylyshyn (1990), 28 Silberschatz and Galvin (1999), Quinn (2005), 152 155 Raichle (2001), 165 Simon (1995), 1, 107 Rogers and McClelland (2004), Sipser (1997), 46, 48 5, 27, 28, 146, 172, 186, Sloman (1996), 52, 57, 60, 63 188, 189, 191, 195, 197 Smolensky (1986), 205, 208, 209 Rosenblatt (1958), 9, 106, 111, Smolensky (1990), 65, 142 113, 136, 194 Smolensky (1997), 65, 142 Rosenblatt (1959), 9, 126, 133, Spivey (1996), 96 136 Sterelny (1989), 7 Rosenblatt (1962), 9, 23, 106, Stillings et al. (1995), 119, 173 111, 117, 120, 126, 127, Stone (1986), 127 129, 136, 143, 144, 210 Stufflebeam (1998), 68, 73, 88 Rumelhart and McClelland Tegmark (2000), 70 (1986a), 4, 8, 19, 27, 28, Thornton (1999), 186, 191 65, 123, 173, 195, 203 Townsend (2004), 160, 161 Rumelhart and McClelland Turing (1936), 7, 49, 51 (1986b), 145, 162, 172, Turing (1947), 43 206, 208 Turing (1948), 4, 9, 50–52, 60, Rumelhart et al. (1986), 115, 106, 124, 127 129, 193 Turing (1950), 6, 51, 52, 61, 62 Russell and Norvig (1995), 115 Uttal (2001), 5, 13, 28, 160, 161, Russell (1946), 8 163, 164, 203, 210 234 AUTHOR INDEX

Uttal (2005), 10, 13, 26, 28, 41, 145, 203, 210 Würtz (1998), 167 Wartofsky (1979), 32 Wegner (1997), 52, 54, 60 Weizenbaum (1976), 27 Wells (1996), 48, 51 Wells (1998), 51 Whiteley (1962), 58 Wilkes (1990), 193 Williams (1997), 68, 69 Winograd and Flores (1986), 24 Winograd (1973), 26, 54, 156 Ziman (2000), 97, 100 List of Figures

2.1 Simple network diagram ...... 30

4.1 CG Image of neurons ...... 84 4.2MRIscan...... 86

5.1 Circuit diagram of squid axon ...... 108 5.2 Spike of a neuron ...... 110 5.3 AI neuron example ...... 114 5.4 Activation functions ...... 115

6.1 AND/OR-functions ...... 127 6.2 XOR-function ...... 128 6.3 Local and global minimum ...... 134 6.4 Feed-forward Network ...... 135 6.5 Simple Recurrent Network (SRN) ...... 135 6.6 Weights and Connections ...... 136 6.7 Weight Matrix for figure 6.6 ...... 137 6.8 Matlab Code Example ...... 139

7.1 X-ray image of a broken bone ...... 153 7.2 f MRI-Image from 1991 (Belliveau et al., 1991) ...... 158

8.1 Error versus naming latency ...... 176 8.2SentencePatterns...... 180 8.3 Word Classes ...... 180 8.4 Training Data ...... 182 8.5 Points in 3-space ...... 187 8.6 Example of a Dendrogram ...... 190

9.1 Insights emanating from the network ...... 199 9.2 Definition of the XOR function ...... 201 9.3 Levels of Description (Smolensky, 1986) ...... 209 9.4 Levels of Description (alt.) ...... 210

235