<<

P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

Part II

COGNITIVE MODELING PARADIGMS 

The chapters in Part II introduce the reader to broadly influential and foundational ap- proaches to computational cognitive modeling. Each of these chapters describes in detail one particular approach and provides examples of its use in computational cognitive modeling.

21 P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

22 P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

CHAPTER 2

Connectionist Models of

Michael S. C. Thomas and James L. McClelland

1. Introduction cial neural network (ANN) or parallel dis- tributed processing (PDP) models, connec- In this chapter, computer models of cogni- tionism has been applied to a diverse range tion that have focused on the use of neural of cognitive abilities, including models of networks are reviewed. These architectures , attention, perception, action, lan- were inspired by research into how com- guage, concept formation, and reasoning putation works in the and subsequent (see, e.g., Houghton, 2005). Although many work has produced models of cognition with of these models seek to capture adult func- a distinctive flavor. Processing is character- tion, connectionism places an emphasis on ized by patterns of activation across sim- learning internal representations. This has ple processing units connected together into led to an increasing focus on developmental complex networks. Knowledge is stored in phenomena and the origins of knowledge. the strength of the connections between Although, at its heart, connectionism com- units. It is for this that this approach prises a set of computational formalisms, to understanding cognition has gained the it has spurred vigorous theoretical debate name of connectionism. regarding the nature of cognition. Some theorists have reacted by dismissing connec- tionism as mere implementation of preex- 2. Background isting verbal theories of cognition, whereas others have viewed it as a candidate to re- Over the last twenty years, connection- place the Classical Computational Theory ist modeling has formed an influential ap- of Mind and as carrying profound impli- proach to the computational study of cog- cations for the way human knowledge is nition. It is distinguished by its appeal to acquired and represented; still others have principles of neural computation to inspire viewed connectionism as a subclass of statis- the primitives that are included in its cog- tical models involved in universal function nitive level models. Also known as artifi- approximation and data clustering.

23 P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

24 thomas and mcclelland

This chapter begins by placing connec- Culloch and Pitts (1943) who characterized tionism in its historical context, leading up the function of simple networks of binary to its formalization in Rumelhart and Mc- threshold neurons in terms of logical op- Clelland’s two-volume Parallel Distributed erations. In his 1949 book The Organiza- Processing (1986), written in combination tion of Behavior, Donald Hebb proposed a with members of the Parallel Distributed cell assembly theory of cognition, including Processing Research Group. Then, three im- the idea that specific synaptic changes might portant early models that illustrate some underlie psychological principles of learn- of the key properties of connectionist sys- ing. A decade later, Rosenblatt (1958, 1962) tems are discussed, as well as how the novel formulated a learning rule for two-layered theoretical contributions of these models neural networks, demonstrating mathemat- arose from their key computational prop- ically that the perceptron convergence rule erties. These three models are the Interac- could adjust the weights connecting an in- tive Activation model of letter recognition put layer and an output layer of simple (McClelland & Rumelhart, 1981; Rumel- neurons to allow the network to associate hart and McClelland, 1982), Rumelhart and arbitrary binary patterns. With this rule, McClelland’s (1986) model of the acquisi- learning converged on the set of connection tion of the English past tense, and Elman’s values necessary to acquire any two-layer- (1991) simple recurrent network for finding computable function relating a set of input- structure in time. Finally, the chapter con- output patterns. Unfortunately, Minsky and siders how twenty-five years of connection- Papert (1969, 1988) demonstrated that the ist modeling has influenced wider theories set of two-layer computable functions was of cognition. somewhat limited – that is, these simple artificial neural networks were not particu- larly powerful devices. While more compu- 2.1. Historical Context tationally powerful networks could be de- Connectionist models draw inspiration from scribed, there was no algorithm to learn the notion that the information-processing the connection weights of these systems. properties of neural systems should influ- Such networks required the postulation of ence our theories of cognition. The possible additional internal, or “hidden,” processing role of neurons in generating the mind units, which could adopt intermediate rep- was first considered not long after the ex- resentational states in the mapping between istence of the nerve cell was accepted input and output patterns. An algorithm in the latter half of the nineteenth cen- (backpropagation) able to learn these states tury (Aizawa, 2004). Early neural net- was discovered independently several times. work theorizing can therefore be found in A key paper by Rumelhart, Hinton, and some of the associationist theories of men- Williams (1986) demonstrated the useful- tal processes prevalent at the time (e.g., ness of networks trained using backpropaga- Freud, 1895; James, 1890; Meynert, 1884; tion for addressing key computational and Spencer, 1872). However, this line of theo- cognitive challenges facing neural networks. rizing was quelled when Lashley presented In the 1970s, serial processing and the data appearing to show that the perfor- Von Neumann computer metaphor domi- mance of the brain degraded gracefully de- nated . Nevertheless, a pending only on the quantity of damage. number of researchers continued to work This argued against the specific involvement on the computational properties of neural of neurons in particular cognitive processes systems. Some of the key themes identi- (see, e.g., Lashley, 1929). fied by these researchers included the role In the 1930s and 1940s, there was a of competition in processing and learning resurgence of interest in using mathemati- (e.g., Grossberg, 1976; Kohonen, 1984), cal techniques to characterize the behavior the properties of distributed representa- of networks of nerve cells (e.g., Rashevksy, tions (e.g., Anderson, 1977; Hinton & 1935). This culminated in the work of Mc- Anderson, 1981), and the possibility of P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 25

Simple Neural (Theoretical) Networks Logogen/ (logical operations) Pandemonium

2-Layer Feedforward (Hand-wired) Networks (trained with Constraint Satisfaction delta rule) – Perceptron Networks (e.g., IA, IAC – Jets & Sharks, Necker Cube, Stereopsis) Competitive Networks unsupervised learning (e.g., Kohonen, Grossberg) Boltzmann 3-Layer Feedforward Machine Networks (trained with (simulated backpropagation algorithm) Hopfield Nets annealing (binary) metaphor)

Cascade correlation Recurrent Networks (Fahlman & Lebiere) Cascade Rule (e.g., Stroop model)

Jordan Elman Attractor Networks Networks (SRN) Networks

Figure 2.1. A simplified schematic showing the historical evolution of neural network architectures. Simple binary networks (McCulloch & Pitts, 1943) are followed by two-layer feedforward networks (perceptrons; Rosenblatt, 1958). Three subtypes then emerge: three-layer feedforward networks (Rumelhart & McClelland, 1986), competitive or self-organizing networks (e.g., Grossberg, 1976; Kohonen, 1984), and interactive networks (Hopfield, 1982; Hinton & Sejnowksi, 1986). Adaptive interactive networks have precursors in detector theories of perception (Logogen: Morton, 1969; Pandemonium: Selfridge, 1959) and in handwired interactive models (interactive activation: McClelland & Rumelhart, 1981; interactive activation and competition: McClelland, 1981; Stereopsis: Marr & Poggio, 1976; Necker cube: Feldman, 1981, Rumelhart et al., 1986). Feedforward pattern associators have produced multiple subtypes: for capturing temporally extended activation states, cascade networks in which states monotonically asymptote (e.g., Cohen, Dunbar, & McClelland, 1990), and attractor networks in which states cycle into stable configurations (e.g., Plaut & McClelland, 1993); for processing sequential information, recurrent networks (Jordan, 1986; Elman, 1991); and for systems that alter their structure as part of learning, constructivist networks (e.g., cascade correlation: Fahlman & Lebiere, 1990; Shultz, 2003). SRN = simple recurrent network.

content addressable memory in networks of connectionism can be found in Rumel- with attractor states, formalized using the hart and McClelland (1986, Chapter 1), mathematics of statistical physics (Hopfield, Bechtel and Abrahamsen (1991), McLeod, 1982). A fuller characterization of the many Plunkett, and Rolls (1998), and O’Reilly historical influences in the development and Munakata (2000). Figure 2.1 depicts a P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

26 thomas and mcclelland

selective schematic of this history and The third feature is a pattern of connec- demonstrates the multiple types of neural tivity. The strength of the connection be- network systems that have latterly come tween any two units will determine the ex- to be used in building models of cogni- tent to which the activation state of one unit tion. Although diverse, they are unified on can affect the activation state of another unit the one hand by the proposal that cogni- at a subsequent time point. The strength of tion comprises processes of constraint sat- the connections between unit i and unit j isfaction, energy minimization, and pattern can be represented by a matrix W of weight recognition, and on the other hand that values wij. Multiple matrices may be speci- adaptive processes construct the microstruc- fied for a given network if there are connec- ture of these systems, primarily by adjust- tions of different types. For example, one ing the strengths of connections among the matrix may specify excitatory connections neuron-like processing units involved in a between units and a second may specify in- computation. hibitory connections. Potentially, the weight matrix allows every unit to be connected to every other unit in the network. Typically, 2.2. Key Properties of units are arranged into layers (e.g., input, Connectionist Models hidden, output), and layers of units are fully Connectionism starts with the following in- connected to each other. For example, in a spiration from neural systems: Computa- three-layer feedforward architecture where tions will be carried out by a set of simple activation passes in a single direction from processing units operating in parallel and af- input to output, the input layer would be fecting each others’ activation states via a fully connected to the hidden layer and the network of weighted connections. Rumel- hidden layer would be fully connected to the hart, Hinton, and McClelland (1986) iden- output layer. tified seven key features that would de- The fourth feature is a rule for propa- fine a general framework for connectionist gating activation states throughout the net- processing. work. This rule takes the vector a(t) of out- The first feature is the set of processing put values for the processing units sending units ui . In a cognitive model, these may activation and combines it with the connec- be intended to represent individual concepts tivity matrix W to produce a summed or net (such as letters or words), or they may sim- input into each receiving unit. The net input ply be abstract elements over which mean- to a receiving unit is produced by multiply- ingful patterns can be defined. Processing ing the vector and matrix together, so that units are often distinguished into input, out- put, and hidden units. In associative net-  works, input and output units have states neti = W × a(t) = wijaj . (2.1) that are defined by the task being modeled j (at least during training), whereas hidden units are free parameters whose states may The fifth feature is an activation rule to be determined as necessary by the learning specify how the net inputs to a given unit algorithm. are combined to produce its new activation The second feature is a state of activation state. The function F derives the new acti- (a) at a given time (t). The state of a set vation state of units is usually represented by a vector of real numbers a(t). These may be binary ai (t + 1) = F (neti (t)). (2.2) or continuous numbers, bounded or un- bounded. A frequent assumption is that the For example, F might be a threshold so that activation level of simple processing units the unit becomes active only if the net in- will vary continuously between the values 0 put exceeds a given value. Other possibil- and 1. ities include linear, Gaussian, and sigmoid P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 27

functions, depending on the network type. Backpropagation makes it possible to de- Sigmoid is perhaps the most common, oper- termine, for each connection weight in the ating as a smoothed threshold function that network, what effect a change in its value is also differentiable. It is often important would have on the overall network error. that the activation function be differentiable The policy for changing the strengths of con- because learning seeks to improve a perfor- nections is simply to adjust each weight in mance metric that is assessed via the acti- the direction (up or down) that would tend vation state whereas learning itself can only to reduce the error, and change it by an operate on the connection weights. The ef- amount proportional to the size of the effect fect of weight changes on the performance the adjustment will have. If there are mul- metric therefore depends to some extent on tiple layers of hidden units remote from the the activation function, and the learning al- output layer, this process can be followed gorithm encodes this fact by including the iteratively: First, error derivatives are com- derivative of that function (see the follow- puted for the hidden layer nearest the out- ing discussion). put layer; from these, derivatives are com- The sixth key feature of connectionist puted for the next deepest layer into the models is the algorithm for modifying the network, and so forth. On this basis, the patterns of connectivity as a function of ex- backpropagation algorithm serves to modify perience. Virtually all learning rules for PDP the pattern of weights in powerful multi- models can be considered a variant of the layer networks. It alters the weights to each Hebbian learning rule (Hebb, 1949). The deeper layer of units in such a way as to essential idea is that a weight between two reduce the error on the output units (see units should be altered in proportion to the Rumelhart, Hinton, et al., 1986, for the units’ correlated activity. For example, if a derivation). We can formulate the weight unit ui receives input from another unit u j , change algorithm by analogy to the delta then if both are highly active, the weight wij rule shown in equation 2.4. For each deeper from u j to ui should be strengthened. In its layer in the network, we modify the central simplest version, the rule is term that represents the disparity between the actual and target activation of the units. Assuming u , u ,andu are input, hidden, wij = η ai aj (2.3) i h o and output units in a three-layer feedfor- ward network, the algorithm for changing where η is the constant of proportionality the weight from hidden to output unit is: known as the learning rate. Where an ex- ternal target activation t (t) is available for i w = η (t − a ) F (net ) a (2.5) auniti at time t, this algorithm is modified oh o o o h by replacing a with a term depicting the i where F (net) is the derivative of the ac- disparity of unit u ’s current activation state i tivation function of the units (e.g., for a (t) from its desired activation state t (t)at i i the sigmoid activation function,F (net ) = time t, so forming the delta rule: o ao(1 − ao)). The term (to − ao) is propor- tional to the negative of the partial deriva- w = η − . ij (ti (t) ai (t)) aj (2.4) tive of the network’s overall error with respect to the activation of the output However, when hidden units are included unit, where the error E is given by E = − 2 in networks, no target activation is available o (to ao) . for these internal parameters. The weights The derived error term for a unit at to such units may be modified by variants of the hidden layer is a product of three the Hebbian learning algorithm (e.g., Con- components: the derivative of the hidden trastive Hebbian; Hinton, 1989; see Xie & unit’s activation function times the sum Seung, 2003) or by the backpropagation of across all the connections from that hid- error signals from the output layer. den unit to the output later of the error P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

28 thomas and mcclelland

term on each output unit weighted by the meaning; or a sequence of inputs, such as the derivative of the output unit’s activation words in a sentence. A range of policies have function (to − ao) F (neto) times the weight been used for specifying the order of pre- connecting the hidden unit to the output sentation of the patterns, including sweep- unit: ing through the full set to random sampling  with replacement. The selection of patterns F (neth) (to − ao)F (neto)woh. (2.6) to present may vary over the course of train- o ing but is often fixed. Where a target output is linked to each input, this is usually as- The algorithm for changing the weights sumed to be simultaneously available. Two from the input to the hidden layer is there- points are of note in the translation between fore: PDP network and cognitive model. First, a  representational scheme must be defined to w = η − hi F (neth) (to ao) map between the cognitive domain of in- o terest and a set of vectors depicting the rel- × w . F (neto) oh ai (2.7) evant informational states or mappings for that domain. Second, in many cases, con- It is interesting that the previous omputa- nectionist models are addressed to aspects of tion can be construed as a backward pass higher-level cognition, where it is assumed through the network, similar in spirit to the that the information of relevance is more ab- forward pass that computes activations in stract than sensory or motor codes. This has that it involves propagation of signals across meant that the models often leave out de- weighted connections, this time from the tails of the transduction of sensory and mo- output layer back toward the input. The tor signals, using input and output represen- backward pass, however, involves the prop- tations that are already somewhat abstract. agation of error derivatives rather than acti- We hold the view that the same principles vations. at work in higher-level cognition are also at It should be emphasized that a very wide work in perceptual and motor systems, and range of variants and extensions of Hebbian indeed there is also considerable connection- and error-correcting algorithms have been ist work addressing issues of perception and introduced in the connectionist learning lit- action, although these will not be the focus erature. Most importantly, several variants of the present chapter. of backpropagation have been developed for training recurrent networks (Williams & 2.3. Neural Plausibility Zipser, 1995); and several algorithms (in- cluding the Contrastive Hebbian Learning It is a historical fact that most connectionist algorithm and O’Reilly’s 1998 LEABRA modelers have drawn their inspiration from algorithm) have addressed some of the con- the computational properties of neural sys- cerns that have been raised regarding the bi- tems. However, it has become a point of ological plausibility of backpropagation con- controversy whether these “brain-like” sys- strued in its most literal form (O’Reilly & tems are indeed neurally plausible. If they Munakata, 2000). are not, should they instead be viewed as The last general feature of connection- a class of statistical function approximators? ist networks is a representation of the en- And if so, shouldn’t the ability of these mod- vironment with respect to the system. This els to simulate patterns of human behavior is assumed to consist of a set of externally be assessed in the context of the large num- provided events or a function for generat- ber of free parameters they contain (e.g., in ing such events. An event may be a single the weight matrix; Green, 1998)? pattern, such as a visual input; an ensem- Neural plausibility should not be the pri- ble of related patterns, such as the spelling mary focus for a consideration of connec- of a word and its corresponding sound or tionism. The advantage of connectionism, P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 29

according to its proponents, is that it pro- previously prevailing symbolic theory – just vides better theories of cognition. Neverthe- given the set of basic computational features less, we will deal briefly with this issue outlined in the preceding section. because it pertains to the origins of con- The second type of criticism leveled at nectionist cognitive theory. In this area, two connectionism questions why, as Davies sorts of criticism have been leveled at con- (2005) puts it, connectionist models should nectionist models. The first is to maintain be reckoned any more plausible as putative that many connectionist models either in- descriptions of cognitive processes just be- clude properties that are not neurally plau- cause they are “brain-like.” Under this view, sible or omit other properties that neural there is independence between levels of de- systems appear to have. Some connection- scription because a given cognitive level the- ist researchers have responded to this first ory might be implemented in multiple ways criticism by endeavoring to show how fea- in different hardware. Therefore, the de- tures of connectionist systems might in fact tails of the hardware (in this case, the brain) be realized in the neural machinery of the need not concern the cognitive theory. This brain. For example, the backward prop- functionalist approach, most clearly stated agation of error across the same connec- in Marr’s three levels of description (compu- tions that carry activation signals is gen- tational, algorithmic, and implementational; erally viewed as biologically implausible. see Marr, 1982) has been repeatedly chal- However, a number of authors have shown lenged (see, e.g., Rumelhart & McClelland, that the difference between activations com- 1985; Mareschal et al., 2007). The challenge puted using standard feedforward connec- to Marr goes as follows. Although, accord- tions and those computed using standard re- ing to computational theory, there may be turn connections can be used to derive the a principled independence between a com- crucial error derivatives required by back- puter program and the particular substrate propagation (Hinton & McClelland, 1988; on which it is implemented, in practical O’Reilly, 1996). It is widely held that con- terms, different sorts of computation are nections run bidirectionally in the brain, as easier or harder to implement on a given required for this scheme to work. Under this substrate. Because computations have to be view, backpropagation may be shorthand delivered in real time as the individual re- for a Hebbian-based algorithm that uses acts with his or her environment, in the first bidirectional connections to spread error sig- instance, cognitive level theories should be nals throughout a network (Xie & Seung, constrained by the computational primitives 2003). that are most easily implemented on the Other connectionist researchers have re- available hardware; human cognition should sponded to the first criticism by stressing be shaped by the processes that work best in the cognitive nature of current connection- the brain. ist models. Most of the work in develop- The relation of connectionist models to mental neuroscience addresses behavior at symbolic models has also proved contro- levels no higher than cellular and local net- versial. A full consideration of this issue works, whereas cognitive models must make is beyond the scope of the current chap- contact with the human behavior studied in ter. Suffice to say that because the con- psychology. Some simplification is therefore nectionist approach now includes a diverse warranted, with neural plausibility compro- family of models, there is no single an- mised under the working assumption that swer to this question. Smolensky (1988) the simplified models share the same fla- argued that connectionist models exist at vor of computation as actual neural sys- a lower (but still cognitive) level of de- tems. Connectionist models have succeeded scription than symbolic cognitive theories, in stimulating a great deal of progress in cog- a level that he called the subsymbolic. Con- nitive theory – and have sometimes gen- nectionist models have sometimes been put erated radically different proposals to the forward as a way to implement symbolic P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

30 thomas and mcclelland

production systems on neural architectures rules are allowed to fire probabilistically and (e.g., Touretzky & Hinton, 1988). At other in parallel begins to approximate a connec- times, connectionist researchers have argued tionist system. that their models represent a qualitatively different form of computation: Whereas 2.4. The Relationship between under certain circumstances, connectionist Connectionist Models and models might produce behavior approxi- Bayesian Inference mating symbolic processes, it is held that human behavior, too, only approximates the Since the early 1980s, it has been apparent characteristics of symbolic systems rather that there are strong links between the cal- than directly implementing them. Further- culations carried out in connectionist mod- more, connectionist systems incorporate els and key elements of Bayesian calcu- additional properties characteristic of hu- lations (see chapter 3 in this volume on man cognition, such as content-addressable Bayesian models of cognition). The state of memory, context-sensitive processing, and the early literature on this point was re- graceful degradation under damage or noise. viewed in McClelland (1998). There it was Under this view, symbolic theories are ap- noted, first of all, that units can be viewed proximate descriptions rather than actual as playing the role of probabilistic hypothe- characterizations of human cognition. Con- ses; that weights and biases play the role nectionist theories should replace them both of conditional probability relations between because they capture subtle differences be- hypotheses and prior probabilities, respec- tween human behavior and symbolic char- tively; and that if connection weights and acterizations, and because they provide a biases have the correct values, the logistic specification of the underlying causal mech- activation function sets the activation of a anisms (van Gelder, 1991). unit to its posterior probability given the ev- This strong position has prompted crit- idence represented on its inputs. icisms that in their current form, connec- A second and more important observa- tionist models are insufficiently powerful to tion is that, in stochastic neural networks account for certain aspects of human cog- (Boltzmann machines and Continuous Dif- nition – in particular, those areas best char- fusion Networks; Hinton & Sejnowski, acterized by symbolic, syntactically driven 1986; Movellan & McClelland, 1993), a net- computations (Fodor & Pylyshyn, 1988; work’s state over all of its units can rep- Marcus, 2001). Again, however, the charac- resent a constellation of hypotheses about terization of human cognition in such terms an input, and (if the weights and the bi- is highly controversial; close scrutiny of rel- ases are set correctly) that the probability evant aspects of – the ground on of finding the network in a particular state which the dispute has largely been focused – is monotonically related to the probability lends support to the view that the system- that the state is the correct interpretation aticity assumed by proponents of symbolic of the input. The exact nature of the re- approaches is overstated and that the actual lation depends on a parameter called tem- characteristics of language are well matched perature; if set to one, the probability that to the characteristics of connectionist sys- the network will be found in a particular tems (Bybee & McClelland, 2005; McClel- state exactly matches its posterior probabil- land et al., 2003). In the end, it may be diffi- ity. When temperature is gradually reduced cult to make principled distinctions between to zero, the network will end up in the most symbolic and connectionist models. At a fine probable state, thus performing optimal scale, one might argue that two units in a perceptual inference (Hinton & Sejnowski, network represent variables, and the con- 1983). It is also known that backpropaga- nection between them specifies a symbolic tion can learn weights that allow Bayes- rule linking these variables. One might also optimal estimation of outputs given inputs argue that a production system in which (MacKay, 1992) and that the Boltzmann P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 31

algorithm (Ackley, Hin- The interactive activation model ad- ton, & Sejnowski, 1985; Movellan & Mc- dressed a puzzle in word recognition. By Clelland, 1993) can learn to produce correct the late 1970s, it had long been known conditional distributions of outputs given that people were better at recognizing let- inputs. The algorithm is slow but there has ters presented in words than letters pre- been recent progress producing substantial sented in random letter sequences. Reicher speedups that achieve outstanding perfor- (1969) demonstrated that this was not the mance on benchmark data sets (Hinton & result of tending to guess letters that would Salakhutdinov, 2006). make letter strings into words. He presented target letters in words, in unpronounceable nonwords, or on their own. The stimuli 3. Three Illustrative Models were then followed by a pattern mask, af- ter which participants were presented with In this section, we outline three of the land- a forced choice between two letters in a mark models in the emergence of connec- given position. Importantly, both alterna- tionist theories of cognition. The models tives were equally plausible. Thus, the par- serve to illustrate the key principles of con- ticipant might be presented with WOOD nectionism and demonstrate how these prin- and asked whether the third letter was O or ciples are relevant to explaining behavior in R. As expected, forced-choice performance ways that were different from other prior was more accurate for letters in words than approaches. The contribution of these mod- for letters in nonwords or presented on their els was twofold: they were better suited than own. Moreover, the benefit of surrounding alternative approaches to capturing the ac- context was also conferred by pronounce- tual characteristics of human cognition, usu- able pseudowords (e.g., recognizing the P in ally on the basis of their context sensitive SPET) compared with random letter strings, processing properties; and compared to ex- suggesting that subjects were able to bring to isting accounts, they offered a sharper set bear rules regarding the orthographic legal- of tools to drive theoretical progress and ity of letter strings during recognition. to stimulate empirical data collection. Each Rumelhart and McClelland (Rumelhart of these models significantly advanced its & McClelland, 1981; Rumelhart & McClel- field. land, 1982) took the contextual advantage of words and pseudowords on letter recog- nition to indicate the operation of top-down 3.1. An Interactive Activation Model processing. Previous theories had put for- of Context Effects in Letter Perception ward the idea that letter and word recogni- (McClelland & Rumelhart, 1981, tion might be construed in terms of detec- Rumelhart & McClelland, 1982) tors that collect evidence consistent with the The interactive activation model of letter presence of their assigned letter or word in perception illustrates two interrelated ideas. the input (Morton, 1969; Selfridge, 1959). The first is that connectionist models natu- Influenced by these theories, Rumelhart and rally capture a graded constraint satisfaction McClelland built a computational simula- process in which the influences of many dif- tion in which the perception of letters re- ferent types of information are simultane- sulted from excitatory and inhibitory inter- ously integrated in determining, for exam- actions of detectors for visual features. Im- ple, the identity of a letter in a word. The portantly, the detectors were organized into second idea is that the computation of a per- different layers for letter features, letters and ceptual representation of the current input words, and detectors could influence each (in this case, a word) involves the simultane- other both in a bottom-up and a top-down ous and mutual influence of representations manner. at multiple levels of abstraction – this is a core Figure 2.2 illustrates the structure of the idea of parallel distributed processing. Interactive Activation (IA) model, both at P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

32 thomas and mcclelland

WORD LEVEL SEED WEED

FEED

LETTER LEVEL W S F

FEATURE LEVEL

VISUAL INPUT Figure 2.2. Interactive Activation model of context effects in letter recognition (McClelland & Rumelhart, 1981, 1982). Pointed arrows are excitatory connections, circular headed arrows are inhibitory connections. Left: macro view (connections in gray were set to zero in implemented model). Right: micro view for the connections from the feature level to the first letter position for the letters S, W, and F (only excitatory connections shown) and from the first letter position to the word units SEED, WEED, and FEED (all connections shown).

the macro level (left) and for a small sec- factor to determine its activation on the next tion of the model at a finer level (right). time step. Connectivity was set at unitary The explicit motivation for the structure values and along the following principles. In of the IA was neural: “[We] have adopted each layer, mutually exclusive alternatives the approach of formulating the model in should inhibit each other. Each unit in a terms similar to the way in which such a layer excited all units with which it was con- process might actually be carried out in a sistent and inhibited all those with which neural or neural-like system” (McClelland & it was inconsistent in the layer immediately Rumelhart, 1981, p. 387). There were three above. Thus, in Figure 2.2, the first-position main assumptions of the IA model: (1) Per- W letter unit has an excitatory connection to ceptual processing takes place in a system the WEED word unit but an inhibitory con- in which there are several levels of process- nection to the SEED and FEED word units. ing, each of which forms a representation Similarly, a unit excited all units with which of the input at a different level of abstrac- it was consistent and inhibited all those with tion; (2) visual perception involves parallel which it was inconsistent in the layer im- processing, both of the four letters in each mediately below. However, in the final im- word and of all levels of abstraction simul- plementation, top-down word-to-letter in- taneously; and (3) perception is an interac- hibition and within-layer letter-to-letter in- tive process in which conceptually driven hibition were set to zero (gray arrows, and data driven processing provide multi- Figure 2.2). ple, simultaneously acting constraints that The model was constructed to recognize combine to determine what is perceived. letters in four-letter strings. The full set of The activation states of the system were possible letters was duplicated for each let- simulated by a sequence of discrete time ter position, and a set of 1,179 word units steps. Each unit combined its activation on was created to represent the corpus of four- the previous time step, its excitatory influ- letter words. Word units were given base ences, its inhibitory influences, and a decay rate activation states at the beginning of P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 33

processing to reflect their different frequen- strings that had gang support but violated cies. A trial began by clamping the feature the orthographic rules. units to the appropriate states to represent There are two specific points to note re- a letter string and then observing the dy- garding the IA model. First, this early con- namic change in activation through the net- nectionist model was not adaptive – connec- work. Conditions were included to allow tivity was set by hand. Although the model’s the simulation of stimulus masking and de- behavior was shaped by the statistical prop- graded stimulus quality. Finally, a proba- erties of the language it processed, these bilistic response mechanism was added to properties were built into the structure of generate responses from the letter level, the system in terms of the frequency of oc- based on the relative activation states of the currence of letters and letter combinations letter pool in each position. in the words. Second, the idea of bottom-up The model successfully captured the excitation followed by competition among greater accuracy of letter detection for let- mutually exclusive possibilities is a strategy ters appearing in words and pseudowords familiar in Bayesian approaches to cognition. compared with random strings or in isola- In that sense, the IA bears similarity to more tion. Moreover, it simulated a variety of em- recent probability theory based approaches pirical findings on the effect of masking and to perception (see chapter 3 in this volume). stimulus quality, and of changing the timing of the availability of context. The results on 3.1.1. what happened next? the contextual effects of pseudowords are Subsequent work saw the principles of the particularly interesting, because the model IA model extended to the recognition of only contains word units and letter units spoken words (the TRACE model: McClel- and has no explicit representation of ortho- land & Elman, 1986) and more recently graphic rules. Let us say on a given trial, the to bilingual speakers, where two subject is required to recognize the second must be incorporated in a single representa- letter in the string SPET. In this case, the tional system (see Thomas & van Heuven, string will produce bottom-up excitation of 2005, for review). The architecture was ap- the word units for SPAT, SPIT, and SPOT, plied to other domains where multiple con- which each share three letters. In turn, the straints were thought to operate during per- word units will propagate top-down activa- ception, for example, in face recognition tion, reinforcing activation of the letter P (Burton, Bruce, & Johnston, 1990). Within and so facilitating its recognition. Were this language, more complex architectures have letter to be presented in the string XPQJ, no tried to recast the principles of the IA model word units could offer similar top-down ac- in developmental settings, such as Plaut tivation, hence the relative facilitation of the and Kello’s (1999) model of the emergence pseudoword. Interestingly, although these of phonology from the interplay of speech top-down “gang” effects produced facilita- comprehension and production. tion of letters contained in orthographically The more general lesson to draw from the legal nonword strings, the model demon- interactive activation model is the demon- strated that they also produced facilitation stration of multiple influences (feature, let- in orthographically illegal, unpronounceable ter, and word-level knowledge) working si- letter strings, such as SPCT. Here, the same multaneously and in parallel to shape the gang of SPAT, SPIT, and SPOT produce response of the system, as well as the some- top-down support. Rumelhart and McClel- what surprising finding that a massively par- land (1982) reported empirical support for allel constraint satisfaction process of this this novel prediction. Therefore, although form can appear to behave as if it contains the model behaved as if it contained ortho- rules (in this case, orthographic) when no graphic rules influencing recognition,itdid such rules are included in the processing not in fact do so because continued contex- structure. At the time, the model brought tual facilitation could be demonstrated for into question whether it was necessary to P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

34 thomas and mcclelland

postulate rules as processing structures to thought), and an arbitrary relation of stem explain regularities in human behavior. This to past tense (go/went), as well as verbs that skepticism was brought into sharper focus have a past tense form identical to the stem by our next example. (hit/hit). These so-called irregular verbs of- ten come in small groups sharing a fam- ily resemblance (sleep/slept, creep/crept, 3.2. On Learning the Past Tense leap/leapt) and usually have high token of English Verbs (Rumelhart & frequencies (see Pinker, 1999, for further McClelland, 1986) details). Rumelhart and McClelland’s (1986) model During the acquisition of the English of English past tense formation marked past tense, children show a characteristic the real emergence of the PDP framework. U-shaped developmental profile at differ- Where the IA model used localist coding, ent times for individual irregular verbs. Ini- the past tense model employed distributed tially, they use the correct past tense of a coding. Where the IA model had handwired small number of high-frequency regular and connection weights, the past tense model irregular verbs. Later, they sometimes pro- learned its weights via repeated exposure duce “overregularized” past tense forms for to a problem domain. However, the mod- a small fraction of their irregular verbs (e.g., els share two common themes. Once more, thinked; Marcus et al., 1992), along with the behavior of the past tense model will other, less frequent errors (Xu & Pinker, be driven by the statistics of the problem 1995). They are also able to extend the domain, albeit these will be carved into the past tense “rule” to novel verbs (e.g., wug- model by training rather than sculpted by wugged). Finally, in older children, perfor- the modelers. Perhaps more importantly, we mance approaches ceiling on both regu- see a return to the idea that a connectionist lar and irregular verbs (Berko, 1958; Ervin, system can exhibit rule-following behavior 1964; Kuczaj, 1977). without containing rules as causal process- In the early 1980s, it was held that this ing structures; but in this case, the rule- pattern of behavior represented the oper- following behavior will be the product of ation of two developmental mechanisms learning and will accommodate a propor- (Pinker, 1984). One of these was symbolic tion of exception patterns that do not follow and served to learn the regular past tense the general rule. The key point that the past rule, whereas the other was associative and tense model illustrates is how (approximate) served to learn the exceptions to the rule. conformity to the regularities of language – The extended phase of overregularization and even a tendency to produce new regular errors corresponded to difficulties in inte- forms (e.g., regularizations like “thinked” or grating the two mechanisms, specifically, a past tenses for novel verbs like “wugged”) – failure of the associative mechanism to block can arise in a connectionist network without the function of the symbolic mechanism. an explicit representation of a linguistic rule. That the child comes to the language acqui- The English past tense is characterized sition situation armed with these two mech- by a predominant regularity in which the anisms (one of them full of blank rules) was majority of verbs form their past tenses by an a priori commitment of the developmen- the addition of one of three allomorphs of tal theory. the “-ed” suffix to the base stem (walk/ By contrast, Rumelhart and McClelland walked, end/ended, chase/chased). How- (1986) proposed that a single network that ever, there is a small but significant group does not distinguish between regular and ir- of verbs that form their past tense in dif- regular past tenses is sufficient to learn past ferent ways, including changing internal tense formation. The architecture of their vowels (swim/swam), changing word final model is shown in Figure 2.3. A phoneme- consonants (build/built), changing both in- based representation of the verb root was ternal vowels and final consonants (think/ recoded into a more distributed, coarser P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 35

Phonological representation of past tense

Decoding Coarse coded, distributed Wickelfeature representation of past tense

2-layer network trained with the delta rule

Coarse coded, distributed Wickelfeature representation of root Recoding

Phonological representation of verb root Figure 2.3. Two-layer network for learning the mapping between the verb roots and past tense forms of English verbs (Rumelhart & McClelland, 1986). Phonological representations of verbs are initially encoded into a coarse, distributed “Wickelfeature” representation. Past tenses are decoded from the Wickelfeature representation back to the phonological form. Later connectionist models replaced the dotted area with a three-layer feedforward backpropagation network (e.g., Plunkett & Marchman, 1991, 1993).

(more blurred) format, which they called tion 2.2, and Equation 4). In Phase 1, the “Wickelfeatures.” The stated aim of this re- network was trained on ten high-frequency coding was to produce a representation that verbs, two regular and eight irregular, in (a) permitted differentiation of all of the line with the greater proportion of irreg- root forms of English and their past tenses, ular verbs among the most frequent verbs and (b) provided a natural basis for gener- in English. Phase 1 lasted for ten presenta- alizations to emerge about what aspects of tions of the full training set (or “epochs”). a present tense correspond to what aspects In Phase 2, the network was trained on 410 of a past tense. This format involved repre- medium frequency verbs, 334 regular and senting verbs over 460 processing units. A 76 irregular, for a further 190 epochs. In two-layer network was used to associate the Phase 3, no further training took place, but Wickelfeature representations of the verb 86 lower-frequency verbs were presented to root and past tense form. A final decoding the network to test its ability to generalize network was then used to derive the closest its knowledge of the past tense domain to phoneme-based rendition of the past tense novel verbs. form and reveal the model’s response (the There were four key results for this decoding part of the model was somewhat model. First, it succeeded in learning both restricted by computer processing limita- regular and irregular past tense mappings in tions of the machines available at the time). a single network that made no reference to The connection weights in the two-layer the distinction between regular and irregular network were initially randomized. The verbs. Second, it captured the overall pat- model was then trained in three phases, tern of faster acquisition for regular verbs in each case using the delta rule to up- than irregular verbs, a predominant feature date the connection weights after each verb of children’s past tense acquisition. Third, root/past tense pair was presented (see Sec- the model captured the U-shaped profile P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

36 thomas and mcclelland

of development: an early phase of accurate ories of language development (no rules!). performance on a small set of regular and The model was initially subjected to con- irregular verbs, followed by a phase of over- centrated criticism. Some of this was regularization of the irregular forms, and fi- overstated – for instance, the use of domain- nally recovery for the irregular verbs and general learning principles (such as dis- performance approaching ceiling on both tributed representation, parallel processing, verb types. Fourth, when the model was and the delta rule) to acquire the past tense presented with the low-frequency verbs on in a single network was interpreted as a claim which it had not been trained, it was able that all of language acquisition could be cap- to generalize the past tense rule to a sub- tured by the operation of a single domain- stantial proportion of them, as if it had in- general learning mechanism. Such an ab- deed learned a rule. Additionally, the model surd claim could be summarily dismissed. captured more fine-grained developmental As it stood, the model made no such claim: patterns for subsets of regular and irregular Its generality was in the processing princi- verbs, and generated several novel predic- ples. The model itself represented a domain- tions. specific system dedicated to learning a small Rumelhart and McClelland explained the part of language. Nevertheless, a number of generalization abilities of the network in the criticisms were more telling: The Wick- terms of the superpositional memory of the elfeature representational format was not two-layer network. All the associations be- psycholinguistically realistic; the generaliza- tween the distributed encodings of verb root tion performance of the model was relatively and past tense forms must be stored across poor; the U-shaped developmental profile the single matrix of connection weights. As appeared to be a result of abrupt changes a result, similar patterns blend into one an- in the composition of the training set; and other and reinforce each other. Generaliza- the actual response of the model was hard to tion is contingent on the similarity of verbs discern because of problems in decoding the at input. Were the verbs to be presented us- Wickelfeature output into a phoneme string ing an orthogonal, localist scheme (e.g., 420 (Pinker & Prince, 1988). units, 1 per verb), then there would be no The criticisms and following rejoinders similarity between the verbs, no blending were interesting in a number of ways. First, of mappings, no generalization, and there- there was a stark contrast between the pre- fore no regularization of novel verbs. As cise, computationally implemented connec- the authors state, “It is the statistical rela- tionist model of past tense formation and tionships among the base forms themselves the verbally specified dual-mechanism the- that determine the pattern of responding. ory (e.g., Marcus et al., 1992). The im- The network merely reflects the statistics plementation made simplifications but was of the featural representations of the verb readily evaluated against quantitative be- forms” (Rumelhart & McClelland, 1986, havioral evidence; it made predictions and p. 267). Based on the model’s successful it could be falsified. The verbal theory by simulation of the profile of language devel- contrast was vague – it was hard to know opment in this domain, and compared with how or whether it would work or exactly the dual mechanism model, its more parsi- what behaviors it predicted (see Thomas, monious a priori commitments, Rumelhart Forrester, & Richardson, 2006, for discus- and McClelland viewed their work on past sion). Therefore, it could only be evalu- tense morphology as a step toward a revised ated on loose qualitative grounds. Second, understanding of language knowledge, lan- the model stimulated a great deal of new guage acquisition, and linguistic information multidisciplinary research in the area. To- processing in general. day, inflectional morphology (of which past The past tense model stimulated a great tense is a part) is one of the most studied as- deal of subsequent debate, not least be- pects of language processing in children, in cause of its profound implications for the- adults, in second language learners, in adults P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 37

with acquired brain damage, in children and also successfully pursued via connectionist adults with neurogenetic disorders, and in modeling in the domain of reading (e.g., children with language impairments, using Plaut, et al., 1996). This led to work that psycholinguistic methods, event-related po- also considered various forms of acquired tential measures of brain activity, functional and developmental dyslexia. magnetic resonance imaging, and behavioral For the past tense itself, there re- genetics. This rush of science illustrates the mains much interest in the topic as a essential role of computational modeling in crucible to test theories of language de- driving forward theories of human cogni- velopment. However, in some senses the tion. Third, further modifications and im- debate between connectionist and dual- provements to the past tense model have mechanism accounts has ground to a halt. highlighted how researchers go about the There is much evidence from child devel- difficult task of understanding which parts opment, adult cognitive neuropsychology, of their model represent the key theoreti- developmental neuropsychology, and func- cal claims and which are implementational tional brain imaging to suggest partial dis- details. Simplification is inherent in model- sociations between performance on regu- ing, but successful modeling relies on mak- lar and irregular inflection under various ing the right simplifications to focus on the conditions. Both connectionist and dual- process of interest. For example, in sub- mechanism models have been modified: the sequent models, the Wickelfeature repre- connectionist model to include the influ- sentation was replaced by more plausible ence of lexical-semantics as well as verb phonemic representations based on articu- root phonology in driving the production latory features; the recoding/two-layer net- of the past tense form (Joanisse & Sei- work/decoding component of the network denberg, 1999; Thomas & Karmiloff-Smith, (the dotted rectangle in Figure 2.3) that 2003b); the dual-mechanism model to sup- was trained with the delta rule was re- pose that regular verbs might also be stored placed by a three-layer feedforward net- in the associative mechanism, thereby in- work trained with the backpropagation al- troducing partial redundancy of function gorithm; and the U-shaped developmental (Pinker, 1999). Both approaches now ac- profile was demonstrated in connectionist cept that performance on regular and ir- networks trained with a smoothly growing regular past tenses partly indexes different training set of verbs or even with a fixed set things – in the connectionist account, dif- of verbs (see, e.g., Plunkett & Marchman, ferent underlying knowledge, in the dual- 1991, 1993, 1996). mechanism account, different underlying processes. In the connectionist theory, per- 3.2.1. what happened next? formance on regular verbs indexes reliance The English past tense model prompted on knowledge about phonological regu- further work within inflectional morphol- larities, whereas performance on irregular ogy in other languages (e.g., pluralization in verbs indexes reliance on lexical-semantic German: Goebel & Indefrey, 2000; plural- knowledge. In the dual-mechanism theory, ization in Arabic: Plunkett & Nakisa, 1997), performance on regular verbs indexes a as well as models that explored the possible dedicated symbolic processing mechanism causes of deficits in acquired and develop- implementing the regular rule, whereas per- mental disorders, such as aphasia, Specific formance on irregular verbs indexes an as- Language Impairment and Williams syn- sociative memory device storing informa- drome (e.g., Hoeffner & McClelland, 1993; tion about the past tense forms of specific Joanisse & Seidenberg, 1999; Thomas & verbs. Both approaches claim to account Karmiloff-Smith, 2003b; Thomas, 2005). for the available empirical evidence. How- The idea that rule-following behavior could ever, to date, the dual-mechanism account emerge in a developing system that also had remains unimplemented, so its claim is to accommodate exceptions to the rules was weaker. P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

38 thomas and mcclelland

How does one distinguish between two man behaviors intrinsically revolve around theories that (a) both claim to explain the temporal sequences. Language, action plan- data but (b) contain different representa- ning, goal-directed behavior, and reasoning tional assumptions? Putting aside the differ- about causality are examples of domains ent level of detail of the two theories, the an- that rely on events occurring in sequences. swer is that it depends on one’s preference How has connectionism addressed the pro- for consistency with other disciplines. The cessing of temporally unfolding events? One dual-mechanism theory declares consistency solution was offered in the TRACE model with linguistics – if rules are required to of spoken word recognition (McClelland & characterize other aspects of language per- Elman, 1986), where a word was specified formance (such as syntax), then one might as a sequence of phonemes. In that case, as well include them in a model of past the architecture of the system was dupli- tense formation. The connectionist theory cated for each time-slice and the duplicates declares consistency with neuroscience – if wired together. This allowed constraints to the language system is going to be imple- operate over items in the sequence to influ- mented in the brain, then one might as well ence recognition. In other models, a related employ a computational formulism based on approach was used to convert a temporally how neural networks function. extended representation into a spatially ex- Finally, we return to the more general tended one. For example, in the past tense connectionist principle illustrated by the model, all the phonemes of a verb were pre- past tense model. So long as there are regu- sented across the input layer. This could be larities in the statistical structure of a prob- viewed as a sequence if one assumed that lem domain, a massively parallel constraint the representation of the first phoneme rep- satisfaction system can learn these regular- resents time slice t, the representation of the ities and extend them to novel situations. second phoneme represents time slice t + 1, Moreover, as with humans, the behavior of and so on. As part of a comprehension sys- the system is flexible and context sensitive – tem, this approach assumes a buffer that can it can accommodate regularities and excep- take sequences and convert them to a spa- tions within a single processing structure. tial vector. However, this solution is fairly limited, as it necessarily pre-commits to the size of the sequences that can be processed 3.3. Finding Structure in Time at once (i.e., the size of the input layer). (Elman, 1990) Elman (1990, 1991) offered an al- In this section, the notion of the simple re- ternative and more flexible approach to current network and its application to lan- processing sequences, proposing an archi- guage are introduced. As with past tense, tecture that has been extremely influential the key point of the model will be to show and much used since. Elman drew on the how conformity to regularities of language work of Jordan (1986) who had proposed a can arise without an explicit representation model that could learn to associate a “plan” of a linguistic rule. Moreover, the following (i.e., a single input vector) with a series of simulations will demonstrate how learning “actions” (i.e., a sequence of output vectors). can lead to the discovery of useful internal Jordan’s model contained recurrent connec- representations that capture conceptual and tions permitting the hidden units to “see” linguistic structure on the basis of the co- the network’s previous output (via a set of occurrences of words in sentences. “state” input units that are given a copy of The IA model exemplified connection- the previous output). The facility for the ism’s commitment to parallelism: All of the network to shape its next output according letters of the word presented to the net- to its previous response constitutes a kind of work were recognized in parallel, and pro- memory. Elman’s innovation was to build cessing occurred simultaneously at different a recurrent facility into the internal units of levels of abstraction. But not all processing the network, allowing it to compute statisti- can be carried out in this way. Some hu- cal relationships across sequences of inputs P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 39

Output units

Fixed 1-to-1 weights copy activation of hidden layer to context units Hidden units HU activations at time (t+1)

HU activations Input units Context units at time (t) Figure 2.4. Elman’s simple recurrent network architecture for finding structure in time (Elman, 1991, 1993). Connections between input and hidden, context and hidden, and hidden and output layers are trainable. Sequences are applied to the network element by element in discrete time steps; the context layer contains a copy of the hidden unit (HU) activations on the previous time step transmitted by fixed, 1-to-1 connections.

and outputs. To achieve this, first time is The input at time t + 3 will be processed discretized into a number of slices. On time along with activity from the context layer step t, an input is presented to the network that is shaped by three influences: and causes a pattern of activation on hidden and output layers. On time step t + 1, the (the input at t + 2 (shaped by the input next input in the sequence of events is pre- at t + 1 (shaped by the input at t))). sented to the network. However, crucially, a copy of the activation of the hidden units on The recursive flavor of the information con- time step t is transmitted to a set of internal tained in the context layer means that each “context” units. This activation vector is also new input is processed in the context of fed to the hidden units on time step t + 1. the full history of previous inputs. This per- Figure 2.4 shows the architecture, known mits the network to learn statistical relation- as the simple recurrent network (SRN). It is ships across sequences of inputs or, in other usually trained with the backpropagation words, to find structure in time. algorithm (see Section 2.3) as a multi- In his original 1990 article, Elman layer feedforward network, ignoring the demonstrated the powerful properties of the origin of the information on the context SRN with two examples. In the first, the layer. network was presented with a sequence of Each input to the SRN is therefore pro- letters made up of concatenated words, for cessed in the context of what came before, example: but in a way subtly more powerful than the Jordan network. The input at t + 1ispro- MANYYEARSAGOABOYANDGIRLLIV- cessed in the context of the activity pro- EDBYTHESEATHEYPLAYEDHAPPIL duced on the hidden units by the input at time t. Now consider the next time step. Each letter was represented by a distributed The input at time t + 2 will be processed binary code over five input units. The net- along with activity from the context layer work was trained to predict the next letter in that is shaped by two influences: the sentence for 200 sentences constructed from a lexicon of fifteen words. There were (the input at t + 1 (shaped by the 1,270 words and 4,963 letters. Because input at t)). each word appeared in many sentences, P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

40 thomas and mcclelland

the network was not particularly successful Not only were the internal states induced by at predicting the next letter when it got to nouns split into animate and inanimate, but the end of each word, but within a word the pattern for “woman” was most similar to it was able to predict the sequences of let- “girl,” and that for “man” was most similar ters. Using the accuracy of prediction as a to “boy.” The network had learned to struc- measure, one could therefore identify which ture its internal representations according to sequences in the letter string were words: a mix of syntactic and semantic information They were the sequences of good prediction because these information states were the bounded by high prediction errors. The abil- best way to predict how sentences would ity to extract words was of course subject unfold. Elman concluded that the represen- to the ambiguities inherent in the training tations induced by connectionist networks set (e.g., for the and they, there is ambigu- need not be flat but could include hierarchi- ity after the third letter). Elman suggested cal encodings of category structure. that if the letter strings are taken to be Based on his finding, Elman also argued analogous to the speech sounds available to that the SRN was able to induce representa- the infant, the SRN demonstrates a possi- tions of entities that varied according to their ble mechanism to extract words from the context of use. This contrasts with classi- continuous stream of sound that is present cal symbolic representations that retain their in infant-directed speech. Elman’s work has identity regardless of the combinations into contributed to the increasing interest in the which they are put, a property called “com- statistical learning abilities of young chil- positionality.” This claim is perhaps better dren in language and cognitive develop- illustrated by a second article Elman (1993) ment (see, e.g., Saffran, Newport, & Aslin, published two years later, called “Learning 1996). and Development in Neural Networks: The In the second example, Elman (1990) importance of Starting Small.” In this later created a set of 10,000 sentences by com- article, Elman explored whether rule-based bining a lexicon of 29 words and a set of mechanisms are required to explain certain short sentence frames (noun + [transitive] aspects of language performance, such as verb + noun; noun + [intransitive] verb). syntax. He focused on “long-range depen- There was a separate input and output unit dencies,” which are links between words for each word, and the SRN was trained that depend only on their syntactic relation- to predict the next word in the sentence. ship in the sentence and, importantly, not on During training, the network’s output came their separation in a sequence of words. For to approximate the transitional probabilities example, in English, the subject and main between the words in the sentences, that is, verb of a sentence must agree in number. If it could predict the next word in the sen- the noun is singular, so must be the verb; tences as much as this was possible. Follow- if the noun is plural, so must be the verb. ing the first noun, the verb units would be Thus, in the sentence “The boy chases the more active as the possible next word, and cat,” boy and chases must both be singular. verbs that tended to be associated with this But this is also true in the sentence “The boy particular noun would be more active than whom the boys chase chases the cat.” In the those that did not. At this point, Elman ex- second sentence, the subject and verb are amined the similarity structure of the in- further apart in the sequence of words but ternal representations to discover how the their relationship is the same; moreover, the network was achieving its prediction abil- words are now separated by plural tokens of ity. He found that the internal representa- the same lexical items. Rule-based represen- tions were sensitive to the difference be- tations of syntax were thought to be neces- tween nouns and verbs, and within verbs, to sary to encode these long-distance relation- the difference between transitive and intran- ships because, through the recursive nature sitive verbs. Moreover, the network was also of syntax, the words that have to agree in a sensitive to a range of semantic distinctions: sentence can be arbitrarily far apart. P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 41

(a) (b)

who2

boy6 boy who 6 2 boys boy9 END 3 chase 5 boy chase 6 4 who7 boy3 boys3 chases5 who 4 chases2 chase4

boys1 chases8 START Principal Component #2 Principal Component #11 boy1

boy1 chases5 START END Time step Principal Component #1 Figure 2.5. Trajectory of internal activation states as the simple recurrent network (SRN) processes sentences (Elman, 1993). The data show positions according to the dimensions of a principal components analysis (PCA) carried out on hidden unit activations for the whole training set. Words are indexed by their position in the sequence but represent activation of the same input unit for each word. (a) PCA values for the second principal component as the SRN processes two sentences, “Boy who boys chase chases boy”or“Boys who boys chase chase boy;” (b) PCA values for the first and eleventh principal components as the SRN processes “Boy chases boy who chases boy who chases boy.”

Using an SRN trained on the same pre- ent multi-phrase sentences, plotted with ref- diction task as that previously outlined but erence to particular dimensions of princi- now with more complex sentences, Elman pal component space. This figure demon- (1993) demonstrated that the network was strates that the network adopted similar able to learn these long-range dependen- states in response to particular lexical items cies even across the separation of multiple (e.g., tokens of boy, who, chases), but that phrases. If boy was the subject of the sen- it modified the pattern slightly according to tence, when the network came to predict the grammatical status of the word. In Fig- the main verb chase as the next word, it ure 2.5a, the second principal component predicted that it should be in the singular. appears to encode singularity/plurality. Fig- The method by which the network achieved ure 2.5b traces the network’s state as it pro- this ability is of particular interest. Once cesses two embedded relative clauses con- more, Elman explored the similarity struc- taining iterations of the same words. Each ture in the hidden unit representations, us- clause exhibits a related but slightly shifted ing principal component analyses to identify triangular trajectory to encode its role in the the salient dimensions of similarity across syntactic structure. which activation states were varying. This The importance of this model is that enabled him to reduce the high dimension- it prompts a different way to understand ality of the internal states (150 hidden units the processing of sentences. Previously, one were used) to a manageable number to visu- would view symbols as possessing fixed iden- alize processing. Elman was then able to plot tities and as being bound into particular the trajectories of activation as the network grammatical roles via a syntactic construc- altered its internal state in response to each tion. In the connectionist system, sentences subsequent input. Figure 2.5 depicts these are represented by trajectories through trajectories as the network processes differ- activation space in which the activation P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

42 thomas and mcclelland

pattern for each word is subtly shifted ac- production can be induced from word se- cording to the context of its usage. The im- quences, neither task is actually performed. plication is that the property of composi- The learned distinction between nouns and tionality at the heart of the classical sym- verbs apparent in the hidden unit represen- bolic computational approach may not be tations is tied up with carrying out the pre- necessary to process language. diction task. But to perform comprehension, Elman (1993) also used this model to for example, the SRN would need to learn investigate a possible advantage to learning categorizations from the word sequences, that could be gained by initially restricting such as deciding which noun was the agent the complexity of the training set. At the and which noun was the patient in a sen- start of training, the network had its mem- tence, regardless of whether the sentence ory reset (its context layer wiped) after ev- was presented in the active (“the dog chases ery third or fourth word. This window was the cat”) or passive voice (“the cat is chased then increased in stages up to six to seven by the dog”). These types of computations words across training. The manipulation was are more complex and the network’s solu- intended to capture maturational changes in tions typically more impenetrable. Although working memory in children. Elman (1993) SRNs have borne the promise of an inher- reported that starting small enhanced learn- ently developmental connectionist theory of ing by allowing the network to build simpler parsing, progress on a full model has been internal representations that were later use- slow (see Christiansen & Chater, 2001; and ful for unpacking the structure of more com- chapter 17 of this volume). Parsing is a com- plex sentences (see Rohde & Plaut, 1999, plex problem – it is not even clear what the for discussion and further simulations). This output should be for a model of sentence idea resonated with developmental psychol- comprehension. Should it be some inter- ogists in its demonstration of the way in mediate depiction of agent–patient role as- which learning and maturation might inter- signments, some compound representation act in constructing cognition. It is an idea of roles and semantics, or a constantly up- that could turn out to be a key principle in dating mental model that processes each the organization of cognitive development sentence in the context of the emerging (Elman et al., 1996). discourse? Connectionist models of pars- ing await greater constraints from psycholin- 3.3.1. what happened next? guistic evidence. Elman’s simulations with the SRN and the Nevertheless, some interesting prelimi- prediction task produced striking results. nary findings have emerged. For example, The ability of the network to induce struc- some of the grammatical sentences that the tured representations containing grammati- SRN finds the hardest to predict are also cal and semantic information from word se- the sentences that humans find the hard- quences prompted the view that associative est to understand (e.g., center embedded statistical learning mechanisms might play a structures like “the mouse the cat the dog much more central role in language acqui- bit chased ate the cheese”) (Weckerly & sition. This innovation was especially wel- Elman, 1992). These are sequences that come given that symbolic theories of sen- place maximal load on encoding information tence processing do not offer a ready account in the network’s internal recurrent loop, sug- of language development. Indeed, they are gesting that recurrence may be a key com- largely identified with the nativist view that putational primitive in language processing. little in syntax develops. However, one lim- Moreover, when the prediction task is re- itation of the prior simulations is that the placed by a comprehension task (such as prediction task does not learn any catego- predicting the agent/patient status of the rizations over the input set. Although the nouns in the sentence), the results are again simulations demonstrate that information suggestive. Rather than building a syntac- important for language comprehension and tic structure for the whole sentence as a P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 43

symbolic parser might, the network focuses should note a number of other related ap- on the predictability of lexical cues for iden- proaches. tifying various syntactic structures (consis- tent with Bates and MacWhinney’s [1989] 4.1. Cascade-Correlation and Competition model of language develop- Incremental Neural Network Algorithms ment. The salience of lexical cues that each syntactic structure exploits and the pro- Backpropagation networks specify input and cessing load that each structure places on output representations, whereas in self- the recurrent loop makes them differentially organizing networks, only the inputs are vulnerable under damage. Here, neuropsy- specified. These networks therefore include chological findings from language break- some number of internal processing units down and developmental language disor- whose activation states are determined by ders have tended to follow the predictions the learning algorithm. The number of in- of the connectionist account in the relative ternal units and their organization (e.g., into impairments that each syntactic construc- layers) plays an important role in determin- tion should show (Dick et al., 2001; 2004; ing the complexity of the problems or cate- Thomas & Redington, 2004). gories that the network can learn. In pattern For more recent work and discussion of associator networks, too few units and the the use of SRNs in syntax processing, see network will fail to learn; in self-organizing Mayberry, Crocker, and Knoeferle (2005), networks, too few output units and the net- Miikkulainen and Mayberry (1999), Morris, work will fail to provide good discrimina- Cottrell, and Elman (2000), Rohde (2002), tion between the categories in the training and Sharkey, Sharkey, and Jackson (2000). set. How does the modeler select in advance Lastly, the impact of SRNs has not been the appropriate number of internal units? restricted to language. These models have Indeed, for a cognitive model, should this been usefully applied to other areas of cog- be a decision that the modeler gets to make? nition where sequential information is im- For pattern associator networks, the cas- portant. For example, Botvinick and Plaut cade correlation algorithm (Fahlman & (2004) have shown how this architecture Lebiere, 1990) addresses this problem by can capture the control of routine sequences starting with a network that has no hidden of actions without the need for schema hier- units and then adding in these resources dur- archies, and Cleeremans and Dienes (chap- ing learning as it becomes necessary to carry ter 14 in this volume) show how SRNs have on improving on the task. New hidden units been applied to implicit learning. are added with weights from the input layer In sum, then, Elman’s work demonstrates tailored so that the unit’s activation corre- how simple connectionist architectures can lates with network error, that is, the new learn statistical regularities over temporal unit responds to parts of the problem on sequences. These systems may indeed be which the network is currently doing poorly. sufficient to produce many of the behav- New hidden units can also take input from iors that linguists have described with gram- existing hidden units, thereby creating de- matical rules. However, in the connection- tectors for higher order features in the prob- ist system, the underlying primitives are lem space. context-sensitive representations of words The cascade correlation algorithm has and trajectories of activation through recur- been widely used for studying cognitive rent circuits. development (Mareschal & Shultz, 1996; Shultz, 2003; Westermann, 1998), for example, in simulating children’s perfor- 4. Related Models mance in Piagetian reasoning tasks (see Sec- tion 5.2). The algorithm makes links with Before considering the wider impact of con- the constructivist approach to development nectionism on theories of cognition, we (Quartz, 1993; Quartz & Sejnowski, 1997), P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

44 thomas and mcclelland

which argues that increases in the complex- Richardson, 2006, for discussion). An exam- ity of children’s cognitive abilities are best ple of the application of mixture of experts explained by the recruitment of additional can be found in a developmental model of neurocomputational resources with age and face and object recognition, where differ- experience. Related models that also use this ent “expert” mechanisms come to special- “incremental” approach to building network ize in processing visual inputs that corre- architectures can be found in the work of spond to faces and those that correspond Carpenter and Grossberg (Adaptive Reso- to objects (Dailey & Cottrell, 1999). The nance Theory; e.g., Carpenter & Grossberg, emergence of this functional specialization 1987a, 1987b) and in the work of Love and can be demonstrated by damaging each ex- colleagues (e.g., Love, Medin, & Gureckis, pert in turn and showing a double dissocia- 2004). tion between face and object recognition in the two components of the model (see Sec- tion 5.3). Similarly, Thomas and Karmiloff- 4.2. Mixture-of-Experts-Models Smith 2002a showed how a mixture-of- The preceding sections assume that only a experts model of English past tense could single architecture is available to learn each produce emergent specialization of separate problem. However, it may be that multiple mechanisms to regular and irregular verbs, architectures are available to learn a given respectively (see also Westermann, 1998, problem, each with different computational for related work with a constructivist net- properties. Which architecture will end up work). learning the problem? Moreover, what if a cognitive domain can be broken down into 4.3. Hybrid Models different parts, for example, in the way that the English past tense problem comprises The success of mixture-of-experts models regular and irregular verbs – could different suggests that when two or more compo- computational components end up learn- nents are combined within a model, it can be ing the different parts of the problem? The advantageous for the computational proper- mixture-of-experts approach considers ways ties of the components to differ. Where the in which learning could take place in just properties of the components are radically such a system with multiple components different, for example, involving the combi- available (Jacobs et al., 1991). In these mod- nation of symbolic (rule-based) and connec- els, functionally specialized structures can tionist (associative, similarity-based) archi- emerge as a result of learning, in the circum- tectures, the models are sometimes referred stance where the computational properties to as “hybrid.” The use of hybrid models is of the different components happen to line inspired by the observation that some as- up with the demands presented by differ- pects of human cognition seem better de- ent parts of the problem domain (so-called scribed by rules (e.g., syntax, reasoning), structure-function correspondences). whereas some seem better described by sim- During learning, mixture-of-experts algo- ilarity (e.g., perception, memory). We have rithms typically permit the multiple com- previously encountered the debate between ponents to compete with each other to de- symbolic and connectionist approaches (see liver the correct output for each input pat- Section 2.3) and the proposal that connec- tern. The best performer is then assigned tionist architectures may serve to imple- the pattern and allowed to learn it. The in- ment symbolic processes (e.g., Touretzky volvement of each component during func- & Hinton, 1988). The hybrid systems ap- tioning is controlled by a gating mechanism. proach takes the alternative view that con- Mixture-of-experts models are one of sev- nectionist and symbolic processing princi- eral approaches that seek to explain the ples should be combined within the same origin of functionally specialized process- model, taking advantage of the strengths ing components in the cognitive system (see of each computational formalism. A dis- Elman et al., 1996; Jacobs, 1999; Thomas & cussion of this approach can be found in P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 45

Sun (2002a, 2002b). Example models in- approach stresses how it may be possible clude CONSYDERR (Sun, 1995), CLAR- to combine prior knowledge in the form ION (Sun & Peterson, 1998), and ACT-R of a set of explicit alternative graph struc- (Anderson & Lebiere, 1998). tures and constraints on the complexity of An alternative to a truly hybrid approach such structures with Bayesian methods of is to develop a multi-part connectionist ar- inference to select the best type of repre- chitecture that has components that employ sentation of a particular data set (e.g., lists different representational formats. Such a of facts about many different animals); and system may be described as having “hetero- within that, to select the best specific in- geneous” computational components. For stantiation of a representation of that type example, in a purely connectionist system, (Tenenbaum, Griffiths, & Kemp, 2006). one component might employ distributed These models are useful contributions to our representations that permit different de- understanding, particularly because they al- grees of similarity between activation pat- low explicit exploration of the role of prior terns, whereas a second component employs knowledge in the selection of a representa- localist representations in which there is tion of the structure present in each data no similarity between different representa- set. It should be recognized, however, that tions. Behavior is then driven by the inter- such models are offered as characterizations play between two associative components of learning at Marr’s (1982) “Computational that employ different similarity structures. Level” and as such they do not specify the One example of a heterogeneous, multiple- representations and processes that are ac- component architecture is the complemen- tually employed when people learn. These tary learning systems model of McClel- models do raise questions for connectionist land, McNaughton, and O’Reilly (1995), research that does address such questions, which employs localist representations to however. Specifically, the work provides a encode individual episodic but benchmark against which connectionist ap- distributed representations to encode gen- proaches might be tested for their success eral semantic memories. Heterogeneous de- in learning to represent the structure from velopmental models may offer new ways a data set and in using such a structure to conceive of the acquisition of concepts. to make inferences consistent with opti- For example, the cognitive domain of num- mal performance according to a Bayesian ber may be viewed as heterogeneous in the approach within a graphical model frame- sense that it combines three systems: the work. More substantively, the work raises similarity-based representations of quantity, questions about whether or not optimiza- the localist representations of number facts tion depends on the explicit representation (such as the order of number labels in count- of alternative structured representations or ing), and a system for object individuation. whether an approximation to such struc- Carey and Sarnecka (2006) argue that a het- tured representations can arise without their erogeneous multiple component system of prespecification. For an initial examination this nature could acquire the concept of pos- of these issues as they arise in the con- itive integers even though such a concept text of causal inference, see McClelland and could not be acquired by any single compo- Thompson (2007). nent of the system on its own.

5. Connectionist Influences 4.4. Bayesian Graphical Models on Cognitive Theory The use of Bayesian methods of inference in graphical models, including causal graph- Connectionism offers an explanation of hu- ical models, has recently been embraced by man cognition because instances of behavior a number of cognitive scientists (Chater, in particular cognitive domains can be ex- Tenenbaum, & Yuille, 2006; Gopnik et al., plained with respect to set of general prin- 2004; see chapter 3 in this volume). This ciples (parallel distributed processing) and P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

46 thomas and mcclelland

the conditions of the specific domains. How- tions (Munakata & McClelland, 2003). La- ever, from the accumulation of successful tent knowledge corresponds to the infor- models, it is also possible to discern a wider mation stored in the connection weights influence of connectionism on the nature of from accumulated experience. By contrast, theorizing about cognition, and this is per- active knowledge is information contained haps a truer reflection of its impact. How in the current activation states of the sys- has connectionism made us think differently tem. Clearly, the two are related because about cognition? the activation states are constrained by the connection weights. But, particularly in re- current networks, there can be subtle dif- 5.1. Knowledge Versus Processing ferences. Active states contain a trace of re- One area where connectionism has changed cent events (how things are at the moment), the basic nature of theorizing is mem- whereas latent knowledge represents a his- ory. According to the old model of mem- tory of experience (how things tend to be). ory based on the classical computational Differences in the ability to maintain the metaphor, the information in long-term active states (e.g., in the strength of recur- memory (e.g., on the hard disk) has to be rent circuits) can produce errors in behavior moved into working memory (the central where the system lapses into more typical processing unit, or CPU) for it to be oper- ways of behaving (Munakata, 1998; Morton ated on, and the long-term memories are & Munakata, 2002). laid down via a domain-general buffer of Second, if information does need to be short-term memory (random access mem- moved around the system, for example, ory, or RAM). In this type of system, it is from a more instance-based (episodic) sys- relatively easy to shift informational con- tem to a more general (semantic) sys- tent between different systems, back and tem, this will require special structures and forth between central processing and short special (potentially time-consuming) pro- and long-term stores. Computation is pred- cesses. Thus McClelland, McNaughton, and icated on variables: the same binary string O’Reilly (1995) proposed a dialogue be- can readily be instantiated in different mem- tween separate stores in the hippocampus ory registers or encoded onto a permanent and neocortex to gradually transfer knowl- medium. edge from episodic to semantic memory. By contrast, knowledge is hard to move French, Ans, and Rousset (2001) proposed about in connectionist networks because it a special method to transfer knowledge is encoded in the weights. For example, in between two memory systems: internally the past-tense model, knowledge of the past generated noise produces “pseudopatterns” tense rule “add -ed” is distributed across from one system that contain the central the weight matrix of the connections be- tendencies of its knowledge; the second tween input and output layers. The diffi- memory system is then trained with this culty in portability of knowledge is inher- extracted knowledge to effect the transfer. ent in the principles of connectionism – Chapters 7 and 8 in the current volume offer Hebbian learning alters connection strengths a wider consideration of models of episodic to reinforce desirable activation states in and semantic memory, respectively. connected units, tying knowledge to struc- Third, information will be processed in ture. If we start from the premise that the same substrate where it is stored. There- knowledge will be very difficult to move fore, long-term memories will be active about in our information processing system, structures and will perform computations what kind of do we on content. An external strategic control sys- end up with? There are four main themes. tem plays the role of differentially activat- First, we need to distinguish between ing the knowledge in this long-term system two different ways in which knowledge can that is relevant to the current context. In be encoded: active and latent representa- anatomical terms, this distinction broadly P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 47

corresponds to frontal/anterior (strategic representing the content of multiple do- control) and posterior (long-term) cortex. mains. However, there may be domain- The design means, somewhat counterintu- general systems that are involved in mod- itively, that the control system has no con- ulating many disparate processes without tent. Rather, the control system contains taking on the content of those systems, what placeholders that serve to activate different we might call a system with “a finger in ev- regions of the long-term system. The con- ery pie.” Meanwhile, short-term or work- trol system may contain plans (sequences ing memory (as exemplified by the active of placeholders) and it may be involved in representations contained in the recurrent learning abstract concepts (using a place- loop of a network) is likely to exist as a holder to temporarily coactivate previously devolved panoply of discrete systems, each unrelated portions of long-term knowledge with its own content-specific loop. For ex- while Hebbian learning builds an association ample, research in the neuropsychology of between them) but it does not contain con- language now tends to support the existence tent in the sense of a domain-general work- of separate working memories for phono- ing memory. The study of frontal systems logical, semantic, and syntactic information then becomes an exploration of the acti- (see MacDonald & Christiansen, 2002, for a vation dynamics of these placeholders and discussion of these arguments). their involvement in learning (see, e.g., work by Davelaar & Usher, 2002; Haarmann & 5.2. Cognitive Development Usher, 2001; O’Reilly, Braver, & Cohen, 1999; Usher & McClelland, 2001). A key feature of PDP models is the use of Similarly, connectionist research has ex- a learning algorithm for modifying the pat- plored how activity in the control system can terns of connectivity as a function of experi- be used to modulate the efficiency of pro- ence. Compared with symbolic, rule-based cessing elsewhere in the system, for instance, computational models, this has made them to implemented selective attention. For ex- a more sympathetic formalism for study- ample, Cohen et al., (1990) demonstrated ing cognitive development (Elman et al., how task units could be used to differ- 1996). The combination of domain-general entially modulate word-naming and color- processing principles, domain-specific archi- naming processing channels in a model of tectural constraints, and structured train- the color-word Stroop task. In this model, ing environments has enabled connectionist latent knowledge interacted with the oper- models to give accounts of a range of devel- ation of task control, so that it was harder opmental phenomena. These include infant to selectively attend to color-naming and category development, language acquisition ignore information from the more practiced and reasoning in children (see Mareschal & word-naming channel than vice versa. This Thomas, 2007, for a recent review and chap- work was later extended to demonstrate ter 16 in this volume). how deficits in the strategic control system Connectionism has become aligned with (prefrontal cortex) could lead to problems a resurgence of interest in statistical learn- in selective attention in disorders such as ing and a more careful consideration of the schizophrenia (Cohen & Servan-Schreiber, information available in the child’s environ- 1992). Chapter 15 in this volume contains a ment that may feed his or her cognitive wider consideration of computational mod- development. One central debate revolves els of attention and cognitive control. around how children can become “cleverer” Lastly, the connectionist perspective on as they get older, appearing to progress memory alters how we conceive of domain through qualitatively different stages of rea- generality in processing systems. It is un- soning. Connectionist modeling of the de- likely that there are any domain-general velopment of children’s reasoning was able processing systems that serve as a “Jack-of- to demonstrate that continuous incremen- all-trades,” that is, that can move between tal changes in the weight matrix driven by P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

48 thomas and mcclelland

algorithms such as backpropagation can re- cognitive system (see Westermann et al., sult in nonlinear changes in surface behav- 2007). Current work also explores the com- ior, suggesting that the stages apparent in putational basis of critical or sensitive peri- behavior may not necessarily be reflected ods in development, uncovering the mecha- in changes in the underlying mechanism nisms by which the ability to learn appears (e.g., McClelland, 1989). Other connection- to reduce with age (e.g., McClelland et al., ists have argued that algorithms able to sup- 1999; Thomas & Johnson, 2006). plement the computational resources of the network as part of learning may also provide 5.3. The Study of Acquired Disorders an explanation for the emergence of more in Cognitive Neuropsychology complex forms of behavior with age (e.g., cascade correlation; see Shultz, 2003). Traditional cognitive neuropsychology of The key contribution of connectionist the 1980s was predicated on the assumption models in the area of developmental psy- of underlying modular structure, that is, that chology has been to specify detailed, im- the cognitive system comprises a set of in- plemented models of transition mechanisms dependently functioning components. Pat- that demonstrate how the child can move terns of selective cognitive impairment after between producing different patterns of be- acquired brain damage could then be used to havior. This was a crucial addition to a construct models of normal cognitive func- field that has accumulated vast amounts of tion. The traditional models comprised box- empirical data cataloging what children are and-arrow diagrams that sketched out rough able to do at different ages. The specifi- versions of cognitive architecture, informed cation of mechanism is also important to both by the patterns of possible selective counter some strongly empiricist views that deficit (which bits can fail independently) simply identifying statistical information in and by a task analysis of what the cognitive the environment suffices as an explanation system probably has to do. of development; instead, it is necessary to In the initial formulation of cognitive show how a mechanism could use this sta- neuropsychology, caution was advised in tistical information to acquire some cogni- attempting to infer cognitive architecture tive capacity. Moreover, when connection- from behavioral deficits, since a given ist models are applied to development, it pattern of deficits might be consistent often becomes apparent that passive statis- with a number of underlying architectures tical structure is not the key factor; rather, (Shallice, 1988). It is in this capacity that the relevant statistics are in the transforma- connectionist models have been extremely tion of the statistical structure of the envi- useful. They have both forced more detailed ronment to the output or the behavior that specification of proposed cognitive models is relevant to the child, thereby appealing to via implementation and also permitted as- notions like the regularity, consistency, and sessment of the range of deficits that can be frequency of input-output mappings. generated by damaging these models in var- Recent connectionist approaches to de- ious ways. For example, models of reading velopment have begun to explore how the have demonstrated that the ability to de- computational formalisms may change our code written words into spoken words and understanding of the nature of the knowl- recover their meanings can be learned in a edge that children acquire. For example, connectionist network; and when this net- Mareschal et al. (2007) argue that many work is damaged by, say, lesioning connec- mental representations of knowledge are tion weights or removing hidden units, var- partial (i.e., capture only some task-relevant ious patterns of acquired dyslexia can be dimensions); the existence of explicit lan- simulated (e.g., Plaut et al., 1996; Plaut & guage may blind us to the fact that there Shallice, 1993). Connectionist models of ac- could be a limited role for truly abstract quired deficits have grown to be an influen- knowledge in the normal operation of the tial aspect of cognitive neuropsychology and P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 49

have been applied to domains such as lan- or layers or weights) or different parts of the guage, memory, semantics, and vision (see input layer or different parts of the output Cohen, Johnstone, & Plunkett, 2000, for layer are differentially important for driv- examples). ing the two behaviors – that is, there is spe- Several ideas have gained their first or cialization of function. Connectionism mod- clearest grounding via connectionist mod- els of breakdown have, therefore, tended eling. One of these ideas is that patterns to support the traditional inferences. Cru- of breakdown can arise from the statis- cially, however, connectionist models have tics of the problem space (i.e., the map- greatly improved our understanding of what ping between input and output) rather than modularity might look like in a neurocom- from structural distinctions in the process- putational system: a partial rather than an ing system. In particular, connectionist mod- absolute property; a property that is the els have shed light on a principal inferential consequence of a developmental process tool of cognitive neuropsychology, the dou- where emergent specialization is driven by ble dissociation. The line of reasoning argues structure-function correspondences (the ability that if in one patient, ability A can be lost of certain parts of a computational structure while ability B is intact, and in a second pa- to learn certain kinds of computation bet- tient, ability B can be lost while ability A is ter than other kinds; see Section 4.2); and a intact, then the two abilities might be gen- property that must now be complemented erated by independent underlying mecha- by concepts such as division of labor, de- nisms. In a connectionist model of category- generacy, interactivity, and redundancy (see specific impairments of semantic memory, Thomas & Karmiloff-Smith, 2002a; Thomas Devlin et al. (1997) demonstrated that a et al., 2006, for discussion). single undifferentiated network trained to produce two behaviors could show a dou- 5.4. The Origins of Individual Variability ble dissociation between them simply as a and Developmental Disorders consequence of different levels of damage. This can arise because the mappings associ- In addition to their role in studying acquired ated with the two behaviors lead them to disorders, the fact that many connectio- have different sensitivity to damage. For a nist models learn their cognitive abilities small level of damage, performance on A makes them an ideal framework within may fall off quickly, whereas performance which to study developmental disorders,such on B declines more slowly; for a high level as autism, dyslexia, and specific language of damage, A may be more robust than B. impairment (Joanisse & Seidenberg, 2003; The reverse pattern of relative deficits im- Mareschal et al., 2007; Thomas & Karmiloff- plies nothing about structure. Smith, 2002b, 2003a, 2005). Where mod- Connectionist researchers have often set els of normal cognitive development seek out to demonstrate that, more generally, to study the “average” child, models of double dissociation methodology is a flawed atypical development explore how devel- form of inference, on the grounds that such opmental profiles may be disrupted. Con- dissociations arise relatively easily from par- nectionist models contain a number of con- allel distributed architectures where func- straints (architecture, activation dynamics, tion is spread across the whole mechanism input and output representations, learn- (e.g., Plunkett & Bandelow, 2006; Juola & ing algorithm, training regimen) that de- Plunkett, 2000). However, on the whole, termine the efficiency and outcome of when connectionist models show robust learning. Manipulations to these constraints double dissociations between two behaviors produce candidate explanations for impair- (for equivalent levels of damage applied to ments found in developmental disorders or various parts of the network and over many for the impairments caused by exposure to replications), it does tend to be because dif- atypical environments, such as in cases of ferent internal processing structures (units deprivation. P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

50 thomas and mcclelland

In the 1980s and 1990s, many theories of terest. This piecemeal approach generated developmental deficits employed the same explanations of individual cognitive abilities explanatory framework as adult cognitive using bespoke networks, each containing its neuropsychology. There was a search for own predetermined representations and ar- specific developmental deficits or dissocia- chitecture. In the future, one avenue to pur- tions, which were then explained in terms of sue is how these models fit together in the the failure of individual modules to develop- larger cognitive system – for example, to ex- ment. However, as Karmiloff-Smith (1998) plain how the past tense network described and Bishop (1997) pointed out, most of in Section 3.2 might link up with the sen- the developmental deficits were actually be- tence processing model described in Section ing explained with reference to nondevelop- 3.3 to process past tenses as they arise in sen- mental, static, and sometimes adult models tences. A further issue is to address the de- of normal cognitive structure. Karmiloff- velopmental origin of the architectures that Smith (1998) argued that the causes of de- are postulated. What processes specify the velopmental deficits of a genetic origin will parts of the cognitive system to perform the lie in changes to low-level neurocomputa- various functions and how do these subsys- tional properties that only exert their in- tems talk to each other, both across devel- fluence on cognition via an extended atyp- opment and in the adult state? Improve- ical developmental process (see also Elman ments in computational power will aid more et al., 1996). Connectionist models provide complex modeling endeavors. Nevertheless, the ideal forum to explore the thesis that it is worth bearing in mind that increas- an understanding of the constraints on the ing complexity creates a tension with the developmental process is essential for gen- primary goals of modeling – simplification erating accounts of developmental deficits. and understanding. It is essential that we The study of atypical variability also understand why more complicated models prompts a consideration of what causes vari- function as they do or they will merely be- ability within the normal range,otherwise come interesting artifacts (see Elman, 2005; known as individual differences or intelli- Thomas, 2004, for further discussion). gence. Are differences in intelligence caused In terms of its relation with other dis- by variation in the same computational pa- ciplines, a number of future influences on rameters that can cause disorders? Are some connectionism are discernible. Connection- developmental disorders just the extreme ism will be affected by the increasing ap- lower end of the normal distribution or peal to Bayesian probability theory in hu- are they qualitatively different conditions? man reasoning. In Bayesian theory, new data What computational parameter settings are are used to update existing estimates of the able to produce above-average performance most likely model of the world. Work has or giftedness? Connectionism has begun to already begun to relate connectionist and take advantage of the accumulated body Bayesian accounts, for example, in the do- of models of normal development to con- main of causal reasoning in children (Mc- sider the wider question of cognitive varia- Clelland & Thompson, 2007). In some cases, tion in parameterized computational models connectionism may offer alternative expla- (Thomas & Karmiloff-Smith, 2003a). nations of the same behavior; in others it maybeviewedasanimplementationofa Bayesian account (see Section 3.1). Connec- 5.5. Future Directions tionism will continue to have a close rela- The preceding sections indicate the range tion to neuroscience, perhaps seeking to build and depth of influence of connectionism on more neural constraints into its computa- contemporary theories of cognition. Where tional assumptions (O’Reilly & Munakata, will connectionism go next? Necessarily, 2000). Many of the new findings in cog- connectionism began with simple models of nitive neuroscience are influenced by func- individual cognitive processes, focusing on tional brain imaging techniques. It will be those domains of particular theoretical in- important, therefore, for connectionism to P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 51

make contact with these data, either via microstructure of these models. The core systems-level modeling of the interaction connectionist themes include the following: between subnetworks in task performance (1) that processing is simultaneously influ- or in exploring the implications of the sub- enced by multiple sources of information at traction methodology as a tool for assess- different levels of abstraction, operating via ing the behavior of distributed interactive soft constraint satisfaction; (2) that repre- systems. The increasing influence of brain sentations are spread across multiple sim- imaging foregrounds the relation of cogni- ple processing units operating in parallel; tion to the neural substrate; it depends on (3) that representations are graded, context- how seriously one takes the neural plausi- sensitive, and the emergent product of adap- bility of connectionist models as to whether tive processes; and (4) that computation an increased focus on the substrate will is similarity-based and driven by the sta- have particular implications for connection- tistical structure of problem domains, but ism over and above any other theory of cog- it can nevertheless produce rule-following nition. behavior. We illustrated the connectionist Connectionist approaches to individual approach via three landmarks models, the differences and developmental disorders Interactive Activation model of letter per- suggest that this modeling approach has ception (McClelland & Rumelhart, 1981), more to offer in considering the computa- the past tense model (Rumelhart & McClel- tional causes of variability. Research in be- land, 1986), and simple recurrent networks havioral genetics argues that a significant pro- for finding structure in time (Elman, 1990). portion of behavioral variability is genetic Apart from its body of successful individ- in origin (Bishop, 2006; Plomin, Owen & ual models, connectionist theory has had McGuffin, 1994). However, the neurode- a widespread influence on cognitive the- velopmental mechanisms by which genes orizing, and this influence was illustrated produce such variation are largely unknown. by considering connectionist contributions Although connectionist cognitive models to our understanding of memory, cogni- are not neural, the fact that they incorporate tive development, acquired cognitive im- neurally inspired properties may allow them pairments, and developmental deficits. Fi- to build links between behavior (where vari- nally, we peeked into the future of con- ability is measured) and the substrate on nectionism, arguing that its relationships which genetic effects act. In the future, con- with other fields in the cognitive sciences nectionism may therefore help to rectify a are likely to guide its future contribution major shortcoming in our attempts to un- to understanding the mechanistic basis of derstand the relation of the human genome thought. to human behavior – the omission of cogni- tion from current explanations. 7. Acknowledgements

6. Conclusions This work was supported by British Aca- demy Grant SG–40400 and UK Medical Re- In this chapter, we have considered the search Council Grant G0300188 to Michael contribution of connectionist modeling to Thomas, and National Institute of Mental our understanding of cognition. Connec- Health Centre Grant P50 MH64445, James tionism was placed in the historical context L. McClelland, Director. of nineteenth-century associative theories of mental processes and twentieth-century at- tempts to understand the computations car- References ried out by networks of neurons. The key properties of connectionist networks were Ackley,D.H.,Hinton,G.E.,&Sejnowski,T.J. then reviewed, and particular emphasis was (1985). A learning algorithm for Boltzmann placed on the use of learning to build the machines. , 9, 147–169. P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

52 thomas and mcclelland

Aizawa, K. (2004). History of connectionism. In Carpenter, G. A., & Grossberg, S. (1987b). C. Eliasmith (Ed.), Dictionary of philosophy A massively parallel architecture for a self- of mind. Retrived October 4, 2006, from organizing neural pattern recognition ma- http://philosophy.uwaterloo.ca/MindDict/ chine. Computer Vision, Graphics and Image connectionismhistory.html. Processing, 37, 54–115. Anderson, J. A. (1977). Neural models with cog- Chater, N., Tenenbaum, J. B., & Yuille, A. nitive implications. In D. LaBerge & S. J. (2006). Probabilistic models of cognition: Samuels (Eds.), Basic processes in reading per- Conceptual foundations. Trends in Cognitive ception and comprehension (pp. 27–90). Hills- Sciences 10(7), 287–291. dale, NJ: Lawrence Erlbaum. Christiansen, M. H., & Chater, N. (2001). Anderson, J., & Lebiere, C. (1998). The atomic Connectionist psycholinguistics. Westport, CT: components of thought. Mahwah, NJ: Lawrence Ablex. Erlbaum Associates. Cohen, G., Johnstone, R. A., & Plunkett, K. Bates, E., & MacWhinney, B. (1989). Func- (2000). Exploring cognition: damaged tionalism and the competition model. In B. and neural networks. Hove, Sussex, UK: Psy- MacWhinney & E. Bates (Eds.), The crosslin- chology Press. guistic study of language processing (pp. 3–37). Cohen, J. D., Dunbar, K., & McClelland, J. L. New York: Cambridge University Press. (1990). On the control of automatic pro- Bechtel, W., & Abrahamsen, A. (1991). Connec- cesses: A parallel distributed processing ac- tionism and the mind. Oxford, UK: Blackwell. count of the Stroop effect. Psychological Re- Berko, J. (1958). The child’s learning of English view, 97, 332–361. morphology. Word, 14, 150–177. Cohen, J. D., & Servan-Schreiber, D. (1992). Bishop, D. V. M. (1997). Cognitive neuropsy- Context, cortex, and dopamine: A connec- chology and developmental disorders: Un- tionist approach to behavior and biology in comfortable bedfellows. Quarterly Journal of schizophrenia. Psychological Review, 99, 45– Experimental Psychology, 50A, 899–923. 77. Bishop, D. V. M. (2006). Developmental cog- Dailey, M. N., & Cottrell, G. W. (1999). Or- nitive genetics: How psychology can inform ganization of face and object recognition in genetics and vice versa. Quarterly Journal of modular neural networks. Neural Networks, Experimental Psychology, 59(7), 1153–1168. 12, 1053–1074. Botvinick, M., & Plaut, D. C. (2004). Doing Davelaar, E. J., & Usher, M. (2003). An without schema hierarchies: A recurrent con- activation-based theory of immediate item nectionist approach to normal and impaired memory. In J. A. Bullinaria, & W. Lowe routine sequential action. Psychological Re- (Eds.), Proceedings of the Seventh Neural Com- view, 111, 395–429. putation and Psychology Workshop: Connec- Burton, A. M., Bruce, V., & Johnston, R. A. tionist models of cognition and perception (pp. (1990). Understanding face recognition with 118–130). Singapore: World Scientific. an interactive activation model. British Journal Davies, M. (2005). Cognitive science. In F. of Psychology, 81, 361–380. Jackson & M. Smith (Eds.), The Oxford hand- Bybee, J., & McClelland, J. L. (2005). Alterna- book of contemporary philosophy (pp. 358– tives to the combinatorial paradigm of linguis- 394). Oxford: Oxford University Press. tic theory based on domain general principles Devlin, J., Gonnerman, L., Andersen, E., & Sei- of human cognition. The Linguistic Review, denberg, M. S. (1997). Category specific se- 22(2–4), 381–410. mantic deficits in focal and widespread brain Carey, S., & Sarnecka, B. W. (2006). The devel- damage: A computational account. Journal of opment of human conceptual representations: Cognitive Neuroscience, 10, 77–94. A case study. In Y. Munakata & M. H. Johnson Dick, F., Bates, E., Wulfeck, B., Aydelott, J., (Eds.), Processes of change in brain and cognitive Dronkers, N., & Gernsbacher, M. A. (2001). development: Attention and performance XXI, Language deficits, localization, and grammar: (pp. 473–496). Oxford, UK: Oxford Univer- Evidence for a distributive model of language sity Press. breakdown in aphasic patients and neurolog- Carpenter, G. A., & Grossberg, S. (1987a). ically intact individuals. Psychological Review, ART2: Self-organization of stable category 108(3), 759–788. recognition codes for analog input patterns. Dick, F., Wulfeck, B., Krupa-Kwiatkowski, M., Applied Optics, 26, 4919–4930. & Bates, E. (2004). The development of P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 53

complex sentence interpretation in typically 200). Oxford, UK: Oxford University Press. developing children compared with children Gopnik, A., Glymour, C., Sobel, D. M., Schulz, with specific language impairments or early L. E., Kushnir, T., & Danks, D. ( 2004). A the- unilateral focal lesions. Developmental Science, ory of causal learning in children: Causal maps 7(3), 360–377. and bayes nets. Psychological Review, 111(1): Elman, J. L. (1990). Finding structure in time. 3–32. Cognitive Science, 14, 179–211. Green, D. C. (1998). Are connectionist models Elman, J. L. (1991). Distributed representations, theories of cognition? Psycoloquy, 9(4). simple recurrent networks, and grammatical Grossberg, S. (1976). Adaptive pattern classifica- structure. Machine Learning, 7, 195–224. tion and universal recoding: Parallel develop- Elman, J. L. (1993). Learning and development ment and coding of neural feature detectors. in neural networks: The importance of starting Biological Cybernetics, 23, 121–134. small. Cognition, 48, 71–99. Haarmann, H., & Usher, M. (2001). Maintenance Elman, J. L. (2005). Connectionist models of of semantic information in capacity limited cognitive development: Where next? Trends item short-term memory. Psychonomic Bul- in Cognitive Sciences, 9, 111–117. letin & Review, 8, 568–578. Elman, J. L., Bates, E. A., Johnson, M. H., Hebb, D. O. (1949). The organization of behav- Karmiloff-Smith, A., Parisi, D., & Plunkett, K. ior: A neuropsychological approach.NewYork: (1996). Rethinking innateness: A connectionist John Wiley & Sons. perspective on development. Cambridge, MA: Hinton, G. E. (1989). Deterministic Boltzmann MIT Press. learning performs steepest descent in weight- Ervin, S. M. (1964). Imitation and structural space. Neural Computation, 1, 143–150. change in children’s language. In E. H. Hinton, G. E., & Anderson, J. A. (1981). Paral- Lenneberg (Ed.), New directions in the study lel models of associative memory. Hillsdale, NJ: of language (pp. 163–189). Cambridge, MA: Lawrence Erlbaum. MIT Press. Hindton, G. E., & McClelland, J. L. (1988). Fahlman, S., & Lebiere, C. (1990). The cas- Learning representations by recirculation. In cade correlation learning architecture. In D. D. Z. Anderson (Ed.), Neural Information Pro- Touretzky (Ed.), Advances in neural informa- cessing Systems, 1987 (pp. 358–366). New tion processing 2 (pp. 524–532). Los Altos, York: American Institute of Physics. CA: Morgan Kauffman. Hinton, G. E., & Salakhutdinov, R. R. (2006). Feldman, J. A. (1981). A connectionist model Reducing the dimensionality of data with neu- ofvisualmemory.InG.E.Hinton&J.A. ral networks. Science, 313(5786), 504–507. Anderson (Eds.), Parallel models of associative Hinton, G. E., & Sejnowski, T. J. (1983). Opti- memory (pp. 49–81). Hillsdale, NJ: Hawrence mal perceptual inference. In Proceedings of the Erlbaum. IEEE Conference on Computer Vision and Pat- Fodor, J. A., & Pylyshyn, Z. W. (1988). Connec- tern Recognition (pp. 448–453). Washington, tionism and cognitive architecture: A critical DC. analysis. Cognition, 78, 3–71. Hinton, G. E., & Sejnowksi, T. (1986). Learn- French, R. M., Ans, B., & Rousset, S. (2001). ing and relearning in Boltzmann machines. In Pseudopatterns and dual-network memory D. Rumelhart & J. McClelland (Eds.), Paral- models: Advantages and shortcomings. In lel distributed processing, Vol. 1 (pp. 282–317). R. French & J. Sougne´ (Eds.), Connectionist Cambridge, MA: MIT Press. models of learning, development and evolution Hoeffner, J. H., & McClelland, J. L. (1993). Can (pp. 13–22). London: Springer. a perceptual processing deficit explain the im- Freud, S. (1895). Project for a scientific psychol- pairment of inflectional morphology in devel- ogy. In J. Strachey (Ed.), The standard edition opmental dysphasia? A computational inves- of the complete psychological works of Sigmund tigation.InE.V.Clark(Ed.),Proceedings of Freud (pp. 283–360). London: The Hogarth the 25th Child language research forum (pp. Press and the Institute of Psycho-Analysis. 38–49). Stanford, CA: Center for the Study Goebel, R., & Indefrey, P. (2000). A recurrent of Language and Information. network with short-term memory capacity Hopfield, J. J. (1982). Neural networks and phys- learning the German –s plural. In P. Broeder & ical systems with emergent collective compu- J. Murre (Eds.), Models of language acquisition: tational abilities. Proceedings of the National Inductive and deductive approaches (pp. 177– Academy of Science USA, 79, 2554–2558. P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

54 thomas and mcclelland

Houghton, G. (2005). Connectionist models in MacKay, D. J. (1992). A practical Bayesian cognitive psychology. Hove: Psychology Press. framework for backpropagation networks. Jacobs, R. A. (1999). Computational studies of Neural Computation, 4, 448–472. the development of functionally specialized Marcus, G. F. (2001). The algebraic mind: In- neural modules. Trends in Cognitive Sciences, tegrating connectionism and cognitive science. 3, 31–38. Cambridge, MA: MIT Press Jacobs, R. A., Jordan, M. I., Nowlan, S. J., Marcus, G., Pinker, S., Ullman, M., Hollander, & Hinton, G. E. (1991). Adaptive mix- J., Rosen, T. & Xu, F. (1992). Overregulari- tures of local experts. Neural Computation, 3, sation in language acquisition. Monographs of 79–87. the Society for Research in Child Development, James, W. (1890). Principles of psychology.New 57 (Serial No. 228). York, NY: Holt. Mareschal, D., Johnson, M., Sirios, S., Spratling, Joanisse, M. F., & Seidenberg, M. S. (1999). Im- M., Thomas, M. S. C., & Westermann, G. pairments in verb morphology following brain (2007). Neuroconstructivism: How the brain injury: A connectionist model. Proceedings of constructs cognition. Oxford, UK: Oxford Uni- the National Academy of Science USA, 96, versity Press. 7592–7597. Mareschal, D., & Shultz, T. R. (1996). Genera- Joanisse, M. F., & Seidenberg, M. S. (2003). tive connectionist architectures and construc- Phonology and syntax in specific language tivist cognitive development. Cognitive Devel- impairment: Evidence from a connectionist opment, 11, 571–605. model. Brain and Language, 86, 40–56. Mareschal, D., & Thomas, M. S. C. (2007). Com- Jordan, M. I. (1986). Attractor dynamics and par- putational modeling in developmental psy- allelism in a connectionist sequential machine. chology. IEEE Transactions on Evolutionary In Proceedings of the Eight Annual Conference of Computation, 11(2), 137–150. Cognitive Science Society (pp. 531–546). Hills- Marr, D. (1982). Vision. San Francisco: W. H. dale, NJ: Lawrence Erlbaum. Freeman. Juola, P., & Plunkett, K. (2000). Why double Marr, D., & Poggio, T. (1976). Cooperative com- dissociations don’t mean much. In G. Cohen, putation of stereo disparity. Science, 194, 283– R. A. Johnston, & K. Plunkett (Eds.), Exploring 287. cognition: Damaged brains and neural networks: Mayberry, M. R., Crocker, M., & Knoeferle, P. Readings in cognitive neuropsychology and con- (2005). A Connectionist Model of Sentence nectionist modelling (pp. 319–327). Hove, UK: Comprehension in Visual Worlds. In: Proceed- Psychology Press. ings of the 27th Annual Conference of the Cog- Karmiloff-Smith, A. (1998). Development itself nitive Science Society, (COGSCI-05, Streas, is the key to understanding developmental dis- Italy), Mahwah, NJ: Erlbaum. orders. Trends in Cognitive Sciences, 2, 389– McClelland, J. L. (1981). Retrieving general and 398. specific information from stored knowledge Kohonen, T. (1984). Self-organization and asso- of specifics. In Proceedings of the Third An- ciative memory. Berlin: Springer-Verlag. nual Meeting of the Cognitive Science Society Kuczaj, S. A. (1977). The acquisition of regu- (pp. 170–172). Hillsdale, NJ: Lawrence Erl- lar and irregular past tense forms. Journal of baum Associates. Verbal Learning and Verbal Behavior, 16, 589– McClelland, J. L. (1989). Parallel distributed pro- 600. cessing: Implications for cognition and devel- Lashley, K. S. (1929). Brain mechanisms and in- opment. In M. G. M. Morris (Ed.), Parallel telligence: A quantitative study of injuries to the distributed processing, implications for psychol- brain.NewYork:Dover. ogy and neurobiology (pp. 8–45). Oxford, UK: Love, B. C., Medin, D. L., & Gureckis, T. M. Clarendon Press. (2004). SUSTAIN: A network model of cate- McClelland, J. L. (1998). Connectionist mod- gory learning. Psychological Review, 111, 309– els and Bayesian inference. In M. Oaksford & 332. N. Chater (Eds.), Rational models of cognition MacDonald, M. C., & Christiansen, M. H. (pp. 21–53). Oxford, UK: Oxford University (2002). Reassessing working memory: A com- Press. ment on Just & Carpenter (1992) and Waters McClelland, J. L., & Elman, J. L. (1986). The & Caplan (1996). Psychological Review, 109, Trace model of speech perception. Cognitive 35–54. Psychology, 18, 1–86. P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 55

McClelland, J. L., McNaughton, B. L., & Minsky, M., & Papert, S. (1969). Perceptrons: An O’Reilly, R. C. (1995). Why there are com- introduction to computational geometry.Cam- plementary learning systems in the hippocam- bridge, MA: MIT Press. pus and neocortex: Insights from the successes Minsky, M. L., & Papert, S. (1988). Percep- and failures of connectionist models of learn- trons: An introduction to computational ge- ing and memory. Psychological Review, 102, omety.Cambridge,MA:MITPress. 419–457. Morris, W., Cottrell, G., & Elman, J. (2000). McClelland, J. L., Plaut, D. C., Gotts, S. J., & A connectionist simulation of the empirical Maia, T. V. (2003). Developing a domain- acquisition of grammatical relations. In S. general framework for cognition: What is the Wermter & R. Sun (Eds.), Hybrid neural sys- best approach? Commentary on a target arti- tems (pp. 175–193). Springer Verlag, Heidel- cle by Anderson and Lebiere. Behavioral and berg. Brain Sciences, 22, 611–614. Morton, J. (1969). Interaction of information McClelland, J. L., & Rumelhart, D. E. (1981). in word recognition. Psychological Review, 76, An interactive activation model of context ef- 165–178. fects in letter perception: Part 1. An account Morton, J. B., & Munakata, Y. (2002). Active ver- of basic findings. Psychological Review, 88(5), sus latent representations: A neural network 375–405. model of perseveration, dissociation, and de- McClelland, J. L., Rumelhart, D. E., & the PDP calage in childhood. Developmental Psychobi- Research Group (1986). Parallel distributed ology, 40, 255–265. processing: Explorations in the microstructure of Movellan, J. R., & McClelland, J. L. (1993). cognition, Vol. 2: Psychological and biological Learning continuous probability distributions models (pp. 2–4). Cambridge, MA: MIT Press. with symmetric diffusion networks. Cognitive McClelland, J. L., Thomas, A. G., McCandliss, Science, 17, 463–496. B. D., & Fiez, J. A. (1999). Understanding Munakata, Y. (1998). Infant perseveration and failures of learning: Hebbian learning, com- implications for object permanence theories: petition for representation space, and some A PDP model of the AB task. Developmental preliminary data. In J. A. Reggia, E. Rup- Science, 1, 161–184. pin, & D. Glanzman (Eds.), Disorders of brain, Munakata, Y., & McClelland, J. L. (2003). Con- behavior, and cognition: The neurocomputa- nectionist models of development. Develop- tional perspective (pp. 75–80). Oxford, UK: mental Science, 6, 413–429. Elsevier. O’Reilly, R. C. (1996). Biologically plausible McClelland, J. L., & Thompson, R. M. (2007). error-driven learning using local activation dif- Using domain-general principles to explain ferences: The generalized recirculation algo- children’s causal reasoning abilities. Develop- rithm. Neural Compuation, 8(5), 895–938. mental Science, 10, 333–356. O’Reilly, R. C. (1998). Six principles for McCulloch, W. S., & Pitts, W. (1943). A logical biologically-based computational models of calculus of ideas immanent in nervous activity. cortical cognition. Trends in Cognitive Sciences, Bulletin of Mathematical Biophysics, 5, 115– 2, 455–462. 133. (Reprinted in Anderson J., & Rosenfield O’Reilly, R. C., Braver, T. S., & Cohen, J. D. E. (1988). Neurocomputing: Foundations of (1999). A biologically based computational research) Cambridge, MA: MIT Press. model of working memory. In A. Miyake & McLeod, P., Plunkett, K., & Rolls, E. T. (1998). P. Shah (Eds.), Models of working memory: Introduction to connectionist modelling of cogni- Mechanisms of active maintenance and execu- tive processes. Oxford, UK: Oxford University tive control (pp. 375–411). New York: Cam- Press bridge University Press. Meynert, T. (1884). Psychiatry: A clinical treatise O’Reilly, R. C., & Munakata, Y. (2000). Compu- on diseases of the forebrain. Part I. The anatomy, tational explorations in cognitive neuroscience: physiology and chemistry of the brain (B. Sachs, Understanding the mind by simulating the brain. Trans.). New York: G.P. Putnam’s Sons. Cambridge, MA: MIT Press. Miikkulainen, R., & Mayberry, M. R. (1999). Dis- Pinker, S. (1984). Language learnability and lan- ambiguation and grammar as emergent soft guage development.Cambridge,MA:Harvard constraints. In B. MacWhinney (Ed.), Emer- University Press. gence of language (pp. 153–176). Hillsdale, NJ: Pinker, S. (1999). Words and rules. London: Wei- Lawrence Erlbaum. denfeld & Nicolson P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

56 thomas and mcclelland

Pinker, S., & Prince, A. (1988). On language Rashevsky, N. (1935). Outline of a physico- and connectionism: Analysis of a parallel dis- mathematical theory of the brain. Journal of tributed processing model of language acqui- General Psychology, 13, 82–112. sition. Cognition, 28, 73–193. Reicher, G. M. (1969). Perceptual recognition Plaut, D. C., & Kello, C. T. (1999). The emer- as a function of meaningfulness of stimulus gence of phonology from the interplay of material. Journal of Experimental Psychology, speech comprehension and production: A 81, 274–280. distributed connectionist approach. In B. Rohde, D. L. T. (2002). A connectionist model MacWhinney (Ed.), The emergence of lan- of sentence comprehension and production. guage (pp. 381–415). Mahwah, NJ: Lawrence Unpublished doctoral dissertation, Carnegie Erlbaum. Mellon University, Pittsburgh, PA. Plaut, D., & McClelland, J. L. (1993). Gener- Rohde, D. L. T., & Plaut, D. C. (1999). Lan- alization with componential attractors: Word guage acquisition in the absence of explicit and nonword reading in an attractor network. negative evidence: How important is starting In Proceedings of the Fifteenth Annual Confer- small? Cognition, 72, 67–109. ence of the Cognitive Science Society (pp. 824– Rosenblatt, F. (1958). The perceptron: A proba- 829). Hillsdale, NJ: Lawrence Erlbaum. bilistic model for information storage and or- Plaut, D. C., McClelland, J. L., Seidenberg, M. S., ganization in the brain. Psychological Review, & Patterson, K. E. (1996). Understanding nor- 65, 386–408. mal and impaired word reading: Computa- Rosenblatt, F. (1962). Principles of neurodynam- tional principles in quasi-regular domains. Psy- ics: Perceptrons and the theory of brain mecha- chological Review, 103, 56–115. nisms. Washington, DC: Spartan Books. Plaut, D. C., & Shallice, T. (1993). Deep Rumelhart, D. E., & McClelland, J. L. (1982). dyslexia: A case study of connectionist neu- An interactive activation model of context ef- ropsychology. Cognitive Neuropsychology, 10, fects in letter perception: Part 2. The contex- 377–500. tual enhancement effect and some tests and Plomin, R., Owen, M. J., & McGuffin, P. (1994). extensions of the model. Psychological Review, The genetic basis of complex human behav- 89, 60–94. iors. Science, 264, 1733–1739. Rumelhart, D. E., Hinton, G. E., & McClelland, Plunkett, K., & Bandelow, S. (2006). Stochastic J. L. (1986). A general framework for par- approaches to understanding dissociations in allel distributed processing. In D. E. Rumel- inflectional morphology. Brain and Language, hart, J. L. McClelland, & the PDP Research 98, 194–209. Group, Parallel distributed processing: Expolra- Plunkett, K., & Marchman, V. (1991). U-shaped tions in the microstructure of congnition. Volume learning and frequency effects in a multi- 1: Foundations (pp. 45–76). Cambridge, MA: layered perceptron: Implications for child lan- MIT Press. guage acquisition. Cognition, 38, 1–60. Rumelhart, D. E., & McClelland, J. L. (1985). Plunkett, K., & Marchman, V. (1993). From Levels indeed! Journal of Experimental Psychol- rote learning to system building: acquiring ogy General, 114(2), 193–197. verb morphology in children and connection- Rumelhart, D. E., & McClelland, J. L. (1986). On ist nets. Cognition, 48, 21–69. learning the past tense of English verbs. In J. L. Plunkett, K., & Marchman, V. (1996). Learning McClelland, D. E. Rumelhart & the PDP Re- from a connectionist model of the English past search Group (Eds.), Parallel distributed pro- tense. Cognition, 61, 299–308. cessing: Explorations in the microstructure of Plunkett, K., & Nakisa, R. (1997). A connec- cognition, Vol. 2: Psychological and biological tionist model of the Arabic plural system. models (pp. 216–271). Cambridge, MA: MIT Language and Cognitive Processes, 12, 807– Press. 836. Rumelhart, D. E., Hinton, G. E., & Williams, R. Quartz, S. R. (1993). Neural networks, nativism, J. (1986). Learning internal representations by and the plausibility of constructivism. Cogni- error propagation. In D. E. Rumelhart, J. L. tion, 48, 223–242. McClelland and The PDP Research Group, Quartz, S. R. & Sejnowski, T. J. (1997). The Parallel distributed processing: Explorations in neural basis of cognitive development: A con- the microstructure of cognition. Volume 1: Foun- structivist manifesto. Behavioral and Brain dations (pp. 318–362). Cambridge, MA: MIT Sciences, 20, 537–596. Press. P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

connectionist models of cognition 57

Rumelhart, D. E., McClelland, J. L., & the PDP Tenenbaum, J. B., Griffiths, T. L., & Kemp, C. Research Group (1986). Parallel distributed (2006). Theory-based Bayesian models of in- processing: Explorations in the microstructure of ductive learning and reasoning. Trends in Cog- cognition, Vol. 1: Foundations (p. 2-4). Cam- nitive Sciences 10(7), 309–318. bridge, MA: MIT Press. Thomas, M. S. C. (2004). The state of connec- Rumelhart, D. E., Smolensky, P., McClelland, tionism in 2004. Parallaxis, 8, 43–61. J. L., & Hinton, G. E. (1986). Schemata and Thomas, M. S. C. (2005). Characterising com- sequential thought processes in PDP mod- pensation. Cortex, 41(3), 434–442. els. In J. L. McClelland & D. E. Rumel- Thomas, M. S. C., Forrester, N. A., & Richardson, hart (Eds.), Parallel distributed processing, F. M. (2006). What is modularity good for? In Vol. 2 (pp. 7–57). Cambridge, MA: MIT R. Sun & N. Miyake (Eds.). Proceedings of the Press. 28th Annual Conference of the Cognitive Science Saffran, J. R., Newport, E. L., & Aslin, R. N. Society (pp. 2240–2245). Vancouver, Canada: (1996). Word segmentation: The role of dis- Cognitive Science Society. tributional cues. Journal of Memory and Lan- Thomas, M. S. C., & Johnson, M. H. (2006). The guage, 35, 606–621. computational modelling of sensitive peri- Selfridge, O. G. (1959). Pandemonium: A ods. Developmental Psychobiology, 48(4), 337– paradigm for learning. In D. V. Blane, & A. M. 344. Uttley (Eds.). Proceedings of the Symposium on Thomas, M. S. C., & Karmiloff-Smith, A. Mechanisation of Thought Processes (pp. 511– (2002a). Are developmental disorders like 529). London: HMSO. cases of adult brain damage? Implications Shallice, T. (1988). From neuropsychology to men- from connectionist modelling. Behavioral and tal structure. Cambridge, UK: Cambridge Uni- Brain Sciences, 25(6), 727–788. versity Press. Thomas, M. S. C., & Karmiloff-Smith, A. Sharkey, N., Sharkey, A., & Jackson, S. (2000). (2002b). Modelling typical and atypical cog- Are SRNs sufficient for modelling language nitive development. In U. Goswami (Ed.), acquisition. In P. Broeder & J. Murre (Eds.), Handbook of childhood development (pp. 575– Models of language acquisition: Inductive and 599). Oxford: Blackwells. deductive approaches (pp. 33–54). Oxford, Thomas, M. S. C., & Karmiloff-Smith, A. UK: Oxford University Press. (2003a). Connectionist models of develop- Shultz, T. R. (2003). Computational developmen- ment, developmental disorders and individ- tal psychology. Cambridge, MA: MIT Press. ual differences. In R. J. Sternberg, J. Lautrey, Smolensky, P. (1988). On the proper treatment & T. Lubart (Eds.), Models of intelligence: In- of connectionism. Behavioral and Brain Sci- ternational perspectives (pp. 133–150). Wash- ences, 11, 1–74. ington, DC: American Psychological Asso- Spencer, H. (1872). Principles of psychology (3rd ciation. ed.). London: Longman, Brown, Green, & Thomas, M. S. C., & Karmiloff-Smith, A. Longmans. (2003b). Modeling language acquisition in Sun, R. (1995). Robust reasoning: Integrat- atypical phenotypes. Psychological Review, ing rule-based and similarity-based reasoning. 110(4), 647–682. Artificial Intelligence, 75, 241–295. Thomas, M. S. C., & Karmiloff-Smith, A. (2005). Sun, R. (2002a). Hybrid connectionist symbolic Can developmental disorders reveal the com- systems. In M. Arbib (Ed.), Handbook of ponent parts of the human language faculty? brain theories and neural networks (2nd ed.), Language Learning and Development, 1(1), 65– (pp. 543–547). Cambridge, MA: MIT 92. Press. Thomas, M. S. C., & Redington, M. (2004). Sun, R. (2002b). Hybrid systems and connec- Modelling atypical syntax processing. In W. tionist implementationalism. In L. Nadel, D. Sakas (Ed.), Proceedings of the first workshop Chalmers, P. Culicover, R. Goldstone, & B. on psycho-computational models of human lan- French (Eds.), Encyclopedia of Cognitive Sci- guage acquisition at the 20th International Con- ence (pp. 697–703). London: Macmillan. ference on Computational Linguistics (pp. 85– Sun, R., & Peterson, T. (1998). Autonomous 92). learning of sequential tasks: Experiments and Thomas, M. S. C., & Richardson, F. (2006). analyses. IEEE Transactions on Neural Net- Atypical representational change: Conditions works, 9(6), 1217–1234. for the emergence of atypical modularity. In P1: JZP CUFX212-02 CUFX212-Sun 978 0 521 85741 3 November 15, 2007 13:46

58 thomas and mcclelland

Y. Munakata & M. H. Johnson (Eds.), Pro- tences. In Proceedings of the Fourteenth An- cesses of change in brain and cognitive de- nual Conference of the Cognitive Science Society. velopment: Attention and Performance XXI, Hillsdale, NJ: Lawrence Erlbaum. (pp. 315–347). Oxford, UK: Oxford Univer- Westermann, G. (1998). Emergent modularity sity Press. and U-shaped learning in a constructivist neu- Thomas, M. S. C., & van Heuven, W. (2005). ral network learning the English past tense. In Computational models of bilingual compre- Proceedings of the 20th Annual Conference of hension. In J. F. Kroll & A. M. B. De Groot the Cognitive Science Society (pp. 1130–1135). (Eds.), Handbook of bilingualism: Psycholin- Hillsdale, NJ: Lawrence Erlbaum. guistic approaches (pp. 202–225). Oxford, UK: Westermann, G., Mareschal, D., Johnson, M. Oxford University Press. H., Sirois, S., Spratling, M. W., & Thomas, Touretzky, D. S., & Hinton, G. E. (1988). A M. S. C. (2007). Neuroconstructivism. Devel- distributed connectionist production system. opmental Science, 10, 75–83. Cognitive Science, 12, 423–466. Williams, R. J., & Zipser, D. (1995). Gradient- Usher, M., & McClelland, J. L. (2001). On the based learning algorithms for recurrent net- time course of perceptual choice: The leaky works and their computational complexity.In competing accumulator model. Psychological Y. Chauvin & D. E. Rumelhart (Eds.), Back- Review, 108, 550–592. propagation: Theory, architectures and applica- van Gelder, T. (1991). Classical questions, rad- tions. Hillsdale, NJ: Lawrence Erlbaum. ical answers: Connectionism and the struc- Xie, X., & Seung, H. S. (2003). Equivalence of ture of mental representations. In T. Horgan & backpropagation and contrastive Hebbian le- J. Tienson (Eds.), Connectionism and the phi- arning in a layered network. Neural Computa- losophy of mind. (pp. 355–381). Dordrecht: tion, 15, 441–454. Kluwer Academic. Xu, F., & Pinker, S. (1995). Weird past tense Weckerly, J., & Elman, J. L. (1992). A PDP ap- forms. Journal of Child Language, 22, 531– proach to processing center-embedded sen- 556.