Review Yang et al. – Learning in ANNs Statistical inference: learning in artificial neural networks Howard Hua Yang, Noboru Murata and Shun-ichi Amari

Artificial neural networks (ANNs) are widely used to model low-level neural activities and high-level cognitive functions. In this article, we review the applications of statistical inference for learning in ANNs. Statistical inference provides an objective way to derive learning algorithms both for training and for evaluation of the performance of trained ANNs. Solutions to the over-fitting problem by model-selection methods, based on either conventional statistical approaches or on a Bayesian approach, are discussed. The use of supervised and algorithms for ANNs are reviewed. Training a multilayer ANN by is equivalent to nonlinear regression. The ensemble methods, bagging and arching, described here, can be applied to combine ANNs to form a new predictor with improved performance. Unsupervised learning algorithms that are derived either by the Hebbian law for bottom-up self-organization, or by global objective functions for top-down self-organization are also discussed.

Although the brain is an extremely large and complex units and connections in an ANN are different at these system, from the point of view of its organization the hier- two levels. archy of the brain can be divided into eight levels: behavior, At the neural-activity level, the units and connections cortex, neural circuit, neuron, microcircuit, synpase, mem- model the nerve cells and the synapses between the neurons, brane and molecule. With advanced invasive and non- respectively. This gives a correspondence between an ANN invasive measurement techniques, the brain can be observed and a neural system1. One successful application of ANNs 2 H.H. Yang is at the at all these levels and a huge amount of data have been at the activity level is Zipser and Andersen’s model , which Computer Science collected. Neural computational theories have been devel- is a three-layer feed-forward network, trained by back- Department, Oregon oped to account for complex brain functions based on the propagation to perform the vector addition of the retinal and Graduate Institute, accumulated data. eye positions. After training, the simulated retinal receptive PO Box 9100, Neural computational theories comprise neural models, fields and the eye position responses of the hidden units Portland, OR 97291, neural dynamics and learning theories. Mathematical in the trained network closely resembled those found in USA. modeling has been applied to each level in the hierarchy of the posterior parietal cortex of the primate brain, where the

tel: + 503 690 1331 the brain. The brain can be considered at three functional absolute spatial location (the position of the object in space, fax: +503 690 1548 levels: (1) a cognitive-function level related to behavior and which does not depend on the head direction) is computed. e-mail: [email protected]. cortex; (2) a neural-activity level related to neural circuit, At the cognitive-function level, ANNs are connectionist edu neuron, microcircuit and synpase; and (3) a subneural level models for cognitive processing. The units and connections related to the membrane and molecule. In this article, we in the connectionist models are used to represent certain N. Murata and S. Amari are at the only consider the first and second functional levels. cognitive states or hypotheses, and constraints among these Laboratory for To focus on the information processing principles of states or hypotheses, respectively. It has been widely be- Information Synthesis, the brain, we must simplify the neurons and synapses in real lieved, and demonstrated by connectionists, that some cog- Riken BSI, Hirosawa neural systems. ANNs are simplified mathematical models nitive functions emerge from the interactions among a large 2-1, Wako-shi, for neural systems formed by massively interconnected number of computational units3. Different cognitive tasks, Saitama 351-01, computational units running in parallel. We discuss the such as memory retrieval, category formation, speech per- Japan. applications of ANNs at the neural-activity level and the ception, language acquisition and object recognition, have

tel: + 81 48467 9625 cognitive-function level. been modeled by ANNs (Refs 3–5). Some examples are the fax: +81 48467 9693 word pronunciation model6, the mental arithmetic model7, e-mail: mura@brain. Applications of ANNs the English text-to-speech system8, and the TD–Gammon riken.go.jp 9 amari@brain. ANNs can model brain functions at the level of either model , which is one of the best backgammon players in riken.go.jp neural activity or cognitive function. The meanings of the the world.

4 Copyright © 1998, Elsevier Science Ltd. All rights reserved. 1364-6613/98/$19.00 PII: S1364-6613(97)01114-5 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998 Yang et al. – Learning in ANNs Review

Statistics and ANNs ference will guide us to derive learning algorithms and to McClelland10 summarized five principles to characterize the analyze their performance in a more systematic way. information processing in connectionist models: principles An ANN has input nodes for taking data, hidden nodes of graded activation; gradual propagation; interactive pro- for the internal representation and output nodes for dis- cessing; mutual competition; and intrinsic variability. In playing patterns. The goal of learning is to find an ANNs, the first principle is realized by a linear combination input–output relation with a prediction error as small as of inputs and sigmoid activation function, the second by fi- possible. A learning algorithm is called supervised if the de- nite impulse response (FIR) filters with exponentially de- sired outputs are known. Otherwise, it is unsupervised. caying impulse response functions as models for synapses11, the third by a bi-directional structure, the fourth by lateral Supervised learning inhibition, and the fifth by noise or probabilistic units. The Multilayer neural networks are chosen as useful connection- factor of intrinsic variability plays an important role in ist models because of their universal approximation capabil- human information processing and it is intrinsic variability ity. A network for knowledge representation can be trained that is the main difference between the brain and digital from examples without using verbal rules and hard-wiring. computers. von Neumann once hinted at the idea of build- Three typical approaches to train a multilayer neural net- ing a brain-like computer based on statistical principles12. It work are (1) the optimization approach, (2) the conven- is a reasonable hypothesis that the brain incorporates intrin- tional statistical approach and (3) the Bayesian approach. sic variability naturally in its structure so that it can operate in a stochastic environment, receiving noisy and ambiguous Optimization approach inputs. McClelland’s work provided some new directions Training a multilayer neural network is often formulated for neural-network research. As a basic hypothesis for con- as an optimization problem. The learning algorithms based nectionist models, the intrinsic variability principle is ap- on gradient descent are enriched by some optimization pealing, especially to statisticians because it allows them to techniques such as the momentum and the Newton– build stochastic neural-network models and to apply statis- Raphson methods. However, because the cost functions are tical inference to these models. subjectively chosen, it is very difficult to address problems Two essential parts of the modern neural-network the- such as the efficiency and the accuracy of the learning algo- ory are stochastic models for ANNs and learning algorithms rithms within the optimization framework. A further prob- based on statistical inference. White13,14 and Ripley15 give lem is that the trained network might over-fit the data. some statistician’s perspectives about ANNs and treat them Thus, when the network architecture is more complex than rigorously using a statistical framework. Amari reviews required, the algorithm might decrease the training error on some important issues about learning and statistical infer- the training examples, but increase the generalization error ence16,17 and the applications of information geometry18 in on the novel examples that are not shown to the network ANNs. In this article, we further review these issues as well during training. As a result, the learning process can be as some other issues not covered previously by Amari16–18. driven by the training examples to a wrong solution. Therefore, it is crucial to select a correct model in the learn- Stochastic models ing process based on the performance of the trained net- Many stochastic neural-network models have been pro- work. To examine the performance of the trained network, posed in the literature. A good neural-network model some examples should be left aside for testing, not training. should be concise in structure with powerful approximation In the optimization approach, the model selection is capability, and be tractable by statistical inference methods. done by trial and error. This is time consuming and the op- A simple, but useful, model for a stochastic perceptron is: timal architecture might not be found. We now discuss some model selection methods based on statistical inference. y = g(x,θ) + noise where x is the input, y the output, and g(x,θ) a nonlinear Conventional statistical approach activation function parameterized by θ. For example, g(x,θ) = Many papers about statistical inference learning have been f(wTx + b) for θ = (w,b) or g(x,θ) = f(xTAx + wTx + b) for θ = published in the past three decades. A key link between (A,w,b) where f is a single variable function, b a bias, w a neural networks and statistics is nonlinear regression. vector linearly combining the inputs, A a matrix linearly Through this link, many statistical inference methods can combining the second order inputs, and T denotes the be applied to neural networks. Perhaps the earliest work on vector transpose. The function f should be simple for statistical inference learning was that carried out in 1967 by calculation, but not too simple so that the network can Amari23, in which the error-correction-adjustment method approximate arbitrary nonlinear functions. Some common based on the stochastic-gradient-descent method was pro- choices for f are sigmoid, bell shape and Mexican hat posed to train linear or nonlinear classifiers, including one- functions. Other stochastic models are random nets19, layer and multilayer neural networks. In the 1970s and Boltzmann machines20, stochastic Helmholtz machines21, 1980s, this method was rediscovered and refined, to be- and the hierarchical mixture of experts model22. come the well-known error-back-propagation method. Here, we focus on the maximum likelihood method Learning algorithms based on statistical inference and model selection by minimizing the generalization Why should we apply statistical inference in the learning of error. Although other learning algorithms, such as reinforce- ANNs? A brief answer to this question is that statistical in- ment learning24 and the EM-algorithm25, are also closely

5 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998 Review Yang et al. – Learning in ANNs

algorithm is usually used to compute the maximum likeli- Box 1. Maximum likelihood hood estimate: ∂ method for training ANNs θ + = θ – ␮ l(z x ;θ ) t 1 t ∂θ t t t Consider a n-m-1 multilayer perceptron: A statistically more efficient algorithm is the method of z = aT␸(Wx + b) + ␰ scoring: ␮ ∂ T –1 where denotes the transpose, the entries in the vector a θ + = θ – G (θ ) l(z x ;θ ) t 1 t t t ∂θ t t t and the matrix W are weights, the entries in the vector b are thresholds, ␸(x) is a differentiable activation function for where G(θ) is the Fisher information matrix defined by ␰ each hidden neuron, and is an additive noise with a prob-  ∂l  ∂l  T  ability density function (pdf) q(y). The conditional pdf of z G(θ ) = E    ∂θ  ∂θ   given x is p(z|x; ␪) = q(z–aT␸(Wx + b)).   This is one of the natural-gradient ascent–descent-type The joint pdf of the input and output is algorithms based on Amari’s information geometry theo- 18 17 p(z,x; ␪) = p(z|x; ␪) p(x) ry . It has been shown by Amari that natural-gradient θ learning is Fisher efficient. If t is updated by the method where ␪ is a parameter vector consisting of weights and θ of scoring, the asymptotic variance of t is thresholds in the multilayer perceptron, p(x) is the pdf of the input and p(z|x; ␪) is the conditional pdf of z given the  θ θ θ θ T  ≈ 1 –1 θ ␪ E ( t – *)( t – *)  G ( *) input x. The problem is to estimate based on a set of train- t ing examples Although the natural-gradient-descent algorithm is D = {(xt, zt), t = 1, …, T} statistically efficient, it is computationally expensive. The Assume the training examples are independent. The bottleneck is the computation of the natural gradient. For a maximum likelihood method is to maximize a likelihood n-m-1 multilayer network, assuming the input dimension- function ality, n, is much larger than the number of hidden neurons, 26 T m, Yang and Amari described a new scheme to represent θ = ⌸ θ p(D; ) p(xt ,zt ; ) the Fisher information matrix and a fast algorithm to com- t=1 pute the natural gradient. The complexity of this fast algo- or equivalently to minimize a lost function defined by rithm for computing the natural gradient is O(n) flops (a

T flop is a floating point operation; add or multiply) while a θ = θ = θ = L(D; ) –log p(D; ) –∑ log p(xt ,zt ; ) direct method, namely matrix inversion and matrix-vector t=1 multiplication, needs O(n3) flops. T T θ Neural networks trained by the maximum likelihood –∑ log p(zt xt ; ) –∑ log p(xt ) t=1 t=1 method are usually statistically efficient and asymptotically un- biased when the model is correct. However, when the model is Since p(x) does not depend on ␪, minimizing L(D;␪) is wrong the trained networks often have the over-fitting problem. equivalent to minimizing the following lost function Moody27 formulated a stochastic model for a multilayer T perceptron and introduced a regularization term in the cost e(θ ) = –∑ log p(z x ;θ ) t t function to penalize a complex network architecture. He t=1 found a second-order approximation of the generalization With different assumption for q(y), we have different ways error as a function of the network complexity and the regu- to measure training errors. For example, when ␰ is larizer. The estimated generalization error is useful for se- 1 – y 2 1 2 lecting an optimal network structure. Also addressing the Gaussian, q(y) = e 2␴ , e(␪) is a mean ␲␴ over-fitting problem, Murata et al.28 generalized Akaike’s square error: 2 information criterion and proposed an NIC (network in- T 1 formation criterion) under a general loss function with a e(θ ) = ∑(z – aT ϕ(Wx +b))2 + constant ; ␴2 t t regularization term. A common theme in these two ap- 2 t=1 proaches is to minimize the generalization error by regular- ␭ –␭ y when e is symmetrically exponential, i.e. q(y) = e , ization, and this is the theoretical basis for the techniques 2 ␰(␪) is an absolute error: used in practice, such as early stopping, growing and prun- ing networks. The bias-variance trade-off29 and bias-variance- T covariance trade-off30 characterize the overfitting problem e(θ ) = ␭∑ z – aT ϕ(Wx +b) + constant t t from a different perspective. In the light of the bias-variance t=1 trade-off, the regularization method balances bias and variance in order to decrease the generalization error. related to statistical inference learning, they are not re- A well-known theoretical framework for neural-network viewed here because of space limitations. learning is Valiant’s probably approximately correct From the statistical point of view, training a network is (PAC) learning model31 for approximately correct learning. equivalent to estimating its weight parameters by the maxi- The Vapnik–Chervonenkis (VC) dimension theory was mum likelihood method (see Box 1). The following on-line developed to analyze the sample complexity (the required

6 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998 Yang et al. – Learning in ANNs Review

θ number of training examples) and the computation com- ditional p( yx, i ) is modeled by a neural network with θ plexity of PAC learning. The VC-dimension theory also an- the parameter i, the predictive distribution is modeled by alyzes the difference between the training error and the gen- a committee machine weighted by ci. eralization error in PAC learning. Most of the work on the The conventional statistical and the Bayesian ap- PAC learning and the VC-dimension models is for binary proaches both have advantages and drawbacks. A general functions only. Haussler32 extended the PAC model and the superiority of one approach over another does not exist. VC-dimension theory to more general function classes. The However, in the context of model selection for training pseudo-dimension he introduced plays a similar role to the neural networks, Amari and Murata35 have shown that the VC-dimension in the case of binary function classes. conventional approach is asymptotically better than the Further results on the extended PAC model and the Bayesian approach in terms of reducing the generalization pseudo-dimension are given by Maass33. error.

Bayesian approach Unsupervised learning From a Bayesian point of view, it is better to assign some Unsupervised algorithms are designed for the self-organiz- preference to unknown parameters to include some prior ation of ANNs. They can be derived either by the Hebbian knowledge about the model. law or by optimizing some global objective functions, such Assume θ is a with a prior p(θ). By the as entropy or mutual information. Bayesian formula, the posterior probability is given by: Bottom-up self-organization p(D θ )p(θ ) p(θ D) = The Hebbian law is a local rule that requires only local sig- p(D) nals to update the connections. It has several mathematical The Bayesian approach is to maximize the posterior forms. Two typical bottom-up self-organization systems are instead of the likelihood function. When the prior is Amari’s self-organizing neural fields38 and Kohonen’s self- Gaussian, maximizing the posterior is equivalent to mini- organizing map39. In Amari’s model, the connections be- mizing a cost function defined by: tween presynaptic field and postsynaptic field are updated

T by the product of the presynaptic neuron activity and the ␣ 2 e()θ = –∑ log p(x ,z ;θ ) + θ postsynaptic neuron activity. In Kohonen’s model, the t t 2 t=i neighborhood and the winner-take-all are two important In particular, when the noise, ξ, is Gaussian, concepts. The central idea of the Kohonen self-organizing T 1 ␣ algorithm is to reinforce the weight vectors of the winner e()θ = ∑(z ± aTϕ(Wx + b))2 + θ 2 2␴2 t t 2 node and the nodes in its neighborhood during learning. t=i

Here, the regularization term is naturally included Top-down self-organization in the cost function to deal with the over-fitting problem. The Hebbian law is local and biologically plausible. It is also ␴2 and ␣ are hyperparameters that, in practice, are usually very flexible with various mathematical forms. However, chosen by trial and error. There are several rational ways to the global behavior of a learning rule based on the Hebbian choose the hyperparameters. MacKay34 proposed a Bayesian law is often difficult to predict from its local mathematical framework for the back-propagation method and gave a form. Optimizing some global objective functions, which Bayesian approach to choose the hyperparameters. The idea characterize the internal representation of a learning task in is to optimize alternately the maximum posterior estimate an ANN, is an alternative way to derive an unsupervised of θ and the hyperparameters, by maximizing an approxi- learning rule to achieve self-organization. Unlike the cost mated posterior probability of the hyperparameters. Amari functions for supervised learning, the global objective func- and Murata35 gave both Bayesian and non-Bayesian treat- tions for unsupervised learning are often subjectively chosen ments to find the optimal hyperparameter θ by maximizing to measure the performance of the trained network for a the posterior probability and minimizing the generalization certain task. Generally, an unsupervised algorithm derived error, respectively. Other papers on Bayesian learning from a global objective function cannot be transformed into include those by Neal36, and Bishop and Qazaz37. a local Hebbian learning rule. An important concept in the Bayesian approach is the Factorial coding and the infomax are two important predictive distribution defined by concepts for top-down self-organization. Their relation is discussed in Box 2. p( yx,D) = ∫ p( yx,θ )p(θ D)dθ Another phrase for factorial coding is independent where D = { (x1, y1),..., (xN, yN) } is a training set. component analysis (ICA). A rigorous mathematical theory θ θ 40 If the posterior p( |D) has several maxima i, i=1,…,M, for ICA has been provided by Comon . One application of and it is sharply peaked around these maxima, then the ICA is the blind separation of sources in a linear mixture x predictive distribution = As, where the mixing matrix, A, and sources, s = (s1,…,sn) are both unknown. Bell and Sejnowski41 applied the info- M ≈ θ py()x,D ∑ci pyx(), i max to blind separation and derived the following on-line i=1 algorithm to maximize H(z): dW = ␩ –T ψ T = θ ⌬θ (W – ( y )x ) where ci p( i D) . In implementation, when the con- dt

7 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998 Review Yang et al. – Learning in ANNs

Box 2. The relation between Box 3. Arching algorithm factorial coding and infomax (adaptive boost)

The factorial codinga and the infomaxb are two strategies to A classifier on a space X is a function y = g(x) where x is a eliminate redundancy in neural coding. To find a factorial pattern in X and y is a class label taking values in {1,…, J} re- coding of a random vector x is to find a mapping y = F(x) spectively. ⌸ such that the joint pdf p(y) = iqi(yi) is factorial. From M classifiers gm(x), a better classifier can be con- Nadal and Pargac used the following channel model to structed by voting in the following way: discuss the relation between the factorial coding and the M infomax: = = f (x ) argmax j ∑vm1{ gm(x ) j } u = z + ␰ = f(y) + ␰ = f(Wx) + ␰ m=1

T where x is the input, u the output, and f(y) = (f1(y1), …, fn(yn)) . where vm are voting weights to be determined. 1 The mutual information between the input and the output is Given a training set XN = {(xn,jn), n = 1, …, N} a classifier 1 m g1(x) is trained on XN. A sequence of training sets Xn, m = 2, I(x,u) = H(u) – H(u|x) = H(u) – H(u – f(y)|x) 1 ␰ …, M, are generated by resampling the XN using probabili- = H(u) – H( ) m m ties p n. A classifier gm(x) is trained on each training set Xn,. ␰ ␰ m Since is an additive noise, H( ) does not depend on the The sampling probabilities pn and the voting weights vm are channel parameter W and the function f. Therefore, maxi- updated by Freund and Schapire’s arching algorithma con- mizing I(x,u) is equivalent to maximizing H(u). In the low siting of the following steps: noise limit, H(u) = H(z) and 1 (1) Initializing p1 = and m = 1 = ⌸ ' ≤ n N H (z ) –KL(p( y) f ( yi )) 0 i i ≠ ⑀ m (2) Let dn = 1 if gm(xn) jn, else dn = 0, for (xn,jn) Xn. ⌺N m ␤ ␤ Compute em = n =1pn dn. m = (1–em)/em, vm = log m and up- p( y ) where KL(p|q)=∫p(y) logdy denotes the Kullback– date the probabilities. q( y ) N Leibler divergence between two probability density factors + pm 1 = pm␤dn /∑ pm␤dn p and q. n n m n m ⌸ ′ n=1 Assume fi(yi) are distribution functions, then if i (yi) is a ⌸ ′ pdf. When H(z) = 0, p(y) = if i (yi), i.e. y is the factorial m+1 1 (3) Generate the training set X n by resampling XN coding of x under linear transform y = Wx, when H(z) is m+1 using probabilities p n maximized. (4) Set m : = m + 1 References (5) Repeat the steps 2–4 until m > M. a Barlow, H.B. and Foldiak, P. (1989) in The Computing Neuron (Miall, C., Durbin, R.M. and Mitchison, G.J., eds), pp. 54–72, Note that em is the error probability of the classifier gm(x). Addison-Wesley When e ≥ 0.5, the classifier g (x) should not be included in b Linsker, R. (1988) Self-organization in a perceptual network m m Computer 21, 105–117 the vote because it will weaken the performance of the com- c Nadal, J.P. and Parga, N. (1994) Nonlinear neurons in the low bined classifier. So it is reasonable to assume em < 0.5 for all m. ␤ noise limit: a factorial code maximises information transfer Under this assumption, we have m > 1 which means that Network 5, 561–581 the sampling probabilities of those patterns misclassified by

gm(x) are increased in the step 2 in the arching algorithm.

Reference where: a Freund, Y. and Schapire, R.E. (1997) A decision-theoretic  T generalization of on-line learning and an application to f ''( y ) f '' ( y ) ψ =  1 1 K n n  boosting J. Comput. Syst. Sci. 55, 119–139 ( y )  – , ,–   f '( y ) f ' ( y ) 1 1 n n

This approach is the same as Pham and Garat’s Quasi- translation. The details about this approach can be found in Maximum Likelihood approach42. The above learning Yang and Amari26 where an adaptive algorithm was pro-

equation was optimized by using the natural-gradient-ascent posed to estimate the marginal entropy H(yi), and the relation method43. The optimized learning equation is between the infomax and minimum mutual information was analyzed. dW = ␩(I – ψ( y )yT )W Although the ICA algorithms are usually not local dt learning rules, they are very useful in practice. Applications Another approach for ICA is to minimize the mutual of these algorithms can be found in Makeig et al.44 for EEG information data analysis, and Bell and Sejnowski41 for natural image

n analysis. = + I ( y ) –H (x ) – log det (W ) ∑H ( yi ) i=1 Combined learning The mutual information I(y) is an ICA contrast func- Breiman’s bagging method45, and Freund and Schapire’s tion, and is invariant to non-zero scaling, permutation and arching (boosting) method46, are very useful ensemble

8 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998 Yang et al. – Learning in ANNs Review methods to improve the performance of the existing learn- pronounce English text Complex Syst. 1, 145–168 ing machines, including ANNs. Both methods combine the 9 Tesauro, G. (1995) Temporal difference learning and TD-Gammon Commun. ACM 38, 58–68 learning machines by voting. The arching method (see Box 3) 10 McClelland, J.L. (1993) in Attention and Performance: Synergies in uses an adaptive weighting procedure to increase the sam- Experimental Psychology, Artificial Intelligence and Cognitive Neuro- pling probabilities of the examples that are difficult to learn science (Vol. 14) (Meyer, D.E. and Kornblum, S., eds), pp. 655–688, MIT Press for the network. Leisch and Hornik47 proposed a variant of 11 Usher, M. and McClelland, J.L. (1995) On the time course of perceptual the arching method to combine ANNs. choice: a model based on principles of neural computation Technical Report PDP.CNS.95.5, Carnegie Mellon University, Dept of Psychology It is worth pointing out that animals might also use 12 Movellan, J.R. and McClelland, J.L. (1995) Stochastic interactive ensemble methods in learning. The brain might take the processing, channel separability and optimal perceptual inference: an advantage of its intrinsic variability to implement bagging examination of Morton’s law Technical Report PDP.CNS.95.4, Carnegie and arching in neural systems. Human beings often learn Mellon University, Dept of Psychology from examples repeatedly and selectively. If some examples 13 White, H. (1989) Learning in artificial neural networks: a statistical perspective Neural Comput. 1, 425–464 are difficult to learn, they are thought to be important 14 White, H. (1994) in Mathematical Perspectives on Neural Networks and worth being learned once more. Without exaggerating, (Smolensky, P., Mozer, M.C. and Rumelhart, D.E., eds), Erlbaum we can say that the brain is a natural bagging and arching 15 Ripley, B.D. (1996) Pattern Recognition and Neural Networks, machine. Cambridge University Press 16 Amari, S. (1995) in The Handbook of Brain Theory and Neural Networks (Arbib, M.A., ed.), pp. 522–526, MIT Press/Bradford Books Conclusions 17 Amari, S. in Handbook of Neural Computation, IOP Publishing It is a promising hypothesis that a future brain-like com- Ltd/Oxford University Press (in press) puter will be a statistical inference machine with a prob- 18 Amari, S. (1985) Differential–Geometrical Methods in Statistics: abilistic model to deal with a stochastic environment. Lecture Notes in Statistics (Vol. 28), Springer-Verlag Statistical inference has offered systematic ways to derive 19 Amari, S. (1972) Characteristics of random nets of analog neuron-like elements IEEE Trans. Syst., Man Cybern. 2, 643–657 not only the learning algorithms, but also objective meas- 20 Ackley, D.H., Hinton, G.E. and Sejnowski, T.J. (1985) A learning ures, such as generalization error and NIC, to evaluate the algorithm for Boltzmann machines Cognit. Sci. 9, 147–169 performance of the algorithms. 21 Dayan, P., Hinto, G.E., Neal, R. and Zemel, R.S. (1995) Helmholtz The theoretical basis for techniques, such as early stop- machines Neural Comput. 7, 1022–1037 ping, growing and pruning, is model selection. The schemes 22 Jordan, M.I. and Jacobs, R.A. (1994) Hierarchical mixtures of experts and the em algorithm Neural Comp. 6, 181–214 for model selection can be derived either by the conven- 23 Amari, S. (1967) A theory of adaptive pattern classifiers IEEE Trans. tional approach or by the Bayesian approach. Electron. Comp. 16, 299–307 Unsupervised learning algorithms can be constructed 24 Barto, A.G. (1995) in The Handbook of Brain Theory and Neural according to the Hebbian law or can be derived from global Networks (Arbib, M.A., ed.), pp. 804–813, MIT Press/Bradford Books objective functions. The Hebbian learning rules are local 25 McLachlan, G.J. and Krishnan, T. (1997) The EM Algorithm and Extensions, John Wiley & Sons and biologically plausible, but the global behavior of the 26 Yang, H.H. and Amari, A. (1997) Adaptive on-line learning algorithms networks using Hebbian rules is difficult to predict. On the for blind separation: maximum entropy and minimum mutual other hand, an unsupervised learning algorithm derived information Neural Comp. 9, 1457–1482 from some global objective function might not have a local 27 Moody, J. (1992) in Advances in Neural Information Processing Systems form but its global behavior is usually clear. Some non-local (Vol. 4) (Moody, J.E., Hanson, S.J. and Lippmann, R.P., eds), pp. 847–854, Morgan Kaufmann unsupervised learning algorithms, such as ICA algorithms, 28 Murata, N., Yoshizawa, S. and Amari, S. (1994) Network information are very useful for data analysis. criterion: determining the number of hidden units for an artificial network model IEEE Trans. Neural Netw. 5, 865–872 29 Geman, S. et al. (1992) Neural networks and the bias/variance dilemma Acknowledgement Neural Comput. 4, 1–58 We would like to thank Xiaoyan Su for proof reading of the manuscript. 30 Wolpert, D.H. (1995) On bias plus variance Technical Report SFI TR 95- 007, The Santa Fe Institute 31 Valiant, L.G. (1984) A theory of the learnable Commun. ACM 27, References 1134–1142 1 Durbin, R. (1989) in The Computing Neuron (Durbin, R., Miall, C. and 32 Haussler, D. (1992) Decision theoretic generalizations of the pac model Mitchison, G., eds), pp. 1–10, Addison-Wesley for neural nets and other learning applications Information Comp. 2 Zipser, D. and Andersen, R.A. (1988) A back-propagation programmed 100, 78–150 network that simulates response properties of a subset of posterior 33 Maass, W. (1995) Agnostic PAC learning of functions on analog neural parietal neurons Nature 331, 679–684 nets Neural Comput. 7, 1054–1078 3 Rumelhart, D.E. and McClelland, J.L., eds (1986) Parallel Distributed 34 MacKay, D.J.C. (1992) A practical Bayesian framework for back- Processing: Explorations in the Microstructure of Cognition (Vols 1 and propagation networks Neural Comput. 4, 449–472 2), MIT Press/Bradford Books 35 Amari, S. and Murata, N. (1997) Statistical analysis of regularization 4 Amari, S. (1977) Neural theory of association and concept-formation constant from Nayes, MDL and NIC points of view Proc. IWANN (Mira, J., Biol. Cybern. 26, 175–185 Moreno-Diaz, R. and Cabestany, J., eds) pp. 284–293, Springer-Verlag 5 Nadel, L. et al., eds (1989) Neural Connections, Mental Computation, 36 Neal, R.M. (1996) Bayesian Learning for Neural Networks: Lecture MIT Press/Bradford Books Notes in Statistics (Vol. 118), Springer-Verlag 6 Seidenberg, M.S. and McClelland, J.L. (1989) A distributed 37 Bishop, C.M. and Qazaz, C.S. (1997) in Advances in Neural Information developmental model of word recognition and naming Psychol. Rev. Processing Systems (Vol. 9) (Mozer, M.C., Jordan, M.I. and Petsche, T., 96, 523–568 eds), pp. 347–353, MIT Press 7 Anderson, J.A. (1995) in The Handbook of Brain Theory and Neural 38 Amari, S. (1980) Topographic organization of nerve fields Bull. Networks (Arbib, M.A., ed.), pp. 570–575, MIT Press/Bradford Books Mathemat. Biol. 42, 339–364 8 Sejnowski, T. and Rosenberg, C. (1987) Parallel networks that learn to 39 Kohonen, T. (1982) Self-organized formation of topologically correct

9 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998 Review Yang et al. – Learning in ANNs

feature maps Biol. Cybern. 43, 59–69 and Hasselmo, M.E., eds), pp. 757–763, MIT Press 40 Comon, P. (1994) Independent component analysis, a new concept? 44 Makeig, S. et al. (1997) Blind separation of auditory event-related Signal Process. 36, 287–314 brain responses into independent components Proc. Natl. Acad. Sci. 41 Bell, A.J. and Sejnowski, T.J. (1997) in Advances in Neural Information U. S. A. 94, 10979–10984 Processing Systems (Vol. 9) (Mozer, M.C., Jordan, M.I. and Petsche, T., 45 Breiman, L. (1996) Bagging predictors 24, 123–140 eds), pp. 831–837, MIT Press 46 Freund, Y. and Schapire, R.E. (1997) A decision-theoretic 42 Pham, D.T. and Garat, P. (1997) Blind separation of a mixture of generalization of on-line learning and an application to boosting independent sources through a quasi-maximum likelihood approach J. Comput. Syst. Sci. 55, 119–139 IEEE Trans. Signal Process. 47, 1712–1725 47 Leisch, F. and Hornik, K. (1997) in Advances in Neural Information 43 Amari, S., Cichocki, A. and Yang, H.H. (1996) in Advances in Neural Processing Systems (Vol. 9) (Mozer, M.C., Jordan, M.I. and Petsche, T., Information Processing Systems (Vol. 8) (Touretzky, D.S., Mozer, M.C. eds), pp. 522–528, MIT Press Indexing and the object concept: developing ‘what’ and ‘where’ systems Alan M. Leslie, Fei Xu, Patrice D. Tremoulet and Brian J. Scholl

The study of object cognition over the past 25 years has proceeded in two largely non-interacting camps. One camp has studied object-based visual attention in adults, while the other has studied the object concept in infants. We briefly review both sets of literature and distill from the adult research a theoretical model that we apply to findings from the infant studies. The key notion in our model of object representation is the ‘sticky’ index, a mechanism of selective attention that points at a physical object in a location. An object index does not represent any of the properties of the entity at which it points. However, once an index is pointing to an object, the properties of that object can be examined and featural information can be associated with, or ‘bound’ to, its index. The distinction between indexing and feature binding underwrites the distinction between object individuation and object identification, a distinction that turns out to be crucial in both the adult attention and the infant object-concept literature. By developing the indexing model, we draw together two disparate sets of A.M. Leslie, P.D. literature and suggest new ways to study object-based attention in infancy. Tremoulet and B.J. Scholl are at the Department of Psychology and Center ‘We readily suppose an object may continue individually in infancy, drawing inspiration from ideas developed in the for Cognitive Science, the same, though several times absent from and present to study of adult visual attention. According to our frame- Rutgers University, the senses; and ascribe to it an identity, notwithstanding work, a key component of object cognition is an internal Piscataway, NJ the interruption of the perception, whenever we con- representation which functions as an ‘index’ to a physical 08855, USA. clude, that if we had kept our eye or hand constantly upon object in the world. Just as a finger that points at something F. Xu is at the it, it would have conveyed an invariable and uninter- conveys no information about the nature of what it points Department of rupted perception.’ at, so too an ‘object index’, in our account, is an entirely ab- Psychology, David Hume, A Treatise of Human Nature, 1740. stract representation that conveys no information about the Northeastern University, Boston, properties of the object involved. MA 02115, USA. There is a long-standing view that the notion of object- hood is one of the fundamental structures of human Objects and indexes tel: +1 732 445 6152 thought1–6. Physical objects are a major focus of human at- Although making best use of our limited processing re- fax: +1 732 445 6280 9,10 e-mail: aleslie@ruccs. tention in the first year of life, and structure visual attention sources demands selective attention , it is likely that at- rutgers.edu in adults7,8. We present a new theory of the ‘object-concept’ tention can span more than a single object at a time.

10 Copyright © 1998, Elsevier Science Ltd. All rights reserved. 1364-6613/98/$19.00 PII: S1364-6613(97)01113-3 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998