Statistical Inference: Learning in Artificial Neural Networks Howard Hua Yang, Noboru Murata and Shun-Ichi Amari

Review Yang et al. – Learning in ANNs Statistical inference: learning in artificial neural networks Howard Hua Yang, Noboru Murata and Shun-ichi Amari Artificial neural networks (ANNs) are widely used to model low-level neural activities and high-level cognitive functions. In this article, we review the applications of statistical inference for learning in ANNs. Statistical inference provides an objective way to derive learning algorithms both for training and for evaluation of the performance of trained ANNs. Solutions to the over-fitting problem by model-selection methods, based on either conventional statistical approaches or on a Bayesian approach, are discussed. The use of supervised and unsupervised learning algorithms for ANNs are reviewed. Training a multilayer ANN by supervised learning is equivalent to nonlinear regression. The ensemble methods, bagging and arching, described here, can be applied to combine ANNs to form a new predictor with improved performance. Unsupervised learning algorithms that are derived either by the Hebbian law for bottom-up self-organization, or by global objective functions for top-down self-organization are also discussed. Although the brain is an extremely large and complex units and connections in an ANN are different at these system, from the point of view of its organization the hier- two levels. archy of the brain can be divided into eight levels: behavior, At the neural-activity level, the units and connections cortex, neural circuit, neuron, microcircuit, synpase, mem- model the nerve cells and the synapses between the neurons, brane and molecule. With advanced invasive and non- respectively. This gives a correspondence between an ANN invasive measurement techniques, the brain can be observed and a neural system1. One successful application of ANNs 2 H.H. Yang is at the at all these levels and a huge amount of data have been at the activity level is Zipser and Andersen’s model , which Computer Science collected. Neural computational theories have been devel- is a three-layer feed-forward network, trained by back- Department, Oregon oped to account for complex brain functions based on the propagation to perform the vector addition of the retinal and Graduate Institute, accumulated data. eye positions. After training, the simulated retinal receptive PO Box 9100, Neural computational theories comprise neural models, fields and the eye position responses of the hidden units Portland, OR 97291, neural dynamics and learning theories. Mathematical in the trained network closely resembled those found in USA. modeling has been applied to each level in the hierarchy of the posterior parietal cortex of the primate brain, where the tel: + 503 690 1331 the brain. The brain can be considered at three functional absolute spatial location (the position of the object in space, fax: +503 690 1548 levels: (1) a cognitive-function level related to behavior and which does not depend on the head direction) is computed. e-mail: [email protected]. cortex; (2) a neural-activity level related to neural circuit, At the cognitive-function level, ANNs are connectionist edu neuron, microcircuit and synpase; and (3) a subneural level models for cognitive processing. The units and connections related to the membrane and molecule. In this article, we in the connectionist models are used to represent certain N. Murata and S. Amari are at the only consider the first and second functional levels. cognitive states or hypotheses, and constraints among these Laboratory for To focus on the information processing principles of states or hypotheses, respectively. It has been widely be- Information Synthesis, the brain, we must simplify the neurons and synapses in real lieved, and demonstrated by connectionists, that some cog- Riken BSI, Hirosawa neural systems. ANNs are simplified mathematical models nitive functions emerge from the interactions among a large 2-1, Wako-shi, for neural systems formed by massively interconnected number of computational units3. Different cognitive tasks, Saitama 351-01, computational units running in parallel. We discuss the such as memory retrieval, category formation, speech per- Japan. applications of ANNs at the neural-activity level and the ception, language acquisition and object recognition, have tel: + 81 48467 9625 cognitive-function level. been modeled by ANNs (Refs 3–5). Some examples are the fax: +81 48467 9693 word pronunciation model6, the mental arithmetic model7, e-mail: mura@brain. Applications of ANNs the English text-to-speech system8, and the TD–Gammon riken.go.jp 9 amari@brain. ANNs can model brain functions at the level of either model , which is one of the best backgammon players in riken.go.jp neural activity or cognitive function. The meanings of the the world. 4 Copyright © 1998, Elsevier Science Ltd. All rights reserved. 1364-6613/98/$19.00 PII: S1364-6613(97)01114-5 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998 Yang et al. – Learning in ANNs Review Statistics and ANNs ference will guide us to derive learning algorithms and to McClelland10 summarized five principles to characterize the analyze their performance in a more systematic way. information processing in connectionist models: principles An ANN has input nodes for taking data, hidden nodes of graded activation; gradual propagation; interactive pro- for the internal representation and output nodes for dis- cessing; mutual competition; and intrinsic variability. In playing patterns. The goal of learning is to find an ANNs, the first principle is realized by a linear combination input–output relation with a prediction error as small as of inputs and sigmoid activation function, the second by fi- possible. A learning algorithm is called supervised if the de- nite impulse response (FIR) filters with exponentially de- sired outputs are known. Otherwise, it is unsupervised. caying impulse response functions as models for synapses11, the third by a bi-directional structure, the fourth by lateral Supervised learning inhibition, and the fifth by noise or probabilistic units. The Multilayer neural networks are chosen as useful connection- factor of intrinsic variability plays an important role in ist models because of their universal approximation capabil- human information processing and it is intrinsic variability ity. A network for knowledge representation can be trained that is the main difference between the brain and digital from examples without using verbal rules and hard-wiring. computers. von Neumann once hinted at the idea of build- Three typical approaches to train a multilayer neural net- ing a brain-like computer based on statistical principles12. It work are (1) the optimization approach, (2) the conven- is a reasonable hypothesis that the brain incorporates intrin- tional statistical approach and (3) the Bayesian approach. sic variability naturally in its structure so that it can operate in a stochastic environment, receiving noisy and ambiguous Optimization approach inputs. McClelland’s work provided some new directions Training a multilayer neural network is often formulated for neural-network research. As a basic hypothesis for con- as an optimization problem. The learning algorithms based nectionist models, the intrinsic variability principle is ap- on gradient descent are enriched by some optimization pealing, especially to statisticians because it allows them to techniques such as the momentum and the Newton– build stochastic neural-network models and to apply statis- Raphson methods. However, because the cost functions are tical inference to these models. subjectively chosen, it is very difficult to address problems Two essential parts of the modern neural-network the- such as the efficiency and the accuracy of the learning algo- ory are stochastic models for ANNs and learning algorithms rithms within the optimization framework. A further prob- based on statistical inference. White13,14 and Ripley15 give lem is that the trained network might over-fit the data. some statistician’s perspectives about ANNs and treat them Thus, when the network architecture is more complex than rigorously using a statistical framework. Amari reviews required, the algorithm might decrease the training error on some important issues about learning and statistical infer- the training examples, but increase the generalization error ence16,17 and the applications of information geometry18 in on the novel examples that are not shown to the network ANNs. In this article, we further review these issues as well during training. As a result, the learning process can be as some other issues not covered previously by Amari16–18. driven by the training examples to a wrong solution. Therefore, it is crucial to select a correct model in the learn- Stochastic models ing process based on the performance of the trained net- Many stochastic neural-network models have been pro- work. To examine the performance of the trained network, posed in the literature. A good neural-network model some examples should be left aside for testing, not training. should be concise in structure with powerful approximation In the optimization approach, the model selection is capability, and be tractable by statistical inference methods. done by trial and error. This is time consuming and the op- A simple, but useful, model for a stochastic perceptron is: timal architecture might not be found. We now discuss some model selection methods based on statistical inference. y = g(x,θ) + noise where x is the input, y the output, and g(x,θ) a nonlinear Conventional statistical approach activation function parameterized by θ. For example, g(x,θ) = Many papers about statistical inference learning have been f(wTx + b) for θ = (w,b) or g(x,θ) = f(xTAx + wTx + b) for θ = published in the past three decades. A key link between (A,w,b) where f is a single variable function, b a bias, w a neural networks and statistics is nonlinear regression. vector linearly combining the inputs, A a matrix linearly Through this link, many statistical inference methods can combining the second order inputs, and T denotes the be applied to neural networks. Perhaps the earliest work on vector transpose.

Load more