Statistical Inference: Learning in Artificial Neural Networks Howard Hua Yang, Noboru Murata and Shun-Ichi Amari

Total Page:16

File Type:pdf, Size:1020Kb

Statistical Inference: Learning in Artificial Neural Networks Howard Hua Yang, Noboru Murata and Shun-Ichi Amari Review Yang et al. – Learning in ANNs Statistical inference: learning in artificial neural networks Howard Hua Yang, Noboru Murata and Shun-ichi Amari Artificial neural networks (ANNs) are widely used to model low-level neural activities and high-level cognitive functions. In this article, we review the applications of statistical inference for learning in ANNs. Statistical inference provides an objective way to derive learning algorithms both for training and for evaluation of the performance of trained ANNs. Solutions to the over-fitting problem by model-selection methods, based on either conventional statistical approaches or on a Bayesian approach, are discussed. The use of supervised and unsupervised learning algorithms for ANNs are reviewed. Training a multilayer ANN by supervised learning is equivalent to nonlinear regression. The ensemble methods, bagging and arching, described here, can be applied to combine ANNs to form a new predictor with improved performance. Unsupervised learning algorithms that are derived either by the Hebbian law for bottom-up self-organization, or by global objective functions for top-down self-organization are also discussed. Although the brain is an extremely large and complex units and connections in an ANN are different at these system, from the point of view of its organization the hier- two levels. archy of the brain can be divided into eight levels: behavior, At the neural-activity level, the units and connections cortex, neural circuit, neuron, microcircuit, synpase, mem- model the nerve cells and the synapses between the neurons, brane and molecule. With advanced invasive and non- respectively. This gives a correspondence between an ANN invasive measurement techniques, the brain can be observed and a neural system1. One successful application of ANNs 2 H.H. Yang is at the at all these levels and a huge amount of data have been at the activity level is Zipser and Andersen’s model , which Computer Science collected. Neural computational theories have been devel- is a three-layer feed-forward network, trained by back- Department, Oregon oped to account for complex brain functions based on the propagation to perform the vector addition of the retinal and Graduate Institute, accumulated data. eye positions. After training, the simulated retinal receptive PO Box 9100, Neural computational theories comprise neural models, fields and the eye position responses of the hidden units Portland, OR 97291, neural dynamics and learning theories. Mathematical in the trained network closely resembled those found in USA. modeling has been applied to each level in the hierarchy of the posterior parietal cortex of the primate brain, where the tel: + 503 690 1331 the brain. The brain can be considered at three functional absolute spatial location (the position of the object in space, fax: +503 690 1548 levels: (1) a cognitive-function level related to behavior and which does not depend on the head direction) is computed. e-mail: [email protected]. cortex; (2) a neural-activity level related to neural circuit, At the cognitive-function level, ANNs are connectionist edu neuron, microcircuit and synpase; and (3) a subneural level models for cognitive processing. The units and connections related to the membrane and molecule. In this article, we in the connectionist models are used to represent certain N. Murata and S. Amari are at the only consider the first and second functional levels. cognitive states or hypotheses, and constraints among these Laboratory for To focus on the information processing principles of states or hypotheses, respectively. It has been widely be- Information Synthesis, the brain, we must simplify the neurons and synapses in real lieved, and demonstrated by connectionists, that some cog- Riken BSI, Hirosawa neural systems. ANNs are simplified mathematical models nitive functions emerge from the interactions among a large 2-1, Wako-shi, for neural systems formed by massively interconnected number of computational units3. Different cognitive tasks, Saitama 351-01, computational units running in parallel. We discuss the such as memory retrieval, category formation, speech per- Japan. applications of ANNs at the neural-activity level and the ception, language acquisition and object recognition, have tel: + 81 48467 9625 cognitive-function level. been modeled by ANNs (Refs 3–5). Some examples are the fax: +81 48467 9693 word pronunciation model6, the mental arithmetic model7, e-mail: mura@brain. Applications of ANNs the English text-to-speech system8, and the TD–Gammon riken.go.jp 9 amari@brain. ANNs can model brain functions at the level of either model , which is one of the best backgammon players in riken.go.jp neural activity or cognitive function. The meanings of the the world. 4 Copyright © 1998, Elsevier Science Ltd. All rights reserved. 1364-6613/98/$19.00 PII: S1364-6613(97)01114-5 Trends in Cognitive Sciences – Vol. 2, No. 1, January 1998 Yang et al. – Learning in ANNs Review Statistics and ANNs ference will guide us to derive learning algorithms and to McClelland10 summarized five principles to characterize the analyze their performance in a more systematic way. information processing in connectionist models: principles An ANN has input nodes for taking data, hidden nodes of graded activation; gradual propagation; interactive pro- for the internal representation and output nodes for dis- cessing; mutual competition; and intrinsic variability. In playing patterns. The goal of learning is to find an ANNs, the first principle is realized by a linear combination input–output relation with a prediction error as small as of inputs and sigmoid activation function, the second by fi- possible. A learning algorithm is called supervised if the de- nite impulse response (FIR) filters with exponentially de- sired outputs are known. Otherwise, it is unsupervised. caying impulse response functions as models for synapses11, the third by a bi-directional structure, the fourth by lateral Supervised learning inhibition, and the fifth by noise or probabilistic units. The Multilayer neural networks are chosen as useful connection- factor of intrinsic variability plays an important role in ist models because of their universal approximation capabil- human information processing and it is intrinsic variability ity. A network for knowledge representation can be trained that is the main difference between the brain and digital from examples without using verbal rules and hard-wiring. computers. von Neumann once hinted at the idea of build- Three typical approaches to train a multilayer neural net- ing a brain-like computer based on statistical principles12. It work are (1) the optimization approach, (2) the conven- is a reasonable hypothesis that the brain incorporates intrin- tional statistical approach and (3) the Bayesian approach. sic variability naturally in its structure so that it can operate in a stochastic environment, receiving noisy and ambiguous Optimization approach inputs. McClelland’s work provided some new directions Training a multilayer neural network is often formulated for neural-network research. As a basic hypothesis for con- as an optimization problem. The learning algorithms based nectionist models, the intrinsic variability principle is ap- on gradient descent are enriched by some optimization pealing, especially to statisticians because it allows them to techniques such as the momentum and the Newton– build stochastic neural-network models and to apply statis- Raphson methods. However, because the cost functions are tical inference to these models. subjectively chosen, it is very difficult to address problems Two essential parts of the modern neural-network the- such as the efficiency and the accuracy of the learning algo- ory are stochastic models for ANNs and learning algorithms rithms within the optimization framework. A further prob- based on statistical inference. White13,14 and Ripley15 give lem is that the trained network might over-fit the data. some statistician’s perspectives about ANNs and treat them Thus, when the network architecture is more complex than rigorously using a statistical framework. Amari reviews required, the algorithm might decrease the training error on some important issues about learning and statistical infer- the training examples, but increase the generalization error ence16,17 and the applications of information geometry18 in on the novel examples that are not shown to the network ANNs. In this article, we further review these issues as well during training. As a result, the learning process can be as some other issues not covered previously by Amari16–18. driven by the training examples to a wrong solution. Therefore, it is crucial to select a correct model in the learn- Stochastic models ing process based on the performance of the trained net- Many stochastic neural-network models have been pro- work. To examine the performance of the trained network, posed in the literature. A good neural-network model some examples should be left aside for testing, not training. should be concise in structure with powerful approximation In the optimization approach, the model selection is capability, and be tractable by statistical inference methods. done by trial and error. This is time consuming and the op- A simple, but useful, model for a stochastic perceptron is: timal architecture might not be found. We now discuss some model selection methods based on statistical inference. y = g(x,θ) + noise where x is the input, y the output, and g(x,θ) a nonlinear Conventional statistical approach activation function parameterized by θ. For example, g(x,θ) = Many papers about statistical inference learning have been f(wTx + b) for θ = (w,b) or g(x,θ) = f(xTAx + wTx + b) for θ = published in the past three decades. A key link between (A,w,b) where f is a single variable function, b a bias, w a neural networks and statistics is nonlinear regression. vector linearly combining the inputs, A a matrix linearly Through this link, many statistical inference methods can combining the second order inputs, and T denotes the be applied to neural networks. Perhaps the earliest work on vector transpose.
Recommended publications
  • Face Recognition by Independent Component Analysis Marian Stewart Bartlett, Member, IEEE, Javier R
    1450 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002 Face Recognition by Independent Component Analysis Marian Stewart Bartlett, Member, IEEE, Javier R. Movellan, Member, IEEE, and Terrence J. Sejnowski, Fellow, IEEE Abstract—A number of current face recognition algorithms use (ICs). Barlow also argued that such representations are advan- face representations found by unsupervised statistical methods. tageous for encoding complex objects that are characterized by Typically these methods find a set of basis images and represent high-order dependencies. Atick and Redlich have also argued faces as a linear combination of those images. Principal compo- for such representations as a general coding strategy for the vi- nent analysis (PCA) is a popular example of such methods. The basis images found by PCA depend only on pairwise relationships sual system [3]. between pixels in the image database. In a task such as face Principal component analysis (PCA) is a popular unsuper- recognition, in which important information may be contained in vised statistical method to find useful image representations. the high-order relationships among pixels, it seems reasonable to Consider a set of basis images each of which has pixels. expect that better basis images may be found by methods sensitive A standard basis set consists of a single active pixel with inten- to these high-order statistics. Independent component analysis sity 1, where each basis image has a different active pixel. Any (ICA), a generalization of PCA, is one such method. We used a version of ICA derived from the principle of optimal information given image with pixels can be decomposed as a linear com- transfer through sigmoidal neurons.
    [Show full text]
  • Generalized Independent Component Analysis Over Finite Alphabets Amichai Painsky, Member, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE
    IEEE TRANSACTIONS ON INFORMATION THEORY 1 Generalized Independent Component Analysis Over Finite Alphabets Amichai Painsky, Member, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE Abstract Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other. Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA in which both the observations and the independent components are over a finite alphabet. In this work we consider a generalization of this framework in which an observation vector is decomposed to its independent components (as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow’s minimal redundancy representation problem and is considered an open problem. We propose several theorems and show that this NP hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Our contribution provides the first efficient and constructive set of solutions to Barlow’s problem. The minimal redundancy representation (also known as factorial code) has many applications, mainly in the fields of Neural Networks and Deep Learning. The Binary ICA (BICA) is also shown to have applications in several domains including medical diagnosis, multi-cluster assignment, network tomography and internet resource management. In this work we show this formulation further applies to multiple disciplines in source coding such as predictive coding, distributed source coding and coding of large alphabet sources.
    [Show full text]
  • Linear Independent Component Analysis Over Finite Fields: Algorithms and Bounds Amichai Painsky, Member, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE
    IEEE TRANSACTIONS ON SIGNAL PROCESSING 1 Linear Independent Component Analysis over Finite Fields: Algorithms and Bounds Amichai Painsky, Member, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE Abstract—Independent Component Analysis (ICA) is a sta- hold, and a more robust approach is required. In other words, tistical tool that decomposes an observed random vector into we would like to decompose any given observed mixture components that are as statistically independent as possible. into “as independent as possible” components, with no prior ICA over finite fields is a special case of ICA, in which both the observations and the decomposed components take values assumption on the way it was generated. This problem was over a finite alphabet. This problem is also known as minimal first introduced by Barlow [1] as minimal redundancy repre- redundancy representation or factorial coding. In this work we sentation and was later referred to as factorial representation focus on linear methods for ICA over finite fields. We introduce [2] or generalized ICA over finite alphabets [3]. a basic lower bound which provides a fundamental limit to the A factorial representation has several advantages. The prob- ability of any linear solution to solve this problem. Based on this bound, we present a greedy algorithm that outperforms ability of occurrence of any realization can be simply com- all currently known methods. Importantly, we show that the puted as the product of the probabilities of the individual overhead of our suggested algorithm (compared with the lower components that represent it (assuming such decomposition bound) typically decreases, as the scale of the problem grows.
    [Show full text]
  • Bartlett TNN02.Pdf
    1450 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002 Face Recognition by Independent Component Analysis Marian Stewart Bartlett, Member, IEEE, Javier R. Movellan, Member, IEEE, and Terrence J. Sejnowski, Fellow, IEEE Abstract—A number of current face recognition algorithms use (ICs). Barlow also argued that such representations are advan- face representations found by unsupervised statistical methods. tageous for encoding complex objects that are characterized by Typically these methods find a set of basis images and represent high-order dependencies. Atick and Redlich have also argued faces as a linear combination of those images. Principal compo- for such representations as a general coding strategy for the vi- nent analysis (PCA) is a popular example of such methods. The basis images found by PCA depend only on pairwise relationships sual system [3]. between pixels in the image database. In a task such as face Principal component analysis (PCA) is a popular unsuper- recognition, in which important information may be contained in vised statistical method to find useful image representations. the high-order relationships among pixels, it seems reasonable to Consider a set of basis images each of which has pixels. expect that better basis images may be found by methods sensitive A standard basis set consists of a single active pixel with inten- to these high-order statistics. Independent component analysis sity 1, where each basis image has a different active pixel. Any (ICA), a generalization of PCA, is one such method. We used a version of ICA derived from the principle of optimal information given image with pixels can be decomposed as a linear com- transfer through sigmoidal neurons.
    [Show full text]
  • Deep Learning in Neural Networks: an Overview
    Neural Networks 61 (2015) 85–117 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Review Deep learning in neural networks: An overview Jürgen Schmidhuber The Swiss AI Lab IDSIA, Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, University of Lugano & SUPSI, Galleria 2, 6928 Manno-Lugano, Switzerland article info a b s t r a c t Article history: In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in Received 2 May 2014 pattern recognition and machine learning. This historical survey compactly summarizes relevant work, Received in revised form 12 September much of it from the previous millennium. Shallow and Deep Learners are distinguished by the depth 2014 of their credit assignment paths, which are chains of possibly learnable, causal links between actions Accepted 14 September 2014 and effects. I review deep supervised learning (also recapitulating the history of backpropagation), Available online 13 October 2014 unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks. Keywords: Deep learning ' 2014 Published by Elsevier Ltd. Supervised learning Unsupervised learning Reinforcement learning Evolutionary computation Contents 1. Introduction to Deep Learning (DL) in Neural Networks (NNs).................................................................................................................................. 86 2. Event-oriented
    [Show full text]
  • Natural Image Coding in V1: How Much Use Is Orientation Selectivity?
    Coding of Natural Images in V1 Natural Image Coding in V1: How Much Use is Orientation Selectivity? Jan Eichhorn, Fabian Sinz and Matthias Bethge∗ Max Planck Institute for Biological Cybernetics T¨ubingen,Germany November 19, 2018 Abstract Orientation selectivity is the most striking feature of simple cell coding in V1 which has been shown to emerge from the reduction of higher-order correlations in natural images in a large variety of statistical image models. The most parsimonious one among these mod- els is linear Independent Component Analysis (ICA), whereas second-order decorrelation transformations such as Principal Component Analysis (PCA) do not yield oriented filters. Because of this finding it has been suggested that the emergence of orientation selectivity may be explained by higher-order redundancy reduction. In order to assess the tenability of this hypothesis, it is an important empirical question how much more redundancies can be removed with ICA in comparison to PCA, or other second-order decorrelation methods. This question has not yet been settled, as over the last ten years contradicting results have been reported ranging from less than five to more than hundred percent extra gain for ICA. Here, we aim at resolving this conflict by presenting a very careful and comprehen- sive analysis using three evaluation criteria related to redundancy reduction: In addition to the multi-information and the average log-loss we compute, for the first time, complete rate-distortion curves for ICA in comparison with PCA. Without exception, we find that the advantage of the ICA filters is surprisingly small. We conclude that orientation se- lective receptive fields in primary visual cortex cannot be explained in the framework of linear redundancy reduction.
    [Show full text]
  • Face Recognition by Independent Component Analysis Marian Stewart Bartlett, Member, IEEE, Javier R
    1450 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 6, NOVEMBER 2002 Face Recognition by Independent Component Analysis Marian Stewart Bartlett, Member, IEEE, Javier R. Movellan, Member, IEEE, and Terrence J. Sejnowski, Fellow, IEEE Abstract—A number of current face recognition algorithms use (ICs). Barlow also argued that such representations are advan- face representations found by unsupervised statistical methods. tageous for encoding complex objects that are characterized by Typically these methods find a set of basis images and represent high-order dependencies. Atick and Redlich have also argued faces as a linear combination of those images. Principal compo- for such representations as a general coding strategy for the vi- nent analysis (PCA) is a popular example of such methods. The basis images found by PCA depend only on pairwise relationships sual system [3]. between pixels in the image database. In a task such as face Principal component analysis (PCA) is a popular unsuper- recognition, in which important information may be contained in vised statistical method to find useful image representations. the high-order relationships among pixels, it seems reasonable to Consider a set of basis images each of which has pixels. expect that better basis images may be found by methods sensitive A standard basis set consists of a single active pixel with inten- to these high-order statistics. Independent component analysis sity 1, where each basis image has a different active pixel. Any (ICA), a generalization of PCA, is one such method. We used a version of ICA derived from the principle of optimal information given image with pixels can be decomposed as a linear com- transfer through sigmoidal neurons.
    [Show full text]