Universal Approximation Bounds for Superpositions of a Sigmoidal Function

Total Page:16

File Type:pdf, Size:1020Kb

Universal Approximation Bounds for Superpositions of a Sigmoidal Function 930 IEEE TRANSACTIONS ON INFORMAI10N THEORY, VOL. 39, NO.3, MAY 1993 Universal Approximation Bounds for Superpositions of a Sigmoidal Function Andrew R. Barron, Member, IEEE Abstract- Approximation properties of a class of artificial A smoothness property of the function to be approximated is neural networks are established. It is shown that feedforward expressed in terms of its Fourier representation. In particular, networks with one layer of sigmoidal nonlinearities achieve inte­ an average of the norm of the frequencyvector weighted by the grated squared error of order O(l/n), where n is the number Fourier magnitude distribution is used to measure the extent of nodes. The function appruximated is assumed to have a bound on the first moment of the magnitude distribution of to which the function oscillates. In this Introduction, the result the Fourier transform. The nonlinear parameters associated is presented in the case that the Fourier distribution has a with the sigmoidal nodes, as well as the parameters of linear density that is integrable as well as having a finitefirst moment . combination, are adjusted in the approximation. In contrast, it Somewhat greater generality is permitted in the theorem stated is shown that for series expansions with n terms, in which only and proven in Sections III and IV. the parameters of linear combination are adjusted, the integrated squared approximation error cannot be made smaller than order Consider the class of functions f on Rd for which there is 1/n2/d uniformly for functions satisfying the same smoothness a Fourier representation of the form assumption, where d is the dimension of the input to the function. For the class of functions examined here, the approximation rate f(x) (2 = r eiw.x j(w) w, ) and the parsimony of the parameterization of the networks are JRd d surprisingly advantageous in high-dimensional settings. for some complex-valued function j(w) for which wj(w) is Index Terms- Artificial neural networks, approximation of integrable, and define functions, Fourier analysis, Kolmogorov n-widths. Of ( = llwllj(w)1 dw, 3) 1. INTRODUCTION Rd . w)1/2. PPROXIMATION bounds for a class of artificial neural where Iwl = (w For each 0> 0, let rc be the set of functions such that Cf networks are derived. Continuous functions on compact f ::; 0, Asubsets of Rd can bc uniformly well approximated by linear Functions with Cf finite are continuously differentiable on combinations of sigmoidal functions as independently shown Rd and the gradient of f has the Fourier representation by Cybenko [1] and Hornik, Stinchcombe, and White [2]. The (4) purpose of this paper is to examine how the approximation 6.f(.7:) = jeiw'X6.f(W)dW, error is related to the number of nodes in the network. As in [1], we adopt the definition of a sigmoidal function where 6.f(w) = iwj(w). Thus, condition (3) may be in­ ¢;(z) as a bounded measurable function on the real line for terpreted as the integrability of the Fourier transform of which ¢;(z) --* 1 as z --* 00 and ¢;(z ) --+ 0 as z --* the gradient of the function f. In Section III, functions are -00. Feedforward neural network models with one layer of permitted to be definedon domains (such as Boolean functions sigmoidal units implement functions on Rd of the form on {O, I} d) for which it does not make sense to refer to differentiability on iliat domain. Nevertheless, the conditions n imposed imply that the function has an extension to Rd with f x l:>k¢;(ak . (1) n( ) x + bk) + Co = a gradient that possesses an integrable Fourier representation. k=1 The following approximation bound is representative of the results obtained in this paper for approximation by linear parameterized by ak E Rd and bk, Ck E R, where a . x combinations of a sigmoidal function. The approximation error denotes the inner product of vectors in Rd. The total number is measured by the integrated squared error with respect to an of parameters of the network is (el+ 2)'11, + 1. arbitrary probability measure J.L on the ball Br = {x: Ixl ::; r} Manuscript received February 19, 1991. This work was supported by ONR of radius r > O. The function ((1(z) is an arbitrary fixed under Contract N00014-S9-J-1SI1. Material in this paper was presented at the sigmoidal function. IEEE International Symposium on Information Theory, Budapest, Hungary, Proposition 1: For every function f with Of finite, and June 1991. 1 The author was with the Department of Statistics, the Dcpartment of every n � , there exists a linear combination of sigmoidal Electrical and Computer Engineering, the Coordinated Science Laboratory, functions fn(x) of the form (1), such that and the Beckman Institute, University of Illinois at Urbana-Champaign. He is now with the Department of Statistics, Yale University, Box 2179, Yale 2 c' Station, New Haven, CT 06520. r (f(X) - f,,(X)) J.L(dx)::; : (5) IEEE Log Number 920696fi. .fEr ' 0018-9448/93$03.00 © 1993 IEEE BARRON: UNIVERSAL APPROXIMATION BOUNDS FOR SUPERPOSITIONS OF A SIGMOIDAL FUNCTION 931 where cj = (2rCf )2. For functions in r G, the coefficients tion on low-frequency components) than does the integrability of the linear combination in (1) may be restricted to satisfy of Iwlli(w)l. In the course of our proof, it is seen that the l and = f(O). integrability of (w) is also sufficientfor a linear combina­ 2::�=1 ICk ::; 2rC, Co Iw 11.1 I Extensions of this result are also given to handle Fourier tion of sinusoidal functions to achieve the 1/11, approximation distributions that are not absolutely continuous, to bound rate. Siu and Brunk [5] have obtained similar approximation the approximation error on arbitrary bounded sets, to restrict results for neural networks in the case of Boolean functions on . the parameters ak and bk to be bounded, to handle certain {O, l}d. Independently, they developed similar probabilistic infinite-dimensionalcases, and to treat iterative optimization of arguments for the existence of accurate approximations in their the network approximation. Examples of functions for which setting. bounds can be obtained for Cf are given in Section IX. It is not surprising that sinusoidal functions are at least as A lower bound on the integrated squared error is given in well suited for approximation as are sigmoidal functions, given Section X for approximations by linear combinations of fixed that the smoothness properties of the function are formulated basis functions. For dimensions d ;::: 3, the bounds demonstrate in terms of the Fourier transform. The sigmoidal functions a striking advantage of adjustable basis functions (such as used are studied here not because of any unique qualifications in in sigmoidal networks) when compared to fixed basis functions achieving the desired approximation properties, but rather to for the approximation of functions in r c. answer the question as to what bounds can be obtained for this commonly used class of neural network models. There are moderately good approximation rate properties in II. DISCUSSION high dimensions for other classes of functions that involve a The approximation bound shows that feedforward networks high degree of smoothness. In particular, for functions with with one layer of sigmoidal nonlinearities achieve integrated .r li(w)12IwI2s dw bounded, the best approximation rate for the integrated squared error achievable by traditional basis squared error of order 0(1/11,), where 11, is the number of nodes, uniformly for functions in the given smoothness class. function expansions using order md parameters is of order A surprising aspect of this result is that the approximation 0(I/m)2S for m = 1, 2"", for instance, see [3] (for bound of order 0(1/11,) is achieved using networks with polynomial methods m is the degree, and for spline methods a relatively small number of parameters compared to the m is the number of knots per coordinate). If 8 = d/2 exponential number of parameters required by traditional poly­ and n is of order md, then the approximation rates in the nomia' spline, and trigonometric expansions. These traditional two settings match. However, the exponential number of expansions take a linear combination of a set of fixed basis parameters required for the series methods still prevent their functions. It is shown in Section X that there is no choice direct use when d is large. Unlike the condition lj(w)i2lw 2s dw which by Par­ of 11, fixed basis' functions such that linear combinations J I < 00, of them achieve integrated squared approximation error of seval's identity is equivalent to the square integrability of all partial derivatives of order the condition li(w)llwldw smaller order than (1/n)C2/d) uniformly for functions in r c, in 8, J < agreement with the theory of Kolmogorov n-widths for other 00 is not directly related to a condition on derivatives of similar classes of functions (see, e.g., [3, pp. 232-233]). This the function. It is necessary (but not sufficient) that all first­ vanishingly small approximation rate (2/ d instead of 1 in the order partial derivatives be bounded. It is sufficient (but not exponent of 1/11,) is a "curse of dimensionality" that does not necessary) that all partial derivatives of order less than or equal d apply to the methods of approximation advocated here for to .� be square-integrable on R , where s is the least integer functions in the given class. greater than 1 + d/2, as shown in example 15 of Section IX. In Roughly, the idea behind the proof of the lower bound result the case of approximation on a ball of radius r, if the partial is that there are exponentially many orthonormal functions derivatives of order 8 are bounded on Br' for some r' > r, with the same magnitude of the frequency w.
Recommended publications
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • 1 Convolution
    CS1114 Section 6: Convolution February 27th, 2013 1 Convolution Convolution is an important operation in signal and image processing. Convolution op- erates on two signals (in 1D) or two images (in 2D): you can think of one as the \input" signal (or image), and the other (called the kernel) as a “filter” on the input image, pro- ducing an output image (so convolution takes two images as input and produces a third as output). Convolution is an incredibly important concept in many areas of math and engineering (including computer vision, as we'll see later). Definition. Let's start with 1D convolution (a 1D \image," is also known as a signal, and can be represented by a regular 1D vector in Matlab). Let's call our input vector f and our kernel g, and say that f has length n, and g has length m. The convolution f ∗ g of f and g is defined as: m X (f ∗ g)(i) = g(j) · f(i − j + m=2) j=1 One way to think of this operation is that we're sliding the kernel over the input image. For each position of the kernel, we multiply the overlapping values of the kernel and image together, and add up the results. This sum of products will be the value of the output image at the point in the input image where the kernel is centered. Let's look at a simple example. Suppose our input 1D image is: f = 10 50 60 10 20 40 30 and our kernel is: g = 1=3 1=3 1=3 Let's call the output image h.
    [Show full text]
  • Deep Clustering with Convolutional Autoencoders
    Deep Clustering with Convolutional Autoencoders Xifeng Guo1, Xinwang Liu1, En Zhu1, and Jianping Yin2 1 College of Computer, National University of Defense Technology, Changsha, 410073, China [email protected] 2 State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, 410073, China Abstract. Deep clustering utilizes deep neural networks to learn fea- ture representation that is suitable for clustering tasks. Though demon- strating promising performance in various applications, we observe that existing deep clustering algorithms either do not well take advantage of convolutional neural networks or do not considerably preserve the local structure of data generating distribution in the learned feature space. To address this issue, we propose a deep convolutional embedded clus- tering algorithm in this paper. Specifically, we develop a convolutional autoencoders structure to learn embedded features in an end-to-end way. Then, a clustering oriented loss is directly built on embedded features to jointly perform feature refinement and cluster assignment. To avoid feature space being distorted by the clustering loss, we keep the decoder remained which can preserve local structure of data in feature space. In sum, we simultaneously minimize the reconstruction loss of convolutional autoencoders and the clustering loss. The resultant optimization prob- lem can be effectively solved by mini-batch stochastic gradient descent and back-propagation. Experiments on benchmark datasets empirically validate
    [Show full text]
  • Tensorizing Neural Networks
    Tensorizing Neural Networks Alexander Novikov1;4 Dmitry Podoprikhin1 Anton Osokin2 Dmitry Vetrov1;3 1Skolkovo Institute of Science and Technology, Moscow, Russia 2INRIA, SIERRA project-team, Paris, France 3National Research University Higher School of Economics, Moscow, Russia 4Institute of Numerical Mathematics of the Russian Academy of Sciences, Moscow, Russia [email protected] [email protected] [email protected] [email protected] Abstract Deep neural networks currently demonstrate state-of-the-art performance in sev- eral domains. At the same time, models of this class are very demanding in terms of computational resources. In particular, a large amount of memory is required by commonly used fully-connected layers, making it hard to use the models on low-end devices and stopping the further increase of the model size. In this paper we convert the dense weight matrices of the fully-connected layers to the Tensor Train [17] format such that the number of parameters is reduced by a huge factor and at the same time the expressive power of the layer is preserved. In particular, for the Very Deep VGG networks [21] we report the compression factor of the dense weight matrix of a fully-connected layer up to 200000 times leading to the compression factor of the whole network up to 7 times. 1 Introduction Deep neural networks currently demonstrate state-of-the-art performance in many domains of large- scale machine learning, such as computer vision, speech recognition, text processing, etc. These advances have become possible because of algorithmic advances, large amounts of available data, and modern hardware.
    [Show full text]
  • Fully Convolutional Mesh Autoencoder Using Efficient Spatially Varying Kernels
    Fully Convolutional Mesh Autoencoder using Efficient Spatially Varying Kernels Yi Zhou∗ Chenglei Wu Adobe Research Facebook Reality Labs Zimo Li Chen Cao Yuting Ye University of Southern California Facebook Reality Labs Facebook Reality Labs Jason Saragih Hao Li Yaser Sheikh Facebook Reality Labs Pinscreen Facebook Reality Labs Abstract Learning latent representations of registered meshes is useful for many 3D tasks. Techniques have recently shifted to neural mesh autoencoders. Although they demonstrate higher precision than traditional methods, they remain unable to capture fine-grained deformations. Furthermore, these methods can only be applied to a template-specific surface mesh, and is not applicable to more general meshes, like tetrahedrons and non-manifold meshes. While more general graph convolution methods can be employed, they lack performance in reconstruction precision and require higher memory usage. In this paper, we propose a non-template-specific fully convolutional mesh autoencoder for arbitrary registered mesh data. It is enabled by our novel convolution and (un)pooling operators learned with globally shared weights and locally varying coefficients which can efficiently capture the spatially varying contents presented by irregular mesh connections. Our model outperforms state-of-the-art methods on reconstruction accuracy. In addition, the latent codes of our network are fully localized thanks to the fully convolutional structure, and thus have much higher interpolation capability than many traditional 3D mesh generation models. 1 Introduction arXiv:2006.04325v2 [cs.CV] 21 Oct 2020 Learning latent representations for registered meshes 2, either from performance capture or physical simulation, is a core component for many 3D tasks, ranging from compressing and reconstruction to animation and simulation.
    [Show full text]
  • Deep Learning Based Computer Generated Face Identification Using
    applied sciences Article Deep Learning Based Computer Generated Face Identification Using Convolutional Neural Network L. Minh Dang 1, Syed Ibrahim Hassan 1, Suhyeon Im 1, Jaecheol Lee 2, Sujin Lee 1 and Hyeonjoon Moon 1,* 1 Department of Computer Science and Engineering, Sejong University, Seoul 143-747, Korea; [email protected] (L.M.D.); [email protected] (S.I.H.); [email protected] (S.I.); [email protected] (S.L.) 2 Department of Information Communication Engineering, Sungkyul University, Seoul 143-747, Korea; [email protected] * Correspondence: [email protected] Received: 30 October 2018; Accepted: 10 December 2018; Published: 13 December 2018 Abstract: Generative adversarial networks (GANs) describe an emerging generative model which has made impressive progress in the last few years in generating photorealistic facial images. As the result, it has become more and more difficult to differentiate between computer-generated and real face images, even with the human’s eyes. If the generated images are used with the intent to mislead and deceive readers, it would probably cause severe ethical, moral, and legal issues. Moreover, it is challenging to collect a dataset for computer-generated face identification that is large enough for research purposes because the number of realistic computer-generated images is still limited and scattered on the internet. Thus, a development of a novel decision support system for analyzing and detecting computer-generated face images generated by the GAN network is crucial. In this paper, we propose a customized convolutional neural network, namely CGFace, which is specifically designed for the computer-generated face detection task by customizing the number of convolutional layers, so it performs well in detecting computer-generated face images.
    [Show full text]
  • Geometrical Aspects of Statistical Learning Theory
    Geometrical Aspects of Statistical Learning Theory Vom Fachbereich Informatik der Technischen Universit¨at Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.) vorgelegt von Dipl.-Phys. Matthias Hein aus Esslingen am Neckar Prufungskommission:¨ Vorsitzender: Prof. Dr. B. Schiele Erstreferent: Prof. Dr. T. Hofmann Korreferent : Prof. Dr. B. Sch¨olkopf Tag der Einreichung: 30.9.2005 Tag der Disputation: 9.11.2005 Darmstadt, 2005 Hochschulkennziffer: D17 Abstract Geometry plays an important role in modern statistical learning theory, and many different aspects of geometry can be found in this fast developing field. This thesis addresses some of these aspects. A large part of this work will be concerned with so called manifold methods, which have recently attracted a lot of interest. The key point is that for a lot of real-world data sets it is natural to assume that the data lies on a low-dimensional submanifold of a potentially high-dimensional Euclidean space. We develop a rigorous and quite general framework for the estimation and ap- proximation of some geometric structures and other quantities of this submanifold, using certain corresponding structures on neighborhood graphs built from random samples of that submanifold. Another part of this thesis deals with the generalizati- on of the maximal margin principle to arbitrary metric spaces. This generalization follows quite naturally by changing the viewpoint on the well-known support vector machines (SVM). It can be shown that the SVM can be seen as an algorithm which applies the maximum margin principle to a subclass of metric spaces. The motivati- on to consider the generalization to arbitrary metric spaces arose by the observation that in practice the condition for the applicability of the SVM is rather difficult to check for a given metric.
    [Show full text]
  • Matrix Calculus
    Appendix D Matrix Calculus From too much study, and from extreme passion, cometh madnesse. Isaac Newton [205, §5] − D.1 Gradient, Directional derivative, Taylor series D.1.1 Gradients Gradient of a differentiable real function f(x) : RK R with respect to its vector argument is defined uniquely in terms of partial derivatives→ ∂f(x) ∂x1 ∂f(x) , ∂x2 RK f(x) . (2053) ∇ . ∈ . ∂f(x) ∂xK while the second-order gradient of the twice differentiable real function with respect to its vector argument is traditionally called the Hessian; 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 ∂x1 ∂x1∂x2 ··· ∂x1∂xK 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 2 K f(x) , ∂x2∂x1 ∂x2 ··· ∂x2∂xK S (2054) ∇ . ∈ . .. 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 ∂xK ∂x1 ∂xK ∂x2 ∂x ··· K interpreted ∂f(x) ∂f(x) 2 ∂ ∂ 2 ∂ f(x) ∂x1 ∂x2 ∂ f(x) = = = (2055) ∂x1∂x2 ³∂x2 ´ ³∂x1 ´ ∂x2∂x1 Dattorro, Convex Optimization Euclidean Distance Geometry, Mεβoo, 2005, v2020.02.29. 599 600 APPENDIX D. MATRIX CALCULUS The gradient of vector-valued function v(x) : R RN on real domain is a row vector → v(x) , ∂v1(x) ∂v2(x) ∂vN (x) RN (2056) ∇ ∂x ∂x ··· ∂x ∈ h i while the second-order gradient is 2 2 2 2 , ∂ v1(x) ∂ v2(x) ∂ vN (x) RN v(x) 2 2 2 (2057) ∇ ∂x ∂x ··· ∂x ∈ h i Gradient of vector-valued function h(x) : RK RN on vector domain is → ∂h1(x) ∂h2(x) ∂hN (x) ∂x1 ∂x1 ··· ∂x1 ∂h1(x) ∂h2(x) ∂hN (x) h(x) , ∂x2 ∂x2 ··· ∂x2 ∇ .
    [Show full text]
  • Pre-Training Cnns Using Convolutional Autoencoders
    Pre-Training CNNs Using Convolutional Autoencoders Maximilian Kohlbrenner Russell Hofmann TU Berlin TU Berlin [email protected] [email protected] Sabbir Ahmmed Youssef Kashef TU Berlin TU Berlin [email protected] [email protected] Abstract Despite convolutional neural networks being the state of the art in almost all computer vision tasks, their training remains a difficult task. Unsupervised rep- resentation learning using a convolutional autoencoder can be used to initialize network weights and has been shown to improve test accuracy after training. We reproduce previous results using this approach and successfully apply it to the difficult Extended Cohn-Kanade dataset for which labels are extremely sparse but additional unlabeled data is available for unsupervised use. 1 Introduction A lot of progress has been made in the field of artificial neural networks in recent years and as a result most computer vision tasks today are best solved using this approach. However, the training of deep neural networks still remains a difficult problem and results are highly dependent on the model initialization (local minima). During a classification task, a Convolutional Neural Network (CNN) first learns a new data representation using its convolution layers as feature extractors and then uses several fully-connected layers for decision-making. While the representation after the convolutional layers is usually optizimed for classification, some learned features might be more general and also useful outside of this specific task. Instead of directly optimizing for a good class prediction, one can therefore start by focussing on the intermediate goal of learning a good data representation before beginning to work on the classification problem.
    [Show full text]
  • Universal Invariant and Equivariant Graph Neural Networks
    Universal Invariant and Equivariant Graph Neural Networks Nicolas Keriven Gabriel Peyré École Normale Supérieure CNRS and École Normale Supérieure Paris, France Paris, France [email protected] [email protected] Abstract Graph Neural Networks (GNN) come in many flavors, but should always be either invariant (permutation of the nodes of the input graph does not affect the output) or equivariant (permutation of the input permutes the output). In this paper, we consider a specific class of invariant and equivariant networks, for which we prove new universality theorems. More precisely, we consider networks with a single hidden layer, obtained by summing channels formed by applying an equivariant linear operator, a pointwise non-linearity, and either an invariant or equivariant linear output layer. Recently, Maron et al. (2019b) showed that by allowing higher- order tensorization inside the network, universal invariant GNNs can be obtained. As a first contribution, we propose an alternative proof of this result, which relies on the Stone-Weierstrass theorem for algebra of real-valued functions. Our main contribution is then an extension of this result to the equivariant case, which appears in many practical applications but has been less studied from a theoretical point of view. The proof relies on a new generalized Stone-Weierstrass theorem for algebra of equivariant functions, which is of independent interest. Additionally, unlike many previous works that consider a fixed number of nodes, our results show that a GNN defined by a single set of parameters can approximate uniformly well a function defined on graphs of varying size. 1 Introduction Designing Neural Networks (NN) to exhibit some invariance or equivariance to group operations is a central problem in machine learning (Shawe-Taylor, 1993).
    [Show full text]
  • Differentiability in Several Variables: Summary of Basic Concepts
    DIFFERENTIABILITY IN SEVERAL VARIABLES: SUMMARY OF BASIC CONCEPTS 3 3 1. Partial derivatives If f : R ! R is an arbitrary function and a = (x0; y0; z0) 2 R , then @f f(a + tj) ¡ f(a) f(x ; y + t; z ) ¡ f(x ; y ; z ) (1) (a) := lim = lim 0 0 0 0 0 0 @y t!0 t t!0 t etc.. provided the limit exists. 2 @f f(t;0)¡f(0;0) Example 1. for f : R ! R, @x (0; 0) = limt!0 t . Example 2. Let f : R2 ! R given by ( 2 x y ; (x; y) 6= (0; 0) f(x; y) = x2+y2 0; (x; y) = (0; 0) Then: @f f(t;0)¡f(0;0) 0 ² @x (0; 0) = limt!0 t = limt!0 t = 0 @f f(0;t)¡f(0;0) ² @y (0; 0) = limt!0 t = 0 Note: away from (0; 0), where f is the quotient of differentiable functions (with non-zero denominator) one can apply the usual rules of derivation: @f 2xy(x2 + y2) ¡ 2x3y (x; y) = ; for (x; y) 6= (0; 0) @x (x2 + y2)2 2 2 2. Directional derivatives. If f : R ! R is a map, a = (x0; y0) 2 R and v = ®i + ¯j is a vector in R2, then by definition f(a + tv) ¡ f(a) f(x0 + t®; y0 + t¯) ¡ f(x0; y0) (2) @vf(a) := lim = lim t!0 t t!0 t Example 3. Let f the function from the Example 2 above. Then for v = ®i + ¯j a unit vector (i.e.
    [Show full text]