Invariance of Weight Distributions in Rectified Mlps

Total Page:16

File Type:pdf, Size:1020Kb

Invariance of Weight Distributions in Rectified Mlps Invariance of Weight Distributions in Rectified MLPs Russell Tsuchida 1 Farbod Roosta-Khorasani 2 3 Marcus Gallagher 1 Abstract plication rather than an understanding of the capabilities and training of neural networks. Recently, significant work An interesting approach to analyzing neural net- has been done to increase understanding of neural net- works that has received renewed attention is to works (Choromanska et al., 2015; Haeffele & Vidal, 2015; examine the equivalent kernel of the neural net- Poole et al., 2016; Schoenholz et al., 2017; Zhang et al., work. This is based on the fact that a fully 2016; Martin & Mahoney, 2017; Shwartz-Ziv & Tishby, connected feedforward network with one hid- 2017; Balduzzi et al., 2017; Raghu et al., 2017). However, den layer, a certain weight distribution, an acti- there is still work to be done to bring theoretical under- vation function, and an infinite number of neu- standing in line with the results seen in practice. rons can be viewed as a mapping into a Hilbert space. We derive the equivalent kernels of MLPs The connection between neural networks and kernel ma- with ReLU or Leaky ReLU activations for all chines has long been studied (Neal, 1994). Much past rotationally-invariant weight distributions, gen- work has been done to investigate the equivalent kernel eralizing a previous result that required Gaus- of certain neural networks, either experimentally (Burgess, sian weight distributions. Additionally, the Cen- 1997), through sampling (Sinha & Duchi, 2016; Livni tral Limit Theorem is used to show that for cer- et al., 2017; Lee et al., 2017), or analytically by assum- tain activation functions, kernels corresponding ing some random distribution over the weight parameters to layers with weight distributions having 0 mean in the network (Williams, 1997; Cho & Saul, 2009; Pandey and finite absolute third moment are asymptot- & Dukkipati, 2014a;b; Daniely et al., 2016; Bach, 2017a). ically universal, and are well approximated by Surprisingly, in the latter approach, rarely have distribu- the kernel corresponding to layers with spherical tions other than the Gaussian distribution been analyzed. Gaussian weights. In deep networks, as depth in- This is perhaps due to early influential work on Bayesian creases the equivalent kernel approaches a patho- Networks (MacKay, 1992), which laid a strong mathemat- logical fixed point, which can be used to argue ical foundation for a Bayesian approach to training net- why training randomly initialized networks can works. Another reason may be that some researchers may be difficult. Our results also have implications hold the intuitive (but not necessarily principled) view that for weight initialization. the Central Limit Theorem (CLT) should somehow apply. In this work, we investigate the equivalent kernels for net- works with Rectified Linear Unit (ReLU), Leaky ReLU 1. Introduction (LReLU) or other activation functions, one-hidden layer, and more general weight distributions. Our analysis carries Neural networks have recently been applied to a number over to deep networks. We investigate the consequences of diverse problems with impressive results (van den Oord that weight initialization has on the equivalent kernel at et al., 2016; Silver et al., 2017; Berthelot et al., 2017). the beginning of training. While initialization schemes that These breakthroughs largely appear to be driven by ap- mitigate exploding/vanishing gradient problems (Hochre- 1School of ITEE, University of Queensland, Bris- iter, 1991; Bengio et al., 1994; Hochreiter et al., 2001) bane, Queensland, Australia 2School of Mathematics and for other activation functions and weight distribution com- Physics, University of Queensland, Brisbane, Queens- binations have been explored in earlier works (Glorot & 3 land, Australia International Computer Science Institute, Bengio, 2010; He et al., 2015), we discuss an initialization Berkeley, California, USA. Correspondence to: Rus- sell Tsuchida <[email protected]>, Farbod Roosta- scheme for Muli-Layer Perceptrons (MLPs) with LReLUs Khorasani <[email protected]>, Marcus Gallagher and weights coming from distributions with 0 mean and fi- <[email protected]>. nite absolute third moment. The derived kernels also allow th us to analyze the loss of information as an input is propa- Proceedings of the 35 International Conference on Machine gated through the network, offering a complementary view Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). to the shattered gradient problem (Balduzzi et al., 2017). Invariance of Weight Distributions in Rectified MLPs 2. Preliminaries optimal solutions to the ridge regression problem of min- imizing a regularized cost function C using the kernel (1) Consider a fully connected (FC) feedforward neural net- and the kernel (2) respectively. The number of datapoints work with m inputs and a hidden layer with n neurons. Let n requiredp to probabilistically bound C(^g) − C(g) is found σ : R ! R be the activation function of all the neurons in to be O( n log n) under a suitable set of assumptions. the hidden layer. Further assume that the biases are 0, as is This work notes the connection between kernel machines common when initializing neural network parameters. For m and one-layer Neural Networks with ReLU activations and any two inputs x; y 2 R propagated through the network, Gaussian weights by citing Cho & Saul (2009). We extend the dot product in the hidden layer is this connection by considering other weight distributions n and activation functions. 1 1 X h(x) · h(y) = σ(w · x)σ(w · y); (1) n n i i In this work our focus is on deriving expressions for the i=1 target kernel, not the approximation error. Additionally, where h(·) denotes the n dimensional vector in the hidden we consider random mappings that have not been consid- m th layer and wi 2 R is the weight vector into the i neuron. ered elsewhere. Our work is related to work by Poole et Assuming an infinite number of hidden neurons, the sum al. (2016) and Schoenholz et al. (2017). However, our re- in (1) has an interpretation as an inner product in feature sults apply to the unbounded (L)ReLU activation function space, which corresponds to the kernel of a Hilbert space. and more general weight distributions, and their work con- We have siders random biases as well as weights. Z k(x; y) = σ(w · x)σ(w · y)f(w) dw; (2) m 3. Equivalent Kernels for Infinite Width R Hidden Layers where f(w) is the probability density function (PDF) for the identically distributed weight vector W = The kernel (2) has previously been evaluated for a number T (W1; :::; Wm) in the network. The connection of (2) to of choices of f and σ (Williams, 1997; Roux & Bengio, the kernels in kernel machines is well-known (Neal, 1994; 2007; Cho & Saul, 2009; Pandey & Dukkipati, 2014a;b). In Williams, 1997; Cho & Saul, 2009). particular, the equivalent kernel for a one-hidden layer net- work with spherical Gaussian weights of variance [W 2] Probabilistic bounds for the error between (1) and (2) have E i and mean 0 is the Arc-Cosine Kernel (Cho & Saul, 2009) been derived in special cases (Rahimi & Recht, 2008) when the kernel is shift-invariant. Two specific random feature 2 mappings are considered: (1) Random Fourier features are E[Wi ]kxkkyk k(x; y) = sin θ0 + (π − θ0) cos θ0 ; (3) taken for the σ in (1). Calculating the approximation er- 2π ror in this way requires being able to sample from the PDF defined by the Fourier transform of the target kernel. More −1 x·y where θ0 = cos kxkkyk is the angle between the inputs explicitly, the weight distribution f is the Fourier transform x and y and k·k denotes the `2 norm. Noticing that the Arc- of the target kernel and the n samples σ(wi ·x) are replaced Cosine Kernel k(x; y) depends on x and y only through by some appropriate scale of cos(wi · x). (2) A random bit their norms, with an abuse of notation we will henceforth string σ(xi) is associated to each input according to a grid set k(x; y) ≡ k(θ0): Define the normalized kernel to be the with random pitch δ sampled from f imposed on the input cosine similarity between the signals in the hidden layer. The normalized Arc-Cosine Kernel is given by space. This method requires having access to the second derivative of the target kernel to sample from the distribu- f k(x; y) 1 tion . cos θ1 = = sin θ0 + (π − θ0) cos θ0 ; pk(x; x)pk(y; y) π Other work (Bach, 2017b) has focused on the smallest er- ror between a target function g in the reproducing kernel Hilbert space (RKHS) defined by (2) and an approximate where θ1 is the angle between the signals in the first layer. function g^ expressible by the RKHS with the kernel (1). Figure1 shows a plot of the normalized Arc-Cosine Kernel. R One might ask how the equivalent kernel changes for a dif- More explicitly, let g(x) = m G(w)σ(w; x)f(w) dw be R ferent choice of weight distribution. We investigate the the representation of g in the RKHS. The quantity g^ − Pn R equivalent kernel for networks with (L)ReLU activations g = αiσ(wi; ·) − m G(w)σ(w; ·)f(w) dw i=1 R and general weight distributions in Section 3.1 and 3.2. The (with some suitable norm) is studied for the best set of α i equivalent kernel can be composed and applied to deep net- and random w with an optimized distribution. i works. The kernel can also be used to choose good weights Yet another measure of kernel approximation error is in- for initialization.
Recommended publications
  • Discontinuous Forcing Functions
    Math 135A, Winter 2012 Discontinuous forcing functions 1 Preliminaries If f(t) is defined on the interval [0; 1), then its Laplace transform is defined to be Z 1 F (s) = L (f(t)) = e−stf(t) dt; 0 as long as this integral is defined and converges. In particular, if f is of exponential order and is piecewise continuous, the Laplace transform of f(t) will be defined. • f is of exponential order if there are constants M and c so that jf(t)j ≤ Mect: Z 1 Since the integral e−stMect dt converges if s > c, then by a comparison test (like (11.7.2) 0 in Salas-Hille-Etgen), the integral defining the Laplace transform of f(t) will converge. • f is piecewise continuous if over each interval [0; b], f(t) has only finitely many discontinuities, and at each point a in [0; b], both of the limits lim f(t) and lim f(t) t!a− t!a+ exist { they need not be equal, but they must exist. (At the endpoints 0 and b, the appropriate one-sided limits must exist.) 2 Step functions Define u(t) to be the function ( 0 if t < 0; u(t) = 1 if t ≥ 0: Then u(t) is called the step function, or sometimes the Heaviside step function: it jumps from 0 to 1 at t = 0. Note that for any number a > 0, the graph of the function u(t − a) is the same as the graph of u(t), but translated right by a: u(t − a) jumps from 0 to 1 at t = a.
    [Show full text]
  • Laplace Transforms: Theory, Problems, and Solutions
    Laplace Transforms: Theory, Problems, and Solutions Marcel B. Finan Arkansas Tech University c All Rights Reserved 1 Contents 43 The Laplace Transform: Basic Definitions and Results 3 44 Further Studies of Laplace Transform 15 45 The Laplace Transform and the Method of Partial Fractions 28 46 Laplace Transforms of Periodic Functions 35 47 Convolution Integrals 45 48 The Dirac Delta Function and Impulse Response 53 49 Solving Systems of Differential Equations Using Laplace Trans- form 61 50 Solutions to Problems 68 2 43 The Laplace Transform: Basic Definitions and Results Laplace transform is yet another operational tool for solving constant coeffi- cients linear differential equations. The process of solution consists of three main steps: • The given \hard" problem is transformed into a \simple" equation. • This simple equation is solved by purely algebraic manipulations. • The solution of the simple equation is transformed back to obtain the so- lution of the given problem. In this way the Laplace transformation reduces the problem of solving a dif- ferential equation to an algebraic problem. The third step is made easier by tables, whose role is similar to that of integral tables in integration. The above procedure can be summarized by Figure 43.1 Figure 43.1 In this section we introduce the concept of Laplace transform and discuss some of its properties. The Laplace transform is defined in the following way. Let f(t) be defined for t ≥ 0: Then the Laplace transform of f; which is denoted by L[f(t)] or by F (s), is defined by the following equation Z T Z 1 L[f(t)] = F (s) = lim f(t)e−stdt = f(t)e−stdt T !1 0 0 The integral which defined a Laplace transform is an improper integral.
    [Show full text]
  • 10 Heat Equation: Interpretation of the Solution
    Math 124A { October 26, 2011 «Viktor Grigoryan 10 Heat equation: interpretation of the solution Last time we considered the IVP for the heat equation on the whole line u − ku = 0 (−∞ < x < 1; 0 < t < 1); t xx (1) u(x; 0) = φ(x); and derived the solution formula Z 1 u(x; t) = S(x − y; t)φ(y) dy; for t > 0; (2) −∞ where S(x; t) is the heat kernel, 1 2 S(x; t) = p e−x =4kt: (3) 4πkt Substituting this expression into (2), we can rewrite the solution as 1 1 Z 2 u(x; t) = p e−(x−y) =4ktφ(y) dy; for t > 0: (4) 4πkt −∞ Recall that to derive the solution formula we first considered the heat IVP with the following particular initial data 1; x > 0; Q(x; 0) = H(x) = (5) 0; x < 0: Then using dilation invariance of the Heaviside step function H(x), and the uniquenessp of solutions to the heat IVP on the whole line, we deduced that Q depends only on the ratio x= t, which lead to a reduction of the heat equation to an ODE. Solving the ODE and checking the initial condition (5), we arrived at the following explicit solution p x= 4kt 1 1 Z 2 Q(x; t) = + p e−p dp; for t > 0: (6) 2 π 0 The heat kernel S(x; t) was then defined as the spatial derivative of this particular solution Q(x; t), i.e. @Q S(x; t) = (x; t); (7) @x and hence it also solves the heat equation by the differentiation property.
    [Show full text]
  • Delta Functions and Distributions
    When functions have no value(s): Delta functions and distributions Steven G. Johnson, MIT course 18.303 notes Created October 2010, updated March 8, 2017. Abstract x = 0. That is, one would like the function δ(x) = 0 for all x 6= 0, but with R δ(x)dx = 1 for any in- These notes give a brief introduction to the mo- tegration region that includes x = 0; this concept tivations, concepts, and properties of distributions, is called a “Dirac delta function” or simply a “delta which generalize the notion of functions f(x) to al- function.” δ(x) is usually the simplest right-hand- low derivatives of discontinuities, “delta” functions, side for which to solve differential equations, yielding and other nice things. This generalization is in- a Green’s function. It is also the simplest way to creasingly important the more you work with linear consider physical effects that are concentrated within PDEs, as we do in 18.303. For example, Green’s func- very small volumes or times, for which you don’t ac- tions are extremely cumbersome if one does not al- tually want to worry about the microscopic details low delta functions. Moreover, solving PDEs with in this volume—for example, think of the concepts of functions that are not classically differentiable is of a “point charge,” a “point mass,” a force plucking a great practical importance (e.g. a plucked string with string at “one point,” a “kick” that “suddenly” imparts a triangle shape is not twice differentiable, making some momentum to an object, and so on.
    [Show full text]
  • An Analytic Exact Form of the Unit Step Function
    Mathematics and Statistics 2(7): 235-237, 2014 http://www.hrpub.org DOI: 10.13189/ms.2014.020702 An Analytic Exact Form of the Unit Step Function J. Venetis Section of Mechanics, Faculty of Applied Mathematics and Physical Sciences, National Technical University of Athens *Corresponding Author: [email protected] Copyright © 2014 Horizon Research Publishing All rights reserved. Abstract In this paper, the author obtains an analytic Meanwhile, there are many smooth analytic exact form of the unit step function, which is also known as approximations to the unit step function as it can be seen in Heaviside function and constitutes a fundamental concept of the literature [4,5,6]. Besides, Sullivan et al [7] obtained a the Operational Calculus. Particularly, this function is linear algebraic approximation to this function by means of a equivalently expressed in a closed form as the summation of linear combination of exponential functions. two inverse trigonometric functions. The novelty of this However, the majority of all these approaches lead to work is that the exact representation which is proposed here closed – form representations consisting of non - elementary is not performed in terms of non – elementary special special functions, e.g. Logistic function, Hyperfunction, or functions, e.g. Dirac delta function or Error function and Error function and also most of its algebraic exact forms are also is neither the limit of a function, nor the limit of a expressed in terms generalized integrals or infinitesimal sequence of functions with point wise or uniform terms, something that complicates the related computational convergence. Therefore it may be much more appropriate in procedures.
    [Show full text]
  • On the Approximation of the Step Function by Raised-Cosine and Laplace Cumulative Distribution Functions
    European International Journal of Science and Technology Vol. 4 No. 9 December, 2015 On the Approximation of the Step function by Raised-Cosine and Laplace Cumulative Distribution Functions Vesselin Kyurkchiev 1 Faculty of Mathematics and Informatics Institute of Mathematics and Informatics, Paisii Hilendarski University of Plovdiv, Bulgarian Academy of Sciences, 24 Tsar Assen Str., 4000 Plovdiv, Bulgaria Email: [email protected] Nikolay Kyurkchiev Faculty of Mathematics and Informatics Institute of Mathematics and Informatics, Paisii Hilendarski University of Plovdiv, Bulgarian Academy of Sciences, Acad.G.Bonchev Str.,Bl.8 1113Sofia, Bulgaria Email: [email protected] 1 Corresponding Author Abstract In this note the Hausdorff approximation of the Heaviside step function by Raised-Cosine and Laplace Cumulative Distribution Functions arising from lifetime analysis, financial mathematics, signal theory and communications systems is considered and precise upper and lower bounds for the Hausdorff distance are obtained. Numerical examples, that illustrate our results are given, too. Key words: Heaviside step function, Raised-Cosine Cumulative Distribution Function, Laplace Cumulative Distribution Function, Alpha–Skew–Laplace cumulative distribution function, Hausdorff distance, upper and lower bounds 2010 Mathematics Subject Classification: 41A46 75 European International Journal of Science and Technology ISSN: 2304-9693 www.eijst.org.uk 1. Introduction The Cosine Distribution is sometimes used as a simple, and more computationally tractable, approximation to the Normal Distribution. The Raised-Cosine Distribution Function (RC.pdf) and Raised-Cosine Cumulative Distribution Function (RC.cdf) are functions commonly used to avoid inter symbol interference in communications systems [1], [2], [3]. The Laplace distribution function (L.pdf) and Laplace Cumulative Distribution function (L.cdf) is used for modeling in signal processing, various biological processes, finance, and economics.
    [Show full text]
  • On the Approximation of the Cut and Step Functions by Logistic and Gompertz Functions
    www.biomathforum.org/biomath/index.php/biomath ORIGINAL ARTICLE On the Approximation of the Cut and Step Functions by Logistic and Gompertz Functions Anton Iliev∗y, Nikolay Kyurkchievy, Svetoslav Markovy ∗Faculty of Mathematics and Informatics Paisii Hilendarski University of Plovdiv, Plovdiv, Bulgaria Email: [email protected] yInstitute of Mathematics and Informatics Bulgarian Academy of Sciences, Sofia, Bulgaria Emails: [email protected], [email protected] Received: , accepted: , published: will be added later Abstract—We study the uniform approximation practically important classes of sigmoid functions. of the sigmoid cut function by smooth sigmoid Two of them are the cut (or ramp) functions and functions such as the logistic and the Gompertz the step functions. Cut functions are continuous functions. The limiting case of the interval-valued but they are not smooth (differentiable) at the two step function is discussed using Hausdorff metric. endpoints of the interval where they increase. Step Various expressions for the error estimates of the corresponding uniform and Hausdorff approxima- functions can be viewed as limiting case of cut tions are obtained. Numerical examples are pre- functions; they are not continuous but they are sented using CAS MATHEMATICA. Hausdorff continuous (H-continuous) [4], [43]. In some applications smooth sigmoid functions are Keywords-cut function; step function; sigmoid preferred, some authors even require smoothness function; logistic function; Gompertz function; squashing function; Hausdorff approximation. in the definition of sigmoid functions. Two famil- iar classes of smooth sigmoid functions are the logistic and the Gompertz functions. There are I. INTRODUCTION situations when one needs to pass from nonsmooth In this paper we discuss some computational, sigmoid functions (e.
    [Show full text]
  • Maximizing Acquisition Functions for Bayesian Optimization
    Maximizing acquisition functions for Bayesian optimization James T. Wilson∗ Frank Hutter Marc Peter Deisenroth Imperial College London University of Freiburg Imperial College London PROWLER.io Abstract Bayesian optimization is a sample-efficient approach to global optimization that relies on theoretically motivated value heuristics (acquisition functions) to guide its search process. Fully maximizing acquisition functions produces the Bayes’ decision rule, but this ideal is difficult to achieve since these functions are fre- quently non-trivial to optimize. This statement is especially true when evaluating queries in parallel, where acquisition functions are routinely non-convex, high- dimensional, and intractable. We first show that acquisition functions estimated via Monte Carlo integration are consistently amenable to gradient-based optimiza- tion. Subsequently, we identify a common family of acquisition functions, includ- ing EI and UCB, whose properties not only facilitate but justify use of greedy approaches for their maximization. 1 Introduction Bayesian optimization (BO) is a powerful framework for tackling complicated global optimization problems [32, 40, 44]. Given a black-box function f : , BO seeks to identify a maximizer X!Y x∗ arg maxx f(x) while simultaneously minimizing incurred costs. Recently, these strategies have2 demonstrated2X state-of-the-art results on many important, real-world problems ranging from material sciences [17, 57], to robotics [3, 7], to algorithm tuning and configuration [16, 29, 53, 56]. From a high-level perspective, BO can be understood as the application of Bayesian decision theory to optimization problems [11, 14, 45]. One first specifies a belief over possible explanations for f using a probabilistic surrogate model and then combines this belief with an acquisition function to convey the expected utility for evaluating a set of queries X.
    [Show full text]
  • Expanded Use of Discontinuity and Singularity Functions in Mechanics
    AC 2010-1666: EXPANDED USE OF DISCONTINUITY AND SINGULARITY FUNCTIONS IN MECHANICS Robert Soutas-Little, Michigan State University Professor Soutas-Little received his B.S. in Mechanical Engineering 1955 from Duke University and his MS 1959 and PhD 1962 in Mechanics from University of Wisconsin. His name until 1982 when he married Patricia Soutas was Robert William Little. Under this name, he wrote an Elasticity book and published over 20 articles. Since 1982, he has written over 100 papers for journals and conference proceedings and written books in Statics and Dynamics. He has taught courses in Statics, Dynamics, Mechanics of Materials, Elasticity, Continuum Mechanics, Viscoelasticity, Advanced Dynamics, Advanced Elasticity, Tissue Biomechanics and Biodynamics. He has won teaching excellence awards and the Distinguished Faculty Award. During his tenure at Michigan State University, he chaired the Department of Mechanical Engineering for 5 years and the Department of Biomechanics for 13 years. He directed the Biomechanics Evaluation Laboratory from 1990 until he retired in 2002. He served as Major Professor for 22 PhD students and over 100 MS students. He has received numerous research grants and consulted with engineering companies. He now is Professor Emeritus of Mechanical Engineering at Michigan State University. Page 15.549.1 Page © American Society for Engineering Education, 2010 Expanded Use of Discontinuity and Singularity Functions in Mechanics Abstract W. H. Macauley published Notes on the Deflection of Beams 1 in 1919 introducing the use of discontinuity functions into the calculation of the deflection of beams. In particular, he introduced the singularity functions, the unit doublet to model a concentrated moment, the Dirac delta function to model a concentrated load and the Heaviside step function to start a uniform load at any point on the beam.
    [Show full text]
  • Dirac Delta Function of Matrix Argument
    Dirac Delta Function of Matrix Argument ∗ Lin Zhang Institute of Mathematics, Hangzhou Dianzi University, Hangzhou 310018, PR China Abstract Dirac delta function of matrix argument is employed frequently in the development of di- verse fields such as Random Matrix Theory, Quantum Information Theory, etc. The purpose of the article is pedagogical, it begins by recalling detailed knowledge about Heaviside unit step function and Dirac delta function. Then its extensions of Dirac delta function to vector spaces and matrix spaces are discussed systematically, respectively. The detailed and elemen- tary proofs of these results are provided. Though we have not seen these results formulated in the literature, there certainly are predecessors. Applications are also mentioned. 1 Heaviside unit step function H and Dirac delta function δ The materials in this section are essential from Hoskins’ Book [4]. There are also no new results in this section. In order to be in a systematic way, it is reproduced here. The Heaviside unit step function H is defined as 1, x > 0, H(x) := (1.1) < 0, x 0. arXiv:1607.02871v2 [quant-ph] 31 Oct 2018 That is, this function is equal to 1 over (0, +∞) and equal to 0 over ( ∞, 0). This function can − equally well have been defined in terms of a specific expression, for instance 1 x H(x)= 1 + . (1.2) 2 x | | The value H(0) is left undefined here. For all x = 0, 6 dH(x) H′(x)= = 0 dx ∗ E-mail: [email protected]; [email protected] 1 corresponding to the obvious fact that the graph of the function y = H(x) has zero slope for all x = 0.
    [Show full text]
  • A Heaviside Function Approximation for Neural Network Binary Classification
    A Heaviside Function Approximation for Neural Network Binary Classification Nathan Tsoi, Yofti Milkessa, Marynel Vazquez´ Yale University, New Haven, CT 06511 [email protected] Abstract Neural network binary classifiers are often evaluated on met- rics like accuracy and F1-Score, which are based on con- fusion matrix values (True Positives, False Positives, False Negatives, and True Negatives). However, these classifiers are commonly trained with a different loss, e.g. log loss. While it is preferable to perform training on the same loss as the eval- uation metric, this is difficult in the case of confusion matrix based metrics because set membership is a step function with- out a derivative useful for backpropagation. To address this Figure 1: Heaviside function H (left) and (Kyurkchiev and challenge, we propose an approximation of the step function Markov 2015) approximation (right) at τ = 0:7. that adheres to the properties necessary for effective train- ing of binary networks using confusion matrix based met- rics. This approach allows for end-to-end training of binary deep neural classifiers via batch gradient descent. We demon- strate the flexibility of this approach in several applications with varying levels of class imbalance. We also demonstrate how the approximation allows balancing between precision and recall in the appropriate ratio for the task at hand. 1 Introduction Figure 2: Proposed Heaviside function approximation H In the process of creating a neural network binary classi- (left) and a sigmoid fit to our proposed Heaviside approx- fier, the network is often trained via the binary cross-entropy imation (right) at τ = 0:7.
    [Show full text]
  • Spectral Density Periodogram Spectral Density (Chapter 12) Outline
    Spectral Density Periodogram Spectral Density (Chapter 12) Outline 1 Spectral Density 2 Periodogram Arthur Berg Spectral Density (Chapter 12) 2/ 19 Spectral Density Periodogram Spectral Density Periodogram Spectral Approximation Motivating Example Any stationary time series has the approximation Let’s consider the stationary time series q X xt = Uk cos(2π!k t) + Vk sin(2π!k t) xt = U cos(2π!0t) + V sin(2π!0t) k=1 where Uk and Vk are independent zero-mean random variables with where U; V are independent zero-mean random variables with equal 2 2 variances σk at distinct frequencies !k . variances σ . We shall assume !0 > 0, and from the Nyquist rate, we The autocovariance function of xt is (HW 4c Exercise) can further assume !0 < 1=2. q The “period” of xt is 1=!0, i.e. the process makes !0 cycles going from xt X 2 γ(h) = σk cos(2π!k h) to xt+1. k=1 From the formula q e−iα + e−iα In particular X 2 cos(α) = γ(0) = σk 2 |{z} k=1 we have var(xt ) | {z } sum of variances σ2 γ( ) 2 −2π!0h 2πi!0h Note: h is not absolutely summable,1 i.e. γ(h) = σ cos(2π!0h) = e + 2 X 2 jγ(h)j = 1 h=−∞ Arthur Berg Spectral Density (Chapter 12) 4/ 19 Arthur Berg Spectral Density (Chapter 12) 5/ 19 Spectral Density Periodogram Spectral Density Periodogram Riemann-Stieltjes Integration Integration w.r.t. a Step Function Riemann-Stieltjes integration enables one to integrate with respect to a general nondecreasing function.
    [Show full text]