Neural Networks

Total Page:16

File Type:pdf, Size:1020Kb

Neural Networks School of Computer Science 10-601B Introduction to Machine Learning Neural Networks Readings: Matt Gormley Bishop Ch. 5 Lecture 15 Murphy Ch. 16.5, Ch. 28 October 19, 2016 Mitchell Ch. 4 1 Reminders 2 Outline • Logistic Regression (Recap) • Neural Networks • Backpropagation 3 RECALL: LOGISTIC REGRESSION 4 Using gradient ascent for linear Recall… classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 5 Using gradient ascent for linear Recall… classifiers This decision function isn’t Use a differentiable function differentiable: instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 6 Using gradient ascent for linear Recall… classifiers This decision function isn’t Use a differentiable function differentiable: instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 7 Recall… Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y 0, 1 D { }i=1 ∈ ∈ { } Model: Logistic function applied to dot product of parameters with input vector. 1 pθ(y =1t)= | 1+2tT( θT t) − Learning: finds the parameters that minimize some objective function. θ∗ = argmin J(θ) θ Prediction: Output is the most probable class. yˆ = `;Kt pθ(y t) y 0,1 | ∈{ } 8 NEURAL NETWORKS 9 Learning highly non-linear functions f: X à Y l f might be non-linear function l X (vector of) continuous and/or discrete vars l Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition © Eric Xing @ CMU, 2006-2011 10 Perceptron and Neural Nets l From biological neuron to artificial neuron (perceptron) Synapse Inputs Synapse Dendrites x Axon 1 Linear Hard Axon Combiner Limiter w1 Output ∑ Y Soma Soma w2 Dendrites θ x Synapse 2 Threshold l Activation function n ⎧+1, if X ≥ ω0 X = ∑ xiwi Y = ⎨ i=1 ⎩−1, if X < ω0 l Artificial neuron networks l supervised learning l gradient descent i g n a l s g n a l i i g n a l s g n a l i I n p u tI S O u tO p u t S Middle Layer Input Layer Output Layer © Eric Xing @ CMU, 2006-2011 11 Connectionist Models l Consider humans: l Neuron switching time ~ 0.001 second l Number of neurons ~ 1010 l Connections per neuron ~ 104-5 l Scene recognition time ~ 0.1 second l 100 inference steps doesn't seem like enough à much parallel computation l Properties of artificial neural nets (ANN) l Many neuron-like threshold switching units l Many weighted interconnections among units l Highly parallel, distributed processes © Eric Xing @ CMU, 2006-2011 12 Why is everyone talking Motivation about Deep Learning? • Because a lot of money is invested in it… – DeepMind: Acquired by Google for $400 million – DNNResearch: Three person startup (including Geoff Hinton) acquired by Google for unknown price tag – Enlitic, Ersatz, MetaMind, Nervana, Skylab: Deep Learning startups commanding millions of VC dollars • Because it made the front page of the New York Times 13 Why is everyone talking Motivation about Deep Learning? 1960s Deep learning: – Has won numerous pattern recognition 1980s competitions – Does so with minimal feature 1990s engineering This wasn’t always the case! 2006 Since 1980s: Form of models hasn’t changed much, but lots of new tricks… – More hidden units 2016 – Better (online) optimization – New nonlinear functions (ReLUs) – Faster computers (CPUs and GPUs) 14 A Recipe for Background Machine Learning 1. Given training data: Face Face Not a face 2. Choose each of these: – Decision function Examples: Linear regression, Logistic regression, Neural Network – Loss function Examples: Mean-squared error, Cross Entropy 15 A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 16 A Recipe for Background Gradients Machine Learning 1. Given training data: Backpropagation3. Define goal: can compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that – Decision function can compute the gradient of any 4. Train with SGD: differentiable function efficiently! (take small steps opposite the gradient) – Loss function 17 A Recipe for Background Goals for Today’s Lecture Machine Learning 1. Given training data: 1. Explore a new class of decision functions 3. Define goal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 18 Decision Functions Linear Regression T y = hθ(x)=σ(θ x) Output where σ(a)=a θ θ θ1 θ2 3 M Input … 19 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ θ θ1 θ2 3 M Input … 20 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − Face Face Not a face θ θ θ1 θ2 3 M Input … 21 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − 1 1 0 y θ θ x2 θ1 θ2 3 M x1 Input … 22 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ θ θ1 θ2 3 M Input … 23 Neural Network Model Inputs Output Age 34 .6 .4 .2 Σ 0.6 .1 .5 Gender 2 .2 Σ .3 Σ .8 .7 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 24 “Combined logistic models” Inputs Output Age 34 .6 .5 0.6 .1 Gender 2 Σ .7 .8 “Probability of beingAlive” Stage 4 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 25 Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 Σ “Probability of .8 beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 26 Inputs Output Age 34 .6 .5 .2 0.6 .1 Gender 1 .3 Σ “Probability of .7 .8 beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 27 Not really, no target for hidden units... Age 34 .6 .4 .2 Σ 0.6 .1 .5 Gender 2 .2 Σ .3 Σ .8 .7 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 28 Jargon Pseudo-Correspondence l Independent variable = input variable l Dependent variable = output variable l Coefficients = “weights” l Estimates = “targets” Logistic Regression Model (the sigmoid unit) Inputs Output Age 34 5 0.6 Gende 1 4 Σ “Probability of r beingAlive” Stage 4 8 Independent variables Coefficients Dependent variable x1, x2, x3 a, b, c p Prediction © Eric Xing @ CMU, 2006-2011 29 Decision Functions Neural Network Output Hidden Layer … Input … 30 Decision Functions Neural Network (F) Loss J = 1 (y y(d))2 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 31 i ∀ Building a Neural Net Output Features … 32 Building a Neural Net Output Hidden Layer … D = M 1 1 1 Input … 33 Building a Neural Net Output Hidden Layer … D = M Input … 34 Building a Neural Net Output Hidden Layer … D = M Input … 35 Building a Neural Net Output Hidden Layer … D < M Input … 36 Decision Boundary • 0 hidden layers: linear classifier – Hyperplanes y x1 x2 Example from to Eric Postma via Jason Eisner 37 Decision Boundary • 1 hidden layer – Boundary of convex region (open or closed) y x1 x2 Example from to Eric Postma via Jason Eisner 38 Decision Boundary y • 2 hidden layers – Combinations of convex regions x1 x2 Example from to Eric Postma via Jason Eisner 39 Decision Functions Multi-Class Output Output … Hidden Layer … Input … 40 Decision Functions Deeper Networks Next lecture: Output … Hidden Layer 1 … Input 41 Decision Functions Deeper Networks Next lecture: Output … Hidden Layer 2 … Hidden Layer 1 … Input 42 Decision Functions Deeper Networks Next lecture: Output Making the … neural Hidden Layer 3 networks … deeper Hidden Layer 2 … Hidden Layer 1 … Input 43 Decision Different Levels of Functions Abstraction • We don’t know the “right” levels of abstraction • So let the model figure it out! 44 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions 45 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Output … Hidden Layer 3 … Hidden Layer 2 … Hidden Layer 1 … Input 46 Example from Honglak Lee (NIPS 2010) ARCHITECTURES 47 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function 48 Activation Functions (F) Loss Neural Network with sigmoid 1 2 J = (y y∗) activation functions 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = j=0 βj zj Hidden Layer (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 49 i ∀ Activation Functions (F) Loss Neural Network with arbitrary 1 2 J = (y y∗) nonlinear activation functions 2 − (E) Output (nonlinear) y = σ(b) Output (D) Output (linear) D … b = j=0 βj zj Hidden Layer (C) Hidden (nonlinear) z = σ(a ), j j j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 50 i ∀ Activation Functions Sigmoid / Logistic Function So far, we’ve 1 assumed that the logistic(u) ≡ 1+ e−u activation function (nonlinearity) is always the sigmoid function… 51 Activation Functions • A new change: modifying the nonlinearity – The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1] Slide from William Cohen AI Stats 2010 depth 4? sigmoid vs.
Recommended publications
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • 6.5 Applications of Exponential and Logarithmic Functions 469
    6.5 Applications of Exponential and Logarithmic Functions 469 6.5 Applications of Exponential and Logarithmic Functions As we mentioned in Section 6.1, exponential and logarithmic functions are used to model a wide variety of behaviors in the real world. In the examples that follow, note that while the applications are drawn from many different disciplines, the mathematics remains essentially the same. Due to the applied nature of the problems we will examine in this section, the calculator is often used to express our answers as decimal approximations. 6.5.1 Applications of Exponential Functions Perhaps the most well-known application of exponential functions comes from the financial world. Suppose you have $100 to invest at your local bank and they are offering a whopping 5 % annual percentage interest rate. This means that after one year, the bank will pay you 5% of that $100, or $100(0:05) = $5 in interest, so you now have $105.1 This is in accordance with the formula for simple interest which you have undoubtedly run across at some point before. Equation 6.1. Simple Interest The amount of interest I accrued at an annual rate r on an investmenta P after t years is I = P rt The amount A in the account after t years is given by A = P + I = P + P rt = P (1 + rt) aCalled the principal Suppose, however, that six months into the year, you hear of a better deal at a rival bank.2 Naturally, you withdraw your money and try to invest it at the higher rate there.
    [Show full text]
  • Deep Learning Based Computer Generated Face Identification Using
    applied sciences Article Deep Learning Based Computer Generated Face Identification Using Convolutional Neural Network L. Minh Dang 1, Syed Ibrahim Hassan 1, Suhyeon Im 1, Jaecheol Lee 2, Sujin Lee 1 and Hyeonjoon Moon 1,* 1 Department of Computer Science and Engineering, Sejong University, Seoul 143-747, Korea; [email protected] (L.M.D.); [email protected] (S.I.H.); [email protected] (S.I.); [email protected] (S.L.) 2 Department of Information Communication Engineering, Sungkyul University, Seoul 143-747, Korea; [email protected] * Correspondence: [email protected] Received: 30 October 2018; Accepted: 10 December 2018; Published: 13 December 2018 Abstract: Generative adversarial networks (GANs) describe an emerging generative model which has made impressive progress in the last few years in generating photorealistic facial images. As the result, it has become more and more difficult to differentiate between computer-generated and real face images, even with the human’s eyes. If the generated images are used with the intent to mislead and deceive readers, it would probably cause severe ethical, moral, and legal issues. Moreover, it is challenging to collect a dataset for computer-generated face identification that is large enough for research purposes because the number of realistic computer-generated images is still limited and scattered on the internet. Thus, a development of a novel decision support system for analyzing and detecting computer-generated face images generated by the GAN network is crucial. In this paper, we propose a customized convolutional neural network, namely CGFace, which is specifically designed for the computer-generated face detection task by customizing the number of convolutional layers, so it performs well in detecting computer-generated face images.
    [Show full text]
  • Matrix Calculus
    Appendix D Matrix Calculus From too much study, and from extreme passion, cometh madnesse. Isaac Newton [205, §5] − D.1 Gradient, Directional derivative, Taylor series D.1.1 Gradients Gradient of a differentiable real function f(x) : RK R with respect to its vector argument is defined uniquely in terms of partial derivatives→ ∂f(x) ∂x1 ∂f(x) , ∂x2 RK f(x) . (2053) ∇ . ∈ . ∂f(x) ∂xK while the second-order gradient of the twice differentiable real function with respect to its vector argument is traditionally called the Hessian; 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 ∂x1 ∂x1∂x2 ··· ∂x1∂xK 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 2 K f(x) , ∂x2∂x1 ∂x2 ··· ∂x2∂xK S (2054) ∇ . ∈ . .. 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 ∂xK ∂x1 ∂xK ∂x2 ∂x ··· K interpreted ∂f(x) ∂f(x) 2 ∂ ∂ 2 ∂ f(x) ∂x1 ∂x2 ∂ f(x) = = = (2055) ∂x1∂x2 ³∂x2 ´ ³∂x1 ´ ∂x2∂x1 Dattorro, Convex Optimization Euclidean Distance Geometry, Mεβoo, 2005, v2020.02.29. 599 600 APPENDIX D. MATRIX CALCULUS The gradient of vector-valued function v(x) : R RN on real domain is a row vector → v(x) , ∂v1(x) ∂v2(x) ∂vN (x) RN (2056) ∇ ∂x ∂x ··· ∂x ∈ h i while the second-order gradient is 2 2 2 2 , ∂ v1(x) ∂ v2(x) ∂ vN (x) RN v(x) 2 2 2 (2057) ∇ ∂x ∂x ··· ∂x ∈ h i Gradient of vector-valued function h(x) : RK RN on vector domain is → ∂h1(x) ∂h2(x) ∂hN (x) ∂x1 ∂x1 ··· ∂x1 ∂h1(x) ∂h2(x) ∂hN (x) h(x) , ∂x2 ∂x2 ··· ∂x2 ∇ .
    [Show full text]
  • Chapter 10. Logistic Models
    www.ck12.org Chapter 10. Logistic Models CHAPTER 10 Logistic Models Chapter Outline 10.1 LOGISTIC GROWTH MODEL 10.2 CONDITIONS FAVORING LOGISTIC GROWTH 10.3 REFERENCES 207 10.1. Logistic Growth Model www.ck12.org 10.1 Logistic Growth Model Here you will explore the graph and equation of the logistic function. Learning Objectives • Recognize logistic functions. • Identify the carrying capacity and inflection point of a logistic function. Logistic Models Exponential growth increases without bound. This is reasonable for some situations; however, for populations there is usually some type of upper bound. This can be caused by limitations on food, space or other scarce resources. The effect of this limiting upper bound is a curve that grows exponentially at first and then slows down and hardly grows at all. This is characteristic of a logistic growth model. The logistic equation is of the form: C C f (t) = 1+ab−t = 1+ae−kt The above equations represent the logistic function, and it contains three important pieces: C, a, and b. C determines that maximum value of the function, also known as the carrying capacity. C is represented by the dashed line in the graph below. 208 www.ck12.org Chapter 10. Logistic Models The constant of a in the logistic function is used much like a in the exponential function: it helps determine the value of the function at t=0. Specifically: C f (0) = 1+a The constant of b also follows a similar concept to the exponential function: it helps dictate the rate of change at the beginning and the end of the function.
    [Show full text]
  • Neural Network Architectures and Activation Functions: a Gaussian Process Approach
    Fakultät für Informatik Neural Network Architectures and Activation Functions: A Gaussian Process Approach Sebastian Urban Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Prof. Dr. rer. nat. Stephan Günnemann Prüfende der Dissertation: 1. Prof. Dr. Patrick van der Smagt 2. Prof. Dr. rer. nat. Daniel Cremers 3. Prof. Dr. Bernd Bischl, Ludwig-Maximilians-Universität München Die Dissertation wurde am 22.11.2017 bei der Technischen Universtät München eingereicht und durch die Fakultät für Informatik am 14.05.2018 angenommen. Neural Network Architectures and Activation Functions: A Gaussian Process Approach Sebastian Urban Technical University Munich 2017 ii Abstract The success of applying neural networks crucially depends on the network architecture being appropriate for the task. Determining the right architecture is a computationally intensive process, requiring many trials with different candidate architectures. We show that the neural activation function, if allowed to individually change for each neuron, can implicitly control many aspects of the network architecture, such as effective number of layers, effective number of neurons in a layer, skip connections and whether a neuron is additive or multiplicative. Motivated by this observation we propose stochastic, non-parametric activation functions that are fully learnable and individual to each neuron. Complexity and the risk of overfitting are controlled by placing a Gaussian process prior over these functions. The result is the Gaussian process neuron, a probabilistic unit that can be used as the basic building block for probabilistic graphical models that resemble the structure of neural networks.
    [Show full text]
  • Differentiability in Several Variables: Summary of Basic Concepts
    DIFFERENTIABILITY IN SEVERAL VARIABLES: SUMMARY OF BASIC CONCEPTS 3 3 1. Partial derivatives If f : R ! R is an arbitrary function and a = (x0; y0; z0) 2 R , then @f f(a + tj) ¡ f(a) f(x ; y + t; z ) ¡ f(x ; y ; z ) (1) (a) := lim = lim 0 0 0 0 0 0 @y t!0 t t!0 t etc.. provided the limit exists. 2 @f f(t;0)¡f(0;0) Example 1. for f : R ! R, @x (0; 0) = limt!0 t . Example 2. Let f : R2 ! R given by ( 2 x y ; (x; y) 6= (0; 0) f(x; y) = x2+y2 0; (x; y) = (0; 0) Then: @f f(t;0)¡f(0;0) 0 ² @x (0; 0) = limt!0 t = limt!0 t = 0 @f f(0;t)¡f(0;0) ² @y (0; 0) = limt!0 t = 0 Note: away from (0; 0), where f is the quotient of differentiable functions (with non-zero denominator) one can apply the usual rules of derivation: @f 2xy(x2 + y2) ¡ 2x3y (x; y) = ; for (x; y) 6= (0; 0) @x (x2 + y2)2 2 2 2. Directional derivatives. If f : R ! R is a map, a = (x0; y0) 2 R and v = ®i + ¯j is a vector in R2, then by definition f(a + tv) ¡ f(a) f(x0 + t®; y0 + t¯) ¡ f(x0; y0) (2) @vf(a) := lim = lim t!0 t t!0 t Example 3. Let f the function from the Example 2 above. Then for v = ®i + ¯j a unit vector (i.e.
    [Show full text]
  • Lipschitz Recurrent Neural Networks
    Published as a conference paper at ICLR 2021 LIPSCHITZ RECURRENT NEURAL NETWORKS N. Benjamin Erichson Omri Azencot Alejandro Queiruga ICSI and UC Berkeley Ben-Gurion University Google Research [email protected] [email protected] [email protected] Liam Hodgkinson Michael W. Mahoney ICSI and UC Berkeley ICSI and UC Berkeley [email protected] [email protected] ABSTRACT Viewing recurrent neural networks (RNNs) as continuous-time dynamical sys- tems, we propose a recurrent unit that describes the hidden state’s evolution with two parts: a well-understood linear component plus a Lipschitz nonlinearity. This particular functional form facilitates stability analysis of the long-term behavior of the recurrent unit using tools from nonlinear systems theory. In turn, this en- ables architectural design decisions before experimentation. Sufficient conditions for global stability of the recurrent unit are obtained, motivating a novel scheme for constructing hidden-to-hidden matrices. Our experiments demonstrate that the Lipschitz RNN can outperform existing recurrent units on a range of bench- mark tasks, including computer vision, language modeling and speech prediction tasks. Finally, through Hessian-based analysis we demonstrate that our Lipschitz recurrent unit is more robust with respect to input and parameter perturbations as compared to other continuous-time RNNs. 1 INTRODUCTION Many interesting problems exhibit temporal structures that can be modeled with recurrent neural networks (RNNs), including problems in robotics, system identification, natural language process- ing, and machine learning control. In contrast to feed-forward neural networks, RNNs consist of one or more recurrent units that are designed to have dynamical (recurrent) properties, thereby enabling them to acquire some form of internal memory.
    [Show full text]
  • Jointly Clustering with K-Means and Learning Representations
    Deep k-Means: Jointly clustering with k-Means and learning representations Maziar Moradi Fard Univ. Grenoble Alpes, CNRS, Grenoble INP – LIG [email protected] Thibaut Thonet Univ. Grenoble Alpes, CNRS, Grenoble INP – LIG [email protected] Eric Gaussier Univ. Grenoble Alpes, CNRS, Grenoble INP – LIG, Skopai [email protected] Abstract We study in this paper the problem of jointly clustering and learning representations. As several previous studies have shown, learning representations that are both faithful to the data to be clustered and adapted to the clustering algorithm can lead to better clustering performance, all the more so that the two tasks are performed jointly. We propose here such an approach for k-Means clustering based on a continuous reparametrization of the objective function that leads to a truly joint solution. The behavior of our approach is illustrated on various datasets showing its efficacy in learning representations for objects while clustering them. 1 Introduction Clustering is a long-standing problem in the machine learning and data mining fields, and thus accordingly fostered abundant research. Traditional clustering methods, e.g., k-Means [23] and Gaussian Mixture Models (GMMs) [6], fully rely on the original data representations and may then be ineffective when the data points (e.g., images and text documents) live in a high-dimensional space – a problem commonly known as the curse of dimensionality. Significant progress has been arXiv:1806.10069v2 [cs.LG] 12 Dec 2018 made in the last decade or so to learn better, low-dimensional data representations [13].
    [Show full text]
  • Lecture 5: Logistic Regression 1 MLE Derivation
    CSCI 5525 Machine Learning Fall 2019 Lecture 5: Logistic Regression Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu Last lecture, we give several convex surrogate loss functions to replace the zero-one loss func- tion, which is NP-hard to optimize. Now let us look into one of the examples, logistic loss: given d parameter w and example (xi; yi) 2 R × {±1g, the logistic loss of w on example (xi; yi) is defined as | ln (1 + exp(−yiw xi)) This loss function is used in logistic regression. We will introduce the statistical model behind logistic regression, and show that the ERM problem for logistic regression is the same as the relevant maximum likelihood estimation (MLE) problem. 1 MLE Derivation For this derivation it is more convenient to have Y = f0; 1g. Note that for any label yi 2 f0; 1g, we also have the “signed” version of the label 2yi − 1 2 {−1; 1g. Recall that in general su- pervised learning setting, the learner receive examples (x1; y1);:::; (xn; yn) drawn iid from some distribution P over labeled examples. We will make the following parametric assumption on P : | yi j xi ∼ Bern(σ(w xi)) where Bern denotes the Bernoulli distribution, and σ is the logistic function defined as follows 1 exp(z) σ(z) = = 1 + exp(−z) 1 + exp(z) See Figure 1 for a visualization of the logistic function. In general, the logistic function is a useful function to convert real values into probabilities (in the range of (0; 1)). If w|x increases, then σ(w|x) also increases, and so does the probability of Y = 1.
    [Show full text]
  • An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain
    J Korean Acad Nurs Vol.43 No.2, 154 -164 J Korean Acad Nurs Vol.43 No.2 April 2013 http://dx.doi.org/10.4040/jkan.2013.43.2.154 An Introduction to Logistic Regression: From Basic Concepts to Interpretation with Particular Attention to Nursing Domain Park, Hyeoun-Ae College of Nursing and System Biomedical Informatics National Core Research Center, Seoul National University, Seoul, Korea Purpose: The purpose of this article is twofold: 1) introducing logistic regression (LR), a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, and 2) examining use and reporting of LR in the nursing literature. Methods: Text books on LR and research articles employing LR as main statistical analysis were reviewed. Twenty-three articles published between 2010 and 2011 in the Journal of Korean Academy of Nursing were analyzed for proper use and reporting of LR models. Results: Logistic regression from basic concepts such as odds, odds ratio, logit transformation and logistic curve, assumption, fitting, reporting and interpreting to cautions were presented. Substantial short- comings were found in both use of LR and reporting of results. For many studies, sample size was not sufficiently large to call into question the accuracy of the regression model. Additionally, only one study reported validation analysis. Conclusion: Nurs- ing researchers need to pay greater attention to guidelines concerning the use and reporting of LR models. Key words: Logit function, Maximum likelihood estimation, Odds, Odds ratio, Wald test INTRODUCTION The model serves two purposes: (1) it can predict the value of the depen- dent variable for new values of the independent variables, and (2) it can Multivariable methods of statistical analysis commonly appear in help describe the relative contribution of each independent variable to general health science literature (Bagley, White, & Golomb, 2001).
    [Show full text]
  • Differentiability of the Convolution
    Differentiability of the convolution. Pablo Jimenez´ Rodr´ıguez Universidad de Valladolid March, 8th, 2019 Pablo Jim´enezRodr´ıguez Differentiability of the convolution Definition Let L1[−1; 1] be the set of the 2-periodic functions f so that R 1 −1 jf (t)j dt < 1: Given f ; g 2 L1[−1; 1], the convolution of f and g is the function Z 1 f ∗ g(x) = f (t)g(x − t) dt −1 Pablo Jim´enezRodr´ıguez Differentiability of the convolution Definition Let L1[−1; 1] be the set of the 2-periodic functions f so that R 1 −1 jf (t)j dt < 1: Given f ; g 2 L1[−1; 1], the convolution of f and g is the function Z 1 f ∗ g(x) = f (t)g(x − t) dt −1 Pablo Jim´enezRodr´ıguez Differentiability of the convolution Definition Let L1[−1; 1] be the set of the 2-periodic functions f so that R 1 −1 jf (t)j dt < 1: Given f ; g 2 L1[−1; 1], the convolution of f and g is the function Z 1 f ∗ g(x) = f (t)g(x − t) dt −1 Pablo Jim´enezRodr´ıguez Differentiability of the convolution Definition Let L1[−1; 1] be the set of the 2-periodic functions f so that R 1 −1 jf (t)j dt < 1: Given f ; g 2 L1[−1; 1], the convolution of f and g is the function Z 1 f ∗ g(x) = f (t)g(x − t) dt −1 Pablo Jim´enezRodr´ıguez Differentiability of the convolution If f is Lipschitz, then f ∗ g is Lipschitz.
    [Show full text]