The Perceptron

Total Page:16

File Type:pdf, Size:1020Kb

The Perceptron The Perceptron Nuno Vasconcelos ECE Depp,artment, UCSD Classification a classification problem has two types of variables • X - vector of observations (features) in the world • Y - state (class) of the world e.g. • x ∈ X ⊂ R2 = (fever, blood pressure) • y ∈ Y = {disease, no disease} X, Y related by a (unknown) function x y = f (x ) f (.) goal: didesign a c lass ifier h: X → Y suchthth that h(x ) = f(x ) ∀x 2 Linear discriminant the classifier implements the linear decision rule ⎧ 1 if g (x ) > 0 h *(x ) = ⎨ = sgn[g(x)] with g (x ) =w T x + b ⎩−1 if g (x ) < 0 has the properties w • it divides X into two “half-spaces” • boundaryyp is the plane with: x • normal w g (x ) b • distance to the origin b/||w|| w w •g(x )/||w || is the dis tance from poi n t x to the boundary • g(x) = 0 for points on the plane • g(x) > 0 on the side w points to (“positive side”) • g(x) < 0 on the “negative side” 3 Linear discriminant the classifier implements the linear decision rule ⎧ 1 if g (x ) > 0 h *(x ) = ⎨ = sgn[g(x)] with g (x ) =w T x + b ⎩−1 if g (x ) < 0 given a linearly separable training set w y=1 D = {(x1,y1), ..., (xn,yn)} no errors ifff and only if, ∀ i x g (x ) • y = 1 and g(x ) > 0 or b i i w yi = -1 and g(xi) < 0 w • i.e. y .g(x ) > 0 i i y=-1 this allows a very concise expression fhifor the situat ion o f“iif “no training error ”“” or “zero emp iilik”irical risk” 4 Learning as optimization necessary and sufficient condition for zero empirical risk T y i (w x i + b )> 0, ∀i this is interesting because it allows the formulation of the learn ing pro blem as one o f func tion op tim iza tion • starting from a random guess for the parameters w and b • we maximize the reward function n T ∑y i ()w x i + b i =1 • or, equivalently, minimize the cost function n T J (w ,b ) = −∑y i (w x i + b ) i =1 5 The gradient we have seen that the gradient of a function f(w) at z is T ⎛ ∂f ∂f ⎞ ∇f (z) = ⎜ (z),L, (z)⎟ ∇f ⎝ ∂w0 ∂wn−1 ⎠ Theorem: the gradient points in the direction of maximum growth f(x,y) gradient is • the direction of greatest increase of f(x) at z, (*) • normal to the iso -contours of f(. ) ∇f (x0, y0 ) ∇f (x1, y1) 6 Critical point conditions let f(x) be continuously differentiable x* is a local minimum of f(x) if and only if • f has zero gradient at x* ∇f (x *) = 0 • and the Hessian of f at x* is positive definite d t ∇2f (x *)d ≥ 0, ∀d ∈ℜn • where ⎡ ∂ 2f ∂ 2f ⎤ ⎢ 2 (x ) L (x )⎥ ∂x 0 ∂x 0∂x n −1 2 ⎢ ⎥ ∇ f (x ) = ⎢ M ⎥ ⎢ ∂ 2f ∂ 2f ⎥ ⎢ (x ) L 2 (x ) ⎥ ⎣∂x n −1∂x 0 ∂x n −1 ⎦ 7 Gradient descent this suggest a simple minimization technique • pick initial estimate x(()0) f(x) • follow the negative gradient x (n +1) = x (n ) −η∇f (x (n ) ) −η∇f (x (n ) ) this is gradient descent x (n ) η is the learning rate and needs to be carefully chosen • if η too large, descent may diverge f(x) many extensions are possible main point: • once framed as optimization, (n ) −η∇f (x ) we can (in general) solve it x (n ) 8 The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost n T J ()w ,b = −∑y i (w x i + b ) i =1 we know that: • if the training set is linearly separable there is at least a pair (w,b) su ch that J(w, b) < 0 • any minimum that is equal to or better than this will do Q: can we find one such minimum? 9 Perceptron learning the gradient is straightforward to compute ∂f ∂f = −∑y i x i = −∑y i ∂w i ∂b i and gradient descent is trivial there is, however, one problem: • J((,)w,b) is not bounded below • if J(w,b) < 0, can make J → −∞ by multiplying w and b by λ > 0 • the minimum is always at −∞ which is quite bad, numerically this is really just the normalization problem that we already talked about 10 Rosenblatt’s idea restrict attention to the points incorrectly classified at each iteration define set of errors T E = {x i | y i (w x i + b ) < 0} and make the cost T J p (w ,b ) = − ∑y i (w x i + b ) i |x i ∈E note that T • Jp cannot be negative since, in E, all yi(w xi+b) are negative • if we gg,et to zero, we know we have the best possible solution (E empty) 11 Perceptron learning is trivial, just do gradient descent on Jp(w,b) (n +1) (n ) w =w +η ∑y i x i x i ∈E (n +1) (n ) b = b +η ∑y i x i ∈E this turns out not to be very effective if the D is large • lthtitiitttkllttthdloop over the entire training set to take a small step at the end one alternative that frequently is better is “stochastic gradient descent” • take the step immediately after each point • no guarantee this is a descent step but, on average, you follow the same directi on a fter process ing en tire D • very popular in learning, where D is usually large 12 Perceptron learning the algorithm is as follows: set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { for i = 1:n { T if yi(w xi + bk) < 0 then { – wk+1 = wk + η yi xi 2 we will talk about – bk+1 = bk + η yi R – k=k+1k = k+1 R shortly! } } T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 13 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 14 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 15 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o wk+ ηyixi – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 16 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 wk+1 o o o o – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 17 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 wk+1 o o o o – k=k+1k = k+1 o o o o o o } 2 bk +ηyiR } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 18 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 o o o wk+1 o – k=k+1k = k+1 o o o o } o o } bk+1 x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 19 Perceptron learning OK, makes intuitive sense how do we know it will not get stuck on a local minimum? this was Rosenblatt’s seminal contribution Theorem: Let D = {(x{( 1,y1), ..., (x ( n,yn)} and (*) R = max || x i || i If there is (w *,b *) such that ||w *|| = 1 and T y i (w * x i + b *)> γ , ∀i (**) then the Perceptron will find an error free hyper-plane in at most 2 ⎛ 2R ⎞ ⎜ ⎟ iterations ⎝ γ ⎠ 20 Proof not that hard denote iteration by t, assume point processed at iteration t-1 is (xi,yi) for simppy,licity, use homogeneous coordinates.
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Fun with Hyperplanes: Perceptrons, Svms, and Friends
    Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 18.1, 18.2, 18.6.3, 18.9 The Automatic Classification Problem Assign object/event or sequence of objects/events to one of a given finite set of categories. • Fraud detection for credit card transactions, telephone calls, etc. • Worm detection in network packets • Spam filtering in email • Recommending articles, books, movies, music • Medical diagnosis • Speech recognition • OCR of handwritten letters • Recognition of specific astronomical images • Recognition of specific DNA sequences • Financial investment Machine Learning methods provide one set of approaches to this problem CIS 391 - Intro to AI 2 Universal Machine Learning Diagram Feature Things to Magic Vector Classification be Classifier Represent- Decision classified Box ation CIS 391 - Intro to AI 3 Example: handwritten digit recognition Machine learning algorithms that Automatically cluster these images Use a training set of labeled images to learn to classify new images Discover how to account for variability in writing style CIS 391 - Intro to AI 4 A machine learning algorithm development pipeline: minimization Problem statement Given training vectors x1,…,xN and targets t1,…,tN, find… Mathematical description of a cost function Mathematical description of how to minimize/maximize the cost function Implementation r(i,k) = s(i,k) – maxj{s(i,j)+a(i,j)} … CIS 391 - Intro to AI 5 Universal Machine Learning Diagram Today: Perceptron, SVM and Friends Feature Things to Magic Vector
    [Show full text]
  • Learning to Learn by Gradient Descent by Gradient Descent
    Learning to learn by gradient descent by gradient descent Marcin Andrychowicz1, Misha Denil1, Sergio Gómez Colmenarejo1, Matthew W. Hoffman1, David Pfau1, Tom Schaul1, Brendan Shillingford1,2, Nando de Freitas1,2,3 1Google DeepMind 2University of Oxford 3Canadian Institute for Advanced Research [email protected] {mdenil,sergomez,mwhoffman,pfau,schaul}@google.com [email protected], [email protected] Abstract The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art. 1 Introduction Frequently, tasks in machine learning can be expressed as the problem of optimizing an objective function f(✓) defined over some domain ✓ ⇥. The goal in this case is to find the minimizer 2 ✓⇤ = arg min✓ ⇥ f(✓). While any method capable of minimizing this objective function can be applied, the standard2 approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates ✓ = ✓ ↵ f(✓ ) . t+1 t − tr t The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use of gradients and ignores second-order information.
    [Show full text]
  • Training Neural Networks Without Gradients: a Scalable ADMM Approach
    Training Neural Networks Without Gradients: A Scalable ADMM Approach Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization algorithms rely on stochastic parameters. gradient methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps.
    [Show full text]
  • Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
    Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Boris Ginsburg 1 Patrice Castonguay 1 Oleksii Hrinchuk 1 Oleksii Kuchaiev 1 Ryan Leary 1 Vitaly Lavrukhin 1 Jason Li 1 Huyen Nguyen 1 Yang Zhang 1 Jonathan M. Cohen 1 Abstract We started with Adam, and then (1) replaced the element- wise second moment with the layer-wise moment, (2) com- We propose NovoGrad, an adaptive stochastic puted the first moment using gradients normalized by layer- gradient descent method with layer-wise gradient wise second moment, (3) decoupled weight decay (WD) normalization and decoupled weight decay. In from normalized gradients (similar to AdamW). our experiments on neural networks for image classification, speech recognition, machine trans- The resulting algorithm, NovoGrad, combines SGD’s and lation, and language modeling, it performs on par Adam’s strengths. We applied NovoGrad to a variety of or better than well-tuned SGD with momentum, large scale problems — image classification, neural machine Adam, and AdamW. Additionally, NovoGrad (1) translation, language modeling, and speech recognition — is robust to the choice of learning rate and weight and found that in all cases, it performs as well or better than initialization, (2) works well in a large batch set- Adam/AdamW and SGD with momentum. ting, and (3) has half the memory footprint of Adam. 2. Related Work NovoGrad belongs to the family of Stochastic Normalized 1. Introduction Gradient Descent (SNGD) optimizers (Nesterov, 1984; Hazan et al., 2015). SNGD uses only the direction of the The most popular algorithms for training Neural Networks stochastic gradient gt to update the weights wt: (NNs) are Stochastic Gradient Descent (SGD) with mo- gt mentum (Polyak, 1964; Sutskever et al., 2013) and Adam wt+1 = wt − λt · (Kingma & Ba, 2015).
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
    Article Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches Juan Cruz-Benito 1,* , Sanjay Vishwakarma 2,†, Francisco Martin-Fernandez 1 and Ismael Faro 1 1 IBM Quantum, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA; [email protected] (F.M.-F.); [email protected] (I.F.) 2 Electrical and Computer Engineering, Carnegie Mellon University, Mountain View, CA 94035, USA; [email protected] * Correspondence: [email protected] † Intern at IBM Quantum at the time of writing this paper. Abstract: In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writ- ing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engi- neering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning- enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Citation: Cruz-Benito, J.; Transformer while using transfer learning and different forms of tokenization to see how they behave Vishwakarma, S.; Martin- Fernandez, F.; Faro, I.
    [Show full text]
  • GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection
    GEE: A Gradient-based Explainable Variational Autoencoder for Network Anomaly Detection Quoc Phong Nguyen Kar Wai Lim Dinil Mon Divakaran National University of Singapore National University of Singapore Trustwave [email protected] [email protected] [email protected] Kian Hsiang Low Mun Choon Chan National University of Singapore National University of Singapore [email protected] [email protected] Abstract—This paper looks into the problem of detecting number of factors, such as end-user behavior, customer busi- network anomalies by analyzing NetFlow records. While many nesses (e.g., banking, retail), applications, location, time of the previous works have used statistical models and machine learning day, and are expected to evolve with time. Such diversity and techniques in a supervised way, such solutions have the limitations that they require large amount of labeled data for training and dynamism limits the utility of rule-based detection systems. are unlikely to detect zero-day attacks. Existing anomaly detection Next, as capturing, storing and processing raw traffic from solutions also do not provide an easy way to explain or identify such high capacity networks is not practical, Internet routers attacks in the anomalous traffic. To address these limitations, today have the capability to extract and export meta data such we develop and present GEE, a framework for detecting and explaining anomalies in network traffic. GEE comprises of two as NetFlow records [3]. With NetFlow, the amount of infor- components: (i) Variational Autoencoder (VAE) — an unsuper- mation captured is brought down by orders of magnitude (in vised deep-learning technique for detecting anomalies, and (ii) a comparison to raw packet capture), not only because a NetFlow gradient-based fingerprinting technique for explaining anomalies.
    [Show full text]
  • Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder
    Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 11, 2018 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function that predicts what the output should be, given the input. Prediction Functions Linear regression: f(x) = wTx + b Linear classification (perceptron): f(x) = 1, wTx + b ≥ 0 -1, wTx + b < 0 Need to learn what w should be! Learning Parameters Goal is to learn to minimize error • Ideally: true error • Instead: training error The loss function gives the training error when using parameters w, denoted L(w). • Also called cost function • More general: objective function (in general objective could be to minimize or maximize; with loss/cost functions, we want to minimize) Learning Parameters Goal is to minimize loss function. How do we minimize a function? Let’s review some math. Rate of Change The slope of a line is also called the rate of change of the line. y = ½x + 1 “rise” “run” Rate of Change For nonlinear functions, the “rise over run” formula gives you the average rate of change between two points Average slope from x=-1 to x=0 is: 2 f(x) = x -1 Rate of Change There is also a concept of rate of change at individual points (rather than two points) Slope at x=-1 is: f(x) = x2 -2 Rate of Change The slope at a point is called the derivative at that point Intuition: f(x) = x2 Measure the slope between two points that are really close together Rate of Change The slope at a point is called the derivative at that point Intuition: Measure the
    [Show full text]
  • Introduction to Machine Learning
    Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial Neural Networks 3 Short History Progression (1943-1960) • First mathematical model of neurons ▪ Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) ▪ A single neuron for classification ▪ Perceptron learning rule ▪ Perceptron convergence theorem Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1963 Backpropagation… but not much attention… Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550 4 Short History Progression (1980-) • 1986 Backpropagation reinvented: ▪ Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: ▪ Character recognition, autonomous cars,… • Open questions: Overfitting? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training? • Hopfield nets (1982), Boltzmann machines,… 5 Short History Degression (1993-) • SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM and Graphical models almost kill the ANN research. • Training deeper networks consistently yields poor results. • Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model) 6 Short History Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative graphical model • Based on restrictive Boltzmann machines • Can be trained efficiently Deep Autoencoder based networks Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H.
    [Show full text]
  • Audio Event Classification Using Deep Learning in an End-To-End Approach
    Audio Event Classification using Deep Learning in an End-to-End Approach Master thesis Jose Luis Diez Antich Aalborg University Copenhagen A. C. Meyers Vænge 15 2450 Copenhagen SV Denmark Title: Abstract: Audio Event Classification using Deep Learning in an End-to-End Approach The goal of the master thesis is to study the task of Sound Event Classification Participant(s): using Deep Neural Networks in an end- Jose Luis Diez Antich to-end approach. Sound Event Classifi- cation it is a multi-label classification problem of sound sources originated Supervisor(s): from everyday environments. An auto- Hendrik Purwins matic system for it would many applica- tions, for example, it could help users of hearing devices to understand their sur- Page Numbers: 38 roundings or enhance robot navigation systems. The end-to-end approach con- Date of Completion: sists in systems that learn directly from June 16, 2017 data, not from features, and it has been recently applied to audio and its results are remarkable. Even though the re- sults do not show an improvement over standard approaches, the contribution of this thesis is an exploration of deep learning architectures which can be use- ful to understand how networks process audio. The content of this report is freely available, but publication (with reference) may only be pursued due to agreement with the author. Contents 1 Introduction1 1.1 Scope of this work.............................2 2 Deep Learning3 2.1 Overview..................................3 2.2 Multilayer Perceptron...........................4
    [Show full text]