Binary Classification / Perceptron

Total Page:16

File Type:pdf, Size:1020Kb

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning • Input: 푥1, 푦1 , … , (푥푛, 푦푛) 푡ℎ 푡ℎ – 푥 is the 푖 data item and 푦 is the 푖 label • Goal: find a function 푓 such that 푓 푥 is a “good approximation” to 푦 – Can use it to predict 푦 values for previously unseen 푥 values Examples of Supervised Learning • Spam email detection • Handwritten digit recognition • Stock market prediction • More? Supervised Learning • Hypothesis space: set of allowable functions 푓: 푋 → 푌 • Goal: find the “best” element of the hypothesis space – How do we measure the quality of 푓? Regression 푦 푥 Hypothesis class: linear functions 푓 푥 = 푎푥 + 푏 Squared loss function used to measure the error of the approximation Linear Regression • In typical regression applications, measure the fit using a squared loss function 2 퐿 푓, 푦 = 푓 푥 − 푦 • Want to minimize the average loss on the training data • For linear regression, the learning problem is then 1 2 min 푎푥 + 푏 − 푦 푎,푏 푛 • For an unseen data point, 푥, the learning algorithm predicts 푓(푥) Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + _ + _ + _ + _ + + _ _ + _ + _ _ + Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + _ What is a good + _ hypothesis space for + _ + _ + this problem? + _ _ + _ + _ _ + Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ In this case, we say that the observations are linearly separable Linear Separators • In 푛 dimensions, a hyperplane is a solution to the equation 푤푇푥 + 푏 = 0 with 푤 ∈ ℝ푛, 푏 ∈ ℝ • Hyperplanes divide ℝ푛 into two distinct sets of points (called halfspaces) 푤푇푥 + 푏 > 0 푤푇푥 + 푏 < 0 Binary Classification (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + + + + _ What is a good + + _ _ hypothesis space for + _ + _ this problem? + _ _ _ + _ _ In this case, we say that the observations are linearly separable The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푓 푥 = 푤푇푥 + 푏 • How should we choose the loss function? The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function? – Count the number of misclassifications () 푙표푠푠 = |푦 − 푠푖푔푛(푓푤,푏(푥 ))| • Tough to optimize, gradient contains no information The Linearly Separable Case (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1,1} • Hypothesis space: separating hyperplanes 푇 푓푤,푏 푥 = 푤 푥 + 푏 • How should we choose the loss function? – Penalize each misclassification by the size of the violation () 푝푒푟푐푒푝푡푟표푛 푙표푠푠 = max 0, −푦푓푤,푏 푥 • Modified hinge loss (this loss is convex, but not differentiable) The Perceptron Algorithm • Try to minimize the perceptron loss using (sub)gradient descent () 훻푤(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦푥 (푖) :−푦푖푓푤,푏 푥 ≥0 훻푏(푝푒푟푐푒푝푡푟표푛 푙표푠푠) = −푦 (푖) :−푦푖푓푤,푏 푥 ≥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) 푥 푥0 Subgradients • For a convex function 푔(푥), a subgradient at a point 푥0 is any tangent line/plane through the point 푥0 that underestimates the function everywhere 푔(푥) If 0 is subgradient at 푥0, then 푥0 is a global minimum 푥 푥0 The Perceptron Algorithm • Try to minimize the perceptron loss using (sub)gradient descent (푡+1) (푡) () 푤 = 푤 + 훾푡 ⋅ 푦푥 :−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡) (푡+1) (푡) 푏 = 푏 + 훾푡 ⋅ 푦 :−푦 푓 푥(푖) ≥0 푖 푤 푡 ,푏(푡) • With step size 훾푡 (sometimes called the learning rate) Stochastic Gradient Descent • To make the training more practical, stochastic gradient descent is used instead of standard gradient descent • Approximate the gradient of a sum by sampling a few indices (as few as one) uniformly at random and averaging 푛 퐾 1 훻 푔 (푥) ≈ 훻 푔 (푥) 푥 퐾 푥 푘 =1 푘=1 here, each 푖푘 is sampled uniformly at random from {1, … , 푛} • Stochastic gradient descent converges under certain assumptions on the step size 22 Stochastic Gradient Descent • Setting 퐾 = 1, we can simply pick a random observation 푖 and perform the following update if the 푖푡ℎ data point is misclassified (푡+1) (푡) () 푤 = 푤 + 훾푡푦푥 (푡+1) (푡) 푏 = 푏 + 훾푡푦 and 푤(푡+1) = 푤(푡) 푏(푡+1) = 푏(푡) otherwise • Sometimes, you will see the perceptron algorithm specified with 훾푡 = 1 for all 푡 23 Application of Perceptron • Spam email classification – Represent emails as vectors of counts of certain words (e.g., sir, madam, Nigerian, prince, money, etc.) – Apply the perceptron algorithm to the resulting vectors – To predict the label of an unseen email: • Construct its vector representation, 푥′ • Check whether or not 푤푇푥′ + 푏 is positive or negative Perceptron Learning • Drawbacks: – No convergence guarantees if the observations are not linearly separable – Can overfit • There are a number of perfect classifiers, but the perceptron algorithm doesn’t have any mechanism for choosing between them + + + + + + _ + + _ _ + _ + _ + _ _ _ + _ _ What If the Data Isn‘t Separable? (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + What If the Data Isn‘t Separable? (1) (푛) 푚 • Input 푥 , 푦1 , … , (푥 , 푦푛) with 푥 ∈ ℝ and 푦 ∈ {−1, +1} • We can think of the observations as points in ℝ푚 with an associated sign (either +/- corresponding to 0/1) • An example with 푚 = 2 + + + _ + What is a good + _ _ hypothesis space for + _ _ _ _ _ + this problem? + _ + _ + + + Adding Features • Perceptron algorithm only works for linearly separable data + + + _ + + _ _ + _ _ _ _ _ + + _ + _ + + + Can add features to make the data linearly separable over a larger space! Adding Features • The idea: – Given the observations 푥(1), … , 푥(푛), construct a feature vector 휙(푥) – Use 휙 푥(1) , … , 휙 푥(푛) instead of 푥(1), … , 푥(푛) in the learning algorithm – Goal is to choose 휙 so that 휙 푥(1) , … , 휙 푥(푛) are linearly separable – Learn linear separators of the form 푤푇휙 푥 (instead of 푤푇푥) • Warning: more expressive features can lead to overfitting Adding Features • Examples 푥1 – 휙 푥1, 푥2 = 푥2 • This is just the input data, without modification 1 푥1 푥 – 휙 푥1, 푥2 = 2 2 푥1 2 푥2 • This corresponds to a second degree polynomial separator, or equivalently, elliptical separators in the original space Adding Features 푥2 2 2 푥1 − 1 + 푥2 − 1 − 1 ≤ 0 푥1 Support Vector Machines • How can we decide between two perfect classifiers? + + + + + + _ + + _ + _ _ + _ + _ _ _ + _ _ • What is the practical difference between these two solutions?.
Recommended publications
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms
    PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs Yunbo Wang Mingsheng Long∗ School of Software School of Software Tsinghua University Tsinghua University [email protected] [email protected] Jianmin Wang Zhifeng Gao Philip S. Yu School of Software School of Software School of Software Tsinghua University Tsinghua University Tsinghua University [email protected] [email protected] [email protected] Abstract The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal vari- ations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial ap- pearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. 1 Introduction
    [Show full text]
  • Learning to Learn by Gradient Descent by Gradient Descent
    Learning to learn by gradient descent by gradient descent Marcin Andrychowicz1, Misha Denil1, Sergio Gómez Colmenarejo1, Matthew W. Hoffman1, David Pfau1, Tom Schaul1, Brendan Shillingford1,2, Nando de Freitas1,2,3 1Google DeepMind 2University of Oxford 3Canadian Institute for Advanced Research [email protected] {mdenil,sergomez,mwhoffman,pfau,schaul}@google.com [email protected], [email protected] Abstract The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art. 1 Introduction Frequently, tasks in machine learning can be expressed as the problem of optimizing an objective function f(✓) defined over some domain ✓ ⇥. The goal in this case is to find the minimizer 2 ✓⇤ = arg min✓ ⇥ f(✓). While any method capable of minimizing this objective function can be applied, the standard2 approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates ✓ = ✓ ↵ f(✓ ) . t+1 t − tr t The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use of gradients and ignores second-order information.
    [Show full text]
  • Almost Unsupervised Text to Speech and Automatic Speech Recognition
    Almost Unsupervised Text to Speech and Automatic Speech Recognition Yi Ren * 1 Xu Tan * 2 Tao Qin 2 Sheng Zhao 3 Zhou Zhao 1 Tie-Yan Liu 2 Abstract 1. Introduction Text to speech (TTS) and automatic speech recognition (ASR) are two popular tasks in speech processing and have Text to speech (TTS) and automatic speech recog- attracted a lot of attention in recent years due to advances in nition (ASR) are two dual tasks in speech pro- deep learning. Nowadays, the state-of-the-art TTS and ASR cessing and both achieve impressive performance systems are mostly based on deep neural models and are all thanks to the recent advance in deep learning data-hungry, which brings challenges on many languages and large amount of aligned speech and text data. that are scarce of paired speech and text data. Therefore, However, the lack of aligned data poses a ma- a variety of techniques for low-resource and zero-resource jor practical problem for TTS and ASR on low- ASR and TTS have been proposed recently, including un- resource languages. In this paper, by leveraging supervised ASR (Yeh et al., 2019; Chen et al., 2018a; Liu the dual nature of the two tasks, we propose an et al., 2018; Chen et al., 2018b), low-resource ASR (Chuang- almost unsupervised learning method that only suwanich, 2016; Dalmia et al., 2018; Zhou et al., 2018), TTS leverages few hundreds of paired data and extra with minimal speaker data (Chen et al., 2019; Jia et al., 2018; unpaired data for TTS and ASR.
    [Show full text]
  • Training Neural Networks Without Gradients: a Scalable ADMM Approach
    Training Neural Networks Without Gradients: A Scalable ADMM Approach Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization algorithms rely on stochastic parameters. gradient methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps.
    [Show full text]
  • Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
    Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Boris Ginsburg 1 Patrice Castonguay 1 Oleksii Hrinchuk 1 Oleksii Kuchaiev 1 Ryan Leary 1 Vitaly Lavrukhin 1 Jason Li 1 Huyen Nguyen 1 Yang Zhang 1 Jonathan M. Cohen 1 Abstract We started with Adam, and then (1) replaced the element- wise second moment with the layer-wise moment, (2) com- We propose NovoGrad, an adaptive stochastic puted the first moment using gradients normalized by layer- gradient descent method with layer-wise gradient wise second moment, (3) decoupled weight decay (WD) normalization and decoupled weight decay. In from normalized gradients (similar to AdamW). our experiments on neural networks for image classification, speech recognition, machine trans- The resulting algorithm, NovoGrad, combines SGD’s and lation, and language modeling, it performs on par Adam’s strengths. We applied NovoGrad to a variety of or better than well-tuned SGD with momentum, large scale problems — image classification, neural machine Adam, and AdamW. Additionally, NovoGrad (1) translation, language modeling, and speech recognition — is robust to the choice of learning rate and weight and found that in all cases, it performs as well or better than initialization, (2) works well in a large batch set- Adam/AdamW and SGD with momentum. ting, and (3) has half the memory footprint of Adam. 2. Related Work NovoGrad belongs to the family of Stochastic Normalized 1. Introduction Gradient Descent (SNGD) optimizers (Nesterov, 1984; Hazan et al., 2015). SNGD uses only the direction of the The most popular algorithms for training Neural Networks stochastic gradient gt to update the weights wt: (NNs) are Stochastic Gradient Descent (SGD) with mo- gt mentum (Polyak, 1964; Sutskever et al., 2013) and Adam wt+1 = wt − λt · (Kingma & Ba, 2015).
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
    Article Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches Juan Cruz-Benito 1,* , Sanjay Vishwakarma 2,†, Francisco Martin-Fernandez 1 and Ismael Faro 1 1 IBM Quantum, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA; [email protected] (F.M.-F.); [email protected] (I.F.) 2 Electrical and Computer Engineering, Carnegie Mellon University, Mountain View, CA 94035, USA; [email protected] * Correspondence: [email protected] † Intern at IBM Quantum at the time of writing this paper. Abstract: In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writ- ing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engi- neering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning- enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Citation: Cruz-Benito, J.; Transformer while using transfer learning and different forms of tokenization to see how they behave Vishwakarma, S.; Martin- Fernandez, F.; Faro, I.
    [Show full text]
  • Self-Supervised Learning
    Self-Supervised Learning Andrew Zisserman Slides from: Carl Doersch, Ishan Misra, Andrew Owens, Carl Vondrick, Richard Zhang The ImageNet Challenge Story … 1000 categories • Training: 1000 images for each category • Testing: 100k images The ImageNet Challenge Story … strong supervision The ImageNet Challenge Story … outcomes Strong supervision: • Features from networks trained on ImageNet can be used for other visual tasks, e.g. detection, segmentation, action recognition, fine grained visual classification • To some extent, any visual task can be solved now by: 1. Construct a large-scale dataset labelled for that task 2. Specify a training loss and neural network architecture 3. Train the network and deploy • Are there alternatives to strong supervision for training? Self-Supervised learning …. Why Self-Supervision? 1. Expense of producing a new dataset for each new task 2. Some areas are supervision-starved, e.g. medical data, where it is hard to obtain annotation 3. Untapped/availability of vast numbers of unlabelled images/videos – Facebook: one billion images uploaded per day – 300 hours of video are uploaded to YouTube every minute 4. How infants may learn … Self-Supervised Learning The Scientist in the Crib: What Early Learning Tells Us About the Mind by Alison Gopnik, Andrew N. Meltzoff and Patricia K. Kuhl The Development of Embodied Cognition: Six Lessons from Babies by Linda Smith and Michael Gasser What is Self-Supervision? • A form of unsupervised learning where the data provides the supervision • In general, withhold some part of the data, and task the network with predicting it • The task defines a proxy loss, and the network is forced to learn what we really care about, e.g.
    [Show full text]
  • GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection
    GEE: A Gradient-based Explainable Variational Autoencoder for Network Anomaly Detection Quoc Phong Nguyen Kar Wai Lim Dinil Mon Divakaran National University of Singapore National University of Singapore Trustwave [email protected] [email protected] [email protected] Kian Hsiang Low Mun Choon Chan National University of Singapore National University of Singapore [email protected] [email protected] Abstract—This paper looks into the problem of detecting number of factors, such as end-user behavior, customer busi- network anomalies by analyzing NetFlow records. While many nesses (e.g., banking, retail), applications, location, time of the previous works have used statistical models and machine learning day, and are expected to evolve with time. Such diversity and techniques in a supervised way, such solutions have the limitations that they require large amount of labeled data for training and dynamism limits the utility of rule-based detection systems. are unlikely to detect zero-day attacks. Existing anomaly detection Next, as capturing, storing and processing raw traffic from solutions also do not provide an easy way to explain or identify such high capacity networks is not practical, Internet routers attacks in the anomalous traffic. To address these limitations, today have the capability to extract and export meta data such we develop and present GEE, a framework for detecting and explaining anomalies in network traffic. GEE comprises of two as NetFlow records [3]. With NetFlow, the amount of infor- components: (i) Variational Autoencoder (VAE) — an unsuper- mation captured is brought down by orders of magnitude (in vised deep-learning technique for detecting anomalies, and (ii) a comparison to raw packet capture), not only because a NetFlow gradient-based fingerprinting technique for explaining anomalies.
    [Show full text]
  • Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder
    Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 11, 2018 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function that predicts what the output should be, given the input. Prediction Functions Linear regression: f(x) = wTx + b Linear classification (perceptron): f(x) = 1, wTx + b ≥ 0 -1, wTx + b < 0 Need to learn what w should be! Learning Parameters Goal is to learn to minimize error • Ideally: true error • Instead: training error The loss function gives the training error when using parameters w, denoted L(w). • Also called cost function • More general: objective function (in general objective could be to minimize or maximize; with loss/cost functions, we want to minimize) Learning Parameters Goal is to minimize loss function. How do we minimize a function? Let’s review some math. Rate of Change The slope of a line is also called the rate of change of the line. y = ½x + 1 “rise” “run” Rate of Change For nonlinear functions, the “rise over run” formula gives you the average rate of change between two points Average slope from x=-1 to x=0 is: 2 f(x) = x -1 Rate of Change There is also a concept of rate of change at individual points (rather than two points) Slope at x=-1 is: f(x) = x2 -2 Rate of Change The slope at a point is called the derivative at that point Intuition: f(x) = x2 Measure the slope between two points that are really close together Rate of Change The slope at a point is called the derivative at that point Intuition: Measure the
    [Show full text]
  • Reinforcement Learning in Supervised Problem Domains
    Technische Universität München Fakultät für Informatik Lehrstuhl VI – Echtzeitsysteme und Robotik reinforcement learning in supervised problem domains Thomas F. Rückstieß Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Daniel Cremers Prüfer der Dissertation 1. Univ.-Prof. Dr. Patrick van der Smagt 2. Univ.-Prof. Dr. Hans Jürgen Schmidhuber Die Dissertation wurde am 30. 06. 2015 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 18. 09. 2015 angenommen. Thomas Rückstieß: Reinforcement Learning in Supervised Problem Domains © 2015 email: [email protected] ABSTRACT Despite continuous advances in computing technology, today’s brute for- ce data processing approaches may not provide the necessary advantage to win the race against the ever-growing amount of data that can be wit- nessed over the last decades. In this thesis, we discuss novel methods and algorithms that are capable of directing attention to relevant details and analysing it in sequence to overcome the processing bottleneck and to keep up with this data explosion. In the first of three parts, a novel exploration technique for Policy Gradi- ent Reinforcement Learning is presented which replaces traditional ad- ditive random exploration with state-dependent exploration, exploring on a higher, more strategic level. We will show how this new exploration method converges faster and finds better global solutions than random exploration can. The second part of this thesis will introduce the concept of “data con- sumption” and discuss means to minimise it in supervised learning tasks by deriving classification as a sequential decision process and ma- king it accessible to Reinforcement Learning methods.
    [Show full text]