Masters Thesis: Memristor-Based Neuromorphic Computing

Total Page:16

File Type:pdf, Size:1020Kb

Masters Thesis: Memristor-Based Neuromorphic Computing Memristor-based Neuromorphic Computing Design and simulation of a Neural Network based on Memristor technology María García Fernández Master of Science Thesis Delft Center for Systems and Control Memristor-based Neuromorphic Computing Design and simulation of a Neural Network based on Memristor technology Master of Science Thesis For the degree of Master of Science in Embedded Systems at Delft University of Technology María García Fernández August 31, 2020 Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) · Delft University of Technology Copyright c Delft Center for Systems and Control (DCSC) All rights reserved. Table of Contents Acknowledgements vii 1 Background 4 1-1 Machine Learning . 4 1-1-1 Neural Networks . 5 1-2 Research question . 11 1-3 History and Evolution of Neuromorphic Computing . 12 1-3-1 Hardware Implementations of Neural Networks . 14 1-3-2 Brief overview . 15 2 Introduction to memristors 17 2-1 Memristors in Neural Networks . 20 3 Memristor Characterization 22 3-1 Experiments and Goal . 22 3-1-1 Experiment Description . 22 3-1-2 Goal . 23 3-2 Brief Explanation of the Set-up . 24 3-2-1 Memristor’s resistance calculation . 26 3-2-2 The memristor endurance problem . 29 3-3 First experiment . 29 3-3-1 Second experiment: From High Resistance State (HRS) to Low Resistance State (LRS) with trains of identical pulses . 31 3-3-2 The Mean Metastable Switch Memristor Model . 34 3-3-3 Model exploration via optimization runs . 38 3-3-4 Analysis of the data and the results . 41 3-4 Simulation Work . 44 3-5 Conclusion . 47 Master of Science Thesis María García Fernández Table of Contents ii 4 Memristors in Neural Networks: Introduction 48 4-1 Unsupervised Learning . 48 4-1-1 Unsupervised implementation of the Simplified Spiking-Time Dependant Plasticity Learning Rule . 49 4-1-2 Accuracy challenge . 52 4-2 Supervised Learning . 52 4-2-1 Supervised implementation of the Simplified Spiking-Time Dependant Plas- ticity Learning Rule . 52 4-2-2 Online Back-propagation Rule with Stochastic Gradient Descent . 53 4-2-3 Manhattan Rule-based Back-Propagation . 55 4-3 Summary and Design Choices . 55 5 Memristor Circuit Design 57 5-1 Main consideration when designing a memristor-based chip for Neural Networks . 57 5-2 Development . 58 6 Neural Network Simulation 63 6-1 Spiking Timing-Dependant Plasticity (STDP): Hebbian and anti-Hebbian learning 64 6-1-1 Supervised STDP . 64 6-1-2 Unsupervised STDP . 65 6-2 Classic Artificial Neural Network (ANN) Implementations . 66 6-2-1 Supervised ANN without hidden layers (1 input layer, 1 output layer) . 67 6-2-2 Supervised with one hidden layer . 69 6-3 Conclusions . 72 7 Conclusions 73 7-1 Summary . 73 7-2 Main Contributions . 73 7-3 Future Work . 75 7-3-1 The dR/dE ratio . 76 7-3-2 Time series of the experiments . 79 7-3-3 Parameter Tuning - Matlab Code . 84 7-3-4 Memristor’s stochasticity simulation . 86 7-3-5 Matrix Circuit Design . 87 Bibliography 88 Glossary 94 List of Acronyms . 94 List of Symbols . 95 Master of Science Thesis María García Fernández List of Figures 1 A typical task that the human brain performs constantly. 2 1-1 Representation of a neuron [1] . 6 1-2 Some examples of activation functions [2] . 7 1-3 Representation of a neural network [3] . 7 1-4 Spiking neuron network implementations [4] . 9 1-5 Simplified version of STDP proposed in [5] . 11 1-6 Simplified overview of the von Neumann Architecture . 13 1-7 Percentage of papers in each field within neuromorphic computing [6] . 13 2-1 The four fundamental two-terminal circuit elements[7] . 17 2-2 Ideal behavior of a memristor’s current and conductance with respect to voltage. 18 2-3 A series resistor can be used to measure the memristor’s resistance. 19 2-4 I(V) curve. 19 2-5 Example of a crossbar architecture. 21 3-1 The experiment was added to the GUI provided by Knowm with the setup, which was open-source. 23 3-2 The experiments performed aim to help the design of a controller that works with the Self-Directed Channel (SDC) memristors given. 24 3-3 Initial Experiment Set-Up. Images provided with permission by Knowm Inc . 24 3-4 Circuit within the Knowm set-up. 25 3-5 How the voltage VREAD is selected. 27 3-6 Circuit simulated in LTSpice, as defined by Knowm. 27 3-7 Voltage and current across the memristor for a 10 µs pulse width. 28 3-8 Algorithm proposed in [8] . 30 Master of Science Thesis María García Fernández List of Figures iv 3-9 Circuit simulated in LTSpice, as defined by Knowm. 31 3-10 Histogram of the voltage applied directly on the memristor (measured via a scope) 32 3-11 Experiment 2 on one of the memristors. 33 3-12 ............................................ 34 3-13 Graphical explanation of the metastable switch memristor model . 35 3-14 Mean Metastable Switch Memristor Model as explained in [9] and [10] . 36 3-15 The predicted change is negligible for most points. 40 3-16 dX vs VMEM for all pulses in which 0.5 V were applied to the circuit. 42 3-17 Real Data vs Optimization Runs Model Data . 43 3-18 Model built in Simulink . 44 3-19 Simulated X vs Voltage(V) and dX vs Voltage(V). 45 3-20 VMEM over time. Resistance setpoints (blue) and actual read resistance (yellow) over time. 46 4-1 STDP as explained in vs the implemented version in [5]. 49 4-2 Visual explanation of the Leaky Integrate-And-Fire (LIF) neuron. 50 4-3 Behavior of the system in [5]. An output spike will only increase a synaptic weight its input is also active. Otherwise it will decrease. The input spikes are longer than the output spikes. Leakage is not explained by this plot, it is added to the simulation via external equations. 51 4-4 Scintillating Grid Illusion [11] . 52 4-5 Architecture proposed in [12] . 53 5-1 Inputs of the circuit designed. 58 5-2 Multiple inputs stacked together and connected an output neuron (the amplifying stage) . 59 5-3 Transfer function of the amplifying stage clipped at 0.5 and -0.5 V, compared to scaled tanh and sigmoid with offset. 61 5-4 ............................................ 61 5-5 VOUT during reading operation with 8 active inputs. 62 5-6 Voltage drop on several memristors while writing one of them. 62 6-1 Some examples of handwritten characters from the MNIST database [13] . 64 6-2 Supervised Implementation of STDP. 64 6-3 Accuracy achieved by using STDP on 600 samples. 65 6-4 Unsupervised Implementation of STDP. 66 6-5 Supervised Implementation of an ANN without hidden layers. 67 6-6 Success of the experiment . 68 6-7 Success of the experiment. 69 6-8 Effects of letting the network learn for too long (no early stopping) . 69 6-9 Supervised implementation of an ANN with one hidden layer with 300 neurons. 70 6-10 Success of the experiment . 70 Master of Science Thesis María García Fernández List of Figures v 6-11 Success of the experiment . 71 6-12 Weight Distribution. 72 7-1 ............................................ 76 7-2 ............................................ 76 7-3 ............................................ 77 7-4 ............................................ 77 7-5 ............................................ 77 7-6 ............................................ 78 7-7 ............................................ 78 7-8 ............................................ 78 7-9 ............................................ 79 7-10 ............................................ 79 7-11 ............................................ 79 7-12 ............................................ 80 7-13 ............................................ 80 7-14 ............................................ 81 7-15 ............................................ 81 7-16 ............................................ 82 7-17 ............................................ 82 7-18 ............................................ 82 7-19 ............................................ 82 7-20 ............................................ 83 7-21 ............................................ 83 7-22 ............................................ 83 7-23 ............................................ 83 Master of Science Thesis María García Fernández List of Tables 1-1 Comparison of the usage of some neuromorphic devices . 16 3-1 Numerical values thrown by both simulations . 29 3-2 Memristor estimation error for Experiment 1. 38 3-3 Memristor estimation error for Experiment 2. 38 3-4 Lower and Upper bounds given to the parameters to be tuned . 39 3-5 Comparison between three different optimization algorithms. They seem to give similar results. 40 3-6 Comparison between three different optimization algorithms. They give similar results. 41 Master.
Recommended publications
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Memristor-The Future of Artificial Intelligence L.Kavinmathi, C.Gayathri, K.Kumutha Priya
    International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 358 ISSN 2229-5518 Memristor-The Future of Artificial Intelligence L.kavinmathi, C.Gayathri, K.Kumutha priya Abstract- Due to increasing demand on miniaturization and low power consumption, Memristor came into existence. Our design exploration is Reconfigurable Threshold Logic Gates based Programmable Analog Circuits using Memristor. Thus a variety of linearly separable and non- linearly separable logic functions such as AND, OR, NAND, NOR, XOR, XNOR have been realized using Threshold logic gate using Memristor. The functionality can be changed between these operations just by varying the resistance of the Memristor. Based on this Reconfigurable TLG, various Programmable Analog circuits can be built using Memristor. As an example of our approach, we have built Programmable analog Gain amplifier demonstrating Memristor-based programming of Threshold, Gain and Frequency. As our idea consisting of Programmable circuit design, in which low voltages are applied to Memristor during their operation as analog circuit element and high voltages are used to program the Memristor’s states. In these circuits the role of memristor is played by Memristor Emulator developed by us using FPGA. Reconfigurable is the option we are providing with the present system, so that the resistance ranges are varied by preprogram too. Index Terms— Memristor, TLG-threshold logic gates, Programmable Analog Circuits, FPGA-field programmable gate array, MTL- memristor threshold logic, CTL-capacitor Threshold logic, LUT- look up table. —————————— ( —————————— 1 INTRODUCTION CCORDING to Chua’s [founder of Memristor] definition, 9444163588. E-mail: [email protected] the internal state of an ideal Memristor depends on the • L.kavinmathi is currently pursuing bachelors degree program in electronics A and communication engineering in tagore engineering college under Anna integral of the voltage or current over time.
    [Show full text]
  • Learning to Learn by Gradient Descent by Gradient Descent
    Learning to learn by gradient descent by gradient descent Marcin Andrychowicz1, Misha Denil1, Sergio Gómez Colmenarejo1, Matthew W. Hoffman1, David Pfau1, Tom Schaul1, Brendan Shillingford1,2, Nando de Freitas1,2,3 1Google DeepMind 2University of Oxford 3Canadian Institute for Advanced Research [email protected] {mdenil,sergomez,mwhoffman,pfau,schaul}@google.com [email protected], [email protected] Abstract The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art. 1 Introduction Frequently, tasks in machine learning can be expressed as the problem of optimizing an objective function f(✓) defined over some domain ✓ ⇥. The goal in this case is to find the minimizer 2 ✓⇤ = arg min✓ ⇥ f(✓). While any method capable of minimizing this objective function can be applied, the standard2 approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates ✓ = ✓ ↵ f(✓ ) . t+1 t − tr t The performance of vanilla gradient descent, however, is hampered by the fact that it only makes use of gradients and ignores second-order information.
    [Show full text]
  • Hybrid Memristor–CMOS Implementation of Combinational Logic Based on X-MRL †
    electronics Article Hybrid Memristor–CMOS Implementation of Combinational Logic Based on X-MRL † Khaled Alhaj Ali 1,* , Mostafa Rizk 1,2,3 , Amer Baghdadi 1 , Jean-Philippe Diguet 4 and Jalal Jomaah 3 1 IMT Atlantique, Lab-STICC CNRS, UMR, 29238 Brest, France; [email protected] (M.R.); [email protected] (A.B.) 2 Lebanese International University, School of Engineering, Block F 146404 Mazraa, Beirut 146404, Lebanon 3 Faculty of Sciences, Lebanese University, Beirut 6573, Lebanon; [email protected] 4 IRL CROSSING CNRS, Adelaide 5005, Australia; [email protected] * Correspondence: [email protected] † This paper is an extended version of our paper published in IEEE International Conference on Electronics, Circuits and Systems (ICECS) , 27–29 November 2019, as Ali, K.A.; Rizk, M.; Baghdadi, A.; Diguet, J.P.; Jomaah, J. “MRL Crossbar-Based Full Adder Design”. Abstract: A great deal of effort has recently been devoted to extending the usage of memristor technology from memory to computing. Memristor-based logic design is an emerging concept that targets efficient computing systems. Several logic families have evolved, each with different attributes. Memristor Ratioed Logic (MRL) has been recently introduced as a hybrid memristor–CMOS logic family. MRL requires an efficient design strategy that takes into consideration the implementation phase. This paper presents a novel MRL-based crossbar design: X-MRL. The proposed structure combines the density and scalability attributes of memristive crossbar arrays and the opportunity of their implementation at the top of CMOS layer. The evaluation of the proposed approach is performed through the design of an X-MRL-based full adder.
    [Show full text]
  • Training Neural Networks Without Gradients: a Scalable ADMM Approach
    Training Neural Networks Without Gradients: A Scalable ADMM Approach Gavin Taylor1 [email protected] Ryan Burmeister1 Zheng Xu2 [email protected] Bharat Singh2 [email protected] Ankit Patel3 [email protected] Tom Goldstein2 [email protected] 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA Abstract many parameters. Because big datasets provide results that With the growing importance of large network (often dramatically) outperform the prior state-of-the-art in models and enormous training datasets, GPUs many machine learning tasks, researchers are willing to have become increasingly necessary to train neu- purchase specialized hardware such as GPUs, and commit ral networks. This is largely because conven- large amounts of time to training models and tuning hyper- tional optimization algorithms rely on stochastic parameters. gradient methods that don’t scale well to large Gradient-based training methods have several properties numbers of cores in a cluster setting. Further- that contribute to this need for specialized hardware. First, more, the convergence of all gradient methods, while large amounts of data can be shared amongst many including batch methods, suffers from common cores, existing optimization methods suffer when paral- problems like saturation effects, poor condition- lelized. Second, training neural nets requires optimizing ing, and saddle points. This paper explores an highly non-convex objectives that exhibit saddle points, unconventional training method that uses alter- poor conditioning, and vanishing gradients, all of which nating direction methods and Bregman iteration slow down gradient-based methods such as stochastic gra- to train networks without gradient descent steps.
    [Show full text]
  • Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments
    Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments Boris Ginsburg 1 Patrice Castonguay 1 Oleksii Hrinchuk 1 Oleksii Kuchaiev 1 Ryan Leary 1 Vitaly Lavrukhin 1 Jason Li 1 Huyen Nguyen 1 Yang Zhang 1 Jonathan M. Cohen 1 Abstract We started with Adam, and then (1) replaced the element- wise second moment with the layer-wise moment, (2) com- We propose NovoGrad, an adaptive stochastic puted the first moment using gradients normalized by layer- gradient descent method with layer-wise gradient wise second moment, (3) decoupled weight decay (WD) normalization and decoupled weight decay. In from normalized gradients (similar to AdamW). our experiments on neural networks for image classification, speech recognition, machine trans- The resulting algorithm, NovoGrad, combines SGD’s and lation, and language modeling, it performs on par Adam’s strengths. We applied NovoGrad to a variety of or better than well-tuned SGD with momentum, large scale problems — image classification, neural machine Adam, and AdamW. Additionally, NovoGrad (1) translation, language modeling, and speech recognition — is robust to the choice of learning rate and weight and found that in all cases, it performs as well or better than initialization, (2) works well in a large batch set- Adam/AdamW and SGD with momentum. ting, and (3) has half the memory footprint of Adam. 2. Related Work NovoGrad belongs to the family of Stochastic Normalized 1. Introduction Gradient Descent (SNGD) optimizers (Nesterov, 1984; Hazan et al., 2015). SNGD uses only the direction of the The most popular algorithms for training Neural Networks stochastic gradient gt to update the weights wt: (NNs) are Stochastic Gradient Descent (SGD) with mo- gt mentum (Polyak, 1964; Sutskever et al., 2013) and Adam wt+1 = wt − λt · (Kingma & Ba, 2015).
    [Show full text]
  • The Memristor: Anew Bond Graph Element D
    Paper No. n-Aut·N J The Society shall not be responsible for statements or opinions :<;111 pnp~,r~ or, in'Ldisao~slofl at me~,~in~s~f t~~ <~~,~iety, o,~, ot ,.its (I',~l)o'isjons:"or S,ecti ons;) lor- prlrU9d;:)ln tts j:~obllcaJLors. G. F.OSTER The Memristor: ANew Bond Graph Element D. M. AUSLANDER Mechanical Engineering Department. The "mcmris[.or," first deft ned by L. Clma for electrical drwits, 1·$ proposed as a new University of California, Berkeley, bond gra,ph clemel/t, on (1.11- equul footiug with R, L, & C, and having some unique Calif. modelling capubilities for nontinear systems. The Missing Constitutive Relation circuit thcory; howevcr, if wc look beyond the electrical doma.in it is not hard 1.0 find systems whose characteristics are con­ Ix HIS original lecture notes int.roducing the bond veniently represented by a memristor modcl. graph technique, Paynter drew a 1I1ctrahedron of state," (Fig. I) which sumnHlrir.cd j.he relationship between the sl.ate vl~riables Properties (c, I, p, q) [I, 2].1 There are G binary relationships possible be­ tween these 4 state variableii, Two of these are definitions: the The constitutive relation for a I-port memristor is a curve in the q-p ph~ne, Fig. 2. [n this context, we do not necessarily in_ displacement., q(t) = q(O) + f(t)clJ, and the qU:l.utit,y pet) = J:' terpret the quant.it,y p(l) = p(O) + ]:' e(t)dt as momentum, magneti(~ p(O) + }:' c(l)dt, which is interpreted ns momentum, flux, or pressure-momentum (2), buti merely as t.he integrated etTol't ("impulse!».
    [Show full text]
  • CSE 152: Computer Vision Manmohan Chandraker
    CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R.
    [Show full text]
  • Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
    Article Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches Juan Cruz-Benito 1,* , Sanjay Vishwakarma 2,†, Francisco Martin-Fernandez 1 and Ismael Faro 1 1 IBM Quantum, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA; [email protected] (F.M.-F.); [email protected] (I.F.) 2 Electrical and Computer Engineering, Carnegie Mellon University, Mountain View, CA 94035, USA; [email protected] * Correspondence: [email protected] † Intern at IBM Quantum at the time of writing this paper. Abstract: In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writ- ing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engi- neering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning- enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Citation: Cruz-Benito, J.; Transformer while using transfer learning and different forms of tokenization to see how they behave Vishwakarma, S.; Martin- Fernandez, F.; Faro, I.
    [Show full text]
  • GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection
    GEE: A Gradient-based Explainable Variational Autoencoder for Network Anomaly Detection Quoc Phong Nguyen Kar Wai Lim Dinil Mon Divakaran National University of Singapore National University of Singapore Trustwave [email protected] [email protected] [email protected] Kian Hsiang Low Mun Choon Chan National University of Singapore National University of Singapore [email protected] [email protected] Abstract—This paper looks into the problem of detecting number of factors, such as end-user behavior, customer busi- network anomalies by analyzing NetFlow records. While many nesses (e.g., banking, retail), applications, location, time of the previous works have used statistical models and machine learning day, and are expected to evolve with time. Such diversity and techniques in a supervised way, such solutions have the limitations that they require large amount of labeled data for training and dynamism limits the utility of rule-based detection systems. are unlikely to detect zero-day attacks. Existing anomaly detection Next, as capturing, storing and processing raw traffic from solutions also do not provide an easy way to explain or identify such high capacity networks is not practical, Internet routers attacks in the anomalous traffic. To address these limitations, today have the capability to extract and export meta data such we develop and present GEE, a framework for detecting and explaining anomalies in network traffic. GEE comprises of two as NetFlow records [3]. With NetFlow, the amount of infor- components: (i) Variational Autoencoder (VAE) — an unsuper- mation captured is brought down by orders of magnitude (in vised deep-learning technique for detecting anomalies, and (ii) a comparison to raw packet capture), not only because a NetFlow gradient-based fingerprinting technique for explaining anomalies.
    [Show full text]
  • Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder
    Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 11, 2018 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function that predicts what the output should be, given the input. Prediction Functions Linear regression: f(x) = wTx + b Linear classification (perceptron): f(x) = 1, wTx + b ≥ 0 -1, wTx + b < 0 Need to learn what w should be! Learning Parameters Goal is to learn to minimize error • Ideally: true error • Instead: training error The loss function gives the training error when using parameters w, denoted L(w). • Also called cost function • More general: objective function (in general objective could be to minimize or maximize; with loss/cost functions, we want to minimize) Learning Parameters Goal is to minimize loss function. How do we minimize a function? Let’s review some math. Rate of Change The slope of a line is also called the rate of change of the line. y = ½x + 1 “rise” “run” Rate of Change For nonlinear functions, the “rise over run” formula gives you the average rate of change between two points Average slope from x=-1 to x=0 is: 2 f(x) = x -1 Rate of Change There is also a concept of rate of change at individual points (rather than two points) Slope at x=-1 is: f(x) = x2 -2 Rate of Change The slope at a point is called the derivative at that point Intuition: f(x) = x2 Measure the slope between two points that are really close together Rate of Change The slope at a point is called the derivative at that point Intuition: Measure the
    [Show full text]
  • Gradient Descent (PDF)
    18.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 11 Scribe: Kevin Li Oct. 14, 2015 2. CONVEX OPTIMIZATION FOR MACHINE LEARNING In this lecture, we will cover the basics of convex optimization as it applies to machine learning. There is much more to this topic than will be covered in this class so you may be interested in the following books. Convex Optimization by Boyd and Vandenberghe Lecture notes on Convex Optimization by Nesterov Convex Optimization: Algorithms and Complexity by Bubeck Online Convex Optimization by Hazan The last two are drafts and can be obtained online. 2.1 Convex Problems A convex problem is an optimization problem of the form min f(x) where f andC are x2C convex. First, we will debunk the idea that convex problems are easy by showing that virtually all optimization problems can be written as a convex problem. We can rewrite an optimization problem as follows. min f(x) , min t , min t X 2X t≥f(x);x2X (x;t)2epi(f) where the epigraph of a function is defined by epi(f)=f(x; t) 2X 2 ×IR : t ≥f(x)g Figure 1: An example of an epigraph. Source: https://en.wikipedia.org/wiki/Epigraph_(mathematics) 1 Now we observe that for linear functions, min c>x = min c>x x2D x2conv(D) where the convex hull is defined N N X X conv(D) = fy : 9 N 2 Z+; x1; : : : ; xN 2 D; αi ≥ 0; αi = 1; y = αixig i=1 i=1 To prove this, we know that the left side is a least as big as the right side since D ⊂ conv(D).
    [Show full text]