Week 2: Linear Classification: Perceptron

Total Page:16

File Type:pdf, Size:1020Kb

Week 2: Linear Classification: Perceptron Applied Machine Learning CIML Chap 4 (A Geometric Approach) “Equations are just the boring part of mathematics. I attempt to see things in terms of geometry.” ―Stephen Hawking Week 2: Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU/Amazon) Roadmap for Weeks 2-3 • Week 2: Linear Classifier and Perceptron • Part I: Brief History of the Perceptron • Part II: Linear Classifier and Geometry (testing time) • Part III: Perceptron Learning Algorithm (training time) • Part IV: Convergence Theorem and Geometric Proof • Part V: Limitations of Linear Classifiers, Non-Linearity, and Feature Maps • Week 3: Extensions of Perceptron and Practical Issues • Part I: My Perceptron Demo in Python • Part II: Voted and Averaged Perceptrons • Part III: MIRA and Aggressive MIRA • Part IV: Practical Issues and HW1 Part V: Perceptron vs. Logistic Regression (hard vs. soft); Gradient Descent • 2 Part I • Brief History of the Perceptron 3 Perceptron (1959-now) Frank Rosenblatt deep learning ~1986; 2006-now multilayer perceptron logistic regression perceptron SVM 1958 1959 1964;1995 kernels 1964 cond. random fields structured perceptron structured SVM 2001 2002 2003 5 Neurons • Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (output cable) May be up to 1m long and will transport the activation signal to neurons at different locations 6 Frank Rosenblatt’s Perceptron 7 Multilayer Perceptron (Neural Net) 8 Brief History of Perceptron 1997 batch Cortes/Vapnik +soft-margin SVM online approx. +kernels max margin subgradient descent 2007--2010 Singer group +max margin minibatch Pegasos minibatch online 2003 2006 conservative updates Crammer/Singer Singer group MIRA aggressive 1959 1962 1969* 1999 Rosenblatt Novikoff Minsky/Papert DEAD Freund/Schapire invention proof book killed it voted/avg: revived inseparable case 2002 2005* Collins McDonald/Crammer/Pereira structured structured MIRA *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students 9 Part II • Linear Classifier and Geometry (testing time) • decision boundary and normal vector w • not separable through the origin: add bias b • geometric review of linear algebra • augmented space (no explicit bias; implicit as w0=b) Test Time Input x Linear Classifier Prediction σ(w ·x) Model w Training Time Input x Perceptron Learner Model w Output y 10 Linear Classifier and Geometry linear classifiers: perceptron, logistic regression, (linear) SVMs, etc. x1 x2 x3 ... xn x2 w x w w x cos ✓ = · 1 n k k w k k weights x output f(x)=σ(w x) σ · ✓ negative O w x1 w x < 0 · positive w x > 0 weight vector w: “prototype” of positive examples · it’s also the normal vector of the decision boundary meaning of w · x: agreement with positive direction test: input: x, w; output:1 if w · x >0 else -1 separating hyperplane training: input: (x, y) pairs; output: w (decision boundary) w x =0 11 · What if not separable through origin? solution: add bias b x1 x2 x3 ... xn x2 w x w w x cos ✓ = · 1 n k k w k k weights x output f(x)=σ(w x + b) · ✓ negative O w x1 w x + b<0 · positive w x + b>0 · w x + b =0 · b | | w k k 12 Geometric Review of Linear Algebra line in 2D (n-1)-dim hyperplane in n-dim x2 x2 w1x1 + w2x2 + b =0 w x + b =0 · (x1⇤,x2⇤) x⇤ b | | (w1,w2) k k (w1,w2) O x1 O x1 required: algebraic and geometric meanings of dot product x3 w1x⇤ + w2x⇤ + b (w1,w2) (x1,x2)+b w x + b | 1 2 | = | · | | · | w2 + w2 (w1,w2) w 1 2 k k k k point-to-linep distance point-to-hyperplane distance http://classes.engr.oregonstate.edu/eecs/fall2017/cs534/extra/LA-geometry.pdf 13 Augmented Space: dimensionality+1 x1 x2 x3 ... xn explicit bias f(x)=σ(w x + b) w1 wn · can’t separate in 1D weights from the origin O output x ... x x0 = 1 x1 x2 3 n augmented space w 0 = w1 wn f(x)=σ((b; w) (1; x)) b · weights 1 can separate in 2D from the origin O output 14 Augmented Space: dimensionality+1 x1 x2 x3 ... xn explicit bias f(x)=σ(w x + b) w1 wn · can’t separate in 2D weights from the origin output x ... x x0 = 1 x1 x2 3 n augmented space w 0 = w1 wn f(x)=σ((b; w) (1; x)) b · weights can separate in 3D from the origin output 15 Part III • The Perceptron Learning Algorithm (training time) • the version without bias (augmented space) • side note on mathematical notations • mini-demo Test Time Input x Linear Classifier Prediction σ(w ·x) Model w Training Time Input x Perceptron Learner Model w Output y 16 Perceptron Ham Spam 17 The Perceptron Algorithm input: training data D output: weights w x initialize w 0 while not converged aafor (x,y) D w0 aaaaif y(w 2x) 0 aaaaaaw · w +yx w • the simplest machine learning algorithm • keep cycling through the training data • update w if there is a mistake on example (x, y) until all examples are classified correctly • 18 Side Note on Mathematical Notations • I’ll try my best to be consistent in notations • e.g., bold-face for vectors, italic for scalars, etc. • avoid unnecessary superscripts and subscripts by using a “Pythonic” rather than a “C” notational style input:• most trainingtextbooks data haveD consistent but bad notations output: weights w initialize w 0 initialize w = 0 and b =0 while not converged repeat aafor (x,y) D if yi [ w, xi + b] 0 then 2 w h w +iy x and b b + y aaaaif y(w x) 0 i i i aaaaaaw · w +yx end if until all classified correctly good notations: bad notations: consistent, Pythonic style inconsistent, unnecessary i and b 19 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w0 w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w 21 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w 22 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w0 w 22 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx w 23 24 Part IV • Linear Separation, Convergence Theorem and Proof • formal definition of linear separation • perceptron convergence theorem • geometric proof • what variables affect convergence bound? 25 Linear Separation; Convergence Theorem • dataset D is said to be “linearly separable” if there exists some unit oracle vector u: ∣∣u|| = 1 which correctly classifies every example (x, y) with a margin at least ẟ: y(u x) δ for all (x,y) D · ≥ 2 • then the perceptron must converge to a linear separator after at most R2/ẟ2 mistakes (updates) where R = max x (x,y) Dk k 2 • convergence rate R2/ẟ2 δ δ x ⊕ ⊕ dimensionality independent u x δ • · ≥ R u : u =1 k k • dataset size independent • order independent (but order matters in output) • scales with ‘difficulty’ of problem Geometric Proof, part 1 • part 1: progress (alignment) on oracle projection assume w(0) = 0, and w(i) is the weight before the ith update (on (x,y)) w(i+1) = w(i) + yx u w(i+1) = u w(i) + y(u x) · · · u w(i+1) u w(i) + δ y(u x) δ for all (x,y) D · ≥ · · ≥ 2 u w(i+1) iδ δ δ · ≥ x projection on u increases! ⊕ ⊕ u x δ · ≥ (more agreement w/ oracle direction) u : u =1 k k (i+1) (i+1) (i+1) w w(i+1) ( w = u w u w iδ i k k ≥ · ≥ ) u w(i) · u w(i+1) 27 · Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w(i+1) = w(i) + yx 2 2 w(i+1) = w(i) + yx 2 = w(i) + x 2 +2y(w(i) x) k k · 2 mistake on x w(i) + R2 x 2 R = max x ⊕ ⊕ iR (x,y) Dk k 2 R w w(i+1) ( i ) ✓ 90◦ ≥ cos ✓ 0 w(i) x 0 · 28 Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w(i+1) = w(i) + yx p iR 2 2 iδ w(i+1) = w(i) + yx 2 = w(i) + x 2 +2y(w(i) x) k k · 2 mistake on x w(i) + R2 x 2 R = max x ⊕ ⊕ iR (x,y) Dk k 2 R Combine with part 1: w w(i+1) ( i (i+1) (i+1) (i+1) ) w = u w u w iδ ✓ 90◦ k k ≥ · ≥ ≥ cos ✓ 0 i R2/δ2 w(i) x 0 · 28 2 2 Convergence Bound R /δ where R = max xi i k k • is independent of: • dimensionality • number of examples • order of examples constant learning rate narrow margin: wide margin: • hard to separate easy to separate • and is dependent of: • separation difficulty (margin ẟ) • feature scale (radius R) • initial weight w(0) changes how fast it converges, but not whether it’ll converge • 29 Part V • Limitations of Linear Classifiers and Feature Maps • XOR: not linearly separable • perceptron cycling theorem • solving XOR: non-linear feature map • “preview demo”: SVM with non-linear kernel • redefining “linear” separation under feature map 30 XOR • XOR - not linearly separable • Nonlinear separation is trivial • Caveat from “Perceptrons” (Minsky & Papert, 1969) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).
Recommended publications
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Incorporating Automated Feature Engineering Routines Into Automated Machine Learning Pipelines by Wesley Runnels
    Incorporating Automated Feature Engineering Routines into Automated Machine Learning Pipelines by Wesley Runnels Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 c Massachusetts Institute of Technology 2020. All rights reserved. Author . Wesley Runnels Department of Electrical Engineering and Computer Science May 12th, 2020 Certified by . Tim Kraska Department of Electrical Engineering and Computer Science Thesis Supervisor Accepted by . Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Incorporating Automated Feature Engineering Routines into Automated Machine Learning Pipelines by Wesley Runnels Submitted to the Department of Electrical Engineering and Computer Science on May 12, 2020, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Automating the construction of consistently high-performing machine learning pipelines has remained difficult for researchers, especially given the domain knowl- edge and expertise often necessary for achieving optimal performance on a given dataset. In particular, the task of feature engineering, a key step in achieving high performance for machine learning tasks, is still mostly performed manually by ex- perienced data scientists. In this thesis, building upon the results of prior work in this domain, we present a tool, rl feature eng, which automatically generates promising features for an arbitrary dataset. In particular, this tool is specifically adapted to the requirements of augmenting a more general auto-ML framework. We discuss the performance of this tool in a series of experiments highlighting the various options available for use, and finally discuss its performance when used in conjunction with Alpine Meadow, a general auto-ML package.
    [Show full text]
  • Warm Start for Parameter Selection of Linear Classifiers
    Warm Start for Parameter Selection of Linear Classifiers Bo-Yu Chu Chia-Hua Ho Cheng-Hao Tsai Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan [email protected] [email protected] [email protected] Chieh-Yen Lin Chih-Jen Lin Dept. of Computer Science Dept. of Computer Science National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan [email protected] [email protected] ABSTRACT we may need to solve many optimization problems. Sec- In linear classification, a regularization term effectively reme- ondly, if we do not know the reasonable range of the pa- dies the overfitting problem, but selecting a good regulariza- rameters, we may need a long time to solve optimization tion parameter is usually time consuming. We consider cross problems under extreme parameter values. validation for the selection process, so several optimization In this paper, we consider using warm start to efficiently problems under different parameters must be solved. Our solve a sequence of optimization problems with different reg- aim is to devise effective warm-start strategies to efficiently ularization parameters. Warm start is a technique to reduce solve this sequence of optimization problems. We detailedly the running time of iterative methods by using the solution investigate the relationship between optimal solutions of lo- of a slightly different optimization problem as an initial point gistic regression/linear SVM and regularization parameters. for the current problem. If the initial point is close to the op- Based on the analysis, we develop an efficient tool to auto- timum, warm start is very useful.
    [Show full text]
  • Neural Networks and Backpropagation
    CS 179: LECTURE 14 NEURAL NETWORKS AND BACKPROPAGATION LAST TIME Intro to machine learning Linear regression https://en.wikipedia.org/wiki/Linear_regression Gradient descent https://en.wikipedia.org/wiki/Gradient_descent (Linear classification = minimize cross-entropy) https://en.wikipedia.org/wiki/Cross_entropy TODAY Derivation of gradient descent for linear classifier https://en.wikipedia.org/wiki/Linear_classifier Using linear classifiers to build up neural networks Gradient descent for neural networks (Back Propagation) https://en.wikipedia.org/wiki/Backpropagation REFRESHER ON THE TASK Note “Grandmother Cell” representation for {x,y} pairs. See https://en.wikipedia.org/wiki/Grandmother_cell REFRESHER ON THE TASK Find i for zi: “Best-index” -- estimated “Grandmother Cell” Neuron Can use parallel GPU reduction to find “i” for largest value. LINEAR CLASSIFIER GRADIENT We will be going through some extra steps to derive the gradient of the linear classifier -- We’ll be using the “Softmax function” https://en.wikipedia.org/wiki/Softmax_function Similarities will be seen when we start talking about neural networks LINEAR CLASSIFIER J & GRADIENT LINEAR CLASSIFIER GRADIENT LINEAR CLASSIFIER GRADIENT LINEAR CLASSIFIER GRADIENT GRADIENT DESCENT GRADIENT DESCENT, REVIEW GRADIENT DESCENT IN ND GRADIENT DESCENT STOCHASTIC GRADIENT DESCENT STOCHASTIC GRADIENT DESCENT STOCHASTIC GRADIENT DESCENT, FOR W LIMITATIONS OF LINEAR MODELS Most real-world data is not separable by a linear decision boundary Simplest example: XOR gate What if we could combine the results of multiple linear classifiers? Combine two OR gates with an AND gate to get a XOR gate ANOTHER VIEW OF LINEAR MODELS NEURAL NETWORKS NEURAL NETWORKS EXAMPLES OF ACTIVATION FNS Note that most derivatives of tanh function will be zero! Makes for much needless computation in gradient descent! MORE ACTIVATION FUNCTIONS https://medium.com/@shrutijadon10104776/survey-on- activation-functions-for-deep-learning-9689331ba092 Tanh and sigmoid used historically.
    [Show full text]
  • Deep Learning Architectures for Sequence Processing
    Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. CHAPTER Deep Learning Architectures 9 for Sequence Processing Time will explain. Jane Austen, Persuasion Language is an inherently temporal phenomenon. Spoken language is a sequence of acoustic events over time, and we comprehend and produce both spoken and written language as a continuous input stream. The temporal nature of language is reflected in the metaphors we use; we talk of the flow of conversations, news feeds, and twitter streams, all of which emphasize that language is a sequence that unfolds in time. This temporal nature is reflected in some of the algorithms we use to process lan- guage. For example, the Viterbi algorithm applied to HMM part-of-speech tagging, proceeds through the input a word at a time, carrying forward information gleaned along the way. Yet other machine learning approaches, like those we’ve studied for sentiment analysis or other text classification tasks don’t have this temporal nature – they assume simultaneous access to all aspects of their input. The feedforward networks of Chapter 7 also assumed simultaneous access, al- though they also had a simple model for time. Recall that we applied feedforward networks to language modeling by having them look only at a fixed-size window of words, and then sliding this window over the input, making independent predictions along the way. Fig. 9.1, reproduced from Chapter 7, shows a neural language model with window size 3 predicting what word follows the input for all the. Subsequent words are predicted by sliding the window forward a word at a time.
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Lecture 2: Linear Classifiers
    Lecture 2: Linear Classifiers Andr´eMartins Deep Structured Learning Course, Fall 2018 Andr´eMartins (IST) Lecture 2 IST, Fall 2018 1 / 117 Course Information • Instructor: Andr´eMartins ([email protected]) • TAs/Guest Lecturers: Erick Fonseca & Vlad Niculae • Location: LT2 (North Tower, 4th floor) • Schedule: Wednesdays 14:30{18:00 • Communication: piazza.com/tecnico.ulisboa.pt/fall2018/pdeecdsl Andr´eMartins (IST) Lecture 2 IST, Fall 2018 2 / 117 Announcements Homework 1 is out! • Deadline: October 10 (two weeks from now) • Start early!!! List of potential projects will be sent out soon! • Deadline for project proposal: October 17 (three weeks from now) • Teams of 3 people Andr´eMartins (IST) Lecture 2 IST, Fall 2018 3 / 117 Today's Roadmap Before talking about deep learning, let us talk about shallow learning: • Supervised learning: binary and multi-class classification • Feature-based linear classifiers • Rosenblatt's perceptron algorithm • Linear separability and separation margin: perceptron's mistake bound • Other linear classifiers: naive Bayes, logistic regression, SVMs • Regularization and optimization • Limitations of linear classifiers: the XOR problem • Kernel trick. Gaussian and polynomial kernels. Andr´eMartins (IST) Lecture 2 IST, Fall 2018 4 / 117 Fake News Detection Task: tell if a news article / quote is fake or real. This is a binary classification problem. Andr´eMartins (IST) Lecture 2 IST, Fall 2018 5 / 117 Fake Or Real? Andr´eMartins (IST) Lecture 2 IST, Fall 2018 6 / 117 Fake Or Real? Andr´eMartins (IST) Lecture 2 IST, Fall 2018 7 / 117 Fake Or Real? Andr´eMartins (IST) Lecture 2 IST, Fall 2018 8 / 117 Fake Or Real? Andr´eMartins (IST) Lecture 2 IST, Fall 2018 9 / 117 Fake Or Real? Can a machine determine this automatically? Can be a very hard problem, since fact-checking is hard and requires combining several knowledge sources ..
    [Show full text]
  • Learning from Few Subjects with Large Amounts of Voice Monitoring Data
    Learning from Few Subjects with Large Amounts of Voice Monitoring Data Jose Javier Gonzalez Ortiz John V. Guttag Robert E. Hillman Daryush D. Mehta Jarrad H. Van Stan Marzyeh Ghassemi Challenges of Many Medical Time Series • Few subjects and large amounts of data → Overfitting to subjects • No obvious mapping from signal to features → Feature engineering is labor intensive • Subject-level labels → In many cases, no good way of getting sample specific annotations 1 Jose Javier Gonzalez Ortiz Challenges of Many Medical Time Series • Few subjects and large amounts of data → Overfitting to subjects Unsupervised feature • No obvious mapping from signal to features extraction → Feature engineering is labor intensive • Subject-level labels Multiple → In many cases, no good way of getting sample Instance specific annotations Learning 2 Jose Javier Gonzalez Ortiz Learning Features • Segment signal into windows • Compute time-frequency representation • Unsupervised feature extraction Conv + BatchNorm + ReLU 128 x 64 Pooling Dense 128 x 64 Upsampling Sigmoid 3 Jose Javier Gonzalez Ortiz Classification Using Multiple Instance Learning • Logistic regression on learned features with subject labels RawRaw LogisticLogistic WaveformWaveform SpectrogramSpectrogram EncoderEncoder RegressionRegression PredictionPrediction •• %% Positive Positive Ours PerPer Window Window PerPer Subject Subject • Aggregate prediction using % positive windows per subject 4 Jose Javier Gonzalez Ortiz Application: Voice Monitoring Data • Voice disorders affect 7% of the US population • Data collected through neck placed accelerometer 1 week = ~4 billion samples ~100 5 Jose Javier Gonzalez Ortiz Results Previous work relied on expert designed features[1] AUC Accuracy Comparable performance Train 0.70 ± 0.05 0.71 ± 0.04 Expert LR without Test 0.68 ± 0.05 0.69 ± 0.04 task-specific feature Train 0.73 ± 0.06 0.72 ± 0.04 engineering! Ours Test 0.69 ± 0.07 0.70 ± 0.05 [1] Marzyeh Ghassemi et al.
    [Show full text]
  • Machine Learning V/S Deep Learning
    International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 02 | Feb 2019 www.irjet.net p-ISSN: 2395-0072 Machine Learning v/s Deep Learning Sachin Krishan Khanna Department of Computer Science Engineering, Students of Computer Science Engineering, Chandigarh Group of Colleges, PinCode: 140307, Mohali, India ---------------------------------------------------------------------------***--------------------------------------------------------------------------- ABSTRACT - This is research paper on a brief comparison and summary about the machine learning and deep learning. This comparison on these two learning techniques was done as there was lot of confusion about these two learning techniques. Nowadays these techniques are used widely in IT industry to make some projects or to solve problems or to maintain large amount of data. This paper includes the comparison of two techniques and will also tell about the future aspects of the learning techniques. Keywords: ML, DL, AI, Neural Networks, Supervised & Unsupervised learning, Algorithms. INTRODUCTION As the technology is getting advanced day by day, now we are trying to make a machine to work like a human so that we don’t have to make any effort to solve any problem or to do any heavy stuff. To make a machine work like a human, the machine need to learn how to do work, for this machine learning technique is used and deep learning is used to help a machine to solve a real-time problem. They both have algorithms which work on these issues. With the rapid growth of this IT sector, this industry needs speed, accuracy to meet their targets. With these learning algorithms industry can meet their requirements and these new techniques will provide industry a different way to solve problems.
    [Show full text]
  • Hyperplane Based Classification: Perceptron and (Intro
    Hyperplane based Classification: Perceptron and (Intro to) Support Vector Machines Piyush Rai CS5350/6350: Machine Learning September 8, 2011 (CS5350/6350) Hyperplane based Classification September8,2011 1/20 Hyperplane Separates a D-dimensional space into two half-spaces Defined by an outward pointing normal vector w RD ∈ (CS5350/6350) Hyperplane based Classification September8,2011 2/20 Hyperplane Separates a D-dimensional space into two half-spaces Defined by an outward pointing normal vector w RD ∈ w is orthogonal to any vector lying on the hyperplane (CS5350/6350) Hyperplane based Classification September8,2011 2/20 Hyperplane Separates a D-dimensional space into two half-spaces Defined by an outward pointing normal vector w RD ∈ w is orthogonal to any vector lying on the hyperplane Assumption: The hyperplane passes through origin. (CS5350/6350) Hyperplane based Classification September8,2011 2/20 Hyperplane Separates a D-dimensional space into two half-spaces Defined by an outward pointing normal vector w RD ∈ w is orthogonal to any vector lying on the hyperplane Assumption: The hyperplane passes through origin. If not, have a bias term b; we will then need both w and b to define it b > 0 means moving it parallely along w (b < 0 means in opposite direction) (CS5350/6350) Hyperplane based Classification September8,2011 2/20 Linear Classification via Hyperplanes Linear Classifiers: Represent the decision boundary by a hyperplane w For binary classification, w is assumed to point towards the positive class (CS5350/6350) Hyperplane based Classification
    [Show full text]
  • Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inflow Forecasting
    water Article Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inflow Forecasting Halit Apaydin 1 , Hajar Feizi 2 , Mohammad Taghi Sattari 1,2,* , Muslume Sevba Colak 1 , Shahaboddin Shamshirband 3,4,* and Kwok-Wing Chau 5 1 Department of Agricultural Engineering, Faculty of Agriculture, Ankara University, Ankara 06110, Turkey; [email protected] (H.A.); [email protected] (M.S.C.) 2 Department of Water Engineering, Agriculture Faculty, University of Tabriz, Tabriz 51666, Iran; [email protected] 3 Department for Management of Science and Technology Development, Ton Duc Thang University, Ho Chi Minh City, Vietnam 4 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam 5 Department of Civil and Environmental Engineering, Hong Kong Polytechnic University, Hong Kong, China; [email protected] * Correspondence: [email protected] or [email protected] (M.T.S.); [email protected] (S.S.) Received: 1 April 2020; Accepted: 21 May 2020; Published: 24 May 2020 Abstract: Due to the stochastic nature and complexity of flow, as well as the existence of hydrological uncertainties, predicting streamflow in dam reservoirs, especially in semi-arid and arid areas, is essential for the optimal and timely use of surface water resources. In this research, daily streamflow to the Ermenek hydroelectric dam reservoir located in Turkey is simulated using deep recurrent neural network (RNN) architectures, including bidirectional long short-term memory (Bi-LSTM), gated recurrent unit (GRU), long short-term memory (LSTM), and simple recurrent neural networks (simple RNN). For this purpose, daily observational flow data are used during the period 2012–2018, and all models are coded in Python software programming language.
    [Show full text]
  • Expert Feature-Engineering Vs. Deep Neural Networks: Which Is Better for Sensor-Free Affect Detection?
    Expert Feature-Engineering vs. Deep Neural Networks: Which is Better for Sensor-Free Affect Detection? Yang Jiang1, Nigel Bosch2, Ryan S. Baker3, Luc Paquette2, Jaclyn Ocumpaugh3, Juliana Ma. Alexandra L. Andres3, Allison L. Moore4, Gautam Biswas4 1 Teachers College, Columbia University, New York, NY, United States [email protected] 2 University of Illinois at Urbana-Champaign, Champaign, IL, United States {pnb, lpaq}@illinois.edu 3 University of Pennsylvania, Philadelphia, PA, United States {rybaker, ojaclyn}@upenn.edu [email protected] 4 Vanderbilt University, Nashville, TN, United States {allison.l.moore, gautam.biswas}@vanderbilt.edu Abstract. The past few years have seen a surge of interest in deep neural net- works. The wide application of deep learning in other domains such as image classification has driven considerable recent interest and efforts in applying these methods in educational domains. However, there is still limited research compar- ing the predictive power of the deep learning approach with the traditional feature engineering approach for common student modeling problems such as sensor- free affect detection. This paper aims to address this gap by presenting a thorough comparison of several deep neural network approaches with a traditional feature engineering approach in the context of affect and behavior modeling. We built detectors of student affective states and behaviors as middle school students learned science in an open-ended learning environment called Betty’s Brain, us- ing both approaches. Overall, we observed a tradeoff where the feature engineer- ing models were better when considering a single optimized threshold (for inter- vention), whereas the deep learning models were better when taking model con- fidence fully into account (for discovery with models analyses).
    [Show full text]