Lecture 6. Notes on Linear Algebra. Perceptron

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Statistical Machine Learning (S2 2017) Deck 6 This lecture • Notes on linear algebra ∗ Vectors and dot products ∗ Hyperplanes and vector normals • Perceptron ∗ Introduction to Artificial Neural Networks ∗ The perceptron model ∗ Stochastic gradient descent 2 Statistical Machine Learning (S2 2016) Deck 6 Notes on Linear Algebra Link between geometric and algebraic interpretation of ML methods 3 Statistical Machine Learning (S2 2017) Deck 6 What are vectors? Suppose = , . What does really represent? ′ 1 2 Ordered set of numbers { , } 1 2 Cartesian coordinates of a point 2 0 A direction 1 2 art: OpenClipartVectors at pixabay.com (CC0) 1 4 Statistical Machine Learning (S2 2017) Deck 6 Dot product: algebraic definition • Given two -dimensional vectors and , their dot product is ∗ E.g., weighted sum of′ terms is a dot product =1 ⋅ ≡ ≡ ∑ ′ • Verify that if is a constant and , and are vectors of the same size then = = ′ + =′ + ′ ′ ′ ′ 5 Statistical Machine Learning (S2 2017) Deck 6 Dot product: geometric definition • Given two -dimensional vectors and , their dot product is cos ∗ Here and are′ L2 norms (i.e., Euclidean lengths) for vectors ⋅ and≡ ≡ ∗ and is the angle between vectors The scalar projection of onto is given by = cos Thus dot product is = = ′ 6 Statistical Machine Learning (S2 2017) Deck 6 Equivalence of definitions • Lemma: The algebraic and geometric definitions are identical • Proof sketch: ∗ Express the vectors using the standard vector basis , … , in , = , and = 1 ∗ Vectors are an orthonormal basic, they have unit length and ∑=1 ∑=1 orthogonal to each other, so = 1 and = 0 for ′ ′ = = ≠ ′ ′ ′ � � =1 =1 = cos = � � =1 =1 7 Statistical Machine Learning (S2 2017) Deck 6 Geometric properties of the dot product • If the two vectors are orthogonal then = 0 ′ • If the two vectors are parallel then = , if they are anti-parallel then = ′ ′ • = , so = + −+ defines the Euclidean′ vector2 length 2 2 1 ⋯ 8 Statistical Machine Learning (S2 2017) Deck 6 Hyperplanes and normal vectors • A hyperplane defined by parameters and is a set of points that satisfy + = 0 ′ • In 2D, a hyperplane is a line: a line is a set of points that satisfy + + = 0 11 22 A normal vector for a 2 hyperplane is a vector perpendicular to that hyperplane 9 1 Statistical Machine Learning (S2 2017) Deck 6 Hyperplanes and normal vectors • Consider a hyperplane defined by parameters and . Note that is itself a vector • Lemma: Vector is a normal vector to the hyperplane • Proof sketch: ∗ Choose any two points and on the hyperplane. Note that vector lies on the hyperplane ∗ Consider dot product = = − + ′+ =′ 0 ′ ∗ Thus lies′ on the hyperplane,− ′ but − is perpendicular to , and so is a vector − normal − 10 Statistical Machine Learning (S2 2017) Deck 6 Example in 2D • Consider a line defined by , and • Vector = , is a normal1 2 vector ′ 1 2 2 = 1 1 2 − 1 − 2 2 2 1 11 Statistical Machine Learning (S2 2017) Deck 6 Is logistic regression a linear method? Logistic function 1.0 0.8 0.6 Probabilities 0.4 0.2 0.0 -10 -5 0 5 10 Reals 12 Statistical Machine Learning (S2 2017) Deck 6 Logistic regression is a linear classifier • Logistic regression model: 1 = 1| = 1 + exp ′ • Classification rule: − if = 1| > then class “1”, else class “0” 1 • Decision boundary: 2 1 1 = 1 + exp 2 exp ′ = 1 −′= 0 −′ 13 Statistical Machine Learning (S2 2016) Deck 6 The Perceptron Model A building block for artificial neural networks, yet another linear classifier 14 Statistical Machine Learning (S2 2017) Deck 6 Biological inspiration • Humans perform well at many tasks that matter • Originally neural networks were an attempt to mimic the human brain photo: Alvesgaspar, Wikimedia Commons, CC3 15 Statistical Machine Learning (S2 2017) Deck 6 Artificial neural network • As a crude approximation, the human brain can be thought as a mesh of interconnected processing nodes (neurons) that relay electrical signals • Artificial neural network is a network of processing elements • Each element converts inputs … to output … … • The output is a function (called activation function) of a weighted sum of inputs 16 Statistical Machine Learning (S2 2017) Deck 6 Outline • In order to use an ANN we need (a) to design network topology and (b) adjust weights to given data ∗ In this course, we will exclusively focus on task (b) for a particular class of networks called feed forward networks • Training an ANN means adjusting weights for training data given a pre-defined network topology • We will come back to ANNs and discuss ANN training in the next lecture • Right now we will turn our attention to an individual network element because it is an interesting model in itself 17 Statistical Machine Learning (S2 2017) Deck 6 Perceptron model × 1 × 1 ( ) 2 2 × 1 0 Σ • , – inputs 1 2 Compare this • , – synaptic weights model to logistic 1 2 regression • –bias weight • 0– activation function 18 Statistical Machine Learning (S2 2017) Deck 6 Perceptron is a linear binary classifier Perceptron is a Predict class A if 0 binary classifier: Predict class B if < 0 where = ≥ Perceptron is a linear classifier: ∑=0 is a linear function of inputs, and the decision boundary is linear decision plane with boundary values of 2 plane with 1 data points Example for 2D data 19 Statistical Machine Learning (S2 2017) Deck 6 Exercise: find weights of a perceptron capable of 0 0 Class B perfect classification of 0 1 Class B the following dataset 1 0 Class B 1 1 Class A art: OpenClipartVectors at pixabay.com (CC0) Statistical Machine Learning (S2 2017) Deck 6 Loss function for perceptron • Recall that “training” means finding weights that minimise some loss. Therefore, we proceed with considering the loss function for perceptron • Our task is binary classification. Let’s arbitrarily encode one class as +1 and the other as 1. So each training example is now { , }, where is either +1 or 1 − • Recall that, in a perceptron, = − , and the sign of determines the predicted class: +1if > 0, and 1 if < 0 ∑=0 • Consider a single training example. If and have− the same sign then the example is classified correctly. If and have different signs, the example is misclassified 21 Statistical Machine Learning (S2 2017) Deck 6 Loss function for perceptron • Consider a single training example. If and have the same sign then the example is classified correctly. If and have different signs, the example is misclassified • The perceptron uses a loss function in which there is no penalty for correctly classified examples, while the penalty (loss) is equal to for misclassified examples* ( , ) • Formally: ∗ , = 0 if both , have the same sign ∗ , = if both , have different signs 0 • This can be re -written as , = max 0, − * This is similar, but not identical to another widely used loss function called hinge loss 22 Statistical Machine Learning (S2 2017) Deck 6 Stochastic gradient descent • Split all training examples in batches ( ) • Choose initial Iterations over the 1 For from 1 to entire dataset are • called epochs • For from 1 to • Do gradient descent update using data from batch • Advantage of such an approach: computational feasibility for large datasets 23 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training algorithm Choose initial guess ( ), = 0 0 For from 1 to (epochs) For from 1 to (training examples) Consider example , Update*: = ( ( )) ++ ( ) = max 0, − *There is no derivative when = 0, but this case = − is handled explicitly in the is learning� rate algorithm, see next slides =0 24 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training rule • We have = 0 when > 0 ∗ We don’t need to do update when an example is correctly classified • We have = when = 1 and < 0 − • We have = when = 1 and > 0 ( , ) − • = ∑=0 0 25 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training algorithm When classified correctly, weights are unchanged When misclassified: = (± ) ( > 0 is called learning rate)+1 − If = 1, but < 0 If = 1, but 0 + + − ≥ ← ← − 0 ← 0 0 ← 0 − Convergence Theorem: if the training data is linearly separable, the algorithm is guaranteed to converge to a solution. That is, there exist a finite such that = 0 26 Statistical Machine Learning (S2 2017) Deck 6 Perceptron convergence theorem • Assumptions ∗ Linear separability: There exists so that for all training data = 1, … , and some∗ positive . ∗ ′ ∗ Bounded data: for =1, … , and some finite≥ . • Proof sketch ≤ ∗ Establish that + ∗ It then follows that∗ ′ ∗ ′ −1 ≥ ∗ ∗ ′ Establish that ≥ 2 2 ∗ Note that cos , = ≤ ∗ ′ ∗ ∗ cos … 1 ∗ Use the fact that ∗ Arrive at 2 ∗ 2 ≤ ≤ 27 Statistical Machine Learning (S2 2017) Deck 6 Pros and cons of perceptron learning • If the data is linearly separable, the perceptron training algorithm will converge to a correct solution ∗ There is a formal proof good! ∗ It will converge to some solution (separating boundary), one of infinitely many possible bad! • However, if the data is not linearly separable, the training will fail completely rather than give some approximate solution ∗ Ugly 28 Statistical Machine Learning (S2 2017) Deck 6 Perceptron Learning Example Basic setup = + + 1 1 1 2 2 0 1 1, < 0 1, 0 2 − 2 Σ � 1 (learning
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Implementing Machine Learning & Neural Network Chip
    Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Implementing Machine Learning & Neural Network Chip Architectures USING NETWORK-ON-CHIP INTERCONNECT IP TY GARIBAY Chief Technology Officer [email protected] Title: Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Primary author: Ty Garibay, Chief Technology Officer, ArterisIP, [email protected] Secondary author: Kurt Shuler, Vice President, Marketing, ArterisIP, [email protected] Abstract: A short tutorial and overview of machine learning and neural network chip architectures, with emphasis on how network-on-chip interconnect IP implements these architectures. Time to read: 10 minutes Time to present: 30 minutes Copyright © 2017 Arteris www.arteris.com 1 Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP What is Machine Learning? “ Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, IBM, 1959 • Machine learning is a subset of Artificial Intelligence Copyright © 2017 Arteris 2 • Machine learning is a subset of artificial intelligence, and explicitly relies upon experiential learning rather than programming to make decisions. • The advantage to machine learning for tasks like automated driving or speech translation is that complex tasks like these are nearly impossible to explicitly program using rule-based if…then…else statements because of the large solution space. • However, the “answers” given by a machine learning algorithm have a probability associated with them and can be non-deterministic (meaning you can get different “answers” given the same inputs during different runs) • Neural networks have become the most common way to implement machine learning.
    [Show full text]
  • Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms
    PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs Yunbo Wang Mingsheng Long∗ School of Software School of Software Tsinghua University Tsinghua University [email protected] [email protected] Jianmin Wang Zhifeng Gao Philip S. Yu School of Software School of Software School of Software Tsinghua University Tsinghua University Tsinghua University [email protected] [email protected] [email protected] Abstract The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal vari- ations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial ap- pearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. 1 Introduction
    [Show full text]
  • Fun with Hyperplanes: Perceptrons, Svms, and Friends
    Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 18.1, 18.2, 18.6.3, 18.9 The Automatic Classification Problem Assign object/event or sequence of objects/events to one of a given finite set of categories. • Fraud detection for credit card transactions, telephone calls, etc. • Worm detection in network packets • Spam filtering in email • Recommending articles, books, movies, music • Medical diagnosis • Speech recognition • OCR of handwritten letters • Recognition of specific astronomical images • Recognition of specific DNA sequences • Financial investment Machine Learning methods provide one set of approaches to this problem CIS 391 - Intro to AI 2 Universal Machine Learning Diagram Feature Things to Magic Vector Classification be Classifier Represent- Decision classified Box ation CIS 391 - Intro to AI 3 Example: handwritten digit recognition Machine learning algorithms that Automatically cluster these images Use a training set of labeled images to learn to classify new images Discover how to account for variability in writing style CIS 391 - Intro to AI 4 A machine learning algorithm development pipeline: minimization Problem statement Given training vectors x1,…,xN and targets t1,…,tN, find… Mathematical description of a cost function Mathematical description of how to minimize/maximize the cost function Implementation r(i,k) = s(i,k) – maxj{s(i,j)+a(i,j)} … CIS 391 - Intro to AI 5 Universal Machine Learning Diagram Today: Perceptron, SVM and Friends Feature Things to Magic Vector
    [Show full text]
  • Warm Start for Parameter Selection of Linear Classifiers
    Warm Start for Parameter Selection of Linear Classifiers Bo-Yu Chu Chia-Hua Ho Cheng-Hao Tsai Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan [email protected] [email protected] [email protected] Chieh-Yen Lin Chih-Jen Lin Dept. of Computer Science Dept. of Computer Science National Taiwan Univ., Taiwan National Taiwan Univ., Taiwan [email protected] [email protected] ABSTRACT we may need to solve many optimization problems. Sec- In linear classification, a regularization term effectively reme- ondly, if we do not know the reasonable range of the pa- dies the overfitting problem, but selecting a good regulariza- rameters, we may need a long time to solve optimization tion parameter is usually time consuming. We consider cross problems under extreme parameter values. validation for the selection process, so several optimization In this paper, we consider using warm start to efficiently problems under different parameters must be solved. Our solve a sequence of optimization problems with different reg- aim is to devise effective warm-start strategies to efficiently ularization parameters. Warm start is a technique to reduce solve this sequence of optimization problems. We detailedly the running time of iterative methods by using the solution investigate the relationship between optimal solutions of lo- of a slightly different optimization problem as an initial point gistic regression/linear SVM and regularization parameters. for the current problem. If the initial point is close to the op- Based on the analysis, we develop an efficient tool to auto- timum, warm start is very useful.
    [Show full text]
  • Introduction to Machine Learning
    Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial Neural Networks 3 Short History Progression (1943-1960) • First mathematical model of neurons ▪ Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) ▪ A single neuron for classification ▪ Perceptron learning rule ▪ Perceptron convergence theorem Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1963 Backpropagation… but not much attention… Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550 4 Short History Progression (1980-) • 1986 Backpropagation reinvented: ▪ Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: ▪ Character recognition, autonomous cars,… • Open questions: Overfitting? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training? • Hopfield nets (1982), Boltzmann machines,… 5 Short History Degression (1993-) • SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM and Graphical models almost kill the ANN research. • Training deeper networks consistently yields poor results. • Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model) 6 Short History Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative graphical model • Based on restrictive Boltzmann machines • Can be trained efficiently Deep Autoencoder based networks Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H.
    [Show full text]
  • Neural Networks and Backpropagation
    CS 179: LECTURE 14 NEURAL NETWORKS AND BACKPROPAGATION LAST TIME Intro to machine learning Linear regression https://en.wikipedia.org/wiki/Linear_regression Gradient descent https://en.wikipedia.org/wiki/Gradient_descent (Linear classification = minimize cross-entropy) https://en.wikipedia.org/wiki/Cross_entropy TODAY Derivation of gradient descent for linear classifier https://en.wikipedia.org/wiki/Linear_classifier Using linear classifiers to build up neural networks Gradient descent for neural networks (Back Propagation) https://en.wikipedia.org/wiki/Backpropagation REFRESHER ON THE TASK Note “Grandmother Cell” representation for {x,y} pairs. See https://en.wikipedia.org/wiki/Grandmother_cell REFRESHER ON THE TASK Find i for zi: “Best-index” -- estimated “Grandmother Cell” Neuron Can use parallel GPU reduction to find “i” for largest value. LINEAR CLASSIFIER GRADIENT We will be going through some extra steps to derive the gradient of the linear classifier -- We’ll be using the “Softmax function” https://en.wikipedia.org/wiki/Softmax_function Similarities will be seen when we start talking about neural networks LINEAR CLASSIFIER J & GRADIENT LINEAR CLASSIFIER GRADIENT LINEAR CLASSIFIER GRADIENT LINEAR CLASSIFIER GRADIENT GRADIENT DESCENT GRADIENT DESCENT, REVIEW GRADIENT DESCENT IN ND GRADIENT DESCENT STOCHASTIC GRADIENT DESCENT STOCHASTIC GRADIENT DESCENT STOCHASTIC GRADIENT DESCENT, FOR W LIMITATIONS OF LINEAR MODELS Most real-world data is not separable by a linear decision boundary Simplest example: XOR gate What if we could combine the results of multiple linear classifiers? Combine two OR gates with an AND gate to get a XOR gate ANOTHER VIEW OF LINEAR MODELS NEURAL NETWORKS NEURAL NETWORKS EXAMPLES OF ACTIVATION FNS Note that most derivatives of tanh function will be zero! Makes for much needless computation in gradient descent! MORE ACTIVATION FUNCTIONS https://medium.com/@shrutijadon10104776/survey-on- activation-functions-for-deep-learning-9689331ba092 Tanh and sigmoid used historically.
    [Show full text]
  • Audio Event Classification Using Deep Learning in an End-To-End Approach
    Audio Event Classification using Deep Learning in an End-to-End Approach Master thesis Jose Luis Diez Antich Aalborg University Copenhagen A. C. Meyers Vænge 15 2450 Copenhagen SV Denmark Title: Abstract: Audio Event Classification using Deep Learning in an End-to-End Approach The goal of the master thesis is to study the task of Sound Event Classification Participant(s): using Deep Neural Networks in an end- Jose Luis Diez Antich to-end approach. Sound Event Classifi- cation it is a multi-label classification problem of sound sources originated Supervisor(s): from everyday environments. An auto- Hendrik Purwins matic system for it would many applica- tions, for example, it could help users of hearing devices to understand their sur- Page Numbers: 38 roundings or enhance robot navigation systems. The end-to-end approach con- Date of Completion: sists in systems that learn directly from June 16, 2017 data, not from features, and it has been recently applied to audio and its results are remarkable. Even though the re- sults do not show an improvement over standard approaches, the contribution of this thesis is an exploration of deep learning architectures which can be use- ful to understand how networks process audio. The content of this report is freely available, but publication (with reference) may only be pursued due to agreement with the author. Contents 1 Introduction1 1.1 Scope of this work.............................2 2 Deep Learning3 2.1 Overview..................................3 2.2 Multilayer Perceptron...........................4
    [Show full text]
  • 1 Algorithms for Machine Learning 2 Linear Separability
    15-451/651: Design & Analysis of Algorithms November 24, 2014 Lecture #26 last changed: December 7, 2014 1 Algorithms for Machine Learning Suppose you want to classify mails as spam/not-spam. Your data points are email messages, and say each email can be classified as a positive instance (it is spam), or a negative instance (it is not). It is unlikely that you'll get more than one copy of the same message: you really want to \learn" the concept of spam-ness, so that you can classify future emails correctly. Emails are too unstructured, so you may represent them as more structured objects. One simple way is to use feature vectors: say you have a long bit vector indexed by words and features (e.g., \money", \bank", \click here", poor grammar, mis-spellings, known-sender) and the ith bit is set if the ith feature occurs in your email. (You could keep track of counts of words, and more sophisticated features too.) Now this vector is given a single-bit label: 1 (spam) or -1 (not spam). d These positively- and negatively-labeled vectors are sitting in R , and ideally we want to find something about the structure of the positive and negative clouds-of-points so that future emails (represented by vectors in the same space) can be correctly classified. 2 Linear Separability How do we represent the concept of \spam-ness"? Given the geometric representation of the input, it is reasonable to represent the concept geometrically as well. Here is perhaps the simplest d formalization: we are given a set S of n pairs (xi; yi) 2 R × {−1; 1g, where each xi is a data point and yi is a label, where you should think of xi as the vector representation of the actual data point, and the label yi = 1 as \xi is a positive instance of the concept we are trying to learn" and yi = −1 1 as \xi is a negative instance".
    [Show full text]
  • Lecture 2: Linear Classifiers
    Lecture 2: Linear Classifiers Andr´eMartins Deep Structured Learning Course, Fall 2018 Andr´eMartins (IST) Lecture 2 IST, Fall 2018 1 / 117 Course Information • Instructor: Andr´eMartins ([email protected]) • TAs/Guest Lecturers: Erick Fonseca & Vlad Niculae • Location: LT2 (North Tower, 4th floor) • Schedule: Wednesdays 14:30{18:00 • Communication: piazza.com/tecnico.ulisboa.pt/fall2018/pdeecdsl Andr´eMartins (IST) Lecture 2 IST, Fall 2018 2 / 117 Announcements Homework 1 is out! • Deadline: October 10 (two weeks from now) • Start early!!! List of potential projects will be sent out soon! • Deadline for project proposal: October 17 (three weeks from now) • Teams of 3 people Andr´eMartins (IST) Lecture 2 IST, Fall 2018 3 / 117 Today's Roadmap Before talking about deep learning, let us talk about shallow learning: • Supervised learning: binary and multi-class classification • Feature-based linear classifiers • Rosenblatt's perceptron algorithm • Linear separability and separation margin: perceptron's mistake bound • Other linear classifiers: naive Bayes, logistic regression, SVMs • Regularization and optimization • Limitations of linear classifiers: the XOR problem • Kernel trick. Gaussian and polynomial kernels. Andr´eMartins (IST) Lecture 2 IST, Fall 2018 4 / 117 Fake News Detection Task: tell if a news article / quote is fake or real. This is a binary classification problem. Andr´eMartins (IST) Lecture 2 IST, Fall 2018 5 / 117 Fake Or Real? Andr´eMartins (IST) Lecture 2 IST, Fall 2018 6 / 117 Fake Or Real? Andr´eMartins (IST) Lecture 2 IST, Fall 2018 7 / 117 Fake Or Real? Andr´eMartins (IST) Lecture 2 IST, Fall 2018 8 / 117 Fake Or Real? Andr´eMartins (IST) Lecture 2 IST, Fall 2018 9 / 117 Fake Or Real? Can a machine determine this automatically? Can be a very hard problem, since fact-checking is hard and requires combining several knowledge sources ..
    [Show full text]
  • Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition
    IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition THOMAS M. COVER Abstract-This paper develops the separating capacities of fami- in Ed. A dichotomy { X+, X- } of X is linearly separable lies of nonlinear decision surfaces by a direct application of a theorem if and only if there exists a weight vector w in Ed and a in classical combinatorial geometry. It is shown that a family of sur- faces having d degrees of freedom has a natural separating capacity scalar t such that of 2d pattern vectors, thus extending and unifying results of Winder x w > t, if x (E X+ and others on the pattern-separating capacity of hyperplanes. Apply- ing these ideas to the vertices of a binary n-cube yields bounds on x w<t, if xeX-. (3) the number of spherically, quadratically, and, in general, nonlinearly separable Boolean functions of n variables. The dichotomy { X+, X-} is said to be homogeneously It is shown that the set of all surfaces which separate a dichotomy linearly separable if it is linearly separable with t = 0. A of an infinite, random, separable set of pattern vectors can be charac- terized, on the average, by a subset of only 2d extreme pattern vec- vector w satisfying tors. In addition, the problem of generalizing the classifications on a w x >O, x E X+ labeled set of pattern points to the classification of a new point is defined, and it is found that the probability of ambiguous generaliza- w-x < 0o xCX (4) tion is large unless the number of training patterns exceeds the capacity of the set of separating surfaces.
    [Show full text]