Support Vector Machines SVM Discussion Overview

Total Page:16

File Type:pdf, Size:1020Kb

Support Vector Machines SVM Discussion Overview Support Vector Machines SVM Discussion Overview 1. Importance of SVMs 2. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing with Multiple Classes 7. SVM and Computational Learning Theory 8. Relevance Vector Machines CSE555: Srihari 1. Importance of SVM •SVM is a discriminative method that brings together: 1. computational learning theory 2. previously known methods in linear discriminant functions 3. optimization theory • Also called Sparse kernel machines • Kernel methods predict based on linear combinations of a kernel function evaluated at the training points, e.g., Parzen Window • Sparse because not all pairs of training points need be used • Also called Maximum margin classifiers • Widely used for solving problems in classification, regression and novelty detection CSE555: Srihari 2. Mathematical Techniques Used 1. Linearly separable case considered since appropriate nonlinear mapping φ to a high dimension two categories are always separable by a hyperplane 2. To handle non-linear separability: • Preprocessing data to represent in much higher- dimensional space than original feature space • Kernel trick reduces computational overhead CSE555: Srihari 3. Support Vectors and Margin • Support vectors are those nearest patterns at distance b from hyperplane • SVM finds hyperplane with maximum distance (margin distance b) from nearest training patterns ThreeCSE555: support Srihari vectors are shown as solid dots Margin Maximization • Why maximize margin? • Motivation found in computational learning theory or statistical learning theory (PAC learning-VC dimension) • Insight given as follows (Tong, Koller 2000): • Model distributions for each class using Parzen density estimators using Gaussian kernels with common parameter σ2 • Instead of optimum boundary, determine best hyperplane relative to learned density model • As σ2Æ 0 optimum hyperplane has maximum margin • Hyperplane becomes independent of data points that are CSE555:not support Srihari vectors Distance from arbitrary point and plane • Hyperplane: t g ( x ) = w • x + w 0 where w is the weight vector and w0 is bias g(x) r = • Lemma:: Distance from x to the plane is || w || w Proof: Let x =xp +r where r is the distance from x to the plane, ||w|| = 0 thenw t 2 g(x)=wt (x +r )+w t w w ||w|| p 0 = w xp +w0 +r = g(x )+r =r ||w|| QED ||w|| ||w|| p ||w|| w x=xp +r g ( x ) = w t • x + w = 0 ||w|| Corollary: Distance of origin to plane is 0 r = g(0)/||w|| = w0/||w|| g(x) since t r = g (0) = w 0 + w 0 = w 0 ||w|| Thus w0=0 implies that plane passes through origin xp g ( x ) > 0 g ( y ) < 0 CSE555: Srihari Choosing a margin t • Augmented space: g(y) = a y by choosing a0= w0 and y0=1, i.e, plane passes through origin • For each of the patterns, let zk = +1 depending on whether pattern k is in class ω1or ω2 • Thus if g(y)=0 is a separating hyperplane then zk g(yk) > 0, k = 1,.., n • Since distance of a point y to hyperplane g(y)=0 is g(y) y g(y)= aty || a || we could require that hyperplane be such that all points are at least distant b from it, i.e., zk g(yk ) b ≥ b CSE555: Srihari || a || SVM Margin geometry y2 g(y) = aty= 0 g(y) = 1 g(y) = -1 y1 y2 y1 Optimal hyperplane is Shortest line connecting Convex hulls, which CSE555: Srihari orthogonal to.. has length = 2/||a|| Statement of Optimization Problem • The goal is to find the weight vector a that satisfies z g(y ) k k ≥ b, k =1,.....n a • while maximizing b • To ensure uniqueness we impose the constraint b ||a|| = 1 or b = 1/||a|| • Which implies that we also require that ||a||2 be minimized • Support vectors are (transformed) training patterns which represent equality in above equation • Called a quadratic optimization problem since we are trying to minimize a quadratic function subject to a set of linear inequality constraints CSE555: Srihari 4. SVM Training Methodology 1. Training is formulated as an optimization problem • Dual problem is stated to reduce computational complexity • Kernel trick is used to reduce computation 2. Determination of the model parameters corresponds to a convex optimization problem • Solution is straightforward (local solution is a global optimum) 3. Makes use of Lagrange multipliers CSE555: Srihari Joseph-Louis Lagrange 1736-1813 • French Mathematician • Born in Turin, Italy • Succeeded Euler at Berlin academy • Narrowly escaped execution in French revolution due to Lovoisier who himself was guillotined • Made key contributions to calculus and dynamics CSE555: Srihari SVM Training: Optimization Problem optimize 1 arg min || a ||2 a,b 2 subject to constraints t zk a yk ≥1, k =1,.....n • Can be cast as an unconstrained problem by introducing Lagrange undetermined multipliers with one multiplier for each constraint αk • The Lagrange function is n 1 2 t L()a,α = a − ∑αk [zka yk −1] 2 k=1 CSE555: Srihari Optimization of Lagrange function • The Lagrange function is n 1 2 t L()a,α = a − ∑αk [zka yk −1] 2 k=1 • We seek to minimize L ( ) • with respect to the weight vector a and • maximize it w.r.t . the undetermined multipliers ≥ 0 αk • Last term represents the goal of classifying the points correctly • Karush-Kuhn-Tucker construction shows that this can be recast as a maximization problem which is computationally better CSE555: Srihari Dual Optimization Problem • Problem is reformulated as one of maximizing n 1 n n L()α = ∑α k − ∑∑ α kα j zk z j k(y j , yk ) k =1 2 k ==11j • Subject to the constraints n ∑α k zk = 0 α k ≥ 0, k =1,....., n k =1 given the training data • where the kernel function is defined by t t k(y j , yk ) = y j ⋅ yk = φ(x j ) ⋅φ(xk ) CSE555: Srihari Solution of Dual Problem • Implementation: • Solved using quadratic programming • Alternatively, since it only needs inner products of training data • It can be implemented using kernel functions • which is a crucial property for generalizing to non-linear case • The solution is given by a = ∑α k zk yk k CSE555: Srihari Summary of SVM Optimization Problems Different Notation here! Quadratic term CSE555: Srihari Kernel Function: key property • If kernel function is chosen with property K(x,y) = (φ (x). φ(y)) then computational expense of increased dimensionality is avoided. • Polynomial kernel K(x,y) = (x.y)d can be shown (next slide) to correspond to a map φ into the space spanned by all products of exactly d dimensions. CSE555: Srihari A Polynomial Kernel Function Suppose x = (x1, x2 ) is the input vector The feature space mapping is : 2 2 t ϕ(x) = (x1 , 2x1x2 , x2 ) Then inner product is same 2 2 t 2 2 2 ϕ(x)ϕ(y) = (x1 , 2x1x2 , x2 ) (y1 , 2y1 y2 , y2 )= (x1 y1 + x2 y2 ) Polynomial kernel function to compute the same value is 2 t 2 2 K(x, y) = (x.y) = ()()x1,x2 (y1,y2 )= (x1 y1 + x2 y2 ) or K(x, y) = ϕ(x)ϕ(y) • Inner product ϕ ( x ) ϕ ( y ) needs computing six feature values and 3 x 3 = 9 multiplications • Kernel function K(x,y) has 2 multiplications and a squaring CSE555: Srihari Another Polynomial (quadratic) kernel function •K(x,y) = (x.y + 1)2 • This one maps d = 2, p = 2 into a six- dimensional space • Contains all the powers of x K(x,y) = ϕ(x)ϕ(y) where 2 2 ϕ(x) = (x1 ,x 2 , 2x1, 2x 2 , 2x1x 2 ,1) • Inner product needs 36 multiplications • Kernel function needs 4 multiplications CSE555: Srihari CSE555: Srihari Non-Linear Case • Mapping function φ (.) to a sufficiently high dimension • So that data from two categories can always be separated by a hyperplane • Assume each pattern xk has been transformed to yk=φ (xk), for k=1,.., n • First choose the non-linear φ functions • To map the input vector to a higher dimensional feature space • Dimensionality of space can be arbitrarily high only limited by computational resources CSE555: Srihari Mapping into Higher Dimensional Feature Space • Mapping each input point x by map ⎛ 1 ⎞ ⎜ ⎟ y = Φ(x) = ⎜ x ⎟ ⎜ 2 ⎟ ⎝x ⎠ Points on 1-d line are mapped onto curve in 3-d. • Linear separation in 3-d space is possible. Linear discriminant function in 3-d is in the form g(x) =CSE555:a1 y1 + Sriharia2 y2 + a3 y3 Pattern Transformation using Kernels • Problem with high-dimensional mapping • Very many parameters • Polynomial of degree p over d variables leads to O(dp) variables in feature space • Example: if d = 50 and p =2 we need a feature space of size 2500 • Solution: • Dual Optimization problem needs only inner products • Each pattern xk transformed into pattern ykwhere = Φ( ) yk xk • Dimensionality of mapped space can be arbitrarily high CSE555: Srihari Example of SVM results • Two classes in two dimensions • Synthetic Data • Shows contours of constant g (x) • Obtained from SVM with Gaussian kernel function • Decision boundary is shown • Margin boundaries are shown • Support vectors are shown CSE555: Srihari • Shows sparsity of SVM Demo http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Svmtoy.exe CSE555: Srihari SVM for the XOR problem • XOR: binary valued features x1, x2 • not solved by linear discriminant • function φ maps input x = [x1, x2] into six-dimensional feature 2 2 space y = [1, /2x1, /2x2, /2x1x2 , x1 , x2 ] input space feature sub-space Hyperplanes Corresponding to /2x1 x2 = +1 Hyperbolas Corresponding to x1 x2 = +1 CSE555: Srihari SVM for XOR: maximization problem • We seek to maximize 4 4 4 1 t ∑αk − ∑ ∑αkα jz jzk y jyk k=1 2 k=1 j=1 • Subject to the constraints α1 −α 2 +α3 −α 4 = 0 0 ≤ α k k =1, 2, 3, 4, • From problem symmetry, at the solution α1 = α3 ,andα 2 = α 4 CSE555: Srihari SVM for XOR: maximization problem • Can use iterative gradient descent • Or use analytical techniques for small problem • The solution is a* = (1/8,1/8,1/8,1/8) • Last term of Optimizn Problem implies that all four points are support vectors (unusual and due to symmetric nature of XOR) Hyperplanes • The final discriminant function is g(x1,x2) = x1.
Recommended publications
  • 1 Algorithms for Machine Learning 2 Linear Separability
    15-451/651: Design & Analysis of Algorithms November 24, 2014 Lecture #26 last changed: December 7, 2014 1 Algorithms for Machine Learning Suppose you want to classify mails as spam/not-spam. Your data points are email messages, and say each email can be classified as a positive instance (it is spam), or a negative instance (it is not). It is unlikely that you'll get more than one copy of the same message: you really want to \learn" the concept of spam-ness, so that you can classify future emails correctly. Emails are too unstructured, so you may represent them as more structured objects. One simple way is to use feature vectors: say you have a long bit vector indexed by words and features (e.g., \money", \bank", \click here", poor grammar, mis-spellings, known-sender) and the ith bit is set if the ith feature occurs in your email. (You could keep track of counts of words, and more sophisticated features too.) Now this vector is given a single-bit label: 1 (spam) or -1 (not spam). d These positively- and negatively-labeled vectors are sitting in R , and ideally we want to find something about the structure of the positive and negative clouds-of-points so that future emails (represented by vectors in the same space) can be correctly classified. 2 Linear Separability How do we represent the concept of \spam-ness"? Given the geometric representation of the input, it is reasonable to represent the concept geometrically as well. Here is perhaps the simplest d formalization: we are given a set S of n pairs (xi; yi) 2 R × {−1; 1g, where each xi is a data point and yi is a label, where you should think of xi as the vector representation of the actual data point, and the label yi = 1 as \xi is a positive instance of the concept we are trying to learn" and yi = −1 1 as \xi is a negative instance".
    [Show full text]
  • Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition
    IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition THOMAS M. COVER Abstract-This paper develops the separating capacities of fami- in Ed. A dichotomy { X+, X- } of X is linearly separable lies of nonlinear decision surfaces by a direct application of a theorem if and only if there exists a weight vector w in Ed and a in classical combinatorial geometry. It is shown that a family of sur- faces having d degrees of freedom has a natural separating capacity scalar t such that of 2d pattern vectors, thus extending and unifying results of Winder x w > t, if x (E X+ and others on the pattern-separating capacity of hyperplanes. Apply- ing these ideas to the vertices of a binary n-cube yields bounds on x w<t, if xeX-. (3) the number of spherically, quadratically, and, in general, nonlinearly separable Boolean functions of n variables. The dichotomy { X+, X-} is said to be homogeneously It is shown that the set of all surfaces which separate a dichotomy linearly separable if it is linearly separable with t = 0. A of an infinite, random, separable set of pattern vectors can be charac- terized, on the average, by a subset of only 2d extreme pattern vec- vector w satisfying tors. In addition, the problem of generalizing the classifications on a w x >O, x E X+ labeled set of pattern points to the classification of a new point is defined, and it is found that the probability of ambiguous generaliza- w-x < 0o xCX (4) tion is large unless the number of training patterns exceeds the capacity of the set of separating surfaces.
    [Show full text]
  • Solvable Model for the Linear Separability of Structured Data
    entropy Article Solvable Model for the Linear Separability of Structured Data Marco Gherardi 1,2 1 Department of Physics, Università degli Studi di Milano, via Celoria 16, 20133 Milano, Italy; [email protected] 2 Istituto Nazionale di Fisica Nucleare Sezione di Milano, via Celoria 16, 20133 Milano, Italy Abstract: Linear separability, a core concept in supervised machine learning, refers to whether the labels of a data set can be captured by the simplest possible machine: a linear classifier. In order to quantify linear separability beyond this single bit of information, one needs models of data structure parameterized by interpretable quantities, and tractable analytically. Here, I address one class of models with these properties, and show how a combinatorial method allows for the computation, in a mean field approximation, of two useful descriptors of linear separability, one of which is closely related to the popular concept of storage capacity. I motivate the need for multiple metrics by quantifying linear separability in a simple synthetic data set with controlled correlations between the points and their labels, as well as in the benchmark data set MNIST, where the capacity alone paints an incomplete picture. The analytical results indicate a high degree of “universality”, or robustness with respect to the microscopic parameters controlling data structure. Keywords: linear separability; storage capacity; data structure 1. Introduction Citation: Gherardi, M. Solvable Linear classifiers are quintessential models of supervised machine learning. Despite Model for the Linear Separability of their simplicity, or possibly because of it, they are ubiquitous: they are building blocks of Structured Data. Entropy 2021, 23, more complex architectures, for instance, in deep learning and support vector machines, 305.
    [Show full text]
  • Linear Separability in Non-Euclidean Geometry
    Linear Separability George M. Georgiou Linear Separability in Non-Euclidean Outline Geometry The problem Separating hyperplane Euclidean George M. Georgiou, Ph.D. Geometry Non-Euclidean Geometry Computer Science Department California State University, San Bernardino Three solutions Conclusion March 3, 2006 [email protected] 1 : 29 The problem Linear Separability Find a geodesic that separates two given sets of points on George M. Georgiou the Poincare´ disk non-Euclidean geometry model. Outline The problem Linear Separability Geometrically Separating hyperplane Euclidean Geometry Non-Euclidean Geometry Three solutions Conclusion 2 : 29 The problem Linear Separability Find a geodesic that separates two given sets of points on George M. Georgiou the Poincare´ disk non-Euclidean geometry model. Outline The problem Linear Separability Geometrically Separating hyperplane Euclidean Geometry Non-Euclidean Geometry Three solutions Conclusion 2 : 29 Linear Separability in Machine Learning Linear Separability George M. Georgiou Machine Learning is branch of AI, which includes areas Outline such as pattern recognition and artificial neural networks. The problem n Linear Separability Each point X = (x1, x2,..., xn) ∈ (vector or object) Geometrically R Separating belongs to two class C0 and C1 with labels -1 and 1, hyperplane respectively. Euclidean Geometry The two classes are linearly separable if there exists n+1 Non-Euclidean W = (w0, w1, w2,..., wn) ∈ such that Geometry R Three solutions w0 + w1x1 + w2x2 + ... + wnxn 0, if X ∈ C0 (1) Conclusion > w0 + w1x1 + w2x2 + ... + wnxn< 0, if X ∈ C1 (2) 3 : 29 Geometrically Linear Separability George M. Georgiou Outline The problem Linear Separability Geometrically Separating hyperplane Euclidean Geometry Non-Euclidean Geometry Three solutions Conclusion 4 : 29 “Linearizing” the circle Linear Separability George M.
    [Show full text]
  • 4 Perceptron Learning
    4 Perceptron Learning 4.1 Learning algorithms for neural networks In the two preceding chapters we discussed two closely related models, McCulloch–Pitts units and perceptrons, but the question of how to find the parameters adequate for a given task was left open. If two sets of points have to be separated linearly with a perceptron, adequate weights for the comput- ing unit must be found. The operators that we used in the preceding chapter, for example for edge detection, used hand customized weights. Now we would like to find those parameters automatically. The perceptron learning algorithm deals with this problem. A learning algorithm is an adaptive method by which a network of com- puting units self-organizes to implement the desired behavior. This is done in some learning algorithms by presenting some examples of the desired input- output mapping to the network. A correction step is executed iteratively until the network learns to produce the desired response. The learning algorithm is a closed loop of presentation of examples and of corrections to the network parameters, as shown in Figure 4.1. network test input-output compute the examples error fix network parameters Fig. 4.1. Learning process in a parametric system R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996 78 4 Perceptron Learning In some simple cases the weights for the computing units can be found through a sequential test of stochastically generated numerical combinations. However, such algorithms which look blindly for a solution do not qualify as “learning”. A learning algorithm must adapt the network parameters accord- ing to previous experience until a solution is found, if it exists.
    [Show full text]
  • Linear Separation
    Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty) 1 Images as Vectors Binary handwritten characters Treat an image as a high- dimensional vector (e.g., by reading pixel values left to right, top to bottom row) p1 p 2 I Greyscale images pN 2 pN Pixel value pi can be 0 or 1 (binary image) or 0 to 255 (greyscale) 2 The human brain is extremely good at classifying images Can we develop classification methods by emulating the brain? 3 4 Neurons communicate via spikes Inputs Output spike (electrical pulse) Output spike roughly dependent on whether sum of all inputs reaches a threshold 5 Neurons as “Threshold Units” Artificial neuron: • m binary inputs (-1 or 1), 1 output (-1 or 1) • Synaptic weights wji • Threshold i w1i Weighted Sum Threshold Inputs uj w 2i Output vi (-1 or +1) (-1 or +1) w3i vi (wjiu j i ) j (x) = 1 if x > 0 and -1 if x 0 6 “Perceptrons” for Classification Fancy name for a type of layered “feed-forward” networks (no loops) Uses artificial neurons (“units”) with binary inputs and outputs Multilayer Single-layer 7 Perceptrons and Classification Consider a single-layer perceptron • Weighted sum forms a linear hyperplane wjiu j i 0 j • Everything on one side of this hyperplane is in class 1 (output = +1) and everything on other side is class 2 (output = -1) Any function that is linearly separable can be computed by a perceptron 8 Linear Separability Example: AND is linearly separable Linear hyperplane u1 u2 AND -1 -1 -1 u 2 (1,1) v 1 -1 -1 1 = 1.5 -1 1 -1 u1 1 1 1 -1
    [Show full text]
  • Supervised.Pdf
    patterns, predictions, and actions - 2021-09-27 1 3 Supervised learning Previously, we talked about the fundamentals of prediction and statistical modeling of populations. Our goal was, broadly speaking, to use available information described by a random variable X to conjecture about an unknown outcome Y. In the important special case of a binary outcome Y, we saw that we can write an optimal predictor Yb as a threshold of some function f : Yb(x) = 1 f (x) > t f g We saw that in many cases the optimal function is a ratio of two likelihood functions. This optimal predictor has a serious limitation in practice, however. To be able to compute the prediction for a given input, we need to know a probability density function for the positive instances in our problem and also one for the negative instances. But we are often unable to construct or unwilling to assume a particular density function. As a thought experiment, attempt to imagine what a probability density function over images labeled cat might look like. Coming up with such a density function appears to be a formidable task, one that’s not intuitively any easier than merely classifying whether an image contains a cat or not. In this chapter, we transition from a purely mathematical charac- terization of optimal predictors to an algorithmic framework. This framework has two components. One is the idea of working with fi- nite samples from a population. The other is the theory of supervised learning and it tells us how to use finite samples to build predictors algorithmically.
    [Show full text]
  • Arxiv:1811.05321V1 [Cs.LG] 11 Nov 2018 (E.G
    Correction of AI systems by linear discriminants: Probabilistic foundations A.N. Gorbana,b,∗, A. Golubkovc, B. Grechuka, E.M. Mirkesa,b, I.Y. Tyukina,b aDepartment of Mathematics, University of Leicester, Leicester, LE1 7RH, UK bLobachevsky University, Nizhni Novgorod, Russia cSaint-Petersburg State Electrotechnical University, Saint-Petersburg, Russia Abstract Artificial Intelligence (AI) systems sometimes make errors and will make errors in the future, from time to time. These errors are usually unexpected, and can lead to dramatic consequences. Intensive development of AI and its practical applications makes the problem of errors more important. Total re-engineering of the systems can create new errors and is not always possible due to the resources involved. The important challenge is to develop fast methods to correct errors without damaging existing skills. We formulated the technical requirements to the ‘ideal’ correctors. Such correctors include binary classifiers, which separate the situations with high risk of errors from the situations where the AI systems work properly. Surprisingly, for essentially high-dimensional data such methods are possible: simple linear Fisher discriminant can separate the situations with errors from correctly solved tasks even for exponentially large samples. The paper presents the probabilistic basis for fast non-destructive correction of AI systems. A series of new stochastic separation theorems is proven. These theorems provide new instruments for fast non-iterative correction of errors of legacy AI systems. The new approaches become efficient in high-dimensions, for correction of high-dimensional systems in high-dimensional world (i.e. for processing of essentially high-dimensional data by large systems). We prove that this separability property holds for a wide class of distributions including log-concave distributions and distributions with a special ‘SMeared Absolute Continuity’ (SmAC) property defined through relations between the volume and probability of sets of vanishing volume.
    [Show full text]
  • Hyperplane Separability and Convexity of Probabilistic Point Sets
    Hyperplane Separability and Convexity of Probabilistic Point Sets Martin Fink∗1, John Hershberger2, Nirman Kumar†3, and Subhash Suri‡4 1 University of California, Santa Barbara, CA, USA 2 Mentor Graphics Corp., Wilsonville, OR, USA 3 University of California, Santa Barbara, CA, USA 4 University of California, Santa Barbara, CA, USA Abstract We describe an O(nd) time algorithm for computing the exact probability that two d-dimensional probabilistic point sets are linearly separable, for any fixed d ≥ 2. A probabilistic point in d-space is the usual point, but with an associated (independent) probability of existence. We also show that the d-dimensional separability problem is equivalent to a (d + 1)-dimensional convex hull membership problem, which asks for the probability that a query point lies inside the convex hull of n probabilistic points. Using this reduction, we improve the current best bound for the convex hull membership by a factor of n [6]. In addition, our algorithms can handle “input degeneracies” in which more than k + 1 points may lie on a k-dimensional subspace, thus resolving an open problem in [6]. Finally, we prove lower bounds for the separability problem via a reduction from the k-SUM problem, which shows in particular that our O(n2) algorithms for 2-dimensional separability and 3-dimensional convex hull membership are nearly optimal. 1998 ACM Subject Classification I.3.5 Computational Geometry and Object Modeling, F.2.2 Nonnumerical Algorithms and Problems, G.3 Probability and Statistics Keywords and phrases probabilistic separability, uncertain data, 3-SUM hardness, topological sweep, hyperplane separation, multi-dimensional data Digital Object Identifier 10.4230/LIPIcs.SoCG.2016.38 1 Introduction Multi-dimensional point sets are a commonly used abstraction for modeling and analyzing data in many domains.
    [Show full text]
  • Introduction to Feedforward Neural Networks
    EEL5840: Machine Intelligence Introduction to feedforward neural networks Introduction to feedforward neural networks 1. Problem statement and historical context A. Learning framework Figure 1 below illustrates the basic framework that we will see in artificial neural network learning. We assume that we want to learn a classification task G with n inputs and m outputs, where, y = G()x , (1) T T x = x1 x2 … xn and y = y1 y2 … ym . (2) In order to do this modeling, let us assume a model Γ with trainable parameter vector w , such that, z = Γ()xw, (3) where, T z = z1 z2 … zm . (4) Now, we want to minimize the error between the desired outputs y and the model outputs z for all possible inputs x . That is, we want to find the parameter vector w∗ so that, E()w∗ ≤E()w , ∀w , (5) where E()w denotes the error between G and Γ for model parameter vector w . Ideally, E()w is given by, E()w = ∫ yz– 2p()x dx (6) x where p()x denotes the probability density function over the input space x . Note that E()w in equation (6) is dependent on w through z [see equation (3)]. Now, in general, we cannot compute equation (6) directly; therefore, we typically compute E()w for a training data set of input/output data, {}()xi, yi , i ∈{12,,…,p}, (7) where xi is the n -dimensional input vector, T xi = xi1 xi2 … xin (8) corresponding to the i th training pattern, and yi is the m -dimensional output vector, x1 y1 x2 Unknown mapping G y2 ed outputs inputs … … x y n m desir z1 Trainable model Γ z2 … … zm model outputs Figure 1 - 1 - EEL5840: Machine Intelligence Introduction to feedforward neural networks T yi = yi1 yi2 … yim (9) corresponding to the i th training pattern, i ∈{12,,…,p}.
    [Show full text]
  • The Linear Separability Problem: Some Testing Methods D
    330 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 2, MARCH 2006 The Linear Separability Problem: Some Testing Methods D. Elizondo Abstract—The notion of linear separability is used widely in ma- chine learning research. Learning algorithms that use this concept to learn include neural networks (single layer perceptron and re- cursive deterministic perceptron), and kernel machines (support vector machines). This paper presents an overview of several of the methods for testing linear separability between two classes. The methods are divided into four groups: Those based on linear pro- gramming, those based on computational geometry, one based on neural networks, and one based on quadratic programming. The Fisher linear discriminant method is also presented. A section on the quantification of the complexity of classification problems is in- cluded. Index Terms—Class of separability, computational geometry, convex hull, Fisher linear discriminant, linear programming, linear separability, quadratic programming, simplex, support vector machine. I. INTRODUCTION WO SUBSETS and of are said to be linearly T separable (LS) if there exists a hyperplane of such that the elements of and those of lie on opposite sides of it. Fig. 1 shows an example of LS and NLS set of points. Squares and circles denote the two classes. Linear separability is an important topic in the domain of machine learning and cognitive psychology. A linear model is rather robust against noise and most likely will not over fit. Mul- tilayer non linear neural networks, such as the back propagation algorithm, work well for classification problems. However, as the experience of perceptrons has shown, there are many real life problems in which there is a linear separation.
    [Show full text]
  • Perceptron (Theory) + Linear Regression
    10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) + Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A Q: I can’t read the chalkboard, can you write larger? A: Sure. Just raise your hand and let me Know if you can’t read something. Q: I’m concerned that you won’t be able to read my solution in the homeworK template because it’s so tiny, can I use my own template? A: No. However, we do all of our grading online and can zoom in to view your solution! Make it as small as you need to. 2 Reminders • Homework 2: Decision Trees – Out: Wed, Jan 24 – Due: Mon, Feb 5 at 11:59pm • Homework 3: KNN, Perceptron, Lin.Reg. – Out: Mon, Feb 5 …possibly delayed – Due: Mon, Feb 12 at 11:59pm by two days 3 ANALYSIS OF PERCEPTRON 4 Geometric Margin Definition: The margin of example ! w.r.t. a linear sep. " is the distance from ! to the plane " ⋅ ! = 0 (or the negative if on wrong side) Margin of positive example !' !' w Margin of negative example !( !( Slide from Nina Balcan Geometric Margin Definition: The margin of example ! w.r.t. a linear sep. " is the distance from ! to the plane " ⋅ ! = 0 (or the negative if on wrong side) Definition: The margin )* of a set of examples + wrt a linear separator " is the smallest margin over points ! ∈ +. + + + w )* + ) + - * + - + - - - + - - Slide from Nina Balcan - - Geometric Margin Definition: The margin of example ! w.r.t. a linear sep.
    [Show full text]