Kernel Methods

Total Page:16

File Type:pdf, Size:1020Kb

Kernel Methods Machine Learning Srihari Kernel Methods Sargur Srihari [email protected] 1 Machine Learning Srihari Topics in Kernel Methods 1. Linear Models vs Memory-based models 2. Stored Sample Methods 3. Kernel Functions • Dual Representations • Constructing Kernels 4. Extension to Symbolic Inputs 5. Fisher Kernel 2 Machine Learning Srihari Linear Models vs Memory-based models • Linear parametric models for regression and classification have the form y(x,w) • During learning we either get a maximum likelihood estimate of w or a posterior distribution of w • Training data is then discarded • Prediction based only on vector w • This is true of Neural networks as well • Memory-based methods keep the training samples (or a subset) and use them during the prediction phase 3 Machine Learning Srihari Memory-Based Methods • Training data points are used in prediction • Examples of such methods • Parzen probability density model • Linear combination of kernel functions centered on each training data point • Nearest neighbor classification • These are memory-based methods • They require a metric to be defined • Fast to train, slow to predict 4 Machine Learning Srihari Kernel Functions • Linear models can be re-cast • into equivalent dual where predictions are based on kernel functions evaluated at training points • Kernel function is given by k (x,x) = ϕ(x)T ϕ(x) • where ϕ(x) is a fixed nonlinear mapping (basis function) • Kernel is a symmetric function of its arguments k (x,x) = k (x,x) • Kernel can be interpreted as similarity of x and x • Simplest is identity mapping in feature space ϕ(x) = x • In which case k (x,x) = xTx • Called Linear Kernel 5 Machine Learning Srihari Kernel Trick • Formulated as inner product allows extending well-known algorithms • by using the kernel trick • Basic idea of kernel trick • If an input vector x appears only in the form of scalar products then we can replace scalar products with some other choice of kernel • Used widely • in support vector machines • in developing non-linear variant of PCA • In kernel Fisher discriminant 6 Machine Learning Srihari Other Forms of Kernel Functions • Function of difference between arguments k (x,x) = k (x-x) • Called stationary kernel since invariant to translation • Radial basis functions k (x,x) = k (||x-x||) • Depends on magnitude of distance between arguments • Note that the kernel function is scalar while x is M-dimensional For these to be valid kernel functions they should be shown to have the property k (x,x) = φ (x)T φ(x) 7 Machine Learning Srihari Dual Representation • Linear models for regression/classification can be reformulated in terms of dual representation • In which kernel function arises naturally • Plays important role in SVMs • Consider linear regression model • Parameters from minimizing sum-of-squares N 1 T 2 λ T J(w) = ∑{w φ(x n ) − tn} + w w 2 n 1 2 = ϕ is the set of M where w = (w ,..,w )T , = ( ,.. )T 0 M-1 φ φ0 φM-1 basis functions we have N samples {x1,..x N } or feature vector λ is the regularization coefficient • Minimum obtained by setting gradient of J(w) wrt w equal to zero € Machine Learning Srihari Solution for w: linear combination of φ (xn) • Equating derivative J(w) wrt w to zero we get N 1 T w = − ∑{w φ(x n ) − tn }φ(x n ) λ n=1 N = ∑anφ(x n ) n=1 = ΦT a • Solution for w: linear combination of φ (xn) whose coefficients€ are functions of w where th T • Φ is design matrix whose n row is given by φ (xn) % φ0 (x1) . φM −1(x1)( ' * ' . * Φ = ' φ0 (xn ) . φM −1(xn )* is a N × M matrix ' . * ' * &φ 0 (xN ) . φM −1(xN )) T • Vector a=(a1,..,aN) has the definition € 1 T an = − {w φ(xn )−tn } λ Machine Learning Srihari Transformation from w to a T • Thus we have w = Φ a • Instead of working with parameter vector w we can reformulate least squares algorithm in terms of parameter vector a • giving rise to dual representation 1 T an = − {w φ(xn )−tn } λ • Although the definition of a still includes w It can be eliminated by the use of the kernel function 10 Machine Learning Srihari Gram Matrix and Kernel Function • Gram matrix K=ΦΦT is N x N matrix • with elements T Knm= ϕ (xn) ϕ (xm)=k(xn,xm) • where kernel function k (x,x) = ϕ (x)T ϕ (x) ⎡ ⎤ ⎢ k(x ,x ) . k(x ,x ) ⎥ Gram Matrix Definition: ⎢ 1 1 1 N ⎥ ⎢ . ⎥ Given N vectors, it is the K = ⎢ ⎥ ⎢ . ⎥ matrix of all inner products ⎢ ⎥ ⎢ k(x ,x ) k(x ,x ) ⎥ ⎣ N 1 N N 1 ⎦ • Notes: • Φ is NxM and K is NxN Note: N M times M N • K is a matrix of similarities of pairs of samples (thus it is symmetric) 11 MachineError Learning Function in Terms of Gram Matrix Srihari • Error Function is 1 N 2 λ J( ) = T (x )−t + T w ∑{w φ n n } w w 2 n=1 2 • Substituting w = ΦTa into J(w) gives 1 1 λ J(w) = aT ΦΦT ΦΦTa −aT ΦΦT t + tT t + aT ΦΦTa 2 2 2 T where t = (t1,..,tN) • Error function is written in terms of Gram matrix as 1 1 λ J(a) = aTKKa −aTKt + tT t + aTKa 2 2 2 T 1 T • Solving for a by combining w=Φ a and an = − {w φ(x n ) − tn } λ -1 a =(K +λIN) t Solution for a can be expressed as a linear€ combination of elements of ϕ(x) whose coefficients are entirely in terms of kernel k(x,x) from which we can recover original formulation in terms of parameters w Machine Learning Prediction Function Srihari • Prediction for new input x -1 T • We can write a =(K +λIN) t by combining w =Φ a and 1 T an = − {w φ(x n ) − tn } λ • Substituting back into linear regression model, € y(x) = wTφ(x) = aT Φφ(x) = k(x)T (K + λI )−1 t where k(x) has elements k (x) = k(x ,x) N n n • Prediction is a linear combination of the target values from the training set. Machine Learning Srihari Advantage of Dual Representation • Solution for a is entirely in terms of k(x,x) • From a we can recover w using w = Φta w = (ΦT Φ)−1ΦT t • In parametric formulation, solution is ML • We invert M x M matrix, since Φ is N x M -1 • In dual formulation, solution is a =(K +λIN) t • We are inverting an N x N matrix • An apparent disadvantage, since N >>M • Advantage of dual: work with kernel k(x,x), so: • avoid working with a feature vector ϕ(x) and • problems associated with very high or infinite dimensionality of x 14 Machine Learning Srihari Constructing Kernels • To exploit kernel substitution need valid kernels • Two methods: • (1) from ϕ (x) to k(x,x’) • (2) from k(x,x’) to ϕ (x) • First Method • Choose ϕ (x) and use it to find corresponding kernel k(x,x ') = φ(x)T φ(x ') M = ( ) ( ') ∑φi x φi x i=1 • where ϕi(x) are basis functions • One-dimensional input space is shown next 15 Machine Learning Srihari Kernel Functions from basis functions One-dimensional input space Polynomials Gaussian Logistic Sigmoid Basis Functions ϕi(x) Kernel Functions k(x,x) = ϕ (x)Tϕ (x’) Red cross is x 16 Machine Learning Srihari Method 2: Direct Construction of Kernels • Kernel has to correspond to a scalar product in some (perhaps infinite dimensional) space • Consider kernel function k(x,z) = (xTz)2 T 2 2 k(x,z) = (x z) = (x1z1 + x2z2) • In 2-D space 2 2 2 2 = x1z1 + 2x1z1x2z2 + x2z2 2 2 2 2 T = (x1 , 2x1x2,x2 )(z1 , 2z1z2,z2 ) = φ( )T φ( ) x z • Feature mapping takes the form φ(x) = (x 2, 2x x ,x 2) 1 1 2 2 • Comprises of all second order terms with a specific weighting • Inner product needs computing 6 feature values, 9 multiplications • Kernel function k(x,z) has 2 multiplies and a squaring • For(xTz+c)2 we get constant, linear, second order terms • For (xTz+c)M we get all powers of x (monomials) 17 Machine Learning Srihari Testing whether a function is a valid kernel • Without constructing ϕ(x), Necessary and sufficient condition for k(x,x) to be a kernel is • Gram matrix K, with elements k(xn,xm) is positive {xn}semi-definite for all possible choices of the set • Positive semi-definite is not the same thing as a matrix whose elements are non- negative zT z 0 for non - zero vectors z with real entries • It means K ≥ i.e.,∑∑Knm zn zm ≥ 0 for any real numbers zn ,zm n m • Mercers theorem: any continuous, symmetric, positive semi-definite kernel function k(x, y) can be expressed as a dot product in a high-dimensional € space • New kernels can be constructed from simpler kernels as building blocks 18 Machine Learning Srihari Techniques for Constructing Kernels • Given valid kernels k1(x,x) and k2(x,x) the following new kernels will be valid 1. k(x,x) =ck (x,x) 1 Where 2. k(x,x)=f(x)k1(x,x)f(x) f (.) is any function 3. k(x,x)=q(k1(x,x)) q(.) is a polynomial with non-negative coefficients 4. k(x,x)=exp(k1(x,x)) 5. k(x,x)=k1(x,x)+k2(x,x) 6. k(x,x)=k1(x,x)k2(x,x) M 7. k(x,x)=k3(f(x).f(x)) f(x) is a function from x to R k is a valid kernel in RM T 3 8.
Recommended publications
  • Large-Scale Kernel Ranksvm
    Large-scale Kernel RankSVM Tzu-Ming Kuo∗ Ching-Pei Leey Chih-Jen Linz Abstract proposed [11, 23, 5, 1, 14]. However, for some tasks that Learning to rank is an important task for recommendation the feature set is not rich enough, nonlinear methods systems, online advertisement and web search. Among may be needed. Therefore, it is important to develop those learning to rank methods, rankSVM is a widely efficient training methods for large kernel rankSVM. used model. Both linear and nonlinear (kernel) rankSVM Assume we are given a set of training label-query- have been extensively studied, but the lengthy training instance tuples (yi; qi; xi); yi 2 R; qi 2 S ⊂ Z; xi 2 n time of kernel rankSVM remains a challenging issue. In R ; i = 1; : : : ; l, where S is the set of queries. By this paper, after discussing difficulties of training kernel defining the set of preference pairs as rankSVM, we propose an efficient method to handle these (1.1) P ≡ f(i; j) j q = q ; y > y g with p ≡ jP j; problems. The idea is to reduce the number of variables from i j i j quadratic to linear with respect to the number of training rankSVM [10] solves instances, and efficiently evaluate the pairwise losses. Our setting is applicable to a variety of loss functions. Further, 1 T X min w w + C ξi;j general optimization methods can be easily applied to solve w;ξ 2 (i;j)2P the reformulated problem. Implementation issues are also (1.2) subject to wT (φ (x ) − φ (x )) ≥ 1 − ξ ; carefully considered.
    [Show full text]
  • A Kernel Method for Multi-Labelled Classification
    A kernel method for multi-labelled classification Andre´ Elisseeff and Jason Weston BIOwulf Technologies, 305 Broadway, New York, NY 10007 andre,jason ¡ @barhilltechnologies.com Abstract This article presents a Support Vector Machine (SVM) like learning sys- tem to handle multi-label problems. Such problems are usually decom- posed into many two-class problems but the expressive power of such a system can be weak [5, 7]. We explore a new direct approach. It is based on a large margin ranking system that shares a lot of common proper- ties with SVMs. We tested it on a Yeast gene functional classification problem with positive results. 1 Introduction Many problems in Text Mining or Bioinformatics are multi-labelled. That is, each point in a learning set is associated to a set of labels. Consider for instance the classification task of determining the subjects of a document, or of relating one protein to its many effects on a cell. In either case, the learning task would be to output a set of labels whose size is not known in advance: one document can for instance be about food, meat and finance, although another one would concern only food and fat. Two-class and multi-class classification or ordinal regression problems can all be cast into multi-label ones. This makes the latter quite attractive but at the same time it gives a warning: their generality hides their difficulty to solve them. The number of publications is not going to contradict this statement: we are aware of only a few works about the subject [4, 5, 7] and they all concern text mining applications.
    [Show full text]
  • Density Based Data Clustering
    California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of aduateGr Studies 3-2015 Density Based Data Clustering Rayan Albarakati California State University - San Bernardino Follow this and additional works at: https://scholarworks.lib.csusb.edu/etd Part of the Other Computer Engineering Commons Recommended Citation Albarakati, Rayan, "Density Based Data Clustering" (2015). Electronic Theses, Projects, and Dissertations. 134. https://scholarworks.lib.csusb.edu/etd/134 This Project is brought to you for free and open access by the Office of aduateGr Studies at CSUSB ScholarWorks. It has been accepted for inclusion in Electronic Theses, Projects, and Dissertations by an authorized administrator of CSUSB ScholarWorks. For more information, please contact [email protected]. California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 3-2015 Density Based Data Clustering Rayan Albarakati Follow this and additional works at: http://scholarworks.lib.csusb.edu/etd This Project is brought to you for free and open access by the Office of Graduate Studies at CSUSB ScholarWorks. It has been accepted for inclusion in Electronic Theses, Projects, and Dissertations by an authorized administrator of CSUSB ScholarWorks. For more information, please contact [email protected], [email protected]. DESNITY BASED DATA CLUSTERING A Project Presented to the Faculty of California State University, San Bernardino In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Rayan Albarakati March 2015 DESNITY BASED DATA CLUSTERING A Project Presented to the Faculty of California State University, San Bernardino by Rayan Albarakati March 2015 Approved by: Haiyan Qiao, Advisor, School of Computer Date Science and Engineering Owen J.Murphy Krestin Voigt © 2015 Rayan Albarakati ABSTRACT Data clustering is a data analysis technique that groups data based on a measure of similarity.
    [Show full text]
  • Kernel and Image
    Math 217 Worksheet 1 February: x3.1 Professor Karen E Smith (c)2015 UM Math Dept licensed under a Creative Commons By-NC-SA 4.0 International License. T Definitions: Given a linear transformation V ! W between vector spaces, we have 1. The source or domain of T is V ; 2. The target of T is W ; 3. The image of T is the subset of the target f~y 2 W j ~y = T (~x) for some x 2 Vg: 4. The kernel of T is the subset of the source f~v 2 V such that T (~v) = ~0g. Put differently, the kernel is the pre-image of ~0. Advice to the new mathematicians from an old one: In encountering new definitions and concepts, n m please keep in mind concrete examples you already know|in this case, think about V as R and W as R the first time through. How does the notion of a linear transformation become more concrete in this special case? Think about modeling your future understanding on this case, but be aware that there are other important examples and there are important differences (a linear map is not \a matrix" unless *source and target* are both \coordinate spaces" of column vectors). The goal is to become comfortable with the abstract idea of a vector space which embodies many n features of R but encompasses many other kinds of set-ups. A. For each linear transformation below, determine the source, target, image and kernel. 2 3 x1 3 (a) T : R ! R such that T (4x25) = x1 + x2 + x3.
    [Show full text]
  • A User's Guide to Support Vector Machines
    A User's Guide to Support Vector Machines Asa Ben-Hur Jason Weston Department of Computer Science NEC Labs America Colorado State University Princeton, NJ 08540 USA Abstract The Support Vector Machine (SVM) is a widely used classifier. And yet, obtaining the best results with SVMs requires an understanding of their workings and the various ways a user can influence their accuracy. We provide the user with a basic understanding of the theory behind SVMs and focus on their use in practice. We describe the effect of the SVM parameters on the resulting classifier, how to select good values for those parameters, data normalization, factors that affect training time, and software for training SVMs. 1 Introduction The Support Vector Machine (SVM) is a state-of-the-art classification method introduced in 1992 by Boser, Guyon, and Vapnik [1]. The SVM classifier is widely used in bioinformatics (and other disciplines) due to its high accuracy, ability to deal with high-dimensional data such as gene ex- pression, and flexibility in modeling diverse sources of data [2]. SVMs belong to the general category of kernel methods [4, 5]. A kernel method is an algorithm that depends on the data only through dot-products. When this is the case, the dot product can be replaced by a kernel function which computes a dot product in some possibly high dimensional feature space. This has two advantages: First, the ability to generate non-linear decision boundaries using methods designed for linear classifiers. Second, the use of kernel functions allows the user to apply a classifier to data that have no obvious fixed-dimensional vector space representation.
    [Show full text]
  • The Forgetron: a Kernel-Based Perceptron on a Fixed Budget
    The Forgetron: A Kernel-Based Perceptron on a Fixed Budget Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel oferd,shais,singer @cs.huji.ac.il { } Abstract The Perceptron algorithm, despite its simplicity, often performs well on online classification problems. The Perceptron becomes especially effec- tive when it is used in conjunction with kernels. However, a common dif- ficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we describe and analyze a new in- frastructure for kernel-based learning with the Perceptron while adhering to a strict limit on the number of examples that can be stored. We first describe a template algorithm, called the Forgetron, for online learning on a fixed budget. We then provide specific algorithms and derive a uni- fied mistake bound for all of them. To our knowledge, this is the first online learning paradigm which, on one hand, maintains a strict limit on the number of examples it can store and, on the other hand, entertains a relative mistake bound. We also present experiments with real datasets which underscore the merits of our approach. 1 Introduction The introduction of the Support Vector Machine (SVM) [7] sparked a widespread interest in kernel methods as a means of solving (binary) classification problems. Although SVM was initially stated as a batch-learning technique, it significantly influenced the develop- ment of kernel methods in the online-learning setting. Online classification algorithms that can incorporate kernels include the Perceptron [6], ROMMA [5], ALMA [3], NORMA [4] and the Passive-Aggressive family of algorithms [1].
    [Show full text]
  • Math 120 Homework 3 Solutions
    Math 120 Homework 3 Solutions Xiaoyu He, with edits by Prof. Church April 21, 2018 [Note from Prof. Church: solutions to starred problems may not include all details or all portions of the question.] 1.3.1* Let σ be the permutation 1 7! 3; 2 7! 4; 3 7! 5; 4 7! 2; 5 7! 1 and let τ be the permutation 1 7! 5; 2 7! 3; 3 7! 2; 4 7! 4; 5 7! 1. Find the cycle decompositions of each of the following permutations: σ; τ; σ2; στ; τσ; τ 2σ. The cycle decompositions are: σ = (135)(24) τ = (15)(23)(4) σ2 = (153)(2)(4) στ = (1)(2534) τσ = (1243)(5) τ 2σ = (135)(24): 1.3.7* Write out the cycle decomposition of each element of order 2 in S4. Elements of order 2 are also called involutions. There are six formed from a single transposition, (12); (13); (14); (23); (24); (34), and three from pairs of transpositions: (12)(34); (13)(24); (14)(23). 3.1.6* Define ' : R× ! {±1g by letting '(x) be x divided by the absolute value of x. Describe the fibers of ' and prove that ' is a homomorphism. The fibers of ' are '−1(1) = (0; 1) = fall positive realsg and '−1(−1) = (−∞; 0) = fall negative realsg. 3.1.7* Define π : R2 ! R by π((x; y)) = x + y. Prove that π is a surjective homomorphism and describe the kernel and fibers of π geometrically. The map π is surjective because e.g. π((x; 0)) = x. The kernel of π is the line y = −x through the origin.
    [Show full text]
  • Kernel Methods Through the Roof: Handling Billions of Points Efficiently
    Kernel methods through the roof: handling billions of points efficiently Giacomo Meanti Luigi Carratino MaLGa, DIBRIS MaLGa, DIBRIS Università degli Studi di Genova Università degli Studi di Genova [email protected] [email protected] Lorenzo Rosasco Alessandro Rudi MaLGa, DIBRIS, IIT & MIT INRIA - École Normale Supérieure Università degli Studi di Genova PSL Research University [email protected] [email protected] Abstract Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since naïve imple- mentations scale poorly with data size. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware. Towards this end, we designed a preconditioned gradient solver for kernel methods exploiting both GPU acceleration and parallelization with multiple GPUs, implementing out-of-core variants of common linear algebra operations to guarantee optimal hardware utilization. Further, we optimize the numerical precision of different operations and maximize efficiency of matrix-vector multiplications. As a result we can experimentally show dramatic speedups on datasets with billions of points, while still guaranteeing state of the art performance. Additionally, we make our software available as an easy to use library1. 1 Introduction Kernel methods provide non-linear/non-parametric extensions of many classical linear models in machine learning and statistics [45, 49]. The data are embedded via a non-linear map into a high dimensional feature space, so that linear models in such a space effectively define non-linear models in the original space.
    [Show full text]
  • Discrete Topological Transformations for Image Processing Michel Couprie, Gilles Bertrand
    Discrete Topological Transformations for Image Processing Michel Couprie, Gilles Bertrand To cite this version: Michel Couprie, Gilles Bertrand. Discrete Topological Transformations for Image Processing. Brimkov, Valentin E. and Barneva, Reneta P. Digital Geometry Algorithms, 2, Springer, pp.73-107, 2012, Lecture Notes in Computational Vision and Biomechanics, 978-94-007-4174-4. 10.1007/978-94- 007-4174-4_3. hal-00727377 HAL Id: hal-00727377 https://hal-upec-upem.archives-ouvertes.fr/hal-00727377 Submitted on 3 Sep 2012 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Chapter 3 Discrete Topological Transformations for Image Processing Michel Couprie and Gilles Bertrand Abstract Topology-based image processing operators usually aim at trans- forming an image while preserving its topological characteristics. This chap- ter reviews some approaches which lead to efficient and exact algorithms for topological transformations in 2D, 3D and grayscale images. Some transfor- mations which modify topology in a controlled manner are also described. Finally, based on the framework of critical kernels, we show how to design a topologically sound parallel thinning algorithm guided by a priority function. 3.1 Introduction Topology-preserving operators, such as homotopic thinning and skeletoniza- tion, are used in many applications of image analysis to transform an object while leaving unchanged its topological characteristics.
    [Show full text]
  • Kernel Methodsmethods Simple Idea of Data Fitting
    KernelKernel MethodsMethods Simple Idea of Data Fitting Given ( xi,y i) i=1,…,n xi is of dimension d Find the best linear function w (hyperplane) that fits the data Two scenarios y: real, regression y: {-1,1}, classification Two cases n>d, regression, least square n<d, ridge regression New sample: x, < x,w> : best fit (regression), best decision (classification) 2 Primary and Dual There are two ways to formulate the problem: Primary Dual Both provide deep insight into the problem Primary is more traditional Dual leads to newer techniques in SVM and kernel methods 3 Regression 2 w = arg min ∑(yi − wo − ∑ xij w j ) W i j w = arg min (y − Xw )T (y − Xw ) W d(y − Xw )T (y − Xw ) = 0 dw ⇒ XT (y − Xw ) = 0 w = [w , w ,L, w ]T , ⇒ T T o 1 d X Xw = X y L T x = ,1[ x1, , xd ] , T −1 T ⇒ w = (X X) X y y = [y , y ,L, y ]T ) 1 2 n T −1 T xT y =< x (, X X) X y > 1 xT X = 2 M xT 4 n n×xd Graphical Interpretation ) y = Xw = Hy = X(XT X)−1 XT y = X(XT X)−1 XT y d X= n FICA Income X is a n (sample size) by d (dimension of data) matrix w combines the columns of X to best approximate y Combine features (FICA, income, etc.) to decisions (loan) H projects y onto the space spanned by columns of X Simplify the decisions to fit the features 5 Problem #1 n=d, exact solution n>d, least square, (most likely scenarios) When n < d, there are not enough constraints to determine coefficients w uniquely d X= n W 6 Problem #2 If different attributes are highly correlated (income and FICA) The columns become dependent Coefficients
    [Show full text]
  • Abelian Categories
    Abelian Categories Lemma. In an Ab-enriched category with zero object every finite product is coproduct and conversely. π1 Proof. Suppose A × B //A; B is a product. Define ι1 : A ! A × B and π2 ι2 : B ! A × B by π1ι1 = id; π2ι1 = 0; π1ι2 = 0; π2ι2 = id: It follows that ι1π1+ι2π2 = id (both sides are equal upon applying π1 and π2). To show that ι1; ι2 are a coproduct suppose given ' : A ! C; : B ! C. It φ : A × B ! C has the properties φι1 = ' and φι2 = then we must have φ = φid = φ(ι1π1 + ι2π2) = ϕπ1 + π2: Conversely, the formula ϕπ1 + π2 yields the desired map on A × B. An additive category is an Ab-enriched category with a zero object and finite products (or coproducts). In such a category, a kernel of a morphism f : A ! B is an equalizer k in the diagram k f ker(f) / A / B: 0 Dually, a cokernel of f is a coequalizer c in the diagram f c A / B / coker(f): 0 An Abelian category is an additive category such that 1. every map has a kernel and a cokernel, 2. every mono is a kernel, and every epi is a cokernel. In fact, it then follows immediatly that a mono is the kernel of its cokernel, while an epi is the cokernel of its kernel. 1 Proof of last statement. Suppose f : B ! C is epi and the cokernel of some g : A ! B. Write k : ker(f) ! B for the kernel of f. Since f ◦ g = 0 the map g¯ indicated in the diagram exists.
    [Show full text]
  • An Algorithmic Introduction to Clustering
    AN ALGORITHMIC INTRODUCTION TO CLUSTERING APREPRINT Bernardo Gonzalez Department of Computer Science and Engineering UCSC [email protected] June 11, 2020 The purpose of this document is to provide an easy introductory guide to clustering algorithms. Basic knowledge and exposure to probability (random variables, conditional probability, Bayes’ theorem, independence, Gaussian distribution), matrix calculus (matrix and vector derivatives), linear algebra and analysis of algorithm (specifically, time complexity analysis) is assumed. Starting with Gaussian Mixture Models (GMM), this guide will visit different algorithms like the well-known k-means, DBSCAN and Spectral Clustering (SC) algorithms, in a connected and hopefully understandable way for the reader. The first three sections (Introduction, GMM and k-means) are based on [1]. The fourth section (SC) is based on [2] and [5]. Fifth section (DBSCAN) is based on [6]. Sixth section (Mean Shift) is, to the best of author’s knowledge, original work. Seventh section is dedicated to conclusions and future work. Traditionally, these five algorithms are considered completely unrelated and they are considered members of different families of clustering algorithms: • Model-based algorithms: the data are viewed as coming from a mixture of probability distributions, each of which represents a different cluster. A Gaussian Mixture Model is considered a member of this family • Centroid-based algorithms: any data point in a cluster is represented by the central vector of that cluster, which need not be a part of the dataset taken. k-means is considered a member of this family • Graph-based algorithms: the data is represented using a graph, and the clustering procedure leverage Graph theory tools to create a partition of this graph.
    [Show full text]