On Kernel Methods for Relational Learning

Total Page:16

File Type:pdf, Size:1020Kb

On Kernel Methods for Relational Learning On Kernel Methods for Relational Learning Chad Cumby [email protected] Dan Roth [email protected] Department of Computer Science University of Illinois, Urbana, IL 61801 USA Abstract Haussler's work on convolution kernels (Haussler, Kernel methods have gained a great deal of 1999) introduced the idea that kernels could be built popularity in the machine learning commu- to work with discrete data structures iteratively from nity as a method to learn indirectly in high- kernels for smaller composite parts. These kernels fol- dimensional feature spaces. Those interested lowed the form of a generalized sum over products { a in relational learning have recently begun to generalized convolution. Kernels were shown for sev- cast learning from structured and relational eral discrete datatypes including strings and rooted data in terms of kernel operations. trees, and more recently (Collins & Duffy, 2002) devel- oped kernels for datatypes useful in many NLP tasks, We describe a general family of kernel func- demonstrating their usefulness with the Voted Percep- tions built up from a description language of tron algorithm (Freund & Schapire, 1998). limited expressivity and use it to study the benefits and drawbacks of kernel learning in While these past examples of relational kernels are for- relational domains. Learning with kernels in mulated separately to meet each problem at hand, we this family directly models learning over an seek to develop a flexible mechanism for building ker- expanded feature space constructed using the nel functions for many structured learning problems - same description language. This allows us to based on a unified knowledge representation. At the examine issues of time complexity in terms of heart of our approach is a definition of a relational learning with these and other relational ker- kernel that is specified in a \syntax-driven" manner nels, and how these relate to generalization through the use of a description language. (Cumby ability. The tradeoffs between using kernels & Roth, 2002) introduced a feature description lan- in a very high dimensional implicit space ver- guage and have shown how to use propositional clas- sus a restricted feature space, is highlighted sifiers to successfully learn over structured data, and through two experiments, in bioinformatics produce relational representation, in the sense that dif- and in natural language processing. ferent data instantiations yield the same features and have the same weights in the linear classifier learned. There, as in (Roth & Yih, 2001), this was done by 1. Introduction significantly blowing up the relational feature-space. Recently, much interest has been generated in the Building on the abovementioned description language machine learning community on the subject of learn- based approach, this paper develops a corresponding ing from relational and structured data via proposi- family of parameterized kernel functions for structured tional learners (Kramer et al., 2001; Cumby & Roth, data. In conjunction with an SVM or a Perceptron- 2003a). Examples of relational learning problems like learning algorithm, our parameterized kernels can include learning to identify functional phrases and simulate the exact features generated in the blown up named entities from structured parse trees in natu- space to learn a classifier, directly from the original ral language processing (NLP), learning to classify structured data. From among several ways to define molecules for mutagenicity from atom-bond data in the distance between structured domain elements we drug design, and learning a policy to map goals to ac- follow (Khardon et al., 2001) in choosing a definition tions in planning domains. At the same time, work that provides exactly the same classifiers produced by on SVMs and Perceptron type algorithms has gener- Perceptron, if we had run it on the blown up discrete ated interest in kernel methods to simulate learning feature space, rather than directly on the structured in high-dimensional feature spaces while working with data. The parameterized kernel allows us to flexibly the original low-dimensional input data. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. define features over structures (or, equivalently, a met- expanded higher-dimensional examples, in which each ric over structures). At the same time, it allows us to feature function plays the role of a basic feature, for choose the degree to which we want to restrict the size learning. This approach clearly leads to an increase of the expanded space, which affects the degree of ef- in expressiveness and thus may improve performance. ficiency gained or lost - as well as the generalization However, it also dramatically increases the number of performance of the classifier - as a result of using it. features (from n to jIj; e.g., to 3n if all conjunctions are used, O(nk) if conjunctions of size k are used) and Along these lines, we then study time complexity and thus may adversely affect both the computation time generalization tradeoffs between working in the ex- and convergence rate of learning. panded feature space and using the corresponding ker- nels, and between different kernels, in terms of the Perceptron is a well known on-line learning algorithm expansions they correspond to. We show that, while that makes use of the aforementioned feature based kernel methods provide an interesting and often more representation of examples. Throughout its execution comprehensible way to view the feature space, compu- Perceptron maintains a weight vector w 2 <n which tationally, it is often not recommended to use kernels is initially (0; : : : ; 0): Upon receiving an example x 2 over structured data. This is especially clear when the f0; 1gn; the algorithm predicts according to the linear number of examples is fairly large relative to the data threshold function w · x ≥ 0: If the prediction is 1 and dimensionality. the label is −1 then the vector w is set to w − x, while if the prediction is −1 and the label is 1 then w is set The remainder of the paper is organized as follows: to w+x: No change is made if the prediction is correct. We first introduce the use of kernel functions in linear classification and the kernel Perceptron algo- It is easily observed, and well known, that the hy- rithm (Khardon et al., 2001). Sec. 3 presents our ap- pothesis w of the Perceptron is a sum of the pre- proach to relational features, which form the higher- vious examples on which prediction mistakes were dimensional space we seek to simulate. Sec. 4 intro- made. Let L(x) 2 {−1; 1g denote the label of ex- duces our kernel function for relational data. Sec. 5 ample x; then w = Pv2M L(v)v where M is the discusses complexity tradeoffs surrounding the ker- set of examples on which the algorithm made a mis- nel Perceptron and standard feature-based Perceptron take. Thus the prediction of Perceptron on x is 1 iff and the issue of generalization. In Sec. 6 we validate w · x = (Pv2M L(v)v) · x = Pv2M L(v)(v · x) ≥ 0. our claims by applying the kernel Perceptron algo- For an example x 2 f0; 1gn let φ(x) denote its trans- rithm with our enhanced kernel to two problems where formation into an enhanced feature space, φ(x) = a structured feature space is essential. fχi(x)gi2I : 2. Kernels & Kernel Perceptron To run the Perceptron algorithm over the enhanced Most machine learning algorithms make use of feature space we must predict 1 iff wφ·φ(x) ≥ 0 where wφ is the based representations. A domain element is trans- weight vector in the enhanced space; from the above formed into a collection of Boolean features, and a la- discussion this holds iff Pv2M L(v)(φ(v) · φ(x)) ≥ 0. beled example is given by < x; l >2 f0; 1gn × {−1; 1g. Denoting In recent years, it has become very common, espe- K(v; x) = φ(v) · φ(x) (1) cially in large scale applications in NLP and computer L v K v; x vision, to use learning algorithms such as variations this holds iff Pv2M ( ) ( ) ≥ 0. We call this ver- sion of Perceptron, which uses a kernel function to ex- of Perceptron and Winnow (Novikoff, 1963; Little- pand the feature space and thus enhances the learning stone, 1988) that use representations that are linear abilities of Perceptron, a kernel Perceptron (Khardon over their feature space (Roth, 1998). In these cases, working with features that directly represent domain et al., 2001). elements may not be expressive enough and there are We have shown a bottom up construction of the \ker- standard ways of enhancing the capabilities of such al- nel trick". This way, we never need to construct the gorithms. A typical way which has been used in the enhanced feature space explicitly; we need only be able NLP domain (Roth, 1998; Roth, 1999) is to expand to compute the kernel function K(v; x) efficiently. This the set of basic features x1; : : : ; xn using feature func- trick can be applied to any algorithm whose prediction tions χi : fx1; : : : xng ! f0; 1g, i 2 I. These features is a function of inner products of examples; see e.g. functions could be, for example, conjunctions such as (Cristianini & Shawe-Taylor, 2000) for a discussion. In x1x3x4 (that is, the feature is evaluated to 1 when the principle, one can define the kernel K(x; v) in different conjunction is satisfied on the example, and to 0 other- ways, as long as it satisfies some conditions. In this wise). Once this expansion is done, one can use these paper we will concentrate on a definition that follows phrase(NP) phr−before phrase(VP) phr−after mation about two individuals. The description logic, N1 N2 contains as defined below, provides a framework for represent- contains contains first contains contains before ing the same semantic individuals as are represented before before before N3 N4 N5 N6 N7 by nodes in a concept graph.
Recommended publications
  • Designing Algorithms for Machine Learning and Data Mining
    Designing Algorithms for Machine Learning and Data Mining Antoine Cornuéjols and Christel Vrain Abstract Designing Machine Learning algorithms implies to answer three main questions: First, what is the space H of hypotheses or models of the data that the algorithm considers? Second, what is the inductive criterion used to assess the merit of a hypothesis given the data? Third, given the space H and the inductive crite- rion, how is the exploration of H carried on in order to find a as good as possible hypothesis?Anylearningalgorithmcanbeanalyzedalongthesethreequestions.This chapter focusses primarily on unsupervised learning, on one hand, and supervised learning, on the other hand. For each, the foremost problems are described as well as the main existing approaches. In particular, the interplay between the structure that can be endowed over the hypothesis space and the optimisation techniques that can in consequence be used is underlined. We cover especially the major existing methods for clustering: prototype-based, generative-based, density-based, spectral based, hierarchical, and conceptual and visit the validation techniques available. For supervised learning, the generative and discriminative approaches are contrasted and awidevarietyoflinearmethodsinwhichweincludetheSupportVectorMachines and Boosting are presented. Multi-Layer neural networks and deep learning methods are discussed. Some additional methods are illustrated, and we describe other learn- ing problems including semi-supervised learning, active learning, online learning, transfer learning, learning to rank, learning recommendations, and identifying causal relationships. We conclude this survey by suggesting new directions for research. 1 Introduction Machine Learning is the science of, on one hand, discovering the fundamental laws that govern the act of learning and, on the other hand, designing machines that learn from experiences, in the same way as physics is both the science of uncovering the A.
    [Show full text]
  • Large-Scale Kernel Ranksvm
    Large-scale Kernel RankSVM Tzu-Ming Kuo∗ Ching-Pei Leey Chih-Jen Linz Abstract proposed [11, 23, 5, 1, 14]. However, for some tasks that Learning to rank is an important task for recommendation the feature set is not rich enough, nonlinear methods systems, online advertisement and web search. Among may be needed. Therefore, it is important to develop those learning to rank methods, rankSVM is a widely efficient training methods for large kernel rankSVM. used model. Both linear and nonlinear (kernel) rankSVM Assume we are given a set of training label-query- have been extensively studied, but the lengthy training instance tuples (yi; qi; xi); yi 2 R; qi 2 S ⊂ Z; xi 2 n time of kernel rankSVM remains a challenging issue. In R ; i = 1; : : : ; l, where S is the set of queries. By this paper, after discussing difficulties of training kernel defining the set of preference pairs as rankSVM, we propose an efficient method to handle these (1.1) P ≡ f(i; j) j q = q ; y > y g with p ≡ jP j; problems. The idea is to reduce the number of variables from i j i j quadratic to linear with respect to the number of training rankSVM [10] solves instances, and efficiently evaluate the pairwise losses. Our setting is applicable to a variety of loss functions. Further, 1 T X min w w + C ξi;j general optimization methods can be easily applied to solve w;ξ 2 (i;j)2P the reformulated problem. Implementation issues are also (1.2) subject to wT (φ (x ) − φ (x )) ≥ 1 − ξ ; carefully considered.
    [Show full text]
  • A Kernel Method for Multi-Labelled Classification
    A kernel method for multi-labelled classification Andre´ Elisseeff and Jason Weston BIOwulf Technologies, 305 Broadway, New York, NY 10007 andre,jason ¡ @barhilltechnologies.com Abstract This article presents a Support Vector Machine (SVM) like learning sys- tem to handle multi-label problems. Such problems are usually decom- posed into many two-class problems but the expressive power of such a system can be weak [5, 7]. We explore a new direct approach. It is based on a large margin ranking system that shares a lot of common proper- ties with SVMs. We tested it on a Yeast gene functional classification problem with positive results. 1 Introduction Many problems in Text Mining or Bioinformatics are multi-labelled. That is, each point in a learning set is associated to a set of labels. Consider for instance the classification task of determining the subjects of a document, or of relating one protein to its many effects on a cell. In either case, the learning task would be to output a set of labels whose size is not known in advance: one document can for instance be about food, meat and finance, although another one would concern only food and fat. Two-class and multi-class classification or ordinal regression problems can all be cast into multi-label ones. This makes the latter quite attractive but at the same time it gives a warning: their generality hides their difficulty to solve them. The number of publications is not going to contradict this statement: we are aware of only a few works about the subject [4, 5, 7] and they all concern text mining applications.
    [Show full text]
  • Mistake Bounds for Binary Matrix Completion
    Mistake Bounds for Binary Matrix Completion Mark Herbster Stephen Pasteris University College London University College London Department of Computer Science Department of Computer Science London WC1E 6BT, UK London WC1E 6BT, UK [email protected] [email protected] Massimiliano Pontil Istituto Italiano di Tecnologia 16163 Genoa, Italy and University College London Department of Computer Science London WC1E 6BT, UK [email protected] Abstract We study the problem of completing a binary matrix in an online learning setting. On each trial we predict a matrix entry and then receive the true entry. We propose a Matrix Exponentiated Gradient algorithm [1] to solve this problem. We provide a mistake bound for the algorithm, which scales with the margin complexity [2, 3] of the underlying matrix. The bound suggests an interpretation where each row of the matrix is a prediction task over a finite set of objects, the columns. Using this we show that the algorithm makes a number of mistakes which is comparable up to a logarithmic factor to the number of mistakes made by the Kernel Perceptron with an optimal kernel in hindsight. We discuss applications of the algorithm to predicting as well as the best biclustering and to the problem of predicting the labeling of a graph without knowing the graph in advance. 1 Introduction We consider the problem of predicting online the entries in an m × n binary matrix U. We formulate this as the following game: nature queries an entry (i1; j1); the learner predicts y^1 2 {−1; 1g as the matrix entry; nature presents a label y1 = Ui1;j1 ; nature queries the entry (i2; j2); the learner predicts y^2; and so forth.
    [Show full text]
  • Density Based Data Clustering
    California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of aduateGr Studies 3-2015 Density Based Data Clustering Rayan Albarakati California State University - San Bernardino Follow this and additional works at: https://scholarworks.lib.csusb.edu/etd Part of the Other Computer Engineering Commons Recommended Citation Albarakati, Rayan, "Density Based Data Clustering" (2015). Electronic Theses, Projects, and Dissertations. 134. https://scholarworks.lib.csusb.edu/etd/134 This Project is brought to you for free and open access by the Office of aduateGr Studies at CSUSB ScholarWorks. It has been accepted for inclusion in Electronic Theses, Projects, and Dissertations by an authorized administrator of CSUSB ScholarWorks. For more information, please contact [email protected]. California State University, San Bernardino CSUSB ScholarWorks Electronic Theses, Projects, and Dissertations Office of Graduate Studies 3-2015 Density Based Data Clustering Rayan Albarakati Follow this and additional works at: http://scholarworks.lib.csusb.edu/etd This Project is brought to you for free and open access by the Office of Graduate Studies at CSUSB ScholarWorks. It has been accepted for inclusion in Electronic Theses, Projects, and Dissertations by an authorized administrator of CSUSB ScholarWorks. For more information, please contact [email protected], [email protected]. DESNITY BASED DATA CLUSTERING A Project Presented to the Faculty of California State University, San Bernardino In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Rayan Albarakati March 2015 DESNITY BASED DATA CLUSTERING A Project Presented to the Faculty of California State University, San Bernardino by Rayan Albarakati March 2015 Approved by: Haiyan Qiao, Advisor, School of Computer Date Science and Engineering Owen J.Murphy Krestin Voigt © 2015 Rayan Albarakati ABSTRACT Data clustering is a data analysis technique that groups data based on a measure of similarity.
    [Show full text]
  • A User's Guide to Support Vector Machines
    A User's Guide to Support Vector Machines Asa Ben-Hur Jason Weston Department of Computer Science NEC Labs America Colorado State University Princeton, NJ 08540 USA Abstract The Support Vector Machine (SVM) is a widely used classifier. And yet, obtaining the best results with SVMs requires an understanding of their workings and the various ways a user can influence their accuracy. We provide the user with a basic understanding of the theory behind SVMs and focus on their use in practice. We describe the effect of the SVM parameters on the resulting classifier, how to select good values for those parameters, data normalization, factors that affect training time, and software for training SVMs. 1 Introduction The Support Vector Machine (SVM) is a state-of-the-art classification method introduced in 1992 by Boser, Guyon, and Vapnik [1]. The SVM classifier is widely used in bioinformatics (and other disciplines) due to its high accuracy, ability to deal with high-dimensional data such as gene ex- pression, and flexibility in modeling diverse sources of data [2]. SVMs belong to the general category of kernel methods [4, 5]. A kernel method is an algorithm that depends on the data only through dot-products. When this is the case, the dot product can be replaced by a kernel function which computes a dot product in some possibly high dimensional feature space. This has two advantages: First, the ability to generate non-linear decision boundaries using methods designed for linear classifiers. Second, the use of kernel functions allows the user to apply a classifier to data that have no obvious fixed-dimensional vector space representation.
    [Show full text]
  • The Forgetron: a Kernel-Based Perceptron on a Fixed Budget
    The Forgetron: A Kernel-Based Perceptron on a Fixed Budget Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel oferd,shais,singer @cs.huji.ac.il { } Abstract The Perceptron algorithm, despite its simplicity, often performs well on online classification problems. The Perceptron becomes especially effec- tive when it is used in conjunction with kernels. However, a common dif- ficulty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we describe and analyze a new in- frastructure for kernel-based learning with the Perceptron while adhering to a strict limit on the number of examples that can be stored. We first describe a template algorithm, called the Forgetron, for online learning on a fixed budget. We then provide specific algorithms and derive a uni- fied mistake bound for all of them. To our knowledge, this is the first online learning paradigm which, on one hand, maintains a strict limit on the number of examples it can store and, on the other hand, entertains a relative mistake bound. We also present experiments with real datasets which underscore the merits of our approach. 1 Introduction The introduction of the Support Vector Machine (SVM) [7] sparked a widespread interest in kernel methods as a means of solving (binary) classification problems. Although SVM was initially stated as a batch-learning technique, it significantly influenced the develop- ment of kernel methods in the online-learning setting. Online classification algorithms that can incorporate kernels include the Perceptron [6], ROMMA [5], ALMA [3], NORMA [4] and the Passive-Aggressive family of algorithms [1].
    [Show full text]
  • Kernel Methods Through the Roof: Handling Billions of Points Efficiently
    Kernel methods through the roof: handling billions of points efficiently Giacomo Meanti Luigi Carratino MaLGa, DIBRIS MaLGa, DIBRIS Università degli Studi di Genova Università degli Studi di Genova [email protected] [email protected] Lorenzo Rosasco Alessandro Rudi MaLGa, DIBRIS, IIT & MIT INRIA - École Normale Supérieure Università degli Studi di Genova PSL Research University [email protected] [email protected] Abstract Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since naïve imple- mentations scale poorly with data size. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware. Towards this end, we designed a preconditioned gradient solver for kernel methods exploiting both GPU acceleration and parallelization with multiple GPUs, implementing out-of-core variants of common linear algebra operations to guarantee optimal hardware utilization. Further, we optimize the numerical precision of different operations and maximize efficiency of matrix-vector multiplications. As a result we can experimentally show dramatic speedups on datasets with billions of points, while still guaranteeing state of the art performance. Additionally, we make our software available as an easy to use library1. 1 Introduction Kernel methods provide non-linear/non-parametric extensions of many classical linear models in machine learning and statistics [45, 49]. The data are embedded via a non-linear map into a high dimensional feature space, so that linear models in such a space effectively define non-linear models in the original space.
    [Show full text]
  • Kernel Methods and Factorization for Image and Video Analysis
    KERNEL METHODS AND FACTORIZATION FOR IMAGE AND VIDEO ANALYSIS Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science (by Research) in Computer Science by Ranjeeth Dasineni 200507017 ranjith [email protected] International Institute of Information Technology Hyderabad, India November 2007 INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled \Kernel Methods and Factorization for Image and Video Analysis" by Ranjeeth Dasineni, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Advisor: Dr. C. V. Jawahar Copyright c Ranjeeth Dasineni, 2007 All Rights Reserved To IIIT Hyderabad, where I learnt all that I know of Computers and much of what I know of Life Acknowledgments I would like to thank Dr. C. V. Jawahar for introducing me to the fields of computer vision and machine learning. I gratefully acknowledge the motivation, technical and philosophical guidance that he has given me throughout my undergraduate and graduate education. His knowledge and valuable suggestions provided the foundation for the work presented in this thesis. I thank Dr P. J. Narayanan for providing an excellent research environment at CVIT, IIIT Hyderabad. His advice on research methodology helped me in facing the challenges of research. I am grateful to the GE Foundation and CVIT for funding my graduate education at IIIT Hyderabad. I am also grateful to fellow lab mates at CVIT and IIIT Hyderabad for their stimulating company during the past years. While working on this thesis, few researchers across the world lent their valuable time validate my work, check the correctness of my implementations, and provide critical suggestions.
    [Show full text]
  • An Algorithmic Introduction to Clustering
    AN ALGORITHMIC INTRODUCTION TO CLUSTERING APREPRINT Bernardo Gonzalez Department of Computer Science and Engineering UCSC [email protected] June 11, 2020 The purpose of this document is to provide an easy introductory guide to clustering algorithms. Basic knowledge and exposure to probability (random variables, conditional probability, Bayes’ theorem, independence, Gaussian distribution), matrix calculus (matrix and vector derivatives), linear algebra and analysis of algorithm (specifically, time complexity analysis) is assumed. Starting with Gaussian Mixture Models (GMM), this guide will visit different algorithms like the well-known k-means, DBSCAN and Spectral Clustering (SC) algorithms, in a connected and hopefully understandable way for the reader. The first three sections (Introduction, GMM and k-means) are based on [1]. The fourth section (SC) is based on [2] and [5]. Fifth section (DBSCAN) is based on [6]. Sixth section (Mean Shift) is, to the best of author’s knowledge, original work. Seventh section is dedicated to conclusions and future work. Traditionally, these five algorithms are considered completely unrelated and they are considered members of different families of clustering algorithms: • Model-based algorithms: the data are viewed as coming from a mixture of probability distributions, each of which represents a different cluster. A Gaussian Mixture Model is considered a member of this family • Centroid-based algorithms: any data point in a cluster is represented by the central vector of that cluster, which need not be a part of the dataset taken. k-means is considered a member of this family • Graph-based algorithms: the data is represented using a graph, and the clustering procedure leverage Graph theory tools to create a partition of this graph.
    [Show full text]
  • COMS 4771 Perceptron and Kernelization
    COMS 4771 Perceptron and Kernelization Nakul Verma Last time… • Generative vs. Discriminative Classifiers • Nearest Neighbor (NN) classification • Optimality of k-NN • Coping with drawbacks of k-NN • Decision Trees • The notion of overfitting in machine learning A Closer Look Classification x O Knowing the boundary is enough for classification Linear Decision Boundary male data Height female data Weight Assume binary classification y= {-1,+1} (What happens in multi-class case?) Learning Linear Decision Boundaries g = decision boundary d=1 case: general: +1 if g(x) ≥ 0 f = linear classifier -1 if g(x) < 0 # of parameters to learn in Rd? Dealing with w0 = . bias homogeneous “lifting” The Linear Classifier popular nonlinearities non-linear threshold Σi wi xi+w0 linear sigmoid 1 x1 x2 … xd bias A basic computational unit in a neuron Can Be Combined to Make a Network … Amazing fact: Can approximate any smooth function! … x x … 1 x2 3 xd An artificial neural network How to Learn the Weights? Given labeled training data (bias included): Want: , which minimizes the training error, i.e. How do we minimize? • Cannot use the standard technique (take derivate and examine the stationary points). Why? Unfortunately: NP-hard to solve or even approximate! Finding Weights (Relaxed Assumptions) Can we approximate the weights if we make reasonable assumptions? What if the training data is linearly separable? Linear Separability Say there is a linear decision boundary which can perfectly separate the training data distance of the closest point to the boundary (margin γ ) Finding Weights Given: labeled training data S = Want to determine: is there a which satisfies (for all i) i.e., is the training data linearly separable? Since there are d+1 variables and |S| constraints, it is possible to solve efficiently it via a (constraint) optimization program.
    [Show full text]
  • Knowledge-Based Systems Layer-Constrained Variational
    Knowledge-Based Systems 196 (2020) 105753 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys Layer-constrained variational autoencoding kernel density estimation model for anomaly detectionI ∗ Peng Lv b, Yanwei Yu a,b, , Yangyang Fan c, Xianfeng Tang d, Xiangrong Tong b a Department of Computer Science and Technology, Ocean University of China, Qingdao, Shandong 266100, China b School of Computer and Control Engineering, Yantai University, Yantai, Shandong 264005, China c Shanghai Key Lab of Advanced High-Temperature Materials and Precision Forming, Shanghai Jiao Tong University, Shanghai 200240, China d College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA 16802, USA article info a b s t r a c t Article history: Unsupervised techniques typically rely on the probability density distribution of the data to detect Received 16 October 2019 anomalies, where objects with low probability density are considered to be abnormal. However, Received in revised form 5 March 2020 modeling the density distribution of high dimensional data is known to be hard, making the problem of Accepted 7 March 2020 detecting anomalies from high-dimensional data challenging. The state-of-the-art methods solve this Available online 10 March 2020 problem by first applying dimension reduction techniques to the data and then detecting anomalies Keywords: in the low dimensional space. Unfortunately, the low dimensional space does not necessarily preserve Anomaly detection the density distribution of the original high dimensional data. This jeopardizes the effectiveness of Variational autoencoder anomaly detection. In this work, we propose a novel high dimensional anomaly detection method Kernel density estimation called LAKE.
    [Show full text]