Efficient Global String Kernel with Random Features: Beyond Counting Substructures

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Global String Kernel with Random Features: Beyond Counting Substructures Efficient Global String Kernel with Random Features: Beyond Counting Substructures Lingfei Wu∗ Ian En-Hsu Yen Siyu Huo IBM Research Carnegie Mellon University IBM Research [email protected] [email protected] [email protected] Liang Zhao Kun Xu Liang Ma George Mason University IBM Research IBM Research [email protected] [email protected] [email protected] Shouling Ji† Charu Aggarwal Zhejiang University IBM Research [email protected] [email protected] ABSTRACT of longer lengths. In addition, we empirically show that RSE scales Analysis of large-scale sequential data has been one of the most linearly with the increase of the number and the length of string. crucial tasks in areas such as bioinformatics, text, and audio mining. Existing string kernels, however, either (i) rely on local features of CCS CONCEPTS short substructures in the string, which hardly capture long dis- • Computing methodologies → Kernel methods. criminative patterns, (ii) sum over too many substructures, such as all possible subsequences, which leads to diagonal dominance KEYWORDS of the kernel matrix, or (iii) rely on non-positive-definite similar- String Kernel, String Embedding, Random Features ity measures derived from the edit distance. Furthermore, while there have been works addressing the computational challenge with ACM Reference Format: respect to the length of string, most of them still experience qua- Lingfei Wu, Ian En-Hsu Yen, Siyu Huo, Liang Zhao, Kun Xu, Liang Ma, Shouling Ji, and Charu Aggarwal. 2019. Efficient Global String Kernel dratic complexity in terms of the number of training samples when with Random Features: Beyond Counting Substructures . In The 25th ACM used in a kernel-based classifier. In this paper, we present anew SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19), class of global string kernels that aims to (i) discover global proper- August 4–8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, 9 pages. ties hidden in the strings through global alignments, (ii) maintain https://doi.org/10.1145/3292500.3330923 positive-definiteness of the kernel, without introducing a diagonal dominant kernel matrix, and (iii) have a training cost linear with 1 INTRODUCTION respect to not only the length of the string but also the number of training string samples. To this end, the proposed kernels are explic- String classification is a core learning task and has drawn consider- itly defined through a series of different random feature maps, each able interests in many applications such as computational biology corresponding to a distribution of random strings. We show that [20, 21], text categorization [26, 44], and music classification [9]. kernels defined this way are always positive-definite, and exhibit One of the key challenges in string data lies in the fact that there is computational benefits as they always produce Random String Em- no explicit feature in sequences. A kernel function corresponding to beddings (RSE) that can be directly used in any linear classification a high dimensional feature space has been proven to be an effective models. Our extensive experiments on nine benchmark datasets method for sequence classification [24, 47]. corroborate that RSE achieves better or comparable accuracy in Over the last two decades, a number of string kernel methods comparison to state-of-the-art baselines, especially with the strings [7, 19, 21, 22, 24, 36] have been proposed, among which the k- spectrum kernel [21], ¹k;mº-mismatch kernel and its fruitful vari- ∗Corresponding author ants [22–24] have gained much popularity due to its strong empiri- †Shouling Ji is also with Alibaba-Zhejiang University Joint Research Institute of cal performance. These kernels decompose the original strings into Frontier Technologies sub-structures, i.e., a short k-length subsequence as a k-mer, and then count the occurrences of k-mers (with up to m mismatches) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed in the original sequence to define a feature map and its associated for profit or commercial advantage and that copies bear this notice and the full citation string kernels. However, these methods only consider the local on the first page. Copyrights for components of this work owned by others than the properties of the short substructures in the strings, failing to cap- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission ture the global properties highly related to some discriminative and/or a fee. Request permissions from [email protected]. features of strings, i.e., relatively long subsequences. KDD ’19, August 4–8, 2019, Anchorage, AK, USA When considering larger k and m, the size of the feature map © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6201-6/19/08...$15.00 grows exponentially, leading to serious diagonal dominance prob- https://doi.org/10.1145/3292500.3330923 lem due to high-dimension sparse feature vector [12, 40]. More importantly, the high computational cost for computing kernel ma- 2 EXISTING STRING KERNELS AND trix renders them only applicable to small values of k, m, and small CONVENTIONAL RANDOM FEATURES data size. Recently, a thread of research has made the valid attempts In this section, we first introduce existing string kernels and its sev- to improve the computation for each entry of the kernel matrix eral important issues that impair their effectiveness and efficiency. [9, 20]. However, these new techniques only solve the scalability We next discuss the conventional Random Features for scaling up issue in terms of the length of strings and the size of alphabet but large-scale kernel machines and further illustrate several challenges not the kernel matrix construction that still has quadratic com- why the conventional Random Features cannot be directly applied plexity in the number of strings. In addition, these approximation to existing string kernels. methods still inherit the issues of these "local" kernels, ignoring global structures of the strings, especially for these of long lengths. Another family of research [6, 11, 14, 29, 32, 38, 39] utilizes a 2.1 Existing String Kernels distance function to compute the similarity between a pair of strings We discuss existing approaches of defining string kernels and also through the global or local alignment measure [28, 35]. These string three issues that have been haunting existing string kernels for a alignment kernels are defined resorting to the learning methodology long time: (i) diagonal dominance; (ii) non-positive definite; (iii) of R-convolution [15], which is a framework for computing the scalability issue for large-scale string kernels kernels between discrete objects. The key idea is to recursively 2.1.1 String Kernel by Counting Substructures. decompose structured objects into sub-structures and compute We consider a family of string kernels most commonly used in the their global/local alignments to derive a feature map. However, the literature, where the kernel k¹x;yº between two strings x;y 2 X is common issue that these string alignment kernels have to address computed by counting the number of shared substructures between is how to preserve the property of being a valid positive-definite x, y. Let S denote the set of indices of a particular substructure in (p.d.) kernel [33]. Interestingly, both approaches [11, 32] proposed x (e.g. subsequence, substring, or single character), and S¹xº be to sum up all possible alignments to yield a p.d. kernel, which the set of all possible such set of indices. Furthermore, let U be unfortunately suffers the diagonal dominance problem, leading to all possible values of such substructure. Then a family of string bad generealization capability. Therefore, some treatments have to kernels can be defined as be made in order to repair the issues, e.g. taking the logarithm of the diagonal, which in turns breaks the positive definiteness. Another Õ Õ k¹x;yº := ϕu ¹xºϕu ¹yº; where ϕu ¹xº = 1u ¹x»S¼ºγ ¹Sº (1) important limitation of these approaches is their high computation u 2U S 2S costs, with the quadratic complexity in terms of both the number and the length of strings. and 1u ¹x»S¼º is the number of substructures in x of valueu, weighted In this paper, we present a new family of string kernels that by γ ¹Sº, which reduces the count according to the properties of S, aims to: (i) discover global properties hidden in the strings through such as length. For example, in a vanilla text kernel, S denotes word global alignments, (ii) maintain positive-definiteness of the kernel, positions in a document x and U denotes the vocabulary set (with without introducing a diagonal dominant kernel matrix, and (iii) γ ¹Sº = 1). To take string structure into consideration, the gappy have a training cost linear with respect to not only the length of n-gram [26] considers S¹xº as the set of all possible subsequences the string but also the number of training string samples. in a string x of length k, with γ ¹Sº = exp(−`¹Sºº being a weight To this end, our proposed global string kernels take into account exponentially decayed function in the length of S to penalize sub- the global properties of strings through the global-alignment based sequences of large number of insertions and deletions. While the edit distance such as Levenshtein distance [48]. In addition, the number of possible subsequences in a string is exponential in the proposed kernels are explicitly defined through feature embedding string length, there exist dynamic-programming-based algorithms given by a distribution of random strings.
Recommended publications
  • Proceedings of International Conference on Electrical, Electronics & Computer Science
    Interscience Research Network Interscience Research Network Conference Proceedings - Full Volumes IRNet Conference Proceedings 12-9-2012 Proceedings of International Conference on Electrical, Electronics & Computer Science Prof.Srikanta Patnaik Mentor IRNet India, [email protected] Follow this and additional works at: https://www.interscience.in/conf_proc_volumes Part of the Electrical and Electronics Commons, and the Systems and Communications Commons Recommended Citation Patnaik, Prof.Srikanta Mentor, "Proceedings of International Conference on Electrical, Electronics & Computer Science" (2012). Conference Proceedings - Full Volumes. 61. https://www.interscience.in/conf_proc_volumes/61 This Book is brought to you for free and open access by the IRNet Conference Proceedings at Interscience Research Network. It has been accepted for inclusion in Conference Proceedings - Full Volumes by an authorized administrator of Interscience Research Network. For more information, please contact [email protected]. Proceedings of International Conference on Electrical, Electronics & Computer Science (ICEECS-2012) 9th December, 2012 BHOPAL, India Interscience Research Network (IRNet) Bhubaneswar, India Rate Compatible Punctured Turbo-Coded Hybrid Arq For Ofdm System RATE COMPATIBLE PUNCTURED TURBO-CODED HYBRID ARQ FOR OFDM SYSTEM DIPALI P. DHAMOLE1& ACHALA M. DESHMUKH2 1,2Department of Electronics and Telecommunications, Sinhgad College of Engineering, Pune 411041, Maharashtra Abstract-Now a day’s orthogonal frequency division multiplexing (OFDM) is under intense research for broadband wireless transmission because of its robustness against multipath fading. A major concern in data communication such as OFDM is to control transmission errors caused by the channel noise so that error free data can be delivered to the user. Rate compatible punctured turbo (RCPT) coded hybrid ARQ (RCPT HARQ) has been found to give improved throughput performance in a OFDM system.
    [Show full text]
  • Text Classification Using String Kernels
    Journal of Machine Learning Research 2 (2002) 419-444 Submitted 12/00;Published 2/02 Text Classification using String Kernels Huma Lodhi [email protected] Craig Saunders [email protected] John Shawe-Taylor [email protected] Nello Cristianini [email protected] Chris Watkins [email protected] Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK Editor: Bernhard Sch¨olkopf Abstract We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes howdespite this fact the inner product can be efficiently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a stan- dard word feature space kernel (Joachims, 1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with different decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations efficiently for large datasets.
    [Show full text]
  • Fast String Kernels Using Inexact Matching for Protein Sequences
    JournalofMachineLearningResearch5(2004)1435–1455 Submitted 1/04; Revised 7/04; Published 11/04 Fast String Kernels using Inexact Matching for Protein Sequences Christina Leslie [email protected] Center for Computational Learning Systems Columbia University New York, NY 10115, USA Rui Kuang [email protected] Department of Computer Science Columbia University New York, NY 10027, USA Editor: Kristin Bennett Abstract We describe several families of k-mer based string kernels related to the recently presented mis- match kernel and designed for use with support vector machines (SVMs) for classification of pro- tein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences (“k-mers”) from the string alphabet Σ. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(cK( x + y )) time, where the constant cK depends on the parameters of the kernel but is inde- pendent| | of the| | size Σ of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like| | the mismatch kernel, but we improve upon the parameter-dependent con- m+1 m stant cK = k Σ of the (k,m)-mismatch kernel. We compute the kernels efficiently using a trie data structure and| | relate our new kernels to the recently described transducer formalism. In protein classification experiments on two benchmark SCOP data sets, we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models, and we investigate the dependence of the kernels on parameter choice.
    [Show full text]
  • String & Text Mining in Bioinformatics
    Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Why compare sequences? Protein sequences Proteins are chains of amino acids. 20 different types of amino acids can be found in protein sequences. Protein sequence changes over time by mutations, dele- tion, insertions. Different protein sequences may diverge from one com- mon ancestor. Their sequences may differ slightly, yet their function is often conserved. Karsten Borgwardt: Data Mining in Bioinformatics, Page 2 Why compare sequences? Biological Question: Biologists are interested in the reverse direction: Given two protein sequences, is it likely that they origi- nate from the same common ancestor? Computational Challenge: How to measure similarity between two protein se- quence, or equivalently: How to measure similarity between two strings Kernel Challenge: How to measure similarity between two strings via a ker- nel function In short: How to define a string kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 3 History of sequence comparison First phase Smith-Waterman BLAST Second phase Profiles Hidden Markov Models Third phase PSI-Blast SAM-T98 Fourth phase Kernels Karsten Borgwardt: Data Mining in Bioinformatics, Page 4 Sequence comparison: Phase 1 Idea Measure pairwise similarities between sequences with gaps Methods Smith-Waterman dynamic programming high accuracy slow (O(n2))
    [Show full text]
  • Proquest Dissertations
    On Email Spam Filtering Using Support Vector Machine Ola Amayri A Thesis in The Concordia Institute for Information Systems Engineering Presented in Partial Fulfillment of the Requirements for the Degree of Master of Applied Science (Information Systems Security) at Concordia University Montreal, Quebec, Canada January 2009 © Ola Amayri, 2009 Library and Archives Biblioth&que et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l'6dition 395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A0N4 Canada Canada Your file Votre reference ISBN: 978-0-494-63317-5 Our file Notre r6f6rence ISBN: 978-0-494-63317-5 NOTICE: AVIS: The author has granted a non- L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliothgque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par Nntemet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation.
    [Show full text]
  • Text Classification Using String Kernels
    Journal of Machine Learning Research 2 (2002) 419-444 Submitted 12/00;Published 2/02 Text Classification using String Kernels Huma Lodhi [email protected] Craig Saunders [email protected] John Shawe-Taylor [email protected] Nello Cristianini [email protected] Chris Watkins [email protected] Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK Editor: Bernhard Sch¨olkopf Abstract We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes howdespite this fact the inner product can be efficiently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a stan- dard word feature space kernel (Joachims, 1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with different decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations efficiently for large datasets.
    [Show full text]