Efficient Global String Kernel with Random Features: Beyond Counting Substructures

Efficient Global String Kernel with Random Features: Beyond Counting Substructures Lingfei Wu∗ Ian En-Hsu Yen Siyu Huo IBM Research Carnegie Mellon University IBM Research [email protected] [email protected] [email protected] Liang Zhao Kun Xu Liang Ma George Mason University IBM Research IBM Research [email protected] [email protected] [email protected] Shouling Ji† Charu Aggarwal Zhejiang University IBM Research [email protected] [email protected] ABSTRACT of longer lengths. In addition, we empirically show that RSE scales Analysis of large-scale sequential data has been one of the most linearly with the increase of the number and the length of string. crucial tasks in areas such as bioinformatics, text, and audio mining. Existing string kernels, however, either (i) rely on local features of CCS CONCEPTS short substructures in the string, which hardly capture long dis- • Computing methodologies → Kernel methods. criminative patterns, (ii) sum over too many substructures, such as all possible subsequences, which leads to diagonal dominance KEYWORDS of the kernel matrix, or (iii) rely on non-positive-definite similar- String Kernel, String Embedding, Random Features ity measures derived from the edit distance. Furthermore, while there have been works addressing the computational challenge with ACM Reference Format: respect to the length of string, most of them still experience qua- Lingfei Wu, Ian En-Hsu Yen, Siyu Huo, Liang Zhao, Kun Xu, Liang Ma, Shouling Ji, and Charu Aggarwal. 2019. Efficient Global String Kernel dratic complexity in terms of the number of training samples when with Random Features: Beyond Counting Substructures . In The 25th ACM used in a kernel-based classifier. In this paper, we present anew SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19), class of global string kernels that aims to (i) discover global proper- August 4–8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, 9 pages. ties hidden in the strings through global alignments, (ii) maintain https://doi.org/10.1145/3292500.3330923 positive-definiteness of the kernel, without introducing a diagonal dominant kernel matrix, and (iii) have a training cost linear with 1 INTRODUCTION respect to not only the length of the string but also the number of training string samples. To this end, the proposed kernels are explic- String classification is a core learning task and has drawn consider- itly defined through a series of different random feature maps, each able interests in many applications such as computational biology corresponding to a distribution of random strings. We show that [20, 21], text categorization [26, 44], and music classification [9]. kernels defined this way are always positive-definite, and exhibit One of the key challenges in string data lies in the fact that there is computational benefits as they always produce Random String Em- no explicit feature in sequences. A kernel function corresponding to beddings (RSE) that can be directly used in any linear classification a high dimensional feature space has been proven to be an effective models. Our extensive experiments on nine benchmark datasets method for sequence classification [24, 47]. corroborate that RSE achieves better or comparable accuracy in Over the last two decades, a number of string kernel methods comparison to state-of-the-art baselines, especially with the strings [7, 19, 21, 22, 24, 36] have been proposed, among which the k- spectrum kernel [21], ¹k;mº-mismatch kernel and its fruitful vari- ∗Corresponding author ants [22–24] have gained much popularity due to its strong empiri- †Shouling Ji is also with Alibaba-Zhejiang University Joint Research Institute of cal performance. These kernels decompose the original strings into Frontier Technologies sub-structures, i.e., a short k-length subsequence as a k-mer, and then count the occurrences of k-mers (with up to m mismatches) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed in the original sequence to define a feature map and its associated for profit or commercial advantage and that copies bear this notice and the full citation string kernels. However, these methods only consider the local on the first page. Copyrights for components of this work owned by others than the properties of the short substructures in the strings, failing to cap- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission ture the global properties highly related to some discriminative and/or a fee. Request permissions from [email protected]. features of strings, i.e., relatively long subsequences. KDD ’19, August 4–8, 2019, Anchorage, AK, USA When considering larger k and m, the size of the feature map © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6201-6/19/08...$15.00 grows exponentially, leading to serious diagonal dominance prob- https://doi.org/10.1145/3292500.3330923 lem due to high-dimension sparse feature vector [12, 40]. More importantly, the high computational cost for computing kernel ma- 2 EXISTING STRING KERNELS AND trix renders them only applicable to small values of k, m, and small CONVENTIONAL RANDOM FEATURES data size. Recently, a thread of research has made the valid attempts In this section, we first introduce existing string kernels and its sev- to improve the computation for each entry of the kernel matrix eral important issues that impair their effectiveness and efficiency. [9, 20]. However, these new techniques only solve the scalability We next discuss the conventional Random Features for scaling up issue in terms of the length of strings and the size of alphabet but large-scale kernel machines and further illustrate several challenges not the kernel matrix construction that still has quadratic com- why the conventional Random Features cannot be directly applied plexity in the number of strings. In addition, these approximation to existing string kernels. methods still inherit the issues of these "local" kernels, ignoring global structures of the strings, especially for these of long lengths. Another family of research [6, 11, 14, 29, 32, 38, 39] utilizes a 2.1 Existing String Kernels distance function to compute the similarity between a pair of strings We discuss existing approaches of defining string kernels and also through the global or local alignment measure [28, 35]. These string three issues that have been haunting existing string kernels for a alignment kernels are defined resorting to the learning methodology long time: (i) diagonal dominance; (ii) non-positive definite; (iii) of R-convolution [15], which is a framework for computing the scalability issue for large-scale string kernels kernels between discrete objects. The key idea is to recursively 2.1.1 String Kernel by Counting Substructures. decompose structured objects into sub-structures and compute We consider a family of string kernels most commonly used in the their global/local alignments to derive a feature map. However, the literature, where the kernel k¹x;yº between two strings x;y 2 X is common issue that these string alignment kernels have to address computed by counting the number of shared substructures between is how to preserve the property of being a valid positive-definite x, y. Let S denote the set of indices of a particular substructure in (p.d.) kernel [33]. Interestingly, both approaches [11, 32] proposed x (e.g. subsequence, substring, or single character), and S¹xº be to sum up all possible alignments to yield a p.d. kernel, which the set of all possible such set of indices. Furthermore, let U be unfortunately suffers the diagonal dominance problem, leading to all possible values of such substructure. Then a family of string bad generealization capability. Therefore, some treatments have to kernels can be defined as be made in order to repair the issues, e.g. taking the logarithm of the diagonal, which in turns breaks the positive definiteness. Another Õ Õ k¹x;yº := ϕu ¹xºϕu ¹yº; where ϕu ¹xº = 1u ¹x»S¼ºγ ¹Sº (1) important limitation of these approaches is their high computation u 2U S 2S costs, with the quadratic complexity in terms of both the number and the length of strings. and 1u ¹x»S¼º is the number of substructures in x of valueu, weighted In this paper, we present a new family of string kernels that by γ ¹Sº, which reduces the count according to the properties of S, aims to: (i) discover global properties hidden in the strings through such as length. For example, in a vanilla text kernel, S denotes word global alignments, (ii) maintain positive-definiteness of the kernel, positions in a document x and U denotes the vocabulary set (with without introducing a diagonal dominant kernel matrix, and (iii) γ ¹Sº = 1). To take string structure into consideration, the gappy have a training cost linear with respect to not only the length of n-gram [26] considers S¹xº as the set of all possible subsequences the string but also the number of training string samples. in a string x of length k, with γ ¹Sº = exp(−`¹Sºº being a weight To this end, our proposed global string kernels take into account exponentially decayed function in the length of S to penalize sub- the global properties of strings through the global-alignment based sequences of large number of insertions and deletions. While the edit distance such as Levenshtein distance [48]. In addition, the number of possible subsequences in a string is exponential in the proposed kernels are explicitly defined through feature embedding string length, there exist dynamic-programming-based algorithms given by a distribution of random strings.

Load more