Efficient Global String Kernel with Random Features: Beyond Counting Substructures Lingfei Wu∗ Ian En-Hsu Yen Siyu Huo IBM Research Carnegie Mellon University IBM Research
[email protected] [email protected] [email protected] Liang Zhao Kun Xu Liang Ma George Mason University IBM Research IBM Research
[email protected] [email protected] [email protected] Shouling Ji† Charu Aggarwal Zhejiang University IBM Research
[email protected] [email protected] ABSTRACT of longer lengths. In addition, we empirically show that RSE scales Analysis of large-scale sequential data has been one of the most linearly with the increase of the number and the length of string. crucial tasks in areas such as bioinformatics, text, and audio mining. Existing string kernels, however, either (i) rely on local features of CCS CONCEPTS short substructures in the string, which hardly capture long dis- • Computing methodologies → Kernel methods. criminative patterns, (ii) sum over too many substructures, such as all possible subsequences, which leads to diagonal dominance KEYWORDS of the kernel matrix, or (iii) rely on non-positive-definite similar- String Kernel, String Embedding, Random Features ity measures derived from the edit distance. Furthermore, while there have been works addressing the computational challenge with ACM Reference Format: respect to the length of string, most of them still experience qua- Lingfei Wu, Ian En-Hsu Yen, Siyu Huo, Liang Zhao, Kun Xu, Liang Ma, Shouling Ji, and Charu Aggarwal. 2019. Efficient Global String Kernel dratic complexity in terms of the number of training samples when with Random Features: Beyond Counting Substructures .