Efficient Online Learning for Kernels Methods on Structured Data

Efficient Online Learning For Mapping Kernels On Linguistic Structures Giovanni Da San Martino Alessandro Sperduti, Fabio Aiolli Alessandro Moschitti Qatar Computing Research Institute, Department of Mathematics, Amazon Hamad Bin Khalifa University University of Padova Manhattan Beach, CA, USA Doha, Qatar via Trieste, 63, Padova, Italy [email protected] [email protected] fsperduti, [email protected] Abstract Moschitti 2006), paraphrase detection (Filice, Da San Mar- tino, and Moschitti 2015), computational argumentation Kernel methods are popular and effective techniques for learn- (Wachsmuth et al. 2017). ing on structured data, such as trees and graphs. One of their major drawbacks is the computational cost related to making One drawback of using tree kernels (and kernels for a prediction on an example, which manifests in the classifica- structures in general) is the time complexity required both tion phase for batch kernel methods, and especially in online in learning and classification. Such complexity can some- learning algorithms. In this paper, we analyze how to speed times prevent the kernel application in scenarios involv- up the prediction when the kernel function is an instance of ing large amount of data. Typical approaches devoted to the Mapping Kernels, a general framework for specifying ker- the solution of this problem relate to: (1) the use of fast nels for structured data which extends the popular convolution learning/classification algorithms, e.g. the (voted) perceptron kernel framework. We theoretically study the general model, (Collins and Duffy 2002); (2) the reduction of the computa- derive various optimization strategies and show how to apply tional time required for computing a kernel (see e.g. (Shawe- them to popular kernels for structured data. Additionally, we derive a reliable empirical evidence on semantic role labeling Taylor and Cristianini 2004) and references therein). task, which is a natural language classification task, highly The first approach is viable since the use of a large num- dependent on syntactic trees. The results show that our faster ber of training examples can fill the accuracy gap between approach can clearly improve on standard kernel-based SVMs, fast on-line algorithms, which do not explicitly maximize which cannot run on very large datasets. margin, and more expensive algorithms like Support Vector Machines (SVMs), which have to solve a quadratic optimiza- 1 Introduction tion problem. Therefore, although the latter is supported by a solid theory and state-of-the-art accuracy on small datasets, Many data mining applications involve the processing of learning approaches able to work with much larger data can structured or semi-structured objects, e.g. proteins and phylo- outperform them. genetic trees in Bioinformatics, molecular graphs in Chem- The second approach relates to either the design of more istry, hypertextual and XML documents in Information Re- efficient algorithms based on more appropriate data structures trieval and parse trees in NLP. In all these areas, the huge or the computation of kernels over set of trees, e.g., (Kazama amount of available data jointly with a poor understanding and Torisawa 2006), (Moschitti and Zanzotto 2007). The lat- of the processes generating them, typically enforces the use ter techniques allow for the factorization/reuse of previously of machine learning and/or data mining techniques. evaluated subparts leading to a faster overall processing. The main complexity on applying machine learning algo- Our proposal is in line with the second approach. How- rithms to structured data resides in the design of effective ever, it will be applicable to both online and batch learning features for their representation. Kernel methods seem a valid algorithms. One of the most popular approaches to the defini- approach to alleviate such complexity since they allow to in- tion of kernel functions for structured data is the convolution ject background knowledge into a learning algorithm and pro- kernel framework (Haussler 1999). A convolution kernel can vide an implicit object representation with the possibility to be defined by decomposing a structured object into substruc- work implicitly in very large feature spaces. These interesting tures and then summing the contributions of kernel evalua- properties have triggered a lot of research on kernel methods tions on the substructures. We exploit the idea that identical for structured data (Jaakkola, Diekhans, and Haussler 2000; substructures appearing in different kernel evaluations can Haussler 1999) and kernels for Bioinformatics (Kuang et al. be factorized and thus computed only once, by postulating 2004). In particular, tree kernels have shown to be very effec- a necessary condition of optimality for a data structure and tive for NLP tasks, e.g., parse re-ranking (Collins and Duffy deriving results from it on an extension of the convolution 2002), Semantic Role Labeling (Kazama and Torisawa 2005; kernel framework, the Mapping Kernels (Shin and Kuboyama Zhang et al. 2007), Entailment Recognition (Zanzotto and 2010). Besides the general theoretical result, we show how Copyright © 2019, Association for the Advancement of Artificial to apply our results to a number of popular kernels for trees. Intelligence (www.aaai.org). All rights reserved. Moreover, we give empirical evidence that such optimal structures yield in practice a significant speed up and reduction Shawe-Taylor 2000). It is trivial to show that, for online learn- of memory consumption for the learning algorithms for the ing algorithms, the cardinality of M, and consequently the task of Semantic Role Labelling: we show that the learning memory required for its storage, grows up linearly with the time for a perceptron on a 4-million-instances dataset reduces number of examples in the input stream; thus the efficiency from more than a week to 14 hours only. Since an SVM could in the evaluation of the function S(x) linearly decreases as only be trained on a significantly smaller number of exam- well. It is therefore of great importance to be able to speed ples, a simple voted perceptron significantly outperforms an up the computation of eq. (1), which is our goal in this paper. SVM also with respect to classification performances. We study the case in which the input examples are structured data, specifically trees. 2 Notation 3.2 Mapping Kernels (MKs) A tree T is a directed and connected graph without cycles for which every node has one incoming edge, except a single The main idea of mapping kernels (Shin and Kuboyama node, i.e. the root, with no incoming edges. A leaf is a node 2010) is to decompose a structured object into substructures with no outgoing edges. A descendant of a node v is any and then sum the contributions of kernel evaluations on a node connected to v by a path. A production at a node v is subset of the substructures. More formally, let us assume 0 0 the tree composed by v and its direct children. A partial tree a “local“ kernel on the subparts, k : χ × χ ! R, and a (PT) t is a subset of nodes in the tree T , with correspond- binary relation R ⊆ χ^ × χ are available, with (^x; x) 2 R 0 ing edges, which forms a tree (Moschitti 2006). A proper if x^ “is a substructure“ of x. The definition of χ^, χ and R subtree (ST) rooted at a node t comprises t and all its descen- varies according to the kernel and the domain, an example 0 dants (Viswanathan and Smola 2003). A subset tree (SST) of of a common setting is the following: χ, χ^ and χ are sets a tree T is a structure which: i) is rooted in a node of T ; ii) of trees and (^x; x) 2 R if x^ 2 χ^ is a subtree of x 2 χ. χ^ = fx^ 2 χ^j(^x; x) 2 Rg satisfies the constraint that each of its nodes contains either Let x be the set of substructures x γ :χ ^ ! χ0 all children of the original tree or none of them (Collins and associated with and let x x a function mapping Duffy 2002). Finally, an elastic tree (ET) only requires the a substructure of x into its representation in the input space nodes of the subtree to preserve the relative positioning in of the local kernel (see (Shin and Kuboyama 2010) for some the original tree (Kashima and Koyanagi 2002). This allows, examples). Then the mapping kernel is defined as follows: for example, to connect a node with any of its descendants X K(x ; x ) = k(γ (^x ); γ (^x )); (2) even if there is no direct edge in the original tree. i j xi i xj j (^xi;x^j )2Mxi;xj 3 Background where M is part of a mapping system M defined as 3.1 Kernelized Perceptron = χ, χ^ jx 2 χ ; M x Kernel methods rely on kernel functions, which are sym- n o (3) metric positive semidefinite similarity functions defined on M ⊆ χ^ × χ^ j(x ; x ) 2 χ × χ : xi;xj x x i j pairs of input examples. The kernelized version (Kivinen, i j Smola, and Williamson 2004) of the popular online learn- M is a triplet composed by the domain of the examples, ing perceptron algorithm (Rosemblatt 1958), can be de- the space of the substructures, and a binary relation M scribed as follows. Let X be a stream of example pairs specifying for which pairs of substructures the local ker- (xi; yi), yi 2 {−1; +1g, then the prediction of the percep- nel has to be computed. M is assumed to be finite and tron for a new example x is the sign of the score function: symmetric, i.e. 8xi; xj 2 χ.jMxi;xj j < 1 and (x ^j; xî) 2 Pn−1 M if (x ^ ; x^ ) 2 M .

Load more