On Kernel Methods for Relational Learning
Total Page:16
File Type:pdf, Size:1020Kb
On Kernel Methods for Relational Learning Chad Cumby [email protected] Dan Roth [email protected] Department of Computer Science University of Illinois, Urbana, IL 61801 USA Abstract Haussler's work on convolution kernels (Haussler, Kernel methods have gained a great deal of 1999) introduced the idea that kernels could be built popularity in the machine learning commu- to work with discrete data structures iteratively from nity as a method to learn indirectly in high- kernels for smaller composite parts. These kernels fol- dimensional feature spaces. Those interested lowed the form of a generalized sum over products { a in relational learning have recently begun to generalized convolution. Kernels were shown for sev- cast learning from structured and relational eral discrete datatypes including strings and rooted data in terms of kernel operations. trees, and more recently (Collins & Duffy, 2002) devel- oped kernels for datatypes useful in many NLP tasks, We describe a general family of kernel func- demonstrating their usefulness with the Voted Percep- tions built up from a description language of tron algorithm (Freund & Schapire, 1998). limited expressivity and use it to study the benefits and drawbacks of kernel learning in While these past examples of relational kernels are for- relational domains. Learning with kernels in mulated separately to meet each problem at hand, we this family directly models learning over an seek to develop a flexible mechanism for building ker- expanded feature space constructed using the nel functions for many structured learning problems - same description language. This allows us to based on a unified knowledge representation. At the examine issues of time complexity in terms of heart of our approach is a definition of a relational learning with these and other relational ker- kernel that is specified in a \syntax-driven" manner nels, and how these relate to generalization through the use of a description language. (Cumby ability. The tradeoffs between using kernels & Roth, 2002) introduced a feature description lan- in a very high dimensional implicit space ver- guage and have shown how to use propositional clas- sus a restricted feature space, is highlighted sifiers to successfully learn over structured data, and through two experiments, in bioinformatics produce relational representation, in the sense that dif- and in natural language processing. ferent data instantiations yield the same features and have the same weights in the linear classifier learned. There, as in (Roth & Yih, 2001), this was done by 1. Introduction significantly blowing up the relational feature-space. Recently, much interest has been generated in the Building on the abovementioned description language machine learning community on the subject of learn- based approach, this paper develops a corresponding ing from relational and structured data via proposi- family of parameterized kernel functions for structured tional learners (Kramer et al., 2001; Cumby & Roth, data. In conjunction with an SVM or a Perceptron- 2003a). Examples of relational learning problems like learning algorithm, our parameterized kernels can include learning to identify functional phrases and simulate the exact features generated in the blown up named entities from structured parse trees in natu- space to learn a classifier, directly from the original ral language processing (NLP), learning to classify structured data. From among several ways to define molecules for mutagenicity from atom-bond data in the distance between structured domain elements we drug design, and learning a policy to map goals to ac- follow (Khardon et al., 2001) in choosing a definition tions in planning domains. At the same time, work that provides exactly the same classifiers produced by on SVMs and Perceptron type algorithms has gener- Perceptron, if we had run it on the blown up discrete ated interest in kernel methods to simulate learning feature space, rather than directly on the structured in high-dimensional feature spaces while working with data. The parameterized kernel allows us to flexibly the original low-dimensional input data. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. define features over structures (or, equivalently, a met- expanded higher-dimensional examples, in which each ric over structures). At the same time, it allows us to feature function plays the role of a basic feature, for choose the degree to which we want to restrict the size learning. This approach clearly leads to an increase of the expanded space, which affects the degree of ef- in expressiveness and thus may improve performance. ficiency gained or lost - as well as the generalization However, it also dramatically increases the number of performance of the classifier - as a result of using it. features (from n to jIj; e.g., to 3n if all conjunctions are used, O(nk) if conjunctions of size k are used) and Along these lines, we then study time complexity and thus may adversely affect both the computation time generalization tradeoffs between working in the ex- and convergence rate of learning. panded feature space and using the corresponding ker- nels, and between different kernels, in terms of the Perceptron is a well known on-line learning algorithm expansions they correspond to. We show that, while that makes use of the aforementioned feature based kernel methods provide an interesting and often more representation of examples. Throughout its execution comprehensible way to view the feature space, compu- Perceptron maintains a weight vector w 2 <n which tationally, it is often not recommended to use kernels is initially (0; : : : ; 0): Upon receiving an example x 2 over structured data. This is especially clear when the f0; 1gn; the algorithm predicts according to the linear number of examples is fairly large relative to the data threshold function w · x ≥ 0: If the prediction is 1 and dimensionality. the label is −1 then the vector w is set to w − x, while if the prediction is −1 and the label is 1 then w is set The remainder of the paper is organized as follows: to w+x: No change is made if the prediction is correct. We first introduce the use of kernel functions in linear classification and the kernel Perceptron algo- It is easily observed, and well known, that the hy- rithm (Khardon et al., 2001). Sec. 3 presents our ap- pothesis w of the Perceptron is a sum of the pre- proach to relational features, which form the higher- vious examples on which prediction mistakes were dimensional space we seek to simulate. Sec. 4 intro- made. Let L(x) 2 {−1; 1g denote the label of ex- duces our kernel function for relational data. Sec. 5 ample x; then w = Pv2M L(v)v where M is the discusses complexity tradeoffs surrounding the ker- set of examples on which the algorithm made a mis- nel Perceptron and standard feature-based Perceptron take. Thus the prediction of Perceptron on x is 1 iff and the issue of generalization. In Sec. 6 we validate w · x = (Pv2M L(v)v) · x = Pv2M L(v)(v · x) ≥ 0. our claims by applying the kernel Perceptron algo- For an example x 2 f0; 1gn let φ(x) denote its trans- rithm with our enhanced kernel to two problems where formation into an enhanced feature space, φ(x) = a structured feature space is essential. fχi(x)gi2I : 2. Kernels & Kernel Perceptron To run the Perceptron algorithm over the enhanced Most machine learning algorithms make use of feature space we must predict 1 iff wφ·φ(x) ≥ 0 where wφ is the based representations. A domain element is trans- weight vector in the enhanced space; from the above formed into a collection of Boolean features, and a la- discussion this holds iff Pv2M L(v)(φ(v) · φ(x)) ≥ 0. beled example is given by < x; l >2 f0; 1gn × {−1; 1g. Denoting In recent years, it has become very common, espe- K(v; x) = φ(v) · φ(x) (1) cially in large scale applications in NLP and computer L v K v; x vision, to use learning algorithms such as variations this holds iff Pv2M ( ) ( ) ≥ 0. We call this ver- sion of Perceptron, which uses a kernel function to ex- of Perceptron and Winnow (Novikoff, 1963; Little- pand the feature space and thus enhances the learning stone, 1988) that use representations that are linear abilities of Perceptron, a kernel Perceptron (Khardon over their feature space (Roth, 1998). In these cases, working with features that directly represent domain et al., 2001). elements may not be expressive enough and there are We have shown a bottom up construction of the \ker- standard ways of enhancing the capabilities of such al- nel trick". This way, we never need to construct the gorithms. A typical way which has been used in the enhanced feature space explicitly; we need only be able NLP domain (Roth, 1998; Roth, 1999) is to expand to compute the kernel function K(v; x) efficiently. This the set of basic features x1; : : : ; xn using feature func- trick can be applied to any algorithm whose prediction tions χi : fx1; : : : xng ! f0; 1g, i 2 I. These features is a function of inner products of examples; see e.g. functions could be, for example, conjunctions such as (Cristianini & Shawe-Taylor, 2000) for a discussion. In x1x3x4 (that is, the feature is evaluated to 1 when the principle, one can define the kernel K(x; v) in different conjunction is satisfied on the example, and to 0 other- ways, as long as it satisfies some conditions. In this wise). Once this expansion is done, one can use these paper we will concentrate on a definition that follows phrase(NP) phr−before phrase(VP) phr−after mation about two individuals. The description logic, N1 N2 contains as defined below, provides a framework for represent- contains contains first contains contains before ing the same semantic individuals as are represented before before before N3 N4 N5 N6 N7 by nodes in a concept graph.