<<

On Methods for Relational Learning

Chad Cumby [email protected] Dan Roth [email protected] Department of Computer Science University of Illinois, Urbana, IL 61801 USA

Abstract Haussler’s work on convolution kernels (Haussler, Kernel methods have gained a great deal of 1999) introduced the idea that kernels could be built popularity in the commu- to work with discrete data structures iteratively from nity as a method to learn indirectly in high- kernels for smaller composite parts. These kernels fol- dimensional feature spaces. Those interested lowed the form of a generalized sum over products – a in relational learning have recently begun to generalized convolution. Kernels were shown for sev- cast learning from structured and relational eral discrete datatypes including strings and rooted data in terms of kernel operations. trees, and more recently (Collins & Duffy, 2002) devel- oped kernels for datatypes useful in many NLP tasks, We describe a general family of kernel func- demonstrating their usefulness with the Voted Percep- tions built up from a description language of tron algorithm (Freund & Schapire, 1998). limited expressivity and use it to study the benefits and drawbacks of kernel learning in While these past examples of relational kernels are for- relational domains. Learning with kernels in mulated separately to meet each problem at hand, we this family directly models learning over an seek to develop a flexible mechanism for building ker- expanded feature space constructed using the nel functions for many structured learning problems - same description language. This allows us to based on a unified knowledge representation. At the examine issues of time complexity in terms of heart of our approach is a definition of a relational learning with these and other relational ker- kernel that is specified in a “syntax-driven” manner nels, and how these relate to generalization through the use of a description language. (Cumby ability. The tradeoffs between using kernels & Roth, 2002) introduced a feature description lan- in a very high dimensional implicit space ver- guage and have shown how to use propositional clas- sus a restricted feature space, is highlighted sifiers to successfully learn over structured data, and through two experiments, in produce relational representation, in the sense that dif- and in natural language processing. ferent data instantiations yield the same features and have the same weights in the linear classifier learned. There, as in (Roth & Yih, 2001), this was done by 1. Introduction significantly blowing up the relational feature-space. Recently, much interest has been generated in the Building on the abovementioned description language machine learning community on the subject of learn- based approach, this paper develops a corresponding ing from relational and structured data via proposi- family of parameterized kernel functions for structured tional learners (Kramer et al., 2001; Cumby & Roth, data. In conjunction with an SVM or a - 2003a). Examples of relational learning problems like learning algorithm, our parameterized kernels can include learning to identify functional phrases and simulate the exact features generated in the blown up named entities from structured parse trees in natu- space to learn a classifier, directly from the original ral language processing (NLP), learning to classify structured data. From among several ways to define molecules for mutagenicity from atom-bond data in the distance between structured domain elements we drug design, and learning a policy to map goals to ac- follow (Khardon et al., 2001) in choosing a definition tions in planning domains. At the same time, work that provides exactly the same classifiers produced by on SVMs and Perceptron type algorithms has gener- Perceptron, if we had run it on the blown up discrete ated interest in kernel methods to simulate learning feature space, rather than directly on the structured in high-dimensional feature spaces while working with data. The parameterized kernel allows us to flexibly the original low-dimensional input data.

Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. define features over structures (or, equivalently, a met- expanded higher-dimensional examples, in which each ric over structures). At the same time, it allows us to feature function plays the role of a basic feature, for choose the degree to which we want to restrict the size learning. This approach clearly leads to an increase of the expanded space, which affects the degree of ef- in expressiveness and thus may improve performance. ficiency gained or lost - as well as the generalization However, it also dramatically increases the number of performance of the classifier - as a result of using it. features (from n to |I|; e.g., to 3n if all conjunctions are used, O(nk) if conjunctions of size k are used) and Along these lines, we then study time complexity and thus may adversely affect both the computation time generalization tradeoffs between working in the ex- and convergence rate of learning. panded feature space and using the corresponding ker- nels, and between different kernels, in terms of the Perceptron is a well known on-line learning algorithm expansions they correspond to. We show that, while that makes use of the aforementioned feature based kernel methods provide an interesting and often more representation of examples. Throughout its execution comprehensible way to view the feature space, compu- Perceptron maintains a weight vector w ∈

before ing the same semantic individuals as are represented before before before N3 N4 N5 N6 N7 by nodes in a concept graph. after after after word(The) after word(boy) word(ran) word(away) word(quickly) tag(DET) tag(NN) tag(VBD) tag(RB) tag(ADV) Definition 1 An FDL description over the attribute alphabet Attr = {a1, ..., an}, the value alphabet V al = Figure 1. A concept graph for a partial parse of a sentence. v1, ..., vn, and the role alphabet Role = {r1, ..., rn} is defined inductively as follows: 1. For an attribute symbol a and a value symbol v , the discussion above. We will first define a feature i j a (v ) is a description called a sensor. a is also space. Then define a collection of features functions i j i a description, an existential sensor. (A sensor and a mapping φ to an enhanced feature space. We represents the set x ∈ X for which a (x, v ) is will then show that our kernel definition satisfies Eq.1. i j true. An existential sensor represents the set x ∈ The discussion above implies that running kernel Per- X s.t. ∃v , s.t. a (x, v ) is true.) ceptron yields identical results to running Perceptron j i j 2. If D is a description and r is a role symbol, then on the enhanced features space. i (ri D) is a role description. (Represents the set 3. Relational Features x ∈ X such that ri(x, y) is true iff for y ∈ X, y is In order to learn in structured domains such as natu- described by D.) ral language, object recognition, and computational 3. If D1, ..., Dn are descriptions, then (AND biology, one must decide what representation is to D1, ..., Dn) is a description. (The conjunction of be used for both the concept to be learned, and for several descriptions.) the instances that will be learned from. Given the In order to generate useful features from the struc- intractability of traditional Inductive Logic Program- tured instance above we could, for example, define a ming (ILP) approaches under many conditions, recent description such as: (AND phrase (contains word)). approaches attempt to develop ways that use efficient Each feature to be generated from it is a Boolean val- propositional algorithms over relational data. One tac- ued function indexed by a syntactic description. Below tic that has shown itself to be particularly useful is the are features generated through the Feature Generat- method of “propositionalization” (Kramer et al., 2001; ing Function (FGF) mechanism described in (Cumby Cumby & Roth, 2003a). This method suggests to rep- & Roth, 2002). To do so, the generating descrip- resent relational data in the form of quantified propo- tion is matched against each node of the given in- sitions (Khardon et al., 1999), which can be used with stance graph. Feat. (1) below means essentially, “In standard propositional and probabilistic learners. This the given data instance, output 1 if ∃x, y ∈ X such technique allows the expansion of the space of possible that phrase(x, NP ) ∧ contains(x, y) ∧ word(y, T he)”. propositional features to include many structured fea- AND tures defined over basic features abstracted from the 1. ( phrase(NP) (contains word(The))) AND input data instances. One method of performing this 2. ( phrase(NP) (contains word(boy))) 3. (AND phrase(VP) (contains word(ran))) expansion of relational features is through the notion AND of a Feature Description Logic. 4. ( phrase(VP) (contains word(away))) 5. (AND phrase(VP) (contains word(quickly))) By the method detailed in (Cumby & Roth, 2002), this Description Logic allows one to define “types” of fea- We note that it is an easy matter to move from a pure tures, which through an efficient generation function, Boolean representation of features to a positive integer valued representation, e.g., by associating the number are instantiated by comparing the feature definition of times each feature substructure occurs in a given against data instances. These data instances are in general any type of structured data, represented by a instance. This change of feature representation does graph-based data structure known as a concept graph. not change the operation of the Perceptron algorithm and is related to our kernel construction in Sec. (4). For an example of a concept graph, consider the sen- tence represented by the dependency graph in Fig. 1. The generation of features in such a manner produces a feature space polynomial in the size of the original This digraph contains nodes to represent each word, input space with the degree being the generating de- along with two nodes to represent the noun and verb phrase present in the sentence. In general, each node scription size. This is attractive in terms of complexity is labeled with attribute information about some indi- of generation and also potentially in terms of the abil- ity to learn efficiently in such spaces. We return to vidual, and each edge is labeled with relational infor- this topic in Sec. (5). 4. Relational Kernels feature vectors φD(G2), φD(G2) output by the fea- We now define a family of kernel functions based on the ture generating function described in (Cumby & Roth, Feature Description Language introduced. We utilize 2002). Thus for all D kD necessarily defines a kernel the description language introduced in Sec. (3) with function over the space of concept graphs G. a different purpose - to define the operation of a ker- nel function - but still with the aim of exploiting the For a proof of Theorem 3 see (Cumby & Roth, 2003b). syntax of the language to determine the shape of the We examplify the claim that k defines a kernel function feature-space. The definition of our kernel functions by demonstrating that we can simulate two expanded uses the formal notion of the FDL concept graphs in- feature spaces directly with our kernel construction. troduced earlier, whose definition we summarize here: First of all, the technique can generalize the simple An FDL concept graph is a labeled directed acyclic propositional case as considered in (Khardon et al., graph G = G(N, E, lN (∗)), where N is a set of nodes, 2001) and also generalize learning with word or POS- E ⊆ (N × N × Role) a set of labeled edges (with role tag n-grams of the type often considered in NLP ap- symbols as labels) and lN (∗) a function that maps each plications, given that the words and tags are encoded node in N to a set of sensor descriptions associated as sensor descriptions appropriately. with it. A rooted concept graph is specified by desig- In order to mimic the case of learning from simple nating a node n ∈ N as the root of the graph. n 0 Boolean bit vectors such as [x1x2 . . . xn] ∈ {0, 1} , With these two definitions we define the operation of we perform the following translation. For each such a family of kernels parameterized by an FDL descrip- vector, define a mapping from each component xi in n tion. We call this parameter the generating descrip- x ∈ {0, 1} to a relation s(x, vi) where vi corresponds tion. Each kernel takes as input two directed acyclic to the truth value of xi on x. We can then use an FDL concept graphs, which may represent many types of concept graph consisting of a single node labeled with structured data. Each outputs a real valued number sensor descriptions corresponding to each s(x, vi) to representing the similarity of the two concept graphs represent x. At this point it becomes a simple matter with respect to the generating description. The fam- to mimic the parameterized kernel of (Khardon et al., ily of kernels is “syntax-driven” in the sense that the 2001) using our feature description kernel. Define the input description specifies the kernel’s operation. generating descriptions:

Definition 2 Let D be the space of FDL descriptions, D1 = s; D2 = (ANDss); D3 = (ANDsss), . . . and G the space of all concept graphs. The family of If we evaluate each kDi (n1, n2) for two single-node in- functions K = {kD|D ∈ D} is defined inductively as: stances n1, n2 as described above, we expect to cap- ture each conjunction of literals up to size three. Let kD(G1, G2) = X X kD(n1, n2) same(n1, n2) denote the number of sensor descrip- n1∈N1 n2∈N2 tions present as labels of both n1 and n2. Then we 3 ks(n1,n2) have kDi (n1, n2) = same(n1, n2) + + 1. If D is a sensor description of the form s(v) and Pi=1 2  ks(n1,n2) = 3 same(n1,n2) . This directly mir- s(v) ∈ l (n ) ∩ l (n ), then k (n , n ) = 1. 3  Pi=1 i  N1 1 N2 2 D 1 2 rors the parameterized kernel detailed in (Khardon 2. If D is an existential sensor description of the et al., 2001). The expanded space feature vectors form s, and ∃V ⊆ V al s.t. ∀v ∈ V s(v) ∈ φ(n1) φ(n2) for nodes n1 n2 consist of all conjunctions lN1 (n ) ∩ lN2 (n ), then kD(n , n ) = |V |. 1 2 1 2 of sensors up to size 3, and the dot product φ(n1)·φ(n2) 3. If D is a role description of the form (r is equal to the number of conjunctions active in both 0 0 D ): Let N1 be the set of all nodes {n1 ∈ vectors. This quantity is equal to i kDi (n1, n2). 0 P N|(n1, n1, r) ∈ E}, and let N2 be the set of 0 0 The example of extracting k-conjunctions from bit vec- all nodes {n2 ∈ N|(n2, n2, r) ∈ E}. Then 0 0 tors is markedly different from the example of gener- kD(n1, n2) = 0 0 kD0 (n , n ). Pn1 Pn2 1 2 ating n-gram type features as seen in the earlier exam- 4. If D is a description of the form (AND D1 ple. In the case of n-grams, relational information is D2 . . . Dn), with li repetitions of any Di, then implicit in the manner in which combinations of words n kDi (n1,n2) kD(n , n ) = . or other objects are chosen to be output. For example 1 2 Qi=1 li  in the sentence: The boy ran quickly, the combination Theorem 3 For any FDL description D and for any The-boy is a valid bigram whereas boy-The is not. Via two concept graphs G1, G2, the quantity output by our kernel definition, we can simulate the operation of kD(G1, G2) is equivalent to the dot product of the two a feature-space based algorithm on bigrams as follows. Given the generating description represented by a node in a given input concept graph D = (AND word (before word)) within n R-labeled edges from the given focus f. When along with two data instances represented as concept R = Role, we write IN[n] D). graphs G1, G2, where G1 corresponds to the instance in Fig. 1 and G to a similar instance for the sentence 2 The family of kernels K is extended accordingly The boy ran quickly, we can observe the output of the to include kernels k (G , G ), with D of the function k (G , G ) = k (n , n ). Dloc 1 2 loc D 1 2 Pn1∈G1 Pn2∈G2 D 1 2 form (IN[n,r] D), and D ∈ F DL. In this case For four pairings of n and n , k (n , n ) will 1 2 word 1 2 G , G are concept graphs each with a given fo- be non-zero, corresponding to the pair of the 1 2 cus of interest f , f . As before k (G , G ) = nodes labeled with word(The), word(boy), word(ran), 1 2 Kloc 1 2 k (n , n ). However here and with word(quickly). The output of the first Pn1∈N1∈G1 Pn2∈N2∈G2 Dloc 1 2 k (n , n ) = k (n , n ) if both n , n are within pairing is k (N 1 , N 2 ) = k (N 1 , N 2 ) · Dloc 1 2 D 1 2 1 2 D T he T he word T he T he n r-labeled edges from f and f respectively, or else k (N 1 , N 2 ) = 1 · k (N 1 , N 2 ) = 1 2 (before word) T he T he word boy boy k (n , n ) = 0. 1 · 1 = 1. The output for the second pairing of Dloc 1 2 1 2 kD(Nboy, Nboy) is also 1 by a similar evaluation. The 5. Complexity & Generalization 1 2 output of the third pairing is kD(Nran, Nran) = Our treatment of kernels for relational data aims 1 2 1 2 kword(Nran, Nran) · k(before word)(Nran, Nran) = 1 · at developing a clear understanding of when it be- kword(Naway, Nquickly) = 1 ∗ 0 = 0. And the out- comes advantageous to use kernel methods over stan- 1 2 put for the fourth pairing is kD(Nquickly, Nquickly) = dard learning algorithms operating over some feature 1 2 kword(Nquickly, Nquickly) · 0 = 1 · 0 = 0. Thus space. The common wisdom in propositional learning kD(G1, G2) = 1 + 1 + 0 = 2. is that kernel methods will always yield better perfor- mance because they cast the data in an implicit high- If we were to expand the feature-space for G using 1 dimensional space (but also see (Khardon et al., 2001; the description D and the method referred to in Sec- Ben-David & Simon, 2002)). tion (3), we would have a feature vector φ(G1) with the following features outputting 1 and all else 0: Propositionalization methods in relational learning have demonstrated separately that it is possible to • (AND word(The) (before word(boy))) learn relational concepts using a transformation of • (AND word(boy) (before word(ran))) the input data. There, however, the transformation • (AND word(ran) (before word(away))) into a high-dimensional feature space is done explicitly • (AND word(away) (before word(quickly))) through the construction of propositional features, as done in Sec. (3) using the Feature Generating Func- tion. The discussion of the relational kernel showed φ G Computing ( 2) in a similar manner produces a vec- that we can formulate a kernel function that simulates tor with these active features, and all else 0: the updates performed by running Perceptron over the features generated this way. Conceivably then, we • AND ( word(The) (before word(boy))) should be able to perform a learning experiment com- • (AND word(boy) (before word(ran))) paring the and the explicitly generated • (AND word(ran) (before word(quickly))) feature space, and expect to achieve the same results in testing. Given such a situation, the main difference

Computing the dot-product: φ(G1) · φ(G2) = 2. Thus between the two methods then becomes the expected kD(G1, G2) = φ(G1) · φ(G2), exemplifying Thm 3. running time involved. Thus a general analysis of the complexity of running the kernel method and of gener- Before continuing, we present an extension to Def. 2 ating features and running over them seems warranted. with its corresponding effects on the types of kernels included in the family defined in Def. 2, intended to Given some description in our FDL D, and two incorporate the notion of locality along with focus of concept graphs G1, G2, let t1 be the time to eval- interest. The focus of interest for a given input concept uate kD(n1, n2) if n1, n2 are two nodes of G1, G2 graph is defined as a single distinguished node f. respectively. Assuming all input concept graphs are equally likely, let g be the average number Definition 4 We extend the language of Def. 2 with of nodes in a given graph. Since kD(G1, G2) = k (n , n ), the complexity of evalu- the following rule and define it as FDLloc: If D is Pn1∈G1 Pn2∈G2 D 1 2 2 a description, then (IN[n,r] D) is also a description, ating kD(G1, G2) is proportional to g t1. Since for an + where n ∈ Z and R ∈ Role. This description de- arbitrary sequence x1 . . . xm of input graphs the kernel notes the set of individuals x ∈ X described by D and Perceptron algorithm could, in the worst case, make i − 1 mistakes on xi, the overall running time for the ture space that is exponential in the size of the input 2 2 algorithm given this kernel is O(m g t1). representation, by defining a metric that depends on all “substructures” (as do standard polynomial ker- To analyze the normal version of Perceptron, run over nels). Our kernel approach allows a transformations of an equivalent feature-space generated by the FGF al- the input representation to an implicitly larger propo- gorithm given in (Cumby & Roth, 2002) with the same sitional space; it does so, however, using a parameter- description D, we first assume that the time to evalu- ized kernel, thus allowing control of this transforma- ate χ for a node of a given concept graph is propor- D tion so that it can be equivalent to a smaller proposi- tional to t . The total time to generate features from 2 tional feature space. By specifying a kernel through m arbitrary concept graphs is then O(mgt ) with g 2 a syntax-driven mechanism based on the relational as defined above. We assume that the feature vectors structure of the input, we can actively attempt to de- we work with are variable length vectors of only posi- crease the number of irrelevant features introduced. tive features (this is equivalent learning in the infinite The benefit to generalization will be shown experimen- attribute space (Blum, 1992) and is common in ap- tally in the next section. plications (Roth, 1998)). To run Perceptron over the resulting feature vectors, each time a mistake is made, Note that the focus of this analysis is on kernel Percep- we update each weight corresponding to an active fea- tron; the exact complexity implications for other ker- ture in the current example. We could abstract the nel learners such as SVM, beyond the quadratic depen- number of active features per example with an aver- dence on the number of examples, is not so clear. The age, but in reality these updates take time proportional implications for generalization, however, apply across to the time spent in generating the features for each all kernel learners. input graph. Thus the total running time is O(mgt ). 2 6. Experimental Validation It is interesting to note that this result also applies to We present two experiments in the domains of bioin- the specific situation of kernel learning from Boolean formatics and NLP. Our goal is to demonstrate that bit vectors. Consider the example given earlier of ex- by restricting the expanded feature space that learn- tracting conjunctions up to size k from bit vectors of ing takes place in - using the standard feature based n the form (x1x2 . . . xn), (y1y2 . . . yn) ∈ {0, 1} . In this learner, or kernel Perceptron - we benefit in terms of case, as the representation of each vector is mapped to generalization, and thus accuracy. Our main compar- a single node concept graph, g = 1. t1 is proportional ison is performed against a kernel learner that implic- k same(x,y) n m to computing Pi=1 i , which is O( ). For itly constructs an exponentially larger feature space examples the total time of learning with kernel Percep- using a kernel based on (Collins & Duffy, 2002). As 2 tron is then O(m n). To compute the feature space of we can explicitly generate the smaller feature space k conjunctions up to size k explicitly takes O(n ), thus used in the FDL approach, it is possible to verify our the total time for running standard Perceptron over equivalence claim and compare with the performance k k−1 the blown up feature space is O(mn ). If m > n , of other propositional learners that do not admit ker- the complexity disadvantage of the kernel approach nels. The latter comparison is done using a variation of becomes apparent. Winnow (Littlestone, 1988) implemented in the SNoW A similar tradeoff can be seen in terms of generaliza- system (Carlson et al., 1999). tion ability. Our discussion shows that learning with The first experiment addresses the problem of predict- kernel methods simulates learning with a highly ex- ing mutagenicity in organic compounds, which has be- panded feature space. It is not surprising then, that come a standard benchmark in ILP for ‘proposition- recent work (Weston et al., 2000; Ben-David & Si- alization’ methods (Srinivasan et al., 1996). In this mon, 2002; Khardon et al., 2001), has shown that even problem, a set of 188 compounds is given in the form margin-maximizing kernel learners may suffer from the of atom and bond tuples corresponding to the atoms in curse of dimensionality; this happens in cases where each molecule with their properties and links to other the use of kernels is equivalent to introducing a large atoms. Additionally, a set of label tuples is given to number of irrelevant features. Assuming that a sub- indicate whether each molecule is mutagenic. set of the features generated is expressive enough for a given problem, it is not hard to show that embedding We map each compound to a concept graph, construct- the problem in an even higher dimensional space can ing nodes for each atom tuple labeled with sensor de- only be harmful to generalization. scriptions of the form atom-elt(v) for the element type, atom-type(v) for the atom type and atom-chrg(v) for Previous methods that use structured kernels (e.g. the partial charge. We construct edges for each bond (Collins & Duffy, 2002)) make use of an implicit fea- tuple and construct a single node for the compound atom−elt(c) atom−chrg(.013) each time with a different phrase node designated as muta(+) the focus of interest. The above descriptions reference lumo(−1.19) logp(1.77) this focus as mentioned in Sec. (4). bond bond As we are trying to predict from a set of k category contains labels for each phrase, we are faced with a multi-class classification problem. We used a standard 1-vs-all Figure 2. Concept graph for fragment of mutagenesis do- method, with a winner take all gate for testing. main element In each experiment we also trained a classifier using kernel Perceptron with a modified kernel based on the overall - connected to each atom and labeled with the parse tree kernel given in (Collins & Duffy, 2002). This compound’s mutagenicity, lumo, and logP values. kernel tracks the number of common subtrees descend- ing from the root node, which is designated to be the We first perform classification using an expanded fea- ‘molecule’ node in each example compound in the mu- ture space generated by taking the sum of the ker- tagenesis experiment, and the current phrase focus of nels k , k , k , with the generating descriptions D1 D2 D3 interest node in the named entity experiment. Down- D , D , D 1 2 3 as described below: weighting of larger subtrees is performed also as in • D1 = lumo (Collins & Duffy, 2002). As our concept graphs may • D2 = logP be cyclic, a termination rule is added to stop traversal • D3 = (AND atom-elt atom-chrg (bond (AND atom- of each graph if a node previously encountered on a elt atom-chrg (bond . . . )))) up to nine conjuncts traversal path is touched again. We show below the results of training and evaluating each type of classi- Since kernels are composable under addition, the over- fier. For mutagenesis we train with 10-fold cross vali- all kernel is valid. We introduce the second experimen- dation on the set of 188 compounds with 12 rounds of tal setup before continuing with results. training for each fold. For the Named Entity task we The next experiment deals with the natural language train using 5 rounds of training over the entire training 1 processing task of tagging Named Entities. Given a set. Results in terms of accuracy are shown. natural language sentence, we wish to determine which phrases in it correspond entities such as locations, or- SNoW FDL All ganizations or persons. In this experiment, the MUC- (Winnow) kernel subtrees 7 dataset of sentences with labeled entities was used. mutagenesis 88.6% 85.4% 71.3% Rather than focusing on the tagging problem of deter- NE class 75.41% 76.6% 52.9% mining whether a word is in some entity or not, we instead attempt to classify what entity a given phrase The main comparison in the above table is between corresponds to out of these three. we first convert the the feature space represented in both col. 1 and 2, and raw MUC data into a chunked form (with subsets of the one represented in col. 3. In both experiments the words marked as noun phrases, verb phrases, etc) and classifiers learned over the FDL induced feature space map this form to concept graphs as in Fig. 1. perform markedly better than the the one learned over the ‘all-subtrees’ feature space. Col. 1 is the outcome We extracted a set of of 4715 training phrases and 1517 of running a different classifier on an explicit feature test phrases, and once again trained a classifier based space equivalent to the FDL of col. 2. It shows that on a restrictively expanded feature space as detailed using the explicitly propositionalized feature space, we in Sec. (3) and Sec. (4). The generating descriptions can use other learning algorithms, not amenable to used to determine the types of features include: kernels, which might perform slightly better. • (IN[0] (first (AND word tag (after word)))) For the mutagenesis task the FDL approach encodes • (IN[0] (phr-before (first word))) the typical features used in other ILP evaluations, in- • (IN[0] (phr-before (first tag))) dicating that many ILP tasks, where the relational • (IN[0] (phr-before phrase)) data can be translated into our graph-based struc- • (IN[0] (phr-after phrase)) 1In the mutagenesis case the data set is fairly balanced • (IN[0] (contains (AND word tag))) and results are usually reported in terms of accuracy (Srini- vasan et al., 1996). For the NE case, each example corre- For each input graph, several noun phrase nodes may sponds to a valid NE, and the set is balanced (3:5:7) so be present. Thus each instance is presented to the fea- micro and macro averaging are fairly close. The difference ture generating process/kernel learner several times, between Col.3 and Col. 1 and 2 is statistically significant. ture (Cumby & Roth, 2002), can be addressed this Cristianini, N., & Shawe-Taylor, J. (2000). An introduction way. The named entity example abstracts away the to support vector machines. Cambridge Press. important task of determining which phrases corre- Cumby, C., & Roth, D. (2002). Learning with feature de- spond to valid entities, and performs sub-optimally in scription logics. Proceedings of the 12th International terms of classifying the entities correctly due to not ex- Conference on Inductive Logic Programming. ploiting richer features that can be used in this task. Cumby, C., & Roth, D. (2003a). Feature extraction lan- However, our goal here in the classification tasks is to guages for propositionalized relational learning. IJ- exhibit the benefit of the FDL approach over the “all- CAI’03 Workshop on Learning Statistical Models from subtrees” kernel. Namely, that by developing a param- Relational Data. eterized kernel which restricts the range of structural Cumby, C., & Roth, D. (2003b). Kernel methods for re- features produced, we avoid over-fitting due to a large lational learning (Technical Report UIUCDCS-R-2003- number of irrelevant features. 2345). UIUC Computer Science Department. 7. Conclusion Freund, Y., & Schapire, R. E. (1998). Large margin classi- We have presented a new approach to constructing ker- fication using the perceptron algorithm. Computational Learing Theory (pp. 209–217). nels for learning from structured data, through a de- scription language of limited expressivity. This family Haussler, D. (1999). Convolution kernels on discrete struc- of kernels is generated in a syntax-driven way parame- tures (Technical Report UCSC-CRL-99-10). Univerisity terized by descriptions in the language. Thus, it high- of California - Santa Cruz. lights the relationship between an explicitly expanded Khardon, R., Roth, D., & Servedio, R. (2001). Efficiency feature space constructed using this language, and the versus convergence of boolean kernels for on-line learning implicitly simulated feature space that learning takes algorithms. NIPS-14. place in when using kernel based algorithms such as Khardon, R., Roth, D., & Valiant, L. G. (1999). Rela- kernel Perceptron or SVM. In some cases, it is more ef- tional learning for NLP using linear threshold elements. ficient to learn in the implicitly expanded space, when Proc. of the International Joint Conference on Artificial this space may be exponential in the size of the input Intelligence (pp. 911–917). example, or infinite. However, we have shown that an Kramer, S., Lavrac, N., & Flach, P. (2001). Proposi- expanded space of much higher dimensionality can de- tionalization approaches to relational . In grade generalization relative to one restricted by the S. Dzeroski and N. Lavrac (Eds.), Relational data min- use of our syntactically determined kernel. ing. Springer Verlag. By studying the relationship between these explicit Littlestone, N. (1988). Learning quickly when irrelevant and implicitly simulated feature spaces, and the com- attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318. putational demands of kernel based algorithms such as kernel Perceptron, we have highlighted the cases Novikoff, A. (1963). On convergence proofs for . in which, contrary to popular belief, working in the Proceeding of the Symposium on the Mathematical The- explicit feature space with the standard learning algo- ory of Automata (pp. 615–622). rithms may be beneficial over kernels. Roth, D. (1998). Learning to resolve natural language am- biguities: A unified approach. National Conference on Acknowledgments: This research is supported by Artificial Intelligence (pp. 806–813). NSF grants ITR-IIS-0085836, ITR-IIS-0085980 and IIS-9984168 and an ONR MURI Award. Roth, D. (1999). Learning in natural language. Proc. of the International Joint Conference on Artificial Intelligence (pp. 898–904). References Roth, D., & Yih, W. (2001). Relational learning via propo- Ben-David, S., & Simon, N. E. U. H. (2002). Limitations of sitional algorithms: An information ext raction case learning via embeddings in euclidean half-spaces. JMLR. study. Proc. of the International Joint Conference on Artificial Intelligence (pp. 1257–1263). Blum, A. (1992). Learning boolean functions in an infinite attribute space. Machine Learning, 9, 373–386. Srinivasan, A., Mugleton, S., King, R. D., & Sternberg, M. (1996). Theories for mutagenicity: a study of first Carlson, A., Cumby, C., Rosen, J., & Roth, D. (1999). order and feature based induction. Artificial Intelligence, The SNoW learning architecture (Technical Report 85(1-2), 277–299. UIUCDCS-R-99-2101). UIUC Computer Science Dep. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Pog- Collins, M., & Duffy, N. (2002). New algorithms gio, T., & Vapnik, V. (2000). Feature selection for svms. for parsing and taggins: Kernels over discrete structures, Neural Information Processing Systems (pp. 668–674). and the voted perceptron. Proceedings of ACL 2002.