A Study on Convolution Kernels for Shallow Semantic

Alessandro Moschitti University of Texas at Dallas Human Language Technology Research Institute Richardson, TX 75083-0688, USA [email protected]

Abstract semantic structures called frames. These lat- In this paper we have designed and experi- ter are schematic representations of situations mented novel convolution kernels for automatic involving various participants, properties and classification of predicate arguments. Their roles in which a may be typically used. main property is the ability to process struc- Frame elements or semantic roles are arguments tured representations. Support Vector Ma- of predicates called target . In FrameNet, chines (SVMs), using a combination of such ker- the argument names are local to a particular nels and the flat feature kernel, classify Prop- frame. S Bank predicate arguments with accuracy higher than the current argument classification state- N VP of-the-art. Additionally, experiments on FrameNet data Paul V NP PP have shown that SVMs are appealing for the Arg. 0 gives D N IN N classification of semantic roles even if the pro- Predicate posed kernels do not produce any improvement. a lecture in Rome Arg. 1 Arg. M 1 Introduction Figure 1: A predicate argument structure in a Several linguistic theories, e.g. (Jackendoff, parse-tree representation. 1990) claim that semantic information in nat- Several approaches for argu- ural language texts is connected to syntactic ment identification and classification have been structures. Hence, to deal with natural lan- developed (Gildea and Jurasfky, 2002; Gildea guage semantics, the learning algorithm should and Palmer, 2002; Surdeanu et al., 2003; Ha- be able to represent and process structured cioglu et al., 2003). Their common characteris- data. The classical solution adopted for such tic is the adoption of feature spaces that model tasks is to convert syntax structures into flat predicate-argument structures in a flat repre- feature representations which are suitable for a sentation. On the contrary, convolution kernels given learning model. The main drawback is aim to capture structural information in term that structures may not be properly represented of sub-structures, providing a viable alternative by flat features. to flat features. In particular, these problems affect the pro- In this paper, we select portions of syntactic cessing of predicate argument structures an- trees, which include predicate/argument salient notated in PropBank (Kingsbury and Palmer, sub-structures, to define convolution kernels for 2002) or FrameNet (Fillmore, 1982). Figure the task of predicate argument classification. In 1 shows an example of a predicate annotation particular, our kernels aim to (a) represent the in PropBank for the sentence: "Paul gives a relation between predicate and one of its argu- lecture in Rome". A predicate may be a verb ments and (b) to capture the overall argument or a noun or an adjective and most of the time structure of the target predicate. Additionally, Arg 0 is the logical subject, Arg 1 is the logical we define novel kernels as combinations of the object and ArgM may indicate locations, as in above two with the polynomial kernel of stan- our example. dard flat features. FrameNet also describes predicate/argument Experiments on Support Vector Machines us- structures but for this purpose it uses richer ing the above kernels show an improvement of the state-of-the-art for PropBank argument ual classifier. As a final decision, we select the classification. On the contrary, FrameNet se- argument associated with the maximum value mantic parsing seems to not take advantage of among the scores provided by the SVMs, i.e. the structural information provided by our ker- argmaxi∈S Ci, where S is the target set of ar- nels. guments. The remainder of this paper is organized as - Phrase Type: This feature indicates the syntactic type follows: Section 2 defines the Predicate Argu- of the phrase labeled as a predicate argument, e.g. NP ment Extraction problem and the standard so- for Arg1. lution to solve it. In Section 3 we present our - Parse Tree Path: This feature contains the path in kernels whereas in Section 4 we show compar- the parse tree between the predicate and the argument ative results among SVMs using standard fea- phrase, expressed as a sequence of nonterminal labels tures and the proposed kernels. Finally, Section linked by direction (up or down) symbols, e.g. V ↑ VP NP 5 summarizes the conclusions. ↓ for Arg1. - Position: Indicates if the constituent, i.e. the potential 2 Predicate Argument Extraction: a argument, appears before or after the predicate in the standard approach sentence, e.g. after for Arg1 and before for Arg0. Given a sentence in natural language and the - Voice: This feature distinguishes between active or pas- target predicates, all arguments have to be rec- sive voice for the predicate phrase, e.g. active for every argument. ognized. This problem can be divided into two subtasks: (a) the detection of the argument - Head Word: This feature contains the headword of the boundaries, i.e. all its compounding words and evaluated phrase. Case and morphological information are preserved, e.g. lecture for Arg1. (b) the classification of the argument type, e.g. - Governing Category indicates if an NP is dominated by Arg0 or ArgM in PropBank or Agent and Goal NP in FrameNet. a sentence phrase or by a verb phrase, e.g. the asso- ciated with Arg1 is dominated by a VP. The standard approach to learn both detec- tion and classification of predicate arguments - Predicate Word: This feature consists of two compo- nents: (1) the word itself, e.g. gives for all arguments; is summarized by the following steps: and (2) the lemma which represents the verb normalized 1. Given a sentence from the training-set gene- to lower case and infinitive form, e.g. give for all argu- rate a full syntactic parse-tree; ments. 2. let P and A be the set of predicates and Table 1: Standard features extracted from the the set of parse-tree nodes (i.e. the potential parse-tree in Figure 1. arguments), respectively; 3. for each pair ∈ P × A: 2.1 Standard feature space • extract the feature representation set, Fp,a; The discovery of relevant features is, as usual, a • if the subtree rooted in a covers exactly the + complex task, nevertheless, there is a common words of one argument of p, put Fp,a in T − consensus on the basic features that should be (positive examples), otherwise put it in T adopted. These standard features, firstly pro- (negative examples). posed in (Gildea and Jurasfky, 2002), refer to For example, in Figure 1, for each combina- a flat information derived from parse trees, i.e. tion of the predicate give with the nodes N, S, Phrase Type, Predicate Word, Head Word, Gov- VP, V, NP, PP, D or IN the instances F”give”,a are erning Category, Position and Voice. Table 1 generated. In case the node a exactly covers presents the standard features and exemplifies Paul, a lecture or in Rome, it will be a positive how they are extracted from the parse tree in instance otherwise it will be a negative one, e.g. Figure 1. F”give”,”IN”. For example, the Parse Tree Path feature rep- To learn the argument classifiers the T + set resents the path in the parse-tree between a + can be re-organized as positive Targi and neg- predicate node and one of its argument nodes. − ative Targi examples for each argument i. In It is expressed as a sequence of nonterminal la- this way, an individual ONE-vs-ALL classifier bels linked by direction symbols (up or down), for each argument i can be trained. We adopted e.g. in Figure 1, V↑VP↓NP is the path between this solution as it is simple and effective (Ha- the predicate to give and the argument 1, a lec- cioglu et al., 2003). In the classification phase, ture. Two pairs and have given a sentence of the test-set, all its Fp,a two different Path features even if the paths dif- are generated and classified by each individ- fer only for a node in the parse-tree. This pre-

a) S Fdeliver, Arg0 b) S c) S Fdeliver, Arg1 N VP N VP N VP Fdeliver, ArgM

Paul V NP PP Paul V NP PP Paul V NP PP Arg. 0 delivers D N IN NP delivers D N IN NP delivers D N IN NP

a talk in jj N a talk in jj N a talk in jj N Arg. 1 forma l style forma l style Arg. M forma l style Figure 2: Structured features for Arg0, Arg1 and ArgM. vents the learning algorithm to generalize well ment sub-structures: the first includes the tar- on unseen data. In order to address this prob- get predicate with one of its arguments. We will lem, the next section describes a novel kernel show that it contains almost all the standard space for predicate argument classification. feature information. The second relates to the sub-categorization frame of verbs. In this case, 2.2 Support Vector Machine approach n the kernel function aims to cluster together ver- Given a vector space in < and a set of posi- bal predicates which have the same syntactic tive and negative points, SVMs classify vectors realizations. This provides the classification al- according to a separating hyperplane, H(~x) = gorithm with important clues about the possible n w~ × ~x + b = 0, where w~ ∈ < and b ∈ < are set of arguments suited for the target syntactic learned by applying the Structural Risk Mini- structure. mization principle (Vapnik, 1995). 3.1 Predicate/Argument Feature To apply the SVM algorithm to Predicate (PAF) Argument Classification, we need a function We consider the predicate argument structures φ n : F → < to map our features space F = annotated in PropBank or FrameNet as our se- f , .., f { 1 |F|} and our predicate/argument pair mantic space. The smallest sub-structure which F F n representation, p,a = z, into < , such that: includes one predicate with only one of its ar- guments defines our structural feature. For Fz → φ(Fz) = (φ1(Fz), .., φn(Fz)) From the kernel theory we have that: example, Figure 2 illustrates the parse-tree of the sentence "Paul delivers a talk in formal H(~x) = α ~x · ~x + b = α ~x · ~x + b = style". The circled substructures in (a), (b)  X i i X i i i=1..l i=1..l and (c) are our semantic objects associated with the three arguments of the verb to de- = α φ(F ) · φ(F ) + b. X i i z liver, i.e. , i=1..l and . Note that each predi- F i , .., l where, i ∀ ∈ {1 } are the training in- cate/argument pair is associated with only one K F , F <φ F stances and the product ( i z) = ( i) · structure, i.e. F contain only one of the cir- φ F > p,a ( z) is the kernel function associated with cled sub-trees. Other important properties are φ the mapping . The simplest mapping that we the followings: φ F ~z z , ..., z can apply is ( z) = = ( 1 n) where (1) The overall semantic feature space F con- z if f F z i = 1 i ∈ z otherwise i = 0, i.e. tains sub-structures composed of syntactic in- F the characteristic vector of the set z with re- formation embodied by parse-tree dependencies spect to F. If we choose as a kernel function and semantic information under the form of the scalar product we obtain the linear kernel predicate/argument annotation. K F , F ~x ~z L( x z) = · . (2) This solution is efficient as we have to clas- Another function which is the current state- sify as many nodes as the number of predicate of-the-art of predicate argument classification is arguments. K F , F c ~x ~z d the polynomial kernel: p( x z) = ( + · ) , (3) A constituent cannot be part of two differ- c d where is a constant and is the degree of the ent arguments of the target predicate, i.e. there polynom. is no overlapping between the words of two ar- 3 Convolution Kernels for Semantic guments. Thus, two semantic structures Fp1,a1 F 1 Parsing and p2,a2 , associated with two different ar- 1 We propose two different convolution kernels Fp,a was defined as the set of features of the object associated with two different predicate argu- . Since in our representations we have only one

NP NP NP Fflush VP VP VP S Fbuckle D N D N D N V NP V NP V NP a talk a talk delivers D N D N D N NP VP V D N Arg1 Arg1 (buckle) NP (flush) a talk a talk Arg0 PRP delivers D N a talk VP CC VP VP VP VP VP (flush and buckle) VP VP VP NP He VBD NP and VBD NP V V NP V NP V V NP NP V NP V NP D N D N delivers D N delivers D N flushed DT NN buckled PRP$ NN delivers delivers D N a talk a Predicate 1 Predicate 2 talk the pan his belt Figure 4: All 17 valid fragments of the semantic

Figure 3: Sub-Categorization Features for two structure associated with Arg 1 of Figure 2. predicate argument structures. 0 0 0 0 |F 0| F = {f1, .., f 0 } and (2) from F to < . guments, cannot be included one in the other. |F | An example of features in F 0 is given This property is important because a convolu- in Figure 4 where the whole set of frag- tion kernel would not be effective to distinguish ments, F 0 , of the argument structure between an object and its sub-parts. deliver,Arg1 Fdeliver,Arg1, is shown (see also Figure 2). 3.2 Sub-Categorization Feature (SCF) It is worth noting that the allowed sub-trees The above object space aims to capture all contain the entire (not partial) production rules. the information between a predicate and one of For instance, the sub-tree [NP [D a]] is excluded its arguments. Its main drawback is that im- from the set of the Figure 4 since only a part of portant structural information related to inter- the production NP → D N is used in its gener- argument dependencies is neglected. In or- ation. However, this constraint does not apply der to solve this problem we define the Sub- to the production VP → V NP PP along with the Categorization Feature (SCF). This is the sub- fragment [VP [V NP]] as the subtree [VP [PP [...]]] parse tree which includes the sub-categorization is not considered part of the semantic structure. Thus, in step 1, an argument structure Fp,a is frame of the target verbal predicate. For 0 example, Figure 3 shows the parse tree of mapped in a fragment set Fp,a. In step 2, this |F 0| the sentence "He flushed the pan and buckled latter is mapped into ~x = (x1, .., x|F 0|) ∈ < , his belt". The solid line describes the SCF where xi is equal to the number of times that 0 0 2 of the predicate flush, i.e. Fflush whereas the fi occurs in Fp,a . dashed line tailors the SCF of the predicate In order to evaluate K(φ(Fx), φ(Fz )) without buckle, i.e. Fbuckle. Note that SCFs are features evaluating the feature vector ~x and ~z we de- for predicates, (i.e. they describe predicates) fine the indicator function Ii(n) = 1 if the sub- whereas PAF characterizes predicate/argument structure i is rooted at node n and 0 otherwise. pairs. It follows that φi(Fx) = Ii(n), where Nx Pn∈Nx Once semantic representations are defined, is the set of the Fx’s nodes. Therefore, the ker- we need to design a kernel function to esti- nel can be written as: mate the similarity between our objects. As |F 0| suggested in Section 2 we can map them into K(φ(Fx), φ(Fz)) = ( Ii(nx))( Ii(nz)) n X X X vectors in < and evaluate implicitly the scalar i=1 nx∈Nx nz∈Nz product among them. = I (n )I (n ) X X X i x i z 3.3 Predicate/Argument structure nx∈Nx nz∈Nz i Kernel (PAK) where Nx and Nz are the nodes in Fx and Fz, re- Given the semantic objects defined in the previ- spectively. In (Collins and Duffy, 2002), it has ous section, we design a convolution kernel in a been shown that i Ii(nx)Ii(nz) = ∆(nx, nz) P way similar to the parse-tree kernel proposed can be computed in O(|Nx| × |Nz|) by the fol- in (Collins and Duffy, 2002). We divide our lowing recursive relation: mapping φ in two steps: (1) from the semantic (1) if the productions at nx and nz are different structure space F (i.e. PAF or SCF objects) then ∆(nx, nz) = 0; to the set of all their possible sub-structures 2A fragment can appear several times in a parse-tree, element in Fp,a with an abuse of notation we use it to thus each fragment occurrence is considered as a different 0 indicate the objects themselves. element in Fp,a. (2) if the productions at nx and nz are the the verb to deliver (Figure 2) in delivered, the same, and nx and nz are pre-terminals then [VP [V delivers] [NP]] subtree will be trans- ∆(nx, nz) = 1; formed in [VP [VBD delivered] [NP]], where the (3) if the productions at nx and nz are the same, NP is unchanged. Thus, the similarity with and nx and nz are not pre-terminals then the previous structure will be quite high as:

nc(nx) (1) the NP with all sub-parts will be matched ∆(n , n ) = (1 + ∆(ch(n , j), ch(n , j))), and (2) the small difference will not highly af- x z Y x z j=1 fect the kernel norm and consequently the fi- nal score. The above property also holds for where nc(nx) is the number of the children of nx the SCF structures. For example, in Figure and ch(n, i) is the i-th child of the node n. Note 3, KP AK (φ(Fflush), φ(Fbuckle)) is quite high as that as the productions are the same ch(nx, i) = the two verbs have the same syntactic realiza- ch(nz, i). tion of their arguments. In general, flat features This kind of kernel has the drawback of do not possess this conservative property. For assigning more weight to larger structures example, the Parse Tree Path is very sensible while the argument type does not strictly to small changes of parse-trees, e.g. two predi- depend on the size of the argument (Moschitti cates, expressed in different tenses, generate two and Bejan, 2004). To overcome this prob- different Path features. lem we can scale the relative importance of Second, some information contained in the the tree fragments using a parameter λ for standard features is embedded in PAF: Phrase the cases (2) and (3), i.e. ∆(nx, nz) = λ and Type, Predicate Word and Head Word explicitly nc(nx) ∆(nx, nz) = λ j (1 + ∆(ch(nx, j), ch(nz , j))) appear as structure fragments. For example, in Q =1 respectively. Figure 4 are shown fragments like [NP [DT] [N]] or It is worth noting that even if the above equa- [NP [DT a] [N talk]] which explicitly encode the tions define a kernel function similar to the one Phrase Type feature NP for the Arg 1 in Fig- proposed in (Collins and Duffy, 2002), the sub- ure 2.b. The Predicate Word is represented by structures on which it operates are different the fragment [V delivers] and the Head Word from the parse-tree kernel. For example, Figure is encoded in [N talk]. The same is not true for 4 shows that structures such as [VP [V] [NP]], [VP SCF since it does not contain information about [V delivers ] [NP]] and [VP [V] [NP [DT] [N]]] are a specific argument. SCF, in fact, aims to char- valid features, but these fragments (and many acterize the predicate with respect to the overall others) are not generated by a complete produc- argument structures rather than a specific pair tion, i.e. VP → V NP PP. As a consequence they . would not be included in the parse-tree kernel Third, Governing Category, Position and of the sentence. Voice features are not explicitly contained in 3.4 Comparison with Standard both PAF and SCF. Nevertheless, SCF may Features allow the learning algorithm to detect the ac- tive/passive form of verbs. In this section we compare standard features Finally, from the above observations follows with the kernel based representation in order that the PAF representation may be used with to derive useful indications for their use: PAK to classify arguments. On the contrary, First, PAK estimates a similarity between SCF lacks important information, thus, alone it two argument structures (i.e., PAF or SCF) may be used only to classify verbs in syntactic by counting the number of sub-structures that categories. This suggests that SCF should be are in common. As an example, the sim- used in conjunction with standard features to ilarity between the two structures in Figure boost their classification performance. 2, F”delivers”,Arg0 and F”delivers”,Arg1, is equal to 1 since they have in common only the [V 4 The Experiments delivers] substructure. Such low value de- The aim of our experiments are twofold: On pends on the fact that different arguments tend the one hand, we study if the PAF represen- to appear in different structures. tation produces an accuracy higher than stan- On the contrary, if two structures differ only dard features. On the other hand, we study if for a few nodes (especially terminals or near SCF can be used to classify verbs according to terminal nodes) the similarity remains quite their syntactic realization. Both the above aims high. For example, if we change the tense of can be carried out by combining PAF and SCF with the standard features. For this purpose .edu/∼framenet) we extracted all 24,558 sen- we adopted two ways to combine kernels3: (1) tences from the 40 frames of Senseval 3 task K = K1 · K2 and (2) K = γK1 + K2. The re- (www.senseval.org) for the Automatic Labeling sulting set of kernels used in the experiments is of Semantic Roles. We considered 18 of the the following: most frequent roles and we mapped together • Kpd is the polynomial kernel with degree d those having the same name. Only verbs are se- over the standard features. lected to be predicates in our evaluations. More-

• KP AF is obtained by using PAK function over over, as it does not exist a fixed split between the PAF structures. training and testing, we selected randomly 30% K KP AF pd • KP AF +P = γ + , i.e. the sum be- of sentences for testing and 70% for training. |KP AF | |Kpd | 4 Additionally, 30% of training was used as a tween the normalized PAF-based kernel and validation-set. The sentences were processed us- the normalized polynomial kernel. ing Collins’ parser (Collins, 1997) to generate KP AF ·Kpd • KP AF ·P = , i.e. the normalized parse-trees automatically. |KP AF |·|Kpd | product between the PAF-based kernel and the 4.2 Classification set-up polynomial kernel. The classifier evaluations were carried out using K KSCF pd • KSCF +P = γ + , i.e. the summa- the SVM-light software (Joachims, 1999) avail- |KSCF | |Kpd | tion between the normalized SCF-based kernel able at svmlight.joachims.org with the default and the normalized polynomial kernel. polynomial kernel for standard feature evalu- KSCF ·Kpd ations. To process PAF and SCF, we imple- • KSCF ·P = , i.e. the normal- |KSCF |·|Kpd | mented our own kernels and we used them in- ized product between SCF-based kernel and the side SVM-light. polynomial kernel. The classification performances were evalu- 7 ated using the f1 measure for single arguments 4.1 Corpora set-up and the accuracy for the final multi-class clas- The above kernels were experimented over two sifier. This latter choice allows us to compare corpora: PropBank (www.cis.upenn.edu/∼ace) the results with previous literature works, e.g. along with Penn TreeBank5 2 (Marcus et al., (Gildea and Jurasfky, 2002; Surdeanu et al., 1993) and FrameNet. 2003; Hacioglu et al., 2003). PropBank contains about 53,700 sentences For the evaluation of SVMs, we used the de- and a fixed split between training and test- fault regularization parameter (e.g., C = 1 for ing which has been used in other researches normalized kernels) and we tried a few cost- e.g., (Gildea and Palmer, 2002; Surdeanu et al., factor values (i.e., j ∈ {0.1, 1, 2, 3, 4, 5}) to ad- 2003; Hacioglu et al., 2003). In this split, Sec- just the rate between Precision and Recall. We tions from 02 to 21 are used for training, section chose parameters by evaluating SVM using Kp3 23 for testing and sections 1 and 22 as devel- kernel over the validation-set. Both λ (see Sec- oping set. We considered all PropBank argu- tion 3.3) and γ parameters were evaluated in a ments6 from Arg0 to Arg9, ArgA and ArgM for similar way by maximizing the performance of K a total of 122,774 and 7,359 arguments in train- KSCF pd SVM using KP AF and γ |K | + |K | respec- ing and testing respectively. It is worth noting SCF pd that in the experiments we used the gold stan- tively. These parameters were adopted also for dard parsing from Penn , thus our ker- all the other kernels. nel structures are derived with high precision. 4.3 Kernel evaluations For the FrameNet corpus (www.icsi.berkeley To study the impact of our structural kernels we 3It can be proven that the resulting kernels still sat- firstly derived the maximal accuracy reachable isfy Mercer’s conditions (Cristianini and Shawe-Taylor, with standard features along with polynomial 2000). kernels. The multi-class accuracies, for Prop- 4To normalize a kernel K(~x, ~z) we can divide it by Bank and FrameNet using Kpd with d = 1, .., 5, K(~x, ~x) · K(~z, ~z). are shown in Figure 5. We note that (a) the p 5We point out that we removed from Penn TreeBank the function tags like SBJ and TMP as parsers usually highest performance is reached for d = 3, (b) are not able to provide this information. for PropBank our maximal accuracy (90.5%) 6We noted that only Arg0 to Arg4 and ArgM con- 7 tain enough training/testing data to affect the overall f1 assigns equal importance to Precision P and Re- 2P ·R performance. call R, i.e. f1 = P +R . is substantially equal to the SVM performance previous section. The overall meaning is dis- (88%) obtained in (Hacioglu et al., 2003) with cussed in the following points: degree 2 and (c) the accuracy on FrameNet First, PAF alone has good performance, since (85.2%) is higher than the best result obtained in PropBank evaluation it outperforms the lin- in literature, i.e. 82.0% in (Gildea and Palmer, ear kernel (Kp1 ), 88.7% vs. 86.7% whereas in 2002). This different outcome is due to a differ- FrameNet, it shows a similar performance 79.5% ent task (we classify different roles) and a differ- vs. 82.1% (compare tables with Figure 5). This ent classification algorithm. Moreover, we did suggests that PAF generates the same informa- not use the Frame information which is very im- tion as the standard features in a linear space. portant8. However, when a degree greater than 1 is used for standard features, PAF is outperformed10. 0.91

0.9 3 Args P PAF PAF+P PAF·P SCF+P SCF·P 0.89

0.88 Arg0 90.8 88.3 90.6 90.5 94.6 94.7 y

c FrameNet Arg1 91.1 87.4 89.9 91.2 92.9 94.1

a 0.87 r

u PropBank

c Arg2 80.0 68.5 77.5 74.7 77.4 82.0

c 0.86

A Arg3 57.9 56.5 55.6 49.7 56.2 56.4 0.85 Arg4 70.5 68.7 71.2 62.7 69.6 71.1 0.84 ArgM 95.4 94.1 96.2 96.2 96.1 96.3 0.83

0.82 Acc. 90.5 88.7 90.2 90.4 92.4 93.2 1 2 d 3 4 5 Evaluation of Kernels on PropBank. Figure 5: Multi-classifier accuracy according to dif- Table 2: ferent degrees of the polynomial kernel. 3 Roles P PAF PAF+P PAF·P SCF+P SCF·P It is worth noting that the difference between agent 92.0 88.5 91.7 91.3 93.1 93.9 linear and polynomial kernel is about 3-4 per- cause 59.7 16.1 41.6 27.7 42.6 57.3 degree 74.9 68.6 71.4 57.8 68.5 60.9 cent points for both PropBank and FrameNet. depict. 52.6 29.7 51.0 28.6 46.8 37.6 This remarkable difference can be easily ex- durat. 45.8 52.1 40.9 29.0 31.8 41.8 plained by considering the meaning of standard goal 85.9 78.6 85.3 82.8 84.0 85.3 features. For example, let us restrict the classi- instr. 67.9 46.8 62.8 55.8 59.6 64.1 mann. 81.0 81.9 81.2 78.6 77.8 77.8 fication function CArg0 to the two features Voice Acc. 85.2 79.5 84.6 81.6 83.8 84.2 and Position. Without loss of generality we can 18 roles assume: (a) Voice=1 if active and 0 if passive, and (b) Position=1 when the argument is af- Table 3: Evaluation of Kernels on FrameNet se- ter the predicate and 0 otherwise. To simplify mantic roles. the example, we also assume that if an argu- Second, SCF improves the polynomial kernel ment precedes the target predicate it is a sub- (d = 3), i.e. the current state-of-the-art, of ject, otherwise it is an object 9. It follows that about 3 percent points on PropBank (column a constituent is Arg0, i.e. CArg0 = 1, if only SCF·P). This suggests that (a) PAK can mea- one feature at a time is 1, otherwise it is not sure the similarity between two SCF structures an Arg0, i.e. CArg0 = 0. In other words, CArg0 and (b) the sub-categorization information pro- = Position XOR Voice, which is the classical ex- vides effective clues about the expected argu- ample of a non-linear separable function that ment type. The interesting consequence is that becomes separable in a superlinear space (Cris- SCF together with PAK seems suitable to au- tianini and Shawe-Taylor, 2000). tomatically cluster different verbs that have the After it was established that the best ker- same syntactic realization. We note also that to nel for standard features is Kp3 , we carried out fully exploit the SCF information it is necessary all the other experiments using it in the kernel to use a kernel product (K1 · K2) combination combinations. Table 2 and 3 show the single rather than the sum (K1 + K2), e.g. column class (f1 measure) as well as multi-class classi- SCF+P. fier (accuracy) performance for PropBank and Finally, the FrameNet results are completely FrameNet respectively. Each column of the two different. No kernel combinations with both tables refers to a different kernel defined in the PAF and SCF produce an improvement. On

8Preliminary experiments indicate that SVMs can 10Unfortunately the use of a polynomial kernel on top reach 90% by using the frame feature. the tree fragments to generate the XOR functions seems 9Indeed, this is true in most part of the cases. not successful. the contrary, the performance decreases, sug- may relate to the use of SCF to generate verb gesting that the classifier is confused by this clusters. syntactic information. The main reason for the Acknowledgments different outcomes is that PropBank arguments This research has been sponsored by the ARDA are different from semantic roles as they are AQUAINT program. In addition, I would like to an intermediate level between syntax and se- thank Professor Sanda Harabagiu for her advice, mantic, i.e. they are nearer to grammatical Adrian Cosmin Bejan for implementing the feature functions. In fact, in PropBank arguments are extractor and Paul Mor˘arescu for processing the annotated consistently with syntactic alterna- FrameNet data. Many thanks to the anonymous re- tions (see the Annotation guidelines for Prop- viewers for their invaluable suggestions. Bank at www.cis.upenn.edu/∼ace). On the con- trary FrameNet roles represent the final seman- References tic product and they are assigned according to Michael Collins and Nigel Duffy. 2002. New ranking semantic considerations rather than syntactic algorithms for parsing and tagging: Kernels over aspects. For example, Cause and Agent seman- discrete structures, and the voted perceptron. In ACL-02 tic roles have identical syntactic realizations. proceeding of . Michael Collins. 1997. Three generative, lexicalized This prevents SCF to distinguish between them. models for statistical parsing. In proceedings of Another minor reason may be the use of auto- the ACL-97, pages 16–23, Somerset, New Jersey. matic parse-trees to extract PAF and SCF, even Nello Cristianini and John Shawe-Taylor. 2000. An if preliminary experiments on automatic seman- introduction to Support Vector Machines. Cam- tic shallow parsing of PropBank have shown no bridge University Press. important differences versus semantic parsing Charles J. Fillmore. 1982. Frame semantics. In Lin- which adopts Gold Standard parse-trees. guistics in the Morning Calm, pages 111–137. Daniel Gildea and Daniel Jurasfky. 2002. Auto- 5 Conclusions matic labeling of semantic roles. Computational In this paper, we have experimented with Linguistic. SVMs using the two novel convolution kernels Daniel Gildea and Martha Palmer. 2002. The neces- PAF and SCF which are designed for the se- sity of parsing for predicate argument recognition. mantic structures derived from PropBank and In proceedings of ACL-02, Philadelphia, PA. FrameNet corpora. Moreover, we have com- R. Jackendoff. 1990. Semantic Structures, Current Studies in Linguistics series. Cambridge, Mas- bined them with the polynomial kernel of stan- sachusetts: The MIT Press. dard features. The results have shown that: T. Joachims. 1999. Making large-scale SVM learn- First, SVMs using the above kernels are ap- ing practical. In Advances in Kernel Methods - pealing for semantically parsing both corpora. Support Vector Learning. Second, PAF and SCF can be used to improve Paul Kingsbury and Martha Palmer. 2002. From automatic classification of PropBank arguments treebank to . In proceedings of LREC- as they provide clues about the predicate argu- 02, Las Palmas, Spain. ment structure of the target verb. For example, M. P. Marcus, B. Santorini, and M. A. SCF improves (a) the classification state-of-the- Marcinkiewicz. 1993. Building a large anno- art (i.e. the polynomial kernel) of about 3 per- tated corpus of english: The penn treebank. Computational Linguistics. cent points and (b) the best literature result of Alessandro Moschitti and Cosmin Adrian Bejan. about 5 percent points. 2004. A semantic kernel for predicate argu- Third, additional work is needed to design ment classification. In proceedings of CoNLL-04, kernels suitable to learn the deep semantic con- Boston, USA. tained in FrameNet as it seems not sensible to Kadri Hacioglu, Sameer Pradhan, Wayne Ward, both PAF and SCF information. James H. Martin, and Daniel Jurafsky. 2003. Finally, an analysis of SVMs using poly- Shallow Semantic Parsing Using Support Vector nomial kernels over standard features has ex- Machines. TR-CSLR-2003-03, University of Col- plained why they largely outperform linear clas- orado. sifiers based-on standard features. Mihai Surdeanu, Sanda M. Harabagiu, John Williams, and John Aarseth. 2003. Using In the future we plan to design other struc- predicate-argument structures for information ex- tures and combine them with SCF, PAF and traction. In proceedings of ACL-03, Sapporo, standard features. In this vision the learning Japan. will be carried out on a set of structural features V. Vapnik. 1995. The Nature of Statistical Learning instead of a set of flat features. Other studies Theory. Springer-Verlag New York, Inc.