<<

Classification using Distributional Similarity in Syntactic and Semantic Structures Danilo Croce Alessandro Moschitti University of Tor Vergata University of Trento 00133 Roma, Italy 38123 Povo (TN), Italy [email protected] [email protected]

Roberto Basili Martha Palmer University of Tor Vergata University of Colorado at Boulder 00133 Roma, Italy Boulder, CO 80302, USA [email protected] [email protected]

Abstract still far for being accomplished. In particular, the ex- In this paper, we propose innovative repre- haustive design and experimentation of lexical and sentations for automatic classification of syntactic features for learning verb classification ap- according to mainstream linguistic theories, pears to be computationally problematic. For exam- namely VerbNet and FrameNet. First, syntac- ple, the verb order can belongs to the two VerbNet tic and semantic structures capturing essential lexical and syntactic properties of verbs are classes: defined. Then, we design advanced similarity – The class 60.1, i.e., order someone to do some- functions between such structures, i.e., seman- thing as shown in: The Illinois Supreme Court or- tic tree kernel functions, for exploiting distri- dered the commission to audit Commonwealth Edi- butional and grammatical information in Sup- son ’ construction expenses and refund any unrea- port Vector Machines. The extensive empir- sonable expenses . ical analysis on VerbNet class and frame de- tection shows that our models capture mean- – The class 13.5.1: order or request something ingful syntactic/semantic structures, which al- in: ... Michelle blabs about it to a sandwich man lows for improving the state-of-the-art. while ordering lunch over the phone . Clearly, the syntactic realization can be used to dis- 1 Introduction cern the cases above but it would not be enough to Verb classification is a fundamental topic of com- correctly classify the following verb occurrence: .. putational research given its importance ordered the lunch to be delivered .. in Verb class for understanding the role of verbs in conveying se- 13.5.1. For such a case, selectional restrictions are mantics of natural (NL). Additionally, gen- needed. These have also been shown to be use- eralization based on verb classification is central to ful for semantic role classification (Zapirain et al., many NL applications, ranging from shallow seman- 2010). Note that their coding in learning algorithms tic parsing to semantic search or information extrac- is rather complex: we need to take into account syn- tion. Currently, a lot of interest has been paid to tactic structures, which may require an exponential two verb categorization schemes: VerbNet (Schuler, number of syntactic features (i.e., all their possible 2005) and FrameNet (Baker et al., 1998), which substructures). Moreover, these have to be enriched has also fostered production of many automatic ap- with lexical information to trig lexical preference. proaches to extraction. Such work has shown that is necessary In this paper, we tackle the problem above for helping to predict the roles of verb arguments by studying innovative representations for auto- and consequently their verb sense (Gildea and Juras- matic verb classification according to VerbNet and fky, 2002; Pradhan et al., 2005; Gildea and Palmer, FrameNet. We define syntactic and semantic struc- 2002). However, the definition of models for opti- tures capturing essential lexical and syntactic prop- mally combining lexical and syntactic constraints is erties of verbs. Then, we apply similarity between

263

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 263–272, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics such structures, i.e., kernel functions, which can also commission in the role of the predicate. It exploit distributional lexical semantics, to train au- clearly satisfies the +ANIMATE/+ORGANIZATION tomatic classifiers. The basic idea of such functions restriction on the PATIENT role. This is not true is to compute the similarity between two verbs in for the direct dependency of the alternative terms of all the possible substructures of their syn- sense 13.5.1, which usually expresses the THEME tactic frames. We define and automatically extract role, with unrestricted type selection. When prop- a lexicalized approximation of the latter. Then, we erly generalized, the direct object information has apply kernel functions that jointly model structural thus been shown highly predictive about verb sense and lexical similarity so that syntactic properties are distinctions. combined with generalized . The nice prop- In (Brown et al., 2011), the so called dynamic erty of kernel functions is that they can be used dependency neighborhoods (DDN), i.e., the set of in place of the scalar product of feature vectors to verbs that are typically collocated with a direct ob- train algorithms such as Support Vector Machines ject, are shown to be more helpful than lexical in- (SVMs). This way SVMs can learn the association formation (e.g., WordNet). The set of typical verbs between syntactic (sub-) structures whose lexical ar- taking a n as a direct object is in fact a strong guments are generalized and target verb classes, i.e., characterization for semantic similarity, as all the they can also learn selectional restrictions. m similar to n tend to collocate with the same We carried out extensive experiments on verb verbs. This is true also for other syntactic depen- class and frame detection which showed that our dencies, among which the direct object dependency models greatly improve on the state-of-the-art (up is possibly the strongest cue (as shown for example to about 13% of relative error reduction). Such re- in (Dligach and Palmer, 2008)). sults are nicely assessed by manually inspecting the In order to generalize the above DDN feature, dis- most important substructures used by the classifiers tributional models are ideal, as they are designed as they largely correlate with syntactic frames de- to model all the collocations of a given noun, ac- fined in VerbNet. cording to large scale corpus analysis. Their abil- In the rest of the paper, Sec. 2 reports on related ity to capture lexical similarity is well established in work, Sec. 3 and Sec. 4 describe previous and our WSD tasks (e.g. (Schutze, 1998)), thesauri harvest- models for syntactic and semantic similarity, respec- ing (Lin, 1998), semantic role labeling (Croce et al., tively, Sec. 5 illustrates our experiments, Sec. 6 dis- 2010)) as well as information retrieval (e.g. (Furnas cusses the output of the models in terms of error et al., 1988)). analysis and important structures and finally Sec. 7 derives the conclusions. Distributional Models (DMs). These models fol- low the distributional hypothesis (Firth, 1957) and 2 Related work characterize lexical meanings in terms of context of Our target task is verb classification but at the same use, (Wittgenstein, 1953). By inducing geometrical time our models exploit distributional models as notions of vectors and norms through corpus analy- well as structural kernels. The next three subsec- sis, they provide a topological definition of seman- tions report related work in such areas. tic similarity, i.e., distance in a space. DMs can Verb Classification. The introductory verb classi- capture the similarity between such as dele- fication example has intuitively shown the complex- gation, deputation or company and commission. In ity of defining a comprehensive feature representa- case of sense 60.1 of the verb order, DMs can be tion. Hereafter, we report on analysis carried out in used to suggest that the role PATIENT can be inher- previous work. ited by all these words, as suitable Organisations. It has been often observed that verb senses tend In supervised language learning, when few exam- to show different selectional constraints in a specific ples are available, DMs support cost-effective lexi- argument position and the above verb order is a clear cal generalizations, often outperforming knowledge example. In the direct object position of the example based resources (such as WordNet, as in (Pantel et for the first sense 60.1 of order, we found al., 2007)). Obviously, the choice of the context

264 type determines the type of targeted semantic prop- are just related (so they can be different). The con- erties. Wider contexts (e.g., entire documents) are tribution of (ii) is proportional to the lexical similar- shown to suggest topical relations. Smaller con- ity of the tree lexical nodes, where the latter can be texts tend to capture more specific semantic as- evaluated according to distributional models or also pects, e.g. the syntactic behavior, and better capture lexical resources, e.g., WordNet. paradigmatic relations, such as synonymy. In partic- In the following, we define our models based on ular, space models, as described in (Sahlgren, previous work on LSA and SPTKs. 2006), define contexts as the words appearing in a 3.1 LSA as lexical similarity model n-sized window, centered around a target word. Co- Robust representations can be obtained through occurrence counts are thus collected in a words-by- intelligent dimensionality reduction methods. In words matrix, where each element records the num- LSA the original word-by-context matrix M is de- ber of times two words co-occur within a single win- composed through Singular Value Decomposition dow of word tokens. Moreover, robust weighting (SVD) (Landauer and Dumais, 1997; Golub and Ka- schemas are used to smooth counts against too fre- han, 1965) into the product of three new matrices: quent co-occurrence pairs: Pointwise Mutual Infor- U, S, and V so that S is diagonal and M = USV T . mation (PMI) scores (Turney and Pantel, 2010) are M is then approximated by M = U S V T , where commonly adopted. k k k k only the first k columns of U and V are used, Structural Kernels. Tree and sequence kernels corresponding to the first k greatest singular val- have been successfully used in many NLP applica- ues. This approximation supplies a way to project tions, e.g., parse reranking and adaptation, (Collins a generic term wi into the k-dimensional space us- and Duffy, 2002; Shen et al., 2003; Toutanova et 1/2 ing W = UkS , where each row corresponds to al., 2004; Kudo et al., 2005; Titov and Hender- k the representation vectors ~wi. The original statisti- son, 2006), chunking and dependency parsing, e.g., cal information about M is captured by the new k- (Kudo and Matsumoto, 2003; Daume´ III and Marcu, dimensional space, which preserves the global struc- 2004), named entity recognition, (Cumby and Roth, ture while removing low-variant dimensions, i.e., 2003), text categorization, e.g., (Cancedda et al., distribution noise. Given two words w1 and w2, 2003; Gliozzo et al., 2005), and relation extraction, the term similarity function σ is estimated as the e.g., (Zelenko et al., 2002; Bunescu and Mooney, cosine similarity between the corresponding projec- 2005; Zhang et al., 2006). tions ~w1, ~w2 in the LSA space, i.e σ(w1, w2) = Recently, DMs have been also proposed in in- ~w1· ~w2 . This is known as Latent Semantic Ker- k ~w1kk ~w2k tegrated syntactic-semantic structures that feed ad- nel (LSK), proposed in (Cristianini et al., 2001), vanced learning functions, such as the semantic as it defines a positive semi-definite Gram matrix tree kernels discussed in (Bloehdorn and Moschitti, G = σ(w1, w2) ∀w1, w2 (Shawe-Taylor and Cris- 2007a; Bloehdorn and Moschitti, 2007b; Mehdad et tianini, 2004). σ is thus a valid kernel and can be al., 2010; Croce et al., 2011). combined with other kernels, as discussed in the 3 Structural Similarity Functions next session. In this paper we model verb classifiers by exploiting 3.2 Tree Kernels driven by Semantic Similarity previous technology for kernel methods. In particu- To our knowledge, two main types of tree kernels lar, we design new models for verb classification by exploit lexical similarity: the syntactic semantic tree adopting algorithms for structural similarity, known kernel defined in (Bloehdorn and Moschitti, 2007a) as Smoothed Partial Tree Kernels (SPTKs) (Croce et applied to constituency trees and the smoothed al., 2011). We define new innovative structures and partial tree kernels (SPTKs) defined in (Croce et similarity functions based on LSA. al., 2011), which generalizes the former. We report The main idea of SPTK is rather simple: (i) mea- the definition of the latter as we modified it for our suring the similarity between two trees in terms of purposes. SPTK computes the number of common the number of shared subtrees; and (ii) such number substructures between two trees T1 and T2 without also includes similar fragments whose lexical nodes explicitly considering the whole fragment space. Its

265 S NP-SBJ VP DT NNP NNP NNP VBD NP-1 S the::d illinois::n supreme::n court::n TARGET-order::v DT NN - the::d commission::n Figure 1: Constituency Tree (CT) representation of verbs.

ROOT SBJ VBD OBJ OPRD NMOD NMOD NMOD NNP TARGET-order::v NMOD NN TO IM DT NNP NNP court::n DT commission::n to::t VB the::d illinois::n supreme::n the::d audit::v Figure 2: Representation of verbs according to the Centered Tree (GRCT) general equations are reported hereafter: 4 Verb Classification Models X X TK(T1,T2) = ∆(n1, n2), (1) The design of SPTK-based algorithms for our verb

n1∈NT1 n2∈NT2 classification requires the modeling of two differ- ent aspects: (i) a tree representation for the verbs; where NT and NT are the sets of the T1’s and T2’s 1 2 and (ii) the lexical similarity suitable for the task. nodes, respectively and ∆(n1, n2) is equal to the We also modified SPTK to apply different similarity number of common fragments rooted in the n1 and 1 functions to different nodes to introduce flexibility. n2 nodes . The ∆ function determines the richness of the kernel space and thus induces different tree 4.1 Verb Structural Representation kernels, for example, the syntactic tree kernel (STK) The implicit feature space generated by structural (Collins and Duffy, 2002) or the partial tree kernel kernels and the corresponding notion of similarity (PTK) (Moschitti, 2006). between verbs obviously depends on the input struc- The algorithm for SPTK’s ∆ is the follow- tures. In the cases of STK, PTK and SPTK different ing: if n1 and n2 are leaves then ∆σ(n1, n2) = tree representations lead to engineering more or less µλσ(n1, n2); else expressive linguistic feature spaces.  2 X ∆σ(n1, n2) = µσ(n1, n2) × λ + With the aim of capturing syntactic features, we

I~1,I~2,l(I~1)=l(I~2) started from two different parsing paradigms: phrase l(I~1) ~ ~ Y  and dependency structures. For example, for repre- λd(I1)+d(I2) ∆ (c (I~ ), c (I~ )) , (2) σ n1 1j n2 2j senting the first example of the introduction, we can j=1 use the constituency tree (CT) in Figure 1, where the where (1) σ is any similarity between nodes, e.g., be- target verb node is enriched with the TARGET label. tween their lexical labels; (2) λ, µ ∈ [0, 1] are decay Here, we apply tree pruning to reduce the computa- th factors; (3) cn1 (h) is the h child of the node n1; tional complexity of tree kernels as it is proportional ~ ~ ~ (4) I1 and I2 are two sequences of indexes, i.e., I = to the number of nodes in the input trees. Accord- (i1, i2, .., l(I)), with 1 ≤ i1 < i2 < .. < il(I); and (5) ingly, we only keep the subtree dominated by the d(I~1) = I~ −I~11+1 and d(I~2) = I~ −I~21+1. 1l(I~1) 2l(I~2) target VP by pruning from it all the S-nodes along Note that, as shown in (Croce et al., 2011), the av- with their subtrees (i.e, all nested sentences are re- erage running time of SPTK is sub-quadratic in the moved). To further improve generalization, we lem- number of the tree nodes. In the next section we matize lexical nodes and add generalized POS-Tags, show how we exploit the class of SPTKs, for verb i.e., noun (n::), verb (v::), (::a), classification. (::d) and so on, to them. This is useful for constrain- 1To have a similarity score between 0 and 1, a normalization ing similarity to be only contributed by lexical pairs in the kernel space, i.e. √ TK(T1,T2) is applied. of the same grammatical category. TK(T1,T1)×TK(T2,T2)

266 TARGET-order::v

court::n commission::n to::t ROOT VBD

the::d illinois::n supreme::n SBJ NNP the::d OBJ NN audit::v OPRD TO

NMOD DT NMOD NNP NMOD NNP NMOD DT IM VB Figure 3: Representation of verbs according to the Lexical Centered Tree (LCT)

To encode dependency structure information in a Algorithm 1 στ (n1, n2, lw) tree (so that we can use it in tree kernels), we use στ ← 0, (i) lexemes as nodes of our tree, (ii) their dependen- if τ(n1) = τ(n2) = SYNT ∧ label(n1) = label(n2) then στ ← 1 cies as edges between the nodes and (iii) the depen- end if dency labels, e.g., grammatical functions (GR), and if τ(n1) = τ(n2) = POS ∧ label(n1) = label(n2) then στ ← 1 POS-Tags, again as tree nodes. We designed two end if different tree types: (i) in the first type, GR are cen- if τ(n1) = τ(n2) = LEX ∧ pos(n1) = pos(n2) then στ ← σLEX (n1, n2) tral nodes from which dependencies are drawn and end if all the other features of the central node, i.e., lexi- if leaf(n1) ∧ leaf(n2) then στ ← στ × lw cal surface form and its POS-Tag, are added as ad- end if ditional children. An example of the GR Centered return στ Tree (GRCT) is shown in Figure 2, where the POS- Tags and lexemes are children of GR nodes. (ii) The second type of tree uses lexicals as central nodes on 5 Experiments which both GR and POS-Tag are added as the right- In these experiments, we tested the impact of our dif- most children. For example, Figure 3 shows an ex- ferent verb representations using different kernels, ample of a Lexical Centered Tree (LCT). For both similarities and parameters. We also compared with trees, the pruning strategy only preserves the verb simple bag-of-words (BOW) models and the state- node, its direct ancestors (father and siblings) and of-the-art. its descendants up to two levels (i.e., direct children 5.1 General experimental setup and grandchildren of the verb node). Note that, our We consider two different corpora: one for VerbNet dependency tree can capture the semantic head of and the other for FrameNet. For the former, we used the verbal argument along with the main syntactic the same verb classification setting of (Brown et al., construct, e.g., to audit. 2011). Sentences are drawn from the Semlink cor- pus (Loper et al., 2007), which consists of the Prop- 4.2 Generalized node similarity for SPTK Banked Penn Treebank portions of the Wall Street We have defined the new similarity στ to be used in Journal. It contains 113K verb instances, 97K of Eq. 2, which makes SPTK more effective as shown which are verbs represented in at least one VerbNet by Alg. 1. στ takes two nodes n1 and n2 and applies class. Semlink includes 495 verbs, whose instances a different similarity for each node type. The latter is are labeled with more than one class (including one derived by τ and can be: GR (i.e., SYNT), POS-Tag single VerbNet class or none). We used all instances (i.e., POS) or a lexical (i.e., LEX) type. In our exper- of the corpus for a total of 45,584 instances for 180 iment, we assign 0/1 similarity for SYNT and POS verb classes. When instances labeled with the none nodes according to string matching. For LEX type, class are not included, the number of examples be- we apply a lexical similarity learned with LSA to comes 23,719. only pairs of lexicals associated with the same POS- The second corpus refers to FrameNet frame clas- Tag. It should be noted that the type-based similarity sification. The training and test data are drawn from allows for potentially applying a different similarity the FrameNet 1.5 corpus2, which consists of 135K for each node. Indeed, we also tested an amplifica- sentences annotated according the frame semantics tion factor, namely, leaf weight (lw), which ampli- fies the matching values of the leaf nodes. 2http://framenet.icsi.berkeley.edu

267 (Baker et al., 1998). We selected the subset of STK PTK SPTK frames containing more than 100 sentences anno- lw Acc. lw Acc. lw Acc. CT - 83.83% 8 84.57% 8 84.46% tated with a verbal predicate for a total of 62,813 GRCT - 84.83% 8 85.15% 8 85.28% sentences in 187 frames (i.e., very close to the Verb- LCT - 77.73% 0.1 86.03% 0.2 86.72% Net datasets). For both the datasets, we used 70% of Br. et Al. 84.64% instances for training and 30% for testing. BOW 79.08% SK 82.08% Our verb (multi) classifier is designed with the one-vs-all (Rifkin and Klautau, 2004) multi- Table 1: VerbNet accuracy with the none class classification schema. This uses a set of binary SVM classifiers, one for each verb class (frame) i. STK PTK SPTK The sentences whose verb is labeled with the class lw Acc. lw Acc. lw Acc. GRCT - 92.67% 6 92.97% 0.4 93.54% i are positive examples for the classifier i. The sen- LCT - 90.28% 6 92.99% 0.3 93.78% tences whose verbs are compatible with the class i BOW 91.13% but evoking a different class or labeled with none SK 91.84% (no current verb class applies) are added as negative Table 2: FrameNet accuracy without the none class examples. In the classification phase the binary clas- sifiers are applied by (i) only considering classes that information between them. SVD reduction is then are compatible with the target verbs; and (ii) select- applied to M, with a dimensionality cut of l = 250. ing the class associated with the maximum positive For generating the CT, GRCT and LCT struc- SVM margin. If all classifiers provide a negative tures, we used the constituency trees generated by score the example is labeled with none. the Charniak parser (Charniak, 2000) and the de- To learn the binary classifiers of the schema pendency structures generated by the LTH syntactic above, we coded our modified SPTK in SVM-Light- parser (described in (Johansson and Nugues, 2008)). TK3 (Moschitti, 2006). The parameterization of The classification performance is measured with each classifier is carried on a held-out set (30% of accuracy (i.e., the percentage of correct classifica- the training) and is concerned with the setting of the tion). We also derive statistical significance of the trade-off parameter (option -c) and the leaf weight results by using the model described in (Yeh, 2000) (lw) (see Alg. 1), which is used to linearly scale and implemented in (Pado,´ 2006). the contribution of the leaf nodes. In contrast, the 5.2 VerbNet and FrameNet Classification cost-factor parameter of SVM-Light-TK is set as the Results ratio between the number of negative and positive To assess the performance of our settings, we also examples for attempting to have a balanced Preci- derive a simple baseline based on the bag-of-words sion/Recall. (BOW) model. For it, we represent an instance of Regarding SPTK setting, we used the lexical simi- a verb in a sentence using all words of the sentence σ larity defined in Sec. 3.1. In more detail, LSA was (by creating a special feature for the predicate word). applied to ukWak (Baroni et al., 2009), which is a We also used sequence kernels (SK), i.e., PTK ap- large scale document collection made up of 2 billion plied to a tree composed of a fake root and only one tokens. M is constructed by applying POS tagging to level of sentence words. For efficiency reasons4, we build rows with pairs hlemma, ::POSi (::POS only consider the 10 words before and after the pred- in brief). The contexts of such items are the columns icate with subsequence features of length up to 5. of M and are short windows of size [−3, +3], cen- Table 1 reports the accuracy of different mod- tered on the items. This allows for better captur- els for VerbNet classification. It should be noted ing syntactic properties of words. The most frequent that: first, SK produces a much higher accuracy than 20,000 items are selected along with their 20k con- BOW, i.e., 82.08 vs. 79.08. On one hand, this is texts. The entries of M are the point-wise mutual 4The average running time of the SK is much higher than the 3(Structural kernels in SVMLight (Joachims, 2000)) avail- one of PTK. When a tree is composed by only one level PTK able at http://disi.unitn.it/moschitti/Tree-Kernel.htm collapses to SK.

268 STK PTK SPTK 90% lw Acc. lw Acc. lw Acc.

CT - 91.14% 8 91.66% 6 91.66% 80% GRCT - 91.71% 8 92.38% 4 92.33% LCT - 89.20% 0.2 92.54% 0.1 92.55% SPTK BOW 88.16% 70% BOW

SK 89.86% Accuracy Brown et al. 60% Table 3: VerbNet accuracy without the none class generally in contrast with standard text categoriza- 50% 0% 20% 40% 60% 80% 100% tion tasks, for which n-gram models show accuracy Percentage of train examples comparable to the simpler BOW. On the other hand, Figure 4: Learning curves: VerbNet accuracy with the it simply confirms that verb classification requires none Class the dependency information between words (i.e., at of-the-art, i.e., 92.63% derived in (Giuglea and Mos- least the sequential structure information provided chitti, 2006), by using STK on CT on 133 frames. by SK). We also carried out experiments to understand Second, SK is 2.56 percent points below the state- the role of the none class. Table 3 reports on the of-the-art achieved in (Brown et al., 2011) (BR), i.e, VerbNet classification without its instances. This is 82.08 vs. 84.64. In contrast, STK applied to our rep- of course an unrealistic setting as it would assume resentation (CT, GRCT and LCT) produces compa- that the current VerbNet release already includes all rable accuracy, e.g., 84.83, confirming that syntactic senses for . In the table, we note that representation is needed to reach the state-of-the-art. the overall accuracy highly increases and the differ- Third, PTK, which produces more general struc- ence between models reduces. The similarities play tures, improves over BR by almost 1.5 (statistically no role anymore. This may suggest that SPTK can significant result) when using our dependency struc- help in complex settings, where verb class character- tures GRCT and LCT. CT does not produce the same ization is more difficult. Another important role of improvement since it does not allow PTK to directly SPTK models is their ability to generalize. To test compare the lexical structure (lexemes are all leaf this aspect, Figure 4 illustrates the learning curves nodes in CT and to connect some pairs of them very of SPTK with respect to BOW and the accuracy large trees are needed). achieved by BR (with a constant line). It is impres- Finally, the best model of SPTK (i.e, using LCT) sive to note that with only 40% of the data SPTK can improves over the best PTK (i.e., using LCT) by al- reach the state-of-the-art. most 1 point (statistically significant result): this dif- ference is only given by lexical similarity. SPTK im- 6 Model Analysis and Discussion proves on the state-of-the-art by about 2.08 absolute We carried out analysis of system errors and its in- percent points, which, given the high accuracy of the duced features. These can be examined by apply- baseline, corresponds to 13.5% of relative error re- ing the reverse engineering tool5 proposed in (Pighin duction. and Moschitti, 2010; Pighin and Moschitti, 2009a; We carried out similar experiments for frame clas- Pighin and Moschitti, 2009b), which extracts the sification. One interesting difference is that SK im- most important features for the classification model. proves BOW by only 0.70, i.e., 4 times less than in Many mistakes are related to false positives and neg- the VerbNet setting. This suggests that word order atives of the none class (about 72% of the errors). around the predicate is more important for deriving This class also causes data imbalance. Most errors the VerbNet class than the FrameNet frame. Ad- are also due to lack of lexical information available ditionally, LCT or GRCT seems to be invariant for to the SPTK kernel: (i) in 30% of the errors, the both PTK and SPTK whereas the lexical similarity argument heads were proper nouns for which the still produces a relevant improvement on PTK, i.e., lexical generalization provided by the DMs was not 13% of relative error reduction, for an absolute accu- racy of 93.78%. The latter improves over the state- 5http://danielepighin.net/cms/software/flink

269 VerbNet class 13.5.1 VerbNet class 13.5.1 (IM(VB(target))(OBJ)) (VP(VB(target))(NP)) (VC(VB(target))(OBJ)) (VP(VBG(target))(NP)) (VC(VBG(target))(OBJ)) (VP(VBD(target))(NP)) (OPRD(TO)(IM(VB(target))(OBJ))) (VP(TO)(VP(VB(target))(NP))) (PMOD(VBG(target))(OBJ)) (S(NP-SBJ)(VP(VBP(target))(NP))) (VB(target)) VerbNet class 60 (VC(VBN(target))) (VBN(target)) (PRP(TO)(IM(VB(target))(OBJ))) (VP(VBD(target))(S)) (IM(VB(target))(OBJ)(ADV(IN)(PMOD))) (VP(VBZ(target))(S)) (OPRD(TO)(IM(VB(target))(OBJ)(ADV(IN)(PMOD)))) (VBP(target)) (VP(VBD(target))(NP-1)(S(NP-SBJ)(VP))) VerbNet class 60 (VC(VB(target))(OBJ)) Table 5: CT fragments (NMOD(VBG(target))(OPRD)) (VC(VBN(target))(OPRD)) analysis of the learned models (i.e. the underlying (NMOD(VBN(target))(OPRD)) support vectors) confirm very interesting grammati- (PMOD(VBG(target))(OBJ)) (ROOT(SBJ)(VBD(target))(OBJ)(P(,))) cal generalizations, i.e. the capability of tree kernels (VC(VB(target))(OPRD)) to implicitly trigger useful linguistic inductions for (ROOT(SBJ)(VBZ(target))(OBJ)(P(,))) (NMOD(SBJ(WDT))(VBZ(target))(OPRD)) complex semantic tasks. When SPTK are adopted, (NMOD(SBJ)(VBZ(target))(OPRD(SBJ)(TO)(IM))) verb arguments can be lexically generalized into word classes, i.e., clusters of argument heads (e.g. Table 4: GRCT fragments commission vs. delegation, or gift vs. present). Au- available; and (ii) in 76% of the errors only 2 or less tomatic generation of such classes is an interesting argument heads are included in the extracted tree, direction for future research. therefore tree kernels cannot exploit enough lexical information to disambiguate verb senses. Addition- 7 Conclusion ally, ambiguity characterizes errors where the sys- We have proposed new approaches to characterize tem is linguistically consistent but the learned selec- verb classes in learning algorithms. The key idea is tional preferences are not sufficient to separate verb the use of structural representation of verbs based on senses. These errors are mainly due to the lack of syntactic dependencies and the use of structural ker- contextual information. While error analysis sug- nels to measure similarity between such representa- gests that further improvement is possible (e.g. by tions. The advantage of kernel methods is that they exploiting proper nouns), the type of generalizations can be directly used in some learning algorithms, currently achieved by SPTK are rather effective. Ta- e.g., SVMs, to train verb classifiers. Very interest- ble 4 and 5 report the tree structures characterizing ingly, we can encode distributional lexical similar- the most informative training examples of the two ity in the similarity function acting over syntactic senses of the verb order, i.e. the VerbNet classes structures and this allows for generalizing selection 13.5.1 (make a request for something) and 60 (give restrictions through a sort of (supervised) syntactic instructions to or direct somebody to do something and semantic co-clustering. with authority). The verb classification results show a large im- In line with the method discussed in (Pighin and provement over the state-of-the-art for both Verb- Moschitti, 2009b), these fragments are extracted as Net and FrameNet, with a relative error reduction they appear in most of the support vectors selected of about 13.5% and 16.0%, respectively. In the fu- during SVM training. As easily seen, the two classes ture, we plan to exploit the models learned from are captured by rather different patterns. The typ- FrameNet and VerbNet to carry out automatic map- ical accusative form with an explicit direct object ping of verbs from one theory to the other. emerges as characterizing the sense 13.5.1, denot- Acknowledgements This research is partially sup- ing the THEME role. All fragments of the sense 60 ported by the European Community’s Seventh Frame- emphasize instead the sentential complement of the work Programme (FP7/2007-2013) under grant numbers verb that in fact expresses the standard PROPOSI- 247758 (ETERNALS), 288024 (LIMOSINE) and 231126 TION role in VerbNet. Notice that tree fragments (LIVINGKNOWLEDGE). Many thanks to the reviewers correspond to syntactic patterns. The a posteriori for their valuable suggestions.

270 References Kernels on Dependency Trees. In Proceedings of EMNLP 2011. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. Chad Cumby and Dan Roth. 2003. Kernel Methods for 1998. The berkeley framenet project. Relational Learning. In Proceedings of ICML 2003. Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Hal Daume´ III and Daniel Marcu. 2004. Np bracketing Eros Zanchetta. 2009. The wacky wide web: a collec- by maximum entropy tagging and SVM reranking. In tion of very large linguistically processed web-crawled Proceedings of EMNLP’04. corpora. LRE, 43(3):209–226. Dmitriy Dligach and Martha Palmer. 2008. Novel se- Stephan Bloehdorn and Alessandro Moschitti. 2007a. mantic features for verb sense disambiguation. In Combined syntactic and semantic kernels for text clas- ACL (Short Papers), pages 29–32. The Association for sification. In Gianni Amati, Claudio Carpineto, and Computer Linguistics. Gianni Romano, editors, Proceedings of ECIR, vol- J. Firth. 1957. A synopsis of linguistic theory 1930- ume 4425 of Lecture Notes in Computer Science, 1955. In Studies in Linguistic Analysis. Philological pages 307–318. Springer, APR. Society, Oxford. reprinted in Palmer, F. (ed. 1968) Se- Stephan Bloehdorn and Alessandro Moschitti. 2007b. lected Papers of J. R. Firth, Longman, Harlow. Structure and semantics for expressive text kernels. G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Lan- In CIKM’07: Proceedings of the sixteenth ACM con- dauer, R. A. Harshman, L. A. Streeter, and K. E. ference on Conference on information and knowledge Lochbaum. 1988. Information retrieval using a sin- management, pages 861–864, New York, NY, USA. gular value decomposition model of latent semantic ACM. structure. In Proc. of SIGIR ’88, New York, USA. Susan Windisch Brown, Dmitriy Dligach, and Martha Daniel Gildea and Daniel Jurasfky. 2002. Automatic la- Palmer. 2011. Verbnet class assignment as a wsd task. beling of semantic roles. Computational Linguistic, In Proceedings of the Ninth International Conference 28(3):496–530. on Computational Semantics, IWCS ’11, pages 85–94, Daniel Gildea and Martha Palmer. 2002. The neces- Stroudsburg, PA, USA. Association for Computational sity of parsing for predicate argument recognition. In Linguistics. Proceedings of the 40th Annual Conference of the Razvan Bunescu and Raymond Mooney. 2005. A short- Association for Computational Linguistics (ACL-02), est path dependency kernel for relation extraction. In Philadelphia, PA. Proceedings of HLT and EMNLP, pages 724–731, Ana-Maria Giuglea and Alessandro Moschitti. 2006. Se- Vancouver, British Columbia, Canada, October. mantic role labeling via framenet, verbnet and prop- Nicola Cancedda, Eric Gaussier, Cyril Goutte, and bank. In Proceedings of ACL, pages 929–936, Sydney, Jean Michel Renders. 2003. Word sequence kernels. Australia, July. Journal of Machine Learning Research, 3:1059–1082. Alfio Gliozzo, Claudio Giuliano, and Carlo Strapparava. Eugene Charniak. 2000. A maximum-entropy-inspired 2005. Domain kernels for word sense disambiguation. parser. In Proceedings of NAACL’00. In Proceedings of ACL’05, pages 403–410. Michael Collins and Nigel Duffy. 2002. New Rank- G. Golub and W. Kahan. 1965. Calculating the singular ing Algorithms for Parsing and Tagging: Kernels over values and pseudo-inverse of a matrix. Journal of the Discrete Structures, and the Voted Perceptron. In Pro- Society for Industrial and Applied Mathematics: Se- ceedings of ACL’02. ries B, Numerical Analysis. Nello Cristianini, John Shawe-Taylor, and Huma Lodhi. T. Joachims. 2000. Estimating the generalization per- 2001. Latent semantic kernels. In Carla Brodley and formance of a SVM efficiently. In Proceedings of Andrea Danyluk, editors, Proceedings of ICML-01, ICML’00. 18th International Conference on Machine Learning, Richard Johansson and Pierre Nugues. 2008. pages 66–73, Williams College, US. Morgan Kauf- Dependency-based syntactic–semantic analysis mann Publishers, San Francisco, US. with PropBank and NomBank. In Proceedings of Danilo Croce, Cristina Giannone, Paolo Annesi, and CoNLL 2008, pages 183–187. Roberto Basili. 2010. Towards open-domain semantic Taku Kudo and Yuji Matsumoto. 2003. Fast methods for role labeling. In Proceedings of the 48th Annual Meet- kernel-based text analysis. In Proceedings of ACL’03. ing of the Association for Computational Linguistics, Taku Kudo, Jun Suzuki, and Hideki Isozaki. 2005. pages 237–246, Uppsala, Sweden, July. Association Boosting-based parse reranking with subtree features. for Computational Linguistics. In Proceedings of ACL’05. Danilo Croce, Alessandro Moschitti, and Roberto Basili. Tom Landauer and Sue Dumais. 1997. A solution to 2011. Structured Lexical Similarity via Convolution ’s problem: The latent semantic analysis theory

271 of acquisition, induction and representation of knowl- Libin Shen, Anoop Sarkar, and Aravind k. Joshi. 2003. edge. Psychological Review, 104. Using LTAG Based Features in Parse Reranking. In Dekang Lin. 1998. Automatic retrieval and clustering of Empirical Methods for Natural Language Processing similar word. In Proceedings of COLING-ACL, Mon- (EMNLP), pages 89–96, Sapporo, Japan. treal, Canada. Ivan Titov and James Henderson. 2006. Porting statisti- Edward Loper, Szu ting Yi, and Martha Palmer. 2007. cal parsers with data-defined kernels. In Proceedings Combining lexical resources: Mapping between prop- of CoNLL-X. bank and verbnet. In In Proceedings of the 7th Inter- Kristina Toutanova, Penka Markova, and Christopher national Workshop on Computational Linguistics. Manning. 2004. The Leaf Path Projection View of Yashar Mehdad, Alessandro Moschitti, and Fabio Mas- Parse Trees: Exploring String Kernels for HPSG Parse simo Zanzotto. 2010. Syntactic/semantic structures Selection. In Proceedings of EMNLP 2004. for textual entailment recognition. In HLT-NAACL, Peter D. Turney and Patrick Pantel. 2010. From fre- pages 1020–1028. quency to meaning: Vector space models of semantics. Alessandro Moschitti. 2006. Efficient convolution ker- Journal of Artificial Intelligence Research, 37:141– nels for dependency and constituent syntactic trees. In 188. Proceedings of ECML’06, pages 318–329. Ludwig Wittgenstein. 1953. Philosophical Investiga- Sebastian Pado,´ 2006. User’s guide to sigf: Signifi- tions. Blackwells, Oxford. cance testing by approximate randomisation. Alexander S. Yeh. 2000. More accurate tests for the sta- Patrick Pantel, Rahul Bhagat, Bonaventura Coppola, tistical significance of result differences. In COLING, Timothy Chklovski, and Eduard Hovy. 2007. Isp: pages 947–953. Learning inferential selectional preferences. In Pro- Benat˜ Zapirain, Eneko Agirre, Llu´ıs Marquez,` and Mi- ceedings of HLT/NAACL 2007. hai Surdeanu. 2010. Improving semantic role classi- Daniele Pighin and Alessandro Moschitti. 2009a. Ef- fication with selectional preferences. In Human Lan- ficient linearization of tree kernel functions. In Pro- guage Technologies: The 2010 Annual Conference of ceedings of CoNLL’09. the North American Chapter of the Association for Daniele Pighin and Alessandro Moschitti. 2009b. Re- Computational Linguistics, HLT ’10, pages 373–376, verse engineering of tree kernel feature spaces. In Pro- Stroudsburg, PA, USA. Association for Computational ceedings of EMNLP, pages 111–120, Singapore, Au- Linguistics. gust. Association for Computational Linguistics. Dmitry Zelenko, Chinatsu Aone, and Anthony Daniele Pighin and Alessandro Moschitti. 2010. On Richardella. 2002. Kernel methods for relation reverse feature engineering of syntactic tree kernels. extraction. In Proceedings of EMNLP-ACL, pages In Proceedings of the Fourteenth Conference on Com- 181–201. putational Natural Language Learning, CoNLL ’10, Min Zhang, Jie Zhang, and Jian Su. 2006. Explor- pages 223–233, Stroudsburg, PA, USA. Association ing Syntactic Features for Relation Extraction using a for Computational Linguistics. Convolution tree kernel. In Proceedings of NAACL. Sameer Pradhan, Kadri Hacioglu, Valeri Krugler, Wayne Ward, James H. Martin, and Daniel Jurafsky. 2005. Support vector learning for semantic argument classi- fication. Machine Learning Journal. Ryan Rifkin and Aldebaro Klautau. 2004. In defense of one-vs-all classification. Journal of Machine Learning Research, 5:101–141. Magnus Sahlgren. 2006. The Word-Space Model. Ph.D. thesis, Stockholm University. Karin Kipper Schuler. 2005. VerbNet: A broad- coverage, comprehensive verb . Ph.D. thesis, University of Pennsylyania. Hinrich Schutze. 1998. Automatic word sense discrimi- nation. Journal of Computational Linguistics, 24:97– 123. John Shawe-Taylor and Nello Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.

272