Manifold Learning for the Semi-Supervised Induction of FrameNet Predicates: An Empirical Investigation

Danilo Croce and Daniele Previtali {croce,previtali}@info.uniroma2.it Department of Computer Science, Systems and Production University of Roma, Tor Vergata

Abstract annotated resources have been used by Seman- tic Role Labeling methods: they are commonly This work focuses on the empirical inves- developed using a supervised learning paradigm tigation of distributional models for the where a classifier learns to predict role labels automatic acquisition of frame inspired based on features extracted from annotated train- predicate . While several seman- ing data. One prominent resource has been de- tic spaces, both -based and syntax- veloped under the Berkeley FrameNet project as based, are employed, the impact of ge- a semantic lexicon for the core vocabulary of En- ometric representation based on dimen- glish, according to the so-called frame seman- sionality reduction techniques is inves- tic model (Fillmore, 1985). Here, a frame is a tigated. Data statistics are accordingly conceptual structure modeling a prototypical sit- studied along two orthogonal perspectives: uation, evoked in texts through the occurrence of exploits global its lexical units (LU) that linguistically expresses properties while Locality Preserving Pro- the situation of the frame. Lexical units of the jection emphasizes the role of local reg- same frame share semantic arguments. For ex- ularities. This latter is employed by em- ample, the frame KILLING has lexical units such bedding prior FrameNet-derived knowl- as assassin, assassinate, blood-bath, fatal, mur- edge in the corresponding non-euclidean derer, kill or suicide that share semantic arguments transformation. The empirical investiga- such as KILLER,INSTRUMENT,CAUSE,VICTIM. tion here reported sheds some light on the The current FrameNet release contains about 700 role played by these spaces as complex frames and 10,000 LUs. A corpus of 150,000 an- kernels for supervised (i.e. Support Vector notated examples sentences, from the British Na- Machine) algorithms: their use configures, tional Corpus (BNC), is also part of FrameNet. as a novel way to semi-supervised lexical Despite the size of this resource, it is un- learning, a highly appealing research di- der development and hence incomplete: several rection for knowledge rich scenarios like frames are not represented by evoking words and FrameNet-based semantic . the number of annotated sentences is unbalanced 1 Introduction across frames. It is one of the main reason for the performance drop of supervised SRL systems in Automatic (SRL) is a out-of-domain scenarios (Baker et al., 2007) (Jo- natural language processing (NLP) technique that hansson and Nugues, 2008). The limited cover- maps sentences to semantic representations and age of FrameNet corpus is even more noticeable identifies the semantic roles conveyed by senten- for the LUs dictionary: it only contains 10,000 tial constituents (Gildea and Jurafsky, 2002). Sev- lexical units, far less than the 210,000 entries in eral NLP applications have exploited this kind of WordNet 3.0. For example, the lexical unit crown, semantic representation ranging from Information according to the annotations, evokes the ACCOU- Extraction (Surdeanu et al., 2003; Moschitti et al., TREMENT frame. It refers to a particular sense: 2003)) to (Shen and Lapata, according to WordNet, it is “an ornamental jew- 2007), Paraphrase Identification (Pado and Erk, eled headdress signifying sovereignty”. Accord- 2005), and the modeling of re- ing to the same , this LU has 12 lations (Tatu and Moldovan, 2005). Large scale lexical senses and the first one (i.e. “The Crown

7 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language , ACL 2010, pages 7–16, Uppsala, Sweden, 16 July 2010. c 2010 Association for Computational Linguistics (or the reigning monarch) as the symbol of the are reported. Finally, in Section 5 we draw final power and authority of a monarchy”) could evoke conclusions and outline future work. other frames, like LEADERSHIP. In (Pennacchiotti et al., 2008) and (De Cao et al., 2008), the prob- 2 Related Work lem of LU automatic induction has been treated in a semi-supervised fashion. First, LUs are mod- As defined in (Pennacchiotti et al., 2008), LU in- eled by exploiting the distributional analysis of an duction is the task of assigning a generic lexical unannotated corpus and the lexical information of unit not yet present in the FrameNet database (the WordNet. These representations were used in or- so-called unknown LU) to the correct frame(s). der to find out frames potentially evoked by novel The number of possible classes (i.e. frames) and words in order to extend the FrameNet dictionary the multiple assignment problem make it a chal- limiting the effort of manual annotations. lenging task. LU induction has been integrated at SemEval-2007 as part of the Frame Seman- In this work the distributional model of LUs tic Structure Extraction shared task (Baker et al., is further developed. As in (Pennacchiotti et al., 2007), where systems are requested to assign the 2008), several word spaces (Pado and Lapata, correct frame to a given LU, even when the LU is 2007) are investigated in order to find the most not yet present in FrameNet. Several approaches suitable representation of the properties which show low coverage (Johansson and Nugues, 2007) characterize a frame. Two dimensionality reduc- or low accuracy, like (Burchardt et al., 2005). This tion techniques are applied here in this context. task is presented in (Pennacchiotti et al., 2008) and Latent Semantic Analysis (Landauer and Dumais, (De Cao et al., 2008), where two different mod- 1997) uses the Singular Value Decomposition to els which combine distributional and paradigmatic find the best subspace approximation of the orig- (i.e. lexical) information have been discussed. The inal word space, in the sense of minimizing the distributional model is used to select a list of frame global reconstruction error projecting data along suggested by the corpus’ evidences and then the the directions of maximal variance. Locality Pre- plausible lexical senses of the unknown LU are serving Projection (He and Niyogi, 2003) is a used to re-rank proposed frames. linear approximation of the nonlinear Laplacian In order to exploit prior information provided Eigenmap algorithm: its locality preserving prop- by the frame theory, the idea underlying is that se- erties allows to add a set of constraints forcing mantic knowledge can be embedded from exter- LUs that belong to the same frame to be near in nal sources (i.e the FrameNet database) into the the resulting space after the transformation. LSA distributional model of unannotated corpora. In performs a global analysis of a corpus capturing (Basu et al., 2006) a limited prior knowledge is ex- relations between LUs and removing the noise in- ploited in several clustering tasks, in term of pair- troduced by spurious directions. However it risks wise constraints (i.e., pairs of instances labeled to ignore lexical senses poorly represented into the as belonging to same or different clusters). Sev- corpus. In (De Cao et al., 2008) external knowl- eral existing algorithms enhance clustering qual- edge about LUs is provided by their lexical senses ity by applying supervision in the form of con- from a lexical resource (e.g WordNet). In this straints. These algorithms typically utilize the work, prior knowledge about the target problem is pairwise constraints to either modify the clustering directly embedded into the space through the LPP objective function or to learn the clustering distor- transformation, by exploiting locality constraints. tion measure. The approach discussed in (Basu et Then a Support Vector Machine is employed to al., 2006) employs Hidden Markov Random Fields provide a robust acquisition of lexical units com- (HMRFs) as a probabilistic generative model for bining global information provided by LSA and semi-supervised clustering, providing a principled the local information provided by LPP into a com- framework for incorporating constraint-based su- plex kernel function. pervision into prototype-based clustering. In Section 2 related work is presented. In Sec- Another possible approach is to directly embed tions 3 the investigated distributional model of the prior-knowledge into data representations. The LUs is presented as well as the dimensionality re- main idea is to employ effective and efficient algo- duction techniques. Then, in Section 4 the exper- rithms for constructing nonlinear low-dimensional imental investigation and comparative evaluations manifolds from sample data points embedded

8 in high-dimensional spaces. Several algorithms tors are semantically related. Contexts are words are defined, including Isometric feature mapping appearing together with a LU: such a space mod- (ISOMAP) (Tenenbaum et al., 2000), Locally Lin- els a generic notion of semantic relatedness, i.e. ear Embedding (LLE) (Roweis and Saul, 2000), two LUs spatially close in the space are likely to Local Tangent Space alignment (LTSA) (Zhang be either in paradigmatic or syntagmatic relation and Zha, 2004) and Locality Preserving Projec- as in (Sahlgren, 2006). Here, LUs delimit sub- tion (LPP) (He and Niyogi, 2003) and they have spaces modeling the prototypical semantic of the been successfully applied in several computer vi- corresponding evoked frames and novel LUs can sion and pattern recognition problems. In (Yang be induced by exploiting their projections. et al., 2006) it is demonstrated that basic nonlinear Since a semantic space supports the language dimensionality reduction algorithms, such as LLE, in use from the corpus statistics in an unsuper- ISOMAP, and LTSA, can be modified by taking vised fashion, vectors representing LUs can be into account prior information on exact mapping characterized by different distributions. For exam- of certain data points. The sensitivity analysis ple, LUs of the frame KILLING, such as blood- of these algorithms shows that prior information bath, crucify or fratricide, are statistically infe- improves stability of the solution. In (Goldberg rior in a corpus if compared to a wide-spanning and Elhadad, 2009), a strategy to incorporate lexi- term as kill. Moreover other ambiguous LUs, as cal features into classification models is proposed. liquidate or terminate, could appear in sentences Another possible approach is the strategy pursued evoking different frames. These problems of data- in recent works on deep learning techniques to sparseness and distribution noise can be over- NLP tasks. In (Collobert and Weston, 2008) a come by applying space transformation techniques unified architecture for NLP that learns features augmenting the space expressiveness in model- relevant to the tasks at hand given very limited ing frame semantics. Semantic space models very prior knowledge is presented. It embodies the elegantly map words in vector spaces (there are idea that a multitask learning architecture coupled as many dimensions as words in the dictionary) with semi-supervised learning can be effectively and LUs collections into distributions of data- applied even to complex linguistic tasks such as points. Every distribution implicitly expresses two Semantic Role Labeling. In particular, (Collobert orthogonal facets: global properties, as the occur- and Weston, 2008) proposes an embedding of lex- rence scores computed for terms across the entire ical information using Wikipedia as source, and collection (irrespectively from their word senses exploits the resulting language model for the mul- or evoking situation) and local regularities, for ex- titask learning process. The extensive use of unla- ample the existence of subsets of terms that tend to beled texts allows to achieve a significant level of be used every time a frame manifests. These also lexical generalization in order to better capitalize tend to be closer in the space and should be closer on the smaller annotated data sets. in the transformed space too. Another important aspect that a transformation could account is exter- 3 Geometrical Embeddings as models of nal semantic information. In the new space, prior Frame Semantics knowledge can be exploited to gather a more regu- lar LUs representation and a clearer separation be- The aim of this distributional approach is to model tween subspaces representing different frame se- frames in semantic spaces where words are repre- mantics. sented from the distributional analysis of their co- occurrences over a corpus. Semantic spaces are In the following sections the investigated dis- widely used in NLP for representing the meaning tributional model of LUs will be discussed. As of words or other lexical entities. They have been many criteria can be adopted to define a LU con- successfully applied in several tasks, such as in- text, one of the goals of this investigation is to find formation retrieval (Salton et al., 1975) and har- a co-occurrence model that better captures the no- vesting thesauri (Lin, 1998). The fundamental in- tion of frames, as described in Section 3.1. Then, tuition is that the meaning of a word can be de- two dimensionality reduction techniques, exploit- scribed by the set of textual contexts in which it ing semantic space distributions to improve frames appears (Distributional Hypothesis as described in representation, are discussed. In Section 3.2 the (Harris, 1964)), and that words with similar vec- role of global properties of data statistics will be

9 investigated through the Latent Semantic Analy- (SVD). The original term-by-term matrix M is sis while in Section 3.3 the Locality Preserving transformed into the product of three new matri- Projection algorithm will be discussed in order to ces: U, S, and V so that M = USV T . Matrix T combine prior knowledge about frames with local M is approximated by Ml = UlSlVl in which regularities of LUs obtained from text. only the first l columns of U and V are used, and only the first l greatest singular values are consid- 3.1 Choosing the space ered. This approximation supplies a way to project Different types of context define spaces with dif- term vectors into the l-dimensional space using ferent semantic properties. Such spaces model a 1/2 Yterms = UlSl . Notice that the SVD process generic notion of semantic relatedness. Two LUs accounts for the eigenvectors of the entire original close in the space are likely to be related by some distribution (matrix M). LSA is thus an example type of generic semantic relation, either paradig- of a decomposition process strongly dependent on matic (e.g. synonymy, hyperonymy, antonymy) a global property. The original statistical informa- or syntagmatic (e.g. meronymy, conceptual and tion about M is captured by the new l-dimensional phrasal association), as observed in (Sahlgren, space which preserves the global structure while 2006). The target of this work is the construc- removing low-variant dimensions, i.e. distribu- tion of a space able to capture the properties which tion noise. These newly derived features may be characterize a frame, assuming those LUs in the thought of as artificial concepts, each one repre- same frame tend to be either co-occurring or sub- senting an emerging meaning component as a lin- stitutional words (e.g. murder/kill). Two tradi- ear combination of many different words (i.e. con- tional word-based co-occurrence models capture texts). Such contextual usages can be used instead the above property: of the words to represent texts. This technique has Word-based space: Contexts are words, as two main advantages. First, the overall computa- lemmas, appearing in a n-window of the LU. tional cost of the model is reduced, as similarities The window width n is a parameter that allows are computed on a space with much fewer dimen- the space to capture different aspects of a frame: sions. Secondly, it allows to capture second-order higher values risk to introduce noise, since a frame relations among LUs, thus improving the quality could not cover an entire sentence, while lower of the similarity measure. values lead to sparse representations. Syntax-based space: Contexts words are en- 3.3 The Locality Preserving Projection riched through information about syntactic rela- Method tions (e.g. X-VSubj-killer where X is the LU), as An alternative to LSA, much tighter to local prop- described in (Pado and Lapata, 2007). Two LUs erties of data, is the Locality Preserving Projection close in the space are likely to be in a paradig- (LP P ), a linear approximation of the non-linear matic relation, i.e. to be close in an IS-A hierarchy Laplacian Eigenmap algorithm introduced in (He (Budanitsky and Hirst, 2006; Lin, 1998). Indeed, and Niyogi, 2003). LPP is a linear dimensional- as contexts are syntactic relations, targets with the ity reduction method whose goal is, given a set of n same part of speech are much closer than targets LUs x1, x2, .., xm in R , to find a transformation of different types. matrix A that maps these m points into a set of k points y1, y2, .., ym in R (k  n). LPP achieves 3.2 Latent Semantic Analysis this result through a cascade of processing steps Latent Semantic Analysis (LSA) is an algorithm described hereafter. presented in (Furnas et al., 1988) afterwards dif- Construction of an Adjacency graph. Let G fused by Landauer (Landauer and Dumais, 1997): denote a graph with m nodes. Nodes i and j have it can be seen as a variant of the Principal Compo- got a weighted connection if vectors xi and xj are nent Analysis idea. LSA aims to find the best sub- close, according to an arbitrary measure of simi- space approximation to the original word space, larity. There are many ways to build an adjacency in the sense of minimizing the global reconstruc- graph. The cosine graph with cosine weighting tion error projecting data along the directions of scheme is explored: given two vectors xi and xj, maximal variance. It captures term (semantic) the weight wij between them is set by dependencies by applying a matrix decomposi- cos(xi, xj ) − τ tion process called Singular Value Decomposition wij = max{0, · cos(xi, xj )} (1) |cos(xi, xj ) − τ|

10 where a cosine threshold τ is necessary. The ad- where jacency graph can be represented by using a sym-  1 iff ∃F s.t. LU ∈ F ∧ LU ∈ F δ(i, j) = i j metric m×m adjacency matrix, named W , whose 0 otherwise element Wij contains the weight between nodes i and j. The method of constructing an adjacency so the resulting manifold keeps close all LUs graph outlined above is correct if the data actually evoking the same frame. Since the number of con- lie on a low dimensional manifold. Once such an nections could introduce too many constraints to adjacency graph is obtained, LPP will try to opti- the Eigenmap problem, a threshold is introduced mally preserve it in choosing projections. to avoid the space collapse: for each LU, only the most-similar c connections are selected. The Solve an Eigenmap problem. Compute the adoption of the proper a priori knowledge about eigenvectors and eigenvalues for the generalized the target task can be thus seen as a promising re- eigenvector problem: search direction. XLXT a = λXDXT a 4 Empirical Analysis In this section the empirical evaluation of distribu- where X is a n × m matrix whose columns are the tional models applied to the task of inducing LUs original m vectors in Rn, D is a diagonal m × m is presented. Different spaces obtained through matrix whose entries are column (or row) sums of the dimensionality reduction techniques imply dif- W , D = P W and L = D − W is the Lapla- ii j ij ferent kernel functions used to independently train cian matrix. The solution of this problem is the different SVMs. Our aim is to investigate the im- set of eigenvectors a , a , .., a , ordered accord- 0 1 n−1 pact of these kernels in capturing both the frames ing to their eigenvalues λ < λ < .. < λ . 0 1 n−1 and LUs’ properties, as well as the effectiveness LPP projection matrix A is obtained by selecting of their possible combination. the k eigenvectors corresponding to the k smallest The problem of LUs’ induction is here treated eigenvalues: therefore it is a n × k matrix whose as a multi-classification problem, where each LU columns are the selected n-dimensional k eigen- is considered as a positive or negative instance of a vectors. Final projection of original vectors into frame. We use Support Vector Machines (SVMs), Rk can be linearly performed by Y = AT X. This (Joachims, 1999) a maximum-margin classifier transformation provides a valid kernel that can be that realizes a linear discriminative model. In case efficiently embedded into a classifier. of not linearly separable examples, convolution Embedding predicate knowledge through functions φ(·) can be used in order to transform LPPs. While LSA finds a projection, according to the initial feature space into another one, where a the global properties of the space, LPP tries to pre- hyperplane that separates the data with the widest serve the local structures of the data. LPP exploits margin can be found. Here new similarity mea- the adjacency graph in order to represent neigh- sures, the kernel functions, can be defined through borhood information. It computes a transforma- the dot-product K(oi, oj) = hφ(oi) · φ(oj)i over tion matrix which maps data points into a lower di- the new representation. In this way, kernel func- mensional subspace. As the construction of an ad- tions KLSA and KLP P can be induced through jacency graph G can be based on any principle, its the dimensionality reduction techniques φLSA and definition could account on some external infor- φLP P respectively, as described in sections 3.2 mation reflecting prior knowledge available about and 3.3. Kernel methods are advantageous be- the task. cause the combination of of kernel functions can In this work, prior knowledge about LUs is em- be integrated into the SVM as they are still kernels. bedded by exploiting their membership to frame Consequently, the kernel combination αKLSA + dictionaries, thus removing from the graph all con- βKLP P linearly combines the global properties nections between LUs xi and xj that do not evoke captured by LSA and the locality constraints im- the same prototypical situation. More formally posed by the LPP transformation. Here, parame- Equation 1 can be rewritten more formally as: ters α and β weight the combination of the two kernels. The evoking frame for a novel LU is the one whose corresponding SVM has the high- cos(xi, xj ) − τ wij = max{0, · cos(xi, xj ) · δ(i, j)} est (possibly negative) margin, according to a one- |cos(xi, xj ) − τ|

11 train tune test overall construction are used. The first two correspond to max 107 35 34 176 avg 28 8 8 44 a Word-Based space, the last to a Syntax-Based, total 2466 722 723 3911 as described in section 3.1: Window-n (Wn): contextual features correspond Table 1: Number of LU examples for each data set to the set of the 20,000 most frequent lemmatized from the 100 frames words in the BNC. The association measure be- tween LUs and contexts is the Point-wise Mu- vs-all scheme. In order to evaluate the quality of tual Information (PMI). Valid contexts for LUs are the presented models, accuracy is measured as the fixed to a n-window. Hereafter two window width percentage of LUs that are correctly re-assigned to values will be investigated: Window5 (W5) and their original (gold-standard) frame. As the sys- Window10 (W10). tem can suggest more than one frame, different Sentence (Sent): contextual features are the same accuracy levels can be obtained. A LU is cor- above, but the valid contexts are extended to the rectly assigned if its correct frame (according to entire sentence length. FrameNet) belongs to the set of the best b pro- SyntaxBased (SyntB): contextual features have posals by the system (i.e. the first b scores from been computed according to the “dependency- 3 the underlying SVMs). Assigning different val- based” vector space discussed in (Pado and La- ues to b, we obtained different levels of accuracy pata, 2007). Observable contexts here are made of as the percentage of LUs that is correctly assigned syntactically-typed co-occurrences within depen- among the first b proposals, as shown in Table 3. dency graphs built from the entire set of BNC sen- tences. The most frequent 20,000 basic features, 4.1 Experimental Setup i.e. (syntactic relation,lemma) pairs, have been employed as contextual features corresponding to The adopted gold standard is a subset of the PMI scores. Syntactic relations are extracted using FrameNet database and it consists of the most 100 the Minipar parser. represented frames in term of annotated examples Word space models thus focus on the LUs of the and LUs. As the number of example is extremely selected 100 frames and the dimensionality have unbalanced across frames1, the LUs dictionary of been reduced by applying LSA and LPP at a new each selected frame contains at least 10 LUs. It is size of l = 100. Any prior knowledge informa- a reasonable amount of information for the SVMs tion is provided to the tuning and test sets during training and it is still a representative data set, be- the LPP transformation: the construction of the ing composed of 3,911 LUs, i.e. the 55% of the reduced feature space takes in account only LUs entire dictionary2 of 7,230 evoking words. All from the train set while remaining predicates are word spaces are derived from the British National represented through the LPP linear projection. In Corpus (BNC), which is underlying FrameNet and these experiments the cosine threshold τ and the consisting of about 100 million words for English. maximum number of constraints c are estimated Each selected frame is represented into the BNC over the tuning set and the best parametrizations by at least 362 annotated sentences, as the lack are shown in Table 2. The adopted implementa- of a reasonable number of examples hardly pro- tion of SVM is SVM-Light-TK 4. duces a good distributional model of LUs. Each frame’s list of LUs is split into train (60%), tuning 4.2 Results (20%) and test set (20%) and LUs having Part-of- speech different from verb, noun or adjective are In these experiments the impact of the lexical removed. In Table 1 the number of LUs for each knowledge gathered by different word-spaces is set, as well as the maximum and the average num- evaluated over the LU induction task. Moreover, ber per frame, are summarized. the improvements achieved through LSA and LPP Four different approaches for the Word Space is measured. SVM classifiers are trained over the semantic spaces produced through the dimension- 1For example the SELF MOTION frame counts 6,248 ex- amples while 119 frames are represented by less than 10 ex- 3The Minimal context provided by the De- amples pendency Vectors tool is used. It is available at 2The entire database contains 10,228 LUs and the number http://www.nlpado.de/∼sebastian/dv.html of evoking word is 7,230, without taking in account multiple 4SVM-Light-TK is available at the url frame assignments. http://disi.unitn.it/∼moschitt/Tree-Kernel.htm

12 α/β τ c 1.0/0.0 .9/.1 .8/.2 .7/.3 .6/.4 .5/.5 .4/.6 .3/.7 .2/.8 .1/.9 0.0/1.0 W 5 0.668 0.669 0.672 0.673 0.669 0.662 0.649 0.632 0.612 0.570 0.033 0.55 5 W 10 0.615 0.619 0.618 0.612 0.604 0.597 0.580 0.575 0.565 0.528 0.048 0.65 3 Sent 0.557 0.567 0.580 0.584 0.574 0.564 0.561 0.545 0.523 0.496 0.048 0.80 5 SyntB 0.654 0.664 0.662 0.652 0.651 0.647 0.649 0.634 0.627 0.592 0.056 0.40 3

Table 2: Accuracy at different combination weights of kernel αKLSA + βKLP P (specific baseline is 0.043)

b-1 b-2 b-3 b-4 b-5 b-6 b-7 b-8 b-9 b-10 α/β W 5orig 0,563 0,685 0,733 0,770 0,801 0,835 0,841 0,854 0,868 0,879 - W 10orig 0,510 0,634 0,707 0,776 0,810 0,830 0,841 0,857 0,865 0,875 - Sentorig 0,479 0,618 0,680 0,734 0,764 0,793 0,813 0,837 0,845 0,852 - SyntBorig 0,585 0,741 0,803 0,840 0,866 0,874 0,886 0,903 0,907 0,913 - W 5LSA+LP P 0.673 0.781 0.831 0.865 0.881 0.891 0.906 0.912 0.926 0.938 0.7/0.3 W 10LSA+LP P 0.619 0.739 0.786 0.818 0.849 0.865 0.878 0.888 0.901 0.909 0.9/0.1 SentLSA+LP P 0.584 0.705 0.766 0.798 0.825 0.835 0.848 0.864 0.876 0.889 0.7/0.3 SyntBLSA+LP P 0.664 0.791 0.840 0.864 0.878 0.893 0.901 0.903 0.907 0.911 0.9/0.1

Table 3: Accuracy of original word-space models (orig) and semantic space models (LSA+LPP) on best-k proposed frames

ality reduction transformations. Representations 0,70 of both semantic spaces are linearly combined as 0,65 αKLSA + βKLP P , where kernel weights α and β are estimated over the tuning set. Both ker- 0,60 nels are used even without a combination: a ra- tio α = 1.0/β = 0.0 denotes the LSA kernel 0,55 alone, while α = 0.0/β = 1.0 the LPP kernel. Ta- Window5 0,50 ble 2 shows best results, obtained through a RBF Window10 kernel. The W indow5 model achieves the high- 0,45 Sentence est accuracy, i.e. 67% of correct classification, SyntaxBased 0,40 where a baseline of 4.3% is estimated assigning

LUs to the most likely frame in the training set (i.e. 1.0/0.0 0.9/0.1 0.8/0.2 0.7/0.3 0.6/0.4 0.5/0.5 0.4/0.6 0.3/0.7 0.2/0.8 0.1/0.9 the one containing the highest number of LUs). αLSA / βLPP weights Wider windows achieve lower classification accu- Figure 1: Accuracy at different combination racy confirming that most of lexical information weights of kernel αKLSA + βKLP P tied to a frame is near the LU. The Syntactic-based word space does not outperform the accuracy of a word-based space. The combination of both ker- SVM) based on local kernels (e.g. RBF) is ar- nels has always provided the best outcome and the gued. As shown in Table 3, the performance drop LSA space seems to be more accurate and expres- of original (orig) models against the best kernel sive respect to the LPP one, as shown in Figure combination of LSA and LP P are significant, 1. In particular LPP alone is extremely unstable, i.e. ∼ 10%, showing how the latent semantic suggesting that constraints imposed by the prior spaces better capture properties of frames, avoid- knowledge are orthogonal with respect to the cor- ing data-sparseness, dimensionality problem and pus statistics. low-regularities of data-distribution. Further experiments are carried out using the Moreover, Table 3 shows how the accuracy level original co-occurrence space models, to assess im- largely increases when more than one frame is provements due to LSA and LPP kernel. In the considered: at a level b = 3, i.e. the novel latter investigation linear kernel achieved best re- LU is correctly classified if one of the original sults as confirmed in (Bengio et al., 2005), where frames is comprised in the list (of three frames) the sensitivity to the curse of dimensionality of proposed by the system, accuracy is 0.84 (i.e the a large class of modern learning algorithms (e.g. SyntaxBased model), while at b = 10 accuracy is

13 LU (# WNsyns) frame 1 frame 2 frame 3 Correct frames

boil.v (5) FOOD FLUIDIC MOTION CONTAINERS CAUSE HARM clap.v (7) SOUNDS MAKE NOISE COMMUNICATION NOISE BODY MOVEMENT ACCOUTREMENTS crown.n (12) LEADERSHIP ACCOUTREMENTS PLACING OBSERVABLE BODYPARTS EDUCATION TEACHING school.n (7) EDUCATION TEACHING BUILDINGS LOCALE BY USE LOCALE BY USE AGGREGATE threat.n (4) HOSTILE ENCOUNTER IMPACT COMMITMENT COMMITMENT tragedy.n (2) TEXT KILLING EMOTION DIRECTED TEXT

Table 4: Proposed 3 frames for each LU (ordered by SVM scores) and correct frames provided by the FrameNet dictionary. In parenthesis the number of different WordNet lexical senses for each LU. nearly 0.94 (i.e Window5). It is high enough to each lexical predicate, is also employed in a semi- support tasks such as the semi-automatic creation supervised manner: local constraints expressing of new FrameNets. An error analysis indicates that prior knowledge on frames are defined in the ad- many misclassifications are induced by a lack in jacency graph. The resulting embedding is there- the frame annotations, especially those concern- fore expected to determine a new space where re- ing polysemic LUs5. Table 4 reports the analysis gions for LU of a given frame can be more eas- of a LU subset where the first 3 frames proposed ily discovered. Experiments have been run using for each evoking word are shown, ranked by the the resulting spaces for task dependent kernels in margin of the SMVs. The last column contains the a SVM learning setting. The application of the frames evoked by LUs, according to the FrameNet FrameNet KB on the 100 best represented frames dictionary, and the frame names in bold suggest showed that a combined use of the global and lo- their correct classification. Some LUs, like threat cal models made available by LSA and LPP, re- (characterized by 4 lexical senses) seem to be mis- spectively, achieves the best results, as the 67.3% classified: in this case the FrameNet annotation of LUs recovers the same frames of the annotated regards a specific sense that evokes the COMMIT- dictionary. This is a significant improvement with MENT frame (e.g. “There was a real threat that respect to previous results achieved by the pure she might have to resign”) without taking in ac- distributional model reported in (Pennacchiotti et count other senses like WordNet’s “menace, threat al., 2008). (something that is a source of danger)” that could Future work is required to increase the level evoke the HOSTILE ENCOUNTER frame. In other of constraints made available from the semi- cases proposed frames seem to enrich the LUs dic- supervised setting of LPP: syntactic informa- tionary, like BUILDINGS, here evoked by school. tion, as well as role-related evidence, can be both accommodated by the adjacency constraints 5 Conclusions imposed for LPP. This constitutes a significant area of research towards a comprehensive semi- The core purpose of this was to present an em- supervised model of frame semantics, entirely pirical investigation of the impact of different dis- based on manifold learning methods, of which this tributional models on the lexical unit induction study on LSA and LPP is just a starting point. task. The employed word-spaces, based on dif- ferent co-occurrence models (either context and syntax-driven), are used as vector models of the Acknowledgement We want to acknowledge LU semantics. On these spaces, two dimensional- Prof. Roberto Basili because this work would not ity reduction techniques have been applied. Latent exist without his ideas, inspiration and invaluable Semantic Analysis (LSA) exploits global proper- support. ties of data distributions and results in a global model for lexical semantics. On the other hand, the Locality Preserving Projection (LPP) method, References that exploits regularities in the neighborhood of Collin Baker, Michael Ellsworth, and Katrin Erk. 5According to WordNet, in our dataset an average of 3.6 2007. Semeval-2007 task 19: Frame semantic struc- lexical senses for each LU is estimated. ture extraction. In Proceedings of SemEval-2007,

14 pages 99–104, Prague, Czech Republic, June. Asso- Richard Johansson and Pierre Nugues. 2007. Using ciation for Computational Linguistics. WordNet to extend FrameNet coverage. In Proceed- ings of the Workshop on Building Frame-semantic Sugato Basu, Mikhail Bilenko, Arindam Banerjee, Resources for Scandinavian and Baltic Languages, and Raymond Mooney. 2006. Probabilistic semi- at NODALIDA, Tartu, Estonia, May 24. supervised clustering with constraints. In Semi- Supervised Learning, pages 73–102. MIT Press. Richard Johansson and Pierre Nugues. 2008. The effect of syntactic representation on semantic role Yoshua Bengio, Olivier Delalleau, and Nicolas Le labeling. In Proceedings of COLING, Manchester, Roux. 2005. The curse of dimensionality for lo- UK, August 18-22. cal kernel machines. Technical report, Departement d’Informatique et Recherche Operationnelle. Tom Landauer and Sue Dumais. 1997. A solution to plato’s problem: The latent semantic analysis the- Alexander Budanitsky and Graeme Hirst. 2006. Eval- ory of acquisition, induction and representation of uating WordNet-based measures of semantic dis- knowledge. Psychological Review, 104. tance. Computational Linguistics, 32(1):13–47. Dekang Lin. 1998. Automatic retrieval and clustering Aljoscha Burchardt, Katrin Erk, and Anette Frank. of similar word. In Proceedings of COLING-ACL, 2005. A WordNet Detour to FrameNet. In Montreal, Canada. Sprachtechnologie, mobile Kommunikation und lin- guistische Resourcen, volume 8 of Computer Stud- Alessandro Moschitti, Paul Morarescu, and Sanda M. ies in Language and Speech. Peter Lang, Frank- Harabagiu. 2003. Open domain information ex- furt/Main. traction via automatic semantic labeling. In FLAIRS Conference, pages 397–401. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep Sebastian Pado and Katrin Erk. 2005. To cause or neural networks with multitask learning. In In Pro- not to cause: Cross-lingual semantic matching for ceedings of ICML ’08, pages 160–167, New York, paraphrase modelling. In Proceedings of the Cross- NY, USA. ACM. Language Knowledge Induction Workshop, Cluj- Napoca, Romania. Diego De Cao, Danilo Croce, Marco Pennacchiotti, and Roberto Basili. 2008. Combining word sense Sebastian Pado and Mirella Lapata. 2007. and usage for modeling frame semantics. In In Pro- Dependency-based construction of semantic space ceedings of STEP 2008, Venice, Italy. models. Computational Linguistics, 33(2):161–199. Charles J. Fillmore. 1985. Frames and the semantics of Marco Pennacchiotti, Diego De Cao, Roberto Basili, understanding. Quaderni di Semantica, 4(2):222– Danilo Croce, and Michael Roth. 2008. Automatic 254. induction of framenet lexical units. In Proceedings G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Lan- of The Empirical Methods in Natural Language Pro- dauer, R. A. Harshman, L. A. Streeter, and K. E. cessing (EMNLP 2008) Waikiki, Honolulu, Hawaii. Lochbaum. 1988. Information retrieval using a sin- gular value decomposition model of latent semantic S.T. Roweis and L.K. Saul. 2000. Nonlinear dimen- structure. In Proc. of SIGIR ’88, New York, USA. sionality reduction by locally linear embedding. Sci- ence, 290(5500):2323–2326. Daniel Gildea and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguis- Magnus Sahlgren. 2006. The Word-Space Model. tics, 28(3):245–288. Ph.D. thesis, Stockholm University.

Yoav Goldberg and Michael Elhadad. 2009. On the G. Salton, A. Wong, and C. Yang. 1975. A vector role of lexical features in sequence labeling. In In space model for automatic indexing. Communica- Proceedings of EMNLP ’09, pages 1142–1151, Sin- tions of the ACM, 18:613A¨ `ı620. gapore. Dan Shen and Mirella Lapata. 2007. Using semantic Zellig Harris. 1964. Distributional structure. In Jer- roles to improve question answering. In Proceed- rold J. Katz and Jerry A. Fodor, editors, The Philos- ings of EMNLP-CoNLL, pages 12–21, Prague. ophy of Linguistics, New York. Oxford University Press. Mihai Surdeanu, , Mihai Surdeanu, A Harabagiu, John Williams, and Paul Aarseth. 2003. Using predicate- Xiaofei He and Partha Niyogi. 2003. Locality preserv- argument structures for . In ing projections. In Proceedings of NIPS03, Vancou- In Proceedings of ACL 2003. ver, Canada. Marta Tatu and Dan I. Moldovan. 2005. A seman- T. Joachims. 1999. Making large-Scale SVM Learning tic approach to recognizing textual entailment. In Practical. MIT Press, Cambridge, MA. HLT/EMNLP.

15 J. B. Tenenbaum, V. Silva, and J. C. Langford. 2000. A Global Geometric Framework for Nonlinear Di- mensionality Reduction. Science, 290(5500):2319– 2323. Xin Yang, Haoying Fu, Hongyuan Zha, and Jesse Bar- low. 2006. Semi-supervised nonlinear dimension- ality reduction. In 23rd International Conference on Machine learning, pages 1065–1072, New York, NY, USA. ACM Press. Zhenyue Zhang and Hongyuan Zha. 2004. Princi- pal manifolds and nonlinear dimensionality reduc- tion via tangent space alignment. SIAM J. Scientific Computing, 26(1):313–338.

16