Cross-Lingual Bootstrapping of Semantic Lexicons: the Case Of

Edinburgh Research Explorer Cross-Lingual Bootstrapping of Semantic Lexicons: The Case of FrameNet Citation for published version: Padó, S & Lapata, M 2005, Cross-Lingual Bootstrapping of Semantic Lexicons: The Case of FrameNet. in Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference. AAAI Press, pp. 1087-1092. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 05. Apr. 2019 Cross-lingual Bootstrapping of Semantic Lexicons: The Case of FrameNet Sebastian Padó Mirella Lapata Computational Linguistics, Saarland University School of Informatics, University of Edinburgh P.O. Box 15 11 50, 66041 Saarbrücken, Germany 2 Buccleuch Place, Edinburgh EH8 9LW, UK [email protected] [email protected] Abstract Frame: COMMITMENT This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic ap- SPEAKER Kim promised to be on time. proach which exploits parallel corpora, relies on shallow text ADDRESSEE Kim promised Pat to be on time. properties, and is relatively inexpensive. Given the English MESSAGE Kim promised Pat to be on time. FrameNet lexicon, our method exploits word alignments to TOPIC The government broke its promise generate frame candidate lists for new languages, which are about taxes. subsequently pruned automatically using a small set of lin- Frame Elements MEDIUM Kim promised in writing to sell Pat guistically motivated filters. Evaluation shows that our ap- the house. proach can produce high-precision multilingual FrameNet lexicons without recourse to bilingual dictionaries or deep consent.v, covenant.n, covenant.v, oath.n, vow.n, syntactic and semantic analysis. pledge.v, promise.n, promise.v, swear.v, threat.n, FEEs threaten.v, undertake.v, undertaking.n, volunteer.v Introduction Table 1: Example frame from the FrameNet database Shallow semantic parsing, the task of automatically identi- fying the semantic roles conveyed by sentential constituents, is an important step towards text understanding and can ul- instance, the SPEAKER is typically an NP, whereas the MES- timately benefit many natural language processing applica- SAGE is often expressed as a clausal complement (see the ex- tions ranging from information extraction (Surdeanu et al. pressions in boldface in Table 1). The COMMITMENT frame 2003) to question answering (Narayanan & Harabagiu 2004) can be evoked by promise, consent, pledge, and several other and machine translation (Boas 2002). verbs as well as nouns (see the list of FEEs in Table 1). Recent advances in semantic parsing have greatly bene- An important first step towards semantic parsing is the fited from the availability of resources that document the creation of a semantic lexicon listing appropriate frames for surface realisation of semantic roles, thus providing richly lemmas together with their corresponding semantic roles. annotated data for training statistical semantic role labellers The English FrameNet currently contains 513 frames cover- (Gildea & Jurafsky 2002). The FrameNet lexical resource ing 7125 lexical items and has been under development for (Baker, Fillmore, & Lowe 1998) has played a central role approximately eight years; German, Spanish, and Japanese in this enterprise. In FrameNet, meaning is represented by FrameNets, though in their infancy, are also under construc- frames, i.e., schematic representations of situations. Seman- tion. Since manual lexicon construction is costly, time con- tic roles are frame-specific, and are called frame elements. suming, and rarely ever complete, automatic methods for The database associates frames with lemmas (verbs, nouns, acquiring FrameNets for new languages could significantly adjectives) that can evoke them (called frame-evoking ele- speed up the lexicon development process. In this paper, we ments or FEEs), lists the possible syntactic realisations of propose an automatic method which employs parallel cor- their semantic roles, and provides annotated examples from pora for the acquisition task and holds promise for reducing the British National Corpus (Burnard 1995). the manual effort involved in creating semantic lexicons. Table 1 exemplifies the COMMITMENT frame. The frame Our method leverages the existing English FrameNet has four roles, a SPEAKER who commits him/herself to do to overcome the resource shortage in other languages by something, an ADDRESSEE to whom the commitment is exploiting the translational equivalences present in word- made, a MESSAGE or a TOPIC expressing the commitment, aligned data. We focus on inducing a frame lexicon, i.e., a and a MEDIUM used to transmit the MESSAGE. The frame list of lemmas for each frame, under the assumption that elements are realised by different syntactic expressions. For FrameNet descriptions for English port reliably to other languages (Erk et al. 2003). The lexicon induction task is a pre- Copyright c 2005, American Association for Artificial Intelli- requisite to learning how a frame’s semantic roles are syn- gence (www.aaai.org). All rights reserved. tactically realised. Our method can yield a FrameNet lexi- AAAI-05 / 1087 con for any language for which English parallel corpora are s, t word types in source and target language available. To show its potential, we investigate the induction si, tj word tokens of s and t of FrameNet lexicons for German and French. f frame in English FrameNet In the following section, we provide an overview of re- pc(f|t) candidate probability of frame f for type t lated work on automatic construction of multilingual re- si 1 tj true iff si and tj are one-to-one aligned sources. Then, we introduce our FrameNet-based lexicon tp(si, tj, f) true iff si and tj form a translation pair for f induction algorithms. Next, we present our experimental fee(f, s) true iff s is a listed FEE for f framework and data. Evaluation results and their discussion wsd(f, si) true iff f is the disambiguated frame for si conclude the paper. cont(si) true iff si is a content word (Adj, N, or V) cand(t, f) true iff t is a FEE candidate for f Related Work Table 2: Notation overview The automatic construction of bilingual lexicons has a long history in machine translation (see Melamed (1996) and the references therein). Word-level translational equivalences the target language. This initial list is somewhat noisy and can be extracted from parallel texts without recourse to contains many spurious translations. In a second step, we corpus external resources, simply by using word-alignment prune spurious translations using incremental filters, thus ac- methods (Och & Ney 2003) and filtering techniques aimed quiring FrameNet-compliant entries. Filters can apply both at discarding erroneous translations (Melamed 1996). Mann at the token and type level, placing constraints either on ad- & Yarowsky (2001) induce bilingual lexicons using solely missible translation pairs or on FEE candidate probability. on-line dictionaries and cognate word pairs. Translational correspondences are also exploited for in- FEE Candidate List Generation ducing multilingual text analysis tools (Yarowsky, Ngai, For the reasons outlined in the previous paragraph, we use & Wicentowski 2001; Hwa et al. 2002). English analyses automatic methods for finding translation equivalences that (e.g., syntax trees, POS tags) are projected onto another lan- allow for rapid lexicon acquisition across languages and do- guage through word-aligned parallel text. The projected an- mains. We obtain an initial FEE candidate list from a par- notations can be then used to derive monolingual tools with- allel corpus using the statistical IBM word alignment mod- out additional annotation cost. els (Och & Ney 2003). Other alignment techniques could Fung & Chen (2004) automatically construct an English- also serve as a basis for generating FEE candidates; our Chinese bilingual FrameNet by mapping automatically method is not specific to the IBM models. FrameNet entries to concepts listed in HowNet1, an on-line To produce an initial FEE candidate list, we employ a ontology for Chinese, without access to parallel texts. rather liberal definition of translation pair, covering all cor- The present work extends previous approaches on auto- respondences we gather from a parallel corpus; a source and matic lexicon construction and annotation projection by in- target language token form a translation pair for frame f if ducing a FrameNet compliant lexicon rather than a generic they are aligned and the source token is a FEE of f (see word-to-word translation mapping. Our work exploits word Table

Load more