Frame-Based Continuous Lexical Semantics Through Exponential Family Tensor Factorization and Semantic Proto-Roles
Total Page:16
File Type:pdf, Size:1020Kb
Frame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto-Roles Francis Ferraro and Adam Poliak and Ryan Cotterell and Benjamin Van Durme Center for Language and Speech Processing Johns Hopkins University fferraro,azpoliak,ryan.cotterell,[email protected] Abstract AGENT GOAL We study how different frame annota- ATTEMPT tions complement one another when learn- She said Bill would try the same tactic again. ing continuous lexical semantics. We Figure 1: A simple frame analysis. learn the representations from a tensorized skip-gram model that consistently en- codes syntactic-semantic content better, with multiple 10% gains over baselines. with judgements about multiple underlying prop- erties about what is likely true of the entity fill- 1 Introduction ing the role. For example, SPR talks about how likely it is for Bill to be a willing participant in the Consider “Bill” in Fig.1: what is his involve- ATTEMPT. The answer to this and other simple ment with the words “would try,” and what does judgments characterize Bill and his involvement. this involvement mean? Word embeddings repre- Since SPR both captures the likelihood of certain sent such meaning as points in a real-valued vec- properties and characterizes roles as groupings of tor space (Deerwester et al., 1990; Mikolov et al., properties, we can view SPR as representing a type 2013). These representations are often learned by of continuous frame semantics. exploiting the frequency that the word cooccurs with contexts, often within a user-defined window We are interested in capturing these SPR-based (Harris, 1954; Turney and Pantel, 2010). When properties and expectations within word embed- built from large-scale sources, like Wikipedia or dings. We present a method that learns frame- web crawls, embeddings capture general charac- enriched embeddings from millions of documents teristics of words and allow for robust downstream that have been semantically parsed with multiple applications (Kim, 2014; Das et al., 2015). different frame analyzers (Ferraro et al., 2014). Frame semantics generalize word meanings to Our method leverages Cotterell et al.(2017)’s that of analyzing structured and interconnected la- formulation of Mikolov et al.(2013)’s popular beled “concepts” and abstractions (Minsky, 1974; skip-gram model as exponential family principal Fillmore, 1976, 1982). These concepts, or roles, component analysis (EPCA) and tensor factor- arXiv:1706.09562v1 [cs.CL] 29 Jun 2017 implicitly encode expected properties of that word. ization. This paper’s primary contributions are: In a frame semantic analysis of Fig.1, the segment (i) enriching learned word embeddings with mul- “would try” triggers the ATTEMPT frame, filling tiple, automatically obtained frames from large, the expected roles AGENT and GOAL with “Bill” disparate corpora; and (ii) demonstrating these and “the same tactic,” respectively. While frame enriched embeddings better capture SPR-based semantics provide a structured form for analyzing properties. In so doing, we also generalize Cot- words with crisp, categorically-labeled concepts, terell et al.’s method to arbitrary tensor dimen- the encoded properties and expectations are im- sions. This allows us to include an arbitrary plicit. What does it mean to fill a frame’s role? amount of semantic information when learning Semantic proto-role (SPR) theory, motivated by embeddings. Our variable-size tensor factoriza- Dowty(1991)’s thematic proto-role theory, offers tion code is available at https://github.com/ an answer to this. SPR replaces categorical roles fmof/tensor-factorization. 2 Frame Semantics and Proto-Roles words (CBOW)—repopularized these methods. We focus on SG, which predicts the context i Frame semantics currently used in NLP have a rich around a word j, with learned representations ci history in linguistic literature. Fillmore(1976)’s and wj, respectively, as p(context i j word j) / frames are based on a word’s context and prototyp- | | exp (ci wj) = exp (1 (ci wj)) ; where is the ical concepts that an individual word evokes; they Hadamard (pointwise) product. Traditionally, the intend to represent the meaning of lexical items by context words i are those words within a small mapping words to real world concepts and shared window of j and are trained with negative sam- experiences. Frame-based semantics have inspired pling (Goldberg and Levy, 2014). many semantic annotation schemata and datasets, such as FrameNet (Baker et al., 1998), PropBank 3.1 Skip-Gram as Matrix Factorization (Palmer et al., 2005), and Verbnet (Schuler, 2005), Levy and Goldberg(2014b), and subsequently as well as composite resources (Hovy et al., 2006; Keerthi et al.(2015), showed how vectors learned 1 Palmer, 2009; Banarescu et al., 2012). under SG with the negative sampling are, under Thematic Roles and Proto Roles These re- certain conditions, the factorization of (shifted) sources map words to their meanings through positive pointwise mutual information. Cotterell discrete/categorically labeled frames and roles; et al.(2017) showed that SG is a form of ex- sometimes, as in FrameNet, the roles can be very ponential family PCA that factorizes the matrix descriptive (e.g., the DEGREE role for the AF- of word/context cooccurrence counts (rather than FIRM OR DENY frame), while in other cases, as shifted positive PMI values). With this interpre- in PropBank, the roles can be quite general (e.g., tation, they generalize SG from matrix to tensor ARG0). Regardless of the actual schema, the roles factorization, and provide a theoretical basis for are based on thematic roles, which map a predi- modeling higher-order SG (or additional context, cate’s arguments to a semantic representation that such as morphological features of words) within a makes various semantic distinctions among the ar- word embeddings framework. 2 guments (Dowty, 1989). Dowty(1991) claims Specifically, Cotterell et al. recast higher-order that thematic role distinctions are not atomic, i.e., SG as maximizing the log-likelihood they can be deconstructed and analyzed at a lower X level. Instead of many discrete thematic roles, Xijk log p(context i j word j; feature k) (1) Dowty(1991) argues for proto-thematic roles, e.g. ijk PROTO-AGENT rather than AGENT, where dis- X exp (1|(ci wj ak)) = Xijk log P ; (2) tinctions in proto-roles are based on clusterings of 0 exp (1|(ci0 wj ak)) ijk i logical entailments. That is, PROTO-AGENTs of- ten have certain properties in common, e.g., ma- where Xijk is a cooccurrence count 3-tensor of nipulating other objects or willingly participating words j, surrounding contexts i, and features k. in an action; PROTO-PATIENTs are often changed or affected by some action. By decomposing the 3.2 Skip-Gram as n-Tensor Factorization meaning of roles into properties or expectations When factorizing an n-dimensional tensor to in- that can be reasoned about, proto-roles can be seen clude an arbitrary number of L annotations, we as including a form of vector representation within replace feature k in Equation (1) and ak in Equa- structured frame semantics. tion (2) with each annotation type l and vector αl included. Xi;j;k becomes Xi;j;l1;:::lL , representing 3 Continuous Lexical Semantics the number of times word j appeared in context i with features l through l . We maximize Word embeddings represent word meanings as el- 1 L ements of a (real-valued) vector space (Deerwester X Xi;j;l1;:::;lL log βi;j;l1;:::;lL word2vec et al., 1990). Mikolov et al.(2013)’s i;j;l1;:::;lL methods—skip-gram (SG) and continuous bag of | βi;j;l1;:::;lL / exp (1 (ci wj αl1 · · · αlL )) : 1See Petruck and de Melo(2014) for detailed descriptions on frame semantics’ contributions to applied NLP tasks. 4 Experiments 2Thematic role theory is rich, and beyond this paper’s scope (Whitehead, 1920; Davidson, 1967; Cresswell, 1973; Our end goal is to use multiple kinds of au- Kamp, 1979; Carlson, 1984). tomatically obtained, “in-the-wild” frame se- mantic parses in order to improve the seman- windowed frame 232 35.9 (triggers) tic content—specifically SPR-type information— # target words within learned lexical embeddings. We utilize ma- 404 45.7 (triggers) jority portions of the Concretely Annotated New # surrounding 232 531 (role fillers) York Times and Wikipedia corpora from Ferraro words 404 2,305 (role fillers) et al.(2014). These have been annotated with Table 1: Vocabulary sizes, in thousands, extracted from Fer- three frame semantic parses: FrameNet from Das raro et al.(2014)’s data with both the standard sliding context et al.(2010), and both FrameNet and PropBank window approach (x3) and the frame-based approach (x4). Upper numbers (Roman) are for newswire; lower numbers from Wolfe et al.(2016). In total, we use nearly (italics) are Wikipedia. For both corpora, 800 total FrameNet five million frame-annotated documents. frame types and 5100 PropBank frame types are extracted. Extracting Counts The baseline extraction we consider is a standard sliding window: for each to enable any arbitrary dimensional tensor fac- word wj seen ≥ T times, extract all words wi two torization, as described in x3.2. We learn 100- to the left and right of wj. These counts, forming a dimensional embeddings for words that appear at matrix, are then used within standard word2vec. least 100 times from 15 negative samples.4 The We also follow Cotterell et al.(2017) and augment implementation is available at https://github. the above with the signed number of tokens sepa- com/fmof/tensor-factorization. rating wi and wj, e.g., recording that wi appeared Metric We evaluate our learned (trigger) embed- two to the left of wj; these counts form a 3-tensor. dings w via QVEC (Tsvetkov et al., 2015). QVEC To turn semantic parses into tensor counts, we uses canonical correlation analysis to measure the first identify relevant information from the parses.