Fine-Grained Class Extraction Via Modifier

Fine-Grained Class Extraction via Modiﬁer Composition (Unpublished Manuscript)

Ellie Pavlick and Marius Pasca

Abstract 1950s American jazz composers ... seminal composers such as Charles We present a method for populating fine- Mingus and George Russell... grained classes (e.g. “American jazz com- ... A virtuoso bassist and composer, Mingus posers”) with instances (e.g. Charles Min- irrevocably changed the face of jazz... gus). While state-of-the-art methods tend ... Mingus truly was a product of America to treat class labels as single lexical units, in all its historic complexities... the method we propose works by individu- ... Mingus dominated the scene back in the ally considering each of the modifiers in the 1950s and 1960s... class label (“American”, “jazz”) relative to the head (“composers”). On the task of re- Figure 1: Our method extracts instances of fine- constructing Wikipedia category pages, we grained classes by considering each of the modi- demonstrate a 4× increase in coverage over fiers in the class label individually. This makes it a strong baseline which relies on widely- possible to find instances even when the class label used lexical patterns for IsA extraction. never appears in text as a single unit. 1 Introduction Reasoning about natural language often requires any one of these class labels will appear in its en- taxonomic knowledge. Knowing, for example, that tirety within one of the expected patterns is very “Charles Mingus” is a “jazz composer” enables sys- low, even in large amounts of text. This is espe- tems to perform complex language understanding cially true for class labels containing more than one tasks like question answering (Harabagiu et al., or two modifiers. The treatment of a class label as 2001) and natural language inference (Clark et al., a single unit therefore severely limits the ability of 2007). Substantial attention has been paid to auto- current techniques to recognize instances of arbi- matically acquiring such “IsA” relations from text trarily fine-grained classes. (Snow et al., 2006; Shwartz et al., 2016). The vast In this work, we introduce an approach for rea- majority of current approaches rely on lexical pat- soning about fine-grained class labels composition- terns as the primary signal of whether an instance ally. We base our model on the notion from formal belongs to a class: for example, observing a pat- linguistics in which modifiers (e.g. “1950s”) corre- tern like “X such as Y” is a strong indication that spond to properties which differentiate instances of Y is an instance of class X (Hearst, 1992). a subclass (“1950s composers”) from instances of While methods based on these “Hearst” patterns the superclass (“composers”) (Heim and Kratzer, yield consistently good results, they rely on the as- 1998). Our proposed method consists of two sumption that class labels can be treated as lex- stages. First, we interpret each modifier in order to icalized units. That is, in order to recognize an make explicit the otherwise-implicit semantic rela- instance of a class, these pattern-based methods tion between the modifier and the head (e.g. “com- require that the entire class label be observed ver- posers active during 1950s”). Second, we use these batim in text, within one of the pre-defined lexical interpretations in order to find instances of the patterns. This assumption is reasonable for rela- head (“composers”) for which the properties im- tively short class labels, in particular those con- plied by the modifier (“active during 1950s”) hold, taining a single word, but in reality, there is an as illustrated in Figure 1. We demonstrate a 4× infinite number of possible classes that one could increase in the number of fine-grained Wikipedia feasibly refer to using natural language: not only categories for which we are able to return instances “composers , but also “jazz composers” or “1950s using the proposed model when compared against American jazz composers”. The probability that a strong baseline that uses Hearst patterns. 2 Related Work ically require Hearst patterns to define the inven- tory of possible classes in the taxonomy. A sepa- Our work builds primarily on two lines of prior rate line of work avoids Hearst patterns altogether, research: noun phrase interpretation and semantic and instead exploits semi-structured data found in taxonomy induction. HTML markup (Wang and Cohen, 2009; Dalvi et Noun Phrase Interpretation. Compound al., 2012; Pasupat and Liang, 2014) or in coordi- noun phrases (e.g. “jazz composers”) commu- nate lists (Bing et al., 2015). nicate implicit semantic relations between the All of these existing approaches share the fun- modifier and the head. There have been numerous damental limitation that, in order for a class to be efforts to provide semantic interpretations of such populated with instances, some (or all) instances of phrases. Many of these methods rely on matching the class need to have been observed. In practice, the compound to pre-defined syntactic patterns this often requires that the entire class label has (Fares et al., 2015) or ontologies of semantic been observed in text in a context amenable to IsA relations (O´ Séaghdhaand Copestake, 2007; Tratz extraction, such as a Hearst pattern or HTML ta- and Hovy, 2010; Surtani and Paul, 2015; Choi et ble. As a result, these approaches are not suitable al., 2015). Recent approaches have shifted to allow to handling the unbound number of class labels interpretations to take the form of arbitrary nat- that can be expressed in natural language. The ural language predicates (Hendrickx et al., 2013). work proposed here aims to address this limitation Most of the models for solving this problem take by leveraging the compositionality of fine-grained a supervised approach, comparing unseen noun class labels. By interpreting modifiers individually, compounds to the most similar noun compounds our proposed method can combine evidence from seen in training data (Wijaya and Gianfortoni, multiple sentences (Figure 1), and can perform IsA 2011; Nulty and Costello, 2013; Van de Cruys extraction without requiring example instances of et al., 2013). Others have taken unsupervised the classes it is asked to populate.1 approaches, which apply information extraction techniques to documents (Kim and Nakov, 2011; 3 Modifiers as Functions Xavier and de Lima, 2014) or to query logs (Pasca, 3.1 Formal Semantics 2015) in order to paraphrase noun compounds. In formal semantics, modification is modeled as All of these methods focus exclusively on the function application. Specifically, let MH be a task of providing good paraphrases for an input class label consisting of a head H, which we as- noun compound. To our knowledge, the work we sume to be a common noun, preceded by a modi- present here is the first attempt to use these inter- fier M. By convention, we use · to represent the pretations for the down-stream task of inferring “interpretation function” whichJ K maps a linguistic IsA relations for fine-grained classes. expression to its denotation in the world. The in- Semantic Taxonomy Induction. There has terpretation of a common noun is the set of enti- been substantial effort made to automatically learn ties2 in the universe U which are denoted by the taxonomic relations from text. Most approaches noun (Heim and Kratzer, 1998): build on the seminal work of Hearst (1992), which observed that certain textual patterns–e.g. “X and H = {e ∈ U | e is a H} (1) J K other Y”–are high-precision indicators of whether In contrast, the interpretation of a modifier M is a X is a member of class Y. Snow et al. (2006) took function that maps between sets of entities. That the idea further by learning such patterns auto- is, modifiers select a specific subset3 of the input: matically from dependency-parsed corpora. Re- cent extensions have incorporated taxonomic re- M (H) = {e ∈ H | e is M} (2) lations such as semantic exclusion (Pavlick et al., J K 2015) and improved the pattern representation us- This formalization leaves open how one decides ing deep learning (Shwartz et al., 2016). whether or not “e is M.” Determining whether These patterns, and the IsA relations that they this condition of “being M” holds is a non-trivial extract, provide a key input for the more general 1Pasupat and Liang (2014) also focuses on zero-shot task of knowledge base population. The “universal IsA extraction, but exploits HTML document struc- schema” approach (Riedel et al., 2013; Kirschnick ture, rather than reasoning compositionally. et al., 2016; Verga et al., 2016), which infers re- 2We use “entities” as it is standard terminology in lations using matrix factorization, often includes model-theoretic semantics. For the purposes of this Hearst patterns as features alongside information paper, we consider “entities” and “instances” to be in- from common sense knowledge resources like Free- terchangeable. 3We make the assumption that all modifiers are sub- base (Bollacker et al., 2008). Similarly, graphical sective, meaning they will return a subset of the input. models (Bansal et al., 2014) and joint inference We acknowledge that this is not always the case (Kamp models (Movshovitz-Attias and Cohen, 2015) typ- and Partee, 1995). aspect of language understanding. The meaning 4 Learning Modifier Interpretations of a modifier can vary depending on the class it is 4.1 Setup modifying. For example, if e is a “good student”, e is not necessarily a “good person”, making it dif- For each modifier M, we would like to learn the ficult to model whether “e is good” in an absolute function φM from Equation 3. Doing so makes sense. We therefore re-frame the above equation, it possible, given a class H and an instance e ∈ so that the decision of whether “e is M” is made H, to decide whether e has the properties required by calling a binary function φM , parameterized by to be an instance of the class MH. In general, the class H within which e is being considered: there is no systematic way to determine the implied relation between M and H. It has been argued M (H) = {e ∈ H | φM (H, e)} (3) J K that, given the right context, modifiers can express Conceptually, φM captures the core “meaning” of any possible semantic relation (Weiskopf, 2007). the modifier M, which is the set of properties that We therefore model the semantic relation be- differentiate members of the output class MH from tween M and H as a distribution over proper- members of the more general input class H. ties which could potentially define the subclass This formal semantics framework has two impor- MH ⊆ H. We will refer to this distribution as a tant consequences. First, the modifier has some “property profile” for M relative to H.4 We make intrinsic “meaning.” There are properties entailed the assumption that relations between M and H by the modifier that are independent of the par- which are discussed more often are more likely to ticular state of the world. This makes it possible capture the important properties of the subclass to make inferences about “1950s composers” even MH. This assumption is not perfect (Section 4.5) if no 1950s composers have been observed. Sec- but has given good results for paraphrasing noun ond, the modifier is a function that can be applied phrases (Nakov and Hearst, 2013; Pasca, 2015). in a truth-theoretic setting. For example, apply- Our method for learning property profiles is ing “1950s” to the set of “composers” will return based on the unsupervised method proposed by exactly the set of “1950s composers”. Pasca (2015). This approach used query logs as a source of common sense knowledge, and 3.2 Computational Approaches rewrote noun compounds (“American composers”) While the notion of modifiers as functions has by matching the compound MH to queries of been incorporated into computational models pre- the form “H(.∗)M”(“composers from America”). viously, prior work has focused on either assigning Anonymous (2016) extended this approach by us- an intrinsic meaning to M or on operationalizing ing subject-predicate-object (SPO) tuples derived M in a truth-theoretic sense, but not on doing both from documents in place of query logs. They simultaneously. For example, Young et al. (2014) demonstrated that doing so produces finer-grained focused exclusively on the subset selection aspect predicates (“born in” instead of “from”) and higher of modification. That is, given a set of instances H coverage. We build on Anonymous (2016)’s ap- and a modifier M, their method could return the proach, described further in Section 4.3. subset MH. However, their method did not model the meaning of the modifier itself, so that, e.g. if 4.2 Inputs there were no red cars in their model of the world, We assume two inputs: 1) An IsA repository, O, the phrase “red cars” would have no meaning. In containing he, Ci tuples where C is a category and contrast, Baroni and Zamparelli (2010) modeled e is an instance of C, and 2) a fact repository, D, the meaning of modifiers explicitly as functions containing hs, p, o, wi tuples where s and o are noun which map between vector-space representations of phrases, p is a predicate, and w is a confidence that nouns. However, their model focuses on similarity p expresses a true relation between s and o. between class labels–e.g. to say that “important We instantiate O with an IsA repository routes” is similar to “major roads”–and it is not constructed by applying Hearst patterns to a obvious how the method could be operationalized large web corpus. Instances are represented in order to identify instances of those classes. as automatically-disambiguated entity mentions5 One goal of our work is to model the semantics of which, when possible, are resolved to Wikipedia M intrinsically, but in a way that permits applica- pages (i.e. “America” and “USA” will be mapped tion in the model theoretic setting. We first learn to the same article). Classes are represented as an explicit model of the “meaning” of a modifier M relative to a head H. We represent this meaning 4The more commonly used terminology in existing as a distribution over properties, stated in natu- work (Section 2) is “interpretation”, i.e. we aim to “in- ral language, which differentiate the members of terpret” M relative to H. However, we use “property profiles” to avoid confusion with the formal semantics the class MH from those of the class H. We then notion of “interpretation,” as used in Section 3. use this representation to identify the subset of in- 5“Entity mentions” may be individuals, like stances of H which constitute the subclass MH. “Barack Obama”, but may also be concepts like “jazz”. Good property profiles Bad property profiles rice dish French violinist Led Zeppelin song still life painter child actor risk manager * serve with rice * live in France Led Zeppelin write * * known for still life * have child * take risk * include rice * born in France Led Zeppelin play * * paint still life * expect child * be at risk * consist of rice * speak French Led Zeppelin have * still life be by * * play child * be aware of risk

Table 1: Examples of property proﬁles learned by observing predicates that relate instances of the class H to the modiﬁer M (I2). Results are similar when using the class label H directly (I1). Our method assumes that the most frequently expressed relations between M and H are the best interpretations. This assumption gives good results in general (left four columns), but has limitations (right two columns). We spell out inverted predicates (Section 4.2) so wildcards (*) may appear in the subject or object position.

(non-disambiguated) strings in natural language. Relating M to H Directly. We first build a We retain every he, Ci tuple which is supported property profile by taking the predicate and object by 5 or more sentences and has a confidence of at from any tuple in D in which the subject is the class least 0.9, where “confidence” is a score that reflects label H and the object is the modifier M: a weighted combination of the number of support- ing sentences and the frequency of the category C. I1(MH) = {hhp, Mi, wi | hH, p, M, wi ∈ D} (4) The resulting repository contains 1.1M IsA relations, covering 412K instances and 9K categories. This is a direct reimplementation of the method We instantiate D with a large repository of facts explored in Anonymous (2016). extracted using open information extraction tech- Relating M to an Instance of H. Second, niques based on those used by ReVerb (Fader et we consider an extension of the above method in al., 2011) and OLLIE (Mausam et al., 2012). That which, rather than requiring the subject of the tu- is, we use shallow syntactic analysis in order to ple to be the class label H, we require the subject extract SPO tuples from raw text. We leave predi- to be an instance of H. cates as natural language strings, but remove stop I2(MH) = {hhp, Mi, wi | he, Hi ∈ O words and apply basic lemmatization (e.g. “is (5) an important part of” becomes “be important part ∧he, p, M, wi ∈ D} of”). Subjects and objects may be either disambiguated entity references, as above, or natu- 4.4 Modifier Expansion ral language strings. Every tuple is included in In practice, when building property profiles, we do both the forward and the reverse direction. For not require that the object of the fact tuple match example, hjazz, perform at, venuei also appears as the modifier exactly, as suggested in Equations 4 hvenue, ←perform at, jazzi, where ← is a special and 5 above. Instead, we follow Pasca (2015) and character signifying inverted predicates. These in- take advantage of facts involving distributionally verted predicates simplify the following definitions. similar modifiers. Specifically, rather than looking In total, our fact repository contains 30M tuples. only at tuples in D in which the object matches M, we consider all tuples, but discount the weight 4.3 Associating Modifiers with Properties proportionally to the similarity between M and the Let I be a function which takes as input a noun object of the tuple. Thus, I1 is computed as below: phrase MH and returns a property profile for M I1(MH) = {hhp, Mi, w × sim(M,N)i relative to H. We define a “property” to be an SPO (6) tuple in which the subject position6 is a wildcard, | hH, p, N, wi ∈ D} e.g. h∗, born in, Americai. Any instance which fills the wildcard slot then “has” the property. where sim(M,N) is the cosine similarity between We expand adjectival modifiers to encompass M and N. I2 is computed analogously. We com- nominalized forms using a nominalization dictio- pute cosine similarity using a vector space built nary extracted from WordNet (Miller, 1995). In from Web documents following the distributional the following definitions, M refers to any modifier semantic models as defined by (Lin and Wu, 2009; in this expanded set. E.g. if MH is “American Pantel et al., 2009). We retain the 100 most sim- composer” and we require a tuple to have the form ilar phrases for each of approximately 10 million hH, p, M, wi, we will include tuples in which the phrases, and consider all other similarities to be 0. third element is either “American” or “America”. 4.5 Analysis of Property Profiles 6Inverse predicates enable us to capture properties Table 1 provides examples of good and bad prop- in which the wildcard is conceptually the object of erty profiles returned for several MHs. In general, the relation, even if it occupies the subject slot of the SPO tuple. E.g. hvenue, ←perform at, jazzi can cap- frequent relations between M and H capture rel- ture that a “jazz venue” is a “venue” e such that “jazz evant properties of MH, but it is not always the performed at e”. case. For example, the most frequently discussed relation between “child” and “actor” is that actors which contains a head H preceded by modifiers have children, but this property is not indicative of M1 ...Mk, we generate a list of candidate instances the meaning of “child actors”. by finding all instances of H which have some prop- Qualitatively, the top-ranked interpretations erty to support every modifier: learned by using the head noun directly (I1, Eq. k 4) are very similar to those learned using instances \ {he, score(e)i | he, wi ∈ Mi (H) ∧ w > 0} (9) of the head (I2, Eq. 5). However, I2 returns many i=1 J K more properties (10 on average per MH) than I1 (just over 1 on average). Anecdotally, we see that where score(e) is simply the average7 of the modi- I captures more specific relations than does I . 2 1 fier scores assigned by each separate φˆ . From For example, for “jazz composers”, both methods Mi here on, we use Mods to refer to our method return the properties “* write jazz” and “* com- which generates lists of instances for a class us- pose jazz”. But I additionally returns specific 2 ing Equations 8 and 9. When φˆ (Equation 7) properties such as “* be major creative influence M is implemented using I we use the name Mods in jazz” and “jazz capture imagination of *”. We 1 H (for “heads”) and when it is implemented using I2 compare I1 and I2 quantitatively in Section 6. Im- we use the name ModsI (for “instances”). portantly, we do see that both I1 and I2 are capa- ble of learning head-specific property profiles for a 5.2 Weakly-Supervised Reranking modifier. Table 2 provides some examples of mod- Equation 8 uses a naive ranking in which the ifiers for which the top-ranked property varies de- weight for e ∈ MH is the product of how often pending on the head being modified. e has been observed with some property and the American company * based in America weight of that property for the class MH. Thus, American composer * born in America instances of H with overall higher counts in D re- American novel * written in America jazz album * features jazz ceive high weights for every MH. We therefore jazz composer * writes jazz train a simple logistic regression model to predict jazz venue jazz performed at * the likelihood that e belongs to MH. We use a small set of features, such as the raw weight as com- Table 2: Head-specific property profiles learned by puted in Equation 7 and the total number of times observing predicates that relate instances of the e appears in D.8 For training, we sample he, Ci class H to the modifier M (I ). Results are similar 2 pairs from our IsA repository O as positive exam- when using the class label H directly (I ). 1 ples and random pairs that do not appear in any Hearst pattern as negative examples. We frame the task as a binary prediction of whether e ∈ C. 5 Class-Instance Identification ˆ We use the model’s confidence as the value of φM , 5.1 Basic Model in place of the function defined in Equation 7. Given a means of finding properties that relate a modifier to a head, we turn to the task of identify- 6 Evaluation ing instances of fine-grained classes. That is, for a 6.1 Evaluation Data from Wikipedia given modifier M, we want to instantiate the func- We evaluate our models on their ability to return tion φM from Equation 3. In practice, rather than being a binary function which decides whether or correct instances for arbitrary class labels. As a ˆ source of evaluation data, we use Wikipedia cat- not e is in class MH, our instantiation, φM , will re- 9 turn a real-valued score expressing the confidence egory pages . These are pages in which the title that e is a member of MH. For notational conve- is the name of the category (e.g. “pakistani film nience, let D(hs, p, oi) = w, if hs, p, o, wi ∈ D and actresses”) and the body is a list of links to other pages which fall under the category. We measure 0 otherwise. We define φˆ as follows: M the precision and recall of each method for discov- ˆ X ering the instances listed on these category pages φM (H, e) = w × D(he, p, oi) (7) given the page title (henceforth “class label”). hhp,oi,wi∈I(MH) Our focus is on class labels which are compo- The interpretation of M applied to H, then, is as sitional, so we limit our evaluation to class la- in Equation 3 except that instead of a discrete set, bels which contain at least one modifier. In ad- it returns a scored list of candidate instances: dition, we restrict to class labels in which the head noun is a single common noun. To build ˆ M (H) = {he, φM (H, e)i | he, Hi ∈ O} (8) our evaluation sets, we collect the titles of all J K Ultimately, we need to identify instances of ar- 7Also tried minimum, but average worked better. bitrary class labels, which may contain multiple 8Feature templates in supplementary material. 9 modifiers. Given a class label C = M1 ...MkH en.wikipedia.org/wiki/Help:Category Wikipedia category pages, removing titles which baseline ignores modifiers altogether, and simply contain fewer than three words and titles in which assumes that any instance of H is an instance of the last word is capitalized.10 We also remove any MH, regardless of M. In this case the confidence titles which contain links to sub-categories. This value for he, MHi is equivalent to that for he, Hi. is to favor finer-grained class labels (“pakistani We refer to this baseline simply as Baseline. film actresses”) over coarser-grained ones (“film Our second, stronger baseline uses the IsA repos- actresses”). From the resulting list of class labels, itory directly to identify instances of the fine- we draw two samples of 100 class labels each. The grained class C = M1 ...MkH. That is, we con- first sample is chosen uniformly at random (de- sider e to be an instance of the class if he, Ci ∈ O, noted UniformSet). The second (denoted Weight- meaning the entire class label appeared in some edSet) is weighted so that the probability of draw- Hearst pattern and could be extracted directly. We ing a class label M1 ...MkH is proportional to the refer to this baseline as Hearst. The confidence total number of class labels in which H appears value used to rank the candidate instances is sim- as the head. We enforce that no H appear as the ply the confidence value assigned by the Hearst head of more than three class labels per sample. pattern extraction, as in Section 4.2. On average, there are 17 instances per category in UniformSet and 19 in WeightedSet. Table 3 gives 6.3 Compositional Models examples of some of the class labels in UniformSet. As a baseline compositional model, we augment the Hearst baseline via intersection of instance sets. 2008 california wildfires · australian army chaplains · australian boy bands · canadian business journalists · cana- Specifically, for a class C = M1 ...MkH, if each of dian military nurses · canberra urban places · cellular the M H appears in the IsA repository indepen- automaton rules · chinese rice dishes · coldplay concert i tours · daniel libeskind designs · economic stimulus pro- dently, we can take the instances of C to be the grams · german film critics · invasive amphibian species · intersection of the instances of each of the indepen- log flume rides · malayalam short stories · pakistani film actresses · puerto rican sculptors · string theory books · dent MiH. We assign the weight of an instance e tampa bay devil rays scouts to be the sum of the weights associated with each modifier independently. We refer to this method Table 3: Examples of class labels from our Uni- as Hearst∩. Note that while Hearst∩ does han- formSet. These labels come from a random sample dle modifiers compositionally, it does not explicitly of titles of Wikipedia category pages. model any intrinsic meaning of modifiers, and thus exhibits the shortcoming described in Section 3.2. Modifier Chunking. While all of the class la- We contrast this with our proposed model which bels contain at least three words, they represent a attempts to recognize instances of a fine-grained mix of single modifier (“puerto rican sculptors”) class by 1) assigning a meaning to each modifier in and multiple modifier (“canadian business jour- the form of a property profile and 2) checking the nalists”) phrases. Therefore, rather than naively extent to which the instance in question exhibits treating every label as a sequence of unigrams, we these properties. We refer to the two versions of perform noun-phrase chunking as a preprocessing our method as ModsH and ModsI , as described step. We use a parser trained to parse queries in Section 5.1. When relevant, we use “raw” to re- (Petrov et al., 2010), which gives good performance fer to the version which ranks candidate instances on short phrases. Given the parse tree, we group using the raw weights (Section 5.1) and “RR” to re- together any tokens which share a common parent fer to the version in which the instances are ranked other than the root node, with the exception of using a logistic regression model (Section 5.2). the rightmost token (the head), which we force to We also evaluate a combined system, in which appear as a chunk by itself. This heuristic was cho- the proposed models are used to extend the Hearst- sen since, on manual inspection, it produced good based method rather than to replace it. Since the chunks. We use these pre-chunked class labels as scores returned by each method are not compara- input to all of the systems, including baselines, in ble, we combine the predictions of Hearst with the our evaluation. The experiments in this section predictions of our proposed model by merging the assume some method for grouping together multi- ranked lists. Specifically, the score of an instance word modifiers (“puerto rican”), but are not de- is the inverse of the sum of its ranks in each of the pendent on this particular method for chunking. input lists. If an instance does not appear at all in an input list, its rank in that list is set to a large 6.2 Baselines constant value. We refer to these combination sys- We implement two baselines using our IsA repos- tems as Hearst+ModsH and Hearst+ModsI . itory (O as defined in Section 4.1). Our simplest 6.4 Results 10For example, “South Korea” is the title of a category page which links to pages such as “South Korean Precision and Coverage. We first compare the culture” and “Images of South Korea”. methods in terms of their coverage, the number of Flemish still life painters: Clara Peeters · Willem Kalf · Jan Davidsz de Heem · Pieter Claesz · Peter Paul Rubens · Frans Snyders · Jan Brueghel the Elder · Hans Memling · Pieter Bruegel the Elder · Caravaggio · Abraham Brueghel Pakistani cricket captains: Salman Butt · Shahid Afridi · Javed Miandad · Azhar Ali · Greg Chappell · Younis Khan · Wasim Akram · Imran Khan · Mohammad Hafeez · Rameez Raja · Abdul Hafeez Kardar · Waqar Younis · Sarfraz Ahmed Thai buddhist temples: Wat Buddhapadipa · Wat Chayamangkalaram · Wat Mongkolratanaram · Angkor Wat · Preah Vihear Temple · Wat Phra Kaew · Wat Rong Khun · Wat Mahathat Yuwaratrangsarit · Vat Phou · Tiger Temple · Sanctuary of Truth · Wat Chalong · Swayambhunath · Mahabodhi Temple · Tiger Cave Temple · Harmandir Sahib

Table 4: Instances extracted for several fine-grained classes from Wikipedia. Lists shown are from ModsI . Instances in italics were also returned by Hearst∩. Strikethrough denotes incorrect instances. class labels for which the method is able to find some instance, and their precision, to what extent the method is able to correctly rank true instances of the class above non-instances. For coverage, we report both total coverage, the number of labels for which the method returns any instance, and correct coverage, the number of labels for which the method returns a correct instance. For precision, we compute the average precision (AP) for each class label. AP ranges from 0 to 1, where 1 indi- cates that all of the positive instances were ranked Figure 2: Distribution of AP over 100 class labels above all of the negative instances. We report in WeightedSet. The proposed method (ModsI , mean average precision (MAP), which is the mean red) and the baseline method (Hearst∩, blue) of the APs across all the class labels in the sample. achieve high precision for the same number of class Note that MAP is only computed over class labels labels, but ModsI additionally finds instances for for which the method returns something (methods class labels for which the baseline returns nothing. are not punished for returning empty lists). Table 5 also reveals that the reranking model UniformSet WeightedSet Coverage MAP Coverage MAP (RR) gives a consistent increase of 3 to 10 points in Baseline 95 / 70 0.01 98 / 74 0.01 MAP for the proposed methods. Therefore, going Hearst 9 / 9 0.63 8 / 8 0.80 Hearst∩ 13 / 12 0.62 9 / 9 0.80 forward, we only report results using the reranking ModsH raw 56 / 32 0.23 50 / 30 0.16 model. Specifically, ModsH and ModsI will refer ModsH RR 56 / 32 0.29 50 / 30 0.25 to ModsH RR and ModsI RR, respectively. ModsI raw 62 / 36 0.18 59 / 38 0.20 Mods RR 62 / 36 0.24 59 / 38 0.23 I Manual Re-Annotation. Since Wikipedia is Table 5: Coverage and precision for populating not a perfect resource, it possible that true in- classes whose labels appear as titles of Wikipedia stances of a class may be missing from our refer- category pages. “Coverage” is the number of class ence set, and thus that our precision scores un- labels (out of 100) for which at least one candidate derestimate the actual precision of the systems. instance was returned, followed by the number for We therefore manually verify the top 10 predic- which at least one correct instance was returned. tions of each of the systems for a random sam- “MAP” is the mean average precision (see text). ple of 25 class labels. We choose class labels for which Hearst was able to return at least one instance, in order to ensure reliable precision esti- Table 4 gives examples of instances returned for mates. For each of these labels, we manually check several class labels and Table 5 shows the preci- the top 10 instances proposed by each method to sion and coverage for each of the methods. The determine whether each belongs to the class. Ta- proposed methods are able to return correct in- ble 6 shows the precision scores for each method stances for up to four times as many class labels as computed against the original Wikipedia list of in- the baseline method (e.g. Mods provides correct I stances and against our manually-augmented list of instances for 38 class labels in WeightedSet, com- gold instances. The overall ordering of the systems pared to only 8 covered by Hearst). However, the does not change, but the precision scores computed proposed methods exhibit a sizable drop in MAP against Wikipedia are notably lower than the preci- compared to Hearst and Hearst∩: from over 0.6 sion scores computed after re-annotation. We con- to under 0.3. Figure 2 illustrates the cause of this tinue to use the Wikipedia lists for our evaluation, large difference in MAP: both Hearst and Mods I but acknowledge that the reported precisions are achieve high precision for the same number of class likely an underestimate of the true precisions. labels, but ModsI additionally returns lists of candidate instances (albeit with lower precision) for Precision-Recall Tradeoff. We next look at many labels for which Hearst returned nothing. the precision-recall tradeoff more closely, in terms (a) Uniform random sample (UniformSet). (b) Weighted random sample (WeightedSet).

Figure 3: ROC curves for selected methods (Hearst baselines in blue, proposed in red). Given a list of instances associated with conﬁdence scores, ROC curves plot the number of true positives vs. false positives that are retained by setting various threshold conﬁdence values. The curve becomes linear once all remaining instances in the list have the same score (e.g. 0) as this makes it impossible to choose a threshold which adds true positives to the list without also including all remaining false positives.

Wikipedia Gold UniformSet WeightedSet Hearst 0.56 0.79 AUC Rec. AUC Recall Hearst∩ 0.53 0.78 Baseline 0.55 0.23 0.53 0.28 ModsH 0.23 0.39 Hearst 0.56 0.03 0.52 0.02 ModsI 0.24 0.42 Hearst∩ 0.57 0.04 0.53 0.02 Hearst+ModsH 0.43 0.63 ModsH 0.68 0.08 0.60 0.06 Hearst+ModsI 0.43 0.63 ModsI 0.71 0.09 0.65 0.09 Hearst∩+ModsH 0.70 0.09 0.61 0.08 Table 6: Precision@10 before/after re-annotation. Hearst∩+ModsI 0.73 0.10 0.66 0.10 Wikipedia underestimates true precision. Table 7: Recall of instances on Wikipedia category pages. “Rec” is recall against the full set of instances from all pages in the sample. AUC cap- of the area under the curve (AUC) achieved when tures the tradeoff between true and false positives. each method attempts to rank the complete list of candidate instances. That is, we take the union of all of the instances proposed by all of the methods improve modifier interpretation. In addition, re- (including the Baseline method which, given a class laxing the constraint that interpretations contain label M ...M H, proposes every instance of the 0 k the modifier itself could allow the method to take head H as a candidate). Then, for each method, we advantage of properties that entail class member- rank this full set of instances such that any instance ship, even if they are not good direct interpreta- returned by the method is assigned whatever score tion of the class label: e.g. that a “composer who the method assigns, and every other instance is as- records with Blue Note” is likely a “jazz composer”. signed a score of 0. Finally, property-based reasoning could be applied Table 7 reports the AUC and the total recall for to single word class labels and their hypernyms– each of the methods. Figure 3 plots the full ROC e.g. a “composer” is a “person who writes music”. curves. The requirement by Hearst and Hearst∩ that class labels appear in full in a single sentence results in very low recall, which translates into very 8 Conclusion low AUC when considering the full set of candidate We have presented an approach to IsA extraction instances. By comparison, the proposed composi- which takes advantage of the compositionality of tional methods are able to draw evidence about natural language. Existing approaches often treat class membership from a larger set of sentences. class labels as atomic units which must be observed Thus, they can provide non-zero scores for many in full in order to be populated with instances. As more candidate instances. This enables the pro- a result, current methods are not able to handle posed methods to achieve a greater than 10 point the infinite number of classes describable in natu- increase in AUC on both UniformSet and Weight- ral language, most of which never appear in text. edSet compared to Hearst (Table 7). Our method works by reasoning about each modi- 7 Future Work fier in the label individually, in terms of the properties that it implies about the instances. This There are several natural extensions to the pro- approach allows us to harness information that is posed method, which we hope to explore in future spread across multiple sentences, and results in a work. More robust representations of property pro- significant increase in the number of fine-grained files, such as predicate embeddings, would likely classes which we are able to populate. References Murhaf Fares, Stephan Oepen, and Erik Velldal. 2015. Identifying compounds: On the role of Anonymous. 2016. Unsupervised interpretation of syntax. In International Workshop on Treebanks multiple-modifier noun phrases. In submission. and Linguistic Theories (TLT14), page 273. Mohit Bansal, David Burkett, Gerard de Melo, and Sanda Harabagiu, Dan Moldovan, Marius Pa¸sca, Dan Klein. 2014. Structured learning for taxon- Rada Mihalcea, Mihai Surdeanu, Razvan omy induction with belief propagation. In Pro- Bunescu, Roxana Girju, Vasile Rus, and Paul ceedings of the 52nd Annual Meeting of the Asso- Morarescu. 2001. The role of lexico-semantic ciation for Computational Linguistics (Volume feedback in open-domain textual question- 1: Long Papers), pages 1041–1051, Baltimore, answering. In Proceedings of the 39th Annual Maryland, June. Association for Computational Meeting on Association for Computational Lin- Linguistics. guistics, pages 282–289. Association for Compu- Marco Baroni and Roberto Zamparelli. 2010. tational Linguistics. Nouns are vectors, adjectives are matrices: Rep- resenting adjective-noun constructions in seman- Marti A. Hearst. 1992. Automatic acquisition of tic space. In Proceedings of the 2010 Conference hyponyms from large text corpora. In Proceed- on Empirical Methods in Natural Language Pro- ings of the 14th Conference on Computational cessing, pages 1183–1193. Association for Com- Linguistics - Volume 2, COLING ’92, pages 539– putational Linguistics. 545. Lidong Bing, Sneha Chaudhari, Richard Wang, Irene Heim and Angelika Kratzer. 1998. Semantics and William Cohen. 2015. Improving distant su- in generative grammar, volume 13. Blackwell pervision for information extraction using label Oxford. propagation through lists. In Proceedings of the ´ 2015 Conference on Empirical Methods in Natu- I. Hendrickx, Z. Kozareva, P. Nakov, D. O ral Language Processing, pages 524–529, Lisbon, Séaghdha,S. Szpakowicz, and T. Veale. 2013. Portugal, September. Association for Computa- SemEval-2013 task 4: Free paraphrases of noun tional Linguistics. compounds. In Proceedings of SemEval-13, pages 138–143. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Free- Hans Kamp and Barbara Partee. 1995. Proto- base: a collaboratively created graph database type theory and compositionality. Cognition, for structuring human knowledge. In Proceed- 57(2):129–191. ings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247– N. Kim and P. Nakov. 2011. Large-scale noun 1250. ACM. compound interpretation using bootstrapping and the Web as a corpus. In Proceedings of the E. Choi, T. Kwiatkowski, and L. Zettlemoyer. 2011 Conference on Empirical Methods in Nat- 2015. Scalable semantic parsing with partial ural Language Processing (EMNLP-11), pages ontologies. In Proceedings of the 53rd Annual 648–658, Edinburgh, Scotland. Meeting of the Association for Computational Linguistics (ACL-15), pages 1311–1320, Beijing, Johannes Kirschnick, Holmer Hemsen, and Volker China. Markl. 2016. Jedi: Joint entity and relation detection using type inference. In Proceedings of Peter Clark, William R. Murray, John Thompson, ACL-2016 System Demonstrations, pages 61–66, Phil Harrison, Jerry Hobbs, and Christiane Fell- Berlin, Germany, August. Association for Com- baum. 2007. On the role of lexical and world putational Linguistics. knowledge in rte3. In Proceedings of the ACL- PASCAL Workshop on Textual Entailment and D. Lin and X. Wu. 2009. Phrase clustering for dis- Paraphrasing, RTE ’07, pages 54–59. criminative learning. In Proceedings of the 47th Annual Meeting of the Association for Com- B. Dalvi, W. Cohen, and J. Callan. 2012. Web- putational Linguistics (ACL-IJCNLP-09), pages sets: Extracting sets of entities from the Web 1030–1038, Singapore. using unsupervised information extraction. In Proceedings of the 5th ACM Conference on Web Mausam, M. Schmitz, S. Soderland, R. Bart, and Search and Data Mining (WSDM-12), pages O. Etzioni. 2012. Open language learning for 243–252, Seattle, Washington. information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in A. Fader, S. Soderland, and O. Etzioni. 2011. Natural Language Processing and Computational Identifying relations for open information ex- Natural Language Learning (EMNLP-CoNLL- traction. In Proceedings of the 2011 Confer- 12), pages 523–534, Jeju Island, Korea. ence on Empirical Methods in Natural Language Processing (EMNLP-11), pages 1535–1545, Ed- G. Miller. 1995. WordNet: a lexical database. inburgh, Scotland. Communications of the ACM, 38(11):39–41. Dana Movshovitz-Attias and William W. Cohen. 2010 Conference on Empirical Methods in Nat- 2015. Kb-lda: Jointly learning a knowledge base ural Language Processing (EMNLP-10), pages of hierarchy, relations, and facts. In Proceed- 705–713, Cambridge, Massachusetts. ings of the 53rd Annual Meeting of the Asso- ciation for Computational Linguistics and the Sebastian Riedel, Limin Yao, Andrew McCallum, 7th International Joint Conference on Natural and Benjamin M. Marlin. 2013. Relation ex- Language Processing (Volume 1: Long Papers), traction with matrix factorization and univer- pages 1449–1459, Beijing, China, July. Associa- sal schemas. In Proceedings of the 2013 Con- tion for Computational Linguistics. ference of the North American Chapter of the Association for Computational Linguistics: Hu- Preslav I Nakov and Marti A Hearst. 2013. Se- man Language Technologies, pages 74–84, At- mantic interpretation of noun compounds us- lanta, Georgia, June. Association for Computa- ing verbal and other paraphrases. ACM Trans- tional Linguistics. actions on Speech and Language Processing (TSLP), 10(3):13. Vered Shwartz, Yoav Goldberg, and Ido Da- gan. 2016. Improving hypernymy detection P. Nulty and F. Costello. 2013. General and with an integrated path-based and distributional specific paraphrases of semantic relations be- method. In Proceedings of the 54th Annual tween nouns. Natural Language Engineering, Meeting of the Association for Computational 19(03):357–384. Linguistics (Volume 1: Long Papers), pages ´ 2389–2398, Berlin, Germany, August. Associa- Diarmuid O Séaghdhaand Ann Copestake. 2007. tion for Computational Linguistics. Co-occurrence contexts for noun compound interpretation. In proceedings of the Workshop Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. on A Broader Perspective on Multiword Expres- 2006. Semantic taxonomy induction from het- sions, pages 57–64. Association for Computa- erogenous evidence. In Proceedings of the 21st tional Linguistics. International Conference on Computational Lin- P. Pantel, E. Crestan, A. Borkovsky, A. Popescu, guistics and 44th Annual Meeting of the As- and V. Vyas. 2009. Web-scale distributional sociation for Computational Linguistics, pages similarity and entity set expansion. In Proceed- 801–808, Sydney, Australia, July. Association for ings of the 2009 Conference on Empirical Meth- Computational Linguistics. ods in Natural Language Processing (EMNLP- Nitesh Surtani and Soma Paul. 2015. A vsm-based 09), pages 938–947, Singapore. statistical model for the semantic relation inter- Marius Pasca. 2015. Interpreting compound noun pretation of noun-modifier pairs. RECENT AD- phrases using web search queries. In Proceed- VANCES IN, page 636. ings of the 2015 Conference of the North Amer- ican Chapter of the Association for Computa- Stephen Tratz and Eduard Hovy. 2010. A taxon- tional Linguistics: Human Language Technolo- omy, dataset, and classifier for automatic noun gies, pages 335–344, Denver, Colorado, May– compound interpretation. In Proceedings of June. Association for Computational Linguis- the 48th Annual Meeting of the Association for tics. Computational Linguistics, pages 678–687. As- sociation for Computational Linguistics. Panupong Pasupat and Percy Liang. 2014. Zero- shot entity extraction from web pages. In Pro- T. Van de Cruys, S. Afantenos, and P. Muller. ceedings of the 52nd Annual Meeting of the Asso- 2013. MELODI: A supervised distributional ciation for Computational Linguistics (Volume approach for free paraphrasing of noun com- 1: Long Papers), pages 391–401, Baltimore, pounds. In Proceedings of the 7th International Maryland, June. Association for Computational Workshop on Semantic Evaluation (SemEval- Linguistics. 13), pages 144–147, Atlanta, Georgia. Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Patrick Verga, Arvind Neelakantan, and Andrew Beller, Benjamin Van Durme, and Chris McCallum. 2016. Generalizing to unseen en- Callison-Burch. 2015. Adding semantics to tities and entity pairs with row-less universal data-driven paraphrasing. In Proceedings of schema. arXiv preprint arXiv:1606.05804. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- R. Wang and W. Cohen. 2009. Automatic set in- tional Joint Conference on Natural Language stance extraction using the Web. In Proceedings Processing (Volume 1: Long Papers), pages of the 47th Annual Meeting of the Association 1512–1522, Beijing, China, July. Association for for Computational Linguistics (ACL-IJCNLP- Computational Linguistics. 09), pages 441–449, Singapore. S. Petrov, P. Chang, M. Ringgaard, and H. Al- Daniel A Weiskopf. 2007. Compound nomi- shawi. 2010. Uptraining for accurate deter- nals, context, and compositionality. Synthese, ministic question parsing. In Proceedings of the 156(1):161–204. Derry Tanti Wijaya and Philip Gianfortoni. 2011. Nut case: What does it mean?: Understand- ing semantic relationship between nouns in noun compounds through paraphrasing and ranking the paraphrases. In Proceedings of the 1st international workshop on Search and mining entity- relationship data, pages 9–14. ACM. Clarissa CastellãXavier and Vera Lúcia Strube de Lima. 2014. Boosting open information extraction with noun-based relations. In LREC, pages 96–100. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computa- tional Linguistics (TACL), 2(Feb):67–78.