Identifying 1950S American Jazz Musicians: Fine-Grained Isa Extraction Via Modifier Composition

Home , Javed Miandad, Jazz (miniseries), Shahid Afridi, Wasim Akram, Younis Khan

Identifying 1950s American Jazz Musicians: Fine-Grained IsA Extraction via Modiﬁer Composition

Ellie Pavlick∗ Marius Pa¸sca University of Pennsylvania Google Inc. 3330 Walnut Street 1600 Amphitheatre Parkway Philadelphia, Pennsylvania 19104 Mountain View, California 94043 [email protected] [email protected]

Abstract 1950s American jazz musicians ... seminal musicians such as Charles Mingus We present a method for populating and George Russell...... A virtuoso bassist and composer, Mingus ir- fine-grained classes (e.g., “1950s Amer- revocably changed the face of jazz... ican jazz musicians”) with instances ... Mingus truly was a product of America in (e.g., Charles Mingus). While state- all its historic complexities...... Mingus dominated the scene back in the of-the-art methods tend to treat class 1950s and 1960s... labels as single lexical units, the proposed method considers each of the in- Figure 1: We extract instances of fine-grained dividual modifiers in the class label rel- classes by considering each of the modifiers in ative to the head. An evaluation on the the class label individually. This allows us to task of reconstructing Wikipedia cate- extract instances even when the full class label gory pages demonstrates a >10 point never appears in text. increase in AUC, over a strong baseline relying on widely-used Hearst patterns. amounts of text. Second, when class labels 1 Introduction are treated as though they cannot be decom- The majority of approaches (Snow et al., 2006; posed, every class label must be modeled inde- Shwartz et al., 2016) for extracting IsA rela- pendently, even those containing overlapping tions from text rely on lexical patterns as the words (“American jazz musician”, “French primary signal of whether an instance belongs jazz musician”). As a result, the number of to a class. For example, observing a pattern meaning representations to be learned is ex- like “X such as Y” is a strong indication that ponential in the length of the class label, and Y (e.g., “Charles Mingus”) is an instance of quickly becomes intractable. Thus, composi- class X (e.g., “musician”)(Hearst, 1992). tional models of taxonomic relations are nec- Methods based on these “Hearst patterns” essary for better language understanding. assume that class labels can be treated as We introduce a compositional approach for atomic lexicalized units. This assumption has reasoning about fine-grained class labels. Our several significant weakness. First, in order to approach is based on the notion from formal recognize an instance of a class, these pattern- semantics, in which modifiers (“1950s”) corre- based methods require that the entire class la- spond to properties that differentiate instances bel be observed verbatim in text. The require- of a subclass (“1950s musicians”) from in- ment is reasonable for class labels containing stances of the superclass (“musicians”)(Heim a single word, but in practice, there are many and Kratzer, 1998). Our method consists of possible fine-grained classes: not only “mu- two stages: interpreting each modifier rela- sicians” but also “1950s American jazz mu- tive to the head (“musicians active during sicians”. The probability that a given label 1950s”), and using the interpretations to iden- will appear in its entirety within one of the tify instances of the class from text (Figure expected patterns is very low, even in large 1). Our main contributions are: 1) a compo- ∗Contributed during an internship at Google. sitional method for IsA extraction, which in- volves a novel application of noun-phrase para- and joint inference models (Movshovitz-Attias phrasing methods to the task of semantic tax- and Cohen, 2015) typically require Hearst pat- onomy induction and 2) the operationalization terns to define an inventory of possible classes. of a formal semantics framework to address A separate line of work avoids Hearst pat- two aspects of semantics that are often kept terns by instead exploiting semi-structured separate in NLP: assigning intrinsic “mean- data from HTML markup (Wang and Cohen, ing” to a phrase, and reasoning about that 2009; Dalvi et al., 2012; Pasupat and Liang, phrase in a truth-theoretic context. 2014). These approaches all share the limitation that, in practice, in order for a class to 2 Related Work be populated with instances, the entire class Noun Phrase Interpretation. Compound label has to have been observed verbatim in noun phrases (“jazz musician”) communicate text. This requirement limits the ability to implicit semantic relations between modifiers handle arbitrarily fine-grained classes. Our and the head. Many efforts to provide se- work addresses this limitation by modeling mantic interpretations of such phrases rely on fine-grained class labels compositionally. Thus matching the compound to pre-defined pat- the proposed method can combine evidence terns or semantic ontologies (Fares et al., 2015; from multiple sentences, and can perform IsA O´ Séaghdhaand Copestake, 2007; Tratz and extraction without requiring any example in- 1 Hovy, 2010; Surtani and Paul, 2015; Choi stances of a given class. et al., 2015). Recently, interpretations may Taxonomy Construction. Previous work take the form of arbitrary natural language on the construction of a taxonomy of IsA rela- predicates (Hendrickx et al., 2013). Most tions (Flati et al., 2014; de Melo and Weikum, approaches are supervised, comparing un- 2010; Kozareva and Hovy, 2010; Ponzetto and seen noun compounds to the most similar Strube, 2007; Ponzetto and Navigli, 2009) con- phrase seen in training (Wijaya and Gianfor- siders that task to be different than extracting toni, 2011; Nulty and Costello, 2013; Van de a flat set of IsA relations from text in prac- Cruys et al., 2013). Other unsupervised ap- tice. Challenges specific to taxonomy con- proaches apply information extraction tech- struction include overall concept positioning niques to paraphrase noun compounds (Kim and how to discover whether concepts are un- and Nakov, 2011; Xavier and Strube de Lima, related, subordinated or parallel to each other 2014; Pa¸sca, 2015). They focus exclusively on (Kozareva and Hovy, 2010); the need to re- providing good paraphrases for an input noun fine and enrich the taxonomy (Flati et al., compound. To our knowledge, ours is the first 2014); the difficulty in adding relevant IsA attempt to use these interpretations for the relations towards the top of the taxonomy downstream task of IsA relation extraction. (Ponzetto and Navigli, 2009); eliminating cy- IsA Relation Extraction. Most efforts to cles and inconsistencies (Ponzetto and Navigli, acquire taxonomic relations from text build on 2009; Kozareva and Hovy, 2010). For prac- the seminal work of Hearst(1992), which ob- tical purposes, these challenges are irrelevant serves that certain textual patterns–e.g., “X when extracting flat IsA relations. Whereas and other Y”–are high-precision indicators of Flati et al.(2014); Bizer et al.(2009); de whether X is a member of class Y. Recent Melo and Weikum(2010); Nastase and Strube work focuses on learning such patterns au- (2013); Ponzetto and Strube(2007); Ponzetto tomatically from corpora (Snow et al., 2006; and Navigli(2009); Hoffart et al.(2013) rely Shwartz et al., 2016). These IsA extraction on data within human-curated resources, our techniques provide a key step for the more gen- work operates over unstructured text. Re- eral task of knowledge base population. The sources constructed in Bizer et al.(2009); Nas- “universal schema” approach (Riedel et al., tase and Strube(2013); Hoffart et al.(2013) 2013; Kirschnick et al., 2016; Verga et al., contain not just a taxonomy of IsA relations, 2017), which infers relations using matrix fac- 1Pasupat and Liang(2014) also focuses on zero-shot torization, often includes Hearst patterns as IsA extraction, but exploits HTML document struc- input features. Graphical (Bansal et al., 2014) ture, rather than reasoning compositionally. but also relation types other than IsA. 1950s musician have been observed. Second, the modifier is a function that can be applied 3 Modifiers as Functions in a truth-theoretic setting. That is, apply- Formalization. In formal semantics, mod- ing “1950s” to the set of “musicians” returns ification is modeled as function application. exactly the set of “1950s musicians”. Specifically, let MH be a class label consisting Computational Approaches. While the of a head H, which we assume to be a com- notion of modifiers as functions has been in- mon noun, preceded by a modifier M. We use corporated into computational models previ- · to represent the “interpretation function” ously, prior work focuses on either assigning an thatJ K maps a linguistic expression to its deno- intrinsic meaning to M or on operationalizing tation in the world. The interpretation of a M in a truth-theoretic sense, but not on do- common noun is the set of entities2 in the uni- ing both simultaneously. For example, Young verse U. They are denoted by the noun (Heim et al.(2014) focuses exclusively on the subset and Kratzer, 1998): selection aspect of modification. That is, given a set of instances H and a modifier M, their H = {e ∈ U | e is a H} (1) method could return the subset MH. How- J K ever, their method does not model the mean- The interpretation of a modifier M is a func- ing of the modifier itself, so that, e.g., if there tion that maps between sets of entities. That 3 were no red cars in their model of the world, is, modifiers select a subset of the input set: the phrase “red cars” would have no meaning. In contrast, Baroni and Zamparelli(2010) M (H) = {e ∈ H | e satisfies M} (2) J K models the meaning of modifiers explicitly as This formalization leaves open how one de- functions that map between vector-space rep- cides whether or not “e satisfies M”. This non- resentations of nouns. However, their model trivial, as the meaning of a modifier can vary focuses on similarity between class labels–e.g., depending on the class it is modifying: if e is to say that “important routes” is similar to a “good student”, e is not necessarily a “good “major roads”–and it is not obvious how the person”, making it difficult to model whether method could be operationalized in order to “e satisfies good” in general. We therefore re- identify instances of those classes. A contri- frame the above equation, so that the decision bution of our work is to model the semantics of whether “e satisfies M” is made by calling of M intrinsically, but in a way that permits application in the model theoretic setting. We a binary function φM , parameterized by the class H within which e is being considered: learn an explicit model of the “meaning” of a modifier M relative to a head H, represented as a distribution over properties that differen- M (H) = {e ∈ H | φM (H, e)} (3) J K tiate the members of the class MH from those

Conceptually, φM captures the core “mean- of the class H. We then use this representa- ing” of the modifier M, which is the set of tion to identify the subset of instances of H, properties that differentiate members of the which constitute the subclass MH. output class MH from members of the more 4 Learning Modifier Interpretations general input class H. This formal semantics framework has two important consequences. 4.1 Setup First, the modifier has an intrinsic “mean- For each modifier M, we would like to learn ing”. The properties entailed by the modi- the function φM from Eq.3. Doing so makes fier are independent of the particular state of it possible, given H and an instance e ∈ H, the world. This makes it possible to make in- to decide whether e has the properties references about “1950s musician” even if no quired to be an instance of MH. In gen- 2We use “entities” and “instances” interchange- eral, there is no systematic way to determine ably;“entities” is standard terminology in linguistics. the implied relation between M and H, as 3As does virtually all previous work in information extraction, we assume that modifiers are subsective, ac- modifiers can arguably express any semantic knowledging the limitations (Kamp and Partee, 1995). relation, given the right context (Weiskopf, 2007). We therefore model the semantic re- cial character signifying inverted predicates. lation between M and H as a distribution These inverted predicates simplify the follow- over properties that could potentially define ing definitions. In total, O contains 1.1M tu- the subclass MH ⊆ H. We will refer to ples and D contains 30M tuples. this distribution as a “property profile” for M relative to H. We make the assumption 4.3 Building Property Profiles that relations between M and H that are dis- Properties. Let I be a function that takes cussed more often are more likely to capture as input a noun phrase MH and returns a the important properties of the subclass MH. property profile for M relative to H. We This assumption is not perfect (Section 4.4) define a “property” to be an SPO tuple in but has given good results for paraphrasing which the subject position5 is a wildcard, e.g. noun phrases (Nakov and Hearst, 2013; Pa¸sca, h∗, born in, Americai. Any instance that fills 2015). Our method for learning property pro- the wildcard slot then “has” the property. files is based on the unsupervised method pro- We expand adjectival modifiers to encom- posed by Pa¸sca(2015), which uses query logs pass nominalized forms using a nominalization as a source of common sense knowledge, and dictionary extracted from WordNet (Miller, rewrites noun compounds by matching MH 1995). If MH is “American musician” and we (“American musicians”) to queries of the form require a tuple to have the form hH, p, M, wi, “H(.∗)M”(“musicians from America”). we will include tuples in which the third ele- ment is either “American” or “America”. 4.2 Inputs Relating M to H Directly. We first build We assume two inputs: 1) an IsA repository, property profiles by taking the predicate and O, containing he, Ci tuples where C is a cat- object from any tuple in D in which the sub- egory and e is an instance of C, and 2) a ject is the head and the object is the modifier: fact repository, D, containing hs, p, o, wi tuples where s and o are noun phrases, p is a pred- I1(MH) = {hhp, Mi, wi | hH, p, M, wi ∈ D} icate, and w is a confidence that p expresses (4) a true relation between s and o. Both O and D are extracted from a sample of around 1 Relating M to an Instance of H. We also billion Web documents in English. The sup- consider an extension in which, rather than plementary material gives additional details. requiring the subject to be the class label H, We instantiate O with an IsA repository we require the subject to be an instance of H. constructed by applying Hearst patterns to the Web documents. Instances are rep- I2(MH) = {hhp, Mi, wi | he, Hi ∈ O (5) resented as automatically-disambiguated en- ∧he, p, M, wi ∈ D} tity mentions4 which, when possible, are re- solved to Wikipedia pages. Classes are rep- Modifier Expansion. In practice, when resented as (non-disambiguated) natural lan- building property profiles, we do not require guage strings. We instantiate D with a large that the object of the fact tuple match the repository of facts extracted using in-house im- modifier exactly, as suggested in Eq.4 and5. plementations of ReVerb (Fader et al., 2011) Instead, we follow Pa¸sca(2015) and take ad- and OLLIE (Mausam et al., 2012). The vantage of facts involving distributionally sim- predicates are extracted as natural language ilar modifiers. Specifically, rather than look- strings. Subjects and objects may be either ing only at tuples in D in which the object disambiguated entity references or natural lan- matches M, we consider all tuples, but dis- guage strings. Every tuple is included in count the weight proportionally to the simi- both the forward and the reverse direction. larity between M and the object of the tuple. E.g. hjazz, perform at, venuei also appears as 5Inverse predicates capture properties in which the hvenue, ←perform at, jazzi, where ← is a spe- wildcard is conceptually the object of the relation, but occupies the subject slot in the tuple. For example, 4“Entity mentions” may be individuals, like hvenue, ←perform at, jazzi captures that a “jazz venue” “Barack Obama”, but may also be concepts like “jazz”. is a “venue” e such that “jazz performed at e”. Good Property Profiles Bad Property Profiles rice dish French violinist Led Zeppelin song still life painter child actor risk manager * serve with rice * live in France Led Zeppelin write * * known for still life * have child * take risk * include rice * born in France Led Zeppelin play * * paint still life * expect child * be at risk * consist of rice * speak French Led Zeppelin have * still life be by * * play child * be aware of risk

Table 1: Example property proﬁles learned by observing predicates that relate instances of class H to modiﬁer M (I2). Results are similar when using the class label H directly (I1). We spell out inverted predicates (Section 4.2) so wildcards (*) may appear as subjects or objects.

Thus, I1 is computed as below: Class Label Property Profile American company * based in America I1(MH) = {hhp, Mi, w × sim(M,N)i (6) American composer * born in America | hH, p, N, wi ∈ D} American novel * written in America jazz album * features jazz where sim(M,N) is the cosine similarity be- jazz composer * writes jazz tween M and N. I2 is computed analogously. jazz venue jazz performed at * We compute sim using a vector space built from Web documents following Lin and Wu Table 2: Head-specific property profiles (2009); Pantel et al.(2009). We retain the 100 learned by relating instances of H to the mod- most similar phrases for each of ∼10M phrases, ifier M (I2). Results are similar using I1. and consider all other similarities to be 0.

4.4 Analysis of Property Profiles classes. That is, for a given modifier M, we Table1 provides examples of good and bad want to instantiate the function φM from Eq. property profiles for several MHs. In general, 3. In practice, rather than being a binary func- frequent relations between M and H capture tion that decides whether or not e is in class ˆ relevant properties of MH, but it is not always MH, our instantiation, φM , will return a real- the case. To illustrate, the most frequently dis- valued score expressing the confidence that e cussed relation between “child” and “actor” is a member of MH. For notational conve- is that actors have children, but this property nience, let D(hs, p, oi) = w, if hs, p, o, wi ∈ D ˆ is not indicative of the meaning of “child ac- and 0 otherwise. We define φM as follows: tor”. Qualitatively, the top-ranked interpreta- ˆ X tions learned by using the head noun directly φM (H, e) = ω×D(he, p, oi) (7) (I1, Eq.4) are very similar to those learned hhp,oi,ωi∈I(MH) using instances of the head (I2, Eq.5). How- ever, I2 returns many more properties (10 on Applying M to H, then, is as in Eq.3 ex- average per MH) than I1 (just over 1 on av- cept that instead of a discrete set, it returns a erage). Anecdotally, we see that I2 captures scored list of candidate instances: more specific relations than does I1. For exam- ˆ ple, for “jazz musicians”, both methods return M (H) = {he, φM (H, e)i | he, Hi ∈ O} (8) J K “* write jazz” and “* compose jazz”, but I2 additionally returns properties like “* be ma- Ultimately, we need to identify instances of ar- jor creative influence in jazz”. We compare bitrary class labels, which may contain mul- I1 and I2 quantitatively in Section6. Impor- tiple modifiers. Given a class label C = tantly, we do see that both I1 and I2 are capa- M1 ...MkH that contains a head H preceded ble of learning head-specific property profiles by modifiers M1 ...Mk, we generate a list of for a modifier. Table2 provides examples. candidate instances by finding all instances of H that have some property to support every 5 Class-Instance Identification modifier: Instance finding. After finding properties k that relate a modifier to a head, we turn to \ {he, s(e)i | he, wi ∈ Mi (H) ∧ w > 0} (9) the task of identifying instances of fine-grained i=1 J K where s(e) is the mean6 of the scores assigned Evaluation Set: Examples of Class Labels ˆ UniformSet: 2008 california wildfires · australian by each separate φMi . From here on, we use army chaplains · australian boy bands · canadian Mods to refer to our method that generates military nurses · canberra urban places · cellular lists of instances for a class using Eq.8 and automaton rules · chinese rice dishes · coldplay ˆ concert tours · daniel libeskind designs · economic 9. When φM (Eq.7) is implemented using stimulus programs · german film critics · invasive I1, we use the name ModsH (for “heads”). amphibian species · latin political phrases · log When it is implemented using I , we use the flume rides · malayalam short stories · pakistani 2 film actresses · puerto rican sculptors · string the- name ModsI (for “instances”). ory books WeightedSet: ancient greek physicists · art deco Weakly Supervised Reranking. Eq.8 sculptors · audio engineering schools · ballet train- uses a naive ranking in which the weight for ing methods · bally pinball machines · british rhythmic gymnasts · calgary flames owners · cana- e ∈ MH is the product of how often e has been dian rock climbers · canon l-series lenses · emi clas- observed with some property and the weight sics artists · free password managers · georgetown of that property for the class MH. Thus, in- university publications · grapefruit league venues · liz claiborne subsidiaries · miss usa 2000 delegates stances of H with overall higher counts in D · new zealand illustrators · russian art critics receive high weights for every MH. We therefore train a simple logistic regression model to Table 3: Examples of class labels from evalu- predict the likelihood that e belongs to MH. ation sets. We use a small set of features7, including the raw weight as computed in Eq.7. For train- any titles that contain links to sub-categories. ing, we sample he, Ci pairs from our IsA repos- This is to favor fine-grained classes (“pak- itory O as positive examples and random pairs istani film actresses”) over coarse-grained ones that were not extracted by any Hearst pattern (“film actresses”). We perform heuristic mod- as negative examples. We frame the task as a ifier chunking in order to group together mul- binary prediction of whether e ∈ C, and use tiword modifiers (e.g., “puerto rican”); for de- the model’s confidence as the value of φˆ in M tails, see supplementary material. From the place of the function in Eq.7. resulting list of class labels, we draw two sam- 6 Evaluation ples of 100 labels each, enforcing that no H appear as the head of more than three class 6.1 Experimental Setup labels per sample. The first sample is cho- Evaluation Sets. We evaluate our models sen uniformly at random (denoted Uniform- on their ability to return correct instances for Set). The second (WeightedSet) is weighted arbitrary class labels. As a source of evalu- so that the probability of drawing M1 ...MkH ation data, we use Wikipedia category pages is proportional to the total number of class labels in which H appears as the head. These (e.g., http://en.wikipedia.org/wiki/Category: 8 Pakistani film actresses). These are pages in different evaluation sets are intended to eval- which the title is the name of the category uate performance on the head versus the tail (“pakistani film actresses”) and the body is of class label distribution, since information a manually curated list of links to other pages retrieval methods often perform differently on that fall under the category. We measure the different parts of the distribution. On average, precision and recall of each method for discov- there are 17 instances per category in Uniform- ering the instances listed on these pages given Set and 19 in WeightedSet. Table3 gives ex- the page title (henceforth “class label”). amples of class labels. We collect the titles of all Wikipedia cate- Baselines. We implement two baselines us- gory pages, removing those in which the last ing our IsA repository (O as defined in Section word is capitalized or which contain fewer than 4.1). Our simplest baseline ignores modifiers three words. These heuristics are intended to altogether, and simply assumes that any in- retain compositional titles in which the head stance of H is an instance of MH, regardless is a single common noun. We also remove of M. In this case the confidence value for

6 8 Also tried minimum, but mean gave better results. Available at http://www.seas.upenn.edu/~nlp/ 7Feature templates in supplementary material. resources/finegrained-class-eval.gz he, MHi is equivalent to that for he, Hi. We precision, to what extent the method is able to refer to this baseline simply as Baseline. Our correctly rank true instances of the class above second, stronger baseline uses the IsA reposi- non-instances. We report total coverage, the tory directly to identify instances of the fine- number of labels for which the method returns grained class C = M1 ...MkH. That is, we any instance, and correct coverage, the num- consider e to be an instance of the class if ber of labels for which the method returns a he, Ci ∈ O, meaning the entire class label ap- correct instance. For precision, we compute peared in a source sentence matching some the average precision (AP) for each class la- Hearst pattern. We refer to this baseline as bel. AP ranges from 0 to 1, where 1 indicates Hearst. The weight used to rank the candi- that all positive instances were ranked above date instances is the confidence value assigned all negative instances. We report mean av- by the Hearst pattern extraction (Section 4.2). erage precision (MAP), which is the mean of the APs across all the class labels. MAP is Compositional Models. As a baseline only computed over class labels for which the compositional model, we augment the Hearst method returns something, meaning methods baseline via set intersection. Specifically, for are not punished for returning empty lists. a class C = M1 ...MkH, if each of the MiH appears in O independently, we take the in- UniformSet WeightedSet stances of C to be the intersection of the in- Coverage MAP Coverage MAP stances of each of the M H. We assign the Baseline 95 / 70 0.01 98 / 74 0.01 i Hearst 9 / 9 0.63 8 / 8 0.80 weight of an instance e to be the sum of Hearst∩ 13 / 12 0.62 9 / 9 0.80 the weights associated with each independent ModsH raw 56 / 32 0.23 50 / 30 0.16 modifier. We refer to this method as Hearst∩. ModsH RR 56 / 32 0.29 50 / 30 0.25 ModsI raw 62 / 36 0.18 59 / 38 0.20 It is roughly equivalent to (Pa¸sca, 2014). We ModsI RR 62 / 36 0.24 59 / 38 0.23 contrast it with our proposed model, which recognizes instances of a fine-grained class by Table 5: Coverage and precision for populat- 1) assigning a meaning to each modifier in ing Wikipedia category pages with instances. the form of a property profile and 2) checking “Coverage” is the number of class labels (out whether a candidate instance exhibits these of 100) for which at least one instance was properties. We refer to the versions of our returned, followed by the number for which method as ModsH and ModsI , as described at least one correct instance was returned. in Section5. When relevant, we use “raw” “MAP” is mean average precision. MAP does to refer to the version in which instances are not punish methods for returning empty lists, ranked using raw weights and “RR” to refer to thus favoring the baseline (see Figure2). the version in which instances are ranked using logistic regression (Section5). We also try Table4 gives examples of instances returned using the proposed methods to extend rather for several class labels and Table5 shows the than replace the Hearst baseline. We com- precision and coverage for each of the meth- bine predictions by merging the ranked lists ods. Figure2 illustrates how the single mean produced by each system: i.e. the score of AP score (as reported in Table5) can misrep- an instance is the inverse of the sum of its resent the relative precision of different meth- ranks in each of the input lists. If an in- ods. In combination, Table5 and Figure2 stance does not appear at all in an input list, demonstrate that the proposed methods ex- its rank in that list is set to a large constant tract instances about as well as the baseline, value. We refer to these combination systems whenever the baseline can extract anything at as Hearst+ModsH and Hearst+ModsI . all; i.e. the proposed method does not cause a precision drop on classes covered by the base- 6.2 Results line. In addition, there are many classes for Precision and Coverage. We first com- which the baseline is not able to extract any pare the methods in terms of their cover- instances, but the proposed method is. None age, the number of class labels for which the of the methods can extract some of the gold method is able to find some instance, and their instances, such as “Dictator perpetuo” and Flemish still life painters: Clara Peeters · Willem Kalf · Jan Davidsz de Heem · Pieter Claesz · Peter Paul Rubens · Frans Snyders · Jan Brueghel the Elder · Hans Memling · Pieter Bruegel the Elder · Caravaggio · Abraham Brueghel Pakistani cricket captains: Salman Butt · Shahid Afridi · Javed Miandad · Azhar Ali · Greg Chappell · Younis Khan · Wasim Akram · Imran Khan · Mohammad Hafeez · Rameez Raja · Abdul Hafeez Kardar · Waqar Younis · Sarfraz Ahmed Thai buddhist temples: Wat Buddhapadipa · Wat Chayamangkalaram · Wat Mongkol- ratanaram · Angkor Wat · Preah Vihear Temple · Wat Phra Kaew · Wat Rong Khun · Wat Mahathat Yuwaratrangsarit · Vat Phou · Tiger Temple · Sanctuary of Truth · Wat Chalong · Swayambhunath · Mahabodhi Temple · Tiger Cave Temple · Harmandir Sahib

Table 4: Instances extracted for several ﬁne-grained classes from Wikipedia. Lists shown are from ModsI . Instances in italics were also returned by Hearst∩. Strikethrough denotes incorrect.

“Furor Teutonicus” of the gold class “latin po- estimates. For each of these labels, we man- litical phrases”. ually check the top 10 instances proposed by Table5 also reveals that the reranking each method to determine whether each be- model (RR) consistently increases MAP for longs to the class. Table6 shows the precision the proposed methods. Therefore, going for- scores for each method computed against the ward, we only report results using the rerank- original Wikipedia list of instances and against ing model (i.e. ModsH and ModsI will refer to our manually-augmented list of gold instances. ModsH RR and ModsI RR, respectively). The overall ordering of the systems does not change, but the precision scores increase no- tably after re-annotation. We continue to evaluate against the Wikipedia lists, but acknowl- edge that reported precision is likely an underestimate of true precision.

Wikipedia Gold Hearst 0.56 0.79 Hearst∩ 0.53 0.78 ModsH 0.23 0.39 ModsI 0.24 0.42 Hearst+ModsH 0.43 0.63 Figure 2: Distribution of AP over 100 class Hearst+ModsI 0.43 0.63 labels in WeightedSet. The proposed method (red) and the baseline method (blue) achieve Table 6: P@10 before vs. after re-annotation; high AP for the same number of classes, but Wikipedia underestimates true precision. ModsI additionally ﬁnds instances for classes for which the baseline returns nothing. Precision-Recall Analysis. We next look at the precision-recall tradeoﬀ in terms of Manual Re-Annotation. It possible that the area under the curve (AUC) when each true instances of a class are missing from our method attempts to rank the complete list of Wikipedia reference set, and thus that our candidate instances. We take the union of all precision scores underestimate the actual pre- of the instances proposed by all of the methods cision of the systems. We therefore man- (including the Baseline method which, given ually verify the top 10 predictions of each a class label M0 ...MkH, proposes every in- of the systems for a random sample of 25 stance of the head H as a candidate). Then, class labels. We choose class labels for which for each method, we rank this full set of candi- Hearst was able to return at least one in- dates such that any instance returned by the stance, in order to ensure reliable precision method is given the score the method assigns, (a) Uniform random sample (UniformSet). (b) Weighted random sample (WeightedSet).

Figure 3: ROC curves for selected methods (Hearst in blue, proposed in red). Given a ranked list of instances, ROC curves plot true positives vs. false positives retained by setting various cutoﬀs. The curve becomes linear once all remaining instances have the same score (e.g., 0), as this makes it impossible to add true positives without also including all remaining false positives.

UniformSet WeightedSet ulated with instances. As a result, current AUC Recall AUC Recall methods are not able to handle the infinite Baseline 0.55 0.23 0.53 0.28 number of classes describable in natural lan- Hearst 0.56 0.03 0.52 0.02 guage, most of which never appear in text. Hearst∩ 0.57 0.04 0.53 0.02 Our method reasons about each modifier in ModsH 0.68 0.08 0.60 0.06 the label individually, in terms of the proper- ModsI 0.71 0.09 0.65 0.09 ties that it implies about the instances. This Hearst∩+ModsH 0.70 0.09 0.61 0.08 approach allows us to harness information that Hearst∩+ModsI 0.73 0.10 0.66 0.10 is spread across multiple sentences, signifi- cantly increasing the number of fine-grained Table 7: Recall of instances on Wikipedia cat- classes that we are able to populate. egory pages, measured against the full set of instances from all pages in sample. AUC cap- Acknowledgments tures tradeoff between true and false positives. The paper incorporates suggestions on an ear- lier version from Susanne Riehemann. Ryan and every other instance is scored as 0. Table7 Doherty offered support in refining and access- reports the AUC and recall. Figure3 plots the ing the fact repository used in the evaluation. full ROC curves. The requirement by Hearst that class labels appear in full in a single sentence results in very low recall, which trans- References lates into very low AUC when considering the M. Bansal, D. Burkett, G. de Melo, and D. Klein. full set of candidate instances. In comparison, 2014. Structured learning for taxonomy induc- the proposed compositional methods make use tion with belief propagation. In Proceedings of a larger set of sentences, and provide non- of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL-14). Balti- zero scores for many more candidates, result- more, Maryland, pages 1041–1051. ing in a >10 point increase in AUC on both UniformSet and WeightedSet (Table7). M. Baroni and R. Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing 7 Conclusion adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Em- We have presented an approach to IsA extrac- pirical Methods in Natural Language Processing tion that takes advantage of the composition- (EMNLP-10). Cambridge, Massachusetts, pages ality of natural language. Existing approaches 1183–1193. often treat class labels as atomic units that C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, must be observed in full in order to be pop- C. Becker, R. Cyganiak, and S. Hellmann. 2009. DBpedia - a crystallization point for the Web of H. Kamp and B. Partee. 1995. Prototype theory data. Journal of Web Semantics 7(3):154–165. and compositionality. Cognition 57(2):129–191.

E. Choi, T. Kwiatkowski, and L. Zettlemoyer. N. Kim and P. Nakov. 2011. Large-scale noun com- 2015. Scalable semantic parsing with partial pound interpretation using bootstrapping and ontologies. In Proceedings of the 53rd An- the Web as a corpus. In Proceedings of the nual Meeting of the Association for Compu- 2011 Conference on Empirical Methods in Nat- tational Linguistics (ACL-15). Beijing, China, ural Language Processing (EMNLP-11). Edin- pages 1311–1320. burgh, Scotland, pages 648–658.

B. Dalvi, W. Cohen, and J. Callan. 2012. Web- J. Kirschnick, H. Hemsen, and V. Markl. 2016. sets: Extracting sets of entities from the Web Jedi: Joint entity and relation detection using using unsupervised information extraction. In type inference. In Proceedings of the 54th An- Proceedings of the 5th ACM Conference on Web nual Meeting of the Association for Computa- Search and Data Mining (WSDM-12). Seattle, tional Linguistics (ACL-16) - System Demon- Washington, pages 243–252. strations. Berlin, Germany, pages 61–66. G. de Melo and G. Weikum. 2010. MENTA: In- ducing multilingual taxonomies from Wikipedia. Z. Kozareva and E. Hovy. 2010. A semi-supervised In Proceedings of the 19th International Con- method to learn and construct taxonomies us- ference on Information and Knowledge Man- ing the web. In Proceedings of the 2010 agement (CIKM-10). Toronto, Canada, pages Conference on Empirical Methods in Natural 1099–1108. Language Processing (EMNLP-10). Cambridge, Massachusetts, pages 1110–1118. A. Fader, S. Soderland, and O. Etzioni. 2011. Iden- tifying relations for open information extraction. D. Lin and X. Wu. 2009. Phrase clustering for dis- In Proceedings of the 2011 Conference on Em- criminative learning. In Proceedings of the 47th pirical Methods in Natural Language Process- Annual Meeting of the Association for Compu- ing (EMNLP-11). Edinburgh, Scotland, pages tational Linguistics (ACL-IJCNLP-09). Singa- 1535–1545. pore, pages 1030–1038.

M. Fares, S. Oepen, and E. Velldal. 2015. Identify- Mausam, M. Schmitz, S. Soderland, R. Bart, and ing compounds: On the role of syntax. In Inter- O. Etzioni. 2012. Open language learning for national Workshop on Treebanks and Linguis- information extraction. In Proceedings of the tic Theories (TLT-14). Warsaw, Poland, pages 2012 Joint Conference on Empirical Methods 273–283. in Natural Language Processing and Compu- tational Natural Language Learning (EMNLP- T. Flati, D. Vannella, T. Pasini, and R. Navigli. CoNLL-12). Jeju Island, Korea, pages 523–534. 2014. Two is bigger (and better) than one: the Wikipedia Bitaxonomy project. In Proceedings G. Miller. 1995. WordNet: a lexical database. of the 52nd Annual Meeting of the Association Communications of the ACM 38(11):39–41. for Computational Linguistics (ACL-14). Balti- more, Maryland, pages 945–955. D. Movshovitz-Attias and W. Cohen. 2015. Kb- lda: Jointly learning a knowledge base of hi- M. Hearst. 1992. Automatic acquisition of hy- erarchy, relations, and facts. In Proceedings of ponyms from large text corpora. In Proceedings the 53rd Annual Meeting of the Association for of the 14th Conference on Computational Lin- Computational Linguistics and the 7th Interna- guistics (COLING-92). pages 539–545. tional Joint Conference on Natural Language I. Heim and A. Kratzer. 1998. Semantics in Gen- Processing (ACL-IJCNLP-15). Beijing, China, erative Grammar, volume 13. Blackwell Oxford. pages 1449–1459.

I. Hendrickx, Z. Kozareva, P. Nakov, D. O´ P. Nakov and M. Hearst. 2013. Semantic in- Séaghdha,S. Szpakowicz, and T. Veale. 2013. terpretation of noun compounds using verbal SemEval-2013 task 4: Free paraphrases of noun and other paraphrases. ACM Transactions on compounds. In Proceedings of Proceedings of the Speech and Language Processing 10(3):1–51. 7th International Workshop on Semantic Eval- uation (SemEval-13). pages 138–143. V. Nastase and M. Strube. 2013. Transforming Wikipedia into a large scale multilingual concept J. Hoffart, F. Suchanek, K. Berberich, and network. Artificial Intelligence 194:62–85. G. Weikum. 2013. YAGO2: a spatially and temporally enhanced knowledge base from P. Nulty and F. Costello. 2013. General and Wikipedia. Artificial Intelligence Journal. Spe- specific paraphrases of semantic relations be- cial Issue on Artificial Intelligence, Wikipedia tween nouns. Natural Language Engineering and Semi-Structured Resources 194:28–61. 19(03):357–384. D. O´ Séaghdha and A. Copestake. 2007. Co- Computational Linguistics (COLING-ACL-06). occurrence contexts for noun compound inter- Sydney, Australia, pages 801–808. pretation. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions. N. Surtani and S. Paul. 2015. A vsm-based statis- Prague, Czech Republic, pages 57–64. tical model for the semantic relation interpretation of noun-modifier pairs. Proceedings of the M. Pa¸sca. 2014. Acquisition of open-domain International Conference on Recent Advances classes via intersective semantics. In Proceed- in Natural Language Processing (RANLP-15) ings of the 23rd World Wide Web Conference pages 636–645. (WWW-14). Seoul, Korea, pages 551–562. S. Tratz and E. Hovy. 2010. A taxonomy, dataset, M. Pa¸sca. 2015. Interpreting compound noun and classifier for automatic noun compound in- phrases using web search queries. In Proceed- terpretation. In Proceedings of the 48th Annual ings of the 2015 Conference of the North Amer- Meeting of the Association for Computational ican Chapter of the Association for Compu- Linguistics (ACL-10). Uppsala, Sweden, pages tational Linguistics: Human Language Tech- 678–687. nologies (NAACL-HLT-15). Denver, Colorado, pages 335–344. T. Van de Cruys, S. Afantenos, and P. Muller. 2013. MELODI: A supervised distributional P. Pantel, E. Crestan, A. Borkovsky, A. Popescu, approach for free paraphrasing of noun com- and V. Vyas. 2009. Web-scale distributional pounds. In Proceedings of the 7th International similarity and entity set expansion. In Proceed- Workshop on Semantic Evaluation (SemEval- ings of the 2009 Conference on Empirical Meth- 13). Atlanta, Georgia, pages 144–147. ods in Natural Language Processing (EMNLP- 09). Singapore, pages 938–947. P. Verga, A. Neelakantan, and A. McCallum. 2017. Generalizing to unseen entities and entity pairs P. Pasupat and P. Liang. 2014. Zero-shot en- with row-less universal schema. In Proceedings tity extraction from Web pages. In Proceedings of the 15th Conference of the European Chapter of the 52nd Annual Meeting of the Association of the Association for Computational Linguistics for Computational Linguistics (ACL-14). Balti- (EACL-17). Valencia, Spain, pages 613–622. more, Maryland, pages 391–401. R. Wang and W. Cohen. 2009. Automatic set in- S. Ponzetto and R. Navigli. 2009. Large-scale tax- stance extraction using the Web. In Proceedings onomy mapping for restructuring and integrat- of the 47th Annual Meeting of the Association ing Wikipedia. In Proceedings of the 21st Inter- for Computational Linguistics (ACL-IJCNLP- national Joint Conference on Artificial Intelli- 09). Singapore, pages 441–449. gence (IJCAI-09). Pasadena, California, pages D. Weiskopf. 2007. Compound nominals, context, 2083–2088. and compositionality. Synthese 156(1):161–204. S. Ponzetto and M. Strube. 2007. Deriving a large D. Wijaya and P. Gianfortoni. 2011. Nut case: scale taxonomy from Wikipedia. In Proceed- What does it mean?: Understanding seman- ings of the 22nd National Conference on Artifi- tic relationship between nouns in noun com- cial Intelligence (AAAI-07). Vancouver, British pounds through paraphrasing and ranking the Columbia, pages 1440–1447. paraphrases. In Proceedings of the 1st Interna- S. Riedel, L. Yao, A. McCallum, and B. Mar- tional Workshop on Search and Mining Entity- lin. 2013. Relation extraction with matrix fac- Relationship Data (SMER-11). Glasgow, United torization and universal schemas. In Proceed- Kingdom, pages 9–14. ings of the 2013 Conference of the North Amer- C. Xavier and V. Strube de Lima. 2014. Boosting ican Association for Computational Linguistics open information extraction with noun-based re- (NAACL-HLT-13). Atlanta, Georgia, pages 74– lations. In Proceedings of the 9th International 84. Conference on Language Resources and Evalua- tion (LREC-14). Reykjavik, Iceland, pages 96– V. Shwartz, Y. Goldberg, and I. Dagan. 2016. Im- 100. proving hypernymy detection with an integrated path-based and distributional method. In Pro- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. ceedings of the 54th Annual Meeting of the As- 2014. From image descriptions to visual deno- sociation for Computational Linguistics (ACL- tations: New similarity metrics for semantic in- 16). Berlin, Germany, pages 2389–2398. ference over event descriptions. Transactions of the Association for Computational Linguistics R. Snow, D. Jurafsky, and A. Ng. 2006. Se- (TACL) 2:67–78. mantic taxonomy induction from heterogenous evidence. In Proceedings of the 21st Interna- tional Conference on Computational Linguistics and 44th Annual Meeting of the Association for