Learning from Negative Examples in Set-Expansion
Total Page:16
File Type:pdf, Size:1020Kb
Learning from Negative Examples in Set-Expansion Prateek Jindal Dan Roth Dept. of Computer Science Dept. of Computer Science UIUC UIUC Urbana, IL, USA Urbana, IL, USA [email protected] [email protected] FEMALE TENNIS PLAYERS Abstract—This paper addresses the task of set-expansion State-of-the-art This Paper on free text. Set-expansion has been viewed as a problem of generating an extensive list of instances of a concept of Monica Seles Mary Pierce Steffi Graf Monica Seles interest, given a few examples of the concept as input. Our key Martina Hingis Martina Hingis contribution is that we show that the concept definition can be Mary Pierce Lindsay Davenport significantly improved by specifying some negative examples Lindsay Davenport Steffi Graf in the input, along with the positive examples. The state-of-the Jennifer Capriati Jennifer Capriati art centroid-based approach to set-expansion doesn’t readily Kim Clijsters Kim Clijsters admit the negative examples. We develop an inference-based Mary Joe Fernandez Karina Habsudova approach to set-expansion which naturally allows for negative Nathalie Tauziat Sandrine Testud examples and show that it performs significantly better than a Kimiko Date Kimiko Date strong baseline. Conchita Martinez Chanda Rubin Anke Huber Anke Huber Judith Wiesner Nathalie Tauziat Andre Agassi Jana Novotna I. INTRODUCTION Pete Sampras Conchita Martinez This paper addresses the task of set-expansion on free text. Jana Novotna Nathalie Dechy Karina Habsudova Amanda Coetzer Set-expansion has been viewed as a problem of generating Jim Courier Barbara Paulus an extensive list of instances of a concept of interest, given Justine Henin Arantxa Sanchez-Vicario a few examples of the concept as input. For example, if the Julie Halard Amy Frazier { } Meredith McGrath Iva Majoli seed-set is Steffi Graf, Martina Hingis, Serena Williams , Goran Ivanisevic Magdalena Maleeva the system should output an extensive list of female tennis Jelena Dokic Jelena Dokic players. Michael Chang Julie Halard We focus on set-expansion from free text, as opposed to web-based approaches that build on existing lists. The key Table I: This table compares the state-of-the-art approach for set- expansion on free text with the approach presented in this paper. The bold method used for set-expansion from free text is distributional and italicized entries correspond to male tennis players and are erroneous. similarity. For example, the state-of-the-art systems [1], [2] The addition of only 1 negative example to the seed-set improves the list- use a centroid-based approach wherein they first find the quality significantly. The second column contains no errors. centroid of the entities in the seed-set and then find the entities that are similar to the centroid. Most of the work on set-expansion has focussed on taking only positive examples. and italicized. We see that the output in the 1 column is For example, as discussed above, to produce a list of female corrupted by male tennis players. Adding only 1 negative tennis players, a few names of female tennis players are example to the seed-set improves the list-quality signifi- given as input to the system. However, just specifying a few cantly. The second column contains no errors. In this paper, female tennis players doesn’t define the concept precisely we propose ways to learn from negative examples in set- enough. The set-expansion systems tend to output some male expansion and show significant improvement. tennis players along with female tennis players. Specifying We present an inference-based approach to set-expansion a few names of male tennis players as negative examples which doesn’t rely on computing the centroid. The new defines the concept more precisely. approach naturally allows for both positive and negative Table I compares the state-of-the-art approach for set- examples in the seed-set. We also extended the centroid- expansion on free text with the approach presented in this based approach so that it can accept negative examples. paper. The table shows only a small portion of the lists We used this improved version of centroid-based approach generated by the system. We used 7 positive examples for as a baseline system and show in the experiments that both approaches, and only 1 negative example was used the inference-based approach we developed significantly for the proposed approach. The errors have been underlined outperforms this baseline. II. RELATED WORK similarity between and all other entities in the entity set The task of set-expansion has been addressed in several . We then sort all the entities in based on this similarity works. We briefly discuss some of the most significant ef- score in decreasing order. The resulting ranked list has the forts towards this task. Google Sets and Boowa [3] are web- property that entities with lower rank are more similar to based set-expansion methods. For set-expansion on free- than entities with higher rank. We call this list the set of text ([4], [5], [1], [2]), pattern recognition and distributional neighbors of , denoted as NBRLIST( ). similarity have primarily been used. Some works on set- In the centroid-based approach, first of all, centroid ( )is expansion ([6], [7], [8]) have focussed on integrating infor- computed by averaging the frequency vectors of entities in mation across several types of sources such as structured, the seed-set ( ) and then computing the discounted PMI semi-structured, unstructured text, query logs etc. of the resulting frequency vector. Next, NBRLIST of the There has also been some work on the use of negative centroid is computed as described above and the system examples in set-expansion. Thelen and Riloff [9] and Lin outputs the first members of NBRLIST. et al. [10] present a framework to simultaneously learn IV. LEARNING FROM NEGATIVE EXAMPLES IN several semantic classes. In this framework, instances which CENTROID-BASED APPROACH have been accepted by one semantic class serve as negative The centroid-based approach to set-expansion doesn’t examples for all other semantic classes. This approach is easily allow learning from negative examples. In this section, limited because it necessitates the simultaneous learning of we present a novel framework which allows the incorpora- several semantic classes. Moreover, negative examples are tion of negative examples in a centroid-based approach. not useful if the semantic classes addressed are not related The active features of any entity are those features which to one another. Lin et al. note that it is not easy to acquire have non-zero frequency. The active features of the centroid good negative examples. The approach presented here, on are the union of the active features of the entities in the the other hand, allows the use of negative examples even seed-set. The active features of the centroid are not equally when there is only one semantic class. important. To incorporate this knowledge into set-expansion, The focus of this work is on set-expansion from free text. we associate a weight term with each entry in the vocabulary. Thus, we do not compare our system with systems which use Higher weight would mean that a particular word is more textual sources other than free text (e.g. semi-structured web relevant to the underlying concept. By incorporating these pages or query logs). The works of Sarmento et al. [1] and weights into the cosine similarity metric, the new formula to Pantel et al. [2] are the state-of-the-art works that are most compute the similarity between entities and becomes: related to our approach and, therefore, in our experiments, ∑ we compare their centroid-based approach with the approach = √∑ √∑ (1) developed here. 2 2 III. CENTROID-BASED APPROACH TO SET-EXPANSION ℎ A. Feature Vector Generation where is the weight associated with the word and dpmi refers to the discounted PMI values. We wish to learn a The input to our set-expansion system consists of free text. weight vector such that the similarity between the positive To extract the relevant entities from the text, we preprocess examples and the centroid becomes more than a prespecified the corpus with a state-of-the-art Named Entity Recognition threshold . Moreover, we want that the similarity between 1 tool developed by Ratinov and Roth [11]. Our experiments negative examples and the centroid should become less than were done only for entities of type PER, and we denote a prespecified threshold . We accomplish this objective the set of all distinct entities recognized in the corpus by ℎ using the following linear program: . represents the entity. The features of an entity are composed of the words appearing in a window of ∑ size centered on each mention of the entity .Weuse ∈ discounted PMI [12] to measure the association of a feature ∑ (2) with the entity. Table II gives some of the features for two − different entities as generated from the corpus. The numbers ∈ ∑ along with the features indicate the absolute frequency of the s.t. ≤ num of non-zero entries in centroid feature. (3) B. List Generation ≥ ∀ ∈ (4) We compute the similarity between any two entities using ≤ ∀ ∈ (5) the cosine coefficient. Given an entity , we compute the ≥ 0 ∀ (6) ≤ ∀ 1http://cogcomp.cs.illinois.edu/page/software (7) Entity Examples of Feature Vectors [President, 24912], [administration, 790], [House, 766], [visit, 761], [talks, 742], [announced, 737], [summit, 703], Bill Clinton [White, 684], [Republican, 541], [WASHINGTON, 508], [Congress, 490], [Democratic, 318], [budget, 243], [veto, 230], [government, 219], [election, 192], [political, 182], [Hillary, 149] [USA, 323], [World, 254], [champion, 226], [number-one, 191], [defending, 124], [final, 115], [American, 112], Pete Sampras [pts, 99], [beat, 86], [round, 81], [tennis, 73], [singles, 65], [Wimbledon, 62], [seeded, 40], [lost, 39], [semi-final, 38], [Grand, 36], [Slam, 36], [tournament, 34], [top-seed, 32], [Tennis, 5] Table II: This table shows some of the features for two different entities.