Set Expansion Using Sibling Relations Between Semantic Categories
Total Page:16
File Type:pdf, Size:1020Kb
Set Expansion using Sibling Relations between Semantic Categories Sho Takase Naoaki Okazaki Kentaro Inui Graduate School of Information Sciences Tohoku University, Japan takase, okazaki, inui @ecei.tohoku.ac.jp { } Abstract and Pennacchiotti, 2006). A bootstrapping algo- rithm iteratively acquires new instances of the tar- Most set expansion algorithms assume to get category using seed instances. First, a bootstrap- acquire new instances of different seman- ping algorithm mines phrasal patterns that co-occur tic categories independently even when we have seed instances of multiple semantic cat- frequently with seed instances in a corpus. Given egories. However, in the setting of set ex- the words “Prius” and “Lexus” as seed instances of pansion with multiple semantic categories, we the car category, the algorithm finds patterns such might leverage other types of prior knowl- as “Toyota produce X” and “X is a hybrid car” (X edge about semantic categories. In this paper, is a variable filled with a noun phrase). Next, the we present a method of set expansion when bootstrapping algorithm acquires instances that co- ontological information related to target se- occur with patterns, i.e., noun phrases that appear mantic categories is available. More specifi- cally, the proposed method makes use of sib- frequently in the variable slots in the patterns. For ling relations between semantic categories as example, the pattern “Toyota produce X” might re- an additional type of prior knowledge. We trieve vehicles manufactured by Toyota. Bootstrap- demonstrate the effectiveness of sibling rela- ping algorithms repeat these steps, expanding pat- tions in set expansion on the dataset in which terns using newly acquired instances. instances and sibling relations are extracted However, bootstrapping algorithms often suffer from Wikipedia in a semi-automatic manner. from patterns that retrieve instances not only of the target category but also of other categories. 1 Introduction For example, given the seed instances “Prius” and Set expansion is the task of expanding a list of “Lexus”, a bootstrapping algorithm might choose named entities from a few named entities (seed in- the pattern “new type of X”, which might extract un- stances). For example, given a few instances of related instances such as “iPhone” and “ThinkPad”. car vehicles “Prius”, “Lexus” and “Insight”, the The semantic drift problem (Curran et al., 2007), the task outputs new car instances such as “Corolla”, phenomenon by which a bootstrapping algorithm “Civic”, and “Fit”. Set expansion has many ap- deviates from the target category, has persisted as plications in NLP including named entity recogni- the major impediment of bootstrapping algorithms. tion (Collins and Singer, 1999), word sense disam- Bootstrapping algorithms assume prior knowl- biguation (Pantel and Lin, 2002), document cate- edge about a semantic category in the form of seed gorization (Pantel et al., 2009), and query sugges- instances. Recently, researchers have been more tion (Cao et al., 2008). interested in self-supervised learning of every se- Set expansion is often implemented as bootstrap- mantic category in the world from massive text ping algorithms (Hearst, 1992; Yarowsky, 1995; Ab- corpora, as exemplified by the Machine Reading ney, 2004; Pantel and Ravichandran, 2004; Pantel project (Oren et al., 2006). In the setting of set ex- 525 Copyright 2012 by Sho Takase, Naoaki Okazaki, and Kentaro Inui 26th Pacific Asia Conference on Language,Information and Computation pages 525–534 pansion with multiple semantic categories, we might 1 pmi(i, p) rι(i)= rπ(p), (2) leverage prior knowledge of other types that were P max pmi | | p P unexplored in previous studies. For example, a per- ∈ son cannot belong to both actor and actress cate- i, p pmi(i, p)=log | | . (3) gories simultaneously. Additionally, we know that 2 i, ,p two distinct categories of car and motorcycle prod- | ∗||∗ | ucts share similar properties (e.g., vehicle, gasoline- In these equations, P and I are sets of patterns and instances of each category. P and I are the num- powered, overland), but have some crucial differ- | | | | bers of patterns and instances in the sets. i, and ences (e.g., with two or four wheels, with or without | ∗| ,p are the frequencies of instance i and pattern windows). |∗ | p in a given corpus. i, p presents the frequency by In this paper, we present a method of set expan- | | which instance i co-occurs with pattern p. Also, max sion when ontological information related to target pmi is the maximum of pmi values in all instances semantic categories is available. More specifically, and patterns. the proposed method makes use of sibling relations between semantic categories as an additional type of First, the Espresso algorithm extracts patterns that prior knowledge. We demonstrate the effectiveness co-occur with seed instances. Next, the algorithm of sibling relations on the dataset (seed and test in- ranks the patterns based on their score calculated us- stances) extracted from Wikipedia. ing equation (1) and acquires the top N patterns. The This paper is organized as follows. Section 2 re- more a pattern co-occurs with reliable instances, the views the Espresso algorithm as the baseline algo- higher the score the pattern obtains. In this way, the rithm of this study. The section also describes the algorithm acquires patterns that correspond to the problem of semantic drift and previous approaches target semantic category. to the problem. Section 3 presents the proposed 2.2 Semantic Drift method, which uses sibling relations of semantic categories as an additional source of prior knowl- The major obstacle of bootstrapping algorithms is edge. Section 4 demonstrates the effectiveness of semantic drift (Curran et al., 2007). Semantic drift the proposed method and discusses the experimen- is the phenomenon by which a bootstrapping algo- tal results. In section 5, we conclude this paper. rithm deviates from the target categories. For ex- ample, one can consider the car category, which in- 2 Related Work cludes “Prius” and “Lexus” as seed instances. The Espresso algorithm might extract patterns that co- 2.1 Espresso algorithm occur with many categories such as “new type of Pantel and Pennacchiotti (2006) proposed the X” and “performance of X” after some iterations. Espresso algorithm, which fundamentally iterates These generic patterns might gather unrelated in- two steps: candidate extraction and ranking. In can- stances such as “iPhone” and “ThinkPad”. These didate extraction, the algorithm collects patterns that instances obscure the characteristics of the target se- are co-occurring with seed instances and instances mantic category. Therefore, the patterns extracted in acquired in the previous iteration. The algorithm the next iteration might not represent at the semantic also finds candidates of new instances using patterns category of the seed instances. extracted in the previous iteration. Semantic drift is also caused by polysemous In the ranking step, the algorithm finds the top N words. For example, to expand the set of motor vehi- candidates of patterns and instances based on their cle manufacturers using seed instances “Saturn” and scores. The espresso algorithm defines score rπ(p) “Subaru”, a bootstrapping algorithm might find in- for candidate pattern p and score rι(i) for the candi- stances representing the star category (e.g., “Jupiter” date instance i as and “Uranus”). This is because “Saturn” and “Sub- aru” are polysemous words, belonging not only to 1 pmi(i, p) r (p)= r (i), (1) motor vehicle manufacture but also to astronomical π I max pmi ι i I objects: planets and stars. | | ∈ 526 2.3 Approaches to semantic drift or patterns appearing in multiple categories are am- biguous. Therefore, they are likely to cause seman- Many researchers have presented various ap- tic drift. By removing ambiguous instances and pat- proaches to reduce the effects of semantic drift. terns, Curran et al. (2007) achieved high precision. The approaches range from refinement of the seed Carlson et al. (2010) proposed the Coupled Pat- set (Vyas et al., 2009), applying classifier (Bellare et tern Learner (CPL) algorithm which also uses mu- al., 2007; Sadamitsu et al., 2011; Pennacchiotti and tual exclusiveness. The CPL algorithm acquires en- Pantel, 2011), using human judges (Vyas and Pantel, tity instances (e.g., instances of the car category) and 2009), to using relationships between semantic cat- relation instances (e.g., CEO-of-Company (Larry egories (Curran et al., 2007; Carlson et al., 2010). Page, Google) and Company-acquired-Company Vyas et al. (2009) investigated the influence of (Google, Youtube)) simultaneously. To acquire seed instances on bootstrapping algorithms. They those instances, the algorithm requires knowledge reported that seed instances selected by human who about exclusiveness constraint between categories are not specialists sometimes yield worse results and links between categories (e.g., an instance of than those selected randomly. They proposed a the CEO category must be CEO of some instance of method that refines seed sets generated by humans the company category). However the algorithm uses to improve the set expansion performance. only the exclusiveness constraint as prior knowledge Bellare et al. (2007) proposed a method using a related to multiple semantic categories. classifier instead of scoring functions in the ranking Curran et al. (2007) and Carlson et al. (2010) use step of bootstrapping algorithms. The classifier ap- not only seed instances but also exclusiveness con- proach can use multiple features to select instances. straint between semantic categories as a prior knowl- Sadamitsu et al. (2011) extended the method of Bel- edge. However, we have more prior knowledge lare et al. (2007) to use topic information estimated about semantic categories at hand. For example, we using Latent Dirichlet Allocation (LDA). They use can obtain ontological information between seman- not only contexts but also topic information as fea- tic categories easily from existing resources such as tures of the classifier. Pennacchiotti and Pantel Wikipedia.