Unsupervised Template Mining for Semantic Category Understanding

1,2 3 3 1 3 Lei Shi ∗, Shuming Shi , Chin-Yew Lin , Yi-Dong Shen , Yong Rui 1State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Microsoft Research shilei,ydshen @ios.ac.cn shumings,cyl,yongrui{ } @microsoft.com { }

Abstract For example, both “CEO of General Motors” and “CEO of Yahoo” have structure “CEO of [com- We propose an unsupervised approach to pany]”. We call such a structure a category tem- constructing templates from a large collec- plate. Taking a large collection of open-domain tion of semantic category names, and use categories as input, we construct a list of category the templates as the semantic representa- templates and build a mapping from categories to tion of categories. The main challenge is templates. Figure 1 shows some example semantic that many terms have multiple meanings, categories and their corresponding templates. resulting in a lot of wrong templates. Sta- Templates can be treated as additional features tistical data and semantic knowledge are of semantic categories. The new features can be extracted from a web corpus to improve exploited to improve some upper-layer applica- template generation. A nonlinear scoring tions like web search and question answering. In function is proposed and demonstrated to addition, by linking categories to templates, it is be effective. Experiments show that our possible (for a computer program) to infer the se- approach achieves significantly better re- mantic meaning of the categories. For example in sults than baseline methods. As an imme- Figure 1, from the two templates linking to cat- diate application, we apply the extracted egory “symptom of insulin deficiency”, it is rea- templates to the cleaning of a category col- sonable to interpret the category as: “a symptom lection and see promising results (preci- of a medical condition called insulin deficiency sion improved from 81% to 89%). which is about the deficiency of one type of hor- mone called insulin.” In this way, our knowledge 1 Introduction about a category can go beyond a simple string A semantic category is a collection of items shar- and its member entities. An immediate application ing common semantic properties. For example, of templates is removing invalid category names all cities in Germany form a semantic category from a noisy category collection. Promising re- named “city in Germany” or “German city”. In sults are observed for this application in our ex- Wikipedia, the category names of an entity are periments. manually edited and displayed at the end of the An intuitive approach to this task (i.e., extract- page for the entity. There have been quite a lot of ing templates from a collection of category names) approaches (Hearst, 1992; Pantel and Ravichan- dran, 2004; Van Durme and Pasca, 2008; Zhang et national holiday of Brazil (instances: Carnival, Christmas…) national holiday of [country] al., 2011) in the literature to automatically extract- national holiday of South Africa (instances: Heritage Day, Christmas…) ing category names and instances (also called is-a school in Denver school in [place] or hypernymy relations) from the web. school in Houston school in [city] football player [sport] player Most existing work simply treats a category basketball player name as a text string containing one or multiple symptom of insulin deficiency (instances: nocturia, weight loss…) symptom of [medical condition] words, without caring about its internal structure. symptom of cortisol deficiency symptom of [hormone] deficiency (instances: low blood sugar…) In this paper, we explore the semantic structure Semantic Categories Category templates of category names (or simply called “categories”). Figure 1: Examples of semantic categories and ∗This work was performed when the first author was vis- iting Microsoft Research Asia. their corresponding templates.

799 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 799–809, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics contains two stages: category labeling, and tem- 1) To the best of our knowledge, this is the first plate scoring. work of template generation specifically for cate- Category labeling: Divide a category name gories in unsupervised manner. into multiple segments and replace some key seg- 2) We extract semantic knowledge and statisti- ments with its hypernyms. As an example, as- cal information from a web corpus for improving sume “CEO of Delphinus” is divided to three seg- template generation. Significant performance im- ments “CEO + of + Delphinus”; and the last seg- provement is obtained in our experiments. ment (Delphinus) has hypernyms “constellation”, 3) We study the characteristics of the scoring “company”, etc. By replacing this segment with function from the viewpoint of probabilistic evi- its hypernyms, we get candidate templates “CEO dence combination and demonstrate that nonlinear of [constellation]” (a wrong template), “CEO of functions are more effective in this task. [company]”, and the like. 4) We employ the output templates to clean Template scoring: Compute the score of each our category collection mined from the web, and candidate template by aggregating the information get apparent quality improvement (precision im- obtained in the first phase. proved from 81% to 89%). A major challenge here is that many segments After discussing related work in Section 2, we (like “Delphinus” in the above example) have mul- define the problem and describe one baseline ap- tiple meanings. As a result, wrong hypernyms proach in Section 3. Then we introduce our ap- may be adopted to generate incorrect candidate proach in Section 4. Experimental results are re- templates (like “CEO of [constellation]”). In this ported and analyzed in Section 5. We conclude the paper, we focus on improving the template scor- paper in Section 6. ing stage, with the goal of assigning lower scores to bad templates and larger scores to high-quality 2 Related work ones. Several kinds of work are related to ours. There have been some research efforts (Third, Hypernymy relation extraction: Hypernymy 2012; Fernandez-Breis et al., 2010; Quesada- relation extraction is an important task in text min- Martınez et al., 2012) on exploring the structure of ing. There have been a lot of efforts (Hearst, 1992; category names by building patterns. However, we Pantel and Ravichandran, 2004; Van Durme and automatically assign semantic types to the pattern Pasca, 2008; Zhang et al., 2011) in the literature to variables (or called arguments) while they do not. extract hypernymy (or is-a) relations from the web. For example, our template has the form of “city Our target here is not hypernymy extraction, but in [country]” while their patterns are like “city in discovering the semantic structure of hypernyms [X]”. More details are given in the related work (or category names). section. Category name exploration: Category name A similar task is query understanding, including patterns are explored and built in some ex- query tagging and query template mining. Query isting research work. Third (2012) pro- tagging (Li et al., 2009; Reisinger and Pasca, posed to find axiom patterns among category 2011) corresponds to the category labeling stage names on an existing ontology. For ex- described above. It is different from template gen- ample, infer axiom pattern “SubClassOf(AB, eration because the results are for one query only, B)” from “SubClassOf(junior school school)” without merging the information of all queries to and “SubClassOf(domestic mammal mammal)”. generate the final templates. Category template Fernandez-Breis et al. (2010) and Quesada- construction are slightly different from query tem- Martınez et al. (2012) proposed to find lexical pat- plate construction. First, some useful features terns in category names to define axioms (in med- such as query click-through is not available in cat- ical domain). One example pattern mentioned in egory template construction. Second, categories their papers is “[X] binding”. They need man- should be valid natural language phrases, while ual intervention to determine what X means. The queries need not. For example, “city Germany” is main difference between the above work and ours a query but not a valid category name. We discuss is that we automatically assign semantic types to in more details in the related work section. the pattern variables (or called arguments) while Our major contributions are as follows. they do not.

800 Template mining for IE: Some research work Li et al. (2013) proposed an clustering algorithm in information extraction (IE) involves patterns. to group existing query templates by search intents Yangarber (2003) and Stevenson and Greenwood of users. (2005) proposed to learn patterns which were in Compared to the open-domain unsupervised the form of [subject, verb, object]. The category methods for query template construction, our ap- names and learned templates in our work are not proach improves on two aspects. First, we propose in this form. Another difference between our work to incorporate multiple types of semantic knowl- and their work is that, their methods need a super- edge (e.g., term peer similarity and term clusters) vised name classifer to generate the candidate pat- to improve template generation. Second, we pro- terns while our approach is unsupervised. Cham- pose a nonlinear template scoring function which bers and Jurafsky (2011) leverage templates to de- is demonstrated to be more effective. scribe an event while the templates in our work are for understanding category names (a kind of short 3 Problem Definition and Analysis text). 3.1 Problem definition Query tagging/labeling: Some research work The goal of this paper is to construct a list of cat- in recent years focuses on segmenting web search egory templates from a collection of open-domain queries and assigning semantic tags to key seg- category names. ments. Li et al. (2009) and Li (2010) employed Input: The input is a collection of category CRF (Conditional Random Field) or semi-CRF names, which can either be manually compiled models for query tagging. A crowdsourcing- (like Wikipedia categories) or be automatically ex- assisted method was proposed by Han et al. (2013) tracted. The categories used in our experiments for query structure interpretation. These super- were automatically mined from the web, by fol- vised or semi-supervised approaches require much lowing existing work (Hearst, 1992, Pantel and manual annotation effort. Unsupervised meth- Ravichandran 2004; Snow et al., 2005; Talukdar ods were proposed by Sarkas et al. (2010) and et al., 2008; Zhang et al., 2011). Specifically, Reisinger and Pasca (2011). As been discussed we applied Hearst patterns (e.g., “NP [,] (such in the introduction section, query tagging is only as including) NP, ∗ and or NP”) and is- one of the two stages of template generation. The | { } { | } a patterns (“NP (is are was were being) (a an the) tagging results are for one query only, without ag- | | | | | | NP”) to a large corpus containing 3 billion En- gregating the global information of all queries to glish web pages. As a result, we obtained a generate the final templates. term hypernym bi-partite graph containing 40 → Query template construction: Some existing million terms, 74 million hypernyms (i.e., cate- work leveraged query templates or patterns for gory names), and 321 million edges (e.g., one query understanding. A semi-supervised random example edge is “Berlin” “city in Germany”, → walk based method was proposed by Agarwal et where “Berlin” is a term and “city in Germany” is al. (2010) to generate a ranked templates list which the corresponding hypernym). Then all the multi- are relevant to a domain of interest. A predefined word hypernyms are used as the input category domain schema and seed information is needed for collection. this method. Pandey and Punera (2012) proposed Output: The output is a list of templates, each an unsupervised method based on graphical mod- having a score indicating how likely it is valid. A els to mine query templates. The above methods template is a multi-word string with one headword are either domain-specific (i.e., generating tem- and at least one argument. For example, in tem- plates for a specific domain), or have some degree plate “national holiday of [country]”, “holiday” is of supervision (supervised or semi-supervised). the headword, and “[country]” is the argument. Cheung and Li (2012) proposed an unsupervised We only consider one-argument templates in this method to generate query templates by the aid of paper, and the case of multiple arguments is left as knowledge bases. An approach was proposed in future work. A template is valid if it is syntacti- (Szpektor et al., 2011) to improve query recom- cally and semantically correct. “CEO of [constel- mendation via query templates. Query session in- lation]” (wrongly generated from “CEO of Del- formation (which is not available in our task) is phinus”, “CEO of Aquila”, etc.) is not valid be- needed in this approach for templates generation. cause it is semantically unreasonable.

801 3.2 Baseline approach last segment of “holiday + of + South Africa” is An intuitive approach to this task contains two processed, stages: category labeling and template scoring. U1: (holiday of [country], South Africa, w1) Figure 2 shows its workflow with simple exam- : (holiday of [book], South Africa, w2) ples. 3.2.2 Phase-2: Template scoring 3.2.1 Phase-1: Category labeling The main objective of this stage is to merge all At this stage, each category name is automatically the CTTs obtained from the previous stage and to segmented and labeled; and some candidate tem- compute a final score for each template. In this plate tuples (CTTs) are derived based on the la- stage, the CTTs are first grouped by the first ele- beling results. This can be done in the following ment (i.e., the template string). For example, tu- steps. ples for “holiday of [country]” may include, Category segmentation: Divide each cate- U1: (holiday of [country], South Africa, w1) gory name into multiple segments (e.g., “holi- U2: (holiday of [country], Brazil, w2) day of South Africa” to “holiday + of + South : (holiday of [country], Germany, w3) Africa”). Each segment is one word or a phrase ... appearing in an entity dictionary. The dictionary Then a scoring function is employed to calcu- used in this paper is comprised of all Freebase late the template score from the tuple scores. For- ~ (www.freebase.com) entities. mally, given n tuples U=(U1, U2..., Un) for a tem- Segment to hypernym: Find hypernyms for plate, the goal is to find a score fusion function every segment (except for the headword and some F (U~ ) which yields large values for high-quality trivial segments like prepositions and articles), by templates and small (or zero) values for invalid referring to a term hypernym mapping graph. ones. → Following most existing query labeling work, we Borrowing the idea of TF-IDF from information derive the term hypernym graph from a dump of retrieval, a reasonable scoring function is, → Freebase. Below are some examples of Freebase n types (hypernyms), F (U~ ) = w IDF (h) (1) i · German city (id: /location/de city) i=1 Italian province (id: /location/it province) X Poem character (id: /book/poem character) where h is the argument type (i.e., the hypernym Book (id: /book/book) of the argument value) of each tuple. TF means To avoid generating too fine-grained templates the “term frequency” and IDF means the “inverse like “mayor of [Germany city]” and “mayor of document frequency”. An IDF function assigns [Italian city]” (semantically “mayor of [city]” lower scores to common hypernyms (like person is more desirable), we discard type modifiers and music track which contain a lot of entities). and map terms to the headwords of Freebase Let DF (h) be the number of entities having hy- types. For example, “Berlin” is mapped to pernym h, we test two IDF functions in our exper- “city”. In this way, we build our basic version of iments, term hypernym mapping which contains 16.13 → 1 + N million terms and 696 hypernyms. Since “South IDF (h) = log 1 1 + DF (h) Africa” is both a country and a book name in Free- (2) base, hypernyms “country”, “book”, and others IDF2(h) = 1/sqrt(DF (h)) are assigned to the segment “South Africa” in this step. where N is total number of entities in the entity CTT generation: Construct CTTs by choosing dictionary. one segment (called the target segment) each time The next problem is estimating tuple score wi. and replacing the segment with its hypernyms. An Please note that there is no weight or score infor- mation in the term hypernym mapping of Free- CTT is formed by the candidate template (with → one argument), the target segment (as an argument base. So we have to set wi to be constant in the value), and the tuple score (indicating tuple qual- baseline, ity). Below are example CTTs obtained after the wi = 1 (3)

802 Term-hypernym mapping Phase-1: Category labeling Brazil  country Phase-2: Template scoring Brazil  book South Africa  country argument value South Africa  book argument tuple score … head argument

holiday of [country], Brazil, w1 holiday of Brazil Phase-1 holiday of [book], Brazil, w2 Phase-2 holiday of [country], holiday of South Africa holiday of [country], South Africa, w3 holiday of [book], S2 … holiday of [book], South Africa, w4 … … Wikipedia Input: Category names Output: Category templates Candidate template tuples (CTTs)

Figure 2: Problem definition and baseline approach.

4 Approach: Enhancing Template sentence of our web corpus. For a (term, hyper- Scoring nym) pair, its popularity F is calculated as the number of sentences in which the term and the hy- In our approach, we follow the same framework pernym co-occur and also follow at least one of as in the above baseline approach, and focus on the patterns. improving the template scoring phase (i.e., phase- For a template tuple Ui with argument type h 2). and argument value v, we test two ways of esti- We try three techniques: First, a better tuple mating the tuple score wi, score wi is calculated in Section 4.1 by performing statistics on a large corpus. The corpus is a collec- wi = log (1 + F (v, h)) (4) tion of 3 billion web pages crawled in early 2013 by ourselves. During this paper, we use “our web F (v, h)) wi = (5) corpus” or “our corpus” to refer to this corpus. λ + h H F (v, hj) j ∈ Second, a nonlinear function is adopted in Sec- where F (v, h) is the popularityP of the (v, h) pair tion 4.2 to replace the baseline tuple fusion func- in our corpus, H is the set of all hypernyms for v tion (Formula 1). Third, we extract term peer sim- in the weighted term hypernym graph. Parame- ilarity and term clusters from our corpus and use → ter λ (=1.0 in our experiments) is introduced for them as additional semantic knowledge to refine smoothing purpose. Note that the second formula template scores. is the conditional probability of hypernym h given term v. 4.1 Enhancing tuple scoring Since it is intuitive to estimate tuple scores with Let’s examine the following two template tuples, their frequencies in a corpus, we treat the approach U1: (holiday of [country], South Africa, w1) with the improved wi as another baseline (our U2: (holiday of [book], South Africa, w2) strong baseline). Intuitively, “South Africa” is more likely to be a country than a book when it appears in text. So 4.2 Enhancing tuple combination function for a reasonable tuple scoring formula, we should Now we study the possibility of improving the tu- have w1 > w2. ple combination function (Formula 1), by examin- The main idea is to automatically calculate ing the tuple fusion problem from the viewpoint the popularity of a hypernym given a term, by of probabilistic evidence combination. We first referring to a large corpus. Then by adding demonstrate that the linear function in Formula 1 the popularity information to (the edges of) the corresponds to the conditional independence as- term hypernym graph of Freebase, we obtain a sumption of the tuples. Then we propose to adopt → weighted term hypernym graph. The weighted a series of nonlinear functions for combining tuple → graph is then employed to enhance the estimation scores. of wi. We define the following events: For popularity calculation, we apply Hearst T : Template T is a valid template; patterns (Hearst, 1992) and is-a patterns (“NP T : T is an invalid template; (is are was were being) (a an the) NP”) to every E : The observation of tuple U . | | | | | | i i

803 Let’s compute the posterior odds of event T , of Formula 1. Let’s use an example to show the given two tuples U1 and U2. Assuming E1 and intuition. Consider a good template “city of [coun- E2 are conditionally independent given T or T , try]” corresponding to CTTs U~A and a wrong tem- according to the Bayes rule, we have, plate “city of [book]” having tuples U~B. Sup- pose U~ = 200 (including most countries in | A| the world) and U~B = 1000 (considering that P (T E1,E2) P (E1,E2 T ) P (T ) | | | = | · many place names have already been used as book P (T E ,E ) P (E ,E T ) P (T ) | 1 2 1 2| · names). We observe that each tuple score corre- P (E T ) P (E T ) P (T ) = 1| 2| sponding to “city of [country]” is larger than the P (E1 T ) · P (E2 T ) · P (T ) tuple score corresponding to “city of [book]”. For | | ~ P (T E1) P (T ) P (T E2) P (T ) P (T ) simplicity, we assume each tuple in UA has score = | · | · ~ P (T E ) P (T ) · P (T E ) P (T ) · P (T ) 1.0 and each tuple in UB has score 0.2. With the | 1 · | 2 · (6) linear and nonlinear (p=2) fusion functions, we can get, Define the log-odds-gain of T given E as, Linear:

P (T E) P (T ) F (U~A) = 200 1.0 = 200 G(T E) = log | log (7) ∗ (11) | P (T E) − P (T ) F (U~ ) = 1000 0.2 = 200 | B ∗ Here G means the gain of the log-odds of T af- Nonlinear: ter E occurs. By combining formulas 6 and 7, we get F (U~A) = 14.1 (12) F (U~ ) = 6.32 G(T E ,E ) = G(T E ) + G(T E ) (8) B | 1 2 | 1 | 2 In the above settings the nonlinear function It is easy to prove that the above conclusion yields a much higher score for the good template holds true when n > 2, i.e., (than for the invalid template), while the linear one n does not. G(T E1, ..., En) = G(T Ei) (9) | | 4.3 Refinement with term similarity and Xi=1 term clusters If we treat G(T Ei) as the score of template T | The above techniques neglect the similarity among when only Ui is observed, and G(T E1, ..., En) as | terms, which has a high potential to improve the the template score after the n tuples are observed, template scoring process. Intuitively, for a toy set then the above equation means that the combined “city in Brazil”, “city in South Africa”,“city in template score should be the sum of wi IDF (h), { · China”, “city in Japan” , since “Brazil”, “South which is exactly Formula 1. Please keep in mind } Africa”, “China” and “Japan” are very similar to that Equation 9 is based on the assumption that the each other and they all have a large probability to tuples are conditional independent. This assump- be a “country”, so we have more confidence that tion, however, may not hold in reality. The case “city in [country]” is a good template. In this sec- of conditional dependence was studied in (Zhang tion, we propose to leverage the term similarity et al., 2011), where a group of nonlinear combina- information to improve the template scoring pro- tion functions were proposed and achieved good cess. performance in their task of hypernymy extrac- We start with building a large group of small tion. We choose p-Norm as our nonlinear fusion and overlapped clusters from our web corpus. functions, as below, 4.3.1 Building term clusters n F (U~ ) = p wp IDF (h)(p > 1) (10) Term clusters are built in three steps. v i · : Two terms are ui=1 Mining term peer similarity uX t peers if they share a common hypernym and they where p (=2 in experiments) is a parameter. are semantically correlated. For example, “dog” Experiments show that the above nonlinear and “cat” should have a high peer similarity score. function performs better than the linear function Following existing work (Hearst, 1992; Kozareva

804 et al., 2008; Shi et al., 2010; Agirre et al., 2009; Step-2. Calculating the final template score: Pantel et al., 2009), we built a peer similarity graph Let term cluster C∗ has the maximal supporting containing about 40.5 million nodes and 1.33 bil- score to T , the final template score is computed lion edges. as, Clustering: For each term, choose its top-30 S(T ) = F (U~ ) S(C∗,T ) (14) · neighbors from the peer similarity graph and run a where F (U~ ) is the template score before refine- hierarchical clustering algorithm, resulting in one ment. or multiple clusters. Then we merge highly du- plicated clusters. The algorithm is similar to the 5 Experiments first part of CBC (Pantel and Lin, 2002), with the difference that a very high merging threshold is 5.1 Experimental setup adopted here in order to generate small and over- 5.1.1 Methods for comparison lapped clusters. Please note that one term may be We make a comparison among 10 methods. included in many clusters. SC: The method is proposed in (Cheung and Li, Assigning top hypernyms: Up to two hyper- 2012) to construct templates from queries. The nyms are assigned for each term cluster by major- method firstly represents a query as a matrix based ity voting of its member terms, with the aid of the on Freebase data. Then a hierarchical clustering weighted term hypernym graph of Section 4.1. → algorithm is employed to group queries having the To be an eligible hypernym for the cluster, it has same structure and meaning. Then an intent sum- to be the hypernym of at least 70% of terms in the marization algorithm is employed to create tem- cluster. The score of each hypernym is the aver- plates for each query group. age of the term hypernym weights over all the → Base: The linear function in Formula 1 is member terms. adopted to combine the tuple scores. We use IDF 4.3.2 Template score refinement 2 here because it achieves higher precision than IDF in this setting. With term clusters at hand, now we describe the 1 LW: The linear function in Formula 1 is score refinement procedure for a template T hav- adopted to combine the tuple scores generated by ing argument type h and supporting tuples U~ =(U , 1 Formula 4. IDF is used rather than IDF for U ..., U ). Denote V = V ,V , ..., V to be the 1 2 2 n { 1 2 n} better performance. set of argument values for the tuples (where V is i LP: The linear function in Formula 1 is adopted the argument value of U ). i to combine the tuple scores generated by Formula By computing the intersection of V and every 5. IDF is used rather than IDF for better per- term cluster, we can get a distribution of the argu- 2 1 formance. ment values in the clusters. We find that for a good NLW: The nonlinear fusion function in For- template like “holiday in [country]”, we can often mula 10 is used. Other settings are the same as find at least one cluster (one of the country clus- LW. ters in this example) which has hypernym h and NLP: The nonlinear fusion function in Formula also contains many elements in V . However, for 10 is used. Other settings are the same as LP. invalid templates like “holiday of [book]”, every LW+C, LP+C, NLW+C, NLP+C: All the set- cluster having hypernym h (=“book” here) only tings of LW, LP, NLW, NLP respectively, with the contains a few elements in V . Inspired by such refinement technology in Section 4.3 applied. an observation, our score refinement algorithm for template T is as follows, 5.1.2 Data sets, annotation and evaluation Step-1. Calculating supporting scores: For metrics each term cluster C having hypernym h, compute The input category names for experiments are au- its supporting score to T as follows: tomatically extracted from a web corpus (Section 3.1). Two test-sets are built for evaluation from the S(C,T ) = k(C,V ) w(C, h) (13) · output templates of various methods. where k(C,V ) is the number of elements shared Subsets: In order to conveniently compare the by C and V , and w(C, h) is hypernym score of h performance of different methods, we create 20 to C (computed in the last step of building clus- sub-collections (called subsets) from the whole in- ters). put category collection. Each subset contains all

805 the categories having the same headword (e.g., Method P@10 P@20 P@30 Base (baseline-1) 0.359 0.361 0.358 “symptom of insulin deficiency” and “depression SC (Cheung and Li, 2012) 0.382 0.366 0.371 symptom” are in the same subset because they Weighted LW 0.633 0.582 0.559 share the same headword “symptom”). To choose (baseline-2) LP 0.771 0.734 0.707 Nonlinear NLW 0.711 0.671 0.638 the 20 headwords, we first sample 100 at ran- NLP 0.818 0.791 0.765 dom from the set of all headwords; then manu- LW+C 0.813 0.786 0.754 ally choose 20 for diversity. The headwords in- Term cluster NLW+C 0.854 0.833 0.808 LP+C 0.818 0.788 0.778 clude symptom, school, food, gem, hero, weapon, NLP+C 0.868 0.839 0.788 model, etc. We run the 10 methods on these sub- sets and sort the output templates by their scores. Table 1: Performance comparison among the Top-30 templates from each method on each sub- methods on subset. set are selected and mixed together for annotation. Fullset: We run method NLP+C (which has term hypernym edges. The results demonstrate → the best performance according to our subsets that edge scores are critical for generating high experiments) on the input categories and sort quality templates. Manually built semantic re- the output templates by their scores. Then we sources typically lack such kinds of scores. There- split the templates into 9 sections according fore, it is very important to enhance them by de- to their ranking position. The sections are: riving statistical data from a large corpus. Since [1 100], (100 1K], (1K 10K], (10K 100K], ∼ ∼ ∼ ∼ it is relatively easy to have the idea of adopt- (100K,120K], (120K 140K], (140K 160K], ∼ ∼ ing a weighted term hypernym graph, we treat (160K 180K], (180K 200K]. Then 40 templates → ∼ ∼ LW and LP as another (stronger) baseline named are randomly chosen from each section and mixed baseline-2. together for annotation. As the second observation, the results show that The selected templates (from subsets and the the nonlinear methods (NLP and NLW) achieve fullset) are annotated by six annotators, with each performance improvement over their linear ver- template assigned to two annotators. A template is sions (LW and LP). assigned a label of “good”, “fair”, or “bad” by an Third, let’s examine the methods with template annotator. The percentage agreement between the scores refined by term similarity and term clus- annotators is 80.2%, with kappa 0.624. ters (LW+C, NLW+C, LP+C, NLP+C). It is shown For the subset experiments, we adopt that the refine-by-cluster technology brings addi- k k Precision@ ( =10,20,30) to evaluate the tional performance gains on all the four settings top templates generated by each method. The (linear and nonlinear, two different ways of calcu- scores for “good”, “fair”, and “bad” are 1, 0.5, lating tuple scores). So we can conclude that the and 0. The score of each template is the average peer similarity and term clusters are quite effective annotation score over two annotators (e.g., if a in improving template generation. template is annotated “good” by one annotator and Fourth, the best performance is achieved “fair” by another, its score is (1.0+0.5)/2=0.75). when the three techniques (i.e., term hypernym The evaluation score of a method is the average → weight, nonlinear fusion function, and refine-by- over the 20 subsets. For the fullset experiments, cluster) are combined together. For instance, by we report the precision for each section. comparing the P@20 scores of baseline-2 and NLP+C, we see a performance improvement of 5.2 Experimental results 14.3% (from 0.734 to 0.839). Therefore every 5.2.1 Results for subsets technique studied in this paper has its own merit The results of each method on the 20 subsets in template generation. are presented in Table 1. A few observations Finally, by comparing the method SC (Cheung can be made. First, by comparing the per- and Li, 2012) with other methods, we can see that formance of baseline-1 (Base) and the methods SC is slightly better than baseline-1, but has much adopting term hypernym weight (LW and LP), lower performance than others. The major reason → we can see big performance improvement. The may be that this method did not employ a weighted bad performance of baseline-1 is mainly due to term hypernym graph or term peer similarity in- → the lack of weight (or frequency) information on formation in template construction.

806 Base SC LP NLP LP+C SC ∼ 1 LP > > ∗∗ ∗∗ P@10 NLP > > > 0.9 ∗∗ ∗∗ LP+C > > > ∗∗ ∗∗ ∗∗ ∼ 0.8 NLP+C > > > > > ∗∗ ∗∗ ∗∗ ∗∗ Base SC LP NLP LP+C 0.7 SC ∼ 0.6 LP > > ∗∗ ∗∗ P@20 NLP > > > 0.5 > ∗∗ > ∗∗ > ∗∗ LP+C Precision ∗∗ ∗∗ ∗∗ ∼ 0.4 NLP+C > > > > > ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ Base SC LP NLP LP+C 0.3 SC ∼ 0.2 LP > > ∗∗ ∗∗ P@30 NLP > > > 0.1 ∗∗ ∗∗ ∗∗ LP+C > > > ∗∗ ∗∗ ∗∗ ∼ 0 NLP+C > > > > ∗∗ ∗∗ ∗∗ ∼ 1 2 3 4 5 6 7 8 9 Table 2: Paired t-test results on subsets. Section ID

Base SC LW NLW LW+C Figure 3: Precision by section in the fullset. SC ∼ LW > > ∗∗ ∗∗ P@10 NLW > > > ∗∗ ∗∗ ∗ LW+C > > > > ∗∗ ∗∗ ∗∗ ∗∗ NLW+C > > > > > ∗∗ ∗∗ ∗∗ ∗∗ ∗ 5.2.2 Fullset results Base SC LW NLW LW+C SC ∼ As described in the Section 5.1.2, for the fullset LW > > ∗∗ ∗∗ P@20 NLW > > > ∗∗ ∗∗ ∗∗ experiments, we conduct a section-wise evalua- LW+C > > > > ∗∗ ∗∗ ∗∗ ∗∗ NLW+C > > > > > ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ tion, selecting 40 templates from each of the 9 sec- Base SC LW NLW LW+C SC ∼ tions of the NLP+C results. The results are shown LW > > ∗∗ ∗∗ P@30 NLW > > > ∗∗ ∗∗ ∗∗ in Figure 3. It can be observed that the precision LW+C > > > > ∗∗ ∗∗ ∗∗ ∗∗ NLW+C > > > > > ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ for each section decreases when the section ID in- creases. The results indicate the effectiveness of Table 3: Paired t-test results on subsets. our approach, since it can rank good templates in top sections and bad templates in bottom sections. According to the section-wise precision data, we Are the performance differences between meth- are able to determine the template score threshold ods significant enough for us to say that one is bet- for choosing different numbers of top templates in ter than the other? To answer this question, we run different applications. paired two-tailed t-test on every pair of methods. We report the t-test values among methods in ta- 5.2.3 Templates for category collection bles 2, 3 and 4. cleaning The meaning of the symbols in the tables are, Since our input category collection is automati- : The method on the row and the one on the cally constructed from the web, some wrong or ∼ column have similar performance. invalid category names is inevitably contained. In this subsection, we apply our category templates >: The method on the row outperforms the to clean the category collection. The basic idea is method on the column, but the performance dif- that if a category can match a template, it is more ference is not statistically significant (0.05 P < ≤ likely to be correct. We compute a new score for 0.1 in two-tailed t-test). every category name H as follows, > : The performance difference is statistically ∗ significant (P < 0.05 in two-tailed t-test). Snew(H) = log(1 + S(H)) S(T ∗) (15) · > : The performance difference is statisti- ∗∗ where S(H) is the existing category score, deter- cally highly significant (P < 0.01 in two-tailed mined by its frequency in the corpus. Here S(T ∗) t-test). is the score of template T ∗, the best template (i.e., the template with the highest score) for the cate- P@10 P@20 P@30 gory. LP V.S. LW > > > Then we re-rank the categories according to ∗∗ ∗∗ ∗∗ NLP V.S. NLW > > > ∗∗ ∗∗ ∗∗ their new scores to get a re-ranked category list. LP+C V.S. LW+C ∼ ∼ ∼ We randomly sampled 150 category names from NLP+C V.S. NLW+C ∼ ∼ ∼ the top 2 million categories of each list (the old list Table 4: Paired t-test results on subsets. and the new list) and asked annotators to judge the

807 quality of the categories. The annotation results discovery. In Proceedings of the fifth ACM interna- show that, after re-ranking, the precision increases tional conference on Web search and data mining, from 0.81 to 0.89 (i.e., the percent of invalid cate- pages 383–392. ACM. gory names decreases from 19% to 11%). Jesualdo Tomas Fernandez-Breis, Luigi Iannone, Ig- nazio Palmisano, Alan L Rector, and Robert 6 Conclusion Stevens. 2010. Enriching the gene ontology via the dissection of labels using the ontology pre-processor In this paper, we studied the problem of build- language. In Knowledge Engineering and Manage- ing templates for a large collection of category ment by the Masses, pages 59–73. Springer. names. We tested three techniques (tuple scor- ing by weighted term hypernym mapping, non- Jun Han, Ju Fan, and Lizhu Zhou. 2013. → Crowdsourcing-assisted query structure interpreta- linear score fusion, refinement by term clusters) tion. In Proceedings of the Twenty-Third inter- and found that all of them are very effective and national joint conference on Artificial Intelligence, their combination achieves the best performance. pages 2092–2098. AAAI Press. By employing the output templates to clean our Marti A. Hearst. 1992. Automatic acquisition of hy- category collection mined from the web, we get ponyms from large text corpora. In Proceedings of apparent quality improvement. Future work in- the 14th conference on Computational linguistics - cludes supporting multi-argument templates, dis- Volume 2, COLING ’92, pages 539–545, Strouds- ambiguating headwords of category names and ap- burg, PA, USA. Association for Computational Lin- guistics. plying our approach to general short text template mining. Zornitsa Kozareva, Ellen Riloff, and Eduard H Hovy. 2008. Semantic class learning from the web with Acknowledgments hyponym pattern linkage graphs. In ACL, volume 8, pages 1048–1056. We would like to thank the annotators for their ef- forts in annotating the templates. Thanks to the Xiao Li, Ye-Yi Wang, and Alex Acero. 2009. Extract- anonymous reviewers for their helpful comments ing structured information from user queries with and suggestions. This work is supported in part by semi-supervised conditional random fields. In Pro- ceedings of the 32nd international ACM SIGIR con- China National 973 program 2014CB340301 and ference on Research and development in information NSFC grant 61379043. retrieval, pages 572–579. ACM.

Yanen Li, Bo-June Paul Hsu, and ChengXiang Zhai. References 2013. Unsupervised identification of synonymous query intent templates for attribute intents. In Pro- Ganesh Agarwal, Govind Kabra, and Kevin Chen- ceedings of the 22nd ACM international conference Chuan Chang. 2010. Towards rich query interpreta- on Conference on information & knowledge man- tion: walking back and forth for mining query tem- agement, pages 2029–2038. ACM. plates. In Proceedings of the 19th international con- ference on World wide web, pages 1–10. ACM. Xiao Li. 2010. Understanding the semantic struc- ture of noun phrase queries. In Proceedings of the Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana 48th Annual Meeting of the Association for Compu- Kravalova, Marius Pas¸ca, and Aitor Soroa. 2009. tational Linguistics, pages 1337–1345. Association A study on similarity and relatedness using distribu- for Computational Linguistics. tional and wordnet-based approaches. In Proceed- ings of Human Language Technologies: The 2009 Sandeep Pandey and Kunal Punera. 2012. Unsuper- Annual Conference of the North American Chap- vised extraction of template structure in web search ter of the Association for Computational Linguistics , queries. In Proceedings of the 21st international pages 19–27. Association for Computational Lin- conference on World Wide Web, pages 409–418. guistics. ACM. Nathanael Chambers and Dan Jurafsky. 2011. Template-based information extraction without the Patrick Pantel and Dekang Lin. 2002. Discovering templates. In Proceedings of the 49th Annual word senses from text. In Proceedings of the eighth Meeting of the Association for Computational Lin- ACM SIGKDD international conference on Knowl- guistics: Human Language Technologies-Volume 1, edge discovery and data mining, pages 613–619. pages 976–986. Association for Computational Lin- ACM. guistics. Patrick Pantel and Deepak Ravichandran. 2004. Au- Jackie Chi Kit Cheung and Xiao Li. 2012. Sequence tomatically labeling semantic classes. In HLT- clustering and labeling for unsupervised query intent NAACL, volume 4, pages 321–328.

808 Patrick Pantel, Eric Crestan, Arkady Borkovsky, Ana- Maria Popescu, and Vishnu Vyas. 2009. Web-scale distributional similarity and entity set expansion. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages 938–947. Association for Com- putational Linguistics. Manuel Quesada-Martınez, Jesualdo Tomas´ Fernandez-Breis,´ and Robert Stevens. 2012. Enrichment of owl ontologies: a method for defin- ing axioms from labels. In Proceedings of the First International Workshop on Capturing and Refining Knowledge in the Medical Domain (K-MED 2012), Galway, Ireland, pages 1–10. Joseph Reisinger and Marius Pasca. 2011. Fine- grained class label markup of search queries. In ACL, pages 1200–1209. Nikos Sarkas, Stelios Paparizos, and Panayiotis Tsaparas. 2010. Structured annotations of web queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 771–782. ACM. Shuming Shi, Huibin Zhang, Xiaojie Yuan, and Ji- Rong Wen. 2010. Corpus-based semantic class mining: distributional vs. pattern-based approaches. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 993–1001. As- sociation for Computational Linguistics. Mark Stevenson and Mark A Greenwood. 2005. A semantic approach to ie pattern induction. In Pro- ceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 379–386. As- sociation for Computational Linguistics. Idan Szpektor, Aristides Gionis, and Yoelle Maarek. 2011. Improving recommendation for long-tail queries via templates. In Proceedings of the 20th international conference on World wide web, pages 47–56. ACM. Allan Third. 2012. Hidden semantics: what can we learn from the names in an ontology? In Proceed- ings of the Seventh International Natural Language Generation Conference, pages 67–75. Association for Computational Linguistics. Benjamin Van Durme and Marius Pasca. 2008. Find- ing cars, goddesses and enzymes: Parametrizable acquisition of labeled instances for open-domain in- formation extraction. In AAAI, volume 8, pages 1243–1248. Roman Yangarber. 2003. Counter-training in discov- ery of semantic patterns. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 343–350. Association for Computational Linguistics. Fan Zhang, Shuming Shi, Jing Liu, Shuqi Sun, and Chin-Yew Lin. 2011. Nonlinear evidence fusion and propagation for hyponymy relation mining. In ACL, volume 11, pages 1159–1168.

809