Taxophrase: Exploring Knowledge Base Via Joint Learning of Taxonomy and Topical Phrases
Total Page:16
File Type:pdf, Size:1020Kb
TaxoPhrase: Exploring Knowledge Base via Joint Learning of Taxonomy and Topical Phrases Weijing Huang Wei Chen∗ Key Laboratory of High Confidence Software Technologies Key Laboratory of High Confidence Software Technologies (Ministry of Education), EECS, Peking University (Ministry of Education), EECS, Peking University [email protected] [email protected] Tengjiao Wang Shibo Tao Key Laboratory of High Confidence Software Technologies SPCCTA, School of Electronics and Computer (Ministry of Education), EECS, Peking University Engineering, Peking University [email protected] [email protected] ABSTRACT 106 Politics_of_Morelos Knowledge bases restore many facts about the world. But due to the 105 big size of knowledge bases, it is not easy to take a quick overview 104 Mathematics onto their restored knowledge. In favor of the taxonomy structure 3 10 Musicians_by_band and the phrases in the content of entities, this paper proposes an Freq 2 Albums_by_artist exploratory tool TaxoPhrase on the knowledge base. TaxoPhrase 10 (1) is a novel Markov Random Field based topic model to learn the 101 taxonomy structure and topical phrases jointly; (2) extracts the 0 10 0 1 2 3 4 5 topics over subcategories, entities, and phrases, and represents the 10 10 10 10 10 10 extracted topics as the overview information for a given category Outdegree in the knowledge base. The experiments on the example categories Mathematics, Chemistry, and Argentina in the English Wikipedia Figure 1: Distribution of categories’ out degrees to subcate- demonstrate that our proposed TaxoPhrase provides an effective gories, according to the study on the English Wikipedia. tool to explore the knowledge base. KEYWORDS the taxonomy, it seems straightforward to utilize the taxonomy to Knowledge Base, Exploratory Tool, Topical Phrases, Taxonomy answer Q1 and Q2. But the size of taxonomy is too large to pro- Structure, Topic Model, Markov Random Field vide a quick overview for the whole knowledge base. Taking the English Wikipedia’s taxonomy as an example, there are about 1.5 1 INTRODUCTION million category nodes. And the distribution of these categories’ Knowledge bases[5][3][10][13] are constructed elaborately to re- degrees is unbalanced, as shown in Figure 1. It means some cate- store the information representing the facts about the world. And gories contain too many subcategories, such as Albums_by_artist, due to the big size of knowledge bases, it’s necessary to provide an whose out degree is 17,749; while most categories have very few exploratory tool to take a quick overview on them. For example, subcategories, such as Politics_of_Morelos containing only one sub- there are more than 5.3 million articles in Wikipedia (March 2017)1, category Morelos_elections. These characteristics indicate that the far beyond the scale that human can read all. A suitable exploratory taxonomy structure is a large scale-free network[1]. So it’s not easy tool benefits the users of the knowledge base to have an overall to answer Q1 and Q2 directly only by the taxonomy structure. perspective of the restored knowledge. Besides, the topic model[2], especially its extension on phrases, We attempt to achieve this purpose by answering the follow- such as [6][9], is developed for the exploratory analysis on text ing three questions: (1) Q1, what are the main subtopics related corpora, and suitable for answering Q3. Usually the most frequent to a given topic in knowledge base; (2) Q2, what are the related words or phrases in topics are used for summarizing the corpus[8]. entities for each subtopic; (3) Q3, what are the key summariza- However the meanings of the learned topics need to be manually tion corresponding to these subtopics. As many knowledge bases interpreted[4], which may limit the usability of existing methods are constructed based on the Wikipedia, such as YAGO[10] and on Q3. DBPeida[13], we answer the above three questions on Wikipedia In this paper, we propose a novel exploratory tool TaxoPhrase, without loss of generality. For this reason, in this paper the terms which learns the taxonomy structure and topical phrases in knowl- page and entity are used interchangeably. edge base jointly, and makes the questions Q1, Q2, Q3 tractable in Since categories in the knowledge base are used to group entities a unified framework. The joint learning algorithm of TaxoPhrase is into similar subjects and are further organized hierarchically into inspired by the complementary relation among the three parts in knowledge base: the categories in the taxonomy, the entities, and ∗Corresponding author. 1https://stats.wikimedia.org/EN/TablesWikipediaEN.htm the phrases in entities’ contents. OKBQA 2017, Tokyo, Japan, Weijing Huang, Wei Chen, Tengjiao Wang, and Shibo Tao TaxoPhrase: Exploring Knowledge Base via Joint Learning of Taxonomy and Topical Phrases Mathematics 2 RELATED WORKS Fields_of_mathematics To the best of our knowledge, there’s few work on providing an Applied_mathematics TaxoPhrase: ExploringABSTRACT Knowledge Base via Joint Learning of explorative tool for the knowledge base. The most closest work This paper provides a sample of a LATEX document which conforms, Taxonomysomewhat loosely, and to the Topical formatting guidelines Phrases for ACM SIG Pro- Theoretical_computer_science ceedings. to our motivation is Holloway’s analyzing and visualizing the se- KEYWORDS Theory_of_computation mantic coverage of Wikipedia[16], which visualizes the category Mathematics_of_computing Formal_languages ABSTRACT Knowledge Base, Topical Phrase Mining, Taxonomy Structure, Topic This paper provides a sample of a LATEX documentModel which conforms, network in two dimensions by the layout algorithm DrL (used to somewhat loosely, to the formatting guidelines for ACM SIG Pro- Formal_methods Computer_arithmetic be VxOrd)[14]. However the layout doesn’t provide the overview ceedings. Formal_theories 1 INTRODUCTION Logic_in_computer_scienceFigure 1: Illustration of the model TaxoPhrase KEYWORDS Knowledge bases[5][3][9][10] are constructed elaborately to restore Mathematical_logic information of the knowledge base directly. The other works re- the information representing the facts about the world. And due Knowledge Base, Topical Phrase Mining, Taxonomy Structure, Topic base jointly, and makes the questions Q1, Q2, Q3 tractable in a Binary_arithmetic to the big size of knowledge bases, it’s necessary to provide an Formal_systems lated to our approach can be grouped into two groups, which are Model unied framework. The joint learning algorithm of TaxoPhrase exploratory tool to take a quick overview on knowledge base. For is inspired by the fact that: the categories in the taxonomy, the example, there are more than 5.4 million articles in Wikipedia entities,Logical_calculi and the phrases in contents are complementary to each taxonomy related and topical phrases related. 1 INTRODUCTION (March 2017)1, far beyond the scaleFigure that human 1: Illustration can read it of all. the A model TaxoPhrase Classical_logic Boolean_algebra other in knowledge base. For the example 1 the Wikipedia page Knowledge bases[5][3][9][10] are constructedsuitable elaborately exploratory to restore tool benets for the users of knowledge base Systems_of_formal_logic Propositional calculus directly belongs to ve categories The taxonomy related works mainly focus on how to use the tax- the information representing the facts aboutto the have world. an overall And due perspective of existing knowledge in KB. Logi- base jointly, and makes the questions Q1, Q2, Q3 tractable, in a , , to the big size of knowledge bases, it’s necessaryWe attempt to provide to achieve an this purpose by answering the following cal_calculi Classicial_logic Propositional_calculus Systems_of_for- unied framework. The joint learningPropositional algorithm of, TaxoPhrase andcalculus , which are further descendant sub- onomy of the knowledge base to enhance the quality of text mining exploratory tool to take a quick overview onthree knowledge questions: base. (1) For Q1, what are the subtopics related to a given mal_logic Boolean_algebra is inspired by the fact that: the categories incategories the taxonomy, of the category the . The content of this page example, there are more than 5.4 million articlestopic in inknowledge Wikipedia base; (2) Q2, what are the key summarization Mathematics entities, and the phrases in contents are complementarycontains the phrases to each such as "propositional calculus", "proposi- tasks, such as Twixonomy[7], LGSA[11], and TransDetector[12]. (March 2017)1, far beyond the scale that humancorresponding can read it to all. these A subtopics; (3) Q3, what are the related enti- other in knowledge base. For the example 1tionalthe Wikipedia logic", "logical page connectives", and etc. For the example 2 the suitable exploratory tool benets for the usersties of for knowledge each subtopic. base Zeroth-order logic Propositional calculus Propositional calculus directly belongs toWikipediave categories page Zeroth-order logic belongs to two categories The phrase related works extend the topic model to phrase level, to have an overall perspective of existing knowledgeSince the in KB. categories in KB are used to group entities into similar Logi- , , , and , and contains We attempt to achieve this purpose by answeringsubjects theand following are further organizedcal_calculi hierarchicallyClassicial_logic into taxonomy,Propositional_calculus it Propositional_calculusSystems_of_for- Systems_of_formal_logic , and ,phrases: which are