Automatically Refining the Wikipedia Infobox Ontology
Total Page:16
File Type:pdf, Size:1020Kb
Automatically Refining the Wikipedia Infobox Ontology Fei Wu Daniel S. Weld Computer Science & Engineering Department, Computer Science & Engineering Department, University of Washington, Seattle, WA, USA University of Washington, Seattle, WA, USA [email protected] [email protected] ABSTRACT example, our autonomous Kylin system [35] trained machine-learning algorithms on the infobox data, yielding extractors which can accu- The combined efforts of human volunteers have recently extracted 2 numerous facts from Wikipedia, storing them as machine-harvestable rately generate infoboxes for articles which don’t yet have them. object-attribute-value triples in Wikipedia infoboxes. Machine learn- We estimate that this approach can add over 10 million additional ing systems, such as Kylin, use these infoboxes as training data, facts to those already incorporated into DBpedia. By running the accurately extracting even more semantic knowledge from natural learned extractors on a wider range of Web text and validating with language text. But in order to realize the full power of this informa- statistical tests (as pioneered in the KnowItAll system [16]), one tion, it must be situated in a cleanly-structured ontology. This paper could gather even more structured data. introduces KOG, an autonomous system for refining Wikipedia’s In order to effectively exploit extracted data, however, the triples infobox-class ontology towards this end. We cast the problem of must be organized using a clean and consistent ontology. Unfortu- ontology refinement as a machine learning problem and solve it nately, while Wikipedia has a category system for articles, the facil- using both SVMs and a more powerful joint-inference approach ity is noisy, redundant, incomplete, inconsistent and of very limited expressed in Markov Logic Networks. We present experiments value for our purposes. Better taxonomies exist, of course, such as demonstrating the superiority of the joint-inference approach and WordNet [1], but these don’t have the rich attribute structure found evaluating other aspects of our system. Using these techniques, we in Wikipedia. build a rich ontology, integrating Wikipedia’s infobox-class schemata 1.1 KOG: Refining the Wikipedia Ontology with WordNet. We demonstrate how the resulting ontology may be used to enhance Wikipedia with improved query processing and This paper presents the Kylin Ontology Generator (KOG), an au- other features. tonomous system that builds a rich ontology by combining Wiki- pedia infoboxes with WordNet using statistical-relational learning. Categories and Subject Descriptors Each infobox template is treated as a class, and the slots of the tem- plate are considered as attributes/slots. Applying a Markov Logic H.4.m [Information Systems]: Miscellaneous Networks (MLNs) model [28], KOG uses joint inference to predict subsumption relationships between infobox classes while simulta- Keywords neously mapping the classes to WordNet nodes. KOG also maps attributes between related classes, allowing property inheritance. Semantic Web, Ontology, Wikipedia, Markov Logic Networks 1. INTRODUCTION 1.2 Why a Refined Ontology is Important The vision of a Semantic Web will only be realized when there Situating extracted facts in an ontology has several benefits. is a much greater volume of structured data available to power Advanced Query Capability: One of the main advantages of advanced applications. Given the recent progress in information extracting structured data from Wikipedia’s raw text is the ability extraction, it may be feasible to automatically gather this infor- to go beyond keyword queries and ask SQL-like questions such as mation from the Web, using machine learning trained extractors. “What scientists born before 1920 won the Nobel prize?” An on- Wikipedia, one of the world’s most popular Websites1, is a logi- tology can greatly increase the recall of such queries by supporting cal source for extraction, since it is both comprehensive and high- transitivity and other types of inference. For example, without rec- quality. Indeed, collaborative editing by myriad users has already ognizing that particle physicist is a subclass of physicist which is resulted in the creation of infoboxes, a set of subject-attribute-value itself a subclass of scientists, a Wikipedia question-answering sys- triples summarizing the key aspects of the article’s subject, for nu- tem would fail to return “Arthur Compton” in response to the ques- merous articles. DBpedia [5] has aggregated this infobox data, tion above. In many cases the attributes of different Wikipedia in- yielding over 15 million pieces of information. fobox classes are mismatched, for example one infobox class might Furthermore, one may use this infobox data to bootstrap a pro- have a “birth place” attribute while another has “cityofbirth” — cess for generating additional structured data from Wikipedia. For matching corresponding attributes for subclasses is clearly essen- tial for high recall. 1 th Ranked 8 in January 2008 according to comScore World Metrix. Faceted Browsing: When referring to Wikipedia, readers use Copyright is held by the International World Wide Web Conference Com- a mixture of search and browsing. A clear taxonomy and aligned mittee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. 2Kylin’s precision ranges from mid-70s to high-90s percent, de- WWW 2008, April 21–25, 2008, Beijing, China. pending on the attribute type and infobox class. ACM 978-1-60558-085-2/08/04. • Using these techniques, we build a rich ontology which in- tegrates and extends the information provided by both Wiki- Schema Cleaning pedia and WordNet; it incorporates both subsumption infor- Wikipedia mation, an integrated set of attributes, and type information for attribute values. Subsumption Detection • We demonstrate how the resulting ontology may be used to Markov Logic Joint Inference enhance Wikipedia in many ways, such as advanced query processing for Wikipedia facts, facetted browsing, automated ? infobox edits and template generation. Furthermore, we be- ? lieve that the ontology can benefit many other applications, ? such as information extraction, schema mapping, and infor- mation integration. WordNet 2. DESIDERATA & ARCHITECTURE In order to support the applications described in the previous sec- Schema Mapping tion, an ontology (and the process used to create it) must satisfy several criteria. First, we seek automatic ontology construction. is-a is-a While researchers have manually created ontologies, such as [12], this is laborious and requires continual maintenance. Automatic Figure 1: Architecture of Kylin Ontology Generator. techniques, likely augmented with human review, have the poten- tial to better scale as Wikipedia and other document stores evolve over time. attributes enable faceted browsing, a powerful and popular way to Second, the ontology should contain a well-defined ISA hierar- investigate sets of articles [36]. chy, where individual classes are semantically distinct and natural Improving Extractors with Shrinkage: As long as an infobox classes are well represented. class has many instances (articles), Kylin has sufficient training Third, each class should be defined with a rich schema, listing data to learn an accurate extractor. Unfortunately, long-tail distri- a comprehensive list of informative attributes. Classes should be butions mean that most infobox classes don’t have many instances. populated with numerous instances. We note that, while Wikipedia When learning an extractor for such a sparsely-populated class, C, infobox classes have rich schemata, many duplicate classes and at- one may use instances of the parent and children of C, appropri- tributes exist. Furthermore, many natural classes have no corre- ately weighted, as additional training examples [17, 34]. sponding Wikipdia infobox. Semiautomatic Generation of New Templates: Today, Wiki- Fourth, classes (and attributes) should have meaningful names. pedia infobox templates are designed manually with an error-prone Randomly-generated names, e.g. G0037, are unacceptable and overly “copy and edit” process. By displaying infobox classes in the con- terse names, e.g. “ABL” (the name of a Wikipedia infobox class), text of a clean taxonomy, duplication and schema drift could be are less favored than alternatives such as “Australian Baseball League.” minimized. Base templates could be automatically suggested by in- Finally, the ontology should have broad coverage — in our case, heriting attributes from the class’ parent. Furthermore, by applying across the complete spectrum of Wikipedia articles. While these the extractors which Kylin learned for the parent class’ attributes, desiderata are subjective, they drove the design of KOG. one could automatically populate instances of the new infobox with candidate attribute values for human validation. 2.1 Architecture Infobox Migration: As Wikipedia evolves, authors are con- As shown in Figure 1, KOG is comprised of three modules: the stantly reclassifying articles, which entails an error-prone conver- schema cleaner, subsumption detector, and schema mapper. The sion of articles from one infobox class to another. For example, schema cleaner performs several functions. First, it merges dupli- our analysis of five Wikipedia dump “snapshots” between 9/25/06 cate classes and attributes. Second, it renames uncommon class and 7/16/07 shows an average of 3200 conversions per month; this and attribute names, such as “ABL,”