Wikipedia and Artificial Intelligence: an Evolving Synergy
Total Page:16
File Type:pdf, Size:1020Kb
AAAI-08 Chicago Wikipedia and Artificial Intelligence: An Evolving Synergy Papers from the 2008 AAAI Workshop Technical Report WS-08-15 AAAI Press Association for the Advancement of Artificial Intelligence AAAI Press 445 Burgess Drive Menlo Park, California 94025 WS-08-15 Wikipedia and Artificial Intelligence: An Evolving Synergy Papers from the 2008 AAAI Workshop Razvan Bunescu, Evgeniy Gabrilovich, and Rada Mihalcea, Cochairs Technical Report WS-08-15 AAAI Press Menlo Park, California Copyright © 2008, AAAI Press The Association for the Advancement of Artificial Intelligence 445 Burgess Drive Menlo Park, California 94025 USA AAAI maintains compilation copyright for this technical report and retains the right of first refusal to any publication (including electronic distribution) arising from this AAAI event. Please do not make any inquiries or arrangements for hardcopy or electronic publication of all or part of the papers contained in these working notes without first explor- ing the options available through AAAI Press and AI Magazine (concur- rent submission to AAAI and an another publisher is not acceptable). A signed release of this right by AAAI is required before publication by a third party. Distribution of this technical report by any means including electronic (including, but not limited to the posting of the papers on any website) without permission is prohibited. ISBN 978-1-57735-383-6 WS-08-15 Manufactured in the United States of America Organizing Committee Razvan Bunescu (Ohio University, USA) Evgeniy Gabrilovich (Yahoo! Research, USA) Rada Mihalcea (University of North Texas, USA) Program Committee Eugene Agichtein (Emory University, USA) Einat Amitay (IBM Research, Israel) Mikhail Bilenko (Microsoft Research, USA) Chris Brew (Ohio State University, USA) Timothy Chklovski (Structured Commons, USA) Massimiliano Ciaramita (Yahoo! Research Barcelona, Spain) Andras Csomai (University of North Texas, USA) Silviu Cucerzan (Microsoft Research, USA) Ido Dagan (Bar-Ilan University, Israel) Ravi Kumar (Yahoo! Research, USA) Oren Kurland (Technion, Israel) Lillian Lee (Cornell University, USA) Elizabeth Liddy (Syracuse University, USA) Daniel Marcu (Information Sciences Institute / University of Southern California, USA) Shaul Markovitch (Technion, Israel) Raymond Mooney (University of Texas at Austin, USA) Vivi Nastase (EML Research, Germany) Bo Pang (Yahoo! Research, USA) Marius Pasca (Google, USA) Ted Pedersen (University of Minnesota Duluth, USA) Simone Paolo Ponzetto (EML Research, Germany) Dragomir Radev (University of Michigan, USA) Dan Roth (University of Illinois Urbana-Champaign, USA) Peter Turney (National Research Council, Canada) This AAAI–08 Workshop was held Sunday, July 13, 2008, in Chicago, Illinois, USA iii Preface Since its inception less than seven years ago, Wikipedia has become one of the largest and fastest growing online sources of encyclopedic knowledge. One of the reasons why Wikipedia is appealing to contributors and users alike is the richness of its embedded structural information: articles are hyperlinked to each other and connected to categories from an ever expanding taxonomy; pervasive language phenomena such as synonymy and polysemy are addressed through redirection and disam- biguation pages; entities of the same type are described in a consistent format using infoboxes; related articles are grouped together in series templates. As a large-scale repository of structured knowledge, Wikipedia has become a valuable resource for a diverse set of artificial intelligence (AI) applications. Major conferences in natural language process- ing and machine learning have recently witnessed a significant number of approaches that use Wikipedia for tasks ranging from text categorization and clustering to word sense disambiguation, information retrieval, information extraction and question answering. On the other hand, Wikipedia can greatly benefit from numerous algorithms and representation models developed during decades of AI research, as illustrated recently in tasks such as estimating the reliability of authors’ contribu- tions, automatic linking of articles, or intelligent matching of Wikipedia tasks with potential contrib- utors. The goal of this workshop was to foster the research and dissemination of ideas on the mutually beneficial interaction between Wikipedia and AI. The workshop took place on July 13, 2008, in Chicago, immediately preceding the Annual Meeting of the Association for the Advancement of Artificial Intelligence. This report contains papers accepted for presentation at the workshop. We issued calls for regular papers, short late–breaking papers, and demos. We received an impressive number of submissions, demonstrating the community interest and the timeliness of the “Wikipedia and AI” research theme: 22 regular papers, 10 short papers and 3 demos were submitted to the workshop. After careful review by our program committee, 10 regular papers, 2 short papers and 1 demo were accepted for presentation. Consistent with the original aim of the workshop, the accepted papers address a highly diverse set of problems, in which Wikipedia is seen either as a useful resource, or as a target for algo- rithms seeking to make it even better. As a rich knowledge source, Wikipedia is shown to benefit applications in information extraction, machine translation, summarization, ontology mining and mapping, and information retrieval. We also learned of interesting applications, most of them using machine learning, that could further enhance the breadth and the quality of Wikipedia, such as pre- dicting the quality of edits, vandalism detection, infobox extraction, crosslingual links creation, and semantic annotation. We were truly impressed by the high quality of the reviews provided by all the members of the program committee, particularly since deadlines were very tight. All of the committee members pro- vided timely and thoughtful reviews, and the papers that appear have certainly benefited from that expert feedback. Finally, when we first started planning this workshop, we agreed that having a high quality invit- ed speaker was crucial. We thank Michael Witbrock not only for his talk, but also for the boost of confidence provided by his quick and enthusiastic acceptance. – Razvan Bunescu, Evgeniy Gabrilovich, and Rada Mihalcea July 2008 vii Contents Organizing Committee / iii Preface / vii Razvan Bunescu, Evgeniy Gabrilovich, and Rada Mihalcea Full Papers The Fast and the Numerous – Combining Machine and Community Intelligence for Semantic Annotation / 1 Sebastian Blohm, Markus Krötzsch, Philipp Cimiano Learning to Predict the Quality of Contributions to Wikipedia / 7 Gregory Druck, Gerome Miklau, Andrew McCallum Integrating Cyc and Wikipedia: Folksonomy Meets Rigorously Defined Common-Sense / 13 Olena Medelyan, Catherine Legg Topic Indexing with Wikipedia / 19 Olena Medelyan, Ian H. Witten, David Milne An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links / 25 David Milne, Ian H. Witten Mining Wikipedia’s Article Revision History for Training Computational Linguistics Algorithms / 31 Rani Nelken, Elif Yamangil Okinet: Automatic Extraction of a Medical Ontology from Wikipedia / 37 Vasco Calais Pedro, Radu Stefan Niculescu, Lucian Vlad Lita Automatic Vandalism Detection in Wikipedia: Towards a Machine Learning Approach / 43 Koen Smets, Bart Goethals, Brigitte Verdonk Enriching the Crosslingual Link Structure of Wikipedia — A Classification-Based Approach / 49 Philipp Sorg, Philipp Cimiano Augmenting Wikipedia-Extraction with Results from the Web / 55 Fei Wu, Raphael Hoffmann, Daniel S. Weld v Short Papers Using Wikipedia Links to Construct Word Segmentation Corpora / 61 David Gabay, Ziv Ben-Eliahu, Michael Elhadad Method for Building Sentence-Aligned Corpus from Wikipedia / 64 Keiji Yasud, Eiichiro Sumita Demo Powerset’s Natural Language Wikipedia Search Engine / 67 Tim Converse, Ronald M. Kaplan, Barney Pell, Scott Prevost, Lorenzo Thione, Chad Walters vi Invited Speaker Michael Witbrock (Vice President of Research at Cycorp, USA) iv The Fast and the Numerous – Combining Machine and Community Intelligence for Semantic Annotation Sebastian Blohm, Markus Krötzsch and Philipp Cimiano Institute AIFB, Knowledge Management Research Group University of Karlsruhe D-76128 Karlsruhe, Germany {blohm, kroetzsch, cimiano}@aifb.uni-karlsruhe.de Abstract incentives for creating semantic annotations in a Semantic MediaWiki are for example semantic browsing and query- Starting from the observation that certain communities have incentive mechanisms in place to create large amounts of un- ing functionality, but most importantly the fact that queries structured content, we propose in this paper an original model over structured knowledge can be used to automatically cre- which we expect to lead to the large number of annotations ate views on data, e.g. in the form of tables. required to semantically enrich Web content at a large scale. However, creating incentives and making annotation easy The novelty of our model lies in the combination of two key and intuitive will clearly not be enough to really leverage se- ingredients: the effort that online communities are making to mantic annotation at a large scale. On the one hand, human create content and the capability of machines to detect reg- resources are limited. In particular, it is well known from ular patterns in user annotation to suggest new annotations. Wikipedia and from tagging systems that the number of con- Provided that the creation