Natural Language Processing for Mediawiki: the Semantic Assistants Approach
Total Page:16
File Type:pdf, Size:1020Kb
Natural Language Processing for MediaWiki: The Semantic Assistants Approach ∗ Bahar Sateli and René Witte Semantic Software Lab Department of Computer Science and Software Engineering Concordia University, Montréal, QC, Canada [sateli,witte]@semanticsoftware.info ABSTRACT its content that answers a specific question of a user, asked We present a novel architecture for the integration of Natural in natural language? Language Processing (NLP) capabilities into wiki systems. Or image a wiki used to curate knowledge for biofuel The vision is that of a new generation of wikis that can research, where expert biologist go through research publica- help developing their own primary content and organize their tions stored in the wiki in order to extract relevant knowledge. structure by using state-of-the-art technologies from the NLP This involves the time-consuming and error-prone task of and Semantic Computing domains. The motivation for this locating biomedical entities, such as enzymes, organisms, or integration is to enable wiki users { novice or expert { to genes: Couldn't the wiki identify these entities in its pages benefit from modern text mining techniques directly within automatically, link them with other external data sources, their wiki environment. We implemented these ideas based and provide semantic markup for embedded queries? on MediaWiki and present a number of real-world application Consider a wiki used for collaborative requirements en- case studies that illustrate the practicability and effectiveness gineering, where the specification for a software product is of this approach. developed by software engineers, users, and other stakehold- ers. These natural language specifications are known to be prone for errors, including ambiguities and inconsistencies. Categories and Subject Descriptors What if the wiki system included intelligent assistants that H.3.1 [Content Analysis and Indexing]: Abstracting work collaboratively with the human users to find defects like methods, Indexing methods, Linguistic processing; H.5.2 this, thereby improving the specification (and the success of [User Interfaces]: Natural language, User-centered design; the project)? H.5.4 [Hypertext/Hypermedia]: Architectures, Naviga- In this paper, we present our work on a comprehensive tion, User issues; I.2.1 [Applications and Expert Sys- architecture that makes these visions a reality: Based entirely tems]: Natural language interfaces; I.2.7 [Natural Lan- on open source software and open standards, we developed guage Processing]: Text analysis an integration of wiki engines, in particular MediaWiki, with the Semantic Assistants NLP web services framework [8]. The resulting system was designed from the ground up for 1. INTRODUCTION scalability and robustness, allowing anyone to integrate so- Back in 2007, we originally coined the idea of a `self- phisticated text mining services into an existing MediaWiki aware wiki system' that can support users in knowledge installation. These services embody `Semantic Assistants' management by applying natural language processing (NLP) { intelligent agents that work collaboratively with human techniques to analyse, modify, and create its own content [7]. users on developing, structuring, and analysing the content As an example, consider a wiki containing cultural heritage of a wiki. We developed a wide range of wiki-based ap- documents, like a historical encyclopedia of architecture: the plications that highlight the power of this integration for large size and partially outdated terminology can make it diverse domains, such as biomedical literature curation, cul- difficult for its user to locate knowledge, e.g., for a restoration tural heritage data management, and software requirements task on an old building: Couldn't the wiki organize the specification. content by automatically creating a back-of-the book index The impact of integrating wikis with NLP techniques pre- as another wiki page? Or even create a new wiki page from sented in this work is significant: It opens the opportunity ∗ of bringing state-of-the-art techniques from the NLP and Se- corresponding author mantic Computing domains to wiki users, without requiring them to have an concrete knowledge in these areas. Rather, the integration will seamlessly provide them with NLP ca- pabilities, so that no context switch is needed, i.e., the NLP Permission to make digital or hard copies of all or part of this work for service invocation is carried out from within the wiki and personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies results are brought back to the user in his place of choice. bear this notice and the full citation on the first page. To copy otherwise, to This way, a broad range of wiki users, from laypersons to republish, to post on servers or to redistribute to lists, requires prior specific organization employees can benefit from NLP techniques, permission and/or a fee. which would normally require them to use specialized tools WikiSym ’12 Aug 27–29, 2012, Linz, Austria or have expert knowledge in that area. Copyright 2012 ACM 978-1-4503-1605-7/12/08 ...$15.00. 2. BACKGROUND keyword, and ask the summarization service to generate a Before we delve into the details of our Wiki-NLP integra- summary with a desired length from the available information tion, we briefly introduce NLP and illustrate how various on that topic within the wiki. techniques from this domain can help to improve a wiki user's Question Answering (QA). The knowledge stored in experience. Afterwards, we describe the Semantic Assistants wikis, for example, when a wiki is used inside an organi- framework, which we aim to integrate with MediaWiki in zation to gather technical knowledge from employees, is a order to provide these NLP services. valuable source of information that can only be queried via a keyword search or indices. However, using a combination of 2.1 Natural Language Processing NLP techniques, a wiki can be enhanced to allow its users to pose questions against its knowledge base in natural language. Natural Language Processing (NLP) is a branch of com- Then, after \understanding" the question, NLP services can puter science that employs various Artificial Intelligence tech- browse through the wiki content and bring back the extracted niques to process content written in natural language. One information or a summary of a desired length to the user. of the applications of NLP is text mining { the process of de- QA systems are especially useful when the information need riving patterns and structured information from text { which is dispersed over multiple pages of the wiki, or when the is usually facilitated through the use of frameworks, such as user is not aware of the existence of such knowledge and the the General Architecture for Text Engineering (GATE) [1]. terminology used to refer to it. Using these frameworks, sophisticated text mining applica- tions can be developed that can significantly improve a user's 2.2 Semantic Assistants Framework experience in content development, retrieval, and analysis. The open source Semantic Assistants framework [8] is an Here, we introduce a number of standard NLP techniques to existing service-oriented architecture that brokers context- illustrate how they can support Wiki users. sensitive NLP pipelines as W3C standard web services.4 The Information Extraction (IE) is one of the most popular goal of this framework is to bring NLP techniques directly applications of NLP. IE identifies instances of a particular to end-users, by integrating them within common desktop class of events, entities or relationships in a natural language applications, such as word processors, email clients, or web text and creates a structured representation of the discovered browsers. The core idea of the Semantic Assistants approach information. A number of systems have been developed for is to take the existing NLP frameworks and wrap concrete this task, such as OpenCalais,1 which can extract named analysis pipelines so that they can be brokered through entities like Persons and Organizations within a text and a service-oriented architecture, and allow desktop clients present them in a formal language, e.g., XML or RDF2. These connected to the architecture to consume these NLP services techniques are highly relevant for wikis, where users often via a plug-in interface. deal with large amounts of textual content. For example, using an IE service in a wiki can help users to automatically Focused find all the occurrences of a specific type of entity, such as Summarization `protein', and gather complementary information in form of − Calling an NLP Service metadata around them. − Runtime Parameters Content Development. In addition to generating meta- NLP Service 1 NLP Service 2 data from existing content in the wiki, an NLP system can Client Server also be used in some instances to create the primary content ... 3 NLP Service n itself. For example, in Wiktionary, where information is Word Processor NLP Service highly structured, various techniques from computational Result linguistics can help to automatically populate the wiki by adding new stubs and their morphological variations. Fur- thermore, the NLP system can analyse the wiki content to Figure 1: The Semantic Assistants service execution find the entries across different languages and automatically workflow [8] annotate them or cross-link their pages together. Automatic Summarization. Consider a wiki user who NLP pipelines are stored in a repository in the Semantic wants to create a report on a specific topic from the available Assistants architecture. They are formally described using related information in a wiki with the size of Wikipedia. For the Web Ontology Language (OWL),5 which allows the Se- this task, the user has to start from one page, read its content mantic Assistants web server to dynamically discover them to determine its relevance and continue browsing through and reason about their capabilities before recommending the wiki by clicking on page links that might be related them to the clients.