Using a Natural Language Understanding System to Generate Semantic Web Content
Total Page:16
File Type:pdf, Size:1020Kb
50 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 Using a Natural Language Understanding System to Generate Semantic Web Content Akshay Java, University of Maryland, USA Sergei Nirneburg, University of Maryland, USA Marjorie McShane, University of Maryland, USA Timothy Finin, University of Maryland, USA Jesse English, University of Maryland, USA Anupam Joshi, University of Maryland, USA ABSTRACT We describe our research on automatically generating rich semantic annotations of text and making it available on the Semantic Web. In particular, we discuss the challenges involved in adapting the On- toSem natural language processing system for this purpose. OntoSem, an implementation of the theory of ontological semantics under continuous development for over 15 years, uses a specially constructed NLP-oriented ontology and an ontological-semantic lexicon to translate English text into a custom ontol- ogy-motivated knowledge representation language, the language of text meaning representations (TMRs). OntoSem concentrates on a variety of ambiguity resolution tasks as well as processing unexpected input and reference. To adapt OntoSem’s representation to the Semantic Web, we developed a translation system, OntoSem2OWL, between the TMR language into the Semantic Web language OWL. We next used OntoSem and OntoSem2OWL to support SemNews, an experimental Web service that monitors RSS news sources, processes the summaries of the news stories, and publishes a structured representation of the meaning of the text in the news story. Keywords: information extraction; natural language processing; OWL; RDF; Semantic Web INTRODUCTION annotation is time-consuming and error-prone. A core goal of the development of the Semantic Moreover, annotations must be made in a formal Web is to bring progressively more meaning to language whose use may require considerable the information published on the Web. An ac- training and expertise. Developing interactive cepted method of doing this is by annotating the tools for annotation is a problematic undertaking text with a variety of kinds of metadata. Manual because it is not known whether they will be Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 51 in actual demand. A number of Semantic Web pragmatic markers (and certainly not templates practitioners maintain that the desire to have filled with uninterpreted snippets of the input their content available on the Semantic Web text that are generated by the current informa- will compel people to spend the time and effort tion extraction methods). on manual annotation. However, even if such a To attain such goals, Semantic Web agents desire materializes, people simply will not have must be equipped with sophisticated semantic enough time either to annotate each sentence in analysis systems that process text found on the their texts or annotate a subset at a semantic level Web and publish their analyses on the Web as that is sufficiently deep to be used by advanced annotations in a form accessible to other agents, intelligent agents that are projected as users of using standard Semantic Web languages such as the Semantic Web alongside people. RDF and OWL. The Semantic Web will, thus, The alternative on the supply side is, then, be useful for both human readers and robotic automatic annotation. Within the current state intelligent agents. The agents will benefit from of the art, automatically produced annotations the existence of deep semantic annotations in are roughly at the level attainable by the latest their application-oriented information process- information extraction techniques—a reason- ing tasks and also will be able to derive such ably good level of capturing named entities with annotations from text. People will not directly a somewhat less successful categorization of access the annotation (metadata) level but will such entities (e.g., deciding whether Jordan is benefit from higher-quality and better formu- used as the first name of an individual or a ref- lated responses to their queries. erence to the Hashemite kingdom). Extracting This article describes initial work on more advanced types of semantic information, responding to the needs and leveraging the for example, types of events (to say nothing offerings of the Web by merging knowledge- about determining semantic arguments, ‘’case oriented natural language processing with web roles” in AI terminology), is not quite within technologies to produce both an automatic an- the current information extraction capabili- notation-generating capability and an enhanced ties, though work in this direction is ongoing. Web service oriented at human users. The on- Indeed, semantic annotation is at the moment tological-semantic natural language processing an active subfield of computational linguistics, system OntoSem (Nirenburg & Raskin, 2001) where annotated corpora are intended for use provided the basis for the automatic annotation by machine-learning approaches to building effort. In order to test and evaluate the utility natural language processing capabilities. of OntoSem on the Semantic Web, we have On the demand side of the Semantic Web, a developed SemNews (Java, Finin & Nirenburg, core capability is improving the precision of the 2006; Java, Finin & Nirenburg, 2005), a pro- Web search, which will be facilitated by detailed totype application that monitors RSS feeds of semantic annotations that are unambiguous news stories, applies OntoSem to understand and sufficiently detailed to support the search the text, and exports the computed facts back to engine in making fine-grained distinctions in the Web in OWL. A prerequisite for this system calculating scores of documents. Another core integration is a utility for translating knowledge capability is to transcend the level of document formats between OntoSem’s knowledge repre- retrieval and, instead, return as answers to user sentation language and ontologies and those of queries specially generated pragmatically and the Semantic Web. stylistically appropriate responses. To attain Since our goal is to continuously improve this capability, intelligent agents must rely on the service, the quality of OntoSem results very detailed semantic annotations of texts. We and system coverage must be continuously believe that such annotations will be, for all enhanced. The Web, in fact, contains a wealth intents and purposes, complete text meaning of textual information that, once processed, representations, not just sets of semantic or can enhance OntoSem’s knowledge base (its Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 52 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 ontology, lexicon and fact repository, see be- Gildea and Jurafsky (2002) created a sto- low for a more detailed description). This is chastic system that labels case roles of predicates why the knowledge format conversion utility, with either abstract (e.g., AGENT, THEME) OntoSem2OWL, has been developed to trans- or domain-specific (e.g., MESSAGE, TOPIC) late both ways between OWL and OntoSem’s roles. The system trained on 50,000 words of knowledge representation language. Our initial hand-annotated text that was produced by the experiments on automatic learning of ontologi- FrameNet project (Baker, Fillmore & Lowe, cal concepts and lexicon entries are reported in 1998). When tasked to segment constituents English and Nirenburg (2007). and identify their semantic roles (with fillers The remainder of this article is organized being un-disambiguated textual strings, not as follows. We start with a brief review of some machine-tractable instances of ontological con- related work on annotation in computational cepts, as in OntoSem), the system scored in the linguistics, and on mapping knowledge between 60s in precision and recall. Limitations of the a text understanding system and the Semantic system include its reliance on hand-annotated Web representation. Next, we introduce the data and its reliance on prior knowledge of the knowledge resources of OntoSem and illus- predicate frame type (i.e., it lacks the capacity trate its knowledge representation language. to disambiguate productively). Semantics in In the section, Mapping OntoSem to OWL, we this project is limited to case-roles. provide an overview of the architecture of our The Interlingual Annotation of Multi- implemented system and describe the approach lingual Text Corpora project (Farwell et al., used and major issues discovered in using it to 2004) had as its goal the creation of a syntactic map knowledge between OntoSem’s knowledge and semantic annotation representation meth- representation system and the Semantic Web odology, and tested it out on seven languages language OWL. We go on in the next section (English, Spanish, French, Arabic, Japanese, tooutline some of the larger issues and chal- Korean, and Hindi). The semantic representa- lenges we expect to encounter. We describe tion, however, is restricted to those aspects of an approach to evaluating the use of OntoSem, syntax and semantics that developers believe can the translation of its ontology to OWL, and the be consistently handled well by hand annota- effectiveness of the SemNews system in the tors for many languages.