50 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007

Using a Natural Language Understanding System to Generate Semantic Web Content

Akshay Java, University of Maryland, USA Sergei Nirneburg, University of Maryland, USA Marjorie McShane, University of Maryland, USA Timothy Finin, University of Maryland, USA Jesse English, University of Maryland, USA Anupam Joshi, University of Maryland, USA

Abstract

We describe our research on automatically generating rich semantic annotations of text and making it available on the Semantic Web. In particular, we discuss the challenges involved in adapting the On- toSem natural language processing system for this purpose. OntoSem, an implementation of the theory of ontological semantics under continuous development for over 15 years, uses a specially constructed NLP-oriented ontology and an ontological-semantic lexicon to translate English text into a custom ontol- ogy-motivated knowledge representation language, the language of text meaning representations (TMRs). OntoSem concentrates on a variety of ambiguity resolution tasks as well as processing unexpected input and reference. To adapt OntoSem’s representation to the Semantic Web, we developed a translation system, OntoSem2OWL, between the TMR language into the Semantic Web language OWL. We next used OntoSem and OntoSem2OWL to support SemNews, an experimental Web service that monitors RSS news sources, processes the summaries of the news stories, and publishes a structured representation of the meaning of the text in the news story.

Keywords: information extraction; natural language processing; OWL; RDF; Semantic Web

INTRODUCTION annotation is time-consuming and error-prone. A core goal of the development of the Semantic Moreover, annotations must be made in a formal Web is to bring progressively more meaning to language whose use may require considerable the information published on the Web. An ac- training and expertise. Developing interactive cepted method of doing this is by annotating the tools for annotation is a problematic undertaking text with a variety of kinds of metadata. Manual because it is not known whether they will be

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 51

in actual demand. A number of Semantic Web pragmatic markers (and certainly not templates practitioners maintain that the desire to have filled with uninterpreted snippets of the input their content available on the Semantic Web text that are generated by the current informa- will compel people to spend the time and effort tion extraction methods). on manual annotation. However, even if such a To attain such goals, Semantic Web agents desire materializes, people simply will not have must be equipped with sophisticated semantic enough time either to annotate each sentence in analysis systems that process text found on the their texts or annotate a subset at a semantic level Web and publish their analyses on the Web as that is sufficiently deep to be used by advanced annotations in a form accessible to other agents, intelligent agents that are projected as users of using standard Semantic Web languages such as the Semantic Web alongside people. RDF and OWL. The Semantic Web will, thus, The alternative on the supply side is, then, be useful for both human readers and robotic automatic annotation. Within the current state intelligent agents. The agents will benefit from of the art, automatically produced annotations the existence of deep semantic annotations in are roughly at the level attainable by the latest their application-oriented information process- information extraction techniques—a reason- ing tasks and also will be able to derive such ably good level of capturing named entities with annotations from text. People will not directly a somewhat less successful categorization of access the annotation (metadata) level but will such entities (e.g., deciding whether Jordan is benefit from higher-quality and better formu- used as the first name of an individual or a ref- lated responses to their queries. erence to the Hashemite kingdom). Extracting This article describes initial work on more advanced types of semantic information, responding to the needs and leveraging the for example, types of events (to say nothing offerings of the Web by merging knowledge- about determining semantic arguments, ‘’case oriented natural language processing with web roles” in AI terminology), is not quite within technologies to produce both an automatic an- the current information extraction capabili- notation-generating capability and an enhanced ties, though work in this direction is ongoing. Web service oriented at human users. The on- Indeed, semantic annotation is at the moment tological-semantic natural language processing an active subfield of computational linguistics, system OntoSem (Nirenburg & Raskin, 2001) where annotated corpora are intended for use provided the basis for the automatic annotation by machine-learning approaches to building effort. In order to test and evaluate the utility natural language processing capabilities. of OntoSem on the Semantic Web, we have On the demand side of the Semantic Web, a developed SemNews (Java, Finin & Nirenburg, core capability is improving the precision of the 2006; Java, Finin & Nirenburg, 2005), a pro- Web search, which will be facilitated by detailed totype application that monitors RSS feeds of semantic annotations that are unambiguous news stories, applies OntoSem to understand and sufficiently detailed to support the search the text, and exports the computed facts back to engine in making fine-grained distinctions in the Web in OWL. A prerequisite for this system calculating scores of documents. Another core integration is a utility for translating knowledge capability is to transcend the level of document formats between OntoSem’s knowledge repre- retrieval and, instead, return as answers to user sentation language and ontologies and those of queries specially generated pragmatically and the Semantic Web. stylistically appropriate responses. To attain Since our goal is to continuously improve this capability, intelligent agents must rely on the service, the quality of OntoSem results very detailed semantic annotations of texts. We and system coverage must be continuously believe that such annotations will be, for all enhanced. The Web, in fact, contains a wealth intents and purposes, complete text meaning of textual information that, once processed, representations, not just sets of semantic or can enhance OntoSem’s knowledge base (its

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 52 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 ontology, lexicon and fact repository, see be- Gildea and Jurafsky (2002) created a sto- low for a more detailed description). This is chastic system that labels case roles of predicates why the knowledge format conversion utility, with either abstract (e.g., AGENT, THEME) OntoSem2OWL, has been developed to trans- or domain-specific (e.g., MESSAGE, TOPIC) late both ways between OWL and OntoSem’s roles. The system trained on 50,000 words of knowledge representation language. Our initial hand-annotated text that was produced by the experiments on automatic learning of ontologi- FrameNet project (Baker, Fillmore & Lowe, cal concepts and lexicon entries are reported in 1998). When tasked to segment constituents English and Nirenburg (2007). and identify their semantic roles (with fillers The remainder of this article is organized being un-disambiguated textual strings, not as follows. We start with a brief review of some machine-tractable instances of ontological con- related work on annotation in computational cepts, as in OntoSem), the system scored in the linguistics, and on mapping knowledge between 60s in precision and recall. Limitations of the a text understanding system and the Semantic system include its reliance on hand-annotated Web representation. Next, we introduce the data and its reliance on prior knowledge of the knowledge resources of OntoSem and illus- predicate frame type (i.e., it lacks the capacity trate its knowledge representation language. to disambiguate productively). Semantics in In the section, Mapping OntoSem to OWL, we this project is limited to case-roles. provide an overview of the architecture of our The Interlingual Annotation of Multi- implemented system and describe the approach lingual Text Corpora project (Farwell et al., used and major issues discovered in using it to 2004) had as its goal the creation of a syntactic map knowledge between OntoSem’s knowledge and semantic annotation representation meth- representation system and the Semantic Web odology, and tested it out on seven languages language OWL. We go on in the next section (English, Spanish, French, Arabic, Japanese, tooutline some of the larger issues and chal- Korean, and Hindi). The semantic representa- lenges we expect to encounter. We describe tion, however, is restricted to those aspects of an approach to evaluating the use of OntoSem, syntax and semantics that developers believe can the translation of its ontology to OWL, and the be consistently handled well by hand annota- effectiveness of the SemNews system in the tors for many languages. The current stage of section that follows.. In the section, titled Appli- development includes only syntax and a limited cations, we describe the SemNews application semantics—essentially, thematic roles. testbed and some general application scenarios In the ACE project1, annotators carry out we have explored to motivate and guide our manual semantic annotation of texts in English, research. Finally, we offer some comments on Chinese and Arabic to create training and test the problem of extracting knowledge from the data for research task evaluations. The downside Web in the section titled, Using the Web for of this effort is that the inventory of semantic Knowledge Acquisition,and we conclude this entities, relations, and events is very small, and, article with closing remarks. therefore, the resulting semantic representations are coarse-grained, for example, there are only RELATED WORK five event types. The project description prom- The general problem of automatically gener- ises more fine-grained descriptors and relations ating and adding semantic annotations to text among events in the future. Another response to has been the focus of research for many years. the clear insufficiency of syntax-only tagging Most of the work has not used the Semantic is offered by the developers of PropBank, the Web languages for encoding these annotations. Penn Treebank semantic extension. Kingsbury We briefly describe some of the work here and et al. (2002), who report: point out some similarities and differences with our own.

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 53

It was agreed that the highest priority, and FMA is a large ontology of the human anatomy the most feasible type of semantic annota- and is represented in a frame-based knowledge tion, is coreference and predicate argument representation language. Some of the challenges structure for verbs, participial modifiers and faced were the lack of equivalent OWL repre- nominalizations, and this is what is included sentations for some frame-based constructs and in PropBank. scalability, and computational issues with the current reasoners. Recently, there has been interest in ex- Schlangen, Stede and Bontas (2004) ploiting information extraction techniques to describe a system that combines a natural lan- text to produce annotations for the Semantic guage processing system with Semantic Web Web. However, few systems capable of deeper technologies to support the content-based stor- semantic analysis have been applied in Seman- age and retrieval of medical pathology reports. tic Web-related tasks. Information extraction The language component was augmented with tools work best when the types of objects that background knowledge consisting of a domain need to be identified are clearly defined, for ontology represented in OWL. The result example, the objective in MUC (Grishman supported the extraction of domain-specific & Sundheim, 1996) was to find the various information from natural language reports, named entities in text. Using OntoSem, we aim which was then mapped back into a Semantic to not only provide such information, but also Web representation. to convert the text meaning representation of TAP (R.V. Guha & McCool, 2003) is an natural language sentences into Semantic Web open source project led by Stanford University representations. and IBM Research aimed at populating the Se- A project closely related to our work was mantic Web with information by providing tools an effort to map the Mikrokosmos knowledge that make the Web a giant distributed Database. base to OWL (Beltran-Ferruz, Gonzalez-Caler TAP provides a set of protocols and conventions & Gervas, 2004a; 2004b). Mikrokosmos (Beale, that create a coherent whole of independently Nirenburg & Mahesh, 1995) is a precursor to produced bits of information, and a simple API OntoSem and was developed with the intent of to navigate the graph. Local, independently using it as an interlingua in machine transla- managed knowledge bases can be aggregated tion-related work. This project developed some to form selected centers of knowledge useful basic mapping functions that can create the class for particular applications. and specify the properties, and their Krueger, Nilsson, Oates and Finin (2004) respective domains and ranges. In our system developed an application that learned to extract we describe how facets, numeric attribute information from talk announcements from ranges can be handled, and more importantly, training data using an algorithm based on Stalker we describe a technique for translating the (Muslea, Minton & Knoblock, 2001). The ex- sentences from their Text Meaning Representa- tracted information was then encoded as markup tion to the corresponding OWL representation, in the Semantic Web language DAML+OIL, a thereby, providing semantically marked-up precursor to OWL. The results were used as part natural language text for use by other agents. of the ITTALKS system (Cost et al., 2002). Another translation effort involving Mikrokos- The Haystack Project has developed system mos produced the Omega Ontology (Philpot, (Hogue & Karger, 2005) enabling users to train Hovy & Pantel, 2005) by merging the content browsers to extract Semantic Web content from of Mikrokosmos with Wordnet with additional HTML documents on the Web. Users provide information sources. examples of semantic content by highlight- Dameron, Rubin and Musen (2005) de- ing them in their browser and then describing scribe an approach to representing the Foun- their meaning. Generalized wrappers are then dational Model of Anatomy (FMA) in OWL. constructed to extract information and encode

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 54 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 the results in RDF. The goal is to let individual question answering, information extraction, users generate Semantic Web content of interest and language generation. It is supported by a to them from text on Web pages. More recently, constructed world model, that is, a structured the project has developed a Firefox plug-in model of the classes of objects, properties, rela- Solvent that can be used to write screen scrapers tions and constraints that might be described to produce RDF data from Web pages. in text, encoded as a rich ontology. The ontol- The On-to-Knowledge project (Fensel, ogy is represented as a directed acyclic graph Harmelen & Akkermans, 2000) provides an using IS-A relations. It contains about 8,000 ontology-based system for knowledge manage- concepts that have on an average 16 properties ment. It uses Ontology-based Inference Layer per concept. At the topmost level the concepts (OIL) support for description logics (DL) and are: OBJECT, EVENT and PROPERTY. frame-based systems over the Web. OWL itself The OntoSem ontology is expressed in is an extension derived from OIL and DAML. a frame-based representation and each of the The OntoExtract and OntoWrapper sub-system frames corresponds to a concept. The concepts in On-to-knowledge were responsible for pro- are defined using a collection of slots that could cessing unstructured and structured text. These be linked using IS-A relations. A slot consists of systems were used to automatically extract a PROPERTY, FACET and a FILLER. ontologies and express them in Semantic Web representations. At the heat of OntoExtract is ONTOLOGY ::= CONCEPT+ an natural language processing system that CONCEPT ::= ROOT | OBJECT-OR- processes text to perform lexical and semantic EVENT | PROPERTY analysis. Finally, concepts found in free text SLOT ::= PROPERTY + FACET + are represented as an ontology. FILLER The Cyc project has developed a very large knowledge base of common sense facts and A property can be either an attribute, rela- reasoning capabilities. Recent efforts (Witbrock tion or ontology slot. An ontology slot is a special et al., 2004) include the development of tools type of property that is used to describe and for automatically annotating documents and organize the ontology. The ontology is closely exporting the knowledge in OWL. The authors tied to the lexicon with language independence also highlight the difficulties in exporting an achieved through the use of multiple lexicons, expressive representation like CycL into OWL one for each language with stored meaning due to lack of equivalent constructs. procedures’ that are used to disambiguate Finally, we mention the KIM platform word senses and references. Thus, keeping the (Kiryakov, Popov, Terziev, Manov & Ognya- concepts defined relatively few and making the noff, 2004) for automatic semantic annotation, ontology small. Text analysis relies on extensive indexing, and retrieval of documents. This static knowledge resources, some of which are system uses the GATE (Cunningham, 2002) lan- described in the following paragraphs. guage engineering system backed by structured The key knowledge base is the OntoSem ontologies in OWL to produce annotations. language-independent ontology, which cur- rently contains around 8,500 concepts, each of OntoSem which is described by an average of 16 proper- Ontological Semantics (OntoSem) is a theory ties. The ontology is populated by concepts that of meaning in natural language text (Nirenburg we expect to be relevant cross-linguistically. The & Raskin, 2001). The OntoSem environment current experiment was run on a subset of the is a rich and extensive tool for extracting and ontology containing about 6,000 concepts. representing meaning in a language independent The OntoSem lexicon is a knowledge way. The OntoSem system is used for a number resource whose entries contain syntactic and se- of applications such as machine translation, mantic information (linked through variables),

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 55

Figure 1. OntoSem is a large scale, sophisticated natural language understanding system that uses a custom frame-based knowledge representation system with an extensive ontology and lexicon.

as well as calls for procedural semantic routines dividual people (Colin Powell), governments when necessary. The semantic zone of an entry (Belgium), organizations (United Nations), most frequently refers to ontological concepts, groups (Tamil Tigers) and places (Upper either directly or with property-based modifi- Manhattan). cations, but also can describe words meaning OntoSem maintains a fact repository that extra-ontologically, for example, in terms of contains remembered instances of ontological modality, aspect or time (see McShane, Niren- concepts, for example, SPEECH-ACT-3366 burg, Beale & O’Hara, 2005 for an in-depth is the 3,366th instantiation of the concept discussion of the lexicon/ontology connection). SPEECH-ACT in the memory of a text-process- The current English lexicon contains approxi- ing agent. Figure 2 shows the representation mately 30,000 senses selected through corpus generated for the text “Collin Powell addressed analysis over its many years of development in the UN General Assembly yesterday. He said application to a number of domains. It includes that President Bush will visit the UN on Thurs- most closed-class items2 and frequent and poly- day.”’ The fact repository is not used in the semous3 verbs. The base lexicon is expanded current experiment but will provide valuable at runtime using an inventory of lexical (e.g., semantically annotated context information for derivational-morphological) rules. future experiments. OnsoSem’s onomasticon is its lexicon OntoSem’s knowledge representation and of proper names and contains approximately reasoning system is the Text Meaning Represen- 350,000 entries. These include names of in- tation (TMR) language. This is a frame-based

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 56 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007

Figure 2. OntoSem constructs this text meaning representation (TMR) for the sentence “He (Colin Powell) asked the UN to authorize the war.”

system with specific features to support the pro- (TMRs). The basic steps in processing the duction of semantic descriptions from natural sentence to extract the meaning representation language text. Such descriptions are referred to is shown in Figure 1. The preprocessor deals as TMRs— text meaning representations. with identifying sentence and word boundaries, The OntoSem syntactic-semantic ana- part of speech tagging, recognition of named lyzer is a knowledge resource that performs entities, dates, and so forth. The syntactic preprocessing (tokenization, named-entity analysis phase identifies the various clause level and acronym recognition, etc.), morphologi- dependencies and grammatical constructs of cal, syntactic and semantic analysis, and the the sentence. The TMR is a representation of creation of TMRs. the meaning of the text and is expressed using OntoSem knowledge resources have been the various concepts defined in the ontology. constructed, enhanced, and maintained by The TMRs are produced as a result of semantic trained acquirers using a broad variety of ef- analysis, which uses knowledge sources such ficiency-enhancing tools, including graphical as lexicon, onomasticon, and fact repository to editors, enhanced search facilities, capabili- resolve ambiguities and time references. TMRs ties of automatically acquiring knowledge for have been used as the substrate for question- classes of entities on the basis of manually answering (Beale, Lavoie, McShane, Nirenburg acquired knowledge for a single representative & Korelsky, 2004), machine translation (Beale, of the class, for example,OntoSem’s DEKADE Nirenburg & Mahesh, 1995) and knowledge environment (McShane et al., 2005) facilitates extraction. Once the TMRs are generated, both knowledge acquisition and semi-automatic OntoSem2OWL converts them to an equivalent creation of gold standard TMRs, which also can OWL representation. be viewed as deep semantic text annotation. The learned instances from the text are The OntoSem environment takes as input stored in a fact repository, which essentially unrestricted text and performs different syn- forms the instance-level knowledge base of tactic and semantic processing steps to convert OntoSem. As an example the sentence: “He it into a set of Text Meaning Representations (Colin Powell) asked the UN to authorize the

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 57

war” is converted to the TMR shown in Figure cally converted. 3. A more detailed description of OntoSem and • In addition, OntoSem also is supported by its features is available in Nirenburg and Raskin a fact repository, which also is mapped to (2005) and from the Institute for language and OWL. information technologies (n.d.). OntoSem2OWL is a rule-based translation MAPPING OntoSem TO OWL engine that takes the OntoSem Ontology in its We have developed OntoSem2OWL (Java, LISP representation and converts it into its cor- Finin & Nirenburg, 2005) as a tool to convert responding OWL format. The following is an OntoSem’s ontology and TMRs encoded in example of how a concept ONTOLOGY-SLOT it to the OWL. This is described in OntoSem: enables an agent to use OntoSem’s environment to extract semantic information from natural (make-frame definition (is-a (value (common ontology-slot))) language text. Ontology mapping deals with (definition (value (common "Human defining functions that describe how concepts readable explanation for a concept"))) in one ontology are related to the concepts in (domain (sem (common all)))) some other ontology (Dou, McDermott, & Qi, 2005). Ontology translation process converts the Its corresponding OWL representation sentences that use the source ontology into their is: corresponding representations in the target on- tology. In converting the OntoSem Ontology to OWL, we are performing the following tasks: with mapping the semantics of OntoSem into a corresponding OWL version. "Human readable explanation for a concept" • Once the ontology is translated the sen- tences that use the ontology are syntacti-

Figure 3. OntoSem constructs this text meaning representation (TMR) for the sentence “He (Colin Powell) asked the UN to authorize the war.”

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 58 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007

that the property slot of an instance can have as fillers. OntoSem domains are converted to rdfs:domain and ranges are converted to rdfs: range. For some of the properties OntoSem We will briefly describe how each of the also defines inverses using the INVERSE-OF OntoSem features are mapped into their OWL relationship. It can be directly mapped to the versions: classes, properties, facets, attribute owl:inverseOf relation. ranges and TMRs. In case there are multiple concepts de- fined for a particular domain or range,- On Handling Classes toSem2OWL handles it using owl:unionOf New concepts are defined in OntoSem using feature. For example: make-frame and related to other concepts using the is-a relation. Each concept also may have a (make-frame controls corresponding definition. Whenever the system (domain encounters a make-frame it recognizes that this (sem (common physical-event is a new concept being defined. OBJECT or physical-object EVENT are mapped to owl:Class while, PROP- social-event ERTIES are mapped to owl:ObjectProperty. social-role))) (range (sem (common actualize ONTOLOGY-SLOTS are special properties artifact that are used to structure the ontology. These natural-object also are mapped to owl:ObjectProperty. Object social-role))) definitions are created usingowl:Class and the (is-a (value (common relation))) IS-A relation is mapped using owl:subClassOf. (inverse (value (common controlled-by))) Definition property in OntoSem has the same (definition (value (common function as rdfs:label and is mapped directly. “A relation which relates concepts to Table 1 shows the usage of each of these fea- what they can control”)))) tures in OntoSem. is mapped to Handling Properties Whenever the level-one parent of a concept is of the type PROPERTY, it is translated to owl: ObjectProperty. Properties also can be linked to other properties using the IS-A relation. In case of properties, the IS-A relation maps to the owl:subPropertyOf. Most of the properties also contain the domain and the range slots. Domain defines the concepts to which the property can be applied and the ranges are the concepts

Table 1. This table shows how often each of the Class-related constructs were used in OntoSem’s ontology

case times used mapped using 1 total Class/Property make-frame 8199 owl:class or owl:ObjectProperty 2 Definition 8192 rdfs:label 3 is-a relationship 8189 owl:subClassOf

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 59

restricted (OWL Web ontology language for services, n.d.). • RELAXABLE-TO: This facet indicates that the value for the filler can take a cer- tain type. It is a way of specifying typical violations. One way of handling RELAX- ABLE-TO is to add this information in an annotation and also add this to the classes present in the owl:Restriction. • DEFAULT: OWL provides no clear way of representing defaults, since it only supports monotonic reasoning and this is one of the issues that have been expressed for future extensions of OWL language (Horrocks, “A relation which relates concepts to Patel-Schneider & van Harmelen, 2003). what they can control” These issues need to be further investigated in order to come up with an appropriate equivalent representation in OWL. One ap- Handling Facets proach is to use rule languages like SWRL OntoSem uses facets as a way of restricting the (Horrocks, Patel-Schneider, Boley, Tabet, fillers that can be used for a particular slot. In Grosof & Dean, 2004) to express such OntoSem there are six facets that are created defaults and exceptions. Another approach and one, inv that is automatically generated. would be to elevate facets to properties. Table 3 shows the different facets and how This can be done by combining the prop- often they are used in OntoSem. erty-facet to make a new property. Thus a concept of an apple that has a property • SEM and VALUE: These are the most color with the default facet value red could commonly used facets. OntoSem2OWL be translated to a new property in the owl handles these identically and maps them version of the frame where the property using an owl:Restriction on a particular name is color-default, and it can have a property. With an owl:Restriction we can value of red. locally restrict the type of values a property • DEFAULT-MEASURE: This facet indi- can take unlike rdfs:domain or rdfs:range cates what the typical units of measure- which specifies how the property is globally ments are for a particular property. This

Table 2. shows some of the key properties used in OntoSem along with the frequency of use and the RDFS or OWL construct used to encode them

case frequency mapped using 1 domain 617 rdfs:domain 2 domain with not facet 16 owl:disjointWith 3 range 406 rdfs:range 4 range with not facet 5 owl:disjointWith 5 inverse 260 owl:inverseOf

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 60 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007

can be handled by creating a new property named MEASURING-UNITS or adding this information as a rule. • NOT: This facet specifies that certain values are not permitted in the filler of the slot in which this is defined. NOT facets Converting Text Meaning can be handled using the owl:disjointWith Representations feature. Once the OntoSem ontology is converted into • INV: This facet need not be handled since its corresponding OWL representation, we can this information already is covered using now translate the text meaning representations the inverse property, which is mapped to into statements in OWL. In order to do this we owl:inverseOf. can use the namespace defined as the OntoSem ontology and use the corresponding concepts Although DEFAULT and DEFAULT- to create the representation. The TMRs also MEASURE provides useful information, it contain additional information such as ROOT- can be noticed from Table 3 that relatively WORDS and MODALITY. These are used they are used less frequently. Hence, in our use to provide additional details about the TMRs cases, ignoring these facets does not lose a lot and are added to the annotations. In addition of information. TMRs also contain certain triggers for meaning procedures such as TRIGGER-REFERENCE Handling Attribute Ranges and SEEK-SPECIFICATION. These actually Certain fillers also can take numerical ranges are procedural attachments and, hence, cannot as values. For instance the property age can be directly mapped into the corresponding take a numerical value between 0 and 120 for OWL versions. instance. Additionally comparison functions such as <, >, and <> also could be used in Sentence: Ohio Congressman Arrives in TMRs. Attribute ranges can be handled using Jordan XML Schema (Fallside & Walmsley, 2004) in OWL. The following is an example of how TMR the property age could be represented in OWL (COME-1740 using xsd:restriction: (TIME (VALUE (COMMON (FIND-ANCHOR- TIME))))

Table 3 shows how often each of the facets was used in OntoSem’s ontology.

case frequency mapped using 1 value 18217 owl:Restriction 2 sem 5686 owl:Restriction 3 relaxable-to 95 annotation 4 default 350 not handled 5 default-measure 612 not handled 6 not 134 owl:disjointWith 7 inv 1941 not required

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 61

(DESTINATION (VALUE (COMMON CITY- (INSTANCE-OF (VALUE (COMMON CITY)))) 1740))) (AGENT (VALUE (COMMON POLITICIAN- TMR in OWL 1740))) (ROOT-WORDS (VALUE (COMMON (AR- RIVE)))) JORDAN (INSTANCE-OF (VALUE (COMMON COME)))) TMR in OWL

CHALLENGES to map a frame-based system like OntoSem to OWL. This section discusses some of the important issues that pertain to mapping of TMR any frame-based system to Web representation (POLITICIAN-1740 such as OWL. (AGENT-OF (VALUE (COMMON COME- One of the challenges in building such 1740))) a system is to bridge the gap between the ;; Politician with some relation to Ohio. A knowledge representation features that are ;; later meaning procedure should try to find used by natural language processing systems ;; that the relation is that he lives there. (RELATION (VALUE (COMMON PROVINCE- and Semantic Web technologies. Some NLP 1740))) systems such as OntoSem are supported by (MEMBER-OF (VALUE (COMMON CON- frame-based representations to construct a GRESS))) model or ontology of the world. Such an ontol- (ROOT-WORDS (VALUE (COMMON (CON- ogy is then used to extract and represent meaning GRESSMAN)))) (WORD-NUM (VALUE (COMMON 1))) from natural language text. Since OntoSem is (INSTANCE-OF (VALUE (COMMON POLITI- used for natural language processing applica- CIAN)))) tions, it has a way of expressing defaults and exceptions. However there is no clear way of TMR in OWL mapping defaults to OWL since OWL does not significant as it may appear, since these features relations between them. For instance, a default if no specific nationality of a person is given, they are US Nationals. Systems such as SemNews TMR will mostly deal with instance data where these (CITY-1740 features are not used, so the fact that information (HAS-NAME (VALUE (COMMON “JOR- about the instance is represented in a system DAN”))) that does not handle defaults does not affect the (ROOT-WORDS (VALUE (COMMON (JOR- DAN)))) ability of OntoSem to handle that data. (WORD-NUM (VALUE (COMMON 4))) Knowledge sharing is a critical factor (DESTINATION-OF (VALUE (COMMON COME- to enable agents on the Semantic Web to use 1740))) this information extracted from NL text or be

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 62 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 able to provide information that can be used tion of the translation process is that currently it by NLP tools. This requires mapping across does not support any of these procedural attach- different ontologies and translating sentences ments. It would be interesting to look into ways from one representation to another. KQML in which this information could be additionally (Finin, Fritzson, McKay & McEntire, 1994) incorporated either into the reasoner or the and KIF (Genesereth, 1991) were two such knowledge base of the agent itself. attempts that developed protocols to enable sharing of large scale knowledge bases. Our EVALUATION system maps the OntoSem ontology to OWL There are several dimensions along which this and thus makes the framework sharable with research can be evaluated. One possible evalu- other agents on the Web. ation is that of the underlying NLP system On- Ambiguity also is an issue when dealing toSem. This has been reported on in (Nirenburg, with NL text. Human language can have ambi- Beale & McShane, 2004) and (Nirenburg & guity at both syntactic and semantic level. An Kavi, 1996). We briefly summarize these below. example often discussed is anaphora resolu- However, we note that the primary purpose of tion, which is the problem of identifying and this paper is not to report on OntoSem. The resolving different references to the same named focus here is to build a system that can bridge entity. OntoSem provides ways for handling between the Semantic Web and semantically such references and resolves these references, rich NLP systems, such as OntoSem, and so not just within a single document but across the evaluation is a measure of how well we all the facts in its repository. This could have can translate. interesting applications in the Semantic Web The primary output of OntoSem is in the domain, especially in resolving ambiguities form of TMRs as described above. To evaluate inherent in FOAF (The friend of a friend project, these semantic representations of natural lan- n.d.) descriptions and data. guage, gold standards were created throughout While some of the basic mapping rules the evaluation process, and a collection of cor- have been developed, more needs to be done to rectness metrics were then applied to these gold identify and represent cardinalities, transitive, standards along with the automatically created symmetric and inverse functional properties. TMRs. The intermediate outputs of the various These issues are being investigated. stages of OntoSem’s processing (preprocess- There also were interesting challenges ing, syntactic analysis, and semantic analysis) while mapping a large ontology such as On- were inspected and corrected to produce gold toSem. Although we needed the capabilities standards; the amount of correction required of OWL Full to represent a more complete was tabulated to produce an evaluation. subset of OntoSem’s features, the result was The emphasis on evaluation was placed too large for OWL Full reasoners to process. on the primary function of OntoSem, semantic One suggestion is to build mappings at dif- analysis. Four calculation metrics were com- ferent levels of expressivity, for example, we bined to produce an overall evaluative figure could have different versions of the OntoSem for any given textual analysis (Nirenburg et ontology for OWL Lite, DL and Full. Another al., 2004): approach would be to investigate the possibil- ity of partitioning the ontology into different • Match/Mismatch of TMR elements. How smaller ontologies. many TMR elements did OntoSem properly OntoSem uses procedural attachments with identify, and how many were in error? concepts in the ontology and also in the TMRs. • Weighted Score for WSD Complexity. These are useful in performing tasks such as Of the mismatched TMR elements, how reference resolution, finding the relative time ambiguous was the mismatched lexical reference, and so forth. An important implica- entry?

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 63

• Weighted Score for WSD Distance. Of the services make some semantic checks in mismatched TMR elements, how similar addition to syntactic ones. A full semantic are they to the target element, using the validity check is quite difficult and, to our OntoSearch distance metric (Onyshkevych, knowledge, no system attempts one, even 1997)? for decidable subsets of OWL. • Semantic Dependency Determination. • Meaning preservation. Is the meaning of How many semantic dependencies in the the generated OWL representation identi- TMR were properly matched, and how cal to that of the OntoSem representation? many were in error? This is a very difficult question to answer, or even to formulate, given the vast dif- Combining the results of these four met- ferences between the two knowledge rics, an automatically generated TMR can be representation systems. However, we can compared to a gold standard TMR, producing easily identify some constructs, such as an evaluation of OntoSem’s performance. defaults, that clearly cannot be captured Our translation model involves translat- in OWL, leading to a loss of information ing ontologies and instances (facts) in both and meaning when going from OntoSem directions: from OntoSem to an OWL version to OWL. of the OntoSem Ontology and from the OWL • Feature minimization. OWL is a complex version of OntoSem into OntoSem. For the representation language, some of whose translation to be truly useful, it also should features make reasoning difficult. A number involve the translation between the OWL ver- of levels of complexity can be identified sion of OntoSem’s ontologies and facts, and the (e.g., the OWL species: Lite, DL and Full). ontologies in common use on the Semantic Web In general, we would like the translation (e.g., FOAF); (The friend of a friend project, service to not use a complex feature un- n.d.), Dublin Core (Miller & Brickley, 2001), less it is absolutely required. Doing so will OWL-S (OWL Web ontology language, 2004), reduce the complexity of reasoning with OWL-time (Hobbs & Pan, 2004), etc.). the generated ontology. Since our current work has concentrated • Translation complexity. What are the on the initial step of translating from OntoSem speed and memory requirements of the to OWL, we will enumerate some of the issues translation. Since, in general, a translation from that perspective. Translating in the oppo- might require reasoning, this could be an site direction raises similar, though not identical, issue. issues. The chief translation measures we have considered are as follows: We report on some preliminary evaluation metrics covering the basic OntoSem to OWL • Syntactic correctness. Does the transla- translation. tion produce syntactically correct RDF and OntoSem2OWL uses the Jena Semantic OWL? The resulting documents can be Web Framework (McBride, 2001) internally checked with appropriate RDF and OWL to build the OWL version of the Ontology. validation systems. The ontologies generated were successfully • Semantic validity. Does the translation validated using two automated RDF validators: produce RDF and OWL that is semanti- the W3C’s RDF Validation Service (RDF vali- cally well formed? An RDF or OWL file dation service, n.d.) and the WonderWeb OWL can be syntactically valid yet contain Ontology Validator (Wonderweb owl ontology errors that violate semantic constrains in validator, n.d.). the language. For example, an OWL class There were a total of about 8,000 concepts should not be disjoint with itself if it has in the original OntoSem ontology. The total any instances. Several OWL validation number of triples generated in the translated

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 64 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 version was just over 100,000. These triples 2005) OWL editor for browsing, editing, and included a number of blank nodes—-RDF further validation. nodes representing objects without identifiers Based on our preliminary results, we found that are required due to RDF’s low-level triple that OntoSem2OWL is able to translate most of representation. the OntoSem ontology into a form that is syntac- Since the generated ontologies required the tically valid and, in so far as current validators use of OWL’s union and inverseOf features, the can tell, free of semantic problems. There are results fall in the OWL Full class in terms of some problems in representing defaults and the level of expressivity. correctly mapping some of the facets, however, Using the Jena API it typically takes be- these are used relatively less frequently. tween 10 and 40 seconds to build the model, depending upon the reasoner employed. The APPLICATIONS computation of transitive closure and basic One of the motivations for integrating language RDF schema differencing takes approximately understanding agents into the Semantic Web is 10 seconds on a typical workstation. The OWL to enable applications to use the information Micro reasoner takes about 40 seconds, while published in free text along with other Semantic OWL Full reasoner fails, possibly due to the Web data. SemNews4 (Java, Finin & Nirenburg, large search space. The OntoSem ontology in its 2006) is a semantic news service that monitors OWL representation can be successfully loaded different RSS news feeds and provides struc- into the Swoop (Kalyanpur, Parsia & Hendler, tured representations of the meaning of news

Figure 4. The SemNews application, which serves as a testbed for our work, has a simple ar- chitecture. RSS (1) from multiple sources is aggregated and then processed by the OntoSem (2) text processing environment. This results in the generation of TMRs (3) and updates to the fact repository (4). The Dekade environment (5) can be used to edit the ontology and TMRs. Onto- Sem2OWL (6) converts the ontology and TMRs to their corresponding OWL versions (7,8). The TMRs are stored in the Redland triple store (9) and additional triples inferred by Jena (10).

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 65

articles found in them. As new articles appear, other applications and users to perform semantic SemNews extracts the summary from the RSS queries over the documents. This enables them description and processes it with OntoSem. The to search for information that would otherwise resulting TMR is then converted into OWL. not be easy to find using simple keyword based This enables us to semantacize the RSS content search. The TMRs are also indexed by the and provide live and up-to-date content on the Swoogle Semantic Web Search system (Ding Semantic Web. The prototype application also et al., 2004). provides a number of interfaces that allow users The following are some examples of queries and agents to query over the meaning representa- that go beyond simple keyword searches. tion of the text as expressed in OWL. Figure 4 shows the basic architecture of • Conceptually searching for content. SemNews. The RSS feeds from different news Consider the query—Find all stories that sources are aggregated and parsed. These RSS have something to do with a place and a feeds also are rich in useful meta-data such as terrorist activity.. Here the goal is to find information on the author, the date when the the content or the story, but essentially by article was published, the news category, and means of using ontological concepts rather tag information. These form the explicit meta- than string literals. So for example, since data that is provided by the publisher. However we are using the ontological concepts here, there is a large portion of the RSS field that is we actually could benefit from resolving essentially plain text and does not contain any different kinds of terror events such as semantics in them. It would be of great value if bombing or hijacking to a terrorist-activity this text available in description and comment concept. fields, for example, could besemantacized . By • Context-based querying. Answering using Natural Language Processing tools such as the query—Find all the events in which OntoSem we can convert natural language text ‘George Bush’ was a speaker—involves into a structured representation thereby adding finding the context and relation in which additional metadata in the RSS fields. Once a particular concept occurs. Using named processed, it is converted to its Text Meaning entity recognition alone, one can only find Representation. OntoSem also updates its fact that there is a story about a named entity repositories to store the information found in the of the type person/human, however it is sentences processed. These facts extracted help not directly perceivable as to what role the the system in its future text analysis tasks. entity participated in. Since OntoSem uses An optional step of correction of the TMRs deeper semantics, it not only identifies the could be performed by means of the Dekade various entities but also extracts the rela- environment (English, 2006). This is helpful in tions in which these entities or instances correcting cases where the analyzers are not able participate, thereby providing additional to correctly annotate parts of the sentence. Cor- contextual information. rections can be performed at both the syntactic • Reporting facts. To answer a query processor and the semantic analyzer phase. The like—Find all politicians who traveled to Dekade environment could also be used to edit Asia— requires reasoning about people’s the OntoSem ontology and lexicons or static roles and geography. Since we are using knowledge sources. ontological concepts rather than plain As discussed in the previous sections, the text and we have certain relations like meaning in these structured representations, also meronomy/part-of we could recognize that known as Text Meaning Representations, can Colin Powel’s trip to China will yield an be preserved by mapping them to OWL/RDF. answer. The OWL version of a document’s TMRs is • Knowledge sharing on the Semantic stored in a Redland-based triple store, allowing Web. Knowledge sharing is critical for

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 66 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007

agents to reason on the Semantic Web. user may also query the triple store directly, us- Knowledge can be shared by means of ing RDQL query language as shown in Figure using a common ontology or by defining 6. Additionally the system also can publish the mappings between existing ontologies. RSS feed of the query results allowing users or One of the benefits of using a system like agents to easily monitor new answers. This is SemNews is that it provides a mechanism a useful way of handling standing queries and for agents to populate various ontolo- finding news articles that satisfy a structured gies with live and updated information. query. While FOAF has become a very popular Developing SemNews provided a per- mechanism to describe a person’s social spective on some of the general problems network, not everyone on the Web has a of integrating a mature language processing FOAF description. By linking the FOAF system like OntoSem into a Semantic Web ontology to OntoSem’s ontology, we could oriented application. While doing a complete populate additional information and learn and faithful translation of knowledge from new instances of foaf:person, even though OntoSem’s native meaning representation these were not published explicitly in foaf language into OWL is not feasible, we found files but as plain text descriptions in news the problems to be manageable in practice for articles. several reasons. First, OntoSem’s knowledge representa- The SemNews environment also provides tion features that were most problematic for a convenient way for the users to query and translation are not used with great frequency. browse the fact repository and triple store. Fig- For example, the default values, relaxable range ure 5 shows a view that lists the named entities constraints and procedural attachments were found in the processed news summaries. Using used relatively rarely in OntoSem’s ontology. an ontology viewer the user can navigate through Thus shortcomings in the OWL version of the news stories conceptually while viewing the OntoSem’s ontology are limited and can be instances that were found. The fact repository circumscribed. explorer provides a way to view the relations Second, the goal is not just to support trans- between different instances and see the news lation between OntoSem, and a complete and stories in which they were found. An advanced faithful OWL version of OntoSem. It is unlikely

Figure 5 Various types of named entities can be identified and explored in SemNews.

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 67

Figure 6 This SemNews interface shows the results for the query—Find all humans and what are they the beneficiaryof

that most Semantic Web content producers or Finally, with a focus on importing and ex- consumers will use OntoSem’s ontology. Rather, porting instances and assertions of fact, we can we expect common consensus ontologies like require these to be generated using the native FOAF, Dublin Core, and SOUPA to emerge representation and reasoning system. Rather and be widely used on the Semantic Web. The than exporting OntoSem’s concept definitions real goal is, thus, to mediate between OntoSem and a handful of facts to OWL and then using and a host of such consensus ontologies. We an OWL reasoner to derive the additional facts believe that these translations between OWL which follow, we can require OntoSem to pre- ontologies will, of necessity, be inexact and, compute all of the relevant facts. Similarly, thus, introduce some meaning loss or drift. when importing information from an OWL So, the translation between OntoSem’s native representation, the complete model can be representation and the OWL form will not be generated, and just the instances and assertions the only lossy one in the chain. translated and imported. Third, the SemNews application generates Language understanding agents could not and exports facts, rather than concepts. The only empower Semantic Web applications but prospective applications coupling a language also could create a space where humans and understanding agent, and the Semantic Web that NLP tools would be able to make use of existing we have examined share this focus on import- structured or semi-structured information avail- ing and exporting instance level information. able. The following are a few of the example To some degree, this obviates many translation application scenarios. issues, since these mostly occur at the concept level. While we may not be able to exactly ex- Semantic Annotation and Metadata press OntoSem’s complete concept of a book’s Generation author in the OWL version, we can translate the The growing popularity of folksonomies and simple instance level assertion that a known social bookmarking tools such as del.icio.us individual is the author of a particular book have demonstrated that light-weight tagging and further translate this into the appropriate systems are useful and practical. Metadata also triple using the FOAF and Dublin Core RDF is available in RSS and ATOM feeds, while ontologies. some use the Dublin Core ontology. Some NLP and statistical tools such as SemTag (Dill,

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 68 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007

Eiron, Gibson, Gruhl & Guha, 2003) and the in one document is the same as the instance TAP (Guha et al., 2003) project aim to generate described in another document. semantically annotated pages from already ex- isting documents on the Web. Using OntoSem in Provenance and Trust the SemNews framework, we have been able to Provenance involves identifying source of demonstrate the potential of large scale semantic information and tracking the history of where annotation and automatic metadata generation. the information came. Trust is a measure of Figure 3 shows the graphical representation of the degree of confidence one has for a source the TMRs, which also are exported in OWL of information. While these are somewhat hard and stored in a triple store. to quantify and are a function of a number of different parameters, there can be significant Gathering Instances indicators of trust and provenance already Ontologies for the Semantic Web define the present in the text and could be extracted by concepts and properties that the agents could the agent. News reports typically describe some use. By making use of these ontologies along of the provenance information, as well as other with instance data agents can perform useful metadata that can effect trust, such as temporal reasoning tasks. For example, an ontology information. This type of information would be could describe that a country is a subclass of a important in applications where agents need to geopolitical entity and that a geopolitical entity make decisions based on the validity of certain is a subclass of a physical entity. Automatically information. generating instance data from natural language text and populating the ontologies could be an Reasoning important application of such technologies. For While currently reasoning on the Semantic Web example, in SemNews you can not only view is enabled by using the ontologies and Semantic the different named entities as shown in Figure Web documents, there could be potentially vast 5, but you also can explore the facts found in knowledge present in natural language. It would different documents about that named entity. be useful to build knowledge bases that could As shown in Figure 7, we could start browsing not only reason based on explicit information from an instance of the entity type ‘NATION’ available in them, but also could use information and explore the various facts that were found extracted from natural language text to augment in the text about that entity. Since OntoSem their reasoning. One of the implications of using also handles referential ambiguities, it would the information extracted from natural language be able to identify that an instance described text in reasoning applications is that agents on

Figure 7 Fact repository explorer for the named entity ‘Mexico’. Shows that the entity has a relation ‘nationality-of’ with CITIZEN-235. Fact repository explorer for the instance CITIZEN- 235 shows that the citizen is an agent-of an ESCAPE-EVENT.

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 69

the Semantic Web would need to reason in pres- is still to assist the human users in their tasks. ence of inconsistent or incomplete annotations Using technologies from question answering as well. Reasoning could be supported from not and language generation, it would be helpful to just Semantic Web data and natural language provide capabilities through which users can in- text but also based on provenance. Developing teract with their agent through natural language, measures for provenance and trust also would thus reducing the cognitive load in formulating help in deciding the degree of confidence that the task in a machine-readable format. the reasoning engine may have in the using certain assertions for reasoning. USING THE WEB FOR KNOWLEDGE ACQUISTION Ontology Enrichment In this paper we have reported on SemNews, Knowledge acquisition is one of the most a system, which uses OntoSem and the vari- expensive steps in developing large scale ous processors and knowledge repositories Semantic Web applications. Even within the available, along with the Web, to enhance the framework of OntoSem, the OntoSem ontology Semantic Web with knowledge learned through has been developed and perfected over years text analysis of RSS news feeds. However, we of research in linguistics, NLP and knowledge also can use the Web as a source for knowl- representation. In order to make the task of a edge acquisition. This automated knowledge knowledge engineer easier, we could possibly acquisition can be done in a few ways. First, use the existing ontologies on the Semantic when OntoSem encounters an unexpected input Web to suggest new concepts, relations, or we can query the Web for documents related to even properties. As an example, consider the such unknown lexical or ontological concepts. concept of fish, in OntoSem there are about By processing the documents containing this four different varieties of fish that have been concept, we can learn its meaning. Using the defined. We could now use a semantic search Web as a corpus, we have been able to auto- engine such as Swoogle (Ding et al., 2004) to matically generate ontological concepts to some find new types of fish and suggest some of the degree of accuracy, when given a target word properties that could be used in order to describe (English et al., 2007). As an example, Table 4 fish in the ontology. shows some of the properties for the concept ‘Hobbit’ learned by querying the Web. Natural Language Interface to From this data we learn that both humans Semantic Web and hobbits (whatever they might be) live, While the Semantic Web is primarily for use by create things, and participate in elections. They machines, and the information available on it is both also can be rescued and killed. This kind in machine understandable format, the end goal of data can be used to generate a hypothesis

Table 4. This table compares selected properties of the concept ‘HUMAN’ to properties for the concept ‘HOBBIT’ automatically from the Web

Ontological Property Values in HUMAN Values in HOBBIT LIVE, CREATE-ARTIFACT, LIVE, CREATE-ARTIFACT, AGENT-OF ELECT, READ ELECT THEME-OF RESCUE, MARRY, KILL RESCUE, KILL HAS-OBJ-AS-PART HEAD na

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 70 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 that hobbits are a class of sentient beings not material on nearly everything imaginable, we unlike humans. will soon be able to close the loop. The second method is to import concepts and instance data available on the Semantic Web. CONCLUSION The long-term goal of our ongoing research Natural language processing agents can provide is indeed to automate knowledge acquisition a service by analyzing text documents on the and learning by reading. Specifically, we are Web and publishing Semantic Web annotations working toward creating a system (an intel- and documents that capture aspects of the text’s ligent agent) that will be able to extract from meaning. Their output will enable many more text formal representations ready for use in agents to benefit from the knowledge and facts automatic reasoning systems. These structures expressed in the text. Similarly, language pro- will reflect both instances and types of events, cessing agents need a wide variety of knowledge objects, relations, and agents’ attitudes in the and facts to correctly understand the text they real world. The reasoning that such agents will process. Much of the needed knowledge may be able to perform will support both general be found on the Web already encoded in RDF problem solving and, specifically, knowledge- and OWL and thus easy to import. based natural language processing, that is, the One of the key problems to be solved in very process through which the agent learns order to integrate language understanding agents from text. into the Semantic Web is translating knowledge Importing and integrating new conceptual and information from their native representation knowledge from the Semantic Web remains a systems to Semantic Web languages. We have challenging problem. It involves addressing not described initial work aimed at preparing the only the ontology mapping problem, but also the the OntoSem language understanding system difficulties in translating knowledge from OWL to be integrated into applications on the Web. to a different knowledge representation system OntoSem is a large scale, sophisticated natural like OntoSem. Importing instance-level data, language understanding system that uses a however, can be done with a very pragmatic custom frame-based knowledge representation approach and will result in information that can system with an extensive ontology and lexicon. be extremely useful to a language processing These have been developed over many years system. For example, the Semantic Wikipedia and are adapted to the special needs of text project (Volkel, Krotzsch, Vrandecic, Haller & analysis and understanding. Studer, 2006) exposes some of the information We have described a translation system, found in Wikipedia in RDF using a small set OntoSem2OWL that is being used to translate of ontologies. This can be easily mined, for OntoSem’s ontology into the Semantic Web example, to enrich OntoSem’s onomasticon language OWL. While the translator is not able by mapping key classes and properties into the to handle all of OntoSem’s representational OWL version of OntoSem and ultimately to features, it is able to translate a large and useful OntoSem’s native representation system. subset. The translator has been used to develop In any case, the benefit is clear. Using SemNews as a prototype of a system that reads the Web, and the Semantic Web as a corpus, summaries of Web news stories and publishes OntoSem can learn new concept instances, OntoSem’s understanding of their meaning on ontological concepts, and lexical entries by the Web encoded in OWL. reading. As the effect of increased static knowl- edge resources on OntoSem is that of producing References better TMRs, the benefit is circular. The more OntoSem learns, the better it becomes at learn- Baker, C., Fillmore, C., & Lowe, J. (1998). The ing. By using a fully open corpus, containing Berkeley FrameNet Project. Proceedings of the

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 71

36th Conference on Computational Linguistics- engine for the Semantic Wweb. Proceedings of Volume 1. Morristown, NJ, USA (pp. 86-90). the Thirteenth ACM Conference on Information Association for Computational Linguistics.. and Knowledge Management. Beale, S., Lavoie, B., McShane, M., Nirenburg, S., Dou, D., McDermott, D., & Qi, P. (2005). Ontolo- & Korelsky, T. (2004). Question answering gies for agents: Theory and experiences, Chap. using ontological semantics. Proceedings of Ontology translation by ontology merging and ACL-2004 Workshop on Text Meaning and automated reasoning (pp. 73-94). Birkhäuser Interpretation (pp. 41-48). Association for Basel. Computational Linguistics. English, J. (2006). DEKADE II: An environment for Beale, S., Nirenburg, S., & Mahesh, K. (1995). Se- development and demonstration in natural lan- mantic analysis in the mikrokosmos machine guage processing. (Master’s thesis, University translation project. Proceedings of the 2nd of Maryland, Baltimore County). Symposium on Natural Language Processing (pp. 297-307). English, J., & Nirenburg, S. (2007). Ontology learning from text using automatic ontological-semantic Beltran-Ferruz, P., Gonzalez-Caler, P., & P. Gervas text annotation and the Web as the corpus. (2004a). Converting frames into OWL: Pre- AAAI Spring Symposium on Machine Reading. paring Mikrokosmos for linguistic creativity. AAAI Press. LREC Workshop on Language Resources for Linguistic Creativity. Fallside, D. C., & Walmsley, P. (2004). XML schema part 0: Primer second edition (W3C recommen- Beltran-Ferruz, P., Gonzalez-Caler, P., & Gervas, P. dation). W3C. http://www.w3.org/TR/2004/ (2004b, July). Converting Mikrokosmos frames REC-xmlschema-0-20041028/. into description logics. RDF/RDFS and OWL in Language Technology: 4th ACL Workshop on Farwell, D., Helmreich, S., Dorr, B., Habash, N., NLP and XML. Association for Computational Reeder, F., Miller, K., Levin, L., Mitamura, T., Linguistics. Hovy, E., Rambow, O., et al. (2004). Interlin- gual annotation of mMultilingual text corpora. Cost, R. S., Finin, T., Joshi, A., Peng, Y., Nicholas, Proceedings of the North American Chapter of C., Soboroff, I., Chen, H., Kagal, L., Perich, the Association for Computational Linguistics F., Zou, Y., & Tolia, S. (2002). ITtalks: A case Workshop on Frontiers in Corpus Annotation study in the Semantic Web and DAML+OIL. (pp. 55-62). Association for Computational IEEE Intelligent Systems, 17 (1), 40-47. Linguistics. Cunningham, H. (2002). GATE, a general architec- Fensel, D., van Harmelen, F., & Akkermans, H. ture for text engineering. Computers and the (2000). Ontoknowledge: Ontology-based tools Humanities, 36 (2), 223-254. for knowledge management. Proceedings of the eBusiness and eWork Conference. Springer Dameron, O., Rubin, D. L., & Musen, M. A. (2005). Verlag. Challenges in converting frame-based ontology into OWL: the Foundational Model of Anatomy Finin, T., Fritzson, R., McKay, D., & McEntire, R. case-study. Proceedings of the American Medi- (1994). KQML as an Agent Communication cal Informatics Association Conference. Language. In N. Adam, B. Bhargava, & Y. Yesha (Eds.), Proceedings of the Third International Dill, S., Eiron, N., Gibson, D., Gruhl, D., & Guha, R. Conference on Information and Knowledge (2003). Semtag and seeker: Bootstrapping the Management (CIKM’94). Gaithersburg, MD, Semantic Web via automated semantic annota- USA. (pp. 456-463). ACM Press. tion. Proceedings of the Twelfth International Conference on World Wide Web. New York, NY, Genesereth, M. (1991). Knowledge interchange USA. (pp. 178-186). ACM Press. format. Proceedings of the Second International Conference on the Principles of Knowledge Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Representation and Reasoning (pp. 599-600). Peng, Y., Reddivari, P., Doshi, V. C., & Sachs, Morgan Kaufmann. J. (2004). Swoogle: A search and metadata

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 72 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007

Gildea, D., & Jurafsky, D. (2002). Automatic labeling bank. Proceedings of the Human Language of semantic roles. Computational Linguistics, Technology Conference. 28 (3), 245-288. Kiryakov, A., Popov, B., Terziev, I., Manov, D., & Grishman, R., & Sundheim, B. (1996). Message Ognyanoff, D. (2004). Semantic annotation, understanding conference-6: a brief history. indexing, and retrieval. Journal of Web Seman- Proceedings of the 16th Conference on Com- tics, 2 (1), 49-79. putational Linguistics (pp. 466-471). Krueger, W., Nilsson, J., Oates, T., & Finin, T. Hobbs, J. R., & Pan, F. (2004). An ontology of time (2004). Agent mediated knowledge manage- for the semantic web. ACM Transactions on ment, Chap. Automatically Generated DAML Asian Language Processing (TALIP), 3 (1), Markup for Semistructured Documents. (LNCS 66-85. Special issue on Temporal Information 4150).Springer. Processing. McBride, B. (2001). Jena: Implementing the RDF Hogue, A., & Karger, D. R. (2005, May). Thresher: model and syntax specification.Proceedings of Automating the unwrapping of semantic content the WWW2001 Semantic Web Workshop. from the Wworld Wide Web. Proceedings of the Fourteenth International World Wide Web McShane, M., Nirenburg, S., Beale, S., & O’Hara, Conference. T. (2005, June). Semantically rich human-aided machine annotation. Proceedings of the Work- Horrocks, I., Patel-Schneider, P. F., Boley, H., Ta- shop on Frontiers in Corpus Annotations II: Pie bet, S., Grosof, B., & Dean, M. (2004). Swrl: in the Sky. Ann Arbor, Michigan. (pp. 68-75). A Ssemantic Web rule language combining Association for Computational Linguistics. owl and ruleml. World Wide Web Consortium Member Submission. Miller, D. B. E., & Brickley, D. (2001). Expressing simple dublin core in RDF / XML. Dublin Core Horrocks, I., Patel-Schneider, P. F., & van Harmelen, Metadata Initiative Recommendation. F. (2003). From shiq and rdf to owl: the making of a Web ontology language. Journal of Web Muslea, I. A., Minton, S., & Knoblock, C. (2001). Semantics, 1 (1), 7-26. Hierarchical wrapper induction for semis- tructured information services. Journal of Institute for language and information technologies. Autonomous Agents and Multi-Agent Systems, http://ilit.umbc.edu/ 4(1/2), 93-114. Java, A., Finin, T., & Nirenburg, S. (2005, No- Nirenburg, S., Beale, S., & McShane, M. (2004). vember). Integrating language understanding Evaluating the performance of the ontosem agents into the semantic web. In T. Payne, & semantic analyzer. Proceedings of the ACL V. Tamma (Eds.), Proceedings of the AAAI Fall Workshop on Text Meaning Representation. Symposium on Agents and the Semantic Web. Association for Computational Linguistics. AAAI Press. Nirenburg, S., & Raskin, V. (2001). Ontological Java, A., Finin, T., & Nirenburg, S. (2006, Janu- semantics, formal ontology, and ambiguity. ary). Text understanding agents and the Se- FOIS ‘01: Proceedings of the International mantic Web. Proceedings of the 39th Hawaii Conference on Formal Ontology in Information International Conference on System Sciences. Systems. New York, NY, USA. (pp. 151-161). Kauai, HI. ACM Press. Kalyanpur, A., Parsia, B., & Hendler, J. (2005). A tool Nirenburg, S., & Raskin, V. (2005). Ontological for working with Web ontologies. International semantics. MIT Press. Journal on Semantic Web and Information Systems, 1 (1), 36-49. Onyshkevych, B. (1997). Ontosearch: Using an ontology as a search space for knowledge based Kingsbury, P., Palmer, M., & Marcus, M. (2002). text processing. (PhD thesis, Carnegie Mellon Adding semantic annotation to the penn tree- University).

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007 73

OWL web ontology language for services (OWL-S). recommendation). W3C. http://www.w3.org/ A W3C submission. http://ww&w.w3.org/Sub- TR/2004/REC-owl-features-20040210/. mission/2004/07/ Volkel, M., Krotzsch, M., Vrandecic, D., Haller, H., Philpot, A., Hovy, E., & Pantel, P. (2005). The & Studer, R. (2006). Semantic Wikipedia. Pro- Omega Ontology. Proceedings of the ONTO- ceedings of the 15th International Conference LEX Workshop at the International Conference on World Wide Web (pp. 585-594). on Natural Language Processing. Jeju Island, South Korea.(pp. 59-66). Witbrock, M., Panton, K., Reed, S., Schneider, D., Aldag, B., Reimers, M., & Bertolo, S. (2004, No- R.V. Guha, & McCool, R. (2003). TAP: A Semantic vember). Automated OWL Annotation Assisted Web toolkit. Journal of Web Semantics, 1 by a Large Knowledge Base. Workshop Notes of (1), 81-87. the 2004 Workshop on Knowledge Markup and Semantic Annotation at the 3rd International RDF validation service. http://www.w3.org/RDF/ Semantic Web Conference (ISWC2004). Validator/. Wonderweb owl ontology validator. http://phoebus. Schlangen, D., Stede, M., & Bontas, E. P. (2004, cs.man.ac.uk:9999/OWL/Validator. July). Feeding owl: Extracting and representing the content of pathology reports. RDF/RDFS and OWL in Language Technology: 4th ACL Endnotes Workshop on NLP and XML. Association for 1 http://www.ldc.upenn.edu/Projects/ACE/. Computational Linguistics. 2 Closed classes are lexical classes with a rela- Sergei Nirenburg, K. M. (1996). Measuring semantic tively fixed set of words to which new ones coverage. 17th International Conference on normally are not added, such as (for English) Computational Linguistics, COLING-96. As- prepositions, conjunctions, and pronouns. sociation for Computational Linguistics. 3 A polysemous word is one with multiple senses The friend of a friend (foaf) project. http://www. whose meanings are related. An Example in foaf-project.org/ English is mole, which can be a mammal that van Harmelen, F., & McGuinness, D. L. (2004). lives in underground burrows or a spy. OWL web ontology language overview (W3C 4 http://semnews.umbc.edu.

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 74 Int’l Journal on Semantic Web & Information Systems, 3(4), 50-74, October-December 2007

Copyright © 2007, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.