Ontology-Based Knowledge Discovery on the World-Wide

Home , James Hendler

From: AAAI Technical Report WS-96-06. Compilation copyright © 1996, AAAI (www.aaai.org). All rights reserved. Ontology-Based Knowledge Discovery on the World-Wide Web Sean Luke* Lee Spector* t David Rager* [email protected] [email protected] ragerQcs.umd.edu http://www.cs.umd.edu/~seanl/ http://hampshire.edu/lasCCS/ http://www.cs.umd.edu/~rager/

*Department of Computer Science tSchool of Cognitive Science and Cultural Studies University of Maryland HampshireCollege College Park, MD20742 Amherst, MA01002

Abstract mechanisms,or the assistance of an armyof data-entry workers assembling hand-madeWeb catalogs. In short, This paper describes SHOE,a set of Simple there is no effective way use the World-WideWeb to HTMLOntology Extensions. SHOEallows World-WideWeb authors to annotate their pages answera query like: with ontology-basedknowledge about page contents. Wepresent examples showinghow the Findweb pagesfor all x, y, and z suchthat use of SHOEcan support a newgeneration of x is a person, knowledge-basedsearch and knowledgediscovery y is a person, tools that operate on the WorM-WideWeb. z is an organizationwhere lastName(r, "Cook" ) and lastName(y, "Cook" ) and Introduction employee(z, x) and Imagine that you are searching the World-WideWeb employee(z,y) and for the home pages of a Mr. and Mrs. Cook, whom marriedTo(x,y) and you met at a conference last year. Youdon’t remember involvedIn(z,"ARPA123-4667") their first names,but you do recall that both workfor an employer associated with the massive ARPAfund- Searching the Web ing initiative 123-4567.This wouldcertainly be suffi- cient information to find these people given a reason- The chief intent of HTMLand HTTPis to assist user- ably structured knowledgebase containing all of the level presentation and navigation of the Internet; au- relevant facts. At first this also seemslike enoughin- tomated search or sophisticated knowledge-gathering formation to find their homepages by searching the has been a muchlower priority. Given this empha- World-WideWeb, but you soon discover otherwise. sis, relatively few mechanismshave been established to allow documentsto be indexed with useful seman- Using an existing man-madeweb catalog, you can tic information beyonddocument-oriented information find ARPA’shome page but learn that hundreds of subcontractors and research groups are workingon ini- like "abstract" or "table of contents". Facedwith this situation, most commonindexing mechanismsfor the tiative 123-4567. Searching existing webindices for World-WideWeb have generally fallen into one of three "Cook"yields thousands of pages about cooking, and categories: searching for "ARPA"and "123-4567" provides you with hundreds and hundreds of hits about the popular ¯ Keywordsubject indices. initiative. Unfortunately,searching for all of themto- ¯ Catalogs painstakingly built by hand. gether yields nothing: apparently neither person lists the initiative on his or her webpage. Just wandering ¯ Private robots using ad-hoc methodsto gather lim- through the Webon your own seems fruitless. What ited semantic information about pages (like "Every- can you do? one with links to me" or "All broken page links"). This scenario is commonto many people on the Each approach has disadvantages. Keywordindices World-WideWeb. A major problem with searching on suffer because they associate the semantic meaningof the Webtoday is that data available on the Webhas webpages with actual lexical or syntactic content. Us- little semantic organization beyondsimple structural ing our previous example, if we were looking for a arrangementof text, declared keywords,titles, and ab- womanwhose last name was Cook, searching a key- stracts. As the Webexpands exponentially in size, this word index under "Cook" yields tremendous numbers lack of organizationmakes it very difficult to efficiently of webpages, almost none of which are about living glean knowledgefrom the Web, even with state-of- people named Cook. "Cook" has many uses besides the-art natural language processing techniques, index being a last name.

96 On the other hand, a major disadvantage of hand- task is to gather web pages about cooking. If this built catalogs is the man-hours required to construct agent were using a thesaurus-lookup or keyword-search them. Given the size of the World-Wide Web, and the mechanism, it might accidentally decide that Helena rate at which it is growing, cataloging even a modest Cook’s web page, and pages linked from it, are good percentage of web pages is a Herculean task. Addition- search candidates for this topic. This could be a bad ally, the criteria used in building any catalog mayturn mistake of course, not only for the obvious reasons, but out to be orthogonal to those of interest to a user. also because Helena Cook’s links are to the rest of the Lastly, ad-hoc robots that attempt to gather seman- University of Maryland (where she works). The Uni- tic information from the web typically gather only the versity of Maryland’s web server network is very very limited semantic information inferable from existing large, and the robot might waste a great deal of time HTMLtags. The current state of natural language in fruitless searching. However, if the agent gathered processing technology makes it difficult to infer much semantic tags from Helena Cook’s web page which in- semantic meaning from the body text itself at a rea- dicated that Cook was her last name, then the agent sonable rate (if at all). We have examined and de- would know better than to search this web page and veloped web-wandering robots equipped with ad-hoc its links. machinery for specialized searching tasks: recognizing and cataloging computer science web pages, for exam- Related Work ple. Unfortunately, even a small topic like this proves HTML2.0 (Berners-Lee and Connolly 1995) already surprisingly difficult to implement, and like manyad- includes several weak mechanisms for semantic markup hoc methods, these robots’ algorithms are extremely (the REL, REV, and CLASSsubtags, and the META brittle. tag). HTML3.0 (Ragget 1995) advances these mech- Further, none of these approaches (except perhaps amsmssomewhat, though it is not yet an official stan- the last, for specific domains) allows for inferences dard. Unfortunately, the semantic markup elements about relationships between web pages, aside from sim- of HTMLhave so far been used primarily for doc- ple facts about linkage. Sophisticated queries such as ument meta-information (such as declared keywords) our initial example ("Find a man and a womanmar- or for hypertext-oriented relationships (like "abstract" ried to each other, whose last nameis ’Cook’, and who or "table of contents"). Furthermore, relationships both work an organization involved with ARPAInitia- can only be established along hypertext links (using tive 123-4567") are therefore clearly out of reach. or ). It appears that the intent of HTML’s existing set of semantic markuptags is only to provide Solution: Adding Semantics to HTML semantics that assist hypertext applications or other Instead of trying to glean knowledge from existing document-oriented functions. HTML,another approach is to give HTMLauthors the To address some of these problems, Dobson and Bur- ability to embed knowledge directly into HTMLpages, rill (1995) have attempted to reconcile HTMLwith makingit simple for user-agents and robots to retrieve the Entity-Relationship (ER) database model. This and store this knowledge. The straightforward way to done by adding to HTMLa simple set of tags that de- do this is to provide authors with a clean superset of fine "entities" within documents, labelling sections of HTMLthat adds a knowledge markup syntax; that is, body text as "attributes" of these entities, and defining to enable them to use HTMLto directly classify their relationships from an entity to outside entities. Docu- web pages and detail their web pages’ relationships and ments maycontain as manyentities as necessary. Dob- semantic attributes in machine-readable form. son and Burrill associate with each entity a unique key, Using such a language, a document could claim that and establish relationships not between URLlinks but it is the homepage of a graduate student. A link from between keys. this page to a research group might declare that the Although Dobson and Burrill’s ER scheme is a sig- graduate student works for this group as a research nificant improvement over HTML’sexisting mecha- assistant. And the page could assert that "Cook" is nism, it does not provide for any ontological declara- the graduate student’s last name. These claims are not tions. For example, their scheme does not give any simple keywords; rather they are semantic tags defined clear mechanismfor classification through an "is a" hi- in an "official" set of attributes and relationships (an erarchy of classes. Yet one of the most significant uses ontology). In this example the ontology would include for semantics in documents is to categorize them ac- attributes like "lastName", classifications like "Per- cording to some classification scheme or taxonomy. For son", and relationships like "employee". Systems that example, paper documents are often classified using gather claims about these attributes and relationships hierarchical schemes like the Library of Congress sub- could use the resulting gathered knowledge to provide ject headings, the DeweyDecimal system, or Univer- answers to sophisticated knowledge-based queries. sal Decimal Classification. Similarly, a good semantics Moreover, user-agents or robots could use gathered mechanism for World-Wide Web documents needs the semantic information to refine their web-crawling pro- ability to do flexible, hierarchical classification. The cess. For example, consider an intelligent agent whose ability to establish relationships between WWWenti-

97 ties is important, but secondary to the ability to clas- Specifying Ontologies sify those entities. ¯ ... Moreover, the ER scheme does not allow one to spec- Declares a new ontology. ify inferences that can be drawn from relationships given in web pages. Even simple specifications such as ¯ <0NTOLOGY-EXTENDS... > transitive closure inferences can be helpful: if Helena Indicates that our ontology extends another ontol- Cook’s home page claims that she works for the PLUS ogy. research group, and this research group is part of the ¯ Computer Science Department, part of the College of Defines a relation, an "is a" classification, or a re- Computer, Mathematical, and Physical Sciences, part naming rule. of the University of Maryland at College Park, part of the University of Marylandat College Park, part of the Annotating an HTML Document Using State of Maryland, she should not have to declare that One or More Ontologies she works for all of these entities; such a fact should ¯ be inferable. Invertable relationships are also useful: if ... Indicatesthat the documentuses one or more on- George Cook is known to be married to Helena Cook, tologies. the inverse should be automatically inferable, without George or Helena having to say it. Through the addi- ¯ tion of morepowerful inferential rule capabilities, full Usedto declarethe document as an entity. knowledge base semantics could be provided. ¯ ... Several advances will be required to provide full Declaresa subsectionof a documentto be an "in- knowledge-base semantics on the World-Wide Web. stance"(an entity). Although the knowledge representation literature describes many systems that could be adapted to this ¯ purpose, unique features of the Webwill mandate sig- Classifiesan instance under one or moreclasses (cat- nificant changes. For example, assertions on the Web egories). will be made by many different people with differing ¯ authority to make such assertions. These assertions Declaresa relationbetween entity instances or be- must therefore be interpreted as claims, of which the tweenan instance and data. authorship is a significant part. In addition, the dis- tributed nature and unknowncorrectness of knowledge ¯ ... An alternativemechanism for declaringa relation on the Webposes new challenges. The work described betweenan entityinstance and data:the data in in this paper is a first step in the process of solving questionis the bodytext between the two attribute these problems to provide full knowledge-base semantics for World-Wide Webcontents. tags.

A SHOE Overview A Detailed Example We present here an introduction to a small superset To illustrate SHOE,we’ll annotate the home page of of HTMLthat provides many of these mechanisms. George Cook (Helena Cook’s husband). This example This scheme is called SHOE: Simple HTMLOntology does not describe all the capabilities of our specifica- Extensions. Amongother things, SHOEprovides the tion, but gives a taste of much of it. Before we can ability to: annotate George’s home page, we need an ontology that: ¯ Define ontologies using HTML. ¯ Provides a "Person" classification ¯ Declare entities for both whole documents and for document subsections ¯ Provides an "Organization" classification ¯ Declare relationships between entities. ¯ Provides the "marriedTo" relationship between people ¯ Declare entity attributes. ¯ Provides the "firstName" and "lastName" attributes ¯ Classify entities under an "is a" classification for people scheme. The full specification of this language is located at ¯ Provides the "employee" relationship between orga- hup: //www.cs. d. ed/p,’oiec*s/pl s/S H O E/sp c. h*ml. nizations and people The specification does not as yet provide inferential For the sake of this example we’ll build a new on- rules other than transitive "is a" classification, but is tology that provides som. e of the necessary classifica- designed to be consistent with such rules when they tions and relationships. Ordinarily we wouldn’t have are added later. The specification adds the following to do this; instead, we’d rely on existing ontologies tags to HTML: from commonlibraries on the web. Such ontologies

98 Figure 1: A Knowledge-based World Wide Web Query in PARKA will offer a unified structure for sharing knowledgeon tags are embedded in an HTMLdocument, which in the World-Wide Web. turn might be promulgated as an "official" person- Let’s assume there already exists an ontology relationships ontology. called organization-ontologyversion 2.1 which de- The "official" location of our ontology is the HTML fines the classifications Organizationand Thing, document at http://ont, org/our-ont.html. George and that this particular ontology is available at Cook can now use this ontology to describe his home http://www.ont.org/orgont.html. We’ll extend the page. Assume that, using this ontology, Helena organizat ion-ontology ontology to include our other Cook’s page has already been classified as a Person, needed classifications and relationships. Namely,we’ll and that its unique key is the same as its official borrow Organization directly, and when we define URL: http://www.cs.umd.edu/~helena. Furthermore, Person we’ll claim that Person "is a" Thing. Let’s the place Helena and George work for, the Univer- call our extension the our-ontology ontology, version sity of Maryland’s Computer Science Department, has 1.0. We write our new ontology as a piece of HTML: its home page classified as an Organization, and that its unique key is the same as its official URL: http : //www.cs. umd.edu. official URL: http://www.cs.umd.edu/~george. In the HEADsection of George’s web page, we add: ARGS="PersonSTRING"> URL="http://ont. org/our-ont, html" > This declares George’s web page to be a data entity with a unique key, and indicates that it will use the ontology our-ontology to describe itself. Furthermore, This indicates that Personis a subcategory of Thing every time elements from our-ontology are used, they as definedin the organization-ontologyontology, will be labelled with the prefix our. that people have first and last nameswhich are strings, In the BODYsection we now declare facts about that people can be married to other people, and that George’s home page, namely George’s name, that people can be employees of organizations. These George is a person, that he is married to Helena, and

99 Figure 2: Query Results that he works for the University of Maryland’s Com- And in the BODYsection we declare George to be an entity instance by adding (near the section on He- "http://www.cs.umd.eduFhelena#GEORGE"> The category declaration indicates that George is a Peraon. The first two relations declare that George’s name is "George Cook". The next relation declares declares the relationship employee from George’s em- Alternatively, if George’s name were mentioned in the text of his homepage, we could replace the first two relation declarations with something like: Applications Ny name is At the University of Maryland at College Park, we are George developing a web-crawling robot, Exposd, which parses Cook SHOE-enabled HTMLdocuments and adds claims to and I live at... its internal knowledge-base. Exposd runs on Macin- tosh CommonLisp or C, using PARKA(Evett, An- If George didn’t have his own web page but instead derson, and Hendler 1993), University of Maryland’s resided on a small part of his wife’s web page, it would massively-parallel semantic network system, for its still be possible to provide George with his ownunique knowledge representation. Wecan then use this knowl- identity and describe these relationships. In this case, edge to answer sophisticated queries about these doc- we’ll use http://www, cs.umd, edu/-helena#GEORGE uments and their relationships as George’s unique key. We add to the HEADsection For example, after Exposd has gathered claims from of his wife’s webpage (if it’s not already there): Helena Cook’s web page, we can query PARKAto find her and her husband. Figure 1 shows the query we in-

I00 Helena’s web page. In conjunction with this effort, we are investigating the use of knowledge-representation standards like KQML(Finin et al. 1994) and KIF (Genesereth and Fikes 1992) to facilitate communica- tion betweenclients and servers in retrieving results, or between servers and slave servers in building up results from a number of sources.

Future Work :::::::::::::::: ...... Although we feel our current specification provides much of the expressiveness needed for more advanced World-Wide Webagents, it still lacks many features found in sophisticated knowledge-representation systems. Weare adding such features conservatively, seek- ing a compromise that provides some of the power of sophisticated knowledge representation tools while keeping the system simple, efficient, and understand- able to the lay HTMLcommunity. For example, the current specification does not yet provide for annotations that allow inference of transitive closure, negation, or inverted (reversed) relations. Weare currently working to refine a small set of tags that will be easy for HTMLauthors to understand while allowing agents to use these inferences to derive useful new facts from the basic claims made in HTML pages. The knowledge representation literature provides manyinsights into the design of such tags, but the unique demands of the World-Wide Web (such as the distribution of knowledge and the varying authority of authors) require that this literature be examined in a new light. Figure 3: Graphically Annotating Helena’s Web Page Conclusion The Webis a disorganized place, and it is growing more troduced in the beginning of this paper, as laid out disorganized every day. Even with state-of-the-art in- using PARKA’sGraphical Query mechanism. This is dexing systems, web catalogs, and intelligent agents, the equivalent of querying PARKAwith: World-WideWeb users are finding it increasingly difficult to gather information relevant to their inter- (query! ’ ( : and ests without considerable and often fruitless searching. (#! instanceOf?X # !Person) Muchof this is directly attributable to the lack of a (# ! instanceOf?Y # ! Person) coherent way to provide useful semantic knowledge on (# ! instanceOf?Z # ! Organization) the Webin a machine-readable form. (#!lastName?X "Cook") SHOEgives HTMLauthors an easy but powerful (#!lastName?¥ "Cook") way to encode useful knowledge in web documents, and (#!employee?Z ?X) it offers intelligent agents a muchmore sophisticated (#!employee?Z ?Y) mechanism for knowledge discovery than is currently (#!marriedTo?X ?Y) available on the World-Wide Web. If used widely, (#!involvedIn?Z "ARPA 123-4867"))) SHOEcould greatly expand the speed and usefulness of intelligent agents on the web by removing the single In Figure 2, PARKAhas filled in the variables with most significant barrier to their effectiveness: a need to actual results--selecting Helena fetches her web page comprehend text and graphical presentation as people directly. do. Given the web’s explosive growth and its predom- We are also developing Java applications to make inance amongInternet information services, the abil- it easier for users to annotate web pages with seman- ity to directly read semantic information from HTML tic knowledge and to query robot servers using SHOE. pages maysoon be not only useful but necessary in or- For example, our graphical annotator is shown in Fig- der to gather information of interest in any reasonable ure 3, assisting in embeddingsemantic knowledge into amount of time.

i01 Acknowledgements Weare grateful to Dr. James Hendler for his assistance in the developmentof this paper. This research was supported in part by grants from NSF(IRI-9306580), ONR(N00014-J-91-1451), AFOSR(F49620-93-1-0065), the ARPA/RomeLab- oratory Planning Initiative (F30602-93-C-0039), the ARPAI3 Initiative (N00014-94-10907) and ARPAcon- tract DAST-95-C0037. References Dobson, S.A. and V.A. Burrill. 1995. Lightweight Databases. In Proceedings of the Third International Worldwide Web Conference (special issue of Com- puter and 1SDN Systems). v. 27-6. Amster- dam: Elsevier Science. URL: http://www.igd.fhg.de/ www/www95/papers/5.t/darm.html See also: http://www, cis. rl. ac. uk/proj/www/docs/iightweight/ index.html

Evett, M.P., W.A. Andersen, and J.A. Hendler. 1993. Providing Computational Effective Knowl- edge Representation via Massive Parallelism. In Parallel Processing for Artificial Intelligence. L. Kanal, V. Kumar, H. Kitano, and C. Suttner, Eds. Amsterdam: Elsevier Science Publishers. URL: http : //www.cs. umd. edu /projects/plus/ P arka/parka- kanal.ps See also: http://www, cs. umd. edu/projects/ plus/Parka~

Finin, T., D. McKay,R. Fritzson, and R. McEntire. 1994. KQML:An Information and Knowledge Ex- change Protocol. In Knowledge Building and Knowl- edge Sharing. K. Fuchi and T. Yokoi, Eds. Ohmsha and IOS Press. URL: http://www, cs. umbc. edu/kqml/ papers/kbks.ps See also: http : //www. cs. umbc.edu/ kqmi/

Genesereth, M. R., and R. E. Fikes, Eds. 1992. Knowl- edge Interchange Format, Version 3.0 Reference Man- ual. Technical Report Logic-92-1. Computer Science Department, Stanford University. URL: http://www- ksl.stanford.edu/knowledge-sharing/papers/kif.ps See also: http://www-ksl.stanford.edu/knowledge-sharing/ kif/

Berners-Lee, T. and D. Connolly. 1995. Hypertext Markup Language - 2.0. IETF HTMLWorking Group. URL:http://www, cs.tu-berlin, de/"jutta/ht/drafl-ietf- html- spec-Ol.html

Ragget, D. 1995. HyperText Markup Language Speci- fication Version 3.0. W3C(World-Wide Web Consor- tium). URL: http://www.w3.org/pub/WWW/ MarkUp /h tml3/ CoverPage. h tml

102