Information Retrieval and the Semantic Web
Total Page:16
File Type:pdf, Size:1020Kb
Information Retrieval and the Semantic Web Tim Finin1, James Mayfield2, Anupam Joshi1, R. Scott Cost2 and Clay Fink2 University of Maryland,1 The Johns Hopkins University2 Baltimore County Applied Physics Laboratory Baltimore MD 21250 USA Laurel MD 20723 USA Abstract ments as opposed to conventional ones. We describe Swoogle, a prototype crawler-based search engines for Information retrieval technology has been central to the RDF documents. This system allows users to retrieve success of the Web. For semantic web documents or indexed RDF documents based on the RDF classes and annotations to have an impact, they will have to be com- properties they use and also uses the Haircut information patible with Web based indexing and retrieval technol- retrieval engine to retrieve documents using character- ogy. We discuss some of the underlying problems and based n-grams. issues central to extending information retrieval systems The next section will motivate the ability to index and to handle annotations in semantic web languages. We search for documents consisting of or annotated with also describe three prototype systems that we have im- semantic web content. Section Three will lay out the plemented to explore these ideas. landscape of possible ways to adapt information retrieval 1. Introduction systems to the Semantic Web and Section Four will de- scribe three different prototype systems we have built to Information retrieval technology has been central to the explore the problem. The fifth section summarizes this success of the Web. Web based indexing and search work and speculates on what the future may bring. systems such as Google and Yahoo have profoundly changed the way we access information. For the seman- 2. Motivation tic web technologies [4][5] to have an impact, they will have to be compatible with Web search engines and in- The Semantic Web has lived its infancy as a clearly de- formation retrieval technology in general. We discuss lineated body of Web documents. That is, by and large several approaches to using information retrieval systems researchers working on aspects of the Semantic Web with both semantic web documents and with text docu- knew where the appropriate ontologies resided and ments that have semantic web annotations. tracked them using explicit URLs. When the desired Se- One vision of the Semantic Web is that it will be mantic Web document was not at hand, one was more much like the Web we know today, except that docu- likely to use a telephone to find it than a search engine. ments will be enriched by annotations in machine under- This closed world assumption was natural when a hand- standable markup. These annotations will provide meta- ful of researchers were developing DAML 0.5 ontolo- data about the documents as well as machine interpret- gies, but is untenable if the Semantic Web is to live up to able statements capturing some of the meaning of the its name. Yet simple support for search over Semantic documents’ content. We describe initial experiments that Web documents, while valuable, represents only a small demonstrate how existing IR systems can be coaxed into piece of the benefits that will accrue if search and infer- supporting this scenario using a technique we call swan- ence are considered together. We believe that Semantic gling to encode RDF triples as word-like terms. Web inference can improve traditional text search, and In an alternate vision, semantic web content will exist that text search can be used to facilitate or augment Se- in separate documents that reference and describe the mantic Web inference. Several difficulties, listed below, content of conventional web documents. Here too it may stand in the way of this vision. be desirable to use conventional systems such as Google Current Web search techniques are not directly suited to index and retrieve these documents. We discuss how to indexing and retrieval of semantic markup. Most the swangling technique can also be used to add asser- search engines use words or word variants as indexing tions to RDF documents in a way that is compatible with terms. When a document written using some flavor of many standard search engines. SGML is indexed, the markup is simply ignored by many A final approach to using IR engines for SWD docu- search engines. Because the Semantic Web is expressed ments is to build custom indexing and retrieval engines entirely as markup, it is thus invisible to them. Even specifically designed to work with semantic web docu- when search engines detect and index embedded markup, they do not process the markup in a way that allows the other containing the corresponding semantic markup. markup to be used during the search, or even in a way The two files are bound by placing in each a pointer to that can distinguish between markup and other text. the URI of the other, either by URI naming convention, Current Web search techniques cannot use semantic or by concurrent retrieval (i.e., as part of a single transac- markup to improve text retrieval. Web search engines tion). While this method makes it difficult to associate typically rely on simple term statistics to identify docu- semantic markup with specific components of the HTML ments that are most relevant to a query. One might con- page, it is possible to implement using today’s standards. sider techniques such as thesaurus expansion or blind Whichever approach is taken to binding semantic markup relevance feedback to be integration of inference into the to HTML, the current lack of a standard has made it dif- retrieval process, but such inference is simple compared ficult to exploit the relationship between the two. with what is possible using semantic markup. One would One of the stated objectives of the semantic web is to like the presence of semantic markup in either the query enhance the ability of both people and software agents to or the documents retrieved to be exploitable during find documents, information and answers to queries on search to improve that search. the Web. While there has been some research on infor- Likewise, text is not useful during inference. To the mation retrieval techniques applied to documents with extent that it is possible to automatically convert text to a markup [1][2][3][7][13], combining retrieval with ontol- semantic representation, such resulting representations ogy browsing [9], the role of explicit ontologies in in- can be used during inference. However, semantic inter- formation retrieval tasks [19], and on question answering pretation is difficult at best, and unsolved in the general as a retrieval task [18], much of it can be seen as incre- case. We would like a way to exploit relevant text during mental extensions to familiar paradigms. Our goal is inference, without needing to analyze the semantics of more ambitious and offers, we think, a new paradigm for that text. information retrieval that mixes and interleaves search, There is no current standard for creating or manipulat- retrieval and understanding. ing documents that contain both HTML text and semantic To explore the tight integration of search and infer- markup. There are two prime candidates for such hybrid ence, we propose a framework designed to meet the fol- documents. First, semantic markup might be embedded lowing desiderata: directly in an HTML page. Unfortunately, while we call • The framework must support both retrieval-driven approaches like RDF and OWL semantic markup, they and inference-driven processing. are typically used not as markup but rather as stand-alone • Retrieval must be able to use words, semantic knowledge representation languages that are not directly markup, or both as indexing terms. tied to text. Furthermore, embedding RDF-based markup • Web search must rely on today’s broad coverage, in HTML is non-compliant with HTML standards up to text-based retrieval engines. and including HTML 4.0. This issue is currently under • Inference and retrieval should be tightly coupled; study by a W3C task force [23]. improvements in retrieval should lead to improve- The second way to bind HTML to semantic markup is ments in inference, while improvements in inference to create a pair of documents, one containing HTML, the Local KB Semantic Inference Semantic Encoded Encoder Web Query Engine Markup Markup Web Search Engine Semantic Semantic Ranked Filters Extractor Markup Markup Pages Figure 1. Integration of inference and retrieval over semantic markup. Arrows represent data flow. should lead to improvements in retrieval. Only some of the semantic markup retrieved through In the following subsections, we first describe the por- this process will be useful for the task at hand. Some will tions of the framework that use semantic markup, then not come from an appropriate trusted authority. Some show how text processing can be mixed in to increase will be redundant. Some will be irrelevant. Thus, before system capabilities and improve performance. it is asserted into the inference engine’s knowledge store, the semantic markup gleaned from each page must be 2.1 Processing of Semantic Markup filtered. The result will be a collection of facts and rules, which are likely to further the inferences being pursued, Imagine we are concerned only with retrieval and infer- or serve as valuable relevance feedback terms. These ence over semantic markup. We would like the ability to facts and rules are passed to the inference engine, which operate some sort of inference engine, to identify facts may then iterate the entire process. and rules needed by the inference engine to reach its de- sired conclusions, to search the Semantic Web for such 2.2 Using Text facts and rules, and to incorporate the results of the search into the inference process. Figure 1 shows the The process described in the previous subsection basic architecture of such a system.