The Semantic Web
Total Page:16
File Type:pdf, Size:1020Kb
The Semantic Web Ovidiu Chira IDIMS Report 21st of February 2003 1. Introduction 2. Background 3. The structure of the Semantic Web 4. Final Remarks 1 1. Introduction One of the most successful stories of the information age is the story of the World Wide Web (WWW). The WWW was developed in 1989 by Tim Berners-Lee to enable the sharing of information among geographically dispersed teams of researchers within the European Laboratory for Particle Physics (CERN). The simplicity of publishing on WWW and the envisioned benefits attracted an increasing number of users from beyond the boundaries of the research community. The WWW grew rapidly to support not only information sharing between scientists (as it was intended), but to support information sharing among different kind of people communities, from simple homepages to large business applications. The Web became an “universal medium for exchanging data and knowledge: for the first time in history we have a widely exploited many-to-many medium for data interchange” [Decker, Harmelen et al. 2000]. The WWW is estimated to consist from around one billion documents and more than 300 millions users access them, and these numbers are growing fast [Fensel 2001; Benjamins, Contreras et al. 2002]. Soon, as a ”medium for human communication, the Web has reached critical mass […] but as a mechanism to exploit the power of computing in our every-day life, the Web is in its infancy” [Connolly 1998; Cherry 2002]. On one hand it became clear that while the Web enables human communication and human access to information it lacks of tools or technologies to ease the management of Web’s resources [Fensel 2000; Fensel 2001; Palmer 2001]. On the other hand it is impossible to build semantical tools that will spot and know the difference between for example a book by and a book about. But also from the human user point of view, the WWW has become an immense haystack of data [Ewalt 2002]. The time has come “to make the Web a whole lot smarter” [Connolly 1998], so dynamic generated data (e.g. data from pages generated from databases) can join the Web [Benjamins, Contreras et al. 2002; Hendler, Berners-Lee et al. 2002]. In other words it is time to upgrade the Web from “giving value to human eyeballs” to a Web where “the interesting eyeballs will belong to computers” (Prabhakar Raghavan, chief technology officer at Variety Inc as cited by [Ewalt 2002]). 2. Background The Semantic Web (SW) is an emerging concept that launches the idea of having data on the web defined and linked in a way that it can be used by people and processed by machines [Berners-Lee 1998; Decker, Harmelen et al. 2000; Fensel 2000; Ramsdell 2000; Berners-Lee, Hendler et al. 2001; Dumbill 2001; Swartz and Hendler 2001; Hendler, Berners-Lee et al. 2002] in a “wide variety of new and exciting applications” [Swartz and 2 Hendler 2001]. It develops “languages for expressing information in a machine processable form” [Berners-Lee 1998], so to enable the machine to be able to participate and help inside the information space [Benjamins, Contreras et al. 2002]: “The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.” [Berners-Lee 1998] The SW will not be a separate Web, but the extension of the current one [Berners-Lee, Hendler et al. 2001; Swartz and Hendler 2001; Ewalt 2002; Hendler, Berners-Lee et al. 2002]. The WWW is primary a medium of documents for people rather then a medium of data and information than can be processed automatically. The SW will upgrade it to a new medium adequate for both people and machines [Berners-Lee, Hendler et al. 2001; Benjamins, Contreras et al. 2002; Westoby 2003]. The new environment will be more effective for its users by automating or enabling the processes that are currently difficult to perform: “locating content, collating and cross-relating content, drawing conclusions from information found in two or more separate sources” [Dumbill 2001]. These objectives are made possible/reachable by giving structure to the rich information contained in documents all over the Web [Berners-Lee, Hendler et al. 2001]. The goals of the SW related research are summarized by Koivunen as follows [Koivunen 2001]: 1. “Design the technologies that support machine facilitated global knowledge exchange.” 2. “Making cost-effective for people to record their knowledge.” 3. “Focus on machine consumption.” After the WWW any new web-based medium (no matter if it will upgrade or replace WWW) has to fulfil a new set requirements used for exchanging data on the web [Decker, Harmelen et al. 2000]: 1. Universal expressive power: meta-data languages should be enabled to express any kind of data. 2. Support for Syntactic Interoperability: the structures that represent data should be easily readable so applications (such as parsers) can exploit them. 3. Support for Semantic Interoperability: unknown data should come with semantics, or it should be possible to be mapped to known data. Once the goals and the requirements have been identified, a set of principles – emerged from the advantages, the mistakes and the shortcomings of the WWW - to guide the SW 3 research and development have been proposed by the World Wide Web Consortium (W3C) as follows [Koivunen 2001; Westoby 2003]: Principle 1: Everything can be identified by Uniform Resource Identifiers (URI). This principle gives a measure of the things that can be a part of SW, that is anything conceivable. From physical objects to human beings and from simple words to complex conceptual structures, everything can be on the SW as long as an URI have been associated to it. There is no restriction concerning the permissible part of the Web URI namespace to be used by public. Principle 2: Resources and links can have types. The first impression a user may have about the WWW is its vastness and derived from here its complexity. In fact the WWW structure is quite simple, consisting in a collection of resources (e.g. web documents) and links that bind different resources to each other (see Figure 1a). Figure1. WWW structure vs. SW structure [Koivunen 2001] On one hand this kind of structure (because of its simplicity) is easy to be implemented, so almost everyone can create, publish and link resources. On the other hand (because of its poor granularity) its difficult if not impossible to know beforehand what a resource refer at and what is a meaning of a link. This result in a lack of automated tools to help humans locate, use or share a specific resource. The SW structure, in turn, gives the possibility of typing the resources and links (see Figure 1b), so more information about resources and link to be available beforehand. In this way it enables the development of tools that would 4 automate some if not all of the web-related process (i.e. finding information, sharing and reuse). Principle 3: Partial information is tolerated. One of the most important characteristics of the web is scalability, which is closely related to the ability of dynamic evolving of the WWW. Because of it the WWW grew beyond any expectation, becoming one of the most important communication/business environment of today’s world. The price paid for this is the link integrity [Koivunen 2001]. This means that resources from the WWW may appear, evolve (e.g. change its content while keeping the same URI) and disappear dynamically without any possibility for the links that point to such resources to be “announced”. That is why there are links that point to inexistent locations (also called 404 links) and links that point to wrong resources. But this is a price worth paying. For these reasons, SW will also have to deal with the resource’s life cycle (e.g. creation, changing, decaying) and the tools built for SW should tolerate it and function in this kind of dynamic resources. Principle 4: There is no need for absolute truth. As in today’s WWW, there is nothing to apriory guarantee the truth or the value of truth of some information within SW. No matter if the particular resource is data or information or knowledge, nobody and nothing should enforce some kind of rules to guarantee that resource is true in some particular system. The value of truth should remain at application level. This means that the particular application - based on some kind of label-kind of information about the resource and about the place it came from - should decide how trustworthiness an input is. In this way the principles 3 and 4 implement in SW one of the most cherished principle of WWW: the freedom of information. Principle 5: Evolution is supported. Generally speaking information evolves as human understanding evolves. This means that it (i.e. information) sustains a progressive change and development. SW has to be able to deal with this phenomenon by enabling processes such as some kind of version control (i.e. adding information without having to change or delete the old one), translation among different communities (i.e. translation from one language to another, synonymy – “postal code” is equivalent to “ZIP code”, and so on), combination of information that may arise from different/distributed resources, and so on [Koivunen 2001]. Principle 6: Minimalist design. “The Semantic Web makes the simple things simple, and the complex things possible” [Koivunen 2001] by standardizing no more than is necessary with the desiderate that “result should offer much more possibilities than the sum of the parts”[Koivunen 2001].