Die approbierte Originalversion dieser Diplom-/ Masterarbeit ist in der Hauptbibliothek der Tech- nischen Universität Wien aufgestellt und zugänglich. http://www.ub.tuwien.ac.at

The approved original version of this diploma or master thesis is available at the main library of the Vienna University of Technology. http://www.ub.tuwien.ac.at/eng

Consuming Linked Open Data via Standard Web Widgets

DIPLOMARBEIT

zur Erlangung des akademischen Grades

Diplom-Ingenieurin

im Rahmen des Studiums

Business Informatics

eingereicht von

Irina Pershina Matrikelnummer 1127738

an der Fakultät für Informatik der Technischen Universität Wien

Betreuung: o.Univ.-Prof. Dipl.-Ing. Dr.techn. A Min Tjoa Mitwirkung: Univ.Ass. Dipl.-Ing. Dr.rer.soc.oec. Amin Anjomshoaa

Wien, 23.04.2014 (Unterschrift Verfasserin) (Unterschrift Betreuung)

Technische Universität Wien A-1040 Wien  Karlsplatz 13  Tel. +43-1-58801-0  www.tuwien.ac.at

Consuming Linked Open Data via Standard Web Widgets

MASTER’S THESIS

submitted in partial fulfillment of the requirements for the degree of

Diplom-Ingenieurin

in

Business Informatics

by

Irina Pershina Registration Number 1127738

to the Faculty of Informatics at the Vienna University of Technology

Advisor: o.Univ.-Prof. Dipl.-Ing. Dr.techn. A Min Tjoa Assistance: Univ.Ass. Dipl.-Ing. Dr.rer.soc.oec. Amin Anjomshoaa

Vienna, 23.04.2014 (Signature of Author) (Signature of Advisor)

Technische Universität Wien A-1040 Wien  Karlsplatz 13  Tel. +43-1-58801-0  www.tuwien.ac.at

Erklärung zur Verfassung der Arbeit

Irina Pershina Kohlgasse 49/15, 1050

Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwende- ten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit - einschließlich Tabellen, Karten und Abbildungen -, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Ent- lehnung kenntlich gemacht habe.

(Ort, Datum) (Unterschrift Verfasserin)

i

Acknowledgements

I would like to express my very great appreciation to my supervisors, Univ.-Prof. A Min Tjoa and Univ. Ass. Dr. Amin Anjomshoaa, for his valuable and constructive suggestions, and useful critiques during the the process of writing this master thesis. Their willingness to give his time so generously has been very much appreciated. To my family, who has always been my support in every stage of my life. I am especially grateful to my parents, who supported me emotionally and financially. I also would like to thank my colleagues Peter Wetz, Dat Trinh Tuan, and Lam Ba Do from Linked Data Lab, and Lucas Gerrand and Raffael Prätterhoffer in Business Informatics Master program for listening, patience and support during the last ten months. I gratefully enjoyed the collaboration and knowledge exchange with them.

iii

Abstract

The Semantic Web describes a concept for information storing, sharing, and retrieving on the Web by adding machine-readable meta information to convey meaning to the data. Linked Open Data is publicly available structured data which is stored and modelled according to Semantic Web standards and interlinked with other Open Data. The Linked Open Data cloud comprises of Linked Data sources and has been growing significantly in recent years. Complementary to this, mashups allow non-professional users to access, consume, and analyze data from various sources. The basic component of a mashup is a widget that can access certain datasets, process data, and provide additional functionality. Mashups can partly handle Linked Data consumption for knowledge workers. The main challenges are finding the appropriate widget when the amount of available widgets is increasing, categorizing and finding widgets with similar functionality, and adding provenance information to widgets. The primary purpose of this thesis is to design a semantic model for a mashup platform that enables (i) publishing of widget information on the Linked Open Data Cloud, (ii) widget discovery, (iii) widget composition, and (iv) smart data consumption based on semantic model. Additionally, the semantic model should provide provenance information in order to provide additional information about origin and authenticity of data, and increase trust in data resources. During this research work existing approaches applicable for Semantic Web Service descrip- tion have been compared and Semantic Web Service description techniques concerning appli- cation in the area of Web Widgets have been evaluated. Requirements to the semantic model are derived from literature review and complemented with requirements for mashup systems. Finally, the semantic widget model is implemented into a mashup prototype to demonstrate its usability.

v

Kurzfassung

Das Semantische Web beschreibt ein Konzept zu Informationspeicherung, -austausch und -abruf im Web durch Hinzufügen maschinenlesbarer Metainformation. Ziel ist es, Daten eine Bedeu- tung zu geben. Zu diesem Konzept zählt auch Linked Open Data. Dabei handelt es sich um Daten, die der Resource Description Framework Spezifikation entsprechend modelliert und ge- speichert sind. Zudem sind diese Daten öffentlich verfügbar und miteinander verknüpft. Die Linked Open Data Cloud beeinhaltet alle bedeutenden Linked Data Quellen und befindet sich seit den letzten Jahren in einem ständigen Wachstum. Ergänzend dazu ermöglichen Mashups nicht fachkundigen Anwendern Zugang zu Konsum und Analyse von Linked Data. Die Grund- komponente eines Mashups ist ein Widget. Dieses kann auf bestimmte Datensätze zugreifen, Daten verarbeiten und zusätzliche Funktionen zur Verfügung stellen. Bis dato können Mashups nur teilweise die vorhandenen Probleme für Wissensarbeiter, die mit der Verwendung von Linked Data zusammenhängen, lösen. Die größten Herausforderungen sind es, passende Widgets zu finden, während die Anzahl verfügbarer Widgtes steigt, Widgets mit ähnlichen Funktionalitäten zu kategorisieren und zu finden, und Informationen über die Her- kunft und Vertrauenswürdigkeit von Daten hinzuzufügen. Der Hauptzweck dieser Masterarbeit ist die Entwicklung eines semantischen Modells für ei- ne Mashup Platform. Das Modell ermöglicht (i) die automatische Veröffentlichung der Widget- information in die Linked Open Data Cloud, (ii) Widget Auffindung, (iii) Widget Zusammenset- zung und (iv) smarte Anwendung von Daten basiert auf semantische Modellen. Zusätzlich soll das Modell Informationen über die Herkunft von Daten beinhalten. Während meiner Forschung evaluierte ich Änlichkeiten und Unterschiede zwischen Web Widgets und Semantic Web Services, verglich existierende Ansätze zu Semantic Web Service Beschreibungen und evaluierte Semantic Web Service Beschreibungstechniken, die für eine An- wendung im Bereich Web Widgets relevant sind. Anforderungen an das Modell werden von vorhandener Literatur abgeleitet und mit den Anforderungen an Mashupsysteme ergänzt. Ab- schließend wird das semantische Widget Modell mittels eines Prototyps implementiert, um des- sen praktische Nutzbarkeit zu demonstrieren.

vii

Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Problem Statement ...... 2 1.3 Structure of the Thesis ...... 5

2 Web of Data 7 2.1 Web 2.0 ...... 7 2.2 Web 3.0 ...... 9 2.3 Resource Description Framework (RDF) ...... 11 2.4 Web Ontology Language (OWL) ...... 16 2.5 SPARQL. Query Language for RDF ...... 19 2.6 Linked Open Data ...... 21 2.7 Overview of Linked Data Endpoints ...... 24 2.7.1 DBPedia ...... 25 2.7.2 Linked Movie Data Base ...... 28 2.8 Widgets & Mashups ...... 28 2.9 Scheme.org ...... 32 2.10 Semantic Web Services ...... 33

3 State of the Art 37 3.1 Applications ...... 37 3.1.1 Overview of existing application ...... 37 3.1.2 Yahoo!Pipes ...... 38 3.1.3 DERI Pipes ...... 41 3.1.4 BIO2RDF ...... 44 3.1.5 LOD2 ...... 45 3.2 Semantic Description Approaches ...... 50 3.2.1 Web Services Description Language (WSDL) ...... 50 3.2.2 Semantic Annotation for Web Services Description Language (SAWSDL) 51 3.2.3 Semantic Markup for Web Services (OWL-S) ...... 51 3.2.4 Web Service Modeling Ontology (WSMO) ...... 54 3.2.5 WSMO lite ...... 54 3.2.6 RESTDesc semantic description ...... 55

ix 3.2.7 SA-REST ...... 56 3.2.8 EXPRESS ...... 57 3.2.9 Linked Open Services (LOS) ...... 59 3.2.10 Linked Date Services (LIDS) ...... 60 3.2.11 Data-Fu ...... 61 3.2.12 Karma ...... 62 3.2.13 RDB to RDF Mapping Language (R2RML) ...... 65 3.3 Summary ...... 68

4 Solution 73 4.1 Definition of requirements ...... 73 4.2 Use and Extension of Karma Approach ...... 75 4.3 Widget Model ...... 81 4.4 DCAT ...... 84 4.5 Provenance ...... 86

5 Results and Evaluation 95 5.1 Resulting Semantic Model ...... 95 5.2 Semantic Model Use cases ...... 95 5.2.1 Publishing examples ...... 96 5.2.2 Discovery examples ...... 98 5.2.3 Composition examples ...... 101 5.2.4 Smart Data Consumption ...... 103 5.3 Result evaluation ...... 104

6 Conclusion and Future Work 107 6.1 Research Summary ...... 107 6.2 Research Limitation ...... 109 6.3 Future Work ...... 109

7 Appendix 111 7.1 Acronyms ...... 111 7.2 Widget Semantic Model ...... 112 7.3 Semantic Models in Top Braid Composer ...... 116

Bibliography 121

x CHAPTER 1 Introduction

1.1 Motivation

The Web is a phenomenon which has changed the modern era of communication and enterprise networks. The idea originally was conceived 25 years ago by Tim Berners-Lee and Robert Cailiau [15]. The main goals of the project were: to provide a protocol for requesting and exchanging information with use of networks; provide a method of reading information; provide search mechanisms; provide a collection of documents [15]. The documents were presented by a list of references, so-called , to other text sources over the Internet. In general, the Web is based on following technologies:

• documents written in Hypertext Markup Language (HTML)1, the language that “was pri- marily designed as a language for semantically describing scientific documents“ [67],

• Uniform Resource Locator (URL) references to a resource that consists of “a naming scheme specifier followed by a string whose format is a function of the naming scheme“ and Uniform Resource Identifier (URI), “a compact sequence of characters that identify an abstract or physical resource“ [1], i.e. the name of Web sources,

• Hypertext Transfer Protocol (HTTP), a protocol for “distributed, collaborative, hyperme- dia information systems“ [64]

With widespread use of the Web we saw the next stage in this evolution, the so called “Read- Write“ Web, or Web 2.0, where information can be distributed. The term includes social com- munities, services, and a corresponding set of technologies. Examples of Web 2.0 are blogs, Web applications, wikis, social networking sites, and mashups. The value of information that organizations and people are putting onto the internet started to increase. This progress had an increasingly significant influence on decision-making processes [27]. Furthermore, the informa- tion became an essential factor for management. The main focus of Web 1.0 and Web 2.0 was

1http://www.w3.org/TR/html/

1 content generation and content representation. The immersion of content created problems for data storage. This was due to the inability to provide a solid structure, comprehensive meaning and interchangeability by using machine-understandable formats. Aiming to solve these challenges a new generation called the Semantic Web emerged. It “provides a common framework that allows data to be shared and reused across application, en- terprise, and community boundaries“ [2]. The main focus of the Semantic Web is interlinking of data from various sources, whereas the original Web mainly focused on the interlinking of Web documents. Additionally, the meaning of data becomes machine-understandable. Fundamen- tally, the Semantic Web is based on the concept of metadata, which provides data descriptions; ontology, which describes hierarchy and relationship between data; and reasoning, which allows for logical derivation of new facts. Furthermore, Semantic Web technologies provide effective querying on large data sets. Data on the Web should have a standard format and be interlinkable in order to generate relationships among data. The “collection of interrelated datasets on the Web“ [2] is called Linked Data. To publish and connect data on Web it is important to follow the set of best practices and principles, the so called Linked Data principles [17]. These include use of URIs for entity identification, use of Semantic Web standards for data description (RDF2), data querying (SPARQL3) and data interlinking.

1.2 Problem Statement

In spite of the fact that the quantity of Linked Data is continually increasing, there are still many research challenges. For instance, creating and publishing Linked Data, trust and provenance of Linked Data, user interaction and usability, and natural language interfaces are still relevant research issues. In addition to the aforementioned challenges, there is a lack of successful applications that offer people, who are not necessarily from a professional background, access to Linked Data. Additionally, the possibility to make complex queries, data analysis, data enrichment, visualiza- tion, and aggregate data from various Linked Data sources in a feasible manner is still cumber- some. Semantic technologies like SPARQL4, RDF5 and OWL6 plus good programming skills are usually needed to process Linked Data. One of the solutions for the challenge mentioned above is the use of mashups, which are “user-driven micro-integration of Web-accessible data“ [21]. Using mashups allows to avoid redundancy, while enabling easy and cost-effective implementation and integration of software components in applications at the same time [60]. Furthermore, they provide features such as reusability, easy implementation, the possibility to combine widgets together, and consuming data from different Linked Open Data (LOD, publicly available Linked Data) sources.

2http://www.w3.org/RDF/ 3http://www.w3.org/TR/rdf-sparql-query/ 4http://www.w3.org/TR/rdf-sparql-query/ 5http://www.w3.org/RDF/ 6http://www.w3.org/TR/owl-features/

2 The basic component of a mashup is a widget which is an application with limited function- ality. Each widget fulfills a simple task and widgets can be linked to each other to enable new, more complex tasks. Widgets either have access to datasets from different sources or can process data. Widgets can have input/output terminals that define the type of data the widget can process and return. Additionally, widgets can include options to control the way widgets process data. The user can also wire the widgets together to create a mashup. Figure 1.1 presents a typical graphical interface of a mashup platform, called Yahoo!Pipes7. High usability obviously has an impact on an application’s success. The mashup platform can support widget development. Therefore a certain growth of the amount of available widgets can be expected. Widgets can process data from different fields like finance, population, trans- port, etc. Initially, it is not possible for users to rapidly learn to work with the system. To find a required widget, the user has to check all categories of widgets. Secondly, the user does not know the source of information, that widgets provide. Thirdly, the user does not know how to find widgets that can be combined in order to create new knowledge. Therefore, it is important to provide a mean to solve the following problems: • Publishing: Make Linked Widget information available on the LOD Cloud. • Discovery: Search for widgets that contain a specific kind of semantic relation. • Composition: Search for widgets that can consume a specific dataset or produce the re- quired output data. • Smart data consumption based on semantic model: Selection of the required input from the provided context data.

7http://pipes.yahoo.com/pipes/

Figure 1.1: Yahoo Pipes User Interface

3 For example, a set of locations is used as input information that is then processed by a widget. After processing, the widget returns a set of movies. Still, some facts remain unclear: Does the provided location describe where the film is running, or does it describe where the author or producer was born? Even though it is possible to add some human readable information in order to clarify a widget’s meaning, it still proves to be a problem for the machine as it does not understand the human readable information. A problem of this master thesis refers to data quality and trust-worthiness. Due to the fact that mashups process data from various sources, it is difficult to define the data origin. Ad- ditionally, information on the Web is often inconsistent or questionable, and therefore people often make trust judgments based on authorship of information. With the fast growth of Linked Data, provenance information becomes a factor that influences success of new Semantic Web applications8, especially of a mashup platform. Provenance includes information about origin and ownership of datasets, change tracking, and access. The proposed solution of this thesis aims at solving this challenge by creating a semantic model. This is due to the widgets having access to various types and formats of data. The semantic model should provide a description of the data that a widget accesses via its input and output, the relationship between the data behind, and provenance information. The goals of this thesis can be summarized as follows:

• Evaluation of similarities and differences between Web Widgets and Semantic Web Ser- vices.

• Comparison of existing approaches applicable for Semantic Web Service description.

• Evaluation of Semantic Web Service description techniques regarding application in the area of Web Widgets.

• Defining the widget semantically in order to enable service composition, search, and pub- lish widget description as a part of the LOD cloud.

• Defining a semantic model that can be used to select the required input widget or data from the provided context data.

• Semantic model extension that provides provenance information.

• Semantic widget model implementation.

The following research questions are derived from the master thesis goals: 1) Is it possible to apply semantic service description languages for widget description? 2) How to extend the semantic model to support the data flow and data streams? 3)How this semantic model can be integrated with a mashup environment?

8http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance

4 1.3 Structure of the Thesis

Chapter 2 provides an overview of the basic principles and concepts of the Semantic Web (like RDF, OWL, and SPARQL), Linked Open Data, and Semantic Web Services. A part of this chapter also describes two Linked Data endpoints that are used for the examples mentioned in the chapters of this Master Thesis. Chapter 3 explores the state of the art in mashups and semantic description approaches. It consists of two parts. The first part of Chapter 3 describes existing mashup platforms (seman- tic and non-semantic), that are comparable with the approach of Linked Widgets. The second part of Chapter 3 follows with detailed descriptions of semantic service description methodolo- gies. Besides description of each methodology it includes advantages and disadvantages of each approach. Chapter 4 presents the Linked Widgets approach and requirements for the developed Linked Widget Model, that includes description of input/output graphs, as well as information about provenance by using DCAT and PROV-O. Chapter 5 follows with evaluations and use cases. The ontology for widget description and examples of the ontology implementation in Top- Braid Composer9 are provided in Appendix.

9http://www.topquadrant.com/tools/IDE-topbraid-composer-maestro-edition/

5

CHAPTER 2 Web of Data

To understand the Semantic Web it is important to begin with defining the World Wide Web (WWW, 3W or the Web), purposes of the Web and its principles. “The Web is a system of inter- linked documents that run over the Internet. With a Web browser, a user views Web pages that may contain text, images, and other multimedia and navigates between them using hyperlinks“1. The first step in the Web evolution was Web 1.0, a collection of static web pages or “Read- Only“ Web. The user had the possibility to search for information on the internet, publish the information in his/her web pages, but it was not possible to interact with other users and distribute the information. For example, in e-commerce sector the web pages were presented like catalogs, the goal of that was to show the information about products to costumers. In the era of Web 1.0 the interaction between user was insufficient. Therefore the appearance of Web 2.0 was predictable. It started in 1999 emergence of systems like LiveJournal2 and Blogger3.

2.1 Web 2.0

By emergence of Web 2.0 or the “Read-Write“ Web users of the internet have instruments for communication, information sharing, advertising like social networks, Wikis, Data Feeds, online market, blog platforms, E-Conferences etc. It is also characterized as dynamic user generated content in the form of information postings (e.g. photos, videos, text). Now even a non-technical user can use Web 2.0 to share information, communicate with another users etc. The most popular platforms are Twitter4, Flickr5, Facebook6, Youtube7 etc.

1http://en.wikipedia.org/wiki/World_Wide_Web 2http://www.livejournal.com/ 3http://www.blogger.com/ 4http://www.twitter.com/ 5http://www.flickr.com/ 6http://www.facebook.com/ 7http://www.youtube.com/

7 With the widespread adoption and penetration of the web into consumer and commercial interests, we have sprung out a new phenomenon of the social and interactive web. This stage is commonly called the Web 2.0 which is presented by a wide variety how websites and soft- ware can be developed and designed. These web pages are often also characterized by personal profiles (such as birthday, contact and location information), connections between users, groups of users, RSS feeds in form of links, and public API. Web technologies allow users to include contents from other web pages into her/his pages. For example, Youtube.com shares usually the code of the video that can be embedded into another web page. “Richer applications make more extensive use of more recently opened APIs“ 8. Web 2.0 is also characterized by use of AJAX (asynchronous Javascript and XML). Ajax is key technology in Web 2.0 for creating asynchronous web applications “without interfering with the display and behavior of the existing page“ [4]. AJAX is a set of technologies that realize data exchange between a client and a server, and integrate Web page presentation. AJAX is used often in combination with HTML and CSS. The retrieving of data is going throw XMLHttprequest (XHR) object, although others formats (like JSON, HTML, text) can be used. The goal of AJAX is to let scripts send and receive data via the HTTP methods: PUT, DELETE, GET, HEAD, POST, and OPTIONS (like HTTP client). AJAX can interact with the server by sending a request in order to get data and at the same time the whole reload of web pages is not needed. There are some alternatives to AJAX available, e.g. Flash9, Microsoft WPF/E3210, XBAP11. The most important achievement of Web 2.0 is open API. The programmers can access via open API to set of modules without direct programming in the source code, create mushups (c.f. Chapter 2.7) from different data sources and integrate data. Regardless of the fact that there is information overflow, the efficient search is complex. For example [5], somebody is planning to participate some conferences in a city. The confer- ences take place in different locations. He/she wants to book some hotels near the conferences. To do this the user should take a look on more than one web page in order to find the needed information and the result will not be perfect because the data about hotels and conference lo- cations are not really connected. A solution for this problem is to program an application using programming languages like Java12 and Python13, or query languages like XPath14 for trans- forming HTML documents (in this example, information from booking system and conference web page). The second solution can be use of open APIs. But the use of open APIs causes extra time needed and complexity to integrate the data because of lack of links between the data. Web 2.0 moved static Web pages to a more dynamic and interactive level. Through Web 2.0 a large amount of information has been collected and widely available. The problem of the Web 1.0 and Web 2.0 provide human-readable content, which linked to each other via the URLs15 [37] without computer readable and understandable logic.

8http://firstmonday.org/ojs/index.php/fm/article/view/2125/1972 9http://www.adobe.com/at/products/flashruntimes.html 10http://msdn.microsoft.com/de-de/library/aa970060(v=vs.110).aspx 11http://www.xbap.org/ 12http://www.java.com 13http://www.python.org 14http://www.w3.org/standards/techs/xpath 15http://en.wikipedia.org/wiki/Uniform_resource_locator

8

Web page :movie “Actors”

a:actor Web page “Movies”

l:city

Web page “Countries”

Figure 2.1: the Web & the Semantic Web

2.2 Web 3.0

The next generation of the Web is Semantic Web or Web of Data (so called Web 3.0) [9]. Tim Berners-Lee defined Semantic Web as “an extension of the original Web, in which information is given well-defined meaning, better enabling computers and people to work in cooperation“ [14]. The differences to the traditional Web are depicted in Table 2.1):

• While it consists of contents which are attractive for the user with nice structured content and interface, the Semantic Web provides machine readable content.

• Not web pages, but data behind the web pages are connected. The links indicate location and meaning of the data.

• It is possible to create logical statements.

Feature Web Semantic Web Fundamental component Unstructured content Formal statements Primary audience Humans Applications Links Indicate location Indicate location and meaning Primary vocabulary Formatting instructions Semantics and logic Logic Informal/nonstandard Description logic Table 2.1: Comparision of Web and Semantic Web. Source: [37]

9 To clarify the difference between data representation in the Web and the Semantic Web an example is illustrated in figure 2.1. In the case of the Web 3.0 web pages are connected via hyperlinks. A search engine finds movies or actors according to keywords like movie, actor, city etc. The model doesn’t represent data behind the web pages. In the case of the Semantic Web the machine is able to read and interpret the data behind the web page. For example, the movie has a title and it has a relation to actors. There are different techniques to add and recognize structured content. The Web Content can be automatically generated from relational database and it helps search engine to interpret the data [83]. To achieve the goal, addition the structure to the data, Microformats16 or RDFa can be applied. Microformats is a vocabulary which describes the data within the web page and extracts semi-structured information [44]. The meta information can be added into a (X)HTML code. The code bellow (Listing 2.1) is an example of the use of an open microformat standard named hCard17. The hCard is used to add information about persons and organisation into web contents. The root class name is vcard18. In this case the properties fn (first and last name), org (organization name), email (email) and url (link to a web page) are used to add semantic to the existing (X)HTML code (c.f. Listing ??).

1

2
Irina Pershina 3
TU Vienna 4 http ://tuwien.ac.at 6 < / div > Listing 2.1: hCard

The information about the person is machine readable, the application can return the infor- mation directly from the web page, where the format is used. An alternative to Microformates for data interchange on the Web is more generic language named Resource Description Framework in Attributes (RDFa) [65], a serialization format [42] for semantic inclusion into (X)HTML code. It supports embedding any type of data [83]. Microformats and RDFa have a common feature. There are focused on addition of meta information into in (X)HTML. There is also another technique to add machine readable (struc- tured) data: Linked Data. Compared to Microformats and RDFa it gives possibility to publish data as Linked Data into Semantic Web. It represents the graph or set of entities that are con- nected using RDF and URIs [52]. In 2000 Tim Berners-Lee introduced 7 layers of Semantic Web (c.f. Figure 2.2):

• IRI/Unicode. IRI is Unique Internationalized Resource Identifier for Semantic Web. Uni- code is global encoding standard that includes characters for various languages and math- ematical formulas.

• XML - the language for structured content creation.

16http://microformats.org/ 17http://microformats.org/wiki/hcard 18http://microformats.org/wiki/hcard

10 Figure 2.2: the Web & the Semantic Web

• RDF/RDF-Schema. RDF is data format for creating statements in triple form (subject, object, and predicate). RDF Schema is used for hierarchies creations.

• Ontology. Web Ontology Language (OWL) is an extension of RDFS that includes con- structs for semantic description (e.g. cardinality, transitivity).

• Logic. Reasoning within the logic layer.

• Proof. Result verification.

• Trust. The derive statements should be verified and resources - identified (they should come from trusted sources).

“The main idea of Semantic Web is to support distributed Web at the level of the data rather than at the level of the presentation“ [5]. The better the data are structured the easier for search engines or applications to read and interpret the content of web pages [83]. The goal is extend- ing existing Internet and computer tools in order to get machine readable information and add semantic (meaning) to data.

2.3 Resource Description Framework (RDF)

RDF is a metamodel for knowledge representation and fact expressions. The structure of RDF expression is a collection of triples. The triple expresses a fact, that is represented as a relations between two nodes of the graph (things). The triple consists of (cf. Figure 2.3) [65]:

11 Figure 2.3: The triple

• The subject (an RDF URI reference or a blank node).

• The predicate (an RDF URI reference).

• The object (an RDF URI reference, a literal or a blank node).

Generally, each subject, object and predicate is an RDF URI reference. But there are some exception:

• If a subject isn’t a URI, it is anonymous resource (with local identifier instead of URI). It also names lank nodes and it is possible to use one or more RDF statements. For data with a complicated structure it is recommended to use the blank node, as it ties together some elements of an entity and adds different relations. For example, an address consists of street, city, house number etc.

• The object can be either an RDF URI reference or an anonymous resource or a literal. There are two kinds of literal: plain literals being a string optionally with a language tag and typed literals is a datatype URI, taken from XML Schema.

The predicate is a URI, it depicts relation between two things or attribute of the subject, having the object as a value. Return to example depict in figure 2.1, the class :movie is a subject, the class a:actor is a object and the relationship :starring is a property. The class :movie has a title :hasTitle, stored as string. The value of the object is a plain literal.

RDF NOTATION

Like XML, the RDF has two type of notations: graphical notations and serialized notations. The graphical notations look like directed labelled graphs and show how RDF triples are connected. The nodes of the graph are things (objects and subjects) and the arcs are predicates. Figure 2.4 depicts a simple example of a RDF graph and represents the following sentence: Angelina Jolie is starring of the movie “Life, or Something Like It“ and she is author of “Notes from My Travels“ that has 213 pages. The parts of the sentences are presented in the table 2.2. The nodes (ovals) in the Graph are resources that can be either object or subject. The objects and the subjects are connected via directed arrows. The direction of the arrow is going from the subject to the object. The diagram should be read as HAS [65]. For example, Angelina Jolie is author of Notes from My Travels. The integer literal 213 (number of pages) is depicted as rectangle.

12

http://dbpedia.org/resource/Life,_or_ Something_Like_It

http://dbpedia.org/page/Angelina_ Jolie 213”

http://dbpedia.org/resource/Life,_or_ Something_Like_It

Figure 2.4: The RDF graph

Elements Value Subject (Resource) http://dbpedia.org/page/Angelina_Jolie Predicate (Property) http://dbpedia.org/property/starringof Object (Resource) http://dbpedia.org/resource/Life,_or_Something Predicate (Property) http://dbpedia.org/property/authorof Object (Resource) http://dbpedia.org/page/Notes_from_My_Travels Subject (Resource) http://dbpedia.org/page/Notes_from_My_Travels Predicate (Property) http://dbpedia.org/property/pages Object (literal) 213 Table 2.2: The part of the sentence

The serialized notations are syntaxes for RDF: N-triples19, Turtle20, N321, RDF/XML22, N- quads23, TriG24, TriX25. The table 2.3 shows the syntax comparison.

19http://www.w3.org/2001/sw/RDFCore/ntriples 20http://www.w3.org/TeamSubmission/turtle 21http://www.w3.org/TeamSubmission/n3 22http://www.w3.org/TR/REC-rdf-syntax 23http://www.w3.org/TR/2013/WD-n-quads-20130905 24http://www.w3.org/TR/2013/WD-trig-20130409 25http://www.w3.org/2004/03/trix

13 Name Description Examples N-Triple is a line- N-Triple use absolute URI, at the end based text format of a statement is a dot obligatory, #livesIn> or row syntax the literals can be an XML Schema . type or a language code. The sam- ples: for a relation: #hasWebPage> .; for at- “http://ex.com#webpage.com“. tributes: “ attribute value“.; for an anonymous nodes: _:a Disadvantages: no possibility to use prefix, statement with the same subject can not be grouped Turtle is a concrete Turtle gives a possibility to use pre- @prefix : . syntax for RDF fixed, relative and absolute URIs. @prefix c: . [24] The keyword “@prefix“ idicates pre- :Irina :livesIn c:Vienna. fix of a URI. For relative URIs defi- :Irina :hasWebPage nition the keyword “ @base“ is used. “http://ex.com#webpage.com“ The relations between classes are ex- pressed with the letter “a“. Advantages: qualified URIs, relative URIs, same subjects can be grouped in the statements, or- dered resource set N3 is an extension N3 allows rules expression, writ- Examples of express- of Turtle, is a short- ing certain relations in a stan- ing rules: @prefix : hand non-XML se- dard way (for example, instead . ?a :is- rialization of RDF of - ChildOf ?b. ?c :isMotherOf “=“), using @prefix. It hat a simple ?b => ?a :isGrandChildOf and consistent grammar [79] ?c. Advantages: compact and readable, rules RDF/XML is the The XML terms are element names, have a namespace name which is cal name“ [19]. The prefix of the quences of elements. Advantages: XML format. Disadvantage: user unfriendly, hard to read Table 2.3: Syntaxes for RDF

14 RDF AND RDF SCHEMA VOCABULARY

RDF Schema provides a data-modelling vocabulary for RDF data [23]. Classes

• rdfs:Resource. All things are instanced from this class. “rdfs:Resource is a subclass of rdfs:Class [23]“.

• rdfs:Class defines a resource as a class.

• rdfs:Literal presents literal values [23] like string and integer.

• rdfs:Datatype is a subclass of rdfs:Literal that presents the datatypes.

• rdf:XMLLiteral is “an instance of rdfs:Datatype and a subclass of rdfs:Literal“ [23].

• rdf:Property depicts the relationships between classes and is instanced from the class rdfs:Class.

Properties

• rdfs:range “is used to state that the values of a property are instances of one or more classes“.

• rdfs:domain is used for subject definition of a triple.

• rdf:type sets the assignment of a resource to a class

• rdfs:subClassOf depicts that a class is a subclass of another class.

• rdfs:subPropertyOf is that a property is a subproperty of another property.

• rdfs: “is an instance of rdf:Property that may be used to provide a human- readable version of a resource’s name“ [23].

• rdfs:comment provides description of the resources .

Benefits of using RDF data model

The main benefits are that [36]:

• By using HTTP URIs as globally unique identifiers for data items as well as for vocabu- lary terms, the RDF data model is inherently designed for being used at global scale and enables anybody to refer to anything.

• Clients can look up any URI in an RDF graph explored locally to retrieve additional information.

• The data model enables you to set RDF links between data from different sources.

15 • Information from different sources can easily be combined by merging the two sets of triples into a single graph.

• RDF allows to represent information that is expressed using different schemata in a single graph, meaning that you can mix terms for different vocabularies to represent data.

• Combined with schema languages such as RDF-Schema and OWL, the data model allows the use of as much or as little structure as desired, meaning that tightly structured data as well as semi-structured data can be represented. A short introduction to RDF Schema and OWL is also given in this Chapter.

Disadvantages of using RDF data model:

• Introduction of redundancy.

• Limited processing speed. Frequently the data from relational databases can be retrieved faster.

2.4 Web Ontology Language (OWL)

In a knowledge based system an ontology represents a vocabulary for knowledge representation [71]. The vocabulary defines set of the objects or entities and the relationships between them which present a knowledge domain. An ontology language means a declarative language for knowledge encoding or knowledge representation. Such types of languages support also knowl- edge reasoning and rules declaration. The elements of an ontology are classes and properties. The class means a group of entities [71], the property describes relationships. An important point is the focusing on relationship. The Web Ontology Language OWL is an extension of RDF and a semantic markup language, that derived from the DAML+OIL Web Ontology Language26 [73] and based on Description Logic with some additional features [69], the purpose of that is to share, authorize (authoring) and publish ontologies on the WWW. In comparison to traditional Description logic OWL uses partially different terms (cf. Table 2.4). For the overview of the ontology elements the following namespaces, a group of identifier, are defined (c.f. Listing 2.2).

1 xmlns:owl ="http://www.w3.org/2002/07/owl#" 2 xmlns:rdf ="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 3 xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 4 xmlns:xsd ="http://www.w3.org/2001/XMLSchema#" Listing 2.2: Namespaces

1. owl: - OWL namespace (vocabulary for OWL).

26http://www.w3.org/Submission/2001/12

16 OWL DL class name concept name class concept object property name role name object property role ontology knowledge base axiom axiom vocabulary vocabulary/ signature Table 2.4: OWL and DL. Source: [69]

2. rfd: - RDF syntax.

3. rdfs: - RDF Schema syntax.

An OWL class represents a set of entities that share common features or characteristics, the property is used for individuals description [71] [37]. A member of the class is an individual. The main properties of a class are object property and datatype properties. According to W3C recommendation the value of an object property is an individual and the value of the a datatype property is a literal. The definition of a class is realized via owl:class and owl:type. The instances of a class are specified via owl:type. For example, the Movie is a Class and Harry Potter is an instance of the Class Movie (c.f. Listing 2.3).

1 @prefix ex: . 2 3 ex:Movie rdf:type owl:Class. 4 ex:HarryPotter owl:type ex:Movie Listing 2.3: Ontology

It is possible to add taxonomic relationships to OWL via using rdfs:subClassOf prop- erty. The distinction between relations to an instance and a subclass is that the instance represents the individuals of a class and the subclass represents a subset of the members [37]. For example (c.f. Listing 2.4), there some types of movies: action, adventure and biography in an ontology. The class ex:Movie shows the set of all movies, which is further subdivided into the subclasses ex:Action, ex:Adventure and ex:Biography. The individuals ex:EndOfDays and ex:Divergent are members of the subclasses ex:Action and ex:Devergent.

1 @prefix ex: . 2 3 ex:Movie rdf:type owl:Class. 4 5 ex:Action rdf:type owl:Class; 6 rdf:subClassOf ex:Movie. 7 8 ex:Adventure rdf:type owl:Class; 9 rdf:subClassOf ex:Movie. 10

17 11 ex:Biography rdf:type owl:Class; 12 rdf:subClassOf ex:Movie. 13 14 ex:EndOfDays rdf:type ex:Action. 15 ex:Divergent rdf:type ex:Adventure. Listing 2.4: Example of a description in OWL

OWL has two major classes: owl:Thing and owl:Nothing. The resource owl:Thing represents the set of all individuals (recomendatioN) and owl:Nothing represents the empty class without members [37]. OWL properties are used to show the relationships between resources [71]. The two main properties are owl:ObjectProperty for relationships between two individuals and owl:DatatypeProperty for the relationship between an individual and a literal [37]. For example, a titel of a movie is a string value, therefore it will be a datatype property. The property ex:starring is a object property because it shows the relationship between an actor and a movie.

1 @prefix ex: . 2 3 ... 4 5 ex: hasTitle rdf:type owl:DatatypeProperty. 6 ex: starring rdf:type owl:ObjectProperty. Listing 2.5: Example of a description in OWL

The next step is to define which domain and which rang have the properties. The rdfs:domain property defines the subject of a triple and the rdfs:range property defines the object of a triple. Return to the example about movies, the movie has a title, in owl semantic the sentence sounds like the domain of the property ex:hasTitle is the Class ex:Movie and rang is string (xsd:string).

1 ... 2 3 ex:hasTitel rdfs:domain ex:Movie. 4 ex:hasTitel rdfs:range xsd:string. 5 ex:starring rdfs:domain ex:Movie. 6 ex:starring rdfs:range ex:Star. Listing 2.6: Example of a description in OWL

The properties like classes can have subproperies. It is defined via rdfs:subProperyOf. For example, the property ex:hasShotDescription and hasLongDescription are specializations of the property ex:hasDescription.

1 ... 2 3 ex:hasShotDescription rdf:type owl:DatatypeProperty. 4 rdfs:subPropertyOf ex:hasDescription. Listing 2.7: Example of a description in OWL

18 As already explained, the property has always a direction, from subject to object. Some- times it is important to indicate that is also an inverse relationship exist. OWL uses the property owl:inverceOf for showing inverse relationships. According to the example ex:starringOf is the inverse property of ex:starring. There are some different variants of OWL that differed in complicity, possibility and needs of concrete users:

1. OWL Lite. The simplest version of OWL designed for user which primarily needing a classification hierarchy and simple constraints (W3C).

2. OWL DL. DL means Description Logic, formal knowledge representation language. The language includes some restrictions like type separation that provides reasoning.

3. OWL Full. OWL full offers the user full capability in expressiveness and it is pure exten- sion of RDF [37]. The main disadvantage is that it is unpredictable.

Each variant of the OWL represents an extension of another one. OWL DL is the extension of OWL Lite and OWL Full is the extension of OWL DL.

2.5 SPARQL. Query Language for RDF

SPARQL is a W3C recommendation, that was introduced for retrieving data stored in RDF format [73]. While SPARQL supports only reading, for data writing SPARQL Update27 is used. According to W3C recommendation the following terms are used in SPARQL [73]:

• IRI. Resource ID, includes URIs and URLs.

• Literal. An RDF graph which represents the set of RDF triples with nodes (set of subjects and objects of triples).

• Lexical form “being a Unicode string, which should be in Normal Form “ [32],

• Plain literals “have a lexical form and optionally a language tag as defined by RFC-3066 28 and normalized to lowercase“.

• Language tags - tags for language identification, defined by RFC-3066.

• Typed literals “have a lexical form and a datatype URI being an RDF URI reference“.

• Datatype IRI - an “RDF URI reference“.

• Blank node - an anonymous resource.

There are fore different query forms that SPARQL use:

27http://www.w3.org/TR/sparql11-update 28http://www.isi.edu/in-notes/rfc3066.txt

19 • SELECT query. The basic command for reading facts according to some graph pattern, returns subset of variables constraint in SPARQL query. • CONSTRUCT query. TThe CONSTRUCT query form returns a single RDF graph speci- fied by a graph template. The result is an RDF graph formed by taking each query solution in the solution sequence, substituting for the variables in the graph template, and combin- ing the triples into a single RDF graph by set union- [73]. • ASK query. Simple question to SPARQL endpoint, returns true/false result • DESCRIBE query, used to read an RDF Graph (returns all facts about resources) Listing 2.12 presents an example SPARQL SELECT query to DBPedia 29. The query re- turns a list of films dbpedia-owl:Film and their corresponding books which is specified by dbpedia-owl:basedOn property with gross income greater than 390000000 dollars.

1 PREFIX dbpedia-owl: 2 3 SELECT DISTINCT ?film ?gross WHERE { 4 5 ?film rdf:type dbpedia-owl:Film. 6 ?a rdf:type dbpedia-owl:Work. 7 ?film dbpedia-owl:basedOn ?a. 8 ?film dbpedia-owl:gross ?gross 9 FILTER(xsd:double(?gross)>390000000) 10 11 } order by DESC(?gross) limit 1000 Listing 2.8: SPARQL query Step-by-step explanation for the example: 1. The first step was the definition of the namespace: PREFIX dbpedia-owl: . It is possible to skip this step, but it will reduce readability. For example, the row ?film dbpedia-owl:basedOn ?a. will look like ?film ?a. 2. As second, the variable ?film and ?gross have been defined. It means, that the value of film (film URI) and the gross have been returned. 3. ?a rdf:type dbpedia-owl:Work. defines that the books are instances of the Class Work from DBPedia. 4. To define the fact, that the films should be based on some books, the following has been written: ?film dbpedia-owl:basedOn ?a.. 5. For filtering the data the FILTER construct is used. 6. It is also possible to order data by gross descending. 7. Limit is maximal number of return values. 29http://dbpedia.org/About

20 Figure 2.5: Evolution of the web. Source: [10]

2.6 Linked Open Data

The Web can been seen as a huge database. The problem of nowadays web is that the data are not really connected. Therefore an effective search over the data is often hard and not embodied. The idea to connect the data over the Web is not new. The approach was introduced by Tim Berners-Lee, director of the World Wide Web Consortium more than 20 years ago and now it gets more popular but there are some complexities. Figure 2.5 depicts the history of data on the web. The evolution has four steps: documents on the web, Web of Documents (linked pages), Data on the Web (Open Data, not linked) and Web of Data (Linked Data). The last step is Linked Data. “Linked Data refers to a set of best practice for publishing and connecting structured data on the Web“ [17]. He outlined four basic principles [17]:

1. Use URIs to denote things.

2. Use HTTP URIs so that people can look up those names over the Web.

3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).

4. Include links to other URIs, so that they can discover more things.

According to these principles we can start to see the advantages that Linked Data brings. The first principle recommends using URI to identify things like real world objects or con- cepts. For examples, animals, people, places, etc. This things can have some properties such as name, color, description and some relations to other objects. “In the classic Web, HTTP URIs are used to combine globally unique identification with a simple, well-understood retrieval mecha- nism“ [36]. A URI is a globally unique resource identifier which is used for identification of

21 Figure 2.6: Linked Open Data Cloud things over the Web according to the second principle. In order to facilitate data processing for different data on the Web, it is necessary to find a common format. HTML is a dominant doc- ument format for pages on the web [36]. W3C recommends use of RDF as a common format for Linked Data. According to the Linked Data principles, the things on the Web are linked with each other via RDF links (in compare to the classic Web, where web pages are linked via hyperlinks). For example, a link between an actor and a film or a hotel and its location. The linking of the data empowers the retrieval of distributed information from different resources. An additional advantage of Linked Data is the facility to build the links between data over the existing Web architecture. The Web of Data has properties similar to classic Web [36]: • The Web of Data is generic and can contain any type of data.

• Anyone can publish data to the Web of Data.

• The Web of Data is able to represent disagreement and contradictionary information about an entity.

• Entities are connected by RDF links, creating a global data graph that spans data sources and enables the discovery of new data sources. This means that applications do not have to be implemented against a fixed set of data sources, but they can discover new data sources at run-time by following RDF links.

• Data publishers are not constrained in their choice of vocabularies which represent data.

22 Figure 2.7: The 5 star scheme. Source: [54]

• Data is self-describing. If an application consuming Linked Data encounters data de- scribed with an unfamiliar vocabulary, the application can dereference the URIs that is identify vocabulary terms in order to find their definition.

• The use of HTTP as a standardized data access mechanism and RDF as a standardized data model simplifies data access compared to Web APIs, which rely on heterogeneous data models and access interfaces. The Data on the Web can be characterized with using the “five-star rating schema“ (cf. Figure 2.7). The criteria are as follows: • 1 Star. Data on the web, any format (e.g., pdf, an image scan), an open licence is available.

• 2 Star. Structured Data, machine-readable formats (e.g., excel).

• 3 Star. Non-proprietary format (e.g., CSV).

• 4 Star. Use of URIs to identify things, open standards from the W3C, possibility to link the things.

• 5 Star. Linked content from different resources, using Linked Data principals. Large amount of structured data has been posted on the Web. The result of this is Linked Open Data Cloud (c.f. Figure 2.6), which is highly interlinked and formed a very extensive graph. The graph consists of billions of triples stored in RDF format from different sources. The datasets covers many topics like Media, Geographic, Publication, Government and various others. Linked Open Data technical overview is presented by Linked Open Data Puzzle (c.f. Figure 2.8). The stack shows which technology should be used for working with LOD. The LOD

23 Figure 2.8: Linked Open Data Puzzle. Source: [10] documents are stored on the WWW (HTTP) servers. The diagrams show URLs which are required for identifying resources. Additionally, vocabularies are used for descriptions of nouns while ontologies add relationships between them. As per the diagram SPARQL gives the ability to query the data. Finally, the applications that can consume and produce Linked Data are defined as Mashups and Search Engines. Linked Data presents a new way to organize information on the web and in organization because of flexible and expressive standards. Linked Data connects data from different sec- tors. The crucial point is adaptation of the Linked Data to enterprise level. It includes better techniques for publishing, consuming the data as well es better usability and easier learning for work with Linked Data. Due to this problem, the main focus of Linked Widget approach is to in- crease level of usability, to make the work with Linked Data more intuitive and understandable, give ability for organizations to combine their intern data with Linked Open Data Cloud.

2.7 Overview of Linked Data Endpoints

In the following chapters of this thesis I will present some examples of semantic service de- scription with use of various semantic description approaches. The semantic web service will be

24 Figure 2.9: Overview of DBPedia components. Source: [18] process Linked Data taken from Linked Data endpoints like DBPedia30 and Linked Data Movie Base31.

2.7.1 DBPedia DBPedia is semantic version of Wikipedia32. It “allows to ask queries against Wikipedia and to link the different data sets on the Web to Wikipedia data“ [18]. The main components of the framework are (c.f. Figure 2.9):

• Page Collections - local or remote sources of Wikipedia contents.

• Destinations - storing or serializing extracted RDF triples.

• Parsers - supporting the extractors, converting values between different units and splitting markup into lists [18].

The Extraction Manager is used for managing the processes of mapping Wikipedia articles to the domain ontology. The framework includes following extractors [18]:

• Labels (rdfs:label) - a title of the articles.

30http://dbpedia.org/About 31http://linkedmdb.org/ 32http://www.wikipedia.org/

25 • Abstracts. There are 2 version of abstract: short with use of rdfs:comment and long with use of dbpedia:abstract.

• Interlanguage links are links that connect articles about same topics in different lan- guages.

• Images. The images are connected to resources via the foaf:depiction property.

• Redirects - identification of synonymous terms, references between DBpedia resources.

• Disambiguation - explanation of the different meanings of homonyms via the predicate dbpedia:disambiguates.

• External links link data from DBPedia to external Web resources with use of the property dbpedia:reference.

• Pagelinks - links between Wikipedia articles (dbpedia:wikilink property).

• Homepages - links entities to their homepages (foaf:homepage).

• Categories - categories of articles that are represented with use of the SKOS vocabulary33 (the property skos:concept, skos:broader).

• Geo-coordinates use the Basic Geo Vocabulary34 and the GeoRSS Simple encoding of the W3C Geospatial Vocabulary35 to define a location.

The framework uses four types of extraction [18]:

• Dump-based extraction. The Dbpedia database is updated monthly with dumps of all Wikipedia editions. The dump-based work uses the page collection from WikiPedia like the source of article texts and the N-Triples serializer as the output destination.

• Live extraction. The extractor uses the Wikipedia’s Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) live feed36 that announce all changes in Wikiperdia as stream for new RDF extraction. The SPARQL Update deletes existing and inserts new triples into a separate triple store.

• Generic Infobox Extraction processes all infoboxes within a Wikipedia article. The triples are produced in the following way. The algorithm uses the corresponding DBpedia URI as subject. The predicate URI is constructed from the namespace fragment http:// dbpedia.org/property/ and name of the infobox attribute. The attribute values are presented as Objects.

33http://www.w3.org/2004/02/skos/ 34http://www.w3.org/2003/01/geo/ 35http://www.w3.org/2005/Incubator/geo/XGR-geo/ 36http://wiki.dbpedia.org/DBpediaLive

26 Figure 2.10: DBPedia page

• Mapping-based Infobox Extraction maps Wikipedia templates to an ontology by arranging the 350 most commonly used infobox templates within the English version, 170 classes, and 2350 attributes from within these templates. “The property mappings define fine- grained rules on how to parse infobox values and define target datatypes, which help the parsers to process attribute values“.

DBpedia currently includes about 4.0 million “things“ with 470 million “facts“, and about 45 million links to external data sets37 like Freebase38 or flickr wrappr39. DBPedia provides also versions in 199 languages. The advantages of DBPedia are coverage of many domains, real community agreement, automatic actualization of contents and multilingual support. “The DBpedia knowledge base is served as Linked Data on the Web“40 and presents one of the central interlinking-hubs. Each thing is identified via a dereferenceable IRI, or a URI-based reference. For example, http://dbpedia.org/ontology/Agent is the URI of the class “Agent“. http:// dbpedia.org/page/Angelina_Jolie is the URI of the instance “Angelina Jolie“ of the class “Agent“. Figure 2.10 depicts the web page which presents the information about the actor “Angelina Jolie“ stored in DBPedia. The information about “Angelina Jolie“ divides into two parts property and value. The instance has the following properties: dbpedia-owl: birthName, which presents the name of the actor; dbpedia-owl:birthPlace - birth place of the actor, dbpedia-owl:parent - parents, and also external links to other data sets.

37http://wiki.dbpedia.org/Datasets 38http://www.freebase.com/ 39http://wifo5-03.informatik.uni-mannheim.de/flickrwrappr/ 40http://dbpedia.openlinksw.com:8890/About

27 The data can be also accessed via the SPARQL endpoint at http://dbpedia.org/ sparql. Listing 2.9 presents an example query to DBPedia. The query returns the list of movie instances and directors of them. The actor “Raul Reubens“ performed a role in the movies. The endpoint supports also full text search over properties.

1 2 PREFIX : 3 SELECT ?movie ?director 4 WHERE { 5 ?movie a dbpedia-owl:Film; 6 dbpedia-owl:starring :Paul_Reubens); 7 dbpedia-owl:director ?director. 8 } Listing 2.9: SPARQL query

2.7.2 Linked Movie Data Base Linked Movie Data Base is another open semantic web database which contains information about movies. The data sets also include links to other Linked Open Data and references to webpages. There are about 1645000 interlinks to other Linked Data endpoints such DBPe- dia (vie owl:sameAs), RDF Book Mashup (via movie:relatedBook), flick-rwrappr (via dbpedia:hasPhotoCollection), etc. Figure 2.11 depicts the interlinks to other data endpoints. The resource are presented by the following sample entities: film, actor, director, writer, producer, music contributor, cinematographer, etc. The SPARQL Endpoint is available at http://data.linkedmdb.org/sparql. Listing 2.10 shows a simple example of the SPARQL query to the endpoint, which selects instances of class movie that are available in En- glish language, and have relation to the actor “Paul Reubens“ (http://data.linkedmdb. org/resource/actor/1395), and include links to other sources (owl:sameAs).

1 SELECT ?film ?title ?instance 2 WHERE { 3 ?film movie:actor 4 . 5 ?film movie:language . 6 ?film dc:title ?title. ?film owl:sameAs ?instance 7 } Listing 2.10: SPARQL query

Figure 2.12 depicts the result of the query.

2.8 Widgets & Mashups

W3C’s Widget Specification defines widget as “an interactive single purpose application for displaying and/or updating local data or data on the Web, packaged in a way to allow a single download and installation on a user’s machine or mobile device“ [56]. In other words, a widget is a small and simple application or piece of dynamic content developed for different types of software platforms.

28 Figure 2.11: LinkedMDB in the Linking Open Data cloud. Source: http://richard. cyganiak.de/blog/

Figure 2.12: SPARQL results

There are different types of widgets:

• GUI () widget is a part of applications designed for human-computer interaction (such as a , a or a check box) in order to control displayed el- ements.

• Disclosure widget specifies which information should be hidden or shown for the user.

29 • Desktop widget is small application for desktop that control simple function like clocks setting, calenders or have access to some web services and show actual information (e.g. news, rate of exchange).

• Metawidget is used for control of other widgets.

A special kind of widget is the web widget, this can be included in code of the web page in order to show the information from another source. It is often used for advertising or for displaying video. The widget is often used in Social Web. It presents by a “widget application“ as a third party application “for an online social network platform, with the user interface or the entire application hosted by the network service“41. It is possible to combine it with other components and data for complex problem solving. There are following benefits:

• Versatility and seamless integration possibility within diverse Web environments.

• Reusability.

• Easy implementation.

• Possibility to combine some widgets together.

• Possibility to use internal resources from a web page (site data) with online data from different LOD sources.

• Easy and cost-efficiently use of widget for adding semantic functionality.

As already referred in Chapter 2.1.2 Web 2.0 opened the door to the new technologies, enabled the easiest data integration, like open API or Mashups. A possible solution to process data from various sources is use of Mashups’. Mashups are application developed for content retrieving from disparate Web sources. The data and functions can be received through various mechanisms and formats like REST APIs, feed formats, JSON, XML, HTML. The typical characteristics for mashups are:

• a mashup is often consist of Widgets and Feeds that mixed together and have access to different sources,

• use of service-oriented architecture,

• focus on specific domains or problems,

• it is possible to publish the result on the web and provide access to their functionality,

• ability to access to published mashups and include their functionality in a new mashup.

41http://en.wikipedia.org/wiki/Software_widget

30 Mashup development differs from tradition component-based application development. It is typically more collaborative, organic, and dependent on reuse of existing components. The development can be realized manually or with use of a development environment. The develop- ment includes:

• Widget creation and organising flows of data, transforming the data into an appropriate format or reusable feeds;

• Mashups sharing, tagging, and trustworthiness indication;

• Reuse of existing mashups or mashup logic extension and new combination sharing;

• Data analyse and personalization.

There is a set of tools and technologies that can be categorized as mashup builder like Ya- hoo!Pipes 42, QedWiki (IBM product), Intel Mash Maker 43 and mashup enabler [53]. In chapter 3 Yahoo Pipes will be viewed because of its popularity. Mashup builder gives an ability for non experts to create new composite applications by combining simple operator and operation like filtering, selecting etc. The operator is a widget that provides access to data sources. The mashups can be published and combined with each other. The disadvantage of such tool is that the developers can not extend the features. Mashup enabler suppose data source adapter that gives structure to the data. The examples of Mashup enablers are Feed4344, Openkapow and Kapow Mashup Server45. The typical data source adapters are Application-specific API, RSS46, and RMDB47. The disadvantage of this kind of tools is that they doesn’t have a graphical mashup builder. The most challenging problems [46] are:

• Combining data and function. Because the data are stored in various sources in different format and it is important to recognize which data and function can be combined together.

• Data integrity. “Mashups are a quick way to create new applications but they can raise data integrity problems when changes of end-users are not valid against the underlying commitment“.

• Mashups search/Cataloging. It is necessary to provide an efficient mechanism for search. If many mashups exist, user doesn’t know which mashup can be used for different task and which mashups can be combined together.

• Making data Web-enabled. Not all data and functionalities are published on the Web. Some data are available, but not accessible from the mashup systems because of formats or it includes extra data (e.g. HTML formatting structure) and a conversion to structured

42http://pipes.yahoo.com/pipes/ 43http://software.intel.com/en-us/articles/ 44http://feed43.com/ 45http://kapowsoftware.com 46http://cyber.law.harvard.edu/rss/rss.html 47http://rmdb.stanford.edu/repository/

31 data is needed. Therefore a well-defined process is required to prepare the raw data for web publishing.

• Security and identity. Some data are confidential and the system should support it via appropriate authorization mechanism.

• Sharing and reusing. It should be possible for users to reuse the already created mashups and share the new mashups with other users.

• Trust certificates. The owner of the mashup system should provide a license that will guarantee end-user rights and permissions of the mashup.

• Version control mechanisms. The data from various sources may get updated and the end user of the system should know about changes in the data sets, therefore a version control mechanism is essential.

The mashup development needs a methodical construction that can include steps as follows: the problem and domain definition (objectives, factors), IT environment definition, identification of technical requirement, technology selection, special mashup features definition like version control or data integrity. Mashups are a novel approach to build Web application that can access to various data sets and combine it together. The mashup creation can be done by non-professional users. The mashups can cover different topics like finance, government, news, libraries etc. Mashups can be used to consume Linked Data. Furthermore mashups offer an easy way to integrate non-semantic data in different formats with Linked Data sets.

2.9 Scheme.org

Schema.org has been introduced in 2011 [39] by Yahoo, Google and Bing. It represents a collec- tion of schemas that can be used to markup the web pages to improve the recognition of data by search engines. The schema.org focuses specially on Linked Data. The Schema.org supports the generation of the following formats: RDF/Turtle, RDF/XML, RDF/NTriples, JSON, and CSV. As already referred in Chapter 2.1.3 the data can be automatically generated from databases and put into HTML. The data stored in databases are already structured, but search engine can not recognize the structure if the data is presented in HTML format. “Many applications, espe- cially search engines, can benefit greatly from direct access to this structured data“ [39]. On- page markup supports more effective search for the data and order the data to make the result of search more relevant for users. The following example shows how the content can be marked up using microdata. The original HTML code looks as follows:

1

2

Harry Potter 3 Author: J.K. Rowling (born 31.07.1965) 4 Country: United Kingdom 5 Movie

32 6 < / div > Listing 2.11: HTML code

The schema.org vocabulary can be used with the microdata format to add the structure to the content of the web page. For the section identification that is about a movie the itemscope element is used. The concrete items are defined by adding itemprop and itemtype elements in the

block.

1

2

3