Delft’s history revisited

Semantic Web applications in the cultural heritage domain

Martijn van Egdom

Delft’s history revisited

THESIS

submitted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER SCIENCE

by

Martijn van Egdom born in Rhenen

Web Information Systems Department of Software Technology Faculty EEMCS, Delft University of Technol- Erfgoed Delft en Omstreken ogy Schoolstraat 7 Delft, the Netherlands Delft, the Netherlands www.wis.ewi.tudelft.nl www.erfgoed-delft.nl c 2012 Martijn van Egdom. Coverpicture: View on Delft, painted by Daniel Vosmaer, Erfgoed Delft. Delft’s history revisited

Author: Martijn van Egdom Student id: 1174444 Email: [email protected]

Abstract

While at one side there is an ever increasing movement within cultural heritage organizations to offer public access to their collection-data using the Web, on the other the Semantic Web, fueled by ongoing research, is growing up to be a mature and successful addition to the Web. Nowadays, these two sides are join- ing forces, combining the large collections of mostly public data of the cultural heritage institutions with the revolutionary methods and techniques developed by the Semantic Web researchers.

This thesis is the result of this rather symbiotic collaboration, providing mul- tiple contributions for both the side of the cultural heritage institutions as well as the side of the Semantic Web researchers. Of special note are: the descrip- tion of search techniques currently applied by cultural heritage organizations on their published data; the discussion of a generic method to transform legacy data to linked data, including a detailed analysis of each step of the process; and the development of a prototype of a faceted browser which utilizes the transformed data. The products of this research, a fully functional transformation method and a working prototype of a faceted browser, show exactly how cultural heritage organizations can benefit from new technologies provided within the Semantic Web.

Thesis Committee:

Chair: Prof.dr.ir. G.J.P.M. Houben, Faculty EEMCS, TUDelft University supervisor: Dr. L. Hollink , Faculty EEMCS, TUDelft Company supervisor: Drs. M. Beumer, Erfgoed Delft en Omstreken Committee Member: IDr. M. Pinzger, Faculty EEMCS, TUDelft

Preface

Writing a thesis is like hiking a mountain. While some parts of the trail are steep and slippery, others provide a beautiful view and a place to relax and sit down. The goal of such a trip is often to reach that phenomenal scenic vista point, and to overcome ones personal limits. For me, this thesis provided a journey from tricky definitions, towards understanding the beauty of the Semantic Web. Furthermore, this hike marks the personal achievement of completing my master’s and moving on to a new chapter in life.

When going off to hike, some company is always pleasant as companions can warn for sharp rocks, dangerous curves and other hazardous situations and can provide the support needed to press on. I therefore would like to thank several people for their help, support and great ideas:

Joseba, my beloved wife, thanks for all your support..

Marjolein Beumer, I will remember the discussions we had about data..

Laura Hollink, always sharp, trying to figure out what the real thing was that I meant..

Peter de Klerk, the brainstorming sessions on how to create RDF were helpful..

Kim Schouten, I hope I can enlist you again to help me write proper English..

Of course there are many more people that have been an asset in conducting the re- search and writing this thesis. These people are (in alphabetic ): Bennie Blom, Frans Bridié, Arthur Hanselman, Geert-Jan Houben, Anita Jansen, Karin Kievit, Frank Meijer, Wim van Rotterdam, Michel van Tol, Wout van Wezel, and Ivo Zandhuis.

Martijn van Egdom Voorburg, the Netherlands February 2, 2012

1

Contents

Contents 3

List of Figures 5

I Context 7

1 Introduction 9 1.1 Research Questions ...... 9 1.2 Scope ...... 10 1.3 Relationship with Erfgoed Delft en Omstreken ...... 11 1.4 Related Projects ...... 11 1.5 A small introduction into the Semantic Web ...... 12 1.6 Why Linked Open Data? ...... 16 1.7 Thesis structure ...... 20

II Cultural Heritage Search 21

2 A survey of search techniques for cultural heritage 23 2.1 A brief history ...... 23 2.2 Methodology ...... 24 2.3 Search techniques ...... 25 2.4 Considerations & conclusion ...... 34

3 Currently applied search techniques at Erfgoed Delft 37 3.1 Methodology ...... 37 3.2 Systems ...... 38 3.3 Observations & conclusion ...... 43

III Semantic Web for Cultural Heritage 45

4 Transforming legacy data 47

3 CONTENTS CONTENTS

4.1 Semantic value ...... 47 4.2 Characteristics of high quality RDF ...... 48 4.3 The transformation recipe: a generic method ...... 50 4.4 General issues & guidelines ...... 53 4.5 An extended case study ...... 57 4.6 Limitations ...... 65 4.7 Evaluation ...... 66 4.8 Conclusion ...... 68

5 A Faceted Browser 71 5.1 Faceted browsing requirements ...... 72 5.2 Architecture ...... 73 5.3 Optimizing facets ...... 75 5.4 Search Performance ...... 79 5.5 Feedback on Facet ...... 80 5.6 Results & conclusion ...... 82

IV Conclusions 85

6 Conclusions and future work 87 6.1 Contributions ...... 87 6.2 Research conclusions ...... 87 6.3 Summary per research question ...... 88 6.4 Future work ...... 89

Bibliography 91

Glossary 93

List of Abbreviations 95

A List of Archives 97

B Full list of top 50 museums 99

C Diagrams of the transformation of the Mierenvelt Dataset 103

D in Delft data-sample 109

E Diagram Linked Open Data Cloud 111

4 List of Figures

1.1 The original architecture of the web...... 12 1.2 Dynamic pages using databases...... 13 1.3 Web 2.0 ...... 14 1.4 Basic architecture of the Semantic Web ...... 15

2.1 Museum using Google (www.habitot.org) ...... 26 2.2 History of Chicago (encyclopedia.chicagohistory.org) ...... 27 2.3 Rein Sofia museum - VR Tour (http://www.googleartproject.com)) . . . . 28 2.4 Basic search in the collection (http://www.britishmuseum.org/) ...... 29 2.5 Looking for... (http://www.tante.org.uk) ...... 30 2.6 Thesaurus term of Stichting Volkenkundige Collectie Nederland (SVCN) (www.svcn.org) ...... 31 2.7 Searching for (http://collections.vam.ac.uk/) ...... 33

3.1 A charter (copyright Erfgoed Delft) ...... 41 3.2 Detail of the last will and testamen of A.J. van Brouwershaven - 1508 (copyright Erfgoed Delft) ...... 41

5.1 Overview Architecture of Facet ...... 73 5.2 Search versus Facet Filters ...... 74

C.1 Step 2: Convert to plain RDF ...... 103 C.2 Step 3: Complete the RDF ...... 104 C.3 Step 4: Link to other resources within the data itself ...... 105 C.4 Step 6: Link with more common ontologies ...... 106 C.5 Step 7: Enrich by linking to other datasets on the web ...... 107

E.1 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ ...... 111

5

Part I

Context

7

Chapter 1

Introduction

The introduction of the World Wide Web set off a revolution of astronomical pro- portions. Within just two decades it completely changed businesses and lifestyles of people [9]. This revolution is based on the observation (and a bit of frustration) of just one man: Tim Berners-Lee. Late 1980’s he realized: "It’s not the computers which are interesting, it’s the documents!".

"The Web is more a social creation than a technical one. I designed it for a social effect to help people work together and not as a technical toy. The ultimate goal of the Web is to support and improve our weblike existence in the world. We clump into families, associations, and companies. We develop trust across the miles and distrust around the corner. [2]" Tim Berners-Lee

Nowadays, mankind is making the next revolutionary move: "It’s not the documents, it is the things they are about, which are important". This is the basic philosophy behind the Semantic Web.

As quoted, Tim Berners-Lee states that the Web is a social, rather than a technical concept. While working on Semantic Web technologies at Erfgoed Delft, a Cultural Heritage institution in Delft, seems to be primarily a technology effort, the intended results are nevertheless of a social kind. People are influenced by their culture and history, both as an individual and as a society. With the technology described in this thesis and applied in a prototype, Erfgoed Delft is able to serve people by offering them tools to explore their history.

1.1 Research Questions

The research is conducted in participation with Erfgoed Delft. The main research ques- tion is:

How can cultural heritage organizations, like Erfgoed Delft, benefit from new Semantic Web technologies with respect to their collections on the Web?

9 1.2 Scope Introduction

In order to be able to answer the main research question, the following research ques- tions have been formulated:

1. What methods are available on the Web to offer the public insight in the collec- tions of cultural heritage organizations.

2. What methods are currently used by Erfgoed Delft that enable the public to search through their collections.

3. What is a suitable method to transform ‘legacy collection data’ into Semantic Web formats.

4. What methods can cultural heritage organizations use to allow people to explore their semantic collection data.

In this thesis, collections on the Web are seen as the combination of both data and applications presenting the data in some form. This definition is reflected in all four research questions. The first two questions are primarily background questions: they are required to get the needed insight into the current state of collection websites. The third question will be the core of this research, being from the academic Semantic Web domain point of view. From all four questions, this one will focus most on the data part of the collections, while the other three are more concerned with the presentation of the data. The last question is focusing on actual applications using the Semantic Web. The research on the fourth question is not only theoretical, as the goal is to implement an actual application using Semantic Web technologies which could be used (as pro- totype) within Erfgoed Delft, and to demonstrate the added value of transforming data to Semantic Web formats.

1.2 Scope

The research focuses on the transformation of ‘legacy’ data into a Semantic Web for- mat, for which RDF will be used throughout this research. A few constraints regarding the scope of the research are set to safeguard this focus.

• During this transformation it is likely some errors or inconsistencies in the data will emerge. These problems will be reported but not fixed, unless there are very strong arguments to do so, because they may require specific domain knowledge.

• Although there may be much possibilities to enrich the RDF after transformation with techniques like Natural Language Processing, not much effort will be spend in actually performing these enrichments .

• The same constraint holds for linking to other datasets in the Linked Open Data cloud: some linking will be performed but this will be of a more exploratory nature, to gain insights in the possibilities.

10 Introduction 1.3 Relationship with Erfgoed Delft en Omstreken

1.3 Relationship with Erfgoed Delft en Omstreken

A few years ago the municipality of Delft and Delft University of Technology agreed on intensifying the collaboration in research projects. The municipality became a provider of use cases, problems, and sometimes budget, while the university offered its academic knowledge. This agreement spans many domains of research, including Industrial Design Engineering, Architecture, and Computer Science.

This research is one of the offsprings of that collaboration. Erfgoed Delft en Om- streken (Erfgoed Delft) provided data, domain expertise, a use case, and much dedica- tion. Erfgoed Delft is a cultural heritage organization with both a local mission as well as a national responsibility.

In the past few years, several departments of the municipality of Delft, including the three museums owned by the municipality, the municipality archive, and the depart- ment of archeology joined forces and became Erfgoed Delft. Their local mission is to create awareness among the citizens of Delft concerning their local history. With the municipality archive at one side and projects like WikiDelft (A wikipedia on the his- tory of Delft built with, and for, the citizens of Delft) on the other side, people in Delft have access to information, both offline and online, concerning their history. Erfgoed Delft also incorporates several museums: Museum Lambert van Meerten, Museum Nusantara en Museum Het Prinsenhof. Especially the last museum bears a national responsibility since this museum is centered around the founding of the Kingdom of the Netherlands in which the city of Delft played a vital role.

This combination of both local and national valuable knowledge, as well as the com- bination of both archive and museum collections, made the research fascinating, span- ning multiple cultural heritage domains.

1.4 Related Projects

This section covers a few related research projects and explains the differences be- tween the related research and this project.

1.4.1 MultimediaN E-Culture The main objective of the MultimediaN E-Culture project is to demonstrate how novel Semantic Web and presentation technologies can be deployed to provide better index- ing and search support within large virtual collections of cultural-heritage resources [15]. The data used in the E-Culture project are all derived from museum collec- tion sources, including the Rijksmuseum in Amsterdam and the Rijksmuseum voor Volkenkunde (State Museum for Ethnology) in Leiden.

Compared with this research, the main difference is found in the data: this research tries to use not only museum collection data but also archive data. Furthermore, this research focuses more on the the creation of RDF than on creating applications.

11 1.5 A small introduction into the Semantic Web Introduction

1.4.2 Europeana The Europeana project, operational from November 2008, aims to make ’s cul- tural and scientific heritage accessible to all on the internet [16]. Currently, over 1500 institutes have provided information, creating a collection consisting of over 10 mil- lions objects.

The contributing institutes do transform their semantically rich data to a small meta- data standard. In this transformation a lot of semantics are lost but this standard enables to link data from different institutes to each other. In comparison, this thesis is much smaller and focuses on the semantics of the data rather than on exchangeability. In addition, this research aims on the creation of applications, which is not a main goal in the Europeana project.

1.5 A small introduction into the Semantic Web

Since the Semantic Web is both a relatively unknown concept and required background knowledge for this thesis, this section briefly explains the Semantic Web, and it’s re- lated concepts like the Linked Open Data Cloud.

The Semantic Web does not intend to replace the World Wide Web of today but rather acts as an extension. To introduce the Semantic Web, several evolutionary stages of the Web are discussed, resulting in an explanation of the Semantic Web itself.

1.5.1 The first webpages The World Wide Web introduced by Tim Berners-Lee had a simple but elegant and efficient architecture (see figure 1.1). Users could use a web-browser to retrieve docu- ments from a web-server using the HTTP protocol. These documents are formatted us- ing the HTML standard. HTML provides the possibility to mark up the document and offers hyperlinks: links to other pages (documents) on the web. These links became the key of the Web’s success: you could refer to any document on the Web without knowing the physical location of the computer where the document was stored.

Figure 1.1: The original architecture of the web.

12 Introduction 1.5 A small introduction into the Semantic Web

1.5.2 Dynamic webpages Soon after its introduction, many people realized the enormous potential of the Web, resulting in many new ideas being invented. Some ideas got abandoned quickly, while others became de facto Web standards. Examples of the latter include Javascript, which was introduced to be able to create interaction with the users on a page at the user side, and programming languages like PHP, which made it possible for a server to generate pages on the fly using a database (see figure 1.2).

Figure 1.2: Dynamic pages using databases.

1.5.3 Web 2.0 From there the Web started to grow: all kinds of techniques and ideas were intro- duced and disappeared. The decisive breakthrough came with the introduction of AJAX (Asynchronous Javascript And XML). This technique enables web-developers to communicate with the server while the user is visiting the page instead of needing to perform a post-back. It opened new ways of user interaction, providing users a (some- times not so) smooth browsing experience. Around the same time APIs (Application Programming Interface) were introduced: some websites offered controlled access to their databases using a HTTP connection. This led to websites enriching their own data with data from other sources. An example of this can be found in the many web- sites now using Google Maps.

Figure 1.3 shows the current architecture. Please note that the original design is not really changed, only extended.

1.5.4 Semantic Web Although Web 2.0 offers users a smooth experience, it contains a major flaw: since the information sent inside the HTML is free text, it is impossible for computers to reason on this information. This lack of reasoning prevents the answering of even sim- ple questions like “I am looking for product X, which stores selling the product are

13 1.5 A small introduction into the Semantic Web Introduction

Figure 1.3: Web 2.0

open today at 20:00?”. In [3] Tim Berners-Lee introduced the Semantic Web with the example of an agent crawling the Web to set up a doctors appointment by reasoning on the quality of the doctor, the traffic-jams prior to the appointment, and whether that doctor is covered by his insurance or not.

But to enable such reasoning, the architecture of the Web had to be adapted. A new layer was added below the current Web in which the data, instead of a marked-up page, is available in a similar way as on the Web. A certain piece of data can link to any other piece of data anywhere on the web, just like hyperlinks combine pages. Therefore a data-format, called RDF, also had to be designed: all data can be reduced to the abstract form of a triple: Subject Predicate Object. Note that exactly the same construction can be observed within natural languages.

• subject is the thing, concept, person who or which carries out the ’action’ of the predicate.

• The predicate tells how the object applies to the subject: the ’action’.

• The object is the person or a thing upon whom or upon which the ’action’ of the predicate is carried out.

These definitions may be hard to get a grip on at first sight, but a simple example can clarify things considerably: in the sentence “Mr. Clinton is teaching Algebra to the students.”, the subject is Mr. Clinton, since he is carrying the action of ‘teaching Algebra’, which is the predicate, leaving ‘the students’ as object since upon them the teaching is carried out.

This simple example can also be expressed in RDF, the data format which is used in the ‘data layer’ of the Semantic Web (figure 1.4):

14 Introduction 1.5 A small introduction into the Semantic Web

If the first URI is retrieved, one will find information on Mr. Clinton, for instance where he lives, his age, etc. The second URI will likely return a definition of teaching Algebra, while the third will likely return information on Students (probably stating they are lazy).

With these simple tools, millions of facts can be recorded. To enable reasoning, another step had to be taken: the definition of relations. RDF Schema and the later introduced OWL offer all kinds of mathematical (after all the Semantic Web is for computers) re- lations which can be used to define relations between instances, classes, and predicates.

One of the relations is the SubClassOf predicate (meaning that A is B but B does not have to be A). For example if one of the recorded facts states ‘LinearAlgebra SubClas- sOf Algebra’, meaning that linear algebra is a kind of algebra, and another fact states ‘Algebra SubClassOf Mathematics’, computers are able to reason that linear algebra is also a form of mathematics. Because of the mathematical definition of SubClassOf, computers will not perform the reasoning that mathematics is the same as linear alge- bra.

The only thing left in using this reasoning was a query language, which came with the

Figure 1.4: Basic architecture of the Semantic Web introduction of SPARQL (SPARQL Protocol and RDF Query Language). This lan- guage is used to actually ask a computer to perform a certain reasoning on a dataset to get the desired answer. With these relatively simple tools, computers can now reason on the information they retrieve.

If some data is published on the Semantic Web, it is often being referred to as Linked Open Data (LOD). ‘Linked’, since it uses URIs to link to definitions and other datasets on the Web, and ‘Open’ since the data is publicly available for everyone. The collec-

15 1.6 Why Linked Open Data? Introduction

tion of all datasets referring to each other on the Semantic Web forms what is called the Linked Open Data Cloud.

With the LOD Cloud, applications can now be built which are able to perform tasks on the Web that are far beyond the imagination of what people thought was possible with Web 2.0.

1.6 Why Linked Open Data?

Nowadays, the Semantic Web is slowly moving away from a purely academic exper- iment and is being picked-up by governments, companies, and other parties. Often when defining a policy on the Web, these organizations ask: “why should we choose for Linked Open Data?”.

This section will elaborate on that question. And of course to choose whether or not to publish Linked Open Data is a trade-off. First however, it must be recognized that this question is twofold: “Why Open data?” and “Why Linked data?”. The former is a question from a policy point of view, while the latter also incorporates technical aspects.

Why open data? Today the world is characterized by information exchange and collaboration, espe- cially when observing the Internet. Many people utilize shared information and share information themselves, using platforms like Wikipedia or other social media. Com- panies are no different, exchanging information with each other on their products and collaborating in online platforms and projects. This is also true within the Cultural Heritage sector, where for example the RCE (Dutch Government organization for Cul- tural Heritage) reports [6] that network based thinking, collaboration, and information exchange are very important aspects in the years to come.

However, the information itself is often hidden or not available at all for computers. They are either published on websites using HTML, or not available at all because it is stored on local computers. The problem is that computers can process HTML files, PDF files, images, etc. but do not have any semantic awareness concerning its con- tents. Much of this information is, of course, very valuable: museum collection data, government data, information on the opening of restaurants, and products catalogues, to name just a few.

Imagine what happens when data is publsihed in such a way that computers can under- stand it. For example, Tim Berners-Lee had the vision of a calendar application that was able to plan a doctor’s appointment, taking into account the quality of the doctor and avoiding certain traffic jams, etc.

Often people have many objections when the value of the Semantic Web is explained to them. While some data is indeed privacy-sensitive and should be very well pro- tected, it is often possible to publish data without any privacy issues, for example by

16 Introduction 1.6 Why Linked Open Data? rendering the data anonymous.

“But if I open my data then I give away my knowledge and my reason to exist.”

While this argument is regularly used within the cultural heritage domain, it ultimately proves to be incorrect. First of all, it is the ability to perform research and administrate data in a certain field that is the main expertise of cultural heritage institutes, not the resulting data. Compare this with software engineers: although they have knowledge of software and software structures, it’s their ability to apply that knowledge on a cer- tain problem that is more valuable than the raw knowledge itself.

Furthermore, most data represent a physical item in the real world. By publishing that data, the physical item is not given away. On the contrary, publishing data about a particular object and context supports the position of institutes as knowledge creators rather than purely collectors of cultural heritage.

Finally, because the main source of funding in the cultural heritage domain is often public money, it could be argued that the result of public funding should be public property as well.

“But when I open my data, the quality will deteriorate.”

This argument ultimately proves to be incorrect as well. An obvious example of this argument being wrong is Wikipedia: with millions of articles and thousands of writ- ers, this encyclopaedia does not contain more errors than the expensive ones created by domain experts, like the Britannica [8].

Another example contradicting this can be found in the open source software domain. Some projects have such high quality control mechanisms: they are among the best software in world. Apple Computers, just to take one example, thought the quality of open source software was high enough to use it as the basis for their Mac OS X operating system.

When data is published, it is likely that the public will start looking in to it. While it can be challenging for an institute that the public will find errors (see section 4.4.4) in their data, it can ultimately lead to improvement of the data.

“But my business model is based mainly on some unique information of which I am the only provider.”

Organizations that have their business model based on owning unique information should consider that such a business model is not solid anymore. There are many ex- amples showing that these business models have lost their competitive edge. Just to give a few examples: phone numbers, postal codes, and street map information is now freely available on the Web, but just a few years ago this data was only available on CD-ROM for which an update had to be bought every year, or for which an expensive phone service had to be called.

17 1.6 Why Linked Open Data? Introduction

For many governmental organizations, datasets are forced to be public by law when it comes to non-personal data. Yet, a myriad of official reasons is employed by these or- ganizations to prevent the publishing of their data, in order to willfully keep the public ignorant regarding their conduct.

“But people will start using my data for applications I do not really like or that do not fit my profile.”

When people start using the data it can be argued that it is valuable for them. The development of applications using the data, shows its value even more. Organizations that dislike the application being built upon their data should wonder why these ap- plications are being developed. Apparently the customers have needs that are either unforeseen or even ignored. In both cases it will be wise to acquire more insight into the organization-customer relationship.

Organizations might decide to help the developer by providing for knowledge on the data. In this way organizations may steer development in a direction the might like. This situation is mutual beneficial, not only for both the organization and the developer but also the public.

In any case, opening data will have unpredictable effects, making it an enterprise most organizations are unwilling to undertake. However, such a reluctance is misplaced be- cause when the opened data is valuable for the public, they will definitely start building applications, which will generate much publicity, offsetting any drawbacks the open- ing of the data may have. That this case is not just theoretical illusion was proved by what happened when Massachusetts, USA opened up the public transport data: in a matter of weeks several applications were launched, built by the community, in which routes could be planned.

This concludes the discussion about the open aspect of Linked Open Data, a subject not further discussed in this thesis.

Why Linked Data? Being more technical in nature, the discussion will start with the non-technical issues, gradually moving towards the more technical ones.

The mean reason for using RDF, which is the best known format to present linked data, is its ability to express semantics and the fact that it enables computers to reason with the data. The concept of ‘semantics’, which is derived from the Greek ‘seman- tikos’, is generally defined as “the study of meaning”. So with RDF it is possible to express the meaning of data.

The first thing to recognize is that the meaning of a word, which is a basic piece of data, is not really defined by just one person but it is rather the product of a common process. Of course people do invent new words all the time, but in the end it is a group

18 Introduction 1.6 Why Linked Open Data? of people who will use that word to communicate about a certain concept. That group can be very small, like a couple that have special words to tell each other that they love each other, or it can be very large like 1.3 billion Chinese speaking people who share many words to communicate which each other.

Thus for semantics there are two requirements to be met: a certain word needs to be used by a group of people, and that group of people must use that word in similar ways.

When this concept is applied on any given dataset or format, it can be concluded that it is very hard to define semantics. This has resulted in many organizations working together to create various standards. A good example of this is the EAD standard, which is used with Archival Descriptions and defines a large number of tags.

By using this standard, data can be shared. However, being able to share data does not imply that computers can perform reasoning on the data. The main reason is that a common data format (XML in the EAD example) does not automatically provide the possibility to reason about the content: to be able to reason, rules must be provided that describe how to reason.

Within RDF, one of the standards for Linked Open Data, this is different! First of all, there is a classification mechanism in place to assign a class to a certain item and, furthermore, there are definitions that can be used to describe how these classes relate to each other. To give a simple example: in RDF one can define two classes: for ex- ample ‘cat’ and ‘animal’. Also it can be defined that ‘cat’ is a subclass of ‘animal’, meaning that if something is a ‘cat’, it is also an ‘animal’. If this definition of ‘cat’ is used to define that a ‘cheetah’ is a ‘cat’, computers are able to reason that a ‘chee- tah’ also must be an ‘animal’. To extend the example, if another definition states that a ‘bird’ is also an ‘animal’ and defines that a ‘dove’ is a ‘bird’, then computers can reasons that a ‘dove’ is also an ‘animal’, but not a ‘cat’.

This example shows how computers can understand semantics and reason about it. To recall, the other aspect of semantics was the fact that it is being defined by the com- munication within a group. In the Semantic Web domain, many groups have defined generic concepts for reuse. For example, the ‘foaf:name’ property presents a definition of the first name of a person. RDF uses URIs to refer to any place on the (Semantic) Web. When people are described in a certain date set, it is very likely that for each person a name is incorporated. So in that dataset the ‘foaf:name’ property could be used. Since that property is used in many places, computers are able to share data with any computer that knows what ‘foaf:name’ means. When XML is used with only the tag ‘name’, computers can not be sure that the ‘name’ tag implies the first name of a person.

So RDF, including all its syntactic variants like N3, truppl, etc., is designed specifically to deal with semantics. All other formats do not meet the requirements for semantics and can be considered as just plain text or binary data when being used for publishing on the web.

19 1.7 Thesis structure Introduction

Part III of this thesis, consisting of chapters 4 and 5, is about semantics: transforming to Semantic Web formats, capturing semantics, and dealing with semantic data.

1.7 Thesis structure

This thesis report is structured as follows:

Chapter 1 provided an introduction to the research question and the relationship with Erfgoed Delft, and elaborates on related research. It also contained a small introduc- tion to the Semantic Web, as well as a discussion regarding Linked Open Data.

Chapter 2 discusses several methods that cultural heritage institutions are using to present their collection on the Web. This chapter also explains which method might be suitable for which type of data.

Chapter 3 presents an overview of currently used methods within Erfgoed Delft to provide the public online access to their collection. The chapter will compare features of the several methods and identifies the possibilities to create (demo) applications us- ing Semantic Web technologies.

Chapter 4 introduces a generic method which has been developed to transform several datasets with ‘legacy‘ data of Erfgoed Delft to a Semantic Web format. After an intro- duction, it presents an extended case study with four datasets on which this method is applied. The chapter also elaborates on many issues encountered during this applica- tion.

Chapter 5 provides information on the creation of a faceted browser based on the transformed datasets. Besides this information, this chapter also discusses some prop- erties of faceted searching and what kind of datasets are adequate to be used for faceted browsing.

Chapter 6 concludes this thesis report and provides possible future work in this con- text.

20 Part II

Cultural Heritage Search

21

Chapter 2

A survey of search techniques for cultural heritage

With the general acceptation of the World Wide Web, many museums, as well as other institutions with cultural heritage collections, opened their virtual doors enjoying prac- tically unlimited space to showcase their art objects. This is of course in great contrast to how the collection is showcased physically, as, due to severe limitations on physical space, often only the top collection is shown in museums and galleries.

However, since museums have regularly hundreds of thousands of art objects, some searching techniques must be employed to be able to find a particular object within these huge collections.

This chapter focuses on that question: which search techniques are used and available on the World Wide Web to browse, search and explore collections. The first section sketches a very brief historical background on searching within the cultural heritage domain. The second section introduces the methodology used to research that ques- tion. The third section presents the results: it lists, and comments on, several found methods and tries to evaluate its advantages and disadvantages. The final section will then provide a conclusion for this chapter.

2.1 A brief history

Already in the late 1970s, many institutions started to use electrical methods to register their collection. Often the goal was to simplify the internal management and govern- ment of collections. Initially just standard software packages, like general purpose databases, where used, as is probably still the case for some institutions.

While the (r)evolution of the digital era continued, software vendors started to in- troduce specials packages to simplify the management of collections. These packages offered support with registration information: location of objects, loaning, restoration, etc. Also scans or pictures could be attached so the registrar had a visual reference.

Libraries where among the first that felt the need to make their collections search-

23 2.2 Methodology A survey of search techniques for cultural heritage

able, including the ability to search the collections of other libraries. These libraries, but later also archives, started to offer their visitors search access to their registration system in order for them to find the books or documents they required.

The introduction of the World Wide Web revolutionized the way in how people have access to information. Libraries began to offer people the ability to search (and some- times even reserve books) within their collection using a web browser.

Museums first used the web primarily as an additional method for generating pub- licity, like promoting (upcoming) exhibitions, and providing general information like location and opening data. But slowly, they began to connect their registration systems to the web, opening all their treasures for the public.

Nowadays almost all museums have their collection, or at least parts of it, published online. From the basic form of providing PDFs of the archival descriptions by archives of municipalities, to fully featured faceted browsers by some larges museums, looking forward to connecting all the cultural heritage institutions in Europe.

The future is bright, according to a new generation of revolutionist, who fervently sup- port the Semantic Web: with connected collections all over the world, the amount of information useable and searchable by the public will be beyond imagination, offering mankind an unprecedented insight into its own history.

2.2 Methodology

To determine which search techniques are actually being used to search through col- lections, the websites of the 50 largest museums of the world have been inspected1. Furthermore, since this thesis aims on cultural heritage in general and not only specif- ically on museums, the websites of several national and municipality archives were inspected as well. Note that the focus is on techniques actually available on the web, and not on what may be proposed in literature.

For each museum, the homepage and sometimes Wikipedia was used to determine its major art objects. This information was used to search for these objects in the col- lection website of the museum. For example on the collection website of the Louvre, a search was performed for the Mona Lisa (and also in French: la Joconde).

For archives this approach was not suitable: often archives do not feature a few par- ticular pieces of art, but instead they have a huge collection of documents. Therefore, for archives, a search with keywords related to the geographical area covered by the archives was performed.

For any given search, two elements where specifically inspected: the set of options the search method provides combined with its inner workings, and the set of options that is provided to process the results. Examples of the former include: an insight

1Largest as measured by the number of visits in 2010 [12].

24 A survey of search techniques for cultural heritage 2.3 Search techniques in how the search process actually works, the possibility to use more advanced meth- ods (like limiting the period of searching), whether the search engine provides a list of predefined values (like the names of artists), and the option to specify result limits using categories, topics etc. Examples of the latter are the possibility to filter results afterward, or whether a feature such as presenting similar objects is present.

Since the budget available to small, local museums is of a total different class when compared to the top 50 museums of the world, a bias can easily occur. To counter this bias, a number of smaller museum websites were visited as well. Often, they where picked because during this research, the website of that particular museum was men- tioned. Examples of these museums are Het Tropenmuseum and Versailles.

This methodology aims to provide a concise survey of some of the search techniques used on the web rather then to offer an exhaustive list of all currently available search techniques. Each technique will therefore only be briefly discussed.

After inspecting a certain website, its features were listed. Afterward, these features were compared with each other and a classification of search techniques was made. In the next section these search techniques are presented.

2.3 Search techniques

In this section several search techniques are presented which are currently in use on the World Wide Web to search through collections. For each technique, an advantage and a disadvantage are given. A rough estimation is given for each technique regarding the control an organization has in the search process. In some search processes, rated with a high control factor, the visitor is directed step wise towards an object he is interested in. In other processes the user just uses a custom Google search for which the search result cannot be controlled at all. Likewise, an estimation of the publishable amount of data is given. Both factors are purely indicative as quantitative data is lacking. For this very reason, an estimation of costs could not be presented. To provide more insight in costs versus control versus publishable amount, further research would be required. Information within the cultural heritage domain can be divided into three basic cat- egories: Intangible cultural heritage (e.g., stories, music, and dance), objects (e.g., paintings and sculptures) and documents (e.g. archival records on the population of a city, and legal possession documents). For each technique, the compatibility with each of these three categories is listed.

Finally, the number of top 50 museums using a certain search technique is stated. Appendix B contains a full overview of search techniques for each museum. Please note that the actual number of museum websites inspected was lower then originally intended because some websites were not available in English, and that several organi- zations combine or offer multiple search techniques. A list of the inspected Archives can be found in A

25 2.3 Search techniques A survey of search techniques for cultural heritage

2.3.1 The ‘let’s use Google’ approach The ‘let’s use Google’ approach is by far the most elementary one: use Google (or any other search engine) to index a collection site. Please note, this search technique incorporates both the usage of a custom Google search field on the website, and the ‘don’ t do anything approach’ in which it is trust visitors will use Google away.

Traditional web search engines crawl the web, including museums sites, to offer a text search. So, with only putting pages with information online some day the website will be found. Therefore this technique is very cheap. Which is the biggest advan- tage of this technique. Another huge advantage is that visitors are likely to be already familiar with Google.

Figure 2.1: Museum using Google (www.habitot.org)

This technique will work well with relatively large texts and not too many objects. Publishing a list of objects with only a small text or some set of tags describing each object will not be very useful unless the objects itself are very famous: the page will most likely not show up in any search results, or only with a very low ranking. For collections with extensive describtions, or for collections of only famous objects or places, this technique will likely be productive.

The main disadvantage is the lack of control: it will be hard to guide the visitor to a particular page. On the other hand, because it is so general, most visitors will be familiar with the search engine already.

The ‘let’s use Google’ Google approach Control Very low Publishable amount Low to Medium Information types Intangibles, documents when fully digitalized Top 50 museums uses 0 out 50 Archives uses 0 out 6

Please note, although all top 50 museum can be found using Google, they do not in- corporate the ‘let’s use Google’ approach. However, this approach is still listed since it is used by a number of smaller museum.

2.3.2 The encyclopedia approach A new revolution on the web was Wikipedia: a free, fully available, publicly created encyclopedia. This website inspired many museums around the globe to create an en- cyclopedia themselves.

26 A survey of search techniques for cultural heritage 2.3 Search techniques

Particularly museums with a historical context, like the Chicago History Museum, seem to choose this approach. This approach is very suitable for stories and large amounts of texts and articles. The biggest advantage is that this technique does help with publishing and structuring the articles. It enables to tell an extended story which users can explore in depth.

Figure 2.2: History of Chicago (encyclopedia.chicagohistory.org)

This technique can be considered to be an extension of the ‘let’s use Google’ approach to offer users their comfortable search engine, but is different in the sense that an en- cyclopedia does offer an index, topic organization, and often even an internal search engine.

Since there are many open source and free packages available to create an encyclo- pedia system, the costs for the software itself will not be high, although it may require some technical assistance to setup. The costs are likely primarily in creating the sto- ries and adding enough content to be interesting for the public. But these costs can be reduced when the public is asked to help creating the encyclopedia, something that has been experimented with already.

It’s biggest disadvantage is the lack of firm support for tens of thousands of objects. With a page for each individual object, the number of pages will become enormous, which hinders the user in effectively finding an object; whereas providing pages with many objects on one page does result in the problem that an individual object will be hard to find.

The encyclopedia approach Control Low Publishable amount Medium Information types Intangibles, documents when fully digitalized Top 50 museums uses 6 out 50 Archives uses 0 out 6

2.3.3 Online Exhibitions Some museums use their expertise of creating exhibitions to present the user with its digital equivalent. Many forms are possible when creating an online exhibition, rang- ing from pages one can browse, to sometimes even a Virtual Reality (VR) tour of parts of the physical museum.

27 2.3 Search techniques A survey of search techniques for cultural heritage

Figure 2.3: Rein Sofia museum - VR Tour (http://www.googleartproject.com))

An advantage of this technique is that you can guide the visitors in a specific way that enables the organization to tell its story. Art objects can now be set into its con- text, grouped, or lined up, just as it is done in the physical museum. In this way, the organization is able to create a certain atmosphere on the website, similar to the phys- ical museum. Furthermore, it can present a smooth tour, providing the visitor with a pleasant experience. Particularly museums that reside in a famous or otherwise special building, like to spend budget on Virtual Tours in order to not only show the collection but building as well.

However, for visitors it can be quite hard to find information on a particular object. The amount of time spent to browse (or walk) through a digital exhibition can be quite large. Therefore visitors should choose to either start such an activity, or to search specifically for one item.

Costs may be an issue as well. Again, creating web pages with rooms may be not very expensive, but taking photos of the entire museum to put into a Virtual Reality can be challenging when budget is limited. A possible solution is found in the Google Art Project that uses special vehicles to drive through the museums, reducing the costs considerably.

Online Exhibitions Control Medium Publishable amount Medium Information types Intangibles, objects Top 50 museums uses 10 out 50 Archives uses 0 out 6

28 A survey of search techniques for cultural heritage 2.3 Search techniques

2.3.4 Text Field Search

Text Field Search (also referenced as focalized search [17]) is a commonly used tech- nique and comes in a variety of implementations, forms and appearances. There are essentially two different objectives that can be achieved with text field searching: re- trieving pages with information or finding objects.

The basic characteristic of this technique is that the visitor enters a keyword and the system retrieves a list with results matching that keyword. While this may not seem to be that much different from the the ‘let’s use Google’ approach’, there is one fun- damental difference: the search is conducted using a connection with a database that contains the information, instead of searching the pages that are used to present the information to the user.

In addition to the basic form in which the search terms are being used to simply query all the appropriate data fields in the database, more advanced features are sometimes provided that allow to search in specific fields with certain keywords (e.g., to search for a specific material in the field ’material’), or that give the user a set of predefined values that can be used to limit the search results to some subset of the data.

Figure 2.4: Basic search in the collection (http://www.britishmuseum.org/)

This technique is very suitable for searching in enormous collections of objects, archival documents, and pages: almost anything that fits into a database or Information retrieval System can be used with this technique. To no surprise, this technique is used by all the inspected Archive websites. Almost all the museum websites offer some kind of searching in their databases, which is done using this technique, as well. The visitor is able to find a particular item in a huge set and can retrieve just that information he is interested in. The control for both the museum and the visitor is high, especially when more advanced features are provided. There is, nonetheless, a risk in applying more advanced varieties of this search technique: it can be complex to use for the visitor.

The disadvantage of this technique can be the costs, as a lot of development is most likely to be required to get systems like these up and running. For organizations with a large financial capacity, this is may not be a problem, but for smaller organizations this will be a major obstacle.

Furthermore, it is not possible to tell a story with only a collection of objects since

29 2.3 Search techniques A survey of search techniques for cultural heritage

Figure 2.5: Looking for... (http://www.tante.org.uk)

the user can arbitrary pick an object. All objects in the database should have public information as well to have anything to show the user.

For archives, which all use this technique, the challenges are even greater, as their vast amount of physical documents should first be digitalized, or at least annotated, which is a complete field of research on its own. On the premise that it is possible to search the contents of the documents and that they are digitalized, Text Field Search can be a very powerful tool for archives as well.

Text field Searching Costs High to very high Control High Publishable amount Almost unlimited Information types Objects, documents, intangibles Top 50 museums uses 31 out 50 Archives uses 6 out 6

2.3.5 Thesaurus Search A thesaurus is a dictionary with information about a particular set of concepts, includ- ing strict definitions of these concept. It is often presented tree-wise with nodes being more general towards the root of the tree and more specific towards its leafs. Thesauri are extensively used among professionals in the cultural heritage, as they enable them to classify objects and provide a reference for common understanding.

Some websites offer to use these thesauri to search for objects. The visitor can browse in the tree and see the connected objects or he can search in definitions of concepts to find related objects.

Browsing through a thesaurus can be a very powerful tool to find certain objects and how they are related. However, the results depend heavily on how well the objects

30 A survey of search techniques for cultural heritage 2.3 Search techniques are annotated with terms from thesaurus. Please note that, when used in this way, the thesaurus acts as a controlled dictionary to annotated objects. Following this, if only a small percentage of the entire collection is annotated with the thesaurus, only that percentage of the collection will be found. This is however not it greatest disadvan- tage: visitors should be familiar with the thesaurus, or at least with (some part of) its concepts. With insufficient domain knowledge, a thesaurus is hard to use, and will likely not lead to the desired concrete results.

Figure 2.6: Thesaurus term of Stichting Volkenkundige Collectie Nederland (SVCN) (www.svcn.org)

For domain professionals, it does offer unmatched control over the search process: with the right actions, specific objects can be found very quickly.

Creating websites with a thesaurus can be quite expensive, although costs can signifi- cantly be reduced when the data is already available (e.g., it resides in a back office). Especially then, Thesaurus Search can be a very valuable tool, since it will support professionals to quickly search in thousands or even tens of thousands of objects.

Thesaurus Search Control Very high Publishable amount High to almost unlimited Information types Objects Top 50 museums uses 2 out 50 Archives uses 0 out 6

2.3.6 Recommendations In e-Commerce it is pretty standard to make some kind of recommendations to the potential buyer. The main reasons [19] are to increase sell and to build loyalty with the customer. While in cultural heritage selling of objects is not the business it may be worth trying to create a relationship with your visitors.

By offering the visitor a small list of recommendations, he can more easily explore more art he likes, resulting in the visitor to stay longer on the website and to value the website more. While this is beyond any doubt a great benefit, creating good recom- mendations is not a trivial task because it requires a lot information about how objects relate to each other and how objects relates to what art a person likes.

31 2.3 Search techniques A survey of search techniques for cultural heritage

Because of its complexity, there are not many websites that use recommendations, although there are some institutions preforming experiments with recommendations. A small example of such an experiment can be found below (KIT example).

The biggest issue with recommendations is that they should make sense, so the vis- itor can understand why the recommendation is given and why it is relevant for him. For example, when the visitor is viewing a painting of Maria van Reigersberch, a recommendation to information about the castle of Loevestein may be relevant, but a recommendation to a painting of Picasso is likely to be irrelevant. And while good rec- ommendations increase the visitor’s experience with the website, bad ones can quickly stir up annoyance with the visitor resulting in a decreased experience, diminishing the website’s value.

Recommendations Control Medium to High Publishable amount High to almost unlimited Information types Objects Top 50 museums uses 2 out 50 Archives uses 0 out 6

KIT example

The Royal Tropical Institute (KIT) in Amsterdam is an independent center of knowl- edge and expertise in the areas of international and intercultural cooperation. KIT has a large museum collection (about 380.000 objects) and a huge library with information regarding former Dutch colonies.

A project was started with the aim of providing an integral search solution for both the museum collection and the collection of the library. The returned objects should then refer to the library for either more information in general, or for a description of its origin in particular.

Their approach was to use a text comparison algorithm to find similarities between objects and documents, using these similarities as a recommendation for visitors. Both the objects and the documents are annotated with a few properties like date, origin etc.

A search starts with entering a certain keyword. The system first uses a full text search to retrieve any object or document for that keyword. The resulting documents and ob- jects are all compared with each other to find similar words or phrases. These similar words are ordered on number of occurrences and a top selection is returned. This sim- ilarity matching is also applied on a few properties like date, origin and author.

The final result is then returned to the user, who can now browse through the results, select an individual item, or use the found similarities as keyword to further narrow down the search to fewer items.

32 A survey of search techniques for cultural heritage 2.3 Search techniques

The results depend heavily on the quality of the algorithm. Because the system is originally designed for the English language, only moderate results are produced in its current beta stage, since most of the texts are in Dutch.

2.3.7 Faceted browsing Faceted browsing tries to exploit the data model of the data. For instance an object has a creator, is made of a certain material and contains certain colors. These predicates (creator, material, color) can be used as filters on a given data set.

A visitor could for example start with all paintings of Golden Age painters, filter- ing them with more criteria until he has found all the paintings of Van Mierenvelt on which Hugo de Groot is portrayed. Visitors can start with a simple keyword and filter all the way through objects until they have the objects they are really interested in. This process creates a certain exploring were the visitors can see how the choices he made effects the search results.

Figure 2.7: Searching for Medals (http://collections.vam.ac.uk/)

This technique is very powerful because of two reasons: firstly, it gives the visitor much control on the objects he wants to find, effectively limiting the number of search results; secondly, it helps to find unique objects within the collection. This technique is especially suitable when there is much differentiation among the objects in the col- lection. For documents in an archive this technique may work less efficient.

A major issue for this method is the description of the data. There should be enough ’facets’ to filter on and of course for each Facet there should be a number of options to chose from (more on this topic in section 5.3). Facets should be created by using a

33 2.4 Considerations & conclusion A survey of search techniques for cultural heritage

controlled dictionary (e.g. annotating objects with the use of a thesaurus) since there should be a common concept which binds objects together (e.g. consider figure 2.7 where there are 161 objects of which the material is bronze). So the data should be checked and prepared before a faceted browser can be used effectively. Also develop- ing a faceted browser is not trivial, so costs may be an issue as well.

Faceted browsing Control High Publishable amount High to almost unlimited Information types Objects Top 50 museums uses 1 out 50 Archives uses 0 out 6

2.4 Considerations & conclusion

The previous section showed several techniques that can be implemented by cultural heritage institutions to offer visitors the possibility to search, browse and explore their collections. The differences on the web between museums and archives are promi- nent. This can be explained by the difference in point of view by the institutions: for archivists the structure and the content of the documents are important, while for museums the objects and all its properties are important. To give an example of this difference, consider that for museum objects it is often very important who created the object and how that object fits in the and the personal life of the creator, were archives do not take personal life details of creators of their documents into consider- ation.

2.4.1 Museum collections

The research indicates that for telling stories, building an encyclopedia is a very suit- able technique. This technique combines free text search with a kind of index helping to find structure in the website. When implemented in a ’crowd-sourcing’ style, costs can be reduced and the commitment of the visitors can be further increased.

It is likely that for many visitors a faceted browser is a suitable technique to explore a huge collection. Primarily because of the ease of iteratively narrowing down the results and secondarily because a facet browser offers a choice rather then a keyword that have to be typed. Finding the right keyword to get your result can be a bit tricky (e.g. entering the term ‘Mona Lisa’ does not give the famous painting at the first result page on the website of the Louvre).

If an organization wants to publish parts of its collection on the web, it must first assert its primary objective for such a project. Possible objectives could be to provide information with text about parts of their collections, or to offer the world a complete view of there collection.

Depending on the budget and the objective, a choice must be made. Experimenting

34 A survey of search techniques for cultural heritage 2.4 Considerations & conclusion with some of the described can be beneficial, but taking the following recommenda- tions in consideration may prevent disappointments and failed projects:

• Use only web search engines directly when there is a very limited budget, since control is lost;

• Huge amounts of art objects in an encyclopedia do not make sense, since an encyclopedia is built on the basis of linking articles with each other;

• Online exhibitions can be very compelling but may require much time for a visitor to complete;

• Text Field Search is very common and powerful, but the search form can be complex for visitors;

• Recommendations to visitors are useful only when they are relevant;

• Thesauri are a valuable tool but only for professional users;

• Faceted browsers are an exceptional tool, but the data should conform to high quality standards.

Based on meetings with people within the cultural heritage domain and the various web experiments they are conducting, it is clear that museums are well aware of the presence of their public on the web and are also willing to serve them there, beyond the physical buildings of the museum.

2.4.2 Archives For archives the research shows that the only applied technique is text field search, often only with the possibilities to search on free text and limiting the result within a certain period of time.

The use of only text field search can be explained by taking these four facts into consid- erations: firstly, archives are often much more heterogeneous regarding their content then museum collections which makes the use of a technique like faceted browsers all the more difficult; secondly, for archival material the content is important, rather then its physical appearance; thirdly, often only parts of documents are digitally available: the focus is not on the document in all its richness but on actually getting the document in a readable digital form.

The latter is a general observation as archives are not that active on web compared to museums. This can be explained by the often still ongoing process of digitalizing their documents, although it can sometimes also be attributed to a lack of interest for the public on web. This lack of interest is disappointing: many archives do have very valuable documents which are not shared at all or only within a very small group.

35

Chapter 3

Currently applied search techniques at Erfgoed Delft

The previous chapter explored what search techniques are used on the web by cultural heritage institutions. Erfgoed Delft has several websites to allow the public access to their collection. Many of these websites implement some search technique to assist the user in finding particular documents or (descriptions of) objects. This chapter presents an analysis of these techniques. The results of this analysis will be used to explore how Semantic Web techniques can complement the currently used techniques.

The analysis consists of a general description of the technique and what kind of in- formation is stored, and it examines the information retrieval aspects of the search techniques in more detail. One important aspect, the ranking of search results, could not be examined as the required information about the internal workings of the imple- mented search techniques was not available.

This chapter starts with an explanation of the used methodology and criteria, after which each examined system is listed. The chapter ends with a comparison of all the systems.

3.1 Methodology

For the analysis of each technique, the following methodology was applied:

1. Generally explore the systems (the purpose of the system, the data in the system, etc.).

2. Examine the help information of the several systems. This can provide informa- tion on the characteristics of the used information retrieval systems (e.g. the use of wild-cards).

3. Use the system to actually find some records. The datasets used in section 4.5 are here acting as a reference for finding data in the systems.

One could argue that because of this, the applied methodology is biased since it is not random. However, since there is more explicit knowledge about the information

37 3.2 Systems Currently applied search techniques at Erfgoed Delft

searched for, it is easier to judge the retrieved result. Illustrative for this is the case of detecting spelling correction methods. With knowledge about the data in the systems, it is easy to check for spelling correction methods by using known misspelled names.

4. Start a search with misspelled information using the found records. For example, “Van Miereveldt”, which is a misspelled variant of “Van Mierevelt”, was fed into the system. With such tests, it can be determined whether there spelling correction is handled or not.

5. When applicable, feed both plural and singular words into the system. This can be used to reveal any stemming algorithm that has been implemented.

6. Finally, in informal interview can be held with the creator(s) of a system to verify the information and gain a deeper understanding of the system.

3.1.1 Criteria To be able to compare the method in more details several information retrieval systems criteria were set [14]:

• Wild card search

• Only whole word matching

• Stemming

• Case insensitive

• Max number of results

The last criteria was added after the observation that systems sometimes limit the num- ber of results they return. Note that each search technique will also be classified in terms of the various techniques discussed in chapter 2.

3.2 Systems

This section discusses many of the systems used on the web by Erfgoed Delft. How- ever, due to the fact that several systems were in the process of being updated during this research, some information may be outdated.

3.2.1 Collection Connection Collection Connection (CC) - a product of CIT - is a system with two tasks: extract information from several sources into search indices and act as an information retrieval system to search into these indices.

Collection Connection is a traditional information retrieval system used on several sites and in several projects. Examples of its usage are the Beeldbank of Erfgoed Delft and the Wikidelft project. It is traditional in that it uses reversed indices for a full text

38 Currently applied search techniques at Erfgoed Delft 3.2 Systems index search.

In its characteristics, it is traditional as well. Stemming is available in several lan- guages (actually the Snowball libraries are used), although it will only be used when the index creating software (also part of Collection Connection) explicitly selects the option for stemming. When selected, separated indices are created for both searching with and without stemming. Tests have shown that in many cases however, stemming is not selected - an observation confirmed by the creator of Collection Connection. Misspellings can be handled, for example the Soundex [11] and N-gramming algo- rithms can be applied.

Collection Connection is less traditional in its storage and output. The basic idea is that every data-source, including xml, relational databases, and data from several APIs used in the cultural heritage, can be mapped to xml. This part can be seen as a data- extractor to xml. For example:

Listing 3.1: A sample of xml for CC 12 Wout Van Wezel Programmer, Teacher at Rug Collection Connection

Collection Connection uses this xml to act as a facet browsing search engine. It can map its searches on any field. Thus, it can search for a person with surname=Wout and profession=Programmer. So queries like ’People who were born in Berlin before 1900’ are no problem.

One could use the IDs in the data to make relation between data sources. Suppose for example that we have a list of software packages with the following resource:

Listing 3.2: A sample of xml for CC 34 Collection Connection CIT 2009

If a relation is made between the created software id in the person example and the id in the softwarepackage example, Collection Connection can reason that Wout van Wezel created Collection Connection and it is sold by vendor CIT. However, Col- lection Connection is not (yet) capable of searching for software packages created by “Wout”, vendored by “CIT”, and sold in “2009”.

39 3.2 Systems Currently applied search techniques at Erfgoed Delft

Collection Connection Wild card search yes Only whole word matching no Stemming yes (when used during creation of the indices) Case insensitive yes Max number of results unbound Chapter 2 method Text Field Searching

3.2.2 WikiDelft WikiDelft is a historical Wiki for Delft and its surroundings. WikiDelft collects stories about the city and its history in both writings and images. It is a true Wiki, so every- body can interact on the site and post his story (either in text, photos, or movies) about Delft. With WikiDelft, Erfgoed Delft created a platform in which they are building a large encyclopedia concerning Delft,jointly with its citizens.

WikiDelft is based on the MediaWiki software released as open source software pack- age by the Wikimedia Foundation. The software was adapted to offer relevant items from the collection of Erfgoed Delft. In this way, Erfgoed Delft connects the stories with items from the collection. The process to find these related items is straightfor- ward: the is used as a keyword to search in the collection. It is not clear if other tags and or metadata are taken into account. This search is handled by the previous discussed Collection Connection (see 3.2.1).

When a user searches for a certain item, two types of results are returned: articles and items. The user can choose to view the related articles, or choose to view the re- lated items from the collections of Erfgoed Delft.

The characteristics of the search process are close to those of CC, except that only whole words seem to be matched, something that was concluded after some experi- ments. The search options are limited to simple keyword search only. There are, for example, no options available to limit articles or items created in a certain period.

WikiDelft Wild card search yes Only whole word matching yes Stemming no Case insensitive yes Max number of results unbound Chapter 2 method Text Field Searching & The encyclopedia approach

3.2.3 Digitale Arena Delft Several series of documents within the municipality archive are transcripted to a dig- ital format. Among these documents are the Charters and acts from both the legal and notary public archives. Not all documents are completely transcripted, but each transcription at least includes all people with their roles in the process recorded by the

40 Currently applied search techniques at Erfgoed Delft 3.2 Systems document, the type and date of the document, locations in the documents and often (predefined) tags to which the document is related.

Figure 3.1: A charter (copyright Erfgoed Delft)

The user has the option to search in many fields, divided into four categories. The first category consists of fields that can limit the period (between the years 1246 and 1842) and source (either charters or deed). The second category of fields is concerned with the people in the documents. The third one deals with locations and the last one is a group of predefined tags and types of acts.

The combination of all the field criteria forms a search request which is executed, returning the results to the user, who can select one of the retrieved documents to view its content in more detail. Since the details do not include a description of the doc- ument but rather a scan of the document in jpeg format, viewing them may be a bit disappointing for people not accustomed to read old handwriting.

Figure 3.2: Detail of the last will and testamen of A.J. van Brouwershaven - 1508 (copyright Erfgoed Delft)

For all fields, except for the predefined tags, types, and the period, wildcard search- ing can be used. There are two types of wildcards available: there is a ‘*’ which can be used for zero or more unknown characters, and a ‘?’, which can be used to specify exactly one unknown character. Matching is done exactly. Searching for example for ‘Brasse’ does not yield any results, ‘Brasser’ should be used.

41 3.2 Systems Currently applied search techniques at Erfgoed Delft

Digitale Arena Delft, is a very effective tool for searching in charters and acts but is unfortunately restricted to people, places and the predefined tags.

Digitale Arena Delft Wild card search yes Only whole word matching yes Stemming no Case insensitive yes Max number of results unbound Chapter 2 method Text Field Searching

3.2.4 Beeldbank The beeldbank is the image database of Erfgoed Delft. It contains images from both the archive and museum collections of Erfgoed Delft, featuring over 100.000 items. Users can search in several (sub)collections:

1. Topography, images of locations in Delft (like buildings, squares etc.)

2. Portraits

3. History, all kind of images related to historical events, people ,etc (like images of family arms)

4. Literature, information (although not necessarily images) on available books in the library of the archive.

Users search through the collection by entering a keyword and selecting the fields to search in. Thus, the meta-data of the information is used for information retrieval. The user has the option to search in all the fields or in all collections.

The search results are returned as a list of images with some additional information, but the user can also choose to browse through the images and the information indi- vidually.

Beeldbank Wild card search yes Only whole word matching no Stemming no Case insensitive no Max number of results unbound Chapter 2 method Text Field Searching

3.2.5 Information Websites In addition to all the previous systems, Erfgoed Delft has several other websites. They include not only practical information, like ticket prices and visiting hours, and general information about the organization of Erfgoed Delft, but also background information on the collections, information on the masterpieces in the Erfgoed Delft collections,

42 Currently applied search techniques at Erfgoed Delft 3.3 Observations & conclusion and some panorama photos of the museums. The website features a search function which offers a simple full text search through the website.

Information Websites Wild card search no Only whole word matching no Stemming no Case insensitive no Max number of results unbound Chapter 2 method ‘Modified Let’s Use Google Approach’

3.3 Observations & conclusion

Currently, Erfgoed Delft has no website dedicated to the exploring of their museum collections. An old website can be found, but it is deprecated and not functional any- more. Although Erfgoed Delft is testing a beta version of a new museum collection website, time restrictions hindered evaluation of that system. Most of the available systems are custom built to make certain information sources within the archive avail- able on the web.

Wiki Delft is a very nice experiment: although the encyclopedia approach is used by more museums, Wiki Delft is unique in its open character as the public can create and edit articles. Given this open character, it might be a very good platform to connect the citizen of Delft with their cultural legacy.

Another observation that can be made concerns the use of stemming. The Collec- tion Connection system, which is likely to be used for more projects in the future, does offer stemming but for the used systems stemming is not activated. Stemming helps to improve recall, so it may be worth to investigate whether enabling stemming function- ality can improve the search process of users.

The absence of a museum collection website and the fact that none of the systems currently in use really combines the archive and museum collection in one coherent search, offer the possibilities of building a prototype to demonstrate the innovation of the Semantic Web.

The rest of this report deals with transforming several datasets of Erfgoed Delft into a Semantic Web data format and prototyping a facet browser [22] using the transformed data.

43

Part III

Semantic Web for Cultural Heritage

45

Chapter 4

Transforming legacy data

The previous chapters concluded with these statements: facet browsers offer the user a very powerful tool to browse through collections; and Erfgoed Delft currently does not employ such a facet browser. The characteristics of Linked Open Data make it very suitable for usage in combination with a facet browser: especially the predicates in a dataset can act as facets within a facet browser.

This chapter focuses on the creation of Linked Open Data from several datasets of Erfgoed Delft. Although there are several methods available to convert legacy data into RDF (i.e. in [4], [5] and [7]), these methods primarily focus on the technical as- pects. The transformation recipe presented in this chapter on the other hand focuses on producing RDF with as much semantic value as possible. Therefore this report covers issues like the occurrence of errors in the legacy data, the interpretation of liter- als, defining classes within the data, etc. The aim of the recipe is to provide a general method to create high quality RDF from ‘legacy’ data, which can be applied on a large quantity of datasets, within (but not restricted to) the cultural heritage domain.

In this thesis, quality of data is defined in terms of semantic value and proper syn- tax. Semantic value is discussed in the first section, the syntax requirements for high quality RDF, defined in terms of good URI design and proper labels and definitions, can be found in the second section. These sections are required to gain insight in the nature of the transformation recipe which is presented in the third section. The fourth section covers a wide range of issues that arise when ‘legacy’ data is converted into RDF, including some general guidelines to deal with these issues as well. The fifth section presents four datasets on which the transformation is applied and discusses the results, the problems, and some commonalities of the results. The sixth section recalls and summarizes the limitations of the transformation recipe based on the results in the fifth section. The seventh section evaluates the work in the previous section whereas the final section presents a conclusion.

4.1 Semantic value

Because the transformation recipe is designed to increase semantic value, before dis- cussing the recipe (section 4.3) itself, the concept of semantic value first needs to be

47 4.2 Characteristics of high quality RDF Transforming legacy data

defined. Basically, semantic value is the amount of information that computers can de- rive from, and use to reason upon, a given dataset. Semantic value is hard to capture in a quantitative manner, however some indicators can be used to determine the increase in semantic value of a dataset:

More triples The number of triples (explicit facts) in a dataset is an indicator for semantic value. The more information in the dataset is represented in triples, the higher the semantic value of the dataset. For example, interpreting information in plain texts results in more facts, thus increasing the semantic value.

More reasoning If the amount of possible reasoning (implicit, derivable facts) increases, the semantic value also increases. The more relations, as defined in RDF Schema and the OWL ontologies, are used the more reasoning is possible.

For example, a dataset contains a thousand resources and each resource is defined as an Painting. If the fact that Painting is a subclass of ArtObject is added, it can now be reasoned that each resource is also an ArtObject. So adding one explicit fact results in at least a thousand new implicit facts.

Linked to more commonly used ontologies As explained earlier, semantics exist when a group of people agree on a certain defini- tion. When a dataset is linked with more commonly used ontologies, more people and in all likelihood more software is able to understand or process the data and thus its semantic value increases. Not only does the linking process itself increase the seman- tic value (the process itself adds more facts) but it also results in more possibilities for other people to reuse the data. For example if an ontology can be linked to two equal ontologies but one of them is much more used on the semantic web, then that one is preferable to link to.

4.2 Characteristics of high quality RDF

Semantic value as covered in the previous section is the most important characteristic of high quality RDF. This section elaborates on other characteristics that are important for high quality RDF. These characteristics are both derived from literature and the evaluation of the iterative RDF creating process.

4.2.1 Good URI design A very important characteristic of high quality RDF is the design of the URI-schema [18]. A good schema can help users to both discover and use the RDF easily. A good URI provides a description of the resource rather then just a more or less arbitrary identifier.

Consider these two examples:

48 Transforming legacy data 4.2 Characteristics of high quality RDF

1. http://www.domain.tpl/resource/people/Hugo_De_Groot

2. http://www.domain.tpl/resource/532e24d6-54dc-462e-b4d6-f76af3c71a1e

The first URI is very clean: it is very clear that the resource describes a person with the name Hugo de Groot. The second URI on the other hand can represent anything. Having descriptive URIs will help to understand data more quickly.

Except from being descriptive, Tim Berners Lee introduced the idea that URIs are ‘to be cool’. Cool URIs are designed with simplicity, stability, and manageability in mind:

Simplicity “Short, mnemonic URIs will not break as easily when sent in emails and are in general easier to remember.”

Stability “Once you set up a URI to identify a certain resource, it should remain this way as long as possible. Think about ten years, maybe more. Therefore implementation details like .php, .aspx should be left out of the URI since technology changes all the time.”

Manageability “Keep in mind that data and websites may change in time. An update of the web- site should not effect the URI of the resources. A good practice is to use a separate subdomain for the RDF data only and host applications using the data on other places.”

4.2.2 Proper labels and definitions A good practice is to create RDF that shows the used meta-model in the data itself, en- abling the users of the data to understand the data without consulting many additional documentation sources. This can be accomplished by using two simple guidelines:

Use clear URIs and labels for predicates and classes instead of using abbreviations.

Using abbreviations does not change the semantics of the data, however it can be diffi- cult to understand the data at first. This potentially results in lesser usage or even being publicly ignored.

For example consider the EAD standard. With many (different types of) abbrevia- tions like ‘dao (Digitial Archive Object)’, ‘desc (Descriptions)’, and ‘grp (Group)’, data becomes hard to read.

If there is need to use abbreviations, for example to support users of legacy data, there is always the possibility to create two equal predicates using owl:sameAs.

49 4.3 The transformation recipe: a generic method Transforming legacy data

Add definitions for the classes and predicates used in the data.

Since a dataset does always reflect information in a certain domain, concepts from the domain are often used within the dataset without further explanation.

For example in the Van Mierevelt dataset (see 4.5.1), the concept ‘Engraving’ is used as a class. When the public is using the dataset it is likely that some people do not fully understand the meaning of that concept. By including a definition (using the skos:definition predicate) or a reference to a definition somewhere on the web, people will have more grip on the meaning of the data.

These two simple guidelines make the published data a lot more understandable for the public and thus more valuable.

4.3 The transformation recipe: a generic method

The transformation recipe is the product of an iterative process. The guidelines as dis- cussed in [4] were used to initialize the process. In each round of the process, RDF was created, reviewed, and evaluated on the increase in semantic value. The results of each evaluation step were used to improve the recipe. This process was stopped when the quality of the RDF was considered to be good (see section 4.2). This section presents the final recipe which is the result of the product of this process. At the end of the section two other, similar, methods are introduced which indicates the presented recipe might be suitable for other domains, besides the cultural heritage domain, as well.

The recipe consists of eight steps. The first two steps result in an RDF representation of the ‘legacy’ data. The remaining steps are designed to improve the quality of the RDF: more semantic value. Steps 3 through 5 aim to increase the semantic value with information from the data itself, steps 6 and 7 aim to increase the semantic value by creating a context using outside links. Finally, step 8 transforms plain text information in the data to semantic information to increase the semantic value.

Step 1: Prepare Analyze the legacy data and determine its (main) concepts and properties. Create a small vocabulary from these concepts (classes) and properties (predicates). Based on the classes and predicates the URI schema should be designed.

Determination of the concepts focus to the core semantics of the legacy data: the information to be expressed in RDF.

Step 2: Convert to plain RDF Convert the legacy data to RDF. Many techniques are available [13], [21] to accom- plish this. The main aspect in this step is to stay as close to the legacy data as possible.

50 Transforming legacy data 4.3 The transformation recipe: a generic method

Do not interpret semantics, but use the predicates as defined in the vocabulary in the previous step.

Datatype definitions from the XSD schema should be included when the legacy data has datatype information: dates, integers, floats, etc., for example when the data origi- nated from a database.

Staying as close as possible to the legacy data reduces the complexity of the vali- dation of the completeness and the correctness of the created RDF. Having already semantic interpretations in this step results in the problem that when an error is found it should first be determined whether it was an interpretation (semantic) problem or an extraction problem.

The legacy data is now converted into RDF.

Step 3: Complete the RDF Add a rdf:type predicate and object to state the type for each resource. This increases the semantic value of the created RDF. Use a type defined within your own namespace (from step 1).

Assigning classes to resources provides the first real semantics. It enables the rea- soning that some resources have the same classification and as such should be treated in the same way. Thus semantic value increases in the sense of enabling more reason- ing.

The RDF is now ready for publishing. However performing step 4 through 8 will increase the semantic value.

Step 4: Link to other resources within the data itself Often resources in a dataset are already implicitly linked. Implementing these links in RDF increases the semantic reasoning that can be applied on the data.

For example, linking all the paintings of Van Mierevelt with the resource describing Van Mierevelt, using the primary and foreign keys of the source database, enables the reasoning that two paintings may or may not have the same painter. Furthermore, the reasoner is also able to create a list of paintings created by Van Mierevelt. So this step adds more triples and enables more reasoning, increasing the semantic value of the dataset.

Step 5: Convert literal values into information Often legacy data include fields that hold a set of literal values of which the meaning is implicitly defined. The creators of the legacy data use these implicitly defined values to add more meaning to the data. The semantics of these literal values is only readable by humans who are familiar with these implicit definitions. This information should

51 4.3 The transformation recipe: a generic method Transforming legacy data

be extracted and added to the RDF. This increases the semantic value of the data. Per- forming this step can lead to partially redoing steps 2, 3 and 4.

For example, in a museum dataset a field is used to express a certain classification of the object. This information is used to differentiate on the type of a certain resource. After converting this information to RDF, it now includes 39 painting, 20 photos, and 84 drawings instead of 144 objects. Since the rather abstract concept of object is re- placed by three more concrete concepts, the semantic value of the data for both people and computers increases.

So far only locally available information has been considered, connecting with the outside world can increase the semantic value tremendously.

Step 6: Link with more common ontologies So far the focus was on the ‘local’ RDF. This step aims to increase the semantic value for the outside world by linking to more common ontologies. The best scenario would be to link against an ontology that is both widely established and that is covering the vocabulary created in step 1. But any ontology that is used on several places on the web could be worth considering.

Linking against widely established ontologies helps the public to better understand the data. For example linking against Dublin Core (in the case of documents) or against Foaf (for persons), enables the data to be used by so many applications that it will increase the semantic value tremendously.

In this step, also ‘standard’ predicates like rdfs:label should be used [4].

Step 7: Enrich by linking to other datasets on the web The web is highly redundant: many data sources may have information on the same resources as in the local dataset. The data in these sources may be more complete, may have different information or may link to other datasets. By linking to these resources the semantic value of the local data increases.

Consider for example a set of paintings of former princes of Orange. By linking the resources of these princes to DBPedia, computers can now reason on their genealogy and order the painting on the sequence of predomination of these princes. Or it can be determined that these princes are buried at the same location: De Nieuwe Kerk in Delft.

Step 8: Enrich by extracting information from that the data itself The final step again focuses on the local dataset itself. Many cultural heritage datasets contain large plain text fields, for example biographies or descriptions of objects. These texts do include a lot of semantics but since it is not in RDF, reasoning is not possible on that information. In this step, pieces of that information are extracted. This is however not a trivial task, since Natural Language Processing is often required to

52 Transforming legacy data 4.4 General issues & guidelines extract the information. But in the case of Semi-structured text, more straightforward tools can be used.

Even though this is the final step in this method, it would be a good idea to feed the information gained in this step back to step 6 and 7.

4.3.1 Similar transformation methods Two other methods, conceptually similar to the approach described in this section, can be found in literature. In [1] and [20] they also start very close to the legacy data and slowly move away, creating richer semantics. This shows that the concept works in not only the more specific domain of cultural heritage, but more general domains as well. On the other hand, while in [1], their research is limited to thesauri, this research clearly shows that the concept is also suitable for other types of data.

4.4 General issues & guidelines

Before going into detail concerning the extended case study with several datasets, this section discusses all sorts of issues which may be encountered during the transfor- mation process. Since these issues often occur in multiple steps, they are ordered according to topic instead of according to the step they occur in. Another reason for ordering according to topic is that the issues are not bound to this particular transfor- mation recipe but are more general in nature, and may also occur using other strategies to convert legacy data to RDF.

4.4.1 Complex data-models Some data-models are very complex, resulting in a long analysis phase (step 1). If the data-model is based on a specific standard, first it should be checked if there is already a method for converting to RDF available. If no method is available, there might be already an ontology available for that standard which can be used directly, instead of creating a small vocabulary as in step 1. In other situations, converting one instance to RDF by hand can greatly help to gain a better understanding of the data.

If the data-model of the legacy data is complex or not clear at first sight, try con- verting one instance by hand to RDF. This manual conversion gives great insight in the data, since it forces the study of individual properties while giving an example of the semantics used in the legacy data.

After this manual conversion try to automate this process and see what other predicates are still unknown. This practice splits up the complex data-model in pieces, making it easier to comprehend.

4.4.2 Impossible to uniquely identify a resource If a legacy dataset is for example a comma separated file or an XML file, it might be the case that there are no identifiers in the legacy data to uniquely identify an individ-

53 4.4 General issues & guidelines Transforming legacy data

ual instance, something which is less likely to occur with databases since tables often contain a primary key. Also, if no fields or tags yield a unique value, designing URIs becomes a complicated task. There are however a few options which are available in this case:

The first option is to try to combine the values of several fields to create a unique seed. This method will lead to stable URIs which will only change if the data itself is changed, which is a huge advantage. However it can lead to confusing URIs, that are not easy to remember.

The second option is to use the sequence in the legacy dataset as unique identifier. For example, if an instance is at position 126 in the file, its URI becomes http: \\example.tpl\resource\126. Although URIs are not descriptive anymore, it is easy to guess the URI of the next instance, making the scheme not completely arbi- trary. A major downside is that whenever the file is updated somewhere, all URIs will change, causing trouble with external links to the data. However, when only new data is appended to the file, as is often the case with systems measuring some phenomena, this option is a good solution.

The third option is a worst case scenario: a random identifier, generated when the RDF is created. First of all, the URIs are not predictable; secondly, there is no seman- tic relation with the data at all; and thirdly, depending on the specific implementation, the identifier changes when the legacy data is updated and the RDF is recreated with new random identifiers instead of the original ones.

The final option is to use blank nodes, which is actually avoiding the usage of iden- tifiers. However, the downside, which is a huge disadvantage, is that one cannot link from the outside anymore. When linking to resources is not required this can be option.

Do not use random identifiers in URIs, but if this cannot be avoided, ensure that the URIs do not change if the data is updated.

Even though URIs may become ‘ugly’ when the right ingredients to create URIs are not provided in the data, this should not prevent the data from being published on the Semantic Web.

4.4.3 Losing semantics Linking to commonly used ontologies is very valuable, but when resources directly use external classes that are less specific, semantics will be lost. For example, consider the charters discusses in 3.2.3. These charters are a very specific type of document. If a general external class ‘Document’ is directly used, instead of a local class ‘Charter’, semantics will be lost. A better practice is to define a ‘Charter’ class and define it to be a sub class of the external ‘Document’ class. For users of the data, this results in the reasoning that a charter is some kind of document, although the specifics of a charter

54 Transforming legacy data 4.4 General issues & guidelines are not directly understood. The latter can be solved by adding a definition (using skos:definition) of ‘Charter’ to the data.

Only use ‘resource rdf:type external class’ if the external class covers exactly the same semantics as the local class. In all other situations use a subclass-class rela- tion with the external class and supply a skos:definition for the local class. Like- wise, apply this guideline for all externally linked predicates.

4.4.4 Errors and inconsistencies Detecting and processing errors is an important issue. Errors can be classified accord- ing to their origin: errors caused in the transformation process and errors originating in the legacy data. Both classes of errors require a different approach.

Transformation errors A good approach to detect these kind of errors is to take a random sample after each performed step, and check if the transformation is correct for each sample instance. Obviously, any errors encountered here should be fixed.

Step 5 and 8 are particularly sensitive for transformation errors since interpretations are performed: semantics are both added and changed. These two steps often not only require knowledge of the domain in general but also of individual resources in the data. Therefore, support of domain experts is a very valuable tool to detect errors in these steps.

Legacy data errors Errors in legacy data are different: while these manifest themselves in steps 5 through 8 as well, a remarkable portion of these errors can be found in the analysis of step 1 of the recipe. This is quite positive, as generally less time is needed fixing errors in earlier steps than later on. If an error is found it should at least be documented and fixed when able, but when the legacy data will still be used after the transformation, errors should be fixed in the legacy data and not in the resulting RDF.

Errors do basically manifest themselves in two ways. Some errors are simple mis- takes: a value is entered in the wrong field, a wrong connection is made in the legacy data, wrong properties are applied, etc. It will be likely that for large datasets (i.e., thousands of art objects) many of these errors can be found. It might be difficult to find them all with limited resources; however, with an open attitude towards the crowd, they might help in finding these errors. In this way, publishing data may help to increase the quality of the data.

There are also errors caused by fundamental problems caused by a faulty data en- tering process in the past. For an example, see the extended case study in 4.5.1. Often these errors have a severe impact on the correctness of the semantics. For these errors,

55 4.4 General issues & guidelines Transforming legacy data

it should first be evaluated what caused the error, then the fundamental flaw must be found, and after that the errors should be fixed. It is very important to find the funda- mental flaw causing the error, to prevent reintroduction of similar errors.

When such errors are found, a decision should be made regarding which action is appropriate. Although it might be tempting to stop the process of publishing until the error is fixed, this is a not a good idea. The transformation process inspects the data in detail in the interpretation steps 5 and 8. If an error is found and the process is stopped, other errors might not be found until the process is restarted. This causes the legacy data to be fixed over and over again. It is better to continue, reporting all encountered errors, so that all known errors can be fixed concurrently. Also be aware that it is often not possible to find all errors with limited human resources. Using the crowd to find errors provides more resources, so errors are likely to be found earlier.

Do not quit publishing data on the web when errors are found. With an open attitude, help can be enlisted from the crowd, improving the quality of the data.

4.4.5 Caching linked data The moment external information from the web is used within applications several is- sues may arise. For example, there may be no endpoint available to query the external data, the endpoint may be too slow, or the connection may be too slow.

To overcome these issues, a local cache of the linked data could be maintained. This may solve the problem of availability but the question then arises how to construct the local cache. Research shows there are basically two options to create local caches.

Carbon copy The first option is to store the required external data unchanged alongside the local data (this can either be in the same or a separate graph from the local data). This solution is easy since no data is changed. However, applications using the data must be adapted to be able to use the cached data. Also, when the external source is modified, update anomalies can occur. Since the use of a local copy is not transparent to the user (i.e., the local copy is not Dereferenceable), behaviour may be unpredictable and update anomalies can be propagated to external applications.

New predicates The second, more complex, option is to redefine the external data using new URIs and predicates. Thus, for each resource in the external data, a new resource in the local data is created and all its properties copied. To keep a reference to the original re- source, rdf:seeAlso can be used (owl:sameAs is a bit tricky since the external source can change).

56 Transforming legacy data 4.5 An extended case study

A great advantage is that the data is only loosely coupled with the original source. Even if the external source is changed or its ontologies completely redefined, the local cache is correctly maintained. Also, since new predicates are used, provenance infor- mation (like origin, extraction data, etc.) can easily be added making the use of a local cache transparent for users. This all comes at the cost of more processing.

Which option to choose depends on the context: within cultural heritage it is important to be able to determine the origin of the information so the second option is preferable.

4.4.6 Domain range issues Defining domains and ranges is very temping since it allows more reasoning. How- ever, as the Web is an open world, people may start reusing created predicates and classes for a certain dataset; when the domain or ranges of predicates are very strict, this can result in inconsistencies.

Consider the following example: a dataset with famous carvers and their sculptures is published. It is likely that many of the carvers can also be found in DBPedia. Since DBPedia is a widely used dataset (see attachment E), the dataset is linked against DBPedia. To define the relation between the sculptures and their carvers, the dbpe- dia:createdBy predicate seems to be a logical choice. However, the dbpedia:createdBy predicate defines Painting and Artist as respectively its domain and range. Where a carver could be considered an artist, a sculpture is definitely not a painting.

If the predicate was to be used anyway, the reasoner would have reasoned that a sculp- ture is a painting. This could have been prevented by not defining domain and range constraints. They are nevertheless sometimes very useful: domains are great to state datatypes like dates and numbers, while ranges reduce ambiguity.

Only define domain and range constraints on predicates for datatypes and for very specific classes. In any other situation, defining domain and range constraints will restrict the reuse of predicates.

4.5 An extended case study

The generic method presented in the previous section is applied on four different datasets. This section shows how the transformation recipe was applied and provide an use case reference for readers who want to apply the transformation recipe themselves. The first two datasets originated from the Municipality Archive governed by Erfgoed Delft while the other two are from the museum collection of Erfgoed Delft including museum ‘Het Prinsenhof’:

1. Van Mierevelt Collection: an XML collection of descriptions of 149 museum objects, 256 images of these objects, 44 persons, and 7 organizations related to these objects;

57 4.5 An extended case study Transforming legacy data

2. Thesauri: the thesauri used by Erfgoed Delft - in total 180.000 terms (XML format);

3. in Delft: a comma separated file with 118475 records on baptisms between 1616 and 1828 in Delft;

4. Archive dataset: A collection of 871 archive descriptions in the EAD (XML) format.

Since semantics is very important in the cultural heritage domain, the main criteria for selecting datasets is the availability of domain experts. For each chosen set, a domain expert was available. Secondly, the selected datasets reflects the aim of this research to cover both archival and museum collections. A third criteria was the presence of links within the datasets and among the datasets. Finally, to validate the transforma- tion recipe for different legacy data formats, both csv and XML datasets are selected.

The rest of this section will cover the transformation process for each of these datasets. The Van Mierevelt Collection will be covered in depth to show a concrete instance of the transformation process, the others will be covered more briefly, showing only the differences compared with the first dataset.

4.5.1 Dataset 1: Van Mierevelt Collection This dataset consists of the descriptions of 149 museum objects, references to 256 im- ages of these objects, 44 persons, and 7 organizations related to these objects. The XML of this dataset was extracted from a database. All the objects, persons, and or- ganizations are related in some way with the Dutch Golden Age painter Van Mierevelt.

The choice for this set was evident: first of all, much research on the work of Van Mierevelt was being conducted by domain experts of Erfgoed Delft during this re- search. This provided access to much knowledge about the contents and semantics of the data. Secondly, the data itself is very interesting from a linked data point of view: Van Mierevelt was a portrait painter painting many famous Dutch Golden Age people while Willem Jacobsz Delff, his son in law, created engravings of these paintings, so there are many links in the dataset between people and objects.

Before going into more details, please note that attachment C contains an overview with figures for each step of the process for this dataset.

Step 1: Prepare In the first step, three main concepts were identified: Objects, Images, and Con- stituents. The last concept is used in the data as an abstract concept to cover both people and organizations. Further analysis resulted in table 4.1. The images are ac- tually just a reference to a file on disc. The images could be retrieved using a remote webserver, so they were not extracted until step 7.

58 Transforming legacy data 4.5 An extended case study

Table 4.1: Analysis of the van Mierenvelt dataset Concept Predicates Identifier Art objects ObjectNumber, Bibliography, DateBegin, Da- ObjectNumber teEnd, Description, DimensionRemarks, Dimen- sions, Markings, Notes, ObjectName, Medium, Department, Title Constituent ConstituentID, Biography, BeginData, EndDate, ConstituentID DisplayName, FirstName, LastName, Middle- Name, Remarks, ConstituentTypeID, Names

Please note that although the ObjectNumber is not very descriptive as a resource iden- tifier, except for employees of Erfgoed Delft, other predicates like title could not be used as resource identifier because of their ambiguity.

Step 2: Convert to plain RDF The second step converted the XML to RDF using a custom tool. This resulted in the RDF in listing 4.1.

Listing 4.1: RDF describing Hugo de Groot 1583 1645 Hugo Hugo de Groot 1 de Groot

Step 3: Complete the RDF The third step added for all constituents the classification “rdf:type Constituents”, and for all art objects “rdf:type Object”.

Step 4: Link to other resources within the data itself In the fourth step the persons and objects were connected. The XML already provided the links by stating a relation using the primary key Id from the database. These links were added using the XML (listing 4.2).

Listing 4.2: All art objects depicting Hugo de Groot

59 4.5 An extended case study Transforming legacy data

The thesaurus terms were added in this step as well. As the XML already contained relative URLs to these terms, the only step required was to change the URLs into real URIs by adding the appropriate prefix.

Step 5: Convert literal values into information Up to this point, the RDF was still very close to the legacy data: the RDF in step 4 is just a simple mapping from the source XML. Step 5 moves away from the legacy data by interpreting the type for both constituents and objects.

Because in the legacy data, the predicate ConstituentTypeID can hold two values: ‘1’ to indicate a person and ‘2’ to indicate an organization, interpreting the type for con- stituents was a relatively easy task.

For the objects, this proved to be more complicated. The analysis of the predicates ObjectName and Materials showed that these fields contain structured but plain-text information about the type. A set of rules on the various fields determined the class assigned to resources (see table 4.2). Note: this step is different from step 3 since here the literal values are used, whereas in step 3 information from the meta-model of the XML was used (i.e., the XML for the persons had to be acquired apart from the objects).

As discussed earlier, the dataset also contains various photographs depicting art ob- jects. For various photos however, the description and dating does not reflect the pic- ture itself but the object depicted. This is indeed a fundamental flaw in the data, as it resulted in the reasoner finding photos from the 1600s, many years before photography was invented! As such errors are not easily fixed, they were reported and should be manually and carefully corrected.

Step 6: Link with more common ontologies There are many ontologies available for cultural heritage but often they lack in com- pleteness, or they only partly implement an owl or rdf-schema version. This greatly impeded the linking of the Van Mierenvelt collection to other ontologies.

Therefore, a slightly different approach was chosen. Because the ontology created in step 1 was partly in Dutch and not proper-labeled, first an ontology (called edeo) was created which reuse and extends the entire vocabulary created in step 1, includ- ing proper definitions, labels, etc., before everything was linked to external ontologies. For every predicate, a correct term was found, a label assigned, and a definition added. When a predicate indicates a range (primarily the predicates concerning the age of the objects and the people), this range was added. Finally, for the types found in step 5,

60 Transforming legacy data 4.5 An extended case study

Table 4.2: Rules to determine the object class Objectname Material Class = schilderij - Painting = reproductie - Reproduction = prentenbriefkaart - PicturePostCard = reproductie - Reproduction & gravure - Engraving = e % kleurenfoto ColorPhoto = e % foto Photo - = staalgravure SteelEngraving = e & gravure Engraving = e % litho Lithograph = exact match, = e empty field, - field ignore, & start with, % contains when applicable, relations like subClassOf were added. For example .

The newly created ontology was then linked to other ontologies (i.e., vracore, dublin core, foaf). Often some parts could only be linked to more abstract concepts; for ex- ample, both edeo:DrawnBy and edeo:PaintedBy are sub-properties of vracore:creator which is a sub-property of dc:creator. Sometimes only a few predicates could be matched.

Of course, it is possible to link the step 1 vocabulary directly to Vra Core. How- ever, since not all the predicates could be covered by external ontologies, the 1 step vocabulary should be improved anyway, and adding definitions to the step 1 vocabu- lary also help with the evaluation of the suitability of external ontologies.

This approach resulted in 42 links between external ontologies and the local edeo data. In the end the ontology was connected to: Foaf, Skos, Dublin Core, and Vra Core.

The objects are now for around 60% covered in Vra Core (and thus also in Dublin Core since Vra Core extends Dublin Core), and the persons are about 80% covered in Foaf. Unfortunately, Foaf does not offer a birthday predicate or else all person related predicates would have been covered. It only provides an age predicate, which is not very useful since every year this property would have to be updated.

Step 7: Enrich by linking to other datasets on the web Efforts to link persons were successful but linking objects failed. Only for world fa- mous objects, like the Nachtwacht by Rembrand, links could be found, but such objects are not present in the Van Mierenvelt collection.

Using the tool RDF Gears, 18 persons (which is 43.6% of all persons) were linked to DBPedia. This resulted in the ability to reason on additional data, like the place

61 4.5 An extended case study Transforming legacy data

Table 4.3: Extracted information in the Van Mierevelt dataset Garments 56 garments found in 25 Objects (17 unique terms) Colors 32 colors found in 11 Objects (6 unique terms) Professions 35 professions found in 19 Persons (13 unique terms)

these people were born, where they died, and sometimes even on their relation (like father and son). Furthermore, these 18 links provided much more information about the person with respect to their biography. For 6 of these 18 people even links to other dataset on the Internet were found, including Geonames and Freebase.

Since linking datasets is not the main topic of this research, no further research was conducted, but with only 18 links, the enormous power of the semantic web is demon- strated, both in computer enabled reasoning and in information sharing.

Step 8: Enrich by extracting information in the data itself

Because of the need for Natural Language Processing, extracting information from plain text fields has the potential to be even more problematic than linking to other dataset. Fortunately, with some simple techniques, most problems regarding Natural Language Processing can be avoiding, while still yielding great results (see table 4.3):

A case-sensitive whole-word exact-string matching algorithm was applied on the data to look for colors and depicted garment on objects and professions of people. The list with terms to match against were defined manually. The number of false positives was in all cases zero. The algorithm was pretty trivial: every term in manually defined list was converted to lower case, and a space before and after the term was added. The chosen predicates were scanned for this extended term. If a literal value did contain exactly that extended term a semantic match was found and a new predicate was added to the resource.

To convert to lower case is a decision to decrease the number of false negatives. Es- pecially with the colors this was important: on several paintings a Prince of Orange is depicted, which often was mentioned in the description, but ‘Orange’ in this case is of course not referring to a color but to the city in France.

This extraction again led to new reasoning: by querying for people with respect to the objects and the depicted garments, two objects were found of Prince Maurits of Orange: on the first one he is depicted in full armor while on the second one he is dressed in much more informal fashion.

As stated in step 8, it is possible to return to step 6 with the newly extracted infor- mation. This step is not taken during this research, however it is very likely that more links would be created: especially the extracted colors have much potential, as there is much information regarding color on the web, and because color is a property which can be compared with any other physical object.

62 Transforming legacy data 4.5 An extended case study

4.5.2 Dataset 2: The Thesauri

Many objects in the Van Mierevelt collection are connected to several thesauri used within Erfgoed Delft. The conversion of these thesauri in combination with the cre- ation of links between the objects and these thesauri, adds a lot of additional reasoning capabilities. For example it may be concluded that two paintings do have the same style, the same depiction, etc.

The value of thesauri is only recognized by a small audience like domain experts and professionals as thesauri are often very specific and contain strict definitions which are hard to understand for the regular user. However, the links to and the information from the thesauri can be used to cluster similar objects and thus form an excellent base for a content-based recommender system [23].

For the thesauri, the transformation process was quite successful. The XML and struc- ture of the data itself was straightforward and simple to extract. In the sixth step the thesauri were linked to SKOS, which completely covers the vocabulary, identified in step 1, of the data.

Interestingly, this dataset shows that meta-data (since often thesauri are considered to be meta-data) can be handled in exactly the same way as regular data when trans- forming to RDF. This observation leads to the point of view that classifying data as meta-data always depends on the context in which the data is used rather than origi- nating from an inherent attribute of the data.

4.5.3 Dataset 3: Baptisms in Delft

This dataset yields two concepts: People and Events. As the file is in a comma sepa- rated file format, each line describes one baptism event. Each event defines the person being baptized, which is often a child, and the people gathered around it, often the parents of the child being baptized.

However, the data is quite poor (see attachment D): only the relation father <> child and mother <> child is defined. The exact relation between the father and mother can- not be determined from the data, according to the domain experts, and other sources of information are required to determine the relation. Even the relation between the child and his parents is not 100% exact, since it is not sure if indeed the father is the biological father of the child for instance.

Within the dataset itself, no links between the records are defined. It is possible a child, being baptized in record 1234, is baptizing his child in record 54321. However, since only names are given, there is no way to be sure that child 1234 is the same as father 54321 using this dataset only.

Until step 3, the method can be followed without problems. Step 4 became a huge problem as linking to other people within the dataset using software is extremely dif- ficult. Even with rules in place to reduce false positives, problems arise because the

63 4.5 An extended case study Transforming legacy data

fact that names are exactly the same does not guarantee that they represent the same person, while on the other hand names that are not equal do not necessarily lead to the conclusion that they represent different persons as the spelling of names can change over time.

The only option left is to link manually, which is still very difficult and very time consuming. Some help might be found in the genealogies people create themselves. Should these be combined with this dataset, things can hopefully be improved.

Because the dataset consists only of names, there is nothing else to extract, resulting in step 5 being skipped. In step 6, the dataset could be linked to the Bio vocabulary, an extension to Foaf, which covers the complete model.

With 118475 records, resulting in around half a million person instances in the data, there are a lot of people to link. This could not be done, mainly because of the same reasons discussed in step 4. Besides these two reasons, linking the dataset to other sources on the web failed for another reason: almost all people could simply not be found. And if they are on the web they will likely be mentioned in some text.

Step 8 finally was skipped as well, for the obvious reason of the dataset not having any plain text fields.

With quite some steps failing to live up to their idea, it might be concluded that the given data is too ambiguous for the recipe.

4.5.4 Dataset 4: Archive descriptions The archive dataset, which consist of 817 EAD XML descriptions of the archive of the municipality of Delft, presented new problems and again showed some limitations of the approach.

To completely understand the problem one must observe that an archivist will always respect the source documents and only describe the stack of documents he encoun- ters. To capture the structure of a stack of documents, EAD is a valuable standard but the content of documents itself is not present in the description. Even though there is room summaries within the EAD standard, these summaries are seldom found within the EADs of Erfgoed Delft. The finding aids in EAD are actually used to retrieve physical documents from the archive rather then digital ones. So, an EAD XML de- scription is similar to the table of contents of this thesis as this thesis is similar to a stack of documents.

With these observations in mind, the first step of the process was applied on the dataset. The files them selves are rather complex, with abbreviation tags like odd, did, or dsc. The nesting of XML tags is pretty deep as well, sometimes going up to 24 levels of XML tags. The combination of abbreviations and deep nesting makes it impossible to understand the data by just looking at it. Thus, the specification of EAD, a document of over 200 pages, was inspected instead to get knowledge about the model of the data.

64 Transforming legacy data 4.6 Limitations

Three main concepts could be identified: Finding Aid (i.e., the XML file itself), Sub- ordinate Component (i.e., a part of the stack of documents), and Archival Material, which is often a part of a document or a bundle of documents which can grabbed from the archive. From here, again using the specification of EAD instead of looking at the data, over 100 predicates can be identified.

This method however, results in a top-down approach instead of the bottom-up ap- proach used within the first step of the transformation process. The product of this first step was a primitive EAD ontology.

This shows that the suggested transformation approach is not really suitable for very specific datasets that use complex data standards. These datasets could better be mapped to an ontology defined within specification themselves.

But for EAD, there is no owl presentation and its not expected that an owl presen- tation of the EAD standard will be created in the near future. So this attempt failed since creating an ontology in OWL for EAD is a bit out of the scope of this thesis.

4.6 Limitations

The previous section already implicitly showed the limitations. This section recalls, summarizes, and extends these limitations and tries to identify the source of the limi- tations. The limitations are in four dimensions: the domain of the data, the complexity of the data-model, required provenance, and heterogeneity of the data.

Domain The transformation recipe is designed for and tested in the cultural heritage domain. Although the recipe might be suitable in other domains this is subject to future re- search. Data in the cultural heritage domain is often more descriptive in comparison with other domains, resulting in more free text. The recipe focuses in several steps ( 5 and 8) on this free text. With the absence of free text, these steps are obsolete, as seen with the third dataset. For domains with less free text the recipe might be in need of some adaptation.

Model The transformation recipe features a bottom-up approach. Therefore the recipe is not suitable with datasets with a highly complex data-model, as seen with the Archive Description dataset. The problem is that the recipe is iterative in nature: it starts simple and extends the RDF with each step. For datasets with a complex model step 1, in which from the data itself an vocabulary is extracted, is very hard since abstracting the entire model, including the concepts used during the creation of data model, is not a trivial task. For complex models a better strategy might be to first establish a firm and complete ontology or vocabulary in OWL, and then convert the data.

65 4.7 Evaluation Transforming legacy data

Provenance The transformation recipe does not really take provenance into account: how to in- corporate provenance in the transformation recipe and more generally in RDF may be a subject for future research. Provenance is not only a transformation issue but an organization issue as well. Including provenance means focusing on the long term and on the perseverance of data, a subject not covered in this research. Provenance however might be an important issue for cultural heritage organizations, as the capa- bility to track the origin of the data and presenting the knowledge-value organizations added, will become more and more important in a future where organizations will be collaborating more and more.

Heterogeneity The archive description dataset also showed, beside issues of too complex a data- model, that when a dataset is very heterogeneous, the transformation recipe is not necessary failing, but does result in a great deal of work. In step 1, all the predicates within the legacy data are defined. With a very heterogeneous dataset there are many predicates, increasing the complexity of the mapping in step 2. Also further on, steps like 5, 7 and 8 are heavily affected, since generalizing and interpreting large numbers of resources becomes very time consuming.

4.7 Evaluation

To evaluate the transformation recipe’s design to create high quality RDF, an evaluation was conducted to test the quality of the generated RDF. A representative sample was drawn from each dataset, and although that is a random procedure, it was ensured that each type of predicate was present in the sample, so that each sample fully represented the data-model of the dataset. Then, each sample was reviewed by both computers and humans, the latter being a group dividable into computer scientists and cultural heritage domain experts. Together these three groups can be considered representative for the userbase of the various datasets. All evaluations were conducted in the form of an informal discussion.

4.7.1 With computer scientists Discussions with computer scientists showed that good URIs are very important: they always prefer a short, descriptive URI. This is, however, not always possible, e.g. for the Mierenvelt dataset, the labels are too long and too ambiguous to use to create a short URI, whereas using the identifiers used by Erfgoed Delft give a short albeit rather non-descriptive URI. This tradeoff between being short and descriptive is typical for the design of any URI scheme.

Discussions with computer scientists also resulted in a large debate on the archival data set, the main problem being the fact that it was not clear how the archival data was structured and how to understand the semantics of of the data. These results thus confirm the findings in 4.5.4 regarding the fact that datasets with a complex data-model

66 Transforming legacy data 4.7 Evaluation are not very suitable for the developed transformation recipe.

Finally, the absence of provenance information, especially copyright information, was identified as a serious issue that should be taken care of when publishing the data.

4.7.2 With generic software package The evaluation with a generic software package is aimed at determining if the data is browsable and correctly labeled. For this evaluation, the OntoWiki browser was used. With OntoWiki all the links between the objects and persons in the dataset were vali- dated. It showed that no link was missing.

Since OntoWiki uses the rdf:label (and some skos predicates like skos:prefLabel) to display a label for an instance, it is a good tool for detecting missing labels. For the Baptism and the Thesauri dataset, random samples showed that no labels were miss- ing. For the van Mierevelt dataset however, some labels were missing. Further analysis showed that this is caused by the lack of in the legacy data and not an error in the transformation recipe. The Facet browser created in chapter 5 also showed this missing label problem.

By displaying a Google map with location information, the validation with OntoWiki also showed that the Geo ontology of the W3C is the ontology that should be used when dealing with geographical data. This is particularly the case when dealing defin- ing GPS coordinates. Further analysis showed that both Geonames and DBPedia are using this same ontology, making linking relatively easy. This result confirms that us- ing general ontologies is important, which is in line with the stipulation in step 6 of the transformation recipe, as well as with the concept of semantic value where it is defined as one of its aspects.

4.7.3 With domain experts The discussion with domain experts is aimed at validating the semantics in dataset. Since the Van Mierevelt dataset contains the most extracted information, this dataset has been evaluated extensively with domain experts. Because the other datasets have no extracted information, such a thorough evaluation was not required, so only very small pieces of data were evaluated. Furthermore, instances in these datasets tend to have the same number of predicates, rendering them very similar to each other, a fact that also reduces the need of a thorough evaluation.

For the thesauri dataset, the main issue was the application of the right language: the legacy data contains definitions in multiple languages for some terms. However, due to errors in both the legacy data and the software converting the data to RDF, some language attributes contained the wrong ISO code, although currently the impact of these errors is low. The facet browser only uses the thesauri for similarity matching which is independent of the language. Please not that the impact will increase when other datasets are more intensively linked with the thesauri.

67 4.8 Conclusion Transforming legacy data

Since the baptism dataset is very large, a random test sample (size 20) has been taken. From this test, one (semantical) problem became evident: the ambiguity in semantic meaning of the FatherOf predicate. The problem is that in this dataset, it is not defined whether the stated father is the biological father or only the one responsible (by law). To fix this problem, a new predicate should be introduced defining this aspect.

The evaluation of the Van Mierevelt revealed several problems, as both types of er- rors described in section 4.4.4 were present in the data. The evaluation showed that a “painting” was “drawn by” a “person” instead of “painted by”. Tracing this error revealed that during the transformation, the predicate “painted by” was accidentally replaced with the predicate “drawn by”, resulting in 59 incorrect facts. Since this was the result of a simple mistake in the transformation recipe, specifically the part where the predicates are interpreted, and not a fundamental error, it was easily fixed. Other straightforward errors were caused by missing labels, as described above.

Unfortunately, more serious problems surfaced as well, one of which was the already described problem with photographs dating back to the 1600s. Also the difference between “Naar een schilderij van”, meaning the creator of the object was inspired by a painting of the referenced painter, and “Van een schilderij van”, meaning an object was copied from a painting by the referenced painter, led to problems. For several objects it was to be discussed which one is the correct relation, not a trivial task as the two relations are not that semantically different. However, both problems are semantic problems in the legacy data, so they should be fixed in the management software used to govern the museum data.

Finally, another phenomenon was discovered: when for a certain object the creator is unknown, a link is established in the management software between the object and an artificial person named ‘Vervaardiger onbekend’, which is literally “Creator Un- known”. When the data is queried for this person it will lead to the reasoning that all these objects are created by the same person! This is of course not the case! This issue can be fixed during the transformation by ignoring all links to the ‘unknown creator’.

4.8 Conclusion

This chapter presented a recipe for transforming legacy data into RDF, suitable for a variety of datasets in the cultural heritage domain, as shown by an extended case study. The recipe might be suitable for other domains as well, but that remains a topic for fur- ther research.

Although the recipe has proven to be suitable for a wide range of data, there are certain limitations: the recipe itself focuses on creating as much semantic value as possible, and might therefore be less suitable for very heterogeneous datasets. Furthermore, this recipe is less suitable for datasets with a very complex data-model since it features a bottom-up approach whereas a top-down approach would be more suitable in these cases.

68 Transforming legacy data 4.8 Conclusion

The purpose of the transformation recipe is to create high quality RDF. Evaluation shows that the created data does meet all the stated criteria, although in some cases URIs were created that would be considered ‘not cool’. The matter of provenance was not incorporated in the definition of high quality RDF or in the transformation recipe itself: it remains a topic for further research as well.

Overall, this chapter presented a number of practical guidelines to overcome issues during the process of transforming legacy data to RDF. These guidelines will be useful for anyone interested in converting datasets to RDF.

69

Chapter 5

A Faceted Browser

In the previous chapter, three datasets were converted to RDF. Since both the Van Mierevelt and the Thesauri dataset could be linked and since they both contain lots of predicates (which results in many facets), these two datasets were used for the devel- opment of the prototype facet browser described in this chapter.

The facet browser, Facet, offers four different types of access to the collections: firstly, the user can pick a class to explore instances of that class; secondly, the user can browse all images in the dataset; thirdly, the user can use a timeline in which all the resources having some time range predicate are lined up; and finally the user can just type in a keyword or select a find to filter the collection.

Facet was developed to meet several requirements:

• To show some of the new possibilities of the Semantic Web;

• To show the added value created by the transformation process in chapter 4;

• To provide Erfgoed Delft with a prototype which can be used to improve their own websites;

• To validate the transformation of datasets in the previous chapter.

The first section introduces Facet and discusses the intend of creating a facet browser. It will explain the goals of the project and elaborates on the application for Erfgoed Delft. The used datasets for Facet and the requirements for data to be used within a faceted browser are discussed as well.

In the second section, the architecture of the faceted browser will be discussed. It will focus on the basic architecture and will go into more detail regarding some design decisions that were taken.

The third section will elaborate on the requirements for predicates within a dataset to be suitable as facets, and how a dataset can be improved for a faceted browser. An example is included to demonstrate how this theory is used in the previous chapter.

71 5.1 Faceted browsing requirements A Faceted Browser

The fourth section comments on the performance of the faceted browser. It shows the mathematical characteristics of searching from set theory and concludes that re- solving very complex search queries may be too slow.

Before drawing the conclusions in the final section, the results of a feedback session are presented where multiple people comment on the tool. The main question to be answered: is it a valuable tool for Erfgoed Delft?

5.1 Faceted browsing requirements

Faceted browsing, also referenced to as faceted navigation or faceted search, is a tech- nique for exploring datasets by applying multiple facet filters. A facet is a common property of several items in the data. For example, in a collection of coins, some of them can be made of bronze. The user thus can select the filter ’material is bronze’ to filter the coins that are made of bronze.

Facet, the browser developed for this thesis, tries to detect the filters to be applied on the dataset itself rather then a set of predefined filters. Therefore Facet is not re- stricted to the transformed data of chapter 4 but be can be used with any set of data in a Semantic Web format. In section 5.3 there is more information on the requirements imposed on the data to detect filters.

Facet makes a distinction between ‘facet filters’ and ‘search filters’. With a ‘facet filter’ the user can just select the filter to be applied on the data (e.g. for an object the material is bronze). With a ‘search filter’ the user can type a keyword which can be applied on the data (e.g. for people the user can search on ‘Willem van Oranje’).

5.1.1 Dataset requirements As explained, faceted browsing is based on the occurrence of common properties within a dataset. To formalize this definition: two resources, respectively with URI O1 and O2, have a common property when two triples O1 P1 S1 and O2 P2 S2 can be found where 01 is not equal to 02 but P1 owl:sameAs P2 and S1 owl:sameAs S2. So the first requirement imposed on datasets to be used within a faceted browser is that each resource should have at least one common property with another resource.

Because Facet enables users to use a classification tree to find resources, there should at least one additional common property other than rdf:type. Without this second re- quirement the user may end up with a list of tens of thousands of resources. Because of this requirement, the archival sets (chapter 4; the Baptism and EAD data) were not among the dataset used with the development of Facet.

Initial development of Facet was done with both the Van Mierenvelt collection and the thesaurus since both requirements hold for both datasets. However, at an early stage in the development process serious performance issues (see section 5.4) were encountered. As available hardware was limited, it was decided to only use the Van Mierenvelt datset, the linked data from DBPedia (section 4.5.1), and the directly linked

72 A Faceted Browser 5.2 Architecture terms (and its definitions) within the thesaurus as data for development and testing of Facet.

The development showed that these initial requirements do not ensure the suitability of datasets for usage within a faceted browser since they only define that there should be some commonality between resources, neglecting the quality of the commonalities. This research does not provide a final answer on the required quality of the common- alities but presents, in section 5.3, an experimental measurement method that allows for at least some statements about the quality to be made.

5.2 Architecture

Since this browser (working title Facet) was not the first faceted browser to be build, a small research was conducted to find out how faceted browsers were developed in other projects. This exploratory research showed that faceted browsers for the Seman- tic Web are still heavily researched, but are already very usable in some cases. This results in projects written in uncommon languages on the web, e.g. Prolog. This makes it hard to use them in business environments since knowledge about these languages is rather limited. Since the aim of Facet is to be applied in a more business like (a cultural heritage organization) environment, Facet is written in PHP using Virtuoso as Triple Store. Facet will be released as an open source project.

Figure 5.1 provides a sketch of the basic architecture of the browser. The rest of this section explains every box in the system. Please note, some modules, e.g. the one that show help information, is left out to reduce complexity.

Figure 5.1: Overview Architecture of Facet

73 5.2 Architecture A Faceted Browser

Request parser The request parser receives the incoming HTTP-request. It extracts the given param- eters and calls the required module (e.g. any data request will be send to the Search Query Engine, a request for the help pages will be send to the help module etc.) to be processed. When the requested url is a resource URI, the Config Store is consulted to determine the action: either show the resource within the layout of the browser, return a simple html page when the agent is a web-browser, or return an RDF file when the agent is an RDF reader. When the url requests a search, the Search Query Engine is executed.

Search Query Engine The objective of the Search Query Engine (SQE) is to create a Sparql query that will be executed on the triple store. The first task of the SQE is to create a sub query for each facet in the parameters of the url. These parameters are based on the filters selected by the user.

Here a separation is made between ‘facet filters’ and ‘search filters’. In figure 5.2, more information is provided to show how the user interface is influenced by this sep- aration. Some predicates in the dataset have a unique value for each resource, for example the ObjectNumber of the object; for these predicates, a search field in the interface is generated were the user can type a value, as it would not make sense to create a drop down box with all the possible ObjectNumbers in the dataset. On the other hand many predicates do offer a predefined list of options the user can choose from, the so called ‘facet filters’.

Figure 5.2: Search versus Facet Filters: for ‘facet filters’, a drop-down-box is gen- erated (e.g. the “kleur” (color) filter) with the options and their occurrences, while for ‘search filters’, only a textbox is generated (e.g. the “naam” (name) filter).

For each ‘search filter’ the generated subquery contains a case insensitive regular ex- pression matching every object containing the entered value by the user. For each ‘facet filter’ the generated subquery is simply the given predicate with the given (se- lected) object. From here the subqueries are combined. In section 5.4, more informa- tion is given regarding the combining of the subqueries into one query to be executed by Sparql Library on the triple store. The results are formatted from the resulted xml into a data-structure and send to the Template Engine. Section 5.3 elaborates on how to determine whether a predicate should be treated as a ‘search filter’ or a ‘facet filter’.

74 A Faceted Browser 5.3 Optimizing facets

The SQE also checks the Config Store whether it is also responsible for handling (ba- sic) Inference of classes or that this will be performed by the endpoint.

Model Retriever The model retriever retrieves all kinds of information about the (meta)model of the dataset. For example, it queries for all the used classes in the dataset or it helps with retrieving labels for resources. SQE uses this feature to retrieve which classes have which predicates, to add labels to the retrieved result, or to get information about the inference of classes.

Sparql Library The Sparql Library is a generic php library, created together with the Facet develop- ment, that receives a Sparql query, executes that on a Sparql endpoint, and returns the result in a php data-structure, parsing the xml in the Sparql result. The library just uses REST and the sparlq-result-xml to communicate.

HTML Template Engine The HTML Template engine, which uses the PHP Smarty Template Engine 1, receives data and maps that to the required template, so the HTML (it generates also javascript and css) can be returned to browser. The templates do all contain a part of the user interface of Facet, e.g. a resource template, a search-result template, a template for a ‘search’ filter, etc. These template would have to be modified manually by publishers who want Facet to implement a graphical design according to their needs.

This simple architecture (although the implementation of the SQE is not a trivial task) is constructed as independent of the Van Mierenvelt dataset as possible. It uses com- mon vocabularies like RDF, RDFS, SKOS and Geo to present certain features like labels of objects and places where people died. All other information is retrieved from the ontology of the dataset. For example, the predicates in the dataset are used to de- termine the facets (see section 5.3).

This, combined with the possibility to write plugins that override the behavior of cer- tain types of resources or predicates, results in a faceted browser that can be used on a wide range of datasets.

5.3 Optimizing facets

Faceted Browsers are a very powerful technique to browse through a given dataset as shown before. As the name already indicates, a faceted browser assumes there are facets within the data. A facet is a group of items within a dataset with the same value on the same property. For example a dataset can contain 140 items, having 70 items with the class ‘painting’ and 70 items that are a ‘drawing’.

1For more information on Smarty see http://www.smarty.net/

75 5.3 Optimizing facets A Faceted Browser

A faceted browser helps a user to methodically reduce the number of items he may be interested in by applying filters on the data. For example, when a user is interested in art from the dark ages, he can apply a data filter like ‘after 500 A.D.’ and ‘before 1500 A.D.’. This will remove all modern art - in which the user is not interested - from the results, thus making the results more relevant for the user.

Although faceted browsing is a very powerful technique, the analysis of several datasets showed that a faceted browser is not always a suitable technique to explore the data. The problem is that in some datasets, items have too much in common (e.g., in the genealogy dataset there are many people with the same family name), while in others, items are too unique (this applies on the archive dataset to some extent).

When items have too much in common, the user can apply a filter on the data, but the number of results will not be significantly reduced. When all the items have a property ‘color’ with value ‘black’, then applying the filter ‘color is black’ will not reduce the number of items that may be interesting for the user, rendering the filter useless.

When items are too unique, then the user will have many options to choose from which will not really help the user. Asking the user to select one of 10.000 options will most probably not please the user.

In a search to make this notion ‘it should be not too unique but also not too com- mon’ more tangible, the concept of facet ratio was developed. The full mathematical approach (see equation 5.1) will be introduced later, but for now a rule of thumb is presented, which can be derived from the formula: √ Number of groups = Number of items per group = total number o f items

Where group is defined as the collection of objects having the same values for a given predicate.

The rule of thumb (and also the formula to be introduced later) is build on the assump- tion that from a user point of view there should not be too much options to choose from but there should also not be too much results left when choosing an option. This assumption led to the idea that there should be as much members per group as groups, which is the perfect balance between the number of groups and the items per group.

Facet uses this rule of thumb to determine whether a predicate can act as a facet filter. First it creates a list of all the predicates present within the dataset being displayed. For each predicate, the set of subjects and their occurrences is queried for. If, for example, each subject has an occurrence of 1, then it is too unique to act as a ‘facet filter’ and will be used as ‘search filter’. If there is only one subject and has as much occurrences as the number of resources having the predicate, it is considered to be too common and 1 will be ignored. However when the number of subjects is somewhere between 4 and 3 4 of the number of resources having the predicate it is used as a ‘facet filter’.

76 A Faceted Browser 5.3 Optimizing facets

5.3.1 The mathematical approach The purpose of the formula is not to present hard ranges in which faceted browsing can be applied but to provide an idea of how suitable faceted browsing will be. Never- theless, this formula can be applied in a more quantitative fashion when searching for ways to improve the dataset with respect to faceted browsing.

Let

d = given dataset p = a predicate i g(dp) = Number of elements in group i with same value for p in d n = Total number of occurences of p

Then fip(d) = Facet ratio for predicate p in dataset d (5.1) With √ ∑0 g(di ) − n fi (d) = i p (5.2) p n

i Note n can also be calculated from the g(dp) function:

0 i n = ∑g(dp) (5.3) i

The most optimal situation is when fip(d) = 0. When fip(d) > 0, there are more items per group suggesting that the dataset might be too common. When fip(d) < 0, the number of groups is larger, suggesting the dataset might be too unique.

An example on how to use this formula can be found in the transformation (step 5) performed on the objects of the Van Mierevelt collection. Before the transformation, all the items were of class ‘object’. This gave fip(d) = 0.94. After the introduction of the distinction in class fip(d) = 0.28.

i The g(dp) function is based on one predicate only. The facet ratio can be adapted to cover all the predicates in the data set (equation 5.4):

0 ∑ | fip(d)| fi (d) = p (5.4) p number of predicate in d

Unfortunately, since this formula uses the absolute sum for each fip(d), the difference between ‘too common’ and ‘too unique’ is lost. Nevertheless it can be a valuable tool to determine the relative increase or decrease between different steps in the process.

Please note that the concept of facet ratio is purely experimental. Future research should be conducted to determine whether the stated ‘number of groups = number of items per group’ really delivers the best user experience. Other possibilities (see next

77 5.3 Optimizing facets A Faceted Browser

Table 5.1: Several predicates in the EDEO data and their facet ratio predicate occurrences groups facet ratio suitable √ rdf:type 347 13 0.302 foaf:givenName 42 36 -4.555 vracore3:material.medium 167 39 -2.018 * √ edeo:paintedBy 79 7 0.212 √ edeo:color 16 6 0.5 edeo:engravedBy 49 15 -1.143 * foaf:name 51 49 -5.861 √ edeo:marking 6 1 0.592 edeo:yearOfBirth 32 25 -3.419 edeo:inspiredBy 32 2 0.647 *

paragraph) should be taken into account as well. In spite of this, it still gives some grip on optimizing datasets for faceted browsing.

Table 5.1 shows several predicates and their facet ratio to indicate what can consid- ered to be good facets and which are used within Facet as ‘search filter’. Some of the predicates are marked with a *. These predicates could be subject for further improve- ment to ensure they are suitable to act as a ‘facet filter’.

5.3.2 Hierarchical optimizing

In the example, the data was altered to be more suitable for usage in a faceted browser (i.e., plain text was interpreted to become semantic data). However, altering data is often not possible, as it can lead to illegal or lost semantics. Consider, for example, the geographical location where an object was found: if the location is altered from ‘Amsterdam’ to ‘The Netherlands’ some semantics are lost while altering the location from ‘The Netherlands’ to ‘Amsterdam’ is semantically illegal.

However, often the values for a certain predicate do have a hierarchy. Note that the hierarchy does not have to reside in the dataset itself, but can be linked from other sources. This information can be used to group the values. The user first selects a group and then values within the group.

When there still are too many different values the process can be repeated again. An indication to use this method is that the values for a predicate are originating from a thesaurus. More on hierarchical faceted searching can be found in [10]. Facet al- ready incorporates this concept in its classification tree where a user can first choose to look for an object, a person or an images and then, if object is selected, whether he is searching for a photo, a painting a, drawing etc.

78 A Faceted Browser 5.4 Search Performance

5.4 Search Performance

During the development of Facet, it was clear that performance of searching through a large set data can be an issue. This section elaborates on these performance issues from a set theory point of view. As described earlier, there is a separation between facet filters and search filters, even though the user can simply apply both on the list with resources he has searched for.

Creating a query for a faced filter is trivial, as can be seen in equation 5.5). The facet itself contains the predicate and the selected value represents the object. This combination can be applied on the list with resources.

x ∈ y ⇒ (x p o) (5.5)

So in set theory it can be stated that the user wants all x resources for all available resources y such that the triple (x p o) does exists. The selection of a second facet filter by the user is straightforward as well, yielding equation 5.6.

x ∈ y ⇒ (x p0 o0) ∩ r ∈ y ⇒ (r p1 o1) (5.6)

So applying n facets will lead to N equations intersections.

Creating a query for the search filters on the other hand is not trivial. The problem is that in general it is ex ante not certain whether a predicate will return literal values only or also resources.

Covering both cases results in two equations, one checking for the literals and one diverting the value to (for example) the label of the resource. Thus for every predicate, equation 5.7 should be applied.

x ∈ y ⇒ (x p s(o,i)) ∪ x ∈ y ⇒ (x p r . r l s(o,i)) (5.7)

Where s(o,i) denotes the ‘case insensitive search on the object of the value entered by the user’ and r and l are, respectively, the resource and its label.

If one selects another search filter, this again will result in two equations for that pred- icate. However these should be combined using the pair-wise product, as shown in equation 5.8.

[x ∈ y ⇒ (x p0 s(o0,i)) ∩ z ∈ y ⇒ (z p1 r1 . r1 l1 s(o1,i))] ∪ (5.8)

[z ∈ y ⇒ (z p1 s(o1,i)) ∩ x ∈ y ⇒ (x p0 r0 . r0 l0 s(o0,i))]

So applying n search constrains on a dataset will result in 2n unions of n intersections.

79 5.5 Feedback on Facet A Faceted Browser

If search and facet filters are combined, the result is even larger: for each of the 2n unions now n ∗ m intersections are required.

From these simple calculations, it can be concluded that searching in an RDF dataset can be quite complex and will lead to enormous queries and possibly relatively low performance. Of course there are all kinds of options to deal with this performance is- sues, as used within traditional information retrieval systems, like the sequence of the subqueries, full-text indexing, reverse indexing, etc. However one should in general be aware of the potential loss in performance when moving from a traditional database to a triple store.

One also may conclude that facet browsing is more efficient than search filter search- ing. However there is a caveat: although the search query itself is more efficient, all labels and resources of a facet filter were already queried in an earlier step just to pro- vide the user with the possibility to select the value.

This bit of mathematics gives the impression that searching a RDF dataset is a complex task. However with n and m being small (less then ten) in almost all use cases there are no real problems in retrieving the required data when servers are fast enough. For reference, in the EDEO dataset used by Facet, max(n + m) = 57.)

5.5 Feedback on Facet

Feedback on Facet was gathered in two ways. First of all, five people, primarily peo- ple within the computer science domain, did use the browser and explored the dataset. From these people their comments and feedbacks was collected. The goal was to find big mistakes and to gain feedback specifically on the interface.

Facet was also evaluated with two persons: a domain expert on Van Mierevelt and an expert in (digital) exhibitions within Erfgoed Delft. The the goal of this evaluation was to validate if Facet could be used during the Van Mierevelt exhibition in fall 2011. During the evaluation, all functions of Facet were discussed and requirements were formulated to be able to use Facet during the exhibitions. During the evaluation with the domain expert, Facet was also used to evaluate the transformed data itself. Facet is an excellent tool for both, since it explicitly shows the classes, the labels, and related resources of each resource.

The rest of this section is split in two parts. First, it discusses the requirements to use Facet within the exhibition, and after that it discusses the received feedback and presents possible improvements for Facet.

5.5.1 Possible use of Facet at Erfgoed Delft The session with experts of Erfgoed Delft shows some restrictions on the range of ap- plications of Facet. First, Facet browsers do require some skills in internet browsing. While these skills are present among the younger generations, for which the Web is so

80 A Faceted Browser 5.5 Feedback on Facet ordinary they can hardly imagine the non-web era, this may not be the case for all the (potential) users of Facet that are visiting the exhibition. Because of this, it is therefore likely that primarily the more professional visitors of the exhibitions are interesting in using Facet.

This however, presents a different problem. In chapter 4, a number of inconsistencies in the data was shown, and these inconsistencies are likely to be picked up by these users of Facet since Facet makes many aspect of the data very explicit. For example, when the icon of a photo camera is shown in the interface, it is more explicit that the object is a photograph then when only a technique field with value ’photo’ is present. Even though this is actually a good characteristic of Facet, given the current data on Van Mierevelt both experts argued that the data should be improved first before it can be on display. Whether the data will be improved before the start of the exhibition will be a topic of discussion within Erfgoed Delft.

Despite the inconsistencies in the data, Facet is a very valuable tool to explore cul- tural heritage collections according to both experts. Before Facet can be used on the web however, first the entire museum collection of Erfgoed Delft should be converted to RDF. Converting the entire collection is a huge task and will consume some time, so both experts believed that a tool like Facet containing the full collection of Erfgoed Delft will not be available for the public in the very near future.

According to [24], the younger generations are more active on the web with respect to culture heritage, rendering the the likelihood of insufficient user-skills much less then for physical exhibitions, so using Facet as a tool on the web rather then a tool during exhibitions is an option to be considered.

5.5.2 Improvements

With the combined feedback of both the computer scientists and domain experts, five major improvements in the Facet prototype were identified:

Speed

Currently, speed is a problem. While running the application on a fast server for this dataset there are no problems, on average laptop hardware however (the test was per- formed on an Intel Core i5 at 2.4 Ghz and 4 GB of memory), running a search with four facets (filtering on period, artists, color and material) took 2.12 seconds to com- plete. A debug session showed that the query itself might be too complex. If the dataset will become larger, improvements in the SQE are required to be able to run on slower hardware. Since Facet can connect with any Virtuoso endpoint, a large scale performance test was performed by connecting it with the DBPedia Sparql End- point (http://dbpedia.org/sparql). Running four facet queries on this endpoint resulted in a time-out (i.e., it took more then 30 second to complete).

81 5.6 Results & conclusion A Faceted Browser

Resource data in the search results Currently Facet is only displaying the label and class of a resource in the list with search results. However, there are sometimes resources with the same label while the resources are unique. This is obviously rather confusing for users. Therefore more information of the resources should be added. For example, for objects this might be the creator and date information, and for people the day of birth and death. Currently, since this is a dataset dependent issue, Facet does not support these kind of extensions.

Object collection origin Erfgoed Delft is the result of the joining of three museums, the municipality archive and the municipality archeological service. Although now a single organization, it is still important to track an object to its ‘source’ collection. However, in the dataset this aspect is not shown. So if a user wants to take a look at the physical object, when applicable, there is no way for a user to find out which department to contact. To support this, such information should be available in Facet.

Timeline The timeline currently features only years of objects (its dating) and persons (their dates of birth and death) and creates a time perspective of objects and persons related to each other. Although this could be a nice feature, it is not very useful at the moment, according to an expert of Erfgoed Delft. The main problem is the lack of context: the timeline should be linked against a dataset describing the main events in Dutch and global history. In that way a context of the persons and objects is established and the timeline is able to serve its intentions: providing a time perspective.

User interface The interface of Facet is currently a bit complicated. If a user wants to apply a facet he should first click a button ’apply facets’. A popup window will appear with all the possible facets. From there, a user selects the facet to add, sets the selection (the object for the predicate) and submits his selection to the server. This process can better be replaced with displaying the facets along side the search result, so a user can directly filter on the collection.

Furthermore, not all predicates are suitable as facets (see table 5.1. For example there are facets to search in the comments on objects, but these are likely to be used rarely while cluttering the interface with their presence. Therefore all the used predicates should be evaluated to determine whether they should be displayed in the interface at all.

5.6 Results & conclusion

Facet is still in the phase of prototyping, clearly evidenced by the still somewhat slow searching. Other optimizations should be included as well, in order for Facet to reach

82 A Faceted Browser 5.6 Results & conclusion its full potential. Improvements in the interface may be required to increase user- friendliness. The improvements do include the process to use the facets and informa- tion shown in list overviews of resources.

Nevertheless, Facet can be a valuable tool for Erfgoed Delft. However there are some steps to take before further deployment. Facet, according to the experts at Erfgoed Delft, will be especially valuable for the more professional user. But this user will notice the errors and inconsistencies as found in chapter 4. Therefore there is much doubt in deploying this tool when the data is in such dire need of more improvements. The presented facet ratio gives a tool to validate if a predicate is optimized for faceted browsing. Although this concept is highly experimental, it did explain how interpreta- tion improves the data not only from a semantic value point of view, but also from an exploring point of view. To recall, Facet was developed to meet four requirements:

• To show some of the new possibilities of the Semantic Web;

• To show the added value created by transformation process in chapter 4;

• To provide Erfgoed Delft with a prototype which can be used to improve their own websites;

• To validate if the transformation of datasets in the previous chapter is correct.

For the first goal, Facet shows that data in a Semantic Web format contains a descrip- tion of it’s data-model. This feature is one of the key elements in the Semantic Web when compared with traditional systems (like databases) were data-models are sepa- rate and often embedded in programming logic. In the same way, this also holds for reasoning on dataset. Most triple stores do offer inference which otherwise should be build into the application. This reduces development time for applications. Facet, as shown in the performance test with dbpedia, is able to use all data in a Semantic Web format, as long as it is located in a triple store, rather then only data in some specific databases.

Because Facet users can use a classification tree (using rdf:type) to find resources, it has made the added classes in the transformation process very explicit, which showed the added value of the transformation process, thus fulfilling the second requirement.

The third goal is only partially reached, as there is still much room for improvement, even for a prototype, and as the prototype is not released yet.

Finally, Facet is used already in several occasions as a validation tool, showing for ex- ample the ‘photo taken before the invention of the photo-camera’ problem (see 4.5.1). It also turned out to be of great assistance with finding missing labels, etc.

As in previous chapters, some questions could not be answered within the scope of this research. Candidates for future research are the performance of large scale facet browsers and the concept of facet ratio.

83

Part IV

Conclusions

85

Chapter 6

Conclusions and future work

In this chapter, the research questions will be answered, the main contributions of this thesis will be described, and future research opportunities will be discussed, bringing this thesis to its conclusion.

6.1 Contributions

The main contribution of this thesis, from a scientific perspective, is the proposed transformation recipe to transform ‘legacy data’ within the cultural heritage domain to a Semantic Web data-format. In addition, the experimental work on the facet ra- tio, which aims to quantify the suitability of this kind of data for use within faceted browsers, is a significant scientific contribution, although much more research would be required on this topic. Finally, the survey on search techniques being used through- out the cultural heritage domain can be a valuable contribution for cultural heritage institutions when assessing their information systems.

Besides the scientific perspective, a contribution was made towards the availability of software for cultural heritage institutions, as the developed facet browser, Facet, will be released as an open-source software project in the near future.

6.2 Research conclusions

The main research question of this thesis was:

How can cultural heritage institutions, like Erfgoed Delft, benefit from new Se- mantic Web technologies with respect to their collections on the Web?

To answer this question, the research was split into two parts: the first part identified already established technologies to offer public access to cultural heritage collections; the second part focused on the creation of RDF from ‘legacy cultural heritage data’ and the application of that data in a faceted browser.

The first part showed that there are many techniques to present collection data to users. Many cultural heritage institutions choose to implement ‘text field searching’, but this

87 6.3 Summary per research question Conclusions and future work

research indicates that, for users, a faceted browser, or another system where the input is not based on keywords, with filters being applied on the data, may be much more convenient. Faceted browsers however, require high quality data and the presence of facets in the data.

The second part introduced a transformation recipe which is based on the concept of starting very close to the legacy data and adding stepwise more semantics by in- terpreting plain text information in the data. This transformation recipe increases the quality of of the data in two ways:it first increases the semantic value of data, a major indicator of quality, and, secondly, it strives to formalize (informally) used definitions within the data.

In this research, a collection on the web is defined as being the combination of both data residing in some information retrieval system and an application such that system to present information to the user. The transformation recipe shows that cultural her- itage institutions can benefit from new Semantic Web technologies with respect to the quality of the data on the Web because the Semantic Web embeds reasoning into the data. This reasoning can be used, as implicitly utilized in the transformation recipe, to formalize both definitions in the data and classifications resulting in an increased quality of data.

The second part also introduced experimental work on a Faceted browser. This work showed that cultural institutions can shorten development duration of applications pre- senting their collection on the Web using Semantic Web technologies. First, since the data-stores used for Semantic Web often already offer reasoning, e.g. inference, less time is required to build the application.

Secondly, since Semantic Web formats incorporate the data, the meta-data, and the data-model in one coherent data-set with the same syntax and semantics, application development becomes less complex. In traditional systems an application should con- tain knowledge on the based database (like the tables) etc. However with Semantic Web technologies the data-model can be retrieved in the same way as the data itself which make it easy to later extend applications with different kind of data.

6.3 Summary per research question

What methods are available on the Web that offer the public insight in the col- lections of cultural heritage organizations? A survey showed 7 main techniques to present a collection to visitors:

let’s use Google approach is an elementary technique in which the cultural heritage organization chooses to either do nothing at all or incorporate a custom Google search field.

The encyclopedia approach is a technique which displays information like an ency- clopedia, with separate pages per topic, links between topics, a full text search engine to search for topics, and sometimes a browsable index.

88 Conclusions and future work 6.4 Future work

Online Exhibitions are the digital equivalent of the museum exhibition. Many forms are possible ranging from pages one can browse, to sometimes even a Virtual Reality (VR) tour of parts of the physical museum. Text Field Search is the most common technique in which the user type a keyword to retrieve pages or objects, very suitable for searching in enormous collections of objects, archival documents, and pages. Thesaurus Search is a search technique that lets the users access objects by browsing through the terms in a thesaurus. Recommendations is a technique which recommends objects to visitors based on the nature of the object or the personal taste of the visitor. Faceted browsing tries to exploit the data model of the data and offers visitors the possibility to select a filter rather than typing a keyword.

What methods are currently used by Erfgoed Delft that enable the public to search through their collections? Currently, Erfgoed Delft has a number of websites that offer public online access to parts of their collections. The online data primarily resides in the municipality archive. All websites use a text search technique to let vis- itors search through the data. There is one exception, the WikiDelft project: this is a public wiki concerning the history of Delft which features the encyclopedia approach.

What is a suitable method to transform ‘legacy collection data’ into Semantic Web formats? This thesis introduced a very suitable method to convert collection data to Semantic Web formats. The transformation recipe based on the concept of starting very close to the legacy data and increasing the semantic value stepwise by interpreting plain text information in the data. An increase in semantic value can be gained by either making already in the data included information explicit or by inter- preting plain-text information. The transformation recipe also strives to standardize definitions and the usage of predefined vocabularies and standards.

What methods can cultural heritage organizations use to allow people to explore their semantic collection data? This research showed that faceted browsers are a good tool to explore collection on the web because of the ease of iteratively narrow- ing down the results and because a facet browser offers a choice rather than a blank box where a keyword has to be entered. Furthermore, as the research on the faceted browser showed, collection data in a Semantic Web format enables easy detection of facets within the data. Combining these two leads to the conclusion that faceted browsers are an excellent tool to allow people to browse through semantic collection data, especially for museum collections since in that type of data, facets are likely to be present.

6.4 Future work

In this research, a generic transformation recipe for creating RDF from legacy cultural heritage datasets was developed. This research has the following possibilities for fu-

89 6.4 Future work Conclusions and future work

ture work or research:

Research on the search strategy of users of cultural heritage web applications. This thesis contains a survey of available techniques for users to search collections. How- ever, it only identified these techniques, leaving the question of how well users are coping with these techniques a topic for further research. Therefore, the intermediate conclusion that faceted browsers are very suitable may be premature. Further research on the topic of user interaction in cultural heritage search would be required to give definitive answers.

Testing the transformation recipe in other domains. While literature indicates that the recipe is suitable to be applied in other domains as well, or may easily be adapted to serve in other domains, this is not tested. An interesting domain to test the recipe on could be climate data since such a dataset would not only contain all kinds of def- initions but also the measurement data. This kind of data was absent in the domain in this thesis.

Further analyses of the suitability of data for use within faceted browsers. This thesis contains experimental work on the concept of facet ratio, which aims to quan- tify the suitability of data for use within faceted browsers. However, it was assumed that the ideal ratio would be attained when the number of groups is equal the number of results per group. Further research is required to validate that assumption. Further- more, since the facet ratio is not normalized, more research on the ratio itself is also required since it is as such currently not possible to compare datasets with each other. Finally, research to find the optimal values to indicate whether a dataset is suitable for faceted browsing or not would also be highly useful.

90 Bibliography

[1] Mark Van Assem, Maarten R. Menken, Guus Schreiber, Jan Wielemaker, and Bob Wielinga. A method for converting thesauri to rdf/owl. In Proc. of the 3rd Intl Semantic Web Conf. (ISWC 04), number 3298 in Lecture Notes in Computer Science, pages 17–31. Springer-Verlag, 2004.

[2] T. Berners-Lee. Weaving the Web: The original design and ultimate destiny of the World Wide Web by its inventor. Harper San Francisco, 1999.

[3] T. Berners-Lee, J. Hendler, O. Lassila, et al. The semantic web. Scientific amer- ican, 284(5):28–37, 2001.

[4] C. Bizer, R. Cyganiak, and T. Heath. How to publish linked data on the web. Publish, 20(October), 2007.

[5] H. Bohring and S. Auer. Mapping xml to owl ontologies. Leipziger Informatik- Tage, 72:147–156, 2005.

[6] Boskaljon. Visie op het netwerkdenken. Technical report, Rijksdienst voor het Cutureel Erfgoed - Ministerie van Onderwijs, Cultuur en Wetenschap, 2011.

[7] M. Ferdinand, C. Zirpins, and D. Trastour. Lifting xml schema to owl. Web Engineering, pages 776–777, 2004.

[8] J. Giles. Special report: Internet encyclopedias go head to head. Nature, 438(15):900–901, 2005.

[9] E. Giovannetti and M. Kagami. The internet revolution: a global perspective, volume 66. Cambridge Univ Pr, 2003.

[10] M. Hearst. Design recommendations for hierarchical faceted search interfaces. In ACM SIGIR Workshop on Faceted Search, pages 1–5. Citeseer, 2006.

[11] D. Holmes and M.C. McCabe. Improving precision and recall for soundex re- trieval. In Information Technology: Coding and Computing, 2002. Proceedings. International Conference on, pages 22–26. IEEE, 2002.

[12] Emily Sharpe Javier Pes. Exhibition & museum attendance figures 2010. The Art Newspaper, April 2011.

91 BIBLIOGRAPHY BIBLIOGRAPHY

[13] C. Lange. Krextor–an extensible xml to rdf extraction framework. Scripting and Development for the Semantic Web (SFSW2009), 2009.

[14] C.D. Manning, P. Raghavan, and H. Schütze. Introduction to information re- trieval. Cambridge University Press, 2008.

[15] J. Ossenbruggen, A. Amin, L. Hardman, et al. Searching and annotating virtual heritage collections with semantic-web techniques. In Museums and the Web, 2007.

[16] Robertas Pogorelis. Europeana on-line library should be enlarged, but still re- spect copyright, say meps. Technical report, European Parlement, 2010.

[17] G.M. Sacco and Y. Tzitzikas. Dynamic taxonomies and faceted search: theory, practice, and experience, volume 25. Springer-Verlag New York Inc, 2009.

[18] L. Sauermann, R. Cyganiak, and M. Völkel. Cool URIs for the semantic web. Working draft, W3C, 2008.

[19] J. Ben Schafer, Jospeh A. Konstan, and John Riedl. E-commerce recommenda- tion applications. Data Mining and Knowledge Discovery, 5:115–153, 2001.

[20] M. Van Assem, V. Malaisé, A. Miles, and G. Schreiber. A method to convert thesauri to skos. The Semantic Web: Research and Applications, pages 95–109, 2006.

[21] D. Van Deursen, C. Poppe, G. Martens, E. Mannens, and R. Walle. Xml to rdf conversion: a generic approach. In Automated solutions for Cross Media Content and Multi-channel Distribution, 2008. AXMEDIS’08. International Conference on, pages 138–144. IEEE, 2008.

[22] R. Villa, N. Gildea, and J.M. Jose. Facetbrowser: a user interface for complex search tasks. In Proceeding of the 16th ACM international conference on Multi- media, pages 489–498. ACM, 2008.

[23] Y. Wang, N. Stash, L. Aroyo, L. Hollink, and G. Schreiber. Semantic relations in content-based recommender systems. In Proceedings of the Fifth International Conference on Knowledge Capture. Redondo Beach, CA, pages 209–210, 2009.

[24] H. Wubs and F. Huysmans. Klik naar het verleden: een onderzoek naar ge- bruikers van digitaal erfgoed: hun profielen en zoekstrategieën. Den Haag: SCP, 2006.

92 Glossary

Charter is the grant of authority or rights, stating that the granter formally recognizes the prerogative of the recipient to exercise the rights specified. It is implicit that the granter retains superiority (or sovereignty), and that the recipient admits a limited (or inferior) status within the relationship, and it is within that sense that charters were historically granted, and that sense is retained in modern usage of the term.

Cultural Heritage may be defined as the entire corpus of material signs - either artis- tic or symbolic - handed on by the past to each culture and, therefore, to the whole of humankind. 1.

Dereferenceable is a property of a URI when a resource retrieval mechanism uses any of the internet protocols (e.g. HTTP) to obtain a copy or representation of the resource the URI identifies.

Facet is a dimensional filtering constraint to be placed on a dataset. For example a facet can be ‘made of wood’. The faceted browser should limit the search results to all objects made of of wood. In RDF each of the predicates can act as a facet.

Inference is the process in which, during the query execution of RDF-endpoints, not only the request property is retrieved, but also all it’s child-properties. I.e. when query for resources of rdf:type is Teacher also rdf:type TeacherOfAlgebra is included.

Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web.

Linked data originated from raw data and published on the Internet using standard format (like rdf / turtle) suitable for using within the semantic web. The differ- ence between linked data and other forms of relational data is that through the

1Please note, the definition of cultural heritage is heavily discussed, e.g. see http://cif.icomos. org/pdf_docs/Documentsonline/Heritagedefinitions.pdf

93 GLOSSARY GLOSSARY

standardization and cause of the nature of the Semantic Web it is possible to link from one data source to another data source across the globe.

Museum is a non-profit, permanent institution in the service of society and its de- velopment, open to the public, which acquires, conserves, researches, commu- nicates and exhibits the tangible and intangible heritage of humanity and its environment for the purposes of education, study and enjoyment. (from: Inter- national Council of Museums).

Open data is data that is free to use among the whole web.

Semantic Web W3C defined the Semantic Web: as group of methods and technolo- gies to allow machines to understand the meaning - or "semantics" - of infor- mation on the World Wide Web . Tim Berners Lee however, stated that what he meant with "Semantic Web" was merely the linking information itself, not the technologies to create it. Later he stated this linking information as the "Giant Global Graph" In thesis the later definitions is used.

Semi-structured text is text with some, computer readable, structure. However, the structure is not captured in a formal model. A example of Semi-structured text is the bibliography of this thesis.

Stemming is a heuristic process that chops off the ends of words to reduce inflectional forms and / or derivationally relate forms of a word to a common base, most of the time, and often includes the removal of derivational affixes. [14].

Triple Store is a purpose-build database designed for the storage and retrieval of triples, e.g. RDF.

94 List of Abbreviations

AJAX Asynchronous Javascript And XML

API Application Programming Interface

CC Collection Connection

DC Dublin Core

EAD Encoded Archival Description

EDEO Erfgoed Delft en Omstreken

FOAF Friend of a friend

HTML HyperText Markup Language

HTTP HyperText Transfer Protocol

KIT Koninklijk Instituut voor de Tropen

LOD Linked Open Data

NLP Natural Language Processing

OWL Web Ontology Language

PDF Portable Document Format

PHP PHP Hypertext Preprocessor

RCE Rijksdienst voor het Cultureel Erfgoed

RDFS RDF Schema

RDF Resource Description Framework

SKOS Simple Knowledge Organization System

SPARQL SPARQL Protocol and RDF Query Language

SQE Search Query Engine

95 GLOSSARY GLOSSARY

SVCN Stichting Volkenkundige Collectie Nederland

TMS The Museum System

URI Universal Resource Identifier

URL Universal Resource Locator

VRA Visual Resource Association

VR Virtual Reality

XML Extensible Markup Language

XSD XML Schema Definition

96 Appendix A

List of Archives

This list sums the inspected archives:

Archive Url Nationaal Archief Nederland http://www.gahetna.nl/collectie/catalogus) Gemeentearchief Archief http://www.gemeentearchief.rotterdam.nl Archief Delft http://www.archief-delft.nl The National Archives http://www.nationalarchives.gov.uk National Archives http://www.archives.gov Archives de France http://www.archivesdefrance.culture.gouv.fr

97

Appendix B

Full list of top 50 museums

This list shows for each of the museum in the top 50 of most visited museums which methods are used on their websites.

The follow abbreviations are used in the table:

Table Abbreviations E Encyclopedia OE Online Exhibitions KS Keyword search KSC Keyword search with categories AS Advanced Keyword search ASC Advanced Keyword search with categories T Thesaurus R Recommendations F Faceted browsing Rem. Remarks or comments For museum it was impossible to inspected the website due to several issue. The remarks column uses the follow codes to indicate such an issue:

Remarks codes NE The website was not available that therefore not (fully) inspected TI Repeating technical issue hindered inspection. NO Musesum did not seem to have a website

99 Full list of top 50 museums 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 a i eJaneiro de Rio cooi Museum Acropolis Picasso Museu Museum Gogh Van Museum National Gyeongju Chicago of Institute Art Uffizi degli Galleria Britain Tate (London) Gallery Portrait National Tokyo Center, Art National The Museum Young De Sofía Reina Museo Brasil do Banco Cultural Centro Museum Hermitage State Museum Albert and Victoria Prado del Museo díOrsay MusÈe Korea of Museum National Pompidou Centre Art Modern of Museum Art of Gallery National Gallery National Modern Tate Art of Museum Metropolitan Museum British Louvre du MusÈe Museum a http://www.theacropolismuseum.gr http://www.museupicasso.bcn.cat http://www.vangoghmuseum.nl http://gyeongju.museum.go.kr http://www.artic.edu/aic http://www.uffizi.firenze.it http://www.tate.org.uk/britain http://www.npg.org.uk http://www.nact.jp http://deyoung.famsf.org http://www.museoreinasofia.es http://www.bb.com.br http://www.hermitagemuseum.org http://vam.ac.uk http://www.museodelprado.es http://www.musee-orsay.fr http://www.museum.go.kr http://www.centrepompidou.fr http://www.moma.org http://www.nga.gov http://www.nationalgallery.org.uk http://www.tate.org.uk http://www.metmuseum.org http://www.britishmuseum.org http://www.louvre.fr URL E OE X X X X X KS X X X X X X X X X X X X X X X X X X X X KSC X X X X X X X X X X X X X X AS X X X X X X X X X X X X X X X X X X ASC X X X X X X X X X X X X X T X R X F TI NE Rem.

100 Full list of top 50 museums Rem. NE NW NE NE NE F R X T X X X X X X X AKSC X X X X X X X AKS X X X X X X X X X X KSC X X X X X X X X X X X KS X X X X X OE E X X X X X X URL http://www.quaibranly.fr http://www.comune.milano.it http://www.tretyakovgallery.ru http://www.saatchi-gallery.co.uk http://www.tnm.jp http://www.bb.com.br http://www.getty.edu http://www.accademia.firenze.it http://museumvictoria.com.au http://www.npg.si.edu http://www.neues-museum.de http://www.mfah.org http://www.guggenheim.org http://www.acmi.net.au http://www.si.edu http://qag.qld.gov.au http://www.glasgowlife.org.uk http://www.ashmolean.org http://www.smb.museum http://www.royalacademy.org.uk http://www.kreml.ru http://www.artgallery.nsw.gov.au http://www.nationalgalleries.org http://www.museuberardo.com a b ˇ ao Berardo Museum MusÈe du Quai Branly Royal Palace of Milan State Tretyakov Gallery Residenzschloss Saatchi Gallery Tokyo National Museum Centro Cultural Banco do Brasil Getty Center Galleria dellíAccademia Melbourne Museum National Portrait Gallery Neues Museum Museum of Fine Arts Solomon R. Guggenheim Museum Australian Centre for the Moving Image Smithsonian American Art Museum Queensland GoMA Kelvingrove Art Gallery Ashmolean Museum Pergamonmuseum Royal Academy of Arts Kremlin Museums Art Gallery of New South Wales National Gallery of Scotland Museu Colecç Brasillia Washington D.C. a b 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

101

Appendix C

Diagrams of the transformation of the Mierenvelt Dataset

This appendix contains the full transformation for the data on Hugo de Groot. Please note, step 5 and 8 are exclude since for the resource no facts were add in these steps.

Figure C.1: Step 2: Convert to plain RDF

103 Diagrams of the transformation of the Mierenvelt Dataset

Figure C.2: Step 3: Complete the RDF

104 Diagrams of the transformation of the Mierenvelt Dataset

Figure C.3: Step 4: Link to other resources within the data itself

105 Diagrams of the transformation of the Mierenvelt Dataset

Figure C.4: Step 6: Link with more common ontologies

106 Diagrams of the transformation of the Mierenvelt Dataset

Figure C.5: Step 7: Enrich by linking to other datasets on the web

107

Appendix D

Baptism in Delft data-sample

Sample of the Baptisms in Delft dataset. The table contains the fieldsnames and line 1 of the csv file containing the dataset. In the csv file are 118475 lines.

Fieldname Valve for line 1 Jaar 1742 Kind-Voornaam Pieter Kind-Patroniem Kind-Tussenvoegsel van der Kind-Achternaam Wal Vader-Voornaam Willem Vader-Patroniem Vader-Tussenvoegsel van der Vader-Achternaam Wal Moeder-Voornaam Elisabet Moeder-Patroniem Moeder-Tussenvoegsel Moeder-Achternaam Smits Getuige1-Voornaam Pieter Getuige1-Patroniem Getuige1-Tussenvoegsel van der Getuige1-Achternaam Wal Getuige2-Voornaam Cornelia Getuige2-Patroniem Getuige2-Tussenvoegsel Getuige2-Achternaam Vermou DatumDoop 25-11-1742 Plaats Delft

109

Appendix E

Diagram Linked Open Data Cloud

Figure E.1: Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

111