Using the Semantic Web in Digital Humanities
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Web 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 Using the Semantic Web in Digital 4 5 5 6 Humanities: Shift from Data Publishing to 6 7 7 8 8 9 Data-analysis and Serendipitous Knowledge 9 10 10 11 Discovery 11 12 12 13 Eero Hyvönen 13 14 University of Helsinki, Helsinki Centre for Digital Humanities (HELDIG), Finland and 14 15 Aalto University, Department of Computer Science, Finland 15 16 Semantic Computing Research Group (SeCo) (http://seco.cs.aalto.fi) 16 17 E-mail: eero.hyvonen@aalto.fi 17 18 18 19 19 Editors: Pascal Hitzler, Kansas State University, Manhattan, KS, USA; Krzysztof Janowicz, University of California, Santa Barbara, USA 20 Solicited reviews: Rafael Goncalves, Stanford University, CA, USA; Peter Haase, metaphacts GmbH, Walldorf, Germany; One anonymous 20 21 reviewer 21 22 22 23 23 24 24 25 25 26 Abstract. This paper discusses a shift of focus in research on Cultural Heritage semantic portals, based on Linked Data, and 26 27 envisions and proposes new directions of research. Three generations of portals are identified: Ten years ago the research focus 27 28 in semantic portal development was on data harmonization, aggregation, search, and browsing (“first generation systems”). 28 29 At the moment, the rise of Digital Humanities research has started to shift the focus to providing the user with integrated 29 30 tools for solving research problems in interactive ways (“second generation systems”). This paper envisions and argues that the 30 next step ahead to “third generation systems” is based on Artificial Intelligence: future portals not only provide tools for the 31 31 human to solve problems but are used for finding research problems in the first place, for addressing them, and even for solving 32 32 them automatically under the constraints set by the human researcher. Such systems should preferably be able to explain their 33 reasoning, which is an important aspect in the source critical humanities research tradition. The second and third generation 33 34 systems set new challenges for both computer scientists and humanities researchers. 34 35 35 36 Keywords: Digital Humanities, Linked Data, Semantic portals, Data analysis, Knowledge discovery 36 37 37 38 38 39 39 1 2 40 1. Introduction as Europeana and Digital Public Library of America , 40 3 41 and forms a substantial part of DBpedia and Wiki- 41 4 42 data . 42 43 Cultural Heritage (CH) has become a most active The availability of Big Data has boosted the rapidly 43 44 area of application of Linked Data and Semantic Web emerging new research area of Digital Humanities 44 45 (SW) technologies [1]. Large amounts of CH content (DH) [2, 3] where computational methods are devel- 45 46 and metadata about it are available openly for research oped and applied to solving problems in humanities 46 47 47 and public use based on collections in museums, li- 48 48 braries, archives, and media organizations. For exam- 1http://europeana.eu 49 2https://dp.la/ 49 50 ple, data has been aggregated in large national and in- 3http://dbpedia.org 50 51 ternational repositories, web services, and portals such 4http://wikidata.org 51 1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 Using the Semantic Web in Digital Humanities 1 and social sciences. In this context Big Data means idea is to agree upon a shared way of describing the 1 2 data that is too big or complex to be analyzed manually properties of documents, and how different models 2 3 by close reading [4]. can be mapped on each other for interoperability. The 3 4 From a SW research point of view, CH data provide event-centric approach focuses on developing more 4 5 interesting challenges for DH research: First, the data fundamental ontological models of the real world onto 5 6 is syntactically heterogeneous (text, images, sound, which different data and metadata can be transformed 6 7 videos, and structured data in different formats, such as for interoperability. Once the data is harmonized in one 7 8 XML, JSON, CSV, and RDF) and written in different way or another, it can be published in a SPARQL end- 8 9 languages. Second, the data is semantically rich cov- point, and semantic portals can be created on top of it 9 10 ering all aspects of life in different times and places. via APIs. 10 11 Third, the data are often incomplete, imprecise, uncer- Both document-centric and event-centric approac- 11 12 tain, or fuzzy due to the nature of history. Fourth, the hes have been successful. Dublin Core and its exten- 12 13 data is interlinked across different data sources, dis- sions have become the metadata norm for representing 13 14 tributed in different countries and databases. Helping documents on the Web, and a lot of use cases and ap- 14 15 the humanities researcher to deal with such data in se- plications of CIDOC CRM10 and other event-centric 15 16 mantically complex problems addressed in humanities systems have been published. 16 17 sets for computer scientists interesting methodological 17 18 problems. 18 19 This paper analyses and discusses this line of re- 19 20 search and development at the crossroads of Semantic 20 21 Web research, humanities, and social sciences, from 21 22 the early days of the Semantic Web to next steps in the 22 23 future. Three conceptual generations of semantic por- 23 24 tals on the Semantic Web are first identified. After this 24 25 the ideas are made more concrete by an example case 25 26 study system exhibiting features of the three genera- 26 27 tions. 27 28 28 29 29 30 2. First Generation: Portals for Search and 30 31 Browsing Fig. 1. A model for distributed Linked Data publishing. The data 31 32 publishers around the circle, i.e., a joint publishing system, provide 32 33 data using the vocabularies of a shared ontology infrastructure in the 33 Due to the challenges in CH data, SW research in middle. The data are automatically interlinked and enrich each other. 34 CH has been initially focused on issues related to syn- 34 35 tactic and semantic interoperability and data aggrega- 35 36 The ideas of the Semantic Web and Linked Data can 36 tion. A great deal of work has been devoted in de- be applied to address the problems of semantic data 37 veloping metadata standards and data models for har- 37 38 interoperability and distributed content creation at the 38 monizing data, including application agnostic W3C same time, as depicted in Fig. 1. Here the publica- 39 standards5 (RDF, OWL, SKOS, etc.), document cen- 39 40 tion system is illustrated by a circle. A shared seman- 40 tric models, such as Dublin Core and its dumb down 41 tic ontology infrastructure is situated in the middle. 41 principle6, and event-centric models for data harmo- 42 It includes shared domain ontologies, modeled using 42 nization on a more fundamental level, such as CIDOC 43 SW standards. If content providers outside of the circle 43 CRM7 [5] for museums and its extensions8, and FR- 44 provide the system with metadata about CH based on 44 BRoo [6] and IFLA Library Reference Model (LRM)9 45 the same ontologies, the data are automatically linked 45 in libraries. In document-centric metadata models the 46 through shared URIs, enrich each other, and form a 46 47 joint knowledge graph. 47 5 48 http://www.w3.org/standards/semanticweb For example, if metadata about a painting created by 48 6 https://www.dublincore.org/ Picasso comes from an art museum, it can be enriched 49 7http://cidoc-crm.org 49 50 8http://www.cidoc-crm.org/collaborations 50 51 9https://www.ifla.org/publications/node/11412 10http://www.cidoc-crm.org/useCasesPage 51 Using the Semantic Web in Digital Humanities 3 1 by data links to, e.g., biographies from Wikipedia and technologies have been developed and applied for such 1 2 other sources, photos taken of Picasso, information tasks, such as sentiment analysis [12], topic modeling 2 3 about his wives, books in a library describing his works [13], network analysis [14, 15], and visualizations [16] 3 4 of art, related exhibitions open in museums, and so on. in addition to traditional and novel statistical methods, 4 5 At the same time, the contents of any organization in such as word embeddings and neural networks [17– 5 6 the portal having Picasso-related material get enriched 19]. 6 7 by the metadata of the new artwork entered in the sys- Many of the methods and tools above are domain 7 8 tem. This is a win-win business model for everybody independent, and there are a lot of software packages 8 9 to join such a system; collaboration pays off. available for using them, such as Gephi13, R [20], and 9 10 Combining the infrastructure with the idea of decou- various Python libraries14. However, each of them have 10 11 pling the data services for machines from the applica- their own input formats and user interfaces. Further- 11 12 tions for the human user creates a model for building 12 more, visualizations are crafted case by case; tools for 13 collaborative Semantic Web applications. This model 13 formulating, adjusting, and comparing them in gener- 14 has been developed and tested in practice, e.g., in the 14 alizable ways would be helpful for the user. A major 15 “Sampo” series of semantic portals11 [7, 8]. The idea 15 problem here is that using the tools typically requires 16 of collaborative content creation using Linked Data 16 technical expertise and skills not common among the 17 has been developed also in other settings, e.g., in Re- 17 humanities researchers.