Enriching and Analysing Statistics with Linked Open Data

Enriching and Analysing Statistics with Linked Open Data Benjamin Zapilko1, Andreas Harth2, Brigitte Mathiak1 1GESIS – Leibniz Institute for the Social Sciences, Bonn, Germany, e-mail: [email protected], [email protected] 2Institute AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany, e-mail: [email protected] Abstract Scientists often need to analyse heterogeneous and distributed datasets for secondary purposes, for example to verify prior assumptions or to detect correlations between datasets. In most cases the time-consuming effort for converting the desired data into formats for statistical tools is not justified by the information need itself. With the emerging paradigm of Linked Open Data, existing information infrastructures for processing research data can be opened up for the use of external data published at sources on the web. Thus, scientists can conduct statistical analyses on a broader range of available data sets but can also save time-consuming data conversion because of the use of widely established standards and interfaces. This paper discusses issues and challenges when using Linked Open Data for analysing and enriching statistics illustrated by a real life use case scenario. Keywords: Data Integration, Linked Open Data, Statistics 1. Introduction A tenet of research in the social sciences is the study of social phenomena via analysing quantitative evidence. Scientists typically need to perform major and complex analyses on statistical data, and with the ever-increasing amount of available digital data, new approaches for data analysis become possible. As part of their main tasks, researchers often require tedious secondary examinations on heterogeneous and distributed datasets, for example to quickly verify prior or referenced assumptions or to detect correlations between two or more datasets (Schnell, et al. 2005; King, et al. 1994; Kohler, et al. 2008). A lot of tools already exist which support researchers in processing and analysing their data, for example SPSS1 or STATA2. Sizeable amounts of data used by scientists are attainable through the web, however, the data is published in a large variety of data formats. To process and analyse data, scientists often need to convert the raw data to particular data formats and integrate data from multiple sources. In general, data 1 SPSS Statistics, IBM, http://www.spss.com/de/ 2 STATA Data Analysis and Statistical Software, http://www.stata.com conversion and integration is not a technical barrier, but the effort spent for conversion is a nuisance, especially for necessary but tedious routine tasks or in cases where the expected research gain is minor. As much research data is being attainable through the web, advanced web technologies may facilitate the access to and integration of such data. In 2006 Tim Berners-Lee articulated four principles for publishing data on the web as so-called Linked Data. Linked Data provides a technical basis for exposing, sharing and linking data on the web, based on the established web architecture comprising standardised formats and interfaces. Spurred by the Linking Open Data project3 – which aims at making Linked Data available under liberal copyright licenses – a large amount of datasets of scientific interest have been published, e.g. life science data, governmental data and official statistical data covering a broad range of domains and countries. The popularity of the Linked Open Data idea raises expectations that more data providers will publish their data in standardised ways. For researchers, Linked Open Data holds the possibility to gain access to a lot of data which has not been available or easily accessible in the past. To conduct scientific analyses, Linked Data still has to be converted to the formats of the tools scientists are using. However, with data accessible via a single standardised interface, time-consuming transformation processes could be automated and analysis tasks could be performed more easily and thus at lower cost. The information infrastructure for working with research data are often too strictly adjusted to data sources traditionally used or to specific domains or purposes. Considering the Linked Open Data movement, a method for performing mostly secondarily research tasks is needed, to allow for statistical queries and calculations on standardised datasets. If these standards would be of a broader range than those of for example SPSS or STATA for statistical data, additional datasets which lie outside the researchers’ facilities could be used automatically for combined analyses. This would also expand the usage and easy-to-perform analyses of published data to other interested user groups, for instance to journalists. In this paper, we propose a method which provides the capability of conducting such statistical analyses on Linked Data resources to support common tasks that researchers encounter when working with heterogeneous and distributed datasets. We give a brief overview on Linked Open Data in Section 2 and present a real-world use case from the domain of the social sciences in Section 3 where heterogeneous datasets are linked for further combined analyses. The technical implementation of the use case is presented in Section 4 and will cover four issues: (i) the publication of data as Linked Data, (ii) the integration and processing of multiple datasets, (iii) the processing of statistical methods and calculations on such datasets and (iv) the combined visualisation of multiple data sets in a line graph. In Section 5 we present related work in the field of statistical online analyses of data. We conclude in Section 6 with a discussion on observed problems occurred during the use case development and its implementation and give an outlook on future work. 3 http://linkeddata.org/ 2. Linked Open Data The main intention behind the recent idea of Linked Open Data (Bizer, et al. 2009) is a method to expose, share and connect freely available data on the web using Semantic Web standards. Linked Data is based on the unique identification of each thing, such as metadata elements or certain entities. The representation of information about these things can be combined with connections to other relevant or associated things on the web. Tim Berners-Lee outlined four principles for Linked Data (Berners-Lee 2006): 1. Use URIs4 to identify things. 2. Use HTTP5 URIs so that these things can be referred to and looked up (“dereferenced”) by people and user agents. 3. Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF6. 4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the web. The publication of data as Linked Open Data from a technical perspective (Bizer, et al. 2007) is based on common standards and techniques which have been developed for years and are established worldwide as fundamental formats and interfaces for publishing data on the web, e.g. URIs, HTTP and RDF. RDF is a graph-structured data format. With the standardisation of SPARQL7 in 2008, a key technology for querying RDF data has been established. The later described use case consists partly of a processed version of the German General Social Survey ALLBUS8 for North Rhine Westphalia, a RDF representation of this dataset looks as follows: <http://lod.gesis.org/lod.pilot/ALLBUS/ZA4570agg.rdf> rdfs:label "ZA4570 Cumulated ALLBUS / GGSS 1980-2008"; gesis:geo <http://lod.gesis.org/vocab#D-NRW>; rdfs:seeAlso <http://www.gesis.org/dienstleistungen/daten/umfragedaten/allbus/>. <http://lod.gesis.org/vocab#D-NRW> owl:sameAs <http://dbpedia.org/resource/North_Rhine-Westphalia>. RDF is structured like Subject-Predicate-Object sentences. The above example reads: The entry D-NRW in the gesis vocabulary is the same as the entry North_Rhine- Westphalia in DBpedia9. The paradigm of Linked Open Data was well received in the Semantic Web community and has encouraged more organisations to publish data. Research organisations, archives, universities and media corporations publish and link their data and participate in the 4 Uniform Resource Identifier 5 Hypertext Transfer Protocol 6 Resource Description Framework, http://www.w3.org/RDF/ 7 SPARQL Protocol and RDF Query Language (SPARQL), http://www.w3.org/TR/rdf-sparql-protocol/ 8 http://www.gesis.org/en/services/data/survey-data/allbus/ 9 http://dbpedia.org/ movement. Work on standardisation and discussions about the openness of data generates new impulses, so even governments all over the world are beginning to deal with the publication of official government data in the web. This strengthens the transparency of governmental agencies as well as the collaboration with and the participation of citizens in diverse aspects. The linkage of distributed and apparently unrelated data sources holds potentials for both data providers and users. Data providers are in the position to enrich their own data with external sources from the web which prove to be relevant for their users or at least to be interesting additional information complementary to their data. Developers can in turn use the published and linked data for inclusion in own tools and applications which can in turn be made available to users. Therefore the real advantages of Linked Open Data for end users, e.g. scientists, will be evident when there are enough tools and applications which use linked data sources in a practical and useful way. 3. Use Case In this section, we present a use case scenario for enriching and analysing statistics based on Linked Data which covers typical research tasks of social scientists: the detection and analysis of correlations between heterogeneous data sources and further calculations on these data sources. Our use case revolves around the following data sources: The German General Social Survey ALLBUS, which collects up-to-date data on attitudes, behaviour, and social structure in Germany and is archived at GESIS – Leibniz Institute for the Social Sciences10. Due to data privacy restrictions we use a special processed version of a subset of ALLBUS.

Enriching and Analysing Statistics with Linked Open Data

SQSTM1 Mutations in Familial and Sporadic Amyotrophic Lateral Sclerosis

Archives of Agriculture and Environmental Science

Strategies for Non-Invasive Management of High-Grade Cervical Intraepithelial Neoplasia

Software in the Scientific Literature: Problems with Seeing, Finding, And

The Research Road We Make: Statistics for the Uninitiated

Assessment of the Impact of the Electronic Submission and Review Environment on the Efficiency and Effectiveness of the Review of Human Drugs – Final Report

Analyzing Data with Graphpad Prism

Fitting Models to Biological Data Using Linear and Nonlinear Regression

The R Project for Statistical Computing a Free Software Environment For

Arxiv:2104.02712V1 [Cs.OH] 6 Apr 2021

List of New Applications Added in ARL #2571

Prism 4 Statistics Guide −Statistical Analyses for Laboratory and Clinical Researchers