Linked Open Data and in Parla-CLARIN Markup

1 Plenary Debates of the Parliament of Finland as 2 Linked Open Data and in Parla-CLARIN Markup 3 Laura Sinikallio £ 4 University of Helsinki, HELDIG Centre for Digital Humanities, SeCo Research Group, Finland 5 Senka Drobac £ 6 Aalto University, Department of Computer Science, SeCo Research Group, Finland 7 Minna Tamper £ 8 Aalto University, Department of Computer Science, SeCo Research Group, Finland 9 Rafael Leal £ 10 University of Helsinki, HELDIG Centre for Digital Humanities, SeCo Research Group, Finland 11 Mikko Koho £ 12 University of Helsinki, HELDIG Centre for Digital Humanities, SeCo Research Group, Finland 13 Jouni Tuominen £ 14 Aalto University and University of Helsinki (HELDIG), SeCo Research Group, Finland 15 Matti La Mela £ 16 University of Helsinki, HELDIG Centre for Digital Humanities, SeCo Research Group, Finland 17 Eero Hyvönen £ 18 Aalto University and University of Helsinki (HELDIG), SeCo Research Group, Finland 19 Abstract 20 This paper presents a knowledge graph created by transforming the plenary debates of the Parliament 21 of Finland (1907–) into Linked Open Data (LOD). The data, totaling over 900 000 speeches, with 22 automatically created semantic annotations and rich ontology-based metadata, are published in 23 a Linked Open Data Service and are used via a SPARQL API and as data dumps. The speech 24 data is part of larger LOD publication FinnParla that also includes prosopographical data about 25 the politicians. The data is being used for studying parliamentary language and culture in Digital 26 Humanities in several universities. To serve a wider variety of users, the entirety of this data was also 27 produced using Parla-CLARIN markup. We present the first publication of all Finnish parliamentary 28 debates as data. Technical novelties in our approach include the use of both Parla-CLARIN and an 29 RDF schema developed for representing the speeches, integration of the data to a new Parliament 30 of Finland Ontology for deeper data analyses, and enriching the data with a variety of external 31 national and international data sources. 32 2012 ACM Subject Classification Information systems → Ontologies; Information systems → 33 Resource Description Framework (RDF); Computing methodologies → Information extraction 34 Keywords and phrases Plenary debates, parliamentary data, Parla-CLARIN, Linked Open Data, 35 Digital Humanities 36 Digital Object Identifier 10.4230/OASIcs.LDK.2021.12 37 1 Introduction 1 38 Semantic Parliament (SEMPARL) is a consortium research project, which produces a 39 linked open data and research infrastructure on Finnish parliamentary data, and develops 40 novel semantic computing technologies to study parliamentary politics and political culture. 1 https://seco.cs.aalto.fi/projects/semparl/en/ © Laura Sinikallio, Senka Drobac, Minna Tamper, Rafael Leal, Mikko Koho, Jouni Tuominen, Matti La Mela and Eero Hyvönen; licensed under Creative Commons License CC-BY 4.0 3rd Conference on Language, Data and Knowledge (LDK 2021). Editors: Dagmar Gromann, Gilles Sérasset, Thierry Declerck, John P. McCrae, Jorge Gracia, Julia Bosque-Gil, Fernando Bobillo, and Barbara Heinisch; Article No. 12; pp. 12:1–12:17 OpenAccess Series in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 12:2 Plenary Debates as a LOD Knowledge Graph and Parla-CLARIN markup 41 SEMPARL brings together researchers at the University of Helsinki, University of Turku, and 42 Aalto University, with complementary, multi-disciplinary expertise in language technology, 43 political and media research, and semantic computing and web technologies, respectively. 44 The project makes three major contributions. First, it responds to the demand for an 45 easy to use and “intelligent” access to the newly digitized Finnish parliamentary data by 46 providing the data as a national Linked Open Data (LOD) infrastructure and service for 47 researchers, citizens, the government, and the media, and application developers. Second, 48 the project studies long-term changes in the Finnish parliamentary and political culture and 49 language. These use cases in political and language research are pioneering studies using 50 the Finnish digital parliamentary data. Third, the new data service semantically enriches 51 content in other related Finnish LOD services, such as LawSampo for Finnish legislation and 52 case law [7] and BiographySampo for prosopographical data [6]. 53 From a Linked Data production point of, two interlinked knowledge graphs (KG) are 54 produced in SEMPARL: 1) A KG of all over 900 000 parliamentary debate speeches of 55 the Parliament of Finland (PoF) (1907–present) to be called S-KG. 2) A prosopographical 56 knowledge (P-KG) graph of the over 2600 Members of Parliament (MP), other people, and 57 organizations related to the parliamentary speeches during the same period of time [16]. 58 These KGs constitute together a larger data publication of PoF data called FinnParla. This 59 paper presents the first graph S-KG and addresses the following more general research 60 question: How to represent and publish parliamentary speeches so that the data can be used 61 easily for Digital Humanities research? 62 In the following, we first present the problem of representing publishing, and using plenary 63 debates as data for Digital Humanities research, and discuss related works and projects. 64 After this, our original debate data, target data model, and the transformation process are 65 described. The produced linked data has been published as a data service using the 7-star 66 model of the Linked Data Finland platform [8]. As a demonstration of using the data service 67 in Digital Humanities research, exemplary data-analyses are presented using YASGUI and 68 Google Colab on top of the underlying SPARQL endpoint. In conclusion, contributions of 69 the work are summarized, related works are discussed, and further research are outlined. 70 2 Related Work: Publishing Plenary Debates as Data 71 The Unicameral Parliament of Finland convened for the first time in 1907. The parliament 72 has 200 members (MP), who are elected for four years. Since the first parliament of 1907, the 73 elections are based on universal suffrage and both male and female MPs have been elected 74 to all parliaments. In the Finnish parliament, the debates take place in the public plenary 75 sessions. Since 1907, the Parliament has transcribed the speeches and published the printed 76 plenary session minutes, which is a practice established already in the nineteenth-century Diet 77 of Estates [20]. The minutes contain the matters considered, the decisions made, and every 78 speech heard during the sessions. The wordings of the speeches are revised and improved for 79 readability. [29, 20]. 80 In the 1990s, the Parliament of Finland started to gradually publish parliamentary 81 documents in digital form. It was only in 2018 that the Parliament completed the digitisation 82 of the historical parliamentary documents of 1907–1999 and opened a new version of their 83 data service [10]. This open data and the data service of the Parliament, however, has 84 weaknesses concerning the data and its usability due to the heterogeneous data formats and 85 different ways of access. For example, the historical minutes contain only the text recognised 86 from the image files, and have no metadata concerning the structure of the minutes ortheir L. Sinikallio et al. 12:3 87 content, which limits the research to bag-of-words approaches [14]. 88 There are also annotated corpora produced of the Finnish Parliamentary debates, which 89 cover the recent decades. FIN-CLARIN has a curated corpus of the debates in 2008–2016 [3]. 90 These include linguistic annotation, metadata about the speakers and the speeches are linked 91 to the actual video recordings of the plenary sessions. Moreover, there is the multilingual 92 Parlspeech parliamentary corpus [21], which includes also the plenary debates of the Finnish 93 parliament in 1991–2015. This data, however, has quality problems. It has been created 94 from the PDF files of the Parliament website of the time, but not all the speeches canbe 95 found in the data when we compare it with the complete minutes. 96 Several projects have transformed parliamentary debates into structured data or produced 97 annotated parliamentary debate corpora. Regarding the former, the projects have foremost 98 concerned the digitisation of the parliamentary debates and their enrichment with political or 2 99 biographical metadata. These data have been transformed both to XML and RDF format . 100 In the Lipad project, the Canadian Hansard from 1901 to present was transformed into 101 linked XML structured data [1]. As in our case, the process included both the OCR and 102 the parsing of the historical documents and more straight-forward conversion of the recent 103 SQL parliamentary data. The major example of parliamentary data in RDF is the Linked 104 EP project, where the data of the European parliament 1999–2017 was transformed into 105 RDF format and enriched with biographical information [28]. The RDF standard has been 3 106 used also in the Latvian LinkedSAEIMA project [2], in the Italian Parliament and in the 107 PoliMedia project, where RDF parliamentary data was linked with media sources [11]. 108 There are several parliamentary corpora. The best known is perhaps the EuroParl 109 corpus, which includes the plenary session debates of the European Parliament and has been 110 used to study machine translation [12]. A comprehensive list of the national parliamentary 4 111 corpora is presented on the CLARIN webpage . The Talk of Norway (1998–2016) is an 112 example of a national parliament corpus with linguistic annotation published in CSV and 113 TSV formats. [15] Different guidelines have been followed for annotating and encoding the 114 Parliamentary debates. The TEI-based Parla-CLARIN schema, which we also use in our 5 115 transformation, is an attempt to define a common annotation model. For example, the 116 Slovene parliamentary corpus siParl (1990–2018) has been encoded with the Parla-CLARIN 117 schema [19].

Linked Open Data and in Parla-CLARIN Markup

Validating RDF Data Using Shapes

V a Lida T in G R D F Da

The Opencitations Data Model

Using Shape Expressions (Shex) to Share RDF Data Models and to Guide Curation with Rigorous Validation B Katherine Thornton1( ), Harold Solbrig2, Gregory S

Shape Designer for Shex and SHACL Constraints Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo

Validating RDF with Shape Expressions

Validating Shacl Constraints Over a Sparql Endpoint

Semi Automatic Construction of Shex and SHACL Schemas Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo

DINGO: an Ontology for Projects and Grants Linked Data

Reading an XML Text Like a Human with Semantic Web Technologies 1

Document (1407

Multi-‐Entity Models of Resource Description in the Semantic