A Semantic Web Based Approach to Linking Heterogeneous Data Sets

Connecting people digitally - a semantic web based approach to linking heterogeneous data sets

Katalin Lejtovicz, Amelie Dorn Austrian Centre for Digital Humanities, Vienna, Austria

{katalin.lejtovicz, amelie.dorn}@oeaw.ac.at

Abstract documents with 272 different entity types. Furthermore, the user can choose the annotation In this paper we present a semantic domain by selecting the classes of the Ontology enrichment approach for linking two or by defining them via a SPARQL query. distinct data sets: the ÖBL (Austrian Although DBpedia Spotlight is a powerful tool, Biographical Dictionary) and the dbo@ema (Database of Bavarian it limits entity linking to only one resource, and Dialects in Austria electronically was developed for the English language. To mapped). Although the data sets are apply it on documents written in other different in their content and in the languages, the models used by Spotlight have to structuring of data, they contain similar be adapted. Babelfy (Moro et al., 2014) uses a common “entities” such as names of graph-based approach to perform entity linking persons. Here we describe the semantic and word sense disambiguation, relying on enrichment process of how these data BabelNet 1.1.1 - a semantic network of sets can be inter-linked through URIs Wikipedia and WordNet1 - in order to provide (Uniform Resource Identifiers) taking LOD2 links to identified text fragments. person names as a concrete example. Moreover, we also point to societal Babelfy’s main asset is the use of a multilingual benefits of applying such semantic resource that incorporates encyclopedic enrichment methods in order to open knowledge as well, however it has the and connect our resources to various drawback, that the resources used for word services. sense disambiguation and entity linking cannot be defined or chosen by the user. For knowledge 1 Introduction networks to be created across resources and applied to various data sets, there is a need for data to be processed by means of computational In the Digital Humanities discourse, the linguistic tools and matched preferably against establishment of data networks and creation of domain specific authority resources. links between different resources has been a key In this paper we introduce and exemplify aspect. The linking of resources not only aims at such a linking process developed and applied in enrichment, but more importantly also at the context of two connected Digital providing wider access to data resources in local Humanities projects, APIS3 (Lejtovicz et al., but also global digital infrastructures. As a 2015) and exploreAT!4 (Wandl-Vogt et al, 2015; consequence data use and re-use is enabled. Benito et al., 2016; Dorn et al, 2016). The One widely practised way of enabling semantic diverse digital networks available to-date have enrichment and linking is by means of open- been created around a variety of topics. Some source tools relying on semantic web technologies. For example DBpedia Spotlight 1 (Mendes et al., 2011) provides the possibility to https://wordnet.princeton.edu/ [last accessed: 23.06.2017] automatically annotate documents with 2 http://lod-cloud.net/ [last accessed: 23.06.2017] mentions of DBpedia resources. The tool uses 3 https://www.oeaw.ac.at/acdh/projects/apis/ [last accessed: 23.06.2017] as resource types the classes of the DBpedia 4 Ontology, thus enabling the user to annotate https://www.oeaw.ac.at/acdh/projects/exploreat/ [last accessed: 23.06.2017]

1 Proceedings of Knowledge Resources for the Socio-Economic Sciences and Humanities associated with RANLP-17, pages 1–8, Varna, Bulgaria, 7 Sep 2017. https://doi.org/10.26615/978-954-452-040-3_001 evolve around networks of places (The the lives of the people in the biographies, as Historical GIS Research Network5) or of art well as titles of books, journals, or publications (e.g. EuropeanaArt6), etc. In our case, we apply mentioned in the biographies. In the dbo@ema, semantic web tools to interlink person names. In on the other hand, we are dealing with names of the Digital Humanities project APIS, it is a main locations and regions, names of data collectors goal to unveil connections among people in or authors and also titles of dictionaries, biographical sources, which provides insightful dissertations and literature. The benefit of information on the lives of well-known people. linking the above mentioned data sets resides in Applying entity-linking in connection with the possibility to enrich the biographies with relation extraction - a task addressed in the missing information contained in the entries of project APIS - allows us to identify and the dbo@ema and vice versa. Often for example visualize connections among entities mentioned the list of literature works is incomplete in in different data sources. either ÖBL or dbo@ema, by linking the two This study thus aims at linking existing resources, the missing information can be added resources partly containing the same the other resource. information through the use of semantic web technologies. Through the additional The ÖBL contains around 18.500 biographies enrichment with LOD, our study aims to show and serves as the reference work for APIS, a how these data sets can first be connected, and project which aims to investigate whether a later opened to a wider user audience. This in large scale lexicon can be used as the basis of turn adds to their prolonged re-use and quantitative data analysis and how biographical sustainability by ensuring that additions and research can benefit from the digital corrections to the data set only have to be added transformation process realized in APIS. The once to the reference resource, instead of lexicon contains biographies of important updating all the distinct data resources. In historical figures from the Austro-Hungarian addition, the results of our study also contribute Monarchy having lived in the time period of to making information on people networks more 1815-1950. The data is not only published in widely available also to knowledge society. print, but it is also available in the machine readable XML format for the APIS project. An 2 Data and resources example of a typical ÖBL data entry in XML format is provided in Appendix. It is taken from The data behind the inter-linking process of the biography of Johann Willibald Nagl, an the projects APIS and exploreAT! are extracted Austrian writer and germanist having lived and from the resources ÖBL (Austrian Biographical worked on the turn of the century. The entry Dictionary; Gruber and Feigl, 2009) and contains some structured information in XML dbo@ema7 (Database of Bavarian Dialects in elements such as Geburt (containing place and Austria electronically mapped) (cf. Wandl-Vogt date of birth), however the majority of the et al., 2008). In the realization of both projects, information (in this specific example referring the Austrian Centre for Digital Humanities to the studies and the career path of August (ACDH-ÖAW8) plays an important role. They Schreiber) is embedded in the unstructured rely on data from the respective resources XML element Haupttext (i.e. main text). The (ÖBL, dbo@ema) which contain similar types ÖBL data set contains not only the 18.500 of elements such as persons, locations, persons the biographies were written about but institutions and titles of written works. In ÖBL also additional individuals mentioned in the this concerns the names of important historical main text. This set of names together with the figures, names of cities and countries relevant to persons in dbo@ema creates the basis for connecting the two projects APIS and 5 http://www.hgis.org.uk/ [accessed: 23.06.2017] 6 exploreAT! via an automatic alignment process. http://www.europeana.eu/portal/de/collections/art [ The dbo@ema, on the other hand is to-date a accessed: 23.06.2017] 7 https://wboe.oeaw.ac.at/projekt/beschreibung/ part of the Database of Bavarian dialects in [accessed: 23.06.2017] Austria (DBÖ) which forms the basis of the 8 https://www.oeaw.ac.at/acdh/acdh-home/ [accessed: project exploreAT!. The project explores this 23.06.2017]

2 large heterogeneous collection of 20th century researchers, domain experts and interested dialect data of the Bavarian dialects in Austria citizens. It is also the case in many other Digital from perspectives of cultural lexicography, Humanities (DH) projects that they partially semantic technologies, visual analysis and comprise of the same information embedded in citizen science. The dbo@ema is a MySQL different resources. APIS and exploreAT! have database that comprises of a collection of common entity types, among them being dialect words of various fields of everyday life. persons, locations, names of written works, Part of the database comprises of the digitised which when being identified and aligned, can data originally collected by means of paper serve as the basis for inter-linking the two questionnaires as well as the digitized entries of projects. This allows for adding missing the plants (~32.000 headwords) and mushrooms information from the complementary data set, collections (~ 1.000 headwords), also include uncovering and visualizing networks of names of places and regions in the former common entities, and expanding the search Austro-Hungarian Empire, as well as names of space by introducing new, joined data sets to the data collectors or authors of dictionaries, previously limited research environment. dissertations or literature. Data concerning persons involved in the collection are for the The motivation to semantically enrich the bigger part derived from internal archival ÖBL data collection - a historically and material of the institute. Initially, the available culturally rich heritage data - is a main goal in questionnaire data was manually entered in the APIS project. We designed a workflow that TUSTEP (TUebingen System of TExt is also applicable for the semantic annotation of processing Programs)9, a professional toolbox other DH collections as well. This workflow is for scholarly processing of textual data. All in set up by first identifying candidates for the all, the DBÖ counts around 3.5 million records linking process, in the second step linking them and an estimated 200,000 headwords. automatically to LOD resources and finally approving and curating the results. In our study, 3 Applying semantic web technologies we link entities to GeoNames and GND, and to inter-link heterogeneous DH data plan to further extend the pool of used LOD sets resources with VIAF10. We use the linked LOD resources to enrich our data with missing In many projects dealing with digital information (e.g. to add name variants, latitude, collections, digital content is generated from longitude, if available URI of corresponding scanned books, dictionaries, maps, etc. This is, Wikipedia article, etc. to our data sets), to detect however, just the prerequisite for establishing a possible errors in our data sets by comparing the knowledge base which is usable and reusable information in ÖBL/dbo@ema with the within and across different disciplines. In order information contained in GeoNames/GND, and to make data more widely available in a to make it machine readable and searchable network of relevant sources, the enrichment through publishing it eventually in the LOD with Linked Open Data (LOD) is key. cloud. However linking to significant Enrichment is a process that has to be vocabularies such as GeoNames and GND do established in order to open up DH data sets not only provide valuable information, but also (e.g. lexicons, encyclopedia, dictionaries, etc.) challenge computational linguistic systems. not only to the public, but also to the members Some of the problems are caused by the of the research community and to industry. incompleteness of authority files, not all person/place/institution names are contained in The projects APIS and exploreAT! face the LOD vocabularies. However this problem can challenge that the valuable information they be addressed by adding further resources to the contain is embedded in different data models system, for this reason we are planning to index and data formats, and therefore they are not VIAF in addition to GeoNames and GND. If an completely transparent and reusable for the entity is present in a vocabulary, information in

9 www.tustep.uni-tuebingen.de/ [last accessed: 10 23.06.2017] https://viaf.org/ [last accessed: 23.06.2017]

3 a biography might still not be enough to procedure currently consists of two steps. First, automatically identify the connection. Often the we resolve the abbreviations including the only information about spouses, siblings, tutors, shortened forms of person names, institution etc. mentioned in the biography are their name names, academic titles, location names, frequent and their relationship (father of, spouse, tutor of, verbs, etc. with a regular expression based Java etc.) to the person the biography was written program to substitute them with their about. In this case relation extraction can help to corresponding resolution taken from an ÖBL- correctly identifying the matching entity. intern abbreviations list. Second, we configure Relational information collected from the and run Stanbol’s Entityhub Indexing Tool to biographies can be compared with information create Solr indexes from the resources in the dictionaries, and in case of matching GeoNames15 and GND16 After initializing the values, the link between the entities can be index an Enhancement Chain is set up. The proposed by the system. In APIS we Enhancement Chain is on the one hand implemented a rule based approach using the responsible for running NLP tasks on the JAPE11 grammar to detect relations. Further biographies (language detection, sentence difficulties arise from names, where more than splitting, tokenization, part-of-speech tagging one match is possible with vocabulary entries. and chunking) and on the other hand for Choosing the correct match is called matching the entities identified by the NLP disambiguation, the heuristics we apply for processor with the Solr index. In our project, the automatic disambiguation consist of fine-tuning NLP pipeline runs the OpenNLP software with the Solr indexes of place names and person the German model files. names, and adapting them to the characteristics Although correction methods can reduce the of the input data. We apply heuristics such as error rate of automatic Entity Linking, some indexing only person names from geographical manual correction is still required, hence we areas relevant to the data sets ÖBL and foresee a manual data curation process to dbo@ema. Thus we can decrease false matches complement and correct the shortcomings of the caused by name-collisions between individuals automatic process. having born, lived and died in areas other than ÖBL/ dbo@ema related ones. 4 Data set analysis For the realization of the entity linking, Apache Stanbol12 has been chosen as an open- Analyzing the person names in the data sets source, customizable and extendible ÖBL and dbo@ema the following figures implementation framework to work with. The emerged: in the ÖBL (counting the biographies benefit of using Apache Stanbol is, on the one written until the beginning of the project) life hand its ability to create Referenced Sites (i.e. a stories of 18219 persons comprise the data set local Apache Solr13 index of a knowledge base) of the APIS project, whereas the dbo@ema data from any (publicly available) RDF-XML resource contains 8841 person names. When resource and to perform Entity Linking against aligning the two data sets, results showed that the compiled site. Furthermore, Stanbol allows 402 person names are identical, given the the user to take advantage of the integrated criteria that the first name and the last name of Natural Language Processing (NLP) the corresponding dbo@ema and ÖBL entries frameworks such as OpenNLP14 in a free, open have to match exactly. Due to the fact, that the source environment. In APIS we have set up a two data sets differ in how they model personal procedure to convert unstructured, full text data (e.g. the ÖBL second name contains all the biographies into structured, semantically name variants of a person in a comma separated enriched and machine-readable documents. This format, whereas the dbo@ema contains a comma in the second name before noble titles) the number of matches between the two 11 https://gate.ac.uk/sale/tao/splitch8.html [last accessed: 23.06.2017] 12 https://stanbol.apache.org/ [last accessed: 23.06.2017] 15 http://www.geonames.org/about.html [last accessed: 13 http://lucene.apache.org/solr/ [last accessed: 23.06.2017] 23.06.2017] 16http://www.dnb.de/EN/Standardisierung/GND/gnd_nod 14 https://opennlp.apache.org/ [last accessed: 23.06.2017] e.html [last accessed: 23.06.2017]

4 resources could be higher after reconciliation. "@id": "urn:enhancement-41adec0e- Our analysis thus shows a first rough estimation 9ebc-8d19-7644-b799288d563b", about how many persons are potentially "@type": [ "Enhancement", overlapping in the two collections. Further "EntityAnnotation" manual curation is necessary considering that ], information for the correct identification of a "confidence": 1.0, person is often missing in the database. The "created": "2017-06- dbo@ema often lacks the information about 22T16:25:27.384Z", date and place of birth. In this case additional "creator": knowledge, such as the publications or names of "org.apache.stanbol.enhancer.engine relatives can be used to identify and correctly s.entitylinking.engine.EntityLinkin find the person from the dbo@ema in the gEngine", "entity-label": "Nagl, Johann Austrian Biographical Dictionary. When Willibald", narrowing down the criteria to exactly match on "entity-reference": "http://d- the first name, last name and year of birth, there nb.info/gnd/116880414", are only 35 entries found that occur in both "entity-type": "http://d- resources. The small number of matches can nb.info/standards/elementset/gnd#Di also be attributed to the fact, that in many cases fferentiatedPerson", basic information is missing for the exact "extracted-from": "urn:content- identification of a person. To overcome this item-sha1- problem, a system has been developed in the 3dee9b203b74c12fec298348e74a1a0f16e e7da2", frame of the APIS project, where manual "relation": "urn:enhancement- curation of entities such as persons, locations, e1a4dcdd-e9fc-d9fc-42d4- institutions, works and events is possible. We b4e7cabb4685", foresee that a manual review process will be "site": "gndPersons" carried out after the automatic linking of the } dbo@ema and ÖBL person data sets, in order to approve correctly established links, revise With the help of a web application being erroneous connections and add missing developed in APIS we are planning to evaluate information to both data sources. the quality of the linking process. The application is designed to support automatic and The following example illustrates how the manual annotation within one system, thus knowledge sources ÖBL and dbo@ema are allowing automatic evaluation of annotation connected to each other via the GND URI tasks. assigned to Johann Willibald Nagl, an Austrian writer and Germanist appearing in both data 5 Discussion and Conclusion sets. Nagls ÖBL biography has been published In this paper we discussed the linking of online, and his personal data (name, date and person names in two data sets, the ÖBL and place of birth, date and place of death) is also dbo@ema. Our applied method has shown, that recorded in the dbo@ema database (see the two through the automatic entity linking process, the entries of Nagl in the Appendix). The link same persons occurring in different resources between the two instances has been established can be detected and connected. Through the by means of the Stanbol Entity Linking Module, established links and by applying the relation which identifies Johann Willibald Nagl as a extraction method implemented in the APIS candidate for entity matching and looks it up in project, a link across the data sets ÖBL and the Solr index created from GND person names. dbo@ema can be revealed, giving valuable Below we show an excerpt of the semantic information of relations among persons annotation created by Stanbol. The URI http://d- mentioned. Our method is only in its developing nb.info/gnd/116880414 links the two stages and this paper is a first introduction. By occurrences of Johann Willibald Nagl and thus generating person networks including additional the two resources ÖBL and dboe@ema. { information existent in the ÖBL or dbo@ema, our “social network” could provide a valuable

5 source of information also for non-specialists. Association for Computational Linguistics As persons mentioned in the two resources are (TACL), 2, pp. 231-244. also connected to a variety of personal Schopper, D., Bowers J., Wandl-Vogt, E (2015) information (profession, birth place, etc.), dboe@TEI: remodelling a database of dialects opening up and connecting our data sets to other into a rich LOD resource. Proceedings of TEI services for societal benefits is another main conference 2015. goal. Services that could potentially benefit Wandl-Vogt, E., Bartelme, N., Fliedl, G., Hassler, from our generated knowledge include M., Kop, C., Mayr, H., Nickel, J., Scholz, J., Europeana collections or Museums. Connecting Vöhringer, J. (2008): dbo@ema. A system for the information from our ÖBL and dbo@ema archiving, handling and mapping heterogeneous resources to current collections would offer a dialect data for dialect dictionaries. In: Bernal, Elisenda / De Cesaris, Janet (Hrsg.): Proceedings fruitful collaboration for giving citizens access of the XIII Euralex International Congress, to otherwise hidden information. Barcelona, Universitat Pompeu Fabra, 15.-19. Juli 2008 (= Sèrie activitats 20). Barcelona References (Documenta Universitaria). S. 1467-1472 (CD- Benito, A., Losada, A. G., Therón, R., Dorn, A., ROM). Seltmann, M., Wandl-Vogt, E. (2016): A spatio- Wandl-Vogt, E., Kieslinger, B., O´Connor, A., temporal visual analysis tool for historical Theron, R. (2015). „exploreAT! Perspektiven dictionaries. TEEM 2016. Proceedings of the einer Transformation am Beispiel eines Fourth International Conference on lexikographischen Jahrhundertprojekts“, in: Technological Ecosystems for Enhancing DHd-Tagung 2015. Graz. Austria. Accessed at: Multiculturality: pp. 985-990 http://dhd2015.uni-graz.at [23.06.2017] Gruber, C., Feigl, R. (2009) Von der Karteikarte zum biografischen Informationsmanagementsystem. Neue Wege am Appendix Institut Österreichisches Biographisches Lexikon und biographische Dokumentation, in: Martina ÖBL entry of Johann Willibald Nagl: Schattkowsky / Frank Metasch (eds.), Biografische Lexika im Internet. Internationale (30. und 31. Mai 2008) (= Bausteine aus dem Networks. 2nd DHA Conference. Vienna, Austria. DOI: 10.15169/sci-gaia:1473321487.86

Mendes, Pablo N., Jakob, Max, García-Silva, Nagl spotlight: shedding light on the web of Johann International Conference on Semantic Systems, page 1-8. New York, NY, USA, ACM, (2011) Willibald Moro, A., Raganato, A., Navigli, R. (2014) Entity Linking meets Word Sense Disambiguation: a Nagl

6 Johann seiner niederösterr. Heimatmundart Willibald darzustellen. Als Vorlage für das Dialektepos „Der Fuchs Roáner, á Nagl Gleichnus aus derselbigen Zeit, wo Johann d’Viecher noh hab’n red’n künná. Willibald Aus uralten, vierhundert- bis sechshundertjährigen Büchern neu in (1856- österreichischen Landsleute“ Natschbach gelang es dabei nicht nur, den b. Neunkirchen (?, niederösterr. Bauerndialekt, NÖ) Anschauungswelt des Neunkirchner 1918)Diepolz b. von Castle fortgesetzt wurde. Neunkirchen (?, Überdies befaßte sich N. mit Stud. NÖ) über den niederösterr. Bauernstand, veröff. Germanist und Schriftsteller W.: Da Roanad. . ., 1889, 2. Aufl. 1909; Vokalismus der bayr.-österr. Mundart, 1895; Geograph. Namenkde., Nagl Johann in: Die Erdkde. 18, 1903; Dt. Willibald, Germanist und Sprachlehre . . ., 1905, 2. Aufl. Schriftsteller. * Natschbach b. 1906; etc. Hrsg.: Dt. Mundarten, Neunkirchen (NÖ), 11. 5. 1856; † 1896 ff.; Dt.-österr. Diepolz b. Neunkirchen (NÖ), 23. 7. Literaturgeschichte, 4 Bde., gem. 1918. mit J. Zeidler und E. Castle, 1899– Stud. nach einem 1937. bald wieder abgebrochenen Theol.- Stud. Phil. und Germanistik an der L.: RP vom 2. und Univ. Wien, 1886 Dr. phil. Neben 11. 5. 1916, 27. 7. und 15. 8. seiner Lehrtätigkeit an 1918; Wr. Ztg. und N. Fr. Pr. vom verschiedenen Schulen war N. ab 26. 7. 1918; Z. für österr. 1890 als Priv. Doz. für Volkskde., Jg. 3, 1897, S. 319, Jg. Mundartforschung an der Univ. Wien 4, 1898, S. 52; Monatsbl. des Ver. tätig. Er darf neben Seemüller zu für Landeskde. von NÖ, Jg. 17, den Initiatoren der Wr. 1918, S. 190 ff.; Petermanns Mitt., mundartkundlichen Schule (z. B. als 1918, S. 228; Unsere Heimat, NF, Hrsg. der Z. „Deutsche Mundarten“) Bd. 11, 1938, S. 200 ff.; I. M. gezählt werden, wenn auch manche Swift Peacock, Der grammat. Anhang von ihm angeschnittene Probleme J. W. N.s „Fuchs Roánad“ im später anderen Lösungen zugeführt Vergleich mit dem heute lebendigen wurden. Schon als Schottenkleriker Wortschatz in der Mundart der hatte N. begonnen, die alte Gemeinde Hafning, Bez. Neunkirchen, Tierfabel von Reineke Fuchs in NÖ, phil. Diss. Wien, 1969;

7 Giebisch–Gugitz; Kosel; Rollett, Kosch, Das kath. Deutschland; Wer Mutter: --- Geburtsdatum: --- Todesdatum: --- (M. Hornung) Anm.: --- (bereits in Datenbank: ja/nein) Vater: --- Geburtsdatum:-- ÖBL 1815-1950, Bd. 7 (Lfg. Todesdatum: --- Anm.: --- (bereits 31, 1976), S. 21 in Datenbank: ja/nein) Weitere Verwandte: --- Anm./Verweise: --- Willibald Nagl: Regierungsrat Dr.phil. Schule: Universität ---

Ort: --- von: --- bis: --- Anm: 12102 Universität --- Ort: Wien --- von: Johann --- bis: 1886 --- Anm: Phil. und Willibald Germanistik; 1886 Dr.phil. --- Nagl --- Anm: --- Beruf: Lehrer --- Ort: 11 verschiedenen Schulen --- Beruf: 5 -- Ort: Universität Wien --- von: - 1856 Schriftsteller --- Ort: --- von: -- Natschbach b. Herausgeber der Zeitschrift Neunkirchen, NÖ „Deutsche Mundarten“ --- Ort: --- 7082 -- Ort: --- von: --- bis: --- Anm: 23 Tätigkeiten: name="todMonat">71918Diepolz b. Neunkirchen, NÖNULL2-1