KINDEX – Automatic Subject Indexing with Knowledge Graph Services

Semantic Web 0 (0) 1 1 IOS Press

1 1 2 2 3 3 4 KINDEX – Automatic Subject Indexing with 4 5 5 6 Knowledge Graph Services 6 7 7 8 Lisa Wenige *, Claus Stadler, Simon Bin, Lorenz Bühmann, Kurt Junghanns and Michael Martin 8 9 Institute for Applied Informatics, University Leipzig, Germany 9 10 E-mail: [email protected] 10 11 11 12 12 13 13 14 14 15 Abstract. Automatic subject indexing has been a longstanding goal of digital curators to facilitate effective retrieval access to 15 16 large collections of both online and offline information resources. Controlled vocabularies are often used for this purpose, as 16 17 they standardise annotation practices and help users to navigate online resources through following interlinked topical concepts. 17 18 However, to this date, the assignment of suitable text annotations from a controlled vocabulary is still largely done manually, or at 18 19 most (semi-)automatically, even though effective machine learning tools are already in place. This is because existing procedures 19 20 require a sufficient amount of training data and they have to be adapted to each vocabulary, language and application domain 20 anew. Against the background of tight budgets and missing IT personnel in cultural heritage as well as research infrastructure 21 21 institutions, adoption of automatic subject annotation tools is hindered, while manual assignment of index terms is an even 22 22 greater burden on the available financial resources. In this paper, we argue that there is a third solution to subject indexing 23 which harnesses cross-domain knowledge graphs (i.e., DBpedia and Wikidata) to facilitate cost-effective automatic descriptor 23 24 assignments that can be done without any algorithm tuning and training corpora. Our KINDEX approach fuses distributed 24 25 knowledge graph information from different sources. Experimental evaluation shows that the approach achieves good accuracy 25 26 scores by exploiting correspondence links of publicly available knowledge graphs. 26 27 27 28 Keywords: Automatic Indexing, Named Entity Recognition, Key-phrase Extraction, Authority File 28 29 29 30 30 31 31 32 1. Introduction The process of assigning KOS keywords to resources 32 33 is called subject indexing and has a long tradition in 33 34 Cultural heritage institutions have a need to organise cultural heritage institutions. It is not a trivial task, 34 35 and facilitate access to large collections of printed and since sets of descriptors have to be identified that best 35 36 online materials. In this regard, topic schemes are im- capture the content of library, museum or archival arti- 36 37 portant tools that facilitate effective content searches facts. In this context, Hutchins coined the term about- 37 38 when navigating the ever growing amount of offline ness to underline that subject indexing is predom- 38 39 and online data. These schemes are often referred to as inantly concerned with identifying an item’s topics 39 40 knowledge organisation systems (KOS) or controlled through condensed keyword descriptions [2]. 40 41 vocabularies. They mediate between user informa- In information providing institutions, trained profes- 41 42 tion needs and the data quantity that can be found in sionals manually inspect and assign suitable concepts. 42 43 both digital and analogue collections. In KOS, topics Even with the growing amount of data this practice 43 44 are represented as uniquely identified concepts, that has not been replaced by automatic approaches. This 44 45 can be characterised by different synonyms/labels [1]. is an interesting fact, considering that there are already 45 46 Thus, it is made sure that users with identical inform- algorithms in place that can extract and assign suit- 46 47 ation needs find items that are aligned with their in- able KOS concepts from digital text. Automatic key- 47 48 terests, even though they conduct searches using dif- phrase extraction approaches can be grouped into two 48 49 ferent keywords. categories, namely ML-enabled associative and lex- 49 50 ical indexing. The former learns a model from training 50 51 *Corresponding author. E-mail: [email protected]. data by identifying frequently occurring associations 51

1 between n-grams from input texts with the correspond- and natural language processing (NLP) approaches for 1 2 ing KOS descriptors. The lexical approach matches a given thesaurus and text collection. We present the 2 3 text snippets from the to-be-described texts with the KINDEX approach that leverages existing general pur- 3 4 labels of controlled vocabularies thus assigning the pose annotation tools, such as DBpedia Spotlight [7] 4 5 appropriate descriptor term by taking advantage of in conjunction with KOS correspondence links of the 5 6 the synonymous labels being prevalent for each KOS data web (e.g. owl:sameAs, skos:exactMatch) 6 7 concept [3]. in order to provide relevant keyword suggestions to in- 7 8 However, even though these approaches have yielded formation professionals. The benefits of our approach 8 9 good accuracy scores and in some cases even achieved are as follows: 9 10 results comparable to human indexers [4], their applic- 10 – Minimum training and preprocessing effort: 11 ation is not as widespread in both cultural heritage and 11 The approach requires a minimum of resources 12 research infrastructure institutions as one would expect 12 from the side of cultural heritage and research in- 13 given their high performance. 13 frastructure institutions as compared to state-of- 14 The reasons for this phenomenon are manifold: First 14 the-art automatic keyword indexing systems (see 15 and foremost, a considerable amount of preprocessing 15 Sect. 2). In KINDEX, this is realised through a 16 and training effort is required to adapt the existing 16 coupling of existing knowledge graph web ser- 17 machine learning approaches to the training corpora 17 vices in a unified annotation workflow. 18 (e.g., language and domain specificities) as well as 18 – Domain Independence: KINDEX provides an- 19 the vocabulary. Besides being dependent on the avail- 19 notations for many potential application domains 20 ability of a large enough training corpus of annot- 20 (see Sect. 3 and 4). It achieves domain independ- 21 ated items, this requires both expertise in programming 21 ence through the following characteristics 22 and NLP methods and sufficient time resources. How- 22 23 ever, many information providing institutions operate ∗ KOS independence: The system does not need 23 24 on tight budgets and cannot employ additional staff for to be trained for a particular controlled vocab- 24 25 these tasks [5]. On the other hand, manual indexing as ulary. The sole precondition is that the KOS 25 26 it is still practised, is time consuming as well. Hence, a system, from which keywords are to be gener- 26 27 (semi-)automatic indexing tool would spare resources ated, is published on the web of data and either 27 28 that could be used to speed up cataloguing routines and linked to DBpedia or Wikidata. 28 29 would lead to more items being sufficiently indexed ∗ Multilingualism: Annotations can be provided 29 30 which would in turn benefit retrieval interfaces in di- for various languages of the input text. 30 31 gital libraries and data portals. It would also enable ∗ Corpus Invariance: The method does not re- 31 32 collections that are currently not linked to any con- quire a training corpus of annotated texts in or- 32 33 trolled vocabulary to be integrated with the distributed der to yield keyword recommendations. Hence, 33 34 knowledge infrastructure of the web of data thus fa- concept suggestions can be made for collec- 34 35 cilitating integration of related information resources. tions that are currently void of any index terms 35 36 For illustration purposes, consider the following ex- from a controlled vocabulary 36 37 ample: An open data portal of a governmental agency 37 – Decent quality of accuracy scores: Evaluation 38 frequently publishes data sets that might be of interest 38 results from two real world usage scenario have 39 to its citizens. Data sets are described with a short ab- 39 shown the viability of the approach in terms of 40 stract. An indexing tool being able to assign subject 40 accuracy scores (see Sect. 5). 41 descriptors of a widespread KOS would enable users 41 42 to inspect publications from a library catalogue apply- 42 43 ing the same concept annotations. 43 2. Related Work 44 Given the high prevalence of bibliographic informa- 44 45 tion on the web of data, this appears to be a promising 45 Matching free text keywords is the number one re- 46 research direction both in terms of data integration 46 trieval strategy of today’s search engines. However, 47 and semantics-aware retrieval [6]. Against the back- 47 meaningful KOS-based annotation of digital and ana- 48 ground of the aforementioned opportunities of the web 48 logue artefacts is still important for collections that 49 of data and the difficulties of ML-based annotating, we 49 serve highly specialised information needs. Several 50 present a novel method and architecture for automated 50 studies in the (digital) library context have shown that 51 subject indexing. Rather than fine-tuning existing ML 51 L. Wenige et al. / Automatic Subject Indexing with KG Services 3

1 users were better able to find annotated documents as models can provide high quality automated subject as- 1 2 opposed to content without any descriptors [8, 9]. This signments [18–20]. In addition to that, Toepfer et al. 2 3 is one of the reasons why subject indexing is a typical have shown that it is even more efficient to fuse the two 3 4 part of the cataloguing procedure to this date. It is also approaches of lexical and associative indexing by uni- 4 5 still - apart from a few exceptions - typically manually fying the sets of subject descriptors that were identified 5 6 carried out by domain experts [10, 11]. by the two methods. They evaluated the approach on 6 7 Despite the practice of manual annotation, the task of a collection of manually annotated economics texts by 7 8 automatically identifying content from text has been testing the systems’ performance regarding its ability 8 9 extensively studied under the terms named entity re- to assign concepts from the STW Thesaurus for Eco- 9 10 cognition and information extraction. Starting from nomics [3]. The evaluations showed that the fusion ap- 10 11 hand-crafted rules, the field has moved toward apply- proach significantly boosted F1 scores as compared to 11 12 ing more advanced machine learning approaches, such a sole application of either lexical or associative index- 12 13 as support vector machines and decision trees [12]. ing. Further, the authors demonstrated that it is pos- 13 14 With the advent of Web 2.0 applications and the pop- sible to achieve fairly good prediction results (with F1 14 15 ularity of both social bookmarking and collaborative scores between 30% and 40%), even though the model 15 16 tagging, a plethora of keyword recommendation ap- was exclusively trained on short texts (i.e., the titles 16 17 proaches was introduced in the literature. These ap- and free-text keywords of publications). However, the 17 18 proaches are often quite similar to the field of auto- above described approaches each learn models that are 18 19 mated indexing with the sole difference that in reg- only applicable to a single document collection. They 19 20 ular subject indexing, terms are part of a knowledge are therefore dependent on the idiosyncrasies of the in- 20 21 organisation system, while social tagging is mostly put texts and the thesaurus that is used in that particular 21 22 done with free text keywords [13–15]. Among the application domain. Hence, there is a considerable ef- 22 23 KOS-based indexing approaches, one can distinguish fort involved when adapting the available approaches 23 24 between ML-enabled lexical and associative indexing. for other resource collections and vocabularies. This is 24 25 The former matches n-grams from text input with the a problem, given the limited IT personnel that might 25 26 lexical entries of a controlled vocabulary. With these not be available for algorithm fine-tuning, beside the 26 27 approaches, machine learning techniques are applied already consuming tasks of handling day-to-day oper- 27 28 to identify how features among the characteristics of ational IT services in cultural heritage institutions [5]. 28 29 the input sequence (e.g., term-document frequency, Given the exponential growth of electronic resources 29 30 keyphraseness or text position) should be weighted on the one hand and the commitment of libraries and 30 31 in order to make high quality predictions. Among archives to provide suitable entry points for these re- 31 32 the most prominent approaches of lexical matching sources on the other hand, (semi-)automated methods 32 33 for thesaurus-based indexing are the software applic- for subject indexing are desperately needed. The Ger- 33 34 ations KEA and its successor MAUI [16]. KEA has man National Library can serve as an example for this 34 35 been tested on a collection of agricultural documents development: Its legal mandate was expanded to the 35 36 with the AGROVOC thesaurus and MAUI was run collection of electronic resources being released by 36 37 on Wikipedia articles. For Wikipedia articles, the au- German publishers or on topics relevant to Germany. 37 38 thors showed that F1 scores of almost 50% are possible Thus, the library had to deal with a quickly growing 38 39 when automatically assigning concepts to documents amount of publications that could not be handled with 39 40 [4, 17]. In contrast to lexical indexing, associative in- the traditional procedures of manual indexing. Exist- 40 41 dexing learns the likelihood of associating the extrac- ing metadata from the publishers can also not be used 41 42 ted text snippets with a corresponding index term. For since they often do not provide topic-specific metadata 42 43 both methods, a model has to be learnt from a training for their resources. Because of this situation, an auto- 43 44 corpus. In the case of associative indexing, the docu- matic indexing strategy was implemented as a pro- 44 45 ment collection has to be even larger to achieve good duction system in the national library. The automatic 45 46 results. This is because the method can only identify strategy has proven to be feasible with mostly high 46 47 concept terms once they occur in the training data. quality assignments of index terms [10, 11]. 47 48 For instance, successful application of ML-based as- However, this example of automatic indexing in an 48 49 sociative indexing methods was demonstrated on text operating environment can be considered a rare ex- 49 50 collections from agricultural, medical and computer ception. The majority of machine-based indexing ap- 50 51 science research showing that learning association proaches is tested under laboratory conditions, often 51 4 L. Wenige et al. / Automatic Subject Indexing with KG Services

1 without a follow-up productive implementation of the Economics by the Leibniz Information Centre for Eco- 1 2 system [21, 22]. The low adoption rates indicate that nomics [28]. The Basel Register of Thesauri, Ontolo- 2 3 existing automatic annotation tools are predominantly gies and Classifications (BARTOC) lists hundreds of 3 4 domain-specific and can not be easily transferred to knowledge organisation systems from various fields, 4 5 other application scenarios. Against this background, such as life sciences, law or history [29].2 5 6 it seems appropriate to investigate suitable alternative Given the wide availability of SKOS vocabularies, 6 7 approaches for automated annotation, such as lever- which are often densely interlinked both with each 7 8 aging openly available data sources for this task. other as well as the web of data, it seems promising 8 9 The cultural heritage sector has long been an active ad- to investigate whether these links can be leveraged for 9 10 vocate of data publishing as it has generated high qual- automatic subject indexing. For instance, Kempf et al. 10 11 ity interlinked and structured data for decades. Thus, pointed out that a mixed-methods strategy combining 11 12 it was a natural inclination for these institutions to join manual and (semi-)automatic indexing in conjunction 12 13 the Linked (Open) Data movement to redistribute the with identity links has great potential to increase cata- 13 14 tax-funded collection of catalogue data to a wider pub- loguing efficiency [30]. However, in the context of en- 14 15 lic. To this date, the activities toward an open know- hanced library services, SKOS vocabularies have been 15 16 ledge graph infrastructure of library data are predom- rarely taken advantage of. Only a few papers studied 16 17 inantly concerned with the publication of catalogue the effect of SKOS relations in order to improve re- 17 18 data and controlled vocabularies [23]. In the context of trieval systems [24, 31, 32]. Empirical investigations 18 19 automatic subject indexing, the available KOS vocab- into the potential and effects of cross-concordances 19 20 ularies are especially relevant. For more than a decade, to leverage automatic subject indexing have not been 20 21 many of the existing thesauri have been made pub- conducted so far. This might be a hindrance for the in- 21 22 licly available in SKOS format [5, 24]. Since 2009, the troduction of automatic indexing tools in information 22 23 vocabulary Simple Knowledge Organisation System providing institutions on a broader scale, since ML- 23 24 (SKOS) is a W3C recommendation. It offers a schema based indexing methods require a considerable amount 24 25 language to describe knowledge organisation systems of data preprocessing, algorithm fine-tuning and are 25 26 in such a way that they can be published, exchanged crucially dependent on a training corpus with a suffi- 26 27 and interlinked on the web of data. SKOS provides cient amount of labelled data; i.e., subjects that have 27 28 expressions to uniquely identify descriptors (i.e., already been manually assigned to a multitude of pub- 28 29 skos:Concepts) declare hierarchical relationships lications. Thus ML-based subject indexing is largely 29 30 (skos:broader) as well as cross-concordances/ domain dependent and will be in many cases not an 30 31 identity links (i.e. skos:exactMatch) [1]. On option due to the missing required high quality data 31 32 the forefront of the SKOS-related activities was the sources. 32 33 publication of library authority files. Authority files However, in the web of data, there are cross-domain 33 34 standardise bibliographic control as they uniquely indexing tools available that can be applied in conjunc- 34 35 identify and describe persons, organisations, and sub- tion with identity links in order to facilitate (semi-) 35 36 ject descriptors. Among them were the Library of automatic subject indexing. In the remainder, we will 36 37 Congress Subject Headings (LCSH), the Integrated investigate whether it is possible to leverage identity 37 38 Authority File (GND) of the German National Lib- links with named entity recognition facilitated by DB- 38 39 rary and the Virtual Integrated Authority File (VIAF) pedia Spotlight in order to offer an alternative route for 39 40 [25, 26]. Started as a joint effort between the Library of subject indexing, in cases where either the data or the 40 41 Congress and the German National Library, the VIAF available personnel resources do not allow for an ML 41 42 registry has been joined by more than 50 institutions solution. 42 43 from the cultural heritage sector by now.1 On the other 43 44 hand, there are also many domain-specific vocabular- 3. Use Cases 44 45 ies that have been made publicly available in RDF 45 46 format. Examples are the AGROVOC thesaurus pub- 3.1. Use Case Selection 46 47 lished by the Food and Agriculture Organisation of the 47 48 United Nations (FAO) [27] and the STW Thesaurus for We showcase the KINDEX approach in the context 48 49 of two different use cases. Use Case 1 is characterised 49 50 50 51 1https://viaf.org 2https://bartoc.org 51 L. Wenige et al. / Automatic Subject Indexing with KG Services 5

1 by the fact that the relevant test collection has no an- ory which is one of the largest Open Access servers in 1 2 notations and thus belongs to the potential scenarios of the field of Economics. The German National Library 2 3 (semi-)automatic subject indexing where an ML-based of Economics published an excerpt of this collection in 3 4 approach is not feasible due to missing training data. RDF format thus making more than 180k papers with 4 5 In contrast to that, Use Case 2 is associated with a data predominantly German and English text descriptions 5 6 collection that is comprised of a considerable amount available to a wider public [34]. A large fraction of 6 7 of training data making it a natural test bed for evalu- the publications that are contained in the data set has 7 8 ations regarding the performance of the identity-based subject annotations from the STW thesaurus. Due to 8 9 approach in comparison to ML-enabled automatic sub- the large number of labelled data, Econstor LOD is a 9 10 ject indexing. natural candidate corpus for testing machine learning 10 11 methods with regard to subject indexing. Hence, the 11 12 3.2. Use Case 1: mCLOUD - LIMBO collection has already been used in an empirical invest- 12 13 igation into automatic indexing methods by Toepfer et 13 The mCLOUD platform is an open data portal that is 14 al. [3]. Results of the authors’ findings can serve as 14 maintained by the German Federal Ministry of Trans- 15 baseline indicators for the feasibility of subject index- 15 port and Digital Infrastructure. It currently registers 16 ing methods that harness identity links in comparison 16 more than 1,500 traffic-related data sets (e.g., rails, 17 to ML-based approaches. 17 18 roads or waterways) as well as climate and weather 18 19 data. Data set owners are either the ministry or associ- 19 ated public agencies. The goal of the mCLOUD portal 20 4. The KINDEX approach 20 is to support data-driven research and development 21 21 projects on unprecedented navigational services, smart 22 4.1. Architecture & Workflow 22 travel and route planning as well as novel approaches 23 23 towards precise weather forecasting.3 The ministry 24 Fig. 1 shows the architecture of the KINDEX 24 also supports the research project Linked Data Ser- 25 approach. It can be implemented as a lightweight 25 vices for Mobility (LIMBO). LIMBO is concerned with 26 command-line script which harnesses existing know- 26 semantically describing and integrating the mCLOUD 27 ledge graph technologies and web services by com- 27 data sets with the web of data in order to facilitate im- 28 bining straightforward HTTP and SPARQL requests as 28 proved retrieval access to these resources. In the pro- 29 well as JSON processing operations. 29 ject, a metadata-catalogue in DCAT/DATAID format 30 For this purpose, it relies on a running instance of 30 has already been crawled from the mCLOUD’s pub- 31 DBpedia Spotlight [7] as well as mappings from DB- 31 licly available web pages [33]. A typical record is com- 32 pedia [35] and Wikidata [36] to Knowledge Organ- 32 prised of the data set’s title, its license and issue date, 33 isation Systems. The indexing process starts with text 33 the publishing agency as well as a short description in 34 snippets (e.g. a title and/or description of a publication, 34 text format (predominantly in German).4 What these 35 image or data set). Prior to the annotation it is often 35 entries are missing, however, is a structured semantic 36 useful to apply a blacklist filter that suppresses annota- 36 description of the data set’s content that integrates well 37 tions for sequences that are generally known to lead 37 with other resources on the web of data. For instance, 38 to faulty keywords (such as two-character sequences). 38 it would be nice to be able to retrieve matching public- 39 After blacklist filtering, text snippets are fed into the 39 ations from the catalogue of the German National Lib- 40 DBpedia Spotlight web service that is tailored to a par- 40 rary for each data set. Hence, a subject indexing ap- 41 ticular language. Currently, there exist 11 indices for 41 proach that automatically assigns GND descriptors is 42 DBpedia Spotlight 5 which facilitate multilingual an- 42 required for this purpose. 43 notations. 43 44 3.3. Use Case 2: Econstor LOD Imagine you would want to index a publication from 44 45 Econstor with STW descriptors. An example publica- 45 46 The second use case example concerns the Linked 46 47 Open Data (LOD) collection of the Econstor reposit- 47 5 48 https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/ 48 faq 49 49 3https://www.mcloud.de/ This list can possibly be extended to comprise the entirety of the 50 4https://gitlab.com/limbo-project/metadata-catalog/blob/ 15 available DBpedia chapters https://wiki.dbpedia.org/services- 50 51 master/catalog.all.ttl resources/datasets/dbpedia-datasets 51 6 L. Wenige et al. / Automatic Subject Indexing with KG Services

1 1 2 2 3 3 4 4 Named Entity 5 Thesaurus 5 Recognition 6 6

7 DBpedia 7 8 annotation 8 Lexical Matching 9 9

10 Blacklist Fusion SparqlIntegrate 10 11 Item Filter 11 collection 12 DBpedia Identity 12 13 Lookup Matching 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 Figure 1. Conceptual overview of the KINDEX approach 21 22 22 23 tion that textually describes an economics publication – DBpedia Lookup: The same-thing lookup service 23 24 on trade policy in English is shown in Listing 1 [34]. has been developed as part of the fusion of know- 24 25 The title is sent to an English DBpedia Spotlight in- ledge graphs from different DBpedia chapters as 25 26 stance which performs named entity recognition and well as Wikidata with the FlexiFusion approach 26 27 disambiguation and returns the annotated text as is presented by Frey et al. The web API6 serves 27 28 shown in Fig. 2. The respective result file is com- as a registry to resolve identity links to manage 28 29 prised of the annotated text, its surface form (e.g., resources of the largest publicly available cross- 29 30 taxation) and the corresponding DBpedia annotation domain knowledge graphs. For instance, for a 30 31 as a URI (e.g., http://dbpedia.org/resource/Tax). For given language-specific DBpedia URI the web 31 32 each of the identified entities, a custom workflow is API returns the corresponding Wikidata and DB- 32 33 then invoked which combines both lexical as well pedia URIs that are linked to the input URI via the 33 34 as identity matching. In this particular example, the owl:sameAs property. Even though, the pub- 34 35 most suitable strategy is to first match the surface lic DBpedia SPARQL endpoint contains some of 35 36 form as it was determined by the spotlight index with these identity links, the same-thing lookup ser- 36 37 the STW thesaurus’ preferred or alternative labels. In vice represents the most comprehensive mapping 37 38 case no descriptors can be found, the engine lever- registry to identify correspondence links between 38 39 ages the identity links (i.e., skos:exactMatch DBpedia and Wikidata [37]. 39 40 or owl:sameAs) that are present in the web of – DBpedia: The public DBpedia SPARQL endpoint 40 41 data starting with the DBpedia URI. For instance, is also offering other mapping links for identity 41 42 there exist cross-concordance links between the STW resolution to thesauri, such as the GND or BBC 42 43 thesaurus and the GND, the DBpedia and Wikidata. Things. When hosting a keyword indexing ser- 43 44 For each of these knowledge graphs or thesauri, the vice in the own institution, these links can be in- 44 45 KINDEX tool tests if there exist any identity links to serted into a local triple store or – in cases of small 45 46 the STW thesaurus that can be utilised for annotation. to medium-sized mapping collections – they can 46 47 In this context, correspondences can be determined by even be accessed through querying an in-memory 47 48 different means depending on the quality and quant- 48 49 ity of the existing mappings. The following lookup 49 50 strategies are possible: 50 51 6https://global.dbpedia.org/same-thing/lookup/ 51 L. Wenige et al. / Automatic Subject Indexing with KG Services 7

1 1 @prefix dc: . 2 2 @prefix econstor: . 3 @prefix econ-auth: . 3 @prefix rdf: . 4 4 @prefix xsd: . 5 5 econstor:62535 a swc:Paper, sioc:Item, foaf:Document; 6 6 dc:title "Business taxation and wages: Evidence from individual panel data"^^ 7 xsd:string; 7 dc:creator econ-auth:38331874, econ-auth:38331875; econ-auth:38331876; 8 8 dc:issued "2012"^^xsd:gYear; 9 dc:language "eng"^^xsd:string . 9 10 10 Listing 1 Example bibliographic description in RDF from Econstor 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 Figure 2. DBpedia Spotlight results 21 22 22 23 model by using the tool SparqlIntegrate (see Sec- 915 skos:exactMatch links between the two 23 24 tion 4.2). knowledge graphs.7 In contrast to that, the exist- 24 25 – Wikidata In cases, where DBpedia does not ing identity links between DBpedia and the GND 25 26 provide the required identity link, additional map- (>80,000 owl:sameAs links) as well as GND 26 27 pings can be determined from Wikidata, which and the STW (>4,800 skos:exactMatch 27 28 contains mappings to a multitude of thesauri, links) is sufficiently higher8. Thus, the chances of 28 29 such as the GND, VIAF, the Library of Con- finding an identity link by navigating from DB- 29 30 gress authority files or MeSH. The advant- pedia to the GND and then to the STW is even 30 31 age of utilising Wikidata for identity resolu- higher then directly navigating from DBpedia to 31 32 tion is that there exists a separate property for the STW. 32 33 each target thesaurus. For instance, the map- 33 Hence, link-enabled keyword indexing is based on 34 pings to the GND descriptors are accessible 34 trying to find the descriptor from the target KOS, 35 via the property , whereas mappings to 36 Spotlight. In the given example, although the re- 37 VIAF can be determined from thus enabling 38 to a Wikidata resource, the corresponding STW 39 faster access to the target KOS. As is the case with 39 descriptor (http://zbw.eu/stw/descriptor/11547-6) can 40 using DBpedia links, the KOS-specific mappings 40 be determined from following the identity links that 41 are best loaded into a local triple store or into an 41 exist between DBpedia and the GND (e.g., via the pub- 42 in-memory model. 42 lic DBpedia SPARQL endpoint matching the property 43 – Additional mappings There might be cases in 43 owl:sameAs). 44 which there neither exist sufficient mappings in 44 Afterward, the cross-concordance that exist between 45 the DBpedia nor in Wikidata to the required tar- 45 the GND and the STW can be obained by querying the 46 get KOS. In these cases, it is often beneficial to 46 STW-to-GND mapping stw_gnd_mapping.ttl.9 47 utilise a larger mapping collection as an interme- 47 48 diate step. Thus the matching of descriptors is en- 48 49 49 abled through traversing identity paths. For in- 7http://zbw.eu/stw/versions/latest/mapping/dbpedia/about.de.html 50 stance, even though DBpedia contains mappings 8https://lod-cloud.net/dataset/dnb-gemeinsame-normdatei 50 51 to the STW thesaurus, there currently exist only 9http://zbw.eu/stw/version/latest/download/about.de.html 51 8 L. Wenige et al. / Automatic Subject Indexing with KG Services

1 from an in-memory model that is accessed with the 4.2. SparqlIntegrate 1 2 help of SparqlIntegrate10. 2 3 The same procedure is carried out for each of the SparqlIntegrate12 is a tool that leverages SPARQL 3 4 Spotlight annotations. It stops as soon as the respective together with extension functions as the lingua franca 4 5 STW descriptor is found. It is assumed that through for RDFisation and integration of the common het- 5 6 the combination of lexical as well as identity match- erogeneous data formats XML, CSV, JSON and of 6 7 ing, more relevant high quality keywords can be identi- course RDF itself. Furthermore, it supports interfacing 7 8 fied for subject indexing. Once the set of relevant KOS with scripting environments by allowing for passing 8 9 descriptors has been obtained, they can be assigned environment variables to SPARQL statements as well 9 10 to a metadata catalogue in RDF format by applying a serialization of result sets in a JSON representation 10 11 SPARQL UPDATE command that is customised with that is suitable for immediate consumption by sev- 11 12 the command-line tool SparqlIntegrate (see Sect. 4.2). eral existing JSON processors. Thus, it is possible to 12 13 The execution of the UPDATE request adds the seamlessly integrate multiple data transformation steps 13 14 triple statements of Listing 2 to the publication cata- that occur during automatic indexing into an efficient 14 15 logue thus enabling a seamless integration of keyword pipeline by means of lightweight command-line pro- 15 16 annotation with existing bibliographic records. To en- cessing. Effectively, triple statements are transformed 16 17 sure that only relevant keywords are added to the cata- and/or newly generated based on the variables that 17 18 logue, the KINDEX approach can be implemented as were passed to the SPARQL interface (see Sect. 4.1). 18 19 an interactive command-line script. Thus, the tool asks During KINDEX processing, SparqlIntegrate trans- 19 20 the subject indexer whether a keyword is relevant for formations are typically handy, when triple statements 20 21 a particular publication, each time before a subject is are to be altered or inserted into a small to medium- 21 22 added to the catalogue. Listing 2 shows the automatic- sized data collection that can be easily loaded into the 22 23 ally generated and manually verified keywords for the main memory of the host machine. For instance, in or- 23 24 Econstor publication. der to determine identity links for different keywords 24 25 The automatic annotation of data set descriptions and mapping collections, SparqlIntegrate offers con- 25 26 in the context of the LIMBO project (Use Case 1) venient query interface options. Depending on the flag 26 27 only slightly differs from the previously outlined pro- option that is set during execution, different mapping 27 28 cessing steps: Subject descriptors from the GND are collections can be flexibly queried and corresponding 28 29 determined by matching the text snippets of a data results (e.g. keyword descriptors from the specified 29 30 set description from the LIMBO metadata-catalogue KOS) can be immediately consumed. 30 31 with DBpedia URIs. DBpedia URIs are then send to For example, when combined with a scripting lan- 31 32 either Wikidata or DBpedia in order to identify iden- guage, SparqlIntegrate could first query the DBpe- 32 33 tity links to the GND thesaurus. Alternatively, rel- dia same-thing lookup service for an identity link to 33 34 evant text snippets (surface forms as determined by Wikidata. 34 35 DBpedia Spotlight) are matched with the preferred When there exists a mapping, a getDescriptor 35 36 or alternative labels of the GND. For this purpose, function determines the corresponding STW descriptor 36 37 we queried the LOBID API of the University Library of the STW-to-Wikidata mapping collection (i.e. 37 38 Centre of North Rhine-Westphalia (HBZ) as it is of- stw_wikidata_maping.ttl). In case, there is 38 39 fering a public interface to determine GND descriptors no such link, the procedure sends a HTTP request to 39 40 [38].11 An example implementation of the KINDEX the DBpedia SPARQL endpoint determining whether 40 41 approach has been made publicly available and can there exist any mappings to the GND. Upon obtain- 41 42 be downloaded from: https://gitlab.com/limbo-project/ ing a match, the getDescriptor function is called, 42 43 keyword-indexing. As SparqlIntegrate plays a major which this time queries the STW-to-GND mapping 43 44 role during the indexing procedures of both use cases collection for a suitable STW descriptor. 44 45 as a tool to perform the RDF triple transformations, it Due to the built-in feature of SparqlIntegrate to 45 46 will be now explained in more detail. provide SPARQL results in JSON format they can 46 47 be conveniently accessed by standard command-line 47 48 processors, such as jq in order to be available for 48 49 49 10https://gitlab.com/limbo-project/keyword-indexing/blob/ 50 master/queries/mapping.sparql 50 51 11https://lobid.org/ 12https://github.com/SmartDataAnalytics/SparqlIntegrate 51 L. Wenige et al. / Automatic Subject Indexing with KG Services 9

1 1 @prefix econstor: . 2 2 @prefix dc: . 3 @prefix stw: 3 4 4 econstor:62535 dc:subject stw:12072-1 (Enterprise), 5 stw:11547-6 (Tax), 5 stw:11322-2 (Wages) . 6 6 7 Listing 2 Automatically generated triple statements 7 8 8 9 9 10 additional data transformations during a KINDEX ment in approx. 63% of the assignments. 10 11 pipeline.13 Thus, once a matching STW descriptor has With the Econstor LOD use case, an annotated corpus 11 12 been identified for the input DBpedia URL (and prefer- was already available. From this corpus, we randomly 12 13 ably verified by a professional subject indexer), it can selected a test collection of 250 English publications 13 14 be assigned to the respective publication and inserted that were annotated with STW descriptors. 14 15 into the Econstor catalogue (catalog.ttl) with a For both use cases, we tested whether the KINDEX 15 16 simple command. approach correctly predicted the baseline annotations 16 17 from the gold standard corpora. For this purpose, pre- 17 18 cision and recall values were calculated by applying 18 19 5. Experiments the sample-based average method commonly used in 19 20 text categorisation tasks [39]. Average F1 scores were 20 21 5.1. Experimental Settings determined after the summarised values for precision 21 22 and recall had already been calculated. The collections 22 23 The performance of the KINDEX approach was for both use cases as well as the evaluation script are 23 24 evaluated in experiments, in which the two use cases publicly available.14 24 25 and their associated data catalogues served as empir- 25 26 ical test beds. For LIMBO, a baseline corpus of manu- 26 27 ally assigned GND descriptors was set up prior to con- 5.2. Results 27 28 ducting the evaluations. For this purpose, two test an- 28 29 notators assigned keywords to a random selection of Prior to conducting the final performance evalu- 29 30 100 example records in German that were extracted ations, we carried out a few preliminary parameter tun- 30 31 from the LIMBO metadata catalogue. The annotators ing experiments, in which we determined the best con- 31 32 were presented with the metadata descriptions of the figuration with regard to the confidence score; i.e. the 32 33 data sets and received some initial recommendations degree to which the matched text snippet is deemed 33 34 for GND descriptors that were determined according to refer to the correct entity by DBpedia Spotlight. 34 35 to DBpedia Spotlight, Wikidata and the LOBID API. Additionally, the best suited type of text information 35 36 The annotators were asked to judge the relevance of the to be used by the respective Spotlight instances was 36 37 descriptors (with either 1 = “relevant” or 0 = “not rel- also determined. In case of the LIMBO catalogue, the 37 38 evant”) upon being presented with the data set descrip- title (dct:title), description (abbr. desc, dcat: 38 39 tion. Additionally, the indexers stated further keywords description) and the union of the two (title+desc) 39 40 that they considered vital to characterise the data set. was tested. For Econstor publications, the title (dct: 40 41 Afterwards, the relevance judgements of the two an- title), free text keywords (key, dc:keyword) and 41 42 notators were condensed in a gold standard annota- a combination of the two text snippets were evalu- 42 43 tion set, where each data set description was enhanced ated during the tests. Abstracts were not considered, 43 44 with the fitting index terms. This was done by consid- since the baseline paper by Toepfer et al. only re- 44 45 ering each GND descriptor as relevant, whenever the ported on ML-based accuracy scores related to short 45 46 test persons had consensus, i.e. both either marked the text descriptions (e.g. title and string keywords) [3]. 46 47 concept with 1 or stated the same keyword as being Since we wanted to evaluate how the identity-based 47 48 additionally relevant. The test persons reached agree- 48 49 49 50 14https://gitlab.com/limbo-project/keyword-indexing/tree/ 50 51 13https://stedolan.github.io/jq/ master/evaluation 51 10 L. Wenige et al. / Automatic Subject Indexing with KG Services

1 1 1 1 2 title title 2 3 title+desc key 3 0.8 0.8 4 desc title+key 4 5 5 6 0.6 0.6 6 7 7 F1 F1 8 0.4 0.4 8 9 9 10 10 11 0.2 0.2 11 12 12 13 0 0 13 14 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 14 15 Confidence Confidence 15 16 16 Figure 3. KINDEX parameter tuning (LIMBO) Figure 4. KINDEX parameter tuning (Econstor) 17 17 18 18 approach performs in comparison to ML-based index- stor the same could only be proven for recall scores 19 19 ing, we focused on these metadata features. Figures (p < 0.07). 20 20 5.2 and 5.2 depict the results of the parameter tuning During the simulation runs, it was also investigated 21 21 tests. Surprisingly, the diagrams indicate that lower to in which sequence the individual lookup steps (i.e., 22 22 medium-level confidence values produce higher accur- lexical, Wikidata- or DBpedia-based identity lookup) 23 23 acy scores. This might be explained with the fact that should be preferably processed by the KINDEX en- 24 24 both of the evaluated catalogues are rather domain- gine. The columns with the heading Indexing Ap- 25 25 specific and thus require narrowly defined topic de- proach in Tabs. 5.2-2 list the different priority rules, 26 26 scriptions that can have different meanings in other do- where the order of the lookup steps (i.e, identity or lex- 27 27 mains. For the LIMBO catalogue, it also seems to be ical and Wikidata or DBpedia) denotes the correspond- 28 28 the case that more input data (title+desc) leads to bet- ing processing sequence. Given the evaluation results, 29 29 ter F1 scores. This is why this specific parameter con- it seems to be the case that indexing approaches func- 30 30 figuration has been used in subsequent tests. For Econ- tion best when they are tailored to the specific use 31 31 stor, on the other hand, the text-based keywords (key) case. While for the LIMBO catalogue, identity match- 32 32 achieved the best performance results for varying con- ing should have priority over lexical matching and 33 33 fidence parameters. Wikidata-based identity links should be detected prior 34 34 Upon having determined the best parameter setting, to DBpedia links, the opposite is true for the Econ- 35 35 we conducted a final analysis in which we evaluated stor use case. In the latter scenario lexical matching 36 36 how the KINDEX approach performs in comparison and DBpedia look-ups are to be processed first in order 37 37 to a naive lexical matching of Spotlight surface forms to boost accuracy scores. The reasons for these differ- 38 38 as measured by precision, recall and F1 scores. Tabs. ences might be that two DBpedia Spotlight instances 39 39 5.2-2 show the final evaluation results of the KIN- were applied (a German Spotlight instance for LIMBO 40 40 DEX approach in the different use cases. Figures in and an English instance for Econstor) and that the top- 41 41 bold mark the best performing score for each metric ical domains might be covered differently by the two 42 42 and indexing approach. Overall the evaluations show large-scale cross-domain knowledge graphs DBpedia 43 43 that a combination of lexical and identity matching and Wikidata [40]. 44 44 always achieves better results than simply identify- Additionally, KINDEX achieved varying levels of ac- 45 45 ing subject descriptors based on their labels. We per- curacy in the two scenarios. While LIMBO results on 46 46 formed subsequent statistical tests to seek further val- average reached fairly high accuracy scores, the per- 47 47 idation for this hypothesis. For the LIMBO data, paired formance was not as good for the Econstor scenario. 48 48 t-tests confirmed a significant improvement through For the latter use case, however, it has to be noted that 49 49 KINDEX in comparison to naive lexical matching for the scores were mostly as good as the best perform- 50 50 each performance metric (p < 0.001), while for Econ- ing ML-based lexical matching approach (i.e., a MAUI 51 51 L. Wenige et al. / Automatic Subject Indexing with KG Services 11

1 Table 1 Table 2 1 2 mCLOUD - Evaluation results Econstor - Evaluation results 2 3 Indexing Approach Precision Recall F1 Indexing Approach Precision Recall F1 3 4 4 Naive Lexical 0.375 0.669 0.480 Naive Lexical 0.357 0.242 0.288 5 5 Identity (DBpedia) 0.442 0.505 0.471 Identity (DBpedia) 0.145 0.083 0.105 6 6 Identity (Wikidata) 0.509 0.740 0.603 Identity (Wikidata) 0.194 0.070 0.103 7 7 Identity+Lexical (Wikidata) 0.523 0.806 0.635 Identity+Lexical (Wikidata) 0.360 0.258 0.300 8 8 Identity+Lexical (DBpedia) 0.503 0.752 0.603 Identity+Lexical (DBpedia) 0.336 0.275 0.302 9 9 Identity+Lexical (Wikidata+DBpedia) 0.510 0.792 0.620 Identity+Lexical (Wikidata+DBpedia) 0.338 0.276 0.304 10 10 Lexical+Identity (DBpedia) 0.430 0.489 0.457 Lexical+Identity (Wikidata) 0.306 0.224 0.259 11 11 Lexical+Identity (Wikidata) 0.418 0.654 0.510 Lexical+Identity (DBpedia) 0.353 0.273 0.307 12 12 Lexical+Identity (Wikidata+DBpedia) 0.423 0.660 0.516 Lexical+Identity (DBpedia+Wikidata) 0.363 0.279 0.315 13 13 14 14 15 adaptation for Econstor) and only slightly weaker than is true that there is also some performance tuning in- 15 16 meta-learning strategies that fuse the results of differ- volved in using our method, KINDEX is multilingual 16 17 ent base learners [3].15 and applicable to a large number of knowledge organ- 17 18 isation systems almost out-of-the-box, while being in- 18 19 dependent of training data at the same time. Thus, in 19 20 6. Conclusion addition to cultural heritage institutions, it might also 20 21 be an interesting tool for researchers to help them an- 21 22 Given the growing number of digital and analogue notate their publications with KOS descriptors in or- 22 23 content in cultural heritage institutions, high quality der to facilitate an open research infrastructure that re- 23 24 metadata descriptions are more important than ever to lies on rich metadata descriptions [41]. To this end, we 24 25 facilitate personalised retrieval access to valuable re- plan to offer a web service in the near future that an- 25 26 sources. However, because of the content overload and notates text from multiple languages with descriptors 26 27 the limited personnel in information providing insti- from various thesauri thus leveraging identity links for 27 28 tutions, manual indexing will often not be feasible. subject indexing in such a manner that it can be used 28 29 Hence, investigations into methods for automatic gen- by a larger audience. 29 30 eration of KOS descriptor annotations are required. 30 31 To this date, most of the few existing approaches fo- 31 32 cus on the application of machine learning techniques. References 32 33 While this is an important route for further investiga- 33 34 tions, we argue that cultural heritage institutions might [1] A. Miles and S. Bechhofer, SKOS simple knowledge organiza- 34 tion system reference, W3C recommendation 18 (2009), W3C. 35 also proﬁt from harnessing the already available cross- 35 [2] W.J. Hutchins, The concept of ‘aboutness’ in subject indexing, 36 concordance links for automatic subject indexing. Our in: Aslib proceedings, Vol. 30, MCB UP Ltd, 1978, pp. 172– 36 37 method generates KOS annotations by combining lex- 181. 37 38 ical and identity matching, which is facilitated by the [3] M. Toepfer and C. Seifert, Descriptor-invariant fusion architec- 38 39 web of data. The evaluation results demonstrate that tures for automatic subject indexing, in: 2017 ACM/IEEE Joint 39 Conference on Digital Libraries (JCDL), IEEE, 2017, pp. 1– 40 our KINDEX approach reaches accuracy scores that 40 10. 41 are competitive with some state-of-the-art ML-enabled [4] O. Medelyan, E. Frank and I.H. Witten, Human-competitive 41 42 methods. Hence, it can serve as a base method whose tagging using automatic keyphrase extraction, in: Proceedings 42 43 results are fused with the results of other indexing ap- of the 2009 Conference on Empirical Methods in Natural Lan- 43 44 proaches. Additionally, KINDEX can be applied as a guage Processing: Volume 3-Volume 3, Association for Com- 44 putational Linguistics, 2009, pp. 1318–1327. 45 stand-alone tool that offers a viable alternative method 45 [5] L. Wenige, The application of Linked Data resources for Lib- 46 for automated subject indexing when the application rary Recommender Systems, Theorie, Semantik und Organisa- 46 47 of ML approaches is not feasible due to missing data, tion von Wissen 13 (2017), 212. 47 48 hardware infrastructures or human resources. While it [6] M. Schmachtenberg, C. Bizer and H. Paulheim, Adoption of 48 49 the linked data best practices in different topical domains, 49 in: International Semantic Web Conference, Springer, 2014, 50 15 50 Please note that these ﬁndings give only an indication, since pp. 245–260. 51 the evaluations could not be run on the same sample. 51 12 L. Wenige et al. / Automatic Subject Indexing with KG Services

1 [7] P.N. Mendes, M. Jakob, A. García-Silva and C. Bizer, DBpe- [26] R. Bennett, C. Hengel-Dittrich, E.T. O’Neill and B.B. Tillett, 1 2 dia spotlight: shedding light on the web of documents, in: Pro- Viaf (virtual international authority file): Linking die deutsche 2 3 ceedings of the 7th international conference on semantic sys- bibliothek and library of congress name authority files, in: 3 tems, ACM, 2011, pp. 1–8. World library and information congress: 72nd IFLA general 4 4 [8] T. Gross and A.G. Taylor, What have we got to lose? The effect conference and council, 2006. 5 of controlled vocabulary on keyword searching results, College [27] C. Caracciolo, A. Stellato, A. Morshed, G. Johannsen, S. Ra- 5 6 & Research Libraries (2005). jbhandari, Y. Jaques and J. Keizer, The AGROVOC linked 6 7 [9] A.G. Taylor, On the subject of subjects, The journal of aca- dataset, Semantic Web 4(3) (2013), 341–348. 7 8 demic librarianship 21(6) (1995), 484–491. [28] J. Neubert, Bringing the" Thesaurus for Economics" on to the 8 [10] U. Junger, Quo vadis Inhaltserschliessung der Deutschen Na- Web of Linked Data., LDOW 25964 (2009). 9 9 tionalbibliothek? Herausforderungen und Perspektiven, o-bib. [29] J. Waeber and A. Ledl, A Semantic Web SKOS Vocabu- 10 Das offene Bibliotheksjournal/Herausgeber VDB 2(1) (2015), lary Service for Open Knowledge Organization Systems, in: 10 11 15–26. Research Conference on Metadata and Semantics Research, 11 12 [11] U. Junger, Automation first–the subject cataloguing policy of Springer, 2018, pp. 3–12. 12 13 the Deutsche Nationalbibliothek (2017). [30] A.O. Kempf and T. Rebholz, ’Mixed Methods’ Indexing: 13 [12] D. Nadeau and S. Sekine, A survey of named entity recognition Building-Up a Multi-Level Infrastructure for Subject Indexing 14 14 and classification, Lingvisticae Investigationes 30(1) (2007), (2017). 15 3–26. [31] L. Wenige, G. Berger and J. Ruhland, SKOS-based concept ex- 15 16 [13] J. Voss, Tagging, folksonomy & co-renaissance of manual in- pansion for LOD-enabled recommender systems, in: Research 16 17 dexing?, arXiv preprint cs/0701072 (2007). Conference on Metadata and Semantics Research, Springer, 17 18 [14] R. Jäschke, L. Marinho, A. Hotho, L. Schmidt-Thieme 2018, pp. 101–112. 18 and G. Stumme, Tag recommendations in folksonomies, in: [32] A. Hajra and K. Tochtermann, Linking science: approaches for 19 19 European Conference on Principles of Data Mining and Know- linking scientific publications across different LOD repositor- 20 ledge Discovery, Springer, 2007, pp. 506–514. ies, International Journal of Metadata, Semantics and Ontolo- 20 21 [15] B. Sigurbjörnsson and R. Van Zwol, Flickr tag recommend- gies 12(2–3) (2017), 124–141. 21 22 ation based on collective knowledge, in: Proceedings of the [33] C. Stadler, L. Wenige, S. Tramp, K. Junghanns and M. Martin, 22 23 17th international conference on World Wide Web, ACM, 2008, RDF-based Deployment Pipelining for Efficient Dataset Re- 23 pp. 327–336. lease Management (2019). 24 24 [16] O. Medelyan, V. Perrone and I.H. Witten, Subject metadata [34] A. Latif, T. Borst and K. Tochtermann, Exposing data from 25 support powered by Maui, in: JCDL’10, ACM, 2010, pp. 407– an open access repository for economics as linked data, D-Lib 25 26 408. magazine 20(9/10) (2014). 26 27 [17] O. Medelyan and I.H. Witten, Thesaurus based automatic key- [35] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, 27 28 phrase indexing, in: Proceedings of the 6th ACM/IEEE-CS P.N. Mendes, S. Hellmann, M. Morsey, P. Van Kleef, S. Auer 28 Joint Conference on Digital Libraries (JCDL’06), IEEE, 2006, et al., DBpedia–a large-scale, multilingual knowledge base ex- 29 29 pp. 296–297. tracted from Wikipedia, Semantic Web 6(2) (2015), 167–195. 30 [18] Y.-M. Chung, W.M. Pottenger and B.R. Schatz, Automatic [36] D. Vrandeciˇ c´ and M. Krötzsch, Wikidata: a free collaborative 30 31 subject indexing using an associative neural network, in: ACM knowledge base (2014). 31 32 DL, 1998, pp. 59–68. [37] J. Frey, M. Hofer, D. Obraczka, J. Lehmann and S. Hellmann, 32 33 [19] B. Lauser and A. Hotho, Automatic multi-label subject index- DBpedia FlexiFusion The Best of Wikipedia> Wikidata> Your 33 ing in a multilingual environment, in: International Conference Data. 34 34 on Theory and Practice of Digital Libraries, Springer, 2003, [38] F. Steeg, A. Pohl and P. Christoph, lobid-gnd–Eine Schnitts- 35 pp. 140–151. telle zur Gemeinsamen Normdatei für Mensch und Maschine, 35 36 [20] W.J. Wilbur and W. Kim, Stochastic gradient descent and the Informationspraxis 5(1) (2019). 36 37 prediction of MeSH for PubMed records, in: AMIA Annual [39] F. Sebastiani, Machine learning in automated text categoriza- 37 38 Symposium Proceedings, Vol. 2014, American Medical In- tion, ACM computing surveys (CSUR) 34(1) (2002), 1–47. 38 formatics Association, 2014, p. 1198. [40] M. Färber and A. Rettinger, Which Knowledge Graph Is Best 39 39 [21] K. Golub, Automatic subject indexing of text (2019). for Me?, arXiv preprint arXiv:1809.11099 (2018). 40 [22] K. Golub, D. Soergel, G. Buchanan, D. Tudhope, M. Lykke [41] S. Vahdati, N. Arndt, S. Auer and C. Lange, OpenRe- 40 41 and D. Hiom, A framework for evaluating automatic indexing search: collaborative management of scholarly communica- 41 42 or classification in the context of retrieval, Journal of the Asso- tion metadata, in: European Knowledge Acquisition Workshop, 42 43 ciation for Information Science and Technology 67(1) (2016), Springer, 2016, pp. 778–793. 43 3–16. [42] M. Tanaka, S. Nakazono, H. Matsuno, H. Tsujimoto, Y. Kit- 44 44 [23] J. Neubert and K. Tochtermann, Linked library data: offering amura and S. Miyano, Intelligent system for topic survey in 45 a backbone for the semantic web, in: Knowledge Technology MEDLINE by keyword recommendation and learning text 45 46 Week, Springer, 2011, pp. 37–45. characteristics, Genome Informatics 11 (2000), 73–82. 46 47 [24] L. Wenige and J. Ruhland, Retrieval by recommendation: using [43] K.S. Hasan and V. Ng, Automatic keyphrase extraction: A sur- 47 48 LOD technologies to improve digital library search, Interna- vey of the state of the art, in: Proceedings of the 52nd An- 48 tional Journal on Digital Libraries 19(2–3) (2018), 253–269. nual Meeting of the Association for Computational Linguistics 49 49 [25] E. Summers, A. Isaac, C. Redding and D. Krech, LCSH, SKOS (Volume 1: Long Papers), 2014, pp. 1262–1273. 50 and linked data (2008). [44] P.D. Turney, Learning algorithms for keyphrase extraction, In- 50 51 formation retrieval 2(4) (2000), 303–336. 51 L. Wenige et al. / Automatic Subject Indexing with KG Services 13

1 [45] Q. Zhang, Y. Wang, Y. Gong and X. Huang, Keyphrase extrac- ation of Library Associations and Institutions (IFLA), 2010. 1 2 tion using deep recurrent neural networks on twitter, in: Pro- [48] Y. Ding, M. Korotkiy, B. Omelayenko, V. Kartseva, V. Zykov, 2 3 ceedings of the 2016 conference on empirical methods in nat- M. Klein, E. Schulten and D. Fensel, Goldenbullet: Automated 3 ural language processing, 2016, pp. 836–845. 4 classification of product data in e-commerce, in: Proceedings 4 [46] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin and of the 5th international conference on business information 5 C.G. Nevill-Manning, Kea: Practical automated keyphrase ex- 5 systems, 2002. 6 traction, in: Design and Usability of Digital Libraries: Case 6 7 Studies in the Asia Pacific, IGI Global, 2005, pp. 129–152. [49] A. Latif, A. Scherp and K. Tochtermann, LOD for library sci- 7 ence: benefits of applying linked open data in the digital library 8 [47] J. Hannemann and J. Kett, Linked data for libraries, in: Proc of 8 the world library and information congress of the Int’l Feder- setting, KI-Kunstliche Intelligenz 30(2) (2016), 149–157. 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51