Semantic Web 1 (2016) 1–5 1 IOS Press

Facilitating Data-Flows at a Global Publisher using the Stack

Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country

Christian Dirschl a, Katja Eck a, Jens Lehmann b,∗, Lorenz Bühmann b, Sören Auer c, Bert Van Nuffelen d a Wolters Kluwer Deutschland GmbH, 85716 Unterschleissheim, Germany E-mail: [email protected], [email protected] b University of Leipzig, Institute of Computer Science, AKSW Group, Augustusplatz 10, D-04009 Leipzig, Germany E-mail: {lastname}@informatik.uni-leipzig.de c University of Bonn, Computer Science, Enterprise Information Systems & Fraunhofer IAIS, Bonn, Germany E-mail: [email protected] d TenForce, Havenkant 38, 3000 Leuven, Belgium E-mail: [email protected]

Abstract. The publishing industry is at the verge of an era, wherein particular professional customers of publishing products are not so much interested in comprehensive and journals, i.e. traditional publishing products, anymore as they now are interested in possibly structured information pieces delivered just-in-time as a certain information need arises. This requires a transformation of the publishing workflows towards the production of much richer meta-data for fine-grained and highly interlinked pieces of content. Linked Data can play a crucial role in this transition. The Linked Data Stack is an integrated distribution of aligned tools which support the whole lifecycle of Linked Data from extraction, authoring/creation via enrichment, interlinking, fusing to maintenance. In this application paper, we describe a real-world usage scenario of the Linked Data Stack at a global publishing company. We give an overview over the Linked Data Stack and the underlying life-cycle of Linked Data, describe data-flows and usage scenarios at a publisher and then show how the stack supports those scenarios.

Keywords: Publishing, Linked Open Data, Linked Data Stack

1. Introduction information, delivery of just-in-time, context-specific and personalized information) and therefore the inter- In times of tablets, and a growing num- nal workflow processes especially for those publish- ber of other electronic devices, publishers are more and ers targeting professional audiences (e.g. legal, tax, ac- more forced to move towards . counting professionals) have to be adapted in order to Possibilities for consuming information are changing provide this high value content. and so do the expectations of customers. For instance, Let’s motivate our work in more detail with a real digital work environments offer new functionalities world user scenario: Gerhard, an accounting profes- (e.g. non-linear story telling, inclusion of background sional works for TWC, a leading tax and accounting consultancy. He is responsible for certifying the value- *Corresponding author. E-mail: [email protected] added-tax (VAT) returns at the Europe-wide operat- leipzig.de. ing food retailer named Aldo (customer of the tax and

1570-0844/16/$27.50 c 2016 – IOS Press and the authors. All rights reserved 2 Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack accounting consultancy). For Gerhard it is crucial, to lishing and electronic usage of documents challenges track all changes of VAT regulations in all the coun- (the document centric usage of) this technology. Elec- tries Aldo operates in, which might include countries tronic publishing requires separation of the core con- in the Euro zone other EU member states and a few tent (represented as XML) and , smaller par- neighboring countries. As a result, Gerhard needs to titioning of the content and the ability to included life be informed, whenever a law in one of these countries external information. related to VAT regulations is changed. In addition, he Semantic technologies can help to support these has to track court decisions of all cases related to VAT processes. Publishers still deal with large amounts of regulations. Because Aldo has to update its ERP (en- unstructured, textual content. terprise resource planning) systems when VAT regu- approaches can help to annotate and enrich such con- lations change, Gerhard wants to notify Aldo’s IT de- tent. Once formalized knowledge (e.g. adhering to the partment already proactively as early as possible, when RDF data model) is extracted it needs to be stored, major regulatory changes (e.g. increase of the VAT for managed and made available for querying. Links to certain products in a certain country) are planned. Cur- other knowledge bases, either from the publisher it- rently, Gerhard and his team have to track a vast num- self, from other content providers or from the Web of ber of textual sources provided to TWC by a global Data, need to be established. We can apply reasoning publisher and several smaller regional publishers spe- and machine learning techniques to enrich the knowl- cialized in the legal, accounting and tax domains for edge bases with ontological structures. Since the orig- relevant legislation and regulatory changes. In future, inal documents might change (like a new version of the global publisher aims to deliver Gerhard and his a law), we need processes for maintaining and further colleagues at TWC much more personalized and con- developing the extracted knowledge. Finally, semantic text specific information pieces fulfilling exactly his search, exploration and visualization techniques can information need. Gerhard aims to register the sources help to gain new insights from the semantically repre- (legislation in certain countries, court decisions and sented content. The Linked Data Stack provides spe- parliamentary initiatives) he wants to track and fil- cialized tools for each of these lifecycle stages and can ter information according to entries in a taxonomy re- consequently be used to facilitate the semantic content lated to VAT. Subsequently, Gerhard wants to be noti- processing workflows at a publisher. fied whenever a certain piece of information related to The application report is structured as follows: In his particular information need is published by one of Section 2, we present the Linked Data Stack architec- the identified sources. He wants to easily compare the ture on a high level. After that, the vision of the vi- changes applied to a certain law, for example, and be sion of the Linked Data lifecycle is explained in Sec- able to explore specifiably related court decisions. tion 3. The phases of this lifecycle are closely related Such a scenario requires an interaction and data ex- to the data-flows at global publishers, which are de- change between different companies as well as sev- scribed in Section 4. Based on this, in Section 5, we eral intelligent systems to process data as well as pos- describe how the Linked Data Stack was applied at a sibly enrich and interlink it. Single tools cannot solve particular publisher – Wolters Kluwer. We follow up those problems in isolation at large scale. Several inter- with a brief summary of related work in Section 6 and operable components based on standards are required plans for future work in Section 7. to achieve this. In this application report, we describe how Linked Data and tools from the Linked Data Stack can address those challenges. This application report 2. Overview of the Linked Data Stack builds on [1]. While the previous article focused on a detailed characterisation of the Linked Data Stack, this The description of the Linked Data Stack (formaly application report focuses on a detailed description of known as the LOD2 stack) and the Linked Data lifecy- the usage scenarios at a global publisher. cle (see Figure 1) are extensions of earlier work in [1] In the past publishers were the driving force be- and [3]. The Linked Data Stack is an integrated distri- hind document representation formats as SGML and bution of components, which support the whole life- XML. Using that technology usually a document cen- cycle of Linked Data from extraction, authoring/cre- tric content manament system is established. Today ation via enrichment, interlinking, fusing to explo- this technology is still a corner stone of many publish- ration. The Linked Data stack is available at http:// ers however the aforementioned rise in electronic pub- stack.linkeddata.org The major components Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack 3 of the Linked Data Stack are open-source facilitat- ing a wide deployment potential. Through an iterative software development approach, the stack contributors aim at ensuring that the stack fulfills a broad set of user requirements and thus facilitates the transition to a Web of Data. The stack is designed to be versatile: by exploiting the Linked Data (RDF, SPARQL, OWL) paradigm as the main application interface the plug- ging in of alternative (third-party) implementations is enabled. In order to fulfill these requirements, the architec- ture of the Linked Data Stack is based on three pillars: 1. Software integration and deployment using the Debian packaging system: The Debian packag- ing system is one of the most widely used pack- Fig. 1. Stages of the Linked Data life-cycle supported by the Linked aging and deployment infrastructures and facili- Data Stack. tates packaging and integration as well as main- already developed functionalities in other appli- tenance of dependencies between the various Linked Data Stack components. Using the De- cations. bian system also allows to facilitate the deploy- These three pillars comprise the methodological and ment of the Linked Data Stack on individual technological framework for integrating the very het- servers, cloud or virtualization infrastructures. erogeneous Linked Data Stack components into a con- 2. The use of RDF as the data representation for- sistent framework. mat and SPARQL as the data exchange mecha- nism creates a uniform and universal knowledge exchange bus. In its simplest form the bus is a 3. The Linked Data Lifecycle central local RDF store, however due to the built- in distribution nature of RDF and SPARQL more The different stages of the Linked Data life-cycle, complex setups are easily realized. All compo- depicted in Figure 1, include: nents of the Linked Data Stack access via the bus the knowlegde and write their findings back to 1. Storage: Efficient RDF data management tech- it. In order for other tools to make sense out of niques fulfilling requirements of global publish- the output of a certain component, it is impor- ers comprise column-store technology, dynamic tant to exploit vocabularies. Vocabularies (or on- query optimization, adaptive caching of joins, tologies) provide semantics to the data for the in- optimized graph processing and cluster/cloud formation domain they cover. Using vocabularies scalability. allows information exchange beyond raw data. 2. Authoring: The Linked Data Stack facilitates 3. Integration of the Linked Data Stack user inter- the authoring of rich semantic knowledge bases, faces based on REST API’s. The available com- by leveraging technology, the ponents form a collection of user interfaces that WYSIWYM paradigm (What You See Is What are technologically and methodologically quite You Mean) and distributed social, semantic col- heterogeneous. The Linked Data Stack does not laboration and networking techniques. aim to resolve this heterogeneity, since each 3. Interlinking: Creating and maintaining links in tool’s UI is specifically tailored for a certain pur- a (semi-)automated fashion is still a major chal- pose. Instead, we develop a common entry points lenge and crucial for establishing coherence and for accessing and managing the data via REST facilitating data integration as outlined in the API’s that offer dedicated support for one task: publishing usage scenario in the introduction. We e.g. selecting a graph, or validating a dataset, seek linking approaches yielding high precision etc. This approach fosters both the exploration of and recall, which configure themselves automat- new ideas and new functionalities as the reuse of ically or with end-user feedback. 4 Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack

4. Enrichment: Linked Data on the Web is mainly knowledge bases, whose results can again be as- raw instance data. For data integration, fusion, sessed by end users for iterative refinement. search and many other applications, however, 4. Semantically enriched knowledge bases improve we need this raw instance data to be classified the detection of inconsistencies and modelling into taxonomies. In the Linked Data Stack, semi- problems, which in turn results in benefits for in- automatic components for this purpose are in- terlinking, fusion, and classification. cluded. 5. The querying performance of RDF data manage- 5. Quality: The quality of content on the Data Web ment directly affects all other components, and varies, as the quality of content on the document the nature of queries issued by the components web varies. The Linked Data Stack comprises affects RDF data management. techniques for assessing quality based on charac- As a result of such interdependence, the Linked Data teristics such as provenance, context, coverage or Stack results in the establishment of an improvement structure. The goal in our application scenarios is cycle for knowledge bases on the Data Web. The im- to assess whether data sources for a publisher are provement of a with regard to one as- complete, consistent, reliable etc. pect (e.g. a new alignment with another interlinking 6. Evolution/Repair: Data on the Web is dynamic. hub) triggers a number of possible further improve- We need to facilitate the evolution of data while ments (e.g. additional instance matches). keeping things stable. Changes and modifica- Linked Data is as technological foundation a firm tions to knowledge bases, vocabularies and on- basis to incrementally grow the interaction pattern. tologies should be transparent and observable. One can start with the minimal data flow sufficient to The Linked Data Stack comprises methods to capture the core data management problem. Later on spot problems in knowledge bases and to auto- the data flow can be extended with new interaction matically suggest repair strategies. points creating a higher value data chain. The easy ex- 7. Search/Browsing/Exploration: For many users, tendibility of data flows allows to incorporate recent the Data Web is still invisible below the sur- developed tools and techniques so that at any time an face. Therefore search, browsing, exploration data flow is served by the best component. For in- and visualization techniques for different kinds stance, a typical first step of Publishers into Linked of Linked Data (i.e. spatial, temporal, statistical) Data starts with the representation of their controlled are developed making the Data Web sensible for vocabularies as SKOS vocabularies and updating their real users. content using the concept URI’s instead of internal We refer to [11] for detailed explanations and spe- identifiers. In a second step the data flow can be ex- cific examples for each of those stages. These life- tended with automated annotation of the content using cycle stages, however, should not be tackled in isola- those SKOS vocabularies and so on. This approach al- tion, but by investigating methods which facilitate a lows to tackle the data value chain in a non-disruptive mutual fertilization of approaches developed to solve way, enhancing it on a by-need basis. these challenges. Examples for such mutual fertiliza- tion (synergies) between approaches include: 4. Data-Flows at Global Publishers 1. The detection of mappings on the schema level, for example, will directly affect instance level Wolters Kluwer Germany (WKG) is an information matching and vice versa. service provider in the legal, business and tax domain. 2. Ontology schema mismatches between knowl- Business units of WKG are devided into "legal and edge bases can be compensated for by learning regulatory" as well as as "tax and accounting" (see which concepts of one are equivalent to which Fig.2). The business unit "legal and regulatory" serves concepts of another knowledge base. mainly legal professionals in several different legal do- 3. Feedback and input from end users (e.g. regard- mains with content, software and services. WKG is ing instance or schema level mappings) can be headquartered in Cologne and has about 1,000 em- taken as training input (i.e. as positive or nega- ployees in 20 offices located across Germany. Wolters tive examples) for machine learning techniques Kluwer Germany is part of Wolters Kluwer n.v., a in order to perform inductive reasoning on larger global information services company with customers Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack 5 in the areas of legal, business, tax, accounting, finance, tent. So in order to provide a better customer service, audit, risk, compliance and healthcare. In 2011, the the core content supply chain has to be improved. company had annual revenues of 3.4 billion Euro and 19,000 employees worldwide with customers in over 150 countries across Europe, North America, Asia Pa- 5. Usage of the Linked Data Stack at WKG cific, and Latin America. Wolters Kluwer is headquar- tered in Alphen aan den Rijn, the Netherlands. Its When WKG, in particular the first authors of this shares are quoted on Euronext Amsterdam (WKL) and application report, first investigated the paradigm of are included in the AEX and Euronext 100 indices. Linked Data, the respective lifecycle and the Linked Wolters Kluwer’s strategy has 3 main focuses: Data Stack supporting this lifecycle, we concluded the following: The Linked Data lifecycle is highly compa- 1. To deliver value at the point-of-use by helping rable to the existing workflows at Wolters Kluwer as an customers to manage complex transactions to information provider; and the Linked Data stack offers produce tangible results; relevant functionality and technology complementary 2. To expand solutions across whole processes, cus- to the existing content management and production en- tomers and networks; vironments. 3. To raise innovation and effectiveness through It was decided to explore the impact of Linked Data global capabilities. by setting up pilots using the Linked Data stack. Each Assets like authoritative content, domain expertise and of the pilots tackled a crucial functionality or new op- integrated workflow tools are the basis of the strategic portunity. Using this approach experience, usage and direction. developement guidelines, best practices and current re- The application of semantic technologies at WKG strictions of the components available in the Linked is piloted within the LOD2 EU project, which started Data stack were collected. The pilots not only chal- in 2010, but meanwhile went beyond prototypical ap- lenged the Linked Data stack, but also the WKG busi- plications within a research project. The WKG con- ness managers. They provided requirements for chal- tent supply chain in the beginning of the LOD2 project lenges that could hardly be solved with the existing was quite typical for a publishing business (see Fig.3). technological infrastructure. Some of the major busi- It starts with the process of content acquisition. Legal ness requirements were: content is in general obtained from different sources 1. Processing and enriching mass content from like public institutions (courts, ministries) or authors partner publishing houses into our products with- and then internally it is refined and consolidated by do- out having to handle this content within the stan- main experts. The resulting authoritative content rep- dard content supply chain. This included the resents a valuable asset for specialized publishers like necessity to separate the source text from the Wolters Kluwer but requirs, in its traditional form, a (meta)data and the storage of the metadata in a lot of resources. dedicated repository. This also meant to rely as Afterwards content is mostly manually classified, much as possible on unified controlled vocabu- enriched and linked in further workflow steps by do- laries, in order to ensure consistency across con- main experts and technical writers. These actions gen- tent sources. erate significant additional value for contents and their 2. Extension and consolidation of our controlled usage in different mediums and platforms. Subse- vocabularies. The extended usage of post search quently, product managers collect content for their filters and typed auto-suggest functionalities re- products and compose or bundle them individually. quired to work extensively on controlled vocab- But as the previous process of enhancing the contents ularies. Once they were consolidated, we were is quite complex and labour intensive, the selection and able to offer better product features, but were bundling is often limited to the most obvious and nec- also able to connect these vocabularies to exter- essary actions. Contents is therefore mainly provided nal data sources and enrich our data even more. via one distinct product without additional informative 3. Inclusion of product-specific variants of meta- value to the customers. This fact impacts the sale pro- data associated to our documents: The transi- cess where only products are offered that are barely tion from print to electronic contents led to the connected with each other or offered with external con- requirement that elements like "title" or some 6 Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack

Fig. 2. Wolters Kluwer business units and products.

Fig. 3. Content supply chain at Wolters Kluwer.

structural information needed to be different in independent from the textual sources, in order to the different media, even different in different ensure flexibility. 4. Enabling a vertical view on content, based on a electronic products - like on a desktop customer group specific angle (e.g. "law office" vs. a mobile application. This should be handled vs. "HR department in a company"). The diversi- Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack 7

fication of electronic products implied the neces- wanted to represent using controlled vocabularies and sity to add quite different clusters of information metadata that is not explicitly controlled. In case there connected to one piece of text, e.g. a law. This is a controlled vocabulary, the data extraction chain has led even to the situation that the main complexity to be extended with a reconsiliation step. The provided in the content processing was based on metadata information sometimes only covers the label instead of handling and not on text handling anymore. the identifier of the concept. This situation has been dealt with by applying a reconsilation step in which the Already in an early stage of the LOD2 project it label is replaced with a proper reference to the target- was observed that project results should immediately ted identifier. In case the reconsilation service did pro- be evaluated by the WKG internal technology and con- vided none or multiple answers the orginal extracted tent experts. By taking up the findings and leverage the data was kept in the resulting data. By quering the project results in the operational planning, the resource knowledge base afterwarts these cases were retrieved spending in lecacy technology could be reduced. and used to improve the controlled vocabularies or to In general, there were two approaches for using correct the legal documents. This investigation process tools from the Linked Data stack in an industrial en- naturally drives to higher quality data. The manage- vironment: Taking the toolset and integrating it into ment of the controlled vocabularies and the associated our internal processes as open source software vs. ap- reconsiliation service are provided by the PoolParty proaching the vendor of the tool in order to license solution. an enhanced commercial version. Since WKG did not The assessment of candidates for controlled vocab- have any internal know-how on this technology, WKG ularies revealed that although they look natural candi- preferred to partner with commercial vendors when dates the creation of a taxonomy; the actual needed ef- available, which had the advantage that we were get- fort to create a correct disambiguated list was beyond ting professional support and the assurance that main- the project budget. For instance the author list showed tenance and further development was guaranteed (e.g. that there are several authors with the same name and Virtuoso and PoolParty). In all other cases pure open- that they have to be disambiguated per document. source was chosen, e.g. for linking we used the SILK The convertion into properly managed taxonomies framework. in PoolParty for those lists that were selected, raised Like each data management project, the work in the the fundamental Linked Data concept of persistent beginning was to extract from the provided raw le- identifiers to be considered by WKG. Without a URI gal content the required information into RDF format. strategy in place sustainable information management Each provided XML document was converted into a is impossible. knowledge base containing the metadata (e.g. title, au- The outcome of the extraction process: the orig- thor, reference number, ...) and structural information inal XML documents and the associated extracted (e.g. chapter, sections, paragraphs, ...). The actual legal metadata as RDF are published using Virtuoso on a clause texts were initially not included, however for a SPARQL endpoint. The PoolParty publishing features later excercise the extraction process has been updated have been used to publish the controlled vocabularies to include them too. as SKOS on another SPARQL endpoint. For quality assurance en focussing reasons, it was Based on the extraction work, WKG decided to inte- decided to iterative refine the extraction rules. The pro- grated Virtuoso and PoolParty a metadata management cess of transformation rules writing, processing doc- solution within our operational system. This metadata uments, reviewing and step-wise extensions is labor management solution realizes the main requirement intensive work that must be supported with adaquate for further exploitation of the WKG legal content be- tooling. In our case, the provided amount of legal doc- yond the current practice: namely a central single stor- uments was sizable. Because of the document oriented age of metadata. This central data storage does not in- conversion approach, the scaling has been built around clude only the metadata already available in the legal this dimension. The tool chain consists of the Valiant documents but also supporting information from inter- tool parallelizing the processing of XML documents to nal and external sources. For instance, product varia- RDF, in combination with the bulk loading of Virtu- tions or publication status information can be stored. oso. We knew from our previous experience, that it was A lesson learned from this transformation task was not possible to implement one stable and fixed rela- that we have to differentiate between metadata that we tional data model, since the business requirements and 8 Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack the technical developments concerning data and meta- on a prototype level and will gain business importance data were changing so rapidly, that one main feature of over time. One side effect of this development is that the application needed to be "data model flexibility". multilingual applications also gain importance. People Since there was no fixed schema necessary when using search in their native language for content published RDF, this requirement was easily met. in another language. Multilingual thesauri, which are One major characteristic of the legal domain is the already part of the above discussed infrastructure, play fact that legal matters and processes are highly con- a central role to address this issue. nected to each other. This is, for example, reflected Despite the growing amount of available Linked by a large number of explicit relations between doc- Data, the absence of sufficient (public) machine- uments. The preservation and usage of these relation- readable legal resources in many European countries ships was of key importance, for example, to imple- led to the decision by WKG to publish legal re- ment proper search capabilities. RDF gave us the pos- sources. It served as an excellent marketing tool show- sibility to easily establish these relationships. ing WKG’s expertise, but more importantly this ini- The creation and maintenance of domain specific tiated discussions within the publishing industry, but knowledge models required expertise and resources. also within the linked data community and public bod- In order to minimize this effort, one requirement was ies. An outcome of this increased interest is that this to be able to integrate external data sources in a con- work has been used to movitate the adoption of the trolled fashion as much as possible (e.g. DBpedia SKOS standard as a recommended standard (comply Live [10] information or publicly available domain or explain standard) for publishing taxonomies by the thesauri). We used SILK and PoolParty for connect- Dutch government. ing our controlled vocabulary to external sources and We used the Linked Data Stack component On- the generic SILK framework for other metadata in- toWiki as user interface for the presentation and main- terlinking. DBpedia data were meaningful and help- tenance of the metadata. It offers search and browse ful for concrete use cases. However, some of the data capabilities and is inherently based on RDF. In con- (e.g. geolocations) have been transformed incorrectly trast to other components introducing this frontend in to DBpedia and caused errors when using it. These is- the WKG operational environment is more compli- sues were communicated to the community and got re- cated because it affects the way of working through- solved. out the entire company especially for the content edi- New usage scenarios for metadata were introduced, tors. Therefore a different strategy has been chosen for both from an internal process point of view as well as transfering the experience to the existing editorial en- from a functionality point of view in our products. We vironment. It has been decided to re-use and adapt the needed to classify our documents according to legal existing interfaces to include the enhanced metadata domain structures. The knowledge represented in the assignment. This approach limits the impact on the ed- metadata and their relations available in Virtuoso gave itorial departments. For completely new tasks, outside us the possibility to achieve the required classification the existing work environment of the editorial staff, quality. Another scenario was the qualified generation like adding relations (e.g. courtnames with geodata) or of an auto-suggest functionality, where the keywords analyzing data (like popular legal topics in court cases shown to the user when typing his query were directly over time), direct access to the metadata database is coming from our knowledge base and were therefore our preferred approach. already normalized and prioritized. Tool support for maintaining knowledge models is The legal publishing market is still a national mar- a very prominent requirement. Within the Linked Data ket. Currently, cross-border offerings are only relevant stack the ontology enrichment and repair tool ORE in certain areas like intellectual property law. However, provides support in this area. Our experiment setup ob- there indications it will change over time, especially jective was to harmonize the schema and the instance within the European Union, where more and more leg- data. Although some situations could be tackled, the islation is performed on the European level. Having in- experiment was to limited to determine how this kind formation available as machine readible Linked Data of tool can be used in the domain of legal information. gives publishers the possibility to more easily align Further investigation is required. and interconnect different CMS systems used by the To sum up, the objectives to deploy tools from enterprise entities in different countries in order to gen- the Linked Data Stack in WKG’s operational systems erate a comprehensive offering. This is already tested were mainly targeting at making our internal con- Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack 9

Labour Law Thesaurus Courts Thesaurus Description covers all main areas of labour law, like the roles of em- structures German and European courts in a hierarchical ployee and employer; legal aspects around labur con- fashion and includes e.g. address information or map vi- tracts and dismissal; also co-determination and industrial sualization action Concepts 1,728 1,499 Linked Sources Standard Thesaurus für Wirtschaft, ZBW (zbw.eu/stw/), DBpedia DBpedia, TheSoz from Leibniz Gesellschaft für Sozial- wissenschaften (www.gesis.org), Eurovoc Licenses Data is licensed using ‘Creative Commons Namensnen- nung 3.0 Deutschland (CC BY 3.0)’ License, Data model is licensed using ‘ODBL’ License., Links to external sources are licensed using a ‘CC0 1.0 Universal (CC0 1.0) Public Domain Dedication’ License URL http://vocabulary.wolterskluwer.de/ http://vocabulary.wolterskluwer.de/ arbeitsrecht. court.html SPARQL endpoints vocabulary.wolterskluwer.de/PoolParty//arbeitsrecht vocabulary.wolterskluwer.de/PoolParty/sparql/court Table 1

Description of the Labour Law and Courts thesauri published by WKG. tent processes more flexible and efficient, but also tar- 1. Valiant is an extraction/transformation tool that geted new feature for WKG’s electronic and software uses XSLT to transform XML documents into products. Once the technological basis was laid, im- RDF. The tool can access data from the file sys- mediately new opportunities for further enhancements tem or a WebDAV repository. It outputs the re- showed up, so that this new part of our technical in- sulting RDF to disk, WebDAV or directly to an frastructure already gained importance and there is RDF store. For each input document a new graph is created. no doubt, that this process will continue. The tools 2. Virtuoso [4] is an enterprise grade multi-model currently used from the Linked Data Stack are well data server. It delivers a platform agnostic solu- integrated with each other, which enables an effi- tion for data management, access, and integra- cient workflow and processing of information. URIs tion. Virtuoso provides a fast quad store with in PoolParty based on controlled vocabularies are used SPARQL endpoint and WebID support. by Valiant for the content transformation process and 3. PoolParty [12] is a tool to create and maintain stored in Virtuoso, so that it can easily be queried via multilingual SKOS (Simple Knowledge Organi- SPARQL and displayed in OntoWiki. The usage of the sation System) thesauri, aiming to be easy to use Linked Data Stack as such has the major advantage for people without a background that the installation is easy and the issues around dif- or special technical skills. PoolParty is written ferent versions not working smoothly with each other in Java and uses the SAIL API, whereby it can are avoided. All this are major advantages compared to be utilized with various triple stores. Thesaurus the separate implementation of individual tools. management itself (viewing, creating and editing A major challenge, however, is not the new tech- SKOS concepts and their relationships) can be performed in an AJAX front-end based on the Ya- nology as such, but a smooth integration of this new hoo User Interface (YUI) library. paradigm in our existing infrastructure and a stepwise 4. OntoWiki [2] is a PHP5 / Zend-based Seman- replacement of old processes with the new and en- tic Web application for collaborative knowledge hanced ones. base editing. It facilitates the visual presentation To summarise the application from a software per- of a knowledge base as an information map, with spective, we list Linked Data Stack components which different views of instance data. It enables intu- were deployed in the content production and manage- itive authoring of semantic content, with an in- ment processes at WKG (cf. Figure 4; general statistics line editing mode for editing RDF content, simi- in Table 3): lar to WYSIWYG for text documents. 10 Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack

Fig. 4. Content management process at WKG and usage of Linked Data Stack components.

5. Silk [5] is a link discovery framework that sup- ment process. Figure 4 shows the already operationally ports data publishers in setting explicit links be- implemented interplay of Linked Data Stack compo- tween two datasets. Using the declarative Silk nents in the processes of WKG. - Link Specification Language (Silk-LSL), de- A document, e.g. a law, is transformed from XML velopers can specify which types of RDF links into RDF with the Valiant Tool. The RDF (meta)data should be discovered between data sources as is stored in Virtuoso and managed using a customized well as which conditions data items must fulfill version of Ontowiki (for document statistics see Ta- in order to be interlinked. These link conditions ble 2) . Examples for such data in legislations are the may combine various similarity metrics and can "legislation date", "law abbreviation" or "legislation take the graph around a data item into account type". Technical editors / domain experts can within using an RDF path language. this environment add further metadata or change exist- 6. ORE [8] (Ontology Repair and Enrichment) al- ing ones via editing features. lows knowledge engineers to improve an OWL Controlled vocabularies are managed in Poolparty. ontology or SPARQL endpoint backed knowl- Thesaurus concepts can be created, ordered and edited edge base by fixing logical errors and mak- in the system. Both metadata management systems in- ing suggestions for adding further axioms to it. teract in a way that e.g. vocabularies from Poolparty ORE uses state-of-the-art methods to detect er- can serve filtering functionalities for documents in On- rors and highlight the most likely sources for toWiki. This way, we can browse in the navigation the problems. To harmonise schema and data pane for a specific "legislation type", an "area of law", in the knowledge base, algorithms of the DL- "authors" or "courts". Learner [6,7,9] framework are integrated. External sources are linked using the Silk frame- work to document metadata in OntoWiki on the one An important asset of the Linked Data Stack is the hand and with controlled vocabularies in Poolparty on fact, that components do interact and support specific the other hand (for external sources see Table 4). In steps of the data transformation lifecycle and manage- case of a law, there could be an enrichment of the doc- Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack 11

Table 2 Table 4 Frequently used document classes. Link statistics. Top level document type URI Document Count Link Target Count all documents 674884 DBpedia 997 http://schema.wolterskluwer.de/aufsaetze 218186 Extended Version of Courts Thesaurus - http://schema.wolterskluwer.de/beitraege 24611 Labor Law Thesaurus to Dbpedia 776 http://schema.wolterskluwer.de/entscheidungen 398398 Labor Law Thesaurus to Thesoz 443 http://schema.wolterskluwer.de/kommentare 26653 Labor Law Thesaurus to STW 289 http://schema.wolterskluwer.de/lexikon 47 Labor Law Thesaurus to Eurovoc 247 http://schema.wolterskluwer.de/rezensionen 99 Legislations to Dbpedia 155 http://schema.wolterskluwer.de/vorschrift 6890 Authors to GND 941 WKG Labor Law Thesaurus to WKG subjects 70 Table 3 Overall statistics. can daily newspapers, publishes its index as RDF since Measure Count 2009. About 10,000 concepts as persons, organiza- tions, locations or descriptors that are used for tagging Triples 42969471 articles are published under a CC-BY license with re- Graphs 830090 spective metadata and links to DBpedia, to Freebase or Entities 6812303 into the Times API1. The BBC, a British public service broadcasting statutory corporation uses their dynamic ument by the area of law or the jurisdiction (i.e. geo- semantic publishing architecture to enhance content graphical coverage) of a law. Vocabularies can be en- processing workflows for their . Contents are riched by abstracts or synonyms. This linking enriches embedded in an ontological domain-modelled infor- the documents and supports further functionality, es- mation architecture that enables automated aggrega- pecially in the case of controlled vocabularies. tion processes and publishing as well as re-purposing For quality control ORE and the included DL- of interrelated content objects.2 Learner library is used. In a first step, ORE analyzes the existing instance data and suggests class descrip- tions for each class contained in the domain ontology. 7. Future Work For instance it suggested that each document of type "aufsatz" (German expression for article) has at least a Ingestion of more data in general and the inclusion title, an editor or creator, as well as information about of more external information especially is one major the start and end position in the page. Based on the topic, but also preparing this conglomerate of infor- suggestions, a domain expert creates and refines the mation for real world usage in an industrial environ- schema. Afterwards, the axioms in the schema are used ment is a major challenge. The latter covers e.g. issues as constraints and converted into SPARQL queries, around quality, governance and licensing. In detail, we which allows for the detection of instance data that will focus on the following tasks: does not fullfil the requirements, e.g. finding instances First of all, we will work on the further deployment of "aufsatz" where the title is missing. Technically, ex- and adoption of the Linked Data tool stack to further pressive OWL schema axioms are used as constraints enhance our metadata sets. We will focus on repair by employing a closed world assumption via a closed and enhancements of the existing RDF schema, auto- world assumption. matic classification of contents, entity extraction and improved functionality of the metadata management tools. We are especially aiming for improvements of 6. Related Work metadata management workflows by enhanced usabil- ity or functionality in order to fasten semi-automatic There are other publishers that already use seman- processes. tic technologies to enhance their own online publish- ing processes, although we believe that WKG applies a 1See http://data.nytimes.com/. richer set of tools covering many Linked Data lifecycle 2See http://www.bbc.co.uk/blogs/bbcinternet/ stages. The New York Times, one of the largest Ameri- 2012/04/sports_dynamic_semantic.html. 12 Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack

XML RDF

Poolparty Backend Pebbles

Frontend

Frontend

Fig. 5. Workflow illustrating how data was transformed from XML to RDF, enriched via Poolparty and Pebbles and then published in various user interfaces.

Concerning the interlinking processes, we plan ex- Acknowledgements ploring new external sources such as GND (http: This work was supported by grants from the Euro- //datahub.io/dataset/dnb-gemeinsame- pean Union’s 7th Framework Programme provided for normdatei) for enrichment of our datasets and in- the projects (GA no. 257943) and GeoKnow (GA no. vestigate new opportunities to improve the generation 318159). and quality of new vocabularies. In order to improve the interface and authoring op- References tions, we will invest in enhancements in publishing, [1] S. Auer, L. Bühmann, C. Dirschl, O. Erling, M. Hausenblas, search, browsing and exploring of metadata sets within R. Isele, J. Lehmann, M. Martin, P. N. Mendes, B. van Nuffe- metadata management. len, C. Stadler, S. Tramp, and H. Williams. Managing the life- cycle of linked data with the LOD2 stack. In Proceedings of The topic of licensing strategies, practices and rec- International Semantic Web Conference (ISWC 2012), 2012. 22 ommendations is another important area, in particular [2] S. Auer, S. Dietzold, and T. Riechert. OntoWiki - A Tool for when datasets from various sources licensed under dif- Social, Semantic Collaboration. In 5th Int. Semantic Web Con- ference, ISWC 2006, volume 4273 of LNCS, 736–749. ferent licensing schemes are fused. Springer, 2006. Dirschl et al. / Facilitating WKD Data Flows using the Linked Data Stack 13

[3] S. Auer and J. Lehmann. Making the web a data washing ma- (ISWC2010), LNCS. Springer, 2010. chine - creating knowledge out of interlinked data. Semantic [9] J. Lehmann and P. Hitzler. Concept learning in description Web Journal, 2010. logics using refinement operators. Machine Learning journal, [4] O. Erling. Virtuoso, a hybrid RDBMS/graph column store. 78(1-2):203–250, 2010. IEEE Data Eng. Bull., 35(1):3–8, 2012. [10] M. Morsey, J. Lehmann, S. Auer, C. Stadler, and S. Hell- [5] A. Jentzsch, R. Isele, and C. Bizer. Silk - generating RDF links mann. DBpedia and the Live Extraction of Structured Data while publishing or consuming linked data. In ISWC 2010 from Wikipedia. Program: electronic library and information Posters & Demo Track, volume 658. CEUR-WS.org, 2010. systems, 46:27, 2012. [6] J. Lehmann. DL-Learner: learning concepts in description log- [11] A.-C. N. Ngomo, S. Auer, J. Lehmann, and A. J. Zaveri. Intro- ics. Journal of Machine Learning Research (JMLR), 10:2639– duction to linked data and its lifecycle on the web. In Reason- 2642, 2009. ing Web. 2014. [7] J. Lehmann, S. Auer, L. Bühmann, and S. Tramp. Class ex- [12] T. Schandl and A. Blumauer. Poolparty: SKOS thesaurus man- pression learning for . Journal of Web agement utilizing linked data. In 7th Extended Semantic Web Semantics, 9:71 – 81, 2011. Conf., ESWC 2010, volume 6089 of LNCS, pages 421–425. [8] J. Lehmann and L. Bühmann. Ore - a tool for repairing and en- Springer, 2010. riching knowledge bases. In 9th Int. Semantic Web Conference