Information Inference in Scholarly Communication Infrastructures: the Openaireplus Project Experience

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 38 ( 2014 ) 92 – 99 10th Italian Research Conference on Digital Libraries, IRCDL 2014 Information inference in scholarly communication infrastructures: the OpenAIREplus project experience Mateusz Kobosa*, Łukasz Bolikowskia, Marek Horsta, Paolo Manghib, Natalia Manolac, d Jochen Schirrwagen aInterdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland bIstituto di Scienza e Tecnologie dell'Informazione “A. Faedo”, Consiglio Nazionale delle Ricerche, via G. Moruzzi 1, 56124 Pisa, Italy cDepartment of Informatics and Telecommunications, University of Athens, Panepistimiopolis, 15784 Ilissia, Athens, Greece dDepartment of Library Technology and Knowledge Management, Bielefeld University, Universitätsstr. 25, 33615 Bielefeld, Germany Abstract The Information Inference Framework presented in this paper provides a general-purpose suite of tools enabling the definition and execution of flexible and reliable data processing workflows whose nodes offer application-specific processing capabilities. The IIF is designed for the purpose of processing big data, and it is implemented on top of Apache Hadoop-related technologies to cope with scalability and high-performance execution requirements. As a proof of concept we will describe how the framework is used to support linking and contextualization services in the context of the OpenAIRE infrastructure for scholarly communication. © 2014 TheThe Authors.Authors. Published Published by by Elsevier Elsevier B.V. B.V. This is an open access article under the CC BY-NC-ND license Peer-review(http://creativecommons.org/licenses/by-nc-nd/3.0/ under responsibility of the Scientific Committee). of IRCDL 2014. Peer-review under responsibility of the Scientific Committee of IRCDL 2014 Keywords: OpenAIRE infrastructure; data processing system; data mining; text mining; big data 1. Introduction OpenAIREplus1,2 delivers a scholarly communication infrastructure† 3 devised to populate a graph-shaped information space of interlinked metadata objects describing publications, datasets, persons, projects, and * Corresponding author. Tel.: +48 22- 87-49-419. E-mail address: [email protected] † http://ww.openaire.eu 1877-0509 © 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Peer-review under responsibility of the Scientific Committee of IRCDL 2014 doi: 10.1016/j.procs.2014.10.016 Mateusz Kobos et al. / Procedia Computer Science 38 ( 2014 ) 92 – 99 93 organizations by collecting and aggregating such objects from publication repositories, dataset repositories, and Current Research Information Systems. The infrastructure supports scientists by providing tools for dissemination and discovery of research results, as well as funding bodies and organizations, by providing tools for measuring and refining funding investments in terms of their research impact. In particular, the infrastructure offers tools to enrich metadata objects using inference mechanisms. Such mechanisms, by mining the graph of metadata and original files they describe (e.g. the publication), are capable of identifying new relationships between existing objects, e.g. by enriching them with research funding information, supporting data curation, e.g. fixing the title name, or identifying new objects, e.g. by finding missing authors of a publication object. In this paper we present the Information Inference Framework (IIF) which was originally designed to underpin the OpenAIREplus infrastructure inference mechanisms and evolved into a general-purpose solution for processing large amounts of data in a flexible and scalable way. The IIF can be regarded as a platform for defining and executing data processing workflows of possibly distributed modules. To this aim, it provides generic tools for: (1) defining domain-specific workflow nodes, e.g. in case of OpenAIRE project: node for extracting metadata from PDF files, node for classifying documents; (2) providing well-defined way to pass the data between the nodes; (3) defining sequence of execution of the nodes as a concise workflow; and (4) executing the workflow in a distributed, reliable, and scalable computational environment. In order to obtain the benefits described in the last point, we use the Apache Hadoop‡ system that allows for running applications on a computing cluster. IIF is an open source project and its source code is available in the project SVN repository§. In this paper we outline the design decisions and describe the components of this processing infrastructure framework for big data. After comparing our work in the context of other popular systems for processing scholarly data in the next section, we show how the framework is applied to enrich scholarly information in the OpenAIREplus project. 2. Related work The IIF is a novel solution. Its contribution can be viewed from two perspectives: introducing a new generic framework for processing big data (see Sect. 3), and introducing a new system for processing and extracting knowledge from objects related to scientific activity: documents, author information etc. (see Sect. 4). 2.1. Data processing frameworks The concepts underlying IIF were inspired by the Rapid Miner4 open source software for data analytics. The tool allows for creating data workflows consisting of various workflow nodes using a user-friendly graphical interface. Its library contains many predefined workflow nodes related to Extract, Transform and Load (ETL) tasks and machine learning algorithms. An interesting feature of this tool is that it checks whether the data declared to be consumed by a workflow node matches the data declared to be produced by a preceding workflow node, which is done as early as during design time of the workflow. A disadvantage of Rapid Miner is that it was not designed to process large data sets. Our system, on the other hand, was designed from the beginning with scalability and big data in mind. Note that there is a plugin for Rapid Miner, called Radoop5 that allows for processing big data by Rapid Miner, but this is neither an open source nor free solution and thus not as flexible and customizable as we would want it to be when applying it to our specific domain. With relation to the first perspective, we can also compare the introduced solution with the ones that were devised to solve similar problems. Google Scholar**, Microsoft Academic Search††, ArnetMiner‡‡, and CiteSeerX ‡ http://hadoop.apache.org/ § The project is divided into Maven subprojects with names having “-iis-” infix that are available at https://svn-public.driver.research- infrastructures.eu/driver/dnet40/modules ** http://scholar.google.com †† http://academic.research.microsoft.com ‡‡ http://arnetminer.org 94 Mateusz Kobos et al. / Procedia Computer Science 38 ( 2014 ) 92 – 99 which is based on SeerSuite§§ are all popular projects dealing with harvesting and providing access to scientific publications and related data. The data processing architecture of ArnetMiner is described in references6,7, and information related to CiteSeerX can be found in references8,9; unfortunately, we were not able to find any reliable publication describing the architecture of neither Microsoft Academic Search nor Google Scholar, so we cannot compare our solution with them. Generally, these projects follow a Service Oriented Architecture (SOA) approach where each processing module is a separate web service with its own data processing tools. The modules communicate with each other through well-defined networking channels. Our approach is different: each processing module is akin to a cluster application that uses the same Hadoop-based computational framework to execute its tasks and communicate with other modules. The first approach assures that the modules are very loosely coupled; however, the disadvantage is that each module has to define its own communication protocol and solution of processing and storing the data in an efficient way and to handle hardware failures. In our approach, the modules process the data in the way supported by the Hadoop cluster (note that this does not preclude using a separate web service as a processing module, though it is not as effective as using a native, Hadoop-based approach). This way we can circumvent the mentioned problems with the SOA architecture. See Sect. 3 for a more detailed discussion of various solutions applied in the IIF along with comparison with other projects. 2.2. Inferring knowledge over scholarly communication content A number of frameworks with objectives and/or architectures similar to IIF can be mentioned. Arguably the two most prominent, complementary architectures for content analysis are Apache UIMA10 and GATE11, which are mature, open-source suites of tools for text engineering. Several projects and higher-order frameworks are using UIMA or GATE, for example: Behemoth***; or XAR12, an open-source framework for information extraction. The IIF addresses the need of a large-scale, flexible, open-source architecture to execute complex workflows for processing documents (e.g. scientific publications) and finding links between extracted entities; none of the existing solutions can fully satisfy these requirements of the OpenAIREplus project. Another set of solutions related to the second perspective is formed by popular projects related to scientific publications mentioned

Information Inference in Scholarly Communication Infrastructures: the Openaireplus Project Experience

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support