ETL Best Practices for Data Quality Checks in RIS Databases

informatics Article ETL Best Practices for Data Quality Checks in RIS Databases Otmane Azeroual 1,2,3,* , Gunter Saake 2 and Mohammad Abuosba 3 1 German Center for Higher Education Research and Science Studies (DZHW), Schützenstraße 6a, 10117 Berlin, Germany 2 Institute for Technical and Business Information Systems—Database Research Group, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106 Magdeburg, Germany; [email protected] 3 Department of Computer Science and Engineering, University of Applied Sciences—HTW Berlin, Wilhelminenhofstraße 75 A, 12459 Berlin, Germany; [email protected] * Correspondence: [email protected] Received: 25 December 2018; Accepted: 27 February 2019; Published: 5 March 2019 Abstract: The topic of data integration from external data sources or independent IT-systems has received increasing attention recently in IT departments as well as at management level, in particular concerning data integration in federated database systems. An example of the latter are commercial research information systems (RIS), which regularly import, cleanse, transform and prepare the analysis research information of the institutions of a variety of databases. In addition, all these so-called steps must be provided in a secured quality. As several internal and external data sources are loaded for integration into the RIS, ensuring information quality is becoming increasingly challenging for the research institutions. Before the research information is transferred to a RIS, it must be checked and cleaned up. An important factor for successful or competent data integration is therefore always the data quality. The removal of data errors (such as duplicates and harmonization of the data structure, inconsistent data and outdated data, etc.) are essential tasks of data integration using extract, transform, and load (ETL) processes. Data is extracted from the source systems, transformed and loaded into the RIS. At this point conflicts between different data sources are controlled and solved, as well as data quality issues during data integration are eliminated. Against this background, our paper presents the process of data transformation in the context of RIS which gains an overview of the quality of research information in an institution’s internal and external data sources during its integration into RIS. In addition, the question of how to control and improve the quality issues during the integration process in RIS will be addressed. Keywords: research information systems (RIS); heterogeneous information sources; metadata; data integration; data transformation; extraction transformation load (ETL) technology; data quality 1. Introduction In recent years, there has been a new trend in which universities, research institutions and researchers capture, integrate, store and analyze their research information into a research information system (RIS). Other names for the term RIS are current research information system (CRIS), especially in Europe, and RIMS (research information management system), RNS (research networking system) or FAR (faculty activity reporting). The RIS is a central database or federated information system [1,2] that can be used to collect, manage and provide a variety of information about research activities and research results. The information represented here are metadata about an institution’s research activities, e.g., personal data, projects, third-party funds, publications, patents and prizes, etc. and Informatics 2019, 6, 10; doi:10.3390/informatics6010010 www.mdpi.com/journal/informatics Informatics 2019, 6, 10 2 of 13 Informatics 2019, 6, x 2 of 14 illustratesare referred the to type as research of research information information. [3]. Furthe The followingr definitions Figure of 1the illustrates term RIS the in the type literature of research can beinformation. found in the Further related definitions papers [4–6]. of the term RIS in the literature can be found in the related papers [4–6]. FigureFigure 1. 1. TypeType of of research research information processed and integrated into research information system. For the the institutions, institutions, it it is is a auseful useful tool tool to to clearly clearly present present their their own own research research profile profile and and thus thus an effectivean effective tool tool for for information information management. management. The The unification unification and and facilitation facilitation of of strategic strategic research reporting creates added value for the administrators, as well as forfor thethe scientists.scientists. The expense of documenting projects projects is is reduced for the scientists, thus gaining more time for the actual research process. The RIS can also support additional helper functions for scientists: e.g., the creation of CVs and publication lists, or other reuse of the aggregated data from RIS onon aa website.website. The most important decision-making and and reporting tools in many institutions are RIS that collectively integrate integrate the the data data from from many many heterogeneous heterogeneous and and independent independent information information systems. systems. In orderIn order to avoid to avoid structural structural and and content-related content-related data data quality quality issues issues and and thereby thereby reduce reduce the the effective effective use of researchresearch information, information, the the transformation transformation of research of research information information takes place takes during place the during integration the integrationprocess. “Data process. integration “Data improvesintegration the improves usability th ofe datausability by improving of data by consistency, improving completeness, consistency, completeness,accessibility, and accessibility, other dimensions and other of dimensions data quality” of [data7]. Data quality” integration [7]. Data is theintegration correct, completeis the correct, and completeefficient merging and efficient of data merging and content of data from and heterogeneous content from heterogeneous sources into a consistentsources into and a structuredconsistent andset ofstructured information set of for information effective interpretation for effective interp by usersretation and by applications users and [applications8]. Filling of [8]. RIS Filling should of RISalready should be ensuredalready be in goodensured quality in good as partquality of theas part integration of the integration process. However, process. theyHowever, should they be shouldcontinually be continually defined, measured, defined, measured, analyzed, analyzed, and improved and improved to increase to interoperabilityincrease interoperability with external with externalsystems, systems, improve improve performance performance and usability, and usability, and address and data address quality data issues quality [9]. issues Therefore, [9]. Therefore, investing investingin the topic in ofthe data topic integration of data integration is particularly is particularly useful when useful the when achievement the achievement of high data of high quality data is qualityparamount is paramount [10]. This [10]. is also This the is case also for the RIS case users for RIS and users it is expectedand it is expected that decisions that decisions based on based relevant, on relevant,complete, complete, correct and correct current and data current will increasedata will the increase chance the of chance getting of better getting results better [11 results]. [11]. The purpose purpose of of this this paper paper will will be be to topresent present the the process process of data of data transformation transformation that thatcontrols controls and improvesand improves the data the data quality quality issues issues in RIS, in RIS, to toenab enablele institutions institutions to to perform perform in-depth in-depth analysis analysis and evaluation on existing or current data, and thus to meet the needs of their decision makers.makers. The first first chapter discusses the causes of data qu qualityality issues in RIS that may be responsible for reducing datadata quality.quality. Subsequently, Subsequently, the the phases phases of theof integrationthe integration process proce aress described are described in the contextin the contextof RIS. Theof RIS. chapter The chapter concludes concludes with approaches with approaches to eliminate to eliminate dataquality data qualit issuesy issues from from external external data datasources sources in the in transformation the transformation phase phase into RIS.into RIS. 2. Causes of Data Quality Issues in the Context of Integrating Research Information 2. Causes of Data Quality Issues in the Context of Integrating Research Information Data quality defined as “fitness for use” and its assurance is recognized as a valid and important Data quality defined as “fitness for use” and its assurance is recognized as a valid and activity, but in practice only a few people list it the highest priority [12]. The quality of the data important activity, but in practice only a few people list it the highest priority [12]. The quality of the depends to a great extent on the use of data and synergies of customer needs, the usability and the data depends to a great extent on the use of data and synergies of customer needs, the usability and the access of data [13]. Therefore, during evaluating and improving data quality, the involvement of Informatics 2019, 6, 10 3 of 13 access of data [13]. Therefore, during evaluating

ETL Best Practices for Data Quality Checks in RIS Databases

The Savvy Survey #16: Data Analysis and Survey Results1 Milton G

Data Quality: Letting Data Speak for Itself Within the Enterprise Data Strategy

Evolution of the Infographic

Informal Data Transformation Considered Harmful

Scientific Data Mining in Astronomy

EMC Greenplum Data Computing Appliance: High Performance for Data Warehousing and Business Intelligence an Architectural Overview

SUGI 28: the Value of ETL and Data Quality

Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H

A Machine Learning Approach to Outlier Detection and Imputation of Missing Data1

What Is a Data Warehouse?

Using Survey Data Author: Jen Buckley and Sarah King-Hele Updated: August 2015 Version: 1

Data Quality Through Data Integration: How Integrating Your IDEA Data Will Help Improve Data Quality