Towards Automated Data Integration in Software Analytics

Towards Automated Data Integration in Software Analytics Silverio Martínez-Fernández Petar Jovanovic Fraunhofer IESE Universitat Politècnica de Catalunya, BarcelonaTech [email protected] [email protected] Xavier Franch Andreas Jedlitschka Universitat Politècnica de Catalunya, BarcelonaTech Fraunhofer IESE [email protected] [email protected] ABSTRACT they happen. In this paper, we envision automated support for the Software organizations want to be able to base their decisions on real-time enterprise concept for software organizations by means the latest set of available data and the real-time analytics derived of applying recent approaches to facilitate data integration tasks. from them. In order to support “real-time enterprise” for software Currently, software organizations want to be able to base their organizations and provide information transparency for diverse decisions on the latest set of available data and the real-time analyt- stakeholders, we integrate heterogeneous data sources about soft- ics derived from them. The software development process produces ware analytics, such as static code analysis, testing results, issue various types of data such as source code, bug reports, check-in tracking systems, network monitoring systems, etc. To deal with histories, and test cases [23]. The data sets not only include data the heterogeneity of the underlying data sources, we follow an from the development (e.g., GitHub with over 14 million projects), ontology-based data integration approach in this paper and define but also millions of data points produced per second about the an ontology that captures the semantics of relevant data for soft- usage of software (e.g., Facebook or eBay ecosystems). All this data ware analytics. Furthermore, we focus on the integration of such can be exploited with “software analytics”, which is about using data sources by proposing two approaches: a static and a dynamic data-driven approaches to obtain insightful and actionable infor- one. We first discuss the current static approach with a predefined mation at the right time to help software practitioners with their set of analytic views representing software quality factors and fur- data-related tasks [9]. This improves information transparency for ther envision how this process could be automated in order to diverse stakeholders. Bearing this goal in mind, we integrate these dynamically build custom user analysis using a semi-automatic different data sources as a necessary first step in making thisdata platform for managing the lifecycle of analytics infrastructures. actionable for decision-making. The integration becomes necessary because the inherent relationships in the data influencing the CCS CONCEPTS overall software quality are not obvious at first sight. Despite its key role, the integration of different software ana- • Software and its engineering → Maintaining software; lytics data still presents challenges due to the heterogeneity of the KEYWORDS data sources. Not only do data come from sources carrying different types of information, but the same information is also stored Data integration, real-time enterprise, ontology, software analytics in heterogeneous formats and tools. Big Data analytics involves ACM Reference Format: the ingestion of real-time operational data into large repositories Silverio Martínez-Fernández, Petar Jovanovic, Xavier Franch, and Andreas (e.g., data warehouses or data lakes), followed by the execution of Jedlitschka. 2018. Towards Automated Data Integration in Software An- analytics queries to derive insights from the data, with the final alytics. In International Workshop on Real-Time Business Intelligence and goal of performing business actions or raising alerts [6]. In a recent Analytics (BIRTE ’18), August 27, 2018, Rio de Janeiro, Brazil. ACM, New systematic review, data integration and final data aggregation were York, NY, USA, 5 pages. https://doi.org/10.1145/3242153.3242159 reported as part of the remaining challenges in Big Data analyt- 1 INTRODUCTION ics [19]. At the same time, another review in software analytics arXiv:1808.05376v1 [cs.SE] 16 Aug 2018 reported that most of the current approaches are still analyzing Nowadays, the huge amount of data available in companies has only one artifact [2], thus not focusing on integrating data from increased their interest in applying the concept of "real-time en- different sources and getting a holistic view. Thus, further research 1 terprise" by using up-to-date information and acting on events as is needed to facilitate the integration of data sources for software 1https://www.gartner.com/technology/research/data-literacy/ analytics driven by the real information needs of end users. To overcome the heterogeneity of software analytics data coming Permission to make digital or hard copies of all or part of this work for personal or from different sources, we follow an ontology-based data integra- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation tion approach in this paper; in particular, we intend to contribute: on the first page. Copyrights for components of this work owned by others than the (a) the definition of an ontology capturing the semantics of relevant author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission data for software analytics; (b) a current static approach for the and/or a fee. Request permissions from [email protected]. integration of heterogeneous data sources given a set of predefined BIRTE ’18, August 27, 2018, Rio de Janeiro, Brazil analytic views representing software quality factors; and (c) an © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. envisioned approach for the dynamic integration of heterogeneous ACM ISBN 978-1-4503-6607-6/18/08...$15.00 https://doi.org/10.1145/3242153.3242159 BIRTE ’18, August 27, 2018, Rio de Janeiro, Brazil Silverio Martínez-Fernández, Petar Jovanovic, Xavier Franch, and Andreas Jedlitschka software analytics data, guided by the specific analytical needs of 3 INTEGRATING DATA SOURCES FOR end users (i.e., information requirements). SOFTWARE ANALYTICS The paper is structured as follows: Section 2 presents the related In this section, we present our software analytics use case that we work. Section 3 presents an ontology for software analytics. Section will use throughout this paper. 4 presents the implementation of a static approach to implement the integration. Section 5 discusses how the integration could be 3.1 An ontology for software analytics done dynamically. Finally, Section 6 concludes the paper. We introduce an ontology that captures the semantics of relevant 2 BACKGROUND AND RELATED WORK data for software analytics (see Fig. 1). In the ontology, each class represents an entity of the software analytics domain. For instance, 2.1 Software analytics and software quality the class Issue represents the issues from issue tracking systems. Contrary to the availability of data and its transparency in open The ontology is abstract in order to enable generalization and ap- source software, tool support for data integration for private com- plicability in different software projects. Therefore, the technolo- panies in commercial systems is just emerging. For instance, we gies used could differ among projects, whereas the concepts are can find some large-scale software companies with their own pro- present in most software development projects. For instance, for prietary development environments, such as Codemine (a propri- issue tracking systems, different companies may use different tools etary software analytics platform) [7], Codebook (a framework for (e.g., Redmine, Jira, or GitLab), but all of them use issue tracking connecting engineers and their work artifacts) [4] by Microsoft, systems as a software development practice. Note that several ap- and Tricorder (a program analysis platform aimed at building a proaches also propose automated generation of a domain ontology data-driven ecosystem) [18] by Google. Still, these platforms are from the desired data sources to support data integration tasks (e.g., proprietary and not widely used by others. In addition, companies [20]). like Kiuwan, Kovair, and Tasktop have recently started offering The classes of the ontology in Fig. 1 represent data coming from software and services for software analytics2. Furthermore, very the system either at development time or at runtime. For the sake recent research tools are CodeFeedr [21] and Q-Rapids [16]. Despite of simplicity, we omit further ontology details (e.g., datatype prop- these efforts, there are still challenges for companies developing erties) in Fig. 1, but explain the main process of how the ontology is commercial systems to understand how to integrate heterogeneous built. During software development, we find data about the project data sources for software analytics. and the development, which can be mapped, respectively, to the Regarding software quality and quality models, a multitude of topics of improving software development process productivity and software-related quality models exist that are used in practice, as software system quality presented by Zhang et al. [23]. At runtime, well as classification schemes [13]. One example is the ISO/IEC we

Towards Automated Data Integration in Software Analytics

Scientific Data Mining in Astronomy

Information Needs for Software Development Analytics

Software Analytics for Incident Management of Online Services: An

Data Quality Through Data Integration: How Integrating Your IDEA Data Will Help Improve Data Quality

Patterns for Implementing Software Analytics in Development Teams

Managing Data in Motion This Page Intentionally Left Blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies

Informatica Cloud Data Integration

What to Look for When Selecting a Master Data Management Solution What to Look for When Selecting a Master Data Management Solution

Using ETL, EAI, and EII Tools to Create an Integrated Enterprise

Introduction to Data Integration Driven by a Common Data Model

Evaluating and Improving Software Quality Using Text Analysis Techniques - a Mapping Study

Using the Qlik Data Integration Platform for Better Analytics