Towards Automated Data Integration in Software Analytics

Total Page:16

File Type:pdf, Size:1020Kb

Towards Automated Data Integration in Software Analytics Towards Automated Data Integration in Software Analytics Silverio Martínez-Fernández Petar Jovanovic Fraunhofer IESE Universitat Politècnica de Catalunya, BarcelonaTech [email protected] [email protected] Xavier Franch Andreas Jedlitschka Universitat Politècnica de Catalunya, BarcelonaTech Fraunhofer IESE [email protected] [email protected] ABSTRACT they happen. In this paper, we envision automated support for the Software organizations want to be able to base their decisions on real-time enterprise concept for software organizations by means the latest set of available data and the real-time analytics derived of applying recent approaches to facilitate data integration tasks. from them. In order to support “real-time enterprise” for software Currently, software organizations want to be able to base their organizations and provide information transparency for diverse decisions on the latest set of available data and the real-time analyt- stakeholders, we integrate heterogeneous data sources about soft- ics derived from them. The software development process produces ware analytics, such as static code analysis, testing results, issue various types of data such as source code, bug reports, check-in tracking systems, network monitoring systems, etc. To deal with histories, and test cases [23]. The data sets not only include data the heterogeneity of the underlying data sources, we follow an from the development (e.g., GitHub with over 14 million projects), ontology-based data integration approach in this paper and define but also millions of data points produced per second about the an ontology that captures the semantics of relevant data for soft- usage of software (e.g., Facebook or eBay ecosystems). All this data ware analytics. Furthermore, we focus on the integration of such can be exploited with “software analytics”, which is about using data sources by proposing two approaches: a static and a dynamic data-driven approaches to obtain insightful and actionable infor- one. We first discuss the current static approach with a predefined mation at the right time to help software practitioners with their set of analytic views representing software quality factors and fur- data-related tasks [9]. This improves information transparency for ther envision how this process could be automated in order to diverse stakeholders. Bearing this goal in mind, we integrate these dynamically build custom user analysis using a semi-automatic different data sources as a necessary first step in making thisdata platform for managing the lifecycle of analytics infrastructures. actionable for decision-making. The integration becomes neces- sary because the inherent relationships in the data influencing the CCS CONCEPTS overall software quality are not obvious at first sight. Despite its key role, the integration of different software ana- • Software and its engineering → Maintaining software; lytics data still presents challenges due to the heterogeneity of the KEYWORDS data sources. Not only do data come from sources carrying differ- ent types of information, but the same information is also stored Data integration, real-time enterprise, ontology, software analytics in heterogeneous formats and tools. Big Data analytics involves ACM Reference Format: the ingestion of real-time operational data into large repositories Silverio Martínez-Fernández, Petar Jovanovic, Xavier Franch, and Andreas (e.g., data warehouses or data lakes), followed by the execution of Jedlitschka. 2018. Towards Automated Data Integration in Software An- analytics queries to derive insights from the data, with the final alytics. In International Workshop on Real-Time Business Intelligence and goal of performing business actions or raising alerts [6]. In a recent Analytics (BIRTE ’18), August 27, 2018, Rio de Janeiro, Brazil. ACM, New systematic review, data integration and final data aggregation were York, NY, USA, 5 pages. https://doi.org/10.1145/3242153.3242159 reported as part of the remaining challenges in Big Data analyt- 1 INTRODUCTION ics [19]. At the same time, another review in software analytics arXiv:1808.05376v1 [cs.SE] 16 Aug 2018 reported that most of the current approaches are still analyzing Nowadays, the huge amount of data available in companies has only one artifact [2], thus not focusing on integrating data from increased their interest in applying the concept of "real-time en- different sources and getting a holistic view. Thus, further research 1 terprise" by using up-to-date information and acting on events as is needed to facilitate the integration of data sources for software 1https://www.gartner.com/technology/research/data-literacy/ analytics driven by the real information needs of end users. To overcome the heterogeneity of software analytics data coming Permission to make digital or hard copies of all or part of this work for personal or from different sources, we follow an ontology-based data integra- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation tion approach in this paper; in particular, we intend to contribute: on the first page. Copyrights for components of this work owned by others than the (a) the definition of an ontology capturing the semantics of relevant author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission data for software analytics; (b) a current static approach for the and/or a fee. Request permissions from [email protected]. integration of heterogeneous data sources given a set of predefined BIRTE ’18, August 27, 2018, Rio de Janeiro, Brazil analytic views representing software quality factors; and (c) an © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. envisioned approach for the dynamic integration of heterogeneous ACM ISBN 978-1-4503-6607-6/18/08...$15.00 https://doi.org/10.1145/3242153.3242159 BIRTE ’18, August 27, 2018, Rio de Janeiro, Brazil Silverio Martínez-Fernández, Petar Jovanovic, Xavier Franch, and Andreas Jedlitschka software analytics data, guided by the specific analytical needs of 3 INTEGRATING DATA SOURCES FOR end users (i.e., information requirements). SOFTWARE ANALYTICS The paper is structured as follows: Section 2 presents the related In this section, we present our software analytics use case that we work. Section 3 presents an ontology for software analytics. Section will use throughout this paper. 4 presents the implementation of a static approach to implement the integration. Section 5 discusses how the integration could be 3.1 An ontology for software analytics done dynamically. Finally, Section 6 concludes the paper. We introduce an ontology that captures the semantics of relevant 2 BACKGROUND AND RELATED WORK data for software analytics (see Fig. 1). In the ontology, each class represents an entity of the software analytics domain. For instance, 2.1 Software analytics and software quality the class Issue represents the issues from issue tracking systems. Contrary to the availability of data and its transparency in open The ontology is abstract in order to enable generalization and ap- source software, tool support for data integration for private com- plicability in different software projects. Therefore, the technolo- panies in commercial systems is just emerging. For instance, we gies used could differ among projects, whereas the concepts are can find some large-scale software companies with their own pro- present in most software development projects. For instance, for prietary development environments, such as Codemine (a propri- issue tracking systems, different companies may use different tools etary software analytics platform) [7], Codebook (a framework for (e.g., Redmine, Jira, or GitLab), but all of them use issue tracking connecting engineers and their work artifacts) [4] by Microsoft, systems as a software development practice. Note that several ap- and Tricorder (a program analysis platform aimed at building a proaches also propose automated generation of a domain ontology data-driven ecosystem) [18] by Google. Still, these platforms are from the desired data sources to support data integration tasks (e.g., proprietary and not widely used by others. In addition, companies [20]). like Kiuwan, Kovair, and Tasktop have recently started offering The classes of the ontology in Fig. 1 represent data coming from software and services for software analytics2. Furthermore, very the system either at development time or at runtime. For the sake recent research tools are CodeFeedr [21] and Q-Rapids [16]. Despite of simplicity, we omit further ontology details (e.g., datatype prop- these efforts, there are still challenges for companies developing erties) in Fig. 1, but explain the main process of how the ontology is commercial systems to understand how to integrate heterogeneous built. During software development, we find data about the project data sources for software analytics. and the development, which can be mapped, respectively, to the Regarding software quality and quality models, a multitude of topics of improving software development process productivity and software-related quality models exist that are used in practice, as software system quality presented by Zhang et al. [23]. At runtime, well as classification schemes [13]. One example is the ISO/IEC we
Recommended publications
  • Scientific Data Mining in Astronomy
    SCIENTIFIC DATA MINING IN ASTRONOMY Kirk D. Borne Department of Computational and Data Sciences, George Mason University, Fairfax, VA 22030, USA [email protected] Abstract We describe the application of data mining algorithms to research prob- lems in astronomy. We posit that data mining has always been fundamen- tal to astronomical research, since data mining is the basis of evidence- based discovery, including classification, clustering, and novelty discov- ery. These algorithms represent a major set of computational tools for discovery in large databases, which will be increasingly essential in the era of data-intensive astronomy. Historical examples of data mining in astronomy are reviewed, followed by a discussion of one of the largest data-producing projects anticipated for the coming decade: the Large Synoptic Survey Telescope (LSST). To facilitate data-driven discoveries in astronomy, we envision a new data-oriented research paradigm for astron- omy and astrophysics – astroinformatics. Astroinformatics is described as both a research approach and an educational imperative for modern data-intensive astronomy. An important application area for large time- domain sky surveys (such as LSST) is the rapid identification, charac- terization, and classification of real-time sky events (including moving objects, photometrically variable objects, and the appearance of tran- sients). We describe one possible implementation of a classification broker for such events, which incorporates several astroinformatics techniques: user annotation, semantic tagging, metadata markup, heterogeneous data integration, and distributed data mining. Examples of these types of collaborative classification and discovery approaches within other science disciplines are presented. arXiv:0911.0505v1 [astro-ph.IM] 3 Nov 2009 1 Introduction It has been said that astronomers have been doing data mining for centuries: “the data are mine, and you cannot have them!”.
    [Show full text]
  • Information Needs for Software Development Analytics
    Information Needs for Software Development Analytics Raymond P.L. Buse Thomas Zimmermann The University of Virginia, USA Microsoft Research [email protected] [email protected] Abstract—Software development is a data rich activity with decision making will likely only become more difficult. many sophisticated metrics. Yet engineers often lack the tools Analytics describes application of analysis, data, and sys- and techniques necessary to leverage these potentially powerful tematic reasoning to make decisions. Analytics is especially information resources toward decision making. In this paper, we present the data and analysis needs of professional software useful for helping users move from only answering questions engineers, which we identified among 110 developers and of information like “What happened?” to also answering managers in a survey. We asked about their decision making questions of insight like “How did it happen and why?” process, their needs for artifacts and indicators, and scenarios Instead of just considering data or metrics directly, one can in which they would use analytics. gather more complete insights by layering different kinds of The survey responses lead us to propose several guidelines for analytics tools in software development including: Engi- analyses that allow for summarizing, filtering, modeling, and neers do not necessarily have much expertise in data analysis; experimenting; typically with the help of automated tools. thus tools should be easy to use, fast, and produce concise The key to applying analytics to a new domain is under- output. Engineers have diverse analysis needs and consider most standing the link between available data and the information indicators to be important; thus tools should at the same time needed to make good decisions, as well as the analyses support many different types of artifacts and many indicators.
    [Show full text]
  • Software Analytics for Incident Management of Online Services: An
    Software Analytics for Incident Management of Online Services: An Experience Report Jian-Guang Lou, Qingwei Lin, Rui Ding, Tao Xie Qiang Fu, Dongmei Zhang University of Illinois at Urbana-Champaign Microsoft Research Asia, Beijing, P. R. China Urbana, IL, USA {jlou, qlin, juding, qifu, dongmeiz}@microsoft.com [email protected] Abstract—As online services become more and more popular, incident management needs to be efficient and effective in incident management has become a critical task that aims to order to ensure high availability and reliability of the services. minimize the service downtime and to ensure high quality of the A typical procedure of incident management in practice provided services. In practice, incident management is conducted (e.g., at Microsoft and other service-provider companies) goes through analyzing a huge amount of monitoring data collected at as follow. When the service monitoring system detects a ser- runtime of a service. Such data-driven incident management faces several significant challenges such as the large data scale, vice violation, the system automatically sends out an alert and complex problem space, and incomplete knowledge. To address makes a phone call to a set of On-Call Engineers (OCEs) to these challenges, we carried out two-year software-analytics trigger the investigation on the incident in order to restore the research where we designed a set of novel data-driven techniques service as soon as possible. Given an incident, OCEs need to and developed an industrial system called the Service Analysis understand what the problem is and how to resolve it. In ideal Studio (SAS) targeting real scenarios in a large-scale online cases, OCEs can identify the root cause of the incident and fix service of Microsoft.
    [Show full text]
  • Data Quality Through Data Integration: How Integrating Your IDEA Data Will Help Improve Data Quality
    center for the integration of IDEA Data Data Quality Through Data Integration: How Integrating Your IDEA Data Will Help Improve Data Quality July 2018 Authors: Sara Sinani & Fred Edora High-quality data is essential when looking at student-level data, including data specifically focused on students with disabilities. For state education agencies (SEAs), it is critical to have a solid foundation for how data is collected and stored to achieve high-quality data. The process of integrating the Individuals with Disabilities Education Act (IDEA) data into a statewide longitudinal data system (SLDS) or other general education data system not only provides SEAs with more complete data, but also helps SEAs improve accuracy of federal reporting, increase the quality of and access to data within and across data systems, and make better informed policy decisions related to students with disabilities. Through the data integration process, including mapping data elements, reviewing data governance processes, and documenting business rules, SEAs will have developed documented processes and policies that result in more integral data that can be used with more confidence. In this brief, the Center for the Integration of IDEA Data (CIID) provides scenarios based on the continuum of data integration by focusing on three specific scenarios along the integration continuum to illustrate how a robust integrated data system improves the quality of data. Scenario A: Siloed Data Systems Data systems where the data is housed in separate systems and State Education Agency databases are often referred to as “data silos.”1 These data silos are isolated from other data systems and databases.
    [Show full text]
  • Patterns for Implementing Software Analytics in Development Teams
    Patterns for Implementing Software Analytics in Development Teams JOELMA CHOMA, National Institute for Space Research - INPE EDUARDO MARTINS GUERRA, National Institute for Space Research - INPE TIAGO SILVA DA SILVA, Federal University of São Paulo - UNIFESP The software development activities typically produce a large amount of data. Using a data-driven approach to decision making – such as Software Analytics – the software practitioners can achieve higher development process productivity and improve many aspects of the software quality based on the insightful and actionable information. This paper presents a set of patterns describing steps to encourage the integration of the analytics activities by development teams in order to support them to make better decisions, promoting the continuous improvement of software and its development process. Before any procedure to extract data for software analytics, the team needs to define, first of all, their questions about what will need to be measured, assess and monitored throughout the development process. Once defined the key issues which will be tracked, the team may select the most appropriate means for extracting data from software artifacts that will be useful in decision-making. The tasks to set up the development environment for software analytics should be added to the project planning along with the regular tasks. The software analytics activities should be distributed throughout the project in order to add information about the system in small portions. By defining reachable goals from the software analytics findings, the team turns insights from software analytics into actions to improve incrementally the software characteristics and/or its development process. Categories and Subject Descriptors: D.2.8 [Software and its engineering]: Software creation and management—Metrics General Terms: Software Analytics Additional Key Words and Phrases: Software Analytics, Decision Making, Agile Software Development, Patterns, Software Measurement, Development Teams.
    [Show full text]
  • Managing Data in Motion This Page Intentionally Left Blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies
    Managing Data in Motion This page intentionally left blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies April Reeve AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier Acquiring Editor: Andrea Dierna Development Editor: Heather Scherer Project Manager: Mohanambal Natarajan Designer: Russell Purdy Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Copyright r 2013 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
    [Show full text]
  • Informatica Cloud Data Integration
    Informatica® Cloud Data Integration Microsoft SQL Server Connector Guide Informatica Cloud Data Integration Microsoft SQL Server Connector Guide March 2019 © Copyright Informatica LLC 2017, 2019 This software and documentation are provided only under a separate license agreement containing restrictions on use and disclosure. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation is subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License. Informatica, the Informatica logo, Informatica Cloud, and PowerCenter are trademarks or registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html. Other company and product names may be trade names or trademarks of their respective owners. Portions of this software and/or documentation are subject to copyright held by third parties. Required third party notices are included with the product. See patents at https://www.informatica.com/legal/patents.html. DISCLAIMER: Informatica LLC provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of noninfringement, merchantability, or use for a particular purpose.
    [Show full text]
  • What to Look for When Selecting a Master Data Management Solution What to Look for When Selecting a Master Data Management Solution
    What to Look for When Selecting a Master Data Management Solution What to Look for When Selecting a Master Data Management Solution Table of Contents Business Drivers of MDM .......................................................................................................... 3 Next-Generation MDM .............................................................................................................. 5 Keys to a Successful MDM Deployment .............................................................................. 6 Choosing the Right Solution for MDM ................................................................................ 7 Talend’s MDM Solution .............................................................................................................. 7 Conclusion ..................................................................................................................................... 8 Master data management (MDM) exists to enable the business success of an orga- nization. With the right MDM solution, organizations can successfully manage and leverage their most valuable shared information assets. Because of the strategic importance of MDM, organizations can’t afford to make any mistakes when choosing their MDM solution. This article will discuss what to look for in an MDM solution and why the right choice will help to drive strategic business initiatives, from cloud computing to big data to next-generation business agility. 2 It is critical for CIOs and business decision-makers to understand the value
    [Show full text]
  • Using ETL, EAI, and EII Tools to Create an Integrated Enterprise
    Data Integration: Using ETL, EAI, and EII Tools to Create an Integrated Enterprise Colin White Founder, BI Research TDWI Webcast October 2005 TDWI Data Integration Study Copyright © BI Research 2005 2 Data Integration: Barrier to Application Development Copyright © BI Research 2005 3 Top Three Data Integration Inhibitors Copyright © BI Research 2005 4 Staffing and Budget for Data Integration Copyright © BI Research 2005 5 Data Integration: A Definition A framework of applications, products, techniques and technologies for providing a unified and consistent view of enterprise-wide business data Copyright © BI Research 2005 6 Enterprise Business Data Copyright © BI Research 2005 7 Data Integration Architecture Source Target Data integration Master data applications Business domain dispersed management (MDM) MDM applications integrated internal data & external Data integration techniques data Data Data Data propagation consolidation federation Changed data Data transformation (restructure, capture (CDC) cleanse, reconcile, aggregate) Data integration technologies Enterprise data Extract transformation Enterprise content replication (EDR) load (ETL) management (ECM) Enterprise application Right-time ETL Enterprise information integration (EAI) (RT-ETL) integration (EII) Web services (services-oriented architecture, SOA) Data integration management Data quality Metadata Systems management management management Copyright © BI Research 2005 8 Data Integration Techniques and Technologies Data Consolidation centralized data Extract, transformation
    [Show full text]
  • Introduction to Data Integration Driven by a Common Data Model
    Introduction to Data Integration Driven by a Common Data Model Michal Džmuráň Senior Business Consultant Progress Software Czech Republic Index Data Integration Driven by a Common Data Model 3 Data Integration Engine 3 What Is Semantic Integration? 3 What Is Common Data Model? Common Data Model Examples 4 What Is Data Integration Driven by a Common Data Model? 5 The Position of Integration Driven by a Common Data Model in the Overall Application Integration Architecture 6 Which Organisations Would Benefit from Data Integration Driven by a Common Data Model? 8 Key Elements of Data Integration Driven by a Common Data Model 9 Common Data Model, Data Services and Data Sources 9 Mapping and Computed Attributes 10 Rules 11 Glossary of Terms 13 Information Resources 15 Web 15 Articles and Analytical Reports 15 Literature 15 Industry Standards for Common Data Models 15 Contact 16 References 16 Data Integration Driven by a Common Data Model Data Integration Engine Today, not even small organisations can make do with a single application. The majority of business processes in an organisation is nowadays already supported by some kind of implemented application, and the organisation must focus on making its operations more efficient. One of the ways of achieving this goal is the information exchange optimization across applications, assisted by the data integration from various applications; it is summarily known as Enterprise Application Integration (EAI). Without effective Enterprise Application Integration, a modern organisation cannot run its business processes to meet the ever increasing customer demands and have an up-to-date knowledge of its operations. Data integration is an important part of EAI.
    [Show full text]
  • Evaluating and Improving Software Quality Using Text Analysis Techniques - a Mapping Study
    Evaluating and Improving Software Quality Using Text Analysis Techniques - A Mapping Study Faiz Shah, Dietmar Pfahl Institute of Computer Science, University of Tartu J. Liivi 2, Tartu 50490, Estonia {shah, dietmar.pfahl}@ut.ee Abstract: Improvement and evaluation of software quality is a recurring and crucial activ- ity in the software development life-cycle. During software development, soft- ware artifacts such as requirement documents, comments in source code, design documents, and change requests are created containing natural language text. For analyzing natural text, specialized text analysis techniques are available. However, a consolidated body of knowledge about research using text analysis techniques to improve and evaluate software quality still needs to be estab- lished. To contribute to the establishment of such a body of knowledge, we aimed at extracting relevant information from the scientific literature about data sources, research contributions, and the usage of text analysis techniques for the im- provement and evaluation of software quality. We conducted a mapping study by performing the following steps: define re- search questions, prepare search string and find relevant literature, apply screening process, classify, and extract data. Bug classification and bug severity assignment activities within defect manage- ment are frequently improved using the text analysis techniques of classifica- tion and concept extraction. Non-functional requirements elicitation activity of requirements engineering is mostly improved using the technique of concept extraction. The quality characteristic which is frequently evaluated for the prod- uct quality model is operability. The most commonly used data sources are: bug report, requirement documents, and software reviews. The dominant type of re- search contributions are solution proposals and validation research.
    [Show full text]
  • Using the Qlik Data Integration Platform for Better Analytics
    DATA SHEET DATA INTEGRATION Modern DataOps Using the Qlik Data Integration Platform for Better Analytics QLIK.COM INTRODUCTION Enabling DataOps for Analytics-Ready Data Qlik’s Data Integration Platform (formerly Attunity solutions) automates real-time data streaming, cataloging, and publishing, so you can quickly find and free analytics-ready data — and take action on it. In today’s hyper-competitive business climate, real-time insights are critical. Users need robust data integration and agile analytics solutions, to make decisions fast. Unlike traditional batch data movement and ETL scripting — which are slow, inflexible and labor intensive — Qlik’s data integration platform automates the creation of data streams from core transactional systems. It efficiently moves data to applications, warehouses, and lakes — on premise and in the cloud — and makes data immediately available via an Amazon-like catalog marketplace experience. With Qlik, users get the frictionless data agility they need to drive greater business value. Data consumers need better access to their data in real time. Many existing processess and technologies simply can’t keep up with this increasing demand, creating more complexity and tighter bottlenecks within IT. DataOps seeks to bring improvements to data integration. Borrowing methods from DevOps — which combines softward development (Dev) and IT operations (Ops) to improve the velocity, quality, predictability, and scale of software development and deployment — DataOps focuses on the practices and technologies for building and enhancing data pipelines to rapidly meet business analytics needs. Modern DataOps 2 Data Warehouse Automation Business needs are in a continual state of flux. Data sets are increasingly diverse. To meet the growing need for analytics-ready data delivery at the speed of now, Qlik’s Data Integration Platform automates the entire data warehouse lifecycle.
    [Show full text]