Technical Report: an Overview of Data Integration and Preparation

Technical Report: An Overview of Data Integration and Preparation ElKindi Rezig Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Mike Cafarella∗ Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Vijay Gadepally Lincoln Laboratory Massachusetts Institute of Technology May 2020 ∗Mike Cafarella is also with the Department of Computer Science and Engineering at the University of Michigan 1 Abstract AI application developers typically begin with a dataset of interest and a vision of the end analytic or insight they wish to gain from the data at hand. Although these are two very important components of an AI workflow, one often spends the first few weeks (sometimes months) in the phase we refer to as data conditioning. This step typically includes tasks such as figuring out how to prepare data for analytics, dealing with inconsistencies in the dataset, and determining which algorithm (or set of algorithms) will be best suited for the application. Larger, faster, and messier datasets such as those from Internet of Things sensors, medical devices or autonomous vehicles only amplify these issues. These challenges, often referred to as the three Vs (volume, velocity, variety) of Big Data, require low-level tools for data management, preparation and integration. In most applications, data can come from structured and/or unstructured sources and often includes inconsistencies, formatting differences, and a lack of ground-truth labels. In this report, we highlight a number of tools that can be used to simplify data integration and preparation steps. Specifically, we focus on data integration tools and techniques, a deep dive into an exemplar data integration tool, and a deep-dive in the evolving field of knowledge graphs. Finally, we provide readers with a list of practical steps and considerations that they can use to simplify the data integration challenge. The goal of this report is to provide readers with a view of state-of-the-art as well as practical tips that can be used by data creators that make data integration more seamless. 2 Contents 1 Executive Summary4 2 Data Integration Introduction and Challenges6 3 Tools and Techniques8 3.1 Data Integration Challenges.....................8 3.2 Data Integration Tasks........................8 4 Bridging the Gap between Data Preparation and Data Analyt- ics 10 4.1 System Architecture......................... 12 4.1.1 User-defined Modules.................... 12 4.1.2 Managing Machine Learning Models............ 13 4.1.3 Visualization......................... 14 4.1.4 Debugging Suite....................... 14 4.1.5 Manual Breakpoints..................... 15 4.1.6 Automatic Breakpoints................... 15 5 Data Civilizer Use Case 17 6 Knowledge Graphs 18 6.1 Background and Terminology.................... 18 6.2 Graph Databases........................... 19 6.3 Application Examples........................ 20 6.4 Wikidata............................... 21 6.5 Future Research........................... 23 6.6 The Knowledge Graph Programming System........... 24 6.7 Example Walkthrough........................ 25 6.8 Language Features.......................... 27 6.8.1 Types and Values....................... 27 6.8.2 Automatic Provenance.................... 28 6.8.3 Sharing Values and Variables................ 29 6.9 Alternate Interaction Modes..................... 29 7 Practical Steps 30 7.1 Data Organization.......................... 30 7.2 Data Quality and Discovery..................... 32 7.3 Data Privacy............................. 33 7.4 Infrastructure, Technology and Policy............... 35 8 Acknowledgements 37 3 1 Executive Summary Many AI application developers typically begin with a dataset of interest and a vision of the end analytic or insight they wish to gain from the data at hand. Although these are two very important components of the AI pipeline, one often spends the first few weeks (sometimes months) in the phase we refer to as data conditioning. This step typically includes tasks such as figuring out how to store data, dealing with inconsistencies in the dataset, and determining which algorithm (or set of algorithms) will be best suited for the application. Larger, faster, and messier datasets such as those from Internet of Things sensors [1], medical devices [2,3] or autonomous vehicles [4] only amplify these issues. These challenges, often referred to as the three Vs (volume, velocity, variety) of Big Data, require low-level tools for data management and data cleaning/pre-processing. In most applications, data can come from structured and/or unstructured sources and often includes inconsistencies, formatting differences, and a lack of ground-truth labels. In practice, there is a wide diversity in the quality, readiness and usability of data [5]. It is widely reported that data scientists spend at least 80% of their time doing data integration and preparation [6]. This process involves finding relevant datasets for a particular data science task (i.e., data discovery [7]). After the data has been collected, it needs to be cleaned so that it can be used by subsequent ML models [8]. Data cleaning or data preparation refer to several data transformations to make the data reliable for future analysis. Example transformations include: (1) finding and fixing errors in the data using rules (e.g., an employee salary should not exceed $1M); (2) normalizing value representations [9]; (3) imputing missing values [10, 11]; and (4) detecting and reconciling duplicates. This document provides a survey of the state-of-the-art in data preparation and integration. To help make concepts clearer, we also discuss a tool developed by our team called Data Civilizer [8]. Additionally, we detail the concept of Knowledge Graphs - a technique that can be used to describe structured data about a topic as a tool for organizing data with applications in data preparation, integration and beyond. Finally, this document highlights a number of practical steps, based largely on experience, that can be used to simplify data preparation and integration. A summary of these steps is given below: 1. Data Organization • Leverage tools for schema integration • Use standardized file formats and naming conventions 2. Data Quality and Discovery • Maintain data integrity through enforceable constraints • Leverage workflow management systems for testing various hypoth- esis on the data 4 • Include data version control • Leverage data discovery tools 3. Data Privacy • Leverage encrypted or secure data management systems as appropriate • Limit data ownership • Avoid inadvertent data releases • Be aware that data aggregated may be sensitive 4. Infrastructure, Technology and Policy • Use data lakes as appropriate • Databases and files can be used together • Leverage federated data access • Access to the right talent • Pay attention to technology selection, software licensing and software/hardware platforms • Maintain an acquisition and development environment conducive to incorporating new technology advances 5 2 Data Integration Introduction and Challenges Many AI application developers typically begin with a dataset of interest and a vision of the end analytic or insight they wish to gain from the data at hand. Although these are two very important components of the AI pipeline, one often spends the first few weeks (sometimes months) in the phase we refer to as data conditioning. This step typically includes tasks such as figuring out how to store data, dealing with inconsistencies in the dataset, and determining which algorithm (or set of algorithms) will be best suited for the application. Larger, faster, and messier datasets such as those from Internet of Things sensors [1], medical devices [2,3] or autonomous vehicles [12] only amplify these issues. These challenges, often referred to as the three Vs (volume, velocity, variety) of Big Data, require low-level tools for data management, data integration [13] and data preparation (this includes data cleaning/pre-processing). In most applications, data can come from structured and/or unstructured sources and often includes inconsistencies, formatting differences, and a lack of ground- truth labels. One interesting way to think about data preparedness is through concept of data readiness levels as outlined in [5] which is similar to the idea of technology readiness levels (TRL) [14]. It is widely reported that data scientists spend at least 80% of their time doing data integration and preparation [6, 15]. This process involves finding relevant datasets for a particular data science task (i.e., data discovery [7]). After the data has been collected, it needs to be cleaned so that it can be used by subsequent ML models [8]. Data cleaning or data preparation refer to several data transformations to make the data reliable for future analysis. Example transformations include: (1) finding and fixing errors in the data using rules (e.g., an employee salary should not exceed $1M); (2) normalizing value representations [9]; (3) imputing missing values [10, 11]; and (4) detecting and reconciling duplicates. At a high level, the concept of data integration and preparation is the effort required to go from raw sensor data to information that can be used in further processing steps (see the Artificial Intelligence architecture provided in [16]). Sometimes this phase is also referred to as data wrangling. Typically, each of these data integration and preparation tasks can be cumbersome,

Technical Report: an Overview of Data Integration and Preparation

Mapreduce-Based D ELT Framework to Address the Challenges of Geospatial Big Data

Scientific Data Mining in Astronomy

A Survey on Data Collection for Machine Learning a Big Data - AI Integration Perspective

Data Quality Through Data Integration: How Integrating Your IDEA Data Will Help Improve Data Quality

Best Practices in Data Collection and Preparation: Recommendations For

The Risks of Using Spreadsheets for Statistical Analysis Why Spreadsheets Have Their Limits, and What You Can Do to Avoid Them

Managing Data in Motion This Page Intentionally Left Blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies

Informatica Cloud Data Integration

Data Migration (Pdf)

Advanced Data Preparation for Individuals & Teams

Preparation for Data Migration to S4/Hana

Preparing a Data Migration Plan a Practical Introduction to Data Migration Strategy and Planning