Data Engineering for Data Analytics: a Classification of the Issues, And

Data Engineering for Data Analytics: A Classification of the Issues, and Case Studies Alfredo Nazabal1, Christopher K.I. Williams12, Giovanni Colavizza3∗ , Camila Rangel Smith1, and Angus Williams* 1The Alan Turing Institute, London, UK 2School of Informatics, University of Edinburgh, UK 3Media Studies Department, University of Amsterdam, The Netherlands April 28, 2020 Abstract approaches; observations can be considered anomalous according to different criteria; or features can be trans- Consider the situation where a data analyst wishes to formed in different ways. carry out an analysis on a given dataset. It is widely Furthermore, most publicly available datasets have al- recognized that most of the analyst's time will be taken ready undergone some form of pre-processing. While up with data engineering tasks such as acquiring, under- providing a clean version of the data is extremely help- standing, cleaning and preparing the data. In this paper ful from a modelling perspective, data engineering re- we provide a description and classification of such tasks searchers suffer from a limited availability to public messy into high-levels groups, namely data organization, data datasets. This leads to researchers addressing some of quality and feature engineering. We also make available the problems by synthetically corrupting clean datasets four datasets and example analyses that exhibit a wide according to the data wrangling problem to be solved. variety of these problems, to help encourage the develop- However, such synthetic corruption is often not sufficient ment of tools and techniques to help reduce this burden to capture the wide variety of corruption processes ex- and push forward research towards the automation or isting in real-world datasets. All of this variability in semi-automation of the data engineering process. terms of data engineering issues and possible solutions, and the lack of public messy datasets and their final 1 Introduction cleaned version makes attempting to automate this problem extremely challenging. A large portion of the life of a data scientist is spent ac- In this paper we first provide a classification of data en- quiring, understanding, interpreting, and preparing data gineering problems appearing in messy datasets when a for analysis, which we collectively term data engineering1. data scientist faces an analytical task, and give practical This can be time-consuming and laborious, for example examples of each of them. We have identified three high- [1] estimate that these tasks constitute up to 80% of the level groups of problems: Data Organization issues arXiv:2004.12929v1 [cs.DB] 27 Apr 2020 effort in a data mining project. Every data scientist faces (DO), related to obtaining the best data representation data engineering challenges when carrying out an anal- for the task to be solved, Data Quality issues (DQ), re- ysis on a new dataset, but this work is often not fully lated to cleaning corrupted entries in the data, and Fea- detailed in the final write-up of their analysis. ture Engineering (FE) issues, related to the creation Our focus is on a task-driven analysis scenario, where of derived features for the analytical task at hand. Addi- a data scientist wants to perform an analysis on a given tionally, we have further divided the DO and DQ groups dataset, and they want to obtain a representation of the according to the nature of data wrangling problem they data in a format that can be used as input to their face. Under Data Organization we include data parsing preferred machine learning (ML) method. Interestingly, (DP), data dictionary (DD), data integration (DI) and there is no unique or correct way of wrangling a messy data transformation (DT). Under Data Quality we in- dataset. For example, features and observations with clude canonicalization (CA), missing data (MD), anoma- missing values can be deleted or imputed using different lies (AN) and non-stationarity (NS). Providing a classification of wrangling challenges not only pushes research ∗ The work described in this paper was carried out when Gio- in these individual fields, but also helps to advance the vanni Colavizza and Angus Williams were with the Alan Turing Institute automation or semi-automation of the whole data engi- 1Another term often used is data wrangling. neering process. 1 A second contribution of the paper is to make available a sample of anonymized medical records from a num- four example messy datasets, each with an associated an- ber of London hospitals, including demographic data, alytical task. The analyses were carried out by data sci- drug dosage data, and physiological time-series measure- entists at the Alan Turing Institute (in some cases repli- ments [4], together with publicly available data clean- cating steps that were taken for published analyses). In ing scripts. Our chosen analytical challenge is to predict the appendices we describe the cleaning operations they which patients die in the first 100 hours from admission needed to perform in order to obtain a version of the data to the hospital. See Section 5.3. that could be used for the analysis task at hand. These provide practical examples of the issues identified in our Ofcom Consumer Broadband Performance classification, and give an insight of what constitutes a dataset: The Consumer Broadband Performance practical pipeline during a data wrangling problem. dataset contains annual surveys of consumer broadband The structure of the paper is as follows: Section 2 in- speeds and other data, commissioned and published by 3 troduces the four cases studies that will provide examples Ofcom , and available as a spreadsheet for each year [5]. throughout the paper. Section 3 describes our classifica- The data engineering challenge here is primarily one of tion of data engineering challenges, broken down under matching common features between the data for any two the headings of Data Organization, Data Quality and years, as the names, positions, and encodings change Feature Engineering. Section 4 discusses related work. every year. There was no given analytical challenge with Section 5 provide details of the data engineering and mod- the data as published, so we have chosen to build a elling steps carried out for each of the four datasets, and model to predict the characteristics of a region according our conclusions appear in section 6. to the properties of the broadband connections (see Section 5.4). 2 Overview of Case Studies Table 1: List of wrangling issues encountered in the We have identified four case studies, drawing data from a datasets. variety of domains and having a variety of formats: plant Data Data Feature measurements, household electricity consumption, health Organization Quality Engineering records, and government survey data. We refer to Sec- Dataset DP DD DI DT CA MD AN NS FE tion 5 for detailed descriptions of the data engineering Tundra ••• ••• • challenges and modelling approaches of each dataset. We HES ••• •• • also provide a GitHub repository with the datasets and 2 CleanEHR ••• ••• • the wrangling processes. Table 1 shows an overview of Broadband ••• •• • the wrangling challenges present in each of our use cases. Note that the actual challenges that an analyst encoun- ters depend on the particular analysis that is undertaken, which in turn will depend on the particular analytical task being addressed. 3 Classification of Wrangling Challenges Tundra Traits dataset: The Tundra Traits data con- sists of measurements of the physical characteristics of When data analysts want to employ their preferred ma- shrubs in the Arctic tundra, as well as the data wran- chine learning models, they generally need the data to be gling scripts that were used to produce a \clean" version formatted in a single table. Although data engineering of the dataset [2]. The analytical challenge entails build- issues extend to other forms of data such as images, text, ing a model to gauge the effect of temperature and pre- time series or graph data, in this work we take the view cipitation on specific shrub traits related to plant growth that the data analyst wishes to obtain a clean regular ta- (see Section 5.1). ble that contains the information needed to perform the desired analytical task. To define notation, a table is as- Household Electricity Survey (HES) dataset: The sumed to consist of n rows and D columns. Each row is Household Electricity Survey (HES) data contains time- an example or instance, while each column is a feature series measurements of the electricity use of domestic ap- or attribute, and attributes can take on a set of different pliances, as well as a report on data cleaning by Cam- values. bridge Architectural Research [3]. Our chosen analytical challenge is to predict the energy consumption of a house- We mainly focus on two large groups of data wran- hold given its appliance profile (see Section 5.2). gling issues: those related with organizing the data and those related with improving the quality of the data. We Critical Care Health Informatics Collaborative also include a third group on feature engineering, where (CleanEHR) dataset: The CleanEHR data contains the separation between wrangling issues and modelling choices starts to become blurred. In Table 3 at the end 2https://github.com/alan-turing-institute/aida-data- engineering-issues 3The UK Office of Communications. 2 of this section we provide a summary of all the data wran- separator results in different structured tables with three gling problems addressed in this work. Notice that this columns). classification should not be considered as a sequential process. Depending on the dataset and the analytical task 3.1.2 Data Dictionary some challenges are necessary while others are not, and the order of solving these problems can be variable. The term Data Dictionary refers to understanding the contents of the data and translating that knowledge into additional metadata. Ideally a dataset should be 3.1 Data Organization described by a data dictionary or metadata repository When data scientists start to work on a specific problem which gives information such as the meaning and type with real data, the first problem they face is obtaining of each attribute in a table, see e.g.

Data Engineering for Data Analytics: a Classification of the Issues, And

Open Source Used in Influx1.8 Influx 1.9

Buildbot Documentation Release 1.6.0

Gerrit J.J. Van Den Burg, Phd London, UK | Email: [email protected] | Web: Gertjanvandenburg.Com

QUARTERLY CHECK-IN Technology (Services) TECH GOAL QUADRANT

VES Home Welcome to the VNF Event Stream (VES) Project Home

Metrics for Gerrit Code Reviews

Fashion Terminology Today Describe Your Heritage Collections with an Eye on the Future

Full-Graph-Limited-Mvn-Deps.Pdf

Arxiv:1910.06663V1 [Cs.PF] 15 Oct 2019

A Dataset of Vulnerable Code Changes of the Chromium OS Project

Git and Gerrit in Action and Lessons Learned Along the Path to Distributed Version Control

CAFCR: a Multi-View Method for Embedded Systems Architecting; Balancing Genericity and Speciﬁcity