The Openclean Open-Source Data Cleaning Library
Total Page:16
File Type:pdf, Size:1020Kb
From Papers to Practice: The openclean Open-Source Data Cleaning Library Heiko Müller, Sonia Castelo, Munaf Qazi, and Juliana Freire New York University {heiko.mueller,s.castelo,munaf,juliana.freire}@nyu.edu ABSTRACT Inspired by the wide adoption of generic machine learning frame- Data preparation is still a major bottleneck for many data science works such as scikit-learn [15], TensorFlow [1], and PyTorch [14], projects. Even though many sophisticated algorithms and tools we are currently developing openclean, an open-source Python have been proposed in the research literature, it is difficult for library for data profiling and data cleaning [11]. Our goals are practitioners to integrate them into their data wrangling efforts. We twofold. First, openclean aims to provide a unified framework for present openclean, a open-source Python library for data cleaning practitioners that brings together open source data profiling and and profiling. openclean integrates data profiling and cleaning data cleaning tools into an easy-to-use environment. By making tools in a single environment that is easy and intuitive to use. We existing tools available to a large user-community, and through the designed openclean to be extensible and make it easy to add new integration with the rich Python ecosystem, openclean has the functionality. By doing so, it will not only become easier for users to potential to simplify data cleaning tasks. Second, by providing a access state-of-the-art algorithms for their data wrangling efforts, structured, extensible framework, openclean can serve as a plat- but also allow researchers to integrate their work and evaluate its form with which researchers and developers can integrate their effectiveness in practice. We envision openclean as a first step to techniques. We hope that by bringing together a community of build a community of practitioners and researchers in the field. In users, developers, and researchers, we will be in a better position our demo, we outline the main components and design decisions to attack the many challenges in dealing with data quality. in the development of openclean and demonstrate the current Overview of openclean. The source code for openclean is avail- functionality of the library on real-world use cases. able on GitHub [11]. We chose Python to implement openclean because of its growing popularity as well as the large number of PVLDB Reference Format: existing open-source libraries. Figure 1 shows the ecosystem of li- Heiko Müller, Sonia Castelo, Munaf Qazi, and Juliana Freire. From Papers braries that openclean currently leverages. openclean is organized to Practice: in a modular way to account for the large number of approaches The openclean Open-Source Data Cleaning Library. PVLDB, 14(12): 2763 - 2766, 2021. and techniques that fall into the areas of data profiling and data doi:10.14778/3476311.3476339 cleaning and that we aim to integrate into openclean in the future. At the center of openclean is the package openclean-core. 1 MOTIVATION Within this package we define the relevant APIs for accessing and processing data as well as some basic functionality. The core pack- The negative impact of poor data quality on the widespread and age depends on well-known Python libraries including pandas [23] profitable use of machine learning17 [ ] makes data cleaning essen- and scikit-learn [15] as well as on a set of libraries that were tial in many data science projects. Improving data quality requires developed in parallel to openclean to address specific needs regard- data profiling and exploration to gain an understanding of quality ing access to reference data [12], provenance and version manage- issues, and data cleaning to transform the data into a state that is fit ment [10], and integration of tools with existing binaries that are for purpose. This process is tedious and costly. A frequently cited not implemented in Python or that require installation of additional survey in 2016 found that data scientists spend 60% of their time software systems [19]. on data cleaning and organizing data [16]. In the same survey 57% On top of the core library are several extensions that make use of of the data scientists also stated that they consider data cleaning the API’s and underlying base packages to provide additional func- and organizing data as the least enjoyable task of their job. tionality. Current extensions include libraries to discover regular ex- Over the years, many tools for profiling, preparing, and cleaning pressions, a Graphical User Interface that is integrated with Jupyter data have been developed, both in academia and industry (see [2, 7, Notebooks, as well as profiling functionality from the Metanome 8] for overviews). These approaches were developed in isolation and project as one example of how existing tools that are not imple- in different programming languages with no standardized interfaces. mented in Python can easily be integrated into openclean. We are Thus, it is difficult for data scientists to combine existing toolsin also in the process of integrating other tools like HoloClean [18] their data processing pipelines. and Ditto [9]. This work is licensed under the Creative Commons BY-NC-ND 4.0 International In this demo paper, we first introduce the openclean library License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of by giving an overview of the main concepts and the architecture this license. For any use beyond those covered by this license, obtain permission by (Section 2). Then, we discuss our demonstration that will show how emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. to use the library based on a few real-world examples and how to Proceedings of the VLDB Endowment, Vol. 14, No. 12 ISSN 2150-8097. add new functionality to the library (Section 3). doi:10.14778/3476311.3476339 2763 (from a given input list) that were identified as outliers. A profil- ing operator that computes statistics over the data will return a dictionary with the different computed statistics. Data Cleaning and Transformation. Data cleaning operators are intended to transform data and they almost exclusively operate on data frames. The following abstract operator types are defined in openclean: transformer, mapper, reducer, and group filter. A transformer maps a given data frame to a new data frame. openclean comes with a set of built-in transformation operations Figure 1: openclean architecture. The core library and exten- for filtering columns and rows, insertion of new columns and rows, sions depend on standard Python libraries and additional moving and renaming columns, updating of cell values, and for packages that we developed for specific tasks. sorting. A mapper returns a group of data frames for a given input data frame. Each resulting group is identified by a unique key. The 2 MAIN FEATURES OF OPENCLEAN default mapper in openclean is the groupby operator that groups 2.1 Data Model and Operators rows based on key values that are computed for each row using a given evaluation function (see below). A reducer converts a group Datasets and Streams. openclean is primarily intended for tabu- of data frames into a single data frame. A common example are lar datasets, which are represented as data frames: a set of columns operators that resolve conflicts when repairing violations of FDs.A with (not necessarily unique) names and a set of rows that contain group filter operates on groups of data frames and transforms one values for each of the columns. set of groups into a new set of groups. openclean supports two modes for processing data frames: (i) Functions. Well-defined APIs are essential for a framework that the data frame can either be loaded completely into main memory aims to be flexible and extensible. In openclean we define APIs for pandas.DataFrame (e.g., as a [23]), or (ii) it can be streamed from common data wrangling tasks in a modular way that makes it easy an external storage source, e.g., a comma-separated file. The latter is for a user to combine them as well as to include new functionality at particularly important for large datasets that cannot fit entirely into different levels of granularity. One means to ensure such flexibility openclean main memory. Data streams in are streams of dataset is the use of functions as arguments. rows. The streaming data processor allows the application of clean- In openclean, many cleaning and transformation operators take ing and profiling operations directly to the rows in the stream, e.g., functions as their arguments. For example, the filter and update to filter data of interest that can then be loaded into main memory. operators shown in Figure 2(b) [#7] take functions IsNotEmpty Figure 2(b) shows an example in cell #7 using the NYC Parking and str.upper as arguments to select and manipulate rows. By Violations dataset. The dataset contains over 9 million rows and having functions as operator arguments it becomes easy to change the data file is approx. 380 mb in size. In the example we first select and customize behavior. In the example, we can easily use any Registration three columns from the dataset and rename column (user defined) function with a single argument instead of thestan- State State to . We then filter rows that contain a vehicle color dard str.upper function to transform the data using the update and convert the color values to all upper case. Finally, we take a operator. random sample of 1000 rows and return them as a data frame. Figure 2(b) [#8] shows another example for the power of com- Operators. openclean implements two different types of opera- posability in openclean.