Dagger: a Data (Not Code) Debugger

Dagger: A Data (not code) Debugger El Kindi Rezig? Lei Cao? Giovanni Simonini? Maxime Schoemans Samuel Madden? Mourad Ouzzaniy Nan Tangy Michael Stonebraker? ?MIT CSAIL UniversitéLibre de Bruxelles yQatar Computing Research Institute felkindi, lcao, giovanni, madden, [email protected] [email protected] fmouzzani, [email protected] ABSTRACT reasonable inferences or the data is in the wrong format (e.g., With the democratization of data science libraries and frame- bad schema alignment or the data is wrong). In Dagger, we works, most data scientists manage and generate their data focus on addressing problems related to data errors. In do- analytics pipelines using a collection of scripts (e.g., Python, ing so, users may be able to also identify errors related to R). This marks a shift from traditional applications that code bugs or non-ideal parameters, but they would discover communicate back and forth with a DBMS that stores and those issues by investigating the data that is handled in their manages the application data. While code debuggers have code. In a given data pipeline, data goes through a myriad reached impressive maturity over the past decades, they fall of transformations across different components. Even if the short in assisting users to explore data-driven what-if sce- input data is valid, its intermediate versions may not be. narios (e.g., split the training set into two and build two We refer to volatile data that is handled in the code (e.g., ML models). Those scenarios, while doable programmati- scalar variable, arrays) as code-handled data. cally, are a substantial burden for users to manage them- We conducted a series of interviews with data scientists selves. Dagger (Data Debugger) is an end-to-end data de- and engineers at the Massachusetts General Hospital (MGH), bugger that abstracts key data-centric primitives to enable Tamr, Paradigm4 and Recorded Future to gauge interest users to quickly identify and mitigate data-related problems in developing a tool that assists users in debugging code- in a given pipeline. Dagger was motivated by a series of handled data, and found widespread demand for such a data interviews we conducted with data scientists across several debugger. organizations. A preliminary version of Dagger has been in- To address this need, we introduce Dagger, an end-to-end corporated into Data Civilizer 2.0 to help physicians at the framework that treats code-handled data as a first-class cit- Massachusetts General Hospital process complex pipelines. izen. By putting code-handled data at the forefront, Dagger allows users to (1) create data pipelines from input Python scripts; (2) interact with the pipeline-handled data to spot 1. INTRODUCTION data errors; (3) identify and debug data-related problems As we are moving towards data-centric systems, build- at any stage of the pipeline by using a series of primitives ing and debugging data pipelines has become more crucial we developed based on real-world data debugging use-cases than ever. Choosing and tuning those pipelines is a dif- (Section 2). ficult problem. As they increase in complexity, a number Motivated by use-cases we encountered while collaborat- of things can go wrong during pipeline modeling or execu- ing with MGH, Dagger is a work-in-progress project, and we tion: (1) code bugs: there may be bugs in one or more of have already incorporated its preliminary version into Data the pipeline's modules. This requires tracking down poten- Civilizer 2.0 (DC2 ) [14]. In this paper, we sketch the design tial bugs (e.g., using a code debugger) and then fixing the of the extended version of Dagger. code; (2) wrong parameters: it is often the case that one or more components require substantial parameter tuning, 1.1 Dagger overview e.g., in ML models [16]. In this case, the user (or an algo- Figure 1(a) illustrates the architecture of Dagger. In a rithm) evaluates the pipeline using a set of parameters to nutshell, the user interacts with Dagger through a declara- get better results (e.g., the accuracy of the ML model); and tive SQL-like language to query, inspect, and debug the data (3) data errors: the input data or any of the intermediate that is input to and produced by their code. data produced by the code has a problem, e.g., training data Workflow manager: In order to integrate user pipelines is not large enough for a machine learning model to make into off-the-shelf workflow management systems [2, 18, 14], users are often required to write additional code. Dagger allows users to leave their code in its native environment (e.g., Python) and only requires users to track the data. Dagger offers two modes of workflow debugging: (1) intra- module debugging where users tag different codes blocks that will become the pipeline nodes; and (2) inter-module This article is published under a Creative Commons Attribution License debugging where users track the data at the boundary of (http://creativecommons.org/licenses/by/3.0/), which permits distribution the modules, i.e., the input and output data of the pipeline and reproduction in any medium as well allowing derivative works, provided that you attribute the original work to the author(s) and CIDR 2020. blocks/modules. For the intra-module debugging, we chose CIDR ’20 January 12–15, 2020, Amsterdam, Netherlands Python as the target programming environment because it Interaction language Debugging primitives Cleaned data Query interpreter data breakpoints Data labelling Label propagation Query processor (noisy) split Label labels Logging manager generation data generalization Balance code symbols compare Training DB code-handled data Montage Preprocessing delta logging Modeling Inference Filter workflow manager code tagging Classification results code static analysis Label candidates Active Learning In-memory data structures (e.g., Pandas DataFrames) Code (a) (b) Figure 1: (a) Dagger Architecture; (b) Workflow used in EEG Application is the most popular language among data scientists [1]. We on all of them. For instance, we could partition the training refer to pipeline nodes as blocks in the rest of the paper. dataset on a value of an attribute and build classifiers using For instance, users can create the pipeline in Figure 1(b) each partition as the training data. In this case, the user- that includes scripts scattered in different Python files by defined function is the classifier. (3) data generalization to running the following Dagger statements (we will explain find, given example data values in a table, a partition based the pipeline in Section 2): on a user-provided similarity function (e.g., Euclidean dis- tance) (4) compare to compare different pipelines using their CREATE BLOCK b1 FOR PIPELINE P1: preprocessing.py:1-600 code-handled data, i.e., if two pipelines generate similar in- CREATE BLOCK b2 FOR PIPELINE P1: label_generation.py:10-1440 CREATE BLOCK b3 FOR PIPELINE P1: modeling.py:200-5000 termediate data, then, it is likely their downstream blocks CREATE BLOCK b4 FOR PIPELINE P1: active_learning.py:40-3000 would produce similar results. Every statement of the above snippet creates a pipeline 1.2 Use Cases block that is attached to a code section of interest. After the Dagger is being used in our ongoing collaborations with blocks are tagged, Dagger performs a static analysis of the two separate groups at the Massachusetts General Hospital code to extract the dataflow. This requires generating a call (MGH). We will provide examples based on those two use graph (i.e., how the different code blocks are interlinked). cases throughout the paper. This step will create the edges in a Dagger pipeline. Use case 1: Our first use case is an ML pipeline for classify- Logging manager: once the pipeline blocks are defined, ing seizures in electroencephalogram (EEG) data from med- the logging manager stores, for a given data value, its ini- ical patients at MGH. By employing data cleaning and regu- tial value and its updated versions across pipeline blocks larization techniques in Dagger, we have been able to achieve at runtime. The logger can track the value of any data better accuracy for the ML models used in the EEG appli- structure (e.g., Pandas DataFrames, variables, etc.). For cation. The better performing data pipelines were devised well-known data types such as Pandas DataFrames [12] and by iteratively building pipelines and trying various cleaning Python built-in types, the logging manager logs their values routines to preprocess the data. automatically. However, for user-defined objects, a logging Use case 2: We have recently started a collaboration with function has to be provided to extract the value of the ob- another MGH group that focuses on tracking infections in- jects. The logging manager stores the tracked code-handled side the hospital. For instance, a patient might spread an in- data and metadata (e.g., variable names) into a relational fection to a nurse, who then, spreads it to other patients. In database. this use case, structured data about patients has been used Interaction language (DQL): Because Dagger exposes to visualize the spread of infections across the hospital floor code-handled data to users, it features a simple-to-use SQL- plans using Kyrix [15]. This data has several errors due to like language, DQL (Dagger Query Language), to facilitate manual editing (e.g., misspelling, wrong date/time record- access and interaction with the pipeline data (e.g., look up ing) or bad database design, e.g., a newborn inherits the all the occurrences of a given data value in the pipeline). same identifier as the mother which results in two patients Users post their DQL debugging queries through a command- having the same identifier.

Dagger: a Data (Not Code) Debugger

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support