Bugdoc: Algorithms to Debug Computational Processes

BugDoc: Algorithms to Debug Computational Processes Raoni Lourenço Juliana Freire Dennis Shasha New York University New York University New York University [email protected] [email protected] [email protected] Abstract Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quanti- tative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and Figure 1: Machine learning pipeline and its prove- provenance to automatically infer the root causes and de- nance. A data scientist can explore different input rive succinct explanations of failures. Through a detailed datasets and classifier estimators to identify a suitable experimental evaluation, we assess the cost, precision, and solution for a classification problem. recall of our approach compared to the state of the art. Our Example: Exploring Supernovas. In an astronomy experiment, experimental data and processing software is available for some visualizations of supernovas presented unusual arti- use, reproducibility, and enhancement. facts that could have indicated a discovery. The experimental CCS Concepts analysis consisted of multiple pipelines run at different sites, • Information systems → Data provenance. including data collection at the telescope site, data processing at a high-performance computing facility, and data analysis 1 Introduction run on the physicist’s desktop. After spending substantial Computational pipelines are widely used in many do- time trying to verify the results, the physicists found that mains, from astrophysics and biology to enterprise analytics. a bug introduced in the new version of the data processing They are characterized by interdependent modules, associ- software had caused the artifacts. ated parameters, and data inputs. Results derived from these To debug such problems, users currently expend consider- pipelines lead to conclusions and, potentially, actions. If one able effort reasoning about the effects of the many possible or more modules in a pipeline produce erroneous or unex- different settings. This requires them to tune and execute pected outputs, these conclusions may be incorrect. Thus, it new pipeline instances to test hypotheses manually, which is critical to identify the causes of such failures. is tedious, time-consuming, and error-prone. Discovering the root cause of failures in a pipeline is chal- We propose new methods and a system that automatically lenging because problems can come from many different and iteratively identifies one or more minimal causes of sources, including bugs in the code, input data, software failures in general computational pipelines (or workflows). arXiv:2004.06530v1 [cs.DB] 12 Apr 2020 updates, and improper parameter settings. Connecting the The Need for Systematic Iteration. Consider the example erroneous result to its root cause is especially difficult for in Figure 1, which shows a generic template for a machine long pipelines or when multiple pipelines are composed. learning pipeline and a log of different instances that were Consider the following real but sanitized examples. run with their associated results. Example: Enterprise Analytics. In an application deployed by The pipeline reads a dataset, splits it into training and test a major software company, plots for sales forecasts showed subsets, creates and executes an estimator, and computes the a sharp decrease compared to historical values. After much F-measure score using 10-fold cross-validation. A data scien- investigation, the problem was tracked down to a data feed tist uses this template to understand how different estimators (coming from an external data provider), whose temporal perform for different types of input data, and ultimately, to resolution had changed from monthly to weekly. The change derive a pipeline instance that leads to high scores. in resolution affected the predictions of a machine learning Analyzing the provenance of the runs, we can see that pipeline, leading to incorrect forecasts. gradient boosting leads to low scores for two of the datasets Raoni Lourenço, Juliana Freire, and Dennis Shasha (Iris and Digits), but it has a high score for Images. By contrast, defines the problem we address. In Section 4, we present al- decision trees work well for both the Iris and Digits datasets, gorithms to search for simple and complex causes of failures. and logistic regression leads to a high score for Iris. We compare BugDoc with the state of the art in Section 5 This may suggest that there is a problem with the gradient and conclude in Section 6, where we outline directions for boosting module for some parameters, that decision trees future work. provide a suitable compromise for different data, and that 2 Related Work is good for the data. Because each run logistic regression Iris Debugging Data and Pipelines. Recently, the problem of used different parameters for each method depending on explaining query results and interesting features in data has the dataset, a definitive conclusion has to await additional received substantial attention in the literature [4, 14, 18, 39, testing of these hyperparameters. Doing so manually is time- 46]. Some have focused on explaining where and how er- consuming and error-prone, while automates this BugDoc rors occur in the data generation process [46] and which process. data items are most likely to be causes of relational query Identifying Root Causes of Failures: Challenges. As the outputs [39, 47]. Others have attempted to use data to ex- above examples illustrate, there are many potential causes plain salient features in data (e.g., outliers) by discovering for a given problem. Prior work used provenance to explain relationships among attribute values [4, 14, 18]. In contrast, errors in computational processes that derive data [18, 46]. BugDoc aims to diagnose abnormal behavior in computa- However, to test these hypotheses and obtain complete (and tional pipelines that may be due to errors in data, programs, accurate) explanations, new pipeline instances must be exe- or sequencing of operations. cuted that vary the different components of the pipeline. Previous work on pipeline debugging has focused on ana- Trying all possible combinations of parameter-values leads lyzing execution histories to identify problematic parameter to a combinatorial explosion of instances to execute, and settings or inputs, but such work does not iteratively infer therefore can be prohibitively expensive. Thus, a critical and test new workflow instances. Bala and Chana [5] applied challenge lies in the design of a strategy that is provably several machine learning algorithms to predict whether a efficient (often requiring only a linear number of pipeline particular pipeline instance will fail to execute in a cloud executions in the number of parameters) for finding root environment. The goal is to reduce the consumption of ex- causes. Causes of errors can include multiple parameters, pensive resources by recommending against executing the each of which may have large domains. So, it is important to instance if it has a high probability of failure. The system have clear and concise explanations in terms of the parameter does not attempt to find the root causes of such failures. values already tried. Chen et al. [12] developed a system that identifies problems Contributions. In this paper, we introduce BugDoc, a new by finding the differences between provenance (encoded as approach that makes use of iteration and provenance to trees) of good and bad runs. However, in general, these dif- infer the root causes automatically and derive succinct ex- ferences do not necessarily identify root causes, though they planations of failures in pipelines. Our contributions can be often contain them. summarized as follows: Some systems have been developed to debug specific ap- plications. Viska [24] helps users identify the underlying (1) BugDoc finds root causes autonomously and iteratively, causes for performance differences for a set of configura- intelligently selecting so-far untested combinations. tions. Users infer hypotheses by exploring performance data (2) We propose debugging algorithms that find root causes and then test these hypotheses by asking questions about using fewer pipeline instances than state-of-the-art the causal relationships between a set of selected features methods, avoiding unnecessary costly computations. and the resulting performance. Thus, Viska can be used to In fact, BugDoc often finds root causes using only a validate hypotheses but not identify root causes. Molly [1] number of pipeline instances linear in the number of combines the analysis of lineage with SAT solvers to find parameters. bugs in fault-tolerance protocols for distributed systems. It (3) The BugDoc system further reduces time by exploiting simulates failures, such as permanent crash failures, mes- parallelism, and sage loss, and temporary network partitions, in order to test (4) Finally, BugDoc derives concise explanations, to facili- fault-tolerance protocols over a specified period. tate the tasks of human debuggers. Although not designed for computational pipelines, Data X-Ray [46]

Load more