Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes
Total Page:16
File Type:pdf, Size:1020Kb
Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes Jie Song∗ Yeye He University of Michigan Microsoft Research [email protected] [email protected] ABSTRACT Complex data pipelines are increasingly common in diverse ap- plications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Figure 1: Example code snippet to describe expected data- Data validation has thus become an important topic, as evidenced values, using declarative constraints in Amazon Deequ. by notable recent efforts from Google and Amazon, where the ob- jective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, Such schema-drift and data-drift can lead to quality issues in that on string-valued data, these existing approaches yield high downstream applications, which are both hard to detect (e.g., mod- false-positive rates and frequently require human intervention. In est model degradation due to unseen data [53]), and hard to debug this work, we develop a corpus-driven approach to auto-validate (finding root-causes in complex pipelines can be difficult). These machine-generated data by inferring suitable data-validation “pat- silent failures require substantial human efforts to resolve, increas- terns” that accurately describe the underlying data-domain, which ing the cost of running production data pipelines and decreasing minimizes false-positives while maximizing data quality issues the effectiveness of data-driven system [21, 53, 58, 60]. caught. Evaluations using production data from real data lakes Large tech com- suggest that Auto-Validate is substantially more effective than Data Validation by Declarative Constraints. panies operating large numbers of complex pipelines are the first to existing methods. Part of this technology ships as an Auto-Tag recognize the need to catch data quality issues early in the pipelines. feature in Microsoft Azure Purview. In response, they pioneered a number of “data validation” tools, ACM Reference Format: including Google’s TensorFlow Data Validation (TFDV) [8, 21] and Jie Song and Yeye He. 2021. Auto-Validate: Unsupervised Data Validation Amazon’s Deequ [1, 60], etc. These tools develop easy-to-use do- Using Data-Domain Patterns Inferred from Data Lakes. In Proceedings of main specific languages (DSLs), for developers and data engineers to the 2021 International Conference on Management of Data (SIGMOD ’21), write declarative data constraints that describe how “normal” data June 20–25, 2021, Virtual Event, China. ACM, New York, NY, USA, 16 pages. should look like in the pipelines, so that any unexpected deviation https://doi.org/10.1145/3448016.3457250 introduced over time can be caught early as they arise. Figure 1 shows an example code snippet1 from Amazon Deequ [60] 1 INTRODUCTION to specify data validation constraints on a table. In this case, it is de- Complex data pipelines and data flows are increasingly common clared that values in the column “review_id” need to be unique, col- in today’s BI, ETL and ML systems (e.g., in Tableau [14], Power umn “marketplace” is complete (with no NULLs), values in “marketplace” BI [12], Amazon Glue [15], Informatica [10], Azure ML [3], etc.). need to be in a fixed dictionary {“US”, “UK”, “DE”, “JP”, “FR”}, etc. Such constraints will be used to validate against data arriving in the arXiv:2104.04659v2 [cs.DB] 13 Apr 2021 These pipelines typically recur regularly (e.g., daily), as BI reports need to be refreshed regularly [30, 60], data warehouses need to future, and raise alerts if violations are detected. Google’s TFDV be updated frequently [63], and ML models need to be retrained uses a similar declarative approach with a different syntax. continuously (to prevent model degradation) [21, 53]. These approaches allow users to provide high-level specifications However, it is widely recognized (e.g., [21, 39, 53, 58–60]) that in of data constraints, and is a significant improvement over low-level recurring production data pipelines, over time upstream data feeds assertions used in ad-hoc validation scripts (which are are hard to can often change in unexpected ways. For instance, over time, data program and maintain) [1, 8, 21, 60, 62]. columns can be added to or removed in upstream data, creating Automate Data Validation using Inferred Constraints. While schema-drift [21, 53, 60]. Similarly, data-drift [53] is also common, declarative data validation clearly improves over ad-hoc scripts, as the formatting standard of data values can change silently (e.g., and is beneficial as reported by both Google and Amazon [21, 60], from “en-us” to “en-US”, as reported in [53]); and invalid values can the need for data engineers to manually write data constraints one- also creep in (e.g., “en-99”, for unknown locale). column-at-a-time (e.g., in Figure 1) still makes it time-consuming and hard to scale. This is especially true for complex production ∗Work done at Microsoft Research. 1https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/ data, where a single data feed can have hundreds of columns requir- In this work, we focus on these string-valued columns that have ing substantial manual work. As a result, experts today can only homogeneous machine-generated data values. Specifically, if we afford to write validation rules for part of their data (e.g., important could infer suitable patterns to describe the underlying “domain” of columns). Automating the generation of data validation constraints the data (defined as the space of all valid values), we can usesuch has become an important problem. patterns as validation rules against data in the future. For instance, The main technical challenge here is that validation rules have the pattern “<letter>{3} <digit>{2} <digit>{4}” can be used to to be inferred using data values currently observed from a pipeline, validate 퐶1 in Figure 2(a), and “<digit>+/<digit>{2}/<digit>{4} but the inferred rules have to apply to future data that cannot yet <digit>+:<digit>{2}:<digit>{2} <letter>{2}” for퐶2 in Figure 2(b). be observed (analogous to training-data/testing-data in ML settings Our analysis on a random sample of 1000 columns from our pro- where test-data cannot be observed). Ensuring that validation rules duction data lake suggests that around 67% string-valued columns generalize to unseen future data is thus critical. have homogeneous machine-generated values (like shown in Fig- We should note that existing solutions already make efforts in ure 3), whose underlying data-domains are amenable to pattern- this direction – both Google TFDV and Amazon Deequ can scan an based representations (the focus of this work). The remaining 33% existing data feed, and automatically suggest validation constraints columns have natural-language (NL) content (e.g., company names, for human engineers to inspect and approve. Our analysis suggests department names, etc.), for which pattern-based approaches would that the capabilities of these existing solutions are still rudimentary, be less well-suited. We should emphasize that for large data-lakes especially on string-valued data columns. For instance, for the ex- and production data pipelines, machine-generated data will likely ample 퐶1 in Figure 2(a) with date strings in Mar 2019, TFDV would dominate (because machines are after all more productive at churn- infer a validation rule2 requiring all future values in this column to ing out large volumes of data than humans producing NL data), come from a fixed dictionary with existing values in 퐶1: {“ Mar 01 making our proposed technique widely applicable. 2019”, ...“ Mar 30 2019” }. This is too restrictive for data validation, While on the surface this looks similar to prior work on pattern- as it can trigger false-alarms on future data (e.g. values like “Apr profiling (e.g., Potter’s Wheel [57] and PADS [31]), which also learn 01 2019”). Accordingly, TFDV documentation does not recommend patterns from values, we highlight the key differences in their suggested rules to be used directly, and “strongly advised developers objectives that make the two entirely different problems. to review the inferred rule and refine it as needed”3. Data Profiling vs. Data Validation. In pattern-based data pro- When evaluated on production data (without human interven- filing, the main objective is to “summarize” a large number ofvalues tion), we find TFDV produces false-alarms on over 90% string- in a column 퐶, so that users can quickly understand content is in valued columns. Amazon’s Deequ considers distributions of values the column without needing to scroll/inspect every value. As such, but still falls short, triggering false-alarms on over 20% columns. the key criterion is to select patterns that can succinctly describe Given that millions of data columns are processed daily in produc- given data values in 퐶 only, without needing to consider values not tion systems, these false-positive rates can translate to millions of present in 퐶. false-alarms, which is not adequate for automated applications. For example, existing pattern profiling methods like Potter’s Auto-Validate string-valued data using patterns. We in this Wheel [57] would correctly generate a desirable pattern “Mar <digit>{2} work aim to infer accurate data-validation rules for string-valued 2019” for퐶1 in Figure 2(a), which is valuable from a pattern-profiling’s data, using regex-like patterns. We perform an in-depth analysis perspective as it succinctly describes values in 퐶1.