Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes Jie Song∗ Yeye He University of Michigan Microsoft Research
[email protected] [email protected] ABSTRACT Complex data pipelines are increasingly common in diverse ap- plications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Figure 1: Example code snippet to describe expected data- Data validation has thus become an important topic, as evidenced values, using declarative constraints in Amazon Deequ. by notable recent efforts from Google and Amazon, where the ob- jective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, Such schema-drift and data-drift can lead to quality issues in that on string-valued data, these existing approaches yield high downstream applications, which are both hard to detect (e.g., mod- false-positive rates and frequently require human intervention. In est model degradation due to unseen data [53]), and hard to debug this work, we develop a corpus-driven approach to auto-validate (finding root-causes in complex pipelines can be difficult). These machine-generated data by inferring suitable data-validation “pat- silent failures require substantial human efforts to resolve, increas- terns” that accurately describe the underlying data-domain, which ing the cost of running production data pipelines and decreasing minimizes false-positives while maximizing data quality issues the effectiveness of data-driven system [21, 53, 58, 60].