Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H

C2Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H. V. Jagadish University of Michigan University of Michigan University of Michigan Ann Arbor, Michigan Ann Arbor, Michigan Ann Arbor, Michigan [email protected] [email protected] [email protected] ABSTRACT CCS CONCEPTS Datasets are often derived by manipulating raw data with • Information systems → Data provenance; Extraction, statistical software packages. The derivation of a dataset transformation and loading. must be recorded in terms of both the raw input and the ma- nipulations applied to it. Statistics packages typically provide KEYWORDS limited help in documenting provenance for the resulting de- data transformation, data documentation, data provenance rived data. At best, the operations performed by the statistical ACM Reference Format: package are described in a script. Disparate representations Jie Song, George Alter, and H. V. Jagadish. 2019. C2Metadata: Au- make these scripts hard to understand for users. To address tomating the Capture of Data Transformations from Statistical these challenges, we created Continuous Capture of Meta- Scripts in Data Documentation. In 2019 International Conference data (C2Metadata), a system to capture data transformations on Management of Data (SIGMOD ’19), June 30-July 5, 2019, Am- in scripts for statistical packages and represent it as metadata sterdam, Netherlands. ACM, New York, NY, USA, 4 pages. https: in a standard format that is easy to understand. We do so by //doi.org/10.1145/3299869.3320241 devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a 1 INTRODUCTION large fraction of data manipulation performed in practice. As the research community responds to increasing demands We then implement SDTA, inspired by relational algebra, in for public access to scientific data, the need for improvement a data transformation specification language we call SDTL. in data documentation has become critical. Accurate and In this demonstration, we showcase C2Metadata’s capture complete metadata is essential for data sharing and for in- of data transformations from a pool of sample transformation teroperability [3]. However, the process of describing and scripts in at least two languages: SPSS®and Stata®(SAS®and documenting scientific data has remained a tedious, manual R are under development), for social science data in a large process even when data collection is fully automated. academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the inter- mediate steps and trace the provenance and changes of data at different levels for better understanding of the process. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not Figure 1: ICPSR data downloads by format (Sep. 4, 2015 made or distributed for profit or commercial advantage and that copies bear - Mar. 4, 2016) this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to Researchers in many fields use the main statistics pack- redistribute to lists, requires prior specific permission and/or a fee. Request ages for data management as well as analysis. Figure 1 shows permissions from [email protected]. data downloads by statistical pakcage format, mainly SPSS®, SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands SAS®, Stata®and R [4, 6–8], from the Inter-university Con- © 2019 Association for Computing Machinery. sortium for Political and Social Research (ICPSR) 1. These ACM ISBN 978-1-4503-5643-5/19/06...$15.00 https://doi.org/10.1145/3299869.3320241 1https://www.icpsr.umich.edu/ SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands Jie Song, George Alter, and H. V. Jagadish Figure 3: C2Metadata workflow reads statistical transformation scripts (in SPSS, Stata, SAS or R) and the original metadata in (either of) two internation- Figure 2: Data Transformation Example by SPSS and ally accepted XML-based standards: the Data Documentation Stata scripts Initiative (DDI) [9] and Ecological Metadata Language (EML) [2]. It then walks over individual data transformations and uses a software-independent data transformation descrip- packages, however, lack tools for documenting variable trans- tion (Standard Data Transformation Language (SDTL)) to formations in the manner of a workflow system or even a update the original metadata, which permits the tracking of database. At best, the operations performed by the statistical dataset changes at different levels. We define SDTL based package are described in a script, which more often than not on a small set of transformation operators that comprise the is not even available to future data users. Different statistics Standard Data Transformation Algebra (SDTA), and cover packages differ in data model, transformation representation the majority of data transformations used in statistical soft- and scope of transformations covered; thereby further com- ware. SDTL is a declarative language describing the purpose plicating the understanding of the transformation process. of transformations in an informative way. To ease the un- Motivating Example. Figure 2 is an example of data anal- derstanding of the process without the necessity of learning ysis using statistical languages. The original data is part of SDTL or SDTA, we further extend the interpretation from a computer-generated survey result of political opinions. It SDTL to natural language as part of the updated metadata. records surveyed attributes: CaseID uniquely identifies survey We currently target two research communities (social and be- participants while V520041 to V520043 represent whether the havioral sciences and earth observation sciences) with strong participant cares about who wins party, state or local election. metadata standards that rely heavily on statistical analysis As the user is interested in all those who care about who wins software. However, we believe our work is generalizable to elections, she generates a transformation script that computes other domains, such as biomedical research. More details of the union of opinions by adding V520031 to V520043 as a newly the project is available at http://c2metadata.org. 2 labeled attribute Partycare1. She then cleans the data by fil- Next, we provide a brief overview of C Metadata and tering out cases whose Age is less than 67 before sorting cases SDTL and then an overview of the actual demonstration. in the descending order of Age, breaking ties by the ascending 2 CaseID. Here we show functionally equivalent transformation 2 C METADATA OVERVIEW scripts in two languages: SPSS and Stata. For more procedural In this section, we discuss the pipelined modular architecture languages like R, transformations can be even harder to inter- of C2Metadata comprising four modules, as shown in the pret without familiarity of the language. grey area in Figure 3. We explain the system workflow and To reduce the cost and increase the completeness of meta- then describe SDTA and SDTL. data, we aim to work with common statistical packages to automate the capture of metadata at the granularity of indi- 2.1 Workflow vidual data transformations in a simple yet expressive repre- The workflow in C2Metadata consists of four modules: SDTL sentation regardless of the original languages used. In this Parser, Pseudo-code Generator, XML Updater and Codebook demo, we show C2Metadata as such a system that imple- Formatter. ments this idea, creating efficiencies and reduce the costs SDTL Parser. In the Script Parser module, C2Metadata ex- of data collection, preparation, and re-use. The system first presses data transformations in SDTL, independent of the C2Metadata: Automating the Capture of Data Transformations SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands Table 1: Operations supported by SDTL load, save, rename, create, merge, dataset level match, transpose, format display Figure 4: SDTL Parser Components column/variable recode, rename, aggregate, compute, level delete, label, sort, missing value, join language used. The components of the module are shown in row/case level select, sort, add, delete, aggregate Figure 4. cell level update, label Script Parser is customized for each statistical package procedural if-then, do repeat loop since they each use a different scripting language. It takes a command script written in scripting languages as input and uses standard compiler techniques to parse input, obtain a project using technologies including .NET, closure, Java, syntax tree, and then generate SDTL code. The syntax Lexer XSLT, COGS, among others. transforms the raw SPSS/Stata/SAS/R syntax code by a lexer grammer for each language into a stream of tokens. The 2.2 SDTA and SDTL ANTLR-based [5] Parser then parses the token stream into Inspired by Relational Algebra, we define Standard Data an abstract syntax tree (AST), simple or nested. The Tree Transformation Algebra (SDTA) as a realization of statistical Walker finally walks over the AST and refers to a mapping data transformation using a small set of primitive operators. between statistical languages and SDTL for SDTL translation. This permits us to write complex transformations as a sin- Below is the translation of an example of a

Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H

Informal Data Transformation Considered Harmful

What Is a Data Warehouse?

POLITECNICO DI TORINO Repository ISTITUZIONALE

Data Warehousing on AWS

Master Data Simplified

Lineage Tracing for General Data Warehouse Transformations

Doppiodb 2.0: Hardware Techniques for Improved Integration of Machine Learning Into Databases

Extract, Transform, Load | ETL Development

Rethinking Software Network Data Planes in the Era of Microservices

Doppiodb 2.0: Hardware Techniques for Improved Integration of Machine Learning Into Databases

Automating the Capture of Data Transformation Metadata

Biological Data Transformation in Pathway Simulation∗