Automating the Capture of Data Transformation Metadata

Automating the Capture of Data Transformation Metadata H.V. Jagadish Univ. of Michigan http://www.eecs.umich.edu/~jag George Alter, University of Michigan Why Metadata? • Data are useless without Metadata – “data about data” • Metadata should: – Include all information about data creation – Describe transformations to variables – Be easy to create • Our goal: Automated capture of metadata A few words about ICPSR • World’s largest archive of social science data • Consortium established 1962 • 760+ member institutions around the world • Founding member and home office for the DDI Alliance Powered by DDI Metadata ICPSR is building search tools based upon Data Documentation Initiative (DDI) XML Codebooks (pdf and online) are rendered from the DDI. Searchable database of 4.5M variables Click here for online codebook What question Online codebook shows was asked? variable in context of dataset How was the Link to online question coded? graph tool Link to online crosstab tool Searchable database of 4.5M variables Click here for variable comparison Variable comparison display Click here for online codebook Metadata for the American National Election Study What question Who answered was asked? this question? How was the question coded? Who answered this question? Metadata for the American National Election Study Who answered this question? How do we know who answered the question? It’s in the pdf. Who answered this question? When data arrive at the archive… • No question text • No interview flow (question order, skip pattern) • No variable provenance • Data transformations are not documented. How is research data created? • Most surveys are conducted with computer assisted interview software (CAI) – CATI – Computer-assisted Telephone Interview – CAPI – Computer-assisted Personal Interview – CAWI – Computer Aided Web Interview • There is no paper questionnaire • The CAI program is the questionnaire – i.e. the program is the metadata Original data Computer Assisted Interviewing CAI We already have tools to convert CAI to machine- readable metadata. Convert to CAI DDI: to Collectica DDI MQDS Original others metadata DDI XML Original What happens when a data project modifies the data. Statistical Packages Command SPSS Revised Computer SAS Assisted scripts: Stata data Interviewing R SPSS SAS Stata CAI R The modified Convert to CAI data no longer DDI: to Collectica DDI match the MQDS Original metadata. others metadata DDI XML Original Metadata are re- data created after the data are transformed. Statistical Packages Command SPSS Revised Computer SAS Assisted scripts: Stata data Interviewing R SPSS SAS Stata CAI R SPSSSAS Stata R Extract Stat metadata Package from to Transformations DDI SPSS/SAS/ are documented Stata/R Convert to CAI by hand Data file DDI: to Collectica DDI Original MQDS DDI metadata others XML DDI XML Extracted metadata Statistics packages have limited metadata • Variable names • Variable labels • Value labels • No provenance Original Automating the data capture of transformation Statistical metadata. Packages Command SPSS Revised Computer SAS Assisted scripts: Stata data Interviewing R SPSS SAS Stata Standard CAI R Data Transformation Language Revised metadata Script SDTL XML Parser Updater DDI XML Convert to CAI DDI: to Collectica DDI MQDS Original metadata others Missing links that we will build. DDI XML What statistics packages should be covered? ICPSR Downloads by Format Studies with all All downloads formats Delimited text 43% 29% SPSS 22% 24% SAS 10% 12% Stata 19% 23% R 5% 12% Excel 0% 1% Other 0% 0% 100% 100% Number 378,007 154,663 Why do we need an SDTL? Input Data Output Data SPSS MISSING VALUES X(-1). X IF (X > 3) Y=9. 2 IF (X < 3) Z=8. 3 4 -1 Stata replace X=. if X==-1 X generate Y=9 if X>3 2 generate Z=8 if X<3 3 4 -1 SAS if X=-1 then X=.; X if X>3 then Y=9; 2 if X<3 then Z=8; 3 4 -1 Why do we need an SDTL? Input Data Output Data SPSS MISSING VALUES X(-1). X X Y Z IF (X > 3) Y=9. 2 2 8 IF (X < 3) Z=8. 3 3 4 4 9 -1 -1 Stata replace X=. if X==-1 X X Y Z generate Y=9 if X>3 2 2 8 generate Z=8 if X<3 3 3 4 4 9 -1 9 SAS if X=-1 then X=.; X X Y Z if X>3 then Y=9; 2 2 . 8 if X<3 then Z=8; 3 3 . 4 4 9 . -1 . 8 What happens when a missing value is in a logical comparison? • SPSS – Logical expressions including a missing value are considered “Missing.” Usually, “Missing” is equivalent to “False.” • Stata – Missing values are treated as numbers equal to infinity. So, any number is less than a missing value. • SAS – Missing values are treated as numbers equal to minus infinity. So, any number is greater than a missing value. Missing Values in Comparisons Input Data Output Data SPSS MISSING VALUES X(-1). X X Y Z IF (X > 3) Y=9. 2 2 8 IF (X < 3) Z=8. 3 3 4 4 9 -1 NULL Stata replace X=. if X==-1 X X Y Z generate Y=9 if X>3 2 2 8 generate Z=8 if X<3 3 3 4 4 9 -1 ∞ 9 SAS if X=-1 then X=.; X X Y Z if X>3 then Y=9; 2 2 . 8 if X<3 then Z=8; 3 3 . 4 4 9 . -1 -∞ . 8 Benefits of automated metadata capture • Metadata will be better – All the information in the CAI can be included. – Variable transformations can be described • Automation will lower costs – Metadata will not be discarded and re-created • All metadata will be standardized and machine readable – Codebooks with rich information can be rendered at will • If we make it easy and beneficial, researchers will use it. Continuous Capture of Metadata for Statistical Data (NSF ACI-1640575) Project Partners • Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan • Colectica • Metadata Technology North America • Norwegian Centre for Research Data • General Social Survey, NORC, University of Chicago • American National Election Study, University of Michigan Questions? Ask George Alter [email protected] .

Automating the Capture of Data Transformation Metadata

Informal Data Transformation Considered Harmful

Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H

What Is a Data Warehouse?

POLITECNICO DI TORINO Repository ISTITUZIONALE

Data Warehousing on AWS

Master Data Simplified

Lineage Tracing for General Data Warehouse Transformations

Doppiodb 2.0: Hardware Techniques for Improved Integration of Machine Learning Into Databases

Extract, Transform, Load | ETL Development

Rethinking Software Network Data Planes in the Era of Microservices

Doppiodb 2.0: Hardware Techniques for Improved Integration of Machine Learning Into Databases

Biological Data Transformation in Pathway Simulation∗