Data Wrangling
John Meehan Jeff Rasley Working with raw data sucks.
• Data comes in all shapes and sizes – CSV files, PDFs, stone tablets, .jpg… • Different files have different forma ng – Spaces instead of NULLs, extra rows • “Dirty” data – Unwanted anomalies – Duplicates Current Tools
• Focus on specific problems – Resolving en es – Removing duplicates – Schema matching • Most systems are non-interac ve – Inaccessible to general audience • A lot of people just use Excel or regular expressions… Data Wrangling
• Goal: extract and standardize the raw data – Combine mul ple data sources – Clean data anomalies • Combine automa on with interac ve visualiza ons to aid in cleaning • Improve efficiency and scale of data impor ng • Lower the threshold for broader audiences Missing headers, Three Data Sources: Mul ple date formats, Database, PDF, CSV Merged columns Success! ………but immediately ge ng error: “Empty cells in column 3”
Ugh, screw it….. Back to impor ng.
Data reimported, Analysis possible, but More cleaning data quality s ll sucks SUCCESS!!!!
Repeat data loading less painful, but s ll annoying Types of Data Problems
• Missing data • Incorrect data • Inconsistent representa ons of the same data • About 75% of data problems require human interven on • Cleaning data vs overly-sani zing data Diagnosing data problems
• Visualiza ons can convey “raw” data • Different visual representa ons highlight different types of data issues – Outliers o en stand out in a plot – Missing data will cause gap or zero value • Becomes increasingly difficult as data gets larger – Visual design coupled with interac on – Sampling Node-Link Diagram Matrix View Sorted Matrix View
Visualizing Missing Data
• Set values to zero? • Interpolate based on exis ng data? • Omit missing data? Visualizing Uncertain Data
• Can arise from: – Measurement errors – Missing data – Sampling • Visualiza on must – Consider all components of uncertainty – Depict mul ple kinds of uncertainty – Interact with uncertainty depic ons Transforming Data
• Spli ng columns, conver ng into meaningful records • Typical methods: regular expressions, programming by demonstra on – Prone to errors, tedious • Interac ve tools simplify the process – Guide user through se ng automated constraints – Generates scripts for the user Transforming Data (cont)
• Data forma ng, extrac on, and conversion • Correc ng erroneous values • Integra ng mul ple data sets Edi ng and Audi ng Transforma ons
• Data Provenance – Maintaining the data history – Track the lineage of a specific item’s origins • Used for the modifica on, reuse, and understanding of a transforma on • What transforma on language should be used? – Extend exis ng languages? Wrangling in the Cloud
• Allows the sharing of data transforma ons • Mining records of wrangling – Be er automa c sugges ons • User-defined data types • Feedback from downstream analysts – Crowdsourcing the final result – Allow users to annotate or correct the data Checkpoint-Conclusion
• Data wrangling is o en a second-class ci zen • Common problems, very me-consuming – Manual approach is no longer a viable op on • Future work: extend visual approaches into the data wrangling phase • Plenty of research direc ons Secure Data Analy cs
• Private/sensi ve data – SSN, Medical, Classified, etc. • Cloud-base analysis currently doesn’t work • Local analysis is key • Research area? Po er's Wheel: An Interac ve Data Cleaning System • Vijayshankar Raman and Joseph Hellerstein (VLDB ‘01) • Provide a graphical interac ve tool to support various data transforma ons with sugges ons Transforma ons Download!
• Po er's Wheel A-B-C: An Interac ve Tool for Data Analysis, Cleansing, and Transforma on – h p://control.cs.berkeley.edu/abc/ • (But you probably don’t want to, last release was Oct 10, 2000) Data Wrangler
• Wrangler: Interac ve Visual Specifica on of Data Transforma on Scripts (CHI ‘11) – Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer (Stanford Vis Group + Berkeley)
D3
$4.3M in Series A funding! (1/12) What’s new?
• Similar goals as Po er’s Wheel • Improved UI for the web • Python & JavaScript libraries • Addi onal transforma ons such as fill Demo (~10min) 7 Command-Line Tools for Data Science
• h p://jeroenjanssens.com/2013/09/19/ seven-command-line-tools-for-data- science.html