Data Wrangling

Data Wrangling

Data Wrangling John Meehan Jeff Rasley Working with raw data sucks. • Data comes in all shapes and sizes – CSV files, PDFs, stone tablets, .jpg… • Different files have different formang – Spaces instead of NULLs, extra rows • “Dirty” data – Unwanted anomalies – Duplicates Current Tools • Focus on specific problems – Resolving enQQes – Removing duplicates – Schema matching • Most systems are non-interacQve – Inaccessible to general audience • A lot of people just use Excel or regular expressions… Data Wrangling • Goal: extract and standardize the raw data – Combine mulQple data sources – Clean data anomalies • Combine automaon with interacQve visualizaons to aid in cleaning • Improve efficiency and scale of data imporQng • Lower the threshold for broader audiences Missing headers, Three Data Sources: MulQple date formats, Database, PDF, CSV Merged columns Success! ………but immediately geng error: “Empty cells in column 3” Ugh, screw it….. Back to imporQng. Data reimported, Analysis possible, but More cleaning data quality sQll sucks SUCCESS!!!! Repeat data loading less painful, but sQll annoying Types of Data Problems • Missing data • Incorrect data • Inconsistent representaons of the same data • About 75% of data problems require human intervenQon • Cleaning data vs overly-saniQzing data Diagnosing data problems • Visualizaons can convey “raw” data • Different visual representaons highlight different types of data issues – Outliers oaen stand out in a plot – Missing data will cause gap or zero value • Becomes increasingly difficult as data gets larger – Visual design coupled with interacQon – Sampling Node-Link Diagram Matrix View Sorted Matrix View Visualizing Missing Data • Set values to zero? • Interpolate based on exisQng data? • Omit missing data? Visualizing Uncertain Data • Can arise from: – Measurement errors – Missing data – Sampling • Visualizaon must – Consider all components of uncertainty – Depict mulQple kinds of uncertainty – Interact with uncertainty depicQons Transforming Data • Spling columns, converQng into meaningful records • Typical methods: regular expressions, programming by demonstraon – Prone to errors, tedious • InteracQve tools simplify the process – Guide user through seng automated constraints – Generates scripts for the user Transforming Data (cont) • Data formang, extracQon, and conversion • CorrecQng erroneous values • Integrang mulQple data sets EdiQng and AudiQng Transformaons • Data Provenance – Maintaining the data history – Track the lineage of a specific item’s origins • Used for the modificaon, reuse, and understanding of a transformaon • What transformaon language should be used? – Extend exisQng languages? Wrangling in the Cloud • Allows the sharing of data transformaons • Mining records of wrangling – Befer automac suggesQons • User-defined data types • Feedback from downstream analysts – Crowdsourcing the final result – Allow users to annotate or correct the data Checkpoint-Conclusion • Data wrangling is oaen a second-class ciQzen • Common problems, very me-consuming – Manual approach is no longer a viable opQon • Future work: extend visual approaches into the data wrangling phase • Plenty of research direcQons Secure Data AnalyQcs • Private/sensiQve data – SSN, Medical, Classified, etc. • Cloud-base analysis currently doesn’t work • Local analysis is key • Research area? Pofer's Wheel: An InteracQve Data Cleaning System • Vijayshankar Raman and Joseph Hellerstein (VLDB ‘01) • Provide a graphical interacQve tool to support various data transformaons with suggesQons Transformaons Download! • Pofer's Wheel A-B-C: An InteracQve Tool for Data Analysis, Cleansing, and Transformaon – hfp://control.cs.berkeley.edu/abc/ • (But you probably don’t want to, last release was Oct 10, 2000) Data Wrangler • Wrangler: InteracQve Visual Specificaon of Data Transformaon Scripts (CHI ‘11) – Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer (Stanford Vis Group + Berkeley) D3 $4.3M in Series A funding! (1/12) What’s new? • Similar goals as Pofer’s Wheel • Improved UI for the web • Python & JavaScript libraries • AddiQonal transformaons such as fill Demo (~10min) 7 Command-Line Tools for Data Science • hp://jeroenjanssens.com/2013/09/19/ seven-command-line-tools-for-data- science.html .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    26 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us