Your notebook is not crumby enough, REPLace it Michael BrachmannB , William SpothB , Oliver KennedyB , Boris GlavicI , Heiko MuellerN, Sonia CasteloN, Carlos BautistaN, Juliana FreireN B: University at Buffalo I: Illinois Institute of Technology N: New York University {mrb24,wmspoth,okennedy}@buffalo.edu
[email protected] {heiko.mueller, s.castelo, carlos.bautista, juliana.freire}@nyu.edu Figure 1: Overview of our system Vizier. The New York City Leading Causes of Death dataset was used. Vizier has four main views: (A) The Vizier Notebook View, (B) The Vizier Caveat View, (C) The Vizier Spreadsheet View and (D) The Vizier History View. ABSTRACT refine data pipelines. Vizier combines the flexibility of note- Notebook and spreadsheet systems are currently the de- books with the easy-to-use data manipulation interface of facto standard for data collection, preparation, and analysis. spreadsheets. Combined with advanced provenance track- However, these systems have been criticized for their lack of ing for both data and computational steps this enables re- reproducibility, versioning, and support for sharing. These producibility, versioning, and streamlined data exploration. shortcomings are particularly detrimental for data curation Unique to Vizier is that it exposes potential issues with data, where data scientists iteratively build workflows to clean up no matter whether they already exist in the input or are in- and integrate data as a prerequisite for analysis. We present troduced by the operations of a notebook. We refer to such Vizier, an open-source tool that helps analysts to build and potential errors as data caveats. Caveats are propagated alongside data using principled techniques from uncertain data management.