Vizcertify: a Framework for Secure Visual Data Exploration
Total Page:16
File Type:pdf, Size:1020Kb
VizCertify: A framework for secure visual data exploration Lorenzo De Stefani, Leonhard F. Spiegelberg, Eli Upfal Tim Kraska Department of Computer Science CSAIL Brown University Massachusetts Institute of Technology Providence, United States of America Cambridge, United States of America florenzo, lspiegel, [email protected] [email protected] Abstract—Recently, there have been several proposals to insightful then. Rather, users most likely would infer that develop visual recommendation systems. The most advanced French wines are generally rated higher; generalizing their systems aim to recommend visualizations, which help users to insight to all wines. Statistically savvy users will now test find new correlations or identify an interesting deviation based on the current context of the user’s analysis. However, when whether this generalization is actually statistically valid using recommending a visualization to a user, there is an inherent risk an appropriate test. Even more technically-savvy users will to visualize random fluctuations rather than solely true patterns: consider other hypotheses they tried and adjust the statistical a problem largely ignored by current techniques. testing procedure to account for the multiple comparisons In this paper, we present VizCertify, a novel framework to problem. This is important since every additional hypothesis, improve the performance of visual recommendation systems by quantifying the statistical significance of recommended vi- explicitly expressed as a test or implicitly observed through a sualizations. The proposed methodology allows controlling the visualization, increases the risk of finding insights which are probability of misleading visual recommendations using a novel just spurious effects. application of the Vapnik Chervonenkis (VC) dimension which The issue with visual recommendation engines: What results in an effective criterion to decide whether a candidate happens when visualization recommendations are generated visualization recommendation corresponds to an actually statis- tically significant phenomenon. by one of the above systems? First, the user does not know if Index Terms—Visualization, Recommendation Systems, the effect shown by the visualization is actually significant FWER, Data Exploration, Visual Recommendation, Vapnik or not. Even worse, she can not use a standard statistical Chervonenkis theory, Statistical Learning method and simply test the effect shown in the visualization for significance. Visual recommendation engines are potentially I. INTRODUCTION checking hundreds of thousands of visualizations for their Recently many visual recommendation engines [1]–[9] have interesting-factor. As a result, by testing large quantities of been proposed to help savvy and unsavvy users with data visualizations it is very likely that a system will find something analysis. While some recommendation engines (e.g., [2], [8], “interesting” regardless of whether the observed phenomenon [4], [9]) aim to recommend better visual encodings, others is actually statistically valid or not. Therefore, a test for sig- (e.g., SeeDB [1], DeepEye [7], MuVE [5], or VizDeck [6]) nificance for the recommended visualization should consider aim to automatically recommend entirely new visualization to the whole testing history. help users finding interesting insights. Why not a holdout? Advocates of visual recommendation While the latter are significantly more powerful, they also engines usually argue that visual recommendations systems significantly increase the risk of finding false insights [10], are meant to be hypothesis generation engines, which should [11]. This happens whenever a visualization is used as a always be validated on a separate hold-out dataset. While this device to represent statistical properties of the data. Consider a is a valid method to control false discoveries, it has several user exploring a dataset containing information about different unpractical implications: (1) None of the found insights from wines. After browsing the data for a while, she creates a the exploration dataset should be regarded as an actual insight visualization of wines ranked by origin showing wines from before they are validated. This is clearly problematic if one France to be apparently higher rated. If her only takeaway observation may steer towards another during the exploration. is, that in this particular dataset wines from France have a (2) Splitting a dataset into an exploration and a hold-out set can higher rating, there is no risk of a false insight. However, significantly reduce the power (i.e., the chance to find actual it is neither in the nature of users to constrain themselves true phenomena). (3) The hold-out needs to be controlled for to such thinking [12], nor would visualizing the data be the multi-hypothesis problem unless the user only wants to use it exactly once for a single test. This research was funded in part by the DARPA Award 16-43-D3M-FP- In this paper, we present VizCertify, a first framework 040, NSF Award IIS-1514491, NSF Award IIS-1562657, NSF Award RI- 18134446, and suppported by Google, Intel, and Microsoft as part of the to make visual recommendation engines “safe”. That is, MIT Data Systems and AI Lab (DSAIL). VizCertify validates candidate recommended visualizations by verifying that they represent actually statistically significant survey dataset we conducted with 2; 644 participants from the phenomena. U.S. on Amazon Mechanical Turk for 32 (mostly unrelated) Scope: As already pointed out in [10], [13], associating multiple-choice questions1. Questions range from “Would you an appropriate hypothesis test for validating a visualization is rather drive an electric car or a gas-powered car?” to “What notoriously hard. In this work, we focus on the visual recom- is your highest achieved education level?” and “Do you mendation style proposed by SeeDB for histograms as it uses a believe in Astrology?”. A global aggregation reveals that the clear semantic for what “interesting” means and rumored to be majority of participants does not believe in Astrology as shown implemented in a widely-used commercial product. SeeDB [1] in Figure 1a. Based on this visualization, SeeDB’s top 3 rec- makes recommendations based on a reference view; it tries ommended visualization are depicted in Figure 1b-1d. SeeDB to find a visualization whose underlying data distribution is proposes various ways to measure the difference between two different from the one’s the user is currently shown (i.e., the visualizations to rank the visualization. For this experiment, implicit test performed by SeeDB evaluates the difference in we configured SeeDB to use the Chebyshev distance. For sub-populations). However, our techniques can be extended example, Figure 1b shows that people who prefer cheese- to other visualization frameworks and hypothesis generation flavored potato chips are more likely to believe in Astrology. tools (e.g., Data Polygamy [14]) provided that the chosen However, the visualizations alone do not answer the ques- “interestingness” criterion can be evaluated by a statistical test. tion, if the shown difference is actually statistically significant Contributions: The core idea of VizCertify is to estimate or not. In contrast, VizCertify would automatically mark the the interest of candidate recommendations and automatically reference view in Figure 1a and 1c as statistically significant, adjusts the significance value based on the search space to but disregard the visualizations in Figure 1b and 1d. Important account for the multiple-hypothesis pitfall. With this work we to note is, that the top-ranked visualization, Figure 1b, is not make the following contributions: considered statistically significant, but the second-ranked is. • We formalize the process of making visualization recom- I.e., higher rank does not imply higher statistical significance. mendations as statistical hypothesis testing. Obviously, detecting statistically significant differences is • We propose a method based on the use of VC dimension, challenging as it depends on the data size, the number of which allows controlling the probability of observing tested hypothesis by the user and the recommendation system. false discoveries during the visualization recommendation Furthermore, while it is easy to define what a test means with process. VizCertify allows control of the Family Wise Er- a two-bar chart histogram, it is much harder for visualizations ror Rate (FWER) at a given control level δ 2 (0; 1). Our with multiple bars. Finally, it has to be pointed out that method provides finite sample FWER control whereas in the case of SeeDB increasing the data size does not classical statistical approaches (i.e., the chi-squared test) necessarily mitigate the problem as SeeDB automatically adds ensure asymptotic FWER control. filter conditions to create ever-smaller sub-populations. • We evaluate the performance of our system, in compari- son with SeeDB via extensive experimental analysis. B. Problem set-up The remainder of this paper is organized as follows: In Sec- In this work, we assume that D (e.g., our survey data) tion II we give a definition of the visualization recommenda- consists of n records chosen uniformly at random and inde- tion problem in rigorous probabilistic terms. In Section III we pendently from a universe Ω. We can imagine Ω in the form discuss possible approaches for the visualization recommen- of a two-dimensional, relational table with N rows and m dation problem, and highlight why they suffer due to