Statwire: Visual Flow-Based Statistical Programming
Total Page:16
File Type:pdf, Size:1020Kb
CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada StatWire: Visual Flow-based Statistical Programming Krishna Subramanian Chat Wacharamanotham Abstract RWTH Aachen University University of Zürich Statistical analysis is a frequent task in several research 52056 Aachen, Germany 8050 Zürich, Switzerland fields such as HCI, Psychology, and Medicine. Performing [email protected] chat@ifi.uzh.ch statistical analysis using traditional textual programming languages like R is considered to have several advantages over GUI applications like SPSS. However, our examina- Johannes Maas Simon Voelker tion of 40 analysis scripts written using current IDEs for R RWTH Aachen University RWTH Aachen University shows that such scripts are hard to understand and main- 52056 Aachen, Germany 52056 Aachen, Germany tain, limiting their replication. We present StatWire, an IDE johannes.maas1@rwth- [email protected] aachen.de for R that closely integrates the traditional text-based edi- tor with a visual data flow editor to better support statistical programming. A preliminary evaluation with four R users Michael Ellers Jan Borchers indicates that this hybrid approach could result in statistical RWTH Aachen University RWTH Aachen University programming that is more understandable and efficient. 52056 Aachen, Germany 52056 Aachen, Germany [email protected] [email protected] Author Keywords Statistical Analysis IDE; Hybrid Programming; Visual Pro- gramming. ACM Classification Keywords Permission to make digital or hard copies of part or all of this work for personal or H.5.2 [User Interfaces]: Graphical user interfaces (GUI); classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation G.3. [Probability and Statistics]: Statistical software. on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Introduction CHI’18 Extended Abstracts, April 21–26, 2018, Montreal, QC, Canada. © 2018 Copyright is held by the owner/author(s). Statistical analysis is an important task in several research ACM ISBN 978-1-4503-5621-3/18/04. fields. While analysis can be performed using text-based https://doi.org/10.1145/3170427.3188528 programming (e.g., R, Python) or GUIs (e.g., SPSS, JMP), LBW104, Page 1 CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada the statistical community considers text-based programming at top-tier conferences such as CHI ’13 and ISWC ’16, were to be a “much more productive, accurate, and reproducible collected from two sources: 32 scripts from 23 projects way of performing (analysis) tasks” than GUIs [17]. on the Open Science Framework (OSF) platform2, and 8 scripts from 3 researchers at our university in Aachen. We However, our analysis shows that scripts written using cur- randomly selected no more than 3 scripts from the same rent IDEs for R, the most commonly used text-based statis- project or researcher to avoid biasing our analysis to a par- Load data Log tics programming language [13], are hard to understand transform ticular analyst. We analyzed a total of 20,303 source lines and maintain. We speculate that current IDEs for R do not of code (SLOC), with an average of 508 SLOC per file, and support the data-driven, iterative, and non-sequential nature found two major issues: of statistical analysis [10, 16], and we explore integrating the text-based environment in current IDEs with a visual 1. Excessive Code Cloning data flow environment as a possible solution. As a realiza- Code cloning is a common programming practice in which log tion of this concept, we present StatWire, an IDE for R that the user copies, pastes, modifies, and executes code [8, Module 1 Module 1 tightly integrates the two environments. 15]. Excessive code cloning can make analysis scripts diffi- cult to maintain, understand and modify [9, 11, 12], limiting Motivation reproduction and reuse of the code for other analyses. Current IDEs for R R [6] is the most commonly used statistical programming We followed a standard procedure [15] to identify instances of code cloning. First, we preprocessed the analysis scripts log language in academia and research. Prominent IDEs for R 1 (remove blank lines, comments, etc.) to identify Lines of Module 2 Module 2 include RStudio , the official R GUI, and Microsoft’s Visual Studio Plugin for R. Such IDEs provide an interactive con- Interest (LOI). Then, we used fingerprints as both the inter- mediate representation of the code and the match detection Module 1: compute descriptive sole for line-by-line execution of code snippets and script statistics (mean, sd), and plot files for storing analysis code, which can then be executed technique to compute instances of code cloning. We identi- histograms. in a sequential manner via the interactive console. fied that out of 4098 LOI from all scripts, 2861 LOI, or 70%, were parametric clones [15]. Module 2: fit a model, get However, the workflow of statistical analysis is iterative residuals, and perform 2. Prevalence of Non-Modular Code Using modules is Shapiro-Wilk’s test on the and non-sequential [16, 10], as shown in Fig. 1. This could residuals. therefore be in conflict with the workflow supported by cur- generally considered to lead to more productive program- rent IDEs for R that use a sequential representation to store ming [5]. Non-modular code is inefficient and does not Figure 1: An example workflow of and execute the analysis. match the nature of statistical analysis, since a typical anal- statistical analysis. Analysis is ysis involves iterative execution of a module with modified iterative and non-sequential with Evaluation of R Analysis Scripts input, as shown in Fig. 1. This issue is further aggravated in ‘modules’ acting upon modified To investigate potential consequences of this mismatch, we long analyses in which a module is reused several times. data with each iteration. examined analysis scripts written in R by researchers. 40 R analysis scripts, used in HCI and Psychology publications In our analysis, we classified code snippets into the three 1http://www.rstudio.com 2http://osf.io LBW104, Page 2 CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada Histogram of speed H A Count I E1 E2 speed F G B C D Figure 2: User interface of StatWire, a novel IDE for performing statistical analysis using the R programming language. StatWire tightly integrates a visual data flow editor (A) with a text-based editor (B) to better support code authoring. major steps of statistical analysis: preprocessing, exploratory a text entry study using StatWire. She is presented two analysis, and confirmatory analysis [16]. We then iden- views: a visual data flow editor (Fig. 2 A) and a textual pro- tified how many of these steps were part of each script. gramming environment (B). Each node in the visual editor We found that 97.5% of all scripts contained two or more can be either a statlet, which is a processing step (F), or steps, and 60% contained all three steps. This issue was a viewlet, which shows plots (E1) or data (E2). The edges also prevalent in long analysis scripts. (H) represent the flow of data across the nodes. The visual editor is initially empty, and Hazel creates a statlet by right- StatWire clicking on the empty canvas. When a statlet is created, Since existing IDEs for R result in issues with code repro- the text editor (C) is shown with a default function template, duction and comprehension, we wanted to explore the tight encouraging her to think in a modular fashion and to write integration of a visual data-flow editor with the traditional functions. During code authoring, Hazel uses the output text editor as a possible solution. The following walkthrough shown next to the code (D) for debugging. She authors explains the interaction design of our current prototype: statlets to read in the CSV file and then subsets the data according to different levels of an independent variable. Hazel is an HCI researcher who wants to analyze data from LBW104, Page 3 CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada When she adds an input or output argument to the function Existing visual programming environments such as ViSta header, the corresponding node in the visual data flow ed- for statistical analysis [19] and Orange for machine learning itor is updated automatically (G). This lets Hazel focus on [4] do not expose the underlying source code of their mod- the flow of data (i.e., overview) in the visual editor, and on ules. Blender, a 3D modeling software, allows the user to the processing code (i.e., details) in the text editor. view and edit the underlying source code of modules, but they have to initially be authored in a different environment When Hazel wants to view the descriptive statistics of the and then imported. RapidMiner and KNIME are visual pro- QWERTY distribution using another statlet, she ‘wires’ its gramming environments that allow authoring code using a output to the ‘View descriptive stats’ statlet. She runs the text editor, but treat the visual editor as a passive element statlet to view the histogram (in a viewlet, Fig. 2 E1) and (i.e., the visual environment does not reflect changes in the descriptive statistics (Fig. 2 F). To repeat this for the other textual code), resulting in a lack of integration between the two distributions, Hazel simply ‘rewires’ the input of the stat- two environments. The Gestalt system provides textual and let, leading to an efficient analysis workflow.