CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada

StatWire: Visual Flow-based Statistical Programming

Krishna Subramanian Chat Wacharamanotham Abstract RWTH Aachen University University of Zürich Statistical analysis is a frequent task in several research 52056 Aachen, Germany 8050 Zürich, Switzerland fields such as HCI, Psychology, and Medicine. Performing [email protected] chat@ifi.uzh.ch statistical analysis using traditional textual programming languages like is considered to have several advantages over GUI applications like SPSS. However, our examina- Johannes Maas Simon Voelker tion of 40 analysis scripts written using current IDEs for R RWTH Aachen University RWTH Aachen University shows that such scripts are hard to and main- 52056 Aachen, Germany 52056 Aachen, Germany tain, limiting their replication. We present StatWire, an IDE johannes.maas1@rwth- [email protected] aachen.de for R that closely integrates the traditional text-based edi- tor with a visual data flow editor to better support statistical programming. A preliminary evaluation with four R users Michael Ellers Jan Borchers indicates that this hybrid approach could result in statistical RWTH Aachen University RWTH Aachen University programming that is more understandable and efficient. 52056 Aachen, Germany 52056 Aachen, Germany [email protected] [email protected] Author Keywords Statistical Analysis IDE; Hybrid Programming; Visual Pro- gramming.

ACM Classification Keywords Permission to make digital or hard copies of part or all of this work for personal or H.5.2 [User Interfaces]: Graphical user interfaces (GUI); classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation G.3. [Probability and Statistics]: Statistical software. on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Introduction CHI’18 Extended Abstracts, April 21–26, 2018, Montreal, QC, Canada. © 2018 Copyright is held by the owner/author(s). Statistical analysis is an important task in several research ACM ISBN 978-1-4503-5621-3/18/04. fields. While analysis can be performed using text-based https://doi.org/10.1145/3170427.3188528 programming (e.g., R, Python) or GUIs (e.g., SPSS, JMP),

LBW104, Page 1 CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada

the statistical community considers text-based programming at top-tier conferences such as CHI ’13 and ISWC ’16, were to be a “much more productive, accurate, and reproducible collected from two sources: 32 scripts from 23 projects way of performing (analysis) tasks” than GUIs [17]. on the Open Science Framework (OSF) platform2, and 8 scripts from 3 researchers at our university in Aachen. We However, our analysis shows that scripts written using cur- randomly selected no more than 3 scripts from the same rent IDEs for R, the most commonly used text-based statis- project or researcher to avoid biasing our analysis to a par- Load data Log tics [13], are hard to understand transform ticular analyst. We analyzed a total of 20,303 source lines and maintain. We speculate that current IDEs for R do not of code (SLOC), with an average of 508 SLOC per file, and support the data-driven, iterative, and non-sequential nature found two major issues: of statistical analysis [10, 16], and we explore integrating the text-based environment in current IDEs with a visual 1. Excessive Code Cloning data flow environment as a possible solution. As a realiza- Code cloning is a common programming practice in which log tion of this concept, we present StatWire, an IDE for R that the user copies, pastes, modifies, and executes code [8, Module 1 Module 1 tightly integrates the two environments. 15]. Excessive code cloning can make analysis scripts diffi- cult to maintain, understand and modify [9, 11, 12], limiting Motivation reproduction and reuse of the code for other analyses. Current IDEs for R R [6] is the most commonly used statistical programming We followed a standard procedure [15] to identify instances of code cloning. First, we preprocessed the analysis scripts log language in academia and research. Prominent IDEs for R 1 (remove blank lines, comments, etc.) to identify Lines of Module 2 Module 2 include RStudio , the official R GUI, and Microsoft’s Visual Studio Plugin for R. Such IDEs provide an interactive con- Interest (LOI). Then, we used fingerprints as both the inter- mediate representation of the code and the match detection Module 1: compute descriptive sole for line-by-line execution of code snippets and script statistics (mean, sd), and plot files for storing analysis code, which can then be executed technique to compute instances of code cloning. We identi- histograms. in a sequential manner via the interactive console. fied that out of 4098 LOI from all scripts, 2861 LOI, or 70%, were parametric clones [15]. Module 2: fit a model, get However, the workflow of statistical analysis is iterative residuals, and perform 2. Prevalence of Non-Modular Code Using modules is Shapiro-Wilk’s test on the and non-sequential [16, 10], as shown in Fig. 1. This could residuals. therefore be in conflict with the workflow supported by cur- generally considered to lead to more productive program- rent IDEs for R that use a sequential representation to store ming [5]. Non-modular code is inefficient and does not Figure 1: An example workflow of and execute the analysis. match the nature of statistical analysis, since a typical anal- statistical analysis. Analysis is ysis involves iterative execution of a module with modified iterative and non-sequential with Evaluation of R Analysis Scripts input, as shown in Fig. 1. This issue is further aggravated in ‘modules’ acting upon modified To investigate potential consequences of this mismatch, we long analyses in which a module is reused several times. data with each iteration. examined analysis scripts written in R by researchers. 40 R analysis scripts, used in HCI and Psychology publications In our analysis, we classified code snippets into the three

1http://www.rstudio.com 2http://osf.io

LBW104, Page 2 CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada

Histogram of speed

H A Count

I E1 E2 speed F G

B

C D

Figure 2: User interface of StatWire, a novel IDE for performing statistical analysis using the R programming language. StatWire tightly integrates a visual data flow editor (A) with a text-based editor (B) to better support code authoring.

major steps of statistical analysis: preprocessing, exploratory a text entry study using StatWire. She is presented two analysis, and confirmatory analysis [16]. We then iden- views: a visual data flow editor (Fig. 2 A) and a textual pro- tified how many of these steps were part of each script. gramming environment (B). Each node in the visual editor We found that 97.5% of all scripts contained two or more can be either a statlet, which is a processing step (F), or steps, and 60% contained all three steps. This issue was a viewlet, which shows plots (E1) or data (E2). The edges also prevalent in long analysis scripts. (H) represent the flow of data across the nodes. The visual editor is initially empty, and Hazel creates a statlet by right- StatWire clicking on the empty canvas. When a statlet is created, Since existing IDEs for R result in issues with code repro- the text editor () is shown with a default function template, duction and comprehension, we wanted to explore the tight encouraging her to think in a modular fashion and to write integration of a visual data-flow editor with the traditional functions. During code authoring, Hazel uses the output text editor as a possible solution. The following walkthrough shown next to the code (D) for debugging. She authors explains the interaction design of our current prototype: statlets to read in the CSV file and then subsets the data according to different levels of an independent variable. Hazel is an HCI researcher who wants to analyze data from

LBW104, Page 3 CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada

When she adds an input or output argument to the function Existing visual programming environments such as ViSta header, the corresponding node in the visual data flow ed- for statistical analysis [19] and Orange for machine learning itor is updated automatically (G). This lets Hazel focus on [4] do not expose the underlying source code of their mod- the flow of data (i.e., overview) in the visual editor, and on ules. Blender, a 3D modeling software, allows the user to the processing code (i.e., details) in the text editor. view and edit the underlying source code of modules, but they have to initially be authored in a different environment When Hazel wants to view the descriptive statistics of the and then imported. RapidMiner and KNIME are visual pro- QWERTY distribution using another statlet, she ‘wires’ its gramming environments that allow authoring code using a output to the ‘View descriptive stats’ statlet. She runs the text editor, but treat the visual editor as a passive element statlet to view the histogram (in a viewlet, Fig. 2 E1) and (i.e., the visual environment does not reflect changes in the descriptive statistics (Fig. 2 F). To repeat this for the other textual code), resulting in a lack of integration between the two distributions, Hazel simply ‘rewires’ the input of the stat- two environments. The Gestalt system provides textual and let, leading to an efficient analysis workflow. visual environments, and allows the user to easily switch between them [14]. However, it uses a linear list as its - StatWire was developed in a user-centered, iterative fash- sual representation, which does not suit the non-linear na- ion, during which we explored several design alternatives ture of statistical analysis. Further, Gestalt does not provide to support a tight integration between the visual and text- a tight integration between the two environments. based environments. E.g., in an earlier prototype of StatWire, users added input or output arguments via the visual pro- Preliminary Evaluation gramming environment. However, evaluations with existing To identify how well StatWire supports statistical program- users of R and an expert in statistical analysis who teaches ming with R, we compared it to RStudio, a widely used IDE introductory courses in R revealed that users preferred to for R, as well as RapidMiner and KNIME, two visual pro- update the arguments in the textual programming environ- gramming environments with R integration. ment. This resulted in live updates to the visual program- ming environment when changes were made to the textual We recruited four R users from our local university in Aachen. programming environment. Three had taken a graduate-level introductory analysis course in R, all had used RStudio before (none had used Related Work KNIME or RapidMiner before), and all had at least three Visual programming environments have been shown to help years of experience performing statistical analysis. with comprehension [2], performance [1], and code navigation [3]. For a comprehensive overview of vi- Procedure sual programming environments and their benefits, see [7]. Each user performed analysis with each tool for a minimum StatWire uses a data flow visual programming environment, of 30 minutes with different datasets and research ques- which better suits the data-driven nature of statistical anal- tions. Additional research questions or datasets were pro- ysis. Such data flow environments have been successfully vided as needed. We balanced the order of conditions and applied to other data-driven domains [18]. randomized both the dataset-tool pairing and the character-

LBW104, Page 4 CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada

istics of datasets to minimize learning effects (Fig. 3. The can be extended to allow transformation of existing analy- RStudio RapidMiner StatWire Dataset 1 Dataset 2 Dataset 3 analysis required typical, non-trivial analysis methods such sis scripts into a more structured and reusable format by, as factor encoding, data transformation, and post-hoc tests. e.g., automatically highlighting similar chunks of code and semi-automatically converting them to a reusable module. StatWire RStudio RapidMiner Significant Findings Dataset 2 Dataset 3 Dataset 1 With RStudio, none of our users followed a modular pro- REFERENCES gramming approach. There were several instances of code 1. Ed Baroth and Chris Hartsough. 1995. Visual KNIME StatWire RStudio cloning, and users used comments to structure the anal- Dataset 3 Dataset 1 Dataset 2 Object-Oriented Programming. Manning Publications ysis. However, because users were familiar with RStudio, Co., Greenwich, CT, USA, Chapter Visual they reported feeling at ease using the tool. Programming in the Real World, 21–42. RStudio StatWire KNIME http://dl.acm.org/citation.cfm?id=213388.213393 Dataset 2 Dataset 1 Dataset 3 With RapidMiner and KNIME, users had issues with the lack of integration between visual data flow editor and tex- 2. Andrew Bragdon, Robert Zeleznik, Steven P. Reiss, Figure 3: The experimental design tual environment, which led to frustration. E.g., while au- Suman Karumuri, William Cheung, Joshua Kaplan, of our preliminary study. We Christopher Coleman, Ferdi Adeputra, and Joseph J. balanced the order of conditions thoring textual code, the visual data flow editor was not up- and randomized data-tool pairing. dated and did not provide any interaction. Of course, over LaViola, Jr. 2010. Code Bubbles: A Working Set-based time, users learn to work with these shortcomings. Interface for Code Understanding and Maintenance. In Proceedings of the SIGCHI Conference on Human With StatWire, users benefited from the tighter integration Factors in Computing Systems (CHI ’10). ACM, New between the two programming environments: Two users York, NY, USA, 2503–2512. DOI: reported understanding the analysis better because of the http://dx.doi.org/10.1145/1753326.1753706 visual data flow editor, and during analysis with StatWire, 3. Robert DeLine, Mary Czerwinski, Brian Meyers, Gina 18 out of 35 statlets were reused, whereas no reuse was Venolia, Steven Drucker, and George Robertson. 2006. observed with other tools. Nevertheless, not all modules Code Thumbnails: Using Spatial Memory to Navigate were reused, which indicates further work is necessary. Source Code. In Proceedings of the Visual Languages StatWire is open-source software, and available as a local and Human-Centric Computing (VLHCC ’06). IEEE web application from the StatWire project home page.3 Computer Society, Washington, DC, USA, 11–18. DOI: http://dx.doi.org/10.1109/VLHCC.2006.14 Future Work 4. Janez Demšar, Blaž Zupan, Gregor Leban, and Tomaz While our preliminary study is promising, it is limited by Curk. 2004. Orange: From Experimental Machine sample size and guiding hypotheses, and further longitu- Learning to Interactive Data Mining. In Proceedings of dinal studies are required to understand StatWire’s effects the 8th European Conference on Principles and on code understanding, structuring, and navigation, as well Practice of Knowledge Discovery in Databases (PKDD as on the productivity of the analysis. Further, the artifact ’04). Springer-Verlag New York, Inc., New York, NY, USA, 537–539. http: 3http://hci.rwth-aachen.de/statwire //dl.acm.org/citation.cfm?id=1053072.1053130

LBW104, Page 5 CHI 2018 Late-Breaking Abstract CHI 2018, April 21–26, 2018, Montréal, QC, Canada

5. John Hughes. 1989. Why Functional Programming Software Metrics. 87–94. DOI: Matters. Comput. J. 32, 2 (1989), 98–107. http://dx.doi.org/10.1109/METRIC.2002.1011328 6. and Robert Gentleman. 1996. R: A 13. Robert A. Muenchen. 2017. The Popularity of Data Language for Data Analysis and Graphics. Journal of Analysis Software. (2017). (Accessed: 08-31-2017) Computational and Graphical Statistics 5, 3 (1996), http://r4stats.com/articles/popularity/. 299–314. 14. Kayur Patel, Naomi Bancroft, Steven M. Drucker, 7. Wesley M. Johnston, J. R. Paul Hanna, and Richard J. James Fogarty, Andrew J. Ko, and James A. Landay. Millar. 2004. Advances in Dataflow Programming 2010. Gestalt: Integrated Support for Implementation Languages. Comput. Surveys 36, 1 (2004), 1–34. DOI: and Analysis in Machine Learning.. In Proceedings of http://dx.doi.org/10.1145/1013208.1013209 the 23rd Annual ACM Symposium on User Interface Software and Technology (UIST ’10). ACM, New York, 8. Miryung Kim, L. Bergman, T. Lau, and D. Notkin. 2004. NY, USA, 37–46. DOI: An Ethnographic Study of Copy and Paste http://dx.doi.org/10.1145/1866029.1866038 Programming Practices in OOPL. In Empirical Software Engineering, 2004. (ISESE ’04). 83–92. DOI: 15. Dhavleesh Rattan, Rajesh Bhatia, and Maninder Singh. http://dx.doi.org/10.1109/ISESE.2004.1334896 2013. Software Clone Detection: A Systematic Review. Information and Software Technology 55, 7 (July 2013), 9. Rainer Koschke. 2008. Frontiers of Software Clone 1165–1199. Management. In Frontiers of Software Maintenance ’08. 119–128. DOI: 16. John W. Tukey. 1977. Exploratory Data Analysis. http://dx.doi.org/10.1109/FOSM.2008.4659255 Addison-Wesley Publishing Company. 10. David Lubinsky and Daryl Pregibon. 1988. Data https://books.google.de/books?id=UT9dAAAAIAAJ Analysis as Search. Journal of Econometrics 38, 1-2 17. Pedro M. Valero-Mora and Rubén D. Ledesma. 2012. (1988), 247–268. DOI: Graphical User Interfaces for R. Journal of Statistical http://dx.doi.org/10.1016/0304-4076(88)90035-8 Software 49, 1 (2012), 1–8. 11. Jean Mayrand, Claude Leblanc, and Ettore M. Merlo. 18. Kanit Wongsuphasawat, Daniel Smilkov, James 1996. Experiment on the Automatic Detection of Wexler, Jimbo Wilson, Dandelion Mané, Doug Fritz, Function Clones in a Software System Using Metrics. Dilip Krishnan, Fernanda B Viégas, and Martin In 1996 Proceedings of International Conference on Wattenberg. 2017. Visualizing Dataflow Graphs of Software Maintenance. 244–253. DOI: Deep Learning Models in TensorFlow. IEEE http://dx.doi.org/10.1109/ICSM.1996.565012 Transactions on Visualization and Computer Graphics 12. Akito Monden, Daikai Nakae, Toshihiro Kamiya, Shin (2017). ichi Sato, and Ken ichi Matsumoto. 2002. Software 19. Forrest W. Young and Carla M. Bann. 1996. ViSta: The Quality Analysis by Code Clones in Industrial Legacy Visual Statistics System. (1996). Software. In Proceedings Eighth IEEE Symposium on

LBW104, Page 6