Journal of Farrell, D 2016 DataExplore: An Application for General Data Analysis in Research and open research software Education. Journal of Open Research Software, 4: e9, DOI: http://dx.doi.org/10.5334/jors.94

SOFTWARE METAPAPER DataExplore: An Application for General Data Analysis in Research and Education Damien Farrell1 1 UCD School of Veterinary Medicine, University College Dublin, Ireland [email protected]

DataExplore is an open source desktop application for data analysis and plotting intended for use in both research and education. It is intended primarily for non-programmers who need to do relatively advanced table manipulation methods. Common tasks that might not be familiar to spreadsheet users such as table pivot, merge and join functionality are included as core elements. Creation of new columns using ­arithmetic expressions and pre-defined functions is possible. Table filtering may be done with simple boolean queries. The other primary feature is rapid dynamic plot creation from selected data. Multiple plots from various selections and data sources can also be composed using a grid layout. It is thus possible to create publication­ quality plots. A plugin system allows the addition of features with several plugins already available by default. The program is written in Python and is based on the PyData suite of Python libraries.­

Keywords: python; scientific plotting; table analysis; pandas Funding statement: The author is funded by an Irish Research Council Postdoctoral Fellowship (GOIPD/2015/475).

(1) Overview out of multiple factors. The opaque way in which cell Introduction based formulae are often used makes it hard to track cal- Recent years have seen a rapid growth in the ­importance culations. The use of conditional formatting to present of data handling and analysis in the sciences. Such is results makes it almost impossible to interpret them in the complexity and volume of data it has given rise another format. Also because they may be used for data to data scientist specialisations within many fields. In entry and analysis at the same time, ad hoc changes to the biological sciences, the data analysis task is often the raw data are encouraged. Finally, statistical analysis assigned to a bioinformatician. Such expert skills are using the most common commercial product, Excel, have essential when the complexity of the task is too much been criticized [4]. These problems make reproducible for experimentalists unfamiliar with advanced compu- science difficult. tational techniques. However there is a danger of over For the general scientific user these limitations can reliance upon data analysts particularly in cases where be partly overcome by using other tools to compliment an analysis can be done with relatively basic computa- spreadsheets. Many researchers use separate plotting tional skills. packages such as GraphPad Prism [5] and statistical Spreadsheets are widely used in scientific research. tools like SPSS [6] for analysis. However this gives rise to They have also advanced a great deal in sophistication another problem – these are commercial applications. [1] since their introduction and are by now a standard tool They are very expensive and the is closed. for anyone dealing with numerical data. In the sciences This has serious implications for reproducibility. The there has however been a tendency to rely on the spread- tendency to use commercial software is highly prevalent sheet for tasks that they were not originally designed in academia even though there are some viable open for [2]. Though advanced features like pivot tables are source solutions available. This may partly be a problem available many general users are sometimes not aware of general awareness on the part of the user. Veusz [7] of them and make the worksheet more complicated and SciDAVis [8] are good examples of free plotting than it needs to be. Even if a spreadsheet can perform packages that compare favourably with commercial a task using a macro it is often much more complex to products though they do not seem to be widely known. accomplish than it would be with a few lines of code [3]. Commercial tools are also frequently rather ­feature Spreadsheets have another more serious problem in that heavy and complex for the general user with entire they make reproducible analysis very difficult. This arises courses devoted to teaching them. Art. e9, p. 2 of 8 Farrell: DataExplore

Scientists in certain data intensive subject areas, are established plotting library for Python and produces pub- now beginning to adapt to scripting languages like lication-quality figures in a variety of hard copy formats and Python [9]. These are a much better foundation to and interactive environments across platforms. build future skills on and since they are open platforms In some plotting packages like Veusz, SciDaviz and mjo- they allow users to publish full end-to-end instructions graph [20] plots are designed by the addition of multiple that anyone in the world can reproduce for free. They plot elements to which data is attached. DataExplore is also facilitate workflows with large data [10]. Adoption more data centric like R-studio and plots are generated of these scripting tools is easier said than done because dynamically from the currently selected data and chosen of the intimidating nature of programming to many. options. The idea is that rows and columns can quickly be This is one reason R might not easily gain traction with chosen or tables edited and new plots generated instantly experimentalists since it requires at least some program- with minimal mouse clicks. Plots cannot currently be ming skill. R-studio [11] goes a long way to address this interactively edited though this is an option that could be issue as it provides a user friendly environment for newer added later. users. Though much progress has been on new web based tools it still a challenge to build highly interactive applica- Architecture tions inside the browser. There is still therefore space for The software is written in Python and makes extensive graphical desktop tools to provide a familiar compliment use of the PyData libraries. These form an ecosystem of to spreadsheets for non-programmers. libraries that can provide a complete solution to data analysis from visualization to machine learning. The Objectives graphical interface is built with Tkinter/ttk, the standard DataExplore is intended for rapid exploratory analysis of graphical tool kit for Python. Like other Python packages tabulated data. Quick transformation and visualization of the library is broken into modules which contain classes data are core features. The use of the Python PyData stack grouped by function. Table, plotting and dialog widgets [12] as the back-end means a large number of well tested are in their own modules as shown in Figure 1. The core algorithms are already available. The dichotomy between class is the pandastable widget which is a Tkinter canvas programming and tools with a graphical interface is usu- object. This is used to display a Pandas DataFrame via a ally a sharp one [13] with users preferring one or the other model class that carries out changes to the DataFrame approach. DataExplore is also intended to help bridge this based on user interaction and stores some additional gap by readily making possible processing steps normally data about the Table. This widget is designed to be familiar to data analysts. The main objectives of the soft- re-used in any Tkinter application. The DataExplore ware are: ­application module itself is built around the table widget, a plot viewer module and several plugins. • allow quick exploration and visualization of a data set • allow a familiar graphical interface but implement User interface more advanced table analysis features than currently The application consists essentially of a table and associ- accessible in spreadsheets ated plot viewer, shown in Figure 2. Multiple sets of tables • help to bridge the gap between graphical interface can be loaded and saved as single projects. For certain func- and command driven or programmatic approaches to tions a child table or sub-table is created below the main data analysis one. This may be to store the results of a table manipula- • scale to medium sized datasets, i.e. a table of the order tion such as an aggregation or to paste in another table so of 1–5 million rows that will fit in the memory of that it can be joined to the main one. Another use would most computers be to paste a portion of the selected data and plot it. The • allow publication quality plots to be made easily and sub-table can be created and discarded as needed. encourage clear scientific visualization [14] Unlike a spreadsheet, the focus is not on data entry. Though individual cell entry is possible, users are ­encouraged Implementation and architecture to keep their original data separate and unchanged. Results Methodology can be exported to csv or other ­formats if required. This is The core R data structure is called data.frame, a versatile important to robust analysis. An undo/redo feature is not matrix structure that stores multiple data types [15]. It yet implemented but will likely be useful in the future has been replicated in Python as a core component of when more complex series of processing steps need to be the Pandas library [16]. This has opened the way to much experimented with. more convenient R style data analysis in Python. Pandas Plot options are laid out in a set of tabbed control pan- DataFrame structures, which use the efficient ndarray els below the plot allowing the user to switch quickly data container class in numpy [17], are now well inte- between basic and other plotting modes. Currently a 3D grated into other Python data analysis libraries, creating plot mode and grid layout options are also available. Table a very useful ecosystem. These libraries are often grouped functions are accessed either from the right toolbar (see together as part of the PyData stack [18]. DataExplore is Figure 2), the right-click context menu inside the table or based on using DataFrames to present tabulated data and from the main menu. Dialogs such as plugin interfaces are on matplotlib [19] for plotting. Matplotlib is a very well usually placed below the table. Farrell: DataExplore Art. e9, p. 3 of 8

Figure 1: Outline of pandastable library modules. The graphical user interface is a scaffolding using these modules.

Figure 2: Application interface. Art. e9, p. 4 of 8 Farrell: DataExplore

Features table is restored when the filters are cleared. The syntax is Import of text files straightforward to learn for beginners and may be useful Import of csv and general plain text formats is a stand- for teaching logical AND/OR/NOT row-wise operations. ard feature of Pandas using the read_csv method and supports many options. The most essential of these are Table manipulation available via the import dialog accessible from the toolbar Common transformations such as transpose, aggregation, or by right-clicking anywhere in the table and using the pivot and merge are supported. Results are mostly placed context menu. in the sub-table so as not to overwrite the main table. The sub-table can also be used to plot from or copied into the Row and column indexes main table or another sheet. For operations involving two The index is a fundamental feature of the underlying tables (like concatenate or merge) the second dataset is DataFrame. This performs the central role of data align- loaded into the sub-table (by importing or pasting) which ment or getting and setting of subsets of the table. A can then be joined to the main table. more novel aspect is the use of “hierarchical” indexing. This is essentially a way of representing data with an arbi- Plotting behaviour trary number of dimensions in a 2D table. In our program The design is oriented around quick generation of plots mostly the use of multi-indexes is implicit to the way the from the current selections. This means that the current program works but it opens the door to add more useful plot is constantly overridden. However plots can be saved functionality later on. For now the index can be displayed and recalled in the current session if required. To produce or hidden in the table and columns can be turned into multiple plots in one figure a grid layout mode is used. indexes. This is useful for plotting since the index is often Changing the number of rows/columns makes a finer the implied x-axis for plotting. grid and adjusting the row/column spans allows a variety of sub plot combinations to be created. When the user Table filtering wants to add a new sub plot they simple select the row Currently filtering of the table is done using a quite sim- and column location to add them. An example is shown ple string query method. An entry box is used to enter in Figure 3. It is also possible to use this method to make the query and the table updated accordingly. The main inset plots by overlaying them on the main plot.

Figure 3: A figure generated directly from the application using the Titanic data. This uses the grid layout mode to combine multiple sub plots together. Farrell: DataExplore Art. e9, p. 5 of 8

Categorical plots and the table updated to reflect the changes immedi- Data can be grouped and plotted either by grouping by ately. This should prove a useful way to teach coding categorical columns in the plot dialog or performing a skills in a familiar environment. groupby-aggregate step and plotting the resulting table. The factor plots plugin provides even more advanced plot- Documentation and usage ting capabilities. Factor plots allow multiple comparisons Documentation is provided in the form of a wiki on github to be made in a single graph. That is, you can split data by at https://github.com/dmnfarrell/pandastable/wiki/. more than one variable along an axis or between plots. Specific case studies/tutorials along with links to screen In seaborn these dimensions are called row/col (the plot casts and details of new features can be viewed on the blog dimensions) x,y (axes) and hue (grouping/color within at http://dmnfarrell.github.io/. The case studies provide a plots). These concepts are illustrated in the seaborn docu- visual guide through real world examples. This blog will mentation on factor plotting [21] and on the blog. be kept up to date as the program is further developed.

3D plots Current case studies Plotting interactive 3D projections that can be zoomed 1. Looking at the Titanic dataset (basic). Exploratory and rotated is provided using the matplotlib mplot3d methods for beginners using the Titanic data from ­module. Scatter, bar, contour, wireframe and surface plots a Kaggle Getting Started Competition [25]. This are available. Plotting of column selections does not have ­illustrates the used of the software in initial explo- a single unambiguous representation in 3D space for the rations of data for beginners using plots of distri- latter three kinds of plots. So pre-defined modes must be butions, breaking down columns by category and used that tell the program how to interpret the selected re-binning categorical data. data. Currently the default is to interpret the third column 2. Plotting miRNA abundance data (advanced). This z as a function of the first two x and y, i.e. of the form uses miRNA-sequencing expression data [26] to (x,y) → z. Other modes, such as support for parameterized show more complex methods of representing data functions, are still to be added. sets with multiple sample labels. It includes a dem- onstration of how to create long form data for use Data fitting with the factor plotting plugin. The statsmodels [22] library is used for data fitting in DataExplore since it works well with Pandas and has a Future developments simple programming interface. It provides descriptive The software at time of writing is at version 0.7. Regular statistics, statistical tests, plotting functions, and imple- releases will be made as bugs are fixed or features added. mentation of standard estimators used in model fitting. Many useful developments can be applied to this software String formulas are supported using Patsy and this is used to exploit the wealth of functionality in Python scientific in to allow the user to type in their formulas providing libraries. The plugin system allows such new features to using the special syntax. be added very easily by third parties with some knowledge of Python. It is important to underline that development Plugins should be led by user feedback rather than the authors. Plugins are for adding custom functionality that is not Some particularly useful potential features are mentioned present in the main application. These are Python scripts here: implemented by sub-classing the Plugin class in the plugin module. At minimum a plugin must have a main() • Workflow tracking mechanism. Obviously in a point method which is called by the application to launch it. and click environment people will not remember Otherwise a script can generally contain any code the every step they take in an analysis. To aid reproduc- author wishes. Usually the idea will be to implement a ibility some method of recording processing steps dialog that the user interacts with but this could also be a would be useful. single function that runs on the current table or all sheets • Integration with Jupyter notebooks. The Jupyter note- at once without further user interaction. Three plugins book [27] is a web application that allows you to create are currently provided with the library: and share documents that contain live code, visualiza- tions and explanatory text. It has potential to be used • Batch file renaming utility – a tool for renaming multi- for the kind of workflow tracking mentioned above. ple files at once that can be useful for importing files. • Plugins for more specific kinds of data analysis such as • Factor plotting – advanced categorical plotting using principal component analysis. the Seaborn library [23]. seaborn provides a high-level • Batch conversion and/or joining of multiple text/csv interface to matplotlib for drawing attractive graphics. files likely using a plugin. It also understands DataFrames without any need for • Improvements to scale with larger data sets. converting them. • Add support for loading remote data sources and • IPython console – Here you can call any Python com- sharing results. mands and even shell commands provided in IPython • Enhance plot support with the ability to add anno- [24]. The underlying table DataFrame can be manipu- tations and allow arbitrary placement of sub plots lated directly with code snippets or external scripts amongst other features. Art. e9, p. 6 of 8 Farrell: DataExplore

Quality control Additional system requirements Git is used for version control and bug tracking. Github’s No special requirements. issue tracking supports milestones, labels and assignees for filtering and categorizing bugs. It is therefore the ideal Dependencies way to handle user feedback. The following Python libraries are required dependencies: Numpy >= 1.5 Testing matplotlib >= 1.1 Testing is done using the standard Python unittest frame- pandas >= 0.17 work (PyUnit). Tests can be executed from the cloned numexpr >= 2.4 source directory. Since automated testing of Tkinter widg- xlrd >= 0.9 ets is known to be difficult, the tests currently concentrate on table functions that do not require user interaction. Optional dependencies For other tests user feedback is essential. The project uses seaborn >= 0.6 the Travis continuous integration service [28]. Travis CI statsmodels >= 0.6 (requires scipy) automatically detects when a commit has been made and ipython >= 4.0 pushed to the GitHub repository. Each time this happens, it will try to build the project and run the tests. This model Software location of development scales well for multiple developers. Archive (e.g. institutional repository, general repository) Name: Zenodo Data integrity Persistent identifier:10.5281/zenodo.44891 Since there is minimal emphasis on data entry per se, the Licence: GPL v3 user is encouraged to keep raw data unchanged. Projects Publisher: Damien Farrell saved in the native format as multiple sets of worksheets Date published: 17/1/16 are kept separate from the original data. Projects are saved Code repository in MessagePack format [29] which is an efficient binary Name: GitHub serialization format and is used to save Python objects. In Identifier:https://github.com/dmnfarrell/pandastable our case the DataFrame, meta data like the current table Licence: GPL v3 selections and plotting options (as Python dictionaries) Date published: 11/2/14 for each sheet are saved together in one project so that the workspace can be reloaded conveniently. Individual Language DataFrames can also be saved alone without any extra English. meta data so that they can be persisted and reloaded outside the context of the application. Though we have (3) Reuse potential found the format efficient and reliable, MessagePack sup- The software is designed for a general science and techni- port is still experimental in Pandas. The project files are cal audience and is not tied to any specific field. Though not designed be used as an archive for raw data sets but it was initially developed with the biological sciences in rather seen as a workspace that can be updated constantly mind, it has very general application. It will be most use- and interchanged between users wishing to share analy- ful for students and researchers who are not program- sis and workflows. A methodology for workflow tracking mers but need to do convenient exploration of their data. is planned that will involve refinement of the this file Educators at all levels are also a target for the software. format. With the easy to learn user interface it is a good way to introduce basic data manipulation methods. Specific uses (2) Availability include producing plots for reports or publication, quick visualization of small to medium sized datasets, database This software is supported on any operating system that style filtering with string queries or fitting linear models. supports a standard Python installation which includes It is hoped that some of this functionality will help stu- , Windows and OSX. In all systems the DataExplore dents become familiar with the more advanced analyti- application can be provided by installing the pandasta- cal methods available via programming languages. Data ble Python library via pip or easy_install. This requires a scientists may also find the tool useful for quick plots of working Python installation. It is also available as a pack- their data before or after detailed analyses and as a way of age via the self contained Anaconda Python distribution. sharing results with others. In Windows a binary installer is available, packaged with The use of proprietary software without access to the cx_Freeze [30], that installs an executable and all the code base is still very common in science. This makes it required libraries without the need for a separate Python hard to do reproducible work that can be shared. This install. This installer is a 32-bit executable which will run software is based on well established open source Python on all windows systems. Detailed install instructions are libraries that do not have such issues. These high-quality given in the documentation. tools for scientific computing provide the ideal platform to build a user friendly open source application for data Programming language analysis. In the long-term, it is hoped that a community of Python version >= 3.4 or 2.7. users can be built, some of whom will be developers able to Farrell: DataExplore Art. e9, p. 7 of 8 provide their own plugins or extend the core application.­ 10(9): e1003833. DOI: http://dx.doi.org/10.1371/ The project can be also be forked without restriction. journal.pcbi.1003833 15. Data science retreat 2013 R: the good parts. Competing Interests Available at http://blog.datascienceretreat.com/post/ The authors declare that they have no competing interests. 69789735503/r-the-good-parts [Accessed: 08-Jan-2016]. 16. Mckinney, W 2015 Pandas, Python Data Analysis Acknowledgements Library. Available at http://pandas.pydata.org/. Thanks to Prof. Stephen Gordon for supporting work on 17. van der Walt, S, Colbert, S and Varoquaux, G this project. 2011 (March) The NumPy Array: A Structure for Efficient Numerical Computation. Comput Sci Eng, References 13(2): 22–30. DOI: http://dx.doi.org/10.1109/MCSE. 1. Weathington, J 2015 5 things every data scientist 2011.37 should know about Excel. Available at http://www. 18. The PyData Community 2015 PyData. Available at techrepublic.com/article/5-things-every-data-scientist- http://pydata.org/downloads/. should-know-about-excel/ [Accessed: 16-Jan-2016]. 19. Hunter, J D 2007 (May) Matplotlib: A 2D Graphics 2. Burns, P 2014 Spreadsheet Addiction. Available at Environment. Comput Sci Eng, 9(3): 90–95. DOI: http://www.burns-stat.com/documents/tutorials/ http://dx.doi.org/10.1109/MCSE.2007.55 spreadsheet-addiction/. 20. Tanahashi, M 2014 mjograph. Available at http:// 3. Moffitt, C 2014 Common Excel Tasks Demonstrated www.ochiailab.dnj.ynu.ac.jp/mjograph/. in Pandas. Available at http://pbpython.com/excel- 21. Waskom, M 2015 Seaborn factorplot documentation. pandas-comp.html. Available at http://stanford.edu/~mwaskom/software/ 4. McCullough, B D and Heiser, D A 2008 On the accu- seaborn/generated/seaborn.factorplot.html [Accessed: racy of statistical procedures in Microsoft Excel 2007. 12-Jan-2016]. Comput Stat Data Anal, 52(10): 4570–4578. DOI: 22. Statsmodels Developers 2015 Statsmodels. Available http://dx.doi.org/10.1016/j.csda.2008.03.004 at http://statsmodels.sourceforge.net/. 5. GraphPad Software 2015 GraphPad Prism version 23. Waskom, M 2012 Seaborn. Available at http:// 6.0. Available at http://www.graphpad.com/scientific- stanford.edu/~mwaskom/software/seaborn/. DOI: software/prism/. http://dx.doi.org/10.1109/MCSE.2007.53 6. IBM 2015 IBM SPSS Statistics. Available at http:// 24. Pérez, F and Granger, B E 2007 (May) {IP}ython: a www-01.ibm.com/software/analytics/spss/. System for Interactive Scientific Computing. Comput 7. Sanders, J 2015 Veusz. Available at http://home.gna. Sci Eng, 9(3): 21–29. org/veusz/. 25. Kaggle 2012 Titanic: Machine Learning from Disaster. 8. Benkert, T, Franke, K and Standish, R 2007 Available at https://www.kaggle.com/c/titanic ­SciDAVis. Available at http://scidavis.sourceforge.net/. [Accessed: 16-Jan-2016]. 9. Buffalo, V 2015 Bioinformatics Data Skills. O’Reilly Media. 26. Farrell, D, Shaughnessy, R G, Britton, L, 10. Heller, M 2015 Learn to crunch big data with R. Avail- MacHugh, D E, Markey, B and Gordon, S V 2015 able at http://www.infoworld.com/article/2880360/ The Identification of Circulating MiRNA in Bovine big-data/learn-to-crunch-big-data-with-r.html ­Serum and Their Potential as Novel Biomarkers of [­Accessed: 07-Jan-2016]. Early Mycobacterium­ avium subsp paratuberculosis 11. RStudio Inc. 2015 RStudio: Integrated Development ­Infection. PLoS One, 10(7): e0134310. DOI: http:// for R. Boston MA. dx.doi.org/10.1371/journal.pone.0134310 12. Oliphant, T E 2007 (May) Python for Scientific 27. 2015 Jupyter Notebook. Available at Computing. Comput Sci Eng, (9)3: 10–20. DOI: http://jupyter.org/ [Accessed: 21-Jan-2016]. http://dx.doi.org/10.1109/MCSE.2007.58 28. Travis CI Community 2011 Travis continuous inte- 13. Ward, N 2013 Excel, SPSS, or R? ­Available at gration. Available at https://travis-ci.org/. https://learnandteachstatistics.wordpress.com/2013/ 29. Furuhashi, S 2008 MessagePack. Available at http:// 02/11/excel-spss-minitab-or-r/ [Accessed: 09-Sep-2015]. msgpack.org/index.html. 14. Rougier, N P, Droettboom, M and Bourne, P E 2014 30. Tuininga, A 2014 cx_Freeze. Available at http://cx- Ten Simple Rules for Better Figures. PLoS Comput Biol, freeze.sourceforge.net/. Art. e9, p. 8 of 8 Farrell: DataExplore

How to cite this article: Farrell, D 2016 DataExplore: An Application for General Data Analysis in Research and Education. Journal of Open Research Software, 4: e9, DOI: http://dx.doi.org/10.5334/jors.94

Submitted: 15 September 2015 Accepted: 08 March 2016 Published: 22 March 2016

Copyright: © 2016 The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See http://creativecommons.org/licenses/by/4.0/.

Journal of Open Research Software is a peer-reviewed open access journal published by Ubiquity Press OPEN ACCESS