Dataexplore: an Application for General Data Analysis in Research and Open Research Software Education
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Farrell, D 2016 DataExplore: An Application for General Data Analysis in Research and open research software Education. Journal of Open Research Software, 4: e9, DOI: http://dx.doi.org/10.5334/jors.94 SOFTWARE METAPAPER DataExplore: An Application for General Data Analysis in Research and Education Damien Farrell1 1 UCD School of Veterinary Medicine, University College Dublin, Ireland [email protected] DataExplore is an open source desktop application for data analysis and plotting intended for use in both research and education. It is intended primarily for non-programmers who need to do relatively advanced table manipulation methods. Common tasks that might not be familiar to spreadsheet users such as table pivot, merge and join functionality are included as core elements. Creation of new columns using arithmetic expressions and pre-defined functions is possible. Table filtering may be done with simple booleanqueries.Theotherprimaryfeatureisrapiddynamicplotcreationfromselecteddata.Multiple plots from various selections and data sources can also be composed using a grid layout. It is thus possible to create publication quality plots. A plugin system allows the addition of features with several plugins alreadyavailablebydefault.TheprogramiswritteninPythonandisbasedonthePyDatasuiteofPython libraries. Keywords:python;scientificplotting;tableanalysis;pandas Funding statement: The author is funded by an Irish Research Council Postdoctoral Fellowship (GOIPD/2015/475). (1) Overview out of multiple factors. The opaque way in which cell Introduction based formulae are often used makes it hard to track cal- Recent years have seen a rapid growth in the importance culations. The use of conditional formatting to present of data handling and analysis in the sciences. Such is results makes it almost impossible to interpret them in the complexity and volume of data it has given rise another format. Also because they may be used for data to data scientist specialisations within many fields. In entry and analysis at the same time, ad hoc changes to the biological sciences, the data analysis task is often the raw data are encouraged. Finally, statistical analysis assigned to a bioinformatician. Such expert skills are using the most common commercial product, Excel, have essential when the complexity of the task is too much been criticized [4]. These problems make reproducible for experimentalists unfamiliar with advanced compu- science difficult. tational techniques. However there is a danger of over For the general scientific user these limitations can reliance upon data analysts particularly in cases where be partly overcome by using other tools to compliment an analysis can be done with relatively basic computa- spreadsheets. Many researchers use separate plotting tional skills. packages such as GraphPad Prism [5] and statistical Spreadsheets are widely used in scientific research. tools like SPSS [6] for analysis. However this gives rise to They have also advanced a great deal in sophistication another problem – these are commercial applications. [1] since their introduction and are by now a standard tool They are very expensive and the source code is closed. for anyone dealing with numerical data. In the sciences This has serious implications for reproducibility. The there has however been a tendency to rely on the spread- tendency to use commercial software is highly prevalent sheet for tasks that they were not originally designed in academia even though there are some viable open for [2]. Though advanced features like pivot tables are source solutions available. This may partly be a problem available many general users are sometimes not aware of general awareness on the part of the user. Veusz [7] of them and make the worksheet more complicated and SciDAVis [8] are good examples of free plotting than it needs to be. Even if a spreadsheet can perform packages that compare favourably with commercial a task using a macro it is often much more complex to products though they do not seem to be widely known. accomplish than it would be with a few lines of code [3]. Commercial tools are also frequently rather feature Spreadsheets have another more serious problem in that heavy and complex for the general user with entire they make reproducible analysis very difficult. This arises courses devoted to teaching them. Art. e9, p. 2 of 8 Farrell: DataExplore Scientists in certain data intensive subject areas, are established plotting library for Python and produces pub- now beginning to adapt to scripting languages like lication-quality figures in a variety of hard copy formats R and Python [9]. These are a much better foundation to and interactive environments across platforms. build future skills on and since they are open platforms In some plotting packages like Veusz, SciDaviz and mjo- they allow users to publish full end-to-end instructions graph [20] plots are designed by the addition of multiple that anyone in the world can reproduce for free. They plot elements to which data is attached. DataExplore is also facilitate workflows with large data [10]. Adoption more data centric like R-studio and plots are generated of these scripting tools is easier said than done because dynamically from the currently selected data and chosen of the intimidating nature of programming to many. options. The idea is that rows and columns can quickly be This is one reason R might not easily gain traction with chosen or tables edited and new plots generated instantly experimentalists since it requires at least some program- with minimal mouse clicks. Plots cannot currently be ming skill. R-studio [11] goes a long way to address this interactively edited though this is an option that could be issue as it provides a user friendly environment for newer added later. users. Though much progress has been on new web based tools it still a challenge to build highly interactive applica- Architecture tions inside the browser. There is still therefore space for The software is written in Python and makes extensive graphical desktop tools to provide a familiar compliment use of the PyData libraries. These form an ecosystem of to spreadsheets for non-programmers. libraries that can provide a complete solution to data analysis from visualization to machine learning. The Objectives graphical interface is built with Tkinter/ttk, the standard DataExplore is intended for rapid exploratory analysis of graphical tool kit for Python. Like other Python packages tabulated data. Quick transformation and visualization of the library is broken into modules which contain classes data are core features. The use of the Python PyData stack grouped by function. Table, plotting and dialog widgets [12] as the back-end means a large number of well tested are in their own modules as shown in Figure 1. The core algorithms are already available. The dichotomy between class is the pandastable widget which is a Tkinter canvas programming and tools with a graphical interface is usu- object. This is used to display a Pandas DataFrame via a ally a sharp one [13] with users preferring one or the other model class that carries out changes to the DataFrame approach. DataExplore is also intended to help bridge this based on user interaction and stores some additional gap by readily making possible processing steps normally data about the Table. This widget is designed to be familiar to data analysts. The main objectives of the soft- re-used in any Tkinter application. The DataExplore ware are: application module itself is built around the table widget, a plot viewer module and several plugins. • allow quick exploration and visualization of a data set • allow a familiar graphical interface but implement User interface more advanced table analysis features than currently The application consists essentially of a table and associ- accessible in spreadsheets ated plot viewer, shown in Figure 2. Multiple sets of tables • help to bridge the gap between graphical interface can be loaded and saved as single projects. For certain func- and command driven or programmatic approaches to tions a child table or sub-table is created below the main data analysis one. This may be to store the results of a table manipula- • scale to medium sized datasets, i.e. a table of the order tion such as an aggregation or to paste in another table so of 1–5 million rows that will fit in the memory of that it can be joined to the main one. Another use would most computers be to paste a portion of the selected data and plot it. The • allow publication quality plots to be made easily and sub-table can be created and discarded as needed. encourage clear scientific visualization [14] Unlike a spreadsheet, the focus is not on data entry. Though individual cell entry is possible, users are encouraged Implementation and architecture to keep their original data separate and unchanged. Results Methodology can be exported to csv or other formats if required. This is The core R data structure is called data.frame, a versatile important to robust analysis. An undo/redo feature is not matrix structure that stores multiple data types [15]. It yet implemented but will likely be useful in the future has been replicated in Python as a core component of when more complex series of processing steps need to be the Pandas library [16]. This has opened the way to much experimented with. more convenient R style data analysis in Python. Pandas Plot options are laid out in a set of tabbed control pan- DataFrame structures, which use the efficient ndarray els below the plot allowing the user to switch quickly data container class in numpy [17], are now well inte- between basic and other plotting modes. Currently a 3D grated into other Python data analysis libraries, creating plot mode and grid layout options are also available.