Computer Science Paper: Data Analytics Module No 40: CS/DA/40 – Open Source Tools for Analytics Quadrant 1 – E-Text

e-PGPathshala Subject : Computer Science Paper: Data Analytics Module No 40: CS/DA/40 – Open Source tools for Analytics Quadrant 1 – e-text 1.1 Introduction There are so many number of exceptionally powerful analytical tools that are free and open source that we can leverage today to enhance the business and development skills. Some of the data analytics framework and tools are discussed in this chapter. 1.2 Learning Outcomes • To know various open source data analytics framework, tools and support • To understand the features of such tools 1.3 Data Visualization open source tools Data visualization describes the presentation of abstract information in graphical form. Data visualization allows us to spot patterns, trends, and correlations that otherwise might go unnoticed in traditional reports, tables, or spreadsheets. Data analysis is the process of inspecting, cleaning, transforming and modelling the data with the goal of discovering useful information, suggestions and conclusions. 1.3.1 R tool R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac. This programming language was named R, based on the first letter of first name of the two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs Language S. The current R is the result of a collaborative effort with contributions from all over the world. It is highly extensible and flexible. R is an interpreted language; users typically access it through a command-line interpreter Figure 1: R graphics 1.3.2 Weka The original non-Java version of WEKA primarily was developed for analyzing data from the agricultural domain. With the Java-based version, the tool is very sophisticated and used in many different applications including visualization and algorithms for data analysis and predictive modeling. Its free under the GNU General Public License, the users can customize it however they please. WEKA supports several standard data mining tasks, including data preprocessing, clustering, classification, regression, visualization and feature selection. WEKA would be more powerful with the addition of sequence modeling, which currently is not included Figure 2: Weka tool . Weka uses the Attribute Relation File Format for data analysis, by default. But listed below are some formats that Weka supports, from where data can be imported: CSV ARFF Database using ODBC Attribute Relation File Format (ARFF): This has two parts: 1) The header section defines the relation (data set) name, attribute name and the type. 2) The data section lists the data instances. Weka supports the following data types for attributes: Numeric <nominal-specification> String date @data – Defined in the Data section followed by the list of all data segments 1.3.3 pandas pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. pandas is well suited for many different kinds of data: Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet Ordered and unordered (not necessarily fixed-frequency) time series data. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. 1.3.4 Tanagra TANAGRA is a free Data mining software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area. This project is the successor of SIPINA which implements various supervised learning algorithms, especially an interactive and visual construction of decision trees. TANAGRA is more powerful, it contains some supervised learning but also other paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms... TANAGRA is an "open source project" as every researcher can access to the source code, and add his own algorithms, as far as he agrees and conforms to the software distribution license.The main purpose of Tanagra project is to give researchers and students an easy-to-use data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing to analyse either real or synthetic data. The second purpose of TANAGRA is to propose to researchers an architecture allowing them to easily add their own data mining methods, to compare their performances. TANAGRA acts more as an experimental platform in order to let them go to the essential of their work, dispensing them to deal with the unpleasant part in the programmation of this kind of tools : the data management. The third and last purpose, in direction of novice developers, consists in diffusing a possible methodology for building this kind of software. They should take advantage of free access to source code, to look how this sort of software is built, the problems to avoid, the main steps of the project, and which tools and code libraries to use for. In this way, Tanagra can be considered as a pedagogical tool for learning programming techniques. Figure 3: Tanagra 1.3.5 Gephi Gephi is an open-source network analysis and visualization software package written in Java on the NetBeans platform. Gephi is an open source tool designed for the interactive exploration and visualization of networks . Designed to facilitate the user’s exploratory process through real-time analysis and visualization. Visualization module uses a 3D render engine . Uses the computer’s graphic card, while leaving CPU free for computing . Highly scalable (can handle over 20,000 nodes) . Built on multi-task model to take advantage of multi-core processors. It runs on Windows, Mac OS X and Linux. Figure 4:Gephi 1.3.6 MOA Massive Online Analysis (MOA) is a free open-source software project specific for data stream mining with concept drift. It is written in Java and developed at the University of Waikato, New Zealand. MOA is an open-source framework software that allows to build and run experiments of machine learning or data mining on evolving data streams. It includes a set of learners and stream generators that can be used from the Graphical User Interface (GUI), the command-line, and the Java API. MOA contains several collections of machine learning algorithms: MOA supports bi-directional interaction with Weka (machine learning). MOA is free software released under the GNU GPL. MOA is a framework for data stream mining. It includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project, it is also written in Java, while scaling to more demanding problems. The goal of MOA is a benchmark framework for running experiments in the data stream mining context by proving storable settings for data streams (real and synthetic) for repeatable experiments a set of existing algorithms and measures form the literature for comparison and an easily extendable framework for new streams, algorithms and evaluation methods. MOA currently supports stream classification, stream clustering, outlier detection, change detection and concept drift and recommender systems. Figure 5: MOA 1.3.7 Orange Orange is an open source data mining tool with very strong data visualization capabilities. It allows you to use a GUI (Orange Canvas) to drag and drop modules and connect them to evaluate and test various machine learning algorithms on your data. This hands-on tutorial will go through setting up Orange and getting familiar with its GUI components. We do this by exploring a sample data set with some visualization widgets included with Orange. Orange is a component-based visual programming software package for data visualization, machine learning, data mining and data analysis.Orange components are called widgets and they range from simple data visualization, subset selection and preprocessing, to empirical evaluation of learning algorithms and predictive modeling. Visual programming is implemented through an interface in which workflows are created by linking predefined or user-designed widgets, while advanced users can use Orange as a Python library for data manipulation and widget alteration. Figure 6 :orange tool 1.3.8 RapidMiner Written in the Java Programming language, this tool offers advanced analytics through template-based frameworks. A bonus: Users hardly have to write any code. Offered as a service, rather than a piece of local software, this tool holds top position on the list of data mining tools. In addition to data mining, RapidMiner also provides functionality like data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. What makes it even more powerful is that it provides learning schemes, models and algorithms from WEKA and R scripts.

Load more