[D0s07a] Big Data Platforms & Technologies [D0s06a]
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data Science Tools Overview In-memory analytics Python and R Visualization The road to big data Notebooks and development environments Labeling File formats Packaging and versioning systems Model deployment 2 In-memory Analytics 3 https://mattturck.com/data2020/ 4 Many tools and vendors For experimentation and model development: hyperparameter tuning, logging, autoML For visualization For labeling For data: Hadoop, Spark, streaming data, feature stores, cloud data warehouses, different storage formats For labeling For deployment: different environments, pipe lines For monitoring and maintenance 5 Heard about Hadoop? Spark? H2O? Many vendors with their “big data and analytics” stack Amazon “My data lake versus yours” Cloudera There’s always “roll your own” Datameer Open source, or walled garden? DataStax Support, features, speed of upgrades? Dell Oracle The situation has stabilized a bit (i.e. the IBM champions have settled), but does it matter? MapR Pentaho Databricks Microsoft Hortonworks EMC2 6 Two sides emerge Infrastructure Big Data Integration Architecture NoSQL and NewSQL Streaming AI and ML ops 7 Two sides emerge Analytics Data Science Machine Learning AI NLP But also still: BI and Visualization 8 There’s a difference 9 In-memory analytics Your data set fits in memory The assumption of many tools SAS, SPSS, MatLAB R, Python, Julia Is this really a problem? Servers with 512GB of RAM have become relatively cheap Cheaper than a HDFS cluster (especially in today’s cloud environment) Implementation makes a difference (representation of data set in memory) If your task is unsupervised or supervised modeling, you can apply sampling Some algorithms can work in online / batch mode 10 Python and R 11 The big two The “big two” in modern data science: Python and R Both have their advantages Others are interesting too (e.g. Julia), but still less adopted Vendors such as SAS and SPSS remain as well But bleeding-edge algorithms or techniques found in open-source first Not (really) due to the language itself The language is just an interface Thanks to their huge ecosystem: many packages for data science available “Python is the second best language for everything” Add-on packages/libraries, which typically aim to: Work with higher order arrays (tensors) and apply operations (typically with broadcasting support) Inspired by early array languages such as Ada, APL, FORTRAN, … This then typically forms the basis to provide support for data frames and “wrangling” them E.g. a 2-dimensional matrix where each column can have a different type, together with sort/filter/aggregation functions Which is then used to construct feature matrices to perform un/supervised learning on Using the predictive techniques we’ve seen earlier As well as some techniques to plot and visualize results 12 Analytics with R Native concept of a “data frame”: a table in which each column contains measurements on one variable, and each row contains one case Unlike a matrix, the data you store in the columns of a data frame can be of various types I.e., one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items, although some of those data items may be missing values) Fun read: Is a Dataframe Just a Table?, Yifan Wu, 2019 13 Analytics with R Hadley Wickham Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University Data Science “tidyverse” ggplot2 for visualizing data dplyr for manipulating data tidyr for tidying data stringr for working with strings lubridate for working with date/times https://www.tidyverse.org/ Data Import readr for reading .csv and fwf files readxl for reading .xls and .xlsx files haven for SAS, SPSS, and Stata files (also: foreign package) httr for talking to web APIs rvest for scraping websites xml2 for importing XML files 14 Modern R Learning R today? Make sure to use “modern R” principles tidyverse should be the first package you install Especially thanks to dplyr , tidyr , stringr , and lubridate dplyr implements a verb-based data manipulation language Works on normal data frames but can also work with database connections (already a simple way to solve the mid-to-big sized data issue) Verbs can be piped together, similar to a Unix pipe operator flights %>% select(year, month, day) %>% arrange(desc(year)) %>% head 15 Modern R delay <- flights %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay %>% filter(count > 20, dist < 2000) %>% ggplot(aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area() Also see: https://www.rstudio.com/resources/cheatsheets/ 16 Modeling with R Virtually any unsupervised or supervised algorithm is implemented in R as a package The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: Data splitting Pre-processing Feature selection Model tuning using resampling Variable importance estimation Caret depends on other packages to do the actual modeling, and wraps these to offer a unified interface You can just use the original package as well if you know what you want Still widely used 17 Modeling with R require(caret) require(ggplot2) require(randomForest) training <- read.csv("train.csv", na.strings=c("NA","")) test <- read.csv("test.csv", na.strings=c("NA","")) # Invoke caret with random forest and 5-fold cross validation rf_model <- train(TARGET~., data=training, method="rf", trControl=trainControl(method="cv",number=5), ntree=500) # Other parameters can be passed here print(rf_model) ## Random Forest ## ## 5889 samples ## 53 predictors ## 5 classes: 'A', 'B', 'C', 'D', 'E' ## ## Resampling: Cross-Validated (5 fold) ## ## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 ## ## Resampling results across tuning parameters: ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 1 1 0.006 0.008 ## 27 1 1 0.005 0.006 ## 53 1 1 0.006 0.007 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 27. 18 Modeling with R print(rf_model$finalModel) ## Call: ## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, ## allowParallel = TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 27 ## ## OOB estimate of error rate: 0.88% ## ## Confusion matrix: ## A B C D E class.error ## A 1674 0 0 0 0 0.00000 ## B 11 1119 9 1 0 0.01842 ## C 0 11 1015 1 0 0.01168 ## D 0 2 10 952 1 0.01347 ## E 0 1 0 5 1077 0.00554 19 Modeling with R The mlr package is an alternative to caret R does not define a standardized interface for all its machine learning algorithms The mlr package provides infrastructure so that you can focus on your experiments The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering The package is connected to the OpenML R package and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research mlr3 : https://mlr3.mlr-org.com/ Newer, though gaining uptake 20 Modeling with R library(mlr3) set.seed(1) task_iris = TaskClassif$new(id = "iris", backend = iris, target = "Species") learner = lrn("classif.rpart", cp = 0.01) train_set = sample(task_iris$nrow, 0.8 * task_iris$nrow) test_set = setdiff(seq_len(task_iris$nrow), train_set) # train the model learner$train(task_iris, row_ids = train_set) # predict data prediction = learner$predict(task_iris, row_ids = test_set) # calculate performance prediction$confusion ## truth ## response setosa versicolor virginica ## setosa 11 0 0 ## versicolor 0 12 1 ## virginica 0 0 6 measure = msr("classif.acc") prediction$score(measure) ## classif.acc ## 0.9666667 21 Modeling with R The modelr package provides functions that help you create elegant pipelines when modelling By Hadley Wickham Mainly for simple regression models More information: http://r4ds.had.co.nz/ Modern R approach Starts simple – linear and visual models Good introduction 22 Visualizations with R ggplot2 reigns supreme By Hadley Wickham Uses a “grammar of graphics” approach A grammar of graphics is a tool that enables us to concisely describe the components of a graphic An abstraction which makes thinking, reasoning and communicating graphics easier Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics Original idea: Wilkinson (2006) ggvis : based on ggplot2 and built on top of vega (a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs) Also declaratively describes data graphics Different render targets Interactivity: interact in browser, phone, … 23 Visualizations with R shiny : a web application framework for R Construct interactive dashboards 24 Other packages worth noting janitor : tools for cleaning data stringr : work with text lubridate : work with times and dates ROCR : make