Large-Scale Statistics with MonetDB and

Hannes Mühleisen

DAMDID 2015, 2015-10-13 About Me

• Postdoc at CWI architectures since 2012

• Amsterdam is nice. We have open positions.

• Special interest in data management for statistical analysis

• Various research & software projects in this space

2 Outline

• Column Store / MonetDB Introduction

• Connecting R and MonetDB

• Advanced Topics

• “R as a Query Language”

• “Capturing The Laws of Data Nature”

3 Column Stores / MonetDB Introduction

4 Postgres, Oracle, DB2, etc.:

Conceptional class speed flux

NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1

Physical (on Disk)

NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1 5 Column Store:

class speed flux Compression! NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1

NX Constitution Galaxy Defiant Intrepid 1 1 1 1 1 3 8 3 6 1

6 What is MonetDB?

• Strict columnar architecture OLAP RDBMS (SQL)

• Started by Martin Kersten and ~1994

• Free & Open Open source, active development ongoing

• www..org

Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the memory wall in MonetDB. Communications of the ACM 51, 12 (December 2008), 77-85. DOI=10.1145/1409360.1409380

7 MonetDB today

• Expanded code

• MAL “DB assembly” & optimisers

• SQL to MAL compiler

• Memory-Mapped files

• Automatic indexing

8 Some MAL

• Optimisers run on MAL code

• Efficient Column-at-a-time implementations

EXPLAIN SELECT * FROM mtcars;

| X_2 := .mvc(); | | X_3:bat[:oid,:oid] := sql.tid(X_2,"sys","mtcars"); | | X_6:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","mpg",0); | | (X_9,r1_9) := sql.bind(X_2,"sys","mtcars","mpg",2); | | X_12:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","mpg",1); | | X_14 := sql.delta(X_6,X_9,r1_9,X_12); | | X_15 := algebra.leftfetchjoin(X_3,X_14); | | X_16:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","cyl",0); | | (X_18,r1_18) := sql.bind(X_2,"sys","mtcars","cyl",2); | | X_20:bat[:oid,:dbl] := sql.bind(X_2,"sys","mtcars","cyl",1); | | X_21 := sql.delta(X_16,X_18,r1_18,X_20); | | X_22 := algebra.leftfetchjoin(X_3,X_21); |

9 “Invisible JOIN” Performance...

TPC−H SF−100 Hot runs

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100.00 ● ● ● ● ● ● ● ● ● ● ● 1.00 ● ● Average time (s) Average

0.01 ● monetdb ● postgres

Query log! 10 But statistics with SQL?

11 Integrate not Reinvent

Statistical Toolkits ?

Flexibility Data Management 2 Systems

Efficiency

12 Collect data

Filter, Load data transform & Analyze & Plot aggregate data

Publish paper

13 Collect data Growing

Filter, Load data transform & Analyze & Plot aggregate data

Not really Analysis features Publish paper

14 Collect data Statistical Toolkit

Filter, Load data transform & Analyze & Plot aggregate data

Data Management 2 System Publish paper

15 Statistical Toolkit

Filter, transform & Analyze & Plot aggregate data

Data Management 2 System 16 Bridge the Gap

• JDB

+ Native operators, lazy evaluation

• JDB

+ Cheap data transfer

17 18 Previous Work

MonetDB.R connector on CRAN since 2013

Embedded R in MonetDB Part of MonetDB since 2014

MonetDBLite for R Preview release available

19 Also…

Embedded Python/NumPy Next MonetDB release

20 MonetDB.R connector

Hannes Mühleisen and Thomas Lumley: Best of Both Worlds – Relational and Statistics 25th International Conference on Scientific and Statistical Database Management (SSDBM2013)

21 DBI

• DBI is for R what JDBC is for Java

• Low-level interface to talk to SQL databases

• Drivers available for most relational databases

• Typically socket connection between R and DB

df <- dbGetQuery(con, "SELECT * FROM table")

22 DBI

• Works, but (generally)

• Serialising/Unserialising large datasets is slow

• Data ingest is slow

• SQL knowledge required

23 dplyr

• Data reorganisation package in “Hadleyverse”

• Works with data.frame, data., SQL DBs

• Maps relational operations (selection, projection, join, grouping etc.) to native R operators

• Lazy evaluation, call chaining

• MonetDB.R includes a dplyr compatibility layer

24 dplyr In R: ni <- select(n, first_name, last_name, race_desc, sex, birth_age) ow <- filter(ni, as.integer(birth_age) > 66, sex=="MALE", race_desc == “WHITE") print(ow) Generated: SELECT "first_name" AS "first_name", "last_name" AS "last_name", "race_desc" AS "race_desc", "sex" AS "sex", "birth_age" AS "birth_age" FROM "ncvoter" WHERE CAST("birth_age" AS INTEGER) > 66.0 AND "sex" = 'MALE' AND "race_desc" = 'WHITE' LIMIT 10

25 dplyr

• Better, but

• Most (All) R packages cannot work with dplyr tables, so at some point data needs to be transferred.

• What if this dataset is large?

26 Embedded R in MonetDB

27 Relationally Integrated

π

Statistical analysis + as operators in relational queries ⨝

σ σ

28 Table-producing

CREATE FUNCTION rapi01(i INTEGER) RETURNS TABLE (i INTEGER, d DOUBLE) LANGUAGE R { data.frame(i=seq(1,i),d=42.0) };

SELECT i,d FROM rapi01(42) AS r WHERE i>40;

29 π Transformations

CREATE FUNCTION rapi02 (i INTEGER, j INTEGER, z INTEGER) RETURNS INTEGER LANGUAGE R { i*sum(j)*z };

SELECT rapi02(i,j,2) AS r02 FROM rval;

30 σ Filtering

CREATE FUNCTION rapi03(i INTEGER, z INTEGER) RETURNS BOOLEAN LANGUAGE R { i>z };

SELECT * FROM rval WHERE rapi03(i,2);

31 Aggregation

CREATE AGGREGATE kmeans(data FLOAT, ncluster INTEGER) RETURNS INTEGER LANGUAGE R { kmeans(data,ncluster)$cluster };

SELECT cluster FROM (SELECT MIN(x) AS minx, MAX(x) AS maxx, kmeans(x,5) AS cluster FROM xdata GROUP BY cluster) as cdata ORDER BY cluster;

32 Performance…

R−col ● MonetDB ● 40

R−full ● 30 PL/R−tuned ●

20 Time (s)

RInt PL/R−naive ● 10 ●

● ● ● ● ● 0 ● ● ● ●

1 K 10 K 100 K 1 M 10 M 100 M Rows (log)

33 Code Shipping

> rf.fit <- randomForest(income~., data=training, mtry=2, ntree=10)

> predictions <- mdbapply(con, “t1", function(d) { p <- predict(rf.fit, type=“prob", newdata=d)[,2] p[p > .9] })

MonetDB.R 1.0.0, soon 34 MonetDBLite

35 MonetDBLite

• Socket serialization/deserialization for client/server protocol is slow for large result sets.

• Too slow for many machine learning problems!

• Running a database server is cumbersome and overkill for a single R client

• Solution: Run entire database inside the R process

• Only copy ingest data / query results around in memory, fast

• Same interface as MonetDB.R, DBI/dplyr

https://goo.gl/jelaOy 36 Quick Benchmark lineitem table with 10M rows, SELECT * FROM lineitem

18 17.2 s

14 s 9

5

0.4 s 0 Old (MAPI Socket) MonetDBLite

37 Zero-Copy

Jonathan Lajus and Hannes Mühleisen: Efficient Data Management and Statistics with Zero-Copy Integration 26th International Conference on Scientific and Statistical Database Management (SSDBM2014)

38 R SEXP Array

SEXP Header 42 43 44 ... Reference

MonetDB BAT

head Column 0 1 2 ... Descriptor BAT Arrays Reference Descriptor Column ... tail Descriptor 42 43 44

39 Dress-up

BAT MonetDB Descriptor Reference Column tail Descriptor

R SEXP Header 42 43 44 ... Reference

40 + Garbage Collection Fun Advanced Topics

41 R as a Query Language

Hannes Mühleisen, Alex Bertram and Maarten-Jan Kallen: Relational Optimizations for Statistical Analysis, Journal of Statistical Software (under review)

42 What is Renjin?

• R on the JVM

• Compatibility is paramount, not just academic exercise (e.g. automatic Fortran/C translations)

• R anywhere on any data format (e.g. Cloud environments)

• Increased performance through lazy evaluation, parallel execution, …

• Easy to plug any Java code into R analysis, easy to plug Renjin into java projects

43 Abstraction in Renjin > a <- 1:10^9 > a[1000000] <- NA #harr harr

> system.time(print(anyNA(a)))[[3]] [1] TRUE [1] 0.001 > system.time(print(any(is.na(a))))[[3]] [1] TRUE [1] 2.23 GNU R

> system.time(print(any(is.na(a))))[[3]] [1] TRUE [1] 0.05 Renjin 44 “R as a query language”

dplyr::, data.table:: subset() [

• Observation 1: Lots of data wrangling happening in R scripts

merge() $ aggregate()

45 “R as a query language”

• Observation 2: Things get slow quickly as vectors get longer

• Lots of optimisation opportunities, but how?

• State of the art: Tactical optimisation/Band aids

46 “R as a query language”

• Proposal: Treat R scripts as declaration of intent (not as a procedural contract written in blood)

• Then we can optimise strategically!

47 Rule-based query optimisation

48 Optimisations

• Selection Pushdown

• Data-parallel scheduling

• Function specialisation/vectorisation

• Common expression elimination/caching

• Redundant computation elimination

49 Static analysis?

50 Deferred Evaluation

/

a <- 1:1000 min max b <- a + 42 c <- b[1:10] [ d <- min(c) / max(c) print(d) +

a 42

51 Pushdown

b <- factorial(a) c <- b[1:10] print(c)

n=10 n=10 [ (subset) factorial n=1000 n=10 factorial [ (subset)

n=1000 n=1000 a a

52 Pushdown

6 ●●

GNU R

4

2 Execution Time (s) Time Execution

●● Renjin ●● 0 ●● ●● ●● 106 107 108 Dataset Size (elements, log scale) 53 Recycling

for (i in 1:100) print((a[i] - min(a))/(max(a)-min(a)))

/

/

- -

- (cached)

a[i] min max

a[i] (cached)

a 54 Recycling

Renjin ●● 60

40 ●●

Execution Time (s) Time Execution 20 GNU R

●●

●● ●● ●● 0 ●● Renjin + R. 106 107 108 Dataset Size (elements, log scale)

55 svymean agep <- svymean(~agep, svydsgn, se=TRUE) for(i in 1:ncol(wts)) { repmeans[i,]<-t(colSums(wts[,i]*x*pw)/ sum(pw*wts[,i])) } […] v<-crossprod(sweep(thetas,2, meantheta,"-")*sqrt(rscales))*scale

56 *

svymean crossprod 0.2

*

- [5]

t

rep

repmeans 5 /

/ / / / / colSums sum

sum colSums sum colSums sum colSums colSums sum colSums sum *

* * * * * * * * * * rep

* * * p * * 47512

wts[,2] wts[,3] wts[,4] x wts[,5] wts[,1] svymean * crossprod 0.2

*

- [5]

t

repmeans rep

/ / / / / 5 /

colSums (cached) colSums (cached) colSums (cached) colSums (cached) colSums (cached) colSums (cached)

* * * * * *

wts[,2] wts[,3] wts[,4] x wts[,5] wts[,1] (cached)

58 svymean

●●

100 Renjin −opt

75

50 GNU R ●● Execution Time (s) Time Execution 25 Renjin 1t ●● ●●

●● ●● ● ●● 0 ●● ● Renjin 47512 1060060 9093077 Dataset Size (elements, log scale) 59 Capturing the Laws of Data Nature

Hannes Mühleisen, Martin Kersten and Stefan Manegold: Capturing the Laws of (Data) Nature, 7th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2015

60 Statistical Models?

• Everyone has models, they encode our understanding of the world

• Everyone has data to train/fit and validate a model

• So far, data management community has ignored these models

• But they hold precious domain knowledge!

61 Configuration Measurement

62 Grouped by-source operation

Model!

Convergence Hints

63 Measurement Configuration

Fitted parameters

64 65 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

3.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Intensity (Jy) ● ● ● ● ● ● ● ●

2.5 ● ● ● ● ● ● ● ● ● ● ●

● 2.0 0.10 0.12 0.14 0.16 0.18 0.20 Frequency (GHz)

source=17562, alpha=-0.692, p=0.812 66 Model to function conversion (automatic)

Move to DB (automatic)

67 Approximate Answer with zero IO*

68 Integrate & Intercept

• Integrate model fitting infrastructure into data management system.

• Also: Huge performance benefits for analysts!

• Intercept model fitting and validation operations by the user and store the model for later use.

• Storage format: Model code + Parameters

69 (1) (2)

I p ⌫↵ ? I p ⌫↵ ? ⇡ · S ⌫ I ⇡ · S ⌫ I

R2 =0.92 ! R2 =0.92 !

(3)

(4)

S = 42,⌫ =0.14,I =? S p ↵

I =3.0 0.05 ! ± I p ⌫↵ ⇡ · (5)

70 But…

• What do we do if model parameters are not specified in the query?

• Sample data?

• Given multiple parameters, it is far from certain that all combinations of values are allowed in the model.

• Construct filter?

71 Data & Model Changes

• What should we do if the user gives us a better model?

• Recompressing could be very expensive

• Threshold for improvement?

• Changes in the data affect the model quality, too

• Switch models?

• Constant Monitoring?

72 Multiple, partial or grouped

• There could be many models for a table with overlapping parameters

• Which one to pick?

• Models do not have to cover the entire table/column

• “Patching”?

• Models could be fitted on aggregation results

• Keep group counts?

73 Thank You Questions?

http://hannes.muehleisen.org

@hfmuehleisen