1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. FPO
In-Database Analytics: Statistics and Advanced Analytics with R—Oracle R Enterprise Charlie Berger Sr. Director Product Management, Data Mining and Advanced Analytics Oracle Corporation [email protected] R 2 Copyrightwww.twitter.com/CharlieDataMine © 2011, Oracle and/or its affiliates. Open Source All rights reserved.
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remain at the sole discretion of Oracle.
3 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Agenda
• Big Data & Big Data Analytics • Open Source Project – Challenges limiting enterprise adoption of R New• R Enterprise Open Source – Features, benefits and advantages • Big Data Appliance New – Open source distribution of R • Oracle R Enterprise Beta Program • Q & A
4 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. What Makes it Big Data?
SOCIAL
BLOG
SMART METER
VOLUME VELOCITY VARIETY VALUE
5 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Big Data in Action
DECIDE ACQUIRE Make Better Decisions Using Big Data ANALYZE ORGANIZE
6 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Announcing Oracle R Enterprise
New
7 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Statistical Programming Language
Open source language and environment
Used for statistical computing and graphics
Strength in easily producing publication-quality plots
Highly extensible with open source community R packages
8 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Open Source
Driven in part by the rise of big data, business intelligence (BI) is a rapidly growing market that has seen increasingly strong enterprise adoption rates. The concurrent to the growth of BI has been increased investment in predictive analytics; R is not only the tool of choice but the ideal environment for advanced analysis. R is designed to be extensible and integrate within BI suites to incorporate advanced analytics into reports.
http://www.gartner.com/technology/core/products/research/topics/businessIntelligence.jsp “Hype Cycle for Analytic Applications, 2011, 30 August 2011 The number of web site links that point to the main web site of each software package on March 19, 2011. http://www.r4stats.com/popularity
9 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Growing Popularity
• R’s rapid adoption over several years has earned its reputation as a new statistical software standard – Rival to SAS and SPSS While it is difficult to calculate exactly how many people use R, those most familiar with the software estimate that close to 250,000 people work with it regularly. “Data Analysts Captivated by R’s Power”, New York Times, Jan 6, 2009
http://www.r-project.org/
10 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Typical R Approach
Statistical and advanced analyses are run and stored on the user’s laptop
11 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. What Are ’s Challenges?
1. R is memory constrained – R processing is single threaded - does not exploit available compute infrastructure – R lacks industrial strength for enterprise use cases 2. R has lacked mindshare in Enterprise market – R is still met with caution by the long established SAS and IBM/SPSS statistical community • However, major university (e.g. Yale ) Statistics courses now taught in R • The FDA has recently shown indications for approval of new drugs for which the submission’s data analysis was performed using R
12 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Oracle R Enterprise Approach
Data and statistical analysis are stored and run in- database
Same R user experience & R same R clients Open Source Embed in operational systems
Complements Oracle Data Mining
13 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R What is Open Source Enterprise? • Oracle R Enterprise brings R’s statistical R functionality closer to the Oracle Database Open Source 1. Eliminate R’s memory constraint by enabling R to work directly & transparently on database objects – Allows R to run on very large data sets 2. Architected for Enterprise production infrastructure – Automatically exploits database parallelism without require parallel R programming – Build and immediately deploy 3. Oracle R leverages the latest R algorithms and packages – R is an embedded component of the DBMS server
14 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Architecture and Performance • Transparently function-ships R constructs to database via R SQL translation –Data structures –Functions
• Data manipulation functions (select, project, join) • Basic statistical functions (avg, sum, summary) • Advanced statistical functions(gamma, beta) Seconds • Performs data-heavy computations in database –R for summary analysis and graphics • Transparent implementation enables using wide range of R “packages” from open source community
15 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle R Enterprise Architecture Worst
Use Case: Using ONTIME airline data, of the 36 busiest airports, run a box-plot analysis of the best/worst airports for arrival delay?
R workspace console Best
Function push-down Oracle statistics engine OBIEE, Web data transformation & – Services statistics R Open Source
Development Production Consumption
16 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle R Enterprise for Statistical Development R Oracle R
Oracle R OCI package makes all R commands, graphics, Oracle packages are identical to user in tables/views visible to R both R and Oracle R Enterprise
Data can be in R data frames or Oracle Tables/Views
17 Copyright © 2011, Oracle and/or its affiliates. Oracle Confidential All rights reserved. Oracle R Enterprise for Statistical Development R Oracle R
Oracle R OCI package makes all R commands, graphics, Oracle packages are identical to user in tables/views visible to R both R and Oracle R
Data can be in R data frames or Oracle Tables/Views
18 Copyright © 2011, Oracle and/or its affiliates. Oracle Confidential All rights reserved. Benefits "R for the Enterprise"
OpenR Source • Oracle R Enterprise enables you to: – Run R to interactively explore and analyze data inside the Database – Develop R scripts on big data stored as tables and views inside the Oracle database and then deploy them within the enterprise—without requiring code changes – Leverage R’s familiar R console and open source R GUIs and IDEs to explore and analyze data either in the database and stored as R data frames – Meet the statistical and advanced analytical requirements of the enterprise – Exploit an information technology platform designed to support analytically- driven applications. – Leverage 30+ years of experience of ever advancing Oracle Database technology.
19 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. How Oracle R Enterprise Works ORE Computation Engines R Open Source • Oracle R Enterprise tightly integrates R with the database and fully manages the data operated upon by R code. – The database is always involved in serving up data to the R code. – Oracle R Enterprise runs in the Oracle Database. • Oracle R Enterprise eliminates data movement and duplication, maintains security and minimizes latency time from raw data to new information. • Three ORE Computation Engines – Oracle R Enterprise provides three different interfaces between the open-source R engine and the Oracle database: 1. Oracle R Enterprise (ORE) Transparency Layer 2. In-Database Statistics Engine 3. Embedded R
20 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. How Oracle R Enterprise Works ORE Computation Engines R Open Source 1. Oracle R Enterprise (ORE) Transparency Layer – Traps all R commands and scripts prior to execution and looks for opportunities to function ship them to the database for native execution – ORE transparency layer converts R commands/scripts into SQL equivalents and thereby leverages the database as a compute engine.
21 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. How Oracle R Enterprise Works ORE Computation Engines R Open Source 2. In-Database Statistics Engine – Significantly extends the Oracle Database’s library of All Base R functions statistical functions and advanced analytical computations R Multiple Regression – Provides support for the complete R language and …. Driven by customers statistical functions found in Base R and selected R packages based on customer usage Base SAS PROCS • Open source packages - written entirely in R language with only • PROC FREQ • PROC MEANS the functions for which we have implemented SQL counterparts - • PROC RANK can be translated to execute in database. • PROC STANDARD • PROC SUMMARY – Without anything visibly different to the R users, their R • PROC UNIVARIATE commands and scripts are oftentimes accelerated by a • PROC APPEND • PROC SORT factor of 10-100x • PROC TRANSPOSE • PROC SQL – Base SAS and most common SAS PROC "knock-offs" • PROC CORR
22 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. How Oracle R Enterprise Works ORE Computation Engines R Open Source 3. Embedded R – For R functions not able to be mapped to native in-database functions, Oracle R Enterprise makes “extproc” remote procedure calls to multiple R engines running on multiple database servers/nodes – This Oracle R Enterprise embedded layer uses the database as a data provider providing data level parallelism to R code – The interfaces, called embedded-layer RQ functions, pass streams of data to one or more instances of R for (parallel) row by row processing (scoring), groups of rows processing (building a model one per group) and table of rows processing (building a model – These functions are used for “operationalizing” R code to run in production
23 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle R Enterprise Example R Illustrates Use of all 3 Engines from within 1 R Script Open Source
24 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle—Hardware and Software Engineered to Work Together • Oracle is the world's most complete, open, and integrated business software and hardware systems company • Data Warehousing, VLDB and ILM • Oracle R Enterprise New • R for the Enterprise • Oracle Data Mining Option New GUI • 12- in-DB data mining algorithms
Oracle has taught the Database how to do Advanced Math/Statistics/Data Mining, and more…
25 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Two Separate Worlds…
• DBA • Line of Business – Security – Ad hoc – Control – Exploratory data analysis – Scalability – Interactive graphics – Performance – Problem-solving
26 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Brings Them Together
• DBA • Line of Business – Security – Ad hoc – Control – Exploratory data analysis – Scalability – Interactive graphics – Performance + – Problem-solving
27 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Better Business Intelligence Enrich BI Dashboards with Statistics, Data Mining and Adv. Analytics • Ad hoc • Exploratory data analysis • Interactive graphics • Problem-solving
Oracle R Enterprise's and ODM's results become a
data feed for OBIEE
28 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. ORACLE R ENTERPRISE FUNDAMENTALS
OpenR Source
29 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Starting up Oracle R Enterprise
When you start up Oracle R Enterprise, it loads several packages and automatically connects to an Oracle database.
30 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Data and Summary Statistic cars
head(cars) • Prints top rows of data set
summary(cars) • Provides summary statistics for data set
31 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Arithmetic & Basics GRADE "Built-in" Dataset
R> attach(GRADE) R> head(GRADE) ... R> max(FINALGRADE) [1] 97 R> min(FINALGRADE) [1] 71 R> max(FINALGRADE)- min(FINALGRADE) [1] 26 R> mean(FINALGRADE) [1] 83 R> sd(FINALGRADE) [1] 9.237604
32 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Histogram cars
R> hist(cars$acceleration)
Fast cars! Slow cars
33 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Graphics
R> plot(cars$weight, cars$mpg)
Heavy cars
34 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Graphics
R> abline(coef(lm(acceleration ~ weight, cars)), col = "red")
Faster cars are heavier?
35 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Graphics R> boxplot(split(weight, cylinder), col = "blue")
Heavier cars have 8 cylinders
36 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Graphics R> boxplot(split(cars$mpg, cars$model.year), col = "green")
MPG increases over time…
37 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Graphics R> boxplot(split(cars$acceleration, cars$model.year), col = "red")
If you want a FAST car, buy an 8 cylinder '70 model car
38 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Graphics
R> plot(cars)
• Supports Exploratory Data Analysis for Oracle data
39 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Graphics
R> plot(data.frame(cars$accel eration,cars$mpg, cars$weight, cars$cylinders), col = "purple")
• Supports Exploratory Data Analysis for Oracle data
40 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. R Graphics Using Add-in Pacakge
install.packages("scatterplot3d") --- Please select a CRAN mirror for use in this session -- - trying URL 'http://cran.case.edu/bin/windows/contrib/2.12/scatter plot3d_0.3-33.zip' Content type 'application/zip' length 605876 bytes (591 Kb) opened URL downloaded 591 Kb package 'scatterplot3d' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Documents and Settings\chberger\Local Settings\Temp\RtmpAEe7NC\downloaded_packages R> library(scatterplot3d) Warning message: package 'scatterplot3d' was built under R version 2.12.2 R> scatterplot3d(cars)
41 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Linear Models
42 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Linear Models Example from R Help(lm) http://127.0.0.1:19161/library/stats/html/lm.html ## Annette Dobson (1990) "An Introduction to Generalized Linear Models". ## Page 9: Plant Weight Data. ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) group <- gl(2,10,20, labels=c("Ctl","Trt")) weight <- c(ctl, trt) anova(lm.D9 <- lm(weight ~ group)) summary(lm.D90 <- lm(weight ~ group - 1))# omitting intercept
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0)) plot(lm.D9, las = 1) # Residuals, Fitted, ... par(opar)
## model frame : stopifnot(identical(lm(weight ~ group, method = "model.frame"), model.frame(lm.D9)))
### less simple examples in "See Also" above
43 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Linear Models
44 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Oracle R Enterprise ARIMA Forecasting Script year200801 <- ONTIME_S[(ONTIME_S$YEAR==2008)& (ONTIME_S$MONTH==1),] y <- ore.pull(year200801) gc() delays <- tapply(y$ARRDELAY, y$DAYOFMONTH, mean, na.rm=TRUE) delays <- ts(delays, start=1, end=31, frequency=1) # Create a Kalman filter with the first 5 delays and predict the rest preds <- c() ses <- c() # 1 step predictions for (i in 5:length(delays)) { fit <- arima(delays[1:i], c(1,2,1)) # predict 1 step into the future. pred <- predict(fit) preds <- c(preds, pred$pred) ses <- c(ses, pred$se) } plot(5:length(delays), preds, type='l', col='green', ylim=range(c(preds+2*ses, preds-2*ses)), xlab="DEay of month", ylab="Predicted average delay (in minutes)", main="Average delays by day for January 2008") lines(5:length(delays), preds+2*ses, col='red') lines(5:length(delays), preds-2*ses, col='red') points(5:length(delays), as.vector(delays[5:length(delays)]))
legend( 23, -8, c("Delay", "Predicted delay", "2 se confidence"), col=c(1, 3, 8), lty=c(0, 1, 1), pch=c(1, -1, -1), merge=TRUE)
45 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Statistical Quality Control R Applications • The qcc package for R: – Plots Shewhart quality control charts for continuous, attribute and count data; – Plots Cusum and EWMA charts for continuous data; – Performs process capability analyses; – Creates Pareto charts and cause-and-effect diagrams
http://www.stat.unipg.it/luca/Rnews_2004-1-pag11-17.pdf
46 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Statistical Quality Control R Applications
Process capability studies to Pareto (80/20 rule) analysis characterize and understand the to understand which few behavior of a "process" factors contribute most
http://www.stat.unipg.it/luca/Rnews_2004-1-pag11-17.pdf
47 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. You Can Think of It Like This… Traditional SQL In-Database Stats/Adv.Analytics • “Human-driven” queries • Wide range of Oracle R Enterprise
• Domain expertise statistical functions • Lacks necessary statistical/adv. • Ability to develop and deploy R scripts analytcal functionality within the enterprise • SQL Queries • ORE Statistics/Adv. Analytics • • SELECT SUMMARY • • DISTINCT + CORR • • Regression R AGGREGATE Open Source • • WHERE Shewhart • • AND OR ARIMA (Time Series) • • GROUP BY R packages • ORDER BY • RANK
48 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Summary "R for the Enterprise" • Enables DBAs and LOB users to readily integrate R models into production • Enables R models to be integrated into BI dashboards • Enables R programmers/statisticians to work against database data without knowing SQL • Reduces the number of LOB help requests for SQL queries to obtain data • Removes the need to manage data outside Oracle Database Save money on SA$! • Use Oracle R Enterprise instead of Base SAS and reduce SA$ Annual Usage Fees • Private analytical sandboxes for LOB/data analyst to work directly on database data in-database Oracle in-Database Analytics for Big Data • Over 100 built-in statistical functions that are compatible with Base SAS • High performance in-database linear algebra • Data parallelism for open source R packages executing in-database • Develop your own algorithms for execution closer to the data, and leverage database parallelism
49 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Announcing Open source distribution of R
New
Open source distribution of R
50 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Big Data Appliance & Software
Integrated Big Data Platform
Oracle Distribution of Apache Hadoop
Oracle NoSQL Database EE Oracle Data Integrator Application Adapter for Hadoop Oracle Loader for Hadoop
Oracle Hadoop Tools Supported by New Open source distribution of R Oracle Linux and JVM
51 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Big Data Appliance + R For Compute Intensive Operations Using R R workspace console
Oracle statistics engine
OBIEE, Web Services Function push-down – data transformation & statistics logreg <- function(input, iterations, dims, alpha){ Massively plane = rep(0, dims) parallel g = function(z) 1/(1 + exp(-z)) computations for (i in 1:iterations) { z = hdfs.get(hadoop.run( input, export = c(plane, g), map = logisticRegressionMapper, reduce = logisticRegressionReducer)) gradient = c(z$val[1], z$val[2]) plane = plane + alpha * gradient } plane } x = hdfs.push(WEBSESSIONS) logreg(x, 10, 2, 0.05)
52 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Summary
New• R Enterprise Open Source
• Sign up for Oracle R Enterprise Beta Program
• Big Data Appliance – Open source distribution of R—coming soon! New
53 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. • MARK YOUR CALENDARS! • BIWA Summit @ • COLLABORATE 12 April 22-26, 2012 Mandalay Bay Convention Center Las Vegas, Nevada http://events.ioug.org/p/cm/ld/fid=15
54 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Q&A
55 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. 56 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. 57 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.