SQL ANALYTICS – Spark Mllib Algorithm Integration E.G
Total Page:16
File Type:pdf, Size:1020Kb
Oracle Machine Learning & Advanced Analytics Data Management Platforms Move the Algorithms; Not the Data! Charlie Berger, MS Engineering, MBA Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics [email protected] www.twitter.com/CharlieDataMine Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 2 Machine Learning, Data Mining, Predictive Analytics Automatically sift through large amounts of data to find hidden patterns, discover new insights and make predictions A1 A2 A3 A4 A5 A6 A7 • Identify most important factor (Attribute Importance) • Predict customer behavior (Classification) • Predict or estimate a value (Regression) • Find profiles of targeted people or items (Decision Trees) • Segment a population (Clustering) • Find fraudulent or “rare events” (Anomaly Detection) • Determine co-occurring items in a “baskets” (Associations) Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Oracle’s Data Management and Machine Learning Market Observations: • Machine learning, predictive analytics & “AI” now must-have requirements • Separate islands for data management and data science just don’t work • Enterprises whose data science teams most rapidly extract insights and predictions win Conclusions: • Must “operationalize” ML insights and predictions throughout enterprise • Multilingual Machine Learning: SQL, R, Python, Workflow UI, Notebooks, Embed ML in Appls, • Evolving towards combined Data Management + Machine Learning environment that can essentially to manage and “think” about data Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 4 Oracle’s Data Management and Machine Learning Architectural Strategy • In-Database proprietary implementations of machine learning algorithms • Leverage strengths of the Database and adds new ML tech – Counting, conditional probabilities, sort, rank, partition, group-by, collections, etc. – Parallel execution, bitmap indexes, partitioning, aggregations, recursion w/in parallel infrastructure, IEEE float, frequent itemsets, Automatic Data Preparation (ADP), Text processing, etc. • Focus on intelligent ML defaults, simplification & automation to enable applications – ADP, xforms, binning, missing values, Prediction_Details, Predictive_Queries, Model_views • Machine learning models built via PL/SQL script; scored via SQL functions (1st class DB objects) select cust_id from customers True power evident when scoring where region = ‘US’ models using SQL functions, e.g. and prediction_probability(churnmod, ‘Y’ using *) > 0.8; – “Smart scan” ML model scoring “push down”; Supports OLTP and ATPC environments • Machine Learning and Advanced Analytics are peer to rest of Oracle Data Mgmt features – Security, Back-up, Encryption, Scalability, “Big Data” eco-system, BDA, Big Data SQL, Cloud, Spark, etc. – Best ML & analytical development and deployment platform Copyright © 2016 Oracle and/or its affiliates. All rights reserved. 5 Oracle’s Machine Learning & Advanced Analytics Fastest Way to Deliver Enterprise-wide Predictive Analytics Traditional ML Oracle’s in-DB Machine Learning Major Benefits Data Import Data Mining . Data remains in Database & Hadoop Model “Scoring” . Model building and scoring occur in-database Data Prep. & . Use R packages with data-parallel invocations Transformation avings . Leverage investment in Oracle IT Data Mining Model Building . Eliminate data duplication . Eliminate separate analytical servers Data Prep & . Deliver enterprise-wide applications Transformation . GUI for ML/Predictive Analytics & code gen Model “Scoring” . R interface leverages database as HPC engine Data Extraction Embedded Data Prep Model Building Data Preparation Hours, Days or Weeks Secs, Mins or Hours Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Oracle’s Machine Learning & Adv. Analytics Algorithms CLASSIFICATION REGRESSION FEATURE EXTRACTION – Naïve Bayes – Linear Model – Principal Comp Analysis (PCA) – Logistic Regression (GLM) – Generalized Linear Model – Non-negative Matrix Factorization – Decision Tree – Support Vector Machine (SVM) – Singular Value Decomposition (SVD) – Random Forest – Stepwise Linear regression – Explicit Semantic Analysis (ESA) – Neural Network – Neural Network – Support Vector Machine – LASSO TEXT MINING SUPPORT – Explicit Semantic Analysis – Algorithms support text type ATTRIBUTE IMPORTANCE A1 A2 A3 A4 A5 A6 A7 – Tokenization and theme extraction CLUSTERING – Minimum Description Length – Explicit Semantic Analysis (ESA) for – Hierarchical K-Means – Principal Comp Analysis (PCA) document similarity – Hierarchical O-Cluster – Unsupervised Pair-wise KL Div – Expectation Maximization (EM) – CUR decomposition for row & AI STATISTICAL FUNCTIONS – Basic statistics: min, max, ANOMALY DETECTION ASSOCIATION RULES median, stdev, t-test, F-test, – One-Class SVM – A priori/ market basket Pearson’s, Chi-Sq, ANOVA, etc. TIME SERIES PREDICTIVE QUERIES R PACKAGES – State of the art forecasting using – Predict, cluster, detect, features – CRAN R Algorithm Packages Exponential Smoothing. through Embedded R Execution – Includes all popular models SQL ANALYTICS – Spark MLlib algorithm integration e.g. Holt-Winters with trends, – SQL Windows, SQL Patterns, seasons, irregularity, missing data SQL Aggregates EXPORTABLE ML MODELS • OAA (Oracle Data Mining + Oracle R Enterprise) and ORAAH combined – REST APIs for deployment • OAA includes support for Partitioned Models, Transactional, Unstructured, Geo-spatial, Graph data. etc, Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Multiple Data Scientist User Roles Supported Oracle’s Machine Learning/Advanced Analytics Additional relevant data Historical or Current Data to and “engineered be “scored” for predictions features” Oracle Historical data Assembled Predictions & Insights historical data Database 12c Sensor data, Text, unstructured data, transactional data, spatial data, etc. DBAs Application Developers R, Python Users, Data Scientists OML Notebook Users Data Analysts, Citizen Data Scientists Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | OAA Model Build and Real-time SQL Apply Prediction Simple SQL Syntax ML Model Build (PL/SQL) begin dbms_data_mining.create_model('BUY_INSUR1', 'CLASSIFICATION', 'CUST_INSUR_LTV', 'CUST_ID', 'BUY_INSURANCE', CUST_INSUR_LTV_SET'); end; / Model Apply (SQL query) Select prediction_probability(BUY_INSUR1, 'Yes' USING 3500 as bank_funds, 825 as checking_amount, 400 as credit_balance, 22 as age, 'Married' as marital_status, 93 as MONEY_MONTLY_OVERDRAWN, 1 as house_ownership) from dual; Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Fraud Prediction Demo Automated In-DB Analytical Methodology drop table CLAIMS_SET; exec dbms_data_mining.drop_model('CLAIMSMODEL'); create table CLAIMS_SET (setting_name varchar2(30), setting_value varchar2(4000)); insert into CLAIMS_SET values ('ALGO_NAME','ALGO_SUPPORT_VECTOR_MACHINES'); insert into CLAIMS_SET values ('PREP_AUTO','ON'); commit; begin dbms_data_mining.create_model('CLAIMSMODEL', 'CLASSIFICATION', 'CLAIMS', 'POLICYNUMBER', null, 'CLAIMS_SET'); end; / Automated Monthly “Application”! Just -- Top 5 most suspicious fraud policy holder claims select * from add: (select POLICYNUMBER, round(prob_fraud*100,2) percent_fraud, Create rank() over (order by prob_fraud desc) rnk from View CLAIMS2_30 (select POLICYNUMBER, prediction_probability(CLAIMSMODEL, '0' using *) prob_fraud As from CLAIMS Select * from CLAIMS2 where PASTNUMBEROFCLAIMS in ('2to4', 'morethan4'))) Where mydate > SYSDATE – 30 where rnk <= 5 order by percent_fraud desc; Time measure: set timing on; Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | ML Model Deployment for Real-Time Scoring Oracle Cloud Real-Time Scoring, Predictions and Recommendations • On-the-fly, single record apply with new data (e.g. from call center) Select prediction_probability(CLAS_DT_1_15, 'Yes' USING 7800 as bank_funds, 125 as checking_amount, 20 as credit_balance, 55 as age, 'Married' as marital_status, 250 as MONEY_MONTLY_OVERDRAWN, 1 as house_ownership) from dual; Social Media Likelihood to respond: Call Center Get AdviceBranch Office R Web Mobile Email Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Manage and Analyze All Your Data Data Scientists, R Users, Citizen Data Scientists Architecturally, Many Options and Flexibility SQL / R Boil down the Data Lake “Engineered Features” Big Data SQL / R – Derived attributes that reflect domain knowledge—key to best models e.g.: • Counts Object • Totals Store • Changes over time Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 12 OAA Algorithm Performance Algorithm Time (sec.) Number of Attributes SVM SGD Solver 83s 900 GLM Classification 180s 9000 SVM IPM Solver 811s 900 GLM Regression 59s 900 K-Means 168s 9000 • In-memory 640 million rows, Airline On-Time dataset • SPARC M8-2 hardware • Courtesy of the SPARC Performance Team