Predictive Analytics and Big Data with MATLAB

Predictive Analytics and Big Data with MATLAB Ian McKenna, Ph.D. © 2015 The MathWorks, Inc.1 Agenda . Introduction . Predictive Modeling – Supervised Machine Learning – Time Series Modeling . Big Data Analysis – Load, Analyze, Discard workflows – Scale computations with parallel computing – Distributed processing of large data sets . Moving to Production with MATLAB 2 Financial Modeling Workflow Small/Big Data Predictive Modeling Deploy Access Explore and Prototype Share Files Reporting Data Analysis & Visualization Financial Databases Applications Modeling Application Development Production Datafeeds Scale 3 Financial Modeling Workflow Predictive Modeling Explore and Prototype Data Analysis & Visualization Financial Modeling Application Development 4 Agenda . Introduction . Predictive Modeling – Supervised Machine Learning – Time Series Modeling . Big Data Analysis – Load, Analyze, Discard workflows – Scale computations with parallel computing – Distributed processing of large data sets . Moving to Production with MATLAB 7 What is Predictive Modeling? . Use of mathematical language to make predictions about the future Input/ Output/ Predictors Predictive Response model . Examples EL f (T,t, DP,...) Trading strategies Electricity Demand 8 Why develop predictive models? . Forecast prices/returns . Price complex instruments . Analyze impact of predictors (sensitivity analysis) . Stress testing . Gain economic/market insight . And many more reasons 9 Challenges . Significant technical expertise required . No “one size fits all” solution . Locked into Black Box solutions . Time required to conduct the analysis 10 Predictive Modeling Workflow Train: Iterate till you find the best model LOAD PREPROCESS SUPERVISED MODEL DATA DATA LEARNING FILTERS PCA CLASSIFICATION SUMMARY CLUSTER SUMMARY CLUSTER REGRESSION STATISTICS ANALYSIS Predict: Integrate trained models into applications NEW PREDICTION DATA 11 Classes of Response Variables Structure Type Sequential Continuous Non-Sequential Categorical 13 Examples . Classification Learner App Bank Marketing Campaign Misclassification Rate 100 . Predicting Customer Response 90 80 70 60 – Classification techniques No Misclassified 50 Yes Percentage Misclassified 40 30 – Measure accuracy and compare models 20 10 0 Neural Net TreeBagger Naive Bayes Support VM Reduced TB Decision Trees Logistic Regression k-nearest Neighbors Discriminant Analysis Realized vs Median Forecasted Path 1800 Original Data 1700 Simulated Data . Predicting S&P 500 1600 1500 1400 – ARIMA modeling 1300 S&P 500 1200 1100 – GARCH modeling 1000 900 800 May-01 Feb-04 Nov-06 Aug-09 May-12 14 Getting Started with Predictive Modeling . Perform common tasks interactively – Classification Learner App – Neural Net App 16 Example – Bank Marketing Campaign . Goal: – Predict if customer would subscribe to bank term deposit based on different attributes Bank Marketing Campaign Misclassification Rate 100 90 80 70 60 No Misclassified 50 . Approach: Yes Percentage Misclassified 40 30 – Train a classifier using different models 20 10 0 Neural Net TreeBagger – Measure accuracy and compare models Naive Bayes Support VM Reduced TB Decision Trees Logistic Regression k-nearest Neighbors Discriminant Analysis – Reduce model complexity – Use classifier for prediction Data set downloaded from UCI Machine Learning repository http://archive.ics.uci.edu/ml/datasets/Bank+Marketing 21 Classification Techniques Regression Neural Ensemble Non-linear Reg. Linear Decision Trees Networks Methods (GLM, Logistic) Regression Classification Support Vector Discriminant Nearest Naive Bayes Machines Analysis Neighbor 22 Example – Bank Marketing Campaign . Numerous predictive models with rich documentation Bank Marketing Campaign Misclassification Rate . Interactive visualizations and apps to 100 90 80 70 aid discovery 60 No Misclassified 50 Yes Percentage Misclassified 40 30 20 10 0 Neural Net TreeBagger . Built-in parallel computing support Naive Bayes Support VM Reduced TB Decision Trees Logistic Regression k-nearest Neighbors Discriminant Analysis . Quick prototyping; Focus on modeling not programming 26 Example – Time Series Modeling and Forecasting for the S&P 500 Index Realized vs All Forecasted Paths 11000 Original Data . Goal: 10000 Simulated Data 9000 – Model S&P 500 time series as a 8000 7000 combined ARIMA/GARCH 6000 S&P 500 5000 process and forecast on test data 4000 3000 2000 1000 . Approach: May-01 Feb-04 Nov-06 Aug-09 May-12 – Fit ARIMA model with S&P 500 returns and estimate parameters – Fit GARCH model for S&P 500 Realized vs Median Forecasted Path 1800 Original Data 1700 Simulated Data volatility 1600 1500 1400 – Perform statistical tests for time 1300 S&P 500 1200 series attributes e.g. stationarity 1100 1000 900 800 May-01 Feb-04 Nov-06 Aug-09 May-12 27 Models for Time Series Data Conditional Mean Models Conditional Variance Models AR – Autoregressive ARCH MA – Moving Average GARCH ARIMA – Integrated EGARCH ARIMAX – eXogenous inputs GJR VARMA – Vector ARMA Non-Linear Models VARMAX – eXogenous inputs NAR Neural Network VEC – Vector Error Correcting NARX Neural Network State Space Models Time Varying Regression Time Invariant Regression with ARIMA errors 28 Example – Time Series Modeling and Forecasting for the S&P 500 Index Realized vs All Forecasted Paths 11000 Original Data . Numerous ARIMAX and 10000 Simulated Data 9000 GARCH modeling techniques 8000 7000 with rich documentation 6000 S&P 500 5000 4000 3000 2000 . Interactive visualizations 1000 May-01 Feb-04 Nov-06 Aug-09 May-12 . Code parallelization to maximize computing resources Realized vs Median Forecasted Path 1800 Original Data 1700 Simulated Data 1600 1500 1400 1300 . S&P 500 Rapid exploration & 1200 1100 development 1000 900 800 May-01 Feb-04 Nov-06 Aug-09 May-12 29 Agenda . Introduction . Predictive Modeling – Supervised Machine Learning – Time Series Modeling . Big Data Analysis – Load, Analyze, Discard workflows – Scale computations with parallel computing – Distributed processing of large data sets . Moving to Production with MATLAB 34 Financial Modeling Workflow Small/Big Data Predictive Modeling Deploy Access Explore and Prototype Share Files Reporting Data Analysis & Visualization Financial Databases Applications Modeling Application Development Production Datafeeds Scale 35 Financial Modeling Workflow Scale 36 Challenges of Big Data “Any collection of data sets so large and complex that it becomes difficult to process using … traditional data processing applications.” (Wikipedia) . Volume – The amount of data . Velocity – The speed data is generated/analyzed . Variety – Range of data types and sources . Value – What business intelligence can be obtained from the data? 37 Big Data Capabilities in MATLAB Memory and Data Access . 64-bit processors . Memory Mapped Variables . Disk Variables . Databases Programming Constructs . Datastores . Streaming . Block Processing . Parallel-for loops . GPU Arrays . SPMD and Distributed Arrays . MapReduce Native ODBC interface Database datastore Platforms . Desktop (Multicore, GPU) Fetch in batches . Clusters . Cloud Computing (MDCS on EC2) Scrollable cursors . Hadoop 38 Techniques for Big Data in MATLAB Hard drive Hard datastore MapReduce Consulting Scale parfor SPMD, Distributed Memory 64bit Workstation RAM Embarrassingly Non- Parallel Partitionable Complexity 39 Techniques for Big Data in MATLAB Hard drive Hard Scale 64bit Workstation RAM Embarrassingly Non- Parallel Partitionable Complexity 40 Memory Usage Best Practices . Expand Workspace: 64bit MATLAB . Use the appropriate data storage – Categorical Arrays – Be aware of overhead of cells and structures – Use only the precision your need – Sparse Matrices . Minimize Data Copies – In place operations, if possible – Use nested functions – Inherit data using object handles 41 Techniques for Big Data in MATLAB Hard drive Hard Scale parfor RAM Embarrassingly Non- Parallel Partitionable Complexity 43 Parallel Computing with MATLAB Worker Worker MATLAB Desktop (Client) Worker Worker Worker Worker 44 Example: Analyzing an Investment Strategy . Optimize portfolios against target benchmark . Analyze and report performance over time . Backtest over 20-year period, parallelize 3-month rebalance 45 When to Use parfor . Data Characteristics – The data for each iteration must fit in memory – Loop iterations must be independent . Transition from desktop to cluster with minimal code changes . Speed up analysis on big data 48 Techniques for Big Data in MATLAB Hard drive Hard Scale SPMD, Distributed Memory RAM Embarrassingly Non- Parallel Partitionable Complexity 49 Parallel Computing – Distributed Memory Using More Computers (RAM) Core 1 Core 2 Core 1 Core 2 Core 3 Core 4 … Core 3 Core 4 RAM RAM 50 spmd blocks spmd % single program across workers end . Mix parallel and serial code in the same function . Single Program runs simultaneously across workers . Multiple Data spread across multiple workers 52 Example: Airline Delay Analysis . Data – Airline On-Time Statistics – 123.5M records, 29 fields . Analysis – Calculate delay patterns – Visualize summaries – Estimate & evaluate predictive models 53 When to Use Distributed Memory . Data Characteristics – Data must be fit in collective memory across machines . Compute Platform – Prototype (subset of data) on desktop – Run on a cluster or cloud . Analysis Characteristics – Distributed arrays support a subset of functions 54 Techniques for Big Data in MATLAB Hard drive Hard datastore Scale RAM Embarrassingly Non- Parallel Partitionable Complexity 55 Access

Load more