MACHINE LEARNING Azure Reference Architecture • Solliance Founder, CEO • Author • Microsoft MVP– Microsoft Azure • Azure Elite, Azure Insider Zoiner Tejada [email protected] • CQURE Certified Security Professional @zoinertejada • Developer Expert (GDE) AGENDA

You will learn: • the key tools in the toolbox (data transformation, supervised learning modules, unsupervised learning modules) • the value that Azure ML brings to the larger solution (such as classification, clustering and predictive analytics) • how you train your model (if you have to at all) and how to validate your model • how Azure ML integrates with your data pipeline

© DEVintersection. All rights reserved. http://www.DEVintersection.com INTRO TO DATA SCIENCE Keepin’ it stats-light WHAT IS DATA SCIENCE

• Practice of obtaining insights from data • Applies equally to small data and BIG data • Structured and unstructured • Multidisciplinary • Stats • Math • Operations • Signal processing • Linguistics • Database / Storage • Programming • Machine Learning • Scientific Computing

© DEVintersection. All rights reserved. http://www.DEVintersection.com WHY NOW?

• Data has become a critical asset • With volumes increasing, it’s getting increasingly harder to tease information and insight out the data • Companies with more than 1k employees, store an average of 235 TB of data • 50B connected devices expected by 2020 • Analyst expectations such as those from Gartner say it’s worth it • Organizations that invest in modern data infrastructure will financially outperform their peers by up to 20% • Customers now expect data sophistication • Think “you might also like” on Amazon or Netflix’s recommended movies

© DEVintersection. All rights reserved. http://www.DEVintersection.com ANALYTICS SPECTRUM

Descriptive Diagnostic Predictive Prescriptive

© DEVintersection. All rights reserved. http://www.DEVintersection.com DESCRIPTIVE ANALYTICS

• What is happening? • Example • For a retail store, identify the customer segments for marketing purposes

© DEVintersection. All rights reserved. http://www.DEVintersection.com DIAGNOSTIC ANALYTICS

• Why is it happening? • Example • Understanding what factors are causing customers to leave a service (churn)

© DEVintersection. All rights reserved. http://www.DEVintersection.com PREDICTIVE ANALYTICS

• What will happen? • Example • Identify customers who are likely to upgrade to the latest phone

© DEVintersection. All rights reserved. http://www.DEVintersection.com PRESCRIPTIVE ANALYTICS

• What should be done? • Example • What’s the best offer to give to a customer who is likely to want that latest phone

© DEVintersection. All rights reserved. http://www.DEVintersection.com PROCESS

Monitor Define the model business Develop the performance problem model & tune

Acquire and Deploy the prepare data model

© DEVintersection. All rights reserved. http://www.DEVintersection.com HOW DO MACHINES LEARN?

• The learning process is the same for humans and machine • Divided into three components • Data input – use observation, memory, and recall to provide factual basis for further reasoning • Abstraction – translate the data into broader representations • Generalization – use the abstraction to form a basis for action

© DEVintersection. All rights reserved. http://www.DEVintersection.com KEY ML TERMS

• Knowledge representation • the formation of logical structures that assist with turning raw data into meaningful insights • Observations/Examples • the raw data inputs, typically thought of as a tuple • Features • An an attribute or column in the example • Model • how the computer summarizes the raw inputs • Training • fitting a particular model to a dataset • Over-fitting • A model that performs well on the training dataset, but poorly when tested with other data

© DEVintersection. All rights reserved. http://www.DEVintersection.com COMMON TECHNIQUES

• Classification • Clustering • Regression • Simulation • Content Analysis • Recommendation

© DEVintersection. All rights reserved. http://www.DEVintersection.com SUPERVISED VS. UNSUPERVISED

• Refers to the requirements of the algorithm • Does it need to be “trained” on a set of data before it can provide conclusions? • Supervised algorithms need to be carefully trained before they can be shown other examples and provide results • Unsupervised algorithms do not require training, they provide results given the data at hand

© DEVintersection. All rights reserved. http://www.DEVintersection.com CLASSIFICATION ALGORITHMS

• Classify people or things into groups • They classify (or predict) a “label” for an example • The outcome is typically known in advance • Tools include • Decision trees • Logistic regression • Neural networks • Supervised learning • Can provide not just the classification, but also how a particular classification was reached

© DEVintersection. All rights reserved. http://www.DEVintersection.com CLUSTERING ALGORITHMS

• Dividing a set of examples into homogenous groups • While they also can predict a “label” for an example, they are applied when the labels are not known in advance • In other words, you are discovering what groups exist in the data • Tools include • K-means clustering • Unsupervised learning

© DEVintersection. All rights reserved. http://www.DEVintersection.com PATTERN DETECTION ALGORITHMS

• Identify frequent associations in the data • Tools include • Association rules • Unsupervised learning

© DEVintersection. All rights reserved. http://www.DEVintersection.com REGRESSION ALGORITHMS

• Predict numerical outcomes • Inputs may be categorical or numerical, but the output is typically a number • Tools include • Linear regression • Neural networks

© DEVintersection. All rights reserved. http://www.DEVintersection.com SIMULATION

• Model and optimize real world processes • Offers the opportunity to test many scenarios by adjusting model variables • Tools include • Monte Carlo simulations • Markov chain analysis • Linear programming

© DEVintersection. All rights reserved. http://www.DEVintersection.com CONTENT ANALYSIS

• Surface information and insights from content like text, audio and video • Tools • Pattern recognition • Text mining • Image recognition • OCR

© DEVintersection. All rights reserved. http://www.DEVintersection.com RECOMMENDATION

• Identify beneficial relationships and recommend items based on similarity between entities or between entities and items • Common example is Amazon’s product recommendations • Tools used • Collaboration filtering (similarity between users or between items) • Content analysis • Affinity (e.g. market basket analysis)

© DEVintersection. All rights reserved. http://www.DEVintersection.com ENSEMBLE MODELS

• The latest approaches have realized • You can have a set of individually weak algorithms • Use them together to process data • The result can be far superior than even the best lone algorithm • Tools used • Decision Forests (the data is split amongst many decision trees) • Boosted Decision Trees (the data in error is flowed thru a chain of trees)

© DEVintersection. All rights reserved. http://www.DEVintersection.com SUMMARY

• Defined data science and key machine learning terminology • Described the data science process • Enumerated the types of analytics • Reviewed the many categories of algorithms

© DEVintersection. All rights reserved. http://www.DEVintersection.com INTRO TO AZURE MACHINE LEARNING Democratizing machine learning, with the power of the cloud AZURE ML STUDIO

• Web based UI for modeling experiments • Typically requires Azure account to design and run GUEST ACCESS

• Experiments can be shared outside of having an Azure account • Guest access allows read-only viewing of experiments • Does not allow them to be run

© DEVintersection. All rights reserved. http://www.DEVintersection.com EXPERIMENTS

• The core “project” type in Azure ML Studio is the experiment • Option for Blank • Numerous templates/samples with which to get started

© DEVintersection. All rights reserved. http://www.DEVintersection.com MODULES

• Experiments contain modules arranged in a flowchart fashion MODULE HELP

• Getting help • Right click a module and select Help to view documentation

© DEVintersection. All rights reserved. http://www.DEVintersection.com MODULE COMMENTS

• Right-click on module, choose Edit Comment • Add free-form text to document what module accomplishes in the context of the experiment. • You can collapse the comments by clicking on the chevron (up arrow)

© DEVintersection. All rights reserved. http://www.DEVintersection.com MODULE CATEGORIES

Source Data ML Modules Operationalize© DEVintersection. All rights reserved.Don’t Use Your Models http://www.DEVintersection.com WINE QUALITY PREDICTION

• Type: Regression • Candidate Algorithms: • Decision Tree • Data Prep: • None • Business Requirements: • Build a model that takes various characteristics of wine and predicts the quality score deemed by experts

© DEVintersection. All rights reserved. http://www.DEVintersection.com DEMO Tour of Azure ML Studio – A first experiment in Wine Quality DATASET

• Data saved to your Azure ML workspace is saved in a dataset • A Dataset is data that has been uploaded to Azure Machine Learning Studio • Datasets are external to your experiment • Azure ML provides ~40 sample datasets

© DEVintersection. All rights reserved. http://www.DEVintersection.com DATATABLE

• Even if you upload data in another format, or specify a storage format such as CSV, ARFF, or TSV, the data is implicitly converted to a DataTable object whenever used by a module in an experiment. • A DataTable is a collection of typed columns. • Columns are vectors – one dimensional arrays • Specialized handling for sparse columns • Datasets downloaded are binary • Create an experiment and use the conversion modules (e.g., Convert to CSV) to create a manageable file

© DEVintersection. All rights reserved. http://www.DEVintersection.com GETTING DATA

• Upload files from local computer into Datasets • Use the Reader module to load from • HTTP sources (URL to file) • Azure SQL Database (Query) • Azure Storage Table (Top N or full scan) • Azure Storage Blob (path to container, directory or blob) • Apache Hive (Query)

© DEVintersection. All rights reserved. http://www.DEVintersection.com GETTING DATA

• Type it in • For small sets, quick work using the Enter Data module

© DEVintersection. All rights reserved. http://www.DEVintersection.com WRITING DATA

• Write data to a destination using the Writer Module • Destinations include • Azure SQL Database (Query) • Azure Storage Table (Top N or full scan) • Azure Storage Blob (path to container, directory or blob) • Apache Hive (Query) • Notice no HTTP destination J

© DEVintersection. All rights reserved. http://www.DEVintersection.com DEMO Datasets AZURE ML SAMPLES

• Azure ML includes 50+ sample experiments • The experiments include the experiment, requisite modules and datasets necessary • Just open, explore and run DEMO Azure ML Samples CORTANA ANALYTICS GALLERY

• The Cortana Analytics Gallery is the community for sharing Azure ML experiments • Community • Contains Microsoft and community contributions • Enables commenting and review • Enables associating documentation with published ML experiments

© DEVintersection. All rights reserved. http://www.DEVintersection.com DEMO A tour of the Cortana Analytics gallery LIMITATIONS OF R

• Traditionally designed for single machine • In memory computation • Specialized algorithms for cluster compute and for data sets larger than memory

© DEVintersection. All rights reserved. http://www.DEVintersection.com AZURE ML ARCHITECTURE

• Runtime environment • Experiments run with a single A8 instance • Datasets limited to 10GB • No support for parallel compute (yet)

© DEVintersection. All rights reserved. http://www.DEVintersection.com REVOLUTION ANALYTICS

• Revolution Analytics recently acquired by Microsoft • Their technology solved the limitations of running R in single machine environments • Provided 80+ algorithms optimized for cluster compute • This technology is NOT YET integrated with Azure ML

© DEVintersection. All rights reserved. http://www.DEVintersection.com SUMMARY

• Author and design experiments within Azure ML Studio • Experiments are made up of datasets and modules • Azure ML provides sample experiments and sample datasets • The Azure ML Gallery provides sample experiments from the community • Azure ML experiments run on a single node

© DEVintersection. All rights reserved. http://www.DEVintersection.com DATA PREPARATION AKA 80% of the work PROCESS (A TIME BASED VIEW)

Monitor model Define the performance & business Develop the tune problem model

Acquire and Deploy the prepare data model

© DEVintersection. All rights reserved. http://www.DEVintersection.com DATA PREP METHODS

• Data Cleaning and Processing • Feature Selection • Feature Engineering

© DEVintersection. All rights reserved. http://www.DEVintersection.com DATA CLEANSING & PROCESSING

• Processing Activities • Handling missing and null values • Removing duplicate records • Identifying and removing outliers • Normalizing features • Addressing imbalanced classes

© DEVintersection. All rights reserved. http://www.DEVintersection.com DEMO Exploring a data set THE CURSE OF DIMENSIONALITY

• Gist: • You want to be able to build classification models that accurately classify new data • As the number of data dimensions increase, it becomes increasingly difficult to avoid over fitting the model to the training data • So the model does not generalize well to new data, and does not perform well • Therefore you need to reduce the number of data dimensions

© DEVintersection. All rights reserved. http://www.DEVintersection.com CLASS IMBALANCE

• When the distribution of classes strongly favors one class over another you have a class imbalance • Example: 90% of a feature are true, only 10% are false • This yields to misleading results, that are accurate only in predicting the majority class • This a can be addressed by • Upsampling – generating new examples derived from the minority class • Downsampling – reducing the number of examples in the majority class • Upsampling can be performed using the SMOTE module

© DEVintersection. All rights reserved. http://www.DEVintersection.com FEATURE HASHING

• Given free text input • Represent as bag of words • Each word represented as a token • Each time that word appears, 1 is assigned to the token or 0 is assigned if it does not appear • This helps, but still results in too many dimensions • The solution is to use a hash

© DEVintersection. All rights reserved. http://www.DEVintersection.com FEATURE SELECTION

• The process of identifying the features that are the most useful to model • Identify which have the most “predictive power” • Reduce the number of dimensions (e.g., columns) the model has to deal with • Reduce the noise in the data

© DEVintersection. All rights reserved. http://www.DEVintersection.com PRINCIPAL COMPONENT ANALYSIS

• How do you identify which dimensions to keep? • Identify the set of dimensions which have the least correlation from the set of dimensions that remain after you have performed feature selection

© DEVintersection. All rights reserved. http://www.DEVintersection.com COMPANY SEGMENTATION

• Type: Clustering • Candidate Algorithms: • K-means clustering • Data Prep: • Text handling and dimensionality reduction via feature hashing • Feature selection via Principal Component Analysis to top 10 features • Business Requirements: • Given data from the S&P 500, find the groups of related companies

© DEVintersection. All rights reserved. http://www.DEVintersection.com DEMO Data Prep in Company Segmentation experiment FEATURE ENGINEERING

© DEVintersection. All rights reserved. http://www.DEVintersection.com FEATURE ENGINEERING EXAMPLES

• Decompose Categorical Attributes • Split one categorical attribute into multiple binary attributes • Decompose DateTime • Split a DateTime attribute into parts (e.g., hour) • or add interpretation (morning, noon, afternoon and night) • Transform Numerical Quantities • Convert rates (units per unit time) into units and time interval • Bucketize or Binning • Summarize ranges of numeric data into distinct partitions

© DEVintersection. All rights reserved. http://www.DEVintersection.com OTHER DATA PREP OPTIONS

• Use Azure Services for data exploration, cleansing, transform, staging and loading

© DEVintersection. All rights reserved. http://www.DEVintersection.com CHURN

• Type: Classification • Candidate Algorithms: • Decision Tree • Decision Forest • Data Prep: • Identify significant amounts of missing values and other issues • Business Requirements: • Using telecom customer data, predict which customers are likely to churn (leave the service).

© DEVintersection. All rights reserved. http://www.DEVintersection.com DEMO Data Prep in Churn experiment SUMMARY

• Acquiring and preparing the data represents the bulk of the time spent in a data science experiment • Typical steps include • Exploring the data • Cleaning and Processing the data • Reducing data dimensionality • Feature selection • Feature engineering • The curse of dimensionality – too many features prevent your model from generalizing to future data

© DEVintersection. All rights reserved. http://www.DEVintersection.com MODEL DEVELOPMENT Build, measure, learn

MODEL VALIDATION Don’t be confused by the confusion matrix MODEL VALIDATION

• Gist • Connect the Score Model module output to the Evaluate Model module input

© DEVintersection. All rights reserved. http://www.DEVintersection.com EVALUATE MODEL

• Use to measure the accuracy of a trained classification model or regression model • Metrics returned depend on the type of model • Classification • Accuracy, precision, recall, F-Score, AUC, average log loss, and training log loss metrics • Regression • Mean absolute error, root mean squared error, relative absolute error, and relative squared error metrics • Clustering • Average distance to cluster center, maximum distance to cluster center, distance to center of other clusters, and the number of points in each cluster

© DEVintersection. All rights reserved. http://www.DEVintersection.com CLASSIFICATION MODEL PERFORMANCE

• Evaluate Model can compare the performance of two models • You can toggle between the inputs models and view their performance stats • You can adjust the threshold that determines the classification and see the affect on performance • Chart the comparison on a single ROC curve • The model connected to the left input node is labeled “Scored dataset” • The model connected to the right input node is labeled “Scored dataset to compare” • The output node provides Visualization

© DEVintersection. All rights reserved. http://www.DEVintersection.com ROC

• Receiver Operating Characteristic curve (ROC) • Shows the predictive performance of a binary classification model • Plots rate of True Positives to False Positives • Performance measure relative to random (e.g., guessing)

• Also known as Sensitivity / Specificity plot ROC

• Understanding relative performance • Random guessing is the line from Good (0,0) to (1,1) • This yields an area of 0.5 underneath this line • This is the line that indicates a model with no predictive value Random Line • Objective • Your goal is to be better than random, so an area better than 0.5 No Good AUC

• Area Under Curve • The ideal (e.g., perfect) model pulls towards the top, left and provides 100% accuracy in predictions Ideal curve • The ideal model has an area under the curve (AUC) of 1 • The ideal model has a box underneath it measuring 1 by 1, the area = 1x1 =1 AUC

• Interpreting AUC • A (outstanding) = 0.9 – 1.0 • The following is a somewhat • B (good) = 0.8 – 0.9 subjective categorization, but a good starting point • C (fair) = 0.7 – 0.8 • Important to apply AUC along with • D (poor) = 0.6 – 0.7 other measures • F (not helpful) = 0.5 – 0.6 • The area under the curve can be the same for many different shapes of curves CONFUSION MATRIX

• Confusion Matrix • Score how well the model did identifying the known outcomes in training data • True positives • Expect true, model scored true • True negatives • Expect false, model scored false • False positives (aka Type I error) • Expect false, model scored true • False negatives (aka Type II error) • Expect true, model scored false CONFUSION MATRIX

• Example: Cancer detection Which model • True positives outcome is better? • Cancer correctly detected • True negatives • Absence of cancer correctly detected • False positives (aka Type I error) • Freaked out patient, but OK • False negatives (aka Type II error) • Big problem for the hospital ACCURACY

• Accuracy • How well the model identifies the correct outcome • (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives) • Think of it like how well the student takes takes a test • percentage score • higher is better PRECISION V.S. RECALL

• Precision v.s. Recall • Precision and Recall are two other metrics available from Evaluate Model module PRECISION

• Precision • The percentage of true positives out of the total identified positives • True Positives / (True Positives + False Positives)

• When a model predicts a positives, how often is it correct? RECALL

• Recall • The percentage of true positives out of the total actual positives • True Positives / (True Positives + False Negatives) • What percentage of the positive examples does the model correctly catch? • AKA what breadth of coverage does it have over the actual positives? PRECISION V.S. RECALL Tiny little toggle • Precision v.s. Recall • Precision – the percentage of true positives out of the total identified positives • True Positives / (True Positives + False Positives) • Recall – the percentage of true positives out of the total actual positives • True Positives / (True Positives + False Negatives) SENSITIVITY

• Sensitivity measures a model’s ability to correctly identify the positive cases • (True Positives) / (True Positives + False Negatives)

• Also defined as the True Positive Rate • If count of False Negatives is 0, then Sensitivity is 1.0 (perfect) • Sensitivity goes toward 0 as model makes mistakes and misclassifies actual positives as negatives (aka count of False Negatives)

• This is also what we introduced as Recall

© DEVintersection. All rights reserved. http://www.DEVintersection.com SPECIFICITY

• Specificity measures a model’s ability to correctly identify negative cases • (True Negatives) / (True Negatives + False Positives)

• Also defined as the True Negative Rate • If the count of False Positives is 0, the Specificity is 1.0 (perfect) • Specificity goes to 0 as the model makes mistakes and misclassifies actual negatives as positives (aka False Positives)

© DEVintersection. All rights reserved. http://www.DEVintersection.com PROBABILITY THRESHOLDS

• For binary classifications a scored outcome provide a value between 0.0 and 1.0 • The default threshold is 0.5 • This means • A score of 0.6 would output the class of 1 • A score of 0.4 would output a class of 0

• The default threshold is arbitrary and should also be adjusted • One recommendation is to set threshold to the point where sensitivity equals specificity

© DEVintersection. All rights reserved. http://www.DEVintersection.com THE PERFECT METRIC

• In some cases, you only care about the model’s ability to identify positives (Sensitivity) • While in other cases, you only care about the model’s ability to identify negatives (Specificity) • In a perfect world you really want both perfect sensitivity and perfect specificity

© DEVintersection. All rights reserved. http://www.DEVintersection.com F1 SCORE

• BUT, there’s a tradeoff between sensitivity and specificity • It’s due to the theoretical error rate • Bayes Error rate • Enter the F1 score • A higher F1 score means a better balance between sensitivity and specificity • (2 x True Postives) / (2 x True Positives + False Positives + False Negatives) MODEL PERFORMANCE – LIFT CURVE

• Lift Curve • Used to understand the impact of the results of a predictive model versus the results without the predictive • The difference between the two is the lift REGRESSION MODEL PERFORMANCE

• Similar to Classification Models, Regression Model performance can be measured with the Evaluate Model module • Provides these 5 metrics • Mean Absolute Error • Root Mean Squared Error • Relative Absolute Error • Relative Squared Error • Coefficient of Determination CLUSTERING MODEL PERFORMANCE

• Cluster Model Performance metrics from Evaluate Model • Avg Distance to Cluster Center • Avg Distance to Other Center • Number of Points • Max Distance to Cluster Center WHICH METRICS TO MEASURE PERFORMANCE WITH?

• Precision, Recall, Accuracy and F1 score all provided by Evaluate Model • How do you know which factors are more important for a given scenario? • Accuracy is a good metric, • but used alone can be misleading when there is a large class imbalance (high accuracy would come simply by predicting the majority class) • Other factors may be important, for instance you might prefer False Positives to False Negatives due to business factors (such as cost impact) • So you need to combine performance metrics

© DEVintersection. All rights reserved. http://www.DEVintersection.com EXPLAINABILITY

• AKA the compliance heuristic • How do you explain to customers why their credit score is 720? • Some algorithms lend themselves to explaining the computation (e.g., Rules, basic Decision Tree, Logistic Regression) • Some even provide nice visualizations • Some algorithms operate more like a black box (e.g., Boosted Decision Tree, Neural Networks) • These can either be very difficult or even impossible to explain

© DEVintersection. All rights reserved. http://www.DEVintersection.com CROSS MODEL VALIDATION

• Another approach to avoid over-fitting your model to the training data is to utilize cross-validation • Use the Cross Validate Model module • Use in lieu of Train Model / Score Model module pairs and any Split leading data into the module

© DEVintersection. All rights reserved. http://www.DEVintersection.com SUMMARY

• You determine if your model is good by measuring its results • Measuring classification performance • The standard point of reference is how it performs relative to random • Metrics include confusion matrix, accuracy, precision, recall, sensitivity, specificity, F1 score • Measuring regression performance • Particularly useful for comparing amongst models • Mean Absolute Error, Root Mean Squared Error, Relative Absolute Error, Relative Squared Error, Coefficient of Determination • Measuring cluster performance • Measure of how “distinct” the cluster really are • Business performance metrics • Explainability, costs of False Negatives or False Positives

© DEVintersection. All rights reserved. http://www.DEVintersection.com OPERATIONALIZING THE MODEL Taking it from ML Studio to the World OPERATIONALIZE THE MODEL

• Package the model up for deployment • Deployment for production

© DEVintersection. All rights reserved. http://www.DEVintersection.com SETUP WEB SERVICE

• Once you’ve Run your experiment you can expose it as a Web Service • This means fronting it with a REST endpoint that • Takes as input the data to score and the parameters to use • Returns as output the scored result • Select the Train Model to use • Choose Setup Web Service -> Predictive Web Service AFTER SETUP

• Original Training experiment is preserved • New Predictive Experiment appears • Wrapped between Web service input and Web service output modules AFTER SETUP

• New Predictive Experiment appears • Toggle to switch between showing modules that provide the schema • And the modules that provide the data DEPLOY WEB SERVICE

• With a Predictive Experiment ready • Run the Experiment • Click Deploy Web Service AFTER DEPLOY

• With Web Service Deployed • Deployment dashboard • Support for Request/Response API • Support for Batch Execution • Test harness DEMO Deploying and invoking POWER BI INTEGRATION Visualizing Results ANALYZE SCORED DATA IN EXCEL

• Convert to CSV Module • Run experiment & download CSV • Analyze, Explore and Chart using • Power View • Power Map • Your other favorite Excel features

© DEVintersection. All rights reserved. http://www.DEVintersection.com REQUEST/RESPONSE API FROM EXCEL

• Once a scored model is published as Web API, you can invoke it directly using any REST client • You can also download a special Excel spreadsheet that uses Macros to invoke your scoring service (via the REST API) with parameters you input into a sheet

© DEVintersection. All rights reserved. http://www.DEVintersection.com POWER BI SERVICE

• You can export your scored data sets from Azure ML and upload the CSV’s to Power BI

© DEVintersection. All rights reserved. http://www.DEVintersection.com BATCH SCORING

• Upload files to Azure Storage blobs • Invoke the REST API • Submit Job • Start Job • Get Job Status • Retrieve the results from Azure Storage blobs • Use the Starter code • provided in the Batch Execution API documentation page of your scored service • Available in C#, R and Python

© DEVintersection. All rights reserved. http://www.DEVintersection.com DEMO Batch scoring using the REST API INTEGRATION Azure ML is not an island ANALYTICS DATA PIPELINE

• Ingest • How to get new data into the system • Processing • The computation needed to prepare the data for delivery • Storage • The locations in which ingested, intermediate and final calculations are stored • Delivery • How the data is presented in analytics tools or to other clients (e.g. API’s, web pages)

© DEVintersection. All rights reserved. http://www.DEVintersection.com INGEST

• Input into the system • Batch load into a data lake • Stream input into a multi-consumer queue

© DEVintersection. All rights reserved. http://www.DEVintersection.com INGEST ON AZURE

• Batch Loading • Import/Export Service load blob storage with content shipped 6TB disks • Azure Data Factory to orchestrate loading from on-prem & cloud hosted data sets • HDInsight using Sqoop to load HDFS from relational database sources • Stream input • Event Hubs receives binary message streams in its high velocity multi-consumer queue

© DEVintersection. All rights reserved. http://www.DEVintersection.com DATA PIPELINES

• Coordinate the flow of data thru the stages of the analytics pipeline • Provide visibility into the status of data pipeline • Provide mechanisms for resiliency (retry failed steps, etc.) • Orchestrate the provisioning/scaling and execution of resources utilized by the pipeline

© DEVintersection. All rights reserved. http://www.DEVintersection.com DATA PIPELINES ON AZURE

• Azure Data Factory • HDInsight • Oozie

© DEVintersection. All rights reserved. http://www.DEVintersection.com PROCESSING ARCHITECTURES

• Processing Architectures • Batch Processing • Interactive/Iterative Processing • Real-time Processing • Lambda Architecture

© DEVintersection. All rights reserved. http://www.DEVintersection.com BATCH PROCESSING

• Batch processing runs computations or queries across a set of data • If data represents a set of data • Batch processing can be thought of as query(data) • It can apply to a small set or a large set, even all of the data in your solution • Batch processing accepts larger latency (time to results ready) in exchange for scalability to support ever increasing batch sizes

© DEVintersection. All rights reserved. http://www.DEVintersection.com BATCH PROCESSING ON AZURE

• HDInsight • MapReduce • Pig • • Spark • Azure Batch • SQL Data Warehouse • Azure Machine Learning (using the Batch Execution API)

© DEVintersection. All rights reserved. http://www.DEVintersection.com INTERACTIVE PROCESSING

• Interactive processing tends to be iterative • Issue a query, review results, refine query, issue modified query • Interactive processing strongly desires minimal latency

© DEVintersection. All rights reserved. http://www.DEVintersection.com INTERACTIVE PROCESSING ON AZURE

• HDInsight • Hive • Spark Core • Spark SQL • Phoenix (Hbase) • Azure Machine Learning (using the Request/Response API) • SQL Database • SQL Data Warehouse

© DEVintersection. All rights reserved. http://www.DEVintersection.com REAL-TIME PROCESSING

• Real-time processing runs computation or queries against data as it is received and outputs result in near real-time • Real-time processing can occur in two forms: • Tuple at a time • Micro-batch • Real-time processing strives to have the lowest possible latency, at the expense of the limited size of data it processes per unit time and the more narrow types of computation it can support

© DEVintersection. All rights reserved. http://www.DEVintersection.com REAL-TIME PROCESSING ON AZURE

• Micro-batch • HDInsight Spark Streaming • Azure Stream Analytics

• Tuple at a time • HDInsight Storm

© DEVintersection. All rights reserved. http://www.DEVintersection.com STORAGE

• Master data storage • Storage for data in rawest form • Typically a file system • Database storage • Storage for data formatted for retrieval using specific query patterns • Index storage • Supplemental indexes for storage formats that do not have primary indexes or for those that do not support secondary indexes

© DEVintersection. All rights reserved. http://www.DEVintersection.com STORAGE ON AZURE

• File Storage • Storage Blobs • HDInsight – HDFS atop Storage Blobs • Azure Data Lake – HDFS as a Service • Relational Storage • SQL Database • SQL Data Warehouse • Columnar Storage • HDInsight – HBase • Key/Value Storage • Storage • Redis Cache

© DEVintersection. All rights reserved. http://www.DEVintersection.com DELIVERY

• The presentation of data for final consumption • Querying by end users • Querying via API • Reporting and visualization using analytics tools

© DEVintersection. All rights reserved. http://www.DEVintersection.com DELIVERY ON AZURE

• Power BI • API Apps • Web Apps • Azure Machine Learning Experiments

© DEVintersection. All rights reserved. http://www.DEVintersection.com LAMBDA ARCHITECTURE

Analytics Clients

• General purpose architecture Power BI • Run arbitrary function on arbitrary data Excel Speed Layer • Return results with low latency Realtime View Hosting Azure ML Ingest Stream Processing Multi-Consumer Custom Apps Queue Tuple or Micro Batch Processing Command Line

Batch Layer Batch View Serving Layer

Bulk Load Batch View Master Data Batch View Bulk Data Hosting Storage Computation on All Data

© DEVintersection. All rights reserved. http://www.DEVintersection.com LAMBDA ARCHITECTURE

• Key Principles of the Lambda Architecture • Complexity is hard to scale, keep things simple and easy to reason about • Human fault tolerance- enable recovery from human mistakes

© DEVintersection. All rights reserved. http://www.DEVintersection.com LAMBDA ARCHITECTURE

• Key Principles of Storage • Store all the data in the rawest form possible, this is the master data set • The master data set is immutable • The master data set is fully normalized • Each piece of data is true in perpetuity- each piece should have timestamp • Data follows a fact-based model, where each data piece is atomic & timestamped • Each fact data piece must be distinguishable, adding “nonce” random #’s as need

© DEVintersection. All rights reserved. http://www.DEVintersection.com LAMBDA ARCHITECTURE

• Key Principles of Speed • Incremental updating constrained to the Speed Layer- complex implementation • The Speed Layer trades accuracy for low latency • The Speed Layer only looks at recent data • The Speed Layer allows random writes and the results are transient

© DEVintersection. All rights reserved. http://www.DEVintersection.com LAMBDA ARCHITECTURE

• Key Principles of Batch • The Batch Layer looks at all data at once • The Batch Layer will eventually correct the Speed Layer • The Serving Layer provides denormalized views that are continually produced from the Batch Layer and intended to support random access reads • The Batch and Serving Layers are Available, but not Consistent in the face of a network Partition (CAP Theorem) • Avoid online compaction (restricted to Speed Layer) and random writes (also restricted to the Speed Layer) to optimize for the read path • This enables human fault tolerance- any mistake in generating a view or updating the speed layer can be corrected in the next batch run

© DEVintersection. All rights reserved. http://www.DEVintersection.com LAMBDA ARCHITECTURE

• Functional Summary • Batch View = function(all data) • Realtime View = function(current realtime view, new data) • Query = function(batch view, realtime view)

© DEVintersection. All rights reserved. http://www.DEVintersection.com LAMBDA ARCHITECTURE ON AZURE

Analytics Clients

Power BI • Azure provides many options for each layer: Speed Layer

SQL DB DocDB Excel Ingest Stream Processing

Azure Stream Redis Azure ML Event Hubs Analytics

Custom Apps Import/Export HDInsight Storm Service

HDInsight Spark Batch Processing Command Line Azure Data Factory Streaming HDInsight (MapReduce, Pig, Tez, Batch Layer Spark) Batch View Serving Layer HDInsight Sqoop Azure Data Azure SQL Azure Azure Batch Lake DW Search HDFS Azure SQL DW HBase + Blob Storage Phoenix

Azure ML

© DEVintersection. All rights reserved. Azure Data Factory http://www.DEVintersection.com CORTANA ANALYTICS SUITE

• Provides a managed big data and advanced analytics suite • Basically, it’s a packaging of • Azure Machine Learning • Azure Stream Analytics • Azure HD Insight • Azure Data Lake • SQL Data Warehouse • Azure Data Factory • Azure Data Catalog • Marketplace API’s • Power BI

© DEVintersection. All rights reserved. http://www.DEVintersection.com SUMMARY

• In summary: • There are numerous service in Azure that combine to provide managed experiences for analytics over big data • Lambda Architecture provides a reference architecture for implementing big data analytic solutions that endure • Cortana Analytics Suite packages these together

© DEVintersection. All rights reserved. http://www.DEVintersection.com THANK YOU