A I M 2 0 1 - S Hot paths to anomaly detection with TIBCO data science, streaming on AWS

Steven Hillion Michael O’Connell Sr Director, Data Science Chief Analytics Officer @StevenHillion @MichOConnell

© 2019, Web Services, Inc. or its affiliates. All rights reserved. The Ideal Data Science Platform

Offer Cross-Sell Products Streaming

Predict Impending Equipment Failure

Data Marts Optimize Batch Scoring Pricing

Transactional Stores Real-Time Scoring Prevent Fraud

Live Applications Point-Click | Code Manage Inventory External Sources Data Access & Prep Edge Devices Modeling, ML, DL

© Copyright 2000-2019 TIBCO Inc. 3 Agenda

• TIBCO Data Science and AWS Marketplace

• The TIBCO Connected Intelligence Cloud

• Anomaly Detection and Analysis

• Demonstration – Spatial Anomaly Analysis

• Links and Assets

© Copyright 2000-2019 TIBCO Software Inc. 4 TIBCO Data Science

Data Access/Prep Modeling Operations Business Apps

+ Distributed compute + Visual composition + Model lifecycle management + Engineering/IoT FUNCTION + Feature engineering + Multilingual notebook + Batch automation + Customer analytics + Reusable templates + Native ML & OS + Real-time event processing + Risk management + Auto-ML, data prep + REST, applications, embedding + Supply chain

Medic; e.g., researcher on epidemic monitoring

Engineer; e.g., aerodynamics engineer

Marketeer; e.g., customer engagement analyst

Business User USER or Data Scientist Data Scientist Analytics Operations Analytics Operations AUTOMATION Citizen Data Scientist Citizen Data Scientist IT / Software Engineer IT / Administration

© Copyright 2000-2019 TIBCO Software Inc. TIBCO Data Science on AWS TIBCO DS on AWS Marketplace; Biggest vCPU Grid; Lightest Serverless Footprint

© Copyright 2000-2019 TIBCO Software Inc. 6 Data Science in the Cloud: Leidos Healthcare Analytics

Leidos Collaborative Advanced Analytics & Data Sharing Platform (CAADS) uses TIBCO Data Science and AWS to deliver analytics services in Healthcare CDC NIH CMS NASA

Disease Outbreaks: Disease Outbreaks: Data Governance: Space Exploration: Determining the Run simulations of Analyzing and Analyzing human cause of an HIV disease propagation consolidating data factors that affect the outbreak in the to guide public policy, around emerging ability to transport Midwest specifically around Healthcare policies astronauts on long the Zika virus across 56 regions in fights (e.g., to Mars) the United States

© Copyright 2000-2019 TIBCO Software Inc. 7 TIBCO analytics transformation platform Powered by shared data assets

ANALYTICS ACTIONS Visual Analytics Data Science Operational Analytics AUGMENT INTELLIGENCE

DATA OPERATIONS INFORMATION MANAGEMENT

METADATA MDM / RDM Metadata Spark Compute & Store Data Virtualization Metadata Master and Master & (Data Governance, Reference Data UNIFY Reference Data Data Catalog) Management DATA Data Store 1 Data Store n EDW/Lake Transactional Data

EVENTS INTEGRATION

Streaming / Hybrid API Events Integration Management INTERCONNECT EVERYTHING

DATA SOURCES Streaming In-memory Enterprise ... Enterprise Sources Data Grid Application 1 Application n

Cloud Native Open Platform AI Foundation

© Copyright 2000-2019 TIBCO Software Inc. TIBCO Data Science and AWS

EDWs & Data Operational Analytics Analytics Sources Systems Sandboxes Platform

Federated Data Amazon SageMaker

Science Containers/Code Real Services Streaming Business Events & - StreamBase

Time (BW) Time SQL

Flogo Spark TIBCO Data APIs andMicroservices Amazon EMR Science Amazon SageMaker Model Management

Data Marts Batch(DV) Automation

Visual UI / Amazon Redshift Notebook Collaboration / Transactional Project

Stores

Manage Execute TIBCO Live Apps

Amazon Simple Storage Service (S3) External Visual Analytics Sources (TIBCO Spotfire)

AWS Deployments © Copyright 2000-2019 TIBCO Software Inc. 9 TIBCO Data Science Solutions on AWS Cloud Apps: Visual Analytics, Data Science, Streaming, Case Management

Anomaly Detection Risk Management Customer Engagement Starter Set

Process Mining IoT Analytics Anomaly Detection Risk Management Customer Engagement Blockchain – Dovetail Partner Management Starter Toolkit

Review Status: TIBCO Spotfire Model: TIBCO Data Science Analyze Event Stream: Case Manage: TIBCO Live Apps Identify issues, sweet spots Supervised: Train TIBCO Flogo, Cloud Integration Investigate identified cases Unsupervised: Anomalies Batch and Real-Time Updates Audit trail + recycle

© Copyright 2000-2019 TIBCO Software Inc. Anomaly Analysis Solution Overview

1 Collect data from equipment, normalize, model to predict 3 Alert raised and case created – TIBCO Cloud magnitude of anomaly – TIBCO Data Science & AWS Live Apps

2 Model detects anomaly – TIBCO Data Science 4 Case manager investigates and takes action to the equipment – TIBCO Cloud Integration

© Copyright 2000-2019 TIBCO Software Inc. Data Challenges in High-Tech Manufacturing

Stack 3D Flash Memory Cell Layers Vertically 96 Memory Cell Layers

© Copyright 2000-2019 TIBCO Software Inc. Hot Paths to Anomaly Detection

14 Longitudinal Anomaly Analysis

© Copyright 2000-2019 TIBCO Software Inc. Cross-Sectional Anomaly Analysis

© Copyright 2000-2019 TIBCO Software Inc. Spatial Anomaly Analysis

One-Valued Wafer Maps Big Red Spot Plaid

Geometric Patterns One Big Blob & One-Valued One Big Blob

© Copyright 2000-2019 TIBCO Software Inc. Cross-Sectional Anomaly Detection

© Copyright 2000-2019 TIBCO Software Inc. Cross-Sectional Anomaly Detection

© Copyright 2000-2019 TIBCO Software Inc. Analysis: Cluster Incidents, View Signatures

© Copyright 2000-2019 TIBCO Software Inc. Analytic workflow – methodologies + demos Find and cluster anomalous events [Demo #1] • Transform wafer maps into vectorized coefficients, then cluster on quality • Many measured parameters, e.g., quality tests: storage fidelity, logic circuits • Approach 1: SVD + K-means • Focus on failure mode parameters • Approach 2: Bessel functions + hierarchical clustering • Radial basis functions • Rotationally invariant | Null-value tolerant | Efficient storage • Better than SVD + K-means for multi-parameter analysis Monitor anomalies [Demo #2] • Stream wafer data • Vectorize and cluster

Predict when and why anomalies occur [Demo #3] Demo data • Reduce dimensionality of very wide data • 733 dies • Train models to determine sensor importance • 20 x 40 layout • 1,500 measured parameters • Identify responsible process parameters • Color by die test (white = Process variable corrections/models rebasing pass all, colors = fail a test) • Identify new patterns as they emerge (e.g., incident analysis) • Factory monitoring staff can click to characterize the new pattern

© Copyright 2000-2019 TIBCO Software Inc. 21 (TIBCO Streaming operator for Python)

Capture and Ingest data Read data Preprocess Perform SVD on the send data to into Amazon from Kinesis and transform bin values for wafers livestream Kinesis for into TIBCO the data using Python further Streaming processing

Send data for live viewing into Spotfire Data Streams Combine data PMML operator to & results cluster the data using a model trained in TIBCO Data Science Refine clusters, identify wafers of interest

Send data to Spark/Hadoop cluster for more in-depth analysis in TIBCO Data Science

Retrain models for clustering and root-cause analysis

© Copyright 2000-2019 TIBCO Software Inc. 22 Anomaly Detection with Spatial Signature Analysis

23 Anomaly Analysis

Find and explore anomalous events

Monitor anomalies

Predict when and why they occur

© Copyright 2000-2019 TIBCO Software Inc. 24 Anomaly Detection and Analysis

Find and explore anomalous events

Monitor anomalies

Predict when and why they occur

© Copyright 2000-2019 TIBCO Software Inc. 25 © Copyright 2000-2019 TIBCO Software Inc. 26 © Copyright 2000-2019 TIBCO Software Inc. 27 Anomaly Detection and Analysis

Find and explore anomalous events

Monitor anomalies

Predict when and why they occur

© Copyright 2000-2019 TIBCO Software Inc. 28 © Copyright 2000-2019 TIBCO Software Inc. 29 © Copyright 2000-2019 TIBCO Software Inc. 30 © Copyright 2000-2019 TIBCO Software Inc. 31 Anomaly Detection and Analysis

Find and explore anomalous events

Monitor anomalies

Predict when and why they occur

© Copyright 2000-2019 TIBCO Software Inc. 32 Digital Twin for Semiconductor Yield

Digital Twins for Semiconductor Manufacturing Yield: Wide-and- Analysis

Build Models to Relate Product Yield Failure Modes ( Yi ) with Process Parameters ( Pi )

Product Test: Process Process Measurement Process Start Process Step i+2 Step 1 Step i Step 1 Step n Pass / Fail, Failure Modes ・・・・・・

P1 Pi Pi+1 Pi+2 Pn ~ 1K Steps, > 1M Process Parameters Equipment Sensor Data Physical Measurement Data Equipment Names

Run continuously to capture new Yield, relationships as they emerge Yield Components or Process Optimization Process Monitoring Clusters Process Control Digital Twin Yi = f(P P P … P ) Y = f(Y1 Y2, Y3, … Yn ) Yield Models 1 , 2, 3, n ,

© Copyright 2000-2019 TIBCO Software Inc. 33 The Extreme Challenge of Big & Wide Data

• Not just big data – many rows: lots, wafers, die, units • Also wide data – many columns: > 1M process parameters • Sensor traces • Time series for every sensor on each machine in each run • Physical measurements • Film thickness, critical dimensions, layer-to-layer overlay, defect classes & counts • Equipment and process attributes • Machine and component IDs, process recipe info • Supplies • Chemical batch IDs, QA sample data

“Today [semiconductor] fabs collect more than 5 billion sensor data points each day. The challenge is to turn massive amounts of data into valuable information.” —Ann Kellehere, VP of the Technology and Manufacturing Group,

© Copyright 2000-2019 TIBCO Software Inc. 34 Solution Architecture Data Prep, Feature Further Feature Selection Visualization of Engineering & Selection & Model Building Results

• In-database parallelized computing • In-memory dedicated fast server • Interactive in-memory • Leverages Hadoop, Apache Spark visualization environment

© Copyright 2000-2019 TIBCO Software Inc. 48 Performance Benchmarks & Conclusions

• Demonstrated performance for time series data from 20,000 sensors, 10,000 wafers in under 2 minutes • Current system scales to time series for 20,000 sensors, 100,000 wafers (2.5 TB) with results in 15 minutes • More capacity and better performance can be achieved by adding nodes to the Spark cluster • Working with top memory manufacturer to deploy production system • System can provide automated real-time feedback on emerging equipment issues affecting yield

Big Data Feature Selection Performance Benchmarks – Run Time1 (minutes) 20 Sensors 200 Sensors 2K Sensors 20K Sensors Dataset Size for (1K variables) (10K variables) (100K variables) (1M variables) 1M Variables 1K Wafers 0.47 0.48 0.72 1.0 25 GB 10K Wafers 0.50 0.53 0.77 1.75 253 GB 100K Wafers 0.53 0.67 1.25 15.15 2,530 GB

1Test Conditions: • Data stored in Hadoop data source • 25 node Spark cluster – 16 cores, 32 GB for each node • Each sensor time series compressed to 50 variables with SAX encoder prior to feature selection step

© Copyright 2000-2019 TIBCO Software Inc. 49 Longitudinal Anomaly Analysis

Subsequence Search

A New Method for Identifying Anomalous Patterns in Time Series (Trace Analytics)

50 Mueen’s Algorithm for Similarity Search

Mueen’s Algorithm for Similarity Search (MASS) is specialized for finding anomalous (versus typical) subsequences of time series

Extremely fast algorithm for this use case

Suitable for further acceleration using GPU

Material adapted from: https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx

© Copyright 2000-2019 TIBCO Software Inc. Mueen’s Algorithm for Similarity Search

Quickly create a matrix profile = a partial distance matrix This uses a sliding window to define a series of subsequences The Matrix Profile plots the distance of each subsequence to its nearest match, with the time sequence of the start of each subsequence on the x-axis

Red = Original Time Series

Blue = Matrix Profile Material adapted from: https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx

© Copyright 2000-2019 TIBCO Software Inc. How to “Read” a Matrix Profile Where you see relatively low values, you know that the subsequence in the original time series must have (at least one) relatively similar subsequence elsewhere in the data (such regions are “motifs” or reoccurring patterns) Where you see relatively high values, you know that the subsequence in the original time series must be unique in its shape (such areas are “discords” or anomalies)

© Copyright 2000-2019 TIBCO Software Inc. Manufacturing Batches Raw Amperage - Each color delimits a batch

Matrix Profile highlights anomalies - set sliding window close to batch size Community https://community.tibco.com/exchange

© Copyright 2000-2019 TIBCO Software Inc. 56 AI in Operations Cloud Starters, Accelerators, Analytic Apps Thoughtleader-Led Solutions

AI on Demand Data Science in Operations

© Copyright 2000-2019 TIBCO Software Inc. Questions & Contact

TIBCO Community Steven Hillion Michael O’Connell community.tibco.com Sr. Director, Data Science Chief Analytics Officer TIBCO Exchange [email protected] [email protected] community.tibco.com/exchange @StevenHillion @MichOConnell TIBCO Tech Blog community.tibco.com/blog

Acknowledgements Dr. Tom Hill, Mike Alperin, Nico Rode, David Katz, Jagrata Minardi, Siva Ramalingam

© Copyright 2000-2019 TIBCO Software Inc. 58 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.