A I M 2 0 1 - S Hot paths to anomaly detection with TIBCO data science, streaming on AWS
Steven Hillion Michael O’Connell Sr Director, Data Science Chief Analytics Officer @StevenHillion @MichOConnell
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Ideal Data Science Platform
Offer Cross-Sell Products Streaming
Predict Impending Equipment Failure
Data Marts Optimize Batch Scoring Pricing
Transactional Stores Real-Time Scoring Prevent Fraud
Live Applications Point-Click | Code Manage Inventory External Sources Data Access & Prep Edge Devices Modeling, ML, DL
© Copyright 2000-2019 TIBCO Software Inc. 3 Agenda
• TIBCO Data Science and AWS Marketplace
• The TIBCO Connected Intelligence Cloud
• Anomaly Detection and Analysis
• Demonstration – Spatial Anomaly Analysis
• Links and Assets
© Copyright 2000-2019 TIBCO Software Inc. 4 TIBCO Data Science
Data Access/Prep Modeling Operations Business Apps
+ Distributed compute + Visual composition + Model lifecycle management + Engineering/IoT FUNCTION + Feature engineering + Multilingual notebook + Batch automation + Customer analytics + Reusable templates + Native ML & OS + Real-time event processing + Risk management + Auto-ML, data prep + REST, applications, embedding + Supply chain
Medic; e.g., researcher on epidemic monitoring
Engineer; e.g., aerodynamics engineer
Marketeer; e.g., customer engagement analyst
Business User USER or Data Scientist Data Scientist Analytics Operations Analytics Operations AUTOMATION Citizen Data Scientist Citizen Data Scientist IT / Software Engineer IT / Administration
© Copyright 2000-2019 TIBCO Software Inc. TIBCO Data Science on AWS TIBCO DS on AWS Marketplace; Biggest vCPU Grid; Lightest Serverless Footprint
© Copyright 2000-2019 TIBCO Software Inc. 6 Data Science in the Cloud: Leidos Healthcare Analytics
Leidos Collaborative Advanced Analytics & Data Sharing Platform (CAADS) uses TIBCO Data Science and AWS to deliver analytics services in Healthcare CDC NIH CMS NASA
Disease Outbreaks: Disease Outbreaks: Data Governance: Space Exploration: Determining the Run simulations of Analyzing and Analyzing human cause of an HIV disease propagation consolidating data factors that affect the outbreak in the to guide public policy, around emerging ability to transport Midwest specifically around Healthcare policies astronauts on long the Zika virus across 56 regions in fights (e.g., to Mars) the United States
© Copyright 2000-2019 TIBCO Software Inc. 7 TIBCO analytics transformation platform Powered by shared data assets
ANALYTICS ACTIONS Visual Analytics Data Science Operational Analytics AUGMENT INTELLIGENCE
DATA OPERATIONS INFORMATION MANAGEMENT
METADATA MDM / RDM Metadata Spark Compute & Store Data Virtualization Metadata Master and Master & (Data Governance, Reference Data UNIFY Reference Data Data Catalog) Management DATA Data Store 1 Data Store n EDW/Lake Transactional Data
EVENTS INTEGRATION
Streaming / Hybrid API Events Integration Management INTERCONNECT EVERYTHING
DATA SOURCES Streaming In-memory Enterprise ... Enterprise Sources Data Grid Application 1 Application n
Cloud Native Open Platform AI Foundation
© Copyright 2000-2019 TIBCO Software Inc. TIBCO Data Science and AWS
EDWs & Data Operational Analytics Analytics Sources Systems Sandboxes Platform
Federated Data Amazon SageMaker
Science Containers/Code Real Services Streaming Business Events & - StreamBase
Time (BW) Time SQL
Flogo Spark TIBCO Data APIs andMicroservices Amazon EMR Science Amazon SageMaker Model Management
Data Marts Batch(DV) Automation
Visual UI / Amazon Redshift Notebook Collaboration / Transactional Project
Stores
Manage Execute TIBCO Live Apps
Amazon Simple Storage Service (S3) External Visual Analytics Sources (TIBCO Spotfire)
AWS Deployments © Copyright 2000-2019 TIBCO Software Inc. 9 TIBCO Data Science Solutions on AWS Cloud Apps: Visual Analytics, Data Science, Streaming, Case Management
Anomaly Detection Risk Management Customer Engagement Starter Set
Process Mining IoT Analytics Anomaly Detection Risk Management Customer Engagement Blockchain – Dovetail Partner Management Starter Toolkit
Review Status: TIBCO Spotfire Model: TIBCO Data Science Analyze Event Stream: Case Manage: TIBCO Live Apps Identify issues, sweet spots Supervised: Train TIBCO Flogo, Cloud Integration Investigate identified cases Unsupervised: Anomalies Batch and Real-Time Updates Audit trail + recycle
© Copyright 2000-2019 TIBCO Software Inc. Anomaly Analysis Solution Overview
1 Collect data from equipment, normalize, model to predict 3 Alert raised and case created – TIBCO Cloud magnitude of anomaly – TIBCO Data Science & AWS Live Apps
2 Model detects anomaly – TIBCO Data Science 4 Case manager investigates and takes action to the equipment – TIBCO Cloud Integration
© Copyright 2000-2019 TIBCO Software Inc. Data Challenges in High-Tech Manufacturing
Stack 3D Flash Memory Cell Layers Vertically 96 Memory Cell Layers
© Copyright 2000-2019 TIBCO Software Inc. Hot Paths to Anomaly Detection
14 Longitudinal Anomaly Analysis
© Copyright 2000-2019 TIBCO Software Inc. Cross-Sectional Anomaly Analysis
© Copyright 2000-2019 TIBCO Software Inc. Spatial Anomaly Analysis
One-Valued Wafer Maps Big Red Spot Plaid
Geometric Patterns One Big Blob & One-Valued One Big Blob
© Copyright 2000-2019 TIBCO Software Inc. Cross-Sectional Anomaly Detection
© Copyright 2000-2019 TIBCO Software Inc. Cross-Sectional Anomaly Detection
© Copyright 2000-2019 TIBCO Software Inc. Analysis: Cluster Incidents, View Signatures
© Copyright 2000-2019 TIBCO Software Inc. Analytic workflow – methodologies + demos Find and cluster anomalous events [Demo #1] • Transform wafer maps into vectorized coefficients, then cluster on quality • Many measured parameters, e.g., quality tests: storage fidelity, logic circuits • Approach 1: SVD + K-means • Focus on failure mode parameters • Approach 2: Bessel functions + hierarchical clustering • Radial basis functions • Rotationally invariant | Null-value tolerant | Efficient storage • Better than SVD + K-means for multi-parameter analysis Monitor anomalies [Demo #2] • Stream wafer data • Vectorize and cluster
Predict when and why anomalies occur [Demo #3] Demo data • Reduce dimensionality of very wide data • 733 dies • Train models to determine sensor importance • 20 x 40 layout • 1,500 measured parameters • Identify responsible process parameters • Color by die test (white = Process variable corrections/models rebasing pass all, colors = fail a test) • Identify new patterns as they emerge (e.g., incident analysis) • Factory monitoring staff can click to characterize the new pattern
© Copyright 2000-2019 TIBCO Software Inc. 21 (TIBCO Streaming operator for Python)
Capture and Ingest data Read data Preprocess Perform SVD on the send data to into Amazon from Kinesis and transform bin values for wafers livestream Kinesis for into TIBCO the data using Python further Streaming processing
Send data for live viewing into Spotfire Data Streams Combine data PMML operator to & results cluster the data using a model trained in TIBCO Data Science Refine clusters, identify wafers of interest
Send data to Spark/Hadoop cluster for more in-depth analysis in TIBCO Data Science
Retrain models for clustering and root-cause analysis
© Copyright 2000-2019 TIBCO Software Inc. 22 Anomaly Detection with Spatial Signature Analysis
23 Anomaly Analysis
Find and explore anomalous events
Monitor anomalies
Predict when and why they occur
© Copyright 2000-2019 TIBCO Software Inc. 24 Anomaly Detection and Analysis
Find and explore anomalous events
Monitor anomalies
Predict when and why they occur
© Copyright 2000-2019 TIBCO Software Inc. 25 © Copyright 2000-2019 TIBCO Software Inc. 26 © Copyright 2000-2019 TIBCO Software Inc. 27 Anomaly Detection and Analysis
Find and explore anomalous events
Monitor anomalies
Predict when and why they occur
© Copyright 2000-2019 TIBCO Software Inc. 28 © Copyright 2000-2019 TIBCO Software Inc. 29 © Copyright 2000-2019 TIBCO Software Inc. 30 © Copyright 2000-2019 TIBCO Software Inc. 31 Anomaly Detection and Analysis
Find and explore anomalous events
Monitor anomalies
Predict when and why they occur
© Copyright 2000-2019 TIBCO Software Inc. 32 Digital Twin for Semiconductor Yield
Digital Twins for Semiconductor Manufacturing Yield: Wide-and-Big Data Analysis
Build Models to Relate Product Yield Failure Modes ( Yi ) with Process Parameters ( Pi )
Product Test: Process Process Measurement Process Start Process Step i+2 Step 1 Step i Step 1 Step n Pass / Fail, Failure Modes ・・・・・・
P1 Pi Pi+1 Pi+2 Pn ~ 1K Steps, > 1M Process Parameters Equipment Sensor Data Physical Measurement Data Equipment Names
Run continuously to capture new Yield, relationships as they emerge Yield Components or Process Optimization Process Monitoring Clusters Process Control Digital Twin Yi = f(P P P … P ) Y = f(Y1 Y2, Y3, … Yn ) Yield Models 1 , 2, 3, n ,
© Copyright 2000-2019 TIBCO Software Inc. 33 The Extreme Challenge of Big & Wide Data
• Not just big data – many rows: lots, wafers, die, units • Also wide data – many columns: > 1M process parameters • Sensor traces • Time series for every sensor on each machine in each run • Physical measurements • Film thickness, critical dimensions, layer-to-layer overlay, defect classes & counts • Equipment and process attributes • Machine and component IDs, process recipe info • Supplies • Chemical batch IDs, QA sample data
“Today [semiconductor] fabs collect more than 5 billion sensor data points each day. The challenge is to turn massive amounts of data into valuable information.” —Ann Kellehere, VP of the Technology and Manufacturing Group, Intel
© Copyright 2000-2019 TIBCO Software Inc. 34 Solution Architecture Data Prep, Feature Further Feature Selection Visualization of Engineering & Selection & Model Building Results
• In-database parallelized computing • In-memory dedicated fast server • Interactive in-memory • Leverages Hadoop, Apache Spark visualization environment
© Copyright 2000-2019 TIBCO Software Inc. 48 Performance Benchmarks & Conclusions
• Demonstrated performance for time series data from 20,000 sensors, 10,000 wafers in under 2 minutes • Current system scales to time series for 20,000 sensors, 100,000 wafers (2.5 TB) with results in 15 minutes • More capacity and better performance can be achieved by adding nodes to the Spark cluster • Working with top memory manufacturer to deploy production system • System can provide automated real-time feedback on emerging equipment issues affecting yield
Big Data Feature Selection Performance Benchmarks – Run Time1 (minutes) 20 Sensors 200 Sensors 2K Sensors 20K Sensors Dataset Size for (1K variables) (10K variables) (100K variables) (1M variables) 1M Variables 1K Wafers 0.47 0.48 0.72 1.0 25 GB 10K Wafers 0.50 0.53 0.77 1.75 253 GB 100K Wafers 0.53 0.67 1.25 15.15 2,530 GB
1Test Conditions: • Data stored in Hadoop data source • 25 node Spark cluster – 16 cores, 32 GB for each node • Each sensor time series compressed to 50 variables with SAX encoder prior to feature selection step
© Copyright 2000-2019 TIBCO Software Inc. 49 Longitudinal Anomaly Analysis
Subsequence Search
A New Method for Identifying Anomalous Patterns in Time Series (Trace Analytics)
50 Mueen’s Algorithm for Similarity Search
Mueen’s Algorithm for Similarity Search (MASS) is specialized for finding anomalous (versus typical) subsequences of time series
Extremely fast algorithm for this use case
Suitable for further acceleration using GPU
Material adapted from: https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx
© Copyright 2000-2019 TIBCO Software Inc. Mueen’s Algorithm for Similarity Search
Quickly create a matrix profile = a partial distance matrix This uses a sliding window to define a series of subsequences The Matrix Profile plots the distance of each subsequence to its nearest match, with the time sequence of the start of each subsequence on the x-axis
Red = Original Time Series
Blue = Matrix Profile Material adapted from: https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx
© Copyright 2000-2019 TIBCO Software Inc. How to “Read” a Matrix Profile Where you see relatively low values, you know that the subsequence in the original time series must have (at least one) relatively similar subsequence elsewhere in the data (such regions are “motifs” or reoccurring patterns) Where you see relatively high values, you know that the subsequence in the original time series must be unique in its shape (such areas are “discords” or anomalies)
© Copyright 2000-2019 TIBCO Software Inc. Manufacturing Batches Raw Amperage - Each color delimits a batch
Matrix Profile highlights anomalies - set sliding window close to batch size Community https://community.tibco.com/exchange
© Copyright 2000-2019 TIBCO Software Inc. 56 AI in Operations Cloud Starters, Accelerators, Analytic Apps Thoughtleader-Led Solutions
AI on Demand Data Science in Operations
© Copyright 2000-2019 TIBCO Software Inc. Questions & Contact
TIBCO Community Steven Hillion Michael O’Connell community.tibco.com Sr. Director, Data Science Chief Analytics Officer TIBCO Exchange [email protected] [email protected] community.tibco.com/exchange @StevenHillion @MichOConnell TIBCO Tech Blog community.tibco.com/blog
Acknowledgements Dr. Tom Hill, Mike Alperin, Nico Rode, David Katz, Jagrata Minardi, Siva Ramalingam
© Copyright 2000-2019 TIBCO Software Inc. 58 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.