Industry Perspective: Big Data and Big Data Analytics

David Barnes Program Director Emerging Internet Technologies IBM Software Group What is Big Data? The Adjacent Possible Inexpensive disk + Increased processing power + Data Warehouse +The Web + X

= Big Data

X=Sensors used to gather climate information, posts to social media sites, digital pictures and videos, transaction records, phone GPS signals, and more. 161 exabytes of data were created in 2006 – 3 million times the amount of information contained in all the books ever written. In 2010 the number reached hit 988 exabytes.

IDC estimates that 1.8 zettabytes were created and replicated in 2011.

© 2010 IBM Corporation Every day, people create the equivalent of 2.5 quintillion bytes of data from sensors, mobile devices, online transactions, and social networks.

Every month people send one billion Tweets and post 30 billion messages on Facebook.

90% (or more) of the world’s data is unstructured.

© 2010 IBM Corporation The true nature of information Unstructured Data

Is noisy

Is often times dirty

Is often full of valuable information The Big Data Imperative

Big Data has swept into every industry Big Data Business and business function. Patterns Computational Journalism Chief Legal Officer Businesses need to put the power of Big Retail Business Planner Data analytics in the hands of their IT Systems Management business employees – Data Scientist is somewhat misleading. Pharma - Clinical Trials Business Fraud Detection Evidence Based Medicine “Leaders in every sector will have to grapple with the implications of big Web Archiving data, not just a few data-oriented . . . managers.” – McKinsey Global Institute

© 2010 IBM Corporation 9 Today’s Problem

Data growing at compound annual growth of 60%/year Storage capacity continue to increase dramatically Storage access speeds have not kept up At transfer speed of 500 MB/sec - 1 terabyte of data will require ~30 mins to read from single drive

Enter Map/Reduce • Automates the mechanisms of large-scale distributed computation ( i.e. work distribution, load balancing, replication, failure/recovery) • Divide & Conquer: Split 1 terabyte split among 100 drives will require ~20 seconds to read • M/R parallel processing model provides cost effective framework for new generation of analytic applications on unstructured or semi-structured data

© 2010 IBM Corporation Requirement: A New Class of Big Data Applications

Big Data analytics must be brought to the line-of-business user.

•Leverage easy-to-use manipulation metaphors

•Use natural language technologies for analytics

•Provide rich visualizations to quickly identify insights

© 2010 IBM Corporation Buyer Sentiment Analysis Demo Social Media: Chiliean Earthquake 2010

2010 Chilean earthquake fifth largest earthquake in recorded history

The affected areas suffered major devastation - buildings, airports, hospitals, prisons, bridges, and roads were severely damaged

Land-based communications systems suffered major outages

The wireless 3G infrastructure remained intact and operational

© 2010 IBM Corporation Sharenomics - Rise of Social Economy Slide 13 Social Media: Chiliean Earthquake 2010

Social networking on wireless networks major form of communications

Extreme Blue students collected 226 million Tweets, analyzed,categorized by incidence type and location

Tweets included - Can I get food? Can I get gas? Are the bridges down - images

The results were visualized

Completed in ~12 weeks

© 2010 IBM Corporation Sharenomics - Rise of Social Economy Slide 14 Big Data = Volume, Variety and Velocity

• Volume - Scale from terabytes to zettabytes • Variety - Relational and non-relational data types from an ever- expanding variety of sources • Velocity - Streaming data and large volume data movement

© 2010 IBM Corporation 15 Big Data = Volume, Variety and Velocity

• Volume - Scale from terabytes to zettabytes • Variety - Relational and non-relational data types from an ever- expanding variety of sources • Velocity - Streaming data and large volume data movement

© 2010 IBM Corporation

The Supercomputer is based on over 1,200 high powered IBM System X servers and can perform 150 trillion calculations per second -- equivalent to 30 million calculations per Danish citizen per second.

Vestas expects its data sets will grow to 20-plus petabytes over the next four years. Big Data = Volume, Variety and Velocity

• Volume - Scale from terabytes to zettabytes • Variety - Relational and non-relational data types from an ever- expanding variety of sources • Velocity - Streaming data and large volume data movement

© 2010 IBM Corporation Seton Healthcare Family Reducing CHF readmission to improve care

“IBM Content and Predictive Analytics for Healthcare uses the same type of natural language processing as IBM , enabling us to leverage information in new ways not possible before. We can access an integrated view of relevant clinical and operational information to drive more informed decision making and optimize patient and operational outcomes.”

Business Challenge Smarter Business Outcomes Seton Healthcare strives to reduce the occurrence of high • Seton will be able to proactively target care management cost Congestive Heart Failure (CHF) readmissions by and reduce re-admission of CHF patients. proactively identifying patients likely to be readmitted on • Teaming unstructured content with predictive analytics, an emergent basis. Seton will be able to identify patients likely for re-admission and introduce early interventions to reduce cost, mortality What’s Smart? IBM Content and Predictive Analytics for Healthcare solution will help to better target and understand high-risk CHF patients for care management programs by: IBM solution • Utilizing natural language processing to extract key elements • IBM Content and from unstructured History and Physical, Discharge Summaries, Predictive Analytics Echocardiogram Reports, and Consult Notes for Healthcare • Leveraging predictive models that have demonstrated high • IBM Cognos Business positive predictive value against extracted elements of Intelligence structured and unstructured data • IBM BAO solution • Providing an interface through which providers can intuitively services navigate, interpret and take action

© 2011 IBM Corporation IBM Content and Predicve Analycs for Healthcare The Seton CHF Readmission Soluon

IBM Watson for Confirm hypotheses or seek alternave Healthcare ideas with confidence based responses from learned knowledge* Ulizing natural language Leveraging predicve models that processing to extract key have demonstrated high posive Raw elements from unstructured IBM Content and predicve value against extracted Analyzed and Dynamic Informaon History and Physical and Predicve elements of structured and Visualized Mulmode Discharge Summary unstructured data Analycs Informaon Interacon Unstructured Data Providing an interface through (Cerner Clinical Documentaon: Content Analycs Predicve Analycs which providers can intuively History and Physical, Discharge Summary, Echocardiogram.) • Natural Language Processing • Predicve Scoring and navigate, interpret and take • Medical Fact and Relaonship Probability Analysis Search and Visually Explore acon Structured Data Extracon (Annotaon) (Mine) (Avega Cost Data, DSS Admission • Trend, Paern, Anomaly, History, DSS Procedure History, Deviaon Analysis Monitor, Dashboard and Cerner Clinical Events) Report (Cognos BI)

Queson and Answer* Health Data Warehouse and Model Custom Soluons Integraon Framework Master Data Management Advanced Case Management

Partners (HLI) Specialized Research Business Analycs

© 2011 IBM Corporaon 2 © 2011 IBM Corporaon What Really Causes Readmissions at Seton Key Findings

The Data We Thought Would Be Useful … Wasn’t • 113 candidate predictors from structured and unstructured data sources • Structured data was less reliable then unstructured data – increased the reliance on unstructured data New Unexpected Indicators Emerged … Highly Predicve Model • 18 accurate indicators or predictors (see next slide)

Predictor Analysis % Encounters % Encounters Structured Data Unstructured Data

Ejecon Fracon (LVEF) 2% 74% 97% at 80th percenle Smoking Indicator 35% 81% (65% Accurate) (95% Accurate) 49% at 20th percenle Living Arrangements <1% 73% (100% Accurate) Drug and Alcohol Abuse 16% 81%

Assisted Living 0% 13%

3 © 2011 IBM Corporaon Visualizing the Results: Readmissions Dashboard

Cognos dashboard reporng system can help in monitoring the key clinical, 1.Clinical Stascs: operaonal and financial metrics. More importantly, being able to track down admission count, the top priority cases for case management. readmission count and readmission rate 1 2 3 2.Operaonal Stasc: Counts of different length of stay periods 3.Financial Stasc: Total 4 5 6 direct cost by total admission and by readmission 4.Mortality: mortality rate 5.Average length of stay 6.Average direct cost by total admission and by 7 readmission only 7.PA Model Score: Distribuon of propensity of readmission

5 © 2011 IBM Corporaon Big Data = Volume, Variety and Velocity

• Volume - Scale from terabytes to zettabytes • Variety - Relational and non-relational data types from an ever- expanding variety of sources • Velocity - Streaming data and large volume data movement

© 2010 IBM Corporation USC Annenberg School of Communications

© 2010 IBM Corporation InfoSphere Streams

© 2010 IBM Corporation 27 Big Data Platform Vision Bringing Big Data to the Enterprise

Big Data Solutions Client and Partner Solutions Data Warehouse InfoSphere Warehouse

Big Data User Environments Warehouse Appliances

Developers End Users Administrators Netezza

Master Data

INTEGRATION Mgmt AGENTS InfoSphere MDM

Big Data Enterprise Engines Database

DB2

Analytics Streaming Analytics Internet Scale Analytics SPSS Business Intelligence Open Source Foundational Components Cognos

Marketing Hadoop MapReduce HDFS Hbase Pig Lucene Jaql Unica

© 2010 IBM Corporation 28