Industry Perspective: Big Data and Big Data Analytics
Total Page:16
File Type:pdf, Size:1020Kb
Industry Perspective: Big Data and Big Data Analytics David Barnes Program Director Emerging Internet Technologies IBM Software Group What is Big Data? The Adjacent Possible Inexpensive disk + Increased processing power + Data Warehouse +The Web + X = Big Data X=Sensors used to gather climate information, posts to social media sites, digital pictures and videos, transaction records, cell phone GPS signals, and more. 161 exabytes of data were created in 2006 – 3 million times the amount of information contained in all the books ever written. In 2010 the number reached hit 988 exabytes. IDC estimates that 1.8 zettabytes were created and replicated in 2011. © 2010 IBM Corporation Every day, people create the equivalent of 2.5 quintillion bytes of data from sensors, mobile devices, online transactions, and social networks. Every month people send one billion Tweets and post 30 billion messages on Facebook. 90% (or more) of the world’s data is unstructured. © 2010 IBM Corporation The true nature of information Unstructured Data Is noisy Is often times dirty Is often full of valuable information The Big Data Imperative Big Data has swept into every industry Big Data Business and business function. Patterns Computational Journalism Chief Legal Officer Businesses need to put the power of Big Retail Business Planner Data analytics in the hands of their IT Systems Management business employees – Data Scientist is somewhat misleading. Pharma - Clinical Trials Business Fraud Detection Evidence Based Medicine “Leaders in every sector will have to grapple with the implications of big Web Archiving data, not just a few data-oriented . managers.” – McKinsey Global Institute © 2010 IBM Corporation 9 Today’s Problem Data growing at compound annual growth of 60%/year Storage capacity continue to increase dramatically Storage access speeds have not kept up At transfer speed of 500 MB/sec - 1 terabyte of data will require ~30 mins to read from single drive Enter Map/Reduce • Automates the mechanisms of large-scale distributed computation ( i.e. work distribution, load balancing, replication, failure/recovery) • Divide & Conquer: Split 1 terabyte split among 100 drives will require ~20 seconds to read • M/R parallel processing model provides cost effective framework for new generation of analytic applications on unstructured or semi-structured data © 2010 IBM Corporation Requirement: A New Class of Big Data Applications Big Data analytics must be brought to the line-of-business user. •Leverage easy-to-use manipulation metaphors •Use natural language technologies for analytics •Provide rich visualizations to quickly identify insights © 2010 IBM Corporation Buyer Sentiment Analysis Demo Social Media: Chiliean Earthquake 2010 2010 Chilean earthquake fifth largest earthquake in recorded history The affected areas suffered major devastation - buildings, airports, hospitals, prisons, bridges, and roads were severely damaged Land-based communications systems suffered major outages The wireless 3G infrastructure remained intact and operational © 2010 IBM Corporation Sharenomics - Rise of Social Economy Slide 13 Social Media: Chiliean Earthquake 2010 Social networking on wireless networks major form of communications Extreme Blue students collected 226 million Tweets, analyzed,categorized by incidence type and location Tweets included - Can I get food? Can I get gas? Are the bridges down - images The results were visualized Completed in ~12 weeks © 2010 IBM Corporation Sharenomics - Rise of Social Economy Slide 14 Big Data = Volume, Variety and Velocity • Volume - Scale from terabytes to zettabytes • Variety - Relational and non-relational data types from an ever- expanding variety of sources • Velocity - Streaming data and large volume data movement © 2010 IBM Corporation 15 Big Data = Volume, Variety and Velocity • Volume - Scale from terabytes to zettabytes • Variety - Relational and non-relational data types from an ever- expanding variety of sources • Velocity - Streaming data and large volume data movement © 2010 IBM Corporation The Supercomputer is based on over 1,200 high powered IBM System X servers and can perform 150 trillion calculations per second -- equivalent to 30 million calculations per Danish citizen per second. Vestas expects its data sets will grow to 20-plus petabytes over the next four years. Big Data = Volume, Variety and Velocity • Volume - Scale from terabytes to zettabytes • Variety - Relational and non-relational data types from an ever- expanding variety of sources • Velocity - Streaming data and large volume data movement © 2010 IBM Corporation Seton Healthcare Family Reducing CHF readmission to improve care “IBM Content and Predictive Analytics for Healthcare uses the same type of natural language processing as IBM Watson, enabling us to leverage information in new ways not possible before. We can access an integrated view of relevant clinical and operational information to drive more informed decision making and optimize patient and operational outcomes.” Business Challenge Smarter Business Outcomes Seton Healthcare strives to reduce the occurrence of high • Seton will be able to proactively target care management cost Congestive Heart Failure (CHF) readmissions by and reduce re-admission of CHF patients. proactively identifying patients likely to be readmitted on • Teaming unstructured content with predictive analytics, an emergent basis. Seton will be able to identify patients likely for re-admission and introduce early interventions to reduce cost, mortality What’s Smart? IBM Content and Predictive Analytics for Healthcare solution will help to better target and understand high-risk CHF patients for care management programs by: IBM solution • Utilizing natural language processing to extract key elements • IBM Content and from unstructured History and Physical, Discharge Summaries, Predictive Analytics Echocardiogram Reports, and Consult Notes for Healthcare • Leveraging predictive models that have demonstrated high • IBM Cognos Business positive predictive value against extracted elements of Intelligence structured and unstructured data • IBM BAO solution • Providing an interface through which providers can intuitively services navigate, interpret and take action © 2011 IBM Corporation IBM Content and PredicUve AnalyUcs for Healthcare The Seton CHF Readmission SoluUon IBM Watson for Confirm hypotheses or seek alternaFve Healthcare ideas with confidence based responses from learned knowledge* UUlizing natural language Leveraging predicUve models that processing to extract key have demonstrated high posiUve Raw elements from unstructured IBM Content and predicUve value against extracted Analyzed and Dynamic Informaon History and Physical and Predicve elements of structured and Visualized Mul=mode Discharge Summary unstructured data Analy=cs Informaon Interac=on Unstructured Data Providing an interface through (Cerner Clinical Documentaon: Content AnalyBcs PredicBve AnalyBcs which providers can intuiUvely History and Physical, Discharge Summary, Echocardiogram.) • Natural Language Processing • Predic0ve Scoring and navigate, interpret and take • Medical Fact and Relaonship Probability Analysis Search and Visually Explore acUon Structured Data Extrac0on (Annotaon) (Mine) (Avega Cost Data, DSS Admission • Trend, Paern, Anomaly, History, DSS Procedure History, Deviaon Analysis Monitor, Dashboard and Cerner Clinical Events) Report (Cognos BI) Queson and Answer* Health Data Warehouse and Model Custom Soluons Integraon Framework Master Data Management Advanced Case Management Partners (HLI) Specialized Research Business AnalyBcs © 2011 IBM Corporaon 2 © 2011 IBM Corporaon What Really Causes Readmissions at Seton Key Findings The Data We Thought Would Be Useful … Wasn’t • 113 candidate predictors from structured and unstructured data sources • Structured data was less reliable then unstructured data – increased the reliance on unstructured data New Unexpected Indicators Emerged … Highly Predic=ve Model • 18 accurate indicators or predictors (see next slide) Predictor Analysis % Encounters % Encounters Structured Data Unstructured Data Ejec0on Frac0on (LVEF) 2% 74% 97% at 80th percenle Smoking Indicator 35% 81% (65% Accurate) (95% Accurate) 49% at 20th percenle Living Arrangements <1% 73% (100% Accurate) Drug and Alcohol Abuse 16% 81% Assisted Living 0% 13% 3 © 2011 IBM Corporaon Visualizing the Results: Readmissions Dashboard Cognos dashboard reporUng system can help in monitoring the key clinical, 1.Clinical Stascs: operaonal and financial metrics. More importantly, being able to track down admission count, the top priority cases for case management. readmission count and readmission rate 1 2 3 2.Operaonal Stasc: Counts of different length of stay periods 3.Financial Sta=s=c: Total 4 5 6 direct cost by total admission and by readmission 4.Mortality: mortality rate 5.Average length of stay 6.Average direct cost by total admission and by 7 readmission only 7.PA Model Score: Distribu0on of propensity of readmission 5 © 2011 IBM Corporaon Big Data = Volume, Variety and Velocity • Volume - Scale from terabytes to zettabytes • Variety - Relational and non-relational data types from an ever- expanding variety of sources • Velocity - Streaming data and large volume data movement © 2010 IBM Corporation USC Annenberg School of Communications © 2010 IBM Corporation InfoSphere Streams © 2010 IBM Corporation 27 Big Data Platform Vision Bringing Big Data to the Enterprise Big Data Solutions Client and Partner Solutions Data Warehouse InfoSphere Warehouse Big Data User Environments Warehouse Appliances Developers End Users Administrators Netezza Master Data INTEGRATION Mgmt AGENTS InfoSphere MDM Big Data Enterprise Engines Database DB2 Analytics Streaming Analytics Internet Scale Analytics SPSS Business Intelligence Open Source Foundational Components Cognos Marketing Hadoop MapReduce HDFS Hbase Pig Lucene Jaql Unica © 2010 IBM Corporation 28.