Discovering Knowledge from Massive Networks and Science Data – Next Frontier for HPC

Alok Choudhary John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Northwestern University [email protected] DOE – CSGF July 2012

1 ACKNOWLEDGEMENTS Data-Data Everywhere

Biomedical Data

Information Assurance Network Intrusion Detection

Homeland Security Geo-spatial intelligence

2 Business

BIG DATA, High-End Analytics,

Engineering Science “Data intensive” vs “Data Driven”

Data Intensive (DI) Data Driven (DD) • Operations are driven (and • Depends on the defined) by data perspective – Massive transactions – Processor, memory, – BIG analytics • Top-down query (well-defined application, storage? operations) • Bottom up discovery • An application can be (unpredictable time-to-result) data intensive without – BIG data processing – Predictive computing (necessarily) being • Usage model further I/O intensive differentiates these – Single App, users – Large number, sharing, historical/temporal

Very few large-scale applications of practical importance are NOT Data Intensive 4 The Data Driven Discovery Ecosystem

Transactional: Data Generation

Data reduction

Data Act, Refine Historical : Data Management Feedback Processing/Organization

Data Query

Relational: Discovery/Predictive Modeling

(A)

Historical data Learning Models Trigger/questions Predict Supercomputers (Current): Illustration of Simulation Dataset Sizes Application On-Line Data Off-Line Data FLASH: Buoyancy-Driven Turbulent Nuclear Burning 75TB 300TB

Reactor Core Hydrodynamics 2TB 5TB Computational Nuclear Structure 4TB 40TB Computational Protein Structure 1TB 2TB Performance Evaluation and Analysis 1TB 1TB Kinetics and Thermodynamics of Metal and 5TB 100TB Complex Hydride Nanoparticles

Climate Science 10TB 345TB Parkinson's Disease 2.5TB 50TB Plasma Microturbulence 2TB 10TB Lattice QCD 1TB 44TB Thermal Striping in Sodium Cooled Reactors 4TB 8TB Gating Mechanisms of Membrane Proteins 10TB 10TB 6 Outline

• Data Driven Approach • Beyond Broad Trends Climate Science • Identifying Patterns and Predictions

Scalable Text, • Sentiment • Influence Network Analysis • Networks

Social, Political • Measuring the Pulse in real-time, Understanding Egyptian and Business Revolution • Networks and Action Based Applications Connections 7 Climate Change: The defining issue of our era

• The planet is warming • Multiple lines of evidence • Credible link to human GHG (green house gas) emissions

• Consequences can be dire • Extreme weather events, regional climate and ecosystem shifts, abrupt climate change, stress on key resources and critical infrastructures

• There is an urgency to act • Adaptation: “Manage the unavoidable” • Mitigation: “Avoid the unmanageable” Russia Burns, Moscow

Chokes • The societal cost of both NATIONAL GEOGRAPHIC, 2010 The Vanishing of the action and inaction is large Arctic Ice cap ecology.com, 2008 Key outstanding science challenge: Actionable predictive insights to credibly inform policy Understanding Climate Change - Physics based Approach

General Circulation Models: Mathematical models with physical equations based on fluid dynamics

Cell (K) 1919 - Parameterization and non-linearity Clouds of differential Land equations are Ocean

sources for from1880 Anomalies uncertainty!

Figure Courtesy: ORNL Physics-based models are essential but not adequate Disagreement between IPCC models

– Relatively reliable predictions at global scale for ancillary variables such as temperature

– Least reliable predictions for variables that are crucial for impact assessment such as regional precipitation

“The sad truth of climate science is that the most crucial information is the least reliable” (Nature, 2010) Regional hydrology exhibits large variations among major IPCC model projections Example Use Cases: Extreme Events Prediction

NH Tropical Cyclone (TC) Activity Climate-Meningitis Outlook

Northern Indian North Pacific North Atlantic West Sahel East Sahel

Hurricane Typhoon Hurricane Cyclone

Forecasting NA Hurricane Tracks Modeling a Climate System as a Network

Climate Data

Correlation between two anomaly time series Stat. significant correlations

Anomaly time series at each node

Extreme Normal Phase Phase Climate Network VWS SST SLP

Edge weights: significant correlations Nodes in the graph: grid points on the globe

Multivariate Networks

Multiphase Networks Towards Predictive Insights: Hurricane Frequency

Steps for Discovery of Multivariate Non-linear Interactions

IPCC AR4 Models: CMIP3 datasets Monthly mean sea surface temperature Monthly mean atmospheric temperature Daily horizontal wind at 250/850 hPa

1. Pre-process ancillary climate 2. Construct multivariate 3. Detect & track model outputs nonlinear climate network communities Steps for Predictive Modeling of Hurricanes

SST impact on hurricane(A) frequency & intensity

5. Determine non-stationary 3. Find non-linear 4. Validate w/ non-i.i.d. climates states & relationships hindcasts Build hurricane models Decadal Trends Discovery via Community Dynamics Characterize the evolution of network communities that are: – Recurring/stable or – Exhibiting significant shifts – With long distance connections

● Evidence of modulation in planetary-scale climatic pattern ● Stable teleconnections between

Nino-3 region and Indian Ocean ● Realignment of Sahel region to northern Africa

─ Indirect evidence of desertification

Surface Air Temperature Community Dynamics A Closer Look: Desertification in the Sahel Forecasting Hurricane Tracks

Improving but have mean error (>185km) beyond 48 h

Physics-based Models What if the error gets interpolated to 10- 15 day in advance forecast?

~500 km

HURDAT Historic Data

(A)

Src: NOAA/NHC: http://www.nhc.noaa.gov/verification/index.shtml Hurricane End-game Track Forecast Forecast 10-15 days in advance the end-game of a North Atlantic since hurricane embryonic formation in Western Africa. • Nearly east-oriented SLP edges suggest horizontal pressure gradient configuration in the same direction. • Based on Buys Ballot’s law, this pressure gradient would be associated with wind flow in the north-south direction. • Onshore wind anomaly flow would promote favorable conditions for landfall; opposite flow anomaly would be more favorable for hurricanes tracks in no-landfall. SLP (yellow/dashed) and SST (red/solid) (+)correlated Performance of Land-hitting vs. Offshore teleconnections; LOO 10-FOLD L—biased toward land-hitting SLP SST SLP+SST SLP SST tracks; Accuracy 0.88 0.90 0.92 0.90 0.90 O—biased toward offshore Sensitivity 0.91 0.96 0.97 0.95 0.97 tracks. Specificity 0.77 0.76 0.81 0.80 0.74 Precision 0.90 0.90 0.92 0.92 0.90 F1-meas. 0.90 0.93 0.94 0.93 Slide0.93 17 CMIP3  CMIP5 => BIG DATA and Computing

● Coupled Model Inter comparison Project ● Spatial resolution: 1 – 0.25 degrees ● Temporal resolution: 6 hours – 3 hours ● Models: 24 - 37 ● Simulation experiments: 10s - 100s – Control runs & hindcast – Decadal & centennial-scale forecasts ● Covers 1000s of simulation years ● 100+ variables ● 10s of TBs to 10s of PBs

Summary of CMIP5 model experiments, grouped into three tiers

Slide 18 Transformative Computer Science Research

Enabling large-scale data-driven science for complex, multivariate, spatio-temporal, non-linear, and dynamic systems:

• Fusion plasma End-to-end demonstration of this Complex Networks • Combustion major paradigm for future knowledge Enable studying of collective • Astrophysics discovery process. behavior of interacting eco- • …. climate systems

Nonlinear, space-time lag, Relationship Mining geographical, multi-scale Enable discovery of complex Predictive Modeling dependence structures such as non Enable predictive modeling of linear associations or long range spatial typical and extreme behavior relationships Community structure- dependencies from multivariate spatio- function-dynamics temporal data

Nonlinear, spatio-temporal, multi- kernels , features, dependencies variate, persistence, long memory Nonlinear, spatio-temporal, Crucial multivariate

High Performance Computing Enable efficient large-scale spatio-temporal analytics on future generation exascale HPC platforms with complex memory hierarchies Large scale, spatio-temporal, unstructured, dynamic From Science to Social • People/Customers/fans are interacting points in space-time • Similarity of interests defines communities • Communication across globes defines networks

Society Activity/interaction based Network

Edge weights: significant interactions/influence Nodes in the graph: people/brands/…

Action-Based Connections Massive Data and Influence Tracking Social Networks and Analysis Mining interest

BRAND Scalable Analytics

interest Multi-language Learning and Sentiment Analytics Predictive Modeling

(c) ANC http://www.pulseofthetweeters.com About Twitter  Twitter: a micro blogging social network  Millions of users  Short messages of up to 140 characters called `tweets’  ~100 million tweets per day  News, Politics, Sports, Entertainment …..  Trending topics on Twitter

7/27/2012 21 21 Social Network Pulse and (Soft) Real-time Sentiment Analysis

Static (Slowly Changing) Dynamic Topic based Social Network(1000 users) response network

22 Identifying Influencers

X

Y

Topic A Topic B

n W (U ,U ) R  (1  d )  d  R e i x x  i m i 1 W (U ,U ) 7/27/2012  e i j 25 j 1

• The 2011 Egyptian revolution resulted in the removal of longtime leader

• By most accounts, social Protests in ’s media played an integral role in organizing and building support for the revolution Ousted President Hosni Mubarak 26 Al Jazeera CNN Mona Eltahawy The Twitter Revolution The Nation 11% 16% 37% 16% 20%

Over 800,000 tweets in six trending topics on Egyptian revolution Topics: , cairo, tahrir, egyptians, hosni_mubarak, and omar_suleiman

Leaders

Influencer Description Twitter user names • Measured influence based Al Jazeera news channel AJEnglish, Dima_AlJazeera, on how often followers based in Doha, Qatar; AJGaza, AlanFisher, FatimaNaib, started discussing topics name translates to “the mohamed, SherineT the influencer tweeted island” CNN American news network bencnn, cnnbrk, CNNLive, about

founded by Ted Turner natlsecuritycnn, vhernandezcnn • Most influential Mona Egyptian journalist and monaeltahawy organizations: Al Jazeera Eltahawy public speaker and CNN Reuters News agency Reuters, JimPathokoukis headquartered in • Most influential individual: London independent journalist The Nation Left-leaning weekly jeremyscahill, thenation Mona Eltahawy magazine headquartered in New 27 York City The Evolution of the Revolution

Date Major events Twitter activity "Day of Rage" protests in Cairo Topic cairo is only trending topic related to Egypt Jan. 25 signal the start of major changes in Negative tweets outnumber positive tweets four to one Egypt. Little activity overall The Egyptian government begins Jan. 27 Topic egypt begins trending limiting internet access in Egypt. Topic egyptians (the most positive topic) begins trending Jan. 28 President Hosni Mubarak dismisses All Egypt-related topics trend his cabinet and appoints Omar Largest single-day volume of tweets on revolution Jan. 29 Suleiman as Vice President of Topic egyptians peaks in sentiment, with almost 3 positive tweets for Egypt. every negative tweet Huge increase in tweets on tahrir (about 1500% as many tweets) Blockage of internet access by Feb. 2 Topic omar_suleiman trends on Feb. 3 with more positive tweets than Egyptian government ends. negative Egypt-related topics stop trending Sentiment on topic hosni_mubarak decreased progressively throughout Feb. 6 on Twitter. Mubarak resigns Feb. the revolution, with about six times as many negative tweets as positive 11. by Feb. 6 Tweet volume Sentiment

28 Activity Signature: Not Just Another #superbowl

Compared with twenty other Significant differences trending topics from the same time period • Large percentage of retweets,

Four categories of topics more tweets overall • Politics • Tweets significantly more • Entertainment negative • Sports • Followers more likely to be • Tech/business engaged • Follower network less tightly Comparison based on tweets, knit sentiment, and network structure of Twitter followers

29

Interest based Community Extraction in Social Networks: Facebook and Twitter Data with 150M+ users

30

7/27/2012 31 Network Effect and Precise Interest Targeting

7/27/2012 Confidential36 Community Hierarchy: Twitter Republican politicians and news media

Republican Presidential Candidates

E.g. U.S. Politics 37 Top Associations by Fans For Bing, Google & Yahoo on FB 3.75% of Windows Phone users

5.30% of Google Chrome 0.98% of users George 2.58% of Foreman Microsoft users Cooking users 2.48% of 3.83% of TechCrunch Logitech users users

0.99% of 1.44% of Chillclock users Dentyne Users

2.49% of 2.53% of Adobe Microsoft users Flash Users

2.58% of 1% of Chex Mix Crest Users users 2.49% of Internet Explorer users 2.15% of 2.425% of Chex Mix Pepto-Bismol users users

2.20% of TridentA 2.37% of Chewing Gum Dentyne users users

2.32% of Yahoo! Sports users All data for 16-34 age group only 38 A different way of thinking? A “DATA DRIVEN DISCOVERY” WORTH A THOUSAND SIMULATIONS? Discovering Materials : Simulations  Analytics Summary: Discovering Knowledge from Massive Data

Business

High-End Analytics, Data

Mining, and Network Mining

Engineering Science

Thank You.

Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the Center for Ultra-Scale Computing and Security Northwestern University [email protected]

42