Statistical Graphics in Data Science
Adalbert F.X. Wilhelm
SDSS 2018, Honoring Dr. Edward J Wegman May 18th, 2018 Reston, VA Quote of the day
"Current facilities for computing, display, and real time interaction have developed substantially beyond our understanding of how to use them effectively in data analysis.“
Tukey and Wilks, 1966
2 Outline
1. Data Science 2. Statistical Graphics 3. Analysis pipeline 4. Visual representatives 5. Challenges 6. Conclusion
3 The Data Era: Data = The New Oil
• Production processes create a large amount of data from sensors, logistics, business operations and more
• Rise of cost-effective data collection gives established industries a new boost
• Producing value from data is a challenge and an opportunity
• The promise of data as the new oil is realized when we can tap into its value in a meaningful, cross-functional way to enhance decision-making which provides the competitive advantage
• Competitive advantage: lower costs, higher quality What do we do with all this data?
© Cathy O‘Neill & Rachel Shutt: Doing Data Science What is Data Science?
© Cathy O‘Neill & Rachel Shutt: Doing Data Science What is Data Science?
Goals and Data Data Business Means selection Analysis decision
Purpose Decide on objectives, Provide Using and Improving identify business appropriate data adapting best business levers, define key for the given suitable analysis processes. measures objective tools Generate value
Central What do we need to Which data is already How to clean data? How to implement questions improve? available? Which modeling results? What can we How to merge and technique to use? Can we automatise this change? connect data? Can we assess reliability analysis? How can we Can we collect of our resulting Which changes are measure change? additional valuable predictions? most relevant? data? How robust are our Potentials/ Include information Create a data results? Enhanced Benchmarks of context experts. inventory. Increase What are state-of-the- understanding of Give data science data awareness art algorithms? Cross- processes. team clear among employees check with current Improved data objectives! to increase data expertise within culture. quality. company What are statistical graphics?
https://apandre.wordpress.com/dataviews/choiceofchart/ The visualization process Graphics system © Geoffrey H. Ball & David J. Hall, 1970
Special thanks to Wayne Oldford The Data Team
Data Science Manager Data Scientist Main responsibilities: Main responsibilities: • Data munging • Group morale and support • Modeling • Business development • Machine Learning • Research & Development • Reporting and Presenting
Data Engineer
Main responsibilities: • Data ingesting Data Analyst • Data architecture • Data formatting Main responsibilities: • Acquiring data • Developing and implementing data analysis • Interpreting data • Analysing results Data Visualizer Main responsibilities: • Dashboard creation • Story telling / effective communication • Programmatic visualisation Conceptual model of visualisation support (Liu & Salvendy, 2007)
Graphics for data preparation Data exploration Aspects
bias noise abnormality meaningful relevant uptodate
Data cleaning Plausibility checks Outlier detection Data understandability
Junk in – Junk out Efficient time synchronization required Exploratory Visualisation Prior to Modelling
Density of Military Spending
Peace
1.2 War YEAR REPRESSI ARMSTRAN
● 10
● 3.5
1.0 ● ●
1995 ● ●
● 3.0
● ● 5
● 0.8 2.5
1990 ●
●
● 2.0 0
0.6 ● 1985
Density ● 1.5 5 1980 0.4 − 1.0 0.5 1975 0.2 10 − 0.0
peace war peace war peace war 0.0
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Military Spending (Billion USD) MILSPEND DEVELOP CUMWAR
● ● ● ● 3.0 ● ● ● ● ● ● ● ● ● 3.5 2.0
COLONIAL TRANSITI 2.5
● others France Britain no transition transitiontransition to semi−dem (0−5) 2.0
1.5 ● peace 3.0 peace 1.5
1.0 ● WARINVOL WARINVOL war 1.0
2.5 ● war
● 0.5
0.5 ●
● 2.0 0.0 0.0
peace war peace war peace war ETHNOPOL SEMIDEM
absent present other semi democracy peace peace WARINVOL WARINVOL war war
14 Exploratory Visualisation Prior to Modelling
15 Exploratory Visualisation Prior to Modelling
16 Exploratory Visualisation: Linked views
17 Visualisation for modeling support
• Depends on purpose • Prediction • Explanation • Pattern recognition
• Depends on data type and structure • Depends on modeling approach • Depends on audience and their standards • Depends on software ecosystem
18 Visualizing model results: model coefficients & CI
© Geoff Dancy & Eric Wiebelhaus-Brahm
Let‘s practice what we preach, Gelman et al. (2002) 19 Visualizing model results: e.g trees, random forests
20 Visualizing model results: ROC, Importance plots, separation plots
21 Visualizing model results: partial dependency plots
22 Visualizing model results: MCMC and Bayesian models
23 Exploratory Visualisation after modeling
24 Continuing challenges:
• Overcoming 2D/3D – limitation
• Dynamic plots
• Conditioning (C.Hurley: condvis)
• Linked views (W. Oldford: loon, H. Hofmann: cranvas, altair ) • Resolution (e.g. imbalanced data) • Reproduction (J. Harner) and Automation (P. Hall: H2O, L. Wilkinson: next session) • Transporting to future computational ecosystems • Interpretation • Storytelling • Data Literacy and visual literacy
25 Storytelling
26 The Data Team
Data Science Manager Data Scientist Main responsibilities: Main responsibilities: • Data munging • Group morale and support • Modeling • Business development • Machine Learning • Research & Development • Reporting and Presenting
Data Engineer
Main responsibilities: • Data ingesting Data Analyst • Data architecture • Data formatting Main responsibilities: • Acquiring data • Developing and implementing data analysis • Interpreting data • Analysing results Data Visualizer Main responsibilities: • Dashboard creation • Story telling / effective communication • Programmatic visualisation Conclusion:
• Many ingredients have been around since long • Power of statistical graphics is widely acknowledged • Visualisation throughout the data analysis process • Interactive graphics in R! • Visual communication and storytelling • Liars know how to figure • Visual literacy • Interpretation • Increasing number of specialised graphics • Reproducibility and Automation
28 Thank you very much for your attention!
Questions?
Comments?
29