<<

Statistical in Science

Adalbert F.X. Wilhelm

SDSS 2018, Honoring Dr. Edward J Wegman May 18th, 2018 Reston, VA Quote of the day

"Current facilities for computing, display, and real time have developed substantially beyond our understanding of how to use them effectively in .“

Tukey and Wilks, 1966

2 Outline

1. 2. Statistical Graphics 3. Analysis pipeline 4. Visual representatives 5. Challenges 6. Conclusion

3 The Data Era: Data = The New Oil

• Production processes create a large amount of data from sensors, logistics, business operations and more

• Rise of cost-effective gives established industries a new boost

• Producing value from data is a challenge and an opportunity

• The promise of data as the new oil is realized when we can tap into its value in a meaningful, cross-functional way to enhance decision-making which provides the competitive advantage

• Competitive advantage: lower costs, higher quality What do we do with all this data?

© Cathy O‘Neill & Rachel Shutt: Doing Data Science What is Data Science?

© Cathy O‘Neill & Rachel Shutt: Doing Data Science What is Data Science?

Goals and Data Data Business selection Analysis decision

Purpose Decide on objectives, Provide Using and Improving identify business appropriate data adapting best business levers, define key for the given suitable analysis processes. measures objective tools Generate value

Central What do we need to Which data is already How to clean data? How to implement questions improve? available? Which modeling results? What can we How to merge and technique to use? Can we automatise this change? connect data? Can we assess reliability analysis? How can we Can we collect of our resulting Which changes are measure change? additional valuable predictions? most relevant? data? How robust are our Potentials/ Include Create a data results? Enhanced Benchmarks of context experts. inventory. Increase What are state-of-the- understanding of Give data science data awareness art algorithms? Cross- processes. team clear among employees check with current Improved data objectives! to increase data expertise within culture. quality. company What are statistical graphics?

https://apandre.wordpress.com/dataviews/choiceofchart/ The process Graphics system © Geoffrey H. Ball & David J. Hall, 1970

Special thanks to Wayne Oldford The Data Team

Data Science Manager Data Scientist Main responsibilities: Main responsibilities: • Data munging • Group morale and support • Modeling • Business development • Machine Learning • Research & Development • Reporting and Presenting

Data Engineer

Main responsibilities: • Data ingesting Data Analyst • Data architecture • Data formatting Main responsibilities: • Acquiring data • Developing and implementing data analysis • Interpreting data • Analysing results Data Visualizer Main responsibilities: • Dashboard creation • Story telling / effective communication • Programmatic visualisation Conceptual model of visualisation support (Liu & Salvendy, 2007)

Graphics for data preparation Data exploration Aspects

bias noise abnormality meaningful relevant uptodate

Data cleaning Plausibility checks detection Data understandability

Junk in – Junk out Efficient time synchronization required Exploratory Visualisation Prior to Modelling

Density of Military Spending

Peace

1.2 War YEAR REPRESSI ARMSTRAN

● 10

● 3.5

1.0 ● ●

1995 ● ●

● 3.0

● ● 5

● 0.8 2.5

1990 ●

● 2.0 0

0.6 ● 1985

Density ● 1.5 5 1980 0.4 − 1.0 0.5 1975 0.2 10 − 0.0

peace war peace war peace war 0.0

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Military Spending (Billion USD) MILSPEND DEVELOP CUMWAR

● ● ● ● 3.0 ● ● ● ● ● ● ● ● ● 3.5 2.0

COLONIAL TRANSITI 2.5

● others France Britain no transition transitiontransition to semi−dem (0−5) 2.0

1.5 ● peace 3.0 peace 1.5

1.0 ● WARINVOL WARINVOL war 1.0

2.5 ● war

● 0.5

0.5 ●

● 2.0 0.0 0.0

peace war peace war peace war ETHNOPOL SEMIDEM

absent present other semi democracy peace peace WARINVOL WARINVOL war war

14 Exploratory Visualisation Prior to Modelling

15 Exploratory Visualisation Prior to Modelling

16 Exploratory Visualisation: Linked views

17 Visualisation for modeling support

• Depends on purpose • Prediction • Explanation • Pattern recognition

• Depends on data type and structure • Depends on modeling approach • Depends on audience and their standards • Depends on software ecosystem

18 Visualizing model results: model coefficients & CI

© Geoff Dancy & Eric Wiebelhaus-Brahm

Let‘s practice what we preach, Gelman et al. (2002) 19 Visualizing model results: e.g trees, random forests

20 Visualizing model results: ROC, Importance plots, separation plots

21 Visualizing model results: partial dependency plots

22 Visualizing model results: MCMC and Bayesian models

23 Exploratory Visualisation after modeling

24 Continuing challenges:

• Overcoming 2D/3D – limitation

• Dynamic plots

• Conditioning (C.Hurley: condvis)

• Linked views (W. Oldford: loon, H. Hofmann: cranvas, altair ) • Resolution (e.g. imbalanced data) • Reproduction (J. Harner) and Automation (P. Hall: H2O, L. Wilkinson: next session) • Transporting to future computational ecosystems • Interpretation • Storytelling • Data Literacy and visual literacy

25 Storytelling

26 The Data Team

Data Science Manager Data Scientist Main responsibilities: Main responsibilities: • Data munging • Group morale and support • Modeling • Business development • Machine Learning • Research & Development • Reporting and Presenting

Data Engineer

Main responsibilities: • Data ingesting Data Analyst • Data architecture • Data formatting Main responsibilities: • Acquiring data • Developing and implementing data analysis • Interpreting data • Analysing results Data Visualizer Main responsibilities: • Dashboard creation • Story telling / effective communication • Programmatic visualisation Conclusion:

• Many ingredients have been around since long • Power of statistical graphics is widely acknowledged • Visualisation throughout the data analysis process • Interactive graphics in R! • and storytelling • Liars know how to figure • Visual literacy • Interpretation • Increasing number of specialised graphics • Reproducibility and Automation

28 Thank you very much for your attention!

Questions?

Comments?

29