Short Course

3 April 2017 Jim Wisnowski

[email protected] (210) 218-1384

1 2 MCOTEA Example

3 Air Force Example ▪ Air Force Magazine Feb 2017 trends for women as a percent of the force

4 Callan of Sector Performance (Quilt Chart)

5 One Last Warm-Up

▪ Stephen Few is a guru in the data visualization world ▪ Let’s take his quiz on best practices at ▪ Goal is to get every one wrong—0/10 is success!

6 Objectives

▪ Appreciate the historical perspective of data visualization ▪ Know the value of data visualization offers to analytics and Big Data ▪ Understand what makes a good graphical display and some of the common mistakes to avoid in graphical design ▪ Be familiar with some methodologies for the data visualization process ▪ Appreciate how to do data viz with a few common software packages

7 Data Visualization is Not New

 Scottish political economist in 1786 recognized superiority of graphs over tabular presentations— published 43 time series plots and one bar chart  Developed the first in 1801 to show distribution of Turkish Empire over Europe, Africa, and Asia  Stephen Few states we really didn’t progress much from these original ideas until late 1970s with Princeton’s John Tukey and his Exploratory Data Analysis (EDA)  He argues most are unaware of modern methods

8 Data Visualization is Not New

▪ Area chart using color was masterful ▪ Playfair credited with the introduction of bar

9 Data Visualization is Not New

10 Exploratory Data Analysis

 John Tukey, Princeton, 1977  Too much emphasis on hypothesis tests as confirmatory analysis—focus should also be on discovery  Objectives – Suggest hypotheses of observed data – Assess statistical test assumptions – Support selection of appropriate methods and tools – Serve as basis for further data collections and experiments  If we need a short suggestion of EDA, I would suggest that – It is an attitude; a flexibility; and requires graph paper and transparencies

The greatest value of a picture is when it forces us to notice what we never 11 expected to see…John Tukey Data Visualization Fuel

 Most important aspect of data visualization is the data itself  Value goes beyond the enterprise/transactional data itself – Unstructured data, social networks, Internet of Things

 Data quality is key and dataviz can help improve that!  Phil Simon rates organizations on visualization framework – Data (big or small) – Visualization (static or interactive)  Start small and scale

If we have data, let’s look at the data. If all we have are opinions, let’s go with mine. 12 Jim Barksdale, Netscape Data is Growing

• Big Data is overused term, but we know there is GOLD in those data mountains • 15 Tb of Twitter daily is a lot of data generated; how much gold do we have? We are exposed to more information in a day than someone from the 15th century was over a lifetime. 90% of today’s data was created in last 2 years (IBM); 2.5 quintillion bytes per day

In 2015 the number of networked devices doubles the entire global population

Of interest: Tera, Peta, Exa, Zetta, Yotta, Brontabytes

Graphic from IBM Research India, presented at Text Mining Workshop Jan 2014 13 Data Visualization Needs Credible Data!

Do not trust any statistics you did not fake yourself…Churchill

Figures don't lie, but liars do figure…Twain

14 Traits of Meaningful Data

 High Volume  Historical  Consistent  Multivariate  Atomic  Clean  Clear  Dimensionally Structured  Richly Segmented  Of Known Pedigree

Data Map and Contour Plots are “best practices” 15 Reference: Now You See It by Stephen Few Data Visualization Definition

 Data is the new business capital.  Data visualization: discovery of solutions that offer highly interactive and graphical user interfaces, are built on in-memory architectures, and are geared toward addressing business users’ unmet ease-of- use and rapid deployment needs. These solutions typically enable users to explore data without much training, making them accessible by a wider range of employees than traditional business analysis tools. SAS  Key to making “analytics” approachable is visualization – Visual thinking is essential skill for all – Both an art and science => craft (Berinato, Harvard Business Review)  Data is a great but messy story; visual analytics is the master filmmaker to bring the story to life (SAS)  Not a great term…was Shakespeare a word sequencer? A picture is worth a thousand data points 16 Data Visualization

 Characteristics (Card et al, Information Visualization) – Computer supported – Interactive – Visual representations of location, length, size, color, shape to allow us to see trends – Abstract data with no physical form (e.g. human body)  Amplify cognition by assisting memory by representing data in ways our brain can easily comprehend  3 facts: Pervasiveness has raised quality expectations, Big Data is here, and the Democratization of Data  90% of data analyses required by most organizations is possible with simple data visualization methods – Excel is getting better – Boss wants to know why graphs in meetings are not nearly as pretty as she sees on fitness tracker (Berinito)

Everyone in our business knows they need to visualize data, but it’s easy to do poorly. We invest in it. We want to use it right while they use it wrong. Daryl Morey

17 Interactive Data Visualization with Excel  Consider recent data on automobile fuel economy from the EPA for 2017 year vehicles  Attributes such as make, model, mpg, class, cylinders, transmission, valve timing etc  Downloaded from  Quick exploration with Excel Pivot Tables, Tableau, and JMP

18 Data Visualization

 Allows viewing of vast quantities of data quickly and efficiently  Provides better insight into the business problem through discovery  Generates a call to action  Performs better if interactive and not static for quick stratification, drill down, and filtering  Relies less on the IT department and empowers workers once they have access to the data with intuitive tools

19 Democratization of Data Viz

 Data visualization methods should allow employees who are not data analysts or scientists the ability to quickly and easily explore data  Domain and business expertise critical to data understanding  More rapidly find trends, generate hypotheses, identify inconsistencies, and determine additional data support requirements  Reduce IT and analyst staff burden—everyone should be numerate  Tension growing in non-data driven organizations  Need to shorten the “kill-chain” of time data is collected until presented as actionable solution to decision makers – Find, Fix, Track, Target, Engage, Assess (F2T2EA)

Goal: Self- Service Approachable Analytics 20 Interactive Data Visualization For All

Flight misery map

21 Source: Sviokla, Harvard Business Review Police Department: Interactive Criminal Activity

22 San%20Antonio%20TX San Francisco Police Department with JMP

 Data is sample file in jmp  Use Graph Builder to plot each crime by color  Add street map  Add filter on station  Create html with data file

23 San Francisco PD with JMP  A bit more interactive is the Distribution platform  Where is there a disproportionate amount of drug activity  What days of the week correlate with runaways?  What are some safe precincts?

24 Democratization of Data Analytics

 Data visualization is no longer just static charts created by IT professionals for meetings  Even this graphic is outdated. Many are creating graphs continuously

Source: TDWI Research, 2013 25 The Human Side of Data Visualization

 Huge advances in past 25 years in data collection, storage and access; have ignored the primary tool to make information meaningful—the human brain  We acquire more information from vision than from all other senses combined  20 Billion neurons in brain used to form patterns from visual information  The eye and visual cortex of brain form a massively parallel processor that provides highest bandwidth channel into human cognitive centers—Colin Ware, UNH  We seek patterns

Strive for Interocular Traumatic Impact

26 The Human Side of Data Visualization

 We have selective visual attention; we are drawn to familiar patterns, and our working memory is limited

 Jacque Bertin’s Semiologie Graphique in 1967 describes basic vocabulary of vision of abstract data – Pre-attentive attributes form the core of good data visualization methods – Pre-attentive means without prior conscious awareness—the things that “pop out” most  We can only “remember” at most chunks of 3 visualizations and even then for only a short period – So don’t make comparisons difficult-like on next chart or scroll down further. Side-by-side is best.

27 Pre-Attentive Attributes Shape Length Hue/Contrast

Size Position Color

Enclosure Symmetry

28 Grouping Xan’s Pre-attentive Processing Quiz

29 Pre-attentive Processing

30 Graphic Attributes: Quantitative Scales

Position Length Slope Area Color Hue Better Position (unaligned) Angle Color DensityWorse

Based on “Graphical Perception: Theory, Experimentation, and Application …” by William Cleveland and Robert McGill, JASA, Sept. 1984

31 The Human Side of Data Visualization  Color is a key pre-attentive attribute  5% Females and 9% Males are color blind – Red-Green is most common  There is a psychology to color – Red is the color of extremes love, violence, danger, anger, and adventure – Yellow captures our attention more than any other color happiness, and optimism, of enlightenment and creativity, sunshine and spring. Lurking in the background is the dark side of yellow: cowardice, betrayal, egoism, and madness. Furthermore, yellow is the color of caution and physical illness (jaundice, malaria, and pestilence). .

32 Color  Color choice may tell a very different story. – Measles rate of ID vs TN?  HSV – Hue: wavelength red, yellow… – Saturation: 1=color, 0=white – Value: brightness, 1=bright, 0=black – Contrast with RGB additive system  Beware of default color choices—not often going to send correct message – Rainbow schemes – Intuition (red should be bad, green good)  Consider your organization’s branding guidelines

33 Color Psychology

 Red is the color of assertive, bold, power, extremes love, violence, danger, anger, stop, and adventure  Pink is soft, tranquil, passive, feminine, health, joy  Orange is warmth, compassion, enthusiasm, fun, energy  Green is nature, balance, environment, healthy, calm, rebirth  Blue is dignified, professional, successful, loyal, positive, authoritative, but also melancholy  Yellow captures our attention most: happiness, optimism, creativity, sunshine and spring; dark side of yellow: cowardice, betrayal, egoism, caution, madness, and medical illness (jaundice, malaria,..)  Purple: royalty, luxury, wisdom, inspiration, spiritual  Brown: Natural, reliable, strong, rustic, conservative, ordinary  Black: classy, formal, authority, power, death, troubles, mourning  White: pure, innocent, clean, new, simple, bland  Gray: neutral, respect, humility, stable, wise,

34 Visualization Expectations


35 Which direction is the top middle wheel moving?

36 Are the blocks side by side or stacked?

37 Blue and Black or White and Gold?

38 Ebbinghaus Illusion

39 Visualization

40 Visualization—Spooky

41 Visualization

Jared Leto

Dallas Buyers Club, Fight Club, Thirty Seconds to Mars and super- stoked about data visualization

42 Visualization in Logos

43 Empirical Findings (Berinato)

 We don’t go in order like reading—the top title may not be read until well after the visual middle; we spend disproportionate amounts of time in different features  We see first what stands out—peaks, valleys, intersections, dominant colors, and outliers  We see only a few things at once—with more than 5-10 variables or elements individual meaning begins to fade

 We seek meaning and make connections—we incessantly construct narratives of the graph consciously and subconsciously  We rely on conventions and metaphors—red is bad, green good, A-N-AF-M, time is on x

44 Value of Data Visualization

 Business intelligence and predictive analytics may be viewed as black boxes and not trustworthy; data visualization can add trust and provide insight to these solutions  Ultimately it is all about making better business decisions

45 Source: TDWI Research, 2013 Value of Data Visualization

 The two areas of data visualization: – Explanation – tell a story to the audience – Exploration – understand what the data is telling you  Will take into account audiences expectations and composition  Help you to detect relationships in data  Allows you to understand “Big Data” ▪ …of those who are most effective with Big Data, 98% use data visualization techniques

46 More on Big Data

 May be overhyped by media, but is here. 90% of data today wasn’t here 2 years ago – Transition from mainframe to client server to mobile cloud – Extract-Transform-Load model is aging  Big Data really has not been solved by most organizations – Resources dedicated to collecting, storing, organizing, and cataloging; – Exploiting Big Data through analytics and viz are behind  Web is more visual, efficient, and data-friendly (Phil Simon, Visual Organization)

47 Popular Choices in Data Visualization

 Pie charts, line charts and bar charts still have their place, but are quickly being replace by more informative and dynamic tools

48 Why Graph Data?

 Tabled data in files and spreadsheets are precise and summary statistics are helpful to understand structure up to a point.  Graphs quickly convey meaningful relationships that tell a story and point you in the right direction to solve your problem.  Interactive visualization takes the graphical capabilities a step further for rapid discovery and hypothesis generation.

The greatest value of a picture is when it forces us to notice what we never 49 expected to see…John Tukey Basic Concepts for Data Visualization

 Understand your data  Size – Cardinality => high = unique values (acct #); low = repeats (gender)  Determine what you are trying to visualize and information conveyed  Know your audience and how they process information  Use a visual that conveys the information best, simplest, and quickest  A “good” graphic is context sensitive (Berinato) – Who will see it? – What do they want? What do they need? – What could I show? What should I show? – How will I show it?

50 Tufte’s Principles

 Enforce visual comparison – Conclusions can be drawn by comparing data  Show causality – A graph without causality will have no meaning  Show multivariate data – Display data using more than two dimensions  Integrate all visual elements – Use words, numbers and images where appropriate  Content-driven design – Quality, relevance and integrity

51 Principles of Good Graphical Design

 Communicate the data with clarity, precision and efficiency  Encourage the eye to compare different pieces of data focusing on substance; intriguing and curiosity provoking  Make large data sets coherent presenting many numbers in a small space  Reveal the data in several layers of detail  Serve a clear purpose: description, exploration, tabulation, or decoration  Are closely integrated with statistical and verbal descriptions of a data set  Are simple, which is much better than unnecessary complexity Generate the greatest number of ideas in the shortest time with the least ink in the smallest space

Reference: The Visual Display of Quantitative Information by Edward Tufte—known as the Strunk and White of 52 graphics Graphical Pillars for Statistical Stories  Simple  Clear and concrete  Informative and Important  Contextual  Sequential  Seamless  Disclose Uncertainty and Truth  Emphasis  Actionable  Clean

Best graph ever?

A quick sketch is better than a long speech Napoleon (perhaps) Demonstration-R, JMP 53 Reference: Now You See It by Stephen Few Other Candidates for Best Graph Ever

54 Other Candidates for Best Graph Ever

55 Other Candidates for Best Graph Ever

56 Other Candidates for Best Graph Ever

57 Other Candidates for Best Graph Ever

Birth and Death of 150,000 Notable People Over Last 2000 Years 58 Data Visualization Best Practices

 Use appropriate scales—start at 0 for bar charts and end a little above max value. Stop at 100% when using percentiles.

 Consider adding reference lines (typically for the y axis) such as the mean, an industry standard, or at 0

 Split data into meaningful sub-graphs (trellis graphics) with exactly same scales and structure to better interpret multivariate data

 Examine your data using a combination of data visualization methods

 Beware of overplotting—e.g. scatterplot that is very dense with points, need to show the volume within each region

 Could make points smaller, hollow, jittered or use heat map for multiple observations

59 Source: Now You See It, Stephen Few Bad Practices of Graphical Design

 Inappropriate display choices  Too much information  Misleading axis scaling  Difficult to understand: all capital letters, too many abbreviations or jargon, vertical text, insensitive to color, obscure legend  Inconsistent ordering or placement  Graph is taller than it is wide  Too small in presentation  Too artistic

If the graphic is bad, the information will be perceived as less credible!

60 Graphs That Cry Help! 61 Beach Ball “Graph”

1. Poor color choices 2. Distracting beach ball chartjunk 3. Different fonts throughout

4. Tall not wide 5. Out of order on x scale 6. ALL CAPITAL LETTERS 7. 3 D boxes for 2D data

62 Data-Ink Ratio

 Ink on the graph represents the data – Maximize data ink and erase as much non-data ink as possible – Tufte – Data-Ink Ratio = 1 – Proportion of graph that can be erased – Erase non-data ink so that the audience is not drawn away from the importance of the data – Think of gridlines—how important are they and at what frequency?

63 Data Density

 Proportion of the graph that is dedicated to displaying data

 Maximize data density and the size of the data matrix – Include more data points – Include more variables  Sparklines – Demonstration in Excel with GoPro – Demonstration in JMP with CrimeData

64 Lie Factor

 A value to describe the relation between the size of effect shown in a graphic and the size of effect shown in the data.  Exaggeration or changing of the scale in a graph

Reference: The Visual Display of Quantitative Information by Edward Tufte 65 Chart Junk

▪ Decorative elements that provide no data and cause confusion ▪ Distract the viewer from valuable information

66 Chart Junk—Another Point of View

 Borkin et al. from Harvard and MIT conducted experiment on what makes a graph memorable  Over 2,000 images; 400 were shown to study participants for 1 second, they then took quiz on which ones they saw.

67 Chart Junk—Another Point of View

 Results showed human recognizable objects most important for memorability  Also helpful are if visualization is “distinct”, visually dense, colorful and has low data-ink ratio

68 How To Do Data Visualization

 Two primary questions before choose graphic: – Is the information conceptual or data-driven? – Am I declaring or exploring something?

69 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Plan

 Prepare (5 mins): have paper and pen, put aside data to think about ideas, write the basics of who visualization is for and what setting  Talk and Listen (15 mins): discuss with colleague what you’re trying to prove or explore; capture words, phrases and statements to summarize goals  Sketch (20 mins): Focus on keywords from above steps, quickly sketch out multiple visuals  Prototype (20 mins): take best sketch and make it more accurate and detailed

Fight the impulse to directly graph your data with preset options

70 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Create

 Focus on structure and hierarchy – Need title (12%), subtitle (8%), visual field (75%), and data source line (5%)  Focus on design clarity (“hit the ball squarely”) – Aggressively remove extraneous elements and let them highlight the idea – Make sure each element has a single purpose that cannot be misinterpreted – Use natural conventions and metaphors  Focus on design simplicity – Minimize number of colors-gray for second level information (gridlines, etc) – Place labels and legends close to what they describe

Goal is to make the graph more understandable—not more attractive

71 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Refine to Persuade

 Hone the main idea – What am I trying to show versus I need to convince them… – Think active words  Make it stand out – Emphasize with color, pointers, labels, markers, … – Isolate it by reducing other elements  Adjust what’s around it – Add reference points and lines – Remove elements that distract with integrity – Create context and comparisons

How can you sell this most effectively?

72 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Presentation

 Show chart and stop talking  Don’t read the picture, talk about ideas not structure  Guide audience for unconventional visualization  Add context by showing reference, average or ideal graph versus one shown  Turn off chart when have something important to say  Put a more detailed version in backup for them to take away  Create tension by showing parts (builds) so they speculate— makes it more memorable; use time and reveal gradually  Bait and switch by luring into what they expect and show different  Deconstruct and reconstruct—drill down dynamically

Does presentation hide something that would rightfully challenge idea?

73 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Be a Critic

 Note what you see first—what stands out?  Note the first idea that forms, then search for more  What are likes, dislikes, wish I saws  What 3 things would you change and why  Sketch your own version and critique yourself

It’s not the critic who counts…the credit belongs to the man who is actually in the arena. Teddy Roosevelt

74 Scott Berinato, Good Charts, Havard Business Review Press Graphs

 Plots of deterministic relationships  Univariate  Bivariate  Multivariate  Time Series  Maps

75 Graphing Functions

 Deterministic equations easily plotted across the range of input variables

76 Univariate Categorical Plots Percent of USAA Mbrs

 Simple example but shows key Good points  Evaluate the Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other relative Percent of USAA Mbrs proportions: part-

to-whole. Family Mbr 4% 8% Retired  Determine best Better 10% 38% Officer Enlisted 12% graphs: bar graphs Employee 13% Gov't Civilian 15% and Pareto plots, Other

though pie charts Distribution of USAA Membership Eligibility are (too) often 40% 35% used. 30% 25% 20% Best 15% 10% 5% 0% Percent of USAA Mbrs

77 Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other What’s the Matter With Pie Charts

 They are overused and do not hit the “pre-attentive” attributes as hard as other methods – Length and position are the most discriminatory  Difficult to compare 2-D areas or angles. DON’T go to 3-D!!  Tough to decipher when have a legend you have to go back and forth to 78 If You Use Pie Charts

 Keep the number of slices to a maximum of 4 or 5 – Actually not bad since we are so familiar with these  Adding text labels for percentages and categories if desired  Exploded Pie emphasizes a proportion; don’t explode more than 25% of the slices  Put largest wedge at 1:00 and make progressively smaller clockwise until 12:00  Donut charts do nut [sic] solve the problem


Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other 79 Histograms  A graphical representation of the distribution of a continuous variable  Group data into similar sized bins and count frequencies in data set  Easy to infer probabilities and relative importance  Commonly occurring “shapes” are known probability distributions (normal, uniform, exponential, Weibull, Beta…)  Bars can be omitted for a Frequency Polygon

 Excel now has menu option and Analysis Toolpack  Bin width is a critical parameter, should be the same for all bins

80 Box Plots

 Excellent choice to show distributions – Line at median, size of box is interquartile range (25th and 75th percentile), whiskers extend to 1.5 time IQR or max/min – Display differences between populations means without making assumptions—best for multiple boxes – IQR is robust estimate of standard deviation (test for equal variances) – Excel!! Finally

81 Box Plots

 Best used when there are several categorical levels of a variable – Can quickly evaluate if the variances (size of boxes) are approximately equal – Can determine if the means/medians are close based on relative positioning – Limited inherent capability in Excel; Box Charter add-in

82 Heat Maps  Intuitive way to quickly identify observations that are at the extreme ends of the distribution  Example:1918 US Flu Epidemic. Shown below are death rates per 100,000 for several age groups. Not surprisingly the babies and elderly had the highest rates. Interestingly, 23-34 year olds also had very high rates. WWI vets returning via crowded and infected trains  Excel Conditional Formatting is excellent. Total Age M F Population <1 2520.5 2020.4 4540.9 1-4 712 724.2 1436.2 5-14 162.5 190.2 352.7 15-24 700.6 475.1 1175.7 23-34 1216.6 781.4 1998 35-44 691.1 406.5 1097.6 45-54 411.8 275 686.8 55-64 420.8 339.2 760 65-74 655.8 636.5 1292.3 75-84 1112.9 1239 2351.9 >8483 2111.2 2320.5 4431.7 Pareto Plot

• A creative and best practice approach for part-to-whole graphics • 80/20 rule—80% of Italy’s wealth held by 20% of residents • Excel chart option now and accessible via Histogram option in Data Analysis Tool Pack for continuous data • Pareto Plots are often based on categorical levels of the input factor

84 Pivot Table and Pivot Chart

▪ Powerful Excel analytical and graphical capability ▪ Data summarization tool that sums across levels of variables ▪ Flexible construction of tables/graphs ▪ Easy to filter, dynamic

Average of MPG Hwy Column Labels Row Labels Hybrid Non hybrid Grand Total 4 38.75 30 30.52238806

45 6 24.04615385 24.04615385 40 8 23 19.96610169 20.01666667 35 Grand Total 35.6 24.76470588 25.046875 30 25 20 Hybrid 15 Non hybrid 10 5 0 4 6 8 85 Demonstration Univariate Graphs

▪ Excel, JMP, R

86 Association between Variables

▪ An association exists between two variables if the distribution of one variable changes when the level (or values) of the other variable changes. ▪ If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable.

Anscombe’s Quartet All have same mean, variance, regression equation and correlation coefficient

87 Scatterplot Matrix

▪ Consider the fuel efficiency data, we can quickly see relationships between several variables

88 Area Plot

▪ Useful to display multiple dependent variables as a function of a single continuous independent variable ▪ Can also do part-to-whole ▪ Beware of “hiding” data

89 Parallel Plot, Parallel Coordinates Plot, Parallel Axis Plot

Specify two or more variables for the response and can effectively use color for another variable

90 Contour Plots—Association Between 3 Variables • Contour lines show level of a variable • Useful in multivariate applications to find locations of min/max response






80 Series6 70 Series5 Series4 60 Series3 1 2 3 Series2 4 5 6 Series1 7 8

91 Association – Categorical Variables

▪ Best graphs: bar graphs, histograms, mosaic plots, and tree maps: • Brushing methods enable you to see the distribution of a categorical variable conditioned on the setting of another. • Mosaic plots and tree maps are especially good with multiple levels of a categorical variable; unfortunately Excel does not have these graphs. ▪ For two (or more) categorical variables, the relationship is measured by association or dependence, not correlation.

92 Association – Categorical Variables Titanic Example ▪ Mosaic plots allow you to determine if survival is dependent on what class you were in and also if gender makes a difference ▪ Excel not set up for Treemaps, but MS has add-in

93 Association – Brushing

▪ Brushing or dynamic linking is a great interactive way to explore relationships in your data by evaluating how a subset of observations based on the level of one variable behaves in another. ▪ We see the cross hatched values correspond to the 8 cylinder vehicles that also have more HP and less MPG

94 Which is Best

▪ Researchers have been looking into the science of data visualization ▪ Rensink 2010: Weber’s Law to scatterplots “a noticeable change in stimulus is a constant ratio of the original stimulus” ▪ Think of a match lit in a pitch black room versus a lighted room ▪ First time method is available to calculate graph effectiveness ▪ Good-scatterplots in positive correlation direction ▪ Okay-parallel coordinate plots, scatterplot negative direction, stacked area ▪ Bad-stacked bar, radar plots

95 Modeling Visualization

▪ Profilers and Contour Plots ▪ F-18 Central Composite Design Multiple Response Optimization

96 Demonstration for Multivariate Graphs

▪ Excel, JMP, R ▪ Anscombe’s Quartet

97 Time Series- Book Sales

98 Data Mining Example

 Peter Lawrence’s The Making of a Fly  Bargain at $23,698,655.93 + $3.99 shipping  Two wholesalers algorithms went awry – 17 used at $35, but for new n=2 copies – Bordee=1.27 X Profnath; Profnath=.998 Bordee – Algorithm quickly goes out of control—Apr 18 $23M – Apr 19 Profnath =$106 and Bordee=$135

99 Time Series

 Look for six basic patterns: overall trend, variability, rate of change, covariation, cycles, and exceptions/outliers.  Best graphs are line graphs, overlay plots, and bar graphs; many other displays are possible over time (for example, box plots and bubble plots).  Run charts with control limits are the cornerstone to process control methods.  75% of graphs are time series

100 Time Series in Excel

 Decent capability for customizable line graphs  New for 2016 was Waterfall Charts  Demonstration

101 Time Series: Google Trends

Is interest in golf waning? What does this mean for Under Armour?

102 102 JMP Output Google Trends

103 103 Time Series

 Line Plots—use to display patterns, trends, cycles, and exceptions  Bar Graphs—use to emphasize or compare values (e.g.Budget versus Actual)  Dot Plots—use if have irregular time intervals. In Excel, just delete line in line plot with markers  Heat Maps for high volume of data to find exceptions and cycles  Box Plots for analyzing distribution (mean and variance) changes over time  Animation may be useful to see changes over time

104 Time Series

 Line Chart shows comparison, variability, cycles, trend

Lord Abbett Monthly Price Change Compared to S&P 500  Stacked Column shows the 0.2 differences better

0.15  Both use Excel defaults 0.1 apart from title and legend 0.05 size

0  Time series analysis -0.05 methods exist to forecast


-0.15 Stacked Column Chart for Lord Abbett vs S&P 500 0.2

0.15 -0.2


-0.25 0.05

S&P Lord Abbett 0 9/1/2003 10/1/2003 11/1/2003 12/1/2003 1/1/2004 2/1/2004 3/1/2004 4/1/2004 5/1/2004 6/1/2004 7/1/2004 8/1/2004 9/1/2004 10/1/2004 11/1/2004 12/1/2004 1/1/2005 2/1/2005 3/1/2005 4/1/2005 5/1/2005 6/1/2005 7/1/2005 8/1/2005 9/1/2005 10/1/2005 11/1/2005





-0.25 105 S&P Lord Abbett Sparklines

▪ A small intense, simple, word-sized graphic with typographic resolution ▪ Typically in a cell after the last value in an ordered series (e.g. time) ▪ Can be everywhere a word or number can be: embedded in a sentence, table, headline, map, spreadsheet, graphic ▪ Started with Excel 2010—add-in available for 2007

106 Conditional Formatting—Beyond Heat Maps ▪ Use to highlight specific observations that exceed a certain value (e.g. 30 MPG), between a range, and so forth ▪ Top or bottom 10%, 10 values, above the mean, below the mean ▪ Data bars to show what percentile the observation ranks

▪ Icon sets for dashboard type MPG Hwy MPG Hwy MPG Hwy MPG Hwy 21 21 21 21 graphics 23 23 23 23 40 40 40 34 40 34 34 33 34 33 48 33 33 48 25 48 48 23 25 25 25 25 23 23 23 38 25 25 25 30 38 38 26 38 30 30 28 30 26 26 26 26 33 28 28 28 26 33 26 26 31 33 33 33 35 33 33 33 35 31 17 31 31 35 17 35 35 35 107 18 35 35 17 17 17 17 17 18 18 17 Maps  Big push in recent years has been geospatial or mapping  Many software options  Maps can be extremely useful, but they do limit our pre- attentive attributes—use with caution  Still very good for visual discovery and analytics  Think beyond geography of what a map is!

108 Alluvial Plots

▪ NFL 2015 season predictions from Nate Silver’s Data viz software is D3.js

109 Visualization of Text Data

Consider the Pareto of word counts from an article this year on Can you get the general idea of what might have happened by these frequencies alone? Note we’ve “stemmed” words=> hors = horse, horses, horsing, horse’s,…and, of course, hors 110 WordCloud pharoah/

111 Word Clouds is a very fun (free) site to paste in your text and make your own word clouds

112 113 Text Groupings From Eigenvectors

Statistical methods can tame the unstructured text to find words that cluster together and common themes

113 Sentiment Analysis

In many applications, such as with online product reviews, we would like to know whether the customer base has a positive, neutral, or negative attitude about the product or service We can count the number of “positive” and “negative” words using a generic list of terms; it may be useful to have a custom “positive” and “negative” word lists. Combine this with Twitter and other social media, then you have real-time feedback; “Opinion Mining” Method just uses the DTM and cross-tabulates with the Harvard list of positive and negative sentiments

114 Sentiment Analysis

Look at Bible by Book and Chapter

115 Sentiment Analysis Demonstration in JMP

116 Sentiment Analysis Demonstration in JMP

117 Correlation of Word Pairs from DTM

118 What Words Associated with Fatal?

Different word frequencies

Not Fatal Fatal 119 What Words Associated with Fatal?

Crosstab Fatal structured variable with word counts

120 Tree Model for Fatal on NTSB DTM

• Classification Tree groups observations based on presence of absence of word • If “land” in write up, very unlikely a fatality unless “mountain” is too • If “stall/spin” in write up, very likely to be a fatality

121 Identifying Topics Via Latent Semantic Analysis

• Factor loadings from Singular Value Decomposition (SVD) of the document term matrix – Creates U (document) and V (term or words) reduced rank matrices – Fortunately, they are linked so we can go back and forth between the two • Plotting first two eigenvectors of V can show most dominant themes

122 Other Text Visualizations

• Arc Diagram of Les Miserables 123 Text Visualization

 Text analytics not going away  Word clouds and frequency counts are helpful  Document Term Matrix is the key to finding relationships between words  Visualizing Singular Value Decomposition of DTM allows you to find topics, quantify unstructured data, and cluster both words and documents  Sentiment analysis visualization methods help gauge overall preferences and emotions

124 Dashboards! Avalanches in Tableau

125 Dashboards  Executive level summary graphs showing key metrics; 4 stages: ▪ NOTICE-get eye to move to right place ▪ FOCUS-quickly get to understand insights ▪ INVESTIGATE—intuitive way to drill down and explore ▪ ACT—right insight at right time to take right action

 Accessible via mobile devices

Dashboards have become a popular means to present critical business information at a glance, but few do so effectively. Huge investments are made in Information Technology to produce actionable information, only to have it robbed of meaning at the very last stage of the process: the presentation of insights to those responsible for making decisions. When designed well, dashboards engage the power of visual perception to communicate a dense collection of information in an instant with 126 exceptional clarity. Stephen Few Survey on Dashboard Effectiveness Metrics of a good dashboard Most organizations have room to improve—especially with unstructured data

127 TDWI Research Survey 2013 Stephen Few’s Common Mistake 1: Exceeding the Boundaries of a Single Screen More difficult for mind to recall information that is no longer visible Seeing everything on one screen allows for quicker and easier comparisons, which lead to quicker insights People often think information that they must scroll to see is of less importance than what is directly in front of them rticle/SQ_APRIL_14-key-performance-indicator- vertical-dashboard.gif

128 Common Mistake 2: Supplying Inadequate Context for the Data Meaningful context is key to understanding the information presented Context should be incorporated in a way that does not distract the reader from the key message Context should only be included when it adds real value to avoid crowding and distraction

http://www.excelchart content/uploads/2008 /03/dundas- gauges1.png

129 Common Mistake 3: Displaying Excessive Detail or Precision

Too much detail slows reader without providing benefit

http://www. m/post/KPI -What- Where- Why-and- how-many

130 Common Mistake 4: Expressing Measures Indirectly Must know what is being measured and in what units Must find the measure that conveys the meaning most effectively Find the message needed by viewer, then select best measure to support message

131 Common Mistake 5: Choosing Inappropriate Display Media Pie charts don’t display quantitative data effectively Humans can’t compare 2-D areas effectively Linear displays such as bar graphs convey information more effectively Common mistake in all quantitative data presentations

http://www.danielp - content/uploads/2 012/11/pie_charts _vs_bar_charts_2. png

132 Common Mistake 6: Introducing Meaningless Variety People typically don’t like using the same type of chart or graph more than once on a dashboard. This often is detrimental to the dashboard. Always use the display medium that is most effective even if the dashboard already uses that display medium. Wherever appropriate, consistency in means of display allows readers to use the same strategy in interpreting information, which saves time. ki/Images/attachment/dashbo 133 ard-example.png Common Mistake 7: Using Poorly Designed Display Media Components of the dashboard must be designed to communicate clearly and efficiently Most graphs are designed poorly Legends force reader’s eyes to go back and forth, wasting time

http://www.perceptualed content/uploads/2009/0 2/sas-revenue- graph.jpg 134 Common Mistake 8: Encoding Quantitative Data Inaccurately Sometimes design errors result in graphs representing values inaccurately _z6QlRCOBLgo/TD3vOz HzA2I/AAAAAAAAABw/ Bz0rOm4Ijns/s1600/stan dard_vs_diffable+(1).png

135 Common Mistake 9: Arranging Information Poorly Dashboards must be well organized, with data appropriately based on importance and proper viewing sequence and framed within a visual design that segregates information into meaningful groups – Stephen Few Make the dashboard look good but most importantly, arrange the information in a manner that fits its use Make important information stand out Data that needs to be compared should be arranged and visually designed to foster comparisons oks/en/2.412.1.25/ 1/

136 Common Mistake 10: Highlighting Information Effectively or Not at All Viewer’s eye should be directed to the most critical information Not all information is of equal importance

http://www.bright efault/files/LG_02 _Customizable_D ashboards.jpg

137 Common Mistake 11: Cluttering the Display with Visual Effects Backgrounds, artistry, and decorations only distract from the important information presented WSVMiug9SS4/URFgm O0yYxI/AAAAAAAAAug/ AbVYKgTd- wU/s1600/Snap4.png

138 Common Mistake 12: Misusing or Overusing Color

Color should not be used haphazardly Hot colors demand attention Cool colors do not demand attention Contrasts call attention Same color creates a relationship between two displays on a dashboard

http://1.bp.blogspo zT3nda4OvLU/UZ yS2mFmn6I/AAAA AAAAEEA/aMems Fmvxxc/s1600/Mai n+DB.png

139 Common Mistake 13: Designing an Unattractive Visual Display Ugly dashboards make the viewer want to look away, making him or her less inclined to understand all of the information presented

http://www.d ashboardzon content/uplo ads/2008/04/ image-76.jpg

140 Tableau Software

 Strictly data visualization software; most popular in industry – Stock symbol: DATA  Connects to standard data sources, proprietary data bases, and big data such as Hadoop, Teradata, GoogleBigQuery  Highly interactive, pretty powerful, and can quickly make graphs

 Goals: – Make data understandable – Manage large data streams – Promote data discovery – Help business decisions

141 Tableau-Hurricanes

 Bring in Excel file, add rows (lat), cols (long), color (basin), label (name)  Change marks (line), path (ISO time/hour), size (wind), color (name), filter (basin)  Animate by pages(ISO time(day)), check Show History

142 Sports Analytics

1. Yankees 2012 HR waterfall chart in Tableau (running total, rev) 2. Spurs 2014-15 performance stats using Graph Builder/Col Sw

143 Sports Analytics Hockey greats scatterplot by position

144 R Statistical Programming Language

 Definitely not strictly data visualization software; most used open source stats software  It can certainly do data visualization though may require some proficiency in the language first  Decent capability from main packages and libraries  ggplot2 seems to have the most following and capabilities – Grammar of graphics; good defaults; layered customizable results, static – Partner ggvis enables web-based interactive graphics  Web-based with Shiny and Markdown; interactive htmlwidgets, plotly, d3, googleVis, and many more packages  Decent review on interactive and data wrangling in post in March 2017 ComputerWorld

145 ggplot2

146 HTMLWidgets

Link to site

147 Hans Rosling

Great TED talks!

148 Questions?

Thank you. Thank you very much.