Data Visualization Short Course
3 April 2017 Jim Wisnowski
[email protected] (210) 218-1384
1 2 MCOTEA Example
3 Air Force Example ▪ Air Force Magazine Feb 2017 trends for women as a percent of the force
4 http://www.airforcemag.com/MagazineArchive/Magazine%20Documents/2017/February%202017/0217infographic.pdf Callan Chart of Sector Performance (Quilt Chart)
5 https://www.callan.com/wp-content/uploads/2017/01/Callan-PeriodicTbl_KeyInd_2017.pdf One Last Warm-Up
▪ Stephen Few is a guru in the data visualization world ▪ Let’s take his quiz on best practices at www.perceptualedge.com ▪ Goal is to get every one wrong—0/10 is success!
6 Objectives
▪ Appreciate the historical perspective of data visualization ▪ Know the value of data visualization offers to analytics and Big Data ▪ Understand what makes a good graphical display and some of the common mistakes to avoid in graphical design ▪ Be familiar with some methodologies for the data visualization process ▪ Appreciate how to do data viz with a few common software packages
7 Data Visualization is Not New
Scottish political economist William Playfair in 1786 recognized superiority of graphs over tabular presentations— published 43 time series plots and one bar chart Developed the first pie chart in 1801 to show distribution of Turkish Empire over Europe, Africa, and Asia Stephen Few states we really didn’t progress much from these original ideas until late 1970s with Princeton’s John Tukey and his Exploratory Data Analysis (EDA) He argues most are unaware of modern methods
8 Data Visualization is Not New
▪ Area chart using color was masterful ▪ Playfair credited with the introduction of bar charts
9 Data Visualization is Not New
10 Exploratory Data Analysis
John Tukey, Princeton, 1977 Too much emphasis on hypothesis tests as confirmatory analysis—focus should also be on discovery Objectives – Suggest hypotheses of observed data – Assess statistical test assumptions – Support selection of appropriate methods and tools – Serve as basis for further data collections and experiments If we need a short suggestion of EDA, I would suggest that – It is an attitude; a flexibility; and requires graph paper and transparencies
The greatest value of a picture is when it forces us to notice what we never 11 expected to see…John Tukey Data Visualization Fuel
Most important aspect of data visualization is the data itself Value goes beyond the enterprise/transactional data itself – Unstructured data, social networks, Internet of Things
Data quality is key and dataviz can help improve that! Phil Simon rates organizations on visualization framework – Data (big or small) – Visualization (static or interactive) Start small and scale
If we have data, let’s look at the data. If all we have are opinions, let’s go with mine. 12 Jim Barksdale, Netscape Data is Growing
• Big Data is overused term, but we know there is GOLD in those data mountains • 15 Tb of Twitter daily is a lot of data generated; how much gold do we have? We are exposed to more information in a day than someone from the 15th century was over a lifetime. 90% of today’s data was created in last 2 years (IBM); 2.5 quintillion bytes per day
In 2015 the number of networked devices doubles the entire global population
Of interest: Tera, Peta, Exa, Zetta, Yotta, Brontabytes
Graphic from IBM Research India, presented at Text Mining Workshop Jan 2014 13 Data Visualization Needs Credible Data!
Do not trust any statistics you did not fake yourself…Churchill
Figures don't lie, but liars do figure…Twain
14 Traits of Meaningful Data
High Volume Historical Consistent Multivariate Atomic Clean Clear Dimensionally Structured Richly Segmented Of Known Pedigree
Data Map and Contour Plots are “best practices” 15 Reference: Now You See It by Stephen Few Data Visualization Definition
Data is the new business capital. Data visualization: discovery of solutions that offer highly interactive and graphical user interfaces, are built on in-memory architectures, and are geared toward addressing business users’ unmet ease-of- use and rapid deployment needs. These solutions typically enable users to explore data without much training, making them accessible by a wider range of employees than traditional business analysis tools. SAS Key to making “analytics” approachable is visualization – Visual thinking is essential skill for all – Both an art and science => craft (Berinato, Harvard Business Review) Data is a great but messy story; visual analytics is the master filmmaker to bring the story to life (SAS) Not a great term…was Shakespeare a word sequencer? A picture is worth a thousand data points 16 Data Visualization
Characteristics (Card et al, Information Visualization) – Computer supported – Interactive – Visual representations of location, length, size, color, shape to allow us to see trends – Abstract data with no physical form (e.g. human body) Amplify cognition by assisting memory by representing data in ways our brain can easily comprehend 3 facts: Pervasiveness has raised quality expectations, Big Data is here, and the Democratization of Data 90% of data analyses required by most organizations is possible with simple data visualization methods – Excel is getting better – Boss wants to know why graphs in meetings are not nearly as pretty as she sees on fitness tracker (Berinito)
Everyone in our business knows they need to visualize data, but it’s easy to do poorly. We invest in it. We want to use it right while they use it wrong. Daryl Morey
17 Interactive Data Visualization with Excel Consider recent data on automobile fuel economy from the EPA for 2017 year vehicles Attributes such as make, model, mpg, class, cylinders, transmission, valve timing etc Downloaded from http://www.fueleconomy.gov/feg/download.shtml Quick exploration with Excel Pivot Tables, Tableau, and JMP
18 Data Visualization
Allows viewing of vast quantities of data quickly and efficiently Provides better insight into the business problem through discovery Generates a call to action Performs better if interactive and not static for quick stratification, drill down, and filtering Relies less on the IT department and empowers workers once they have access to the data with intuitive tools
19 www.introtopolicyinformatics.wikispaces.asu.edu Democratization of Data Viz
Data visualization methods should allow employees who are not data analysts or scientists the ability to quickly and easily explore data Domain and business expertise critical to data understanding More rapidly find trends, generate hypotheses, identify inconsistencies, and determine additional data support requirements Reduce IT and analyst staff burden—everyone should be numerate Tension growing in non-data driven organizations Need to shorten the “kill-chain” of time data is collected until presented as actionable solution to decision makers – Find, Fix, Track, Target, Engage, Assess (F2T2EA)
Goal: Self- Service Approachable Analytics 20 Interactive Data Visualization For All
Flight misery map
21 Source: Sviokla, Harvard Business Review Police Department: Interactive Criminal Activity
22 http://www.raidsonline.com/?address= San%20Antonio%20TX San Francisco Police Department with JMP
Data is sample file in jmp Use Graph Builder to plot each crime by color Add street map Add filter on station Create html with data file
23 San Francisco PD with JMP A bit more interactive is the Distribution platform Where is there a disproportionate amount of drug activity What days of the week correlate with runaways? What are some safe precincts?
24 Democratization of Data Analytics
Data visualization is no longer just static charts created by IT professionals for meetings Even this graphic is outdated. Many are creating graphs continuously
Source: TDWI Research, 2013 25 The Human Side of Data Visualization
Huge advances in past 25 years in data collection, storage and access; have ignored the primary tool to make information meaningful—the human brain We acquire more information from vision than from all other senses combined 20 Billion neurons in brain used to form patterns from visual information The eye and visual cortex of brain form a massively parallel processor that provides highest bandwidth channel into human cognitive centers—Colin Ware, UNH We seek patterns
Strive for Interocular Traumatic Impact
26 The Human Side of Data Visualization
We have selective visual attention; we are drawn to familiar patterns, and our working memory is limited
Jacque Bertin’s Semiologie Graphique in 1967 describes basic vocabulary of vision of abstract data – Pre-attentive attributes form the core of good data visualization methods – Pre-attentive means without prior conscious awareness—the things that “pop out” most We can only “remember” at most chunks of 3 visualizations and even then for only a short period – So don’t make comparisons difficult-like on next chart or scroll down further. Side-by-side is best.
27 Pre-Attentive Attributes Shape Length Hue/Contrast
Size Position Color
Enclosure Symmetry
28 Grouping Xan’s Pre-attentive Processing Quiz
29 Pre-attentive Processing
30 Graphic Attributes: Quantitative Scales
Position Length Slope Area Color Hue Better Position (unaligned) Angle Color DensityWorse
Based on “Graphical Perception: Theory, Experimentation, and Application …” by William Cleveland and Robert McGill, JASA, Sept. 1984
31 The Human Side of Data Visualization Color is a key pre-attentive attribute 5% Females and 9% Males are color blind – Red-Green is most common There is a psychology to color – Red is the color of extremes love, violence, danger, anger, and adventure – Yellow captures our attention more than any other color happiness, and optimism, of enlightenment and creativity, sunshine and spring. Lurking in the background is the dark side of yellow: cowardice, betrayal, egoism, and madness. Furthermore, yellow is the color of caution and physical illness (jaundice, malaria, and pestilence). .
http://www.colormatters.com/yellow
32 Color Color choice may tell a very different story. – Measles rate of ID vs TN? HSV – Hue: wavelength red, yellow… – Saturation: 1=color, 0=white – Value: brightness, 1=bright, 0=black – Contrast with RGB additive system Beware of default color choices—not often going to send correct message – Rainbow schemes – Intuition (red should be bad, green good) Consider your organization’s branding guidelines
33 Color Psychology
Red is the color of assertive, bold, power, extremes love, violence, danger, anger, stop, and adventure Pink is soft, tranquil, passive, feminine, health, joy Orange is warmth, compassion, enthusiasm, fun, energy Green is nature, balance, environment, healthy, calm, rebirth Blue is dignified, professional, successful, loyal, positive, authoritative, but also melancholy Yellow captures our attention most: happiness, optimism, creativity, sunshine and spring; dark side of yellow: cowardice, betrayal, egoism, caution, madness, and medical illness (jaundice, malaria,..) Purple: royalty, luxury, wisdom, inspiration, spiritual Brown: Natural, reliable, strong, rustic, conservative, ordinary Black: classy, formal, authority, power, death, troubles, mourning White: pure, innocent, clean, new, simple, bland Gray: neutral, respect, humility, stable, wise,
34 Visualization Expectations
Test: http://www.youtube.com/watch?v=xAFfYLR_IRY
35 Which direction is the top middle wheel moving?
36 Are the blocks side by side or stacked?
37 Blue and Black or White and Gold?
38 Ebbinghaus Illusion
39 Visualization
40 Visualization—Spooky
41 Visualization
Jared Leto
Dallas Buyers Club, Fight Club, Thirty Seconds to Mars and super- stoked about data visualization
42 Visualization in Logos
43 Empirical Findings (Berinato)
We don’t go in order like reading—the top title may not be read until well after the visual middle; we spend disproportionate amounts of time in different features We see first what stands out—peaks, valleys, intersections, dominant colors, and outliers We see only a few things at once—with more than 5-10 variables or elements individual meaning begins to fade
We seek meaning and make connections—we incessantly construct narratives of the graph consciously and subconsciously We rely on conventions and metaphors—red is bad, green good, A-N-AF-M, time is on x
44 Value of Data Visualization
Business intelligence and predictive analytics may be viewed as black boxes and not trustworthy; data visualization can add trust and provide insight to these solutions Ultimately it is all about making better business decisions
45 Source: TDWI Research, 2013 Value of Data Visualization
The two areas of data visualization: – Explanation – tell a story to the audience – Exploration – understand what the data is telling you Will take into account audiences expectations and composition Help you to detect relationships in data Allows you to understand “Big Data” ▪ …of those who are most effective with Big Data, 98% use data visualization techniques
46 More on Big Data
May be overhyped by media, but is here. 90% of data today wasn’t here 2 years ago – Transition from mainframe to client server to mobile cloud – Extract-Transform-Load model is aging Big Data really has not been solved by most organizations – Resources dedicated to collecting, storing, organizing, and cataloging; – Exploiting Big Data through analytics and viz are behind Web is more visual, efficient, and data-friendly (Phil Simon, Visual Organization)
47 Popular Choices in Data Visualization
Pie charts, line charts and bar charts still have their place, but are quickly being replace by more informative and dynamic tools
48 Why Graph Data?
Tabled data in files and spreadsheets are precise and summary statistics are helpful to understand structure up to a point. Graphs quickly convey meaningful relationships that tell a story and point you in the right direction to solve your problem. Interactive visualization takes the graphical capabilities a step further for rapid discovery and hypothesis generation.
The greatest value of a picture is when it forces us to notice what we never 49 expected to see…John Tukey Basic Concepts for Data Visualization
Understand your data Size – Cardinality => high = unique values (acct #); low = repeats (gender) Determine what you are trying to visualize and information conveyed Know your audience and how they process information Use a visual that conveys the information best, simplest, and quickest A “good” graphic is context sensitive (Berinato) – Who will see it? – What do they want? What do they need? – What could I show? What should I show? – How will I show it?
50 Tufte’s Principles
Enforce visual comparison – Conclusions can be drawn by comparing data Show causality – A graph without causality will have no meaning Show multivariate data – Display data using more than two dimensions Integrate all visual elements – Use words, numbers and images where appropriate Content-driven design – Quality, relevance and integrity
51 Principles of Good Graphical Design
Communicate the data with clarity, precision and efficiency Encourage the eye to compare different pieces of data focusing on substance; intriguing and curiosity provoking Make large data sets coherent presenting many numbers in a small space Reveal the data in several layers of detail Serve a clear purpose: description, exploration, tabulation, or decoration Are closely integrated with statistical and verbal descriptions of a data set Are simple, which is much better than unnecessary complexity Generate the greatest number of ideas in the shortest time with the least ink in the smallest space
Reference: The Visual Display of Quantitative Information by Edward Tufte—known as the Strunk and White of 52 graphics Graphical Pillars for Statistical Stories Simple Clear and concrete Informative and Important Contextual Sequential Seamless Disclose Uncertainty and Truth Emphasis Actionable Clean
Best graph ever?
A quick sketch is better than a long speech Napoleon (perhaps) Demonstration-R, JMP 53 Reference: Now You See It by Stephen Few Other Candidates for Best Graph Ever
54 https://commons.wikimedia.org/wiki/File:Nightingale-mortality.jpg Other Candidates for Best Graph Ever
55 Other Candidates for Best Graph Ever
56 Other Candidates for Best Graph Ever
57 https://flowingdata.com/2015/04/02/how-we-spend-our-money-a-breakdown/ Other Candidates for Best Graph Ever
Birth and Death of 150,000 Notable People Over Last 2000 Years
http://science.sciencemag.org/content/345/6196/558 58 Data Visualization Best Practices
Use appropriate scales—start at 0 for bar charts and end a little above max value. Stop at 100% when using percentiles.
Consider adding reference lines (typically for the y axis) such as the mean, an industry standard, or at 0
Split data into meaningful sub-graphs (trellis graphics) with exactly same scales and structure to better interpret multivariate data
Examine your data using a combination of data visualization methods
Beware of overplotting—e.g. scatterplot that is very dense with points, need to show the volume within each region
Could make points smaller, hollow, jittered or use heat map for multiple observations
59 Source: Now You See It, Stephen Few Bad Practices of Graphical Design
Inappropriate display choices Too much information Misleading axis scaling Difficult to understand: all capital letters, too many abbreviations or jargon, vertical text, insensitive to color, obscure legend Inconsistent ordering or placement Graph is taller than it is wide Too small in presentation Too artistic
If the graphic is bad, the information will be perceived as less credible!
60 Graphs That Cry Help!
http://www.muschealth.com/weight/graph.htm
http://scienceblogs.com/goodmath/2009/03/more_stupid_graphs.php
http://www.macworld.com/article/134708/2008/07/ 61 http://adesigndive.blogspot.com/2010/11/show-and-tell Beach Ball “Graph”
1. Poor color choices 2. Distracting beach ball chartjunk 3. Different fonts throughout
4. Tall not wide 5. Out of order on x scale 6. ALL CAPITAL LETTERS 7. 3 D boxes for 2D data
62 Data-Ink Ratio
Ink on the graph represents the data – Maximize data ink and erase as much non-data ink as possible – Tufte – Data-Ink Ratio = 1 – Proportion of graph that can be erased – Erase non-data ink so that the audience is not drawn away from the importance of the data – Think of gridlines—how important are they and at what frequency?
63 Data Density
Proportion of the graph that is dedicated to displaying data
Maximize data density and the size of the data matrix – Include more data points – Include more variables Sparklines – Demonstration in Excel with GoPro – Demonstration in JMP with CrimeData
64 Lie Factor
A value to describe the relation between the size of effect shown in a graphic and the size of effect shown in the data. Exaggeration or changing of the scale in a graph
Reference: The Visual Display of Quantitative Information by Edward Tufte 65 Chart Junk
▪ Decorative elements that provide no data and cause confusion ▪ Distract the viewer from valuable information
66 Chart Junk—Another Point of View
Borkin et al. from Harvard and MIT conducted experiment on what makes a graph memorable Over 2,000 images; 400 were shown to study participants for 1 second, they then took quiz on which ones they saw.
67 Chart Junk—Another Point of View
Results showed human recognizable objects most important for memorability Also helpful are if visualization is “distinct”, visually dense, colorful and has low data-ink ratio
68 How To Do Data Visualization
Two primary questions before choose graphic: – Is the information conceptual or data-driven? – Am I declaring or exploring something?
69 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Plan
Prepare (5 mins): have paper and pen, put aside data to think about ideas, write the basics of who visualization is for and what setting Talk and Listen (15 mins): discuss with colleague what you’re trying to prove or explore; capture words, phrases and statements to summarize goals Sketch (20 mins): Focus on keywords from above steps, quickly sketch out multiple visuals Prototype (20 mins): take best sketch and make it more accurate and detailed
Fight the impulse to directly graph your data with preset options
70 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Create
Focus on structure and hierarchy – Need title (12%), subtitle (8%), visual field (75%), and data source line (5%) Focus on design clarity (“hit the ball squarely”) – Aggressively remove extraneous elements and let them highlight the idea – Make sure each element has a single purpose that cannot be misinterpreted – Use natural conventions and metaphors Focus on design simplicity – Minimize number of colors-gray for second level information (gridlines, etc) – Place labels and legends close to what they describe
Goal is to make the graph more understandable—not more attractive
71 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Refine to Persuade
Hone the main idea – What am I trying to show versus I need to convince them… – Think active words Make it stand out – Emphasize with color, pointers, labels, markers, … – Isolate it by reducing other elements Adjust what’s around it – Add reference points and lines – Remove elements that distract with integrity – Create context and comparisons
How can you sell this most effectively?
72 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Presentation
Show chart and stop talking Don’t read the picture, talk about ideas not structure Guide audience for unconventional visualization Add context by showing reference, average or ideal graph versus one shown Turn off chart when have something important to say Put a more detailed version in backup for them to take away Create tension by showing parts (builds) so they speculate— makes it more memorable; use time and reveal gradually Bait and switch by luring into what they expect and show different Deconstruct and reconstruct—drill down dynamically
Does presentation hide something that would rightfully challenge idea?
73 Scott Berinato, Good Charts, Havard Business Review Press How To Do Data Visualization-Be a Critic
Note what you see first—what stands out? Note the first idea that forms, then search for more What are likes, dislikes, wish I saws What 3 things would you change and why Sketch your own version and critique yourself
It’s not the critic who counts…the credit belongs to the man who is actually in the arena. Teddy Roosevelt
74 Scott Berinato, Good Charts, Havard Business Review Press Graphs
Plots of deterministic relationships Univariate Bivariate Multivariate Time Series Maps
75 Graphing Functions
Deterministic equations easily plotted across the range of input variables
76 Univariate Categorical Plots Percent of USAA Mbrs
Simple example but shows key Good points Evaluate the Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other relative Percent of USAA Mbrs proportions: part-
to-whole. Family Mbr 4% 8% Retired Determine best Better 10% 38% Officer Enlisted 12% graphs: bar graphs Employee 13% Gov't Civilian 15% and Pareto plots, Other
though pie charts Distribution of USAA Membership Eligibility are (too) often 40% 35% used. 30% 25% 20% Best 15% 10% 5% 0% Percent of USAA Mbrs
77 Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other What’s the Matter With Pie Charts
They are overused and do not hit the “pre-attentive” attributes as hard as other methods – Length and position are the most discriminatory Difficult to compare 2-D areas or angles. DON’T go to 3-D!! Tough to decipher when have a legend you have to go back and forth to
http://annarborchronicle.com/wp-content/uploads/2012/12/DNR-StatewideByCounty.jpg 78 http://www.outsidethebeltway.com/wp-content/uploads/2010/01/us-states-population-pie-chart.png If You Use Pie Charts
Keep the number of slices to a maximum of 4 or 5 – Actually not bad since we are so familiar with these Adding text labels for percentages and categories if desired Exploded Pie emphasizes a proportion; don’t explode more than 25% of the slices Put largest wedge at 1:00 and make progressively smaller clockwise until 12:00 Donut charts do nut [sic] solve the problem
PercentMbrs
Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other 79 Histograms A graphical representation of the distribution of a continuous variable Group data into similar sized bins and count frequencies in data set Easy to infer probabilities and relative importance Commonly occurring “shapes” are known probability distributions (normal, uniform, exponential, Weibull, Beta…) Bars can be omitted for a Frequency Polygon
Excel now has menu option and Analysis Toolpack Bin width is a critical parameter, should be the same for all bins
80 Box Plots
Excellent choice to show distributions – Line at median, size of box is interquartile range (25th and 75th percentile), whiskers extend to 1.5 time IQR or max/min – Display differences between populations means without making assumptions—best for multiple boxes – IQR is robust estimate of standard deviation (test for equal variances) – Excel!! Finally
81 Box Plots
Best used when there are several categorical levels of a variable – Can quickly evaluate if the variances (size of boxes) are approximately equal – Can determine if the means/medians are close based on relative positioning – Limited inherent capability in Excel; Box Charter add-in
82 Heat Maps Intuitive way to quickly identify observations that are at the extreme ends of the distribution Example:1918 US Flu Epidemic. Shown below are death rates per 100,000 for several age groups. Not surprisingly the babies and elderly had the highest rates. Interestingly, 23-34 year olds also had very high rates. WWI vets returning via crowded and infected trains Excel Conditional Formatting is excellent. Total Age M F Population <1 2520.5 2020.4 4540.9 1-4 712 724.2 1436.2 5-14 162.5 190.2 352.7 15-24 700.6 475.1 1175.7 23-34 1216.6 781.4 1998 35-44 691.1 406.5 1097.6 45-54 411.8 275 686.8 55-64 420.8 339.2 760 65-74 655.8 636.5 1292.3 75-84 1112.9 1239 2351.9 >8483 2111.2 2320.5 4431.7 Pareto Plot
• A creative and best practice approach for part-to-whole graphics • 80/20 rule—80% of Italy’s wealth held by 20% of residents • Excel chart option now and accessible via Histogram option in Data Analysis Tool Pack for continuous data • Pareto Plots are often based on categorical levels of the input factor
84 Pivot Table and Pivot Chart
▪ Powerful Excel analytical and graphical capability ▪ Data summarization tool that sums across levels of variables ▪ Flexible construction of tables/graphs ▪ Easy to filter, dynamic
Average of MPG Hwy Column Labels Row Labels Hybrid Non hybrid Grand Total 4 38.75 30 30.52238806
45 6 24.04615385 24.04615385 40 8 23 19.96610169 20.01666667 35 Grand Total 35.6 24.76470588 25.046875 30 25 20 Hybrid 15 Non hybrid 10 5 0 4 6 8 85 Demonstration Univariate Graphs
▪ Excel, JMP, R
86 Association between Variables
▪ An association exists between two variables if the distribution of one variable changes when the level (or values) of the other variable changes. ▪ If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable.
Anscombe’s Quartet All have same mean, variance, regression equation and correlation coefficient
87 Scatterplot Matrix
▪ Consider the fuel efficiency data, we can quickly see relationships between several variables
88 Area Plot
▪ Useful to display multiple dependent variables as a function of a single continuous independent variable ▪ Can also do part-to-whole ▪ Beware of “hiding” data
89 Parallel Plot, Parallel Coordinates Plot, Parallel Axis Plot
Specify two or more variables for the response and can effectively use color for another variable
90 Contour Plots—Association Between 3 Variables • Contour lines show level of a variable • Useful in multivariate applications to find locations of min/max response
130
120
110
100
90
80 Series6 70 Series5 Series4 60 Series3 1 2 3 Series2 4 5 6 Series1 7 8
91 Association – Categorical Variables
▪ Best graphs: bar graphs, histograms, mosaic plots, and tree maps: • Brushing methods enable you to see the distribution of a categorical variable conditioned on the setting of another. • Mosaic plots and tree maps are especially good with multiple levels of a categorical variable; unfortunately Excel does not have these graphs. ▪ For two (or more) categorical variables, the relationship is measured by association or dependence, not correlation.
92 Association – Categorical Variables Titanic Example ▪ Mosaic plots allow you to determine if survival is dependent on what class you were in and also if gender makes a difference ▪ Excel not set up for Treemaps, but MS has add-in
93 Association – Brushing
▪ Brushing or dynamic linking is a great interactive way to explore relationships in your data by evaluating how a subset of observations based on the level of one variable behaves in another. ▪ We see the cross hatched values correspond to the 8 cylinder vehicles that also have more HP and less MPG
94 Which is Best
▪ Researchers have been looking into the science of data visualization ▪ Rensink 2010: Weber’s Law to scatterplots “a noticeable change in stimulus is a constant ratio of the original stimulus” ▪ Think of a match lit in a pitch black room versus a lighted room ▪ First time method is available to calculate graph effectiveness ▪ Good-scatterplots in positive correlation direction ▪ Okay-parallel coordinate plots, scatterplot negative direction, stacked area ▪ Bad-stacked bar, radar plots
95 Modeling Visualization
▪ Profilers and Contour Plots ▪ F-18 Central Composite Design Multiple Response Optimization
96 Demonstration for Multivariate Graphs
▪ Excel, JMP, R ▪ Anscombe’s Quartet
97 Time Series- Book Sales
98 Amazon.com Data Mining Example
Peter Lawrence’s The Making of a Fly Bargain at $23,698,655.93 + $3.99 shipping Two wholesalers algorithms went awry – 17 used at $35, but for new n=2 copies – Bordee=1.27 X Profnath; Profnath=.998 Bordee – Algorithm quickly goes out of control—Apr 18 $23M – Apr 19 Profnath =$106 and Bordee=$135
99 Time Series
Look for six basic patterns: overall trend, variability, rate of change, covariation, cycles, and exceptions/outliers. Best graphs are line graphs, overlay plots, and bar graphs; many other displays are possible over time (for example, box plots and bubble plots). Run charts with control limits are the cornerstone to process control methods. 75% of graphs are time series
100 Time Series in Excel
Decent capability for customizable line graphs New for 2016 was Waterfall Charts Demonstration
101 Time Series: Google Trends
Is interest in golf waning? What does this mean for Under Armour?
102 102 JMP Output Google Trends
103 103 Time Series
Line Plots—use to display patterns, trends, cycles, and exceptions Bar Graphs—use to emphasize or compare values (e.g.Budget versus Actual) Dot Plots—use if have irregular time intervals. In Excel, just delete line in line plot with markers Heat Maps for high volume of data to find exceptions and cycles Box Plots for analyzing distribution (mean and variance) changes over time Animation may be useful to see changes over time
104 Time Series
Line Chart shows comparison, variability, cycles, trend
Lord Abbett Monthly Price Change Compared to S&P 500 Stacked Column shows the 0.2 differences better
0.15 Both use Excel defaults 0.1 apart from title and legend 0.05 size
0 Time series analysis -0.05 methods exist to forecast
-0.1
-0.15 Stacked Column Chart for Lord Abbett vs S&P 500 0.2
0.15 -0.2
0.1
-0.25 0.05
S&P Lord Abbett 0 9/1/2003 10/1/2003 11/1/2003 12/1/2003 1/1/2004 2/1/2004 3/1/2004 4/1/2004 5/1/2004 6/1/2004 7/1/2004 8/1/2004 9/1/2004 10/1/2004 11/1/2004 12/1/2004 1/1/2005 2/1/2005 3/1/2005 4/1/2005 5/1/2005 6/1/2005 7/1/2005 8/1/2005 9/1/2005 10/1/2005 11/1/2005
-0.05
-0.1
-0.15
-0.2
-0.25 105 S&P Lord Abbett Sparklines
▪ A small intense, simple, word-sized graphic with typographic resolution ▪ Typically in a cell after the last value in an ordered series (e.g. time) ▪ Can be everywhere a word or number can be: embedded in a sentence, table, headline, map, spreadsheet, graphic ▪ Started with Excel 2010—add-in available for 2007
106 Conditional Formatting—Beyond Heat Maps ▪ Use to highlight specific observations that exceed a certain value (e.g. 30 MPG), between a range, and so forth ▪ Top or bottom 10%, 10 values, above the mean, below the mean ▪ Data bars to show what percentile the observation ranks
▪ Icon sets for dashboard type MPG Hwy MPG Hwy MPG Hwy MPG Hwy 21 21 21 21 graphics 23 23 23 23 40 40 40 34 40 34 34 33 34 33 48 33 33 48 25 48 48 23 25 25 25 25 23 23 23 38 25 25 25 30 38 38 26 38 30 30 28 30 26 26 26 26 33 28 28 28 26 33 26 26 31 33 33 33 35 33 33 33 35 31 17 31 31 35 17 35 35 35 107 18 35 35 17 17 17 17 17 18 18 17 Maps Big push in recent years has been geospatial or mapping Many software options Maps can be extremely useful, but they do limit our pre- attentive attributes—use with caution Still very good for visual discovery and analytics Think beyond geography of what a map is!
108 Alluvial Plots
▪ NFL 2015 season predictions from Nate Silver’s fivethirtyeight.com. Data viz software is D3.js
109 http://www.brightpointinc.com/2015-nfl-predictions/ Visualization of Text Data
Consider the Pareto of word counts from an article this year on cnn.com Can you get the general idea of what might have happened by these frequencies alone? Note we’ve “stemmed” words=> hors = horse, horses, horsing, horse’s,…and, of course, hors 110 WordCloud
http://www.cnn.com/2015/06/06/us/belmont-stakes-american- pharoah/
111 Word Clouds
www.wordle.net is a very fun (free) site to paste in your text and make your own word clouds
112 113 Text Groupings From Eigenvectors
Statistical methods can tame the unstructured text to find words that cluster together and common themes
113 Sentiment Analysis
In many applications, such as with online product reviews, we would like to know whether the customer base has a positive, neutral, or negative attitude about the product or service We can count the number of “positive” and “negative” words using a generic list of terms; it may be useful to have a custom “positive” and “negative” word lists. Combine this with Twitter and other social media, then you have real-time feedback; “Opinion Mining” Method just uses the DTM and cross-tabulates with the Harvard list of positive and negative sentiments
114 Sentiment Analysis
Look at Bible by Book and Chapter
115 Sentiment Analysis Demonstration in JMP
116 Sentiment Analysis Demonstration in JMP
117 Correlation of Word Pairs from DTM
118 What Words Associated with Fatal?
Different word frequencies
Not Fatal Fatal 119 What Words Associated with Fatal?
Crosstab Fatal structured variable with word counts
120 Tree Model for Fatal on NTSB DTM
• Classification Tree groups observations based on presence of absence of word • If “land” in write up, very unlikely a fatality unless “mountain” is too • If “stall/spin” in write up, very likely to be a fatality
121 Identifying Topics Via Latent Semantic Analysis
• Factor loadings from Singular Value Decomposition (SVD) of the document term matrix – Creates U (document) and V (term or words) reduced rank matrices – Fortunately, they are linked so we can go back and forth between the two • Plotting first two eigenvectors of V can show most dominant themes
122 Other Text Visualizations
• Arc Diagram of Les Miserables
http://gastonsanchez.com/software/les_miserables_arcdiagram.pdf 123 Text Visualization
Text analytics not going away Word clouds and frequency counts are helpful Document Term Matrix is the key to finding relationships between words Visualizing Singular Value Decomposition of DTM allows you to find topics, quantify unstructured data, and cluster both words and documents Sentiment analysis visualization methods help gauge overall preferences and emotions
124 Dashboards! Avalanches in Tableau
125 Dashboards Executive level summary graphs showing key metrics; 4 stages: ▪ NOTICE-get eye to move to right place ▪ FOCUS-quickly get to understand insights ▪ INVESTIGATE—intuitive way to drill down and explore ▪ ACT—right insight at right time to take right action
Accessible via mobile devices
Dashboards have become a popular means to present critical business information at a glance, but few do so effectively. Huge investments are made in Information Technology to produce actionable information, only to have it robbed of meaning at the very last stage of the process: the presentation of insights to those responsible for making decisions. When designed well, dashboards engage the power of visual perception to communicate a dense collection of information in an instant with 126 exceptional clarity. Stephen Few Survey on Dashboard Effectiveness Metrics of a good dashboard Most organizations have room to improve—especially with unstructured data
127 TDWI Research Survey 2013 Stephen Few’s Common Mistake 1: Exceeding the Boundaries of a Single Screen More difficult for mind to recall information that is no longer visible Seeing everything on one screen allows for quicker and easier comparisons, which lead to quicker insights People often think information that they must scroll to see is of less importance than what is directly in front of them
http://www.helpsystems.com/sites/default/files/a rticle/SQ_APRIL_14-key-performance-indicator- vertical-dashboard.gif
128 Common Mistake 2: Supplying Inadequate Context for the Data Meaningful context is key to understanding the information presented Context should be incorporated in a way that does not distract the reader from the key message Context should only be included when it adds real value to avoid crowding and distraction
http://www.excelchart s.com/blog/wp- content/uploads/2008 /03/dundas- gauges1.png
129 Common Mistake 3: Displaying Excessive Detail or Precision
Too much detail slows reader without providing benefit
http://www. funkylab.co m/post/KPI -What- Where- Why-and- how-many
130 Common Mistake 4: Expressing Measures Indirectly Must know what is being measured and in what units Must find the measure that conveys the meaning most effectively Find the message needed by viewer, then select best measure to support message
131 Common Mistake 5: Choosing Inappropriate Display Media Pie charts don’t display quantitative data effectively Humans can’t compare 2-D areas effectively Linear displays such as bar graphs convey information more effectively Common mistake in all quantitative data presentations
http://www.danielp radilla.info/blog/wp - content/uploads/2 012/11/pie_charts _vs_bar_charts_2. png
132 Common Mistake 6: Introducing Meaningless Variety People typically don’t like using the same type of chart or graph more than once on a dashboard. This often is detrimental to the dashboard. Always use the display medium that is most effective even if the dashboard already uses that display medium. Wherever appropriate, consistency in means of display allows readers to use the same strategy in interpreting information, which saves time.
http://sourceforge.net/p/art/wi ki/Images/attachment/dashbo 133 ard-example.png Common Mistake 7: Using Poorly Designed Display Media Components of the dashboard must be designed to communicate clearly and efficiently Most graphs are designed poorly Legends force reader’s eyes to go back and forth, wasting time
http://www.perceptualed ge.com/blog/wp- content/uploads/2009/0 2/sas-revenue- graph.jpg 134 Common Mistake 8: Encoding Quantitative Data Inaccurately Sometimes design errors result in graphs representing values inaccurately
http://2.bp.blogspot.com/ _z6QlRCOBLgo/TD3vOz HzA2I/AAAAAAAAABw/ Bz0rOm4Ijns/s1600/stan dard_vs_diffable+(1).png
135 Common Mistake 9: Arranging Information Poorly Dashboards must be well organized, with data appropriately based on importance and proper viewing sequence and framed within a visual design that segregates information into meaningful groups – Stephen Few Make the dashboard look good but most importantly, arrange the information in a manner that fits its use Make important information stand out Data that needs to be compared should be arranged and visually designed to foster comparisons
http://flylib.com/bo oks/en/2.412.1.25/ 1/
136 Common Mistake 10: Highlighting Information Effectively or Not at All Viewer’s eye should be directed to the most critical information Not all information is of equal importance
http://www.bright edge.com/sites/d efault/files/LG_02 _Customizable_D ashboards.jpg
137 Common Mistake 11: Cluttering the Display with Visual Effects Backgrounds, artistry, and decorations only distract from the important information presented
http://4.bp.blogspot.com/- WSVMiug9SS4/URFgm O0yYxI/AAAAAAAAAug/ AbVYKgTd- wU/s1600/Snap4.png
138 Common Mistake 12: Misusing or Overusing Color
Color should not be used haphazardly Hot colors demand attention Cool colors do not demand attention Contrasts call attention Same color creates a relationship between two displays on a dashboard
http://1.bp.blogspo t.com/- zT3nda4OvLU/UZ yS2mFmn6I/AAAA AAAAEEA/aMems Fmvxxc/s1600/Mai n+DB.png
139 Common Mistake 13: Designing an Unattractive Visual Display Ugly dashboards make the viewer want to look away, making him or her less inclined to understand all of the information presented
http://www.d ashboardzon e.com/wp- content/uplo ads/2008/04/ image-76.jpg
140 Tableau Software
Strictly data visualization software; most popular in industry – Stock symbol: DATA Connects to standard data sources, proprietary data bases, and big data such as Hadoop, Teradata, GoogleBigQuery Highly interactive, pretty powerful, and can quickly make graphs
Goals: – Make data understandable – Manage large data streams – Promote data discovery – Help business decisions
141 https://public.tableau.com/s/gallery/good-value-mbas Tableau-Hurricanes
Bring in Excel file, add rows (lat), cols (long), color (basin), label (name) Change marks (line), path (ISO time/hour), size (wind), color (name), filter (basin) Animate by pages(ISO time(day)), check Show History
142 Sports Analytics
1. Yankees 2012 HR waterfall chart in Tableau (running total, rev) 2. Spurs 2014-15 performance stats using Graph Builder/Col Sw
143 Sports Analytics Hockey greats scatterplot by position
144 R Statistical Programming Language
Definitely not strictly data visualization software; most used open source stats software It can certainly do data visualization though may require some proficiency in the language first Decent capability from main packages and libraries ggplot2 seems to have the most following and capabilities – Grammar of graphics; good defaults; layered customizable results, static – Partner ggvis enables web-based interactive graphics Web-based with Shiny and Markdown; interactive htmlwidgets, plotly, d3, googleVis, and many more packages Decent review on interactive and data wrangling in post in March 2017 ComputerWorld
145 ggplot2
146 HTMLWidgets
Link to site
147 Hans Rosling
Great TED talks!
148 http://www.gapminder.org/tools Questions?
Thank you. Thank you very much.
149