Visualizing Uncertainty What we talk about when we talk about uncertainty

Zheng Yan Yu

1 Thesis Presented by Zheng Yan Yu

to The Department of Arts, Media and Design In Partial Fulfillment of the Requirements for the Degree of Master of Fine Arts in Information Design and Visualization

Northeastern University Boston, Massachusetts April, 2018

2 Visualizing uncertainty Thesis defense 3 Abstract Most visualizations have been designed on the deal with visualizing uncertainty but they are not assumption that the visually represented are accessible to non-experts outside of the scientific free from uncertainty. However, this is rarely the community. That is partly because the science case. Visualizing uncertainty is essential if we want researchers proposed the methodologies to deal to improve how people understand . We with visualizing uncertainty but they did not connect are surrounded by and statistical them to the problems the general public more care inference on our daily lives, such as and about. probabilities in newspapers, government journals, and mobile applications, and these statistics Second, the traditional methods for visualizing often lead us to biased understanding of the data. uncertainty are limited. For example, the The summary and cannot reflect the true distribution of data. In often cannot reflect the reality of data. Visualizing addition, traditional ways to visualize uncertainty uncertainty hidden beneath statistics can improve usually do not consider visual cognition and the way we interpret statistics by looking at whole perception. They often require the users have datasets to find comparisons. This thesis focuses an understanding of statistics to understand the on the uncertainties of summary statistics and visualization. statistical inference. I propose an updated taxonomy of uncertainty to Dealing with visualization of uncertainty is an help people deal with visualizing uncertainty. By important issue. However, there are few existing connecting the examples of visualizing uncertainty guidelines for best practices for people to deal with to the scientific uncertainty taxonomy, I bridge the visualizing uncertainty. Although many of the data gap between the science and the other domains. visualization practitioners are highly interested in Furthermore, I propose three of this topic and they have produced many related visualizing uncertainty of statistical predictive examples, visualizing uncertainty is regarded as models. By visualizing my research outcomes, the one of the unsolved problems in data visualization visualization touches the less discussed topic on community. There are many reasons causing this visualizing the uncertainty of statistical predictive problem. models.

First, the gap between the scientific visualization community and other domains such as news agencies, consulting companies. In the scientific community, researchers have a systematic way to

4 Visualizing uncertainty Thesis defense 5 Acknowledgements This thesis cannot be completed without many build and polish the artifacts of my thesis. people’s help. I would like to express my best appreciation to all of them. Thank you, my IDV colleagues. Besides the help from professors, all of you provided great ideas and Without the emotional and financial support from helpless on my thesis through the past two whole my family, I cannot study abroad in the United semesters. The discussion during class is the most States. Thank you, my grandfather, you gave me valued treasure I have received from learning in IDV. all of your earnings to support the tuitions. It is your hard-earned money but you gave me all of Thank you all. them when you know I have decided to pursue my master degree in the USA. Thank you, my father and mother, you always support me especially when I decided to study in Beijing and Boston. When I am not sure the decisions I make are correct or not, you always say I will not regret when I think it through.

Thank you, my advisor, Pedro Cruz. When I felt confused on my thesis, you guided me out of the maze (several times). Thank you, the instructor of thesis course and the program lead of IDV, Dietmar Offenhuber. You put much energy caring each student, you listen to us carefully, solve our issues patiently. You pointed out many problems with my thesis. I felt very grateful for these suggestions which made my thesis solid. Thank you, Miso Kim, for encouraging me to keep working on the same topic of my thesis after the proposal. Thank you, Nathan Felde, for helping me go through an extensive and conceptual exploration of uncertainty. The exploration helped me build the thinking system of uncertainty. Thank you, Paul Kahn, for continuously giving feedbacks on my thesis exhibition. The discussion during a couple of weeks helped me

6 Visualizing uncertainty Thesis defense 7 Contents

Part 1 Introduction

Part 2 What we talk about when we talk about uncertianty 1 Uncertainty visualization precursors 2 An overview of visualizing uncertainty in the scientific community 3 The topic of visualizing uncertainty in the news agencies, consulting companies, and other domains 4 New ways to visualize uncertainty that consider human visual perception 5 Visualization of certain types of uncertainty that are yet to be addressed

Part 3 Updated taxonomy of uncertainty 1 Updated taxonomy of uncertainty 2 Examples of the taxonomy

Part 4 Visualizing the uncertainty of statistical predictive models 1 Types of uncertainty of statistical models used in this chapter 2 The three instantiations 3 First instantiation: visualizing probability 4 Second instantiation: conveying uncertainty through data points 5 Third instantiation: conveying uncertainty through physical metaphor 6 Conclusion

Part 5 Conclusion and next step

Part 6 References

8 Visualizing uncertainty Thesis defense 9 Introduction Most visualizations have been designed on the assumption that the visually represented data are free from uncertainty. However, this is rarely the case. Visualizing uncertainty is essential if we want to improve how people understand statistics. We are surrounded by summary statistics and statistical inference on our daily lives, such as means and probabilities in news paper, government journals, and mobile applications. These statistics often lead us to biased understanding of the data. Visualizing uncertainty hidden beneath statistics can improve the way we interpret statistics by looking at whole datasets and/or possible outcomes to find comparisons. This thesis focuses on the uncertainties generated from inefficient representation of data (summary statistics) and statistical inference (data sample and probability). Visualizing uncertainty can help us to “reveal” the uncertainty.

First, what is the uncertainty in inefficient representation of data? The uncertainty in inefficient representation of data is generated from summary statistics. Merely looking at summary statistics, we cannot know the reality of

Suppose two datasets have the same value. The dataset with wider is more uncertain than the dataset with narrower range.

10 Visualizing uncertainty Thesis defense 11 data. To “reveal” the uncertainty hidden in summary the dataset and the range of the box is also called statistics, we need to visualize the whole datasets. . If the data is above or below 1.5 To be more specific, if you are told that the mean interquartile range of the maximum or minimum of (averaged value) of a dataset is 5, have you ever the box, they are called outliers. wondered what the overall dataset looks like? It may be 1,5,9. It may be 4,5,6 as well. When we are told However, the box plot cannot reflect the reality of that the mean of a dataset is 5, we often believe/ data well. In addition, most people cannot read guess the whole data is close to 5, or even think box plot because it requires readers to know what each of the data is 5. The summary statistics, such as is , and outliers before they can mean and , hide the reality of data. The two interpret box plot. datasets are different, but the means are the same. The uncertainty comes from we have no idea what When the range of the the real data look like because it is summarized data change, the three by statistics. Visualizing uncertainty in summary boxplots remain the statistics means visualizing the reality of data. same The most common and inefficient example of visual representation to show the one dimensional data in statistics is box plot.

Median is the middle point of the data and is shown by the line that divides the box into two parts. The A boxplot with all its middle “box” represents the middle 50% of data for elements annotated

12 Visualizing uncertainty Thesis defense 13 In addition, the uncertainty due to inefficient This is a very famous video about the relation Five same summary representation of two (or more than two) between lifespan and income of each country. This statistics (X mean, dimensions of data is about distribution. For example shows the uncertainty in distribution. If we Y mean, X standard example, scatterplot is the distribution which shows only look at the data of China as a whole, without deviation, Y standard the relation of x and y. separate Shanghai (the richest province) and Hans Rosling’s 200 deviation, and Guizhou (the poorest province), we cannot have a Countries, 200 Years, correlation of X and Nevertheless, if we use summary statistics, such as better understanding of the dataset and may have 4 Minutes - The Joy of Y) with different mean, , and correlation to present biased understanding. Stats - BBC Four datasets and different the relations of x and y, we will misunderstand the visualizations whole dataset.

14 Visualizing uncertainty Thesis defense 15 Second, what is the uncertainty in statistical is an estimated value and it cannot reflect the inference? real outcome perfectly. To be more specific, the averaged value in statistics is not as same as the Things are getting more and more complex. Many averaged value in real life. The picture shows the terms in statistics are related to statistical inference averaged value (in real life) of throwing a dice such as probability, p-value, r-square, confidence hundreds of times. We can see that the averaged interval, etc. In this thesis, I focus on probability and value in real life is getting close to the averaged data sample. value in statistics. However, they will not be the same. They will be the same only when we can Probability and data sample are uncertain because throw the dice for “infinite” times. “Infinite” number the possible outcomes are not usually the same. only exists in math/statistics, and it does not exist in We can use probability and data sample to predict our real life. That’s why probability is uncertain. the possible outcomes, but they cannot reflect the actual outcomes and/or population perfectly.

Take probability as an example. Probability measures the possible outcomes. For example, throwing a dice. If we throw a dice, the probability The simulation of of showing each face is 1/6 and the averaged value throwing a dice and the based on the probability is (1+2+3+4+5+6)/6 = 3.5. averaged values The uncertainty in probability is that the probability

16 Visualizing uncertainty Thesis defense 17 Therefore, without visualization, we will not “reveal” different domains and provides a comprehensive those uncertainties in data. We already know there framework for visualizing uncertainty. I want to are two types of uncertainties and they are the main broaden the understanding of techniques to focus of my thesis. visualize uncertainty, revise the processes to display uncertainty, and showcase examples from Besides the two uncertainties, there are other types the news, consulting agencies, and other domains of uncertainties, such as data quality. In scientific as well as from the scientific community. Besides community, data quality is discussed more ofthen from providing an updated overview in the topic than in other domains but data quality is not of visualizing uncertainty, this paper proposes a elaborated in this thesis. taxonomy for visualizing uncertainty for the the general public and a couple of experiments focusing Working with data, from collection to visualization, on visualizing the uncertainty of statistical predictive involves many kinds of uncertainty. Dealing with models. uncertainty from collecting to visualizing data is already well defined by the scientific researchers. In Previous academic literature on visualizing addition, visualizing uncertainty is a common topic uncertainty rarely discusses work outside the in scientific community since a couple of decades scientific community and pays little attention to how ago. However, how to visualize uncertainty confuses the news agencies, consulting companies and lay people especially to those outside of scientific people deal with uncertainty. For this purpose, this visualization community, some of them are even thesis includes methods and work on visualizing senior data visualization practitioners with statistical uncertainty from academic and non-academic backgrounds. These data visualization practitioners domains, which allows visualizing uncertainty in outside of scientific community play an important science be connected with real-world problems. role between the scientific researchers and the In addition, the combination of scientific research general people. They bridge the communication gap with the works of non-scientific visualization between the scientific community and the public. domains will let the scientific research outcomes Nevertheless, they provide little best guidance to be more accessible to the public. The scientific communicate uncertainty with general public and researchers know how to deal with uncertainty and even thought visualizing uncertainty is an “unsolved the non-scientific visualization practitioners know problem”. how to communicate uncertainty with the public. They both have their advantages. Combining their This thesis discusses examples and methodologies advantages will be beneficial for the general public related to the visualization of uncertainty across to understand uncertainty.

18 Visualizing uncertainty Thesis defense 19 Uncertainty is not a new topic in scientific There are two talks visualization, as many studies have produced about uncertainty in theories in regards to visualizing uncertainty. 2016 OpenVis Examples include methods to visualize uncertainty in 1D to 4D graphs, and in data manipulation. A typology of visualizing uncertainty, and a framework for presenting uncertainty were also elaborated over ten years ago. The source of uncertainty and perceptual uncertainty are other topics that have been discussed in the scientific community. These discussions are broad, and it is stated that most of the popular methods of visualization already uncertainty is an “unsolved problem.” That may be can deal with uncertainty. However, the scientific due to some of these methods being “laboratory community has been less concerned with how non- exercises”, having little practicality, or are not being scholars visualize uncertainty. Even if the work from understood by the public. However, the public has other domains is collected and discussed, it is not an urgent need for practical methodologies to deal analyzed systematically. with visualizing uncertainty. As the explosion of big data and the prevalence of visualization tools in Even though these scientific methods can already recent years, the number of visualization projects deal with uncertainty in visualization, there are still outside the scientific community has increased. The arguments from the public stating that visualizing growth in the interest of data visualization increases the interest of visualizing uncertainty. The schedule of As the need for visualizing data grows, the need to d3.unconf visualize uncertainty does so as well. In addition, the discussion about how to deal with communicating uncertainty also has been rising in governments in recent years.

Although there are no established guidelines for visualizing uncertainty outside the scientific community, some news agencies, consulting companies, and governments, among others, have already started visualizing uncertainty in their own

20 Visualizing uncertainty Thesis defense 21 work. However, the communication of uncertainty in visualizations is not often effective, because lay users have biased understandings of the concept, and tend to think negatively of it, believing uncertainty is unreliable. In addition, the traditional way to visualize data also causes uncertainty. In academic domains besides scientific visualization, the visualization of uncertainty has also been studied: for example, the domain of geographic used in maps to show predicted rainfall levels, but Stephen Few wrote a information system (GIS) has produced fruitful have rarely been used in recent years. post criticizing the HOPs strategies for visualizing uncertainty. Additionally, and here are some of similar strategies have emerged in economics, New methods have been proposed, but some the comments on the biology, etc. problems still exist. Visualizing uncertainty still post confuses the public and some people even reject In recent years, not only have traditional ways to the new methods which are tested to be more visualize uncertainty been critiqued, but new ways efficient than traditional methods. For example, to help people understand uncertainty have been HOPs are critiqued by some of the readers : proposed, for example by displaying “hypothetical outcome plots (HOPs) ”. In addition, certain authors argue that uncertainty is created in every step of the visualization pipeline HOPs which are (from collecting to visualizing data), nevertheless, consisted of a couple HOPs only deal with one part of the pipeline of animated lines are which is distributing. In order to help more justified to have a better people understand and cope with uncertainty performance than error in visualization, besides conducting qualitative bars and research about how domain experts deal with uncertainty, it is imperative to connect methodology created by the scientific community to real-world Using simulated data, the HOPs help users problems. “experience” the uncertainty, and empirical studies suggest that users can more accurately interpret uncertainty, compared to traditional charts. It is important to note that hypothetical outcome plots are not an entirely new idea, for instance, they were

22 Visualizing uncertainty Thesis defense 23 of the methods and my own thoughts to an updated My four contributions of this thesis based on these methodology which can help us identify the goals are to: properties of uncertainty. Only when we can identify the properties of uncertainty, then we can know how (1) provide more clear definitions of uncertainties: to visualize uncertainty. In addition, the uncertainty inefficient representations of data, and statistical taxonomy proposed by this thesis connects inference. the methodologies conducted by the scientific visualization community with examples made by (2) Collect and analyze the work of visualizing other domains. If we do not provides examples to uncertainty across domains such as scientific the methodology, people will not know how to use research, news agencies, consulting companies, it. and governments. I found the problem that there is a gap between the scientific community and the (4) Connect the methodology to the real-world public. The data visualization practitioners outside problem. There are few examples about visualizing of the scientific community are passionate about uncertainties generated by models. This thesis visualizing uncertainty. However, they pay little introduces a couple of experiments of visualizing attention to the research outcomes conducted by uncertainty of linear probability models. The the researchers. In the chapter of “What we talk experiments examine the gender effects on about when we talk about uncertainty”, I have a employment of Mainland China and Taiwan. The broad discussion about visualizing uncertainty. data are collected in 2000 by the governments of It starts from the work of the precursor, includes Mainland China and Taiwan. Three experiments examples from scientific research, summarizes four have different visual representations of uncertainty. categories of the work from news agencies and consulting companies, discusses about the existing problems of using traditional ways to visualize uncertainty, and collects new ways to visualize uncertainty.

(3) Propose an updated methodology which is uncertainty taxonomy. Both of scientific researchers and data visualization practitioners outside of scientific community proposed methods to deal with and visualize uncertainty. However, these methods are not combined before. I combine both

24 Visualizing uncertainty Thesis defense 25 What we talk about Visualizing uncertainty is not a new topic. It has been discussed since a hundred years ago. Most of when we talk about uncertainty? the visualizations of uncertainty are about summary statistics and statistical inference except the work of scientific research. In scientific community, uncertainty is more complex. Some of them are the uncertainty of data quality, and some of them are about three (or above) dimension visualization. In this section, extensive uncertainties besides summary statistics and statistical inference are discussed.

Besides including the work of visualizing uncertainty, discussions about traditional ways to visualize uncertainty are included in this section as well. For example, error bars and boxplot are two common and traditional ways to visualize uncertainty in statistics, but they are justified that they do not express uncertainty effectively.

In recent years, new ways to visualize uncertainty are proposed. Most of them research outcomes are from Dr.Hullman. She is a professor at University of Washington and her research is about information visualization and data cognition. She proposes a series of visualizations of uncertainty which combine simulation and animation to express uncertainty. The research outcomes show that readers have a better understanding of the visualization she created than the traditional visualization. However, the research only discusses about parts of the uncertainty and some parts are not discussed yet, such as statistical predictive models.

26 Visualizing uncertainty Thesis defense 27 Law of deviation from The structure of this section can be divided into five an average. Hereditary parts: 1) visualizing uncertainty precursors; 2) an Genius , 1869. overview of visualizing uncertainty in the scientific community, and its lack of accessibility of the public to scientific research outcomes; 3) the topic of visualizing uncertainty in the news agencies, consulting companies, and other domains; 4) new ways to visualize uncertainty that consider human visual perception; 5) visualization of certain types of uncertainty that are yet to be addressed and the proposal of new methodology of visualizing uncertainty.

2.1 Uncertainty visualization precursors

Francis Galton (1822–1911) is a famous and also made important contributions in many fields of science, including psychology, biology, etc. In his book, Hereditary Genius: An Inquiry into Its Laws and Consequences, he uses a diagram to explain the distribution of intelligence among the population.

In his another book, Typical laws of heredity, Galton invented the quincunx for studying variation. By throwing the beans along with the filters, the shape of the distribution is close to “normal distribution” which is bilateral symmetric. In addition, the picture shows when the sizes of the filters are changed, the height of distribution change as well but is still bilateral symmetric. The lower part of this graph shows the two-stage quincunx.

28 Visualizing uncertainty Thesis defense 29 Two-stage quincunx and the law of reversion. Typical laws of heredity, 1877.

2.2 An overview of visualizing uncertainty in the Variations of the box- scientific community plot. Overview and State-of-the-Art of Un- The scientific research of uncertainty visualization certainty Visualization, is solid and comprehensive. The researchers 2014. have proposed examples, taxonomies, and methodologies related to visualizing uncertainty. The examples cover from 1D to 4D:

1D: Variations of data and uncertain data point. As Brodlie et al. 2012 argues, “there are a variety of ways that an indication of uncertainty can be added: error bars can be added to the data point markers, or the markers themselves can encode the uncertainty through size or color of the glyph.” In this part, the research usually discusses two aspects of uncertainty visualization, first is about box plot and error bar; the other aspect is about markers which is used to present uncertainty.

30 Visualizing uncertainty Thesis defense 31 Examples from Sanyal The left graph uses the size of the markers to et al. A user study to present uncertain data points; the right graph uses compare four uncertain- the colors to show the uncertain data. The difference ty visualization methods is that the data points express the uncertainty is for 1D and 2D datasets , discrete while the colors express the uncertainty is 2009. continuous.

2D: Bivariate extensions of the boxplot and Bonneau et al. 2014 lists 10 types of box plots. uncertainty in hierarchy. This part is also about From left to right, they separately are a) boxplot, it boxplot but it is extended to two dimensions. This was proposed by John W. Tukey in 1969 (boxplot types of boxplot show the relationship of two was invented by Spear in 1952) and used for variables. As shown below, there are four types presenting the range of data distribution; b) range of bivariate boxplots: a) rangefinder boxplot, it plot, it was proposed by Mary Eleanor Spear in contains six lines segments which center cross lines 1952; c) innerquartile plot; d) histplot, it is a hybrid intersect at the cross-median values and the other of and boxplot and the width shows lines represent interquartile rang of the data; b) 2D the estimated density of the data; e) vaseplot, it is boxplot, it has three parts: a median point, an inner similar to histplot but the width of the box at each box containing 50% of the data, and an outer box point is proportional to the estimated density; f) separating the outliers; c) , it is similar to 2d box- plot and g) violin plot show density Boxplot and it also has three parts: a bag (dark grey information for the entire range of the data set; h) interior) containing 50% of the data points , a loop variable width notched boxplot, the notches provide (grey area) indicating the points outside the bag, a measure of the rough significance of differences and a fence that separates inliers from outliers ; d) Bivariate boxplots. between the values; i) skewplot is a boxplot with a Quel and Rel plots, they are similar to other bivariate Overview and State-of- thick line which shows the and ; boxplot because they also have three parts: a the-Art of Uncertainty j) summary plot. information, it contains center, an interior hinge including 50% of the data, Visualization, 2014. and density information. All of and a fence delineating outliers. them are aiming at showing the distribution of data.

Another aspect of this type of uncertainty visualization in scientific visualization research is markers which encode uncertainty. The figure bellows two main usages of markers: size and color.

32 Visualizing uncertainty Thesis defense 33 Examples of 3D uncer- tainty visualizations. A Review of Uncertainty in Data Visualization, 2012.

Uncertainty visualization Regarding on uncertainty in hierarchy, it is more of hierarchy. Plotting generally used in biology than in other domains. methods for phylogenies However, they are neglected in recent literature & comparative data, reviews of scientific visualization. These kinds of 2014. diagrams are usually used to show the uncertainty of species evolution. The original hierarchy tree shows data in a binary way but when it comes to species evolution, it is not usually the case. A specie evolves to other species, there are some probabilities engaging in the evolution process.

3D, 4D: glyphs, image discontinuity, attribute modification, etc. They are mainly used for testing the algorithms, the quality of data, and so on. Regarding on these types of uncertainty visualization, they are getting more complicated than previous types. For examples, they could be: a) Spherical glyphs scaled to radiosity differences, b) Line glyphs show tile particle positions along streamlines computed by two integration methods, c) Uncertainty vector glyphs over Monterey Bay, California, d) Line glyphs show the difference between bilinear and multiquadric interpolated surfaces, or e) Uncertainty isosurfaces.

34 Visualizing uncertainty Thesis defense 35 The examples of 1D and 2D are traditional statistics Brodlie et al., 2012 includes Haber and McNabb visualization and they are usually about box plot. model: visualization of uncertainty and uncertainty These work is included in previous state-of-the-art of visualization. The Haber and McNabb model is papers but there are few discussions about how to proposed in 1990, Brodlie et al. refine the model can use them for different purposes and few discussion propose a new diagram which is shows below: about their restrictions. The examples of 3D and 4D are usually based on scientific research needs and they are much complex to lay users.

The another important part of scientific visualization is the uncertainty pipeline. The pipeline shows the process how uncertainty is created. It is first proposed by Pang et al. in 1996. By this pipeline, uncertainty is generated from collecting data to visualizing data. At the “collect” part, uncertainty comes from models and measurements. Then, uncertainty derives from transformation process. Last, uncertainty is generated from visualization process.

In next year, Pang et al. proposed an updated version of the uncertainty pipeline. The pipeline Brodlie et al. argue that, “uncertainty occurs at all Brief introduction of is more detailed and the picture of the pipeline stages - visualization of uncertainty focusses on the data uncertainty from separates the uncertainty process into three main data stage, while the uncertainty of visualization collecting data, to derive parts: acquisition, transformation, and visualization. begins at the filter stage and passes through to the data, and to visualize From the below graph, we can clearly see different render stage.” data. Approaches to ways of dealing with data that would cause uncertainty visualization, uncertainty. Bonneau el al. propose another uncertainty 1996. pipeline in 2014. Compared with Brodlie et al., this pipeline is simpler. The difference is that “modeling uncertainties” are included in “filter” by Brodlie et al.

36 Visualizing uncertainty Thesis defense 37 Detailed introduction of data uncertainty from collecting data, to derive data, and to visualize data. Approaches to uncertainty visualization, 1997.

These discussion is systematic but complex, Sources of uncertainty. and usually not accessible to the public. The Overview and State-of- methodologies are proposed but they usually the-Art of Uncertainty match with scientific examples and less discussion Visualization, 2014. about how they could be applied to the real-world problems.

Some of the uncertainty can be visualized, some cannot. For example, in the acquisition part, most Haber and McNabb of the uncertainty cannot be visualized because the model: visualization of uncertainty here are the lost data. We cannot get the uncertainty and uncer- lost data, therefore, they cannot be visualized. tainty of visualization. A Review of Uncertainty in The uncertainty pipelines proposed by scientific Data Visualization, 2012. visualization community clearly state different kinds of uncertainties and the classification of uncertainties. These pipelines will help us deal with uncertainty visualization. However, the examples and explanation are few related

38 Visualizing uncertainty Thesis defense 39 A. The tendency and forecast. It usually shows the data and/or future data of one or multiple observed object. Using lines to present uncertainty tendency is the usual type of this category. In addition, regression of uncertain data also belongs to this category. These types of uncertainty visualization are usually expressed by using lines. The change of the chances of two parties to control the Senate over time. The changes of the to real-world situations. Even the uncertainty probability of each party pipelines and the classification of uncertainty Other related work: Spaghetti plot shows the controlling the Senate visualization were proposed two decades ago, there previous, actual, and forecast data. ‘Two-thirds’ of after the 2014 elections. are still arguments that uncertainty visualization is Hammond’s £26bn Budget war chest faces wipeout Who Will Win The Sen- an unsolved problem nowadays. done by Financial Times, 2017. ate?, 2014. 2.3 the topic of visualizing uncertainty in the news agencies, consulting companies, and other domains Forecasts and actual outcomes. ‘Two-thirds’ In news agencies, a lot of work is about election of Hammond’s £26bn prediction. A good representative example of news Budget war chest faces agencies work is Who Will Win The Senate? (http:// wipeout, 2017. www.nytimes.com/newsgraphics/2014/senate- model/)done by New York Times in 2014. This work uses four main kinds of visualization uncertainty to express uncertainty: tendency of uncertainty, spreadsheet of uncertain data, and simulation of uncertainty, distribution of uncertain data.

40 Visualizing uncertainty Thesis defense 41 A type of the Fan chart. Watercolor regression. It is a type of visually- Fan Chart: The Art and weighted regression. It comes from spaghetti Science of Communicat- plot and uses a different visual way to show the ing Uncertainty, 2014. regression lines.

Watercolor regression, 2012.

Fan chart. It is firstly proposed by Bank of England. It combines historical data and forecast. Historical data is a line and forecast is a range of gradient colors. Fan char is also similar to watercolor regression but the gradients of fan chart are much more clear.

42 Visualizing uncertainty Thesis defense 43 Historical pools and its Probability interval. This chart shows the historical prediction. How popu- data and forecasts. On the left hand side of dot line, lar/unpopular is Donald they are historical data; on the right hand side, it B. The spreadsheet of uncertainty. State-by-State Proba- Trump?, 2017. shows the forecasts. This chart is confusing because This type of uncertainty visualization simply uses bilities tables. Who Will it uses the same visual way to express historical data spreadsheet to directly show the numbers of Win The Senate?, 2014. and forecasts. The interval on the left hand side is uncertainty. the historical 90% of polls. The interval on the right hand side is not clearly stated in the article. Other related work: In the same article (Who Will Win The Senate?), New York Times uses another spreadsheet to compare the forecasts of different medias.

44 Visualizing uncertainty Thesis defense 45 of prediction. Who Will Win The Sen- ate?, 2014.

Bar chart with error bar. Wie wir über Umfragen berichten, 2017. Other Forecasts Com- Like a heatmap, colors are used to denote the value parison tables. Who Will of the probability. This method lets readers quickly Win The Senate?, 2014. navigate the highest value of the spreadsheet. It is also a good way to compare the uncertainty values. C. Statistic summary of uncertainty. This type of uncertainty visualization uses traditional statistics to summarize uncertainty. For example, bar chart and box plot.

Other related work: The combination of error bar and bar chart. The light color parts of bar chart denote the uncertainty. This visualization is done by a Germany news media.

46 Visualizing uncertainty Thesis defense 47 Prediction gauge. Live Presidential Forecast, 2017.

Jobs report in two scenarios. How Not to Be Misled by the Jobs Report, 2014.

“Rolling the Dice” simu- D. Simulation of uncertainty. lation. Who Will Win The This type of uncertainty visualization is usually Senate?, 2014. interactive and/or dynamic. Users can play with the work. For example, the below chart shows a compass of every states. In each compass, the area of color denotes the probability which party will win. By clicking the “SPIN AGAIN” button, the results are usually different. Simulation help users engage in the work and it helps them understand the concept Other related work: This gauge is used to show of uncertainty. real-time forecast of election. Random data is added to the gauge. By doing so, it emphasizes the uncertainty of the forecast.

48 Visualizing uncertainty Thesis defense 49 Forecasting models pre- dicted the storm’s path with relative accuracy. What’s in Irma’s path?, 2017.

Different types of error bars. Error Bars Consid- ered Harmful: Exploring Alternate Encodings for Mean and Error, 2014.

Simulation and distribution. This dynamic bar chart is used to denote the uncertainty of the Labor Department’s monthly jobs report. For example, despite the Labor Department states that there was a steady growth of job during past months, the real situations may be different. By considering the error, the dynamic bar chart provide a different perspective to read the jobs report.

Simulation, distribution, and forecast. Geographic spaghetti plot shows the possible routes of typhoons. This dynamic chart shows the paths Different types of bar which Irma may go. It is a combination of spaghetti charts with erro bars. plot and simulation. Information Graphics, 2000.

50 Visualizing uncertainty Thesis defense 51 Three varying 1D distri- News media provides various work to visualize butions of data, all with uncertainty, however, certain types of uncertainty the same boxplot repre- visualization have problems. Some of them are sentation, 2017. confusing to readers and some of them are using wrong types of visualization.

First, confusing uncertainty visualization. These visualizations usually require people have basic understanding of statistics. Second, wrong types of visualization. Certain types of uncertainty visualization cannot express uncertain data correctly to readers. In this part, a lot of research and discussion focus on error bar and box plot. Correll et al. discuss traditional error bar is not easily to display the disadvantages of box plot and Same statistics, different understandable for users and propose different statistics. visualization, 2017. ways to visualize error bars. As shown above, the data of these charts have The usage of gradients to denote uncertainty different distribution but the box plots are the same. is proposed by Harris in his book, Information Box plot has some disadvantages, besides from Graphics, in 1996. it cannot show exact distribution of data, it is not easily understandable for users. The research group of Autodesk also conducts research on the disadvantages caused by traditional The below charts shows different visualization way of uncertainty visualization. “Same Stats, with the same statistics. The X mean, Y mean, X Different Graphs: Generating Datasets with Varied SD, Y SD, and correlation are all the same, but the Appearance and Identical Statistics through visualizations are totally different from each other. Simulated Annealing” proposes various visualization

52 Visualizing uncertainty Thesis defense 53 Hypothetical Outcome 2.4 new ways to visualize uncertainty that Plots. Hypothetical Out- consider human visual perception come Plots Outperform The another research conducted by Jessica Imagining Replications: Error Bars and Violin In recent years, some professors at university Hullman is using a new way to visualization density Graphical Prediction & Plots for Inferences start to explore new ways to visualize uncertainty. distribution. Instead of using traditional way to Discrete Visualizations About Reliability of Vari- They try to conduct a more easily understandable make distribution continuous, using dot to denote Improve Recall & Esti- able Ordering, 2016. visualization denoting uncertainty. distribution and make the data discrete has better mation of Effect Uncer- performances than the traditional way. tainty, 2017. Jessica Hullman is the most active researcher who proposes various uncertainty visualization and evaluation researches. For example, “Hypothetical Outcome Plots” (HOPs). HOPs visualize and animate each data. HOPs enable users inter the properties of distribution of data by counting the animated lines. This method helps users “experience” uncertainty, and it has good outcomes compared with traditional uncertainty visualization, such as violin plot and box plot.

54 Visualizing uncertainty Thesis defense 55 2.5 visualization of certain types of uncertainty that are yet to be addressed

The new methods to visualize uncertainty are usually focusing on distribution. However, based on the taxonomies proposed by previous researches, each process of the pipeline will generate uncertainty. Therefore, certain types of uncertainty are not discussed yet.

In addition, visual presentation of simulation hleps users understand the uncertainty more effectively than static presentation in some cases. Thses kinds of methods are not wildely implemented to uncertainty visualization.

In sum, there are three aspects of uncertainty visualization are not addressed well in the state of the art. First, methdologies in sceintific visualization community are not widely used by the visualization pratitioners outside of the scientific community. Second, outside of the scientific community, certain types of uncertainty are not well discussed. Third, Exploring In recent years, there is a trend to visualize visualizing uncertainty by using simulation is still a by using simulation and uncertainty by using simulation. These work new way and few examples are proposed. scrolling. What’s so hard provides a better way to help users understand about histograms?, 2017. the uncertainty than the traditional static visual The next sections explain an updated methdologies presentation. of uncertainty visualization and propose an example to visualize a certainty type of uncertainty which The example here tells the story how the data is is less discussed. The example has two visual mapped to a bar chart and the emerging uncertainty presentation, one is static and another is using with varing bin width. simulation.

56 Visualizing uncertainty Thesis defense 57 Updated taxonomy of uncertainty A lot of visual representations of uncertainty are proposed by data visualization practitioners in the previous chapter of this thesis. Nevertheless, there are few clear guidelines on how to deal with uncertainty in visualization. If we do not have a clear taxonomy of the properties of uncertainty, we cannot deal with uncertainty and cannot visualize uncertainty well.

Although scientific researchers proposed some guidelines which are parts of the taxonomy of uncertainty, they are complex for people outside of the scientific community to use. In addition, the taxonomy proposed by the scientific researchers are not sufficient and only discusses the uncertainty generated from the process of working with data (from collection to visualization).

We usually do not know what we exactly talk about when we talk about uncertainty. We tend to mix related the properties of uncertainty together, which are usually ill-defined, and thus are confusing to those people who want to deal with and visualize uncertainty.

There are some guidelines of uncertainty from data visualization practitioners outside of the scientific community in recent year. They provides taxonomy and visual representation of uncertainty.

The latest one is a blog post in 2017 done by Nathan Yau who has a Phd in Statistics. This post identifies some of the properties of uncertainty but these

58 Visualizing uncertainty Thesis defense 59 properties are not categorized well although this post has the most clear definitions of uncertainty I can find. Range and distribution are one kind of representations of summary statistics. They can be used for better representation of data or can be used for the data is uncertainty such as probability. The example of multiple outcomes here is about statistical inference. However, multiple outcomes, same as simulation and obscurity, are used to Visualizing the Uncer- express the “feeling” of uncertainty. The properties tainty in Data, Nathan of uncertainty are mixed together which often Yau, 2017 cause people confused. Although there are partly

overlapping of the properties of uncertainty and the Uncertainty fog visual representations of uncertainty, we still need to point out the differences and the properties of uncertainty we want to deal with and visualize. As a result, we are facing a fog of uncertainty.

To sum up, there are problems if we want to visualize uncertainty. First, no clear taxonomy which identify the uncertainties, the properties and the visual presentations of uncertainty.

60 Visualizing uncertainty Thesis defense 61 Second, the taxonomy without with related examples will confuse people and few people understand how to use the taxonomy.

3.1 Uncertainty taxonomy

In this section, an updated taxonomy of uncertainty is proposed. A set of properties of uncertainty in visualization can be used in order to better address how these properties interplay when dealing with visualizing uncertainty.

Parts of the taxonomy is based on the previous uncertainty pipelines of the scientific community and combined with other uncertainty taxonomies proposed by the data visualization practitioners outside of the scientific community. It aims at providing more comprehensive ways to deal with visualizing uncertainty.

An uncertainty taxon- omy proposed in this thesis

62 Visualizing uncertainty Thesis defense 63 Data derived from physical traces Zhengyan Yu | 02.15.2017 Research Methodologies

3.2 Examples of the methodology

Density When we want to deal with uncertainty visualization, finds Traces there are many aspects we need to think about. first step Area Observation Questions causes Wider area selects collects e.g. A part of sidewalk More data 1) Process Caculate precisely suits for second step builds simulate Consistent spot on sidewalk From to data visualization, each Gumspotting Generalization Visualization Spots changing in 5 years

Systematic sampling acknowledges uses e.g. not all have the same size & material process would generate uncertainty. Decoding methods

average each block with its neighbors third step e.g. comes from Conclusion Smoothing A. Collecting: less radical data Due to the limitation of the methods to collect i.e. data, we usually can get the sample of the whole Human behavior Toss it on a ground datasets. Some of the uncertainties come from the tools which collect the data, for example, physical trace data is collected by eyes and eyes cannot He collected the gum spots by his eyes and used An example of data perfectly identify every data. In addition, some the data sample to predict the total amount of the collecting methdologies. uncertainties come from the statistic method to gum spots in New York City. In the diagram above, Data derived from phys- collect data, for example, . the uncertainties come from the observation and ical traces, 2017. generalization process. An example for the uncertainty generated from the data collecting is Gumspotting conducted by Walker B. Distributing Gum distribution on Harrison in 2016. He observed the gum spots on the The uncertainties in this process usually come different areas. Gums- sidewalks in New York City. from the traditional way to visualize the data. potting, 2016. For example, when we use box-plot, the visual presentation is consisted of the summary statistics. When the distribution changes, box-plot usually connot reflect the real situations.

Visualizing the different distributions shows the uncertainty. Same Stats, Different Graphs, 2017.

64 Visualizing uncertainty Thesis defense 65 C. Gathering Different models This process generates uncertainties from the filters generate different to gather data. For example, the varying width of results. bins will generate different shapes of the histogram. Another example is the varying location/scale of polygons will also generate different results (which are considered as uncertainty.)

Uncertainty generated from the side-scaling of hexagon. Synthetic spatio-temporal visualization of Ebola cases in Sierra Leone, Antonio Solano-Román Same datasets with and Andrés Colubri. different algoritms have different results.

D. Modeling In this section, modeling includes models and algorithms. When we use different models for prediction, the results are usually different. That is because the parameters of each model are usually different. This kind of uncertainty is also very common in network visualization. We visualize the same network datasets by using different algorithms, the shapes of the visualization are different.

66 Visualizing uncertainty Thesis defense 67 E. Predicting 3.2.2 Restriction Prediction is not 100% guaranteed. Thus there are Some of the uncertainties cannot be visualized while uncertainties existing in prediction. Examples are some of the uncertainties can be visualized. already elaborated in the previous section, the tendency and forecast. In addition, simulation is The book, Making sense of uncertainty, discusses also a method to present prediction. why uncertainty is important in science and why the general audience tends to misunderstand it. In order to clearly understand uncertainty, this book argues that we need to split uncertainty into categories: known unknown and unknown unknown: Some of the uncertainty - known unknowns: we know there are some things visualization in we do not know; prediction mentioned in - unknown unknowns: the ones we don't know we previous sections in this don't know. paper. Making sense of uncer- Unknown unknown is the uncertainties which tainty, 2013. cannot be visualized. In addition, some known unknowns cannot be visualized either. The uncertainties which cannot be visualized generally come from the process of data collecting. For example, due to the quality issue of the machine, parts of the data are missing or have “noise.”

68 Visualizing uncertainty Thesis defense 69 3.2.3 Types of uncertainty Second, statistical inference. The uncertainties There are many types of uncertainty. In this thesis, of statistical inference is more complex than the I focus on two types of uncertainty which are uncertainties generated from summary statistics. inefficient representation of data, and statistical For examples: probability, data sample, confidence inference. The first type of uncertainty are interval, and , etc. Two of the generated form summary statistics, such as means, terms are discussed in this thesis: probability and standard deviation, probability. However, if we only data sample. understand the datasets by reading the summary statistics, we usually have biased interpretation of First, probability. The uncertainty of probability is the datasets. elaborated in the chapter of introduction. Another example is used here. When throwing a coin, the First, inefficient representation of data. Summary probability of flipping a head or a tail is 0.5. We may statistics are good at providing summarized think the probability is very accurate. However, 0.5 information of data but sometimes summary is only a concept in math/statistics and it does not statistics fail to provide the reality of data. exist in our real life. If we throw a coin for 100 times, as shown in this picture, the outcomes of head and Summary Statistics Tell tail are not always equal. You Little About the Big Picture, Nathan Yau. The simulation of thrwoing a dice.

Second, data sample. In statistics, a data sample is collected and stands for a population which is too large to be collected. Sample data is used to predict the parameters of population. The governments usually use data samples to predict the economic

70 Visualizing uncertainty Thesis defense 71 performance. Due to random statistical noise, data 3.2.4 Types of uncertainty visualization sample cannot perfectly reflect the real economic Uncertainty visualization has two confusing performance. categories: visualization of uncertainty and uncertainty of visualization. When we talk about For example, the Labor Department’s monthly uncertainty visualization, we usually mix them. That employment estimate. It is almost impossible is part of the reason why uncertainty visualization (and too expensive) to collect all of the data is so hard based on the arguments of Brodlie et al. of employment. Hence, data samples are used 2012. to predict the actual labor market. The random statistical noise makes the estimate uncertain. We place the work in the context of a reference model for data visual- A Review of Uncertainty in Data Suppose the actual job growth is stable over the ization, that sees data pass through a pipeline of processes. This allows Visualization, 2012. past 12 months, the job report which is based on us to distinguish the visualization of uncertainty - which considers data samples could look unstable. how we depict uncertainty specified with the data - and theuncertain - ty of visualization - which considers how much inaccuracy occurs as How Not to Be Misled we process data through the pipeline. by the Jobs Report, Neil Irwin and Kevin Quealy, Even if there is certainty about the data, errors can occur in the process A Review of Uncertainty in Data 2014 of turning the data into a picture. We call this uncertainty of visual- Visualization, 2012. ization.

Haber and McNabb model: visualization of uncertainty and uncertainty of visualization. A Review of Uncertainty in Data Visualization, 2012.

72 Visualizing uncertainty Thesis defense 73 3.2.5 Visual presentation Animation means the uncertainty is animated. Two categories are listed here: techniques and visual Readers usually have better understanding the design. In techniques, there are static, simulation, animated uncertainty. Animation can help us have animation, and physical metaphor. In visual design, better understanding of statistical graphs . there are discrete events, gradients, shapes, and colors. Animating from stacked bars to grouped bars In the category of techniques, to be more specific, static means the visualization is not interactive but good for print. Static is the most common and traditional way to visualize uncertainty.

Simulation means the data are simulated before visualized, it is good for expressing the concept of uncertainty. Simulated data are not “real data” (not happen in our real life) but they are used to show possible outcomes. The paths of Irma are simulated Physical metaphor means we apply the uncertainty by models. We can see the possible route of Irma. to physical metaphors which are prize wheel, gauge, and dice. Using physical metaphors is easily to Forecasting models express the concept of uncertainty to the public predicted the storm’s because the public is familiar with these physical path with relative metaphors and they do not require readers to have accuracy. What’s in any knowledge of statistics. They are different from Irma’s path?, 2017 error bars and probability density which require readers have knowledge of statistical models to correctly interpret . The most common physical metaphors are prize wheel, gauge, and dice. They are used in some news articles.

74 Visualizing uncertainty Thesis defense 75 “Rolling the Dice” Risk Analysis Table simulation. Who Will Win The Senate?, 2014

I want to talk more about discrete events here because using discrete events to express uncertainty Prediction gauge. Live is a new way and it is justified that it helps readers Presidential Forecast, understand the uncertainty of data. 2017 Discrete events mean showing data and information individually. It is justified that people have better understanding on the visualizing uncertainty such as probability when they are represented as discrete events. For example, dot plots. Dot plots are one kind of discrete events. Data plots are used to replace the visual representation of probability density of Normal distribution.

The traditional visualization of uncertainty for transit In the category of visual design, there are discrete predictions on mobile phones is probability density events, gradients, shapes and colors. Gradients, distribution. The green area is the probabilistic shapes and colors are common methods to estimate of time to arrival. If the are is higher, it represent uncertainty. Some related examples can means it is more probable the bus will arrive at the be found in the previous chapter of this thesis, “what time. However, this kind of visualization requires we talk about when we talk about uncertainty”. people know what is probability and probability density distribution. The researchers use dot plots to replace probability density distribution. Dot plots are

76 Visualizing uncertainty Thesis defense 77 more intuitive than the density distribution because In this thesis, I propose a series of visualizing people can count the dots instead of understanding uncertainty which combine animation and discrete the meaning of the density area. events. These 9 visual representations of uncertainty are not totally created by me. They are scattered around in the academic research outcomes and data visualization practitioners’ work. I collect these visual representations and categorize them based on one of the visual representations of uncertainties: range, distribution, and prediction. Some of these The visual visualizations are originally static, I animate them representations of because animation can help people understand uncertainty which uncertainty better and few examples are provided. combine animation and For example, the got plots and fan chart which are discrete events

Traditional and new ways to visualize probability density distribution

78 Visualizing uncertainty Thesis defense 79 originally static. 3.2.6 Dimension The main challenge of displaying animation is Showing uncertainty on maps is an important topic Lanticular prints that it usually requires using electric devices as in uncertainty visualization. In scientific visualization applying the nine displaying mediums. I solve this problem by using community, there is a great a lot of research using animated visualizations the technique of lanticular which can show the maps to display uncertainty. of uncertainty animation and transition on physical prints. Visualizing uncertainty in areal data with bivariate choropleth maps, map pixelation and glyph rotation, 2017.

Using 3D and/or 4D to visualize uncertainty is more common in the scientific community because the researchers usually need to deal with much more complex datasets.

A visualization of the brain using transfer functions that express the risk associated with classification. Overview and State-of-the-Art of Uncertaint Visualization, 2014.

80 Visualizing uncertainty Thesis defense 81 Visualizing the uncertainty We are surrounded by statistical inference everyday from the bus predicted arrival time to the possible of statistical models poll outcomes. These predictions are usually represented as a certain value. However, people usually do not interpret and understand the number correctly if we do not provide visual representations to them. In addition, traditional ways to visualize uncertainty are also regarded as “harmful” to general people because general people cannot understand them correctly.

In this chapter, I propose three visual representations of uncertainty in statistical predictive models. The three visual representations of uncertainty are based on the taxonomy proposed in the previous chapter of this thesis. In statistical models discussed here, the uncertainties of statistical inference are probability. The visual representation techniques such as simulation, animation, physical metaphor, color coding and multiple outcomes are used to convey the concept of uncertainty.

The three visual representations of uncertainty are based on linear probability models. The models show Taiwan and Mainland China employees’ gender inequality which is my previous academic research outcomes. The models consider a couple of person’s gender and other characteristics such as age, education degree, marital status. The final outcomes of the models are the probability for a male/female to get a job in Taiwan/ Mainland China.

82 Visualizing uncertainty Thesis defense 83 4.1 Types of uncertainty of statistical models used regression in this chapter

Linear regression model is the most important and fundamental regression model in statistics. It is used to show the relation of two variables which are usually displayed as X and Y in statistics. models is also used for prediction.

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to regression model. The response (often referred to as be a dependent variable. For example, a modeler Y) of the linear probability model ranges from zero might want to relate the weights of individuals to to one and it can be interpreted as a probability. their heights using a linear regression model. To be more specific, if Y equals to 0, the prediction of the model is 0 percent; if Y equals to 1, the A linear regression line has an equation of the form prediction of the model is 100 percent.

Y = a + bX 4.2 The three instantiations

, where X is the explanatory variable and Y is the The three instatiations are based on linear dependent variable (the correspondence). The slope probability models and the data is data of of the line is b, and a is the intercept (the value of y Taiwan and Mainland China in 2000. The census when x = 0). data are categorical. Each column represent each A linear probability model is one kind of linear people’s (the row) characteristics, such as gender, age, and employed or not.

census data

84 Visualizing uncertainty Thesis defense 85 The uncertainties of linear probability models Then, we apply two values to the dummy variable come from the prediction outcomes of the models. (which is female). If we want to know how the model Because the models are probability models, affect the male Taiwanese, then the female variable the prediction outcomes are probabilities. For equals to 0; if we want to know the effect on example, the simplest model of this research is just Taiwanese female, then the female variable equals considering gender variable: to 1. Therefore, we have two outcomes for male and female separately:

Pr(Y=1) = Φ(β0 + β1[female]) Male: Y = 0.867 , and here is the outcome after the data are applied Female: Y = 0.667 to this model: The Y, as mentioned in previous paragraph, Gender effect on represents the response of the linear probability employment: basic model and is interpreted as a probability. For the situation result of male, 0.867 means there is 86.7 percent probability for male Taiwanese to get a job. For the result of female, 0.667 means there is 66.7 percent probability for female Taiwanese to get a job.

This model is just the simplest one of the research. The research considers many other personal characteristics and their effects on the probability for male and female Taiwanese to get a job. For example, the degree, the age, the marital status, etc.

As a result, the previous model can be written as: By discussing this kind simple the model, there are three main existing problems. First, the discussions Y = 0.867 + (-0.2)*[female] are repetitive. When we analyze more variables, we will have more similar outcomes with different variables and coefficients. We may easily neglect some uncommon outcomes because they are hiding in the visual overwhelming tables.

86 Visualizing uncertainty Thesis defense 87 Second, it is hard for people to “feel” the uncertainty The three following subchapters use different visual from these models. The value of the models are representation to visualize the uncertainty of the constant although they represent probability. In statistical models and they are designed for different addition, probability is used for prediction, these purposes and people with different academic models themselves cannot show any outcomes if we backgrounds. do not apply simulated data to it. Three ways to visualize the uncertainty of the Third, for general people, it is unlikely for them to models are: first, visualizing the probability understand these models and get insights from which are the coefficients of the models; second, the models. For examples, the regression, the conveying uncertainty through many points by probability, the coefficient, the r-square, and the applying simulation to the models; third, conveying p-values. uncertainty through probabilities by applying the outcomes of the models to physical metaphor. Therefore, I am proposing three visualization experiments to solve these problems in next three 4.3 First instantiation: visualizing probability subchapters. The first instantiation focuses on the models Repetitive analysis of themselves and visualize the coefficients of the these models variables by using color coding.

What is coefficient? Coefficient is a numerical quantity placed before and multiplying the variable. Take the basic situation of the models as an example, -0.2 is the coefficient of the female

88 Visualizing uncertainty Thesis defense 89 Gender effect on the table. employment: basic Here, a more complex model considering two situation Personal characteristics (age)

variable. As mentioned in the previous paragraph, the above table can be written as the following equation,

Y = 0.867 + (-0.2)*[female].

Y, as a probability of the model, is composed of the constant (0.867) and the coefficient multiplying the variable (-0.2*female). The most important part of the model is Y and reading the information of the coefficient and the constant is the only way to understand Y. Hence, we can use color coding to the coefficient and reduce unnecessary information of variables (gender and age) is used as an example: As discussed in previous subchapter, the outcomes (such as the table displayed above) are repetitive and it is easy for readers to neglect uncommon results of the models, for example, the values of

90 Visualizing uncertainty Thesis defense 91 coefficients. for to explore and find other possible 4.3.1 The methodology explanations. Color coding to represent the coefficient of the variables in models. Although it helps reduce visual burden by using color coding, the visualization required people have 4.3.2 The process some statistical background such as regression and The most important part of the outcomes is the probability. coefficients. I extract the values and use color coding to represent these values, and arrange the squares 4.4 Second: conveying uncertainty through many in proper positions. data points

The second instantiation use simulation to explain first instantiation the uncertainty generated from the statistical models. Instead of communicating the coefficients of the statistical models with readers, I display the possible outcomes as many data points which are generated by the probability of the models. Readers will know the outcomes of the statistical models without needing to understand the probability, regression, and coefficients.

4.4.1 The methodology 4.3.3 The results By using color coding, the more green means the This instantiation combines simulation, animation, values of the coefficients are negatively larger; and multiple outcomes of the methodology of visual the more red means the values of the coefficients representation. are positively larger. The uncommon results of the table is the red squares. The effect of age In visualizing uncertainty, “static” is the traditional on the probability for people to get a job should way to visualize uncertainty while “simulation” is be negative but here the values are positive. It is getting more and more popular in recent years. usually believed that when we get older, we should “Simulation“ is a good way to explain complex be less likely to get a job because as age grows, topics and it provides better performance compared our physical condition declines. This result is good with static visualizing uncertainty. Animated

92 Visualizing uncertainty Thesis defense 93 visualization catch users’ attention and engage them in understanding the visualizations. Multiple outcomes mean data points in this instantiation which displays data points as people.

4.4.2 The process Get the simulated data based on the probability of the model (basic situation) and visualize the data in random positions which emphasize the concept of uncertainty. There are two circles which represent male and female separately. In each circle, there are 100 data points which mean 100 males or 100 females. The colors of the dots are blue and red. Blue dot means the male or female will get a job. Red dot means the male or female will not get a job.

The outcomes of the linear probability model and simulation is used to convey the uncertainty

The static version of this instantiation

94 Visualizing uncertainty Thesis defense 95 4.4.3 The results The restrictions of this instantiation are about display medium. It only can display on electronic device such as computer, tablet, cell phone which have screen. In addition, it is better for user to control the speed of animation. If the animation is too speedy, it may confuse the readers. Moreover, visualizing uncertainty by using multiple outcomes outcomes of the prize wheel are uncertainty. Three physical typically requires much space and time to 4.5.2 The process metaphors communicate the outcomes, in many situations this Calculate the probability of four statistical models, method may not be suitable. create four pie charts physically, and attach them to the prize wheel. 4.5 Third: conveying uncertainty through physical metaphor The prize wheel “will The above two instantiations have their own you get a job?” conveys restriction and still complex. What would the most the uncertainty concept intuitive visualization look like? That would be of the linear probability connecting the statistical models to the things the models. readers are already familiar with and the things should contain the concept of uncertainty.

4.5.1 The methodology

I use a physical metaphor to represent the models. It is prize wheel. The area of the wheel stand for the Y (which is probability). By applying the models to the prize wheel, the users can understand the models more easily because they already know the

96 Visualizing uncertainty Thesis defense 97 4.5.3 The results 4.6 Conclusion The blue means if you are a male in Taiwan, the probability for you to get a job is 86%. From the first instantiation to the third instantiation, The yellow pie chart means if you are an illiterate the visualization is from “hard to understand” to male in Taiwan, the probability for you to get a “easy to understand”. The visualization which is job is 21%. The purple pie chart means if you are a hard to understand is more accurate than the 30-years-old male in Taiwan, the probability for you visualization which is easy to understand. However, to get a job is 87%. the visualization which is hard to understand is more complex than the visualization which is easy The overlapping parts of the pie charts convey to understand. There is a trade-off among the three another uncertainty. Because the four pie charts visualizations. are generated from four statistical models which consider different personal characteristics, the By comparing the different visual representations of results (aka the probabilities/the pie charts) are uncertainty, the readers may know that visualizing different. The uncertainty of the overlapping means data actually is losing certain information. For if we consider different variables, we may get example, the prize wheel does not contain the different outcomes from statistical models. In this information of coefficient and p-value. The case, I may get a job if I am a 30-years-old male in purpose of the visualization depends on how much Taiwan but I may not get a job if I am a single male information will be lost. More intuitive uncertainty in Taiwan. visualization lose more valued information while more accurate uncertainty visualization is less intuitive. In addition, different kinds of visualization cater to different types of readers. For example, professional readers would prefer accurate visualization and they can understand certain academic terms of the uncertainty visualization. For general readers, the most important thing is to let them get the basic concept of the uncertainty visualization, also they prefer easy and interesting visualization due to their less academic knowledge.

The three instantiations of visualizing uncertainty proposed in this thesis only discuss one of the

98 Visualizing uncertainty Thesis defense 99 regression models. There are still many statistical models can be visualized and communicated to people more effectively (not just showing summary statistics).

In addition, visualizing uncertainty can be applied to other more complex domains, such as machine learning. Living in the age of artificial intelligence and machine learning, understanding how models work and its statistical concepts are important not only for professionals but also for general public. The core of machine learning is statistical models. However, the models of machine learning are much more complex than the models I discuss in this thesis. If we want to know how machine learning works and how the predictions are made, we need to understand these models first.

100 Visualizing uncertainty Thesis defense 101 Conclusion and next steps Visualizing uncertainty is a complex topic. The most difficult part of visualizing uncertainty is to define “what is uncertainty.” Ten people may have eleven definitions of uncertainty. However, in many examples of visualizing uncertainty created by data visualization practitioners do not have a clear definition about the uncertainty they discuss.

The first contribution of this thesis is to define the uncertainty and provide an updated methodology which is uncertainty taxonomy. The uncertainty taxonomy defines the categories of the properties of uncertainty. By using the uncertainty taxonomy, we will have clear understanding what kinds of uncertainty we want to deal with and visualize. In addition, providing examples to the taxonomy is important as well. By providing examples to the uncertainty taxonomy, readers will have clear understanding the meanings of the terms used in the taxonomy. Without providing examples to the taxonomy, readers may have their own understanding of the terms used in the taxonomy and may go to the wrong direction because they have different definitions of the terms.

Traditional visual representations of uncertainty is another issue in the topic of visualizing uncertainty. It is often argued that traditional visualizations of uncertainty are not “effective” because these visualizations do not conder how people interpret them. New methods of visualizing uncertainty are proposed and justified that they are more effective and can help people understand the visualizations

102 Visualizing uncertainty Thesis defense 103 of uncertainty. Visualizing uncertainty is not only a complex The second contribution of this thesis is collecting topic but also a comprehensive topic. The three and categorize the new visualizations of uncertainty instantiations of visualizing uncertainty only and the new methodologies to visualize uncertainty. consider probability. However, there are many Instead focusing on the traditional ways to visualize other statistical terms related to prediction, such as uncertainty, the discussion of the methodology , r-square, and p-value which are and the examples in this thesis are more about less discussed in this thesis. In addition, only one simulation, animation, and discrete events which are is discussed. The linear probability the new ways to visualize uncertainty. These new model is one of the most basic regression models ways to visualize uncertainty are used not only in in statistics. There are many other statistical models news agencies and consulting companies but also in worthy to be explored. academic community. About the taxonomy, it can be developed into a The third contribution of this thesis is expanding tool which helps users to critique the visualizations the new ways of visualizing uncertainty to the of uncertainty. The taxonomy can also be an statistical models which are less discussed by data interactive tool which assists journalists deal with visualization practitioners. The three experiments uncertainty. consider different ways to visualize uncertainty of statistical models. Another one of the missing parts of thesis is about cognitive bias. The cognitive bias can explain the At the end of this thesis, we will know that tradeoff between summary statistics and discrete uncertainties are not preventable and cannot be events. Summary statistics are not useless instead totally visualized by single visualization. The three they are important. Discrete events such as data experiments of visualizing uncertainty focus on points and dot plots can help people understand different aspects/properties of uncertainty. The uncertainty. However, if the variation of the dots prize wheel focuses on expressing the concept or points are hard to be detected by human, of uncertainty through its physical body but the summary statistics can solve the problem. A talk by outcomes are neglected. The second instantiation Lane Harris can show the importance of summary of visualizing uncertainty convey the uncertainty statistics. through many data points which are the prediction outcomes but the coefficients of the models are neglected.

104 Visualizing uncertainty Thesis defense 105 Which dataset are more correlated? The left one or This example shows the importance of summary the right one? statistics. In certain situation, we need to use summary statistics instead using discrete events. Many people will say the right one is more accurate. This discussion can be explored more in the future However, actually the left one is more accurate. step of this thesis.

106 Visualizing uncertainty Thesis defense 107 References Ken Brodlie, Rodolfo Allendes Osorio, and Georges-Pierre Bonneau et al., “Overview and Adriano Lopes, “A Review of Uncertainty in State-of-the-Art of Uncertainty Visualization,” Data Visualization,” Expanding the Frontiers Scientific Visualization, September 19, 2014, of Visual Analytics and Visualization, n.d., 3–27. 81–109, https://doi.org/https://doi. Brodlie, Osorio, and Lopes, “A Review of Uncertainty org/10.1007/978-1-4471-2804-5_6. in Data Visualization.” Stephanie Howarth, “From Here to Uncertainty: Jibonananda Sanyal, Song Zhang, and Gargi Communicating Estimates in Statistics,” June 4, Bhattacharya, “A User Study to Compare 2014. Four Uncertainty Visualization Methods for 1D Justin Matejka and George Fitzmaurice, “Same and 2D Datasets,” IEEE, October 23, 2009, Stats, Different Graphs: Generating Datasets https://doi.org/10.1109/TVCG.2009.114. with Varied Appearance and Identical Statistics New York Times, “Who Will Win The Senate?,” through Simulated Annealing,” n.d. November 4, 2014. David Spiegelhalter, Mike Pearson, and Ian Short, George Leckie, Chris Charlton, and Harvey “Visualizing Uncertainty About the Future,” Goldstein, “Communicating Uncertainty in Science 333, no. 6048 (September 9, 2011): School Value-Added League Tables,” January 1393–1400. 11, 2016. Nathan Yau, “Visualizing Uncertainty Still Unsolved Jeffrey Heer and George G. Robertson, “Animated Problem,” July 10, 2013, https://flowingdata. Transitions in Statistical Data Graphics,” com/2013/07/10/visualizing-uncertainty-still- IEEE Transactions on Visualization and unsolved-problem/. Computer Graphics 13, no. 6 (October 27, Jessica Hullman, Paul Resnick, and Eytan Adar, 2007): 1240–47. “Hypothetical Outcome Plots Outperform Error Matthew Kay et al., “When (Ish) Is My Bus? User- Bars and Violin Plots for Inferences about Centered Visualizations of Uncertainty in Reliability of Variable Ordering,” November 16, Everyday, Mobile Predictive Systems,” CHI’16 2015, https://doi.org/https://doi.org/10.1371/ Proceedings of the 2016 CHI Conference on journal.pone.0142444. Human Factors in Computing Systems, May 7, Stephen Few, “HOPs, Skips, and Jumps to Silly 2016, 5092–5103. Visualizations,” n.d., http://www. perceptualedge.com/blog/?p=2275. FRANCIS GALTON, “Hereditary Genius,” 1892. Francis Galton, “Typical Laws of Heredity,” Nature, n.d.

108 Visualizing uncertainty Thesis defense 109