Sabra Sultana Stat 2017 IUB Incomplete.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
ON THE VISUALIZATION IN STATISTICS A THESIS SUBMITTED TO THE ISLAMIA UNIVERSITY OF BAHAWALPUR IN THE SUBJECT OF STATISTICS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY By SABRA SULTANA October 2011 Department of Statistics Bahawalpur 63100, PAKISTAN www.iub.edu.pk Chapter 1 Introduction 1.1 Data Visualization and Its Importance It has been noted years ago that visualization is the foundation for human understanding. Since the beginning of recorded history, human beings have seen images in natural wonders. Objects such as stars, rock formations, and clouds were being more easily explained through visualization. The sayings (i) “Seeing is believing”, (ii) “One picture is worth ten thousand words” and (iii) “A good sketch is better than a long speech” are true everywhere and in statistics have no exception. Visualization is an emerging field that has an important role to play in statistics. Playfair (1801) expressed, “For no study is less alluring or more dry and tedious than statistics unless the mind and imagination are set to work or that the person studying is particularly interested in the subject; which is seldom the case with men in any rank in life.” Data visualization is used to (i) present data in a summarized form, (ii) analyze large amounts of data to see patterns, trends, correlations, etc., (iii) get insights of the data to make decisions. Data visualization is very popular in business fields and considered to be necessary for the growth of business. Data recorded in experiments or surveys is to be displayed by a statistical graph for visualization. Usually, they are classified into different types such as Pictograph, Line Plot, Pie Chart, Histogram, Bar Graph, Line Graph, Frequency Polygon, Scatter Plot, Stem and Leaf Plot, and Box Plot etc. The choice of the tools are related with the type of the data and the questions being asked. Schmid (1954) suggested the following three basic purposes for charts and graphs. (i) Illustration, (ii) Analysis, and (iii) Computation. Tukey (1972) suggested the following three categories of graphs. (i) Propaganda graphs, (ii) Analytical graphs, and (iii) Substitute of tables. Another category “Graphs for decoration” was added to Tukey’s categories by Tufte (1976). According to Friedman (2008), “The main goal of visualization is its ability to visualize data, communicating information clearly and effectively through graphical means. It doesn’t mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex data set by communicating its key-aspects in a more intuitive way.” Graphical methods play a key role in the development of statistical theory and practice. Graphics provide an excellent approach for exploring data and are essential for presenting results. The graphics are divided into following two types. (i) Presentation graphics, and (ii) Exploratory graphics. 2 Anscombe (1973) explained that when fitting a function to data, first, we must plot the data. He invented four data sets to demonstrate the importance of graphing the data before fitting lines of regression. Collectively these datasets present a very striking picture. Graphs are also a fundamental part of the confirmatory analysis for statistical inference. Graphical visualization has gained such importance nowadays that ASA publishes a separate journal; Journal of Computational and Graphical Statistics. With the rise of computer graphics capabilities, the work of pioneers of graphical statistics such as Tukey, Cleveland, and Tufte has been combined, resulting in many high-end scientific visualization systems. Chambers et al. (1983) stated, “There is no single statistical tool that is as powerful as a well-chosen graph.” A visually intuitive approach has now been given to statistical data analysis. Visual Statistics through ViSta (The Visual Statistics System) has brought complex and advanced statistical methods within the approach of even less skilled users of statistics. Full interactive visualizations from relevant mathematical statistics, promoting perceptual and cognitive understanding of the data's story has been made available by ViSta. Thus emphasis should be placed on a paradigm for understanding data that is visual, intuitive, geometric and active, rather than one that relies on convoluted logic, heavy mathematics, and systems of algebraic equations or passive acceptance of results. 1.2 Problem Statement The recent age is of multivariate data sets which should be analyzed in a limited time. Data visualization provides quick and reliable results. To make the data visualization more effective, new graphical forms for presentation and comparison of multivariate data sets should be developed. But unfortunately, the existing graphs are not in generalized forms. Some of these graphs are just subjective in nature. 3 1.3 Research Questions Based on the indicated problems, the current study is indented to find answers for the some following questions. Whether the existing graphical forms should be improved or the new graphical forms should be developed for this purpose? To increase the scope of graphs, should the new graphs be in generalized and objective forms? Which type of new graphical forms can adequately present and compare the multivariate data set? 1.4 Research Objectives Objectives of the present study are to: Introduce new graphical forms for presentation and comparison of multivariate data sets. Create the new graphs in generalized forms because they increase the scope and will be more useful. Develop new graphical forms which should be objective in nature. Present the new graphical forms in different colors to make them aesthetically pleasing so that they may enhance the interest of the viewers. 4 1.5 Organization of the Thesis Chapter 1: It comprises on the introduction of the comprehensive domain of the study, the problem statement, research questions, and objectives. Chapter 2: It deals with the detailed history of graphical presentation. Chapter 3: It deals with the extensive literature. Chapter 4: It contains (a) a new visualization of LSD test which is (i) objective in nature, and (ii) self-explanatory at a glance. (b) a comparison of proposed visualization of LSD with that of Iqbal and Clarke (2003b). Chapter 5: It presents the generalized form of Multi-Series Doughnut charts. For marginal cases, more sophisticated way of visualization are also considered. Chapter 6: It deals with the creation of a generalized form of Bar Charts which make the representation and comparison of multivariate data sets possible at a time. Chapter 7: It deals with the exploration of the generalized form of Boxes which make comparison of multivariate data sets possible for more than three dimensions. Chapter 8: It is based on the summary of the study and directions for future research. 5 Chapter 2 History of Data Visualization Statistical graphics and data visualization are modern developments in statistics. The pedigree of graphical portrayal of quantitative information can be found in ancient map- making, thematic cartography, medicine, and science. Visualization took its birth in geometric diagrams, in tables of the positions of stars and other celestial bodies, and in the map making. Figure 2.1 illustrates the graphical overview of the distinctive periods of time (strictly from 1500 to 2000) with the frequency of events which have taken as milestones in the history. It depicts the steady rise up from the early 18th century, to the 19th century, followed by a decline in the period of modern dark ages. After that, a sudden rise up to the present age has been shown. 6 Figure 2.1 Time Distribution of Events Considered Milestones in the History of Data Visualization, Shown By a Rug Plot and Density Estimate. Source: Chen et al. (2008, p.18) Funkhouser (1936) presented one of the oldest graphical representations (see Figure 2.2) dating back to the 10th century which is related to the changing positions of the seven important heavenly bodies and after that Tufte (1983, p. 28) reproduced it. Figure 2.2 is a multiple time-series graph showing changing position of seven heavenly bodies over space and time. Time (divided into 30 intervals) is taken up along the horizontal axis and the vertical axis shows the inclination of the planetary orbits. Statistical graphics and data visualization have deep roots which reach into the past studies of the earliest map-making, visual depiction and later into cartography. The Egyptian surveyors used the concept of the coordinates by at least 200 B.C. A memorable work at that time was the map projection of spherical earth into longitude and latitude. During the 14th century, the idea of plotting a theoretical function and the logical relation between tabulating values and plotting them appeared. In the 15th century, Nicole Oresme (1482) gave the basis for the idea of a theoretical graph of distance vs speed. 7 Figure 2.2 Planetary Movements shown as Cyclic Inclinations over Time. Source: Chen et al. (2008, p. 19) In the 16th century, different techniques for exact observations and measurement of physical quantities were suggested and for that tools were developed. Reginer Gemma- Frisius (1533) used camera obscure to record an eclipse of the sun. Some other prominent contributions were basic trigonometric tables and the first cartographic atlas. Early in the 17th century, a remarkable example (Figure 2.3) shows 1644 graphics by Michael Florent van Langren. It is believed to be the first visual representation of statistical data (Tufte, 1997, p. 15). This Figure 2.3 is a 1-D line graph to show the distance between Toledo and Rome in longitude. According to Friendly & Kwan (2003), ‘‘Van Langren’s graph is also a milestone as the earliest known exemplar of the principle of effect ordering for data display”. 8 This century saw new directions in theory and practical application, the rise of coordinate systems and analytic geometry, the analysis of observations, the foundation of probability theory, demographic statistics and political arithmetic.