Part One Exploratory Data Analysis Probability Distributions
Total Page:16
File Type:pdf, Size:1020Kb
Part One Exploratory Data Analysis Probability Distributions Charles A. Rohde Fall 2001 Contents 1 Numeracy and Exploratory Data Analysis 1 1.1 Numeracy . 1 1.1.1 Numeracy . 1 1.2 Discrete Data . 3 1.3 Stem and leaf displays . 6 1.4 Letter Values . 9 1.5 Five Point Summaries and Box Plots . 12 1.6 EDA Example . 14 1.7 Other Summaries . 21 1.7.1 Classical Summaries . 22 1.8 Transformations for Symmetry . 23 1.9 Bar Plots and Histograms . 27 1.9.1 Bar Plots . 27 1.9.2 Histograms . 27 1.9.3 Frequency Polygons . 30 1.10 Sample Distribution Functions . 32 1.11 Smoothing . 34 i ii CONTENTS 1.11.1 Smoothing Example . 36 1.12 Shapes of Batches . 42 1.13 References . 43 2 Probability 47 2.1 Mathematical Preliminaries . 47 2.1.1 Sets . 47 2.1.2 Counting . 52 2.2 Relating Probability to Responses and Populations . 54 2.3 Probability and Odds - Basic De¯nitions . 56 2.3.1 Probability . 56 2.3.2 Properties of Probability . 57 2.3.3 Methods for Obtaining Probability Models . 58 2.3.4 Odds . 61 2.4 Interpretations of Probability . 64 2.4.1 Equally Likely Interpretation . 64 2.4.2 Relative Frequency Interpretation . 65 2.4.3 Subjective Probability Interpretation . 65 2.4.4 Does it Matter? . 66 2.5 Conditional Probability . 67 2.5.1 Multiplication Rule . 69 2.5.2 Law of Total Probability . 71 2.6 Bayes Theorem . 75 2.7 Independence . 80 2.8 Bernoulli trial models; the binomial distribution . 81 CONTENTS iii 2.9 Parameters and Random Sampling . 83 2.10 Probability Examples . 94 2.10.1 Randomized Response . 94 2.10.2 Screening . 96 3 Probability Distributions 99 3.1 Random Variables and Distributions . 99 3.1.1 Introduction . 99 3.1.2 Discrete Random Variables . 101 3.1.3 Continuous or Numeric Random Variables . 107 3.1.4 Distribution Functions . 116 3.1.5 Functions of Random Variables . 117 3.1.6 Other Distributions . 118 3.2 Parameters of Distributions . 119 3.2.1 Expected Values . 119 3.2.2 Variances . 121 3.2.3 Quantiles . 122 3.2.4 Other Expected Values . 123 3.2.5 Inequalities involving Expectations . 125 4 Joint Probability Distributions 127 4.1 General Case . 127 4.1.1 Marginal Distributions . 128 4.1.2 Conditional Distributions . 128 4.1.3 Properties of Marginal and Conditional Distributions . 129 4.1.4 Independence and Random Sampling . 129 iv CONTENTS 4.2 The Multinomial Distribution . 130 4.3 The Multivariate Normal Distribution . 134 4.4 Parameters of Joint Distributions . 136 4.4.1 Means, Variances, Covariances and Correlation . 136 4.4.2 Joint Moment Generating Functions . 138 4.5 Functions of Jointly Distributed Random Variables . 139 4.5.1 Linear Combinations of Random Variables . 141 4.6 Approximate Means and Variances . 143 4.7 Sampling Distributions of Statistics . 145 4.8 Methods of Obtaining Sampling Distibutions or Approximations . 151 4.8.1 Exact Sampling Distributions . 151 4.8.2 Asymptotic Distributions . 152 4.8.3 Central Limit Theorem . 152 4.8.4 Central Limit Theorem Example . 153 4.8.5 Law of Large Numbers . 158 4.8.6 The Delta Method - Univariate . 160 4.8.7 The Delta Method - Multivariate . 162 4.8.8 Computer Intensive Methods . 166 4.8.9 Bootstrap Example . 170 Chapter 1 Numeracy and Exploratory Data Analysis 1.1 Numeracy 1.1.1 Numeracy Since most of statistics involves the use of numerical data to draw conclusions we ¯rst discuss the presentation of numerical data. Numeracy may be broadly de¯ned as the ability to e®ectively think about and present numbers. ² One of the most common forms of presentation of numerical information is in tables. ² There are some simple guidelines which allow us to improve tabular presentation of numbers. ² In certain situations, the guidelines presented here will need to be modi¯ed if the audience e.g. readers of a professional journal expect the results to be presented in a speci¯ed format. 1 2 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Guidelines ² Round to two signi¯cant ¯gures. ± In order to understand a table of numbers it is almost always easier to do so if the numbers do not contain too many signi¯cant ¯gures. ² Add averages or totals. ± Adding row and/or column averages, proportions or totals when appropriate to a table often provide a useful focus for establishing trends or patterns. ² Numbers are easier to compare in columns. ² Order by size. ± A more e®ective presentation is often achieved by rearranging so that the largest (and presumably most important numbers) appear ¯rst. ² Spacing and layout. ± It is useful to present tables in single space format and not have a lot of \empty space" to detract the reader from concentrating on the numbers in the table. 1.2. DISCRETE DATA 3 1.2 Discrete Data For discrete data present tables of the numbers of responses at the various values, possibly grouped by factors. Also one can produce bar graphs and histograms for graphical pre- sentation. Thus in the ¯rst example in the introduction we might present the results as follows: Placebo Vaccine Proportion Cases .008 .004 Studied 200,745 201,229 A sensible description might be 4 cases per thousand for the vaccinated group and 8 cases per thousand for the placebo group. 4 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS For the alcohol use data in the Overview Section eg. Group Use Alcohol Surveyed Proportion Clergy 32 300 .11 Educators 51 250 .20 Executives 67 300 .22 Merchants 83 350 .24 we might present the data as Figure 1.1: 1.2. DISCRETE DATA 5 For the self classi¯cation data in the Overview Section e.g. Class Lower Working Middle Upper Number 72 714 655 41 we might present the data as Figure 1.2: 6 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.3 Stem and leaf displays Suppose we have a batch or collection of numbers. Stem and leaf displays provide a simple, yet informative way to ² Develop summaries or descriptions of the batch either to learn about it in isolation or to compare it with other batches. The fundamental summaries are ± location of the batch (a center concept) ± scale or spread of the batch (a variability concept). ² Explore (note) characteristics of the batch including ± symmetry and general shape ± exceptional values ± gaps ± concentrations 1.3. STEM AND LEAF DISPLAYS 7 Consider the following batch of 62 numbers which give the ages in years of graduate students, post-docs, sta® and faculty of a large academic department of statistics: 33 20 41 52 35 25 43 61 37 29 44 64 40 32 50 76 33 22 42 55 36 26 43 61 37 30 46 65 40 32 50 79 34 23 43 59 37 27 43 61 39 31 46 67 41 32 51 81 37 28 44 64 37 29 44 64 40 31 49 74 51 52 Not much can be learned by looking at the numbers in this form. A simple display which begins to describe this collection of numbers is as follows: 9 | ( 1) 1 8 | 1 ( 4) 3 7 | 4 6 9 (12) 8 6 | 1 4 5 7 4 1 1 4 (20) 8 5 | 9 1 5 2 1 2 0 0 (42) 16 4 | 2 1 4 3 3 3 0 3 6 0 1 6 4 0 9 4 (26) 17 3 | 0 7 6 3 7 2 7 2 2 2 1 5 9 4 7 1 7 ( 9) 9 2 | 9 7 3 2 9 0 5 6 8 1 | | Interpretation: 1 at 8 means 81, 4 at 7 means 74, 6 at 7 means 76, 9 at 7 means 79, etc. 8 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS A more re¯ned version of this display is: 9 | ( 1) 1 8 | 1 ( 4) 3 7 | 4 6 9 (12) 8 6 | 1 1 1 4 4 4 5 7 (20) 8 5 | 0 0 1 1 2 2 5 9 (42) 16 4 | 0 0 0 1 1 2 3 3 3 3 4 4 4 6 6 9 (26) 17 3 | 0 1 1 2 2 2 3 3 4 5 6 7 7 7 7 7 9 ( 9) 9 2 | 0 2 3 5 6 7 8 9 9 1 | Interpretation: 1 at 8 means 81, 4 at 7 means 74, 6 at 7 means 76, 9 at 7 means 79, etc. To construct a stem and leaf display we perform the following steps: ² To the left of the solid line we put the stem of the number ² To the right of the solid line we put the leaf of the number. The remaining entries in the display are discussed in the next section. Note that a stem and leaf display provides a quick and easy way to display a batch of numbers. Every statistical package now has a program to draw stem and leaf displays. Some additional comments on stem and leaf displays: p ² Number of stems. Understanding Robust and Exploratory Data Analysis suggests n for n less than 100 and 10 log10(n) for n larger than 100. (Usually more than 50 are done using a computer and each statistical package has its own default method). ² Stems can be double (or more) digits and there can be stems such as 5? and 5¢ which divide the numbers with stem 5 into two groups (0,1,2,3,4) and (5,6,7,8,9).