Part One Exploratory Data Analysis Distributions

Charles A. Rohde

Fall 2001 Contents

1 Numeracy and Exploratory Data Analysis 1 1.1 Numeracy ...... 1 1.1.1 Numeracy ...... 1 1.2 Discrete Data ...... 3 1.3 Stem and leaf displays ...... 6 1.4 Letter Values ...... 9 1.5 Five Point Summaries and Box Plots ...... 12 1.6 EDA Example ...... 14 1.7 Other Summaries ...... 21 1.7.1 Classical Summaries ...... 22 1.8 Transformations for Symmetry ...... 23 1.9 Bar Plots and Histograms ...... 27 1.9.1 Bar Plots ...... 27 1.9.2 Histograms ...... 27 1.9.3 Frequency Polygons ...... 30 1.10 Sample Distribution Functions ...... 32 1.11 Smoothing ...... 34

i ii CONTENTS 1.11.1 Smoothing Example ...... 36 1.12 Shapes of Batches ...... 42 1.13 References ...... 43

2 Probability 47 2.1 Mathematical Preliminaries ...... 47 2.1.1 Sets ...... 47 2.1.2 Counting ...... 52 2.2 Relating Probability to Responses and Populations ...... 54 2.3 Probability and - Basic Definitions ...... 56 2.3.1 Probability ...... 56 2.3.2 Properties of Probability ...... 57 2.3.3 Methods for Obtaining Probability Models ...... 58 2.3.4 Odds ...... 61 2.4 Interpretations of Probability ...... 64 2.4.1 Equally Likely Interpretation ...... 64 2.4.2 Relative Frequency Interpretation ...... 65 2.4.3 Subjective Probability Interpretation ...... 65 2.4.4 Does it Matter? ...... 66 2.5 ...... 67 2.5.1 Multiplication Rule ...... 69 2.5.2 ...... 71 2.6 Bayes Theorem ...... 75 2.7 Independence ...... 80 2.8 Bernoulli trial models; the ...... 81 CONTENTS iii 2.9 Parameters and Random Sampling ...... 83 2.10 Probability Examples ...... 94 2.10.1 Randomized Response ...... 94 2.10.2 Screening ...... 96

3 Probability Distributions 99 3.1 Random Variables and Distributions ...... 99 3.1.1 Introduction ...... 99 3.1.2 Discrete Random Variables ...... 101 3.1.3 Continuous or Numeric Random Variables ...... 107 3.1.4 Distribution Functions ...... 116 3.1.5 Functions of Random Variables ...... 117 3.1.6 Other Distributions ...... 118 3.2 Parameters of Distributions ...... 119 3.2.1 Expected Values ...... 119 3.2.2 Variances ...... 121 3.2.3 Quantiles ...... 122 3.2.4 Other Expected Values ...... 123 3.2.5 Inequalities involving Expectations ...... 125

4 Joint Probability Distributions 127 4.1 General Case ...... 127 4.1.1 Marginal Distributions ...... 128 4.1.2 Conditional Distributions ...... 128 4.1.3 Properties of Marginal and Conditional Distributions ...... 129 4.1.4 Independence and Random Sampling ...... 129 iv CONTENTS 4.2 The ...... 130 4.3 The Multivariate ...... 134 4.4 Parameters of Joint Distributions ...... 136 4.4.1 Means, Variances, Covariances and Correlation ...... 136 4.4.2 Joint Moment Generating Functions ...... 138 4.5 Functions of Jointly Distributed Random Variables ...... 139 4.5.1 Linear Combinations of Random Variables ...... 141 4.6 Approximate Means and Variances ...... 143 4.7 Sampling Distributions of ...... 145 4.8 Methods of Obtaining Sampling Distibutions or Approximations ...... 151 4.8.1 Exact Sampling Distributions ...... 151 4.8.2 Asymptotic Distributions ...... 152 4.8.3 Central Limit Theorem ...... 152 4.8.4 Central Limit Theorem Example ...... 153 4.8.5 ...... 158 4.8.6 The Delta Method - Univariate ...... 160 4.8.7 The Delta Method - Multivariate ...... 162 4.8.8 Computer Intensive Methods ...... 166 4.8.9 Bootstrap Example ...... 170 Chapter 1

Numeracy and Exploratory Data Analysis

1.1 Numeracy

1.1.1 Numeracy

Since most of statistics involves the use of numerical data to draw conclusions we first discuss the presentation of numerical data. Numeracy may be broadly defined as the ability to effectively think about and present numbers.

• One of the most common forms of presentation of numerical information is in tables.

• There are some simple guidelines which allow us to improve tabular presentation of numbers.

• In certain situations, the guidelines presented here will need to be modified if the audience e.g. readers of a professional journal expect the results to be presented in a specified format.

1 2 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Guidelines

• Round to two significant figures.

◦ In order to understand a table of numbers it is almost always easier to do so if the numbers do not contain too many significant figures.

• Add averages or totals.

◦ Adding row and/or column averages, proportions or totals when appropriate to a table often provide a useful focus for establishing trends or patterns.

• Numbers are easier to compare in columns.

• Order by size.

◦ A more effective presentation is often achieved by rearranging so that the largest (and presumably most important numbers) appear first.

• Spacing and layout.

◦ It is useful to present tables in single space format and not have a lot of “empty space” to detract the reader from concentrating on the numbers in the table. 1.2. DISCRETE DATA 3 1.2 Discrete Data

For discrete data present tables of the numbers of responses at the various values, possibly grouped by factors. Also one can produce bar graphs and histograms for graphical pre- sentation. Thus in the first example in the introduction we might present the results as follows:

Placebo Vaccine Proportion Cases .008 .004 Studied 200,745 201,229

A sensible description might be 4 cases per thousand for the vaccinated group and 8 cases per thousand for the placebo group. 4 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS For the alcohol use data in the Overview Section eg.

Group Use Alcohol Surveyed Proportion Clergy 32 300 .11 Educators 51 250 .20 Executives 67 300 .22 Merchants 83 350 .24 we might present the data as

Figure 1.1: 1.2. DISCRETE DATA 5 For the self classification data in the Overview Section e.g.

Class Lower Working Middle Upper Number 72 714 655 41 we might present the data as

Figure 1.2: 6 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.3 Stem and leaf displays

Suppose we have a batch or collection of numbers. Stem and leaf displays provide a simple, yet informative way to

• Develop summaries or descriptions of the batch either to learn about it in isolation or to compare it with other batches. The fundamental summaries are

◦ location of the batch (a center concept) ◦ scale or spread of the batch (a variability concept).

• Explore (note) characteristics of the batch including

◦ symmetry and general shape ◦ exceptional values ◦ gaps ◦ concentrations 1.3. STEM AND LEAF DISPLAYS 7 Consider the following batch of 62 numbers which give the ages in years of graduate students, post-docs, staff and faculty of a large academic department of statistics:

33 20 41 52 35 25 43 61 37 29 44 64 40 32 50 76 33 22 42 55 36 26 43 61 37 30 46 65 40 32 50 79 34 23 43 59 37 27 43 61 39 31 46 67 41 32 51 81 37 28 44 64 37 29 44 64 40 31 49 74 51 52

Not much can be learned by looking at the numbers in this form. A simple display which begins to describe this collection of numbers is as follows:

9 | ( 1) 1 8 | 1 ( 4) 3 7 | 4 6 9 (12) 8 6 | 1 4 5 7 4 1 1 4 (20) 8 5 | 9 1 5 2 1 2 0 0 (42) 16 4 | 2 1 4 3 3 3 0 3 6 0 1 6 4 0 9 4 (26) 17 3 | 0 7 6 3 7 2 7 2 2 2 1 5 9 4 7 1 7 ( 9) 9 2 | 9 7 3 2 9 0 5 6 8 1 | |

Interpretation: 1 at 8 means 81, 4 at 7 means 74, 6 at 7 means 76, 9 at 7 means 79, etc. 8 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS A more refined version of this display is:

9 | ( 1) 1 8 | 1 ( 4) 3 7 | 4 6 9 (12) 8 6 | 1 1 1 4 4 4 5 7 (20) 8 5 | 0 0 1 1 2 2 5 9 (42) 16 4 | 0 0 0 1 1 2 3 3 3 3 4 4 4 6 6 9 (26) 17 3 | 0 1 1 2 2 2 3 3 4 5 6 7 7 7 7 7 9 ( 9) 9 2 | 0 2 3 5 6 7 8 9 9 1 |

Interpretation: 1 at 8 means 81, 4 at 7 means 74, 6 at 7 means 76, 9 at 7 means 79, etc. To construct a stem and leaf display we perform the following steps:

• To the left of the solid line we put the stem of the number

• To the right of the solid line we put the leaf of the number.

The remaining entries in the display are discussed in the next section. Note that a stem and leaf display provides a quick and easy way to display a batch of numbers. Every statistical package now has a program to draw stem and leaf displays. Some additional comments on stem and leaf displays: √ • Number of stems. Understanding Robust and Exploratory Data Analysis suggests n

for n less than 100 and 10 log10(n) for n larger than 100. (Usually more than 50 are done using a computer and each statistical package has its own default method).

• Stems can be double (or more) digits and there can be stems such as 5? and 5· which divide the numbers with stem 5 into two groups (0,1,2,3,4) and (5,6,7,8,9). Large displays could use 5 or 10 divisions per stem. The important idea is to display the numbers effectively.

• For small batches, when working by hand, the use of stem and leaf displays is a simple way to obtain the ordered values of the batch. 1.4. LETTER VALUES 9 1.4 Letter Values

The stem and leaf display can be used to determine a collection of derived numbers, called statistics, which can be used to summarize some additional features of the batch. To do this we need determine the total size of the batch and where the individual numbers are located in the display.

• To the left of the stem we count the number of leaves on each stem.

• The numbers in parentheses are the cumulative numbers counting up and counting down.

• Using the stem and leaf display we can easily “count in” from either end of the batch.

◦ The associated count is called the depth of the number. ◦ Thus at depth 4 we have the number 74 if we count down (largest to smallest) and the number 25 if we count up (smallest to largest).

• It is easier to understand the concept of depth if the numbers are written in a column from largest to smallest.

• A measure of location is provided by the median, defined as that number in the display with depth equal to 1 (1 + batch size) 2 ◦ If the size of the batch is even (n = 2m) the depth of the median will not be an integer. ◦ In such a case the median is defined to be halfway between the numbers with depth m and depth m + 1. ◦ In the example 1 63 median depth = (1 + 62) = = 31.5 2 2 thus the median is given by:

(# with depth 31) + (# with depth 32) 41 + 42 = = 41.5 2 2 10 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS ◦ The median has the property that 1/2 of the numbers in the batch are above it and 1/2 of the numbers in the batch are below it, i.e., it is halfway from either end of the batch. • The median is just one example of a letter value. Other letter values enable us to describe variability, shape and other characteristics of the batch. ◦ The simplest sequence of letter values divides the lower half in two and the upper half in two, each of these halves in two, and so on. ◦ To obtain these letter values we first find their depths by the formula 1 next letter value depth = (1 + [previous letter value depth]) 2 where [ ] means we discard any fraction in the calculation. (Called the “floor function”). ◦ Thus the upper and lower quartiles have depths equal to 1 (1 + [depth of median]) 2 The quartiles are sometimes called fourths. ◦ The eighths have depths equal to 1 (1 + [depth of hinge]) 2 ◦ We proceed down to the extremes which have depth 1. ◦ The median, quartiles and extremes often describe a batch of numbers quite well. ◦ The remaining letter values are used to describe more subtle features of the data (illustrated later).

In the example we thus have 1 32 F depth = (1 + 31) = = 16 2 2 1 17 E depth = (1 + 16) = = 8.5 2 2 1 2 Extreme depth = (1 + 1) = = 1 2 2 The corresponding letter values are 1.4. LETTER VALUES 11 M 41.5 depth 31.5 F 33 52 depth 16 Ex 20 81 depth 1

We can display the letter values as follows:

Value Depth Lower Upper Spread M 31.5 41.5 41.5 0 F 16 33 52 19 E 8.5 29 64 35 Ex 1 20 81 61 where the spread of a letter value is defined as:

upper letter value − lower letter value 12 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.5 Five Point Summaries and Box Plots

• A useful summary of a batch of numbers is the five point summary in which we list the upper and lower extremes, the upper and lower hinges and the median. Thus for the example we have the five point summary given by

20, 33, 41.5, 52, 81

• A five point summary can be displayed graphically as a box plot in which we picture only the median, the lower fourth, the upper fourth and the extremes as on the following page: 1.5. FIVE POINT SUMMARIES AND BOX PLOTS 13 For this batch of numbers there is evidence of asymmetry or skewness as can be observed from the stem-leaf display or the box plot.

Figure 1.3:

To measure spread we can use the interquartile range which is simply the diference between the upper quartile and the lower quartile. 14 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.6 EDA Example

The following are the heights in centimeters of 351 elderly female patients. The data set is elderly.raw (from Hand et. al. pages 120-121)

156 163 169 161 154 156 163 164 156 166 177 158 150 164 159 157 166 163 153 161 170 159 170 157 156 156 153 178 161 164 158 158 162 160 150 162 155 161 158 163 158 162 163 152 173 159 154 155 164 163 164 157 152 154 173 154 162 163 163 165 160 162 155 160 151 163 160 165 166 178 153 160 156 151 165 169 157 152 164 166 160 165 163 158 153 162 163 162 164 155 155 161 162 156 169 159 159 159 158 160 165 152 157 149 169 154 146 156 157 163 166 165 155 151 157 156 160 170 158 165 167 162 153 156 163 157 147 163 161 161 153 155 166 159 157 152 159 166 160 157 153 159 156 152 151 171 162 158 152 157 162 168 155 155 155 161 157 158 153 155 161 160 160 170 163 153 159 169 155 161 156 153 156 158 164 160 157 158 157 156 160 161 167 162 158 163 147 153 155 159 156 161 158 164 163 155 155 158 165 176 158 155 150 154 164 145 153 169 160 159 159 163 148 171 158 158 157 158 168 161 165 167 158 158 161 160 163 163 169 163 164 150 154 165 158 161 156 171 163 170 154 158 162 164 158 165 158 156 162 160 164 165 157 167 142 166 163 163 151 163 153 157 159 152 169 154 155 167 164 170 174 155 157 170 159 170 155 168 152 165 158 162 173 154 167 158 159 152 158 167 164 170 164 166 170 160 148 168 151 153 150 165 165 147 162 165 158 145 150 164 161 157 163 166 162 163 160 162 153 168 163 160 165 156 158 155 168 160 153 163 161 145 161 166 154 147 161 155 158 161 163 157 156 152 156 165 159 170 160 152 153 1.6. EDA EXAMPLE 15 STATA log for EDA of Heights of Elderly Women

. infile height using c:\courses\b651201\datasets\elderly.raw (351 observations read)

. stem height

Stem-and-leaf plot for height

14t | 2 14f | 555 14s | 67777 14. | 889 15* | 000000111111 15t | 22222222222233333333333333333 15f | 44444444444555555555555555555555 15s | 6666666666666666666677777777777777777777 15. | 888888888888888888888888888888899999999999999999 16* | 00000000000000000000011111111111111111111 16t | 222222222222222222333333333333333333333333333333 16f | 44444444444444444555555555555555555 16s | 666666666667777777 16. | 88888899999999 17* | 00000000000111 17t | 333 17f | 4 17s | 67 17. | 88 16 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS . summarize height, detail

height ------Percentiles Smallest 1% 145 142 5% 150 145 10% 152 145 Obs 351 25% 156 145 Sum of Wgt. 351

50% 160 Mean 159.7749 Largest Std. Dev. 6.02974 75% 164 176 90% 168 177 Variance 36.35777 95% 170 178 Skewness .1289375 99% 176 178 Kurtosis 3.160595

. display 3.49*6.02974*(351^(-1/3)) 2.9832408

. display 3.49*sqrt(r(Var))*(351^(-1/3)) 2.983241

. display (178-142)/2.98 12.080537

. display min(sqrt(351),10*log(10)) 18.734994 1.6. EDA EXAMPLE 17 . graph height, normal xlabel ylabel ti(Heights of Elderly Women 5 Bins)

. graph height, normal xlabel ylabel ti(Heights of Elderly Women 5 Bins) saving > (g1,replace)

. graph height, bin(12) normal xlabel ylabel ti(Heights of Elderly Women 5 Bins > ) saving(g2,replace)

. graph height, bin(12) normal xlabel ylabel ti(Heights of Elderly Women 12 Bin > s) saving(g2,replace)

. graph height, bin(18) normal xlabel ylabel ti(Heights of Elderly Women 18 Bin > s) saving(g3,replace)

. graph height, bin(25) normal xlabel ylabel ti(Heights of Elderly Women 25 Bin > s) saving(g4,replace)

. graph using g1 g2 g3 g4 18 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Histograms of Data on Elderly Women

Figure 1.4: Histograms 1.6. EDA EXAMPLE 19 . lv height

# 351 height ------M 176 | 160 | spread pseudosigma F 88.5 | 156 160 164 | 8 5.95675 E 44.5 | 153 159.5 166 | 13 5.667454 D 22.5 | 151 160.25 169.5 | 18.5 6.048453 C 11.5 | 148.5 159.5 170.5 | 22 5.929273 B 6 | 147 160 173 | 26 6.071367 A 3.5 | 145 160.75 176.5 | 31.5 6.659417 Z 2 | 145 161.5 178 | 33 6.360923 Y 1.5 | 143.5 160.75 178 | 34.5 6.355203 1 | 142 160 178 | 36 6.246375 | | # below # above inner fence | 144 176 | 1 4 outer fence | 132 188 | 0 0

. format height %9.2f . lv height

# 351 height ------M 176 | 160.00 | spread pseudosigma F 88.5 | 156.00 160.00 164.00 | 8.00 5.96 E 44.5 | 153.00 159.50 166.00 | 13.00 5.67 D 22.5 | 151.00 160.25 169.50 | 18.50 6.05 C 11.5 | 148.50 159.50 170.50 | 22.00 5.93 B 6 | 147.00 160.00 173.00 | 26.00 6.07 A 3.5 | 145.00 160.75 176.50 | 31.50 6.66 Z 2 | 145.00 161.50 178.00 | 33.00 6.36 Y 1.5 | 143.50 160.75 178.00 | 34.50 6.36 1 | 142.00 160.00 178.00 | 36.00 6.25 | | # below # above inner fence | 144.00 176.00 | 1 4 outer fence | 132.00 188.00 | 0 0 20 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS . graph height, box

. graph height, box ylabel

. graph height, box ylabel l1(Height in Centimeters) ti(Box Plot of Heights of > Elderly Women)

. cumul height, gen(cum)

. graph cum height,s(i) c(l) ylabel xlabel ti(Empirical Distribution Function O > f Heights of Elderly Women) rlabel yline(.25,.5,.75)

. kdensity height

. kdensity height,normal ti(Kdensity Estimate of Heights)

. log close 1.7. OTHER SUMMARIES 21 1.7 Other Summaries

Other measures of location are

1 • mid = 2 (UQ + LQ)

1 LQ + 2M + UQ • tri-mean = 2 (mid + median) = 4 where UQ is the upper quartile, M is the median and LQ is the lower quartile.

It is often useful to identify exceptional values that need special attention. We do this using fences.

• The upper and lower fences are defined by

3 upper fence = UF = upper hinge + 2 (H-spread) 3 lower fence = LF = lower hinge − 2 (H-spread)

• Values above the upper fence or below the lower fence can be considered as exceptional values and need to be examined closely for validity. 22 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.7.1 Classical Summaries

The summary quantities developed in the previous sections are examples of statistics, for- mally defined as functions of a sample data set. There are other summary measures of a sample data set.

• For location, the traditional summary measure is the sample mean defined by

1 Xn x¯ = xi n i=1

where n is the number of observations in the data set and (x1, x2, . . . , xn) is the sample data set.

• For spread or variablity the sample variance, s2, and the sample standard deviation, s, are defined by Xn √ 2 1 2 2 s = (xi − x¯) and s = s n − 1 i=1 • Note that µ 1 ¶ 1 x¯ = 1 − x¯ + x n (i−1) n i

wherex ¯i−1 is the sample mean of the data set with the ith observation removed. ◦ It follows that a single observation can greatly influence the magnitude of the sample mean which explains why other summaries such as the median or tri-mean for location are often used. ◦ Similarly the sample variance and sample standard deviation are greatly influ- enced by single observations.

• For distributions which are “bell-shaped” the interquartile range is approximately equal to 1.34 s to where s is the sample standard deviation. 1.8. TRANSFORMATIONS FOR SYMMETRY 23 1.8 Transformations for Symmetry

Data can be easier to understand if it is nearly symmetric and hence we sometimes transform a batch to make it approximately symmetric. The reasons for transformations are:

• For symmetric batches we have an unambiguous measure of center (the mean or the median).

• Transformed data may have a scientific meaning.

• Many statistical methods are more reliable for symmetric data.

As examples of transformed data with scientific meaning we have

• For income and population changes the natural logarithm is often useful since both money and poulations grow exponentially i.e.

Nt = N0 exp(rt)

where r is the interest rate or growth rate.

• In measuring consumption e.g. miles per gallon or BTU per gallon the reciprocal is a measure of power.

The fundamental use of transformations is to change shape which can be loosely described as everything about the batch other than location and scale. Desirable features of a trans- formation is to preserve order and be a simple and smooth function of the data. We first note that a linear transformation does not change shape, it only changes the location and center of the batch since

t(yi) = a + byi, t(yj) = a + byj =⇒ t(yi) − t(yj) = b(yi − yj) shows that a linear transformation does not change the relative distances between observa- tions. Thus a linear transformation does not change the shape of the batch. To choose a transformation for symmetry we first need to determine whether the data are skewed right or skewed left. A simple way to do this is to examine the “mid-list” defined as lower letter value + upper letter value mid letter value = 2 24 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS If the values in the mid-list increase as the letter values increase then the batch is skewed right. Conversely if the values in the mid-list decrease as the letter values increase the batch is skewed left. A convenient collection of transformations is the power family of transformations de- fined by ( yk k 6= 0 t (y) = k ln(y) k = 0 For this family of transformations we have the following ladder of re-expression or transfor- mation: k tk(y) 2 y2 1 y 1 √ 2 y 0 ln(y) 1 √ − 2 −1/ y −1 −1/y −2 −1/y2 The rule for using this ladder is to start at the transfomation where k = 1. If the data are skewed to high values, go down the ladder to find a transformation. If skewed towards low values of y go up the ladder. For the data set on ages the complete set of letter vales as produced by STATA is

# 62 y ------M 31.5 | 41.5 | spread F 16 | 33 42.5 52 | 19 E 8.5 | 29 46.5 64 | 35 D 4.5 | 25.5 48 70.5 | 45 C 2.5 | 22.5 50 77.5 | 55 B 1.5 | 21 50.5 80 | 59 1 | 20 50.5 81 | 61 | | | | # below # above inner fence | 4.5 80.5 | 0 1 outer fence | -24 109 | 0 0 1.8. TRANSFORMATIONS FOR SYMMETRY 25 Thus the mid-list is

mid letter value 41.5 median 42.5 fourth 46.5 eighth 48 D 50 B 50.5 A 50.5 Extreme

Since the values increase we need to go down the ladder. Hence we try square roots or natural logarithms first. Note: There are some rather sophisticated symmetry plots now available. e.g. STATA has a command symplot which determines the value of k. Often, however this results in k = .48 or k = .52. Try to choose a k which is simple e.g. k = 1/2 and hope for a scientific justification. 26 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Here are the stem and leaf plots of the natural logarithm and square root of the age data

30* | 09 4** | 47 31* | 4 4** | 69 32* | 26 4** | 80 33* | 0377 5** | 00,10 34* | 033777 5** | 20,29,39,39 35* | 00368 5** | 48,57,57 36* | 111116999 5** | 66,66,66,74,74 37* | 1146666888 5** | 83,92 38* | 339 6** | 00,08,08,08,08,08 39* | 113355 6** | 24,32,32,32 40* | 18 6** | 40,40,48,56,56,56,56 41* | 1116667 6** | 63,63,63,78,78 42* | 0 6** | 43* | 0379 7** | 00,07,07,14,14 lnage 7** | 21,21 7** | 42 7** | 68 7** | 81,81,81 8** | 00,00,00,06,19 8** | 8** | 8** | 60,72 8** | 89 9** | 00 square root of age 1.9. BAR PLOTS AND HISTOGRAMS 27 1.9 Bar Plots and Histograms

Two other useful graphical displays for describing the shape of a batch of data are provided by bar plots and histograms.

1.9.1 Bar Plots

• Barplots are very useful for describing relative proportions and frequencies defined for different groups or intervals.

• The key concept in constructing bar plots is to remember that the plot must be such that the area of the bar is proportional to the quantity being plotted.

• This causes no problems if the intervals are of equal length but presents real problems if the intervals are not of equal length.

• Such incorrect graphs are examples of “lying graphics” and must be avoided.

1.9.2 Histograms

• Histograms are similar to bar plots and are used to graph the proportion of data set values in specified intervals.

• These graphs give insight into the distributional patterns of the data set.

• Unlike stem-leaf plots, histograms sacrifice the individual data values.

• In constructing histograms the same basic principle used in constructing bar plots applies: the area over an interval must be proportional to the number or proportion of data values in the interval. The total area is often scaled to be one.

• Smoothed histograms are available in most software packages. (more later when we discuss distributions).

The following pages show the histogram of the first data set of 62 values with equal intervals and the kdensity graph. 28 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Histogram

Figure 1.5: 1.9. BAR PLOTS AND HISTOGRAMS 29 Smoothed histogram

Figure 1.6: 30 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.9.3 Frequency Polygons

• Closely related to histograms are frequency polygons in which the proportion or frequency of an interval is plotted at the mid point of the interval and the resulting points connected.

• Frequency polygons are also useful in visualizing the general shape of the distribution of a data set.

Here is a small data set giving the number of reported suicide attempts in a major US city in 1971:

Age 6-15 16-25 26-35 36-45 46-55 56-65 Frequency 4 28 16 8 4 1 1.9. BAR PLOTS AND HISTOGRAMS 31 The frequency polygon for this data set is as follows:

Figure 1.7: 32 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.10 Sample Distribution Functions

• Another useful graphical display is the sample distribution function or empirical distribution function which is a plot of the proportion of values less than or equal to y versus y where y represents the ordered values of the data set.

• These plots can be conveniently made using current software but usually involve too much computation to be done by hand.

• They represent a very valuable technique for comparing observed data sets to theoret- ical models as we will see later. 1.10. SAMPLE DISTRIBUTION FUNCTIONS 33 Here is the sample distribution function for the first data set on ages.

Figure 1.8: 34 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.11 Smoothing

Time series data of the form yt : t = 0, 1, 2, . . . , n which we abbreviate to {yt} can usefully be separate d into two additive parts: {zt} and {rt} where

• {zt} is the smooth or signal and represents that part of the data which is slowly varying and structured.

• {rt} is the rough or noise and represents that part of the data which is rapidly varying and unstructured.

{zt}, the smooth, tells us about long-run patterns while {rt}, the roughh, tells us about exceptional points. The operator which converts the data {yt} into the smooth is called a data smoother. The smoothed data may then be written as Sm{yt}. The corresponding rough is then given by Ro{yt} = {yt} − Sm{yt}

There are many smoothers, defined by their properties. For our purposes two general types are important:

• Linear smoothers defined by the property

Sm{axt + byt} = aSm{xt} + bSm{yt}

• Semi-linear smoothers defined by the property

Sm{ayt + b} = aSm{yt} + b

Examples of linear smoothers include moving averages e.g. y + y + y Sm{y } = t−1 t t+1 t 3 and weighted moving averages such as Hanning defined by 1 1 1 Sm{y } = y + y + y t 4 t−1 2 t 4 t+1 (Special adjustments are made at the ends of the series. 1.11. SMOOTHING 35 Examples of semi-linear smoothers include running medians of length 3 or 5 when smooth- ing without a computer or even lengths if using a statistical package with the right programs. e.g. Sm{yt} = med{yt−1, yt, yt+1} is a smoother of running medians of length 3 with the ends replicated (copied). These kinds of smoothers are applied several times until they “settle down”. Then end adjustments are made. The two basic types of smoothers are usually combined to form compound smoothers. The nomenclature for these smoothers is rather bewildering at first but informative: e.g.

3RSSH,twice refers to the smoother which

• takes running medians of length 3 until the series stabilizes (R)

• the S refers to splitting the repeated values, using the endpoint operator on them and then replaces the original smooth with these values

• H applies the Hanning smoother to the series which remains

• twice refers to using the smoother on the rough and then adding the rough back to the smooth to form the final smoothed version

A little trial and error is needed in using these smoothers. Velleman has recommended the smoother 4253H,twice for general use. 36 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.11.1 Smoothing Example

To illustrate the smoothing techniques we use data on unemployment percent for the years 1960 to 1990.

. infile year unempl using c:\courses\b651201\datasets\unemploy.raw (31 observations read)

. smooth 3 unempl, gen(sm1)

. smooth 3 sm1, gen(sm2)

. smooth 3R unempl, gen(sm3)

. smooth 3RE unempl, gen(sm4)

. smooth 4253H,twice unempl, gen(sm5)

. gen sm5r=round(sm5,.1) 1.11. SMOOTHING 37

. list year unempl sm1 sm2 sm3 sm4

year unempl sm1 sm2 sm3 sm4 1960 4.9 4.9 4.9 4.9 4.9 1961 6 4.9 4.9 4.9 4.9 1962 4.9 5 4.9 4.9 4.9 1963 5 4.9 4.9 4.9 4.9 1964 4.6 4.6 4.6 4.6 4.6 1965 4.1 4.1 4.1 4.1 4.1 1966 3.3 3.4 3.4 3.4 3.4 1967 3.4 3.3 3.3 3.3 3.3 1968 3.2 3.2 3.2 3.2 3.2 1969 3.1 3.2 3.2 3.2 3.2 1970 4.4 4.4 4.4 4.4 4.4 1971 5.4 5 5 5 5 1972 5 5 5 5 5 1973 4.3 5 5 5 5 1974 5 5 5 5 5 1975 7.8 7 7 7 7 1976 7 7 7 7 7 1977 6.2 6.2 6.2 6.2 6.2 1978 5.2 5.2 5.2 5.2 5.2 1979 5.1 5.2 5.2 5.2 5.2 1980 6.3 6.3 6.3 6.3 6.3 1981 6.7 6.7 6.7 6.7 6.7 1982 8.6 8.4 8.4 8.4 8.4 1983 8.4 8.4 8.4 8.4 8.4 1984 6.5 6.5 6.5 6.5 6.5 1985 6.2 6.2 6.2 6.2 6.2 1986 6 6 6 6 6 1987 5.3 5.3 5.3 5.3 5.3 1988 4.7 4.7 4.7 4.7 4.7 1989 4.5 4.5 4.5 4.5 4.5 1990 4.1 4.1 4.1 4.1 4.1 38 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS . list year unempl sm5r

year unempl sm5r 1960 4.9 4.9 1961 6 5 1962 4.9 5 1963 5 4.9 1964 4.6 4.6 1965 4.1 4 1966 3.3 3.6 1967 3.4 3.4 1968 3.2 3.4 1969 3.1 3.6 1970 4.4 4.1 1971 5.4 4.6 1972 5 4.8 1973 4.3 5.1 1974 5 5.5 1975 7.8 6 1976 7 6.2 1977 6.2 6.1 1978 5.2 5.9 1979 5.1 5.8 1980 6.3 6.2 1981 6.7 7 1982 8.6 7.4 1983 8.4 7.3 1984 6.5 7 1985 6.2 6.4 1986 6 5.8 1987 5.3 5.3 1988 4.7 4.8 1989 4.5 4.4 1990 4.1 4.1 1.11. SMOOTHING 39 . graph unempl sm4 year,s(oi) c(ll) ti(Unemployment and 3RE Smooth) xlab

. graph unempl sm5r year,s(oi) c(ll) ti(Unemployment and 4253H,twice Smooth) x > lab

. log close

The graphs on the following two pages show the smoothed versions and the original data. 40 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS Graph of Unemployment Data and 3RE smooth

Figure 1.9: 1.11. SMOOTHING 41 Graph of Unemployment Data and 4253H,twice Smooth.

Figure 1.10: 42 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 1.12 Shapes of Batches

Figure 1.11: 1.13. REFERENCES 43 1.13 References

1. Bound, J. A. and A. S. C. Ehrenberg (1989). Significant Sameness. J. R. Statis. Soc. A 152(Part 2): pp. 241-247.

2. Chakrapani, C. Numeracy. Encyclopedia of Statistics.

3. Chambers, J. M., W. S. Cleveland, et al. (1983). Graphical Methods for Data Analysis, Wadsworth International Group.

4. Chatfield, C. (1985). The Initial Examination of Data. J.R.Statist. Soc. A 148(3): 214-253.

5. Cleveland, W. S. and R. McGill (1984). The Many Faces of a Scatterplot. JASA 79(388): 807-822.

6. Doksum, K. A. (1977). Some Graphical Methods in Statistics. Statistica Neelandica Vol. 31(No. 2): pp. 53-68.

7. Draper, D., J. S. Hodges, et al. (1993). Exchangability and Data Analysis. J. R. Statist. Soc. A 156(Part 1): pp. 9-37.

8. Ehrenberg, A. S. C. (1977). Graphs or Tables ? The Statistician Vol. 27(No.2): pp. 87-96.

9. Ehrenberg, A. S. C. (1986). Reading a Table: An Example. Applied Statistics 35(3): 237-244.

10. Ehrenberg, A. S. C. (1977). Rudiments of Numeracy. J. R. Statis. Soc. A 140(3): 277-297.

11. Ehrenberg, A. S. C. Reduction of Data. Johnson and Kotz.

12. Ehrenberg, A. S. C. (1981). The Problem of Numeracy. American Statistician 35(3): 67-71.

13. Finlayson, H. C. The Place of ln x Among the Powers of x. American Mathematical Monthly: 450.

14. Gan, F. F., K. J. Koehler, et al. (1991). Probability Plots and Distribution Curves for Assessing the Fit of Probability Models. American Statistician 45(1): 14-21. 44 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 15. Goldberg, K. and B. Iglewicz (1992). Bivariate Extensions of the Boxplot. Technomet- rics 34(3): 307-320.

16. Hand, D. J. (1996). Statistics and the Theory of Measurement. J. R. Statist. Soc. A 159(Part 3): pp. 445-492.

17. Hand, D. J. (1998). Data Mining: Statistics and More? American Statistics 52(2): 112-118.

18. Hoaglin, D. C., F. Mosteller, et al. (1991). Fundamentals of Exploratory Analysis of Variance, John Wiley & Sons, Inc.

19. Hoaglin, D. C., F. Mosteller, et al., Eds. (1983). Understanding Robust and Ex- ploratory Data Analysis, John Wiley & Sons, Inc.

20. Hunter, J. S. (1988). The Digidot Plot. American Statistician 42(1): 54.

21. Hunter, J. S. (1980). The National System of Scientific Measurement. Science 210: 869-874.

22. Kafadar, K. Notched Box-and-Whisker Plots. Encyclopedia of Statistics. Johnson and Kotz.

23. Kruskal, W. (1978). Taking Data Seriously. Toward a Metric of Science, John Wiley & Sons: 139-169.

24. Mallows, C. L. and D. Pregibon (1988). Some Principles of Data Analysis, Statistical Research Reports No. 54 AT&T Bell Labs.

25. McGill, R., J. W. Tukey, et al. (1978). Variations of Box Plots. American Statistician 32(1): 12-16.

26. Mosteller, F. (1977). Assessing Unknown Numbers: Order of Magnitude Estimation. Statistical Methods for Policy Analysis. W. B. Fairley and F. Mosteller, Addison- Wesley.

27. Paulos, J. A. (1988). Innumeracy: Mathematical Illiteracy and Its Consequences, Hill and Wang.

28. Paulos, J. A. (1991). Beyond Numeracy: Ruminations of a Numbers Man, Alfred A. Knopf. 1.13. REFERENCES 45 29. Preece, D. A. (1987). The language of size, quantity and comparison. The Statistician 36: 45-54.

30. Rosenbaum, P. R. (1989). Exploratory Plots for Paired Data. American Statistician 43(2): 108-109.

31. Scott, D. W. (1979). On optimal and data-based histograms. Biometrika 66(3): pp. 605-610.

32. Scott, D. W. (1985). Frequency Polygons: Theory and Applications. JASA 80(390): 348-354.

33. Sievers, G. L. Probability Plotting. Encyclopedia of Statistics. Johnson and Kotz: 232-237.

34. Snee, R. D. and C. G. Pfeifer.. Graphical Representation of Data. Encyclopedia of Statistics. Johnson and Kotz: 488-511.

35. Stevens, S. S. (1968). Measurement, Statistics and the Schemapric View. Science 161(3844): 849-856.

36. Stirling, W. D. (1982). Enhancements to Aid Interpretation of Probablity Plots. The Statistician 31(3): 211.

37. Sturges, H. A. (1926). The Choice of Class Interval. JASA 21: 65-66.

38. Terrell, G. R. and D. W. Scott (1985). Oversmoothed Nonparametric Density Esti- mates. JASA 80(389): 209-213.

39. Tukey, J. W. (1980). We Need Both Exploratory and Confirmatory. American Statis- tician 34(1): 23-25.

40. Tukey, J. W. (1986). Sunset Salvo. American Statistician 40(1): 72-76.

41. Tukey, J. W. (1977). Exploratory Data Analysis, Addison Wesley.

42. Tukey, J. W. and C. L. Mallows An Overview of Techniques of Data Analysis, Empha- sizing Its Exploratory Aspects: 111-172.

43. Velleman, P. F. Applied Nonlinear Smoothing. Sociological Methodology 1982 San Francisco: Jossey-Bass 46 CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS 44. Velleman, P. F. and L. Wilkinson (1993). Nominal, Ordinal, Interval, and Ratio Ty- pologies Are Misleading. American Statistician 47(1): 65-72.

45. Wainer, H. (1997). Improving Tabular Displays, With NAEP Tables as Examples and Inspirations. Journal of Educational and Behavioral Statistics 22(1): 1-30.

46. Wand, M. P. (1997). Data-Based Choice of Histogram Bin Width. American Statisti- cian Vol. 51(No. 1): pp. 59-64.

47. Wilk, M. B. and R. Gnanadesikian (1968). Probability plotting methods for the anal- ysis of data. Biometrika 55(1): 1-17. Chapter 2

Probability

2.1 Mathematical Preliminaries

2.1.1 Sets

To study statistics effectively we need to learn some probability. There are certain elementary mathematical concepts which we use to increase the precision of our discussions. The use of set notation provides a convenient and useful way to be precise about populations and samples. Definition: A set is a collection of objects called points or elements. Examples of sets include:

• set of all individuals in this class

• set of all individuals in Baltimore

• set of integers including 0 i.e. {0, 1,...}

• set of all non-negative numbers i.e. [0, +∞)

• set of all real numbers i.e. (−∞, +∞)

47 48 CHAPTER 2. PROBABILITY To describe the contents of a set we will follow one of two conventions:

• Convention 1: Write down all of the elements in the set and enclose them in curly brackets. Thus the set consisting of the four numbers 1, 2, 3 and 4 is written as

{1, 2, 3, 4}

• Convention 2: Write down a rule which determines or defines which elements are in the set and enclose the result in curly brackets. Thus the set consisting of the four numbers 1, 2, 3 and 4 is written as

{x : x = 1, 2, 3, 4}

and is read as “the set of all x such that x = 1, 2, 3, or 4”. The general convention is thus {x : C(x)} and is read as “the set of all x such that the condition C(x) is satisfied”.

Obviously convention 2 is more useful for complicated and large sets. 2.1. MATHEMATICAL PRELIMINARIES 49 Notation and Definitions:

• x ∈ A means that the point x is a point in the set A

• x 6∈ A means that the point x is not a point in the set A Thus 1 ∈ {1, 2, 3, 4} but 5 6∈ {1, 2, 3, 4}

• A ⊂ B means that each a ∈ A implies that a ∈ B. A. Such an A is said to be a subset of B. Thus {1, 2} ⊂ {1, 2, 3, 4}

• A = B means that every point in A is also in B and conversely. More precisely A = B means that A ⊂ B and B ⊂ A.

• The union of two sets A and B is denoted by A ∪ B and is the set of all points x which are in at least one of the sets. Thus if A = {1, 2} and B = {2, 3, 4} then A ∪ B = {1, 2, 3, 4}

• The intersection of two sets A and B is denoted by A∩B and is the set of all points x which are in both of the sets. Thus if A = {1, 2} and B = {2, 3, 4} then A ∩ B = {2}.

• If there are no points x which are in both A and B we say that A and B are disjoint or mutually exclusive and we write

A ∩ B = ∅

where ∅ is called the empty set (the set containing no points). 50 CHAPTER 2. PROBABILITY • Each set under discussion is usually considered to be a subset of a larger set Ω called the .

• The complement of a set A, Ac is the set of all points not in A i.e.

Ac = {x : x 6∈ A}

Thus if Ω = {1, 2, 3, 4, 5} and A = {1, 2, 4} then Ac = {3, 5}.

• If B ⊂ A then A − B = A ∩ Bc = {x : x ∈ A ∩ Bc}

• If a and b are elements or points we call (a, b) an ordered pair. a is called the first coordinate and b is called the second coordinate. Two ordered pairs are equal defined to be equal if and only if both their first and second coordinates are equal. Thus

(a, b) = (c, d) if and only if a = c and b = d

Thus if we record for an individual their blood pressure and their age the result may be written as (age, blood pressure).

• The Cartesian product of two sets A and B is written as A × B and is the set of all ordered pairs having as first coordinate an element of A and second coordinate an element of B. More precisely

A × B = {(a, b): a ∈ A; b ∈ B}

Thus if A = {1, 2, 3} and B = {3, 4} then

A × B = {(1, 3), (1, 4), (2, 3), (2, 4), (3, 3), (3, 4)}

• Extension of Cartesian products to three or more sets is useful. Thus

A1 × A2 × A3 = {(a1, a2, a3): a1 ∈ A1, a2 ∈ A2, a3 ∈ A3}

defines a set of triples. Two triples are equal if and only if they are equal coordinatewise. Most computer based storage systems (data base programs) implicitly use Cartesian products to label and store data values.

• An n tuple is an ordered collection of n elements of the form a1, a2, . . . , an. 2.1. MATHEMATICAL PRELIMINARIES 51 example: Consider the set (population) of all individuals in the United States. If

• A is all those who carry the AIDS virus

• B is all homosexuals

• C is all IV drug users

Then

• The set of all individuals who carry the AIDS virus and satisfy only one of the other two conditions is (A ∩ B ∩ Cc) ∪ (A ∩ Bc ∩ C)

• The set of all individuals satisfying at least two of the conditions is

(A ∩ B) ∪ (A ∩ C) ∪ (B ∩ C)

• The set of individuals satisfying exactly two of the conditions is

(A ∩ B ∩ Cc) ∪ (A ∩ Bc ∩ C) ∪ (Ac ∩ B ∩ C)

• The set of all individuals satisfying all three conditions is

A ∩ B ∩ C

• The set of all individuals satisfying at least one of the conditions is

A ∪ B ∪ C 52 CHAPTER 2. PROBABILITY 2.1.2 Counting

Many probability problems involve “counting the number of ways” something can occur.

Basic Principle of Counting: Given two sets A and B with n1 and n2 elements respec- tively of the form

A = {a1, a2, . . . , an1 }

B = {b1, b2, . . . , bn2 }

then the set A × B consisting of all ordered pairs of the form (ai, bj) contains n1n2 elements.

• To see this consider the table

b1 b2 ··· bn2

a1 (a1, b1) a1, b2) ··· (a1, bn2 )

a2 (a2, b1) a2, b2) ··· (a2, bn2 ) ......

an1 (an1 , b1) an1 , b2) ··· (an1 , bn2 )

The conclusion is thus obvious.

• Equivalently: If there are n1 ways to perform operation 1 and n2 ways to perform operation 2 then there are n1n2 ways to perform first operation 1 and then operation 2.

• In general if there are r operations in which the ith operation can be performed in ni ways then there are n1n2 ··· nr ways to perform the r operations in sequence. • Permutations: If a set S contains n elements, there are

n! = n × (n − 1) × · · · × 3 × 2 × 1

different n tuples which can be formed from the n elements of S. – By convention 0! = 1. – If r ≤ n there are

(n)r = (n − r + 1)(n − r + 2) ··· (n − 1)n r tuples composed of elements of S. 2.1. MATHEMATICAL PRELIMINARIES 53 • Combinations: If a set S contains n elements and r ≤ n, there are à ! n n! Cn = = r r r!(n − r)!

subsets of size r containing elements of S. To see this we note that if we have a subset of size r from S there are r! permutations of its elements, each of which is an r tuple of elements from S. Therefore we have the equation n r! Cr = (n)r and the conclusion follows. examples:

(1) For an ordinary deck of 52 cards there are 52 × 51 × 50 ways to choose a “hand” of three cards.

(2) If we toss two dies (each six-sided with sides numbered 1-6) there are 36 possible outcomes.

(3) The use of the convention that 0! = 1 can be considered a special case of the Gamma function defined by Z ∞ Γ(α) = xα−1e−xdx 0 defined for any positive α. We note by integration by parts that ¯ ¯∞ Z ¯ ∞ Γ(α) = (α − 1)xα−1¯ + (α − 1) xα−2e−xdx = (α − 1)Γ(α − 1) ¯ 0 0 It follows that if α = n where n is an integer then

Γ(n) = (n − 1)!

and hence with n = 1 Z ∞ 0! = Γ(1) = e−xdx = 1 0 54 CHAPTER 2. PROBABILITY 2.2 Relating Probability to Responses and Populations

Probability is a measure of the uncertainty associated with the occurrence of events.

• In applications to statistics probability is used to model the uncertainty associated with the response of a study.

• Using probability models and observed responses (data) we make statements (statistical inferences) about the study:

◦ The probability model allows us to relate the uncertainty associated with sample results to statements about population characteristics. ◦ Without such models we can say little about the population and virtually nothing about the reliability or generalizability of our results.

• The term experiment or statistical experiment or random experiment denotes the performance of an observational study, a census or sample survey or a designed experiment.

◦ The collection, Ω, of all possible results of an experiment will be called the sample space. ◦ A particular result of an experiment will be called an and denoted by ω. ◦ An event is a collection of elementary events. ◦ Events are thus sets of elementary events. 2.2. RELATING PROBABILITY TO RESPONSES AND POPULATIONS 55 • Notation and interpretations:

◦ ω ∈ E means that E occurs when ω occurs ◦ ω 6∈ E means that E does not occur when ω occurs ◦ E ⊂ F means that the occurrence of E implies the occurrence of F ◦ E ∩ F means the event that both E and F occur ◦ E ∪ F means the event that at least one of E or F occur ◦ φ denotes the impossible event ◦ E ∩ F = φ means that E and F are mutually exclusive ◦ Ec is the event that E does not occur ◦ Ω is the sample space 56 CHAPTER 2. PROBABILITY 2.3 Probability and Odds - Basic Definitions

2.3.1 Probability

Definition: Probability is an assignment to each event of a number called its probability such that the following three conditions are satisfied:

(1) P (Ω) = 1 i.e. the probability assigned to the certain event or sample space is 1

(2) 0 ≤ P (E) ≤ 1 for any event E i.e. the probability assigned to any event must be between 0 and 1

(3) If E1 and E2 are mutually exclusive then

P (E1 ∪ E2) = P (E1) + P (E2)

i.e. the probability assigned to the union of mutually exclusive events equals the sum of the assigned to the individual events.

P (E) is called the probability of the event E

Note: In considering probabilities for continuous responses we need a stronger form of (3): X P (∪iEi) = P (Ei) i for any countable collection of events which are mutually exclusive. 2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS 57 2.3.2 Properties of Probability

Important properties of probabilities are:

• P (Ec) = 1 − P (E)

• P (∅) = 0

• E1 ⊂ E2 implies P (E1) ≤ P (E2)

• P (E1 ∪ E2) = P (E1) + P (E2) − P (E1 ∩ E2)

Rather than develop the theory of probability we will:

• Develop the most important probability models used in statistics.

• Learn to use these models to make calculations according to the definitions and prop- erties listed above

• Learn how to interpret probabilities. examples:

• Suppose that P (A) = .4,P (B) = .3 and P (A ∩ B) = .2 then

P (A ∪ B) = .4 + .3 − .2 = .5

• For any three events A, B and C we have

P (A∪B ∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (A∩C)−P (B ∩C)+P (A∩B ∩C)

and hence P (A ∪ B ∪ C) ≤ P (A) + P (B) + P (C) 58 CHAPTER 2. PROBABILITY 2.3.3 Methods for Obtaining Probability Models

The four most important sample spaces for statistical applications are

◦ {0, 1, 2, . . . , n} (discrete-finite)

◦ {0, 1, 2,...} (discrete-countable)

◦ [0, ∞) (continuous)

◦ {(−∞, ∞)} (continuous)

For these sample spaces probabilities are defined by probability mass functions (discrete case) and probability density functions (continuous case). We shall call both of these prob- ability density functions (pdfs).

◦ For the discrete cases a pdf assigns a number f(x) to each x in the sample space such that X f(x) ≥ 0 and f(x) = 1 x

Then P (E) is defined by X P (E) = f(x) x∈E ◦ For the continuous cases a pdf assigns a number f(x) to each x in the sample space such that Z f(x) ≥ 0 and f(x)dx = 1 x

Then P (E) is defined by Z P (E) = f(x)dx x∈E

Since sums and integrals over disjoint sets are additive probabilities can be assigned using pdfs (i.e. the probabilities so assigned obey the three axioms of probabilities). examples:

◦ If à ! n f(x) = px(1 − p)n−x x = 0, 1, 2, . . . , n x 2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS 59 where 0 ≤ p ≤ 1 we have a binomial probabilty model with parameter p. The fact that à ! X Xn n f(x) = px(1 − p)n−x = 1 x x=0 x follows from the fact (Newton’s binomial expansion) that à ! Xn n (a + b) = axbn−x x=0 x for any a and b.

◦ If λxe−λ f(x) = x = 0, 1, 2,... x! where λ ≥ 0 we have a Poisson probability model with parameter λ. The fact that

X X∞ λxe−λ f(x) = = 1 x x=0 x! follows from the fact that X∞ λx = eλ x=0 x! ◦ If f(x) = λe−λx 0 ≤ x < ∞) where λ ≥ 0 we have an exponential probability model with parameter λ. The fact

that Z Z ∞ f(x)dx = λe−λxdx = 1 x 0 follows from the fact that Z ∞ 1 e−λxdx = 0 λ 60 CHAPTER 2. PROBABILITY ◦ If f(x) = (2πσ)−1/2 exp{−(x − µ)2/2σ2} − ∞ < x < +∞ where −∞ < µ < +∞ and σ > 0 we have a normal or Gaussian probability model with parameters µ and σ2. The fact that

Z Z +∞ f(x)dx = (2πσ)−1/2 exp{−(x − µ)2/2σ2}dx = 1 x −∞ is shown in the supplemental notes.

Each of the above examples of probability models play major roles in the statistical analysis of data from experimental studies. The binomial is used to model prospective (cohort), retrospective (case-control) studies in epeidemiology, the Poisson is used to model accident data, the exponential is used to model failure time data and the normal distribution is used for measurement data which has a bell-shaped distribution as well as to approximate the binomial and Poisson. The normal distribution also figures in the calculation of many common statistics used for inference via the Central Limit Theorem. All of these models are special cases of the exponential family of distributions defined as having pdfs of the form:   Xp  f(x; θ , θ , . . . , θ ) = C(θ , θ , . . . , θ )h(x) exp t (x)q (θ) 1 2 p 1 2 p  j j  j=1 2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS 61 2.3.4 Odds

Closely related to probabilities are odds.

• If the odds of an event E occurring are given as a to b this means, by definition, that

P (E) P (E) a = = P (Ec) 1 − P (E) b

We can solve for P (E) to obtain a P (E) = a + b ◦ Thus we can go from odds to probabilities and vice-versa. ◦ Thinking about probabilities in terms of odds sometimes provides useful interpre- tation of probability statements.

• Odds can also be given as the odds against E are c to d. This means that

P (Ec) 1 − P (E) c = = P (E) P (E) d

so that in this case d P (E) = c + d • example: The odds against disease 1 are 9 to 1. Thus 1 P (disease 1) = = .1 1 + 9

• example: The odds of thundershowers this afternoon are 2 to 3. Thus 2 P (thundershowers) = = .4 2 + 3 62 CHAPTER 2. PROBABILITY • Ratios of odds are called odds ratios and play an important role in modern epidemi- ology where they are used to quantify the risk associated with exposure.

◦ example: Let OR be the odds ratio for the occurrence of a disease in an exposed population relative to an unexposed or control population. Thus

odds of disease in exposed population p2 OR = = 1−p2 odds of disease in control population p1 1−p1

where p2 is the probability of the disease in the exposed population and p1 is the probability of the disease in the control population. ◦ Note that if OR = 1 then p p 2 = 1 1 − p2 1 − p1

which implies that p2 = p1 i.e. that the probability of disease is the same in the exposed and control population. ◦ If OR > 1 then p p 2 > 1 1 − p2 1 − p1

which can be shown to imply that p2 > p1 i.e. that the probability of disease in the exposed population exceeds the probability of the disease in the control population. ◦ If OR < 1 the reverse conclusion holds i.e. the probability of disease in the control population exceeds the probability of disease in the exposed population. 2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS 63 • The odds ratio, while useful in comparing the relative magnitude of risk of disease does not convey the absolute magnitude of the risk (unless the risk is small).

◦ Note that p2 1−p2 p1 = OR 1−p1

implies that " # p1 p2 = OR 1 + (OR − 1)p1 ◦ Consider a situation in which the odds ratio is 100 for exposed vs control. Thus −6 −4 if OR = 100 and p1 = 10 (one in a million) then p2 is approximately 10 (one −2 in ten thousand). If p1 = 10 (one in a hundred) then   1 100 p = 100  100³ ´ = = .50 2 1 199 1 + 99 100 64 CHAPTER 2. PROBABILITY 2.4 Interpretations of Probability

Philosophers have discussed for several centuries at various levels what constitues “proba- bility”. For our purposes probability has three useful operational interpretations.

2.4.1 Equally Likely Interpretation

Consider an experiment where the sample space consists of a finite number of elementary events e1, e2, . . . , eN If, before the experiment is performed, we consider each of the elementary events to be “equally likely” or exchangeable then an assignment of probability is given by 1 p({e }) = i N This allows an interpretation of statements such as “we selected an individual at random from a population” since in ordinary language at random means that each invidual has the same chance of being selected. Although defining probability via this recipe is circular it is a useful interpretation in any situation where the sample space is finite and the elementary events are deemed equally likely. It forms the basis of much of sample survey theory where we select individuals at random from a population in order to investigate properties of the population. Summary: The equally likely interpretation assumes that each element in the sample space has the same chance of occuring. 2.4. INTERPRETATIONS OF PROBABILITY 65 2.4.2 Relative Frequency Interpretation

Another interpretation of probability is the so called relative frequency interpretation.

• Imagine a long series of trials in which the event of interest either occurs or does not occur.

• The relative frequency (number of trials in which the event occurs divided by the total number of trials) of the event in this long series of trials is taken to be the probability of the event.

• This interpretation of probability is the most widely used interpretation in scientific studies. Note, however, that it is also circular.

• It is often called the “long run frequency interpretation”.

2.4.3 Subjective Probability Interpretation

This interpretation of probability requires the personal evaluation of probabilities using in- difference between two wagers (bets). Suppose that you are interested in determining the probability of an event E. Consider two wagers defined as follows:

Wager 1 : You receive $100 if the event E occurs and nothing if it does not occur.

Wager 2 : There is a jar containing x white balls and N − x red balls. You receive $100 if a white ball is drawn and nothing otherwise.

You are required to make one of the two wagers. Your probability of E is taken to be the ratio x/N at which you are indifferent between the two wagers. 66 CHAPTER 2. PROBABILITY 2.4.4 Does it Matter?

• For most applications of probability in modern statistics the specific interpretation of probability does not matter all that much.

• What matters is that probabilities have the properties given in the definition and those properties derived from them.

• In this course we will take probability as a primitive concept leaving it to philosophers to argue the merits of particular interpretations.

• Each of the interpretations discussed above satisfies the three basic axioms of the definition of probability. 2.5. CONDITIONAL PROBABILITY 67 2.5 Conditional Probability

• Conditional probabilities possess all the properties of probabilities.

• Conditional probabilities provide a method to revise probabilities in the light of addi- tional information (the process itself is called conditioning).

• Conditional probabilities are important because almost all probabilities are conditional probabilities.

example: Suppose a coin is flipped twice and you are told that at least one coin is a head. What is the chance or probability that they are both heads? Assuming a and a good toss each of the four possibilities

{(H,H), (H,T ), (T,H), (T,T )}

which constitutes the sample space for this experiment has the same probability i.e. 1/4. Since the information given rules out (T,T ); a logical answer for the conditional probability of two heads given at least one head is 1/3. example: A family has three children. What is the probability that two of the children are boys? Assuming that gender distributions are equally likely the eight equally likely possibilities are:

{(B,B,B), (B,B,G), (B, G, B), (G, B, B), (G, G, B), (G, B, G), (B, G, G), (G, G, G)}

Thus the probability of two boys is 1 1 1 3 + + = 8 8 8 8 Depending on the conditioning information the probability of two boys is modified e.g.

• What is the probability of two boys if you are told that at least one child in the family is a boy? 3 Answer: 7 68 CHAPTER 2. PROBABILITY • What is the probability of two boys if you are told that at least one child in the family is a girl? 3 Answer: 7 • What is the probability of two boys if you are told that the oldest child is a boy? 1 Answer: 2 • What is the probability of two boys if you are told that the oldest child is a girl? 1 Answer: 4

We generalize to other situations using the following definition: Definition: The conditional probability of event B given event A is

P (B ∩ A) P (B|A) = P (A) provided that P (A) > 0 example: The probability of two boys given that the oldest child is a boy is the probability of the event “two boys in the family and the oldest child in the family is a boy” divided by the probability of the event “the oldest child in the family is a boy”. Thus the required conditional probability is given by

2 P ({(B, G, B), (G, B, B)}) 8 1 = 4 = P ({(B,B,B), (B, G, B), (G, B, B), (G, G, B)}) 8 2 2.5. CONDITIONAL PROBABILITY 69 2.5.1 Multiplication Rule

The multiplication rule for probabilities is as follows:

P (A ∩ B) = P (A)P (B|A) which can immediately be extended to

P (A ∩ B ∩ C) = P (A)P (B|A)P (C|A ∩ B) and in general to:

P (E1 ∩ E2 ∩ · · · ∩ En) = P (E1)P (E2|E1) ··· P (En|E1 ∩ E2 ∩ · · · ∩ En−1) example: There are n people in a room. What is the probability that at least two of the people have a common birthday? Solution: We first note that

P (common birthday) = 1 − P (no common birthday)

If there are just two people in the room then µ365¶ µ364¶ P (no common birthday) = 365 365 while for three people we have µ365¶ µ364¶ µ363¶ P (no common birthday) = 365 365 365

It follows that the probability of no common birthday with n people in the room is given by à ! µ365¶ µ364¶ 365 − (n − 1) ··· 365 365 365 70 CHAPTER 2. PROBABILITY Simple calculations show that if n = 23 then the probability of no common birthday is 1 slightly less than 2 . Thus if the number of people in a room is 23 or larger the probability of 1 a common birthday exceeds 2 . The following is a short table of the results for other values of n

n Prob n Prob 2 .003 17 .315 3 .008 18 .347 4 .016 19 .379 5 .027 20 .411 6 .041 21 .444 7 .056 22 .476 8 .074 23 .507 9 .095 24 .538 10 .117 25 .569 11 .141 26 .598 12 .167 27 .627 13 .194 28 .654 14 .223 29 .681 15 .253 30 .706 16 .284 31 .730 2.5. CONDITIONAL PROBABILITY 71 2.5.2 Law of Total Probability

Law of Total Probability: For any event E we have X P (E) = P (E|Ei)P (Ei) i

where Ei is a partition of the sample space i.e. the Ei are mutually exclusive and their union is the sample space. example: An examination consists of multiple choice questions. Each question is a multiple choice question in which there are 5 alternative answers only one of which is correct. If a student has diligently done his or her homework he or she is certain to select the correct answer. If not he or she has only a one in five chance of selecting the correct answer (i.e. they choose an answer at random). Let

• p be the probability that the student does their homework

• A the event that they do their homework

• B the event that they select the correct answer 72 CHAPTER 2. PROBABILITY (i) What is the probability that the student selects the correct answer to a question? Solution: We are given 1 P (A) = p ; P (B|A) = 1 and P (B|Ac) = 5 By the Law of Total Probability

P (B) = P (A)P (B|A) + P (Ac)P (B|Ac) µ1¶ = p × 1 + (1 − p) × 5 5p + 1 − p = 5 4p + 1 = 5

(ii) What is the probability that the student did his or her homework given that they selected the correct anwer to the question? Solution: In this case we want P (A|B) so that

P (A ∩ B) P (A|B) = P (B) P (A)P (B|A) = P (B) 1 × p = 4p+1 5 5p = 4p + 1 2.5. CONDITIONAL PROBABILITY 73 example: Cross-Sectional Study Suppose a population of individuals is classified into four categories defined by

• their disease status (D is diseased and Dc is not diseased)

• their exposure status (E is exposed and Ec is not exposed).

If we observe a sample of n individuals so classified we have the following population probabilities and observed data.

Population Sample Probabilities Numbers Dc D Total Dc D Total Ec P (Ec,Dc) P (Ec,D) P (Ec) n(Ec,Dc) n(Ec,D) n(Ec) E P (E,Dc) P (E,D) P (E) n(E,Dc) n(E,D) n(E) Total P (Dc) P (D) 1 n(Dc) n(D) n

The law of total probability then states that

P (D) = P (E,D) + P (Ec,D) = P (D|E)P (E) + P (D|Ec)P (Ec) 74 CHAPTER 2. PROBABILITY Define the following quantities:

Population Parameters Sample Estimates prob of exposure prob of exposure c n(E,D)+n(E,Dc) P (E) = P (E,D) + P (E,D ) p(E) = n prob of disease given exposed prob of disease given exposed P (E,D) n(D,E) P (D|E) = P (E) p(D|E) = n(E) odds of disease if exposed odds of disease if exposed P (D,E) n(D,E) O(D|E) = P (Dc,E) o(D|E) = n(Dc,E) odds of disease if not exposed odds of disease if not exposed c P (D,Ec) c n(D,Ec) O(D|E ) = P (Dc,Ec) o(D|E ) = n(Dc,Ec) odds ratio (relative odds) odds ratio (relative odds) O(D|E) o(D|E) OR = O(D|Ec) or = o(D|Ec) relative risk relative risk P (D|E) p(D|E) RR = P (D|Ec) rr = p(D|Ec)

It can be shown that if the disease is rare in both the exposed group and the non exposed group then OR ≈ RR

The above population parameters are fundamental to the epidemiological approach to the study of disease as it relates to exposure. example: In demography the crude death rate is defined as Total Deaths D CDR = = Population Size N If the population is divided into k age groups or other strata defined by gender, ethnicity, etc. then D = D1 + D2 + ··· + Dk and N = N1 + N2 + ··· + Nk and hence P P k k Xk D i=1 Di i=1 NiMi CR = = = = piMi N N N i=1

where Mi = Di/Ni is the age specfic death rate for the ith age group and pi = Ni/N is the proportion of the population in the ith age group. This is directly analogous to the law of total probability, 2.6. BAYES THEOREM 75 2.6 Bayes Theorem

Bayes theorem combines the definition of conditional probability, the multiplication rule and the law of total probability and asserts that

P (Ei)P (E|Ei) P (Ei|E) = P j P (Ej)P (E|Ej)

• where E is any event

• the Ej constitute a partition of the sample space

• Ei is any event in the partition.

Since P (E ∩ E) P (E |E) = i i P (E)

P (Ei ∩ E) = P (Ei)P (E|Ei) X P (E) = P (Ej)P (E|Ej) j

Bayes theorm is obviously true. Note: A partition of the sample space is a collection of mutually exclusive events such that their union is the sample space. 76 CHAPTER 2. PROBABILITY example: The probability of disease given exposure is .5 while the probability of disease given non-exposure is .1. Suppose that 10% of the population is exposed. If a diseased individual is detected what is the probability that the individual was exposed? Solution: By Bayes theorem

P (Ex)P (Dis|Ex) P (Ex|Dis) = P (Ex)P (Dis|Ex) + P (No Ex)P (Dis|No Ex) (.1)(.5) = (.1)(.5) + (.9)(.1) 5 = 5 + 9 5 = 14

The intuitive explanation for this result is as follows:

• Given 1,000 individuals 100 will be exposed and 900 not exposed

• Of the 100 individuals exposed 50 will have the disease.

• of the 900 non exposed individuals 90 will have the disease

Thus of the 140 individuals with the disease, 50 will have been exposed which yields a 5 proportion of 14 . 2.6. BAYES THEOREM 77 example: Diagnostic Tests In this type of study we are interested in the performance of a diagnostic test designed to determine whether a person has a disease. The test has two possible results:

• + positive test (the test indicates presence of disease).

• − negative test (the test does not indicate presence of disease).

We thus have the following setup:

Population Sample Probabilities Numbers Dc D Total Dc D Total − P (−,Dc) P (−,D) P (−) n(−,Dc) n(−,D) n(−) + P (+,Dc) P (+,D) P (+) n(+,Dc) n(+,D) n(+) Total P (Dc) P (D) 1 n(Dc) n(D) n 78 CHAPTER 2. PROBABILITY We define the following quantities:

Population Parameters Sample Estimates sensitivity sensitivity P (+,D) n(+,D) P (+|D) = P (+,D)+P (−,D) p(+|D) = n(+,D)+n(−,D) specificity specificity c P (−,Dc) c n(−,Dc) P (−|D ) = P (−,Dc)+P (+,Dc) p(−|D ) = n(−,Dc)+n(+,Dc) positive test probability proportion positive test c n(+) P (+) = P (+,D) + P (+,D ) p(+) = n negative test probability proportion negative test c n(−) P (−) = P (−,D) + P (−,D ) p(−) = n positive predictive value positive predictive value P (+,D) p(+,D) P (D|+) = P (+) p(D|+) = p(+) negative predictive value negative predictive value c P (−,Dc) c p(−,Dc) P (D |−) = P (−) p(D |−) = p(−)

As an example consider the performance of a blood sugar diagnostic test to determine whether a person has diabetes. The test has two possible results:

• + positive test (the test indicates presence of diabetes).

• − negative test (the test does not indicate presence of diabetes). 2.6. BAYES THEOREM 79 The following numerical example is from Epidemiology (1996) Gordis, L. W. B. Saunders. We have the following setup:

Population Sample Probabilities Numbers Dc D Total Dc D Total − P (−,Dc) P (−,D) P (−) 7600 150 7750 + P (+,Dc) P (+,D) P (+) 1900 350 2250 Total P (Dc) P (D) 1 9500 500 10, 000

We calculate the following quantities:

Population Parameters Sample Estimates sensitivity sensitivity P (+,D) 350 P (+|D) = P (+,D)+P (−,D) p(+|D) = 500 = .70 specificity specificity c P (−,Dc) c 7600 P (−|D ) = P (−,Dc)+P (+,Dc) p(−|D ) = 9500 = .80 positive test probability proportion positive test c 2250 P (+) = P (+,D) + P (+,D ) p(+) = 10,000 = .225 negative test probability proportion negative test c 7750 P (−) = P (−,D) + P (−,D ) p(−) = 10,000 = .775 positive predictive value positive predictive value P (+,D) 350 P (D|+) = P (+) p(D|+) = 2250 = 0.156 negative predictive value negative predictive value c P (−,Dc) c 7600 P (D |−) = P (−) p(D |−) = 7750 = 0.98 80 CHAPTER 2. PROBABILITY 2.7 Independence

Closely related to the concept of conditional probability is the concept of independence of events. Definition Events A and B are said to be independent if

P (B|A) = P (B)

Thus knowledge of the occurrence of A does not influence the assignment of probabilities to B. Since P (A ∩ B) P (B|A) = P (A) it follows that if A and B are independent then

P (A ∩ B) = P (A)P (B)

This last formulation of independence is the definition used in building probability mod- els. 2.8. BERNOULLI TRIAL MODELS; THE BINOMIAL DISTRIBUTION 81 2.8 Bernoulli trial models; the binomial distribution

• One of the most important probability models is the binomial. It is widely used in epidemiology and throughout statistics.

• The binomial model is based on the assumption of Bernoulli trials.

The assumptions for a Bernoulli trial model are

(1) The result of the experiment or study can be thought of as the result of n smaller experiments called trials each of which has only two possible outcomes e.g. (dead, alive), (diseased, non-diseased), (, failure)

(2) The outcomes of the trials are independent

(3) The probabilities of the outcomes of the trials remain the same from trial to trial (homogeneous probabilities). example 1: A group of n individuals are tested to see if they have elevated levels of cholestrol. Assuming the results are recorded as elevated or not elevated and we can justify (2) and (3) we may apply the Bernoulli trial model. example 2: A population of n individuals is found to have d deaths during a given period of time. Assuming we can justify (2) and (3) we may use the Bernoulli model to describe the results of the study. In Bernoulli trial models the quantity of interest is the number of successes x which occur in the n trials. It can be be shown that the following formula gives the probability of obtaining x successes in n Bernoulli trials à ! n P (x) = px(1 − p)n−x x

where

• x can be 0, 1, 2, . . . , n

• p is the probability of success on a given trial 82 CHAPTER 2. PROBABILITY ³ ´ n • x , read as ”n choose x”, is defined by à ! n n! = x x!(n − x)!

In this last formula r! = r(r − 1)(r − 2) ··· 3 · 2 · 1 for any integer r and 0! = 1. Note: The term distribution is used because the formula describes how to distribute prob- ability over the possible values of x. example: The chance or probability of having an elevated cholesterol level is 1/100. If 10 individuals are examined, what is the probability that one or more of them will have been exposed? Solution: The binomial model applies so that à ! 10 P (0) = (.01)0(1 − .01)10−0 0 = (.99)10

Thus

P (1 or more elevated) = 1 − P (0 elevated) = 1 − (.99)10 = .059 2.9. PARAMETERS AND RANDOM SAMPLING 83 2.9 Parameters and Random Sampling

• The numbers n and p which appear in the formula for the binomial distribution are examples of what statisticians call parameters.

• Different values of n and p give different assignments of probabilities each of the bino- mial type.

• Thus a parameter can be considered as a label which identifies the particular assign- ment of probabilities.

• In applications of the binomial distribution the parameter n is known and can be fixed by the investigator - it is thus a study design parameter.

• The parameter p, on the other hand, is unknown and obtaining information about it is the reason for performing the experiment.

We use the observed data and the model to tell us something about p. This same set-up applies in most applications of statistics. To summarize:

• Probability distributions relate observed data to parameters.

• Statistical methods use data and probability models to make statements about the parameters of interest.

In the case of the binomial the parameter of interest is p, the probability of success on a given trial. 84 CHAPTER 2. PROBABILITY example: Random sampling and the binomial distribution. In many circumstances we are given the results of a survey or study in which the investigators state that they examined a “random sample” from the population of interest. Suppose we have a population containing N individuals or objects. We are presented with a “random sample” consisting of n individuals from the population. What does this mean? We begin by defining what we mean by a sample. Definition: A sample of size n from a target population T containing N objects is an ordered collection of n objects each of which is an object in the target population. In set notation a sample is just an n-tuple with each coordinate being an element of the target population. In symbols then a sample s is

s = (a1, a2, . . . , an)

where a1 ∈ T, a2 ∈ T, . . . , an ∈ T . Specific example: If T = {a, b, c, d} then a possible sample of size 2 is (a, b) while some others are (b, a) and (c, d). What about (a, a)? Clearly, this is a sample according to the definition. To distinguish between these two types of samples:

• A sample is taken with replacement if an element in the population can appear more than once in the sample

• A sample is taken without replacement if an element in the population can appear at most once in the sample. 2.9. PARAMETERS AND RANDOM SAMPLING 85 Thus in our example the possible samples of size 2 with replacement are

(a, a)(a, b)(a, c)(a, d) (b, a)(b, b)(b, c)(b, d) (c, a)(c, b)(c, c)(c, d) (d, a)(d, b)(d, c)(d, d) while without replacement the possible samples are

(a, b)(a, c)(a, d) (b, a)(c, a)(d, a) (b, c)(c, b)(b, d) (d, b)(c, d)(d, c)

Definition: A random sample of size n from a population of size N is a sample which is selected such that each sample has the same chance of being selected i.e. 1 P (sample selected) = number of possible samples

1 Thus in the example each sample with replacement would be assigned a chance of 16 while 1 each sample without replacement would be assigned a chance of 12 for random sampling. 86 CHAPTER 2. PROBABILITY In the general case,

• For sampling with replacement the probability assigned to each sample is 1 N n

• For sampling without replacement the probability assigned to each sample is 1

(N)n

where (N)n is given by:

(N)n = N(N − 1)(N − 2) ··· (N − n + 1)

In our example we see that

n 2 N = 4 = 16 and (N)n = (4)2 = 4(4 − 2 + 1) = 4 × 3 = 12

To summarize: A random sample is the result of a selection process in which each sample has the same chance of being selected. 2.9. PARAMETERS AND RANDOM SAMPLING 87 Suppose now that each object in the population can be classified into one of two categories e.g. (exposed, not exposed), (success, failure), (A, not A), (0, 1) etc. For definiteness let us call the two outcomes success and failure and denote them by S and F . In the example suppose that a and b are successes while c and d are failures. The target population is now T = {a(S), b(S), c(F ), d(F )} In general D of the objects will be successes and N − D will be failures. The question of interest is: If we select a random sample of size n from a population of size N consisting of D successes and N − D failures, what is the probability that x successes will be observed in the sample? In the example we see that with replacement the samples are

(a(S), a(S)) (a(S), b(S)) (a(S), c(F )) (a(S), d(F )) (b(S), a(S)) (b(S), b(S)) (b(S), c(F )) (b(S), d(F )) (c(F ), a(S)) (c(F ), b(S)) (c(F ), c(F )) (c(F ), d(F )) (d(F ), a(S)) (d(F ), b(S)) (d(F ), c(F )) (d(F ), d(F ))

Thus if sampling is at random with replacement the probabilities of 0 successes, 1 success and 2 successes are given by 4 P (0) = 16 8 P (1) = 16 4 P (2) = 16 If sampling is at random without replacement the probabilities are given by 2 P (0) = 12 8 P (1) = 12 2 P (2) = 12 88 CHAPTER 2. PROBABILITY These probabilities can, in the general case, be shown to be

without replacement : Ã ! n (D) (N − D) P (x successes) = x n−x x (N)n

with replacement : Ã ! n µ D ¶x µ D ¶n−x P (x successes) = 1 − x N N

The distribution without replacement is called the hypergeometric distribution with parameters N, n and D. The distribution with replacement is the binomial distribution with parameters n and p = D/N. In many applications the sample size, n, is small relative to the population size N. In this situation it can be shown that the formula à ! n µ D ¶x µ D ¶n−x 1 − x N N

provides an adequate approximation to the probabilities for sampling without replacement. Thus for most applications, random sampling from a population in which each individual is classified as a success or a failure results in a binomial distribution for the probability of obtaining x successes in the sample.

D The interpretation of the parameter p = N is thus:

• “the proportion of successes in the target population”

• “the chance that an individual selected at random will be classified as a success”. 2.9. PARAMETERS AND RANDOM SAMPLING 89 example: Prospective (Cohort) Study In this type of study

• we observe n(E) individuals who are exposed and n(Ec) individuals who are not ex- posed.

• These individuals are followed and the number in each group who develop the disease are recorded.

We thus have the following setup:

Population Sample Probabilities Numbers Dc D Total Dc D Total Ec P (Dc|Ec) P (D|Ec) 1 n(Dc,Ec) n(D,Ec) n(Ec) E P (Dc|E) P (D|E) 1 n(Dc,E) n(D,E) n(E)

We can model this situation as two independent binomial distributions as follows:

n(D,E) is binomial (n(E),P (D|E)) n(D,Ec) is binomial (n(Ec),P (D|Ec)) 90 CHAPTER 2. PROBABILITY We define the following quantities:

Population Parameters Sample Estimates prob of disease given exposed prob of disease given exposed P (E,D) n(D,E) P (D|E) = P (E) p(D|E) = n(E) odds of disease if exposed odds of disease if exposed P (D,E) n(D,E) O(D|E) = P (Dc,E) o(D|E) = n(Dc,E) odds of disease if not exposed odds of disease if not exposed c P (D,Ec) c n(D,Ec) O(D|E ) = P (Dc,Ec) o(D|E ) = n(Dc,Ec) odds ratio (relative odds) odds ratio (relative odds) O(D|E) o(D|E) OR = O(D|Ec) or = o(D|Ec) relative risk relative risk P (D|E) p(D|E) RR = P (D|Ec) rr = p(D|Ec)

As an example consider the following hypothetical study in which we follow smokers and non smokers to see which individuals develop coronary heart disease (CHD). Thus E is smoker and Ec is non smoker. This example is from Epidemiology (1996) Gordis, L. W. B. Saunders. 2.9. PARAMETERS AND RANDOM SAMPLING 91 We have the following setup:

Population Sample Probabilities Numbers Dc D Total No CHD CHD Total Ec P (Dc|Ec) P (D|Ec) 1 4, 913 87 5, 000 E P (Dc|E) P (D|E) 1 2, 916 84 3, 000

We calculate the following quantities:

Population Parameters Sample Estimates prob of disease given exposed prob of disease given exposed P (E,D) 84 P (D|E) = P (E) p(CHD|S) = 3,000 = 0.028 odds of disease if exposed odds of disease if exposed P (D,E) 84 O(D|E) = P (Dc,E) o(CHD|S) = 2916 = 0.0288 odds of disease if not exposed odds of disease if not exposed c P (D,Ec) 87 O(D|E ) = P (Dc,Ec) o(CHD|NS) = 4913 = 0.0177 odds ratio (relative odds) odds ratio (relative odds) O(D|E) 84/2916 OR = O(D|Ec) or = 87/4913 = 1.63 relative risk relative risk P (D|E) 84/3000 RR = P (D|Ec) rr = 87/5000 = 1.61 92 CHAPTER 2. PROBABILITY example: Retrospective (Case-Control) Study In this type of study we

• Select n(D) individuals who have the disease (cases) and n(Dc) individuals who do not have the disease (controls).

• Then the number of individuals in each group who were exposed is determined.

We thus have the following setup:

Population Sample Probabilities Numbers Dc D Dc D Ec P (Ec|Dc) P (Ec|D) n(Dc,Ec) n(D,Ec) E P (E|Dc) P (E|D) n(Dc,E) n(D,E) Total 1 1 n(Dc) n(D)

We can model this situation as two independent binomials as follows:

n(D,E) is binomial (n(D),P (E|D)) n(Dc,E) is binomial (n(Dc),P (E|Dc))

Define the following quantities:

Population Parameters Sample Estimates prob of exposed given diseased prob of exposed given disease n(D,E) P (E|D) p(E|D) = n(D) odds of exposed if disease odds of exposed if disease P (E|D) n(D,E) O(E|D) = P (Ec|D) o(E|D) = n(D,Ec) odds of exposed if not disease odds of exposed if not disease c P (E|Dc) c n(E,Dc) O(E|D ) = P (Ec|Dc) o(E|D ) = n(Ec,Dc) odds ratio (relative odds) odds ratio (relative odds) O(E|D) o(E|D) OR = O(E|Dc) or = o(E|Dc) 2.9. PARAMETERS AND RANDOM SAMPLING 93 As an example consider the following hypothetical study in which examine individuals with coronary heart disease (CHD) (cases) and individuals without coronary heart diease (controls). We then determine which individuals were smokers and which were not. Thus E is smoker and Ec is non smoker. This example is from Epidemiology (1996) Gordis, L. W. B. Saunders.

Population Sample Probabilities Numbers Controls Cases Controls Cases Ec P (Ec|Dc) P (Ec|D) 224 88 E P (E|Dc) P (E|D) 176 112 Total 1 1 400 200

We calculate the following quantities:

Population Parameters Sample Estimates prob of exposed given diseased prob of exposed given disease 112 P (E|D) p(E|D) = 200 = 0.56 odds of exposed if disease odds of exposed if disease P (E|D) 112 O(E|D) = P (Ec|D) o(E|D) = 88 = 1.27 odds of exposed if not disease odds of exposed if not disease c P (E|Dc) c 176 O(E|D ) = P (Ec|Dc) o(E|D ) = 224 = 0.79 odds ratio (relative odds) odds ratio (relative odds) O(E|D) 112/88 OR = O(E|Dc) or = 176/224 = 1.62 94 CHAPTER 2. PROBABILITY 2.10 Probability Examples

The following two examples illustrate the importance of probability in solving real problems. Each of the topics presented has been extended and generalized since their introduction.

2.10.1 Randomized Response

Suppose that a sociologist is interested in determining the prevalence of child abuse in a population. Obviously if individual parents are asked a question such as “have you abused your child” the reliability of the answer is in doubt. The sociologist would ideally like the parent to respond with an honest choice between the following two questions:

(i) Have you ever abused your children?

(ii) Have you not abused your children?

A clever method for determining prevalence in such a situation is to provide the respon- dent with a randomization device such as a deck of cards in which a proportion P of the cards are marked with the number 1 and the remainder with the number 2. The respondent selects a card at random and replaces it with the result unknown to the interviewer. Thus confidentiality of the respondent is protected. If the card drawn is 1 the respondent answers truthfully to question 1 whereas if the card drawn is a 2 the respondent answers truthfully to question 2. 2.10. PROBABILITY EXAMPLES 95 It follows that the probability λ that the respondent answers yes is given by

λ = P (yes) = P (yes|Q1)P Q1) + P (yes|Q2)P Q2) = πP + (1 − π)(1 − P ) where π is the prevalence (the proportion in the population who abuse their children and P is the proportion of 1’s in the deck of cards. We assume P 6= 1/2. If we use this procedure on n respondents and observe x yes answers then the observed proportion x/n is a natural estimate of πP + (1 − π)(1 − P ) i.e. x λb = = πP + (1 − π)(1 − P ) n Since we know P we can solve for π giving us the estimate

λb + 1 − P πb = 2P − 1

Reference: Encyclopedia of Biostatistics. 96 CHAPTER 2. PROBABILITY 2.10.2 Screening

As another simple application of probability consider the following situation. We have a fixed amount of money available to test individuals for the presence of a disease, say $1,000. The cost of testing one sample of blood is $5. We have to test a population of size 1,000 in which we suspect the prevalence of the disease is 3/1,000. Can we do it? If we divide the population into 100 groups of size 10 then there should be 1 diseased individual in 3 of the groups and the remaining 97 groups will be disease free. If we pool the samples from each group and test each grouped sample we would need 100 + 30 = 130 tests instead of 1,000 tests to screen eveyone. The probabilistic version is as follows: A large number N of individuals are subject to a blood test which can be administered in one of two ways

(i) Each individual is to be tested separately so that N tests are required. (ii) The samples of n individuals can be pooled or combined and tested. If this test is negative then the one test suffices to clear all of these n individuals. If this test is positive then each of the n individuals in that group must be tested. Thus n + 1 tests are required if the pooled samples tests positive.

Assume that individuals are independent and that each has probability p of testing positive. Clearly we have a Bernoulli trial model and hence the probability that the combined sample will test positive is P (combined test positive) = 1 − P (combined test negative) = 1 − (1 − p)n Thus we have for any group of size n P (1 test) = (1 − p)n+1 ; P (n + 1 tests) = 1 − (1 − p)n It follows that the expected number of tests if we combine samples is (1 − p)n + (n + 1)[1 − (1 − p)n] = n + 1 − n(1 − p)n Thus if there are N/n groups we expect to run · 1 ¸ N 1 + − (1 − p)n n tests if we combine samples instead of the N tests if we test each individual. Given a value of p we can choose n to minimize the total number of tests. 2.10. PROBABILITY EXAMPLES 97 As an example with N = 1, 000 and p = .01 we have the following numbers

Group Size Number of Tests 2 519.9 3 363.0343 4 289.404 5 249.0099 6 225.1865 7 210.7918 8 202.2553 9 197.5939 10 195.6179 11 195.5708 12 196.9485 13 199.4021 14 202.6828 15 206.6083 16 211.0422 17 215.8803 18 221.0418 19 226.463 20 232.0931

Thus we should combine individuals into groups of size 10 or 11. In which case we expect to run 196 tests instead of 1,000 tests. Clearly we achieve real savings. Reference: Feller, W. (1950 An Introduction to and Its Applications. John Wiley & Sons. 98 CHAPTER 2. PROBABILITY Graph of Expected Number of Tests vs Group Size (N = 1, 000 and p = .01)

Figure 2.1: Chapter 3

Probability Distributions

3.1 Random Variables and Distributions

3.1.1 Introduction

Most of the responses we model in statistics are numerical. It is useful to have a notation for real valued responses. Real valued responses are called random variables. The notation is not only convenient, it is imperative when we consider statistics, defined as functions of sample data. The probability models for these random variables are called their sampling distributions and form the foundation of the modern theory of statistics. Definition:

• Before the experiment is performed the possible numerical response is denoted by X, X is called a .

• After the experiment is performed the observed value of X is denoted by x. We call x the realized or observed value of X.

99 100 CHAPTER 3. PROBABILITY DISTRIBUTIONS Notation:

• The set of all possible values of a random variable X is called the sample space of X and is denoted by X .

• The probability model of X is denoted by PX and we write

PX (B) = P (X ∈ B)

for the probability that the event X ∈ B occurs.

• The probability model for X is called the of X.

There are two types of random variables which are of particular importance: discrete and continuous. These correspond to the two types of numbers introduced in the overview section and the two types of probability density functions introduced in the probability section.

• A random variable is discrete if its possible values (sample space) constitute a finite or countable set e.g.

X = {0, 1} ; X = {0, 1, 2, . . . , n} ; X = {0, 1, 2,...}

◦ Discrete random variables arise when we consider response variables which are categorical or counts.

• A random variable is continuous or numeric if its possible values (sample space) is an interval of real numbers e.g.

X = [0, ∞); X = (−∞, ∞)

◦ Continuous random variables arise when we consider response variables which are recorded on interval or ratio scales. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 101 3.1.2 Discrete Random Variables

Probabilities for discrete random variables are specified by the probability density function p(x): X PX (B) = P (X ∈ B) = p(x) x∈B

Probability density functions for discrete random variables have the properties

• 0 ≤ p(x) ≤ 1 for all x in the sample space X P • x∈X p(x) = 1

Binomial Distribution

A random variable is said to have a binomial distribution if its probability density function is of the form: Ã ! n p(x) = px(1 − p)n−x for x = 0, 1, 2, . . . , n x where 0 ≤ p ≤ 1. If we define X as the number of successes in n Bernoulli trials then X is a random variable with a binomial distribution. The parameters are n and p where p is the probability of success on a given trial. The term distribution is used because the formula describes how to distribute probability over the possible values of x. Recall that the assumptions necessary for a Bernoulli trial model to apply are:

• The result of the experiment or study consists of the result of n smaller experiments called trials each of which has only two possible outcomes e.g. (dead, alive), (diseased, non-diseased), (success, failure).

• The outcomes of the trials are independent.

• The probabilities of the outcomes of the trials remain the same from trial to trial (homogeneous probabilities). 102 CHAPTER 3. PROBABILITY DISTRIBUTIONS Histograms of Binomial Distributions

Figure 3.1:

Note that as n ↑ the binomial distribution becomes more symmetric. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 103

A random variable is said to have a Poisson distribution if its probability distribution is given by λxe−λ p(x) = for x = 0, 1, 2,... x! • The parameter of the Poisson distribution is λ. • The Poisson distribution is one of the most important distributions in the applications of statistics to public health problems. The reasons are: ◦ It is ideally suited for modelling the occurence of “rare events”. ◦ It is also particularly useful in modelling situations involving person-time. ◦ Specific examples of situations in which the Poisson distribution applies include: ◦ Number of deaths due to a rare disease ◦ Spatial distribution of bacteria ◦ Accidents

The Poisson distribution is also useful in modelling the occurence of events over time. Sup- pose that we are interested in modelling a process where:

(1) The occurrences of the event in an interval of time are independent. (2) The probability of a single occurrence of the event in an interval of time is proportional to the length of the interval. (3) In any extremely short time interval, the probability of more than one occurrence of the event is approximately zero.

Under these assumptions:

• The distribution of the random variable X, defined as the number of occurrences of the event in the interval is given by the Poisson distribution. • The parameter λ in this case is the average number of occurrences of the event in the interval i.e. λ = µt where µ is the rate per unit time 104 CHAPTER 3. PROBABILITY DISTRIBUTIONS example: Suppose that the suicide rate in a large city is 2 per week. Then the probability of two suicides in one week is 22e−2 P (2 suicides in one week) = = .2707 = .271 2! The probability of two suicides in three weeks is

62e−6 P (2 suicides in three weeks) = = .0446 = .045 2!

example: The Poisson distribution is often used as a model for the probability of automobile or other accidents for the following reasons:

(1) The population exposed is large.

(2) The number of people involved in accidents is small.

(3) The risk for each person is small.

(4) Accidents are “random”.

(5) The probability of being in two or more accidents in a short time period is approxi- mately zero. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 105 Approximations using the Poisson Distribution

Poisson probabilities can be used to approximate binomial probabilities when n is large, p is small and λ is taken to be np Thus for n = 150 and p = .02 we have the following table:

Binomial Poisson x n = 150, p = .02 λ = 150(.02) = 3 0 0.04830 0.04979 1 0.14784 0.14936 2 0.22478 0.22404 3 0.22631 0.22404 4 0.16974 0.16803 5 0.10115 0.10082 6 0.04989 0.05041 7 0.02094 0.02160 8 0.00764 0.00810 9 0.00246 0.00270 10 0.00071 0.00081 11 0.00018 0.00022 12 0.00004 0.00006 13 0.00001 0.00001 14 0.00000 0.00000

Note the closeness of the approximation. The supplementary notes contain a “proof” of the propositition that the Poisson approximates the binomial when n is large and p is small. 106 CHAPTER 3. PROBABILITY DISTRIBUTIONS Histograms of Poisson Distributions

Figure 3.2:

Note that as n ↑ the Poisson distribution becomes more symmetric. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 107 3.1.3 Continuous or Numeric Random Variables

Probabilities for numeric or continuous random variables are given by the area under the curve of its probability density function f(x). Z P (E) = f(x)dx E

• f(x) has the properties:

◦ f(x) ≥ 0 ◦ The total area under the curve is one

• Probabilities for numeric random variables are tabled or can be calculated using a statistical software package.

The Normal Distribution

By far the most important continuous probability distribution is the normal or Gaussian. The probability density function is given by: ( ) 1 (x − µ)2 p(x) = √ exp − 2πσ 2σ2

• The normal distribution is used as a basic model when theobserved data has a his- togram which is symmetric and bell-shaped.

• In addition the normal distribution provides useful approximations to other distribu- tions by the Central Limit Theorem.

• The Central Limit Theorem also implies that a variety of statistics have distributions that can be approximated by normal distributions.

• Most statistical methods were originally developed for the normal distribution and then extended to other distributions.

• The parameter µ is the natural center of the distribution (since the distribution is symmetric about µ). 108 CHAPTER 3. PROBABILITY DISTRIBUTIONS • The parameter σ2 or σ provides a measure of spread or scale.

• The special case where µ = 0 and σ2 = 1 is called the standard normal or Z distribution 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 109 The following quote indicates the importance of the normal distribution:

The normal law of error stands out in the experience of mankind as one of the broadest generalizations of natural philosophy. It serves as the guiding instrument in researches in the physical and social sciences and in medicine, agriculture and engineering. It is an indispensible tool for the analysis and the interpretation of the basic data obtained by observation and experimentation.

W. J. Youden 110 CHAPTER 3. PROBABILITY DISTRIBUTIONS The principal characteristics of the normal distribution are

• The curve is bell-shaped.

• The possible values for x are between −∞ and +∞

• The distribution is symmetric about µ

• median = mode (point of maximum height of the curve)

• area under the curve is 1.

• area under the curve over an interval I gives the probability of I

• 68% of the probability is between µ − σ and µ + σ

• 95% of the probability is between µ − 2σ and µ + 2σ

• 99.7% of the probability is between µ − 3σ and µ + 3σ

• For the standard normal distribution we have

◦ P (Z ≥ z) = 1 − P (Z ≤ z)

◦ P (Z ≥ z0) = P (Z ≤ −z0) for z0 ≥ 0. Thus we have

P (Z ≤ 1.645) = .95 P (Z ≥ 1.645) = .05 P (Z ≤ −1.645) = .05

• Probabilities for any normal distribution can be calculated by converting to the stan- dard normal distribution (µ = 0 and σ = 1) as follows: µ x − µ¶ P (X ≤ x) = P Z ≤ σ 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 111 Plot of Z Distribution

Figure 3.3: 112 CHAPTER 3. PROBABILITY DISTRIBUTIONS Plots of Normal Distributions

Figure 3.4: 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 113 Approximating Binomial Probabilities Using the Normal Distribution

If n is large we may approximate binomial probabilities using the normal distribution as follows:   x − np + 1 P (X ≤ x) ≈ P Z ≤ q 2  np(1 − p)

1 • The 2 in the approximation is called a since it improves the approximation for modest values of n.

• A guideline is to use the normal approximation when à ! à ! p 1 − p n ≥ 9 and n ≥ 9 1 − p p

and use the continuity correction. The Supplementary Notes give a brief discussion of the appropriateness of the conti- nuity correction. 114 CHAPTER 3. PROBABILITY DISTRIBUTIONS For the Binomial distribution with n = 30 and p = .3 we find the following probabilities:

x P (X = x) P (X ≤ x) 0 0.00002 0.00002 1 0.00029 0.00031 2 0.00180 0.00211 3 0.00720 0.00932 4 0.02084 0.03015 5 0.04644 0.07659 6 0.08293 0.15952 7 0.12185 0.28138 8 0.15014 0.43152 9 0.15729 0.58881 10 0.14156 0.73037 11 0.11031 0.84068 12 0.07485 0.91553 13 0.04442 0.95995 14 0.02312 0.98306 15 0.01057 0.99363 16 0.00425 0.99788 17 0.00150 0.99937

Thus P (Y ≤ 12) is exactly 0.91553. Using the normal approximation without the continuity correction yields a value of 0.88400 Using the continuity correction yields a value of 0.91841, close enough for most work. However, using STATA or other statistical packages makes it easy to get exact probabilities. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 115 Approximating Poisson Probabilities Using the Normal Distribution

If λ ≥ 10 we can use the normal (Z) distribution to approximate the Poisson distribution as follows: Ã ! x − λ P (X ≤ x) ≈ P Z ≤ √ λ The following are some Poisson probabilities for λ = 10

x P (X = x) P (X ≤ x) 0 0.00005 0.00005 1 0.00045 0.00050 2 0.00227 0.00277 3 0.00757 0.01034 4 0.01892 0.02925 5 0.03783 0.06709 6 0.06306 0.13014 7 0.09008 0.22022 8 0.11260 0.33282 9 0.12511 0.45793 10 0.12511 0.58304 11 0.11374 0.69678 12 0.09478 0.79156 13 0.07291 0.86446 14 0.05208 0.91654 15 0.03472 0.95126 16 0.02170 0.97296 17 0.01276 0.98572 18 .0070911 0.99281 19 .0037322 0.99655 20 .0018661 0.99841

For y = 15 we find that P (≤ 15) = 0.95126 Using the normal approximation yields a value of 0.94308 A continuity correction can again be used to improve the approximation. 116 CHAPTER 3. PROBABILITY DISTRIBUTIONS 3.1.4 Distribution Functions

For any random variable the probability that it assumes a value less than or equal to a specified value, say x, is called its distribution function and denoted by F i.e.

F (x) = P (X ≤ x)

The distribution function F is between 0 and 1 and does not decrease as x increases. The graph of F is a step function for discrete random variables (the height of the step at x is the probability of the value x) and is a differentiable function for continuous random varaibles (the derivative equals the density function). Distribution functions are the model analogue to the empirical distribution function in- troduced in the exploratory data analysis section. They play an important role in goodness of fit tests and in finding the distribution of functions of continuous random variables. In ad- dition, the natural estimate of the distribution function is the empirical distribution function which forms the basis for the substitution method of estimation. 3.1. RANDOM VARIABLES AND DISTRIBUTIONS 117 3.1.5 Functions of Random Variables

It is often necessary to find the distribution of a function of a random variable(s).

Functions of Discrete Random Variables

In this case to find the pdf of Y = g(X) we find the probability density function directly using the formula f(y) = P (Y = y) = P ({x : g(x) = y}) Thus if X has a binomial pdf with parameters n and p and represents the number of successes in n trials what is the pdf of Y = n − X, the number of failures? We find that à ! à ! n n P (Y = y) = P ({x : x = n − y}) = pn−y(1 − p)n−(n−y) = (1 − p)ypn−y n − y y i.e. binomial with parameters n and 1 − p. 118 CHAPTER 3. PROBABILITY DISTRIBUTIONS Functions of Continuous Random Variables

Here we find the distribution function of Y

P (Y ≤ y) = P ({x : g(x) ≤ y}) and then differentiate to find the density function of Y . example: Let Z be standard normal and let Y = Z2. The distribution function of Y is given by √ √ √ Z y F (y) = P (Y ≤ y) = P ({z : − y ≤ z ≤ y}) = √ φ(z)dz − y where φ(z) is the standard normal density i.e.

φ(z) = (2π)−1/2e−z2/2

It follows that the density function of Y is equal to

dF (y) 1 √ 1 √ = √ φ( y) + √ φ(− y) dy 2 y 2 y or 1 √ y1/2−1e−y/2 f(y) = √ ( 2π)−1/2e−y/2 = √ y 21/2 π which is called the chi-square distribution with one degree of freedom. That is, if Z is standard normal then Z2 is chi-square with one degree of freedom.

3.1.6 Other Distributions

A variety of other distributions arise in statistical problems. These include the log-normal, the chi-square, the Gamma, the Beta, the t, the F , and the negative binomial. We will discuss these as they arise. 3.2. PARAMETERS OF DISTRIBUTIONS 119 3.2 Parameters of Distributions

3.2.1 Expected Values

In exploratory data analysis we emphasized the importance of a measure of location (center) and spread (variability) for a batch of numbers. There are analagous measures for probability distributions. Definition: The , E(X), of a random variable is the weighted average of its values, the weights being the probability assigned to the values.

• For a discrete random variable we have X E(X) = xp(x) x where p(x) is the probability density function of X.

• For continuous random variables Z E(X) = xf(x)dx x

Some important expected values are:

(1) The expected value of the binomial distribution is np

(2) The expected value of the Poisson distribution is λ

(3) The expected value of the normal distribution is µ 120 CHAPTER 3. PROBABILITY DISTRIBUTIONS Using the properties of sums and integrals we have the following properties of expected values

• E(c) = where c is a constant. In words: The expected value of a constant is equal to the constant.

• E(cX) = cE(X) where c is a constant. In words: The expected value of a constant times a random variable is equal to the constant times the expected value of the random variable.

• E(X + Y ) = E(X) + E(Y ) In words: The expected value of the sum of two random variables is the sum of their expected values.

• If X ≥ 0 then E(X) ≥ 0 In words: The expected value of a non-negative random variable is non- negative.

Note: The result that the expected value of the sum of two random variables is the sum of their expected values is non trivial in the sense that one must show that the distribution of the sum has expected value equal to the sum of the individual expected values. 3.2. PARAMETERS OF DISTRIBUTIONS 121 3.2.2 Variances

Definition: The variance of a random variable is

var (X) = E(X − µ)2 where µ = E(X)

• If we write X = µ + (X − µ) or X = µ + error we see that the variance of a random variable is a measure of the average size of the squared error made when using µ to predict the value of X.

• The square root of var (X) is called the standard deviation of X and is used as a basic measure of variability for X.

(1) For the binomial var (X) = npq where q = 1 − p

(2) For the Poisson var (X) = λ

(3) For the normal var (X) = σ2

Using the properties of sums and integrals we have the following properties of variances:

• var (c) = 0 where c is a constant. In words: The variance (variability) of a constant is 0.

• var (c + X) = var (X) where c is a constant. In words: The variance of a random variable is unchanged by the addition of a constant.

• var (cX) = c2var (X) where c is a constant. In words: The variance of a constant times a random variable equals the constant squared times the variance of the random variable.

• var (X) ≥ 0 In words: The variance of a random variable cannot be negative. 122 CHAPTER 3. PROBABILITY DISTRIBUTIONS 3.2.3 Quantiles

Recall that

• The median of a batch of numbers is the value which divides the batch in half.

• Similarly the upper quartile has one fourth of the numbers above it while the lower quartile has one fourth of the numbers below it.

• There are analogs for probability distributions of random variables.

Definition: The pth quantile, Qp of X is defined by

P (X ≤ Qp) = p where 0 < p < 1.

• Q.5 is called the median of X

• Q.25 is called the lower quartile of X

• Q.75 is called the upper quartile of X

• Q.75 − Q.25 is called the interquartile range of X 3.2. PARAMETERS OF DISTRIBUTIONS 123 3.2.4 Other Expected Values

If Y = g(X) is a function of X then Y is also a random variable and has expected value given by ( P g(x)f(x) if X is discrete E[Y ] = E[g(X)] = R x x g(x)f(x)dx if X is continuous

Definition: The moment generating function of X, M(t), is defined as the expected value of Y = etX where t is a real number. The moment generating function has two important theoretical properties:

(1) The rth derivative of M(t) with respect to t, evaluated at t = 0 gives the rth moment of X, E(Xr) for any integer r. This often provides an easy method to find the mean, variance, etc. of a random variable.

(2) The moment generating function is unique: that is, if two distributions have the same moment generating function then they have the same distribution. example: For the binomial distribution we have that à ! à ! Xn n Xn n M(t) = E[etX ] = etx px(1 − p)n−x = (pet)x(1 − p)n−x = (pet + q)n x=0 x x=0 x where q = 1 − p. The first and second derivatives are

dM(t) t t n−1 dt = npe (pe + q) d2M(t) 2 2t t n−2 t t n−1 dt2 = n(n − 1)p e (pe + q) + npe (pe + q) Thus we have E(X) = np ; E(X2) = n(n − 1)p2 + np and hence var (X) = n(n − 1)p2 + np − (np)2 = np(1 − p) example: For the Poisson distribution we have that

∞ x ∞ t x X λ X (λe ) t M(t) = E(etX ) = etxe−λ == e−λ = eλ(e −1) x=0 x! x=0 x! 124 CHAPTER 3. PROBABILITY DISTRIBUTIONS The first and second derivatives are

dM(t) t d2M(t) 2 t t dt = λe M(t) dt2 = λ e M(t) + λe M(t) Thus we have E(X) = λ ; E(X2) = λ2 + λ and hence var (X) = (λ2 + λ) − λ2 = λ example: For the normal distribution we have that

M(t) = exp{tµ + t2σ2/2}

The first two derivatives are

dM(t) 2 d2M(t) 2 2 2 dt = (µ + tσ )M(t) dt2 = (µ + tσ ) M(t) + (σ )M(t) Thus we have E(X) = µ ; E(X2) = µ2 + σ2 and hence var (X) = (µ2 + σ2) − µ2 = σ2 3.2. PARAMETERS OF DISTRIBUTIONS 125 3.2.5 Inequalities involving Expectations

Markov’s Inequality: If Y is any non-negative random variable then

E(Y ) P (Y ≥ c) ≤ c where c is any positive constant. To see this define a discrete random variable by the equation ( c if Y ≥ c Z = 0 if Y < c

Note that Z ≤ Y so that

E(Y ) ≥ E(Z) = 0P (Z = 0) + cP (Z = c) = cP (Y ≥ c)

Tchebychev’s Inequality: If X is any random variable then

σ2 P (−δ < X − µ < δ) ≥ 1 − δ2 where σ2 is the variance of X and δ is any positive number. To see this define

Y = (|X − µ|)2

Then Y is non-negative with expected value equal to σ2 and by Markov’s Inequality we have that σ2 P (Y ≥ δ2) ≤ δ2 and hence σ2 σ2 1 − P (Y < δ2) ≤ or P (Y < δ2) ≥ 1 − δ2 δ2 But P (Y < δ2) = P (|X − µ| < δ) = P (−δ < |X − µ| < δ) so that σ2 P (−δ < X − µ < δ) ≥ 1 − δ2 126 CHAPTER 3. PROBABILITY DISTRIBUTIONS example: Consider n Bernoulli trials and let Sn be the number of successes. Then X = Sn/n has µS ¶ np µS ¶ npq pq E n = = p and var n = = n n n n2 n Thus Tchebychev’s Inequality says that µ S ¶ pq 1 ≥ P −δ < n − p < δ ≥ 1 − n nδ2

In other words, if the number of trials is large, the probability that the observed frequency of successes will be close to the true probability of success is close to 1. This is used as the justification for the relative frequency interpretation of probability. It is also a special case of the Weak Law of large Numbers. Chapter 4

Joint Probability Distributions

4.1 General Case

Often we want to consider several responses simultaneously. We model these using random variables X1,X2,... and we have joint probability distributions. There are again two major types.

(i) Joint discrete distributions have the property that the sample space for each random variable is discrete and probabilities are assigned using the joint probability density function defined by X X X 0 ≤ f(x1, x2, . . . , xk) ≤ 1 ; ··· f(x1, x2, . . . , xk) = 1 x1 x2 xk

(ii) Joint continuous distributions have the property that the sample space for each ran- dom variable is continuous and probabilities are assigned using the probability density function which has the properties that Z Z Z f(x1, x2, . . . , xk) ≥ 0 ··· f(x1, x2, . . . , xk)dx1dx2 ··· dxk = 1 x1 x2 xk

127 128 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.1.1 Marginal Distributions

Marginal distributions are distributions of subsets of random variables which have a joint distribution. In particular the of one of the components, say Xi, is said to be the marginal distribution of Xi. Marginal distributions are obtained by “summing” or “integrating” out the other variables in the joint density. Thus if X and Y have a joint distribution which is discrete the marginal distribution of X is given by X fX (x) = f(x, y) y

If X and Y have a joint distribution which is continuous the marginal distribution of X is given by Z fX (x) = f(x, y)dy y

4.1.2 Conditional Distributions

Conditional distributions are distributions of subsets of random variables which have a joint distribution given that other components of the random variables are fixed. The conditional distribution of Y given X = x is obtained by

f(y, x) fY |X (y|x) = fX (x)

where f(y, x) is the joint distribution of Y and X and fX (x) is the marginal distribution of X. Conditional distributions are of fundamental importance in regression and prediction problems. 4.1. GENERAL CASE 129 4.1.3 Properties of Marginal and Conditional Distributions

• The joint distribution of X1,X2,...,Xk can be obtained as

f(x1, x2, . . . , xk) = f1(x1)f2(xx|x1)f3(x3|x1, x2) ··· fk(xk|x1, x2, . . . , xk−1)

which is a generalization of the multiplication rule for probabilities.

• The marginal distribution of Y can be obtained via the formula ( P x f(y|x)fX (x) if X, Y are discrete fY (y) = R x f(y|x)fX (x)dx if X, Y are continuous which is is a generalization of the law of total probability.

• The conditional density of y given X = x can be obtained as

f(y, x) fY (y)fX|Y (x|y) fY |X (y|x) = = fX (x) fX (x) which is a version of Bayes Theorem.

4.1.4 Independence and Random Sampling

If X and Y have a joint distribution they are independent if

f(x, y) = fX (x)fY (y) or if fY |X (y|x) = fY (y)

In general X1,X2,...,Xn are independent if

f(x1, x2, . . . , xn) = fX1 (x1)fX2 (x2) ··· fXn (xn) i.e. the joint distribution is the product of the marginal distributions.

Definition: We say that x1, x2, . . . , xn constitute a random sample from f if they are realized values of independent random variables X1,X2,...,Xn, each of which has the same probability distribution f. Random sampling from a distribution is fundamental to many applications of modern statistics. 130 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.2 The Multinomial Distribution

The most important joint discrete distribution is the multinomial defined as

Yk xi pi f(x1, x2, ··· , xk) = n! i=1 xi! where Pk xi = 0, 1, 2, . . . , n , i = 1, 2, . . . , k , i=1 xi = n Pk 0 ≤ pi ≤ 1 , i = 1, 2, . . . , k , i=1 pi = 1 The multinomial is the basis for the analysis trials where the outcomes are not binary but of k distinct types and in the analysis of tables of data which consist of counts of the number of times certain response patterns occur. Note that if k = 2 the multinomial reduces to the binomial. example: Suppose we are interested in the daily pattern of “accidents” in a manufactur- ing firm. Assuming individuals in the firm have accidents independent of others then the probability of accidents by day has the multinomal distribution n! x1 x2 x3 x4 x5 P (x1, x2, x3, x4, x5) = p1 p2 p3 p4 p5 x1!x2!x3!x4!x5!

where pi is the probability of an accident on day i and i indexes working days. Of interest is whether or not the pi are equal. If they are not we might be interested in which seem too large. 4.2. THE MULTINOMIAL DISTRIBUTION 131 example: This data set consists of the cross classification of 12,763 applications for admis- sion to graduate programs at the University of California at Berkeley in 1973. The data were classified by gender and admission . Of interest is the possibility of gender bias in the admissions policy of the university.

Admissions Outcome Gender Admitted Not Admitted Male 3738 4704 Female 1494 2827

In general we have that n individuals are investigated and their gender and admission outcome is recorded. The data are thus of the form:

Gender Admitted Not Admitted Male n00 n01 Female n10 n11

To model this data we assume that individuals are independent and that the possible response patters for an individual are given by one of the following:

(male, admitted) = (0, 0) (female, admitted) = (1, 0) (male, not admitted) = (0, 1) ( female, not admitted) = (1, 1)

Denoting the corresponding probabilities by p11, p01, p10 and p00 the multinomial model applies and we have the probabilities of the observed responses given by n! n00 n01 n10 n11 p00 p01 p10 p11 n00!n01!n10!n11!

The random variables are thus N00,N01,N10 and N11. 132 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS In the model above the probabilities are thus given by

Gender Admitted Not Admitted Marginal of Gender Male p00 p01 p0+ Female p10 p11 p1+

Marginal of Admission Status p+0 p+1 1

Note that p+0 gives the probability of admission and that p0+ gives the probability of be- ing male. It is clear (why?) that the marginal distribution of admission is binomial with parameters n and p = p+0.

The probability that N00 = n00 and N01 = n01 given that N00 + N01 = n0+ gives the probability of admission given male and is

P (N00 = n00,N01 = n01|N00 + N01 = n0+)

This conditional probability is given by: n! n −n P (N = n ,N = n − n ) pn00 p 0+ 00 (1 − p )n−n0+ 00 00 01 0+ 00 = (n0+−n00)!n00!(n−n0+)! 00 01 0+ n! n0+ n−n P (N00 + N01 = n0+) p (1 − p0+) 0+ n0+!(n−n0+)! 0+ Ã ! Ã ! n ! p n00 p n0+−n00 = 0+ 00 01 n00!(n0+ − n00)! p0+ p0+ Ã ! n 0+ n00 n0+−n00 = p∗ (1 − p∗) n00

which is a binomial distribution with parameters n0+, the number male and p00 p∗ = p0+ Note that the odds of admission given male are p p ∗ = 00 1 − p∗ p01

Similarly the probability of admission given female is binomial with parameters n1+, the number of females and P∗ where p10 P∗ = p1+ Note that the odds in this case are given by P p ∗ = 10 1 − P∗ p11 4.2. THE MULTINOMIAL DISTRIBUTION 133 Thus the odds ratio of admission (female to male) is given by

p /p p p 10 11 = 01 10 p00/p01 p00p11

If the odds ratio is one gender and admission are independent. (Why?) It follows that the odds ratio is a natural measure of association for categorical data. In the example the odds of admission for males is estimated by

3738/8442 3738 odds of admission for males = = = 0.79 4704/8442 4704 while the odds for admission given female is

1494/4321 1494 odds of admission given female = = = 0.53 2827/4321 2827

Thus the odds of admission are lower for females. The odds ratio is estimated by

1494/2827 1494 × 4704 odds ratio of admission (females to males) = = = 0.67 3738/4704 2827 × 3738

Is this odds ratio different enough from 1 to claim that females are discriminated against in the admissions policy? More later!!! 134 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.3 The Multivariate Normal Distribution

The most important joint continuous distribution is the multivariate normal distribution. The density function of Y is given by ½ ¾ − k − 1 1 T −1 f(x) = (2π) 2 [det(V)] 2 exp − (x − µ) V (x − µ) 2 where we assume that V is a non-singular, symmetric, positive definite matrix of rank k. The two parameters of this distribution are µ and V.

• It can be shown that the marginal distribution of any Xi is normal with parameters µi and vii where     µ1 v11 v12 ··· v1k      µ   v v ··· v   2   12 22 2k  µ =  .  , V =  . . . .   .   . . . .  µk v1k v2k ··· vkk

• It can also be shown that the distribution of linear combinations of multivariate normal random variables are also multivariate normal. More precisely let W = a + BY where a is p × 1 and B is a p × k matrix with p ≤ k. Then the joint distribution of W is multivariate normal with parameters

T µW = a + BµY and VW = BVY B

where BT is the transpose of B. 4.3. THE MULTIVARIATE NORMAL DISTRIBUTION 135 • It can also be shown that the conditional distribution of any subset of X given any other subset is multivariate normal more precisely: let " # " # " # X1 µ1 V11 V12 X = ; µ = , V = T X2 µ2 V12 V22

T where A denotes the transpose of A. Then the conditional distribution of X2 given X1 = x1 is also multivariate normal with

T −1 T −1 µ∗ = µ2 + V12V11 (x1 − µ1); V∗ = V22 − V12V11 V12

• It follows that if X1 and X2 have a multivariate normal distribution then they are independent if and only if V12 = 0

The multivariate normal distribution forms the basis for regression analysis, analysis of variance and a variety of other statistical methods including factor analysis and latent variable analysis. 136 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.4 Parameters of Joint Distributions

4.4.1 Means, Variances, Covariances and Correlation

The collection of expected values of the marginal distributions of Y is called the expected value of Y and is written as     E(Y1) µ1      E(Y )   µ   2   2  E(Y) = µ =  .  =  .   .   .  E(Yk) µk

The covariance between X and Y , where X and Y have a joint distribution is defined by

cov (X,Y ) = E(X − µX )(Y − µY )

The correlation between X and Y is defined as cov (X,Y ) ρ(X,Y ) = q var (X)var (Y ) and is simply a standardized covariance. Correlations have the property that

−1 ≤ ρ(X,Y ) ≤ 1 4.4. PARAMETERS OF JOINT DISTRIBUTIONS 137 Using the properties of expected values we see that covariances have the following prop- erties

• cov (X,Y ) = cov (Y,X)

• cov (X,X) = var (X)

• cov (X + a, Y + b) = cov (X,Y )

• cov (aX, bY ) = abcov (X,Y )

• cov (aX + bY, cW + dZ) = ac cov (X,W ) + ad cov (X,Z) + bc cov (Y,W ) + bd cov (Y,Z)

We define the variance covariance matrix of Y as   var (Y1) cov (Y1,Y2) ··· cov (Y1,Yk)    cov (Y ,Y ) var (Y ) ··· cov (Y ,Y )   1 1 2 2 k  VY =  . . . .   . . .. .  cov (Yk,Y1) cov (Yk,Y2) ··· var (Yk,Yk)

Note that for the multivariate normal distribution with parameters µ and V we have that E(Y) = µ and VY = V Thus the two parameters in the multivariate normal are respectively the mean vector and the variance covariance matrix. 138 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.4.2 Joint Moment Generating Functions

The joint moment generating function of X1,X2,...,Xk is defined as

Pk tiXi MX (t) = E(e i=1 )

• Partial derivatives with respect to ti evaluated at t1 = t2 = ··· = tk = 0 give the moments of Xi and mixed partial derivatives (e.g. with respect to ti and tj give the covariances, etc.)

• Joint moment generating functions are unique (if two distributions have the same moment generating function then the two distributions are the same).

• The joint moment generating function for the multivariate normal distribution is given by   ½ 1 ¾ Xk 1 Xk Xk  M (t) = exp µT t + tT Vt = exp t µ + t t v X  i i i j ij 2 i=1 2 i=1 j=1

• If random variables are independent then their joint moment generating function is equal to the product of the individual moment generating functions. 4.5. FUNCTIONS OF JOINTLY DISTRIBUTED RANDOM VARIABLES 139 4.5 Functions of Jointly Distributed Random Variables

If Y = g(X) is any function of random variables X we can find its distribution exacly as in the one variable case i.e. P fY (y) = x:g(x)=y f(x1, x2, . . . , xk) if X is discrete

dFY (y) fY (y) = dy if X is continuous where Z FY (y) = f(x1, x2, . . . , xk)dx1dx2 ··· dxk {x:g(x)≤y}

Thus we can find the distribution of the sum, the difference, a linear combination, a ratio, a product, etc. We shall not derive all of the results we use in later sections but we shall record a few of the most important results here

• If X has a multivariate normal distribution with mean µ and variance covariance matrix V then the distribution of

Xk T Y = a + b X = a + biXi i=1 is normal with

Xk Xk Xk T T E(Y ) = a + b µ = a + biE(Xi) and var (Y ) = b Vb = bibjcov (Xi,Xj) i=1 i=1 j=1 140 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS

• If Z1,Z2,...,Zr are independent each N(0, 1) then the distribution of

2 2 2 Z1 + Z2 + ··· + Zr

is chi-square with r degrees of freedom.

• If Z is N(0, 1) and W is chi-square with r degrees of freedom and Z and W are independent then Z T = q W/r has a Student’s t distribution with r degrees of freedom.

• If Z1 and Z2 are each N(0, 1) and independent then the distribution of the ratio Z C = 1 Z2 is Cauchy with parameters 0 and 1 4.5. FUNCTIONS OF JOINTLY DISTRIBUTED RANDOM VARIABLES 141 4.5.1 Linear Combinations of Random Variables

If X1,X2,...,Xn have a joint distribution with parameters µ1, µ2, . . . , µn and variances and covariances given by cov (Xi,Xj) = vij Pn then the expected value of i=1 aiXi is given by

Xn Xn Xn E( aiXi) = aiE(µi) = aiµi i=1 i=1 i=1

Pn and the variance of i=1 aiXi is given by à ! Xn Xn Xn Xn Xn var aiXi = aiajcov (Xi,Xj) = aiajvij i=1 i=1 j=1 i=1 j=1

If we write     µ1 v11 v12 ··· v1n      µ   v v ··· v   2   21 22 2n  µ =  .  ; V =  . . . .   .   . . .. .  µn vn1 vn2 ··· vnn we see that the above results may be written as

E(aT X) = aT µ ; var (aT X) = aT Va 142 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS As special cases we have

• var (X + Y ) = var (X) + var (Y ) + 2 cov (X,Y )

• var (X − Y ) = var (X) + var (Y ) − 2 cov (X,Y )

• Thus if X and Y are uncorrelated with the same variance σ2 we hav

– var (X + Y ) = 2σ2 – var (X − Y ) = 2σ2

• More generally if X1,X2,...,Xn are uncorrelated then à ! Xn Xn 2 var aiXi = ai var (Xi) i=1 i=1

1 – In particular if we take each ai = n we have σ2 var (X¯) = n 4.6. APPROXIMATE MEANS AND VARIANCES 143 4.6 Approximate Means and Variances

In some problems we cannot find the expected value or variance or distribution of Y = g(X) exactly. It is useful to have approximations for the means and variances in such cases. If the function g is reasonably linear in a neignorhood of µX , the expected value of X then we can write (1) Y = g(X) ≈ g(µ) + g (µX )(X − µX ) by Taylor’s Theorem. Hence we have

E(Y ) ≈ g(µX ) (1) 2 2 var (Y ) ≈ [g (µX )] σX We can get an improved approximation to the expected value of Y by writing 1 Y = g(X) ≈ g(µ) + g(1)(µ )(X − µ ) + g(2)(µ )(X − µ )2 X X 2 X X Thus 1 E(Y ) ≈ g(µ ) + g(2)(µ )σ2 X 2 X X 144 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS If Z = g(X,Y ) is a function of two random variables then we can write

∂g(µ) ∂g(µ) Z = g(X,Y ) ≈ g(µ) + (X − µ ) + (Y − µ ) ∂x X ∂y Y where µ denotes the point (µX , µY ) and ¯ ¯ ∂g(µ) ∂g(x, y)¯ = ¯ ∂x ∂x ¯ x=µX ,y=µY Thus we have that E(Z) ≈ g(µ) h i2 h i2 h i h i ∂g(µ) 2 ∂g(µ) 2 ∂g(µ) ∂g(µ) var (Z) ≈ ∂x σX + ∂y σy + 2 ∂x ∂y cov (X,Y ) As in the single variable case we can obtain an improved approximation for the expected value by using Taylor’s Theorem with second order terms e.g. " # " # " # 1 ∂2g(µ) 1 ∂2g(µ) ∂2g(µ) E(Z) ≈ g(µ) + σ2 + σ2 + cov (X,Y ) 2 ∂x2 X 2 ∂y2 Y ∂x∂y

• Note 1: The improved approximation is needed for the expected value because in general E[g(X)] 6= g(µ) i.e. E(X2) 6= µ2

• Note 2: Some care is needed when working with discrete variables and certain functions. Thus if X is binomial with parameters n and p the expected value of log(X) is not defined so that no approximation can be correct. 4.7. SAMPLING DISTRIBUTIONS OF STATISTICS 145 4.7 Sampling Distributions of Statistics

Definition: A statistic is a numerical quantity calculated from a set of data. Typically a statistic is designed to provide information about some parameter of the population.

• If x1, x2, . . . , xn is the data some statistics are – x¯, the sample mean – the median – the upper quartile – s2, the sample variance – the range

• Since the data are realized values of random variables a statistic is also realized value of a random variable.

• The probability distribution of this random variable is called the sampling distribu- tion of the statistic. 146 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS In most contemporary applications of statistics the sampling distribution of the statistic is used to assess the performance of a statistic used for inference about population parameters. The following is a schematic diagram of the concept of a sampling distribution of a statistics. experiment.

Figure 4.1: 4.7. SAMPLING DISTRIBUTIONS OF STATISTICS 147 Illustration of Sampling Distributions Sampling Distribution of Sample Mean, Sample Size 25

Figure 4.2: 148 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Illustration of Sampling Distributions Sampling Distribution of (n − 1)s2/σ2, Sample Size n = 10

Figure 4.3: 4.7. SAMPLING DISTRIBUTIONS OF STATISTICS 149 Illustration of Sampling√ Distributions Sampling Distribution of t = n(¯x − µ)/s, Sample Size n = 10

Figure 4.4: 150 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS example: Given a sample of data suppose we calculate the sample meanx ¯ and the sample median q.5. Which of these is a better measure of the center of the population?

• If we assume that the data represent a random sample from a probability distribution which is N(µ, σ2) then it is known that:

¯ σ2 ◦ the sampling distribution of X is N(µ, n ) π σ2 ◦ the sampling distribution of the sample median is approximately N(µ, ( 2 ) n ). • Thus the sample mean will, on average, be closer to the population mean than will the sample median. Thus the sample mean is preferred as as estimate of the population mean.

• If the underlying population is not N(µ, σ2) then the above result does not hold and the sample median may be the preferred estimate.

• It follows that the role of assumptions about the underlying probability model is crucial in the development and assesment of statistical procedures. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS151 4.8 Methods of Obtaining Sampling Distibutions or Approximations

There are three methods used to obtain information on sampling distributions:

• Exact sampling distributions. Statisticians have, over the last 100 years, developed the sampling distributions for a variety of useful statistics for specific parametric models. For the most part these statistics are simple functions of the sample data such as the sample mean, the sample variance, etc.

• Asymptotic (approximate) distributions. When exact sampling distributions are not tractable we may find the distribution of the statistic for large sample sizes. These are called asymptotic methods and are suprisingly useful.

• Computer intensive methods. These are based on resampling from the empirical distri- bution of the data and have been shown to have useful properties. The most important of these methods is called the bootstrap.

4.8.1 Exact Sampling Distributions

Here we find the exact sampling distribution of the statistic using the methods previously discussed. The most famous example of this method is the result that if we have a random sample for a normal distribution then the distribution of the sample mean is also normal. Other examples include the distribution of the sample variance from a normal sample, the t distribution and the F distribution. 152 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.2 Asymptotic Distributions

4.8.3 Central Limit Theorem

If we cannot find the exact sampling distribution of a statistic we may be able to find its mean and variance. If the sampling distribution were approximately normal then we would be able to make approximate statements using just the mean and variance. In the discussion of the Binomial and Poisson distributions we noted that for large n the distributions could be approximated by the normal distribution.

• In fact, the sampling distribution of X¯ for almost any population distribution becomes more and more similar to the normal distribution regardless of the shape of the original distribution as n increases.

More precisely:

Central Limit Theorem If X1,X2,...,Xn are independent each with the same distribution having expected value µ and variance σ2 then the sampling disribution of X¯ is approximately N(µ, σ2 ) i.e. n   X¯ − µ P  ≤ z ∼ P (Z ≤ z) √σ n where P (Z ≤ z) is the area under the normal curve up to z. The Central Limit Theorem has been extended and refined over the last 75 years.

• Many statistics have distributions whose sampling distributions are approximately nor- mal.

• This explains the great use of the normal distribution in statistics.

• In particular, whenever a measurement can be thought of as a sum of individual com- ponents we may expect it to be approximately normal. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS153 4.8.4 Central Limit Theorem Example

We now illustrate the Central Limit Theorem and some other results on sampling distri- butions. The data set consists of a population of 1826 children whose blood lead values (milligarms per deciliter) were recorded at the Johns Hopkins Hospital. The data are cour- tesy of Dr. Janet Serwint. Lead in children is a serious public health problem, lead levels exceeding 15 milligrams per deciliter are considered to have implications for learning disabil- ities, are implicated in violent behavior and are the concern of major governmental efforts aimed at reducing exposure. The distribution in real populations is often assumed to follow a log-normal distribution i.e. the natural logarithm of blood lead values is normally distributed. Note the asymmetry of the distribution of blood lead values. Note that the log trans- formation results in a decided improvement in symmetry, indicating that the log-normal assumption is probably appropriate. We select random samples from the population of blood lead readings and log blood lead readings. We select 100 random samples of size 10, 25 and 100 respectively. As the histograms indicate the distribution of the sample means of the blood lead values do indeed appear normal even though the distribution of blood lead values is highly skewed. 154 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Histograms of Blood Lead and Log Blood Lead Values

Figure 4.5: 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS155 Histograms of Sample Means of Blood Lead Values

Figure 4.6: 156 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS Histograms of Sample Means of Log Blood Lead Values

Figure 4.7: 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS157 The summary statistics for blood lead values and the samples are as follows

> summary(blpb) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 5 8 9.773 12 128 > var(blpb) 71.79325

Sample Size Mean Variance 10 9.93 9.53 25 9.75 2.91 100 9.87 .72

The summary statistics for log blood lead values and the samples are as follows

summary(logblpb) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.386 1.658 2.11 2.084 2.506 4.854 > var(logblpb) [1] 0.4268104

Sample Size Mean Variance 10 2.07 .037 25 2.08 .017 100 2.08 .004 158 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.5 Law of Large Numbers

Under most weak conditions, the average of a sample is “close” to the population average if the sample size is large. More precisely:

Law of Large Numbers: If we have a random sample X1,X2,...,Xn from a distribution with expected value µ and variance σ2 then

P (X¯ ≈ µ) ≈ 1

for n sufficiently large. The approximation becomes closer the larger the value of n. We write X¯ −→p µ and say that X¯ converges in probability to µ. If g is a continuous function then if X converges in probability to µ then g(X) converges in probability to g(µ). Some idea of the value of n needed can be obtained from Chebychev’s inequality which states that σ2 P (−k ≤ X¯ − µ ≤ k) ≥ 1 − n k2 where k is any constant. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS159 Law of Large Numbers Examples

Figure 4.8: 160 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.6 The Delta Method - Univariate

For statistics Sn which are normal or approximately normal the delta method can be used to find the approximate distribution of g(Sn), a function of Sn. The technique is based on approximating g by a linear function as in obtaining approxi- mations to expected values and variances of functions i.e.

(1) g(Sn) ≈ g(µ) + g (µ)(Sn − µ)

(1) where Sn converges in probability to µ and g (µ) is the derivative of g evaluated at µ. Thus we have that (1) g(Sn) − g(µ) ≈ g (µ)(Sn − µ) √ If n(Sn −µ) has an exact or approximate normal distribution with mean 0 and variance 2 σ then √ n[g(Sn) − g(µ)] has an approximate normal distribution with mean 0 and variance

[g(1)(µ)]2σ2

It follows that we may make approximate calculations by treating g(Sn) as if were normal with mean g(µ) and variance [g(1)(µ)]2σ2/n i.e.     g(Sn) − g(µ) x − g(µ) x − g(µ) P (g(Sn) ≤ s) = P q ≤ q  = P Z ≤ q  [g(1)(µ)]2 σ2/n [g(1)(µ)]2 σ2/n [g(1)(µ)]2 σ2/n

(1) where Z is N (0, 1). in addition if g (µ) is continous then we can replace µ by Sn in the formula for the variance. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS161 example: Let X be binomial with parameters n and p and let Sn = X/n. Then we know by the Central Limit Theorem that the approximate distribution of √ n(Sn − p) is N (0, pq). If we define µ x ¶ g(x) = ln = ln(x) − ln(1 − x) 1 − x then 1 1 (1 − x) + x 1 g(1)(x) = + = = x (1 − x) x(1 − x) x(1 − x) Thus 1 g(1)(µ) = pq and hence " Ã !# √ µ S ¶ p n ln n − ln 1 − Sn 1 − p is approximately normal with mean 0 and variance pg 1 1 = + (pq)2 p q

(1) Since g (µ) is continuous we may treat ln(Sn) as if it where normal with à ! p 1 · 1 1 ¸ 1 1 mean ln and variance + = + 1 − p n Sn 1 − Sn X n − X

Thus the distribution of the sample log odds in a binomial may be approximated by a normal distribution with mean equal to the population log odds and variance equal to the sum of the reciprocals of the number of successes and the number of failures. 162 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.7 The Delta Method - Multivariate

More generally, if we have a collection of statistics S1,S2,...,Sk then we say that they are approximately multivariate normally distributed with mean µ and variance covariance matrix V if √ T n a (Sn − µ) has an approximate normal distribution with mean 0 and variance aT Va for any a.

In this case the distribution of g(Sn) is also approximately normal i.e. √ n [g(Sn) − g(µ)]

2 T is approximately normal with mean 0 and variance σg = ∇(µ) V∇(µ) where   ∂g(µ)  ∂µ1   ∂g(µ)     ∂µ2  ∇(µ) =  .   .   .  ∂g(µ) ∂µk2

Thus we may make approximate calculations by treating g(Sn) as if were normal with with 2 mean g(µ) and variance σg i.e.     g(S ) − g(µ) s − g(µ) x − g(µ)  n    P (g(Sn) ≤ s) = P q ≤ q = P Z ≤ q 2 2 2 σg /n σg /n σg /n where N (0, 1). In addition if each partial derivative is continuous we may replace µ by Sn in the formula for the variance. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS163 example: Let X1 be binomial n and p1 and let X2 be binomial n and p2 and be independent. Then then the joint distribution of " # " # S1n X1/n Sn = = S2n X2/n is such that √ n(Sn − p) is approximately multivariate normal with mean 0 and variance covariance matrix V where " # p q 0 V = 1 1 0 p2q2

Thus if à ! à ! p2 p1 g(p) = ln − ln = ln(p2) − ln(1 − p2) − ln(p1) + ln(1 − p1) 1 − p2 1 − p1 we have that ∂g(p) = − 1 − 1 = − 1 ∂p1 p1 1−p1 p1q1 ∂g(p) = 1 + 1 = 1 ∂p2 p2 1−p2 p2q2 It follows that " #" # h i − 1 1 1 2 1 1 p1q1 0 p1q1 σg = − p q p q 1 = + 1 1 2 2 0 p2q2 p2q2 p1q1 p2q2 Since the partial derivatives are continuous we may treat the sample log odds ratio as if it where normal with mean equal to the population log odds ratio à ! p /(1 − p ) ln 2 2 p1/(1 − p1) and variance 1 1 1 1 + + + X1 n − X1 X2 n − X2 164 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS If we write the sample data as

sample 1 X1 = a n − X1 = b sample 2 X2 = c n − X2 = d then the above formula reads as 1 1 1 1 + + + a b c d a very widely used formula in epidemiology. Technical Notes

(1) Only a minor modification is needed to show that the result is true when the sample size in the two binomials is different provided that the ratio of the sample sizes does not tend to 0.

(2) The log odds ratio is much nearly normally distributed than the odds ratio.

We generate 1000 samples of size 20 from each of two binomial populations one with param- eter .3 and the other with parameter .5. It follows that the population odds ratio and the population log odds ratio are given by

.5/.5 7 odds ratio = = = 2.333 ; log odds ratio = .8473 .3/.7 3

The asymptotic variance for the log odds ratio is given by the formula

(1/6) + (1/14) + (1/10) + (1/10) = .4381 which leads to an asymptotic standard deviation of .6618. The mean of the 1000 random samples is .9127 with variance .5244 and standard deviation .7241. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS165 Graphs of the Simulated Distributions

Figure 4.9: 166 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.8 Computer Intensive Methods

• The determination of the sampling distribution of statistics which are complicated functions of the observations can be approximated using the Delta Method.

• With the advent of fast modern computing techniques other methods of obtaining sampling distributions have been developed. One of these, called the bootstrap is of great importance in estimation and in interval estimation.

The Bootstrap Method

ˆ Given data x1, x2, . . . , xn, a random sample from p(x; θ) we estimate θ by the statistic θ. Of interest is the standard error of θˆ. We may not be able to obtain the standard error if θˆ is a complicated function of the data, nor do we want an asymptotic result which may be suspect if used for small samples. The bootstrap method, introduced in 1979 by Bradley Efron, is a computer intensive method for obtaining the standard error of θˆ which has been shown to valid in most situations. The bootstrap method for estimating the standard error of θˆ is as follows:

(1) Draw a random sample of size n with replacement from the observed data x1, x2, . . . , xn and compute θˆ.

(2) Repeat step 1 a large number, B, of times obtaining B separate estimates of θ denoted by ˆ ˆ ˆ θ1, θ2,..., θB

(3) Calculate the mean of the estimates in step 2 i.e. P B θˆ θ¯ = i=1 i B

(4) The bootstrap estimate of the standard error of θˆ is given by v uP u B (θˆ − θ¯)2 σˆ (θˆ) = t i=1 i BS B − 1 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS167 The bootstrap is computationally intensive but is easy to use except in very complex problems. Efron suggests about 250 samples be drawn (i.e. B=250) in order to obtain reliable estimates of the standard error. To obtain percentiles of the bootstrap distribution it is suggested that 500 to 1000 bootstrap samples be taken. The following is a schematic of the bootstrap procedure.

Figure 4.10: 168 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS It is interesting to note that the current citation index for statistics lists about 600 papers involving use of the bootstrap!

References:

1. A Leisurely Look at the Bootstrap, the Jackknife and Cross-Validation (1983) B. Efron and G. Gong; The American Statistician, February 1983, Vol. 37, No. 1

2. Bootstrapping (1993) C. Mooney and R. Duval; Sage Publications. This is a very readable introduction designed for applications in the Social Sciences.

3. The STATA Manual has an excellent section on the bootstrap and a bootstrap com- mand is available.

The Jackknife Method

The jackknife is another procedure for obtaining estimates and standard errors in situations where

• The exact sampling distribution of the estimate is not known.

• We want an estimate of the standard error of the estimate which is robust against model failure and the assumption of large sample sizes.

The jackknife is computer intensive but relatively easy to implement. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS169

Assume that we have n observations x1, x2, . . . , xn which are assumed to be a random sample from a distribution p. Assume the parameter of interest is θ and that the estimate is θˆ The jackknife procedure is as follows:

ˆ 1. Let θ(i) denote the estimate of θ determined by eliminating the ith observation. 2. The jackknife estimate of θ is defined by

Xn ˆ 1 ˆ θ(JK) = θ(i) n i=1 ˆ i.e. the average of the θ(i).

3. The jackknife estimate of the standard error of θˆ is given by " # Xn 1/2 (n − 1) ˆ ˆ 2 σˆJK = (θ(i) − θ(JK)) n i=1 170 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS 4.8.9 Bootstrap Example

In ancient Greece a rectangle was called a “Golden Rectangle” if the length to width ratio was 2 √ = 0.618034 5 + 1 This ratio was a design feature of their architecture. The following data set gives the breadth to length ratio of beaded rectangles used by the Shoshani Indians in the decoration of leather goods. Were they also using the Golden Rectangle?

.693 .672 .668 .553 .748 .615 .611 .570 .654 .606 .606 .844 .670 .690 .609 .576 .662 .628 .601 .933

We now use the bootstrap method for the sample mean and the sample median. 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS171

. infile ratio using "c:\courses\b651201\datasets\shoshani.raw (20 observations read)

. stem ratio

Stem-and-leaf plot for ratio ratio rounded to nearest multiple of .001 plot in units of .001

5** | 53,70,76 6** | 01,06,06,09,11,15,28 6** | 54,62,68,70,72,90,93 7** | 48 7** | 8** | 44 8** | 9** | 33

. summarize ratio

Variable | Obs Mean Std. Dev. Min Max ------+------ratio | 20 .66045 .0924608 .553 .933 172 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS

. bs "summarize ratio" "r(mean)", reps(1000) saving(mean) command: summarize ratio statistic: r(mean) (obs=20)

Bootstrap statistics

Variable | Reps Observed Bias Std. Err. [95% Conf. Interval] ------+------bs1 | 1000 .66045 .0017173 .0197265 .6217399 .6991601 (N) | .626775 .70365 (P) | .6264 .7021 (BC) ------N = normal, P = percentile, BC = bias-corrected

. use mean, clear (bs: summarize ratio)

. kdensity bs1

. kdensity bs1,saving(g1,replace) 4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS173 . drop _all

. infile ratio using "c:\courses\b651201\datasets\shoshani.raw (20 observations read)

. bs "summarize ratio,detail" "r(p50)", reps(1000) saving(median) command: summarize ratio,detail statistic: r(p50) (obs=20)

Bootstrap statistics

Variable | Reps Observed Bias Std. Err. [95% Conf. Interval] ------+------bs1 | 1000 .641 -.001711 .0222731 .5972925 .6847075 (N) | .6075 .671 (P) | .609 .679 (BC) ------N = normal, P = percentile, BC = bias-corrected

. use median,clear (bs: summarize ratio,detail)

. kdensity bs1,saving(g2,replace)

. graph using g1 g2 174 CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS The bootstrap distributions of the sample mean and the sample median are given below:

Figure 4.11: