Statistical Regression Analysis
Larry Winner University of Florida Department of Statistics
August 15, 2017 2 Contents
1 Probability Distributions, Estimation, and Testing 7
1.1 Introduction...... 7
1.1.1 Discrete Random Variables/Probability Distributions ...... 7
1.1.2 Continuous Random Variables/Probability Distributions...... 10
1.2 Linear Functions of Multiple Random Variables ...... 15
1.3 FunctionsofNormalRandomVariables ...... 17
1.4 Likelihood Functions and Maximum Likelihood Estimation ...... 21
1.5 Likelihood Ratio, Wald, and Score (Lagrange Multiplier)Tests ...... 35
1.5.1 SingleParameterModels ...... 36
1.5.2 MultipleParameterModels ...... 39
1.6 Sampling Distributions and an Introduction to the Bootstrap ...... 42
2 Simple Linear Regression - Scalar Form 47
2.1 Introduction...... 47
2.2 Inference Regarding β1 ...... 55
2.3 Estimating a Mean and Predicting a New Observation @ X = X∗ ...... 57
2.4 AnalysisofVariance ...... 59
2.5 Correlation ...... 64
3 4 CONTENTS
2.6 RegressionThroughtheOrigin ...... 67
2.7 R Programs and Output Based on lm Function ...... 71
2.7.1 CarpetAgingAnalysis...... 71
2.7.2 Suntan Product SPF Assessments ...... 74
3 Simple Linear Regression - Matrix Form 77
4 Distributional Results 85
5 Model Diagnostics and Influence Measures 93
5.1 CheckingLinearity ...... 93
5.2 CheckingNormality ...... 99
5.3 CheckingEqualVariance ...... 105
5.4 CheckingIndependence ...... 108
5.5 Detecting Outliers and Influential Observations ...... 110
6 Multiple Linear Regression 119
6.1 Testing and Estimation for Partial Regression Coefficients...... 120
6.2 AnalysisofVariance ...... 120
6.3 Testing a Subset of βs =0...... 122
6.4 Tests Based on the Matrix Form of Multiple Regression Model ...... 122
6.4.1 Equivalence of Complete/Reduced Model and Matrix BasedTest ...... 124
6.4.2 R-NotationforSumsofSquares ...... 131
6.4.3 CoefficientsofPartialDetermination ...... 135
6.5 Models With Categorical (Qualitative) Predictors ...... 136
6.6 ModelsWithInteractionTerms ...... 140 CONTENTS 5
6.7 ModelsWithCurvature ...... 144
6.8 ResponseSurfaces ...... 148
6.9 TrigonometricModels ...... 153
6.10ModelBuilding...... 156
6.10.1 BackwardElimination ...... 158
6.10.2 ForwardSelection ...... 159
6.10.3 StepwiseRegression ...... 161
6.10.4 AllPossibleRegressions ...... 162
6.11 IssuesofCollinearity ...... 163
6.11.1 PrincipalComponentsRegression...... 164
6.11.2 RidgeRegression ...... 169
6.12 Models with Unequal Variances (Heteroskedasticity) ...... 175
6.12.1 EstimatedWeightedLeast Squares ...... 180
6.12.2 Bootstrap Methods When Distribution of Errors is Unknown ...... 191
6.13 Generalized Least Squares for Correlated Errors ...... 196
7 Nonlinear Regression 205
8 Random Coefficient Regression Models 225
8.1 BalancedModelwith1Predictor ...... 225
8.2 General Model with p Predictors ...... 226
8.2.1 UnequalSampleSizesWithinSubjects ...... 230
8.2.2 Tests Regarding Elements of Σβ ...... 236
8.2.3 Tests Regarding β ...... 240
8.2.4 CorrelatedErrors...... 242 6 CONTENTS
8.3 NonlinearModels...... 242
9 Alternative Regression Models 249
9.1 Introduction...... 249
9.2 BinaryResponses -LogisticRegression...... 249
9.2.1 Interpreting the Slope Coefficients ...... 251
9.2.2 Inferences for the Regression Parameters ...... 252
9.2.3 GoodnessofFitTestsandMeasures ...... 253
9.3 Count Data - Poisson and Negative Binomial Regression ...... 261
9.3.1 PoissonRegression ...... 261
9.3.2 GoodnessofFitTests ...... 264
9.3.3 Overdispersion ...... 268
9.3.4 ModelswithVaryingExposures...... 271
9.4 NegativeBinomialRegression ...... 274
9.5 GammaRegression...... 282
9.5.1 EstimatingModelParameters...... 283
9.5.2 Inferences for Model Parameters and Goodness of Fit Test...... 286
9.6 BetaRegression...... 293
9.6.1 EstimatingModelParameters...... 294
9.6.2 Diagnostics and Influence Measures ...... 297 Chapter 1
Probability Distributions, Estimation, and Testing
1.1 Introduction
Here we introduce probability distributions, and basic estimation/testing methods. Random variables are outcomes of an experiment or data-generating process, where the outcome is not known in advance, although the set of possible outcomes is. Random variables can be discrete or continuous. Discrete random variables can take on only a finite or countably infinite set of possible outcomes. Continuous random variables can take on values along a continuum. In many cases, variables of one type may be treated as or reported as the other type. In the introduction, we will use upper-case letters (such as Y ) to represent random variables, and lower-case letters (such as y) to represent specific outcomes. Not all (particularly applied statistics) books follow this convention.
1.1.1 Discrete Random Variables/Probability Distributions
In many applications, the result of the data-generating process is the count of a number of events of some sort. In some cases, a certain number of trials are conducted, and the outcome of each trial is observed as a “Success” or “Failure” (binary outcomes). In these cases, the number of trials ending in Success is observed. Alternatively, a series of trials may be conducted until a pre-selected number of Successes are observed. In other settings, the number of events of interest may be counted in a fixed amount of time or space, without actually breaking the domain into a set of distinct trials.
For discrete random variables, we will use p(y) to represent the probability that the random variable Y takes on the value y. We require that all such probabilities be bounded between 0 and 1 (inclusive), and that they sum to 1:
P Y = y = p(y) 0 p(y) 1 p(y)=1 { } ≤ ≤ y X 7 8 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
The cumulative distribution function is the probability that a random variable takes on a value less than or equal to a specific value y∗. It is an increasing function that begins at 0 and increases to 1, and we will denote it as F (y∗). For discrete random variables it is a step function, taking a step at each point where p(y) > 0: F (y∗) = P (Y y∗) = p(y) ≤ ∗ yX≤y The mean or Expected Value (µ) of a random variable is it’s long-run average if the experiment was conducted repeatedly ad infinitum. The variance σ2 is the average squared difference between the random variable and its mean, measuring the dispersion within the distribution. The standard deviation (σ) is the positive square root of the variance, and is in the same units as the data.
µ = E Y = yp(y) σ2 = V Y = E (Y µ )2 = (y µ )2 p(y) σ = + σ2 Y { } Y { } − Y − Y Y Y y y X n o X q Note that for any function of Y , the expected value and variance of the function is computed as follows:
2 2 E g(Y ) = g(y)p(y) = µ V g(Y ) = E g(Y ) µ = g(y) µ p(y) { } g(Y ) { } − g(Y ) − g(Y ) y y X n o X For any constants a and b, we have the mean and variance of the linear function a + bY :
E a + bY = ap(y) + byp(y) = a p(y) + b yp(y) = a(1) + bE Y = a + bµ { } { } Y y y y y X X X X
V a + bY = ((a + by) (a + bµ ))2 p(y) = b2 (y µ )2 p(y) = b2σ2 { } − Y − Y Y y y X X A very useful result in mathematical statistics is the following:
σ2 = V Y = E (Y µ )2 = E Y 2 2µ Y + µ = E Y 2 2µ E Y + µ2 = E Y 2 µ2 Y { } − Y − Y 2 − Y { } Y − Y n o 2 2 2 2 2 Thus, E Y = σY + µY . Also, from this result we obtain: E Y (Y 1) = σY + µY µY . From this, we can obtain σ2 = E Y (Y 1) µ2 + µ , which is useful for some{ discrete− } probability− distributions. Y { − } − Y Y Next, we consider several families of discrete probability distributions: the Binomial, Poisson, and Negative Binomial families.
Binomial Distribution
When an experiment consists of n independent trials, each of which can end in one of two outcomes: Success or Failure with constant probability of success, we refer to this as a binomial experiment. The random variable Y is the number of Successes in the n trials, and can take on the values y =0, 1,...,n. Note that in some settings, the “Success” can be a negative attribute. We denote the probability of success as π, which lies between 0 and 1. We use the notation: Y B (n,π). The probability distribution, mean and variance of Y depend on the sample size n and probability∼ of success π.
n p(y) = πy (1 π)n−y E Y = µ = nπ V Y = σ2 = nπ (1 π) y − { } Y { } Y − 1.1. INTRODUCTION 9
n n! where y = y!(n−y)! . In practice, π will be unknown, and estimated from sample data. Note that to obtain the mean and variance, we have: n n n n! n! E Y = µ = yp(y) = y πy (1 π)n−y = y πy (1 π)n−y = { } Y y!(n y)! − y!(n y)! − y=0 y=0 y=1 X X − X − n n−1 (n 1)! n 1 ∗ ∗ = nπ − πy−1 (1 π)n−y = nπ − πy (1 π)n−1−y = nπ p (y∗) = nπ y∗ = y 1 (y 1)!(n y)! − y∗ − − y=1 y∗ =0 y∗ X − − X X To obtain the variance, we use the result from above, σ2 = E Y (Y 1) µ2 + µ : Y { − } − Y Y n n n n! n! E Y (Y 1) = y(y 1)p(y) = y(y 1) πy (1 π)n−y = y(y 1) πy (1 π)n−y = { − } − − y!(n y)! − − y!(n y)! − y=0 y=0 y=2 X X − X − n n−2 (n 2)! n 2 ∗∗ ∗∗ = n(n 1)π2 − πy−2 (1 π)n−y = n(n 1)π2 − πy (1 π)n−2−y − (y 2)!(n y)! − − y∗∗ − y=2 y∗∗ =0 X − − X n(n 1)π2 p (y∗∗) = n(n 1)π2 y∗∗ = y 2 − − − y∗∗ X σ2 = n(n 1)π2 n2π2 + nπ = nπ nπ2 = nπ (1 π) ⇒ Y − − − −
Poisson Distribution
In many applications, researchers observe the counts of a random process in some fixed amount of time or space. The random variable Y is a count that can take on any non-negative integer. One important aspect of the Poisson family is that the mean and variance are the same. This is one aspect that does not work for all applications. We use the notation: Y Poi(λ). The probability distribution, mean and variance of Y are: ∼ e−λλy p(y) = E Y = µ = λ V Y = σ2 = λ y! { } Y { } Y Note that λ > 0. The Poisson arises by dividing the time/space into n infinitely small areas, each having either 0 or 1 Success, with Success probability π = λ/n. Then Y is the number of areas having a success.
n! λ y λ n−y n(n 1) (n y + 1) λ y λ n−y p(y) = 1 = − ··· − 1 = y!(n y)! n − n y! n − n − 1 n n 1 n y +1 λ n λ −y = − − λy 1 1 y! n n ··· n − n − n The limit as n goes to is: ∞ 1 e−λλy lim p(y) = (1)(1) (1)λye−λ(1) = p(y) = n→∞ y! ··· y! To obtain the mean of Y for the Poisson distribution, we have:
∞ ∞ e−λλy ∞ e−λλy ∞ e−λλy E Y = µ = yp(y) = y = y = = { } Y y! y! (y 1)! y=0 y=0 y=1 y=1 X X X X − 10 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
∗ ∞ e−λλ(y−1) ∞ e−λλy λ = λ = λ p (y∗) = λ (y 1)! y∗! y=1 y∗=0 y∗ X − X X We use the same result as that for the binomial to obtain the variance for the Poisson distribution:
∗∗ ∞ ∞ e−λ λy ∞ e−λ λy ∞ e−λλ(y−2) ∞ e−λλy E Y (Y 1) = y(y 1)p(y) = y(y 1) = y(y 1) = λ2 = λ2 = λ2 { − } − − y! − y! (y 2)! y∗∗! y=0 y=0 y=2 y=2 y∗∗=0 X X X X − X σ2 = λ2 λ2 + λ = λ ⇒ Y −
Negative Binomial Distribution
The negative binomial distribution is used in two quite different contexts. The first is where a binomial type experiment is being conducted, except instead of having a fixed number of trials, the experiment is completed when the rth success occurs. The random variable Y is the number of trials needed until the rth success, and can take on any integer value greater than or equal to r. The probability distribution, its mean and variance are: y 1 r r (1 π) p(y) = − πr (1 π)y r E Y = µ = V Y = σ2 = − . r 1 − − { } Y π { } Y π2 − A second use of the negative binomial distribution is as a model for count data. It arises from a mixture of Poisson models. In this setting it has 2 parameters and is more flexible than the Poisson (which has the variance equal to the mean), and can take on any non-negative integer value. In this form, the negative binomial distribution and its mean and variance can be written as (see e.g. Agresti (2002) and Cameron and Trivedi (2005)).
−1 Γ α−1 + y α−1 α µ y ∞ f (y; µ, α) = Γ(w) = xw−1e−xdx = (w 1)Γ(w 1) . Γ (α−1) Γ (y + 1) α−1 + µ α−1 + µ − − Z0 The mean and variance of this form of the Negative Binomial distribution are E Y = µ V Y = µ (1 + αµ) . { } { }
1.1.2 Continuous Random Variables/Probability Distributions
Random variables that can take on any value along a continuum are continuous. Here, we consider the normal, gamma, t, and F families. Special cases of the gamma family include the exponential and chi-squared distributions. Continuous distributions are density functions, as opposed to probability mass functions. Their density is always non-negative, and integrates to 1. We will use the notation f(y) for density functions. The mean and variance for continuous distributions are obtained in a similar manner as discrete distributions, with integration replacing summation. ∞ ∞ E Y = µ = yf(y)dy V Y = σ2 = (y µ )2 f(y)dy { } Y { } Y − Y Z−∞ Z−∞ In general, for any function g(Y ), we have: ∞ ∞ 2 2 E g(Y ) = g(y)f(y)dy = µ V g(Y ) = E g(Y ) µ = g(y) µ f(y)dy { } g(Y ) { } − g(Y ) − g(Y ) Z−∞ Z−∞ n o 1.1. INTRODUCTION 11
Normal Distribution
The normal distributions, also known as the Gaussian distributions, are a family of symmetric mound- shaped distributions. The distribution has 2 parameters: the mean µ and the variance σ2, although often it is indexed by its standard deviation σ. We use the notation Y N µ, σ2 . The probability density function, the mean and variance are: ∼ 2 1 (y µ) 2 2 f(y) = exp − 2 E Y = µY = µ V Y = σY = σ √2πσ2 − 2σ ! { } { } The mean µ defines the center (median and mode) of the distribution, and the standard deviation σ is a measure of the spread (µ σ and µ + σ are the inflection points). Despite the differences in location and spread of the different distributions− in the normal family, probabilities with respect to standard deviations from the mean are the same for all normal distributions. For < z < z < , we have: −∞ 1 2 ∞ µ+z2σ 2 z2 1 (y µ) 1 −z2 /2 P (µ + z1σ Y µ + z2σ) = exp − 2 dy = e dz = Φ(z2) Φ(z1) ≤ ≤ µ+z σ √2πσ2 − 2σ z1 √2π − Z 1 ! Z Where Z is standard normal, a normal distribution with mean 0, and variance (standard deviation) 1. Here Φ(z∗) is the cumulative distribution function of the standard normal distribution, up to the point z∗:
z∗ 1 2 Φ(z∗) = e−z /2dz √2π Z−∞ These probabilities and critical values can be obtained directly or indirectly from standard tables, statistical software, or spreadsheets. Note that: Y µ Y N µ, σ2 Z = − N(0, 1) ∼ ⇒ σ ∼ This makes it possible to use the standard normal table for any normal distribution. Plots of three normal distributions are given in Figure 1.1.
Gamma Distribution
The gamma family of distributions are used to model non-negative random variables that are often right- skewed. There are two widely used parameterizations. The first given here is in terms of shape and scale parameters: 1 f(y) = yα−1e−y/β y 0, α > 0, β > 0 E Y = µ = αβ V Y = σ2 = αβ2 Γ(α)βα ≥ { } Y { } Y
∞ α−1 −y Here, Γ(α) is the gamma function Γ(α) = 0 y e dy and is built-in to virtually all statistical packages and spreadsheets. It also has two simple properties: R 1 α > 1: Γ(α) = (α 1)Γ(α 1) Γ = √π. − − 2 Thus, if α is an integer, Γ(α) = (α 1)!. The second version given here is in terms of shape and rate parameters. − θα α α f(y) = yα−1e−yθ y 0, α > 0, θ > 0 E Y = µ = V Y = σ2 = Γ(α) ≥ { } Y θ { } Y θ2 12 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
Normal(µ,σ)
N(25,5) N(50,10) N(75,3) f(y) 0.00 0.05 0.10 0.15
0 20 40 60 80 100
y
Figure 1.1: Three Normal Densities
Note that different software packages use different parameterizations in generating samples and giving tail- areas and critical values. For instance, EXCEL uses the first parameterization and R uses the second. Figure 1.2 displays three gamma densities of various shapes.
Two special cases are the exponential family, where α = 1 and the chi-squared family, with α = ν/2 and β = 2 for integer valued ν. For the exponential family, based on the second parameterization:
1 1 f(y) = θe−yθ E Y = µ = V Y = σ2 = { } Y θ { } Y θ2
∗ Probabilities for the exponential distribution are trivial to obtain as F (y∗)=1 e−y θ. Figure 1.3 gives three exponential distributions. −
For the chi-square family, based on the first parameterization:
1 ν −1 −y/2 2 f(y) = y 2 e E Y = µ = ν V Y = σ =2ν ν ν/2 Y Y Γ 2 2 { } { } 2 Here, ν is the degrees of freedom and we denote the distribution as: Y χν . Upper and lower critical values of the chi-square distribution are available in tabular form, and in statistical∼ packages and spreadsheets. Probabilities can be obtained with statistical packages and spreadsheets. Figure 1.4 gives three Chi-square distributions. 1.1. INTRODUCTION 13
Gamma(α,θ)
Gamma(20,2) Gamma(50,2)
f(y) Gamma(50,1) 0.00 0.05 0.10 0.15 0.20
0 20 40 60 80 100
y
Figure 1.2: Three Gamma Densities
Exponential(θ)
Exponential(1) Exponential(0.5) Exponential(0.2) f(y) 0.0 0.1 0.2 0.3 0.4 0.5
0 5 10 15 20 25 30
y
Figure 1.3: Three Exponential Densities 14 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
χ2(ν)
Chi−Square(5) Chi−Square(15) Chi−Square(40) f(y) 0.00 0.05 0.10 0.15 0.20
0 20 40 60 80 100
y
Figure 1.4: Three Chi-Square Densities
Beta Distribution
The Beta distribution can be used to model data that are proportions (or percentages divided by 100). The traditional model for the Beta distribution is given below. Γ (α + β) 1 Γ (a +1)Γ(b + 1) f (y; α, β) = yα−1 (1 y)β−1 0 < y < 1; α > 0, β > 0 wa (1 w)b dw = Γ (α) Γ (β) − − Γ (a + b + 2) Z0 Note that the Uniform distribution is a special case, with α = β = 1. The mean and variance of the Beta distribution are obtained as follows: 1 Γ (α + β) Γ (α + β) 1 E Y = y yα−1 (1 y)β−1 dy = yα (1 y)β−1 dy = { } Γ (α) Γ (β) − Γ (α) Γ (β) − Z0 Z0 Γ (α + β) Γ (α +1)Γ(β) Γ (α + β) αΓ (α) Γ (β) α = = = Γ (α) Γ (β) Γ (α + β + 1) Γ (α) Γ (β) (α + β) Γ (α + β) α + β By extending this logic, we can obtain: 1 Γ (α + β) α (α + 1) αβ E Y 2 = y2 yα−1 (1 y)β−1 dy = V Y = Γ (α) Γ (β) − (α + β +1)(α + β) ⇒ { } 2 Z0 (α + β +1)(α + β) An alternative formulation of the distribution involves setting re-parameterizing as follows: α µ = φ = α + β α = µφ β = (1 µ) φ. α + β ⇒ −
Figure 1.5 gives three Beta distributions. 1.2. LINEAR FUNCTIONS OF MULTIPLE RANDOM VARIABLES 15
Beta(α,β)
Beta(1,1) Beta(4,2) Beta(0.5,2) f(y) 0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0 0.2 0.4 0.6 0.8 1.0
y
Figure 1.5: Three Beta Densities
1.2 Linear Functions of Multiple Random Variables
Suppose we simultaneously observe two random variables: X and Y . Their joint probability distribution can be discrete, continuous, or mixed (one discrete, the other continuous). We consider the joint distribution and the marginal distributions for the discrete case below.
p(x, y) = P X = x, Y = y p (x) = P X = x = p(x, y) p (y) = P Y = y = p(x, y) { } X { } Y { } y x X X For the continuous case, we have the following joint and marginal densities and cumulative distribution function.
∞ ∞ Joint Density when X = x, Y = y : f(x, y) fX (x) = f(x, y)dy fY (y) = f(x, y)dx Z−∞ Z−∞
b a F (a,b) = P X a, Y b = f(x, y)dxdy { ≤ ≤ } Z−∞ Z−∞ Note the following.
Discrete: p(x, y) = px(x) = pY (y)=1 x y x y X X X X ∞ ∞ ∞ ∞ Continuous: f(x, y)dxdy = fX (x)dx = fY (y)dy =1 Z−∞ Z−∞ Z−∞ Z−∞ 16 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
The conditional probability that X = x, given Y = y (or Y = y given X = x)is denoted as follows.
p(x, y) p(x, y) p(x y) = P X = x Y = y = p(y x) = P (Y = y X = x) = | { | } pY (y) | | pX (x) assuming pY (y) > 0 and pX (x) > 0. This simply implies the probability that both occur divided by the probability that Y = y or that X = x. X and Y are said to be independent if p(x y) = p(x) for all y, and that p(y x) = p(y) for all x. The conditional densities for continuous random variable| s are similarly defined based on| the joint and marginal densities. f(x, y) f(x, y) f(x y) = fY (y) > 0 f(y x) = fX (x) > 0 | fY (y) | fX (x)
The conditional mean and variance are the mean and variance of the conditional distribution (density), and are often functions of the conditioning variable.
∞ Discrete: E Y X = x = µY |x = yp(y x) Continuous: E Y X = x = µY |x = yf(y x)dy { | } y | { | } −∞ | X Z 2 Discrete: V Y X = x = σ2 = y µ p(y x) { | } Y |x − Y |x | y X ∞ 2 Continuous: V Y X = x = σ2 = y µ f(y x)dy { | } Y |x − Y |x | Z−∞ Next we consider the variance of the conditional mean and the mean of the conditional variance for the continuous case (with integration being replaced by summation for the discrete case).
∞ 2 V E Y x = µ µ f (x)dx X { { | }} Y |x − Y X Z−∞ ∞ E V Y x = σ2 f (x)dx X { { | }} Y |x X Z−∞ Note that we can partition the variance of Y into the sum of the variance of the conditional mean and mean of the conditional variance: V Y = V E Y x + E V Y x { } X { { | }} X { { | }}
The covariance σXY between X and Y is the average product of deviations from the mean for X and Y . For the discrete case, we have the following.
σ = E (X µ ) (Y µ ) = (x µ ) (y µ ) p(x, y) = (xy xµ µ y + µ µ ) p(x, y) = XY { − X − Y } − X − Y − Y − X X Y x y x y X X X X = xyp(x, y) µ xp(x, y) µ yp(x, y) + µ µ = E XY µ µ − Y − X X Y { } − X Y x y x y x y X X X X X X For the continuous case, replace summation with integration. If X and Y are independent, σXY = 0, but the converse is not typically the case. Covariances can be either positive or negative, depending on the association (if any) between X and Y . The covariance is unbounded, and depends on the scales of measurement for X and Y . The correlation ρXY is a measure that is unit-less, is not affected by linear transformations of X and/or Y , and is bounded between -1 and 1.
σXY ρXY = σX σY 1.3. FUNCTIONS OF NORMAL RANDOM VARIABLES 17
where σX and σY are the standard deviations of the marginal distributions of X and Y , respectively.
The mean and variance of any linear function of X and Y : W = aX + bY for fixed constants a and b are for the discrete case:
E W = E aX + bY = (ax + by)p(x, y) = a xp (x) + b yp (y) = aµ + bµ { } { } X Y X Y x y x y X X X X V W = V aX + bY = [(ax + by) (aµ + bµ )]2 p(x, y) = { } { } − x y x y X X 2 2 2 2 2 2 2 2 a (x µX ) + b (y µY ) +2ab (x µX ) (y µY ) p(x, y) = a σX + b σY +2abσXY . x y − − − − X X h i For the continuous case, replace summation with integration.
In general, if Y1,...,Yn are a sequence of random variables, and a1,...,an are a sequence of constants, we have the following results.
n n n E a Y = a E Y = a µ i i i { i} i i ( i=1 ) i=1 i=1 X X X n n n−1 n 2 2 V aiYi = ai σi +2 aiajσij ( i=1 ) i=1 i=1 j=i+1 X X X X 2 Here µi is the mean of Yi, σi is the variance of Yi, and σij is the covariance of Yi and Yj.
n If we have two linear functions of the same sequence of random variables, say W1 = i=1 aiYi and n W2 = biYi, we can obtain their covariance as follows. i=1 P P
n n n n−1 n COV W , W = a b σ = a b σ2 +2 a b σ { 1 2} i j ij i i i i j ij i=1 j=1 i=1 i=1 j=i+1 X X X X X
The last term drops out when Y1,...,Yn are independent.
1.3 Functions of Normal Random Variables
2 2 2 First, note that if Z N(0, 1), then Z χ1. Many software packages present Z-tests as (Wald) χ -tests. See the section on testing∼ below. ∼
2 Suppose Y1,...,Yn are independent with Yi N µ, σ for i = 1,...,n. Then the sample mean and sample variance are computed as follow. ∼ n Y n (Y Y )2 Y = i=1 i S2 = i=1 i − n n 1 P P − 18 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
In this case, we obtain the following sampling distributions for the mean and a function of the variance.
2 2 n 2 2 σ (n 1)S (Yi Y ) (n 1)S Y N µ, − = i=1 − χ2 Y, − are independent. ∼ n σ2 σ2 ∼ n−1 σ2 P
Note that in general, if Y1,...,Yn are normally distributed (and not necessarily with the same mean and/or variance), any linear function of them will be normally distributed, with mean and variance given in the previous section.
Two distributions associated with the normal and chi-squared distributions are Student’s t and F . Student’s t-distribution is similar to the standard normal (N(0, 1)), except that is indexed by its degrees of freedom and that it has heavier tails than the standard normal. As its degrees of freedom approach infinity, 2 its distribution converges to the standard normal. Let Z N (0, 1) and W χν , where Z and W are independent. Then, we get: ∼ ∼
2 Y µ Z Y N µ, σ Z = − N(0, 1) T = tν ∼ ⇒ σ ∼ W/ν ∼ where the probability density, mean, and variance for Student’s t-distributionp are:
ν+1 ν+1 2 − 2 Γ 2 t ν f(t) = 1 + E T = µT =0 V T = ν > 2 Γ ν √νπ ν { } { } ν 2 2 − and we use the notation T tν . Three t-distributions, along with the standard normal (z) distribution are shown in Figure 1.6. ∼
Now consider the sample mean and variance, and the fact they are independent.
σ2 Y µ Y µ Y N µ, Z = − = √n − N(0, 1) ∼ n ⇒ σ2 σ ∼ n q (n 1)S2 n (Y Y )2 W (n 1)S2 S W = − = i=1 i − χ2 = − = σ2 σ2 ∼ n−1 ⇒ ν σ2(n 1) σ P r s − Z √n Y −µ Y µ σ √ T = = S = n − tν ⇒ W/ν σ S ∼ 2 2 The F -distribution arises often in Regressionp and Analysis of Variance applications. If W1 χ , W2 χ , ∼ ν1 ∼ ν2 and W1 and W2 are independent, then:
W1 ν1 F = Fν1,ν2 . h W2 i ∼ ν2 h i where the probability density, mean, and variance for the F -distribution are given below as a function of the specific point F = f. ν /2 ν /2 Γ ν1+ν2 ν 1 ν 2 f ν1/2−1 f(f) = 2 1 2 Γ (ν /2)Γ(ν /2) (ν1+ν2)/2 " 1 2 # "(ν1f + ν2) # ν 2ν2 (ν + ν 2) E F = µ = 1 ν > 2 V F = 2 1 2 − ν > 4 { } F ν 2 2 { } ν (ν 2) (ν 4) 2 2 − 1 2 − 2 − 1.3. FUNCTIONS OF NORMAL RANDOM VARIABLES 19
t(ν)
t(3) t(12) t(25) N(0,1) f(t) 0.0 0.1 0.2 0.3 0.4 0.5
−4 −2 0 2 4
t
Figure 1.6: Three t-densities and z
F(ν1,ν2)
F(3,12) F(2,24) F(6,30) f(f) 0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4 5 6
f
Figure 1.7: Three F-densities 20 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
Three F -distributions are given in Figure 1.7.
Critical values for the t and F -distributions are given in statistical textbooks. Probabilities can be obtained from many statistical packages and spreadsheets. Technically, the t and F distributions described here are central t and central F distributions.
Inferences Regarding µ and σ2
We can test hypotheses concerning µ and obtain confidence intervals based on the sample mean and standard deviation when the data are independent N µ, σ2 . Let t (α/2,ν) be the value such that if:
T t P T t = α/2. ∼ ν ⇒ ≥ α/2,ν then we get the following probability statement that leads toa (1 α)100% confidence interval for µ. − Y µ S S 1 α = P t √n − t = P t Y µ t = − − α/2,n−1 ≤ S ≤ α/2,n−1 − α/2,n−1 √n ≤ − ≤ α/2,n−1 √n S S = P Y t µ Y + t − α/2,n−1 √n ≤ ≤ α/2,n−1 √n Once a sample is observed with Y1 = y1,...,Yn = yn, and the sample mean, y and sample standard deviation, s are obtained, the Confidence Interval for µ is formed as follows. s y t s y t ± α/2,n−1 Y ≡ ± α/2,n−1 √n
A 2-sided test of whether µ = µ0 is set up as follows, where T S is the test statistic, and RR is the rejection region. y µ H : µ = µ H : µ = µ T S : t = √n − 0 RR : t t 0 0 A 6 0 obs s | obs| ≥ α/2,n−1 with P -value = 2P (t t ). n−1 ≥| obs| To make inferences regarding σ2, we will make use of the following notational convention.
W χ2 P W χ2 = α/2 ∼ ν ⇒ ≥ α/2,ν 2 2 Since the χ distribution is not symmetric around 0, as Student’s t is, we will have to also obtain χ1−α/2,ν, representing the lower tail of the distribution having area=α/2. Then, we can obtain a (1 α)100% Confi- dence interval for σ2, based on the following probability statements. −
(n 1)S2 (n 1)S2 (n 1)S2 1 α = P χ2 − χ2 = P − σ2 − − 1−α/2,ν ≤ σ2 ≤ α/2,ν χ2 ≤ ≤ χ2 α/2,ν 1−α/2,ν ! Based on the observed sample variance s2, the Confidence Intervals for σ2 and σ are formed as follow.
2 2 2 2 2 (n 1)s (n 1)s (n 1)s (n 1)s σ : 2− , 2 − σ : 2− , 2 − χα/2,n−1 χ1−α/2,n−1 ! s χα/2,n−1 sχ1−α/2,n−1 ! To obtain a (1 α)100% Confidence interval for σ, take the positive square roots of the end points. To test H : σ2 = σ2 versus− H : σ2 = σ2, simply check whether σ2 lies in the confidence interval for σ2. 0 0 A 6 0 0 1.4. LIKELIHOOD FUNCTIONS AND MAXIMUM LIKELIHOOD ESTIMATION 21
1.4 Likelihood Functions and Maximum Likelihood Estimation
Suppose we take a random sample of n items from a probability mass (discrete) or probability density (continuous) function. We can write the marginal probability density (mass) for each observation (say yi) as a function of one or more parameters (θ):
Discrete: p (y θ) Continuous: f (y θ) . i| i| If the data are independent, then we get the joint density (mass) functions as the product of the individual (marginal) functions:
n n Discrete: p (y ,...,y θ) = p (y θ) Continuous: f (y ,...,y θ) = f (y θ) . 1 n| i| 1 n| i| i=1 i=1 Y Y
Consider the following cases: Binomial, Poisson, Negative Binomial, Normal, Gamma, and Beta Distri- butions. For the binomial case, suppose we consider n individual trials, where each trial can end in Success (with probability π) or Failure (with probability 1 π). Note that each yi will equal 1 (S) or 0 (F). This is referred to as a Bernoulli distribution when each− trial is considered individually.
n p (y π) = πyi (1 π)1−yi p (y ,...,y π) = p (y π) = π yi (1 π)n− yi i| − ⇒ 1 n| i| − Yi=1 P P For the Poisson model, we have:
n e−λλyi e−nλλ yi p (yi λ) = p (y1,...,yn λ) = p (yi λ) = . | yi! ⇒ | | yPi! iY=1 For the Negative Binomial distribution, we have: Q
−1 −1 −1 α yi Γ α + yi α µ f (y µ, α) = . i| Γ (α−1) Γ (y + 1) α−1 + µ α−1 + µ i For the Normal distribution, we obtain:
2 2 1 (yi µ) f yi µ, σ = exp −2 | √2πσ2 − 2σ ! ⇒ n 2 −n/2 (y µ) f y ,...,y µ, σ2 = f y µ, σ2 = 2πσ2 exp i − . 1 n| i| − 2σ2 i=1 " P # Y For the Gamma model, we have: θα f (y α, θ) = yα−1e−yθ. i| Γ(α) i For the Beta distribution, we have: Γ (α + β) f (y α, β) = yα−1 (1 y )β−1 i| Γ (α) Γ (β) i − i An alternative formulation of the distribution involves re-parameterizing as follows. α µ = φ = α + β α = µφ β = (1 µ) φ α + β ⇒ − 22 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
The re-parameterized model, and mean and variance are given below. Γ (φ) f (y ; µ, φ) = yµφ−1 (1 y )(1−µ)φ−1 0 < µ < 1 φ > 0 i Γ (µφ) Γ ((1 µ) φ) i − i − µ (1 µ) E Y = µ V Y = − { } { } φ +1
Note that in each of these cases (and for other distributions as well), once we have collected the data, the joint distribution can be thought of as a function of unknown parameter(s). This is referred to as the likelihood function. Our goal is to choose parameter value(s) that maximize the likelihood function. These are referred to as maximum likelihood estimators (MLEs). For most distributions, it is easier to maximize the log of the likelihood function, which is a monotonic function of the likelihood.
Likelihood: L (θ y ,...,y ) = f (y ,...,y θ) Log-Likelihood: l = ln(L) | 1 n 1 n| To obtain the MLE(s), we take the derivative of the log-likelihood with respect to the parameter(s) θ, set to zero, and solve for θˆ. Under commonly met regularity conditions, the maximum likelihood estimator θˆML is asymptotically normal, with mean equal to the true parameter(s) θ, and variance (or variance-covariance matrix when the number of parameters, p > 1) equal to:
∂2l −1 V θˆ = E = I−1 (θ) ML − ∂θ∂θ0 n o where: ∂2l ∂2l 2 θ1 ∂θ ∂θ1∂θp 2 1 ··· . ∂ l . . θ = . = . .. . . . ∂θ∂θ0 . . . θ ∂2l ∂2l p 2 ∂θp∂θ1 ··· ∂θp The estimated variance (or variance-covariance matrix) replaces the unknown parameter values θ with their ML estimates θˆML. The standard error is the standard deviation of its sampling distribution, the square root of the variance.
We can construct approximate large-sample Confidence Intervals for the parameter(s) θ, based on the asymptotic normality of the MLEs:
α θˆ z Vˆ θˆ P Z z = . ML α/2 ML α/2 2 ± r ≥ n o
Now, we consider the 6 distributions described above.
Bernoulli/Binomial Distribution
L (π y ,...,y ) = π yi (1 π)n− yi l = ln(L) = y ln(π) + n y ln(1 π) | 1 n − ⇒ i − i − P P X X Taking the derivative of l with respect to π, setting to 0, and solving forπ ˆ, we get:
∂l y n y set y = i − i =0 πˆ = i . ∂π π − 1 π ⇒ n P −P P 1.4. LIKELIHOOD FUNCTIONS AND MAXIMUM LIKELIHOOD ESTIMATION 23
∂2l y n y ∂2l nπ n(1 π) 1 1 n = i − i E = − = n + = ⇒ ∂π2 − π2 − (1 π)2 ⇒ ∂π2 − π2 − (1 π)2 − π 1 π −π(1 π) P −P − − − n −1 π(1 π) πˆ (1 πˆ) V πˆ = = − Vˆ πˆ = − ⇒ { ML} − −π(1 π) n ⇒ { ML} n −
Example: WNBA Free Throw Shooting - Maya Moore
We would like to estimate WNBA star Maya Moore’s true probability of making a free throw, treating her 2014 season attempts as a random sample from her underlying population of all possible (in game) free throw attempts. We treat her individual attempts as independent Bernoulli trials with probability of success π. Further, we would like to test whether her underlying proportion is π = π0 =0.80 (80%). Over the course of the season, she attempted 181 free throws, and made 160 of them.
n yi 160 πˆ (1 πˆ) 0.884(0.116) πˆ = i=1 = =0.884 Vˆ (ˆπ) = − = = .0005665 n 181 n 181 P A 95% Confidence Interval for π is:
πˆ z Vˆ (ˆπ) 0.884 1.96√0.0005665 0.884 0.047 (0.837, 0.931) ± .025 ≡ ± ≡ ± ≡ q
∇
Poisson Distribution
e−nλλ yi L (λ y1,...,yn) = l = ln(L) = nλ + yi ln(λ) ln (yi!) | yPi! ⇒ − − X X
∂lQ y set y = n + i =0 λˆ = i ∂λ − λ ⇒ n P P ∂2l y nλ −1 λ λˆ = i V λˆ = = Vˆ λˆ = ⇒ ∂λ2 − λ2 ⇒ ML − − λ2 n ⇒ ML n P n o n o
Example: English Premier League Football Total Goals per Game - 2012/13 Season
We are interested in estimating the population mean combined goals per game among the 2012/13 English Premier League (EPL) teams, based on the sample of games played in the season (380 total games). There are 20 teams, and each team plays each other team twice, one at Home, one Away. Assuming a Poisson model (which may not be reasonable, as different teams play in different games), we will estimate the underlying population mean λ. There were 380 games, with a total of 1063 goals, and sample mean and variance of 2.797 and 3.144, respectively. The number of goals and frequencies are given in Table 1.1. 24 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
Goals 0 1 2 3 4 5 678910 Games 35 61 72 91 64 32 13 9 1 0 2
Table 1.1: Frequency Tabulation for EPL 2012/2013 Total Goals per Game
Goals Observed Expected(Poisson) Expected(Neg Bin) 0 35 23.41 27.29 1 61 65.24 67.74 2 72 90.92 87.90 3 91 84.46 79.34 4 64 58.85 55.94 5 32 32.80 32.81 6 13 15.24 16.65 7 12 9.08 12.33 ≥ Table 1.2: Frequency Tabulation and Expected Counts for EPL 2012/2013 Total Goals per Game
n y 1063 λˆ 2.797 λˆ = i=1 i = =2.797 Vˆ λˆ = = =0.007361 n 380 n 380 P A 95% Confidence Interval for λ is:
λˆ z Vˆ λˆ 2.797 1.96√0.007361 2.797 0.168 (2.629, 2.965) ± .025 ≡ ± ≡ ± ≡ r
Table 1.2 gives the categories (goals), observed and expected counts, for the Poisson and Negative Binomial (next subsection) and the Chi-Square Goodness-of-fit tests for the two distributions. The goodness- th of-fit test statistics are computed as follows, where Oi is the Observed count for the i category, Ei is the Expected count for the ith category, and N is the total number of observations.
6 6 eλˆ λˆi πˆ = P (Y = i) = i =0 ..., 6 E = N πˆ i =0 ..., 6 O = N O E = N E i i! i · i 7 − i 7 − i Xi=0 Xi=0 7 (O E )2 X2 = i − i GOF E i=0 i X
The degrees of freedom for the Chi-Square Goodness-of-Fit test is the number of categories minus 1 minus the number of estimated parameters. In the case of the EPL Total goals per game with a Poisson distribution, we have 8 categories (0, 1,..., 7+) and one estimated parameter (λ), for 8-1-1=6 degrees of freedom. (35 23.41)2 (12 9.08)2 X2 = − + + − = 12.197 χ2 (0.05, 6)= 12.592 P χ2 12.197 = .0577 GOF-Poi 23.41 ··· 9.08 6 ≥ We fail to reject the hypothesis that total goals per game follows a Poisson distribution. We compare the Poisson model to the Negative Binomial model below.
∇ 1.4. LIKELIHOOD FUNCTIONS AND MAXIMUM LIKELIHOOD ESTIMATION 25
Negative Binomial Distribution
The probability distribution function for the Negative Binomial distribution can be written as (where Y is the number of occurrences of an event):
α−1 Γ α−1 + y α−1 µ y f (y µ, α) = . i| Γ (α−1) Γ (y + 1) α−1 + µ α−1 + µ The mean and variance of this form of the Negative Binomial distribution are:
E Y = µ V Y = µ (1 + αµ) . { } { } Note that µ and α−1 must be strictly positive, we re-parameterize so that α∗ = ln α−1, µ∗ = ln µ. This way α∗ and µ∗ do not need to be positive in the estimation algorithm. The likelihood function for the ith observation is given below, which will be used to obtain estimators of parameters and their estimated variances. −1 −1 −1 α yi Γ α + yi α µ L (µ, α) = f (y ; µ, α) = . i i Γ (α−1) Γ (y + 1) α−1 + µ α−1 + µ i Note that, due to the recursive pattern of the Γ ( ) function, we have: · −1 −1 −1 −1 −1 −1 −1 −1 Γ α + yi α + yi 1 α + yi 2 ... α Γ α α + yi 1 α + yi 2 ... α −1 = − −1 − = − − . Γ (α ) Γ (yi + 1) Γ (α ) Γ (yi + 1) yi! ∗ ∗ Taking the logarithm of the likelihood for the ith observation, and replacing α−1 with eα and µ with eµ , we get:
yi−1 l = lnL (µ, α) = ln(α−1 + j) ln (y !) + α−1 ln α−1 + y ln µ α−1 + y ln α−1 + µ = i i − i i − i j=0 X
yi−1 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ln eα + j ln (y !) + eα ln eα + y ln eµ eα + y ln eα + eµ . − i i − i Xj=0 Taking the derivative of the log likelihood with respect to α∗ = ln α−1, we obtain the following result.
yi−1 α∗ ∂l ∗ 1 ∗ ∗ ∗ e + y i = eα + ln eα +1 ln eα + eµ i ∂α∗ eα∗ + j − − eα∗ + eµ∗ j=0 X The derivative of the log likelihood with respect to µ∗ = ln µ is
µ∗ α∗ ∂li e e + yi ∗ = yi α∗ µ∗ . ∂µ − e + e The second derivatives, used to obtain estimated variances are
2 yi−1 α∗ yi−1 α∗ α∗ α∗ ∂ li α∗ 1 α∗ α∗ µ∗ e + yi e e e (yi µ) = e ∗ + ln e +1 ln e + e ∗ ∗ +1 ∗ ∗ + − ∂ (α∗)2 eα + j − − eα + eµ − (eα∗ + j)2 − eα + eµ (eα∗ + µ)2 Xj=0 Xj=0 2 α∗ µ∗ α∗ 2 α∗ µ∗ µ∗ ∂ l e e e + yi ∂ l e e yi e i = i = − . ∗ 2 α∗ µ∗ 2 ∗ ∗ α∗ µ∗ 2 ∂ (µ ) − (e + e ) ∂α ∂µ (e + e ) 26 CHAPTER 1. PROBABILITY DISTRIBUTIONS, ESTIMATION, AND TESTING
Once these have been computed (and note that they are functions of the unknown parameters), compute the following quantities. n n ∂li ∂li g ∗ = g ∗ = α ∂α∗ µ ∂µ∗ i=1 i=1 X X n 2 n 2 ∂ li ∂ li Gα∗ = Gµ∗ = ∂ (α∗)2 ∂ (µ∗)2 Xi=1 Xi=1 ∗ ∗ We define θ as the vector containing elements µ and α and create a vector of first partial derivatives gθ, a matrix of second partial derivatives Gθ.
2 n ∂ li ∗ ∗ µ gµ∗ Gµ i=1 ∂α∗∂µ∗ θ = g = G = 2 ∗ θ θ n ∂ li α gα∗ G ∗ " i=1 ∂α∗∂µ∗ P α # The Newton-Raphson algorithm is used to obtain (iterated)P estimates of µ∗ and α∗. Then α∗ is back- transformed to α−1 and µ∗ to µ.
In the first step, set α∗ = 0, which corresponds to α−1 = 1, and obtain an iterated estimate of µ∗. A reasonable starting value for µ∗ is ln Y .