Chapter 1 Introduction
Total Page:16
File Type:pdf, Size:1020Kb
Outline Chapter 1 Introduction • Variable types • Review of binomial and multinomial distributions • Likelihood and maximum likelihood method Yibi Huang • Inference for a binomial proportion (Wald, score and likelihood Department of Statistics ratio tests and confidence intervals) University of Chicago • Sample sample inference 1 Regression methods are used to analyze data when the response variable is numerical • e.g., temperature, blood pressure, heights, speeds, income • Stat 22200, Stat 22400 Methods in categorical data analysis are used when the response variable takes categorical (or qualitative) values Variable Types • e.g., • gender (male, female), • political philosophy (liberal, moderate, conservative), • region (metropolitan, urban, suburban, rural) • Stat 22600 In either case, the explanatory variables can be numerical or categorical. 2 Two Types of Categorical Variables Nominal : unordered categories, e.g., • transport to work (car, bus, bicycle, walk, other) • favorite music (rock, hiphop, pop, classical, jazz, country, folk) Review of Binomial and Ordinal : ordered categories Multinomial Distributions • patient condition (excellent, good, fair, poor) • government spending (too high, about right, too low) We pay special attention to — binary variables: success or failure for which nominal-ordinal distinction is unimportant. 3 Binomial Distributions (Review) Example If n Bernoulli trials are performed: Vote (Dem, Rep). Suppose π = Pr(Dem) = 0:4: • only two possible outcomes for each trial (success, failure) Sample n = 3 voters, let y = number of Dem votes among them. − • π = P(success), 1 π = P(failure), for each trial, n! 3! P(y) = πy (1 − π)n−y = (0:4)y (0:6)3−y • trials are independent y!(n − y)! y!(3 − y)! • Y = number of successes out of n trials 3! P(0) = (0:4)0(0:6)3 = (0:6)3 = 0:216 0!3! then Y has a binomial distribution, denoted as 3! P(1) = (0:4)1(0:6)2 = 3(0:4)(0:6)2 = 0:432 Y ∼ binomial (n; π): 1!2! The probability function of Y is y P(y) ! n 0 0.216 P(Y = y) = πy (1 − π)n−y ; y = 0; 1;:::; n: y 1 0.432 ! n n! where = is the binomial coefficient and 2 0.288 y y!(n − y)! 3 0.064 m! = “m factorial” = m × (m − 1) × (m − 2) × · · · × 1: total 1 Note that 0! = 1. 4 5 R Codes Facts About the Binomial Distribution > dbinom(x=0, size=3, p=0.4) If Y is a binomial (n; π) random variable, then [1] 0.216 > dbinom(0, 3, 0.4) • E(Y) = nπ [1] 0.216 p p • σ(Y) = Var(Y) = nπ(1 − π), > dbinom(1, 3, 0.4) [1] 0.432 • Binomial (n; π) can be approx. by Normal (nπ, nπ(1 − π)) > dbinom(0:3, 3, 0.4) when n is large (nπ > 5 and n(1 − π) > 5). [1] 0.216 0.432 0.288 0.064 binomial(n = 8; π = 0:2) binomial(n = 25; π = 0:2) > plot(0:3, dbinom(0:3, 3, .4), type = "h", xlab = "y", ylab = "P(y)") 0.20 0.30 0.4 0.10 P(y) P(y) 0.3 0.15 P(y) 0.2 0.00 0.00 0.1 0 2 4 6 8 0 5 10 15 0.0 1.0 2.0 3.0 y y y 6 7 Multinomial Distribution — Generalization of Binomial Example If n trials are performed: Suppose proportions of individuals with genotypes AA, Aa, and aa in a large population are • in each trial there are c > 2 possible outcomes (categories) (πAA ; πAa ; πaa ) = (0:25; 0:5; 0:25): Pc • πi = P(category i), for each trial, i=1 πi = 1 Randomly sample n = 5 individuals from the population. • trials are independent The chance of getting 2 AA’s, 2 Aa’s, and 1 aa is • Y = number of trials fall in category i out of n trials 5! i P(Y = 2; Y = 2; Y = 1) = (0:25)2(0:5)2(0:25)1 AA Aa aa 2! 2! 1! 5 · 4 · 3 · 2 · 1 then the joint distribution of (Y ; Y ;:::; Y ) has a multinomial = (0:25)2(0:5)2(0:25)1 ≈ 0:117 1 2 c (2 · 1)(2 · 1)(1) distribution, with probability function and the chance of getting no AA, 3 Aa’s, and 2 aa’s is n! y y y ( = = = ) = 1 2 ··· c 5! 0 3 2 P Y1 y1; Y2 y2;:::; Yc yc π1 π2 πc P(YAA = 0; YAa = 3; Yaa = 2) = (0:25) (0:5) (0:25) y1! y2! ··· yc ! 0! 3! 2! P 5 · 4 · 3 · 2 · 1 0 3 2 where 0 ≤ yi ≤ n for all i and i yi = n: = (0:25) (0:5) (0:25) ≈ 0:078 (1)(3 · 2 · 1)(2 · 1) 8 9 Facts About the Multinomial Distribution If (Y1; Y2;:::; Yc ) has a multinomial distribution with n trials and category probabilities (π1; π2; ··· ; πc ), then Likelihood and Maximum Likelihood Estimation • E(Yi) = nπi for i = 1; 2;:::; c p p • σ(Yi) = Var(Yi) = nπi(1 − πi), • Cov(Yi; Yj) = −nπiπj 10 A Probability Question A Statistics Question A push pin is tossed n = 5 times. Let Y be the number of times the push pin lands on its head. What is P(Y = 3)? Suppose a push pin is observed to land on its head Y = 8 times in n = 20 tosses. Can we infer about the value of Answer. As the tosses are indep., Y is binomial (n = 5, π) π = P(push pin lands on its head in a toss)? n! P(Y = y; π) = πy (1 − π)n−y y!(n − y)! The chance to observe Y = 8 in n = 20 tosses is 8 where π = P(push pin lands on its head in a toss): > 20 (0:3)8(0:7)12 ≈ 0:1143 if π = 0:3 <> 8 P(Y = 8; π) = > :> 20 (0:6)8(0:4)12 ≈ 0:0354 if π = 0:6 If π is known to be 0.4, then 8 It appears that π = 0:3 is more likely than π = 0:6; since the former 5! P(Y = 3; π) = (0:4)3(0:6)2 = 0:2304: 3!2! gives a higher prob. to observe the outcome Y = 8. 11 12 The probability Likelihood ! n P(Y = y; π) = πy (1 − π)n−y = `(πjy) y In general, suppose the observed data (Y1; Y2;:::; Yn) have a joint viewed as a function of π, is called the likelihood function, (or just probability distribution with some parameter(s) θ likelihood) of π, denoted as `(πjy). P(Y1 = y1; Y2 = y2;:::; Yn = yn) = f(y1; y2;:::; ynjθ) It is a measure of the “plausibility” for a value being the true value The likelihood function for the parameterθ is of π. `(θ) = `(θjy1; y2;:::; yn) = f(y1; y2;:::; ynjθ): 1.0 0.8 y=0 • Note the likelihood function regards the probability as a 0.6 function of the parameter θ rather than as a function of the 0.4 y=2 y=8 y=14 data y1; y2;:::; yn. Likelihood 0.2 • If 0.0 `(θ1jy1;:::; yn) > `(θ2jy1;:::; yn); 0.0 0.2 0.4 0.6 0.8 1.0 π then θ1 appears more plausible to be the true value of θ than θ2 does, given the observed data y1;:::; yn: Curves for the likelihood `(πjy) at different values of y for n = 20: 13 14 Maximum Likelihood Estimate (MLE) Maximizing the Log-likelihood The maximum likelihood estimate (MLE) of a parameter θ is the value at which the likelihood function is maximized. Rather than maximizing the likelihood, it is usually computationally Example. If a push pin lands on head Y = 8 times in n = 20 easier to maximize its logarithm, called the log-likelihood, tosses, the likelihood function ! 20 log `(πjy) `(πjy = 8) = π8(1 − π)12 8 which is equivalent since logarithm is strictly increasing, reach its maximum at π = 0:4, the MLE for π is bπ = 0:4 given the data Y = 8: x1 > x2 () log(x1) > log(x2): 0.15 So 0.10 `(π1jy) > `(π2jy) () log `(π1jy) > log `(π2jy): 0.05 0.00 0.0 0.2 0.4 0.6 0.8 1.0 π 15 16 Example (MLE for Binomial) More Facts about MLEs If the observed data Y ∼ binomial (n; π) but π is unknown, the likelihood of π is ! n 2 `(πjy) = p(Y = yjπ) = πy (1 − π)n−y • If Y1; Y2;:::; Yn are i.i.d. N(µ, σ ), the MLE of µ is the sample y mean Pn Y =n: and the log-likelihood is i=1 i ! n log `(πjy) = log + y log(π) + (n − y) log(1 − π): y • In ordinary linear regression, From calculus, we know a function f(x) reaches its max at x = x0 if 2 d ( ) = = d ( ) = Yi = β0 + β1xi1 + ··· + βpxip + "i dx f x 0 at x x0 and dx2 f x < 0 at x x0: As d y n − y y − nπ log `(πjy) = − = when the noise "i are i.i.d. normal, the usual least squares dπ π 1 − π π(1 − π): estimates for β0; β1; : : : ; βp are MLEs. 2 = d ( j ) = − y − n−y equals 0 when π y=n and dπ2 log ` π y π2 (1−π)2 < 0 is always true, we know log `(πjy) reaches its max when π = y=n. So the MLE of π is y=n: 17 18 Large Sample Optimality of MLEs MLEs are not always the best estimators but they have a number of good properties.