Maximum Likelihood Estimation (MLE) Definition of MLE

Total Page:16

File Type:pdf, Size:1020Kb

Maximum Likelihood Estimation (MLE) Definition of MLE Econ 620 Maximum Likelihood Estimation (MLE) Definition of MLE • Consider a parametric model in which the joint distribution of Y =(y1,y2, ···,yn)hasadensity (Y ; θ) with respect to a dominating measure µ, where θ ∈ Θ ⊂ RP . Definition 1 A maximum likelihood estimator of θ is a solution to the maximization problem max (y; θ) θ∈Θ • Note that the solution to an optimization problem is invariant to a strictly monotone increasing trans- formation of the objective function, a MLE can be obtained as a solution to the following problem; max log (y; θ)=maxL (y; θ) θ∈Θ θ∈Θ Proposition 2 (Sufficient condition for existence) If the parameter space Θ is compact and if the likelihood function θ → (y; θ) is continuous on Θ, then there exists a MLE. Proposition 3 (Sufficient condition for uniqueness of MLE) If the parameter space Θ is convex and if the likelihood function θ → (y; θ) is strictly concave in θ, then the MLE is unique when it exists. • If the observations on Y are i.i.d. with density f (yi; θ) for each observation, then we can write the likelihood function as n n (y; θ)= f (yi; θ) ⇒ L (y; θ)= log f (yi; θ) i=1 i=1 Properties of MLE q Proposition 4 (Functional invariance of MLE) Suppose a bijective function g :Θ→ Λ where Λ ⊂ R and θ is a MLE of θ, then λ = g θ is a MLE of λ ∈ Λ. ⇒ By definition of MLE, we have θ ∈ Θand y; θ ≥ (y; θ) , ∀θ ∈ Θ or equivalently, λ ∈ Λand y; g−1 λ ≥ y; g−1 (λ) , ∀λ ∈ Λ which implies that λ = g θ is a MLE of λ in a model with density y; g−1 (λ) . Proposition 5 (Relationship with sufficiency) MLE is a function of every sufficient statistic. ⇒ Let S (Y ) be a sufficient statistic. From the factorization theorem of a sufficient statistic, the density function can be written as (y; θ)=Ψ(S (y);θ) h (y) , i.e., L (y; θ)=logΨ(S (y);θ)+logh (y) . Hence max- imizing (y; θ) with respect to θ is equivalent to maximizing log Ψ (S (y);θ) with respect to θ. Therefore, MLE depends on Y through S (Y ) . • To discuss asymptotic properties of MLE, which are why we study and use MLE in practice, we need some so-called regularity conditions. These conditions are to be checked not to be granted before we use MLE. It is difficult, mostly impossible, to check in practice, though. 1 Regularity Conditions 1. The variables Yi,i=1, 2, ··· are independent and identically distributed with density f (y; θ) . 2. The parameter space Θ is compact. 3. The true but unknown parameter value θ0 is identified, i.e. θ0 =argmaxEθ log f (Yi; θ) θ∈Θ 0 4. The likelihood function n L (y; θ)= log f (yi; θ) i=1 is continuous in θ. 5. Eθ0 log f (Y ; θ) exists. 1 6. The log-likelihood function is such that n L (y; θ) converges almost surely (in probability) to Eθ0 log f (Yi; θ) uniformly in θ ∈ Θ, i.e., 1 − sup L (y; θ) Eθ0 log f (Yi; θ) <δalmost surely (in probability) for some δ>0. θ∈Θ n Proposition 6 Under 1 - 6, there exists a sequence of MLE’s converging almost surely (in probability) to the true parameter value θ0. That is, MLE is a consistent estimator. ⇒ 1 and 2 ensure the existence of MLE θn. It is obtained by maximizing L (y; θ)orequivalently, 1 1 1 n n L (y; θ) . Since n L (y; θ)= n i=1 log f (yi; θ) can be interpreted as the sample mean of the random variables log f (yi; θ) , which are i.i.d., the objective function converges almost surely (in probability) to Eθ0 log f (Y ; θ) by the strong(weak) law of large numbers. Furthermore, the uniform strong law of large 1 n numbers implies that the solution to n i=1 log f (yi; θ) , θn, converges to the solution to the limit problem max Eθ log f (Y ; θ) θ∈Θ 0 i.e., max log f (y; θ) f (y; θ0) dy θ∈Θ Y Now, note that the identifiability condition 3 ensures the convergence of θn to θ0. More regularity conditions for asymptotic distribution 2’. θ0 ∈ Int(Θ) . 7. The log-likelihood function L (y; θ) is twice continuously differentiable in a neighborhood of θ0. 8. Integration and differential operators are interchangeable. 9. The matrix 2 ∂ log f (Y ; θ0) I (θ0)=Eθ − 0 ∂θ∂θ called information matrix, exists and non-singular. • The additional assumptions enables us to use differential method to obtain MLE and its asymptotic distribution. Lemma 7 ∂ log f (Y ; θ0) Eθ =0. 0 ∂θ 2 ⇒ ∂ log f (Y ; θ0) ∂ log f (y; θ0) Eθ0 = f (y; θ0) dy ∂θ ∂θ 1 ∂f (y; θ0) ∂f (y; θ0) = f (y; θ0) dy = dy f (y; θ0) ∂θ ∂θ However, f (y; θ0) dy = 1 by definition. Hence, differentiating with respect to θ gives ∂ ∂f (y; θ0) f (y; θ0) dy = dy =0 ∂θ ∂θ Lemma 8 2 ∂ log f (Y ; θ0) ∂ log f (Y ; θ0) ∂ log f (Y ; θ0) Eθ = Eθ − 0 ∂θ ∂θ 0 ∂θ∂θ ⇒ 2 ∂ log f (Y ; θ0) Eθ 0 ∂θ∂θ 2 ∂ log f (y; θ0) ∂ ∂ log f (y; θ0) = f (y; θ0) dy = f (y; θ0) dy ∂θ∂θ ∂θ ∂θ ∂ 1 ∂f (y; θ0) = f (y; θ0) dy ∂θ f (y; θ0) ∂θ 2 − 1 ∂f (y; θ0) ∂f (y; θ0) 1 ∂ f (y; θ0) = 2 + f (y; θ0) dy (f (y; θ0)) ∂θ ∂θ f (y; θ0) ∂θ∂θ 2 − 1 ∂f (y; θ0) 1 ∂f (y; θ0) ∂ f (y; θ0) = f (y; θ0) dy + dy f (y; θ0) ∂θ f (y; θ0) ∂θ ∂θ∂θ ∂ log f (Y ; θ0) ∂ log f (Y ; θ0) ∂ log f (Y ; θ0) ∂ log f (Y ; θ0) = − f (y; θ0) dy = −Eθ ∂θ ∂θ 0 ∂θ ∂θ 2 ∂ f(y;θ0) The last line follows from the fact that ∂θ∂θ dy =0. Proposition 9 Under 1,2’, 3 - 9, a sequence of MLE, θn, satisfies √ d −1 n θn − θ0 → N 0, I (θ0) ⇒ A Taylor series expansion of the first order condition around the true value of θ, θ0, yields ∂L θn 2 ∗ ∂L(θ0) ∂ L (θ ) = + θn − θ0 ∂θ ∂θ ∂θ∂θ ∗ where θ is on the line segment connecting θn and θ0. From the first order condition, we have 2 ∗ ∂L(θ0) ∂ L (θ ) 0= + θn − θ0 ∂θ ∂θ∂θ Therefore, √ 2 ∗ −1 1 ∂ L (θ ) 1 ∂L(θ0) n θn − θ0 = − √ n ∂θ∂θ n ∂θ As n →∞, 2 ∗ n 2 ∗ − 1 ∂ L (θ ) 1 −∂ log f (Yi; θ ) = n ∂θ∂θ n i=1 ∂θ∂θ 3 converges almost surely to 2 ∂ log f (Y ; θ0) I (θ0)=Eθ − 0 ∂θ∂θ ∗ a.s. by the strong law of large numbers and the fact that θ → θ0. Moreover, n 1 ∂L(θ0) 1 ∂ log f (Y ; θ0) √ = √ n ∂θ n i=1 ∂θ n √1 ∂ log f (Y ; θ0) ∂ log f (Y ; θ0) = − Eθ0 n i=1 ∂θ ∂θ which converges in distribution to N (0, I (θ0)) by the central limit theorem. We have used Lemma 7 and Lemma 8 here to get the asymptotic distribution ∂L(θ ) √1 0 of n ∂θ . Then, √ d −1 n θn − θ0 → N 0, I (θ0) • The asymptotic distribution, itself is useless since we have to evaluate the information matrix at true value of parameter. However, we can consistently estimate the asymptotic variance of MLE by evaluating the information matrix at MLE, i.e., √ −1 d n θn − θ0 → N 0, I θn In other expression which is slightly misleading but commonly used in practice is d 1 −1 −1 θn → N θ0, I θn = N θ0, nI θn n 2 ∂ L(θn) where nI θn = − ∂θ∂θ . We can also use the approximation that n ∂ log f yi; θn ∂ log f yi; θn I n θn = i=1 ∂θ ∂θ Proposition 10 Let g be a continuously differentiable function of θ ∈ Rp with values in Rq. Then, under the assumptions of Proposition 9, (i) g θn converges almost surely to g (θ0) . √ d dg (θ0) −1 dg (θ0) (ii) n g θn − g (θ0) → N 0, I (θ0) dθ dθ ⇒ The first claim is straight application of Slutsky theorem. For the second claim, we do a Taylor expansion of g θn around θ0 to get ∗ dg (θ ) g θn = g (θ0)+ θn − θ0 dθ Hence, √ ∗ √ dg (θ ) n g θn − g (θ0) = n θn − θ0 dθ Note that, as n →∞, we have ∗ dg (θ ) a.s. dg (θ0) → and dθ dθ √ d −1 n θn − θ0 → N 0, I (θ0) The claim follows immediately. 4.
Recommended publications
  • 5. Completeness and Sufficiency 5.1. Complete Statistics. Definition 5.1. a Statistic T Is Called Complete If Eg(T) = 0 For
    5. Completeness and sufficiency 5.1. Complete statistics. Definition 5.1. A statistic T is called complete if Eg(T ) = 0 for all θ and some function g implies that P (g(T ) = 0; θ) = 1 for all θ. This use of the word complete is analogous to calling a set of vectors v1; : : : ; vn complete if they span the whole space, that is, any v can P be written as a linear combination v = ajvj of these vectors. This is equivalent to the condition that if w is orthogonal to all vj's, then w = 0. To make the connection with Definition 5.1, let's consider the discrete case. Then completeness means that P g(t)P (T = t; θ) = 0 implies that g(t) = 0. Since the sum may be viewed as the scalar prod- uct of the vectors (g(t1); g(t2);:::) and (p(t1); p(t2);:::), with p(t) = P (T = t), this is the analog of the orthogonality condition just dis- cussed. We also see that the terminology is somewhat misleading. It would be more accurate to call the family of distributions p(·; θ) complete (rather than the statistic T ). In any event, completeness means that the collection of distributions for all possible values of θ provides a sufficiently rich set of vectors. In the continuous case, a similar inter- pretation works. Completeness now refers to the collection of densities f(·; θ), and hf; gi = R fg serves as the (abstract) scalar product in this case. Example 5.1. Let's take one more look at the coin flip example.
    [Show full text]
  • 1 One Parameter Exponential Families
    1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e.
    [Show full text]
  • Lecture 4: Sufficient Statistics 1 Sufficient Statistics
    ECE 830 Fall 2011 Statistical Signal Processing instructor: R. Nowak Lecture 4: Sufficient Statistics Consider a random variable X whose distribution p is parametrized by θ 2 Θ where θ is a scalar or a vector. Denote this distribution as pX (xjθ) or p(xjθ), for short. In many signal processing applications we need to make some decision about θ from observations of X, where the density of X can be one of many in a family of distributions, fp(xjθ)gθ2Θ, indexed by different choices of the parameter θ. More generally, suppose we make n independent observations of X: X1;X2;:::;Xn where p(x1 : : : xnjθ) = Qn i=1 p(xijθ). These observations can be used to infer or estimate the correct value for θ. This problem can be posed as follows. Let x = [x1; x2; : : : ; xn] be a vector containing the n observations. Question: Is there a lower dimensional function of x, say t(x), that alone carries all the relevant information about θ? For example, if θ is a scalar parameter, then one might suppose that all relevant information in the observations can be summarized in a scalar statistic. Goal: Given a family of distributions fp(xjθ)gθ2Θ and one or more observations from a particular dis- tribution p(xjθ∗) in this family, find a data compression strategy that preserves all information pertaining to θ∗. The function identified by such strategyis called a sufficient statistic. 1 Sufficient Statistics Example 1 (Binary Source) Suppose X is a 0=1 - valued variable with P(X = 1) = θ and P(X = 0) = 1 − θ.
    [Show full text]
  • On the Meaning and Use of Kurtosis
    Psychological Methods Copyright 1997 by the American Psychological Association, Inc. 1997, Vol. 2, No. 3,292-307 1082-989X/97/$3.00 On the Meaning and Use of Kurtosis Lawrence T. DeCarlo Fordham University For symmetric unimodal distributions, positive kurtosis indicates heavy tails and peakedness relative to the normal distribution, whereas negative kurtosis indicates light tails and flatness. Many textbooks, however, describe or illustrate kurtosis incompletely or incorrectly. In this article, kurtosis is illustrated with well-known distributions, and aspects of its interpretation and misinterpretation are discussed. The role of kurtosis in testing univariate and multivariate normality; as a measure of departures from normality; in issues of robustness, outliers, and bimodality; in generalized tests and estimators, as well as limitations of and alternatives to the kurtosis measure [32, are discussed. It is typically noted in introductory statistics standard deviation. The normal distribution has a kur- courses that distributions can be characterized in tosis of 3, and 132 - 3 is often used so that the refer- terms of central tendency, variability, and shape. With ence normal distribution has a kurtosis of zero (132 - respect to shape, virtually every textbook defines and 3 is sometimes denoted as Y2)- A sample counterpart illustrates skewness. On the other hand, another as- to 132 can be obtained by replacing the population pect of shape, which is kurtosis, is either not discussed moments with the sample moments, which gives or, worse yet, is often described or illustrated incor- rectly. Kurtosis is also frequently not reported in re- ~(X i -- S)4/n search articles, in spite of the fact that virtually every b2 (•(X i - ~')2/n)2' statistical package provides a measure of kurtosis.
    [Show full text]
  • The Probability Lifesaver: Order Statistics and the Median Theorem
    The Probability Lifesaver: Order Statistics and the Median Theorem Steven J. Miller December 30, 2015 Contents 1 Order Statistics and the Median Theorem 3 1.1 Definition of the Median 5 1.2 Order Statistics 10 1.3 Examples of Order Statistics 15 1.4 TheSampleDistributionoftheMedian 17 1.5 TechnicalboundsforproofofMedianTheorem 20 1.6 TheMedianofNormalRandomVariables 22 2 • Greetings again! In this supplemental chapter we develop the theory of order statistics in order to prove The Median Theorem. This is a beautiful result in its own, but also extremely important as a substitute for the Central Limit Theorem, and allows us to say non- trivial things when the CLT is unavailable. Chapter 1 Order Statistics and the Median Theorem The Central Limit Theorem is one of the gems of probability. It’s easy to use and its hypotheses are satisfied in a wealth of problems. Many courses build towards a proof of this beautiful and powerful result, as it truly is ‘central’ to the entire subject. Not to detract from the majesty of this wonderful result, however, what happens in those instances where it’s unavailable? For example, one of the key assumptions that must be met is that our random variables need to have finite higher moments, or at the very least a finite variance. What if we were to consider sums of Cauchy random variables? Is there anything we can say? This is not just a question of theoretical interest, of mathematicians generalizing for the sake of generalization. The following example from economics highlights why this chapter is more than just of theoretical interest.
    [Show full text]
  • Minimum Rates of Approximate Sufficient Statistics
    1 Minimum Rates of Approximate Sufficient Statistics Masahito Hayashi,y Fellow, IEEE, Vincent Y. F. Tan,z Senior Member, IEEE Abstract—Given a sufficient statistic for a parametric family when one is given X, then Y is called a sufficient statistic of distributions, one can estimate the parameter without access relative to the family fPXjZ=zgz2Z . We may then write to the data. However, the memory or code size for storing the sufficient statistic may nonetheless still be prohibitive. Indeed, X for n independent samples drawn from a k-nomial distribution PXjZ=z(x) = PXjY (xjy)PY jZ=z(y); 8 (x; z) 2 X × Z with d = k − 1 degrees of freedom, the length of the code scales y2Y as d log n + O(1). In many applications, we may not have a (1) useful notion of sufficient statistics (e.g., when the parametric or more simply that X (−− Y (−− Z forms a Markov chain family is not an exponential family) and we also may not need in this order. Because Y is a function of X, it is also true that to reconstruct the generating distribution exactly. By adopting I(Z; X) = I(Z; Y ). This intuitively means that the sufficient a Shannon-theoretic approach in which we allow a small error in estimating the generating distribution, we construct various statistic Y provides as much information about the parameter approximate sufficient statistics and show that the code length Z as the original data X does. d can be reduced to 2 log n + O(1). We consider errors measured For concreteness in our discussions, we often (but not according to the relative entropy and variational distance criteria.
    [Show full text]
  • Theoretical Statistics. Lecture 5. Peter Bartlett
    Theoretical Statistics. Lecture 5. Peter Bartlett 1. U-statistics. 1 Outline of today’s lecture We’ll look at U-statistics, a family of estimators that includes many interesting examples. We’ll study their properties: unbiased, lower variance, concentration (via an application of the bounded differences inequality), asymptotic variance, asymptotic distribution. (See Chapter 12 of van der Vaart.) First, we’ll consider the standard unbiased estimate of variance—a special case of a U-statistic. 2 Variance estimates n 1 s2 = (X − X¯ )2 n n − 1 i n Xi=1 n n 1 = (X − X¯ )2 +(X − X¯ )2 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 2 = (X − X¯ ) − (X − X¯ ) 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 1 = (X − X )2 n(n − 1) 2 i j Xi=1 Xj=1 1 1 = (X − X )2 . n 2 i j 2 Xi<j 3 Variance estimates This is unbiased for i.i.d. data: 1 Es2 = E (X − X )2 n 2 1 2 1 = E ((X − EX ) − (X − EX ))2 2 1 1 2 2 1 2 2 = E (X1 − EX1) +(X2 − EX2) 2 2 = E (X1 − EX1) . 4 U-statistics Definition: A U-statistic of order r with kernel h is 1 U = n h(Xi1 ,...,Xir ), r iX⊆[n] where h is symmetric in its arguments. [If h is not symmetric in its arguments, we can also average over permutations.] “U” for “unbiased.” Introduced by Wassily Hoeffding in the 1940s. 5 U-statistics Theorem: [Halmos] θ (parameter, i.e., function defined on a family of distributions) admits an unbiased estimator (ie: for all sufficiently large n, some function of the i.i.d.
    [Show full text]
  • The Likelihood Function - Introduction
    The Likelihood Function - Introduction • Recall: a statistical model for some data is a set { f θ : θ ∈ Ω} of distributions, one of which corresponds to the true unknown distribution that produced the data. • The distribution fθ can be either a probability density function or a probability mass function. • The joint probability density function or probability mass function of iid random variables X1, …, Xn is n θ ()1 ,..., n = ∏ θ ()xfxxf i . i=1 week 3 1 The Likelihood Function •Let x1, …, xn be sample observations taken on corresponding random variables X1, …, Xn whose distribution depends on a parameter θ. The likelihood function defined on the parameter space Ω is given by L|(θ x1 ,..., xn ) = θ f( 1,..., xn ) x . • Note that for the likelihood function we are fixing the data, x1,…, xn, and varying the value of the parameter. •The value L(θ | x1, …, xn) is called the likelihood of θ. It is the probability of observing the data values we observed given that θ is the true value of the parameter. It is not the probability of θ given that we observed x1, …, xn. week 3 2 Examples • Suppose we toss a coin n = 10 times and observed 4 heads. With no knowledge whatsoever about the probability of getting a head on a single toss, the appropriate statistical model for the data is the Binomial(10, θ) model. The likelihood function is given by • Suppose X1, …, Xn is a random sample from an Exponential(θ) distribution. The likelihood function is week 3 3 Sufficiency - Introduction • A statistic that summarizes all the information in the sample about the target parameter is called sufficient statistic.
    [Show full text]
  • Statistical Inference
    GU4204: Statistical Inference Bodhisattva Sen Columbia University February 27, 2020 Contents 1 Introduction5 1.1 Statistical Inference: Motivation.....................5 1.2 Recap: Some results from probability..................5 1.3 Back to Example 1.1...........................8 1.4 Delta method...............................8 1.5 Back to Example 1.1........................... 10 2 Statistical Inference: Estimation 11 2.1 Statistical model............................. 11 2.2 Method of Moments estimators..................... 13 3 Method of Maximum Likelihood 16 3.1 Properties of MLEs............................ 20 3.1.1 Invariance............................. 20 3.1.2 Consistency............................ 21 3.2 Computational methods for approximating MLEs........... 21 3.2.1 Newton's Method......................... 21 3.2.2 The EM Algorithm........................ 22 1 4 Principles of estimation 23 4.1 Mean squared error............................ 24 4.2 Comparing estimators.......................... 25 4.3 Unbiased estimators........................... 26 4.4 Sufficient Statistics............................ 28 5 Bayesian paradigm 33 5.1 Prior distribution............................. 33 5.2 Posterior distribution........................... 34 5.3 Bayes Estimators............................. 36 5.4 Sampling from a normal distribution.................. 37 6 The sampling distribution of a statistic 39 6.1 The gamma and the χ2 distributions.................. 39 6.1.1 The gamma distribution..................... 39 6.1.2 The Chi-squared distribution.................. 41 6.2 Sampling from a normal population................... 42 6.3 The t-distribution............................. 45 7 Confidence intervals 46 8 The (Cramer-Rao) Information Inequality 51 9 Large Sample Properties of the MLE 57 10 Hypothesis Testing 61 10.1 Principles of Hypothesis Testing..................... 61 10.2 Critical regions and test statistics.................... 62 10.3 Power function and types of error.................... 64 10.4 Significance level............................
    [Show full text]
  • Lecture 14 Testing for Kurtosis
    9/8/2016 CHE384, From Data to Decisions: Measurement, Kurtosis Uncertainty, Analysis, and Modeling • For any distribution, the kurtosis (sometimes Lecture 14 called the excess kurtosis) is defined as Testing for Kurtosis 3 (old notation = ) • For a unimodal, symmetric distribution, Chris A. Mack – a positive kurtosis means “heavy tails” and a more Adjunct Associate Professor peaked center compared to a normal distribution – a negative kurtosis means “light tails” and a more spread center compared to a normal distribution http://www.lithoguru.com/scientist/statistics/ © Chris Mack, 2016Data to Decisions 1 © Chris Mack, 2016Data to Decisions 2 Kurtosis Examples One Impact of Excess Kurtosis • For the Student’s t • For a normal distribution, the sample distribution, the variance will have an expected value of s2, excess kurtosis is and a variance of 6 2 4 1 for DF > 4 ( for DF ≤ 4 the kurtosis is infinite) • For a distribution with excess kurtosis • For a uniform 2 1 1 distribution, 1 2 © Chris Mack, 2016Data to Decisions 3 © Chris Mack, 2016Data to Decisions 4 Sample Kurtosis Sample Kurtosis • For a sample of size n, the sample kurtosis is • An unbiased estimator of the sample excess 1 kurtosis is ∑ ̅ 1 3 3 1 6 1 2 3 ∑ ̅ Standard Error: • For large n, the sampling distribution of 1 24 2 1 approaches Normal with mean 0 and variance 2 1 of 24/n 3 5 • For small samples, this estimator is biased D. N. Joanes and C. A. Gill, “Comparing Measures of Sample Skewness and Kurtosis”, The Statistician, 47(1),183–189 (1998).
    [Show full text]
  • Covariances of Two Sample Rank Sum Statistics
    JOURNAL OF RESEARCH of the National Bureou of Standards - B. Mathematical Sciences Volume 76B, Nos. 1 and 2, January-June 1972 Covariances of Two Sample Rank Sum Statistics Peter V. Tryon Institute for Basic Standards, National Bureau of Standards, Boulder, Colorado 80302 (November 26, 1971) This note presents an elementary derivation of the covariances of the e(e - 1)/2 two-sample rank sum statistics computed among aU pairs of samples from e populations. Key words: e Sample proble m; covariances, Mann-Whitney-Wilcoxon statistics; rank sum statistics; statistics. Mann-Whitney or Wilcoxon rank sum statistics, computed for some or all of the c(c - 1)/2 pairs of samples from c populations, have been used in testing the null hypothesis of homogeneity of dis­ tribution against a variety of alternatives [1, 3,4,5).1 This note presents an elementary derivation of the covariances of such statistics under the null hypothesis_ The usual approach to such an analysis is the rank sum viewpoint of the Wilcoxon form of the statistic. Using this approach, Steel [3] presents a lengthy derivation of the covariances. In this note it is shown that thinking in terms of the Mann-Whitney form of the statistic leads to an elementary derivation. For comparison and completeness the rank sum derivation of Kruskal and Wallis [2] is repeated in obtaining the means and variances. Let x{, r= 1,2, ... , ni, i= 1,2, ... , c, be the rth item in the sample of size ni from the ith of c populations. Let Mij be the Mann-Whitney statistic between the ith andjth samples defined by n· n· Mij= ~ t z[J (1) s=1 1' = 1 where l,xJ~>xr Zr~= { I) O,xj";;;x[ } Thus Mij is the number of times items in the jth sample exceed items in the ith sample.
    [Show full text]
  • Analysis of Covariance (ANCOVA) with Two Groups
    NCSS Statistical Software NCSS.com Chapter 226 Analysis of Covariance (ANCOVA) with Two Groups Introduction This procedure performs analysis of covariance (ANCOVA) for a grouping variable with 2 groups and one covariate variable. This procedure uses multiple regression techniques to estimate model parameters and compute least squares means. This procedure also provides standard error estimates for least squares means and their differences, and computes the T-test for the difference between group means adjusted for the covariate. The procedure also provides response vs covariate by group scatter plots and residuals for checking model assumptions. This procedure will output results for a simple two-sample equal-variance T-test if no covariate is entered and simple linear regression if no group variable is entered. This allows you to complete the ANCOVA analysis if either the group variable or covariate is determined to be non-significant. For additional options related to the T- test and simple linear regression analyses, we suggest you use the corresponding procedures in NCSS. The group variable in this procedure is restricted to two groups. If you want to perform ANCOVA with a group variable that has three or more groups, use the One-Way Analysis of Covariance (ANCOVA) procedure. This procedure cannot be used to analyze models that include more than one covariate variable or more than one group variable. If the model you want to analyze includes more than one covariate variable and/or more than one group variable, use the General Linear Models (GLM) for Fixed Factors procedure instead. Kinds of Research Questions A large amount of research consists of studying the influence of a set of independent variables on a response (dependent) variable.
    [Show full text]