Maximum Likelihood 1 Likelihood Models

Total Page:16

File Type:pdf, Size:1020Kb

Maximum Likelihood 1 Likelihood Models Maximum likelihood Class Notes Manuel Arellano Revised: January 28, 2020 1 Likelihood models Given some data w1, ..., wn , a likelihood model is a family of density functions fn (w1, ..., wn; ), f g Rp, that is specified as a probability model for the observed data. If the data are a 2 random sample from some population with density function g (wi) then fn (w1, ..., wn; ) is a model for n n i=1 g (wi). If the model is correctly specified fn (w1, ..., wn; ) = i=1 f (wi, ) and g (wi) = f (wi, 0) Qfor some value 0 in . The function f may be a conditional orQ an unconditional density. It may be flexible enough to be unrestricted or it may place restrictions in the form of the population density.1 n In the likelihood function () = f (wi, ), the data are given and is the argument. Usually L i=1 we work with the log likelihood function:Q n L () = i=1 ln f (wi, ) . (1) P L () measures the ability of different densities within the pre-specified class to generate the data. The maximum likelihood estimator (MLE) is the value of associated with the largest likelihood: = arg max L () . (2) 2 b 2 Example 1: Linear regression under normality The model is y X X , In so j N 0 0 that n fn (y1, ..., yn x1, ..., xn; ) = f (yi xi; ) j i=1 j 2 2 Y where = 0, 0, 0 = 0, 0 0 and 1 1 2 f (yi xi; ) = exp yi x0 (3) j p 2 22 i 2 n n 2 1 n 2 L () = ln (2) ln yi x0 (4) 2 2 22 i=1 i P The MLE = 0, 2 0 consists of the OLS estimator and the residual variance 2 without degrees of freedom adjustment. Letting u = y X we have: b b b b b 1 2 u0u = X0X X0y =b . b (5) n 1 b b Thisb note follows the practiceb of using the term density both for continuous random variables and for the probability function of discrete random variables; as for example in David Cox, Principles of Statistical Inference, 2006. 1 Example 2: Logit regression There is a binary dependent variable yi that takes only two yi (1 yi) values, 0 and 1. Therefore, in this case f (yi xi) = p (1 pi) where pi = Pr (yi = 1 xi). j i j In the logit model the log odds ratio depends linearly on xi: pi ln = xi0 , 1 pi so that pi = (xi0 ) where is the logistic cdf (r) = 1/ (1 + exp ( r)). n Assuming that fn (y1, ..., yn x1, ..., xn; ) = f (yi xi; ), the log likelihood function is j i=1 j n L () = yi ln x0 + (1 yi) ln 1Q x0 . i=1 i i The first andP second partial derivatives ofL () are: @L () n = xi yi x0 @ i=1 i 2 @ L () P n = i=1 xixi0 i (1 i) . @@0 Since the HessianP matrix is negative semidefinite, as long as it is nonsingular, there exists a single maximum to the likelihood function.2 The first order conditions @L () /∂ = 0 are nonlinear but we can find their root using the Newton-Raphson method of successive approximations. Namely, we begin by finding the root 1 to a linear approximation of @L () /∂ around some initial value 0 and b iterate the procedure until convergence: 2 1 @ L (j) @L (j) j+1 = j (j = 0, 1, 2, ...) . @@ @ 0 Example 3: Duration data In the analysis of a duration variable Y it is common to study its hazard rate, which is given by h (y) = f (y) / Pr (Y y). If Y is discrete h (y) = Pr (Y = y Y y), ≥ j ≥ 3 whereas if Y is continuous h (y) = limy 0 Pr (y Y < y + y Y > y) /y = f (y) / [1 F (y)]. ! ≤ j A simple model is to assume that the hazard function is constant: h (y) = with > 0, which for a continuous Y gives rise to the exponential distribution with pdf f (y) = exp ( y) and cdf F (y) = 1 exp ( y). A popular conditional model is the proportional hazard specification: h (y, x) = (y) ' (x). A special case is an exponential model in which the log hazard depends linearly on xi: ln h (yi, xi) = xi0 . Here the log likelihood of a sample of conditionally independent complete and incomplete durations is x x n x x L () = x0 yie i0 + yie i0 = (1 ci) x0 yie i0 ci yie i0 complete i incomplete i=1 i h i where ciP= 0 if a duration is completeP and ci = 1 if not. TheP contribution to the log likelihood of a complete duration is ln f (yi), whereas the contribution of an incomplete duration is ln Pr (Y > yi). 2 To see that the Hessian is negative semidefinite, note that using the decomposition 0 i = i (1 i) with 0 = i i 1/2 1/2 2 n (1 i) , i (1 i) , the Hessian can be written as @ L () /∂@0 = xi0 ix0 . i i=1 i i h 3 The hazard rate fully characterizesi the distribution of Y . To recover F (y) from h (y) note that ln [1 F (y)] = P y ln [1 h (s)] in the discrete case, and ln [1 F (y)] = y [ h (s)] ds in the continuous one. s=1 0 P R 2 What does estimate? The population counterpart of the sample calculation (2) is 0 = arg max E [ln f (W, )] (6) b 2 where W is a random variable with density g (w) so that E [ln f (W, )] = ln f (w, ) g (w) dw. If the population density g (w) belongs to the f (w, ) family, then 0 asR defined in (6) is the true parameter value. In effect, due to Jensen’sinequality: ln f (w, ) f (w, 0) dw ln f (w, 0) f (w, 0) dw Z Z f (w, ) f (w, ) = ln f (w, 0) dw ln f (w, 0) dw = ln f (w, ) dw = ln 1 = 0. f (w, ) ≤ f (w, ) Z 0 Z 0 Z Thus, when g (w) = f (w, 0) we have E [ln f (W, )] E [ln f (W, 0)] 0 for all . ≤ 4 The value 0 can be interpreted more generally as follows: g (W ) 0 = arg min E ln . (7) f (W, ) 2 The quantity E ln g(W ) ln g(w) g (w) dw is the Kullback-Leibler divergence (KLD) from f(W,) f(w,θ) f (w, ) to g (w).h The KLDi is theR expected log difference between g (w) and f (w, ) when the expec- tation is taken using g (w). Thus, f (w, 0) can be regarded as the best approximation to g (w) in the class f (w, ) when the approximation is understood in the KLD sense. If g (w) = f (w, 0) then 0 is called the “true value”. If g (w) does not belong to the f (w, ) class and f (w, 0) is just the best approximation to g (w) in the KLD sense, then 0 is called a “pseudo-true value”. The extent to which a pseudo-true value remains an interesting quantity is model specific. For example, (3) is a restrictive model of the conditional distribution of yi given xi, first because it assumes that the dependence of yi on xi occurs exclusively through the conditional mean E (yi xi) and secondly j because this conditional mean is assumed to be a linear function of xi. However, if yi depends on 2 xi in other ways, for example through the conditional variance, the parameter values 0 = 0, 0 0 2 remain interpretable quantities: as @E (yi xi) /∂xi and as the unconditional variance of the 0 j 0 2 errors ui = yi x . If E (yi xi) is a nonlinear function, and can only be characterized as the i0 0 j 0 0 linear projection regression coeffi cient vector and the linear projection error variance, respectively. Pseudo maximum likelihood estimation The statistic is the maximum likelihood estimator under the assumption that g (w) belongs to the f (w, ) class. In the absence of this assumption, is b a pseudo-maximum likelihood estimator (PML) based on the f (w, ) family of densities. Sometimes b is called a quasi-maximum likelihood estimator. 4 Since E ln g(W ) = E [ln g (W )] [ln f (W, )] and E [ln g (W )] does not depend on , the arg min in (7) is the same b f(W,θ) as the arg maxh in (6).i 3 2 Consistency and asymptotic normality of PML estimators Under regularity and identification conditions a PML estimator is a consistent estimator of the (pseudo) true value 0. Since may not have a closed form expression we need a method for establishing b the consistency of an estimator that maximizes an objective function. The following theorem taken b from Newey and McFadden (1994) provides such a method. The requirements are boundedness of the parameter space, uniform convergence of the objective function to some nonstochastic continuous limit, and that the limiting objective function is uniquely maximized at the truth (identification). Consistency Theorem Suppose that maximizes the objective function Sn () in the parameter space . Assume the following: b (a) is a compact set. (b) The function Sn () converges uniformly in probability to S0 (). (c) S0 () is continuous. (d) S0 () is uniquely maximized at 0. p Then 0. ! n In the PML context Sn () = (1/n) i=1 ln f (wi, ) and S0 () = E [ln f (W, )]. In particular, in b 5 the regression example, by the law of largeP numbers: 1 1 2 1 2 S0 () = ln (2) ln E yi x0 (8) 2 2 22 i h i and, noting that yi x ui x ( ), also i0 i0 0 1 1 2 1 2 S0 () = ln (2) ln + ( )0 E xix0 ( ) .
Recommended publications
  • Analysis of Variance; *********43***1****T******T
    DOCUMENT RESUME 11 571 TM 007 817 111, AUTHOR Corder-Bolz, Charles R. TITLE A Monte Carlo Study of Six Models of Change. INSTITUTION Southwest Eduoational Development Lab., Austin, Tex. PUB DiTE (7BA NOTE"". 32p. BDRS-PRICE MF-$0.83 Plug P4istage. HC Not Availablefrom EDRS. DESCRIPTORS *Analysis Of COariance; *Analysis of Variance; Comparative Statistics; Hypothesis Testing;. *Mathematicai'MOdels; Scores; Statistical Analysis; *Tests'of Significance ,IDENTIFIERS *Change Scores; *Monte Carlo Method ABSTRACT A Monte Carlo Study was conducted to evaluate ,six models commonly used to evaluate change. The results.revealed specific problems with each. Analysis of covariance and analysis of variance of residualized gain scores appeared to, substantially and consistently overestimate the change effects. Multiple factor analysis of variance models utilizing pretest and post-test scores yielded invalidly toF ratios. The analysis of variance of difference scores an the multiple 'factor analysis of variance using repeated measures were the only models which adequately controlled for pre-treatment differences; ,however, they appeared to be robust only when the error level is 50% or more. This places serious doubt regarding published findings, and theories based upon change score analyses. When collecting data which. have an error level less than 50% (which is true in most situations)r a change score analysis is entirely inadvisable until an alternative procedure is developed. (Author/CTM) 41t i- **************************************4***********************i,******* 4! Reproductions suppliedby ERRS are theibest that cap be made * .... * '* ,0 . from the original document. *********43***1****t******t*******************************************J . S DEPARTMENT OF HEALTH, EDUCATIONWELFARE NATIONAL INST'ITUTE OF r EOUCATIOp PHISDOCUMENT DUCED EXACTLYHAS /3EENREPRC AS RECEIVED THE PERSONOR ORGANIZATION J RoP AT INC.
    [Show full text]
  • 6 Modeling Survival Data with Cox Regression Models
    CHAPTER 6 ST 745, Daowen Zhang 6 Modeling Survival Data with Cox Regression Models 6.1 The Proportional Hazards Model A proportional hazards model proposed by D.R. Cox (1972) assumes that T z1β1+···+zpβp z β λ(t|z)=λ0(t)e = λ0(t)e , (6.1) where z is a p × 1 vector of covariates such as treatment indicators, prognositc factors, etc., and β is a p × 1 vector of regression coefficients. Note that there is no intercept β0 in model (6.1). Obviously, λ(t|z =0)=λ0(t). So λ0(t) is often called the baseline hazard function. It can be interpreted as the hazard function for the population of subjects with z =0. The baseline hazard function λ0(t) in model (6.1) can take any shape as a function of t.The T only requirement is that λ0(t) > 0. This is the nonparametric part of the model and z β is the parametric part of the model. So Cox’s proportional hazards model is a semiparametric model. Interpretation of a proportional hazards model 1. It is easy to show that under model (6.1) exp(zT β) S(t|z)=[S0(t)] , where S(t|z) is the survival function of the subpopulation with covariate z and S0(t)isthe survival function of baseline population (z = 0). That is R t − λ0(u)du S0(t)=e 0 . PAGE 120 CHAPTER 6 ST 745, Daowen Zhang 2. For any two sets of covariates z0 and z1, zT β λ(t|z1) λ0(t)e 1 T (z1−z0) β, t ≥ , = zT β =e for all 0 λ(t|z0) λ0(t)e 0 which is a constant over time (so the name of proportional hazards model).
    [Show full text]
  • 18.650 (F16) Lecture 10: Generalized Linear Models (Glms)
    Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X (µ(X), σ2I), | ∼ N And ⊤ IE(Y X) = µ(X) = X β, | 2/52 Components of a linear model The two components (that we are going to relax) are 1. Random component: the response variable Y X is continuous | and normally distributed with mean µ = µ(X) = IE(Y X). | 2. Link: between the random and covariates X = (X(1),X(2), ,X(p))⊤: µ(X) = X⊤β. · · · 3/52 Generalization A generalized linear model (GLM) generalizes normal linear regression models in the following directions. 1. Random component: Y some exponential family distribution ∼ 2. Link: between the random and covariates: ⊤ g µ(X) = X β where g called link function� and� µ = IE(Y X). | 4/52 Example 1: Disease Occuring Rate In the early stages of a disease epidemic, the rate at which new cases occur can often increase exponentially through time. Hence, if µi is the expected number of new cases on day ti, a model of the form µi = γ exp(δti) seems appropriate. ◮ Such a model can be turned into GLM form, by using a log link so that log(µi) = log(γ) + δti = β0 + β1ti. ◮ Since this is a count, the Poisson distribution (with expected value µi) is probably a reasonable distribution to try. 5/52 Example 2: Prey Capture Rate(1) The rate of capture of preys, yi, by a hunting animal, tends to increase with increasing density of prey, xi, but to eventually level off, when the predator is catching as much as it can cope with.
    [Show full text]
  • On the Meaning and Use of Kurtosis
    Psychological Methods Copyright 1997 by the American Psychological Association, Inc. 1997, Vol. 2, No. 3,292-307 1082-989X/97/$3.00 On the Meaning and Use of Kurtosis Lawrence T. DeCarlo Fordham University For symmetric unimodal distributions, positive kurtosis indicates heavy tails and peakedness relative to the normal distribution, whereas negative kurtosis indicates light tails and flatness. Many textbooks, however, describe or illustrate kurtosis incompletely or incorrectly. In this article, kurtosis is illustrated with well-known distributions, and aspects of its interpretation and misinterpretation are discussed. The role of kurtosis in testing univariate and multivariate normality; as a measure of departures from normality; in issues of robustness, outliers, and bimodality; in generalized tests and estimators, as well as limitations of and alternatives to the kurtosis measure [32, are discussed. It is typically noted in introductory statistics standard deviation. The normal distribution has a kur- courses that distributions can be characterized in tosis of 3, and 132 - 3 is often used so that the refer- terms of central tendency, variability, and shape. With ence normal distribution has a kurtosis of zero (132 - respect to shape, virtually every textbook defines and 3 is sometimes denoted as Y2)- A sample counterpart illustrates skewness. On the other hand, another as- to 132 can be obtained by replacing the population pect of shape, which is kurtosis, is either not discussed moments with the sample moments, which gives or, worse yet, is often described or illustrated incor- rectly. Kurtosis is also frequently not reported in re- ~(X i -- S)4/n search articles, in spite of the fact that virtually every b2 (•(X i - ~')2/n)2' statistical package provides a measure of kurtosis.
    [Show full text]
  • Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds
    Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds Valentin Liévin 1 Andrea Dittadi1 Anders Christensen1 Ole Winther1, 2, 3 1 Section for Cognitive Systems, Technical University of Denmark 2 Bioinformatics Centre, Department of Biology, University of Copenhagen 3 Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital {valv,adit}@dtu.dk, [email protected], [email protected] Abstract This paper introduces novel results for the score function gradient estimator of the importance weighted variational bound (IWAE). We prove that in the limit of large K (number of importance samples) one can choose the control variate such that the Signal-to-Noise ratio (SNR) of the estimator grows as pK. This is in contrast to the standard pathwise gradient estimator where the SNR decreases as 1/pK. Based on our theoretical findings we develop a novel control variate that extends on VIMCO. Empirically, for the training of both continuous and discrete generative models, the proposed method yields superior variance reduction, resulting in an SNR for IWAE that increases with K without relying on the reparameterization trick. The novel estimator is competitive with state-of-the-art reparameterization-free gradient estimators such as Reweighted Wake-Sleep (RWS) and the thermodynamic variational objective (TVO) when training generative models. 1 Introduction Gradient-based learning is now widespread in the field of machine learning, in which recent advances have mostly relied on the backpropagation algorithm, the workhorse of modern deep learning. In many instances, for example in the context of unsupervised learning, it is desirable to make models more expressive by introducing stochastic latent variables.
    [Show full text]
  • The Probability Lifesaver: Order Statistics and the Median Theorem
    The Probability Lifesaver: Order Statistics and the Median Theorem Steven J. Miller December 30, 2015 Contents 1 Order Statistics and the Median Theorem 3 1.1 Definition of the Median 5 1.2 Order Statistics 10 1.3 Examples of Order Statistics 15 1.4 TheSampleDistributionoftheMedian 17 1.5 TechnicalboundsforproofofMedianTheorem 20 1.6 TheMedianofNormalRandomVariables 22 2 • Greetings again! In this supplemental chapter we develop the theory of order statistics in order to prove The Median Theorem. This is a beautiful result in its own, but also extremely important as a substitute for the Central Limit Theorem, and allows us to say non- trivial things when the CLT is unavailable. Chapter 1 Order Statistics and the Median Theorem The Central Limit Theorem is one of the gems of probability. It’s easy to use and its hypotheses are satisfied in a wealth of problems. Many courses build towards a proof of this beautiful and powerful result, as it truly is ‘central’ to the entire subject. Not to detract from the majesty of this wonderful result, however, what happens in those instances where it’s unavailable? For example, one of the key assumptions that must be met is that our random variables need to have finite higher moments, or at the very least a finite variance. What if we were to consider sums of Cauchy random variables? Is there anything we can say? This is not just a question of theoretical interest, of mathematicians generalizing for the sake of generalization. The following example from economics highlights why this chapter is more than just of theoretical interest.
    [Show full text]
  • Bayesian Methods: Review of Generalized Linear Models
    Bayesian Methods: Review of Generalized Linear Models RYAN BAKKER University of Georgia ICPSR Day 2 Bayesian Methods: GLM [1] Likelihood and Maximum Likelihood Principles Likelihood theory is an important part of Bayesian inference: it is how the data enter the model. • The basis is Fisher’s principle: what value of the unknown parameter is “most likely” to have • generated the observed data. Example: flip a coin 10 times, get 5 heads. MLE for p is 0.5. • This is easily the most common and well-understood general estimation process. • Bayesian Methods: GLM [2] Starting details: • – Y is a n k design or observation matrix, θ is a k 1 unknown coefficient vector to be esti- × × mated, we want p(θ Y) (joint sampling distribution or posterior) from p(Y θ) (joint probabil- | | ity function). – Define the likelihood function: n L(θ Y) = p(Y θ) | i| i=1 Y which is no longer on the probability metric. – Our goal is the maximum likelihood value of θ: θˆ : L(θˆ Y) L(θ Y) θ Θ | ≥ | ∀ ∈ where Θ is the class of admissable values for θ. Bayesian Methods: GLM [3] Likelihood and Maximum Likelihood Principles (cont.) Its actually easier to work with the natural log of the likelihood function: • `(θ Y) = log L(θ Y) | | We also find it useful to work with the score function, the first derivative of the log likelihood func- • tion with respect to the parameters of interest: ∂ `˙(θ Y) = `(θ Y) | ∂θ | Setting `˙(θ Y) equal to zero and solving gives the MLE: θˆ, the “most likely” value of θ from the • | parameter space Θ treating the observed data as given.
    [Show full text]
  • Simple, Lower-Variance Gradient Estimators for Variational Inference
    Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference Geoffrey Roeder Yuhuai Wu University of Toronto University of Toronto [email protected] [email protected] David Duvenaud University of Toronto [email protected] Abstract We propose a simple and general variant of the standard reparameterized gradient estimator for the variational evidence lower bound. Specifically, we remove a part of the total derivative with respect to the variational parameters that corresponds to the score function. Removing this term produces an unbiased gradient estimator whose variance approaches zero as the approximate posterior approaches the exact posterior. We analyze the behavior of this gradient estimator theoretically and empirically, and generalize it to more complex variational distributions such as mixtures and importance-weighted posteriors. 1 Introduction Recent advances in variational inference have begun to Optimization using: make approximate inference practical in large-scale latent Path Derivative ) variable models. One of the main recent advances has Total Derivative been the development of variational autoencoders along true φ with the reparameterization trick [Kingma and Welling, k 2013, Rezende et al., 2014]. The reparameterization init trick is applicable to most continuous latent-variable mod- φ els, and usually provides lower-variance gradient esti- ( mates than the more general REINFORCE gradient es- KL timator [Williams, 1992]. Intuitively, the reparameterization trick provides more in- 400 600 800 1000 1200 formative gradients by exposing the dependence of sam- Iterations pled latent variables z on variational parameters φ. In contrast, the REINFORCE gradient estimate only de- Figure 1: Fitting a 100-dimensional varia- pends on the relationship between the density function tional posterior to another Gaussian, using log qφ(z x; φ) and its parameters.
    [Show full text]
  • Theoretical Statistics. Lecture 5. Peter Bartlett
    Theoretical Statistics. Lecture 5. Peter Bartlett 1. U-statistics. 1 Outline of today’s lecture We’ll look at U-statistics, a family of estimators that includes many interesting examples. We’ll study their properties: unbiased, lower variance, concentration (via an application of the bounded differences inequality), asymptotic variance, asymptotic distribution. (See Chapter 12 of van der Vaart.) First, we’ll consider the standard unbiased estimate of variance—a special case of a U-statistic. 2 Variance estimates n 1 s2 = (X − X¯ )2 n n − 1 i n Xi=1 n n 1 = (X − X¯ )2 +(X − X¯ )2 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 2 = (X − X¯ ) − (X − X¯ ) 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 1 = (X − X )2 n(n − 1) 2 i j Xi=1 Xj=1 1 1 = (X − X )2 . n 2 i j 2 Xi<j 3 Variance estimates This is unbiased for i.i.d. data: 1 Es2 = E (X − X )2 n 2 1 2 1 = E ((X − EX ) − (X − EX ))2 2 1 1 2 2 1 2 2 = E (X1 − EX1) +(X2 − EX2) 2 2 = E (X1 − EX1) . 4 U-statistics Definition: A U-statistic of order r with kernel h is 1 U = n h(Xi1 ,...,Xir ), r iX⊆[n] where h is symmetric in its arguments. [If h is not symmetric in its arguments, we can also average over permutations.] “U” for “unbiased.” Introduced by Wassily Hoeffding in the 1940s. 5 U-statistics Theorem: [Halmos] θ (parameter, i.e., function defined on a family of distributions) admits an unbiased estimator (ie: for all sufficiently large n, some function of the i.i.d.
    [Show full text]
  • STAT 830 Likelihood Methods
    STAT 830 Likelihood Methods Richard Lockhart Simon Fraser University STAT 830 — Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 1 / 32 Purposes of These Notes Define the likelihood, log-likelihood and score functions. Summarize likelihood methods Describe maximum likelihood estimation Give sequence of examples. Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 2 / 32 Likelihood Methods of Inference Toss a thumb tack 6 times and imagine it lands point up twice. Let p be probability of landing points up. Probability of getting exactly 2 point up is 15p2(1 − p)4 This function of p, is the likelihood function. Def’n: The likelihood function is map L: domain Θ, values given by L(θ)= fθ(X ) Key Point: think about how the density depends on θ not about how it depends on X . Notice: X , observed value of the data, has been plugged into the formula for density. Notice: coin tossing example uses the discrete density for f . We use likelihood for most inference problems: Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 3 / 32 List of likelihood techniques Point estimation: we must compute an estimate θˆ = θˆ(X ) which lies in Θ. The maximum likelihood estimate (MLE) of θ is the value θˆ which maximizes L(θ) over θ ∈ Θ if such a θˆ exists. Point estimation of a function of θ: we must compute an estimate φˆ = φˆ(X ) of φ = g(θ). We use φˆ = g(θˆ) where θˆ is the MLE of θ.
    [Show full text]
  • Lecture 14 Testing for Kurtosis
    9/8/2016 CHE384, From Data to Decisions: Measurement, Kurtosis Uncertainty, Analysis, and Modeling • For any distribution, the kurtosis (sometimes Lecture 14 called the excess kurtosis) is defined as Testing for Kurtosis 3 (old notation = ) • For a unimodal, symmetric distribution, Chris A. Mack – a positive kurtosis means “heavy tails” and a more Adjunct Associate Professor peaked center compared to a normal distribution – a negative kurtosis means “light tails” and a more spread center compared to a normal distribution http://www.lithoguru.com/scientist/statistics/ © Chris Mack, 2016Data to Decisions 1 © Chris Mack, 2016Data to Decisions 2 Kurtosis Examples One Impact of Excess Kurtosis • For the Student’s t • For a normal distribution, the sample distribution, the variance will have an expected value of s2, excess kurtosis is and a variance of 6 2 4 1 for DF > 4 ( for DF ≤ 4 the kurtosis is infinite) • For a distribution with excess kurtosis • For a uniform 2 1 1 distribution, 1 2 © Chris Mack, 2016Data to Decisions 3 © Chris Mack, 2016Data to Decisions 4 Sample Kurtosis Sample Kurtosis • For a sample of size n, the sample kurtosis is • An unbiased estimator of the sample excess 1 kurtosis is ∑ ̅ 1 3 3 1 6 1 2 3 ∑ ̅ Standard Error: • For large n, the sampling distribution of 1 24 2 1 approaches Normal with mean 0 and variance 2 1 of 24/n 3 5 • For small samples, this estimator is biased D. N. Joanes and C. A. Gill, “Comparing Measures of Sample Skewness and Kurtosis”, The Statistician, 47(1),183–189 (1998).
    [Show full text]
  • Covariances of Two Sample Rank Sum Statistics
    JOURNAL OF RESEARCH of the National Bureou of Standards - B. Mathematical Sciences Volume 76B, Nos. 1 and 2, January-June 1972 Covariances of Two Sample Rank Sum Statistics Peter V. Tryon Institute for Basic Standards, National Bureau of Standards, Boulder, Colorado 80302 (November 26, 1971) This note presents an elementary derivation of the covariances of the e(e - 1)/2 two-sample rank sum statistics computed among aU pairs of samples from e populations. Key words: e Sample proble m; covariances, Mann-Whitney-Wilcoxon statistics; rank sum statistics; statistics. Mann-Whitney or Wilcoxon rank sum statistics, computed for some or all of the c(c - 1)/2 pairs of samples from c populations, have been used in testing the null hypothesis of homogeneity of dis­ tribution against a variety of alternatives [1, 3,4,5).1 This note presents an elementary derivation of the covariances of such statistics under the null hypothesis_ The usual approach to such an analysis is the rank sum viewpoint of the Wilcoxon form of the statistic. Using this approach, Steel [3] presents a lengthy derivation of the covariances. In this note it is shown that thinking in terms of the Mann-Whitney form of the statistic leads to an elementary derivation. For comparison and completeness the rank sum derivation of Kruskal and Wallis [2] is repeated in obtaining the means and variances. Let x{, r= 1,2, ... , ni, i= 1,2, ... , c, be the rth item in the sample of size ni from the ith of c populations. Let Mij be the Mann-Whitney statistic between the ith andjth samples defined by n· n· Mij= ~ t z[J (1) s=1 1' = 1 where l,xJ~>xr Zr~= { I) O,xj";;;x[ } Thus Mij is the number of times items in the jth sample exceed items in the ith sample.
    [Show full text]