Properties of Maximum Likelihood Estimation (MLE) (Latex Prepared by Haiguang Wen) April 27, 2015

Total Page:16

File Type:pdf, Size:1020Kb

Properties of Maximum Likelihood Estimation (MLE) (Latex Prepared by Haiguang Wen) April 27, 2015 ECE 645: Estimation Theory Spring 2015 Instructor: Prof. Stanley H. Chan Lecture 8: Properties of Maximum Likelihood Estimation (MLE) (LaTeX prepared by Haiguang Wen) April 27, 2015 This lecture note is based on ECE 645(Spring 2015) by Prof. Stanley H. Chan in the School of Electrical and Computer Engineering at Purdue University. 1 Efficiency of MLE Maximum Likelihood Estimation (MLE) is a widely used statistical estimation method. In this lecture, we will study its properties: efficiency, consistency and asymptotic normality. MLE is a method for estimating parameters of a statistical model. Given the distribution of a statistical model f(y ; θ) with unkown deterministic parameter θ, MLE is to estimate the parameter θ by maximizing the probability f(y ; θ) with observations y. θ(y) = arg min f(y ; θ) (1) θ Please see the previous lecture note Lectureb 7 for details. 1.1 Cram´er–Rao Lower Bound (CRLB) Cram´er–Rao Lower Bound (CRLB) is introduced in Lecture 7. Briefly, CRLB describes a lower bound on the variance of estimators of the deterministic parameter θ. That is ( ∂ E[θ(Y )])2 Var θ(Y ) > ∂θ , (2) I(θ) b where I(θ) is the Fisher information that measuresb the information carried by the observable random variable Y about the unknown parameter θ. For unbiased estimator θ(Y ), Equation 2 can be simplified as 1 Var θ(Y ) > b , (3) I(θ) which means the variance of any unbiased estimatorb is as least as the inverse of the Fisher information. 1.2 Efficient Estimator From section 1.1, we know that the variance of estimator θ(y) cannot be lower than the CRLB. So any estimator whose variance is equal to the lower bound is considered as an efficient estimator. b Definition 1. Efficient Estimator An estimator θ(y) is efficient if it achieves equality in CRLB. b Example 1. 2 Question: Y = Y1, Y2, , Yn are i.i.d. Gaussian random variables with distribution N(θ, σ ). Determine the maximum likelihood{ ··· estimator} of θ. Is the estimator efficient? Solution: Let y = y ,y , ,yn be the observation, then { 1 2 ··· } n f(y ; θ)= f(yk ; θ) k Y=1 n 2 1 (yk θ) = exp − √2πσ2 − 2σ2 k=1 Y n 2 1 k=1(yk θ) = n exp − . (2πσ2) 2 − 2σ2 P Take the log of both sides of the above equation, we have n 2 n (yk θ) log f(y ; θ)= log(2πσ2) k=1 − . − 2 − 2σ2 P Since log f(y ; θ) is a quadratic concave function of θ, we can obtain the MLE by solving the following equation. n ∂ log f(y ; θ) 2 (yk θ) = k=1 − =0. ∂θ 2σ2 P 1 n Therefore, the MLE is θMLE(y)= n k=1 yk. P Now let us check whetherb the estimator is efficient or not. It is easy to check that the MLE is an unbiased estimator (E[θMLE (y)] = θ). To determine the CRLB, we need to calculate the Fisher information of the model. ∂2 n b I(θ)= E log f(y ; θ) = (4) − ∂θ2 σ2 According to Equation 3, we have 1 σ2 Var θ (Y ) > = . (5) MLE I(θ) n And the variance of the MLE is b n 1 σ2 Var θ (Y ) = Var Y = . (6) MLE n k n k ! X=1 b So CRLB equality is achieved, thus the MLE is efficient. 1.3 Minimum Variance Unbiased Estimator (MVUE) Recall that a Minimum Variance Unbiased Estimator (MVUE) is an unbiased estimator whose variance is lower than any other unbiased estimator for all possible values of parameter θ. That is Var(θMVUE (Y )) 6 Var(θ(Y )) (7) for any unbiased θ(Y ) of any θ. b b Proposition 1. Unbiased and Efficient Estimators b If an estimator θ(y) is unbiased and efficient, then it must be MVUE. b 2 Proof. Since θ(y) is efficient, according to CRLB, we have b Var θ(Y ) 6 Var θ(Y ) (8) for any θ(Y ). Therefore, θ(Y ) must be minimumb variance (MV).e Since θ(Y ) is also unbiased, it is a MVUE. ✷ e e e Remark: The converse of the proposition is not true in general. That is, MVUE does NOT need to be efficient. Here is a counter example. Example 2. Counter Example 1 Suppose that y = Y1, Y2, , Yn are i.i.d. exponential random variables with unknown mean θ . Find the MLE and MVUE of{ θ. Are··· these} estimators efficient? Solution: Let y = y ,y , ,yn be the observation, then { 1 2 ··· } n f(y ; θ)= f(yk ; θ) k=1 Yn = θ exp θyk {− } k=1 Y n n = θ exp θ yk . (9) − ( k ) X=1 Take the log of both sides of the above equation, we have n log f(y ; θ)= n log(θ) θ yk. − k X=1 Since log f(y ; θ) is a concave function of θ, we can obtain the MLE by solving the following equation. n ∂ log f(y ; θ) n = =0 ∂θ θ − k X=1 So the MLE is n θMLE(y)= n . (10) k=1 yk b P E n To calculate the CRLB, we need to calculate θMLE(Y ) and Var θMLE(Y ) . Let T (y)= k=1 yk, then by moment generating function, we can show thath the distributioni of T (y) is the Erlange distribution: b b P n n−1 θ t − f (t)= e θt. (11) T (n 1)! − So we have ∞ n n−1 n θ t − E θ (T (Y )) = e θtdt MLE t (n 1)! Z0 h i ∞− n−2 nθ (θt) − b = e θtdθt n 1 0 (n 2)! −n Z − = θ. (12) n 1 − 3 Therefore the MLE is a biased estimator of θ. Similarly, we can calculate the variance of MLE as follows. 2 2 Var θMLE (T (Y )) = E θ (T (Y )) E θMLE (T (Y )) MLE − h θ2n2 i h i b = b b (n 1)2(n 2) − − The Fisher information is ∂2 n I(θ)= log f(y θ)= . −∂θ2 | θ2 So the CRLB is 2 ∂ E ∂θ θMLE (T (Y )) Var θ (T (Y )) > MLE h I(θ) i b n2 n b = (n 1)2 θ2 −n . = θ2. (n 1)2 − The CRLB equality does NOT hold, so θMLE is not efficient. b n The distribution in Equation 9 belongs to exponential family and T (y) = k=1 yk is a complete sufficient statistic. So the MLE can be expressed as θ (T (y)) = n , which is a function of T (y). However, the MLE T (y) P MLE is a biased estimator (Equation 12). But we can construct an unbiased estimator based on the MLE. That is b n 1 θ(T (y)) = − θ (T (y)) n MLE n 1 e = − .b T (y) E E n−1 n−1 n It is easy to check θ (T (Y )) = n θMLE (T (Y )) = n n−1 θ = θ. Since θ(T (y)) is an unbiased estimator and it is a functionh ofi completeh sufficient statistic,i θ(T (y)) is MVUE. So e b e n 1 θMVUE (T (y)) = −e . (13) T (y) The variance of MVUE is b n 1 Var θ (T (Y )) = Var − θ (T (Y )) MVUE n MLE (n 1)2 n2θ2 b = − b n2 (n 1)2(n 2) − − θ2 = . n 2 − So the CRLB is 1 θ2 Var θ (T (Y )) > = . MVUE I(θ) n Therefore, the MVUE is NOT an efficientb estimator. 4 2 Consistency of MLE Definition 2. Consistency Let Y1, , Yn be a sequence of observations. Let θn be the estimator using Y1, , Yn . We say that { ··· } p { ··· } θn is consistent if θn θ, i.e., → b P θn θ >ε 0, as n (14) b b | − | → → ∞ Remark: A sufficient condition to have Equationb 14 is that 2 E θn θ 0, as n . (15) − → → ∞ Proof. b According to Chebyshev’s inequality, we have 2 E (θn θ) − P θn θ ε (16) | − |≥ ≤ h ε2 i b 2 b Since E θn θ 0, the we have − → b 2 E (θn θ) − 0 P θn θ ε 0. ≤ | − |≥ ≤ h ε2 i → b b Therefore, P θn θ >ε 0, as n . | − | → → ∞ ✷ b Example 3. 2 Y1, Y2, , Yn are i.i.d. Gaussian random variables with distribution N(θ, σ ). Is the MLE using Y1, Y2, , Yn consistent?{ ··· } { ··· } Solution: From Example 1., we know that the MLE is n 1 θ = Y . n n k k X=1 b Since 2 σ2 E θn θ = Var θn = , (see Equation 6), − n 2 b p b so E θn θ 0. Therefore θn θ, i.e. θn is consistent. − → → b b b In fact, the result of the example above it holds for any distribution. The following proposition states this result: Proposition 2. (MLE of i.i.d observation is consistent) Let Y , , Yn be a sequence of i.i.d. observations where { 1 ··· } iid Yk ∼ fθ(y). Then the MLE of θ is consistent. 5 Proof. (This proof is partially correct. See Levy Chapter 4.5 for complete discussion.) The MLE of θ is n θn = argmax fθ(yk) θ k Y=1 b n = argmax log fθ(yk) θ k=1 ! n Y 1 = argmax log fθ(yk) θ n k X=1 = argmax ϕn(θ), θ 1 n fθ (yk) where ϕn(θ)= log fθ(yk). Let ℓb(yk) = log , then we have n k=1 θ fθb(yk) P E def fθ(yk) θb ℓθb(Yk) = log fθb(yk)dyk fb(y ) Z θ k = D fθ fb .
Recommended publications
  • Lecture 12 Robust Estimation
    Lecture 12 Robust Estimation Prof. Dr. Svetlozar Rachev Institute for Statistics and Mathematical Economics University of Karlsruhe Financial Econometrics, Summer Semester 2007 Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Copyright These lecture-notes cannot be copied and/or distributed without permission. The material is based on the text-book: Financial Econometrics: From Basics to Advanced Modeling Techniques (Wiley-Finance, Frank J. Fabozzi Series) by Svetlozar T. Rachev, Stefan Mittnik, Frank Fabozzi, Sergio M. Focardi,Teo Jaˇsic`. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Outline I Robust statistics. I Robust estimators of regressions. I Illustration: robustness of the corporate bond yield spread model. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Robust Statistics I Robust statistics addresses the problem of making estimates that are insensitive to small changes in the basic assumptions of the statistical models employed. I The concepts and methods of robust statistics originated in the 1950s. However, the concepts of robust statistics had been used much earlier. I Robust statistics: 1. assesses the changes in estimates due to small changes in the basic assumptions; 2. creates new estimates that are insensitive to small changes in some of the assumptions. I Robust statistics is also useful to separate the contribution of the tails from the contribution of the body of the data. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Robust Statistics I Peter Huber observed, that robust, distribution-free, and nonparametrical actually are not closely related properties.
    [Show full text]
  • 5. Completeness and Sufficiency 5.1. Complete Statistics. Definition 5.1. a Statistic T Is Called Complete If Eg(T) = 0 For
    5. Completeness and sufficiency 5.1. Complete statistics. Definition 5.1. A statistic T is called complete if Eg(T ) = 0 for all θ and some function g implies that P (g(T ) = 0; θ) = 1 for all θ. This use of the word complete is analogous to calling a set of vectors v1; : : : ; vn complete if they span the whole space, that is, any v can P be written as a linear combination v = ajvj of these vectors. This is equivalent to the condition that if w is orthogonal to all vj's, then w = 0. To make the connection with Definition 5.1, let's consider the discrete case. Then completeness means that P g(t)P (T = t; θ) = 0 implies that g(t) = 0. Since the sum may be viewed as the scalar prod- uct of the vectors (g(t1); g(t2);:::) and (p(t1); p(t2);:::), with p(t) = P (T = t), this is the analog of the orthogonality condition just dis- cussed. We also see that the terminology is somewhat misleading. It would be more accurate to call the family of distributions p(·; θ) complete (rather than the statistic T ). In any event, completeness means that the collection of distributions for all possible values of θ provides a sufficiently rich set of vectors. In the continuous case, a similar inter- pretation works. Completeness now refers to the collection of densities f(·; θ), and hf; gi = R fg serves as the (abstract) scalar product in this case. Example 5.1. Let's take one more look at the coin flip example.
    [Show full text]
  • 4.2 Variance and Covariance
    4.2 Variance and Covariance The most important measure of variability of a random variable X is obtained by letting g(X) = (X − µ)2 then E[g(X)] gives a measure of the variability of the distribution of X. 1 Definition 4.3 Let X be a random variable with probability distribution f(x) and mean µ. The variance of X is σ 2 = E[(X − µ)2 ] = ∑(x − µ)2 f (x) x if X is discrete, and ∞ σ 2 = E[(X − µ)2 ] = ∫ (x − µ)2 f (x)dx if X is continuous. −∞ ¾ The positive square root of the variance, σ, is called the standard deviation of X. ¾ The quantity x - µ is called the deviation of an observation, x, from its mean. 2 Note σ2≥0 When the standard deviation of a random variable is small, we expect most of the values of X to be grouped around mean. We often use standard deviations to compare two or more distributions that have the same unit measurements. 3 Example 4.8 Page97 Let the random variable X represent the number of automobiles that are used for official business purposes on any given workday. The probability distributions of X for two companies are given below: x 12 3 Company A: f(x) 0.3 0.4 0.3 x 01234 Company B: f(x) 0.2 0.1 0.3 0.3 0.1 Find the variances of X for the two companies. Solution: µ = 2 µ A = (1)(0.3) + (2)(0.4) + (3)(0.3) = 2 B 2 2 2 2 σ A = (1− 2) (0.3) + (2 − 2) (0.4) + (3− 2) (0.3) = 0.6 2 2 4 σ B > σ A 2 2 4 σ B = ∑(x − 2) f (x) =1.6 x=0 Theorem 4.2 The variance of a random variable X is σ 2 = E(X 2 ) − µ 2.
    [Show full text]
  • 1. How Different Is the T Distribution from the Normal?
    Statistics 101–106 Lecture 7 (20 October 98) c David Pollard Page 1 Read M&M §7.1 and §7.2, ignoring starred parts. Reread M&M §3.2. The eects of estimated variances on normal approximations. t-distributions. Comparison of two means: pooling of estimates of variances, or paired observations. In Lecture 6, when discussing comparison of two Binomial proportions, I was content to estimate unknown variances when calculating statistics that were to be treated as approximately normally distributed. You might have worried about the effect of variability of the estimate. W. S. Gosset (“Student”) considered a similar problem in a very famous 1908 paper, where the role of Student’s t-distribution was first recognized. Gosset discovered that the effect of estimated variances could be described exactly in a simplified problem where n independent observations X1,...,Xn are taken from (, ) = ( + ...+ )/ a normal√ distribution, N . The sample mean, X X1 Xn n has a N(, / n) distribution. The random variable X Z = √ / n 2 2 Phas a standard normal distribution. If we estimate by the sample variance, s = ( )2/( ) i Xi X n 1 , then the resulting statistic, X T = √ s/ n no longer has a normal distribution. It has a t-distribution on n 1 degrees of freedom. Remark. I have written T , instead of the t used by M&M page 505. I find it causes confusion that t refers to both the name of the statistic and the name of its distribution. As you will soon see, the estimation of the variance has the effect of spreading out the distribution a little beyond what it would be if were used.
    [Show full text]
  • Should We Think of a Different Median Estimator?
    Comunicaciones en Estad´ıstica Junio 2014, Vol. 7, No. 1, pp. 11–17 Should we think of a different median estimator? ¿Debemos pensar en un estimator diferente para la mediana? Jorge Iv´an V´eleza Juan Carlos Correab [email protected] [email protected] Resumen La mediana, una de las medidas de tendencia central m´as populares y utilizadas en la pr´actica, es el valor num´erico que separa los datos en dos partes iguales. A pesar de su popularidad y aplicaciones, muchos desconocen la existencia de dife- rentes expresiones para calcular este par´ametro. A continuaci´on se presentan los resultados de un estudio de simulaci´on en el que se comparan el estimador cl´asi- co y el propuesto por Harrell & Davis (1982). Mostramos que, comparado con el estimador de Harrell–Davis, el estimador cl´asico no tiene un buen desempe˜no pa- ra tama˜nos de muestra peque˜nos. Basados en los resultados obtenidos, se sugiere promover la utilizaci´on de un mejor estimador para la mediana. Palabras clave: mediana, cuantiles, estimador Harrell-Davis, simulaci´on estad´ısti- ca. Abstract The median, one of the most popular measures of central tendency widely-used in the statistical practice, is often described as the numerical value separating the higher half of the sample from the lower half. Despite its popularity and applica- tions, many people are not aware of the existence of several formulas to estimate this parameter. We present the results of a simulation study comparing the classic and the Harrell-Davis (Harrell & Davis 1982) estimators of the median for eight continuous statistical distributions.
    [Show full text]
  • Bias, Mean-Square Error, Relative Efficiency
    3 Evaluating the Goodness of an Estimator: Bias, Mean-Square Error, Relative Efficiency Consider a population parameter ✓ for which estimation is desired. For ex- ample, ✓ could be the population mean (traditionally called µ) or the popu- lation variance (traditionally called σ2). Or it might be some other parame- ter of interest such as the population median, population mode, population standard deviation, population minimum, population maximum, population range, population kurtosis, or population skewness. As previously mentioned, we will regard parameters as numerical charac- teristics of the population of interest; as such, a parameter will be a fixed number, albeit unknown. In Stat 252, we will assume that our population has a distribution whose density function depends on the parameter of interest. Most of the examples that we will consider in Stat 252 will involve continuous distributions. Definition 3.1. An estimator ✓ˆ is a statistic (that is, it is a random variable) which after the experiment has been conducted and the data collected will be used to estimate ✓. Since it is true that any statistic can be an estimator, you might ask why we introduce yet another word into our statistical vocabulary. Well, the answer is quite simple, really. When we use the word estimator to describe a particular statistic, we already have a statistical estimation problem in mind. For example, if ✓ is the population mean, then a natural estimator of ✓ is the sample mean. If ✓ is the population variance, then a natural estimator of ✓ is the sample variance. More specifically, suppose that Y1,...,Yn are a random sample from a population whose distribution depends on the parameter ✓.The following estimators occur frequently enough in practice that they have special notations.
    [Show full text]
  • 1 One Parameter Exponential Families
    1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e.
    [Show full text]
  • Lecture 4: Sufficient Statistics 1 Sufficient Statistics
    ECE 830 Fall 2011 Statistical Signal Processing instructor: R. Nowak Lecture 4: Sufficient Statistics Consider a random variable X whose distribution p is parametrized by θ 2 Θ where θ is a scalar or a vector. Denote this distribution as pX (xjθ) or p(xjθ), for short. In many signal processing applications we need to make some decision about θ from observations of X, where the density of X can be one of many in a family of distributions, fp(xjθ)gθ2Θ, indexed by different choices of the parameter θ. More generally, suppose we make n independent observations of X: X1;X2;:::;Xn where p(x1 : : : xnjθ) = Qn i=1 p(xijθ). These observations can be used to infer or estimate the correct value for θ. This problem can be posed as follows. Let x = [x1; x2; : : : ; xn] be a vector containing the n observations. Question: Is there a lower dimensional function of x, say t(x), that alone carries all the relevant information about θ? For example, if θ is a scalar parameter, then one might suppose that all relevant information in the observations can be summarized in a scalar statistic. Goal: Given a family of distributions fp(xjθ)gθ2Θ and one or more observations from a particular dis- tribution p(xjθ∗) in this family, find a data compression strategy that preserves all information pertaining to θ∗. The function identified by such strategyis called a sufficient statistic. 1 Sufficient Statistics Example 1 (Binary Source) Suppose X is a 0=1 - valued variable with P(X = 1) = θ and P(X = 0) = 1 − θ.
    [Show full text]
  • On the Meaning and Use of Kurtosis
    Psychological Methods Copyright 1997 by the American Psychological Association, Inc. 1997, Vol. 2, No. 3,292-307 1082-989X/97/$3.00 On the Meaning and Use of Kurtosis Lawrence T. DeCarlo Fordham University For symmetric unimodal distributions, positive kurtosis indicates heavy tails and peakedness relative to the normal distribution, whereas negative kurtosis indicates light tails and flatness. Many textbooks, however, describe or illustrate kurtosis incompletely or incorrectly. In this article, kurtosis is illustrated with well-known distributions, and aspects of its interpretation and misinterpretation are discussed. The role of kurtosis in testing univariate and multivariate normality; as a measure of departures from normality; in issues of robustness, outliers, and bimodality; in generalized tests and estimators, as well as limitations of and alternatives to the kurtosis measure [32, are discussed. It is typically noted in introductory statistics standard deviation. The normal distribution has a kur- courses that distributions can be characterized in tosis of 3, and 132 - 3 is often used so that the refer- terms of central tendency, variability, and shape. With ence normal distribution has a kurtosis of zero (132 - respect to shape, virtually every textbook defines and 3 is sometimes denoted as Y2)- A sample counterpart illustrates skewness. On the other hand, another as- to 132 can be obtained by replacing the population pect of shape, which is kurtosis, is either not discussed moments with the sample moments, which gives or, worse yet, is often described or illustrated incor- rectly. Kurtosis is also frequently not reported in re- ~(X i -- S)4/n search articles, in spite of the fact that virtually every b2 (•(X i - ~')2/n)2' statistical package provides a measure of kurtosis.
    [Show full text]
  • On the Efficiency and Consistency of Likelihood Estimation in Multivariate
    On the efficiency and consistency of likelihood estimation in multivariate conditionally heteroskedastic dynamic regression models∗ Gabriele Fiorentini Università di Firenze and RCEA, Viale Morgagni 59, I-50134 Firenze, Italy <fi[email protected]fi.it> Enrique Sentana CEMFI, Casado del Alisal 5, E-28014 Madrid, Spain <sentana@cemfi.es> Revised: October 2010 Abstract We rank the efficiency of several likelihood-based parametric and semiparametric estima- tors of conditional mean and variance parameters in multivariate dynamic models with po- tentially asymmetric and leptokurtic strong white noise innovations. We detailedly study the elliptical case, and show that Gaussian pseudo maximum likelihood estimators are inefficient except under normality. We provide consistency conditions for distributionally misspecified maximum likelihood estimators, and show that they coincide with the partial adaptivity con- ditions of semiparametric procedures. We propose Hausman tests that compare Gaussian pseudo maximum likelihood estimators with more efficient but less robust competitors. Fi- nally, we provide finite sample results through Monte Carlo simulations. Keywords: Adaptivity, Elliptical Distributions, Hausman tests, Semiparametric Esti- mators. JEL: C13, C14, C12, C51, C52 ∗We would like to thank Dante Amengual, Manuel Arellano, Nour Meddahi, Javier Mencía, Olivier Scaillet, Paolo Zaffaroni, participants at the European Meeting of the Econometric Society (Stockholm, 2003), the Sympo- sium on Economic Analysis (Seville, 2003), the CAF Conference on Multivariate Modelling in Finance and Risk Management (Sandbjerg, 2006), the Second Italian Congress on Econometrics and Empirical Economics (Rimini, 2007), as well as audiences at AUEB, Bocconi, Cass Business School, CEMFI, CREST, EUI, Florence, NYU, RCEA, Roma La Sapienza and Queen Mary for useful comments and suggestions. Of course, the usual caveat applies.
    [Show full text]
  • The Probability Lifesaver: Order Statistics and the Median Theorem
    The Probability Lifesaver: Order Statistics and the Median Theorem Steven J. Miller December 30, 2015 Contents 1 Order Statistics and the Median Theorem 3 1.1 Definition of the Median 5 1.2 Order Statistics 10 1.3 Examples of Order Statistics 15 1.4 TheSampleDistributionoftheMedian 17 1.5 TechnicalboundsforproofofMedianTheorem 20 1.6 TheMedianofNormalRandomVariables 22 2 • Greetings again! In this supplemental chapter we develop the theory of order statistics in order to prove The Median Theorem. This is a beautiful result in its own, but also extremely important as a substitute for the Central Limit Theorem, and allows us to say non- trivial things when the CLT is unavailable. Chapter 1 Order Statistics and the Median Theorem The Central Limit Theorem is one of the gems of probability. It’s easy to use and its hypotheses are satisfied in a wealth of problems. Many courses build towards a proof of this beautiful and powerful result, as it truly is ‘central’ to the entire subject. Not to detract from the majesty of this wonderful result, however, what happens in those instances where it’s unavailable? For example, one of the key assumptions that must be met is that our random variables need to have finite higher moments, or at the very least a finite variance. What if we were to consider sums of Cauchy random variables? Is there anything we can say? This is not just a question of theoretical interest, of mathematicians generalizing for the sake of generalization. The following example from economics highlights why this chapter is more than just of theoretical interest.
    [Show full text]
  • Minimum Rates of Approximate Sufficient Statistics
    1 Minimum Rates of Approximate Sufficient Statistics Masahito Hayashi,y Fellow, IEEE, Vincent Y. F. Tan,z Senior Member, IEEE Abstract—Given a sufficient statistic for a parametric family when one is given X, then Y is called a sufficient statistic of distributions, one can estimate the parameter without access relative to the family fPXjZ=zgz2Z . We may then write to the data. However, the memory or code size for storing the sufficient statistic may nonetheless still be prohibitive. Indeed, X for n independent samples drawn from a k-nomial distribution PXjZ=z(x) = PXjY (xjy)PY jZ=z(y); 8 (x; z) 2 X × Z with d = k − 1 degrees of freedom, the length of the code scales y2Y as d log n + O(1). In many applications, we may not have a (1) useful notion of sufficient statistics (e.g., when the parametric or more simply that X (−− Y (−− Z forms a Markov chain family is not an exponential family) and we also may not need in this order. Because Y is a function of X, it is also true that to reconstruct the generating distribution exactly. By adopting I(Z; X) = I(Z; Y ). This intuitively means that the sufficient a Shannon-theoretic approach in which we allow a small error in estimating the generating distribution, we construct various statistic Y provides as much information about the parameter approximate sufficient statistics and show that the code length Z as the original data X does. d can be reduced to 2 log n + O(1). We consider errors measured For concreteness in our discussions, we often (but not according to the relative entropy and variational distance criteria.
    [Show full text]