Likelihood Ratio Tests 1 the Simplest Case: Simple Hypotheses

Total Page:16

File Type:pdf, Size:1020Kb

Likelihood Ratio Tests 1 the Simplest Case: Simple Hypotheses Math 541: Statistical Theory II Likelihood Ratio Tests Instructor: Songfeng Zheng A very popular form of hypothesis test is the likelihood ratio test, which is a generalization of the optimal test for simple null and alternative hypotheses that was developed by Neyman and Pearson (We skipped Neyman-Pearson lemma because we are short of time). The likelihood ratio test is based on the likelihood function fn(X ¡1; ¢ ¢ ¢ ;Xnjθ), and the intuition that the likelihood function tends to be highest near the true value of θ. Indeed, this is also the foundation for maximum likelihood estimation. We will start from a very simple example. 1 The Simplest Case: Simple Hypotheses Let us ¯rst consider the simple hypotheses in which both the null hypothesis and alternative hypothesis consist one value of the parameter. Suppose X1; ¢ ¢ ¢ ;Xn is a random sample of size n from an exponential distribution 1 f(xjθ) = e¡x/θ; x > 0 θ Conduct the following simple hypothesis testing problem: H0 : θ = θ0 vs:Ha : θ = θ1; where θ1 < θ0. Suppose the signi¯cant level is ®. If we assume H0 were correct, then the likelihood function is Yn 1 X ¡Xi/θ0 ¡n fn(X1; ¢ ¢ ¢ ;Xnjθ0) = e = θ0 expf¡ Xi/θ0g: i=1 θ0 Similarly, if H1 were correct, the likelihood function is X ¡n fn(X1; ¢ ¢ ¢ ;Xnjθ1) = θ1 expf¡ Xi/θ1g: We de¯ne the likelihood ratio as follows: P à ! ¡n ¡n ½µ ¶ X ¾ fn(X1; ¢ ¢ ¢ ;Xnjθ0) θ0 expf¡ Xi/θ0g θ0 1 1 LR = = ¡n P = exp ¡ Xi fn(X1; ¢ ¢ ¢ ;Xnjθ1) θ1 expf¡ Xi/θ1g θ1 θ1 θ0 1 2 Intuitively, if the evidence (data) supports H1, then the likelihood function fn(X1; ¢ ¢ ¢ ;Xnjθ1) should be large, therefore the likelihood ratio is small. Thus, we reject the null hypothesis if the likelihood ratio is small, i.e. LR · k, where k is a constant such that P (LR · k) = ® under the null hypothesis (θ = θ0). To ¯nd what kind of test results from this criterion, we expand the condition 0 1 à !¡n ½µ ¶ ¾ θ0 1 1 X ® = P (LR · k) = P @ exp ¡ Xi · kA θ1 θ1 θ0 à ½µ ¶ ¾ à !n ! 1 1 X θ0 = P exp ¡ Xi · k θ1 θ0 θ1 õ ¶ "à !n #! 1 1 X θ0 = P ¡ Xi · log k θ1 θ0 θ1 à ! X log k + n log θ ¡ n log θ = P X · 0 1 i 1 ¡ 1 à θ1 θ0 ! 2 X 2 log k + n log θ0 ¡ n log θ1 = P Xi · 1 1 θ0 θ0 ¡ à θ1 θ0 ! 2 log k + n log θ0 ¡ n log θ1 = P V · 1 1 θ0 ¡ θ1 θ0 2 P where V = Xi. From the property of exponential distribution, we know under the null θ0 2 2 hypothesis, Xi follows  distribution, consequently, V follows a Chi square distribution θ0 2 with 2n degrees of freedom. Thus, by looking at the chi-square table, we can ¯nd the value of the chi-square statistic with 2n degrees of freedom such that the probability that V is less than that number is ®, that is, solve for c, such that P (V · c) = ®. Once you ¯nd the value of c, you can solve for k and de¯ne the test in terms of likelihood ratio. For example, suppose that H0 : θ = 2 and Ha : θ = 1, and we want to do the test at a signi¯cance level ® = 0:05 with a random sample of size n = 5 from an exponential distribution. We can look at the chi-square table under 10 degrees of freedom to ¯nd that P 3.94 is the value under which there is 0.05 area. Using this, we can obtain P ( 2 X · P 2 i 3:94) = 0:05. This implies that we should reject the null hypothesis if Xi · 3:94 in this example. To ¯nd a rejection criterion directly in terms of the likelihood function, we can solve for k by 2 log k + n log θ0 ¡ n log θ1 1 1 = 3:94; θ0 ¡ θ1 θ0 and the solution is k = 0:8034. So going back to the original likelihood ratio, we reject the null hypothesis if à ! ¡n ½µ ¶ ¾ µ ¶¡5 ½µ ¶ ¾ θ0 1 1 X 2 1 1 X exp ¡ Xi = exp ¡ Xi · 0:8034 θ1 θ1 θ0 1 1 2 3 2 General Likelihood Ratio Test Likelihood ratio tests are useful to test a composite null hypothesis against a composite alternative hypothesis. Suppose that the null hypothesis speci¯es that θ (may be a vector) lies in a particular set of possible values, say £0, i.e. H0 : θ 2 £0; the alternative hypothesis speci¯es that £ lies in another set of possible values £a, which does not overlap £0, i.e. Ha : θ 2 £a. Let £ = £0 [ £a. Either or both of the hypotheses H0 and Ha can be compositional. ^ Let L(£0) be the maximum (actually the supremum) of the likelihood function for all θ 2 £0. ^ ^ That is, L(£0) = maxθ2£0 L(θ). L(£0) represents the best explanation for the observed ^ data for all θ 2 £0. Similarly, L(£) = maxθ2£ L(θ) represents the best explanation for ^ ^ the observed data for all θ 2 £ = £0 [ £a. If L(£0) = L(£), then a best explanation for the observed data can be found inside £0 and we should not reject the null hypothesis ^ ^ H0 : θ 2 £0. However, if L(£0) < L(£), then the best explanation for the observed data could be found inside £a, and we should consider rejecting H0 in favor of Ha. A likelihood ^ ^ ratio test is based on the ratio L(£0)=L(£). De¯ne the likelihood ratio statistic by L(£^ ) max L(θ) ¤ = 0 = θ2£0 ; L(£)^ maxθ2£ L(θ) A likelihood ratio test of H0 : θ 2 £0 vs:Ha : θ 2 £a employs ¤ as a test statistic, and the rejection region is determined by ¤ · k. Clearly, 0 · ¤ · 1. A value of ¤ close to zero indicates that the likelihood of the sample is much smaller under H0 than it is under Ha, therefore the data suggest favoring Ha over H0. The actually value of k is chosen so that ® achieves the desired value. A lot of previously introduced testing procedure can be reformulated as likelihood ratio test, such at the example below: Example 1: Testing Hypotheses about the mean of a normal distribution with unknown variance. Suppose that X = (X1; ¢ ¢ ¢ ;Xn) is a random sample from a normal distribution with unknown mean ¹ and unknown variance σ2. We wish to test the hypotheses H0 : ¹ = ¹0 vs:Ha : ¹ 6= ¹0 2 2 Solution: In this example, the parameter is θ = (¹; σ ). Notice that £0 is the set f(¹0; σ ): 2 2 2 2 σ > 0g, and £a = f(¹; σ ): ¹ 6= ¹0; σ > 0g, and hence that £ = £0 [ £a = f(¹; σ ): ¡1 < ¹ < 1; σ2 > 0g. The value of the constant σ2 is completely unspeci¯ed. We must ^ ^ now ¯nd L(£0) and L(£). 4 For the normal distribution, we have à ! " # 1 n Xn (X ¡ ¹)2 2 p i L(θ) = L(¹; σ ) = exp ¡ 2 : 2¼σ i=1 2σ ^ Restricting ¹ to £0 implies that ¹ = ¹0, and we can ¯nd L(£0) if we can determine the 2 2 value of σ that maximizes L(¹; σ ) subject to the constraint that ¹ = ¹0. It is easy to see 2 2 that when ¹ = ¹0, the value of σ that maximizes L(¹0; σ ) is Xn 2 1 2 σ^0 = (Xi ¡ ¹0) : n i=1 ^ 2 2 2 Thus, L(£0) can be obtained by replacing ¹ with ¹0 and σ withσ ^0 in L(¹; σ ), which yields à ! " # à ! n Xn 2 n ^ 1 (Xi ¡ ¹0) 1 ¡n=2 L(£0) = p exp ¡ 2 = p e : 2¼σ^0 i=1 2^σ0 2¼σ^0 We now turn to ¯nding L(£).^ Let (^¹; σ^2) be the point in the set £ which maximizes the likelihood function L(¹; σ2), by the method of maximum likelihood estimation, we have Xn ¹ 2 1 2 ¹^ = X andσ ^ = (Xi ¡ ¹^) : n i=1 Then L(£)^ is obtained by replacing ¹ with¹ ^ and σ2 withσ ^2, which gives à ! " # à ! 1 n Xn (X ¡ ¹^)2 1 n ^ p i p ¡n=2 L(£) = exp ¡ 2 = e : 2¼σ^ i=1 2^σ 2¼σ^ Therefore, the likelihood ratio is calculated as ³ ´ n à ! " P # ^ p 1 e¡n=2 2 n=2 n ¹ 2 n=2 L(£0) 2¼σ^0 σ^ i=1(Xi ¡ X) ¤ = = ³ ´n = = P 1 2 n 2 L(£)^ p e¡n=2 σ^0 i=1(Xi ¡ ¹0) 2¼σ^ Notice that 0 < ¤ · 1 because £0 ½ £, thus when ¤ < k we would reject H0, where k < 1 is a constant. Because Xn Xn Xn 2 ¹ ¹ 2 ¹ 2 ¹ 2 (Xi ¡ ¹0) = [(Xi ¡ X) + (X ¡ ¹0)] = (Xi ¡ X) + n(X ¡ ¹0) ; i=1 i=1 i=1 the rejection region, ¤ < k, is equivalent to P n ¹ 2 (Xi ¡ X) Pi=1 2=n 0 n 2 < k = k i=1(Xi ¡ ¹0) 5 P n ¹ 2 (Xi ¡ X) P i=1 < k0 n ¹ 2 ¹ 2 i=1(Xi ¡ X) + n(X ¡ ¹0) 1 0 ¹ 2 < k : Pn(X¡¹0) 1 + n (X ¡X¹)2 i=1 i This inequality is equivalent to ¹ 2 n(X ¡ ¹0) 1 P > ¡ 1 = k00 n ¹ 2 0 i=1(Xi ¡ X) k ¹ 2 n(X ¡ ¹0) P > (n ¡ 1)k00 1 n ¹ 2 n¡1 i=1(Xi ¡ X) By de¯ning Xn 2 1 ¹ 2 S = (Xi ¡ X) ; n ¡ 1 i=1 the above rejection region is equivalent to ¯ ¯ ¯p ¹ ¯ q ¯ n(X ¡ ¹0)¯ ¯ ¯ > (n ¡ 1)k00: ¯ S ¯ p ¹ We can recognize that n(X ¡ ¹0)=S is the t statistic employed in previous sections, and the decision rule is exactly the same as previous.
Recommended publications
  • Hypothesis Testing and Likelihood Ratio Tests
    Hypottthesiiis tttestttiiing and llliiikellliiihood ratttiiio tttesttts Y We will adopt the following model for observed data. The distribution of Y = (Y1, ..., Yn) is parameter considered known except for some paramett er ç, which may be a vector ç = (ç1, ..., çk); ç“Ç, the paramettter space. The parameter space will usually be an open set. If Y is a continuous random variable, its probabiiillliiittty densiiittty functttiiion (pdf) will de denoted f(yy;ç) . If Y is y probability mass function y Y y discrete then f(yy;ç) represents the probabii ll ii tt y mass functt ii on (pmf); f(yy;ç) = Pç(YY=yy). A stttatttiiistttiiicalll hypottthesiiis is a statement about the value of ç. We are interested in testing the null hypothesis H0: ç“Ç0 versus the alternative hypothesis H1: ç“Ç1. Where Ç0 and Ç1 ¶ Ç. hypothesis test Naturally Ç0 § Ç1 = ∅, but we need not have Ç0 ∞ Ç1 = Ç. A hypott hesii s tt estt is a procedure critical region for deciding between H0 and H1 based on the sample data. It is equivalent to a crii tt ii call regii on: a critical region is a set C ¶ Rn y such that if y = (y1, ..., yn) “ C, H0 is rejected. Typically C is expressed in terms of the value of some tttesttt stttatttiiistttiiic, a function of the sample data. For µ example, we might have C = {(y , ..., y ): y – 0 ≥ 3.324}. The number 3.324 here is called a 1 n s/ n µ criiitttiiicalll valllue of the test statistic Y – 0 . S/ n If y“C but ç“Ç 0, we have committed a Type I error.
    [Show full text]
  • Data 8 Final Stats Review
    Data 8 Final Stats review I. Hypothesis Testing Purpose: To answer a question about a process or the world by testing two hypotheses, a null and an alternative. Usually the null hypothesis makes a statement that “the world/process works this way”, and the alternative hypothesis says “the world/process does not work that way”. Examples: Null: “The customer was not cheating-his chances of winning and losing were like random tosses of a fair coin-50% chance of winning, 50% of losing. Any variation from what we expect is due to chance variation.” Alternative: “The customer was cheating-his chances of winning were something other than 50%”. Pro tip: You must be very precise about chances in your hypotheses. Hypotheses such as “the customer cheated” or “Their chances of winning were normal” are vague and might be considered incorrect, because you don’t state the exact chances associated with the events. Pro tip: Null hypothesis should also explain differences in the data. For example, if your hypothesis stated that the coin was fair, then why did you get 70 heads out of 100 flips? Since it’s possible to get that many (though not expected), your null hypothesis should also contain a statement along the lines of “Any difference in outcome from what we expect is due to chance variation”. Steps: 1) Precisely state your null and alternative hypotheses. 2) Decide on a test statistic (think of it as a general formula) to help you either reject or fail to reject the null hypothesis. • If your data is categorical, a good test statistic might be the Total Variation Distance (TVD) between your sample and the distribution it was drawn from.
    [Show full text]
  • 5. Completeness and Sufficiency 5.1. Complete Statistics. Definition 5.1. a Statistic T Is Called Complete If Eg(T) = 0 For
    5. Completeness and sufficiency 5.1. Complete statistics. Definition 5.1. A statistic T is called complete if Eg(T ) = 0 for all θ and some function g implies that P (g(T ) = 0; θ) = 1 for all θ. This use of the word complete is analogous to calling a set of vectors v1; : : : ; vn complete if they span the whole space, that is, any v can P be written as a linear combination v = ajvj of these vectors. This is equivalent to the condition that if w is orthogonal to all vj's, then w = 0. To make the connection with Definition 5.1, let's consider the discrete case. Then completeness means that P g(t)P (T = t; θ) = 0 implies that g(t) = 0. Since the sum may be viewed as the scalar prod- uct of the vectors (g(t1); g(t2);:::) and (p(t1); p(t2);:::), with p(t) = P (T = t), this is the analog of the orthogonality condition just dis- cussed. We also see that the terminology is somewhat misleading. It would be more accurate to call the family of distributions p(·; θ) complete (rather than the statistic T ). In any event, completeness means that the collection of distributions for all possible values of θ provides a sufficiently rich set of vectors. In the continuous case, a similar inter- pretation works. Completeness now refers to the collection of densities f(·; θ), and hf; gi = R fg serves as the (abstract) scalar product in this case. Example 5.1. Let's take one more look at the coin flip example.
    [Show full text]
  • Use of Statistical Tables
    TUTORIAL | SCOPE USE OF STATISTICAL TABLES Lucy Radford, Jenny V Freeman and Stephen J Walters introduce three important statistical distributions: the standard Normal, t and Chi-squared distributions PREVIOUS TUTORIALS HAVE LOOKED at hypothesis testing1 and basic statistical tests.2–4 As part of the process of statistical hypothesis testing, a test statistic is calculated and compared to a hypothesised critical value and this is used to obtain a P- value. This P-value is then used to decide whether the study results are statistically significant or not. It will explain how statistical tables are used to link test statistics to P-values. This tutorial introduces tables for three important statistical distributions (the TABLE 1. Extract from two-tailed standard Normal, t and Chi-squared standard Normal table. Values distributions) and explains how to use tabulated are P-values corresponding them with the help of some simple to particular cut-offs and are for z examples. values calculated to two decimal places. STANDARD NORMAL DISTRIBUTION TABLE 1 The Normal distribution is widely used in statistics and has been discussed in z 0.00 0.01 0.02 0.03 0.050.04 0.05 0.06 0.07 0.08 0.09 detail previously.5 As the mean of a Normally distributed variable can take 0.00 1.0000 0.9920 0.9840 0.9761 0.9681 0.9601 0.9522 0.9442 0.9362 0.9283 any value (−∞ to ∞) and the standard 0.10 0.9203 0.9124 0.9045 0.8966 0.8887 0.8808 0.8729 0.8650 0.8572 0.8493 deviation any positive value (0 to ∞), 0.20 0.8415 0.8337 0.8259 0.8181 0.8103 0.8206 0.7949 0.7872 0.7795 0.7718 there are an infinite number of possible 0.30 0.7642 0.7566 0.7490 0.7414 0.7339 0.7263 0.7188 0.7114 0.7039 0.6965 Normal distributions.
    [Show full text]
  • 8.5 Testing a Claim About a Standard Deviation Or Variance
    8.5 Testing a Claim about a Standard Deviation or Variance Testing Claims about a Population Standard Deviation or a Population Variance ² Uses the chi-squared distribution from section 7-4 → Requirements: 1. The sample is a simple random sample 2. The population has a normal distribution (n −1)s 2 → Test Statistic for Testing a Claim about or ²: 2 = 2 where n = sample size s = sample standard deviation σ = population standard deviation s2 = sample variance σ2 = population variance → P-values and Critical Values: Use table A-4 with df = n – 1 for the number of degrees of freedom *Remember that table A-4 is based on cumulative areas from the right → Properties of the Chi-Square Distribution: 1. All values of 2 are nonnegative and the distribution is not symmetric 2. There is a different 2 distribution for each number of degrees of freedom 3. The critical values are found in table A-4 (based on cumulative areas from the right) --locate the row corresponding to the appropriate number of degrees of freedom (df = n – 1) --the significance level is used to determine the correct column --Right-tailed test: Because the area to the right of the critical value is 0.05, locate 0.05 at the top of table A-4 --Left-tailed test: With a left-tailed area of 0.05, the area to the right of the critical value is 0.95 so locate 0.95 at the top of table A-4 --Two-tailed test: Divide the significance level of 0.05 between the left and right tails, so the areas to the right of the two critical values are 0.975 and 0.025.
    [Show full text]
  • 1 One Parameter Exponential Families
    1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e.
    [Show full text]
  • Lecture 4: Sufficient Statistics 1 Sufficient Statistics
    ECE 830 Fall 2011 Statistical Signal Processing instructor: R. Nowak Lecture 4: Sufficient Statistics Consider a random variable X whose distribution p is parametrized by θ 2 Θ where θ is a scalar or a vector. Denote this distribution as pX (xjθ) or p(xjθ), for short. In many signal processing applications we need to make some decision about θ from observations of X, where the density of X can be one of many in a family of distributions, fp(xjθ)gθ2Θ, indexed by different choices of the parameter θ. More generally, suppose we make n independent observations of X: X1;X2;:::;Xn where p(x1 : : : xnjθ) = Qn i=1 p(xijθ). These observations can be used to infer or estimate the correct value for θ. This problem can be posed as follows. Let x = [x1; x2; : : : ; xn] be a vector containing the n observations. Question: Is there a lower dimensional function of x, say t(x), that alone carries all the relevant information about θ? For example, if θ is a scalar parameter, then one might suppose that all relevant information in the observations can be summarized in a scalar statistic. Goal: Given a family of distributions fp(xjθ)gθ2Θ and one or more observations from a particular dis- tribution p(xjθ∗) in this family, find a data compression strategy that preserves all information pertaining to θ∗. The function identified by such strategyis called a sufficient statistic. 1 Sufficient Statistics Example 1 (Binary Source) Suppose X is a 0=1 - valued variable with P(X = 1) = θ and P(X = 0) = 1 − θ.
    [Show full text]
  • On the Meaning and Use of Kurtosis
    Psychological Methods Copyright 1997 by the American Psychological Association, Inc. 1997, Vol. 2, No. 3,292-307 1082-989X/97/$3.00 On the Meaning and Use of Kurtosis Lawrence T. DeCarlo Fordham University For symmetric unimodal distributions, positive kurtosis indicates heavy tails and peakedness relative to the normal distribution, whereas negative kurtosis indicates light tails and flatness. Many textbooks, however, describe or illustrate kurtosis incompletely or incorrectly. In this article, kurtosis is illustrated with well-known distributions, and aspects of its interpretation and misinterpretation are discussed. The role of kurtosis in testing univariate and multivariate normality; as a measure of departures from normality; in issues of robustness, outliers, and bimodality; in generalized tests and estimators, as well as limitations of and alternatives to the kurtosis measure [32, are discussed. It is typically noted in introductory statistics standard deviation. The normal distribution has a kur- courses that distributions can be characterized in tosis of 3, and 132 - 3 is often used so that the refer- terms of central tendency, variability, and shape. With ence normal distribution has a kurtosis of zero (132 - respect to shape, virtually every textbook defines and 3 is sometimes denoted as Y2)- A sample counterpart illustrates skewness. On the other hand, another as- to 132 can be obtained by replacing the population pect of shape, which is kurtosis, is either not discussed moments with the sample moments, which gives or, worse yet, is often described or illustrated incor- rectly. Kurtosis is also frequently not reported in re- ~(X i -- S)4/n search articles, in spite of the fact that virtually every b2 (•(X i - ~')2/n)2' statistical package provides a measure of kurtosis.
    [Show full text]
  • The Probability Lifesaver: Order Statistics and the Median Theorem
    The Probability Lifesaver: Order Statistics and the Median Theorem Steven J. Miller December 30, 2015 Contents 1 Order Statistics and the Median Theorem 3 1.1 Definition of the Median 5 1.2 Order Statistics 10 1.3 Examples of Order Statistics 15 1.4 TheSampleDistributionoftheMedian 17 1.5 TechnicalboundsforproofofMedianTheorem 20 1.6 TheMedianofNormalRandomVariables 22 2 • Greetings again! In this supplemental chapter we develop the theory of order statistics in order to prove The Median Theorem. This is a beautiful result in its own, but also extremely important as a substitute for the Central Limit Theorem, and allows us to say non- trivial things when the CLT is unavailable. Chapter 1 Order Statistics and the Median Theorem The Central Limit Theorem is one of the gems of probability. It’s easy to use and its hypotheses are satisfied in a wealth of problems. Many courses build towards a proof of this beautiful and powerful result, as it truly is ‘central’ to the entire subject. Not to detract from the majesty of this wonderful result, however, what happens in those instances where it’s unavailable? For example, one of the key assumptions that must be met is that our random variables need to have finite higher moments, or at the very least a finite variance. What if we were to consider sums of Cauchy random variables? Is there anything we can say? This is not just a question of theoretical interest, of mathematicians generalizing for the sake of generalization. The following example from economics highlights why this chapter is more than just of theoretical interest.
    [Show full text]
  • Minimum Rates of Approximate Sufficient Statistics
    1 Minimum Rates of Approximate Sufficient Statistics Masahito Hayashi,y Fellow, IEEE, Vincent Y. F. Tan,z Senior Member, IEEE Abstract—Given a sufficient statistic for a parametric family when one is given X, then Y is called a sufficient statistic of distributions, one can estimate the parameter without access relative to the family fPXjZ=zgz2Z . We may then write to the data. However, the memory or code size for storing the sufficient statistic may nonetheless still be prohibitive. Indeed, X for n independent samples drawn from a k-nomial distribution PXjZ=z(x) = PXjY (xjy)PY jZ=z(y); 8 (x; z) 2 X × Z with d = k − 1 degrees of freedom, the length of the code scales y2Y as d log n + O(1). In many applications, we may not have a (1) useful notion of sufficient statistics (e.g., when the parametric or more simply that X (−− Y (−− Z forms a Markov chain family is not an exponential family) and we also may not need in this order. Because Y is a function of X, it is also true that to reconstruct the generating distribution exactly. By adopting I(Z; X) = I(Z; Y ). This intuitively means that the sufficient a Shannon-theoretic approach in which we allow a small error in estimating the generating distribution, we construct various statistic Y provides as much information about the parameter approximate sufficient statistics and show that the code length Z as the original data X does. d can be reduced to 2 log n + O(1). We consider errors measured For concreteness in our discussions, we often (but not according to the relative entropy and variational distance criteria.
    [Show full text]
  • Theoretical Statistics. Lecture 5. Peter Bartlett
    Theoretical Statistics. Lecture 5. Peter Bartlett 1. U-statistics. 1 Outline of today’s lecture We’ll look at U-statistics, a family of estimators that includes many interesting examples. We’ll study their properties: unbiased, lower variance, concentration (via an application of the bounded differences inequality), asymptotic variance, asymptotic distribution. (See Chapter 12 of van der Vaart.) First, we’ll consider the standard unbiased estimate of variance—a special case of a U-statistic. 2 Variance estimates n 1 s2 = (X − X¯ )2 n n − 1 i n Xi=1 n n 1 = (X − X¯ )2 +(X − X¯ )2 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 2 = (X − X¯ ) − (X − X¯ ) 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 1 = (X − X )2 n(n − 1) 2 i j Xi=1 Xj=1 1 1 = (X − X )2 . n 2 i j 2 Xi<j 3 Variance estimates This is unbiased for i.i.d. data: 1 Es2 = E (X − X )2 n 2 1 2 1 = E ((X − EX ) − (X − EX ))2 2 1 1 2 2 1 2 2 = E (X1 − EX1) +(X2 − EX2) 2 2 = E (X1 − EX1) . 4 U-statistics Definition: A U-statistic of order r with kernel h is 1 U = n h(Xi1 ,...,Xir ), r iX⊆[n] where h is symmetric in its arguments. [If h is not symmetric in its arguments, we can also average over permutations.] “U” for “unbiased.” Introduced by Wassily Hoeffding in the 1940s. 5 U-statistics Theorem: [Halmos] θ (parameter, i.e., function defined on a family of distributions) admits an unbiased estimator (ie: for all sufficiently large n, some function of the i.i.d.
    [Show full text]
  • The Likelihood Function - Introduction
    The Likelihood Function - Introduction • Recall: a statistical model for some data is a set { f θ : θ ∈ Ω} of distributions, one of which corresponds to the true unknown distribution that produced the data. • The distribution fθ can be either a probability density function or a probability mass function. • The joint probability density function or probability mass function of iid random variables X1, …, Xn is n θ ()1 ,..., n = ∏ θ ()xfxxf i . i=1 week 3 1 The Likelihood Function •Let x1, …, xn be sample observations taken on corresponding random variables X1, …, Xn whose distribution depends on a parameter θ. The likelihood function defined on the parameter space Ω is given by L|(θ x1 ,..., xn ) = θ f( 1,..., xn ) x . • Note that for the likelihood function we are fixing the data, x1,…, xn, and varying the value of the parameter. •The value L(θ | x1, …, xn) is called the likelihood of θ. It is the probability of observing the data values we observed given that θ is the true value of the parameter. It is not the probability of θ given that we observed x1, …, xn. week 3 2 Examples • Suppose we toss a coin n = 10 times and observed 4 heads. With no knowledge whatsoever about the probability of getting a head on a single toss, the appropriate statistical model for the data is the Binomial(10, θ) model. The likelihood function is given by • Suppose X1, …, Xn is a random sample from an Exponential(θ) distribution. The likelihood function is week 3 3 Sufficiency - Introduction • A statistic that summarizes all the information in the sample about the target parameter is called sufficient statistic.
    [Show full text]