Lecture 11 Likelihood, MLE and Sufficiency

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 11 Likelihood, MLE and Sufficiency Lecture 11: Likelihood, MLE and sufficiency 1 of 17 Course: Mathematical Statistics Term: Fall 2018 Instructor: Gordan Žitkovi´c Lecture 11 Likelihood, MLE and sufficiency 11.1 Likelihood - definition and examples As we have seen many times before, the statistical inference based on a ran- dom sample depends heavily on the model we assume for the data. We would estimate the unknown parameter (say, the mean of the distribution) in vastly different ways when the data are normal, compared to the case of uniformly distributed data. The information about the assumed model is captured in the joint pdf (or pmf) of our data and its dependence on the pa- rameters. In a huge number of situations, this joint pdf/pmf has a simple explicit form and, as we will see below, many important conclusions can be reached by looking at it in a right way. Definition 11.1.1. Given a random sample Y1,..., Yn from a discrete distribution D with an unknown parameter q, we define the likeli- hood (function) by L(q; y ,..., y ) = pq (y ,..., y ) = pq(y )pq(y ) ... pq(y ), 1 n Y1,...,Yn 1 n 1 2 n q where p is the pmf of (each) Yi. If Y1,..., Yn come from a continuous distribution, we set L(q; y ,..., y ) = f q (y ,..., y ) = f q(y ) f q(y ) ... f q(y ), 1 n Y1,...,Yn 1 n 1 2 n q where f is the pdf of (each) Yi. Simply put, the likelihood is the same as the joint pdf (pmf), but with the emphasis placed on the dependence on the parameter. When considering likelihoods, we think of y1,..., yn as fixed and of L(q; y1,..., yn) as the “like- lihood” of it being the parameter q that actually produced y1,..., yn. This should not be confused with a probability - as a function of q, the likelihood L(q; y1,..., yn) is not a pdf (or a pmf) of a probability distribution. In order to be able to interpret likelihood as a probability, we need a completely different paradigm, namely Bayesian statistics. Last Updated: September 25, 2019 Lecture 11: Likelihood, MLE and sufficiency 2 of 17 In these notes, Y1,..., Yn are always independent, and, so the likelihood can be written as a product of individual pdfs (by the factorization criterion). In general, when Y1,..., Yn are dependent random variables, the notion of a likelihood can still be used if the joint distribution (pmf or pdf) of Y1,..., Yn is specified. Almost everything we cover below will apply to this case, as well. Example 11.1.2. Here are the likelihood functions for random samples from some of our favorite distributions: 1. Bernoulli. Suppose that Y1,..., Yn and independent and Yi ∼ B(p). The pmf of Yi can be written as ( ) p, y = 1 y 1−y p(y) = P[Yi = y] = = p (1 − p) for y = 0, 1. (1 − p), y = 0 While it may look strange at first, the right-most expression py(1 − p)1−y happens to be very useful. For example, it allows us to write the full likelihood in a very compact form y1 1−y1 yn 1−yn L(p; y1,..., yn) = p (1 − p) × · · · × p (1 − p) = p∑i yi × (1 − p)n−∑i yi . 2. Normal. For a random sample Y1,..., Yn from a normal N(m, s)- distribution, we have (y−m)2 − f (y) = p1 e 2s2 , s 2p and, so, n (y −m)2 ∑n (y −m)2 − i − i=1 i L(m, s; y ,..., y ) = 1 e 2s2 = 1 e 2s2 . 1 n (2ps2)n/2 ∏ (2ps2)n/2 i=1 We can go a step further and try to isolate the parameters m and s 2 by expanding each square (yi − m) : 1 m m2 − ∑n y2+ ∑n y −n L(m, s, y ,..., y ) = 1 e 2s2 i=1 i s2 i=1 i 2s2 1 n s2(2p)n/2 3. Uniform. Let Y1,..., Yn be a random sample from a uniform dis- tribution U(0, q), with an unknown q > 0. The pdf of a single Yi is 1 f (y) = q 1f0≤y≤qg, Last Updated: September 25, 2019 Lecture 11: Likelihood, MLE and sufficiency 3 of 17 and, so L(q y y ) = 1 1 × · · · × 1 ; 1,..., n qn f0≤y1≤qg f0≤yn≤qg = 1 1 qn f0≤y1,...,yn≤qg. The condition 0 ≤ y1,..., yn ≤ q is equivalent to the two conditions 0 ≤ min(y1,..., yn) and max(y1,..., yn) ≤ q. Therefore, we can write L(q y y ) = 1 1 1 ; 1,..., n qn fmin(y1,...,yn)≥0g fmax(y1,...,yn)≤qg. 11.2 Maximum-likelihood estimation We mentioned that the word “likelihood” in the likelihood function refers to the parameter, but that we cannot think of it as probability without changing our entire worldview. What we can do is compare likelihoods for different values of the parameter and think of the parameters with the higher value of the likelihood as “more likely” to have produced the observations y1,..., yn. Example 11.2.1. Three buses (of unknown sizes, and not necessarily the same) carrying football fans arrive at a game between the Green team and the Orange team. Suppose that 90% of the people in the first bus are fans of the Green team, and 10% fans of the Orange team. The composition of the second bus is almost exactly opposite: 15% Green team fans, and 85% Orange team fans. The third bus carries the same number of Green-team and Orange-team fans. Once the buses arrive at the game, the two populations mix (say as they enter the stadium) and a person is randomly selected from the crowd. It turns out she is a fan of the Green team. What is your best guess of the bus she came to the game in? The situation can be modeled as follows; the (unknown) parameter q corresponding to the bus our fan came from - can take only three values 1, 2 or 3. The observation Y can take only two values G (for the Green team fans) and O (for the Orange team fans). The likelihood function L(q, y) is given by 8 8 0.9, q = 1 0.1, q = 1 <> <> L(q; G) = 0.15, q = 2 and L(q; O) = 0.85, q = 2 . > > :0.5, q = 3 :0.5, q = 3 Since the randomly picked person was a fan of the Green team, we focus on L(q; G). The three values it can take, namely 0.9, and 0.15 Last Updated: September 25, 2019 Lecture 11: Likelihood, MLE and sufficiency 4 of 17 and 0.5 cannot be interpreted as probabilities, as they do not add up to 1. We can still say that, in this case, q = 1 is much more likely than q = 2, and we should probably guess that our fan came in the first bus (the one that carried mostly the Green-team fans). The thinking we used to formulate our guess in the above example was simple: pick the value of the parameter which yields the highest likelihood, given the observed data. If we follow this procedure systematically, we arrive at one of the most important classes of estimators: Definition 11.2.2. An estimator qˆ = qˆ(y1,..., yn) is called the maximum-likelihood estimator (MLE) if it has the property that for 0 0 any other estimator qˆ = qˆ (y1,..., yn) we have 0 L(qˆ; y1,..., yn) ≥ L(qˆ ; y1,..., yn), for all y1,..., yn. Maximum-likelihood are often easy to find whenever explicit expressions for the likelihood functions are available. Unlike in Example 11.2.1 above, the unknown parameters often vary continuously and we can use calculus to find the values that maximize the likelihood. A very useful trick is to maximize the log-likelihood log L(q; y1,..., yn) instead of the likelihood L. We get the same maximizers (as x 7! log(x) is an increasing function), but the expressions involved are often much simpler. Example 11.2.3. 1. Normal (with a known variance). Let Y1,..., Yn be a random sam- ple from a normal model with an unknown mean m and the known variance s = 1. Thanks to Example 11.1.2 above, we have the fol- lowing expression for the likelihood function − 1 n (y − )2 L(m; y ,..., y ) = 1 e 2 ∑i=1 i m . 1 n (2p)n/2 To find the MLE mˆ for m, we need to find the value of m (depending on y1,..., yn) which maximizes L(m; y1,..., yn). We could use the standard technique and differentiate L(m; y1,..., yn) in m, set the obtained value to 0 and solve for m. The log-likelihood n n 1 2 log L(m; y1,..., yn) = − 2 log(2p) − 2 ∑(yi − m) i=1 Last Updated: September 25, 2019 Lecture 11: Likelihood, MLE and sufficiency 5 of 17 is much easier to differentiate: n ¶ 1 ¶ 2 log L(m; y1,..., yn) = − 2 ∑ (yi − m) ¶m i=1 ¶m n n 1 = − 2 ∑ 2(yi − m) = nm − ∑ yi. i=1 i=1 We set the obtained expression to 0 and solve for m, obtaining n 1 mˆ = n ∑ Yi. i=1 It can be show that this mˆ is indeed the maximum of L(m; y1,..., yn) (and not a minimum or an inflection point). It should not be too surprising that we obtained the sample mean - it has already been shown to be the best estimator for m is the mean-squared-error sense.
Recommended publications
  • Two Principles of Evidence and Their Implications for the Philosophy of Scientific Method
    TWO PRINCIPLES OF EVIDENCE AND THEIR IMPLICATIONS FOR THE PHILOSOPHY OF SCIENTIFIC METHOD by Gregory Stephen Gandenberger BA, Philosophy, Washington University in St. Louis, 2009 MA, Statistics, University of Pittsburgh, 2014 Submitted to the Graduate Faculty of the Kenneth P. Dietrich School of Arts and Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2015 UNIVERSITY OF PITTSBURGH KENNETH P. DIETRICH SCHOOL OF ARTS AND SCIENCES This dissertation was presented by Gregory Stephen Gandenberger It was defended on April 14, 2015 and approved by Edouard Machery, Pittsburgh, Dietrich School of Arts and Sciences Satish Iyengar, Pittsburgh, Dietrich School of Arts and Sciences John Norton, Pittsburgh, Dietrich School of Arts and Sciences Teddy Seidenfeld, Carnegie Mellon University, Dietrich College of Humanities & Social Sciences James Woodward, Pittsburgh, Dietrich School of Arts and Sciences Dissertation Director: Edouard Machery, Pittsburgh, Dietrich School of Arts and Sciences ii Copyright © by Gregory Stephen Gandenberger 2015 iii TWO PRINCIPLES OF EVIDENCE AND THEIR IMPLICATIONS FOR THE PHILOSOPHY OF SCIENTIFIC METHOD Gregory Stephen Gandenberger, PhD University of Pittsburgh, 2015 The notion of evidence is of great importance, but there are substantial disagreements about how it should be understood. One major locus of disagreement is the Likelihood Principle, which says roughly that an observation supports a hypothesis to the extent that the hy- pothesis predicts it. The Likelihood Principle is supported by axiomatic arguments, but the frequentist methods that are most commonly used in science violate it. This dissertation advances debates about the Likelihood Principle, its near-corollary the Law of Likelihood, and related questions about statistical practice.
    [Show full text]
  • 5. Completeness and Sufficiency 5.1. Complete Statistics. Definition 5.1. a Statistic T Is Called Complete If Eg(T) = 0 For
    5. Completeness and sufficiency 5.1. Complete statistics. Definition 5.1. A statistic T is called complete if Eg(T ) = 0 for all θ and some function g implies that P (g(T ) = 0; θ) = 1 for all θ. This use of the word complete is analogous to calling a set of vectors v1; : : : ; vn complete if they span the whole space, that is, any v can P be written as a linear combination v = ajvj of these vectors. This is equivalent to the condition that if w is orthogonal to all vj's, then w = 0. To make the connection with Definition 5.1, let's consider the discrete case. Then completeness means that P g(t)P (T = t; θ) = 0 implies that g(t) = 0. Since the sum may be viewed as the scalar prod- uct of the vectors (g(t1); g(t2);:::) and (p(t1); p(t2);:::), with p(t) = P (T = t), this is the analog of the orthogonality condition just dis- cussed. We also see that the terminology is somewhat misleading. It would be more accurate to call the family of distributions p(·; θ) complete (rather than the statistic T ). In any event, completeness means that the collection of distributions for all possible values of θ provides a sufficiently rich set of vectors. In the continuous case, a similar inter- pretation works. Completeness now refers to the collection of densities f(·; θ), and hf; gi = R fg serves as the (abstract) scalar product in this case. Example 5.1. Let's take one more look at the coin flip example.
    [Show full text]
  • The Gompertz Distribution and Maximum Likelihood Estimation of Its Parameters - a Revision
    Max-Planck-Institut für demografi sche Forschung Max Planck Institute for Demographic Research Konrad-Zuse-Strasse 1 · D-18057 Rostock · GERMANY Tel +49 (0) 3 81 20 81 - 0; Fax +49 (0) 3 81 20 81 - 202; http://www.demogr.mpg.de MPIDR WORKING PAPER WP 2012-008 FEBRUARY 2012 The Gompertz distribution and Maximum Likelihood Estimation of its parameters - a revision Adam Lenart ([email protected]) This working paper has been approved for release by: James W. Vaupel ([email protected]), Head of the Laboratory of Survival and Longevity and Head of the Laboratory of Evolutionary Biodemography. © Copyright is held by the authors. Working papers of the Max Planck Institute for Demographic Research receive only limited review. Views or opinions expressed in working papers are attributable to the authors and do not necessarily refl ect those of the Institute. The Gompertz distribution and Maximum Likelihood Estimation of its parameters - a revision Adam Lenart November 28, 2011 Abstract The Gompertz distribution is widely used to describe the distribution of adult deaths. Previous works concentrated on formulating approximate relationships to char- acterize it. However, using the generalized integro-exponential function Milgram (1985) exact formulas can be derived for its moment-generating function and central moments. Based on the exact central moments, higher accuracy approximations can be defined for them. In demographic or actuarial applications, maximum-likelihood estimation is often used to determine the parameters of the Gompertz distribution. By solving the maximum-likelihood estimates analytically, the dimension of the optimization problem can be reduced to one both in the case of discrete and continuous data.
    [Show full text]
  • The Likelihood Principle
    1 01/28/99 ãMarc Nerlove 1999 Chapter 1: The Likelihood Principle "What has now appeared is that the mathematical concept of probability is ... inadequate to express our mental confidence or diffidence in making ... inferences, and that the mathematical quantity which usually appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term 'Likelihood' to designate this quantity; since both the words 'likelihood' and 'probability' are loosely used in common speech to cover both kinds of relationship." R. A. Fisher, Statistical Methods for Research Workers, 1925. "What we can find from a sample is the likelihood of any particular value of r [a parameter], if we define the likelihood as a quantity proportional to the probability that, from a particular population having that particular value of r, a sample having the observed value r [a statistic] should be obtained. So defined, probability and likelihood are quantities of an entirely different nature." R. A. Fisher, "On the 'Probable Error' of a Coefficient of Correlation Deduced from a Small Sample," Metron, 1:3-32, 1921. Introduction The likelihood principle as stated by Edwards (1972, p. 30) is that Within the framework of a statistical model, all the information which the data provide concerning the relative merits of two hypotheses is contained in the likelihood ratio of those hypotheses on the data. ...For a continuum of hypotheses, this principle
    [Show full text]
  • 1 One Parameter Exponential Families
    1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e.
    [Show full text]
  • Lecture 4: Sufficient Statistics 1 Sufficient Statistics
    ECE 830 Fall 2011 Statistical Signal Processing instructor: R. Nowak Lecture 4: Sufficient Statistics Consider a random variable X whose distribution p is parametrized by θ 2 Θ where θ is a scalar or a vector. Denote this distribution as pX (xjθ) or p(xjθ), for short. In many signal processing applications we need to make some decision about θ from observations of X, where the density of X can be one of many in a family of distributions, fp(xjθ)gθ2Θ, indexed by different choices of the parameter θ. More generally, suppose we make n independent observations of X: X1;X2;:::;Xn where p(x1 : : : xnjθ) = Qn i=1 p(xijθ). These observations can be used to infer or estimate the correct value for θ. This problem can be posed as follows. Let x = [x1; x2; : : : ; xn] be a vector containing the n observations. Question: Is there a lower dimensional function of x, say t(x), that alone carries all the relevant information about θ? For example, if θ is a scalar parameter, then one might suppose that all relevant information in the observations can be summarized in a scalar statistic. Goal: Given a family of distributions fp(xjθ)gθ2Θ and one or more observations from a particular dis- tribution p(xjθ∗) in this family, find a data compression strategy that preserves all information pertaining to θ∗. The function identified by such strategyis called a sufficient statistic. 1 Sufficient Statistics Example 1 (Binary Source) Suppose X is a 0=1 - valued variable with P(X = 1) = θ and P(X = 0) = 1 − θ.
    [Show full text]
  • On the Meaning and Use of Kurtosis
    Psychological Methods Copyright 1997 by the American Psychological Association, Inc. 1997, Vol. 2, No. 3,292-307 1082-989X/97/$3.00 On the Meaning and Use of Kurtosis Lawrence T. DeCarlo Fordham University For symmetric unimodal distributions, positive kurtosis indicates heavy tails and peakedness relative to the normal distribution, whereas negative kurtosis indicates light tails and flatness. Many textbooks, however, describe or illustrate kurtosis incompletely or incorrectly. In this article, kurtosis is illustrated with well-known distributions, and aspects of its interpretation and misinterpretation are discussed. The role of kurtosis in testing univariate and multivariate normality; as a measure of departures from normality; in issues of robustness, outliers, and bimodality; in generalized tests and estimators, as well as limitations of and alternatives to the kurtosis measure [32, are discussed. It is typically noted in introductory statistics standard deviation. The normal distribution has a kur- courses that distributions can be characterized in tosis of 3, and 132 - 3 is often used so that the refer- terms of central tendency, variability, and shape. With ence normal distribution has a kurtosis of zero (132 - respect to shape, virtually every textbook defines and 3 is sometimes denoted as Y2)- A sample counterpart illustrates skewness. On the other hand, another as- to 132 can be obtained by replacing the population pect of shape, which is kurtosis, is either not discussed moments with the sample moments, which gives or, worse yet, is often described or illustrated incor- rectly. Kurtosis is also frequently not reported in re- ~(X i -- S)4/n search articles, in spite of the fact that virtually every b2 (•(X i - ~')2/n)2' statistical package provides a measure of kurtosis.
    [Show full text]
  • C Copyright 2014 Navneet R. Hakhu
    c Copyright 2014 Navneet R. Hakhu Unconditional Exact Tests for Binomial Proportions in the Group Sequential Setting Navneet R. Hakhu A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science University of Washington 2014 Reading Committee: Scott S. Emerson, Chair Marco Carone Program Authorized to Offer Degree: Biostatistics University of Washington Abstract Unconditional Exact Tests for Binomial Proportions in the Group Sequential Setting Navneet R. Hakhu Chair of the Supervisory Committee: Professor Scott S. Emerson Department of Biostatistics Exact inference for independent binomial outcomes in small samples is complicated by the presence of a mean-variance relationship that depends on nuisance parameters, discreteness of the outcome space, and departures from normality. Although large sample theory based on Wald, score, and likelihood ratio (LR) tests are well developed, suitable small sample methods are necessary when \large" samples are not feasible. Fisher's exact test, which conditions on an ancillary statistic to eliminate nuisance parameters, however its inference based on the hypergeometric distribution is \exact" only when a user is willing to base decisions on flipping a biased coin for some outcomes. Thus, in practice, Fisher's exact test tends to be conservative due to the discreteness of the outcome space. To address the various issues that arise with the asymptotic and/or small sample tests, Barnard (1945, 1947) introduced the concept of unconditional exact tests that use exact distributions of a test statistic evaluated over all possible values of the nuisance parameter. For test statistics derived based on asymptotic approximations, these \unconditional exact tests" ensure that the realized type 1 error is less than or equal to the nominal level.
    [Show full text]
  • The Probability Lifesaver: Order Statistics and the Median Theorem
    The Probability Lifesaver: Order Statistics and the Median Theorem Steven J. Miller December 30, 2015 Contents 1 Order Statistics and the Median Theorem 3 1.1 Definition of the Median 5 1.2 Order Statistics 10 1.3 Examples of Order Statistics 15 1.4 TheSampleDistributionoftheMedian 17 1.5 TechnicalboundsforproofofMedianTheorem 20 1.6 TheMedianofNormalRandomVariables 22 2 • Greetings again! In this supplemental chapter we develop the theory of order statistics in order to prove The Median Theorem. This is a beautiful result in its own, but also extremely important as a substitute for the Central Limit Theorem, and allows us to say non- trivial things when the CLT is unavailable. Chapter 1 Order Statistics and the Median Theorem The Central Limit Theorem is one of the gems of probability. It’s easy to use and its hypotheses are satisfied in a wealth of problems. Many courses build towards a proof of this beautiful and powerful result, as it truly is ‘central’ to the entire subject. Not to detract from the majesty of this wonderful result, however, what happens in those instances where it’s unavailable? For example, one of the key assumptions that must be met is that our random variables need to have finite higher moments, or at the very least a finite variance. What if we were to consider sums of Cauchy random variables? Is there anything we can say? This is not just a question of theoretical interest, of mathematicians generalizing for the sake of generalization. The following example from economics highlights why this chapter is more than just of theoretical interest.
    [Show full text]
  • Minimum Rates of Approximate Sufficient Statistics
    1 Minimum Rates of Approximate Sufficient Statistics Masahito Hayashi,y Fellow, IEEE, Vincent Y. F. Tan,z Senior Member, IEEE Abstract—Given a sufficient statistic for a parametric family when one is given X, then Y is called a sufficient statistic of distributions, one can estimate the parameter without access relative to the family fPXjZ=zgz2Z . We may then write to the data. However, the memory or code size for storing the sufficient statistic may nonetheless still be prohibitive. Indeed, X for n independent samples drawn from a k-nomial distribution PXjZ=z(x) = PXjY (xjy)PY jZ=z(y); 8 (x; z) 2 X × Z with d = k − 1 degrees of freedom, the length of the code scales y2Y as d log n + O(1). In many applications, we may not have a (1) useful notion of sufficient statistics (e.g., when the parametric or more simply that X (−− Y (−− Z forms a Markov chain family is not an exponential family) and we also may not need in this order. Because Y is a function of X, it is also true that to reconstruct the generating distribution exactly. By adopting I(Z; X) = I(Z; Y ). This intuitively means that the sufficient a Shannon-theoretic approach in which we allow a small error in estimating the generating distribution, we construct various statistic Y provides as much information about the parameter approximate sufficient statistics and show that the code length Z as the original data X does. d can be reduced to 2 log n + O(1). We consider errors measured For concreteness in our discussions, we often (but not according to the relative entropy and variational distance criteria.
    [Show full text]
  • A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess
    S S symmetry Article A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess Guillermo Martínez-Flórez 1, Víctor Leiva 2,* , Emilio Gómez-Déniz 3 and Carolina Marchant 4 1 Departamento de Matemáticas y Estadística, Facultad de Ciencias Básicas, Universidad de Córdoba, Montería 14014, Colombia; [email protected] 2 Escuela de Ingeniería Industrial, Pontificia Universidad Católica de Valparaíso, 2362807 Valparaíso, Chile 3 Facultad de Economía, Empresa y Turismo, Universidad de Las Palmas de Gran Canaria and TIDES Institute, 35001 Canarias, Spain; [email protected] 4 Facultad de Ciencias Básicas, Universidad Católica del Maule, 3466706 Talca, Chile; [email protected] * Correspondence: [email protected] or [email protected] Received: 30 June 2020; Accepted: 19 August 2020; Published: 1 September 2020 Abstract: In this paper, we consider skew-normal distributions for constructing new a distribution which allows us to model proportions and rates with zero/one inflation as an alternative to the inflated beta distributions. The new distribution is a mixture between a Bernoulli distribution for explaining the zero/one excess and a censored skew-normal distribution for the continuous variable. The maximum likelihood method is used for parameter estimation. Observed and expected Fisher information matrices are derived to conduct likelihood-based inference in this new type skew-normal distribution. Given the flexibility of the new distributions, we are able to show, in real data scenarios, the good performance of our proposal. Keywords: beta distribution; centered skew-normal distribution; maximum-likelihood methods; Monte Carlo simulations; proportions; R software; rates; zero/one inflated data 1.
    [Show full text]
  • Theoretical Statistics. Lecture 5. Peter Bartlett
    Theoretical Statistics. Lecture 5. Peter Bartlett 1. U-statistics. 1 Outline of today’s lecture We’ll look at U-statistics, a family of estimators that includes many interesting examples. We’ll study their properties: unbiased, lower variance, concentration (via an application of the bounded differences inequality), asymptotic variance, asymptotic distribution. (See Chapter 12 of van der Vaart.) First, we’ll consider the standard unbiased estimate of variance—a special case of a U-statistic. 2 Variance estimates n 1 s2 = (X − X¯ )2 n n − 1 i n Xi=1 n n 1 = (X − X¯ )2 +(X − X¯ )2 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 2 = (X − X¯ ) − (X − X¯ ) 2n(n − 1) i n j n Xi=1 Xj=1 n n 1 1 = (X − X )2 n(n − 1) 2 i j Xi=1 Xj=1 1 1 = (X − X )2 . n 2 i j 2 Xi<j 3 Variance estimates This is unbiased for i.i.d. data: 1 Es2 = E (X − X )2 n 2 1 2 1 = E ((X − EX ) − (X − EX ))2 2 1 1 2 2 1 2 2 = E (X1 − EX1) +(X2 − EX2) 2 2 = E (X1 − EX1) . 4 U-statistics Definition: A U-statistic of order r with kernel h is 1 U = n h(Xi1 ,...,Xir ), r iX⊆[n] where h is symmetric in its arguments. [If h is not symmetric in its arguments, we can also average over permutations.] “U” for “unbiased.” Introduced by Wassily Hoeffding in the 1940s. 5 U-statistics Theorem: [Halmos] θ (parameter, i.e., function defined on a family of distributions) admits an unbiased estimator (ie: for all sufficiently large n, some function of the i.i.d.
    [Show full text]