The Calculus of M-Estimation

Total Page:16

File Type:pdf, Size:1020Kb

The Calculus of M-Estimation The Calculus of MEstimation Leonard A Stefanski and Dennis D Boos Abstract Since the groundbreaking pap ers by Hub er in the s Mestimation metho ds estimating equations have b een increasingly imp ortant for asymptotic analysis and approximate inference Now with the prevalence of programs like Maple and Mathematica calculation of asymptotic variances in complex problems is just a matter of routine manipulation The intent of this article is to illustrate the depth and generality of the Mestimation approach and thereby facilitate its use KEY WORDS Asymptotic variance Central Limit Theorem Estimating equations Maple M estimator Institute of Statistics Mimeo Series No March Leonard A Stefanski and Dennis D Bo os are Professors Department of Statistics North Carolina State Univer sity Raleigh NC Email addresses are stefanskistatncsuedub o os statncsuedu INTRODUCTION P n b Y That is the Mestimator Mestimators are solutions of the vector equation i i=1 satises n X b Y i i=1 Here we are assuming that Y Y are indep endent but not necessarily identically distributed 1 n is a pdimensional parameter and is a known p function that do es not dep end on i or n In this description Y represents the ith datum In some applications it is advantageous i to emphasize the dep endence of on particular comp onents of Y For example in a regression i problem Y x Y and would typically b e written i i i n X b Y x i i i=1 where x is the ith regressor i Hub er intro duced Mestimators and their asymptotic prop erties and they were an imp ortant part of the development of mo dern robust statistics Liang and Zeger help ed p opularize Mestimators in the biostatistics literature under the name generalized estimating equa tions GEE Obviously many others have made imp ortant contributions For example Go damb e intro duced the concept of an optimum estimating function in an Mestimator context and that pap er could b e called a forerunner of the Mestimator approach However our goal is not to do cument the development of Mestimators or to give a bibliog raphy of contributions to the literature Rather we want to show that the Mestimator approach is simple p owerful and more widely applicable than many readers imagine We want students to feel comfortable nding and using the asymptotic approximations that ow from the metho d The key advantage is that a very large class of asymptotically normal statistics includin g delta metho d transformations can b e put in the general Mestimator framework This unies large sample ap proximation metho ds simplies analysis and makes computations routine An imp ortant practical consideration is the availability of a symb olic manipulation program like Maple otherwise the matrix calculations can b e overwhelming Nevertheless the general theory is straightforward We claim that many estimators not thought of as Mestimators can b e written in the form of Mestimators Consider as a simple example the mean deviation from the sample mean n X b Y j jY i 1 n i=1 Is this an Mestimator There is certainly no single equation of the form n X Y i i=1 b b that yields Moreover there is no family of densities f y such that is a comp onent of the 1 1 maximum likeliho o d estimator of But if we let y jy j and y y 1 1 2 2 1 2 1 2 2 then P n n b b X jY j i 2 1 i=1 C B C B b b Y A A i 1 2 P n b Y i=1 i 2 i=1 P n b b yields Y and n jY Y j We like to use the term partial Mestimator for 2 1 i i=1 an estimator that is not naturally an Mestimator until additional functions are added The key idea is simple any estimator that would b e an Mestimator if certain parameters were known is a partial Mestimator b ecause we can stack functions for each of the unknown parameters This asp ect of Mestimators is related to the general approach of Randles for replacing unknown parameters by estimators b From the ab ove example it should b e obvious that we can replace Y by any other 2 estimator dened by an estimating equation for example the sample median Moreover we can b b also add functions to give delta metho d asymptotic results for transformations like log 3 1 In this latter context there are connections to Benichou and Gail Certainly the combination of standard inuence curve and delta theorem metho dology can handle a larger class of problems than this enhanced Mestimation approach However we b elieve that the combination of a single approach along with a symb olic manipulator like Maple will make this Mestimation approach much more likely to b e used in complex problems A description of the basic approach is given in Section along with a few examples Connec tions to the inuence curve are given in Section and then extensions for nonsmo oth functions are given in Section Extensions for regression are given in Section A discussion of some test ing problems is given in Section and Section summarizes the key features of the Mestimator metho d The Basic Approach Mestimators solve where the vector function must b e a known function that do es not dep end on i or n For regression situations the argument of will b e expanded to dep end on regressors x i but the basic will still not dep end on i For the moment we will conne ourselves to the iid case where Y Y are iid p ossibly vectorvalued with distribution function F The true parameter 1 n value is dened by 0 Z E Y y dF y F 1 0 0 R For example if Y Y then clearly the p opulation mean y dF y is the unique i i 0 R solution of y dF y If there is one unique satisfying then in general there exists a sequence of Mestimators 0 p b b such that the weak law of large numb ers leads to as n Furthermore if is suitably 0 P n 1 smo oth then Taylor expansion of G n Y gives n i i=1 0 b b G G G R n n 0 0 0 n n h i 0 T 0 to b e nonsingular For n suciently large we exp ect G G where G 0 0 n n n = so that we can rearrange and get p p p 1 0 b n G n G nR 0 0 n 0 n n Under suitable regularity conditions as n n X p 0 E Y Y A G i 0 1 0 0 0 n T T n i=1 h i p d T n G MVN B where B E Y Y n 0 0 0 1 0 1 0 p p nR n If A exists the Weak Law of Large Numb ers gives If B exists then follows from 0 0 the Central Limit Theorem The hard part to prove is Hub er was the rst to give general results for but there have b een many others since then We shall b e content to observe that holds in most regular situations where there are sucent smo othness conditions on and has xed dimension p as n Putting and together with Slustkys Theorem we have that V 0 b is AMN as n 0 n 1 1 T where V A B fA g AMN means asymptotically multivariate normal 0 0 0 0 The limiting covariance V is called the sandwich matrix b ecause the meat B is placed 0 0 1 1 T b etween the bread A and fA g 0 0 b Extension Supp ose that instead of satises n X b Y c i n i=1 p p p n n is absorb ed as n Following the ab ove arguments and noting that c where c n n p in nR of we can see that as long as and hold then will also hold This extension n allows us to cover a much wider class of statistics including empirical quantiles estimators whose function dep ends on n and Bayesian estimators For maximum likeliho o d estimation y log f y is often called the score func tion If the data truly come from the assumed parametric family f y then A B 0 0 1 I the information matrix In this case the sandwich matrix V reduces to the usual I 0 0 0 One of the key contributions of Mestimation theory has b een to p oint out what happ ens when the assumed parametric family is not correct In such cases there is often a welldened satisfying 0 b and satisfying but A B and valid inference should b e carried out using the 0 0 1 1 T 1 correct limiting covariance matrix V A B fA g not I 0 0 0 0 0 Using the lefthandside of we dene the empirical estimator of A by 0 n X 0 b b b Y A Y G i n n T n i=1 b b Note that for maximum likeliho o d estimation nA is the observed information matrix I n Y Similarly the empirical estimator of B is 0 n X T b b b Y Y B Y i i n n i=1 Putting these together yields the empirical sandwich estimator 1 1 T b b b b V Y A Y B Y fA Y g n n n n b V Y will generally b e consistent for V under mild regularity conditions see Iverson and n 0 Randles for a general theorem on this convegence b V Y requires no analytic work b eyond sp ecifying the function In some problems it is n 1 1 T simpler to work directly with the limiting form V A B fA g and just plug 0 0 0 0 in estimators for and any other unknown quantities in V The notation V suggests that 0 0 0 is the only unknown quantity in V but in reality V often involves higher moments or 0 0 0 other characteristics of the true distribution function F of Y In fact there is a range of p ossibilities i for estimating V dep ending on what mo del assumptions are used For simplicity we will just 0 b b use the notation V Y for the purely empirical estimator and V for any of the exp ected n value plus mo del assumption versions For maximum likeliho o d estimation with a correctly sp ecied family the three comp eting h i 1 1 1 1 b b b b b A Y and I V In this case estimators for I are V Y I n n n Y h i 1 1 b b b and I are generally more ecient than V Y for the standard estimators I n n Y 1 1
Recommended publications
  • Analysis of Variance; *********43***1****T******T
    DOCUMENT RESUME 11 571 TM 007 817 111, AUTHOR Corder-Bolz, Charles R. TITLE A Monte Carlo Study of Six Models of Change. INSTITUTION Southwest Eduoational Development Lab., Austin, Tex. PUB DiTE (7BA NOTE"". 32p. BDRS-PRICE MF-$0.83 Plug P4istage. HC Not Availablefrom EDRS. DESCRIPTORS *Analysis Of COariance; *Analysis of Variance; Comparative Statistics; Hypothesis Testing;. *Mathematicai'MOdels; Scores; Statistical Analysis; *Tests'of Significance ,IDENTIFIERS *Change Scores; *Monte Carlo Method ABSTRACT A Monte Carlo Study was conducted to evaluate ,six models commonly used to evaluate change. The results.revealed specific problems with each. Analysis of covariance and analysis of variance of residualized gain scores appeared to, substantially and consistently overestimate the change effects. Multiple factor analysis of variance models utilizing pretest and post-test scores yielded invalidly toF ratios. The analysis of variance of difference scores an the multiple 'factor analysis of variance using repeated measures were the only models which adequately controlled for pre-treatment differences; ,however, they appeared to be robust only when the error level is 50% or more. This places serious doubt regarding published findings, and theories based upon change score analyses. When collecting data which. have an error level less than 50% (which is true in most situations)r a change score analysis is entirely inadvisable until an alternative procedure is developed. (Author/CTM) 41t i- **************************************4***********************i,******* 4! Reproductions suppliedby ERRS are theibest that cap be made * .... * '* ,0 . from the original document. *********43***1****t******t*******************************************J . S DEPARTMENT OF HEALTH, EDUCATIONWELFARE NATIONAL INST'ITUTE OF r EOUCATIOp PHISDOCUMENT DUCED EXACTLYHAS /3EENREPRC AS RECEIVED THE PERSONOR ORGANIZATION J RoP AT INC.
    [Show full text]
  • Autoregressive Conditional Kurtosis
    Autoregressive conditional kurtosis Article Accepted Version Brooks, C., Burke, S. P., Heravi, S. and Persand, G. (2005) Autoregressive conditional kurtosis. Journal of Financial Econometrics, 3 (3). pp. 399-421. ISSN 1479-8417 doi: https://doi.org/10.1093/jjfinec/nbi018 Available at http://centaur.reading.ac.uk/20558/ It is advisable to refer to the publisher’s version if you intend to cite from the work. See Guidance on citing . Published version at: http://dx.doi.org/10.1093/jjfinec/nbi018 To link to this article DOI: http://dx.doi.org/10.1093/jjfinec/nbi018 Publisher: Oxford University Press All outputs in CentAUR are protected by Intellectual Property Rights law, including copyright law. Copyright and IPR is retained by the creators or other copyright holders. Terms and conditions for use of this material are defined in the End User Agreement . www.reading.ac.uk/centaur CentAUR Central Archive at the University of Reading Reading’s research outputs online This is a pre-copyedited, author-produced PDF of an article accepted for publication in the Journal of Financial Econometrics following peer review. The definitive publisher-authenticated version (C. Brooks, S.P. Burke, S. Heravi and G. Persand, ‘Autoregressive Conditional Kurtosis’, Journal of Financial Econometrics, 3.3 (2005)) is available online at: http://jfec.oxfordjournals.org/content/3/3/399 1 Autoregressive Conditional Kurtosis Chris Brooks1, Simon P. Burke2, Saeed Heravi3, Gita Persand4 The authors’ affiliations are 1Corresponding author: Cass Business School, City of London, 106 Bunhill Row, London EC1Y 8TZ, UK, tel: (+44) 20 70 40 51 68; fax: (+44) 20 70 40 88 81 41; e-mail: [email protected] ; 2School of Business, University of Reading 3Cardiff Business School, and 4Management School, University of Southampton.
    [Show full text]
  • 6 Modeling Survival Data with Cox Regression Models
    CHAPTER 6 ST 745, Daowen Zhang 6 Modeling Survival Data with Cox Regression Models 6.1 The Proportional Hazards Model A proportional hazards model proposed by D.R. Cox (1972) assumes that T z1β1+···+zpβp z β λ(t|z)=λ0(t)e = λ0(t)e , (6.1) where z is a p × 1 vector of covariates such as treatment indicators, prognositc factors, etc., and β is a p × 1 vector of regression coefficients. Note that there is no intercept β0 in model (6.1). Obviously, λ(t|z =0)=λ0(t). So λ0(t) is often called the baseline hazard function. It can be interpreted as the hazard function for the population of subjects with z =0. The baseline hazard function λ0(t) in model (6.1) can take any shape as a function of t.The T only requirement is that λ0(t) > 0. This is the nonparametric part of the model and z β is the parametric part of the model. So Cox’s proportional hazards model is a semiparametric model. Interpretation of a proportional hazards model 1. It is easy to show that under model (6.1) exp(zT β) S(t|z)=[S0(t)] , where S(t|z) is the survival function of the subpopulation with covariate z and S0(t)isthe survival function of baseline population (z = 0). That is R t − λ0(u)du S0(t)=e 0 . PAGE 120 CHAPTER 6 ST 745, Daowen Zhang 2. For any two sets of covariates z0 and z1, zT β λ(t|z1) λ0(t)e 1 T (z1−z0) β, t ≥ , = zT β =e for all 0 λ(t|z0) λ0(t)e 0 which is a constant over time (so the name of proportional hazards model).
    [Show full text]
  • Quadratic Versus Linear Estimating Equations
    ST 762 Nonlinear Statistical Models for Univariate and Multivariate Response Quadratic versus Linear Estimating Equations GLS estimating equations 0 1 n fβj 0 2 2 −1 X σ gj 0 Yj − fj @ 2 2 1/σ A 4 4 2 2 2 = 0: 0 2σ gj 0 2σ gj (Yj − fj ) − σ gj j=1 νθj Estimating equations for β are linear in Yj . Estimating equations for β only require specification of the first two moments. GLS is optimal among all linear estimating equations. 1 / 26 Quadratic versus Linear Estimating Equations ST 762 Nonlinear Statistical Models for Univariate and Multivariate Response Gaussian ML estimating equations 0 2 2 1 n fβj 2σ gj νβj 2 2 −1 X σ g 0 Yj − fj j = 0: @ 2 2 1/σ A 4 4 (Y − f )2 − σ2g 2 0 2σ gj 0 2σ gj j j j j=1 νθj Estimating equations for β are quadratic in Yj . Estimating equations for β require specification of the third and fourth moments as well. Specifically, if we let Yj − f (xj ; β) j = ; σg (β; θ; xj ) then we need to know 3 ∗ 2 ∗ E j = ζj and var j = 2 + κj : 2 / 26 Quadratic versus Linear Estimating Equations ST 762 Nonlinear Statistical Models for Univariate and Multivariate Response Questions ∗ ∗ ^ If we know the true values ζj and κj , how much is β improved using the quadratic estimating equations versus using the linear estimating equations? If we use working values (for example ζj = κj = 0, corresponding to ∗ ∗ normality) that are not the true values (i.e., ζj and κj ), is there any improvement in using the quadratic estimating equations? If we use working variance functions that are not the true variance functions, is there
    [Show full text]
  • Inference Based on Estimating Equations and Probability-Linked Data
    University of Wollongong Research Online Centre for Statistical & Survey Methodology Faculty of Engineering and Information Working Paper Series Sciences 2009 Inference Based on Estimating Equations and Probability-Linked Data R. Chambers University of Wollongong, [email protected] J. Chipperfield Australian Bureau of Statistics Walter Davis University of Wollongong, [email protected] M. Kovacevic Statistics Canada Follow this and additional works at: https://ro.uow.edu.au/cssmwp Recommended Citation Chambers, R.; Chipperfield, J.; Davis, Walter; and Kovacevic, M., Inference Based on Estimating Equations and Probability-Linked Data, Centre for Statistical and Survey Methodology, University of Wollongong, Working Paper 18-09, 2009, 36p. https://ro.uow.edu.au/cssmwp/38 Research Online is the open access institutional repository for the University of Wollongong. For further information contact the UOW Library: [email protected] Centre for Statistical and Survey Methodology The University of Wollongong Working Paper 18-09 Inference Based on Estimating Equations and Probability-Linked Data Ray Chambers, James Chipperfield, Walter Davis, Milorad Kovacevic Copyright © 2008 by the Centre for Statistical & Survey Methodology, UOW. Work in progress, no part of this paper may be reproduced without permission from the Centre. Centre for Statistical & Survey Methodology, University of Wollongong, Wollongong NSW 2522. Phone +61 2 4221 5435, Fax +61 2 4221 4845. Email: [email protected] Inference Based on Estimating Equations and Probability-Linked Data Ray Chambers, University of Wollongong James Chipperfield, Australian Bureau of Statistics Walter Davis, Statistics New Zealand Milorad Kovacevic, Statistics Canada Abstract Data obtained after probability linkage of administrative registers will include errors due to the fact that some linked records contain data items sourced from different individuals.
    [Show full text]
  • 18.650 (F16) Lecture 10: Generalized Linear Models (Glms)
    Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X (µ(X), σ2I), | ∼ N And ⊤ IE(Y X) = µ(X) = X β, | 2/52 Components of a linear model The two components (that we are going to relax) are 1. Random component: the response variable Y X is continuous | and normally distributed with mean µ = µ(X) = IE(Y X). | 2. Link: between the random and covariates X = (X(1),X(2), ,X(p))⊤: µ(X) = X⊤β. · · · 3/52 Generalization A generalized linear model (GLM) generalizes normal linear regression models in the following directions. 1. Random component: Y some exponential family distribution ∼ 2. Link: between the random and covariates: ⊤ g µ(X) = X β where g called link function� and� µ = IE(Y X). | 4/52 Example 1: Disease Occuring Rate In the early stages of a disease epidemic, the rate at which new cases occur can often increase exponentially through time. Hence, if µi is the expected number of new cases on day ti, a model of the form µi = γ exp(δti) seems appropriate. ◮ Such a model can be turned into GLM form, by using a log link so that log(µi) = log(γ) + δti = β0 + β1ti. ◮ Since this is a count, the Poisson distribution (with expected value µi) is probably a reasonable distribution to try. 5/52 Example 2: Prey Capture Rate(1) The rate of capture of preys, yi, by a hunting animal, tends to increase with increasing density of prey, xi, but to eventually level off, when the predator is catching as much as it can cope with.
    [Show full text]
  • Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds
    Optimal Variance Control of the Score Function Gradient Estimator for Importance Weighted Bounds Valentin Liévin 1 Andrea Dittadi1 Anders Christensen1 Ole Winther1, 2, 3 1 Section for Cognitive Systems, Technical University of Denmark 2 Bioinformatics Centre, Department of Biology, University of Copenhagen 3 Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital {valv,adit}@dtu.dk, [email protected], [email protected] Abstract This paper introduces novel results for the score function gradient estimator of the importance weighted variational bound (IWAE). We prove that in the limit of large K (number of importance samples) one can choose the control variate such that the Signal-to-Noise ratio (SNR) of the estimator grows as pK. This is in contrast to the standard pathwise gradient estimator where the SNR decreases as 1/pK. Based on our theoretical findings we develop a novel control variate that extends on VIMCO. Empirically, for the training of both continuous and discrete generative models, the proposed method yields superior variance reduction, resulting in an SNR for IWAE that increases with K without relying on the reparameterization trick. The novel estimator is competitive with state-of-the-art reparameterization-free gradient estimators such as Reweighted Wake-Sleep (RWS) and the thermodynamic variational objective (TVO) when training generative models. 1 Introduction Gradient-based learning is now widespread in the field of machine learning, in which recent advances have mostly relied on the backpropagation algorithm, the workhorse of modern deep learning. In many instances, for example in the context of unsupervised learning, it is desirable to make models more expressive by introducing stochastic latent variables.
    [Show full text]
  • Bayesian Methods: Review of Generalized Linear Models
    Bayesian Methods: Review of Generalized Linear Models RYAN BAKKER University of Georgia ICPSR Day 2 Bayesian Methods: GLM [1] Likelihood and Maximum Likelihood Principles Likelihood theory is an important part of Bayesian inference: it is how the data enter the model. • The basis is Fisher’s principle: what value of the unknown parameter is “most likely” to have • generated the observed data. Example: flip a coin 10 times, get 5 heads. MLE for p is 0.5. • This is easily the most common and well-understood general estimation process. • Bayesian Methods: GLM [2] Starting details: • – Y is a n k design or observation matrix, θ is a k 1 unknown coefficient vector to be esti- × × mated, we want p(θ Y) (joint sampling distribution or posterior) from p(Y θ) (joint probabil- | | ity function). – Define the likelihood function: n L(θ Y) = p(Y θ) | i| i=1 Y which is no longer on the probability metric. – Our goal is the maximum likelihood value of θ: θˆ : L(θˆ Y) L(θ Y) θ Θ | ≥ | ∀ ∈ where Θ is the class of admissable values for θ. Bayesian Methods: GLM [3] Likelihood and Maximum Likelihood Principles (cont.) Its actually easier to work with the natural log of the likelihood function: • `(θ Y) = log L(θ Y) | | We also find it useful to work with the score function, the first derivative of the log likelihood func- • tion with respect to the parameters of interest: ∂ `˙(θ Y) = `(θ Y) | ∂θ | Setting `˙(θ Y) equal to zero and solving gives the MLE: θˆ, the “most likely” value of θ from the • | parameter space Θ treating the observed data as given.
    [Show full text]
  • Simple, Lower-Variance Gradient Estimators for Variational Inference
    Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference Geoffrey Roeder Yuhuai Wu University of Toronto University of Toronto [email protected] [email protected] David Duvenaud University of Toronto [email protected] Abstract We propose a simple and general variant of the standard reparameterized gradient estimator for the variational evidence lower bound. Specifically, we remove a part of the total derivative with respect to the variational parameters that corresponds to the score function. Removing this term produces an unbiased gradient estimator whose variance approaches zero as the approximate posterior approaches the exact posterior. We analyze the behavior of this gradient estimator theoretically and empirically, and generalize it to more complex variational distributions such as mixtures and importance-weighted posteriors. 1 Introduction Recent advances in variational inference have begun to Optimization using: make approximate inference practical in large-scale latent Path Derivative ) variable models. One of the main recent advances has Total Derivative been the development of variational autoencoders along true φ with the reparameterization trick [Kingma and Welling, k 2013, Rezende et al., 2014]. The reparameterization init trick is applicable to most continuous latent-variable mod- φ els, and usually provides lower-variance gradient esti- ( mates than the more general REINFORCE gradient es- KL timator [Williams, 1992]. Intuitively, the reparameterization trick provides more in- 400 600 800 1000 1200 formative gradients by exposing the dependence of sam- Iterations pled latent variables z on variational parameters φ. In contrast, the REINFORCE gradient estimate only de- Figure 1: Fitting a 100-dimensional varia- pends on the relationship between the density function tional posterior to another Gaussian, using log qφ(z x; φ) and its parameters.
    [Show full text]
  • Generalized Estimating Equations for Mixed Models
    GENERALIZED ESTIMATING EQUATIONS FOR MIXED MODELS Lulah Alnaji A Dissertation Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY August 2018 Committee: Hanfeng Chen, Advisor Robert Dyer, Graduate Faculty Representative Wei Ning Junfeng Shang Copyright c August 2018 Lulah Alnaji All rights reserved iii ABSTRACT Hanfeng Chen, Advisor Most statistical approaches of molding the relationship between the explanatory variables and the responses assume subjects are independent. However, in clinical studies the longitudinal data are quite common. In this type of data, each subject is assessed repeatedly over a period of time. Therefore, the independence assumption is unlikely to be valid with longitudinal data due to the correlated observations of each subject. Generalized estimating equations method is a popular choice for longitudinal studies. It is an efficient method since it takes the within-subjects correla- tion into account by introducing the n n working correlation matrix R(↵) which is fully char- ⇥ acterized by the correlation parameter ↵. Although the generalized estimating equations’ method- ology considers correlation among the repeated observations on the same subject, it ignores the between-subject correlation and assumes subjects are independent. The objective of this dissertation is to provide an extension to the generalized estimating equa- tions to take both within-subject and between-subject correlations into account by incorporating the random effect b to the model. If our interest focuses on the regression coefficients, we regard the correlation parameter ↵ and as nuisance and estimate the fixed effects β using the estimating equations U(β,G,ˆ ↵ˆ).
    [Show full text]
  • STAT 830 Likelihood Methods
    STAT 830 Likelihood Methods Richard Lockhart Simon Fraser University STAT 830 — Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 1 / 32 Purposes of These Notes Define the likelihood, log-likelihood and score functions. Summarize likelihood methods Describe maximum likelihood estimation Give sequence of examples. Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 2 / 32 Likelihood Methods of Inference Toss a thumb tack 6 times and imagine it lands point up twice. Let p be probability of landing points up. Probability of getting exactly 2 point up is 15p2(1 − p)4 This function of p, is the likelihood function. Def’n: The likelihood function is map L: domain Θ, values given by L(θ)= fθ(X ) Key Point: think about how the density depends on θ not about how it depends on X . Notice: X , observed value of the data, has been plugged into the formula for density. Notice: coin tossing example uses the discrete density for f . We use likelihood for most inference problems: Richard Lockhart (Simon Fraser University) STAT 830 Likelihood Methods STAT 830 — Fall 2011 3 / 32 List of likelihood techniques Point estimation: we must compute an estimate θˆ = θˆ(X ) which lies in Θ. The maximum likelihood estimate (MLE) of θ is the value θˆ which maximizes L(θ) over θ ∈ Θ if such a θˆ exists. Point estimation of a function of θ: we must compute an estimate φˆ = φˆ(X ) of φ = g(θ). We use φˆ = g(θˆ) where θˆ is the MLE of θ.
    [Show full text]
  • Using Generalized Estimating Equations to Estimate Nonlinear
    Using generalized estimating equations to estimate nonlinear models with spatial data ∗ § Cuicui Lu†, Weining Wang ‡, Jeffrey M. Wooldridge Abstract In this paper, we study estimation of nonlinear models with cross sectional data using two-step generalized estimating equations (GEE) in the quasi-maximum likelihood estimation (QMLE) framework. In the interest of improving efficiency, we propose a grouping estimator to account for the potential spatial correlation in the underlying innovations. We use a Poisson model and a Negative Binomial II model for count data and a Probit model for binary response data to demon- strate the GEE procedure. Under mild weak dependency assumptions, results on estimation consistency and asymptotic normality are provided. Monte Carlo simulations show efficiency gain of our approach in comparison of different esti- mation methods for count data and binary response data. Finally we apply the GEE approach to study the determinants of the inflow foreign direct investment (FDI) to China. keywords: quasi-maximum likelihood estimation; generalized estimating equations; nonlinear models; spatial dependence; count data; binary response data; FDI equation JEL Codes: C13, C21, C35, C51 arXiv:1810.05855v1 [econ.EM] 13 Oct 2018 ∗This paper is supported by the National Natural Science Foundation of China, No.71601094 and German Research Foundation. †Department of Economics, Nanjing University Business School, Nanjing, Jiangsu 210093 China; email: [email protected] ‡Department of Economics, City, U of London; Northampton Square, Clerkenwell, London EC1V 0HB. Humboldt-Universität zu Berlin, C.A.S.E. - Center for Applied Statistics and Economics; email: [email protected] §Department of Economics, Michigan State University, East Lansing, MI 48824 USA; email: [email protected] 1 1 Introduction In empirical economic and social studies, there are many examples of discrete data which exhibit spatial or cross-sectional correlations possibly due to the closeness of ge- ographical locations of individuals or agents.
    [Show full text]