Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference

Total Page:16

File Type:pdf, Size:1020Kb

Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference doi 10.1515/ijb-2012-0038 The International Journal of Biostatistics 2014; 10(1): 29–57 Research Article Mark J. van der Laan* Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference Abstract: In order to obtain concrete results, we focus on estimation of the treatment specific mean, controlling for all measured baseline covariates, based on observing independent and identically distrib- uted copies of a random variable consisting of baseline covariates, a subsequently assigned binary treat- ment, and a final outcome. The statistical model only assumes possible restrictions on the conditional distribution of treatment, given the covariates, the so-called propensity score. Estimators of the treatment specific mean involve estimation of the propensity score and/or estimation of the conditional mean of the outcome, given the treatment and covariates. In order to make these estimators asymptotically unbiased at any data distribution in the statistical model, it is essential to use data-adaptive estimators of these nuisance parameters such as ensemble learning, and specifically super-learning. Because such estimators involve optimal trade-off of bias and variance w.r.t. the infinite dimensional nuisance parameter itself, they result in a sub-optimal bias/variance trade-off for the resulting real-valued estimator of the estimand. We demonstrate that additional targeting of the estimators of these nuisance parameters guarantees that this bias for the estimand is second order and thereby allows us to prove theorems that establish asymptotic linearity of the estimator of the treatment specific mean under regularity conditions. These insights result in novel targeted minimum loss-based estimators (TMLEs) that use ensemble learning with additional targeted bias reduction to construct estimators of the nuisance parameters. In particular, we construct collaborative TMLEs (C-TMLEs) with known influence curve allowing for statistical inference, even though these C-TMLEs involve variable selection for the propensity score based on a criterion that measures how effective the resulting fit of the propensity score is in removing bias for the estimand. As a particular special case, we also demonstrate the required targeting of the propensity score for the inverse probability of treatment weighted estimator using super-learning to fit the propensity score. Keywords: asymptotic linearity; cross-validation; efficient influence curve; influence curve; targeted mini- mum loss based estimation *Corresponding author: Mark J. van der Laan, Division of Biostatistics, University of California – Berkeley, Berkeley, CA, USA, E-mail: [email protected] 1 Introduction and overview This introduction provides an atlas for the contents of this article. It starts with formulating the role of estimation of nuisance parameters to obtain asymptotically linear estimators of a target parameter of interest. This demonstrates the need to target this estimator of the nuisance parameter in order to make the estimator of the target parameter asymptotically linear when the model for the nuisance parameter is large. The general approach to obtain such a targeted estimator of the nuisance parameter is described. Subsequently, we present our concrete example to which we will apply this general method for targeted estimation of the nuisance parameter, and for which we establish a number of formal theorems. Finally, we discuss the link to previous articles that concerned some kind of targeting of the estimator of the nuisance parameter, and we provide an organization of the remainder of the article. 30 M. J. van der Laan: Targeted Estimation of Nuisance Parameters 1.1 The role of nuisance parameter estimation Suppose we observe n independent and identically distributed copies of a random variable O with prob- ability distribution P0. In addition, assume that it is known that P0 is an element of a statistical model M ψ Ψ Ψ : and that we want to estimate 0 ¼ ðP0Þ for a given target parameter mapping M 7! IR. In order to guarantee that P0 2M one is forced to only incorporate real knowledge, and, as a consequence, such models M are always very large and, in particular, are infinite dimensional. We assume that the target parameter mapping is path-wise differentiable and let DÃðPÞ denote the canonical gradient of the path-wise Ψ ψ Ψ^ Ψ^ derivative of at P 2M[1]. An estimator n ¼ ðPnÞ is a functional applied to the empirical distribution ^ Pn of O1; ...; On and can thus be represented as a mapping Ψ : MNP 7! IR from the non-parametric ^ statistical model MNP into the real line. An estimator Ψ is efficient if and only if it is asymptotically linear à with influence curve D ðP0Þ: 1 Xn pffiffiffi ψ À ψ ¼ DÃðP ÞðO Þþo ð1= nÞ: n 0 n 0 i P i¼1 à The empirical mean of the influence curve D ðP0Þ represents the first-order linear approximation of the estimator as a functional of the empirical distribution, and the derivation of the influence curve is a by-product of the application of the so-called functional delta-method for statistical inference based on functionals (i.e. Ψ^) of the empirical distribution [2–4]. Suppose that ΨðPÞ only depends on P through a parameter QðPÞ and that the canonical gradient depends on P only through QðPÞ and a nuisance parameter gðPÞ. The construction of an efficient estimator requires the construction of estimators Qn and gn of these nuisance parameters Q0 and g0, respectively. Targeted minimum loss-based estimation (TMLE) represents a method for construction of (e.g. efficient) Ψ Ã Ã asymptotically linear substitution estimators ðQnÞ, where Qn is a targeted update of Qn that relies on the estimator gn [5–7]. The targeting of Qn is achieved by specifying a parametric submodel : : fQnð Þ gfQðPÞ P 2MgÐ through the initial estimator Qn and a loss function O 7! LðQÞðOÞ for Q arg min P L Q ; L Q o dP o , so that the generalized score d L Q spans a desired user- 0 ¼ Q 0 ð Þ ð Þð Þ 0ð Þ d ð nð ÞÞ ¼0 supplied estimating function DðQn; gnÞ. In addition, one may decide to target gn by specifying a parametric submodel fg ð Þ : gfgðPÞ : P 2Mg and loss function O7!LðgÞðOÞ for g ¼ arg min P LðgÞ, so that n 1 1 0 g 0 d the generalized score Lðgnð1ÞÞ spans another desired estimating function D1ðgn; η Þ for some d1 n 1¼0 η η estimator n of nuisance parameter . The parameter is fitted with MLE n ¼ arg min PnLðQnð ÞÞ, provid- 1 ing the first-step update Qn ¼ Qnð nÞ, and similarly 1;n ¼ arg min1 PnLðgnð 1ÞÞ. This updating process that ; 1 ; 1 mapped a current fit ðQn gnÞ into an update ðQn gnÞ is iterated till convergence at which point the TMLE Ã; à Ã; à ðQn gnÞ solves PnDðQn gnÞ¼0, i.e. the empirical mean of the estimating function equals zero at the final Ã; à Ã; η TMLE ðQn gnÞ. If one also targeted gn, then it also solves PnD1ðgn nÞ¼0. The submodel through Qn will η depend on gn, while the submodel through gn will depend on another nuisance parameter n. By setting DðQ; gÞ equal to the efficient influence curve DÃðQ; gÞ, the resulting TMLE solves the efficient influence curve à Ã; ; estimating equation PnD ðQn gnÞ¼0 and thereby will be asymptotically efficient when ðQn gnÞ is consistent for ðQ0; g0Þ under appropriate regularity conditions, where the targeting of gn is not needed. The latter is shown as follows. By the property of the canonical gradient (in fact, any gradient) we have Ψ Ã Ψ Ã Ã; Ã; ; ; ðQnÞ ðQ0Þ¼P0D ðQn gnÞþRnðQn Q0 gn g0Þ, where Rn involves integrals of second-order products à à Ã; of the differences ðQn À Q0Þ and ðgn À g0Þ. Combined with PnD ðQn gnÞ¼0, this implies the following identity: Ψ Ã Ψ Ã Ã; Ã; ; ; : ðQnÞ ðQ0Þ¼ðPn À P0ÞD ðQn gnÞþRnðQn Q0 gn g0Þ The first term is an empirical process term that, under empirical process conditions (mentioned below), pffiffiffi à ; ; Ã; = equals ðPn À P0ÞD ðQ gÞ, where ðQ gÞ denotes the limit of ðQn gnÞ, plus an oPð1 nÞ-term. This then yields pffiffiffi Ψ Ã Ψ Ã ; Ã; ; ; = : ðQnÞ ðQ0Þ¼ðPn À P0ÞD ðQ gÞþRnðQn Q0 gn g0ÞþoPð1 nÞ M. J. van der Laan: Targeted Estimation of Nuisance Parameters 31 pffiffiffi Ψ Ã = To obtain the desired asymptotic linearity of ðQnÞ one needs Rn ¼ oPð1 nÞ, which in general requires at minimal that both nuisance parameters are consistently estimated: Q ¼ Q0 and g ¼ g0. However, in many à problems of interest, Rn only involves a cross-product of the differences Qn À Q0 and gn À g0, so that Rn à converges to zero if either Qn is consistent or gn is consistent: i.e. Q ¼ Q0 or g ¼ g0. In this latter case, the TMLE is so-called double robust. Either way, the consistency of the TMLE relies now on one of the nuisance parameter estimators being consistent, thereby requiring the use of non-parametric adaptive estimation such as super-learning [8–10] for at least one of the nuisance parameters. If only one of the nuisance parameter estimators is consistent, and we are in the double robust scenario, then it follows that the bias is of the same order as the bias of the consistent nuisance parameter estimator. However, if the nuisance parameter estimator is not based on a correctly specified parametric model, but instead is a data-adaptive pffiffiffi pffiffiffi estimator, then this bias will be converging to zero at a rate slower than 1= n: i.e. nRn converges to infinity as n 7!1. Thus, in that case, the estimator of the target parameter may thus be overly biased and thereby will not be asymptotically linear.
Recommended publications
  • Ordinary Least Squares 1 Ordinary Least Squares
    Ordinary least squares 1 Ordinary least squares In statistics, ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation. The resulting estimator can be expressed by a simple formula, especially in the case of a single regressor on the right-hand side. The OLS estimator is consistent when the regressors are exogenous and there is no Okun's law in macroeconomics states that in an economy the GDP growth should multicollinearity, and optimal in the class of depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law. linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors be normally distributed, OLS is the maximum likelihood estimator. OLS is used in economics (econometrics) and electrical engineering (control theory and signal processing), among many areas of application. Linear model Suppose the data consists of n observations { y , x } . Each observation includes a scalar response y and a i i i vector of predictors (or regressors) x . In a linear regression model the response variable is a linear function of the i regressors: where β is a p×1 vector of unknown parameters; ε 's are unobserved scalar random variables (errors) which account i for the discrepancy between the actually observed responses y and the "predicted outcomes" x′ β; and ′ denotes i i matrix transpose, so that x′ β is the dot product between the vectors x and β.
    [Show full text]
  • Quantum State Estimation with Nuisance Parameters 2
    Quantum state estimation with nuisance parameters Jun Suzuki E-mail: [email protected] Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, 182-8585 Japan Yuxiang Yang E-mail: [email protected] Institute for Theoretical Physics, ETH Z¨urich, 8093 Z¨urich, Switzerland Masahito Hayashi E-mail: [email protected] Graduate School of Mathematics, Nagoya University, Nagoya, 464-8602, Japan Shenzhen Institute for Quantum Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China Center for Quantum Computing, Peng Cheng Laboratory, Shenzhen 518000, China Centre for Quantum Technologies, National University of Singapore, 117542, Singapore Abstract. In parameter estimation, nuisance parameters refer to parameters that are not of interest but nevertheless affect the precision of estimating other parameters of interest. For instance, the strength of noises in a probe can be regarded as a nuisance parameter. Despite its long history in classical statistics, the nuisance parameter problem in quantum estimation remains largely unexplored. The goal of this article is to provide a systematic review of quantum estimation in the presence of nuisance parameters, and to supply those who work in quantum tomography and quantum arXiv:1911.02790v3 [quant-ph] 29 Jun 2020 metrology with tools to tackle relevant problems. After an introduction to the nuisance parameter and quantum estimation theory, we explicitly formulate the problem of quantum state estimation with nuisance parameters. We extend quantum Cram´er-Rao bounds to the nuisance parameter case and provide a parameter orthogonalization tool to separate the nuisance parameters from the parameters of interest. In particular, we put more focus on the case of one-parameter estimation in the presence of nuisance parameters, as it is most frequently encountered in practice.
    [Show full text]
  • Nearly Optimal Tests When a Nuisance Parameter Is Present
    Nearly Optimal Tests when a Nuisance Parameter is Present Under the Null Hypothesis Graham Elliott Ulrich K. Müller Mark W. Watson UCSD Princeton University Princeton University and NBER January 2012 (Revised July 2013) Abstract This paper considers nonstandard hypothesis testing problems that involve a nui- sance parameter. We establish an upper bound on the weighted average power of all valid tests, and develop a numerical algorithm that determines a feasible test with power close to the bound. The approach is illustrated in six applications: inference about a linear regression coe¢ cient when the sign of a control coe¢ cient is known; small sample inference about the di¤erence in means from two independent Gaussian sam- ples from populations with potentially di¤erent variances; inference about the break date in structural break models with moderate break magnitude; predictability tests when the regressor is highly persistent; inference about an interval identi…ed parame- ter; and inference about a linear regression coe¢ cient when the necessity of a control is in doubt. JEL classi…cation: C12; C21; C22 Keywords: Least favorable distribution, composite hypothesis, maximin tests This research was funded in part by NSF grant SES-0751056 (Müller). The paper supersedes the corresponding sections of the previous working papers "Low-Frequency Robust Cointegration Testing" and "Pre and Post Break Parameter Inference" by the same set of authors. We thank Lars Hansen, Andriy Norets, Andres Santos, two referees, the participants of the AMES 2011 meeting at Korea University, of the 2011 Greater New York Metropolitan Econometrics Colloquium, and of a workshop at USC for helpful comments.
    [Show full text]
  • Composite Hypothesis, Nuisance Parameters’
    Practical Statistics – part II ‘Composite hypothesis, Nuisance Parameters’ W. Verkerke (NIKHEF) Wouter Verkerke, NIKHEF Summary of yesterday, plan for today • Start with basics, gradually build up to complexity of Statistical tests with simple hypotheses for counting data “What do we mean with probabilities” Statistical tests with simple hypotheses for distributions “p-values” “Optimal event selection & Hypothesis testing as basis for event selection machine learning” Composite hypotheses (with parameters) for distributions “Confidence intervals, Maximum Likelihood” Statistical inference with nuisance parameters “Fitting the background” Response functions and subsidiary measurements “Profile Likelihood fits” Introduce concept of composite hypotheses • In most cases in physics, a hypothesis is not “simple”, but “composite” • Composite hypothesis = Any hypothesis which does not specify the population distribution completely • Example: counting experiment with signal and background, that leaves signal expectation unspecified Simple hypothesis ~ s=0 With b=5 s=5 s=10 s=15 Composite hypothesis (My) notation convention: all symbols with ~ are constants Wouter Verkerke, NIKHEF A common convention in the meaning of model parameters • A common convention is to recast signal rate parameters into a normalized form (e.g. w.r.t the Standard Model rate) Simple hypothesis ~ s=0 With b=5 s=5 s=10 s=15 Composite hypothesis ‘Universal’ parameter interpretation makes it easier to work with your models μ=0 no signal Composite hypothesis μ=1 expected signal
    [Show full text]
  • Likelihood Inference in the Presence of Nuisance Parameters N
    PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 Likelihood Inference in the Presence of Nuisance Parameters N. Reid, D.A.S. Fraser Department of Statistics, University of Toronto, Toronto Canada M5S 3G3 We describe some recent approaches to likelihood based inference in the presence of nuisance parameters. Our approach is based on plotting the likelihood function and the p-value function, using recently developed third order approximations. Orthogonal parameters and adjustments to profile likelihood are also discussed. Connections to classical approaches of conditional and marginal inference are outlined. 1. INTRODUCTION it is defined only up to arbitrary multiples which may depend on y but not on θ. This ensures in particu- We take the view that the most effective form of lar that the likelihood function is invariant to one-to- inference is provided by the observed likelihood func- one transformations of the measurement(s) y. In the tion along with the associated p-value function. In context of independent, identically distributed sam- the case of a scalar parameter the likelihood func- pling, where y = (y1; : : : ; yn) and each yi follows the tion is simply proportional to the density function. model f(y; θ) the likelihood function is proportional The p-value function can be obtained exactly if there to Πf(yi; θ) and the log-likelihood function becomes a is a one-dimensional statistic that measures the pa- sum of independent and identically distributed com- rameter. If not, the p-value can be obtained to a ponents: high order of approximation using recently developed methods of likelihood asymptotics. In the presence `(θ) = `(θ; y) = Σ log f(yi; θ) + a(y): (2) of nuisance parameters, the likelihood function for a ^ (one-dimensional) parameter of interest is obtained The maximum likelihood estimate θ is the value of via an adjustment to the profile likelihood function.
    [Show full text]
  • Learning to Pivot with Adversarial Networks
    Learning to Pivot with Adversarial Networks Gilles Louppe Michael Kagan Kyle Cranmer New York University SLAC National Accelerator Laboratory New York University [email protected] [email protected] [email protected] Abstract Several techniques for domain adaptation have been proposed to account for differences in the distribution of the data used for training and testing. The majority of this work focuses on a binary domain label. Similar problems occur in a scientific context where there may be a continuous family of plausible data generation processes associated to the presence of systematic uncertainties. Robust inference is possible if it is based on a pivot – a quantity whose distribution does not depend on the unknown values of the nuisance parameters that parametrize this family of data generation processes. In this work, we introduce and derive theoretical results for a training procedure based on adversarial networks for enforcing the pivotal property (or, equivalently, fairness with respect to continuous attributes) on a predictive model. The method includes a hyperparameter to control the trade- off between accuracy and robustness. We demonstrate the effectiveness of this approach with a toy example and examples from particle physics. 1 Introduction Machine learning techniques have been used to enhance a number of scientific disciplines, and they have the potential to transform even more of the scientific process. One of the challenges of applying machine learning to scientific problems is the need to incorporate systematic uncertainties, which affect both the robustness of inference and the metrics used to evaluate a particular analysis strategy. In this work, we focus on supervised learning techniques where systematic uncertainties can be associated to a data generation process that is not uniquely specified.
    [Show full text]
  • Exact Statistical Inferences for Functions of Parameters of the Log-Gamma Distribution
    UNLV Theses, Dissertations, Professional Papers, and Capstones 5-1-2015 Exact Statistical Inferences for Functions of Parameters of the Log-Gamma Distribution Joseph F. Mcdonald University of Nevada, Las Vegas Follow this and additional works at: https://digitalscholarship.unlv.edu/thesesdissertations Part of the Multivariate Analysis Commons, and the Probability Commons Repository Citation Mcdonald, Joseph F., "Exact Statistical Inferences for Functions of Parameters of the Log-Gamma Distribution" (2015). UNLV Theses, Dissertations, Professional Papers, and Capstones. 2384. http://dx.doi.org/10.34917/7645961 This Dissertation is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Dissertation in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Dissertation has been accepted for inclusion in UNLV Theses, Dissertations, Professional Papers, and Capstones by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact [email protected]. EXACT STATISTICAL INFERENCES FOR FUNCTIONS OF PARAMETERS OF THE LOG-GAMMA DISTRIBUTION by Joseph F. McDonald Bachelor of Science in Secondary Education University of Nevada, Las Vegas 1991 Master of Science in Mathematics University of Nevada, Las Vegas 1993 A dissertation submitted in partial fulfillment of the requirements for the Doctor of Philosophy - Mathematical Sciences Department of Mathematical Sciences College of Sciences The Graduate College University of Nevada, Las Vegas May 2015 We recommend the dissertation prepared under our supervision by Joseph F.
    [Show full text]
  • Confidence Intervals and Nuisance Parameters Common Example
    Confidence Intervals and Nuisance Parameters Common Example Interval Estimation in Poisson Sampling with Scale Factor and Background Subtraction The Problem (eg): A “Cut and Count” analysis for a branching fraction B finds n events. ˆ – The background estimate is b ± σb events. – The efficiency and parent sample are estimated to give a scaling factor ˆ f ± σf . How do we determine a (frequency) Confidence Interval? – Assume n is sampled from Poisson, µ = hni = fB + b. ˆ – Assume b is sampled from normal N(b, σb). ˆ – Assume f is sampled from normal N(f, σf ). 1 Frank Porter, March 22, 2005, CITBaBar Example, continued The likelihood function is: 2 2 n −µ 1ˆb−b 1fˆ−f µ e 1 −2 σ −2 σ L(n,ˆb, ˆf; B, b, f) = e b f . n! 2πσbσf µ = hni = fB + b We are interested in the branching fraction B. In particular, would like to summarize data relevant to B, for example, in the form of a confidence interval, without dependence on the uninteresting quantities b and f. b and f are “nuisance parameters”. 2 Frank Porter, March 22, 2005, CITBaBar Interval Estimation in Poisson Sampling (continued) Variety of Approaches – Dealing With the Nuisance Parameters ˆ ˆ Just give n, b ± σb, and f ± σf . – Provides “complete” summary. – Should be done anyway. – But it isn’t a confidence interval. ˆ ˆ Integrate over N(f, σf ) “PDF” for f, N(b, σb) “PDF” for b. (variant: normal assumption in 1/f). – Quasi-Bayesian (uniform prior for f, b (or, eg, for 1/f)).
    [Show full text]
  • Reducing the Sensitivity to Nuisance Parameters in Pseudo-Likelihood Functions
    544 The Canadian Journal of Statistics Vol. 42, No. 4, 2014, Pages 544–562 La revue canadienne de statistique Reducing the sensitivity to nuisance parameters in pseudo-likelihood functions Yang NING1*, Kung-Yee LIANG2 and Nancy REID3 1Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada 2Department of Life Sciences and Institute of Genome Sciences, National Yang-Ming University, Taiwan, ROC 3Department of Statistical Science, University of Toronto, Toronto, Canada Key words and phrases: Composite likelihood; higher order inference; invariance; nuisance parameters; pseudo-likelihood. MSC 2010: Primary 62F03; secondary 62F10 Abstract: In a parametric model, parameters are often partitioned into parameters of interest and nuisance parameters. However, as the data structure becomes more complex, inference based on the full likeli- hood may be computationally intractable or sensitive to potential model misspecification. Alternative likelihood-based methods proposed in these settings include pseudo-likelihood and composite likelihood. We propose a simple adjustment to these likelihood functions to reduce the impact of nuisance parameters. The advantages of the modification are illustrated through examples and reinforced through simulations. The adjustment is still novel even if attention is restricted to the profile likelihood. The Canadian Journal of Statistics 42: 544–562; 2014 © 2014 Statistical Society of Canada Resum´ e:´ Les parametres` d’un modele` sont souvent categoris´ es´ comme nuisibles ou d’inter´ et.ˆ A` mesure que la structure des donnees´ devient plus complexe, la vraisemblance peut devenir incalculable ou sensible a` des erreurs de specification.´ La pseudo-vraisemblance et la vraisemblance composite ont et´ epr´ esent´ ees´ comme des solutions dans ces situations.
    [Show full text]
  • Arxiv:1807.05996V2 [Physics.Data-An] 4 Aug 2018
    Lectures on Statistics in Theory: Prelude to Statistics in Practice Robert D. Cousins∗ Dept. of Physics and Astronomy University of California, Los Angeles Los Angeles, California 90095 August 4, 2018 Abstract This is a writeup of lectures on \statistics" that have evolved from the 2009 Hadron Collider Physics Summer School at CERN to the forthcoming 2018 school at Fermilab. The emphasis is on foundations, using simple examples to illustrate the points that are still debated in the professional statistics literature. The three main approaches to interval estimation (Neyman confidence, Bayesian, likelihood ratio) are discussed and compared in detail, with and without nuisance parameters. Hypothesis testing is discussed mainly from the frequentist point of view, with pointers to the Bayesian literature. Various foundational issues are emphasized, including the conditionality principle and the likelihood principle. arXiv:1807.05996v2 [physics.data-an] 4 Aug 2018 ∗[email protected] 1 Contents 1 Introduction 5 2 Preliminaries 6 2.1 Why foundations matter . 6 2.2 Definitions are important . 6 2.3 Key tasks: Important to distinguish . 7 3 Probability 8 3.1 Definitions, Bayes's theorem . 8 3.2 Example of Bayes's theorem using frequentist P . 10 3.3 Example of Bayes's theorem using Bayesian P . 10 3.4 A note re Decisions ................................ 12 3.5 Aside: What is the \whole space"? . 12 4 Probability, probability density, likelihood 12 4.1 Change of observable variable (\metric") x in pdf p(xjµ) . 13 4.2 Change of parameter µ in pdf p(xjµ) ...................... 14 4.3 Probability integral transform . 15 4.4 Bayes's theorem generalized to probability densities .
    [Show full text]
  • P Values and Nuisance Parameters
    P Values and Nuisance Parameters Luc Demortier The Rockefeller University, New York, NY 10065, USA Abstract We review the definition and interpretation of p values, describe methods to incorporate systematic uncertainties in their calculation, and briefly discuss a non-regular but common problem caused by nuisance parameters that are unidentified under the null hypothesis. 1 Introduction Statistical theory offers three main paradigms for testing hypotheses: the Bayesian construction of hy- pothesis probabilities, Neyman-Pearson procedures for controlling frequentist error rates, and Fisher's evidential interpretation of tail probabilities. Since practitioners often attempt to combine elements from different approaches, especially the last two, it is useful to illustrate with a couple of examples how they actually address different questions [1]. The first example concerns the selection of a sample of pp¹ collision events for measuring the mass of the top quark. For each event one must decide between two hypotheses, H0 : The event is background, versus H1 : The event contains a top quark. Since the same testing procedure is sequentially repeated on a large number of events, the decision must be made in such a way that the rate of wrong decisions is fully controlled in the long run. Traditionally, this problem is solved with the help of Neyman-Pearson theory, with a terminology adapted to physics goals: the Type-I error rate ® translates to background contamination, and the power of the test to selection efficiency. In principle either hypothesis can be assigned the role of H0 in this procedure. Since only the Type-I error rate is directly adjustable by the investigator (via the size of the critical region), a potentially useful criterion is to define as H0 the hypothesis for which it is deemed more important to control the incorrect rejection probability than the incorrect acceptance probability.
    [Show full text]
  • Stat 5102 Lecture Slides Deck 2
    Stat 5102 Lecture Slides Deck 2 Charles J. Geyer School of Statistics University of Minnesota 1 Statistical Inference Statistics is probability done backwards. In probability theory we give you one probability model, also called a probability distribution. Your job is to say something about expectations, probabilities, quantiles, etc. for that distri- bution. In short, given a probability model, describe data from that model. In theoretical statistics, we give you a statistical model, which is a family of probability distributions, and we give you some data assumed to have one of the distributions in the model. Your job is to say something about which distribution that is. In short, given a statistical model and data, infer which distribution in the model is the one for the data. 2 Statistical Models A statistical model is a family of probability distributions. A parametric statistical model is a family of probability distribu- tions specified by a finite set of parameters. Examples: Ber(p), N (µ, σ2), and the like. A nonparametric statistical model is a family of probability dis- tributions too big to be specified by a finite set of parameters. Examples: all probability distributions on R, all continuous sym- metric probability distributions on R, all probability distributions on R having second moments, and the like. 3 Statistical Models and Submodels If M is a statistical model, it is a family of probability distribu- tions. A submodel of a statistical model M is a family of probability distributions that is a subset of M. If M is parametric, then we often specify it by giving the PMF (if the distributions are discrete) or PDF (if the distributions are continuous) { fθ : θ ∈ Θ } where Θ is the parameter space of the model.
    [Show full text]