STAT 200B: Theoretical Statistics

Total Page:16

File Type:pdf, Size:1020Kb

STAT 200B: Theoretical Statistics STAT 200B: Theoretical Statistics Arash A. Amini March 2, 2020 1 / 218 Statistical decision theory A probability model = Pθ : θ Ω for data X : • Ω: parameter space,P : samplef space.2 g 2 X X An action space : set of available actions (decisions). • A A loss function: • 0-1 loss L(θ; a) = 1 θ = a Ω = = 0; 1 . f 6 2g A f d g Quadratic loss (Squared error) L(θ; a) = θ a 2 Ω = = R . k − k A Statistical inference as a game: 1. Nature picks the \true" parameter θ, and draws X Pθ. Thus, X is a random element of . ∼ X 2. Statistician observes X and makes a decision δ(X ) . δ : is called a decision rule. 2 A X!A 3. Statistician incurs the loss L(θ; δ(X )). The goal of the statistician is to minimize its expected loss, a.k.a the risk: R(θ; δ) := EθL(θ; δ(X )) 2 / 218 The goal of the statistician is to minimize its expected loss, a.k.a the risk: • R(θ; δ) := EθL(θ; δ(X )) = L(θ; δ(x)) dPθ(x) Z = L(θ; δ(x)) pθ(x) dµ(x) Z when family is dominated: Pθ = pθdµ. Usually work with the family of densities p ( ): θ Ω . • f θ · 2 g 3 / 218 Example 1 (Bernoulli trials) A coin being flipped, want to estimate the probability of coming up heads. • One possible model: • iid X = (X1;:::; Xn), Xi Ber(θ), for some θ [0; 1]. n ∼ n 2 Formally, = 0; 1 , Pθ = (Ber(θ))⊗ and Ω = [0; 1]. • X f g PMF of Xi : • θ x = 1 x 1 x P(Xi = x) = = θ (1 θ) − ; x 0; 1 1 θ x = 0 − 2 f g ( − n xi 1 xi Joint PMF: p (x1;:::; xn) = θ (1 θ) − • θ i=1 − Action space: = Ω. • A Q Quadratic loss: L(θ; δ) = (θ δ)2. • − 4 / 218 Comparing estimators via their risk Bernoulli trials. Let us look at three estimators: 1 n θ(1 θ) Sample mean δ1(X ) = n i=1 Xi R(θ; δ1) = n− 1 P 1 2 Constant estimator δ2(X ) = R(θ; δ2) = (θ ) 2 − 2 P 2 i Xi +3 nθ(1 θ)+(3 6θ) Strange looking δ3(X ) = n+6 R(θ; δ3) = −(n+6)2− . Throw data out δ4(X ) = X1 R(θ; δ4) = θ(1 θ). − 5 / 218 Comparing estimators via their risk n = 10 n = 50 2 10− 0.2 2 · 0.15 1.5 0.1 δ4 1 δ2 2 5 10− 0.5 · δ1 δ3 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Comparison depends on the choice of the loss. A different loss gives a different picture. 6 / 218 Comparing estimators via their risk How to deal with the fact that the risks are functions? Summarize them by reducing to numbers: • • (Bayesian) Take a weighted average: Z inf R(θ; δ) dπ(θ) δ Ω • (Frequentist) Take the maximum: inf max R(θ; δ) δ θ 2 Ω Restrict to a class of estimators: unbiased (UMVU), equivariant, etc. • Rule out estimators that are dominated by others (inadmissible). • 7 / 218 Admissibility Definition 1 Let δ and δ∗ be decision rules. δ∗ (strictly) dominates δ if R(θ; δ∗) R(θ; δ); for all θ Ω, and • ≤ 2 R(θ; δ∗) < R(θ; δ); for some θ Ω. • 2 δ is inadmissible if there is a different δ∗ that dominates it; otherwise δ is admissible. An inadmissible rule can be uniformly \improved". δ4 in the Bernoulli example is inadmissible. We will see a non-trivial example soon (Exponential Distribution). 8 / 218 Bias Definition 2 The bias of δ for estimating g(θ) is Bθ(δ) := Eθ(δ) g(θ). − The estimator is unbiased if B (δ) = 0 for all θ Ω. θ 2 Not always possible to find unbiased estimators. Example, g(θ) = sin(θ) in the binomial family. (Keener Example 4.2, p. 62) Definition 3 g is called U-estimable if there an unbiased estimator δ for g. Usually g(θ) = θ. 9 / 218 Bias-variance decomposition For the quadratic loss L(θ; a) = (θ a)2, the risk is mean-squared error (MSE). In this case we have the following decomposition− 2 MSEθ(δ) = [Bθ(δ)] + varθ(δ) Proof. Let µθ := Eθ(δ). We have 2 2 MSEθ(δ) = Eθ(θ δ) = Eθ(θ µθ + µθ δ) − − − 2 2 = (θ µθ) + 2(θ µθ)Eθ[µθ δ] + Eθ(µθ δ) : − − − − Same goes for general g(θ) or higher dimensions: L(θ; a) = g(θ) a 2. k − k2 10 / 218 Example 2 (Berger) Let X N(θ; 1). ∼ Class of estimators of the form δc (X ) = cX , for c R. 2 MSE (δ) = (θ cθ)2 + c2 = (1 c)2θ2 + c2 θ − − For c > 1, we have 1 = MSEθ(δ1) < MSEθ(δc ) for all θ. For c [0; 1] the rules are incomparable. 2 11 / 218 Optimality depends on the loss Example 3 (Possion process) X1;:::; Xn be the inter-arrival times of a Poisson process with rate λ. iid X1;:::; Xn Expo(λ). The model has the following p.d.f. ∼ n P λxi n λ xi pλ(x) = pλ(xi ) = λe− 1 xi > 0 = λ e− i 1 min xi > 0 f g f i g i=1 i Y Y Ω = = (0; ). A 1 1 Let S = Xi and X¯ = S. • i n The MLE for λ is λ^ = 1=X¯ = n=S. • P 12 / 218 iid X1;:::; Xn Expo(λ) ∼ 1 Let S = Xi and X¯ = S. • i n The MLEP for λ is λ^ = 1=X¯ = n=S. • n S := Xi has the Gamma(n; λ) distribution. • i=1 1=S has Inv-Gamma(n; λ) distribution with mean λ/(n 1). • P − ^ Eλ[λ] = nλ/(n 1). MLE is biased for λ. • − Then, λ~ := (n 1)λ/^ n is unbiased. • − We also have var (λ~) < var (λ^). • λ λ It follows that • MSE (λ~) < MSE (λ^); λ λ λ 8 The MLE λ^ is inadmissible for quadratic loss. 13 / 218 8 7 Possible explanation: 6 Quadratic loss penalizes over-estimation 5 4 more than under-estimation for Ω = 3 (0 ). ; 2 1 1 0 0 1 2 3 4 5 6 Alternative loss function (Itakura{Saito distance) L(λ, a) = λ/a 1 log(λ/a); a; λ (0; ) − − 2 1 With this loss function, R(λ, λ~) > R(λ, λ^); λ. • 8 That is, MLE renders λ~ inadmissible. • An example of a Bregman divergence for φ(x) = log x. − For a convex function φ : Rd R, the Bregman divergence is defined as ! d (x; y) = φ(x) φ(y) φ(y); x y φ − − hr − i the remainder of the first order Taylor expansion of φ at y. 14 / 218 Details: Consider δ (X ) = α=S. Then, we have • α n n n n R(λ, δ ) R(λ, δ ) = log log α − β α − β − α − β Take α = n 1 and β = n. • − Use log x log y < x y for x > y 1. • − − ≥ (Follows from strict concavity of f (x) = log(x): f (x) f (y) < f 0(y)(x y) for y = x). − − 6 15 / 218 Sufficiency Idea: Separate the data into parts that are relevant for the estimating θ (sufficient) • and parts that are irrelevant (ancillary). • Benefits: Achieve data compression: efficient computation and storage • Irrelevant parts can increase the risk (Rao-Blackwell) • Definition 4 Consider the model = Pθ : θ Ω for X . A statistic T = T (XP) is sufficientf 2 forg (or for θ or for X ) if the conditional distribution of X given T does not dependP on θ. More precisely, we have Pθ(X A T = t) = Qt (A); t; A 2 j 8 for some Markov kernel Q. Making it more precise requires measure theory. Intuitively, given T , we can simulate X by an external source of randomness. 16 / 218 Sufficiency Example 4 (Coin tossing) iid Xi Ber(θ), i = 1;:::; n. • ∼ Notation: X = (X1;:::; Xn), x = (x1;:::; xn). • Will show that T = T (X ) = i Xi is sufficient for θ. (This should be • intuitive.) P n xi 1 xi T (x) n T (x) Pθ(X = x) = pθ(x) = θ (1 θ) − = θ (1 θ) − − − i=1 Y Then • Pθ(X = x; T = t) = Pθ(X = x)1 T (x) = t f g t n t = θ (1 θ) − 1 T (x) = t : − f g 17 / 218 Then • Pθ(X = x; T = t) = Pθ(X = x)1 T (x) = t f g t n t = θ (1 θ) − 1 T (x) = t : − f g Marginalizing, • t n t Pθ(T = t) = θ (1 θ) − 1 T (x) = t − f g x 0;1 n 2X f g n t n t = θ (1 θ) − : t − Hence, • t n t θ (1 θ) − 1 T (x) = t 1 Pθ(X = x T = t) = − f g = 1 T (x) = t : j n θt (1 θ)n t n f g t − − t What is the above (conditional) distribution? • 18 / 218 Factorization Theorem It is not convenient to check for sufficiency this way, hence: Theorem 1 (Factorization (Fisher{Neyman)) Assume that = Pθ : θ Ω is dominated by µ. A statistic T is sufficient iff for some functionP gf ; h 02 g θ ≥ pθ(x) = gθ(T (x))h(x); for µ-a.e. x The likelihood θ p (X ), only depends on X through T (X ). 7! θ Family being dominated (having a density) is important.
Recommended publications
  • Point Estimation Decision Theory
    Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we select a likelihood function p(D j θ), explaining how observed data D are expected to be generated given the value of θ. Then we select a prior distribution p(θ) reecting our initial beliefs about θ. Finally, we conduct an experiment to gather data and use Bayes’ theorem to derive the posterior p(θ j D). In a sense, the posterior contains all information about θ that we care about. However, the process of inference will often require us to use this posterior to answer various questions. For example, we might be compelled to choose a single value θ^ to serve as a point estimate of θ. To a Bayesian, the selection of θ^ is a decision, and in dierent contexts we might want to select dierent values to report. In general, we should not expect to be able to select the true value of θ, unless we have somehow observed data that unambiguously determine it. Instead, we can only hope to select an estimate that is “close” to the true value. Dierent denitions of “closeness” can naturally lead to dierent estimates. The Bayesian approach to point estimation will be to analyze the impact of our choice in terms of a loss function, which describes how “bad” dierent types of mistakes can be. We then select the estimate which appears to be the least “bad” according to our current beliefs about θ.
    [Show full text]
  • Statistical Inference a Work in Progress
    Ronald Christensen Department of Mathematics and Statistics University of New Mexico Copyright c 2019 Statistical Inference A Work in Progress Springer v Seymour and Wes. Preface “But to us, probability is the very guide of life.” Joseph Butler (1736). The Analogy of Religion, Natural and Revealed, to the Constitution and Course of Nature, Introduction. https://www.loc.gov/resource/dcmsiabooks. analogyofreligio00butl_1/?sp=41 Normally, I wouldn’t put anything this incomplete on the internet but I wanted to make parts of it available to my Advanced Inference Class, and once it is up, you have lost control. Seymour Geisser was a mentor to Wes Johnson and me. He was Wes’s Ph.D. advisor. Near the end of his life, Seymour was finishing his 2005 book Modes of Parametric Statistical Inference and needed some help. Seymour asked Wes and Wes asked me. I had quite a few ideas for the book but then I discovered that Sey- mour hated anyone changing his prose. That was the end of my direct involvement. The first idea for this book was to revisit Seymour’s. (So far, that seems only to occur in Chapter 1.) Thinking about what Seymour was doing was the inspiration for me to think about what I had to say about statistical inference. And much of what I have to say is inspired by Seymour’s work as well as the work of my other professors at Min- nesota, notably Christopher Bingham, R. Dennis Cook, Somesh Das Gupta, Mor- ris L. Eaton, Stephen E. Fienberg, Narish Jain, F. Kinley Larntz, Frank B.
    [Show full text]
  • An Economic Theory of Statistical Testing∗
    An Economic Theory of Statistical Testing∗ Aleksey Tetenovy August 21, 2016 Abstract This paper models the use of statistical hypothesis testing in regulatory ap- proval. A privately informed agent proposes an innovation. Its approval is beneficial to the proponent, but potentially detrimental to the regulator. The proponent can conduct a costly clinical trial to persuade the regulator. I show that the regulator can screen out all ex-ante undesirable proponents by committing to use a simple statistical test. Its level is the ratio of the trial cost to the proponent's benefit from approval. In application to new drug approval, this level is around 15% for an average Phase III clinical trial. The practice of statistical hypothesis testing has been widely criticized across the many fields that use it. Examples of such criticism are Cohen (1994), Johnson (1999), Ziliak and McCloskey (2008), and Wasserstein and Lazar (2016). While conventional test levels of 5% and 1% are widely agreed upon, these numbers lack any substantive motivation. In spite of their arbitrary choice, they affect thousands of influential decisions. The hypothesis testing criterion has an unusual lexicographic structure: first, ensure that the ∗Thanks to Larbi Alaoui, V. Bhaskar, Ivan Canay, Toomas Hinnosaar, Chuck Manski, Jos´eLuis Montiel Olea, Ignacio Monz´on, Denis Nekipelov, Mario Pagliero, Amedeo Piolatto, Adam Rosen and Joel Sobel for helpful comments and suggestions. I have benefited from presenting this paper at the Cornell Conference on Partial Identification and Statistical Decisions, ASSET 2013, Econometric Society NAWM 2014, SCOCER 2015, Oxford, UCL, Pittsburgh, Northwestern, Michigan, Virginia, and Western. Part of this research was conducted at UCL/CeMMAP with financial support from the European Research Council grant ERC-2009-StG-240910-ROMETA.
    [Show full text]
  • 6 X 10.5 Long Title.P65
    Cambridge University Press 978-0-521-68567-2 - Principles of Statistical Inference D. R. Cox Index More information Author index Aitchison, J., 131, 175, 199 Daniels, H.E., 132, 201 Akahira, M., 132, 199 Darmois, G., 28 Amari, S., 131, 199 Davies, R.B., 159, 201 Andersen, P.K., 159, 199 Davis, R.A., 16, 200 Anderson, T.W., 29, 132, 199 Davison, A.C., 14, 132, 200, 201 Anscombe, F.J., 94, 199 Dawid, A.P., 131, 201 Azzalini, A., 14, 160, 199 Day, N.E., 160, 200 Dempster, A.P., 132, 201 de Finetti, B., 196 Baddeley, A., 192, 199 de Groot, M., 196 Barnard, G.A., 28, 29, 63, 195, 199 Dunsmore, I.R., 175, 199 Barndorff-Nielsen, O.E., 28, 94, 131, 132, 199 Barnett, V., 14, 200 Edwards, A.W.F., 28, 195, 202 Barnett, V.D., 14, 200 Efron, B., 132, 202 Bartlett, M.S., 131, 132, 159, 200 Eguchi, S., 94, 201 Berger, J., 94, 200 Berger, R.L., 14, 200 Bernardo, J.M., 62, 94, 196, 200 Farewell, V., 160, 202 Besag, J.E., 16, 160, 200 Fisher, R.A., 27, 28, 40, 43, 44, 53, 55, Birnbaum, A., 62, 200 62, 63, 66, 93, 95, 132, 176, 190, Blackwell, D., 176 192, 194, 195, 202 Boole, G., 194 Fraser, D.A.S., 63, 202 Borgan, Ø., 199 Fridette, M., 175, 203 Box, G.E.P., 14, 62, 200 Brazzale, A.R., 132, 200 Breslow, N.E., 160, 200 Brockwell, P.J., 16, 200 Garthwaite, P.H., 93, 202 Butler, R.W., 132, 175, 200 Gauss, C.F., 15, 194 Geisser, S., 175, 202 Gill, R.D., 199 Carnap, R., 195 Godambe, V.P., 176, 202 Casella, G.C., 14, 29, 132, 200, 203 Good, I.J., 196 Christensen, R., 93, 200 Green, P.J., 160, 202 Cochran, W.G., 176, 192, 200 Greenland, S., 94, 202 Copas, J., 94, 201 Cox, D.R., 14, 16, 43, 63, 94, 131, 132, 159, 160, 192, 199, 201, 204 Hacking, I., 28, 202 Creasy, M.A., 44, 201 Hald, A., 194, 202 209 © Cambridge University Press www.cambridge.org Cambridge University Press 978-0-521-68567-2 - Principles of Statistical Inference D.
    [Show full text]
  • Principles of Statistical Inference
    Principles of Statistical Inference In this important book, D. R. Cox develops the key concepts of the theory of statistical inference, in particular describing and comparing the main ideas and controversies over foundational issues that have rumbled on for more than 200 years. Continuing a 60-year career of contribution to statistical thought, Professor Cox is ideally placed to give the comprehensive, balanced account of the field that is now needed. The careful comparison of frequentist and Bayesian approaches to inference allows readers to form their own opinion of the advantages and disadvantages. Two appendices give a brief historical overview and the author’s more personal assessment of the merits of different ideas. The content ranges from the traditional to the contemporary. While specific applications are not treated, the book is strongly motivated by applications across the sciences and associated technologies. The underlying mathematics is kept as elementary as feasible, though some previous knowledge of statistics is assumed. This book is for every serious user or student of statistics – in particular, for anyone wanting to understand the uncertainty inherent in conclusions from statistical analyses. Principles of Statistical Inference D.R. COX Nuffield College, Oxford CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521866736 © D. R. Cox 2006 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
    [Show full text]
  • 1 Introduction
    Optimal Decision Rules for Weak GMM By Isaiah Andrews1 and Anna Mikusheva2 Abstract This paper studies optimal decision rules, including estimators and tests, for weakly identi- fied GMM models. We derive the limit experiment for weakly identified GMM, and propose a theoretically-motivated class of priors which give rise to quasi-Bayes decision rules as a limiting case. Together with results in the previous literature, this establishes desirable properties for the quasi-Bayes approach regardless of model identification status, and we recommend quasi-Bayes for settings where identification is a concern. We further propose weighted average power- optimal identification-robust frequentist tests and confidence sets, and prove a Bernstein-von Mises-type result for the quasi-Bayes posterior under weak identification. Keywords: Limit Experiment, Quasi Bayes, Weak Identification, Nonlinear GMM JEL Codes: C11, C12, C20 First draft: July 2020. This draft: July 2021. 1 Introduction Weak identification arises in a wide range of empirical settings. Weakly identified non- linear models have objective functions which are near-flat in certain directions, or have multiple near-optima. Standard asymptotic approximations break down when identifi- cation is weak, resulting in biased and non-normal estimates, as well as invalid standard errors and confidence sets. Further, existing optimality results do not apply in weakly 1Harvard Department of Economics, Littauer Center M18, Cambridge, MA 02138. Email ian- [email protected]. Support from the National Science Foundation under grant number 1654234, and from the Sloan Research Fellowship is gratefully acknowledged. 2Department of Economics, M.I.T., 50 Memorial Drive, E52-526, Cambridge, MA, 02142. Email: [email protected].
    [Show full text]
  • Multiple Hypothesis Testing and Multiple Outlier Identification Methods
    Multiple Hypothesis Testing and Multiple Outlier Identification Methods A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Department of Mathematics and Statistics University of Saskatchewan Saskatoon, Saskatchewan By Yaling Yin March 2010 c Yaling Yin , March 2010. All rights reserved. Permission to Use In presenting this thesis in partial fulfilment of the requirements for a Postgraduate degree from the University of Saskatchewan, I agree that the Libraries of this University may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department or the Dean of the College in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the University of Saskatchewan in any scholarly use which may be made of any material in my thesis. Requests for permission to copy or to make other use of material in this thesis in whole or part should be addressed to: Head of the Department of Mathematics and Statistics University of Saskatchewan Saskatoon, Saskatchewan Canada S7N 5E6 i Abstract Traditional multiple hypothesis testing procedures, such as that of Benjamini and Hochberg, fix an error rate and determine the corresponding rejection region.
    [Show full text]
  • Statistical Decision Functions with Judgment
    Working Paper Series Simone Manganelli Statistical decision functions with judgment No 2512 / January 2021 Disclaimer: This paper should not be reported as representing the views of the European Central Bank (ECB). The views expressed are those of the authors and do not necessarily reflect those of the ECB. Abstract A decision maker tests whether the gradient of the loss function eval- uated at a judgmental decision is zero. If the test does not reject, the action is the judgmental decision. If the test rejects, the action sets the gradient equal to the boundary of the rejection region. This statisti- cal decision rule is admissible and conditions on the sample realization. The confidence level reflects the decision maker’s aversion to statistical uncertainty. The decision rule is applied to a problem of asset alloca- tion. Keywords: Statistical Decision Theory; Hypothesis Testing; Confidence Intervals; Conditional Inference. JEL Codes: C1; C11; C12; C13; D81. ECB Working Paper Series No 2512 / January 2021 1 Non-technical Summary The use of judgment is ubiquitous in decision making, yet it lacks a formal treatment in statistics. Policy institutions, like central banks, routinely use state of the art econometric models to forecast key economic variables. When forecasts differ from the assessment of the decision makers, they are adjusted with ‘expert judgment’. The process of incorporating judgment in the decision process should be turned on its head. Decision makers should first express their judgmental de- cision and then econometricians should recommend whether there is statistical evidence to deviate from it. The statistical decision incorporating judgment lies at the boundary of a confidence interval.
    [Show full text]
  • Econometrics and Decision Theory
    Journal of Econometrics 95 (2000) 255}283 Econometrics and decision theory Gary Chamberlain* Department of Economics, Harvard University, Cambridge, MA 02138, USA Abstract The paper considers the role of econometrics in decision making under uncertainty. This leads to a focus on predictive distributions. The decision maker's subjective distribu- tion is only partly speci"ed; it belongs to a set S of distributions. S can also be regarded as a set of plausible data-generating processes. Criteria are needed to evaluate procedures for constructing predictive distributions. We use risk robustness and minimax regret risk relative to S. To obtain procedures for constructing predictive distributions, we use Bayes procedures based on parametric models with approximate prior distributions. The priors are nested, with a "rst stage that incorporates qualitative information such as exchangeability, and a second stage that is quite di!use. Special points in the parameter space, such as boundary points, can be accommodated with second-stage priors that have one or more mass points but are otherwise quite di!use. An application of these ideas is presented, motivated by an individual's consumption decision. The problem is to construct a distribution for that individual's future earnings, based on his earnings history and on a longitudinal data set that provides earnings histories for a sample of individuals. ! 2000 Elsevier Science S.A. All rights reserved. JEL classixcation: C23; C44 Keywords: Expected utility; Predictive distribution; Risk robustness; Minimax regret; Bayes procedure; Longitudinal data * Corresponding author. Tel.: (617) 495-1869; fax: (617) 495-8570. E-mail address: gary}[email protected]. (G.
    [Show full text]
  • Bayesian Statistics
    Bayesian Statistics B.J.K. Kleijn University of Amsterdam Korteweg-de Vries institute for Mathematics Spring 2009 Contents Preface iii 1 Introduction 1 1.1 Frequentist statistics . 1 1.2 Bayesian statistics . 8 1.3 The frequentist analysis of Bayesian methods . 10 1.4 Exercises . 11 2 Bayesian basics 13 2.1 Bayes' rule, prior and posterior distributions . 14 2.2 Bayesian point estimators . 22 2.3 Credible sets and Bayes factors . 27 2.4 Decision theory and classification . 37 2.5 Exercises . 46 3 Choice of the prior 51 3.1 Subjective and objective priors . 52 3.2 Non-informative priors . 55 3.3 Conjugate families, hierarchical and empirical Bayes . 60 3.4 Dirichlet process priors . 71 3.5 Exercises . 79 4 Bayesian asymptotics 79 4.1 Asymptotic statistics . 79 4.1.1 Consistency, rate and limit distribution . 80 4.1.2 Local asymptotic normality . 84 4.2 Schwarz consistency . 90 4.3 Posterior rates of convergence . 96 4.4 The Bernstein-Von Mises theorem . 101 4.5 The existence of test sequences . 104 i 5 Model and prior selection 87 5.1 Bayes factors revisited . 87 5.2 Marginal distributions . 87 5.3 Empirical Bayesian methods . 87 5.4 Hierarchical priors . 87 6 Numerical methods in Bayesian statistics 89 6.1 Markov-chain Monte-Carlo simulation . 89 6.2 More . 89 A Measure theory 91 A.1 Sets and sigma-algebras . 91 A.2 Measures . 91 A.3 Measurability and random variables . 93 A.4 Integration . 93 A.5 Existence of stochastic processes . 95 A.6 Conditional distributions .
    [Show full text]
  • Bayesian Statistics Lecture Notes 2015
    Bayesian Statistics Lecture Notes 2015 B.J.K. Kleijn Korteweg-de Vries institute for Mathematics Contents Preface iii 1 Introduction 1 1.1 Frequentist statistics . 1 1.1.1 Data and model . 1 1.1.2 Frequentist estimation . 6 1.2 Bayesian statistics . 9 1.3 The frequentist analysis of Bayesian methods . 12 1.4 Exercises . 13 2 Bayesian basics 15 2.1 Bayes's rule, prior and posterior distributions . 16 2.1.1 Bayes's rule . 16 2.1.2 Bayes's billiard . 20 2.1.3 The Bayesian view of the model . 21 2.1.4 A frequentist's view of the posterior . 23 2.2 Bayesian point estimators . 27 2.2.1 Posterior predictive and posterior mean . 27 2.2.2 Small-ball and formal Bayes estimators . 30 2.2.3 The Maximum-A-Posteriori estimator . 31 2.3 Confidence sets and credible sets . 33 2.3.1 Frequentist confidence intervals . 33 2.3.2 Bayesian credible sets . 36 2.4 Tests and Bayes factors . 37 2.4.1 Neyman-Pearson tests . 37 2.4.2 Randomized tests and the Neyman-Pearson lemma . 40 2.4.3 Test sequences and minimax optimality . 42 2.4.4 Bayes factors . 44 2.5 Decision theory and classification . 47 2.5.1 Decision theory . 47 2.5.2 Bayesian decision theory . 51 i 2.5.3 Frequentist versus Bayesian classification . 53 2.6 Exercises . 56 3 Choice of the prior 61 3.1 Subjective priors . 62 3.1.1 Motivation for the subjectivist approach . 62 3.1.2 Methods for the construction of subjective priors .
    [Show full text]
  • Statistical Inference
    Statistical Inference https://people.bath.ac.uk/masss/APTS/lecture4.pdf Simon Shaw University of Bath Lecture Four, 18 December 2019 Simon Shaw (University of Bath) Statistical Inference Lecture Four, 18 December 2019 1 / 18 Statistical Decision Theory Yesterday’s lecture Yesterday’s lecture Bayesian statistical decision problem, [Θ, D,π(θ), L(θ, d)]. The risk of decision d ∈ D under the distribution π(θ) is ρ(π(θ), d)= θ L(θ, d)π(θ) dθ. The Bayes riskR ρ∗(π) minimises the expected loss, ρ∗(π) = inf ρ(π, d) d ∈D with respect to π(θ). A decision d ∗ ∈ D for which ρ(π, d ∗)= ρ∗(π) is a Bayes rule against π(θ). A decision rule δ(x) is a function from X into D, We view the set of decision rules, to be our possible set of inferences about θ when the sample is observed so that Ev(E, x) is δ∗(x) The classical risk for the model E = {X , Θ, fX (x | θ)} is R(θ, δ) = L(θ, δ(x))fX (x | θ) dx. ZX Simon Shaw (University of Bath) Statistical Inference Lecture Four, 18 December 2019 2 / 18 Statistical Decision Theory Admissible rules Example 2 2 Let X = (X1,..., Xn) where Xi ∼ N(θ,σ ) and σ is known. Suppose that 2 2 L(θ, d) = (θ − d) and consider a conjugate prior θ ∼ N(µ0,σ0). Possible decision functions include: 1 δ1(x)= x, the sample mean. 2 δ2(x) = med{x1,..., xn} =x ˜, the sample median. 3 δ3(x)= µ0, the prior mean.
    [Show full text]