Bayesian Statistics Lecture Notes 2015

Bayesian Statistics Lecture Notes 2015 B.J.K. Kleijn Korteweg-de Vries institute for Mathematics Contents Preface iii 1 Introduction 1 1.1 Frequentist statistics . 1 1.1.1 Data and model . 1 1.1.2 Frequentist estimation . 6 1.2 Bayesian statistics . 9 1.3 The frequentist analysis of Bayesian methods . 12 1.4 Exercises . 13 2 Bayesian basics 15 2.1 Bayes's rule, prior and posterior distributions . 16 2.1.1 Bayes's rule . 16 2.1.2 Bayes's billiard . 20 2.1.3 The Bayesian view of the model . 21 2.1.4 A frequentist's view of the posterior . 23 2.2 Bayesian point estimators . 27 2.2.1 Posterior predictive and posterior mean . 27 2.2.2 Small-ball and formal Bayes estimators . 30 2.2.3 The Maximum-A-Posteriori estimator . 31 2.3 Confidence sets and credible sets . 33 2.3.1 Frequentist confidence intervals . 33 2.3.2 Bayesian credible sets . 36 2.4 Tests and Bayes factors . 37 2.4.1 Neyman-Pearson tests . 37 2.4.2 Randomized tests and the Neyman-Pearson lemma . 40 2.4.3 Test sequences and minimax optimality . 42 2.4.4 Bayes factors . 44 2.5 Decision theory and classification . 47 2.5.1 Decision theory . 47 2.5.2 Bayesian decision theory . 51 i 2.5.3 Frequentist versus Bayesian classification . 53 2.6 Exercises . 56 3 Choice of the prior 61 3.1 Subjective priors . 62 3.1.1 Motivation for the subjectivist approach . 62 3.1.2 Methods for the construction of subjective priors . 63 3.2 Non-informative priors . 66 3.2.1 Uniform priors . 66 3.2.2 Jeffreys prior and reference priors . 68 3.3 Conjugate families . 71 3.3.1 Basic definitions and some examples . 71 3.3.2 Exponential families . 73 3.4 Hierarchical priors . 75 3.4.1 Hyperparameters and hyperpriors . 76 3.4.2 Hierarchical prior construction in an example . 77 3.5 Empirical prior choice . 79 3.5.1 Model selection with empirical methods . 80 3.6 Dirichlet priors . 82 3.6.1 Dirichlet distributions . 82 3.6.2 The Dirichlet process . 85 3.7 Exercises . 90 4 Bayesian asymptotics 93 4.1 Efficiency in smooth parametric models . 94 4.1.1 Optimality in smooth, parametric estimation problems . 94 4.1.2 Regularity and efficiency . 96 4.2 Le Cam's Bernstein-von Mises theorem . 99 4.2.1 Proof of the Bernstein-von Mises theorem . 102 4.2.2 Three subsidiary lemmas . 106 4.3 Exercises . 108 A Measure theory 111 A.1 Sets and sigma-algebras . 111 A.2 Measures . 112 A.3 Measurability, random variables and integration . 115 A.4 Existence of stochastic processes . 118 A.5 Conditional distributions . 119 A.6 Convergence in spaces of probability measures . 121 Bibliography 123 ii Preface These lecture notes were written for the course `Bayesian Statistics', taught at University of Amsterdam in the spring of 2007. The course was aimed at first-year MSc.-students in statistics, mathematics and related fields. The aim was for students to understand the basic properties of Bayesian statistical methods; to be able to apply this knowledge to statistical questions and to know the extent (and limitations) of conclusions based thereon. Considered were the basic properties of the procedure, choice of the prior by objective and subjective crite- ria and Bayesian methods of inference. In addition, some non-parametric Bayesian modelling and basic posterior asymptotic behaviour will received due attention. An attempt has been made to make these lecture notes as self-contained as possible. Nevertheless the reader is expected to have been exposed to some statistics, preferably from a mathematical perspective. It is not assumed that the reader is familiar with asymptotic statistics. Where possible, definitions, lemmas and theorems have been formulated such that they cover parametric and nonparametric models alike. An index, references and an extensive bibliography are included. Since Bayesian statistics is formulated in terms of probability theory, some background in measure theory is prerequisite to understanding these notes in detail. However the reader is not supposed to have all measure-theorical knowledge handy: appendix A provides an overview of relevant measure-theoretic material. In the description and handling of nonparametric statistical models, functional analysis and topology play a role. Of the latter two, however, only the most basic notions are used and all necessary detail in this respect will be provided during the course. For corrections to early versions of these notes the author thanks C. Muris. Bas Kleijn, Amsterdam, January 2007 iii iv Chapter 1 Introduction The goal of statistical inference is to understand, describe and estimate (aspects of) the randomness of measured data. Quite naturally, this invites the assumption that the data represents a sample from an unknown but fixed probability distribution. 1.1 Frequentist statistics Any frequentist inferential procedure relies on three basic ingredients: the data, a model and an estimation procedure. The central assumption in frequentism is that the data has a definite but unknown, underlying distribution to which all inference pertains. 1.1.1 Data and model The data is a measurement or observation which we denote by Y , taking values in a corre- sponding samplespace. Definition 1.1. The samplespace for an observation Y is a measurable space (Y ; B) (see definition A.2) containing all values that Y can take upon measurement. Measurements and data can take any form, ranging from categorical data (sometimes referred to as nominal data where the samplespace is simply a (usually finite) set of points or labels with no further mathematical structure), ordinal data (also known as ranked data, where the samplespace is endowed with an total ordering), to interval data (where in addition to having an ordering, the samplespace allows one to compare differences or distances between points), to ratio data (where we have all the structure of the real line). Moreover Y can collect the results of a number of measurements, so that it takes its values in the form of a vector (think of an experiment involving repeated, stochastically independent measurements of the same quantity, leading to a so-called independent and identically distributed (or i.i.d.) 1 2 Introduction sample). The data Y may even take its values in a space of functions or in other infinite- dimensional spaces, for example, in the statistical study of continuous-time time-series. The samplespace Y is assumed to be a measurable space to enable the consideration of probability measures on Y , formalizing the uncertainty in measurement of Y . As was said in the opening words of this chapter, frequentist statistics hinges on the assumption that there exists a probability measure P0 : B ! [0; 1] on the samplespace Y representing the \true distribution of the data": Y ∼ P0 (1.1) Hence from the frequentist perspective, statistics revolves around the central question: \What does the data make clear about P0?", which may be considered in parts by questions like, \From the data, what can we say about the mean of P0?", \Based on the data that we have, how sharp can we formulate hypotheses concerning the value of the variance of P0?", etcetera. The second ingredient of a statistical procedure is a model, which contains all explanations under consideration of the randomness in Y . Definition 1.2. A (frequentist) statistical model P is a collection of probability measures P : B ! [0; 1] on the samplespace (Y ; B). The distributions Pθ are called model distributions. For every sample space (Y ; B), the collection of all probability distributions is called the full model (sometimes referred to as the full non-parametric model), denoted M (Y ; B). The model P contains the candidate distributions for Y that the statistician finds \rea- sonable" explanations of the uncertainty he observes (or expects to observe) in Y . As such, it constitutes a choice of the statistician analyzing the data rather than a given. From a more mathematical perspective we observe that a model P on (Y ; B) is a subset of the vector space ca(Y ; B) of all signed measures µ : B ! R (that is, all countably additive, real-valued set functions) that are of finite total variation. Equipped with the total-variational norm (see appendix A, definition A.7), µ 7! kµk, ca(Y ; B) is a Banach space, in which the full model can be characterized as follows: M (Y ; B) = P 2 ca(Y ; B): P ≥ 0;P (Y ) = 1 : Often, we describe models as families of probability densities rather than distributions. Definition 1.3. If there exists a σ-finite measure µ : B ! [0; 1] such that for all P 2 P, P µ, we say that the model is dominated. The Radon-Nikodym theorem (see theorem A.26) guarantees that we may represent a dominated probability measure P in terms of a probability density function p = dP/dµ : Y ! R [0; 1) that satisfies A p(y) dµ(y) = P (A) for all A 2 B. For dominated models, it makes sense to adopt a slightly different mathematical perspective: if µ dominates P, we map P to the space of all µ-integrable functions L1(µ) by means of the Radon-Nikodym mapping. Frequentist statistics 3 Example 1.4. Suppose that Y is countable (and let B be the powerset of Y ): then the measure µ that puts mass one at every point in Y , also known as the counting measure on Y , is σ-finite and dominates every other (finite) measure on Y . Consequently, any model on (Y ; B) can be represented in terms of elements p in the Banach space L1(µ), more commonly denoted as, X `1 = f(f1; f2;:::) 2 [0; 1] : jfij < 1g: i≥1 P P where it is noted that pi ≥ 0 and kpk = i jpij = i pi = 1 for all P 2 M (Y ; B).

Bayesian Statistics Lecture Notes 2015

Assessing Fairness with Unlabeled Data and Bayesian Inference

Bayesian Inference Chapter 4: Regression and Hierarchical Models

The Bayesian Lasso

Point Estimation Decision Theory

Posterior Propriety and Admissibility of Hyperpriors in Normal

Empirical Bayes Methods for Combining Likelihoods Bradley EFRON

Statistical Inference a Work in Progress

A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics

An Economic Theory of Statistical Testing∗

Bayesian Learning

6 X 10.5 Long Title.P65

9 Introduction to Hierarchical Models