Bayesian Analysis: Objectivity, Multiplicity and Discovery

Total Page:16

File Type:pdf, Size:1020Kb

Bayesian Analysis: Objectivity, Multiplicity and Discovery PHYSTAT-LHC Workshop June 27-29, 2007 ' $ Bayesian analysis: objectivity, multiplicity and discovery James O. Berger Duke University Statistical and Applied Mathematical Sciences Institute & %1 PHYSTAT-LHC Workshop June 27-29, 2007 ' $ Relevant SAMSI Programs • Astrostatistics/Phystat Program: January – July, 2006 (Jogesh Babu and Louis Lyons were Program Leaders) • Multiplicity and Reproducibility in Scientific Studies: July 10-28, 2006 • Future program on Discovery or ?. & %2 PHYSTAT-LHC Workshop June 27-29, 2007 ' $ Outline • Objective Bayesian Analysis • Bayesian Hypothesis Testing • Multiplicity and Discovery & %3 PHYSTAT-LHC Workshop June 27-29, 2007 ' Objective Bayesian Analysis $ • In subjective Bayesian analysis, a prior distribution π(θ) for an unknown θ represents personal beliefs or knowledge. – But: If a prior distribution reflects consensus belief of science, should we call that subjective or objective? • In objective Bayesian analysis, prior distributions represent ‘neutral’ knowledge. Many types: – Maximum entropy – Minimum description length (or minimum message length) – Jeffreys / Reference – Uniform in local-location parameterizations – Right-Haar and left-Haar (structural) – Fiducial distributions and other specific invariance – Matching priors – Admissible priors & %4 PHYSTAT-LHC Workshop June 27-29, 2007 ' $ Why Objective Bayes? • It is the oldest and most used form of Bayesianism (Laplace,...) • Some believe it to be the ultimate inferential truth (e.g. Jaynes), or at least the correct answer to precise definitions of ‘letting the data speak for itself.’ • Good versions are argued to yield better frequentist answers than asymptotic frequentist methods. • Numerous difficult problems, such as dealing with multiplicity, become straightforward to handle. • One is automatically solving the otherwise very difficult problem of proper conditioning; in this community called ‘Bayesian credibility’ - Joel’s, Eilam’s, and Paul’s talks. & %5 PHYSTAT-LHC Workshop June 27-29, 2007 'Artificial example of coverage and ‘Bayesian credibility’: $ Observe X1 and X2 where θ + 1 with probability 1/2 Xi = θ − 1 with probability 1/2. Consider the confidence set (a singleton) for θ 1 2 (X1 + X2) if X1 6= X2 C(X1,X2)= X1 − 1 if X1 = X2. Unconditional coverage: Pθ(C(X1,X2) contains θ) = 0.75. Bayesian credibility: If one uses the objective prior π(θ)=1, C(X1,X2) has Bayesian credibility of 50% if x1 6= x2 and 100% if x1 = x2, which is obviously the right answer in this example. The point: Unconditional frequentistism can be practically silly for particular data in the absence of ‘Bayesian credibility.’ & %6 PHYSTAT-LHC Workshop June 27-29, 2007 ' $ Comments on Bayesian Credibility • Is it the case that coverage errors are more important than credibility errors? (Harrison’s point) • What priors should be chosen for assessing credibility? – Ideally, credibility would be measured by a prior that is nearly exactly frequentist matching (although then the ‘right answer’ is known). – Alternatively, one might choose a reasonable class of priors and compute the credibility range over the class (robust Bayesian analysis). & %7 PHYSTAT-LHC Workshop June 27-29, 2007 ' $ A psychiatry diagnosis example (with Mossman, 2001) The Psychiatry Problem: • Within a population, p0 = Pr(Disease D). • A diagnosic test results in either a Positive (P) or Negative (N) reading. • p1 = Pr(P | patient has D). • p2 = Pr(P | patient does not have D). • It follows from Bayes theorem that p p θ ≡ Pr(D|P )= 0 1 . p0p1 + (1 − p0)p2 & %8 PHYSTAT-LHC Workshop June 27-29, 2007 ' $ The Statistical Problem: The pi are unknown. Based on (independent) data Xi ∼ Binomial(ni,pi) (arising from medical surveys of ni individuals), find a 100(1 − α)% confidence set for θ. Suggested Solution: Assign pi the Jeffreys-rule objective prior −1/2 −1/2 π(pi) ∝ pi (1 − pi) (not the full reference prior!). By Bayes theorem, the posterior distribution of pi given the data, xi, is −1/2 −1/2 n xi ni−xi p (1 − pi) × p (1 − pi) π(p | x )= i xi i , i i −1/2 n x p (1 − p )−1/2 × p i (1 − p )ni−xi dp i i xi i i i R 1 1 which is the Beta(xi + 2 , ni − xi + 2 ) distribution. & %9 PHYSTAT-LHC Workshop June 27-29, 2007 ' $ Finally, compute the desired confidence set (formally, the 100(1 − α)% equal-tailed posterior credible set) through Monte Carlo simulation from the posterior distribution by 1 1 • drawing random pi from the Beta(xi + 2 , ni − xi + 2 ) posterior distributions, i = 0, 1, 2; • computing the associated θ = p0p1/[p0p1 + (1 − p0)p2]; • repeating this process 10, 000 times, yielding θ1,θ2,...,θ10,000; α • using the 2 % upper and lower percentiles of these generated θ to form the desired confidence limits. & %10 PHYSTAT-LHC Workshop June 27-29, 2007 ' $ n0 = n1 = n2 (x0, x1, x2) 95% confidence interval 20 (2,18,2) (0.107, 0.872) 20 (10,18,0) (0.857, 1.000) 80 (20,60,20) (0.346, 0.658) 80 (40,72,8) (0.808, 0.952) Table 1: The 95% equal-tailed posterior credible interval for θ = p0p1/[p0p1 + (1 − p0)p2], for various values of the ni and xi. & %11 PHYSTAT-LHC Workshop June 27-29, 2007 'Unconditional frequentist performance of the objective Bayes $ procedure: The goal was to find confidence sets for p p θ = Pr(D P )= 0 1 . | p0p1 + (1 p0)p2 − Consider the frequentist percentage of the time that the 95% Bayesian sets miss on the left and on the right (ideal would be 2.5%) for the indicated parameter values when n0 = n1 = n2 = 20. (p0,p1,p2) O-Bayes Log Odds Gart-Nam Delta 1 3 1 ( 4 , 4 , 4 ) 2.86,2.71 1.53,1.55 2.77,2.57 2.68,2.45 1 9 1 ( 10 , 10 , 10 ) 2.23,2.47 0.17,0.03 1.58,2.14 0.83,0.41 1 9 1 ( 2 , 10 , 10 ) 2.81,2.40 0.04,4.40 2.40,2.12 1.25,1.91 Conclusion: By construction, reasonable ‘Bayesian credibility’ is guaranteed; unconditional frequentist performance is clearly fine (and the expected lengths of the Bayesian intervals were smallest). & %12 PHYSTAT-LHC Workshop June 27-29, 2007 'What is Frequentism in the Basic HEP Problem? $ Model: Ns+b ∼ Poisson(Ns+b | s + b), where s is the unknown signal mean and b the unknown background mean. Goal: Upper confidence limit for s. Proper prior density for b: π(b), arising from either • Case 1: sideband data Nb ∼ Poisson(Nb | b) • Case 2: known physical randomness • Case 3: agreed scientific beliefs Objective prior density for s: πo(s | b) Bayesian analysis: construct upper confidence limit U for s from the posterior distribution π(s | N ) ∝ Poisson(N | s + b)πo(s | b)π(b)db. s+b Z s+b & %13 PHYSTAT-LHC Workshop June 27-29, 2007 ' o $ Case 1: With the sideband data Nb, π(b) ∝ Poisson(Nb | b)π (b), where πo(b) is an objective prior. Natural frequentist goal: Frequentist coverage w.r.t the joint distribution of Ns+b and Nb, i.e. P (s < U(Ns+b, Nb) s, b) | ∞ ∞ X X = 1 Poisson(Ns+b s + b)Poisson(Nb b) . {s<U(Ns+b,Nb)} | | Ns+b=0 Nb=0 Objective Bayesian solution: Find a reference prior πo(s,b) (see Luc’s work), that has excellent frequentist coverage properties (except possibly at the boundaries). If this is too hard, find adhoc objective priors that work; also hard, alas. But achieving ‘Bayesian credibility’ is at least as hard, since a frequentist must establish this for all possible data. & %14 PHYSTAT-LHC Workshop June 27-29, 2007 'Case 2: π(b) describes the physical randomness of the (otherwise $ unmeasured) background from experiment to experiment. Natural frequentist goal: Frequentist coverage w.r.t. the marginal density Ê f(Ns+b s)= Poisson(Ns+b s + b)π(b)db, i.e., coverage as | | ∞ X P (s < U(Ns+b) s)= 1 f(Ns+b s) . | {s<U(Ns+b)} | Ns+b=0 Frequentist Principle (Neyman): In repeated actual use of a statistical procedure, the average actual error should not be greater than (and should ideally equal) the average reported error. Objective Bayesian solution: Find the reference prior corresponding to f(Ns+b s), which is the Jeffreys prior | ∞ X 2 J Ô d π (s)= I(s) , I(s)= f(Ns+b s) log f(Ns+b s) . − | ds2 | Ns+b=0 This should have excellent frequentist coverage properties (except possibly at the boundary s = 0). & %15 PHYSTAT-LHC Workshop June 27-29, 2007 'Case 3: π(b) encodes accepted scientific beliefs. $ Natural frequentist goal: Unclear! One could • insist that, for every given s and b, we control ∞ P (s<U(Ns+b) | s,b)= 1{s<U(Ns+b)}Poisson(Ns+b | s+b) NsX+b=0 (actually not possible here, but let’s pretend it is); • again simply control coverage w.r.t. f(Ns+b | s), i.e. ∞ P (s<U(Ns+b) | s) = 1{s<U(Ns+b)}f(Ns+b | s) NsX+b=0 = P (s<U(N ) | s,b)π(b)db. Z s+b Objective Bayesian solution: None for the first criterion; indeed, what does a frequentist do with π(b)? Second criterion: as before. & %16 PHYSTAT-LHC Workshop June 27-29, 2007 ' Bayesian Hypothesis Testing $ Key issue 1: Is the hypothesis being tested believable? Example: Test of H0 : s = 0, where s is the mean signal. Here s = 0 should be plausible (e.g., no Higgs). Key issue 2: There is no need to assign prior probabilities to hypotheses; one can give Bayes factors, and sometimes useful bounds on Bayes factors that are completely independent of priors.
Recommended publications
  • Arxiv:1502.07813V1 [Cs.LG] 27 Feb 2015 Distributions of the Form: M Pr(X;M ) = ∑ W J F J(X;Θ J) (1) J=1
    Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Parthan Kasarapu · Lloyd Allison Abstract Mixture modelling involves explaining some observed evidence using a combination of probability distributions. The crux of the problem is the inference of an optimal number of mixture components and their corresponding parameters. This paper discusses unsupervised learning of mixture models using the Bayesian Minimum Message Length (MML) crite- rion. To demonstrate the effectiveness of search and inference of mixture parameters using the proposed approach, we select two key probability distributions, each handling fundamentally different types of data: the multivariate Gaussian distribution to address mixture modelling of data distributed in Euclidean space, and the multivariate von Mises-Fisher (vMF) distribu- tion to address mixture modelling of directional data distributed on a unit hypersphere. The key contributions of this paper, in addition to the general search and inference methodology, include the derivation of MML expressions for encoding the data using multivariate Gaussian and von Mises-Fisher distributions, and the analytical derivation of the MML estimates of the parameters of the two distributions. Our approach is tested on simulated and real world data sets. For instance, we infer vMF mixtures that concisely explain experimentally determined three-dimensional protein conformations, providing an effective null model description of protein structures that is central to many inference problems in structural bioinformatics. The ex- perimental results demonstrate that the performance of our proposed search and inference method along with the encoding schemes improve on the state of the art mixture modelling techniques. Keywords mixture modelling · minimum message length · multivariate Gaussian · von Mises-Fisher · directional data 1 Introduction Mixture models are common tools in statistical pattern recognition (McLachlan and Basford, 1988).
    [Show full text]
  • Nmo Ash University
    S ISSN 1440-771X ISBN 0 7326 1048 6 NMO ASH UNIVERSITY, AUSTRALIA t 1 NOV 498 Comparisons of Estimators and Tests Based on Modified Likelihood and Message Length Functions Mizan R. Laskar and Maxwell L. King Working Paper 11/98 August 1998 .DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS COMPARISONS OF ESTIMATORS AND TESTS BASED ON MODIFIED LIKELIHOOD AND MESSAGE LENGTH FUNCTIONS - Mizan R. Laskar and Maxwell L. King Department of Econometrics and Business Statistics Monash University Clayton, Victoria 3168 Australia Summary The presence of nuisance parameters causes unwanted complications in statistical and econometric inference procedures. A number of modified likelihood and message length functions have been developed for better handling of nuisance parameters but they are not equally efficient. In this paper, we empirically compare different modified likelihood and message length functions in the context of estimation and testing of parameters from linear regression disturbances that follow either first-order moving average or first-order autoregressive error processes. The results show that estimators based on the conditional profile likelihood and tests based on the marginal likelihood are best. If there is a minor identification problem, the sizes of the likelihood ratio and Wald tests based on simple message length functions are best. The true sizes of the Lagrange multiplier tests based on message length functions are rather poor because the score functions of message length functions are biased. Key words: linear regression model; marginal likelihood; conditional profile likelihood; first-order moving average errors; first-order autoregressive errors. 1. Introduction Satisfactory statistical analysis of non-experimental data is an important problem in statistics and econometrics.
    [Show full text]
  • A Minimum Message Length Criterion for Robust Linear Regression
    A Minimum Message Length Criterion for Robust Linear Regression Chi Kuen Wong, Enes Makalic, Daniel F. Schmidt February 21, 2018 Abstract This paper applies the minimum message length principle to inference of linear regression models with Student-t errors. A new criterion for variable selection and parameter estimation in Student-t regression is proposed. By exploiting properties of the regression model, we de- rive a suitable non-informative proper uniform prior distribution for the regression coefficients that leads to a simple and easy-to-apply criterion. Our proposed criterion does not require specification of hyperparameters and is invariant under both full rank transformations of the design matrix and linear transformations of the outcomes. We compare the proposed criterion with several standard model selection criteria, such as the Akaike information criterion and the Bayesian information criterion, on simulations and real data with promising results. 1 Introduction ′ Consider a vector of n observations y = (y1,...,yn) generated by the linear regression model y = β01n + Xβ + ε , (1) where X = (x ,..., x ) is the design matrix of p explanatory variables with each x Rn, β Rp 1 k i ∈ ∈ is the vector of regression coefficients, β0 R is the intercept parameter, 1n is a vector of 1s of ′ ∈ length n, and ε = (ε1,...,εn) is a vector of random errors. In this paper, the random disturbances εi are assumed to be independently and identically distributed as per a Student-t distribution with location zero, scale τ, and ν degrees of freedom. The Student-t linear regression model finds frequent application in modelling heavy-tailed data [1, 2].
    [Show full text]
  • Suboptimal Behavior of Bayes and MDL in Classification Under
    Mach Learn (2007) 66:119–149 DOI 10.1007/s10994-007-0716-7 Suboptimal behavior of Bayes and MDL in classification under misspecification Peter Gr¨unwald · John Langford Received: 29 March 2005 / Revised: 15 December 2006 / Accepted: 20 December 2006 / Published online: 30 January 2007 Springer Science + Business Media, LLC 2007 Abstract We show that forms of Bayesian and MDL inference that are often applied to classification problems can be inconsistent. This means that there exists a learning prob- lem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. From a Bayesian point of view, the result can be reinterpreted as saying that Bayesian inference can be inconsistent under misspecification, even for countably infinite models. We extensively discuss the result from both a Bayesian and an MDL perspective. Keywords Bayesian statistics . Minimum description length . Classification . Consistency . Inconsistency . Misspecification 1 Introduction Overfitting is a central concern of machine learning and statistics. Two frequently used learning methods that in many cases ‘automatically’ protect against overfitting are Bayesian inference (Bernardo & Smith, 1994) and the Minimum Description Length (MDL) Principle (Rissanen, 1989; Barron, Rissanen, & Yu, 1998; Gr¨unwald, 2005, 2007). We show that, when applied to classification problems, some of the standard variations of these two methods can be inconsistent in the sense that they asymptotically overfit: there exist scenarios where, no matter how much data is available, the generalization error of a classifier based on MDL Editors: Olivier Bousquet and Andre Elisseeff P.
    [Show full text]
  • Efficient Linear Regression by Minimum Message Length
    1 Efficient Linear Regression by Minimum Message Length Enes Makalic and Daniel F. Schmidt Abstract This paper presents an efficient and general solution to the linear regression problem using the Minimum Message Length (MML) principle. Inference in an MML framework involves optimising a two-part costing function that describes the trade-off between model complexity and model capability. The MML criterion is integrated into the orthogonal least squares algorithm (MML-OLS) to improve both speed and numerical stability. This allows for the message length to be iteratively updated with the selection of each new regressor, and for potentially problematic regressors to be rejected. The MML- OLS algorithm is subsequently applied to function approximation with univariate polynomials. Empirical results demonstrate superior performance in terms of mean squared prediction error in comparison to several well-known benchmark criteria. I. INTRODUCTION Linear regression is the process of modelling data in terms of linear relationships between a set of independent and dependent variables. An important issue in multi-variate linear modelling is to determine which of the independent variables to include in the model, and which are superfluous and contribute little to the power of the explanation. This paper formulates the regressor selection problem in a Minimum Message Length (MML) [1] framework which offers a sound statistical basis for automatically discriminating between competing models. In particular, we combine the MML criterion with the well- known Orthogonal Least Squares (OLS) [2] algorithm to produce a search (MML-OLS) that is both numerically stable and computationally efficient. To evaluate the performance of the proposed algorithm Enes Makalic and Daniel Schmidt are with Monash University Clayton School of Information Technology Clayton Campus Victoria 3800, Australia.
    [Show full text]
  • Arxiv:0906.0052V1 [Cs.LG] 30 May 2009 Responses, We Can Write the Y Values in an N × 1 Vector Y and the X Values in an N × M Matrix X
    A Minimum Description Length Approach to Multitask Feature Selection Brian Tomasik May 2009 Abstract One of the central problems in statistics and machine learning is regression: Given values of input variables, called features, develop a model for an output variable, called a response or task. In many settings, there are potentially thousands of possible features, so that feature selection is required to reduce the number of predictors used in the model. Feature selection can be interpreted in two broad ways. First, it can be viewed as a means of reducing prediction error on unseen test data by improving model generalization. This is largely the focus within the machine-learning community, where the primary goal is to train a highly accurate system. The second approach to feature selection, often of more interest to scientists, is as a form of hypothesis testing: Assuming a \true" model that generates the data from a small number of features, determine which features actually belong in the model. Here the metrics of interest are precision and recall, more than test-set error. Many regression problems involve not one but several response variables. Often the responses are suspected to share a common underlying structure, in which case it may be advantageous to share information across the responses; this is known as multitask learning. As a special case, we can use multiple responses to better identify shared predictive features|a project we might call multitask feature selection. This thesis is organized as follows. Section1 introduces feature selection for regression, focusing on `0 regularization methods and their interpretation within a Minimum Description Length (MDL) framework.
    [Show full text]
  • Model Selection for Mixture Models-Perspectives and Strategies Gilles Celeux, Sylvia Frühwirth-Schnatter, Christian Robert
    Model Selection for Mixture Models-Perspectives and Strategies Gilles Celeux, Sylvia Frühwirth-Schnatter, Christian Robert To cite this version: Gilles Celeux, Sylvia Frühwirth-Schnatter, Christian Robert. Model Selection for Mixture Models- Perspectives and Strategies. Handbook of Mixture Analysis, CRC Press, 2018. hal-01961077 HAL Id: hal-01961077 https://hal.archives-ouvertes.fr/hal-01961077 Submitted on 19 Dec 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 7 Model Selection for Mixture Models – Perspectives and Strategies Gilles Celeux, Sylvia Frühwirth-Schnatter and Christian P. Robert INRIA Saclay, France; Vienna University of Economics and Business, Austria; Université Paris-Dauphine, France and University of Warwick, UK CONTENTS 7.1 Introduction .................................................................... 124 7.2 Selecting G as a Density Estimation Problem ................................. 125 7.2.1 Testing the order of a finite mixture through likelihood ratio tests .. 127 7.2.2 Information criteria for order selection ................................ 128 7.2.2.1 AIC and BIC ............................................... 128 7.2.2.2 The Slope Heuristics ....................................... 130 7.2.2.3 DIC ......................................................... 131 7.2.2.4 The minimum message length .............................. 132 7.2.3 Bayesian model choice based on marginal likelihoods ................ 133 7.2.3.1 Chib’s method, limitations and extensions ................
    [Show full text]
  • Minimum Message Length Ridge Regression for Generalized Linear Models
    Problem Description MML GLM Ridge Regression Results/Examples Minimum Message Length Ridge Regression for Generalized Linear Models Daniel F. Schmidt and Enes Makalic Centre for Biostatistics and Epidemiology The University of Melbourne 26th Australasian Joint Conference on Artificial Intelligence Dunedin, New Zealand 2013 Problem Description MML GLM Ridge Regression Results/Examples Outline 1 Problem Description Generalized Linear Models GLM Ridge Regression 2 MML GLM Ridge Regression Minimum Message Length Inference Message Lengths of GLMs 3 Results/Examples Parameter Estimation Experiments Example: Dog Bite Data Conclusion Problem Description Generalized Linear Models MML GLM Ridge Regression GLM Ridge Regression Results/Examples Outline 1 Problem Description Generalized Linear Models GLM Ridge Regression 2 MML GLM Ridge Regression Minimum Message Length Inference Message Lengths of GLMs 3 Results/Examples Parameter Estimation Experiments Example: Dog Bite Data Conclusion Problem Description Generalized Linear Models MML GLM Ridge Regression GLM Ridge Regression Results/Examples Generalized Linear Models (GLMs) (1) We have A vector of targets, y ∈ Rn 0 0 0 n×p A matrix of features, X = (x¯1,..., x¯n) ∈ R ⇒ Usual problem is to use X to predict y If targets are reals, linear-normal model is a standard choice yi ∼ N(ηi, τ) where ηi = x¯iβ + α, is the linear predictor, and β ∈ Rp are the coefficients α ∈ R is the intercept Problem Description Generalized Linear Models MML GLM Ridge Regression GLM Ridge Regression Results/Examples Generalized
    [Show full text]
  • Bayesian Model Averaging Across Model Spaces Via Compact Encoding
    To Appear in the 8th International Symposium on A.I. and Mathematics, Fort Lauderdale, Florida, 2004. Bayesian Model Averaging Across Model Spaces via Compact Encoding Ke Yin Ian Davidson* Department of Computer Science Department of Computer Science SUNY – Albany SUNY – Albany Albany, NY 12222 Albany, NY 12222 [email protected] [email protected] Abstract Bayesian Model Averaging (BMA) is well known for improving predictive accuracy by averaging inferences over all models in the model space. However, Markov chain Monte Carlo (MCMC) sampling, as the standard implementation for BMA, encounters difficulties in even relatively simple model spaces. We introduce a minimum message length (MML) coupled MCMC methodology, which not only addresses these difficulties but has additional benefits. The MML principle discretizes the model space and associates a probability mass with each region. This allows efficient sampling and jumping between model spaces of different complexity. We illustrate the methodology with a mixture component model example (clustering) and show that our approach produces more interpretable results when compared to Green’s popular reverse jump sampling across model sub- spaces technique. The MML principle mathematically embodies Occam’s razor since more complicated models take more information to describe. We find that BMA prediction based on sampling across multiple sub-spaces of different complexity makes much improved predictions compared to the single best (shortest) model. 1. Introduction Bayesian model averaging (BMA) removes the model uncertainty by making inferences from all possible models in the considered model spaces weighted by their posterior probabilities. The removal of uncertainty decreases the associated risk of making predictions from only a single model hence improves the prediction accuracy [8].
    [Show full text]
  • 0A1AS4 Tjf"1/47
    0A1AS4 tjf"1/47 ISSN 1032-3813 ISBN 0 7326 1037 0 LMONASH UNIVERSITY '44liNo AUSTRALIA BAYESIAN APPROACHES TO SEGMENTING A SIMPLE TIME SERIES Jonathan J. Oliver Catherine S. Forbes Working Paper 14/97 November 1997 DEPARTMENT OF ECONOMETRIC AND BUSINESS STATISTICS Bayesian Approaches to Segmenting a Simple Time Series Jonathan J. Oliver, Catherine S. Forbes, Dept. of Computer Science Dept. Econometrics Monash University, Monash University, Clayton, 3168 Australia Clayton, 3168 Australia jono©cs.monash.edu.au Catherine.Forbes©BusEco.monash.edu.au November 24, 1997 Keywords: Segmentation, Minimum Message Length, MML, Bayes Factors, Evidence, Time Series Abstract The segmentation problem arises in many applications in data mining, A.I. and statis- tics. In this paper, we consider segmenting simple time series. We develop two Bayesian approaches for segmenting a time series, namely the Bayes Factor approach, and the Minimum Message Length (MML) approach. We perform simulations comparing these Bayesian approaches, and then perform a comparison with other classical approaches, namely AIC, MDL and BIC. We conclude that the MML criterion is the preferred crite- rion. We then apply the segmentation method to financial time series data. 1 Introduction In this paper, we consider the problem of segmenting simple time series. We consider time series of the form: Yt4-3. = Yt + ft where we are given N data points (yi , yN) and we assume there are C 1 segments (j E {0, ... C}), and that each Et is Gaussian with mean zero and variance (7.. We wish to estimate — the number of segments, C 1, — the segment boundaries, {v1, , vc}, — the mean change for each segment, pj, and — the variance for each segment, ay.
    [Show full text]
  • Model Identification from Many Candidates
    15 Model Identification from Many Candidates Mark L. Taper ABSTRACT Model identification is a necessary component of modern science. Model misspecification is a major, if not the dominant, source of error in the quan­ tification of most scientific evidence. Hypothesis tests have become the de facto standard for evidence in the bulk of scientific work. Consequently, be­ cause hypothesis tests require a Single null and a Single alternative hypoth­ esis there has been a very strong tendency to restrict the number of models considered in an analysis to two. I discuss the information criteria approach to model identification. The information criteria approach can be thought of as an extension of the likelihood ratio approach to the case of multiple al­ ternatives. However, it has been claimed that information criteria are "con­ fused" by too many alternative models and that selection should occur among a limited set of models. I demonstrate that the information criteria approach can be extended to large sets of models. There is a tradeoff be­ tween in the amount of model detail that can be accurately captured and the number of models that can be considered. This tradeoff can he incorporated in modifications of the parameter penalty term. HYPOTHESES, MODELS, AND SCIENCE The hypothesis concept plays an important role in science. The classic sci­ entific method (Popper, 1959) continually reiterates a cycle of hypothesis This paper was first presented in August 1998 at a symposium titled "Scientific Evidence" hosted by the Ecological Statistics section of the Ecological Society of America. I thank Subhash Lele, Jay Rotella, Prasanta Bandyopadhyay, and Brad Shepard for commenting on drafts of this manuscript.
    [Show full text]
  • Model Selection by Normalized Maximum Likelihood
    Myung, J. I., Navarro, D. J. and Pitt, M. A. (2006). Model selection by normalized maximum likelihood. Journal of Mathematical Psychology, 50, 167-179 http://dx.doi.org/10.1016/j.jmp.2005.06.008 Model selection by normalized maximum likelihood Jay I. Myunga;c, Danielle J. Navarrob and Mark A. Pitta aDepartment of Psychology Ohio State University bDepartment of Psychology University of Adelaide, Australia cCorresponding Author Abstract The Minimum Description Length (MDL) principle is an information theoretic approach to in- ductive inference that originated in algorithmic coding theory. In this approach, data are viewed as codes to be compressed by the model. From this perspective, models are compared on their ability to compress a data set by extracting useful information in the data apart from random noise. The goal of model selection is to identify the model, from a set of candidate models, that permits the shortest description length (code) of the data. Since Rissanen originally formalized the problem using the crude `two-part code' MDL method in the 1970s, many significant strides have been made, especially in the 1990s, with the culmination of the development of the refined `universal code' MDL method, dubbed Normalized Maximum Likelihood (NML). It represents an elegant solution to the model selection problem. The present paper provides a tutorial review on these latest developments with a special focus on NML. An application example of NML in cognitive modeling is also provided. Keywords: Minimum Description Length, Model Complexity, Inductive Inference, Cognitive Modeling. Preprint submitted to Journal of Mathematical Psychology 13 August 2019 To select among competing models of a psychological process, one must decide which criterion to use to evaluate the models, and then make the best inference as to which model is preferable.
    [Show full text]