Interval Estimation for a Difference Between Intraclass Kappa Statistics

Total Page:16

File Type:pdf, Size:1020Kb

Interval Estimation for a Difference Between Intraclass Kappa Statistics BIOMETRICS58, 209-215 March 2002 Interval Estimation for a Difference Between Intraclass Kappa Statistics Allan Donner* and Guangyong Zou Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario N6A 5C1, Canada * e-mail: [email protected] SUMMARY.Model-based inference procedures for the kappa statistic have developed rapidly over the last decade. However, no method has yet been developed for constructing a confidence interval about a differ- ence between independent kappa statistics that is valid in samples of small to moderate size. In this article, we propose and evaluate two such methods based on an idea proposed by Newcombe (1998, Statistics in Medicine, 17, 873-890) for constructing a confidence interval for a difference between independent propor- tions. The methods are shown to provide very satisfactory results in sample sizes as small as 25 subjects per group. Sample size requirements that achieve a prespecified expected width for a confidence interval about a difference of kappa statistic are also presented. KEYWORDS: Confidence interval; Interobserver agreement; Intraclass correlation; Reliability studies; Sam- ple size. 1. Introduction of many measurement studies in educational and psychologi- There has recently been increased interest in developing mod- cal research (e.g., Alsawalmeh and Feldt, 1992). el-based inferences for the kappa statistic, particularly as ap- Appropriate testing procedures for this problem have been plied to the assessment of interobserver agreement. Most of described by Donner, Eliasziw, and Klar (1996) for the case this research has focused on problems that arise in a single of independent samples of subjects and Donner et al. (2000) sample of subjects, each assessed on the presence or absence for the case of dependent samples involving the same sub- of a binary trait by two observers (e.g., Donner and Eliasziw, jects. Reed (2000) has also addressed this problem, provid- 1992; Hale and Fleiss, 1993; Basu, Banerjee, and Sen, 2000; ing an executable Fortran code for testing the homogeneity of kappa statistics in the former case. However, to the best Klar, Lipsitz, and Ibrahim, 2000; Nam, 2000). Extension of of our knowledge, procedures have not yet been developed this research to single-sample inference problems involving for constructing an interval estimate about the difference be- more than two raters and/or multinomial outcomes has also tween two kappa statistics. With the increased emphasis on recently appeared (e.g., Altaye, Donner, and Klar, 2001; Bart- confidence-interval construction as an alternative to signifi- fay and Donner, 2001). cance testing, this would seem to be an important gap in the Although single-sample problems involving the kappa sta- literature. tistic perhaps predominate, the need to compare two or more As discussed by Nam (2000), there are two models that kappa statistics also frequently arises in practice. For exam- have tended to be adopted for constructing inferences for the ple, it may be of interest to compare measures of interobserver kappa statistic in the caSe of two raters and a binary outcome. agreement across different study populations or different pa- The first model allows the marginal probabilities of a success tient subgroups in a single study. The latter was an objective to differ between the two raters and leads naturally to Cohen’s of a study reported by Ishmail et al. (1987) that focused on (1960) kappa. The second model assumes the same marginal interobserver agreement with respect to the presence of the probability of success for each rater and leads naturally to the third heart sound (S3), recognized as an important sign in intraclass kappa statistic, identically equal to Scott’s (1955) T. the evaluation of patients with congestive heart failure. One Landis and Koch (1977) and Bloch and Kraemer (1989) have question of interest in this study was whether the degree of presented arguments supporting the use of this model when interobserver agreement varied by the gender of the patient, the main emphasis is directed at the reliability of the mea- an issue that arose because it may be more difficult to hear S3 surement process rather than in potential differences among in women compared with men. Other examples of such inves- raters. Zwick (1988) also discusses this issue, pointing out tigations are given by Barlow, Lai, and Azen (1991), McLellan that, if marginal differences are small, the value of Cohen’s et al. (1985), and McLaughlin et al. (1987). The comparison of kappa will be close to that of Scott’s T and the choice between coefficients of interobserver agreement is also the major focus them will not be important. Furthermore, as shown by Black- 209 210 Biometries, March 2002 man and Koval (2000), the two estimators are asymptotically Let (Ih,uh) denote lower and upper lOO(1 - a)% limits equivalent. for nh. Then the hybrid confidence interval for A is given by In this article, we develop and evaluate a method for con- (L,U),where structing a confidence interval for a difference between two in- L = (RI - R2) - ~,~2{var(ki)l~~=~~+ ~ar(k2)1~~=~~ }1/2 traclass kappa statistics computed from independent samples of subjects. The procedures compared are developed in Sec- and tion 2, followed in Section 3 by a simulation study evaluating = - k2) } 1/2 ' their properties. Section 4 provides guidelines for sample size u (a1 + ~,/2{var(~1)lK,"u,+",,(R2)("2=12 estimation, while two examples illustrating the procedures are Further details concerning the theoretical basis for this pro- presented in Section 5. The article concludes with overall rec- cedure are given in the Appendix. ommendations for practice and with advice concerning avail- One approach to obtaining the limits (Ih, uh), h = 1,2, able software. was proposed by Donner and Eliasziw (1992), who used a 2. Development of Procedures goodness-of-fit approach to construct a confidence interval for Following Donner et al. (1996), we let Xijh denote the rating nh. This approach, algebraically equivalent to an approach for the ith subject by the jth rater in sample h (h = 1,2). independently proposed by Hale and Fleiss (1993), assumes Letting nh = Pr(Xijh = 1) be the probability of a success, only that the observed frequencies Wkh (k = 1,2,3) follow a the probabilities of joint responses within sample h are given multinomial distribution with corresponding probability Pkh by Plh(nh) = Pr(Xi1h = 1,X,2h = 1) = 7ri + nh(1- nh)nh, conditional on Nh. If estimated probabilities &(Kh) are ob- P2h(nh) = Pr(Xilh = 0,XiZh = 1 or Xilh = 1, Xi2h = 0) = tained by replacing ~h with ?h, it then follows that x; = 2nh(l - sjLd(l- nh), and P3h(nh) = Pr(Xi1h = O,Xi2h = 0) ~;=,{nkh - NhPkh (nh)12 / {NhPkh (nh)} has a limiting chi- = (1 - nh) +nh(l - nh)nh. This model, previously discussed square distribution with 1 d.f. One can then obtain the con- by several authors, including Mak (1988) and Bloch and Krae- fidence limits (Zh,uh) for Kh by finding the two admissible mer (1989), has been referred to as the common correlation roots to the cubic equation xG2 = x:,~-,. We refer to this model because it assumes that the correlation between any method as the hybrid goodness-of-fit (HGOF) approach. pair (Xilh,Xi~h)has the same value nh. An alternative method of obtaining (Ih,uh) is to invert a We now assume that the observed frequencies nlh,n2h, n3h modified Wald test (Rao and Mukerjee, 1997), which may corresponding to Plh(nh), P2h(nh), Psh(nh) follow a multi- be constructed by replacing var(kh) with vTrrp(Rh),obtained nomial distribution conditional on Nh = $=, nkh, the to- by substituting ejrh for nh. The confidence limits for nh are tal number of subjects in sample h. Letting +h = (2nlh + then given by the two admissible roots of the cubic equation 2 2 nzh)/(2Nh), Bloch and Kraemer (1989) show that, for a sin- given by (Rh - nh) /varp(nh) = xl,l-cu. An approach similar gle sample of subjects, the maximum likelihood estimator of to this was applied by Lee and Tu (1994) to the problem of nh under the common correlation model is given by k.h = constructing confidence limits about Cohen's kappa, who refer 1-7~h/{2Nh?h (1-?h)}, with large-sample variance obtained to it as the profile variance approach. We refer to it here as as var(Rh) = (1 - nh)[( 1- nh)( 1- 2nh) +nh(2 - nh)/{2nh( 1- the hybrid profile variance (HPR) approach. Note that the nh)}]/Nh.It is easily verified that kh is simply the standard HGOF and HPR methods differ only in how the confidence intraclass correlation coefficient (e.g., Snedecor and Cochran, limits (Zh, uh), h = 1,2, are computed, 1989) as applied to dichotomous outcome data. We can estimate var(kh) by substituting +h and k.h for rh 3. Evaluation and nh, respectively. A large-sample lOO(1 - a)%confidence The finite sample properties of the three methods presented interval for n1-n~ is then given by (R1 - R2)?~z,/~{v2r(k.1)+ in Section 2 are, for the most part, intractable. We therefore &ir(R2)}1/2, where i&ir(kh) is the sample estimate of var(kh) compared the performance of the SA, HGOF, and HPR meth- and 2,p denotes the lOO(1 - a/2)% percentile point of the ods using Monte Carlo simulation. Simulation runs having standard normal distribution. We refer to this method as the +h(l-+h) = 0 (for which kh is undefined) were discarded until simple asymptotic (SA) method. 10,000 runs were obtained. The expected proportion of such A potential difficulty with the SA method is that the sam- runs is given by {n2+ 7r( 1 - n)~}~$- { (1- T)~ +n( 1- n)~}~, pling distribution of kh may be far from normal when nh is well under 5% for all parameter combinations considered in close to one and the nuisance parameter nh is close to zero or the simulation.
Recommended publications
  • Confidence Interval Estimation in System Dynamics Models: Bootstrapping Vs
    Confidence Interval Estimation in System Dynamics Models: Bootstrapping vs. Likelihood Ratio Method Gokhan Dogan* MIT Sloan School of Management, 30 Wadsworth Street, E53-364, Cambridge, Massachusetts 02142 [email protected] Abstract In this paper we discuss confidence interval estimation for system dynamics models. Confidence interval estimation is important because without confidence intervals, we cannot determine whether an estimated parameter value is significantly different from 0 or any other value, and therefore we cannot determine how much confidence to place in the estimate. We compare two methods for confidence interval estimation. The first, the “likelihood ratio method,” is based on maximum likelihood estimation. This method has been used in the system dynamics literature and is built in to some popular software packages. It is computationally efficient but requires strong assumptions about the model and data. These assumptions are frequently violated by the autocorrelation, endogeneity of explanatory variables and heteroskedasticity properties of dynamic models. The second method is called “bootstrapping.” Bootstrapping requires more computation but does not impose strong assumptions on the model or data. We describe the methods and illustrate them with a series of applications from actual modeling projects. Considering the empirical results presented in the paper and the fact that the bootstrapping method requires less assumptions, we suggest that bootstrapping is a better tool for confidence interval estimation in system dynamics
    [Show full text]
  • Minimum Sample Sizes for On-Road Measurements of Car
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by International Institute for Applied Systems Analysis (IIASA) Manuscript – Final and authoritative paper published in Environmental Science & Technology, 18 Oct 2019. https://pubs.acs.org/doi/abs/10.1021/acs.est.9b04123 1 When is enough? Minimum sample sizes for on-road 2 measurements of car emissions 3 Yuche Chen‡ *, Yunteng Zhang‡, Jens Borken-Kleefeld† * 4 ‡Department of Civil and Environmental Engineering, University of South Carolina, USA 5 †International Institute for Applied System Analysis, Laxenburg, Austria 6 * Corresponding author: Yuche Chen, [email protected]; Jens Borken-Kleefeld, 7 [email protected] 8 9 Key words: Monte Carlo simulation, bootstrap, in-use surveillance, diesel car, Europe, Euro 6, 10 mean, variance, remote emission sensing. 11 Manuscript – Final and authoritative paper published in Environmental Science & Technology, 18 Oct 2019. https://pubs.acs.org/doi/abs/10.1021/acs.est.9b04123 12 ABSTRACT 13 The power of remote vehicle emission sensing stems from the big sample size obtained and its 14 related statistical representativeness for the measured emission rates. But how many records are 15 needed for a representative measurement and when does the information gain per record become 16 insignificant? We use Monte Carlo simulations to determine the relationship between the sample 17 size and the accuracy of the sample mean and variance. We take the example of NO emissions 18 from diesel cars measured by remote emission monitors between 2011 and 2018 at various 19 locations in Europe. We find that no more than 200 remote sensing records are sufficient to 20 approximate the mean emission rate for Euro 4, 5 and 6a,b diesel cars with 80% certainty within 21 a ±1 g NO per kg fuel tolerance margin (~±50 mg NO per km).
    [Show full text]
  • Likelihood-Based Random-Effects Meta-Analysis with Few Studies
    Seide et al. RESEARCH Likelihood-based random-effects meta-analysis with few studies: Empirical and simulation studies Svenja E Seide1,2†, Christian R¨over1* and Tim Friede1ˆ Abstract Background: Standard random-effects meta-analysis methods perform poorly when applied to few studies only. Such settings however are commonly encountered in practice. It is unclear, whether or to what extent small-sample-size behaviour can be improved by more sophisticated modeling. Methods: We consider likelihood-based methods, the DerSimonian-Laird approach, Empirical Bayes, several adjustment methods and a fully Bayesian approach. Confidence intervals are based on a normal approximation, or on adjustments based on the Student-t-distribution. In addition, a linear mixed model and two generalized linear mixed models (GLMMs) assuming binomial or Poisson distributed numbers of events per study arm are considered for pairwise binary meta-analyses. We extract an empirical data set of 40 meta-analyses from recent reviews published by the German Institute for Quality and Efficiency in Health Care (IQWiG). Methods are then compared empirically and as well as in a simulation study, based on few studies, imbalanced study sizes, and considering odds-ratio (OR) and risk ratio (RR) effect sizes. Coverage probabilities and interval widths for the combined effect estimate are evaluated to compare the different approaches. Results: Empirically, a majority of the identified meta-analyses include only 2 studies. Variation of methods or effect measures affects the estimation results. In the simulation study, coverage probability is, in the presence of heterogeneity and few studies, mostly below the nominal level for all frequentist methods based on normal approximation, in particular when sizes in meta-analyses are not balanced, but improve when confidence intervals are adjusted.
    [Show full text]
  • Targeted Maximum Likelihood Estimation of Treatment Effects In
    Targeted maximum likelihood estimation of treatment e®ects in randomized controlled trials and drug safety analysis by Kelly L. Moore A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Biostatistics in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Mark J. van der Laan, Chair Professor Ira B. Tager Professor Alan E. Hubbard Fall 2009 The dissertation of Kelly L. Moore, titled Targeted maximum likelihood estimation of treatment e®ects in randomized controlled trials and drug safety analysis, is approved: Chair Date Date Date University of California, Berkeley Fall 2009 Targeted maximum likelihood estimation of treatment e®ects in randomized controlled trials and drug safety analysis Copyright 2009 by Kelly L. Moore 1 Abstract Targeted maximum likelihood estimation of treatment e®ects in randomized controlled trials and drug safety analysis by Kelly L. Moore Doctor of Philosophy in Biostatistics University of California, Berkeley Professor Mark J. van der Laan, Chair In most randomized controlled trials (RCTs), investigators typically rely on estimators of causal e®ects that do not exploit the information in the many baseline covariates that are routinely collected in addition to treatment and the outcome. Ignoring these covariates can lead to a signi¯cant loss is estimation e±ciency and thus power. Statisticians have underscored the gain in e±ciency that can be achieved from covariate adjustment in RCTs with a focus on problems involving linear models. Despite recent theoretical advances, there has been a reluctance to adjust for covariates based on two primary reasons; 1) covariate-adjusted estimates based on non-linear regression models have been shown to be less precise than unadjusted methods, and, 2) concern over the opportunity to manipulate the model selection process for covariate adjustment in order to obtain favorable results.
    [Show full text]
  • WHAT DID FISHER MEAN by an ESTIMATE? 3 Ideas but Is in Conflict with His Ideology of Statistical Inference
    Submitted to the Annals of Applied Probability WHAT DID FISHER MEAN BY AN ESTIMATE? By Esa Uusipaikka∗ University of Turku Fisher’s Method of Maximum Likelihood is shown to be a proce- dure for the construction of likelihood intervals or regions, instead of a procedure of point estimation. Based on Fisher’s articles and books it is justified that by estimation Fisher meant the construction of likelihood intervals or regions from appropriate likelihood function and that an estimate is a statistic, that is, a function from a sample space to a parameter space such that the likelihood function obtained from the sampling distribution of the statistic at the observed value of the statistic is used to construct likelihood intervals or regions. Thus Problem of Estimation is how to choose the ’best’ estimate. Fisher’s solution for the problem of estimation is Maximum Likeli- hood Estimate (MLE). Fisher’s Theory of Statistical Estimation is a chain of ideas used to justify MLE as the solution of the problem of estimation. The construction of confidence intervals by the delta method from the asymptotic normal distribution of MLE is based on Fisher’s ideas, but is against his ’logic of statistical inference’. Instead the construc- tion of confidence intervals from the profile likelihood function of a given interest function of the parameter vector is considered as a solution more in line with Fisher’s ’ideology’. A new method of cal- culation of profile likelihood-based confidence intervals for general smooth interest functions in general statistical models is considered. 1. Introduction. ’Collected Papers of R.A.
    [Show full text]
  • A Framework for Sample Efficient Interval Estimation with Control
    A Framework for Sample Efficient Interval Estimation with Control Variates Shengjia Zhao Christopher Yeh Stefano Ermon Stanford University Stanford University Stanford University Abstract timates, e.g. for each individual district or county, meaning that we need to draw a sufficient number of We consider the problem of estimating confi- samples for every district or county. Another diffi- dence intervals for the mean of a random vari- culty can arise when we need high accuracy (i.e. a able, where the goal is to produce the smallest small confidence interval), since confidence intervals possible interval for a given number of sam- producedp by typical concentration inequalities have size ples. While minimax optimal algorithms are O(1= number of samples). In other words, to reduce known for this problem in the general case, the size of a confidence interval by a factor of 10, we improved performance is possible under addi- need 100 times more samples. tional assumptions. In particular, we design This dilemma is generally unavoidable because concen- an estimation algorithm to take advantage of tration inequalities such as Chernoff or Chebychev are side information in the form of a control vari- minimax optimal: there exist distributions for which ate, leveraging order statistics. Under certain these inequalities cannot be improved. No-free-lunch conditions on the quality of the control vari- results (Van der Vaart, 2000) imply that any alter- ates, we show improved asymptotic efficiency native estimation algorithm that performs better (i.e. compared to existing estimation algorithms. outputs a confidence interval with smaller size) on some Empirically, we demonstrate superior perfor- problems, must perform worse on other problems.
    [Show full text]
  • Likelihood Confidence Intervals When Only Ranges Are Available
    Article Likelihood Confidence Intervals When Only Ranges are Available Szilárd Nemes Institute of Clinical Sciences, Sahlgrenska Academy, University of Gothenburg, Gothenburg 405 30, Sweden; [email protected] Received: 22 December 2018; Accepted: 3 February 2019; Published: 6 February 2019 Abstract: Research papers represent an important and rich source of comparative data. The change is to extract the information of interest. Herein, we look at the possibilities to construct confidence intervals for sample averages when only ranges are available with maximum likelihood estimation with order statistics (MLEOS). Using Monte Carlo simulation, we looked at the confidence interval coverage characteristics for likelihood ratio and Wald-type approximate 95% confidence intervals. We saw indication that the likelihood ratio interval had better coverage and narrower intervals. For single parameter distributions, MLEOS is directly applicable. For location-scale distribution is recommended that the variance (or combination of it) to be estimated using standard formulas and used as a plug-in. Keywords: range, likelihood, order statistics, coverage 1. Introduction One of the tasks statisticians face is extracting and possibly inferring biologically/clinically relevant information from published papers. This aspect of applied statistics is well developed, and one can choose to form many easy to use and performant algorithms that aid problem solving. Often, these algorithms aim to aid statisticians/practitioners to extract variability of different measures or biomarkers that is needed for power calculation and research design [1,2]. While these algorithms are efficient and easy to use, they mostly are not probabilistic in nature, thus they do not offer means for statistical inference. Yet another field of applied statistics that aims to help practitioners in extracting relevant information when only partial data is available propose a probabilistic approach with order statistics.
    [Show full text]
  • Model-Based Estimation Official Statistics
    Model-Basedd Estimationmatio for Offi cial SStatistics08007atistics Jan van den Brakel and Jelke Bethlehem The views expressed in this paper are those of the author(s) and do not necessarily refl ect the policies of Statistics Netherlands Discussionpaper (08002) Statistics Netherlands Voorburg/Heerlen, 2008 Explanation of symbols . = data not available * = provisional fi gure x = publication prohibited (confi dential fi gure) – = nil or less than half of unit concerned – = (between two fi gures) inclusive 0 (0,0) = less than half of unit concerned blank = not applicable 2005-2006 = 2005 to 2006 inclusive 2005/2006 = average of 2005 up to and including 2006 2005/’06 = crop year, fi nancial year, school year etc. beginning in 2005 and ending in 2006 2003/’04–2005/’06 = crop year, fi nancial year, etc. 2003/’04 to 2005/’06 inclusive Due to rounding, some totals may not correspond with the sum of the separate fi gures. Publisher Statistics Netherlands second half of 2008: Prinses Beatrixlaan 428 Henri Faasdreef 312 2273 XZ Voorburg 2492 JP The Hague Prepress Statistics Netherlands - Facility Services Cover TelDesign, Rotterdam Information Telephone .. +31 88 570 70 70 Telefax .. +31 70 337 59 94 Via contact form: www.cbs.nl/information Where to order E-mail: [email protected] Telefax .. +31 45 570 62 68 Internet http://www.cbs.nl ISSN: 1572-0314 © Statistics Netherlands, Voorburg/Heerlen, 2008. 6008308002 X-10 Reproduction is permitted. ‘Statistics Netherlands’ must be quoted as source. Summary: This paper summarizes the advantages and disadvantages of design-based and model-assisted estimation procedures that are widely applied by most of the European national statistical institutes.
    [Show full text]
  • Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values Author(S): Matthew J
    Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values Author(s): Matthew J. Rinella and Jeremy J. James Source: Invasive Plant Science and Management, 3(2):106-112. 2010. Published By: Weed Science Society of America DOI: 10.1614/IPSM-09-038.1 URL: http://www.bioone.org/doi/full/10.1614/IPSM-09-038.1 BioOne (www.bioone.org) is an electronic aggregator of bioscience research content, and the online home to over 160 journals and books published by not-for-profit societies, associations, museums, institutions, and presses. Your use of this PDF, the BioOne Web site, and all posted and associated content indicates your acceptance of BioOne’s Terms of Use, available at www.bioone.org/page/terms_of_use. Usage of BioOne content is strictly limited to personal, educational, and non-commercial use. Commercial inquiries or rights and permissions requests should be directed to the individual publisher as copyright holder. BioOne sees sustainable scholarly publishing as an inherently collaborative enterprise connecting authors, nonprofit publishers, academic institutions, research libraries, and research funders in the common goal of maximizing access to critical research. Invasive Plant Science and Management 2010 3:106–112 Review Invasive Plant Researchers Should Calculate Effect Sizes, Not P-Values Matthew J. Rinella and Jeremy J. James* Null hypothesis significance testing (NHST) forms the backbone of statistical inference in invasive plant science. Over 95% of research articles in Invasive Plant Science and Management report NHST results such as P-values or statistics closely related to P-values such as least significant differences. Unfortunately, NHST results are less informative than their ubiquity implies.
    [Show full text]
  • Background Material
    Appendix A Background Material All who debate on matters of uncertainity, should be free from prejudice, partiality, anger, or compassion. —Caius Sallustius Crispus A.1 Some Classical Likelihood Theory Most of the VGLM/VGAM framework is infrastructure directed towards maxi- mizing a full-likelihood model, therefore it is useful to summarize some supporting results from classical likelihood theory. The following is a short summary of a few selected topics serving the purposes of this book. The focus is on aspects of di- rect relevance to the practitioners and users of the software. The presentation is informal and nonrigorous; rigorous treatments, including justification and proofs, can be found in the texts listed in the bibliographic notes. The foundation of this subject was developed by Fisher a century ago (around the decade of WW1), and he is regarded today as the father of modern statistics. A.1.1 Likelihood Functions The usual starting point is to let Y be a random variable with density func- T tion f(y; θ) depending on θ =(θ1,...,θp) , a multidimensional unknown param- eter. Values that Y can take are denoted in lower-case, i.e., y. By ‘density function’, here is meant a probability (mass) function for a discrete-valued Y , and probabil- ity density function for continuous Y . We shall refer to f as simply the density function, and use integration rather than summation to denote quantities such as expected values, e.g., E(Y )= f(y) dy where the range of integration is over the support of the distribution, i.e., those values of y where f(y) > 0 (called Y).
    [Show full text]
  • Precise Unbiased Estimation in Randomized Experiments Using Auxiliary Observational Data
    Precise Unbiased Estimation in Randomized Experiments using Auxiliary Observational Data Johann A. Gagnon-Bartsch∗1, Adam C. Sales∗2, Edward Wu1, Anthony F. Botelho3, Luke W. Miratrix4, and Neil T. Heffernan3 1University of Michigan, Department of Statistics 2University of Texas, Austin, College of Education 3Worcester Polytechnic Institute, Learning Sciences and Technologies 4Harvard University, School of Education February 2, 2020 Abstract Randomized controlled trials (RCTs) are increasingly prevalent in education re- search, and are often regarded as a gold standard of causal inference. Two main virtues of randomized experiments are that they (1) do not suffer from confounding, thereby allowing for an unbiased estimate of an intervention's causal impact, and (2) allow for design-based inference, meaning that the physical act of randomization largely justifies the statistical assumptions made. However, RCT sample sizes are often small, leading to low precision; in many cases RCT estimates may be too imprecise to guide policy or inform science. Observational studies, by contrast, have strengths and weaknesses complementary to those of RCTs. Observational studies typically offer much larger sample sizes, but may sufferDRAFT confounding. In many contexts, experimental and obser- vational data exist side by side, allowing the possibility of integrating \big observational data" with \small but high-quality experimental data" to get the best of both. Such approaches hold particular promise in the field of education, where RCT sample sizes are often small due to cost constraints, but automatic collection of observational data, such as in computerized educational technology applications, or in state longitudinal data systems (SLDS) with administrative data on hundreds of thousand of students, has made rich, high-dimensional observational data widely available.
    [Show full text]
  • 32. Statistics 1 32
    32. Statistics 1 32. STATISTICS Revised September 2007 by G. Cowan (RHUL). This chapter gives an overview of statistical methods used in High Energy Physics. In statistics, we are interested in using a given sample of data to make inferences about a probabilistic model, e.g., to assess the model’s validity or to determine the values of its parameters. There are two main approaches to statistical inference, which we may call frequentist and Bayesian. In frequentist statistics, probability is interpreted as the frequency of the outcome of a repeatable experiment. The most important tools in this framework are parameter estimation, covered in Section 32.1, and statistical tests, discussed in Section 32.2. Frequentist confidence intervals, which are constructed so as to cover the true value of a parameter with a specified probability, are treated in Section 32.3.2. Note that in frequentist statistics one does not define a probability for a hypothesis or for a parameter. Frequentist statistics provides the usual tools for reporting objectively the outcome of an experiment without needing to incorporate prior beliefs concerning the parameter being measured or the theory being tested. As such they are used for reporting most measurements and their statistical uncertainties in High Energy Physics. In Bayesian statistics, the interpretation of probability is more general and includes degree of belief (called subjective probability). One can then speak of a probability density function (p.d.f.) for a parameter, which expresses one’s state of knowledge about where its true value lies. Bayesian methods allow for a natural way to input additional information, such as physical boundaries and subjective information; in fact they require as input the prior p.d.f.
    [Show full text]