Statistical Testing of Rngs

Total Page:16

File Type:pdf, Size:1020Kb

Statistical Testing of Rngs Statistical Testing of RNGs For a sequence of numbers to be considered a sequence of randomly acquired numbers, it must have two basic statistical properties: • Uniformly distributed values • Independence from previous values drawn A sequence fails because patterning is present in the sequence with some degree of statistical certainty. This may manifest itself by: • Clustering or bias occurs at some point (ie., statistically, the sequence is not uniformly distributed) − This includes sequences of tuples drawn from the sequence (if truly random, there should be n-space uniformity) • Standard statistical measures are too high or too low − Mean value − Standard deviation • Statistically significant patterns occur (indicating dependencies) − Alternating patterns − Distribution patterns (correlation) − Other patterning It is always possible that some underlying patterns in a sequence will go undetected. NIST and RNGs In recognition of the importance of random number generators, particularly for areas such as cryptography, the NIST (http://csrc.nist.gov/rng/) has established guidelines for random number generation and testing with three primary goals: 1. to develop a battery of statistical tests to detect non-randomness in binary sequences constructed using random number generators and pseudo-random number generators utilized in cryptographic applications 2. to produce documentation and software implementation of these tests 3. to provide guidance in the use and application of these tests Typical Tests for Randomness as Reported in 1979 (Bright, H. and R. Enison. Computing Surveys 11(4). 1979. Quasi-Random Number Sequences from a Long-Period TLP Generator with Remarks on Application to Cryptography) 1. Equidistribution or Frequency Test. Count the number of times a member of the sequence falls into a given interval. The number should be approximately the same for intervals of the same length if the sequence is uniform. 2. Serial Test. This is a higher dimensional version of the equidistribution test. Count the number of times a k-tuple of members of the sequence falls into a given k-dimensional cell. If this test is passed, the sequence is said to be k-distributed. Other tests of k-distributivity are: 1. Poker test: Consider groups of k members of the sequence and count the number of distinct values represented in each group (e.g., a hand of 5 cards falls into one of a number of groups, such as one pair, two pairs, three of a kind, etc). 2. Maximum (minimum) of k: Plot the distribution of the function max (min) [of a k-tuple]. 3. Sum of k 3. Gap Test. Plot the distribution of gaps in the sequence of various lengths (typically, a gap is the distance between an item and its next recurrence) 4. Runs Test. Plot the distribution of the runs up (monotonic increasing) and runs down (monotonic decreasing), or the distribution of the runs above the mean and those below the mean, or the distribution of the run lengths. 5. Coupon Collector's Test. Choose a suitably small integer d and divide the universe into d intervals. Then each member of the sequence falls into one such interval. Plot the distribution of runs of various lengths required to have all d intervals represented. The idea is that you are trying to collect all of the “coupons”. 6. Permutation Test. Study the order relations between the members of the sequence in groups of k. Each of the k! possible orders should occur about equally often. If the universe is large, the probability of equality is small; otherwise, equal members may be disregarded. 7. Serial Correlation Test. Compute the correlation coefficient between consecutive members of the sequence. This gives the serial correlation for lag 1. Similarly, one may get the serial correlation for lag k by computing the correlation coefficient between xi and xi+k. This is to show that the members of the sequence are independent. Recall: the correlation coefficient references the method of least squares for fitting a line to a set of points. The closer the correlation coefficient is to 1, the more confidence in using the regression line as a predictor. Hence, the an RNG should exhibit small serial correlation values. Generally, the RNG is assumed to be producing probabilities; ie., numbers in the interval [0,1]. Knuth, Seminumerical Algorithms, (Addison-Wesley, 1997 – 3rd edition) is an excellent source for a discussion on testing random number generators. An often cited battery of tests for random number generators is Marsaglia’s 1995 Diehard collection (http://stat.fsu.edu/pub/diehard/) Both source and object files are provided, but at least some of the source is “an f2c conversion of patched and jumbled Fortran code,” which may limit its usefulness. Background for Hypothesis Testing Given n observations, X1, X2, … , Xn (assumed to be from independent, identically distributed random variables), with (unknown) population mean µ and variance σ2. The sample mean n X ∑ i X()n = i=1 n provides an (unbiased) estimator for µ (meaning that if we perform a large number of independent trials, each producing a value for X(n) , the average of the X(n) ’s will be a close approximation to µ). Likewise the sample variance: n 2 ∑ [Xi − X()n ] S2 ()n = i=1 n −1 is an estimator for σ2 (the population variance is calculated by dividing by n; division by n-1 is used in this case because it can be proved that the sample variance is an unbiased estimator for σ2 when dealing with independent, identically distributed random variables) Basically, two random variables are independent if there is no relationship between them; ie., the values assumed by one random variable tell us nothing about how the values of the other one distribute and vice-versa. Of course, using X(n) to estimate µ has little validity without knowing more information to provide an assessment of how closely X(n) approximates µ. For the mean, it is easy to show that • E(cX) = cE(X) • E(X1 + X2) = E(X1) + E(X2) For the variance, • Var(cX) = c2Var(X), but • Var(X1 + X2) = Var(X1) + Var(X2) only if the Xi’s are independent Evidently, we must examine Var( X(n) ) to see how good an approximation to µ we can expect. Consider that ⎛ 1 n ⎞ 1 ⎛ n ⎞ Var X = Var X Var( X(n) ) = ⎜ ∑ i ⎟ 2 ⎜∑ i ⎟ ⎝ n i=1 ⎠ n ⎝ i=1 ⎠ 1 n = Var(X ) 2 ∑ i if the Xi’s are independent. n i=1 If so, then Var( X(n) ) = σ2/n In other words, when the Xi’s are independent, it is quite clear that with larger values of n, the variance shrinks and averaging the sample means provides a good estimate for µ. Does this matter? Law and Kelton report that simulation output data are almost always correlated; ie., the set of means acquired from multiple simulation runs lack independence. In this case, using the sample variance as an estimator may significantly underestimate the population variance, in turn leading to erroneous estimates of µ. Covariance and Correlation as Measures of Independence Between Two Random Variables For two random variables X1 and X2, the covariance is given by Cov(X1, X2) = E[(X1 - µ1)( X2 - µ2)] = E(X1X2) - µ1µ2 E[(X1 - µ1)( X2 - µ2)] = E(X1X2) - µ1E(X2) - µ2E(X1) + µ1µ2 = E(X1X2) - µ1µ2 µ2 µ1 If X1 and X2 are independent, then E(X1X2) = E(X1)E(X2) (this can be determined from the fact that for independent events x and y, P(xy) = P(x)P(y)). Hence, when X1 and X2 are independent, Cov(X1, X2) = 0. 2 2 2 2 The relationship σ = E[(X-µ) ] = E(X ) - µ is not hard to show (we did 2 2 2 it earlier), so in particular, Cov(X1, X1) = E(X1 ) - µ1 = σ1 (a boundary case, since nothing could be less independent than a random variable compared with itself). When Cov(X1, X2) = 0, the two random variables are said to be uncorrelated. If X1 and X2 are independent, we noted Cov(X1, X2) = 0, which indicates there is some relationship between covariance and independence, but it turns out that this is not a necessary and sufficient condition (so, for those interested in statistical theory, criteria under which this is a necessary and sufficient condition for independence provides a natural object of study). When Cov(X1, X2) > 0, the two random variables are said to be positively correlated (the counterpart being negatively correlated). Intuitively, positive correlation typically occurs when both X1 > µ1 and X2 > µ2 or X1 < µ1 and X2 < µ2 (with the obvious counterpart for negative). In particular, when negative correlation is present, you expect that when one sample mean is on the high side, the other will be on the low side. To normalize the statistic, the correlation ρ12 (also known as the correlation coefficient) is defined by Cov(X1,X 2 ) ρ12 = σ1σ 2 Here we now have -1 ≤ ρ12 ≤ 1 2 2 2 This follows from the fact that E(X1 )E(X2 ) ≥ [E(X1)E(X2)] (Schwarz’s inequality) The closer ρ is to 1 or –1 , the greater the dependence relationship between X1 and X2. Simulation models typically employ random variables for input so the model outputs are also random. A stochastic process is one for which the model outcomes are ordered over time, so in general, the collection of random variables representing one of the outputs of a simulation model over multiple runs, X1, X2, …, Xn is a stochastic process. Reaching conclusions about the model from the implicit stochastic process may require making assumptions about the nature of the stochastic process that may not be true, but are necessary for the application of statistical analysis.
Recommended publications
  • Computing the Oja Median in R: the Package Ojanp
    Computing the Oja Median in R: The Package OjaNP Daniel Fischer Karl Mosler Natural Resources Institute of Econometrics Institute Finland (Luke) and Statistics Finland University of Cologne and Germany School of Health Sciences University of Tampere Finland Jyrki M¨ott¨onen Klaus Nordhausen Department of Social Research Department of Mathematics University of Helsinki and Statistics Finland University of Turku Finland and School of Health Sciences University of Tampere Finland Oleksii Pokotylo Daniel Vogel Cologne Graduate School Institute of Complex Systems University of Cologne and Mathematical Biology Germany University of Aberdeen United Kingdom June 27, 2016 arXiv:1606.07620v1 [stat.CO] 24 Jun 2016 Abstract The Oja median is one of several extensions of the univariate median to the mul- tivariate case. It has many nice properties, but is computationally demanding. In this paper, we first review the properties of the Oja median and compare it to other multivariate medians. Afterwards we discuss four algorithms to compute the Oja median, which are implemented in our R-package OjaNP. Besides these 1 algorithms, the package contains also functions to compute Oja signs, Oja signed ranks, Oja ranks, and the related scatter concepts. To illustrate their use, the corresponding multivariate one- and C-sample location tests are implemented. Keywords Oja median, Oja signs, Oja signed ranks, Oja ranks, R,C,C++ 1 Introduction The univariate median is a popular location estimator. It is, however, not straightforward to generalize it to the multivariate case since no generalization known retains all properties of univariate estimator, and therefore different gen- eralizations emphasize different properties of the univariate median.
    [Show full text]
  • Annual Report of the Center for Statistical Research and Methodology Research and Methodology Directorate Fiscal Year 2017
    Annual Report of the Center for Statistical Research and Methodology Research and Methodology Directorate Fiscal Year 2017 Decennial Directorate Customers Demographic Directorate Customers Missing Data, Edit, Survey Sampling: and Imputation Estimation and CSRM Expertise Modeling for Collaboration Economic and Research Experimentation and Record Linkage Directorate Modeling Customers Small Area Simulation, Data Time Series and Estimation Visualization, and Seasonal Adjustment Modeling Field Directorate Customers Other Internal and External Customers ince August 1, 1933— S “… As the major figures from the American Statistical Association (ASA), Social Science Research Council, and new Roosevelt academic advisors discussed the statistical needs of the nation in the spring of 1933, it became clear that the new programs—in particular the National Recovery Administration—would require substantial amounts of data and coordination among statistical programs. Thus in June of 1933, the ASA and the Social Science Research Council officially created the Committee on Government Statistics and Information Services (COGSIS) to serve the statistical needs of the Agriculture, Commerce, Labor, and Interior departments … COGSIS set … goals in the field of federal statistics … (It) wanted new statistical programs—for example, to measure unemployment and address the needs of the unemployed … (It) wanted a coordinating agency to oversee all statistical programs, and (it) wanted to see statistical research and experimentation organized within the federal government … In August 1933 Stuart A. Rice, President of the ASA and acting chair of COGSIS, … (became) assistant director of the (Census) Bureau. Joseph Hill (who had been at the Census Bureau since 1900 and who provided the concepts and early theory for what is now the methodology for apportioning the seats in the U.S.
    [Show full text]
  • Approaching Mean-Variance Efficiency for Large Portfolios
    Approaching Mean-Variance Efficiency for Large Portfolios ∗ Mengmeng Ao† Yingying Li‡ Xinghua Zheng§ First Draft: October 6, 2014 This Draft: November 11, 2017 Abstract This paper studies the large dimensional Markowitz optimization problem. Given any risk constraint level, we introduce a new approach for estimating the optimal portfolio, which is developed through a novel unconstrained regression representation of the mean-variance optimization problem, combined with high-dimensional sparse regression methods. Our estimated portfolio, under a mild sparsity assumption, asymptotically achieves mean-variance efficiency and meanwhile effectively controls the risk. To the best of our knowledge, this is the first time that these two goals can be simultaneously achieved for large portfolios. The superior properties of our approach are demonstrated via comprehensive simulation and empirical studies. Keywords: Markowitz optimization; Large portfolio selection; Unconstrained regression, LASSO; Sharpe ratio ∗Research partially supported by the RGC grants GRF16305315, GRF 16502014 and GRF 16518716 of the HKSAR, and The Fundamental Research Funds for the Central Universities 20720171073. †Wang Yanan Institute for Studies in Economics & Department of Finance, School of Economics, Xiamen University, China. [email protected] ‡Hong Kong University of Science and Technology, HKSAR. [email protected] §Hong Kong University of Science and Technology, HKSAR. [email protected] 1 1 INTRODUCTION 1.1 Markowitz Optimization Enigma The groundbreaking mean-variance portfolio theory proposed by Markowitz (1952) contin- ues to play significant roles in research and practice. The optimal mean-variance portfolio has a simple explicit expression1 that only depends on two population characteristics, the mean and the covariance matrix of asset returns. Under the ideal situation when the underlying mean and covariance matrix are known, mean-variance investors can easily compute the optimal portfolio weights based on their preferred level of risk.
    [Show full text]
  • A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess
    S S symmetry Article A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess Guillermo Martínez-Flórez 1, Víctor Leiva 2,* , Emilio Gómez-Déniz 3 and Carolina Marchant 4 1 Departamento de Matemáticas y Estadística, Facultad de Ciencias Básicas, Universidad de Córdoba, Montería 14014, Colombia; [email protected] 2 Escuela de Ingeniería Industrial, Pontificia Universidad Católica de Valparaíso, 2362807 Valparaíso, Chile 3 Facultad de Economía, Empresa y Turismo, Universidad de Las Palmas de Gran Canaria and TIDES Institute, 35001 Canarias, Spain; [email protected] 4 Facultad de Ciencias Básicas, Universidad Católica del Maule, 3466706 Talca, Chile; [email protected] * Correspondence: [email protected] or [email protected] Received: 30 June 2020; Accepted: 19 August 2020; Published: 1 September 2020 Abstract: In this paper, we consider skew-normal distributions for constructing new a distribution which allows us to model proportions and rates with zero/one inflation as an alternative to the inflated beta distributions. The new distribution is a mixture between a Bernoulli distribution for explaining the zero/one excess and a censored skew-normal distribution for the continuous variable. The maximum likelihood method is used for parameter estimation. Observed and expected Fisher information matrices are derived to conduct likelihood-based inference in this new type skew-normal distribution. Given the flexibility of the new distributions, we are able to show, in real data scenarios, the good performance of our proposal. Keywords: beta distribution; centered skew-normal distribution; maximum-likelihood methods; Monte Carlo simulations; proportions; R software; rates; zero/one inflated data 1.
    [Show full text]
  • Random Processes
    Chapter 6 Random Processes Random Process • A random process is a time-varying function that assigns the outcome of a random experiment to each time instant: X(t). • For a fixed (sample path): a random process is a time varying function, e.g., a signal. – For fixed t: a random process is a random variable. • If one scans all possible outcomes of the underlying random experiment, we shall get an ensemble of signals. • Random Process can be continuous or discrete • Real random process also called stochastic process – Example: Noise source (Noise can often be modeled as a Gaussian random process. An Ensemble of Signals Remember: RV maps Events à Constants RP maps Events à f(t) RP: Discrete and Continuous The set of all possible sample functions {v(t, E i)} is called the ensemble and defines the random process v(t) that describes the noise source. Sample functions of a binary random process. RP Characterization • Random variables x 1 , x 2 , . , x n represent amplitudes of sample functions at t 5 t 1 , t 2 , . , t n . – A random process can, therefore, be viewed as a collection of an infinite number of random variables: RP Characterization – First Order • CDF • PDF • Mean • Mean-Square Statistics of a Random Process RP Characterization – Second Order • The first order does not provide sufficient information as to how rapidly the RP is changing as a function of timeà We use second order estimation RP Characterization – Second Order • The first order does not provide sufficient information as to how rapidly the RP is changing as a function
    [Show full text]
  • The Theory of the Design of Experiments
    The Theory of the Design of Experiments D.R. COX Honorary Fellow Nuffield College Oxford, UK AND N. REID Professor of Statistics University of Toronto, Canada CHAPMAN & HALL/CRC Boca Raton London New York Washington, D.C. C195X/disclaimer Page 1 Friday, April 28, 2000 10:59 AM Library of Congress Cataloging-in-Publication Data Cox, D. R. (David Roxbee) The theory of the design of experiments / D. R. Cox, N. Reid. p. cm. — (Monographs on statistics and applied probability ; 86) Includes bibliographical references and index. ISBN 1-58488-195-X (alk. paper) 1. Experimental design. I. Reid, N. II.Title. III. Series. QA279 .C73 2000 001.4 '34 —dc21 00-029529 CIP This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W.
    [Show full text]
  • Statistical Theory
    Statistical Theory Prof. Gesine Reinert November 23, 2009 Aim: To review and extend the main ideas in Statistical Inference, both from a frequentist viewpoint and from a Bayesian viewpoint. This course serves not only as background to other courses, but also it will provide a basis for developing novel inference methods when faced with a new situation which includes uncertainty. Inference here includes estimating parameters and testing hypotheses. Overview • Part 1: Frequentist Statistics { Chapter 1: Likelihood, sufficiency and ancillarity. The Factoriza- tion Theorem. Exponential family models. { Chapter 2: Point estimation. When is an estimator a good estima- tor? Covering bias and variance, information, efficiency. Methods of estimation: Maximum likelihood estimation, nuisance parame- ters and profile likelihood; method of moments estimation. Bias and variance approximations via the delta method. { Chapter 3: Hypothesis testing. Pure significance tests, signifi- cance level. Simple hypotheses, Neyman-Pearson Lemma. Tests for composite hypotheses. Sample size calculation. Uniformly most powerful tests, Wald tests, score tests, generalised likelihood ratio tests. Multiple tests, combining independent tests. { Chapter 4: Interval estimation. Confidence sets and their con- nection with hypothesis tests. Approximate confidence intervals. Prediction sets. { Chapter 5: Asymptotic theory. Consistency. Asymptotic nor- mality of maximum likelihood estimates, score tests. Chi-square approximation for generalised likelihood ratio tests. Likelihood confidence regions. Pseudo-likelihood tests. • Part 2: Bayesian Statistics { Chapter 6: Background. Interpretations of probability; the Bayesian paradigm: prior distribution, posterior distribution, predictive distribution, credible intervals. Nuisance parameters are easy. 1 { Chapter 7: Bayesian models. Sufficiency, exchangeability. De Finetti's Theorem and its intepretation in Bayesian statistics. { Chapter 8: Prior distributions. Conjugate priors.
    [Show full text]
  • “Mean”? a Review of Interpreting and Calculating Different Types of Means and Standard Deviations
    pharmaceutics Review What Does It “Mean”? A Review of Interpreting and Calculating Different Types of Means and Standard Deviations Marilyn N. Martinez 1,* and Mary J. Bartholomew 2 1 Office of New Animal Drug Evaluation, Center for Veterinary Medicine, US FDA, Rockville, MD 20855, USA 2 Office of Surveillance and Compliance, Center for Veterinary Medicine, US FDA, Rockville, MD 20855, USA; [email protected] * Correspondence: [email protected]; Tel.: +1-240-3-402-0635 Academic Editors: Arlene McDowell and Neal Davies Received: 17 January 2017; Accepted: 5 April 2017; Published: 13 April 2017 Abstract: Typically, investigations are conducted with the goal of generating inferences about a population (humans or animal). Since it is not feasible to evaluate the entire population, the study is conducted using a randomly selected subset of that population. With the goal of using the results generated from that sample to provide inferences about the true population, it is important to consider the properties of the population distribution and how well they are represented by the sample (the subset of values). Consistent with that study objective, it is necessary to identify and use the most appropriate set of summary statistics to describe the study results. Inherent in that choice is the need to identify the specific question being asked and the assumptions associated with the data analysis. The estimate of a “mean” value is an example of a summary statistic that is sometimes reported without adequate consideration as to its implications or the underlying assumptions associated with the data being evaluated. When ignoring these critical considerations, the method of calculating the variance may be inconsistent with the type of mean being reported.
    [Show full text]
  • Chapter 5 Statistical Inference
    Chapter 5 Statistical Inference CHAPTER OUTLINE Section 1 Why Do We Need Statistics? Section 2 Inference Using a Probability Model Section 3 Statistical Models Section 4 Data Collection Section 5 Some Basic Inferences In this chapter, we begin our discussion of statistical inference. Probability theory is primarily concerned with calculating various quantities associated with a probability model. This requires that we know what the correct probability model is. In applica- tions, this is often not the case, and the best we can say is that the correct probability measure to use is in a set of possible probability measures. We refer to this collection as the statistical model. So, in a sense, our uncertainty has increased; not only do we have the uncertainty associated with an outcome or response as described by a probability measure, but now we are also uncertain about what the probability measure is. Statistical inference is concerned with making statements or inferences about char- acteristics of the true underlying probability measure. Of course, these inferences must be based on some kind of information; the statistical model makes up part of it. Another important part of the information will be given by an observed outcome or response, which we refer to as the data. Inferences then take the form of various statements about the true underlying probability measure from which the data were obtained. These take a variety of forms, which we refer to as types of inferences. The role of this chapter is to introduce the basic concepts and ideas of statistical inference. The most prominent approaches to inference are discussed in Chapters 6, 7, and 8.
    [Show full text]
  • Identification of Empirical Setting
    Chapter 1 Identification of Empirical Setting Purpose for Identifying the Empirical Setting The purpose of identifying the empirical setting of research is so that the researcher or research manager can appropriately address the following issues: 1) Understand the type of research to be conducted, and how it will affect the design of the research investigation, 2) Establish clearly the goals and objectives of the research, and how these translate into testable research hypothesis, 3) Identify the population from which inferences will be made, and the limitations of the sample drawn for purposes of the investigation, 4) To understand the type of data obtained in the investigation, and how this will affect the ensuing analyses, and 5) To understand the limits of statistical inference based on the type of analyses and data being conducted, and this will affect final study conclusions. Inputs/Assumptions Assumed of the Investigator It is assumed that several important steps in the process of conducting research have been completed before using this part of the manual. If any of these stages has not been completed, then the researcher or research manager should first consult Volume I of this manual. 1) A general understanding of research concepts introduced in Volume I including principles of scientific inquiry and the overall research process, 2) A research problem statement with accompanying research objectives, and 3) Data or a data collection plan. Outputs/Products of this Chapter After consulting this chapter, the researcher or research
    [Show full text]
  • Statistical Inference: Paradigms and Controversies in Historic Perspective
    Jostein Lillestøl, NHH 2014 Statistical inference: Paradigms and controversies in historic perspective 1. Five paradigms We will cover the following five lines of thought: 1. Early Bayesian inference and its revival Inverse probability – Non-informative priors – “Objective” Bayes (1763), Laplace (1774), Jeffreys (1931), Bernardo (1975) 2. Fisherian inference Evidence oriented – Likelihood – Fisher information - Necessity Fisher (1921 and later) 3. Neyman- Pearson inference Action oriented – Frequentist/Sample space – Objective Neyman (1933, 1937), Pearson (1933), Wald (1939), Lehmann (1950 and later) 4. Neo - Bayesian inference Coherent decisions - Subjective/personal De Finetti (1937), Savage (1951), Lindley (1953) 5. Likelihood inference Evidence based – likelihood profiles – likelihood ratios Barnard (1949), Birnbaum (1962), Edwards (1972) Classical inference as it has been practiced since the 1950’s is really none of these in its pure form. It is more like a pragmatic mix of 2 and 3, in particular with respect to testing of significance, pretending to be both action and evidence oriented, which is hard to fulfill in a consistent manner. To keep our minds on track we do not single out this as a separate paradigm, but will discuss this at the end. A main concern through the history of statistical inference has been to establish a sound scientific framework for the analysis of sampled data. Concepts were initially often vague and disputed, but even after their clarification, various schools of thought have at times been in strong opposition to each other. When we try to describe the approaches here, we will use the notions of today. All five paradigms of statistical inference are based on modeling the observed data x given some parameter or “state of the world” , which essentially corresponds to stating the conditional distribution f(x|(or making some assumptions about it).
    [Show full text]
  • Topics of Statistical Theory for Register-Based Statistics Zhang, Li-Chun Statistics Norway Kongensgt
    Topics of statistical theory for register-based statistics Zhang, Li-Chun Statistics Norway Kongensgt. 6, Pb 8131 Dep N-0033 Oslo, Norway E-mail: [email protected] 1. Introduction For some decades now administrative registers have been an important data source for official statistics alongside survey sampling and population census. Not only do they provide frames and valuable auxiliary information for sample surveys and censuses, systems of inter-linked statistical registers (i.e. registers for statistical uses) have been developed on the basis of various available administrative registers to produce a wide range of purely register-based statistics (e.g. Statistics Denmark, 1995; Statistics Finland, 2004; Wallgren and Wallgren, 2007). A summary of the development of some of the key statistical registers in the Nordic countries is given in UNECE (2007, Table on page 5), the earliest ones of which came already into existence in the late 1960s and early 1970s. Statistics Denmark was the first to conduct a register-based census in 1981, and the next census in 2011 will be completely register-based in all Nordic countries. Whereas the so-called virtual census (e.g. Schulte Nordholt, 2005), which combines data from registers and various sample surveys in a more ‘traditional’ way, will be implemented in a number of European countries. The major advantages associated with the statistical use of administrative registers include reduction of response burden, long-term cost efficiency, as well as potentials for detailed spatial-demographic and longitudinal statistics. The trend towards more extensive use of administrative data has long been recognized by statisticians around the world (e.g.
    [Show full text]