Summary Statistics in Rows

Total Page:16

File Type:pdf, Size:1020Kb

Summary Statistics in Rows Paper PH02 Summary Statistics in Rows John Henry King, Ouachita Clinical Data Services, Inc. Mount Ida, AR ABSTRACT Many if not most summary tables created from clinical trials data use basically the same format, they display a mixture of discrete (categorical) and/or continuous variables in a style that could be referred to as “summary statistics in rows”. Each variable whether, discrete or continuous forms a group of rows in the table, the descriptive labels for these variables form the first left most column of the table. Continuous variable rows are made up of descriptive statistic descriptions e.g. N, Mean, Standard Deviation, Median, Minimum and Maximum. It is common for two or more of these statistics to be displayed on the same row; “Mean (SD)” and “Min, Max”. Discrete variable rows are formed from the descriptive values of the levels of each discrete variable, e.g. levels of Gender (Male, Female). Usually the treatment variable from the clinical trial forms groupings in columns to the right of the row descriptions. The table’s cells are either count and column percent for discrete variables or the values of summary statistics for continuous variables. This table style is not directly obtainable in SAS® and many pharmaceutical companies and individuals have devoted great effort in programming elaborate macros to produce this table. This paper presents a simple yet dynamic program that uses SAS procedures, standard SAS metadata objects FORMAT and LABEL and a few macro variables to produce tables in the style of summary statistics in rows. Summary statistics are displayed with a decimal precision that is derived from each continuous variable’s format. Each variable’s LABEL is used to describe the variable in the table. The formatted value of the discrete variables is used to label the rows for each discrete variable. Complete rows and/or columns of zero counts are accommodated using SAS procedure options. This is all accomplished without a macro, using SAS procedure and data steps, yet the program is dynamic with regards to variables summarized and treatment details. INTRODUCTION The most common clinical trials table looks more or less like the example shown in Figure 1. Fonts and titling may be different but the table should look very familiar to anyone working with clinical trials data. This table style is used for both safety and efficacy data summaries. It may have variables of all one type, categorical or continuous, or a mixture. The summary statistics displayed here are easily obtainable with SAS procedures SUMMARY and FREQ; it is the formatting that makes this table require extra attention, specifically the table cells that contain values from two statistics, N (%), Mean (SD) and Min, Max. Also, the row labels which usually consists of data from two variables that are displayed in one column, this will require some addition processing. This goal of this paper is to describe a program that will produce tables of this type. This program will be dynamic but will not use a complicated macro to achieve the results. Without a complicated macro the program can be readily distributed to clients without the need to support or explain or otherwise document a complicated macro. The program uses SAS PROCEDURES and simple data steps to achieve the results. The syntax and other details are fully documented in the SAS user documentation. The program uses a simple user interface of %LET statements. 1 ABC Pharmaceutical Page 1 of 1 Protocol: ABC2010-103, A Phase III Study Table 14-2.1 Summary of Demographic Characteristics at Baseline Treatment Placebo Active Total (N=13) (N=6) (N=19) Gender, n(%) Female 7 (53.8) 2 (33.3) 9 (47.4) Male 6 (46.2) 4 (66.7) 10 (52.6) Ethnic Origin, n(%) White 7 (53.8) 4 (66.7) 11 (57.9) Black 6 (46.2) 1 (16.7) 7 (36.8) Hispanic 0 1 (16.7) 1 (5.3) Other 0 0 0 Age (years) N 13 6 19 Mean (SD) 13.2 (1.5) 13.7 (1.6) 13.3 (1.5) Median 13.0 13.5 13.0 Min, Max 11, 15 12, 16 11, 16 Age group, n(%) 10 and Under 0 0 0 Pre-teen 5 (38.5) 2 (33.3) 7 (36.8) Teen 8 (61.5) 4 (66.7) 12 (63.2) C:\Documents and Settings\johking\My Documents\My 03SEP11:09:45:55 SAS Files\9.1\TT05.sas Figure 1 Example of a Summary Statistics in Rows table. The program uses PROC REPORT as the “print engine” to display a specially structured data set that is printed with very generic PROC REPORT syntax. Therefore most of the program is concerned with building this data set. The PROC CONTENTS output shown in Figure 2 describes the variables created by the program. Variable numbers 1-6 will be always be created, variable numbers 7 and higher are the treatment column variables. The number of COL_n variables depends on the number of levels for the treatment variable. In the example data there are three levels of treatment. # Variable Type Len Label 1 page Num 8 Break variable used in PROC REPORT BREAK statement 2 2 vorder Num 8 Order variable used to order the analysis variables, defined in PROC REPORT as ORDER NOPRINT. 3 vname Char 32 Analysis variable name 4 vlabel Char 256 Analysis variable label, defined in PROC REPORT as ORDER NOPRINT, displayed with LINE statement in COMPUTE block. 5 roworder Num 8 Order variable for row labels, defined in PROC REPORT as ORDER NOPRINT 6 rowlabel Char 256 Detail row label, derived from categories or statistic labels 7 col_1 Char 32 Placebo~(N=13) 8 col_2 Char 32 Active~(N=6) 9 col_3 Char 32 Total~(N=19) Figure 2 Variables information from PROC CONTENTS. An example of the data contained in this data set is displayed in Figure 3. Variables COL_2 and COL_3 have been omitted here. page vorder vname vlabel roworder rowlabel col_1 1 3 Age Age (years) 1 N 13 1 3 Age Age (years) 2 Mean (SD) 13.2 (1.5) 1 3 Age Age (years) 3 Median 13.0 1 3 Age Age (years) 4 Min, Max 11, 15 2 6 Height Height (inches) 1 N 13 2 6 Height Height (inches) 2 Mean (SD) 61.87 (4.81) 2 6 Height Height (inches) 3 Median 62.50 2 6 Height Height (inches) 4 Min, Max 51.3, 69.0 2 7 Weight Weight (lbs.) 1 N 13 2 7 Weight Weight (lbs.) 2 Mean (SD) 94.31 (17.68) 2 7 Weight Weight (lbs.) 3 Median 98.00 2 7 Weight Weight (lbs.) 4 Min, Max 50.5, 112.5 2 5 bmi BMI (kg/m**2) 1 N 13 2 5 bmi BMI (kg/m**2) 2 Mean (SD) 17.181 (1.917) 2 5 bmi BMI (kg/m**2) 3 Median 17.772 2 5 bmi BMI (kg/m**2) 4 Min, Max 13.49, 20.25 1 4 ageg Age group, n(%) 1 10 and Under 0 1 4 ageg Age group, n(%) 2 Pre-teen 5 (38.5) 1 4 ageg Age group, n(%) 3 Teen 8 (61.5) 1 2 race Ethnic Origin, n(%) 1 White 7 (53.8) 1 2 race Ethnic Origin, n(%) 2 Black 6 (46.2) 1 2 race Ethnic Origin, n(%) 3 Hispanic 0 1 2 race Ethnic Origin, n(%) 4 Other 0 1 1 Sex Gender, n(%) 1 Female 7 (53.8) 1 1 Sex Gender, n(%) 2 Male 6 (46.2) Figure 3 Example of data set produce by the program. DETAILS SUBJECT LEVEL DATA The program accepts a subject level data set similar to the ADaM data ADSL. This data set has one observation for each subject, a treatment variable used to define summary table columns and analysis variables. The statements below create a data set that will demonstrate the features of the program. proc format; value trt 1='Placebo' 2='Active' 3='Total'; value $sex 'F'='Female' 'M'='Male'; value race 1 ='White' 2='Black' 3='Hispanic' 4='Other'; value age low-10 = '10 and Under' 11-12='Pre-teen' 13-high='Teen'; run; 3 These formats are used to label the output. Each categorical variable should be formatted if it is a coded variable, e.g. TRT, RACE and SEX or if there are levels of the variable that do not appear in the data. The grouping format age will be used to create and summarize age groups. We can use SASHELP.CLASS to model ADSL, treatment (TRT) and a few other variables are created to make the output more interesting. Notice that each variable is given a LABEL and a FORMAT, these metadata objects will be used to annotate the output. data ADSL; set sashelp.class; trt = rantbl(12345,.5); race = rantbl(12345,.5,.4); ageg = age; bmi = (weight*703) / height**2; attrib age format=F3.0 label='Age (years)'; attrib height format=F7.1 label='Height (inches)'; attrib weight format=F7.1 label='Weight (lbs.)'; attrib sex format=$sex. label='Gender'; attrib race format=race. label='Ethnic Origin'; attrib ageg format=age. label='Age group'; attrib bmi format=F7.2 label='BMI (kg/m**2)'; attrib trt format=trt. label='Treatment'; run; The data created by this data step are shown in Figure 4, with the formatting removed. Obs Name Sex Age Height Weight trt race ageg bmi 1 Alfred M 14 69.0 112.5 1 2 14 16.6115 2 Alice F 13 56.5 84.0 2 1 13 18.4986 3 Barbara F 13 65.3 98.0 1 2 13 16.1568 4 Carol F 14 62.8 102.5 1 2 14 18.2709 5 Henry M 14 63.5 102.5 2 2 14 17.8703 6 James M 12 57.3 83.0 1 1 12 17.7715 7 Jane F 12 59.8 84.5 1 1 12 16.6115 8 Janet F 15 62.5 112.5 1 2 15 20.2464 9 Jeffrey M 13 62.5 84.0 1 1 13 15.1173 10 John M 12 59.0 99.5 1 1 12 20.0944 11 Joyce F 11 51.3 50.5 1 2 11 13.4900 12 Judy F 14 64.3 90.0 1 2 14 15.3030 13 Louise F 12 56.3 77.0 2 1 12 17.0777 14 Mary F 15 66.5 112.0 1 1 15 17.8045 15 Philip M 16 72.0 150.0 2 1 16 20.3414 16 Robert M 12 64.8 128.0 2 1 12 21.4297 17 Ronald M 15 67.0 133.0 2 3 15 20.8285 18 Thomas M 11 57.5 85.0 1 1 11 18.0733 19 William M 15 66.5 112.0 1 1 15 17.8045 Figure 4 Data ADSL, example input data.
Recommended publications
  • Summarize — Summary Statistics
    Title stata.com summarize — Summary statistics Description Quick start Menu Syntax Options Remarks and examples Stored results Methods and formulas References Also see Description summarize calculates and displays a variety of univariate summary statistics. If no varlist is specified, summary statistics are calculated for all the variables in the dataset. Quick start Basic summary statistics for continuous variable v1 summarize v1 Same as above, and include v2 and v3 summarize v1-v3 Same as above, and provide additional detail about the distribution summarize v1-v3, detail Summary statistics reported separately for each level of catvar by catvar: summarize v1 With frequency weight wvar summarize v1 [fweight=wvar] Menu Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics 1 2 summarize — Summary statistics Syntax summarize varlist if in weight , options options Description Main detail display additional statistics meanonly suppress the display; calculate only the mean; programmer’s option format use variable’s display format separator(#) draw separator line after every # variables; default is separator(5) display options control spacing, line width, and base and empty cells varlist may contain factor variables; see [U] 11.4.3 Factor variables. varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. by, collect, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. aweights, fweights, and iweights are allowed. However, iweights may not be used with the detail option; see [U] 11.1.6 weight. Options Main £ £detail produces additional statistics, including skewness, kurtosis, the four smallest and four largest values, and various percentiles. meanonly, which is allowed only when detail is not specified, suppresses the display of results and calculation of the variance.
    [Show full text]
  • U3 Introduction to Summary Statistics
    Presentation Name Course Name Unit # – Lesson #.# – Lesson Name Statistics • The collection, evaluation, and interpretation of data Introduction to Summary Statistics • Statistical analysis of measurements can help verify the quality of a design or process Summary Statistics Mean Central Tendency Central Tendency • The mean is the sum of the values of a set • “Center” of a distribution of data divided by the number of values in – Mean, median, mode that data set. Variation • Spread of values around the center – Range, standard deviation, interquartile range x μ = i Distribution N • Summary of the frequency of values – Frequency tables, histograms, normal distribution Project Lead The Way, Inc. Copyright 2010 1 Presentation Name Course Name Unit # – Lesson #.# – Lesson Name Mean Central Tendency Mean Central Tendency x • Data Set μ = i 3 7 12 17 21 21 23 27 32 36 44 N • Sum of the values = 243 • Number of values = 11 μ = mean value x 243 x = individual data value Mean = μ = i = = 22.09 i N 11 xi = summation of all data values N = # of data values in the data set A Note about Rounding in Statistics Mean – Rounding • General Rule: Don’t round until the final • Data Set answer 3 7 12 17 21 21 23 27 32 36 44 – If you are writing intermediate results you may • Sum of the values = 243 round values, but keep unrounded number in memory • Number of values = 11 • Mean – round to one more decimal place xi 243 Mean = μ = = = 22.09 than the original data N 11 • Standard Deviation: Round to one more decimal place than the original data • Reported: Mean = 22.1 Project Lead The Way, Inc.
    [Show full text]
  • Descriptive Statistics
    Descriptive Statistics Fall 2001 Professor Paul Glasserman B6014: Managerial Statistics 403 Uris Hall Histograms 1. A histogram is a graphical display of data showing the frequency of occurrence of particular values or ranges of values. In a histogram, the horizontal axis is divided into bins, representing possible data values or ranges. The vertical axis represents the number (or proportion) of observations falling in each bin. A bar is drawn in each bin to indicate the number (or proportion) of observations corresponding to that bin. You have probably seen histograms used, e.g., to illustrate the distribution of scores on an exam. 2. All histograms are bar graphs, but not all bar graphs are histograms. For example, we might display average starting salaries by functional area in a bar graph, but such a figure would not be a histogram. Why not? Because the Y-axis values do not represent relative frequencies or proportions, and the X-axis values do not represent value ranges (in particular, the order of the bins is irrelevant). Measures of Central Tendency 1. Let X1,...,Xn be data points, such as the result of n measurements or observations. What one number best characterizes or summarizes these values? The answer depends on the context. Some familiar summary statistics are these: • The mean is given by the arithemetic average X =(X1 + ···+ Xn)/n.(No- tation: We will often write n Xi for X1 + ···+ Xn. i=1 n The symbol i=1 Xi is read “the sum from i equals 1 upto n of Xi.”) 1 • The median is larger than one half of the observations and smaller than the other half.
    [Show full text]
  • Measures of Dispersion for Multidimensional Data
    European Journal of Operational Research 251 (2016) 930–937 Contents lists available at ScienceDirect European Journal of Operational Research journal homepage: www.elsevier.com/locate/ejor Computational Intelligence and Information Management Measures of dispersion for multidimensional data Adam Kołacz a, Przemysław Grzegorzewski a,b,∗ a Faculty of Mathematics and Computer Science, Warsaw University of Technology, Koszykowa 75, Warsaw 00–662, Poland b Systems Research Institute, Polish Academy of Sciences, Newelska 6, Warsaw 01–447, Poland article info abstract Article history: We propose an axiomatic definition of a dispersion measure that could be applied for any finite sample of Received 22 February 2015 k-dimensional real observations. Next we introduce a taxonomy of the dispersion measures based on the Accepted 4 January 2016 possible behavior of these measures with respect to new upcoming observations. This way we get two Available online 11 January 2016 classes of unstable and absorptive dispersion measures. We examine their properties and illustrate them Keywords: by examples. We also consider a relationship between multidimensional dispersion measures and mul- Descriptive statistics tidistances. Moreover, we examine new interesting properties of some well-known dispersion measures Dispersion for one-dimensional data like the interquartile range and a sample variance. Interquartile range © 2016 Elsevier B.V. All rights reserved. Multidistance Spread 1. Introduction are intended for use. It is also worth mentioning that several terms are used in the literature as regards dispersion measures like mea- Various summary statistics are always applied wherever deci- sures of variability, scatter, spread or scale. Some authors reserve sions are based on sample data. The main goal of those characteris- the notion of the dispersion measure only to those cases when tics is to deliver a synthetic information on basic features of a data variability is considered relative to a given fixed point (like a sam- set under study.
    [Show full text]
  • Summary Statistics, Distributions of Sums and Means
    Summary statistics, distributions of sums and means Joe Felsenstein Department of Genome Sciences and Department of Biology Summary statistics, distributions of sums and means – p.1/18 Quantiles In both empirical distributions and in the underlying distribution, it may help us to know the points where a given fraction of the distribution lies below (or above) that point. In particular: The 2.5% point The 5% point The 25% point (the first quartile) The 50% point (the median) The 75% point (the third quartile) The 95% point (or upper 5% point) The 97.5% point (or upper 2.5% point) Note that if a distribution has a small fraction of very big values far out in one tail (such as the distributions of wealth of individuals or families), the may not be a good “typical” value; the median will do much better. (For a symmetric distribution the median is the mean). Summary statistics, distributions of sums and means – p.2/18 The mean The mean is the average of points. If the distribution is the theoretical one, it is called the expectation, it’s the theoretical mean we would be expected to get if we drew infinitely many points from that distribution. For a sample of points x1, x2,..., x100 the mean is simply their average ¯x = (x1 + x2 + x3 + ... + x100) / 100 For a distribution with possible values 0, 1, 2, 3,... where value k has occurred a fraction fk of the time, the mean weights each of these by the fraction of times it has occurred (then in effect divides by the sum of these fractions, which however is actually 1): ¯x = 0 f0 + 1 f1 + 2 f2 + ..
    [Show full text]
  • Numerical Summary Values for Quantitative Data 35
    3.1 Numerical summary values for quantitative data 35 Chapter 3 Descriptive Statistics II: Numerical Summary Values 3.1 Numerical summary values for quantitative data For many purposes a few well–chosen numerical summary values (statistics) will suffice as a description of the distribution of a quantitative variable. A statistic is a numerical characteristic of a sample. More formally, a statistic is a numerical quantity computed from the values of a variable, or variables, corresponding to the units in a sample. Thus a statistic serves to quantify some interesting aspect of the distribution of a variable in a sample. Summary statistics are particularly useful for comparing and contrasting the distribution of a variable for two different samples. If we plan to use a small number of summary statistics to characterize a distribution or to compare two distributions, then we first need to decide which aspects of the distribution are of primary interest. If the distributions of interest are essentially mound shaped with a single peak (unimodal), then there are three aspects of the distribution which are often of primary interest. The first aspect of the distribution is its location on the number line. Generally, when speaking of the location of a distribution we are referring to the location of the “center” of the distribution. The location of the center of a symmetric, mound shaped distribution is clearly the point of symmetry. There is some ambiguity in specifying the location of the center of an asymmetric, mound shaped distribution and we shall see that there are at least two standard ways to quantify location in this context.
    [Show full text]
  • Measures of Dispersion
    MEASURES OF DISPERSION Measures of Dispersion • While measures of central tendency indicate what value of a variable is (in one sense or other) “average” or “central” or “typical” in a set of data, measures of dispersion (or variability or spread) indicate (in one sense or other) the extent to which the observed values are “spread out” around that center — how “far apart” observed values typically are from each other and therefore from some average value (in particular, the mean). Thus: – if all cases have identical observed values (and thereby are also identical to [any] average value), dispersion is zero; – if most cases have observed values that are quite “close together” (and thereby are also quite “close” to the average value), dispersion is low (but greater than zero); and – if many cases have observed values that are quite “far away” from many others (or from the average value), dispersion is high. • A measure of dispersion provides a summary statistic that indicates the magnitude of such dispersion and, like a measure of central tendency, is a univariate statistic. Importance of the Magnitude Dispersion Around the Average • Dispersion around the mean test score. • Baltimore and Seattle have about the same mean daily temperature (about 65 degrees) but very different dispersions around that mean. • Dispersion (Inequality) around average household income. Hypothetical Ideological Dispersion Hypothetical Ideological Dispersion (cont.) Dispersion in Percent Democratic in CDs Measures of Dispersion • Because dispersion is concerned with how “close together” or “far apart” observed values are (i.e., with the magnitude of the intervals between them), measures of dispersion are defined only for interval (or ratio) variables, – or, in any case, variables we are willing to treat as interval (like IDEOLOGY in the preceding charts).
    [Show full text]
  • Section II Descriptive Statistics for Continuous & Binary Data (Including
    Section II Descriptive Statistics for continuous & binary Data (including survival data) II - Checklist of summary statistics Do you know what all of the following are and what they are for? One variable – continuous data (variables like age, weight, serum levels, IQ, days to relapse ) _ Means (Y) Medians = 50th percentile Mode = most frequently occurring value Quartile – Q1=25th percentile, Q2= 50th percentile, Q3=75th percentile) Percentile Range (max – min) IQR – Interquartile range = Q3 – Q1 SD – standard deviation (most useful when data are Gaussian) (note, for a rough approximation, SD 0.75 IQR, IQR 1.33 SD) Survival curve = life table CDF = Cumulative dist function = 1 – survival curve Hazard rate (death rate) One variable discrete data (diseased yes or no, gender, diagnosis category, alive/dead) Risk = proportion = P = (Odds/(1+Odds)) Odds = P/(1-P) Relation between two (or more) continuous variables (Y vs X) Correlation coefficient (r) Intercept = b0 = value of Y when X is zero Slope = regression coefficient = b, in units of Y/X Multiple regression coefficient (bi) from a regression equation: Y = b0 + b1X1 + b2X2 + … + bkXk + error Relation between two (or more) discrete variables Risk ratio = relative risk = RR and log risk ratio Odds ratio (OR) and log odds ratio Logistic regression coefficient (=log odds ratio) from a logistic regression equation: ln(P/1-P)) = b0 + b1X1 + b2X2 + … + bkXk Relation between continuous outcome and discrete predictor Analysis of variance = comparing means Evaluating medical tests – where the test is positive or negative for a disease In those with disease: Sensitivity = 1 – false negative proportion In those without disease: Specificity = 1- false positive proportion ROC curve – plot of sensitivity versus false positive (1-specificity) – used to find best cutoff value of a continuously valued measurement DESCRIPTIVE STATISTICS A few definitions: Nominal data come as unordered categories such as gender and color.
    [Show full text]
  • 4. Descriptive Statistics
    4. Descriptive statistics Any time that you get a new data set to look at one of the first tasks that you have to do is find ways of summarising the data in a compact, easily understood fashion. This is what descriptive statistics (as opposed to inferential statistics) is all about. In fact, to many people the term “statistics” is synonymous with descriptive statistics. It is this topic that we’ll consider in this chapter, but before going into any details, let’s take a moment to get a sense of why we need descriptive statistics. To do this, let’s open the aflsmall_margins file and see what variables are stored in the file. In fact, there is just one variable here, afl.margins. We’ll focus a bit on this variable in this chapter, so I’d better tell you what it is. Unlike most of the data sets in this book, this is actually real data, relating to the Australian Football League (AFL).1 The afl.margins variable contains the winning margin (number of points) for all 176 home and away games played during the 2010 season. This output doesn’t make it easy to get a sense of what the data are actually saying. Just “looking at the data” isn’t a terribly effective way of understanding data. In order to get some idea about what the data are actually saying we need to calculate some descriptive statistics (this chapter) and draw some nice pictures (Chapter 5). Since the descriptive statistics are the easier of the two topics I’ll start with those, but nevertheless I’ll show you a histogram of the afl.margins data since it should help you get a sense of what the data we’re trying to describe actually look like, see Figure 4.2.
    [Show full text]
  • Analysis of Variance with Summary Statistics in Microsoft Excel
    American Journal of Business Education – April 2010 Volume 3, Number 4 Analysis Of Variance With Summary Statistics In Microsoft Excel David A. Larson, University of South Alabama, USA Ko-Cheng Hsu, University of South Alabama, USA ABSTRACT Students regularly are asked to solve Single Factor Analysis of Variance problems given only the sample summary statistics (number of observations per category, category means, and corresponding category standard deviations). Most undergraduate students today use Excel for data analysis of this type. However, Excel, like all other statistical software packages, requires an input data set in order to invoke its Anova: Single Factor procedure. The purpose of this paper is therefore to provide the student with an Excel macro that, given just the sample summary statistics as input, generates an equivalent underlying data set. This data set can then be used as the required input data set in Excel for Single Factor Analysis of Variance. Keywords: Analysis of Variance, Summary Statistics, Excel Macro INTRODUCTION ost students have migrated to Excel for data analysis because of Excel‟s pervasive accessibility. This means, given just the summary statistics for Analysis of Variance, the Excel user is either limited to solving the problem by hand or to solving the problem using an Excel Add-in. Both of Mthese options have shortcomings. Solving by hand means there is no Excel counterpart solution that puts the entire answer „right there in front of the student‟. Using an Excel Add-in is often not an option because, typically, Excel Add-ins are not available. The purpose of this paper, therefore, is to explain how to accomplish analysis of variance using an equivalent input data set and also to provide the Excel user with a straight-forward macro that accomplishes this technique.
    [Show full text]
  • Descriptive Statistics
    Statistics: Descriptive Statistics When we are given a large data set, it is necessary to describe the data in some way. The raw data is just too large and meaningless on its own to make sense of. We will sue the following exam scores data throughout: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10 45, 67, 87, 21, 43, 98, 28, 23, 28, 75 Summary Statistics We can use the raw data to calculate summary statistics so that we have some idea what the data looks like and how sprad out it is. Max, Min and Range The maximum value of the dataset and the minimum value of the dataset are very simple measures. The range of the data is difference between the maximum and minimum value. Range = Max Value − Min Value = 98 − 21 = 77 Mean, Median and Mode The mean, median and mode are measures of central tendency of the data (i.e. where is the center of the data). Mean (µ) The mean is sum of all values divided by how many values there are N 1 X 45 + 67 + 87 + 21 + 43 + 98 + 28 + 23 + 28 + 75 xi = = 51.5 N i=1 10 Median The median is the middle data point when the dataset is arranged in order from smallest to largest. If there are two middle values then we take the average of the two values. Using the data above check that the median is: 44 Mode The mode is the value in the dataset that appears most. Using the data above check that the mode is: 28 Standard Deviation The standard deviation (σ) measures how spread out the data is.
    [Show full text]
  • 1 - to Overcoming the Reluctance of Epidemiologists to Explicitly Tackle the Measurement Error Problem
    Statistics → Applications → Epidemiology Does adjustment for measurement error induce positive bias if there is no true association? Igor Burstyn, Ph.D. Community and Occupational Medicine Program, Department of Medicine, Faculty of Medicine and Dentistry, the University of Alberta, Edmonton, Alberta, Canada; Tel: 780-492-3240; Fax: 780-492-9677; E-mail: [email protected] Abstract This article is a response to an off-the-record discussion that I had at an international meeting of epidemiologists. It centered on a concern, perhaps widely spread, that measurement error adjustment methods can induce positive bias in results of epidemiological studies when there is no true association. I trace the possible history of this supposition and test it in a simulation study of both continuous and binary health outcomes under a classical multiplicative measurement error model. A Bayesian measurement adjustment method is used. The main conclusion is that adjustment for the presumed measurement error does not ‘induce’ positive associations, especially if the focus of the interpretation of the result is taken away from the point estimate. This is in line with properties of earlier measurement error adjustment methods introduced to epidemiologists in the 1990’s. An heuristic argument is provided to support the generalizability of this observation in the Bayesian framework. I find that when there is no true association, positive bias can only be induced by indefensible manipulation of the priors, such that they dominate the data. The misconception about bias induced by measurement error adjustment should be more clearly explained during the training of epidemiologists to ensure the appropriate (and wider) use of measurement error correction procedures.
    [Show full text]