Center: Finding the Median Median Spread: Home on the Range

Total Page:16

File Type:pdf, Size:1020Kb

Center: Finding the Median Median Spread: Home on the Range Center: Finding the Median Center: Finding the Median (cont.) • When we think of a typical value, we usually look • A more reasonable choice for center than for the center of the distribution. the midrange is the value with exactly half • For a unimodal, symmetric distribution, it’s easy the data values below it and half above it. to find the center—it’s just the center of This particular value is called the median. symmetry. • The median is the middle data value (once • We could average the minimum and maximum the data values have been ordered) that data values (called the midrange) as a measure divides the histogram into two equal areas. of center, but the midrange is very sensitive to • The median has the same units as the skewed distributions and outliers. data. Copyright © 2004 Pearson Education, Inc. Slide 5-1 Copyright © 2004 Pearson Education, Inc. Slide 5-2 Median Spread: Home on the Range • When describing a distribution numerically, we always report a measure of its spread along with The sample median is the n + 1 largest observation. its center. 2 • The range of the data is the difference between the maximum and minimum values: Range = max – min. n +1 • A disadvantage of the range is that a single If is not a whole number, the median is the 2 extreme value can make it very large and, thus, average of the two observations on either side. not representative of the data overall. Copyright © 2004 Pearson Education, Inc. Slide 5-3 Copyright © 2004 Pearson Education, Inc. Slide 5-4 The Interquartile Range Quartiles • The interquartile range (IQR) allows us to Quartiles split the data into quarters ignore extreme data values and • Lower quartile (Q1) divides bottom half of data concentrate on the middle of the data. into two • To find the IQR, we first need to know – median of observations below the median • Upper quartile (Q ) divides upper half of data what quartiles are… 3 into two – median of observations above the median • The difference between the quartiles is the IQR, so IQR = upper quartile – lower quartile. Copyright © 2004 Pearson Education, Inc. Slide 5-5 Copyright © 2004 Pearson Education, Inc. Slide 5-6 The Interquartile Range (cont.) The Five-Number Summary • The lower and upper quartiles are the 25th and • Five number summary 75th percentiles of the data, so… { Min, Q1, Median, Q3, Max } • The IQR contains the middle 50% of the values of the distribution, as shown in Figure 5.3 from •Example: the text: Copyright © 2004 Pearson Education, Inc. Slide 5-7 Copyright © 2004 Pearson Education, Inc. Slide 5-8 Boxplots Boxplot •A boxplot is a graphical display of the five- number summary. The steps involved in Q Med Q constructing a boxplot can also be found 1 3 Data on pages 60-61 of the text. 1.5 IQR 1.5 IQR • Boxplots are particularly useful when (pull back until hit observation) (pull back until hit observation) comparing groups. Scale Figure 2.4.4 Construction of a box plot. From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000. Copyright © 2004 Pearson Education, Inc. Slide 5-9 Copyright © 2004 Pearson Education, Inc. Slide 5-10 Construction of Boxplot Comparing Groups With Boxplots Data: breaking strength of wire in kilograms • The following set of boxplots compares the 220 214 222 218 223 210 223 210 227 225 212 effectiveness of various coffee containers: Leaf Unit = 1.0 kg 4 21 0024 5 21 8 (4) 22 0233 2 22 57 • Find Median • Find Quartiles Q1 = Q3 = • Calculate Interquartile range Q3 -Q1 = • Calculate whisker length 1.5 x (Q -Q) = 3 1 • What does this graphical display tell you? Copyright © 2004 Pearson Education, Inc. Slide 5-11 Copyright © 2004 Pearson Education, Inc. Slide 5-12 Summarizing Symmetric Distributions Sample Mean – average • Medians do a good job of identifying the center of skewed distributions. When we • The sample mean is denoted by x have symmetric data, the mean is a good measure of center. The sample mean = Sum of the observations Number of observations • We find the mean by adding up all of the data values and dividing by n, the number of data values we have. Mean (a) (b) (c) Figure 2.4.1 Mechanical construction representing a dot plot: Copyright © 2004 Pearson Education, Inc. Slide 5-13 Copyright © 2004 Pearson Education,(a) shows Inc. a balanced rod while (b) and (c) show unbalanced rods.Slide 5-14 Relationship between mean and Mean or Median? median • Regardless of the shape of the distribution, the mean is the point at which a histogram of the data would balance. P • In symmetric distributions, the mean and median Med = x are approximately the same in value, so either (a) Data symmetric about P measure of center may be used. P • For skewed data, though, it’s better to report the Med median than the mean as a measure of center. x (b) Two largest points moved to the right Figure 2.4.2 The mean and the median. [Grey disks in (b) are the ``ghosts'' of the points that were moved.] From Chance Encounters by C.J. Wild and G.A.F. Seber, © John Wiley & Sons, 2000. Copyright © 2004 Pearson Education, Inc. Slide 5-15 Copyright © 2004 Pearson Education, Inc. Slide 5-16 What About Spread? Variance • A more powerful measure of spread than • The sample variance, denoted by s2, is the IQR is the standard deviation, which found using the formula takes into account how far each data value is from the mean. 2 2 2 2 •A deviation is the distance that a data 2 ()x1 − x + ()x2 − x +...+ ()xn − x 1 s = = ∑()xi − x value is from the mean. Since adding all n −1 n −1 deviations together would total zero, we square each deviation and find an average of sorts for the deviations. Copyright © 2004 Pearson Education, Inc. Slide 5-17 Copyright © 2004 Pearson Education, Inc. Slide 5-18 Sample Standard Deviation Shape, Center, and Spread 2 2 2 2 • When telling about a quantitative variable, ()x1 − x + ()x2 − x + ... + ()xn − x 1 sx = = ∑ ()xi − x n −1 n −1 always report the shape of its distribution, along with a center and a spread. • In same units as data • If the shape is skewed, report the median – So preferable to sample variance and IQR. • Equals zero only if all observations identical • If the shape is symmetric, report the mean • Sensitive to outliers (extreme observations) and standard deviation and possibly the • Button on calculator – learn to use it! median and IQR as well. – Much simpler than applying formula Copyright © 2004 Pearson Education, Inc. Slide 5-19 Copyright © 2004 Pearson Education, Inc. Slide 5-20 What About Outliers? What Can Go Wrong? • If there are any clear outliers and you are • Do a reality check—don’t let technology do reporting the mean and standard your thinking for you. deviation, report them with the outliers • Don’t forget to sort the values before present and with the outliers removed. The finding the median or percentiles. differences may be quite revealing. • Don’t compute numerical summaries of a • Note: The median and IQR are not likely to categorical variable. be affected by the outliers. • Watch out for multiple modes—multiple modes might indicate multiple groups in your data. Copyright © 2004 Pearson Education, Inc. Slide 5-21 Copyright © 2004 Pearson Education, Inc. Slide 5-22 What Can Go Wrong? (cont.) So What Do We Know? • Be aware of slightly different methods— • We describe distributions in terms of shape, different statistics packages and center, and spread. calculators may give you different answers •For symmetric distributions, it’s safe to use the for the same data. mean and standard deviation; for skewed • Beware of outliers. distributions, it’s better to use the median and • Make a picture (make a picture, make a interquartile range. picture). • Always make a picture—don’t make judgments • Be careful when comparing groups that about which measures of center and spread to use by just looking at the data. have very different spreads. Copyright © 2004 Pearson Education, Inc. Slide 5-23 Copyright © 2004 Pearson Education, Inc. Slide 5-24.
Recommended publications
  • Applied Biostatistics Mean and Standard Deviation the Mean the Median Is Not the Only Measure of Central Value for a Distribution
    Health Sciences M.Sc. Programme Applied Biostatistics Mean and Standard Deviation The mean The median is not the only measure of central value for a distribution. Another is the arithmetic mean or average, usually referred to simply as the mean. This is found by taking the sum of the observations and dividing by their number. The mean is often denoted by a little bar over the symbol for the variable, e.g. x . The sample mean has much nicer mathematical properties than the median and is thus more useful for the comparison methods described later. The median is a very useful descriptive statistic, but not much used for other purposes. Median, mean and skewness The sum of the 57 FEV1s is 231.51 and hence the mean is 231.51/57 = 4.06. This is very close to the median, 4.1, so the median is within 1% of the mean. This is not so for the triglyceride data. The median triglyceride is 0.46 but the mean is 0.51, which is higher. The median is 10% away from the mean. If the distribution is symmetrical the sample mean and median will be about the same, but in a skew distribution they will not. If the distribution is skew to the right, as for serum triglyceride, the mean will be greater, if it is skew to the left the median will be greater. This is because the values in the tails affect the mean but not the median. Figure 1 shows the positions of the mean and median on the histogram of triglyceride.
    [Show full text]
  • Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)
    NYS COMMON CORE MATHEMATICS CURRICULUM Lesson 7 M2 ALGEBRA I Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range) Student Outcomes . Students explain why a median is a better description of a typical value for a skewed distribution. Students calculate the 5-number summary of a data set. Students construct a box plot based on the 5-number summary and calculate the interquartile range (IQR). Students interpret the IQR as a description of variability in the data. Students identify outliers in a data distribution. Lesson Notes Distributions that are not symmetrical pose some challenges in students’ thinking about center and variability. The observation that the distribution is not symmetrical is straightforward. The difficult part is to select a measure of center and a measure of variability around that center. In Lesson 3, students learned that, because the mean can be affected by unusual values in the data set, the median is a better description of a typical data value for a skewed distribution. This lesson addresses what measure of variability is appropriate for a skewed data distribution. Students construct a box plot of the data using the 5-number summary and describe variability using the interquartile range. Classwork Exploratory Challenge 1/Exercises 1–3 (10 minutes): Skewed Data and Their Measure of Center Verbally introduce the data set as described in the introductory paragraph and dot plot shown below. Exploratory Challenge 1/Exercises 1–3: Skewed Data and Their Measure of Center Consider the following scenario. A television game show, Fact or Fiction, was cancelled after nine shows. Many people watched the nine shows and were rather upset when it was taken off the air.
    [Show full text]
  • An Introduction to Psychometric Theory with Applications in R
    What is psychometrics? What is R? Where did it come from, why use it? Basic statistics and graphics TOD An introduction to Psychometric Theory with applications in R William Revelle Department of Psychology Northwestern University Evanston, Illinois USA February, 2013 1 / 71 What is psychometrics? What is R? Where did it come from, why use it? Basic statistics and graphics TOD Overview 1 Overview Psychometrics and R What is Psychometrics What is R 2 Part I: an introduction to R What is R A brief example Basic steps and graphics 3 Day 1: Theory of Data, Issues in Scaling 4 Day 2: More than you ever wanted to know about correlation 5 Day 3: Dimension reduction through factor analysis, principal components analyze and cluster analysis 6 Day 4: Classical Test Theory and Item Response Theory 7 Day 5: Structural Equation Modeling and applied scale construction 2 / 71 What is psychometrics? What is R? Where did it come from, why use it? Basic statistics and graphics TOD Outline of Day 1/part 1 1 What is psychometrics? Conceptual overview Theory: the organization of Observed and Latent variables A latent variable approach to measurement Data and scaling Structural Equation Models 2 What is R? Where did it come from, why use it? Installing R on your computer and adding packages Installing and using packages Implementations of R Basic R capabilities: Calculation, Statistical tables, Graphics Data sets 3 Basic statistics and graphics 4 steps: read, explore, test, graph Basic descriptive and inferential statistics 4 TOD 3 / 71 What is psychometrics? What is R? Where did it come from, why use it? Basic statistics and graphics TOD What is psychometrics? In physical science a first essential step in the direction of learning any subject is to find principles of numerical reckoning and methods for practicably measuring some quality connected with it.
    [Show full text]
  • TR Multivariate Conditional Median Estimation
    Discussion Paper: 2004/02 TR multivariate conditional median estimation Jan G. de Gooijer and Ali Gannoun www.fee.uva.nl/ke/UvA-Econometrics Department of Quantitative Economics Faculty of Economics and Econometrics Universiteit van Amsterdam Roetersstraat 11 1018 WB AMSTERDAM The Netherlands TR Multivariate Conditional Median Estimation Jan G. De Gooijer1 and Ali Gannoun2 1 Department of Quantitative Economics University of Amsterdam Roetersstraat 11, 1018 WB Amsterdam, The Netherlands Telephone: +31—20—525 4244; Fax: +31—20—525 4349 e-mail: [email protected] 2 Laboratoire de Probabilit´es et Statistique Universit´e Montpellier II Place Eug`ene Bataillon, 34095 Montpellier C´edex 5, France Telephone: +33-4-67-14-3578; Fax: +33-4-67-14-4974 e-mail: [email protected] Abstract An affine equivariant version of the nonparametric spatial conditional median (SCM) is con- structed, using an adaptive transformation-retransformation (TR) procedure. The relative performance of SCM estimates, computed with and without applying the TR—procedure, are compared through simulation. Also included is the vector of coordinate conditional, kernel- based, medians (VCCMs). The methodology is illustrated via an empirical data set. It is shown that the TR—SCM estimator is more efficient than the SCM estimator, even when the amount of contamination in the data set is as high as 25%. The TR—VCCM- and VCCM estimators lack efficiency, and consequently should not be used in practice. Key Words: Spatial conditional median; kernel; retransformation; robust; transformation. 1 Introduction p s Let (X1, Y 1),...,(Xn, Y n) be independent replicates of a random vector (X, Y ) IR IR { } ∈ × where p > 1,s > 2, and n>p+ s.
    [Show full text]
  • Cluster Analysis for Gene Expression Data: a Survey
    Cluster Analysis for Gene Expression Data: A Survey Daxin Jiang Chun Tang Aidong Zhang Department of Computer Science and Engineering State University of New York at Buffalo Email: djiang3, chuntang, azhang @cse.buffalo.edu Abstract DNA microarray technology has now made it possible to simultaneously monitor the expres- sion levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremen- dous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increase the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expres- sion data, and also new algorithms have recently been proposed specifically aiming at gene ex- pression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data.
    [Show full text]
  • Reliability Engineering: Today and Beyond
    Reliability Engineering: Today and Beyond Keynote Talk at the 6th Annual Conference of the Institute for Quality and Reliability Tsinghua University People's Republic of China by Professor Mohammad Modarres Director, Center for Risk and Reliability Department of Mechanical Engineering Outline – A New Era in Reliability Engineering – Reliability Engineering Timeline and Research Frontiers – Prognostics and Health Management – Physics of Failure – Data-driven Approaches in PHM – Hybrid Methods – Conclusions New Era in Reliability Sciences and Engineering • Started as an afterthought analysis – In enduing years dismissed as a legitimate field of science and engineering – Worked with small data • Three advances transformed reliability into a legitimate science: – 1. Availability of inexpensive sensors and information systems – 2. Ability to better described physics of damage, degradation, and failure time using empirical and theoretical sciences – 3. Access to big data and PHM techniques for diagnosing faults and incipient failures • Today we can predict abnormalities, offer just-in-time remedies to avert failures, and making systems robust and resilient to failures Seventy Years of Reliability Engineering – Reliability Engineering Initiatives in 1950’s • Weakest link • Exponential life model • Reliability Block Diagrams (RBDs) – Beyond Exp. Dist. & Birth of System Reliability in 1960’s • Birth of Physics of Failure (POF) • Uses of more proper distributions (Weibull, etc.) • Reliability growth • Life testing • Failure Mode and Effect Analysis
    [Show full text]
  • 5. the Student T Distribution
    Virtual Laboratories > 4. Special Distributions > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5. The Student t Distribution In this section we will study a distribution that has special importance in statistics. In particular, this distribution will arise in the study of a standardized version of the sample mean when the underlying distribution is normal. The Probability Density Function Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n degrees of freedom, and that Z and V are independent. Let Z T= √V/n In the following exercise, you will show that T has probability density function given by −(n +1) /2 Γ((n + 1) / 2) t2 f(t)= 1 + , t∈ℝ ( n ) √n π Γ(n / 2) 1. Show that T has the given probability density function by using the following steps. n a. Show first that the conditional distribution of T given V=v is normal with mean 0 a nd variance v . b. Use (a) to find the joint probability density function of (T,V). c. Integrate the joint probability density function in (b) with respect to v to find the probability density function of T. The distribution of T is known as the Student t distribution with n degree of freedom. The distribution is well defined for any n > 0, but in practice, only positive integer values of n are of interest. This distribution was first studied by William Gosset, who published under the pseudonym Student. In addition to supplying the proof, Exercise 1 provides a good way of thinking of the t distribution: the t distribution arises when the variance of a mean 0 normal distribution is randomized in a certain way.
    [Show full text]
  • The Probability Lifesaver: Order Statistics and the Median Theorem
    The Probability Lifesaver: Order Statistics and the Median Theorem Steven J. Miller December 30, 2015 Contents 1 Order Statistics and the Median Theorem 3 1.1 Definition of the Median 5 1.2 Order Statistics 10 1.3 Examples of Order Statistics 15 1.4 TheSampleDistributionoftheMedian 17 1.5 TechnicalboundsforproofofMedianTheorem 20 1.6 TheMedianofNormalRandomVariables 22 2 • Greetings again! In this supplemental chapter we develop the theory of order statistics in order to prove The Median Theorem. This is a beautiful result in its own, but also extremely important as a substitute for the Central Limit Theorem, and allows us to say non- trivial things when the CLT is unavailable. Chapter 1 Order Statistics and the Median Theorem The Central Limit Theorem is one of the gems of probability. It’s easy to use and its hypotheses are satisfied in a wealth of problems. Many courses build towards a proof of this beautiful and powerful result, as it truly is ‘central’ to the entire subject. Not to detract from the majesty of this wonderful result, however, what happens in those instances where it’s unavailable? For example, one of the key assumptions that must be met is that our random variables need to have finite higher moments, or at the very least a finite variance. What if we were to consider sums of Cauchy random variables? Is there anything we can say? This is not just a question of theoretical interest, of mathematicians generalizing for the sake of generalization. The following example from economics highlights why this chapter is more than just of theoretical interest.
    [Show full text]
  • Biostatistics (BIOSTAT) 1
    Biostatistics (BIOSTAT) 1 This course covers practical aspects of conducting a population- BIOSTATISTICS (BIOSTAT) based research study. Concepts include determining a study budget, setting a timeline, identifying study team members, setting a strategy BIOSTAT 301-0 Introduction to Epidemiology (1 Unit) for recruitment and retention, developing a data collection protocol This course introduces epidemiology and its uses for population health and monitoring data collection to ensure quality control and quality research. Concepts include measures of disease occurrence, common assurance. Students will demonstrate these skills by engaging in a sources and types of data, important study designs, sources of error in quarter-long group project to draft a Manual of Operations for a new epidemiologic studies and epidemiologic methods. "mock" population study. BIOSTAT 302-0 Introduction to Biostatistics (1 Unit) BIOSTAT 429-0 Systematic Review and Meta-Analysis in the Medical This course introduces principles of biostatistics and applications Sciences (1 Unit) of statistical methods in health and medical research. Concepts This course covers statistical methods for meta-analysis. Concepts include descriptive statistics, basic probability, probability distributions, include fixed-effects and random-effects models, measures of estimation, hypothesis testing, correlation and simple linear regression. heterogeneity, prediction intervals, meta regression, power assessment, BIOSTAT 303-0 Probability (1 Unit) subgroup analysis and assessment of publication
    [Show full text]
  • Notes Mean, Median, Mode & Range
    Notes Mean, Median, Mode & Range How Do You Use Mode, Median, Mean, and Range to Describe Data? There are many ways to describe the characteristics of a set of data. The mode, median, and mean are all called measures of central tendency. These measures of central tendency and range are described in the table below. The mode of a set of data Use the mode to show which describes which value occurs value in a set of data occurs most frequently. If two or more most often. For the set numbers occur the same number {1, 1, 2, 3, 5, 6, 10}, of times and occur more often the mode is 1 because it occurs Mode than all the other numbers in the most frequently. set, those numbers are all modes for the data set. If each number in the set occurs the same number of times, the set of data has no mode. The median of a set of data Use the median to show which describes what value is in the number in a set of data is in the middle if the set is ordered from middle when the numbers are greatest to least or from least to listed in order. greatest. If there are an even For the set {1, 1, 2, 3, 5, 6, 10}, number of values, the median is the median is 3 because it is in the average of the two middle the middle when the numbers are Median values. Half of the values are listed in order. greater than the median, and half of the values are less than the median.
    [Show full text]
  • Big Data for Reliability Engineering: Threat and Opportunity
    Reliability, February 2016 Big Data for Reliability Engineering: Threat and Opportunity Vitali Volovoi Independent Consultant [email protected] more recently, analytics). It shares with the rest of the fields Abstract - The confluence of several technologies promises under this umbrella the need to abstract away most stormy waters ahead for reliability engineering. News reports domain-specific information, and to use tools that are mainly are full of buzzwords relevant to the future of the field—Big domain-independent1. As a result, it increasingly shares the Data, the Internet of Things, predictive and prescriptive lingua franca of modern systems engineering—probability and analytics—the sexier sisters of reliability engineering, both statistics that are required to balance the otherwise orderly and exciting and threatening. Can we reliability engineers join the deterministic engineering world. party and suddenly become popular (and better paid), or are And yet, reliability engineering does not wear the fancy we at risk of being superseded and driven into obsolescence? clothes of its sisters. There is nothing privileged about it. It is This article argues that“big-picture” thinking, which is at the rarely studied in engineering schools, and it is definitely not core of the concept of the System of Systems, is key for a studied in business schools! Instead, it is perceived as a bright future for reliability engineering. necessary evil (especially if the reliability issues in question are safety-related). The community of reliability engineers Keywords - System of Systems, complex systems, Big Data, consists of engineers from other fields who were mainly Internet of Things, industrial internet, predictive analytics, trained on the job (instead of receiving formal degrees in the prescriptive analytics field).
    [Show full text]
  • Interactive Statistical Graphics/ When Charts Come to Life
    Titel Event, Date Author Affiliation Interactive Statistical Graphics When Charts come to Life [email protected] www.theusRus.de Telefónica Germany Interactive Statistical Graphics – When Charts come to Life PSI Graphics One Day Meeting Martin Theus 2 www.theusRus.de What I do not talk about … Interactive Statistical Graphics – When Charts come to Life PSI Graphics One Day Meeting Martin Theus 3 www.theusRus.de … still not what I mean. Interactive Statistical Graphics – When Charts come to Life PSI Graphics One Day Meeting Martin Theus 4 www.theusRus.de Interactive Graphics ≠ Dynamic Graphics • Interactive Graphics … uses various interactions with the plots to change selections and parameters quickly. Interactive Statistical Graphics – When Charts come to Life PSI Graphics One Day Meeting Martin Theus 4 www.theusRus.de Interactive Graphics ≠ Dynamic Graphics • Interactive Graphics … uses various interactions with the plots to change selections and parameters quickly. • Dynamic Graphics … uses animated / rotating plots to visualize high dimensional (continuous) data. Interactive Statistical Graphics – When Charts come to Life PSI Graphics One Day Meeting Martin Theus 4 www.theusRus.de Interactive Graphics ≠ Dynamic Graphics • Interactive Graphics … uses various interactions with the plots to change selections and parameters quickly. • Dynamic Graphics … uses animated / rotating plots to visualize high dimensional (continuous) data. 1973 PRIM-9 Tukey et al. Interactive Statistical Graphics – When Charts come to Life PSI Graphics One Day Meeting Martin Theus 4 www.theusRus.de Interactive Graphics ≠ Dynamic Graphics • Interactive Graphics … uses various interactions with the plots to change selections and parameters quickly. • Dynamic Graphics … uses animated / rotating plots to visualize high dimensional (continuous) data.
    [Show full text]