Numerically Summarizing Data

3 Numerically Summarizing Data Outline Making an Informed Decision 3.1 Measures of Central Tendency Suppose that you are in the market for a used 3.2 Measures of car. To make an informed decision regarding Dispersion your purchase, you decide to collect as much information as possible. What information is 3.3 Measures of Central important in helping you make this decision? See Tendency and the Decisions project on page 178. Dispersion from Grouped Data 3.4 Measures of Position and Outliers 3.5 The Five-Number Summary and Boxplots PUTTING IT TOGETHER When we look at a distribution of data, we should consider three characteristics of the distribution: shape, center, and spread. In the last chapter, we discussed methods for organizing raw data into tables and graphs. These graphs (such as the histogram) allow us to identify the shape of the distribution: symmetric (in particular, bell shaped or uniform), skewed right, or skewed left. The center and spread are numerical summaries of the data. The center of a data set is commonly called the average. There are many ways to describe the average value of a distribution. In addition, there are many ways to measure the spread of a distribution. The most appropriate measure of center and spread depends on the distribution’s shape. Once these three characteristics of the distribution are known, we can analyze the data for interesting features, including unusual data values, called outliers. 117 M03B_SULL3539_05_SE_C03.1.indd 117 11/02/15 4:01 PM 118 CHApTER 3 Numerically Summarizing Data 3.1 Measures of Central Tendency Preparing for This Section Before getting started, review the following: • Population versus sample (Section 1.1, p. 5) • Qualitative data (Section 1.1, p. 8) • Parameter versus statistic (Section 1.1, p. 5) • Simple random sampling (Section 1.3, pp. 21–25) • Quantitative data (Section 1.1, p. 8) Objectives ➊ Determine the arithmetic mean of a variable from raw data ➋ Determine the median of a variable from raw data ➌ Explain what it means for a statistic to be resistant ➍ Determine the mode of a variable from raw data A measure of central tendency numerically describes the average or typical data value. We hear the word average in the news all the time: • The average miles per gallon of gasoline of the 2015 Chevrolet Corvette Z06 in highway driving is 29. • According to the U.S. Census Bureau, the national average commute time to work in 2013 was 25.4 minutes. • According to the U.S. Census Bureau, the average household income in 2013 was $51,939. • The average American woman is 5′4″ tall and weighs 142 pounds. CAUTION! Whenever you hear the word In this chapter, we discuss the three most widely-used measures of central tendency: average, be aware that the word may not always be referring to the the mean, the median, and the mode. In the media (newspapers, blogs, and so on), mean. One average could be used to average usually refers to the mean. But beware: some reporters use average to refer to support one position, while another the median or mode. As we shall see, these three measures of central tendency can give average could be used to support a different position. very different results! ➊ Determine the Arithmetic Mean of a Variable from Raw Data In everyday language, the word average often represents the arithmetic mean. To compute the arithmetic mean of a set of data, the data must be quantitative. Definitions The arithmetic mean of a variable is computed by adding all the values of the variable in the data set and dividing by the number of observations. The population arithmetic mean, m (pronounced “mew”), is computed using all the individuals in a population. The population mean is a parameter. The sample arithmetic mean, x (pronounced “x-bar”), is computed using sample data. The sample mean is a statistic. While other types of means exist (see Problems 39 and 40), the arithmetic mean is generally referred to as the mean. We will follow this practice for the remainder of the text. We usually use Greek letters to represent parameters and Roman letters (such as x or s) to represent statistics. The formulas for computing population and sample means follow: If x1, x2, c, xN are the N observations of a variable from a population, then the In Other Words population mean, m, is To find the mean of a set of data, add up all the observations x1 + x2 + g + xN a xi m = = (1) and divide by the number of N N observations. M03B_SULL3539_05_SE_C03.1.indd 118 11/02/15 4:01 PM SEcTION 3.1 Measures of Central Tendency 119 If x1, x2, c, xn are n observations of a variable from a sample, then the sample mean, x, is x1 + x2 + g + xn a xi x = = (2) n n Note that N represents the size of the population, and n represents the size of the sample. The symbol Σ (the Greek letter capital sigma) tells us to add the terms. The subscript i shows that the various values are distinct and does not serve as a mathematical operation. For example, x1 is the first data value, x2 is the second, and so on. EXAmpLe 1 Computing a Population Mean and a Sample Mean Table 1 Problem The data in Table 1 represent the first exam score of 10 students enrolled in Introductory Statistics. Treat the 10 students as a population. Student Score (a) Compute the population mean. 1. Michelle 82 (b) Find a simple random sample of size n = 4 students. 2. Ryanne 77 (c) Compute the sample mean of the sample found in part (b). 3. Bilal 90 4. Pam 71 Approach 5. Jennifer 62 (a) To compute the population mean, add all the data values (test scores) and divide by the number of individuals in the population. 6. Dave 68 (b) Recall from Section 1.3 that we can use Table I in Appendix A, a calculator with 7. Joel 74 a random-number generator, or computer software to obtain simple random 8. Sam 84 samples. We will use a TI-84 Plus C graphing calculator. 9. Justine 94 (c) Find the sample mean by adding the data values corresponding to the individuals 10. Juan 88 in the sample and then dividing by n = 4, the sample size. Solution (a) Compute the population mean by adding the scores of all 10 students: a xi = x1 + x2 + x3 + g + x10 = 82 + 77 + 90 + 71 + 62 + 68 + 74 + 84 + 94 + 88 = 790 Divide this result by 10, the number of students in the class. a xi 790 m = = = 79 N 10 Although it was not necessary in this problem, we will agree to round the mean to one more decimal place than that in the raw data. Figure 1 (b) To find a simple random sample of size n = 4 from a population of size N = 10, we will use the TI-84 Plus C random-number generator with a seed of 14. (Recall that this gives the calculator its starting point to generate the list of random numbers.) Figure 1 shows the students in the sample: Jennifer (62), Juan (88), Ryanne (77), and Dave (68). (c) Compute the sample mean by adding the scores of the four students: a xi = x1 + x2 + x3 + x4 = 62 + 88 + 77 + 68 = 295 Divide this result by 4, the number of individuals in the sample. a xi 295 x = = = 73.8 Round to the nearest tenth. • Now Work Problem 21 n 4 • M03B_SULL3539_05_SE_C03.1.indd 119 11/02/15 4:01 PM 120 CHApTER 3 Numerically Summarizing Data It helps to think of the mean of a data set as the center of gravity. In other words, the mean is the value such that a histogram of the data is perfectly balanced, with equal weight on each side of the mean. Figure 2 shows a histogram of the data in Table 1 with the mean labeled. The histogram balances at m = 79. Figure 2 Scores on First Exam 3 y 2 requenc F 1 60 65 70 75 80 85 90 95 m ϭ 79 Score ➋ Determine the Median of a Variable from Raw Data A second measure of central tendency is the median. To compute the median of a set of data, the data must be quantitative. Definition The median of a variable is the value that lies in the middle of the data when arranged in ascending order. We use M to represent the median. In Other Words To help remember the idea behind the median, think of the Steps in Finding the Median of a Data Set median of a highway; it divides the highway in half. So the Step 1 Arrange the data in ascending order. median divides the data in half, Step 2 Determine the number of observations, n. with at most half the data below the median and at most half Step 3 Determine the observation in the middle of the data set. above it. • If the number of observations is odd, then the median is the data value exactly in the middle of the data set. That is, the median is the observation that lies in n + 1 the position. 2 • If the number of observations is even, then the median is the mean of the two middle observations in the data set. That is, the median is the mean of the n n observations that lie in the position and the + 1 position. 2 2 EXAmpLe 2 Determining the Median of a Data Set with an Odd Number of Observations Problem The data in Table 2 represent the length (in seconds) of a random sample of songs released in the 1970s.

Numerically Summarizing Data

A Review on Outlier/Anomaly Detection in Time Series Data

Mean, Median and Mode We Use Statistics Such As the Mean, Median and Mode to Obtain Information About a Population from Our Sample Set of Observed Values

Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture

Weighted Means and Grouped Data on the Graphing Calculator

Lecture 3: Measure of Central Tendency

Random Variables and Applications

Big Data and Open Data As Sustainability Tools a Working Paper Prepared by the Economic Commission for Latin America and the Caribbean

Mistaking the Forest for the Trees: the Mistreatment of Group-Level Treatments in the Study of American Politics

Principal Component Analysis: Application to Statistical Process Control

Statistics, Measures of Central Tendency I

Lesson 16.5 Measures of Central Tendency and Grouped Data

Have You Seen the Standard Deviation?