Chapter 1: Exploring Data

Introduction: Course goal is to learn various tools for using data to gain understanding and make sound decisions. Statistical tools and ideas can help you examine data in order to describe their main features. This examination is called ______. There are two basic strategies that help us organize our exploration of a set of data:  Begin with a ______. Then add numerical summaries of specific aspects of the data.  Begin by examining ______by itself. Then move on to study relationships among the variables. In Chapters 1 and 2 we will examine single-variable data, and Chapters 3 and 4 we will look at the relationships among variables. In both settings, we will begin with graphs and then move on to numerical summaries.

Definition: ______are the objects described by a set of data. ______may be people, but they also may be animals or things.

A ______is any characteristic of an individual. A ______can take different values for different individuals.

SIDE NOTE: When you plan a statistical study or explore data from someone else’s work, ask yourself the following key questions:

1. Who are the individuals described by the data? How many individuals are there? 2. What are the variables? In what ______is each variable recorded? Weights, for example, might be recorded in pounds or kilograms. 3. Why were the data gathered? Do we hope to ______? Do we want to draw conclusions about individuals other than the ones we actually have data for? 4. When, where, how, and by whom were the data produced? Where did the data come from? Are these available data or ______? Are the data from an experiment or an ______? Can we trust the data?

Definitions: A ______variable places an individual into one of several groups or categories. A ______variable takes numerical values for which arithmetic operations such as adding averaging make sense.

1 Try it! – what type of variable?  Age (years)  Car Manufacturer (GM, Ford, etc)  Starting Salary (annual in $1000s)  Calcium level (micrograms per milliliter)  Current Smoker (Yes, No)

Section 1.1: Displaying Distributions with Graphs

Definition: The distribution of a

We begin in Chapter 1 by examining each variable by itself, later we will look at relationships between variables. We start with graphs, then turn to numerical summaries of variables.

Graphs for Categorical Variables Example 1.1 P39 – Radio station formats Format Count of stations Percent of stations Adult contemporary 1,556 11.2 Adult standards 1,196 8.6 Contemporary hit 569 4.1 Country 2.066 14.9 News/Talk/Information 2,179 15.7 Oldies 1,060 7.7 Religious 2,014 14.6 Rock 869 6.3 Spanish language 750 5.4 Other formats 1,579 11.4 Total

SIDE NOTE: The counts should add to 13,838, the total number of stations. The percents should add to 100%. Sometimes you have ______. ______don’t point to mistakes in our work, just to the effect of rounding off results. How would you go about summarizing the data? The distribution of a categorical variable lists the categories and the count or proportion or the percent of the items that fall into each category. A graphical display of this distribution can be a pie chart or a bar graph. ______are hard to draw by hand. Use a ______only when you want to emphasize each category’s relations to the whole. ______are easier to make and also easier to read. Both graphs can display the distribution of a categorical variable, but a bar graph can also compare any set of quantities that are measured in the same units.

2

NOTE: Always label and scale axes and title your graphs.

There is one question that you should always ask when you look at data –do the data tell you what you want to know? –

Let’s say that you plan to buy radio time to advertise your Web site for downloading MP3 music files. How helpful are the data in Example 1.1? very not very useful You are interested, not in counting stations, but counting listeners. In fact, you aren’t even interested in the entire radio audience, because MP3 users are mostly ______You really want to know what kinds of radio stations reach the largest number of ______.

3 Displaying Distributions – Quantitative Variables

Stemplots (or stem-and-leaf plot) Gives a quick picture of the shape of the data of a distribution. Works best for small number of observations that are all greater than 0.

Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. Example 1.4 P42 Literacy in Islamic nations Constructing and interpreting a stemplot. Table 1.1 - Literacy rates in Islamic nations

Female Male Country Percent Percent

Algeria 60 78 Bangladesh 31 50 Egypt 46 68 Iran 71 85 Jordan 86 96 Kazakhstan 99 100 Lebanon 82 95 Libya 71 92 Malaysia 85 92 Morocco 38 68 Saudi Arabia 70 84 Syria 63 89 Tajikistan 99 100 Tunisia 63 83 Turkey 78 94 Uzbekistan 99 100 Yemen 29 70

To make a stemplot of the percents of females who are literate, use the first digits as stems and the second digits as leaves. KEY:

The overall pattern of the stemplot is irregular (skewed to the left). There do appear to be two ______of countries. The plot suggests that we may want to investigate the variation in literacy. For example,

4 why do the three central Asian countries (Kazakhstan, Tajikistan, and Uzbekistan) have very high literacy rates?

SIDE NOTE: When discussing quantitative data, you need to comment about the Shape, its Center and Spread, as well as any Outliers. Look for patterns in the data and for deviations from those patterns.

When you wish to compare two related distributions, a ______with common stems is useful. TRY IT! Comparing Female and Male Literacy Rates

Final thoughts on stemplots: Stemplots do not work well for large data sets where each stem must hold a large number of leaves. You can double the number of stems in a plot by ______into two leaves, one with leaves 0 to 4 and the other with leaves 5 to 9. When the observed values have many digits, it is often best to round the numbers to just a few digits before making a stemplot. Remember the purpose of a stemplot is to display the shape of a distribution.

HISTOGRAMS Stemplots display the actual values of the observations, while histograms do not. A histogram breaks the ______of values of a variable into ______and displays only the ______or ______of the observations that fall into each class. You can choose any convenient number of classes, but you should

Histogram Tips  Be sure to choose classes that are all the ______.  There is no one right choice of the number of classes. ______classes is a good minimum. Use your judgment in choosing classes to display the shape.  Statistical software and graphing calculators will choose the classes for you. The default choice is often a good one, but you can change it if you want.  Use histograms of percents for comparing several distributions with different numbers of observations.

5 Exercise 1.11 P57 Presidential ages at inauguration Exercise 1.11 - Presidential ages at inauguration

President Age

Washington 57 Buchanan 65 Hoover 54 F. D. J. Adams 61 Lincoln 52 Roosevelt 51 Jefferson 57 A. Johnson 56 Truman 60 Madison 57 Grant 46 Eisenhower 61 Monroe 58 Hayes 54 Kennedy 43 L. B. J. Q. Adams 57 Garfield 49 Johnson 55 Jackson 61 Arthur 51 Nixon 56 Van Buren 54 Cleveland 47 Ford 61 W. H. Harrison 68 McKinley 54 Carter 52 Tyler 51 T. Roosevelt 42 Reagan 69 Polk 49 Taft 51 G. Bush 64 Taylor 64 Wilson 56 Clinton 46 Fillmore 50 Harding 55 G. W. Bush 54 Pierce 48 Coolidge 51 Harrison 55 Cleveland 55 How would you go about summarizing the inauguration ages in this data set?

 We might first find the smallest and largest values:

 Could list all of the values from _____ to _____ and then count how many of each of these values occurs but

 So instead we will take this overall range and break it up into intervals (of equal width)

What is reasonable here?

Class Frequency Table Class Frequency Relative Percent (Count) Frequency

SIDE NOTE: A large number of observations on a single variable can be summarized in a table of ______(count) or ______(fractions or percents).

6 Let’s draw a histogram of the ages of presidents at inauguration.

SIDE NOTE: A ______shows the distribution of counts or percents among the values of a ______variable. A ______displays the distribution of a ______variable. Draw bar graphs with spaces between bars to separate the categories. Draw histograms with no space, to indicate that all values of the variable are covered.

Examining Distributions In any graph of data, look for the ______and for ______from the pattern. You can describe the overall pattern by its ______. An important kind of deviation is an ______, an individual value that falls outside the overall pattern. In section 1.2 we will look at numerical methods for finding outliers, center and spread. For now, we will describe the ______by its midpoint, the ______by giving the smallest and largest values.

Here are some things to look for when describing shape:

 Does the distribution have one or several major peaks, called ______? A distribution with one major peak is called unimodal.  Is the distribution symmetric or is it skewed in one direction? A distribution is symmetric if the values smaller and larger than its midpoint are ______images of each other. It is ______to the right if the right tail (______values) is much longer than the left tail (______values).

Read through Example 1.7 page 53. Notes:

7 For exercise 1.11:  Describe the shape, center and spread of the distributions.

 Who was the youngest president? Who was the oldest?

 Was Bill Clinton, at age 46, unusually young?

Now let’s try this on the TI-83/84. See Tech Toolbox P59, if necessary.

Dealing with Outliers With small data sets, outliers generally will stand apart from the overall pattern of the histogram or stemplot. Look for points that are clearly apart from the body of the data, not just the ______observations in a distribution. You should search for an explanation for any outlier. Sometimes outliers ______made in recording the data. In other cases, it could be caused by equipment failure or other unusual circumstances.

Read through Example 1.8 page 54. Notes:

Relative Frequency and Cumulative Frequency A histogram displays the distribution of values of a quantitative variable. But it does not tell us about the relative standing of an individual observation. We would need to construct a relative frequency graph, often called an ______. Let’s try this with exercise 1.11.

Class Frequency Table Class Frequency Relative Cumulative Relative (Count) Frequency frequency Cumulative Frequency

8 Time Plots A lot of data that we see summarized in the newspaper is time series data, measurements taken at regular intervals over time – unemployment data, stock data, crime rates. When data are observations collected over time, is wise to plot them against time ( or order). Always put ______on the ______scale and the ______on the ______scale. You connect the points with a line segment to help see the pattern over time (if there is one). Just as with most graphs, we first look for overall patterns and then any deviations from that pattern. Examples - ______(pattern that repeats at known regular intervals); ______(persistent long-term rise or fall).

9 Exercise 1.16 – Life expectancy Here are the numbers for women a. Construct a time plot for these data. Exercise 1.16 - Life expectancy

Life Expectancy Year (female)

1900 48.3 1910 51.8 1920 54.6 1930 61.6 1940 65.2 1950 71.1 1960 73.1 1970 74.7 1980 77.5 1990 78.8 2000 79.5 b. Describe what you see about the life expectancy of females over the last hundred years?

We have been doing Exploratory Data Analysis – using graphs and numerical summaries to describe the variables in a data set – we will look at more numerical summaries (mean, median, standard deviation, etc in the next section 1.2).

Recall: We discussed how to make and interpret two plots for quantitative variables. A ______can be used to display the distribution of ______. We can talk about the “shape” of the distribution, the approximate ______. When the quantitative data are observations collected over time, it is wise to plot them against time (or order). In a ______(or a series or sequence plot), time goes on horizontal axis and the response or variable you are measuring is on the vertical axis (different from that for a histogram!). You connect the points with a line segment to help see the pattern over time (if there is one).

One more Time Plot: What if? In a production process, a crucial measurement for a particular part is its length, which has a target value of 10 cm (with some allowance around this value). To monitor the production process, you sample a part each hour and record its length on a time series chart. If the chart looked like this – what would you think?

10 1.2 Describing Distributions with Numbers

In this section we continue with how to describe a distribution. In 1.1 we focused on making a picture of the distribution. In this section we will focus on ______summaries of the center and the spread of the distribution (appropriate for ______data only!). A brief description of a distribution should include its shape and numbers describing its center and spread.

Note: describe the shape of the distribution based upon the histogram or stemplot.

Measuring Center Two basic “averages” or measures of center:  Mean x - the average value

 Median – the middle value

Example: Golf scores of 12 members of a women’s golf team in tournament play.

89 90 87 95 86 81 102 97 83 88 91 79 What should we do first? Graph it!

1. Describe the shape of this distribution.

2. Compute the mean

3. Compute the median

11 4. What if the worst player’s score was incorrectly entered as 201 instead of 102?

Note: the mean is ______to extreme observations. The median is ______to extreme observations. Most graphical displays should have detected such an outlying value.

Some Pictures: Mean versus Median

Principles Regarding Averages

Find the mean for these data: 1, 1, 1, 1, 1, 1, 1, 1, 1, 11. Does this average represent the typical value for this data set?

Principle:

For the data set above, what proportion of the values are less than or equal to the mean?

Principle:

The mean age for 10 adults in a room is 35 years. A 32-year old adult enters the room. Can you find the new mean age for the 11 adults?

The median age for 10 adults in a room is 35 years. A 32-year old adult enters the room. Can you find the new median age for the 11 adults?

Principle:

12 The mean life length of male pipe smokers is 78 years. The mean life length of males is 74 years. Does smoking a pipe help you live longer? Should these two averages be compared?

Principle:

The mean wage of males is approximately $10/hour overall. The mean wage for females is approximately $6/hour overall. Is this evidence of wage discrimination? Further study of the data revealed that females earn more than males in each job category. How can this be?

Principle:

Measuring Spread or Variation Midterms are returned and the “average” was reported as 76 out of 100. You received a score of 88. How should your feel?

Often what is missing when the “average” of something is reported, is a corresponding measure of spread or variability. Here we discuss various measures of variation, each useful in some situations, each with some limitations.

Range = Maximum – Minimum

Percentiles: pth percentile is the value such that p% of the observations fall at or below that value.

Some Common percentiles:

Median

First quartile

Third quartile

Note:

Five Number Summary: Minimum, Q1, Median, Q3, Maximum

IQR -

13 Try it! Golf Score Data

Ordered data: 79 81 83 86 87 88 89 90 91 95 97 102

Find the five-number summary:

Example: Test Scores The five-number summary for the distribution of test scores for a very large math class is provided below.

34 46 58 78 95

1. Suppose you scored a 46 on the test. What can you say about the percentage of students who scored higher than you?

2. Suppose you scored 47 on the test. What can you say about the percentage of students who scored higher than you?

3. If the top 25% of the students received an A on the test, what was the minimum score needed to get an A on the test?

Boxplots A boxplot is a graphical representation of the five-number summary. Steps:  Make a box with ends at the quartiles Q1 and Q3  Draw a line in the box at the median  Check for possible outliers using the 1.5 *IQR rule and if any, plot them individually  Extend lines from end of box to smallest and largest observations that are not possible outliers

Note: Possible outliers are observations that are more than 1.5*IQR outside the quartiles. That is, observations that are below Q1 - 1.5*IQR or observations that are above Q3 + 1.5*IQR,

Try it! Golf Score Data Recall the five-number summary is:

IQR =

1.5*IQR =

14 Q1 - 1.5*IQR =

Q3 + 1.5*IQR =

Sketch the boxplot:

Suppose the worst golf score of 102 was really 107.

Ordered data: 79 81 83 86 87 88 89 90 91 95 97 107

Then the five number summary would be:

The IQR and 1.5*IQR would be the same, so the “boundaries” for checking for possible outliers are again ______and ______. Now we would have one potential high outlier, the maximum value of 107.

Sketch the boxplot if we have this one outlier.

Here are the boxplots for the golf score data generated by SPSS.

Now, let’s try it on the TI-83.

15 Note on Boxplots:  Side-by-side boxplots are good for  Watch out – points plotted individually are  Can’t comment on  When reading values from a graph

Standard Deviation (and Variance) When the mean is used to measure center, the most common measure of spread is the ______. The standard deviation is a measure of the spread of the ______. We will refer to it as a kind of “average distance” of the observations from the mean. But it actually is the root mean square deviation – the square root of the average of the squared deviations of the observations from the mean. Since that is a bit cumbersome, we like to think of the standard deviation as “roughly, the ______of the observations from the ______.”

Here is the formula: s 2 = variance =

s = standard deviation =

Try it! Golf Score Data: 89 90 87 95 86 81 102 97 83 88 91 79 The mean was computed earlier to be 89. Find the standard deviation for this data. s =

Not much fun to do it by hand. Is ok for small number of observations, but in general, we would have a calculator or computer do it for us. Interpretation: These golf scores are roughly ______away from their mean score of ______, on average.

The mean square deviation, that is, the standard deviation without taking the square root is called the ______. Or in other words, the ______of a set of observations is the average of the squares of the deviations of the observations from ______. We emphasize the standard deviations since it is in the ______.

16 Properties of the Standard Deviation  s measures . . .  s should be used only . . .  So using the mean and standard deviation for ______. The five-number summary is better for ______.  S = 0 means . . .  Otherwise, s > 0. As the observations become . . .  s, like the mean x, is . . .  Page 86 of text give some nice explanations.

Example: Test Scores Mean score on the midterm was 76. You received an 88. How should you feel? How far above the mean is your score? Suppose the distribution of scores is roughly symmetric and unimodal (bell-shaped).

Let’s draw two possible curves.

What if the standard deviation is 4 points? How many standard deviations are you above the mean? What if the standard deviation is 16 points? How many standard deviations are you above the mean?

17 Linear Transformations: Big Idea: If we have some height data measured in inches and report the mean to be 68 and the standard deviation to be 3, the units for these two summaries is also inches. We might say that the heights in the data are roughly about 3 inches away from the mean height of 68 inches, on ______. What if the height data had been recorded in feet instead of inches? It is easy to convert numerical summaries of a set of data or of a distribution from one unit of measurement to another. The conversion is expressed in terms of a linear transformation and some rules for the effect of a linear transformation on the measures of center and spread are provided.

Definition:

A ______changes the original variable x into the new variable xnew .

General form for a Linear Transformation

xnew  a  bxold

Try it! Consider the following data: {1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3} Plot it on the graph below:

1. Consider adding a fixed value to each observation, say the value 5. Plot the new observations on the same graph using a different plotting symbol or color.

Did the transformation New = Old + 5 effect the center of the distribution? Did the transformation New = Old + 5 effect the spread of the distribution?

Adding a fixed number to each observation . . .

2. Consider multiplying a (positive) fixed value to each observation, say the value 3. Plot the new observations on the same graph using a different symbol or color.

Did the transformation New + 3(Old) effect the center of the distributions?

Did the transformation New + 3(Old) effect the spread of the distributions?

18 Multiplying a fixed (positive) number to each observation . . .

Try it! Exercise 1.45 & 1.46 page 97 Raising teachers’ pay A school system employs teachers at salaries between $30,000 and $60,000. The teachers’ union and the school board are negotiating the form of next year’s increase in the salary schedule. I. Suppose that every teacher is given a flat $1000 raise. a. How much will the mean salary increase? The median salary?

b. Will a flat $1000 raise increase the spread as measured by the distance between the quartiles?

c. Will a flat $1000 raise increase the spread as measured by the standard deviation of the salaries?

II. Suppose that the teachers each receive a 5% raise. The amount of the raise will vary from $1500 to $3000, depending on present salary. a. Will a 5% across-the-board raise increase the spread of the distribution as measured by the distance between the quartiles?

b. Do you think it will increase the standard deviation?

c. What will happen to the mean salary?

Read the Data Analysis Toolbox and Example 1.20 pages 93-95. Notes:

19 Complete Case Closed! Pages 102-103

Summary

20