Statistics

“There are three kinds of lies: lies, damned lies, and statistics”.

Benjamin Disraeli, British Politician In this final unit of the course, we will talk about a couple fundamental ideas of statistics and how these concepts are used in everyday life. We are inundated with statistical information, typically coming from the media. Newspapers, television news, and the internet are constantly giving us numerical information on a whole host of subjects.

2 Statistics are often given to us in order to convince us of something. For example, statistics are used to encourage us to use or not use a given medication.

How can we judge the validity of statements made with the use of statistics?

The following page lists several statements found on the internet.

3 • guardian.co.uk: “The report showed median prices rose 2.6 percent in March.” • on Budget and Policy Priorities: “Overall median household income rose modestly in 2005.” • Foliomag.com: “Consumer circulation executives reported a mean salary of $95,400 this year.” • Bankrate.com: “The average 30-year fixed rate rose 7 basis points, to 6.03 percent.” • Medical Research Council: “The mode income was between £10-15,000 and the estimate of average income was £18,385.”

4 In each of these statements, a statistical term was used. They all represent a similar idea, which we will now discuss.

5 Measures of Central Tendency

The terms mean (or average), median, and mode all represent some sort of average, or measure of the “middle” of the data. Each measures something a little different, however.

6 Average ( = Mean)

The average, or mean, of a bunch of numerical data is the most common measure of the “middle” of the data.

To compute the average, add up all the data and divide by the number of data points.

7 LA Laker Salaries, 2007-2008 Player Salary $19,490,625 $13,709,375 $13,524,000 Vladimir Radmanovic $5,632,200 $4,350,000 $4,000,000 $2,710,800 $2,200,000 $2,172,000 Sasha Vujacic $1,756,951 $1,009,560 $770,610 Coby Karl $427,163 DJ Mbenga $319,331 Total $72,072,615

Source: http://hoopshype.com/salaries/la_lakers.htm 8 To find the average of this set of 14 salaries, we add them up and divide by 14.

This gives $72,072,615 / 14 = $5,148,044

So, on average, a Laker will receive over $5 million for playing the 2007-2008 season.

9 This example brings up an issue, which we will illustrate with an even more extreme example. Suppose a company wishes to publish its average employee salary. The company has 5 employees, a CEO, two secretaries, and two sales people. Their salaries are in the following chart.

Person Salary CEO $5,000,000 sales person $75,000 sales person $75,000 secretary $40,000 secretary $40,000

10 The average company salary is then

(5,000,000+75,000+75,000+40,000+40,000) / 5 = 5,230,000 / 5 = $1,046000

This sounds really good! They can advertise the average salary to be over $1 million. However, if they are looking to hire anybody but the CEO, this is misleading.

11 The problem with an average is that extreme data, such as the very high CEO salary, can skew the average, making it not represent the “typical” salary.

In the case of the Lakers, the average salary of over $5 million is way above the salaries of 5 of the players, over a third of the team.

There are other measures of the middle of the data which lessen the impact of extreme values.

12 Median

The median of a set of numerical data is, essentially, the number for which half the data is above and half is below that number.

In practice there are two cases for calculating this. In either case, listing the data in order is a good idea for finding the median.

13 For the company example, the salaries are

40000, 40000, 75000, 75000, 5000000

In this case there are 5 data points, an odd number. We then take the number right in the middle. That is the left-most 75000. Thus, the median of this data is 75000.

14 For the Laker salary example, rounding to make it shorter to write the information, the salaries are (M = million): .32M, .43M, .77M, 1M, 1.76M, 2.17M, 2.2M, 2.7M, 4M, 4.4M, 5.6M, 13.5M, 13.7M, 19.5M.

In this case there are 14 salaries, an even number. If we look at the “middle” two values below in blue:

.32M, .43M, .77M, 1M, 1.76M, 2.17M, 2.2M, 2.7M, 4M, 4.4M, 5.6M, 13.5M, 13.7M, 19.5M 15 These two numbers have 6 below them and 6 above them. To find the median with an even number of data points, as in this example, we average the two middle numbers. In this case we get

(2.2M + 2.7M) / 2 = 4.9M / 2 = 2.45M.

So, the median salary on the Lakers is $2.45 million. This is a better representation of the “typical” salary, since Kobe Bryant’s salary is so much larger than the others.

16 To summarize, in order to calculate the median of a data set, if there are an odd number of data points, select the number right in the middle, where there are just as many numbers below as above.

If the set has an even number of data points, take the two numbers right in the middle and average them.

17 For another example, suppose a company has a high paid CEO, making $5,000,000 per year, and the other 10 employees are paid the same, $50,000 per year.

The average salary is (5000000+50000*10) / 11 = $500,000

The median is $50,000, since the list of salaries is 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 5000000.

18 The median is a better measure of the middle of the data than the average when there are extreme values, which skews the average toward the extreme values.

There is a third measure of the middle, which can be used on non-numeric data.

19 Mode

The mode of a bunch of data, whether or not it is numeric, is the most commonly occurring value.

20 The mode of the previous salary information is 50000, since this occurred 10 times, while the CEO salary occured only once. For another example, given the salaries 30000, 30000, 30000, 75000, 5000000, the mode is 30000, since it occurs 3 times, while other values occur only once.

The mode of the data set red, green, blue, red, green, red, yellow is red, since red occurs the most. 21 The mode does not give useful information for some data sets. For example, the Laker salary data has no duplicate salaries. So, no salary is more common than the other.

22