Statistics, Measures of Central Tendency I
Total Page:16
File Type:pdf, Size:1020Kb
Statistics, Measures of Central TendencyI We are considering a random variable X with a probability distribution which has some parameters. We want to get an idea what these parameters are. We perfom an experiment n times and record the outcome. This means we have X1;:::; Xn i.i.d. random variables, with probability distribution same as X . We want to use the outcome to infer what the parameters are. Mean The outcomes are x1;:::; xn. The Sample Mean is x1+···+xn x := n . Also sometimes called the average. The expected value of X , EX , is also called the mean of X . Often denoted by µ. Sometimes called population mean. Median The number so that half the values are below, half above. If the sample is of even size, you take the average of the middle terms. Mode The number that occurs most frequently. There could be several modes, or no mode. Dan Barbasch Math 1105 Chapter 9 Week of September 25 1 / 24 Statistics, Measures of Central TendencyII Example You have a coin for which you know that P(H) = p and P(T ) = 1 − p: You would like to estimate p. You toss it n times. You count the number of heads. The sample mean should be an estimate of p: EX = p, and E(X1 + ··· + Xn) = np: So X + ··· + X E 1 n = p: n Dan Barbasch Math 1105 Chapter 9 Week of September 25 2 / 24 Descriptive StatisticsI Frequency Distribution Divide into a number of equal disjoint intervals. For each interval count the number of elements in the sample occuring. Histogram see the next slide Grouped Data Mean Essentially calculate the mean of the frequency distribution. Intervals are used, rather than single values. It is assumed that all these values are located at the midpoint of the interval. The letter xM is used to represent the midpoints and f represents the frequencies: P fi xM;i n Frequency Polygon Connect the middles of the tops of each interval. Dan Barbasch Math 1105 Chapter 9 Week of September 25 3 / 24 Histogram A histogram is a graphical representation of the distribution of numerical data. It is a kind of bar graph. To construct a histogram, the first step is to "bin" the range of values, that is, divide the entire range of values into a series of intervals, and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to be) of equal size. Bin Count −3:5 − 2:51 9 −2:5 − 1:51 32 −1:5 − 0:51 109 −0:5 − 0:49 180 0:5 − 1:49 132 1:5 − 2:49 34 2:5 − 3:49 4 (−3)·9+(−2)·32+(−1)·109+·(0)180+1·132+2·34+3·4 Mean: 500 Dan Barbasch Math 1105 Chapter 9 Week of September 25 4 / 24 Example The table on the next page gives the number of days in June and July of recent years in which the temperature reached 90 degrees or higher in New Yorks Central Park. Source: The New York Times and Accuweather.com. a. Prepare a frequency distribution with a column for intervals and frequencies. Use seven intervals, starting with [0 4]. b. Sketch a histogram and a frequency polygon, using the intervals in part a. c. Find the mean for the original data. d. Find the mean using the grouped data from part a. e. Explain why your answers to parts c and d are different. f. Find the median and the mode for the original data. Dan Barbasch Math 1105 Chapter 9 Week of September 25 5 / 24 9.1 Frequency Distributions; Measures of Central Tendency 417 a. Use this table to estimate the mean income for white house- Animal Number of Blood Types holds in 2008. Pig 16 b. Compare this estimate with the estimate found in Exercise 39. Discuss whether this provides evidence that white Amer- Cow 12 ican households have higher earnings than African Ameri- Chicken 11 can households. Horse 9 41. Airlines The number of consumer complaints against the top Human 8 U.S. airlines during the first six months of 2010 is given in the Sheep 7 following table. Source: U.S. Department of Transportation. Dog 7 Rhesus monkey 6 Complaints per 100,000 Mink 5 Airline Complaints Passengers Boarding Rabbit 5 Delta 1175 2.19 Mouse 4 American 660 1.56 Rat 4 United 487 1.84 Cat 2 US Airways 428 1.69 Continental 350 1.64 Southwest 149 0.29 General Interest Skywest 77 0.65 44. Temperature The following table gives the number of days in American Eagle 68 0.87 June and July of recent years in which the temperature reached Expressjet 56 0.70 90 degrees or higher in New York’s Central Park. Source: The Alaska 34 0.44 New York Times and Accuweather.com. Temperature Data a. By considering the numbers in the column labeled “Com- plaints,” calculate the mean and median number of com- plaints per airline. Year Days Year Days Year Days b. Explain why the averages found in part a are not meaningful. 1972 11 1985 4 1998 5 c. Find the mean and median of the numbers in the column 1973 8 1986 8 1999 24 labeled “Complaints per 100,000 Passengers Boarding.” 1974 11 1987 14 2000 3 Discuss whether these averages are meaningful. 1975 3 1988 21 2001 4 Life Sciences 1976 8 1989 10 2002 13 1977 11 1990 6 2003 11 42. Pandas The size of the home ranges (in square kilometers) of several pandas were surveyed over a year’s time, with the fol- 1978 5 1991 21 2004 1 lowing results. 1979 7 1992 4 2005 12 1980 12 1993 25 2006 5 Home Range Frequency 1981 12 1994 16 2007 4 0.1–0.5 11 1982 11 1995 14 2008 10 0.6–1.0 12 1983 20 1996 0 2009 0 1.1–1.5 7 1984 7 1997 10 2010 20 1.6–2.0 6 Dan Barbasch Math 1105 Chapter 9 Week of September 25 6 / 24 2.1–2.5 2 2.6–3.0 1 a. Prepare a frequency distribution with a column for intervals 3.1–3.5 1 and frequencies. Use six intervals, starting with 0–4. b. Sketch a histogram and a frequency polygon, using the a. Sketch a histogram and frequency polygon for the data. intervals in part a. b. Find the mean for the data. c. Find the mean for the original data. 43. Blood Types The number of recognized blood types varies by d. Find the mean using the grouped data from part a. species, as indicated by the table below. Find the mean, median, and mode of this data. Source: The Handy Science e. Explain why your answers to parts c and d are different. Answer Book. f. Find the median and the mode for the original data. Measures of Variation Summary of Section 9.2 Range The difference Largest Data - Smallest Data in a Sample. Deviation from the Mean P x2−nx2 P(x −x)2 1 Variance σ2 = s2 = i = i n−1 p n−1 2 Standard Deviation σ = s = s2 These are random variables called Sample Variance and Sample Standard Deviation. For a random variable X ; µ = E(X ) is called the mean. The variance Var(X ) is σ2 = Var(X ) = E((X − µ)2). Main Property/ Explanation for dividing by n − 1: If Xi are P 2 2 (Xi −X ) i.i.d with distribution X ; then if you set S = n−1 , its expected value is E(S2) = σ2: This is not true for the standard deviation, E(S) 6= σ: s P 2 2 fi x − nx Grouped Data s = M;i : n − 1 Dan Barbasch Math 1105 Chapter 9 Week of September 25 7 / 24 ExamplesI Example (Range) Data 15; −3; 4; 7; 18. The smallest is −3, the largest 18 so Range = 18 − (−3) = 21: Always a nonnegative number. Example (Deviation from the Mean) 15−3+4+7+18 In the previous example, x = 5 = 8:2. So 15 − 8:2 = 6:8; −3 − 8:2 = −11:2; 4 − 8:2 = −3:8; 7 − 8:2 = −1:2; 18 − 8:2 = 9:8: Example (Variance and Standard Deviation) 2 2 2 2 2 2 2 2 2 2 2 s2 = 6:8 +11:2 +3:8 +1:2 +9:8 = 15 +3 +4 +7 +18 −5·8:2 p 4 4 s = s2: Dan Barbasch Math 1105 Chapter 9 Week of September 25 8 / 24 ExamplesII Example (Binomial Distribution) P(X = 1) = p; P(X = 0) = 1 − p: Then µ = E(X ) = p; and σ2 = E((X − p)2) = (1 − p)2p + (0 − p)2(1 − p) = p(1 − p): This is the same as E(X 2 − p2) = (1 − p2)p + (−p2)(1 − p) = (1 − p)p: Remark: Note that the formula for variance and standard deviation only holds for n > 2: Otherwise, for n = 1; you would be dividing by 0. For one random variable, the variance is defined as Var(X ) = E((X − E(X ))2): For X1; X2;, two independent random variables, Var(X1 + X2) = Var(X1) + Var(X2): Suppose X is a random variable. We can write a table X a1 a2 ::: an P(X ) p1 p2 ::: pn Dan Barbasch Math 1105 Chapter 9 Week of September 25 9 / 24 Examples III For the expected value µ = E(X ); you multiply the two terms in each column, and add X ai × pn = a1p1 + ··· + anpn: i In a spreadsheet program, the data would be in columns and you would add over the products from the rows.