Geog 2000: Introduction to Geographic Statistics

Total Page:16

File Type:pdf, Size:1020Kb

Geog 2000: Introduction to Geographic Statistics

Problem Set #1 Geog 2000: Introduction to Geographic Statistics

Instructor: Dr. Paul C. Sutton

Due Tuesday Week #3 in Lab Instructions:

There are three parts to this problem set: 1) Do it by Hand Exercises (#’s 1 – 7). Do these with a calculator or by hand. Draw your figures by hand. Write out your answers by either digitally or by hand. Start your answer to each numbered question on a new page. Make it easy for me to grade. 2) Computer exercises (#’s 8-13). Use the computer to generate the graphics and paste them into a digital version of this assignment while leaving room for you to digitally or hand write your responses if you so wish. 3) ‘How to Lie with Statistics’ Essay(s) (#’s 14-17). For this section prepare a typed paragraph answering each question. Do not send me digital versions of your answers. I want paper copies because that is the easiest way for me to spill coffee on your assignments when I am grading them. All of the questions on these four problem sets will resemble many of the questions on the three exams. Consequently it behooves you to truly understand – ON YOUR OWN – how to do these problems. I am fine with you working with other students on these problem sets but my exams will punish those who free ride the comprehension of the material in these problem sets. Do it By Hand Exercises (i.e. Don’t Use A Computer)

#1) Strange Numbers from What or Why or Where or How?

Consider the following set of numbers that are observations of some random process or phenomena: (BTW N = 16)

0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 4, 4, 5, 8, 14, 41

A) Calculate and/or identify the mean, median, mode, range, inter-quartile range, maximum, Shortest half, outliers, and minimum of these numbers.

Mean: 4.75 or 4.0 * Median: 2 Mode: 0

Range: 41 Inter-quartile Range: 4.5 Shortest Half: 0-2

Maximum: 41 Minimum: 0 Outliers: 14 and 41 *

* NOTE: Quantiles are not obvious to calculate for small samples (see Quantile Calculation paper on course web site). Nonetheless, for our purposes I’m ok with the 25th and 75th quantiles of this dataset to be either .5 and 4.5 or .25 and 4.75. As far as outliers go JMP uses a rule that works for me: upper quartile + 1.5*(interquartile range) ……lower quartile - 1.5*(interquartile range).

B) Draw a histogram, a box-plot, and a stem-and-leaf plot for these numbers. Explain how the Stem and Leaf plot does not really improve on the histogram with respect to characterizing these numbers in this particular example but might in another dataset.

Stem & Leaf on 5’s

0,0,0,0,1,1,1,2,2,3,4 5,8 14 - - ... 0 10 20 30 40 41 C) Write a short 2-4 sentence explanation for each of the following random processes as to why each of them do or do not represent a reasonable and probable manifestation of the measurements represented by these numbers. ( I expect answers to vary )

1) The number of sexual partners of a 16 randomly selected 20 year old men in New York City in 1995. This strikes me as possible. It is a skewed distribution with many low values and perhaps a few very high values. It could be argued that 41 sexual partners at the age of 20 is too high an outlier but it is at least possible and maybe probable.

2) The number of corporate logos (e.g. the McDonald’s ‘M’, The Nike ‘Swoosh’, British Petroleum’s ‘BP’) recognized from a presentation of 1000 logos to a randomly selected set of 16 sixth grade students from Boston MA. FOUR kids in Boston that don’t regognize at least one corporate logo? Not a chance. In fact, most kids know more corporate logos by the age of four than the number of plants or animals they can identify in their own backyard as high school students. Possible but not very likely in the least.

3) The length in inches of Trout caught in a Fishing Derby in Lake Tahoe CA. Zero inch trout? No way. Who would even report catching a zero inch trout? Impossible and improbable.

4) The actual Fertility rate (the number of live births) of 16 randomly selected American Women of the age of 50 in 2007 (Actual TFR then was 2.05). While the numbers may be possible the probability is almost zero that in a sample of sixteen women you would find one that had given birth to 41 children. Also, the average is way above the real average for this phenomena. Almost impossible and certainly improbable.

5) The net worth (in millions of dollars) of sixteen people who paid $50 to have their photograph taken at the top of a ski lift in Aspen Colorado in 2001.

Aspen has a large population of very rich people floating through. And, millionaires are the new middle class in America. This strikes me as both possible and perhaps probable. #2) Fuzzy Dice from Hell and their Probability Density Functions

Assume you have two tetrahedrally shaped dice (i.e. four sided) on which are printed a 1, 2, 3, and 4 on each side. Also assume that the dice are ‘fair’ (i.e. equal probability of 1-4 showing up [in this case face down] when they are rolled).

A) 1) What is the average result you would expect from the ‘sum’ value of the rolling of these two dice? 5 2) What is the range of possible outcomes from the sum of the rolling of these two dice? 2 to 8 (range = 6) 3) What is a more likely outcome of the sum of a single roll of these two dice: Sum of Dice = 2 or Sum of Dice = 5? Explain.

Sum of Dice = 5 greater probability than sum of dice = 2 Probability of sum = 2 is ¼ * ¼ = 1/16 (of 16 possible outcomes only 1 chance for sum of dice to = 2) Several ways for sum of dice to be 5 (a 3 and a 2, a 2 and a 3, a 4 and a 1 or a 1 and a 4 --- 4/16 probability)

B) Draw a probability distribution function (actually since this is a discrete distribution it is more accurately referred to as a probability mass function – which should look a lot like a histogram or bar chart), that characterizes the probability of each of the possible sums of the values of the face down value of these two dice.

1/4 .25 3/16 3/16 .1875 1/8 1/8 .125 1/16 1/16 .0625 0.0 2 3 4 5 6 7 8

C) Suppose your roommate suggests to you that she will engage you in a game of chance with these two die. If she rolls a score of three or less (where the ‘score’ is the sum of the two dice) you pay her 10 dollars. If she rolls any score greater than three she pays you three dollars. Your roommate will roll the die 100 times and payments will be made for each roll of the dice. 1) Are you wise to play this game of chance? 2) If you did choose to play this game of chance how much would you expect to win or lose assuming this ‘game of chance’ worked out perfectly randomly (which these things rarely do)?

Expected Loss on any roll: 3/16 * 10 - 13/16 * 3 = - .5625 dollars On average, after 100 rolls you will have won $ 56.25 D) Suppose you try to cheat on your roommate by sneaking in a ‘bogus’ die with ‘twos’ on every side. Did you improve your odds of winning? Explain.

Not a smart move. You actually reduce your odds from winning to losing. Expected win is now: ¾ * 3 – ¼ * 10 = 2.25 – 2.50 = -0.25 After 100 rolls, on average, you will be down $25.

E) What is the average value of the roll of one of these four sided die? How might you do financially if you suggested to your roommate that she roll a SINGLE four sided die and you pay her $100 dollars every time she rolled the average value and you collected a mere twenty five cents every time she did not? Explain.

The average roll of a single four-sided die with 1 through 4 marked on each side is 2.5 In this case the average can NEVER be rolled. Good bet. You’ll win EVERY time. But Your roommate just might figure this out after a while .

#3) The odds and oddity of heirs

Assume that the probability of having a baby boy or a baby girl are equal and 50%. Assume a woman has two children. What are her chances of having at least one female child? Explain.

The four possible outcomes are: Boy & Boy, Boy & Girl, Girl & Boy, Girl & Girl In 3 of the 4 cases there is at least one girl. Thus, if a woman has two children, ~75% of the time she will have at least one girl.

#4) Sometimes it is easier to think about what couldn’t happen.

Assume you have been randomly selected from thousands of college age students to participate in an ‘American Idol’ competition. Also assume that there is an equal probability of all of these college age students having a birthday on any day of the year (No Leap Years – we’re not going there – e.g. 1 in 365 days of the year).

A) How many students would have to be selected for which the probability of you having the same birthday as one of the other contestants would be a 50% chance? (Note: You can use a calculator or spreadsheet for this question – it does involve significant sequential arithmetic). Wikipedia Provided a very clear answer to this question

To compute the approximate probability that in a room of n people, at least two have the same birthday, we disregard variations in the distribution, such as leap years, twins, seasonal or weekday variations, and assume that the 365 possible birthdays are equally likely. Real-life birthday distributions are not uniform since not all dates are equally likely.[3]

It is easier to first calculate the probability p(n) that all n birthdays are different. If n > 365, by the pigeonhole principle this probability is 0. On the other hand, if n ≤ 365, it is given by

because the second person cannot have the same birthday as the first (364/365), the third cannot have the same birthday as the first two (363/365), etc.

The event of at least two of the n persons having the same birthday is complementary to all n birthdays being different. Therefore, its probability p(n) is

This probability surpasses 1/2 for n = 23 (with value about 50.7%).

B) Assume the song you have decided to perform for this ‘American Idol’ competition is “Blue Valentines” by Tom Waits – What are the odds that out of 1,000 other contestants one or more of them would choose the same song? Answers will vary. I say the odds are zero because Tom Waits is a gravelly voiced singer who sings depressing songs that would be an unlikely choice of candidates trying to win an American Idol competition. C) Characterize the probabilistic approach you took to parts ‘A’ and ‘B’ of this Question as classical, relative frequency, monte carlo, or subjective. Explain.

The first calculation is a classical probability approach whereas the second is a subjective probability approach.

#5) Conditional Probability can be REALLY tough to recognize: “Let’s Make a Deal”.

Monte Hall (and I think now Drew Carey) hosted (and Drew Carey now hosts) a famous game show called “Let’s Make a Deal”. A classic statistical problem is associated with a challenge in this show involving “The infamous Door #3” or whatever door comes into play. Assume you are a contestant on this show. Assume the producers of the show etc. are not messing around with you. Here is your situation: Monte Hall (now Drew Carey) presents you with three doors that obscure potential prizes. The prizes are: $1,000,000; an old Goat, and a young Sheep. Assume, (which I think is a safe bet), you want the million bucks. Monte Hall tells you to pick a door: “ Door #1?, Door #2? Or,…Door #3”. You pick ‘Door #2’. Now… (Monte Hall WILL do This no matter what door you initially choose) Monte Hall says – “Are You Sure you want to Stay with Door #2?” – And then he reveals to you the Goat or Sheep behind Door #1 or Door #3. And Monte Hall now gives you this option: “Do you want to stay with Door #2 OR Do you want to switch to the door I have not revealed to you” . It’s up to you. Should you switch or should you stick with your first choice? Remember this: Monte Hall WILL challenge you with this option of switching regardless of what your initial choice was. Is he doing you a favor by giving you this choice from a statistical perspective or is this merely a 50 – 50 choice on your part? What should you do: Switch or Stick or does it not matter? Explain using what you have learned about probability.

You should defnintely switch given the opportunity. On your first selection you have a 1/3 chance of choosing the money and a 2/3 chance of choosing a hooved animal. If you chose the money you would HAVE to switch to a goat or sheep (this will happen 1/3 of the time). If you chose a hooved animal you will ALWAYS switch to the money because the other hooved animal has been revealed to you (This will happen 2/3 of the time). Consequently it is NOT a 50/50 proposition. It is a 2/3 – 1/3 proposition in your favor.

#6) Probabilistic Approaches

Define and provide and example of each of the following approaches to Probability:

A) Classical Population parameters of Random Variables are known and the approach is derived from strict mathematical rules. Casino games such as roulette, craps, poker, black jack, etc are all classical probability problems.

B) Relative Frequency A large body of empirical evidence provides the groundwork for relative frequency approaches. Insurance companies do this all the time. Male drivers between the ages of 20 and 25 have such and such a probability of getting in accidents that cost such and such amount. Insurance rates are based on relative frequencies. These are often adjusted based on conditional probabilities such as # of DUIs, # of Tickets, Levels of Education, Distance traveled to work etc.

C) Subjective Subjective probability is when someone makes decisions based on experience, folklore, hunches, judgement, etc. A good example is listening to sports guys on TV pick football games. They are not random. Las Vegas also puts ‘odds’ on sporting events. These ‘odds’ are basically the subjective assessment of experts. They are often pretty damn good.

D) Monte Carlo The Monte Carlo approach is similar to relative frequency approach but often takes place when one does not have a large history of empirical results to draw from. In this case, theoretical models are developed using stochastic (random variables) and ‘empirically modeled’ relative frequencies are generated. For example, the “Let’s Make A Deal” question (#5) can be approached classically or you could simply play it 100 times (e.g. Monte Carlo it) and get an empirical relative frequency distribution. An example provided by Pete Rogerson (Geography Prof At SUNY Buffalo) is two theoretical hockey teams. In this example lets use to stochastic parameters per team: 1) # of shots on goal each team makes in a game; and 2) The % of saves each goalie makes for each shot on goal. If you model these parameters as random instantiations either team can win even if the team with the better goalie also averages more shots on goal. After many ‘runs’ of the model though – the ‘better team’ wins more games.

#7) Will the Real Independent Random Variable Please step forward?

What does it mean for observations of a random variable to be ‘Independent’? Provide an example of independent observations and an example in which Observations are not independent. Why is geography (and especially spatial Analyses) often criticized for using regular statistics on non-independent Measurements?

Independence of measurement means that making one measurement will tell you nothing about any of the other measurements you make. This is essentially why we want truly random sampling when we are doing statistics (if possible). When two events are statistically independent, it means that knowing whether one of them occurs makes it neither more probable nor less probable that the other occurs. In other words, the occurrence of one event occurs does not affect the outcome of the occurrence of the other event. Similarly, when we assert that two random variables are independent, we intuitively mean that knowing something about the value of one of them does not yield any information about the value of the other.

Example of Independent Observation: The roll of a single die should not effect the roll of another single die.

Example of Non-Independent Observation: The Let’s Make a Deal question (#5) demonstrates a kind of non-independence. Your odds of ‘winning’ change depending upon whether your first choice was the money or a hooved animal.

Non-Independence in Space: Space makes almost everything dependent. The value of a house is almost always related to the value of the houses near it. Even if you randomly select within a neighborhood you have almost necessarily biased your sample of housing values. The temperature at a given location is more like the temperature around it than the temperature 1500 miles away in any direction. This idea of spatial dependence is one of the foundational reasons geography is a discipline. Nonetheless, it is also a reason that we are often ‘cheating’ (to a greater or lesser extent) whenever we use traditional statistics. Computer Problems (use JMP and/or Excel for these exercises)

The Bell Curve (aka The Normal Distribution) is but one of many, many different kinds of random number distributions. The JMP software package allows you to generate sets of ‘observations’ from a large number of different kinds of random number distributions. Some distributions have ‘parameters’. The meaning and interpretation of these parameters is not always the same. For example, the Normal distribution has two parameters: The mean (usually symbolized as: µ), and the standard deviation (usually symbolized as: σ). The mean is a measure of the central tendency of the normal distribution and the standard deviation is a measure of the spread, or, variability of the numbers around the mean. The Normal distribution is fairly good at characterizing many random variables. For example, consider the distribution of the height of 25 year old men in the state of Colorado. Let’s assume that is can be characterized by a Normal distribution with a mean of 70 inches (5’ 10”) and a standard deviation of 3 inches. We can ask JMP to artificially sample from a N(70,3) (read as Normal distribution with mean of 70 and standard deviation 3) distribution. In JMP open a new table, add 2000 rows (we’ll take a sample of 2000) and double click on the column and click on Edit Formula (we’ll do this in lab too) and choose ‘Random’ then choose ‘Normal’. Add the 70 and the 3 to the box, check apply, and bam! – You have 2000 numbers in your table drawn from a Random N(70, 3) distribution. Use the Analysis tool to make a histogram and get descriptive statistics.

Figure 1: Histogram of 2000 randomly generated points from a N(70, 3) distribution So here is a histogram of the randomly generated numbers from a N(70, 3) distribution. Use the ‘?’ tool (which is the help system of JMP) to find out what all these diagrams, and numbers are. There is a box and whiskers chart, a shortest half (the red figure), and a whole bunch more. Make sure you know how to use this ‘?’ tool. It is very cool. In fact make sure you can find the figure pasted onto the next page in the JMP help system. This is a very good way to learn both how to use the software, and how to interpret the statistical output of the software. Figure 2: JMP Help system explanation of figures in the Histogram function Find the above figure in the JMP help system. Now, we’ll generate another 2000 observations from a Normal distribution with different parameters. Assume a coke can is 7 inches tall. The variability of the size of a coke can will be much smaller (relatively and actually) than the variability of the heights of 25 year old men living in Colorado. So do the same thing again by creating a new column, editing its formula etc. but this time generate your numbers from a N(7, 0.03) distribution (e.g. very little spread about the mean). Plot the histogram. Compare the two datasets – Both of them are from a Normal Distribution but have different ‘parameters’ for the Mean and Standard Deviation.

Figure 3: Histogram of 2000 randomly generated points from a N(7, 0.03) distribution

Note the numerical scale across the bottom. All the points fall within 0.06 inches around 7 whereas for the first generation from N(70, 3) all the points were within 20 inches around 70. Both are Normal distributions but clearly the heights of coke cans are much less variable than the heights of 25 year old men living in Colorado. Your computer exercises for this problem set are to prepare a one page summary for each of the following distributions: Uniform, Binomial, Exponential, Poisson, and Chi-Square. (Note the Chi-Square is not in the ‘Random’ part of the formula editor but in the ‘Statistical’ part). Google each distribution to find out a little bit about it: How many parameters does it have? How are the parameters interpreted? Is this a discrete or a continuous distribution. What kinds of phenomena are best represented by this kind of distribution? Generate two histograms for each distribution using different parameters. Tell me a one page story about each distribution with these figures and your interpretation based on your research. Definitely answer the italicized questions above in this story.

This is the essence of the assignment:

#8) Generate two Normal distributions identical to those described above and Create a single column with all of the observations (e.g. 2000 from N(70,3) and 2000 from a N(7, 0.03). Plot them all on one histogram. What do you see?

All the variability of the N(7, .03) variable (~ from 6.9 to 7.1) fits into a single bin of the histogram; whereas, the variability of the N(70, 3) random variable ranges from roughly 60 to 80. Both are symmetric normal distributions but they have different means and standard deviations (aka central tendencies and spreads).

10 20 30 40 50 60 70 80

#9) (Less than) One page Summary of the Uniform Distribution Uniform (100) The Uniform distribution is essentially equal probability across the range for which it exists. In fact, the only ‘parameter’ of the uniform distribution is this ‘range’. The Uniform distribution is essentially what is used for random number generation applications such as selecting lottery numbers. There are discrete and continuous versions of the uniform.The pdf of the continuous uniform distribution is: 0 10 20 30 40 50 60 70 80 90 100 #10) One Page Summary of the Binomial Distribution

Distributions Bin(100, .5) Quantiles Moments 100.0% maximum 70.000 Mean 49.791 99.5% 62.000 Std Dev 5.0389497 97.5% 60.000 Std Err Mean 0.1126743 90.0% 56.000 upper 95% Mean 50.011971 75.0% quartile 53.000 low er 95% Mean 49.570029 50.0% median 50.000 N 2000 25.0% quartile 46.000 10.0% 43.000 40 50 60 70 2.5% 40.000 0.5% 36.005 0.0% minimum 33.000

If you play with the ‘hand tool’ in JMP you can make the histogram as fine ‘binned’ as you like. In the case above I made it so fine ‘binned’ that you can see the discrete nature of the binomial distribution. ‘Scores’ of 50 and 49 and 19 can occur but no non-integer values can. This Bin(100, .5) example would be analogous to flipping a coin 100 times and counting the number of heads – 2000 times. Theoretically a 0 or a 100 could occur; however, note that in 2000 trials of 100 coin flips they do not occur. In fact, all the values fall between 30 and 70. The parameters of a Binomial are ‘n’ (analogous to the # of coin flips – or Bernoulli trials, and ‘p’ the probability of getting a ‘heads’ or ‘success’ in Bernoulli trial jargon. The formulas for the pmf and some of the parameters for the Binomial are pasted in from Wikipedia below: Parameters number of trials (integer) success probability (real) Support Probability mass function (pmf) Cumulative distribution function (cdf) Mean one of Median

Mode Variance #11) One Page Summary of the Exponential Distribution

Distributions Exp(10) Quantiles Moments 100.0% maximum 10.331 Mean 0.9796749 99.5% 5.063 Std Dev 1.0018597 97.5% 3.640 Std Err Mean 0.0224023 90.0% 2.256 upper 95% Mean 1.0236092 75.0% quartile 1.335 low er 95% Mean 0.9357407 50.0% median 0.690 N 2000 25.0% quartile 0.282 10.0% 0.100 0 1 2 3 4 5 6 7 8 9 10 2.5% 0.025 0.5% 0.00633 0.0% minimum 0.00025

The exponential is a skewed continuous distribution that has many applications in the natural and social sciences. It is often considered a continuous analog of the negative binomial distribution. The exponential distribution has only one parameter: λ (e.g. Exp(λ)). The mean of the exponential is 1/ λ; while the variance is given by 1/ λ2 . Processes that are often modeled by the exponential include the decay of radioactive material and the length of time one might expect to wait for a bus at a bus stop. The Wikipedia characterization of pdf and cumulative distribution function are pasted in below:

Probability density function

The probability density function (pdf) of an exponential distribution has the form

where λ > 0 is a parameter of the distribution, often called the rate parameter. The distribution is supported on the interval [0,∞). If a random variable X has this distribution, we write X ~ Exponential(λ).

[edit] Cumulative distribution function

The cumulative distribution function is given by

#12) One Page Summary of the Poisson Distribution Distributions Poisson(10) Quantiles Moments 100.0% maximum 20.000 Mean 10.112 99.5% 19.000 Std Dev 3.1039126 97.5% 17.000 Std Err Mean 0.0694056 90.0% 14.000 upper 95% Mean 10.248115 75.0% quartile 12.000 low er 95% Mean 9.9758851 50.0% median 10.000 N 2000 25.0% quartile 8.000 10.0% 6.000 0 10 20 2.5% 5.000 0.5% 3.000 0.0% minimum 1.000

The Poisson distribution is a discrete distribution often used in spatial applications. In fact, is an important distribution with respect to deriving complete spatial randomness (CSR). Imagine a field study of say.. meteorite impacts. If you divide this area up into equal sized sub-areas (often quadrats) you would expect the ‘count’ of meteorite impacts in these quadrats to be distributed as a poisson variable with λ = # of meteorite impacts / # of quadrats (if they are spatially random). In fact, for the Poisson; it’s only parameter: λ, is also the expected value of both it’s mean and 2 variance (e.g. for X~P(λ); E[X] = λ and E[ σ x ] = λ. Again Wikipedia provides some nice background info on the Poisson distribution pasted in below:

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.The distribution was discovered by Siméon-Denis Poisson (1781–1840) and published, together with his probability theory, in 1838 in his work Recherches sur la probabilité des jugements en matières criminelles et matière civile ("Research on the Probability of Judgments in Criminal and Civil Matters"). The work focused on certain random variables N that count, among other things, a number of discrete occurrences (sometimes called "arrivals") that take place during a time-interval of given length. If the expected number of occurrences in this interval is λ, then the probability that there are exactly k occurrences (k being a non- negative integer, k = 0, 1, 2, ...) is equal to

where e is the base of the natural logarithm (e = 2.71828...) k is the number of occurrences of an event - the probability of which is given by the function k! is the factorial of k λ is a positive real number, equal to the expected number of occurrences that occur during the given interval. For instance, if the events occur on average every 4 minutes, and you are interested in the number of events occurring in a 10 minute interval, you would use as model a Poisson distribution with λ = 10/4 = 2.5. As a function of k, this is the probability mass function. The Poisson distribution can be derived as a limiting case of the binomial distribution. #13) One Page Summary of the Chi-Square Distribution For clues on how to do this see: http://web.utk.edu/~leon/stat571/jmp/chi-square/default.html

Distributions Chi-Square (5) Quantiles Moments 100.0% maximum 22.252 Mean 5.1645598 99.5% 16.547 Std Dev 3.2840109 97.5% 13.207 Std Err Mean 0.0734327 90.0% 9.741 upper 95% Mean 5.3085724 75.0% quartile 6.861 low er 95% Mean 5.0205471 50.0% median 4.435 N 2000 25.0% quartile 2.726 10.0% 1.674 0 10 20 2.5% 0.807 0.5% 0.441 0.0% minimum 0.194

The Chi-Square distribution (aka χ2 ) is a continuous distribution with only one parameter often denoted ‘k’ and called ‘degrees of freedom’. The Chi-square distribution is the distribution that results from the summing of ‘k’ standard normal variables squared. Consequently it is always positive and it is skewed to the right. It is used in the oft used Chi-Square ‘Goodness of Fit’ test which has many applications including the testing of the independence of nominal variables with respect to frequency distributions (for example are ‘eye color’ and ‘handedness’ of human beings correlated?), and several geographic applications. The shape of the Chi-Square changes significantly as it’s degree of freedom ‘k’ parameter increases from 1-5. For ‘k’’s beyond 5 it becomes an increasingly flat yet slightly skewed curve. The most common use of the chi-square is the ‘goodness of fit’ test in which the statistics is:

(Observed – Expected)2 Where the ‘Observed’ and ‘Expected’ 2 χ = ------numbers are obtained from a‘Contingency (Expected)2 Table’. (more on this later in the course) Again, hat’s off to Wikipedia for some additional info on this distribution (below): In probability theory and statistics, the chi-square distribution (also chi-squared or χ2 distribution) is one of the most widely used theoretical probability distributions in inferential statistics, i.e. in statistical significance tests. It is useful because, under reasonable assumptions, easily calculated quantities can be proven to have distributions that approximate to the chi-square distribution if the null hypothesis is true.If Xi are k independent, normally distributed random variables with mean 0 and variance 1, then the random variable

is distributed according to the chi-square distribution. This is usually written

The chi-square distribution has one parameter: k - a positive integer that specifies the number of degrees of freedom (i.e. the number of Xi). The chi-square distribution is a special case of the gamma distribution.The best-known situations in which the chi-square distribution are used are the common chi-square tests for goodness of fit of an observed distribution to a theoretical one, and of the independence of two criteria of classification of qualitative data. However, many other statistical tests lead to a use of this distribution. One example is Friedman's analysis of variance by ranks.

How to Lie With Statistics Question

#14) Write a 4 to 7 sentence summary of Chapter 1: The Sample with built in Bias ANSWERS WILL VARY: The average income of Yale graduates was used as an example. They asked alumni who came to a re-union and believed what they said. DU can ‘bias’ their ‘Average SAT’ score of applicant by not requiring applicants to submit SAT scores. This biases the average up because applicants with good scores will want to submit whereas applicants with bad scores are more likely to not sbumt their SAT scores #15) Write a 4 to 7 sentence summary of Chapter 2: The well chosen average ANSWERS WILL VARY: The skewed distribution of housing prices is used here as an example. There are many measures of ‘central tendency’ which can sometimes scurrilously be replaced with the technical and precise term of ‘average’. The mean (aka ‘average’, median (aka middle value), and mode (most common value) are all measures of central tendency. All of them could be used to characterize a distribution. The true ‘average’ price of a house in a given housing market is usually higher than what most people pay for a house because housing prices tend to be skewed kind of like a chi-squared distribution – i.e. a long tail to the right. So, if you are trying to impress people about what a nice neighborhood you live in is you might give the average price of a home in it; however, if you are trying to sell someone on the affordability of the neighborhood you might use the median price of a home.

#16) Write a 4 to 7 sentence summary of Chapter 3: The little figures that are not there ANSWERS WILL VARY: Some results are prominently presented even though they may be based on small sample sizes or even biased samples. The old DENTYNE commercial comes to mind: “Four out of Five Dentists, WHO CHEW GUM, choose Dentyne. Come on, how many Dentists chew gum? (I don’t really know probably 99% ) #17) An apochryphal story about the department of Geography at the University of North Carolina (UNC) is that their undergraduate geography majors earn more money than the geography graduates of any other university in the country. It just so happens that Michael Jordan (an extremely well paid basketball player) got his degree from UNC in geography in 1986. Would Darrell Huff describe this as ‘Lying With Statistics’? If so, what category of your above essays would this ‘lie’ fall under. Explain.

ANSWERS WILL VARY: Darrell Huff would describe this as lying with statistics and he would probably categorize it as either a sample with built in bias or as a well-chosen average. Hopefully you get the point in either case.

Recommended publications