13 Collecting Statistical Data

13 Collecting Statistical Data 13.1 The Population 13.2 Sampling 13.3 Random Sampling 1.1 - 1 • Polls, studies, surveys and other data collecting tools collect data from a small part of a larger group so that we can learn something about the larger group. • This is a common and important goal of statistics: Learn about a large group by examining data from some of its members. 1.1 - 2 Data collections of observations (such as measurements, genders, survey responses) 1.1 - 3 Statistics is the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data 1.1 - 4 Population the complete collection of all individuals (scores, people, measurements, and so on) to be studied; the collection is complete in the sense that it includes all of the individuals to be studied 1.1 - 5 Census Collection of data from every member of a population Sample Subcollection of members selected from a population 1.1 - 6 A Survey • The practical alternative to a census is to collect data only from some members of the population and use that data to draw conclusions and make inferences about the entire population. • Statisticians call this approach a survey (or a poll when the data collection is done by asking questions). • The subgroup chosen to provide the data is called the sample, and the act of selecting a sample is called sampling. 1.1 - 7 A Survey • The first important step in a survey is to distinguish the population for which the survey applies (the target population) and the actual subset of the population from which the sample will be drawn, called the sampling frame. • The ideal scenario is when the sampling frame is the same as the target population–that would mean that every member of the target population is a candidate for the sample. When this is impossible (or impractical), an appropriate sampling frame must be chosen. 1.1 - 8 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The U.S. presidential election of 1936 pitted Alfred Landon, the Republican governor of Kansas, against the incumbent Democratic President, Franklin D. Roosevelt. At the time of the election, the nation had not yet emerged from the Great Depression, and economic issues such as unemployment and government spending were the dominant themes of the campaign. 1.1 - 9 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The Literary Digest had used polls to accurately predict the results of every presidential election since 1916, and their 1936 poll was the largest and most ambitious poll ever. The sampling frame for the Literary Digest poll consisted of an enormous list of names that included: 1.1 - 10 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL (1) every person listed in a telephone directory anywhere in the United States, (2) every person on a magazine subscription list, and (3) every person listed on the roster of a club or professional association. 1.1 - 11 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL From this sampling frame a list of about 10 million names was created, and every name on this list was mailed a mock ballot and asked to mark it and return it to the magazine. 1.1 - 12 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • Based on the poll results, the Literary Digest predicted a landslide victory for Landon with 57% of the vote, against Roosevelt’s 43%. • the election turned out to be a landslide victory for Roosevelt with 62% of the vote, against 38% for Landon. 1.1 - 13 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • The difference between the poll’s prediction and the actual election results was 19%, the largest error ever in a major public opinion poll. 1.1 - 14 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • For the same election, a young pollster named George Gallup was able to predict accurately a victory for Roosevelt using a sample of “only” 50,000 people. • What went wrong with the Literary Digest poll and why was Gallup able to do so much better? 1.1 - 15 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • The first thing seriously wrong with the Literary Digest poll was the sampling frame, consisting of names taken from telephone directories, lists of magazine subscribers, rosters of club members, and so on. Telephones in 1936 were something of a luxury, and magazine subscriptions and club memberships even more so, at a time when 9 million people were unemployed. 1.1 - 16 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • When it came to economic status the Literary Digest sample was far from being a representative cross section of the voters. This was a critical problem, because voters often vote on economic issues, and given the economic conditions of the time, this was especially true in 1936. 1.1 - 17 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • When the choice of the sample has a built-in tendency (whether intentional or not) to exclude a particular group or characteristic within the population, we say that a survey suffers from selection bias. • Selection bias must be avoided, but it is not always easy to detect it ahead of time. Even the most scrupulous attempts to eliminate selection bias can fall short. 1.1 - 18 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • The second serious problem with the Literary Digest poll was the issue of nonresponse bias. • In a typical survey it is understood that not every individual is willing to respond to the survey request (and in a democracy we cannot force them to do so). 1.1 - 19 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • Those individuals who do not respond to the survey request are called nonrespondents, and those who do are called respondents. • The percentage of respondents out of the total sample is called the response rate. 1.1 - 20 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • For the Literary Digest poll, out of a sample of 10 million people who were mailed a mock ballot only about 2.4 million mailed a ballot back, resulting in a 24% response rate. • When the response rate to a survey is low, the survey is said to suffer from nonresponse bias. 1.1 - 21 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL • One of the significant problems with the Literary Digest poll was that the poll was conducted by mail. This approach is the most likely to magnify nonresponse bias, because people often consider a mailed questionnaire just another form of junk mail. 1.1 - 22 CASE STUDY 2 THE 1936 LITERARY DIGEST POLL The Literary Digest story has two morals: (1) You’ll do better with a well-chosen small sample than with a badly chosen large one, and (2) watch out for selection bias and nonresponse bias. 1.1 - 23 Examples • Page 516, problems 17,18,19 (Solutions on following slides) NOTE: students should omit problem 28 from homework exercises 1.1 - 24 Examples Solutions 17(a) the citizens of Cleansburg 17(b) the sampling frame is limited to that part of the target population that passes by a city street corner between 4:00 pm and 6:00 pm 1.1 - 25 Examples Solutions 18(a) 475 18(b) yes, this survey is subject to nonresponse bias. The response rate is 475 0.266 1313 475 1.1 - 26 Examples Solutions 19(a) the choice of street corner could make a difference in responses collected 19(b) interviewer D. We are assuming that people who live or work downtown are more likely to answer yes than people in other parts of town. 1.1 - 27 Examples Solutions 19(c) yes. People on street between 4 pm and 6 pm are not representative of the population at large. Also, the five street corners were chosen by the interviewers and the passers-by are unlikely to represent a cross-section of the target population. 19(d) omit 1.1 - 28 Convenience Sampling • One commonly used short-cut in sampling is known as convenience sampling. In convenience sampling the selection of which individuals are in the sample is dictated by what is easiest or cheapest for the data collector, never mind trying to get a representative sample. • A classic example of convenience sampling is when interviewers set up at a fixed location such as a mall or outside a supermarket and ask passersby to be part of a public opinion poll. 1.1 - 29 Convenience Sampling • A different type of convenience sampling occurs when the sample is based on self-selection–the sample consists of those individuals who volunteer to be in it. • Self-selection is the reason why many Area Code 800 polls are not to be trusted. Convenience sampling is not always bad–at times there is no other choice or the alternatives are so expensive that they have to be ruled out. 1.1 - 30 Quota Sampling • Quota sampling is a systematic effort to force the sample to be representative of a given population through the use of quotas–the sample should have so many women, so many men, so many blacks, so many whites, so many people living in urban areas, so many people living in rural areas, and so on. The proportions in each category in the sample should be the same as those in the population. • If we can assume that every important characteristic of the population is taken into account when the quotas are set up, it is reasonable to expect that the sample will be representative of the population and produce reliable data.

13 Collecting Statistical Data

10 Questions Opinion Polls

Random Selection in Politics

The Effect of Sampling Error on the Time Series Behavior of Consumption Data*

Preface Chapter 1

American Community Survey Accuracy of the Data (2018)

George Gallup: Highlights of His Life and Work

MRS Guidance on How to Read Opinion Polls

Has Polling Enhanced Representation? Unearthing Evidence from the Literary Digest Issue Polls

Observational Studies and Bias in Epidemiology

Third Party Election Spending and the Charter

Capturing the Effects of Public Opinion Polls on Voter Support in the NY 25Th Congressional Election

8 Analyzing and Interpreting Polls