<<

Overview of data collection principles Anecdotal Anecdotal evidence

Anecdotal evidence consists of an anecdote or a descriptive Unit 1: Introduction to data story about an event or phenomenon. “Smoking is not bad for everyone, my uncle smokes three packs a Lecture 1: Data collection day and he’s in perfectly good health.” Anecdotal evidence is often incomplete and inaccurate, and 104 refers to an exceptional event Statistical Anecdotal evidence Mine C¸etinkaya-Rundel Large, representative samples Very small, biased samples May 14, 2013 Precise measurements in con- Casual observations in uncon- trolled situations trolled circumstances Measure/control all other relevant Other important factors are unac- factors that affect the outcome counted Careful about making causal con- Causal connections are made too nections easily

Statistics 104 (Mine C¸etinkaya-Rundel) U1 - L1: Data collection May 14, 2013 2 / 12

Overview of data collection principles Populations and samples Overview of data collection principles from a population Populations and samples Census

Wouldn’t it be better to just include everyone and “sample” the entire Research question: Can people population? This is called a census. become better, more efficient runners on their own, merely by Pros Cons running? - provides a true measure of the pop- - costlier than working with a sample Population of interest: ulation (no sampling error) in data collection and processing - detailed about small - some individuals are hard to locate sub-groups within the population is or measure, and there may be cer- more likely to be available tain characteristics about those indi- Sample: Group of adult women who recently joined a running group viduals Population to which results can be generalized: - populations rarely stand still

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/ finding-your-ideal-running-form

Statistics 104 (Mine C¸etinkaya-Rundel) U1 - L1: Data collection May 14, 2013 3 / 12 Statistics 104 (Mine C¸etinkaya-Rundel) U1 - L1: Data collection May 14, 2013 4 / 12 Overview of data collection principles Sampling from a population Overview of data collection principles Sampling methods Exploratory analysis to inference Simple random sample

Randomly select cases from the population, each case is equally likely to be selected.

● ●

● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● When you taste a spoonful of soup and decide the spoonful you ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● tasted isn’t salty enough, that’s exploratory analysis. ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● If you generalize and conclude that your entire soup needs salt, ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● that’s an inference. ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● For your inference to be valid, the spoonful you tasted (the ● ● ● ● ● ● ● ● ● ● ●

● sample) needs to be representative of the entire pot (the ● ● population). Stratum 2 Stratum 4 ● ● ● Stratum 6 ● Index ● ● Stratum●● ● 2 ● ● Stratum 4 ● ●● ● ● ● ● ● ●● ● ● ● ● ● Stratum 6 ● ● ● ● Index ● Statistics 104 (Mine● C¸etinkaya-Rundel) ● U1 - L1: Data collection ● ● May 14, 2013 5 / 12 Statistics 104 (Mine C¸etinkaya-Rundel)● U1 - L1: Data collection ● May 14, 2013 6 / 12 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Stratum 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● Stratum 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● Overview of data collection principles Sampling methods ● ● Overview of● data collection principles● ● Sampling● methods ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● Stratum 1 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● Stratified sample Cluster sample ●● ●● ● ● ● Stratum 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Clusters are● ● not necessarily● homogenous. First sample a few ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● First divide the population into homogenous strata, then randomly ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● clusters, then● randomly sample from within them.● ● Usually preferred ● Stratum ●5 sample from each ●stratum. ● ● ● for economical● reasons. Stratum 5 Cluster 9 Cluster 2 Cluster 5 Stratum 2 ● ● Stratum 4 Index ● ● Stratum 6 ● ● ● Cluster 9 ● Index Cluster 7 ● ● Cluster● ●5 ● ● Cluster● 2 ● ● ● ● ● ● ●● ● ● ● ● ● ● Index ● ● ● ● ● ● ● ● ● ● ● ● Cluster● 7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● Stratum 3 ● ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 4 ●●● ● ● ● ● ● ● ●● ●● ●● ● ● Cluster 8 ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● Stratum 1 ● ● ● ● ● ● ● ● Cluster 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 1 ● Cluster 6 ● ● ● Stratum 5 Cluster 1

Cluster 9 Cluster 2 Cluster 5 ● ● Index ● ● ● Cluster 7 ● ● ● ● ● ● ● ● ● ●● Statistics 104 (Mine C¸etinkaya-Rundel)● U1 - L1: Data collection May● 14, 2013 7 / 12 Statistics 104 (Mine C¸etinkaya-Rundel) U1 - L1: Data collection May 14, 2013 8 / 12 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● Cluster 8 ● ● ● ● ● ● ●● ● Cluster 4 ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ●● Cluster 6 ● ● Cluster 1 ● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Overview of data collection principles Sampling● methods● Overview of data collection principles Sampling methods ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●

● ● ●

● Stratum 2 Application exercise: ● Stratum 4 ● ● Stratum 6 ● ● Index Poll ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Stratified● vs. cluster ●● Stratum 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● A city council has requested a household survey be conducted in a ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● What● are the differences between stratified and cluster● sampling?● ● Give ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● suburban area of their city. The area is broken into many distinct and ● Stratum 1 ● ● ● ●● ● examples of cases where each● method would be● appropriate. ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● unique neighborhoods, some including large homes, some with only ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● apartments, and others a diverse mixture of housing structures. Which ● ● ● Stratified● Cluster Stratum 5 Cluster 9 approach would likely be the least effective? Stratum 2 Stratum 4 ● Stratum 6 Cluster 2 Cluster 5 ● ● ● Index Index ● ● ● ● ●● ● ● ● Cluster 7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Stratum 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (a) Simple random sampling ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● Cluster 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 4 ● ●● ● Stratum 1 ● ●● ● ● ●● (b) Cluster sampling ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● (c) Stratified sampling ● ● ● ●● ● ● ● ● ● ●● Cluster 6 ● ● ● ● ● ● ●● ● ●

● Cluster 1 Stratum 5 (d) Blocked sampling Cluster 9 Cluster 2 Cluster 5 ● ● Index ● ● ● Cluster 7 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● Cluster 8 ● ● ● ● ● ● ●● ● Cluster 4 ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Statistics 104 (Mine C¸etinkaya-Rundel) U1 -● L1:● Data collection May 14, 2013 9 / 12 Statistics 104 (Mine C¸etinkaya-Rundel) U1 - L1: Data collection May 14, 2013 10 / 12 ● ● ●●● ● ●● ● ●● ● ●● Cluster 6 ● ● Cluster 1

Overview of data collection principles Sampling Overview of data collection principles A few sources of bias Poll A school district is considering whether it will no longer allow high school students to park at school after two recent accidents where students were severely injured. As a first step, they survey parents by mail, asking them whether or not the parents would object to this policy change. Of 6,000 sur- Non-response: If only a small fraction of the randomly sampled veys that go out, 1,200 are returned. Of these 1,200 surveys that were com- people choose to respond to a survey, the sample may no longer pleted, 960 agreed with the policy change and 240 disagreed. Which of the be representative of the population. following statements are true? Voluntary response: Occurs when the sample consists of people I. Some of the mailings may have never reached the parents. who volunteer to respond because they have strong opinions on II. Overall, the school district has strong support from parents to move the issue since such a sample will also not be representative of forward with the policy approval. the population. III. It is possible that majority of the parents of high school students Convenience sample: Individuals who are easily accessible are disagree with the policy change. more likely to be included in the sample. IV. The survey results are unlikely to be biased because all parents were mailed a survey.

(a) Only I (b) I and II (c) I and III (d) III and IV (e) Only IV

Statistics 104 (Mine C¸etinkaya-Rundel) U1 - L1: Data collection May 14, 2013 11 / 12 Statistics 104 (Mine C¸etinkaya-Rundel) U1 - L1: Data collection May 14, 2013 12 / 12