<<

Have you ...

Unit 1: Introduction to data Lecture 1: Data collection, observational studies, and experiments been placed into a team? successfully logged on to RStudio? If not, see me after class. 101

Mine C¸etinkaya-Rundel Also, there are still a few of you who haven’t completed the class survey, please do so ASAP. January 15, 2013 http:// stat.duke.edu/ courses/ Spring13/ sta101.001/ schedule.html

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 2 / 31

Overview of data collection principles Anecdotal Readiness assessment Anecdotal evidence and early smoking research

Anti-smoking research started in the 1930s and 1940s when cigarette smoking became increasingly popular. While some smokers seemed to be sensitive to cigarette smoke, others were 15 mins for individual, answer using clickers completely unaffected. 15 mins for team, answer using scratch off sheets Anti-smoking research was faced with resistance based on 1 pt for each question correct on the first try anecdotal evidence such as “My uncle smokes three packs a day 0.5 pts for each question correct on the second try and he’s in perfectly good health”, evidence based on a limited no points for more than 2 tries write your team name and tally your scores sample size that might not be representative of the population. Representative from each team turns in scratch of sheets and all It was concluded that “smoking is a complex human behavior, by paper copies of assessments its nature difficult to study, confounded by human variability.” In time researchers were able to examine larger samples of cases (smokers) and trends showing that smoking has negative health impacts became much clearer.

Brandt, The Cigarette Century (2009), Basic Books.

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 3 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 4 / 31 Overview of data collection principles Populations and samples Overview of data collection principles from a population Populations and samples Census

Wouldn’t it be better to just include everyone and “sample” the entire population? Research question: Can people This is called a census. become better, more efficient There are problems with taking a census: runners on their own, merely by It can be difficult to complete a census: there always seem to be running? some individuals who are hard to locate or hard to measure. And Population of interest: there may be certain characteristics about those individuals who are hard to locate.

http:// well.blogs.nytimes.com/ 2012/ 08/ 29/ Populations rarely stand still. Even if you could take a census, the population changes constantly, so it’s never possible to get a finding-your-ideal-running-form perfect measure. Sample: Group of adult women who recently joined a running group Taking a census may be more complex than sampling. Population to which results can be generalized:

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 5 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 6 / 31

Overview of data collection principles Sampling from a population Overview of data collection principles Sampling from a population Exploratory analysis to inference

Sampling is natural.. Think about sampling something you are cooking - you taste (examine) a small part of what you’re cooking to get an idea about the dish as a whole. When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s exploratory analysis. If you generalize and conclude that your entire soup needs salt, that’s an inference. For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population). If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not http:// www.npr.org/ templates/ story/ story.php?storyId=125380052 representative of the whole pot. If you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot.

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 7 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 8 / 31 ● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Overview of data collection principles Sampling methods ● Overview● of data● collection principles Sampling methods ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Simple random sample Stratified sample● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Randomly select cases from the population, each case is equally ● ● Strata are homogenous,● simple random sample from each stratum. likely to be selected. ● ●

Stratum 2 Stratum 4 ● Stratum 6 ● ● Index ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Stratum 3 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● Stratum 1 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Stratum 5 ●● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● Cluster 9 Cluster 2 Cluster 5 Stratum 2 ● ● Index ● Stratum 4 ● ● ● ● ● Stratum 6 Cluster 7 Index ● ● ● ● ● ● ● ● ● ● ● ●● Stratum● ● 2 ● ● Stratum● ●4 ● ● ● ● ● ● ● ● Stratum 6 ● ● ● ● ● ● Statistics 101 (Mine C¸etinkaya-Rundel)● U1 - L1: Data coll., obs.Index studies, experiments ● January 15, 2013 9 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll.,● ● obs. studies, experiments January 15, 2013 10 / 31 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Stratum 3 ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● Stratum 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 4 ● Overview of data collection principles● Sampling● methods ● ● Overview of data collection principles Sampling methods ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● Stratum 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Cluster sample ●● ●● ● ● ● ● ● Stratum 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Clicker question ● Cluster 6 ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● Clusters are● ● not necessarily● homogenous, simple random sample ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 1 ● ● ● ● ● ● ● ● ● ● ● ● A city council has requested a household survey be conducted in a from a random● sample of clusters. Usually preferred● ● for economical ● Stratum ●5 ● reasons. ● Stratum 5 suburban area of their city. The area is broken into many distinct and Cluster 9 Cluster 2 Cluster 5 unique neighborhoods, some including large homes, some with only ● ● Index ● ● Cluster 9 ● ● Cluster 7 ● apartments, and others a diverse mixture of housing structures. Which Cluster● ●5 Cluster● 2 ● ● ● ● ●● ● ● ● ● Index ● ● ● ● ● Cluster● 7 ● ● ● ● ● ● approach would likely be the least effective? ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● ●● ● (a) Simple random sampling ● ● ● ●● ●● ● Cluster 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 4 ● ●● ● ● ● ●● ●● ●● ● ● Cluster 8 (b) Cluster sampling ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● (c) Stratified sampling ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● (d) Blocked sampling ● ●● ● ● ● ● ● ● ● Cluster● 6 ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● Cluster 1 ●● Cluster 6 (e) Anecdotal sampling ● ● Cluster 1

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 11 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 12 / 31 Overview of data collection principles Sampling Overview of data collection principles A few sources of bias Landon vs. FDR

Non-response: If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no longer A historical example of a biased sample yielding misleading results: be representative of the population. Voluntary response: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue since such a sample will also not be representative of In 1936, Landon the population. sought the Republican presidential nomination opposing the re-election of FDR.

cnn.com, Jan 14, 2012 Convenience sample: Individuals who are easily accessible are more likely to be included in the sample.

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 13 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 14 / 31

Overview of data collection principles Sampling bias Overview of data collection principles Sampling bias The Literary Digest Poll The Literary Digest Poll - what went wrong?

The Literary Digest polled about 10 million The magazine had surveyed Americans, and got responses from about its own readers, 2.4 million. registered automobile owners, and The poll showed that Landon would likely registered telephone users. be the overwhelming winner and FDR These groups had incomes well above the national average of would get only 43% of the votes. the day (remember, this is Great Depression era) which resulted Election result: FDR won, with 62% of the in lists of voters far more likely to support Republicans than a votes. truly typical voter of the time, i.e. the sample was not The magazine was completely discredited because of the poll, representative of the American population at the time. and was soon discontinued.

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 15 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 16 / 31 Overview of data collection principles Sampling bias Overview of data collection principles Sampling bias Large samples are preferable, but... Clicker question A school district is considering whether it will no longer allow high school students to park at school after two recent accidents where students were severely injured. As a first step, they survey parents by mail, asking them whether or not the parents would object to this policy change. Of 6,000 sur- veys that go out, 1,200 are returned. Of these 1,200 surveys that were com- The Literary Digest election poll was based on a sample size of pleted, 960 agreed with the policy change and 240 disagreed. Which of the 2.4 million, which is huge, but since the sample was biased, the following statements are true? sample did not yield an accurate prediction. I. Some of the mailings may have never reached the parents. Back to the soup analogy: If the soup is not well stirred, it doesn’t matter how large a spoon you have, it will still not taste right. If II. The school district has strong support from parents to move forward with the policy approval. the soup is well stirred, it doesn’t matter whether you have a III. It is possible that majority of the parents of high school students large or small spoon, it will taste fine either way. disagree with the policy change. IV. The survey results are unlikely to be biased because all parents were mailed a survey.

(a) Only I (b) I and II (c) I and III (d) III and IV (e) Only IV

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 17 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 18 / 31

Overview of data collection principles Observational studies and experiments Observational studies Cereal breakfast Observational studies and experiments

Observational study: Researchers collect data in a way that does not directly interfere with how the data arise, i.e. they merely “observe”, and can only establish an association between the explanatory and response variables. Experiment: Researchers randomly assign subjects to various treatments in order to be able to establish causal connections between the explanatory and response variables. If you’re going to walk away with one thing from this class, let it be “correlation does not imply causation”.

http:// xkcd.com/ 552/

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 19 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 20 / 31 Observational studies Cereal breakfast Observational studies Cereal breakfast 3 possible explanations:

1 Eating breakfast causes girls to be thinner.

What type of study is this, observational study or an experiment? “Girls who regularly ate breakfast, particularly one that includes cereal, were slimmer

than those who skipped the morning meal, according to a study that tracked nearly 2 Being thin causes girls to eat breakfast. 2,400 girls for 10 years. [...] As part of the survey, the girls were asked once a year what they had eaten during the previous three days.”

What is the conclusion of the study? 3 A third variable is responsible for both. What could it be? Who sponsored the study? An extraneous variable that affects both the explanatory and the response variable and that make it seem like there is a relationship between the two are called variables.

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 21 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 22 / 31

Observational studies Cereal breakfast Experiments Principles of experimental design Project ideas - observational studies Principles of experimental design

1 numerical: Is the average number of hours Americans spend relaxing after work different than the European average of 3 hours/day? 1 [Data: Number of hours relaxing after work] Control: Compare treatment of interest to a control group. 1 categorical: Estimate the percentage of North Carolina 2 Randomize: Randomly assign subjects to treatments. residents who live below the poverty line and are planning to 3 Replicate: Within a study, replicate by collecting a sufficiently vote Republican in the most recent presidential election. large sample. Or replicate the entire study. [Data: Vote Republican - yes, no] 4 Block: If there are variables that are known or suspected to affect 1 numerical and 1 categorical: Is there a relationship between the response variable, first group subjects into blocks based on mom’s working status during the first 5 years of the childOs˜ life these variables, and then randomize cases within each block to and the child’s education? treatment groups. [Data: Number of years of education of child; Mom’s working status - yes, no] 2 categorical: Do racial minority groups in North Carolina have less access to health care coverage? [Data: Ethnicity - white, minority; Health coverage - yes, no]

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 23 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 24 / 31 Experiments Principles of experimental design Experiments Principles of experimental design

More on blocking Clicker question

We would like to design an experiment to A study is designed to test the effect of light level and noise level on investigate if gels makes you run faster: exam performance of students. The researcher also that light and noise levels might have different effects on males and females, Treatment: energy gel Control: no energy gel so wants to make sure both genders are represented equally under different conditions. Which of the below is correct? It is suspected that energy gels might affect pro and amateur athletes differently, therefore we (a) There are 3 explanatory variables (light, noise, gender) and 1 block for pro status: response variable (exam performance) Divide the sample to pro and amateur (b) There are 2 explanatory variables (light and noise), 1 blocking Randomly assign pro athletes to treatment and variable (gender), and 1 response variable (exam performance) control groups (c) There is 1 explanatory variable (gender) and 3 response variables Randomly assign amateur athletes to treatment and control groups (light, noise, exam performance) Pro/amateur status is equally represented in (d) There are 2 blocking variables (light and noise), 1 explanatory the resulting treatment and control groups variable (gender), and 1 response variable (exam performance) Why is this important? Can you think of other variables to block for?

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 25 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 26 / 31

Experiments Principles of experimental design Experiments Principles of experimental design Difference between blocking and explanatory variables More experimental design terminology...

Placebo: fake treatment, often used as the control group for medical studies Factors are conditions we can impose on the experimental units. Placebo effect: experimental units showing improvement simply Blocking variables are characteristics that the experimental units because they believe they are receiving a special treatment come with, that we would like to control for. Blinding: when experimental units do not know whether they are Blocking is like stratifying, except used in experimental settings in the control or treatment group when randomly assigning, as opposed to when sampling. Double-blind: when both the experimental units and the researchers do not know who is in the control and who is in the treatment group

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 27 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 28 / 31 Experiments Principles of experimental design Recap Project ideas - experiments Clicker question 1 numerical and 1 categorical: Is there a relationship between What is the main difference between observational studies and exper- memory and distraction? Randomly assign 20 students to two iments? groups: one group memorizes a list of words while also listening to music, another group memorizes the same words in silence. (a) Experiments take place in a lab while observational studies do Compare average number of words memorized in the two not need to. groups. (b) In an observational study we only look at what happened in the [Data: Number of words memorized; Group - treatment, control] past. 2 categorical: Is there a relationship between learning and (c) Most experiments use random assignment while observational distraction? Randomly assign a group of students to two groups: studies do not. one group studies a concept while also listening to music, the other group studies in silence using the same materials. Then (d) Observational studies are completely useless since no causal test whether or not they learned the concept. inference can be made based on their findings. [Data: Whether or not the students learned the concept - yes, no; Group - treatment, control

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 29 / 31 Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 30 / 31

Recap Random assignment vs. random sampling

most ideal Random No random observational experiment assignment assignment studies No causal conclusion, Random Causal conclusion, generalized to the whole correlation statement Generalizability sampling population. generalized to the whole population. No causal conclusion, No random Causal conclusion, No correlation statement only only for the sample. sampling for the sample. generalizability

bad most Causation Correlation experiments observational studies

Statistics 101 (Mine C¸etinkaya-Rundel) U1 - L1: Data coll., obs. studies, experiments January 15, 2013 31 / 31