<<

8/30/16

Unit 1 Outline

• Variables & Measurement • Collecting Unit 1: Section 1.1 & 1.2 in the Text • (for causal inference)

In God we trust. All others must bring data. – W. Edmunds Deming

2

Variables and their Measurements Categorical Variables

Variable: Any characteristic that takes different values for • • Two types: Nominal and Ordinal different individuals in a or population • Nominal variable: a categorical variable in which the Two major Types of variables: categorical and quantitative • categories are unordered. A categorical variable is a variable that can take on a few • • Ordinal variable: a categorical variable in which the different values (categories) when measured. Sometimes categories have an order or hierarchy (and can possibly be called qualitative variables. numeric), but there is “no defined distance between levels • A quantitative variable is a variable that is measured on a on the measurement scale” numerical scale covering a large of values. • Examples? • Examples? Categorical: Nominal: Quantitative: Ordinal:

3 4

1 8/30/16

1 + 1 = 3 What a Dummy! In English Dummy Variables Quantitative Variables please? X Y • Two types: Discrete and Continuous. • There is a special type of “nominal” categorical variable • Both are measured on an interval scale. That is called a dummy variable or indicator variable. there is a specific numerical distance between any two measurements. • These variables take on only 2 possible values: 0 or 1. The one usually stands for success or yes, while the zero usually • Discrete variable: a quantitative variable that can only take on stand for failure or no. specific numbers, like 0, 1, 2, …

• By convention, they are usually named after the category • Continuous variable: a quantitative variable that can take an that is a success. infinite number of possibilities within a range of numbers • Example: to represent sex/gender, we could define a • Examples: dummy variable named female, which would be 1 for all Discrete: women, and 0 for all men: ⎧1 if female Continuous: female = ⎨ ⎩0 if male 5 6

Summary of types of Variables: Unit 1 Outline

Va r i a b l e s • Variables & Measurement Categorical Quantitative • Collecting Data Nominal Ordinal Discrete Continuous (more common) (less common) • Sampling • Random Assignment (for causal inference) Dummy (special case)

**Note: in this class (and most of statistics), the most important difference is that between categorical and quantitative variables. That differentiation will typically determine the type of statistics and analysis used. Nominal and ordinal variables are often treated the same. Same for discrete and continuous variables. 7 8

2 8/30/16

Anecdotal evidence Collecting Data • Anecdotal evidence is based on haphazardly selected • Data can be collected in many ways: individual cases, that often come to our attention because they are striking (probably not representative) • Example: Politicians often cite the case of a single The further individual to invoke a public response consistent with 1. Anecdotal information down the list you go, the the politicians’ desire (a sample of size n = 1) 2. Available data more reliable • “Ask for averages, not testimonials” the information is. And the 3. Observational studies conclusions you can draw will 4. Randomized typically then be stronger.

9 10

Available data Observational Studies • Available data are data that were produced in the past for some other purpose but may help answer a present question • An is one in which data is collected by merely observing the measurements on the individuals in • Many use available data because producing new data is the sample. No attempt to influence or intervene with the expensive (nearly always most costly part of research). subject is taken. • There are lots of reliable available datasets on the web rich • May be difficult to reach causal conclusions (that changing with information. Some examples: one variable causes another variable to change) since other : http://www.census.gov/# variables may be muddling up (called ) this relationship.

: http://www3.norc.org/gss+website/ • Example: Does smoking cigarette increase your risk of heart disease? • Example: Let’s come up with a of Harvard: : http://www.hcup-us.ahrq.gov/nisoverview.jsp

11 12

3 8/30/16

Observational Studies Observational Studies Pros: Cons: • Usually cheap • Establishing causation may be impossible due to the presence of confounding variables. • The only option when randomized is not • Requires advanced statistical methods and unverifiable feasible or unethical assumptions. • Showing causation is not always necessary • Risk factors for medical decisions, population statistics. • Confounding variable (or factor), sometimes referred to as a confounder or a lurking variable, affects both the group • Risk factor (common in medicine and ) - membership and the outcome (or dependent) variable. a variable associated with an increased risk of disease or infection. • This third variable causes the two variables to falsely appear to be • Examples? related.

13 14

Confounding Variables Experiments • An experiment is a study in which an investigator imposes an • Name confounding variables that may induce the following intervention (e.g. treatment) on individuals in order to associations: observe their response. • Clinical trials are a type of experiment • The association between the amount of serious crime committed and the amount of ice cream sold by street • An Example: A comparison of different drugs for women vendors. with breast cancer, often with as few as 100 people. • The experimenter chooses women in the study receive the • Drink More Diet Soda, Gain More Weight?: Overweight different levels of the drug (new therapy vs. old therapy). Risk Soars 41% With Each Daily Can of Diet Soft Drink. The levels of the drug are called the treatment. • The outcome of the study may be the measured amount of • Negative correlation between a size of one’s palm and their disease-free survival for each woman life expectancy.

15 16

4 8/30/16

Experiments: a few details Experiments

• There has to always be at least two groups of the treatment to • An experiment is the best (only?) way to determine if one compare. The ‘default’ condition is often called the control group variable (the treatment) causes another variable (the outcome) (standard-of-care in clinical trials). to vary. • The control group may receive a placebo treatment. This is a • However, they are not always ethical or plausible. You treatment that looks like the active treatment (classic ex: a cannot knowingly do harm to human subjects by forcing ‘sugar pill’) them to take a dangerous treatment (ex: force to smoke) • The subjects should be randomized to the treatment groups. That is, chance should decide which patients receive the treatments • Experiments may not mimic real life (the conditions in which an experiment is run are often too ‘perfect’ or unrealistic). • This guarantees that all other variables are balanced across the So there is often some loss of generalization of them to the treatment groups real world. • To ensure this balance, the study needs to be replicated enough times. • They are also the most expensive way to collect data

17 18

Confounding Variables and Unit 1 Outline • Suppose we would like to compare two methods of teaching • Variables & Measurement introductory statistics. • At Harvard, one professor uses standard lecturing set-up in his • Collecting Data class and another professor uses an interactive clicker approach in her class. • Sampling • Students in the two classes are given achievement tests to see • Random Assignment (for causal inference) how well they learned the tests. • Confounding variables? • Better experiment?

19 20

5 8/30/16

Population vs. Sample Parameters and Statistics

• Population: entire group of individuals on which we desire information. • Parameter (often called an estimand): a numerical summary of the population (like µ or p). • Technicality: actual vs. conceptual populations • For our Harvard study: • For our Harvard study: • : a numerical summary of the sample data. • For our Harvard study: • Estimator: a statistic used as a guess for the value of the • Sample: a part of the population on which we estimand ( or pˆ ). actually collect data. • Estimate: a particular realization of the estimator (4/12 • For our Harvard study: = 0.33).

21 22

Selection of Study Units Parameter vs. Estimate

Study/experimental unit/subject - one member of a set of Parameter (also, estimand) - Estimate - proportion of proportion of childless entities being studied. childless households in households in the population. Two ex tremes o f a sel ectio n the sample. mechanism: • Self-selection (volunteers, haphazard) • Random sampling

Analysis, Estimates, & Inference - childless household - household with children under 18 Purpose: describe population characteristics. 23 24

6 8/30/16

Entire population = all possible units Target Population: a collection of units a researcher is interested in; a group about which the researcher wishes to draw conclusions.

25 26

Census: sample everybody in target population Census

Pros: • In principle, no need to use

Cons: • Expensive • Long and difficult • In practice, never perfect: • Respondents are often not representative of target population!

27 28

7 8/30/16

Sampling units whose data Collection of units that are Respondents: were actually obtained. : potential members of the sample. Undercoverage Overcoverage

29 Ta r g e t p o p u l a t i o n 30

Sample:a [randomly selected] subset of a sampling frame Sampling Steps

Population

Target population

Sampling frame

Sample

Respon- dents

32 31

8 8/30/16

Selecting a Sample from a Sampling Random Sampling • Ensures that all subpopulations in the overall population are Frame roughly represented in the sample. • Simple Random Sampling (SRS) – every subset of n units has equal chance to be selected • Pick size n (may use power analysis) • Enumerate all units • Pick n numbers randomly • What is the simplest way of collecting a random sample? • Small example: selection of a 3-member advisory committee at random from the 11 faculty members of the Stat Dept. • What is the population? What is the sample? • What’s the chance that any one specific member is selected for the committee? • (Stat 110 question): How many different 3-person committees can be formed?

34 33

Random Sampling example • If we were to draw a simple random sample n = 60 students • Systematic Random Sampling - select every kth unit from from all Harvard undergrads, we could : the ordered sampling frame, starting randomly from of the first k positions. • Easier to administer • Requires well-mixed population • Variable probability sampling – allow units to have unequal probabilities of being sampled. 1) Write out the sampling frame: the list of all individuals in the • Requires more careful analysis that involves population. weighting. 2) Assign each of the N members to a number from 1 to N. • Example: – split the population 3) Use a random numbers table or software to generate random into homogeneous subpopulations and use SRS (or numbers: another method) within a sampling frame of each So if N is a 4-digit number, then we could just generate random subpopulation. sets of 4 digits numbers, and choose the individuals based on those numbers 35 36

9 8/30/16

Stratified random samples Stratified Sample Example • We could perform a stratified sample at Harvard by Basic idea: sample important groups separately, randomly selecting 5 individuals from each house, and then combining them into one sample of n = 60. then combine these samples 1) Divide population into groups of similar n = 60 individuals, called strata 2) Choose a separate simple random sample within each strata 3) Combine the results of the simple random samples together to form the overall statistic, weighting each separate stratum correctly to mimic the population What’s an advantage to this stratified sample compared to the SRS? 37 38

Unit 1 Outline Assignment to Groups

• Variables & Measurement Random sampling • Collecting Data • Sampling Assignment to groups/treatments • Random Assignment (for causal inference) Two ex tremes o f an If the assignment assignment mechanism: mechanism is random, •Haphazard (or unknown) the expected proportion •Random of childless couples in each (treatment) group is the same. 39 40

10 8/30/16

Assignment to Groups Inferences Permitted by Study Design

• Complete Randomization (parallel to SRS). If sampled If groups are randomly • Stratified or clustered Randomization (parallel to stratified randomly sampling) assigned • Ensures that representatives from all strata are present in each treatment group. • How not to randomize:

If sampled randomly AND groups are randomly assigned

41 42

Study's generalizability Observational studies • Internal validity is the validity of (causal) inferences in a scientific study. • Should be established first. • It is low when there are • unaccounted confounding factors; • ignored ; • Noncompliance; Difficult to draw inferences about • unverified assumptions; population • suboptimal method of analysis.

• A study that readily allows its findings to generalize to the Difficult to draw population at large has high external validity. causal inferences 43 44

11 8/30/16

Concepts to review: (some will be covered in Units 2 & 3) • Outcomes, events, probability • Random variables (r.v.) • of r.v. • Indicator variables • Bernoulli and Binomial Distribution • Normal (Gaussian) distribution • , , (SD) • • Sample mean, sample variance, sample SD • Law of Large Numbers •

45 46

The Last Word

47

12