<<

Warm-Up and Data Basics Announcements

Unit 1: Introduction to data Lecture 2: Exploratory

Statistics 101

Nicole Dalzell

May 14, 2015

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 2 / 1

Warm-Up and Data Basics Warm-Up and Data Basics Review Types of Variables Example

Example Study: Still our cat example: A researcher is interested in whether or not cats will choose to sleep Cat Age Toys # of Naps Weight (lbs) less if they have toys to entertain themselves. She divides 250 cats 1 adult 1 3 8 (adults and kittens) into two rooms, with adult cats in one room and 2 juvenile 1 5 9 baby kittens in the other room. Within each room she erects a fence, 3 adult 0 2 10.5 randomly placing half the cats (or kittens) on each side of the fence. 4 adult 1 8 12.25 On one side of the fence she scatters a variety of cat toys. For 1 day, . . . . . the researcher records the number of hours each cat spends . . . . . sleeping. 250 adult 0 5 11.67 What is the research question? What types of variables are these: What are the explanatory and response variables? Age? Is this an Experimental or Observational study? Toys? What are the controls and treatments? # of Naps? Is blocking employed in this study? Weight?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 3 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 3 / 1 Warm-Up and Data Basics Methods Warm-Up and Data Basics Sampling Methods Population to Obtaining good samples

It is usually not feasible to collect information on the entire population due to high costs of so statisticians instead work with samples that are (hopefully) representative of Almost all statistical methods are based on the notion of implied the populations they come from. . population If observational data are not collected in a random framework sample from a population, these statistical methods – the estimates and errors associated with the estimates – are not reliable. Most commonly used random sampling techniques are simple, stratified, and .

We try to understand certain features of the population as a whole using summary statistics and graphs based on these samples.

● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 4 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA ● ● ● May 14, 2015 5 / 1 ● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Warm-Up and Data Basics Sampling Methods ● ● Warm-Up● and Data Basics Sampling Methods ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Simple random sample Stratified sample●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Randomly select cases from the population, each case is equally ● ● Strata are homogenous,● simple random sample from each stratum. likely to be selected. ● ●

Stratum 2 Stratum 4 ● Stratum 6 ● ● Index ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● Stratum 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● Stratum 1 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Stratum 5 ● ● ●

Cluster 9 Cluster 2 Cluster 5 Stratum 2 ● ● Index ● Stratum 4 ● ● ● ● ● Stratum 6 Cluster 7 Index ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Statistics 101 (Nicole Dalzell) U1 - L2: EDA May● 14, 2015 6 / 1 Statistics 101 (Nicole Dalzell)● U1 - L2: EDA May 14, 2015 7 / 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Stratum 3 ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● Stratum 1 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● Cluster 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● Cluster 1 ● ● Stratum 5

Cluster 9 Cluster 2 Cluster 5 ● ● Index ● ● ● Cluster 7 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● Cluster 8 ● ● ● ● ● ● ●● ● Cluster 4 ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ●● Cluster 6 ● ● Cluster 1 ● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●

● ● ●

Stratum 2 Stratum 4 ● Stratum 6 ● Index ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● Stratum 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● Warm-Up and Data Basics● Sampling Methods ● Warm-Up and Data Basics Sampling Methods ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● Cluster sampleStratum 1 ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● Clusters are● not necessarily● homogenous, simple random sample ● ● ● ● ● ● ● ● ● ● ● ● Participation question ● ● ● ● ● from a random● sample of clusters. Usually preferred● ● for economical ● ● ● reasons. ● Stratum 5 A city council has requested a household be conducted in a suburban area of their city. The area is broken into many distinct and Cluster 9 Cluster 2 Cluster 5 ● ● unique neighborhoods, some including large homes, some with only Index ● ● ● Cluster 7 ● ● ● ● ● ● ● ● ● ●● ● ● apartments. Which approach would likely be the least effective? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Cluster 3 ● ● ● ● ● ● ● ● ● ● (a) Simple random sampling ● ● ● ● ● ● ●● ● ●● ●● Cluster 8 ● ● ● ● ● ● (b) Cluster sampling ●● ● Cluster 4 ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● (c) Stratified sampling ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ●● Cluster 6 ● ● Cluster 1

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 8 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 9 / 1

Warm-Up and Data Basics Exploratory Data Analysis Warm-Up and Data Basics Exploratory Data Analysis Explore the Data Visualizing numerical variables

When you taste a spoonful of chili and decide it doesn’t taste spicy enough, that’s exploratory analysis. Intensity map: Useful for displaying the spatial distribution. For data analysis, we perform exploratory data analysis, or EDA, Dot plot: Useful when individual values are of interest. to determine trends in features that may be present in the data. Histogram: Provides a view of the data density, and are The distribution of a variable is a list of possible values the especially convenient for describing the shape of the data variable can take and how often it takes each of those values. distribution. Distributions are critical to assessing the probability of events. Box plot: Especially useful for displaying the median, quartiles, Plots are almost always useful for visualizing relationships and unusual observations, as well as the IQR. distributions in the data.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 10 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 11 / 1 Warm-Up and Data Basics Exploratory Data Analysis Warm-Up and Data Basics Exploratory Data Analysis Why visualize? Why visualize?

Describe the spatial distribution of race/ethnicity in the US. And let’s take a closer look at Durham.

http:// demographics.coopercenter.org/ DotMap/ index.html

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 12 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 13 / 1

Warm-Up and Data Basics Exploratory Data Analysis Warm-Up and Data Basics Exploratory Data Analysis Scatterplot Cars: ... vs. weight

Scatterplots are useful for visualizing the relationship between two From the cars data: numerical variables.

60

40 Do life expectancy and total fertil- ting) 50

ity appear to be associated or in- 40 30 dependent? 30

price ($1000s) 20

Was the relationship the same miles per gallon (city ra 20 10 throughout the years, or did it change? 2000 3000 4000 2000 2500 3000 3500 4000 weight (pounds) weight (pounds) What do these scatterplots reveal about the data? How might they be useful? http:// www.gapminder.org/ world

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 14 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 15 / 1 Numerical Variables Basic Plots Numerical Variables Basic Plots World Bank Data Visualizing numerical variables

This is public-use data available for download from http:// data.worldbank.org/ topic/ energy-and-mining . What does the distribution of energy use per capita look like across different countries? Intensity map: Useful for displaying the spatial distribution. Is energy use fairly uniform across different countries? Dot plot: Useful when individual values are of interest. If not, can we distinguish groups of countries that use more than others? Histogram: Provides a view of the data density, and are especially convenient for describing the shape of the data Country.Name X2011 distribution. 37 Afghanistan Box plot: Especially useful for displaying the median, quartiles, 50 Angola 672.74 unusual observations, as well as the IQR. 63 Albania 689.03 76 Arab World 1806.90 89 United Arab Emirates 7407.01 102 Argentina 1966.97 115 Armenia 916.26

Statistics 101 (Nicole Dalzell)128 AmericanU1 Samoa - L2: EDA May 14, 2015 16 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 17 / 1 141 Antigua and Barbuda 154 Australia 5500.79

167 AustriaNumerical Variables Basic Plots 3927.92 Numerical Variables Basic Plots 180 Azerbaijan 1369.32 Stacked Dot Plot193 Burundi Dot Plot: Why visualize? 206 Belgium 5348.97 Higher bars represent219 Benin areas where there are more 384.56 observations, makes it a little232 easier Burkina to judge Faso the center and the shape of the distribution. 245 Bangladesh 204.72 258 Bulgaria 2615.04 Dot plot of weight, in ounces

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Do you see anything out of the ordinary?

3.0 3.2 3.4 3.6 3.8 4.0

gpa

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 18 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 19 / 1 Numerical Variables Basic Plots Numerical Variables Basic Plots Why visualize? Why visualize?

What type of variable is average number of hours of sleep per night? Is this reflected in the dot plot below? If not, what might be the reason?

Dot plot of weight, in ounces Dot plot of average number of hours of sleep per night

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1000 2000 3000 4000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Do you see anything out of the ordinary? ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

4 5 6 7 8 9

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 20 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 21 / 1

Numerical Variables Basic Plots Numerical Variables Basic Plots Dot Plot: World Bank Data Histogram

Eenrgy Data Dot Plot Energy Use in 2011 (World Bank Data) 80 Country.Name X2011

60 Afghanistan Angola 672.74 Albania 689.03 40 Arab World 1806.90 United Arab Emirates 7407.01 Number of Countries 20 Argentina 1966.97 Armenia 916.26 0

0 5000 10000 15000 Energy Use (kg oil equivalent per capita)

Bins 0-2000 2001 - 4000 4001 - 6000 6001 - 8000 ... Count 92 38 18 10 ... 0 5000 10000 15000 Energy per capita

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 22 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 23 / 1 Numerical Variables Basic Plots Numerical Variables Basic Plots Histogram: Bin Width Bin Width

Which one(s) of these histograms are useful? Which reveal too much Energy Use in 2011 (World Bank Data) about the data? Which hide too much? 80 50 30 60 40 25 40 20 30 Number of Countries 20 15 20

frequency frequency 10 0 10 0 5000 10000 15000 5 Energy Use (kg oil equivalent per capita) 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 extracurricular hrs / week extracurricular hrs / week 14 Energy Use in 2011 (World Bank Data) Energy Use in 2011 (World Bank Data) 15 12 35 10 30 120 10 8 25 6 80 20 frequency 5 frequency 4

60 15 2

40 10 0 0 Number of Countries Number of Countries 5 20 0 5 10 15 20 25 5 10 15 20 25

0 0 extracurricular hrs / week extracurricular hrs / week 0 5000 10000 15000 20000 0 5000 10000 15000 Energy Use (kg oil equivalent per capita) Energy Use (kg oil equivalent per capita)

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 24 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 25 / 1

Numerical Variables Basic Plots Numerical Variables Distribution Shapes Histogram Histogram

Energy Use in 2011 (World Bank Data) Number of Countries

Energy Use (kg oil equivalent per capita) Provides a view of the data density. Very usual for examining the shape of a distribution. This distribution is right skewed and unimodal.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 26 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 27 / 1 Numerical Variables Distribution Shapes Numerical Variables Distribution Shapes Shape: Skewness Shape: Modality

The mode is defined as the most frequent observation in the data set. Does the histogram have a single prominent peak (unimodal), several We describe histograms as right skewed, left skewed, or symmetric. prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? 30 15 60 25 14 20 15 20 12 15 10 40 15 10 15 10 8 10 10 5 10 20 6 5 5 4 5 5 0 0 0 2 0 2 4 6 8 10 0 5 10 15 20 25 0 20 40 60 80 0 0 0 0

0 5 10 15 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 Histograms are said to be skewed to the side of the long tail. In order to determine modality, it’s easiest to step back and imagine a density curve over the histogram. Use the limp spaghetti method.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 28 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 29 / 1

Numerical Variables Distribution Shapes Numerical Variables Distribution Shapes Shape and Skew Shape: Why does this matter?

Symmetric Distribution Bimodal Distribution

How would you describe this distribution?

Histogram of

average number of hours spent on school work per day Height Height 30 25

20 Value Value 15 10 1000 5 800 0 600 2 4 6 8 10 400 200 0 0 1 2 3 4 5 6 7 8 9

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 30 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 31 / 1 Numerical Variables Distribution Shapes Numerical Variables Distribution Shapes Commonly observed shapes of distributions

modality Participation question uniform Which of these variables do you expect to be uniformly distributed? unimodal bimodal multimodal

(a) weights of adult females (b) salaries of a random sample of people from North Carolina (c) house prices skewness (d) birthdays of classmates (day of the month) symmetric right skew left skew

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 32 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 33 / 1

Numerical Variables Distribution Shapes Numerical Variables Distribution Shapes Density Curves Unusual Observations

A Density Curve is a smoothed density histogram where the area under the curve is 1. To draw a density curve from a histogram simply connect the Are there any unusual observations or potential outliers? peaks of a histogram with a smooth line, and normalize the 40 values of the y-axis such that the area under the curve is 1. 30 25 30 20 20 15 10 10 5 0 0

0 5 10 15 20 20 40 60 80 100

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 34 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 35 / 1 Numerical Variables Distribution Shapes Numerical Variables Distribution Shapes Describing Your Pictures

Bell Shaped: Data is bell shaped if the majority of the data is clustered around the center value (mean) with very few data points lying either way above or way below this value. Right Skewed: Data is positively skewed if you have several Application exercise: Shapes of distributions large positive data points creating a long tail to the right. Left Skewed: Data is negatively skewed if you have several large negative numbers creating a long tail to the left. Bimodal: Data is bimodal if it has two large clusters of data points. Symmetric: Data is symmetric if it looks like a mirror image around a point of inflection. Uniformly Distributed: Data is evenly spread across all possible values.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 36 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 37 / 1

Descriptive Statistics Center Center Mean Median

The sample mean, denoted as x¯, can be calculated as The median is the value that splits the data in half when ordered x + x + ··· + x Sum of Data Points x¯ = 1 2 n = , in ascending order. n Number of Data Points

where x1, x2, ··· , xn represent the n observed values. 0, 1, 2, 3, 4 The population mean is a parameter computed the same way but If there are an even number of observations, then the median is is denoted as µ. It is often not possible to calculate µ since the average of the two values in the middle. population data is rarely available. 2 + 3 x¯ is an estimate of µ based on the observed data. 0, 1, 2, 3, 4, 5 → = 2.5 2 The sample mean is a sample statistic, or a point estimate of the population mean. This estimate may not be perfect, but if the Since the median is the midpoint of the data, 50% of the values th sample is good (representative of the population) it is usually a are below it. Hence, it is also the 50 percentile. good guess.

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 38 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 39 / 1 Descriptive Statistics Center Descriptive Statistics Center Mean vs. Median Back to our Energy Data

Link If the distribution is symmetric, center is the mean Symmetric: mean ≈ median Energy Use in 2011 (World Bank Data) If the distribution is skewed or has outliers center is the median

Right-skewed: mean > median 80 Left-skewed: mean < median 60 Right−skewed Left−skewed

mean mean

median median 40 Number of Countries 20 0

0 5000 10000 15000

Symmetric Energy Use (kg oil equivalent per capita)

mean median Mean: 2532.631 Median: 1593.7

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 40 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 41 / 1

Descriptive Statistics Center Descriptive Statistics Center Measures of Center Are you typical?

The Mean of a dataset is what we commonly refer to as the average. The Median of a dataset is the middle value of your data. You find the median of your data by ordering from smallest to largest, then finding the value where 50% of your data is above and below that value. The Trimmed Mean is the calculation of the mean after removing http:// www.youtube.com/ watch?v=4B2xOvKFFz4 a few of the very large and very small observations.

How useful are centers alone for conveying the true characteristics of a distribution?

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 42 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 43 / 1 Descriptive Statistics Center Descriptive Statistics Spread Describing distributions of numerical variables Measures of Spread

When describing distributions of numerical variables always mention 2 Shape: skewness, modality The population Variance, σ , measures each observation’s Center: an estimate of a typical observation in the distribution deviation from the mean. (mean, median, mode, etc.) The population Standard Deviation, σ, is the square root of the Unusual observations: observations that stand out from the rest variance. of the data that may be suspected outliers The Inner Quartile Range (IQR) measures the spread of the Spread: measure of variability in the distribution (SD, IQR, range, middle 50% of your data, and is visually depicted in Boxplots. etc.)

−3 −2 −1 0 1 2 3

Link

−3 −2 −1 0 1 2 3 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 44 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 45 / 1

Descriptive Statistics Spread Descriptive Statistics Spread −3 −2 −1 0 1 2 3 Box Plot Anatomy of a Box Plot

The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median. 40 suspected outliers max whisker reach

30 upper whisker

20 Q3 (third quartile)

median # of study hours / week 10 Q1 (first quartile) 10 20 30 40 # of study hours / week 0 lower whisker

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 46 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 47 / 1 Descriptive Statistics Spread Descriptive Statistics Spread Measures of Location Whiskers and Outliers

The 25th percentile is also called the first quartile, Q1. Whiskers of a box plot can extend up to 1.5 * IQR away from the The 50th percentile is also called the median. quartiles. The 75th percentile is also called the third quartile, Q3. max upper whisker reach : Q3 + 1.5 ∗ IQR = 20 + 1.5 ∗ 10 = 35 summary( d$study hours ) max lower whisker reach : Q1 − 1.5 ∗ IQR = 10 − 1.5 ∗ 10 = −5 Min. 1st Qu. Median Mean 3rd Qu. Max. NAs 3.00 10.00 15.00 17.42 20.00 40.00 13.00

Between Q1 and Q3 is the middle 50% of the data. The range these An outlier is defined as an observation beyond the maximum data span is called the interquartile range, or the IQR. reach of the whiskers. It is an observation that appears extreme relative to the rest of the data. IQR = 20 − 10 = 10

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 48 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 49 / 1

Descriptive Statistics Spread Descriptive Statistics Spread Outliers (cont.) Why visualize?

What does a response of 0 mean in this distribution?

Why is it important to look for outliers? Number of drinks it takes students to get drunk Identify extreme skew in the distribution. Identify data collection and entry errors. ● ● Provide insight into interesting features of the data.

0 2 4 6 8 10 12

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 50 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 51 / 1 Descriptive Statistics Spread Descriptive Statistics Spread Example: Visualizing Who uses the most energy?

What does our Energy Data look like? Country.Name X2011 1 Iceland 17964.44 Energy Use Data Boxplot 2 Qatar 17418.69 3 Trinidad and Tobago 15691.29 4 Kuwait 10408.28 5 Brunei Darussalam 9427.09 6 Oman 8356.29 15000 7 Luxembourg 8045.90 8 United Arab Emirates 7407.01 9 Bahrain 7353.16

10000 10 Canada 7333.28 11 North America 7062.22

Energy Usage 12 United States 7032.35

5000 13 Saudi Arabia 6738.42 14 Singapore 6452.33 15 Finland 6449.04 0

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 52 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 53 / 1

Descriptive Statistics Spread Descriptive Statistics Spread Participation question Measures of Spread Which of the following is false about the distribution of average number of hours students study daily? The population Variance, σ2, measures each observation’s deviation from the mean. Average number of hours students study daily The population Standard Deviation, σ, is the square root of the variance. ● The Inner Quartile Range (IQR) measures the spread of the middle 50% of your data, and is visually depicted in Boxplots.

2 4 6 8 10

Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 4.000 3.821 5.000 10.000

(a) There are no students who don’t study at all. (b) 75% of the students study more than 5 hours daily, on average. (c) 25% of the students study less than 3 hours, on average. (d) IQR is 2 hours. Link

Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 54 / 1 Statistics 101 (Nicole Dalzell) U1 - L2: EDA May 14, 2015 55 / 1