Chapter 3 - the Histogram PART II:DESCRIPTIVE STATISTICS

Chapter 3 - The Histogram PART II:DESCRIPTIVE STATISTICS

Dr. Joseph Brennan

Math 148, BU

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 1 / 37 Variables

Once a study has been designed and data collected, researchers begin to SUMMARIZE their data. Data may be summarized by plotting ﬁgures and computing certain summary measures to obtain important information about the data.

STATISTIC : A summary measure computed from the data.

Recall : Every data point is the value of the response VARIABLE measured on a unit. So we should think of variable as the quantity that takes diﬀerent values for diﬀerent individuals.

Examples: gender, color of eyes, weight, bacteria count.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 2 / 37 Variables

There are two types of variables, dependant upon on their possible values: qualitative (categorical) quantitative (numerical). Quantitative variables are further divided into discrete and continuous.

VARIABLES

Qualitative Quantitative

Discrete Continuous

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 3 / 37 Variable Types

A qualitative variable places an individual into one of several groups or categories. Such variables are also called categorical variables. The variable gender has two possible values male and female. The variable major has numerous values such as Mathematics, Biology, Physics, Economics, Chemistry,...

A quantitative variable takes numerical values for which arithmetic operations (such as adding and averaging) make sense. Quantitative variables are also called numerical variables.

NOTE: If unsure on how to classify a variable, question how it can be aﬀected mathematically. We cannot average gender or major, therefore they are qualitative variables.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 4 / 37 Quantitative Variables

Quantitative variables are divided into discrete and continuous: Discrete quantitative variable takes on values which are spaced, i.e, for two adjacent values, there is no value that goes between them.

The variable number of children is discrete. It takes on integer values ... there cannot be 2.5 kids in a family.

Continuous quantitative variable take values in a given interval. For ANY two values of the variable, we can always ﬁnd another value that can go between the two.

Variables such as weight, time, and distance are continuous.

NOTE: The variable salary is continuous but essentially discrete if all salaries are rounded to the whole dollar.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 5 / 37 Example 1

Classify each of the following variables as qualitative or quantitative (discrete or continuous): Color of eyes. qualitative

Blood pressure quantitative

Weight (in lb) quantitative

Residence (country) qualitative

Number of patients under a treatment quantitative

Zip code qualitative

NOTE: Not all the variables that take on numerical values are quantitative!

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 6 / 37 Variables in statistical studies In statistical studies we have encountered three types of variables: Treatment Variables Response Variables Confounding Variables

All the above types of variables can be either qualitative or quantitative. In the studies which we considered in Part I the treatment variable was usually qualitative:

In the fever example (Example 1, Part I) the treatment variable was drug with the values drug A and drug B.

In the smoking example (Example 2, Part II) the treatment variable was smoking status with the values Yes and No.

In Part III we will develop methods to analyze data from studies for which both the treatment and response variables are quantitative.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 7 / 37 An Introduction to the Histogram

Data represents the values of the response variable measured from each unit. The distribution of data is a list summarizing the observed values of the response variable and how often they were observed. When the data is quantitative, whether discrete or continuous, a histogram may be used to display its distribution.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 8 / 37 Example (Guinea pigs)

Taken from Moore and McCabe, Table 1.8, Chapter 1. The table gives the survival times in days of 72 guinea pigs after they were injected with tubercle bacilli in a medical experiment.

43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102 102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198 211 214 243 249 329 380 403 511 522 598

We aim to describe the distribution of the survival times.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 9 / 37 Example (Guinea pigs)

First, plot the observations on a horizontal axis.

Figure : Guinea pigs survival times plotted on a horizontal axis.

We can see that the observations are not uniformly spread along the axis. In particular, there is a crowding of observations around 100.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 10 / 37 Example (Guinea pigs)

The density histogram is a graph representing the density of observations along the horizontal axis. Such a histogram is constructed three steps: 1) Density Histogram: Step 1. Break the range of values of a variable into adjacent intervals, which are called class intervals or bins.

Class interval 40 ≤ survival time < 80 80 ≤ survival time < 120 120 ≤ survival time < 160 160 ≤ survival time < 200 200 ≤ survival time < 250 250 ≤ survival time < 400 400 ≤ survival time < 600

No particular rule was used to choose the above class intervals.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 11 / 37 Example (Guinea pigs)

2) Density Histogram: Step 2. Create the distribution table which contains the count and percent (or proportion) of individuals in each class interval.

Class interval Count Proportion Percent 40≤survival time<80 12 0.1667 16.67% 80≤survival time<120 32 0.4444 44.44% 120≤survival time<160 11 0.1528 15.28% 160≤survival time<200 7 0.0972 9.72% 200≤survival time<250 4 0.0556 5.56% 250≤survival time<400 2 0.0278 2.78% 400≤survival time<600 4 0.0556 5.56% Total 72 1.0001 100.01%

NOTE: The total percent is equal to 100.01% rather than 100% due to errors introduced by rounding.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 12 / 37 Example (Guinea pigs)

Density Histogram: Step 3. Constructing the histogram.

1 On the horizontal axis mark the endpoints of class intervals. 2 On each class interval plot a rectangle, whose base covers the class interval and whose height is computed in the following way:

percentage of observations in class interval Bar height = width of class interval

The HIGHER the bar, the GREATER the concentration (density) of observations in the corresponding class interval. We assume that observations are spread uniformly within a class interval. The units of measurement on the VERTICAL axis of the histogram are percents (proportion) per unit width.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 13 / 37 Example (Guinea pigs)

Density Histogram: Step 3, continued.

Class interval Bar height 40≤survival time<80 0.1667/40 = 0.0041675 80≤survival time<120 0.4444/40 = 0.01111 120≤survival time<160 0.1528/40 = 0.00382 160≤survival time<200 0.0972/40 = 0.00243 200≤survival time<250 0.0556/50 = 0.001112 250≤survival time<400 0.0278/150 ≈ 0.0001853 400≤survival time<600 0.0556/200 = 0.000278

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 14 / 37 Figure : Figure 4. Histogram for guinea pigs survival times.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 15 / 37 Width of Class Intervals

Class intervals are generally chosen of uniform width; contrary to the uneven intervals in our guinea pig study. Most computer programs default to bins of equal width. What number of bins of equal width should we use?

There is no best number of bins, and diﬀerent bin sizes can reveal diﬀerent features of the data. Usually, the number of bins is chosen between 5 and 25. The larger the data set, the greater number of bins should be used. One frequently used rule to compute the number of bins k (of equal width) for a data set of size n is Sturges’ formula:

k = 1 + 3.322 log10(n)

We obtain k by rounding it to the nearest integer.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 16 / 37 Example (Guinea pigs)

We will replot the density histogram using the CrunchIt! program, using the default CrunchIt! number of bins. CrunchIt! directions :

Open the data from the CrunchIt! http://crunchit2.bfwpub.com/crunchit2/ips5e/?section_id= Click on Chapter 1 in the upper left corner. Choose Table 1.8 from the list. On the Grey panel click on Graphics −→ Histogram. Click on the variable and choose Density.

The CrunchIt! produces the graph shown in Figure 5. It uses 12 class intervals of width 50 ranging from0 to 600. NOTE that CrunchIt! uses a diﬀerent formula from Sturges’ rule to ﬁnd the number of bins in its default setting. By the Sturge’s formula

k = 1 + 3.322 log10(72) = 7.17 ∼ 7.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 17 / 37 Histogram with Equal Bin Widths

Figure : Figure 5. Histogram for guinea pigs (CrunchIt!) NOTE: Histograms in Figures 4 and 5 are similar, but in general the appearance of a histogram can substantially change when you change the widths of class intervals.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 18 / 37 Four Histograms Plotted from the Same Data

The bin width and positioning of the bin edges can have a signiﬁcant eﬀect on the resulting histogram. Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 19 / 37 Endpoint Convention

What should we do if an observation happens to be on the boundary between two bars? In which class interval does the data point lie?

Each observation must be taken into account just once, so we need to choose between the left and right bars. The choice is arbitrary, however your choice should be indicated and rigourously adhered to. The convention followed in Example 2 is that the left endpoint is included in the class interval, and the right endpoint is excluded.

NOTE: There are 2 data values of 80 days. In the histogram plotted in Figure 4 the ﬁrst class interval includes data values for which [40, 80), the second class interval contains observations with [80, 120). How does the histogram change by altering endpoint convention?

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 20 / 37 Area Under the Density Histogram

The area of each bar is equal to the percent (or proportion) of observations in that bar.

In the histogram plotted in Figure 4 the area of the ﬁrst bar is the following.

area of the ﬁrst bar = base × height = 40(days) × 0.0041675(proportion of observations per day) = 0.1667 (proportion of all observations).

The total area under the density histogram is 1(100%).

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 21 / 37 Zero Bar Heights

Zero bar heights in histograms: In a histogram there is no horizontal space between the bars unless a class interval is empty (has no data), so that its bar has height equal to zero.

NOTE: The histogram in Figure 5 has 2 empty classes for survival times in the ranges 250 - 300 and 450 - 500. These intervals do not have any data point.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 22 / 37 Diﬀerent Types of Histograms

There are 3 main types of the histograms : Density histogram displays percents (or proportions) per unit width in the vertical direction. In a frequency histogram the height of each bar is equal to the actual count of observations in the class interval. In a relative frequency histogram the height of each bar is equal to the proportion or percentage of observations in the class interval.

We will mostly deal with density histograms in this course. Later in this unit we will approximate the density histograms with density curves.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 23 / 37 Example of a Frequency Histogram

This example is taken from Moore and McCabe. The frequency histogram below shows the distribution of IQ scores for 60 ﬁfth-grade students. On the y-axis we have the count of students in each class interval. The sum of all the bar heights equal 60, the number of tested students.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 24 / 37 Example of a Relative Frequency Histogram

This example is taken from Moore and McCabe. The relative frequency histogram below shows the distribution of the lengths of words used in Shakespeare’s plays.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 25 / 37 Outliers in Data

Very often the extreme bars of the histogram correspond to outlying observations or outliers. An outlier is an observation which falls outside of the overall pattern of the histogram. Rules exist to identify outliers, but in many cases it is just a matter of judgement. Look for points that are clearly apart from the body of data, not just the most extreme observations in a distribution.

Observation 598 in Example 2 (Guinea pigs) is clearly an outlier. Observations 511 and 522 are also potential outliers. A formal rule for detecting outliers will be developed later in this unit.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 26 / 37 Population Distribution and Data Distribution

Making a histogram is not an end in itself. The purpose of the histogram is to help us understand the data and make observations about the population from which the data was drawn.

There is the true population distribution of the variable of interest, which may be computed from the census data.

The population distribution is usually unknown since we cannot make a census. The histogram computed from the sample data shows the data distribution which estimates the true population distribution.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 27 / 37 Analyzing a Histogram

Once you’ve plotted the histogram (data distribution), look for the overall pattern and for outliers. The overall pattern of a distribution can be described by its shape, center, and spread. We will learn the measures of distribution’s center and spread in Chapter 4. The shape of the distribution can be described: by specifying the number of modes. as symmetric or skewed.

Modes are major peaks in the distribution. Distribution with one, two and three modes are called unimodal, bimodal, and trimodal, respectively. If a distribution has more than 3 modes, it is usually called multimodal.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 28 / 37 Modes in Histograms

Figure : Unimodal and bimodal distributions.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 29 / 37 Symmetric Distributions

A symmetric distribution has a histogram symmetric about the midpoint. Imagine drawing a vertical line through the center of the histogram and folding the histogram in half around that line: the two halves should match up. Even if the true population distribution is symmetric, we do not expect the histogram of the data to be perfectly symmetric. The unimodal histogram on the previous slide is fairly symmetric, so it may correspond to a symmetric population distribution. Many symmetric unimodal histograms look bell-shaped. The unimodal histogram on the previous slide appears bell-shaped.

Many biological measurements on specimens from the same species and sex - lengths of bird ﬂies, heights of adults - have symmetric bell-shaped distributions.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 30 / 37 Some History

A symmetric bell-shaped distribution is also called a normal distribution or a Gaussian distribution. It was ﬁrst discovered in 1809 by the famous mathematician, Carl Frederich Gauss (1777-1855).

As a general rule, test scores in large classes (like MAT148) tend to follow a normal distribution!

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 31 / 37 Analyzing the Histogram

Once you have plotted the histogram (data distribution), look for outliers, and the overall pattern, which is described by its shape, center, and spread. The shape of the distribution can be described by specifying the number of modes. as symmetric or skewed.

Figure : Unimodal and bimodal distributions. The unimodal appears symmetric.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 32 / 37 Symmetric and Skewed Distributions

Tails are the parts of a distribution away from modes. There is the left tail (for smallest values of the variable) and the right tail (for largest values of the variable).

A distribution is right skewed if the right tail (larger values) is much longer than the left tail (smaller values).

A distribution is left skewed if the left tail (smaller values) is much longer than the right tail (larger values).

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 33 / 37 Example: Histogram of Income Distribution

Money amounts usually have right-skewed distributions. A few families have very large income compared to the majority families, which skews the income distribution data.

Figure : Income distribution.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 34 / 37 The Appearance of Symmetric and Skewed distributions

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 35 / 37 Example (English)

What is the most frequently used letter in the English language?

This is a relative frequency “histogram” generated (as per Wikipedia) from a sample of about 2700 words taken from 3 diﬀerent sources. Alternatively, which letter is most likely to be at the end of a word? Take a look at end of word letter frequencies : Letter e s d t n Frequency 0.1917 0.1435 0.0923 0.0864 0.0786

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 36 / 37 Smooth Histogram Sketches

Many sources will represent histograms as smooth curves. However, we have deﬁned histograms to be based upon bar graphs.

Histogram sketches are smooth curves drawn through the tops of the histogram bars and used to indicate the overall shape of a histogram.

Figure : A Histogram and its Smooth Histogram Sketch.

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 37 / 37