Biostatistics Block I Lecture 2

Huining Kang [email protected] July 8-10, 2020 Descriptive Statistics Intorduction

• Objective of statistics – To determine the nature of data • Descriptive measure – A number that summarizes the data in certain aspect • Objectives of this lesson – Learn how to manipulate the information – in the form of numbers – that you will encounter as a health sciences professional. Content

• Ordered array • Grouped Data – frequency distribution – Histogram and stem-and-tree displays • Descriptive Statistics – Measures of central tendency – Measures of dispersion – Percentiles and quartiles – Boxplots Ordered Array

• Ordered array – list of values of a collection in order of magnitude from the smallest to the largest • Command in Stata – sort [ varlist ] • e.g. sort age • Menu – Data > Sort : Standard sort (ascending) – Data > Sort Ascending and descending sort • Example 2.2.1 (refer to Stata do-file: lecture2.do) Grouped Data – Frequency Distribution • To group data into a set of continuous, non- overlapping, equal length intervals (class intervals) • Rule of thumb

– k = 1 + 3.322(log10n) – Class interval width w = R/k, where R = (largest – smallest) • Frequency (count in each interval), relative frequency (proportion of the numbers in each interval) and percentage (of the numbers in each interval) Grouped Data – Frequency Distribution

Example 2.3.1

n = 189, k = 1 + 3.322(log10189) ≈ 9, R / k =(82 – 30)/9 = 5.778 A class interval width of 5 or 10 would be convenient. Use w = 10

Frequency table

Age group | Freq. Percent Cum. ------+------30-39 | 11 5.82 5.82 40-49 | 46 24.34 30.16 50-59 | 70 37.04 67.20 60-69 | 45 23.81 91.01 70-79 | 16 8.47 99.47 80-89 | 1 0.53 100.00 ------+------Total | 189 100.00 Grouped Data – Frequency Distribution

Histogram 80 60 40 Frequency 20 0 20 40 60 80 AGE Grouped Data – Frequency Distribution Stata codes for previous frequency table and histogram * First you need to import data file EXA_C01_S04_01.csv Import delimited “data\EXA_C01_S04_01.csv”, clear * Group the data by generating a class variable agegrp recode age (30/39 = 1 "30-39") (40/49=2 "40-49") (50/59=3 "50-59") /// (60/69=4 "60-69") (70/79=5 "70-79") (80/89=6 "80-89"), gen(agegrp) * Generate the frequency table tabulate agegrp * Generate the histogram histogram age, width(10) start(30) color(blue) fintensity(inten10) frequency * You can also add a normal distribution curve on the histogram by adding the option normal as follows histogram age, width(10) start(30) color(blue) fintensity(inten10) /// frequency normal Grouped Data – Frequency Distribution

Histogram with fitted normal distribution density curve 80 60 40 Frequency 20 0 20 40 60 80 100 AGE Grouped Data – Frequency Distribution With a little coding you can draw the frequency polygon and you can also put the frequency polygon and histogram together in one display as shown in text book. See stata codes in the next slides. 80 80 60 60 40 40 Frequancy Frequancy 20 20 0

20 40 60 80 100 0 age 20 40 60 80 100 age Frequancy Frequancy Grouped Data – Frequency Distribution Stata codes for figures in the previous slides. It is beyond our course requirement at this point. import delimited using "data/EXA_C01_S04_01.csv", clear recode age (30/39 = 1 "30-39") (40/49=2 "40-49") (50/59=3 "50-59") /// (60/69=4 "60-69") (70/79=5 "70-79") (80/89=6 "80-89"), gen(agegrp) label variable agegrp "Age group" table agegrp, replace describe rename table1 freq label variable freq "Frequancy" set obs 7 replace freq = 0 if agegrp==. replace agegrp = 0 if agegrp==. set obs 8 replace freq = 0 if agegrp==. replace agegrp = 7 if agegrp==. Grouped Data – Frequency Distribution

label variable freq "Frequancy" gen age = 24.5 + agegrp*10 sort age * Create frequency polygon twoway connected freq age, sort * Save plot for future use graph save temp1, replace

* Create frequency polygon and histogram and put them together twoway (connected freq age, sort) (bar freq age, sort fcolor(none) barwidth(9.5)) * Save plot for future use graph save temp2, replace

* display combined figures graph combine "temp1" "temp2" Grouped Data – Frequency Distribution

Stem-and-leave display . stem age, width(10) Stem-and-leaf plot for age (AGE) 3* | 04577888899 4* | 0022333333444444455566666677777788888889999999 5* | 0000000011112222223333333333333333344444444444555666666777777788999999 6* | 000011111111111222222233444444556666667888999 7* | 0111111123567888 8* | 2

. stem age, width(5) Stem-and-leaf plot for age (AGE) 3* | 04 3. | 577888899 4* | 00223333334444444 4. | 55566666677777788888889999999 5* | 0000000011112222223333333333333333344444444444 5. | 555666666777777788999999 6* | 000011111111111222222233444444 6. | 556666667888999 7* | 0111111123 7. | 567888 8* | 2

Stata command: stem [var], width(#) Definitions for a Parameter of a Population and a Statistic • Statistic – A descriptive measure computed from the data of a sample is called a statistic. • Parameter – A descriptive measure computed from the data of a population is called a parameter.

Note: A statistic is a function of a sample, hence it is a random variable. A parameter of a population is a fixed number, but it is usually unknown. One of the major tasks of statistics is to use a statistic to estimate the parameter and to make decisions on some issues of the population related to the parameter. Descriptive Statistics – Measures of Central Tendency • Mean (Arithmetic mean) • Median • Mode • Skewness Descriptive Statistics – Measures of Central Tendency • Mean (Arithmetic mean) – Population mean n x x + x ++ x ∑ i µ = 1 2 n = i=1 – Sample mean n n n X X + X ++ X ∑ i X = 1 2 n = i=1 • Properties n n – Uniqueness – Simplicity – Can be easily distorted by extreme values (outliers) Descriptive Statistics – Measures of Central Tendency • Median of a finite set of values Middle point – When n = odd

x1 ≤  ≤ x(n−1)/ 2 ≤ x(n+1)/ 2 ≤ x(n+1)/ 2+1  ≤ xn – When n = even

x1 ≤  ≤ xn/ 2 ≤ xn/ 2+1  ≤ xn

x(n+1)/ 2 n = odd Median =  (xn/ 2 + xn/ 2+1) / 2, n = even • Properties – Uniqueness, simplicity, not drastically affected by extreme values Descriptive Statistics – Measures of Central Tendency • Mode – Occurs most frequently. • Properties – May not exist, (e.g. all value are different) – May not be unique. (e.g. two different values achieve the maximal frequency) Descriptive Statistics – Measures of Central Tendency • Calculation using Stata – Mean and median: summarize , detail – Mode: tabulate • Example (Age data from the study on smoking cessation) summarize age, detail tabulate age mean = 55.03, median = 54, mode = 53 (with frequency = 17) (refer to the histogram and stem-and-tree plot) Descriptive Statistics – Measures of Central Tendency • Skewness – Skewed distribution • Graph of the distribution is asymmetric. – Right skew • The graph extends further to the right than to the left (long tail to the right) – Left skew • The graph extends further to the left than to the right (long tail to the left) Descriptive Statistics – Measures of Central Tendency • Skewness

– = = 𝑛𝑛 3 ( 𝑛𝑛 ) 3 𝑛𝑛 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖−𝑥𝑥̅ 𝑛𝑛 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖−𝑥𝑥̅ 3 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑛𝑛 2 2 𝑛𝑛−1 𝑛𝑛−1 = 0∑:𝑖𝑖 =no1 𝑥𝑥skew𝑖𝑖−𝑥𝑥̅ (the distribution is symmetric) > 0: right skew (the distribution is skewed to the • 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 right) • 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 < 0: left skew (the distribution is skewed to the left)

– Calculation• 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 • summarize age, detail Descriptive statistics – measures of dispersion • Range • Variance – Sample variance and degree of freedom – Variance of a population • Standard deviation • Coefficient of variation • Percentiles, quartiles and interquartile range • Box-and-Whisher Ploats

Other terms used synonymously with dispersion: variation, spread, and scatter. Descriptive statistics – measures of dispersion

• Range R = xmax − xmin (usefulness is limited) • Variance N – Finite population of N values: 2 2 σ = ∑(xi − µ) / N N i=1 – Sample variance: 2 2 s = ∑(xi − x) /(n −1) i=1 – Degree of freedom: n −1 – Unit squared. • Standard division 2 – σ = σ 2 s = s – Same unite as mean Descriptive statistics – measures of dispersion

• Coefficient of variation – Sample: C.V = σ / µ (100) – Population: C.V. = s / x (100) – Advantage: independent of measurement unit (quantity without any physical units). When comparing between data sets with different units or wildly different means, one should use the C.V instead of the standard deviation – Disadvantage: when mean value is near zero, the C.V. is sensitive to small changes in the mean, limiting its usefulness. Only useful for the measurement that is always positive. Descriptive statistics – measures of dispersion • Percentile – Definition (10th Ed. page 47) – Calculation p percentile is the (n+1)pth ordered observation (when (n+1)p is not the integer, it is the average of the two nearest observations • Quartiles

– The quartiles Q1, Q2 and Q3 are 25%, 50% and 75% percentiles, i.e. th th th Q1= (n+1)/4 , Q2= (n+1)×2/4 and Q1= (n+1)×3/4 observations • Interquartile Range (measure for dispersion)

IQR = Q2 – Q1 Descriptive statistics – measures of dispersion • Kurtosis (Peakness) – Definition (10th Ed. page 49 ) – To measure whether a distribution is “peaked” or “flat” in comparison to a normal distribution. = 3 = 3 𝑛𝑛 4 𝑛𝑛 4 ∗ 𝑛𝑛 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖−𝑥𝑥̅ 𝑛𝑛 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖−𝑥𝑥̅ 𝑛𝑛 2 2 2 4 • 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖−𝑥𝑥̅ − 𝑛𝑛−1 𝑥𝑥 − 0 > 0 < 0

𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 ≈ 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾

= 3 > 3 * In Stata, 𝑛𝑛 4 . A distribution is Mesokurtic if , Leptokurtic if , and Platykurtic if 𝑛𝑛 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖−𝑥𝑥̅ < 3 𝑛𝑛 2 2 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾. 𝐾𝐾𝐾𝐾𝐾𝐾 ∑𝑖𝑖=1 𝑥𝑥𝑖𝑖−𝑥𝑥̅ 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 ≈ 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 Descriptive statistics – measures of dispersion • Box-and-whisker Plots

50 Outlier – any value more than 1.5 X IQR beyond a quartile (there may be no outliers) 45 Upper Adjacent Value (largest data value that isn’t an outlier) Whisker 40

rd th Q3 (3 Quartile, 75 Percentile) Box length is IQR 35 = Q3-Q1 (spread of middle half of data Median (50th Percentile)

st th 30 Q1 (1 Quartile, 25 Percentile)

Lower Adjacent Value (smallest data value that isn’t an outlier)