Lesson A2 Mean statistics, Continuous Distributions, Probability density functions GEO1001.2020
Clara García-Sánchez, Stelios Vitalis
Resources adapted from: - David M. Lane et al. (http://onlinestatbook.com) - Allen B. Downey et al. (https://greenteapress.com/wp/think-stats-2e/) Lesson A2 Mean Statistics Overview
• Central Tendency
• Variability
• Shape Overview
• Central Tendency
• Variability
• Shape Central Tendencythe three datasets would make you happiest? In other words, in comparing your score with your fellow students' scores, in which dataset would your score of 3 be the most impressive? In Dataset A, everyone's score is 3. This puts your score at the exact center It is aboutof the determiningdistribution. You wherecan draw issatisfaction the centre from the of fact the that distribution. you did as well asThere are differenteveryone ways else. But to ofcalculate course it cuts it. both The ways: message everyone after else did all just is, as how well faras your grade foryou. example is from the centre of the distribution?
Table 1. Three possible datasets for the 5-point make-up quiz.
Student Dataset A Dataset B Dataset C
You 3 3 3
John's 3 4 2
Maria's 3 4 2
Shareecia's 3 4 2
Luther's 3 5 1
Now consider the possibility that the scores are described as in Dataset B. This is a depressing outcome even though your score is no different than the one in Dataset A. The problem is that the other four students had higher grades, putting yours below the center of the distribution. Finally, let's look at Dataset C. This is more like it! All of your classmates score lower than you so your score is above the center of the distribution. Now let's change the example in order to develop more insight into the 5/48 center of a distribution. Figure 1 shows the results of an experiment on memory for chess positions. Subjects were shown a chess position and then asked to reconstruct it on an empty chess board. The number of pieces correctly placed was recorded. This was repeated for two more chess positions. The scores represent the total number of chess pieces correctly placed for the three chess positions. The maximum possible score was 89.
!125 position along the number line, then it would be possible to balance them by placing a fulcrum at 6.8. Central Tendency Figure 4. The distribution is not balanced. Figure 5 shows an asymmetric distribution. To balance it, we cannot put the Figure 2. A balance scale. fulcrum halfway between the lowest and highest values (as we did in Figure 3). Placing the fulcrum at the “half way” point would cause it to tip towards the left. For another example, consider the distribution shown in Figure 3. It is balanced by placing the fulcrum in the geometric middle. 1. Balance scale position along the number line, then it would be possible to balance them by placing a fulcrum at 6.8.
(picking a number arbitrarily). Table 2 shows the sum of the absolute deviations of these numbers from the number 10. Figure 2. A balance scale. Figure 3. A distributionTable 2.balanced An exampleon the tip of aof triangle. the sum Figureof absolute 5. An asymmetric deviations distribution balanced on the tip of a triangle. Figure 4 illustrates that the same distribution can't be balanced by placing the For another example, consider the distribution shown in Figure 3. It is balancedfulcrum by to the left of center. The balance point defines one sense of a distribution's center. placing the fulcrum in the geometric middle. Absolute Deviations 2. Smallest Absolute Deviation Values from 10 Smallest Absolute Deviation Another way to define the center of a distribution is based on the concept of the 2 8 sum of the absolute deviations (differences). Consider the distribution made up of 3 7 the five numbers 2, 3, 4, 9, 16. Let's see how far the distribution is from 10 4 6 9 1 16 6 !128
!127 Sum 28 Table 3. An example of the sum of squared deviations. The first row of the table shows that the absolute value of the difference between 2 3. Smallest Squared Deviation Figure 3. A distribution balanced on the tip of a triangle. and 10 is 8; theSquared second Deviations row shows that the absolute difference between 3 and 10 is 7, Valuesand similarly forfrom the 10other rows. When we add up the five absolute deviations, Figure 4 illustrates that the same distribution can't be balanced by placing the we get2 28. So, the sum64 of the absolute deviations from 10 is 28. Likewise, the sum fulcrum to the left of center. of the3 absolute deviations49 from 5 equals 3 + 2 + 1 + 4 + 11 = 21. So, the sum of the absolute4 deviations from36 5 is smaller than the sum of the absolute deviations from 10. In9 this sense, 5 is 1closer, overall, to the other numbers than is 10. 16 36 We are now in a position to define a second measure of central tendency, this time in termsSum of absolute 186deviations. Specifically, according to our second definition, the center of a distribution is the number for which the sum of the absolute deviations is smallest. As we just saw, the sum of the absolute deviations from 106/48 is 28 and Thethe sum first of row the inabsolute the table deviations shows thatfrom the 5 is squared 21. Is there value a valueof the for difference which the between sum 2 andof the 10 absolute is 64; the deviations second isrow even shows smaller that than the 21?squared Yes. Fordifference these data, between there is3 aand 10 is value for which the sum of absolute deviations is only 20. See if you can find it. !127 49, and so forth. When we add up all these squared deviations, we get 186. ChangingSmallest theSquared target fromDeviation 10 to 5, we calculate the sum of the squared deviations fromWe shall 5 as discuss 9 + 4 +one 1 +more 16 +way 121 to = define 151. So,the centerthe sum of ofa distribution. the squared It deviations is based on from 5 isthe smaller concept than of the the sum sum of of squared the squared deviations deviations (differences). from 10.Again, Is there consider a value the for whichdistribution the sum of the of fivethe numberssquared deviations2, 3, 4, 9, 16. is Tableeven smaller3 shows thanthe sum 151? of Yes,the squared it is possibledeviations to of reach these 134.8. numbers Can from you the find number the target 10. number for which the sum of squared deviations is 134.8? The target that minimizes the sum of squared deviations provides another useful definition of central tendency (the last one to be discussed in this section).!129 It can be challenging to find the value that minimizes this sum.
!130 Central Tendency - Measures
1. Arithmetic mean (mean)
2. Median
Odd number of numbers: the median is the middle number
Even number of numbers: the median is the mean of the two middle numbers
When there are numbers with the same values —> 50th percentile formula
3. Mode
It is the most frequently occurring value in the list
The mode of continuous data is normally computed from a grouped frequency distribution 7/48 Overview
• Central Tendency
• Variability
• Shape Variability - Measures
The term variability refers to how spread out is a distribution. It can also be referred as spread or dispersion of the distribution.
There are 4 frequently used measures of variability:
1. Range: simply the highest and the lowest number in the distribution.
2. Interquartile Range (IQR): it is the range of the middle 50% of the scores in a distribution, as we saw before.
3. Variance: is the averaged squared difference of the scores from the mean.
4. Standard deviation: is simply the square root of the variance
9/48 Overview
• Central Tendency
• Variability
• Shape ShapesShapes and of Distributions distributions by David M. Lane
Prerequisites • Chapter 1: Distributions • Chapter 3: Measures of Central Tendency We focused• Chapter 3: Variability in two measures of shape distributions: skew and kurtosisLearning Objectives (also know as 3rd and 4th order moments (1st and 2nd being 1. Compute skew using two different formulas 2. Compute2 kurtosis � andWe saw � in ))the section on distributions in Chapter 1 that shapes of distributions can differ in skew and/or kurtosis. This section presents numerical indexes of these two measures of shape.
1. SkewnessSkew : distributions with + skew normally means Figure 1 shows a distribution with a very large positive skew. Recall that mean>>mediandistributions with positive skew have tails that extend to the right. 400
300
200 equency r F
100
0 25 75 125 175 225 275 325 375 425 475 525 575 625
Figure 1. A distribution with a very large positive skew. This histogram shows the salaries of major league baseball players (in thousands of 2. Kurtosisdollars). : the value 3 is subtracted to define no kurtosis of a normal distribution, otherwise a normal distribution would have a
kurtosis 3. !152
11/48 Lesson A2 Probability Density Functions Overview
• Probability Mass Functions (PMFs)
• Cumulative Density Functions (CDFs)
• Probability Density Functions (PDFs)
• Kernel Density Estimation
• The distribution Framework
• Moments
• Skewness Overview
• Probability Mass Functions (PMFs)
• Cumulative Density Functions (CDFs)
• Probability Density Functions (PDFs)
• Kernel Density Estimation
• The distribution Framework
• Moments
• Skewness Probability Mass Functions (PMFs)
It is a way to represent a distribution, which maps from each value its probability. Probability is a frequency expressed as a fraction of the sample size, n. To get from frequencies to probabilities, we divide through by n, which is called normalization. 34 Chapter 3. Probability mass functions
• To plot a PMF you can use “pyplot.hist”.
• By plotting PMF, instead of histogram, we can compare two distributions without being mislead by the difference in sample size (since PMFs are Figure 3.1: PMF of pregnancy lengths for first babies and others, using bar normalized) graphs and step functions.
thinkplot.PrePlot(2, cols=2) 16/48 thinkplot.Hist(first_pmf, align='right', width=width) thinkplot.Hist(other_pmf, align='left', width=width) thinkplot.Config(xlabel='weeks', ylabel='probability', axis=[27, 46, 0, 0.6])
thinkplot.PrePlot(2) thinkplot.SubPlot(2) thinkplot.Pmfs([first_pmf, other_pmf]) thinkplot.Show(xlabel='weeks', axis=[27, 46, 0, 0.6]) PrePlot takes optional parameters rows and cols to make a grid of figures, in this case one row of two figures. The first figure (on the left) displays the Pmfs using thinkplot.Hist, as we have seen before. The second call to PrePlot resets the color generator. Then SubPlot switches to the second figure (on the right) and displays the Pmfs using thinkplot.Pmfs. I used the axis option to ensure that the two figures are Overview
• Probability Mass Functions (PMFs)
• Cumulative Density Functions (CDFs)
• Probability Density Functions (PDFs)
• Kernel Density Estimation
• The distribution Framework
• Moments
• Skewness Cumulative Density Functions (CDFs)
The CDF is the function that maps from a value to its percentile rank.
It is a function of x, where x is any value that might appear in the distribution, and to evaluate the CDF(x) for a particular value of x, we compute the fraction of values in the distribution less than or equal to x. 4.4. Representing CDFs 49
Example:
- Suppose a sample [1,2,2,3,5].
- Some values of the CDF are: CDF(0)=0, CDF(1)=0.2, CDF(2)=0.6, CDF(3)=0.8, CDF(4)=0.8, CDF(5)=1
Figure 4.2: Example of a CDF. 18/48
If x is greater than the largest value, CDF(x)is1. Figure 4.2 is a graphical representation of this CDF. The CDF of a sample is a step function.
4.4 Representing CDFs
thinkstats2 provides a class named Cdf that represents CDFs. The funda- mental methods Cdf provides are:
Prob(x): Given a value x, computes the probability p = CDF(x). The • bracket operator is equivalent to Prob.
Value(p): Given a probability p, computes the corresponding value, • x; that is, the inverse CDF of p.
The Cdf constructor can take as an argument a list of values, a pandas Series, a Hist, Pmf, or another Cdf. The following code makes a Cdf for the distribution of pregnancy lengths in the NSFG: live, firsts, others = first.MakeFrames() cdf = thinkstats2.Cdf(live.prglngth, label='prglngth') Overview
• Probability Mass Functions (PMFs)
• Cumulative Density Functions (CDFs)
• Probability Density Functions (PDFs)
• Kernel Density Estimation
• The distribution Framework
• Moments
• Skewness Chapter 6
Probability density functions
The code for this chapter is in density.py. For information about down- loading and working with this code, see Section 0.2.
Probability Density Functions (PDFs) 6.1 PDFs
The derivative of a CDF isThe called derivative a probability of a CDF densityis called “probability function, density or PDF. function” For example, the PDF of an exponential distribution is