Basic for SGPE Students Part I: theory1

Nicolai Vitt [email protected]

University of Edinburgh

September 2019

1Thanks to Achim Ahrens, Anna Babloyan and Erkal Ersoy for creating these slides and allowing me to use them. Outline 1.

I Conditional and independence I Bayes’ theorem 2. Probability distributions

I Discrete and continuous probability functions I Probability density function & cumulative distribution function I Binomial, Poisson and I E[X] and V[X] 3. Descriptive statistics

I Sample statistics (mean, variance, percentiles) I Graphs (box plot, histogram) I Data transformations (log transformation, unit of ) I Correlation vs. Causation 4. Statistical inference

I Population vs. sample I I Central limit theorem I Confidence intervals I Hypothesis testing and p-values 1 / 35 Probability

Example II.1 A fair coin is tossed three times.

Sample space and event The (mutually exclusive and exhaustive) list of possible outcomes of an experiment is known as the and is denoted as S. An event E is a single outcome or group of outcomes in the sample space. That is, E is a subset of S.

In this example, S = {HHH ; THH ; HTH ; HHT; HTT; THT; TTH ; TTT} where H and T denote head and tail. Suppose we are interested in the event ‘at least two heads’. The corresponding subspace is E = {HHH ; THH ; HTH ; HHT}. What is the probability of the event E?

2 / 35 Probability Let’s take a step back: What is probability?

Classical Interpretation (Jacob Bernoulli, Pierre-Simon Laplace) If outcomes are equally likely, they must have the same probability. For example, when a coin is tossed, there are two possible outcomes: head and tail. More general, if there are n equally likely outcomes, then the probability of each outcome is 1/n. Frequency Interpretation The probability that a specific outcome of a process will be obtained is the relative frequency with which that outcome would be obtained if the process were repeated a large number of times under the same conditions. 1

Trial 1 Trial 2 .8

As we make more and more .6 tosses, the proportion of tosses that produce head approaches 0.5. We say that .4 0.5 is the probability of head. Rel. frequency of heads .2 0 0 20 40 60 80 100 Number of tosses 3 / 35 Probability Let’s take a step back: What is probability?

Subjective Interpretation (Bayesian approach) The probability that a person assigns to a possible outcome represents his own judgement (based on the person’s beliefs and information). Another person, who may have different beliefs or different information, may assign a different probability to the same outcome. Distinction between prior and posterior beliefs.

Thinking about [Carl Friedrich] Gauss’s conversation turned to chance, the enemy of all knowledge, and the thing he had always wished to overcome. Viewed from up close, one could detect the infinite fineness of the web of causality behind every event. Step back and larger patterns appeared: Freedom and Chance were a question of distance, a point of view. Did he understand? Sort of, said Eugen wearily, looking at his pocket watch. from Measuring the World by Daniel Kehlmann

4 / 35 Probability Properties of probability S

Rule 1 A For any event A, 0 ≤ P(A) ≤ 1. Furthermore, P(S) = 1.

S

Rule 2: Complement rule A Ac denotes the complement of event A. P(Ac) = 1 − P(A) Ac

S Rule 3: Multiplication rule

Two events A and B are independent of each other ABc AB AcB if and only if P(AB) = P(A and B) = P(A ∩ B) = P(A)P(B).

5 / 35 Probability Properties of probability S

Rule 4: Addition rule A B If two events A and B are mutually exclusive, then P(A or B) = P(A ∪ B) = P(A) + P(B).

S

c Rule 5 AB B If event B is a subset of event A, then P(B) < P(A).

6 / 35 Probability What is the probability of E?

Example II.1 A fair coin is tossed three times. S = {HHH , THH , HTH , HHT, HTT, THT, TTH , TTT} E = {HHH , THH , HTH , HHT} What is P(E)?

First, note that – because the coin is fair – 1 P(H ) = P(T) = . 2 Second, since each toss is independent of the previous, we can use Rule 3 (Multiplication Rule), 1 1 1 1 P(HHH ) = P(H )P(H )P(H ) = × × = . 2 2 2 8 and following the same reasoning, P(HHT) = P(HHT) = ... = 1/8. Third, using Rule 4 (Addition Rule) 4 1 P(E) = P(HHH ) + P(THH ) + P(HTH ) + P(HHT) = = . 8 2

7 / 35 Probability Generalised addition rule

Example II.2 A fair six-sided die is rolled. The sample space is given by

S = {1, 2, 3, 4, 5, 6}.

Let E1 be the event ‘obtain 3 or 4’ and let E2 denote the event ‘smaller than 4’. Thus,

E1 = {3, 4} and E2 = {1, 2, 3}

2 3 It is immediately clear that P(E1) = /6 and P(E2) = /6. But what is the probability that either E1 or E2? That is, what is P(E1 ∪ E2)?

Since E1 and E2 are not mutually exclusive, we cannot apply Rule 4 (Addition Rule). But we can generalise Rule 4.

8 / 35 Probability Generalised addition rule

S Rule 4’: (General) Addition rule For any two events A and B P(A or B) = P(A ∪ B) = P(A) + P(B) − P(AB). ABc AB AcB Note that if A and B are mutually exclusive, P(AB) = 0. Therefore, Rule 4 is a special case of Rule 4’. Applying Rule 4’, we get

P(E1 ∪ E2) = P(E1) + P(E2) − P(E1E2) = P(3) + P(4) + P(1) + P(2) + P(3) − P(3) | {z } | {z } |{z} P(E1) P(E2) P(E1E2) 1 1 1 1 1 1 4 = + + + + − = . 6 6 6 6 6 6 6

9 / 35 Example II.3 Suppose that, on any particular day, Anna is either in a good mood (A) or in a bad mood (Ac). Also, on any particular day, the sun is either shining (B) or not (Bc). Anna’s mood depends on the weather, such that she is more likely to be in a good mood when the sun is shining.

S

A

The blue area A which represents the probability that Anna is in a good mood is rather small compared to the full rectangle (≈ 35%). In general, it is more likely that Anna is in a bad mood.

10 / 35 Conditional probability Example II.3 Suppose that, on any particular day, Anna is either in a good mood (A) or in a bad mood (Ac). Also, on any particular day, the sun is either shining (B) or not (Bc). Anna’s mood depends on the weather, such that she is more likely to be in a good mood when the sun is shining.

S

AB ←−AcB

ABc −→

This graph shows both events, A and B, and their overlap.

11 / 35 Conditional probability Example II.3 Suppose that, on any particular day, Anna is either in a good mood (A) or in a bad mood (Ac). Also, on any particular day, the sun is either shining (B) or not (Bc). Anna’s mood depends on the weather, such that she is more likely to be in a good mood when the sun is shining.

AB ←−AcB

Now, suppose the sun is shining. We can discard the remaining sample space and focus on B. The area AB takes up most of the area in the circle. That is, given that B occured, it is more likely that Anna is in a good mood, although – in general – she is more often in a bad mood.

12 / 35 Conditional probability

S Rule 3’: General Multiplication rule

If A and B are any two events and P(B) > 0, ABc AB AcB then P(AB) = P(A)P(B|A) = P(B)P(A|B).

P(A|B) is the conditional probability of the event A given that the event B has occurred.

Conditional probability From Rule 3’ follows the definition for conditional probability P(AB) P(A|B) = . P(B)

Note that, if A and B are independent, then P(A)P(B) P(A|B) = = P(A). P(B) Thus, Rule 3 is a special case of Rule 3’. 13 / 35 Conditional probability

Example II.4 The following table contains counts (in thousands) of persons aged 25 and older, classified by educational attainment and employment status:

Not in Education Employed Unemployed Total labor force Did not finish high school 11,521 886 14,226 26,633 High school degree 36,857 1,682 22,834 61,373 Some college 34,612 1,275 13,944 49,831 Bachelor’s degree or higher 43,182 892 12,546 56,620 Total 126,172 4,735 63,550 194,457 Is employment status independent of educational attainment? Suppose we randomly draw a person from the population. What is the probability that the person is employed? 126,172 P(employed) = = 0.6488. 194,457 Now, suppose we randomly draw another person and are given the information that the person did not finish high school. What is the probability that the person is employed given that the person did not finish high school? 11,521 P(employed|did not finish high school) = = 0.4326. 26,633

14 / 35 Conditional probability We can display the relationship between education and employment in a probability table.

Not in Education Employed Unemployed Total labor force Did not finish high school 0.05925 0.00456 0.07316 0.13696 High school degree 0.18954 0.00865 0.11742 0.31561 Some college 0.17800 0.00656 0.07171 0.25626 Bachelor’s degree or higher 0.22206 0.00459 0.06452 0.29117 Total 0.64884 0.02435 0.32681 1.00000 The probabilities in the central enclosed rectangle are joint probabilities. For example,

P(no high school ∩ unemployed) = P(unemp.)P(no high school|unemp.)  4,735   886  = 194,457 4,735 = P(no high school)P(unemp.|no high school)  26,633   886  = 194,457 26,633  886  = = 0.00456. 194,457

15 / 35 Conditional probability We can display the relationship between education and employment in a probability table.

Not in Education Employed Unemployed Total labor force Did not finish high school 0.05925 0.00456 0.07316 0.13696 High school degree 0.18954 0.00865 0.11742 0.31561 Some college 0.17800 0.00656 0.07171 0.25626 Bachelor’s degree or higher 0.22206 0.00459 0.06452 0.29117 Total 0.64884 0.02435 0.32681 1.00000 The probabilities in the right-most and bottom row are called marginal probabilities. For example,

 61,373  P(High school degree) = = 0.31561. 194,457

16 / 35 Conditional probability We can display the relationship between education and employment in a probability table.

Not in Education Employed Unemployed Total labor force Did not finish high school 0.05925 0.00456 0.07316 0.13696 High school degree 0.18954 0.00865 0.11742 0.31561 Some college 0.17800 0.00656 0.07171 0.25626 Bachelor’s degree or higher 0.22206 0.00459 0.06452 0.29117 Total 0.64884 0.02435 0.32681 1.00000 Note that, under independence,

P(no high school ∩ employed) = P(employed)P(no high school) = 0.64884 × 0.13696 = 0.08887 6= 0.05925.

which indicates that educational attainment and employment are not independent.

17 / 35 Conditional probability

Example II.5 Ms Smith, Ms Brown and Ms Thomson want to spend a day in Edinburgh, but cannot agree on what to do. They decide to vote. Each person can choose between theatre (T) and cinema (C). Ms Smith and Ms Thomson decide independently but Ms Brown is affected by Ms Thomson. The probabilities can be summarised as follows: P(Thomson = T) = 0.2; P(Brown = T|Thomson = T) = 0.8; P(Brown = T|Thomson = C) = 0.05; P(Smith = T) = 0.8.

What is the probability that the majority (i.e. at least two) will vote in favour of theatre?

18 / 35 Independence versus disjointness Recall that P(A and B) = P(A)P(B) holds if and only if A and B are independent. Furthermore,

P(A and B) = 0

holds if and only if the events A and B are disjoint or mutually exclusive. Therefore, if A and B are nontrivial events (i.e. P(A) and P(B) are nonzero), then they cannot be both independent and mutually exclusive. Remark Independent and disjoint does not mean the same! Disjointness means that A and B cannot occur at the same time. Independence means that the occurrence of A has no influence on the probability that B happens, and vice versa.

19 / 35 Bayes’ theorem Derivation From Rule 3’ P(AB) P(A|B) = (1) P(B) P(AB) P(B|A) = (2) P(A)

We can rewrite (2) as P(B|A)P(A) = P(AB) and substitute the expression into (1) to get P(B|A)P(A) P(A|B) = . (3) P(B) Furthermore P(B) = P(BA) + P(BAc) and from (2) P(B) = P(B|A)P(A) + P(B|Ac)P(Ac). Therefore, we can write (3) as

P(B|A)P(A) P(A|B) = . P(B|A)P(A) + P(B|Ac)P(Ac)

20 / 35 Bayes’ theorem

Bayes’ theorem For any two events A and B with 0 < P(A) < 1 and 0 < P(B) < 1 P(B|A)P(A) P(A|B) = P(B|A)P(A) + P(B|Ac)P(Ac)

Bayes’ theorem provides a simple rule for computing the conditional probability of the event A given B from the conditional probability of B given A (and the unconditional probability of A).

21 / 35 Bayes’ theorem

Example II.6 Suppose you have three coins in a box. Two of them are fair and the other one is counterfeit and always lands heads. Thus, if you randomly pick one coin, there is a 1/3 chance that the coin is counterfeit; i.e. P(counterfeit) = 1/3. P(counterfeit) is the prior (or unconditional) probability. Now, you toss the randomly picked coin three times and get three heads.

We are interested in the (posterior) probability that the coin is counterfeit conditional on observing three heads. That is P(HHH|counterfeit)P(counterfeit) P(counterfeit|HHH) = P(HHH|counterfeit)P(counterfeit) + P(HHH|fair)P(fair) We know from above that 1 2 P(counterfeit) = P(fair) = 3 3 1 P(HHH|counterfeit) = 1 P(HHH|fair) = 8 Thus, 1 × 1 4 P(counterfeit|HHH) = 3 = . 1 1 2 5 1 × 3 + 8 × 3 22 / 35 Bayes’ theorem

Example II.7 Suppose a test for an illegal drug correctly identifies drug users 90% of the time and will give a positive reading for non-drug users only 1% percent of the time. 1 person in thousand of the population are drug users. Timmy is tested positive, indicating that he is a drug user. How likely is it that Timmy is actually a drug user?

We are looking for P(pos.|user)P(user) P(user|pos.) = . P(pos.|user)P(user) + P(pos.|non-user)P(non-user)

From the text above, we know that P(user) = 0.001, P(non-user) = 0.999, P(pos.|user) = 0.9 and P(pos.|non-user) = 0.01. Therefore, 0.9 × 0.001 P(user|pos.) = ≈ 0.083. 0.9 × 0.001 + 0.01 × 0.999

23 / 35 Bayes’ theorem

Example II.7 Suppose a test for an illegal drug correctly identifies drug users 90% of the time and will give a positive reading for non-drug users only 1% percent of the time. 1 person in thousand of the population are drug users. Timmy is tested positive, indicating that he is a drug user. How likely is it that Timmy is actually a drug user?

The prior (unconditional) probability that Timmy is a drug user is P(user) = 0.001. Based on the information from the test, we update the prior probability of 0.001 upwards to a posterior probability of 0.0833. This probability is surprisingly low. Despite the positive test result and despite the test being quite reliable, it is more likely that Timmy is not a drug user than that he is a drug user!

24 / 35 Bayes’ theorem

Example II.7 Suppose a test for an illegal drug correctly identifies drug users 90% of the time and will give a positive reading for non-drug users only 1% percent of the time. 1 person in thousand of the population are drug users. Timmy is tested positive. indicating that he is a drug user. How likely is it that Timmy is actually a drug user?

We can display the relationship between test results and drug consumption in a probability table:

Test results Drug user? Positive Negative Non-user 0.0099 0.9891 0.999 User 0.0009 0.0001 0.001 0.0108 0.9892 1.000

P(user ∩ positive) = P(user)P(pos.|user) = 0.001 × 0.9 = 0.0009 P(non-user ∩ positive) = P(non-user)P(pos.|non-user) = 0.999 × 0.01 = 0.00999

25 / 35 Monty Hall problem You are in a game show. There are three doors. Behind one door is a car; behind the other two doors are goats. ? ? ? You pick one door (here door 1). § § §

? ? ? § § The game host opens another door which has a§ goat (here door 2).

?  ?

The game host gives you the chance to switch to the other closed door (here door 3). Should you stick to the door or switch? Does it matter?

26 / 35 Monty Hall problem The answer seems obvious: It should not make a difference. There are two doors left. So the probability of winning should be 0.5, independent of how you decide. However, this reasoning is wrong! To see why, let’s list all nine different cases and look which strategy is more successful.

27 / 35 Monty Hall problem

Door 1 Door 2 Door 3 Stick Switch § § §

   WIN LOSE § § §

   LOSE WIN § § §

   LOSE WIN

If we switch, we have a 2/3 chance of winning! Watch video

28 / 35 Monty Hall problem

Door 1 Door 2 Door 3 Stick Switch § § §

   LOSE WIN § § §

   WIN LOSE § § §

   LOSE WIN

If we switch, we have a 2/3 chance of winning! Watch video

28 / 35 Monty Hall problem

Door 1 Door 2 Door 3 Stick Switch § § §

   LOSE WIN § § §

   LOSE WIN § § §

   WIN LOSE

If we switch, we have a 2/3 chance of winning! Watch video

28 / 35 The Birthday Problem

Example I.8 Suppose there is group of k (2 ≤ k ≤ 365) people. What is the probability that at least two people in a group share the same birthday (i.e. year of birth does not matter)? Ignore February 29 and assume that each of the 365 days of a year is equally likely to be the birthday of any person and that birthdays of the group members are unrelated (no twins).

It turns out, it is easier to start with the question “what is the probability that no one in the group shares a birthday”? Note that

P(at least two share a birthday) = 1 − P(no one shares a birthday).

Let’s start with k = 2. Given that the first person has her birthday on any arbitrary day of the year, the probability that the second person does not have the same birthday is 364 . 365

29 / 35 The Birthday Problem

Example I.8 Suppose there is group of k (2 ≤ k ≤ 365) people. What is the probability that at least two people in a group share the same birthday (i.e. year of birth does not matter)? Ignore February 29 and assume that each of the 365 days of a year is equally likely to be the birthday of any person and that birthdays of the group members are unrelated (no twins).

k = 3. The probability that three persons do not share the same birthday is 364 363 . 365 365 And, in general,

364 · 363 · 362 · ... · (365 − k + 1) . 365k−1

30 / 35 The Birthday Problem

Example I.8 Suppose there is group of k (2 ≤ k ≤ 365) people. What is the probability that at least two people in a group share the same birthday (i.e. year of birth does not matter)? Ignore February 29 and assume that each of the 365 days of a year is equally likely to be the birthday of any person and that birthdays of the group members are unrelated (no twins).

Note that (n − k)(n − k − 1) ... 1 n(n − 1) ... (n − k + 1) = n(n − 1) ... (n − k + 1) (n − k)(n − k − 1) ... 1 n! = (n − k)! where n! = n(n − 1) ... 1 and 0! = 1. Thus, we can write the above as 365! P(no one shares a birthday) = (365 − k)!365k and the solution is 365! P(at least two share a birthday) = 1 − . (365 − k)!365k

31 / 35 The Birthday Problem

Example I.8 Suppose there is group of k (2 ≤ k ≤ 365) people. What is the probability that at least two people in a group share the same birthday (i.e. year of birth does not matter)? Ignore February 29 and assume that each of the 365 days of a year is equally likely to be the birthday of any person and that birthdays of the group members are unrelated (no twins).

k p 5 0.027 10 0.117 15 0.253 The table shows the probability p 20 0.411 that at least two people in a group 22 0.476 of k people will have the same 23 0.507 birthday. 25 0.569 30 0.706 40 0.891 50 0.970 60 0.994

Watch video

32 / 35 Sampling with replacement The birthday problem is an example of sampling with replacement. Sampling with replacement (stylised example) A box contains n balls numbered 1,..., n. First, one ball is selected at random from the box and its number noted. This ball is then put back in the box and another ball is selected. Thus, it is possible that the same ball is selected again. This process is called sampling with replacement. It is assumed that each of the n balls is equally likely to be selected at each stage and that the selections are independent of each other. Suppose we pick k balls. There are in total nk different outcomes. The probability assigned to each outcome is 1/nk .

33 / 35 Sampling without replacement

Example I.9 Suppose we have a box of 6 different books and we randomly arrange the books on a shelf. What is the probability that, by chance, the books are ordered alphabetically?

There are 6 · 5 · 4 · 3 · 2 · 1 = 6! = 720 distinct ways of arranging 6 books, but only one order is alphapetically correct. Thus, p = 1/720. More general: Permutations Suppose that k cards are to be selected and removed from a deck of n cards without replacement. Each possible distinct outcome is called a permutation. The total number of permutations is n! P = n(n − 1) ... (n − k + 1) = n,k (n − k)! where a! = a(a − 1)(a − 2) ... 1 and 0! = 1.

34 / 35 Summary

I Frequentist approach: The probability of an outcome is the relative frequency with which that outcome would be obtained if the experiment were repeated a large number of times.

I Independence and disjointness are not the same! If two events A and B are mutually exclusive (or disjoint), then P(AB) = 0. If two events are independent, then the occurrence of A has no influence on the probability that B occurs, and vice versa.

I Bayes’ theorem provides a rule for computing the conditional probability of the event A given B from the conditional probability of B given A. It is the building block of Bayesian econometrics.

35 / 35