Quick viewing(Text Mode)

[.3Cm] Part II: Probability Distribution=1[Frame]

Basic for SGPE Students Part II: distribution1

Nicolai Vitt [email protected]

University of Edinburgh

September 2019

1Thanks to Achim Ahrens, Anna Babloyan and Erkal Ersoy for creating these slides and allowing me to use them. Outline 1.

I Conditional and independence I Bayes’ 2. Probability distributions

I Discrete and continuous probability functions I Probability density & cumulative distribution function I Binomial, Poisson and I E[X] and V[X] 3.

I statistics (, , ) I Graphs (box , ) I transformations (log transformation, unit of ) I Correlation vs. Causation 4.

I Population vs. sample I Law of large I I Confidence intervals I Hypothesis testing and p-values 1 / 61 Random variables Most of the outcomes or events we have considered so far have been non-numerical, e.g. either head or tail. If the of an is numerical, we call the variable that is determined by the experiment a . Random variables may be either discrete (e.g. the of days the sun shines) or continuous (e.g. your salary after graduating from the MSc). In contrast to a continuous random variable, we can list the distinct potential outcomes of a discrete random variable.

Notation Random variables are usually denoted by capital letters, e.g. X. The corresponding realisations are denote by small letters, e.g. x.

2 / 61 Should you make the bet?

Example III.1 I propose the following game. We toss a 10 times. If head appears 4 times or less, I pay you £2. If head appears more than 4 times, you pay me £1. Should you make the bet?

Let’s try to formalise the problem. Let the random variables X1, X2,..., X10 be defined such that

 1 if head appears on the ith toss X = for i = 1,..., 10. i 0 if tail appears on the ith toss

3 / 61 Should you make the bet? Furthermore, let the random variable Y denote the number of heads. Clearly, Y = X1 + X2 + ... + X10. If the realisation of Y is greater than 4, I win. Let P(Y = y) denote the probability that Y takes the value y. Accordingly, P(Y ≤ 4) is the probability that we obtain 4 or less heads and P(Y > 4) is the probability that we obtain more than 4 heads. When would you make the bet? Your is E[V ] = P(Y ≤ 4) · £2 + P(Y > 4) · (£−1) where V is the money you get. If E[V ] > 0 (and you are neutral), you’ll choose to play.

4 / 61 Should you make the bet?

Expected value The expected value of a discrete random variable X is denoted by E[X] and given by k X E[X] = x1P(X=x1) + x2P(X=x2) + ··· + xkP(X=xk) = xiP(X=xi) i=1 where k is the number of distinct outcomes.

5 / 61 Should you make the bet? To solve the problem, we need to find P(Y ≤ 4) and P(Y > 4). From the additive law (Rule 4), we know that

P(Y ≤4) = P(Y =0 ∪ Y =1 ∪ Y =2 ∪ Y =3 ∪ Y =4) = P(Y =0) + P(Y =1) + P(Y =2) + P(Y =3) + P(Y =4) P(Y >4) = P(Y =5) + P(Y =6) + P(Y =7) + P(Y =8) + P(Y =9) + P(Y =10)

Hence, we need to find P(Y = yi) for i = 0,..., 10.

6 / 61 Discrete It is common to denote the probability distribution of a discrete random variable Y by f (y).

Discrete probability distribution The probability distribution or probability mass function of a discrete random variable X associates with each of the distinct potential outcomes xi (i = 1,..., k) a probability P(X = xi). That is, f (xi) = P(X = xi). Pk The sum of the probabilities add up to 1, i.e. i f (xi) = 1.

7 / 61 Discrete probability distribution Two examples:

Example III.2 (Discrete Uniform Distribution) Let X be the result from rolling a fair . The probability distribution is simply  1/6 for x = {1, 2,..., 6} f (x) = P(X = x) = . 0 otherwise This probability distribution is an example for a discrete uniform distributions.

Bernoulli distribution It is said that a random variable X has a with P(X = 1) = p (i.e. probability of success) if X can take only the values 1 (success) and 0 (failure). The probability distribution is given by

( p if x = 1 f (x) = 1 − p if x = 0 0 otherwise

8 / 61 Binomial coefficient & Let’s start with f (0) = P(Y =0) which is the probability of obtaining no heads. Using the multiplicative law,

1 10 P(Y =0) = P(X1=0)P(X2=0) ... P(X10=0) = ( /2) = 0.00097656

Now, f (1) = P(Y =1). Since we are interested in the number of heads, we have to take into account that there is more than one combination that results in 1 head.

P(Y =1) = P(X1=1)P(X2=0) ... P(X10=0)

+ P(X1=0)P(X2=1) ... P(X10=0) . .

+ P(X1=0)P(X2=0) ... P(X10=1) 10 = 10 · (1/2) = 0.00976563

9 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HHTTTTTTTT 2 HTHTTTTTTT 3 HTTHTTTTTT 4 HTTTHTTTTT 5 HTTTTHTTTT 6 HTTTTTHTTT

combination 7 HTTTTTTHTT 8 HTTTTTTTHT 9 HTTTTTTTTH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HHTTTTTTTT 2 THHTTTTTTT 3 THTHTTTTTT 4 THTTHTTTTT 5 THTTTHTTTT 6 THTTTTHTTT

combination 7 THTTTTTHTT 8 THTTTTTTHT 9 THTTTTTTTH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HTHTTTTTTT 2 THHTTTTTTT 3 TTHHTTTTTT 4 TTHTHTTTTT 5 TTHTTHTTTT 6 TTHTTTHTTT

combination 7 TTHTTTTHTT 8 TTHTTTTTHT 9 TTHTTTTTTH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HTTHTTTTTT 2 THTHTTTTTT 3 TTHHTTTTTT 4 TTTHHTTTTT 5 TTTHTHTTTT 6 TTTHTTHTTT

combination 7 TTTHTTTHTT 8 TTTHTTTTHT 9 TTTHTTTTTH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HTTTHTTTTT 2 THTTHTTTTT 3 TTHTHTTTTT 4 TTTHHTTTTT 5 TTTTHHTTTT 6 TTTTHTHTTT

combination 7 TTTTHTTHTT 8 TTTTHTTTHT 9 TTTTHTTTTH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HTTTTHTTTT 2 THTTTHTTTT 3 TTHTTHTTTT 4 TTTHTHTTTT 5 TTTTHHTTTT 6 TTTTTHHTTT

combination 7 TTTTTHTHTT 8 TTTTTHTTHT 9 TTTTTHTTTH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HTTTTTHTTT 2 THTTTTHTTT 3 TTHTTTHTTT 4 TTTHTTHTTT 5 TTTTHTHTTT 6 TTTTTHHTTT

combination 7 TTTTTTHHTT 8 TTTTTTHTHT 9 TTTTTTHTTH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HTTTTTTHTT 2 THTTTTTHTT 3 TTHTTTTHTT 4 TTTHTTTHTT 5 TTTTHTTHTT 6 TTTTTHTHTT

combination 7 TTTTTTHHTT 8 TTTTTTTHHT 9 TTTTTTTHTH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HTTTTTTTHT 2 THTTTTTTHT 3 TTHTTTTTHT 4 TTTHTTTTHT 5 TTTTHTTTHT 6 TTTTTHTTHT

combination 7 TTTTTTHTHT 8 TTTTTTTHHT 9 TTTTTTTTHH

10 / 61 Binomial coefficient & binomial distribution Now, f (2) = P(Y =2). How many combinations are there that yield 2 heads out of 10 tosses? Given that the first toss produces a head, there are 9 combinations that yield two heads in total. And so on...

toss 1 2 3 4 5 6 7 8 9 10 1 HTTTTTTTTH 2 THTTTTTTTH 3 TTHTTTTTTH 4 TTTHTTTTTH 5 TTTTHTTTTH 6 TTTTTHTTTH

combination 7 TTTTTTHTTH 8 TTTTTTTHTH 9 TTTTTTTTHH

This gives us 10 · 9. This approach has the problem of double counting. Each combination appears twice. So we have to divide by 2 and get (10 · 9)/2 distinct combinations. Thus,

10 · 9 110 P(Y = 2) = = 0.04394531. 2 2

10 / 61 Binomial coefficient & binomial distribution For P(Y = 3), P(Y = 4),... this gets even more complicated. Binomial coefficient Suppose that there is a of n distinct elements from which it is desired to choose a subset of k elements (typically 1 ≤ k ≤ n). The binomial coefficient gives the number of ways k elements can be selected from n elements. The binomial coefficient is defined as n n! C = C n = = . n,k k k k!(n − k)! where k! = k(k − 1)(k − 2) ... 1 and 0! = 1.

Remark Note that n! = n · (n − 1) · (n − 2) · ... · (n − k + 1). (n − k)! For example, 7! 7! 7 · 6 · 5 · 4 · 3 · 2 · 1 = = = 7 · 6 · 5. (7 − 3)! 4! 4 · 3 · 2 · 1

11 / 61 Binomial coefficient & binomial distribution Let’s consider another example to get a better understanding of the binomial coefficient. Example III.3 Imagine a box with four distinct elements (n = 4) denoted as a, b, c, d. We want to randomly pick two elements (k = 2). If the order of selecting elements matters, there exist 4 · 3 different combinations. However, we don’t want the order to matter, so we divide by 2, as there are two ways of ordering two elements ({b, a} and {a, b}). Therefore, there are 4 · 3 4! 4 = = = 6 2 2!(4 − 2)! 2 different combinations. {a, b}{a, c}{a, d}{b, c}{b, d}{c, d}.

12 / 61 Combination vs. permutation

Example III.3 Imagine a box with four distinct elements (n = 4) denoted as a, b, c, d. We want to randomly pick two elements (k = 2). If the order of selecting elements matters, there exist 4 · 3 different combinations. However, we don’t want the order to matter, so we divide by 2, as there are two ways of ordering two elements ({b, a} and {a, b}). Therefore, there are 4 · 3 4! 4 = = = 6 2 2!(4 − 2)! 2 different combinations. {a, b}{a, c}{a, d}{b, c}{b, d}{c, d}.

Note the distinction between permutation (order matters) and combination (order does not matter). If order matters (e.g. we distinguish between {a, b} and {b, a}), the solution to the above problem is simply 4 · 3 = 12.

13 / 61 Binomial coefficient & binomial distribution Back to our problem: For P(Y = 3),

10 110 10! 110 f (3) = P(Y = 3) = = 3 2 3!(10 − 3)! 2 10 · 9 · 8 110 = = 0.1171875. 3 · 2 · 1 2

The binomial coefficient allows us to find a general expression for f (y). Binomial distribution

If the random variables X1,..., Xn form n Bernoulli trials with parameter p (i.e. probability of success), then Y = X1 + ··· + Xn follows a binomial distribution. The binomial distribution is given by n f (y; n, p) = py(1 − p)n−y y for y = 0, 1,..., n.

14 / 61 Binomial distribution We now know the specific functional form of f (y) = P(Y = y). Hence, we can obtain the probability that we draw 0, 1, 2,..., 10 heads.

Binomial distribution (n=10,p=0.5) 0.4 y f (y) 0 0.00098

1 0.00977 0.3 2 0.04395 3 0.11719 4 0.20508 0.2 5 0.24609 f(y) 6 0.20508 7 0.11719 8 0.04395 0.1 9 0.00977 10 0.00098 0.0

0 1 2 3 4 5 6 7 8 9 10

y

15 / 61 Cumulative distribution function However, we are interested in P(Y ≤ 4). Cumulative distribution function The cumulative distribution function of a discrete random variable X is denoted by F(x) and is defined as F(x) = P(X ≤ x) where −∞ ≤ x ≤ +∞. The cumulative distribution function F(x) gives the probability that the outcome of X in a random trial will be less than or equal to any specified value x.

16 / 61 Binomial distribution

Cumulative distribution function

1 y f (y) F(y) 0 0.00098 0.00098 1 0.00977 0.01074 0.8 2 0.04395 0.05469 3 0.11719 0.17188 0.6 ) y

4 0.20508 0.37695 ( F 5 0.24609 0.62305 0.4 6 0.20508 0.82813 7 0.11719 0.94531 8 0.04395 0.98926 0.2 9 0.00977 0.99902 10 0.00098 1.00000 0 0 1 2 3 4 5 6 7 8 9 10 y

For example, F(2) = f (0) + f (1) + f (2) = 0.00098 + 0.00977 + 0.04395 = 0.05469.

17 / 61 Should you make the bet?

Example III.1 (continued) I propose the following game. We toss a fair coin 10 times. If head appears 4 times or less, I pay you £2. If head appears more than 4 times, you pay me £1. Should you make the bet?

We can finally solve the problem. Your expected value is

E[V ] = P(Y ≤ 4) · £2 + P(Y > 4) · (£ − 1) = F(4) · £2 + (1 − F(4)) · (£ − 1) = 0.377 · £2 + 0.623 · (£ − 1) ≈ £0.131.

You should make the bet (if you are risk neutral)! What does E[V ] mean? If we repeat the game an infinite number of times, your average payoff will be £0.131.

18 / 61 Binomial distribution (Simulation) Suppose that we play the game m times. That is, we toss the coin 10 times, write down the number of heads, and play again. Let’s set m = 20. We get: 37523325374444534542. y 0 1 2 3 4 5 6 7 8 9 10 00356402000 rel. frequency 0.00 0.00 0.15 0.25 0.30 0.20 0.00 0.10 0.00 0.00 0.00

Empirical binomial distribution (n=10,p=0.5) Binomial distribution (n=10,p=0.5) 20 repetitions 0.4 0.4 0.3 0.3 0.2 0.2 f(y) f(y) 0.1 0.1 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 y 1

19 / 61 Binomial distribution (Simulation) Suppose that we play the game m times. That is, we toss the coin 10 times, write down the number of heads, and play again. Let’s set m = 50. y 0 1 2 3 4 5 6 7 8 9 10 frequency 0 0 3 7 11 8 7 7 4 3 0 rel. frequency 0.00 0.00 0.06 0.14 0.22 0.16 0.14 0.14 0.08 0.06 0.00

Empirical binomial distribution (n=10,p=0.5) Binomial distribution (n=10,p=0.5) 50 repetitions 0.4 0.4 0.3 0.3 0.2 0.2 f(y) f(y) 0.1 0.1 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

y y

19 / 61 Binomial distribution (Simulation) Suppose that we play the game m times. That is, we toss the coin 10 times, write down the number of heads, and play again. Let’s set m = 100. y 0 1 2 3 4 5 6 7 8 9 10 frequency 0 2 3 17 26 22 14 11 3 2 0 rel. frequency 0.00 0.02 0.03 0.17 0.26 0.22 0.14 0.11 0.03 0.02 0.00

Empirical binomial distribution (n=10,p=0.5) Binomial distribution (n=10,p=0.5) 100 repetitions 0.4 0.4 0.3 0.3 0.2 0.2 f(y) f(y) 0.1 0.1 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

y y

19 / 61 Binomial distribution (Simulation) Suppose that we play the game m times. That is, we toss the coin 10 times, write down the number of heads, and play again. Let’s set m = 10,000. y 0 1 2 3 4 5 6 7 8 9 10 frequency 4 89 429 1171 2045 2470 2075 1198 411 103 5 rel. frequency 0.0004 0.0089 0.0429 0.1171 0.2045 0.2470 0.2075 0.1198 0.0411 0.0103 0.0005

Empirical binomial distribution (n=10,p=0.5) Binomial distribution (n=10,p=0.5) 10000 repetitions 0.4 0.4 0.3 0.3 0.2 0.2 f(y) f(y) 0.1 0.1 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

y y

19 / 61 Binomial distribution The binomial distribution has two : n is the number of (Bernoulli) trials and p is the probability of success in each trial. How does the distribution look like for different values of n and p?

Binomial distribution (n=10,p=0.5) Binomial distribution (n=10,p=0.7) Binomial distribution (n=10,p=0.9) 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 f(y) f(y) f(y) 0.1 0.1 0.1 0.0 0.0 0.0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

y y y

Binomial distribution (n=30,p=0.5) Binomial distribution (n=30,p=0.7) Binomial distribution (n=30,p=0.9) 0.3 0.3 0.3 0.2 0.2 0.2 f(y) f(y) f(y) 0.1 0.1 0.1 0.0 0.0 0.0

0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 y y y 20 / 61 Binomial distribution Expected value and variance In the same way we summarise an observed dataset by the sample average and the sample variance (or ), we can characterise a probability distribution by its expected value and its variance. From the figures we can see that expected value and variance change with n and p. To find the expected value of Y , note that: Linearity of expectation

If Y is the sum of random variables X1, X2,..., Xn , then: " n # n X X E[Y ] = E Xi = E[Xi]. i=1 i=1 Furthermore, if c is a constant (i.e., non-random) and X a random variable, then E[X + c] = E[X] + c E[cX] = cE[X].

21 / 61 Binomial distribution Expected value and variance

Recall that Y = X1 + X2 + X3 + ··· + Xn. Therefore,

E[Y ] = E[X1] + E[X2] + E[X3] + ··· + E[Xn]

Recall that Xi follows a Bernoulli distribution. The expected value of a Bernoulli variable is

E[Xi] = p · 1 + (1 − p) · 0 = p. Therefore,

E[Y ] = E[X1] + E[X2] + E[X3] + ... + E[Xn] = np.

22 / 61 Binomial distribution Expected value and variance

Variance The variance of a discrete random variable X is denoted by V[X] and given by k X 2 V[X] = (xi − E[X]) P(X = xi). i

Variance of the sum of uncorrelated random variables

If Y is the sum of independent (!) random variables X1, X2,..., Xn, then: " n # n X X V[Y ] = V Xi = V[Xi] i=1 i=1

The variance of Xi is given by 2 2 V[X] = (1 − E[X]) p + (0 − E[X]) (1 − p) = p(1 − p) Therefore V[Y ] = np(1 − p) 23 / 61

Example III.4 Let X be the number of cars passing by in an hour. On average λ cars pass by. What is the probability that 3 cars pass by?

To simplify the problem, we divide each hour into 60 minutes. Since E[X] = λ, the probability that one car passes by in any particular minute is λ/60. Using this simplification, we can work with the binomial distribution.

60  λ 3  λ 60−3 f (3) = P(X = 3) ≈ 1 − 3 60 60

However, this approach does not take into account that more than one car may pass by in a minute.

24 / 61 Poisson distribution We can address this problem by dividing each hour into 3,600 seconds, 3,600,000 milliseconds, and so on. More general, with n being the number of units we divide the hour into:

n  λ x  λ n−x f (x) = P(X = x) ≈ 1 − x n n

Let n → ∞, do some maths and you’ll arrive at: Poisson distribution If X follows a poisson process, then e−λλx f (x) = P(X = x) = x! 1 n for x = 0, 1, 2, 3,..., ∞. Note that e = limn→∞ 1 + n = 2.718 ... .

25 / 61 Poisson distribution

Poisson distribution Poisson distribution lambda=1 lambda=3 0.20 0.3 0.15 0.2 f(x) f(x) 0.10 0.1 0.05 0.0 0.00

0 5 10 15 20 0 5 10 15 20

x x The poisson distribution is asymmetric (or right-skewed) for E[X] = λ = 1 and λ = 3

26 / 61 Poisson distribution

Poisson distribution Poisson distribution lambda=10 lambda=1000 0.12 0.012 0.10 0.010 0.08 0.008 f(x) f(x) 0.06 0.006 0.04 0.004 0.02 0.002 0.00 0.000

0 5 10 15 20 850 900 950 1000 1050 1100 1150

x x The higher λ, the more symmetric is the poisson distribution. Also, the distribution looks very similar to the normal distribution!

27 / 61 Continuous distributions

Probability density function (PDF) If the random variable X is continuous, we use f (x) to denote the probability density function (PDF) of X. The PDF satisfies two requirements: Z +∞ (1) f (x) ≥ 0 and (2) f (x)dx = 1 −∞

Remark (!) If the random variable X is continuous, then the probability that X takes a particular value x is zero. That is, f (x) 6= P(X = x) = 0.

28 / 61 Continuous distributions

Uniform distribution Let a and b be two real numbers (a < b) and consider an experiment where a number X is randomly selected from the [a, b]. If the probability that X belongs to any subinterval of [a, b] is proportional to the length of the subinterval, we say that X is uniformly distributed. The PDF of X is given by  1 for x ∈ [a, b] f (x) = b−a 0 otherwise We write that X ∼ u(a, b).

29 / 61 Continuous distributions

Example III.5 Anna and Achim arrange to meet ‘between 1pm and 2pm’ at Old College. Their arrival time is uniformly distributed and they arrive independently of each other. What is the probability that no one will have to wait more than 15 minutes? Let X denote Anna’s arrival time and Y denote Achim’s arrival time.

I Express the event ‘no one will wait for more than 15 minutes’ in terms of X and Y .

I What is the joint distribution of X and Y ?

30 / 61 Continuous distributions The normal distribution is by far the single most important probability distribution. Many natural phenomena are (approximately) normally distributed. Another reason for its importance comes from the central limit theorem (to be discussed in the next lecture). Normal distribution If the random variable X follows normal distribution with mean µ (−∞ < µ < ∞) and variance σ2 (σ > 0), its PDF is given by

(x−µ)2 1 − f (x) = √ e 2σ2 σ22π for −∞ < x < ∞. We write that X ∼ N (µ, σ2).

31 / 61 Continuous distributions

0.5 1 N(0, 1) u(1, 3) u( 3, 0) − 0.4 0.8

0.3 0.6 ) ) x x ( ( f f 0.2 0.4

0.1 0.2

0 0 4 2 0 2 4 4 2 0 2 4 − − − − x x

32 / 61 E[X] for continuous distributions Expected value The expected value of a continuous random variable is Z +∞ E[X] = xf (x)dx. −∞

E[X] is the balance point of the probability mass. The probability mass to the left of E[X] is in balance to the probability mass on the right of E[X]. I

33 / 61 E[X] for continuous distributions Thus, the expected value of the normal distribution is simply at its highest point (due to symmetry) and the expected value of a uniform distribution is half way between a and b.

a I I b

34 / 61 E[X] for the uniform distribution Let’s do this formally for the uniform distribution. Z +∞ E[X] = xf (x)dx (1) −∞ Z +∞ 1 = x dx (2) −∞ b − a 1 Z b = xdx (3) b − a a 1 1 b = x2 (4) b − a 2 a 1 1   = b2 − a2 (5) b − a 2 1 (b2 − a2)(b + a) = (6) 2 (b − a)(b + a) 1 = (b + a) (7) 2

35 / 61 E[X] for the uniform distribution

(1) By the definition of the expected value. (2) By the definition of the uniform distribution. (3) If k is a constant, then Z Z kf (x)dx = k f (x)dx.

(4) Since Z 1 xndx = xn+1 + c (n 6= −1). n + 1 (4-5) Since Z b  b f (x)dx = F(x) = F(b) − F(a) a a d where dx F(x) = f (x).

36 / 61 V[X] for continuous distributions Variance The variance of a continuous random variable X is Z +∞ 2 V[X] = (x − E[X]) f (x)dx. (8) −∞

0.5 N(0, 1) N(0, 2) 0.4 N(0, 3)

0.3 ) x ( f 0.2

0.1

0 4 2 0 2 4 − − x

37 / 61 V[X] for uniform distribution Instead of using the definition above, we can make use of the following: Variance

h 2i V[X] = E (X − E[X]) h 2 2i = E X − 2XE[X] + (E[X]) 2 2 2 = E[X ] − 2(E[X]) + (E[X]) 2 2 = E[X ] − (E[X]) where the third line uses the fact that E[A + B] = E[A] + E[B] and E[E[A]] = E[A]. 2 We know E[X]. Hence, we only need to find E[X ].

38 / 61 V[X] for uniform distribution We treat X 2 as a new random variable which follows the same PDF as X. Z +∞ 2 2 E[X ] = x f (x)dx −∞ 1 Z b = x2dx b − a a 1 1 b = x3 b − a 3 a 1 1 = b3 − a3 b − a 3 1 (b3 − a3)(a2 + ab + b2) = 3 (b − a)(a2 + ab + b2) 1 = (a2 + ab + b2) 3

1 1 2 1 V [X] = [X 2]−( [X])2 = (a2 +ab+b2)− (b + a) = (b−a)2 E E 3 2 12

39 / 61 PDF and probability Consider the standard normal distribution depicted in the figure. As you can see, f (0) ≈ 0.4. To be precise, f (0) = 0.3989423 ... . It is important to understand that this does not mean that P(X = 0) = 0.3989423 ... ! If a random variable is continuous, there are infinite distinct values that the random variable can take. Thus, the probability that the random variable takes a specific value is zero.

0.5 N(0, 1)

0.4

0.3 ) x ( f 0.2

0.1

0 4 2 0 2 4 − − x

40 / 61 PDF and probability However, we can say that the probability that X is below 0 is equal to the shaded gray area. Since we know that the area under f (x) is 1, we know (due to symmetry) that P(X ≤ 0) = 0.5.

0.5 N(0, 1)

0.4

0.3 ) x ( f 0.2

0.1

0 4 2 0 2 4 − − x

41 / 61 PDF and probability But what is the probability that, say, P(X ≤ −1)?

0.5 N(0, 1)

0.4

0.3 ) x ( f 0.2

0.1

0 4 2 1 0 2 4 − − − x

42 / 61 CDF and probability

Cumulative density function The cumulative density function (CDF) of a continuous random variable X is denoted by F(x) and given by Z x F(x) = P(X ≤ x) = f (u)du for − ∞ ≤ x ≤ +∞. −∞ The CDF gives the probability that the outcome of X in a random experiment is less than or equal to x.

43 / 61 CDF and probability

0.5 N(0, 1)

0.4

0.3 ) x ( f 0.2

We can read from the CDF that 0.1

0 4 2 1 0 2 4 F(−1) = P(X ≤ −1) ≈ 0.159 ... − − −

1.2 and N(0, 1) 1 F(0) = P(X ≤ 0) = 0.5. 0.8 ) x

( 0.6 F 0 .5 0.4

0.2 0 .159 ..

0 4 2 1 0 2 4 − − − x

44 / 61 CDF and probability

0.5 N(0, 1)

0.4

0.3 ) x ( f 0.2 What is the probability that X lies between −1 and 0? That is, 0.1 what is 0 4 2 1 0 2 4 − − −

P(−1 ≤ X ≤ 0)? 1.2 N(0, 1) It is simply 1

F(0)−F(−1) = 0.5−0.159 ≈ 0.341. 0.8 ) x

( 0.6 F 0 .5 0.4

0.2 0 .159 ..

0 4 2 1 0 2 4 − − − x

45 / 61 CDF and probability

0.5 N(0, 1)

0.4

0.3 ) x ( f 0.2

What is the probability that X 0.1 is below +1? Due to symmetry 0 4 2 1 0 1 2 4 F(1) = 1 − F(−1). − − − 1.2 Thus, N(0, 1) 1

0 .841 .. 1−F(−1) = 1−0.159 ≈ 0.841. 0.8 ) x

( 0.6 F 0 .5 0.4

0.2 0 .159 ..

0 4 2 1 0 1 2 4 − − − x

46 / 61 Inverse functions and CDF We will often use the inverse function of the CDF. Inverse function In general, if g(x) is an invertible function, then the inverse function is given by g−1(g(x)) = x. Intuition: A function works like a machine. It takes x as an input and returns the output g(x) = a. An inverse function works the other way around: g−1(a) = x.

Suppose we are interested in the following question: What is the value of x such that P(X ≤ x) = 0.95.

47 / 61 Inverse functions and CDF

0.5 N(0, 1)

0.4

0.3 ) x ( Suppose we are interested in the f 0.2 following question: What is the

value of x such that P(X ≤ 0.1 x) = 0.95. Using the inverse 0 CDF: 4 2 1 0 1 1 .64 2 4 − − − F −1(0.95) ≈ 1.64 ... 1.2 N(0, 1) 1 This implies that, due to sym- 0 .95 metry, 90% of the probability 0.8 )

mass is approximately in the x

( 0.6 ±1.64 interval. F 0.4

0.2

0 4 2 1 0 1 1 .642 4 − − − x

48 / 61 Standard normal distribution

Standard normal distribution If X ∼ N (µ, σ2), then X − µ Z = ∼ N (0, 1). σ We say that Z follows a standard normal distribution. The PDF and CDF of the standard normal distribution are often denoted by φ(z) and Φ(z).

0.5 Z N(0, 1) ∼ X N(10, 4) 0.4 ∼

0.3 ) x ( f 0.2

0.1

0 0 5 10 15 49 / 61 Standard normal distribution

Example III.6 √ Suppose X ∼ N (10, 4) and Z = (X − 10)/ 4 ∼ N (0, 1). What is P(X ≤ 8)?

 X − µ 8 − µ   X − 10 8 − 10  P(X ≤ 8) = P ≤ = P √ ≤ √ = P (Z ≤ −1) σ σ 4 4

0.5 Z N(0, 1) ∼ X N(10, 4) 0.4 ∼

0.3 ) x ( f 0.2

0.1

0 1 0 5 8 10 15 − 50 / 61 Multivariate distributions (discrete)

Joint probability function The joint probability function of two discrete random variables X and Y is given by X X f (x, y) = P(X = x and Y = y) and f (xi , yj ) = 1. i j

f(x, y)

0.4

Table of Probabilities 0.3

X\Y 1 2 3 4 0.2 1 0.1 0 0.1 0 0.1 2 0.3 0 0.1 0.2 3 0 0.2 0 0 1 1 4 0 0 0 0 2 2 3 3 4 4 y

x

51 / 61 Multivariate distributions (continuous)

Joint probability density function The joint probability density function (or joint PDF) of two continuous random variables X and Y is given by Z +∞ Z +∞ f (x, y) and f (x, y)dxdy = 1. −∞ −∞

) 0.1 x, y ( f

0 1 4 − 0.5 3.5 − 0 3 0.5 2.5 1 2

x 1.5 1.5 y 2 1 2.5 0.5 3 0 52 / 61 Multivariate distributions (continuous)

Joint probability density function The joint probability density function of two continuous random variables X and Y is given by Z +∞ Z +∞ f (x, y) and f (x, y)dxdy = 1. −∞ −∞

Example III.7 If X and Y have joint PDF f (x, y) then the probability that X lies between 0 and 2 and that, at the same time, Y lies between 0 and 1 is Z 1 Z 2 P(0 ≤ X ≤ 2 and 0 ≤ Y ≤ 1) = f (x, y)dxdy 0 0

53 / 61 Marginal distributions

Marginal probability function (discrete) If X and Y are two discrete random variables for which the joint probability function is f (x, y), then the marginal probability function for X is X X fX (x) = P(X = x) = P(X = x and Y = y) = f (x, y) y y

The marginal probability gives the probability of observing a specific value of X (say X = x). To calculate the probability of observing x, we need to add the probabilites of all events that correspond to X = x: That is, P(X = x) = f (x, y1) + f (x, y2) + ... + f (x, yn).

54 / 61 Marginal distributions

Example III.8 (Discrete ) Table of Probabilities

X\Y 1 2 3 4 What is the marginal probability function for 1 0.1 0 0.1 0 X? 2 0.3 0 0.1 0.2 3 0 0.2 0 0 4 0 0 0 0

Marginal probability density function (continuous) If X and Y are two continuous variables for which the joint probability density function is f (x, y), then the marginal probability density function for X is Z +∞ fX (x) = f (x, y)dy −∞

55 / 61 Joint distributions and independence Recall from the last lecture that, if X and Y are two independent events, then P(X and Y ) = P(X)P(Y ). We can generalize this statement: Independence Two independent continuous (or discrete) random variables are independent if and only if

f (x, y) = fX (x)fY (y) ⇐⇒ F(x, y) = FX (x)FY (y)

where fX (x) and fY (y) are marginal PDF’s. FX (x) and FY (y) denote marginal CDF’s.

56 / 61 Conditional distributions Recall from the last lecture that the is defined as P(X,Y ) P(X|Y ) = P(Y ) . Furthermore, recall that if X and Y are two independent events, then P(X|Y ) = P(X). Conditional probability density function Suppose that X and Y are two continuous (or discrete) random variables for which the joint PDF is f (x, y) and the marginal PDF’s are fX (x) and fY (y). Suppose also that the value y has already been observed. The conditional probability density function of X given that Y = y is given by f (x, y) fX (x|y) = . fY (y)

Note that if X and Y are independent, we get the relation fX (x|y) = fX (x).

57 / 61 Conditional distributions

) 0.1 x, y ( f

0 1 4 − 0.5 3.5 − 0 3 0.5 2.5 1 2

x 1.5 1.5 y 2 1 2.5 0.5 3 0 The black, thick line shows f (x, 1) which is proportional to the conditional distribution of X given Y = 1. More specific: f (x, 1) fX (x|Y = 1) = . fY (1)

58 / 61 (Conditional) Expectations and

Example III.9 Let X and Z be two independently distributed standard normal random 2 variables and let Y = X + Z. Hence, V(X) = V(Z) = 1 and E(X) = E(Z) = 0. a) Derive E[Y |X]. 2 2 b) Derive E[Y ]. [Hint: Use V[X] = E[X ] − (E[X]) .] 3 c) Derive E[XY ]. [Hint: Since X is standard normal, E[X ] = 0.] d) Find Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))]. What is interesting about the results?

59 / 61 Independence and Covariance Covariance is defined as

Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))] = E[XY ] − E[X]E[Y ].

If X and Y are independent, E[XY ] = E[X]E[Y ]. Therefore, if X and Y are independent, Cov(X, Y ) = 0. However, Cov(X, Y ) = 0 (and Corr(X, Y ) = 0) does not imply independence as demonstrated in the previous example.

60 / 61 Summary

I Random variables are either discrete or continuous. Random variables are usually denoted by capital letters, e.g. X, and realisations by small letters, e.g. x.

I For a continuous random variable, P(X = x) = 0 and f (x) 6= P(X = x) where f (x) denotes the probability density function.

I Many probability distributions are closely related. For example, we can derive the Poisson distribution from the Binomial distribution and the Poisson distribution behaves similar to the Normal distribution as λ → ∞.

I Independence implies Cov(X, Y ) = 0 (and Corr(X, Y ) = 0), but not the other way around. Cov(X, Y ) and Corr(X, Y ) measure the strength of the linear relation between two variables.

61 / 61