Business Math II (part 4) : and Random Variables (by Evan Dummit, 2019, v. 1.00)

Contents

4 Probability and Random Variables 1 4.1 Probability and Probability Distributions ...... 1 4.1.1 Sample Spaces and Events ...... 2 4.1.2 of Events ...... 3 4.1.3 Computing Probabilities ...... 5 4.2 and Independence ...... 8 4.2.1 Denition of Conditional Probability ...... 8 4.2.2 Independence ...... 10 4.2.3 Computing Probabilities, Bayes' Formula ...... 12 4.3 Discrete Random Variables ...... 17 4.3.1 Denition and Examples ...... 17 4.3.2 ...... 19 4.3.3 Variance and Standard Deviation ...... 21 4.4 Continuous Random Variables ...... 23 4.4.1 Denition and Examples ...... 24 4.4.2 Expected Value, Variance, Standard Deviation ...... 25 4.4.3 The and Central Limit Theorem ...... 28

4 Probability and Random Variables

In this chapter, we will discuss probability, which quanties how likely it is that a particular event will occur. We will begin by discussing discrete probability, which arose historically from the study of games of chance, and which often uses many of the counting techniques we have developed. Next, we discuss conditional probability, which allows us to compute how to evaluate probability based on the addition of new information, and the related topic of independence, which describes whether two events are related. We then discuss discrete and continuous random variables, which are quantities whose values depend on the of a random event, and how to compute associated such as the average value, the expected value, and the variance and standard deviation. We also discuss several fundamentally important probability distributions, such as the normal distribution, along with some of their applications in statistics.

4.1 Probability and Probability Distributions

• We begin by discussing sample spaces and methods for computing probabilities of events.

1 4.1.1 Sample Spaces and Events

• The fundamental idea of probability arises from performing an experiment or observation, and tabulating how often particular outcomes arise.

◦ For any experiment or observation, the set of possible outcomes is called the , and an event is a subset of the sample space.

• Example: Consider the experiment of rolling a standard 6-sided die once.

◦ There are 6 possible outcomes to this experiment, namely, rolling a 1, a 2, a 3, a 4, a 5, or a 6, so the sample space is the set S = {1, 2, 3, 4, 5, 6}. ◦ One event is rolling a 3, which would correspond to the subset {3}. ◦ Another event is rolling an even number, which would correspond to the subset {2, 4, 6}. ◦ A third event is rolling a number bigger than 2, which would correspond to the subset {3, 4, 5, 6}. ◦ A fourth event is rolling a negative number, which would correspond to the empty subset ∅ = {} because there are no outcomes in the sample space that make this event occur.

• Example: Consider the experiment of ipping a coin once.

◦ There are 2 possible outcomes to this experiment: heads and tails. Thus, the sample space is the set S = {heads, tails}, which we typically abbreviate as S = {H,T }. ◦ One event is obtaining heads, corresponding to the subset {H}. ◦ Another event is obtaining tails, corresponding to the subset {T }.

• Example: Consider the experiment of ipping a coin four times.

◦ By the multiplication principle, there are 24 = 16 possible outcomes to this experiment (namely, the 16 possible strings of 4 characters each of which is either H or T ). ◦ One event is exactly one head is obtained, corresponding to the subset {HTTT,THTT,TTHT,TTTH}. ◦ Another event is the rst three ips are tails, corresponding to the subset {TTTH,TTTT }.

• Example: Consider the experiment of shu ing a standard 52-card deck.

◦ There are many possible outcomes to this experiment, namely, all 52! ways of arranging the 52 cards in sequence. ◦ The sample space S is the (quite large!) set of all 52! of these sequences. ◦ One event is the rst card is an ace of clubs, while another event is the entire deck alternates red- black-red-black. It would be infeasible to write out the exact subsets corresponding to these events, but in principle it would be possible to list them all.

• Example: Consider the experiment of measuring the lifetime of a refrigerator in years.

◦ There are many possible outcomes to this experiment, including 0, 5, 28, 3.2, and 100. ◦ The sample space would (at least in principle) be the set of nonnegative real numbers: S = [0, ∞). ◦ One event is the refrigerator stops working after at most 3 years, corresponding to the subset S = [0, 3]. ◦ Another event is the refrigerator works for at least 6 years, corresponding to the subset S = [6, ∞).

• Example: Consider the experiment of measuring the temperature in degrees Fahrenheit outside.

◦ There are many possible outcomes to this experiment, including 50◦, 87.4◦, and 120◦. ◦ For this experiment, depending on how accurately the temperature is measured, and what the possible outside temperatures are, the sample space could be very large or even innite, perhaps as large as the set S = (−∞, ∞) of all real numbers.

2 ◦ If the temperature is measured to the nearest whole degree, and it is known that the temperature is never colder than 0◦ nor hotter than 120◦, then the sample space would be S = {0◦, 1◦, 2◦,..., 120◦}. ◦ One event is the temperature is above freezing, corresponding to the subset {33◦, 34◦,..., 120◦}. ◦ Another event is the temperature is a multiple of 10◦, corresponding to the subset {0◦, 10◦, 20◦,..., 120◦}.

• Since we view events as subsets of the sample space, we dene the union (or intersection) of two events as the union (or intersection) of their corresponding subsets, and we dene the complement of an event to be the complement of the corresponding subset inside the sample space.

◦ For example, with sample space S = {1, 2, 3, 4, 5, 6} obtained by rolling a standard 6-sided die once, and take A = {2, 4, 6} to be the event of rolling an even number and B = {3, 4, 5, 6} to be the event of rolling a number larger than 2. ◦ Then the complement Ac = {1, 3, 5} is the event of not rolling an even number (i.e., rolling an odd number), and the complement Bc = {1, 2} is the event of not rolling a number larger than 2 (i.e., rolling a number less than or equal to 2). ◦ Also, the union A ∪ B = {2, 3, 4, 5, 6} is the event of rolling a number that is even or larger than 2, while the intersection A ∩ B = {4, 6} is the event of rolling a number that is even and larger than 2.

• In the case where the events A and B have A ∩ B = ∅, we say that A and B are mutually exclusive, since they cannot both occur at the same time.

4.1.2 Probabilities of Events

• We would now like to analyze probabilities of events, which measure how likely particular events are to occur.

◦ If we have an experiment with corresponding sample space S, and E is an event (which we consider as being a subset of S), we would like to dene the probability of E, written P (E), to be the frequency with which E occurs if we repeat the experiment many times independently.

◦ Specically, if we repeat the experiment n times and the event occurs en times, then the relative frequency that E occurs is the ratio en/n. If we let n grow very large (more formally, if we take the limit as n → ∞) then the ratios en/n should approach a xed number, which we call the probability of the event E.

◦ Since for each n we have 0 ≤ en ≤ n, and thus 0 ≤ en/n ≤ 1, we see that the limit of the ratios en/n must be in the closed interval [0, 1]. This tells us that P (E) should always lie in this interval. ◦ For the event S (consisting of the entire sample space), we clearly have P (S) = 1, because if we perform the experiment n times, the event S always occurs n times, so en = n for every n.

◦ Also, if E1 and E2 are mutually exclusive events, then P (E1 ∪ E2) = P (E1) + P (E2): this follows by observing that because both events cannot occur simultaneously, then the total number of times E1 or E2 occurs in n experiments is equal to the total number of times E1 occurs plus the total number of times E2 occurs.

◦ In particular, if S = {s1, s2, . . . , sk} is a nite sample space, then by applying the above observation repeatedly, we can see that P ({s1}) + P ({s2}) + ··· + P ({sk}) = P ({s1, s2, . . . , sk}) = P (S) = 1. ◦ This means that the sum of all the probabilities of the events in the sample space is equal to 1.

• We now give a more formal denition of a on a sample space, using the properties we have worked out above as motivation:

• Denition: If S is a sample space, a probability distribution on S is any function P dened on events (i.e., subsets of S) with the following properties:

[P1] For any event E, the probability P (E) satises 0 ≤ P (E) ≤ 1. [P2] The probability P (S) = 1. 1 [P3] If E1,E2,...,Ek are mutually exclusive events (meaning that Ei ∩ Ej = ∅ for i 6= j), then P (E1 ∪ E2 ∪ · · · ∪ Ek) = P (E1) + P (E2) + ··· + P (Ek).

1If S is an innite sample space, this property also applies to innite collections of mutually exclusive events: explicitly, we require ∞ P∞ . P (∪i=1Ei) = i=1 P (Ei)

3 ◦ We may interpret the probability P (E) as the relative frequency that the event E occurs, if we repeat the experiment a large number of times.

• There are a few fundamental properties of probabilities that can be derived from the basic assumptions [P1]-[P3]:

• Proposition (Basic Properties of Probability): For any events E,E1,E2 inside a sample space S with proba- bility distribution P , the following properties hold:

1. P (∅) = 0.

2. P (E1 ∪ E2) = P (E1) + P (E2) − P (E1 ∩ E2). 3. P (Ec) = 1 − P (E).

◦ Proof: For (1), apply [P3] to E1 = E2 = ∅, so that E1 ∪ E2 = ∅ also, to see that P (∅) = P (∅) + P (∅), so P (∅) = 0. For (2), observe (e.g., via a ) that if we dene the events c, , and ◦ A = E1 ∩ E2 B = E1 ∩ E2 c , then are mutually disjoint with , , and . C = E1 ∩E2 A, B, C A∪B = E1 B ∪C = E2 A∪B ∪C = E1 ∪E2

◦ Thus by [P3] applied repeatedly, we have P (E1 ∪ E2) = P (A ∪ B ∪ C) = P (A) + P (B) + P (C), while P (E1) = P (A) + P (B), P (E2) = P (B) + P (C), and P (E1 ∩ E2) = P (B).

◦ Hence we see P (E1 ∪ E2) = P (A) + P (B) + P (C) = P (E1) + P (E2) − P (E1 ∩ E2), as claimed. c c c ◦ For (3), apply the result of (2) with E1 = E and E2 = E : since E ∪ E = S and E ∩ E = ∅ the statement of (1) reduces to 1 = P (S) = P (E) + P (Ec) − P (∅), and thus by (1) and [P2] we see that P (Ec) = 1 − P (E) as claimed.

• In the particular case where S = {s1, s2, . . . , sk} is a nite sample space, we can describe what a probability distribution is more explicitly:

◦ Explicitly, by properties [P2] and [P3], we see that P ({s1})+P ({s2})+···+P ({sk}) = P ({s1, s2, . . . , sk}) = P (S) = 1.

◦ Thus, by [P1] the numbers P ({s1}), ... , P ({sk}) are all between 0 and 1 inclusive, and have sum 1. ◦ If we make any selection for these values satisfying these conditions, then we may compute the probability of an arbitrary event E = {t1, . . . , td} by using property [P3] again: P (E) = P ({t1}) + ··· + P ({td}). ◦ It is not hard to check that this assignment of P (E) for each subset E of S satises all three of [P1]-[P3].

◦ Therefore, a probability distribution on S = {s1, s2, . . . , sk} is simply an assignment of probabilities between 0 and 1 inclusive to each of the individual outcomes in the sample space, such that the sum of all the probabilities is 1.

• Example: Consider the sample space S = {H,T } corresponding to ipping a coin.

◦ By our discussion above, a probability distribution on S is determined by assigning values to P (H) and P (T ) such that 0 ≤ P (H),P (T ) ≤ 1 and P (H) + P (T ) = 1. ◦ One probability distribution arises from assuming that a head is equally likely to appear as a tail: then 1 we would have P (H) = P (T ) = . 2 ◦ A dierent probability distribution arises from assuming that a tail is twice as likely to appear as a head: 1 2 then we would have P (H) = while P (T ) = . 3 3 ◦ Another probability distribution, for an even more unfair coin, would have P (H) = 0.1 with P (T ) = 0.9: this represents a coin that lands heads 10% of the time and tails the other 90% of the time. ◦ We even have a probability distribution for a coin that never comes up heads; namely, the one with P (H) = 0 and P (T ) = 1.

• In the example above, we can see that the choice of probability distribution for this sample space aects our interpretation of how fair the coin is.

4 ◦ We view the probability of an event on a sliding scale from 0 to 1: a probability near 0 means that the event happens rarely, while a probability near 1 means the event happens often. ◦ If the sample space is nite, then an event with probability 0 never occurs, while an event with probability 1 always occurs. ◦ However, if the sample space is innite, there are situations where events of probability 0 can (perhaps surprisingly) occur. One example of this is the experiment of choosing a real number uniformly at random from the interval [0, 1]. The assumption that we choose uniformly at random means that every real number has the same probability of being chosen, but since there are innitely many numbers in the interval, the probability of choosing any particular one is equal to 0. Nonetheless, each time this experiment is performed, there is some real number chosen as the outcome, even though that particular outcome has probability 0 of occurring!

4.1.3 Computing Probabilities

• So far, our discussion of probability has been relatively abstract, since we have primarily spoken about probability distributions.

◦ In order to compute probabilities of particular events, we must make assumptions about the corresponding probability distributions, such as assuming that a coin is fair (i.e., that heads and tails are equally likely).

◦ In the particular case where all of the outcomes s1, s2, . . . , sk in the sample space are equally likely, we 1 would have P (s ) = P (s ) = ··· = P (s ) = . 1 2 k k ◦ Then the probability of any event E is then simply #(E)/k, which only depends on the number of outcomes in E. Thus, we may nd P (E) simply by counting all of the outcomes in E; in particular, when the sample space is small, we can simply list them all.

• Example: If a fair coin is ipped 3 times, determine the respective probabilities of obtaining (i) no heads, (ii) 1 head, (iii) 2 heads, and (iv) 3 heads.

◦ The sample space has 23 = 8 outcomes: explicitly, S = {TTT,TTH,THT,THH,HTT,HTH,HHT,HHH}. ◦ Under the assumption that the coin is fair, each of the 8 outcomes is equally likely, so by our analysis above, the probability of each of the 8 outcomes is 1/8. ◦ Now we simply count to determine each of the respective probabilities. ◦ There is only 1 outcome with no heads (namely, TTT ), so the probability of obtaining no heads is 1/8 . ◦ There are 3 outcomes (namely TTH, THT , and HTT ) with 1 head, so the probability of obtaining 1 head is 3/8 . ◦ There are 3 outcomes (namely THH, HTH, and HHT ) with 2 heads, so the probability of obtaining 2 heads is 3/8 .

◦ There is 1 outcome (namely, HHH) with 3 heads, so the probability of obtaining 3 heads is 1/8 .

• Example: If two fair 6-sided dice are rolled, determine the probabilities of the respective events (i) the two dice read 6 and 3 in some order, (ii) the sum of the two rolls is equal to 4, (iii) both rolls are equal, (iv) neither roll is a 2, and (v) at least one roll is a 2.

2 ◦ In this case, the sample space consists of 6 = 36 outcomes representing the 36 ordered pairs (R1,R2) where R1 is the outcome of the rst die roll and R2 is the outcome of the second die roll. ◦ Under the assumption that both dice are fair, all 36 outcomes are equally likely, so the probability of any one of these individual outcomes is 1/36. ◦ Now we simply count to determine each of the respective probabilities. ◦ For event (i), there are two possible outcomes, namely (3, 6) and (6, 3). Thus the probability of this event is 2/36 = 1/18 .

5 ◦ For event (ii), there are three possible outcomes, namely (1, 3), (2, 2), and (3, 1). Thus the probability of this event is 3/36 = 1/12 . ◦ For event (iii), there are six possible outcomes, namely (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), and (6, 6). Thus the probability of this event is 6/36 = 1/6 . ◦ For event (iv), there are 25 possible outcomes, consisting of the 5 · 5 ordered pairs (a, b) where each of a and b is one of the 5 numbers 1, 3, 4, 5, 6. Thus the probability of this event is 25/36 . ◦ For event (v), one way to compute the probability is simply to list all 11 possible outcomes, namely (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (1, 2), (3, 2), (4, 2), (5, 2), and (6, 2), to see that the probability of this event is 11/36 . ◦ Another way to nd the probability of event (v) is to observe that it is the complement of event (iv), and thus its probability must be 1 − 25/36 = 11/36 .

• When the sample space is large, it can be unreasonable to enumerate all the outcomes in particular events explicitly. Nonetheless, by counting appropriately, we may still compute probabilities:

• Example: Three cards are randomly dealt from a standard 52-card deck. Determine the probabilities of the respective events (i) all three cards are aces, (ii) all three cards are the same suit, (iii) one of the cards is the 3 of diamonds, and (iv) at least one card is a jack.

◦ Let us choose the sample space S to be the set of possible triples (C1,C2,C3) of the three cards dealt in order from the deck. By the multiplication principle, there are 52·51·50 such triples, so #S = 52·51·50. ◦ As with the previous examples, under the assumption that the cards are drawn randomly, each of the outcomes in S is equally likely. ◦ For event (i), there are 4 choices for the rst card (any of the 4 aces), 3 choices for the second card (any of the 3 remaining aces), and 2 choices for the third card. Thus there are 4 · 3 · 2 outcomes that yield 4 · 3 · 2 1 this event, and so the probability that this event occurs is = . 52 · 51 · 50 5525 ◦ For event (ii), there are 4 choices for the common suit. Once we have chosen the suit, there are 13 choices for the rst card, 12 for the second, and 11 for the third, yielding a total number of 4 · 13 · 12 · 11 4 · 13 · 12 · 11 22 outcomes. Thus the desired probability is = ≈ 0.0518. 52 · 51 · 50 425 ◦ For event (iii), there are 3 choices for which card is the 3 of diamonds. Once this choice is made, there are 51 possibilities for the rst remaining card (any card except the 3 of diamonds) and 50 possibilities for the second remaining card (any card except the 3 of diamonds or the card just chosen), yielding a total 3 · 51 · 50 3 number of 3 · 51 · 50 outcomes. Thus the desired probability is = ≈ 0.0577. Intuitively, 52 · 51 · 50 52 since we are choosing 3 cards out of 52, and the 3 of diamonds is equally likely to be any one of these 52 cards, it is natural to say that the probability that it is among the 3 cards we have chosen should be 3/52. ◦ For event (iv), we rst nd the probability that none of the cards is a jack. If none of the cards is a jack, then there are 48 · 47 · 46 possible ways to choose the cards (the rst card can be any card except 48 · 47 · 46 4324 the 4 jacks, and so forth), and so the probability that none of the cards is a jack is = . 52 · 51 · 50 5525 Then the probability that at least one card is a jack is the probability of the complementary event, and 4324 1201 is therefore equal to 1 − = ≈ 0.2174. 5525 5525

• Example: A fair coin is ipped 15 times. Determine the probabilities of the respective events (i) exactly 7 heads are obtained, (ii) at most 2 tails are obtained, (iii) at least 3 tails are obtained, and (iv) the number of heads is a multiple of 6.

◦ Here, the sample space is the set of 215 = 32768 possible strings of 15 heads or tails, where (as above) each of the possible outcomes is equally likely.

6 From our results on combinations, there are exactly 15 possible strings that have heads and ◦ n n 15 − n tails, and an equal number that have 15−n heads and n tails. We therefore only need to tally the various possibilities in each case. 15 · 14 · 13 · 12 · 11 · 10 · 9 ◦ For event (i), there are 15 = = 6435 possible outcomes, and thus the proba- 7 7 · 6 · 5 · 4 · 3 · 2 · 1 6435 bility of this event is ≈ 0.1964. 32768 For event (ii), a string with at most 2 tails can have 2 tails, 1 tail, or 0 tails, so there are 15 15 15 ◦ 2 + 1 + 0 = 15 · 14 121 + 15 + 1 = 121 possible outcomes. Thus the probability of this event is ≈ 0.0037. 2 32768 ◦ For event (iii), observe that this event is the complement of event (ii), and therefore its probability is 121 32647 1 − = ≈ 0.9963. This approach is simpler than the more direct method of observing 32768 32768 that this event occurs if there are 3, 4, 5, ... , or 15 tails, and so the total number of possibilities is 15 15 15 . 3 + 4 + ··· + 15 = 32647 For event (iv), such a string can have 0 heads, 6 heads, or 12 heads, so there are 15 15 15 ◦ 0 + 6 + 12 = 5461 1 + 5005 + 455 = 5461 possible outcomes. Thus the probability of this event is ≈ 0.1667. 32768

• Example: A random ve-card poker hand is dealt from a standard 52-card deck. Determine the probabilities of the respective events (i) the hand is a straight ush, (ii) the hand is a four-of-a-kind, (iii) the hand is a full house (three cards of one rank and a pair in another), (iv) the hand is a ush (all cards are the same suit) that is not a straight ush, (v) the hand is a straight (ve cards in sequence) that is not a straight ush, (vi) the hand is a three-of-a-kind, (vii) the hand is two pair (two pairs of dierent ranks), (viii) the hand is one pair, and (ix) the hand is none of the above.

First observe that there are 52 ways to deal a ve-card hand, where the order of the cards ◦ 5 = 2598960 does not matter. In some cases, it is slightly easier to count hands where we do pay attention the ordering, in which case there are 52 · 51 · 50 · 49 · 48 total hands instead. ◦ For (i), there are 4 possible suits for a straight ush and 10 possible card sequences (A-2-3-4-5 through 4 · 10 10-J-Q-K-A) yielding a straight ush. Thus, the probability of obtaining a straight ush is = 2598960 1 ≈ 0.0015%. 64974 ◦ For (ii), there are 13 possible ranks for the 4-of-a-kind and 48 possibilities for the remaining single card. 13 · 48 1 Thus, the probability of obtaining a 4-of-a-kind is = ≈ 0.0240%. 2598960 4165 For (iii), there are 13 possible ranks for the 3-of-a-kind and 4 ways to choose the corresponding ◦ 3 = 4 cards, and also 12 possible remaining ranks for the pair and 4 ways to choose those cards. Thus, 2 = 6 13 · 4 · 12 · 6 6 the probability of obtaining a full house is = ≈ 0.1981%. 2598960 4165 For (iv), there are 4 possible choices for the suit and then 13 choices for the 5 cards in that ◦ 5 = 1287 suit. Of these, 10 yield straight ushes by the analysis in (i), so the remaining 1277 do not. Therefore, 4 · 1277 1277 the probability of obtaining a ush that is not a straight ush is = ≈ 0.1965%. 2598960 649740 ◦ For (v), there are 10 possible card sequences yielding a straight, and each of the 5 cards can be any of the four suits, yielding a total of 10 · 45 = 10240 straights. Of these, 40 are straight ushes, and the remaining 10200 are not. Therefore, the probability of obtaining a straight that is not a straight ush is 10200 5 = ≈ 0.3925%. 2598960 1274 For (vi), we use ordered hands. There are 5 ways to choose the locations of the 3-of-a-kind cards, ◦ 3 = 10 and there are 13 possible ranks for the 3-of-a-kind with 4 · 3 · 2 = 24 ways to choose the corresponding

7 cards. Once these choices are made, there are 48 possibilities for the rst remaining card and 44 for the second remaining card (since it must be a dierent rank). Therefore, the probability of obtaining a 10 · 13 · 24 · 48 · 44 88 three-of-a-kind is = ≈ 2.1128%. 52 · 51 · 50 · 49 · 48 4165 For (vii), there are 13 possible ranks for the two pairs and 4 4 ways to choose the cards ◦ 2 = 78 2 · 2 = 36 in each pair. Once these are chosen, there are 44 possibilities for the last card (since it must be a dierent 78 · 36 · 44 198 rank than the pairs). Therefore, the probability of obtaining two pair is = ≈ 4.7539%. 2598960 4165 For (viii), we use ordered hands. There are 5 ways to choose the locations of the pair cards, 13 ◦ 2 = 10 possible ranks for the pair, and 4 · 3 = 12 ways to choose the corresponding cards. Once these choices are made, there are 48 · 44 · 40 possibilities for the remaining three cards (since they must all be dierent 10 · 13 · 12 · 48 · 44 · 40 352 ranks). Therefore, the probability of obtaining one pair is = ≈ 42.2569%. 52 · 51 · 50 · 49 · 48 833 ◦ For (ix), since we have chosen possibilities above that are all mutually exclusive, the probability of obtaining none of these events is simply 1 minus the sum of all of them, which ends up simplifying to 1277 ≈ 50.1177%. 2548 • Our general denition of probability distribution also allows us to analyze situations where the outcomes in the sample space are not equally likely.

◦ However, this situation carries the added diculty that we must now also assign a probability to each individual outcome in the events we are studying. ◦ We will be able to do this more easily once we discuss conditional probabilities.

4.2 Conditional Probability and Independence

• We now discuss conditional probability, which provides us a way to compute the probability that one event occurs, given that another event also occurred.

4.2.1 Denition of Conditional Probability

• As a motivating example, we calculated earlier that if two fair 6-sided dice are rolled, then the probability that neither roll is a 2 is equal to 25/36. Now suppose we are given the additional information that the rst die was a 5, and we ask again for the probability that neither roll was a 2.

◦ There are now only six possible outcomes of this experiment that are consistent with the given informa- tion, namely, (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), and (5, 6). ◦ Since each of these outcomes was equally likely to occur originally, it is reasonable to say that they should still be equally likely. Our desired event (of not obtaining a 2) occurs in 5 of the 6 cases, so the probability of the event is 5/6. ◦ Notice that by providing additional information (namely, that the rst roll was a 5), the probability that neither roll was a 2 changes: this is the essential idea of conditional probability.

• In some cases, we can compute conditional probabilities using counting arguments like in our earlier examples. • Example: Suppose that 100 students in a course have grades and academic standing as given in the table below. Category A B C Total Sophomore 4 8 6 18 Junior 14 11 10 35 Senior 28 14 5 47 Total 56 23 21 100 Compute the probabilities that (i) a randomly-chosen student is getting an A, (ii) a randomly-chosen student is a junior, (iii) a randomly-chosen junior is getting a A, and (iv) a randomly-chosen A student is a junior.

8 ◦ For event (i), there are 56 A-students out of 100 total students, so the probability is 56/100 = 0.56.

◦ For event (ii), there are 35 juniors out of 100 total students, so the probability is 35/100 = 0.35.

◦ For event (iii), there are 14 A-student juniors out of 35 total juniors, so the probability is 14/35 = 0.4.

◦ For event (iv), there are 14 A-student juniors out of 56 total A-students, so the probability is 14/56 = 0.25. ◦ Event (iii) is an example of a conditional probability, namely, the probability that a student is getting an A given that the student is a junior. In order to compute this conditional probability, notice that we restrict our attention only to the set of juniors, and perform our computations as if this is our entire sample space. ◦ Likewise, event (iv) is also a conditional probability, namely, the probability that a student is a junior, given that the student is getting an A, and like with event (iii) we restrict our attention only to the set of A-students and perform our computations as if this is our entire sample space. ◦ More abstractly, let S be the full sample space of all 100 students, and let J represent the event that a student is a junior and let A represent the event that a student receives an A. ◦ If we write P (A|J) for the probability that a student is receiving an A given that they are a junior, which is event (iii), then as we calculated above, P (A|J) is equal to the ratio of juniors receiving an A to the #(J ∩ A) total number of juniors, or, symbolically, P (A|J) = . #(J) #(J ∩ A) #(J ∩ A)/#(S) P (J ∩ A) ◦ Now observe that P (A|J) = = = , which gives us a way to write #(J) #(J)/#(S) P (J) P (A|J) in terms of probabilities of events in the sample space (specically, the events J ∩ A and J).

• The analysis in the example above motivates a way to dene conditional probabilities in general: • Denition: If A and B are events and P (B) > 0, we dene the conditional probability P (A|B) that A occurs P (A ∩ B) given that B occurred by P (A|B) = . P (B)

◦ Remark: If P (B) = 0, then the conditional probability P (A|B) cannot be computed using this denition. If the sample space is nite, then the conditional probability P (A|B) does not make sense if P (B) = 0, because the event B can never occur.

• Example: Suppose two fair 6-sided dice are rolled. If A is the event neither roll is a 2 and B is the event the rst roll is a 5, nd P (A|B) and P (B|A) and describe what these two probabilities mean.

◦ By denition, P (A|B) is the probability that A occurs given that B occurred, so in this case it is the probability that neither roll is a 2, given that the rst roll is a 5. ◦ Inversely, P (B|A) is the probability that B occurs given that A occurred, so in this case it is the probability that the rst roll is a 5 given that neither roll is a 2. P (A ∩ B) P (A ∩ B) ◦ According to the denition, we have P (A|B) = and P (B|A) = , so we need only P (B) P (A) calculate P (A), P (B), and P (A ∩ B). ◦ As we have already calculated by enumerating all of the possible outcomes, P (A) = 25/36 (there are 5 · 5 possible outcomes where neither roll is a 2), P (B) = 1/6 (there are 6 outcomes where the rst roll is a 5), and P (A ∩ B) = 5/36 (there are 5 outcomes where the rst roll is a 5 and neither roll is a 2). 5/36 5 ◦ Hence the formulas give P (A|B) = = , which agrees with the calculation we made earlier, and 1/6 6 5/36 1 also P (B|A) = = . 25/36 5

• Example: Suppose four fair coins are ipped. Determine the probabilities of the respective events (i) there are exactly two heads, given that the rst ip is a tail, (ii) the rst ip is a tail, given that there are exactly two heads, and (iii) all four ips are tails, given that there is at least one tail.

9 ◦ For event (i), if we let A be the event that there are exactly two heads and B be the event that the rst P (A ∩ B) ip is a tail, then we wish to nd P (A|B) = . P (B) ◦ We can see that A ∩ B is the event that the rst ip is a tail and there are exactly two heads, which 3 occurs in three ways: {THHT,THTH,TTHH}, so P (A ∩ B) = . 16 ◦ Furthermore, we can see that there are 8 ways in which the rst ip is a tail (since the remaining 3 ips 1 can be either heads or tails) so P (B) = . 2 P (A ∩ B) 3/16 3 ◦ Thus, we see P (A|B) = = = . P (B) 1/2 8 P (A ∩ B) ◦ For event (ii), using the same events as above, we now wish to nd P (B|A) = . P (A) 6 3 ◦ We can see that there are 4 = 6 ways in which we can obtain exactly two heads, so P (A) = = . 2 16 8 3 P (A ∩ B) 3/16 1 ◦ Thus, since P (A ∩ B) = as computed above, we have P (B|A) = = = . 16 P (A) 3/8 2 ◦ For event (iii), if we let C be the event that all four ips are tails and D be the event that there is at P (C ∩ D) least one tail, then we wish to nd P (C|D) = . P (D) ◦ It is easy to see that C ∩ D is simply the event C (since if all four ips are tails, then there is certainly 1 at least one tail), so P (C ∩ D) = P (C) = . 16 ◦ Also, D will occur in every outcome except the one where all four ips are heads (which occurs with 1 15 probability 1/16), so P (D) = 1 − = . 16 16 P (C ∩ D) 1/16 1 ◦ Thus, P (C|D) = = = . P (D) 15/16 15

4.2.2 Independence

• It is natural to say that two events are independent of one another if the knowledge that one occurs does not give any additional information about whether the other occurs. We can easily phrase this in terms of conditional probability:

• Denition: We say that two events A and B are independent if P (A|B) = P (A), or equivalently, if P (B|A) = P (B). Two events that are not independent are said to be dependent.

P (A ∩ B) ◦ From the denition of the conditional probability we know that P (A|B) = , so if we rearrange P (B) the denition above, we obtain an equivalent denition of independence, namely, that A and B are independent if P (A ∩ B) = P (A) · P (B). ◦ This observation also allows us to see that the two statements of independence are equivalent to each other.

• Example: Suppose 2 fair 6-sided dice are rolled. Let A be the event the rst roll is a 3, let B be the event the second roll is a 6, and let C be the event the sum of the two rolls is 6. Determine whether each of the pairs of events is independent.

◦ Of the 36 possible outcomes of rolling the two dice, 6 have the rst roll equal to 3, 6 have the second roll 1 1 5 equal to 6, and 5 have the sum of the two rolls equal to 6. Thus, P (A) = , P (B) = , and P (C) = . 6 6 36 ◦ First, A∩B is the event that the rst roll is a 3 and the second roll is a 6, which can happen in only 1 way, 1 1 1 1 so P (A∩B) = . Since P (A∩B) = = · = P (A)·P (B), we see that A and B are independent . 36 36 6 6

10 ◦ Second, A ∩ C is the event that the rst roll is a 3 and the sum of the two rolls is 6, which can happen 1 5 1 in only 1 way (namely, both rolls are 3s), so P (A ∩ C) = . Since P (A) · P (C) = 6= , the events 36 216 36 A and C are not independent . ◦ Third, B ∩ C is the event that the rst roll is a 6 and the sum of the two rolls is 6, which can never 5 occur (since the second die roll cannot be 0), so P (B ∩ C) = 0. Since P (B) · P (C) = 6= 0, the events 216 B and C are not independent . ◦ Intuitively, we should expect that A and B are independent because the two rolls of the die do not aect one another. On the other hand, C is not independent from A and B because knowing that one is a 3 means it is possible for the sum of the dice to be 6, while knowing that one roll is a 6 means that a sum of 6 is impossible.

• Example: A single card is randomly dealt from a standard 52-card deck. Let A be the event the card is a spade, let B be the event the card is an ace, and let C be the event the card is an ace, 2, 3, or 4. Determine whether each of the pairs of events is independent.

1 1 ◦ Since there are 13 spades, 4 aces, and 16 cards total that are a 2, 3, 4, or 5, we see P (A) = , P (B) = , 4 13 4 and P (C) = . 13 1 ◦ First, A ∩ B is the event that the card is the ace of spades, so P (A ∩ B) = = P (A) · P (B), meaning 52 that A and B are independent . ◦ Second, A ∩ C is the event that the card is a spade and also an ace, 2, 3, or 4, so it must be the ace, 2, 1 3, or 4 of spades. Thus P (A ∩ C) = = P (A) · P (C), meaning that A and C are independent . 13 ◦ Third, B ∩ C is the event that the card is an ace and also an ace, 2, 3, or 4, which is the same as saying 1 4 it is an ace. Thus P (B ∩ C) = 6= = P (B) · P (C), meaning that B and C are not independent . 13 169 ◦ Remark: This example shows that independence is not transitive: A and B are independent, as are A and C, yet B and C are not independent.

• If A and B are independent events, then knowing that A occurs does not aect the probability that B occurs. Under this interpretation, it is reasonable to say that Ac and B are also independent events: • Proposition (Independence of Complements): If A and B are independent events, then so are Ac and B.

◦ Proof: First observe that P (B) = P (A∩B)+P (Ac ∩B), since the events A∩B and Ac ∩B are mutually disjoint and have union B. ◦ Then if A and B are independent, we have P (A ∩ B) = P (A)P (B), so we may rearrange the equation above to see that P (Ac ∩ B) = P (B) − P (A)P (B) = [1 − P (A)] · P (B) = P (Ac) · P (B). ◦ Hence Ac and B are also independent, as claimed.

• We may also dene independence of more than two events at once:

• Denition: We say that the events E1,E2,...,En are collectively independent if P (F1 ∩ F2 ∩ · · · ∩ Fk) = P (F1) · P (F2) ····· P (Fn) for any subset F1,F2,...,Fk of E1,E2,...,En.

◦ By rearranging this denition, one can see that it is equivalent to saying that knowledge of whether some of the events have occurred does not aect the probability of any of the others.

◦ Example: A fair coin is ipped 3 times. If E1 is the event that the rst ip is heads, E2 is the event that the 1 second ip is heads, and E is the event that the third ip is heads, then P (E ) = P (E ) = P (E ) = , 3 1 2 3 2 1 1 1 while P (E ∩ E ) = = P (E )P (E ), P (E ∩ E ) = = P (E )P (E ), P (E ∩ E ) = = P (E )P (E ), 1 2 4 1 2 1 3 4 1 3 2 3 4 2 3 1 and P (E ∩ E ∩ E ) = = P (E )P (E )P (E ). Thus, these three events are collectively independent. 1 2 3 8 1 2 3

11 • If any pair of the events is not independent, then the entire collection of events cannot be collectively indepen- dent. On the other hand, it is possible for all of the pairs to be individually independent, but for the entire set not to be independent:

• Example: A fair coin is ipped 3 times. If E12 is the event that ips 1 and 2 are the same, E13 is the event that ips 1 and 3 are the same, and E23 is the event that ips 2 and 3 are the same, show that each pair of these events is independent, but the three events together are not collectively independent.

◦ Observe that E12 occurs in 4 of the 8 possible ip sequences (TTT , TTH, HHT , HHH), as does E13 1 (TTT , THT , HTH, HHH) and E (TTT , HTT , THH, HHH), so P (E ) = P (E ) = P (E ) = . 23 12 13 23 2 ◦ Also, E12 ∩ E13 is the event where all 3 ips are the same, as is E12 ∩ E23 and E13 ∩ E23, so each of these 1 1 1 1 events has probability . Thus P (E ∩ E ) = = · = P (E ) · P (E ) so these two events are 4 12 13 4 2 2 12 13 independent, and similarly for the other two pairs.

◦ On the other hand, E12 ∩E13 ∩E23 is also the event where all 3 ips are the same, so P (E12 ∩E13 ∩E23) = 1 1 1 1 1 6= = · · = P (E ) · P (E ) · P (E ). 4 8 2 2 2 12 13 23 • By using the idea of collective independence, we can calculate probabilities for sequences of independent events:

• Example: A baseball player's batting average is 0.378, meaning that she has a probability of 0.378 of getting a hit on any given at-bat, independently of any other at-bat. If she bats 4 times during a game, compute the probabilities that she gets (i) four hits during the game, (ii) a hit in her rst two at-bats but no other hits, (iii) exactly two hits during the game, and (iv) at least one hit during the game.

◦ For each of the four at-bats, the player has an independent probability 0.378 of getting a hit. Let Ei be the event of getting a hit on the th at-bat: then and c . i P (Ei) = 0.378 P (Ei ) = 1 − 0.378 = 0.622

◦ Event (i) is E1 ∩ E2 ∩ E3 ∩ E4. Since the four at-bats are independent, the probability of getting a hit 4 during all four at-bats is therefore P (E1∩E2∩E3∩E4) = P (E1)·P (E2)·P (E3)·P (E4) = 0.378 ≈ 2.04%. Event (ii) is c c. Since the four at-bats are independent (and independent events also ◦ E1 ∩ E2 ∩ E3 ∩ E4 have independent complements), the probability of this event is therefore c c P (E1 ∩ E2 ∩ E3 ∩ E4) = c c 2 2 . P (E1) · P (E2) · P (E3) · P (E4) = 0.378 · 0.622 ≈ 5.53% Event (iii) is the union of 4 possible events (one of which is event (ii)), each of which has 2 hits ◦ 2 = 6 and 2 non-hits in the 4 at-bats. By independence and the calculation for event (ii), each of these events has probability 0.3782 · 0.6222 and they are all mutually exclusive. Thus, the probability of their union is simply the sum of their individual probabilities, which is 6 · 0.3782 · 0.6222 ≈ 33.17%. Event (iv) is the complement of the event of not getting any hits in the game, which is c c c c. ◦ E1 ∩E2 ∩E3 ∩E4 By independence, we have c c c c c c c c 4, and therefore P (E1 ∩ E2 ∩ E3 ∩ E4) = P (E1) · P (E2) · P (E3) · P (E4) = 0.622 the probability of getting at least one hit in the game is 1 − 0.6224 ≈ 85.03%.

4.2.3 Computing Probabilities, Bayes' Formula

• By rearranging the denition of conditional probability, we obtain a useful formula for the probability of an intersection of two events, namely, P (A ∩ B) = P (B|A) · P (A).

◦ Intuitively, one may interpret this formula as saying that if we wish to nd the probability that both A and B occur, then rst we compute the probability that A occurs, and then we multiply this by the probability that B also occurs (given that A occurred). ◦ In many situations, it turns out to be much easier to compute the probabilities P (A) and P (B|A) separately by viewing the events of choosing A and then choosing B given A as being choices made in a sequence. ◦ We also remark that this formula is the probabilistic version of the multiplication formula from our discussion of counting principles, and we can invoke it in much the same manner.

12 • Example: Suppose an urn contains 7 red balls and 11 purple balls. If two balls are randomly drawn from the urn without replacement2, determine the probabilities of the respective events (i) the rst ball is red, (ii) the second ball is red given that the rst ball is red, (iii) both balls are red, and (iv) the rst ball is purple and the second ball is red.

◦ Let R1 be the event that the rst ball is red, R2 be the event that the second ball is red, and P1 be the event that the rst ball is red.

◦ For event (i), we want to compute P (R1). If we use the sample space consisting only of the rst ball drawn from the urn, then each of the 18 outcomes is equally likely, and 7 of them yield a red ball drawn, 7 so we have P (R ) = ≈ 0.3889. 1 18

◦ For event (ii), we want to compute P (R2|R1). If we imagine drawing the rst red ball and ignoring it, then drawing the second ball is the same as drawing one ball from an urn containing 6 red balls and 11 purple balls. By the same logic as above, the probability that a red ball is drawn now is 6 P (R |R ) = ≈ 0.3529. 2 1 17

◦ For event (iii), we want to compute P (R1 ∩ R2). By using the intersection formula, this is equal to 7 6 7 P (R |R ) · P (R ) = · = ≈ 0.1373. 2 1 1 18 17 51

◦ For event (iv), we want to compute P (P1 ∩ R2) = P (R2|P1) · P (P1). 11 ◦ By the same logic as for events (i) and (ii), we see that P (P ) = since when we draw the rst ball, 1 18 7 there are 11 purple balls out of 18 total, and also P (R |P ) = since when we draw the second ball 2 1 17 under the assumption that the rst one was purple, there are 7 red balls out of 17 total. 7 11 77 ◦ Therefore we see that P (P ∩ R ) = P (R |P ) · P (P ) = · = ≈ 0.2516. 1 2 2 1 1 17 18 306

• We may also iteratively apply the result above to obtain formulas for intersections of more than two events.

◦ For example, for three events A, B, C, we have P (A ∩ B ∩ C) = P (C|A ∩ B) · P (A ∩ B) = P (C|A ∩ B) · P (B|A) · P (A). ◦ If we view these probabilities as a sequence of choices, then this formula tells us that we can compute the probability of A ∩ B ∩ C by choosing A, then choosing B given A, then choosing C given both A and B. ◦ The same idea extends to intersections of four or more events: the probability of the intersection can be found by computing the probabilities the events one at a time (each conditional on all of the previous events listed), and then multiply all of the results. ◦ The advantage to this approach is that it allows us to break complicated events down into simpler ones whose probabilities can often be computed quickly by working with a small sample space (like in the example above with the urns where we only considered a single ball being drawn from the urn).

• By combining these ideas with our results on counting principles we can solve a wide array of probability problems.

• Example (Monty Hall Problem): On a game show, a contestant chooses one of three doors: behind one door is a car and behind the other two are goats. The host opens one of the two unchosen doors to reveal a goat, and then oers the contestant the option of switching their choice from their original door to the remaining unopened door in the hopes of winning the car3. Should the contestant accept the oer to switch doors?

2Without replacement means that when a ball is drawn, it is not placed back into the urn. The opposite is with replacement, where each ball is placed back into the urn after it is drawn. 3For this particular scenario, we also assume that the car is randomly hidden behind one of the doors, that host knows what is behind each door, that the host always opens a door that the contestant has not chosen that reveals a goat (randomly selecting between the two if the contestant has chosen the car), and the host always oers the contestant the option to switch doors.

13 ◦ We want to compute the probability that the car is behind the contestant's door. Suppose we label the contestant's door with the number 1, the other two doors are labeled 2 and 3, and the host opens door 2.

◦ Let P1, P2, P3 be the events in which the prize is behind door 1, 2, or 3 respectively, and H2 and H3 be the events in which the host opens door 2 and door 3 respectively. We wish to compute the conditional P (P1 ∩ H2) probability P (P1|H2) = . P (H2)

◦ We know that P (P1) = P (P2) = P (P3) = 1/3 since the car is equally likely to be behind any of the doors at the start of the game, and also P (H2) = P (H3) = 1/2 since by symmetry the host is equally likely to open door 2 or door 3.

◦ So we need only compute P (P1 ∩ H2) = P (H2|P1) · P (P1).

◦ But P (H2|P1) = P (H3|P1) = 1/2 since if the prize is behind door 1, the host is equally likely to open door 2 or door 3.

1 1 1 P (P1 ∩ H2) 1/6 1 ◦ Therefore, we get P (P1 ∩ H2) = · = , and then P (P1|H2) = = = . 2 3 6 P (H2) 1/2 3 ◦ This means that there is a 1/3 probability that the prize is behind door 1 (the door chosen by the contestant), and thus there is a 2/3 probability that the prize is behind door 3 (the remaining unchosen door), so it is better to switch doors. ◦ Remark: This is a famous probability puzzle known originally popularized in this form by vos Savant in 1990. Although she gave the correct answer to the puzzle, she evidently received many thousands of letters from readers who disagreed with the answer! (It is a very common mistake to argue that because there are now only 2 doors remaining to choose between, the probability that the prize is behind each of them must be 1/2.)

• Example (): Assuming that birthdays are randomly distributed among the 365 days in a non-leap year, and ignoring February 29th, what is the probability pn that in a group of n people, some pair have the same birthday? Also determine the smallest number of people for which there is at least a 50% chance of having two people with the same birthday.

◦ We compute the probability of the complementary event by choosing the birthdays for the members of the group one at a time and ensuring that no two have the same birthday. ◦ The rst person may have any birthday. The second person may have any birthday except the rst person's birthday, and the probability that this occurs is 364. 365 ◦ The third person may have any birthday except that of the rst and second person, and the probability that this occurs is 363. 365 ◦ We continue in this way, observing that the kth person may have any of 366 − k possible birthdays, and that the probability of this event (conditioned on the previous ones) is 366 − k . 365 ◦ Then by our formula for the probability of the intersection of events, the desired probability that there 365 364 363 366 − n is a pair of people with the same birthday is p = 1 − · · ····· . n 365 365 365 365 ◦ Here is a table of values of this probability for various n: n 5 10 15 20 22 23 25 30 40 50 pn 2.7% 11.7% 25.3% 41.1% 47.6% 50.7% 56.9% 70.6% 89.1% 97.0% ◦ From the table we can see that the minimal n such that there is a greater-than-50% chance of having at least one pair of people with the same birthday is n = 23 . ◦ Remark: The fact that the number of people required for a decent probability of having 2 with the same birthday is only 23 is generally found to be surprising, as many people typically guess that a much larger number (e.g., around half of 365) would be needed to have a high probability of getting a match. ◦ One way to estimate the correct answer is to observe that the desired event is the union of the events that one of the pairs of people in the group has the same birthday. The probability that any given pair has the same birthday is 1/365, and although these events are not disjoint, they are moderately close.

14 With k people, an estimate for the probability of getting at least one match is then k/365, and setting √ 2 this equal to 50% leads to an estimate k ≈ 365 ≈ 19.1, not too far away from the actual answer of 23.

• Example: Suppose A and B are events such that P (A) = 0.6, P (B|A) = 0.7, and P (B|Ac) = 0.2. Find (i) P (A ∩ B), (ii) P (A ∩ Bc), (iii) P (Ac ∩ B), (iv) P (B), (v) P (A|B), (vi) P (A ∪ B), and (vii) P (Ac ∪ Bc).

◦ In computing probabilities of this form, it is useful to label the results on a Venn diagram. ◦ For (i), we know that P (A ∩ B) = P (B|A) · P (A) = 0.7 · 0.6 = 0.42 . ◦ For (ii), we know that P (A) = 0.6 and from above we also have P (A ∩ B) = 0.42, so per a Venn diagram we see that P (A ∩ Bc) = 0.6 − 0.42 = 0.18 . ◦ For (iii), we may write P (Ac ∩ B) = P (B|Ac) · P (Ac) = 0.2 · (1 − 0.6) = 0.08 . ◦ For (iv), we know that P (A ∩ B) = 0.42 and from above we also have P (Ac ∩ B) = 0.08, so per a Venn diagram we see that P (B) = P (A ∩ B) + P (Ac ∩ B) = 0.42 + 0.08 = 0.5 . P (A ∩ B) 0.42 ◦ For (v), we have P (A|B) = = = 0.84 using the results we found above. P (B) 0.5 ◦ For (vi), we have P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.6 + 0.5 − 0.42 = 0.68 . ◦ For (vii), by a Venn diagram we have P (Ac ∪ Bc) = 1 − P (A ∩ B) = 0.58 .

• Example: A medical test for a rare disease will detect the disease in 99% of test samples from patients who have the disease, but it also has a false positive rate of 0.5%, meaning that it also detects the disease in 0.5% of test samples from patients who do not have the disease. If 0.1% of the population actually has the disease, what is the probability that a randomly chosen patient who tests positive actually has the disease?

◦ Let D be the event that a person has the disease and + be the event that they test positive for the P (D ∩ +) disease: we wish to compute the conditional probability P (D|+) = . P (+) ◦ We are given that P (D) = 0.1% = 0.001 and thus P (Dc) = 0.999, and also that P (+|D) = 99% = 0.99 and P (+|Dc) = 0.5% = 0.005. ◦ Therefore, we have P (D ∩ +) = P (+|D) · P (D) = 0.99 · 0.001 = 0.00099. ◦ We also have P (+) = P (+ ∩ D) + P (+ ∩ Dc), and we can compute the latter as P (+ ∩ Dc) = P (+|Dc) · P (Dc) = 0.005 · 0.999 = 0.004995. ◦ Therefore, P (+) = P (+ ∩ D) + P (+ ∩ Dc) = 0.005985. 0.00099 ◦ Hence the desired conditional probability is ≈ 16.54% . 0.005985 ◦ We thus obtain the (perhaps surprising!) result that a patient who tests positive for the disease only has about a 1-in-6 chance of actually having the disease, despite the fact that the test yields the correct result 99% of the time for positive patients and 99.5% of the time for negative patients. ◦ Ultimately, the reason for this disparity is that a typical person is very unlikely to have the disease, and thus there are far more false positives than actual cases of the disease. Thus, even if a person tests positive once, it is still not especially likely that they have the disease, unless there is some reason to think that the person was not a random member of the population.

• In the last two examples above, we ended up computing a conditional probability using the values of the conditional probabilities in the other order: in the rst example we used P (B|A) and P (B|Ac) to nd P (A|B), while in the second example we used P (+|D) and P (+|Dc) to nd P (D|+). We can in fact write down a general formula for this calculation:

P (A|B) · P (B) • Theorem (Bayes' Formula): If A and B are any events, then P (B|A) = . P (A|B) · P (B) + P (A|Bc) · P (Bc) More generally, if events B1,B2,...,Bk are mutually exclusive and have union the entire sample space, then P (A|Bi) · P (Bi) P (Bi|A) = . P (A|B1) · P (B1) + ··· + P (A|Bk) · P (Bk)

15 P (A ∩ B) ◦ Proof: By denition, P (B|A) = , and we also have P (A ∩ B) = P (A|B) · P (B). P (A) ◦ We also have P (A) = P (A ∩ B) + P (A ∩ Bc) since these events are mutually exclusive and have union A. ◦ Then P (A ∩ B) = P (A|B) · P (B) and P (A ∩ Bc) = P (A|Bc) · P (Bc) from the denition of conditional probability; plugging all of these values in yields the formula immediately.

◦ The second formula follows in the same way by writing P (A) = P (A∩B1)+P (A∩B2)+···+P (A∩Bk).

• Example: Alex has two urns, one labeled H which has 5 black and 4 red balls, and one labeled T which has 4 black and 11 red balls. Alex ips an unfair coin that has a 2/3 probability of landing heads, and then draws one ball at random from the correspondingly-labeled urn (H for heads, T for tails). If Alex draws a red ball, what is the probability that the coin ip was tails?

◦ Let R denote the event of drawing a red ball, and let H and T denote the events of ipping heads and tails respectively. P (R|H) · P (H) ◦ We want P (H|R), which we may get via Bayes' formula: P (H|R) = P (R|H) · P (H) + P (R|T ) · P (T ) ◦ We have P (H) = 2/3, P (T ) = 1/3, and also P (R|H) = 4/9 and P (R|T ) = 11/15. P (R|H) · P (H) 4/9 · 2/3 40 ◦ Therefore, P (H|R) = = = ≈ 54.8%. P (R|H) · P (H) + P (R|T ) · P (T ) 4/9 · 2/3 + 11/15 · 1/3 73

• Example (Prosecutor's Fallacy): A DNA sample from a minor crime is compared to a state forensic database containing 100,000 records and a single suspect is identied on this basis alone, with no other evidence suggesting guilt or innocence. From analysis of human genetic variation, it is determined that the probability that a randomly-selected innocent person would match the DNA sample is 1 in 10000. At the trial, the prosecutor states that the probability that the suspect is innocent is only 1 in 10000, and observes that this gure means that it is overwhelmingly likely the suspect is guilty. Critique this statement.

◦ Suppose M is the event that there is a DNA match, and I is the event that the suspect is innocent. ◦ The conditional probability P (I|M) is the probability that the suspect is innocent given that there is a DNA match, which is what the prosecutor is claiming is equal to 1/10000. ◦ However, the 1-in-10000 gure is actually the probability that there is a match given that the suspect is innocent: this is the conditional probability P (M|I), which as we have seen is quite dierent from P (I|M). P (M|I) · P (I) ◦ In this case we may use Bayes' formula to write P (I|M) = . P (M|I) · P (I) + P (M|Ic) · P (Ic) ◦ A priori, there is no reason to believe that the given suspect is any more likely to be guilty than any other person in the database, so we will take P (Ic) = 1/100000 = 0.00001 and then P (I) = 0.99999. ◦ We also take P (M|I) = 1/10000 = 0.0001 as indicated, and P (M|Ic) = 1 since (we presume) the DNA analysis will always identify a guilty suspect. 0.0001 · 0.99999 ◦ Then we obtain P (I|M) = ≈ 90.9%. 0.0001 · 0.99999 + 1 · 0.00001 ◦ The conditional probability in this case states that there is a 90.9% chance that the suspect is innocent, given the existence of a positive match and no other evidence: quite a far cry from the prosecutor's claim of 99.999% guilt! ◦ Remark: The confusion of the probability of innocence given a positive match with the probability of a positive match given innocence is called the prosecutor's fallacy, and (as shown with this dra- matic example) it is a very serious error. In this particular example, one would expect approximately 100000/10000 = 10 DNA matches to come from the database, and given the lack of evidence to say otherwise, it is no more likely that the given suspect is guilty than any of these 10 people.

16 4.3 Discrete Random Variables

• When we observe the result of an experiment, we are often interested in some specic property of the outcome rather than the entire outcome itself.

◦ For example, if we ip a fair coin 5 times, we may want to know only the total number of heads obtained, rather than the exact sequence of all 5 ips. ◦ As another example, if we roll a pair of dice (e.g., in playing the dice game craps) we may want to know the sum of the outcomes rather than the results of each individual roll.

4.3.1 Denition and Examples

• Formally, properties of outcomes can be thought of as functions dened on the outcomes of a sample space: • Denition: A is a (real-valued) function dened on the outcomes in a sample space. If the sample space is nite (or countably innite), we refer to the random variable as a discrete random variable.

◦ Example: For the experiment of ipping a coin 5 times, with corresponding sample space S, one random variable X is the total number of heads obtained. The value of X on the outcome HHHHT is 4, while the value of X on the outcome TTTTT is 0. ◦ Example: For the experiment of rolling a pair of dice 5 times, with corresponding sample space S, one random variable Y is the sum of the outcomes. The value of Y on the outcome (1, 4) is 5, while the value of Y on the outcome (6, 6) is 12. ◦ If X is a random variable, the set of outcomes on which X takes a particular value (or range of values) is a subset of the sample space, which is to say, it is an event. ◦ Thus, if we have a probability distribution on the sample space, we may therefore ask about quantities like (i) P (X = n), the probability that X takes the value n, or (ii) P (X ≥ 5), the probability that the value of X is at least 5, or (iii) P (2 < X < 4), the probability that the value of X is strictly between 2 and 4. ◦ A common way to tabulate all of this information is to make a list or table of all the possible values of X along with their corresponding probabilities.

• Example: If two standard 6-sided dice are rolled, nd the probability distribution for the random variable X giving the sum of the outcomes. Then calculate (i) P (X = 7), (ii) P (4 < X < 9), and (iii) P (X ≤ 6).

◦ To nd the probability distribution for X, we identify all of the possible values for X and then tabulate the respective outcomes in which each value occurs. ◦ We can see that the possible values for X are 2, 3, 4, ... , 12, and that they occur as follows: Value Outcomes Probability X = 2 (1, 1) 1/36 X = 3 (1, 2), (2, 1) 1/18 X = 4 (1, 3), (2, 2), (3, 1) 1/12 X = 5 (1, 4), (2, 3), (3, 2), (4, 1) 1/9 X = 6 (1, 5), (2, 4), (3, 3), (4, 2), (5, 1) 5/36 X = 7 (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1) 1/6 X = 8 (2, 6), (3, 5), (4, 4), (5, 3), (6, 2) 5/36 X = 9 (3, 6), (4, 5), (5, 4), (6, 3) 1/9 X = 10 (4, 6), (5, 5), (6, 4) 1/12 X = 11 (5, 6), (6, 5) 1/18 X = 12 (6, 6) 1/36 1 1 5 1 5 5 ◦ Then we have P (X = 7) = , P (4 < X < 9) = + + + = , and P (X ≤ 6) = 6 9 36 6 36 9 1 1 1 1 5 5 + + + + = . 36 18 12 9 36 12

17 • Example: If a fair coin is ipped 4 times, nd the probability distributions for the random variable X giving the number of total heads obtained, and for the random variable Y giving the longest run of consecutive tails obtained. Then calculate (i) P (X = 2), (ii) P (X ≥ 3), (iii) P (1 < X < 4), (iv) P (Y = 1), (v) P (Y ≤ 3), and (vi) P (X = Y = 2).

◦ For X, we obtain the following distribution: Value Outcomes Probability X = 0 TTTT 1/16 X = 1 TTTH, TTHT , THTT , HTTT 1/4 X = 2 TTHH, THTH, THHT , HTTH, HTHT , HHTT 3/8 X = 3 THHH, HTHH, HHTH, HHHT 1/4 X = 4 HHHH 1/16 ◦ For Y , we obtain the following distribution: Value Outcomes Probability Y = 0 HHHH 1/16 Y = 1 THTH, THHT , THHH, HTHT , HTHH, HHTH, HHHT 7/16 Y = 2 TTHT , TTHH, THTT , HTTH, HHTT 5/16 Y = 3 TTTH, HTTT 1/8 Y = 4 TTTT 1/16 3 1 1 5 3 1 5 ◦ We can then quickly compute P (X = 2) = , P (X ≥ 3) = + = , P (1 < X < 4) = + = , 8 4 16 16 8 4 8 7 1 7 5 1 15 P (Y = 1) = , and P (Y ≤ 3) = + + + = . 16 16 16 16 8 16 ◦ To nd P (X = Y = 2) we must look at the individual outcomes where X and Y are both equal to 2. 1 There are 2 such outcomes, namely TTHH and HHTT , so P (X = Y = 2) = . 8

• If we are given two random variables X and Y dened on the sample space, then since X and Y are both functions on outcomes, we can dene various new random variables using X and Y .

◦ If g is any real-valued function, we can dene a new random variable g(X) by evaluating g on all of the results of X. Some possibilities include g(X) = 2X, which doubles every value of X, or g(X) = X2, which squares every value of X2. ◦ Likewise, the random variable X + Y is dened to be the random variable whose value is the sum of the respective values of X and Y on each possible outcome. ◦ We say that the random variables X and Y are independent if the value of either variable is independent of the value of the other variable, which is to say, if P (X = a|Y = b) = P (X = a) for all values a and b. Equivalently, this means that P (X = a ∩ Y = b) = P (X = a) · P (Y = b) for each a, b.

• A particular random variable is the random variable identifying whether an event has occurred: ( 1 if E occurs • Denition: If E is any event, we dene the Bernoulli random variable for the event E to be XE = . 0 if E does not occur

◦ The name for this random variable comes from the idea of a , which is an experiment having only two possible outcomes, success (with probability p) and failure (with probability 1 − p). We think of E as being the event of success, while Ec is the event of failure. ◦ Many experiments consist of a sequence of independent Bernoulli trials, in which the outcome of each trial is independent from the outcomes of all of the others. For example, ipping a (fair or unfair) coin 10 times and testing whether heads is obtained for each ip is an example of a Bernoulli trial. ◦ Using our results on independence of events, we can describe explicitly the probability distribution of the random variable X giving the total number of successes when n independent Bernoulli trials are performed, each with a probability p of success.

18 • Proposition (): Let X be the random variable representing the total number of successes obtained by performing n independent Bernoulli trials each of which has a success probability p. Then the probability distribution of is the binomial distribution, in which n k n−k for integers X P (X = k) = k p (1 − p) k with 0 ≤ k ≤ n, and P (X = k) = 0 for other k. The binomial distribution is so named because of the presence of the binomial coecients n. One ◦ k particular example of this distribution is the total number of heads obtained by ipping n unfair coins each of which has probability p of landing heads. Proof: From our results on binomial coecients, we can see that there are n ways to choose trials ◦ k k yielding success out of a total of n. Furthermore, since all of the trials are independent, the probability of obtaining any given pattern of k successes and n − k failures is equal to pk(1 − p)n−k. Thus, since the probability of obtaining any given one of the n outcomes with exactly successes is ◦ k k k n−k, the probability of obtaining exactly successes is n k n−k, as claimed. p (1 − p) k k p (1 − p) • Example: A baseball player's batting average is 0.378, meaning that she has a probability of 0.378 of getting a hit on any given at-bat, independently of any other at-bat. Find the probability that in her rst 100 at-bats that she gets (i) exactly 37 hits, (ii) exactly 40 hits, and (iii) exactly 50 hits.

◦ We can view each at-bat as an independent Bernoulli trial (with a hit being considered a success) with p = 0.378, so the total number of hits will be binomially distributed. 100 ◦ Then the probability of getting 37 hits is · 0.37837 · 0.62263 ≈ 8.13%, the probability of 40 hits is 37 100 100 · 0.37840 · 0.62260 ≈ 7.33%, and the probability of 50 hits is · 0.37850 · 0.62250 ≈ 0.37%. 40 50

4.3.2 Expected Value

• If we repeat an experiment many times and record the dierent values of a random variable X each time, a useful statistic summarizing the outcomes is the average value of the outcomes.

◦ We would like a way to describe the average value of a random variable X.

◦ Suppose that the sample space has events s1, s2, . . . , sn on which the random variable X takes on the values x1, x2, . . . , xn with probabilities p1, p2, . . . , pn, where p1 + ··· + pn = 1. ◦ Under our interpretation of these probabilities as giving the relative frequencies of events when we repeat the experiment many times, if we perform the experiment N times where N is large, we should obtain the event X = xi approximately piN times for each 1 ≤ i ≤ n. (p N)x + (p N)x + ··· + (p N)x ◦ The average value would then be 1 1 2 2 n n = p x + p x + ··· + p x . N 1 1 2 2 n n ◦ We may use this calculation to give a denition of the average value for an arbitrary discrete random variable:

• Denition: If X is a discrete random variable, the expected value of X, written E(X), is the sum E(X) = X P (si)X(si) over all outcomes si in the sample space S. In words, the expected value is the average of the

si∈S values that X takes on the outcomes in the sample space, weighted by the probability of each outcome.

◦ The expected value is also sometimes called the mean or the average value of X. ◦ Example: If a fair coin is ipped once, the expected value of the random variable X giving the number 1 1 1 of total heads obtained is equal to E(X) = · 0 + · 1 = , because there are two possible outcomes, 2 2 2 0 heads and 1 head, each of probability 1/2. ◦ In the example above we see that the expected value captures the idea of an average value of the random variable when the experiment is repeated many times: if we ip a fair coin N times, we would expect to see about N/2 heads (yielding on average 1/2 head per ip) which agrees with the expected value of 1/2.

19 ◦ Example: If an unfair coin with a probability 2/3 of landing heads is ipped once, the expected value of 1 2 2 the random variable X giving the number of total heads obtained is equal to E(X) = · 0 + · 1 = , 3 3 3 because there are two possible outcomes, 0 heads and 1 head, of respective probabilities 1/3 and 2/3. ◦ Example: If a standard 6-sided die is rolled once, the expected value of the random variable X giving 1 1 1 1 1 1 7 the result is equal to E(X) = · 1 + · 2 + · 3 + · 4 + · 5 + · 6 = , because each of the 6 6 6 6 6 6 6 2 possible outcomes 1,2,3,4,5,6 has probability 1/6 of occurring.

• A common application of expected value is to calculate the expected winnings from a game of chance: • Example: In one version of a Pick 3 lottery, a single entry ticket costs $1. In this lottery, 3 single digits are drawn at random, and a ticket must match all 3 digits in the correct order to win the $500 prize. What is the expected value of one ticket for this lottery?

1 999 ◦ From the description, we can see that there is a probability of winning the prize and a 1000 1000 probability of winning nothing. ◦ Since winning the prize nets a total of $499 (the prize minus the $1 entry fee), and winning nothing nets a total of −$1, the expected value of the random variable giving the net winnings is equal to 1 999 ($499) + (−$1) = −$0.50 . 1000 1000 ◦ The expected value of −$0.50, in this case, indicates that if one plays this lottery many times, on average one should expect to lose 50 cents on every ticket.

• Example: In one version of the game Chuck-a-luck, three standard 6-sided dice are rolled. Prizes for a bet of $1 are awarded as follows: $65 for a roll of three 6s, $5 for a roll of two 6s, $1 for one 6, and $0 for any other roll. What is the expected value of one play of this game?

◦ We calculate the probabilities of the various outcomes. 1 1 ◦ The probability of obtaining three 6s is = , since there is only one such outcome. In this case, 63 216 the net winnings total $65. 15 15 ◦ The probability of obtaining exactly two 6s is = , since there are 15 such outcomes, namely, 66A, 63 216 6A6, and A66, where A is any of the rolls 1-5. In this case, the net winnings total $4. 75 75 ◦ The probability of obtaining exactly one 6s is = , since there are 75 such outcomes, namely, 6AB, 63 216 A6B, and AB6, where A and B are any of the rolls 1-5. In this case, the net winnings total $0. 125 125 ◦ The probability of obtaining no 6s is = , since each die has 5 possible outcomes. In this case, 63 216 the net winnings total −$1. 1 15 75 125 ◦ Therefore, the expected winnings from one play are ·($65)+ ·($4)+ ·($0)+ ·(−$1) = $0 . 216 216 216 216 ◦ For this game, we can see that the expected winnings are $0, meaning that the game is fair (in the sense that neither the player nor the person running the game should expect to win or lose money on average over the long term).

• Expected value has several important algebraic properties: • Proposition (Linearity of Expected Value): If X and Y are discrete random variables dened on the same sample space, and a and b are any real numbers, then E(aX +b) = a·E(X)+b and E(X +Y ) = E(X)+E(Y ).

◦ Intuitively, if the expected value of X is 4, then it is reasonable to feel that the expected value of X + 1 should be 5, while the expected value of 2X should be 8. These two observations, taken together, form essentially the rst part of the statement. ◦ Likewise, if the expected value of Y is 3, then it is also reasonable to feel that the expected value of X + Y should be 7, the sum of the expected values of X and Y .

20 ◦ Proof: Suppose the events of the sample space are s1, s2,... on which X takes on the values x1, x2,... and Y takes on the values y1, y2,... with probabilities p1, p2,... , where p1 + p2 + ··· = 1.

◦ Then E(X) = p1x1 + p2x2 + ··· and E(Y ) = p1y1 + p2y2 + ··· .

◦ Since aX + b takes on the values ax1 + b, ax2 + b, . . . on the events of the sample space, so E(aX + b) = p1(ax1 + b) + p2(ax2 + b) + ··· = a(p1x1 + p2x2 + ··· ) + b(p1 + p2 + ··· ) = a · E(X) + b.

◦ Also, X + Y takes on the values x1 + y1, x2 + y2,... on the events of the sample space, so E(X + Y ) = p1(x1 + y1) + p2(x2 + y2) + ··· = (p1x1 + p2x2 + ··· ) + (p1y1 + p2y2 + ··· ) = E(X) + E(Y ). • By using the linearity of expected value and decomposing random events into simpler pieces, we can compute expected values of more complicated random variables:

• Example (Binomial Distribution): An unfair coin with probability p of landing heads is ipped n times. Find the expected number of heads obtained.

◦ One approach would be to let X be the random variable giving the total number of heads, then compute the probability distribution of X and use the result to nd the expected value. Since the number of heads is binomially distributed, the probability of obtaining heads is n k ◦ k k p (1 − n n−k, the expected value is X n k n−k , which can eventually be evaluated (using algebraic p) k p (1 − p) · k k=0 identities for the binomial coecients) as np . ◦ However, a vastly simpler approach is to observe that we can write X as the sum of random variables X = X1 + X2 + ··· + Xn, where Xi is the number of heads obtained on the nth ip.

◦ Since E(X1) = E(X2) = ··· = E(Xn) = p since the ips each independently have a probability p of landing heads, by the additivity of expectation we see that E(X) = E(X1)+E(X2)+···+E(Xn) = np .

• Example: Five cards are randomly drawn from a standard 52-card deck. Find the expected number of cards drawn that are (i) spades, (ii) aces, (iii) aces of spades.

◦ For (i), if we let X be the random variable giving the total number of spades, then X = X1 + X2 + X3 + 13 1 X +X where X is the random variable that the ith card is a spade. Then we have E(X ) = = since 4 5 i i 52 4 1 5 each card can be one of 52 equally likely possibilities, 13 of which are spades, and so E(X) = 5 · = . 4 4

◦ For (ii), if we let Y be the random variable giving the total number of aces, then X = Y1 +Y2 +Y3 +Y4 +Y5 4 1 where Y is the random variable that the ith card is an ace. Then we have E(Y ) = = since each i i 52 13 1 5 card can be one of 52 equally likely possibilities, 4 of which are aces, and so E(X) = 5 · = . 13 13 ◦ In the same way, for (iii), if we decompose the random variable Z giving the total number of aces of 1 5 spades as a sum over the 5 cards drawn from the deck, we see that E(Z) = 5 · = . 52 52

◦ Remark: Notice that the values of the random variables X1,...,X5 are not independent: for example, if the rst card is a spade then the second card is less likely to be a spade (and vice versa, and similarly for the other cards). Nonetheless, the expected value of the total number of spades is still the sum of the individual expectations, even though the actual values of the corresponding random variables are not independent of one another.

4.3.3 Variance and Standard Deviation

• In addition to computing the expected value of a random variable, we also would like to be able to measure how much variation the values have relative to their expected value.

◦ For example, if one random variable X is always equal to 2, then there is no variation in its value. If the random variable Y is equal to 0 half the time and 4 the other half the time, then its expected value is also 2, but there is much more variation in the values of Y .

21 • Denition: If X is a discrete random variable with expected value E(X) = µ, we dene the variance of X to be var(X) = E[(X − µ)2], the expected value of the square of the dierence between X and its expectation. The standard deviation of X, denoted σ(X), is the square root of the variance: σ(X) = pvar(X).

◦ Roughly speaking, the standard deviation measures the average distance that a typical outcome of X will be from the expected outcome. ◦ We can also give another formula for the variance: by using the linearity of expectation, we can write E[(X − µ)2] = E(X2 − 2µX + µ2) = E(X2) − 2µE(X) + µ2, and since µ = E(X), this formula simplies to var(X) = E(X2) − µ2 = E(X2) − [E(X)]2. ◦ It is often faster to compute E(X) and E(X2) separately than to compute the probability distribution of (X − µ)2.

• Example: If a coin with probability p of landing heads is ipped once, nd the expected value, variance, and standard deviation of the random variable X giving the number of heads.

◦ There are two possible outcomes: either X = 0, which occurs with probability 1 − p, or X = 1, which occurs with probability p. ◦ The expected value is then E(X) = (1 − p) · 0 + p · 1 = p . ◦ For the standard deviation, there are two possible outcomes of (X − µ)2 = (X − p)2: either X − µ = (0 − p)2 = p2, which occurs with probability 1 − p, or (X − µ)2 = (1 − p)2, which occurs with probability p. ◦ The variance is then var(X) = (1 − p) · p2 + p · (1 − p)2 = p(1 − p) , and the standard deviation is

σ(X) = pvar(X) = pp(1 − p) . ◦ Alternatively, using the formula var(X) = E(X2)−[E(X)]2, we compute E(X2) = (1−p)·02 +p·12 = p, and then var(X) = p − p2 = p(1 − p) , as above. ◦ In particular, when p = 1/2, we see that the standard deviation is 1/2. This agrees with the natural idea that although the expected number of heads obtained when ipping a fair coin is 1/2, the actual outcome is always a distance 1/2 away from the expectation (since it is either 0 or 1).

• Example: If a standard 6-sided die is rolled once, nd the variance and standard deviation of the random variable X giving the result of the die roll. 1 ◦ Each of the possible outcomes X = 1, 2, 3, 4, 5, 6 occurs with probability . 6 1 1 1 1 1 1 7 ◦ We compute E(X) = · 1 + · 2 + · 3 + · 4 + · 5 + · 6 = = 3.5, and also E(X2) = 6 6 6 6 6 6 2 1 1 1 1 1 1 91 · 12 + · 22 + · 32 + · 42 + · 52 + · 62 = . 6 6 6 6 6 6 6 91 49 35 r35 ◦ Thus, var(X) = E(X2) − [E(X)]2 = − = , and σ(X) = pvar(X) = ≈ 1.708. 6 4 12 12

• Example: A fair coin is ipped 4 times. Find the expected value, variance, and standard deviation of the random variable Y giving the longest run of consecutive tails obtained.

◦ We have previously computed the probability distribution of Y : n 0 1 2 3 4 P (Y = n) 1/16 7/16 5/16 1/8 1/16 1 7 5 1 1 27 ◦ We compute E(Y ) = · 0 + · 1 + · 2 + · 3 + · 4 = = 1.6875, and also E(X2) = 16 16 16 8 16 16 1 7 5 1 1 61 · 02 + · 12 + · 22 + · 32 + · 42 = . 16 16 16 8 16 16 61 729 247 r247 ◦ Thus, var(X) = E(X2) − [E(X)]2 = − = , and σ(X) = pvar(X) = ≈ 0.9823. 16 256 256 256

22 • The variance and standard deviation also possess some algebraic properties like those of expected value: • Proposition (Properties of Variance): If X is a discrete random variable and a and b are any real numbers, then var(aX + b) = a2 · var(X) and σ(aX + b) = |a| σ(X). Furthermore, if X and Y are independent random variables, then var(X + Y ) = var(X) + var(Y ).

◦ Note that we do require the hypothesis that X and Y be independent in order for the variance to be additive. The result is not true for non-independent random variables: an easy example is to take X = Y : then var(X + Y ) = var(2X) = 4var(X), which is not equal to var(X) + var(Y ) = 2var(X). ◦ Proof: From the linearity of expectation, we know that E(aX + b) = a · E(X) + b, and therefore we have aX + b − E(aX + b) = aX + b − a E(X) − b = a · [X − E(x)]. ◦ Then var(aX + b) = E[(aX + b − E(aX + b))2] = E[a2 · (X − E(x))2] = a2var(X), and by taking the square root of both sides we then get σ(aX + b) = |a| σ(X). ◦ For the second statement, if X and Y are independent random variables, then one may show that E(XY ) = E(X)E(Y ) by manipulating the appropriate sum of products and factoring. ◦ Using this fact, we obtain var(X + Y ) = E[(X + Y )2] − [E(X + Y )]2 = E(X2 + 2XY + Y 2) − [E(X) + E(Y )]2 = E(X2)+2E(X)E(Y )+E(Y 2)−[E(X)2 +2E(X)E(Y )+E(Y )2] = [E(X2)−E(X)2]+[E(Y 2)− E(Y )2] = var(X) + var(Y ), as claimed.

• Using these properties we can calculate the variance and standard deviation of a binomially-distributed random variable:

• Corollary (Binomial Variance): Let X be the binomially-distributed random variable representing the total number of successes obtained by performing n independent Bernoulli trials each of which has a success probability p. Then E(X) = np, var(X) = np(1 − p), and σ(X) = pnp(1 − p).

◦ Proof: We write X = X1 + X2 + ··· + Xn where Xi is the random variable of success on the ith trial for 1 ≤ i ≤ n. Observe that and 2 2 2 , so ◦ E(Xi) = (1 − p) · 0 + p · 1 = p E(Xi ) = (1 − p) · 0 + p · 1 = p var(Xi) = 2 2 . E(Xi ) − E(Xi) = p(1 − p)

◦ Each of the Xi is a single Bernoulli trial and they are all collectively independent by assumption, so we have E(X) = E(X1) + ··· + E(Xn) = np (as we previously calculated), and also var(X) = var(X1) + p p ··· + var(Xn) = np(1 − p) so that σ(X) = var(X) = np(1 − p) as claimed.

• Example: A car dealer has a probability 0.36 of selling a car to any individual customer, independently. If 25 customers patronize the dealership, determine the expected number and the standard deviation in the total number of cars sold.

◦ Each individual customer can be thought of as a Bernoulli trial, with success corresponding to selling a car with probability p = 0.36, with a total of n = 25 trials. ◦ Thus, from our results on the binomial distribution, the expected number of cars sold is np = 25·0.36 = 9 √ and the standard deviation is pnp(1 − p) = 25 · 0.36 · 0.64 = 2.4 .

4.4 Continuous Random Variables

• Another important class of random variables consists of the random variables whose underlying sample space is the entire real line.

◦ Because there are uncountably many possible outcomes, to evaluate probabilities of events we cannot simply sum over the outcomes that make them up. ◦ Instead, we must use the continuous analogue of summation, namely, integration.

23 4.4.1 Denition and Examples

• Denition: A probability density function p(x) is a piecewise-continuous, nonnegative real-valued function such that ∞ . −∞ p(x) dx = 1 ´ Note that the integral ∞ is in general improper. However, since is by assumption nonneg- ◦ −∞ p(x) dx p(x) ative, then the value of´ the integral is always well-dened (although it may be ∞). ( 1/4 for 0 ≤ x ≤ 4 ◦ Example: The function p(x) = is a probability density function, since the two 0 for other x 4 ∞ 4 1 1 components of p(x) are both continuous and nonnegative, and p(x) dx = dx = x = 1. −∞ 0 4 4 ´ ´ x=0 ( 2x for 0 ≤ x ≤ 1 ◦ Example: The function q(x) = is a probability density function, since the two 0 for other x components of are both continuous and nonnegative, and ∞ 1 2 1 . q(x) −∞ q(x) dx = 0 2x dx = x x=0 = 1 ( ´ ´ e−x for x ≥ 0 ◦ Example: The function e(x) = is a probability density function, since the two compo- 0 for x < 0 nents of are both continuous and nonnegative, and ∞ ∞ −x −x ∞ . e(x) −∞ e(x) dx = 0 e dx = −e |x=0 = 1 ´ ´ • Denition: We say that X is a continuous random variable if there exists a probability density function p(x) such that for any interval on the real line, we have . I P (X ∈ I) = I p(x) dx ´ ◦ In other words, probabilities for continuous random variables are computed via integrating the probability density function on the appropriate interval. ◦ Note that if X is a continuous random variable, then (no matter what the probability density function is), the value of is zero for any value of , since a . This means that P (X = a) a P (X = a) = a p(x) dx = 0 the probability that X will attain any specic value a is equal to zero. ´ ( x/8 for 0 ≤ x ≤ 4 • Example: If X is the continuous random variable whose probability density function is p(x) = , 0 for other x nd (i) P (1 ≤ X ≤ 3), (ii) P (X ≤ 2), (iii) P (X ≥ 5), (iv) P (−2 ≤ X ≤ 3), and (v) P (X = 2).

◦ When we set up the integrals, we must remember to break the range of integration up (if needed) so that we are integrating the correct component of p(x) on the correct interval. For example, to verify that is a probability density function, we would compute ∞ ◦ p(x) −∞ p(x) dx = 4 ´ 0 4 x ∞ 1 0 dx + dx + 0 dx = 0 + x2 + 0 = 1, as required. −∞ 0 8 4 16 ´ ´ ´ x=0 3 3 3 x 1 1 ◦ For (i), we have P (1 ≤ X ≤ 3) = p(x) dx = dx = x2 = . 1 1 8 16 2 ´ ´ x=1 2 2 0 2 x 1 1 ◦ For (ii), we have P (X ≤ 2) = p(x) dx = 0 dx + dx = x2 = . −∞ −∞ 0 8 16 4 ´ ´ ´ x=0 For (iii), we have ∞ ∞ . ◦ P (X ≥ 5) = 5 p(x) dx = 5 0 dx = 0 ´ ´ 3 3 0 3 x 1 9 ◦ For (iv), we have P (−2 ≤ X ≤ 3) = p(x) dx = 0 dx + dx = x2 = . −2 −2 0 8 16 16 ´ ´ ´ x=0 2 2 2 x 1 ◦ For (v), we have P (X = 2) = p(x) dx = dx = x2 = 0 . 2 2 8 16 ´ ´ x=2 • As we saw in the example above, if we are evaluating integrals, we may ignore intervals on which p(x) = 0, since their contribution to the integrals will always be 0.

• A useful function related to the probability density function of a continuous random variable is the cumulative density function, which measures the total probability that the continuous random variable takes a value ≤ x:

24 • Denition: If X is a continuous random variable with probability density function p(x), its cumulative density function is dened as x for each real value of . Then for every . c(x) c(x) = −∞ p(t) dt x P (X ≤ a) = c(a) a ´ ◦ By the fundamental theorem of calculus, we have c0(x) = p(x) for every x, so we may freely convert back and forth between the probability density function and the cumulative density function via dierentiation and integration. Since is nonnegative, if then , and since ∞ , we have , ◦ p(x) a ≤ b c(a) ≤ c(b) −∞ p(x) dx = 1 limx→∞ c(x) = 1 and limx→−∞ c(x) = 0. ´ ( 1/4 for 0 ≤ x ≤ 4 ◦ Example: For the random variable with probability density function p(x) = , the 0 for other x  0 for x ≤ 0  cumulative density function is c(x) = x/4 for 0 ≤ x ≤ 4 . 1 for x ≥ 4 ( e−x for x ≥ 0 ◦ Example: For the random variable with probability density function e(x) = , the cu- 0 for x < 0 ( 0 for x < 0 mulative density function is c(x) = . 1 − e−x for x ≥ 0

• A simple class of continuous random variables are those whose probability density functions are constant on an interval [a, b] and zero elsewhere. We say such random variables are uniformly distributed on the interval [a, b].

◦ From the requirement that the integral of the probability density function equals 1, it is straight- forward to see that the probability density function for a uniformly-distributed random variable on  1  for a ≤ x ≤ b [a, b] is given by p(x) = b − a , and then the cumulative distribution function is 0 for other x  for 0 x < a x − a c(x) = for a ≤ x ≤ b .  b − a 1 for x > b ◦ Using this description we can easily compute probabilities for uniformly-distributed random variables.

• Example: The high temperature in a certain city in May is uniformly distributed between 70◦F and 90◦F. Find the probabilities that (i) the temperature is between 82◦ and 85◦, (ii) the temperature is less than 75◦, and (iii) the temperature is greater than 82◦.

◦ From our description of uniformly-distributed random variables, we can see that the probability density ( 1/20 for 70 ≤ x ≤ 90 function for the temperature is p(x) = . 0 for other x 1 3 ◦ Then for (i), the probability is 85 p(x) dx = 85 dx = . 82 82 20 20 ´ ´ 1 1 ◦ For (ii), the probability is 75 p(x) dx = 70 0 dx + 75 dx = . −∞ −∞ 70 20 4 ´ ´ ´ 1 2 ◦ For (iii), the probability is ∞ p(x) dx = 90 dx + ∞ 0 dx = . 82 82 20 90 5 ´ ´ ´

4.4.2 Expected Value, Variance, Standard Deviation

• In much the same way as we dened expected value, variance, and standard deviation for discrete random variables, we can also dene these quantities for continuous random variables.

25 • Denition: If X is a continuous random variable with probability density function p(x), we dene the expected value as ∞ , presuming that the integral converges. E(X) = −∞ x p(x) dx ´ ◦ This formula is the continuous analogue of the expected value formula for a discrete random variable.

• Example: Find the expected values of the continuous random variables X, Y and Z with respective probability ( ( ( 6(x − x2) for 0 ≤ x ≤ 1 2x for 0 ≤ x ≤ 1 e−x for x ≥ 0 density functions p(x) = , q(x) = , and e(x) = . 0 for other x 0 for other x 0 for x < 0

∞ 1 3 1 1 ◦ By denition, we have E(X) = x p(x) dx = x · 6(x − x2) dx = (2x3 − x4) = . −∞ 0 2 x=0 2 ´ ´

∞ 1 2 1 2 ◦ Likewise, E(Y ) = x q(x) dx = x · 2x dx = ( x3) = . −∞ 0 3 x=0 3 ´ ´ Also, ∞ ∞ −x −x −x ∞ . ◦ E(Z) = −∞ x e(x) dx = 0 x e dx = −xe − e |x=0 = 1 ´ ´ • Unlike with a discrete random variable, the expected value of a continuous random variable can be innite or even undened (the latter case is possible because the improper integral for the expected value need not be convergent):

• Example: Find the expected values of the continuous random variables X and Y with respective probability ( 1/x2 for x ≥ 1 1 density functions p(x) = and q(x) = . 0 for other x π(1 + x2)

1 ◦ By denition, we have E(X) = ∞ x p(x) dx = ∞ x · dx = ln(x)|∞ = ∞ . −∞ 1 x2 x=1 ´ ´ 1 ◦ For Y , we would have E(Y ) = ∞ x · dx. By using a substitution, we can see that an −∞ π(1 + x2) x 1 ´ antiderivative of is ln(1 + x2). π(1 + x2) 2π 1 ◦ To evaluate this improper integral, we can split the range of integration at 0 to obtain ∞ x· dx = 0 π(1 + x2) ´ 1 ∞ 0 1 1 0 ln(1 + x2) = ∞, while x · dx = ln(1 + x2) = −∞. 2π x=0 −∞ π(1 + x2) 2π x=−∞ ´ 1 ◦ But this tells us that the original integral E(Y ) = ∞ x · dx is ∞ − ∞, which is not dened. −∞ π(1 + x2) ´ ◦ Remark: Observe that the probability density function for Y is symmetric about x = 0, which would suggest that the expected value is 0. However, this is not the case! This particular probability density function is called the Cauchy distribution, and yields counterexamples to many statements that might seem to be intuitively obvious.

• We would like to dene the variance and standard deviation in the same way as for discrete random variables; namely, as var(X) = E[(X − µ)2] where µ = E(X) is the expected value of X, or equivalently as var(X) = E(X2) − [E(X)]2.

◦ However, we need to know how to compute the expected value of a function of X. ◦ To illustrate the diculty, suppose X is uniformly distributed on [0, 2]. Then we can easily compute E(X), but it is not so obvious how to nd E(X2). ◦ Since X2 is a random variable, it has some probability density function, which we can try to calculate its probability density function by using the cumulative density function. √ Explicitly, since 2 is equivalent to (at least for nonnegative), this means that ◦ √ X ≤ a X ≤ a X cX2 (a) = cX ( a) for 0 ≤ a ≤ 4. √ √ √ a a a 1 a ◦ In terms of the probability density functions, this says p 2 (x) dx = p (x) dx = dx = . 0 X 0 X 0 2 2 ´ ´ ´

26 √  d a 1 −1/2 ◦ Then dierentiating both sides yields p 2 (a) = = a for 0 ≤ a ≤ 4. X da 2 4 ( x−1/2/4 for 0 ≤ x ≤ 4 ◦ Thus, we deduce that p 2 (x) = . X 0 for other x 4 4 1 1 4 ◦ We can then evaluate E(X2) = x · x−1/2 dx = x3/2 = . 0 4 6 3 ´ x=0 • This procedure we used to compute E(X2) in the example above is quite complicated and dicult. Fortunately, there is a simpler way to compute the expected value of a function of a continuous random variable:

• Proposition (Expected Value of Functions of X): If X is a continuous random variable with probability density function p(x) and g(x) is any piecewise-continuous function, then the expected value of g(X) is ∞ . E(g(X)) = −∞ g(x) p(x) dx ´ ◦ Proof (special case): Suppose g is increasing and has an inverse function g−1. Then g(x) ≤ a is equivalent −1 −1 to x ≤ g (a), so by the argument given above, we see that cg(X)(a) = cX (g (a)). 1 ◦ Dierentiating both sides yields p (a) = p(g−1(a))· , and then E(g(X)) = ∞ x·(g−1(x))· g(X) g0(g−1(a)) −∞ 1 ´ dx. g0(g−1(x)) ◦ Making the substitution u = g−1(x), so that x = g(u) and dx = g0(u)du, in the integral and simplifying yields ∞ , as claimed. E(g(X)) = −∞ g(u) · p(u) du ◦ The argument in the´ general case is similar.

• Corollary (Properties of Expected Value): If X and Y are continuous random variables whose expected values are dened, and a and b are any real numbers, then E(aX + b) = a · E(X) + b and E(X + Y ) = E(X) + E(Y ). Proof (rst statement): If has probability density function , then ∞ ◦ X p(x) E(aX + b) = −∞(ax + b) · ∞ ∞ . ´ p(x) dx = a −∞ x · p(x) + b −∞ p(x) dx = a · E(X) + b ◦ The second´ statement can be´ deduced by rst nding the probability density function of X +Y and then using an argument similar to that of the proposition above.

• If the expected value of a continuous random variable is dened and nite, we can dene the variance and standard deviation:

• Denition: If X is a continuous random variable with nite expected value µ, the variance var(X) is dened as var(X) = E[(X − µ)2] = E(X2) − E(X)2, and the standard deviation is σ(X) = pvar(X).

◦ The equality E[(X−µ)2] = E(X2)−E(X)2 follows for continuous random variables by the same argument used for discrete random variables.

• Example: Find the variance and standard deviation for the continuous random variables X, Y , and Z with ( ( ( 6(x − x2) for 0 ≤ x ≤ 1 2x for 0 ≤ x ≤ 1 e−x for x ≥ 0 density functions p(x) = , q(x) = , e(x) = . 0 for other x 0 for other x 0 for x < 0

1 3 1 1 1 ◦ For X, we have E(X) = x · 6(x − x2) dx = (2x3 − x4) = and E(X2) = x2 · 6(x − x2) dx = 0 2 x=0 2 0 ´ ´ 3 6 1 3 ( x4 − x5) = . 2 5 x=0 10 3 1 1 r 1 ◦ Thus var(X) = E(X2) − [E(X)]2 = − = and σ(X) = . 10 4 20 20

1 2 1 2 1 1 1 1 ◦ For Y , we have E(Y ) = x · 2x dx = ( x3) = and E(Y 2) = x2 · 2x dx = ( x4) = . 0 3 x=0 3 0 2 x=0 2 ´ ´ 2 1 5 r 5 ◦ Thus var(Y ) = E(Y 2) − [E(Y )]2 = − = and σ(X) = . 3 4 12 12

27 For , we have ∞ −x and 2 ∞ 2 −x 2 −x ∞ . ◦ Z E(Z) = 0 x e dx = 1 E(Z ) = 0 x e dx = −(x + 2x + 2)e x=0 = 2 ´ ´ ◦ Thus var(Z) = E(Z2) − [E(Z)]2 = 2 − 1 = 1 and σ(X) = 1 .

• Proposition (Properties of Variance): If X is a continuous random variable and a and b are any real numbers, then var(aX + b) = a2 · var(X) and σ(aX + b) = |a| σ(X). Furthermore, if X and Y are independent random variables, then var(X + Y ) = var(X) + var(Y ).

◦ Proof: The proof follows in the same way as for discrete random variables.

4.4.3 The Normal Distribution and Central Limit Theorem

• One of the most important probability distributions is the normal distribution, often called the bell curve due to its shape.

• Denition: A random variable Nµ,σ is normally distributed with mean µ and standard deviation σ if its 1 −(x−µ)2/(2σ2) probability density function is pµ,σ(x) = √ e . The standard normal distribution is N0,1 has σ 2π mean 0 and standard deviation 1.

◦ Here is a graph of the standard normal distribution showing its bell curve shape:

◦ It is not trivial to verify that pµ,σ(x) actually yields a probability density function: one must verify that 1 2 2 ∞ −(x−µ) /2σ , although this can be done4. −∞ √ e dx = 1 ´ σ 2π By manipulating the integrals accordingly using a series of substitutions and the evaluation of ∞ ◦ −∞ pµ,σ(x) dx = 2 1, we may eventually compute that E(Nµ,σ) = µ and var(Nµ,σ) = σ so that σ(Nµ,σ) = σ. ´

• Example: Suppose N11,2 is a normally-distributed random variable with expected value 11 and standard deviation 2. Find (i) P (N ≤ 11), (ii) P (7 ≤ N ≤ 9), and (iii) P (N ≥ 13).

◦ One approach is simply to write down the probability density function and set up the integrals.

1 2 1 2 This yields 11 −(x−11) /8 , with 9 −(x−11) /8 , ◦ P (N11,2 ≤ 11) = −∞ √ e dx P (7 ≤ N11,2 ≤ 9) = 7 √ e dx ´ 2 2π ´ 2 2π 1 2 and ∞ −(x−11) /8 . P (N11,2 ≥ 13) = 13 √ e dx ´ 2 2π ◦ Unfortunately, except in certain rare cases, integrals involving the normal distribution are very di- 2 2 cult to evaluate exactly, owing to the fact that the function e−(x−µ) /(2σ ) does not have an elementary antiderivative: this means we cannot write down an exact formula for the indenite integral (i.e., the cu- mulative density function) in terms of polynomials, exponentials, logarithms, or trigonometric functions. √ 4One approach is to substitute u = (x − µ)/(σ 2), which converts the problem into showing that the value of the integral J = 2 √ 2 2 ∞ −u is . This can be done in several ways: the most standard approach is to write 2 ∞ ∞ −(x +y ) and −∞ e du π J = −∞ −∞ e dy dx ´then convert to polar coordinates to deduce J2 = π. There are many other approaches, including dierentiation´ ´ under the integral, interchanging the order of a double integral, integration in the complex plane, and asymptotic analysis.

28 ◦ Of course, we can use numerical integration procedures (implemented by a calculator's or computer's

software) to approximate these integrals to any desired accuracy: this yields P (N11,2 ≤ 11) = 0.5 ,

P (7 ≤ N11,2 ≤ 9) ≈ 0.1359 , and P (N11,2 ≥ 13) ≈ 0.1587 . ◦ Another approach is to use a table of computed values for the cumulative distribution function of the standard normal distribution N0,1, along with a substitution: it is a straightforward calculation to verify a − µ b − µ that if we dene z = and z = , then P (a ≤ N ≤ b) = P (z ≤ N ≤ z ). a σ b σ µ,σ a 0,1 b ◦ Thus, if we compute these (so-called) z-scores, we can use a table of computed values for the standard normal distribution to nd the desired probabilities: z −3 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 3

P (N0,1 ≤ z) 0.0014 0.0228 0.0668 0.1587 0.3085 0.5 0.6915 0.8413 0.9332 0.9772 0.9987

◦ In this case, we have µ = 11 and σ = 2, so we have P (N11,2 ≤ 11) = P (N0,1 ≤ 0) = 0.5 .

◦ Likewise, P (7 ≤ N11,2 ≤ 9) = P (−2 ≤ N0,1 ≤ −1) = P (N0,1 ≤ −1) − P (N0,1 ≤ −2) = 0.1587 − 0.0228 ≈ 0.1359 .

◦ Finally, P (N11,2 ≥ 13) = P (N0,1 ≥ 1) = 1 − P (N0,1 ≤ 1) = 1 − 0.8413 = 0.1587 .

• The normal distribution is seem very commonly in physical applications: some examples of quantities whose distribution is (very close to) normal are human heights, sizes of parts made by automated processes, blood pressures, measurement errors, and scores on standardized examinations.

• The reason for the common appearance of the normal distribution is the following fundamental result:

• Theorem (Central Limit Theorem): Let X1, X2, ... , Xn be a sequence of independent, identically-distributed random discrete or continuous random variables each with expected value µ and standard deviation σ > 0. X + X + ··· + X − nµ Then the distribution of the random variable Y = 1 2 √ n tends to the standard normal n σ n distribution (of mean 0 and standard deviation 1) as n tends to ∞: explicitly, we have P (a ≤ Yn ≤ b) → 1 2 b −x /2 for any real numbers . a √ e dx a ≤ b ´ 2π

◦ If X1,...,Xn are independent and identically-distributed, we may think of the random variable X1 + ··· + Xn as being the sum of the results of independently sampling a random variable X a total of n times. (One example of this would be ipping a coin n times and summing the total number of heads; another would be rolling a die n times and summing the outcomes.) From our properties of expected value and variance, it is not hard to see that ◦ √ E(X1 + ··· + Xn) = nµ and σ(X1 + ··· + Xn) = σ n.

◦ Thus, the result says that if we normalize the summed distribution X1 + ··· + Xn by translating and rescaling it so that its expected value is 0 and its standard deviation is 1, then the resulting normalized average approaches the standard normal distribution as we average over more and more samples. ◦ This result is quite powerful, since it applies to any discrete or continuous random variable whose expected value is dened.

• In particular, since the binomial distribution is obtained by summing n independent Bernoulli random vari- ables, the central limit theorem tells us that if n is suciently large, then the distribution will be well approximated by a normal distribution with the same expected value and standard deviation.

◦ As a practical matter, the approximation tends to be very good when np and n(1 − p) are both at least 5, and increases in accuracy when np and n(1 − p) are larger. ◦ Since the binomial distribution is discrete while the normal distribution is continuous, we typically make an adjustment when we approximate the binomial distribution by the normal distribution, namely, we approximate the probability P (B = k) that the binomial random variable equals k with the probability 1 1 that the normal random variable lands in the interval 1 1 . P (k − 2 ≤ N ≤ k + 2 ) [k − 2 , k + 2 ] ◦ We make a similar continuity adjustment whenever we approximate a discrete distribution by a con- tinuous one.

29 • Here are some comparisons of binomial distributions (with various parameters n and p) with their correspond- ing normal approximations:

• Example: Compute the exact probability that a fair coin ipped 400 times will land heads (i) exactly 200 times, and (ii) between 203 and 208 times inclusive, and compare them to the results of the normal approximation.

◦ The total number of heads is binomially distributed with n = 400 and p = 1/2, so for (i) the probability 1 400 1 400 400 400 is ≈ 3.99%, while for (ii) it is + + ··· + ≈ 20.36%. 2400 200 2400 203 204 208 ◦ The normal approximation to this binomial distribution has expected value µ = np = 200 and standard deviation pnp(1 − p) = 10.

◦ Therefore, for (i) we wish to compute P (199.5 ≤ N200,10 ≤ 200.5), which from our discussion of z-scores is also equal to P (−0.05 ≤ N0,1 ≤ 0.05). Using a table of values, we can nd P (N0,1 ≤ −0.1) = 0.48006 and P (N0,1 ≤ 0.1) = 0.51994, so the desired probability is 0.51994 − 0.48006 = 0.03988 ≈ 3.99%.

◦ For (ii) we wish to compute P (202.5 ≤ N200,10 ≤ 208.5), which from our discussion of z-scores is also equal to P (0.25 ≤ N0,1 ≤ 0.85). Using a table of values, we can nd P (N0,1 ≤ 0.25) = 0.59871 and P (N0,1 ≤ 0.85) = 0.80234, so the desired probability is 0.80234 − 0.59871 = 0.20363 ≈ 20.36%. ◦ As we can see from our calculations, the normal approximation is very good! (In fact, the use of the normal distribution to approximate the binomial distribution was, historically speaking, one of the very rst applications of the normal distribution.)

• Example: Use the normal distribution to estimate the probability that if 420 fair dice are rolled, the total of all the dice rolls will be between 1460 and 1501 inclusive. 7 r35 ◦ We saw earlier that µ = and σ = for the random variable X giving the outcome of one roll. X 2 X 12 ◦ Since we are summing the results of 420 independent samplings of X, the central limit theorem tells us that the overall distribution of the sum will be closely approximated by a normal distribution with mean √ 420µX = 1470 and standard deviation 420µX = 35.

◦ The desired probability is then P (1459.5 ≤ N1420,35 ≤ 1501.5) = P (−0.3 ≤ N0,1 ≤ 0.9) ≈ 43.39% .

Well, you're at the end of my handout. Hope it was helpful. Copyright notice: This material is copyright Evan Dummit, 2018-2019. You may not reproduce or distribute this material without my express permission.

30