Chapter 1

Probability Theory

1.1 Set Theory

One of the main objectives of a statistician is to draw conclusions about a population of objects by conducting an experiment. The first step in this endeavor is to identify the possible outcomes or, in statistical terminology, the sample space.

Definition 1.1.1 The set, S, of all possible outcomes of a par- ticular experiment is called the sample space for the experiment.

If the experiment consists of tossing a coin, the sample space con- tains two outcomes, heads and tails; thus, S = {H,T }. Consider an experiment where the observation is reaction time to a certain stimulus. Here, the sample space would consist of all positive numbers, that is, S = (0, ∞).

1 2 CHAPTER 1. THEORY The sample space can be classified into two type: countable and uncountable. If the elements of a sample space can be put into 1–1 cor- respondence with a subset of integers, the sample space is countable. Otherwise, it is uncountable.

Definition 1.1.2 An event is any collection of possible outcomes of an experiment, that is, any subset of S (including S itself).

Let A be an event, a subset of S. We say the event A occurs if the outcome of the experiment is in the set A. 1.1. SET THEORY 3 We first define two relationships of sets, which allow us to order and equate sets:

A ⊂ B ⇔ x ∈ A ⇒ x ∈ B (containment)

A = B ⇔ A ⊂ B and B ⊂ A. (equality)

Given any two events (or sets) A and B, we have the following elementary set operations: Union: The union of A and B, written A∪B, is the set of elements that belong to either A or B or both:

A ∪ B = {x : x ∈ A or x ∈ B}.

Intersection: The intersection of A and B, written A ∩ B, is the set of elements that belong to both A and B:

A ∩ B = {x : x ∈ A and x ∈ B}.

Complementation: The complement of A, written Ac, is the set of all elements that are not in A:

Ac = {x : x∈ / A}. 4 CHAPTER 1. Example 1.1.1 (Event operations) Consider the experiment of selecting a card at random from a standard deck and noting its suit: clubs (C), diamond (D), hearts (H), or spades (S). The sample space is S = {C,D,H,S}, and some possible events are

A = {C,D}, and B = {D,H,S}.

From these events we can form

A ∪ B = {C,D,H,S},A ∩ B = {D}, and Ac = {H,S}.

Furthermore, notice that A ∪ B = S and (A ∪ B)c = ∅, where ∅ denotes the empty set (the set consisting of no elements). 1.1. SET THEORY 5 Theorem 1.1.1 For any three events, A, B, and C, defined on a sample space S,

a. Commutativity. A ∪ B = B ∪ A, A ∩ B = B ∩ A.

b. Associativity. A ∪ (B ∪ C) = (A ∪ B) ∪ C, A ∩ (B ∩ C) = (A ∩ B) ∩ C.

c. Distributive Laws. A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C), A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).

d. DeMorgan’s Laws. (A ∪ B)c = Ac ∩ Bc, (A ∩ B)c = Ac ∪ Bc.

The operations of union and interaction can be extended to infinite collections of sets as well. If A1,A2,A3,... is a collection of sets, all defined on a sample space S, then

∞ ∪i=1Ai = {x ∈ S : x ∈ Ai for some i}.

∞ ∩i=1Ai = {x ∈ S : x ∈ Ai for all i}.

For example, let S = (0, 1] and define Ai = [(1/i), 1]. Then

∞ ∞ ∪i=1Ai = ∪i=1[(1/i), 1] = (0, 1]

∞ ∞ ∩i=1Ai = ∩i=1[(1/i), 1] = {1}. 6 CHAPTER 1. PROBABILITY THEORY It is also possible to define unions and intersections over uncount- able collections of sets. If Γ is an index set (a set of elements to be used as indices), then

∪a∈ΓAa = {x ∈ S : x ∈ Aa for some a},

∩a∈ΓAa = {x ∈ S : x ∈ Aa for all a}. Definition 1.1.3 Two events A and B are disjoint (or mutually exclusive) if A ∩ B = ∅. The events A1,A2,... are pairwise dis- joint (or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j.

∞ Definition 1.1.4 If A1,A2,... are pairwise disjoint and ∪i=1Ai =

S, then the collection A1,A2,... forms a partition of S. 1.2. BASICS OF PROBABILITY THEORY 7

1.2 Basics of Probability Theory

When an experiment is performed, the realization of the experiment is an outcome in the sample space. If the experiment is performed a number of times, different outcomes may occur each time or some outcomes may repeat. This “frequency of occurrence” of an outcome can be thought of as a probability. More probable outcomes occur more frequently. If the outcomes of an experiment can be described probabilistically, we are on our way to analyzing the experiment sta- tistically.

1.2.1 Axiomatic Foundations

Definition 1.2.1 A collection of subsets of S is called a sigma algebra (or Borel field), denoted by B, if it satisfied the following three properties:

a. ∅ ∈ B (the empty set is an element of B).

b. If A ∈ B, then Ac ∈ B (B is closed under complementation).

∞ c. If A1,A2,... ∈ B, then ∪i=1Ai ∈ B (B is closed under count- able unions). 8 CHAPTER 1. PROBABILITY THEORY Example 1.2.1 (Sigma algebra-I) If S is finite or countable, then these technicalities really do not arise, we define for a given sam- ple space S,

B = {all subsets of S, including S itself}.

If S has n elements, there are 2n sets in B. For example, if S = {1, 2, 3}, then B is the following collection of 23 = 8 sets: {1}, {1, 2}, {1, 2, 3}, {2}, {1, 3}, ∅, {3}, {2, 3}.

Example 1.2.2 (Sigma algebra-II) Let S = (−∞, ∞), the real line. Then B is chosen to contain all sets of the form

[a, b], (a, b], (a, b), [a, b) for all real numbers a and b. Also, from the properties of B, it follows that B contains all sets that can be formed by taking (possibly countably infinite) unions and interactions of sets of the above varieties. 1.2. BASICS OF PROBABILITY THEORY 9 Definition 1.2.2 Given a sample space S and an associated sigma algebra B, a probability function is a function P with domain B that satisfies

1. P (A) ≥ 0 for all A ∈ B.

2. P (S) = 1.

∞ 3. If A1,A2,... ∈ B are pairwise disjoint, then P (∪i=1Ai) = P∞ i=1 P (Ai).

The three properties given in the above definition are usually re- ferred to as the Axioms of Probability or the Kolmogorov Axioms. Any function P that satisfies the Axioms of Probability is called a probability function. The following gives a common method of defining a legitimate prob- ability function.

Theorem 1.2.1 Let S = {s1, . . . , sn} be a finite set. Let B be any sigma algebra of subsets of S. Let p1, . . . , pn be nonnegative numbers that sum to 1. For any A ∈ B, define P (A) by X P (A) = pi.

{i:si∈A} 10 CHAPTER 1. PROBABILITY THEORY (The sum over an empty set is defined to be 0.) Then P is a probability function on B. This remains true if S = {s1, s2,...} is a countable set.

Proof: We will give the proof for finite S. For any A ∈ B, P (A) = P p ≥ 0, because every p ≥ 0. Thus, Axiom 1 is true. Now, i:si∈A i i X Xn P (S) = pi = pi = 1.

i:si∈S i=1

Thus, Axiom 2 is true. Let A1,...,Ak denote pairwise disjoint events. (B contains only a finite number of sets, so we need to consider only finite disjoint unions.) Then, X Xk X Xk k P (∪i=1Ai) = pj = pj = P (Ai). k i=1 {j:s ∈A } i=1 {j:sj∈∪i=1Ai} j i The first and third equalities are true by the definition of P (A). The disjointedness of the Ai’s ensures that the second equality is true, because the same pj’s appear exactly once on each side of the equality. Thus, Axiom 3 is true and Kolmogorov’s Axioms are satisfied. ¤ 1.2. BASICS OF PROBABILITY THEORY 11 Example 1.2.3 (Defining -II) The game of darts is played by throwing a dart at a board and receiving a score corre- sponding to the number assigned to the region in which the dart lands. For a novice player, it seems reasonable to assume that the probability of the dart hitting a particular region is proportional to the area of the region. Thus, a bigger region has a higher prob- ability of being hit. The dart board has radius r and the distance between rings is r/5. If we make the assumption that the board is always hit, then we have Area of region i P (scoring i points) = . Area of dart board For example, πr2 − π(4r/5)2 4 P (scoring1point) = = 1 − ( )2. πr2 5 It is easy to derive the general formula, and we find that (6 − i)2 − (5 − i)2 P (scoring i points) = , i = 1,..., 5, 52 independent of π and r. The sum of the areas of the disjoint regions equals the area of the dart board. Thus, the probabilities that have been assigned to the five outcomes sum to 1, and, by Theorem 1.2.6, this is a probability function. 12 CHAPTER 1. PROBABILITY THEORY

1.2.2 The calculus of Probabilities

Theorem 1.2.2 If P is a probability function and A is any set in B, then

a. P (∅) = 0, where ∅ is the empty set.

b. P (A) ≤ 1.

c. P (Ac) = 1 − P (A).

Theorem 1.2.3 If P is a probability function and A and B are any sets in B, then

a. P (B ∩ Ac) = P (B) − P (A ∩ B).

b. P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

c. If A ⊂ B, then P (A) ≤ P (B).

Formula (b) of Theorem 1.2.3 gives a useful inequality for the prob- ability of an intersection. Since P (A ∪ B) ≤ 1, we have

P (A ∩ B) = P (A) + P (B) − 1.

This inequality is a special case of what is known as Bonferroni’s inequality. 1.2. BASICS OF PROBABILITY THEORY 13 Theorem 1.2.4 If P is a probability function, then

P∞ a. P (A) = i=1 P (A ∩ Ci) for any partition C1,C2,...;

∞ P∞ b. P (∪i=1Ai) ≤ i=1 P (Ai) for any sets A1,A2,.... (Boole’s Inequality)

Proof: Since C1, C2,... form a partition, we have that Ci ∩ Cj = ∅

∞ for all i 6= j, and S = ∪i=1Ci. Hence,

∞ ∞ A = A ∩ S = A ∩ (∪i=1Ci) = ∪i=1(A ∩ Ci), where the last equality follows from the Distributive Law. We there- fore have

∞ P (A) = P (∪i=1(A ∩ Ci)).

Now, since the Ci are disjoint, the sets A ∩ Ci are also disjoint, and from the properties of a probability function we have X∞ X∞ P ( (A ∩ Ci)) = P (A ∩ Ci). i=1 i=1 ∗ ∗ To establish (b) we first construct a disjoint collection A1,A2,..., ∞ ∗ ∞ ∗ with the property that ∪i=1Ai = ∪i=1Ai. We define Ai by Xi−1 ∗ ∗ A1 = A1,Ai = Ai \ ( Aj), i = 2, 3,..., j=1 14 CHAPTER 1. PROBABILITY THEORY where the notation A\B denotes the part of A that does not intersect with B. In more familiar symbols, A \ B = A ∩ Bc. It should be easy

∞ ∗ ∞ to see that ∪i=1Ai = ∪i=1Ai, and we therefore have X∞ ∞ ∞ ∗ ∗ P (∪i=1Ai) = P (∪i=1Ai ) = P (Ai ), i=1 ∗ where the equality follows since the Ai ’s are disjoint. To see this, we write © ª © ª ∗ ∗ i−1 k−1 ∗ Ai ∩ Ak = Ai \ (∪j=1Aj) ∩ Ak \ (∪j=1 Aj) (definition of Ai )

© i−1 cª © k−1 cª = Ai ∩ (∪j=1Aj) ∩ Ak ∩ (∪j=1 Aj) (definition of \) i\−1 k\−1 © cª © cª 0 = Ai ∩ Aj ∩ Ak ∩ Aj (DeMorgan sLaws) j=1 j=1 Now if i > k, the first intersection above will be contained in the set

c Ak, which will have an empty intersection with Ak. If k > i, the ∗ ∗ argument is similar. Further, by construction Ai ⊂ Ai, so P (Ai ) ≤

P (Ai) and we have X∞ X∞ ∗ P (Ai ) ≤ P (Ai), i=1 i=1 establishing (b). ¤ 1.2. BASICS OF PROBABILITY THEORY 15 There is a similarity between Boole’s Inequality and Bonferroni’s Inequality. If we apply Boole’s Inequality to Ac, we have Xn n c c P (∪i=1Ai) ≤ P (Ai), i=1 c c c and using the facts that ∪Ai = (∩Ai) and P (Ai) = 1 − P (Ai), we obtain Xn n 1 − P (∩i=1Ai) ≤ n − P (Ai). i=1 This becomes, on rearranging terms, Xn n P (∩i=1Ai) ≥ P (Ai) − (n − 1), i=1 which is a more general version of the Bonferroni Inequality. 16 CHAPTER 1. PROBABILITY THEORY

1.2.3 Counting

Methods of counting are often used in order to construct probability assignments on finite sample spaces, although they can be used to an- swer other questions also. The following theorem is sometimes known as the Fundamental Theorem of Counting.

Theorem 1.2.5 If a job consists of k separate tasks, the ith of which can be done in ni ways, i = 1, . . . , k, then the entire job can be done in n1 × n2 × · · · × nk ways.

Example 1.2.4 For a number of years the New York state lottery operated according to the following scheme. From the numbers 1,2, ..., 44, a person may pick any six for her ticket. The winning number is then decided by randomly selecting six numbers from the forty-four. So the first number can be chosen in 44 ways, and the second number in 43 ways, making a total of 44 × 43 = 1892 ways of choosing the first two numbers. However, if a person is allowed to choose the same number twice, then the first two numbers can be chosen in 44 × 44 = 1936 ways.

The above example makes a distinction between counting with re- placement and counting without replacement. The second crucial 1.2. BASICS OF PROBABILITY THEORY 17 element in counting is whether or not the ordering of the tasks is important. Taking all of these considerations into account, we can construct a 2 × 2 table of possibilities.

Number of possible arrangements of size r from n objects Without replacement With replacement

n! r Ordered (n−r)! n ¡n¢ ¡n+r−1¢ Unordered r r Let us consider counting all of the possible lottery tickets under each of these four cases.

Ordered, without replacement From the Fundamental Theorem of Count- ing, there are 44! 44 × 43 × 42 × 41 × 40 × 39 = = 5, 082, 517, 440 38! possible tickets.

Ordered, with replacement Since each number can now be selected in 44 ways, there are

44 × 44 × 44 × 44 × 44 × 44 = 446 = 7, 256, 313, 856 possible tickets. 18 CHAPTER 1. PROBABILITY THEORY

Unordered, without replacement From the Fundamental Theorem, six numbers can be arranged in 6! ways, so the total number of unordered tickets is 44 × 43 × 42 × 41 × 40 × 39 44! = = 7, 059, 052. 6 × 5 × 4 × 3 × 2 × 1 6!38!

Unordered, with replacement In this case, the total number of un- ordered tickets is 44 × 45 × 46 × 47 × 48 × 49 49! = = 13, 983, 816. 6 × 5 × 4 × 3 × 2 × 1 6!43! 1.2. BASICS OF PROBABILITY THEORY 19

1.2.4 Enumerating outcomes

The counting techniques of the previous section are useful when the sample space S is a finite set and all the outcomes in S are equally likely. Then probabilities of events can be calculated by simply count- ing the number of outcomes in the event. Suppose that S = {s1, . . . , sN } is a finite sample space. Saying that all the outcomes are equally likely means that P ({si}) = 1?N for every outcome si. Then, we have, for any event A, X X 1 # of elements in A P (A) = P ({s }) = = . i N # of elements in S si∈A si∈A Example 1.2.5 Consider choosing a five-card poker hand from a standard deck of 52 playing cards. What is the probability of having four aces? If we specify that four of the cards are aces, then there are 48 different ways of specifying the fifth card. Thus, 48 48 P (four aces) = ¡ ¢ = . 52 2, 598, 960 5 The probability of having four of a kind is 13 × 48 624 P (four of a kind) = ¡ ¢ = . 52 2, 598, 960 5 The probability of having exactly one pair is ¡ ¢¡ ¢ 13 4 12 43 1, 098, 240 P (exactly one pair) = 2¡ ¢3 = . 52 2, 598, 960 5 20 CHAPTER 1. PROBABILITY THEORY

1.3 and Independence

All of the probabilities that we have dealt with thus far have been unconditional probabilities. A sample space was defined and all prob- abilities were calculated with respect to that sample space. In many instances, however, we are in a position to update the sample space based on new information. In such cases, we want to be able to update probability calculations or to calculate conditional probabilities.

Example 1.3.1 (Four aces) Four cards are dealt from the top of a well-shuffled deck. What is the probability that they are four aces? ¡ ¢ 52 1 The probability is 1/ 4 = 270,725. We can also calculate this probability by an “updating” argu- ment, as follows. The probability that the first card is an ace is 4/52. Given that the first card is an ace, the probability that the second card is an ace is 3/51. Continuing this argument, we get the desired probability as

4 3 2 1 1 = . 52 51 50 49 270, 725 1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE 21 Definition 1.3.1 If A and B are events in S, and P (B) > 0, then the conditional probability of A given B, written p(A|B), is P (A ∩ B) P (A|B) = . (1.1) P (B) Note that what happens in the conditional probability calculation is that B becomes the sample space: P (B|B) = 1. The intuition is that our original sample space, S, has been updated to B. All further occurrence are then calibrated with respect to their relation to B.

Example 1.3.2 (Continuation of Example 1.3.1) Calculate the conditional probabilities given that some aces have already been drawn.

P (4 aces in 4 cards|i aces in i cards) P ({4 aces in 4 cards} ∩ {i aces in i cards}) = P (i aces in i cards) P (4 aces in 4 cards) = P (i aces in i cards) ¡ ¢ 52 (4 − i)!48! 1 = ¡ ¢¡i ¢ = = ¡ ¢ 52 4 (52 − i)! 52−i 4 i 4−i For i = 1, 2 and 3, the conditional probabilities are .00005, .00082, and ,02041, respectively. 22 CHAPTER 1. PROBABILITY THEORY Example 1.3.3 (Three prisoners) Three prisoners, A, B, and C, are on death row. The governor decides to pardon one of the three and chooses at random the prisoner to pardon. He informs the warden of his choice but requests that the name be kept secret for a few days.

The next day, A tries to get the warden to tell him who has been pardoned. The warden refuses. A then asks which of B or C will be executed. The warden thinks for a while, then tells A that B is to be executed.

1 Warden’s reasoning: Each prisoner has a 3 chance of being par- doned. Clearly, either B or C must be executed, so I have given A no information about whether A will be pardoned.

A’s reasoning: Given that B will be executed, then either A or C

1 will be pardoned. My chance of being pardoned has risen to 2. Let A, B, and C denote the events that A, B, or C is pardoned, respectively. We know that P (A) = P (B) = P (C) = 1/3. Let W denote the event that the warden says B will die. Using (1.1), A 1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE 23 can update his probability of being pardoned to P (A ∩ W ) P (A|W ) = . P (W ) What is happening can be summarized in this table:

Prisoner Pardoned Warden tells A Probability

1 A B dies 6 1 A C dies 6 1 B C dies 3 1 C B dies 3 Using this table, we can calculate

P (W ) = P (warden says B dies)

= P (warden says B dies and A pardoned)

+ P (warden says B dies and C pardoned)

+ P (warden says B dies and B pardoned) 1 1 1 = + + 0 = . 6 3 2 Thus, using the warden’s reasoning, we have P (A ∩ W ) P (A|W ) = P (W ) P (warden says B dies and A pardoned) 1/6 1 = = = . P (warden says B dies) 1/2 3 24 CHAPTER 1. PROBABILITY THEORY However, A falsely interprets the event W as equal to the event Bc and calculates P (A ∩ Bc) 1/3 1 P (A|Bc) = = = . P (Bc) 2/3 2 We see that conditional probabilities can be quite slippery and require careful interpretation.

The following are several variations of (1.1):

P (A ∩ B) = P (A|B)P (B)

P (A ∩ B) = P (B|A)P (A) P (A) P (A|B) = P (B|A) P (B)

Theorem 1.3.1 (Bayes’ Rule) Let A1, A2, ... be a partition of the sample space, and let B be any set. Then, for each i = 1, 2,...,

P (B|Ai)P (Ai) P (Ai|B) = P∞ . j=1 P (B|Aj)P (Aj) 1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE 25 Example 1.3.4 (Codings) When coded messages are sent, there are sometimes errors in transmission. In particular, Morse code uses “dots” and “dashes”, which are known to occur in the pro- portion of 3:4. This means that for any given symbol, 3 4 P (dot sent) = and P (dash sent) = . 7 7 Suppose there is interference on the transmission line, and with

1 probability 8 a dot is mistakenly received as a dash, and vice versa. If we receive a dot, can we be sure that a dot was sent? Using Bayesian Rule, we can write P (dot sent) P (dot sent|dot received) = P (dot received|dot sent) . P (dot received)

P (dot received) = P (dot received ∩ dot sent)

+ P (dot received ∩ dash sent)

= P (dot received|dot sent)P (dot sent)

+ P (dash received|dash sent)P (dash sent) 7 3 1 4 25 = × + × = . 8 7 8 7 56 So we have (7/8) × (3/7) 21 P (dot sent|dot received) = = . 25/56 25 26 CHAPTER 1. PROBABILITY THEORY In some cases it may happen that the occurrence of a particular event, B, has no effect on the probability of another event, A. For these cases, we have the following definition.

Definition 1.3.2 Tow events, A and B, are statistically indepen- dent if P (A ∩ B) = P (A)P (B).

Example 1.3.5 The gambler introduced at the start of the chap- ter, Chevalier de Mere, was particularly interested in the event that he could throw at least 1 six in 4 rolls of a die. We have

P (at least 1 six in 4 rolls) = 1 − P (no six in 4 rolls) Y4 = 1 − P (no six on roll i) i=1 5 = 1 − ( )4 = 0.518. 6 The equality of the second line follows by independence of the rolls. 1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE 27 Theorem 1.3.2 If A and B are independent events, then the following pairs are also independent:

a. A and Bc.

b. Ac and B.

c. Ac and Bc.

Proof: (a) can be proved as follows.

P (A ∩ Bc) = P (A) − P (A ∩ B) = P (A) − P (A)P (B)(A and B are independent)

= P (A)(1 − P (B)) = P (A)P (Bc)

¤

Example 1.3.6 (Tossing two dice) Let an experiment consists of tossing two dice. For this experiment the sample space is

S = {(1, 1), (1, 2),..., (1, 6), (2, 1),..., (2, 6),..., (6, 1),..., (6, 6)}.

Define the following events:

A = {double appear} = {(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)},

B = {the sum is between 7 and 10}

C = {the sum is 2, 7 or 8}. 28 CHAPTER 1. PROBABILITY THEORY Thus, we have P (A) = 1/6, P (B) = 1/2 and P (C) = 1/3. Fur- thermore,

P (A ∩ B ∩ C) = P (the sum is 8, composed of double 4s) 1 1 1 1 = = × × 36 6 2 3 = P (A)P (B)P (C) However, 11 P (B ∩ C) = P (sum equals 7 or 8) = 6= P (B)P (C). 36 Similarly, it can be shown that P (A∩B) 6= P (A)P (B). Therefore, the requirement P (A ∩ B ∩ C) = P (A)P (B)P (C) is not a strong enough condition to guarantee pairwise independence.

Definition 1.3.3 A collection of events A1,...,An are mutually independent if for any subcollection Ai1,...,Aik, we have Yk k P (∩j=1Aij) = P (Aij). j=1 1.3. CONDITIONAL PROBABILITY AND INDEPENDENCE 29 Example 1.3.7 (Three coin tosses) Consider the experiment of tossing a coin three times. The sample space is {HHH,HHT,HTH,

THH,TTH,THT,HTT,TTT }. Let Hi, i = 1, 2, 3, denote the

th event that the i toss is a head. For example, H1 = {HHH,HHT, HTH,HTT }. Assuming that the coin is fair and has an equal probability of landing heads or tails on each toss, the events H1,

H2 and H3 are mutually independent. To verify this we note that 1 P (H ∩ H ∩ H ) = P ({HHH}) = = P (H )P (H )P (H ). 1 2 3 8 1 2 3 To verify the condition in Definition 1.3.3, we must check each pair. For example, 2 P (H ∩ H ) = P ({HHH,HHT }) = = P (H )P (H ). 1 2 8 1 2

The equality is also true for the other two pairs. Thus, H1, H2 and H3 are mutually independent. 30 CHAPTER 1. PROBABILITY THEORY

1.4 Random variables

Example 1.4.1 (Motivation example) In an opinion poll, we might decide to ask 50 people whether they agree or disagree with a cer- tain issue. If we record a “1” for agree and “0” for disagree, the sample space for this experiment has 250 elements. If we define a variable X=number of 1s recorded out of 50, we have captured the essence of the problem. Note that the sample space of X is the set of integers {1, 2,..., 50} and is much easier to deal with than the original sample space.

In defining the quantity X, we have defined a mapping (a function) from the original sample space to a new sample space, usually a set of real numbers. In general, we have the following definition.

Definition 1.4.1 A is a function from a sample space S into the real numbers.

Example 1.4.2 (Random variables) In some experiments ran- dom variables are implicitly used; some examples are these. 1.4. RANDOM VARIABLES 31

Experiment Random variable Toss two dice X =sum of the numbers Toss a coin 25 times X =number of heads in 25 tosses Apply different amounts of fertilizer to corn plants X =yield/acre

Suppose we have a sample space

S = {s1, . . . , sn} with a probability function P and we define a random variable X with range X = {x1, . . . , xm}. We can define a probability function PX on X in the following way. We will observe X = xi if and only if the outcome of the random experiment is an sj ∈ S such that X(sj) = xi. Thus,

PX(X = xi) = P ({sj ∈ S : X(sj) = xi}). (1.2)

Note PX is an induced probability function on X , defined in terms of the original function P . Later, we will simply write PX(X = xi) =

P (X = xi). 32 CHAPTER 1. PROBABILITY THEORY Theorem 1.4.1 The induced probability function defined in (1.2) defines a legitimate probability function in that it satisfies the Kol- mogorov Axioms.

Proof: CX is finite. Therefore B is the set of all subsets of X . We must verify each of the three properties of the axioms.

(1) If A ∈ B then PX(A) = P (∪xi∈A{sj ∈ S : X(sj) = xi}) ≥ 0 since P is a probability function.

m (2) PX(X ) = P (∪i=1{sj ∈ S : X(sj) = xi}) = P (S) = 1.

(3) If A1,A2,... ∈ B and pairwise disjoint then

∞ ∞ PX(∪k=1Ak) = P (∪k=1{∪xi∈Ak {sj ∈ S : X(sj) = xi}}) X∞ X∞

= P (∪xi∈Ak {sj ∈ S : X(sj) = xi} = PX(Ak), k=1 k=1 where the second inequality follows from the fact P is a probability function. ¤

A note on notation: Random variables will always be denoted with uppercase letters and the realized values of the variable will be denoted by the corresponding lowercase letters. Thus, the random variable X can take the value x. 1.4. RANDOM VARIABLES 33 Example 1.4.3 (Three coin tosses-II) Consider again the exper- iment of tossing a fair coin three times independently. Define the random variable X to be the number of heads obtained in the three tosses. A complete enumeration of the value of X for each point in the sample space is

s HHH HHT HTH THH TTH THT HTT TTT X(s) 3 2 2 2 1 1 1 0

The range for the random variable X is X = {0, 1, 2, 3}. As-

1 suming that all eight points in S have probability 8, by simply counting in the above display we see that the induced probability function on X is given by

x 0 1 2 3

1 3 3 1 PX(X = x) 8 8 8 8

The previous illustrations had both a finite S and finite X , and the definition of PX was straightforward. Such is also the case if X is countable. If X is uncountable, we define the induced probability 34 CHAPTER 1. PROBABILITY THEORY function, PX, in a manner similar to (1.2). For any set A ⊂ X ,

PX(X ∈ A) = P ({s ∈ S : X(s) ∈ A}). (1.3)

This does define a legitimate probability function for which the Kol- mogorov Axioms can be verified. 1.5. DISTRIBUTION FUNCTIONS 35

1.5 Distribution Functions

Definition 1.5.1 The cumulative distribution function (cdf) of a random variable X, denoted by FX(x), is defined by

FX(x) = PX(X ≤ x), for all x.

Example 1.5.1 (Tossing three coins) Consider the experiment of tossing three fair coins, and let X =number of heads observed. The cdf of X is   0 if −∞ < x < 0   1 8 if 0 ≤ x < 1  FX(x) = 1 if 1 ≤ x < 2 2   7 if 2 ≤ x < 3 8   1 if 3 ≤ x < ∞.

Remark:

1. FX is defined for all values of x, not just those in X = {0, 1, 2, 3}. Thus, for example, 7 F (2.5) = P (X ≤ 2.5) = P (X = 0, 1, 2) = . X 8 36 CHAPTER 1. PROBABILITY THEORY

2. FX has jumps at the values of xi ∈ X and the size of the jump

at xi is equal to P (X = xi).

3. FX = 0 for x < 0 since X cannot be negative, and FX(x) = 1 for x ≥ 3 since x is certain to be less than or equal to such a value.

FX is right-continuous, namely, the function is continuous when a point is approached from the right. The property of right-continuity is a consequence of the definition of the cdf. In contrast, if we had defined FX(x) = PX(X < x), FX would then be left-continuous.

Theorem 1.5.1 The function FX(x) is a cdf if and only of the following three conditions hold:

a. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.

b. F (x) is a nondecreasing function of x.

c. F (x) is right-continuous; that is, for every number x0, limx↓x0 F (x) =

F (x0).

Example 1.5.2 (Tossing for a head) Suppose we do an exper- iment that consists of tossing a coin until a head appears. Let p =probability of a head on any given toss, and define X =number 1.5. DISTRIBUTION FUNCTIONS 37 of tosses required to get a head. Then, for any x = 1, 2,...,

P (X = x) = (1 − p)x−1p.

The cdf is Xx Xx i−1 x FX(x) = P (X ≤ x) = P (X = i) = (1−p) p = 1−(1−p) . i=1 i=1 It is easy to show that if 0 < p < 1, then FX(x) satisfies the conditions of Theorem 1.5.1. First,

lim FX(x) = 0 x→−∞ since FX(x) = 0 for all x < 0, and

x lim FX(x) = lim (1 − (1 − p) ) = 1, x→∞ x→∞ where x goes through only integer values when this limit is taken. To verify property (b), we simply note that the sum contains more positive terms as x increases. Finally, to verify (c), note that, for any x, FX(x + ²) = FX(x) if ² > 0 is sufficiently small. Hence,

lim FX(x + ²) = FX(x), ²↓0 so FX(x) is right-continuous.

Example 1.5.3 (Continuous cdf) An example of a continuous cdf (logistic distribution) is the function 1 F (x) = . X 1 + e−x 38 CHAPTER 1. PROBABILITY THEORY It is easy to verify that

lim FX(x) = 0 and lim FX(x) = 1. x→−∞ x→∞

Differentiating FX(x) gives d e−x F (x) = > 0, dx X (1 + e−x)2 showing that FX(x) is increasing. FX is not only right-continuous, but also continuous.

Definition 1.5.2 A random variable X is continuous if FX(x) is a continuous function of x. A random variable X is discrete if

FX(x) is a step function of x.

We close this section with a theorem formally stating that FX com- pletely determines the probability distribution of a random variable X. This is true if P (X ∈ A) is defined only for events A in B1, the smallest sigma algebra containing all the intervals of real numbers of the form (a, b), [a, b), (a, b], and [a, b]. If probabilities are defined for a larger class of events, it is possible for two random variables to have the same distribution function but not the same probability for every event (see Chung 1974, page 27). 1.5. DISTRIBUTION FUNCTIONS 39 Definition 1.5.3 The random variables X and Y are identically distributed if, for every set A ∈ B1, P (X ∈ A) = P (Y ∈ A).

Note that two random variables that are identically distributed are not necessarily equal. That is, the above definition does not say that X = Y .

Example 1.5.4 (Identically distributed random variables) Con- sider the experiment of tossing a fair coin three times. Define the random variables X and Y by

X =number of heads observed and Y =number of tails observed.

For each k = 0, 1, 2, 3, we have P (X = k) = P (Y = k). So X and Y are identically distributed. However, for no sample point do we have X(s) = Y (s).

Theorem 1.5.2 The following two statements are equivalent:

a. The random variables X and Y are identically distributed.

b. FX(x) = FY (x) for every x.

Proof: To show equivalence we must show that each statement im- plies the other. We first show that (a) ⇒ (b). 40 CHAPTER 1. PROBABILITY THEORY

Because X and Y are identically distributed, for any set A ∈ B1, P (X ∈ A) = P (Y ∈ A). In particular, for every x, the set (−∞, x] is in B1, and

FX(x) = P (X ∈ (−∞, x]) = P (Y ∈ (−∞, x]) = FY (x).

The above argument showed that if the X and Y probabilities agreed in all sets, then agreed on intervals. To show (b) ⇒ (a), we must prove if the X and Y probabilities agree on all intervals, then they agree on all sets. For more details see Chung (1974, section

2.2). ¤ 1.6. DENSITY AND MASS FUNCTIONS 41

1.6 Density and Mass Functions

Definition 1.6.1 The probability mass function (pmf) of a dis- crete random variable X is given by

fX(x) = P (X = x) for all x.

Example 1.6.1 (Geometric probabilities) For the geometric dis- tribution of Example 1.5.2, we have the pmf   (1 − p)x−1p for x = 1, 2,... f (x) = P (X = x) = X  0 otherwise.

From this example, we see that a pmf gives us “point probability”. In the discrete case, we can sum over values of the pmf to get the cdf. The analogous procedure in the continuous case is to substitute integrals for sums, and we get

Z x P (X ≤ x) = FX(x) = fX(t)dt. −∞

Using the Fundamental Theorem of Calculus, if fX(x) is continuous, we have the further relationship d F (x) = f (x). dx X X 42 CHAPTER 1. PROBABILITY THEORY

Definition 1.6.2 The probability density function or pdf, fX(x), of a continuous random variable X is the function that satisfies

Z x FX(x) = fX(t)dt for all x. (1.4) −∞

A note on notation : The expression “X has a distribution given by FX(x)” is abbreviated symbolically by “X ∼ FX(x)”, where we read the symbol “∼” as “is distributed as”. We can similarly write

X ∼ fX(x) or, if X and Y have the same distribution, X ∼ Y . In the continuous case we can be somewhat cavalier about the spec- ification of interval probabilities. Since P (X = x) = 0 if X is a continuous random variable,

P (a < X < b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a ≤ X ≤ b).

Example 1.6.2 (Logistic probabilities) For the logistic distribu- tion, we have 1 F (x) = , X 1 + e−x hence, we have d e−x f (x) = F (x) = , X dx X (1 + e−x)2 1.6. DENSITY AND MASS FUNCTIONS 43 and

P (a < X < b) = FX(b) − FX(a) Z b Z a = fX(x)dx − fX(x)dx −∞ −∞ Z b = fX(x)dx. a

Theorem 1.6.1 A function FX(x) is a pdf (or pmf) of a random variable X if and only if

a. FX(x) ≥ 0 for all x.

P R ∞ b. x fX(x) = 1 (pmf) or −∞ fX(x)dx = 1 (pdf).

Proof: If fX(x) is a pdf (or pmf), then the two properties are im- mediate from the definitions. In particular, for a pdf, using (1.4) and Theorem 1.5.1, we have that

Z ∞ 1 = lim FX(x) = fX(t)dt. x→∞ −∞ The converse implication is equally easy to prove. Once we have fX(x), we can define FX(x) and appeal to Theorem 1.5.1. ¤ 44 CHAPTER 1. PROBABILITY THEORY