<<

MTH 453: Basic Random Processes

Lecture 1: Basic Review

References: Probability for Stat. and Machine Learning Fund. and Adv. Topics by Anirban DasGupta. Probability With Applications and R by Robert Dobrow. Introduction to Stochastic Processes with R by Robert Dobrow. Outline: 1) Experiments, Sample Spaces and Random Variables 2) A First Look into Simulation 3) Integer-Valued and Discrete Random Variables 4) Continuous Random Variables 5) Expectation, Moments and Simulation of Arbitrary Continuous RVs. 6) Joint Distributions and Independence of Random Variables 7) Properties of the most Common Distributions (will be taught/reviewed on the go) 8) Normal Approximations and Central Limit Theorem (will be taught/reviewed on the go)

Part 1. Experiments, Sample Spaces and Random Variables Treatment of starts with the consideration of a . The sample space is the of all possible outcomes in some experiment. For example, if a coin is tossed twice and after each toss the face that shows is recorded, then the possible outcomes of this particular coin-tossing experiment, say X, are {HH,HT,TH,TT }, with H denoting the occurrence of heads and T denoting the occurrence of tails. We call Ω = {HH,HT,TH,TT } the sample space of the experiment X.

In general, a sample space is a general set Ω, finite or infinite. An easy example where the sample space Ω is infinite is to toss a coin until the first time heads show up and record the number of the trial at which the first head appeared. In this case, the sample space Ω is the countably infinite set Ω = {1, 2, 3,...} Sample spaces can also be uncountably infinite; for example, consider the experiment measuring the time for a light bulb to get burned. The sample space of this experiment is Ω = R+. In this case, Ω is an uncountably infinite set. In all cases, individual elements of a sample space are denoted as ω. The first task is to define events and to explain the meaning of the probability of an event.

1 Definition 1.1. Let Ω be the sample space of an experiment X. Then any subset A of Ω, including the empty set ∅ and the entire sample space Ω is called an event. Events may contain even one single sample point ω, in which case the event is a singleton set {ω}. We want to assign to events. But we want to assign probabilities in a way that they are logically consistent. In fact, this cannot be done in general if we insist on assigning probabilities to arbitrary collections of sample points, that is, arbitrary subsets of the sample space . We can only define probabilities for such subsets of Ω that are tied together like a family, the exact concept being that of a σ-field. In most applications, including those cases where the sample space Ω is infinite, events that we would want to normally think about will be members of such an appropriate σ-field.

Here is a definition of what counts as a legitimate probability on events.

Definition 1.2. Given a sample space Ω, a probability or a probability on Ω is a P ON SUBSETS of Ω such that

(a) P[A] ≥ 0 for all A ⊆ Ω. )In particular A can be a singleton A = ω ∈ Ω. P (b) P[Ω] = P[ω] = 1 ω∈Ω

(c) Given disjoint subsets A1,A2,... of Ω,

" ∞ # ∞ [ X P Ai = P[Ai] i=1 i=1 In particular considering an event A and all the possible elements in A, X P[A] = P[ω] ω∈A

You may not be familiar with some of the notations in this definition. The symbol ∈ means “is an element of”. So ω ∈ Ω means ω is an element of Ω. We are also using a generalized Σ-notation. The P notation ω∈Ω means that the sum is over all ω that are elements of the sample space, that is, all outcomes in the sample space. In the case of a finite sample space Ω = {ω1, . . . , ωk}, the equation in (b) becomes

k X X P[Ω] = P[ω] = P[ωk] = P[ω1] + P[ω2] + ... + P[ωk] = 1 ω∈Ω n=1 Definition 1.3. Let Ω be a finite sample space consisting of N sample points. We say that the sample points are equally likely or that the probability distribution is uniform if P(ω) = 1/N for each sample point ω.

An immediate consequence, due to the addivity axiom, is the useful formula.

2 Proposition 1.4. Let Ω be a finite sample space consisting of N equally likely sample points. Let A be any event and suppose A contains n distinct sample points. Then Number of sample points favorable to A n (A) = = P Total number of sample points N Example 1.5. Roll a pair of dice. Find the sample space, identify the event that the sum of the two dice is equal to 7 and compute its probability.

Solution: The random experiment is rolling two dice. Keeping track of the roll of each die gives the sample space

Ω = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2),..., (6, 5), (6, 6)}.

The event is A = {Sum is 7} = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}, and its probability, using Propo- sition 1, is 6/36 = 1/6.

Homework Problem 1. Yolanda and Zach are running for president of the student association. Ten students will be voting. Identify

(i) the sample space and

(ii) the event that Yolanda beats Zach by at least 7 votes.

Homework Problem 2. Joe will continue to flip a coin until heads appears. Identify the sample space and the event that it will take Joe at least three coin flips to get a head.

Example 1.6. A college has six majors: biology, geology, physics, dance, art, and music. The proportion of students taking these majors are 45, 30, 15, 10, 10, and 35,respectively. Choose a random student. What is the probability they are a science major?

Solution: The random experiment is choosing a student. The sample space is

Ω = {Bio, Geo, P hy, Dan, Art, Mus}.

The probability function is given by the number of students in the major divided by the total number of students (45+30+15+10+10+35=145). That is,

P[Bio] = 45/145 ≈ 0.31, P[Geo] = 30/145 ≈ 0.21, P[Phy] = 15/145 ≈ 0.1,

P[Dan] = 10/145 ≈ 0.07 P[Art] = 10/145 ≈ 0.07 P[Mus] = 35/145 ≈ 0.24 The event in question is

A = {Science major} = {Bio, Geo, Phy} = {Bio} ∪ {Geo} ∪ {Phy} where the last equality holds because the events are disjoint (we are not considering “double majors”). Finally,

P[A] = P[{Bio, Geo, P hy}] = P[{Bio}∪{Geo}∪{Phy}] = P[Bio]+P[Geo]+P[P hy] ≈ 0.31+0.21+0.1 = 0.62. Homework Problem 3. In three coin tosses, what is the probability of getting at least two tails?

3 In order to compute some probabilities we need to know also how does the probability behaves or compares when two events have certain properties and the following proposition explains this more clearly: Proposition 1.7 (Properties of Probabilities). 1. If A implies B, that is, if A ⊆ B, then P[A] ≤ P[B]. Ex. A=Roll of dice being 2, B= Roll of dice being even.

2. P[A does not occur] = P[Ac] = 1 − P[A]. Ex: A=Roll of dice being 2 or 4, Ac = Roll of dice being 1,3,5 or 6. 3. For any events A and B, P[A or B] = P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Ex: P[Roll of dice being 2, 3 or Roll of dice being 2 or 6] = P[Roll of dice being 2 or 3] + P[Roll of dice being 2 or 3] − P[Roll of dice being 2] = 1/3 + 1/3 − 1/6 = 1/2 Homework Problem 4. In a city, 75% of the population have brown hair, 40% have brown eyes, and 25% have both brown hair and brown eyes. A person is chosen at random. What is the probability that they 1. have brown eyes or brown hair? 2. have neither brown eyes nor brown hair? Often the outcomes of a random experiment take on numerical values. For instance, we might be interested in how many heads occur in three coin tosses. Let X be the number of heads. Then X is equal to 0, 1, 2, or 3, depending on the outcome of the coin tosses. The object X is called a . The possible values of X are 0, 1, 2, and 3. Definition 1.8. A random variable X is a (measurable) function from the sample space Ω to the real numbers R. You can think simply that a random variable assigns numerical values to the outcomes of a random experiment. Random variables are enormously useful and allow us to use algebraic expressions, equalities, and inequalities when manipulating events. In many of the previous examples, we have been working with random variables without using the name, for example, the number of threes in rolls of a die, the number of votes received, the number of heads in repeated coin tosses, etc. Example 1.9. If we throw two dice, what is the probability that the sum of the dice is greater than or equal to four? Solution: We can, of course, find the probability by direct counting. But we will use random variables. Let Y be the sum of two dice rolls. Then Y is a random variable whose possible values are 2, 3, . . . , 12. The event that the sum is greater than 4 can be written as {Y ≥ 4}. Observe that the complementary event is {Y ≤ 3}. By taking complements, P[Y ≥ 4] = 1 − P x[Y ≤ 3] = P[Y = 2] or P[Y = 3] = P[Y = 2] + P[Y = 3] = P[{(1, 1)}] + P[{(1, 2), (2, 1)} 1 2 3 1 = + = = 36 36 36 12

4 While writing my book [Stochastic Processes] I had an argument with Feller. He asserted that everyone said “random variable” and I asserted that everyone said “chance variable”. We obviously had to use the same name in our books, so we decided the issue by a [random] procedure. That is, we tossed for it and he won.

-Joe Doob, quoted in Statistical Science

Random variables are central objects in probability. The name can be confusing since they are really neither “random” nor “variables” (like x in f(x)). A random variable is actually a function, a function whose domain is the sample space.

A random variable assigns every outcome of the sample space a real number. Consider the three coins example, letting X be the number of heads in three coin tosses. Depending upon the outcome of the experiment, X takes on different values. To emphasize the dependency of X on the outcome ω, we can write X(ω), rather than just X. In particular,   0, if ω = TTT  1, if ω = HTT,THT or TTH X(ω) =  2, if ω = HHT,HTH or THH  3, if ω = HHH

The probability of getting exactly two heads is written as P[X = 2], which is shorthand for P[{ω : X(ω) = 2}]. You may be unfamiliar with this last notation, used for describing sets. The notation {ω : Property} describes the set of all ω that satisfies some property. So {ω : X(ω) = 2} is the set of all ω with the property that X(ω) = 2. That is, the set of all outcomes that result in exactly two heads, which is {HHT,HTH,THH}. Similarly, the probability of getting at most one head in three coin tosses is P[X ≤ 1] = P[{ω : X(ω) ≤ 1}] = P[{TTT,HTT,THT,TTH}]. Because of simplicity and ease of notation, we typically use the shorthand X in writing random variables instead of the more verbose X(ω).

Part 2. A First Look into Simulation

At this point is important to review Ap- pendix A from the textbook (section 6 can be omitted for now). It will be handed out in class.

Using random numbers on a computer to simulate probabilities is called the method. To- day, Monte Carlo tools are used extensively in statistics, physics, engineering, and across many disciplines. The name was coined in the 1940s by mathematicians John von Neumann and Stanislaw Ulam working on the Manhattan Project. It was named after the famous Monte Carlo casino in Monaco.

5 The first thoughts and attempts I made to practice [the Monte Carlo method] were suggested by a question which occurred to me in 1946 as I was convalescing from an illness and playing . The question was what are the chances that a Canfield laid out with 52 cards will come out successfully? After spending a lot of time trying to estimate them by pure combinatorial calculations, I wondered whether a more practical method than “abstract thinking” might not be to lay it out say one hundred times and simply observe and count the number of successful plays. This was already possible to envisage with the beginning of the new era of fast computers, and I immediately thought of problems of neutron diffusion and other questions of mathematical physics, and more generally how to change processes described by certain differential equations into an equivalent form interpretable as a succession of random operations. Later [in 1946], I described the idea to John von Neumann, and we began to plan actual calculations.

-Stanislaw Ulam, quoted in Eckhardt (1987):

The Monte Carlo simulation approach is based on the relative frequency model for probabilities. Given a random experiment and some event A, the probability P(A) is estimated by repeating the random experiment many times and computing the proportion of times that A occurs. More formally, define a X1,X2,..., where

 1, if A occurs on the k-th trial X = k 0, if A does not occur on the kth trial, for k = 1, 2,.... Then X1 + ... + Xn n is the proportion of times in which A occurs in n trials. For large n, the Monte Carlo method estimates P[A] (by using the LAW OF LARGE NUMBERS) as X + ... + X [A] ≈ 1 n . P n Definition 1.10. Simulation is a method that uses an artificial process (programmed in the computer) to represent the outcomes of a real process (like tossing a coin) that provides information about the probability of events. We use simulation to estimate probabilities in real problems and situations for which empirical or theoretical procedures are not easily calculated.

In order to simulate random variables taking values on the integers we will use the function sample in R.

> sample(a:b,n,replace=T)

The command samples with replacement from {a, a + 1, a + 2, . . . , b − 1, b} n times such that outcomes are equally likely (this can be modified by specifying the probabilities of each element in a vector of b − a elements). For a coin toss, we can sample it using

> sample(0:1,n,replace=T)

Let 0 represent tails and 1 represent heads. The output is a sequence of n ones and zeros corresponding to . The average, or mean, of the list is precisely the proportion of ones. To simulate P(Heads), type

> mean(sample(0:1,n,replace=T))

6 Computer Exercise: Repeat the command several times with one million trials (use the up arrow key). These give repeated Monte Carlo estimates of the desired probability. Are the answers the same each time?

Example 1.11. Obtain, by simulation, an approximate probability of getting three heads in three coin tosses. Can you increase the precision of your results? Solution: # Trial > trial <- sample(0:1,3,replace=TRUE) # Success > if (sum(trial)==3) 1 else 0 # Replication > n <- 10 # Number of repetitions > simlist <- numeric(n) # Initialize vector > for (i in 1:n) { trial <- sample(0:1, 3, replace=TRUE) success <- if (sum(trial)==3) 1 else 0 simlist[i] <- success } > mean(simlist) # Proportion of trials with 3 heads To increase the precision of the result, we just make n larger and check it starts converging to the exact answeer 1/8 = 0.125.

Homework Problem 5. HARD PROBLEM: An integer is drawn uniformly at random (i.e., each number has the same chances of being drawn) from {1, ..., 1000} such that each number is equally likely. Approximate, by simulation, the probability that the number drawn is divisible by 3, 5, or 6? Can you compute the answer theoretically? (The theoretical results uses Proposition 1.7) Homework Problem 6. For the following 2 situations identify, (i) the random experiment, (ii) sample space, (iii) event, and (iv) random variable. Express the probability in question in terms of the defined random variable, but do not compute the probability. • Roll four dice. Consider the probability of getting all fives. • In Angel’s garden, there is a 3% chance that a tomato will be bad. Angel harvests 100 tomatoes and wants to know the probability that at most five tomatoes are bad. Homework Problem 7. A couple plans to continue having children until they have a girl or until they have six children, whichever comes first. Describe a sample space and a reasonable random variable for this random experiment. Homework Problem 8. A random experiment has three possible outcomes a, b, and c, with 2 P[a] = p, P[b] = p , and P[c] = p. What choice(s) of p makes this a valid probability model? Homework Problem 9. Modify the code in Example 1.11 to simulate the probability of getting exactly one head in four coin tosses. Homework Problem 10. Write a function dice(k) for generating k throws of a fair die. Use your function and R’s sum function to generate the sum of two dice throws.

7 Part 3. Integer-Valued and Discrete Random Variables Definition 1.12. A random variable X is called discrete if it can only take countably many values.

Definition 1.13. For a random variable X that takes values in a S, the probability mass function of X is the probability function

fX (x) = P[X = x], for x ∈ S Probability mass functions are the central objects in discrete probability that allow one to compute probabilities. If we know the pmf of a random variable, in a sense we have “complete knowledge” (in a probabilistic sense) of the behavior of that random variable. The most common discrete distributions are given by:

R: Working with Probability Distributions

R has several commands for working with probability distributions like the binomial distribution. These commands are prefixed with d, p, and r. They take a suffix that describes the distribution. The Bernoulli distribution is a particular case of binomial with n = 1, the binomial suffix is “binom” while the discrete uniform can be obtained with sample and the Poisson distribution has suffix “pois”.

For the binomial distribution, these commands are:

• dbinom(k,n,p): Gives the value P[X = k], where X ∼binom(n, p).

• pbinom(k,n,p): Gives the value P[X ≤ k], where X ∼binom(n, p). • rbinom(m,n,p): Simulates m Binomal(n, p) random variables. For the Poisson distribution, these commands are, analogously, dpois(x,lambda), ppois(x,lambda) and rpois(n,lambda).

Example 1.14. Using R, Compute exactly and approximate via simulation P[X > 5], where X ∼Binom(100,0.1).

8 Solution: For the exact probability we use that P[X > 5] = 1 − P[X ≤ 5]. Thus the R-command is

> 1-pbinom(5,100,0.1)

To simulate the probability P[X > 5] based on 10,000 repetitions, we do

> n <- 10000 > simlist <- rbinom(n,100,0.10) > sum(simlist>5)/n Homework Problem 11. Every person in a group of 1000 people has a 1% chance of being infected by a virus. The process of being infected is independent from person to person. Using random variables, write expressions for the following probabilities and solve them with R.

(a) The probability that exactly 10 people are infected.

(b) That probability that at least 16 people are infected.

(c) The probability that between 12 and 14 people are infected.

(d) The probability that someone is infected.

Homework Problem 12. In 1693, Samuel Pepys wrote a letter to Isaac Newton posing the following question. Which of the following three occurrences has the greatest chance of success?

1. Six fair dice are tossed and at least one 6 appears.

2. Twelve fair dice are tossed and at least two 6’s appear.

3. Eighteen fair dice are tossed and at least three 6’s appear.

Using R, answer Mr. Pepys’ question.

Homework Problem 13. Suppose X has a Poisson distribution and P(X = 2) = 2P(X = 1). Find P(X = 3). Homework Problem 14. Poisson approximation of the binomial: Suppose X ∼ Binom(n, p). Write an Rfunction compare(n,p,k) that computes (exactly) P[X = k] − P[Y = k], where Y ∼ P ois(np). The following is a know fact:

Let X ∼ Binom(n, p) and Y ∼ P ois(λ). If n → ∞ and p → 0 in such a way so that np → λ > 0, then for all k, P[X = k] → P[Y = k]. Thus, the Poisson distribution with parameter λ = np serves as a good approximation for the binomial distribution when n is large and p is small.

Try your function on 6 numbers where you expect the Poisson probability to be a good approximation of the binomial. Also try it on 6 numbers where you expect the approximation to be poor.

9 Part 4. Continuous Random Variables Shooting an arrow at a target and picking a real number between 0 and 1 are examples of random experi- ments where the sample space is a continuum of values. Such sets have no between elements; they are not discrete. The elements are uncountable and cannot be listed. We call such sample spaces continuous. The most common continuous sample spaces in one dimension are intervals such as (a, b), (−∞, c], [a, ∞) and (−∞, ∞).

An example of a problem involving continuous random variables is the following: an archer shoots an arrow at a target. The target C is a circle of radius 1. The bullseye B is a smaller circle in the center of the target of radius 1/4. What is the probability P(B) of hitting the bullseye?

To work with random variables in the continuous world will require some new mathematical tools that will be introduced as follows:

Definition 1.15. A continuous random variable X is a random variable that takes values in a continuous set. If S is an uncountable subset of the real numbers, then {X ∈ S} is the event that X takes values in S. For instance, {X ∈ (a, b)} = {a < X < b}, and {X ∈ (−∞, c]} = {X ≤ c}.

In the discrete setting, to compute P[X ∈ S] add up values of the probability mass function. That is, P P[X ∈ S] = x∈S P[X ∈ S]. If X, however, is a continuous random variable, how could we sum over all the points of an uncountable set?

In order to compute P[X ∈ S] we integrate the probability density function (pdf) over S. The proba- bility density function plays the role of the pmf. It is the function used to compute probabilities. However, if they are playing the same role, why are they called differently? The answer to this question is very deep and requires “measure theory” to be answered, but roughtly speaking, the problem is that the pmf at a point x is the probability that the r.v. assumes that value x while the pdf at a point x is not! This is because the probability of a continuous random variable being equal to exactly one value is always 0! (Think about how can you be able to choose one point in an uncountable set, it seems impossible since you cannot even assign some ordering to it!!!). A more mathematical (but still not the 100% correct) definition of the pdf is the following:

Definition 1.16. Let X be a continuous random variable. A function f is a probability density function of X if

1. f(x) ≥ 0, for all x ∈ R. R ∞ 2. −∞ f(x) = 1. 3. For S ⊆ , R Z P[X ∈ S] = f(x)dx S

10 Taking this definition, notice what happens if we allow the set S in Definition 1.16 to be a singleton {a} (or a point a). Then Z a P[X ∈ {a}] = P[X = a] = f(x)dx = 0 a because the integral of any function over an interval of length 0 is 0. Thus, as explained before, in a continuous random variable, the probability of the random variable being exactly equal to a point is 0. However, to understand a little better what the pdf is telling us, we now take the set S in Definition 1.16 to be the interval (a − h, a + h) and what we get is that:

Z a+h P[X ∈ (a − h, a + h)] = f(x)dx a−h Dividing both sides by 2h and taking the limit as h → 0, we get:

[X ∈ (a − h, a + h)] 1 Z a+h lim P = lim f(x)dx = f(a) h→0 2h h→0 2h a−h Why is it true the last equality? (Hint: Fundamental Theorem of Calculus!!!, in general, the result is called Lebesgue Differentiation Theorem!)

In other words, using the differential notation, for very small ∆, the pdf pf the random variable X tells us that

f(a)∆ ≈ P[X ∈ (a − ∆/2, a + ∆/2)] or f(a)∆ ≈ P[X ∈ (a, a + ∆)] One way to connect and unify the treatment of discrete and continuous random variables is through the cumulative distribution function (cdf) which is defined for all random variables.

Definition 1.17. Let X be a random variable. The cumulative distribution function of X is the function

F (x) = P[X ≤ x], defined for all real numbers x.

The cdf plays an important role for continuous random variables in part because of its relationship to the density function. For a continuous random variable X Z x F (x) = P[X ≤ x] = f(t)dt −∞ and thus by the fundamental theorem of calculus,

F 0(x) = f(x)

The four defining properties of a cumulative distribution function are the following:

Proposition 1.18. A function F is a cumulative distribution function, that is, there exists a random variable X whose cdf is F , if it satisfies the following properties:

1. lim F (x) = 0. x→−∞

11 2. lim F (x) = 1. x→∞ 3. If x ≤ y, then F (x) ≤ F (y). 4. F is right-continuous. That is, for all real numbers a lim F (x) = F (a). x→a+ Homework Problem 15. A random variable X has density function f(x) = cex, for −2 < x < 2. Find (a) Find c.

(b) Find P[X < −1]. Homework Problem 16. The cumulative distribution function for a random variable X is   0 if x ≤ 0 F (x) = sin(x) if x ≤ π/2  1 if x > π/2 Find P[0.1 < X < 0.2]. Next, we will mention the some important continuous distributions: Definition 1.19. We say tha X has a Normal distribution with parameters µ ∈ R and σ > 0 (denoted as X ∼ N(µ, σ2)) if the pdf of X is given by

2 1 − (x−µ) f(x) = √ e 2σ2 2πσ2 for all real numbers x. The parameter µ is the mean and the parameter σ is the standard deviation (see below for definitions). Definition 1.20. We say tha X has a Uniform distribution on the interval (a, b) (denoted as X ∼ U(a, b)) if the pdf of X is given by 1 f(x) = 1 (x) b − a (a,b) for all real numbers x ∈ (a, b). Definition 1.21. We say tha X has an Exponential distribution with rate parameter λ > 0 (denoted as X ∼ exp(λ)) if the pdf of X is given by f(x) = λe−λx, for all real numbers x > 0. The parameter λ is the rate parameter. Note: It is also common to define the exponential random variable in terms of the mean parameter θ. In that case, the pdf is given by

1 − x f(x) = e θ θ Notice that the rate parameter is related to the mean parameter by λ = 1/θ. To find the density at a point x, the cdf at a point x and to simulate n of these random variables in R, we use the commands:

N(µ, σ2) U(a, b) exp(λ) Density at x dnorm(x,mu,sigma) dunif(x,a,b) dexp(x,lambda) CDF at X pnorm(x,mu,sigma) punif(x,a,b) pexp(x,lambda) Simulate n realizations rnorm(n,mu,sigma) runif(n,a,b) rexp(n,lambda)

12 Part 5. Expectation, Moments and Simulation of Arbitrary Continuous RVs. The expectation is a numerical measure that summarizes the typical, or average, behavior of a random variable. It is computed differently on discrete and continuous random variables. First, we will dwell into the discrete case and then we will proceed to the contuous case.

Part 5.A: Expectation and Moments in the Discrete Case

Definition 1.22. If X is a discrete random variable that takes values in a set S, the expectation E[X] is defined as X E[X] = xpX (x), x∈S where pX (x) = P[X = x] is the pmf of X. The sum in the definition is over all values of X. If the sum is a divergent infinite , we say that the expectation of X does not exist. The expectation is a weighted average of the values of X, where the weights are the corresponding probabilities of those values. The expectation places more weight on values that have greater probability.

In the case when X is uniformly distributed on a finite set {x1, . . . , xn}, that is, all outcomes are equally likely, n n X X xi x1 + x2 + ... + xn [X] = x [X = x ] = = E iP i n n i=1 i=1 Thus, with equally likely outcomes the expectation is just the regular average of the values. Other names for expectation are mean and expected value.

Example 1.23. In the (simplified) game of roulette, a ball rolls around a roulette wheel landing on one of 38 numbers. Eighteen numbers are red; 18 are black; and two ( the 0 and 00) are green. A bet of “red” costs $1 to play and pays off even money if the ball lands on red. Let X be a player’s winnings at roulette after one bet of red. What is the distribution of X? What is the expected value E[X]?

Solution: The player either wins or loses $1. So X = 1 or −1, with P(X = 1) = 18/38 and P(X = −1) = 20/38. Once the distribution of X is found, the expected value can be computed: 18 20 −2 [X] = (1)P (X = 1) + (−1)P (X = −1) = − = = −0.0526. E 38 38 38 The expected value of the game is about −5 cents.

But what does E[X] = −0.0526 mean? We interpret E[X] as a long-run average. If you play roulette for a long time making many red bets, then the average of all your one dollar wins and losses will be about −5 cents. What that also means is that if you play, say, 10,000 times then your total loss will be about 5 × 10, 000 = 50, 000 cents, or about $500.

Example 1.24. Using the function replicate in R, create a code for simulating the roulette game and approximate its mean.

13 > simlist <- replicate(10000, if (sample(1:38,1) <= 18) 1 else -1) > mean(simlist)

Example 1.25. Obtain the mean of a Poisson r.v. with parameter λ Solution: Computing,

∞ ∞ X X e−λλk [X] = k [X = k] = k E P k! k=0 k=0 ∞ X λk X λk−1 = e−λ = λe−λ (k − 1)! (k − 1)! k=1 k=1 ∞ X λk = λe−λ = λe−λeλ = λ k! k=0 Suppose X is a random variable and f is some function. Then Y = f(X) is a random variable that is a function of X. The values of this new random variable are found as follows. If X = x, then Y = f(x).

Functions of random variables, like X2, eX , and 1/X, show up all the time in probability. Often we apply some function to the outcomes of a random experiment, as we will see in many examples later. In statistics, it is common to transform data using an elementary function such as the log or exponential function. To compute the expectation of a function of a discrete random variable we use the following proposition. Proposition 1.26. Let X be a random variable that takes values in a set S. Let f be a function. Then, X E[f(X)] = f(x)P[X = x]. x∈S Example 1.27. Using the fact that the sum of the first n squares is n(n + 1)(2n + 1)/6, compute the value of X2, where X is a number picked uniformly from 1 to 100. Also, give a 1-line code to approximate the value using simulation. Solution: Computing the expectation

100 100 X X x2 [X2] = x2 [X = x] = E P 100 x=1 x=1 1 100 · 101 · 201 101 · 201 = · = = 3383.5, 100 6 6 The 1 line code can be: > mean(sample(1:100,1000000,replace=T)^2)

A first thought may be that since E[X] = 101/2 = 50.5, then E[X2] = (101/2)2 = 2550.25. We see that this is not correct. It is not true that E[X2] = (E[X])2. Homework Problem 17. Suppose X has a Poisson distribution with parameter λ. Find E[1/(X + 1)]. One of the most important properties of expectation is that is a linear opeartor. That is,

14 Proposition 1.28. For any random variables X and Y , it holds that

E[aX + bY ] = aE[X] + bE[Y ] for any constants a and b.

Given an event A, define a random variable 1{A} such that

 1, if A occurs. 1 = {A} 0, if A does not occur.

1 c Therefore, {A} equals 1, with probability P[A], and 0, with probability P[A ]. Such a random variable is called an indicator variable. An indicator is a Bernoulli random variable with p = P[A]. The expectation of an indicator variable is important enough to highlight.

Proposition 1.29. For A an event, and 1{A} be an indicator r.v.,

1  c E {A} = (1)P[A] + (0)P[A ] = P[A]. This is fairly simple, but nevertheless extremely useful and interesting, because it means that probabil- ities of events can be thought of as expectations of indicator random variables.

Often random variables involving counts can be analyzed by expressing the count as a sum of indicator variables. We illustrate this powerful technique in the next example.

Example 1.30. Expectation of binomial distribution. Using the fact that the sum of n Bernoulli (indicator) r.v. with parameter p is a Binomial(n,p) r.v. Find the expectation of a Binomial(n,p) rv.

Solution: Let I1,...,In be a sequence of i.i.d. Bernoulli (indicator) random variables with success prob- ability p. Let X = I1 + ... + In. Then X has a binomial distribution with parameters n and p (Sums pf Bernoullis are Binomial, think about it!). By linearity of expectation,

" n # n n X X X E[X] = E Ik = E[Ik] = (1 · p + 0 · (1 − p)) = pn k=1 k=1 k=1 This result should be intuitive. For instance, if you roll 600 dice, you would expect 100 ones. The number of ones has a binomial distribution with n = 600 and p = 1/6. We emphasize the simplicity and elegance of the last derivation, a result of thinking probabilistically about the problem. Contrast this with the algebraic approach, using the definition of expectation and combinatorial coefficients. This technique of using indicator functions and knowing that the expectation of an indicator function is the probability of the event, becomes really useful sometimes as in the next example:

Example 1.31. At graduation ceremony, a class of n seniors, upon hearing that they have graduated, throw their caps up into the air in celebration. Their caps fall back to the ground uniformly at random and each student picks up a cap. What is the expected number of students who get their original cap back (a “match”)?

Solution: Let X be the number of matches. Define  1, if the k-th student got his cap back I = k 0, if if the kth student dis not get their cap back,

15 for k = 1, . . . , n. Then X = I1 + ... + In. The expected number of matches is " n # n X X E[X] = Ex Ik = E [Ik] k=1 k=1 n X = P [the k-th student got its cap back] k=1 n X 1  1  = = n = 1 n n k=1 The reason being that the probability that the k-th student gets their cap back is 1/n since there are n caps to choose from and only one belongs to the kth student.

Remarkably, the expected number of matches is one, independent of the number of people, n. If every- one in China throws their hat up in the air, on average about one person will get their hat back.

While a computer algorithm to approximate the expectation using simulation is feasable, sometimes it is easier to rethink the problem to find a better solution. Another way to see this problem is to think about a permutation of the numbers 1, 2, . . . , n and count the number of fixed points, i.e., the number of numbers that stayed in its place after the permutation. Whule doing that you will be counting how many people got exactly his hat back. Example 1.32. Write a code to obtain (approximate) the expectation by using simulation and a fixed point permutation with 50 hats. Solution: > n <- 50 > mean(replicate(10000,sum(sample(n,n)==(1:n)))) Note: Let’s pause to reflect upon what we have actually done. We have simulated a random element from a sample space that contains 50! ≈ 3.04×1064 elements, checked the number of fixed points, and then repeated the operation 10,000 times, finally computing the average number of fixed points and all of this in a second or two on your computer. It would not be physically possible to write down a table of all 50! permutations, their number of fixed points, or their corresponding probabilities. And yet by generating random elements and taking simulations the problem becomes computationally feasible. Of course, in this case we already know the exact answer and do not need simulation to find the expectation. However, that is not the case with many complex, real-life problems. In many cases, simulation is the only way to go. Homework Problem 18 (St. Petersburgh Paradox). Very interesting problem: You are offered the following game. Flip a coin until heads appears. If it takes n tosses, you will get paid $2n. Thus if heads comes up on the first toss, you get paid $2. If it first comes up on the 10th toss, you get paid $1024. How much would you pay to play this game? This problem, discovered by the eighteenth-century Swiss mathematician Daniel Bernoulli, is the St. Petersburg paradox. The “paradox” is that most people would not pay very much to play this game. And yet you should pay whatever you are asked to (theoretically)!. Question: Why in real life you should not pay “whatever” for this game?

As the mean is a measure of how the random variable behaves in average, there are other important quantities that depict certain behaviour of the r.v. called the moments.

16 Definition 1.33. The n−th moment of a random variable is defined as

n µn := E[X ]. Thus, for a discrete r.v. X with pmf f and support on the discrete set A, we compute the n−th moment as X n µn = x f(x) x∈A

Part 5.B: Expectation and Moments in the Continuous Case Definition 1.34. For X a continuous random variable with density f, the expectation of X is computed as Z ∞ E[X] = xf(x)dx −∞ Also, analogous as the discrete case, we have that Definition 1.35. For X a continuous random variable with density f and g a given function, the expec- tation of g(X) is computed as Z ∞ E[g(X)] = g(x)f(x)dx −∞ Example 1.36. A balloon has radius uniformly distributed on (0,2). Find the expectation of the volume of the balloon exactly and by simulation. Solution: Let V the volume of the ballon. Then, V = 4 ∗ pi ∗ r3/3, where r ∼ U(0, 2). Then,   Z 2 4π 3 4π  3 4π 3 1 8π E[V ] = E r = E r = r dr = 3 3 3 0 2 3 A code that computes the expectation is > mean((4/3)*pi*runif(1000000,0,2)^3) Definition 1.37. As recalled from above, the n−th moment of a random variable is defined as

n µn := E[X ]. Thus, for a continuous r.v. X with pdf f, we compute the n−th moment as Z ∞ n µn = x f(x)dx −∞ Finally, a very important quantity is the variance, which can be regarded as the average (squared-)error of the random variable around the mean µ. That is, if the variance of a random variable X is σ2, then you expect than in average the (square-)error that you make by guessing that the (random) value of X is µ is σ2. The following definition provides a way to compute it: Definition 1.38. Let X be a random variable. Then, the Variance of X, denoted by V (X) is given by  2 V (X) := E (X − E[X]) . Sometimes, the above formula is complicated and one can prove that also,

2 2 2 V (X) = E[X ] − E [X] = µ2 − µ1, where µn is the n−th moment.

17 The following proposition establishes some basic properties of the expectation and the variance.

Proposition 1.39. For constants a and b, and random variables X and Y ,

• E[aX + b] = aE[X] + b

• E[aX + bY ] = aE[X] + bE[Y ] • V [a] = 0

• V [aX + b] = a2V [X] Another important concept is the one of Standard Deviation

Definition 1.40. The standard deviation of X is the square root of the variance. That is, p σX = V [X]

Variance and standard deviation are always nonnegative. The greater the variability of outcomes, the larger the deviations from the mean, and the greater the two measures. If X is a constant, and hence has no variability, then X = E[X] = µX and we see from the variance formula that V [X] = 0. The converse is also true. That is, if V [X] = 0, then X is a constant (a.s.).

Example 1.41. A balloon has radius uniformly distributed on (0,2). Find the expectation and standard deviation of the volume of the balloon theoretically and by simulation.

Solution: Let V be the volume. Then V = (4pi/3)R3, where R ∼ Unif(0, 2). The expected volume is

  Z 2 4π 3 4π 3 1 8π E[V ] = E R = r dr = ≈ 8.378. 3 3 0 2 3 Also, " 2# 2 Z 2 2 2 4π 3 16π 6 1 1024π E[V ] = E R = r dr = . 3 9 0 2 63 giving 1024π2 8π 2 64π2 V [X] = [V 2] − ( [V ])2 = − = , E E 63 3 7 √ with standard deviation σV = 8π/ 7 ≈ 9.5. A sample R code could be > volume <-(4/3)*pi*runif(1000000,0,2)^3 > mean(volume) [1] 8.386274 > sd(volume) [1] 9.498317

The next homework problem is challenging but provides a great example that not all random variables have a finite expectation.

Homework Problem 19. Answer all the following:

18 1. Show that the function 1 1 f(x) = , x ∈ π 1 + x2 R is a valid density function by showing points 1 and 2 of Definiton 1.16. 2. Let X be a random variable with pdf as above (such r.v. is called the Cauchy distribution). Show that E[X] = ∞. 3. Let U ∼ Unif(−π/2, π/2). Show (by doing an “obvious” change of variable in the integral) that E[X] = E[tan(U)]. 4. Create a function that computes 20 times E[tan(U)] and run it. What you observe? Explain. Finally, it is important for us to know how to simulate almost ANY continuous random variable provided that we have its density. For example, suppose we want to simulate observations from a random variable X, where X has density f(x) = 2x, for 0 < x < 1, and 0, otherwise. How would you do it?

Proposition 1.42 (Inverse Transform Method). Suppose X is a continuous random variable with cumulative distribution function F , where F is invert- ible with inverse function F −1. Let U ∼ Unif(0, 1). Then the distribution of F −1(U) is equal to the distribution of X. In other words, to simulate X first simulate U and output F −1(U).

Example 1.43. Explain how to simulate the rv X above with pdf f(x) = 2x, for 0 < x < 1, and 0, otherwise. Solution: First, computing the CDF of X is Z x Z x 2 F (x) = P[X ≤ x] = f(u)du = 2udu = x , for0 < x < 1. −∞ 0 2 −1 √ On the interval (0,1) the function F (x) = x is√ invertible and F (x) = x. The inverse transform method says that if U ∼ Unif(0, 1), then F −1(U) = U has the same distribution as X. Thus to simulate X, we can implement the code

> simlist <- sqrt(runif(10000))

To check the density, we can generate the histogram on the 10,000 simulations above (which should be a proxy for the density), along with the super-imposed curve of the theoretical density f(x) = 2x. > simlist <- sqrt(runif(10000)) > hist(simlist,prob=T,main="",xlab="") > curve(2*x,0,1,add=T) The proof of the inverse transform method is quick and easy. We need to show that F −1(U) has the same distribution as X. For x in the range of X, −1 P(F (U) ≤ x) = P(U ≤ F (x)) = F (x) = P(X ≤ x), where we used the fact that the cdf of the uniform distribution on (0,1) is the identity function, F (x) = x.

19 Homework Problem 20. By using the inverse transform method, simulate X ∼ exp(λ). Homework Problem 21. Show that

f(x) = exe−ex , for allx, is a probability density function. If X has such a density function find the cdf of X.

Homework Problem 22. Let X ∼ Unif(a, b). Find a general expression for the k−th moment E[Xk]. Homework Problem 23. An isosceles right triangle has side length uniformly distributed on (0,1). Find the expectation and variance of the length of the hypotenuse. Homework Problem 24. Let X and Y be independent exponential random variables with parameter λ = 1. Simulate P(X/Y < 1).

Part 6. Joint Distributions and Independence of Random Variables In the case of two random variables X and Y , a joint distribution specifies the values and probabilities for all pairs of outcomes. If X and Y are discrete r.v., the joint probability mass function of X and Y is the function of two variables fX,Y (x, y) = P(X = x, Y = y). Proposition 1.44. The joint pmf is a probability function. Therefore, it sums to 1. That is, if X takes values in a set S and Y takes values in a set T , then X X X X fX,Y (x, y) = P(X = x, Y = y) = 1 x∈S y∈T x∈S y∈T As in the discrete one variable case, probabilities of events are obtained by summing over the individual outcomes contained in the event. For instance, for constants a < b and c < d,

b d b d X X X X P[a ≤ X ≤ b, c ≤ Y ≤ d] = fX,Y (x, y) = P(X = x, Y = y) x=a y=c x=a y=c A joint probability mass function can be defined for any finite collection of discrete random variables X1,...,Xn defined on a common sample space. The joint pmf is the function of n variables P(X1 = x1,...,Xn = xn).

From the joint distribution of X and Y , one can obtain the univariate, or marginal distribution of each variable. The marginal distribution of X is obtained from the joint distribution of X and Y by summing over the values of y. Similarly the probability mass function of Y is obtained by summing the joint pmf over the values of x. Definition 1.45. The marginal distribution of X is the individual distribution of X in the random vector (X,Y ). It is computed by X P[X = x] = P[X = x, Y = y] y∈T and similarly for Y , X P[X = x] = P[X = x, Y = y] x∈S

20 Random variables can arise as functions of two or more random variables. Suppose f(x, y) is a real- valued function of two variables. Then Z = f(X,Y ) is a random variable, which is a function of two random variables. For expectations of such random variables, there is a multivariate version of the equation in Proposition 1.26.

Proposition 1.46. If X and Y are random variables and φ(x, y) is a real function of two variables, then Z = φ(X,Y ) is a random variable. If fX,Y (x, y) is the joint density of (X,Y ), then X X X X E[Z] = E[φ(X,Y )] = φ(x, y)fX,Y (x, y) = φ(x, y)P[X = x, Y = y] x∈S y∈T x∈S y∈T

In the same way as in the discrete case, we can reformulate everything in terms of the continuous case:

Proposition 1.47. For continuous random variables X and Y defined on a common sample space, the joint density function fX,Y (x, y) of the random vector (X,Y ) has the following properties.

• f(x, y) ≥ 0 for all real numbers x and y.

• Z ∞ Z ∞ f(x, y)dxdy = 1 −∞ −∞

• For S ⊆ 2, R ZZ P[(X,Y ) ∈ S] = fX,Y f(x, y)dxdy S

For continuous random variables X1,...,Xn defined on a common sample space, the joint density function f(x1, . . . , xn) is defined similarly. Definition 1.48. If X and Y have joint density function f, the joint cumulative distribution function of X and Y is Z x Z y F (x, y) = P[X ≤ x, Y ≤ y] = f(s, t)dtds −∞ −∞ defined for al x and y. Differentiating with respect to both x and y gives

∂2 F (x, y) = f(x, y) ∂x∂y The joint density of X and Y captures all the “probabilistic information” about X and Y . In principle, it can be used to find any probability which involves these variables. From the joint density the marginal densities are obtained by integrating out the extra variable (In the discrete case, we sum over the other variable.).

Definition 1.49. The marginal distributions of the a random variable X given the joint distribution of X and Y is given by Z ∞ fX (x) = f(x, y)dy −∞ Similarly, the marginal of the random variable Y is given by Z ∞ fY (x) = f(x, y)dx −∞

21 Computing expectations of functions of two or more continuous random variables should offer no sur- prises as we use the continuous form of the “law of the unconscious statistician.”

Proposition 1.50. If (X,Y ) have joint density f, and g(x, y) is a function of two variables, then Z ∞ Z ∞ E[g(X,Y )] = g(x, y)fX,Y (x, y)dxdy. −∞ −∞ In particular, the expected product is given by Z ∞ Z ∞ E[XY ] = xyfX,Y (x, y)dxdy. −∞ −∞ The next concept is arguably one of e most important concepts in Probability. The following will be given as a definition although is important to notice that it is not the formal definition, but a consequence of it.

Definition 1.51. Two random variables X and Y are called independent if for any A and B subsets of Ω,

P[X ∈ A, Y ∈ B] = P[X ∈ A]P[X ∈ B] From this definition, we can conclude that X and Y are independent if the joint probability (mass) function of X and Y has is the product of the marginals. That is,

fX,Y (x, y) = P(X = x, Y = y) = P (X = x)P (Y = y) = fX (x)fY (y), for all x, y. If X and Y are independent random variables, then knowledge of whether or not X occurs gives no information about whether or not Y occurs. It follows that if f and g are functions, then f(X) gives no information about whether or not g(Y ) occurs and hence f(X) and g(Y ) are independent random variables.

Proposition 1.52. Suppose X and Y are independent random variables, and f and g are any functions. Then, the random variables f(X) and g(Y ) are also independent. Moreover,

E[f(X)g(Y )] = E[f(X)]E[g(Y )] and, by choosing f(x) = g(x) = x, we get:

E[XY ] = E[X]E[Y ] Example 1.53. Suppose the radius R and height H of a cone are independent and each uniformly dis- tributed on {1,..., 10}. Find the expected volume of the cone and make a code to simulate it.

Solution: The volume of a cone is given by the formula v(r, h) = πr2h/3. Let V be the volume of the cone. Then V = πR2H/3 and

10 ! 10 ! hπ i π π X r2 X h π 77 11 847π [V ] = R2H = [R2] [H] = = = ≈ 221.744 E E 3 3 E E 3 10 10 3 2 2 12 r=1 h=1 An R code that illustrates the example could be:

> simlist <- replicate(100000,(pi/3)*sample(1:10,1)^2*sample(1:10,1)) > mean(simlist)

22 Example 1.54. Suppose the joint density of X and Y is

fX,Y (x, y) = cxy, for 1 < x < 4 and 0 < y < 1. Find c and compute P[2 < X < 3, Y > 1/4]. Solution: To find c, we need to solve the equation Z ∞ Z ∞ fX,Y (x, y)dxdy = 1, −∞ −∞ which becomes Z 1 Z 4 cxydxdy = 1. 0 1 Solving, we have that c = 4/15. Then, to find the probability, Z 1 Z 3 4 4 5 15 5 P[2 < X < 3, Y > 1/4] = xydxdy = = . 1/4 2 15 15 2 32 16 Homework Problem 25. The joint pmf of X and Y is x + 1 f = X,Y 12 for x = 0, 1 and y = 0, 1, 2, 3. Find the marginal distributions of X and Y . Describe their distributions qualitatively. That is, identify their distributions as one of the known distributions you have worked with (e.g., Bernoulli, binomial, Poisson, or uniform). Homework Problem 26. Suppose X and Y are independent random variables. Does E[X/Y ] = E[X]/E[Y ]? Either prove it true or exhibit a counterexample. Homework Problem 27. HARD PROBLEM: An elevator containing p passengers is at the ground floor of a building with n floors. On its way to the top of the building, the elevator will stop if a passenger needs to get off. Passengers get off at a particular floor with probability 1/n. Find the expected number of stops the elevator makes. (Hint: Use indicators, letting 1{k} = 1 if the k−th floor is a stop. Be careful: more than one passenger can get off at a floor.) Homework Problem 28. Suppose X and Y are i.i.d. exponential random variables with λ = 1. Find the density of X/Y and use it to compute P(X/Y < 1). Also, simulate that probability. Having looked at measures of variability for individual and independent random variables, we now consider measures of variability between dependent random variables. The covariance is a measure of the association between two random variables. For jointly distributed continuous random variables, the covariance and correlation are defined as: Definition 1.55. Let X and Y be jointly distributed continuous random variables with joint density function f. Let µX = E[X] and µY = E[Y ]. The covariance of X and Y is

Cov(X,Y ) = E[(X − µx)(Y − µY )] Moreover, the last quantity can be also computed as Cov[X,Y ] = E[XY ] − E[X]E[Y ] On the other hand, the Correlation (Pearson) is given by Cov[X,Y ] ρX,Y = pV ar[X]V ar[Y ]

23 For independent random variables, E[XY ] = E[X]E[Y ] and thus Cov[X,Y ] = 0. The covariance will be positive when large values of X are associated with large values of Y and small values of X are associated with small values of Y . In particular, for outcomes x and y products of the form (x − µX )(y − µY ) in the covariance formula will tend to either both be positive or both be negative, both cases resulting in positive values.

On the other hand, if X and Y are inversely related, most terms (x − µX )(y − µY ) will be negative, since when X takes values above the mean, Y will tend to fall below the mean, and viceversa. In this case, the covariance between X and Y will be negative. Covariance is a measure of linear association between two variables. In a sense, the “less linear” the relationship, the closer the covariance is to 0. It is important to notice that the sign of the covariance indicates whether two random variables are positively or negative associated. But the magnitude of the covariance can be difficult to interpret. The correlation is an alternative measure when this wants to be done because is standardized.

Proposition 1.56 (Properties of correlation). 1. −1 ≤ ρX,Y ≤ 1. 2. If Y = aX + b is a linear function of X for constants a and b, then Corr(X,Y ) = ±1, depending on the sign of a.

Correlation is a common summary measure in statistics. Dividing the covariance by the standard deviations creates a “standardized” covariance, which is a unitless measure that takes values between −1 and 1. The correlation is exactly equal to ±1 if Y is a linear function of X. Random variables that have correlation, and covariance, equal to 0 are called uncorrelated.

Definition 1.57. We say random variables X and Y are uncorrelated if

E[XY ] = E[X]E[Y ], that is, if Cov(X,Y ) = 0.

24