Quick viewing(Text Mode)

Summary of Basic Probability Theory Math 218, Mathematical Statistics

Summary of Basic Probability Theory Math 218, Mathematical Statistics

8. For any two events E and F , P (E) = P (E ∩ F ) + P (E ∩ F c). 9. If event E is a of event F , then P (E) ≤ P (F ). Summary of basic 10. Statement 7 above is called the principle of Math 218, Mathematical inclusion and exclusion. It generalizes to more than two events. D Joyce, Spring 2016 n n [  X X P Er = P (Ei) − P (Ei ∩ Ej) Sample . A consists of a un- r=1 i=1 i

1 sample space. For instance, if you have a fair die, Also, the probability of a singleton {b} can be found the sample space is Ω = {1, 2, 3, 4, 5, 6} where the as a limit 1 probability of any singleton is 6 . If you have two fair , you can use two random variables, X and P ({b}) = lim (F (b) − F (a)). Y , to refer to the two dice, but each has the same a→b sample space. (Soon, we’ll look at the joint distri- From these, probabilities of unions of intervals can bution of (X,Y ), which has a sample space defined be computed. Sometimes, the c.d.f. is simply called on Ω × Ω. the distribution, and the sample space is identified with this distribution. Random variables and cumulative distribu- tion functions. A sample space can have any set Discrete distributions. Many sample distribu- as its underlying set, but usually they’re related tions are determined entirely by the probabilities of to . Often the sample space is the set of their outcomes, that is, the probability of an event real numbers R, and sometimes a power of the real E is numbers Rn. X X The most common sample space only has two el- P (E) = P (X=x) = P ({x}). ements, that is, there are only two outcomes. For x∈E x∈E instance, flipping a coin as two outcomes—Heads and Tails; many experiments have two outcomes— The sum here, of course, is either a finite or count- Success and Failure; and polls often have two ably infinite sum. Such a distribution is called a dis- outcomes—For and Against. Even though these crete distribution, and when there are only finitely events aren’t numbers, it’s useful to replace them many outcomes x with nonzero probabilities, it is by numbers, namely 0 and 1, so that Heads, Suc- called a finite distribution. cess, and For are identified with 1, and Tails, Fail- A discrete distributions is usually described in ure, and Against are identified with 0. Then the terms of a probability mass function (p.m.f.) f de- sample space can have R as its underlying set. fined by When the sample space does have R as its un- derlying set, the random variable X is called a real f(x) = P (X=x) = P ({x}). random variable. With it, the probability of an in- terval like [a, b], which is P ([a, b]), can then be de- This p.m.f. is enough to determine this distribution scribed as P (a ≤ X ≤ b). Unions of intervals can since, by the definition of a discrete distribution, also be described, for instance P ((−∞, 3) ∪ [4, 5]) the probability of an event E is can be written as P (X < 3 or 4 ≤ X ≤ 5). X When the sample space is R, the probability P (E) = f(x). function P is determined by a cumulative distri- x∈E bution function (c.d.f.) F as follows. The function F : R → R is defined by In many applications, a finite distribution is uni- form, that is, the probabilities of its outcomes are F (x) = P (X ≤ x) = P ((−∞, x]). all the same, 1/n, where n is the number of out- comes with nonzero probabilities. When that is Then, from F , the probability of a half-open inter- the case, the field of is useful in find- val can be found as ing probabilities of events. Combinatorics includes various principles of counting such as the multipli- P ((a, b]) = F (b) − F (a). cation principle, permutations, and combinations.

2 Continuous distributions. When the cumula- For a continuous random variable, it is defined in tive distribution function F for a distribution is terms of the probability density function f as differentiable function, we say it’s a continuous dis- Z ∞ tribution. Such a distribution is determined by a E(X) = µX = xf(x) dx. probability density function f. The relation be- −∞ 0 tween F and f is that f is the derivative F of F , There is a physical interpretation where this and F is the of f. is interpreted as a center of gravity. Z x Expectation is a linear . That F (x) = f(t) dt −∞ that the expectation of a sum or difference is the difference of the expectations and independence. If E and F are two events, with P (F ) 6= 0, then E(X + Y ) = E(X) + E(Y ), the conditional probability of E given F is defined and that’s true whether or not X and Y are inde- to be P (E ∩ F ) pendent, and also P (E|F ) = . P (F ) E(cX) = c E(X) Two events, E and F , neither with probability 0, are said to be independent, or mutually indepen- where c is any constant. From these two properties dent, if any of the following three logically equiva- it follows that lent conditions holds E(X − Y ) = E(X) − E(Y ), P (E ∩ F ) = P (E) P (F ) P (E|F ) = P (E) and, more generally, expectation preserves linear combinations P (F |E) = P (F ) n ! n X X E c X = c E(X ). Bayes’ formula. This formula is useful to invert i i i i i=1 i=1 conditional probabilities. It says P (E|F ) P (F ) Furthermore, when X and Y are independent, P (F |E) = then E(XY ) = E(X) E(Y ), but that equation P (E) doesn’t usually hold when X and Y are not inde- P (E|F ) P (F ) = pendent. P (E|F ) P (F ) + P (E|F c) P (F c) where the second form is often more useful in prac- and standard deviation. The vari- tice. ance of a random variable X is defined as

2 2 2 2 Expectation. The E(X), also Var(X) = σX = E((X − µX ) ) = E(X ) − µX called the expectation or mean µ , of a random X where the last equality is provable. Standard devia- variable X is defined differently for the discrete and tion, σ, is defined as the root of the variance. continuous cases. Here are a couple of properties of variance. First, For a discrete random variable, it is a weighted if you multiply a random variable X by a constant defined in terms of the probability mass c to get cX, the variance changes by a factor of the function f as square of c, that is X E(X) = µ = xf(x). X Var(cX) = c2 Var(X). x

3 That’s the main reason why we take the square The moment generating function. There is a root of variance to normalize it—the standard de- clever way of organizing all the moments into one viation of cX is c times the standard deviation of mathematical object, and that object is called the X. Also, variance is translation invariant, that is, moment generating function. It’s a function m(t) if you add a constant to a random variable, the of a new variable t defined by variance doesn’t change: m(t) = E(etX ). Var(X + c) = Var(X). Since the exponential function et has the power se- In general, the variance of the sum of two random ries variables is not the sum of the of the two ∞ X tk t2 tk random variables. But it is when the two random et = = 1 + t + + ··· + + ··· , k! 2! k! variables are independent. k=0 we can rewrite m(t) as follows Moments, central moments, , and th tX µ2 2 µk k . The k moment of a random variable m(t) = E(e )1 + µ1t + t + ··· + t + ··· . k 2! k! X is defined as µk = E(X ). Thus, the mean is the first moment, µ = µ1, and the variance can That implies that m(k)(0), the kth derivative of m(t) 2 th be found from the first and second moments, σ = evaluated at t = 0, equals the k moment µk of 2 µ2 − µ1. X. In other words, the moment generating function The kth is defined as E((X −µ)k. generates the moments of X by differentiation. Thus, the variance is the second central moment. For discrete distributions, we can also compute A third central moment of the standardized ran- X − µ the moment generating function directly in terms dom variable X∗ = , of the probability mass function f(x) = P (X=x) σ as 3 tX X tx ∗ 3 E((X − µ) ) m(t) = E(e ) = e f(x). β3 = E((X ) ) = 3 σ x is called the skewness of X. A distribution that’s For continuous distributions, the moment generat- symmetric about its mean has 0 skewness. (In fact ing function can be expressed in terms of the prob- all the odd central moments are 0 for a symmetric ability density function fX as distribution.) But if it has a long tail to the right Z ∞ tX tx and a short one to the left, then it has a positive m(t) = E(e ) = e fX (x) dx. skewness, and a negative skewness in the opposite −∞ situation. The moment generating function enjoys the fol- A fourth central moment of X∗, lowing properties. 4 Translation. If Y = X + a, then ∗ 4 E((X − µ) ) β4 = E((X ) ) = σ4 ta mY (t) = e mX (t). is called kurtosis. A fairly flat distribution with long tails has a high kurtosis, while a short tailed Scaling. If Y = bx, then distribution has a low kurtosis. A bimodal distribu- mY (t) = mX (bt). tion has a very high kurtosis. A has a kurtosis of 3. (The word kurtosis was made Standardizing. From the last two properties, if up in the early 19th century from the Greek word X − µ X∗ = for curvature.) σ

4 is the standardized random variable for X, then The joint random variable (X,Y ) has its own p.m.f. denoted f(X,Y )(x, y), or more briefly f(x, y): −µt/σ mX∗ (t) = e mX (t/σ). f(x, y) = P ((X,Y )=(x, y)) = P (X=x and Y =y), . If X and Y are independent vari- ables, and Z = X + Y , then and it determines the two individual p.m.f.s by

mZ (t) = mX (t) mY (t). X X fX (x) = f(x, y), fY (y) = f(x, y). The primary use of moment generating functions is y x to develop the theory of probability. For instance, the easiest way to prove the central limit The individual p.m.f.s are usually called marginal is to use moment generating functions. probability mass functions. For example, assume that the random variables The , quartiles, quantiles, and per- X and Y have the joint probability mass function centiles. The median of a distribution X, some- given in this table. times denotedµ ˜, is the value such that P (X ≤ µ˜) = 1 Y 2 . Whereas some distributions, like the Cauchy dis- tribution, don’t have means, all continuous distri- −1 0 1 2 butions have . −1 0 1/36 1/6 1/12 If p is a number between 0 and 1, then the pth X 0 1/18 0 1/18 0 1 0 1/36 1/6 1/12 quantile is defined to be the number θp such that 2 1/12 0 1/12 1/6

P (X ≤ θp) = F (θp) = p. By adding the entries row by row, we find the the Quantiles are often expressed as percentiles where marginal function for X, and by adding the entries the pth quantile is also called the 100pth percentile. column by column, we find the marginal function Thus, the median is the 0.5 quantile, also called the for Y . We can write these marginal functions on 50th percentile. the margins of the table. The first quartile is another name for θ0.25, the th 25 percentile, while the third quartile is another Y fX th name for θ0.75, the 75 percentile −1 0 1 2 −1 0 1/36 1/6 1/12 5/18 Joint distributions. When studing two related X 0 1/18 0 1/18 0 1/9 real random variables X and Y , it is not enough 1 0 1/36 1/6 1/12 5/18 just to know the distributions of each. Rather, the 2 1/12 0 1/12 1/6 1/3 pair (X,Y ) has a joint distribution. You can think fY 5/36 1/18 17/36 1/3 of (X,Y ) as naming a single random variable that takes values in the plane R2. Discrete random variables X and Y are indepen- dent if and only if the joint p.m.f is the product of Joint and marginal probability mass func- the marginal p.m.f.s tions. Let’s consider the discrete case first where both X and Y are discrete random variables. f(x, y) = fX (x)fY (y). The probability mass function for X is fX (x) = P (X=x), and the p.m.f. for Y is fY (y) = P (Y =y). In the example above, X and Y aren’t independent.

5 Joint and marginal cumulative distribution Furthermore, the marginal density functions can functions. Besides the p.m.f.s, there are joint be found by integrating joint density function. and marginal cumulative distribution functions. Z ∞ Z ∞ The c.d.f. for X is FX (x) = P (X≤x), while the fX (x) = f(x, y) dy, fY (x) = f(x, y) dx c.d.f. for Y is FY (y) = P (Y ≤y). The joint ran- −∞ −∞ dom variable (X,Y ) has its own c.d.f. denoted Continuous random variables X and Y are inde- F (x, y), or more briefly F (x, y): (X,Y ) pendent if and only if the joint density function is F (x, y) = P (X≤x and Y ≤y), the product of the marginal density functions and it determines the two marginal p.m.f.s by f(x, y) = fX (x)fY (y).

FX (x) = lim F (x, y),FY (y) = lim F (x, y). y→∞ x→∞ and correlation. The covariance of two random variables X and Y is defined as Joint and marginal probability density func- tions. Now let’s consider the continuous case Cov(X,Y ) = σXY = E((X − µX )(Y − µY )). where X and Y are both continuous. The last paragraph on c.d.f.s still applies, but we’ll have It can be shown that marginal probability density functions fX (x) and Cov(X,Y ) = E(XY ) − µX µY . fY (y), and a joint probability density function f(x, y) instead of probability mass functions. Of When X and Y are independent, then σ = 0, course, the derivatives of the marginal c.d.f.s are XY but in any case the density functions d d Var(X + Y ) = Var(X) + 2 Cov(X,Y ) + Var(Y ). f (x) = F (x) f (y) = F (y) X dx X Y dy Y Covariance is a bilinear operator, which means it is and the c.d.f.s can be found by integrating the den- linear in each coordinate sity functions

Z x Z y Cov(X1 + X2,Y ) = Cov(X1,Y ) + Cov(X2,Y ) FX (x) = fX (t) dt FY (y) = fY (t) dt. Cov(aX, Y ) = a Cov(X,Y ) −∞ −∞ Cov(X,Y + Y ) = Cov(X,Y ) + Cov(X,Y ) The joint probability density function f(x, y) is 1 2 1 2 found by taking the derivative of F twice, once with Cov(X, bY ) = b Cov(X,Y ) respect to each variable, so that The correlation, or correlation coefficient, of X ∂ ∂ and Y is defined as f(x, y) = F (x, y). ∂x ∂y σXY ρXY = . (The notation ∂ is substituted for d to indicate that σX σY there are other variables in the expression that are Correlation is always a number between −1 and 1. held constant while the derivative is taken with respect to the given variable.) The joint cumula- Math 218 Home Page at tive distribution function can be recovered from the http://math.clarku.edu/~djoyce/ma218/ joint density function by integrating twice Z x Z y F (x, y) = f(s, t) dt ds. −∞ −∞

6