Probabilities, Random Variables and Distributions A
Total Page:16
File Type:pdf, Size:1020Kb
Probabilities, Random Variables and Distributions A Contents A.1 EventsandProbabilities................................ 318 A.1.1 Conditional Probabilities and Independence . ............. 318 A.1.2 Bayes’Theorem............................... 319 A.2 Random Variables . ................................. 319 A.2.1 Discrete Random Variables ......................... 319 A.2.2 Continuous Random Variables ....................... 320 A.2.3 TheChangeofVariablesFormula...................... 321 A.2.4 MultivariateNormalDistributions..................... 323 A.3 Expectation,VarianceandCovariance........................ 324 A.3.1 Expectation................................. 324 A.3.2 Variance................................... 325 A.3.3 Moments................................... 325 A.3.4 Conditional Expectation and Variance ................... 325 A.3.5 Covariance.................................. 326 A.3.6 Correlation.................................. 327 A.3.7 Jensen’sInequality............................. 328 A.3.8 Kullback–LeiblerDiscrepancyandInformationInequality......... 329 A.4 Convergence of Random Variables . 329 A.4.1 Modes of Convergence . 329 A.4.2 Continuous Mapping and Slutsky’s Theorem . 330 A.4.3 LawofLargeNumbers........................... 330 A.4.4 CentralLimitTheorem........................... 331 A.4.5 DeltaMethod................................ 331 A.5 ProbabilityDistributions............................... 332 A.5.1 UnivariateDiscreteDistributions...................... 333 A.5.2 Univariate Continuous Distributions . 335 A.5.3 MultivariateDistributions.......................... 339 This appendix gives important definitions and results from probability theory in a compact and sometimes slightly simplifying way. The reader is referred to Grimmett and Stirzaker (2001) for a comprehensive and more rigorous introduc- tion to probability theory. We start with the notion of probability and then move on to random variables and expectations. Important limit theorems are described in Appendix A.4. L. Held, D. Sabanés Bové, Applied Statistical Inference, 317 DOI 10.1007/978-3-642-37887-4, © Springer-Verlag Berlin Heidelberg 2014 318 A Probabilities, Random Variables and Distributions A.1 Events and Probabilities Any experiment involving randomness can be modeled with probabilities. Probabil- ities Pr(A) between zero and one are assigned to events A such as “It will be raining tomorrow” or “I will suffer a heart attack in the next year”. The certain event has probability one, while the impossible event has probability zero. Any event A has a disjoint, complementary event Ac such that Pr(A) + Pr(Ac) = 1. For example, if A is the event that “It will be raining tomorrow”, then Ac is the event that “It will not be raining tomorrow”. More generally, a series of events A1,A2,...,An is called a partition if the events are pairwise disjoint and if Pr(A1) +···+Pr(An) = 1. A.1.1 Conditional Probabilities and Independence Conditional probabilities Pr(A | B) are calculated to update the probability Pr(A) of a particular event under the additional information that a second event B has occurred. They can be calculated via Pr(A, B) Pr(A | B) = , (A.1) Pr(B) where Pr(A, B) is the probability that both A and B occur. This definition is only sensible if the occurrence of B is possible, i.e. if Pr(B) > 0. Rearranging this equa- tion gives Pr(A, B) = Pr(A | B)Pr(B),butPr(A, B) = Pr(B | A) Pr(A) must obvi- ously also hold. Equating and rearranging these two formulas gives Bayes’ theorem: Pr(B | A) Pr(A) Pr(A | B) = , (A.2) Pr(B) which will be discussed in more detail in Appendix A.1.2. Two events A and B are called independent if the occurrence of B does not change the probability of A occurring: Pr(A | B) = Pr(A). From (A.1) it follows that under independence Pr(A, B) = Pr(A) · Pr(B) and Pr(B | A) = Pr(B). Conditional probabilities behave like ordinary probabilities if the conditional event is fixed, so Pr(A | B)+ Pr(Ac | B) = 1, for example. It then follows that Pr(B) = Pr(B | A) Pr(A) + Pr B | Ac Pr Ac (A.3) A.2 Random Variables 319 and more generally Pr(B) = Pr(B | A1) Pr(A1) + Pr(B | A2) Pr(A2) +···+Pr(B | An) Pr(An) (A.4) if A1,A2,...,An is a partition. This is called the law of total probability. A.1.2 Bayes’ Theorem Let A and B denote two events A,B with 0 < Pr(A) < 1 and Pr(B) > 0. Then Pr(B | A) · Pr(A) Pr(A | B) = Pr(B) Pr(B | A) · Pr(A) = . (A.5) Pr(B | A) · Pr(A) + Pr(B | Ac) · Pr(Ac) For a general partition A1,A2,...,An with Pr(Ai)>0 for all i = 1,...,n,wehave that Pr(B | Aj ) · Pr(Aj ) Pr(A | B) = j n | · (A.6) i=1 Pr(B Ai) Pr(Ai) for each j = 1,...,n. For only n = 2 events Pr(A) and Pr(Ac) in the denominator of (A.6), i.e. Eq. (A.5), a formulation using odds ω = π/(1 − π) rather than probabilities π pro- vides a simpler version without the uncomfortable sum in the denominator of (A.5): Pr(A | B) Pr(A) Pr(B | A) = · . c | c | c Pr(A B) Pr(A ) Pr(BA ) Posterior Odds Prior Odds Likelihood Ratio Odds ω can be easily transformed back to probabilities π = ω/(1 + ω). A.2 Random Variables A.2.1 Discrete Random Variables We now consider possible real-valued realisations x of a discrete random vari- able X.Theprobability mass function of a discrete random variable X, f(x)= Pr(X = x), describes the distribution of X by assigning probabilities for the events {X = x}. The set of all realisations x with truly positive probabilities f(x)>0 is called the support of X. 320 A Probabilities, Random Variables and Distributions A multivariate discrete random variable X = (X1,...,Xp) has realisations x = p (x1,...,xp) in R and the probability mass function f(x) = Pr(X = x) = Pr(X1 = x1,...,Xp = xp). Of particular interest is often the joint bivariate distribution of two discrete random variables X and Y with probability mass function f(x,y).Theconditional proba- bility mass function f(x| y) is then defined via (A.1): f(x,y) f(x| y) = . (A.7) f(y) Bayes’ theorem (A.2) now translates to f(y| x)f(x) f(x| y) = . (A.8) f(y) Similarly, the law of total probability (A.4) now reads f(y)= f(y| x)f(x), (A.9) x where the sum is over the support of X. Combining (A.7) and (A.9) shows that the argument x can be removed from the joint density f(x,y) via summation to obtain the marginal probability mass function f(y)of Y : f(y)= f(x,y), x again the sum over the support of X. Two random variables X and Y are called independent if the events {X = x} and {Y = y} are independent for all x and y. Under independence, we can therefore factorise the joint distribution into the product of the marginal distributions: f(x,y) = f(x)· f(y). If X and Y are independent and g and h are arbitrary real-valued functions, then g(X) and h(Y ) are also independent. A.2.2 Continuous Random Variables A continuous random variable X is usually defined with its density function f(x), a non-negative real-valued function. The density function is the derivative of the x distribution function F(x) = Pr(X ≤ x), and therefore −∞ f(u)du= F(x) and +∞ −∞ f(x)dx= 1. In the following we assume that this derivative exists. We then A.2 Random Variables 321 have b Pr(a ≤ X ≤ b) = f(x)dx a for any real numbers a and b, which determines the distribution of X. Under suitable regularity conditions, the results derived in Appendix A.2.1 for probability mass functions also hold for density functions with replacement of sum- mation by integrals. Suppose that f(x,y) is the joint density function of two random variables X and Y ,i.e. y x Pr(X ≤ x,Y ≤ y) = f(u,v)dudv. −∞ −∞ The marginal density function f(y)of Y is f(y)= f(x,y)dx, and the law of total probability now reads f(y)= f(y| x)f(x)dx. Bayes’ theorem (A.8) still reads f(y| x)f(x) f(x| y) = . (A.10) f(y) If X and Y are independent continuous random variables with density functions f(x)and f(y), then f(x,y) = f(x)· f(y) for all x and y. A.2.3 The Change-of-Variables Formula When the transformation g(·) is one-to-one and differentiable, there is a formula for the probability density function fY (y) of Y = g(X) directly in terms of the proba- bility density function fX(x) of a continuous random variable X. This is known as the change-of-variables formula: −1 −1 dg (y) fY (y) = fX g (y) · dy −1 dg(x) = fX(x) · , (A.11) dx 322 A Probabilities, Random Variables and Distributions where x = g−1(y). As an example, we will derive the density function of a log- normal random variable Y = exp(X), where X ∼ N(μ, σ 2).Fromy = g(x) = exp(x) we find that the inverse transformation is x = g−1(y) = log(y), which has the derivative dg−1(y)/dy = 1/y. Using the formula (A.11), we get the density function −1 −1 dg (y) fY (y) = fX g (y) · dy 2 1 1 (log(y) − μ) 1 = √ exp − 2πσ2 2 σ 2 y 1 1 log(y) − μ = ϕ , σ y σ where ϕ(x) is the standard normal density function. Consider a multivariate continuous random variable X that takes values in Rp and a one-to-one and differentiable transformation g : Rp → Rp, which yields Y = g(X).