for : The Bare Minimum Prerequisites

This note seeks to cover the very bare minimum of topics that students should be familiar with before starting the Probability for Statistics module. Most of you will have studied classical and/or axiomatic probability before, but given the various backgrounds it’s worth being concrete on what students should know going in.

Real Analysis

Students should be familiar with the main ideas from real analysis, and at the very least be familiar with notions of • Limits of sequences of real numbers; • Limit Inferior (lim inf) and Limit Superior (lim sup) and their connection to limits; • Convergence of sequences of real-valued vectors; • Functions; injectivity and surjectivity; continuous and discontinuous functions; continuity from the left and right; Convex functions. • The notion of open and closed subsets of real numbers. For those wanting for a good self-study textbook for any of these topics I would recommend: Mathematical Analysis, by Malik and Arora.

Complex Analysis

We will need some tools from complex analysis when we study characteristic functions and subsequently the central limit theorem. Students should be familiar with the following concepts: • Imaginary numbers, complex numbers, complex conjugates and properties. • De Moivre’s theorem. • Limits and convergence of sequences of complex numbers. • Fourier transforms and inverse Fourier transforms For those wanting a good self-study textbook for these topics I recommend: Introduction to Complex Analysis by H. Priestley.

Classical Probability and Combinatorics

Many students will have already attended a course in classical beforehand, and will be very familiar with these basic concepts. Nonetheless, they are presented to consolidate terminology and notation.

• The set of all possible outcomes of an experiment is known as the of the experiment, denoted by S. • An event is a set consisting of possible outcomes of an experiment. Informally, any subset E of the sample space is known as an event. If the of the experiment is contained in E, then we say that E has occurred. • A probability P is a function which assigns real values to events in a way that satisfies the following axioms of probability:

1 2

1. Every probability is between 0 and 1, i.e. 0 ≤ P (E) ≤ 1. 2. The probability that at least one of the outcomes in S will occur is 1, i.e. P (S) = 1.

3. For any sequence of mutually exclusive events E1,...,En, we have n ! n [ X P Ei = P (Ei). i=1 i=1 • A permutation is an arrangement of r objects from a pool of n, where the order matters. The number of such arrangements is given by P (n, r), defined as n! P (n, r) = . (n − r)! • A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n, r), defined as P (n) n! C(n, r) = = r! r!(n − r)!

• Bayes’ rule – For events A and B such that P (B) > 0, we have: P (B | A)P (A) P (A | B) = . P (B) Note that P (A ∩ B) = P (A)P (B | A) = P (A | B)P (B).

• Let {Ai i = 1, . . . , n} be a partition of S, i.e. n [ Ai ∩ Aj = ∅, i 6= j and Ai = S. i=1 Pn Then we have P (B) = i=1 P (B | Ai)P (Ai). • More generally, P (B |Ak)P (Ak) P (Ak | B) = Pn . i=1 P (B | Ai)P (Ai) • Two events A and B are independent if and only if: P (A ∩ B) = P (A)P (B).

Random Variables

• A , X, is a function which maps every element in a sample space to the real line. If X takes discrete values (e.g. outcomes of coin flips), then we say that X is discrete, otherwise, if X takes continuous values, such as the temperature in the room, then X is continuous. • The cumulative distribution function F , is defined by F (x) = P (X ≤ x).

This is a monotonically non-decreasing function such that limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1. Clearly, P (a < X ≤ b) = F (b) − F (a). • Given a continuous random variable X with CDF F , a function f is a probability density function if R x dF (x) R ∞ F (x) = −∞ f(y) dy. Equivalently, f(x) = dx . Clearly, f(x) ≥ 0 and −∞ f(x) dx = 1. • Given a discrete random variable X, a function f is a probability mass function if P (X = x) = f(x). Clearly f(x ) ≤ 1 and P f(x ) = 1. If F is the CDF of X then F (x) = P f(x ). j j j xi≤x i 3

Expectations and Moments

The of a random variable, also known as the mean value or the first moment, is often denoted by E[X] or µ. It can be considered the value that we would obtain by averaging the results of the experiment infinitely many times. If f is the probability density function or probability mass function of X, then it is computed as

n X Z ∞ E[X] = xif(xi) and E[X] = xf(x) dx, i=1 −∞ for discrete and continuous random variables, respectively. The expected value of a function of a random variable g(X) is computed as follows:

n X Z ∞ E[g(X)] = g(xi)f(xi) and E[g(X)] = g(x)f(x) dx, i=1 −∞ for discrete and continuous random variables, respectively. More generally, the kth (non-centered) moment is the value of Xk that we expect to observe on average on infinitely many trials. It is computed as follows:

n Z ∞ k X k k k E[X ] = xi f(xi) and E[X ] = x f(x) dx, i=1 −∞ for discrete and continuous random variables, respectively.

Let 1A be an indicator function for the set A, i.e. ( 1 if x ∈ A 1A(x) = 0, otherwise. then E[1A(X)] = P[A].

Linearity of Expectations – If X1,...,Xn are random variables and a1, . . . , an are constants, then " # X X E aiXi = aiE[Xi]. i i

1 Pn In particular, if X = n i=1 Xi, then E[X] = E[X1] = µ.

Expectations of Independent Random Variables – Let X1,...,Xn be independent random variables. Then

" n # Y Y E Xi = E[Xi]. i=1 i

The variance of a random variable, often denoted by Var[X] or σ2 is a measure of the spread of its . It is determined as follows:

2 2 2 Var[X] = E[(X − E[X]) ] = E[X ] − (E[X]) . If a and b are constants, then Var[aX + b] = a2Var[X]. The standard deviation of a random variable, often denoted by σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows: σ = pVar[X].

Linearity of Variance – If X1,...,Xn are independent, and a1, . . . , an are constants, then

" n # X X 2 Var aiXi = ai Var[Xi]. i=1 i 4

1 Pn 1 σ2 In particular, if X = n i=1 Xi where X1,...,Xn are IID, then Var[X] = n Var[X1] = n .

Characteristic Functions – A characteristic function ψ(ω) is derived from a probability density function f(x) and is defined as: n Z ∞ X iωxi iωx ψ(ω) = e f(xi) and ψ(ω) = e f(x) dx, i=1 −∞ for a discrete and continuous random variable, respectively, and where i is the imaginary number. It is clear that ψ(0) = 1.

Euler’s formula is the name given to the identity

eiθ = cos(θ) + i sin(θ).

Using Euler’s formula we obtain ψ(ω) = E[cos(ωX)] + iE[sin(ωX)]. The kth moment can also be computed with the characteristic function as follows:

 k  k k ∂ ψ E[X ] = (−i) k . ∂ω ω=0

Distribution of a sum of independent random variables – Let Y = X1 +...+Xn with X1,...,Xn independent. We have: n Y ψY (ω) = ψXk (ω). k=1

Transformation of random variables – Now suppose that X and Y are continuous random variables which are linked by some one-to-one differentiable function r, e.g. Y = r(X). Denoting by fX and fY the probability density function of X and Y respectively, then

dx fY (y) = fX (x(y)) . dy Leibniz integral rule – Let g be a function and a parameter θ and boundaries a = a(θ) and b = b(θ), then ! ∂ Z b(θ) ∂b(θ) ∂a(θ) Z b(θ) ∂g g(x, θ) dx = · g(b) − · g(a) + (x, θ) dx. ∂θ a(θ) ∂θ ∂θ a(θ) ∂θ

Probability Distributions

Discrete distributions — The main discrete distributions to be familiar with:

Distribution P[X = x] ψ(ω) E[X] Var[X] n x n−x iω n X ∼ B(n, p) x p (1 − p) (pe + (1 − p)) np np(1 − p) λx −λ λ(eiω −1) X ∼ Poisson(λ) x! e e λ λ

Continuous distributions — The main continuous distributions to be familiar with:

Distribution f(x) ψ(ω) E[X] Var[X] 1 eiωb−eiωa a+b (b−a)2 X ∼ U(a, b) b−a (b−a)iω 2 12 2 2 1 − 1 x−µ iωµ− 1 ω2σ2 2 X ∼ N (µ, σ ) √ e 2 ( σ ) e 2 µ σ 2πσ2 −λx 1 1 1 X ∼ Exp(λ) λe iω λ λ2 1− λ 5

Jointly Distributed Random Variables

The joint probability mass function of two discrete random variables X and Y , denoted fXY defined as

fXY (xi, yj) = P[X = xi and Y = yj]. The joint probability density of two continuous random variables X and Y is defined by

fXY (x, y)∆x∆y = P[x ≤ X ≤ x + ∆x and y ≤ Y ≤ y + ∆y].

Marginal density– We define the marginal distribution of the random variable X as follows:

X Z +∞ fX (xi) = fXY (xi, yj) or fX (x) = fXY (x, y) dy, j −∞ in the discrete and continuous case, respectively.

The cumulative distribution function for (X,Y ) is defined as follows: X X FXY (x, y) = fXY (xi, yj)

xi≤x yj ≤y in the discrete case and Z x Z y 0 0 0 0 FXY (x, y) = fXY (x , y ) dx dy , −∞ −∞ in the continuous case.

Conditional density – The conditional density of X with respect to Y , often denoted by fX|Y , is defined as follows: fXY (x, y) fX|Y (x) = . fY (y) Independence – Two random variables X and Y are said to be independent if we have:

fXY (x, y) = fX (x)fY (y). Moments of joint distributions – We define the moments of joint distributions of random variables X and Y as follows Z ∞ Z ∞ p q X X p q p q p q E[X Y ] = xi yj fXY (xi, yi) and E[X Y ] = x y fXY (x, y) dx dy, i j −∞ −∞ in the discrete and continuous case, respectively.

2 – We define the covariance of two random variables X and Y , that we note σXY or Cov(X,Y ), as follows: 2 Cov(X,Y ) = σXY = E[(X − µX )(Y − µY )] = E[XY ] − µX µY , where µX = E[X] and µY = E[Y ].

Correlation – The correlation between the random variables X and Y , denoted ρXY is given by

2 σXY ρXY = , σX σY where σX , σY are the standard deviations of X and Y . Clearly −1 ≤ ρXY ≤ 1 and ρXY = 0 if X and Y are independent.

Conditional Expectation – The conditional expectation of X given Y = y is (P i xifX|Y (xi|y) discrete case E[X |Y = y] = R xfX|Y (x|y) dx continuous case. 6

If r(x, y) is a function of x and y:

(P i r(xi, y)fX|Y (xi|y) discrete case E[r(X,Y ) |Y = y] = R r(x, y)fX|Y (x|y) dx continuous case.

For random variables X and Y , assuming the expectations exist, we have that

E[E[Y | X]] = E[Y ] and E [E[X|Y ]] = E[X].

1 Important Inequalities

Concentration inequalities provide probability bounds on how a random variable deviates from its expectation.

Markov Inequality – Let X be any nonnegative integrable random variable, and a > 0,

[X] (X ≥ a) ≤ E . P a

Chebyshev’s inequality – Let X be a random variable with expected value µ. For a > 0, we have the following inequality; Var[X] (|X − µ| ≥ a) ≤ . P a2

Hoeffding’s inequality– Let Y1,...,Yn be independent observations such that E[Yi] = 0 and ai ≤ Yi ≤ bi. Let  > 0. Then, for any t > 0, " n # n 2 2 X −t Y t (bi−ai) /8. P Yi ≥  ≤ e e i=1 i=1

Cauchy-Schwartz inequality – If X and Y have finite variances then p E|XY | ≤ E[X2]E[Y 2]

Jensen’s inequality – Recall that a function g is convex if, for each x, y and α ∈ [0, 1],

g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y).

2 Then E[g(X)] ≥ g(E[X]). In particular E[X2] ≥ (E[X]) , and if X is positive then E[1/X] ≥ 1/E[X].