Statistical Methods

”Never trust a statistics you didn’t forge yourself”

Winston Churchill

Florian Herzog

2013 Independent and identical distributed random variables

Deﬁnition 1. The random variables X1, ..., Xn are called a random sample of size n from the population f(x) if X1, ..., Xn are mutually independent random variables and the marginal pdf of each Xi is the same function Xi. Alternatively, X1, ..., Xn are called independent and identically distributed random variables with pdf f(x).

The joint pdf of X1, ..., Xn is given as:

n Y f(x1, x2, ..., xn) = f(x1)f(x2) ··· f(xn) = f(xi) i=1

The slides of this section follow closely the Chapters 5 and 7 of the Book ”G.Casela and R. Berger, Statistical Inference,Duxbury Press 2002”.

Stochastic Systems, 2013 2 Identically and independently distributed random variables

Often in statistics (especially in estimation) we assume identically and independently distributed (i.i.d) random variables (r.v.). This means that a random variable Xi, where k = 1, 2, ... denotes the realizations of the r.v., has the following properties:

• Each Xk f(x) is drawn form the same density .

• Xk is independent of Xk−1,Xk−2, ..., X1.

• Each Xk is uncorrelated from each Xj,i.e. E[XkXj] = 0 ∀j\{k}

Stochastic Systems, 2013 3 Sample mean and variance

Deﬁnition 2. The sample mean is the arithmetic average of the values in a random sample and is denoted as

n 1 X X = Xi n i=1

Deﬁnition 3. The sample variance is the statistic deﬁned as

n 2 1 X 2 S = (Xi − X) n − 1 i=1

The sample standard deviation is the statistic deﬁned as S = p(S2)

Stochastic Systems, 2013 4 Properties of i.i.d.random variables

Theorem 1. When we have X1,X2, ...Xn independent and identically distributed (i.i.d) random variables with mean µ = E[Xn] and variance 2 σ = V ar[Xn]. Then

E[X] = µ σ2 V ar[X] = n E[S2] = σ2

Stochastic Systems, 2013 5 Properties of i.i.d.random variables

Theorem 2. When we have X1,X2, ...Xn i.i.d. from a normal distribution with mean µ and variance σ. Then 1. X and S2 are independent random variable, σ2 2. X is distributed N(µ, n ), 2 3. (n − 1)S has a chi square distribution with n − 1 degrees of freedom. σ2

Stochastic Systems, 2013 6 Convergence of a sequence of r.v. {Xn}

1 1. Convergence with probability one (or almost sure), Xn −→ X:

P ({ω ∈ Ω : lim (Xn(ω)) = X(ω)}) = 1. n→∞

p 2. The sequence {Xn} converges to X in probability, Xn −→ X, if lim P ({ω ∈ Ω: |Xn(ω) − X(ω)| > ε}) = 0, for all ε > 0. n→∞

p p L 3. The sequence {Xn} converges to X in L , Xn −→ X, if

p lim E(|Xn(ω) − X(ω)| ) = 0. n→∞

Stochastic Systems, 2013 7 Convergence concepts interrelations

Lp Xn −→ X

? 1 Lq Xn −→ X (almost sure) Xn −→ X, q < p

q ) p Xn −→ X (in probability)

? d Xn −→ X (in distribution)

Stochastic Systems, 2013 8 Parameter estimation (Point estimation)

In stochastic systems modeling, we often build models from data observation (and not from physical first principles). We need statistically motivated methods to identify the stochastic systems under consideration. The identification of the stochastic systems requires the following: • Identification of the distribution • Identification of the dynamics • Identification of the system parameters • Analysis of the parameter significance In this section we only focus on the parameter estimation and assume that the distribution is known. We will come back to this topic after the theoretical introduction of stochastic processes.

Stochastic Systems, 2013 9 Parameter estimation (Point estimation)

Deﬁnition 1. A point estimator is any function W (X1,X2, ..., Xn) of a sample of random variables.

There are main ways of ﬁnding point estimators, the main ones are: • Methods of moments (MM) • Maximum Likelihood estimators (MLE) • Expectation Maximization (EM) • Bayes Estimators Besides the methods of ﬁnding a point estimator, the evaluation (quality) of the estimator. In the following slides, we will introduce the methods of moments and the maximum likelihood estimator.

Stochastic Systems, 2013 10 Methods of moments

We have X1,X2, ...., Xn the sample from a population from one pdf f(x|θ1, θ2, ..., θk). The parameter θi are the distribution parameter, e.g. σ and µ in the case of a normal distribution.

Definition 2. The method of moments is the matching of the first k moments of the data with the first k theoretical moments of the distribution. The theoretical moments are a function of the parameters and parameter estimation problem is reduced to the solving of k equations

Stochastic Systems, 2013 11 Methods of moments

We have

n 1 X 0 m1 = Xi, µ = E[X], n 1 i=1 n 1 X 2 0 2 m2 = X , µ = E[X ], n i 2 i=1 . n 1 X k 0 k mk = X , µ = E[X ], n i k i=1

0 where mi denotes the sample moment and µi the theoretical moments.

Stochastic Systems, 2013 12 Methods of moments

0 Since µi is a function of θi, we get the following system of equations:

0 m1 = µ1(θ1, ..., θk), 0 m2 = µ2(θ1, ..., θk), . 0 mk = µk(θ1, ..., θk),

0 where mi denotes the sample moment and µi the theoretical moments. The parameters are found by solving the system of k moments.

Stochastic Systems, 2013 13 Methods of moments

As main example, we assume that the data is generated by a normal distribution 2 2 with mean µ and variance σ . We denote θ1 = µ and θ2 = σ . The ﬁrst and second moment of the normal distribution are given as

n 0 1 X µ = µ = Xi 1 n i=1 n 0 2 2 1 X 2 µ = µ + σ = X 2 n i i=1

Stochastic Systems, 2013 14 Methods of moments

Solving for µ and σ2 we get:

n 1 X µ = Xi n i=1

n n !2 n 2 1 X 2 X 1 X 2 σ = X − Xi = (Xi − µ) n i n i=1 i=1 i=1

The solution are the sample moments of mean and variance and are of course the natural”way of estimation the mean and variance of the normal distribution.

Stochastic Systems, 2013 15 Maximum Likelihood Estimation

The likelihood is the joint pdf of X1, ..., Xn and given as:

n Y L(θ1, ...θk|x1, x2, ..., xn) = f(xi|θ1, ...θk) . i=1

T T We denote by x = [x1, x2, ...] and by θ = [θ1, θ2, ...]

Deﬁnition 3. For each sample x, let θb(x) be a parameter value at which L(θ|x) attains its maximum as function of θ.A maximum likelihood estimator MLE of the parameter θ based on the X is θb(x). If the likelihood function is C2, then possible candidates for the MLE are the values of θ which solve ∂ L(θ1, ...θk|x1, x2, ..., xn) = 0 . ∂θi

Stochastic Systems, 2013 16 Maximum log-Likelihood Estimation

Theorem 1. The maximum likelihood estimation is equivalent to the maximum log-likelihood estimation. The log-likelihood is deﬁned as

n X l(θ1, ...θk|x1, x2, ..., xn) = log (f(xi|θ1, ...θk)) . i=1

Example: We want to derive the maximum likelihood estimated for the mean (µ) of the normal distribution under the assumption of known variance σ2. The log-pdf of the normal distribution is given as:

2 ! 1 2 (xi − µ) log(f(xi|µ)) = − log(2πσ ) + 2 σ2

Stochastic Systems, 2013 17 Maximum log-Likelihood Estimation

The maximum log-likelihood function is given as:

n 2 ! X 1 2 (xi − µ) l(µ|x1, x2, ..., xn) = − log(2πσ ) + 2 σ2 i=1

Since σ is known, the maximization problem is reduced to least square problem:

n X 2 min (xi − µ) µ i=1

which as the solution of n 1 X µ = xi b n i=1

Stochastic Systems, 2013 18 Invariance of Maximum Likelihood Estimation

Theorem 2. The invariance property of MLEs state that if θb is the MLE of θ then for any function τ(θ), the MLE of τ(θ) is τ(θb).

Suppose that a distribution is parameterized by a parameter θ, but we are interest in ﬁnding an estimator for some function of θ, say τ(θ), then we can still use the MLE for θ. An example is as follows: If θ is the mean of normal distribution, the MLE of sin(θ) is sin(µb).

Stochastic Systems, 2013 19 Quality of estimators: MSE

Deﬁnition 4. The mean squared error (MSE) of an W of the a parameter θ 2 is the function deﬁned by Eθ[(W − θ) . The MSE of W measures the average squared distance between the estimator and the true value of the parameter. The MSE has the following interpretation:

2 2 Eθ[(W − θ) ] = V arθ[W ] + (Eθ[W ] − θ)

The ﬁrst term is the variance of the estimator W and the second term is called the bias. Deﬁnition 5. The bias of an estimator W of the parameter θ is the distance between the expected value of W and the true value of θ. An estimator where the bias is zero is called an unbiased estimator.

Stochastic Systems, 2013 20 Quality of estimators: Bias and variance

In the multivariate case where θ is a vector of parameters, the variance of the estimator is a covariance Cov(θ). An estimator with low variance (covariance) is called an efficient estimator (in the sense that few data is needed). The MSE is often an trade-off between unbiasedness and higher variance or an biased but efficient estimator. The true value of the MSE can often not be determined since the true value of θ is not known. Therefore, we focus on the variance of the estimator in order to describe the quality of the estimator. p p Definition 6. An estimator is called called consistent when θb → θ where → denotes convergence in probability. An unbiased estimator is also consistent.

Stochastic Systems, 2013 21 Quality of estimators: Normal distribution example

2 When we have X1,X2, .. i.i.d. data from a N (µ, σ ) distribution and use the sample mean X and sample variance S2 as estimator: • E[X] = µ and therefore, X is unbiased • E[S2] = σ2 and therefore, S2 is unbiased 2 σ2 • E[(X − µ) ] = V ar[X] = n 2 2 2 2 2σ4 • E[(S − σ ) ] = V ar[S ] = n−1 σ2 The MSE of X is still n when the data is not normal, but this does not hold for MSE for S2 when the data is not normally distributed.

Stochastic Systems, 2013 22 Quality of estimators: Cramer-Rao bound for the variance

Deﬁnition 7. The Fisher Information matrix J is deﬁned as 1 ∂ Ji,j = l(θ1, ...θk|x1, x2, ..., xn) n ∂θi ∂ · l(θ1, ...θk|x1, x2, ..., xn) , ∂θj

which is known as the outer product form. Under certainty regularity conditions and when the log-likelihood function is C2, it can be calculated as: ! 1 ∂2 Ji,j = − l(θ1, ...θk|x1, x2, ..., xn) , n ∂θi∂θj

which is called the inner product form. Note that the expectation is conditional on θ

Stochastic Systems, 2013 23 Quality of estimators: Cramer-Rao bound for the variance

The Cramer-Rao bound states the following: Theorem 3. The covariance of an estimator W is bounded by

J −1 Cov(W ) ≥ , N where N is the number of observations- This bound also to make an worst case approximation of the eﬃciency of an estimator. The Fisher Information matrix allows us to compute the uncertainty and thus, the quality of an estimator.

Stochastic Systems, 2013 24 Quality of estimators

MLE is the main methods for finding estimators, since it has the following properties: • Consistency: the estimator converges in probability to the value being estimated. • Asymptotic normality: as the sample size increases, the distribution of the MLE tends to the Gaussian distribution with mean θ and covariance matrix equal to the inverse of the Fisher information matrix. • Efficiency, i.e., it achieves the Cramer-Rao lower bound when the sample size tends to infinity. This means that no asymptotically unbiased estimator has lower asymptotic mean squared error than the MLE J−1 • The estimate of θ ∼ N (θML, N ). • Barretts Theorem ”The maximum-likelihood procedure in any problem is what you are most likely to do if you don’t know any statistics”.

Stochastic Systems, 2013 25