II: Statistical Analysis

Prof. Dr. Alois Kneip Statistische Abteilung Institut für Finanzmarktökonomie und Statistik Universität Bonn

Contents: 1. Empirical Distributions, Quantiles and Nonparametric Tests 2. Nonparametric 3. 4. Bootstrap 5. Semiparametric Models

EconometricsII-Kneip 0–1 Some literature: • Gibbons, J.D. , A. (1971): Nonparametric Statistical Infe- rence, McGraw-Hill, Inc. for Analysis; Clarendon Press • Bowman, A.W. and Azzalini, A. (1997): Applied Smoothing Techniques for ; Clarendon Press • Li and Racine (2007): Nonparametric Econometrics; Prince- ton University Press • Greene, W.H. (2008): Econometric Analysis; Pearson Edu- cation • Silverman, B.W. (1986): Density Estimation for and Data Analysis, Chapman and Hall • Davison, A.C and Hinkley, D.V. (2005): Bootstrap Methods and their Application, Cambridge University Press • Yatchew, A. (2003): Semiparametric Regression for the Ap- plied Econometrician, Cambridge University Press • Hastie, T., Tisbshirani, R. and Friedman, J. (2001): The ele- ments of statistical learning, Springer Verlag

EconometricsII-Kneip 0–2 1 Empirical distributions, quantiles and nonparametric tests

1.1 The empirical distribution function

The distribution of a real-valued X can be com- pletely described by its distribution function

F (x) = P (X ≤ x) for all x ∈ IR.

It is well-known that any distribution function possesses the fol- lowing properties: • F (x) is a monotonically increasing function of x • Any distribution function is right-continuous:

lim F (x + |∆|) = F (x) ∆→0 for any x ∈ IR. Furthermore,

lim F (x − |∆|) = F (x) − P (X = x) ∆→0 • ∫If F (x) is continuous, then there exists a density f such that x ∈ −∞ f(t)dt = F (x) for all x IR. If f(x) is continuous at x, then F ′(x) = f(x).

Data: i.i.d. random X1,...,Xn For given data, the sample analogue of F is the so-called empiri- cal distribution function, which is an important tool of . Let I(·) denote the indicator function, i.e., I(x ≤ t) = 1 if x ≤ t, and I(x ≤ t) = 0 if x > t.

EconometricsII-Kneip 1–1 Empirical distribution function: ∑ 1 n ≤ Fn(x) = n i=1 I(Xi x),

i.e Fn(x) is the proportion of observations with Xi ≤ x Properties:

• 0 ≤ Fn(x) ≤ 1

• Fn(x) = 0, if x < X(1), where X(1) - smallest observation

• F (x) = 1, if x ≥ X(n), where X(n) - largest observation

• Fn monotonically increasing step function Example

x1 x2 x3 x4 x5 x6 x7 x8 5,20 4,80 5,40 4,60 6,10 5,40 5,80 5,50 Corresponding empirical distribution function:






0.0 4.0 4.5 5.0 5.5 6.0 6.5

EconometricsII-Kneip 1–2 For real valued random variables the empirical distribution func- tion is closely linked with the so-called “order statistics”.

• Given a sample X1,...,Xn, the corresponding order stati- stics is the n-tuple

of the ordered observations (X(1),...,X(n)), where X(1) ≤ X(2) ≤ · · · ≤ X(n).

• For r = 1, . . . , n, X(r) is called r-th order statistics.

Order statistics can only be determined for one-dimensional ran- dom variables. But an empirical distribution function can also be defined for random vectors. Let X be a d-dimensional ran- d T dom variable defined on IR , and let Xi = (Xi1,...,Xid) de- note an i.i.d. sample of random vectors from X. Then for any T x = (x1, . . . , xd)

F (x) = P (X1 ≤ x1,...,Xd ≤ xd) and 1 ∑n F (x) = I(X ≤ x ,...,X ≤ x ) n n i1 1 id d i=1

We can also define the so-called “empirical measure” Pn. For any A ⊂ IRd 1 ∑n P (A) = I(X ∈ A) n n i i=1

Note that Pn(A) simply quantifies the relative frequency of obser- vations falling into A. As n → ∞ Pn(A) →P P (A) Note that Pn of course depends on the observation and thus is random. At the same time, however, it possesses all properties of a probability measures.

When knowing Fn we can uniquely reconstruct all observed va- lues {X1,...,Xn} The only information lost is the exact succes-

EconometricsII-Kneip 1–3 sion of these values. For i.i.d. samples this information is comple- tely irrelevant for all statistical purposes. All important statistics

(and estimators) can thus be written as functions of Fn (or Pn)). In particular, in theoretical literature expectations and corre- sponding samples averages are often represented in the following form: For a continuous function g ∫ ∫ E(g(X)) = g(x)dP = g(x)dF (x) and ∫ ∫ 1 ∑ g(X ) = g(x)dP = g(x)dF (x) n i n n i=1 ∫ Here, g(x)dF (x) refers to the Stieltjes integral. This is a gene- ralization of the well-known Riemann integral. Let d = 1, and consider a partition a = x0 < x1 < ··· < xm = b of an interval [a, b]. Then ∫ b ∑m g(x)dF (x) = lim g(ξj)(F (xj) − F (xj−1) m→∞;sup |x −x |→0 a i+1 i j=1 if the limit exists and is independent of the specific choices of ξj ∈

[xj−1, xj]. It can be shown that for any continuous function g and any distribution function F the corresponding∫ Stieltjes∫ integral ∞ ≡ exist for any finite interval [a, b]. −∞ g(x)dF (x) g(x)dF (x) corresponds to the limit (if existent) as a → −∞, b → ∞.

EconometricsII-Kneip 1–4 1.2 Theoretical properties of empirical distri- bution functions

In the following we will assume that X is a real-valued random variable (d = 1). Theorem: For every x ∈ IR

nFn(x) ∼ B(n, F (x)), i.e., nFn(x) has a binomial distribution with parameters n and

F (x). The of Fn(x) is thus given by   ( ) m  n  m n−m P Fn(x) = = F (x) (1−F (x)) , m = 0, 1, . . . , n n m

Some consequences:

• E(Fn(x)) = F (x), i.e. Fn(x) is an unbiased estimator of F (x) • 1 − V ar(Fn(x)) = n F (x)(1 F (x)), i.e. as n increases the va- riance of Fn(x) decreases.

• Fn(x) is a (weakly) consistent estimator of F (x).

Theorem of Glivenko-Cantelli: ( )

P lim sup |Fn(x) − F (x)| = 0 = 1 n→∞ x∈IR

EconometricsII-Kneip 1–5 The distribution of Y = F (X)

Note: there is an important difference between F (x) und F (X): • For any fixed x ∈ IR the corresponding value F (x) is also a fixed number, F (x) = P (X ≤ x) • F (X) is a random variable, where F denotes the distribution function of X. Theorem: Let X by a random variable with a continuous distri- bution function F . Then Y = F (X) has a (continuous) uniform distribution on the interval (0, 1), i.e.

F (X) ∼ U(0, 1),

P (a ≤ F (X) ≤ b) = b − a for all 0 ≤ a < b ≤ 1

Consequence: If F is continuous, then

• F (X1),...,F (Xn) can be interpreted as an i.i.d. random sample of observations from a U(0, 1) distribution

• (F (X(1)),...,F (X(n)) is the corresponding order statistics

EconometricsII-Kneip 1–6 1.3 Quantiles

Quantiles are an essential tool for statistical analysis. They provi- de important information for characterizing location and disper- sion of a distribution. In statistical inference they play a central role in measuring risk. Let X denote a real valued random varia- ble with distribution function F .

Quantiles: For 0 < τ < 1, any qτ ∈ IR satisfying

F (qτ ) = P (X ≤ qτ ) ≥ τ and P (X ≥ qτ ) ≥ 1 − τ is called τth quantile (or simply τ-quantile) of X. Note that quantiles are not necessarily unique. for given τ, there may exist an interval of possible values fulfilling the above con- ditions. But if X is a continuous random variable with density f, then qτ is unique if f(qτ ) > 0 (then F (qτ ) = τ and F (q) ≠ τ for all q ≠ qτ ). In statistical literature most work on quantiles is based on the so-called quantile function which is defined as an “inverse” dis- tribution function. For 0 < τ < 1 the quantile function is defined by

Q(τ) : inf{y| F (y) ≥ τ}

• For any 0 < τ < 1 the value qτ = Q(τ) is a τ-quantile satis- fying the above conditions. If there is an interval of possible

values for qτ , Q(τ) selects the smallest possible value. • Like the distribution function, the quantile function provides a complete characterization of the random variable X. • If the distribution function F (x) is strictly monotonically increasing, then Q(τ) is the inverse of F , Q(τ) = F −1(τ).

EconometricsII-Kneip 1–7 Important quantiles:

• µmed = Q(0.5) is the of X (with probability at least 0.5 an observation is smaller or equal to Q(0.5), and with probability at least 0.5 an observation is larger or equal to Q(0.5) • Q(0.25) and Q(0.75) are called lower and upper quartile, respectively. Instead of the , the inter-quartile IRQ = Q(0.75) − Q(0.25) (also called quartile coefficient of dispersion) is frequently used as a measure of . Note that P (X ∈ [Q(0.25),Q(0.75)]) ≈ 0.5. • Q(0.1),Q(0.2),...,Q(0.9) are the “deciles” of X. • Q(0.01),Q(0.02),...,Q(0.99) are the “” of X. The median is of particular interest. In classical it often preferred to the µ = E(x) in order to localize the center of a distribution. Different from the mean, the median is defined for any real valued random variable X. The median is a robust measure, its value is not much affec- ted by the tails of a distribution (⇒ empirically, outliers in the data do not play much of a role when estimating a median or quartiles). If a distribution is heavily skewed, then the median is more informative than the mean for localizing the “center” of a distribution.

• If the distribution of X is symmetric, then µmed = µ (provi- ded that µ = E(X) exists).

• For skewed distribution µ ≠ µmed. In general,

µmed < µ if the distribution is right-skewed,

µmed > µ if the distribution is left-skewed.

EconometricsII-Kneip 1–8 For many important measures “summarizing” characteristics of a distribution, there exist different versions which are either ba- sed on moments or on quantiles. The quantile-based versions are necessarily more robust, since quantiles are well-defined for any distribution, while the existence of moments already introduces some restriction. Some summary measures:

• Location measures: mean µ = E(X), median µmed • Dispersion measures: standard deviation σ, IRQ • measures: ( )3 (X−µ γ1 := E σ

Q(τ)+Q(1−τ)−2µmed γ(τ) = Q(τ)−Q(1−τ) (for τ > 0.5) In empirical analysis, sample quantiles are used to estimate the unknown true quantiles of X.

Data: i.i.d random sample X1,...,Xn from X

The sample quantile function Qn(τ) is then defined by using the empirical distribution function Fn instead of F . The sample quantile function: For 0 < τ < 1 define

Qn(τ) : inf{y| Fn(y) ≥ τ}

• For a fixed τ ∈ (0, 1), Qn(τ) is called the τth sample quantile.

A frequently used tool for descriptive data analysis is the so- called boxplot. The boxplot provides a graphical description of the empirical distribution of the observed data by using sample quantiles. It provides information about median, lower and upper quartiles, as well as outliers.

EconometricsII-Kneip 1–9 Example: Order (n=10): 0,1 0,1 0,2 0,4 0,5 0,7 0,9 1,2 1,4 1,9






0.0 0.0 0.5 1.0 1.5 2.0 x


0.0 0.5 1.0 1.5 2.0 x

EconometricsII-Kneip 1–10 EconometricsII-Kneip


0 10 20 30 40 Frauen Maenner 1–11 1.4 Nonparametric tests: the Kolmogorov-Smirnov test

There exists an enormous variety of nonparametric tests for diffe- rent statistical problems. Starting with the Kolmogorov-Smirnov one sample test we will introduce some important test procedu- res which are based on the use of empirical distribution functions and order statistics. There exist many further “classical” nonparametric tests based on various approaches. A reference is the book by Gibbons (1971). Although approaches and setups are different, there are some common characteristics shared by all of these tests: • Generality: The null hypothesis of interest is formulated in a general way; no parametrization, no dependence on existence and values of moments of specific distributions. • Distribution-free tests: The distribution of the tests stati-

stics under H0 is does not depend on the underlying distri- bution of the variable of interest • Robustness: test results should not be unduly affected by ”outliers” or small departures from the model assumptions

Goodness-of-fit tests: There are a number of nonparametric tests which try to assess whether a given distribution is suited to a dataset. The aim is to verify whether an observed variable possesses a specified distribution, as e.g. an exponential distribu- tion with parameter λ = 1 or a normal distribution with mean 0 and 1. The most important test in this context is the Kolmogorov-Smirnov test

EconometricsII-Kneip 1–12 Assumption: Real-valued random variable X with continuous distribution function F

Data: i.i.d. random sample X1,...,Xn from X

Goal: Test of the null hypothesis H0 : F = F0, where F0 is a given distribution function.

Idea: Fn(x) is an unbiased and consistent estimator of F (x).

Hence, if the null hypothesis is correct and F = F0, the diffe- rences |Fn(x) − F0(x)| should be sufficiently small.

Kolmogorov-Smirnov test:

H0 : F (x) = F0(x) for all x ∈ IR

H1 : F (x) ≠ F0(x) for some x ∈ IR Test statistic:

Dn = sup |Fn(x) − F0(x)| x∈IR

H0 is rejected if Dn > dn,1−α, where dn,1−α is the 1−α-quantile of the distribution of Dn under H0.

Problem: Distribution of Dn under H0?

a) Under H0 : F = F0 the test statistic Dn is distribution-free. It coincides with the distribution of the random variable ∗ | − ∗ | Dn = sup y Fn (y) . y∈[0,1] ∗ Here, Fn denotes the empirical distribution function of an i.i.d. sample Y1,...,Yn from a U(0, 1)-distribution. b) Asymptotic distribution (n large): For every

EconometricsII-Kneip 1–13 λ > 0 we obtain √ ∑∞ k−1 −2k2λ2 lim P (Dn ≤ λ/ n) = 1 − 2 (−1) e n→∞ k=1

• Result a) implies that the critical values of a Kolmogorov- Smirnov test can be approximated by Monte-Carlo-simulations: – Using a random number generator draw an i.i.d. sample

Y1,...,Yn from a U[0, 1]-distribution, and calculate the ∗ | − ∗ | corresponding value Dn,1 = supy∈IR y Fn (y) . – Iterate k times (k large, e.g. k = 2000) ⇒ ∗ ∗ ∗ k values: Dn,1,Dn,2,...,Dn,k – the (1 − α)-quantile of the empirical distribution of ∗ ∗ ∗ Dn,1,Dn,2,...,Dn,k provides an approximation of dn,1−α (the larger k, the more accurate the approximation)

• There exist tables providing critical values dn,1−α for small n.

Example: A manufacturer of a certain SUV claims that when driving at a constant speed of 100 km/h fuel consumption of the SUV is normally distributed with mean µ = E(X) = 12 und standard deviation σ = 1. A random sample of 10 SUVs leads to the following observed fuel consumptions:

12.4 11.8 12.9 12.6 13.0 12.5 12.0 11.5 13.2 12.8

Calculating the K-S test statistic yields (n = 10): D10 = 0.3554

Critical value of the test for n = 10 and α = 0.05: d10,0.95 = 0.409

⇒ H0 is accepted, since 0.3554 < 0.409

Remark: In principle, the test may also be used for discrete

EconometricsII-Kneip 1–14 distributions. In this case the test is conservative, i.e. under

H0 the probability of a type I error is usually smaller than α. Composite null hypotheses

It is common to speak of a composite null hypothesis, if F0(x) ≡

F0(x, θ) is only specified up to an unknown parameter vector θ ∈ IRm. An example is the normal distribution with unknown mean and variance, i.e. θ = (µ, σ2). In such a case the aim is simply to test whether the data are “normally distributed” (irrespective of the particular mean and variance). Testing problem:

H0 : F (x) = F0(x, θ) for all x ∈ IR; θ unknown

H1 : For all possible θ: F (x) ≠ F0(x, θ) for some x ∈ IR Test statistic:

ˆ Dn = sup |Fn(x) − F0(x, θ)| x∈IR ˆ Here, θ denotes the maximum-likelihood estimate∑ of θ. ˆ ¯ 2 2 1 − ¯ 2 Normal distribution: θ = (X, σˆ ), σˆ = n i(Xi X) .

H0 is rejected if Dn > dn,1−α • In general one uses the same critical values as in the case of a simple null hypothesis (see above). This implies that the

test is conservative, i.e. under H0 the probability of a type I error is usually smaller than α. • For the special case of a normal distribution, exact criti- cal values have been determined by Lillifors. The resulting “Lillifors test” is implemented in many statistical program packages.

EconometricsII-Kneip 1–15 1.5 Nonparametric one-sample tests

1.5.1 Rank statistics

Many nonparametric tests are (implicitly or explicitly) based on ranks of observations. Ranks are easily determined from order statistics.

• Consider an i.i.d. random sample X1,...,Xn from a conti-

nuous random variable X. If Xi ≠ Xj for all i ≠ j, then the

rank r(Xi) of observation Xi, i = 1, . . . , n, is defined by ∑n r(Xi) := I(Xj ≤ Xi). j=1 This that the smallest observation has rank 1, while the largest observation has rank n, and

r(X(i)) = i i = 1, . . . , n

• For an i.i.d. sample from a continuous random variable we

have P (Xi = Xj for some i ≠ j) = 0. Consequently, with

probability 1, r(X1), . . . , r(Xn) is a random permutation of all natural numbers between 1 and n. n+1 – E(r(Xi)) = 2 n2−1 – V ar(r(Xi)) = 12 • In practice, it can of course occur that there exist “ties”, i.e. different observations which have equal values. In this case an average rank is assigned to all observations with identical value.

EconometricsII-Kneip 1–16 Examples (n=5):

Xi 0, 3 1, 5 −0, 1 0, 8 1, 0

r(Xi) 2 5 1 3 4

Xi 2, 0 0, 5 0, 9 1, 3 2, 6

r(Xi) 4 1 2 3 5

Xi 1, 09 2, 17 2, 17 2, 17 3, 02

r(Xi) 1 3 3 3 5

Xi 0, 5 0, 5 0, 9 1, 3 1, 3

r(Xi) 1, 5 1, 5 3 4.5 4.5

Note: If there are ties, then the empirical variance of r(Xi) is n2−1 necessarily smaller than 12 .

1.5.2 Linear rank statistics (one sample)

Consider a random variable X with continuous distribution func- tion F

Data: i.i.d. random sample X1,...,Xn Nonparametric one-sample tests try to verify hypotheses concer- ning the location of the center of a distribution. More precisely, they aim to test whether the median µmed is equal to a pre- specified value µ0.

Recall that for a continuous random variable the median µmed necessarily statisfies F (µmed) = 0.5. For simplicity, in the follo- wing we will only consider two-sided tests. One-sided tests are completely analogous.

EconometricsII-Kneip 1–17 Formal testing problem:

H0 : µmed = µ0

H1 : µmed ≠ µ0

Example: For studying the intelligence of PhD students at a certain university n = 10 students were randomly selected and the corresponding IQ-values were measured using an IQ test. This lead to the following 10 observations:

Xi 99 131 118 112 128 136 120 107 134 122

Question: Is the data compatible with the hypothesis H0 : µmed = 110?

Linear rank statistics for the one-sample problem rely on the ranks of the absolute values of the differences Di = Xi − µ0:

r(|Di|) := rank of |Di| = |Xi − µ0| in the sample

of the absolute values|D1|,..., |Dn|

Moreover, let   1 if Xi − µ0 > 0 Vi :=  0 if Xi − µ0 ≤ 0

+ For a suitable weight function g a linear rank statistics Ln is then defined by ∑n + | | · Ln = g(r( Di )) Vi i=1

EconometricsII-Kneip 1–18 IQ-example (µ0 = 110):

Xi 99 131 118 112 128 136 120 107 134 122

Vi 0 1 1 1 1 1 1 0 1 1

|Di| 11 21 8 2 18 26 10 3 24 12

r(|Di|) 5 8 3 1 7 10 4 2 9 6

There exist some general theoretical results on the choice of a sui- table weight function for constructing locally optimal rank tests. The term “locally optimal” refers to the assumption that the un- derlying F is “close” to some pre-specified parametric distribution (e.g. normal). In practice, the most frequently used linear rank tests are the and the Wilcoxon test.

The sign test: The sign test is the linear rank test with the simplest possible weight function: g(x) = 1 for all x. For testing

H0 : µmed = µ0 the sign test thus relies on the test statistics ∑n + Vn = Vi i=1 • 1 1 Under H0 we obtain P (Vi = 1) = 2 and P (Vi = 0) = 2 • ∗ This implies that the null distribution of Vn is a binomial 1 distribution with parameters n and 2 , 1 V + ∼ B(n, ). n 2

⇒ For a given significance level α > 0, the sign test rejects H0 if + + either P (B 1 ≤ V ) ≤ α/2 or P (B 1 ≥ V ) ≤ α/2. n, 2 n n, 2 n n large: the binomial distribution may be approximated by a

EconometricsII-Kneip 1–19 normal distribution. Under H0 we have approximatively V + − n/2 n√ ∼ AN(0, 1) n/4

Remark: Since F is continuous we have P (Xi − µ0 = 0) = 0. In practice, however, there may exist observations with Xi −µ0 = 0. In this case it is common practice to eliminate these observations and to apply the sign test to the corresponding reduced sample.

The Wilcoxon test: The Wilcoxon test is a linear rank test based on the weight function g(x) = x for all x. It relies on the additional assumption that the underlying distribution is sym- metric. The test statistic is ∑n + | | · Wn = r( Di ) Vi i=1 For a given significance level α > 0, the Wilcoxon test rejects + ≤ + ≥ H0 if either Wn wn,α/2 or Wn wn,1−α/2. Here, wn,α/2 and wn,α/2 are the corresponding quantiles of the distribution of Wn under H0.

• If F is symmetric, then under H0 the statistic Wn is distribution- 1 free. Under H0, V1,...,Vn are i.i.d. with P (Vi = 1) = 2 1 and P (Vi = 0) = 2 , while symmetry of F implies that the random variables Vi and |Di| are independent. Hence, all possible combinations of zeros and ones for the indicator

variables V1,...,Vn are equally probable, while at the sa-

me time r(|D|), . . . , r(|Dn|) are purely random permutations of {1, . . . , n}. Therefore, critical values can be obtained by straightforward combinatorial methods.

EconometricsII-Kneip 1–20 • Asymptotic approximation (n large):

W + − n(n+1) √n 4 ∼ AN(0, 1), + V ar(Wn )

+ n(n+1)(2n+1) where V ar(Wn ) = 24 Note: The theoretical derivation of the null distribution relies on the assumption of a continuous random variable (probability of ties equal to zero). Ties may of course exist in practice. Then the above distribu- tion are only approximatively valid, and the accuracy of approximation decreases with the number of ties. In the literature there can be found some formulas which provide corrected critical values in the presence of ties.

Application: Paired-sample procedures

Paired samples: § ¤ Sample (X1,Y1),..., (Xn,Yn)

X1,...,Xn i.i.d. with distribution function FX

Y1,...,Ym i.i.d. with distribution function FY

Xi und Yi not independent; e.g. (Xi,Yi) repeated measure- ¦ments for the same ¥

Example: advertising campaign The following table represents the weekly sales (in 10000 Euro) of a trade chain before and after an advertising campaign.

chain store 1 2 3 4 5 6 before campaign (X) 18,5 15,6 20,1 17,2 21,1 19,3 after campaign (Y) 20,2 16,6 19,8 19,3 21,9 19,0

EconometricsII-Kneip 1–21 ⇒ x¯ = 18, 63, y¯ = 19, 47 Question: Has the advertising campaign been successful? Did the campaign (in tendency) lead to significantly higher sales? Nonparametric approach: Analysis of the resulting sample of differences

Z1 = X1 − Y1,Z2 = X2 − Y2,...,Zn = Xn − Yn The above problem can be translated into the question: Is the median of Z1,...,Zn significantly different from zero? ⇒ Testing problem:

H0 : µmed;Z = 0

H1 : µmed;Z ≠ 0

⇒ Application of the sign test (or Wilcoxon test) based on Z1,...,Zn.

Power for detecting alternatives: • Parametric alternative (assuming normality): Student t-test • The asymptotic relative efficiency of the sign test relative to the t-Test ist 0.637 if the underlying distribution is normal. The sign test can be much more efficient than the t-test if the underlying distribution is skew or possesses heavy tails. • For a symmetric distribution the Wilcoxon test is always more efficient than the sign test. The asymptotic relative efficiency of the Wilcoxon test relative to the t-Test ist 0.96 if the underlying distribution is normal.

EconometricsII-Kneip 1–22 1.6 Two-sample tests

In the following we consider two random variables X und Y with continuous distribution functions FX und FY

Data: i.i.d random samples X1,...,Xm and Y1,...,Yn from un- derlying populations with distribution functions FX und FY . Xi is independent of Yj for all i, j.

Problem: Test the null hypothesis H0 : FX = FY of equality of the underlying distribution

Example: Coffee and the speed of typing on a keyboard An was conducted in order to measure the influence of caffeine on the speed of typing on a computer keyboard. 20 trained test persons were randomly divided into two groups of 10 persons. The first group did not receive any beverages, but each member of the second group had to drink a big cup of coffee (administering 200 mg caffeine). Every test person then had to type a text on a keyboard. The following table provides the respective average number of characters typed per minute.

no caffeine (X) 242.8 245.3 244.0 240.2 247.1 248.3 241.7 244.7 246.5 240.4 200 mg caff. (Y) 246.4 251.1 250.2 252.3 248.0 250.9 246.1 248.2 245.6 250.0 Question: Does there exist a difference between typing speeds with and without caffeine?

Formal testing problem:

H0 : FX = FY

H1 : FX ≠ FY

EconometricsII-Kneip 1–23 For two sample tests based on order statistics the rank of the observations Xi and Yj in the combined samples of all n + m observations play a central role. If there are no ties, then r(Xi) is defined by ∑m ∑n r(Xi) := I(Xj ≤ Xi) + I(Yj ≤ Xi) j=1 j=1 and consequently ∑n r(X(i)) = i + I(Yj ≤ X(i)) j=1 for all i = 1, . . . , n. If H0 : FX = FY is correct, then all ranks bet- 1 ween 1 and m + n are equally probable, P (r(Xi) = j) = n+m for all j ∈ {1, . . . , m+n}. More precisely, under H0, r(X1), . . . , r(Xm) can be interpreted as m numbers randomly drawn from the set {1, 2, . . . , m + n}. All possible sequence of these m numbers are equally probable. This will not be true under the alternative.

1.6.1 The Kolmogorov-Smirnov two-sample test


• The empirical distribution functions FX,m and FY,n are un-

biased and consistent estimators of FX and FY , respectively.

• If the null hypothesis H0 : FX = FY is correct, all diffe-

rences |FX,m(x) − FY,n(x)| are purely random and should be sufficiently small. This motivates the two-sample test of Kolmogorov and Smirnov for testing H0 : FX = FY .

EconometricsII-Kneip 1–24 Test statistic:

Dm,n = sup |FX,m(x) − FY,n(x)| x∈IR

H0 is rejected if Dm,n > dm,n,1−α, where dm,n,1−α is the 1−α- quantile of the distribution of Dm,n under the null hypothesis.

a) Under H0 : FX = FY , the test statistic Dmn is distribution- free. Critical values can be obtained by straightforward com- binatorics. Recall that ties do not play any role in theoretical analysis, since they have probability 0. We obtain {

Dm,n = max max |FX,m(Xi) − FY,n(Xi)|, i=1,...,m }

max |FX,m(Yj ) − FY,n(Yj )| j=1,...,n { n i 1 ∑ = max max | − I(Yj ≤ X(i))|, i=1,...,m m n j=1 m } 1 ∑ i max | I(Xj ≤ Y(i))| − | i=1,...,n m n j=1

The values of Dm,n thus only depend on the ranks of Xi,

Xj in the combined sample of all m + n observation. Since

all under H0 all ranks are equally probable, critical values are thus obtained by a simple counting procedure. b) Asymptotic distribution (n large): For all λ > 0 √ ∑∞ k−1 −2k2λ2 lim P (Dm,n ≤ λ/ mn/(m + n)) = 1−2 (−1) e n→∞ k=1

c) The Kolmogorov-Smirnov test is consistent for all alternati- ves.

EconometricsII-Kneip 1–25 1.6.2 Linear rank statistics

• Rank tests are explicitly constructed on the basis of the ranks

of Xi and Yi in the combined samples of all N = m + n observations.

• Under H0 : FX = FY the combined sample can be interpre- ted as an i.i.d. random sample of size N := m + n from a

population with distribution function FX = FY . If there are no ties, the ranks are random permutations of the natural numbers between 1 and N. Rank tests then aim to verify, whether the distribution of ranks is indeed purely random, or if there are systematic differences between the ranks of

the X and Y variables which indicate that FX ≠ FY . Most commonly used rank tests for the two-sample problem can be classified together as linear combinations of indicator varia- bles for the combined (ordered) samples. Such statistics are often called linear rank statistics. For the following theoretical analysis we will assume that FX and FY are continuous and that there are not ties in the samples. Let   1 if the i-th variable in the combined, Vi :=  ordered sample is an X-variable  0 else Linear rank statistics can now generally be written in the form ∑N LN = aiVi, i=1 where a1, a2,... are pre-specified weights (“scores”). Different test procedures use different specifications of the scores ai.

• (V1,V2,...,VN ) is a vector consisting of m ones and n zeros.

EconometricsII-Kneip 1–26   N There are   different possible combinations of these m m ones and n zero, each of which has the same probability under


• Under H0 : FX = FY the distribution of LN is distribution- free. Critical values can be determined by straightforward combinatorics: q(c) P (LN = c |H0) = , N m

where q(c) denotes∑ the number of vectors (V1,...,VN ) satis- N fying LN = i=1 aiVi = c.

• Moments under H0: m – E(Vi) = N mn – V ar(Vi) = N 2 −mn – Cov(Vi,Vj) = N 2(N−1) This implies ∑ – E(L ) = m N a N N i=1 i ∑ ∑ mn N 2 − N 2 – V ar(LN ) = N 2(N−1) (N i=1 ai ( i=1 ai) ) • Asymptotic distribution (n large):

LN − E(LN ) ZN = √ ∼ AN(0, 1). V ar(LN ) Tests based on linear rank statistics are not consistent against all possible alternatives. However, they can be constructed in such a way that they are particularly powerful in detecting some important types of alternatives, as for example shifts in locati- on. The point is that in many practically relevant situations the EconometricsII-Kneip 1–27 structure of the distributions Fx and FY is quite similar, but there exists a shift in the centers of these distributions (different median, means). Mathematically this can be formalized by the concept of stocha- stic dominance. Definition: A real random variable X (first order) stochastically dominates a real random variable Y (written X ≥FSD Y if P (X > z) ≥ P (Y > z) for all z or equivalently

FX (z) ≤ FY (z) for all z

If X ≥FSD Y ., then µX,med > µY,med, where µX,med and µY,med denote the of X and Y , respectively. Moreover, if E(X) exists, then E(X) > E(Y ). Tests for the location problem are particulary powerful against alternatives of the form FX (z) < FY (z) or FX (z) > FY (z). Loca- tion tests based on linear rank statistics rely on specifying scores such that a1 < a2 < ··· < an is a strictly monotonically increa- sing sequence. Note that the following tests may also be able to detect alterna- tives where stochastic dominance of one variable is not exactly satisfied. They will, however, not be consistent against alternati- ves, where the centers of the distributions are equal and the only difference lies in the fact that one variable is more dispersed than the other. The Wilcoxon-Mann-Whitney-test (Mann-Whitney-U- test):

The best known two-sample is the Wilcoxon-Mann- Whitney-test. The test statistic is a special linear rank statistic EconometricsII-Kneip 1–28 with scores ai = i, i = 1, . . . , n: ∑N ∑m WN = i · Vi = r(Xj) i=1 j=1

For α > 0 let ωN,α denote the α-quantile of the distribution of

Wn under H0.

• Two-sided test (H0 : FX = FY against H1 : FX ≠ FY ):

H0 is rejected if WN ≤ ωN,α/2 or WN ≥ ωN,1−α/2.

• One-sided test (H0 : FX = FY against H1 : FX (z) < FY (z) for all z):

H0 is rejected if WN ≥ ωN,1−α.

• One-sided test (H0 : FX = FY against H1 : FX (z) > FY (z) for all z):

H0 is rejected if WN ≤ ωN,α.

• Unter H0, Wn is distribution-free. Critical values can be ob- tained in a combinatorial way (see above). • m(N+1) mn(N+1) E(WN ) = 2 , V ar(Wn) = 12

• Asymptotic approximation (n large): WN approximatively m(N+1) mn(N+1) normal with mean 2 and variance 12 .

Note: The theoretical derivation of the null distribution relies on the assumption of continuous random variables (probability of ties equal to zero). Ties may of course exist in practice. Then the above distribution are only approximatively valid, and the accuracy of approximation decreases with the number of ties. In the literature there can be found some formulas which provide corrected critical values in the presence of ties.

EconometricsII-Kneip 1–29 The test by van der Waerden

The van der Waerden-test relies on a special linear rank statistic −1 i with scores ai = Φ ( N+1 ). Here, Φ is the distribution functi- on of the standard normal distribution. This leads to the test statistic ∑N i ∑m r(X ) VW = Φ−1( ) · V = Φ−1( j ) N N + 1 i N + 1 i=1 j=1 Critical vaules can again be obtain by using the general results for linear rank statistics.

Both tests mentioned in this section possess a considerable power for detecting shifts in location. Asymptotic relative efficiencies are calculated with respect to restricted, parametrized classes of alternatives H1 : FX (t) = FY (t−δ) for some δ ∈ IRand all t ∈ IR. Power for detecting alternatives: • Parametric t-test: Additional assumption: normal distributions with equal va- 2 2 riances, X ∼ N(µ1, σ ) und Y ∼ N(µ2, σ ) ⇒ two-sample t-Test with test statistic X¯ − Y¯ T = √ S 1/n + 1/m

Under H0 the statistic T follows a Student t-distribution with N − 2 degrees of freedom (Rejection of H0 if |T | is to large). • The asymptotic relative efficiency of the Wilcoxon-Mann- Whitney-test relative to the t-Test ist 0.955 if the underlying distributions are normal. The Wilcoxon-Mann-Whitney test is more efficient than the t-Test for strongly skewed or heavy

EconometricsII-Kneip 1–30 tailed distribution. A lower bound for the asymptotic relative efficiency is 0.864, an upper bound does not exist. • Assuming normal distributions, the asymptotic relative effi- ciency of the an der Waerden Test-test relative to the t-Test is equal to 1. If the distribution have heavy tails, then the Wilcoxon-Mann-Whitney-test is more powerful than the van der Waerden-test.

Scale alternatives: There are also rank tests which are specia- lized to detect whether one random variable is more dispersed than the other (scale alternative). Such tests already rely on the assumption that the centers of the distributions are equal, i.e.,

µX,med = µY,med (which may be tested using a location test). Test statistics are linear rank statistics which assign small values ai to very small and very large observation, and assign large va- lues ai to observations in the center of the distribution. The best known test in this context is the Siegel-Tukey-test. It is based on the test statistic

∑N SN = ai · Vi, i=1 where the weights a are calculated as follows:

a1 = 1, aN = 2, aN−1 = 3, a2 = 4, a3 = 5, aN−2 = 6,

aN−3 = 7, a4 = 8, a5 = 9, aN−4 = 10,...

The critical values of the Siegel-Tukey-test coincide with the cri- tical values of the Wilcoxon-Mann-Whitney-test.

EconometricsII-Kneip 1–31 1.7 Multiple comparisons

In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selec- ted based on the observed values. Errors in inference, including confidence intervals that fail to include their corresponding po- pulation parameters or hypothesis tests that incorrectly reject the null hypothesis are more likely to occur when one considers the set as a whole. This is an important, although largely ignored problem in app- lied econometric work. In empirical studies often dozens or even hundreds of tests are performed for the same data set. When searching for significative test results, one may come up with false discoveries. • Multiple tests: In some study many different tests are done simultaneously • Example: m different, independent test of significance level α > 0 (independence means that the test statistics used are mutually independent; this is usually not true in practice).

Assume that the respective null hypothesis H0 holds for each of the m tests   Type I error     m ⇒ P  by at least  = 1 − (1 − α) =: αm > α one of the m tests

EconometricsII-Kneip 1–32 m αm 1 0.05 3 0.143 5 0.226 10 0.401 100 0.994 (!) ⇒ Interpretation of significant results?

• Analogous problem: Construction of m (1 − α)-confidence intervals   at least one of the m confidence     m P  intervals does not contain  = 1 − (1 − α) > α the true parameter value

This represents the general problem of multiple comparisons. In practice, it will not be true that all test statistics used are mu- tually independent. This even complicates the problem. We will still have the effect that the probability of at least one falsely significant results increases with the number m of tests, but it will not be equal to 1 − (1 − α)m. A statistically rigorous solution of this problem consists in mo- difying the constructions of tests or confidence intervals in order to arrive at simultaneous tests or simultaneous confidence intervals:

  Type I error by P   ≤ α at least one of the m tests

EconometricsII-Kneip 1–33 or   All confidence interval     P simultaneously contain the ≥ 1 − α true parameter values

For certain problems (e.g. ) there exist speci- fic procedure for constructing simultaneous confidence intervals. The only generally applicable procedure seems to be the Bonfer- roni correction. It is based on Boole’s inequality.

Theorem (Boole): Let A1,A2,...,Am denote m different events. Then

∑m P (A1 ∪ A2 ∪ · · · ∪ Am) ≤ P (Ai). i=1

This inequality also implies that with A¯i denoting the comple- mentary event “not Ai” ∑m P (A1 ∩ A2 ∩ · · · ∩ Am) ≥ 1 − P (A¯i). i=1

Application: Bonferroni adjustment • ∗ α m different tests of level α = m :   Type I error by ∑m α ⇒ P   ≤ = α m at least one of the m tests i=1

• Analogously: Construction of m (1−α∗) confidence intervals,

EconometricsII-Kneip 1–34 ∗ α α = m ,   all confidence interval   ∑m   α ⇒ P simultaneously contain the ≥ 1 − = 1 − α   m i=1 true parameter values

Example: For n = 40 US corporations a multiple regression model is used to model the observed return of capital Y in dependence of 12 explanatory variables. After eliminating two outliers, the following table provides the results of the regression analysis. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.24883 0.14386 1.730 0.09603 . WCFTCL 1.11519 0.36955 3.018 0.00579 ** WCFTDT -0.21457 0.39528 -0.543 0.59206 GEARRAT -0.01992 0.10610 -0.188 0.85261 LOGSALE 0.49969 0.18335 2.725 0.01156 * LOGASST -0.48743 0.17500 -2.785 0.01005 * NFATAST -0.30425 0.15446 -1.970 0.06003 . CAPINT -0.08022 0.03706 -2.165 0.04017 * FATTOT -0.11086 0.09125 -1.215 0.23571 INVTAST 0.23047 0.23588 0.977 0.33790 PAYOUT 0.00168 0.01717 0.098 0.92284 QUIKRAT 0.08012 0.10827 0.740 0.46617 CURRAT -0.18976 0.09244 -2.053 0.05070 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual : 0.0552 on 25 degrees of freedom Multiple R-Squared: 0.6958, Adjusted R-squared: 0.5498 F-statistic: 4.765 on 12 and 25 DF, p-value: 0.0004878

EconometricsII-Kneip 1–35 1.8 Maxima of a finite sequence of random va- riables

The problem of multiple comparisons is closely connected with the problem of bounding maxi=1,...,m |Xi| for a sequence of ran- dom variables. When the probability distribution of each random variable Xi is known, Boole’s inequality can be used in order to obtain (fairly rough) stochastic bounds for maxi=1,...,m |Xi|. Mo- re precise results can, however, be obtained for some practically important special cases.

1.8.1 Maximum of a sample of bounded random varia- bles

Let X1,...,Xn be an i.i.d. sample, and assume that for some (unknown) θ ∈ IRthe underlying distribution possesses a density f with the following properties: • f(θ) > 0 and f(x) = 0 for all x > θ • For any ϵ > 0, f is continuous in the interval [θ − ϵ, θ]. This implies that all possible values of X are bounded by θ, P (X > θ) = 0. Problem: Estimate θ A natural estimator of θ is given by ˆ θn := X(n) = max Xi i=1,...,n

ˆ ˆ Note that with probability 1 we have θn ≤ θ. θn is therefore a biased estimator of θ. It is fairly straightforward to derive the

EconometricsII-Kneip 1–36 ˆ asymptotic distribution of θn. For any c > 0 we obtain c P (n(θ − θˆ ) ≤ c) = P (X ∈ [θ − , θ] for some i = 1, . . . , n) n i n ∑n c = 1 − P ( I(X ∈ [θ − , θ]) = 0) i n i=1 ∑ n ∈ − c But i=1 I(Xi [θ n , θ]) has a binomial distribution with ∈ − c parameters n and P (X [θ n , θ]). Therefore, ( ) c n P (n(θ − θˆ ) ≤ c) = 1 − 1 − P (X ∈ [θ − , θ] n n → ∞ ∈ − c c 1 But as n , P (X [θ n , θ]) = f(θ) n + o( n ). Furthermore, it is well known that for any λ > 0 λ lim (1 − )n = exp(−λ), n→∞ n and consequently, ( ) c n f(θ)c lim 1 − P (X ∈ [θ − , θ] = lim (1− )n = exp(−f(θ)c). n→∞ n n→∞ n We can conclude that as n → ∞ the asymptotic distribution of ˆ n(θ − θn) is an exponential distribution with parameter f(θ), ˆ n(θ − θn) →D Exp(f(θ))

This type of problems is quite important in economics. The abo- ve setup represents a simple form of an extreme value problem which are, for example, important in finance. The estimation of (conditional) maxima of observation is the subject of production frontier analysis. The setup of frontier analysis can be described as follows: In an industrial sector there are usually a large number of competing companies. Each firm produces a production output X on the

EconometricsII-Kneip 1–37 basis of several production inputs z ∈ IRp. For a given input vector z there is a maximal output g(z) which can be produced based on the current state of technology. The function g(z) is called production function. A firm with input vector z is efficient if its output equals g(z), and it is (to some degree) inefficient if the output is smaller than g(z). For a sample of measured production outputs X1,...,Xn the basic model then can be written as

Xi = g(Zi) + ui, i = 1, . . . , n, where ui is a negative random variable, i.e. P (ui ≤ 0) = 1, which measures the degree of inefficiency. This in turn implies that P (Xi ≥ g(z)|Zi = z) = 0. The situation described above corresponds to the trivial case p = 0 with no inputs and g(z) ≡ θ being a fixed constant. In practice, there will of course always exist a number p > 0 of important input variables which leads to the much more compli- cated problem of estimating conditional maxima. Different esti- mation methods (e.g. data envelopment analysis) have been deve- loped in deterministic frontier analysis. Procedures of stochastic frontier analysis are based on a variant of the above model which adds a normally distributed measurement error ϵi, i.e. it is assu- med that Xi = g(Zi)+ui +ϵi, i = 1, . . . , n,. For some overview see e.g. • Cooper, Seiford and Tone (2006): Introduction to data enve- lopment analysis and its uses, Springer Verlag • Kumbhakar and Lovell (2000): Stochastic frontier analysis, Cambridge University Press

EconometricsII-Kneip 1–38 Example of stochastic frontier analysis (p = 1):

1.8.2 Maximum of normal variables

Let X1,...,Xm be a collection of standard normal random va- riables, i.e. Xi ∼ N(0, 1). Note that it is not assumed that the variables are independent. | | Problem: Establish a bound for supi=1,...,m Xi which is valid for large m. We first establish a simple tail bound for a standard normal va- riable X: For any c > 0 ∫ 1 ∞ t2 P (X ≥ c) = √ exp(− )dt 2 2π ∫c 1 ∞ t t2 ≤ √ exp(− )dt c 2 2π c 2 ∞ 2 1 t 1 c = √ exp(− ) = √ exp(− ) c 2π 2 c c 2π 2

EconometricsII-Kneip 1–39 √ Let A be some constant with A > 2. Using Boole’s inequality we can then infer from the above bound that √ sup |Xi| ≤ A log m i=1,...,m

2 1 − A +1 holds with probability at least 1− √ √ m 2 . Note that A log m 2π as m → ∞ this probability converges to 1. This bound is heavily used in regression and high-dimensional procedures like the Lasso. For example assume a standard model with normal errors and a ve- ry large number m ≈ n of explanatory variables. For the the ˆ estimated regression coefficient βj we have √ n(βˆ − β ) √j j ∼ N(0, 1), j = 1, . . . , m σ qjj

2 where σ is the error variance, and qjj is the jth diagonal element 1 T −1 × of the matrix ( n XX ) , where in this case X is the n m dimensional matrix of regressors. Hence, whenever βj = 0 we have √ nβˆ √ j ∼ N(0, 1), σ qjj and the above bound implies that √ ˆ √ nβj √ ≤ A log m for all j ∈ {1, . . . , m} with βj = 0 σ qjj holds with high probability if m is large.

EconometricsII-Kneip 1–40 1.9 More on quantiles

Quantiles and quantile regression are an important empirical tool in risk analysis. For non-normal data quantile regression offers a robust alternative to usual least squares methods.

1.9.1 The check function

It is well known that if E(X2) < ∞ the mean µ = E(X) is obtained by minimizing squared loss: ( ) µ = arg min E (X − c)2 . c∈IR

If E(|X|) < ∞, then the median is obtained by minimizing L1- loss (absolute deviations):

µmed = arg min E (|X − c|) . c∈IR The condition E(|X|) < ∞ can be avoided by rewriting the mi- nimization problem in the (otherwise equivalent) form

µmed = arg min E (|X − c| − |X|) . c∈IR Note that E (|X − c| − |X|) < ∞ for any real valued random variable X and every c ∈ IR. In general, for every τ ∈ (0, 1) the τ-quantile Q(τ) can be obtai- ned by minimizing expected loss with respect to a an asymmetric linear bases on the

Check function: ρτ (u) = (τ − I(u < 0))u, u ∈ IR

EconometricsII-Kneip 1–41 Q(τ) then minimizes

Vτ (q) := E (ρτ (X − q)) = τE (|X − q| · I(X > q)) + (1 − τ)E (|X − q| · I(X < q)) ∫ ∫ = τ |x − q|dF (x) + (1 − τ) |x − q|dF (x) x>q x

E (ρτ (X − q)) − E (ρτ (X)) (to be used if E(|X|) does not exist). It is easily seen that

• Vτ (q) = E (ρτ (X − q)) is a continuous function of q.

• If F (q) is continuous at q, then Vτ (q) is differentiable at q. • If P (X = q) > 0, then F (u) has a jump at u = q, while

Vτ (u) has a kink (i.e. is not differentiable) at u = q. One can, however, always define directional derivatives, i.e. right and

left derivatives when considering the limits Vτ (q − |∆|) and

Vτ (q + |∆|) as ∆ → 0. More precisely, For the left-derivative we have

∂Vτ (q) 1 = lim Vτ (q + |∆|) = −τP (X > q) + (1 − τ)P (X ≤ q) ∂q+ ∆→0 |∆| = −τ + P (X ≤ q),

EconometricsII-Kneip 1–42 while the right derivative is given by

∂Vτ (q) 1 = lim Vτ (q − |∆|) = −τP (X ≥ q) + (1 − τ)P (X < q) ∂q− ∆→0 −|∆| = −τ + P (X < q).

Now let Q(τ) denote a τ-quantile of X. Note that for any q ∈ IR with F (q) = P (X ≤ q) > F (Q(τ)) ≥ τ, we also have P (X < q) = F (q) − P (X = q) ≥ τ. Therefore,

• ∂Vτ (q) > 0 and ∂Vτ (q) ≥ 0 for any q ∈ IR with F (q) > ∂q+ ∂q− F (Q(τ))

• ∂Vτ (q) < 0 and ∂Vτ (q) < 0 for any q ∈ IR with F (q) < ∂q+ ∂q− F (Q(τ))

This implies that Q(τ) minimizes Vτ (q). If X is a continuous random variable, then necessarily F (Q(τ)) = τ, and any solution hence satisfies the first order condition

0 = −τ + P (X ≤ q(τ)) = −τ + F (q(τ)).

Recall from the definition of quantiles that the solution is not necessarily unique. If F has constant segments, there may exist an interval of possible values for Q(τ). But Q(τ) is necessarily unique if F is continuous and if the corresponding density f satisfies f(Q(τ)) > 0.

Let X1,...,Xn be an i.i.d. random sample from X. The above arguments also imply that sample quantiles Qn(τ), τ ∈ (0, 1), can be obtained by minimizing ρτ with respect to the empirical

EconometricsII-Kneip 1–43 distribution function. Any possible value Qn(τ) minimizes 1 ∑n V (q) := ρ (X − q) τ,n n τ i i=1 1 ∑ 1 ∑ = τ |X − q| + (1 − τ) |X − q| n i n i ∫ i:Xi>q i:∫Xi

= τ |x − q|dFn(x) + (1 − τ) |x − q|dFn(x) x>q x

Assume that the distribution of X possesses a density f with f(Q(τ)) > 0. Then Q(τ) is unique, and it is easy to show that

Qn(τ) is a consistent estimator of Q(τ). Furthermore, ( ) √ τ(1 − τ) n(Q (τ) − Q(τ)) → N 0, n D f(Q(τ))2

1.9.2 Quantile regression

Quantile regression plays an increasingly important role in eco- nometrics. It opens a way to explore regression relationship in depth. Much more information can be obtained than by using trasitional least squares regression which only aims to quantify a conditional mean. Furthermore, a crucial property is robustness. In particular, median regression is preferable to least squares re- gression when dealing with heavy-tailed distributions.

Assume an i.i.d sample (Y1,X1),..., (Yn,Xn), where Yi ∈ IR is k a response variable of interest, while Xi ∈ IR is a vector of explanatory variables. We are now interested in determining quantiles of the conditional distribution of Y given X. For any vector x ∈ IRk there is a conditional distribution function FY |X=x(y) = P (Y = y|X = x)

EconometricsII-Kneip 1–44 and a corresponding conditional quantile function QY |X=x(τ), τ ∈ (0, 1). In the following we will assume that all conditional distribution functions are continuous which implies the existence of conditional densities fY |X=x(·).

Note that if Y and X are independent, then FY |X=x = FY and k QY |X=x(·) = QY (·) for all x ∈ IR , where FY and QY denote the (marginal) distribution and quantile functions of Y , respectively.

Otherwise, FY |X=x and QY |X=x will depend on the value X = x. Standard quantile regression now rests upon the assumption that for a given τ ∈ (0, 1)

T k QY |X=x(τ) = x βτ for some βτ ∈ IR

If this assumption holds for all τ ∈ (0, 1), we arrive at the general model T Yi = Xi β(Zi), where the random variable Zi ∼ U(0, 1) is independent of Xi, and k β : (0, 1) → IR is a measurable function such that β(τ) = βτ . Special cases:

T 1) Simple OLS model with Xi = (1,Xi1,...,Xi,k−1)

∑k Yi = β1 + βjXij + ϵi, i = 1, . . . , n, j=2

where ϵ1, . . . , ϵn are i.i.d errors with continuous strictly mo-

notonically increasing distribution function Fϵ. Then ∑k ∑k −1 −1 Q | (τ) = β + β X +F (τ) = β + F (τ) + β X Y X=Xi 1 j ij ϵ | 1 {zϵ } j ij j=2 j=2 βτ,1

EconometricsII-Kneip 1–45 2) Heteroskedastic errors: ∑k ∑k Yi = βjXij + ( γjXij)ϵi, i = 1, . . . , n, j=1 j=1

where ϵ1, . . . , ϵn are i.i.d errors with continuous strictly mo- notonically increasing distribution function Fϵ, and αj, γj ∈ IR, j = 1, . . . , k. Then ∑k ∑k ∑k −1 QY |X=Xi (τ) = βjXij + ( γjXij)Fϵ (τ) = βτ,jXij j=1 j=1 j=1

−1 for βτ,j = βj + γjFϵ (τ), j = 1, . . . , k. A remarkable property of this approach is its equivariance to monotonic transformations: For a nondecreasing function h we T have Qh(Y )|X=Xi (τ) = h(QY |X=Xi (τ)). For example, if ατ +x βτ T is the τth conditional quantile of log Y , then exp(ατ + x βτ ) is the τth conditional quantile of Y . k The coefficients βτ ∈ IR can be estimated by using the check- ˆ function approach. Estimates βτ are determined by minimizing ∑n − T Vτ,n(β) := ρτ (Yi Xi β) ∑ i=1 ∑ | − T | − | − T | = τ Yi Xi β + (1 τ) Yi Xi β T T i:Yi>Xi β i:Yi

EconometricsII-Kneip 1–46 • The structure of Vτ,n(β) is similar to the structure of the

function Vτ,n(q) analyzed before. Since Vτ,n(β) is not diffe- rentiable with respect to β, the estimator does not have a closed analytical form. There exist, however, very efficient li- near programming algorithms which allow to determine esti- mates numerically. ˆ • Due to a linear loss function, βτ is much more robust to outliers than the least squares estimator. • Quantile regression is not the same as regressions based on split samples because every quantile regression utilizes all sample data (with different weights). Thus, quantile regres- sion also avoids the sample selection problem arising from sample splitting. It is possible to calculate a measure for goodness-of-fit of a quan- tile regression model by generalizing the usual notion of R2 and calculating V (βˆ ) R1(τ) = 1 − τ,n τ , V˜τ,n(Qy,n(τ)) where Qy,n(τ) is∑ the τth sample quantile of Y1,...,Yn, and ˜ n − Vτ,n(Qy,n(τ) = i=1 ρτ (Yi Qy,n(τ)). Standard asymptotic theory for quantile regression has to be ba- sed on the following assumptions:

a) (Y1,X1),..., (Yn,Xn) is an i.i.d random sample from (Y,X) ( ) ∥ ∥2 b) The regressors have bounded second moment, i.e. E Xi 2 < ∞

c) For any x ∈ IRk, the conditional distribution of the “error T terms” ϵi := Yi−x βτ given X = x has a density fτ (ϵ|X = x)

EconometricsII-Kneip 1–47 satisfying ∫ 0 fτ (ϵ|X = x)dϵ = τ. −∞ T Note that necessarily fτ (0|X = x) = fY |X=x(x βτ ). d) The regressors and error density are such that the k × k matrix ( ) | T Cτ := E fτ (0 Xi)XiXi is positive definite. ˆ Under these conditions it can be shown that( βτ is) a (weakly) T consistent estimator of βτ , and with M := E XiXi we obtain √ ( ) ˆ − → − −1 −1 n(βτ βτ ) D N 0, τ(1 τ)Cτ MCτ

If fτ (0|X = x) ≡ fτ (0) does not depend on x. i.e., conditional homogeneity, then the result simplifies to ( ) √ − ˆ − → τ(1 τ) −1 n(βτ βτ ) D N 0, 2 M fτ (0) Inference is either based on estimates of the covariance matrix or (more frequently) on the bootstrap. The difficulty in approximating the asymptotic covariance ma- trix consists in estimating the values fτ (0|Xi) of the conditional densities (or of fτ (0) under an homogeneity assumption). A pos- sibility is to use (conditional) kernel density estimators. Other procedures have, for example, been proposed by Hendricks and Koenker (1991), or Powell (1991). Let us finally compare the efficiency of estimators with the estimators to be obtained by LAD (τ = 1/2) under the simple model

T Yi = Xi β + ϵi, i = 1, . . . , n,

EconometricsII-Kneip 1–48 with i.i.d. errors ϵ1, . . . , ϵn. Also assume that the error distribu- tion is symmetric around 0 and has a density f with f(0) > 0.

For τ = 1/2 we then have βτ = β. The model implies conditional homogeneity, and therefore ( ) √ 1 n(βˆ − β) → N 0, M −1 τ D 4f(0)2 Under the additional moment conditions necessary to derive asym- ptotics for the least squares estimator βˆ we obtain √ ( ) ˆ 2 −1 n(β − β) →D N 0, σ M ,

2 where σ = Var(ϵi). For normally distributed errors the least squares estimator βˆ is ˆ an asymptotically more efficient estimator than βτ . In this ca- se, f(0) = √1 and hence the asymptotic variance of βˆ is σ 2π τ 2πσ2 −1 ≈ 2 −1 4 M 1.57σ M , which is larger than the asymptotic variance of βˆ. The situation changes, however, when considering heavy-tailed error distributions. Examples of heavy-tailed, symmetric distri- butions are the student t(ν)-distributions for small values of ν.

• Assume that ϵi ∼ t(1). This is, of course, a very extreme case since the t(1)-distribution (which is also a particular case of the so-called Cauchy-distribution) does not possess any

moments. Even E(|ϵi|) does not exist. This implies that the ordinary least squares estimator is not even consistent in this case (no central limit arguments apply). Median regression is still applicable and asymptotic normality still holds. We 1 then have f(0) = π .

• Assume that ϵi ∼ t(3). Then Var(ϵi) = 3 and the asympto- tic variance of βˆ is thus equal to 3M −1. At the same time,

EconometricsII-Kneip 1–49 √ 6 3 ˆ we have f(0) = 9π , and the asymptotic variance of βτ is approximately 1.85M −1 < 3M −1.

1.10 Appendix: Statistical test procedures

1.10.1 Basic concepts

Assume a random sample X1,...,Xn, where the distributions of

X1,...,Xn depend on some unknown parameter θ ∈ Ω, where Ω is some parameter space.

General Testing problem:

H0 : θ ∈ Ω0 against H1 : θ ∈ Ω1.

H0 is the null hypothesis, while H1 is the alternative. Ω0 ⊂ Ω and Ω1 ⊂ Ω are used to denote the possible values of θ under H0 and H1, respectively. Necessarily, Ω0 ∩ Ω1 = ∅. For a large number of tests we have Ω = IRand the respective null hypothesis states that θ has a specific value θ0 ∈ IR, i.e., Ω0 =

{θ0} and H0 : θ = θ0. Depending on the alternative one then often distinguishes between one-sided (Ω1 = (θ0, ∞) or Ω1 =

(−∞, θ0)) and two-sided tests (Ω1 = {θ ∈ IR|θ ≠ θ0}). Statistical hypothesis testing: The data is used in order to decide whether to accept or to reject H0.

EconometricsII-Kneip 1–50 Test statistic: Every hypothesis test relies on a corresponding test statistic T = T (X1,...,Xn). Any test statistics is a real valued random variable, and for given observations the resulting observed value Tobs is used to decide between H0 and H1. Ge- nerally, the distribution of T under H0 is analyzed in order to define a rejection region C:

• Tobs ̸∈ C ⇒ H0 is not rejected

• Tobs ∈ C ⇒ H0 is rejected

Typically C is of the form (−∞, c0], [c1, ∞) or (−∞, c0] ∪ [c1, ∞). The limits of the respective intervals are cal- led critical values, and are obtained from quantiles of the null distribution (null distribution ≡ distribution of T under H0).

Type I error: H0 is rejected when it is true

Type II error: the test fails to reject a false H0

In a statistical significance test, the probability of a type I error is controlled by the significance level: Significance level α (e.g. α = 5%)

P ( type I error ) = P (T ∈ C| H0 true) ≤ α

Note: sup P (T ∈ C|θ ∈ Ω0) is called the size of the test. The preselected significance level is a bound for the size, which may not be attained if, e.g., the relevant probability function is dis- crete. Practically important significance levels: • α = 0.05 - It is common to say that a test result is “signifi-

cant” if a hypothesis test of level α = 0.05 rejects H0 • α = 0.01 - It is common to say that a test result is “strongly

significant” if a hypothesis test of level α = 0.01 rejects H0 EconometricsII-Kneip 1–51 Statistical software usually determines the p-value of a test. p-value = probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true

Interpretation: • The p-value is random since it depends on the observed da- ta. Different random samples will lead to different p-values. • For given data, having determined the p-value of a test we also know the test decisions for all possible levels α:

α > p-value ⇒ H0 is rejected

α < p-value ⇒ H0 is accepted

Example: X ∼ N(µ, σ2); Observation from an i.i.d sample of size n = 5:

X1 = 19.20, X2 = 17.40, X3 = 18.50, X4 = 16.50, X5 = 18.90, ⇒ X¯ = 18.1

Testing problem:

H0 : µ = 17 against H1 : µ ≠ 17 (two-sided test) Since the variance is unknown, we have to use a t-test in order to test H0. Test statistic of the t-test: √ n(X¯ − µ ) T = 0 , S ∑ 2 1 n − ¯ 2 2 where S = n−1 i=1(Xi X) is an unbiased estimator of σ .

√ 5(18.1 − 17) T = = 2.187 obs 1.125 ⇒ p-value = P (|Tn−1| ≥ 2.187) = 0.094

EconometricsII-Kneip 1–52 t-test for different significance levels α:

α = 0.2 ⇒ 2.187 > t4,0.9 = 1.533 ⇒ H0 is rejected

α = 0.1 ⇒ 2.187 > t4,0.95 = 2.132 ⇒ H0 is rejected

α = 0.094 = p-Wert ⇒ 2.187 = t4,0.953 = 2.187 ⇒ H0 is rejected

α = 0.05 ⇒ 2.187 < t4,0.975 = 2.776 ⇒ H0 is accepted

α = 0.01 ⇒ 2.187 < t4,0.995 = 4.604 ⇒ H0 is accepted

1.10.2 The power function

For every possible value θ ∈ Ω0 ∪ Ω1, all sample sizes n and each significance level α the corresponding value of the power function β is given by the probability


:= P (H0 is rejected, if the true parameter value equals θ)

Obviously, βn,α(θ) ≤ α for all θ ∈ Ω0. For any θ ∈ Ω1, 1−βn,α(θ) is the probability of committing a type II error. The power function is an important tool for accessing the quality of a test and for comparing different test procedures. Obviously, the depends on the true value θ ∈ Ω, the sample size n, and on the significance level α. Some important terminology: • If possible, a test is constructed in such a way that size equals

level, i.e., βn,α(θ) = α for some θ ∈ Ω0. In some cases, ho- wever, as for discrete test statistics or complex, composi- te null hypothesis, it is not possible to reach the level, and

EconometricsII-Kneip 1–53 supθ∈Ω0 βn,α(θ) < α. In this case the test is called “conser- vative”. • Unbiased test: A significance test of level α > 0 is called

“unbiased” if βn,α(θ) ≥ α for all θ ∈ Ω1. • Consistent Test: A significance test of level α > 0 is called “consistent” if

lim βn,α(θ) = 1 n→∞

for all θ ∈ Ω1. When choosing between different testing procedures for the same testing problem, one will usually prefer the “most powerful” test. Consider a fixed sample size n.

• For a specified θ ∈ Ω1, a test with power function βn,α(θ) is said to be most powerful for θ if for any alternative test ∗ with power function βn,α(θ), ≥ ∗ βn,α(θ) βn,α(θ) holds for all levels α > 0.

• A test with power function βn,α(θ) is said to be uniformly most powerful against the set of alternatives Ω1 if for any ∗ alternative test with power function βn,α(θ), ≥ ∗ ∈ βn,α(θ) βn,α(θ) holds for all θ Ω1, α > 0

Unfortunately, uniformly most powerful tests only exist for very special testing problems.

Example: Let X1,...,Xn be an i.i.d. random sample for X. Assume that n = 9, and that X ∼ N(µ, 0.182). Hence, in this simple example only the mean µ = E(X) is unknown, while the standard deviation has the known value σ = 0.18.

EconometricsII-Kneip 1–54 Testing problem:

H0 : µ = µ0 against H1 : µ ≠ µ0 for µ0 = 18.3. Since the variance is known, a test may rely on the test statistics √ n(X¯ − µ ) 3(X¯ − 18.3) Z = 0 = σ 0.18

Under H0 we have Z ∼ N(0, 1), and for the significance level α = 0.05 the null hypothesis is rejected if

|Z| ≥ z1−α/2 = 1.96

Here z1−α/2 denotes the corresponding quantile of the standard normal distribution. Note that the size of this test equals its level α = 0.05. For determining the rejection region of a test it suffices to de- termine the distribution of the test statistic under H0. But in order to calculate the power function one needs to quantify the distribution of the test statistic for all possible values θ ∈ Ω. For many important problems this is a formidable task. In our example it is, however, quite easy. Note that for any µ ∈ R the corresponding distribution of Z ≡ Zµ is √ √ (√ ) n(µ − µ ) n(X¯ − µ) n(µ − µ ) Z = 0 + ∼ N 0 , 1 µ σ σ σ This implies that with Φ denoting the distribution function of the standard normal distribution we obtain ( ) βn,α(µ) = P |Zµ| > z1−α/2 ( √ ) ( √ ) n(µ − µ0) n(µ − µ0) = 1 − Φ z − − + Φ −z − − 1 α/2 σ 1 α/2 σ

This example illustrates the power function of a “good” test. Un- der H0 : µ = µ0 we have βn,α(µ0) = α. The test is unbiased, since

EconometricsII-Kneip 1–55 βn,α(µ0) > α for any µ ≠ µ0. Furthermore, the test is consistent, since limn→∞ βn,α(µ) = 1 for every fixed µ ≠ µ0.

For fixed sample size n, βn,α(µ) increases as the distance |µ−µ0| ∗ ∗ increases. If |µ−µ0| > |µ −µ0| then βn,α(µ) > βn,α(µ ). On the other hand, βn,α(µ) decreases as the size α of the test decreases. ∗ If α > α then βn,α(µ) > βn,α∗ (µ).

The example values µ0 = 18.3, n = 9 and σ = 0.18 lead to

• µ = 18.36 and α = 0.05 ⇒ β9,0.05(18.36) = 0.168

• µ = 18.48 and α = 0.05 ⇒ β9,0.05(18.48) = 0.873

• µ = 18.48 and α = 0.01 ⇒ β9,0.05(18.48) = 0.663

1.10.3 Asymptotic relative efficiency

In statistical literature power comparisons of different tests are most frequently based on asymptotic theory. Explicit power cal- culations can usually only be done for simple structured hypothe- ses. We will thus only consider the case that the testing problem concerns the value of some real-valued parameter θ:

H0 : θ = θ0 against H1 : θ > θ0, for some pre-specified θ0 ∈ IR. Two-sided tests can be analyzed analogously.

Any corresponding test is based on a test statistic T , and H0 is rejected if the observed value of T is too large. The distribution of this test statistic will depend on the sample size n and on the true value θ, i.e., T ≡ Tn(θ). A first assumption is that, as usual, the asymptotic distribution of Tn(θ) is asymptotically normal. More precisely, we now assume

EconometricsII-Kneip 1–56 that there exist some functions eT (θ) and σT (θ) such that for all θ ∈ Ω = [θ0, ∞) √ n(Tn(θ) − eT (θ)) →D N(0, 1), σT (θ) as n → ∞. We will require that σT (θ) is continuous, while eT (θ) is strictly monotonically increasing and continuously differentia- ′ → ∞ ble in θ with eT (θ0) > 0. One can conclude that as n an asymptotic approximation of the power function of the test is given by (√ ) n(Tn(θ) − eT (θ0)) β (θ) = P > z − n,α σ (θ ) 1 α/2 (√ T 0 √ ) n(Tn(θ) − eT (θ)) n(eT (θ) − eT (θ0)) σT (θ0) = P + > z − σ (θ) σ (θ) σ (θ) 1 α/2 ( T √ T ) T σT (θ0) n(eT (θ) − eT (θ0)) = 1 − Φ z1−α/2 − σT (θ) σT (θ) for any given significance level α > 0. e Consider an alternative test with test statistic Tn(θ) possessing similar properties such as n → ∞ its power function can be approximated by ( ) √ e e n(Tn(θ0) − eT (θ0)) βn,α(θ) = P > z1−α/2 σeT (θ) ( √ ) σeT (θ0) n(eT (θ) − eT (θ0)) = 1 − Φ z1−α/2 − σeT (θ) σeT (θ)

We again assume that σeT (θ) is continuous, while eT (θ) is strict- ly monotonically increasing and continuously differentiable in θ e′ with eT (θ0) > 0. This already shows that asymptotically it does not make much sense to compare the efficiency of the two tests on the basis of

EconometricsII-Kneip 1–57 a fixed alternative θ > 0, since then limn→∞ βn,α(θ) = 1 as well e as limn→∞ βn,α(θ) = 1. The idea is then to consider sequences of local alternatives θm with θm → θ0 as m → ∞.

For an arbitrary d > 0 define a sequence {θm}m=1,2,... ⊂ Ω1 such − √d that θm θ0 = m for all m = 1, 2,... . Asymptotic efficiency analysis then poses the following questions for arbitrary α < β < 1 and large m: • When using the first test, how many observations n(m) do

we need such that the probability of rejecting H0 is at least β when θ = θm? • When using the second test, how many observations ne(m)

do we need such that the probability of rejecting H0 is at least β when θ = θm? Formally this requires to determine n(m) and ne(m) such that e βn(m),α(θm) ≈ β and βne(m),α(θm) ≈ β. Taylor expansions of eT (θ) and eT (θ) yield √ √ n(m)(e (θ ) − e (θ )) n(m)e′ (θ )(θ − θ ) n(m) T m T 0 = T 0 m 0 + o( ) σ (θ ) σ (θ ) m √ T m √ T 0 ne(m)(e (θ ) − e (θ )) ne(m)e′ (θ )(θ − θ ) ne(m) = T m T 0 , T 0 m 0 + o( ) σeT (θm) σeT (θ0) m

σT (θ0) σeT (θ0) On the other hand, z − → z − as well as z − → σT (θ) 1 α/2 1 α/2 σeT (θ) 1 α/2 → ∞ − z1−α/2 as m ( . Furthermore,) with γ(α, β) := z1−α/2 z1−β we obtain 1 − Φ z1−α/2 − γ(α, β) = β. Therefore, with n(m) = 2 2 2 2 γ(α,β) σT (θ0) γ(α,β) σeT (θ0) m ′ 2 2 and ne(m) = m ′ 2 2 , e (θ0) d e (θ0) d T √ T n(m)(e (θ ) − e (θ )) lim T m T 0 = γ(α, β) m→∞ σ (θ ) √ T m ne(m)(e (θ ) − e (θ )) lim T m T 0 = γ(α, β), →∞ m σeT (θm)

EconometricsII-Kneip 1–58 and we can conclude that 2 2 γ(α, β) σT (θ0) lim βn(m),α(θm) = β for n(m) = m ′ m→∞ 2 2 eT (θ0) d 2 2 e γ(α, β) σeT (θ0) lim βne(m),α(θm) = β for ne(m) = m ′ m→∞ e 2 2 eT (θ0) d

ne(m) The quotient n(m) is a unique number which does obviously not depend on α, β or the specific choice of d used to construct

{θm}m=1,2,.... This quotient defines the asymptotic relative efficiency of test T relative to test Te: e ′ 2e 2 e n(m) eT (θ0) σT (θ0) ARE(T, T ) = lim = ′ m→∞ e 2 2 n(m) eT (θ0) σT (θ0) Interpretation: • ARE(T, Te) = 1 ⇒ both tests equally efficient (for detecting local alternatives) • ARE(T, Te) = γ < 1 ⇒ Test Te is more efficient than test T ! In order to achieve (approximately) identical local power Te needs fewer observations (by the factor γ) than test T . • ARE(T, Te) = γ∗ > 1 ⇒ Test T is more efficient than test Te! In order to achieve (approximately) identical local power Te needs more observations than test T .

Remark: If a test statistic is based on an (asymptotically) un- biased estimator of the true value of θ, i.e., E(Tn(θ)) = θ, then − − ′ eT (θm) eT (θ0) = θ θ0 and hence eT (θ0) = 1. When comparing two tests based on different unbiased estimators ARE therefore 2 e σeT (θ0) reduces to ARE(T, T ) = 2 . The “better” estimator with a σT (θ0) smaller asymptotic variance thus also defines the more efficient test.

EconometricsII-Kneip 1–59