Statistical Analysis

Econometrics II: Statistical Analysis Prof. Dr. Alois Kneip Statistische Abteilung Institut für Finanzmarktökonomie und Statistik Universität Bonn Contents: 1. Empirical Distributions, Quantiles and Nonparametric Tests 2. Nonparametric Density Estimation 3. Nonparametric Regression 4. Bootstrap 5. Semiparametric Models EconometricsII-Kneip 0–1 Some literature: • Gibbons, J.D. , A. (1971): Nonparametric Statistical Infe- rence, McGraw-Hill, Inc. for Data Analysis; Clarendon Press • Bowman, A.W. and Azzalini, A. (1997): Applied Smoothing Techniques for Data Analysis; Clarendon Press • Li and Racine (2007): Nonparametric Econometrics; Prince- ton University Press • Greene, W.H. (2008): Econometric Analysis; Pearson Edu- cation • Silverman, B.W. (1986): Density Estimation for Statistics and Data Analysis, Chapman and Hall • Davison, A.C and Hinkley, D.V. (2005): Bootstrap Methods and their Application, Cambridge University Press • Yatchew, A. (2003): Semiparametric Regression for the Ap- plied Econometrician, Cambridge University Press • Hastie, T., Tisbshirani, R. and Friedman, J. (2001): The ele- ments of statistical learning, Springer Verlag EconometricsII-Kneip 0–2 1 Empirical distributions, quantiles and nonparametric tests 1.1 The empirical distribution function The distribution of a real-valued random variable X can be com- pletely described by its distribution function F (x) = P (X ≤ x) for all x 2 IR: It is well-known that any distribution function possesses the following properties: • F (x) is a monotonically increasing function of x • Any distribution function is right-continuous: lim F (x + j∆j) = F (x) ∆!0 for any x 2 IR. Furthermore, lim F (x − j∆j) = F (x) − P (X = x) ∆!0 • RIf F (x) is continuous, then there exists a density f such that x 2 −∞ f(t)dt = F (x) for all x IR. If f(x) is continuous at x, then F 0(x) = f(x). Data: i.i.d. random sample X1;:::;Xn For given data, the sample analogue of F is the so-called empirical distribution function, which is an important tool of statistical inference. Let I(·) denote the indicator function, i.e., I(x ≤ t) = 1 if x ≤ t, and I(x ≤ t) = 0 if x > t. EconometricsII-Kneip 1–1 Empirical distribution function: P 1 n ≤ Fn(x) = n i=1 I(Xi x), i.e Fn(x) is the proportion of observations with Xi ≤ x Properties: • 0 ≤ Fn(x) ≤ 1 • Fn(x) = 0, if x < X(1), where X(1) - smallest observation • F (x) = 1, if x ≥ X(n), where X(n) - largest observation • Fn monotonically increasing step function Example x1 x2 x3 x4 x5 x6 x7 x8 5,20 4,80 5,40 4,60 6,10 5,40 5,80 5,50 Corresponding empirical distribution function: 1.0 0.8 0.6 0.4 0.2 0.0 4.0 4.5 5.0 5.5 6.0 6.5 EconometricsII-Kneip 1–2 For real valued random variables the empirical distribution function is closely linked with the so-called “order statistics”. • Given a sample X1;:::;Xn, the corresponding order statistics is the n-tuple of the ordered observations (X(1);:::;X(n)), where X(1) ≤ X(2) ≤ · · · ≤ X(n). • For r = 1; : : : ; n, X(r) is called r-th order statistics. Order statistics can only be determined for one-dimensional random variables. But an empirical distribution function can also be defined for random vectors. Let X be a d-dimensional ran- d T dom variable defined on IR , and let Xi = (Xi1;:::;Xid) denote an i.i.d. sample of random vectors from X. Then for any T x = (x1; : : : ; xd) F (x) = P (X1 ≤ x1;:::;Xd ≤ xd) and 1 Xn F (x) = I(X ≤ x ;:::;X ≤ x ) n n i1 1 id d i=1 We can also define the so-called “empirical measure” Pn. For any A ⊂ IRd 1 Xn P (A) = I(X 2 A) n n i i=1 Note that Pn(A) simply quantifies the relative frequency of observations falling into A. As n ! 1 Pn(A) !P P (A) Note that Pn of course depends on the observation and thus is random. At the same time, however, it possesses all properties of a probability measures. When knowing Fn we can uniquely reconstruct all observed values fX1;:::;Xng The only information lost is the exact succes- EconometricsII-Kneip 1–3 sion of these values. For i.i.d. samples this information is comple- tely irrelevant for all statistical purposes. All important statistics (and estimators) can thus be written as functions of Fn (or Pn)). In particular, in theoretical literature expectations and corresponding samples averages are often represented in the following form: For a continuous function g Z Z E(g(X)) = g(x)dP = g(x)dF (x) and Z Z 1 X g(X ) = g(x)dP = g(x)dF (x) n i n n i=1 R Here, g(x)dF (x) refers to the Stieltjes integral. This is a gene- ralization of the well-known Riemann integral. Let d = 1, and consider a partition a = x0 < x1 < ··· < xm = b of an interval [a; b]. Then Z b Xm g(x)dF (x) = lim g(ξj)(F (xj) − F (xj−1) m!1;sup jx −x j!0 a i+1 i j=1 if the limit exists and is independent of the specific choices of ξj 2 [xj−1; xj]. It can be shown that for any continuous function g and any distribution function F the correspondingR StieltjesR integral 1 ≡ exist for any finite interval [a; b]. −∞ g(x)dF (x) g(x)dF (x) corresponds to the limit (if existent) as a ! −∞, b ! 1. EconometricsII-Kneip 1–4 1.2 Theoretical properties of empirical distribution functions In the following we will assume that X is a real-valued random variable (d = 1). Theorem: For every x 2 IR nFn(x) ∼ B(n; F (x)); i.e., nFn(x) has a binomial distribution with parameters n and F (x). The probability distribution of Fn(x) is thus given by 0 1 ( ) m @ n A m n−m P Fn(x) = = F (x) (1−F (x)) ; m = 0; 1; : : : ; n n m Some consequences: • E(Fn(x)) = F (x), i.e. Fn(x) is an unbiased estimator of F (x) • 1 − V ar(Fn(x)) = n F (x)(1 F (x)), i.e. as n increases the va- riance of Fn(x) decreases. • Fn(x) is a (weakly) consistent estimator of F (x). Theorem of Glivenko-Cantelli: ! P lim sup jFn(x) − F (x)j = 0 = 1 n!1 x2IR EconometricsII-Kneip 1–5 The distribution of Y = F (X) Note: there is an important difference between F (x) und F (X): • For any fixed x 2 IR the corresponding value F (x) is also a fixed number, F (x) = P (X ≤ x) • F (X) is a random variable, where F denotes the distribution function of X. Theorem: Let X by a random variable with a continuous distribution function F . Then Y = F (X) has a (continuous) uniform distribution on the interval (0; 1), i.e. F (X) ∼ U(0; 1); P (a ≤ F (X) ≤ b) = b − a for all 0 ≤ a < b ≤ 1 Consequence: If F is continuous, then • F (X1);:::;F (Xn) can be interpreted as an i.i.d. random sample of observations from a U(0; 1) distribution • (F (X(1));:::;F (X(n)) is the corresponding order statistics EconometricsII-Kneip 1–6 1.3 Quantiles Quantiles are an essential tool for statistical analysis. They provi- de important information for characterizing location and dispersion of a distribution. In statistical inference they play a central role in measuring risk. Let X denote a real valued random variable with distribution function F . Quantiles: For 0 < τ < 1, any qτ 2 IR satisfying F (qτ ) = P (X ≤ qτ ) ≥ τ and P (X ≥ qτ ) ≥ 1 − τ is called τth quantile (or simply τ-quantile) of X. Note that quantiles are not necessarily unique. for given τ, there may exist an interval of possible values fulfilling the above conditions. But if X is a continuous random variable with density f, then qτ is unique if f(qτ ) > 0 (then F (qτ ) = τ and F (q) =6 τ for all q =6 qτ ). In statistical literature most work on quantiles is based on the so-called quantile function which is defined as an “inverse” distribution function. For 0 < τ < 1 the quantile function is defined by Q(τ) : inffyj F (y) ≥ τg • For any 0 < τ < 1 the value qτ = Q(τ) is a τ-quantile satisfying the above conditions. If there is an interval of possible values for qτ , Q(τ) selects the smallest possible value. • Like the distribution function, the quantile function provides a complete characterization of the random variable X. • If the distribution function F (x) is strictly monotonically increasing, then Q(τ) is the inverse of F , Q(τ) = F −1(τ). EconometricsII-Kneip 1–7 Important quantiles: • µmed = Q(0:5) is the median of X (with probability at least 0.5 an observation is smaller or equal to Q(0:5), and with probability at least 0.5 an observation is larger or equal to Q(0:5) • Q(0:25) and Q(0:75) are called lower and upper quartile, respectively. Instead of the standard deviation, the inter-quartile range IRQ = Q(0:75) − Q(0:25) (also called quartile coefficient of dispersion) is frequently used as a measure of statistical dispersion. Note that P (X 2 [Q(0:25);Q(0:75)]) ≈ 0:5. • Q(0:1);Q(0:2);:::;Q(0:9) are the “deciles” of X.

Statistical Analysis

A Unit-Error Theory for Register-Based Household Statistics

Statistical Units

Statistical Analysis of Real Manufacturing Process Data Statistické Zpracování Dat Z Reálného Výrobního Procesu

A Guide to Writing a Good Codebook for Data Analysis Projects in Medicine

Statistical Units in the System of National Accounts

Glossary of Terms

Statistical Analysis of Microarray Data in Sleep Deprivation

LEGEND Glossary Let Evidence Guide Every New Decision

The Tyranny of Census Geography: Small-Area Data and Neighborhood Statistics

Best Practice Guidelines for Developing International Statistical Classifications

Milton Terris1

TRANSPORT STATISTICS Content