Iowa State University Capstones, Theses and Retrospective Theses and Dissertations Dissertations

2008 Estimating the distribution of subpopulations and testing observation outlyingness for a large-scale complex survey Jianqiang Wang Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/rtd Part of the and Probability Commons

Recommended Citation Wang, Jianqiang, "Estimating the distance distribution of subpopulations and testing observation outlyingness for a large-scale complex survey" (2008). Retrospective Theses and Dissertations. 15747. https://lib.dr.iastate.edu/rtd/15747

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Estimating the distance distribution of subpopulations and testing observation

outlyingness for a large-scale complex survey

by

Jianqiang Wang

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Statistics

Program of Study Committee: Jean D. Opsomer, Major Professor Wayne A. Fuller Song X. Chen Dan Nettleton Dimitris Margaritis

Iowa State University

Ames, Iowa

2008

Copyright © Jianqiang Wang, 2008. All rights reserved. UMI Number: 3330743

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

UMI

UMI Microform 3330743 Copyright2008by ProQuest LLC All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 11

DEDICATION

To my family,

Father Xiaoling Wang, mother Lanjiu Cao, sister Huihui Wang and brother Jianchao Wang. iii

TABLE OF CONTENTS

LIST OF TABLES vi

LIST OF FIGURES viii

ACKNOWLEDGEMENTS xi

ABSTRACT xii

CHAPTER 1. OVERVIEW 1

CHAPTER 2. REVIEW OF LITERATURE 7

CHAPTER 3. ESTIMATING DISTANCE DISTRIBUTION AND TEST­

ING POINT OUTLYINGNESS 12

3.1 Distance Distribution Functions and Estimators 12

3.2 Assumptions 14

3.3 Distance Distribution Estimators and Measures of Outlyingness 19

3.3.1 -based distance distribution 19

3.3.2 -based distance distribution 21

3.3.3 Measures of outlyingness 26

3.4 Estimation 29

3.4.1 Naive estimator 29

3.4.2 Estimating the effect of population center estimation error 31

3.4.3 Jackknife variance estimator 32

3.5 Discussion 35

3.5.1 Comments on norms 35

3.5.2 Initial partitioning 36 IV

3.5.3 Classification 37 3.5.4 Using auxiliary variables 39

CHAPTER 4. SIMULATION STUDY 41

4.1 Assessing consistency and asymptotic normality 41

4.2 Evaluating D{ff) and its variance estimators 44

4.3 Detection 51

4.4 Initial clustering and testing outlyingness 54

CHAPTER 5. OUTLIER IDENTIFICATION IN THE NATIONAL RE­

SOURCES INVENTORY 56

5.1 General strategy 56

5.2 Introduction to National Resources Inventory 58

5.3 Test application 59

5.4 Full-scale application 72

5.5 Discussion 76

CHAPTER 6. EXTENSIONS TO NONDIFFERENTIABLE SURVEY ES­

TIMATORS 78

6.1 Explicitly denned nondifferentiable survey estimators 78

6.1.1 Assumptions 79

6.1.2 Design-based results 81

6.1.3 Discussion 83

6.2 Nondifferentiable 87

6.2.1 Assumptions 87

6.2.2 Design-based results 88

6.2.3 Discussion 89

6.3 Variance estimation 90

CHAPTER 7. CONCLUSIONS 93 V

APPENDIX A. MODEL-BASED RESULTS 95

A.l Model-based result for mean-based inference 95

A.2 Model results for median-based inference 97

A.3 Model-based results for nondifferentiable survey estimators 99

A.4 Model-based results for nondifferentiable estimating equations 100

APPENDIX B. PROOFS 101

B.l Appendix: Proofs 101

APPENDIX C. FIGURES 122

BIBLIOGRAPHY 141 VI

LIST OF TABLES

Table 3.1 Score function for each discrimination method 38

Table 4.1 Comparison of estimator Du^(i^u) and Dv^{\i^) in terms of bias and

standard under three different simple random samples; G =

2 is the number of subpopulations; bias(Z?(/i)), bias(Z?(/i)) represent

the bias of our estimator when the true population mean is estimated

or known; sd{D{(i))1 sd(D(/j,)) represent the of our

estimator when the true population mean is estimated or known. ... 48

Table 4.2 Comparison of estimator Du^(qu) and Du^(qu) in terms of bias and

standard deviation under three different simple random samples; G = 2

is the number of subpopulations; bias(Z)(qr)), bias(D(q)) represent the

bias of our estimator when the true population median is estimated

or known; sd(D(q)),sd(D(q)) represent the standard deviation of our

estimator when the true population median is estimated or known. . . 49

Table 4.3 Comparison of variance estimators: VMC denotes the Monte Carlo es­

timate of variance from 2000 simulations under stratified ,

VNV f^i/,d(Ai/)) denotes the naive estimator, VSM (A^d(/^)) denotes

the analytic variance estimator using the kernel estimator, and VJK (Du^(i^u))

denotes the jackknife variance estimator using the kernel estimator. . . 50 Vll

Table 4.4 Summary table of simulation results to examine the performance of

detecting population . We use either mean or median as measure

of population center and population outliers are defined by 0* (y°) >

1 — a for g = 1,... , 5. outliers are the points y° such that

n*g,u(v°) > 1 - <* for $ = 1,... ,5 53

Table 4.5 Summary table of simulation results to examine the performance of

detecting population outliers. We use unmatched definition of center

and population outliers are defined by 0*?I/(y°) > 1 — OL for g = 1,..., 5.

Sample outliers are the points y° such that O*?I/(y°) > 1 — a for g =

1,...,5 54

Table 5.1 Broad Use Categories 60

Table 5.2 Description of 6 clusters on USLE core points 61

Table 5.3 Center of each subpopulation for Kansas USLE points 62

Table 5.4 List of suspicious USLE core points in Kansas 66

Table 5.5 Description of 4 clusters on WEQ core points 70

Table 5.6 Center of each cluster for Kansas WEQ points 70

Table 5.7 List of suspicious WEQ core points in Kansas 71 Vlll

LIST OF FIGURES

Figure 3.1 of 4 jackknife variance estimators; JK-1 indicates the jack-

knife procedure using £r j (fly) as replicate, JK-2 uses replicate D^ j (/i[, ^),

JK-3 uses piecewise linear estimator of P^(-), and JK-4 uses (3.55). . . 34

Figure 4.1 Scatterplot of two subpopulations; the first two plots show each individ­

ual subpopulation and the last plot shows them both, but in different

colors 45

Figure 4.2 Distribution of subpopulation , with solid line indicating dis­

tances to mean vector and dash line indicating distances to median;

left plot shows distance distributions for subpopulation 1, and right

plot shows distance distributions for subpopulation 2 46

Figure 4.3 Bivariate population with 5 subpopulations; points in "plus" sign are

population outliers, defined by O*^(y°) > 1 — a for g = 1,..., 5. We

use mean vectors as measures of center and the threshold a is chosen

as 0.02 52

Figure 5.1 Histogram of selected variables in subpopulations 1.1 and 1.2 63

Figure 5.2 Distribution of Mahalanobis distances from observations in subpopula­

tions 1.1, 1.2, 2 and 5 to their corresponding subpopulatoin . . 63

Figure 5.3 Distribution of Mahalanobis distances from observations in subpopula­

tions 1.1 and 1.2 to both subpopulation means 65

Figured of population clusters and 10 test points 122 IX

Figure C.2 Estimated distance distributions Dv^(Ji^) under SRS and Stratifica­

tion. The solid lines are population quantities Dv^{\i^)^ dashed lines

are estimators using SRS, dotted lines using 123

Figure C.3 Estimated distance distributions Dv^(Ji^) under SRS and Stratifica­

tion. The solid lines are population quantities DV^(\L^), dashed lines

are estimators using SRS, dotted lines using stratified sampling 124

Figure C.4 Estimated distance distributions Dv^{Ji^) under SRS and Stratifica­

tion. The solid lines are population quantities D^^(/x^), dashed lines

are estimators using SRS, dotted lines using stratified sampling 125

Figure C.5 Estimated distance distributions Du^(qu) under SRS and Stratifica­

tion. The solid lines are population quantities Dv^qv\ dashed lines

are estimators using SRS, dotted lines using stratified sampling 126

Figure C.6 Estimated distance distributions Dv^iAv) under SRS and Stratifica­

tion. The solid lines are population quantities D^^(gf^), dashed lines

are estimators using SRS, dotted lines using stratified sampling 127

Figure C.7 Estimated distance distributions Dv^iAv) under SRS and Stratifica­

tion. The solid lines are population quantities D^^(gf^), dashed lines

are estimators using SRS, dotted lines using stratified sampling 128

Figure C.8 Histogram of estimated distance distributions D^^iP'us) under SRS for

d = 0.1,0.5,2 and 6 with population quantity superimposed 129

Figure C.9 Histogram of estimated distance distributions D^^AZM) under SRS for

d = 0.1,0.5,2 and 6 with population quantity superimposed 130

Figure CIO Histogram of estimated distance distributions D^^fi^) under SRS for

d = 0.1,0.5,2 and 6 with population quantity superimposed 131

Figure C.ll Histogram of estimated distance distributions Dv^{Jiv^) under SRS for

d = 0.1,0.5,2 and 6 with population quantity superimposed 132 Figure C.12 Plot of conditional bias against ||iv — 7JI using both mean and median

as cluster centers; red curve for d = 0.4472, green curve for d = 1.0 and

blue curve for d = 3.873 133

Figure C.13 Scatter plot of sample clusters using PAM and adjusted k-means method. 134

Figure C.14 Boxplot of estimated distance distributions A^,||;z-/xJ|(Azy) under SRS

with population quantity superimposed. Method for initial clustering

is PAM 135

Figure C.15 Boxplot of estimated distance distributions Du^z_j1^(fiiy) under SRS with population quantity superimposed. Method for initial clustering

is PAM 136

Figure C.16 Boxplot of estimated distance distributions Dv^z_^^{Jiv) under SRS with population quantity superimposed. Method for initial clustering

is PAM 137

Figure C.17 Boxplot of estimated distance distributions Dv^z_^J^{JL^) under SRS with population quantity superimposed. Method for initial clustering

is adjusted k-means 137

Figure C.18 Boxplot of estimated distance distributions Du^z_^(jjJu) under SRS with population quantity superimposed. Method for initial clustering

is adjusted k-means 138

Figure C.19 Boxplot of estimated distance distributions Dv^z_^ ^(jij) under SRS with population quantity superimposed. Method for initial clustering

is adjusted k-means 139

Figure C.20 Scatter plot of cluster 5 generated from a bivariate lognormal distribu­

tion and plot of population distance distributions Dv^{\i^) and Du^(qu) against d 140 xi

ACKNOWLEDGEMENTS

This work was supported by Cooperative Agreement No. 68-3A75-4-122 between the USDA

Natural Resources Conservation Service and the Center for Survey Statistics and Methodology at Iowa State University.

I would like to take this opportunity to express my gratitude to those who helped me through out my Ph.D. career.

First and foremost, I want to thank Dr Jean Opsomer for proposing the initial idea, guiding me throughout my research work and helping me writing up this thesis paper. I especially want to thank him for being so helpful, even remotely when I was on my internship and during the final year of my Ph.D. life.

I also would like to thank my committee members for raising interesting questions and commenting on the written paper, especially Dr Fuller for helping me through NRI application section and motivating some interesting directions.

I also want to thank Richard Dorsch and Bob Dayton for the analysis of nagged NRI outliers and their analysis greatly complemented the theoretical and methodological component of this thesis.

Last but not the least, I would like to thank the Center for Survey Statistics and Method­ ology for sponsoring my Ph.D. research work. I really appreciate the help offered by every co-worker and every fellow student in the center. xii

ABSTRACT

Many finite populations targeted by sample surveys comprise a relatively small number of homogeneous subpopulations. In on-going survey operations, it is often of interest to be able to assess whether a new observation belongs to one of those subpopulations or should be flagged as not belonging to any of them. Because the homogeneity of the subpopulations depends on potentially large numbers of survey variables interacting in complex ways, we define a distance measure in the space induced by the survey variables, and consider the distribution of these distances within the subpopulation as a summary of the distributional characteristics of the subpopulation. We also define a measure of the outlyingness of each individual point as the fraction of points with a less extreme distance in the subpopulation. In this thesis, we propose a sample-based estimator for the subpopulation distance distribution functions and measure of outlyingness. We allow for a general distance measure, and consider both multivariate means and as centers. We describe the design-based asymptotic properties of the estimator under weak assumptions on the finite population. We investigate several approaches for design- based variance estimation, including a combination of kernel regression and variance estimation. The practical properties of the procedures are evaluated in a longitudinal complex survey called the National Resources Inventory.

The theoretical derivations for sample distance distribution can be generalized to a broad class of survey estimators, involving nondifferentiable functions defined at the population level.

We extend the theoretical results to two classes nondifferentiable survey estimators, statistics as nondifferentiable functions of estimated parameters and estimators implicitly defined through estimating equations. In both cases, a direct Taylor linearization is not applicable, but we can assume the limiting function is differentiable and do asymptotic expansion on the limits. We Xlll offer a design-based perspective of both cases in this thesis and justify the design assumptions under probabilistic mechanism. We also propose a variance estimator using kernel smoothing and show its consistency. 1

CHAPTER 1. OVERVIEW

A common issue in large-scale complex surveys is the detection of outliers in the .

Such outliers can be caused by frame imperfections, which can lead to ineligible units being selected, or by errors during . If these outliers remain in the survey dataset, they can cause inference based on the survey to be invalid for the population of interest. Most survey operations therefore incorporate data editing and validation as part of the post-data collection steps, where they attempt to identify suspicious observations and either remove or correct them. Lyberg et al. (1997) contains a collection of articles discussing data collection and post survey processing to improve data quality. When outliers exhibit "extreme" values on one or several survey variables, they can be detected relatively easily. However, because surveys often collect large numbers of variables, there is the potential for other outliers which are not extreme on any single variable. Detecting such outliers is more difficult, and identifying unusual or suspicious patterns in the data often requires substantial subject-matter knowledge.

For literature on detecting statistical outliers in general, the readers are referred to Barnett and Lewis (1994) and Hawkins (1980).

In practice, many finite populations targeted by surveys consist of a number of relatively homogeneous subpopulations, and this structure can be exploited to develop a statistical ap­ proach for flagging suspicious observations that does not require detailed knowledge of the relationships between the variables. The type of surveys we are targeting here is one in which recurring surveys are made over time, and the characteristics and composition of the subpopu­ lations remain relatively stable between surveys. An example of such a survey is the National

Resources Inventory (NRI) conducted by the US Natural Resources Conservation Service. The

NRI is a longitudinal survey of all non-federally owned land in the US and territories. Its pri- 2 mary of data collection is a combination of aerial photography and photo-interpretation, and the main variables in the NRI measure various aspects of land use, farming practices and environmentally important variables such as wetland status and erosion. Because the survey is longitudinal, the NRI measures both the level and the change over time in these variables.

Many of these variables are related to each other, so that it is indeed possible to identify broadly similar "clusters" within the population. We refer to Nusser and Goebel (1997) for further information about the NRI.

In a survey context like the one just described, we have the opportunity to develop a characterization of the subpopulations based on past years' surveys, and develop a method with which to flag suspicious observations in the current year's survey, potentially as early as while the data are being collected. For the method to be practical, it should be relatively easy to compute yet account for complex interactions between the survey variables, not require significant input from subject-matter experts, and allow for statistical analysis.

The focus of Chapter 3 is to propose an estimator for the "outlyingness" of an observation relative to a subpopulation, based on the "distance" between an observation and the "center" of the subpopulation (all these terms will be more precisely described below), and to derive its statistical properties in an asymptotic design-based context. In doing this, we will assume that the subpopulations have been identified based on prior years' surveys, in the sense that each observation in those surveys has been assigned to a subpopulation. This subpopulation identification problem and the implementation of a complete outlier detection protocol for the NRI are further described in Chapter 5, where in the National Resources Inventory, we define subpopulations by geographical association and trend of broad use, and are required to combine subject matter knowledge with the proposed method to identify interesting points more efficiently.

The estimated distance distribution can be regarded as a function of the estimated "center", but a direct Taylor linearization is not applicable because of the nondifferetiability of the function. Similar problems occur in other interesting survey estimators, like the proportion below an estimated quantity, estimating distribution function with auxiliary variables. The 3 theoretical derivations in Chapter 3 can be extended to more general settings. In Chpater 6, we review two types of nondifferentiable survey estimators, explicitly defined with estimated parameters or implicitly defined through estimating equations. We will give a rigorous proof of their asymptotic properties and propose variance estimators using kernel smoothing.

In order to construct the distance distribution estimator proposed in Chapter 3, we consider each element in the population as located in a multidimensional space, with each dimension corresponding to one of the variables being measured in the survey. In this space, it is possible to define a distance measure and to compute the distance between any two locations. An example of such distance measure is the Minkoski distance, with special cases as Euclidean distance and Manhattan distance. can also be used, which takes the shape of subpopulation into account. The approach we will describe later in the paper is valid for general distance measures, so that it is possible to customize the measure for the particular survey under investigation.

A simple definition for the center of a subpopulation is the vector among all the elements who belong to that subpopulation. Another definition, which we will also consider in this article, is the multidimensional median for the subpopulation, The median for a multivariate variable has been explored in the literature of . Brown

(1983) discusses the spatial median as a location measure minimizing the Euclidean distance in the context of spatial data, and Small (1990) gives a review of multidimensional medians.

Given a distance measure and a subpopulation center definition, one could in principle compute the distance between each observation in the subpopulation and its corresponding center, and construct the "distance distribution function" for the subpopulation. The set of finite population-level distance distribution functions represents our target measure of outly- ingness: for a new observation that needs to be assigned to one of the subpopulations, we would like to compute its distance relative to each of the subpopulation centers and obtain the corresponding tail probabilities on the subpopulation distribution functions for this obser­ vation. If the probabilities are small for all subpopulations, we conclude that this observation is unlikely to belong to any of the subpopulations and flag it for further investigation. Note 4 that the term "probability" in this context refers to the distribution of elements in the finite population, and does not assume the existence of a probability model for the data.

The target measure just described is infeasible, because we do not have access to the full population and hence neither the subpopulation centers nor the subpopulation distance distributions functions are available. We will therefore estimate both quantities, and the resulting measure of outlyingness will be the sample-based estimated probabilities of belonging to each of the subpopulations, as measured by the estimated distance distribution functions.

The estimation of distribution functions using survey data has been well explored in the literature. Dunstan and Chambers (1986) offered a model-based perspective and proposed an estimator under a ratio model with known variance. Rao et al. (1990) furnished a thor­ ough treatment of estimators of distribution functions. They proposed a , a difference estimator and another estimator that is asymptotically both design-unbiased and model-unbiased (often referred to as the RKM estimator). Design normality of distribution function and quantile estimators was proved by Francisco and Fuller (1991) under stratified . The properties of the Dunstan and Chambers estimator and the RKM esti­ mator are further investigated by Chambers et al. (1992). Dorfman (2007) provides an overall review of estimating distributions and quantiles in the survey context.

While our approach is similarly based on estimating (sub)population distribution functions, the results of the above authors are not directly applicable to our situation, because we are considering the distribution of distances from a center that is itself estimated. In order to obtain valid inference results for our estimator, we need to incorporate the uncertainty in estimation of the subpopulation centers as well as that in estimating the distribution function.

A critical technical difficulty is that the distribution function estimator is not a smooth function with respect to the subpopulation center, so that the standard tools from classical design- based asymptotic inference cannot be used. This substantially complicated the derivations of the asymptotic results, and also required us to investigate alternative methods for variance estimation.

We apply the proposed estimator on data from the National Resources Inventory, a recur- 5 ring large-scale complex survey, in Chapter 5. We will outline a general strategy for identifying outliers in a large-scale survey (possibly updated periodically), describe the practical concerns in initial partitioning, and present detailed analysis of the suspicious points that are identified by the proposed approach. The initial partitioning of NRI points compromises administrative concerns and concerns on model performance. We utilize geographical association as well as point classification to partition the points into relatively homogeneous subpopulations and use the previously proposed measure of outlyingness to associate an outlyingness measure with each point. Then we specify a cutoff value, and flag the points with an outlyingness mea­ sure more extreme than cutoff value as suspicious points. The suspicious points are presented to subject matter specialists for a close examination and analysis results are summarized in

Chapter 5.

Many survey estimators involve nondifferentiable functions defined at the population level.

Some survey estimators are nondifferentiable functions of estimated sample quantities, exam­ ples are Dunstan and Chambers (1986) estimator, Rao-Kovar-Mantel estimator (Rao et al.

1990), endogenous post-stratification estimator (Breidt and Opsomer 2008), and distance dis­ tributions (Wang and Opsomer 2008), among others. Another class of estimators are defined implicitly through nondifferentiable estimating equations, like sample quantiles or other robust location measures.

The first class of estimators can be treated as statistics containing estimated parameters, but a direct linearization is not feasible due to nondifferentiability. Randies (1982) gave a general treatment of statistics with estimated parameters and discussed when the asymptotic variance is affected by the errors in estimated parameters. In survey context, Deville (1999) talked about variance estimation for complex statistics using influence functions and introduced kernel smoothing in variance estimation. But no formal proof of asymptotic properties has been given and there is little influential theoretical work establishing the asymptotic properties of this class of estimators under a complex survey design.

The second case involves population quantities and sample estimators defined by estimat­ ing equations. Examples are quant iles, generalized parameters (Binder 1983, Wu 6 and Sitter 2001a), and some robust measures. Chapter 1.3.4 of Fuller (2007) gave a derivation of estimators defined by estimating equations in complex surveys when the estimating function satisfies some differentiability condition. The usual approach to study the asymptotic prop­ erties of the estimating equation estimators is to linearize the estimating function and back solve for the estimator. But in cases when the estimating function is not differentiable with respect to the parameter of interest, we can not directly linearize the estimating equations.

Instead, we use the monotonicity of estimating equations to show the asymptotic normality of our estimator.

Chapter 6 provides a uniform treatment of the previous two classes of estimators by as­ suming a limiting smooth function to facilitate our asymptotic expansions, and talks about variance estimation by way of kernel smoothing. The assumption of a smooth limit is crit­ ical but can be justified by assuming the characteristics of interest are generated through a probabilistic mechanism. 7

CHAPTER 2. REVIEW OF LITERATURE

The estimation of distribution functions for survey data has been well explored in previ­ ous literature. Sedransk and Sedransk (1979) used stratum cumulative distribution functions

(cdf's) to compare different groups of facilities practicing radiation therapy, as a complement to only using means. They used a Hajek estimator similar to (3.3) to estimate pretreatment and therapy score cumulative distribution functions and compared them across strata. The downside of this paper is that although they explicitly derived variance formula for their cdf estimators, the variance was not incorporated in comparison. In this thesis, the subpopula- tions are not defined by stratum indicators, but are in terms of variables being collected and our primary goal is not to compare subpopulations but to discriminate points and identify suspicious observations.

Dunstan and Chambers (1986) offered a model-based perspective on this estimation prob­ lem. They propsed an estimator under ratio model with known variance structure and exam­ ined its asymptotic properties. Additionally, they discussed its performance when the variance assumption fails to hold and talked about quantile estimation as well.

Kuk (1988) compared the design variance of three estimators of distribution functions and their corresponding median estimators, without using auxiliary information. The three

1 1 and competing estimators are FL(t) = ^Es^f , £R(*) = 1 " j? Es ^f ^(t) =

Es ^-^ /E

Rao et al. (1990) furnished a thorough treatment of estimators of distribution functions.

They proposed a ratio estimator, difference estimator and another estimator that is asymptot­ ically both design-unbiased and model-unbiased (RKM estimator). These estimators, together with the ordinary design-based estimator (Sedransk and Sedransk, 1979) and model-based es­ timator (Dunstan and Chambers, 1986) were compared through simulation study. They also proposed variance estimators and treated quantile estimation in the same paper.

The previous papers did not prove asymptotic normality results except in model-based case.

Francisco and Fuller (1991) showed design normality for both distribution function estimator and quant ile estimators. They proposed a number of confidence intervals for quant iles which were carefully examined by Sitter and Wu (2001) under stratified cluster sampling. In this thesis, we do not assume a specific sampling design but assumed design normality for a simple mean estimator.

The next paper on this topic was Chambers et al. (1992), which studied the Dunstan and

Chambers estimator and RKM estimator. They compared and MSEs of the two estimators in theory and through empirical study, and concluded that no one outperforms the other uniformly.

A following paper in the sequel was Wu and Sitter (2001b), which revisited variance estima­ tion for Dunstan and Chambers estimator and RKM estimator and proposed jackknife variance estimator. They showed the delete-one jackknife variance estimator is design consistent for its target, but in our case, delete-one jackknife does not work due to having a nonsmooth function of simple means. Other papers on the similar topic are Dorfman and Hall (1993) and Wang and Dorfman (1996).

In this thesis, the variable whose distribution we are estimating is ||y^ — 7JI, where 7^ is a measure of multivariate center. The mean vector would be a convenient choice for 7^, but some robust measures are preferred in the context of identifying outliers. In subsection

3.3.2, we generalize the definition of spatial median (Brown, 1983) to a general definition of multivariate median, and this definition will be used as a robust measure of multivariate center.

The original idea of multivariate median dates back to Hayford (1902), Weber (1906) and Gini 9 and Galvani (1929).

Haldane (1948) revisited the median of multivariate distribution and compared arithmetic median with L\ median in terms of their geometric properties. It talked about the invariance of L\ median under rotation of axes and its unambiguity in the bivariate case. This paper referred to L\ median as "geometrical median" and claimed that arithmetic median is "to be preferred in ordinary statistical works", but L\ median "has certain advantages in problems of geometrical probability".

Gower (1974) termed L\ median as mediancentre and examined its angle property, as a generalization of univariate sign change. The angle property can be derived by taking deriva­ tives under a polar coordinate system representation. Then he proposed an iterative algorithm to compute the mediancentre in bivariate case, using unit vectors along the direction of the line between an initial estimate of median and sample points.

Brown (1983) provided a thorough treatment of L\ median, terming as spatial median.

The paper derived asymptotic properties of sample spatial median under bivariate normality.

In its simulation studies, the paper compared the asymptotic of spatial median, arithmetic median and mean vector, and demonstrated that although not as relatively efficient as mean vector, spatial median outperforms arithmetic median along the direction of the larger principal component.

Milasevic and Ducharme (1987) treated spatial median as the 7 that minimizes E(||y^ —

7|| — ||yj|) where Y i is a sample from continuous multivariate distribution. They also showed the uniqueness of the proposed spatial median, which is also the basis of Lemma A.2.1 and 2 of our paper.

We generalize the definition of spatial median by using a general norm to quantify the distances between points and the center, as in many applications we may use norms other than

L\ or L<2 distances. We may want to place weights on different dimensions or consider corre­ lations among variables, using a quadratic norm offers us an alternative besides transforming the variables. Another example is to consider working with high dimensional data or infinite dimensional data (), a general norm offers us a uniform treatment in defining centers 10 and estimating distances-to-center.

Other kinds of robust measures of multivariate locations have been proposed in statistical literature. Small (1990) provided a review of various multivariate medians, and Hettmansperger and Randies (2002) gave an overview of multivariate L\ median and M-estimator of location.

Another generalization of spatial median would be to consider spatial quantiles, introduced by

Dudley and Koltchinskii (1992) and Koltchinskii (1993), as an extension of univariate quan- tiles described by Ferguson (1967) to W. Chaudhuri (1996) furnished a thorough treatment of spatial quantiles, termed as geometric quantiles. This paper talked about the existence and uniqueness of geometric quantiles with computational aspects, and derived their asymptotic distribution using Bahadur Representation (Bahadur, 1966).

This thesis assumes that the initial partitioning of survey data is already provided. In survey practice, we can obtain the partition by using auxiliary variables or apply tools to the realized sample. There is massive literature on cluster analysis in the fields of multivariate analysis, data mining and statistical learning. It is generally agreed that cluster analysis methods can be divided into two broad categories: partitional and hierarchical methods. Partitional methods seek the best way to partition the data into a specified number of clusters under a given criterion and hierarchical methods partitions the data into a nested sequence of clusters using top-down (hierarchical divisive clustering) or bottom-up approach

(hierarchical agglomerative clustering).

The commonly used partitional methods are k-means and k-medoids (also known as Par­ titioning Around Medoid). Steinley (2006) gave a thorough review on k-means clustering, including theoretical background of k-means and estimating the number of clusters. A statisti­ cal treatment of k-means was given by Pollard (1981, 1982). Concerning estimating the number of clusters, Milligan and Cooper (1985) compared 30 procedures in a Monte Carlo simulation and their top pick is the Calinski and Harabasz index (Calinski and Harabasz, 1974), and a more recent approach was proposed by Tibshirani et al. (2001), known as the "Gap ".

K-medoids is a more robust but computationally extensive approach, and it uses L\ measure of distances instead of L2 norm. A detailed description of k-meadois method can be found in 11

Kaufman and Rousseeuw (1990) and Hastie et al. (2001).

Hierarchical agglomerative clustering has two major categories: distance-based and model- based methods. Model-based clustering can be found in Banfield and Raftery (1993), Celeux et al. (1995), Bensmail et al. (1997), and Fraley and Raftery (1998, 2000).

Cluster analysis has also been applied to the design and analysis of survey data. Golder and

Yeomans (1973) used cluster analysis to form a basis for stratification of first-stage sampling units in their study of Birmingham motorists. They had data with 13 socio-economic variables of the 39 wards in the city and presumed to divide them into 5 clusters. They illustrated the 5 clusters they got with a list of wards, cluster means, and first two principal scores, and tested cluster stability by different initial partitions.

Another application of cluster analysis in survey context was to collapse imputation cells

(Langlet, 1990). The objective was to predict missing values of patient age using gender and tabulating diagnosis in the Hospital Morbidity System data. Patients are initially classified by tabulating diagnosis. Then age distribution was defined for each imputation class, and estimated means and standard deviations were used to represent this distribution. Finally, cluster analysis was performed on the estimated parameters to reduce the number of hot deck imputation cells.

The idea of partitioning the whole population (or sample) into homogeneous groups and defining an index based on distances has been applied in educational measurement, health studies and other scientific studies. Hutchison (1969) applied this idea in modeling career choice, under flexibly determined subgrouping. They used "centour score" based on Maha- lanobis distance to measure the association of a hypothetical subject with vocation x success­ ful/unsuccessful group, using SAT score and I.Q. as explanatory variables. The population index so defined can be of real scientific interest. Chen (1978) talked about defining health status index for heath service areas using more healthy areas as the norm. The index is defined using average life expectancy at birth and percent of population that are disability-free, and

Euclidean distance is used in their study. 12

CHAPTER 3. ESTIMATING DISTANCE DISTRIBUTION AND TESTING POINT OUTLYINGNESS

3.1 Distance Distribution Functions and Estimators

For the theoretical study of our estimators, we will follow the framework of Isaki and

Fuller (1982) in which the properties of estimators are established under a fixed sequence of populations and a corresponding sequence of random samples. Suppose therefore that we have

( ) an increasing sequence of finite populations {Uu) ^ =1 of sizes Ny with Ny < Nu+i. Associated with the z-th population element is a p-dimensional vector of observations

Vi = (yi,i>->2/i,p)> (3-1)

and let Tv be the power set of z/-th finite population {yi^y^, ...,2/JVJ- The finite population

Ujy can be partitioned into G subpopulations, with Uv = \}q=\Uvg. The knowledge of this structure of finite population is usually available for longitudinal surveys of stable populations, and we assume the partition into subpopulations is provided. In deriving theoretical results, it makes no difference if we assume just one population, so we will work with one population in the theoretical derivation section for simplicity .

We take a sample Sv of size nv from population Uv, and the sampling design generating Sv may be a complex design with stratification or multi-stage sampling. We let -K{ = P(z G Sv) and TTij = P(z, j G Sv) denote the one-way and two-way inclusion probabilities under this sampling design, where we suppress the dependence on v of both quantities. Because in general the sample size nv is random, we use n* = Yj{nv\Tv) to denote the expected sample size over repeated sampling from Uv.

Let || • || denote a norm in the space defined by the yi in (3.1). We define the population-level 13 distance distribution function as

3 2 Du,dilv) = TrJ^aiwi-Yj^d), ( - )

where 7^ is some measure of center of Uu. Given a sample Su, D^^ilp) is estimated by the Hajek estimator

D v.d {%) = -£- J2 —l(\\yi-%\\

where 7^ is an estimator of 7^, some measure of population center and Ny = J^ ^ is an estimator of the (sub)population size. For future reference, we also define a Horvitz-Thompson estimator of D^^iffy) in which Nv is known,

DuA%) = jrY. -h\\Vi-%\\<*>- (3-4) ly n h> c l

In equation (3.2), we assume 7^ is a nonrandom sequence of finite population centers but 7^

an are s e is random due to sampling mechanism. Quantities Dv^(p(v) d Dv^iftu) ^ P functions of d and 7 with jumps of size O(^r) and Op(^r), respectively, which will significantly complicate the study of their theoretical properties.

As a measure of center 7^ in equations (3.2)-(3.4), we can use the usual mean vector

which is estimated by,

Alternatively, we can also use a generalized version of multivariate median. The definition of spatial median was given in Brown (1983) and Small (1990). We generalize this idea and define the multivariate median of a finite population to be the location with smallest overall distance to all population units with respect to some norm || • ||. For a given norm, we define the median for population Uv as

qv = arg infT ^ || yi - 7 ||, (3.7) 14

and the sample-based estimator of qv is denned as

qv = arg infT ^ — || yi - 7 || . (3.8)

In what follows, we will use 7^ to denote a general measure of population center, which can be mean vector /xv or median qyl and 7^ to denote the estimator of 7^.

While the distribution function estimator in (3.3) can be of interest in its own right, our primary purpose is to use it in the detection of potential survey outliers. To do so, we consider a

(possibly future) observation as a location y° in !ftp, and we would like to evaluate the distance between y° and the population center, ||y° — 7JI, in relation to the distribution of distances in the population. Hence, for any location y° £ !ftp, we define a measure of outlyingness of this location with respect to the finite population as

l 0 3 9 fii/(y°) = -]^J2 (\\yz-~rJ<\\y -~rJ)- ( - )

The population outlyingness measure is generally not known and it is our target population parameter. We propose a sample estimator of ftu(y°) defined as

&v(y°) = £- Yl -^II^-TJI^II^-TJD- (3-10)

The closer this outlyingness measure is to one for a given observation, the more suspicious it is relative to the finite population. One possible way to use this in a particular survey is to define a cutoff value a* and flag all observations k with value of Vty{yk) > 1 — a* for closer scrutiny.

3.2 Assumptions

We will estimate the distribution of distances under a design-based framework. We assume the sequence of finite populations to be fixed and only comes from the sampling mechanism. In Assumption 3.2.1, we assume a reasonable rate for the growth of sample size, relative to population size. Assumption 3.2.1(1) is used in showing asymptotic results for distance distribution and nondifferentiable statistics containing estimated parameters, and 15

Assumption 3.2.1(2) is used for nondifferentiable estimating equations. We do not want to restrict our attention to a specific sampling design but make rather general assumptions to cover various sampling schemes. Assumptions 3.2.2 and 3.2.3 guarantee the design consistency and asymptotic normality of a Horvitz-Thompson estimator of the mean.

Assumption 3.2.1. The sample size nv can he fixed or random for any v, and we require riy = Op(Nis) with

1' 2p+l ^ P — L>

2. \<(3

Assumption 3.2.2. The following conditions hold for inclusion probabilities TVi and design variance of Horvitz-Thompson estimator of the mean,

1. KL < ^-TTi < K\j for all i, where KL and K\j are positive constants. Tlu

1 as 2. For any vector z with finite 2 + 5 monents, define zv^ = -^- J2 f the Horvitz-

z Thompson estimator of zu = ^-J2 i- We assume

Var(z^7r|J^) < K^^sRsizy^Ty),

for some constant K\, where VoxsRsizv^Fv) is the design variance- matrix

of zv^ under simple random sampling of size n*.

It is readily shown that under Assumption 3.2.2(2), ^ —• 1 by bounding its design variance.

Assumption 3.2.3. For any z with positive variance-co variance matrix and finite fourth

1/2 nt (zu,w - zv)\Tv 4 N(0,Ezz), (3.11) and

l 1/2 [V(zu^\FvT VHHZUM^) - W = 0PK" ), (3.12) where T^zz is a nonnegative , Vizy^Fv) is the design variance-co variance matrix of estimator zu^, and VHT(ZV,>K\FV) is the Horvitz-Thompson estimator of the variance of

ZV,'K\J~V 16

Assumption 3.2.3 on the normality of the Horvitz-Thompson estimator for a general design and for a general vector with moment conditions is similar to assumptions in Fuller (2007), among others.

We also need to assume a further number of more specific regularity conditions on the sequence of finite populations. Assumption 3.2.4 specifies population moment condition on population vectors yi. In Assumption 3.2.5, we assume the existence of the limiting function of Dvg^{p() and that the limiting function satisfies certain smoothness condition. Assumption 3.2.6 provides limits for a number of population quantities needed to derive the design-based results.

Assumption 3.2.4. The sequence of population vectors y^s has bounded 4 + 5 moments,

lim N-lY,\Vi\^5<^ Nu^OO

for some 5 > 0.

Assumption 3.2.5. 1. The population-level distance distribution converges to a limiting

distance distribution,

lim A,,d(7) = Vd{*i)

on (d,7) e [0,oo) xW.

2. The limiting function £^(7) *s continous in d € [0, 00) and 7 € W. Additionally, the

derivatives —g^ > —l-i' an(^ —g 2 a^ exist and are finite in (d, 7) € [0, +00) x W.

Assumption 3.2.6. The following population quantities converge to zero:

1.

v I hN \ Jfv E (d

where h^ = 0(N~a) and a € (|, 1).

2. let

I V n V2s , V Ri = \\\yi-y-nl-^s\\

then rt1'2

uniformly for 7 G W and s G CS; a /arge enough compact set in W.

The reasonableness of the population requirements in Assumptions 3.2.5 and in particular

3.2.6 is somewhat difficult to evaluate as stated. Therefore, in Assumption A. 1.1 in the Ap­ pendix, a superpopulation model version of Assumption 3.2.5 is stated, under which the yi are generated through a probabilistic mechanism. Based on that assumption, a number of model results can be shown to hold with probability one. In particular, the two parts of Assumption

3.2.6 are shown to hold almost surely under the superpopulation model, as derived in Lemmas

A. 1.1 and A. 1.2 in the Appendix. We here assume that the (fixed) population sequence from which we are sampling is such that these results hold, without the almost sure condition.

The next set of assumptions are necessary when deriving median-based results, where we need stronger conditions on the distribution of finite population elements and restrictions on the norm. Assumption 3.2.7 guarantees the uniqueness of population median qv. Assumption 3.2.8(1) gives smoothness conditions on the norm, Assumption 3.2.8(2) will be used in showing asymptotic normality of qu, and Assumption 3.2.8(3) is needed in variance estimation.

P suc Assumption 3.2.7. There exists no (A,7X,72); with A G (0, l),7i,72 £ ^ ; h that a large enough finite population Uv puts all points on

X Llin2 = {y\\\y - 71 - 72ll = \\ V ~ Till + 11(1 " A)y - 72|| }. (3.14)

Assumption 3.2.8. The following conditions hold for the norm || • ||,

1. The norm || • || is continous on W, with a continuous gradient vector ifj(7), and bounded

second derivative matrix Hs(^y).

2. For any 7 in a neighborhood of qy, -^j- ^Hs{yi — 7) is a nonsingular matrix. Further,

the sequence of Hs(yi —7^) has bounded first two moments at population level. 18

3. The gradient function ^(y^ — 7^) has bounded fourth population moments,

^^Z\^{yi-iu)\A <°°-

The definition of L^in2 in Assumption 3.2.7 is interesting to explore. It includes all the points y such that Xy — *y1 and (1 — X)y — 72 are on the same "line" through the origin.

The definition of " line" differs from one norm to another, where in L2 norm, it is the usual definition of a line, but in L\ norm, it says Xy — ~y1 and (1 — X)y — 72 are in the same orthant. Finally, we give some conditions on the kernel function and bandwidth that will be used in constructing a variance estimator, where we will use a kernel regression estimator to estimate the derivative of limiting smooth function as needed in the expression of asymptotic variance.

Assumption 3.2.9(1) is standard in literature on kernel smoothing and (2) specifies condi­ tions on bandwidth. Assumption 3.2.9(3) is satisfied by Gaussian kernel, and compact kernels like Epanechnikov kernel also has bounded derivatives as they take zero values outside the interval [—1,1].

Assumption 3.2.9. The following conditions hold for the kernel function K(t) and bandwidth h,

1. The kernel function K(t) is symmetric with f^^Kfydt = 1, and K(t) is an absolutely

continuous function with finite derivatives K'{t). Further, let R{K) = J^° K2(t)dt < 00

and a\ = J^ t2K(t)dt < 00.

1 2. The bandwidth h —• 0, and Nuh(log Nu)~ —• 00; as Nu —• 00.

3. There exists a constant c, such that \jpK' (|)| < c; for any x/0 and h arbitrarily small.

Jf.. The graph defined by {7 : -gzK( ~j| ) = 0} partitions W into a finite number of subsets. 19

3.3 Distance Distribution Estimators and Measures of Outlyingness

3.3.1 Mean-based distance distribution

In this section, we use the mean vector as a measure of center, and show the asymptotic properties of the sample distance distribution estimator. The sample distance distribution function (3.3) is a nonsmooth function of the estimated center, so the uncertainty due to estimating the mean can not be directly quantified through linearization. The idea to tackle the problem is to replace the nonsmooth function by a smooth limiting function, and do linearization on the smooth function. Lemma 1 establishes an intermediate result, which allows us to replace the nonsmooth function by the limiting smooth function, similar to (1.6) of Randies (1982). Then we can linearize the estimator (Theorem 1) and derive asymptotic normality of the estimator (Theorem 2).

Lemma 1. Under Assumptions 3.2.1(1), 3.2.2 and 3.2.5-3.2.6,

1 2 nl ' (DU4([L„) ~ A,,d(fO " Z>d(A„) + ^(/O) ^ 0 (3.15) in design.

Proof. See Appendix. •

Theorem 1. Under Assumptions 3.2.1(1), 3.2.2 and 3.2.5-3.2.6, the sample-based quantity

D^diP'u) is Vnt~consistent for the corresponding population quantity Dv^{\i^), namely,

1 2 nl ' (Du4(iiv) - £>„,*(/*„)) = Op(l).

Further,

1/2 + (Pd(AJ - Pd(/*„)) + oP«- )- (3.16) 20

Proof. We use the following decomposition,

1/2 n* [Du4(jj,v) - Du>d(pu

1/2 = < (pv4(fiv) - A,,d(A*„) - 2?fl,d(Ai/) + Vgtd(iiv]

1 2 + n^ / (SM(AO - A,d(AO)

1/2 + nt (Vd(pu)-Vd(pu)),

where the first term is op(l) by Lemma 1, and the last two terms are both Op(l) by assumption

3.2.2 and standard linearization methods. •

Equation (3.16) allows us to obtain the asymptotic distribution of DV^(\L^). The first term

on the RHS of (3.16) contains DV^(\LJ), which is a ratio of two Horvitz-Thompson estimators.

In the second term, Vd(fiu) applies a smooth function to the ratio estimator of population mean, whose variance can be evaluated through linearizing T*d(').

Assumptions 3.2.3 and 3.2.4 imply the following multivariate normality,

= 1/2 (ieSu) TV Z^ 1 ^AjV(0,SMjd) (3.17) Ny 7Ti uv Vi where

(3.18)

This immediately leads to the following result.

Theorem 2. Under Assumptions 3.2.1(1), 3.2.2-3.2.6,

-1/2 1/2 n, VMA,,d(£„) D„,d({iu) - Dv4(nv] Tv A N(0,1) (3.19) where

V [Dv4((ipy) = a^EM>daM (3.20)

&MBA\' .. (&MBA\ (3.21) «M=(i>-A,,d(/0-(^^)'/*,,,( dfjLv and EMjrf is defined in (3.18). 21

Proof. We use decomposition (3.16),

2 1/2 n^ (p„4(»v) - A,d(AO) + < {Vd{jj,v) - Vd{nv))

1 2 l 2 = nl ' (Dv,d(»v) ~ Dv,d{nv)) - nl ' Dvg4{Vv) f ^ - 1 J

x 2 l 2 = < / (£„,*(/*„) - DV)d(»vj) - nl ' Dvg4{^) ( ^ - 1 j

+n* l^rj 1^5^ ' u\w» )) p The result follows from Slutsky's Theorem. D

The asymptotic variance of Dv^{Ji^) consists of two components, the first component from using estimator Dv^{\i^) and the second component due to the uncertainty of estimating population center. The first component is essentially the variance of a ratio estimator and can be easily estimated using plug-in estimator or replication procedure, but the second component involves an unknown derivative of limiting smooth function. The derivative can be estimated by kernel smoothing and incorporated into either plug-in or replication estimator. We will discuss estimation of V (Du^(i^u)) in section 3.4.

3.3.2 Median-based distance distribution

Now let us look at the case when we use median as a measure of subpopulation center. We introduce the following estimating equations at the population and sample level, respectively,

£>(l/i-7) = 0, (3.22)

uv and

£ tiMi^jl = o. (3.23)

The medians defined by (3.7)-(3.8) are related to the roots of the estimating equations

(3.22)-(3.23). More details are held until Appendix B. 22

Lemma 2. Under Assumption 3.2.7, for a large enough population, J2u \\Vi ~ Til ^as only one local minimum, which is also its global minimum.

Remark 1: Lemma 2 states a stronger result than qv being unique, and it also says there are no other local minimums for J2u \\Vi ~ 7II-

Remark 2: If we make Assumption 3.2.7 on the sample SVl then it is obvious that J2s 7P nas a unique global minimizer and no other local minimizers. But under a complex sampling design, we may have nonzero probability of selecting a sample where all points are on the same line. But considering the increasing sequence of finite populations and sequence of sampling designs, there is high probability of selecting a sample that not all points are on a line if the sample is large enough.

Although the sequence of sample medians qv need not be unique, we can show that any sequence of qv is design consistent for population median qv and asymptotically normally distributed, as will be stated and proved in Theorems 3 and 4. Then the distances can be defined using any one of these sequences, but sample distance distribution estimators will remain consistent and asymptotically normally distributed.

In establishing the weak convergence of qv for qy) we adopt the definition of weak con­ vergence from of Page 24 of Billingsley (1968) and use general norm || • || as a discrepancy measure. Here, we say qv converges to qv weakly if and only if Ve > 0,

HmP(||^-qJ|>e) = 0. (3.24)

Theorem 3. Under Assumptions 3.2.1(1), 3.2.2 and 3.2.7, any sequence qv that satisfies (3.8) is design consistent for qv.

Proof. Negation. Suppose qv does not converge to qv in probability, then there exists ei, 6 > 0, such that

P (\\ qv - qv \\> et) > 6, (3.25) for infinitely many v. Equation (3.25) together with Assumption 3.2.7 would imply 3 e 0, such that p (i^Z>i-^ii-wY,II^-«"H >e2) ><*> (3-26) v v V uv uv ) 23 for infinitely many v.

Now, let us show that

1 V^ WVi ~ QA 1 V II„. - II P n fq o^ ^ 2. ~^~ -jf-l^WVi-QA^O. (3.27)

Define

KESis UV note that,

P(|Qn(g„)| > e) < P[sup7€C|Qn(7)| > e'] + P(g„ £ C), where C is a large enough compact set. Now it is left to show that

sup7€C|Qn(7)l^0. (3.28)

The equation above can be shown by covering technique, similar to the approach we use to show (B.l) in Lemma 1 in the Appendix. Equations (3.26) and (3.27) imply that, there exists

63 > 0 and Si > 0, such that

1 V^ WVi-QuW 1 Y^M ,, \ cr- P I Jf > . " _ " - — > Jll/« - q \\ > *S | > 5i, (3.29) Nigv — j^Eitoi-^iv i >«•)>«••

l£Su Uv / for infinitely many z/.

As ^ V H^~g^H > -L V H^-^!!, for the same 63 and 5i,

^EjLVdL-^EiiVi-^ii>^j>*i. for infinitely many v. Contradicting the fact that jj- J2 ~ is design consistent for

i^Su ifcElli/i-gi/ll- D

Theorem 4. Under Assumptions 3.2.1(1), 3.2.2-3.2.3, 3.2.8(1), and assuming that qu is a

design consistent sequence, then we have the following asymptotic normality for the sample median,

1/2 nv (qv-qv)±Np(0,XVtq), (3.30) 24

where E^g = AT,U>^,A',A , and i=l

TTi 7Tn

Proof. The consistency of qv for qv gives,

Yl —^(Vi - 9i/) = 0 i^Sv

^ 5^ —^{Vi -Vis- {Qu ~ Qu)) = 0 — TTi

^ 5Z ~~ W^* " «!/) ~ ^(^ ~ 9I/)(9I/ - 9i/) + op(9^ - qu))} = 0,

i£Sh which implies

1 y^ ?*(yi ~ q^ Qv = Qv + 2^ — + °*>(9i/ - Qv (3.31) Nv f-? TTi Nu f-± TT,

Nv after using non-singularity condition in Assumption 3.2.8(2) to take the inverse of -4- ^ -ffs(y.

01/) It is easy to argue that

1 ^ Hs{yi - qv) ^2^s(l/i-9i/) (3.32) Nv f-rfE m W„ 165, C/F and :l/2 •*P(Vi-Qv) d iVp(0,E^), (3.33) TTi i&Sv where E„^ = < £ E(., - ^Z^^- C/i/ C/i/ The proof is then completed by applying Slutsky's Theorem and the fact that ^f —• 1 to obtain (3.30). D

Theorem 5. Suppose Assumptions 3.2.1(1), 3.2.2 and 3.2.7-3.2.8 are satisfied, then for any

s sequence qv that satisfies (3.8), the estimated distance distribution Dy^( V^t~consistent for the corresponding population quantity Du^(qu), namely,

1 2 nl ' (pVtd(q„) - D„4{qvj) = Op(l). 25

Further,

Dv,d(qu) - Dv4(qv) = (pv4(qv) - Dv4(qv)

1 2 + (Vd(qu)-Vd(qu)) + op(nr l ) (3.34)

Proof. Similar to the proof of Theorem 2, we use the following decomposition,

1 2 nl ' (pv4{qv) - DV)d{qv))

1/2 = n*v {pVtd{qv) - Dv,d{qv) - Vg4{qv) + Vg4{qvj)

1 2 + nl ' (pv4{qv) - Dv4{qv))

1 2 + nl ' (Vd(qv) - Vd(qv)).

We can show the first term converges to zero in design in a similar fashion to the proof of

Lemma 1. The remainder of the proof follows as \frtf,(qv — qu) = Op(l). •

Assumption 3.2.8(4) together with Assumption 3.2.3 gives

\\\Vi-qA\(Vi ~ Qu) where n*, TT^^ ,bq,ib'qS E<* = ~fj2 Yl S^y ~ WJ (3.36) iriir.%nj * 3

Theorem 6. Under Assumptions 3.2.1(1), 3.2.2-3.2.3 and 3.2.7-3.2.8, for any sequence qv satisfying (3.8),

-1/2 1/2 n, V (pv4{qv) D„4(.q„) ~ D„4(qv] ^AiV(0,l), where

(3.37) and ^=(i>-^(^),(^i)/^k)'' (3.38) 26

where HSNU = -^j- Y^ Hs(yi — qy) and Hg^ is defined in (3.36). v i=l

Proof. The proof is similar to that of Theorem 2. But as there is no Ny in the expansion of qu, so we do not have a complicated second term in aq. •

Another extension of the concept of quantiles in multivariate cases is given by Chaudhuri

(1996), called "geometric quantile", which considers the direction in multidimensional space as well. He also derived a Bahadur representation for the proposed multivariate quantile.

3.3.3 Measures of outlyingness

In this section, we use a general measure of center, either mean vector or generalized median, and show that the estimated outlyingness measure ftu(y°) is design consistent for ftu(y°) and asymptotically normally distributed, where y° is an arbitrary location in W (these results do not follow directly from those in Sections 3.3.1 and 3.3.2, because of the presence of an

0 additional estimated center in ftu(y )). Lemma 3 is similar but a stronger version of Lemma 1, and it leads to the conclusion that quantifying the distance between y° and the population as

||y° — 7^|| will not affect the leading term in the asymptotic variance, although there is error in estimating 7^.

Lemma 3. Under Assumptions 3.2.1(1), 3.2.2, 3.2.5-3.2.8,

V2 < (A.lli^-Tjfr) - A.ii^jfr) - ?V-7J|(7) + 2V_Tj(7)) ± 0 (3.39)

uniformly for 7 in a neighbourhood of *yu, where the measures of center 7^ and 7^ can be either mean vector or generalized median.

Theorem 7. Under Assumptions 3.2.1(1), 3.2.2-3.2.8, for any sequence 7^ satisfying (3.6) or [J.s), -1/2 1/2 1/2 < v (a,(

0 V (fUy )) = a;s^||yo_7|/||£fcy, (3.40) 27

l|y U11 and aT is defined in (3.21) or (3.38) with —^ replace by — Q^ ——, and ^^,\\y°-^u\\ is defined in either (3.18) or (3.36) with d replaced by ||y° — 7j|.

Proof. We use the following decomposition,

n„^ (n„(y°) - fl„(y°))

1/2 I - n* f -^- ^ -I(||1/._%||<||1/O_TJ) - — 5^ Cllw*—y^ll N, uv + K1/2^- Yl — (I(llw0-yJI

= n*V2 (A/.llyO-yjfr,,) - A/,||t/>-r„||(7«/))

1 2 1/2 + < / (P||yo_^||(7J -^||yo_7j|(7,)) + oP«- ), by Lemma 1 and 3. Then the rest of the proof follows from the mean-based or median-based distance distributions. •

We briefly consider an additional extension of the outlier detection estimator, to the situa­ tion in which the distance measure is itself sample-dependent. As an example of this, assume

Yiv to be a non-negative definite shape measure for y^'s that are in populatoin Uv, and there exists a y/n^—consistent estimator of Yiv based on the sample, say Yiv. Now we can define the

Mahalanobis distance as

\\Vi-iA^ = yivi-ivy^ivi-iv) (3-41) and the population distance distribution can be written as,

3 42 A,,d(7„E,) = ^E^-TJI^)' ( - ) v Uu and a natural sample estimator for the distribution is,

A**(7„, £,) = ££ 7^I(ll^ll^)P \\\yi iv\\Y,v- - (3-43) Uv %

1 The inverse matrix H^ is defined through a principal value decomposition. Let Yiv =

1 1 TvkyT'v, with Ky being a diagonal matrix with entries A^. We define H" = IV A" !^ where 28

A"1 has diagonal elements {max(A^, -^)} with a large number M. The sample may have zero eigen values also. Using the above procedure ignores the degenerate dimensions, but will define a huge distance if a new observation yi has a different value in the direction associated with the near zero eigen value.

Another remark is that the Mahalanobis distance ||y^ — 7j|s^ defined by (3.41) does not satisfy the usual definition of norm as a nonzero vector may have a zero distance when the shape measure H^ is degenerate. We will call is a seminorm.

Similarly, consider a population measure of outlyingness defined as

K(v°) = ^E^-^ii^ii^bJ' (3-44) with

Nv II Vi - Iv l|s„ = y(Vi ~ lv)^v (Vi ~ 7i/), ^ = N^I ^2(Vi ~ 1v)(Vi ~ 7i/)', (3.45)

where the generalized inverse of Y*v is used if the population variance-covariance matrix is degenerate. This approach is attractive because the norm adjusts for the distribution and correlation of the survey variables, which might be beneficial in surveys with numerous related variables. Because Yiv is typically unknown, the corresponding sample estimator

nt(v°) = i E ^(ll^ll^ll^ll^ (3-46) v s * is now used, with

3 47 II Vi ~ iv hu= y(Vi ~ iv)'^{yi - %)& = ^n Yl h^ ~ ^^ ~ ^'- ( - )

The estimator (3.46) contains a sample-dependent distances, so that the previous theory does not apply directly in this case. To show the design consistency and asymptotic normality of this outlyingness measure with both center and shape estimated, we need to restate or strengthen the following assumptions,

1. The estimated shape measure Yiv is v ^-consistent for population quantity E^.

2. In Assumption 3.2.5, we need the limiting function to be a continuous function of shape

measure, with suitably defined derivatives. 29

3. A stronger uniform convergence assumption is need in Assumption 3.2.6 (2).

The asymptotic properties of estimator (3.46) could then be derived following the same approach as in this article, but we will not do so explicitly here.

3.4 Variance Estimation

This section deals with estimating the variance of Dv^{Jiv\ DU^(QU) and the outlyingness measures of Section 3.3.3. We will introduce a naive estimator, a kernel estimator for incorpo­ rating the effect of error in 7, and a jackknife estimator. The naive estimator ignores the error in estimating population center, which might be appropriate in some cases as discussed below.

When the error due to estimating population center cannot be ignored, we can estimate it by kernel smoothing, and include the kernel estimator in an analytic variance estimator. An alternative approach is to incorporate the kernel estimator in a jackknife procedure. These estimators are discussed below.

3.4.1 Naive estimator

In general, in order to estimate V (D^^Ai/)) and V (DU^(QU)) as defined in (3.20) and

(3.37), we need to estimate gradient vectors —Q^ and —^q • However, for certain popu­ lation distributions and norms, these gradients vanish and hence a simple variance estimator is sufficient. We investigate this case here.

We will show in Lemma 4 and 5 that the gradient vectors —^ ^ and —Q are negligible if the population acts like a sample from an . We say a

Y is distributed according to an elliptically contoured distribution with parameters /x, A and

0, or ECV(\L, A, 0), if the characteristic function of Y has the form

Eexp{it'Y} = exp(itftJ,)(tfAt), where \i is a p x 1 vector and A is a non-negative definite matrix Fang and Zhang (1980).

The proofs of the following two results are given in the Appendix. 30

Lemma 4. Assume random variable Y ^ i?Cp(/x, A,0) with mean vector \i, and A is a non- negative definite matrix. We define the norm as,

|| u ||= Vu'Bu, (3.48) where B is a non-negative definite matrix. Then

1. the partial derivative evaluated at superpopulation mean is 0; —#^ = 0.

2. mean fi coincides with median q, fi = q.

Lemma 5. Assume 3.2.5, —#^ = 0 for some constant vector^, and the sequence of popu­ lation centers 7^ converges to 7, lim 7^ = 7. Then —4^^- = o(l).

Proof The proof follows from Taylor expansion,

Lemma 4 and 5 imply that the extra variability due to estimating population centers can be ignored in elliptical distributions with a norm specified by (3.48). This special case is similar to case A of Randies (1982), where we can "pretend" that we are using true population center

7^ without affecting the leading variance term. So we propose the following naive plug-in variance estimator for the leading term,

VNV (pv>d(%i) = (l, -Dv>d(%fj %,d,NV (l, -Dv,dirtv))', (3-49) with

I lT/M.(\\Vi-%\\

(3.50) v i j

And we use (3.49) as an estimator of asymptotic variances V (DUj(i(jj,u) J and V (DUj(i(qu) j as denned in (3.20) and (3.37), where we plug in the estimated center 7^ in place of true center 31

3.4.2 Estimating the effect of population center estimation error

The naive estimator (3.49) ignores the variance due to estimating the population center and hence tends to underestimate the true variance for a general population. So we need to estimate —#^ and incorporate the extra piece of variance. Let CdOr) = —C^Y> an<^ a population-level estimator of CdOr) usmS kernel smoothing is given by

K ±±M A CM(7) = j-rj- E { F ) *(Vi - 7)- (3-51)

We propose an sample-based estimator of Cd(l) and thus Ci/,d(7)? defined as

CM(7) = i E * (^F1) iKtt - 7)^- (3-52)

The idea of estimator Cvd(l) ls to estimate £^(7) using the primitive function of a kernel if (•)> and then take derivatives with respect to 7.

Lemma 6. Under Assumptions 3.2.1(1), 3.2.2, 3.2.8(1,3) and 3.2.9(1,2), the estimator C„^(7) zs design consistent for Cvd(l) as defined in (3.51).

Lemma 7. Under Assumptions 3.2.1(1), 3.2.2, 3.2.8(1,2) and 3.2.9, and assume the sequence

of populations is such that

^P |C„,d(7)- 0(7)1-0, (3.53)

£/ie kernel estimator Cvd(lv) ^s design consistent for Cd(lv) for everV d.

Remarks: We have established uniform strong consistency of C^OY) f°r —i^P~ m Lemma

A. 1.4, under appropriate superpopulation assumptions. The assumption in (3.53) assumes that we are not working with the populations where Cvd(l) does not converge to Cd(l)-> which is on a zero-probability set.

an m We can directly plug the estimator (3.52) and sample estimates of Dv^(p(v) d ^^4 ^o variance expressions (3.20) and (3.37) to get design-consistent analytic variance estimators for the mean-based and median-based cases, respectively. But unlike naive estimator (3.49), this variance estimator incorporates the extra variability due to estimating the center. We will use 32 mean-based inference as an example to rigorously present the kernel-based variance estimator.

We define the kernel estimator as

where

aM= (I,-5M(^)-CM(A^)A^,CM(^)) ' and Yi^d is a Horvitz-Thompson estimator of H^.

Theorem 8. Under Assumptions 3.2.1(1), 3.2.2-3.2.6, 3.2.8(1,2) and 3.2.9, the kernel esti­

-3 2 mator VSM \D^d(P>i,)) is n* / -consistent for V(Du^(i^u)), ?>•£•

l 1/2 [v(DVid(jtu)j\ ~ VSM {pvA»v)) - W = 0P(nl- ),

and in (3.19), we can replace V(Du^(i^u)) by the estimator to obtain

1/2 ny [VSM (A^Q&jjJ (L>M(/iJ - DvAdiv) ^^>iV(0,i).

Proof. The proof easily follows from Slutsky's Theorem. •

3.4.3 Jackknife variance estimator

This section only applies formally to mean-based estimator, but can be modified to include median-based estimator by using delete-d jackknife or Balanced Repeated Replication (BRR).

To introduce the jackknife variance estimator for our application, we start by assuming there already exists a design consistent jackknife variance estimator for simple linear estimators.

Then we define jackknife replicates in our case and show the design consistency of the resulting variance estimator. This approach is also used by Fuller and Kim (2005) and Da Silva and

Opsomer (2006). We will use a number of assumptions from the latter article, and not fully state them here for the sake of brevity.

Theorem 9. Let 6 be a linear estimator with 33 where Wi is the survey weight and zi has bounded 4 + 5 moments. Assume there is a jackknife replication procedure that generates L replicated estimates

with I = 1, 2, • • • , L and w\) is replication weight for unit i in the l—th replicate. The replication variance estimator is defined as

L 2 VJK(9) = ^CI{§{1)-§) > (3-54) where c\ is a constant for the I — th replicate. Assumptions similar to (D1)-(D4) and (D6) in

Da Silva and Opsomer (2006) are assumed.

We define the l-th jackknife replicate as

£«(£„) = £(£(£„) + CM(AJ(A1° - £„), (3-55)

w ] where D®d([Lv) = ^Y,ieSv f\\\yz-M\

L (0 VJK (5M(A,)) = J> (S (A„) " A^(Aj) (3.56)

is design consistent for V yDu^(i^u)) in (3.20).

Proof. The two terms in jackknife replicate (3.55) closely minic the two leading terms in linearization (3.16). Using replicate D^d(fiu) provides a consistent estimator of the asymptotic variance of lDu^(l^u) — Dv^{\i^) j. The second term in replicate (3.55) consistently estimates the asymptotic variance of (V^fi^) — P^(/x^)). The cross product term when expanding (3.56) estimates the covariance between the two leading terms in (3.16). •

We have a small simulation study comparing jackknife variance estimator ignoring the

k) k) variance due to estimating the center (JK-1), the jackknife using Dl d(£il ) (JK-2) where the center is recalculated for each replicate sample, jackknife using piecewise linear distribution function estimator to estimate the derivative (JK-3), and the jackknife using (3.55) as replicate

(JK-4). In this simulation study, we simulate from a population of size N = 4000 from 34

Gamma(3,0.5), then draw SRS samples of size n = 1000, and use the above 4 jackknife variance esimators. The results are shown in Figure 3.1 with red vertical bar indicating the true variance from Monte Carlo simulation.

JK-1 JK-2

i 1 1—"i r i 1 1—"i r 0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30

JK-3 JK-4

to i 1 1—n r i 1 1—"i r 0.05 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.25 0.30

Figure 3.1 Histogram of 4 jackknife variance estimators; JK-1 indicates the jackknife procedure using Zr j(/x^) as replicate, JK-2 uses repli­ cate Zr j(/i^), JK-3 uses piecewise linear estimator of T>d(-), and JK-4 uses (3.55).

Figure 3.1 shows that if we keep the center fixed and use Zr j(/x^) as jackknife replicate, then we underestimate the variance by a lot; but using £r j(/i[, ^) overestimates the variance by a large amount, as the histogram is greatly truncated; using piecewise linear estimator of

T>d(') does not have large bias, but the resulting variance estimates are highly variable; the estimator (3.55) does not have much bias and the variance estimates are less variable than using a piecewise linear estimator.

This replication variance estimator is readily interpreted by considering the composition of the replicate in (3.55). We first ignore the second term in (3.55) and compare the resulting jackknife estimator with the naive estimator (3.49). Both estimators consistently estimate the 35

asymptotic variance of Dv^{\i^)^ corresponding to setting —J±Bul = 0 in (3.21). The second term in (3.55) uses the combination of the kernel estimator and the replication method to estimate the effect of estimating the population center. It should be noted that the "full" jacknife variance estimator which recalculates both the mean and the fraction of distances

l in each replicate, L)( \p,y), results in an inconsistent variance estimator, as Du^(i^u) is a nonsmooth function of jly.

The kernel regression estimator Cvd(l) can be used either in an analytic "plug-in" estimator or in a replication-based method. In many situations, a significant advantage of the latter approach is that we do not need to estimate the covariance matrix (3.18), which can be complicated in a large-scale complex survey. In the jackknife variance estimation, we start from an existing jackknife procedure for simple linear estimators, as could for instance be provided by a statistical agency as part of the survey dataset. We then estimate the gradient vector based on the whole sample only once, but £r ^(/i^) and jiy can be computed based on the original replicate weights w\ .

Delete-d jackknife can be used in complex surveys or in case of using median as measures of center. As has been pointed out by Shao and Wu (1989), the number of deleted points d has to go to infinity at some rate depending on the nonsmoothness of the estimator. By choosing an appropriate d, we can account for the variation in qv as well as the nonsmoothness of D^(-). We do not explore this issue further here.

3.5 Discussion

3.5.1 Comments on norms

In the previous proofs in Section 3.3, we talk about a general norm which induces distance satisfying Assumption A. 1.1 for the centroid-based estimator. To show the existence and uniqueness of sample median as well as asymptotic results on kernel variance estimator, we need to place careful assumptions on the norm, as stated in Assumption 3.2.7 and 3.2.8.

In practice, we do not have to use the same norm for different clusters in order for the previous results to hold, which gives us more flexibility in choosing an appropriate norm for a 36 particular cluster.

An appropriate norm should match the shape of the underlying multivariate distribution and be robust to the existence of outliers. The Mahalanobis distance or a robust version is a natural choice when we are choosing a shape-respecting norm for a cluster. The co- variance matrix S can be replaced by sample variance-covariance matrix or a more robust estimator. The estimation of multivariate location and shape has been heavily explored in literature. Rocke and Woodruff (1996) gave a thorough review of existing robust estimators of multivariate location and shape, including minimum volume ellipsoid (MVE) and minimum covariance determinant (MCD) estimators (Frank R. Hampel and Stahel (1986); Rousseeuw

1985; Rousseeuw and Leroy 1987), maximum likelihood and M-estimators (Campbell 1980,

1980; Huber 1981; Kent and Tyler 1991; Maronna 1976; Rocke 1996; Tyler 1983, 1988, 1991).

So we should estimate the shape of multivariate distribution based on the sample and define norms using Mahalanobis distance with S replaced by an appropriate estimate of shape. Notice that the "distribution" here refers to the empirical distribution of finite population points in survey sampling framework.

3.5.2 Initial partitioning

Usually, we do not know the cluster association of individual points in real application.

So we need to use some explorative tools like cluster analysis to partition the data into ho­ mogeneous subpopulations (clusters) and then estimate the distribution of distances for each subpopulation. An alternative approach would be to partition the population into homoge­ neous subpopulations based on auxiliary variables and subject matter knowledge. But cluster analysis tools should be used to confirm the partition results and guide initial partitioning.

In the initial partitioning, we ignore the sampling weights. Usual k-means approach is efficient but would lead to biased results if clusters have different dispersion. The following clustering algorithm we propose takes the dispersion of clusters into consideration:

1. Start from a preliminary clustering result, say or k-means.

2. At iteration step fc, calculate an estimate of cluster center 7^ and shape estimator Tg . 37

3. For any point yi} define a distance of yi to the center of cluster g,

) 4. Classify point i into cluster g*, such that diL = miiigd^ .

5. Iterate until the above algorithm converges.

This algorithm will be referred to as Shape Respecting k-means algorithm (SRKA) later.

The algorithm involves estimating the shape of each cluster at each iteration step and a com­ putationally less intensive approach is to rescale the distances in each cluster by the mean distance. In section 4.4 of simulation study, we use the following two clustering algorithms to generate initial clustering of sample points:

1. PAM (Partitioning Around Medoid) (Rousseeuw and van Zomeren 1990).

2. Adjusted k-means. We divide the raw distances by the mean distance of that particular

cluster to account for variations in cluster sizes and dispersion.

The performance of the clustering methods will affect how well we can estimate population distance distributions and thus the outlyingness of test points. Detailed empirical study can be seen in section 4.4.

3.5.3 Classification

The estimated fraction O* (y°) measures the association between a point and a distribution.

The fraction is close to 0 if the point lies close to the population center and close to 1 if the point is far away from the center. So 0*(y°) can also be used as a criterion to classify future observations if we have training data with known classification. To classify future observations, we can either use ^\yi — 7JI and assign point yi to the closest subpopulation or use the estimated fraction to assign point yi to the population with the smallest Dv^{ftv)-

If Mahalanobis distance as (3.41) is used, the first method takes location and shape of the distribution into account, but the second method also respects the distribution of ||?/i — ivHs 38 in the training sample. Next, we will compare different classification criteria under elliptical distribution, with variations of classification criteria we propose using statistical distances.

1. The subpopulations only differ in location. Mathematically, we assume Yg ~ EC(iig, H, i?),

and let yll • • • , yNv be an iid sample from this distribution. Denote x as the new point

to be discriminated. Let TV9 be the of x belonging to g—th subpopu-

lation. The score function we maximize is listed in Table 3.1 for the four classification

criteria. Under this scenario, the four classification criteria are equivalent, assuming that

Hg(-) is a strictly monotone function and equal prior for different classes.

2. The subpopulations differ in both location and shape, but the distribution of Rg is the

same. In this case, Linear Discriminant Analysis (LDA) is different from the other

three criteria. The classification using distances is equivalent to using fractions as the

distribution functions Hg(-) are the same. Quadratic Discriminant Analysis (QDA) is

similar to distances and fractions, but the latter ignores the penalization on the volume

of the variance-covariance matrix.

3. Location, shape and distance distributions are all different for subpopulations. In this

case, classification using distances is different from using fractions. If the distance be­

tween x and two cluster centers iig and /jLgf are the same, then the point x will be

classified into the cluster with a higher tail probability.

Table 3.1 Score function for each discrimination method

Classification method score function

T 1 T x LDA x Y<- fjlg - \ti gYr iig + log^g T 1 QDA -\ log | E^l - \{x - ^) S- (^ - iig) + log 7^

T 1 Distance — \{x — /x^) H~ (^ — 11g )

T 1 Fraction 1 — Hg ((a? — /x^) H~ (^ — fig)) 39

3.5.4 Using auxiliary variables

Most surveys also have plenty of auxiliary variables which explain the target variables yi reasonably well. The relationship between auxiliary variables and response variables can be used to improve the precision of cumulative distribution function estimators (Dunstan and

Chambers 1986, Rao et al. 1990, Chambers et al. 1992). We will define distances to the center on the estimated yi given auxiliary variable X{, and the distances so defined can be estimated for each population element. Then we use these distances as substitutes for ||y^ — /ij| and use difference estimator to improve precision.

In this section, we will explain the idea of incorporating auxiliary variables in estimating distance distribution functions, with univariate response and explanatory variables. The first model we consider is a simple ratio model, then we extend to a general parametric regression model and will briefly talk about model in the end.

3.5.4.1 Simple ratio model

The first superpopulation model is assumed to be a ratio model,

yi = fixi + ei, (3.57) where e^ are iid with mean 0 and variance X{G\ . The slope j3 can be estimated by £>, if X{ and yi for all population elements are known, where

A design-consistent estimator for population ratio B is to replace the population sums by weighted sums on the sample,

z2s„ xi/7ri

An ideal proxy value for \\yi — ftu\\ is \\/3xi — (3XN\\, and as (3 is usually not obtainable we use the estimated slope B. Then we can borrow strength from RKM estimator (Rao et al.

1990), and use the following difference estimator to estimate distance CDF.

L l n rr. \- }_ )j VV^ \\\Vi-v>A<<*>)\\\Vi-M\

Rao et al. 1990 has suggested that the asymptotic variance for this difference estimator would be the residual variance after regressing I(||y._^^||<^ on 1 and I(i^||ix._^ \\(^ — #JV)|| are good substitutes.

3.5.4.2 General parametric regression model

The superpopulation model we assume is

yi = f(xi](3) + ei, (3.59) where j3 is a fix unknown parameter and e^ are assumed to be iid with mean 0 and variance a\. In population level, j3 can be estimated by as follows,

B = argmin^(^-/(^;/3))2, and the scenario here is similar to the estimating equations as described in section 3.3.2.

The sample-based estimator of B and thus j3 is given as below,

z B = argminV — (yivui - f(xf, j\?i,pf(3))2 and its asymptotic properties can be shown in a similar way to median as in section 3.3.2.

Then we can propose the following difference estimator

l l n rr, \ - \ xr\\\vi-»A<*) , Y^T ^ (\\f(x^B)-f(x^B)\\

The asymptotic variance for the difference estimator is the residual variance after regressing

y\\y* ^\\

CHAPTER 4. SIMULATION STUDY

In this section, we will present our simulation results. First, we will assess the consistency and asymptotic normality of proposed estimators, as a confirmation of theoretical result derived in section 3.3. Secondly, we will compare the distance distribution estimator Dy^iftv) t°

Dv^dv) where population center is know, and then evaluate various versions of variance estimators. Thirdly, we define population outliers as points whose outlyingness measure exceed a certain threshold for all subpopulations, and evaluate the power of sample-based outlyingness estimators. Finally, we include two clustering approaches into the procedure and examine the preformance of sample outlyingness measures based on a random partitioning.

4.1 Assessing consistency and asymptotic normality

When deriving theoretical results in Section 3.3, we ignored the subpopulation structure in our finite population. But in real applications, we almost always want to partition the population into homogeneous subpopulations before applying the proposed approach. So in our simulation study, we will re-introduce the subpopulations and examine the estimator and variance estimators for subpopulations with different distribution structures.

Theorems 1 and 5 established the convergence rate of Dv^{Ji^) for Dv^{\i^) and Dv^{Qu) for Dv^{*lv)'> respectively. And Theorems 2 and 6 showed the asymptotic normality of sample- based distance distributions. In this section, we use simulation to assess the asymptotic prop­ erties of our estimators.

To assess these large sample results, we first generate a bivariate finite population with the following 4 subpopulations with equal size Nug = 2000, g = 1,2,3,4. The subpopulations are explained as follows, 42

• Subpopulation 1 is generated from a truncated bivariate normal distribution centered at

the origin,

r!~iV2(0,I2)I(|M|2<5).

• Subpopulation 2 is also a truncated bivariate normal distribution,

l 3 2 2 + ^2(0,I2)I(||W2||2<5). 4 1 -1

• Subpopulation 3 is generated from a Pearson type II distribution (Johnson 1987) with

m = 4, -6 2 -2 + i?3 U^\ 7 4 2

where R3 ~ ^Beta(\^ 5) independent of L^2), which is uniformly distributed on the unit

circle.

Subpopulation 4 is simulated from a truncated Pearson type VII distribution with,

-5 1 -1 + i?4 U^\ -6 2 1

2 where i?4 = T^I(_^_<5) with Z ~ Beta(l, 1) independent of ?7^ \ which is defined as above.

Figure C.l shows the four subpopulations, from which we can see subpopulation 4 has more dispersion than the other 3 subpopulations and the distribution of Y4 to subpopulation center has a heavier tail. We will ignore the test points on Figure C.l for now.

Then we draw samples using simple random sampling (SRS), Poisson sampling, Unequal probability sampling and Stratification of different anticipated sample sizes n = 200, 800, 2000.

The sampling schemes are explained here,

1. Simple random sampling. 43

2. Unequal probability sampling. The inclusion probabilities 7^ _L yi and also cluster

indicators. And TT^S are drawn from Unif ((1 — OL)JJ, (1 + a)-j^)> where n is the antic­

ipated sample size and a = 0.8. The problem of random inclusion probability will not

matter much if we want to assess convergenece or normality properties, but will affect

true variances and variance estimates.

3. Poisson sampling. The configuration of inclusion probabilities is the same as unequal

probability sampling.

4. Stratification. We have two strata in this sampling procedure, stratum 1 with yx • <

0, j = 1, 2, 3, 4, and stratum 2 has the rest population points. Then we draw a sample of

size 0.6 x n from stratum 1 using SRS and 0.4 x n from stratum 2 using SRS as well.

We calculate Dv^(Ji^)^Dv^{ci^) with Euclidean norm and plots are shown in Figures C.2-

C.4. We did not include the results of unequal probability sampling or Poisson sampling due to limited space but those results are similar to the results of SRS and stratified sampling.

Pointwise convergence of Dv^{Ji^) for Dv^{\i^) is seen if we look at the trend of estimated distance distributions as sample size n increases. When sample size is 200, estimated Dv^(Ji^) deviates noticeably from its population conterpart with noticeable jumps. If we increase sample size to 800 and examine Figure C.3, estimated Dv^{p,^ has fewer jumps and the deviation from population quantity has been decreased. But for this particular sample, the estimated

distance distributions for subpopulations 2, 3 and 4 still deviate somewhat from DV^(\L^). If we increase the sample size to 2000 and look at Figure C.4, Dv^(Ji^) are generally very close to the population quantity for both simple random sampling and stratification sampling, and the profiles of Dv^(Ji^) against d look smooth on this plotting scale.

Figures C.5-C.7 show the estimated distance distributions Du^(qu) and Du^(qu) under different sampling schemes. We do not seen any difference between the profiles of Dv^{\i^) and Dv^(civ\ consistent with what we have shown in Lemma 4 that \iv — qv -+' 0 for elliptically contoured distributions.

To assess design normality of the estimated distance distributions, we repeatedly draw 44

200 SRS samples of size n = 800 under a fixed finite population of size N = 8000, calculate

D^d(P^u) °f subpopulations 3 and 4 for d = 0.1, 0.5, 2 and 6. The results are shown in Figures

C.8 and C.9. The p-values from Shapiro-Wilk's test (Shapiro and Wilk 1965) suggest that we fail to reject normality of Du^(i^u) except D^^JA^) and AM,O.I(A^4)- Figure C.8 shows that

nas a Dus^(fius) higher than normal distribution and Figure C.9 shows AM,O.I(A^4) to be left skewed.

Then we increase the population size to 16000 and double sampe size as well, and the results

s s m are shown in Figures CIO and C.ll. Now, AM,O.I(A^4) i ^ teft skewed but Shapiro-Wilk's test fails to reject the normality of 1)^6(^3)• The reason for this is that for probabilities close to 0 or 1, we generally need a larger sample to have an approximately normally distributed estimator.

To illustrate the difference between using mean vector and generalized median, we included subpopulation 5 of size 2000 from a non-elliptically contoured distribution as follows,

• Subpopulation 5 is generated from the following non-symmetric distribution

5 R(6) cos 6 + -6 R(6) sin 6 with 6 ~ Unif[0, 2TT] and R(6) = \0 - TT| X X|-

and make a scatter plot in Figure C.20. Then we calculate population distance distributions

DV^(\L^) and Du^(qu) and plot them against d on Figure C.20, from which we can see that Dy^Qu) increases earlier than D^^(/x^), implying that there are more points within a small radius d of qu than /xv.

4.2 Evaluating ^(7) and its variance estimators

In this section, we will only look at subpopulations 4 and 5 from the previous section, and call them subpopulations 1 and 2 respectively.

Figure 4.1 shows the scatterplot of two subpopulations, from which we can see the distri­ bution of Y\ is elliptically contoured, and subpopulation 2 is a skewed population with an 45 irregular shape. Figure 4.2 shows the empirical distribution function of subpopulation dis­ tances using both mean and median as measures of center. For elliptical subpopulation 1, the empirical distribution of distances using mean or median do not differ much but the two distributions differ noticeably for a skewed cluster like subpopulation 2.

We only show the result from stratified sampling. We partition the population into 2 strata, based on if the second coordinate is smaller than (stratum 1) or greater than (stratum 2) —5.

Then we draw a simple random sample from each stratum, with sizes 0Anu and 0.6n^, and overall sample size ny could be 80, 160 or 400. The strata we created cut into subpopulations

1 and 2, 84% of subpopulation 1 falls into stratum 1 and the rest points fall into stratum 2.

And 80% of subpopulation 2 falls into stratum 1 with the rest into stratum 2.

Figure 4.1 Scatterplot of two subpopulations; the first two plots show each individual subpopulation and the last plot shows them both, but in different colors.

we wm r To evaluate the performance of estimator Dy^iftv)^ compare it with Dv^(p(v)'> f° which we use the true population center. The comparison is on the basis of bias and standard deviation under three different sample sizes with d = 0.707,1.0,1.414, 2.45 and 3.873. The simulation results are shown in Tables 4.1 and 4.2 with mean and median as measures of center, respectively.

The bias of DV^(\LJ) is negligible compared to its standard deviation in all three sample sizes. In the case of D^(A^)> its bias has same magnitude as its standard deviation for some d's under sample sizes nv = 80 and 160. The bias diminishes as d increases. Due to estimating the 46

mean */ mean median J$/ median 1 1 1 1 1 I i 1 1 1 1 1 r 0 1 2 3 4 5 6 d

Figure 4.2 Distribution of subpopulation distances, with solid line indicat­ ing distances to mean vector and dash line indicating distances to median; left plot shows distance distributions for subpopula­ tion 1, and right plot shows distance distributions for subpop­ ulation 2.

subpopulation center, we have introduced extra bias and variance to the estimator, in general.

JL The increase in variance can be seen by comparing sd{Du^{i h)) with sd{Du^{^h))- The difference between sd(Du^(i^u)) and sd(Du^(l^u)) is nearly neglibile for subpopulation 1 under large sample sizes, confirming the use of naive variance estimator in elliptical distributions.

But this discrepancy remains obvious in a skewed subpopulation even for the largest sample size.

Table 4.2 shows the results of using the median as a measure of center. We can see that for subpopulation 1, the bias and standard deviation of DU^(QU) are very close to that of Du^(l^u) as the two subpopulation centers are very close. The bias of Du^(qu) is generally smaller than

Dv^{Jiv\ with everything else being the same, as a result of the median being more robust against "bad" samples. And sd(Diy^(Qu)) is very close to sd(Diy^(Qu)) under all three sample sizes for elliptical subpopulations as the extra variance is a small order term, but the remains obvious for skewed subpopulations.

In general, the bias of estimator Dy^lv) is much smaller than the standard deviation, and ignoring the bias will not influence our inference much. But the bias of D^^lv) is usually a lot larger in magnitude, indicating that although the asymptotic bias will eventually diminish, in finite samples especially with small sample sizes, the added bias can deteriorate the confidence 47 intervals.

Generally, the standard deviation of D^^iftv) is close to that of D{pfv) for elliptical dis­ tributions with a reasonably large sample size, but the two standard deviations could be very different for skewed populations. One interesting finding is that as d increases, the extra vari­ ance due to estimating the center will diminish, consistent with our intuition that the effect of estimating the center will decrease as we examine a larger region around the center. The simulation result confirms that the contribution to variance of estimating the center is much less in the tail of the distribution.

A closer look at the estimator D^^iftv) would be to plot the conditional bias

E (d^dtfj,) |||iv - 7JI J - D^ddv) against ||^ - 7J, which is shown on Figure C.12. We repeatedly draw 2000 SRS samples of size 400 and examine the relationship between conditional bias and the distance between 7^ and 7^. On this figure, we delete 1% of the "extreme" samples that 7^ goes very far from its target and plot the smoothed curves done by LOESS. One thing to mention is that how far the curves can extend on the x-axis shows 99% quantile of ||7^ —7^||-

For a dispersed cluster like cluster 4, \\p,u — /xj| is more likely to have extreme values than

\\QU ~ 9i/ll> but for clusters 2 and 5, the distances on medians are worse than these using means. Brown (1983) talks about the relative efficiency of using mean and spatial median to estimate location parameters under normal distributions. Secondly, the general trend of these curves is expected to decrease as ||7^ — 7JI increases like shown on cluster 4, but for compact clusters like cluster 2, getting a bad sample where 7^ is far from its target will not have much effect on estimating the fractions. Thirdly, the effect of ||7^ — 7JI on estimated fraction will generally diminish as d increases, which we can see from cluster 4 and 5 plots. The decrease of conditional bias as ||7^ — 7JI increases is not as obvious for larger d's as for smaller cfs, wihch is natural as moving the center has more effect on estimating the fraction of points within a small neighbourhood of center than farther away.

Next, we report on a comparison of the performance of the naive variance estimator defined by (3.49) and the kernel and jackknife combined variance estimator for d = 0.707, 1.0, 1.414, 2.45 and 3.873. We only report on the results for the mean-based case, as this is the case for which 48

Table 4.1 Comparison of estimator Du^(i^u) and Dv^{\i^) in terms of bias and standard deviation under three different sim­ ple random samples; G = 2 is the number of subpopula- tions; bias(D(£i)), bias(D(IJ,)) represent the bias of our esti­ mator when the true population mean is estimated or known; sd(D(JA)),sd{p(iL)) represent the standard deviation of our es­ timator when the true population mean is estimated or known.

Ny = 2000, nv = 400, G = 2 Subpopulation 1 Subpopulation 2 d 0.707 1.0 1.414 2.45 3.873 0.707 1.0 1.414 2.45 3.873 .322 .422 .526 .708 .82 .13 .28 .508 .852 .99 bias{D{ii)) .0004 .0002 .0001 .0004 0 .0003 .0006 .001 .0003 .0002 bias (D (ft)) -.0005 -.0025 .0028 -.001 -.0005 .0059 .0042 -.0027 .0006 0 sd(D(fi)) .017 .02 .021 .022 .02 .027 .035 .038 .028 .008 .018 .02 .022 .022 .021 .038 .043 .044 .026 .008

Nu = 2000, n„ = 160, G = 2 Subpopulation 1 Subpopulation 2 d 0.707 1.0 1.414 2.45 3.873 0.707 1.0 1.414 2.45 3.873 .322 .422 .526 .708 .82 .13 .28 .508 .852 .99 bias(D(p)) .0002 .0007 .0005 .0009 .0011 -.0011 -.0012 -.0012 -.0001 -.0002 bias(D(jj)) -.0105 -.0072 -.0008 -.0009 .0029 .0071 .0047 -.0034 .0028 -.0002 sd(D(fi)) .038 .038 .04 .04 .036 .06 .075 .074 .043 .013 sd(D((i)) .033 0.037 .04 .039 .035 .043 .058 .065 .046 .013

Nu = 2000, nu = 80, G = 2 Subpopulation 1 Subpopulation 2 d 0.707 1.0 1.414 2.45 3.873 0.707 1.0 1.414 2.45 3.873

DVg,d{^v) .322 .422 .526 .708 0.82 .13 .28 .508 .852 .99 bias(D(/j,)) .0024 .003 .0029 .0019 .0018 -.0017 -.0013 -.0034 .0006 0 bias(D(p)) -.0253 -.0145 -.0033 -.0004 .0065 .0121 .0091 0 .0079 .0007 sd{D(n)) .05 .054 .057 .056 .051 .061 .084 .093 .065 .019 sd(D(fi)) .063 .059 .058 .058 .051 .084 .107 .106 .062 .018 49

an m Table 4.2 Comparison of estimator Dv^iAv) d D^d(Qv) terms of bias and standard deviation under three different sim­ ple random samples; G = 2 is the number of subpopula- tions; bias(D(q)), bias(D(q)) represent the bias of our estima­ tor when the true population median is estimated or known; sd(D(q)), sd(D(q)) represent the standard deviation of our esti­ mator when the true population median is estimated or known.

N„ = 2000, n„ = 400, G = 2 Subpopulation 1 Subpopulation 2 d 0.707 1.0 1.414 2.45 3.873 0.707 1.0 1.414 2.45 3.873 A/,dO?„) .322 .421 .53 .708 .817 .217 .347 .526 .825 .981 bias{D(q)) .0004 .0002 .0002 .0004 0 .0002 .0002 .0012 .0004 .0003 bias{D(q)) .0012 .0025 .0006 -.0005 .0002 -.0016 .0028 .0014 .0037 .0008 sd{D(q)) .017 0.02 .022 .022 .02 .032 .036 .039 .029 .011 sd(D(q)) .017 .02 .022 0.022 .02 .04 .045 .038 .027 .011

Nu = 2000, nv = 160, G = 2 Subpopulation 1 Subpopulation 2 d 0.707 1.0 1.414 2.45 3.873 0.707 1.0 1.414 2.45 3.873 .322 .421 .53 .708 .817 .217 .347 .526 .825 .981 bias{D(q)) .0002 .0006 .0005 .001 .0011 -.0005 -.0019 -.0008 -.0001 -.0004 bias{D(q)) .0023 .003 .0009 .0002 .0018 -.0027 .0032 .0013 .0071 .0006 sd(D(q)) .033 .037 .04 .039 .035 .053 .061 .064 .049 .018 sd(D(q)) .033 .037 .04 .039 .035 .067 .073 .065 .045 .018

Nu = 2000, nu = 80, G = 2 Subpopulation 1 Subpopulation 2 d 0.707 1.0 1.414 2.45 3.873 0.707 1.0 1.414 2.45 3.873 Dv,d(

Table 4.3 Comparison of variance estimators: VMC denotes the Monte Carlo estimate of variance from 2000 simulations under strat­ ified sampling, VNV (A^d(/^))denotes the naive estimator, VSM f^i/,d(Ai/)) denotes the analytic variance estimator using

the kernel estimator, and VJK \Du^(i^u)) denotes the jackknife variance estimator using the kernel estimator.

Nv = 2000, nv = 400, G = 2 Subpopulat]io n 1 Subpopulation 2 d 0.707 1.0 1.414 2.45 3.873 0.707 1.0 1.414 2.45 3.873 DvAVu) 0.331 0.425 0.517 0.674 0.813 0.183 0.324 0.544 0.868 0.987 VMC x 1000 0.741 0.79 0.8 0.786 0.555 0.869 1.013 0.925 0.423 0.043 VNV(DVAM) 0.931 0.981 1.021 0.962 0.969 0.523 0.68 0.891 0.961 1.105 VMC VSM{DVA{JXI/)) VMC h = 0.05 1.432 1.403 1.275 1.109 1.129 1.113 1.137 1.135 1.057 1.133 h = 0.l 1.212 1.224 1.168 1.034 1.046 1.025 1.054 1.068 0.999 1.057 h = 0.2 1.132 1.131 1.099 0.998 1.007 0.942 1.002 1.055 0.961 1.028 h = 0A 1.155 1.086 1.066 0.981 0.987 0.873 0.998 1.07 0.911 1.011 VJKJDVAM) VMC h = 0.05 1.186 1.067 1.058 0.966 1.057 1.034 0.98 0.946 0.94 1.108 fc = 0.1 1.009 0.932 0.966 0.902 0.988 0.96 0.913 0.898 0.903 1.049 h = 0.2 0.944 0.87 0.903 0.868 0.954 0.885 0.869 0.881 0.881 1.028 h = 0A 0.952 0.844 0.863 0.841 0.938 0.818 0.871 0.883 0.857 1.015

We can see from Table 4.3 that the Monte Carlo variance increases first and then decreases

as DV^(\L^) goes from 0 to 1. The naive variance estimator (3.49) performs satisfactorily if the underlying distribution is elliptically contoured (subpopulation 1), but would seriously underestimate the variance in skewed distributions as in subpopulation 2. Incorporating the kernel smoothing term enables us to include the additional variance component, but we need to be careful in choosing the bandwidth. The variance estimate generally decreases as bandwidth h increases, and the "optimal" bandwidth depends not only on the features of the cluster but also on how far away from center we are looking at. 51

4.3 Outlier Detection

In this simulation, we evaluate the performance of our proposed approach in detecting potential outliers in a survey. We will apply the outlyingness measure O* (y°) which is now defined at the subpopulation level and takes the shape of the subpopulation into account, as discussed in Section 3.3.3. In setting up this , we first need to determine which observations to label as "potential outliers." Instead of creating a population and adding artificial outliers, we decided to generate a population with five subpopulations (see Figure 4.3) and label all points that are sufficiently "far" from all of the subpopulation centers as outliers

(see below). Hence, they are not outliers in the sense of not belonging to the population, but they are unusual with respect to the population distribution and therefore good candidates for closer scrutiny.

In defining suspicious points with respect to finite population according to the reasoning above, we prespecify a threshold a equal to 0.02 in this case, and define the population points with O*u(y°) > 1 — a for g = 1,..., 5 to be the set of outliers, with either mean vector or generalized median as measure of center. Figure 4.3 shows the points identified as outliers when the centers are defined as mean vectors. Given this set of suspicious points, the goal of the simulation experiment is to assess how well a sample-based estimator is able to identify the same points as being potential outliers.

Probability samples are drawn under a complex design with 3 strata, and different designs for each stratum (Simple Random Sampling Without Replacement, Poisson sampling and

Cluster sampling). In creating population strata, we created a stratification variable Z{ =

V\i + V2i + ei with 6i rsj iV(0,1), and use -5 and 5 as cutoff points on zi for determining stratum membership for each element i G Uy. Stratum 1 contains the units where Z{ < —5, and we draw a Poisson sample with inclusion probability proportional to Z{ and anticipated sample size rijy/4:. Stratum 2 has the units where — 5 < zi < 5, and we equally partition the range of zi into 500 intervals, and select clusters using Simple Random Sampling to obtain an anticipated stratum sample size of nvj2. Then we draw a Simple Random Sample of size n^/4 from stratum 3. 52

Given a realized sample, we assume that the cluster association of each sample point is known. We will deal with the case involving initial cluster in the next section. So in this simulation, we calculate estimates of center and shape for each subpopulation, assuming the subpopulation association is known. Then we calculate 0*(y°) for each point in the sample by leaving it out in constructing distance distributions. If the outlyingness measures of a point with respect to all five subpopulations are greater than 1 — a, then this point is labeled as an outlier in the sample.

CM E T3

I

O V

LO T

dim 1

Figure 4.3 Bivariate population with 5 subpopulations; points in "plus"

sign are population outliers, defined by 0*?I/(y°) > 1 — OL for g = 1,..., 5. We use mean vectors as measures of center and the threshold a is chosen as 0.02.

We have 8 simulation scenarios: using mean or median as the center of subpopulation, large or small sample sizes (n = 1000 or n = 400), leave the testing point out in constructing distance distributions or keep it in the sample. Then we calculate the average number of sample 53 outliers, average number of outliers identified, average number of outliers correctly identified and average fraction of correctly identified outliers over the number of outliers present in the sample, and summarize the results in Table 4.4.

From Table 4.4 we can see, constructing the distance distribution without the test point performs better than with the test point, in that we can identify more outliers and the fraction of correctly identified outliers is higher. Leaving out the test point is also consistent with our theoretical results as the distribution estimator without the test point is consistent for population target as well.

If we decrease the sample size, we usually identify fewer suspicious points and the fraction of correctly identified outliers is smaller. This can be explained by the fact that the sample estimated region has more bias and more variation for a smaller sample size as seen in previous simulations, so non-population-outliers are more likely to be labeled as suspicious points.

Table 4.4 Summary table of simulation results to examine the performance of detecting population outliers. We use either mean or median as measure of population center and population outliers are de­ fined by O* (y°) > 1 — a for g = 1,..., 5. Sample outliers are

the points y° such that O*?I/(y°) > 1 — a for g = 1,..., 5.

average number average fraction . average number numbe r orr or outliers or correctly . of outliers . . , . , sample correctly identifie n d ,. identified . , . , ,. outliers identifien d outliers mean, n=1000, leave 17.72 19.8 14.19 0.81 mean, n=1000, not leave 17.11 16.16 12.54 0.74 mean, n=400,leave 7.06 9.32 5.07 0.74 mean, n=400, not leave 6.82 5.58 3.67 0.56 median, n= 1000,leave 17.56 19.57 14.20 0.82 median, n= 1000,not leave 16.98 16.08 12.53 0.75 median, n=400,leave 7.00 9.21 5.06 0.75 median, n=400, not leave 6.76 5.56 3.71 0.57

Another scenario we will consider in simulation is using unmatched definition of center. 54

We will use median to define population center but sample mean in denning sample distances.

We will consider 2 sample sizes (n=1000 or n=400), and leave or not leave the test point in constructing the distance distribution. The results as shown in Table 4.5 are very similar to the case of matched definition of center, probably because our populations are not skewed enough so the results from using mean and median are similar. But if we are working on a highly skewed population/subpopulation, or a smaller-sized sample is selected, we conjecture that using median is going to outperform mean-based approach.

Table 4.5 Summary table of simulation results to examine the performance of detecting population outliers. We use unmatched definition of

center and population outliers are defined by 0*?I/(y°) > I — a for g = 1,..., 5. Sample outliers are the points y° such that fi^(y°)>l-afor^ = l,...,5.

average number average fraction average number number of of outliers of correctly of outliers sample correctly identified identified outliers identified outliers n= 1000, leave 17.59 19.8 14.14 0.81 n= 1000,not leave 17 16.16 12.44 0.74 n=400, leave 7.01 9.32 5.06 0.74 n=400, not leave 6.78 5.58 3.68 0.57

4.4 Initial clustering and testing outlyingness

In this simulation study, we generate 10 test sample points for testing outlyingness. The scatter plot of population and prediction points is shown in Figure C.l. Initial classification is obtained by PAM (Partitioning Around Mediod) and adjusted k-means described in subsection

3.5.2.

Figures C.14-C.19 show the boxplots of estimated distance distributions under SRS for both PAM and adjusted k-means clustering as well as their population counterparts. From

Figures C.14-C.16, we can see an underestimate of Diyl^z_jJ/^(/jJul) for all points except point 55

3, 5 and 9. Similarly, an overestimate of D^Wz-^^W^I1^) ls seen f°r Pomt 0 and 2, and an overestimate of i^/y^-^ 4||(/Vi) is seen for all points except point 5. An explanation to this is PAM tends to generate clusters of the same radius, so some points from clusters 2 or 4 are classified into the same group as cluster 1.

Figures C.17-C.19 shows better result of estimating Dvg^z_^ \\{^ug) using adjusted k- means clustering. In these plots, population distance distributions are usually between the

25% and 75% quantile of estimated distance distributions and the bias in estimation has been greatly reduced.

Figure C.13 shows the cluster analysis results from PAM and adjusted k-means procedure from which we can see a fraction of points from clusters 2 and 4 are classified into the same group as cluster 1. But the misclassification rate has been reduced if we use adjusted k-means method. That explains why the boxplots in Figures C.14-C.19 behaves like that.

Now, let us take a close look into Figures C.17-C.19. Point 0 is at the boundary of cluster

are n 1 and 2. For most simulations, only i^y^-^ ||(/ii/i) and ^2,||2-/x^2||(A^2) °t equal to

nas 1, but both of them are often very large, though. The distribution of -0^2,11^—/x^2|| (A1/2) a heavy left tail, as we may have misclassifications into cluster 2 in some simulation runs.

Point 1 is at the boundary of a dense cluster (cluster 1) and a widespread cluster (cluster 4).

As we tend to classify boundary points into cluster 1 in both algorithms (PAM and adjusted

an k-means), we usually underestimate i^y^-^ ||(/VL) d overestimate i^/y^-^ 4||(/Vi)- A similar overestimate of Du4:^z_/Jj^ (^4) is seen for point 3, 4 and 7. The estimation for points 2 and 5 using adjusted k-means clustering looks very nice where the boxplot of Diyg^z_^ y (jjLug) centers at the true value. Points 6 and 8 are obvious suspicious points in the population, and from Figures C.19 and C.17 we can see, the estimated sample fraction are all greater than 0.965 for 4 clusters. Point 9 is an inlier point for cluster 4 and we can tell that ^^^-/x 4||(/Vi) are all very small (< 0.4) from Figure C.17. 56

CHAPTER 5. OUTLIER IDENTIFICATION IN THE NATIONAL RESOURCES INVENTORY

In this Chapter, we will present empirical study results targeting identifying suspicious points using the proposed approach in a large-scale national survey called the the National

Resources Inventory. First, we will outline the general strategy for identifying suspicious points in a large-scale survey, especially for longitudinal surveys that data is updated periodically.

Secondly, we will present analysis results in a test application using Kansas as an example, and give careful description of each point we identify as suspicious point. Thirdly, we explain initial partitioning rules where we take into account geographical difference as well as coveruse, and present summary results of suspicious points. Finally, we hold a discussion to point out interesting findings and practical concerns.

5.1 General strategy

In this section, we will explain how to test observations from a complex survey and es­ pecially a longtitudinal survey which is updated periodically, based on the notion of distance distribution and outlyingness measure given in section 3.1. We outline the general strategy as follows:

1. Identify target variables that we are interested in or want to test outlyingness on. In a

large-scale survey, we usually collect hundreds of variables, but the analysis only pertains

to these sensitive to measurement tools and of interest in our estimation procedures.

2. Partition the whole population into a number of relatively homogeneous subpopulations

based on subject matter knowledge, at the guidance of explorative tools like cluster 57

analysis. In NRI, geographical association and point classification are two variables we

used for partitioning the points. In a human population, demographical information can

be used to pre-classify the points into homogeneous groups.

3. Define a measure of subpopulation center using (3.5) or (3.7), and use || • || or || • \\^u to

calculate the distance between each point and its corresponding subpopulation center.

Then define the population measure of outlyingness for location y° using (3.9) or (3.44).

4. Estimate the population measure of outlyingness by (3.10) or (3.46), for each y • in the

sample. We treat each y^ as a location in !ftp, and estimate the fraction of points whose

distance to the center is less than the distance of jjj.

5. Prespecify a threshold on the measure of outlyingness (0.95 or 0.98), and identify observa­

tions whose measure of outlyingness is greater than the threshold. The set of observations

with measure of outlyingness exceeding the threshold are flagged out for closer scrutiny.

6. Make decisions on the list of suspicious points identified in step 5, either correct the

observation if there is an error or leave the observation as it is. In the NRI application,

we present the suspicious points to subject matter specialist to receive analysis results

on the points.

7. For longitudinal surveys updated periodically, we suggest building outlier identification

rules based on previous years' data and use the rules to flag forthcoming observations.

The procedure is explained as follows,

(a) Divide the longitudinal data set into two subsets, with the same number of variables

and similar pattern of observation years so that the covariance matrix is similar

between the two sets. Use one set as training set and the other one containing most

recent years as test set.

(b) Build outlier identification rules based on previous years' data (training set) using

steps 2-5. 58

For each point in the test set (time period has been shifted), calculate its distance

to the corresponding subpopulation center in training set. Then estimate an out-

lyingness meausure based on the distance distributions built using previous years'

data. The idea of building outlier identification rules based on historical data and

apply it to upcoming data is readily applied to streaming data as well.

If the estimated outlyingness measure with respect to its associated subpopulation

is greater than the threshold, this observation is treated as a suspicious point. We

need to make decisions on every suspicious observation that is identified by this

procedure.

5.2 Introduction to National Resources Inventory

The National Resources Inventory (NRI) is a long-term survey of natural resources infor­ mation on nonfederal land in the US, which covers over 75 percent of the total land area.

It is conducted by the U.S. Department of Agriculture (USDA) Natural Resources Conserva­ tion Services (NRCS), in co-operation with the Center for Survey Statistics and Methodology

(CSSM) at Iowa State University.

The 1997 NRI survey is based on approximately 300,000 primary sampling units (PSU) and about 800,000 sampling points, and the information was updated every 5 years before

1997, and annually through a partially overlapping subsampling design since 2000. The most common primary sampling unit is a 0.5 mile x 0.5 mile square and in the second stage of sampling, we usually take 3 sample points within each PSU. In practice, There are some deviations from the standard procedure. The basic sampling design of NRI is a longitudinal stratified two-stage area sample. Variables are collected at point level and PSU-level. The variables for each sampling point include general information of the point (like state, PSU, hydrological unit), indicator variables whether it was sampled in a specific year, broad use and land use, variables concerning USLE (Universal Soil Loss equation), variables concerning WEQ

(Wind Erosion equation), conservation practice indicators, and so forth. Not all variables are applicable to each point. A detailed description of the sampling design and data collection of 59

National Resources Inventory can be found in Nusser and Goebel (1997).

As the NRI data come from a large scale complicated survey with many variables, it is natual to have some outlying points, resulting from errors in data collection and processing or some real point themselves behave abnormally. There are outliers with extreme values on one variable, or with rare or unreasonable combination of variable values. The usual practice is to examine each point across different years to look for abnormal trend and usually requires a large amount of time and efforts to go through the whole NRI data set. We want to come up with an automatic procedure to test the outlyingness of NRI data and identify suspicious points for further examination. The conceptual idea is to first partition the state-level data into several homogeneous subpopulations and use a proper to measure how far each point is from its subpopulation center. Then we define the outlyingness measure by an estimate of the fraction of population points that are beyond this point, and take a portion of the most outlying points for careful examination.

5.3 Test application

We use the 2003 NRI data from Kansas as an application of the initial partitioning and outlier identification procedures. Our original data consists of 12 variables on USLE for the core sample. There are two major categories of points, which could be identified by broad uses, in our data set: points with USLE soil erosion and points that do not have soil erosion.

The points that have soil loss are cultivated cropland, noncultivated cropland, pastureland and land that is enrolled in the Conservation Reserve Program. The other points (NONUSLE) are rangeland, forest land, minor land, urban and built-up land, rural transportation, water areas and Federal land. Points that switch from usle coveruse to a nonusle land use or vice versa are not included in the study set. There are 2088 USLE core points in the study set.

The 12 variables included in the analysis are broad use (bul997, bu2000, bu2003), C factor (cfactl997, cfact2000 and cfact2003), support practice factor (pfact2003), slope length

(slopelen2003), slope (slope2003) and USLE loss (uslelossl997, usleloss2000 and usleloss2003).

Broad use is a with 12 categories, explained in Table 5.1. The C factor 60 is a USLE cover and management factor and the P factor is a USLE support practice factor.

Slope and slope length determine the LS factor of the point. The USLE loss is calculated by the Universal Soil Loss Equation, where the equation is

USLELOSS = RxKxLSxCxP, where USLELOSS represents the potential long term soil loss in tons per acre per year, R is the rainfall and runoff factor, K is the soil erodibility factor, LS is the slope-gradient factor, C is the crop/vegetation and management factor, and P stands for the support practice factor.

Table 5.1 Broad Use Categories.

Value of Point Type Broad Use 1 Cultivated cropland 2 Noncultivated cropland 3 Pastureland 4 Rangeland 5 Forest land 6 Minor land 7 Urban and built-up land 8 Rural transportation 9 Small water areas 10 Large water areas 11 Federal land 12 Conservation Reserve Program (CRP)

As broad use is a categorical variable, we define dummy variables for each category of broad use. The USLE points have 4 distinct land use types (1, 2, 3 and 12), so we define three dummy variables for each broad use variable. We take the ratio of USLE loss and two external variables (namely R factor and K factor) to standardize USLE loss, and drop 6 points whose

USLE loss and K factor are both 0.

The initial partitioning aims at partitioning our data into homogeneous subpopulations, and can be conducted by using categorical variables or clustering algorithms. We partition the working data set into 6 subpopulations based on real-world experience. First we split the data 61 into 4 groups of points with no broad use change in these 3 years and a group of points with changes of cover use. As the cultivated cropland points form a large group with 1361 points, we further split the points into two subpopulations based on the P factor. Table 5.2 shows the 6 subpopulations and their sizes. The split of subpopulation 1 into 2 groups corresponds to the results of cluster analysis using Mahalanobis distance to define pointwise distances, and this initial partitioning reflects our subject matter knowledge about the data.

Table 5.2 Description of 6 clusters on USLE core points.

Cluster Size Broad use sequence p factor 1.1 1004 (1,1,1) 1 1.2 357 (1,1,1) < 1 2 66 (2,2,2) 1 3 182 (3,3,3) 1 4 280 (12,12,12) 1 5 197 with changes 1

We calculated a weighted sample mean and variance-covariance matrix for each subpopu­ lation, and defined the distance between an arbitrary point yi in cluster g and center of that

wnere subpopulation as \\yi — p>ug\\ = y (Ui — frvgj'^vg^ii ~ favg)) ^v# is defined as in section 3.1. The center of each cluster is listed in Table 5.3. The partition into subpopulations 1.1 and

1.2 is based on P factor and the two subpopulations have different distribution in slope length and slope, as also suggested by the in Figure 5.1. We use (3.43) to define a distri­ bution of distances within subpopulation g. The distance distributions for subpopulations 1.1,

1.2, 2 and 5 are plotted in Figure 5.2. The variance-covariance matrices in defining distances are different in having different determinants. The distributions indicate how far the distances can extend to in each clusters, and we can visualize the tail of each distribution. If we use a common variance-covariance matrix to define distances to the center, we can tell the difference between subpopulations in dispersion. But our main goal here is to identify suspicious points, and using separate variance-covariance matrix is preferred.

In this application, we use broad use to partition the whole population into subpopulations 62

Table 5.3 Center of each subpopulation for Kansas USLE points

Cluster bul997.1 bul997.2 bul997.3 bu2000.1 bu2000.2 bu2000.3 1.1 1.00 0.00 0.00 1.00 0.00 0.00 1.2 1.00 0.00 0.00 1.00 0.00 0.00 2 0.00 1.00 0.00 0.00 1.00 0.00 3 0.00 0.00 1.00 0.00 0.00 1.00 4 0.00 0.00 0.00 0.00 0.00 0.00 5 0.44 0.35 0.05 0.54 0.28 0.07 Cluster bu2003.1 bu2003.2 bu2003.3 cfactl997 cfact2000 cfact2003 1.1 1.00 0.00 0.00 0.2032 0.2046 0.2001 1.2 1.00 0.00 0.00 0.2048 0.2077 0.2016 2 0.00 1.00 0.00 0.0153 0.0152 0.0201 3 0.00 0.00 1.00 0.0170 0.0152 0.0193 4 0.00 0.00 0.00 0.0165 0.0147 0.0193 5 0.42 0.29 0.09 0.0920 0.0912 0.0887 Cluster pfact2003 slopelen2003 slope2003 uslelossl997 usleloss2000 usleloss2003 1.1 1.00 216.39 1.53 3.64 3.72 3.89 1.2 0.55 171.80 2.70 3.67 3.90 3.25 2 1.00 196.76 3.46 0.73 0.73 0.82 3 1.00 196.03 3.86 0.96 0.88 0.75 4 1.00 183.83 3.10 0.70 0.62 0.82 5 0.95 190.54 2.54 2.16 2.67 1.99 63

subpopulation 1.1 subpopulation 1.2 •ir- r^T^l | A—M^—, 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4

cfact2003 cfact2003

subpopulation 1.1 subpopulation 1.2

^ 0 5 10 15 20 0 5 10 15 20

slope2003 slope2003

subpopulation 1.1 subpopulation 1.2

sr o a—| •—| J; |:i , i>rr>ri r 50 100 150 200 250 50 100 150 200 250

slopelen2003 slopelen2003

subpopulation 1.1 subpopulation 1.2

| o I, rfl q-n-i. ii-n^. , 1 1 1 1 0 5 10 15 20 25 0 5 10 15 20 25

usleloss2003

Figure 5.1 Histogram of selected variables in subpopulations 1.1 and 1.2.

subpopulation 1.1 subpopulation 1.2

i 1 1 1 1 r i 1 1 1 1 r

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Mahalanobis distance 1.1 Mahalanobis distance 1.2

subpopulation 2 subpopulation 5

i 1 1 1 1 r i 1 1 1 1 r

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Mahalanobis distance 2 Mahalanobis distance 5

Figure 5.2 Distribution of Mahalanobis distances from observations in sub- populations 1.1, 1.2, 2 and 5 to their corresponding subpopula- toin means. 64

1, 2, 3, 4, and 5, and then partition subpopulation 1 into groups 1.1 and 1.2 using P factor. In denning distances, the subpopulation association variable (broad use, in our case) enters into the multivariate y-vector. The subpopulations will have degenerate dimensions on broad use except subpopulation 5, and the distance between points with different coveruse is undefined or defined as infinity. This fact makes it trivial to examine inter-subpopulation distances as the distances between certain clusters are infinity (except subpopulation 1.1 and 1.2). An ideal case to look at inter-subpopulation distance would be when the convex hull of points in different subpopulations overlap, where the convex hull is concerned with multivariate vector y. In this particular application, distances between subpopulations 1.1 and 1.2 are nontrivial distances, and we can examine potential misclassification between the two subpopulations. We can calculate the outlyingness measure of a future point with respect to each subpopulation and classify the point into a subpopulation with the largest outlyingness measure. If all the outlyingness measures are close to zero, we will label the point as a suspicious point. Figure

5.3 suggests that a considerable portion of observations in subpopulation 1.2 have moderate outlyingness measures with respect to subpopulation 1.1, so the risk of misclassifying a point in subpopulation 1.2 into 1.1 is relatively high. The bottom two plots in Figure 5.3 suggest that the chance is very small that we classify a point in subpopulation 1.1 into 1.2.

In identifying outliers, we take the 1% of the points having the largest distances from each cluster (if a cluster has less than 200 observations, we take two "outliers"). Subject matter knowledge is used to carefully examine the points.

Table 5.4 gives a list of suspicious points. The outlyingness usually comes from the value of a certain variable or variable combinations, and we want to provide some guidance on which variables have caused the outlyingness. We use the following normalized vector for this purpose, -1/2/

-1/2 / 2 \^vg'9 v (jjJyg-yi)\\2

The idea behind this is to rotate the vector from yi to (iugl project the rotated vector onto the axis of original variables, and pick the dimensions that have small angles with the rotated vector. 65

subpopulation 1.1 subpopulation 1.2

i 1 1 r i 1 1 1 r 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

subpopulation 1.2 subpopulation 1.1

i 1 1 r i 1 1 1 r

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Center 1.2 Center 1.2

Figure 5.3 Distribution of Mahalanobis distances from observations in sub- populations 1.1 and 1.2 to both subpopulation means. 66

Table 5.4 List of suspicious USLE core points in Kansas bul997 bu2000 bu2003 cfactl997 cfact2000 cfact2003 ~T~ 1 1 1 014 014 (U4 2 1 1 1 0.14 0.14 0.14 3 1 1 1 0.14 0.14 0.14 4 1 1 1 0.23 0.14 0.14 5 1 1 1 0.2 0.2 0.2 6 1 1 1 0.16 0.15 0.02 7 1 1 1 0.14 0.14 0.14 8 1 1 1 0.23 0.14 0.013 9 1 1 1 0.02 0.15 0.21 10 1 1 1 0.14 0.19 0.22 11 1 1 1 0.14 0.14 0.14 12 1 1 1 0.3 0.29 0.29 13 1 1 1 0.14 0.14 0.14 14 2 2 2 0.013 0.013 0.042 15 2 2 2 0.003 0.013 0.013 16 3 3 3 0.003 0.013 0.013 17 3 3 3 0.09 0.013 0.013 18 12 12 12 0.013 0.13 0.013 19 12 12 12 0.13 0.013 0.013 20 12 1 1 0.042 0.21 0.21 21 12 1 1 0.042 0.21 0.21 22 12 3 12 0.013 0.003 0.013 23 2 1 2 0.02 0.02 0.02 24 2 1 2 0.02 0.02 0.02 67

Table 5.4 (Continued)

pfact2003 slopelen2003 slope2003 uslelossl997 usleloss2000 usleloss2003 1 1 150 18 8.781 8J5 8.391 2 1 150 18 8.781 8.75 8.391 3 1 150 18 8.781 8.75 8.391 4 1 150 14 8.797 8.719 8.359 5 1 250 11 27.413 27.307 26.507 6 1 50 8 11.2 10.7 1.5 7 1 90 12 8.547 8.578 8.203 8 1 200 9 9.518 8.714 2.179 9 1 175 5 1.49 10.314 13.294 10 1 250 5 11.733 15.733 17.733 11 0.8 150 18 8.766 8.734 8.375 12 0.6 75 11 8.578 8.562 8.188 13 0.6 90 12 8.547 8.578 7.719 14 1 250 8 2.027 2.054 5.311 15 1 90 12 0.547 2.234 2.219 16 1 200 18 1.484 5.578 5.516 17 1 160 10 9.018 2.286 2.25 18 1 250 3 0.56 5 0.52 19 1 100 7 9.095 1.119 1.095 20 1 250 10 8.863 20.824 20.078 21 1 250 15 14.667 21.843 20.902 22 1 250 1 0.286 0.114 0.257 23 1 100 1 0.306 0.306 0.278 24 1 100 0.4 0.233 0.233 0.2 68

We have taken a close look at these points, examining the point with a subject matter spe­ cialist. We are grateful to Bob Dayton and Dick Dorsch for providing background information and studying the points. The analysis is described in the following paragraphs.

Points 1, 2, 3, 4, 7, 6, 11,12, and 13 were picked mainly because they have a large slope percent. The interval formed by the 5% — 95% quantiles of the empirical distribution of slope percent is (0.2,5) and (1,8) for cluster 1.1 and 1.2, respectively. The distribution of slope percent is concentrated on the small values, so the points with slope percent on the tails are singled out as suspicious points.

Point 5 has a long and steep slope, so the predicted soil erosion due to water is very high, compared to points in the same cluster. The slope is a possible slope and the data are accepted only if the soil map indicated an error would the point be in error.

Point 6 has a large change in C factor (and in USLELOSS) during the three years. Looking at its C factor in all years from 1982 to 2003, it is possible that annual C factors were reported in some years in place of the C factor for the total rotation. The land use on this point changes from corn in 1997 to wheat in 2000 to hay in 2003. The proper way to define the C factor for the three years is to use the average of the C factors for the rotation. A value of 0.02 is reasonable if the point is kept as hay for a long time, but not if hay is part of the rotation.

Points 8 and 9 also involve changes in land use, the changes in C factor are considered to be caused by the use of the annual C factor. Information in subsequent years suggests if the data should be modified.

Point 14 has been hay ever since 1982, and it is strange that the C factor changed from

0.013 to 0.042 between 2000 and 2003. It was decided to change the C factor for 2003 to 0.013.

For point 15, 0.003 is very low for hay, and the subject matter specialist decided to change the C factor for 1997 to 0.013. Point 16 is in pasture, but is a noncultivated cropland point with land use being hay. Both points are unusual but possible.

Point 17 has a very high C factor in 1997 for hay, and it was decided to change the factor to 0.013.

Point 18 and 19 had data entry errors as the C factors should all be 0.013, and they were 69 corrected in the data set.

Points 20 and 21 changed broad use from CRP to noncultivated cropland, with C factor and USLE soil loss changing dramatically. The changes are considered as reasonable given the change in land use.

For point 22, the C factor should be higher for pastureland in 2000 than CRP in 1997 and

2003. A resolution would be to change C factor in 2000 to 0.013.

Points 23 and 24 are rotation points from the same PSU. Looking at the broad use and

C factor in all sampling years between 1997 and 2003, the points are judged to be reasonable points.

In another test application, we examined the WEQ (Wind Erosion Equation) variables for points with wind erosion. The set of points with wind erosion is a subset of USLE points where wind erosion is . The variables included in our analysis are broad use (bul997, bu2000 and bu2003), wind erodibility index (eiwind2003), WEQ knoll erodibility (knoll2003), soil ridge roughness factor (wkfact2003), WEQ unsheltered distance (wlfact2003), WEQ equivalent vegetative cover (wvfactl997, wvfact2000 and wvfact2003), and yearly WEQ soil loss (weql997, weq2000 and weq2003). We define dummy variables for each category of broad use, similar to that used for USLE points.

The initial partitioning divides our data into 4 clusters and Table 5.5 shows the size and broad use sequence for each cluster. We usually do not collect wind erosion on pastureland, so in our dataset, we do not have a cluster for points that are pastureland in all three years. In defining distances of interest, we estimate the mean vector and variance-covariance matrix for each cluster and use Mahalanobis distance as in (3.41) to define the distance between a point and its cluster center. The mean vector for each cluster is listed in Table 5.6. In identifying suspicious points, we take the 1% of points with largest distances and list the points in Table

5.7.

We have taken a close look at the points picked by our approach and given explanations for the points with the help of a subject matter specialist. The analysis is described in the subsequent paragraphs. 70

Table 5.5 Description of 4 clusters on WEQ core points.

Cluster Size Broad use sequence

~~l 941 (1^1)

2 12 (2,2,2)

4 218 (12,12,12)

5 110 with changes

Table 5.6 Center of each cluster for Kansas WEQ points

Cluster bul997.1 bul997.2 bul997.3 bu2000.1 bu2000.2 bu2000.3 1 1.0000 0.0000 0.0000 1.0000 0.0000 0.0000 2 0.0000 1.0000 0.0000 0.0000 1.0000 0.0000 4 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 5 0.4469 0.3514 0.0073 0.5902 0.2559 0.0045 Cluster bu2003.1 bu2003.2 eiwind2003 knoll2003 wkfact2003 wlfact2003 1 1.0000 0.000 7.5796 0.0118 0.8216 1841.687 2 0.0000 1.000 7.8214 0.0000 1.0000 1537.264 4 0.0000 0.000 10.5096 0.0317 1.0000 2085.072 5 0.4917 0.206 8.9838 0.0000 0.9347 1694.468 Cluster wvfactl997 wvfact2000 wvfact2003 weql997 weq2000 weq2003 1 1675.569 1734.164 1751.594 2.1065 1.4323 2.1142 2 2356.418 2333.286 2359.802 0.0740 0.2414 0.3286 4 2308.833 2363.729 2391.440 0.5694 0.4817 0.5943 5 2190.136 1745.942 2113.253 1.0538 3.5934 2.0942 71

Table 5.7 List of suspicious WEQ core points in Kansas

bul997 bu2000 bu2003 eiwind2003 knoll2003 wkfact2003 1 1 1 1 ioj 4 0J 2 1 1 1 3.8 4 0.7 3 1 1 1 10.7 4 0.7 4 1 1 1 3.8 3 0.6 5 1 1 1 6.9 3 0.7 6 1 1 1 10.7 3 0.7 7 1 1 1 32.2 0 0.8 8 1 1 1 17.2 0 1.0 9 1 1 1 26.8 0 0.8 10 2 2 2 17.2 0 1.0 11 2 2 2 26.8 0 1.0 12 12 12 12 3.6 3 1.0 13 12 12 12 10.7 3 1.0 14 12 3 12 6.7 0 1.0 15 3 1 1 11.5 0 0.7 16 12 12 12 2.9 0 1 wlfact2003 wvfactl997 wvfact2000 wvfact2003 weql997 weq2000 weq2003 ~Y~ 240 noo 1251) 1251) 10.96 5^61 JU9 2 250 1500 1550 1550 0.00 0.00 0.00 3 1200 1400 1400 1400 2.65 4.38 5.71 4 700 800 1000 800 1.02 0.00 0.56 5 500 1400 1400 1250 0.40 0.00 1.40 6 600 1500 1400 1500 1.74 3.21 1.88 7 2550 3250 1000 1250 0.00 37.94 33.13 8 2600 1000 3000 500 13.59 0.00 59.46 9 830 1000 1250 700 17.94 20.80 57.86 10 2600 2600 2200 2000 0.00 1.62 2.43 11 2470 2600 2200 2000 1.47 5.58 7.75 12 700 3500 2600 2600 0.00 0.00 0.00 13 450 2600 2600 2600 0.00 0.00 0.00 14 1300 2000 5000 2600 0.00 0.00 0.00 15 1900 2000 1250 1100 0.00 1.52 2.53 16 2600 4000 2600 4000 0.00 2.53 0.00 72

Points 1-6 and 12, 13 are picked because of an excessively large knoll erodibility, compared to most points in the data set. The of knoll erodibility is from 0 to 9, but for kansas this variable is clumped around 0. So the 8 points with "outlying" knoll erodibility are picked as suspicious points. The points are all accepted as reasonable observations.

Point 7 and 8 involve dramatic changes of yearly WEQ loss, and this is due to change of land use. The two points are suspicious points, but given the data set at hand, the decision is not to edit them.

Point 9 has a very low vegetative factor in 2003 and thus very high wind erosion. This is probably normal change given the data at hand and the decision is not to edit the point.

The noncultivated cropland points are both plausible points and the changes in vegetative factor are accepted as normal changes after a careful examination. The points 14 and 15 are reasonable, and they were picked because of being pastureland in either 1997 or 2000 while pastureland points are rare in our data set.

Another attempt is to use an average WEQ loss instead of yearly WEQ, and we have found one point with data entry error, listed as point 16 in Table 5.7. The vegative factor for the point in 1998 is entered as 260 in error, leading to a wind erosion of 10.12 in that year, so the

1997-2000 average wind erosion is 2.53 for the point while the average WEQ loss is 0 for the other two time periods. The error has been corrected in the original data set.

5.4 Full-scale application

The test application and the follow-up analysis validated our proposed approach and we will apply the proposed procedure to the NRI data including all states and U.S. territories.

In survey design and estimation, we usually first analyze the data state-by-state, and then combine information to form national-level aggregates. In our efforts to identify suspicious points on the whole NRI data, we will also take into account the geographical difference, as well as various cover use.

The National Resources Inventory is updated annually through a partially overlapping subsampling design since 2000, but "core" sampling points are observed every year except that 73 part of core points were not observed due to financial constraints in 2002. We imputed the core points that were not measured in 2002 to form a complete dataset. After data editing where a set of deterministic rules has been set to screen invalid observations, we can still benefit from the proposed outlier identification approach to improve data quality. We will use the strategy explained in section 5.1 to identify 1% of the most suspicious points for careful examination.

At the moment, the 2004 and 2005 NRI data are uncleaned, and we will use historical data including 2001, 2002 and 2003 to build outlier identification rules and use the rules to test

2003, 2004 and 2005 sample points.

The sample points that are included in our full-scale application are nonpseudo core points which have USLE soil erosion in years 2001-2005. In other words, the points included stayed in broad use categories 1, 2, 3, or 12 from 2001 to 2005. But changes of broad use among the four categories is possible and usually of our survey interest.

We included 21 variables into the training set, including broad use, land use, C factor, support practice factor, slope, slope length, and USLE loss in all three years (2001, 2002, and

2003). We use standardized USLE by dividing it by rainfall/runoff factor and soil erodibility factor, so that the standardized USLE loss has higher correlation with the factor variables.

The training set is partitioned according to the following rules, and the cluster label for each cluster has 8 digits, where the first 2 digits are state/region indicators and next 6 digits are broad use trend indicators.

1. Partition national data into state-wise categories.

2. As we usually do not have many sample points in northeastern states, we decide to merge

northeastern states into one category to get enough observations. We have NRCS region

code for merging the state data. And use "00" as the new region indicator.

3. For each region/state, partition data based on broad use pattern into (1, 1, 1), (2, 2, 2),

(3, 3, 3), (12, 12, 12) and another category containing points with broad use change. For

example, a point in Kansas with broad use sequence being (1, 1, 1) will be assigned to a

cluster labeled as "20010101", where "20" is state ID and "010101" indicates the broad 74

use sequence of this point/cluster.

4. Combine CRP points across states where the number of CRP points is less than 25 and

label the cluster as "##121212", combine noncultivated cropland points across states

where the number of noncultivated cropland points is less than 25 and label the cluster

as "##020202".

5. Merge observations in Nevada with those in Arizona if their broad use trends are the

same, and use "04" to denote the state in cluster label.

6. Partition data points that have changed their broad use according to their broad use

pattern, say (2, 3, 3), (1, 1, 12), and so forth. For example, all points whose broad use

sequences are (1, 2, 2) will be merged into a cluster labeled "##010202".

7. Eliminate the clusters with less than 26 observations from further analysis. We have only

a few of the small clusters that we decide to ignore for the outlier analysis.

To identify suspicious points in training data, we estimate the mean vector of each cluster by (3.6), and calculate the distance between a point and the cluster center using (3.47). Then we pick out 1% of points (or 2 if the cluster has less than 200 points) with largest distances from each cluster, and label them as suspicious points.

To identify outliers in the test dataset, we partition the updated data using the same par­ titioning rules, and calculate the distance between each new observation and the original mean vector of its assigned cluster. Then we compare the distances with 99% quantile of the distance distributions estimated using (3.43). Points with distances greater than the threshold are la­ beled as suspicious points and presented to agricultural specialist for careful examination. By specifying the threshold as 99% quantile of the distribution of training distances, we identified a lot more than 1% suspicious points from uncleaned test data, so for current analysis, we only take 2 most outlying observations in each test cluster if more than 2 are ruled out. And we nagged 669 suspicious points from both training and test data set.

The outliers that we identified are stored in a comma separated file named "outlier.csv", with first five columns being cluster ID, outlier ID, the variable or variable combination that 75 is suspicious, and two indicators indicating if this is a training outlier or test outlier. The rest of the columns are variables we used, including all five years from 2001 to 2005.

We have taken a careful look at the points and are grateful to Dick Dorsch and Bob Dayton for studying the points and writing a review. We calculate eug^ for each point i using (5.1), and flag the variables with \eug^\ > 0.5 as suspicious variables. In studying the points, the subject matter specialists partition the points according to suspicious variable, and examine the whole sequence of that particular variable. The following is the review of 669 NRI points nagged as suspicious points from the training and test data sets, separated by the variable of suspicion:

1. USLE C factor - almost all of these points were considered suspicious by NRCS reviewers.

In some cases the data could be changed locally, others would require sending the segment

(Primary Sampling Unit) back to the data collection team. The types of problems

observed included:

(a) Data entry errors, Landuse stays constant and the C factor changes by a multiple

of 10. Examples:

i. 40061_010301S pt 2 (0.013 0.13 0.013 0.013 0.013)

ii. 19043_0204015 pt 3 (0.13 0.013 0.13 0.013 0.013)

(b) Invalid entries, C factor = 1.0 on hayland, pastureland or CRP. Examples:

i. 01033-0206035 pt 2 (hay 0.003 0.005 0.003 1.0 1.0)

ii. 01061-0302015 pt 2 (CRP 0.003 0.003 0.003 1.0 1.0)

(c) Unusual (but not necessarily invalid) levels or trends in relation to land cover/use.

Examples:

i. 05057_030604i? pt 2 (0.011 0.06 0.11 0.003 0.003)

ii. 29053_030201i? pt 3 (0.013 0.008 0.004 0.013 0.013)

2. USLE P factor - the level of the P factor is dependant on the supporting conservation

practices at the site and can vary for any type of land cover/use. All of these points 76

were candidates for review because of the change in the P factor over time. Especially

troubling were the points that showed multi-directional changes. Examples:

(a) 19173_010201i? pt 2 (1.0 1.0 1.0 0.6 1.0)

(b) 27055_030203i? pt 2 (0.7 0.5 0.5 0.6 0.6)

3. USLE slope length - None of these points were flagged because of a change over time,

all were flagged because of the level. NRCS reviewers felt that any slope length greater

than 350 feet should be verified on site. It was also suggested that any points with a

change in slope length, regardless of level, should be reviewed.

4. USLE slope percent - Very few points were flagged because of slope percent. As with

slope length, any change in slope percent should be verified on site. The mean slope

percent of points for which the USLE is calculated is often quite small, so most outliers

detected would be because of unusually large levels. It would also be recommended to

check all slope percents = 0 since a very specific condition must be met in such cases.

5. USLE loss - It appears that the points nagged because of USLE loss (tons/acre) are the

result of corresponding suspicious values for some or all of the above factors and variables

that are used to make the loss calculation.

6. Land Cover/Use - Most of the points that were flagged as outliers because of land

cover/use were because of a change in the type of hayland or pastureland over time.

This is not a major concern for NRCS reviewers since it is very difficult to distinguish

the various types, and the types will be combined for analysis. Of more concern to the

reviewers would be certain multiple changes in the broad cover/use category over time.

5.5 Discussion

The described outlier identification procedure has been proved to be successful in iden­ tifying various types of outliers and improving survey data quality. We have incorporated geographical knowledge and land classification in partitioning survey population and the com- 77 plex interactions between variables through variance-covaraince matrix in defining distances.

The use of distance distribution as reference takes the dispersion of the subpopulation distri­ bution into consideration, but the use of a threshold value is sometimes criticized for being arbitrary.

The analysis of outliers in NRI application has provided feedbacks for the current outlier idenfication approach and alternative possible directions we can go.

1. Analyze the factor variables separately. In this specific application, the correlation be­

tween different USLE factor variables is weak, so we can analyze the factor variables

one by one or joint but use a block diagonal variance-covariance matrix. The outliers

in USLE loss is often caused by factor variables, and we can conduct separate analysis

on USLE loss and factor variables, and then compare the results. More generally, this

suggests examining univariate time series separately, which may not always be proper,

but still provides useful insights into data structure.

2. Assign more weights to changes over time. Most outliers identified from USLE slope

variables are flagged because of the level, but change is usually more interesting to subject

matter specialists. So we can reparameterize slope and slope length, use the mean of three

years as well as changes from one year to the next, and assign less weight to the mean

when defining distances. More generally, this concerns the definition of anormally in a

specific application, and the proposed method should be tailored to reflect the subject

matter concern.

3. Treat landuse as an ordinal/categorical variable. Treating landuse variable as a numerical

variable is somewhat questionable, and this has been revealed by the outliers identified

on landuse. We will modify the way to treat the landuse variable and change the distance

metric on this variable. 78

CHAPTER 6. EXTENSIONS TO NONDIFFERENTIABLE SURVEY

ESTIMATORS

In this chapter, we will consider two classes of nondifferentiable survey estimators: statistics with estimated parameters and estimators defined by estimating equations. In either case, we can not directly apply the usual linearization approach when we have nondifferentiability, but we facilitate our asymptotic expansions by assuming a differentiable limiting function. In section 6.3, we will propose a novel variance estimator using kernel smoothing to estimate unknown variance components and demonstrate its statistical properties.

6.1 Explicitly defined nondifferentiable survey estimators

To allow an easy generalization of our theoretical results, we will not assume a specific functional form of our estimator. Instead, we assume the population quantity takes the form of an order 1 U-statistic and the sample estimator is a Horvitz-Thompson estimator of the population quantity.

The sample estimator we are working with has the following form,

where A is a sample based estimator of some g-dimensional population quantity A^y and h(y;j) : W x ffl —• 3ft is not necessarily a differentiable function of 7. We can define the following population quantity with A replaced by A^y, N

TN(\N) = jr^h(yi;\N), (6.2)

We will use an arbitrary constant 7 to replace the A parameter to emphasize the role of TJv(-) and T(-) as functions of population quantity A^y or sample estimator A. The case 79 when h(y;~y) is a smooth function of 7 is relatively easy to deal with, as we can apply Taylor linearization and carefully argue the ignorability of the remaining terms in the expansion. But if h(y;~/) is a nondifferentiable function of 7, we can not quantify the extra variation by a direct linearization and further steps need to be taken to study the asymptotic properties of the estimator. Randies (1982) gave a general treatment of nondifferentiable estimators in a nonsurvey setting. But in survey framework, if h(y;~y) is a nonsmooth function of 7, the expectation of T{p() under the design, namely TJv(7) remains as a nonsmooth function of 7, so we need to modify the proof of Randies (1982) to extend the results to survey context.

6.1.1 Assumptions

In this section, we will provide sufficient conditions to show the asymptotic properties of nondifferentiable survey estimator (6.1). Assumption 6.1.1 and 6.1.2 specifies conditions on the estimated parameter A, and Assumption 6.1.3 assumes the existence of a smooth limit of TJv(7) with appropriate number of derivatives. Assumption 6.1.4 specifies the order of magnitude for several population quantities. Assumptions 6.1.3 and 6.1.4 are shown to hold almost surely under a probabilistic model, and proof is given in Appendix.

We need the population parameter A^y to live in a compact region as N —• 00 and is consistently estimated by A, and a linearization is available for A.

Assumption 6.1.1. The population parameter of interest Xjy G C\ where C\ is compact on

W.

Assumption 6.1.2. We need the following conditions for A as an estimator of Xjy,

1. A is v^*-consistent for Xjy.

2. A has the following linearization

where gr(yj has finite fourth population moments. 80

We also need a number of specific regularity conditions on the sequence of finite populations.

Assumption 6.1.3 specifies a limiting smooth function of TJv(7), and Assumption 6.1.4 provides limits for a number of population quantities. Both assumptions are needed to allow for an asymptotic expansion of estimator T(A).

Assumption 6.1.3. 1. The population level function TJv(7) converges to a limiting smooth

function T{p() uniformly in a neighborhood of X where A = lim Xjy.

2. The limiting function T(*y) is uniformly continuous for 7 in a neighborhood of A; say

C\. Further, T{p() has finite first and second derivatives with respect to 7, denoted as

C(7) = T'(7) and T"(7).

Assumption 6.1.4. 1. The function /&(•;•) as in (6.1) is bounded.

2. The population quantity

a/2 -1/i/z2 V2i/z , n TN{XN + n*" s) - TN(\N) - T(\N + n*" s) + T{\N) (6.3)

uniformly for s G Cs, where Cs is any compact set in W.

a 3. For\\sN\\ = 0(N- ),ae(ll),

N

± sup ^2\h{y-~f + sN)-h{y-~i)\ = 0(\\sN\\).

The reasonableness of the population requirements in Assumptions 6.1.3 and 6.1.4, and in particular 6.1.4, is somewhat difficult to evaluate as stated. Therefore, in Appendix A.3, a superpopulation model version of Assumption 6.1.3 is stated, under which the yi are generated through a probabilistic mechanism. Based on that assumption, a number of model results can be shown to hold with probability one. In particular, we can show Assumption 6.1.4(2) to hold almost surely under the superpopulation model, as derived in Lemma A.3.1 in the Appendix.

We here assume that the (fixed) population sequence from which we are sampling is such that these results hold, without the almost sure condition. 81

6.1.2 Design-based results

The key intermediate result we need in this section is stated in Lemma 8, which allows using the limiting smooth function T(*y) instead of nonsmooth population quantity TJv(7) in asymptotic expansion. Then we will establish the asymptotic normality of estimator (6.1) in

Theorem 10 and give the expression of its asymptotic variance.

Lemma 8. Under Assumptions 3.2.1(1), 3.2.2, 6.1.2(1) and 6.1.4,

:l/2 n T(A) - T(\N) - T(A) + T(A N ^0.

Proof. See Appendix. •

Theorem 10. Under Assumptions 3.2.1(1), 3.2.2-3.2.3 and 6.1.1-6.1.4, the sample estimator

T(A) is design consistent for TJV(AJV) and asymptotically normally distributed,

-1/2 • 1/2 n V{T(\)) T(\) - TN(\N, (6.4) where 1 T V(T(\))=(l,[C(\N)} )V(zn (6.5) C(Aiv) £(•) is derivative ofT(-) as defined in Assumption 6.1.3(2) and

- -1^1 (6.6) in ~ N ^ in 9(Vi)

Proof. We have the following decomposition

a 2 n l {f(\)-TN{XN)} 2 = n*^ {f(\N)-TN(\N)}

1 2 + n* / {T(\)-T(XN)}

1 2 + n* ' [f (A) - f(\N) - T(A) + T(\N)} , (6.7) 82 where the last term is stochastically small by Lemma 8. Then we will have

a 2 n / {f(\)-TN(\N)}

h h y Ajy \ 7! Yl —. (yi> XN)~WY1 ( i> ) r

1 2 + n* ' [C{\N)]T U E ±-g(Vi) ~ JrY,9(Vi)\ + ^(!).

then we can use the fact that hiy^ XN) and gr(yj have finite fourth population moments and

Assumption 3.2.3 to conclude asymptotic normality of the sample estimator. •

Generally speaking, for nondifferentiable survey estimators with estimated parameters, we can first replace the estimated parameter A with an arbitrary constant 7 in C\, then take expectation with respect to sampling design to obtain population quantity TJv(7). The population quantity usually remains as a nondifferentiable function of 7, but we can reasonably assume a differentiable limit for TJv(7) as in Assumption 6.1.3. The differentiable limit can then be used in asymptotic expansions and variance expressions.

This section furnishes a theoretical treatment of nondifferentiable survey estimators with estimated parameters, assuming that the estimator admits expression (6.1). In practice, many complex estimators in ongoing surveys can not be written in the simple form of a survey weighted order-1 U-statistic, but some are differentiable functions of estimators with expression

(6.1). We can carry out expansion (6.7), and modify our proof a little bit to adapt to this situation. A specific example is that in estimating cumulative distribution function, we usually use estimated population size N in the denominator with a numerator admits expression (6.1).

Estimators of this type can be linearized and studied with the results presented in this section.

Corollary 1 shows the asymptotic normality of a differentiable function of T(A).

Corollary 1. Under Assumptions 3.2.1(1), 3.2.2-3.2.3 and 6.1.1-6.1.4, for any function /(•) with continuous first derivative in a neighborhood ofTjy(Xjy),

-1/2 *l/2 [f'(T (\ ))V(T(\))f'(T (\N))\ (f(T(X)) - f(T (X )) F„±N(0,1). (6.6 n ' N N N N N 83

6.1.3 Discussion

In this section, we will review specific examples of nondifferentiable estimators with es­ timated parameters in survey literature. This class of estimators is frequently encountered in estimating distribution functions with auxiliary information and estimating a fraction be­ low/above an estimated quantity.

6.1.3.1 Estimating distribution functions using auxiliary information

There is extensive literature on estimating the distribution function of a target variable when auxiliary information is present (Dunstan and Chambers 1986, Rao et al. 1990, Cham­ bers et al. 1992 , Wang and Dorfman 1996). To incorporate auxiliary information in estimating distribution function, we generally estimate some model or population parameter first and then use model predicted value of target variable to construct a distribution function estimator. The sample distribution estimator is usually a nondifferentiable function of the estimated parame­ ter, like the model-based estimator in Dunstan and Chambers (1986), the ratio estimator Fr(t), difference estimator Fd(t) and Rao-Kovar-Mantel estimator F£m(t) in Rao et al. (1990). It was stated in Rao et al. (1990) that we can ignore the variation due to esitmating parameters in estimators Fr(t), Fd(t) and F£m(t), but no rigorous proof was presented. We will show that it is because the derivative C(AJV) is either strictly zero or a smaller order term.

We will use difference estimator Fd(t) as an example and work under ratio model. The estimator is given as follows,

Fd{t) = Fd{t]R) = ±\YJ^

We can replace R by an arbitrary constant 7, and take expectation with respect to design to get,

u which does not depend on parameter 7. So the derivative of the limiting function with respect to 7 is zero and the extra variance due to estimating population parameter Rjy is negligible.

Similarly, the extra variance is negligible in ratio estimator and RKM estimator. There is a 84

slight difference between difference estimator and ratio estimator, in that the derivative C(AJV) is exactly zero for the difference estimator and a nonzero but negligible term for ratio estimator due to ratio bias. To summarize, for all three distribution function estimators, estimating the population parameter will not contribute to an increase in leading term in the asymptotic variance. A slight modification of the previous case would be to estimate the population distribution of residuals t{ = yi — RNXII defined as

TN(RN) = j?^2l(yi-RNxi

A sample-based estimator is

{yi-Rxi

where h(yi,Xi;j) = \Vi-1Xi

TN(I) = jf^2l(yi-ixi

Under a probabilistic framework assuming (yi,Xi) are realizations of iid random variables

(l7^, Xi), T/v(7) strongly converges to the distribution function of -^ pointwise and a uniform convergence result holds due to Glivenko-Cantelli Lemma. A rigorous proof of Assumption

6.1.4(2) in this case will not be given, but very similar to "projection pursuit " in P.834 of

Shorack and Wellner (1986) and Diaconis and Freedman (1984). In this case, the extra variance due to estimated parameter is not negligible in general.

6.1.3.2 Estimating a fraction below/above an estimated quantity

Another estimator we will discuss is an estimated fraction below or above an estimated level, which is commonly seen in social surveys. There is extensive literature on variance estimation for the proportion below an estimated level, as in Shao and Rao (1993), Binder and

Kovacevic (1995), Preston (1996), Deville (1999), Eurostat (2000). More specifically, we might 85 be interested in the population fraction below the mean or a fixed fraction of a quantile, but the population mean or quantile may be unknown. A specific example is to estimate the fraction of households in poverty when the poverty line is draw at 50% of the median income. In this case, our sample fraction with estimated median plugged in is a nondifferentiable function of the estimated parameter, and we can apply the previous results to this situation.

In the univariate case,

iesu u If we assume y^s are independent and identically distributed random variables with distribution function Fy('), the limit of TJv(7) equals Fy{^) almost surely. Denote the true population quantity as AJV? then the derivative in (6.5) becomes FY(\N), and then we can evaluate the asymptotic variance of T(A) given the asymptotic variance of A.

In the multivariate case, we talked about the distribution of distances to estimated center in Chapter 3, where we specify a radius d, and are interested in estimating the fraction of points within distance d of some measure of center. For ease of presentation, we suppress v in the subscript. Here,

ies„ u and TJv(7) is a nondifferentiable function of p—dimensional parameter 7, but we assume it has a differentiable limit. A detailed treatment of this case can be seen in Chapter 3.

To summarize, the extra variance due to estimating the parameter can not be ignored in the univariate and multivariate case of estimating fractions below an estimated quantity, and a key component in variance formula (6.5) relies on the derivative of the limiting function T{y).

6.1.3.3 Endogenous post-stratification estimator (EPSE)

The endogenous post-stratification estimator is a more complicated estimator with esti­ mated quantity discussed in Breidt and Opsomer (2008). We will apply the methodology explained in section 6.1.2 to the endogenous post-stratification estimator and examine it from design-based perspective. The endogenous post-stratification estimator as in (7) of Breidt and Opsomer (2008) is given as follows, /S(*) = E4F^4m(*), (6-9)

h=l ANhO\A)

J where ANh0(\) = ^E r ,*' ^ \,A%h,(\) = ^E^^^r ,*' ^ l,* = 0,1. We replace A by 7 and get, %) = ^(7) = E#^W(7)

where the second term is asymptotically negligible assuming H is finite and some regularity conditions. Then we can approximate the expectation of T(*y) by the expectation of the first term after a careful treatment of the negligible term,

H

ET(7) « TN(j) = ^Y.Y.yhrH-^m^'^Kru) = FE»' i=l U U which does not depend on 7, so the variance due to estimating 7 can be ignored in this case.

As a further note, if instead of estimating the whole population mean but estimating a domain mean, the extra variance should be carefully examined as TJv(7) niay be a non-trivial function of 7 as the indicators l(rh_1

Assumption 6.1.4(2) implies that,

ANhoW = ANh0(XN) + ah0(\) - ah0(XN) + op(n* ' ), (6.10) and the previous theory implies,

1/2 A*Nhl(X) = A*Nhl(\N) + ahl(X) - ahl(XN) + op(n*~ ), (6.11) then everything follows the proof of result 2 in Breidt and Opsomer (2008). 87

6.2 Nondifferentiable estimating equations

Assume that population parameter £/v is of our interest, defined by

CAT = inf{7 : SJV(T) > 0}, (6.12) where N ^(7)4E^-7)» (6-13) and ip(') is a univariate real function. The SJV(7) hi our discussion is assumed to be monotone, and the assumption is elaborated in 6.2.2(1).

The population parameter £/v is then estimated by £, where

i = inf{7 : 5(7) > 0}, (6.14) with %) = sEi*'-^ (6-15) and TTi is the inclusion probability for the i—th element.

6.2.1 Assumptions

We also need to assume a further number of application specific regularity conditions on the sequence of finite populations. Assumption 6.2.1 assumes that the population quantity £/v lives in a closed interval on 3ft, and Assumption 6.2.2 specifies conditions on the monotonicity and smoothness of Sjy(t) and its limit.

Assumption 6.2.1. The population parameter of interest £/v £ C% where C^ is a closed interval on 3ft.

Assumption 6.2.2. The population score function SN(-) and ?/?(•) satisfy the following con­ ditions:

1. lim SJV(T) = Shf) uniformly on Ct and SY7) is strictly increasing with finite first deriva-

tive in a neighbourhood of £/v- Further, S(£N) = O N 2. The population variance of ip(yi — 7) exists and remains bounded for all 7 G 3ft.

3. For any sequence h^ = 0(N~a), a G (|, 1),

1 N — supj^ \

Jf.. The following equation holds for h^ = O(-}=),

1/2 n* [SN(£N + hN) - S(£N + hN) + S{£N)] - 0.

5. Forh^0+,

1 N 2 -^{^r7-ft)-^r7)-M7 + ft) + ^(7)} -0.

The uniform convergence in Assumption 6.2.2(1) can be shown by Glivenko-Cantelli Lemma, and (2) is essentially a condition on the second moment of i/j(yi — 7). Assumption 6.2.2(3) resembles Assumption 6.1.4(3), and its special case will be justified under probabilistic mech­ anism in Appendix A.3. Assumption 6.2.2(4) resembles that of Assumption 6.1.4(2) and is justified in Lemma A.3.1 under a probability model. Assumption 6.2.2(5) is also a condition on the second moment of ^{yi — 7), but stronger than Assumption 6.2.2 (2).

6.2.2 Design-based results

The main results for estimating equations are presented in this section, where Lemma 9 shows that £(7) weakly converges to its population counterpart uniformly, Theorem 11 states the design consistency of sample estimator £, and Theorem 12 states design-based asymptotic normality of the sample estimator. All proofs are deferred until the Appendix.

Lemma 9. Under Assumptions 3.2.1(2), 3.2.2(2) and 6.2.2(2-3), for any large enough closed interval C G 3ft,

sup S(l) - SN(7) ^0.

Theorem 11. Under Assumptions 3.2.1(2), 3.2.2, 6.2.1 and 6.2.2(1-3), the sample estimator

£ is design consistent for the population quantity £/v- Theorem 12. Under Assumptions 3.2.1(2), 3.2.2-3.2.3, 6.2.1-6.2.2, the sample estimator f is asymptotically normally distributed, i.e.

*l/2 v(0\ 1/2(i-tN] •F-i-/V(0,l), where ViO = »• (M«)

6.2.3 Discussion

In this section, we will give a few examples of nondifferentiable estimating equations and bring different cases into the same theoretical framework that was established in the preceeding section.

The first example we talk about is sample quantile defined by inverting Hajek estimator of culmulative distribution function. In this case, the estimating function for p—th quantile is

^{Vi-l) =Ife-7<0) -P> and population estimating equation is

N S N(I) = jr^2\yi-y

i=l

The sample quantile is defined as

£ = inf{t : 5(7) > 0} = inf L : 4 £ ±I(w<7) > p\ , I iesu ) where

The limiting function of SJV(7) is denoted as £(7) = ^(7) — p, where ^(7) is the distribution function of yi if we assume the y^s are identically distributed and independent (or weakly dependent). So we can estimate the asymptotic variance of £ by estimating design variance

V(S(£N)) and derivative F'(£N). 90

Another example is Winsorized mean introduced by Huber (1964), where the estimating function ip(-) is defined as

-k, yi->y

HVi - 7) = I Vi - 7, \Vi - 7| < k

k, yi - 7 > k V

kl = (Vi ~ 7)!(|^-7|ky

The population score function is N N S k l N(l) = ^ J^G/i - 7)I(|i/i-7|k) - \yi-K-k)] , i=l i=l and we assume S^in) converges a limit function £(7) which is differentiable in a neighborhood of ^N where £/v is the population Winsorized mean as defined by (6.12). This population score function is nonincreasing, but we can use —SJV(7)> and still define the parameter of interest as

(6.12). Then we can define sample estimating equation and estimator, and show its asymptotic properties as before.

6.3 Variance estimation

To estimate the design variance of T(A) in (6.5) or £ in (6.16), we need to estimate the derivatives C(AJV) OT S'(£N) of unknown limiting function T(*y) or £(7). We can start from some smoothed estimator of primitive functions T{p() or £(7), and take derivatives with respect to 7 or 7. Natural sample-based estimators are T(*y) and £(7), respectively, but we can not take derivatives on these two functions. So some smoothing technique is necessary. These two cases can be handled in the same fashion, so we will only examine the first case to avoid duplication. We will propose an estimator for £(AJV), show its consistency and the design consistency of the overall variance estimator.

We denote Kq(-) as a kernel function in Sft^, and use the convolution of nonsmooth function hiy^ •) and kernel Kq(-) with bandwidth b to obtain,

hi * Kq(~() = f ••• JKVi,*)Kq (V9 dx, 91 then we can estimate T(-y) by

6 17 TTT.TJ-JHV**)*'*^)** ( - )

We take derivative of (6.17) with respect to 7, to obtain the following estimator,

c(7) = OTfEi/-/ h X)K TO dx> (6-18) with population counterpart

Civ(7) = wmY.f ••• ih(vnX)K (V5)dx- (6-19) U We use || • || to denote L2 norm in W in assessing , and make the following as­ sumptions to show the convergence of C(A) for £(AJV). Assumption 6.3.1 specifies conditions on the kernel function and bandwidth, then we assume the population quantity CN(^N) converges to the target quantity C(ATV) uniformly as in Assumption 6.3.2.

Assumption 6.3.1. The following conditions hold for the kernel function Kq(-) and bandwidth b,

1. The kernel function Kq(-) has compact support CK and is absolutely continuous with

finite derivative K'q{-). Additionally, J • • • J Kq(x)dx = 1.

q 2. The bandwidth b —• 0; and Nb —• oo; as N —• 00

3. There exists a constant c, such that ||^i^(^) — ^i^(^-)|| < c \\x\-a?2|| for any X\^X2,

and b arbitrarily small.

Assumption 6.3.2. The deviance \\CN(I) ~ C(l)\\ ~^ 0 uniformly for 7 G C\.

Given Assumptions 6.3.1, 6.3.2 and some regularity conditions on sampling design, we can show the consistency of estimtator of derivative in Lemma 10 and the whole variance estimator in Theorem 13.

Lemma 10. Under Assumptions 3.2.2, 6.1.1-6.1.2(1), 6.3.1-6.3.2, the estimator £(A) is de­ sign consistent for £(AJV). 92

Theorem 13. Let VHT(Z-K) be the Horvitz-Thompson variance estimator for zn as defined in (6.6). Under Assumptions 3.2.2-3.2.3, 6.1.1-6.1.2, 6.1.4(1), 6.3.1-6.3.2, the estimator

V(T(X)) = (l, ICWf) VHT(zn) \ V L ' J \ C(A) is design consistent for V(T(\)) as defined in (6.5).

Proof. The proof easily follows from Assumption 3.2.3, Lemma 10, and Slutsky's Theorem. • 93

CHAPTER 7. CONCLUSIONS

In this chapter, we will emphasize our contributions to the existing literature, both from a theoretical and application point of view.

The proposed distance distribution and measure of outlyingness are special cases of nondif- ferentiable survey estimators carefully explained in Chapter 6, for which the standard tools for classical design-based asymptotic inference cannot be directly applied, and variance estimation is challenged as extra variance due to estimated parameters cannot be ignored in general. We have introduced a number of methodological innovations to handle these complications.

First, we provide a method of proof for the design consistency and asymptotic normality of a non-smooth function of a survey estimator, under the assumption that the non-smooth function has a smooth limit. The method of proof is broadly similar to that of Randies (1982), who considered [/-statistics with estimated parameters, and that of Breidt and Opsomer (2008).

Unlike in our situation, their results are derived in a model-based context.

Second, in order for our design-based results to hold, a number of regularity conditions are required to hold for the sequence of finite populations (among others, the smooth limit of the non-smooth function mentioned above). As a complement to stating those as population-level assumptions, we show that these regularity conditions hold with probability one under a set of model assumptions. In other words, we show that the assumed behavior of the finite population sequence is "reasonable" in the sense that deviations from that behavior only happen on a set of (model) probability 0 under stated model assumptions. This explicit connection between sufficient model assumptions and the finite population assumptions is to our knowledge not commonly used in design-based inference, but it provides a formal assessment of the generality of the population specifications, and hence of the subsequent design-based results. 94

Third, we use a novel combination of nonparametric regression and replication to estimate the design-based variance of the estimated subpopulation distance distribution functions. A kernel-based method is used to obtain an estimate of the derivative of the limiting smooth function mentioned above, and is combined with a traditional design-based replication method such as jackknife to estimate the variance. The method is straightforward to implement, because the replicates do not require recomputing the distances between the observations and the estimated center, and the nonpar ametric regression step does not have to be repeated across the replicates.

From an application point of view, we have made the following contributions to the litera­ ture on explorative analysis.

First, we have introduced a measure of the dispersion of multivariate populations and a design-consistent estimator of the proposed measure, which is the distribution of distances to the center. The measure describes multivarate populations and is a graphical tool to evaluate various subpopulations from certain initial partitioning procedures.

Secondly, the proposed measure of outlyingness indicates the suspiciousness of an obser­ vation with respect to its assigned subpopulation, and we have demonstrated the use of this measure in outlier identification through simulation study and a large-scale complex survey.

Thirdly, we proposed a convenient correction to the widely used k-means clustering algo­ rithm in section 3.5.2. The correction is easy to implement in computer algorithms and is broadly similar to model-based cluster analysis under certain mixture gaussian models (Ban- field and Raftery (1993), Fraley and Raftery (1998), Fraley and Raftery (2000)). 95

APPENDIX A. MODEL-BASED RESULTS

A.l Model-based result for mean-based inference

In Section 3.2, we have assumed several regularity conditions on the sequence of finite populations to derive the design-based results. In this section, we provide sufficient conditions under a superpopulation model to make it possible to evaluate the "reasonableness" of these population-level regularity conditions. We will show that the assumptions made in Section 3.2 hold with probability one under the superpopulation model.

In the model-based context, we assume that each subpopulation is an independent and iden­ tically distributed sample from a superpopulation model with cumulative distribution function

Fg(y). Let Yg denote the random variable from model component g. We state a model ver­ sion of Assumption 3.2.5 below, and show that the statements in Assumption 3.2.6 hold with probability one under that assumption. Proofs are given in Appendix C.

Assumption A. 1.1. The distribution of \\Yg — *y\\ can be written as,

El (\\Yg-r\\fljd(7), which is continous in d G [0, oo) and 7 G W. Additionally, the derivatives —&f , —a4

and a2^(T) all exist in (d,j) G [0, +00) x W.

Lemma A. 1.1 and A. 1.2 provide almost sure results on finite populations under the model, corresponding to Assumptions 3.2.5 and 3.2.6 on the finite population in section 3.2.

Lemma A.1.1. Under Assumption A.1.1,

l V T h ^"9 { ^J2g (d<\\yz-~r\\

1 Lemma A. 1.2. Under Assumption AAA, and assuming that Nvghjsrv (log Nvg) —• oo; as

Njyg —• oo; then

h \ N„g l h a sup Yl ^(WVi-Y-hNvgsWKd) ~ (\\yi-~y\\

Corollary A. 1.1. In Lemma A A. 2, we can replace 7 with a stochastically bounded sequence

~fyg and obtain,

hN h v h a ~ Yl fiWvi-Yug-hNvgsWKd) - kWvi-YvgW^d) - gAlvg + N„gs) + Vg4{ih -4"0 Jvg uniformly for s G Cs.

In Lemma 7, we required statement (3.53) to hold for the population. We show here that under some general model assumptions, that statement holds with model probability one. We will use results from empirical process theory described in Pollard (1984) and previously used in Opsomer and Ruppert (1997). We define a set of functions g^(-) on W as,

g^.)=K(^]p^^j(.-1), (A.l)

and let Qg = {g^(-) : 7 G W} indicate the class of these functions. We also define the graph of g~y(-) as the subset

gr(#T) = {(y,i)|0 < t < £T(y),or g^y) < t < 0} of W x 3ft.

We need the following assumptions to show the theoretical properties of Cvg,d(l)- Assump­ tion 3.2.9(1) gives smoothness and moment conditions on kernel function if(-), Assumption

3.2.9(2) put some conditions on bandwith h. Below, Assumption A.1.2 strengthens Assumption

A. 1.1 with further requirements on Dg^(y). Lemma A. 1.3 states that the graphs of functions

Qg have polynomial discrimination.

<92"D (~Y) Assumption A.1.2. Assume that the second derivative —^2 ^S a Lipschitz function 0/7. 97

Assumption A.1.3. The first derivative of ipj(- — 7) with respect to 7 partitions W into a finite number of subsets.

an Lemma A.1.3. Under Assumption 3.2.9(4) d A.1.3, and then the graphs of functions Qg has polynomial discrimination.

Lemma A.1.4. Under Assumptions 3.2.8(1), 3.2.9(1,2) and A.l.l-A.1.3, Cug.dh) as defined in (3.51) is weakly consistent for —| , and further,

a.s. sup CU7)-%H-0- (A.2) 7

A.2 Model results for median-based inference

We can define the superpopulation median as

Qg = arg infTE || Yg - 7 ||, (A.3) with corresponding estimating equation

Ei/i(Yg-7) = 0. (A.4)

Assumption A.2.1 assumes that the multivariate density Fg(y) does not put all its probabil­ ity mass on a line. Assumption A.2.2 guarentees that when we take derivatives on E || Yg — ~/ ||, we can change the expectation and derivatives to obtain estimating equation (A.4).

Assumption A.2.1. The probability f — - fL dFg(y) < 1, V71,72, where Llin2 is defined '71,72 by (3.14).

The existence of the infimum of E||l^ — 7|| is trivial as long as we have a nontrivial norm and nondegenerate Fg(y). Lemma A.2.1 shows the uniqueness of qg as a minimizer of

an ^WlJi ~ 7II5 d Lemma A.2.2 shows the limiting behavior of qv.

Lemma A.2.1. Assumption A.2.1 together with (A.3) guarentees the uniqueness of qg.

Lemma A.2.2. Assuming that f • • • fL dFg(y) = 0 for any 71,72 ^ ^\ the sequence of qviV >3 as defined by (3.7) is unique and converges to qg almost surely. 98

Assumption A.2.2. For anyj G W', t/zere exists a function (/>(-) such that, \^{y — ^()\ < 4>(y) and E(f)(Yg) < oo.

We need to relate the generalized medians denned by minimizing overall distance to the roots of estimating equations. Since E||l^ — 7|| and jj- J2u \\Vi ~^\\ are both continuous and derivable functions of 7, their minimizers qq and qv will solve the corresponding estimating equations (A.4) and (3.22) under mild assumptions. With Assumption A.2.2, we can inter­ change the expectation and partial derivative to obtain (A.4). For estimating equation (3.22) at the finite population level, summation and differentiation can be interchanged as well.

The population median qu is asymptotically normally distributed under weak conditions, but the details will not be covered here. 99

In sections 6.1.1 and 6.2.1, we assumed some regularity conditions on the sequence of fixed finite populations. In sections A.3 and A.4, we will provide sufficient conditions under a superpopulation model to evaluate the "reasonableness" of these population-level regularity conditions.

A.3 Model-based results for nondifferentiable survey estimators

In the model-based context, we assume each finite population is an independent and identi- callly distributed sample from a superpopulation model with cumulative distribution function

F(y). We state a model version of Assumption 6.1.3 below, and assume some regularity condi­ tions on function /i(-; •), then we show that the statements in 3.2.6 (1, 2) hold with probability one under the superpopulation model. Detailed proofs are presented in Appendix B.

Assumption A.3.1. Let Y be a random vector with absolutely continuous cumulative distri­ bution function F(y), and denote T(*y) = F,h(Y;'y). Further, T(*y) is a continuous function of ~y, with finite first derivative ^ and second derivative d V .

Assumption A.3.2. The number of monotonicity changes in the set of functions {7 G SR9 : h(y; 7)} as functions of y is uniformly bounded for 7ES9.

Lemma A.3.1. Under Assumptions 3.2.6(1), A.3.1 and A.3.2, the population quantity

1/2 1/2 1/2 n* \TN(\N + n*- s) - TN(\N) - T(\N + n*~ s) + T(\N) 0,

q uniformly for SGCSJ a large enough compact set in $l .

a Lemma A.3.2. Let B = {(-00, a] : x e W}, and \\sN\\ = 0(N~ ),a e (|,l). Under Assumption A.3.1, the population quantity N SU -k P ^\\vi-i-BNeB) -I(y.-TGs)| =0(\\sN\\),a.s., (A.5) where CT is a large enough compact region in W.

Assumption 3.2.6(2) is stated in terms of a general function h(y; 7), and Lemma A.4.1 only covers the case when

KVM) =I(y-YeB), 100 where B = {(—oo, x] : x G 3ftp}, but can be easily generated to cases when y and 7 are additive.

Let h(y;~y) = /io(y ~~ 7)5 then we can approximate a broad class of functions ho(y — 7) by a finite collection of functions of the form l^y_^eB.^ to obtain results similar to (A.5).

A.4 Model-based results for nondifferentiable estimating equations

In this section, we will provide justifications to some assumptions in section 6.2.1, under a probabilistic model. We assme the population characteristics yi are independent and identically distributed realizations of a random variable Y.

Assumption A.4.1. LetY be a random variable with absolutely continuous cumulative distri­ bution function, and denote £(7) = E [^{Y — 7)]. The score function £(7) is strictly increasing with finite first derivative. Further, the function ?/?(•) is bounded.

Assumption A.4.2. The function ip(-) is nonincreasing with a finite number of monotonicity changes.

Lemma A.4.1. Under Assumptions A.J^.l and A.4-2, the population quantity

l 2 l 2 a.s. n*V2 SN(HN + n*- ' s) - S(6v + n*- ' s) + S(^N) 0.

uniformly for s € Cs, a large enough closed interval in 5ft. 101

APPENDIX B. PROOFS

B.l Appendix: Proofs

Proof of Lemma 1

Proof. Let us define Du^(jj,u) similar to (3.4), then

1/2 n* [DV4{JAV) - Dv,d(nv) - Vg4{iiy) + T>gtd(iivy)

1/2 = K (pv4(jxu) - A,,d(A*„) - Vg4{JAv) + Vg4(nv)J

1/2 +n* (DV4{JAP) - DU4(IJ,U) - Dg4(jtu) + Dgid(fiu)J

1/2 = < (bv4(iiu) - Dv4{nv) - V9td{fiv) + Z>S)d(/*„)) +

where

= (pg4(fiv) ~ Dg^)) 0P(1). 102

The expression \Dg4(p,u) — Dg4(p,v)\ weakly converges to 0 as

V (\pg4{fiu) - Dg4(pu)\ >e)

- P( hv^2—l^llft-AJI^-^llt/i-^ll^d)! ^ej

< P > e \\N u Zw niTV- (e,||A,-^ll><" MJ

1/2 I b\T Z^ ^ ( e.H/i^-^ll <<" M

1/2 < p (||A,-Mj|>nr M)+P( J-VIT > €

^ 0, by Assumption 3.2.6 where Mis a sufficiently enough constant.

Now it suffices to show that,

1/2 < (Dv4{fiv) - Dv4{nv) - Vg4(fiu) + Vg4(n„j) A 0.

Define Qn(s) =

2 1/2 1/2 nfl (p„4(»v + <" s) - Du4(^) - Vg4(pv + <" s) + Vg4(pv)) ,

As fijy = fijy + Op(-^=) following Assumption 3.2.2, it suffices to prove that

sup|Qn(s)| -^0, sec for some compact set C. As

\Qn(s)\

sl/1 2 *-l/ 1/i!2 *-l/2 1/i! , < n* ^ Du4(pu + n* s) - Du4(pu) - Du4(pu + n* s) + A,,d(/f

1/2 1/2 1/2 + < (£>M(M„ + n* s) + Dv4(pv) - Vg4(pu + n* s) + Pfl,d(AV

The second term converges to zero uniformly for s G Cs by Assumption 3.2.6(2). It remains to show that

sup 0, (B.l) sec 103

1 2 where a^s^n* / [\\\y.-^-K-i/2sll

For any 1 — f3 < £ < f- where f3 is defined in Assumption 3.2.2(1), use the following partition of C,

c = d u c2 u • • • u cN,P,Cj n cy = 0, Vj + f, where Diam (Cj) = 0(N~i),'ij = 1, 2,.., N$P.

Select a set of Sj € Cj,j = 1,2,.., NJ)P,

Tv

< EP^E ^-'W > e ^

< 1 a^ ^ Ev- ^E 7T ; 7 j = l C/i/ iv;e p 1 V^Y^Y^/^ __ \ai(sj) ai'(sj)

2K ! ! ^ < iV,2

K'l 2 1/2 TV 6 JZZ^aiifc-MZ^Z^^IIyi-Mi/-, ^ s||d)

^ + l 1 2 N„e/V f2 ^Z^ {\\yi-iJiv-K- / s\\>d1\\yi-^v\\

< AT^t^i2 / J "( \\d-n* 1/2max - ||s -||<|||/ -/x ||

Additionally,

sup J- v^/l seCj Uu - ~7T/2 sup E la*(s) _ a^sJ

iV„ 1

'*V v s^j Tj

+A2 S i 1/2 1/2 ~a75^T ^P 2^ (iiyi-Mt,-nr s|l>rf,llyi-Mt,-nr sJ-ll

Nv 1 ^T

n*V2 AT, seC:3 uv 0, uniformly for all j = 1, 2,.., iV5P, as a result of Assumption 3.2.6(1), where K<2 is a positive constant.

Hence, the proof of (B.l) is completed by bounding the LHS of (B.l) by max 3 ifcE(^-i)^ U u and max sup D 3 seC-j ]fcE(^-l)(oi(«)-Oi(«i))

Proof of Lemma 2

an Proof. Suppose *y1 and 72 are two distinct local minimizers of J2u \\Vi~7ll- For y ^ ^ (0> -0> from triangle inequality, we have

A 1 A x Yl WVi ~ ^i " t ~ )^ll < ^2\\Vi ~ Till + (1 - A) ^ ||^ - 72||. C/i/ C/i/ Uu

For any ^ G W\Llin2l

A \\Vi ~ 7i - (1 - A)72|| < \\\Vi - Till + (1 - X)\\Vi - 72||.

So

A 1 A A Yl WVi - ^i " t - )^ll < 5^ 111/* - 7iII + (1 - A) ^ Wyi - 72| C/i/ C/i/

< max S^ll^-7ill,^ll^-72ll f • I Uu Uu ) 105

/ If we let A approaches 0 or 1, then X y1 + (1 — A)72 can be infinitely close to 72 or 7^ contradicting the fact that both 71 and 72 are local minimizers. •

Proof of Lemma 4

p Proof. Let Yg = /j,g + RgAgU( \ where AgAg = Ap, and the distribution of univariate random variable Rg is i?(-) with density h(-). The random vector U^ is uniformly distributed on the unit hypersphere in W. For any *yg £ K(fig)1 where K(fig) denotes a small neighborhood of

Y ll 9 ~ ^g IP = (Yg ~ fig + fig ~ 7g)'B(Yg - fig + fig- Jg)

= (uW)'AgBA'guWRl + 2(fig - lg)'BAgU^Rg

r + (Vg-lg) B{flg-1g)

= a\Rg + CL2Rg + &3>

where at = (U®)''AgBA'gU^,a2 = 2(^ - 7/54^) and a3 = (ng - lg)'B{ng - lg).

The case when AgBA' is a zero matrix is trivial. If AgBA' is not a zero matrix, then P ((uWyAgBA'gUW = 0) = 0, so

2 2 P(||^-79H

2 2 P (aiR g + a2Rg + (a3 - d ) < 0) 1 = E u(p) J™l(Rg \lg)

= Eulp) {H (i?f (7s)) - if (i?« (7s)) } , where i?^ \lg) and ^ vTp) are two distinct solutions to the following quadratic equation as its discriminant is positive for 7^ £ K(fig),

2 2 (U^)'AgBA'gU^R g + 2(Pg - -ia)'BA'gUtoRg + ((Pg - lg)'B{ng - lg) - d ) = 0. (B.2)

We can solve (B.2) for iJ^frJ,

(7, - Hg)'BAgUW ± A/(^ - 7ff)'{SA^(p)C/(p)'^5' - aiS}(A«s - 7ff) + aid? 9 v '9; oi 106

with i = 1,2. As U^ is bounded with probability 1, Wg (7S) is negative with probability 1 if K(fig) is small enough. So we can obtain,

2) 1} d~f„ -EU(P) {H (4 (7S)) - H (4 (79))}

_d_ 2 -EI/(P){fr(4 )(7ff))} die 2) 2) , a4 (7fl) Et/(p> (4 ( '9

We evaluate the expression above at [xg and obtain,

U(P) ^ = h(d)BA'E, f^ ^ a y ) ^ a (y(p))'AgBA'gU(p)

v The distribution of , (p)y^ ^ is symmetric about 0 with a compact support, so

<9^(/V> = 0.

It is easy to show that \ig uniquely minimizes (A.3), thus it coincides with median. •

Proof of Lemma 6

Proof. Assumption 3.2.8(1) imply that i/j(yi—j) are uniformly bounded. Assumption 3.2.9(1) on kernel function K(t) implies that t2K(t) —• 0 as t —• oo, so \K f ~"y^"\ j are uniformly bounded for large enough Ny. We define

which is design unbiased for Cvd(l)-> and by Assumption 3.2.2(3),

Vax(C,d(7)l^) K 1 1 ^ ^, A (^F ) *(Vi ~ 7) \K (^f^) *P'(y3 - 7)

" t j

assuming that i/j(yi — j) has bounded second moment in population level. Then C^d(7) is both weakly and MSE consistent for the population target and thus Cv,d(l) ls design consistent by

Slutsky's Theorem. • 107

Proof of Lemma 7

Proof. It is easy to show that

< Cv,d{lv) - Cv,d{lv)\ + \Cv,d{lv) - Cv,d{lv)

+ \Cu,dhu)-Cdhu)\

< l K K -h E ^Mvi - -r») [ (^F^) - (^F^ Nvh -^r 'n pxl

+ 7T E ^ (^i(Wi - 7„) -

It suffices to show that -^-^ J2 ^^jiVi^v) K (d-\\yj-%\\\ _ K (d-\\yj-~fv\ and ^£-(^(^-7^) are both stochastically small.

K [ HkcM) _ K ^-ii»p,iiy

= ^2 E ^i(y* - iv)K' (^pl) ^T(y. _ 7ja _ 7t 5„ +smaller order terms

*-l/2x ^•(yi - iv)^ ivi - iv)Ov{nl +smaller order terms

= 0p(l), as Ti has bounded mean and variance given assumption 2 and 3.

It is similar to show that j^-r J2 ^r (^jiVi ~ TV) ~ ^jiVi ~ %/)) K ( — h ^ ) *s stochas- tically small, thus the design consistency of kernel estimator follows. •

Proof of Lemma A.1.1 108

Proof. Here,

= l LHS J^J2v^{ (d<\\yi-~f\\

+ ^/N~u (P(d < hi - l\\ < d + hNv)

Using Assumption A. 1.1,

ii=i^*ap ^ ~> u, where d* G [d, d + fojvj-

Define Ti^ = ^j J] Xi? where i=l

/2 It is easy to show that E^ = 0, and EXf = 0(N^ hNu) for fc > 2. Thus,

£••• E EXfEXj-.-EXf =0, if any kj = 1, j = 1, 2,..., /... And for all fcj > 2, J^ fcj = 2r,

l l tjX o(K+ h Nj, i2 ••• il

r 0(N?h NJ.

Now we can apply Markov's Inequality to obtain,

\2r 2r P(|T,„|>f) > e

\2r E X N£v7 Y,U„^Uv i < c2r

EX/

r ar 0(h NJ = 0(N- ) = O (^) , where r = \_^r\ • 109

Now, we can show that J2 P(l^i,ivJ > e) < oo, and use Borel-Cantelli Lemma to obtain Nu=l strong consistency. D

Proof of Lemma A.1.2

Proof. Define

x = l l V h i \ (\\yi--r-hNus\\

p \\\Vi-7-hNvs\\d) - (ll2/i - 7 - hNus\\ < d, ||^ - 7II > d)

hWVi-Y-hNvaWKd, Ui/i-TllxO " P(H^ " ^ " hN»8W - d' WVi ~ ^H > d)

= Xu — X21

i=l Now, let us show that sup |Ti,ivJ -+ 0. It is easy to establish that EXu = 0 and

(7,s)G^xCs

E (liWVi-Y-hNvBW^d, Ml/*—rll>rf)) = Pdlj/i ~ ^ ~ ^SH - d' Hj/i ~ ^11 > d)

< P(d-\\hNus\\<\\yi-7\\

= 0{hNv).

Without loss of generality, we assume E {l^\y._^_hNv as • s\\d)) ^ hNu, otherwise we can rescale Xn to make sure its expectation is smaller than h^u •

Define

I 9~f,a(y) = (\\y-~f-hNu8\\d), with \g^a{y)\ < 1 and E^s(y.) < hNu.

Now we define the graph of gliS{y) as,

gr(^>s) = {(y,t)\0

= {(y, t)\te [0,1], \\y - 7|| > d} n {(j/, t)| t € [0,1], ||t/ - 7 - /»JV„S|| < d}

= {(y,t)\0 < t < I(||y-7||>d)} n {(y,t)\0 < t < I(||y_^_^s||

Both of the two sets of graphs are translation families of {(y, t)\ 0 < t < I(||y_T||>^}, which has polynomial discrimination in !ftp by Lemma B.l (ii) of Opsomer and Ruppert (1997). 110

Lemma A.4 (ii) of the same paper states that both {(y,£)|0 (f)} and {(y,t)|0 < t < I^ysy-hN s\\

Now, everything is set up for applying Theorem 11.37 of Pollard (1984). Let n = Ny, an =

1,5^ = hjsru, then we can conclude that

/] \l(\\y -'Y-h 8\\

converges to 0 almost surely, uniformly for (7, s) G W x Cs. D

Proof of Lemma A.1.3

Proof. The proof readily follows from Lemma B.l (ii) in Opsomer and Ruppert (1997) by treating K( ~ ^ )^j{' — 7) as a translation family of functions. •

Proof of Lemma A.1.4

Proof. We can re-write Cv,d(l) as follows,

CM(7) = ^TE^(^)^-7) (B.3)

Jv„.. „B ,_„ under Assumptions 3.2.9(1,2).

Let K(t) = f_QQK(t)dt, and Uj be the j-th column vector of p x p-identity matrix, then Ill under the continuity conditions of Assumption 3.2.9(1),

ECM(7)

K dx h lo-oo<9o 97 V * )

oo^Tj pxl

(i)

pxl

pxl (2)

Ij J ftp pxl d_ EK(2Hhs/pl d'y where

d-\\vi-nf\ 1 EK f K (^) ^F ^ Jo T> ^ _L l^!^^ 2 7 2 (B.4) by the first mean value theorem for integration and the fact that ||j/i — 7|| is bounded almost surely. In the above derivation, (1) holds because -^K f ^ ) < 7^(f) f°r an 7' whereas the latter is integrable; and (2) holds as K ( ~lli/-Tll j is bounded by a constant K (^). Now we can obtain that

ECM(7) 1 lim £>d(7 + ^Ti^jJ - X>d(7j + 2« cr^ ^ Q2d/2 gl^/2 J A7j—0 A7j pxl ^ + 0(h% (B.5) by Assumption A. 1.2.

To show the uniform convergence of (A.2), we assume |#T(-)| < 1 and

EK2 ( a-iiyi-Tli } ^2( 7) < ^ otherwise, we can rescale it so that the two conditions hold. h J ^j WJ Applying Approximation Lemma 11.25 of Pollard (1984), we obtain that the class of functions

Qg has finite covering number under the induced by Fg(y). Everything is 112

set up for Theorem 11.37 of Pollard (1984). Let an = 1, 6% = h, then we can obtain that

SUPICMW-ECMWI^'O given Lemma A. 1.3 and the fact that we are working with finite dimensional random vectors.

Equation (B.5) implies the difference between F,£1/d('y) and —#^ is negligible, thus (A.2) follows. •

Proof of Lemma A.2.1

Proof. The argument is similar to that of Milasevic and Ducharme (1987).

Suppose 7X and 72 are two distinct interior points with

E\\Yg- 7i ||= E || Yg - 72 ||= inf^E || Yg - 7 || . (B.6)

Triangle inequality would imply

|| y - (A7l + (1 - A)72) || < A || y - 7l || +(1 - A) || y - l2 ||,

for y EW and A G (0,1). For any y G W\Llin2, the definition of Llin2 implies that

A II V - (A71 + (1 - A)72) II < II V - 7i II +(1 - A) || y - 72 \l

as the equality only holds for y on the line through 7X and 72 if the norm has an inner product representation.

Assumption A.2.1 provides P(W\L~ri^2) > 0, then we will have

E\\Yg- (A7l + (1 - A)72) ||< E\\Yg- 72 ||, contradicting (B.6). •

Proof of Lemma A.2.2

Proof. The uniqueness follows from the fact that with probability 1, finite population Uv, v > 3 will not put all yi on the same line. 113

We use negation to show strong consistency of qv for qg. Assume that P(qu -» qg) > 0, then without loss of generality, assume that on a positive probability set, qv —• q'g. Then for large enough v and some e > 0,

Then by SLLN and continuity properties, we can show that,

1 V~^ II II a-s- -r-i II II

Nv uv = inf inf.7G2R P ^"»Vii ~ 7||,pointwise, Nyr rEll^-^H ^^^rN E uv inf QP— E lll/i ~ 7ll ™' inf^QpEH^ - 7||, TG ^ C/i/ and

infTG^E||^ — 7|| = infTGQpE||y^ — 7||, pointwise,

Contradicting (A.3) and (3.7). Thus P(qu -^ qg) = 0. •

Proof of Lemma 3.15

Proof. We can define,

1/2 1/2 1/2 Qn(s) = n* (f(XN + n*~ s) - T(\N) - T(\N + n*~ s) + T(\N) and it suffices to show that

sup IQn 0)| -*•(), sec where C is a large enough compact region in $lq.

As

N1/7 2 -1/i/z 2 i/z1/2, IQnWI < n* T(\N + n*- s)-T(\N)-TN(\N + n*- s) + TN(\T V J

: a/2 -1/1/z2 1/21 , + n IMAAT + n*" s) - TN(\N) - T(XN + n*" ^) + T(\N)\ , (B.7) where the supremum of the second term converges to zero by Assumption 3.2.6(1). 114

Now we need to show the supremum of the first term converges to zero. For any 1 — /? <

£ ^ 2»> wnere /? is defined in Assumption 3.2.1, partition the compact region C into,

c = d u c2 u • • • u c^P, Cj n cy = 0,Vj + f,

where Diam(Cj) = O(iY^), Vj = 1, 2, • • • , iV&.

Select a set of s- e Cj, j = 1, 2, • • • , iV&,

• 1/2 ^ /T A V27 . P max ^(y*; ^ + ™* SJ) - h(y- A iV > € N ^ v Try \ 3 1=1 1 2 n* / A /I(tes) 1,2 AT ^U* hiy^XN + n* Sj)-h(yi'1\ N) > e

N l2

- % E ^ E1^'A7V + n*"1/2si) - Kvi\ AA

N& N ^ ^ Esup A? E r^; A7V + n*~1/2sj) - Mi/*;A

= O(iV^-f), where ci is a constant and the last term goes to zero as £ < ^-.

Additionally,

• 1/2 N /T -1/i7 2 iV7 2 max sup /i(y-A7v + n* s)-/i(y-A7v + n* st N ^ V TTy 3 seC-j i=l N 1 N < —-r-pr max sup A 1/2 1/2 1 2 - J2 | [^(y*; ^v + n*- s) - /i(^; XN + n*- Sj n* / i sec, -• i=1 -+ 0, by Assumption 3.2.6(2). So the first term in (B.7) can be bounded by

1,2 P max -1 hiy^XN + n* sj)-h(yi'1X N > e V 3 N E 7Ti 1=1 and

n«l* ^ /lies 1/2 1/2 max sup fc^Ajv + n* s)-h(yi;\N + n* Sj N E VTi 3 sec, i=l and thus the proof is completed. • 115

Proof of Lemma A.3.1

Proof. Define

-1/z 2 1/2, Xi = h(yi; 7 + n*-^ s) - h(Vi; 7) - T(7 + n*~^s) + T(7) then we need to show 1 ^ = 1/2 0 n iV

q uniformly for 7 G $t, and s G Cs, a large enough compact set in W.

Here, EX* = 0 and E h{Vi,7 + «* 1//2s) - My*; 7) = ^(n* 1/2). Without loss of gener- ality, assume

My*!7 + n*"1/2s) - /i(y-7) and -1/2 E My*;7 + n* 1/2s)-^(yi!7) * < n V2.^ _ Now we define the graph of <7-y,s(y) = Hy^ 7 + n* ' s) — h{y^ 7),

sr(^,s) = {(y,*)|o

= {(1/,t)|0 < t < ^>s(y)} u {(y,«)|^)8(i/) < t < 0},

Assumption A.3.2 implies that the set of functions

? 1 2 {(7,s) £S xC8: h(y-n + n*~ / s) - h(y-n)} , can be written as the summation of a finite number of monotone function classes. Lemmas

9 1/2 9.9 and 9.11 of Kosorok (2008) imply that {(7, s) € 5R x Cs : h(y;~f + n*~ s) - h(y;j)\ is a VC class and thus has polynomial discrimination. Everything is set up for Theorem 11.37 of

2 -1 2 Pollard (1984). Let an = 1,<5 = n* / , then we can obtain that

:l/2 -1/2i/z, -V2L/z , sup ri pM7 + n*- s) - TN(j) - Tin + n*~ s) + T(7) 0, and thus the result follows. •

Proof of Lemma 9 116

Proof. The lemma is proved via covering approach. We can equally partition C into Ny sub-intervals,

with \ < v < (3 and select an arbitrary point tk G C&, k = 1, 2, • • • , Nv'.

Then

(7) - SW(T) < max%) - SW(7/c) +max sup 5(7) - SW(T) - £(7fc) ~ SN{p/k) L J L \ k \ \ k ieCk I and it suffices to show the two terms on the RHS are both stochastically small.

Let us first calculate the design variance of £(7) — SJV(T)- ^(Vi ~ 7) - SW(7) ^(% - 7) - ^(7) V(S(>y)-SN(>y)) = T^EEK-^; Tin vr7; N 1 1 v-^

C2 < n* by Assumption 6.2.2 (2).

So v N nsfr*) - ^(7*)) P(max 5(7fc) - SN{lk)\ > e) < ^ fc=i Now let us show that

max sup S(>y)-SN(

max sup 5(7) - 5^(7)] - [5(7fc) - SW(7fc)] 1 n 1 max sup fc NV X7TiI ~^^ ~ T) ~~ ^ ~~ T*)) ~~ [^(7) - SJV(7* yeck tt

max sup l) (^(2/* —y) — -0(2/* — T*)) I 1=1 N - °3rfmllxS^ jj^2\^(yi-7)-4>(yi-ik)\ ' n* fe 7Gcfc N 0{m 'h-h where C3 is a positive constant, and the last equation follows from Assumption 6.2.2(3). • 117

Proof of Lemma 10:

Proof. Triangle inequality implies that

||C(A)-C(Atf)||

< ||C(A) - C(\N)\\ + IIC(AJV) - CN(*N)\\ + HOv(AAr) - C(Ajv)||,

where the third term goes to zero by Assumption 6.3.2, and the first two terms are op(l) by

Assumptions 3.2.2 and 6.3.1(3).

Here,

||C(A) - C(A N = dx i *Ei/-7"(y.;^K(¥)-^( r) dx s < c\\\- Ajv|| • sup Ihiy-x)] • ± ^ ^ • / •• • / dx ^ 0, we can show the second term also converges to zero in probability by showing its asymptotic variance goes to zero using Assumption 3.2.2 (2). •

Proof of Lemma A.4.1:

Proof. We need to show that the set of functions

1/2 {^(y - 7 - n*~ s) - ^(y - 7) | (7, s) e Q x Cs} has polynomial discrimination, and observing that

1/2 a n* SN(£N) ^ 0, then the rest of the proof follows from that of Lemma A.3.1.

We define the following set of functions <77,s(-)>

g-yAv) = tl>(y ~ 7 - n* 1/2$) - *l>(y - 7) 118

and the graph of gliS is the subset

gr(#7,s) = {(y,*)|0 < t < 07>s(y),or57>s(y) < t < 0} of 3ft x 3ft.

The graphs in gr+(g75S) = {(y, t)| 0 < £ < g75S(y)} can be partitioned into a finite number of mutually exclusive translational classes of sets, and each translational class has polynomial discrimination by Assumption A.4.2 and by Lemma B.l of Opsomer and Ruppert (1997).

Then the graphs in gr+(g75S) have polynomial discrimination by Lemma 15 of Pollard (1984).

Similarly, we can argue that gr_(g1^s) = {(y, t)\ gllS(y) < t < 0} has polynomial discrimination.

Thus we have shown that gr(g1^s) has polynomial discrimination and the rest of the proof follows easily. •

Proof of Theorem 11:

Proof. Negation. Without loss of generality, assume that 3 5i,ei > 0, such that

P(tN-i>81)>e1, (B.8) for any iV, n. M Assumption 6.2.2(1) implies that V5i > 0, and 72 > 7i + ^i where 71 and 72 are in a neighbourhood of £/v> 3iV(5i;7i,72) such that \/ N > iV(5i;7i,72),

SW(72) - SW(7i) > fo, for some 82 > 0. Together with (B.8) then imply that

lim P(SJV(6V) - SN(i) > 62) > €1, (B.9) for some 62 > 0.

Assumption 6.2.2(1) implies that SJV(£/V) —• 0, together with (B.9) imply that, for some

6'2>0,

lim P(SN(i) < -5'2) > ei. (B.10)

Lemma 9 implies that

lim P sN(0-s(0\>^)=o. (B.11) 119

>v * fit Equations (B.10) and (B.ll) imply that 5(£) < — -f with positive probability, contradicting the definition of £. The other half of the proof can be done in a similar fashion. • Proof of Theorem 12:

Proof. It is easy to verify that,

P(5(7) > 0) < P(i < 7) < P(5(7) > 0). (B.12)

We define

t i *-l/2 where a will be specified later. Eventually, we want to show

lim P (aLiLlM

lim P(5(7,,n) > 0) = $(z), given (B.12).

lim P(S(7*,n) > 0)

= lim p(l^l^(yi-72jn)>0]

= lim P I TT J2 -^i - ^.") - 5^ + * n*"V2fJ) > SN(^N + z n*"1^) )

= lim P n 11—>00 \ where the last equation follows from Assumption 6.2.2(4).

Now let us show that

n {v(S(^n))-V(SN(^N))}^0. 120

Here,

*{v(sN(>ygtnj)-v(sN(ZNj)}

= n*{c (

< n*^V (sN(jz,n) + SJV&V)) ^V (sN(jz,n) - SN(&

1 N 2 < C ^ M%' - lz,n) ~ 4>{Vj ~ £N) ~ SN{~fz,n) + SN{£N)} -+ o, by Assumption 6.2.2(5). Then we can conclude that

V(SN(lz,n))

V(SN(ZN))

y/v(SP(tN)) S'(HN) •

1 2 n* ' {y(5(7,,n))} ^ E -^i - 7*,«) - SNhz,n) \ ± N(0,1), I ieS l ) by Assumption 3.2.3. Then we complete the proof by Slutsky's Theorem.

Proof of Lemma A.3.2

Proof. We can multiply by Na on both sides of equation (A.5) to get, N SU l ^r P z2 \ (yi-~Y-SNtB) -\yi-^B)\ = 0(1),a.s.. (B

Here,

N NOL ^~~\ I I SU l L "AT P Z^ \ (yi-'Y-*NeB) ~ (yi--reB)\

N N ]\fa X "^ T ]\fa X "^ T

c SU < -W sup 2^fhyi-'y-*NeB,yi-'yeB ) + ^r P Z^hvi-^-^^B^y.-^B)-

It suffices to show that N SU l c ^r P z2 (yi--r-sNeB,yi--reB ) =0(1), a.5., 121 and thus it suffices to show

N a N sup W J2 ^Wi-y-ajveB.Hj-yeB') - p0^ - 7 - SN € B, Y - 7 G £c = 0(1),O.S.. TSC-, i=l (B.14)

c an = Let g~y(y) = l^y-^-SN^B,y-^eB )i d £? {#-y(2/)l7 £ C^y}? then the class of functions Q is VC class by Shorack and Wellner (1986) and thus has polynomial discrimination. Obviously,

an 1 2 a 2 \g^V)\ < 1 d {E^(l^)} / < N~ / for each 7 e CT. Everything is set up to apply Theorem 11.37 of Pollard (1984) to obtain (B.14). • 122

APPENDIX C. FIGURES

scatter plot of 4 clusters

#6

CM III?

I ° #4?

-•8

#9 cluster 1 cluster 2 cluster 3 cluster 4

-15 -10 10 15

dim 1

Figure C.l Scatter plot of population clusters and 10 test points. 123

cluster 1, n=200, mean cluster 2, n=200, mean

00 00 6 d

CD CD 6 d

6 6

CM CM 6 d o o 6 6

cluster 3, n=200, mean cluster 4, n=200, mean

00 00 6 6

CD CD 6 d

6 d

CM CM 6 6 o o 6 d

Figure C.2 Estimated distance distributions D^(Azy) under SRS and Stratification. The solid lines are population quantities Dy^d^y), dashed lines are estimators using SRS, dotted lines using stratified sampling. 124

cluster 1, n=800, mean cluster 2, n=800, mean

00 00 6 d

CD CD 6 d

6 6

CM CM 6 d o o 6 6

cluster 3, n=800, mean cluster 4, n=800, mean

00 00 6 6

CD CD 6 d

6 d

CM CM 6 6 o o 6 d

Figure C.3 Estimated distance distributions D^(Azy) under SRS and Stratification. The solid lines are population quantities Dy^d^y), dashed lines are estimators using SRS, dotted lines using stratified sampling. 125

cluster 1, n=2000, mean cluster 2, n=2000, mean

:> 1. 0 ^

D _ 5 ** d°°. _

D _ p'ni f CD _ 5 • d

1- Q1 5 # 6 _ pp| - ppl u _ - - srs CM _ - srs 5 strat d strat

:> _ J o _ -/ 5 d 1 1 I I I I I I I I I I 0 1 2 3 '\ 5 6 0 2 3 4 5 6

cluster 3, n=2000, mean cluster 4, n=2000, mean

00 00 d d

CD CD d d d d

CM CM d d o o d d

Figure C.4 Estimated distance distributions Dv^{Ji^) under SRS and Stratification. The solid lines are population quantities Dy^d^y), dashed lines are estimators using SRS, dotted lines using stratified sampling. 126

cluster 1, n=200, median cluster 2, n=200, median

:> 1. 0

D _ v 5 d°°. _ /**

D _ p'ni f CD _ 5 d

1- Q1 5 6 _ pp| _ ppl u _ - - srs - - srs 5 strat strat

:> _ ^••'i

5 3. 0 0. 2 1 1 1 1 1 I I 1 1 1 1 0 2 3 '1 5 6 0 1 2 3 4 5 6

cluster 3, n=200, median cluster 4, n=200, median

00 00 6 6

CD CD 6 d

6 d

CM CM 6 6 o o 6 d

Figure C.5 Estimated distance distributions Du^(qu) under SRS and Stratification. The solid lines are population quantities D^diQv), dashed lines are estimators using SRS, dotted lines using stratified sampling. 127

cluster 1, n=800, median cluster 2, n=800, median

,*- 1. 0 ^f^^"

/" p'ni f 0. 6 8

i 6 Q d _ ppl _ ppl - - srs CM / - - srs strat O strat •/ q _ d I I I 1 1 I l l l l I

0 1 2 C O — \ 5 6 0 1 2 3 4 5 6

cluster 3, n=800, median cluster 4, n=800, median

00 d

CD -q d C d

C\j d o d

Figure C.6 Estimated distance distributions bu^(qu) under SRS and Stratification. The solid lines are population quantities

Dv^{*lv)') dashed lines are estimators using SRS, dotted lines using stratified sampling. 128

cluster 1, n=2000, median cluster 2, n=2000, median

00 00 6 d

CD CD 6 d

6 6

CM CVJ 6 d o o 6 6

cluster 3, n=2000, median cluster 4, n=2000, median

00 00 6 6

CD CD 6 d

6 d

CM CM 6 6 o o 6 d

Figure C.7 Estimated distance distributions Du^(qu) under SRS and Stratification. The solid lines are population quantities D^diQv), dashed lines are estimators using SRS, dotted lines using stratified sampling. 129

SRS, cluster 3, N= 8000, n= 800 SRS, cluster 3, N= 8000, n= 800

(M

O (M a _ i i 1 n 1 1 i i i r i i i 0.12 0.16 0.20 0.24 0.00 0.02 0.04 0.06 0.08

shapiro.test p-value 0.127 , d= 0.5 shapiro.test p-value 0.5986 , d= 0.1

SRS, cluster 3, N= 8000, n= 800 SRS, cluster 3, N= 8000, n= 800

^ _, o CM CM

O

& co H

S co -I CO ' Q ^ ~l

CM -\ O —I M i 1 r ni 1 1 0.50 0.55 0.60 0.65 0.84 0.88 0.92 0.96

shapiro.test p-value 0.888 , d= 2 shapiro.test p-value 0.0172 , d= 6

Figure C.8 Histogram of estimated distance distributions D^^iP'us) under SRS for d = 0.1,0.5,2 and 6 with population quantity super­ imposed. 130

SRS, cluster 4, N= 8000, n= 800 SRS, cluster 4, N= 8000, n= 800

CM

O J

CO -\

w (D H c CD CD Q Q

I 1 1—'—I 1 r T T 0.25 0.30 0.35 0.40 0.05 0.10 0.15 0.20 0.25 shapiro.test p-value 0.4855 , d= 0.5 shapiro.test p-value 2e-04 , d= 0.1

SRS, cluster 4, N= 8000, n= 800 SRS, cluster 4, N= 8000, n= 800

CM _, CM

O

c CD Q Q ^ ~\

CM -\

O —I T T 1 T T 1 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80

shapiro.test p-value 0.6311 , d= 2 shapiro.test p-value 0.2154 , d= 6

Figure C.9 Histogram of estimated distance distributions D^^AZM) under SRS for d = 0.1,0.5,2 and 6 with population quantity super­ imposed. 131

SRS, cluster 3, N= 16000, n= 1600 SRS, cluster 3, N= 16000, n= 1600

o o CM

o if) J CO cCO CD o Q (M

1 f 1 1 1 I 1 1 ^ L1 1 1 0.02 0.03 0.04 0.05 0.06 0.07 0.14 0.16 0.18 0.20 0.22 0.24 0.26

shapiro.test p-value 0.781 , d= 0.1 shapiro.test p-value 0.7787 , d= 0.5

SRS, cluster 3, N= 16000, n= 1600 SRS, cluster 3, N= 16000, n= 1600

o CO

LO (M

o O >, 1- ^ CM CO c m c CD CD Q Q

T T 1 i 1 1 r—i 1 1 0.50 0.55 0.60 0.65 0.89 0.90 0.91 0.92 0.93 0.94 0.95 shapiro.test p-value 0.7669 , d= 2 shapiro.test p-value 0.382 , d= 6

Figure CIO Histogram of estimated distance distributions D^^iP'us) un_ der SRS for d = 0.1,0.5,2 and 6 with population quantity superimposed. 132

SRS, cluster 4, N= 16000, n= 1600 SRS, cluster 4, N= 16000, n= 1600

o (M

co CO O c C l- CD CD Q a

r T 1 r n r~ 1 1 1 1 0.10 0.15 0.20 0.30 0.34 0.38 0.42

shapiro.test p-value 0 , d= 0.1 shapiro.test p-value 0.4018 , d= 0.5

SRS, cluster 4, N= 16000, n= 1600 SRS, cluster 4, N= 16000, n= 1600

o CM

CO O C i- c CD CD Q Q

i 1 1 1—i 1 1— I 1 1——I 1 1 0.46 0.50 0.54 0.58 0.64 0.66 0.68 0.70 0.72 0.74 shapiro.test p-value 0.3325 , d= 2 shapiro.test p-value 0.3074 , d= 6

un_ Figure C.ll Histogram of estimated distance distributions D^^AZM) der SRS for d = 0.1,0.5,2 and 6 with population quantity superimposed. 133

cluster: 2 sample size: 400 median cluster: 2 sample size: 400 mean

'n

mu.hat-mull I mu.hat-mu||

cluster: 4 sample size: 400 median cluster: 4 sample size: 400 mean

mu.hat-mull | mu.hat-mull

cluster: 5 sample size: 400 median cluster: 5 sample size: 400 mean

CTJ ? -I o -^ CO o I "T" 1 0.0 0.2 0.4 0.6 0.8

mu.hat-mu|| | mu.hat-mu||

Figure C.12 Plot of conditional bias against \\ju — 7JI using both mean and median as cluster centers; red curve for d = 0.4472, green curve for d = 1.0 and blue curve for d = 3.873. 134

scatter plot of 4 clusters, PAM scatter plot of 4 clusters, Distance

MMii

; ^ '~%S ' ®4 V5r^i?'

cluster 1 cluster 1 cluster 2 cluster 2 cluster 3 cluster 3 cluster 4 cluster 4

0 0

dim 1 dim 1

Figure C.13 Scatter plot of sample clusters using PAM and adjusted k-means method. 135

PAM, SRS, n= 800 , point 0 PAM, SRS, n= 800 , point 1

clusters clusters

PAM, SRS, n= 800 , point 2 PAM, SRS, n= 800 , point 3

clusters clusters

Figure C.14 Boxplot of estimated distance distributions Du^z_fl ^JJLV) un­ der SRS with population quantity superimposed. Method for initial clustering is PAM. 136

PAM, SRS, n= 800 , point 4 PAM, SRS, n= 800 , point 5

clusters clusters

PAM, SRS, n= 800 , point 6 PAM, SRS, n= 800 , point 7

clusters clusters

Figure C.15 Boxplot of estimated distance distributions Du^z_fl ^JJLV) un­ der SRS with population quantity superimposed. Method for initial clustering is PAM. 137

PAM, SRS, n= 800 , point 8 PAM, SRS, n= 800 , point 9

clusters clusters

Figure C.16 Boxplot of estimated distance distributions Dv^z_^^{Jiv) un­ der SRS with population quantity superimposed. Method for initial clustering is PAM.

Distance, SRS, n= 800 , point 8 Distance, SRS, n= 800 , point 9 oo co CO oo o o co CD CO o

CD o CO CO o C\] o CD oo CD CO CD

Figure C.17 Boxplot of estimated distance distributions Dv^z_^j^(Ji^) un­ der SRS with population quantity superimposed. Method for initial clustering is adjusted k-means. 138

Distance, SRS, n= 800 , point 0 Distance, SRS, n= 800 , point 1

O O "•"

CD 00 O CD c O o 00 "^3 CD O CD _Q ^ O CO O TD CD CD O O CD C O CM £GO CD LO T3 O O O CD

O

clusters clusters

Distance, SRS, n= 800 , point 2 Distance, SRS, n= 800 , point 3

O O "I_ LO CD O 00 CD O io n

LO jg^ CD 00 }—. CD CO O TD CD

CD LO an c s O

I - is t "D CM CD

LO CD

clusters clusters

Figure C.18 Boxplot of estimated distance distributions Du^z_fl ^JJLV) un­ der SRS with population quantity superimposed. Method for initial clustering is adjusted k-means. 139

Distance, SRS, n= 800 , point 4 Distance, SRS, n= 800 , point 5 o o

c 00 o o .^Q 00 •o CD CI) O O o c Cfl o GO •o ^1" o

LO 00

clusters clusters

Distance, SRS, n= 800 , point 6 Distance, SRS, n= 800 , point 7 o O o - —,— O — 1 1 "' 0 1 LO o 1 c O) 0 \ o —•— ti o o o N

> rib u i O i dis t o o i O 00 CD - ne e o ~8~ £GO LO TD

§ 0. 8 o r^ O CJ) - o 8 1 1 1 1 0.8 0 1 2 3 4

clusters clusters

Figure C.19 Boxplot of estimated distance distributions Du^z_fl ^JJLV) un­ der SRS with population quantity superimposed. Method for initial clustering is adjusted k-means. 140

Scatter plot of 5 clusters cluster 5, N=2000

— mean ••••• median

^jy

dim 1

Figure C.20 Scatter plot of cluster 5 generated from a bivariate lognor- mal distribution and plot of population distance distributions

Dv,d(iiv) and D^d(qu) against d. 141

BIBLIOGRAPHY

Bahadur, R. R. (1966). A note on quantiles in large samples. The Annals of Mathematical

Statistics, 37:577-580.

Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering.

Biometrics, 49:803-821.

Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data. John Wiley &; Sons.

Bensmail, H., Celeux, G., Raftery, A. E., and Robert, C. P. (1997). Inference in model-based

cluster analysis. Statistics and Computing, 7:1-10.

Billingsley, P. (1968). Convergence of Probability Measures. John Wiley &; Sons.

Binder, D. and Kovacevic, M. (1995). Estimating some measures of income inequality from

survey data: an application of the estimating equations approach. ,

21:137-145.

Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex

surveys. International Statistical Review, 51:279-292.

Breidt, F. and Opsomer, J. (2008). Endogenous post-stratification in surveys: classifying with

a sample-fitted model. Annals of Statistics, to appear.

Breidt, F. J. and Opsomer, J. D. (2000). Local polynomial regresssion estimators in survey

sampling. The Annals of Statistics, 28(4):1026-1053.

Brown, B. M. (1983). Statistical uses of the spatial median. Journal of the Royal Statistical

Society, Series B: Methodological, 45:25-30. 142

Calinski, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Commun. Stat,

3:1-27.

Campbell, N. A. (1980). Robust procedures in multivariate analysis I: Robust covariance

estimation. Applied Statistics, 29:231-237.

Celeux, G., Govaert, G., and Govaert, G. (1995). Gaussian parsimonious clustering models.

Pattern Recognition, 28:781-793.

Chambers, R. L., Dorfman, A. H., and Hall, P. (1992). Properties of estimators of the finite

population distribution function. Biometrika, 79:577-582.

Chaudhuri (1996). On a geometric notion of quantiles for multivariate data. Journal of the

American Statistical Association, 91:862-872.

Chen, M. (1978). A norm-referenced population health status indicator based on life ex­

pectancy and disability. Social Indicators Research, 5:245-253.

Da Silva, D. and Opsomer, J. (2006). A kernel smoothing method to adjust for unit nonresponse

in sample surveys, the Canadian Journal of Statistics, 34:563-579.

Deville, J.-C. (1999). Variance estimation for complex statistics and estimators: Linearization

and residual techniques. Survey Methodology, 25:193-203.

Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit. The

Annals of Statistics, 12:793-815.

Dorfman, A. H. (2007). Inference on distributions and quantiles. In Rao, C. and Pfeffer-

mann, D., editors, Handbook of Statistics Volume 29: Sample Surveys: Theory, Methods

and Inference,. Elsevier Science Publishing Co., New York; North-Holland Publishing Co.,

Amsterdam.

Dorfman, A. H. and Hall, P. (1993). Estimators of the finite population distribution function

using nonparametric regression. The Annals of Statistics, 21:1452-1475. 143

Dudley, R. and Koltchinskii, V. (1992). The spatial quantiles.

Dunstan, R. and Chambers, R. L. (1986). Model-based confidence intervals in multipurpose

surveys. Applied Statistics, 35:276-280.

Eurostat (2000). Low-wage employees in eu countries. Statistics in Focus: Population and

Social Conditions.

Fang, K.-T. and Zhang, Y.-T. (1980). Generalized Multivariate Analysis. Springer-Verlag.

Ferguson, T. (1967). : A Decision Theorectic Approach. Academic

Press, New York.

Fraley, C. and Raftery, A. (2000). Model-based clustering, discriminant analysis, and density

estimation. Technical Report 380, University of Washington, Dept. of Stats.

Fraley, C. and Raftery, A. E. (1998). How many clusters? which clustering method? answers

via model-based cluster analysis. The Computer Journal, 41(8):578-588.

Francisco, C. A. and Fuller, W. A. (1991). Quantile estimation with a complex survey design.

The Annals of Statistics, 19:454-469.

Frank R. Hampel, Elvezio M. Ronchetti, P. J. R. and Stahel, W. A. (1986). :

The Approach Based on Influence Functions. John Wiley &; Sons.

Fuller, W. (2007). Sampling Statistics. Unpublished manuscript.

Fuller, W. A. and Kim, J. K. (2005). Hot deck imputation for the response model. Survey

Methodology, 31(2):139-149.

Gini, C. and Galvani, L. (1929). Di talune estensioni dei concetti di media ai caratteri quali­

tative Metron, 8:3-209.

Golder, P. A. and Yeomans, K. A. (1973). The use of cluster analysis for stratification. Applied

Statistics, 22:213-219.

Gower, J. (1974). Algorithm as 78: The mediancentre. Applied Statistics, 23:466-470. 144

Haldane, J. (1948). Note on the median of a multivariate distribution. Biometrika, 35:414-415.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning.

Springer.

Hawkins, D. M. (1980). Identification of Outliers. Chapman &; Hall Ltd.

Hayford, J. (1902). Di talune estensioni dei concetti di media ai caratteri qualitativi. Journal

of the American Statistical Association, 8:47-58.

Hettmansperger, T. P. and Randies, R. H. (2002). A practical affine equivariant multivariate

median. Biometrika, 89(4):851-860.

Huber, P. J. (1964). Robust estimation of a . The Annals of Mathematical

Statistics, 35:73-101.

Huber, P. J. (1981). Robust Statistics. John Wiley &; Sons.

Hutchison, T. (1969). A numerical example of centour analysis among flexibly determined

subgroups. American Educational Research Journal, 6:91-103.

Isaki, C. and Fuller, W. (1982). Survey design under the regression superpopulation model.

Journal of the American Statistical Association, 77:89-96.

Johnson, M. E. (1987). Multivariate Statistical Simulation. John Wiley &; Sons.

Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data: An Introduction to Cluster

Analysis. New York, NY:Wiley.

Kent, J. T. and Tyler, D. E. (1991). Redescending M-estimates of multivariate location and

scatter. The Annals of Statistics, 19:2102-2119.

Koltchinskii, V. (1993). Bahadur-kiefer approximation for spatial quantiles,.

Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. To

appear in Springer Series in Statistics. 145

Kuk, A. Y. C. (1988). Estimation of distribution functions and medians under sampling with

unequal probabilities. Biometrika, 75:97-103.

Kuk, A. Y. C. and Mak, T. K. (1989). Median estimation in the presence of auxiliary infor­

mation. Journal of the Royal Statistical Society, Series B: Methodological, 51:261-269.

Langlet, E. R. (1990). Use of cluster analysis for collapsing imputation classes. Survey Method­

ology, 16:137-143.

Lyberg, L., Biemer, P., Collins, M., de Leeuw, E., Dippo, C, Schwarz, N., and Trewin, D.

(1997). Survey Measurement and Process Quality. John Wiley &; Sons.

Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter. The Annals

of Statistics, 4:51-67.

Milasevic, P. and Ducharme, G. R. (1987). Uniqueness of the spatial median. The Annals of

Statistics, 15:1332-1333.

Milligan, G. W. and Cooper, M. C. (1985). An examination of procedures for determining the

number of clusters in a data set. Psychometrika, 50:159-179.

Nusser, S. M. and Goebel, J. J. (1997). The National Resources Inventory: A long-term

multi-resource monitoring programme. Environmental and Ecological Statistics, 4:181-204.

Opsomer, J. D. and Ruppert, D. (1997). Fitting a bivariate additive model by local polynomial

regression. The Annals of Statistics, 25(1):186-211.

Pollard, D. (1981). Strong consistency of fc-means clustering. The Annals of Statistics, 9:135-

140.

Pollard, D. (1982). A for fc-means clustering. The Annals of Probability,

10:919-926.

Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag Inc. 146

Preston, I. (1996). Sampling distributions of relative poverty statistics. Applied Statistics,

45:399.

Randies, R. H. (1982). On the asymptotic normality of statistics with estimated parameters.

The Annals of Statistics, 10:462-474.

Rao, J., Kovar, J. G., and Mantel, H. J. (1990). On estimating distribution functions and

quantiles from survey data using auxiliary information. Biometrika, 77:365-375.

Rocke, D. M. (1996). Robustness properties of S-estimators of multivariate location and shape

in high dimension. The Annals of Statistics, 24(3):1327-1345.

Rocke, D. M. and Woodruff, D. L. (1996). Identification of outliers in multivariate data.

Journal of the American Statistical Association, 91:1047-1061.

Rousseeuw, P. (1985). Multivariate estimation with high breakdown point. In Grossmann,

W., Pflug, G., Vincze, I., and Wertz, W., editors, Mathematical Statistics with Applications,

pages 283-297. D. Reidel Publishing Co.

Rousseeuw, P. J. and Leroy, A. M. (1987). and Outlier Detection. John

Wiley &; Sons.

Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage

points. Journal of the American Statistical Association, 85:633-639.

Sedransk, N. and Sedransk, J. (1979). Distinguishing among distributions using data from

complex sample designs. Journal of the American Statistical Association, 74:754-760.

Shao, J. and Rao, J. (1993). Standard errors for low income proportions estiamted from

stratified multistage samples. Sankhya B, 55:393-414.

Shao, J. and Wu, C. F. J. (1989). A general theory for jackknife variance estimation. The

Annals of Statistics, 17:1176-1197.

Shapiro, S. S. and Wilk, M. B. (1965). An test for normality (complete

samples). Biometrika, 52:591-611. 147

Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics.

John Wiley &; Sons.

Sitter, R. R. and Wu, C. (2001). A note on Woodruff confidence intervals for quantiles.

Statistics & Probability Letters, 52(4):353-358.

Small, C. G. (1990). A survey of multidimensional medians. International Statistical Review,

58:263-277.

Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathe­

matical and Statistical Psychology, 59:1-34.

Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a

data set via the gap statistic. Journal of the Royal Statistical Society, Series B: Statistical

Methodology, 63(2):411-423.

Tyler, D. E. (1983). Robustness and efficiency properties of scatter matrices (Corr: V71 p656).

Biometrika, 70:411-420.

Tyler, D. E. (1988). Some results on the existence, uniqueness, and computation of the M-

estimates of multivariate location and scatter. SI AM Journal on Scientific and Statistical

Computing, 9:354-362.

Tyler, D. E. (1991). Some issues in the robust estimation of multivariate location and scatter.

In Stahel, W. and Weisberg, S., editors, Directions in Robust Statistics and Diagnostics.

Part II, pages 327-336. Springer-Verlag Inc.

Wang, J. and Opsomer, J. D. (2008). Estimating the distance distribution of subpopulations

for a large scale complex survey.

Wang, S. and Dorfman, A. H. (1996). A new estimator for the finite population distribution

function. Biometrika, 83:639-652. 148

Wu, C. and Sitter, R. R. (2001a). A model-calibration approach to using complete auxiliary

information from survey data. Journal of the American Statistical Association, 96(453):185-

193(9).

Wu, C. and Sitter, R. R. (2001b). Variance estimation for the finite population distribution

function with complete auxiliary information. The Canadian Journal of Statistics /La Revue

Canadienne de Statistique, 29(2):289-307.