Robust Hidiroglou-Lavallee´ stratified strategy∗

Maria Caterina Bramati†

May 2010

Abstract There are several reasons why techniques are useful tools in sampling design. First of all, when stratified samples are considered, one needs to deal with three main issues: the size, the strata bounds determination and the sample allocation in the strata. Since the target variable y, objective of the , is unknown, it is used some auxiliary infor- mation x known for the entire population from which the sample is drawn. Such information is helpful as it is strongly correlated with the target y, but of course some discrepancies between them may arise. The use of auxiliary information, combined with the choice of the appropriate to estimate the relationship with the variable of interest y, is crucial for the determination of the strata bounds, the size of the sample and the sampling rates according to a chosen precision level of the estimates, as it has been shown by Rivest (2002). Neverthe- less, this regression-based approach is highly sensitive to the presence of contaminated . Indeed, the influence of outlying observations in both y and x has an explosive impact on the with the effect of strong departures from the optimum sample allocation. Therefore, we expect increasing sample sizes in the strata, wrong allocation of sampling units in the strata and some errors in the strata bounds determination. Since the key tool for stratified sampling is the measure of scale of y conditional to the knowledge of some auxiliary x, a robust approach based on S−estimator of regression is proposed in this paper. The aim is to allow for robust sample size and strata bounds determination, together with the optimal sample allocation. To show the advantages of the proposed method, an empirical illustration is provided for Belgian business surveys in the sector of Construction. It is considered a skewed population framework, which is typical for businesses, with a stratified design with one take-all stratum and L − 1 strata. Simulation results are also provided.

1 Introduction

The presence of outliers can strongly bias the sampling design and hence the survey results. In particular, it could induce a wrong computation of the number of statistical units to sample, usually overestimating it. ∗The author acknowledges colleagues and friends at Statistics Belgium for support and many insightful discussions. All errors are my own. †La Sapienza, Universita` di Roma, Dipartimento di Metodi e Modelli per l’Economia, il Territorio e la Finanza, Via del Castro Laurenziano 9, I-00161 Rome. [email protected].

1 In what follows we focus on the stratified sampling design, which has been proven to be the most efficient surveying technique under some basic assumptions (see Tille,´ 2001) and it is cur- rently in use at several NSIs for business surveys. For instance, suppose that in the stratification variable X some outliers arise. Outliers are observations arbitrarily far from the majority of the data. They are often due to mistakes, like editing, measurement and observational errors. Intuitively, when outliers are present in a given stratum for the stratification variable X they affect both the location and scale measures for X. Therefore, it is clear that a higher dispersion than the ’true’ one will be observed in that stratum. Such a situation will bias the outcome of the HL method. For instance, the sample size would be bigger than it should be, given the fact that observations seem to be more distant (in average) than they are in the reality. Moreover, the strata bounds and the sample allocation would be both biased. This is clear when we consider the Neyman allocation, for example, which is based on within-stratum dispersion. Since the principle is to survey more units in the strata in which the auxiliary variable is more dispersed within the stratum, outliers might have the effect of increasing enormously and unduly the sample size in each stratum. For this reason we build two robust versions of the HL method, the naive robust and the robust HL sampling strategy which we compare through a simulation study.

2 The problem

In what follows we focus on simple stratified samples with one take-all stratum and several take- some strata. This because we deal with

• skewed distributions (small number of units accounts for a large share of the study variables)

• availability of administrative information, providing a list of the statistical units of the target population (i.e. tax declaration, social security registers)

• survey burdens for firms and costs for NSIs

• data quality (administrative sources and survey collection)

• compliance requirements established by EUROSTAT

Indeed, in the stratified sample population is divided into subgroups (or strata) in order to maximize the intra-group ‘ homogeneity’ (according to a chosen target variable) and to minimize the inter- group ‘homogeneity’. It requires mutually exclusive strata (i.e. 1 unit can belong to 1 stratum only) and collectively exhaustive strata (i.e. no population unit excluded). The statistical precision is the constraint under which choices of sample sizes and allocation are made. This approach has been proposed in a seminal paper by Hidiroglou and Lavallee´ (1988), who propose optimal solution to the choice of the strata bounds, the sample size and allocation constrained to a fixed precision of the target variable. That is to say that the algorithm allows for the simultaneous determination

2 of the minimum sample size, the strata bounds and the sample allocation which satisfy a desired statistical precision. Consider a stratified random sampling scheme with L strata for a variable of interest Y in a target population U of size N. Then, denoting by Nhthe size, Sh the random sample with size nh nh and ah = the of stratum h, h = 1,...,L, the survey estimator for the total Nh PL Nh P tˆ = yk has estimated by ystrat h=1 nh k∈Sh

L ˆ ˆ X (1 − ah) 2 Var(tystrat) = Nh syh (1) ah h=1 where 2 1 X ˆ 2 syh = (yk − yh) , nh − 1 k∈Sh ˆ and yh is the sample of Y within stratum h. In the procedure the L-th stratum is the take-all stratum, i.e. nL = NL, whereas the remaining L−1 strata are take-some, i.e. drawn by a simple random sampling with nh = (n−NL)ah, h < L. Therefore, by straightforward calculations (1) can be rewritten as

L−1 N 2s2 L−1 ˆ ˆ 1 X h yh X 2 Var(tystrat) = − Nhsyh (2) n − NL ah h=1 h=1 from which, solving for n,

N 2s2 PL−1 h yh h=1 ah nˆ = NL + , (3) tystrat ˆ PL−1 2 Var(tystrat) + h=1 Nhsyh which is the sample size required for a given (supposed known) variance of the sample total esti- mator. Expression (3) can also be written as

W 2s2 PL−1 h yh h=1 ah nˆ = NL + , (4) tystrat 2 PL−1 Wh 2 (cY /N) + h=1 N syh

Nh where Wh = N , c is the target coefficient of variation (precision level, which often ranges between 1% to 10% in business surveys) and Y the mean of Y . The idea is to find the optimal stratum boundaries b , . . . , b which minimize n subject 1 L−1 tˆystrat 2 2 to a requirement on the precision of tˆystrat such as Var(tˆystrat) = c Y , with some appropriate sampling allocation (Neyman, proportional...). When Neyman allocation is applied n W s a = h = h yh , (5) h N PL−1 h k=1 Wksyk

3 i.e. the number of units sampled in a stratum is proportional to the relative dispersion of the survey variable within the stratum compared to the overall dispersion, all being ponderated by the relative weight of the stratum. If Neyman allocation rule is applied to (4), the following holds

PL−1 2 ( h=1 Whsyh) nˆ = NL + . (6) tystrat 2 PL−1 Wh 2 (cY /N) + h=1 N syh Now, it is known that there exists a discrepancy between the auxiliary variable X used for stratification and the survey variable Y . Therefore, the strategy suggested by Rivest (2002) is to recover such discrepancy by the use of a regression model. Of course, the auxiliary information X is only a proxy for the target variable Y , which requires to estimate the discrepancy between Y and X, as suggested by Rivest (2002) with the generaliza- tion of the HL algorithm. In the business survey literature, the relationship existing between Y and X is often modeled by a log- relationship. In what follows we consider variables X and Y as continuous random variables and we de- note by f(x), x ∈ R the density of X. The data x1, . . . , xN are considered as N independent realizations of the X. Since stratum h consists of the population units with an X-value in the interval (bh−1, bh], the stratification process uses the values of E(Y |bh ≥ X > bh−1) and Var(Y |bh ≥ X > bh−1), which denote the conditional and variances of Y given that the unit falls in stratum h, for h = 1,...,L − 1. This model considers the regression relationship between Y and X expressed by

log Y = α + βlog log X + ε,

2 where ε is assumed to be a 0-mean random variable, normally distributed with variance σlog and independent from X, whereas α and βlog are the parameters to be estimated. The conditional moments of Y are obtained using the basic properties of the lognormal distri- bution, and they are

2 α+σ /2 βlog E(Y |bh ≥ X > bh−1) = exp log E(X |bh ≥ X > bh−1) (7) and

2 2 α+σ /2 σ 2βlog βlog 2 Var(Y |bh ≥ X > bh−1) = exp log {e log E(X |bh ≥ X > bh−1) − E(X |bh ≥ X > bh−1) }. (8)

However this approach presents some weaknesses

2 1. syh is unknown, which makes crucial the use of the auxiliary information X 2. the number L of strata is selected by the user

4 3. the administrative records are often of low quality (errors)

Focusing on the last issue, we consider several types of outliers in the data which might affect the HL sampling algorithm. We can distinguish three main sources of anomalies, such as

• erroneous records in the surveyed data (Y )(vertical outliers)

• quality issues in the administrative registers (X)(leverage points)

• outliers in both variables (X,Y ) (good/bad leverage points)

The presence of such anomalies makes unreliable the conditional mean and variance of Y |X, therefore affecting the sample size and strata bounds determination as well as the sample allocation. In what follows we propose two possible alternatives to the Rivest (2002)’s algorithm. In the first approach, robust regression estimators are used to estimate the parameters of the log-linear regression model. Then, the estimated robust parameters are plugged in the HL objective function used for the computation of the strata bounds, the minimum sample size and the sample allocation which satisfy a fixed statistical precision. We call this the naive robust approach. In the second approach, strata bounds and sizes are derived minimizing the conditional variance in each stratum after a re-weighting of the information according to the degree of outlyingness. We refer to this approach as to the robust generalization of the HL algorithm (RGHL).

3 The Robust Generalization of the HL algorithm (RGHL)

Supposing that a log-linear relationship exists between the survey variable Y and the auxiliary one X, then consider the S-estimator of regression as in Rousseeuw and Yohai (1984) as

S(x, y) = arg min s(r1(β), ..., rN (β)) β where ri(β) are the regressions residuals and s is scale measure which solves

N 1 X ri(β) ρ( ) = b N s i=1 for a conveniently chosen ρ function and a constant b. This estimator is robust with respect to both vertical outliers and leverage points. Then, with some straightforward calculations (expanding ρ(·)), the following approximation holds

σ2 2 V ar[Y |bh ≥ X > bh−1] ≈ e ψh/Wh − (φh/Wh) ,

5 where

Z bh β Wh = ω(x )f(x)dx bh−1 Z bh β β φh = x ω(x )f(x)dx bh−1 Z bh 2β β ψh = x ω(x )f(x)dx, (9) bh−1 β and σ are the parameters of the log- in the previous section, and ω(x) = ρ0(x)/x is the weighting function. Some choices of ρ(x) and ω(x) = ρ0(x)/x are presented in figure below.

The problem then reduces to solving for bounds b1, . . . , bh, . . . , bL which minimize n using the Neyman allocation scheme. In symbols, under the loglinear specification the objective function is

(PL−1(eσ2 ψ W − φ2 )1/2)2 n = N + h=1 h h h (10) tˆystrat L σ2 2 P β 2 PL−1 (e ψh−φh/Wh) (c xi /N) + h=1 N where robust moments Wh, φh and ψh are those in (9), β and σ are the parameters of the log-linear model estimated by robust regression (S-estimator or LTS). Then, the Sethi’s iterations are run for a given L and precision c, computing the optimal strata bounds and sample size.

4 Simulation Study

The aim of the simulation study is to compare the performance of the two robust sampling strategies proposed in this paper with respect to Rivest (2002)’s based on classical LS regression. Simulations are performed using the business sampling frame of the Structural Business Survey in 2002, where we consider as target variable y the value added of enterprises in the industry of

6 Constructions which are stratified by the economic-size class. The number of strata h = 1,..., 6 is set according to the common practice in SBS, with one take-all stratum and 5 take-some strata. The auxiliary information x is on the turnover (from the VAT register). Therefore, the population is generated from log yi = β log xi + εi with a choice of β = .75. Then we consider the following designs

1. no outliers: εi ∼ N (0, 1)

2. long-tailed errors: εi ∼ Cauchy1

3. long-tailed errors: εi ∼ t3

q 2 4. vertical outliers: δ% of εi ∼ N (5 χ1;0.99, 1.5)

5. bad leverage points: δ% of εi ∼ N (10, 10) and corresponding X ∼ N (−10, 10). The contamination level, i.e. the percentage of outliers in the data, is set to δ = 15% and 30%. Three procedures are used to compute the strata bounds, sizes and allocation

• generalization of HL algorithm (Rivest (2002))

• naive robust HL method

• robust generalization of the HL algorithm (RGHL) at 1% precision and compared by means of relative MSE of the Horvitz-Thompson estimator for the mean. In the tables below are displayed the main results. We observe that in the case of long-tailed distributions, the RGHL algorithm is outperforming, which it is not the case for the naive robust estimator. To reach an efficiency level equal to the 10% of the LS-based approach in the case of no outliers, the RGHL method requires a sample size 100 times bigger than the LS ones.

5 Conclusions and Agenda

In this work we propose a robust approach to the simple stratified sampling. There are many advantages to the use using robust techniques when dealing with administrative records. Since they might present a low quality data, due to the presence of errors in editing, source etc., they might affect the size and allocation of the sample, as well as the strata bounds determination. To correct for this we propose a robust approach to the HL algorithm, considering a log-linear regression relationship between the survey variable y and the auxiliary information x.

7 Design Relative Efficiencies Relative sample size No outliers 4.52 1 Long-tailed Cauchy 0.07 0.03 Long-tailed t 6.79 0.15 Vertical outliers 15% 1.01 0.24 Leverage points 15% 4.52 0.24 Vertical outliers 30% 0.99 0.99 Leverage points 30% 0.11 0.99

Table 1: Summary of results comparing naive robust HL method versus generalization of HL algorithm (Rivest, 2002), target precision: 1%

Design Relative Efficiency Relative sample size No outliers 0.10 100 Long-tailed Cauchy 0.00 0.29 Long-tailed t 0.08 10 Vertical outliers 15% 0.99 10 Leverage points 15% 0.00 10 Vertical outliers 30% 0.99 1.43 Leverage points 30% 0.00 1.43

Table 2: Summary of results comparing RGHL versus generalization of HL algorithm (Rivest, 2002), target precision: 1%

8 A simulation study illustrate the performance of the sampling strategy hereby proposed, show- ing clearly many advantages compared to the classical LS strategy, as suggested by Rivest (2002). Further work is required for the numerical implementation of the multiple variable survey op- timal stratification.

6 References

Rivest, L.P. (2002). A generalization of Lavallee´ and Hidiroglou algorithm for stratification in business surveys. Techniques d’enquetesˆ 28, pp. 207-214. Rousseeuw, P. J., Yohai, V. J. (1984). Robust Regression by Means of S-Estimators, in Robust and Nonlinear , eds. J. Franke,W. Hardle, and D. Martin, Lecture Notes in Statistics, 26. Berlin, Springer-Verlag, pp. 256272. Teirlinck, P., Spithoven, A. (2005). Spatial inequality and location of private R&D activities in Belgian districts. Royal Dutch Geographical Society 96, pp. 558-572. Tille,´ Y. (2001). Theorie´ des sondages. Paris, Dunod.

9