Component of Canada Catalogue no. 12-001-X Business Survey Methods Division

Article Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias by Carl-Erik Särndal and Sixten Lundström

December 2010 , December 2010 131 Vol. 36, No. 2, pp. 131-144 Statistics Canada, Catalogue No. 12-001-X

Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias

CarlErik Särndal and Sixten Lundström 1

Abstract This article develops computational tools, called indicators, for judging the effectiveness of the auxiliary information used to control nonresponse bias in survey estimates, obtained in this article by calibration. This work is motivated by the survey environment in a number of countries, notably in northern Europe, where many potential auxiliary variables are derived from reliable administrative registers for household and individuals. Many auxiliary vectors can be composed. There is a need to compare these vectors to assess their potential for reducing bias. The indicators in this article are designed to meet that need. They are used in surveys at Statistics Sweden. General survey conditions are considered: There is probability from the finite population, by an arbitrary sampling design; nonresponse occurs. The probability of inclusion in the sample is known for each population unit; the probability of response is unknown, causing bias. The study variable (the yvariable) is observed for the set of respondents only. No matter what auxiliary vector is used in a calibration estimator (or in any other estimation method), a residual bias will always remain. The choice of a “best possible” auxiliary vector is guided by the indicators proposed in the article. Their background and computational features are described in the early sections of the article. Their theoretical background is explained. The concluding sections are devoted to empirical studies. One of these illustrates the selection of auxiliary variables in a survey at Statistics Sweden. A second empirical illustration is a simulation with a constructed finite population; a number of potential auxiliary vectors are ranked in order of preference with the aid of the indicators. Key Words: Calibration weighting; Nonresponse adjustment; Nonresponse bias; Auxiliary variables; Bias indicator.

1. Introduction than the choice of the weighting methodology”. The review by Kalton and FloresCervantes (2003) provides many Large nonresponse is typical of many surveys today. This references to earlier work. As in this paper, a calibration creates a need for techniques for reducing as much as approach to nonresponse weighting is favoured in Deville possible the nonresponse bias in the estimates. Powerful (2002) and Kott (2006). auxiliary information is needed. Administrative files Some earlier methods are special cases of the outlook in are a source of such information. The Scandinavian coun this article, which is based on a systematic use of auxiliary tries and some other European countries, notably the information by calibration at two levels. Recently the search Netherlands, are in an advantageous position. Many poten for efficient weighting has emphasized two directions: (i) to tial auxiliary variables (called xvariables) can be taken from provide a more general setting than the popular but limited high quality administrative registers where auxiliary vari cell weighting techniques, and (ii) to quantify the search for able values are specified for the entire population. Variables auxiliary variables with the aid of computable indicators. measuring aspects of the data collection is another useful Särndal and Lundström (2005, 2008) propose such indica type of auxiliary data. Effective action can be taken to tors, while Schouten (2007) uses a different perspective to control nonresponse bias. Beyond sampling design, design motivate an indicator. An article of related interest is for estimation becomes, in these countries, an important Schouten, Cobben and Bethlehem (2009). component of the total design. Statistics Sweden has This content of this article has four parts: The general devoted considerable recourses to the development of background for estimation with nonresponse is stated in techniques for selecting the best auxiliary variables. Sections 2 to 4. Indicators for preference ranking of x Many articles discuss weighting in surveys with non vectors are presented in Sections 5 and 6, and the response and the selection of “best auxiliary variables”. computational aspects are discussed. The linear algebra Examples include Eltinge and Yansaneh (1997), Kalton and derivations behind the indicators is presented in Sections 7 FloresCervantes (2003), and Thomsen, Kleven, Wang and and 8. The two concluding Sections 9 and 10 present two Zhang (2006). Weighting in panel surveys with attrition empirical illustrations. The first (Section 9) uses real data receives special attention in, for example, Rizzo, Kalton and from a large survey at Statistics Sweden. The second Brick (1996), who suggest that “the choice of auxiliary (Section 10) reports a simulation carried out on a con variables is an important one, and probably more important structed finite population.

1. CarlErik Särndal, Professor and Sixten Lundström, Senior Methodological Advisor, Statistics Sweden. Email: [email protected]. 132 Särndal and Lundström: Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias

2. Calibration estimators for a survey and does not severely restrict xk . Most xvectors useful in with nonresponse practice are in fact covered. Examples include: (1)

xk= (1,x k )′ , where xk is the value for unit k of a A probability sample s is drawn from the population continuous auxiliary variable x; (2) the vector representing a U= {1, 2,..., k ,..., }. The sampling design gives unit k categorical xvariable with J mutually exclusive and the known inclusion probability πk = Pr(k ∈ s ) > 0 and the exhaustive classes, xk= γ k =( γ1 k ,..., γ jk ,..., γ Jk ),′ where known design weight d = 1/π . Nonresponse occurs. The k k γjk =1 if k belongs to group j, and γjk = 0 if not, response set r is a subset of s; how it was generated is … j=1, 2, , J ; (3) the vector xk used to codify two unknown. We assume r⊂ s ⊂ U , and r nonempty. The categorical variables, the dimension of xk being J1 + (design weighted) response rate is J2 −1, where J1 and J2 are the respective number of classes, and the ‘minusone’ is to avoid a singularity in the d P = ∑ r k (2.1) computation of weights calibrated to the two arrays of d ∑ s k marginal counts; (4) the extension of (3) to more than two categorical variables. Vectors of the type (3) and (4) are (if A is a set of units, AU⊆ , a sum ∑ will be written k∈ A especially important in statistics production in statistical as ∑ ). Ordinarily a survey has many study variables. A A agencies (the choice x = x , not covered by (2.3), leads to typical one, whether continuous or categorical, is denoted y. k k the nonresponse ratio estimator, known to be a usually poor Its value for unit k is y , recorded for k∈ r, not available k choice for controlling nonresponse bias, compared with for k∈ U − r. We seek to estimate the population ytotal, x = (1,x )′ , so excluding the ratio estimator is no great Y= ∑ y . Many parameters of interest in the finite k k U k loss). population are functions of several totals, but we can focus The calibration estimator of Y= ∑ y , computed on the on one such total. U k data y for k∈ r, is The auxiliary information is of two kinds. To these k ∗ ˆ correspond two vector types, xk and xk . Population YCAL = wk y k (2.4) ∗ ∑ r auxiliary information is transmitted by xk , a vector value ∗ with w= d{1 + (X − d x )′′ ( d x x )−1 x }. The known for every k∈ U. Thus ∑U xk is a known population k k∑r k k ∑ r k k k k ∗ weights w are calibrated on both kinds of information: total. Alternatively, we allow that ∑U xk is imported from k ∗ ∗ ∗ an exterior source and that x is a known (observed) vector ∑r w x= X, which implies ∑rw x= ∑ U x and k k k k k k value for every k∈ s. Sample auxiliary information is ∑rwkx k= ∑ s d k x k . We assume throughout that the symmetric matrix d x x′ is nonsingular (for compu transmitted by xk , a vector value known (observed) for ∑r k k k tational reasons, it is prudent to impose a stronger every k∈ s; the total ∑U xk is unknown but is estimated requirement: The matrix should not be illconditioned, or without bias by ∑s dkx k . The auxiliary vector value nearsingular). In view of (2.3), we have Yˆ = w y combining the two types is denoted xk . This vector and the CAL ∑r k k −1 associated information is with weights wk= d k v k where vk= X′′().∑r d k x k x k x k The weights satisfy ∑r dk v kx k = X, where X has one or ∗ x∗  xk  ∑U k both of the components in (2.2). xk = ;. X =   (2.2) x d x  A closely related calibration estimator is based on the k  ∑ s k k  same twotiered vector xk but with calibration only to the th Tied to the k unit is the vector (yk ,x k , π k ). Here, πk is sample level: known for all k∈ U, y for all k∈ r, the component x∗ of k k ɶ YCAL = ∑ r dk m k y k (2.5) xk carries population information, the component xk of xk carries sample information. where Many xvectors can be formed with the aid of variables from administrative registers, survey process data or other ′ −1 mk= ()∑ d kx k() ∑ d k x k x′ k x k. (2.6) sources. Among all the vectors at our disposal, we wish to s r identify the one most likely to reduce the nonresponse bias, The calibration equation then reads ∑r dk m kx k = if not to zero, so at least to a nearzero value. ∑s dkx k , where xk has the two components as in (2.2). The We consider vectors having the property that there exists auxiliary vector xk serves two purposes: To achieve a low a constant nonnull vector such that variance and a low nonresponse bias. From the variance ˆ ɶ ′ x =1 for all k ∈ U (2.3) perspective alone, YCAL is usually preferred to YCAL because k the former profits from the input of a known population ∗ “Constant” means that ≠ 0 does not depend on k, nor on s total ∑U xk . But this paper studies the bias. From that ˆ or r. Condition (2.3) simplifies the mathematical derivations perspective, we are virtually indifferent between YCAL and

Statistics Canada, Catalogue No. 12001X Survey Methodology, December 2010 133

ɶ YCAL, and we focus on the latter. Under liberal conditions, difference in bias between the two is of little consequence, −1 ˆ ɶ the difference between the bias of YCAL and that of even for modest sample sizes. We can focus on YFUL . −1 ɶ −1 YCAL is of order n , thereby of little practical consequence even for modest sample sizes n, as discussed for example in Särndal and Lundström (2005). 4. The bias ratio An alternative expression for (2.5) is For a given outcome (s, r), consider the estimates ɶ ɶ ɶ ɶ ′ YCAL, YEXP and YFUL given by (2.5), (3.1) and (3.2) as three YCAL = ()∑ dkx k Bx (2.7) s points on a horizontal axis. Both Yɶ (generated by the ɶ EXP where primitive xk =1) and YCAL (generated by a better xvector) are computable, but biased. As the xvector improves, Yɶ −1 CAL B= B = d x x′ d x y (2.8) will distance itself from Yɶ and may come near the x x|r ; d(∑r k k k ) ∑ r k k k EXP unbiased but unrealized ideal Yɶ . We consider therefore ɶ ɶ ɶFUL ɶ ɶ ɶ is the regression coefficient vector arising from the (dk three deviations: YYYYEXP− FUL, EXP − CAL and YYCAL− FUL, weighted) least squares fit based on the data (,)ykx k for of which only the middle one is computable. The unknown ɶ ɶ k∈ r. “deviation total”, YYEXP− FUL, is decomposable as A remark on the notation: When needed for emphasis, a “deviation accounted for” (by the chosen xvector) plus symbol has two indices separated by a semicolon. The first “deviation remaining”: shows the set of units over which the quantity is computed ɶ ɶ ɶ ɶ ɶ ɶ YYYYYYEXP− FUL =( EXP − CAL ) + ( CAL − FUL ). (4.1) and the second indicates the weighting, as in Bx|r ; d given by (2.8), and in weighted means such as yr; d = ɶ ɶ If computable, YYCAL− FUL would be of particular ∑rdk y k/. ∑ r d k If the weighting is uniform, the second of ɶ interest, as an estimate of the bias remaining in YCAL (and in the two indices is dropped as in yU= (1/ )∑U y k . ˆ ɶ ɶ YCAL ), whereas YYEXP− FUL would estimate the usually much larger bias of the benchmark, Yɶ . The bias ratio for EXP ɶ a given outcome (s, r) sets the estimated bias of Y in 3. Points of reference CAL relation to that of Yˆ : EXP The most primitive choice of vector is the constant one, ɶ ɶ YYCAL− FUL x =1 for all k. Although inefficient for reducing bias ratio = . (4.2) k YYɶ− ɶ nonresponse bias, it serves as a benchmark. Then m=1/ P EXP FUL k ɶ for all k, where P is the survey response rate (2.1), and YCAL We scale the three deviations by the estimated population ˆ is the expansion estimator: size = ∑s dk and use the notation TAR = + , where ɶ ˆ T suggests “total”, A “accounted for” and R “remaining”. YEXP=(1/ P ) ∑ r dk y k = y r ; d (3.1) Noting that ∑r dk( y k−x′ k Bx ) = 0, we have ˆ where = ∑s dk is design unbiased for the population size ˆ −1 ɶ ɶ ɶ T =(); YEXP − Y FUL = y r ; d − y s ; d . The bias of YEXP can be large. At the opposite end of the bias spectrum are the −1 =ˆ () Yɶ − Y ɶ =x′ B − y unbiased, or nearly unbiased, estimators obtainable under RCAL FUL s ; dx s; d full response, when r= s. They are hypothetical, not =ˆ −1()() Yɶ − Y ɶ =x − x′ B computable in the presence of nonresponse. Among these AEXP CAL r ; d s ; d x are the GREG estimator with weights calibrated to the where x=d x/,/, d x = d x d and ∗ s;; d∑s k k ∑ s k r d ∑ r k k ∑ r k known population total ∑U xk , ys; d and yr; d are the analogously defined means for the y ˆ variable. Then (4.2) takes the form YFUL = ∑ s dk g k y k ()x− x′ B ∗ ∗ ∗ ∗′ −1 ∗ RA r;; d s d x where gk=1 + (∑Ux k − ∑ s d k x k )′ ( ∑ s d k x k x k ) x k , and bias ratio= = 1 − = 1 − . (4.3) FUL refers to full response. The unbiased HT estimator T Ty r;; d − y s d (obtained when g =1 for all k) is k We have bias ratio = 1 for the primitive vector x =1. ɶ k ɶ ˆ Ideally, we want the auxiliary vector xk for YCAL to give YFUL=∑ dk y k = y s ; d . (3.2) s bias ratio ≈ 0. For a given outcome (s, r) and a given y ∗ It disregards the information ∑U xk , which may be variable, we take steps in that direction by finding an x important for variance reduction. But for the study of bias in vector that makes the computable numerator A = ˆ ɶ ′ this paper, we are indifferent between YFUL and YFUL. The ()xr;; d− x s d Bx large (in absolute value). This is within our

Statistics Canada, Catalogue No. 12001X 134 Särndal and Lundström: Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias reach. But whatever our final choice of xvector, the 5. Expressing the deviation accounted for ɶ remaining bias of YCAL is unknown. Even with the best available xvector, considerable bias may remain. We have The responding unit k receives the weight d m in the ɶ k k then attempted to do the best possible, under perhaps estimator YCAL = ∑r dk m k y k . The nonresponse adjustment −1 unfavourable circumstances. factor mk= ()()∑s d kx k′′ ∑ r d k x k x k x k expands the design

To summarize, for a given outcome (s, r) and a given y weight dk . We can view mk as the value of a derived variable, the three deviations have the following features: (i) variable, defined for a particular outcome (r, s) and choice of x , independent of all yvariables of interest, and T =y r;; d − y s d is an unknown constant value, depending k computable for k∈ s (but used in Yɶ only for on both unobserved and observed yvalues; (ii) A is CAL k∈ r). Using (2.3), we have computable; it depends on yk for k∈ r and on the values x for k∈ s of the chosen xvector; (iii) cannot be k R d mx= d x ;; d m = d computed; it depends on unobserved values y , and on x ∑rk k k ∑ s k k ∑ r k k ∑ s k k k for k∈ s. 2 d m= d m . (5.1) To follow the progression of the estimates when the x ∑rk k ∑ s k k vector improves, consider a given outcome (s, r). The Two weighted means are needed: deviation can have either sign. Suppose > 0, T ɶ T indicating a positive bias in Y , as when large units dk m k ∑dk1 ∑ d k m k EXP m=∑ r =s =; m = s (5.2) respond with greater propensity than small ones. When the r; d d d P s; d d ɶ ∑rk ∑ r k ∑ s k xvector in YCAL becomes progressively more powerful by where P is the response rate (2.1). Thus the average the inclusion of more and more xvariables, A tends to adjustment factor in Yɶ = d m y is 1/P , regardless increase away from zero and will, ideally, come near T , CAL ∑r k k k ɶ ɶ of the choice of xvector. Whether a chosen xvector is indicating a desired closeness of YCAL to the unbiased YFUL . efficient or not for reducing bias will depend on higher As long as the xvector remains relatively weak, AT < is likely to hold. When the xvector becomes increasingly moments of the mk . The weighted variance of the mk is powerful, moves closer to the fixed , a sign of bias A T S2= S 2 = d(). m − m 2 d (5.3) nearing zero. It could even “move beyond”, so that an m m| r ; d ∑rk k r ; d ∑ r k 2 “overadjustment”, AT> , has occurred. This not a The simpler notation Sm will be used. A development of detrimental feature; although RTA = − is then (5.3) and a use of (5.1) and (5.2) gives negative, it is ordinarily small (the analyst can only work S2 = m( m − m ). (5.4) with A ; it is unknown to him/her whether A and T are m r;;; d s d r d close, or whether the overadjustment AT > has occurred). These points are illustrated by the simulation in The coefficient of variation of the mk is

Section 10. If T < 0, these tendencies are reversed. m Sm s; d The form of (4.3) may suggest an argument which can cvm = = − 1. (5.5) mr;; d m r d however be misleading: Suppose that a vector xk has been suggested, containing variables thought to be effective, The weighted variance of the study variable y is given by along with an assumption that yk=β′ x k + ε k , where εk is 2 2 2 a small residual. Then yr;;;; d− y s d ≈()x r d − x s d ′ Bx ≈ Sy= S y| r ; d =∑r dk() y k − y r ; d ∑ r dk (5.6) (),x− x ′β and consequently bias ratio≈ 0, sending a r;; d s d 2 message, often false, that the postulated vector x is (when the response probabilities are not all equal, S y = k 2 2 efficient. One weakness of the argument stems from the S y| r ; d is not unbiased for the population variance S y| U , but wellknown fact that nonresponse (unless completely this is not an issue for the derivations that follow). We need the covariance random) will cause Bx to be biased for a regression vector that describes the ytox relationship in the population. Cov(y , m )= Cov( y , m )r; d = Further comments on this issue are given in Section 8. Finally, there is the practical consideration that a typical 1 d()() m− m y − y survey has many yvariables. To every yvariable corre ∑ r k k r;; d k r d (5.7) ∑ dk sponds a calibration estimator, and a bias ratio given by r

(4.3). The ideal xvector is one that would be capable of and the correlation coefficient, Ry, m= Cov( y , m ) /( S y S m ), controlling bias in all those estimators. This is usually not satisfying −1 ≤Ry, m ≤ 1. possible without compromise, as we discuss later. The deviation A =()x r;; d − x s d ′ Bx is a crucial component in the bias ratio (4.3). We seek an xvector that

Statistics Canada, Catalogue No. 12001X Survey Methodology, December 2010 135

makes A large. The factors that determine A are seen in HSR0= A/ y = − y , m × cv m . (5.11) (5.8) to (5.10). Computational tools (indicators) to assist the search for effective xvariables are given in (5.11) and As borne out by theory in Section 8 and by the empirical (5.12). Their derivation by linear algebra is deferred to work in Section 10, over a long run of outcomes (s, r), the ɶ Section 7, which may be bypassed by readers more average of H0 tracks the average deviation YYCAL − ɶ interested in the practical use of these tools in the search for (which measures the bias of YCAL ) in a nearly perfect linear xvariables, as illustrated in the empirical Sections 9 and 10. manner when the xvector changes. This holds indepen dently of the response distribution that generates r from s. We can factorize A/S y as Since H0 can have either sign, it is practical to work with A/SR y = − y, m × cv m . (5.8) its absolute value denoted H1; in addition we consider two other indicators, H2 and H3, inspired by (5.9) to (5.10): Two simple multiplicative factors determine A/:S y The coefficient of variation cv , which is free of y and m k HSR1=| A |/ y = | y , m | × cv m ; computed on the known xk alone, and the (positive or negative) correlation coefficient Ry, m . Another factor HRH2=y ,x ×cv m ; 3 = cv m . (5.12) ization in terms of simple concepts is

Our main alternatives are H1 and H3. Of these, H1 is A/SFR y = × y,x × cv m (5.9) motivated by its direct link to A, which we want to make 2 large, for a given yvariable. A strong reason to consider where RRy,,x= y x is the coefficient of multiple correla 2 H is its independence of all yvariables in the survey. The tion between y and x, Ry,x is the proportion of the y 3 2 indicator H is an ad hoc alternative; although H variance Sy explained by the predictor x, and F = 2 2 contains a familiar concept, the multiple correlation − RRy,, m/ y x (formula (7.8) states the precise expression for 2 coefficient R , it is less appropriate than H because the Ry,x ). As Section 7 also shows, |RRy,, m | ≤ y x for any x y, x 1 vector and yvariable; consequently −1 ≤F ≤ 1. correlation coefficient ratio FRR= − y,, m/ y x may vary considerably from one xvector to another. Both H and In (5.8) and (5.9), cvm and Ry, x are nonnegative 2 H increase when further xvariables are added to the x terms, while Ry, m and F can have either sign (or possibly 3 be zero). Hence vector, something which does not hold in general for H1. The use of these indicators is illustrated in the empirical ||/ASRFR y = | y,, m |cv × m = || ×yx × cv. m (5.10) Sections 9 and 10. All of SRR, cv , , and F are easily computed in y m y,,x y m the survey. Both cvm and Ry, x increase (or possibly stay 6. Preference ranking of auxiliary vectors unchanged) when further xvariables are added to the x vector; Ry, m does not have this property. The methods in this paper are intended for use primarily To illustrate with the aid of fairly typical numbers, if with the large samples that characterize government FR=0.5; = 0.6 and cv= 0.4, then /S = 0.12, surveys. The sample size is ordinarily much larger than the y,x ɶ ɶ m A y implying that YCAL/ = Y EXP / − 0.12 × S y . That is, the dimension of the xvector. The variance of estimates is ɶ ˆ estimated ymean YCAL / has become adjusted by 0.12 ordinarily small, compared to the squared bias. However, standard deviations down from the primitive estimate for categorical auxiliary variables, no group size should be ɶ ˆ YEXP /. The adjustment can be large compared to the allowed to be “too small”. It is recommended that all group standard deviation of the estimated ymean, especially when sizes be at least 30, if not at least 50, in order to avoid the survey sample size is in the thousands. It remains instability. The crossing of categorical variables (to allow unknown whether or not that adjustment has cured most of interactions) implies a certain risk of small groups. It is the biasing effect of nonresponse. preferable to calibrate on marginal counts, rather than on

It follows from (5.8) that 0≤ | A |/S y ≤ cv m whatever frequencies for small crossed cells. the yvariable. A shaper inequality is |A |/SR y ≤ y,x × cv m , In a number of countries, the many available admi but it depends on the yvariable. Further, if the correlation nistrative registers provide a rich source of auxiliary ratio F stays roughly constant when the xvector changes, so information, particularly for surveys on individuals and that FF≈ 0, then |A |/SFR y ≈ |0 | × y ,x × cv m . households. These registers contain many potential x Although computable for any xvector and any outcome variables from which to choose. Many different xvectors

(s, r), A does not reveal the value of the bias ratio. But A can be composed. The indicators in (5.12) provide compu suggests computational tools, called indicators, for com tational tools for obtaining a preference ordering, or a paring alternative xvectors. By (5.8), let ranking, of potential xvectors, with the objective to reduce

Statistics Canada, Catalogue No. 12001X 136 Särndal and Lundström: Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias as much as possible the bias remaining in the calibration 7. Derivations estimator. For given yvariable and outcome (s, r), we seek an x Scenario 1: Focus on a specific yvariable. The bias vector to make the computable numerator = remaining in the calibration estimator depends on the y A ()x− x′ B in the bias ratio (4.3) large, in absolute variable; some are more bias prone than others. We identify r;; d s d x value. In this section we prove the factorizations /S = one specific yvariable deemed to be highly important in the A y −RFR ×cv = × × cv in (5.8) and (5.9). We note survey, and we seek to identify an xvector that reduces the y,, m m yx m first that cv2 is a quadratic form in the vector that contrasts bias for this variable as much as possible (if more than one m the xmean in the response set r with the xmean in the yvariable needs to be taken into account, a compromise sample s. Let must be struck, which suggests Scenario 2 below). For this ′ purpose, we use the yvariable dependent indicator H1 = D= xr;; d − x s d ;. Σ = ∑rdk x k x k ∑ r d k (7.1) |A |/SR y = | y, m | × cv m and choose the xvector so as to Then, with P given by (2.1), make H1 large. An ad hoc alternative is to use the indicator HR2=y ,x × cv m , and strive to make it as large as possible. 2 2 2− 1 cvm=PS × m = D′ΣD . (7.2) Scenario 2: The objective is to identify a general purpose x vector, efficient for all or most yvariables in the survey. This expression follows from (5.3) and a consequence of (2.3), namely, This suggests H3 = cvm as a compromise indicator, and to choose the xvector that maximizes H . To that same −1 − 1 3 xr′′;;;; d Σ x r d= x r d Σ x s d =1. (7.3) effect, Särndal and Lundström (2005, 2008) used the 2 2 2 indicator SHPm = 3 /. They showed that the derived The vector of covariances with the study variable y is variable m in (2.6) can be seen as a predictor of the inverse k C=d()(). x − x y − y d (7.4) of the unknown response probability and that choosing the (∑r k k r;; d k r d ) ( ∑ r k ) xvector to make S 2 large signals a bias reduction in the m We can then write as a bilinear form: calibration estimator, irrespective of the yvariable. A =DBD′′ = ΣC−1 (7.5) For each scenario we can distinguish two procedures: A x −1 − 1 All vectors procedure: A list of candidate xvectors is using that D′′ Σ xr;;; d=( x r d − x s d ) Σ x r; d = 0 by (7.3). prepared, based on appropriate judgment. We compute the A useful perspective on A is gained from the geometric chosen indicator for every candidate xvector, and settle for interpretation of C and D in (7.5) as vectors in the space the vector that gives the highest indicator value. The whose dimension is that of xk . We have resulting xvector may not be the same for H1 (which −1 1/ 2 − 1 1/ 2 A = Λ()()DΣDCΣC′′ (7.6) targets a specific yvariable) as for H3 (which seeks a compromise for all yvariables in the survey). where

Stepwise procedure: There is a pool of available xvariables. DΣC′ −1 We build the xvector by a stepwise forward (or stepwise Λ.= (7.7) ()()DΣDCΣC′′−1 1/ 2 − 1 1/ 2 backward) selection from among the available xvariables, one variable at a time, using the successive changes (if For a specific yvariable and a specific xvector, the considered large enough) in the value of the chosen scalar quantities ()DΣD′ −1 1/ 2 and ()CΣC′ −1 1/ 2 represent the indicator to signal the inclusion (or exclusion) of a given x respective vector lengths of D and C (following an variable at a given step. The indicators HH1, 2 and H3 do orthogonal transformation based on the eigenvectors and not in general give the same selection of variables. Consider eigenvalues of Σ−1). The scalar quantity Λ represents the two xvectors, x1k and x2k , such that x2k is made up of cosine of the angle between D (which is independent of y) x1k and an additional vector x+k: x2k= (,). x 1′′′ k x+ k The and C (which depends on y); hence − 1 ≤Λ ≤ 1. transition from x1k to x2k will increase the value of H 2 When the auxiliary vector xk is allowed to expand by and H3. In each step of a forward selection procedure we adding further available xvariables, both vector lengths −1 1/ 2 −1 1/ 2 select the variable bringing the largest increase in H2 or ()DΣD′ and ()CΣC′ increase. The change in the

H3. But the transition does not guarantee an increased angle Λ may be in either direction; if |Λ | stays roughly value for the most appropriate indicator, H1. However, H1 constant, (7.6) shows that | A | will increase. may be used in stepwise selection in the manner described A second useful perspective on A follows by decom in Section 9. posing the total variability of the study variable y, d()(). y− y2 = d S 2 Two regression fits need ∑rk k r; d ∑ r k y

Statistics Canada, Catalogue No. 12001X Survey Methodology, December 2010 137

to be examined, the one of y on the auxiliary vector x, and linear in xk via mk , cannot yield a greater variance the one of y on the derived variable m defined by (2.6). To explained than that maximum. 2 −1 each fit corresponds a decomposition of Sy into explained Now from (7.9), (7.2) and (7.5), −Ry, mcv m =D′ΣC / yvariation and residual yvariation. The two explained SSy= A/, y as claimed by formula (5.8). Moreover, (7.7), portions have important links to the bias ratio (4.3). Result (7.8) and (7.9) imply −RRy,, m/, y x = Λ so the correlation 7.1 summarizes the two decompositions. coefficient ratio F in (5.9) equals the angle Λ defined by (7.7). Result 7.1. For a given survey outcome (s, r), let D, Σ and C be given by (7.1) and (7.4). Then the proportion of the y 2 variance Sy explained by the regression of y on x is 8. Comments: Goodness of fit, properties of the RS2= ()/.CΣC′ − 1 2 (7.8) bias and a related selection procedure y, x y The coefficient of correlation between y and the Three issues are examined in this section: (i) The univariate predictor m is relationship between bias and goodness of fit, (ii) the linear ˆ −1 ɶ relation between the expected value of A =( YEXP − RS= −(DΣCDΣD′′−1 ) /[( − 1 ) 1/ 2 × ]. (7.9) ɶ ɶ ˆ y, m y YCAL ) and the bias of YCAL or YCAL, and (iii) the alternative method for selection of auxiliary variables proposed by Consequently, the proportion of S 2 explained by m is y Schouten (2007). 2− 1 2 −1 2 For the issue (i), recall that the total deviation in Section RSy, m =(DΣCDΣD′′ ) /[( ) × y ]. (7.10) 4 is TAR = + , where A is computable but T and 2 2 2 2 ˆ ɶ ɶ The proportions Ry,x and Ry, m satisfy RRy,, m≤ y x ≤1. R are not. If computable, R = YCAL − Y FUL would be an estimate of the bias of Yɶ (and of that of Yˆ ). A Proof. The proof of (7.8) uses the weighted least squares CAL CAL small is desirable. The question arises: Is this achieved regression of y on x fitted over r. The residuals are R when y =β′ x + ε (with a given vector x ) fits the data y− yˆ()x , where yˆ()x = x′ B with B given by (2.8). k k k k k k k k x x well? We need to distinguish two aspects: (a) The The decomposition is computable fit to the data (,)ykx k observed for k∈ r; and 2 d()() y− y2 = d yˆ()x − y (b) The hypothetical fit to the data (,)y x for k∈ s, some ∑r k k r;; d ∑ r k k r d k k observed, some not. ˆ()x 2 A good fit for the respondents, k∈ r, does not guarantee +∑ r dk(). y k − y k a small R: The weighted LSQ fit using the observed data

The mixed term is zero. A development of the term (,)ykx k for k∈ r gives the residuals ek| r ; d = yk − 2 “variation explained” gives ∑rdk()() yˆ()x k− y r; d = ∑ r dk x′k Bx| r ; d , computable for k∈ r, with the property ′ −1 C ΣC. Thus the proportion of variance explained is ∑r dk e k| r ; d = 0 (here, the detailed notation Bx|r ; d specified 2()x 2 2 ′ −1 2 Ry,;x =∑r d k( yˆ k − y r d ) /[( ∑ r dk ) S y ] = CΣC /S y , as in (2.8) is preferable to the simplified notation Bx ). For claimed in (7.8). To show (7.9) we note that the covariance k∈ s − r, ek| r ; d is not computable; it has an unknown non (5.7) can be written with the aid of (7.5) as zero mean es− r; d= ∑s− r d k e k | r ; d /. ∑ s − r dk We have −1 ɶ ɶ ˆ Cov(y , m )= −A / P = −DΣC′ /.P R =(YCAL − Y FUL ) / = − (1 − P ) es− r ; d ≠ 0. (8.1)

It then follows from (7.2) that Ry, m=Cov( y , m ) /( S y S m ) Regardless of whether the fit is good (small residuals 2 2 has the expression (7.9). The residuals from the regression ek| r ; d; R y ,x near one) or poor (large residuals ek| r ; d; R y ,x

(with intercept) of y on the univariate explanatory variable near zero), the deviation R given by (8.1) may be large, 2 ɶ m are yˆ()m k= y r;; d + B m() m k − m r d with Bm=Cov( y , m )/ S m = and YCAL far from unbiased. Even with a perfect fit for the −1 − 1 2 −P (DΣDΣ′′CD)/( ). The proportion of variance respondents ( ek| r , d = 0 for all k∈ r, and Ry,x =1), there is 2 2 explained is ∑rdk( yˆ()m k− y r; d ) /[( ∑ r dk ) S y ], which upon no guarantee that the bias is small. 2 development gives the expression for Ry, m in (7.10). A similar inadequacy affects imputation based on the 2 2 Finally, RRy,, m≤ y x follows from the CauchySchwarz respondent data. If the regression imputations yˆk = −1 2 − 1 inequality for a bilinear form: ()()D′′ΣCDΣD≤ x′k Bx| r ; d are used to fill in for the values yk missing for (C′ΣC−1 ). k∈ s − r, the imputed estimator is 2 2 The inequality RRy,, m≤ y x ≤1 can also be deduced by ˆ ˆ Yimp =∑r dk y k + ∑ s− r d k y k . the fact that, among all predictions yˆk= x′ k β that are linear in the xvector, those that maximize the variance explained Then YYˆ = ɶ , so Yˆ has the same exposure to bias ɶ imp CAL imp are yˆ()x k= x′ k Bx, so the predictions yˆ()m k , which are as YCAL, as is easily understood: When the nonresponse

Statistics Canada, Catalogue No. 12001X 138 Särndal and Lundström: Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias

ɶ causes a skewed selection of yvalues, the imputed values Note that formula (8.3) does not guarantee that YCAL computed on that skewed selection will misrepresent the based on a certain vector xk will have zero or nearzero unknown yvalues that characterize the sample s or the bias. It does not state that a comparatively large value of population U. | | guarantees a small bias in Yɶ . What (8.3) says is that A ɶ CAL Consider now the aspect (b) of the fit, that is, the bias(YCAL ) is linearly related to the expectation of the hypothetical weighted LSQ regression fit to the data indicator HS0 = A/. y Therefore, to assess available x (,)y x for k∈ s. The regression coefficient vector would k k vectors in terms of the indicator H0 (or the indicator be B= (),∑d x x′ −1 ∑ d x y and the residuals x|s ; ds k k k s k k k HS1 =| A |/ y ) is consistent with the objective of bias ek| s ; d= y k − x′ k Bx | s ; d for k∈ s satisfy ∑s dk e k| s ; d = 0. reduction. ˆ ˆ Using that ∑r dk m kx k/ = x s; d and ∑r dk m k y k / = Turning to the issue (iii), we comment on the alternative x′s; d Bx | r ; d , we have method for selection of auxiliary variables proposed by =ˆ−1 ( Yɶ − Y ɶ ) = (1/ ˆ ) d m e . (8.2) Schouten (2007). His indicator for the stepbystep selection R CAL FUL ∑ r k k k| s ; d of variables differs from our indicators; it will usually not Suppose the model is “true for the sample s”, with a select exactly the same set of variables. In a list of say 30 available categorical xvariables, the first ten to enter will perfect fit, so that ek| s ; d = 0 for all k∈ s. Then, by (8.2) we not be the same set of ten as with our indicators H to H . do have R = 0, so the nonresponse adjusted estimator 0 3 ɶ ɶ The order in which variables are selected will not neces YCAL agrees with the unbiased estimator YFUL . A belief that the bias is small hinges on an unverifiable assumption. sarily be the same either. For comparison, we compared, in Turning to the issue (ii), we now explain the essentially some of our empirical work, with the variable selection ɶ linear relation between the bias of YCAL and the expected realized by Schouten’s method. In some cases we noted a ɶ ɶ ˆ value of the indicator H0= A/()/. S y = Y EXP − Y CAL S y considerable congruence between the two sets of “first ten” For a given outcome (s , r ), a fixed yvariable and a fixed picked in the two procedures. xvector we have The differences between the two approaches are best ɶˆ ɶ ˆ appreciated by a comparison of their background and ()/()/.YCAL− Y Sy = Y EXP − Y S y − H0 derivation. Our indicators H0 and H1 originate in the notion of separation (or distance), for a given outcome Let Epq denote the expectation operator with respect to ɶ (s , r ), between the adjusted estimator YCAL and the all outcomes (s , r ), that is, Epq(⋅ )= E p ( E q ( ⋅ | s )), where ɶ p() s and q( r | s ) are, respectively, the known sampling primitive one, YEXP, and in the idea that this separation will design and the unknown response distribution. We denote ordinarily increase when the xvector becomes more ɶ ɶ ɶ ɶ powerful. The probability sampling design is taken into bias(YEYYCAL )=pq ( CAL ) − , bias(YEXP ) = EYYpq ()EXP − ˆ consideration; no assumptions are made on the response and C= Epq( S y ). Using the usual large sample replace ment of the expected value of a ratio by the ratio of the distribution. ɶ ˆ Schouten uses a superpopulation argument; sampling expected values, we have Epq[( YCAL − Y ) / S y ] ≈ ɶ ˆ ɶ weights do not appear to enter into consideration. An [()]/()Epq YCAL − Y E pq S y and analogously for YEXP, so expression for the modelexpected bias of an estimator of ɶ ɶ bias(YYCEHCAL )≈ bias( EXP ) − × ( 0 ). (8.3) the population mean is found to be proportional to the ɶ correlation (at the level of the population) between the y Here bias(YEXP ) and C do not depend on the choice of x vector, whereas bias(Yɶ ) and EH() do. Therefore, as variable and the 01 indicator for response. It is shown that CAL ɶ 0 this correlation (and consequently the bias) can be bounded the xvector changes, bias(YCAL ) and EH()0 are essen tially linearly related. No particular forms of p() s and inside an interval. In particular, the generalized regression q( r | s ) need to be specified for (8.3) to hold. As a estimator is considered and it is shown that its maximum absolute bias equals the width of the bias interval. This consequence, when two auxiliary vectors, x1k and x2k , are compared, the difference in bias is, to close approximation, width depends on the true unknown regression vector βββ for the regression (at the population level) between y and x. proportional to the change in the expected value of H0 : This unknown βββ is replaced by an estimate based on the ɶ ɶ bias(YYCEECAL (x 1k ))− bias( CAL ( x 2 k )) ≈ − (1 − 2 ) (8.4) respondents, thus subject to some bias because of the nonresponse. Schouten emphasizes that a missingat where EEHi= pq(0 (x ik )) for i =1, 2. The properties (8.3) random assumption is not needed for his method, which is and (8.4) are validated by the Monte Carlo study in in that respect similar to our method. Section 10.

Statistics Canada, Catalogue No. 12001X Survey Methodology, December 2010 139

9. Auxiliary variable choice for the Swedish pilot selected. In Step 2, the indicator value is computed for all 11 survey on gaming and problem gambling vectors of dimension two that contain the variable selected in Step 1 and one of the remaining variables. The variable We identified a real survey data set to illustrate the use of that gives the largest value for the indicator is selected in the indicators HH1, 2 and H3 in building the xvector. In Step 2, and so on, in the following steps. A new variable 2008, The Swedish National Institute of Public Health always joins already entered variables in the “sidebyside” (Svenska Folkhälsoinstitutet) conducted a pilot survey to (or “+”) manner. Interactions are thereby relinquished. The study the extent of gambling participation and the charac order of selection is different for each indicator. teristics of persons with gambling problems. Sampling and The values of H2 and H3 that identify the next variable weight calibration was carried out by Statistics Sweden. We for inclusion are by mathematical necessity increasing in illustrate the use of the indicators in this survey, for which a every step. This does not hold for H1. In a certain step j, we stratified simple random sample s of n = 2,000 persons was used the rule to include the xvariable with the largest of drawn from the Swedish Register of Total Population computed H1 values. That value can be smaller than the

(RTP). The strata were defined by the cross classification of H1 value that identified the variable entering in the region of residence by age group. Each of the six regions preceding step, j −1. The series of H1 values for inclusion was defined as a cluster of postal code areas deemed similar will increase up to a certain step, then begin to decline, as in regard to variables such as education level, purchasing Table 9.1 illustrates. ɶ power, type of housing, foreign background. The four age The unbiased estimate is YFUL = 4,265; the primitive groups were defined by the brackets 1624; 2534; 3564 estimate is Yɶ = 4,719 (both in thousands). This suggests EXP ɶ and 6584. a large positive bias in YEXP, whose relative deviation (in ɶ ɶ ɶ ɶ 2 The overall unweighted response rate was 50.8%. The %) from YFUL is RDF= (YYYEXP − FUL ) / FUL × 10 = 10.7. nonresponse, more or less pronounced in the different Adding categorical xvariables one by one into the xvector domains of interest, interferes with the accuracy objective. will successively change this deviation, although when a An extensive pool of potential auxiliary variables was few variables have been admitted, the change is not always available for this survey, including variables in the RTP, in in the direction of a smaller value. In each step we ɶ ɶ ɶ the Education Register and a subset of those in another computed the indicator, YCAL and RDF= (YYCAL − FUL ) / ɶ 2 extensive Statistics Sweden data base, LISA. For this YFUL ×10 . illustration, we prepared a data file consisting of 13 selected Table 9.1 shows the stepwise selection with the indicator categorical variables. Twelve of these were designated as x H1 (the number of categories is given in parenthesis for variables, and one, the dichotomous variable Employed, each selected variable). First to enter is the variable Income played the role of the study variable. The values of all class; this brings a large reduction in RDF from 10.7 to 4.5. variables are available for all units k∈ s. Response ()k∈ r The next five selections take place with increased H1 or not ()k∈ s − r to the survey is also indicated in the data values, and the value of RDF is reduced, but by successively file. smaller amounts. Step six, where Marital status is selected, Variables that are continuous by nature were used as brings about a turning point, indicated by the double line in grouped; all 12 xvariables are thus categorical and of the Table 9.1: The value of H then starts to decline, and Yɶ 1 CAL xk type, as defined in Section 2 (because most of the and RDF start to increase. At step 6, RDF is at its lowest variables are available for the full population, they are value, 0.5, then starts to rise, illustrating that inclusion of all potentially of the type x∗, but since the effect on bias is of available xvariables may not be best. The turning point of k little consequence, we used them as xk variables). The H1 and the point at which RDF is closest to zero happen to study variable value, yk =1 if k is employed and yk = 0 agree in this example. This is not generally the case. otherwise, is known for k∈ s, so the unbiased estimate Moreover, in a real survey setting, RDF is unknown, as is Yɶ defined by (3.2) can be computed and used as a the step at which RDF is closest to zero. FUL ɶ reference. We also computed YEXP defined by (3.1), as well Table 9.2 shows the stepwise selection with indicator as Yɶ defined by (2.5) for different xvectors built by H . Its value increases at every step, but at a rate that levels CAL 3 ɶ stepwise selection from the pool of 12 xvariables with the off, and successive changes in YCAL become negligible. aid of the indicators HH1, 2 and H3 defined by (5.12). This suggests to stop after six steps, at which point RDF = We carried out forward selection as follows: The 2.8. In none of the 12 steps does RDF come as close to zero auxiliary vector in Step 0 is the trivial x =1, and the as the value RDF = 0.5 obtained with H after six steps. In ɶ k 1 estimator is YEXP . In Step 1, the indicator value is computed this respect H1 is better than H3 , in this example. With all for every one of 12 presumptive auxiliary variables; the 12 xvariables selected, RDF attains in both tables the final variable producing the largest value of the indicator is value 2.6.

Statistics Canada, Catalogue No. 12001X 140 Särndal and Lundström: Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias

Table 9.1 probabilities are taken into account. For the in Stepwise forward selection, indicator H1, dichotomous study this section, we specify several different response distribu 3 ɶ variable Employed. Successive values of H1 ×10 , of YCAL in ɶ ɶ ɶ 2 tions with a specified positive value for the response proba thousands, and of RDF= (YYYCAL− FUL)/ FUL × 10 . For compar ɶ3 ɶ 3 bility θk for every k∈ U. That is, with specified proba ison, YYEXP×10= 4,719; FUL × 10 = 4,265 bility θ , the value y gets recorded in the experiment; 3 ɶ 3 k k Auxiliary variable entered H1 ×10 YCAL ×10 RDF with probability 1− θk , it goes missing. We find that the Income class (3) 76 4,458 4.5 Education level (3) 107 4,350 2.0 indicators H0 (or HH1= | 0 | ) defined in (5.11) ranks the Presence of children (2) 114 4,326 1.4 different xvectors in the correct order of preference for all Urban centre dwelling (2) 118 4,310 1.1 participating response distributions, consistent with the Sex (2) 123 4,296 0.7 theoretical results (8.3) and (8.4). We confirm that, over a Marital status (2) 125 4,286 0.5 Days unemployed (3) 121 4,301 0.9 long run of outcomes (s , r ), the average of H0 = Months with sickness benefits (3) 120 4,305 1.0 A/SR y = − y, m × cv m tracks the bias of the calibration Level of debt (3) 115 4,322 1.3 ɶ estimator, measured by the average of YYCAL − , in an Cluster of postal codes (6) 109 4,343 1.8 essentially perfectly linear manner, when the xvector Country of birth (2) 103 4,363 2.3 Age class (4) 99 4,377 2.6 moves through 16 different formulations. We also examine

the indicators H2 and H3 defined in (5.12), and find in this Table 9.2 experiment that they also have strong relationship to the bias Stepwise forward selection, indicator H3 , dichotomous study ɶ 3 ɶ of Y . variable Employed. Successive values of H3 ×10 , of YCAL in CAL ɶ ɶ ɶ 2 thousands, of RDF= (YYYCAL− FUL)/ FUL × 10 . For comparison, We experimented with several created populations; the YYɶ×103= 4,719; ɶ × 10 3 = 4,265 EXP FUL conclusions were similar. We report here results for one 3 ɶ 3 Auxiliary variable entered H3 ×10 YCAL ×10 RDF constructed population of size = 6,000, with created … Education level (3) 186 4,520 6.0 values (yk ,x k , θ k ) for k=1, 2, , = 6,000, for 16 Cluster of postcode areas (6) 250 4,505 5.6 alternative categorical formulations of x , and four Country of birth (2) 281 4,498 5.5 k Income Class (3) 298 4,369 2.4 different ways to assign the θk . Age class (4) 354 4,399 3.1 The 16 alternative categorical auxiliary xvectors were Sex (2) 364 4,384 2.8 obtained by grouping the generated values x and x of Urban centre dwelling (2) 374 4,378 2.6 1k 2k Level of debt (3) 381 4,364 2.3 two continuous auxiliary variables, x1 and x2. The values Months with sickness benefits (3) 384 4,380 2.7 (,,)yk x1 k x 2 k for k =1, 2, ..., 6,000 were created in three Presence of children (2) 387 4,379 2.7 steps as follows. Step 1 (the variable x ): The 6,000 values Marital status (2) 388 4,379 2.7 1 Days unemployed (3) 388 4,377 2.6 x1k were obtained as independent outcomes of the gamma distributed random variable Γ(,)a b with parameter values The set of the first six variables to enter with H has a = 2, b = 5. The mean and variance of the 6,000 realized 3 values x was 10.0 and 49.9, respectively. Step 2 (the three in common with the corresponding set of six with H . 1k 1 variable x ): For unit k, with value x fixed by Step 1, a There is no contradiction in the quite different selection 2 1k value x2k is realized as an outcome of the gamma random patterns, because H1 is geared to the specific yvariable Employed, while H is a compromise indicator, indepen variable with parameters such that the conditional expec 3 tation and variance of x are α + βx + K h() x and dent of any yvariable. To save space, the stepbystep 2k 1k 1 k σ2 x , respectively, where h()() x= x x − results for indicator H are not shown. Its selection pattern 1k 1k 1 k 1 k x1 2 (x − 3 ) with = 10. We used the values α = 1, 1k x1 x1 resembles more that of H3 than that of H1. Out of the first 2 six variables to enter with H , four are among the first six β =1,k = 0.001 and σ = 25. The polynomial term 2 K h() x gives a mild nonlinear shape to the plot of with H . As a general comment, we believe that in many 1k 3 (x , x ), to avoid an exactly linear relationship. The mean practical situations the use of more than six variables is 2k 1 k and variance of the 6,000 realized values x were 11.0 and unnecessary, and the selection of the first few becomes 2k 210.0, respectively. The correlation coefficient between x crucially important. 1 and x2, computed on the 6,000 couples (x1k , x 2 k ), was 0.48. Step 3 (the study variable y): For unit k, with values 10. Empirical validation by simulation for a x1k and x2k fixed by Steps 1 and 2, a value yk is realized constructed population as an outcome of the gamma random variable with parameters such that the conditional expectation and 2 The theory presented in earlier sections makes no variance of yk are c0+ c 1 x 1k + c 2 x 2 k and σ0(c 1 x 1k + assumptions on the response distribution. It is unknown. c2 x 2k ), respectively. We used c0=1, c 1 = 0.7, c2 = 0.3 and 2 The sampling design is arbitrary; its known inclusion σ0 = 2. The mean and the variance of the 6,000 realized

Statistics Canada, Catalogue No. 12001X Survey Methodology, December 2010 141

−c(10 + yk ) values yk were 11.4 and 86.5, respectively. The correlation IncExp(10 + y), with θk = 1 − e coefficient between y and x1, computed on the 6,000 where c = 0.06217 −c() x1k + x 2 k couples (yk , x1 k ), was 0.76; that between y and x2, DecExp( x1 + x2 ), with θk = e computed on the 6,000 couples (yk , x2 k ), was 0.73. where c = 0.01937 −cyk Each of the two xvariables was then transformed into DecExp( y), with θk = e four alternative group modes, denoted 8G, 4G, 2G and 1G, where c = 0.03534. yielding 4 × 4 = 16 different auxiliary vectors x . The k The constant c was adjusted in all four cases to give a 6,000 values x of variable x were size ordered; eight 1k 1 mean response probability of θ= θ / = 0.70. In the equalsized groups were formed. Group 1 consists of the U∑U k first two, the value 10 (rather than 0) was used to avoid a units with the 750 largest values x , group 2 consists of the 1k high incidence of small response probabilities θ . These next 750 units in the size ordering, and so on, ending with k four options represent contrasting features for the response group 8. In this mode 8G of x1, unit k is assigned the vector value γ , of dimension eight with seven entries “0” and probabilities: increasing as opposed to decreasing, de (x1 ;8) k a single entry “1” to code the group membership of k. Next, pendent on xvalues only as opposed to dependent on y successive group mergers are carried out, so that two values only. In the second and fourth option, the response is adjoining groups always define a new group, every time directly yvariable dependent, and could hence be called doubling the group size. Thus for mode 4G, the merger of “purely nonignorable”. We generated J = 5,000 outcomes (s , r ), where s of groups 1 and 2 puts the units with the 1,500 largest x1k values into a first new group; groups 3 and 4 merge to form size n = 1,000 is drawn from = 6,000 by simple random the second new group of 1,500, and so on; the vector value sampling and, for every given s, the response set r is realized by each of the four response distributions. That is, associated with unit k is γ(x ;4) k. In mode 2G, unit k has the 1 for k∈ s, a Bernoulli trial was carried out with the vector value γ(x ;2) k = (1,0)′ for the 3,000 largest x1 value 1 specified probability θ of inclusion in the response set r. units and γ(x ;2) k = (0,1)′ for the rest. In the ultimate mode, k 1 The Bernoulli trials are independent. 1G, all 6,000 units are put together, all x1 information is relinquished, and γ =1 for all k. The 6,000 values x For each response distribution, for each of the 16 x (x1 ; 1) k 2k were transformed by the same procedure into the group vectors, and for every outcome (s , r ), we computed the ˆ ˆ modes 8G, 4G, 2G and 1G. Corresponding group member relative deviation RD()/,=YYYCAL − where YCAL is given ship of unit k is coded by the vectors γ,, γ γ by (2.4) and Y= ∑U yk is the targeted ytotal, known in this (x2 ;8) k ( x 2 ;4) k ( x 2 ;2) k ɶ and γ =1. The 4 × 4 = 16 different auxiliary vectors experimental setting (alternatively, we used YCAL given by (x2 ;1) k x take into account both kinds of group information; the (2.5) but, as expected, the difference in bias compared with k ˆ two γ vectors are placed side by side (as opposed to YCAL is negligible). We also computed the indicators crossed), the result being a calibration on two margins, as Hi, i = 0,1, 2, 3, given by (5.11) and (5.12). Summary indicated by the “+” sign. Thus for the case denoted measures were computed as 8G + 8G, unit k has the auxiliary vector value x = k 1 J (,),γ′′′ γ where (− 1) indicates that one category relbias= Av(RD) = RD ; (x1 ;8) k ( x 2 ;8) k (− 1) ∑ j is excluded in either γ or γ to avoid a singular J j=1 (x1 ;8) k (x2 ;8) k matrix in the computation, giving xk the dimension 8 + 8 – J 1 = 15. The case 8G + 8G has the highest information 1 Av(Hi ) =∑ H ij for i = 0,1,2, 3 content. At the other extreme, the case 1G + 1G disregards J j=1 all the xinformation and x =1 for all k. There are 14 th k where j indicates the value computed for the j outcome, intermediate cases of information content. For example, j=1, 2,… , 5,000 = J . For each response distribution, we 4G + 2G has x= (,) γ′′′ γ of dimension 4 + 2 – k( x1 ;4) k ( x 2 ;2) k (1)− thus obtain the value relbias (which is the Monte Carlo 1 = 5; 4G + 1G has x=( γ′′ ,1) = γ of dimen k(;4) x1 k (1)− (;4) x 1 k measure of the relative bias (())/)EYYYˆ − and 16 sion 4 (there is nonnegligible interaction between x and pq CAL 1 values of Av(H ) (which is the Monte Carlo measure of x in this experiment, but we restrict the experiment to x i 2 EH( )), i = 0,1, 2, 3, where p stands for simple random vectors without interactions, causing no risk of small group pq i sampling, and q stands for one of the four response counts). distributions. We discuss here the results for four response distri Table 10.1 shows, for IncExp(10 + x + x ), relbias in % butions. Their response probabilities θ , k=1, 2, ..., = 1 2 k and Av(H )× 103 for the 16 xvectors. For the cell 1G + 6,000, were specified as follows: 1 1G, with vector xk =1, all four Avquantities are zero, and −c(10 + x1k + x 2 k ) IncExp(10 + x1 + x2 ), with θk = 1 − e relbias is at its highest level, 13.2%. At the opposite where c = 0.04599 extreme, the cell 8G + 8G represents the highest level of

Statistics Canada, Catalogue No. 12001X 142 Särndal and Lundström: Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias

information; it gives the highest value for Av(H1 ), and Table 10.2 for IncExp(10 + x1 + x2 ) and Table 10.3 for relbias is at its lowest value, 0.2%; virtually all bias is IncExp(10 + y) show how Av(H1 ), Av(H2 ) and Av(H3 ) removed (except for a possible sign difference, Av(H0 ) rank the 16 xvectors, represented by their value of relbias. and Av(H1 ) were equal for all cells). To measure the success of ranking, we computed the The result (8.4), holding for any response distribution Spearman rank correlation coefficient, denoted rancor, and any sampling design, states that the indicator H0 will between relbias and the value of the indicator, based on the rank the 4× 4 = 16 auxiliary vectors correctly for any 16 values of each. For Av(H1 ), the bottom line of the two response distribution (with response probabilities not all tables shows |rancor |= 1, for perfect ranking. For these constant, as noted below). Table 10.1 illustrates (8.4) in data, |rancor | is near one also for Av(H2 ) and Av(H3 ) terms of HH1= | 0 |: The change, from any one cell to any (more generally, the ranking obtained with H2 and H3 other, in the value of Av(H1 ) (the MonteCarlo estimate of may be good, but is data dependent). the expected value of ()H1 is accompanied by a pro portional change in the value of relbias. The same Table 10.2 Value, in ascending order, of relbias in %, and corresponding proportionality was noted for the other three response 3 3 value and rank of Av(HH1 )× 10,Av( 2 ) × 10 and distributions. We could have chosen other response 3 Av(H3 )××× 10 , for 16 auxiliary vectors. Bottom line: Value of distributions to illustrate the same property. Spearman rank correlations, rancor. Response distribution IncExp (10 +x1 + x 2 )

Table 10.1 3 3 3 3 relbias Av(H1 )××× 10 Av(H2 )××× 10 Av(H3 )××× 10 Relbias in % and, within parenthesis, the value of Av(H1 )××× 10 for 16 auxiliary vectors x . Response distribution IncExp (10 + 0.2 101 (1) 127 (1) 232 (1) k 0.5 99 (2) 119 (2) 225 (2) x1 + x 2 ) 0.5 98 (3) 118 (3) 224 (3) Groups Groups based on x2k 0.8 96 (4) 109 (4) 217 (4) based on 1.3 93 (5) 109 (5) 216 (5) x1k 8G 4G 2G 1G 1.5 91 (6) 105 (6) 213 (6) 8G 0.2 (101) 0.5 (99) 1.3 (93) 3.4 (76) 1.8 89 (7) 98 (7) 207 (7) 4G 0.5 (98) 0.9 (96) 1.8 (89) 4.1 (70) 1.9 88 (8) 94 (8) 205 (8) 2G 1.5 (91) 1.9 (88) 3.2 (78) 6.5 (52) 3.2 78 (9) 80 (11) 192 (9) 1G 4.1 (70) 5.0 (64) 7.3 (46) 13.2 (0) 3.4 76 (10) 90 (9) 188 (11) 4.1 70 (11) 84 (10) 190 (10) The response distribution with a constant response 4.1 70 (12) 77 (12) 175 (13) 5.0 64 (13) 70 (13) 179 (12) probability θ for all k is a special case. The calibration ɶ k 6.4 52 (14) 52 (14) 146 (15) estimator YCAL based on any vector xk then has zero bias 7.3 46 (15) 46 (15) 156 (14) ɶ 13.2 0 (16) 0 (16) 0 (16) (very nearly), and this includes the primitive estimator YEXP with x =1. Result 8.3 continues to be valid, stating in that Rancor 1.00 0.99 0.99 k ɶ ɶ case that EHYYpq (0 )≈ bias( CAL ) ≈ bias( EXP ) ≈ 0. In the context of the simulation in this section, if θk = 0.70 for all There is one notable contrast between the results on k is taken to be an additional response distribution, Table relbias for the two response distributions in Tables 10.2 and 10.1 will in all 16 cells show nearly zero values of both 10.3. The best among the auxiliary vectors leave consid 3 relbias in % and Av(H1 )× 10 , from the weakest cell erably more bias for the nonignorable IncExp(10 + y) than (1G + 1G) all the way to the cell of the most powerful x for IncExp(10 + x1 + x2 ). This is not unexpected, and it is vector (8G + 8G). There is no bias to be removed by an important to note that considerable bias reduction is improvement of the xvector. If in practice the indicator obtained for the nonignorable case as well.

()H1 does not react to an enlargement of the xvector, there In the simulation, the overadjustment mentioned in is no incentive to seek beyond the simplest vector Section 4, > > 0 (when ()Yɶ has positive bias) or AT ɶ EXP formulation. It could signify one of three possibilities: The AT< < 0 (when YEXP has negative bias), happens for yvariable in question is not subject to nonresponse bias, or some outcomes (s , r ). The frequency varies with the that the response probability is almost constant, or that none strength of the auxiliary vector and is different for different of the available xvectors is capable of reducing an existing response distributions. The cell for which this over bias. adjustment is most likely to occur is 8G + 8G, the most

To save space we do not show the corresponding tables powerful of the 16 auxiliary vectors. For IncExp(10 + x1 + for Av(H2 ) and Av(H3 ). By mathematical necessity, both x2 ), the bias is almost completely removed for cell quantities increase in the nested transitions. Not shown 8G + 8G; relbias is only 0.2%. Hence Yɶ is close to the ɶ CAL either are the counterparts of Table 10.1 for the other three unbiased YFUL , A is near T , and AT > happened for response distributions. The patterns are similar. 45.6% of all outcome (s , r ). By contrast, for the non

ignorable case IncExp(10 + y), the incidence of AT >

Statistics Canada, Catalogue No. 12001X Survey Methodology, December 2010 143

was only 0.1% for the cell 8G + 8G. Although that cell IncExp(10 + x1 + x2 ). The upper entry in a table cell shows brings considerable bias reduction (compared to the the success rate in % for n = 1,000, the lower entry shows primitive 1G + 1G), there is bias remaining, and as a that rate for n = 2,000. Shown in parenthesis is the value of consequence, AT > almost never happens. relbias for the vectors in question. We do not show the corresponding tables for “Severe tests” are preferred, that is, confrontations of

DecExp( x1 + x2 ) and DecExp(y). The lowest value of vectors with a small difference in absolute relbias, because rancor was 0.94, recorded for Av(H3 ) in the case of the correct decision is then harder to obtain. There is a priori

DecExp( x1 + x2 ). no reason why one of the indicators should always A question not addressed in Tables 10.2 and 10.3 is: outperform the others in this study. In the five severe tests in

How often, over a long series of outcomes (s , r ), does a Table 10.4, H1 has, on the whole, better success rates than given indicator H ()xk succeed in pointing correctly to the H2 and H3. The success rate of H1 improves by doubling preferred xvector? To answer this, let x1k and x2k be two the sample size, and tends as expected to be greater when vectors selected for comparison. If the absolute value of the the relbias values are further apart. The case 4G + 8G vs. ˆ ˆ bias of YCAL()x 2k is smaller than that of YCAL(x 1k ), we 8G + 8G compares nested xvectors, so it is known would like to see that HH()()x2k≥ x 1 k holds for a vast beforehand that H2 and H3 give perfect success rates. majority of all outcomes (s , r ), because then the indicator Table 10.4 H ()⋅ delivers with high probability the correct decision to Selected pairwise comparisons of auxiliary vectors; percentage of prefer x2k . Because H ()xk has sampling variability, its outcomes with correct indication, for the indicators H,H1 2 and success rate (the rate of correct indication) depends on the H3. Within parenthesis, relbias in %. Upper entry: n = 1,000 sample size, and we expect it to increase with sample size. lower entry: n = 2,000. Response distribution IncExp (10 + x1 + x 2 )

Cells compared Percent outcomes with correct indication Table 10.3 Value, in ascending order, of relbias in %, and corresponding H1 H2 H3 3 3 value and rank of Av(HH1 )× 10,Av( 2 ) × 10 and Av(H3 )××× 4G + 8G(0.5) vs. 90.0 100.0 100.0 103 , for 16 auxiliary vectors. Bottom line: Value of Spearman 8G + 8G(0.2) 96.4 100.0 100.0 rank correlations, rancor. Response distribution IncExp (10 +y ) 4G + 2G(1.8) vs. 66.8 86.0 70.7

3 3 3 2G + 8G(1.5) 74.2 89.0 67.4 relbias Av(H1 )××× 10 Av(H2 )××× 10 Av(H3 )××× 10 1G + 8G(4.1) vs. 74.3 70.3 45.0 3.6 74 (1) 91 (1) 165 (1) 8G + 1G(3.4) 82.8 78.0 43.3 3.9 71 (2) 84 (2) 158 (2) 4G + 1G(4.1) vs. 90.6 61.4 83.9 4.0 71 (3) 83 (3) 156 (3) 2G + 2G(3.2) 97.0 68.8 92.3 4.3 68 (4) 76 (5) 149 (5) 1G + 2G(7.3) vs. 77.4 77.4 34.5 4.4 68 (5) 78 (4) 153 (4) 2G + 1G(6.5) 85.9 85.9 28.8 4.9 64 (6) 68 (8) 142 (3) 4.9 63 (7) 72 (6) 146 (6) 5.3 60 (8) 69 (7) 143 (7) 5.4 60 (9) 64 (9) 137 (9) 6.0 55 (10) 59 (10) 132 (10) 11. Concluding remarks 6.2 53 (11) 54 (11) 128 (11) 7.2 46 (12) 54 (12) 122 (12) In this article, we address survey situations where many 7.9 41 (13) 41 (14) 111 (13) alternative auxiliary vectors (xvectors) can be created and 7.9 40 (14) 43 (13) 109 (14) ɶ 9.6 27 (15) 27 (15) 90 (15) considered for use in the calibration estimator YCAL . For any 13.1 0 (16) 0 (16) 0 (16) ɶ given xvector, a certain unknown bias remains in YCAL ; we Rancor 1.00 0.99 0.99 wish by an appropriate choice of xvector to make that bias as small as possible. Hence we examine the bias ratio

We threw some light on this question by extending the defined by (4.2) and (4.3). The component A of the bias Monte Carlo experiment: 5,000 outcomes (,)s r were ratio was expressed, in (5.8) to (5.10), as product of easily realized, first with sample size n = 1,000, then with sample interpreted statistical measures. This led us to suggest size n = 2,000 (the response set r is realized according to several alternative bias indicators, for use in evaluating one of the four response distributions, declaring unit k different xvectors in regard to their capacity to effectively

“responding” as a result of a Bernoulli trial with the reduce the bias. We studied in particular the indicator H1 specified probability θk ). We computed the success rate as given by (5.12). It functions very well but is geared to a the proportion of all outcomes (,)s r in which the correct particular study variable y. However, a typical government indication materializes in a confrontation of two different x survey has many study variables, and for practical reasons it vectors. Several pairwise comparisons of this kind were is desirable to use the same xvector in estimating all y carried out. Typical results are shown in Table 10.4, for totals. A compromise becomes necessary. We argued that

Statistics Canada, Catalogue No. 12001X 144 Särndal and Lundström: Design for estimation: Identifying auxiliary vectors to reduce nonresponse bias the indicator H in (5.12) suits this purpose; it depends on Kalton, G., and FloresCervantes, I. (2003). Weighting methods. 3 Journal of Official Statistics, 19, 8198. the xk but not on any ydata. A topic for further research is to develop other indicators (than H ) for the “many y Kott, P.S. (2006). Using calibration weighting to adjust for 3 nonresponse and coverage errors. Survey Methodology, 32, 133 variable situation”. Another topic for further work is to 142. examine algorithms for stepwise selection of xvariables Rizzo, L., Kalton, G. and Brick, J.M. (1996). A comparison of some with the indicator H1, other than the one used in Section 9. weighting adjustment methods for panel nonresponse. Survey Methodology, 22, 4353. Acknowledgements Särndal, C.E., and Lundström, S. (2005). Estimation in Surveys with onresponse. New York: John Wiley & Sons, Inc. The authors are grateful to the referees and to the Särndal, C.E., and Lundström, S. (2008). Assessing auxiliary vectors for control of nonresponse bias in the calibration estimator. Associate Editor for comments contributing to an improve Journal of Official Statistics, 4, 251260. ment of this paper. Schouten, B. (2007). A selection strategy for weighting variables under a notmissingatrandom assumption. Journal of Official Statistics, 23, 5168. References Schouten, B., Cobben, F. and Bethlehem, J. (2009). Indicators for the Deville, J.C. (2002). La correction de la nonréponse par calage representativeness of survey response. Survey Methodology, 35, 101113. généralisé. Actes des Journées de Méthodologie, I.N.S.E.E., Paris. Eltinge, J., and Yansaneh, I. (1997). Diagnostics for the formation of Thomsen, I., Kleven, Ø., Wang, J.H. and Zhang, L.C. (2006). Coping nonresponse adjustment cells with an application to income with deceasing response rates in Statistics Norway. Recommended nonresponse in the US Consumer Expenditure Survey. Survey practice for reducing the effect of nonresponse. Reports 2006/29. Oslo: Statistics Norway. Methodology, 23, 3340.

Statistics Canada, Catalogue No. 12001X