Small area estimation I

Monica Pratesi and Caterina Giusti

Department of Economics and Management, University of Pisa

1st EMOS Spring School Trier, Pisa, Manchester, Luxembourg, 23-27 March 2015

M. Pratesi, C. Giusti Small area estimation I Structure of the Presentation

1 Rising interest in measuring progress at a local level

2 An overview of SAE methods

3 Expansion and GREG estimators

4 Empirical Best Linear Unbiased Predictor

5 M-Quantile Approach

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods

Part I

A short introduction to the small area estimation problem

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods Demand for at local level

During the last decade there has been a rising demand for (objective and subjective) progress and well-being indicators These measures play a central role for policy makers, to plan and to verify the effectiveness of their policies Some examples of indicators: Poverty indicators: At risk of poverty rate, Household equivalized income, etc. Labour market indicators: Unemployment rate, Satisfaction with the job, etc. Health indicators: Average lifespan, Share of population with dangerous behaviors (obesity, smoking, etc.) To be informative and effective, these indicators should be chosen at the appropriate level of disaggregation Indicators can be disaggregated along various dimensions, including geographic areas, demographic groups, income/consumption groups and social groups

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods Demand for statistics at local level

To gather data to compute the indicators of interest for a given population there are two main solutions: Sample Surveys

Sample surveys have been recognized as cost-effectiveness means of obtaining information on wide-ranging topics of interest at frequent intervals over time Indeed, available data to measure progress and well-being in Europe come mainly from sample surveys, such as the Survey on Income and Living Conditions (EU-SILC)

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods Demand for statistics at local level

Data coming from sample surveys can be used to compute accurate direct estimates only for large domains Direct estimation uses only domain-specific data: 1 suffer from low precision when the domain sample size is small 2 it is not applicable with zero sample sizes For example, EU-SILC data can be used to produce accurate direct estimates only at the NUTS* 2 level (that is, regional level in Italy)

* The geocode NUTS refers to the EU Nomenclature of Territorial Units for Statistics

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods Introduction to Small Area Estimation

To satisfy the increasing demand of statistical estimates on progress and well-being referring to smaller domains (LAU 1 and LAU 2 levels, that is provinces and municipalities in Italy) using sample survey data, there is the need to resort to small area methodologies Small area methodologies try to fill the gap between official statistics and local request of data What is a small area? What is small area estimation?

Small Area Estimation (SAE) is concerned with the development of statistical procedures for producing efficient (precise) estimates for small areas, that is for domains with small or zero sample sizes. Domains are defined by the cross-classification of geographical districts by social/economic/demographic characteristics. The target is the estimation of a parameter (average/percentile/proportion/rate) and the estimation of the corresponding prediction error

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods Introduction to Small Area Estimation: Target parameters

The estimates of interest in the small areas are usually means or totals. Examples: Mean individual gross income in a set of areas Total crop production in a set of areas However, also other estimates can be of interest. Examples: The quantiles of the household (equivalised) income The proportion of households with income below the poverty line

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods Introduction to Small Area Estimation: Example

Target population: households who live in an Italian Region Variable of interest: Income or other poverty measures Survey sample: EUSILC (European Union Statistics on Income and Living Conditions), designed to obtain reliable estimate at Regional level in Italy planned design domains: Regions unplanned design domains: e.g. Provinces, Municipalities

EUSILC sample size in Tuscany: 1751 households Pisa province 158 households → need SAE Grosseto province 70 households → need SAE

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods Introduction to Small Area Estimation: Example US sample sizes with an equal probability of selection method sample of 10,000 persons

State 1994 Population (thousands) Sample size California 31,431 1207 Texas 18,378 706 New York 18,169 698 ...... DC 570 22 Wyoming 476 18

Suppose to measure customer satisfaction for a government service: California 24.86% → leads to a confidence interval of 22.4%-27.3% (reliable) Wyoming 33.33% → leads to a confidence interval of 10.9%-55.7% (unreliable → need SAE)

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods SAE: data requirements

Survey data: available for the target variable y and for the auxiliary variable x, related to y /Administrative data: available for x but not for y

SAE in 3 steps

1 Use survey data to estimate models that link y to x 2 Combine the estimated model parameters with x for out of sample units, to form predictions 3 Use these predictions to estimate the target parameters

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods SAE: classifications

There are several classifications of SAE methods Some of these classifications cross each other

Direct vs Indirect methods Direct methods use only domain-specific data Indirect methods borrow information from all the data

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods SAE: classifications

Design-based vs Model-based methods Design-based (Model-assisted) methods Direct estimation Can allow for the use of models (model-assisted) Inference is under the randomization distribution Model-based methods Borrow strength by using a model Estimation using frequentist or Bayesian approaches Inference under the model is conditional on the selected sample

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods SAE: classifications

Area-level vs Unit level methods Area-level models relate small area direct estimators to area specific covariates. Such models are necessary if unit (or element) level data are not available. Unit level models that relate the unit values of a study variable to unit-specific covariates.

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods How a SAE Unit Level Model Works

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods A classification of the Small Area Estimation Methods

M. Pratesi, C. Giusti Small area estimation I Rising interest in measuring progress at a local level An overview of SAE methods Focus of the lesson

We will focus on model-assisted and model-based methods For the model-based estimators we will consider two alternative approaches to SAE: small area estimation using multilevel (mixed/random effects) models → unit and area-level approaches small area estimation based on M-quantile models → unit level approach We will present some examples based on true data We will present R codes to obtained the estimates of the examples

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile

Part II

Classic and new approaches to model-based Small Area Estimation

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Definitions

Design-based estimation: the main focus is on the design unbiasedness. Estimators are unbiased with respect to the randomization that generates survey data Finite population Ω = 1,..., N

y: variable of interest, with yi value of the i-th unit of the finite population P ¯ Statistics of interest: e.g. total, Y = Ω yi or mean, Y = Y /N Sample s = 1,..., n p(s): probability of selecting the sample s from population Ω. p(s) depends on know design variables such as stratum indicator and size measures of clusters

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Definitions

Bias of an estimator θˆ is defined as E[θˆ − θ] Variance of an estimator θˆ is defined as E[(θˆ − E[θˆ])2] Mean Squared Error of an estimator θˆ is: E[(θˆ− θ)2] = V [θˆ] + B[θˆ]2

Design bias: Bias(Yˆ) = Ep[Yˆ] − Y 2 Design variance: V (Yˆ) = Ep[(Yˆ − y) ]

Design-based properties P 1 Design-unbiasedness: Ep[Yˆ] = p(s)Yˆs = Y 2 Design-consistency: Yˆ → Y in probability

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Estimation of Means: Expansion Estimator

Data {yi }, i ∈ s Expansion estimator for the mean:

P w y ¯ˆ i∈s i i Y = P i∈s wi

−1 wi = πi , the basic design weight πi is the probability of selecting the unit i in sample s

Remark: weights wi are independent from yi

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Estimation of Means: GREG Estimator

Data {yi , xi }, i ∈ s 0 X = (X1,..., Xq) , population totals for q auxiliary variables Generalized REGression Estimator:

ˆ ˆ ˆ 0 Y¯GR = Y¯ + (X¯ − X¯) βˆ

ˆ P 0 −1 P β = ( i∈s wi xi xi /ci ) ( i∈s wi xi yi /ci ) ci is a specified positive constant

¯ˆ P ∗ ∗ ∗ Note YGR = i∈s wi yi , where wi = wi (s) = wi (s)gi (s) ¯ ˆ¯ 0 P 0 −1 with gi (s) = 1 + (X − X) ( i∈s wi xi xi /ci ) xi /ci

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Domain Estimation

Let partitioning population Ω into m partitions or domains:

m Ω = ∪i=1Ωi

Ωi = 1,..., Ni , population of the domain i

si = 1,..., ni , sample of the domain i Statistics of interest for the variable y: P Yi = yj , the total of domain i Ωi Y¯i = Yi /Ni , the mean of domain i

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Domain Estimation: Expansion Estimator

Data {yij }, j ∈ si , i = 1,..., m Expansion estimator of the mean for the domain i: P wij yij Y¯ˆ = j∈si i P w j∈si ij

where yij is the observation value and wij is the weight for unit j in area i The case of the simple random sampling: (1)(Ni −1) 1 ni −1 ni −1 Ni πij = N = → wij = π = ( i ) Ni ij ni ni ni Pni Ni Ni Pni P w y yij yij ˆ j=1 ij ij j=1 ni ni j=1 1 Pni Y¯ = n = = = y (that is the i P i Pni Ni Ni n j=1 ij j=1 wij ni i j=1 ni ni sample mean)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Domain Estimation: Expansion Estimator

ˆ Y¯i is design unbiased P 2 2 (yij −y¯i ) ˆ ni Si 2 j∈si Vˆ(Y¯i ) = (1 − ) , where S = is the sample variance Ni ni i ni −1 2 The magnitude of the variance depends on 3 factors: ni /Ni , Si and ni

If ni is small the design variance is likely to be large In such a situation, estimation of variance is even more problematic

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Domain Estimation: GREG Estimator

Data {yij , xij }, i ∈ s 0 Xi = (Xi1,..., Xiq) , population totals for q auxiliary variables in the domain i Generalized REGression Estimator:

¯ˆ ¯ˆ ¯ ¯ˆ 0 ˆ Yi,GR = Yi + (Xi − Xi ) βi

βˆ = (P w x x0 /c )−1(P w x y /c ) i j∈si ij ij ij ij j∈si ij ij ij ij cij is a specified positive constant

Remark: The GREG estimator attempt to improve on the expansion estimator by borrowing strength from relevant sources (e.g. covariates) through appropriate adjustment to the basic weights. Remark: The GREG estimator remain approximately design-unbiased, but (should) improve on the design-variance (and thereby the MSE)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile GREG Estimator, Example

We present here an example of how to use and apply the Generalized Regression estimator to obtain small area estimates of the area means The target parameter is the mean forest biomass per hectare (ha) within the 14 municipalities (small areas) of the forest in Vestfold County (Norway) The data are from the Norwegian National Forest Inventory, that provides estimates of forest parameters on national and regional scales by means of a systematic network of permanent sample plots The data set is public and detailed information on the data is available in Breidenbach and Astrup (2012)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile GREG Estimator, Example

Data on forest biomass per hectare (Biomass/ha) are available for 145 sample plots Auxiliary data on mean canopy height are also available from digital aerial images The following table shows the first six lines (out of 145) of the sample plots of the Norwegian National Forest Inventory

Figure : Norwegian National Forest Inventory sample data

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile GREG Estimator, Example The relationship between the target and the auxiliary data available from the sample is represented in the following Figure

Figure : Scatterplot of the Biomass/ha vs Mean canopy height sample data

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile GREG Estimator, Example Data on the mean canopy height are also available for all the elements, the population here is the forest covered by aerial images (for which the mean canopy values are available) P14 Thus, the N = i=1 Ni = 5402465 population elements are the 16 · 16 tiles within forest for which auxiliary variables from the canopy height mean and image data were calculated

Figure : Population data of mean canopy height from digital aerial images

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile GREG Estimator, Example Using data reported in the two previous Tables we can obtain small area GREG estimates of the mean of forest biomass/ha in the 14 municipalities

Figure : Point and MSE GREG estimates of the mean forest biomass per hectare for the 14 municipalities

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile GREG: advantages, disadvantages, extensions

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Unit Level Approach Unit level approach to small area estimation y the vector for the y variable for the population Ω 0 0 0 y = [ys , yr ] , where ys is the vector of the observed units (the sampled ones) and yr is the vector of the non observed units (N − n, r = 1,..., N − n) X is the covariates matrix and is considered know for all the population units

Subscript i refers to small areas (e.g. ysi is the vector of observed variables in area i) Model for the y variable (known as superpopulation model)

y = Xβ + Zu + e

that can be alternatively write as

yij = xij β + ui + eij

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Unit Level Approach

Given that u ∼ N(0, G), e ∼ N(0, R) and u ⊥ e 2 2 R = σe Σe , G = σuΣu

X is a full rank matrix (say rank equal q) and rank of (X : Zi ) > q n ≥ q + m + 1 it can be shown that V (y) = ZGZ0 + R = V 0 −1 −1 0 −1 βˆ = (X V X) X V ys is the BLUE for β uˆ = GZ0V−1(y − Xβˆ) is the BLUP for u y ∼ N(Xβ, V)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Unit Level Approach The estimator of the mean for the i-th area is obtained in terms of a linear combination between observed and unobserved units, as follows     ¯ˆ 1 X X 0 ˆ Yi = yij + (xik β +u ˆi ) Ni j∈si k∈ri  Next step is to derive an MSE estimator

MSE(θˆi ) ≈ g1i (σ) + g2i (σ) + g3i (σ) + g4i (σ) 0 0 g1i (σ) = αr Zr Ts Zr αr g2i (σ) = 0 0 0 0 0 −1 −1 0 0 −1 0 [αr bXr − αr Zr Ts Zs Rs Xs ](Xs V Xs ) [Xr αr − Xs Rs Zs Ts Zr αr ] 0 0 −1 0 0 0 −1 0 0 g3i (σ) = tr{(∇(αr Zr ΣuZs Vs ) )Vs (∇(αr Zr ΣuZs Vs ) ) E[(ˆσ − σ)(ˆσ − σ)0]} 0 g4i (σ) = αr Rr αr 0 0 −1 T = Σu − ΣuZs (Σes + Zs ΣuZs ) Zs Σu 2 2 2 0 σ = (σe , σu/σe )

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Unit Level Approach

Finally, the estimator for the MSE of θˆi is

MSE[(θˆi ) = g1i (ˆσ) + g2i (ˆσ) + 2g3i (ˆσ) + g4i (ˆσ)

σˆ is an unbiased estimator for σ Remark: it is possible to obtain an estimate of the MSE using alternative techniques, such as bootstrap and jackknife

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: Estimates of Mean Income in Tuscany Provinces

Data on the equivalised income in 2005 for 1525 households in the 10 Tuscany Provinces are available from the EUSILC survey 2006 A set of explanatory variables is available for each unit in the population from the Population Census 2001 We employ the EBLUP unit level model to estimate the mean of the household equivalised income The Municipality of Florence, with 125 units out of 457 in the Province, is considered as a stand-alone area

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: Target variable

Relative poverty measures are related to income or consumption Our estimate are based on the household equivalised disposable income (target variable) Averages are computed on the household equivalised disposable income The disposable household equivalised income is computed as

[Disposable household income] · [Within-household non-response inflation factor] Equivalised household size

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: Target variable

Disposable household income: The sum for all household members of gross personal income components plus gross income components at household level minus employer’s social insurance contributions, interest paid on mortgage, regular taxes on wealth, regular inter-household cash transfer paid, tax on income and social insurance contributions Within-household non-response inflation factor: Factor by which it is necessary to multiply the total gross income, the total disposable income or the total disposable income before social transfers to compensate the non-response in individual questionnaires

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: Target variable

Equivalised household size: Let HM14+ be the number of household members aged 14 and over (at the end of income reference period) Let HM13− be the number of household members aged 13 or less(at the end of income reference period)

Equivalised household size = 1 + 0.5 · (HM14+ − 1) + 0.3 · HM13− Remark: by this way we take into account the economy of scale present in an household

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: auxiliary data

The selection of covariates to fit the small area model relies on prior studies on poverty assessment The following covariates have been selected: household size ownership of dwelling (owner/tenant) age of the head of the household years of education of the head of the household working position of the head of the household (employed/unemployed in the previous week) Design-based estimates of the mean income have been carried out in order to show the gain in efficiency of the EBLUP

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: results

Table : Mean Income Estimates. Tuscany Provinces

Provinces EBLUP RMSE\ EBLUP Design-Based RMSE\ DB Arezzo 17328 816.6 18455 1088.9 Florence M. 18139 806.4 21927 1188.8 Florence 16327 628.2 16347 490.1 Grosseto 16593 937.8 17811 1574.5 Livorno 17111 841.5 20257 3258.4 Lucca 15805 868.7 15780 783.5 Massa 15644 868.0 14814 909.0 Pisa 15950 826.5 16741 755.0 Pistoia 16467 850.1 16852 950.4 Prato 16964 824.9 16715 911.5 Siena 16660 852.0 16926 779.4

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: results

Error bar for mean income estimates 26000 24000 22000 20000 Equivalized Income 18000 16000 14000

2 4 6 8 10

Provinces

Figure : Black bars represent Design-Based estimates, red bars represent EBLUP estimates

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: results

Design-Based Estimates EBLUP Estimates

20258.00

18732.66

17323.03

16019.47

14813.00

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP at the unit level: advantages, disadvantages, extensions

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Area Level Approach, the Fay-Harriot Model

The area level model includes random area-specific effects and area specific covariates xi

θi = xi β + zi ui , i = 1,..., m

θi is the parameter of interest (e.g. totals, Yi or means, Y¯i )

zi are known positive constant

ui are independent and identically distributed random variables with 2 2 mean 0 and variance σu (ui ∼ N(0, σu)) β is the regression parameters vector

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Area Level Approach, the Fay-Harriot Model

Assumption θˆi = θi + ei

θˆi is a direct design-unbiased estimator 2 ei are independent sampling error with mean 0 and know variance σe Fay-Harrior Model

θˆi = xi β + zi vi + ei , i = 1,..., m

Note: this is a special case of the general linear mixed model with diagonal covariance structure

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Area Level Approach, the Fay-Harriot Model

Under above mentioned assumptions

ˆ 2 2 2 θi ∼ N(xi β, zi σu + σe )

Let us to introduce matrix notation

θˆi = Xβ + Zu + e u ∼ N(0, G) and e ∼ N(0, R) θˆ ∼ N(Xβ, ZGZ0 + R), let V = ZGZ0 Given the estimates of β and u we obtain the Best Linear Unbiased Predictor (BLUP) for θ βˆ = (X0V−1X)−1X0V−1y, where y is the vector of the observations uˆ = GZ0V−1(y − Xβˆ) Note: estimates for β and u can be obtained by penalized maximum likelihood (u considered as fix)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Area Level Approach, the Fay-Harriot Model

In the “real world” G (and R) are unknown and they must be estimated Using restricted likelihood optimized with scoring algorithm we 2 obtain estimates for σu (G) 2 In the Fay-Herriot model σe is considered as known (we use sampling variance) 2 Plugging in the estimated area-specific variance componentσ ˆu in the estimator for β and u we obtain their estimates

βˆ = (X0Vˆ −1X)−1X0Vˆ −1y uˆ = GZˆ 0Vˆ −1(y − Xβˆ)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Area Level Approach, the Fay-Harriot Model

Finally using the obtained estimates in the Fay-Herriot model we have the Empirical BLUP (EBLUP) for the parameter of interest θ

˜ ˆ ˆ ˆ 0 ˆ 0 θi = φi θi + (1 − φi )(xi β + zi uˆ)

2 ˆ2 ˆ zi σu φi = 2 ˆ2 2 , is the shrinkage factor zi σu +σe

θˆi is the design estimator for θi 2 Note: using this procedure could happen that the estimate of σu is negative, in this case it must be truncated to 0

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP: Area Level Approach, the Fay-Harriot Model

The MSE of the Fay-Herriot small area estimator is

˜ 2 2 2 MSE(θi ) = g1i (σu) + g2i (σu) + g3i (σu)

2 g1i (σu) is due to random errors (order O(1)) 2 −1 g2i (σu) is due to β estimate (order O(m ), given some conditions) 2 2 g3i (σu) is due to the estimate of σu An approximately correct estimate of the MSE is

˜ 2 2 2 MSE[(θi ) = g1i (ˆσu) + g2i (ˆσu) + 2g3i (ˆσu)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP at the area level: example

In this example we use of the Fay-Herriot model to estimate the mean agrarian surface area used for production of grape (θi ) in each municipality of the Tuscany region Population data come from the Italian Agricultural Census of year 2000 for the region of Tuscany The census collects information about farms land by type of cultivation, amount of breeding, kind of production, structure and amount of farm employment We considered as small areas the 274 municipalities of Tuscany, with population sizes Ni , i = 1,..., m given by the census We use the census data on the agrarian surface area used for production in hectares (x1i ) and on the average number of working days in the reference year (x2i ) as covariates in the model

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP at the area level: example

Sample data are collected from a simple random sample with size ni from each area, with sampling fractions ni /Ni approximately constant and equal to 0.05 These data are used to compute, for each municipality i, the direct estimator of the mean agrarian surface area used for production of grape in hectares (yi ) and its sampling variance (ψi )

Figure : Data on grape production

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP at the area level: example

The results obtained with the Fay-Herriot model for the first 10 Municipalities are shown in the following table

Figure : Estimates on grape production

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP at the area level: example

As we can se from the second table, the MSEs of the FH-EBLUP estimates are lower than the variances of the direct estimates in the first table Readers interested in the results for all the 247 municipalities can refer to the deliverables of the SAMPLE project (www.sample-project.eu)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile EBLUP at the area level: advantages, disadvantages, extensions

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Conclusions on EBLUPs

The EBLUPs are model-based estimators EBLUPs can also be used when we know only the average of the auxiliary variables In many applications the EBLUPs perform better than the design based estimators in terms of relative root MSE (smaller confidence intervals) Actually EBLUPs are used as a standard technique to derive small area statistics

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Conclusions on EBLUPs

Drawbacks Assumption of normality is needed for area effects and individual effects (but sensitivity analysis shows that the model is robust against non normality if symmetry of the distributions hold) It is not design-unbiased, in the sense that under complex survey design the estimates could be biased Parameters of interest in out of samples areas (areas with 0 observations) cannot be estimated (EBLUP needs minimum two observations per area) Extensions to the model are not easily implementable (complex derivation of the MSE estimator)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Conclusions on EBLUPs

Improvements for the EBLUP Spatial process (CAR and SAR models) Time process Spatio-temporal process Robust estimation Binary and count data models

Alternative approach Quantile/M-Quantile approach

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Quantiles

Given p, with p ∈ (0, 1), the quantile qp of a random variable X is defined as:

Z qp dF (x) = p −∞ Quantiles are points taken at regular intervals from the cumulative distribution function (CDF) of a random variable. The kth q-quantile for a random variable is the value x such that the probability that the random variable will be less than x is at most k/q and the probability that the random variable will be more than x is at most (q − k)/q = 1 − (k/q). Example: the 4-quantiles are called quartiles and they are three values that divide the distribution of the values of X into four parts. The first quartile is associated with p = 0.25, the second with p = 0.5 and the third with p = 0.75. The second quartile is the median.

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile M-quantiles

Given p, with p ∈ (0, 1), the M-Quantile θp of a random variable X is defined as:

Z ψp(x − θp)dF (x) = 0

where  (1 − p)ψ(u) u < 0 ψ (u) = p pψ(u) u ≥ 0

and ψp(u) is an opportunely chosen influence function

The M-Quantile is a generalization of the quantile concept (includes quantile as a particular case)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile M-quantile Regression

M-quantile

Dependent variable (y1,..., yn)

Auxiliary variables (x1,..., xk ) 0 θp(x) = αp + xi βp + i The M-Quantile θp of order p, with p ∈ (0, 1), of the conditional R distribution Y |X is defined as: ψp(y − θp(x))dF (y|x) = 0 with  (1 − p)ψ(r) r < 0 ψ (r) = p pψ(r) r ≥ 0

M-Quantile regression is a unified model that includes quantile regression and expectile regression as particular cases

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile How to use M-quantile model to obtain estimates for the random area effects

Small area model-based estimators borrow strength from all the sample to capture random area effects, given the hierarchical structure of the data. M-quantile regression does not depend on a hierarchical structure. We can characterise conditional variability across the population of interest by the M-quantile coefficients of the population units Linear mixed effects model captures random area effects as differences in the conditional distribution of y given x between small areas M-Quantile model determines area effect with M-Quantile coefficients of the units belonging to the area

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile How to use M-quantile model to obtain estimates for the random area effects

Assume that we have individual level data on y and x Each sample value of (x, y) will lie on one and only one M-Quantile line We refer to the p-value of this line as the M-Quantile coefficient of the corresponding sample unit. So every sample unit will have an associate p-value In order to estimate these unit specific p-values, we define a fine grid of p-values (e.g. 0.001,...,0.999) that adequately covers the conditional distribution of y and x. We fit an M-Quantile model for each p-value in the grid and use linear interpolation to estimate a unique p-value, pj , for each individual j in the sample

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile How to use M-quantile model to obtain estimates for the random area effects 25 25 20 20 15 15 10 10 5 5 0 0

2 4 6 8 2 4 6 8

(a) (b) 25 25 20 20 15 15 10 10 5 5 0 0

2 4 6 8 2 4 6 8

(c) (d)

Figure : (a) Sample data, (b) M-quantile lines, (c) M-quantile lines associated to each unit, (d) M-quantile area lines M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile How to use M-quantile model to measure small area effects

Calculate an M-Quantile coefficient for each area by suitably averaging the q-values of each sampled individual in that areas. Denote this area-specific q-value by θˆi

The M-Quantile small area model is

T yij = xij βψ(θi ) + εij

β is the unknown regression vector

θi is the unknown area specific coefficient

εij is an individual disturbance

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile The linear M-quantile small area model

T Linear M-quantile small area model: yij = xij βψ(θi ) + εij

βψ is estimated using the iterative weighted least square

θi is obtained by averaging the q-values of the sampled units belonging to area i ψ(u) = u I (|u| ≤ c) + sgn(u) c I (|u| > c) (Huber proposal 2 influence function)

εij has a non specified distribution The predictor for the target variable of the non sampled unit k in area i is T ˆ ˆ yˆki = xki βψ(θi )

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Small Area Mean Estimate with M-Quantile Model

Given estimates of β and θi we can obtain the small area mean estimator Small Area Mean Estimator   ˆ −1 P P 0 Ni −ni P 0 Y¯i = N yij + x βˆ (θˆi ) + (yij − x βˆ (θˆi )) i j∈si j∈ri ij ψ ni j∈si ij ψ

and the MSE estimator MSE estimator for the M-Quantile Small Area Mean Estimator

Vˆi = V (y¯ˆi − y¯i ) =   2  2 −1 P 2 0 (Ni −ni ) P 0 N u yij − x βˆ (θˆi ) + yij − x βˆ (θˆi ) i j∈s ij ij ψ (ni −1) j∈ri ij ψ   Bˆ = N−1 Pm P w x0 βˆ (θˆ ) − P x0 βˆ (θˆ ) i i i=1 j∈si ij ij ψ i j∈si ij ψ i

ˆ ˆ2 MSE[ i = Vi + Bi

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: M-quantile estimates using EU-SILC data

Data on the equivalised income in 2007 are available from the EU-SILC survey 2008 for 1495 households in the 10 Tuscany Provinces, for 1286 households in the 5 Campania Provinces and for 2274 households in the 11 Lombardia Provinces The target variable is the equivalised household income already defined for the unit-level EBLUP example A set of explanatory variables is available for each unit in the population from the Population Census 2001 We employ an M-quantile model to estimate the mean of the household equivalised income at a LAU 1 level (Provinces)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: EU-SILC data

Remark 1: it is important to underline that EU-SILC data are confidential. These data were provided by ISTAT, the Italian National Institute of Statistics, to the researchers of the SAMPLE project and were analyzed by respecting all confidentiality restrictions Remark 2: We chose the Campania, Lombardia and Toscana regions because they are representative respectively of the South, Center and North of Italy Remark 3: The choice of three representative regions for North, Center and South of Italy has been driven by the well known North-South divide

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: Target variable statistics 200000 150000 100000 50000 0

Campania Lombardia Toscana

Figure : Boxplots of the disposable equivalised household income

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: Target variable statistics

The boxplots show evidence of skew distribution of the household equivalised income with heavy tail on the right in all the three regions The boxplots shown evidence of outliers Evidence of outliers emerges also from summary statics obtained using the cross-sectional EU-SILC household weights:

Min. 1st Qu. Median Mean 3rd Qu. Max. Campania −852.7 8073 11560 13550 17430 99400 Lombardia −21550.0 12620 17670 20040 24000 209800 Toscana −2849.0 12120 17230 19430 23570 107900

Table : Summary statistics for the household equivalised income

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: Census 2001 data

Italian Census 2001 was collected by ISTAT Campania region accounts for 1,862,855 households Lombardia region accounts for 3,652,944 households Toscana region accounts for 1,388,252 households Available variables: household size (integer value), ownership of dwelling (owner/tenant), age of the head of the household (integer value), years of education of the head of the household (integer value), working position of the head of the household (employed / unemployed in the previous week), gender of the head of the household, civil status of the head of the household

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Example: Model Specification

We used the M-quantile linear model to compute the means at LAU 1 level (Provinces) The selection of covariates to fit the small area models relies on prior studies of poverty assessment The following covariates were selected: household size (integer value) ownership of dwelling (owner/tenant) age of the head of the household (integer value) years of education of the head of the household (integer value) working position of the head of the household (employed / unemployed in the previous week) gender of the head of the household

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile

Example: EstimatesMean of of Households mean incomeEquivalised forIncome Lombardia Provinces Lombardia Provinces

SONDRIO 16307.16 (1668.92)

COMO 18578.33 LECCO (1137.01) 19497.61 (1131.62) VARESE BERGAMO 21091.49 18323.07 (1305.98) (820.61) BRESCIA 16326.21 (581.47)

MILANO 20798.63 (497.68)

LODI CREMONA 17052.58 16774.18 (965.49) (883.69) MANTOVA PAVIA 17774.9 21081.25 (677.24) (4080.17)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile

Mean of Households Equivalised Income Example: Estimates ofmean Toscana incomeProvinces for Toscana Provinces

MASSA CARRARA 14128.26 (664.84)

LUCCA 15867.69 PISTOIA PRATO (766.8) 18980.76 17702.87 (1119.33) (632.74) FIRENZE 19184.92 (498.35)

AREZZO PISA 18665.97 18550.16 (1014.42) (876.37)

LIVORNO 17875.01 SIENA (919.41) 20228.98 (1113.91)

GROSSETO 16152.47 (1151.84)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile

Example: EstimatesMean of of Households mean incomeEquivalised for Income Campania Provinces Campania Provinces

BENEVENTO 11312.89 CASERTA (1033.79) 11685.74 (574.89)

AVELLINO 12873.13 (979.46)

NAPOLI 12661.84 (291.73)

SALERNO 12715.91 (502.22)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile MQ approach to SAE: advantages, disadvantages, extensions

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Advantages and drawbacks of the M-Quantile approach with respect to the EBLUP approach Main advantages i. Distributional assumptions on parameters are not needed ii. Assumptions on the hierarchical structure are not needed iii. M-Quantile model is robust against outliers iv. It is easy to implement non parametric M-Quantile approaches v. Bootstrap approach to the estimate of MSE is faster than bootstrap for EBLUP (mixed linear model require double bootstrap techniques) Main drawbacks i. There is no specification for the M-Quantile model for nominal response variables ii. There is no specification if the response variable is multivariate iii. Estimators can be design-biased (even if they are model-unbiased)

M. Pratesi, C. Giusti Small area estimation I Expansion and GREG estimators Empirical Best Linear Unbiased Predictor M-Quantile Essential bibliography Breckling, J. and Chambers, R. (1988). M -quantiles. Biometrika, 75, 761–771. Brunsdon, C., Fotheringham, A.S. and Charlton, M. (1996). Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical Analysis, 28, 281–298. Chambers, R. and Dunstan, R. (1986). Estimating distribution function from survey data, Biometrika. 73, 597–604. Chambers, R. and Tzavidis, N. (2006). M-quantile models for small area estimation. Biometrika, 93, 255–268. Chambers, R.L., Chandra, H., Tzavidis, N. (2011). On bias-robust mean squared error estimation for pseudo-linear small area estimators. , 37, 153–170. Cheli B. and Lemmi, A. (1995). A Totally Fuzzy and Relative Approach to the Multidimensional Analysis of Poverty. Economic Notes, 24, 115–134. Elbers, C., Lanjouw, J. O., Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71, 355–364. Fotheringham, A.S., Brunsdon, C. and Charlton, M. (2002). Geographically Weighted Regression. John Wiley and Sons, West Sussex. Foster, J., Greer, J. and Thorbecke, E. (1984). A class of decomposable poverty measures. Econometrica, 52, 761–766. Hall, P. and Maiti, T. (2006). On parametric bootstrap methods for small area prediction. Journal of the Royal Statistical Society: Series B, 68, 2, 221–238. Lombardia M.J., Gonzalez-Manteiga W. and Prada-Sanchez J.M. (2003). Bootstrapping the Chambers-Dunstan estimate of finite population distribution function. Journal of Statistical Planning and Inference, 116, 367–388. Marchetti, S., Tzavidis, N. and Pratesi, P. (2012). Nonparametric Bootstrap Mean Squared Error Estimation for M-quantile Estimators for Small Area Averages, Quantiles and Poverty Indicators. Computational Statistics and Data Analysis, 56, 2889–2902. Royall, R. and Cumberland, W.G. (1978). Variance Estimation in Finite Population Sampling. Journal of the American Statistical Association, 73, 351–358. Salvati, N., Tzavidis, N., Pratesi, M. and Chambers, R. (2010). Small Area Estimation Via M-quantile Geographically Weighted Regression. [Paper submitted for publication in TEST] Salvati, N., Chandra, H., Ranalli, M.G. and Chambers, R. (2010). Small Area Estimation Using a Nonparametric Model Based Direct Estimator. Journal of Computational Statistics and Data Analysis, 54, 2159-2171. Tzavidis N., Marchetti S. and Chambers R. (2010). Robust estimation of small area means and quantiles. Australian and New Zealand Journal of Statistics, 52, 2, 167–186. Tzavidis, N., Salvati, N., Pratesi, M. and Chambers, R. (2007). M-quantile models for poverty mapping. Statistical Methods & Applications, 17, 393–411.

M. Pratesi, C. Giusti Small area estimation I