Official for the Next Decade-- Methodological Issues and Challenges

Danny Pfeffermann

Conference on

New Techniques and Technologies for Statistics (NTTS)

March, 2015

1

List of tough challenges

A- Collection and management of big for POS √

B- Integration of computer science for POS from big data

C- Data accessibility, privacy and confidentiality

D- Possible use of Internet panels √

E- How to deal with effects √

G- Future and small area estimation √ F- Integration of statistics and geospatial information.

Ques. Are Universities preparing students for NSOs? √

2

Collection and management of big data for POS Exp. 1. Count of number of vehicles crossing road sections. Presently done in a very primitive way.

Why not get the information and much more from cell phone companies? Available in principle for each time point. Exp. 2. Use the BPP, based on 5 million commodities sold on line to predict the CPI requires two costly surveys.

3

Big Data  Big Problems Big headache  Coverage/selection bias (we are talking of POS)  Data accessibility  New legislation  Privacy (data protection)  Disclosure control  Computer storage  Computation and Analysis  Linkage of different files  Risk of data manipulation

4

Two types of big data Type 1. Data obtained from sensors, cameras, cell phones…, - generally structured and accurate, Type 2. Data obtained from social networks, e-commerce etc.,- diverse, unstructured and appears irregularly. Type 1 measurements available continuously. Should POS publications be mostly in the form of graphs and pictures?

If needed, how should big data be transformed to monthly aggregates? By ? Will random sampling continue playing an important role when processing big data?

5

Other important issues Coverage bias- major concern in use of big data for POS. Big data of credit card transactions contains no information on transactions made with other of payment. Opinions expressed in social networks are different from opinions held by the general public. No bias should occur when using big data to predict other variables, estimated from standard surveys. e.g., use BPP to predict the CPI. Use job advertisements to predict employment. Use Satellite images to predict crops. Requires proper statistical analysis to identify and test the prediction models. 6

Big data are supposedly free of sampling errors. Are measures of error still an issue? Measure of bias? How? Measurement errors? Big Data for sub-populations: NSOs publish estimates for sub-populations; age, gender, ethnicity, geography,…

Big data may not contain this information. Requires massive linkage if missing information available in other big files.

Will traditional sample surveys always be needed? We are familiar with design-based, model-dependent, and model-assisted estimators. New: algorithmic estimators- the result of computational algorithms applied to raw big data. (Example, measure of religiosity).

7

Computer engineering for POS from big data No longer Gigabytes (~109 bytes). Terabytes (~1012bytes) and petabytes (1015 bytes) are the least new standards.

Computing facilities at most (all?) NSOs cannot store and handle such high volumes of data. Possible solution. Use Cloud storage, management and processing facilities.

Big problem. Data protection. Multiple users, data distributed over a large number of devices. Possible sol. Private cloud installation, incorporating all local computers; combined management of storage space and processing power of the separate computers. 8

Summary of in house computing challenges 1. Study the logic of storage and processing of big data,

2. Prepare storage spaces that can be regularly extended to higher volumes of data.

3. Establish communication networks that permit receiving data from multiple sources in different formats, and prepare the data for processing and analysis.

4. Protect the data from possible hackers and develop new methods of statistical data control (SDC). 5. Develop analytic tools for processing, editing and analysing big data, including visualization techniques.

Everything different, if cloud service can be used. 9

Data accessibility, privacy and confidentiality Two aspects: A- Protect the data from intruders. Very expensive devices. B- SDC. Guarantee that data released cannot be used to reveal private confidential data.

Current SDC procedures need extensive modifications. Exp 1. Release “” generated from models. Can we generate new big data? Exp 2. Research (safe) rooms. Available procedures for release of data and control of outputs need major revision. New trend: release synthetic data for researchers before they get the real data in Research rooms. 10

Big data- summary remarks New expensive computing facilities, new data processing techniques, new linkage methods, new visualization methods, new analytic methods, new measures of error.

Only some of the big challenges facing computer scientists and in the use of big data for POS.

Big potential advantages: timeliness, much broader coverage (possible coverage bias), no sampling frames, no , no interviewers,…

Considering the constant decline in response rates in traditional surveys, use of big data seems inevitable.

Big data will just grow bigger and bigger. 11

Possible use of Internet panels for POS Web surveys have huge advantages over traditional surveys. Major problem: volunteers with access to the internet.  At best represent the population of internet users (IU). Ipanel: big group of volunteers agreeing to participate regularly in surveys, often in return of certain incentives.

Ipanel possibly recruited by probability sampling, and the samples selected from the Ipanel often selected by probability sampling.

Big challenge: Estimate general population parameters from Ipanel sample.

12

Common solutions Propensity scores (PS): select a traditional large reference sample, treat sample SI from Ipanel as treatment sample, and reference sample SR , as control sample.

Estimate propensity scores based on all j S  SIR  S .

Divide SI into C classes based on estimated propensity w, psa w scores. Compute an adjusted weight ddjj fc for jS cI . w d j  initial weights assigned to iS I .

Yˆ w,PSA  dyw, psa . c j S jj cI

13

Problems with the use of propensity scores

Requires drawing a large reference sample which can be very costly.

Strong ignorability: let T 1 for iS I , T  0 for iS R . PS(a): T ┴ Y given covariates x used for PS, PS(b): 0 Pr(T  1| x)  1 for every x S. Conditions may hold for some, but not for all study variables.

Not obvious how to estimate the of estimators.

14

Another common solution: Calibration

w cal Change base weights d j to weights d j , such that for U observed survey variables Z with known totals tz ,

dtcalz  U .  jS jj z I Totals might be cell totals or marginal cell totals. ˆU Reliable sample estimates tz may also be used.

Does not require a reference sample.

Combining propensity scores and calibration adjustments possibly more effective in reducing the bias.

15

A new alternative approach?

Let Ai  1 if iU is an Internet User (IU), Ai  0 otherwise.

Assumption. Pr(Ai 1| x i , y i )  0  i  U . y-study variable,x-covariates.

Bayes Pr(Ai 1| x i , y i ) fp ( y i | x i ) fIU ( yi | x i ) f ( y i | x i , A i 1)  , Pr(Aii 1| x ) fyp (ii | x ) = distribution in target population U, fyIU (ii | x ) = distribution for IU.

16

A new alternative approach (cont.) In practice, not every IU asked to participate in the Ipanel agrees, or a person may agree but does not respond in a particular survey taken for its members.

Let Ri =1 if IU i is in Ipanel and responds, Ri =0 otherwise. The marginal distribution for responding unit i is then,

f(|x) y f (|x, y A  1, R  1) R i i i i i i Pr(R 1 y ,x , A  1)Pr( A  1| x , y ) f ( y | x )  i i i i i i ip i i .

Pr(RAAi 1| x i , i  1)Pr( i  1| x i )

17

A new alternative approach- inference The respondents’ likelihood (assuming independence) r Pr(R 1 y ,x , A  1; )Pr( A  1| y ,x ; ) f( y | x ; ) i i i i i i i p i i . LResp (,)   i1 Pr(RAAi 1 x i , i  1; ,  )Pr( i  1| x i ;  ,  ) Inference: maximize the likelihood with respect to unknown ˆ ˆ parameters and use fpp( yi | x i ; ) f ( y i | x i ; ) for inference about the target population U .

Example: estimate total as, Yˆ U  Ey( | x ;)ˆ . IP iU p i i

18

Inference (cont.)

When Ipanel selected with probabilities πi and the covariates are unknown for units outside the Ipanel sample, Yˆ Uw  E( y | x ;ˆ )/( pˆˆ p ), IP iS p i i i Ri Ai IR ˆ ˆ pˆ Ri Pr(Ri  1 y i,x i , A i  1) ; pˆ Ai Pr(Ayi 1| x i , i ). Remarks: 1- A full parametric inference requires specifying models for pRi , pAi and fp( y i | x i ; ). Likelihood complicated, and may face non-identifiability problems. Use of Empirical likelihood simpler and safer.

19

Remarks (cont.) 2- Although none of the stochastic processes in the likelihood is observable, the respondents’ model is testable using classical test statistics, since it relates to the observed data.

3- Further simplification of the likelihood obtained by combining the models for Internet use and for Ipanel response into a single model. Define Di = A i R i . The model is,

Pr(Di 1 y i ,x i ) f p ( y i | x i ) fD( y i | x i ) f ( y i | x i , D i  1)  Pr(Dii 1| x )

Much simpler model, but might be too restrictive.

20

Comparison with other approaches Use of proposed approach does not require the availability of a reference sample as required for the use of propensity scores, and does not rely on ignorability conditions.

The proposed approach requires parametric assumptions for the probabilities probabilities ,, but the model holding for the responding units can be tested.

21

Mode effects Mixed mode surveys: different modes of response; telephone, personal interview, mail, internet,… The different modes sometimes offered sequentially to those who do not respond with a previous mode.

mode-effect: encompasses two confounded effects: Selection effect; differences between characteristics of sample members responding with different modes and hence, possible differences in the values of study variables. Measurement effect; effect of responding differently by same person, depending on the mode of response.

22

A common approach (Holland?) Reasons for mixed mode surveys: increase response rate and reduce measurement effects. Also, some modes cheaper than others (internet!!).

A common approach: assume that one of the modes has no measurement effect by responding with this mode, the estimate of the population parameter of interest is unbiased.

Define by D the mode used to respond, and let m1 be the mode providing “right answers”. Suppose that every unit iU possibly responds with every mode. (counterfactual assumption.)

23

A common approach (cont.)

Population of interest: μm1 E( Y | D m1 ). Suppose there is only one alternative mode and denote by G the preferred mode for response. Then,

μm1 ppm1 m 1, m 1 m 2 m 1, m 2

pmg Pr( G  g ), g  1,2; dg, E( Y | D  d , G  g ) The selection and measurement effects are defined as,

Sem1 m1, m 1 m 1, m 2 ; Mem1 m1, m 2 m 2, m 2.

Mode effect  MOem1 m1, m 1  m 2, m 2  Sem1 Me m1.

Problem: estimate μm12 ,m (respond D = m1, prefer G = m2).

24

A common approach (cont.)

Estimate μm12 ,m by modelling the selection effects. Assume the existence of covariates Z satisfying: S(a)- G and Y conditionally independent given , S(b)- D and Z conditionally independent given G. S(a) is the familiar ignorability condition. S(b) implies absence of measurement effects on . Let,

μd,g,z E( Y | D  d , G  g , Z  z), pz|dg Pr(Z  z | D  d , G  g ).

The following decomposition holds under S(a)-S(b),  pp  . mm1,2zz mmzmm 1,2, z|1,2 mmzzmm 1,1, |2,2

25

A common approach (cont.)

Alternatively, model the measurement effect by assuming

the existence of covariates F that capture them.

Limitations of the approach:

1- Assumes the existence of the benchmark mode m1. Not always obvious which mode provides the most accurate responses, and how accurate they are.

2- Assumes the existence of the covariates Z or F , satisfying the ignorability assumptions.

3- Generalization to more than 2 modes not straightforward.

26

A new alternative approach?

Suppose there are M  2 modes and let gi ,di indicate the preferred and used modes for unit iS .

Denote by xi covariates explaining Y and let ui  (xi ,z i ,f i  ). The analysis that follows does not require the existence of the three sets of covariates.

Assp. 1. For every jU exists a true value Y j with pdf fp( y i , x i ), which has its support in the sampled values. Not assumed that Y measured accurately under any mode. Assp. 2. Every sample unit responds by one of the modes.

27

An alternative approach (cont.) The following factorization holds for the observed data:

fM ( yi | u i , D i d,G i g)  Pr(D d | y ,z , G  g) Pr( G  g | y ,f ) f ( y | x ) i i i i i i ip i i Pr(Di d | z i ,x i ,G i  g) Pr( G i  g | f i ,x i ) (2) (1)

(1) = fyp (i | x i ,f i ,G i  g) =measurement effect for unit with G = g

(2) = selection effect on (1) from responding with D = d g. A single model accounts for both the measurement and selection effects.

28

Special important case Every unit in the sample responds with his or her preferred Pr(D g | y ,z , G g) mode. In this case i i i i 1 and Pr(Dgi | z i ,x i ,G i g) fM ( yi | u i , D i G i g)

Pr(Gi g | y i ,f i ) fp ( y i | x i ) ==fp ( yi | x i , f i ,G i = g ) . Pr(Ggi | f i ,x i )

Selection effects embedded within measurement effects.

29

Remarks 1- The proposed approach does not require the existence of covariates Z or F that fully capture Se. and Me. effects.

2- The proposed approach does not assume that the responses obtained by one of the modes are correct.

3- The proposed approach requires specifying models for the probabilities pdi Pr(Di  d | y i ,z i , G i  g) & pgi Pr(Ggi | y i ,f i ), or a single model for the product pgd,i  ppgi di, but the model can be tested.

4- The model is best fitted by use of empirical likelihood.

30

Future censuses and small area estimation A and SAE seem to define opposite concepts. A census refers to big data; SAE deals with methods of estimating domain parameters based on very small or even no samples in some of the areas. Modern censuses no longer “door to door” because of severe budget constraints and heavy logistic hurdles. Stephen Fienberg (Hansen lecture, 2014) discussed future U.S. censuses, emphasizing the need to create exhaustive administrative record systems by linkage of multiple files. Good administrative files are key for modern censuses.

31

Census in Israel Israel has a fairly accurate population register; almost perfect at the country level. Record linkage easy because every Israeli resident has an identity number and there are many other individual identifiers on different files.

Population register much less accurate for small domains, with an average enumeration error of about 13%. Main reason for inaccuracy at domain level is that people moving in or out an area, often report late their change of address.

32

Census in Israel (cont.)

The ICBS uses an integrated census, which consists of the population register, corrected by estimates obtained from two coverage samples for each area. An area sample of addresses for estimating the register undercount and a telephone sample of people registered to the area for estimating the register over-count.

33

Census in Israel (cont.)

N i - true number of people living in domain i,

Ri - number of people registered to domain i, pi,L|R - proportion of people living in domain i among those registered to the domain, pi,R|L- proportion of people registered to domain i among those living in the domain. pˆ ˆ i,| L R ˆ Ni × p i,R|L = R i × p i,L|R Νi  R i   R i i. pˆi,| R L

34

Census in Israel (cont.)

Var( pˆˆ ) [ E ( p )]2 ˆ 2 i,, L|R i L|R Var()N i | Ri Ri 24  Var() pˆi,| R L . [E ( pˆˆi, R | L )] [ E ( p i , R | L )] The register for domain i improves, if at least one of the proportions pi,L|R or pi,R|L increases.

If ppi,| L R, i,R|L  1 Var()() pˆˆi, L|R, Var p i , R | L  0 and ˆ Var[( Ni N i ) | R i ] 0. The census is "consistent" (not in the usual sense).

35

Small area estimation ˆ ˆ The estimator Νi Ri i is a direct estimator, as it only uses data available for domain i. However, the sizes of the two samples are very small, requiring the use of SAE models, which use auxiliary (covariate) information known for the area, and borrow strength across the areas.

Example: let θi  ppi, L |R/ i , R | L. Fay-Herriot (1989) model, ˆ 22 iiii  e,~(0, e N  Diiiii );   x   u ,~(0,) u N  u , ˆ i ppˆˆ i, L |R/ i , R | L (direct), ei is the sampling error, xi is the set of covariates and ui is a random effect that accounts for the unexplained variation of θi. 36

Small area estimation (cont.)

Best Linear Unbiased Predictor (BLUP), ˆˆ2 2 2 θMi ˆii  (1   ˆ iiGLSi )x ;  ˆ  ˆˆ u / (  u   Di ). ˆ For area kwith no sample, θMk  xk GLS .

For areas with samples the BLUP is a linear combination of the direct estimator and the “synthetic estimator”.

We are presently trying fitting this and other models, but so far with little success in reducing significantly the enumeration error achieved by using the register counts.

37

Mean square error definition and estimation, an old, so far unresolved challenge

The use of model-dependent predictors raises the question of how to define and estimate the prediction mean square error (PMSE). A PMSE estimator is required for every area separately. Common PMSE estimators account for all model sources of error, including the variance of the random effect.

My view: the PMSE should generally condition on θi, and only account for the variance over all sample selections.

38

Mean square error definition and estimation (cont.) ˆ Example: PMSE of ΝθiRi Mi should reflect the error in predicting the count N i of a given domain at a given time, and not the error over all possible values of the true count under a model. There is one specific true count to predict. Bayesian interpretation of random effects as reflecting prior uncertainty about the sign and magnitude of the difference ui (i xi ) is OK, but once we estimate 2 σui = Var() u , we average over other areas. Not likely to be accepted by an officer from the given area. Users are used to design-based .

39

Mean square error definition and estimation (cont.) My view: The Fay-Herriot- or any other model with random effects can be used for selecting an appropriate predictor, but the PMSE should condition on the actual, small area parameter of interest.

If acceptable, the only random process that should be accounted for is the design-based PMSE over all possible sample selections. (model-assisted).

40

Computation & estimation of PMSE for F-H model

Consider the predictor θMi and suppose that all model parameters are known. PMSE() E (    )22  E [(     )(1)(x       )] DDDMi Mii iMii ii i 2 EeD[i  (1  i )(  Mi  x i  )] . An unbiased estimator of the design-based PMSE is,

ˆ 2 2 2 PMSED ()(2Mi  i  1)  Di  (1   i )(  Mi  x) i  .

2 ˆ The design-based estimator is poor if σDi = Var D() θ i is large, the fundamental problem of SAE (small sample sizes).

Formidable Challenge: develop reliable design-based PMSE estimates for SAE. 41

Integration of statistics and geospatial information

Use of Geographic Information System (GIS) adds a spatial dimension to available data, and hence the possibility to get new important insights. Examps: poverty , maps of accidents in road sections. Shows spatial similarities between neighboring locations. Transforms discrete estimates into a continuum. Other important benefits of GIS: 1- Enhances the design of sample surveys by defining the borders of strata, sampling cells etc. Provides all the required information for area sampling.

42

Integration of statistics and GIS (cont.) 2- Efficient allocation of sampling quotas and the construction of optimal navigation routes to arrive at the sampled locations.

3- Permits tracing changes over time in socio-economic conditions of individuals or households residing in given areas, and relate them to geographic measures such as distance from big cities, commuting possibilities and alike.

4- Enhances data resolution, enabling exploration and formation of new groupings (clusters), not known before.

NSOs should continue developments of GIS to adapt to dynamic collection of big data in high resolution.

43

Are Universities preparing students to work at NSOs?

Importance of NSOs in modern society not in doubt. NSOs are among the largest employers of statisticians and economists. Considered three central topics in the work of NSOs: Survey sampling; (SA) and Trend estimation; .

Browsed the curriculum of undergraduate and graduate courses in statistics and economics at the top 25 universities in the world, as ranked by the Shanghai

Academic of World Universities.

44

Are Universities teaching the topics in regular courses?

 Only 11 universities offer a course on survey sampling!! All universities offer a course (s) on Analysis and Forecasting, but there are no specialized courses on SA or trend estimation.

Found only 3 time series courses for which seasonality is even mentioned in the curriculum.

 There are no specialized courses on National Accounts.

At best, NA is mentioned occasionally in macro- economic courses, mostly with reference to the GDP.

45

Remarks following these findings 1- What is the role of universities? Are they supposed to train students to work at work places? Should universities focus solely on research, and educate new generations of researchers?

The three topics mentioned, and many other topics underlying the work of NSOs require advanced theory, no less than in other courses taught regularly. Teaching courses on these topics is in no conflict with the view that universities should concentrate on research.

46

2- It can be argued that the reason for the lack of courses related to POS is the lack of expert researchers to teach them. This is the problem. Students are not exposed to these topics during their studies, and hence they don’t even consider them for their academic research. 3- Statistics has changed dramatically in the last decade. Survey sampling and time series analysis have also changed. Courses on such topics need to adapt accordingly.

Lohr (Hansen lecture, 2009), and Kolenikov et al. (2015) in several papers to appear in “Survey Practice”, discuss ways of “Training the Modern Survey ”.

47

4- Some good news: There are several universities around the world that do emphasize survey sampling in their teaching and research. There are several University Master Programs in Official Statistics (JPSM in the U.S., Moffstat in the UK). The European Union has recently established a European Master Program in Official Statistics. There seems to be a growing recognition of the importance of POS by academic institutions. NSOs should strengthen the ties with the academia by helping establishing specialized courses on POS, and involving more academic researchers in their work. 48