Official Statistics for the Next Decade-- Methodological Issues and Challenges
Total Page:16
File Type:pdf, Size:1020Kb
Official Statistics for the Next Decade-- Methodological Issues and Challenges Danny Pfeffermann Conference on New Techniques and Technologies for Statistics (NTTS) March, 2015 1 List of tough challenges A- Collection and management of big data for POS √ B- Integration of computer science for POS from big data C- Data accessibility, privacy and confidentiality D- Possible use of Internet panels √ E- How to deal with mode effects √ G- Future censuses and small area estimation √ F- Integration of statistics and geospatial information. Ques. Are Universities preparing students for NSOs? √ 2 Collection and management of big data for POS Exp. 1. Count of number of vehicles crossing road sections. Presently done in a very primitive way. Why not get the information and much more from cell phone companies? Available in principle for each time point. Exp. 2. Use the BPP, based on 5 million commodities sold on line to predict the CPI requires two costly surveys. 3 Big Data Big Problems Big headache Coverage/selection bias (we are talking of POS) Data accessibility New legislation Privacy (data protection) Disclosure control Computer storage Computation and Analysis Linkage of different files Risk of data manipulation 4 Two types of big data Type 1. Data obtained from sensors, cameras, cell phones…, - generally structured and accurate, Type 2. Data obtained from social networks, e-commerce etc.,- diverse, unstructured and appears irregularly. Type 1 measurements available continuously. Should POS publications be mostly in the form of graphs and pictures? If aggregate data needed, how should big data be transformed to monthly aggregates? By sampling? Will random sampling continue playing an important role when processing big data? 5 Other important issues Coverage bias- major concern in use of big data for POS. Big data of credit card transactions contains no information on transactions made with other means of payment. Opinions expressed in social networks are different from opinions held by the general public. No bias should occur when using big data to predict other variables, estimated from standard surveys. e.g., use BPP to predict the CPI. Use job advertisements to predict employment. Use Satellite images to predict crops. Requires proper statistical analysis to identify and test the prediction models. 6 Big data are supposedly free of sampling errors. Are measures of error still an issue? Measure of bias? How? Measurement errors? Big Data for sub-populations: NSOs publish estimates for sub-populations; age, gender, ethnicity, geography,… Big data may not contain this information. Requires massive linkage if missing information available in other big files. Will traditional sample surveys always be needed? We are familiar with design-based, model-dependent, and model-assisted estimators. New: algorithmic estimators- the result of computational algorithms applied to raw big data. (Example, measure of religiosity). 7 Computer engineering for POS from big data No longer Gigabytes (~109 bytes). Terabytes (~1012bytes) and petabytes (1015 bytes) are the least new standards. Computing facilities at most (all?) NSOs cannot store and handle such high volumes of data. Possible solution. Use Cloud storage, management and processing facilities. Big problem. Data protection. Multiple users, data distributed over a large number of devices. Possible sol. Private cloud installation, incorporating all local computers; combined management of storage space and processing power of the separate computers. 8 Summary of in house computing challenges 1. Study the logic of storage and processing of big data, 2. Prepare storage spaces that can be regularly extended to higher volumes of data. 3. Establish communication networks that permit receiving data from multiple sources in different formats, and prepare the data for processing and analysis. 4. Protect the data from possible hackers and develop new methods of statistical data control (SDC). 5. Develop analytic tools for processing, editing and analysing big data, including visualization techniques. Everything different, if cloud service can be used. 9 Data accessibility, privacy and confidentiality Two aspects: A- Protect the data from intruders. Very expensive devices. B- SDC. Guarantee that data released cannot be used to reveal private confidential data. Current SDC procedures need extensive modifications. Exp 1. Release “synthetic data” generated from models. Can we generate new big data? Exp 2. Research (safe) rooms. Available procedures for release of data and control of outputs need major revision. New trend: release synthetic data for researchers before they get the real data in Research rooms. 10 Big data- summary remarks New expensive computing facilities, new data processing techniques, new linkage methods, new visualization methods, new analytic methods, new measures of error. Only some of the big challenges facing computer scientists and statisticians in the use of big data for POS. Big potential advantages: timeliness, much broader coverage (possible coverage bias), no sampling frames, no questionnaires, no interviewers,… Considering the constant decline in response rates in traditional surveys, use of big data seems inevitable. Big data will just grow bigger and bigger. 11 Possible use of Internet panels for POS Web surveys have huge advantages over traditional surveys. Major problem: volunteers with access to the internet. At best represent the population of internet users (IU). Ipanel: big group of volunteers agreeing to participate regularly in surveys, often in return of certain incentives. Ipanel possibly recruited by probability sampling, and the samples selected from the Ipanel often selected by probability sampling. Big challenge: Estimate general population parameters from Ipanel sample. 12 Common solutions Propensity scores (PS): select a traditional large reference sample, treat sample SI from Ipanel as treatment sample, and reference sample S , as control sample. R Estimate propensity scores based on all j S SIR S . Divide SI into C classes based on estimated propensity w, psa w scores. Compute an adjusted weight ddjj fc for jS cI . w d j initial weights assigned to iS I . Yˆ w,PSA dyw, psa . c j S jj cI 13 Problems with the use of propensity scores Requires drawing a large reference sample which can be very costly. Strong ignorability: let T 1 for iS I , T 0 for iS R . PS(a): T ┴ Y given covariates x used for PS, PS(b): 0 Pr(T 1| x) 1 for every x S. Conditions may hold for some, but not for all study variables. Not obvious how to estimate the variance of estimators. 14 Another common solution: Calibration w cal Change base weights d j to weights d j , such that for U observed survey variables Z with known totals tz , dtcalz U . jS jj z I Totals might be cell totals or marginal cell totals. ˆU Reliable sample estimates tz may also be used. Does not require a reference sample. Combining propensity scores and calibration adjustments possibly more effective in reducing the bias. 15 A new alternative approach? Let Ai 1 if iU is an Internet User (IU), Ai 0 otherwise. Assumption. Pr(Ai 1| x i , y i ) 0 i U . y-study variable,x-covariates. Bayes Pr(Ai 1| x i , y i ) fp ( y i | x i ) fIU ( yi | x i ) f ( y i | x i , A i 1) , Pr(Aii 1| x ) fyp (ii | x ) = distribution in target population U, fyIU (ii | x ) = distribution for IU. 16 A new alternative approach (cont.) In practice, not every IU asked to participate in the Ipanel agrees, or a person may agree but does not respond in a particular survey taken for its members. Let Ri =1 if IU i is in Ipanel and responds, Ri =0 otherwise. The marginal distribution for responding unit i is then, f(|x) y f (|x, y A 1, R 1) R i i i i i i Pr(R 1 y ,x , A 1)Pr( A 1| x , y ) f ( y | x ) i i i i i i ip i i . Pr(RAAi 1| x i , i 1)Pr( i 1| x i ) 17 A new alternative approach- inference The respondents’ likelihood (assuming independence) r Pr(R 1 y ,x , A 1; )Pr( A 1| y ,x ; ) f( y | x ; ) i i i i i i i p i i . LResp (,) i1 Pr(RAAi 1 x i , i 1; , )Pr( i 1| x i ; , ) Inference: maximize the likelihood with respect to unknown ˆ ˆ parameters and use fpp( yi | x i ; ) f ( y i | x i ; ) for inference about the target population U . Example: estimate total as, Yˆ U Ey( | x ;)ˆ . IP iU p i i 18 Inference (cont.) When Ipanel selected with probabilities πi and the covariates are unknown for units outside the Ipanel sample, Uw Yˆ E( y | x ;ˆ )/( pˆˆ p ), IP iS p i i i Ri Ai IR ˆ ˆ pˆ Ri Pr(Ri 1 y i,x i , A i 1) ; pˆ Ai Pr(Ayi 1| x i , i ). Remarks: 1- A full parametric inference requires specifying models for pRi , pAi and fp( y i | x i ; ). Likelihood complicated, and may face non-identifiability problems. Use of Empirical likelihood simpler and safer. 19 Remarks (cont.) 2- Although none of the stochastic processes in the likelihood is observable, the respondents’ model is testable using classical test statistics, since it relates to the observed data. 3- Further simplification of the likelihood obtained by combining the models for Internet use and for Ipanel response into a single model. Define Di = A i R i . The model is, Pr(Di 1 y i ,x i ) f p ( y i | x i ) fD( y i | x i ) f ( y i | x i , D i 1) Pr(Dii 1| x ) Much simpler model, but might be too restrictive. 20 Comparison with other approaches Use of proposed approach does not require the availability of a reference sample as required for the use of propensity scores, and does not rely on ignorability conditions.